Introduction
We report the results of an extensive analysis of HTML documents from
the World Wide Web. Our data set, collected by the Inktomi Web crawler, currently
comprises over 2.6
million HTML documents. We present a broad range of statistics
pertaining to these pages.
Such an analysis of the content of HTML documents is of interest for
several reasons:
- Evolution of HTML. Unused features and extensions that do
not achieve a reasonable level of acceptance should be deprecated and,
eventually, eliminated. This prevents the accretion of useless
language features.
- Improving Web content. Widespread awareness of poor
natural and markup language usage will promote the spread of helpful
tools and practices.
- Control of HTML. The marketplace perceives the relative
ability of vendors to force acceptance of new, non-standard language
extensions as market ``strength.'' Understanding the true acceptance
level of such extensions can help fight vendor disinformation.
- Sociological insights. Many interesting sociological
observations may be derived from the content of Web pages.
Despite these motivations, however, previous studies relating
to the Web have either focused on other topics or have been
limited in scope. The most closely related work includes:
- User studies.
User surveys
[COMM95,
PITK94b,
PITK95a,
PITK95b,
RISS95,
YAHO95]
and browser usage studies
[CATL95,
PITK94a]
have become very common. Such studies
focus on high-level user issues (e.g., choice of software, available
connectivity) and low-level user-browser interaction (e.g., use of the
back button). The information extracted, though valuable, is
wholly user-centric.
- Content analyses of small data sets.
There have been some attempts to perform simple analyses of the
content of the Web. For example, the original Lycos
project at Carnegie Mellon University's Center for Machine Translation
[MAUL94]
tracked a number of interesting statistics while
their data set was relatively small. These included:
- content of title and headings
- 100 top keywords and first 20 lines
- word frequency count
- file size (bytes, words)
- URL types
- most-linked-to URLs
- Structural analysis.
The CMU Lycos project generated at least one complete graph
of their data set. The project's commercial successor, Lycos, Inc.,
now tracks the 250 most-linked-to sites as a side-effect of their
indexing [LYCO95].
Other projects have focused on (graph-oriented)
structural analysis, as well.
These include several Web visualization systems (e.g., Webspace
[CHI95]
and the Navigational View Builder
[MUKH95]).
For the most part, such visualization
has been very small-scale and limited in scope.
More sophisticated analyses are possible, combining both structural
analysis and semantic modelling. A project at Xerox PARC
[PIRO95] is
conducting such analyses over small data sets.
To complement the above work,
we have conducted a large-scale investigation of the
content of HTML documents from the
Web. The remainder of this paper is structured as follows.
First, we describe the tools we used to perform our study.
We next discuss the scope of our study and our results.
Finally, we present some lessons learned and
possible future directions.