Introduction

We report the results of an extensive analysis of HTML documents from the World Wide Web. Our data set, collected by the Inktomi Web crawler, currently comprises over 2.6 million HTML documents. We present a broad range of statistics pertaining to these pages.

Such an analysis of the content of HTML documents is of interest for several reasons:

Evolution of HTML. Unused features and extensions that do not achieve a reasonable level of acceptance should be deprecated and, eventually, eliminated. This prevents the accretion of useless language features.
Improving Web content. Widespread awareness of poor natural and markup language usage will promote the spread of helpful tools and practices.
Control of HTML. The marketplace perceives the relative ability of vendors to force acceptance of new, non-standard language extensions as market ``strength.'' Understanding the true acceptance level of such extensions can help fight vendor disinformation.
Sociological insights. Many interesting sociological observations may be derived from the content of Web pages.

Despite these motivations, however, previous studies relating to the Web have either focused on other topics or have been limited in scope. The most closely related work includes:

User studies. User surveys [COMM95, PITK94b, PITK95a, PITK95b, RISS95, YAHO95] and browser usage studies [CATL95, PITK94a] have become very common. Such studies focus on high-level user issues (e.g., choice of software, available connectivity) and low-level user-browser interaction (e.g., use of the back button). The information extracted, though valuable, is wholly user-centric.
Content analyses of small data sets. There have been some attempts to perform simple analyses of the content of the Web. For example, the original Lycos project at Carnegie Mellon University's Center for Machine Translation [MAUL94] tracked a number of interesting statistics while their data set was relatively small. These included:
- content of title and headings
- 100 top keywords and first 20 lines
- word frequency count
- file size (bytes, words)
- URL types
- most-linked-to URLs
Structural analysis. The CMU Lycos project generated at least one complete graph of their data set. The project's commercial successor, Lycos, Inc., now tracks the 250 most-linked-to sites as a side-effect of their indexing [LYCO95]. Other projects have focused on (graph-oriented) structural analysis, as well. These include several Web visualization systems (e.g., Webspace [CHI95] and the Navigational View Builder [MUKH95]). For the most part, such visualization has been very small-scale and limited in scope. More sophisticated analyses are possible, combining both structural analysis and semantic modelling. A project at Xerox PARC [PIRO95] is conducting such analyses over small data sets.

To complement the above work, we have conducted a large-scale investigation of the content of HTML documents from the Web. The remainder of this paper is structured as follows. First, we describe the tools we used to perform our study. We next discuss the scope of our study and our results. Finally, we present some lessons learned and possible future directions.