An Investigation of Documents from the World Wide Web
Allison Woodruff
Paul M. Aoki
Eric Brewer
Paul Gauthier
Lawrence A. Rowe
Computer Science Division
University of California at Berkeley
Berkeley, CA
94720-1776
email: {woodruff,aoki,brewer,gauthier,rowe}@cs.berkeley.edu
- Abstract:
-
We report on our examination of pages from the World Wide Web. We
have analyzed data collected by the Inktomi Web crawler (this data
currently comprises over 2.6 million HTML documents). We have
examined many characteristics of these documents, including: document
size; number and types of tags, attributes, file extensions,
protocols, and ports; the number of in-links; and the ratio of
document size to the number of tags and attributes. For a more
limited set of documents, we have examined the following: the number
and types of syntax errors and readability scores. These data have
been aggregated to create a number of ranked lists, e.g., the ten
most-used tags, the ten most common HTML errors.
For postscript versions of this document (as submitted to
WWW5):
color postscript
black and white postscript