Results

We examined over 2.6 million HTML documents collected by the Inktomi crawler in November of 1995. Although Inktomi occasionally downloads non-HTML documents, the results presented reflect only HTML documents. (For example, we filtered out all binary files, such as images.) Furthermore, because Inktomi implements the Robot Exclusion Standard, the contents of automated databases (e.g., genome data sets) have also been excluded. The distribution of the documents in the data set by domain is as follows:

Documents Studied by Domain
Domain # of HTML Documents % of Total
other 1064318 41%
com 516709 20%
edu 698616 27%
gov 117125 4%
net 113595 4%
mil 14734 1%
org 89939 3%
total 2615036 100%

Documents Studied by Domain
Domain	# of HTML Documents	% of Total
other	1064318	41%
com	516709	20%
edu	698616	27%
gov	117125	4%
net	113595	4%
mil	14734	1%
org	89939	3%
total	2615036	100%

Here, ``other'' includes all domains other than the given top-level domains. For example, ``other'' contains all non-US top-level domains (such as Germany's .de).

We analyzed a variety of properties of these documents. In this paper, we present results on the following:

Document Size
Tag/Size Ratio
Tag Usage
Attribute Usage
Browser-specific Usage
Port Usage
Protocols Used in Child URLs
File Types Used in Child URLs
Number of In-links
Readability
Syntax Errors