Results
         

Results

We examined over 2.6 million HTML documents collected by the Inktomi crawler in November of 1995. Although Inktomi occasionally downloads non-HTML documents, the results presented reflect only HTML documents. (For example, we filtered out all binary files, such as images.) Furthermore, because Inktomi implements the Robot Exclusion Standard, the contents of automated databases (e.g., genome data sets) have also been excluded. The distribution of the documents in the data set by domain is as follows:

Documents Studied by Domain
Domain# of HTML Documents % of Total
other1064318 41%
com516709 20%
edu698616 27%
gov117125 4%
net113595 4%
mil14734 1%
org89939 3%
total2615036 100%

Here, ``other'' includes all domains other than the given top-level domains. For example, ``other'' contains all non-US top-level domains (such as Germany's .de).

We analyzed a variety of properties of these documents. In this paper, we present results on the following: