Web Data Collection
The Inktomi research
project at Berkeley, consisting of Prof. Eric Brewer and graduate
student Paul Gauthier, conducts research in the construction of scalable Web
servers using parallel processing technology. To date, the
project has produced two major software components: a parallel Web
crawler and a parallel Web index search engine. In this paper,
where we mention Inktomi, it may be assumed that we refer to the crawler.
The data presented in this study comes entirely from Inktomi. The
high speed of the crawler enables us, for the first time, to consider
taking ``snapshots'' of the Web and analyzing them. As of this
writing, the Inktomi team has crawled twice. The first set of runs,
from July to October 1995, collected 1.3 million unique HTML documents.
The second set of runs, in November 1995, collected 2.6 million
unique HTML documents.