Web Data Collection

The Inktomi research project at Berkeley, consisting of Prof. Eric Brewer and graduate student Paul Gauthier, conducts research in the construction of scalable Web servers using parallel processing technology. To date, the project has produced two major software components: a parallel Web crawler and a parallel Web index search engine. In this paper, where we mention Inktomi, it may be assumed that we refer to the crawler.

The data presented in this study comes entirely from Inktomi. The high speed of the crawler enables us, for the first time, to consider taking ``snapshots'' of the Web and analyzing them. As of this writing, the Inktomi team has crawled twice. The first set of runs, from July to October 1995, collected 1.3 million unique HTML documents. The second set of runs, in November 1995, collected 2.6 million unique HTML documents.