Conclusions

Truisms

There are two maxims which are particularly apropos of our experience. First, dealing with large data sets is difficult and time-consuming. None of the existing tools tools which we used scaled adequately to dealing with a data set on the order of millions of documents.

Second, we observed empirically that the Web changes exceptionally quickly. Many properties of the documents in our first data set have altered in the months since the data was collected. The largest document in our data set was 1.6Mbytes; we checked the current size of that same document. It has grown to 9Mbytes. As another example, many of the most popular URLs in the first data set no longer exist.

Future Work

In the short term, we intend to improve our tools. These will focus on speed and robustness improvements, e.g., checkpointing and improvements to the third-party software such as style and weblint.

In the long term, structural graph analysis seems promising. In particular, analysis of the kind practiced by sociologists in structural network analysis [WASS94] promises insight. However, existing social network algorithms are several orders of magnitude more complex than is viable for a data set of this size. Significant work would have to be done to make such analysis feasible.

It would also be interesting to allow user-defined queries against the data set. The simplest functionality would be to allow a user to ascertain how a form-specified URL compared with the data set. A more interesting and complex interface would allow the user to define arbitrary queries on the data set.