Readability

The UNIX utility style was used to assess the readability level of a subset of the HTML documents in our data set (approximately 150,000). We remove HTML markup before invoking style on each document. We do this for two reasons. First, style does not understand HTML, so the extra punctuation would confuse its analyzer. Second, breaking English text into sentences and sentence fragments can be tricky and we need to provide the style analyzer with some assistance. For example, it is not always clear when a bulleted list should be ignored, treated as a single long sentence, or treated as a list of individual sentences. When invoked on troff documents, style uses a set of heuristics to insert punctuation into text, using the markup to assist it [CHER81]. This information is then used by later passes of the analyzer to determine sentence and sentence fragment breaks. We use a similar set of heuristics to insert periods and commas into HTML documents as we strip out markup.

The numbers presented below represent the scores of the different domains on the Kincaid readability test. The ``other'' domain is excluded because it represents extraordinarily diverse sources.

Average Readability broken down by Domain
Domain Readability Score
com 10.3
edu 11.0
gov 10.0
net 12.3
mil 12.1
org 11.2