Readability
The UNIX utility style was used to assess the readability
level of a subset of the HTML documents in our data set (approximately
150,000).
We remove HTML markup before invoking style on each document.
We do this for two reasons. First, style does not understand
HTML, so the extra punctuation would confuse its analyzer. Second,
breaking English text into sentences and sentence fragments can be
tricky and we need to provide the style analyzer with some
assistance. For example, it is not always clear when a bulleted list
should be ignored, treated as a single long sentence, or treated as a
list of individual sentences. When invoked on troff
documents, style uses a set of heuristics to insert
punctuation into text, using the markup to assist it [CHER81]. This
information is then used by later passes of the analyzer to determine
sentence and sentence fragment breaks. We use a similar set of
heuristics to insert periods and commas into HTML documents as we
strip out markup.
The numbers presented below represent the scores of the
different domains on the Kincaid readability test.
The ``other'' domain is excluded because it represents
extraordinarily diverse sources.
Average Readability broken down by Domain
Domain
| Readability Score
|
com
| 10.3
|
edu
| 11.0
|
gov
| 10.0
|
net
| 12.3
|
mil
| 12.1
|
org
| 11.2
|