Natural Language Analysis: style
We scored English language documents using the standard UNIX
style program [CHER81]. style
reports a variety of statistical properties of each document, such as
the average sentence length and the number of complex sentences. It
also scores the document using four readability metrics. These
metrics indicate the nominal educational (grade) level a reader would
need to understand the document.
Since most HTML documents do not conform to an internationalization
standard, we applied heuristics to screen out non-English documents.
We filtered out documents that contained any character with
the high bit set (indicating a non-ASCII character set) or containing
character sequences indicating known encodings (such as the Shift-JIS
encoding of the Japanese character set).