HTML Data Extraction

HTML Data Extraction: `libink`

Although toolkits such as the W3C Reference Library [FRYS94] already exist for manipulating HTML and HTTP objects, we have developed our own special-purpose library, libink. This was necessitated by the fact that our performance and functionality needs were very different from those of the other toolkit developers.

libink consists of four major subcomponents:

HTML parser. libink contains a simple flex-based HTML scanner. We found existing parsers too slow (especially true in the case of parsers written in scripting languages) or difficult to modify. The libink scanner is small, enabling us to make it both fast and relatively robust, as well as highly configurable. Like the W3C SGML/HTML lexical analyzer [CONN95], our scanner uses a callback interface to handle various events (e.g., recognition of a tag and its attributes). The W3C lexical analyzer, however, is not configurable.
URL parser. The URL parser, unlike many freely-available implementations, conforms to RFC 1808 [FIEL95].
Domain name service (DNS) translation and caching. We use Internet addresses to reduce hostname aliasing in our data. To speed up the lookup process, we provide a wrapper around the standard name service routines that caches all URL hostnames.
General hash table services. The various lookup tables on which libink relies sometimes exceed the capacity of a single machine's physical memory. Therefore, in addition to in-memory hash tables, libink provides interfaces to striped on-disk hash tables (using GNU DBM) as well as hash-partitioned distributed hash tables (using ONC RPC). The distributed hash tables support 1ms turnaround on hash table lookups, which is far better than the 20-30ms required to fetch a hash table page from secondary storage.

HTML Data Extraction: libink

HTML Data Extraction: `libink`