HTML Data Extraction: libink
Although toolkits such as the W3C Reference Library [FRYS94] already exist for manipulating HTML and
HTTP objects, we have developed our own special-purpose library,
libink. This was necessitated by the fact that our
performance and functionality needs were very different from those of
the other toolkit developers.
libink consists of four major subcomponents:
- HTML parser. libink contains a simple
flex-based HTML scanner. We found existing parsers too slow
(especially true in the case of parsers written in scripting
languages) or difficult to modify. The libink scanner is
small, enabling us to make it both fast and relatively robust, as well
as highly configurable. Like the W3C SGML/HTML lexical analyzer [CONN95], our scanner uses a callback interface to
handle various events (e.g., recognition of a tag and its attributes).
The W3C lexical analyzer, however, is not configurable.
- URL parser. The URL parser, unlike many freely-available
implementations, conforms to RFC 1808 [FIEL95].
- Domain name service (DNS) translation and caching. We use
Internet addresses to reduce hostname aliasing in our data. To speed
up the lookup process, we provide a wrapper around the standard name
service routines that caches all URL hostnames.
- General hash table services. The various lookup tables on
which libink relies sometimes exceed the capacity of a single
machine's physical memory. Therefore, in addition to in-memory hash
tables, libink provides interfaces to striped on-disk hash
tables (using GNU DBM) as well as hash-partitioned distributed hash
tables (using ONC RPC). The distributed hash tables support 1ms
turnaround on hash table lookups, which is far better than the 20-30ms
required to fetch a hash table page from secondary storage.