Construction of Wrappers

Next: Sharing and Distribution of Up: Our Experience Previous: Our Experience

Construction of Wrappers

In the Continual Queries project at OGI [7, 6, 8], we have built an event-driven update monitoring system for Internet information sources. Motivated by our initial experience and frustration from writing wrapper programs by hand, we have developed an interactive software tool, called XWRAP, for semi-automatic generation of XML-enabled wrappers for Internet data sources [4]. The main objective of XWRAP is to wrap the source documents into XML format and provide content filtering capabilities using XWRAP XML query processor. A wrapper programmer (developer) may enter the URL that he/she would like to wrap. XWRAP will then interact with the wrapper developer by walking through the XWRAP wrapping process, consisting of a sequence of 7 steps:

Enter URL accepts the URL of the target data source to be wrapped and fires a remote fetch function to obtain the target source document.
Source Normalization parses and transforms the source document in plain text format to a tree structure. XWRAP normalizes each HTML document into an XML-tree with standard HTML tags.
Semantic Token Extraction accepts the semantic tokens (S-tokens) identified by the wrapper developer and generates a set of source-specific S-token extraction rules and a comma-delimited file, consisting of S-token name and S-token value pairs.
Hierarchical Structure Extraction takes the hierarchical structures identified by the wrapper developer through interactive clicks and highlights, and derives a source-specific hierarchical structure extraction (H-struct) rules written in the form of XML-template, an XML file with action semantics.
Learn is the step where XWRAP learns by going through the procedures necessary for generating the wrapper program through the generation of the XML representation of the source document. The input of the learning step includes the semantic token extraction rules, the comma-delimited file, and the hierarchical structure extraction rules (XWRAP-script). It produces an XML representation for the source document fetched from the URL entered at the beginning in addition to the pseudo code for the wrapper program.
Wrapper Program Construction produces the executable wrapper program for the given data source based on the pseudo code generated from the Learn step.
Wrapper Program Test allows the developer to test the wrapper program generated by entering another URL of the same data source and invokes the Learn process after the source normalization. If the result produced is not satisfactory, an iterative process can start from the S-token extraction, performing incremental revision of the S-token extraction rules and H-struct extraction rules.
Wrapper Program Release provides final packaging of the wrapper program. Once the wrapper developer is satisfied with the test of his/her wrapper program generated by XWRAP, he/she can click the release button to exit the XWRAP generator.

The wrapper program generated by XWRAP is now ready for use. It can automatically query and access the target data source remotely through the wrapper API. A wrapper API often provides more powerful and finer-grained content filtering than the search capability offered at the original data source site. Furthermore, the wrapper program is able to extract specific content fragments of interest in addition to remove irrelevant advertisements and graphics from the original source document.

Next: Sharing and Distribution of Up: Our Experience Previous: Our Experience

Ling Liu
Sun Feb 7 00:31:54 PST 1999