How does XWRAP Composer toolkit work?

The main idea behind XWRAP Composer toolkit is to separate what to extract from how to extract, and distinguish information extraction logic from query answer control logic. The control logic describes how many different ways a query could be answered from a given web site. The data extraction logic describes the cross page extraction steps, including what information is important to extract at each page and how such information is used as a complex filter in the next search/extraction step. We use interface and outerface description to describe what need to be extracted and user Composer Extraction Script language to describe the control logic and extraction logic and to implement the output alignment and tagging of data items extracted based on the outerface specification.

The following figure presents an architecture sketch of the XWRAP Composer toolkit. XWRAP Composer compiler takes three inputs, interface specification, outerface specification and extraction script, and compiles them into a Java wrapper, which can be further extended into either a Web service with wrapping capability or a Ptolemy wrapper actor for future integration. The compilation process of the XWRAP Composer includes generating code based on control logic and extraction logic as well as generating the correct output alignment and semantically meaningful tagging based on the outerface specification.

XWRAP Composer System Structure

Structure of XWRAP Composer Toolkit

Interface and Outerface Specification.

The design of the XWRAP Composer Interface and Outerface Specification serves two important objectives. First and for most, it will ease the use of XWRAP wrappers. Second, it will also facilitate the XWRAP Composer wrapper code generator system to generate Java code. Therefore, some components of the specification may not be directly useful for the users of these wrappers.

Concretely, the interface specification describes the wrapper name and which web site needs to be wrapped by giving the source URL and other related information. The outerface specification describes what data items are of interest and the semantically meaningful names to be used to tag those data items. The following figure shows a concrete example for the XWRAP Composer interface and outerface schema for the NCBI BLAST summary wrapper.

Interface and Outerface Specification

An example of interface and outerface specification - NCBi Summary

Extraction Script

The XWRAP Composer script usually contains three types of root commands, document retrieval, data extraction and post processing. The document retrieval commands construct a file request or an HTTP request and fetch the document. The data extraction commands extract information from the fetched document. The post processing commands allow adding semantic filters to make the output conform to the outerface specification.

The general usage of commands is as follows:

General usage of commands

The following table shows a list of commands that are currently supported in the first release of the XWRAP Composer toolkit. Please refer Dr. Wei Han's Ph.D. dissertation for detailed explanation.

XWRAPComposer Extraction Root Commands

The following is an extraction script example for NCBi Summary. Given a sequence from the input, we first construct a NCBi Blast search URL. Script Set variable{[text()]} indicates the sequence is in the input with the XPath, "text()". The first FetchDocument retrieves the NCBi Blast response page that contains a request ID. We extract the ID and construct the URL of search results. The control-flow command while...do... periodically invokes the second FetchDocument to retrieve the result page until the results are delivered. Finally we use GrabXWRAPEliteData to extract useful data from the result page.

Script Example

Extraction Script Example for NCBi Summary

Code Generation and Extensions

XWRAP Composer compiles interface, outerface and extraction script into an XWRAP Composer wrapper, which contains an executable Java program and a set of configuration files. The configuration files include the input and output schema obtained from interface and outerface, resource files used in the data extraction phase such as XSLT files. Furthermore, because Java programs are usually not convenient for application integration, we have provided two extensions to XWRAP Composer wrappers.

Web Service

Ptolemy is a modeling tool to assemble concurrent components, providing a friendly GUI environment to connect task components and govern the interaction between them. The task components have to conform Ptolemy's actor interface in order to be integrated in Ptolemy. We have implemented such an actor interface for generated XWRAP Composer wrappers to be used in Ptolemy. The following two figures demonstrate two ptolemy diagrams that are composed by XWRAP Composer wrapper actors from the left frame.

Ptolemy diagrams 1

Ptolemy Diagram 2

 

After clicking on the play button on the top, StartWrapping triggers ReadInputFile to read a gene id from a specified input file. The gene id will then be sent to NCBiSummary Wrapper Actor and the wrapper actor performs the wrapping function upon the gene id input and returns ids of related genes as results. XMLDisplay will pop up a window to present the results.

Ptolemy Result 1

Ptolemy Result 2