What is XWRAP Composer?

XWRAP Composer is a semi-automated wrapper generation system that generates wrappers capable of extracting information from multiple heterogeneous Web documents. By encoding wrapper developers' knowledge in Interface Specification, Outerface Specification and Extraction Script, XWRAP Composer integrates single-page wrappers into a composite wrapper capable of extracting information across multiple interconnected pages. The enhancement module allows the generated composite wrapper to be used as a WSDL Web service or a Ptolemy actor for future integration.

Why do we need XWRAP Composer?

Automatically extracting data from Web sites has been an important task for the entire life of the Web. Although there have been many advances in automated information exchange, there remains many example applications that can benefit from automated techniques to extract data from a variety complex Web sites.

Wrappers generated by wrapper generators, such as XWRAP Original or XWRAP Elite, usually perform well when extracting information from individual documents but are poorly equipped to extract information from multiple linked Web documents. Consider the following typical data integration work flow. A biologist first users a program called Clusfavor to cluster genes that changed significantly in a micro-array analysis experiment. After extracting all gene ids from the Clusfavor result, he feeds them into NCBI Blast Web site, which searches all related sequences over a variety of data sources. The returned sequences will be further examined to find promoter sequences. Each of the phases described above can also be decomposed into smaller tasks. For example, the search in NCBI Blast actually involves four steps. Due to the long waiting time for serving a BLAST search on the NCBI web site, the NCBI Web server will first return a response page with a request ID. When the biologist asks for the results by the request ID, the Web site will presents a delay page if the results are not yet ready to display. Once the search results are delivered, they are displayed in a summary page that contains a summary of all genes matching the search query condition and links to detail pages of each found gene. If the summary page does not include detailed information that the biologist is interested in, he has to visit each detail page.

Data Integration

As usual, a critical challenge for providing system-level support for scientists to achieve such data integration tasks is the problem of locating, accessing, and fusing information from a rapidly growing, heterogeneous, and distributed collection of data sources available on the Web. Automating such complex search and data collection workflow presents three major challenges.

In answering those challenges, we designed XWRAP Composer, a semi-automated wrapper generation system that generates wrappers capable of extracting information from multiple heterogeneous Web documents.