XWRAP Elite is a software toolkit for generation of XML-enabling wrapper
programs for Web sources.
By XML-enabling, we mean that the wrapper programs
generated by XWRAP Elite can transform an HTML document into an XML document
and deliver the extracted data content in XML format with a DTD.
XWRAP Elite service is provided free of charge. By using the Elite service,
you can build a ready-to-go wrapper for any of your favorite web sites
in just a few minutes. You can learn how to build a wrapper program by a few clicks (see walkthrough).
The wrappers are generated as Java classes (click here for an example).
The core technology of the XWRAP Elite toolkit is the automatic discovery
and extraction of objects of interest and their elements.
The object and element extraction heuristics are computed and derived
automatically for any given web page. For those data objects intensitve
web sites such as bn.com, ebay.com, cnet.com, buy.com, our object
and element extraction algorithms can offer 95% to 100% of accuracy.
Another distinct feature of our extraction algorithms is its robustness
against the representational changes of the websites.
The XWRAP Elite toolkit has been tested over thousands of web pages and
hundreds of web sources.
By utilizing XML enabling wrappers, the content of Web sources
can be easily made accessible to those applications that need to
filter, fuse, integrate, and summarize data from multiple and
disparate Web information sources.
In addition to the object and element extraction algorithms, the XWRAP Elite
service also provides an XML Wrapper Query Language (XML-WQL) as the
data exchange language for accepting
application query requests (either keyword-based or content-sensitive
queries), and for returning the matching source documents or matching
source objects in XML format.
Other components of the XWRAP Elite service include the automatic extraction
of user query interfaces of the Web site to be wrapped, the code generator,
the code testing module, and the code packaging component.
For a demonstration of the usefulness of the XWRAPElite wrappers, you may
visit the Adaptive Query Routing system.
For downloading of ready-made XWRAP Elite wrappers,
you may visit the XWRAP wrapper repository.
How does it work
As a user of XWRAP Elite, you can use the toolkit to generate
the wrapper program of your favorite Web site in three consecutive phases:
- Phase 1 - Object & Element Extraction
This phase helps you generate an object and
element extraction component that extracts and converts an HTML page into
an XML document. You can obtain the source code for the extraction component in the following steps:
-
First, you enter a website that you are interested in generating a
wrapper for. For example, you choose fatbrain.com.
-
Then you can surf the website through our proxy-like service, and find
the kind of web pages that you want to wrap. For example, you run a search
on all JDBC books, and you are now at the search result page where
all the JDBC books offered at fatbrain.com site are listed.
If the result page is the kind of pages you want to wrap for,
then click on the Data Extraction button in the Elite service panel
located at the top-left corner.
-
Now, XWRAP Elite automatically discovers and separates objects and elements
in the sample page that you have chosen.
At this step, the toolkit computes and learns a set of object extraction heurstics and element extraction heuristics. These heuristics are the core of the
extraction component of the wrapper to be generated.
If you are not happy with the first run of the extraction result, you may
refine the object extraction and element extraction results using our
easy-to-use GUI (see Walkthrough and
Examples for detail).
If you are satisfied with the extraction result, you may click the
button to enter the next step.
-
At the element extraction refinement step, you can refine the extraction by
restricting or relaxing the number of elements per object
or refine the data types of
the elements. If you are happy with the extraction result, you may enter your
favorite tag name for each of the elements extracted.
Then click on to enter the code generation step for the extraction
component.
-
At the code generation step, XWRAP Elite produces the extraction module as
a Java class that takes a URL (and a query string if the page is accessed
by HTTP-POST method) and outputs XML data with the tagging you specify.
You may download the source code, or run a few more tests before downloading. If you want to generate a wrapper that has a filtering capability over the XML
document, you need to continue by entering Phase 2.
- Phase 2 - Search Interface Extraction
This phase allows you to build a component that constructs a URL for a web page
by given keywords. Elite automatically captures the URL (and the query string
if the page is accessed by HTTP-POST method).
You identify the dynamic part of the URL and the query string as keywords.
Then Elite will automatically generates a Java class that takes keywords as
inputs and outputs a URL (and a query string.)
- Phase 3 - Code Packaging
This phase integrates the two source code components automatically
by generating the wrapper main program as a wrapper class.
This wrapper class takes keywords as input and produces XML data as output.
Now you can download the complete set of source code and the object code to
any computer where you would like to run the wrapper program.
If you would like to share your wrapper with others, you may also register it
with the XWRAP wrapper repository with you as the sole owner
of your wrappers.