faq

Frequently Asked Questions

XWRAP Elite is a software toolkit that help you generate wrappers for the web information sources you are interested in wrapping or performing data integration or summarization tasks.

How does XWRAP Elite differ from XWRAP Original?

XWRAP Elite differs from XWRAP Original in a number of ways. First, the Elite version discover the object and element extraction automatically but restricted to those web pages that have a reasonably larger number of similar objects such as product catalog web sites. XWRAP Original can wrap many more complex types of web pages but requires wrapper developers to have much more interaction with the XWRAP toolkit during a wrapper generation process. Another salient difference is the robustness of the wrappers generated with respect to the presentational changes of the corresponding web pages. XWRAP Elite is much more robust. It works as long as the nature of the web pages is data objects rich. The presentational changes or layout changes of the data objects or advertisements have no effect on the performance and effectiveness of the wrappers. However, the robustness of the wrappers generated by XWRAP Original highly depends on the experience of the toolkit users and the quality of interactions between the wrapper developers and the toolkit.

What do I use XWRAP Elite for?

Whenever you find some web sources of interest online, and you would like to build a data integration engine or data fusion engine or data summarization engine, you may want to first wrap the set of web sources and turn them into XML data sources before the integration processing. XWRAP Elite can help you generate XML enabling wrappers that can transform HTML web documents into XML documents and filter XML documents by keyword search.

Does it have to be installed on my computer before I can run it?

No. You do not have to install XWRAP Elite service on your computer in order to run it. You can access it through a browser and freely use it. However, we assume all users have read and agreed with the terms and conditions of this free service.

Does XWRAP Elite work on all the web pages?

No. XWRAP Elite aims at data rich web pages. XWRAP Elite is not designed for wrapping personal web pages or web pages that don't have repetitive data objects. XWRAP Original, which will be released in the future, can be used when XWRAP Elite does not apply.

What types of Web sources can XWRAP Elite work well?

Although XWRAP Elite may not work well for all kinds of web pages, it does work well on many web sources that are data objects rich. A web page is called data objects rich if it contains objects of similar structure (although some objects may have missing elements) and the number of objects reside on the same page is not too small, say more than 5 objects). Most of such data objects rich pages are the result pages returned from searches via HTML forms. Some static pages that are data objects rich can also be wrapped effectively by XWRAP Elite generated wrappers.

The Remote Page Fetch doesn't work all the time. Can you tell me why?

The remote page fetch model in the current release of the XWRAP Elite service does not work on some web pages where JavaScript controls the links.

Can I tune the heuristics for locating objects?

Yes, you can. The current release supports six subtree heuristics to locate and separated data objects. One of the heuristics is chosen as the best default choice. When you observe the low quality of object and element extraction result, you are encouraged to try out other subtree heuristics to tune the extraction accuracy. You may do so by simply clicking on the specific subtree heuristic you like to try on, when the extraction result is not to your satisfaction.

How do I use the optional steps in Element Extraction Refinement?

The last two steps (Step 3 and Step 4) of Element Extraction Refinement are optional. Step 3 allows you to input another URL as a sample page to improve the extraction quality. This step works only of the target web site has more than one web page or has a search interface. In the latter case, you may simply replace the word(s) you typed in the search form with a new input. Step 4 allows you to change the recommended text separators and the recommended tag separators. The format for text separators are "text-sep1", "text-sep2", "text-sep3"...., and the format for tag separators is tag-name1, tag-name2, tag-name3 ... .

How many programs does XWRAP Elite generate during one round of wrapper construction process?

XWRAP Elite generates three pieces of code. First, it generates an extraction component that translate/extracts a web page into XML data. Second, it produces a search interface component that construct a URL (and a query string if it is needed) in terms of some keywords. Finally, an integration program is built on top of the two components. The integration program takes keywords as input and produces XML data as output.

How do I use the programs generated by XWRAP Elite?

There are two ways to utilize the programs generated by Elite. First, you can register the wrapper into our wrapper repository. Then you can go to the repository, search for the wrapper and see the details of the wrapper. We have an HTML search interface for each wrapper; you can simply type your query to get results from the wrapper. Second, you can download the wrapper programs generated onto another computer and run them on the command line. However, you must download the Elite library code to make it work. To download the Elite library code, go to the "distribution" page.

The labels for elements are wrong. Could you give me some tips on how to correct them?

XWRAP Elite's alignment algorithm (in Element Extraction Refinement) helps you find the correct labels for elements. First, you need to choose some most representative objects as the reference of alignment. Normally, they are objects with the maximum number of elements. We assume that these objects contain all the potential elements (This is another reason why we encourage you to use a sample page that has a larger number of objects when generating wrappers). Sometimes, there are only one or two objects with the max number of elements in a page. If you want to play safe, you may want to find another sample page with more objects that have maximum number of elements. Second, check the types of elements (String, Dollar Value, Links, or Images) to see if they are aligned correctly. XWRAP Elite automatically detects the types of elements for you. However, sometimes it may have hard time to determine the type of an element. Thus, it simply considers such elements having string type. Manual revisions can correct it. Third, when some element or some string occurs at most once in every data object, we refer to such element or string as object discriminator. When an element is an object discriminator, you can utilize such discriminator string as Alignment Hints to correct the alignment of elements. This element refinement is especially helpful for those web pages containing objects that have missing elements of different kinds. However, when an element appears in the same object twice, we suggest that you do not consider it as a unique object discriminator and do not use it as an Alignment Hint. In the current release we do not distinguish digits from string in the refinement of element alignment.

What are keywords in Search Interface Extraction?

Many web pages are dynamically generated given certain keyword inputs. Search Interface Extraction enables the wrappers generated to identify these keywords input and access the web pages by new keyword input correctly and conveniently.

How can I specify keyword names in Search Interface Extraction?

The keyword names indicate what these keywords are. You can either choose the names from a list of output attributes from the wrapper or specify a new name.

Does XWRAPP Elite uses any third party software packages?

Yes. We have used the following packages in our XWRAP Elite library: (1)The packages we wrote ourselves. (2) Java Tidy (Tidy by Dave Raggett <dsr@w3.org>; translation to Java by Andy Quick <ac.quick@sympatico.ca>). (3) IBM XML parser. (4) GNU Regular Expression library -- Copyright (C) 1998 Wes Biggs. (5) GNU Base64 library -- Copyright (c) 1998 by Kevin Kelley.

Do I have to register the wrapper in the wrapper repository?

You don't have to. However, you will miss the functionality provided by our wrapper repository, such as being able to keep track of the wrappers you have generated, or automatically incorporating your wrappers into those tools which use the wrappers in the repository, such as AQR.

How can I register the generated wrapper in the wrapper repository?

After you generate the wrapper at the last step of Code Packaging, click on the link of "register into the wrapper repository." You must provide your username and password to register a wrapper. The password will be used to protect your wrapper from being edited and deleted by other people. You need to register a new user account before you register a wrapper.

What are the limitations for current implementation of XWRAP Elite?
Two major limitations. First, to ensure the extraction quality of XWRAP Elite you may want to start the wrapper generation process using a sample page with a large number of objects. Typically, the objects in the sample page should be more than 5. Occasionally a sample page with less than 10 objects will not work if the pages at the same web site usually return a larger number of objects, such as 50 or more objects. Second, a wrapper generated by XWRAP Elite from a sample page with a good number of objects may not work on the pages that change the result display format. For instance, several E-commerce sites use a list of product summaries with links to each product page as the standard display but when the result has only one product the result page will be the same as the product page. The second limitation will be improved in a future release.
Can I filter the XML wraapping results?
Yes, you can. XWRAP Elite has implemented a simple query component on XML that allows you to filter objects in terms of certain query conditions. When you run the Elite wrappers in the wrapper repository through the HTML form interface. The required attributes are used to connect the websites, and the optional attributes are conditions to filter the XML results. Current implementation of the query component does not have any optimization, so it may take a while if you specify any filter conditions. This feature is also available if you integrate the Elite wrappers into your Java programs. The query language we use will be described in the Java API document. However, this feature is not available when you run the wrappers on the command line.
Do XWRAP Elite wrappers follow next pages when the results are returned in several pages?

Yes, they do. We implemented a simple component to discover the links of next pages and extract data from these pages as well. You can test this feature by testing wrappers in the repository through the HTML interface. We do not provide this feature when you run wrappers on the command line. The algorithm for discovering next pages' links is pretty naive, and it may not work on all the web sites. (One of the working websites is www.kingbooks.com, which is tested on May 8, 2000.) We are going to improve the algorithm in the future.

Introduction

Getting Started

Limitations and Advanced Questions

For problems or questions regarding this web contact [XWRAP Elite]. Last updated: April 06, 2000.

For problems or questions regarding this web contact [XWRAP Elite].
Last updated: April 06, 2000.