Next: Hosting and Sharing Wrappers:
Up: Building an Extensible Wrapper
Previous: Sharing and Distribution of
A wrapper framework consists of a generic code structure and a collection of code components that can be tailored to build specialized wrappers through
source-specific code customization. A key component of a wrapper framework is the wrapper API and a library of code that implements this API.
Specifically, a wrapper API implemented in Java usually include the Java class hierarchy rooted at the Wrapper class.
- Wrapper class, representing wrapper objects. A wrapper object is an execution of the corresponding wrapper program. Therefore, each Wrapper object consists of a query object, a result-packaging object, a wrapper execution status object, and a wrapper exception handler. In Wrapper class, one also specifies the properties that are common to all types of wrappers, such as the version number, the software release date, the manufacture name, the coding language, the operational platforms, to name a few. Relational wrappers and HTTP wrappers are specialized classes of the Wrapper class.
- Query class, representing query objects and mechanisms for firing queries over remote site and fetching data of interest. HTTP queries are instances of HTTPQuery class, which can be implemented as a specialization of the Query class.
Each HTTP query can be further classified as an object of HTTPGetQuery class or an object of HTTPPostQuery class.
- WrapperOutput class, representing the results to be returned to the caller of the wrapper program. A WrapperOutput object is generated through a series of transformation and filtering processes of the source document fetched from the original data source.
- WrapperStatus, indicating the status of a wrapper that is running.
Different types of wrappers may need different methods for monitoring the execution status of a wrapper program. Thus, the WrapperStatus class can be specialized into RelationalWrapperStatus, HTTPWrapperStatus, and so forth.
- WrapperException, used to indicate errors that arise when firing a wrapper, such as timeout handling in the presence of network connection failure or slow links. WrapperException class is used to capture the errors and exceptions common to all wrappers, different types of wrappers may have specific exceptions and specific exception handling needs. Thus, both generic and specialized wrapper exception-handling methods are needed.
A major challenge in designing an extensible wrapper framework for wrapper construction is the identification of mandatory functionality of a wrapper and the clean separation of optional functionality of a wrapper from undesirable functionality. For example, should we consider sophisticated retrieval mechanisms, error handling, and the choice of streaming mode or blocking mode as mandatory wrapper functionality? Should performance, robustness, statistics, proxies, optimization be better treated as optional functionality?
Should we consider massaging data, sophisticated recovery strategies as undesirable wrapper functionality?
Based on our experience in building wrappers, the mandatory functionality of a wrapper should contain those capabilities that are crucial for achieving the basic goal of a wrapper. For example, to enhance the information extraction quality, it is necessary for a wrapper to provide sophisticated retrieval and filtering mechanisms, and simple error handling strategies. Examples of error handling strategies include handling timeouts with user-specified thresholds and providing a status method that returns the runtime status of the wrapper. Furthermore, to improve the responsiveness of a wrapper, both blocking mode and streaming mode of interaction between a wrapper and its applications should be provided. When a wrapper runs in the streaming mode, applications are able to fire a wrapper (e.g., by issuing a query) and receive a stream of returned data, rather than having to block the wrapper until the wrapper query is terminated.
To provide the streaming interface, the Wrapper's fire(...) method returns a synchronized queue. The wrapper will run its own thread and write to the queue, and the application can read from it. Many applications may not take advantage of the streaming mode, so a much simpler blocking interface is provided.
In addition to the mandatory functionality, there are a number of optional functionality of a wrapper that are important and desirable, including performance statistics and mechanisms for wrapper query optimization. For example, many applications would need more detailed statistics information about the wrapper execution (e.g., clock times, bytes transferred, number of tuples or objects returned) than simply that the status is done or running. It is also desirable that a wrapper may act as a responsible optimizer that ensures that the applications will not throw dozens of queries per second to a site and especially allow an application-configurable inter-query delay.
In the current design of XWRAP, we consider the following functionality undesirable, primarily because of the complexity introduced.
-
Caching. A wrapper should not try to do any caching on behalf of the applications that invoke the wrapper. Such functionality should be provided by the application who needs the caching, not the wrapper.
-
Data Mediation. A wrapper should return the extracted content with minimal mediation. For example, a wrapper should not remove duplicated information or canonicalize attribute values. Such requirements should be met by the applications.
-
Failure recovery. A wrapper should not try to provide any sort of sophisticated failure recovery strategy such as retry failed queries a few times. Such functionality is best handled by the application.
Several functionality are useful but are not considered in the current design of XWRAP, including the mechanisms that make wrappers able to learn to be more adaptive to changes at source sites, and the incorporation of complex data types. Whether or not such functionality is desirable remains a question that needs to be answered.
Next: Hosting and Sharing Wrappers:
Up: Building an Extensible Wrapper
Previous: Sharing and Distribution of
Ling Liu
Sun Feb 7 00:31:54 PST 1999