PROJECT SUMMARY

[ Problem ||| Concept ||| Objective ||| Approach ||| Milestones ]


Organization Name: College of Computing, Georgia Tech

Principal Investigator: Calton Pu

Co-PI and Technical Manager: Ling Liu

Problem

With the proliferation of Internet and intranets and the ongoing advances in the World Wide Web (WWW or Web) technology, everyone today can publish information on the web independently at any time. The flexibility and autonomy of producing and sharing information on WWW is phenomenal. However, the technology to organize, fuse, and access information from these vast data stores has not kept pace with the rapid growing volume of available data. On the one hand, this flood of information can easily overwhelm the users, and on the other hand, naive query processing can easily overload the database system, the data stores, the network, and the users. Both problems can be aggravated when constant addition and updates of on-line information sources force users to revisit Web sites and re-issue their queries frequently to obtain the new data that match their queries. Trying to monitor the constant information changes on the Internet is both labor intensive for the user and resource intensive for the system.

The first challenge is the heterogeneity of data sources and consumer requirements. There is a need for new technology to intelligently process, interpret, fuse, combine, and abstract useful knowledge from these heterogeneous data sources, sources which use different interfaces, query languages, data structures, terminology, and semantics to represent the same or similar information.

The second challenge is the requirement in data quality, reliability, and up-to-dateness that must be maintained throughout the information flow from producers to consumers. Consider the logistic application scenario of a joint operation in East Africa. As supplies are loaded and transported from their current location around to world to the theater of operations, the logistic anchor desk must keep track of their progress and current location. This is a time-critical task since the logistic anchor desk must have up-to-date information for planning as well as command and control. In this situation, the periodical update policy is of limited value; even dynamically changeable periodicity will only alleviate the problem. Thus, there is a need for new technology to effectively and efficiently monitor information updates.

The third challenge is the query responsiveness and fault tolerance in the presence of slow delivery and availability problems caused by netwok congestion and source contention. For example, in logistic applications, it is very important to propagate partial results when sources are unavailable or the network is degraded. This problem is aggravated when we take into account the quality of information requirements of the second challenge. A complementary problem is the handling of conflicts and update merging when new sources join the network, or when old sources recovers from crash. Thus, there is a need for dynamic query evaluation techniques that can incorporate run-time information to revise the query plan when some sources or intermediate sites are unavailable or have slow delivery frequency.

Objective

The primary objective of the Continual Queries project is to investigate information changes (insertion, deletion, and update) monitoring in large-scale distributed information, and to design an intelligent and adaptive architecture for coordinating different levels of system support for delivering data of required quality, reliability, and up-to-dateness. We are building on currently ongoing research to help handle heterogeneity and query responsiveness challenges. Concretely, we have three principal subgoals: (1) to develop an integrated set of data delivery techniques that allow information consumers to subscribe to their most preferred data delivery services, and enable information producers to publish and deliver their source data in a timely fashion; (2) to design an architecture for efficient and effective information delivery and updates monitoring; (3) to build a prototype that implements and tests the Continual Queries architecture and the set of techniques in the context of logistics applications.

Concept

A continual query CQ is described as a triple, consisting of a normal query Q (written in SQL), a triggering condition Tcq and a termination condition. We define the result of running a continual query CQ as a sequence of query answers, say Q(S1), ..., Q(Sn), obtained by running query Q on the sequence of database states Si, i = 1, ..., n, each element Q(Si) in the sequence is generated when the triggering condition Tcq associated with the continual query becomes true. The triggering condition can take several forms: (1) a direct specification of time such as every Monday, (2) a specification of time interval from a previous query result such as a week since the last query result was produced, (3) a condition on the database state such as Q should be evaluated when the price of Siemens stock is reached to 110 DM, or (4) a relationship between a previous result and the current database state such as Q should be evaluated when the price of Simens stock is increased by 10% since the last query result was produced.

A continual query can be posed over multiple heterogeneous data sources. It is quite common that at any given time point, some sources may be changed and others remain the same with respect to the previous time point. It is in general expensive to re-evaluate the whole query over the entire data sources for each execution of a continual query, although in some circumstances it may be unavoidable to reprocess the query from scratch. Therefore, it is important to find optimization methods that can bypass the complete re-evaluation to avoid duplicate computation and unnecessary data transmission over the networks.

Approach

The main research and development challenges of the Continual Queries project are the following:

General Updates: An important distinction between CQs and previous work is the handling of general updates, including insertions, deletions, and modifications. For example, let us assume that a job opening database is updated whenever a posted position is filled (deletion) or a new job posting arrives (insertion). Some users may desire to be informed of both the incoming new openings from some particular type of companies and the positions that were available before (e.g., in the previous execution of the query) but have been recently filled or canceled due to budget disapproval. Simply recomputing the query for each execution does not guarantee the correct results, because conventional SQL query manager can only filter data currently matching the query. To provide adequate support for flexible incorporation of updates, the CQ manager must exploit the knowledge implied by both the query expressions and database update operations.

CQ Evaluation: Another important problem in monitoring of information updates is the cost of query evaluation. A naive query evaluation strategy would recompute the same query from scratch for each execution request and highlight the difference. In the Internet environment, such strategy can easily overwhelm the users and overload the system and networks, especially when a large number of records match the query, but only a small fraction is changed each time, or when some users want to see only the data that have been updated since the last query result.

Incremental Algorithms: We have proposed the differential re-evaluation algorithm (DRA) for processing CQs. The solution we promote avoids the overhead of recomputing the query from scratch, and relaxes the restrictions of append-only database and monotonic-transformable queries. Our design builds on the research in differential files, hypothetical relations, and incremental updating materialized views, and offers several advantages over the previous solutions. The basic idea is to compute the result of the current execution of a continual query by combining the result produced by the most recent execution with the updates occurred in between these two consecutive executions. We use differential relations to record changes to the source relations (or materialized views). We define a set of differential operators to compute the updates occurred in between two consecutive executions of a continual query.

Performance Tuning Through Specializations: An important problem in the processing of continual queries is the varied performance of different update propagation and query evaluation algorithms for different environments, since different update propagation and CQ query processing algorithms work best in different scenarios. We will evaluate these algorithms according to their capabilities and use specialization techniques developed in the Synthetix project to improve the overall system performance. Intuitively, specialization says that we should use the best algorithm for each specific situation we can characterize. For example, in this project we will implement different algorithms for the different scenarios such as those given below.

  1. Large number of queries, relatively slow update rates. For instance, stock market trend analysis.
  2. Relatively small number of queries, high update rates. For instance, mobile inventory tracking and recording.
  3. Both small and large number of queries (depending on the application area), slow update rates in information sources (directories). For instance, Agile/Harvest information discovery.

The main two variables are the number of queries on the data and the update rate. It is intuitive that an algorithm that takes advantage of low update rates (the first case below) probably will not work well when that assumption is not true. Instead of trying to come up with a general algorithm for all cases, we will apply our experience with specialization to accommodate these kinds of algorithms.

Combining Client Pull with Server Push: Continual queries can be seen as a hybrid mode of data delivery which combines the client-pull and server-push mechanisms, namely, the transfer of information from servers to clients is first initiated by a client pull and the subsequent transfer of updated information to clients is initiated by a server push. In the CQ project, we will explore a number of ways to provide more flexible combinations of client-pull and server-push, including monitoring changes in information sources by incremental organization and management of client caches based on query specification; providing a flexible and user-friendly CQ adapter interface for installation of distributed continual query services; developing strategies for dynamic interaction of ad-hoc query answering and smart (semantic) query caching; and incorporating services such as prefetching, update dissemination, and balancing push-based and pull-based information access into the CQ system architecture.

Milestones

FY-97 Plans

FY-98 Plans

FY-98 Plans

Technology Transition



CQ Home Page cq@cc.gatech.edu