|
Organization Name:
College of Computing, Georgia Tech
Principal Investigator:
Calton Pu
Co-PI and Technical Manager:
Ling Liu
The first challenge is the heterogeneity of data sources and consumer requirements. There is a need for new technology to intelligently process, interpret, fuse, combine, and abstract useful knowledge from these heterogeneous data sources, sources which use different interfaces, query languages, data structures, terminology, and semantics to represent the same or similar information.
The second challenge is the requirement in data quality, reliability, and up-to-dateness that must be maintained throughout the information flow from producers to consumers. Consider the logistic application scenario of a joint operation in East Africa. As supplies are loaded and transported from their current location around to world to the theater of operations, the logistic anchor desk must keep track of their progress and current location. This is a time-critical task since the logistic anchor desk must have up-to-date information for planning as well as command and control. In this situation, the periodical update policy is of limited value; even dynamically changeable periodicity will only alleviate the problem. Thus, there is a need for new technology to effectively and efficiently monitor information updates.
The third challenge is the query responsiveness and fault tolerance in the presence of slow delivery and availability problems caused by netwok congestion and source contention. For example, in logistic applications, it is very important to propagate partial results when sources are unavailable or the network is degraded. This problem is aggravated when we take into account the quality of information requirements of the second challenge. A complementary problem is the handling of conflicts and update merging when new sources join the network, or when old sources recovers from crash. Thus, there is a need for dynamic query evaluation techniques that can incorporate run-time information to revise the query plan when some sources or intermediate sites are unavailable or have slow delivery frequency.
A continual query can be posed over multiple heterogeneous data sources. It is quite common that at any given time point, some sources may be changed and others remain the same with respect to the previous time point. It is in general expensive to re-evaluate the whole query over the entire data sources for each execution of a continual query, although in some circumstances it may be unavoidable to reprocess the query from scratch. Therefore, it is important to find optimization methods that can bypass the complete re-evaluation to avoid duplicate computation and unnecessary data transmission over the networks.
General Updates: An important distinction between CQs and previous work is the handling of general updates, including insertions, deletions, and modifications. For example, let us assume that a job opening database is updated whenever a posted position is filled (deletion) or a new job posting arrives (insertion). Some users may desire to be informed of both the incoming new openings from some particular type of companies and the positions that were available before (e.g., in the previous execution of the query) but have been recently filled or canceled due to budget disapproval. Simply recomputing the query for each execution does not guarantee the correct results, because conventional SQL query manager can only filter data currently matching the query. To provide adequate support for flexible incorporation of updates, the CQ manager must exploit the knowledge implied by both the query expressions and database update operations.
CQ Evaluation: Another important problem in monitoring of information updates is the cost of query evaluation. A naive query evaluation strategy would recompute the same query from scratch for each execution request and highlight the difference. In the Internet environment, such strategy can easily overwhelm the users and overload the system and networks, especially when a large number of records match the query, but only a small fraction is changed each time, or when some users want to see only the data that have been updated since the last query result.
Incremental Algorithms: We have proposed the differential re-evaluation algorithm (DRA) for processing CQs. The solution we promote avoids the overhead of recomputing the query from scratch, and relaxes the restrictions of append-only database and monotonic-transformable queries. Our design builds on the research in differential files, hypothetical relations, and incremental updating materialized views, and offers several advantages over the previous solutions. The basic idea is to compute the result of the current execution of a continual query by combining the result produced by the most recent execution with the updates occurred in between these two consecutive executions. We use differential relations to record changes to the source relations (or materialized views). We define a set of differential operators to compute the updates occurred in between two consecutive executions of a continual query.
Performance Tuning Through Specializations: An important problem in the processing of continual queries is the varied performance of different update propagation and query evaluation algorithms for different environments, since different update propagation and CQ query processing algorithms work best in different scenarios. We will evaluate these algorithms according to their capabilities and use specialization techniques developed in the Synthetix project to improve the overall system performance. Intuitively, specialization says that we should use the best algorithm for each specific situation we can characterize. For example, in this project we will implement different algorithms for the different scenarios such as those given below.
The main two variables are the number of queries on the data and the update rate. It is intuitive that an algorithm that takes advantage of low update rates (the first case below) probably will not work well when that assumption is not true. Instead of trying to come up with a general algorithm for all cases, we will apply our experience with specialization to accommodate these kinds of algorithms.
Combining Client Pull with Server Push: Continual queries can be seen as a hybrid mode of data delivery which combines the client-pull and server-push mechanisms, namely, the transfer of information from servers to clients is first initiated by a client pull and the subsequent transfer of updated information to clients is initiated by a server push. In the CQ project, we will explore a number of ways to provide more flexible combinations of client-pull and server-push, including monitoring changes in information sources by incremental organization and management of client caches based on query specification; providing a flexible and user-friendly CQ adapter interface for installation of distributed continual query services; developing strategies for dynamic interaction of ad-hoc query answering and smart (semantic) query caching; and incorporating services such as prefetching, update dissemination, and balancing push-based and pull-based information access into the CQ system architecture.