Overview

Motivation

The Internet today contains millions of hosts that provide useful information accessible from anywhere in the world. Many of the information sources are partially overlapping, and have varying and yet limited querying capabilities. Locating and accessing information in a rapidly growing, heterogeneous, distributed collection of data sources such as on the Internet is a difficult problem of growing importance, especially for large-scale data intensive systems where a single point of contact is desired. However, neither the organization of the data nor the available Internet tools for associative access to the data is adequate for quickly discovering relevant information.

Query Routing

Query routing is a process of directing user queries to appropriate servers by constraining the search space through query refinement and source selection. The goal of a query routing middleware is to reduce both false positives (useless answers delivered that fail to fulfill user's needs) and false negatives (useful answers that the system fails to deliver to the user) in answering a query. Effective query routing not only minimizes the query response time and the overall processing cost, but also eliminates a lot of unnecessary communication overhead over the global networks and over the individual information sources.

The goal of a query routing system is to provide efficient associative access to a large, heterogeneous, distributed collection of information providers through routing a user query to the most relevant information sources that can provide the best answer. We pursue this goal along two dimensions: The first dimension is to develop a set of methods and techniques that will incorporate query routing step into the query evaluation and search process to improve query responsiveness and system scalability. The second dimension is to build a working system that demonstrates our ideas, concepts, and techniques developed for efficient and responsive query and search using real-world application scenarios.

AQR Software

AQR is an adaptive query routing middleware system. In AQR, a query routing task is divided into two cooperating processes: query refinement and source selection. It is well known that a broadly defined query inevitably produces many false positives. Query refinement is a process in which various refinement mechanisms are used to narrow a query definition to focus it on only useful answers . As a complimentary process, source selection reduces false negatives by identifying and locating a set of relevant information providers from a large collection of available sources. By pruning irrelevant information sources for a user query , source selection also reduces the overhead of contacting the information servers that do not contribute to the answer of the query.

The AQR Software serves as an adaptive middleware layer that incorporates several different query routing mechanisms and provides architecture for deploying them in an open networked environment. AQR also includes facilities for performance monitoring, which allows a system developer to examine the impact of using different query routing mechanisms.

Query Refinement

In a large, rapidly evolving network of information servers, users will inevitably submit broadly defined queries that produce enormous result sets with many false po sitives. Such enormous result sets are likely to adversely impact system performance and overwhelm the user with unwanted material. Query refinement is a value-added functionality to help the user formulate queries that will return more useful results and that can be processed efficiently.

Query refinement can be seen as a mechanism that recommends query modifications to reduce false positives. For instance, the query refinement can be used to assist a user in formulating well focused queries by suggesting terms related to a query. These terms can be used for helping the user narrow the query definition to focus it on documents of interest by either formulating new queries or refining a query with more focus. Query refinement incrementally drives the process of focusing the general query by iteratively offering the user an automatically generated set of terms that can be chosen to make the query more specific.

There are a number of methods for deriving recommended terms to add to a user's query. Typical query refinement algorithms rely on collocation of terms. A commonly used approach consists of two basic steps: it first computes the set of documents that contain one or more terms from the user's query and then suggests to the user the terms with the highest cumulative frequency over the above set of documents. Any concrete query refinement algorithm may incrementally drive the refinement process by iterating these two steps. We refer to this approach as source-documents based term collocation. In AQR, a new approach to query refinement is explored, which uses the user query profiles as a means to assist a user in formulating well focused queries. The main idea is to derive recommended terms based on the semantic context and scope of what the user wants in a particular query, and replace the terms that are too broad in the original query definition with the recommended terms. For instance, in response to a query ``book suppliers", the query routing system will derive the following recommended terms: ``book store", ``book club", ``publisher". These recommended terms are either derived automatically from domain-specific ontology (e.g., ontology on book related application domain in this case) or obtained directly from the user's feedback on the query context. We refer to our approach to finding collocation of terms as the ontology-based term collocation.

A significant difference between the user-query-profile approach to query refinement and the approach based on collocation of terms in the source documents, lies in the ways by which additional terms are derived. The query refinement approach, driven by user-query-profiles, computes and recommends terms to focus a query primarily based on the domain knowledge of the terms used in the original user query, and thus is independent of the collection of source documents over which the query is posed. An obvious advantage of the ontology-based term collocation approach is its ability to reduce false positives before the actual run of the query, thus enhancing the efficiency and accuracy of the query refinement algorithms.

Source Selection

Source selection is a mechanism that helps the user locate relevant information by identifying and pruning irrelevant information sources for a user query, thus reducing the overhead of contacting the information servers that do not contribute to the answer of the query. Source selection is a process in query routing, aiming at reducing or minimizing false negatives in the result set of a query. Given a query and a distributed collection of information sources, the first decision one needs to make is which of the following four different search semantics is required for this specific query: Given a query and its search semantics, there are a number of alternative approaches to compute the best data sources for a particular query, depending on the types and the capabilities of the information sources. Important to note is that different source selection mechanisms use different structures for source capability profiles, and require different pruning algorithms.

Multi-level Source Relevance Reasoning