Source-Biased Discovery using DynaBot

Sponsor Ling Liu / James Caverlee
{lingliu, caverlee}@cc.gatech.edu
216 / 225B CCB

Area Systems and Databases

Problem

The past few years have witnessed great strides in the accessibility and manageability of vast amounts of Web data. In particular, the widespread adoption of general purpose search engines like Google and AllTheWeb has added a layer of organization to an otherwise unwieldy medium. But with the rise of high-quality data intensive web services on the so-called Deep Web (or Hidden Web) and the emergence of the web services paradigm, these popular tools are becoming less relevant. Recent studies suggest that the size and growth rate of the dynamic Web greatly exceed that of the static Web, yet dynamic content is often ignored by existing search engine indexers owing to the technical challenges that arise when attempting to search the Deep Web.

To address these challenges, we argue that there is a growing need for efficient mechanisms for discovering and ranking data intensive services. Effective mechanisms for web service discovery and ranking are critical for organizations to take advantage of the tremendous opportunities offered by web services, to engage in business collaborations and service compositions, to identify potential service partners, and to understand service competitors and increase the competitive edge of their service offerings.

In the context of service discovery, we have previously introduced two related systems:

DynaBot: We have developed a prototype service-centric crawler called DynaBot. DynaBot is designed to discover and cluster Deep Web sources offering dynamic content. DynaBot utilizes a service class model of the Web implemented through the construction of service class descriptions (SCDs). It also employs a modular, self-tuning system architecture for focused crawling of the Deep Web using service class descriptions.
BASIL: We have also developed the BASIL algorithms (BiAsed Service dIscovery aLgorithm) for discovering and ranking relevant data-intensive web services. Our first prototype supports a personalized view of web services through source-biased probing and source-biased relevance detection and ranking metrics. Concretely, our approach is capable of discovering and ranking web services by focusing on the nature and degree of the data relevance of the source service to others.

This mini-project focuses on implementing a combined source-biased crawler that incorporates the BASIL algorithms into the DynaBot crawling architecture. Sample source code and relevant papers will be provided. There are a number of exciting opportunities to incorporate interesting research ideas into the crawler, so we look forward to hearing from interested students.

Deliverables:

A brief report describing your efforts.
A tar-ball of your code.

Evaluation: You will be graded on the novelty and quality of your report and implementation.

Sponsor	Ling Liu / James Caverlee {lingliu, caverlee}@cc.gatech.edu 216 / 225B CCB
Area	Systems and Databases