Object Mining On the Web


Sponsor Ling Liu / David Buttler
{lingliu, buttler}@cc.gatech.edu
223 / 260 CCB
Area Systems and Databases
Related Projects


Problem
Currently there is an overwhelming amount of information online. Being online has made information easier to access, however, most information is designed for human consumption rather than computer processing. This inevitably means that human attention is required to gather useful information, resulting in information overload and coping strategies such as ignoring potentially important data. To counteract this tendency, it is necessary to build tools that can find, extract, and organize information. This particular project is focused on converting information online into a form that is usable by machines, and thus allow more complex aggregation, summarization, and automated information management.

The first step is identifying the type of information available in a particular web page. Potentially, there is a large number of different types of data avaialbe in a single page, some of it complex, other parts simple. Before understanding complex types, it is best to understand their simpler component parts. For example, to understand a book object it is easier to first recognize the names of the authors editors, publishers, as well as the publication dates and possibly the price. This example points to the necessity of extracting certain fundamental data types:

There are several goals in this project, each of which can be turned into a mini-project suitable for 7001.

  1. Design a system to identify some commonly used types in a web page. For example, define a set of regular expressions that can identify dates, or prices. In conjuntion with the regular expressions, develop a small program that converts a page into a type structure (i.e. given a DOM model of a web page, identify all of the types that you have defined, and replace the string tokens with XML tags identifying the types. Replace all non-type tokens with a generic type, and return the tree as a full type structure).
  2. Develop a system to recognize names in page. Given a list of names and a web page, identify possible matches in the page. Based on the structure of the page and the distribution of recognized names, identify strings that may also be names based on their location in the DOM tree heirarchy representing the page.
  3. develop a framework where different type identifies could be simply plugged in. The framework should support objects that are based on single tokens from a web page, multiple tokens, or even tokens that span tree leaf boundaries. A simple type identifier could be implemented to validate the ideas.
  4. Identify complex objects in a web page based on patterns of simple types. Using association rule mining methods, the goal is to identify objects of similar association patterns. For the mini-project it is sufficient to develop a hand-coded rule to convert a single page with marked up simple types into a list of complex objects and describe one possible direction to scale the system to handle a larger variety of page types. For example, an book object has a name (person name type), a date (date type), a title (string type), a publisher (company type), and possibly a price (price type). If 90% objects follow such patterns, you may decide the set of objects are books, and/or the 10% of objects that do not follow the pattern do not belong in the list. The deliverable of the project should be design report and validation method and results. The validation can be done by implementing a simple prototype to demonstrate the validity of the design ideas or a simulation-based evaluation.

Useful References

Background

A background programming in Java or Python would be very useful for this project. Some knowledge of machine learning may be useful but is not required.


Deliverables

A report, describing what you have learned and listing useful references; source code for any implementation that may have been done to validate your ideas.


Evaluation
You will be graded on the novelty and quality of report and implementation.