Omini

Object Mining and Extraction System

The web has become one of the primary ways that people, businesses, and organizations share information. However, due to the wealth of information available, finding and reusing information has become much more difficult.

Search engines have tried to deal with finding information by indexing all of the pages on the web, but they have several shortcomings. First, they cannot keep pace with the expansion of the web. Second, they ignore most of the information available, because they only look at static pages. By some estimates, more than 90% of the information on the web is "hidden" and only available through forms. Finally, they do not offer any granularity other than a basic page.

The goal of Omini is to get at the data behind Web forms. Omini software automatically extracts content objects and ignores irrelevant parts of the page. One of the key design features of Omini is its robustness even as the web pages from which it extracts data evolves, eliminating the need for a programmer to manually determine where objects are.

Omini's technology is useful in several different domains. We have already applied Omini as the foundation of XWRAPElite. XWRAPElite is an interactive online toolkit that generates wrappers which extract data from web sites and convert it into semantically relevant XML.
We are also in the process of constructing a search engine for dynamic web sites that is based on Omini. The search engine will be able to locate relevant data objects in web sites that are appropriate to the context of a search. This approach complements traditional search engines which index static web pages, such as HyperBee or Google.

Software

Software Demo
The source code is available at sourceforge. Warning: it is only available via CVS.

Publications

David Buttler, Ling Liu, Calton Pu, Henrique Paques, Wei Han, Wei Tang. "OminiSearch: A method for searching dynamic content on the Web. " ACM SIGMOD 2001, Santa Barbara, California (May 21-24, 2001).
David Buttler, Ling Liu, Calton Pu. "A Fully Automated Extraction System for the World Wide Web." IEEE ICDCS-21, Phoenix, Arizona (April 16-19, 2001).

The People Behind Omini

Who we are

Email us

Your comments

FAQ

Monitor this page for changes

This material is based upon work partially supported by the National Science Foundation under Grant No. 9988452. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).

Last Update Jan 2000

Comment and Information: E-MAIL disl@cc.gatech.edu