The web has become one of the primary ways that people, businesses, and organizations share information. However, due to the wealth of information available, finding and reusing information has become much more difficult.
Search engines have tried to deal with finding information by indexing all of the pages on the web, but they have several shortcomings. First, they cannot keep pace with the expansion of the web. Second, they ignore most of the information available, because they only look at static pages. By some estimates, more than 90% of the information on the web is "hidden" and only available through forms. Finally, they do not offer any granularity other than a basic page.
The goal of Omini is to get at the data behind Web forms. Omini software automatically extracts content objects and ignores irrelevant parts of the page. One of the key design features of Omini is its robustness even as the web pages from which it extracts data evolves, eliminating the need for a programmer to manually determine where objects are.
Omini's technology is useful in several different domains. We have already applied Omini as the foundation of
XWRAPElite. XWRAPElite is an interactive online
toolkit that generates wrappers which extract data from web sites and convert it into semantically relevant XML.
We are also in the process of constructing a search engine for dynamic web sites that is based on Omini. The search engine will be able to locate relevant data objects in web sites that are appropriate to the context of a search. This approach complements traditional search engines which index static web pages, such as HyperBee or Google.
The source code is available at sourceforge. Warning: it is only available via CVS.
Monitor this page for changes