Object Mining On the Web
Problem
Currently there is an overwhelming amount of information online. Being online has made information easier to access,
however, most information is designed for human consumption rather than computer processing. This inevitably means
that human attention is required to gather useful information, resulting in information overload and coping strategies
such as ignoring potentially important data. To counteract this tendency, it is necessary to build tools that can
find, extract, and organize information. This particular project is focused on converting information online into
a form that is usable by machines, and thus allow more complex aggregation, summarization, and automated information
management.
The first step is identifying the type of information available in a particular web page. Potentially, there
is a large number of different types of data avaialbe in a single page, some of it complex, other parts simple.
Before understanding complex types, it is best to understand their simpler component parts. For example, to understand
a book object it is easier to first recognize the names of the authors editors, publishers, as well as the publication
dates and possibly the price. This example points to the necessity of extracting certain fundamental data types:
- people's names
- company names
- dates
- prices
There are several goals in this project, each of which can be turned into a mini-project suitable for 7001.
- Design a system to identify some commonly used types in a web page. For example, define a set
of regular expressions that can identify dates, or prices. In conjuntion with the regular expressions, develop
a small program that converts a page into a type structure (i.e. given a DOM model of a web page, identify all
of the types that you have defined, and replace the string tokens with XML tags identifying the types. Replace
all non-type tokens with a generic type, and return the tree as a full type structure).
-
Develop a system to recognize names in page. Given a list of names and a web page, identify possible matches in
the page. Based on the structure of the page and the distribution of recognized names, identify strings that may
also be names based on their location in the DOM tree heirarchy representing the page.
-
develop a framework where different type identifies could be simply plugged
in. The framework should support objects that are based on single tokens from a web page, multiple tokens, or even
tokens that span tree leaf boundaries. A simple type identifier could be implemented to validate the ideas.
-
Identify complex objects in a web page based on patterns of simple types. Using association rule
mining methods, the goal is to identify objects of similar association patterns.
For the mini-project it is sufficient to develop a hand-coded rule to convert a single page with marked up simple
types into a list of complex objects and describe one possible direction to scale the system to handle a larger
variety of page types.
For example, an book object has a name (person name type), a date (date type), a title (string type),
a publisher (company type), and possibly a price (price type). If 90% objects follow such patterns, you may decide
the set of objects are books, and/or the 10% of objects that do not follow the pattern do not belong in the list.
The deliverable of the project should be design report and validation method and results. The validation can be done by implementing a simple prototype to demonstrate the validity of the design
ideas or a simulation-based evaluation.
Useful References
-
Tidy -- an HTML syntax corrector which can produce a DOM
representation of an HTML page.
-
A dictionary of people's names. (here, here,
or here)
-
A dictionary of company names and stock symbols
-
Local Java code library (convert an HTML file into a tree, automatically extract textual objects from a page, and more).
-
set of web pages to test solutions
Background
A background programming in Java or Python would be very useful for this project. Some knowledge of machine
learning may be useful but is not required.
Deliverables
A report, describing what you have learned and listing useful references; source code for any implementation
that may have been done to validate your ideas.
Evaluation
You will be graded on the novelty and quality of report and implementation.