Sponsor | Ling Liu / Daniel Rocco |
Area | Database Systems / Internet Data Management |
Problem
One important consideration for large Web services is data storage and
processing efficiency. Much of the data on the Internet is contained
in HTML documents that are useful for human browsing but incur
significant drawbacks from a data management perspective. HTML has no
real type information aside from layout instructions, so any data
contained in the document is mixed with formatting and layout
constructs intended to help browser software render pages on screen.
Automated data extraction or comparison of Web pages is expensive.
We have produced a new document encoding scheme to address some of the
problems associated with storage and processing of Web documents that
enable Web service applications to operate efficiently on a large
scale. We are interested in possible enhancements to the design
and applications that can benefit from a more efficient encoding scheme.
We suspect that there are many possible applications on the Web that can benefit from efficient document management. The main goal of this project is to generate ideas and avenues of exploration for future work. Some possibilities include comparable compression mechanisms for Web documents, efficient document management and semantically interesting query techniques in large-scale document repositories, and selective document dissemination in document streams. Since we hope to generate new ideas, discussion, collaboration, and questions are all welcome as part of this project. You are also free to customize this project in cooperation with the sponsors. Extensions of this work are certainly possible, and ideas toward that end withare welcome.
Here is what you need to do:
References
[2] XMill: an Efficient Compressor for XML Data - Liefke, Suciu
[3] Meaningful Change Detection in Structured Data - Chawathe, Garcia-Molina
Deliverables
A report describing your findings and any related designs.
Evaluation
Based on the report turned in to the sponsor of the project by the due
date.