Page Digest Applications


Sponsor

Ling Liu / Daniel Rocco
(lingliu | rockdj)@cc.gatech.edu

Area Database Systems / Internet Data Management

Problem
One important consideration for large Web services is data storage and processing efficiency. Much of the data on the Internet is contained in HTML documents that are useful for human browsing but incur significant drawbacks from a data management perspective. HTML has no real type information aside from layout instructions, so any data contained in the document is mixed with formatting and layout constructs intended to help browser software render pages on screen. Automated data extraction or comparison of Web pages is expensive. We have produced a new document encoding scheme to address some of the problems associated with storage and processing of Web documents that enable Web service applications to operate efficiently on a large scale. We are interested in possible enhancements to the design and applications that can benefit from a more efficient encoding scheme.

We suspect that there are many possible applications on the Web that can benefit from efficient document management. The main goal of this project is to generate ideas and avenues of exploration for future work. Some possibilities include comparable compression mechanisms for Web documents, efficient document management and semantically interesting query techniques in large-scale document repositories, and selective document dissemination in document streams. Since we hope to generate new ideas, discussion, collaboration, and questions are all welcome as part of this project. You are also free to customize this project in cooperation with the sponsors. Extensions of this work are certainly possible, and ideas toward that end withare welcome.

Here is what you need to do:

  1. Review current literature to discover state-of-the-art techniques in XML compression, Web document comparison (change detection), and other related topics. Possible starting points include references [1, 2, 3].
  2. Propose ideas for new and interesting ways of managing data on the Web that can benefit from an efficient encoding scheme like the Page Digest.

References

[1] Our paper on Page Digest.

[2] XMill: an Efficient Compressor for XML Data - Liefke, Suciu

[3] Meaningful Change Detection in Structured Data - Chawathe, Garcia-Molina

Deliverables

A report describing your findings and any related designs.

Evaluation
Based on the report turned in to the sponsor of the project by the due date.