DISL Projects
Omini
Query Router
WebCQ
XWrapElite
Infosphere Projects

Useful Links
Research Index
W3C

Page Digest

The Page Digest is a mechanism for efficient storage and processing of Web documents. The Page Digest design encourages a clean separation of the structural elements of Web documents from their content. Its encoding transformation is invertible without introducing significant additional cost or complexity to normal document parsing. Compared to using standard DOM implementations, our initial experimental results show that Page Digest encoding can provide an order of magnitude speedup when traversing a Web document or comparing two arbitrary Web documents.

We have examined the potential benefits of using Page Digest in other large-scale Web Services such as Web Search Software, Web Data Extraction Services, and Automatic Fragment Detection for Dynamic Content Caching. Our experimental results show that change detection using the Page Digest operates in linear time, offering 75% improvement in execution performance when compared with popular existing change detection and difference systems. In addition, the Page Digest format reduces the tag name redundancy found in Web documents, which provides up to a 50% reduction in the document size without employing data compression techniques.


Publications
  1. Daniel Rocco, David Buttler, Ling Liu. Page Digest for Large-Scale Web Services. In Proc. IEEE Conference on E-Commerce, 2003. (pdf)

Images