Paper #:[1.1 SE] 16 Title: "Crawling towards Eternity: Building an Archive of the World Wide Web" by Mike Burner PROBLEM There are times when users may want to view how a page was on a certain date or find out when a page has been modified. Consequently, Brewster Kahle has been archiving the internet since 1996. His goal was to trace the "development of technologies, styles, and cultural trends." Crawling done for archiving is different from that done for search engines because the focus is on being able to answer questions such as "when was this URL last modified and what was its' global namespace at the time?" instead of "where are files that correspond to "mouse trap"?". NEW IDEAS AND STRENGTHS *The paper does a nice job giving details about many of the challenges of web crawling, such as keeping track of retrieved objects and crawling without upsetting web authors and administrators. Respect for web-site owners wishes is done through the use of SRE (Standard for Robot Exclusion). *One good decision consisted of deciding to crawl on a site by site basis in order to minimize the expense of looking up IP addresses and retrieving robot exclusions for sites. By going site by site, you may reuse information and minimize IP lookup and robot exclusions retrievals. *Another idea is "resting" between retrieval of objects from a site during a crawl in order to prevent overloading a site. *A good point was made about the efficiency of using a bitmap instead of a relational database for solving the URL duplication problem. In addition, a nice focus on scalability was done with the idea of having "site" bitmaps instead of "internet wide bitmaps". WEAKNESSES AND EXTENSIONS *The paper mentions how it is hard for multiple machines to synchronize their bitmaps. Consequently, further research should be done in this area. *It would be nice if the functionality of this archiving service were combined with a search engine in the future. This would give users a one stop shop to be able to give queries such as "List mouse trap solutions available in 1982".