Week 6, Paper 16 Crawling towards Eternity: Building An Archive of The World Wide Web Problems This publication is a technical article and discusses the Internet Archive project. The Internet Archive is an endeavor to crawl and archive the world wide web as it evolves. The Internet Archiver is a crawler that works like any other Internet crawler retreives objects and stores them away. However, unlike search engines the motivation is to archive the Internet so that it can be used later. The article discusses the design of the Internet archive crawler. Strengths and New Ideas (1) The article discusses the design of the crawler and how it is slightly different because the motivation is to primarily retrieve data rather than to be a search engine. In this respect, the crawling strategy is different from that of search engines. To make the crawling faster and more efficient the archive crawler crawls on a per site basis. This reduces the time to hop to a different site, download the robot exclusion file etc. as it could be held in memory. (2) Most of the technical workings of a crawler are similar to any other crawler. However, for a novice reader, the information is provided at a high level so understanding the basic design of a crawler becomes easy. (3) The indexing strategy of the archive crawler is different from that of search engines. While, search engines index based on keywords, the archive cralwer's index is more like a card catalog where you can issue queries like "Tell me the modification times and locations in the Archive of all objects with URL = 'some url'". This feature gives us an idea of how an archive could be used in the future and how it is different from a search engine. (4) The article discusses the major bottlenecks and problems in the system such as DNS and testing for seen URLs. It briefly discusses how the crawler overcomes each of these problems by using non-blocking I/O and a bitmap. Weaknesses (1) Since this publication is an article rather than a research paper, it does not introduce any groundbreaking ideas or results. The design of the archive crawler is similar to any other crawler like Mercator for instance. In this aspect, this article only caters to an audience who is not familiar with the design issues involving web cralwers. (2) The article does not discuss any UI related questions. How would the UI for an archive be different from that of a search engine? (3) The article does not discuss any metrics of how to gauge the effectiveness of an archiving utility. There is no way of answering the question, "Is this a good archive?". There are few numbers provided relating to the amount and diversity of information collected.