Mercator: Designing a Scalable, Extensible Web Crawler (1) Problems This paper discusses the design of a scalable and extensible web crawler -- it defines the various components of such a crawler clearly and in detail. Designing a crawler that can scale to crawling the entire web and be extended easily by having pluggable components and is daunting task. Mercator, the crawler discussed in the paper successfully negotiates this challenge. Most crawlers are faced with constraints of both time and space as well as other bottlenecks. This paper describes how Mercator deals with these problems effectively. (2) New Ideas and Strengths (a) This is known to be one of the few papers that describes the design of a web crawler in detail. The design of a web crawler is not simple and since web search is such a competitive industry, designs of other systems are not very well known--at least not at the level of detail of Mercator. This paper helps in understanding the various components of the crawler in detail. (b) In the discussion of each component, the authors discuss their implementation and also contrast that with other possible implementations. For example, it compares the data structure used for the URL-seen test with the data structure used by the Internet Archive crawler called a bloom filter. It explains how a bloom filter might return false positives and to reduce this probability larger bit vectors would be required. This does not scale well as increasing the size of the crawl would require more and more memory. In contrast, Mercator's URL-seen test does not admit false positives and uses a bounded amount of memory irrespective of the size of the crawl. (c) Shows how a DNS resolver can be a significant bottleneck. They discuss how the implementation of a custom DNS resolver that could handle multiple requests at the same time improved efficiency considerably. (d) Discussion of how their design was actually inefficient for crawling intranets. Since in the crawl of the web, each worker thread was assigned the same server, in the intranet (the number of servers are small) this situation resulted in many worker threads being idle. This example highlights how a web crawler designer must take into account the parameters of the crawl. (e) Shows how a Java VM implementation might be important to the overall efficiency of the crawler. They use a custom Java runtime that improved the efficiency of the crawler. (f) The extensible nature allows them to modify the crawler to do a variety of tasks such as random walks, gather different statistics etc. (3) Weaknesses and Extensions (a) Although their numbers show that they compare favorably against the statistics published by google, it fails to address the question of why it is not on the market. Since, they show that their numbers are heavily in their favor, there is no reason why they should not be a live search engine. (b) They mention that the use of their custom runtime improved efficiency considerably but they dont discuss how the runtime affects the performance. This should be elaborated on further. (c) They use only one crawling machine. However, the paper does not discuss how Mercator would behave if multiple distributed crawlers would perform together. This is an important factor to make a high quality crawler as the size of the web increases daily.