Paper #: 12 Title: Mercator: A Scalable, Extensible Web Crawler (1) Problems 每 A web crawler is software that starts with a seed list of URL*s, downloads one of them, extracts URL links on that page (and perhaps does other operations like indexing on the page), and continues the process on the URL links it finds. This paper describes a web crawler that is scalable and extensible, and describes it in detail. Other web crawlers have been described, but not in good detail because their owners were protecting their technology. The paper addresses the problems of making a web crawler scalable and extensible. A crawler is scalable if you can scale up the software and use it to crawl the entire web. This requires that the software use a bounded amount of memory. A crawler is extensible if you can easily add new functionality to it, and/or easily change its functionality by changing the functionality of its components. Extensibility is important because this kind of product has many uses, some of which haven*t been thought of yet, so it needs to be easy to adapt it. Scalability is important because any such product must be able to handle the entire Internet to be useful, and the Internet keeps growing. (2) New Ideas and Strengths 每 Tests indicate that the software runs faster than Google and Internet Archive; successful web crawler products, and, it seems to have accomplished its goals of being extensible and scalable. Thus, the product is useful and it*s useful to study how it*s implemented. The paper describes the implementation in good detail, so you can learn a lot from it. The product is written all in Java (although a special fast java runtime is used), and considering the popularity of java since the time of this paper, and the good performance of the product, this seems to be a strength; since java is an object oriented language that is inherently easy to use to make extensible, well-organized programs. The product differs from existing crawler designs in that it uses a multi-threaded approach, and uses protocol modules and processing modules that help make it easily extensible. It*s method of testing if documents and URLs have been seen is also new and is claimed to be superior to existing techniques, but see possible weakness discussed below. I had difficulty grasping the synchronous v. asynchronous i/o issue, but I think I understand that each Mercator thread is able to just wait for i/o operations to complete before proceeding, and the other designs that are single threaded have to queue i/o ops and come back later to service them 每 and thus, Mercator*s operating structure seems easier to understand. (3) Weaknesses and Extensions 每 Mecator was not designed ※from the ground up§ to scale to multiple machines 每 as were Google and Internet Archive. This could be a weakness, although the paper claims it*s not much of a problem. Mercator claims that it*s method of using checksums (as opposed to a bloom filter approach) avoids all false positives. I may misunderstand this concept, but it seems to me that it would be possible for two URLs to have the same checksum, just as it is possible for a bloom filter to register a false positive. If so, the paper is weak, in making this claim. An extension that would help prove Mercator*s case would be to actually adapt the product to run on multiple machines, instead of just claiming that that would ※not be too difficult.§ An appendix that explains in detail the high-performance java runtime, might be very useful in educating the reader about the java changes that had to be implemented to make the product run fast enough.