Paper #: [1.1 SE] 12 Title: Mercator: A Scalable, Extensible Web Crawler by Heydon and Najork Week 2 - CS 8803 Paper Summary 1/21/2005 Problems , There is a lack of details on the challenges and tradeoffs of web crawler design due to competition amongst search engine companies limiting published work. Consequently, a paper was needed to address the different items involved in creating a web crawler. , Earlier crawlers did not address scaling problems. In addition web crawlers that were customizable in hopes of increasing extensibility were site specific, such as SPHINX. , Currently, useful documents such as images are ignored by web crawlers who are focused on HTML pages. This is because it is hard to index images without some sort of metadata. , While the process of web crawling is slow, one reason that it is hard to speed it up is because it is considered bad manners to have multiple requests pending on the same server. , Slow read/write disk time slows down web crawlers. , Many documents are located on the Web with different URLs, consequently a method is needed to ensure that the same document is not stored and referenced twice. Otherwise inefficiency can occur. , DNS bottlenecks occur because only one DNS request can be outstanding at a time. New Ideas and Strengths , The authors created a modular web crawler that is extensible. Modularity allows the ability to create a module for an intranet and another module that is optimized for the internet by ensuring that all threads are working as long as the # of hosts >= # of worker threads. , Mercator can handle any file associated with MIME including GIF images, unlike other web crawlers associated with search engines such as Google, which are limited to HTML pages. , Unlike the Google paper, this paper addresses explicitly how a web master may exclude pages from indexing as well as how the Mercator implementation handles this and it¨s performance affects. , Checksum: It was a good idea to use checksum instead of ^diff ̄ for comparison of documents during the ^content-seen test ̄. In addition, using checksums to store URLs with the same host close together helps to limit the slow down from disk seek. , The authors wrote a DNS resolver in order to remove the DNS bottleneck. , Mercator¨s URL-seen test is better than Internet Archives bloom-filters because it doesn¨t have false negatives that could exclude some files from being downloaded that should be downloaded. , Mercator uses a multithreaded architecture and is faster than both Google and Internet Archive. Weaknesses and Extensions , The 1MB limit on documents may be too small for an implementation which can handle documents other than HTML pages. Many file types such as images and videos, which may be of interest take up more than 1 MB. This is aided by the fact that disk space is cheap. , One weakness is that it can not avoid downloading multiple copies if they exist on different web servers. However, on the bright side it does only process one copy. , Using a Java implementation was successful. But, I would like to see how a C++ implementation may have performed. This is because C++ usually runs faster than Java.