Apoidea

Decentralized P2P Web Crawling

 

Introduction

Web crawling has been an integral and probably, most critical part of the search engine technology. Crawlers like Google, Mercator (Altavista crawler) use a very resourceful cluster of tightly coupled machines and pay heavy administrative costs for maintaining the cluster. In addition, they have a central URL repository which identifies if a URL has been previously crawled or not and also allocates URLs to crawl, to various machines. Such an architecture suffers from similar central-architecture flaws like single point of failure, link congestion (since every communication has to be with a central repository) etc.

GAtech has been leading efforts in developing a decentralized version of such a crawler. An earlier work Hyperbee developed a semi-centralized version. With Apoidea, we took our efforts to a new level. We developed a completely decentralized "pure" P2P web crawler. It uses no central authority to co-ordinate which URLs to crawl or find out if a URL has been crawled or not. In addition, we utilize an important benefit of a global distributed system - which is its widespread presence. We use geographical proximity of peers to the domains that need to be crawled, to decide which peer does the crawling. All this decision making happens amongst "equal" peers without anybody co-ordinating the efforts.

The design of Apoidea has many advantages. Firstly, it is scalable. We can add more and more peers with the growing size of the web. Secondly, it is low cost on resources. Infact, users with typical home PCs can donate some of their idle computer time to Apoidea and participate meaningfully. In addition, it requires no manual administration and is self-configuring and self-healing.

 

People

Aameek Singh
Mudhakar Srivatsa
Ling Liu
Todd Miller

 

Publications

Aameek Singh, Mudhakar Srivatsa, Ling Liu, Todd Miller, " Apoidea: A Decentralized Peer-to-Peer Architecture for Crawling the World Wide Web ", Proceedings of the SIGIR 2003 Workshop on Distributed Information Retrieval, Lecture Notes in Computer Science, Volume 2924. |Slides|

 

External Citations

  1. J. Stribling, J. Li, I. Councill, M. Kaashoek, R. Morris, "OverCite: A Distributed, Cooperative CiteSeer", NSDI 2006. [html]

  2. Pleng Chirawatkul, "Structured Peer-to-Peer Search to build a Bibliographic Paper Recommendation System", Masters Thesis, Saarland University, 2006.[pdf]

  3. S. Tongchim, P. Srichaivattana, C. Kruengkrai, V. Sornlertlamvanich, H. Isahara, "Collaborative Web Crawler over High-speed Research Network", International Conference on Knowledge, Information and Creativity Support Systems (KICSS) 2006. [pdf]

  4. Fu Xiang-hua, Feng Bo-qin, "Towards an Effective Personalized Information Filter for P2P Based Focused Web Crawling", Journal of Computer Science, Volume-2(1), 2006. [pdf]

  5. J. Casey, W. Zhou, "Reducing the Bandwidth Requirements of P2P Keyword Indexing", 6th International Conference on Algorithms and Architectures (ICA3PP-2005). [pdf]

  6. J. Stribling, I. Councill, J. Li, M. Kaashoek, D. Karger, R. Morris, S. Shenker, "OverCite: A Cooperative Digital Research Library", IPTPS 2005. [pdf]

  7. Xianghua Fu, Boqin Feng, "Content Filtering of Decentralized P2P Search System Based on Heterogeneous Neural Networks Ensemble", ISSN 2005. [html]

  8. Pavel Serdyukov, "Query Routing in Peer-to-Peer Web Search", Masters Thesis, Saarland University. [pdf]

  9. Christian Zimmer, "Peer-to-Peer Indexing and Crawling", Feb 2004. [pdf]

  10. Jaroslav Pokorny, "Web Searching and Information Retrieval", IEEE Computing Science & Engineering

    ... Earlier attempts to distribute processes suffered many problems for example, Web servers got requests from different search-engine crawlers that increased the servers load. Most of the objects the crawlers retrieved were useless and subsequently discarded; compounding this, there was no coordination among the crawlers. Fortunately, this bleak picture has improved: a new completely distributed and decentralized P2P crawler called Apoidea is both self-managing and uses the resource s geographical proximity to its peers for a better and faster crawl [6]. Another recent work [7] explores the possibility of using document rankings in searches...

  11. Teppo Marin, "Supporting Web Search with Peer-to-Peer Methods", P2P Technologies Seminar, Helsinki University of Technology, 2005. [pdf]

  12. Boon Thou Loo, Owen Cooper, Sailesh Krishnamurthy, "Distributed Web Crawling over DHTs", University of California, Berkeley [pdf]

  13. The David Scott Lewis Information Retrieval Archive [html] and his blog [html]

 

Source Code

Available by emailing any of the contacts.

 

Why Apoidea?

Apoidea is a family of bees which does not have any queen bees. Having no central authority seemed just too appropriate for the project and also, we wanted to continue the bee tradition of GAtech P2P web crawling efforts. (Hyperbee).

 

Contact

Aameek Singh <aameek[AT]aameeksingh.com>
Mudhakar Srivatsa <mudhakar[AT]cc.gatech.edu>

 

Acknowledgements

This work is partially supported by the National Science Foundation under a CNS Grant, an ITR grant, a Research Infrastructure grant, and a DoE SciDAC grant, an IBM SUR grant, an IBM faculty award, and an HP equipment grant. Any opinions, findings, and conclusions or recommend ations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation or DoE.

 

© 2007 Aameek Singh