Result caching in the Gnutella network


Sponsor

Ling Liu / Bugra Gedik
{lingliu, bgedik}@cc.gatech.edu
CCB 216 / CCB 263

Area

Systems and Databases


Problem
Gnutella is a peer-to-peer (P2P) protocol for distributed search. It is widely used by many popular file-sharing applications. In these applications, the aim is to exchange files between peers through a direct connection between them. Each peer stores the files it wants to share in its local disk and exchanges these files with other peers by locating them and directly communicating with them. The way a peer finds other peers that store the files corresponding to its query is the main part that varies between different peer-to-peer protocols. For instance Napster uses centralized servers to index the files each peer shares. As opposed to Napster, which is a centralized system, Gnutella is totally decentralized. It forms an overlay network (called the Gnutella network) of peers and all queries flood over this network.

Two main problems with the Gnutella protocol are: (1) The search is expensive when we consider the number of messages used for querying the network and getting back the results. This causes inefficient use of network resources. (2) The routing responsibilities are assigned to peers with the assumption that each peer has the same resource capacity. This hurts the reliability of the system, since many nodes have weak connections and short participation times, which makes them unsuitable for taking active part in the protocol.

A known solution for the problem (2) is to use the so-called super-peers as proxies for less powerful nodes of the system. A basic approach to the problem (1) is to use routing structures at each node to control the flooding in a more intelligent way.

In this project we will investigate the possibility of using result caching together with the idea of super peers to better solve the problem (1). You are expected to first get familiar with the Gnutella protocol. The basic protocol specification is available at here. And a more informal description can be found at here. At the second step, you are expected to explore the idea of super peers by reading the paper called "Designing a super peer network" available on-line and (optionally) the idea of Routing Indexes as described by the "Routing Indices for Peer-to-peer Systems" paper available on-line.

After learning how Gnutella works, you are expected to explore the possibility of building caches on the super peers based on the query results observed. These caches will store the addresses of peers that have returned results to queries, not the files themselves. These caches are expected to support returning results fast via the use of an index structure built on them.

This mini-project description consists of at least two mini-projects. You may choose to work with me on another miniproject (the second project), you may also choose one over another.

(1)   analyze the search text of queries that pass through the super peers to see if there are repeating similar queries due to the non-uniformity in user interests. If so, a basic caching approach is expected to be developed. An important question to be answered is when to use the caches and when to use the traditional flooding, and if a probabilistic method might be used.

(2)   define the details of the cache organization and maintenance. We will provide a Gnutella simulator to implement the caching scheme and get some simulation measures that reflects the effectiveness of the proposed approach.

Deliverables
(1) For task (1)

A report describing your findings from the search text analysis, the code you used for collecting and processing this information,

(2) For task (2)

A report summarizing the cache organization and maintenance, and simulation results showing the effectiveness of the proposed approach.

Evaluation
You will be graded on the quality of your report.