Content Sensitive Link Analysis


Sponsor

Ling Liu (lingliu@cc.gatech.edu)

Area

Information and Knowledge Management


Problem
Relevance analysis is an important issue in search engine, which returns ranked lists bringing the mostly related web documents first to the users. It consists of three parts: term-based proximity analysis, anchor text analysis and link analysis. PageRank and Hub-authority algorithms are the good examples for global link analysis, which are based on links between roughly all pages on the web and independent of the specific queries. However, link information within the documents returned by search engine about the user specific queries could be useful in improving the relevance too. The users may want to see that all documents that are related to each other by hyperlinks are listed together. In this project, we want to see if the first few documents returned by search engine are related by hyperlinks, and how the link structure is.

 

The project requires you to use Google API to get the search results. Google offers a programmatic interface http://www.google.com/apis. You can build your program on the Java sample code or write new code with other languages, such as Perl or Python, using SOAP protocol.  

Here is what you need to do.

  1. You should get access to Google's API. Go to http://www.google.com/apis/ , register and download the package. There is some sample Java code. Try it and build your code on it. Or, you can write new code using Perl or Python using SOAP protocol.
  2. Try three queries “latex”, “jaguar” and “apple”. For each query, analyze the first 100 documents returned by Google and extract the hyperlinks in the original pages. Find if there are link references between the 100 documents. Draw a reference graph if you find it.    
  3. Considering the problem: how can you use the link information to improve the relevance, or to improve search result variety?

Deliverables

  1. The code you used to fetch the query results, extract hyperlinks and find the references.
  2. A report, summarizing your results and suggesting how to utilize the link information.

 

Evaluation
Based on the report turned in to the sponsor of the project by the due date.