Paper #: 1 Title: The Anatomy of a Large-Scale Hypertextual Web Search Engine by Brin and Page Week 2 - CS 8803 Paper Summary 1/17/2005 Problems ¡¤ A search engine was needed that was not biased toward mainstream topics so that the search needs of all may be addressed, especially those in the academic community, who at the time were being neglected by search engines such as ¡°Alta Vista¡±. ¡¤ While the number of pages on the Web is increasing, the attention span and reading speed of users is not. Consequently, a more efficient search engine is needed. ¡¤ Commercialism: Current search engine technology is hidden because the companies are focused more on making money and having a competitive edge instead of the pure advancement of the area and knowledge. In addition, some Web authors have ulterior motives and try to trick the web crawler in hopes of gaining more recognition in order to make a profit by gaining a higher ranking, consequently metadata may not be used in order to aid web crawlers because not all authors may be trusted to create accurate metadata. ¡¤ Currently, there is a lack of fast crawling technology, which limits the ability to reach all pages. In addition disk seek time is slow, which slows query answering ability. New Ideas and Strengths ¡¤ A strength of this paper is that it takes into account both the positive and negative aspects of the future of this area. Brin and Page reference the current technological changes and expect improvements in areas such as hardware performance and storage space at the rate of Moore¡¯s law. Consequently, their solution was not primarily concerned with the amount of space required, but instead was focused on speed. On the other hand, one hindrance in the future of search technology is the rate of growth of the web, which with the present web crawler technology; we will be unable to keep up with. ¡¤ One new idea of this paper is the PageRank concept. Instead of only using the text of a document in order to rate the quality of a page, Google also uses items such as the text of the link since that is probably a better brief description than anything the actual page will give. It also considers items such as font size and weight. ¡¤ Another new concept consisted of making the system accessible and its¡¯ data accessible to those who want to do research and not conceal data as much in order to lead to more gains in the information retrieval and data usage sector. Weaknesses and Extensions ¡¤ In reference to using the link structure when referencing PageRank, the authors did not mention the weakness of it not being helpful because some html authors use tactics such as ¡°click here for info on Bill Clinton¡± instead of ¡°Bill Clinton¡±. ¡¤ The word list was capped at 14 million. Consequently, how does it handle new words? For example, the word ¡°muggle¡± did not exist until the Harry Potter books were written. Are old words removed? ¡¤ The paper mentions that a ¡°trusted user¡± can evaluate and give feedback on the results, in order to help with the tuning process. However, no description was given for who a ¡°trusted user¡± is or for the characteristics of a ¡°trusted user¡±. ¡¤ A possible extension to this paper is to address the problem of the web crawler crawling and indexing pages of people who do not want their pages indexed. While acknowledging that the current techniques don¡¯t handle it the way that some Web authors believe, detailing a way for people to opt out of their pages being indexed and how Google handles this would be helpful.