Paper #:[1.1 SE] 37 Title: "Trawling the web for emerging cyber-communities" by Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tompkins PROBLEM Currently sites exist that have such a narrow focus that they are glossed over by major directories such as "Yahoo" that feel that the sites are not worth its resources. However, the details and information available on topics through these specialized sites, which the authors call "communities", is usually useful and hard to find. Consequently, a need exists for a place to search for small, specific communities of information efficiently. NEW IDEAS AND STRENGTHS *The paper does a good job of referencing previous work and utilizing parts of solutions that are relevant to their goals, such as using links to determine the quality of a page. *One idea that the authors had for these specialized communities is using target advertising. Since the people that would be interested in these sites would have a specific interest, it would be easy to advertise to the group of users. *The authors use data from web crawlers in order to discover "communities" by finding bipartite subgraphs through the analysis of the HTML text documents that have been crawled. *One of the ideas that surprised me the most was that "co-citation" is more important than direct links in their approach to discovering "communities". The authors do a good job of using co-citation to not only find known communities but also emerging communities. WEAKNESSES AND EXTENSIONS *The data that was used in experimentation only consisted of text-only HTML files. I do not think that this is an accurate representation of the Internet. Consequently, I would have like to have seen some file types such as images and PDF files in order to give a more accurate portrayal of the data that is available in the communities. Data in these file types may also be useful to the user. *While the goals of the authors is to focus on "little-known" communities, it is being biased against more popular communities. For example, it considers communities referenced by sites such as "Yahoo" as uninteresting. Selective users may still have interest in sites recognized by mainstream sources. I would like to see an extension that is all inclusive and includes both mainstream as well as narrow-focus sites. *A needed future addition is to automate inspection of results when checking for community discovery accuracy.