Week 6, Paper 37 Trawling the web for emerging cyber communities Problems This paper describes the attempt to crawl the web to discover cyber communities at their nascent stages. Cyber communities consist of, but are not limited to, newsgroups, usenets, webrings, resource collections, and other websites dedicated to popular topics. The paper uses a graph theoretic and link information only approach to discover such communities. Strengths & New Ideas (1) The idea to search for cyber communities is indeed a novel and interesting one. Since communities are driven by a common interest, they can be the source of excellent information on their topic of interest. Tracking such communities would allow access to information that might not be easily found using search engines. (2) The paper discusses a graph centric approach to discover such communities. They claim that cyber communities are characterized by dense bi-partite graphs that have cores a subset of this graph that is strongly connected. Starting from these cores it is possible to discover the communities. (3) The pruning strategies discussed to reduce the amount of data from a crawl is discussed in great detail and shown to be effective. This is an excellent result as it can be used for future data mining from the web (it might scale with the size of the web). (4) All the different parts of the data collection, pruning and community discovering process have been described at a fine level of detail. This gives the reader a good idea about the complexity and design of the system. (5) The results discussed in the paper validate the authors claim to a great degree. This proves that that pure link based data crawling of the web can yeild useful results while reducing the overhead of crawling and parsing other data by a significant amount. Weaknesses (1) Since the paper is so graph theory driven, it can be a little hard for people who are not familiar with graph theory to understand. (2) There is no mention of a live version of this system with a list of the communities that this has discovered. It would be interesting to see how they would build an index of such communities to be actually searched. Having the data is one thing but data is useless if it cannot be looked up by users.