Paper #: 37 Title: Trawling the web for emerging cyber-communities (1) Problems This paper addresses the problem of finding nascent web communities; communities that are there but not yet recognized as communities by the big portals like Yahoo!, and may not even be recognized by the members of the communities themselves. This can be useful for many reasons, like advertising and marketing, information access, and learning about the sociology of the web. Interesting, members of such communities may have opposing viewpoints (e.g. abortion), and may not reference, may not even like, other members of the community. (2) New Idea and Strengths The idea is to identify instances of graph structures indicative of such communities, a process called "trawling." You find the community by finding its "core," and then use the core to find the rest of it. The core is a small (i,j)-sized complete bipartite subgraph of the community. To find cores, you can through the data set (e.g., the www) and find "fans," specialized hubs with links to at least 6 different websites. You use some method, such as shingling, to eliminate mirror sites, because such sites can result in identification of spurious communities. You must prune down the results to find the real communities. One way to do this is to eliminate pages that have an in-degree greater than a threshold K, which they choose as about 50. After enough pruning, you end up with the community cores. (3) Weaknesses and Extensions An obvious weakness is that it's hard to find a way to make money by using the techique; none of the major search engines are using it. Open problems identified are: automatically extracting semantic information and organizaing the communities trawled into a useful structure. An interesting extension would be to trawl successive copies of the web, tracking the sociology of emerging communities. -- END --