Webb Spam Corpus 2011

Web spam is defined as Web pages that are created to manipulate search engines and deceive Web users. As such, Web spam is regarded as one of the most important challenges currently facing search engines and Web users, and recent studies suggest that it accounts for a significant portion of all Web content. Although the problems associated with Web spam have been widely acknowledged, research efforts have been somewhat limited by the lack of a publicly available Web spam data set. To help combat this situation, the Webb Spam Corpus was created.

The Webb Spam Corpus 2011 was collected by De Wang (original Webb Spam Corpus collected by Steve Webb) as part of the Denial of Information Project at the Georgia Institute of Technology. It is a first-of-its-kind, large-scale, and publicly available Web spam data set that was created using a novel, fully automated Web spam collection method. The corpus consists of nearly 350,000 Web spam pages, making it more than two orders of magnitude larger than any other previously cited Web spam data set.

A detailed description of the corpus and the methodology used to collect it can be found in the following conference publication:

Introducing the Webb Spam Corpus: Using Email Spam to Identify Web Spam Automatically

Please cite the following paper if you use Webb Spam Corpus 2011 in your publication:

De Wang, Danesh Irani, and Calton Pu. "Evolutionary Study of Web Spam: Webb Spam Corpus 2011 versus Webb Spam Corpus 2006". In Proc. of 8th IEEE International Conference on Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom 2012). Pittsburgh, Pennsylvania, United States, October 2012.

Disclaimer

Permission is granted for research use of the Webb Spam Corpus, provided users agree to and abide by the Usage Agreement.

A great deal of time and effort has gone towards ensuring the quality of this corpus. However, it is possible that a few false positives (i.e., legitimate Web pages) may still exist in the data set. If you find a page that you believe has been misclassified as Web spam, please send an email to the following address:

wang6 [AT] cc [DOT] gatech [DOT] edu

Download

Webb Spam Corpus 2011 [Coming soon]

Webb Spam Corpus (~1.1GB, tarred and gzipped)
Webb Spam Corpus graph files (~10MB, tarred and gzipped)