Data set summary

Table of Contents

1 Size estimates

1.1 Available data

1.2 Private data

  • Facebook:
  • Myspace:
    • Outside estimate of monthly unique visitors: 76M (1/9)
  • Twitter:
    • Outside estimate: 13M users

2 Publically available data sets

2.1 WebBase

Various web crawls are available through http://www.webvac.org/. Most scan around 72M pages.

The Internet Archive also has crawls available, but it's unclear to me how to grab them.

2.2 Wikipedia and dbpedia

An archive of Wikipedia pages is available through http://en.wikipedia.org/wiki/Wikipedia:Database_download.

dbpedia converts Wikipedia data into to triples for RDF stores, http://wiki.dbpedia.org/Downloads32.

2.3 identi.ca

identi.ca is an open twitter-like service. They're willing to dump the data if we send the SQL. They also give out "firehose" access to the complete stream (although sans deletions) if we ever want to test really dynamic mechanisms.

2.4 imdb

imdb lists various statistics. Node sets include people, titles, links, etc. Data files are available.

3 Not yet classified

Author: Jason Riedy <jason.riedy@cc.gatech.edu>

Date: 2009-04-22 12:24:49 EDT

HTML generated by org-mode 6.25trans in emacs 23