Data set summary
Table of Contents
1 Size estimates
- An estimate for the total web: 11.5B in Jan, 2005
1.1 Available data
- WebBase: Crawls of 72M pages
- Wikipedia and dbpedia: 1.6M pages, multiple sets of 70M triples
- identi.ca: 50k users, 3M updates, unknown groups and tags in 4/9 (Determined by checking @welcomebot and the home page.)
- imdb: Multiple node sets (people, titles) of a few million each.
1.2 Private data
-
Facebook:
- PR release: 200M (4/9)
- Outside estimate of monthly unique visitors: 55M (1/9)
-
Myspace:
- Outside estimate of monthly unique visitors: 76M (1/9)
-
Twitter:
- Outside estimate: 13M users
2 Publically available data sets
2.1 WebBase
Various web crawls are available through http://www.webvac.org/. Most scan around 72M pages.
The Internet Archive also has crawls available, but it's unclear to me how to grab them.
2.2 Wikipedia and dbpedia
An archive of Wikipedia pages is available through http://en.wikipedia.org/wiki/Wikipedia:Database_download.
dbpedia converts Wikipedia data into to triples for RDF stores, http://wiki.dbpedia.org/Downloads32.
2.3 identi.ca
identi.ca is an open twitter-like service. They're willing to dump the data if we send the SQL. They also give out "firehose" access to the complete stream (although sans deletions) if we ever want to test really dynamic mechanisms.
2.4 imdb
imdb lists various statistics. Node sets include people, titles, links, etc. Data files are available.
3 Not yet classified
- Notre Dame's: http://www-personal.umich.edu/~mejn/netdata/
- webgraph: http://webgraph.dsi.unimi.it/
- many scientific sets: http://www-personal.umich.edu/~mejn/netdata/
Date: 2009-04-22 12:24:49 EDT
HTML generated by org-mode 6.25trans in emacs 23