Clustering Instances

Last graph file update: July 26, 2011, addition of files (descriptions further down):

The graphs in this category have been taken from various sources, as listed below. Except where otherwise noted, all graphs have been processed as follows:

  • symmetrized (the union of the graph and its transpose, i.e., all edges are un/bidirectional)
  • parallel edges have been removed
  • self-loops have been removed
  • edges with weight 0 or weight "inf" have been removed
  • graphs with non-integer weights have been made unweighted
  • graphs where all weights equal 1 have been made unweighted
  • weights are only added up if the original graph is directed and contains integer weights
Please note the accompanying conversion notes in the description of each graph, if any, for a deviating description of the preprocessing that has been performed on the original datasets to arrive at the files hosted here.

The following four datasets are taken from http://deim.urv.cat/~aarenas/data/welcome.htm , with kind permission of Alex Arenas, Dept. Enginyeria Informatica i Matematiques (Computer Science & Mathematics), Universidad Rovira i Virgili.

Graph n m
jazz.graph.bz2
Jazz musicians network . List of edges of the network of Jazz musicians. Data compiled by P.Gleiser and L. Danon , Adv. Complex Syst.6, 565 (2003).
198 2742
celegans_metabolic.graph.bz2 :
C. elegans metabolic network. List of edges of the metabolic network of C.elegans. Community identification using Extremal Optimization J. Duch and A. Arenas, Physical Review E , vol. 72, 027104, (2005).
453 2025
email.graph.bz2 :
E-mail network URV. List of edges of the network of e-mail interchanges between members of the Univeristy Rovira i Virgili (Tarragona). Please cite R. Guimera, L. Danon, A. Diaz-Guilera, F. Giralt and A. Arenas, Physical Review E , vol. 68, 065103(R), (2003).
1133 5451
PGPgiantcompo.graph.bz2 :
PGP network. List of edges of the giant component of the network of users of the Pretty-Good-Privacy algorithm for secure information interchange. M. Boguña, R. Pastor-Satorras, A. Diaz-Guilera and A. Arenas, Physical Review E, vol. 70, 056122 (2004).
10680 24316

The following 16 datasets are taken from http://www-personal.umich.edu/~mejn/netdata , with kind permission of Mark Newman, Department of Physics and Center for the Study of Complex Systems, University of Michigan and Santa Fe Institute.

Graph n m
adjnoun.graph.bz2
Word adjacencies: adjacency network of common adjectives and nouns in the novel David Copperfield by Charles Dickens. M. E. J. Newman, Phys. Rev. E 74, 036104 (2006).
112 425
as-22july06.graph.bz2
Internet: a symmetrized snapshot of the structure of the Internet at the level of autonomous systems, reconstructed from BGP tables posted by the University of Oregon Route Views Project. Mark Newman, using data for July 22, 2006.
22963 48436
astro-ph.graph.bz2
Astrophysics collaborations: weighted network of coauthorships between scientists posting preprints on the Astrophysics E-Print Archive between Jan 1, 1995 and December 31, 1999. M. E. J. Newman, Proc. Natl. Acad. Sci. USA 98, 404-409 (2001).
16706 121251
celegansneural.graph.bz2
Neural network: A directed, weighted network representing the neural network of C. Elegans. Data compiled by D. Watts and S. Strogatz and made available on the web here. Please cite D. J. Watts and S. H. Strogatz, Nature 393, 440-442 (1998). Original experimental data taken from J. G. White, E. Southgate, J. N. Thompson, and S. Brenner, Phil. Trans. R. Soc. London 314, 1-340 (1986). A note on the conversion: the original graph is directed and contains integer weights. Each undirected edge in the converted graph that represents two directed edges has an edge weight that equals the sum of the weights of these edges.
297 2148
cond-mat.graph.bz2
Condensed matter collaborations 1999: weighted network of coauthorships between scientists posting preprints on the Condensed Matter E-Print Archive between Jan 1, 1995 and December 31, 1999. M. E. J. Newman, The structure of scientific collaboration networks, Proc. Natl. Acad. Sci. USA 98, 404-409 (2001).
16726 47594
cond-mat-2003.graph.bz2
Condensed matter collaborations 2003: updated network of coauthorships between scientists posting preprints on the Condensed Matter E-Print Archive. This version includes all preprints posted between Jan 1, 1995 and June 30, 2003. The largest component of this network, which contains 27519 scientists, has been used by several authors as a test-bed for community-finding algorithms for large networks; see for example J. Duch and A. Arenas, Phys. Rev. E 72, 027104 (2005). M. E. J. Newman, Proc. Natl. Acad. Sci. USA 98, 404-409 (2001).
31163 120029
cond-mat-2005.graph.bz2
Condensed matter collaborations 2005: updated network of coauthorships between scientists posting preprints on the Condensed Matter E-Print Archive. This version includes all preprints posted between Jan 1, 1995 and March 31, 2005. M. E. J. Newman, Proc. Natl. Acad. Sci. USA 98, 404-409 (2001). A note on the conversion: the original graph contains two edges with weight inf, these were removed before making the graph unweighted.
40421 175691
dolphins.graph.bz2
Dolphin social network: an undirected social network of frequent associations between 62 dolphins in a community living off Doubtful Sound, New Zealand. D. Lusseau, K. Schneider, O. J. Boisseau, P. Haase, E. Slooten, and S. M. Dawson, Behavioral Ecology and Sociobiology 54, 396-405 (2003). Thanks to David Lusseau for permission to post these data on this web site..
62 159
football.graph.bz2
American College football: network of American football games between Division IA colleges during regular season Fall 2000. M. Girvan and M. E. J. Newman, Proc. Natl. Acad. Sci. USA 99, 7821-7826 (2002).
115 613
hep-th.graph.bz2
High-energy theory collaborations: weighted network of coauthorships between scientists posting preprints on the High-Energy Theory E-Print Archive between Jan 1, 1995 and December 31, 1999. M. E. J. Newman, Proc. Natl. Acad. Sci. USA 98, 404-409 (2001).
8361 15751
karate.graph.bz2
Zachary's karate club: social network of friendships between 34 members of a karate club at a US university in the 1970s. W. W. Zachary, An information flow model for conflict and fission in small groups, Journal of Anthropological Research 33, 452-473 (1977).
34 78
lesmis.graph.bz2
Les Miserables: coappearance network of characters in the novel Les Miserables. D. E. Knuth, The Stanford GraphBase: A Platform for Combinatorial Computing, Addison-Wesley, Reading, MA (1993).
77 254
netscience.graph.bz2
Coauthorships in network science: coauthorship network of scientists working on network theory and experiment, as compiled by M. Newman in May 2006. M. E. J. Newman, Phys. Rev. E 74, 036104 (2006).
1589 2742
polblogs.graph.bz2
Political blogs: A directed network of hyperlinks between weblogs on US politics, recorded in 2005 by Adamic and Glance. L. A. Adamic and N. Glance, "The political blogosphere and the 2004 US Election", in Proceedings of the WWW-2005 Workshop on the Weblogging Ecosystem (2005). Thanks to Lada Adamic for permission to post these data on this web site.
1490 16715
polbooks.graph.bz2
Books about US politics: A network of books about US politics published around the time of the 2004 presidential election and sold by the online bookseller Amazon.com. Edges between books represent frequent copurchasing of books by the same buyers. The network was compiled by V. Krebs and is unpublished, but can found on Krebs' web site ( http://www.orgnet.com/ ). Thanks to Valdis Krebs for permission to post these data on this web site.
105 441
power.graph.bz2
Power grid: An undirected, unweighted network representing the topology of the Western States Power Grid of the United States. Data compiled by D. Watts and S. Strogatz and made available on the web here. D. J. Watts and S. H. Strogatz, Nature 393, 440-442 (1998).
4941 6594

The following three datasets are taken from http://law.dsi.unimi.it/datasets.php, and have all been gathered by members and collaborators of the Laboratory of Web Algorithms, University of Milano. We refer to the above site for further details on the datasets. Works that these data sets base upon and should be cited when using the data are:

inproceedings{BoVWFI,
author ="Paolo Boldi and Sebastiano Vigna",
title = "The {W}eb{G}raph Framework {I}: {C}ompression Techniques",
year = 2004,
booktitle="Proc. of the Thirteenth International World Wide Web Conference (WWW 2004)",
address="Manhattan, USA",
pages="595--601",
publisher="ACM Press"
}

@inproceedings{BRSLLP,
author = "Paolo Boldi and Marco Rosa and Massimo Santini and Sebastiano Vigna",
title = "Layered Label Propagation: A MultiResolution Coordinate-Free Ordering for Compressing Social Networks",
booktitle="Proceedings of the 20th international conference on World Wide Web",
year = 2011,
publisher="ACM Press"
}

If the graphs you are using were gathered by UbiCrawler, please acknowledge the usage of UbiCrawler by quoting the following paper:

@article{BCSU3,
author="Paolo Boldi and Bruno Codenotti and Massimo Santini and Sebastiano Vigna",
title="UbiCrawler: A Scalable Fully Distributed Web Crawler",
journal="Software: Practice \& Experience",
year=2004,
volume=34,
number=8,
pages="711--726"
}

Graph n m
cnr-2000.graph.bz2 :
A very small crawl of the Italian CNR domain
325557 2738969
eu-2005.graph.bz2 :
A small crawl of the .eu domain
862664 16138468
in-2004.graph.bz2 :
A small crawl of the .in domain performed for the Nagaoka University of Technology
1382908 13591473
uk-2002.graph.bz2 :
This graph has been obtained from a 2002 crawl of the .uk domain performed by UbiCrawler.
18520486 261787258
uk-2007-05.graph.bz2 :
This graph is a time-aware graph generated by combining twelve monthly snapshot of the .uk domain collected for the DELIS project. You might be interested in the fact that the format the authors of this graph use compresses it as to use only 1.207 bits/link.
105896555 3301876564

The following two networks are road networks taken from the benchmark of the 9th DIMACS Implementation Challenge on shortest paths (see http://www.dis.uniroma1.it/~challenge9/download.shtml)

Graph n m
road_central.graph.bz2 14081816 16933413
road_usa.graph.bz2 23947347 28854312

The following graph is taken from Vladimir Batagelj and Andrej Mrvar (2006): Pajek datasets (http://vlado.fmf.uni-lj.si/pub/networks/data/)

Graph n m
chesapeake.graph.bz2
Chesapeake Bay Mesohaline Network; D. Baird; Umcees Ref No. XXX-86; MGC/M2/SUM 1 from datall.zip from http://www.cbl.cees.edu/~ulan/ntwk/network.html Chesapeake Bay mesohaline ecosystem. Summer carbon flows. Baird, D. and R.E. Ulanowicz. 1989. The seasonal dynamics of the Chesapeake Bay ecosystem. Ecol. Monogr. 59: 329-364. transformed from SCOR format by Scor2net 15.7.2003
39 170

Other graphs:

Graph n m
caidaRouterLevel.graph.bz2
The Cooperative Association for Internet Data Analysis (CAIDA) investigates the structure of the Internet and releases measurements of the adjacency matrix of the Internet router-level graph. This graph is the undirected network available from the CAIDA homepage. The data has been collected by CAIDA between Apr 21 and May 8, 2003.
192244 609066
preferentialAttachment.graph.bz2
This graph has been generated following a preferential attachment process (see Barabási and Albert, "Emergence of scaling in random networks", Science, 1999). Starting with a clique of five vertices, the vertices are successively added to the graph. Each new vertex chooses exactly five neighbors among the existing vertices, such that the probability of choosing a particular vertex is proportional to its degree. In our implementation, a vertex can choose a neighbour only once, such that the resulting random graph is guaranteed to be simple.
100000 499985
smallworld.graph.bz2
This graph has been generated using the small world generator of the Boost Graph Library. Starting with a ring of 100000 vertices and 500000 edges, edges are rewired with a probability of 0.2 according to the random model of Watts and Strogatz (see Watts and Strogatz, "Collective Dynamics of Small-World Networks", Nature, 1998). Multiple edges are removed after the process.
100000 499998
G_n_pin_pout.graph.bz2
This graph has been generated using a two-level Gnp random-graph generator. First, each vertex chooses a cluster to belong to, iid randomly. Then, in the spirit of the Erdös-Rényi model, cluster-internal edges are created with a given internal probability each, then cluster-external edges are created with a smaller external probability each. The parameters for this instance are: 100000 vertices, 316 clusters, the internal and the external edge probability are chosen such that the expected number of cluster-internal and the expected number of cluster-external incidences of a node are both five. Such a graph is simple. For references, details and a dynamic version see this project page.
100000 501198

The set of bibliometric graphs is also part of the testbed for the clustering challenge, but the files are not repeated here. Please visit the corresponding page to obtain them.