Maximal Independent Set Benchmark
Table of Contents
1 Introduction
This is a proposed benchmark for the Graph 500 effort.
2 Problem Specification
This benchmark computes a maximal independent set (MIS) in a generated, undirected graph. Maximal independent set algorithms combine graph walking, set manipulation, and often random choice generation.
2.1 Input and output

Input specification:
 Integer number of vertices n

Output specification:
 Integer size of the maximal independent as set described below
 Doubleprecision time in seconds for computing this maximal independent set
2.2 Description
Given a number of vertices n, construct an undirected graph G = (V, E) with vertex set V of cardinality n and edge set E of cardinality almost surely Θ(n sqrt(n log n)) and compute a maximal independent set of vertices M ⊂ V. An independent set S ⊂ V of vertices is one where no two vertices in S are adjacent. A maximal independent set M is an independent set where every vertex in V \ M is adjacent to a vertex in M and thus cannot be added to M without breaking independence.
The graph is constructed through a batched version of the trianglefree process ^{1}. To generate G:
 Generate a batch B of k = ⌈ n sqrt (n log n)⌉ unique random edges.
 Insert all edges from B into G that are not already in G and do not form a triangle.
 Repeat until the number of undirected edges in G increases by less than 10%.
If you consider G as a symmetric matrix, the first step is equivalent to choosing k locations from the lower triangle or k unique integers in [0, (n1)(n2)/2).
On termination with large n, the graph G almost surely will have the following properties:
 G is tranglefree (by construction, always true),
 has Θ(n sqrt(n log n)) edges,
 has vertex degrees of Θ(sqrt (n log n)), and
 all maximal independent sets are of size Θ(sqrt (n log n)).
From the generated graph G, compute a maximal independent set M. Return its cardinality M as well as the time taken to compute M.
3 Open issues: Experimental specification
3.1 Summary
There are many issues to address that cross all the benchmarks. I'll bring up a few of the issues in the context of the MIS benchmark here. Some useful background reading includes:
 general background on experimental design for irregular algorithms, see the papers by D.S. Johnson ^{2} and Catherine C. McGeoch ^{3}, and
 for specific statistical techniques like ANOVA for more regular benchmarks, see Raj Jain's text ^{4}.
The first two are quick to read and apply. The latter is a long text most immediately useful for citation and hitting people over the head. I'll mention a few specific statistical concerns later.
Ideally, our experimental design will permit addressing variations from multiple sources. The time reported is only that for calculating the maximal independent set, so variations within graph generation can be ignored but not the variation caused by different graphs. Of the remaining variations, the following three categories seem important:
 runtime system variations
 thread scheduling, network interactions, job and data placement, etc.;
 random sampling variations
 not only the obvious PRNG sampling but also races between multiple execution threads consulting the same or related PRNGs; and
 algorithm variations
 variations between any reference algorithm / implementation and vendortuned versions.
Possibilities for choosing repetition strategies to investigate / emphasize particular features include the following:
 runtime system variations
 repeat with the same graph and the same random samples within MIS,
 random sampling variations
 repeat with the same graph but different samples within MIS, and
 algorithm variations
 repeat the same set of algorithms across multiple graphs.
We need to choose which variations are interesting to capture and smooth from the results. We also need to understand the associated costs of requiring repetitions on massivescale problems.
Alas, irregular problems tend to confound standard experimental analysis. Standard ANOVA methods assume the response fits a linear model with normally distributed errors, but this often doesn't hold. Once we start collecting data from a few sample systems, we may find transformations or other models that are more applicable.
3.2 Detailed open issues with respect to MIS
 Given a set of times T_{i} and MIS sizes M_{i}, what is the final number used to compare systems? If vendors submit a set of repetitions, should we use the best T_{i}? The median? Vendors will never accept the worst. Also, it's difficult to scale by M_{i}. The extreme cases M_{i} = 1 (dense) or M_{i} = n (no edges) are quite fast.
 How many repetitions should we specify? Experiments on smaller graphs (n = 4000) imply that 5 is sufficient to smooth out random sampling variations in MIS on this trianglefree generator as well as on the generator.
 What are the problem class sizes? Given Θ(n sqrt(n log n)) edges for n vertices with the trianglefree generator above, we can scale the MIS problems easily. The scales need to be consistent across benchmarks, however.
3.3 Strawman proposal for MIS
The following steps are meant as a starting point for discussion. The numbers are pulled from the air with little thought.
 Run three repetitions on each of five randomly generated graphs.
 Use bootstrapping to compute the expected best of 3 runs over the resulting 15 data points.
My goal is not to require pregenerated graphs or specific PRNGs but rather minimize their impact. Then we can scale the input to any size.
4 Example source code
These are examples demonstrating functionality and not performance. The code is for the GNU Octave environment.
4.1 Tranglefree graph generation
Generate an undirected trianglefree graph in a (symmetric, unit) sparse matrix given the number of vertices and an optional density: trifree.m.
function G = trifree (N, TOL=0.1, VERBOSE=false) ### * texinfo * ### @deftypefn{Function File} {@var{G} =} trifree (@var{N}) ### @deftypefnx{Function File} {@var{G} =} trifree (@var{N}, @var{TOL}) ### @cindex random graph ### Generate a trianglefree, unitweight graph through the trianglefree ### process. Stop when the number of edges added in a batch is below @var{TOL}%, ### which defaults to 10%. ### @seealso{Bohman, T. (2008), The trianglefree process, arXiv:0806.4375} ### @end deftypefn ## The default number of samples of the same order as the expected ## number of edges. nsamp = ceil (N * sqrt (N * log (N))); G = sparse ([], [], [], N, N, 3*nsamp); ## In practice, doubling it seems to converge after two iterations, ## but at the cost of long genedgebatch times. So pay the cost on the ## first generation and back off later. if VERBOSE, fprintf (stderr, "go0 %d\n", nnz (G)); endif [newi, newj] = genedgebatch (G, 2*nsamp); if VERBOSE, fprintf (stderr, "prennz0 %d\n", nnz (G)); endif G = incorporatebatch (G, newi, newj); if VERBOSE, fprintf (stderr, "nnz0 %d\n", nnz (G)); endif nsamp = ceil (nsamp / 2); iter = 1; do [newi, newj] = genedgebatch (G, nsamp); nnzold = nnz (G); if VERBOSE, fprintf (stderr, "prennz %d\n", nnz (G)); endif G = incorporatebatch (G, newi, newj); ++iter; if VERBOSE, fprintf (stderr, "nnz %d\n", nnz (G)); endif until ((nnz (G)  nnzold) / nnz (G)) < TOL; assert (istrifree (G)); endfunction function [row, col] = trinumsplit (x) ### Split a linear address into a row and column in a triangle. x = 1; row = 1 + floor ((sqrt (1 + 8 * x)  1) / 2); col = 1 + x  (row .* (row1)) / 2; endfunction function out = gencomb (N, M) ### Randomly select at most M entries from 1:N. Floyd's algorithm as relayed by ### Bentley. if M > N, M = N; endif if M == N, out = randperm (N); return; endif out = zeros (M, 1); outlist = []; J = N  M + (1:M); T = 1 + floor (rand (1, M) .* J); for k = 1:M, j = J(k); if any (T(k) == out(1:k1)), out(k) = j; else out(k) = T(k); endif endfor endfunction function [newi, newj] = genedgebatch (G, nsamp) ### Generate a batch of edges by choosing random linear addresses and ### converting to row,col pairs. N = size (G, 2); ## Choose nsamp from the lower triangle. The lower triangle has ## N*(N1)/2 entries. Z = gencomb ((N/2)*(N1), nsamp); [newi,newj] = trinumsplit (Z); newi += 1; # lower triangle, so shift down. endfunction function G = incorporatebatch (G, newi, newj) ### Slurp a batch of new edges into the graph G. nbatch = length (newi); for k = 1:nbatch, i = newi(k); j = newj(k); ineigh = find (G(:, i)); ## Drop existing edges if ismember (j, ineigh), continue; endif jneigh = find (G(:, j)); ## Add if the edge does not form a triangle. if isempty (intersect (ineigh, jneigh)), G(i, j) = 1; G(j, i) = 1; endif endfor endfunction
4.2 MIS
Use Luby's randomized algorithm to compute a maximal independent set given an undirected graph encoded in a sparse matrix: MIS.m.
function IS = MIS (G) ### * texinfo * ### @deftypefn{Function File} {@var{IS} =} MIS (@var{G}) ### @cindex graph ### Compute a maximal independent set of vertices on the graph represented ### by sparse matrix @var{G} using Luby's randomized algorithm. ### @end deftypefn Gsaved = G; N = size (G, 1); IS = []; idx = 1:N; warning ("off", "Octave:dividebyzero"); while !isempty (G), N = size (G, 1); deg = full (spstats (G)); prob = 1 ./ (2 * deg); select = find (rand (1, N) < prob); subG = G(select, select); if any (subG(:)), ## If any of the sample have neighbors within the sample, add only ## the largest degree one. We should add more, but adding one is ## sufficient to guarantee progress. assert (!any (subG  subG.')) ##kk = sum (subG) > 0; ##zz = find (kk); zz = select(find (sum (subG) > 0)); select = setdiff (select, zz); [ign, ix] = sort (deg(zz)); select = union (select, zz(ix(1))); endif IS = [IS; idx(select)(:)]; assert (!any (G(select,select)(:))); ## Clear the selected vertices and their neighbors from G. zz = sparse (N, 1); zz(select) = 1; zz = union (find(zz), find (G * zz)); if length (zz) == N, break; endif G(:,zz) = []; G(zz,:) = []; idx(zz) = []; # Also update G's global indices. endwhile ## Verify that IS is a /maximal/ independent set. Check that every ## entry in not in IS has a neighbor in IS. IS = sort (IS); Gslice = Gsaved (IS, setdiff (1:N, IS)); assert (size (Gslice, 2) == 0  all (sum (Gslice))); endfunction
4.3 Utilities
4.3.1 Check the trianglefree property
Check if a graph encoded in a sparse matrix is trianglefree: istrifree.m. (Expensive brute force check.)
function out = istrifree (G) ### * texinfo * ### @deftypefn{Function File} {@var{flag} =} istrifree (@var{G}) ### @cindex graph property ### Verify that the graph in matrix @var{G} is trianglefree. ### Return a true value if @var{G} is trianglefree, zero otherwise. ### @end deftypefn G = spones (G); [I, J] = find (G); J = unique (J); x = sparse (size (G, 2), 1); for k = 1 : length (J), j = J(k); x(j) = 1; y = G * x; j2 = find (y); # All reached in one step if any (spstats (G(j2,j2))), out = 0; return endif x(j) = 0; endfor out = 1; endfunction
4.3.2 Generate combinations
Generate a vector of M items drawn from 1 to N without replacement: gencomb.m.
function out = gencomb (N, M) out = zeros (M, 1); for k = 1:M, j = NM+k; t = 1 + floor (rand () * j); if any (t == out), out(k) = j; else out(k) = t; endif endfor endfunction
4.3.3 Uppertriangular matrix addressing
Given a linear index into the upper triangle of a square matrix, compute the row and column indices: trinumsplit.m.
function [row, col] = trinumsplit (x) x = 1; row = 1 + floor ((sqrt (1 + 8 * x)  1) / 2); col = 1 + x  (row .* (row1)) / 2; endfunction
Footnotes:
^{1} Tom Bohman, The trianglefree process, Advances in Mathematics, Volume 221, Issue 5, 1 August 2009, pp. 16531677, ISSN 00018708, DOI: 10.1016/j.aim.2009.02.018. Keywords: Ramsey theory; Random graphs http://arxiv.org/abs/0806.4375
^{2} D. S. Johnson, A Theoretician's Guide to the Experimental Analysis of Algorithms. in Data Structures, Near Neighbor Searches, and Methodology: Fifth and Sixth DIMACS Implementation Challenges, M. H. Goldwasser, D. S. Johnson, and C. C. McGeoch, Editors, American Mathematical Society, Providence, 2002, pp. 215250. http://www2.research.att.com/~dsj/papers/experguide.pdf
^{3} Catherine C. McGeoch, Feature Article – Toward an Experimental Method for Algorithm Simulation Informs Journal on Computing, 1996 8: 115. http://joc.journal.informs.org/cgi/content/abstract/8/1/1 http://scholar.google.com/scholar?cluster=15493961017810078589&hl=en&as_sdt=80000
^{4} Raj Jain, The art of computer systems performance analysis : techniques for experimental design, measurement, simulation, and modeling, Wiley, 1991. http://scholar.google.com/scholar?cluster=3111465748583530237&hl=en&as_sdt=80000
Date: 20100517 15:07:51 EDT
HTML generated by orgmode 7.01trans in emacs 24