UP | HOME

Maximal Independent Set Benchmark

Table of Contents

1 Introduction

This is a proposed benchmark for the Graph 500 effort.

2 Problem Specification

This benchmark computes a maximal independent set (MIS) in a generated, undirected graph. Maximal independent set algorithms combine graph walking, set manipulation, and often random choice generation.

2.1 Input and output

  • Input specification:
    • Integer number of vertices n
  • Output specification:
    • Integer size of the maximal independent as set described below
    • Double-precision time in seconds for computing this maximal independent set

2.2 Description

Given a number of vertices n, construct an undirected graph G = (V, E) with vertex set V of cardinality n and edge set E of cardinality almost surely Θ(n sqrt(n log n)) and compute a maximal independent set of vertices M ⊂ V. An independent set S ⊂ V of vertices is one where no two vertices in S are adjacent. A maximal independent set M is an independent set where every vertex in V \ M is adjacent to a vertex in M and thus cannot be added to M without breaking independence.

The graph is constructed through a batched version of the triangle-free process 1. To generate G:

  1. Generate a batch B of k = ⌈ n sqrt (n log n)⌉ unique random edges.
  2. Insert all edges from B into G that are not already in G and do not form a triangle.
  3. Repeat until the number of undirected edges in G increases by less than 10%.

If you consider G as a symmetric matrix, the first step is equivalent to choosing k locations from the lower triangle or k unique integers in [0, (n-1)(n-2)/2).

On termination with large n, the graph G almost surely will have the following properties:

  • G is trangle-free (by construction, always true),
  • has Θ(n sqrt(n log n)) edges,
  • has vertex degrees of Θ(sqrt (n log n)), and
  • all maximal independent sets are of size Θ(sqrt (n log n)).

From the generated graph G, compute a maximal independent set M. Return its cardinality |M| as well as the time taken to compute M.

3 Open issues: Experimental specification

3.1 Summary

There are many issues to address that cross all the benchmarks. I'll bring up a few of the issues in the context of the MIS benchmark here. Some useful background reading includes:

  • general background on experimental design for irregular algorithms, see the papers by D.S. Johnson 2 and Catherine C. McGeoch 3, and
  • for specific statistical techniques like ANOVA for more regular benchmarks, see Raj Jain's text 4.

The first two are quick to read and apply. The latter is a long text most immediately useful for citation and hitting people over the head. I'll mention a few specific statistical concerns later.

Ideally, our experimental design will permit addressing variations from multiple sources. The time reported is only that for calculating the maximal independent set, so variations within graph generation can be ignored but not the variation caused by different graphs. Of the remaining variations, the following three categories seem important:

run-time system variations
thread scheduling, network interactions, job and data placement, etc.;
random sampling variations
not only the obvious PRNG sampling but also races between multiple execution threads consulting the same or related PRNGs; and
algorithm variations
variations between any reference algorithm / implementation and vendor-tuned versions.

Possibilities for choosing repetition strategies to investigate / emphasize particular features include the following:

run-time system variations
repeat with the same graph and the same random samples within MIS,
random sampling variations
repeat with the same graph but different samples within MIS, and
algorithm variations
repeat the same set of algorithms across multiple graphs.

We need to choose which variations are interesting to capture and smooth from the results. We also need to understand the associated costs of requiring repetitions on massive-scale problems.

Alas, irregular problems tend to confound standard experimental analysis. Standard ANOVA methods assume the response fits a linear model with normally distributed errors, but this often doesn't hold. Once we start collecting data from a few sample systems, we may find transformations or other models that are more applicable.

3.2 Detailed open issues with respect to MIS

  • Given a set of times Ti and MIS sizes |Mi|, what is the final number used to compare systems? If vendors submit a set of repetitions, should we use the best Ti? The median? Vendors will never accept the worst. Also, it's difficult to scale by |Mi|. The extreme cases |Mi| = 1 (dense) or |Mi| = n (no edges) are quite fast.
  • How many repetitions should we specify? Experiments on smaller graphs (n = 4000) imply that 5 is sufficient to smooth out random sampling variations in MIS on this triangle-free generator as well as on the generator.
  • What are the problem class sizes? Given Θ(n sqrt(n log n)) edges for n vertices with the triangle-free generator above, we can scale the MIS problems easily. The scales need to be consistent across benchmarks, however.

3.3 Straw-man proposal for MIS

The following steps are meant as a starting point for discussion. The numbers are pulled from the air with little thought.

  1. Run three repetitions on each of five randomly generated graphs.
  2. Use bootstrapping to compute the expected best of 3 runs over the resulting 15 data points.

My goal is not to require pre-generated graphs or specific PRNGs but rather minimize their impact. Then we can scale the input to any size.

4 Example source code

These are examples demonstrating functionality and not performance. The code is for the GNU Octave environment.

4.1 Trangle-free graph generation

Generate an undirected triangle-free graph in a (symmetric, unit) sparse matrix given the number of vertices and an optional density: trifree.m.

function G = trifree (N, TOL=0.1, VERBOSE=false)
### -*- texinfo -*-
### @deftypefn{Function File} {@var{G} =} trifree (@var{N})
### @deftypefnx{Function File} {@var{G} =} trifree (@var{N}, @var{TOL})
### @cindex random graph
### Generate a triangle-free, unit-weight graph through the triangle-free
### process.  Stop when the number of edges added in a batch is below @var{TOL}%,
### which defaults to 10%.
### @seealso{Bohman, T. (2008), The triangle-free process, arXiv:0806.4375}
### @end deftypefn

  ## The default number of samples of the same order as the expected
  ## number of edges.
  nsamp = ceil (N * sqrt (N * log (N)));
  G = sparse ([], [], [], N, N, 3*nsamp);

  ## In practice, doubling it seems to converge after two iterations,
  ## but at the cost of long genedgebatch times.  So pay the cost on the
  ## first generation and back off later.
  if VERBOSE, fprintf (stderr, "go0 %d\n", nnz (G)); endif
  [newi, newj] = genedgebatch (G, 2*nsamp);
  if VERBOSE, fprintf (stderr, "prennz0 %d\n", nnz (G)); endif
  G = incorporatebatch (G, newi, newj);
  if VERBOSE, fprintf (stderr, "nnz0 %d\n", nnz (G)); endif
  nsamp = ceil (nsamp / 2);

  iter = 1;
  do
    [newi, newj] = genedgebatch (G, nsamp);
    nnzold = nnz (G);
    if VERBOSE, fprintf (stderr, "prennz %d\n", nnz (G)); endif
    G = incorporatebatch (G, newi, newj);
    ++iter;
    if VERBOSE, fprintf (stderr, "nnz %d\n", nnz (G)); endif
  until ((nnz (G) - nnzold) / nnz (G)) < TOL;
  assert (istrifree (G));
endfunction

function [row, col] = trinumsplit (x)
### Split a linear address into a row and column in a triangle.
  x -= 1;
  row = 1 + floor ((sqrt (1 + 8 * x) - 1) / 2);
  col = 1 + x - (row .* (row-1)) / 2;
endfunction

function out = gencomb (N, M)
### Randomly select at most M entries from 1:N.  Floyd's algorithm as relayed by
### Bentley.
  if M > N,
    M = N;
  endif
  if M == N,
    out = randperm (N);
    return;
  endif
  out = zeros (M, 1);
  outlist = [];
  J = N - M + (1:M);
  T = 1 + floor (rand (1, M) .* J);
  for k = 1:M,
    j = J(k);
    if any (T(k) == out(1:k-1)),
      out(k) = j;
    else
      out(k) = T(k);
    endif
  endfor
endfunction

function [newi, newj] = genedgebatch (G, nsamp)
### Generate a batch of edges by choosing random linear addresses and
### converting to row,col pairs.
  N = size (G, 2);
  ## Choose nsamp from the lower triangle.  The lower triangle has
  ## N*(N-1)/2 entries.
  Z = gencomb ((N/2)*(N-1), nsamp);
  [newi,newj] = trinumsplit (Z);
  newi += 1; # lower triangle, so shift down.
endfunction

function G = incorporatebatch (G, newi, newj)
### Slurp a batch of new edges into the graph G.
  nbatch = length (newi);
  for k = 1:nbatch,
    i = newi(k);
    j = newj(k);
    ineigh = find (G(:, i));
    ## Drop existing edges
    if ismember (j, ineigh), continue; endif
    jneigh = find (G(:, j));
    ## Add if the edge does not form a triangle.
    if isempty (intersect (ineigh, jneigh)),
      G(i, j) = 1;
      G(j, i) = 1;
    endif
  endfor
endfunction

4.2 MIS

Use Luby's randomized algorithm to compute a maximal independent set given an undirected graph encoded in a sparse matrix: MIS.m.

function IS = MIS (G)
### -*- texinfo -*-
### @deftypefn{Function File} {@var{IS} =} MIS (@var{G})
### @cindex graph
### Compute a maximal independent set of vertices on the graph represented
### by sparse matrix @var{G} using Luby's randomized algorithm.
### @end deftypefn

  Gsaved = G;
  N = size (G, 1);
  IS = [];
  idx = 1:N;
  warning ("off", "Octave:divide-by-zero");

  while !isempty (G),
    N = size (G, 1);
    deg = full (spstats (G));
    prob = 1 ./ (2 * deg);
    select = find (rand (1, N) < prob);

    subG = G(select, select);
    if any (subG(:)),
      ## If any of the sample have neighbors within the sample, add only
      ## the largest degree one.  We should add more, but adding one is
      ## sufficient to guarantee progress.
      assert (!any (subG - subG.'))

      ##kk = sum (subG) > 0;
      ##zz = find (kk);
      zz = select(find (sum (subG) > 0));
      select = setdiff (select, zz);

      [ign, ix] = sort (deg(zz));

      select = union (select, zz(ix(1)));
    endif

    IS = [IS; idx(select)(:)];
    assert (!any (G(select,select)(:)));

    ## Clear the selected vertices and their neighbors from G.
    zz = sparse (N, 1);
    zz(select) = 1;
    zz = union (find(zz), find (G * zz));
    if length (zz) == N, break; endif
    G(:,zz) = [];
    G(zz,:) = [];
    idx(zz) = []; # Also update G's global indices.
  endwhile

  ## Verify that IS is a /maximal/ independent set.  Check that every
  ## entry in not in IS has a neighbor in IS.
  IS = sort (IS);
  Gslice = Gsaved (IS, setdiff (1:N, IS));
  assert (size (Gslice, 2) == 0 || all (sum (Gslice)));
endfunction

4.3 Utilities

4.3.1 Check the triangle-free property

Check if a graph encoded in a sparse matrix is triangle-free: istrifree.m. (Expensive brute force check.)

function out = istrifree (G)
### -*- texinfo -*-
### @deftypefn{Function File} {@var{flag} =} istrifree (@var{G})
### @cindex graph property
### Verify that the graph in matrix @var{G} is triangle-free.
### Return a true value if @var{G} is triangle-free, zero otherwise.
### @end deftypefn

  G = spones (G);
  [I, J] = find (G);
  J = unique (J);
  x = sparse (size (G, 2), 1);
  for k = 1 : length (J),
    j = J(k);
    x(j) = 1;
    y = G * x;
    j2 = find (y); # All reached in one step
    if any (spstats (G(j2,j2))),
      out = 0;
      return
    endif
    x(j) = 0;
  endfor
  out = 1;
endfunction

4.3.2 Generate combinations

Generate a vector of M items drawn from 1 to N without replacement: gencomb.m.

function out = gencomb (N, M)
  out = zeros (M, 1);
  for k = 1:M,
    j = N-M+k;
    t = 1 + floor (rand () * j);
    if any (t == out),
      out(k) = j;
    else
      out(k) = t;
    endif
  endfor
endfunction

4.3.3 Upper-triangular matrix addressing

Given a linear index into the upper triangle of a square matrix, compute the row and column indices: trinumsplit.m.

function [row, col] = trinumsplit (x)
  x -= 1;
  row = 1 + floor ((sqrt (1 + 8 * x) - 1) / 2);
  col = 1 + x - (row .* (row-1)) / 2;
endfunction

Footnotes:

1 Tom Bohman, The triangle-free process, Advances in Mathematics, Volume 221, Issue 5, 1 August 2009, pp. 1653-1677, ISSN 0001-8708, DOI: 10.1016/j.aim.2009.02.018. Keywords: Ramsey theory; Random graphs http://arxiv.org/abs/0806.4375

2 D. S. Johnson, A Theoretician's Guide to the Experimental Analysis of Algorithms. in Data Structures, Near Neighbor Searches, and Methodology: Fifth and Sixth DIMACS Implementation Challenges, M. H. Goldwasser, D. S. Johnson, and C. C. McGeoch, Editors, American Mathematical Society, Providence, 2002, pp. 215-250. http://www2.research.att.com/~dsj/papers/experguide.pdf

3 Catherine C. McGeoch, Feature Article – Toward an Experimental Method for Algorithm Simulation Informs Journal on Computing, 1996 8: 1-15. http://joc.journal.informs.org/cgi/content/abstract/8/1/1 http://scholar.google.com/scholar?cluster=15493961017810078589&hl=en&as_sdt=80000

4 Raj Jain, The art of computer systems performance analysis : techniques for experimental design, measurement, simulation, and modeling, Wiley, 1991. http://scholar.google.com/scholar?cluster=3111465748583530237&hl=en&as_sdt=80000

Author: Jason Riedy

Date: 2010-05-17 15:07:51 EDT

HTML generated by org-mode 7.01trans in emacs 24