This software was produced with support in part by the
National Science Foundation under grants
CCR-9103135,
HPCC/GCAG-BIR-9318183,
CISE-Postdoc-EIA-9625668,
CCR-9627210,
ITR-ACI-0081404,
DEB-9910123,
CCR-9457800,
CAREER-ACI-0093039,
ITR-EIA-0121377,
DEB-0120709,
ITR-EF/BIO-0331654,
DBI-0420513,
CNS-0614915,
and
CNS-0708307;
by NIH NIGMS R01 GM083621;
and by NASA GSRP Fellowship NGT-50951.

Source Code released under the GNU General Public License (GPL)

All source code on this page is released under the GNU General Public License
(GPL) which is intended to guarantee your freedom to share and
change free software--to make sure the software is free for all its users.

/* Copyright (C) 1995-2010 David A. Bader
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License
* as published by the Free Software Foundation; either version 2
* of the License, or (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, write to the Free Software
* Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307,
* USA. */

Pasqual: a Parallel de Novo Genome Sequence Assembler

PASQUAL Date last modified:7 September 2011
Our assembler PASQUAL, short for PArallel SeQUence AssembLer, is designed for shared memory parallelism, using OpenMP due to its good tradeoff between performance and programmer productivity.

GraphCT: a Graph Characterization Toolkit for the Cray XMT

GraphCT Trac Repository Date last modified:26 October 2010
GraphCT is a parallel toolkit for analyzing massive graphs
on the massively multithreaded Cray XMT. Designed from
the ground up to take advantage of the large shared
memory and fine-grained synchronization of the Cray XMT,
GraphCT provides an optimized library of functions
including clustering coefficients, connected components,
betweenness centrality, graph diameter, and distribution
statistics. Functions are provided to read and write large
data files in parallel using the shared memory. A proof-of-concept
scripting and command-line interface is available for non-programmers.
GraphCT uses a single common graph representation for all
analysis kernels, and a straightforward API makes it easy
to extend the library by implementing your own custom routines. References:

D. Ediger, K. Jiang, J. Riedy, D.A. Bader, C. Corley, R. Farber and W.N. Reynolds,
``Massive Social Network Analysis: Mining Twitter for Social Good,'' The 39th International Conference on Parallel Processing (ICPP 2010), San Diego, CA, September 13-16, 2010.

STINGER

STINGER Trac Repository Date last modified:26 October 2010
STINGER is an extensible representation for streaming, dynamic
networks and graphs. Linking blocks of edges in a hybrid data
structure, STINGER is able to accomodate massive streams of
edge insertions and deletions while simultaneously supporting
fast, parallel access to neighborhood information. STINGER
supports directed and undirected graphs, weighted and unweighted
edges, edge and vertex types, timestamps, and can be extended
with additional metadata. STINGER can be built on shared memory
platforms including the Cray XMT and commodity hardware using
OpenMP + atomic intrinsics. References:

SNAP: Small-world Network Analysis and Partitioning

SNAP at SourceForge Date last modified:4 August 2010
SNAP is an extensible parallel framework for exploratory
analysis and partitioning of large-scale networks. SNAP
is implemented in C, uses OpenMP primitives for parallelization,
and targets sequential, multicore, and symmetric multiprocessor
platforms. Our intent with SNAP is to provide a simple and
intuitive interface for network analysis and application design,
hiding the parallel programming complexity from the user. In
addition to path-based, centrality, and community identification
queries on large-scale graphs, we support commonly-used
preprocessing kernels and quantitative measures that help
understand the global network topology. References:

CUDARMAT-v1.tar.gz Date last modified:16 June 2011
CUDARMAT generates an R-MAT test graph using CUDA on nVidia
GPUs and outputs the graph in one of the following formats:

host-endian binary edge list,

textual DIMACS format, and

host-endian STINGER graph and dynamic edge action
list.

The Makefile depends on CUDA SDK makefiles. Pass
--help to the built executable for more argument
details.

GTfold: A Scalable Multicore Code for RNA Secondary Structure Prediction

GTfold at SourceForge Date last modified:30 March 2010
The prediction of the correct secondary structures of large
RNAs is one of the unsolved challenges of computational
molecular biology. Among the major obstacles is the fact
that accurate calculations scale as O(n^{4}), so the computational
requirements become prohibitive as the length increases.
GTfold is a new parallel multicore and scalable program
which is one to two orders of magnitude faster than the
de facto standard programs and achieves comparable accuracy
of prediction. Development of GTfold opens up a new
path for the algorithmic improvements and application of an
improved thermodynamic model to increase the prediction
accuracy. References:

SWARM: Software and Algorithms for Running on Multicore

SWARM at SourceForge Date last modified:12 April 2007
SWARM has been introduced as an open source parallel programming
framework. It is a library of primitives that fully exploit the
multicore processors. SWARM is built on POSIX threads that allow
the user to use either the already developed primitives or direct
thread primitives. SWARM has constructs for parallelization,
restricting control of threads, allocation and de-allocation of
shared memory, and communication primitives for synchronization,
replication and broadcasting. The framework has been successfully
used to implement efficient parallel versions of primitive
algorithms. Viz. List ranking, Prefix sums, Symmetry breaking, etc.
SWARM is available for Linux, FreeBSD, Solaris, AIX, and Microsoft Windows. References:

DARPA Graph Theory Benchmark (Scalable Synthetic Compact Application #2)

SSCA2-Bader.tar.gz Date last modified:4 May 2005
[Please click here for a README for building software packages based on SIMPLE libraries]
Graph theoretic problems are representative of fundamental
computations in traditional and emerging scientific disciplines like scientific
computing and computational biology, as well as applications in
national security. We present our design and implementation of a graph
theory application that supports the kernels from the Scalable Synthetic
Compact Applications (SSCA) benchmark suite, developed under the
DARPA High Productivity Computing Systems (HPCS) program. This
synthetic benchmark consists of four kernels that require irregular access
to a large, directed, weighted multi-graph. We have developed a parallel
implementation of this benchmark in C using the POSIX thread library
for commodity symmetric multiprocessors (SMPs). In this paper, we primarily
discuss the data layout choices and algorithmic design issues for
each kernel, and also present execution time and benchmark validation
results. References:

listrank-MTA.c listrank-SMP.c and listrank-SMP.h listrank-2.1.tar.gz (see this README for building the base SIMPLE package first.) listrank-Cell.tar.gz (for the Cell processor) Date last modified:14 April 2007
[Please click here for a README for building software packages based on SIMPLE libraries]
Irregular problems such as those from graph theory pose serious challenges for parallel machines due to non-contiguous accesses to global data structures with low degrees of locality. These parallel codes perform list ranking on two types of shared-memory computers: symmetric multiprocessors (SMP) and multithreaded architectures (MTA) such as the Cray MTA-2. Previous studies show that for SMPs performance is primarily a function of non-contiguous memory accesses, whereas for the MTA, it is primarily a function of the number of concurrent operations. References:

biconncomp-1.0.tar.gz Date last modified:1 October 2004
[Please click here for a README for building software packages based on SIMPLE libraries]
The code provides an implementation of parallel biconnected components
algorithms for SMPs, as a module for SIMPLE/SMP. Our new parallel algorithm includes
various optimization techniques that
significantly improves the performance of finding biconnected
components of a graph on symmetric multiprocessors
(SMPs). Finding biconnected components has application in
fault-tolerant network design, and is also used in graph planarity
testing. References:

GRAPPA-1.6.tar.gz Date last modified:31 July 2002
GRAPPA is software for Genome Rearrangements Analysis under Parsimony and other
Phylogenetic Algorithms; the latest version can also be found
here.
Phylogenies derived from gene order data may prove crucial in
answering some fundamental open questions in biomolecular evolution.
Yet very few techniques are available for such phylogenetic
reconstructions. One method is breakpoint analysis,
developed by Blanchette and Sankoff for solving the ``breakpoint phylogeny.''
Our earlier studies confirmed the usefulness of this approach, but also found that BPAnalysis,
the implementation developed by Sankoff and Blanchette, was too slow
to use on all but very small datasets.
We have re-implemented BPAnalysis using the principles
of algorithmic engineering. Our faster (by 2 to 3 orders of magnitude) and
flexible implementation allowed us to conduct studies
on the characteristics of breakpoint analysis, in terms of running
time, quality, and robustness, as well as to analyze datasets that
had so far been considered out of reach.
GRAPPA also provides the first linear-time implementation of inversion distance.
Hannenhalli and Pevzner gave the first polynomial-time algorithm
for computing the inversion distance between two signed permutations,
as part of the larger task of determining the shortest sequence of
inversions needed to transform one permutation into the other.
Their algorithm (restricted to distance calculation) proceeds in two
stages: in the first stage, the overlap graph induced by the permutation
is decomposed into connected components, then in the second stage certain
graph structures (hurdles and others) are identified. Berman and Hannenhalli
avoided the explicit computation of the overlap graph and gave
an O(nα (n)) algorithm, based on a Union-Find structure,
to find its connected components, where α is the inverse Ackerman
function. Since for all practical purposes α(n) is a constant
no larger than four, this algorithm has been the fastest practical
algorithm to date. In this code, we implement a new linear-time
algorithm for computing the connected components, which is more
efficient than that of Berman and Hannenhalli in both theory and practice.
Our algorithm uses only a stack and is very easy to implement. References:

mst-1.0.tar.gz Date last modified:3 October 2003
[Please click here for a README for building software packages based on SIMPLE libraries]
Minimum Spanning Tree (MST) is one of the most studied combinatorial
problems with practical applications in VLSI layout, wireless
communication, and distributed networks, recent problems in biology
and medicine such as cancer detection, medical imaging, and
proteomics, and national security and bioterrorism such as
detecting the spread of toxins through populations in the case of
biological/chemical warfare.
Most of the previous attempts for improving the speed of MST using
parallel computing are too complicated to implement or perform well
only on special graphs with regular structure.
This parallel code (a module for SIMPLE/SMP) provides
implementations of four parallel MST algorithms
(three variations of Boruvka plus our new approach) for arbitrary
sparse graphs that for the first time give speedup when compared
with the best sequential algorithm. In fact, our algorithms
also solve the minimum spanning forest problem.
Our new implementation achieves good speedups over a wide
range of input graphs with regular and irregular structures,
including the graphs used by previous parallel MST studies. References:

spantree-1.0.tar.gz Date last modified:25 September 2002
[Please click here for a README for building software packages based on SIMPLE libraries]
The code provides an implementation of parallel spanning tree
algorithms for SMPs, as a module for SIMPLE/SMP. This new randomized
algorithm and implementation has superior performance that for the
first-time achieves parallel speedup on arbitrary graphs (both regular
and irregular topologies) when compared with the best sequential
implementation for finding a spanning tree. References:

parjedi-1.0.tar.gz Date last modified:31 July 2003
[Please click here for a README for building software packages based on SIMPLE libraries]
This parallel code (a SIMPLE/SMP module) provides an
implementation of a parallel algorithm for state assignment of large
Finite State Machines. High performance CAD tools are necessary to
overcome the computational complexity involved in the optimization of
large sequential circuits. FSMs constitute an important class of logic
circuits and state assignment is one of the key steps in combinational
logic optimization. The SMP-based parallel algorithm, based on the
sequential program
JEDI
targeting multilevel logic
implementation, scales nearly linearly with the number of
processors for FSMs of varying problem sizes chosen from standard
benchmark suites while attaining quality of results comparable to
the best sequential algorithms. References:

rake-1.0.tar.gz Date last modified:9 September 2002
[Please click here for a README for building software packages based on SIMPLE libraries]
This parallel code (a SIMPLE/SMP module) evaluates arithmetic
expression trees using the algorithmic techniques of list ranking and
tree contraction; this problem is not only of interest in its own
right, but is representative of a large class of irregular
combinatorial problems that have simple and efficient sequential
implementations and fast PRAM algorithms, but have no known efficient
parallel implementations. References:

listrank-2.0.tar.gz Date last modified:27 September 2001
[Please click here for a README for building software packages based on SIMPLE libraries]
This SIMPLE/SMP module (written by David R. Helman and Joseph
JáJá , and later modified by David A. Bader)
implements a randomized, parallel code for link ranking.
Helman and JáJá introduce a new optimal prefix
computation algorithm on linked lists which builds upon the
sparse ruling set approach of Reid-Miller
and Blelloch. Besides being somewhat simpler and requiring nearly half
the number of memory accesses, and can bound our complexity with high
probability instead of merely on average. Moreover, whereas
Reid-Miller and Blelloch targeted their algorithm for implementation
on a vector multiprocessor architecture, they develop an algorithm for
implementation on the symmetric multiprocessor architecture
(SMP). These symmetric multiprocessors dominate the high-end server
market and are currently the primary candidate for constructing large
scale multiprocessor systems. The authors ran this code using a variety
of benchmarks which they identified to examine the dependence of the
algorithm on memory access patterns. For some problems, their algorithm
actually matched or exceeded the optimal sequential solution using only
a single thread. Moreover, in spite of the fact that the
processors must compete for access to main memory, their algorithm still
resulted in scalable performance up to 16 processors, which was the
largest platform available to them. References:

D. R. Helman and J. JáJá ,
``Prefix Computations on Symmetric Multiprocessors,''
Journal of Parallel and Distributed Computing, 61(2):265--278, 2001.

SIMPLE

simple-4.4J.tar.gz Date last modified:20 September 2005
[Please click here for a README for building software packages based on SIMPLE libraries]
SIMPLE is a framework for implementation of parallel algorithms
using our methodology for developing high performance programs running
on clusters of SMP nodes. Our methodology is based on a small kernel
(SIMPLE) of collective communication primitives that make efficient
use of the hybrid shared and message passing environment. We
illustrate the power of our methodology by presenting experimental
results for sorting integers, two-dimensional fast Fourier transforms
(FFT), and constraint-satisfied searching. References:

cycledetect-1.0.tar.gz Date last modified:23 July 1999
This package provides an MPI implementation of a new, parallel
algorithm for detecting cycles in partitioned, planar directed
graphs that is both scalable in the graph and machine size, and
performs well in practice. As an example, on a p = 64
processor cluster, we have solved an extremely large and
difficult input problem with n = 2^{28} vertices in less
than five minutes. References:

Parallel Sorting using Sampling with Randomized and Deterministic Approaches

psort-1.0.tar.gz Date last modified:22 August 1996
Previous schemes for sorting on general-purpose parallel machines
have had to choose between poor load balancing and irregular
communication or multiple rounds of all-to-all personalized
communication. This code provides a sample sort which uses only two
rounds of regular all-to-all personalized communication in a scheme
that yields very good load balancing with virtually no
overhead. Moreover, unlike previous variations, our algorithm
efficiently handles the presence of duplicate values without the
overhead of tagging each element with a unique identifier. This
algorithm was implemented in Split-C and run on a variety of
platforms, including the Thinking Machines CM-5, the IBM SP-2, and the
Cray Research T3D. This implementation of our new algorithm seems to
outperform all similar algorithms known to the authors on these
platforms, and its performance is invariant over the set of input
distributions unlike previous efficient algorithms. Our results also
compare favorably with those reported for the simpler ranking problem
posed by the NAS Integer Sorting (IS) Benchmark.
The code also provides a new deterministic parallel sorting
algorithm based on the regular sampling approach. The performance
compares closely to that of our random sample sort algorithm, which
seems to outperform all similar algorithms known to the authors on
these platforms. Together, their performance is nearly invariant over
the set of input distributions, unlike previous efficient
algorithms. However, unlike our randomized sorting algorithm, the
performance and memory requirements of our regular sorting algorithm
can be deterministically guaranteed. References:

psrs-1.1.tar.gz Date last modified:3 March 2006
This package provides an MPI implementation of Parallel Sorting
by Regular Sampling, an algorithm from Shi and Schaeffer.
References:

H. Shi and J. Schaeffer. ``Parallel Sorting by Regular
Sampling. Journal of Parallel and Distributed Computing,''
14(4):361--372, 1992.

Parallel Selection and Median Finding

mpiselect-1.4A.tar.gz Date last modified:10 January 2000
A common statistical problem is that of finding the median element in
a set of data. This parallel MPI code implements an efficient, randomized high-level
parallel algorithms for finding the median given a set of elements
distributed across a parallel machine. In fact, our algorithm solves
the general selection problem that requires the determination of the
element of rank k, for an arbitrarily given integer k.
We use efficient techniques for distributing and coalescing data as well as
efficient combinations of task and data parallelism. The algorithms
have been coded in the message passing standard MPI, and our
experimental results from the IBM SP-2 illustrate the scalability and
efficiency of our algorithm and improve upon all the related
experimental results known to the authors. References:

imageU-3.8.tar.gz Date last modified:5 June 1998
This code provides efficient and portable implementations of
useful image processing algorithms, histogramming, connected
components, image enhancement, and segmentation, as well parallel
codes for combinatorial problems such as routing h-relations, sorting
and selection. The algorithms have been coded in Split-C and provide
the best known execution times for these approaches, even when
compared with machine specific implementations. References: