David A. Bader
IEEE Fellow
AAAS Fellow
College of Computing
Georgia Tech
Atlanta, GA 30332

This software was produced with support in part by the National Science Foundation under grants CCR-9103135, HPCC/GCAG-BIR-9318183, CISE-Postdoc-EIA-9625668, CCR-9627210, ITR-ACI-0081404, DEB-9910123, CCR-9457800, CAREER-ACI-0093039, ITR-EIA-0121377, DEB-0120709, ITR-EF/BIO-0331654, DBI-0420513, CNS-0614915, and CNS-0708307; by NIH NIGMS R01 GM083621; and by NASA GSRP Fellowship NGT-50951.

Source Code released under the GNU General Public License (GPL)

All source code on this page is released under the GNU General Public License (GPL) which is intended to guarantee your freedom to share and change free software--to make sure the software is free for all its users.
/* Copyright (C) 1995-2010  David A. Bader
 * This program is free software; you can redistribute it and/or
 * modify it under the terms of the GNU General Public License
 * as published by the Free Software Foundation; either version 2
 * of the License, or (at your option) any later version.
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * GNU General Public License for more details.
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA  02111-1307,
 * USA.   */
The GPL is also available here.
  • Pasqual: a Parallel de Novo Genome Sequence Assembler

    Date last modified: 7 September 2011
    Our assembler PASQUAL, short for PArallel SeQUence AssembLer, is designed for shared memory parallelism, using OpenMP due to its good tradeoff between performance and programmer productivity.
  • GraphCT: a Graph Characterization Toolkit for the Cray XMT

    GraphCT Trac Repository
    Date last modified: 26 October 2010
    GraphCT is a parallel toolkit for analyzing massive graphs on the massively multithreaded Cray XMT. Designed from the ground up to take advantage of the large shared memory and fine-grained synchronization of the Cray XMT, GraphCT provides an optimized library of functions including clustering coefficients, connected components, betweenness centrality, graph diameter, and distribution statistics. Functions are provided to read and write large data files in parallel using the shared memory. A proof-of-concept scripting and command-line interface is available for non-programmers. GraphCT uses a single common graph representation for all analysis kernels, and a straightforward API makes it easy to extend the library by implementing your own custom routines.

    STINGER Trac Repository
    Date last modified: 26 October 2010
    STINGER is an extensible representation for streaming, dynamic networks and graphs. Linking blocks of edges in a hybrid data structure, STINGER is able to accomodate massive streams of edge insertions and deletions while simultaneously supporting fast, parallel access to neighborhood information. STINGER supports directed and undirected graphs, weighted and unweighted edges, edge and vertex types, timestamps, and can be extended with additional metadata. STINGER can be built on shared memory platforms including the Cray XMT and commodity hardware using OpenMP + atomic intrinsics.
  • SNAP: Small-world Network Analysis and Partitioning

    SNAP at SourceForge
    Date last modified: 4 August 2010
    SNAP is an extensible parallel framework for exploratory analysis and partitioning of large-scale networks. SNAP is implemented in C, uses OpenMP primitives for parallelization, and targets sequential, multicore, and symmetric multiprocessor platforms. Our intent with SNAP is to provide a simple and intuitive interface for network analysis and application design, hiding the parallel programming complexity from the user. In addition to path-based, centrality, and community identification queries on large-scale graphs, we support commonly-used preprocessing kernels and quantitative measures that help understand the global network topology.
  • CUDARMAT: R-MAT Graph Generator for NVIDIA CUDA

    Date last modified: 16 June 2011
    CUDARMAT generates an R-MAT test graph using CUDA on nVidia GPUs and outputs the graph in one of the following formats:

    • host-endian binary edge list,
    • textual DIMACS format, and
    • host-endian STINGER graph and dynamic edge action list.

    The Makefile depends on CUDA SDK makefiles. Pass --help to the built executable for more argument details.

    GTfold: A Scalable Multicore Code for RNA Secondary Structure Prediction

    GTfold at SourceForge
    Date last modified: 30 March 2010
    The prediction of the correct secondary structures of large RNAs is one of the unsolved challenges of computational molecular biology. Among the major obstacles is the fact that accurate calculations scale as O(n4), so the computational requirements become prohibitive as the length increases. GTfold is a new parallel multicore and scalable program which is one to two orders of magnitude faster than the de facto standard programs and achieves comparable accuracy of prediction. Development of GTfold opens up a new path for the algorithmic improvements and application of an improved thermodynamic model to increase the prediction accuracy.
  • SWARM: Software and Algorithms for Running on Multicore

    SWARM at SourceForge
    Date last modified: 12 April 2007
    SWARM has been introduced as an open source parallel programming framework. It is a library of primitives that fully exploit the multicore processors. SWARM is built on POSIX threads that allow the user to use either the already developed primitives or direct thread primitives. SWARM has constructs for parallelization, restricting control of threads, allocation and de-allocation of shared memory, and communication primitives for synchronization, replication and broadcasting. The framework has been successfully used to implement efficient parallel versions of primitive algorithms. Viz. List ranking, Prefix sums, Symmetry breaking, etc. SWARM is available for Linux, FreeBSD, Solaris, AIX, and Microsoft Windows.
  • DARPA Graph Theory Benchmark (Scalable Synthetic Compact Application #2)

    Date last modified: 4 May 2005
    [Please click here for a README for building software packages based on SIMPLE libraries]
    Graph theoretic problems are representative of fundamental computations in traditional and emerging scientific disciplines like scientific computing and computational biology, as well as applications in national security. We present our design and implementation of a graph theory application that supports the kernels from the Scalable Synthetic Compact Applications (SSCA) benchmark suite, developed under the DARPA High Productivity Computing Systems (HPCS) program. This synthetic benchmark consists of four kernels that require irregular access to a large, directed, weighted multi-graph. We have developed a parallel implementation of this benchmark in C using the POSIX thread library for commodity symmetric multiprocessors (SMPs). In this paper, we primarily discuss the data layout choices and algorithmic design issues for each kernel, and also present execution time and benchmark validation results.
  • List Ranking

    listrank-SMP.c and listrank-SMP.h
    listrank-2.1.tar.gz (see this README for building the base SIMPLE package first.)
    listrank-Cell.tar.gz (for the Cell processor)
    Date last modified: 14 April 2007
    [Please click here for a README for building software packages based on SIMPLE libraries]
    Irregular problems such as those from graph theory pose serious challenges for parallel machines due to non-contiguous accesses to global data structures with low degrees of locality. These parallel codes perform list ranking on two types of shared-memory computers: symmetric multiprocessors (SMP) and multithreaded architectures (MTA) such as the Cray MTA-2. Previous studies show that for SMPs performance is primarily a function of non-contiguous memory accesses, whereas for the MTA, it is primarily a function of the number of concurrent operations.
  • Biconnected Components

    Date last modified: 1 October 2004
    [Please click here for a README for building software packages based on SIMPLE libraries]
    The code provides an implementation of parallel biconnected components algorithms for SMPs, as a module for SIMPLE/SMP. Our new parallel algorithm includes various optimization techniques that significantly improves the performance of finding biconnected components of a graph on symmetric multiprocessors (SMPs). Finding biconnected components has application in fault-tolerant network design, and is also used in graph planarity testing.


    Date last modified: 31 July 2002
    GRAPPA is software for Genome Rearrangements Analysis under Parsimony and other Phylogenetic Algorithms; the latest version can also be found here.
    Phylogenies derived from gene order data may prove crucial in answering some fundamental open questions in biomolecular evolution. Yet very few techniques are available for such phylogenetic reconstructions. One method is breakpoint analysis, developed by Blanchette and Sankoff for solving the ``breakpoint phylogeny.'' Our earlier studies confirmed the usefulness of this approach, but also found that BPAnalysis, the implementation developed by Sankoff and Blanchette, was too slow to use on all but very small datasets. We have re-implemented BPAnalysis using the principles of algorithmic engineering. Our faster (by 2 to 3 orders of magnitude) and flexible implementation allowed us to conduct studies on the characteristics of breakpoint analysis, in terms of running time, quality, and robustness, as well as to analyze datasets that had so far been considered out of reach.
    GRAPPA also provides the first linear-time implementation of inversion distance. Hannenhalli and Pevzner gave the first polynomial-time algorithm for computing the inversion distance between two signed permutations, as part of the larger task of determining the shortest sequence of inversions needed to transform one permutation into the other. Their algorithm (restricted to distance calculation) proceeds in two stages: in the first stage, the overlap graph induced by the permutation is decomposed into connected components, then in the second stage certain graph structures (hurdles and others) are identified. Berman and Hannenhalli avoided the explicit computation of the overlap graph and gave an O(nα (n)) algorithm, based on a Union-Find structure, to find its connected components, where α is the inverse Ackerman function. Since for all practical purposes α(n) is a constant no larger than four, this algorithm has been the fastest practical algorithm to date. In this code, we implement a new linear-time algorithm for computing the connected components, which is more efficient than that of Berman and Hannenhalli in both theory and practice. Our algorithm uses only a stack and is very easy to implement.
  • Minimum Spanning Forest

    Date last modified: 3 October 2003
    [Please click here for a README for building software packages based on SIMPLE libraries]
    Minimum Spanning Tree (MST) is one of the most studied combinatorial problems with practical applications in VLSI layout, wireless communication, and distributed networks, recent problems in biology and medicine such as cancer detection, medical imaging, and proteomics, and national security and bioterrorism such as detecting the spread of toxins through populations in the case of biological/chemical warfare. Most of the previous attempts for improving the speed of MST using parallel computing are too complicated to implement or perform well only on special graphs with regular structure.
    This parallel code (a module for SIMPLE/SMP) provides implementations of four parallel MST algorithms (three variations of Boruvka plus our new approach) for arbitrary sparse graphs that for the first time give speedup when compared with the best sequential algorithm. In fact, our algorithms also solve the minimum spanning forest problem. Our new implementation achieves good speedups over a wide range of input graphs with regular and irregular structures, including the graphs used by previous parallel MST studies.
  • Spanning Tree

    Date last modified: 25 September 2002
    [Please click here for a README for building software packages based on SIMPLE libraries]
    The code provides an implementation of parallel spanning tree algorithms for SMPs, as a module for SIMPLE/SMP. This new randomized algorithm and implementation has superior performance that for the first-time achieves parallel speedup on arbitrary graphs (both regular and irregular topologies) when compared with the best sequential implementation for finding a spanning tree.
  • ParJedi

    Date last modified: 31 July 2003
    [Please click here for a README for building software packages based on SIMPLE libraries]
    This parallel code (a SIMPLE/SMP module) provides an implementation of a parallel algorithm for state assignment of large Finite State Machines. High performance CAD tools are necessary to overcome the computational complexity involved in the optimization of large sequential circuits. FSMs constitute an important class of logic circuits and state assignment is one of the key steps in combinational logic optimization. The SMP-based parallel algorithm, based on the sequential program JEDI targeting multilevel logic implementation, scales nearly linearly with the number of processors for FSMs of varying problem sizes chosen from standard benchmark suites while attaining quality of results comparable to the best sequential algorithms.
  • Expression Evaluation using Tree Contraction

    Date last modified: 9 September 2002
    [Please click here for a README for building software packages based on SIMPLE libraries]
    This parallel code (a SIMPLE/SMP module) evaluates arithmetic expression trees using the algorithmic techniques of list ranking and tree contraction; this problem is not only of interest in its own right, but is representative of a large class of irregular combinatorial problems that have simple and efficient sequential implementations and fast PRAM algorithms, but have no known efficient parallel implementations.
  • Parallel List Ranking

    Date last modified: 27 September 2001
    [Please click here for a README for building software packages based on SIMPLE libraries]
    This SIMPLE/SMP module (written by David R. Helman and Joseph JáJá , and later modified by David A. Bader) implements a randomized, parallel code for link ranking.
    Helman and JáJá introduce a new optimal prefix computation algorithm on linked lists which builds upon the sparse ruling set approach of Reid-Miller and Blelloch. Besides being somewhat simpler and requiring nearly half the number of memory accesses, and can bound our complexity with high probability instead of merely on average. Moreover, whereas Reid-Miller and Blelloch targeted their algorithm for implementation on a vector multiprocessor architecture, they develop an algorithm for implementation on the symmetric multiprocessor architecture (SMP). These symmetric multiprocessors dominate the high-end server market and are currently the primary candidate for constructing large scale multiprocessor systems. The authors ran this code using a variety of benchmarks which they identified to examine the dependence of the algorithm on memory access patterns. For some problems, their algorithm actually matched or exceeded the optimal sequential solution using only a single thread. Moreover, in spite of the fact that the processors must compete for access to main memory, their algorithm still resulted in scalable performance up to 16 processors, which was the largest platform available to them.
    • D. R. Helman and J. JáJá , ``Prefix Computations on Symmetric Multiprocessors,'' Journal of Parallel and Distributed Computing, 61(2):265--278, 2001.


    Date last modified: 20 September 2005
    [Please click here for a README for building software packages based on SIMPLE libraries]
    SIMPLE is a framework for implementation of parallel algorithms using our methodology for developing high performance programs running on clusters of SMP nodes. Our methodology is based on a small kernel (SIMPLE) of collective communication primitives that make efficient use of the hybrid shared and message passing environment. We illustrate the power of our methodology by presenting experimental results for sorting integers, two-dimensional fast Fourier transforms (FFT), and constraint-satisfied searching.
  • Cycle Detection of Partitioned Planar Digraphs

    Date last modified: 23 July 1999
    This package provides an MPI implementation of a new, parallel algorithm for detecting cycles in partitioned, planar directed graphs that is both scalable in the graph and machine size, and performs well in practice. As an example, on a p = 64 processor cluster, we have solved an extremely large and difficult input problem with n = 228 vertices in less than five minutes.
  • Parallel Sorting using Sampling with Randomized and Deterministic Approaches

    Date last modified: 22 August 1996
    Previous schemes for sorting on general-purpose parallel machines have had to choose between poor load balancing and irregular communication or multiple rounds of all-to-all personalized communication. This code provides a sample sort which uses only two rounds of regular all-to-all personalized communication in a scheme that yields very good load balancing with virtually no overhead. Moreover, unlike previous variations, our algorithm efficiently handles the presence of duplicate values without the overhead of tagging each element with a unique identifier. This algorithm was implemented in Split-C and run on a variety of platforms, including the Thinking Machines CM-5, the IBM SP-2, and the Cray Research T3D. This implementation of our new algorithm seems to outperform all similar algorithms known to the authors on these platforms, and its performance is invariant over the set of input distributions unlike previous efficient algorithms. Our results also compare favorably with those reported for the simpler ranking problem posed by the NAS Integer Sorting (IS) Benchmark.
    The code also provides a new deterministic parallel sorting algorithm based on the regular sampling approach. The performance compares closely to that of our random sample sort algorithm, which seems to outperform all similar algorithms known to the authors on these platforms. Together, their performance is nearly invariant over the set of input distributions, unlike previous efficient algorithms. However, unlike our randomized sorting algorithm, the performance and memory requirements of our regular sorting algorithm can be deterministically guaranteed.
  • Parallel Sorting using Regular Sampling

    Date last modified: 3 March 2006
    This package provides an MPI implementation of Parallel Sorting by Regular Sampling, an algorithm from Shi and Schaeffer.
    • H. Shi and J. Schaeffer. ``Parallel Sorting by Regular Sampling. Journal of Parallel and Distributed Computing,'' 14(4):361--372, 1992.

  • Parallel Selection and Median Finding

    Date last modified: 10 January 2000
    A common statistical problem is that of finding the median element in a set of data. This parallel MPI code implements an efficient, randomized high-level parallel algorithms for finding the median given a set of elements distributed across a parallel machine. In fact, our algorithm solves the general selection problem that requires the determination of the element of rank k, for an arbitrarily given integer k. We use efficient techniques for distributing and coalescing data as well as efficient combinations of task and data parallelism. The algorithms have been coded in the message passing standard MPI, and our experimental results from the IBM SP-2 illustrate the scalability and efficiency of our algorithm and improve upon all the related experimental results known to the authors.
  • Parallel Combinatorial Algorithms

    Date last modified: 5 June 1998
    This code provides efficient and portable implementations of useful image processing algorithms, histogramming, connected components, image enhancement, and segmentation, as well parallel codes for combinatorial problems such as routing h-relations, sorting and selection. The algorithms have been coded in Split-C and provide the best known execution times for these approaches, even when compared with machine specific implementations.


Last updated: September 7, 2011


Computational Biology

Parallel Computing