SNAP 0.4 has been released, plugging memory leaks, fixing
modularity calculations for undirected graphs, adding
conductance and clustering coefficient metrics, and adding
seeded community detection routines.

David Ediger, Karl Jiang, E. Jason Riedy, David A. Bader, Courtney Corley, Rob Farber, and William N. Reynolds.
Massive social network analysis: Mining twitter for social good.
In *39th International Conference on Parallel Processing
(ICPP)
*
,San Diego, CA, September 2010.
[bib]
conference proceedings

Social networks produce an enormous quantity of data. Facebook consists of over 400 million active users sharing over 5billionpieces of information each month. Analyzing this vast quantity of unstructured data presents challenges for software and hardware. We present GraphCT, aGraphCharacterizationToolkit for massive graphs representing social network data. On a 128-processor Cray XMT, GraphCT estimates the betweenness centrality of an artificially generated (R-MAT) 537 million vertex, 8.6 billion edge graph in 55 minutes and a real-world graph (Kwak,et al.) with 61.6 million vertices and 1.47 billion edges in 105 minutes. We use GraphCT to analyze public data from Twitter, a microblogging network. Twitter's message connections appear primarily tree-structured as a news dissemination system. Within the public data, however, are clusters of conversations. Using GraphCT, we can rank actors within these conversations and help analysts focus attention on a much smaller data subset.

Proposal for the edge-traversal Graph500 benchmark: Maximal
independent sets in trangle-free graphs.

Release 0.4.1 contains bug fixes and performance enhancements.

STING proof-of-concept snapshot, 10 May 2010.

David Ediger, Karl Jiang, E. Jason Riedy, and David A. Bader.
Massive streaming data analytics: A case study with clustering
coefficients.
In *4th Workshop on Multithreaded Architectures and Applications
(MTAAP)
*
,Atlanta, GA, April 2010.
[bib |
.html]
conference proceedings

We present a new approach for parallel massive graph analysis of streaming, temporal data with a dynamic and extensible representation. Handling the constant stream of new data from health care, security, business, and social network applications requires new algorithms and data structures. We examine data structure and algorithm trade-offs that extract the parallelism necessary for high-performance updating analysis of massive graphs. Static analysis kernels often rely on storing input data in a specific structure. Maintaining these structures for each possible kernel with high data rates incurs a significant performance cost. A case study computing clustering coefficients on a general-purpose data structure demonstrates incremental updates can be more efficient than global recomputation. Within this kernel, we compare three methods for dynamically updating local clustering coefficients: a brute-force local recalculation, a sorting algorithm, and our new approximation method using a Bloom filter. On 32 processors of a Cray XMT with a synthetic scale-free graph of 2^{24}≈ 16 million vertices and 2^{29}≈ 537 million edges, the brute-force method processes a mean of over 50,000 updates per second and our Bloom filter approaches 200,000 updates per second.

E. Jason Riedy.
(untitled).
In Dana Martin Guthrie, editor, *Read Write Poem NaPoWriMo
Anthology
*
,page 84. Lulu Press, 2010.
(to appear).
[bib |
http]
poetry

E. Jason Riedy. Dependable direct solutions for linear systems using a little extra precision. CSE Seminar at Georgia Institute of Technology, August 2009. [bib | http] presentation

James W. Demmel, Mark Frederick Hoemmen, Yozo Hida, and E. Jason Riedy.
Non-negative diagonals and high performance on low-profile matrices
from Householder *QR*.
*SIAM Journal on Scientific Computing*, 31(4):2832–2841, July
2009
.
[bib |
DOI]
referreed journal

The Householder reflections used in LAPACK'sQRfactorization leave positive and negative real entries alongR's diagonal. This is sufficient for most applications ofQRfactorizations, but a few require thatRhave a nonnegative diagonal. This note describes a new Householder generation routine to produce a nonnegative diagonal. Additionally, we find that scanning for trailing zeros in the generated reflections leads to large performance improvements when applying reflections with many trailing zeros. Factoring low-profile matrices, those with nonzero entries mostly near the diagonal (e.g., band matrices), now require far fewer operations. For example,QRfactorization of matrices with profile widthbthat are stored densely in ann nmatrix improves fromO(nto^{3})O(n.These routines are in LAPACK 3.2.^{2}+ nb^{2})

Keywords: LAPACK; QR factorization; Householder reflection; floating-point

James W. Demmel, Yozo Hida, Xiaoye S. Li, and E. Jason Riedy.
Extra-precise iterative refinement for overdetermined least squares
problems.
*ACM Transactions on Mathematical Software*, 35(4):1–32,
February 2009.
[bib |
DOI] referreed journal

We present the algorithm, error bounds, and numerical results for extra-precise iterative refinement applied to overdetermined linear least squares (LLS) problems. We apply our linear system refinement algorithm to Bjrck's augmented linear system formulation of an LLS problem. Our algorithm reduces the forward normwise and componentwise errors toO()unless the system is too ill conditioned. In contrast to linear systems, we provide two separate error bounds for the solutionxand the residualr. The refinement algorithm requires only limited use of extra precision and adds onlyO(mn) work to theO(mncost of^{2})QRfactorization for problems of sizem-by-n. The extra precision calculation is facilitated by the new extended-precision BLAS standard in a portable way, and the refinement algorithm will be included in a future release of LAPACK and can be extended to the other types of least squares problems.

E. Jason Riedy. Auctions for distributed (and possibly parallel) matchings. Visit to CERFACS courtesy of the Franco-Berkeley Fund, December 2008. [bib | .pdf] presentation

James W. Demmel, Mark Frederick Hoemmen, Yozo Hida, and E. Jason Riedy.
Non-negative diagonals and high performance on low-profile matrices
from Householder *QR*.
LAPACK Working Note 203, Netlib, May 2008.
Also issued as UCB/EECS-2008-76; modified from SISC version.
[bib |
.pdf]
technical report

James W. Demmel, Yozo Hida, Xiaoye S. Li, and E. Jason Riedy. Extra-precise iterative refinement for overdetermined least squares problems. LAPACK Working Note 188, Netlib, May 2007. Also issued as UCB/EECS-2007-77; version accepted for TOMS. [bib | .pdf] technical report

We present the algorithm, error bounds, and numerical results for extra-precise iterative refinement applied to overdetermined linear least squares (LLS) problems. We apply our linear system refinement algorithm to Bjrck's augmented linear system formulation of an LLS problem. Our algorithm reduces the forward normwise and componentwise errors toO()unless the system is too ill conditioned. In contrast to linear systems, we provide two separate error bounds for the solutionxand the residualr. The refinement algorithm requires only limited use of extra precision and adds onlyO(mn) work to theO(mncost of^{2})QRfactorization for problems of size m-by-n. The extra precision calculation is facilitated by the new extended-precision BLAS standard in a portable way, and the refinement algorithm will be included in a future release of LAPACK and can be extended to the other types of least squares problems.

James W. Demmel, Yozo Hida, Xiaoye S. Li, E. Jason Riedy, Meghana Vishvanath, and David Vu. Precise solutions for overdetermined least squares problems. Stanford 50 – Eighth Bay Area Scientific Computing Day, March 2007. [bib | .pdf] poster

Linear least squares (LLS) fitting is the most widely used data modeling technique and is included in almost every data analysis system (e.g. spreadsheets). These software systems often give no feedback on the conditioning of the LLS problem or the floating-point calculation errors present in the solution. With limited use of extra precision, we can eliminate these concerns for all but the most ill-conditioned LLS problems. Our algorithm provides either a solution and residual with relatively tiny error or a notice that the LLS problem is too ill-conditioned.

James W. Demmel, Jack Dongarra, Beresford Parlett, W. Kahan, Ming Gu, David Bindel, Yozo Hida, Xiaoye S. Li, Osni A. Marques, E. Jason Riedy, Christof Vmel, Julien Langou, Piotr Luszczek, Jakub Kurzak, Alfredo Buttari, Julie Langou, and Stanimire Tomov. Prospectus for the next LAPACK and ScaLAPACK libraries. LAPACK Working Note 181, Netlib, February 2007. Also issued as UT-CS-07-592. [bib | .pdf] technical report

Jack Dongarra, Julien Langou, and E. Jason Riedy. Sca/LAPACK program style. August 2006. [bib | .html] unpublished

The purpose of this document is to facilitate contributions to LAPACK and ScaLAPACK by documenting their design and implementation guidelines. The long-term goal is to provide guidelines for both LAPACK and ScaLAPACK. However, the parallel ScaLAPACK code has more open issues, so this document primarily concerns LAPACK.

James W. Demmel, Jack Dongarra, Beresford Parlett, W. Kahan, Ming Gu, David Bindel, Yozo Hida, Xiaoye S. Li, Osni A. Marques, E. Jason Riedy, Christof Vmel, Julien Langou, Piotr Luszczek, Jakub Kurzak, Alfredo Buttari, Julie Langou, and Stanimire Tomov.
Prospectus for the next LAPACK and ScaLAPACK libraries.
In *PARA'06: State-of-the-Art in Scientific and Parallel
Computing
*
,Ume, Sweden, June 2006. High Performance Computing Center
North (HPC2N) and the Department of Computing Science, Ume University,
Springer.
[bib |
.pdf] conference proceedings

LAPACK and ScaLAPACK are widely used software libraries for numerical linear algebra. There have been over 68M web hits at www.netlib.org for the associated libraries LAPACK, ScaLAPACK, CLAPACK and LAPACK95. LAPACK and ScaLAPACK are used to solve leading edge science problems and they have been adopted by many vendors and software providers as the basis for their own libraries, including AMD, Apple (under Mac OS X), Cray, Fujitsu, HP, IBM, Intel, NEC, SGI, several Linux distributions (such as Debian), NAG, IMSL, the MathWorks (producers of MATLAB), Interactive Supercomputing, and PGI. Future improvements in these libraries will therefore have a large impact on users.

James W. Demmel, Yozo Hida, W. Kahan, Xiaoye S. Li, Sonil Mukherjee, and
E. Jason Riedy.
Error bounds from extra-precise iterative refinement.
*ACM Transactions on Mathematical Software*, 32(2):325–351,
June 2006.
[bib |
DOI] referreed journal

We present the design and testing of an algorithm for iterative refinement of the solution of linear equations where the residual is computed with extra precision. This algorithm was originally proposed in 1948 and analyzed in the 1960s as a means to compute very accurate solutions to all but the most ill-conditioned linear systems. However, two obstacles have until now prevented its adoption in standard subroutine libraries like LAPACK: (1) There was no standard way to access the higher precision arithmetic needed to compute residuals, and (2) it was unclear how to compute a reliable error bound for the computed solution. The completion of the new BLAS Technical Forum Standard has essentially removed the first obstacle. To overcome the second obstacle, we show how the application of iterative refinement can be used to compute an error bound in any norm at small cost and use this to compute both an error bound in the usual infinity norm, and a componentwise relative error bound.

E. Jason Riedy. Making static pivoting dependable. Seventh Bay Area Scientific Computing Day, March 2006. [bib | .pdf] poster

For sparse LU factorization, dynamic pivoting tightly couples symbolic and numerical computation. Dynamic structural changes limit parallel scalability. Demmel and Li use static pivoting in distributed SuperLU for performance, but intentionally perturbing the input may lead silently to erroneous results. Are there experimentally stable static pivoting heuristics that lead to a dependable direct solver? The answer is currently a qualified yes. Current heuristics fail on a few systems, but all failures are detectable.

Osni A. Marques, E. Jason Riedy, and Christof Vmel.
Benefits of IEEE-754 features in modern symmetric tridiagonal
eigensolvers.
*SIAM Journal on Scientific Computing*, 28(5):1613–1633, January 2006.
[bib |
DOI] referreed journal

Bisection is one of the most common methods used to compute the eigenvalues of symmetric tridiagonal matrices. Bisection relies on the Sturm count: For a given shift sigma, the number of negative pivots in the factorizationT - I = LDLequals the number of eigenvalues of T that are smaller than sigma. In IEEE-754 arithmetic, the value permits the computation to continue past a zero pivot, producing a correct Sturm count when^{T}Tis unreduced. Demmel and Li showed [IEEE Trans. Comput., 43 (1994), pp. 983—992] that using rather than testing for zero pivots within the loop could significantly improve performance on certain architectures. When eigenvalues are to be computed to high relative accuracy, it is often preferable to work withLDLfactorizations instead of the original tridiagonal^{T}T. One important example is the MRRR algorithm. When bisection is applied to the factored matrix, the Sturm count is computed fromLDLwhich makes differential stationary and progressive qds algorithms the methods of choice. While it seems trivial to replace^{T}TbyLDL,in reality these algorithms are more complicated: In IEEE-754 arithmetic, a zero pivot produces an overflow followed by an invalid exception (NaN, or "Not a Number") that renders the Sturm count incorrect. We present alternative, safe formulations that are guaranteed to produce the correct result. Benchmarking these algorithms on a variety of platforms shows that the original formulation without tests is always faster provided that no exception occurs. The transforms see speed-ups of up to 2.6x over the careful formulations. Tests on industrial matrices show that encountering exceptions in practice is rare. This leads to the following design: First, compute the Sturm count by the fast but unsafe algorithm. Then, if an exception occurs, recompute the count by a safe, slower alternative. The new Sturm count algorithms improve the speed of bisection by up to 2x on our test matrices. Furthermore, unlike the traditional tiny-pivot substitution, proper use of IEEE-754 features provides a careful formulation that imposes no input range restrictions.^{T}

E. Jason Riedy, Yozo Hida, and James W. Demmel. The future of LAPACK and ScaLAPACK. Robert C. Thompson Matrix Meeting, November 2005. [bib | .pdf] presentation

We are planning new releases of the widely used LAPACK and ScaLAPACK numerical linear algebra libraries. Based on an on-going user survey (http://www.netlib.org/lapack-dev) and research by many people, we are proposing the following improvements: Faster algorithms (including better numerical methods, memory hierarchy optimizations, parallelism, and automatic performance tuning to accomodate new architectures), more accurate algorithms (including better numerical methods, and use of extra precision), expanded functionality (including updating and downdating, new eigenproblems, etc. and putting more of LAPACK into ScaLAPACK), and improved ease of use (friendlier interfaces in multiple languages). To accomplish these goals we are also relying on better software engineering techniques and contributions from collaborators at many institutions. This is joint work with Jack Dongarra.

Osni A. Marques, E. Jason Riedy, and Christof Vmel. Benefits of IEEE-754 features in modern symmetric tridiagonal eigensolvers. LAPACK Working Note 172, Netlib, September 2005. Also issued as UCB//CSD-05-1414; expanded from SISC version. [bib | .pdf] technical report

E. Jason Riedy. Modern language tools and 754R. ARITH'05, June 2005. [bib | .pdf] panel participant

David Hough, Bill Hay, Jeff Kidder, E. Jason Riedy, Guy L. Steele Jr., and Jim
Thomas.
Arithmetic interactions: From hardware to applications.
In
*
17th IEEE Symposium on Computer Arithmetic (ARITH'05)
*
,
June 2005.
See
related presentation.
[bib |
DOI] conference proceedings

The entire process of creating and executing applications that solve interesting problems with acceptable cost and accuracy involves a complex interaction among hardware, system software, programming environments, mathematical software libraries, and applications software, all mediated by standards for arithmetic, operating systems, and programming environments. This panel will discuss various issues arising among these various contending points of view, sometimes from the point of view of issues raised during the current IEEE 754R standards revision effort.

E. Jason Riedy. Parallel combinatorial computing and sparse matrices. SIAM Conference on Computational Science and Engineering, February 2005 . [bib | .pdf] minisymposium speaker

James W. Demmel, Yozo Hida, W. Kahan, Xiaoye S. Li, Sonil Mukherjee, and E. Jason Riedy. Error bounds from extra-precise iterative refinement. LAPACK Working Note 165, Netlib, February 2005. Also issued as UCB//CSD-05-1414, UT-CS-05-547, and LBNL-56965; expanded from TOMS version. [bib | .pdf] technical report

E. Jason Riedy. Sparse data structures for weighted bipartite matching. SIAM Workshop on Combinatorial Scientific Computing, February 2004. [bib | .pdf] presentation

E. Jason Riedy. Parallel weighted bipartite matching and applications. SIAM Parallel Processing for Scientific Computing, February 2004. [bib | .pdf] minisymposium speaker

E. Jason Riedy. Practical alternatives for parallel pivoting. SIAM Annual Meeting, June 2003. [bib | .pdf] presentation

E. Jason Riedy. Parallel bipartite matching for sparse matrix computations. SIAM Conference on Computational Science and Engineering, February 2003 . [bib | .pdf] poster

David Bindel and E. Jason Riedy. Exception handling interfaces, implementations, and evaluation. IEEE-754r revision meeting, August 2002. [bib | .pdf] presentation

E. Jason Riedy. Parallel bipartite matching for sparse matrix computation. Third Bay Area Scientific Computing Day, March 2002. [bib] poster

E. Jason Riedy. Type system support for floating-point computation. May 2001. [bib | .pdf] unpublished

Floating-point arithmetic is often seen as untrustworthy. We show how manipulating precisions according to the following rules of thumb enhances the reliability of and removes surprises from calculations: Store data narrowly, compute intermediates widely, and derive properties widely. Further, we describe a typing system for floating point that both supports and is supported by these rules. A single type is established for all in- termediate computations. The type describes a precision at least as wide as all inputs to and results from the computation. Picking a single type provides benefits to users, compilers, and interpreters. The type system also extends cleanly to encompass intervals and higher precisions.

E. Jason Riedy and Robert Szewczyk. Power and control in networked sensors. Cited, May 2000. [bib | .pdf] unpublished

The fundamental constraint on a networked sensor is its energy consumption, since it may be either impossible or not feasible to replace its energy source. We analyze the power dissipation implications of implementing the network sensor with either a central processor switching between I/O devices or a family of processors, each dedicated to a single device. We present the energy measurements of the current generations of networked sensors, and develop an abstract description of tradeoffs between both designs.

E. Jason Riedy and Rich Vuduc. Microbenchmarking the Tera MTA. Cited, http://purl.oclc.org/NET/jason-riedy/resume/material/Tera-presentation.pdfpresentation version available, May 1999. [bib | .pdf] unpublished

The Tera Multithreaded Architecture, or MTA, addresses scalable shared memory system design with a difierent approach; it tolerates latency through providing fast access to multiple threads of execution. The MTA employs a number of radical design ideas: creation of hardware threads (streams) with frequent context switching; full-empty bits for each memory word; a flat memory hierarchy; and deep pipelines. Recent evaluations of the MTA have taken a top-down approach: port applications and application benchmarks, and compare the absolute performance with conventional systems. While useful, these studies do not reveal the effect of the Tera MTA's unique hardware features on an application. We present a bottom-up approach to the evaluation of the MTA via a suite of microbenchmarks to examine in detail the underlying hardware mechanisms and the cost of runtime system support for multithreading. In particular, we measure memory, network, and instruction latencies; memory bandwidth; the cost of low-level synchronization via full-empty bits; overhead for stream management; and the effects of software pipelining. These data should provide a foundation for performance modeling on the MTA. We also present results for list ranking on the MTA, an application which has traditionally been difficult to scale on conventional parallel systems.

Joseph N. Wilson, E. Jason Riedy, Gerhard X. Ritter, and Hongchi Shi.
An Image Algebra based SIMD image processing environment.
In C. W. Chen and Y. Q. Zhang, editors,
*
Visual Information
Representation, Communication, and Image Processing
*
,pages 523–542. Marcel
Dekker, New York, 1999.
[bib |
.pdf] book chapter

SIMD parallel computers have been employed for image related applications since their inception. They have been leading the way in improving processing speed for those applications. However, current parallel programming technologies have not kept pace with the performance growth and cost decline of parallel hardware. A highly usable parallel software development environment is needed. This chapter presents a computing environment that integrates a SIMD mesh architecture with image algebra for high-performance image processing applications. The environment describes parallel programs through a machine-independent, retargetable image algebra object library that supports SIMD execution on the Lockheed Martin PAL-I parallel computer. Program performance on this machine is improved through on-the-fly execution analysis and scheduling. We describe the relevant elements of the system structure, outline the scheme for execution analysis, and provide examples of the current cost model and scheduling system.

Joseph N. Wilson and E. Jason Riedy.
Efficient SIMD evaluation of image processing programs.
In Hongchi Shi and Patrick C. Coffield, editors, *Parallel and
Distributed Methods for Image Processing
*
,volume 3166, pages 199–210, San
Diego, CA, July 1997. SPIE.
[bib |
DOI |
.pdf] conference proceedings

SIMD parallel systems have been employed for image processing and computer vision applications since their inception. This paper describes a system in which parallel programs are implemented using a machine-independent, retargetable object library that provides SIMD execution on the Lockheed Martin PAL-I SIMD parallel processor. Programs' performance on this machine is improved through on-the-fly execution analysis and scheduling. We describe the relevant elements of the system structure, the general scheme for execution analysis, and the current cost model for scheduling.