## Distributed Parallel Quantum Chemistry

Our long-term goal for this project is to scale quantum chemistry calculations to millions of processor cores. Hartree-Fock or self-consistent field (SCF) calculations are typically compute-bound, due to computationally intensive electron repulsion integral (ERI) computations. However, the calculation is very irregular, and on large distributed memory supercomputers, unbalanced communication will easily ruin scalability. We have developed partitioning methods and distributed dynamic scheduling techniques to reduce and balance communication costs. This work won the Best Paper Award (Applications Track) at IPDPS 2014.

## Particle Simulations with Long-Range Interactions

Particle simulations involving long-range forces (such as hydrodynamic, electrostatic, and gravitational) become challenging at large scales because every particle interacts with every other particle. Luckily, some fast algorithms exist. We are applying the FFT-based particle-mesh Ewald method to Brownian dynamics and Stokesian dynamics for non-dilute systems when hydrodynamic forces are dominant. One application is the modeling of diffusion processes inside biological cells (see figure to the left). These types of particle simulations are rich in problems for numerical linear algebra. For example, we were led to discover how to precondition Lanczos methods for computing Brownian forces, a technique that can also be applied to statistical simulation in other domains. Our implementations of PME can simulate hydrodynamic systems involving 500,000 particles, probably the largest ever performed on a single node.

## Numerical Problems in Computational Chemistry

Eigenvalue calculations can also impact the scalability of SCF calculations. For this, we are pursuing methods more amenable to parallelization, in particular those based on McWeeny purification. This is simply a type of Newton iteration for computing the matrix sign function, and is based on matrix-matrix product operations.

The 4-dimensional tensor of ERIs is too
large to be stored, and must be recomputed at each SCF iteration.
An approach that tries to avoid this recomputation is *density fitting*.
Here, the ERI tensor is approximately factored as the contraction
of two 3-dimensional tensors. We are developing distributed memory
parallel codes for density fitting that account for the 8-way
symmetry in the ERI tensor.

## Parallel Asynchronous Matrix Factorizations

Our work addresses the problem of the massive fine-grained concurrency required in scientific and engineering algorithms in order to run efficiently on current and future computer architectures, such as GPUs and MIC. Some of the most effective algorithms are inherently sequential, but we break up this sequential bottleneck by applying approximations and iterations. We focus on certain matrix factorization problems arising in scientific and engineering simulation and data analysis, for example, incomplete factorization preconditioners and matrix completion problems.

## Intel Xeon Phi and Heterogeneous Computing

We are developing new algorithmic techniques to exploit the large number of cores and wide SIMD available on the Intel Xeon Phi coprocessor. Our software for the sparse matrix-vector product kernel, using a data structure we call ELLPACK Sparse Block (ESB), has been released by Intel as a prototype package, and may eventually become part of the Intel Math Kernel Library for the MIC architecture.

We are also optimizing the ERI kernel of quantum chemistry for Intel Xeon Phi. This is extremely challenging due to the lack of natural data parallelism in this kernel. We are collaborating with Intel to develop and run this software for a large-scale calculation on Tianhe-2, currently the world's fastest supercomputer. To fully utilize this machine, we are designing and testing the best ways to schedule and balance tasks heterogeneously across multiple CPUs and Phi coprocessors, while also minimizing offloading overheads.