rahimian a, lashuk i, veerapaneni s. k, aparna c, malhotra d, moon l,
sampath r, shringarpure a, vetter j, vuduc r, zorin d, biros g|
petascale direct numerical simulation of blood flow on 200K cores and heterogeneous architectures
acm/ieee scxy conference series, pp. 1–11, 2010, (Gordon Bell Prize)
We present a fast, petaflop-scalable algorithm for Stokesian particulate flows. Our goal is the direct simulation of blood, which we model as a mixture of a Stokesian fluid (plasma) and red blood cells (RBCs). Directly simulating blood is a challenging multiscale, multiphysics problem. We report simulations with up to 260 million deformable RBCs. The largest simulation amounts to 90 billion unknowns in space. In terms of the number of cells, we improve the state-of-the art by several orders of magnitude: the previous largest simulation, at the same physical fidelity as ours, resolved the flow of O(1,000-10,000) RBCs. Overall, the code has scaled on 256 CPU-GPUs on Teragrid's Lincoln cluster, on 327 CPU-GPUs on Keeneland cluster, and on 200,000 AMD cores of the Oak Ridge National Laboratory'sJaguar PF system. In our largest simulation, we have achieved 0.7 Petaflops/s of sustained performance on Jaguar.
veerapaneni sk, rahimian a, biros g, zorin d|
a fast algorithm for simulating vesicle flows in three dimensions
in review, pp. 1–40, 2010
Vesicles are locally-inextensible fluid membranes that can sustain bending. In this paper, we present a fast algorithm for simulating the dynamics of vesicles suspended in viscous fluids. Spatial quantities are discretized using spherical harmonics, and quadrature rules for singular surface integrals need to be adapted to this case; an algorithm for surface reparameterization is neeed to ensure suffcient of the time- stepping scheme, and spectral filtering is introduced to maintain reasonable accuracy while minimizing computational costs. We obtain a time-stepping scheme that, in our numerical experiments, is unconditionally stable. We present results to analyze the cost and convergence rates of the overall scheme. To illustrate the applicability of the new method, we consider a few vesicle-flow interaction problems: a single vesicle in relaxation, sedimentation, shear flows, and many-vesicle flows.
chaillat s, biros g|
FaIMS: a fast algorithm for the inverse medium problem with multiple frequencies and multiple sources for the scalar Helmholtz equation
in review, pp. 1–27, 2010
We consider the inverse medium problem, for the low-frequency time-harmonic wave equation with broadband and multi-point illumination in the low frequency regime. This model finds many applications in science and engineering (e.g., seismic imaging, non-destructive evaluation, and optical tomography). We formulate the problem using a Lippmann-Schwinger formulation, which we discretize using a quadrature method. We consider small perturbations of the background medium and we invert the Born approximation. To solve this inverse problem, we use a least squares formulation that is regularized with the truncated Singular Value Decomposition (SVD). We have developed an approximate SVD method that reduces the cost of the factorization that provides orders of magnitude improvements over a black-box dense SVD. We provide numerical results that demonstrate the scalability of the method.
ghilliotti g, rahimian a, biros g, misbah c|
vesicle migration and spatial organization driven by flow line curvature
physical review letters, in press, pp. 1–4, 2010
Cross-streamline migration of deformable entities is essential in many problems such as industrial particulate flows, DNA sorting, and blood rheology. Using numerical experiments, we have discovered that vesicles suspended in a flow with curved flow lines migrate towards regions of high flow-line curvature, which are regions of high shear rates. The migration velocity of a vesicle is found to be a universal function of the normal stress difference and the flow curvature. This finding quantitatively demonstrates a direct coupling between a microscopic quantity (migration) and a macroscopic one (normal stress difference). Furthermore, simulations with multiple vesicles revealed a self-organization, which corresponds to segregation, in a rim closer to the inner cylinder, resulting from a subtle interaction among vesicles. Such segregation effects could have signifficant impact on rheology of vesicle flows.
gooya a, biros g, davatzikos c|
deformable registration of glioma images using an EM algorithm and diffusion-reaction modeling
IEEE Transcations in Medical Imaging, in press, pp. 1–15, 2010
We investigate the problem of atlas registration of brain images with gliomas. Multi-parametric imaging modalities (T1, T1-CE, T2, and FLAIR) are first utilized for segmentations of different tissues, and to compute the posterior probability map (PBM) of membership to each tissue class, using supervised learning. Similar maps are generated in the initially normal atlas, by modeling the tumor growth, using reactiondiffusion equation. Deformable registration using a demonslike algorithm is used to register the patient images with the tumor bearing atlas. Joint estimation of the simulated tumor parameters (e.g. location, mass effect and degree of infiltration), and the spatial transformation is achieved by maximization of the log-likelihood of observation. The proposed method has been evaluated on five simulated data sets created by Statistically Simulated Deformations (SSD), and fifteen real multichannel glioma data sets.
sampath rs, biros g|
parallel elastic registration using a multigrid preconditioned Gauss-Newton-Krylov solver, grid continuation and octrees
in review, pp. 1–30, 2010
We present a parallel algorithm for intensity-based elastic image registration. This algorithm integrates several components: parallel octrees, multigrid preconditioning, a Gauss-Newton-Krylov solver, and grid continuation. We use a non-parametric deformation model based on trilinear finite element shape functions defined on octree meshes. Our C++ based implementation uses the Message Passing Interface (MPI) standard and is built on top of the Dendro and PETSc libraries. We demonstrate the performance of our method on synthetic and medical images. We demonstrate the scalability of our implementation on up to 4096 processors on the Sun Constellation Linux Cluster "Ranger" at the Texas Advanced Computing Center (TACC).
adavani s, biros g|
fast algorithms for inverse problems with parabolic PDE constraints
in review, pp. 1–15, 2010
We present optimal complexity algorithms to solve the inverse problem with parabolic PDE constraints on the 2D unit box where the temporal variation of a source function is known but the spatial variation is unknown. We consider measurements on a single boundary and two opposite boundaries. The problem is formulated as a PDE-constrained optimization problem. We use a reduced space approach in which we eliminate the state and adjoint variables and we iterate in the inversion parameter space using the Conjugate Gradients algorithm. We derive analytical expressions for the entries of the reduced Hessian and propose preconditioners based on a low-rank approximation of the Hessian. We also propose preconditioners for problems with non-constant coefficient PDE constraints. We observed mesh-independent and noise-independent convergence of CG with the preconditioner.
rahimian a, veerapaneni sk, biros g|
dynamic simulation of locally inextensible vesicles suspended in an arbitrary two-dimensional domain, a boundary integral method
journal of computational physics, pp. 6466–6484, (229) 2010
We consider numerical algorithms for the simulation of hydrodynamics of two-dimensional vesicles suspended in a viscous Stokesian fluid. The motion of vesicles is governed by the interplay between hydrodynamic and elastic forces. Continuum models of vesicles use a two-phase fluid system with interfacial forces that include tension (to maintain local ace” in inextensibility) and bending. We use a semi-implicit time-marching scheme based on a boundary integral formulation of the Stokes problem for vesicles in an unbounded medium was proposed. In this paper, we consider confined flows within arbitrary-shaped stationary/moving geometries and flows in which the interior (to the vesicle) and exterior fluids have different viscosity. Overall, our method achieves several orders of magnitude speed-up compared to standard explicit schemes.
sampath rs, biros g|
a parallel geometric multigrid method for finite elements on octree meshes
siam journal on scientific computing, 32(3), pp.1361–1392, 2010
We present a parallel geometric multigrid algorithm for solving variable-coefficient elliptic partial differential equations on the unit box (with Dirichlet or Neumann boundary conditions) using highly nonuniform, octree-based, conforming finite element discretizations. Our octrees are 2:1 balanced, that is, we allow no more than one octree-level difference between octants that share a face, edge, or vertex. We describe a parallel algorithm whose input is an arbitrary 2:1 balanced fine-grid octree and whose output is a set of coarser 2:1 balanced octrees that are used in the multigrid scheme. Also, we derive matrix-free schemes for the discretized finite element operators and the intergrid transfer operations. The overall scheme is second-order accurate for sufficiently smooth right-hand sides and material properties; its complexity for nearly uniform trees is O( N np log N np )+O(np log np), where N is the number of octree nodes and np is the number of processors. Our implementation uses the Message Passing Interface standard. We present numerical experiments for the Laplace and Navier (linear elasticity) operators that demonstrate the scalability of our method. Our largest run was a highly nonuniform, 8-billion-unknown, elasticity calculation using 32,000 processors on the Teragrid system, “Ranger,” at the Texas Advanced Computing Center. Our implementation is publically available in the Dendro library, which is built on top of the PETSc library from Argonne National Laboratory.
aparna c, williams s,oliker l, lashuk i, biros g, vuduc r|
optimizing and tuning the fast multipole method for state-of-the-art multicore architectures
ieee proceedings of ipdps, pp. 1–15, 2010
This work presents the first extensive study of single-node performance optimization, tuning, and analysis of the fast multipole method (FMM) on modern multicore systems. We consider single- and double-precision with numerous performance enhancements, including low-level tuning, numerical approximation, data structure transformations, OpenMP parallelization, and algorithmic tuning. Among our numerous findings, we show that optimization and parallelization can improve double-precision performance by 25X on Intel'quad-core Nehalem, 9.4X on AMD’s'ad-core Barcelona, and 37.6X on Sun'Victoria Falls (dual-sockets on all systems). We also compare our single-precision code against our prior state-of-the-art GPU-based code and show, surprisingly, that the most advanced multicore architecture (Nehalem) reaches parity in both performance and power efficiency with NVIDIA'most advanced GPU architecture.
kaoui b, biros g, misbah c|
why do red blood cells have asymmetric shapes even in a symmetric flow?
physical refview laters, (103), 2009
Understanding why red blood cells (RBCs) move with an asymmetric shape (slipperlike shape) in small blood vessels is a long-standing puzzle in blood circulatory research. By considering a vesicle (a model system for RBCs), we discovered that the slipper shape results from a loss in stability of the symmetric shape. It is shown that the adoption of a slipper shape causes a significant decrease in the velocity difference between the cell and the imposed flow, thus providing higher flow efficiency for RBCs. Higher membrane rigidity leads to a dramatic change in the slipper morphology, thus offering a potential diagnostic tool for cell pathologies.
lashuk i, aparna c, langston h,
nguyen ta, sampath ra, shringarpure a, vuduc r, ying l,
zorin d, biros g|
a massively parallel adaptive fast-multipole method on heterogeneous architectures
acm/ieee scxy conference series, pp. 1–11, 2009
We present new scalable algorithms and a new implementation of our kernel-independent fast multipole method (Ying et al. ACM/IEEE SC ’03), in which we employ both distributed memory parallelism (via MPI) and shared memory/streaming parallelism (via GPU acceleration) to rapidly evaluate two-body non-oscillatory potentials. On traditional CPU-only systems, our implementation scales well up to 30 billion unknowns on 65K cores (AMD/CRAYbased Kraken system at NSF/NICS) for highly non-uniform point distributions. We achieve scalability to such extreme core counts by adopting a new approach to scalable MPI-based tree construction and partitioning, and a new reduction algorithm for the evaluation phase. Taken together, these components show promise for ultrascalable FMM in the petascale era and beyond.