Research Interests
High-Performance Computing, Parallel Programming Models, Autotuning, Numerical and Discrete Algorithms in Computational Science.
Research Projects
Performance Optimization and Autotuning of the Fast Multipole Method
We are working on the performance optimization, tuning, and analysis of the kernel-independent fast multipole method (FMM) on modern multicore systems. The main feature of this FMM algorithm is that it does not require the implementation of multipole expansions of the underlying kernel, and is based only on kernel evaluations.
Algorithm Design and Autotuning for the Concurrent Collections C/C++ Parallel Programming Model
Intel Concurrent Collections (CNC) is a new programming model where the programmer just specifies high-level computational steps, including inputs and outputs, without imposing unnecessary ordering on their execution. This results in a separation of concerns between the specification of a program and the optimization of its execution on a target parallel architecture. In Summer 2008, I implemented two numerical kernels (Cholesky decomposition, Eigenvalue computation) using the C/C++ CNC run-time for multicore architectures, to study programmability and performance bottlenecks. In recent work, we have also compared the performance of two different multicore implementations for CNC (Java and C/C++), and evaluated memory management strategies for the same. Currently, we are investigating methodologies to autotune programs written in the CNC C/C++ model.
Accelerating Financial Applications on the Cell B./E. processor
The Cell processor is a heterogeneous multicore architecture that offers a significant performance improvement over current architectures for data-intensive multimedia and scientific applications. We design and optimize a parallel 64-bit pseudo-random number generator - linear congruential generator (LCG) on the Cell processor. Our Cell/B.E. LCG implementation achieves an average speedup of 33 over current Intel architectures. We use this fast generator for Monte Carlo simulations, and speed up the computation of Value at Risk (VaR), a commonly used model for risk assessment in financial markets.
Tunable Parallelism in the Asynchronous Variational Integrator (AVI) Framework
Ahsynchronous variational integrator (AVI) framework of Lew, et al., was originally developed for time-integration of partial differential equations (PDEs). We study tunable parallelism in the context of AVIs, by designing several multicore implementations. An AVI can be naturally cast as a discrete event simulation (DES), where element (or super-element) updates become events with prescribed time-stamps, and the DES software framework takes care of all the scheduling and causality-preserving details. We designed a scalable AVI implementation using the optimistic DES methodology, observing a speedup of near 30 on 64 threads of a Niagara 2 machine.
An Image-processing library for the Cell B./E. processor
As part of my undergraduate senior year project, I designed an image-processing library for the Cell processor. The implementation included functions like convolution, laplacian filters, contrast stretching, thresholding, edge detection using MSB, Sobel edge detection, rotation and mirroring.
Indian Language Processing
As a summer intern at Tata Consultancy Services, I worked on an automated converter for processing documents in Indian languages and standardizing them. This facilitates text processing operations like spell checking, searching, and sorting on the processed documents.