HPC 2.0: The Challenge of Scale
In just a few years, scientific computing has moved from the terascale to the trans-petascale regime, with research on exascale designs now underway. Large-scale scientific instruments, ubiquitous sensors and large scale simulations are now producing a torrent of digital data, with associated challenges of transport, analysis and curation. Disciplinary fusion and multidisciplinary collaborations are challenging our social structures, as teams from diverse backgrounds and organizations work together. Changes in scale, whether technical or sociological, always bring new challenges.
What are the emerging technologies that will drive the future of high-performance computing? How can we best integrate software management of component failures and system resilience, energy consumption and efficiency, and performance optimization in manageable and intuitive ways? How can we measure and optimize the performance of systems at scale? How do we balance use of commodity components and economies of scale against custom design, recognizing that technical computing has always tracked the consumer mainstream? What are the best ways to manage and extract insights from data at scale, and what are the economically viable ways to enable multidisciplinary data fusion and technology transfer? How can we simplify use of technical computing and broaden the base of users?
This talk will examine these and other issues technically, socially and economically, with an eye to the fusion of technology and policy.

Dr. Aydin Buluç
Lawrence Berkeley National Laboratory
30 March 2012

A sustainable software stack for parallel graph analysis
Graph theory is used to model large-scale complex systems in various domains, such as genomics and social network analysis. I will describe a sustainable software stack for parallel analysis of very large graphs. The layered software architecture combines performance, customizability, and conceptual simplicity though separation of concerns. I will first focus on the parallel backend, the Combinatorial BLAS, which implements state-of-the-art algorithms for sparse matrix primitives that serve as parallel building blocks for tightly-coupled graph computations. I will present the algorithmic innovations that make the primitives scalable, and the algebraic semiring model that is fundamental to the customizability of the Combinatorial BLAS. Several graph kernels that are implemented by composing a handful of the Combinatorial BLAS primitives scale up to tens of thousands of cores on distributed memory architectures, surpassing specialized implementations. I will then outline the architecture and capabilities of the Python based frontend, the Knowledge Discovery Toolbox, and describe several applications. I will briefly touch on supporting semantic graphs without sacrificing performance, thus enabling a broad set of real applications that maintain important auxiliary information through attributes on individual edges and vertices.

The New Era in Genomics: Opportunities and Challenges for High Performance Computing
The advent of second and third generation sequencing technologies is revolutionizing biosciences research by enabling high-throughput sampling of genomes and transcriptomes. With these technologies, it is now possible to sequence a few billion DNA fragments in a single experiment. This quantum leap in throughput creates many technical challenges: 1) instrumentation-specific challenges such as error correction; 2) reinventing methodologies for classic applications such as genome assembly; 3) challenges due to rapid scale up in scope and number of research projects driven by lower economic costs; and 4) expanding the repertoire of applications – e.g., personalized genome sequencing, and digital expression analysis for systems biology. The rapid development of next generation technologies, and the richness and diversity of their applications, has created a pressing need for new bioinformatics tools and computational methods capable of dealing with large volumes of data. This talk will outline the computational challenges in exploiting high-throughput sequencing technologies, and the role of high performance computing as we transition to this new era.

Algorithm Engineering for Route Planning - An Update –
Nowadays, route planning systems belong to the most frequently used information systems. The algorithmic core problem of such systems, i.e., the fast computation of shortest paths is a classical problem that can be solved by Dijkstra's shortest paths algorithm. However, algorithms for route planning in transportation networks have recently undergone a rapid development, leading to methods that are up to several million times faster than Dijkstra's algorithm. In particular, computing shortest paths in huge networks has become a showpiece of Algorithm Engineering demonstrating the engineering cycle that consists of design, analysis, implementation and experimental evaluation of practicable algorithms. We will provide a condensed overview of the techniques enabling this development.

Design by Transformation - Application to Dense Linear Algebra Libraries
The FLAME project has yielded modern alternatives to LAPACK and related effort. An attractive feature of this effort is the complete vertical integration of the entire software stack, starting with low level kernels that support the BLAS and finishing with a new distributed memory library, Elemental. In between are layers that target a single core, multicore, and multiGPU architectures. What this now enables is a new approach where libraries are viewed not as instantiations in code but instead as a repository of algorithms, knowledge about those algorithm, and knowledge about target architectures. Representations in code are then mechanically generated by a tool that performs optimizations for a given architecture by applying high-level transformations much like a human expert would. We discuss how this has been used to mechanically generate tens of thousands of different distributed memory implementations given a single sequential algorithm. By attaching cost functions to the component operations, a highly optimized implementation is chosen by the tool that invariably matches or exceeds the performance of implementations by human experts. We call this approach Design by Transformation (DxT).

The Hub Labeling Algorithm
This is a survey of Hub Labeling results for general and road
networks. Given a weighted graph, a distance oracle takes as an input a
pair of vertices and returns the distance between them. The labeling
approach to distance oracle design is to precompute a label for every
vertex so that distances can be computed from the corresponding
labels. This approach has been introduced by [Gavoille et al. '01],
who also introduced the Hub Labeling algorithm (HL). HL has been
further studied by [Cohen et al. '02].We study HL in the context of
graphs with small highway dimension (e.g., road networks). We show
that under this assumption HL labels are small and the queries are
sublinear. We also give an approximation algorithm for computing small
HL labels that uses the fact that shortest path set systems have small
VC-dimension.Although polynomial-time, precomputation given by theory
is too slow for continental-size road networks. However, heuristics
guided by the theory are fast, and compute very small labels. This
leads to the fastest currently known practical distance oracles for
road networks.The simplicity of HL queries allows their implementation
inside of a relational database (e.g., in SQL), and query efficiency
assures real-time response. Furthermore, including HL data in the
database allows efficient implementation of more sophisticated
location-based queries. These queries can be combined with traditional
SQL queries. This approach brings the power of location-based services
to SQL programmers, and benefits from external memory implementation
and query optimization provided by the underlying database.Joint work
with Ittai Abraham, Daniel Delling, Amos Fiat, and Renato Werneck.

DIMACS Implementation Challenges: Past, Present, and Future
This talk is being presented at the workshop for the 10th DIMACS Implementation Challenge. In it, I will describe the history and impact of the Implementation Challenge series, and also talk about directions for the future.

Model Order Reduction for Time-Critical Applications in Aeronautical Engineering
In many applications, high-fidelity time-dependent numerical simulations remain so computationally intensive that they cannot be used as often as needed, or are more often used in special circumstances than routinely. This is the case, for example, for turbulent CFD computations at high Reynolds numbers. Consequently, in many engineering fields including aeronautics, the impact of computational sciences on time-critical operations such as design, design optimization, control, and flight test has not yet fully materialized. For these operations, reduced order models that can faithfully reproduce the essential features of the larger computational models at a fraction of their computational cost offer a promising alternative. This presentation will cover recent advances in this field. It will also demonstrate real-time performance on mobile devices of a CFD-based computational methodology for assisting flutter flight-testing and the aerodynamic design of a Formula 1 car.

Context in image and video understanding
In spite of the significant effort that has been devoted to the core problems of object and action recognition in images and videos, the recognition performance of state of the art algorithms is well below what would be required for any successful deployment in robotic applications. Additionally, there are challenging combinatorial problems associated with constructing globally “optimal” descriptions of images and videos in terms of potentially very large collections of object and action models. The constraints that are utilized in these optimization procedures are loosely referred to as “context.” So, for example, vehicles are generally supported by the ground, so that an estimate of ground plane location parameters in an image constrains positions and apparent sizes of vehicles. Another source of context are the everyday spatial and temporal relationships between objects and actions; so, for example, keyboards are typically “on” tables and not “on” cats. The first part of the talk will discuss how visually grounded models of object appearance and relations between objects can be simultaneously learned from weakly labeled images (images which are linguistically but not spatially annotated – i.e., we are told there is a car in the image, but not where the car is located). Next, I will discuss how these models can be more efficiently learned using active learning methods. Once these models are acquired, one approach to inferring what objects appear in a new image is to segment the image into pieces, construct a graph based on the regions in the segmentation and the relationships modeled, and then apply belief propagation to the graph. However, this typically results in a very dense graph with many “noisy” edges, leading to inefficient and inaccurate inference. I will briefly describe a learning approach that can construct smaller and more informative graphs for inference. Finally, I will relax the (unreasonable) assumption that one can segment an image into regions that correspond to objects, and describe an approach that can simultaneously construct instances of objects out of collections of connected segments that look like objects, while also softly enforcing contextual constraints.

Dr. RK Shyamasundar
Tata Institute of Fundamental Research, Mumbai, India
23 September 2011

Analysis, Synchronization and Scheduling Challenges in X10
X10 is a modern object-oriented programming language in the family of Partially Global Address Space languages that is under design at IBM Research. The fundamental goal of X10 is to enable scalable, high-performance, high productivity transformational programming for high-end computers for traditional numerical computation workloads as well as commercial server workloads. X10 introduces a flexible treatment of concurrency, distribution and locality, within an integrated type system. X10 introduces places as an abstraction for a computational context with a locally synchronous view of shared memory. An X10 computation runs over a large collection of places. Each place hosts some data and runs one or more activities. Activities are extremely lightweight threads of execution. An activity may synchronously (and atomically) use one or more memory locations in the place in which it resides, leveraging current symmetric multi-processor (SMP) technology. It must spawn activities asynchronously to access or update memory at other places. X10 provides weaker ordering guarantees for inter-place data access, enabling applications to scale. One or more clocks may be used to order activities running in multiple places. (Multi-dimensional) Arrays may be distributed across multiple places. Arrays support parallel collective operations. A novel exception flow model ensures that exceptions thrown by asynchronous activities can be caught at a suitable parent activity.
In this talk, first, we shall provide a overview of X10 and discuss some of the challenges and initial works towards may-happen-parallelism, generalization of work stealing of Cilk to X10, guaranteed deadlock-free execution of X10 programs with bounded resource, algorithm for affinity driven distributed scheduling of multi-place 2 parallel computations, relationship of clocks to barrier synchronization, the power of clocks etc.

X-caliber and Sandia's Exascale Grand Challenge (XGC): A Data Movement View of the Requirements of Exascale Supercomputing
Since the creation of the modern computer era in the late 1930s, system performance has been dominated by memory. In modern supercomputers, the problem is exacerbated by the long latency of remote memory accesses through complex interconnection networks, and as we enter the exascale era system performance is increasingly dominated by energy constraints. Estimates for "scaling-out" today's MPP-based supercomputers the three orders of magnitude necessary to reach exascale show that power requirements will grow from 10 MW (often less) to 125+ MW. This is a significant shift in the operational and deployment possibilities of machines, as energy will rapidly exceed the purchase price of the hardware, and foretells of the same problem with large-scale commercial datacenters.
This talk will discuss the potential of two critical technologies to alleviate the data movement restrictions of current systems: 3D Integrated Memory Systems, and Silicon Photonics-based interconnects. It will focus on the potential to rearchitect the system in light of these enabling technologies, and demonstrate that doing so will create a more capable system than one that focuses on FLOPS. In fact, several "data movement starved" proposals to build special purpose supercomputers to address the LINPACK benchmark exist, but would prove poor candidates for large-scale sparse applications typical of complex physics and informatics codes.
The results and systems in this talk will focus on ongoing work at Sandia in the X-caliber and XGC projects.

Statistical Inference Protein Structure Function
The study of the structure and function of proteins raises many problems that offer challenges and opportunities for computational and statistical research. I will overview my experiences in several such problem domains, ranging from those in which off-the-shelf ideas can be fruitfully applied to others that require new thinking. These domains include: (1) the identification of active sites in enzymes; (2) the modeling of protein backbone configurations; (3) the prediction of molecular function based on phylogeny; and (4) joint inference of alignment and phylogeny.

Spatial Stochastic Simulation of Polarization in Yeast Mating
In microscopic systems formed by living cells, the small numbers of some reactant molecules can result in dynamical behavior that is discrete and stochastic rather than continuous and deterministic. Spatio-temporal gradients and patterns play an important role in many of these systems. In this lecture we report on recent progress in the development of computational methods and software for spatial stochastic simulation. Then we describe a spatial stochastic model of polarisome formation in mating yeast. The new model is built on simple mechanistic components, but is able to achieve a highly polarized phenotype with a relatively shallow input gradient, and to track movement in the gradient. The spatial stochastic simulations are able to reproduce experimental observations to an extent that is not possible with deterministic simulation.

Mining Billion-node Graphs What do graphs look like? How do they evolve over time? How to handle a graph with a billion nodes? We present a comprehensive list of static and temporal laws, and some recent observations on real graphs (like, e.g., "eigenSpokes''). For generators, we describe some recent ones, which naturally match all of the known properties of real graphs. Finally, for tools, we present "oddBall'' for iscovering anomalies and patterns, as well as an overview of the PEGASUS system which is designed for handling Billion-node graphs,
running on top of the "hadoop'' system.

Multicore-oblivious algorithms This talk addresses the design of efficient algorithms for multicores
that are oblivious to machine parameters, and thus are portable across
machines with different multicore configurations. We consider HM, a
multicore model consisting of a parallel shared memory machine with
hierarchical multi-level caching, and we introduce a multicore
oblivious (MO) approach to algorithms and schedulers for HM. An MO
algorithm is specified with no mention of any machine parameters, such
as the number of cores, number of cache levels, cache sizes and block
lengths. However, it is equipped with a small set of instructions that
can be used to provide hints to the run-time scheduler on how to
schedule parallel tasks. We present efficient MO algorithms for
several fundamental problems including matrix transposition, FFT,
sorting, the Gaussian Elimination Paradigm, list ranking, and
connected components. The notion of an MO algorithm is complementary
to that of a network-oblivious (NO) algorithm, recently introduced by
Bilardi et al. for parallel distributed-memory machines where
processors communicate point-to-point. Indeed several of our MO
algorithms translate into efficient NO algorithms, adding to the body
of known efficient NO algorithms.
Towards the end of this talk I will give a brief overview of some of
my recent work related to computational sciences. First I will talk
about "Pochoir" (pronounced "PO-shwar") - a stencil computation
compiler for multicores developed at MIT CSAIL. Stencils have numerous
applications in computational sciences including geophysics, fluid
dynamics, finance, and computational biology. Next I will talk about
"F2Dock" - a state-of-the-art rigid-body protein-protein docking
software developed at UT Austin in collaboration with the SCRIPPS
Research Institute.

Parallel Computing -- A Theoretical Perspective The speeds of microprocessors are not increasing anymore. Yet the transistor sizes keep shrinking exponentially according to Moore's Law. This resulted in the paradigm shift from sequential to parallel computing: multi-cores processors -- processors with multiple CPUs -- have become a norm, rather than exceptions. Yet, we do not have good theoretical foundation for modern parallel systems and a clear understanding of what will make algorithms efficient on these systems. In this presentation I will talk about the current landscape of modeling parallelism in modern multi-core architectures from the algorithmic perspective. I will also touch on emerging computational frameworks, alternative to multi-cores, such as GPUs and Google's MapReduce and how they are changing the landscape of parallel and distributed computing.

Software Challenge for Extreme Scale Systems The computer industry is at a major inflection point in its hardware roadmap due to the end of a decades-long trend of exponentially increasing clock frequencies. Computer systems anticipated in the 2015 -- 2020 timeframe are referred to as Extreme Scale because they will be built using homogeneous and heterogeneous many-core processors with 100's of cores per chip. These systems pose new critical challenges for software in the areas of concurrency, energy efficiency and resiliency. Unlike previous generations of hardware evolution, this shift towards extreme scale will have a profound impact on software. The software challenges are further compounded by the need to enable parallelism in workloads and application domains that have traditionally not had to worry about multiprocessor parallelism in the past. Domain-specific computing offers new opportunities for advances in both performance and productivity for extreme scale systems. In this talk, we will focus on the programming problem for tightly coupled homogeneous and heterogeneous multicore processors, which provide the foundation for extreme scale systems. We identify key execution model primitives that we believe will be necessary for successful programming models, and discuss the portability of these primitives based on their ability to be mapped to a wide range of homogeneous and heterogeneous hardware. We also identify limitations in current programming models that are mismatched to future systems where performance has to be driven by parallelism and constrained by energy. We present early experiences with these execution model primitives in the Habanero Multicore Software Research project which targets a wide range of homogeneous and heterogeneous manycore processors, and the NSF Expeditions Center for Domain-Specific Computing which targets customizable heterogeneous systems with an initial focus on the medical imaging domain. Both projects takes a two-level approach to programming models, with a higher-level macro-dataflow model based on Intel Concurrent Collections (CnC) for parallelism-oblivious domain experts, and a lower-level task-parallel model based on the Habanero-Java and Habanero-C languages for parallelism-aware developers. We discuss language, compiler and runtime implementation challenges that must be overcome to efficiently support these execution model primitives on
future extreme scale processors.

PHAST: Hardware-Accelerated Shortest Path Trees We present a novel algorithm to solve the nonnegative single-source shortest path problem on road networks and other graphs with low highway dimension. After a quick preprocessing phase, we can compute all distances from a given source in the graph with essentially a linear sweep over all vertices. Because this sweep is independent of the source, we are able to reorder vertices in advance to exploit locality. Moreover, our algorithm takes advantage of features of modern CPU architectures, such as SSE and multi-core. Compared to Dijkstra's algorithm, our method makes fewer operations, has better locality, and is better able to exploit parallelism at multi-core and instruction levels. We gain additional speedup when implementing our algorithm on a GPU, where our algorithm is up to three orders of magnitude faster than Dijkstra's algorithm on a high-end CPU. This makes applications based on all-pairs shortest-paths practical for continental-sized road networks. The applications include, for example, computing graph diameter, exact arc flags, and centrality measures such as exact reaches or betweenness.
Joint work with Andrew Goldberg, Andreas Nowatzyk, and Renato Werneck .

Prof. David Keyes
King Abdullah University of Science and Technology
11 February 2011

The Exascale: Why and How Sustained floating-point computation rates on real applications, as tracked by the ACM Gordon Bell Prize, increased by three orders of magnitude from 1988 (1 Gigaflop/s) to 1998 (1 Teraflop/s), and by
another three orders of magnitude to 2008 (1 Petaflop/s). Computer engineering provided only a couple of orders of magnitude of improvement for individual cores over that period; the remaining factor
came from concurrency, which is approaching one million-fold. Algorithmic improvements contributed meanwhile to making each flop more
valuable scientifically.
As the semiconductor industry now slips relative to its own roadmap for silicon-based logic and memory, concurrency, especially on-chip many-core concurrency and GPGPU SIMD-type concurrency, will play an
increasing role in the next few orders of magnitude, to arrive at the ambitious target of1 Exaflop/s, extrapolated for 2018. An important question is whether today¹s best algorithms are efficiently hosted on such hardware and how
much co-design of algorithms and architecture will be required.
From the applications perspective, we illustrate eight reasons why today's computational scientists have an insatiable appetite for such performance: resolution, fidelity, dimension, artificial boundaries, parameter
inversion, optimal control, uncertainty quantification, and the statistics of ensembles.
The paths to the exascale summit are debated, but all are narrow and treacherous, constrained by fundamental laws of physics, cost, power consumption, programmability, and reliability. Drawing on recent
reports, workshops, vendor projections, and experiences with scientific codes on contemporary platforms, we propose roles for today¹s
researchers in one of the great global scientific quests of the next decade.

AM++: A Generalized Active Message Framework Active messages have been shown to be a good model to express irregular applications for distributed-memory systems. However, current active messaging frameworks are either too inflexible or too inefficient for use by graph algorithms. We have developed a new framework, AM++, for "generalized active messages" that provides both efficiency and flexibility. It also provides features such as flexible, user-configurable message coalescing and duplicate message elimination without users needing to entangle those behaviors into their applications. Implementation and benchmarking of graph algorithms using AM++ shows the usability benefits of the new features. I will also briefly discuss my work on the Graph 500 benchmark: the graph generators, MPI reference implementations, and the Argonne entry to the first Graph 500 list.

X10: Productivity and Programming at Scale The X10 project was started in January 2004 to develop a productive programming model for multi-core and scale-out computing. We review the research results of the last seven years. These include the development of the APGAS programming model founded on places and asynchrony (articulated in print in AMP 2010, see also OOPSLA Onwards 2005), a framework for weak memory models (PPoPP 2007), a formal semantics for the core APGAS constructs, including a proof of deadlock freedom (Concur 2005), a framework for constraint-based types for object-oriented languages (OOPSLA 2007), type inference for place-types (PPoPP 2008), adaptive work-stealing for graphs (ICPP 2008), compilation of asynchronous multi-place programs into SPMD programs (PPoPP 2009), distributed graph algorithms in PGAS languages (Supercomputing 2010), and a framework for global load-balancing (PPoPP 2011, forthcoming).
The main outcome of the project, however, is the X10 language system, available for download at http://x10-lang.org. The system supports a
C++ code generator that produces code that runs on a variety of
architectures (Power SMPs, multi-core x86s, clusters, Ultrasparc machines, NVidia GPUs), operating systems (Linux, AIX, MacOS, cygwin), transport mechanisms (shared memory, sockets, MPI, LAPI). Code written in X10 can achieve performance comparable to (or better than) MPI for many applications, while providing the productivity benefits of a modern, type-safe, object-oriented, functional language with strong static guarantees.
This work has been done in collaboration with several researchers including Ras Bodik, Satish Chandra, Guojing Cong, David Cunningham, Kemal Ebcioglu, Radha Jagadeesan, David Grove, Doug Lea, Maged Michael, Nathaniel Nystrom, Jens Palsberg, Igor Peshansky, Christoph von Praun, Vivek Sarkar, Sayantan Sur, and Olivier Tardieu.

Novel Applications of Graph Embedding Techniques Force-directed graph embedding algorithms, like the Fruchterman-Reingold method, are typically used to generate aesthetically pleasing graph layouts. At a fundamental level, these algorithms are based on manipulating the structural properties of the graph to match them to certain spatial requirements. This relation between structural and spatial properties is also present in other areas beyond graph visualization. In this talk, I will discuss how graph embedding can be used in diverse areas such as (i) improving the accuracy of unsupervised clustering, (ii) creating good quality elements in unstructured meshes and (iii) identifying perturbations in large-scale networks.

Dr. M.M. Rao Kunda
Indian Space Research Organization, India
24 September 2010

Advances in Image Processing, HealthCare and Challenges Digital image processing is playing an important role in all fields of Science and Technology today. Pictures/picture signals are digitized with 8-bit to 16-bit digitizers to accommodate vide dynamic range of picture signals. Pictures are computer processed for image enhancement, geometric and radiometric errors, etc., Registration of temporal images are carried out for checking the changes in the images with time, of a patient in case of health care, remote sensing images in case of crop monitoring, etc., Image processing techniques so far developed for processing satellite images ,images from space probes are being adopted for healthcare applications.
Imaging techniques are constantly developed in medicine using different modality devices such as MRI, CT, Ultra-sound and PET. 2D images are used to analyze the diseases, while some of these images are combined to generate 3D volumes of human body for calculating volume of affected areas.
Various hyper spectral imaging techniques, which map wavelength-resolved reflectivity across a two dimensional scene offer considerable promise as a new clinical tool for ophthalmology. Due to the delicate nature of the eye, invasive biopsy or mechanical access to the retina are not possible and relies strongly upon optical imaging methods for diagnosis of diseases connected to Carnia, retina and optic disc.
Lot of technologies such as image processing, hyper spectral imaging, networking, etc., developed for various applications, if adopted with suitable modifications for health care, will benefit millions of people for getting healthcare at low cost.
This talk covers Satellite Image processing, Remote sensing sensors and use of image processing for Healthcare.

Computational Science Impact - From Solar Cells to Exascale Computing Beginning with a brief overview of a boundary integral problem that led from an intractable computational problem to model an efficient solar cell but through a novel mathematical formulation became computationally tractable, I will lay the foundation of the importance of mathematics in computation. This problem led to my career in Computational Science and High Performance Computing (HPC). Today, as we move from the Petascale era to Exascale era, HPC is a tool frequently used to understand complex problems in numerous areas such as aerospace, biology, climate modeling and energy. Scientists and engineers working on problems in these and other areas demand ever increasing compute power for their problems. In order to satisfy the demand for increase performance to achieve breakthrough science and engineering, we turn to parallelism through large systems with multi-core chips. For these systems to be useful massive parallelism at the chip level is not sufficient. I will describe some of the challenges that will need to be considered in designing Petascale and eventually Exascale systems. Through the combination of HPC hardware coupled with novel mathematical and algorithmic approaches, such as those described in the initial problem of this talk, some efforts toward breakthroughs in science and engineering are described. While progress is being made, there remain many challenges for the computational science and engineering community to apply ultra-scale, multi-core systems to “Big” science problems with impact on society. New radical algorithmic approaches are needed that emphasize minimization of data transmission instead of floating point performance. In conclusion, some discussion not only on the most obvious way to use ultra-scale, multi-core HPC systems will be given but also some thoughts on how one might use such systems coupled with new computational science techniques to tackle previously intractable problems.

Positive and Negative Relationships in On-Line Social Networks The social networks in many on-line settings encode a mixture of positive (friendly) and negative (antagonistic) relationships, but the bulk of research on these networks to date has focused almost exclusively on the positive interpretations of the links. We discuss how the interplay between positive and negative relationships affects the overall functioning of on-line social networks, and we connect our analysis to theories of signed networks from social psychology. In addition to providing a perspective for reasoning about the underlying networks and applications, this analysis provides one of the first large-scale evaluations of these theories using on-line datasets. We find that classical theories capture certain of the underlying effects, but that they are also at odds with some of the fundamental phenomena we observe--particularly related to the evolving, directed nature of the networks.

X10 - Concurrent Programming for Modern Architectures Two major trends are converging to reshape the landscape of concurrent object oriented programming languages. First, trends in modern architectures (multicore, accelerators, high performance clusters such as Blue Gene) are making concurrency and distribution inescapable for large classes of Object-oriented (OO) programmers. Second, experience with first-generation concurrent OO languages (e.g. Java threads and synchronization) have revealed several drawbacks of unstructured threads with lock-based synchronization.
X10 is a second generation OO language designed to address both programmer productivity and parallel performance for modern architectures. It extends the sequential Java programming lbbnguage with a handful of constructs for concurrency and distribution: clustered address space (with global data-structures) to deal with distribution, lightweight asynchrony to deal with massive parallelism, recursive fork-join parallelism for structured concurrency, termination detection for sequencing and atomic blocks for mutual exclusion.
Additionally, it introduces a rich framework for constraint-based dependent types (and annotations) in OO languages. Specifically, the framework is useful for statically specifying the shape of various data-structures (e.g. multidimensional arrays) and the location of objects and distribution of arrays in the clustered address space.
X10 is being developed jointly as part of the DARPA-IBM high performance computing project, PERCS, under the Eclipse open-source license. A first implementation of the language is available at x10-lang.org. Submissions based on X10 have won HPCC Type II awards at SuperComputing for the last three years.

Spiral: Program Generation for Linear Transforms and Beyond Spiral is a program and hardware design generation system for linear transforms such as the discrete Fourier transform, discrete cosine transforms, filters, and others. We are currently extending Spiral beyond its original problem domain, using coding algorithms (Viterbi decoding and JPEG 2000 encoding) and image formation synthetic aperture radar, SAR) as examples. For a user-selected problem specification, Spiral autonomously generates different algorithms, represented in a declarative form as mathematical formulas, and their implementations to find the best match to the given target platform. Besides the search, Spiral performs deterministic optimizations on the formula level, effectively restructuringthe code in ways unpractical at the code or design level. Spiral generates
specialized single-size implementations or adaptive general-size autotuning libraries, and utilizes special instructions and multiple processor cores.
The implementation generated by Spiral rival the performance of expertly hand-tuned libraries.In this talk, we give a short overview on Spiral. We explain then howSpiral generates efficient programs for parallel platforms
including vector architectures, shared and distributed memory platforms, and GPUs; as well as hardware designs (Verilog) and automatically partitioned software/hardware implementations. We overview how Spiral targets the Cell BE and PowerXCell 8i, the BlueGene/P PPC450d processors, as well as Intel's upcoming Larrabee GPU and AVX vector instruction set. As all optimizations in Spiral, parallelization and partitioning are performed on a high abstraction level of algorithm representation, using rewriting systems.

Prof. John Gilbert
University of California, Santa Barbara
12 March 2010

Challenges in Combinatorial Scientific Computing Computation on large combinatorial structures -- graphs, strings, partial orders, etc. -- has become fundamental in many areas of data analysis and scientific modeling. The field of high-performance combinatorial computing, however, is in its infancy. By way of contrast, in numerical supercomputing we possess standard algorithmic primitives, high-performance software libraries, powerful rapid-prototyping tools, and a deep understanding of effective mappings of problems to high-performance computer architectures.
This talk will describe a number of challenges for the field of combinatorial scientific computing, in algorithms, tools,
architectures, and mathematics. I will draw examples from various applications, and I will highlight our group's work on algebraic primitives for high-performance computation on large graphs and networks.

Accurate Inference of Phylogenetic Relationships from Multi-locus Data Accurate inference of phylogenetic relationships of species, and understanding their
relationships with gene trees are two central themes in molecular and evolutionary
biology. Traditionally, a species tree is inferred by (1) sequencing a genomic region of
interest from the group of species under study, (2) reconstructing its evolutionary history,
and (3) declaring it to be the estimate of the species tree. However, recent analyses of
increasingly available multi-locus data from various groups of organisms have
demonstrated that different genomic regions may have evolutionary histories (called
“gene trees”) that may disagree with each other, as well as with that of the species. This
observation has called into question the suitability of the traditional approach to species
tree inference. Further, when some, or all, of these disagreements are caused by reticulate
evolutionary events, such as hybridization, then the phylogenetic relationship of the
species is more appropriately modeled by a phylogenetic network than a tree. As a result,
a new, post-genomic paradigm has emerged, in which multiple genomic regions are
analyzed simultaneously, and their evolutionary histories are reconciled in order to infer
the evolutionary history of the species, which may not necessarily be treelike.
In this talk, I will describe our recent work on developing mathematical criteria and
algorithmic techniques for analyzing incongruence among gene trees, and inferring
phylogenetic relationships among species despite such incongruence. This includes work
on lineage sorting, reticulate evolution, as well as simultaneous treatment of both. If time
permits, I will describe our recent work on population genomic analysis of bacterial data,
and the implications on the evolutionary forces shaping the genomic diversity in these
populations.

Load-Balanced Bonded Force Calculations on Anton Molecular dynamics (MD) simulations of biological molecules involve the calculation of "bonded" force terms due to covalent bonds. On parallel computers, this calculation is normally not load balanced because the data partitioning required for load balance is in conflict with the partitioning needed for the much more expensive "nonbonded" (electrostatic and van der Waals) force calculation. On Anton, a specialized parallel machine for MD calculations, the situation has changed: the nonbonded component of the calculation has been dramatically accelerated and the remaining bonded component can often be a non-negligible determinant of overall performance. This talk describes the challenges in load balancing the bonded force calculations on Anton. In addition to the usual considerations of balancing load and minimizing communication across processors, we consider balancing the storage required per node, to allow larger chemical systems to be simulated. This interesting combinatorial problem arises because many bond terms share the same parameter data and this data does not need to be duplicated within a node. We also consider the hierarchical problem of partitioning the data among Anton nodes and its relation to partitioning the data within each node for computation by multiple cores. Additional resource limitations at this level also lead to interesting (and messy!) combinatorial problems for load balance.

Parallel Data Warehousing/OLAP and Parallel Bioinformatics This presentation discusses applications of parallel algorithms design in parallel data warehousing/OLAP and parallel bioinformatics:
(1) The building of a parallel data cube protype for parallel data warehousing and OLAP.
(2) A global investigation of protein-protein interactions in yeast Saccharomyces cerevisiae using re-occurring short polypeptide sequences. We performed the first all-to-all sequence-based computational screen of PPIs in yeast, in which we identify 29,589 high confidence interactions out of approximately 2 x 10^{7} possible pairs. Of these, 14,438 PPIs have not been previously reported.
(3) A new computational approach towards the design of RNA pools for in vitro selection of complex aptamers.

Discovery of Mechanisms from Mathematical Modeling of DNA Microarray Data: Computational Prediction and Experimental Verification Future discovery and control in biology and medicine will come from the mathematical modeling of large-scale molecular biological data,such as DNA microarray data, just as Kepler discovered the laws of planetary motion by using mathematics to describe trends in astronomical data. In this talk, I will demonstrate that mathematical modeling of DNA microarray data can be used to correctly predict previously unknown mechanisms that govern the activities of DNA and RNA. First, I will describe the computational prediction of a mechanism of regulation, by developing generalizations of the matrix and tensor computations that underlie theoretical physics and using them to uncover a genome-wide pattern of correlation between DNA replication initiation and RNA expression during the cell cycle. Second, I will describe the recent experimental verification of this computational prediction, by analyzing global expression in synchronized cultures of yeast under conditions that prevent DNA replication initiation without delaying cell cycle progression. Third, I will describe the use of the singular value decomposition to uncover "asymmetric Hermite functions," a generalization of the eigenfunctions of the quantum harmonic oscillator, in genome-wide mRNA lengths distribution data. These patterns might be explained by a previously undiscovered asymmetry in RNA gel electrophoresis band broadening and hint at two competing evolutionary forces that determine the lengths of gene transcripts. Finally, I will describe ongoing work in the development of tensor algebra algorithms (as well as viusal correlation tools), the integrative and comparative modeling of DNA microarray data (as well as rRNA sequence data), and the discovery of mechanisms that regulate cell division, cancer and evolution.

Parallel Thinking
With the advent of manycore computers on every desktop and the recent access by just about anyone to massively parallel data centers, parallelism is becoming pervasive. This trend is likely to eventually lead to a scenario in which parallel programming will
become predominant and sequential programming will be a special case. Are we ready for this change? Short-term solutions based just on add-ons to sequential languages are unlikely to be sufficient in the long term. Instead the change will likely require a more fundamental rethinking that permeates the programmers' methodologies from early stages of algorithm and system design. This will require developing a form of ``parallel thinking''.
Perhaps the biggest barrier to the widespread effective use of parallelism is educating people on how to think parallel. Many if not most computer science classes, however, remain case studies in how to push students into thinking sequentially. This talk will address how parallelism could be taught right from the start, and if presented at the right level of abstraction could be no harder than teaching sequential programming.

Blocked Plane Rotations for Band Reduction and Sparse SVD
With the success of Basic Linear Algebra Subroutines (BLAS) in using the memory efficiently, the algorithms with vector operations (BLAS2) have given way to algorithms with matrix operations (BLAS3). In some cases, BLAS3 based algorithms are successful even with the cost of doing additional floating point operations and using additional memory. In this talk, I will talk about two problems where algorithms with vector operations when combined with blocking can perform better than BLAS3 based algorithms.
Band reduction methods are mainly used in computing the eigen value decomposition and singular value decomposition of band matrices. In the first part of this talk, I will outline a blocking scheme for plane rotations. The blocked plane rotations when coupled with a pipelining scheme leads to fewer floating point operations and memory usage than the BLAS3 based band reduction methods. The blocked method is also able to extract the same performance benefits from the cache as the BLAS3 based methods leading to a faster band reduction method. I will also show how we can exploit the zeros while finding the eigen and singular vectors.
In the second part of the talk, I will introduce a method for computing the bidiagonalization of a sparse upper triangular matrix R. In this method, we exploit the sparsity of R and use plane rotations to reduce it to the bidiagonal form. We choose the rotations to minimize the fill generated in R itself. I will show how to extend this method to use dynamic blocking and the pipelining scheme to arrive at an efficient R-bidiagonalization method for computing the sparse SVD.

Dr. Bill Appelbe
Victorian Partnership For Advanced Computing (VPAC)
18 August 2009

The Australian National Research Data Grid
Traditionally, most research groups and institutions are responsible for maintaining and archiving their research data - from raw data to published results. Linking of research data is ad-hoc and based on community consensus, and reproducing published results is often difficult - requiring personal effort and collaboration with the author(s). However, there is increasing demand from researchers and industry to be able to link research data sets in disciplines such as Medicine, Bioinformatics, Engineering.
Recently the Australian Government announced a $100M+ ambitious 5 year program to establish a national data-grid for archiving the countries research data (from all Universities and research institutes), through a network of data centres operated by a single organization, ARCS (which VPAC is the Lead Agent for). This effort builds on earlier successful efforts by VPAC and others to build discipline specific data grids, such as Biogrid (for clinical research data). While the national data grid is intended to be an operational facility, it introduces a wide range of open research and implementation problems - ranging from security, a single national authentication and authorization framework (AAF), privacy and consent, discoverability, reproducibility of research, and meta-data schema.
This talk will outline the emerging architecture and technology of the national data grid (building on technologies such as iRods and Shibboleth), and lessons learned from successful predecessors such as BioGrid. The talk will also outline the cultural and governmental factors that make such initiatives possible in Australia and relate to the challenges and opportunities in the USA.

Automatic Performance Analysis for Terascale Systems Performance analysis is an important step in the development of programs for high-end parallel systems. Several tools are available that are based on profiling or tracing. Trace-based tools provide the programmer with detailed information that can then be analyzed manually with the help of powerful visualization tools.
In contrast to those performance analysis systems, Periscope is based on profile data and determines performance bottlenecks automatically. Periscope was developed at Technische Universität München and is currently productized in two recently started projects. It is based on a formal specification of performance properties. The specified properties are searched by a set of analysis agents during the execution of the parallel application. Depending on the number of additional processors allocated for the performance analysis, Periscope can scale to very large application runs.
This presentation will outline the general concepts of performance analysis tools and will then introduce the techniques used in Periscope. It will demonstrate different search strategies for analyzing the single node performance of applications on the Itanium processor in the Altix 4700 supercomputer at Leibniz Computing Center.

Petascale and Multicore programming Models: What is needed The almost simultaneous emergence of multicore chips and petascale computers presents multidimensional challenges and opportunities for parallel programming. What kind of programming models will prevail? What are some of the required and desired characteristics of such model/s? I will attempt to answer these questions. My answers are based in part on my experience with several applications ranging from quantum chemistry, biomolecular simulations, simulation of solid propellant rockets, and computational astronomy.
First, the models need to be independent of number of processors, allowing programmers to over-decompose the computation into logical pieces. Such models, including 15 year old Charm++, enable intelligent runtime optimizations. More importantly, they promote compositionality.
Second, building on this compositionality, one needs a collection of parallel programming languages/models, each incomplete by itself, but capable of inter-operating with each other. Third, many parallel applications can be "covered" by simple, deterministic, mini-languages which lead to programs that are easy to reason about and debug. These should be used in conjunction with more complex but complete languages.
Also, domain specific frameworks and libraries, which encapsulate expertise and facilitate reuse of commonly needed functionality, should be developed and used whenever feasible. I will illustrate these answers with examples drawn from our CSE applications, and from some relatively new programming notations we have been developing.

Non-Parametric Modeling of Partially Ranked Data A growing number of modeling applications involve partially ranked, rather than numeric, data. Examples include voting data, psychological
studies, product recommendations in online marketing, and website and ad placement in search engines. Learning models on partial rankings of n
items are often of limited practical use for large n due to computational considerations. We explore several non-parametric and
conditional models for partially ranked data and derive computationally efficient procedures for large n. The derivations are largely possible
through the combinatorics of the lattice of partial rankings and the algebraic structure of the symmetric group. We demonstrate the
effectiveness of the proposed framework with a bias-variance analysis and a large scale experimental study involving voting data, product recommendations, and web search.

The Opportunities and Challenges of Multi-Scale Computational Science and Engineering — A Pragmatist’s View The next generation of computers will offer society unprecedented opportunities to solve problems of strategic importance in basic and applied science and engineering. Within ten years we will have access not only to multi-petaflop supercomputers, but also to multi-teraflop desktops and small clusters. Multi-scale applications present some of the greatest opportunities as well as some of the greatest challenges.
Developing multi-scale applications to exploit this exponential growth in computing power is the key challenge. Generally each application area has unique requirements. Developing each major application has taken large teams (10 to 30 staff) many years (10 or more). It’s more difficult today. Massive parallelization has resulted in highly complex computer platforms and has increased the challenge of developing applications. The challenges include integrating the effects of many complex, strongly interacting scientific phenomena across many orders of magnitude of time and distance scales; managing and coordinating multi-institutional and multi-disciplinary project teams; organizing the code development process while maintaining adequate flexibility and agility; meeting the requirements of the laws of nature and groups of production users; ensuring adequate verification and validation; and developing and using software development tools for these complex platforms. Unfortunately, while there is much support for developing more powerful computers, but little (or none) for addressing the code development challenges.
Multi-scale problems typically have distance (and time) scales that have a dynamic range of a factor of 105 or more. For problems where increased spatial resolution is desirable, adaptive mesh refinement is often used. To obtain increased stability with large time steps, implicit techniques are often useful. However, implicit techniques generally require communication across the whole mesh each time step, which is undesirable for massively parallel computers with limited bandwidth and high memory latency. Another approach is to express the small scale physics with an approximate but rapid treatment that captures the main features of interest of the small scale phenomena. Other approaches include identifying the key physics and “averaging” out the inessential features. Many successful approaches capture the “emergent” features of the small scale effects (i.e., use of hydrodynamics instead of molecular dynamics for calculating water flow in a river). Many multi-scale problems are NP-hard, meaning that the problem complexity is such that a brute force technique requires a level of computer power that grows exponentially with the size of the problem. In those cases, the use of emergent principles and focusing the calculation on the key element of interest is essential. There are few, if any, general methods for solving multi-scale problems when increased resolution doesn’t work. Most successful techniques build on a deep understanding of the physics of the problem to capture the essential features in an economical algorithm.

Analysis of Large-Scale Gene Expression Microarray Compendia Over the past decade, gene expression microarray data has become an important tool for biologists to understand molecular processes and mechanisms on the whole-genome scale. Microarray data provides a window into the inner workings of the transcriptional process that is vital for cellular maintenance, development, biological regulation, and disease progression. While a rapidly increasing amount of microarray data is being generated for a wide variety of organisms, there is a severe lack of methods designed to utilize the vast amount of data currently available. In my work, I explore several techniques to meaningfully harness large-scale collections of microarray data both to provide biologists with a greater ability to explore data repositories, and to computationally utilize these repositories to discover novel biology.
First, effective search and analysis techniques are required to guide researchers and enable their effective use of large-scale compendia. I will present a user-driven similarity search algorithm designed to both quickly locate relevant datasets in a collection and to then identify novel players related to the user’s query. Second, I will describe novel methods that allow users to simultaneously visualize multiple datasets with the goal of providing a larger biological context within which to understand these data. Finally, I will discuss how we have used these approaches to discover novel biology, including successfully directing a large-scale experimental investigation of S. cerevisiae mitochondrial organization and biogenesis.

Depicting beyond Terascale Advanced computing and imaging technologies enable scientists to study natural phenomena at unprecedented precision, resulting in an explosive growth of data. The size of the collected information about the Internet and mobile device users is expected to be even greater, a daunting challenge we must address in order to make sense and maximize utilization of all the available information. Visualization transforms large quantities of, possibly multiple-dimensional, data into graphical representations that exploit the high-bandwidth channel of the human visual system, leveraging the brain's remarkable ability to detect patterns and draw inferences. It has thus become an indispensable tool in many areas of study involving large data. As we are moving from terascale to petascale computing and data analysis, are existing visualization solutions still usable? I will go over a few of the new visual-based strategies we have recently developed offering promise for addressing the upcoming petascale challenges.

Prof. Ulrich Rüde
Universitaet Erlangen-Nuernberg, Germany
11 March 2008

Parallel Simulation and Animation of Flows with the Lattice Boltzmann Method
The Lattice Boltzmann Method (LBM) is based on a
discretization of the Boltzmann equation and results in a cellular
automaton that is relatively simple, easy to extend, and to parallelize.
In the past few years, the LBM method has been established as an
alternative method in computational fluid dynamics. Here, we will
discuss an extension of the LBM to compute free surface flows. The basis
of our study is an approach similar to volume-of-fluids method. In
combination with appropriate boundary conditions, it is efficient and
fully mass conserving. Results for parallel versions of the algorithm
will be discussed together with extensions such as coupling to shallow
water simulations.

Dynamics of real networks: patterns and algorithms With the advent of the Web, large scale social and information networks containing detailed traces of human activity have become available. This
offers great opportunities to measure, model and predict actions of millions of people. For example, we had an opportunity to analyze a ``planetary scale'' Microsoft Instant Messenger network that contains 240 million people, with more than 1 billion conversations per day
(4.5TB of data), which makes it the largest social network analyzed to date.
In this talk I will focus on two aspects of the dynamics of large real-world networks: (a) dynamics of information diffusion and cascading
behavior in networks, and (b) dynamics of time evolving networks. First, I will consider network cascades that are created by the diffusion
process where behavior spreads from node to node like an epidemic. We study two related scenarios: information diffusion among blogs, and a
viral marketing setting of 16 million product recommendations among four million people. Motivated by our empirical observations we develop
algorithms for finding influential bloggers and detecting disease outbreaks. We exploit the ''submodularity'' principle to develop an
efficient algorithm that achieves near optimal solutions, while scaling to large problems and being 700 times faster than a simple greedy
algorithm. Second, in our recent work we found interesting and counter intuitive patterns, which change some of the basic assumptions about
fundamental structural properties of networks varying over time. Leveraging our observations we developed a Kronecker graph generator
model that explains processes governing network evolution. Moreover, we can fit the model to large networks, and then use it to generate
realistic graphs and give formal statements about their properties. Estimating the model naively takes O(N!N^2) while we develop a linear
time O(E) algorithm.

Making sense of genome-scale data High-throughput data production technologies are revolutionizing modern biology. Translating this experimental data into discoveries of
relevance to human health relies on sophisticated computational tools
that can handle large-scale data.
One area of rapid ongoing data generation is whole genome sequencing.
Comparisons between sequenced genomes can be a powerful tool to
understand functional genomic regions, by going beyond the primary
sequence to capture patterns in how functional regions evolve. Using
data generated by the ENCODE project we will demonstrate the power of
genome comparisons to distinguish cis-regulatory elements (critical for
the control of gene expression). We will then describe a machine
learning approach that goes beyond sequence conservation to capture
broader and more informative sequence and evolutionary patterns that
better distinguish different classes of elements. This approach has
proven successful for a variety of classification problems. In
particular, the "Regulatory Potential Score" has been used to identify
putative regulatory elements with high rates of experimental validation.
Sophisticated methods for the analysis of biological data are of little
value if they are not accessible. Powerful analysis tools, data
warehouses, and browsers exist, but for the average experimental
biologists with limited computer expertise, making effective use of
these resources is still out of reach. We have developed "Galaxy", which
solves this problem by providing an integrated web-based workspace that
bridges the gap between different tools and data sources. For
computational tool developers, Galaxy eliminates the repetitive effort
involved in creating high-quality user interfaces, while giving them the
benefit of being able to provide their tools in an integrated
environment. For experimental biologists, Galaxy allows running complex
analysis on huge datasets with nothing more than a web browser, and
without needing to worry about details like resource allocation, format
conversions, etc. Galaxy makes high-end computational biology more
accessible, efficient, and reproducible.

Scalable Implicit Methods for Magnetic Fusion Modeling Fusion energy holds the promise of a clean, sustainable and safe energy source for the future. While research in this field has been ongoing for the last half century, much work remains before it may prove a viable source of energy. In this talk, I discuss some of the scientific and engineering challenges remaining in fusion energy, and the role of applied mathematics and scientific computation in helping overcome these obstacles. I then introduce some of the mathematical models used in studying fusion stability and refueling, and how solutions to those models may be approximated. Of particular interest in such approximation techniques is the ability of the relevant numerical methods to scale up to resolutions necessary for accurately modeling the underlying physics. To that end, I discuss some of my recent work in the development of fully implicit solution approaches for magnetic fusion modeling, presenting both numerical results and theoretical analysis demonstrating the benefit of these approaches over traditional methods.

Prof. Yves Robert
Ecole Normale Supérieure de Lyon, France
8 February 2008

Algorithm design and scheduling techniques for clusters and grids In this talk we provide several examples to illustrate key algorithmic concepts required to efficiently execute applications on clusters and grids.
The idea is to give a lively exposition of the necessity to inject whatever static knowledge is available into the design
of typical applications, such as master-slave tasking, numerical kernels, and job workflows. We claim that this is the
key to an efficient deployment of these applications onto large-scale distributed computational platforms.
The talk will proceed through examples to explain how to cope with resource selection, memory constraints, platform heterogeneity, etc.

Graph Mining: Laws, Generators and Tools How do graphs look like? How do they evolve over time? How can we
generate realistic-looking graphs? We review some static and temporal
'laws', and we describe the "Kronecker" graph generator, which
naturally matches all of the known properties of real graphs.
Moreover, we present tools for discovering anomalies and patterns in
two types of graphs, static and time-evolving. For the former, we
present the 'CenterPiece' subgraphs (CePS), which expects q query
nodes (eg., suspicious people) and finds the node that is best
connected to all q of them (eg., the master mind of a criminal
group). We also show how to compute CenterPiece subgraphs
efficiently. For the time evolving graphs, we present tensor-based
methods, and apply them on real data, like the DBLP author-paper
dataset, where they are able to find natural research communities, and
track their evolution. Finally, we also briefly mention some results
on influence and virus propagation on real graphs.

What Can We Do With A Multitude Of Genome Sequences? There are currently 575 bacterial species and 28 vertebrate
species, ranging from primates to fishes, for which we know
(nearly) their entire DNA sequences. These number will continue to
increase rapidly over the next few years. Comparing these genome
sequences has emerged as one of the most important areas of
computational biology. For example, one way to predict functional
portions of the human genome is to search among related
genomes for sequences that appear to be remarkably similar due
to selective pressure. I will discuss and demonstrate some of the
methods and tools for such an approach, as well as some of the
challenges and unsolved problems. This talk will be self-contained:
no knowledge of biology beyond what you have heard in the news
will be assumed.

Toward petaflop algorithms for medical-image driven inverse problems in biophysics In this talk, I will describe algorithms for massively parallel assimilation of multimodal time-varying medical images with biophysical models. The goal is to create highly-resolved, physically-realistic, patient-specific biophysical models. In addition, the numerical methods should deliver optimal algorithmic complexity, and should scale to thousands of processors. The target applications are problems in the cardiac and brain physiology.
In particular, I will discuss three topics: (1) fast octree-based algorithms for elliptic and parabolic partial differential equations
(PDEs) that are used to model the biophysics; (2) algorithms for inverse problems constrained by partial differential equations; and
(3) the application of the inverse and forward solvers to 4D deformable image registration problems that aim to extract cardiac motion data from MRI sequences and to subsequently use the reconstructed motion fields as virtual observations to identify patient-specific spatial tissue heterogeneities (for example, ischemic regions in the heart).

100 Years of Digital Data The Information Age has brought with it a deluge of digital data. Current estimates are that in 2006, 161 exabytes (10^{18} bytes) of digital data were created from cell phones, computers, iPods, DVDs, sensors, satellites, scientific instruments, and other sources, providing a foundation for our digital world. Migrating digital content through new generations of storage media, making sense of its content, and ensuring that needed information is accessible now and for the foreseeable future constitute some of the most critical challenges of the Information Age.
The San Diego Supercomputer Center (SDSC) is a national Center leading the development and deployment of a comprehensive infrastructure for managing, storing, preserving, and using digital data. Leveraging ongoing collaborations with the research community (National Science Foundation, Department of Energy, etc.), data preservation and archival communities (Library of Congress, National Archives and Records Administration) and other partners, SDSC is providing innovative leadership in the emerging area of Data Cyberinfrastructure. In this talk, SDSC Director Fran Berman discusses SDSC's approach to building and deploying data-oriented computational and data cyberinfrastructure, and describes the next generation of challenges and opportunities for the data that drives the Information Age.

Computational and Mathematical Challenges Involved
in Very Large-Scale Phylogenetics
Phylogenetic inference presents enormous computational and mathematical
challenges, but these are particularly exacerbated when dealing with very
large datasets (containing thousands of sequences) or when sequences evolve
under complex models of evoluiton. In this talk, I will describe some of the
recent progress in large-scale phylogenetics. In particular, I will talk
about multiple sequence alignment and its implications for large-scale
phylogenetics.

Essential Computing: Is Inference the key ingredient? Intel Research’s bold “Essential Computing” vision is focused on a leaping capability which will make our interactions with technology intuitive, helpful and robust. These next generation computing systems will move from task and utility-orientation to supporting the essence of our lives. A major challenge is to eliminate annoying, labor-intensive and fragile interactions. Is Inference the key ingredient to make this leap?

Dr. Gary Flake
Technical Fellow Microsoft
4 October 2007

The “Blind Men and the Elephant” Parable and the Future of Web Search This presentation will consider the future of web search from three different perspectives: the business strategist, the end user, and the research scientist. For the business strategist, I propose an “ecosystem” pattern for describing how parts of the online worlds are both merging and encroaching on offline activities. For the end user, I consider how web search is but one part of a larger work flow, which suggests that search will eventually become part of a more integrated user experience. And for the research scientist, I will argue that many Web-based research programs point towards common themes and that these themes suggest new directions for future advancements in web search.
Put together, these three views of the future of web search reinforce one another in a way that suggests good news for our field: right now may be the best moment in the history of the world to be a computer scientist.
Time permitting, I will also demo two emerging technologies from Live Labs which resonate with the themes of the talk: Seadragon, a multi-scale visualization platform; and Photosynth, an entirely new way to visualize collections of photos.

Opportunities with Graph Based Programming Reformulating an algorithm to mask communication delays can be crucial in maintaining scalability, but the required programming effort is a
challenge even for the expert. We present Thyme and Tarragon, run time libraries, together with a programming methodology, for implementing
communication-tolerant algorithms. Thyme and Tarragon employ a task precedence graph to represent a partial ordering over a logical grouping
of the application's loop iteration spaces that is is similar to dataflow. They avoid the classic problems of split-phase code and
built-in scheduling assumptions that hamper performance robustness in the presence of technological change. We present results with
applications running on large scale systems, and demonstrate significant reductions in communication overhead. The libraries rely on
background run time services to realize dataflow semantics. They are lightweight, and avoid the heavy machinery of thread migration, an
alternative technique for supporting latency tolerance. We are currently exploring compile time techniques targeting graph based models.

Dr. Kunda M.M. Rao
Ex-Deputy Director, Department of Space, India
28 August 2007

Applications and Challenges in Data Processing of Medium and High Resolution Remote Sensing Satellite Data A series of Indian Remote Sensing (IRS) satellites have been launched by Indian Space Research Organisation (ISRO) during the last three decades for earth resource surveys and estimating natural resources. Indian Remote Sensing satellites provide high, medium and coarse resolution data to ~ 30 countries and the data is used for various applications. These satellites have steerable cameras, cameras with stereo data capture, wide field Imaging cameras etc. Advances in satellite Remote Sensing (RS) have enhanced various applications such as water resources, agriculture, etc., for the gross root level development.
Data is processed for mosaicing, data fusion,3D images, temporal data registration etc., to meet the requirement of various applications. RS data, geospatial technologies, information technology, and communication technology is currently being converged to provide quick information in the form of SMS on mobile phones during natural disasters.
This talk also covers data reception, processing, applications of IRS satellite data and application of satellite image processing techniques to biomedical images.

Handling Large Scale Biological Workloads on Networked Computing Systems - Potential for Divisible Load Paradigm In this talk, I will first give a quick overview of Divisible Load Theory (DLT) – the model, some key results, and highlight on applications that had directly used this paradigm. This will be the first part of the talk. In the second part of the talk, I will present few of our recent works that had directly used DLT paradigm for biological sequence matching problems.
Similarity matching or homology detection is an impetrative step in many of the Bioinformatics applications. For instance, protein structure-to-structure comparison methods invariably start with some understanding on the results from similarity detection mechanisms. Protein classification problems based on domain specific knowledge invariably use sequence aligning methods in a fairly repetitive way thus consuming enormous amount of computing power from resources. These are few examples to immediately concede the computational challenge involved in handling such voluminous data. Modern day computing platforms, such as networked workstations, a high-speed LAN infrastructure or even Grid computing domain offers enormous amount of power that can be tapped for handling such data intensive applications. Such applications seem to demand ‘divide-and-conquer’ approach to gain significant speed-ups, although the data partitioning and distribution among the nodes play a vital role in deciding the speed-up. I will present our experiences and outcomes in adopting DLT paradigm for biological problems on two different topologies – Bus and Mesh networks. The latter topology is attempted to design an automatic computational engine for handling fairly large number of sequences delivering an impressive speed-up! These ventures clearly encourage such applications to consider modeling under divisible load paradigm and also serve as a viable starting point towards mapping onto high-performance systems.

Challenges and Opportunities in an Information Dominated Age for Large-Scale Scientific Knowledge Discovery Computational scientists must understand results from experimental,
observational and computational simulation generated data to gain
insights and perform knowledge discovery. As systems approach the
petascale range, problems that were unimaginable a few years ago are
within reach. It is rapidly becoming apparent, that as large-scale
computers and instruments become more capable and more widely used the
data generated for subsequent analysis is increasing in size and
complexity. Scientific analysis and discovery already seems to be
limited not by the speed or capacity of high-performance computers,
but by the speed at which scientists can construct meaningful insights
from this huge amount of data. The task of managing scientific data
is so overwhelming that scientists spend much of their time managing
the data by developing special purpose solutions, rather than using
their time effectively for scientific investigation and discovery.
In this talk, we first discuss the issues and challenges in scientific
knowledge discovery and scientific data management. We describe a
framework for scientific data management for scalable systems. Then we
present (1) Scalable parallel I/O interfaces and optimizations for
scalable scientific simulations and knowledge discoveries, (2) the use
of high-level information extraction at runtime to optimize I/O and
storage accesses, (3) scalable analytical and mining techniques for
scientific data mining and discoveries, and (4) future directions.

Parallel Numerical Algorithms
in Computational Science and Engineering In this presentation, we provide a definition of the interdisciplinary
field of Computational Science and Engineering (CSE) and point out its
pivotal role in advancing certain science and engineering
disciplines. Such advances, however, require extensive large-scale and
large-scope numerical simulations. These, in turn, can only be
realized through the use of high-end parallel computing platforms. One
often discovers, however, that simple implementation of the key
classical sequential algorithms on these novel architectures leads to
poor overall performance. High performance is thus most frequently
realized through the redesign of algorithms that take maximum
advantage of the modern parallel architectures in which floating point
operations are the least time-consuming tasks. We provide few
numerical linear algebra examples to illustrate this point with
applications in computational mechanics and nanoelectronics.

Dr. Rich Vuduc
Lawrence Livermore National Laboratory
2 April 2007

Autotuning sparse matrix kernels
"Autotuning" is the process of automatically improving an
application's performance, for a given machine and/or input problem,
using automated experiments. In this talk, I present OSKI, a freely
available autotuning system for sparse matrix kernels. OSKI targets
kernels, such as sparse matrix-vector multiply (SpMV) and triangular
solve, that are frequent bottlenecks in diverse applications in
scientific computing, economic modeling, and information
retrieval. OSKI hides the complexity of choosing data structure and
code transformations at run-time that will yield the best performance
for a given matrix, machine, and workload of kernel operations. For
SpMV, traditional implementations typically run at 10% or less of peak
machine speed; in contrast, OSKI's SpMV achieves up to 31% of peak and
runs up to 4x faster. OSKI is being adopted by commercial tools and
high-level parallel solver libraries.
Looking forward, we might ask to what extent could we autotune an
arbitrary application. I close this talk by summarizing an
early-stage, long-term effort to develop an autotuning compiler.

Tornado Codes for Archival Storage
This presentation examines the application of Tornado Codes, a class of low density parity check (LDPC) erasure codes, to archival storage systems based on massive arrays of idle disks (MAID). The fault tolerance of Tornado Code graphs is analyzed, and it is shown that it is possible to identify and mitigate worst-case failure scenarios in small (96 node) graphs through use of simulations that find and eliminate critical node sets that can cause Tornado Codes to fail even when almost all blocks are present. The resulting graph construction procedure is then used to construct a 96-device Tornado Code stripe storage system with capacity overhead equivalent to RAID 10 that tolerates any 4 device failures. This system is demonstrated to be superior to parity-based RAID.
After establishing the fault tolerance of Tornado Codes, a log-structured extent-based file system is constructed using the Tornado Coded stripe storage. The file system is combined with a MAID simulator to emulate the behavior of a large-scale storage system, with the goal of employing Tornado Codes to increase fault tolerance while minimizing power consumption. The effect of power conservation constraints and design choices on system throughput is examined, and a policy of placing multiple data nodes on a single device is shown to increase read throughput with an identifiable decrease in fault tolerance. Finally, the system is implemented on a 100 TB Lustre storage cluster, providing GridFTP accessible storage with higher reliability and availability than the underlying storage architecture.

From microarrays to networks: integrated analysis and visualization of functional genomics data Now that genomic sequencing is a routine technology, the next key
challenge in systems biology is understanding protein function,
interactions, and regulation. Experimental approaches to this problem
generated vast amounts of diverse functional genomics data, but an
equivalent explosion in biological understanding has not yet followed
the near-exponential growth in experimental results. I will discuss
how computer science approaches for integrated analysis of these data
combined with targeted, computation-driven experiments can address
this gap and substantially contribute to our understanding of biology.
Specifically, I will discuss my group’s work on development of novel
methods for study of gene function and regulation in biological
systems through integration, analysis, and visualization of
heterogeneous biological data. I will focus on our machine learning
and search-based methods for function prediction, context-specific
biological network modeling, and integrated analysis and visualization
of microarray data. I will describe biological insights that
motivated our algorithms, their evaluation, and implementation in
publicly available systems; and present how using these systems, we
have modeled multiple known processes in the yeast S. cerevisiae,
characterized unknown components in these processes through
computational predictions and experimental validation, and identified
novel cross-talk relationships between biological processes.
Throughout the talk, I will also outline related challenges for
computer science and insights we gained in developing algorithms for
these problems.

Dr. Ali Pinar
Lawrence Berkeley National Laboratory
26 February 2007

Vulnerability Analysis of the Electric Power Grid Robust operation of a power grid requires anticipating likely events that can lead to costly blackouts. The Northeast blackout of August 2003 for instance, started with only three broken power lines. Detecting vulnerabilities of the power system requires solving combinatorial optimization problems constrained by nonlinear power flow equations. In this talk, I will first present a mixed integer nonlinear programming (MINLP) formulation of the vulnerability analysis problem, and provide an analysis of the structure of an optimal solution for this problem. I will use this analysis to approximate the original MINLP formulation with pure combinatorial problems, namely the network inhibition problem and the inhibiting bisection problem.

Performance Modeling: Understanding the Past and Predicting the Future in HPC In the context of High Performance Computing, a performance
model is a calculable expression that takes as parameters attributes
of application software, input data, and target machine hardware (and
possibly other factors) and computes, as output, expected
performance. Via parameterization, performance models enable
exploration of performance as a function of possible enhancements to
machine or code, and can thereby reveal greater understanding of the
performance, as well as the opportunities to improve it. The models
are therefore useful to explore and improve the design of future
machines in the context of application requirements, to explore and
improve application algorithm choice and software implementation in
the context of machine capabilities, and to determine affinities of
applications to machines.
Because of the historic difficulty in producing truly general models,
prior-art generally limited the scope of models to a single system and
application, allowing only the system size and job size to vary. This
talk will describe methods that can be effective over a broader range
of system/application choices. The models can then be used to improve
architecture design, inform procurement, and guide application tuning,
as well as to improve resource allocation (scheduling). In our
research, we are exploring these multiple uses and applying the
research results to several real-world problems as will be
described. The process of producing performance models historically
has been rather time- consuming, requiring large amounts of computer
time and highly expert human effort. This has severely limited the
number of high-end applications that can be modeled and studied. It
has been observed that, due to the difficulty of developing
performance models for new applications, as well as the increasing
complexity of new systems, supercomputers have become better at
predicting and explaining natural phenomena (such as the weather) than
at predicting and explaining the performance of themselves or other
computers! In our research we are addressing these challenges by
automating the formation of models, and making the processes of
acquiring application data faster and easier to store, and
representing the performance of target machines by a few simple,
orthogonal benchmarks. The result is increased scope, accuracy, and
applicability of performance models.

Cell Programming at IBM Research This presentation will summarize the various Cell
programming projects on-going at IBM Research, IBMs plans for the
future of the Cell tool chain and will include information on methods
for additional collaboration.

Addressing Challenges of Adaptivity and Scale in Parallel Scientific Simulations Simulations are playing an increasingly important role in science and
engineering and are rapidly becoming critical research
modalities. Large scale, coupled and dynamically adaptive simulations
can enable highly accurate solutions to realistic models, and provide
dramatic insight into complex phenomena. However, the phenomena being
modeled by these simulations are inherently dynamic and heterogeneous,
spanning multiple time and space scales. As a result, their large
scale parallel implementation presents significant challenges. In this
talk, I will present a computational engine that incorporates
algorithmic and infrastructure solutions to addresses these challenges
and enables efficient and scalable implementation of adaptive
formulations in a wide range of application domains. Specifically, I
will focus on addressing the space, time and computational
heterogeneity, and dynamism in simulations based on parallel adaptive
mesh refinement.

Solution of large nonlinear Eigenvalue Problems In Electronic Structure Calculations Density Functional Theory (DFT) is a successful technique used to determine the electronic structure of matter that is based on a number of approximations. It converts the original n-particle problem into an effective one-electron system, resulting in a coupled one-electron Schrödinger equation and a Poisson equation. This coupling is nonlinear and rather complex. It involves a charge density $\rho$ which can be computed from the eigenfunctions $\psi_i$, for all occupied states. These eigenfunctions are solutions of a nonlinear eigenvalue problem resulting from a Schrödinger equation whose potential depends nonlinearly on the charge density, which in turn depends on the eigenfunctions. This problem is solved `self-consistently' by an iterative procedure. The challenge comes from the large number of eigenfunctions to be computed for realistic systems with, say, thousands of electrons. We will desribe a parallel implementation with a finite difference approach for this problem with an emphasis on diagonalization, illustrating as needed with our in-house code called PARSEC. PARSEC evolved over more than a decade from a multi-disciplinary collaboration as many features were progressively added to it and the diagonalization routine, which accounts for the biggest part of a typical execution time, was upgraded a few times. We found that it is important to consider the problem as one of computing an invariant subspace in the non-linear context of the Kohn-Sham equations. This viewpoint leads to considerable savings as it de-emphasizes the accurate computation of individual eigenvectors and focuses instead on the subspace which they span.

Petascale Computing in the Biosciences: Simulating Entire Life Forms Living cells are made of atoms that are organized in molecules and their hierarchical assemblies. Most lasting advances in our understanding of the building blocks of living systems, for example genes, proteins, membranes, were based on atomic level structures and resulting properties, computing playing a key role. The question arises if these types of advances can be extended to explain the self-organization of complete living systems, i.e., if an entire cell can be resolved structurally to the atomic level and then its self-organized behavior deduced from physical principles. This lecture demonstrates through examples in how far a description of life from atomic properties and pure physics is already achievable today and may reach further with advances in microscopy, multi-scale theory, and petascale computing. Examples to be provided will include the self-assembly of proteins and lipids, simulation of a virus, swimming of bacteria and, lastly, the atomic level description of an entire cellular organelle, the photosynthetic apparatus of purple bacteria. The lecture links present day achievement in modeling technology and computational discovery with petascale era prospects to demonstrate that the biosciences are ready for petascale computing.

Cell Broadband Engine: Rationale, design, and applications This talk will provide insight into the major design goals and design decisions that led to the definition
of the Cell Broadband Engine(TM) processor. We discuss the impementation of the first generation Cell BE
processor, and provide some guidance on how best to program this processor. Next we discuss use of Cell
beyond the gaming space and show a number of application examples. We'll end with a view of the future
for this family of processors.

Algorithms and Infrastructure for Molecular Dynamics Simulations Molecular dynamics (MD) approaches are at the core of diverse problems in biophysical simulations, materials engineering, and chemical modeling, among others. Over the past several years, with collaborators in various application domains, we have investigated fast algorithms, mathematical formalisms, software infrastructure, and more recently, application-specific coarse-graining techniques and novel molecular dynamics techniques that further augment traditional approches with information abstracted from quantum-scale modeling. In this talk, we describe our results in the development of
(i) fast algorithms for long-range potentials, (ii) error analysis and acceleration techniques for multipole approximations, (iii) parallelism and scaling properties, and (iv) applications of our techniques to sparse and dense linear system solvers. In spite of various algorithmic and hardware advances, molecular dynamics remains highly limited in its spatial and temporal scales. For this reason, it is important to develop coarse-grained models capable of fundamentally extending space and time scales, while maintaining the fidelity and predictive properties of simulations. We describe our work on the development of these models for the simulation of biophysical membranes. Traditional molecular dynamics approaches do not extend to reactive systems, since bonds are static through the simulation.
We conclude with a discussion of our more recent work on simulation of reactive systems.

Analyzing and interrogating protein interaction maps High-throughput technologies have determined physical interactions for a significant fraction of proteins in several organisms. Taken together, these linkages form large protein interaction networks. Computational analysis of these networks can help uncover the functions of individual proteins as well their roles within larger pathways. In this talk, I will introduce novel methods for analyzing protein interaction networks and uncovering recurring means with which diverse biological processes are carried out. Our approach lends insight into the organizational principles of protein networks and can be used to determine functionally cohesive subnetworks and help predict protein function.

Sequence- Structure Relationships in Proteins The rapid growth in sequence data through genomic projects presents the obvious challenge of accurately modeling the structure and function of newly discovered proteins. The most efficient modeling approach is based on homology. In homology modeling we first identify a protein similar to the target (template). The template is used to model the structure of the new sequence. I shall describe our machine learning efforts to design fully automated systems for better detection of templates, alignment of the new sequence against the template, and producing and scoring final models. I will present a few blind tests, and an example of significant biological interest of a gene that controls the size of the tomato fruit.
Another challenge presented by the new data is the development of a global model of sequence and structure relationships in proteins. We simulate the global relationships between protein sequences and structures with randomized algorithms. Since structures are preserved much better than sequences, we ask: What is the sequence capacity of protein structures? i.e.
how many sequences fold into a particular protein shape? We also seek ways to characterize the evolutionary interplay of protein sequences and folds.
We define the `temperature of evolution' to characterize mutation mechanisms, and present a directed graph for all known protein folds to describe the flow of sequences between all known structures. Implications to experimentally observed evolutionary relationships and protein design will be discussed.

Opportunities and challenges for petascale inverse earthquake modeling We have been working on the inverse problem of estimating the earth model from observations of ground motion from past earthquakes. We discuss such compounding issues as extreme large scale, ill-posedness, discontinuous solutions, multigrid preconditioners for the reduced Hessian, nonlinear convergence difficulties, and multiple local minima, and techniques for addressing them. We end with a discussion of the prospects of earthquake inversion to frequencies of engineering interest on petascale computers.

Adaptive immersed boundary methods for simulating cardiac blood-muscle-valve mechanics and electrophysiology
Although the equations that describe cardiac mechanics and electrophysiology are different, in both cases a realistic treatment demands the use of methods that account for anisotropy, inhomogeneity, and complex geometries. We employ a unified theoretical framework, the immersed boundary (IB) method, for both aspects of cardiac physiology. This unified approach not only yields methodological overlap but also allows for substantial software reuse.
This talk will include an overview of an adaptive version of the IB method for problems of fluid-structure interaction, as well as results from the application of this adaptive methodology to the three-dimensional simulation of blood flow in the heart. I shall also discuss the application of the IB framework to the modeling and simulation of the electrical function of the heart. Preliminary simulation results obtained using this new IB formulation of the bidomain equations will be presented.
Basic details of cardiac physiology will be introduced as necessary, and computer animations of the model heart will be shown.

Tangent Space Alignment for Manifold Learning In this talk we present an algorithm for the manifold learning and parametrization recovery problems. Based on a set of
scattered data points sampled with noise from a parameterized manifold, the local geometry of the manifold is learned by
constructing an approximation for the tangent space at each data point, and those tangent spaces are then
aligned to produce the global coordinates of the data points with respect to the underlying manifold. We relate
the alignment process to biharmonic eigenvalue problems and discuss the spectral properties of the alignment operators.
We also mention how those techniques can be applied to supervised manifold learning problems. Applications
in scientific computing will be discussed as well.

Progress in Supercomputing: The Top Three Breakthroughs of the Last 20 Years and the Top Three Challenges for the Next 20 Years As a community we have almost forgotten, what supercomputing was like twenty years ago in 1985. The state-of-the-art system then was a 2 Gflop/s peak Cray-2, with at that
time a phenomenal 2 GBytes memory. It was the era of custom-built vector mainframes, where anything beyond 100 Mflop/s sustained was considered excellent performance.
The software environment was Fortran with vectorizing compilers (at best), and a proprietary operating system. There was hand tuning only, no tools, no visualization, and
dumb terminals with remote batch. If one was lucky and had an account, remote access via 9600 baud was state-of-the-art. Usually, a single code developer developed and coded
everything from scratch. What a long way we have come in the last twenty years! Teraflop/s level performance on inexpensive, highly parallel commodity clusters, open source software, community
codes, grid access via 10 Gbit/s, powerful visualization systems, and a productive development environment on a desktop system that is more powerful than the Cray-2
from 20 years ago—these are the characteristics of high performance computing in 2005. Of course, a significant contribution to this progress is due to the continued increase of
computing power following Moore's Law. But what I want to argue here is that progress was not just simply quantitative alone. We did not just get more of the same at a
cheaper price. There were several powerful ideas and concepts that were shaped in the last 20 years that made supercomputing the vibrant field that it is today. As an active
researcher in the field for the last 25 years, I will offer my subjective opinion, what were the real top breakthrough ideas that led to qualitative change and significant progress in
our field. Retrospection leads to extrapolation: in the last part of the lecture, I will envision what supercomputing will be like 20 years from now in the year 2025. Can we expect similar
performance increases? How will supercomputing change qualitatively and what are the top challenges that we will have to overcome to reach that vision of supercomputing in 2025?

Rearrangements and Duplications in Tumor Genomes: Towards a Cancer Genome Project Cancer is a disease driven by mutations in the genome that alter the structure, function, and regulation of genes. These mutations range from single letter changes in the DNA sequence to more drastic rearrangements, gains, or losses of large pieces of DNA. In some types of cancer these large-scale alterations are directly implicated in the pathogenesis of cancer and provide targets for cancer diagnostics and therapeutics.
I will describe computational methods for reconstructing tumor genome architectures and analyzing rearrangements in tumor genomes at high resolution using a technique called End Sequence Profiling (ESP). These methods are inspired by techniques in comparative genomics and view a tumor genome as a rearranged version of the normal human genome. In this framework, both the human and tumor genomes are represented by permutations and the problem is to find a parsimonious sequence of rearrangement operations that transform one permutation into another. I will also describe how computational analysis of ESP data suggests mechanisms that produce complicated patterns of overlapping rearrangement and duplication events that are observed in some tumor genomes.
Another experimental technique called array comparative genomic hybridization (aCGH) has become indispensable in the identification of duplicated and deleted segments of DNA in tumor genomes. ESP provides an effective complement to aCGH, and I will discuss how to combine data from both types of experiments using network flow techniques in order to obtain a comprehensive view of tumor genome architecture. I will demonstrate the application of these methods to ESP and aCGH data from breast cancers. Finally, I will describe the implications of this work for the recently proposed Cancer Genome Atlas, a genome project for cancer.

Parallel Methods for Maize Genome Assembly Maize is the first complex plant genome to be sequenced in the United States, and its sequencing is currently underway. Repeats span 65-80% of the maize genome, which creates particular challenges in assembling it.
In this talk, I will describe parallel bioinformatics methods we developed for this purpose. The novel features of our approach include the design of memory efficient algorithms, and algorithmic techniques and use of parallel processing to facilitate large-scale assembly with extremely fast turnaround times. I will present our assembly efforts on the IBM BlueGene/L massively parallel supercomputer, and discuss the NSF/USDA/DOE interagency project for maize genome sequencing and assembly.

Experimental Methods for Tuning Parameter Selection in Machine Learning and Statistics Today, models need to be complex to cope with the complexity of data sets. Model complexity arises in part from model, or tuning, parameters that either determine a model from a class of models for the data, or that specify options in a learning algorithm that searches for structure in the data or carries out a task. For example, the following are tuning parameters in locally-weighted regression: bandwidth, polynomial degree, scaling coefficients of the explanatory variables, and number of leaf nodes of a k-d tree used for fast computation. Tuning parameter selection is a ubiquitous task in statistics and machine learning. It is typically carried out by choosing a model criterion such as the cross-validation sum of squares, and using an optimization algorithm to find the parameters that minimize the criterion. We advocate, in place of just a machine minimization, an experimental approach. We begin with the same framework, tuning parameters and a model criterion, and add a measure of model complexity and an estimate of the noise variance. We approach the tuning parameter selection as we would a multi-factor experiment with multiple response surfaces that are functions of the factors. There are three responses - the selection criterion, the complexity measure and the estimated noise variance - and the factors are the tuning parameters. The three responses are evaluated over a design space, a collection of values of the factors; then the resulting meta-data are analyzed. When each evaluation is expensive, as it might be if the data are massive, methods of multi-factor experimental design are used to choose the design space. The meta-data are analyzed by methods for studying data from designed experiments: transformations of the factors to simply the response surfaces, extensive data visualization, the evolutionary operation methods of Box, fitting equations to response surfaces, and anything else that helps to understand the meta-data. The benefits of this experimental approach are (1) more effective model and algorithm optimization, (2) models with more tuning parameters than could succeed with a blind optimization, and (3) model choice based on a balancing of bias and variance to meet the needs of the analysis. Furthermore, with experience across data sets, much insight is gained into the effect and relationship of the tuning parameters, which can be used to improve the model or learning algorithm. We have begun our investigation of this approach for locally-weighted regression. Initial results have been very encouraging.

Fault Tolerance for Large-Scale Computing As computing systems grow in size they become less reliable. Although each component may be quite reliable, connecting thousands of them together (the largest supercomputers have thousands of processors, TBs of RAM and miles of network cable) virtually guarantees that something will fail somewhere every few hours. In this talk I will focus on two aspects of this problem.
First, I will discuss my work on application-level checkpointing for parallel applications. Unlike prior work on checkpointing, which is very
OS- and library-specific, this approach works on any OS and any implementation of system libraries, scaling upto systems with thousands of processors.
Second, I will look into the more general problem of detecting and tolerating errors in real computing systems. Unlike simple fail-stop fault models, real systems fail in complex ways and after five decades of research there exists no efficient way of detecting and tolerating such faults. I will present a specific emerging source of complex system
failures: radiation-induced soft errors. I will then outline a new compiler-based approach for the detection and tolerance of such errors that works by analyzing the application's vulnerability to random bit flips and using this information to reduce this vulnerability by the intelligent application of existing fault tolerance techniques.

Mechanical Generation of Correct Linear Algebra Algorithms One of my research goals is to produce a system that mechanically transforms the mathematical description of a linear algebra operation into formally correct high-performance routines together with their performance and
(ideally) stability analyses. Building high-performance libraries for dense linear algebra is fundamental to scientific computing. Almost all high-performance applications spend a substantial amount of time in dense linear algebra operations like those supported by the BLAS and LAPACK libraries. With the advent of architectures with complex multi-level memories, it no longer suffices to include only one or two algorithms for each operation in such libraries: different algorithms attain better performance under different settings. Expanding existing libraries to include all possible algorithms is a challenging task if traditional development techniques are employed.
The Formal Linear Algebra Methods Environment (FLAME) project at UT-Austin radically changes the building of linear algebra libraries by applying classical techniques from computer science. We have discovered how to systematically generate loop-based algorithms from a mathematical specification of the input linear algebra operation. The key is the a priori identification of loop-invariants corresponding to formally correct loops.
This advance enables us to constructively derive families of provably correct algorithms so that the best (e.g., fastest or most stable) algorithm can be chosen for a target setting. In order to carry over the correcteness of the algorithms to routines, we designed domain-specific APIs that allow a programmer to write code that mirrors the algorithm description.
My vision is to provide a webpage where a user can 1) input the mathematical specification of a linear algebra operation, 2) choose a target language, 3) describe a target a architecture, and 4) at the press of a button receive one or more routines---written in the target language and optimized for the target architecture---implementing the operation. I show that the goal is within reach; the FLAME methodology is sufficiently systematic that it can be automated. I present my system that mechanically generates algorithms and code with limited human intervention. The system has been used to derive dozens of algorithms for a broad selection of linear algebra operations. In many cases the resulting algorithms were previously unknown. I will conclude with how the performance and stability analyses of algorithms can be also automated or made systematic.

Monte Carlo Methods for Partial Differential Equations We present a brief overview of Monte Carlo methods for the solution of elliptic and parabolic partial differential equations (PDEs). We begin with a review of the Feynman-Kac formula, and its use in the probabilistic representation of the solutions of elliptic and parabolic PDEs. We then consider some specific Monte Carlo methods used for obtaining the solution of simple elliptic partial differential equations (PDEs) as part of exterior boundary value problems that arise in electrostatics and flow through porous media. These Monte Carlo methods use Feynman-Kac to represent the solution of the elliptic PDE at a point as the expected value of functionals of Brownian motion trajectories started at the point of interest. We discuss the rapid solution of these equations, in complex exterior geometries, using both the "walk on spheres" and "Greens function first-passage" algorithms. We then concentrate on methods for quickly computing the isotropic permeability using the "unit capacitance" and "penetration depth" methods. The first of these methods, requires computing a linear functional of the solution to an exterior elliptic PDE. Both these methods for computing permeability are simple, and provide accurate solutions in a few seconds on laptop-scale computers. We then conclude with a brief look at other Monte Carlo methods and problems that arise on related application areas.

Why a Mouse Resembles an Oil Reservoir We will present an overview of middleware and
algorithms designed to respond to challenges posed by heterogeneity,
coordination, data size and control requirements posed by dynamic data
driven application systems. We will outline the architectural roles
played systematic metadata management mechanisms, and by mechanisms that
provide an abstract view of combined information stored in filesystems,
XML databases and in relational databases. Finally, we will motivate
this talk by a variety of examples that arise in exploring the
morphological impact of Rb gene knockouts, tumor microenvironments in
breast cancer research, in grid based computer aided diagnosis and in
oil reservoir management.

A History of Numerical Linear Algebra From the days of the first ballistic computations on digital computers, the vast majority of computer time used for scientific computation is spent on linear algebra problems. Pioneers like Lanczos, von Neumann, and Wilkinson led a revolution in advanced computing using machines like the ENIAC and ACE in the early and middle years of this century. We shall describe some pioneers in numerical linear algebra and their influence. Over the years, many effective techniques have been developed for solving scientific and engineering computing problems from ballistics to quantum mechanics. We shall discuss several of these problems in linear algebra and describe, in outline, their solution. Within the last decade, parallel and vector computers have sparked a new revolution with profound affects on numerical analysis. Some techniques banished as inferior for conventional computers have proved to be attractive alternatives for machines with advanced architectures. Supercomputer research has also led to improved algorithms for conventional serial computers as well.
Finally, we shall discuss some of the latest advances, results, and current directions in scientific computation and numerical linear algebra.

The Challenge of Consilience and Scale Legend says that Archimedes remarked, on the discovery of the lever, “Give me a place to stand, and I can move the world.” Today, computing pervades all aspects of science and engineering. “Science” and “computational science” have become largely synonymous, and computing is the intellectual lever that opens the pathway to discovery. As new discoveries increasingly lie at the interstices of traditional disciplines, computing is also the enabler for a scholarship in the arts, humanities, creative practice and public policy.
This talk will describe emerging opportunities in the arts, humanities, science and engineering where interdisciplinary Renaissance approaches can have profound impact on discovery and creative expression. It will also the implications of these opportunities for design of high-performance computing systems.
As node counts for multi-teraflop systems grow to tens of thousands, with proposed petaflop system likely to contain hundreds of thousands of nodes, and with a tsunami of new experimental and computational data ahead, we must rethink traditional assumptions about software scaling and manageability and hardware reliability. Although the mean time before failure (MTBF) for the individual components (i.e., processors, disks, memories, power supplies, fans and networks) is high, the large overall component count means the system itself can still fail more frequently. Our thesis is that these “two worlds” of software – distributed systems and parallel systems – must meet, embodying ideas from each, if we are to build resilient systems.

Mondriaan sparse matrix partitioning Mondriaan is a two-dimensional, multilevel, hypergraph-based package for the partitioning of a rectangular sparse matrix, to be used as a sequential preprocessing step for parallel sparse matrix-vector multiplication. The partitioning is done recursively, each time splitting a rectangular submatrix into two parts, trying to minimise communication while maintaining a good load balance. In this talk, the Mondriaan approach will be explained and its performance will be compared with physics-based partitioning in the application of finding the minimum-energy configuration of a 20,000-particle sample of amorphous silicon. We compare our geometry-based partitioning (using BCC, FCC, and HCP sphere packings) to a partitioning obtained by first translating the problem to a sparse matrix formulation and then using Mondriaan. Furthermore, we show how the communication load of the multiplication can be balanced among the processors by using improved heuristics for the vector partitioning or, in certain cases, by using an optimal graph-based algorithm.

Stateful Parallel Programming Our field has a long history of trying to find a way to compute in parallel on mutable state without data races or certain kinds of non-determinism ("bad" non-determinism). Recently, atomic transactions have been proposed as a way to simplify updates to shared mutable state in parallel. This talk will argue that nested, multi-object atomic transactions are in fact sufficient to insure correctness including "good" determinism of stateful parallel programs.

Cray Cascade Project In 2002, the US Department of Defense initiated the High Productivity Computing Systems Project to develop a next generation computer system capable of sustaining a petaflop. Cray is one of three vendors funded for Phase II development work. We have proposed, CASCADE, a revolutionary new computer system comprised of custom processors, a next-generation interconnection network, and an active memory system. While still in design, the system is expected to include support for both heavy-weight threads that exploit high temporal locality and light-weight threads that exploit high spatial locality. The former will execute on processors that tolerate memory latencies through a combination of multithreading, vector, and stream processing. The latter may execute in an active memory systems with PIM-like characteristics that may also be multithreaded to tolerate memory latencies. The interconnection network may be a symmetric Cayley graph network capable of high bandwidth, low latency communications. Memory will be physically distributed, but shared. A sophisticated programming environment is proposed to assist application programmers utilize automatically the machine's unique processing capabilities. We expect that the global shared memory and the hardware's ability to tolerate memory lApril 15, 2012s will eliminate many of the programming challenges confronting scientific application developers today. In this talk, I will present the design goals for Cascade and describe the architecture and programming environment as they are currently envisioned.

Monitoring Human Activity using Computer Vision During the past decade, we have been studying problems related to detection, tracking and behavioral analysis of humans in action. This talk will review that research, emphasizing our work on real time vision systems for visual surveillance. I will discuss the W4 system and recent research we have conducted on components of that system. W4 is a real time system that can detect and track people and their body parts, segment small groups of people into individuals and recognize when people are carrying and exchanging objects. W4 controls a stationary, monochromatic pant/tilt/zoom camera to develop and update statistical image models of the backgrounds against which it will analyze human activity. It identifies moving objects as bipeds, quadrupeds or vehicles based on an analysis of the periodicity of their appearance. Objects identified as people are further analyzed; based on their shape we determine their pose (upright, sitting, etc.) and then use pose-specific models to find principal body parts (head, arms, legs, torso) and initialize trackers for the body and its parts. Dynamic shape analysis is used to determine if a person is carrying an object, and to segment the object from the person carrying it. W4 also has a limited capability to analyze people moving in groups; under certain circumstances it can segment the group into individuals and find and track their heads. We have very recently developed more robust methods for segmenting people into groups based on a combination of color and occlusion analysis in both single view and multiple view imagery, and have developed appearance models for people that can be used under varying pose and illumination conditions. I will discuss and illustrate these research projects.

Prof. Jack Dongarra
University of Tennessee &
Oak Ridge National Lab
27 October 2003

An Overview of High Performance and the Computational Grid In this talk we will look at how High Performance computing has changed over the last 10-year and look toward the future in terms of trends. In addition, we advocate the `Computational Grids' to support `large-scale' applications. These must provide transparent access to the complex mix of resources - computational, networking, and storage - that can be provided through aggregation of resources. The vision is of uniform, location independent, and transient access to the Computational, Catalogued data, Instrument system, and Human collaborator, resources of contemporary research activity in order to facilitate the solution of large-scale, complex, multi institutional/multidisciplinary data and computational based problems. It envisages these resources being accessible through a Problem Solving Environment appropriate to the target community.

Copyright vs Community in the Age of Computer Networks
Copyright developed in the age of the printing press, and was designed to fit with the system of centralized copying imposed by the printing press. But the copyright system does not fit well with computer networks, and only draconian punishments can enforce it. The global corporations that profit from copyright are lobbying for draconian punishments, and to increase their copyright powers, while suppressing public access to technology. But if we seriously hope to serve the only legitimate purpose of copyright--to promote progress, for the benefit of the public--then we must make changes in the other direction.

PICO: A Massively-Parallel Branch-and-Bound Engine
This talk describes how to implement branch-and-bound (B&B) search algorithms using the PICO (Parallel Integer and Combinatorial Optimizer) massively parallel B&B framework. The developer must specify how to partition the search space, how to determine feasibility, and how to bound a subproblem. PICO allows optional specification of a number of other properties to improve search efficiency and/or user efficiency. We use a simple knapsack example throughout the talk to illustrate each point. Time permitting, we will also include highlights of a mixed-integer-programming B&B solver under PICO. We will also discuss the design and implementation of PICO's B&B engine. PICO is a C++/MPI code designed to scale to thousands of processors, but sufficiently flexible to accomodate virtually any parallel computing environment. It has a number of features for efficient parallel branch and bound including integration of general and problem-specific heuristics for early generation of good feasible solutions, flexible interprocessor communication, flexible centralization of control, efficient work storage and distribution, multiple methods of load balancing, and a customized non-preemptive task (thread) scheduler to manage CPU utilization. PICO is now entering a beta test phase in preparation for (free) public release. Attendees interested in using PICO to solve their optimization problems are welcome to contact us. Joint work with Jonathan Eckstein (Rutgers) and William Hart (Sandia National Laboratories)

Prof. Yi Pan
Georgia State University
30 August 2002

Effective Parallelization of Scientific Applications: Labor Costs vs. Execution Performance
There is always a trade-off between labor costs and execution performance when we implement a scientific application using a parallel programming language. How to choose a suitable and effective language to parallelize a scientific application is a challenge. This talk briefly reviews some of the most popular high-level and low-level parallel programming languages used for scientific computing. We will report our experiences of using these languages in our research and compare the performance of several parallel scientific equation solvers implemented in different parallel languages. Major features and comparisons of these languages will be discussed. Some insights into when and where these languages should be used and how to use them effectively during the parallelization process will be provided.

Columnsort Lives! An Efficient Out-of-Core Sorting Program
Suppose that you have a big parallel computer and you wish to sort a lot of data. So much data that it does not fit in the memory of your big computer (we call such problems "out-of-core"), nor does it fit on a single disk. How could you sort it efficiently? In this talk, we present the design and implementation of a sorting algorithm that works under the above conditions. Our algorithm is based on Leighton's columnsort algorithm. We show how to relax some of the steps of the original columnsort algorithm to permit a faster out-of-core implementation. Our algorithm requires only four passes over the data. Although there is a limit on the number of items that can be sorted--as a function of the memory used per processor--this upper limit need not be a severe restriction, and it increases superlinearly with the per-processor memory. We define several measures of sorting efficiency and demonstrate that our implementation's sorting efficiency is competitive with that of NOW-Sort, a sorting algorithm developed to sort large amounts of data quickly on a cluster of workstations. We also discuss three improvements to the algorithm. One uses threads for maximum overlap of I/O, computation, and communication, and this improvement pays off nicely. One reduces the number of passes from four down to three, but this improvement does not always pay off. And one adds a pass but in so doing increases the maximum problem size as a function of per-processor memory. Joint work with Geeta Chaudhry and Len Wisniewski.

Prof. Joseph JaJa University of Maryland, College Park
30 November 2001

Efficient Techniques for Exploring Geospatial Data A challenging fundamental problem in exploring large scale geospatial data is to organize the data and develop efficient methods to explore these data sets by content to determine associations between physical phenomena and spatial factors, detect changes and trends, and build environmental models. In this talk, we address the core problem of handling queries to identify regions whose parameter values over a specified number of layers fall within certain ranges. We assume that the input data is too large to fit in main memory. We develop a data structure that is a combination of an R-tree and efficient representation of regions such that such queries can be answered very quickly. Most of the results are based on theoretical foundation and are justified by extensive experimental results.

Towards a Universal Model for Personal Mobility Management in Cellular Wireless Networks
The global convergence of wired/wireless telecommunication networks and IP-based data networks to form a seamless global personal communication system (PCS) has set up the stage for a network independent universal location service. Symbolic representation of location space in terms of cells is the key to its viability. In this talk we will characterize the complexity of the mobility tracking problem in a cellular environment under an information-theoretic framework. We will identify Shannon's entropy measure as a basis for comparing user mobility models. By building and maintaining a dictionary of individual user's path updates (as opposed to the widely used location updates), we propose an adaptive on-line algorithm called LeZi-update, that can learn subscibers' profiles. This technique evolves out of the concepts of lossless compression. The compressibility of the variable-to-fixed length encoding of the acclaimed Lempel-Ziv family of algorithms reduces the update cost, whereas their built-in predictive power can be effectively used to reduce paging cost. Under this framework, universal learning, estimation and prediction of personal mobility is still possible for well-behaved users, in spite of the fact no single mobility model works for the diverse set of wireless infrastructures.

Tinker Toy Parallel Computing: Complicated Applications from Simple Tools
Developing parallel software for unstructured problems
continues to be a difficult undertaking. A number
of elaborate frameworks have been developed for parallel
applications, but they tend to be applicable to only
limited classes of problems. For non-standard applications,
developers are often forced to code from scratch.
In this talk, we show that this needn't be the case.
We describe a set of simple primitives which can be
combined in different ways to provide solutions to
a variety of complex, unstructured parallel computing
problems. Specifically, we show how a small set of
tools can yield efficient
parallel algorithms for particle modeling, crash simulations and transferring data between two independent grids in multiphysics simulations.

Secure Multi-Party Protocols for Approximate Searching
Suppose that Bob has a database D and that Alice wants to perform a search query q on D (e.g., "is q in D?"). There are elegant cryptographic techniques for solving this problem under various constraints such as ``Bob should know neither q nor the answer to q and Alice should learn nothing about D other than the answer to q'' while optimizing various performance criteria (e.g., amount of communication). We consider the version of this problem where the query is of the type "is q approximately in D?" for a number of different notions of "approximate", some of which arise in image processing and template matching, while others are of the string-edit type that arise in biological sequence comparisons. New techniques are needed in this framework of approximate searching, because each notion of "approximate equality" introduces its own set of difficulties; using encryption is more problematic in this framework because two items that are approximately equal cease to be so after encryption or cryptographic hashing. Practical protocols for solving such problems make possible new forms of e-commerce between proprietary database owners and customers who seek to query the database, with privacy and security. Joint work with Wenliang (Kevin) Du.

Large Scale Data Visualization Using Parallel Data Streaming
Effective large-scale data visualization remains a significant and important challenge with simulation codes already producing terabyte results on clusters with thousands of processors. Frequently the simulation codes produce distributed data and consume a significant portion of the available memory per node. This talk presents an architectural approach to handling these visualization problems based on mixed dataset topology parallel data streaming. This enables visualizations on clusters that would normally require more storage/memory than is available while at the same time achieving high code reuse. Results from a variety of hardware andvisualization configurations are discussed with data sizes ranging near to a petabyte.

Telescoping Languages: A Framework for Generating High Performance Problem-Solving Systems
Increases in the complexity of both computer architectures and application structure have led to corresponding increases in the
difficulty of application development, making it the nearly exclusive
domain of the professional programmer. When combined with the current
shortage of programmers at the highest skill levels, this has created
a software gap between the demand for new software and the ability of
our workforce to deliver it. One way to bridge this gap is to make it possible for end users to
develop applications for themselves. Indeed, many users today are
producing highly functional applications using scripting languages and
high-level problem-solving systems such as Matlab and Visual
Basic. Unfortunately, these applications are not typically considered
in the statistics that measure productivity because they fall short in
some measure of performance. Any such application that is deemed
useful is usually rewritten in a more traditional programming
language-C, C++, or Fortran-before being used for "production"
work. If we could skip this rewriting step, we would take a
significant step toward overcoming the software gap, while making it
possible for end users to develop truly powerful high-performance
applications. At Rice, we are developing a framework, called telescoping languages,
for generating optimized high-level problem-solving languages from
annotated domain libraries. The strategy involves an extensive,
compute-intensive preliminary analysis of the library, performed at
language-generation time. The output of this process, which could take
many hours to complete, will be an efficient compiler for an extended
scripting language in which calls to the underlying domain library are
recognized and optimized as primitive operations. The talk will
describe this strategy and its applications in detail and report on
some preliminary experiments demonstrating its effectiveness.

Prof. H.J. Siegel
Colorado State University
(formerly Purdue University)
12 November 2000

High-Performance Heterogeneous Computing: Goals and Open Problems
Heterogeneous computing systems provide a variety of architectural capabilities, and can be orchestrated to perform an application whose tasks have diverse execution requirements. One type of heterogeneous computing system is a mixed-machine system, where a suite of different kinds of high-performance machines is interconnected by high-speed links. One way to exploit such systems is to decompose an application into tasks, where (ideally) each task is computationally homogeneous. Different tasks may have different computational needs, and one task may generate input data for another. The goal is then to assign the tasks to machines and order the execution of the tasks and any inter-task communications to minimize the overall completion time of the application. Typically, users must specify this decomposition and assignment. One long-term pursuit in the field of heterogeneous computing is to do this automatically. An overview of a conceptual model of what this involves will be given. An example of the type of research being conducted in this area will be presented. In particular, a genetic algorithm based approach will be described for matching tasks to machines and scheduling the execution of the tasks and any inter-task communications. Open problems in the field of mixed-machine heterogeneous computing will be discussed. "Alligators" will be shown. (This research was done in collaboration with L. Wang, A. A. Maciejewski, and others.)

Prof. Amnon Barak
The Hebrew University of Jerusalem
23 September 1999

Scalable cluster computing with MOSIX for Linux
MOSIX is a software tool for supporting cluster computing with
Linux. The core of MOSIX are kernel-level, adaptive load-balancing and memory ushering algorithms that are geared for maximal performance, overhead-free scalability and ease-of-use. These algorithms are designed to respond on-line to variations in the resource usage among the nodes by migrating processes from one node to another, preemptively and transparently, to balance the load and to prevent memory depletion at any node. The MOSIX software allows a cluster of PCs (workstations and servers) to work cooperatively as if part of a single system. MOSIX uses competitive algorithms for on-line distribution and redistribution of the workload and the resources in order to improve the overall performance of a computing-cluster of any size. MOSIX conveniently supports a multi-user time-sharing environment for the execution of both sequential and parallel tasks. The talk gives an overview of MOSIX for Linux and its resource sharing algorithms. The talk then presents our experience with the execution of several large-scale parallel applications, including Protein sequences, molecular and quantum dynamics simulations.

Prof. Mahmut Kandemir
Pennsylvania State University
(formerly Northwestern University)
3 May 1999

Improving Locality Using Loop and Data Transformations in an Integrated Framework
High performance computers of today extensively use multiple levels of memory hierarchies which renders the performance of applications critically dependent on their memory access characteristics. It is now widely acknowledged that exploiting these hierarchies is an important issue for sequential as well as parallel machines. In order to appreciate the significance of this issue it is sufficient to observe that in the last decade the microprocessor speeds increased at a minimum rate of 50% per year whereas the decrease in the main memory access times was only about 7% per year. This huge performance gap, which is the main reason to have a memory hierarchy, has caused significant amount of activity in the area of compiler-directed cache locality optimization. In this talk I will present a new integrated compiler framework for improving the cache performance of regular scientific applications. In addition to applying iteration space (loop) transformations, the method includes data space (array layout) optimizations, i.e., those that change the memory layouts of large data structures (arrays in this case). A key characteristic of this approach is that loop transformations are used to improve temporal locality while data layout optimizations are used to improve spatial locality. This optimization framework was used with sixteen programs from several benchmarks and math libraries, and the performance was measured on the SGI Origin 2000 distributed-shared-memory machine. The results demonstrate that the approach is very effective in enhancing cache locality and outperforms current solutions that use either loop based or data layout based transformations alone. We expect that our solution to the locality problem will also enable better register usage due to increased temporal locality in the innermost loops, and that it will also help in eliminating false-sharing on multiprocessors due to exploiting spatial locality in the innermost loops.

Prof. Ishfaq Ahmad
The University of Texas at Arlington
(formerly The Hong Kong University of Science and Technology)
31 March 1999

MPEG-4 Based Interactive Multimedia using Parallel Processing
MPEG-4 is the emerging standard for interactive multimedia based information systems. It is aimed at supporting content-based coding, communication, access, and manipulation of digital video (and accompanying audio) objects. Compared to the conventional frame-based compression techniques, the object-based representation and coding of video enable MPEG-4 to cover a broad range of emerging applications, real-time, non-real-time, interactive, and non-interactive. In this talk, we describe our current research in building multimedia information systems based on this standard. The MPEG-4 encoder is considerably more complex and requires more computing power than previous video coding standards. To overcome this problem, we employ parallel processing using a cluster of workstations collectively working as virtual parallel machine. Since MPEG-4 is an object-based standard and the size and shape of each video object may vary from time to time, parallelization of the encoder is a challenging research problem. We propose efficient scheduling and load balancing techniques that enhance the encoding speed while ensuring synchronization among the objects. We also present an overview of our other projects related to MPEG-2 and MPEG-4, and discuss some of the future research directions.