Biocomputing: Research

Description of Selected Ongoing Research Activities

A. GENOME DATABASES

B. DISTRIBUTED AND PARALLEL SYSTEMS

C. DISTRIBUTED LABORATORIES

D. DISTRIBUTED SIMULATION

E. ALGORITHMS

F. INTELLIGENT SYSTEMS AND COMPUTATIONAL NEUROSCIENCE

G. HIGH PERFORMANCE COMPUTING IN SUPPORT OF ENVIRONMENTAL RESEARCH

H. MODELING, IMAGING, AND VISUALIZATION

I. MODELING AND LEARNING USING ARTIFICIAL INTELLIGENCE TECHNIQUES

A. GENOME DATABASES

A mitochondrial genome information system which defines the genetic basis of mitochondria is being developed in collaboration with the Emory scientists from the department of Molecular Genetics (Profs. Wallace, Brown et al.). The project involves designing and implementing a multifaceted, comprehensive database system encompassing mitochondrial genetics and mitochondrial disease. The system will be capable of delivering information about a mitochondrial DNA (mtDNA) sequence, variations of that sequence based on population studies, clinical disease information, functional information about DNA/RNA/proteins, and information on the interaction between nuclear and mitochondrial genes. A prototype database has already been developed for the mitochondrial DNA (mtDNA) and is available on the Worldwide Web at http://www.gen.emory.edu/mitomap.html. It not only provides a valuable reference for the mitochondrial biologist, but may also provide a model for the development of information storage and retrieval systems for other components of the human genome. We also expect that it will serve as a unifying element to bring together information on structure and function, pathogenic mutations, and their clinical characteristics, population-associated variations and information eventually useful in gene therapy. This work is currently supported by a seed grant from the Emory-GIT research program and by a graduate fellowship from the National Library of Medicine.

B. DISTRIBUTED AND PARALLEL SYSTEMS

New developments in computing technologies and innovative applications of such technologies must be an essential element in any competitive effort by Georgia Tech in life sciences. This fact was apparent in recent federal initiatives (e.g., the digital library initiative), where winning teams typically combined strong computing elements with their innovative use or further development in specific application domains. Research in computing at Georgia Tech can play (and is already playing) a similar role, including in the two domains of (1) high performance computing and (2) interactive systems and exemplified by sample projects already being undertaken jointly by researchers in application domains and computer scientists.

A project addressing both high performance and interactive systems currently being undertaken with researchers in environmental sciences is entitled "The Parallelization and Visual Analysis of Multidimensional Fields: Application to Ozone Production, Destruction, and Transport in Three Dimensions" (Investigators: Karsten Schwan, College of Computing, Fred Alyea, Earth and Atmospheric Sciences, M. William Ribarsky, Information Technology, and Mary Trauner, Information Technology). This project concerns the parallelization, on-line monitoring, and steering of a global atmospheric modeling code, for multiple target high performance machines. Novel contributions from this work include: (1) a detailed investigation of opportunities for parallelism in atmospheric transport based on spectral solution methods, (2) the use of such parallelization as an enabling technology (i.e., vastly improved model speeds) that now permits end users to use and interact with their models during their execution, using output data visualizations and animations of program functionality or performance, and (3) the development of sophisticated interactive 3D interfaces for such monitoring and steering for data sets of significant size.

C. DISTRIBUTED LABORATORIES

A distributed computational laboratory is envisioned where interdisciplinary research can be conducted among researchers across campus (potentially including scientists with specialized capabilities at other locations as well) to interact with each other and sophisticated simulation models as if all were in a single building. This distributed laboratory will greatly enhance the productivity of researchers collaborating in interdisciplinary teams.

To permit scientists and engineers at geographically distinct locations (including individuals `tele-commuting'' from home) to combine their expertise in solving shared problems, we are is constructing a combined networking and high performance computing infrastructure that will be broadly useful in the life sciences. At the same time, we are using two specific applications to focus our efforts, and to help ensure that our results and software tools are properly integrated. One such application is the construction of a distributed laboratory for experimentation with high performance numeric computations for applications in atmospheric sciences and in biomedical sciences, including remote instrument viewing and control (e.g., telemedicine). The initial version of this laboratory is distributed across local area, ATM networks, and it employs a heterogeneous set of parallel supercomputers, multiprocessor workstations, and less powerful machines, all running the Unix operating system. Later versions of this laboratory will be distributed across wide area networks, using existing middleware for user-user and user-program interactions.

The "middleware" projects consist of: (a) dynamic monitoring, adaptation, and interactive steering of high performance computations for on-line control of `virtual laboratory instrument;" (b) efficient execution of simulation programs, especially discrete-event simulations of physical systems on multi-granular compute servers; (c) exploring the real-time properties of such dynamic, distributed systems, by construction of advance benchmarks.

The emphasis in distributed systems research is on support for shared-state in multi-granular and distributed computing environments, by supporting the construction of high performance and interactive object technologies. Networking research is concerned with providing the necessary high performance and real-time communication protocols for distributed laboratory applications and for extensions of these applications into the home.

D. DISTRIBUTED SIMULATION

Simulations of complex biological systems ranging from molecular structures to models of ecosystems are necessary to provide an effective computational environment for researchers in biology and bioengineering. Because of the inherent complexity of these systems, the computation required to perform these simulations are huge, placing severe limitations on the scale and degree of detail that can be captured by the simulations. High performance simulation methods are required that exploit state-of-the-art parallel and distributed computing platforms for rapid completion of computation-intensive simulation problems. Environments supporting rapid development and understanding of simulation models are required. This work is currently funded by grants from several federal agencies, including ARPA.

E. ALGORITHMS

Computers play a role, for example, in sequencing the genome, in which algorithms are used for `assembling'' the sequence for a long segment, from the sequences for shorter, overlapping subsequences; for inferring evolutionary trees from similarity data for different species; for drug design, where the computers allow researchers to manipulate images of the molecules on the screen, in the hope of inferring how the molecules will behave; and for inferring the 3D structure of a protein, from its 2D sequence of amino acids.

Researchers work on these and related problems in the fast growing area of computational biology, whose goal is improved algorithms for the computational problems that biologists solve. Because these problems are often enormous, fast algorithms are all the more important. Since the human genome, for example, has about four billion base pairs, algorithms for storing and retrieving the data or for helping to sequence the genome must be especially fast. The exciting area of computational biology is a new source of important applications for computer science.

F. INTELLIGENT SYSTEMS AND COMPUTATIONAL NEUROSCIENCE

Robotics research within the College of Computing has a rich history of drawing heavily upon neuroscientific models of behavior as the basis for control systems. We have in the past interacted most heavily with colleagues at Emory and Georgia State due to the dearth of neuroscientists at Tech. Computing plays a valuable role from the computational neuroscience perspective - providing a means for testing and validating ideas of computational structure in living systems while simultaneously providing an avenue for discovery and invention of new technologies, in particular regarding computer vision and robotic control. Computational and robotic models of behavior, conversely, provide a test arena for neuroscientists to explore their ideas. A recent award by the National Science Foundation to the intelligent systems group does just that: porting abstract neural models developed by neuroscientists in Mexico that describe frog and mantid behavior to yield robotic control programs.

In a program funded by the National Library of Medicine, we are also exploring methods for modeling visual reasoning, such that these computational models can be used for processing and analyzing 3D cardiovascular nuclear imagery. Typically, these images are difficult to interpret and requires extensive training and expertise. By studying the visual reasoning processes employed by clinical experts, we have formulated a knowledge-based approach to perform the interpretive tasks. Importantly, the knowledge-based system infers structure (estimation of the level arterial disease) from function (the perfusion imagery). This computer-based "assistant" has demonstrated its reliability and accuracy in clinical settings, and has recently benefitted from extensive refinements to its user interface design. In addition, we are exploring the use of neural networks to help interpret patterns in new imaging techniques, and ways in which symbolic knowledge can be extracted from these connectionist approaches.

G. HIGH PERFORMANCE COMPUTING IN SUPPORT OF ENVIRONMENTAL RESEARCH

The application of high performance computing and cross-disciplinary team research to address fundamental problems in environmental science, technology and policy are essential. Within the computing discipline there are several areas that will play a significant role in this effort such as database management, parallel and distributed computing, computer graphics, visualization, simulation and numerical/statistical analysis. Being able to establish, maintain and share large data collections as well as being able to retrieve and to analyze the data will be fundamental to solving environmental problems. Consider the problems of predicting weather, climate and global change. The aim is to understand the coupled atmosphere, ocean, biosphere system in enough detail to be able to make long range predictions about its behavior. One can easily imagine the tremendous amount of information that will be stored in able to solve these problems. Current generation database systems running on traditional large mainframe computers will not have the necessary functionality, nor the compute power, nor the I/O speed to efficiently deal with data of this magnitude. The use of a network of high performance workstations and parallel computers with parallel I/O capabilities will be needed to meet data sharing and performance requirements.

In the College of Computing, research is being performed in the methodologies and tools of representing and manipulating very large volumes of data in a parallel/distributed heterogeneous environment. More specifically, we focus on several areas. First, we consider database models. Here we need to represent and manipulate new data types such as temporal, spatial and image data in the database system. We also need the capability to store and retrieve more complex metadata such as device characteristics and documentation about experiments/observations. Second, systems issues are being investigated that are concerned with multimedia data storage, data indexing, archiving and parallel/distributed processing of large scale environmental models.

Due to massive amounts of data, performance will become a crucial issue both from a standpoint of optimizing compute cycles as well as optimizing I/O cycles. From an I/O perspective, it will become necessary to maintain a 3 level data storage hierarchy, e.g., main memory, disk storage and optical disks. The efficient staging of data through the hierarchy will play an important role in providing a high level of performance. Third, we focus on knowledge discovery from data contained in the database. Can we provide efficient algorithms which attempt to discover significant patterns in the massive amounts of data that are relevant to the goals of the scientists (e.g., identifying interesting features and characteristics). In addition, it will be necessary to provide scientists with the ability to browse data quickly. To do this efficiently, both from a computing and users standpoint, is important and the use of visualization techniques will play a key role. The visualization of output data from large scale simulation models will also be needed. Fourth, a resource sharing environment is necessary to support collaboration among environmental scientists in a heterogeneous computing environment. However, providing efficient transparent distributed program execution and data sharing requires much work.

H. MODELING, IMAGING, AND VISUALIZATION

A number of ongoing projects, funded by the National Institutes of Health and the National Library of Medicine, are underway with the objective of achieving a better understanding of the information contained in complex imagery acquired through a number of modalities, including CT, MRI, PET, and SPECT. The discrete datasets may consist of a set of two-dimensional images, a sequence of temporal images, or volumes of image information capturing different biomedical structures or processes. Using these datasets, computer vision and image processing methods are explored to extract salient features such that the structure of interest can be characterized. These features are subsequently used to create models of these structures: the models are mathematical and geometrical representations which exhibit physical and material properties that can be static (e.g., internal structural detail) or dynamic (e.g., flexibility and elasticity properties to model possible deformations or movement). These models are subsequently represented as visual displays that are three-dimensional (or of higher dimensions, representing either movement of complex information), and that can be manipulated or studied interactively. These physically-based visualization models and simulations are further interpreted in order to further assist in understanding the structure or process of interest.

The challenges involve the exploration of methods for extracting, displaying, and interpreting the information. Thus, the research is undergirded by the subdisciplines of computer vision, graphics, visualization, and models of visual reasoning. The results of the research have yielded a number of interesting methods to represent and interact with multidimensional information, models for reasoning about incomplete, misleading, and noisy information, techniques for creating deformable tissue models, and approaches to fuse different types of imagery into single visualization models. The research has been conducted with colleagues from Emory University and Medical College of Georgia, and has spanned projects dealing with cardiac structure and dynamics, brain data fusion and analysis, surgery simulation, and medical image labeling and understanding.

I. MODELING AND LEARNING USING ARTIFICIAL INTELLIGENCE TECHNIQUES

One of the contributions that computing can make to biology is to provide a methodology for modeling biological problems computationally. Examples from AI include building neural network models to model brain function and to interpret imagery; multiagent intelligent systems to model groups (e.g., flocking behavior in birds); modeling vision and action; genetics; and knowledge-based interpretation of images. Another example would be to look for regularities and clusters in genetic population distribution data: for instance, how genes vary across populations. We can also contribute in (1) building models of adaptive agents in various environments; (2) using learning algorithms to find binding patterns or regularities in biological or behavioral observations; and (3) building models of flocking and other group behaviors.

Last Modified: Dec. 5, 1995 by De Duff (de@cc.gatech.edu)