CS 4230/6236: Parallel and Distributed Simulation Systems
Spring Semester, 2000
Projects
Preliminaries
Below are descriptions of suitable course projects. With additional work, some of these projects can lead to a publication in a conference, and/or PhD dissertation research. In some cases, support in the form of a graduate research assistantship for continuing work this summer may be possible.
Each project includes two checkpoints where a specific deliverable is due, and the final report. Weekly meetings will be arranged through the remainder of the semester to maintain progress in the projects and to provide feedback. An oral presentation is also required at the end of the semester describing the problem you worked on, and results that are available at the time of the report.
Due dates are as follows:
If you are doing a project from the following list, simply turn in a sheet of paper indicating which project you will work on for the project selection due date (note some additional information is required if you are doing the virtual environment project, however). You are free to propose your own project. If you plan to define your own project you should see me first to discuss what you want to do. In this case, the project selection write up should describe your project in detail comparable to the project descriptions that follow. Be sure to indicate what milestones you will reach at each of the checkpoints, as well as the final report.
All projects listed below are intended for one person, unless indicated otherwise. Persons taking this course for undergraduate credit may complete any of the projects listed below. Persons taking this course for graduate credit may complete any except the Virtual Environment project.
More than one person may select the same project to work on. You may (and in fact are encouraged) to discuss how you will approach the project with other persons working on the same project, but all code and reports that you turn in must be entirely your own work.
Most of the projects that follow will require making some significant modifications to the FDK software, or another software package called GTW/TeD. We will provide help and suggestions on how to modify this software to complete the project, but keep in mind these are research oriented projects, so complete solutions have not been worked out.
Project 1: Multiple SMP RTI Implementation
The goal of this project is to realize an efficient implementation of the FDK software that runs over a collection of shared memory multiprocessors interconnected via Myrinet switches. Communications between processors residing within the same SMP should use a shared memory message passing mechanism. This software is already available for the SGI multiprocessor; you will need to port it to the PC SMP cluster for use in this project. Communication between processors residing in different SMPs should use Myrinet message passing software, using a software package called GM. Myrinet is a high speed switch that enables low latency communication by bypassing the software protocol stack used in traditional networks. The native Myrinet hardware can provide communication latency on the order of a few microseconds when sending small messages, compared to hundreds of microseconds for traditional local area networks. You will use a cluster of PC-based SMPs interconnected with Myrinet hardware for this project.
There are two parts to this project. The first is porting the existing SMP implementation of FDK to execute on a single PC SMP. The second, which forms the main portion of the project, requires extending this implementation to include fast communications via the Myrinet for communication between processes mapped to different SMPs. This requires extending the communication software used by FDK (called FM-lib) to support inter-SMP communication. This should all be transparent to higher level software (specifically the rest of the RTI software and federates interconnected by the RTI).
Achieving high performance is a central focus of this project. It is important that your design minimize software overheads, e.g., message copying.
You must complete the following tasks for this project:
Suggested Reading:
Project 2: Message Bundling Software
Communication protocols such as TCP/IP executing over local or wide area networks are more efficient in sending large messages than small ones. Applications such as discrete event simulations tend to send many small messages. Such applications are often more efficiently implemented by using a technique called message bundling where several outgoing messages for a specific destination are buffered, and then sent over the network as a single large message rather than several smaller messages. The goal of this project is to implement message bundling in FDK, and to evaluate its performance relative to the current implementation that does not use message bundling.
This project requires the completion of the following tasks:
Suggested Reading:
Project 3: Threading Approaches
The current FDK software uses a single threaded approach where each federate must explicitly call the tick() procedure to give the RTI cycles to perform tasks such as receiving messages and to perform time management computations. This project will explore an alternate approach where federates need not call tick(), but rather a separate thread executes on each processor that performs this function. To complete this project you must come up with a design to transform the single-threaded FDK software to a multi-threaded version, implement your design, and perform a thorough performance evaluation study.
This project requires completion of the following tasks:
Suggested Reading:
Project 4: Time Management Software
The goal of this project is to realize an alternate implementation of FDK's time management software (TM-Kit). Specifically, the current FDK software computes LBTS values (a lower bound on the time stamp of future messages a processor can receive), but does not take into consideration topology information, i.e., information concerning which federates communicate with which other federates. This project involves developing a replacement that takes into account topology information, and an evaluation to determine its effect on performance.
To simplify the implementation of this mechanism, you should try to preserve the existing interface to the time management software as much as possible. You will need to add some additional information (e.g., topology information) however. In principle, one can deduce topology information from the publications and subscriptions made by each federate. To simplify this project, however, you may use a separate mechanism, e.g., a configuration file of your own design to provide topology information. Assume a separate lookahead value is associated with each link in the topology, and the topology and lookahead values do not change during the execution of the simulation.
Suggested Reading:
Project 5: Reverseable Computation and Fault Tolerant System Design
Many of the techniques used in optimistic synchronization of distributed simulations are also applicable to fault tolerant system design. In particular, it is well known that optimistic synchronization can be used to recover from failures in distributed systems. Specifically, detecting a failure can be used to trigger a Time Warp like rollback mechanism to return the distributed system to a consistent state. The state of the system is checkpointed (to secondary storage) to enable recovery from failures.
While one can directly apply Time Warp like mechanisms to implement distributed failure recovery, a problem is the associated overhead which involves disk I/O to retrieve a checkpointed state of each process being rolled back. This overhead can be reduced by recomputing rather than checkpointing a prior state of the system. In prior work, we have developed a way to automatically generate the reverse computation of segments of code in order to recompute prior states of the system.
This project involves (1) developing an algorithm for distributed failure recovery using the reverse computation mechanism, and (2) adding modifications to our existing reverse execution Time Warp system to partially implement the failure recovery mechanism. Specifically, you will modify an operational Time Warp system based on reverse computation to add disk I/O to checkpoint state and message logs. Note that this project does not use the FDK software, but rather a different package called Georgia Tech Time Warp (GTW).
To complete this project, you must perform the following tasks:
Professor Doug Blough (ECE) and Dr. Kalyan Perumalla will participate in this project. Professor Blough is an expert in fault tolerant computing, and Dr. Perumalla is an expert in reverse computation, and developed the reverse computation software you will modify for use in this project.
Suggested Reading:
C. D. Carothers, K. S. Perumalla, R. M. Fujimoto, "Efficient Optimistic Parallel Simulation Using Reverse Computation," 13th Workshop on Parallel and Distributed Simulation, 1999 pp. 126-135.
S. Rao, L. Alvisi, and H.M. Vin, "The Cost of Recovery in Message Logging Protocols," Proceedings of the 17th IEEE Symposium on Reliable Distributed Systems, 1998, pp 10-18.
D. B. Johnson and W. Zwaenepoel, "Sender-Based Message Logging," FTCS 17: Digest of Papers. The Seventeenth International Symposium on Fault-Tolerant Computing, 1987 pp 14-19.
Project 6: Multi-Player Virtual Environment
This project is only for those taking the course for undergraduate credit. It involves building a distributed simulation application on top of the FDK software. The goal of this project is to realize an interactive distributed simulation application (e.g., a multi-player video game) over FDK. You are free to choose the type of game to be implemented. A "first-person" viewer application is preferable where the display seen by each user is that which would be seen by a person embedded within the virtual environment. In addition to the interactive participants in the virtual environment, there should be simulated entities moving throughout the virtual world, e.g., simulated adversaries.
A software package called Crystal is available for use in this project that provides support for creating virtual environments. It is recommended you use Crystal or a similar package rather than attempt to build everything from scratch.
This project can be done individually or in teams. One approach for a team project is to have each person build a different type of entity that is embedded in the virtual environment.
The tasks for this project include: