CS 7210 Distributed Computing 
Spring 2008

Index

  1. News&Notification (Final exam and project due dates announced!)

  2. Course Description

  3. Prerequisites

  4. Grading

  5. Text and Other Readings

  6. Syllabus

  7. Gaming Infrastructures and Environments

  8. Gaming Related Papers

  9. Class Wiki

  10. Project Information

  11. Class Schedule and Student Presentations

  12. Project Demonstration Schedule

  13. Sample Exam


News & Notification


Course Description

Distributed computing systems have become pervasive. From clusters to internet-worked computers, to mobile machines, distributed systems are being used to support a wide variety of applications. This course will focus on the fundamental concepts in distributed computing systems enabling such applications. The following are the objectives of this course:


These goals are achieved as follows. First, class sessions cover a set of papers that discuss basic principles in distributed systems. Principles covered in the course include global states of distributed computations, logical and physical clocks, and failure models. Distributed algorithms for consensus, replicated state management, and resource finding are also covered. Midterm and final exams feature questions on the papers covered in class. A second element of the class exposes students to representative development environments and infrastructures for distributed systems. This includes wide area applications and infrastructures (e.g., using Websphere or its open source counterparts on Internet-linked systems), high performance applications or algorithms across high end machines and network links (e.g., using Georgia Tech's cluster machines and facilities), mobile applications running on pervasive platforms (e.g., using representative pervasive systems at Georgia Tech or the MobiEmu emulation platform), and peer to peer systems or applications running on PlanetLab. Programming projects evaluate students' knowledge for this class component. A third element of the class introduces students to rigorous academic research, where for a specific topic and some set of papers covered in class, each student selects a set of recent papers in a certain area, and produces a short term paper describing the problems and results discussed in those papers. These summaries are shared with all students, and they are presented in class. There is considerable flexibility in defining class projects and topics for term papers. Concerning projects, students can formulate their own, and depending on the nature of the project, they can work in groups. Term papers are worked on individually, but their topics may be linked to class projects.

After successfully completing this course, students should be ready to explore research problems in distributed systems and on emerging applications deployed in such systems. This rigorous course requires strong class participation, and part of a student's grade is derived from class presentations.


Prerequisites

CS 4210 or equivalent. Graduate standing strongly suggested.


Grading

30% Examinations
15% Term Paper
40% Programming Project
15% Class Participation
Note that a passing grade is required in each of the above components in order to pass the class.



Textbook and Other Readings

Journal and conference papers listed above will be made available. Some useful background appears in the Mullender and Singhal textbooks on distributed computing.

  1. Distributed Systems, Sape Mullender, Addison-Wesley.
  2. Advanced Concepts in Operating Systems, Mukesh Singhal.

Syllabus

The following papers on the listed topics will be discussed in class. Other related papers will be assigned for reading. Homework will include material from papers assigned for reading. For papers without links, you should be able to download them from ACM Digital Library or IEEE Xplore through GATech Library.

Event Ordering, Global States and Time in Distributed Systems

1.      M. Raynal and M. Singhal. Logical Time: A Way to Capture Causality in Distributed Systems. IRISA Technical Report.

2.      David Mills. Network Time Protocol, RFC 1305. (skipped)

3.      Jeremy Elson, Lewis Girod, and Deborah Estrin, Fine-Grained Network Time Synchronization Using Reference Broadcasts, OSDI 2002.

4.      Chandy, M. and Lamport, L., Distributed Snapshots: Determining Global States of Distributed Systems, ACM Trans. on Computer Systems, February 1985.

5.      Schwarz, R. and Mattern, F., Detecting Causal Relationships in Distributed Systems: In Search of the Holy Grain, Distributed Computing, 1994.

Term Paper Topics: Distributed Checkpointing/Recovery, Clock Synchronization Algorithms.

Failures and Distributed Systems

Background reading: Danny Dolev, Cynthia Dwork and Larry Stockmeyer, On the Minimal Synchronism Needed for Distributed Consensus, JACM, January 1987.

1.      Survey of failures in distributed systems (ch. 2, Mullender)

2.      M. J. Fischer, N. Lynch and M. S. Patterson, Impossibility of distributed consensus with one faulty process, JACM 32, 1985.

3.      Tushar Deepak Chandra, Vassos Hadzilacos and Sam Toueg; The weakest failure detector for solving consensus, JACM 43, 4 (Jul. 1996), Pages 685 - 722.

4.      M. Castro, B.  Liskov, Practical Byzantine Fault Tolerance, OSDI, Feb. 1999.

Sample systems: we will study one of them in some detail:
Cox et al., Pastiche: Making Backup Cheap and Easy, OSDI 2002.
Abd-El-Malek, Ganger, Fault-Scalable Byzantine Fault-Tolerant Services, SOSP 2005.
Dunagan et al, Fuse: Light-weight Distributed Failure Notification, OSDI 2004.

Term Paper Topics: Byzantine Failures, Costs of Consensus in Synchronous Systems, Probabilistic Consensus.

Abstractions for Supporting Distributed Applications I: Group Communication

1.      Ken Birman, Andre Schiper and Pat Stephenson, Lightweight Causal and Atomic Multicast, ACM TOCS, August 1991.

2.      David Cheriton and Dale Skeen, Understanding the Limitations of Causally and Totally Ordered Communication, ACM SOSP, December 1993.

Systems:
K. Birman et. al., Bimodal Multicast, ACM TOCS.
Papers on DHTs, gossiping algorithms and their implementation:
Trickle/Stanford, PlanetP/Rutgers, and DHTs:
Frank Dabek, M. Frans Kaashoek, David Karger, Robert Morris, Ion Stoica. Wide-area cooperative storage with CFS, SOSP 2001.
Muthitacharoen et al., Ivy: A Read/Write Peer-to-Peer File System, OSDI 2002.

Term Paper Topics: Reliable Multicast Protocols, Virtual Synchrony, Group Communication Systems, Real-time or other constraints. Publish/subscribe systems (e.g., Gryphon)

Midterm Examination

Monday, March 3 - Includes papers covered in class through February 25

Abstractions for Supporting Distributed Applications II: Replicated Objects

Background: F. Torres, M. Ahamad and M. Raynal,  Timed Consistency for Shared Distributed Objects , PODC 1999.

1.      F. B. Schneider,   Implementing Fault-tolerant Services Using the State Machine Approach: A Tutorial,  Computing Surveys, 1990.

2.      Gifford, D., Weighted Voting for Replicated Data, ACM Symp. on Operating Systems Principles, December 1979.

3.      Malkhi, D. and Reiter,  M.  Byzantine Quorum Systems , Journal of Distributed Computing, 1998.

4.      Terry, D. B. et. al., Session guarantees for weakly consistent replicated data, 1994 PDIS.

5.      Ahamad, M. and Kordale, R. Scalable Consistency Protocols for Distributed Services IEEE Transaction on Parallel and Distributed Systems. 1999.
(Background reading: Mustaque Ahamad, Jim Burns, Phillip Hutto, Prince Kohli and Gil Neiger, Causal Memory, Distributed Computing, 1995.)

Systems using such services:
H. Yu and A. Vahdat.  The Costs and Limits of Availability of Replicated Services , SOSP 2001.
Danco Davcev and W.A. Burkhard. Consistency and Recovery Control for Replicated Files. In Proc. 10th ACM SOSP 1985. (likely to be skipped).
Karin Petersen, Mike J. Spreitzer, Douglas B. Terry, Marvin M. Theimer and Alan J. Demers, Flexible update propagation for weakly consistent replication. SOSP 1997 Pages 288-301.

Term Paper Topics: Scalable Consistency Protocols, Update conflict detection/resolution, consistency protocols with constraints (e.g., timing, location, ...).

Naming, Resource Finding, and Peer-to-Peer Systems

1.      Mullender, S., Vitany, P., Distributed Match-Making, Algorithmica, No.3, 1988. (skipped)

2.      Steen, M., Hauck, F., Homburg, P. and Tanenbaum, A. Locating objects in wide-area systems. IEEE Communications Magazine 1998.

3.      Stoica, I., Morris, R., Karger, D., Kaashoek, M. F. and Balakrishnan, H.,  Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications , TON.

4.      David G. Andersen, Hari Balakrishnan, M. Frans Kaashoek, Robert Morris. Resilient Overlay Networks. Proc. 18th ACM SOSP, Banff, Canada, October 2001.

5.      Landon P. Cox, Christopher D. Murray, and Brian D. Noble, Pastiche: Making Backup Cheap and Easy, OSDI 2002.

Other P2P Systems:
     
P2P ( FarsiteSamsara, Eigentrust and Strategyproof Mechanisms)

Term Paper Topics: Naming, Infrastructures for P2P, Structured and Unstructured Overlays, Constructing Distributed Systems and Applications (e.g., Pastry, Tapestry), FreeNet, Gnutella spec, P2P

Mobility

1.      Badrinath et. al., Designing Distributed Algorithms for Mobile Computing Networks. ICDCS.

2.      M. Satyanarayanan, Fundamental Challenges in Mobile Computing, PODC 1995.

3.      L. B. Mummert, M. R. Ebling and M. Satyanarayanan,  Exploiting weak connectivity for mobile file access. SOSP 1995 Pages 143-155.

Term Paper Topics: Algorithms adapted to mobile systems (e.g., radio issues, connectivity (e.g., reliable multi-hop routing), coordination in sensor networks, real-time communication and coordination, location management, end-to-end connectivity, structured (i.e., topologies) vs. unstructured (gossiping) communication, incentive-based approaches, ...).

Security in Distributed Systems - skipped due to overlap with Secure Systems Class

1.      Lampson, B., Abadi, M. and Burrows, M,  Authentication in Distributed Systems: Theory and Practice , ACM Transactions on Computer Systems, 1992.

2.      M. Kaminsky, G. Saviddes, D. Mazieres and M. F. Kaashoek,  Decentralized User Authentication in a Global File System ,  SOSP 2003.

Case Study

1.      Marcos K. Aguilera, et. al,  Sinfonia: A New Paradigm for Building Scalable Distributed Systems, SOSP 2007.

2.      Guiseppe DeCandia, et. al,  Dynamo: Amazon's Highly Available Key-Value Store,  SOSP 2007.

3.   Jeffrey Dean, Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, OSDI 2004

4.   Fay Chang, et. al. Bigtable: A Distributed Storage System for Structured Data, OSDI 2006


Gaming Infrastructures and Environments

  1. "Stratagus" open source gaming engine.
    1. "Wargus" is one such game built on top of stratagus.
    2. There are other examples.
    3. Professor Ashwin Ram's group in AI uses this engine. Contact them for other possible uses.
  2. Second Life has released its client code (not the server code). Some interesting projects enhancing client code may be possible.
  3. Ogre 3D is another possibility.
  4. Solipsis
  5. list of other open source games. Some other examples:
    1. Sauerbraten (a.k.a. Cube 2)
    2. Fight Win Prevail
    3. Quake 3 (only engine is free, game data has to be bought)


Gaming Related Papers

The following papers on the listed topics will be discussed in class. Other related papers will be assigned for reading. Homework will include material from papers assigned for reading. For papers without links, you should be able to download them from ACM Digital Library or IEEE Xplore through GATech Library.

Infrastructures and Architectures

1.      Christophe Diot and Laurent Gautier. A Distributed Architecture for Multiplayer Interactive Applications on the Internet IEEE Networks magazine, vol. 13, no. 4, July/August 1999.

2.      K Guo, S Mukherjee, S Rangarajan, S Paul. A fair message exchange framework for distributed multi-player games 2nd workshop on Network and system support for games, 2003.

3.      D Saha, S Sahu, A Shaikh. A service platform for on-line games ACM 2nd workshop on Network and system support for games 2003.

4.      S. Fiedler, M. Wallner, and M. Weber. A communication architecture for massive multiplayer games In Proceedings of Workshop on Network and System Support for Games (NETGAMES), Braunschweig, Germany, April 2002.

5.      R. K. Balan,M. Ebling, P. Castro, and A.Misra. Matrix: Adaptive middleware for distributed multiplayer games. In Proceedings of the 6th ACM/IFIP/USENIX International Middleware Conference (Middleware), Nov. 2005.

6.      Ashwin Bharambe, Jeffrey Pang and Srinivasan Seshan. Colyseus: A Distributed Architecture for Multiplayer Games. In ACM/USENIX NSDI 2006, San Jose, USA. (slides extended tech report)

Peer-to-peer

1.      B Knutsson, H Lu, W Xu, B Hopkins. Peer-to-Peer Support for Massively Multiplayer Games IEEE Infocom, 2004.

2.      Takuji Iimura, Hiroaki Hazeyama, Youki Kadobayashi. Zoned federation of game servers: a peer-to-peer approach to scalable multi-player online games 3rd ACM SIGCOMM workshop on Network and system support for games, 2004.

3.      J. Keller (France Telecom Res. & Dev., Issy Moulineaux, France) and G. Simon. Toward a peer-to-peer shared virtual reality Proceedings 22nd International Conference on Distributed Computing Systems Workshops, 2002, p 695-700.

4.      A. Yu (Dept. of Oceanogr., British Columbia Univ., Vancouver, BC, Canada) and S.T. Vuong. MOPAR: a mobile peer-to-peer overlay architecture for interest management of massively multiplayer online games Proceedings of the 15th International Workshop on Network and Operating Systems Support for Digital Audio and Video. NOSSDAV 2005, p 99-104.

Specific Issues

1.      Markus Bylund and Fredrik Espinoza. Testing and demonstrating context-aware services with Quake III Arena CACM special issue "Game Engines in Scientific Research".

2.      Joseph Pelligrino and Constantinous Dovrolis. Bandwidth requirement and state consistency in three multiplayer game architectures ACM ????.

3.      Claypool, LaPoint, Winsolow. Network analysis of Counter-strike and Starcraft IEEE - Performance, Computing, and Communications Conference, 2003. 9-11 April 2003 Page(s):261 - 268

4.      Lothar Pantel, Lars C. Wolf. On the impact of delay on real-time multiplayer games 12th international workshop on Network and operating systems support for digital audio and video, 2002.

5.      E. Cronin, B. Filstrup, A. R. Kurc, S. Jamin. An Efficient Synchronization Mechanism for Mirrored Game Architectures Proceedings of NetGames2002 Workshop, Germany, April 2002.

6.      Yow-Jian Lin, K. Guo, S. Paul. Sync-MS: synchronized messaging service for real-time multi-player distributed games 10th IEEE International Conference on Network Protocols, 2002.

7.      S. Yamamoto, Y. Murata, K. Yasumoto, and M. Ito. A distributed event delivery method with load balancing for mmorpg In Proc. of the 4th Workshop on Network and system support for games (NETGAMES), 2005.

8.      B Knutsson, H Lu, W Xu, B Hopkins. A dynamic load sharing algorithm for massively multiplayer online games IEEE Infocom, 2004.

9.      B Knutsson, H Lu, W Xu, B Hopkins. Transfer Learning in Real-Time Strategy Games Using Hybrid CBR/RL IEEE Infocom, 2004.

10.      Ashwin Bharambe, Venkat Padmanabhan and Srinivasan Seshan. Supporting Spectators in Online Multiplayer Games, HOTNETS-III, San Diego, November 2004.

Game Design

1.      MUD-Dev mailing list archives or mirror 1 or mirror 2. This is the 62 MB zipped collection of the mailing list where high-level technical and design issues were worked out originally for the massively multi-player online games. While it had its roots in the MUD text games, people from games like EverQuest were also involved. It is a good place to understand why particular technical and game-play decisions were made for this style of game.



Class Wiki

The Class Wiki provides a communication channel for all participants in the class.


Project Information

Guidelines

The purpose of a student-defined project is to (1) involve you in ongoing research projects, (2) leverage your unique background in some way, and/or (3) leverage other work in which you are involved. In general, any special project should be of a caliber that can generate results publishable in reviewed outlets like workshops or conferences (often requiring some additional work beyond the time spent in this class). You may link your project work with the topics investigated in the term papers, but both are separate deliverables.

Proposals

To propose your project for this class, you must submit the following materials:

The first step must include both development (e.g., coding, experimentation) and background work, such as producing a bibliography of relevant papers and having read them, having designed suitable algorithms/approaches, and having learned or looked at suitable tools to be used for your project, including target platforms.

The second step, typically after the class midterm, should involve having produced much of the software necessary and having debugged it.

The third step should include not only software testing but also evaluation, on the platforms you have chosen. Such evaluations may include theoretical results if you choose to develop or experiment with a novel algorithm, for instance, but it should also include experimental results.

The final deliverable not only includes the actual software but also a report, which is outlined next.

Final Report

The on-line final report regarding your project should have the following parts:


Facilities Available for CS7210 Projects

1. 'Hack' cluster (physically part of the NetLab facility described under 2. below):

Contact Sanjay Kumar (ksanjay) or Ivan Ganev (ganev@cc) for details on hardware/software setup.

2. NetLab cluster machine: cluster of over 40 machines, able to emulate arbitrary Internet topologies and able to run multiple operating systems. Also able to run mobile system emulations using the MobiEmu support infrastructure.

Contact Himanshu Raj (rhim@cc) or Sanjay Kumar (ksanjay@cc) for more information and machine access.

3. High end cluster facilities, one comprised of 30 dual Itanium IIs, the other comprised of over 50 dual 64-bit PIVs.

Contact Matt Wolf (mwolf@cc) for more information and machine access.

4. 'Enterprise' computing clusters: a typical  3-tier setup using an IBM blade server as a backend and HP and IBM and workstations as frontends.

Contact Mohammed Mansour (mansour@cc) for more detail and/or for hardware access.

5. Handheld computers using wireless links, including Compaq iPAQ and XScale-based (Sitsang) machines.

Contact Ripal Nathuji (rnathuji@cc) for additional information.

6. Embedded communication subsystems, using Intel's IXP2400 network routers on a small cluster machine:

Contact Ada Gavrilovska (ada@cc) for more information on IXPs.

7. For research involving high performance systems or graphics, large-scale visualization media, including an Immersadesk, a video wall, an Access Grid node, and potential access to high end network links connecting Georgia Tech with remote institutions.

Contact Matt Wolf (mwolf@cc) for more information.

8. Special hardware, including virtualization-capable (VT instruction set) and multi-core machines.

Contact Himanshu Raj (rhim@cc) or Matt Wolf (mwolf@cc) for more information.

Sample Projects

These are simply some ideas that came to my mind. You are NOT bound by these ideas. In fact, I'd welcome alternative ideas, especially those involving other infrastructures like Chord, Pastry, ... or those involving innovative distributed algorithms or methods:

  1. Online Massively Multiplayer games and environments like World of Warcraft and Second Life demonstrate many of the distributed systems concepts emphasized in this class. With Second Life's client being released as open source and other online environments available in source code, new interaction mechanisms demonstrating different distributed computing concepts not currently exploited by the system can be added. For example, developing a distributed server approach where the state of the game world is maintained by various connected clients and shared among them. Having state synchronization, replication, and inter-node communication for a volatile client set would afford a larger, more detailed, fault tolerant, and scalable world than single-server solutions. (contact lofstead@cc).
  2. Experiment/work with Microsoft's .net infrastructure. For instance, first use its SOAP remote invocation functionality, then enhance it to improve SOAP performance. Enhancements are possible along two directions: 1) work with other students to design a new implementation of the SOAP protocol based on an open source release associated with Apache (contact  bala@cc for more information), then (2) further extend its ability to deal with larger application data.
  3. Work with realistic enterprise system  infrastructures, such as IBM's Websphere or open source products like JBOSS, or SOAP implementations. Sample work with such infrastructures involves changing their key communication primitives (e.g., making RMI  network aware) or developing or enhancing applications that require their rich functionality (contact mansour@cc or sandip@cc).
  4. Industry developments in middleware involve three different support infrastructures, one being the Java-based approaches like Websphere, another centered around .net, and a third focused the publish/subscribe model of communication. Our research group has been developing a high performance pub/sub infrastructure, termed  ECho, which has recently been extended to make it easier to create efficient very large scale pub/sub applications and to create new failure recovery methods. The extensions makes it easier to dynamically create overlay networks mapped to appropriate machines and network links. They also make it easy to create new methods for dealing with failures, including those studied in this class. Students can perform a wide variety of projects, including (1) automatic generation of and management of certain pub/sub functionality (e.g., the operators applied to the events  traversing the overlay network) (contact lofstead@cc or vibhore@cc, the former experimenting with XML-based representations of such operators and the latter involved with database operator-like formulation; (2) algorithms to create and update overlay networks, based on network feedback (contact ztcai@cc) or based on other resource monitoring (contact sandip@cc) or formulations of  distributed trust (contact ramesh@cc) or dealing with failures (contact vibhore@cc); and (3) new applications that utilize the rich middleware functionality provided by the overlay or pub/sub models (contact the instructor). Also of interest are uses of infrastructures like these in mobile systems (e.g., with the robotics applications we are studying jointly with Tucker Balch) (contact kjohara@cc for more information).
  5. Develop sample distributed applications. Most interesting are: (1) ubiquitous or embedded applications that involve both real-time sensing and reactions to sensor data and/or use portable or mobile machines, (2) experimentation with adaptive applications, including video transfers (raw video or MPEG/SPEG encoded) and evaluating their ability to run on wireless and embedded devices and systems, (3) innovative distributed applications including video games, remote robot control, immersive systems, using real-time feeds (e.g., gotten from Atlanta's traffic web pages or from sports actions) and/or large data sets (e.g., earth observational data). Propose to instructors and also contact mwolf@cc or eisen@cc for  information or project ideas.

Useful Information

These are the presentations on Jan 14th which would help you find out your own project topic.

1. Distributed Games:  for further information, please contact Balasubramanian Seshasayee (bala@cc)

BZFlags: Overview, Architecture and Projects

BZFlags: Features, Design and Issues

2. Distributed Service Application: for further information, please contact Peter Budny (bigpeteb@cc)

Service-Oriented Architecture Benchmark

3. Hermes: Distributed Information Service: for further information, please contact Dr. Matt Wolf (mwolf@cc)


Sample Exam

2007 MidExam



Last Updated: 2008 January 15th