Midterm will be on Oct 8th, 2009.
Midterm and Final will be in-class exams, closed book, covering papers studied in class
Distributed computing systems have become pervasive. From clusters to internet-worked computers, to mobile machines, distributed systems are being used to support a wide variety of applications. This course will cover both fundamental concepts in distributed applications and infrastructrures and systems enabling such applications. The following are the objectives of this course:
These goals are achieved as follows. First, class sessions cover a set of
papers that discuss basic principles in distributed systems as well as
representative systems and infrastructures. Principles covered in the course
include global states of distributed computations, logical and physical clocks,
and failure models. Distributed algorithms for consensus, replicated state
management, and resource finding are also covered. Midterm and final exams
feature questions on the papers covered in class. A second element of the class
exposes students to representative distributed applications, systems, and
infrastructures that support them. This includes wide area applications (e.g.,
peer to peer) and systems (e.g., DHTs, distributed stores), high performance
applications or algorithms across high end machines and network links (e.g.,
cloud computing, using Georgia Tech's cluster machines and facilities), mobile
applications running on pervasive platforms, and applications and infrastructures
from specific domains (e.g., data streaming applications from the
publish/subscribe domain, computer games). Programming projects evaluate
students' knowledge for these practical class components. A third element of
the class introduces students to rigorous academic research, where for a
specific topic and some set of papers covered in class, where each student
produces a term paper covering some technical topic and their projects. These
term papers are presented in class. There is considerable flexibility in
defining class projects and topics for term papers. Concerning projects,
students can formulate their own, and depending on the nature of the project,
they can work in groups. Finally, each student (or small group of students) is
responsible for presenting one of the papers discussed in class.
After successfully completing this course, students should be ready to explore research problems in distributed systems and on emerging applications deployed in such systems. This rigorous course requires strong class participation, and part of a student's grade is derived from class presentations.
CS 4210 or equivalent. Graduate standing strongly suggested.
40% Examinations
40% Programming Project
20% Class Participation (includes presentation)
A passing grade is required in each of the above components in order to
pass the class.
The class will primarily use journal and conference papers. Some useful background appears in the Mullender and Singhal textbooks on distributed computing.
The following papers on the listed topics will be discussed in class. Other related papers will be assigned for reading. Papers labeled as background are optional, for interested students. For papers without links, you should be able to download them from the ACM Digital Library or IEEE Xplore through the GATech Library.
Required background: Lamport, L., " Time, Clocks, and the Ordering of Events in a Distributed System ", Communications of the ACM, 21, 7, pgs. 558-565, July 1978. (see CS 6210)
1. M. Raynal and M. Singhal. Logical Time: A Way to Capture Causality in Distributed Systems. IRISA Technical Report.
2. Jeremy Elson, Lewis Girod, and Deborah Estrin, Fine-Grained Network Time Synchronization Using Reference Broadcasts, OSDI 2002 (presentation includes brief review of how network time protocols are implemented in the Internet - NTP).
3. Chandy, M. and Lamport, L., Distributed Snapshots: Determining Global States of Distributed Systems, ACM Trans. on Computer Systems, February 1985.
4. Schwarz, R. and Mattern, F., Detecting Causal Relationships in Distributed Systems: In Search of the Holy Grain, Distributed Computing, 1994.
1. Survey of failures in distributed systems (Ch. 2, Mullender)
2. M. Castro, B. Liskov, Practical Byzantine Fault Tolerance, OSDI, Feb. 1999.
3. Abd-El-Malek, Ganger, Fault-Scalable Byzantine Fault-Tolerant Services, SOSP 2005.
4. Dunagan et al, Fuse: Light-weight Distributed Failure Notification, OSDI 2004.
5. The Google File System (SOSP 03) and HDFS (Apache.org)
Additional background reading:
Danny Dolev, Cynthia Dwork and Larry Stockmeyer, On the Minimal Synchronism Needed for Distributed Consensus, JACM, January 1987.
Tushar Deepak Chandra, Vassos Hadzilacos and Sam Toueg; The weakest failure detector for solving consensus, JACM 43, 4 (Jul. 1996), Pages 685 - 722.
M. J. Fischer, N. Lynch and M. S. Patterson, Impossibility of distributed consensus with one faulty process, JACM 32, 1985.
1. Ken Birman, Andre Schiper and Pat Stephenson, Lightweight Causal and Atomic Multicast, ACM TOCS, August 1991.
2. David Cheriton and Dale Skeen, Understanding the Limitations of Causally and Totally Ordered Communication, ACM SOSP, December 1993.
3. K. Birman et. al., Bimodal Multicast, ACM TOCS.
4. Frank Dabek, M. Frans Kaashoek, David Karger, Robert Morris, Ion Stoica. Wide-area cooperative storage with CFS, SOSP 2001.
5. Muthitacharoen et al., Ivy: A Read/Write Peer-to-Peer File System, OSDI 2002.
Other topics of interest: Reliable Multicast Protocols, Virtual Synchrony, Group Communication Systems, Real-time or other constraints. Publish/subscribe systems (e.g., Gryphon). Gossiping algorithms and their implementation: Trickle/Stanford;
1. F. B. Schneider, Implementing Fault-tolerant Services Using the State Machine Approach: A Tutorial,, Computing Surveys, 1990.
2. Gifford, D., Weighted Voting for Replicated Data, ACM Symp. on Operating Systems Principles, December 1979.
3. H. Yu and A. Vahdat. The Costs and Limits of Availability of Replicated Services , SOSP 2001.
4. Ahamad,
M. and Kordale, R. Scalable Consistency Protocols for Distributed Services,
IEEE Transaction on Parallel and Distributed Systems. 1999.
(Useful background: Mustaque Ahamad, Jim Burns, Phillip Hutto, Prince Kohli and
Gil Neiger, Causal Memory, Distributed Computing, 1995.)
Other interesting readings:
Malkhi, D. and Reiter, M. Byzantine Quorum Systems , Journal of Distributed Computing, 1998.
F. Torres, M. Ahamad and M. Raynal, Timed Consistency for Shared Distributed Objects , PODC 1999.
Terry, D. B. et. al., Session guarantees for weakly consistent replicated data, 1994 PDIS.
Danco Davcev and W.A. Burkhard.
Consistency and Recovery Control for Replicated Files.
In Proc. 10th ACM SOSP 1985.
Karin Petersen, Mike J. Spreitzer, Douglas B. Terry, Marvin M. Theimer and Alan J. Demers, Flexible update propagation for weakly consistent replication. SOSP 1997 Pages 288-301.
1. Steen, M., Hauck, F., Homburg, P. and Tanenbaum, A. Locating objects in wide-area systems. IEEE Communications Magazine 1998.
2. David G. Andersen, Hari Balakrishnan, M. Frans Kaashoek, Robert Morris. Resilient Overlay Networks. Proc. 18th ACM SOSP, Banff, Canada, October 2001.
3. Landon P. Cox, Christopher D. Murray, and Brian D. Noble, Pastiche: Making Backup Cheap and Easy, OSDI 2002.
Other readings and P2P
Systems:
P2P ( Farsite, Samsara, Eigentrust and Strategyproof
Mechanisms)
Mullender, S., Vitany, P., Distributed Match-Making, Algorithmica, No.3, 1988.
Stoica, I., Morris, R., Karger, D., Kaashoek, M. F. and Balakrishnan, H., Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications , TON.
Other Topics: Naming, Infrastructures for P2P, Structured and Unstructured Overlays, Constructing Distributed Systems and Applications (e.g., Pastry, Tapestry), FreeNet, Gnutella spec, P2P
1. Guiseppe DeCandia, et. al, Dynamo: Amazon's Highly Available Key-Value Store, SOSP 2007.
2. Jeffrey Dean, Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, OSDI 2004.
3. Fay Chang, et. al. Bigtable: A Distributed Storage System for Structured Data, OSDI 2006.
4. The Many Faces of Pub/Sub - ACM Surveys
Marcos K. Aguilera, et. al, Sinfonia: A New Paradigm for Building Scalable Distributed Systems, SOSP 2007.
1. Badrinath et. al., Designing Distributed Algorithms for Mobile Computing Networks. ICDCS.
2. M. Satyanarayanan, Fundamental Challenges in Mobile Computing, PODC 1995.
3. L. B. Mummert, M. R. Ebling and M. Satyanarayanan, Exploiting weak connectivity for mobile file access. SOSP 1995 Pages 143-155.
Other Topics: Algorithms adapted to mobile systems (e.g., radio issues, connectivity (e.g., reliable multi-hop routing), coordination in sensor networks, real-time communication and coordination, location management, end-to-end connectivity, structured (i.e., topologies) vs. unstructured (gossiping) communication, incentive-based approaches, ...).
1. Lampson, B., Abadi, M. and Burrows, M, Authentication in Distributed Systems: Theory and Practice , ACM Transactions on Computer Systems, 1992.
2. M. Kaminsky, G. Saviddes, D. Mazieres and M. F. Kaashoek, Decentralized User Authentication in a Global File System , SOSP 2003.
The following papers on the listed topics will be discussed in class. Other related papers will be assigned for reading. Homework will include material from papers assigned for reading. For papers without links, you should be able to download them from ACM Digital Library or IEEE Xplore through GATech Library.
1. Christophe Diot and Laurent Gautier. A Distributed Architecture for Multiplayer Interactive Applications on the Internet IEEE Networks magazine, vol. 13, no. 4, July/August 1999.
2. K Guo, S Mukherjee, S Rangarajan, S Paul. A fair message exchange framework for distributed multi-player games 2nd workshop on Network and system support for games, 2003.
3. D Saha, S Sahu, A Shaikh. A service platform for on-line games ACM 2nd workshop on Network and system support for games 2003.
4. S. Fiedler, M. Wallner, and M. Weber. A communication architecture for massive multiplayer games In Proceedings of Workshop on Network and System Support for Games (NETGAMES), Braunschweig, Germany, April 2002.
5. R. K. Balan,M. Ebling, P. Castro, and A.Misra. Matrix: Adaptive middleware for distributed multiplayer games. In Proceedings of the 6th ACM/IFIP/USENIX International Middleware Conference (Middleware), Nov. 2005.
6. Ashwin Bharambe, Jeffrey Pang and Srinivasan Seshan. Colyseus: A Distributed Architecture for Multiplayer Games. In ACM/USENIX NSDI 2006, San Jose, USA. (slides extended tech report)
1. B Knutsson, H Lu, W Xu, B Hopkins. Peer-to-Peer Support for Massively Multiplayer Games IEEE Infocom, 2004.
2. Takuji Iimura, Hiroaki Hazeyama, Youki Kadobayashi. Zoned federation of game servers: a peer-to-peer approach to scalable multi-player online games 3rd ACM SIGCOMM workshop on Network and system support for games, 2004.
3. J. Keller (France Telecom Res. & Dev., Issy Moulineaux, France) and G. Simon. Toward a peer-to-peer shared virtual reality Proceedings 22nd International Conference on Distributed Computing Systems Workshops, 2002, p 695-700.
4. A. Yu (Dept. of Oceanogr., British Columbia Univ., Vancouver, BC, Canada) and S.T. Vuong. MOPAR: a mobile peer-to-peer overlay architecture for interest management of massively multiplayer online games Proceedings of the 15th International Workshop on Network and Operating Systems Support for Digital Audio and Video. NOSSDAV 2005, p 99-104.
1. Markus Bylund and Fredrik Espinoza. Testing and demonstrating context-aware services with Quake III Arena CACM special issue "Game Engines in Scientific Research".
2. Joseph Pelligrino and Constantinous Dovrolis. Bandwidth requirement and state consistency in three multiplayer game architectures ACM ????.
3. Claypool, LaPoint, Winsolow. Network analysis of Counter-strike and Starcraft IEEE - Performance, Computing, and Communications Conference, 2003. 9-11 April 2003 Page(s):261 - 268
4. Lothar Pantel, Lars C. Wolf. On the impact of delay on real-time multiplayer games 12th international workshop on Network and operating systems support for digital audio and video, 2002.
5. E. Cronin, B. Filstrup, A. R. Kurc, S. Jamin. An Efficient Synchronization Mechanism for Mirrored Game Architectures Proceedings of NetGames2002 Workshop, Germany, April 2002.
6. Yow-Jian Lin, K. Guo, S. Paul. Sync-MS: synchronized messaging service for real-time multi-player distributed games 10th IEEE International Conference on Network Protocols, 2002.
7. S. Yamamoto, Y. Murata, K. Yasumoto, and M. Ito. A distributed event delivery method with load balancing for mmorpg In Proc. of the 4th Workshop on Network and system support for games (NETGAMES), 2005.
8. B Knutsson, H Lu, W Xu, B Hopkins. A dynamic load sharing algorithm for massively multiplayer online games IEEE Infocom, 2004.
9. B Knutsson, H Lu, W Xu, B Hopkins. Transfer Learning in Real-Time Strategy Games Using Hybrid CBR/RL IEEE Infocom, 2004.
10. Ashwin Bharambe, Venkat Padmanabhan and Srinivasan Seshan. Supporting Spectators in Online Multiplayer Games, HOTNETS-III, San Diego, November 2004.
1. MUD-Dev mailing list archives or mirror 1 or mirror 2. This is the 62 MB zipped collection of the mailing list where high-level technical and design issues were worked out originally for the massively multi-player online games. While it had its roots in the MUD text games, people from games like EverQuest were also involved. It is a good place to understand why particular technical and game-play decisions were made for this style of game.
The Class Wiki provides a communication channel for all participants in the class.
Guidelines
The purpose of a student-defined project is to (1) involve you in ongoing research projects, (2) leverage your unique background in some way, and/or (3) leverage other work in which you are involved. In general, any special project should be of a caliber that can generate results publishable in reviewed outlets like workshops or conferences (often requiring some additional work beyond the time spent in this class). You may link your project work with the topics investigated in the term papers, but both are separate deliverables.
Proposals
To propose your project for this class, you must submit the following materials:
The first step must include both development (e.g., coding, experimentation) and background work, such as producing a bibliography of relevant papers and having read them, having designed suitable algorithms/approaches, and having learned or looked at suitable tools to be used for your project, including target platforms.
The second step, typically after the class midterm, should involve having produced much of the software necessary and having debugged it.
The third step should include not only software testing but also evaluation, on the platforms you have chosen. Such evaluations may include theoretical results if you choose to develop or experiment with a novel algorithm, for instance, but it should also include experimental results.
The final deliverable not only includes the actual software but also a report, which is outlined next.
Final Report
The on-line final report regarding your project should have the following parts:
Facilities Available for CS7210 Projects
1. 'Hack' cluster (virtual machines running on new facility):
Contact Chad Huneycutt (chadh@cc) for more information.
2. NetLab cluster machine: cluster of over 40 machines, able to emulate arbitrary Internet topologies and able to run multiple operating systems. Also able to run mobile system emulations using the MobiEmu support infrastructure.
Contact Adit Ranadive (adit262@cc.gatech.edu) for more information.
3. Cloud computing facilities - to come online in mid-October.
Contact Chad Honeycutt (chadh@cc) for more information and machine access.
4. 'Enterprise' computing clusters: a typical 3-tier setup using an IBM blade server as a backend and HP and IBM and workstations as frontends.
information.
Sample Projects
Useful Information
Service-Oriented Architecture Benchmark