Project Demonstration Schedule: each team please check out available time slots and email TA to reserve time slot.
Project Final Report Due: 10:00am April 28th
Final Exam: 11:30am -- 1:30pm May 2nd, at usual place (ES&T L1105)
Midterm will include papers covered in class through Feb 25th. There is a sample exam.
Please ask TA (in KACB3201) for copy of reading material on Fault Tolerance (Section 4.5 in Andrew Tanenbaum's book Distributed Operating Systems)
Midterm: Monday, Mar 3rd
Project Update: Wednesday, Feb 27th
Project Proposal Due: Midnight Friday, Jan 18th
Please check out Class wiki. You can publish your project description and do some advertising there. It's also a good place to look for project partners.
Distributed computing systems have become pervasive. From clusters to internet-worked computers, to mobile machines, distributed systems are being used to support a wide variety of applications. This course will focus on the fundamental concepts in distributed computing systems enabling such applications. The following are the objectives of this course:
These goals are achieved as follows. First, class sessions cover a set of
papers that discuss basic principles in distributed systems. Principles covered
in the course include global states of distributed computations, logical and
physical clocks, and failure models. Distributed algorithms for consensus,
replicated state management, and resource finding are also covered. Midterm and
final exams feature questions on the papers covered in class. A second element
of the class exposes students to representative development environments and
infrastructures for distributed systems. This includes wide area applications
and infrastructures (e.g., using Websphere or its open source counterparts on
Internet-linked systems), high performance applications or algorithms across
high end machines and network links (e.g., using Georgia Tech's cluster
machines and facilities), mobile applications running on pervasive platforms
(e.g., using representative pervasive systems at Georgia Tech or the MobiEmu
emulation platform), and peer to peer systems or applications running on PlanetLab.
Programming projects evaluate students' knowledge for this class component. A
third element of the class introduces students to rigorous academic research,
where for a specific topic and some set of papers covered in class, each
student selects a set of recent papers in a certain area, and produces a short
term paper describing the problems and results discussed in those papers. These
summaries are shared with all students, and they are presented in class. There
is considerable flexibility in defining class projects and topics for term
papers. Concerning projects, students can formulate their own, and depending on
the nature of the project, they can work in groups. Term papers are worked on
individually, but their topics may be linked to class projects.
After successfully completing this course, students should be ready to explore research problems in distributed systems and on emerging applications deployed in such systems. This rigorous course requires strong class participation, and part of a student's grade is derived from class presentations.
CS 4210 or equivalent. Graduate standing strongly suggested.
30% Examinations
15% Term Paper
40% Programming Project
15% Class Participation
Note that a passing grade is required in each of the above components in
order to pass the class.
Journal and conference papers listed above will be made available. Some useful background appears in the Mullender and Singhal textbooks on distributed computing.
The following papers on the listed topics will be discussed in class. Other related papers will be assigned for reading. Homework will include material from papers assigned for reading. For papers without links, you should be able to download them from ACM Digital Library or IEEE Xplore through GATech Library.
1. M. Raynal and M. Singhal. Logical Time: A Way to Capture Causality in Distributed Systems. IRISA Technical Report.
2. David Mills. Network Time Protocol, RFC 1305. (skipped)
3. Jeremy Elson, Lewis Girod, and Deborah Estrin, Fine-Grained Network Time Synchronization Using Reference Broadcasts, OSDI 2002.
4. Chandy, M. and Lamport, L., Distributed Snapshots: Determining Global States of Distributed Systems, ACM Trans. on Computer Systems, February 1985.
5. Schwarz, R. and Mattern, F., Detecting Causal Relationships in Distributed Systems: In Search of the Holy Grain, Distributed Computing, 1994.
Term Paper Topics: Distributed Checkpointing/Recovery, Clock Synchronization Algorithms.
Background reading: Danny Dolev, Cynthia Dwork and Larry Stockmeyer, On the Minimal Synchronism Needed for Distributed Consensus, JACM, January 1987.
1. Survey of failures in distributed systems (ch. 2, Mullender)
2. M. J. Fischer, N. Lynch and M. S. Patterson, Impossibility of distributed consensus with one faulty process, JACM 32, 1985.
3. Tushar Deepak Chandra, Vassos Hadzilacos and Sam Toueg; The weakest failure detector for solving consensus, JACM 43, 4 (Jul. 1996), Pages 685 - 722.
4. M. Castro, B. Liskov, Practical Byzantine Fault Tolerance, OSDI, Feb. 1999.
Sample systems: we
will study one of them in some detail:
Cox et al., Pastiche:
Making Backup Cheap and Easy, OSDI 2002.
Abd-El-Malek, Ganger, Fault-Scalable
Byzantine Fault-Tolerant Services, SOSP 2005.
Dunagan et al, Fuse:
Light-weight Distributed Failure Notification, OSDI 2004.
Term Paper Topics: Byzantine Failures, Costs of Consensus in Synchronous Systems, Probabilistic Consensus.
1. Ken Birman, Andre Schiper and Pat Stephenson, Lightweight Causal and Atomic Multicast, ACM TOCS, August 1991.
2. David Cheriton and Dale Skeen, Understanding the Limitations of Causally and Totally Ordered Communication, ACM SOSP, December 1993.
Systems:
K. Birman et. al., Bimodal Multicast,
ACM TOCS.
Papers on DHTs, gossiping algorithms and their implementation:
Trickle/Stanford, PlanetP/Rutgers, and DHTs:
Frank Dabek, M. Frans Kaashoek, David Karger, Robert Morris, Ion Stoica. Wide-area cooperative
storage with CFS, SOSP 2001.
Muthitacharoen et al., Ivy: A Read/Write
Peer-to-Peer File System, OSDI 2002.
Term Paper Topics: Reliable Multicast Protocols, Virtual Synchrony, Group Communication Systems, Real-time or other constraints. Publish/subscribe systems (e.g., Gryphon)
Background: F.
Torres, M. Ahamad and M. Raynal, Timed Consistency
for Shared Distributed Objects , PODC 1999.
1. F. B. Schneider, Implementing Fault-tolerant Services Using the State Machine Approach: A Tutorial, Computing Surveys, 1990.
2. Gifford, D., Weighted Voting for Replicated Data, ACM Symp. on Operating Systems Principles, December 1979.
3. Malkhi, D. and Reiter, M. Byzantine Quorum Systems , Journal of Distributed Computing, 1998.
4. Terry, D. B. et. al., Session guarantees for weakly consistent replicated data, 1994 PDIS.
5.
Ahamad, M. and Kordale, R. Scalable
Consistency Protocols for Distributed Services IEEE Transaction on Parallel
and Distributed Systems. 1999.
(Background reading: Mustaque Ahamad, Jim Burns, Phillip Hutto, Prince Kohli
and Gil Neiger, Causal Memory, Distributed Computing, 1995.)
Systems using such services:
H. Yu and A. Vahdat. The Costs and Limits of Availability of Replicated Services , SOSP 2001.
Danco Davcev and W.A. Burkhard. Consistency and Recovery Control for
Replicated Files. In Proc. 10th ACM SOSP 1985. (likely to be skipped).
Karin Petersen, Mike J. Spreitzer, Douglas B. Terry, Marvin M. Theimer and
Alan J. Demers, Flexible update propagation for weakly consistent
replication.
SOSP 1997 Pages 288-301.
>
Term Paper Topics: Scalable Consistency Protocols, Update conflict detection/resolution, consistency protocols with constraints (e.g., timing, location, ...).
1. Mullender, S., Vitany, P., Distributed Match-Making, Algorithmica, No.3, 1988. (skipped)
2. Steen, M., Hauck, F., Homburg, P. and Tanenbaum, A. Locating objects in wide-area systems. IEEE Communications Magazine 1998.
3.
Stoica,
4. David G. Andersen, Hari Balakrishnan, M. Frans Kaashoek, Robert Morris. Resilient Overlay Networks. Proc. 18th ACM SOSP, Banff, Canada, October 2001.
5.
Landon P. Cox, Christopher D. Murray, and Brian D. Noble,
Pastiche: Making Backup Cheap and Easy, OSDI 2002.
Other P2P Systems:
P2P
( Farsite,
Samsara,
Eigentrust
and Strategyproof
Mechanisms) >
Term Paper Topics: Naming, Infrastructures
for P2P, Structured and Unstructured Overlays, Constructing Distributed Systems
and Applications (e.g., Pastry, Tapestry), FreeNet, Gnutella spec, P2P
1. Badrinath et. al., Designing Distributed Algorithms for Mobile Computing Networks. ICDCS.
2. M. Satyanarayanan, Fundamental Challenges in Mobile Computing, PODC 1995.
3. L. B. Mummert, M. R. Ebling and M. Satyanarayanan, Exploiting weak connectivity for mobile file access. SOSP 1995 Pages 143-155.
Term Paper Topics: Algorithms adapted to mobile systems (e.g., radio issues, connectivity (e.g., reliable multi-hop routing), coordination in sensor networks, real-time communication and coordination, location management, end-to-end connectivity, structured (i.e., topologies) vs. unstructured (gossiping) communication, incentive-based approaches, ...).
1. Lampson, B., Abadi, M. and Burrows, M, Authentication in Distributed Systems: Theory and Practice , ACM Transactions on Computer Systems, 1992.
2. M. Kaminsky, G. Saviddes, D. Mazieres and M. F. Kaashoek, Decentralized User Authentication in a Global File System , SOSP 2003.
1. Marcos K. Aguilera, et. al, Sinfonia: A New Paradigm for Building Scalable Distributed Systems, SOSP 2007.
2. Guiseppe DeCandia, et. al, Dynamo: Amazon's Highly Available Key-Value Store, SOSP 2007.
3. Jeffrey Dean, Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, OSDI 2004
4. Fay Chang, et. al. Bigtable: A Distributed Storage System for Structured Data, OSDI 2006
The following papers on the listed topics will be discussed in class. Other related papers will be assigned for reading. Homework will include material from papers assigned for reading. For papers without links, you should be able to download them from ACM Digital Library or IEEE Xplore through GATech Library.
1. Christophe Diot and Laurent Gautier. A Distributed Architecture for Multiplayer Interactive Applications on the Internet IEEE Networks magazine, vol. 13, no. 4, July/August 1999.
2. K Guo, S Mukherjee, S Rangarajan, S Paul. A fair message exchange framework for distributed multi-player games 2nd workshop on Network and system support for games, 2003.
3. D Saha, S Sahu, A Shaikh. A service platform for on-line games ACM 2nd workshop on Network and system support for games 2003.
4. S. Fiedler, M. Wallner, and M. Weber. A communication architecture for massive multiplayer games In Proceedings of Workshop on Network and System Support for Games (NETGAMES), Braunschweig, Germany, April 2002.
5. R. K. Balan,M. Ebling, P. Castro, and A.Misra. Matrix: Adaptive middleware for distributed multiplayer games. In Proceedings of the 6th ACM/IFIP/USENIX International Middleware Conference (Middleware), Nov. 2005.
6. Ashwin Bharambe, Jeffrey Pang and Srinivasan Seshan. Colyseus: A Distributed Architecture for Multiplayer Games. In ACM/USENIX NSDI 2006, San Jose, USA. (slides extended tech report)
1. B Knutsson, H Lu, W Xu, B Hopkins. Peer-to-Peer Support for Massively Multiplayer Games IEEE Infocom, 2004.
2. Takuji Iimura, Hiroaki Hazeyama, Youki Kadobayashi. Zoned federation of game servers: a peer-to-peer approach to scalable multi-player online games 3rd ACM SIGCOMM workshop on Network and system support for games, 2004.
3. J. Keller (France Telecom Res. & Dev., Issy Moulineaux, France) and G. Simon. Toward a peer-to-peer shared virtual reality Proceedings 22nd International Conference on Distributed Computing Systems Workshops, 2002, p 695-700.
4. A. Yu (Dept. of Oceanogr., British Columbia Univ., Vancouver, BC, Canada) and S.T. Vuong. MOPAR: a mobile peer-to-peer overlay architecture for interest management of massively multiplayer online games Proceedings of the 15th International Workshop on Network and Operating Systems Support for Digital Audio and Video. NOSSDAV 2005, p 99-104.
1. Markus Bylund and Fredrik Espinoza. Testing and demonstrating context-aware services with Quake III Arena CACM special issue "Game Engines in Scientific Research".
2. Joseph Pelligrino and Constantinous Dovrolis. Bandwidth requirement and state consistency in three multiplayer game architectures ACM ????.
3. Claypool, LaPoint, Winsolow. Network analysis of Counter-strike and Starcraft IEEE - Performance, Computing, and Communications Conference, 2003. 9-11 April 2003 Page(s):261 - 268
4. Lothar Pantel, Lars C. Wolf. On the impact of delay on real-time multiplayer games 12th international workshop on Network and operating systems support for digital audio and video, 2002.
5. E. Cronin, B. Filstrup, A. R. Kurc, S. Jamin. An Efficient Synchronization Mechanism for Mirrored Game Architectures Proceedings of NetGames2002 Workshop, Germany, April 2002.
6. Yow-Jian Lin, K. Guo, S. Paul. Sync-MS: synchronized messaging service for real-time multi-player distributed games 10th IEEE International Conference on Network Protocols, 2002.
7. S. Yamamoto, Y. Murata, K. Yasumoto, and M. Ito. A distributed event delivery method with load balancing for mmorpg In Proc. of the 4th Workshop on Network and system support for games (NETGAMES), 2005.
8. B Knutsson, H Lu, W Xu, B Hopkins. A dynamic load sharing algorithm for massively multiplayer online games IEEE Infocom, 2004.
9. B Knutsson, H Lu, W Xu, B Hopkins. Transfer Learning in Real-Time Strategy Games Using Hybrid CBR/RL IEEE Infocom, 2004.
10. Ashwin Bharambe, Venkat Padmanabhan and Srinivasan Seshan. Supporting Spectators in Online Multiplayer Games, HOTNETS-III, San Diego, November 2004.
1. MUD-Dev mailing list archives or mirror 1 or mirror 2. This is the 62 MB zipped collection of the mailing list where high-level technical and design issues were worked out originally for the massively multi-player online games. While it had its roots in the MUD text games, people from games like EverQuest were also involved. It is a good place to understand why particular technical and game-play decisions were made for this style of game.
The Class Wiki provides a communication channel for all participants in the class.
Guidelines
The purpose of a student-defined project is to (1) involve you in ongoing research projects, (2) leverage your unique background in some way, and/or (3) leverage other work in which you are involved. In general, any special project should be of a caliber that can generate results publishable in reviewed outlets like workshops or conferences (often requiring some additional work beyond the time spent in this class). You may link your project work with the topics investigated in the term papers, but both are separate deliverables.
Proposals
To propose your project for this class, you must submit the following materials:
The first step must include both development (e.g., coding, experimentation) and background work, such as producing a bibliography of relevant papers and having read them, having designed suitable algorithms/approaches, and having learned or looked at suitable tools to be used for your project, including target platforms.
The second step, typically after the class midterm, should involve having produced much of the software necessary and having debugged it.
The third step should include not only software testing but also evaluation, on the platforms you have chosen. Such evaluations may include theoretical results if you choose to develop or experiment with a novel algorithm, for instance, but it should also include experimental results.
The final deliverable not only includes the actual software but also a report, which is outlined next.
Final Report
The on-line final report regarding your project should have the following parts:
Facilities Available for CS7210 Projects
1. 'Hack' cluster (physically part of the NetLab facility described under 2. below):
Contact Sanjay Kumar (ksanjay) or Ivan Ganev (ganev@cc) for details on hardware/software setup.
2. NetLab cluster machine: cluster of over 40 machines, able to emulate arbitrary Internet topologies and able to run multiple operating systems. Also able to run mobile system emulations using the MobiEmu support infrastructure.
Contact Himanshu Raj (rhim@cc) or Sanjay Kumar (ksanjay@cc) for more information and machine access.
3. High end cluster facilities, one comprised of 30 dual Itanium IIs, the other comprised of over 50 dual 64-bit PIVs.
Contact Matt Wolf (mwolf@cc) for more information and machine access.
4. 'Enterprise' computing clusters: a typical 3-tier setup using an IBM blade server as a backend and HP and IBM and workstations as frontends.
Contact Mohammed Mansour (mansour@cc) for more detail and/or for hardware access.
5. Handheld computers using wireless links, including Compaq iPAQ and XScale-based (Sitsang) machines.
Contact Ripal Nathuji (rnathuji@cc) for additional information.
6. Embedded communication subsystems, using Intel's IXP2400 network routers on a small cluster machine:
Contact Ada Gavrilovska (ada@cc) for more information on IXPs.
7. For research involving high performance systems or graphics, large-scale visualization media, including an Immersadesk, a video wall, an Access Grid node, and potential access to high end network links connecting Georgia Tech with remote institutions.
Contact Matt Wolf (mwolf@cc) for more information.
8. Special hardware, including virtualization-capable (VT instruction set) and multi-core machines.
Contact Himanshu Raj (rhim@cc) or Matt Wolf (mwolf@cc) for more information.
Sample Projects
These are simply some ideas that came to my mind. You are NOT bound by these ideas. In fact, I'd welcome alternative ideas, especially those involving other infrastructures like Chord, Pastry, ... or those involving innovative distributed algorithms or methods:
Useful Information
These are the presentations on Jan 14th which would help you find out your own project topic.
1. Distributed Games: for further information, please contact Balasubramanian Seshasayee (bala@cc)
2. Distributed Service Application: for further information, please contact Peter Budny (bigpeteb@cc)
3. Hermes: Distributed Information Service: for further information, please contact Dr. Matt Wolf (mwolf@cc)