Instructor: Mustaque Ahamad (mustaq@cc.gatech.edu)
Office: CoC 221
Phone: 894-2593
Office Hrs: MW 10:00-11:00am
TA: Matt Moyer
Office Hrs: TBA
Distributed computing systems have become pervasive. From clusters to internetworked computers, we are using distributed systems to support a wide variety of applications. This course will focus on the fundamentals of distributed computing systems. The following are the objectives of this course:
We will achieve these goals by covering a set of papers in class that discuss the core topics. These topics include global states of distributed computations, logical and physical clocks, and various failure models. Distributed algorithms for consensus, replicated state management, and resource finding will also be covered. We will also address issues that arise in the context of wide area distributed systems. In addition, each student will select a set of related papers and produce a short term paper describing the problems and results discussed in the papers. These summaries will shared with all students, and if necessary, they will be discussed in class. Finally, all students will be required to complete a programming project. There is considerable flexibility in deciding the nature of such a project and students will be able to define their own projects, and depending on the nature of the project, students can work in groups.
After successfully completing this course, it is expected that students will be ready to explore research problems in distributed systems, and applications that are deployed in such systems. The course is intense and class participation is a must.
Advanced undergraduate or graduate course in operating systems or permission of instructor.
The following papers on the listed topics will be discussed in class. Other related papers will be assigned for reading. Homeworks will include material from papers assigned for reading.
M. Raynal and M. Singhal. Logical Time: A Way to Capture Causality in Distributed Systems. IRISA Technical Report.
David Mills. Network Time Protocol, RFC 1305.
Chandy, M. and Lamport, L., Distributed Snapshots: Determining Global States of Distributed Systems, ACM Trans. on Computer Systems, February 1985.
Schwarz, R. and Mattern, F., Detecting Causal Relationships in Distributed Computations: In Search of the Holy Grail, Distributed Computing, 1994.
Term Paper Topics: Distributed Checkpointing/Recovery, Clock Synchronization Algorithms.
Survey of failures in distributed systems (ch. 2, Mullender)
M. J. Fischer, N. Lynch and M. S. Patterson, Impossibility of distributed consensus with one faulty process, JACM 32, 1985.
Danny Dolev, Cynthia Dwork and Larry Stockmeyer, On the Minimal Synchronism Needed for Distributed Consensus, JACM, January 1987.
Term Paper Topics: Byzantine Failures, Costs of Consensus in Synchronous Systems, Probabilistic Consensus.
Ken Birman, Andre Schiper and Pat Stephenson, Lightweight Causal and Atomic Multicast, ACM TOCS, August 1991.
David Cheriton and Dale Skeen, Understanding the Limitations of Causally and Totally Ordered Communication, ACM SOSP, December 1993.
K. Birman. A Response to Cheriton and Skeen's Criticism of Causally and Totally Ordered Communication. , Cornell University Technical Report TR93-1390.
Term Paper Topics: Reliable Multicast Protocols, Virtual Synchrony.
Gifford, D., Weighted Voting for Replicated Data, ACM Symp. on Operating Systems Principles, December 1979.
Danco Davcev and W.A. Burkhard. Consistency and Recovery Control for Replicated Files. In Proc. Tenth ACM Symposium on Operating Systems, Operating Systems Review, 1985.
Mustaque Ahamad, Jim Burns, Phillip Hutto, Prince Kohli and Gil Neiger, Causal Memory, Distributed Computing, 1995.
Terry, D. B. et. al., Session guarantees for weakly consistent replicated data, 1994 PDIS.
Term Paper Topics: Scalable Consistency Protocols, Update conflict detection/resolution.
Mullender, S., Vitany, P., Distributed Match-Making, Algorithmica, No.3, 1988.
Steen, M., Hauck, F., Homburg, P. and Tanenbaum, A. Locating objects in wide-area systems. IEEE Communications Magazine.
Badrinath et. al., Designing Distributed Algorithms for Mobile Computing Networks. ICDCS.
M. Satyanarayanan, Fundamental Challenges in Mobile Computing, PODC 1995.
Term Paper Topics: Naming, Distributed Algorithms for Mobile Systems.
Vahdat, A. et. al. WebOS: Operating System Services for Wide Area Applications. Technical report UCB-CSD-97-938.
Chengjie, L. and Cao, P. Maintaining Strong Consistency in the WWW ICDCS 1997.
Ahamad, M. and Kordale, R. Scalable Consistency Protocols for Distributed Services IEEE Transaction on Parallel and Distributed Systems. 1999.
Belani, E., Vahdat, A., Anderson, T. and Dahlin, M. The CRISIS Wide-area Security Architecture 1998 Usenix Security Symposium.
Term Paper Topics: CORBA/DCOM Middleware, Wide area replication, Web caching.
Journal and conference papers. The following books are useful references.
| Evaluation Item | Credit |
| Two Homeworks | 10% |
| Examinations | 50% |
| Term Paper | 10% |
| Programming Project | 20% |
| Class Participation | 10% |