Spring 2003 Systems Qualifier Examination

Instructions: Please answer any 6 out of 9 questions

Give detailed answers to questions mentioning all the assumptions you are making and be comprehensive in your answers showing the details of your approach and understanding of the problem.

1. Communications/OS

Lightweight communication protocols have become an important topic in high-end applications and systems. Attributes of such protocols with respect to OS research include 1) overheads imposed on end systems, 2) latency, and 3) configurability.

 - Define the meaning of `overhead' vs. latency, and explain why both are not identical measures.
 - Identify two specific examples of protocol configurability and for each, identify their utility for performance-critical applications.
 - List the topics to consider if you were designing and implementing a low-latency, low-    overhead communication protocol.

2. Middleware

Your reading list has a substantial number of references on object-oriented distributed systems. In more modern terms, such infrastructures tend to be called 'middleware' when they implement a distributed OO layer across underlying OS platforms. In this question, you are asked to identify the technical linkages between middleware and underlying operating systems and networks, that is, you are asked to think about how middleware interacts with and uses underlying platforms.
 - List at least four such linkages (some being CPU centric, others communication centric).
 - Describe for at least two of these (one CPU, one comm.) configurability at the middleware level that would be useful for end user applications and/or for underlying     platforms (e.g., consider heterogeneity).
 - Identify two techniques in online monitoring that might be usefully associated with such configurability.

3. In peer-to-peer (P2P) distributed computing, several new issues arise that have not been addressed fully in past research. For example, because there are no central servers, peers must employ methods to locate resources and services they need using distributed algorithms that are appropriate in this environment. Several systems (e.g., Chord) have been proposed which use distributed hash tables (DHT) based to find resources. They typically require log n messages and time to find a resource. Consider a P2P system in which 80% of the requests are for a small set of popular resources whereas the remaining 20% requests are for resources hat are infrequently accessed by peer nodes. Can you adapt the DHT based schemes to this environment so that popular resources can be found more efficiently?  The infrequently accessed resources may be more expensive to find but once such a resource becomes popular, the cost of finding it must decrease.

Discuss the algorithms that you propose as well as their performance. Since a peer may choose to disconnect and not participate in the execution of the algorithm, how does your algorithm handle such disconnections?

Distributed Systems

4. Many believe that the design of secure distributed systems shares many problems that have been addressed in dependable or reliable distributed systems. For example, protocols that tolerate Byzantine failures, which capture the behavior of a malicious host that can behave arbitrarily, have been explored for several problems that arise in distributed computing systems. Consider the problem of building a highly available distributed services that can provide critical information to people who respond to emergency situation. Some of the responders invoke functions supported by the service to store information about conditions on the ground whereas others may invoke functions to read such information. To ensure that the service is highly available, it is replicated at multiple nodes. However, some subset of these nodes may fail, including in Byzantine mode, during the operation of the service.

Develop a replicated service management protocol that allows a client to send a request to the various service instances and receive a response as long as a bounded number of them are faulty. The service instances must coordinate among themselves to ensure that they behave as a single service.  How many of these instances can be become faulty if the single service abstraction is to be maintained?

5. Distributed Simulation

(A) Several algorithms are reported in the distributed computing literature to implement causal ordering.  Why are these algorithms seldom used in distributed simulations of (say) communication networks used to evaluate new communication protocols?

(B) Distributed simulations used to create virtual environments seldom use algorithms to ensure time stamp ordering of events.  Explain why.

(C) The Chandy/Misra/Bryant null message algorithm assumes reliable delivery of messages.  Consider relaxing this assumption, i.e., assume some messages are lost.  Will lost messages always result in a failure of the algorithm?  Explain why or why not.  Describe what problems can come up if messages are lost.

6. Cluster-based Web Cache System
 
Your new company, MyPortal, provides cluster-based software for ISP's that allows them to do simple firewalling and Web caching.  A customer installs a SIM caching cluster according to their expected user population size.  Under normal conditions (typical Web workloads, no undue network congestion effects, etc.), a single cluster node can handle about 60 hits per second.

Some of your more sophisticated customers tell you that they would like to be able to "augment" the caching with additional capabilities. For example, @mobi.com has many users who use PDA's who access the Web through their ISP cluster, so in addition to fetching and caching a page, they would like to be able to perform simple transformations on data, depending on whether the client is a Windows PDA, Palm PDA, or regular desktop browser (IE/Netscape).  So you agree to design an API that will allow them to insert "plug-ins" into the cache stream.

Below is the simplified control flow of your existing architecture.  Your software architecture is concurrent: a single thread handles the incoming request, but after the 'get page from web' procedure returns, the thread forks into two so that the "store in cache" and "deliver to client" code paths can be executed in parallel.

In the original logic diagram above, all of your functional blocks correspond to methods inside a single process.  (The "store in cache" and "deliver to client" blocks can be executed by different threads, but they are user-level threads in the same process space.) @Mobi.com wants to know if their plug-in module will run

(a) in the same process space as a separate procedure call that is part of, or
(b) in the same process space but as a separate (forked) thread, or
(c) in a different process space.  

Thoroughly enumerate the pros and cons of each approach. Consider performance, ease of programming, capacity provisioning, and failure management.  Which approach (a, b or c) would you ultimately recommend and why?

Please write down your answers to the best of your knowledge.

7. The GMS system described in Feeley95 is one of several schemes for using the main memory of remote workstations to store memory pages that would otherwise have to be stored on disk.  GMS adopts a design decision that pages cached in remote workstations are always "clean", i.e. they are always consistent with the copy on disk. Evaluate this design decision.  What are the tradeoffs between keeping only clean copies versus allowing dirty copies?  Be as quantitative as possible and include the effects of projected technology trends.

8. Real-Time Systems

Schedulability Analysis.  In classic real-time systems, jobs are classified as either periodic or sporadic.  Periodic jobs arrive on regular intervals (the period) and have predictable execution times.  Sporadic jobs arrive irregularly and execution times that may vary from job to job.  However, some sporadic jobs may be modeled as periodic jobs since their inter-arrival times and execution times have bounded irregularity.

However, this approach may result in under-utilization of CPU.  As a concrete example, consider a stream of sporadic jobs whose inter-arrival times are uniformly distributed from 8 to 12 seconds and execution times uniformly distributed from 2 to 4 seconds.

(a) How can we model this stream of sporadic jobs as a single periodic job?  (Hint: use worst case analysis.)  What is the CPU reservation required for this period job?

(b) Calculate the average CPU utilization of the sporadic job stream.  Compare this average utilization to the CPU reservation made in part (a).  How much is the resulting under-utilization due to the modeling of a sporadic job stream using a periodic job?

9. Parallel/distributed Systems

A parallel computing cluster is a popular vehicle for executing a variety of applications. The messaging fabric is the most important component of such a cluster from the point of view of performance.
Consider two classes of applications:
-Scientific applications with fairly fine-grain sharing of objects each of size less than 1KByte.
-Stream-oriented multimedia applications with fairly coarse-grain sharing of large     objects (e.g. video frames) each of size on the order of 1 MByte.

(a) Qualitatively discuss the attributes you would like to see in a messaging layer given the above two application scenarios.  Are the characteristics of the messaging layer different for supporting the two application scenarios?  If so, how would you optimize the messaging layer to suit the two application scenarios?
(b) Design a messaging layer that can adequately support the characteristics of both applications.