SYSTEMS QUALIFIER

Spring 2000

ANSWER 6 OUT OF THE 9 QUESTIONS BELOW.

 

1. "Optimism" is a recurrent theme in systems: e.g. Optimistic Active Messages (Wallach '95), Optimistic simulation (Jefferson '85) and Optimistic specialization (Audrey '95) in the reading list along with optimistic concurrency control in databases and speculative (presumably optimistic) execution in computer architecture.

As a pessimistic researcher, how does one properly *evaluate* a system that exploits optimism? Give your answer in the form of "N issues a paper with optimism in the title must address".

 

2. The remote procedure call (RPC) abstraction has become quite popular for building client/server distributed applications. RPC offers a synchronous mechanism for control and data transfer akin to procedure calls in sequential programs. However, the implementation of RPC must address problems that arise due to failures of client or server nodes, and its performance depends on many factors, including the communication protocol used by the implementation.

Consider an environment in which mobile clients call fixed services that are replicated in different localities. The RPC abstraction may have to be changed in such an environment to suit the application needs. For example, to avoid long periods of blocking when the mobile client may be disconnected from a service, a non-blocking or asynchronous RPC facility can be provided. Such a call returns at the client before it is executed by the server, and the client can claim the return results in the future when it needs them.

Discuss how issues such as at-most once execution of the call and security of RPC calls can be addressed when the calls are asynchronous. If your solution extends the implementation of synchronous RPC, you should describe that first. If needed, you should also develop a concrete definition of an asynchronous RPC.

 

3. Access control lists (ACLs) and capabilities are two different mechanisms that are used by operating systems to protect access to resources. Discuss the advantages as well disadvantages of these mechanisms.

It may be possible to develop a protection system that offers the best of both ACLs and capabilities. For example, access control for files in Unix performs an ACL style check when the file is opened. The open call then returns a file descriptor which can be viewed as a capability. Read or write requests can then be performed by presenting this capability and they require no ACL checks.

Discuss the design of a general protection system based on the idea of combining ACLs and capabilities. You can use specific examples (e.g., memory system) to illustrate important aspects of your design. You should also discuss the strengths and weakness of such a system.

 

4. The server-less distributed file system xFS developed at Berkeley and the global virtual memory project at Washington both chose to use the memory of other nodes to cache blocks or pages rather than fetching them from disk. This is beneficial when a block can be transferred over the network faster than fetching it from disk. In both of these systems, the problem of replacement has to be addressed to create room for a new block that is fetched from disk. In particular, a distributed global replacement algorithm must be employed to effectively use the memories of the nodes in the system. To keep copies in memory and to reduce the load on a small number of nodes, we propose an algorithm that will replace a least recently referenced block of which more than k copies exist in the memories of the nodes. If no such block exists, a least recently referenced block with most copies is replaced. Develop an efficient distributed algorithm to implement this replacement algorithm and compare it with one of the cooperative caching algorithms implemented by xFS.

 

5. Failures in distributed systems can result in an inconsistent state where the results of a distributed computation are reflected at some nodes but not at others. Commit protocols are used to avoid such inconsistent results. These protocols make use of a non-volatile store (e.g., disk) to save results of a computation until the outcome of the commit protocol becomes known. A logging service can be provided by the operating system to save results of a computation on non-volatile store. Systems such as Quicksilver and recoverable virtual memory make use of such a logging service. In these systems, the log records are flushed to disk when they must be in non-volatile memory. Since not all nodes of a distributed system fail at the same time, it may be better to use remote node memories to save copies of log records than disk for better performance (similar to xFS's use of remote node memory for caching file data). Design a log service that allows applications to choose to save log records on disk or in remote memories. List all the operations supported by the service and discuss their implementation. How can we evaluate the performance of this new log service with other existing implementations?

 

6. Execution errors (e.g., divide by zero) present challenges in Time Warp programs that do not arise when using conservative execution. Explain why.

Propose an approach to handle each of the following types of errors in a Time Warp system. Ideally, your approach should be transparent to the simulation application programmer. Explain the extent to which you can make your approach transparent.

a) Division by zero

b) Infinite loop

c) Invalid memory reference, e.g., overwriting a portion of the Time Warp executive's internal state

 

7. A consistent, recent trend has been to use cluster machines as the large-scale parallel machines of choice. Early efforts in cluster computing focused on the use of special purpose interconnects, such as those offering direct notions of consistent memory across multiple cluster nodes. Recent efforts are focusing on using Commercial-Off-The-Shelf (COTS) interconnects.

a) Describe the technical issues to be addressed when trying to link the MEMORY systems of communicating machines.

b) Discuss how even ero-copy' solutions (define zero-copy) may be difficult to implement for COTS interconnects (e.g., present some examples). (Recall Thekkath's paper here).

 

8. There has been substantial work on scheduling in parallel systems, including that addressing processor cache affinity, scheduling in the presence of real-time constraints, etc. This question is about SCHEDULERS, not scheduling. Since you can safely assume that each processor will have its own scheduler when running on a non-shared memory platform (e.g., a cluster machine), please describe some possible linkages one might need to define across such multiple schedulers to enable PARALLEL scheduling. Be specific, by for instance, considering the linkages needed to implement some form of group scheduling (e.g., co-scheduling), or when trying to implement a multi-machine reservation for purposes of meeting real-time constraints.

Identify at least two desirable, multi-machine scheduling disciplines and the associated scheduler linkages you might need to implement them. Assume schedulers ive in' kernel space.

 

9. Imagine you are constructing an object-based operating system for a parallel machine.

a) Describe what you mean by parallelism for object-based programs, thereby identifying what you need from programmers to create the parallel program on your SMP machine (assume SMP machines here!).

b) Describe what OS functions will need to be changed or paid attention to for implementing this parallelism (e.g., consider object invocation and scheduling).