Spring 2003 Systems Qualifier Examination
Instructions: Please answer any 6 out of 9 questions
Give detailed answers to questions mentioning all the assumptions you are
making and be comprehensive in your answers showing the details of your approach
and understanding of the problem.
1. Communications/OS
Lightweight communication protocols have become an important topic in high-end
applications and systems. Attributes of such protocols with respect to OS
research include 1) overheads imposed on end systems, 2) latency, and 3)
configurability.
- Define the meaning of `overhead' vs. latency, and explain why both
are not identical measures.
- Identify two specific examples of protocol configurability and for
each, identify their utility for performance-critical applications.
- List the topics to consider if you were designing and implementing
a low-latency, low- overhead communication protocol.
2. Middleware
Your reading list has a substantial number of references on object-oriented
distributed systems. In more modern terms, such infrastructures tend to be
called 'middleware' when they implement a distributed OO layer across underlying
OS platforms. In this question, you are asked to identify the technical linkages
between middleware and underlying operating systems and networks, that is,
you are asked to think about how middleware interacts with and uses underlying
platforms.
- List at least four such linkages (some being CPU centric, others
communication centric).
- Describe for at least two of these (one CPU, one comm.) configurability
at the middleware level that would be useful for end user applications and/or
for underlying platforms (e.g., consider heterogeneity).
- Identify two techniques in online monitoring that might be usefully
associated with such configurability.
3. In peer-to-peer (P2P) distributed computing, several new issues arise
that have not been addressed fully in past research. For example, because
there are no central servers, peers must employ methods to locate resources
and services they need using distributed algorithms that are appropriate
in this environment. Several systems (e.g., Chord) have been proposed which
use distributed hash tables (DHT) based to find resources. They typically
require log n messages and time to find a resource. Consider a P2P system
in which 80% of the requests are for a small set of popular resources whereas
the remaining 20% requests are for resources hat are infrequently accessed
by peer nodes. Can you adapt the DHT based schemes to this environment so
that popular resources can be found more efficiently? The infrequently
accessed resources may be more expensive to find but once such a resource
becomes popular, the cost of finding it must decrease.
Discuss the algorithms that you propose as well as their performance. Since
a peer may choose to disconnect and not participate in the execution of the
algorithm, how does your algorithm handle such disconnections?
Distributed Systems
4. Many believe that the design of secure distributed systems shares many
problems that have been addressed in dependable or reliable distributed systems.
For example, protocols that tolerate Byzantine failures, which capture the
behavior of a malicious host that can behave arbitrarily, have been explored
for several problems that arise in distributed computing systems. Consider
the problem of building a highly available distributed services that can
provide critical information to people who respond to emergency situation.
Some of the responders invoke functions supported by the service to store
information about conditions on the ground whereas others may invoke functions
to read such information. To ensure that the service is highly available,
it is replicated at multiple nodes. However, some subset of these nodes may
fail, including in Byzantine mode, during the operation of the service.
Develop a replicated service management protocol that allows a client to
send a request to the various service instances and receive a response as
long as a bounded number of them are faulty. The service instances must coordinate
among themselves to ensure that they behave as a single service. How
many of these instances can be become faulty if the single service abstraction
is to be maintained?
5. Distributed Simulation
(A) Several algorithms are reported in the distributed computing literature
to implement causal ordering. Why are these algorithms seldom used
in distributed simulations of (say) communication networks used to evaluate
new communication protocols?
(B) Distributed simulations used to create virtual environments seldom use
algorithms to ensure time stamp ordering of events. Explain why.
(C) The Chandy/Misra/Bryant null message algorithm assumes reliable delivery
of messages. Consider relaxing this assumption, i.e., assume some messages
are lost. Will lost messages always result in a failure of the algorithm?
Explain why or why not. Describe what problems can come up if messages
are lost.
6. Cluster-based Web Cache System
Your new company, MyPortal, provides cluster-based software for ISP's that
allows them to do simple firewalling and Web caching. A customer installs
a SIM caching cluster according to their expected user population size.
Under normal conditions (typical Web workloads, no undue network congestion
effects, etc.), a single cluster node can handle about 60 hits per second.
Some of your more sophisticated customers tell you that they would like to
be able to "augment" the caching with additional capabilities. For example,
@mobi.com has many users who use PDA's who access the Web through their ISP
cluster, so in addition to fetching and caching a page, they would like to
be able to perform simple transformations on data, depending on whether the
client is a Windows PDA, Palm PDA, or regular desktop browser (IE/Netscape).
So you agree to design an API that will allow them to insert "plug-ins" into
the cache stream.
Below is the simplified control flow of your existing architecture.
Your software architecture is concurrent: a single thread handles the incoming
request, but after the 'get page from web' procedure returns, the thread
forks into two so that the "store in cache" and "deliver to client" code
paths can be executed in parallel.
In the original logic diagram above, all of your functional blocks correspond
to methods inside a single process. (The "store in cache" and "deliver
to client" blocks can be executed by different threads, but they are user-level
threads in the same process space.) @Mobi.com wants to know if their plug-in
module will run
(a) in the same process space as a separate procedure call that is part of,
or
(b) in the same process space but as a separate (forked) thread, or
(c) in a different process space.
Thoroughly enumerate the pros and cons of each approach. Consider performance,
ease of programming, capacity provisioning, and failure management.
Which approach (a, b or c) would you ultimately recommend and why?
Please write down your answers to the best of your knowledge.
7. The GMS system described in Feeley95 is one of several schemes for using
the main memory of remote workstations to store memory pages that would otherwise
have to be stored on disk. GMS adopts a design decision that pages
cached in remote workstations are always "clean", i.e. they are always consistent
with the copy on disk. Evaluate this design decision. What are the
tradeoffs between keeping only clean copies versus allowing dirty copies?
Be as quantitative as possible and include the effects of projected technology
trends.
8. Real-Time Systems
Schedulability Analysis. In classic real-time systems, jobs are classified
as either periodic or sporadic. Periodic jobs arrive on regular intervals
(the period) and have predictable execution times. Sporadic jobs arrive
irregularly and execution times that may vary from job to job. However,
some sporadic jobs may be modeled as periodic jobs since their inter-arrival
times and execution times have bounded irregularity.
However, this approach may result in under-utilization of CPU. As a
concrete example, consider a stream of sporadic jobs whose inter-arrival
times are uniformly distributed from 8 to 12 seconds and execution times
uniformly distributed from 2 to 4 seconds.
(a) How can we model this stream of sporadic jobs as a single periodic job?
(Hint: use worst case analysis.) What is the CPU reservation required
for this period job?
(b) Calculate the average CPU utilization of the sporadic job stream.
Compare this average utilization to the CPU reservation made in part (a).
How much is the resulting under-utilization due to the modeling of a sporadic
job stream using a periodic job?
9. Parallel/distributed Systems
A parallel computing cluster is a popular vehicle for executing a variety
of applications. The messaging fabric is the most important component of
such a cluster from the point of view of performance.
Consider two classes of applications:
-Scientific applications with fairly fine-grain sharing of objects each of
size less than 1KByte.
-Stream-oriented multimedia applications with fairly coarse-grain sharing
of large objects (e.g. video frames) each of size on the
order of 1 MByte.
(a) Qualitatively discuss the attributes you would like to see in a messaging
layer given the above two application scenarios. Are the characteristics
of the messaging layer different for supporting the two application scenarios?
If so, how would you optimize the messaging layer to suit the two application
scenarios?
(b) Design a messaging layer that can adequately support the characteristics
of both applications.