Amdahl's law dictates that successful
machines balance computation vs. communication, latency vs. bandwidth and
load balance vs. locality. For instance the Cray 1, a vector machine, was
successful in part because of its attention to low latency which is crucial
when vectors are short. Critique the five approaches to high performance
computing below according to their approach to the three notions of balance
above.
message-passing multiprocessors, e.g. T3E
shared-memory multiprocessors, e.g. O2K
dataflow multiprocessors, e.g. Monsoon
multithreaded multiprocessors, e.g. HEP/Tera
vector supercomputers, e.g. Cray.
Imagine you are a user of a machine of unknown architecture (you
can only run user-mode programs). Describe an experiment (program that you
would use, inputs, conclusions based on the form of the output) that you
would perform in order to discover what caching hardware is used on that
machine. Explain how your experiment would let you discover how many levels
of cache hierarchy there are and how big each level is. Other questions to
address include:
Is your experiment sensitive on whether the
L1 cache is virtually or physically addressed? If so, how does the OS
virtual-to-physical mapping strategy affect the outcome?
Can you discover anything about the associativity of the caches through
your experiment? How?
Can you discover anything about the block size of the caches?
How?
List scheduling is a fast heuristic
which is used in superscalars to decide the wake-up order in wide-issue
processors which might use speculation etc. List scheduling attempts to
topologically order the instructions in ready-queue and use a heuristic
function (such as some kind of priority) to determine the schedule of
instructions at the same level in the dependence graph. It appears to give
schedules close to optimal in practice. Intuitively argue why this should
be the case. Consider the effects of instruction windows, issue rates etc.
What hardware support is needed for list scheduling? Give an example where
list scheduling performs at its worst. Try to generalize this behavior to a
class of problems. Can you give an alternative scheduling scheme for the
above class?
This question pertains to software and
hardware issues in memory management in modern operating systems (such as
Linux and NT) on modern processors such as DEC Alpha and Intel P-III.
What are multi-level page tables?
How do they work? Why do we need them for memory management in modern
processors?
Give a detailed description of memory management. Your answer
should cover both the hardware and software issues in memory management.
Explain what aspects are handled typically in hardware and what aspects are
handled typically in software. Explain why such a division of labor between
hardware and software is reasonable or unreasonable. If an aspect can be
handled in either software or hardware, then what are the pros and cons for
each approach.
What hardware support is deemed necessary for efficient memory
management, and how is it used by the operating system?
Indicate what kinds of hardware support are desirable for
efficient memory management but are not available. For any such support in
the wish list, explain with adequate reason whether the lack of hardware
support is because it is infeasible to implement in hardware, or because it
will have a detrimental effect on performance.
Using the memory hierarchy of any modern processor (physical
memory, virtual memory, page tables, TLB, L1, and L2, caches) walk through
what design decisions you will have to make in designing the memory
management piece of the operating system.
This problem deals with shared memory multiprocessors.
Clearly explain the distinction between memory consistency model and
cache coherence in shared memory multiprocessors. Why is this distinction
important? How is it used in the design and implementation of
multiprocessors?
SMPs keep the private caches of the processors coherent.
However, each processor in an SMP has its own private set of registers. Are
they kept coherent? If not, why not? If they are not kept coherent, how is
the desired semantics of a multithreaded program that uses a threading
package such as pthreads guaranteed to work correctly in an SMP?
Each processor in an SMP does its own virtual to physical
address translations for memory accesses. To speed up the address
translation, each processor has its own TLB as well. Consider a
multithreaded program executing on an SMP. All the threads execute in the
same address space. How is the semantics of the multithreaded program
preserved with the autonomous virtual memory management that is happening on
each processor of the SMP? Explain clearly, what needs to happen to
preserve the semantics and what are the possible solution approaches to
achieving them.
Write-invalidate and write-update
cache coherence protocols differ in that the former is a pull approach for
getting the most recent write to a memory location, and the latter is a push
approach for disseminating the most recent write to a memory location.
Discuss the pros and cons of both
these approaches from the point of view of program behavior. Give code
fragments where one or the other would be more beneficial.
Discuss the pros and cons of both these approaches from the
point of view of designing large-scale shared memory multiprocessors (i.e.
the interconnect does not have a broadcast or multicast capability). Be
concrete as to the problems that will be encountered with each approach and
what are possible solutions.
Given that programs show different sharing characteristics
over the lifetime of their execution, suggest a scheme for dynamically
switching from one approach to the other dynamically. It is sufficient if
you suggest heuristics you would use to switch from one to the other, and
what additional state information you will need in the cache coherence
hardware to enable this switching.
Message-passing and shared memory are
two alternative styles of communicating and synchronizing among the
different threads of the same program. In the former, the communication is
explicit (via primitives such as sends and receives) while in the latter the
communication is implicit (via primitives such as read and write to shared
memory locations).
Give the pros and cons of both
these approaches from the point of view of ease of parallel program
development and performance.
To reap the advantages of both styles it is desired to add
some explicit communication primitives to a shared memory multiprocessor.
Suggest a scheme to do this. Your scheme should clearly state the semantics
of the proposed primitives, and their implementation (at the level of
protocol enhancements, state transitions, and any additional state
information in the cache coherence hardware) in the context of a cache
coherent shared memory multiprocessor.
Compare and contrast polling and
interrupts as mechanisms for the delivery of events, e.g. notification of
the arrival of network messages in a multiprocessor. Under what
circumstances does polling provide better performance? Under what
circumstances does interrupt-based delivery provide better performance.
What other issues drive the selection of one mechanism or the other? Give
examples that are as concrete as possible.
This problem deals with branch prediction.
What problem is solved by branch prediction? Why is it
important?
Branch
predictors utilize tables indexed by the PC, by global history and by local
history. Define each indexing scheme and give detailed scenarios in which
each scheme performs better than the other two.
What issues limit the complexity of branch predictors?
How might compiler "advice" be propagated to a branch predictor?