Spring 2003 Architecture Qualifier Exam
Instructions:
Please answer any 6 out of 9 questions
Give detailed answers to questions mentioning all the assumptions you are
making and be comprehensive in your answers showing the details of your approach
and understanding of the problem.
1. Conventional wisdom in cache design has been that the L1 cache needs
to be accessed in a single cycle in order to keep CPI anywhere near 1.0.
However, many recent machines (Alpha 21264, Pentium 4) have switched to 2-cycle
access times on both L1-I and L1-D caches. How can they get away with
this, i.e. what architectural techniques apply and why do they work?
Be as quantitative as possible in your answer.
2. Processors with branch prediction must provide a means to save and
restore processor state at the point of the branch in case of a misprediction.
However, with deep pipelines (20+ stages in the Pentium 4), the fetch unit
will likely encounter *multiple* branches before the first has resolved.
Describe means to deal with multiple branches. If there are tradeoffs
involved, how do you resolve them? Be as quantitative as possible,
e.g. provide equations and rough estimates for the tradeoffs.
3. Most current architectures have prefetch instructions. The
Alpha goes a step further and offers four instructions, one to manipulate
a memory line into each of the four MESI states.
Describe situations, preferably with code examples, in which a programmer/compiler
could exploit each of the four instructions (four answers).
4. Conventional wisdom says that less complex designs should be capable
of higher clock rates, i.e. that VLIW/EPIC (Itanium) or static superscalar
(SPARC) processors should have higher clock rates than complex, out-of-order
designs such as the Pentium 4 or Athalon. In practice, though, those
complex chips today are running with at least 2x faster clocks even accounting
for process differences. Explain the difference.
5. Compare and contrast symmetric multithreading (SMT) or "hyper threading"
as Intel calls it, speculative threading or "multiscalar" processing [Sohi95],
and plain chip multiprocessors (CMP) as routes to high performance Nirvana.
As processor chips exceed a billion transistors, which technique appears
most promising? Include in your discussion technology trends, workload
trends and engineering issues.
6. In shared memory multiprocessors the coherence criterion used is:
"read of a location returns the most recent write into that location".
(a) What are the implemenatation requirements to make this coherence criterion
viable in a general purpose shared memory multiprocessor?
(b) How are these requirements satisfied in a shared memory multiprocessor
with a single shared bus and private (per-processor) caches?
(c) Sketch a protocol for implementing this coherence criterion in a shared
memory multiprocessor with a switched interconnection network between the
processors (with private caches) and the memory modules.
7. If you pick up any classical computer architecture textbook, you
will find classification of parallel architectures into SISD, MIMD, SIMD,
or MISD.
(a) Are these simply random permutations of some chosen letters or are there
some good technical reasons for such a classification?
(b) What are celebrated examples of machines belonging to each of these categories?
What in your view is the reason for their celebrity status?
(c) Does such a classification stand the test of time? Explain.
Are some of these categories no longer pertinent? Is so, why?
Give concrete technological reasons for the disappearance of some of these
categories.
8. Memory consistency models (such as SC, PC, PSO, and TSO) have a
fairly simple model of what goes on within a single processor, namely, each
processor emanates a sequence of read and write requests to the memory.
A memory consistency model is simply a rule set for the interleaving of these
requests so that one can reason about program behavior given this interleaving.
However, SMPs are built using commodity processors as the basic building
blocks. These processors incorporate aggressive
implementation techniques such as deep pipelines, out-of-order processing,
and multiple functional units.
(a) Do modern processors with multiple functional units, and hardware management
of ILP complicate realizing a particular memory consistency model?
You have to be concrete in your answer using features of modern processors
and their interaction with the rule set of memory consistency models.
(b) Sketch how you might implement TSO using an out-of-order processor as
the basic unit and a switched interconnect to the memory.
(c) Do such modern processors with ILP obviate the need for exotic memory
consistency models? Explain with as much quantitative reasoning as
possible.
9. Performance evaluation of computer architectures is non-trivial.
It is customary to use analytical, simulation, or experimental techniques
for this purpose.
(a) Explain the fallacies and pitfalls with each one of these approaches
to performance evaluation. Substantiate your answers with quantitative
results and anecdotal evidences from published literature.
(b) Let us assume you have designed a brand new branch prediction algorithm.
Let us assume that intuitively your algorithm seems appealing. Walk
through a process for doing a quantitative evaluation of your new algorithm
that will convince a skeptical community such as ISCA that your algorithm
is actually useful! Your answer should include how you choose the performance
metrics, the workload, and the evaluation techniques. Comment on how
you will validate your results.