Spring 2003 Architecture Qualifier Exam

Instructions:

Please answer any 6 out of 9 questions

Give detailed answers to questions mentioning all the assumptions you are making and be comprehensive in your answers showing the details of your approach and understanding of the problem.

1.  Conventional wisdom in cache design has been that the L1 cache needs to be accessed in a single cycle in order to keep CPI anywhere near 1.0.  However, many recent machines (Alpha 21264, Pentium 4) have switched to 2-cycle access times on both L1-I and L1-D caches.  How can they get away with this, i.e. what architectural techniques apply and why do they work?  Be as quantitative as possible in your answer.

2.  Processors with branch prediction must provide a means to save and restore processor state at the point of the branch in case of a misprediction.  However, with deep pipelines (20+ stages in the Pentium 4), the fetch unit will likely encounter *multiple* branches before the first has resolved.  Describe means to deal with multiple branches.  If there are tradeoffs involved, how do you resolve them?  Be as quantitative as possible, e.g. provide equations and rough estimates for the tradeoffs.

3.  Most current architectures have prefetch instructions.  The Alpha goes a step further and offers four instructions, one to manipulate a memory line into each of the four MESI states.  

Describe situations, preferably with code examples, in which a programmer/compiler could exploit each of the four instructions (four answers).

4.  Conventional wisdom says that less complex designs should be capable of higher clock rates, i.e. that VLIW/EPIC (Itanium) or static superscalar (SPARC) processors should have higher clock rates than complex, out-of-order designs such as the Pentium 4 or Athalon.  In practice, though, those complex chips today are running with at least 2x faster clocks even accounting for process differences.  Explain the difference.

5.  Compare and contrast symmetric multithreading (SMT) or "hyper threading" as Intel calls it, speculative threading or "multiscalar" processing [Sohi95], and plain chip multiprocessors (CMP) as routes to high performance Nirvana.  As processor chips exceed a billion transistors, which technique appears most promising?  Include in your discussion technology trends, workload trends and engineering issues.

6.  In shared memory multiprocessors the coherence criterion used is:  "read of a location returns the most recent write into that location".  (a) What are the implemenatation requirements to make this coherence criterion viable in a general purpose shared memory multiprocessor?

(b) How are these requirements satisfied in a shared memory multiprocessor with a single shared bus and private (per-processor) caches?

(c) Sketch a protocol for implementing this coherence criterion in a shared memory multiprocessor with a switched interconnection network between the processors (with private caches) and the memory modules.

7.  If you pick up any classical computer architecture textbook, you will find classification of parallel architectures into SISD, MIMD, SIMD, or MISD.  

(a) Are these simply random permutations of some chosen letters or are there some good technical reasons for such a classification?

(b) What are celebrated examples of machines belonging to each of these categories?  What in your view is the reason for their celebrity status?

(c) Does such a classification stand the test of time?  Explain.

Are some of these categories no longer pertinent?  Is so, why?  

Give concrete technological reasons for the disappearance of some of these categories.

8.  Memory consistency models (such as SC, PC, PSO, and TSO) have a fairly simple model of what goes on within a single processor, namely, each processor emanates a sequence of read and write requests to the memory.  A memory consistency model is simply a rule set for the interleaving of these requests so that one can reason about program behavior given this interleaving.  However, SMPs are built using commodity processors as the basic building blocks.  These processors incorporate aggressive
implementation techniques such as deep pipelines, out-of-order processing, and multiple functional units.

(a) Do modern processors with multiple functional units, and hardware management of ILP complicate realizing a particular memory consistency model?  You have to be concrete in your answer using features of modern processors and their interaction with the rule set of memory consistency models.

(b) Sketch how you might implement TSO using an out-of-order processor as the basic unit and a switched interconnect to the memory.

(c) Do such modern processors with ILP obviate the need for exotic memory consistency models?  Explain with as much quantitative reasoning as possible.

9.  Performance evaluation of computer architectures is non-trivial.  It is customary to use analytical, simulation, or experimental techniques for this purpose.

(a) Explain the fallacies and pitfalls with each one of these approaches to performance evaluation.  Substantiate your answers with quantitative results and anecdotal evidences from published literature.

(b) Let us assume you have designed a brand new branch prediction algorithm.
Let us assume that intuitively your algorithm seems appealing.  Walk through a process for doing a quantitative evaluation of your new algorithm that will convince a skeptical community such as ISCA that your algorithm is actually useful!  Your answer should include how you choose the performance metrics, the workload, and the evaluation techniques.  Comment on how you will validate your results.