Architecture Area Questions for Qualifier

Architecture Area Questions for Qualifier

Fall 2000

Answer 6 of 9 questions

1. VLIW/EPIC processor philosophy touts compiler-based mechanisms including predication and software-visible speculative instructions over dynamic mechanisms for branch prediction and instruction reordering. Define two equivalent machines, one a three-wide EPIC and the other a three-wide dynamic superscalar, and give example code sequences (and usage scenarios) exhibiting (a) a case where predication and software speculation outperform dynamic branch prediction and reordering and (b) the opposite.

2. Custom network interface (NI) designs for parallel machines have differed significantly from network interfaces built for internetworking. Some differences are attributable to the (major) difference that parallel-machine NIs are custom while internetworking NIs are mass-market but other differences are fundamental. In other words, even when NIs eventually are integrated on the processor chip, parallel designs are likely to differ from internetworking designs. What are the differences between these two domains and how do the differences affect network interface design?

3. Prefetching of instructions and data can be performed in hardware via autonomous prefetching engines or in software through prefetch instructions. Discuss the pros and cons of each approach in detail including implementation complexity as well as performance costs. Propose a hybrid approach to overcome the drawbacks of each.

4. Processor interrupts are generally expensive because they cause a pipeline flush. The Smith/Pleszkun paper assumes that interrupts additionally will cause a rollback of any speculative state in an out-of-order processor. All this cost is most vexing to those who would like to use interrupts as a lightweight event delivery mechanism. What architectural features are required to support interrupts *without* requiring a flush of processor state? Be sure to account for interrupts that switch from user- to kernel-mode. Discuss any necessary modifications to the pipeline, speculative store, TLB/MMU, etc.

5. Register renaming in hardware is an important technique to remove dependencies at run time and increase parallelism in the instruction schedule. Give an example where dynamic register renaming helps. What are the costs associated with renaming? Give an example where the costs could reduce the benefits of renaming -- what alternative do you suggest over a basic renaming scheme? In the above, you may assume a basic block execution or if you wish you could also use speculation.

6. Suppose you are given a certain CPU, and are told you should look into optimizing its performance. In particular, from execution traces, you know certain pairs of instructions appear together frequently in instructions traces (e.g., a compare instruction followed by a conditional branch instruction). You are asked to consider defining a few new instructions that combine the functionality of such pairs of frequently occurring instructions (e.g., a "compare and branch" instruction).

Describe in detail how you would go about evaluating whether or not adding these instructions will result in improved performance. In addition to making the CPU faster, your boss would like to know quantitatively how much you expect incorporating these new instructions will improve performance. Describe what information you need to determine to make such an assessment.

7. Consider software- and hardware-based speculative execution techniques. What makes aggressive speculative execution difficult? Give specific program characteristics or examples. Suggest some ways to overcome these difficulties.

8. Consider the following code fragment:

Label L1:
   while (A == 0);
   if (t&s(A) == 0) {
     B = C;
     F = C+D+E;
   }
   A = 0;
Label L2:

The above code is compiled to the following set of instructions on an SMP:

BACK:    LD      R0,    A
         CMP     R0,    #0,   BACK
         T&S     R0,    A
         CMP     R0,    #1,   BACK
         LD      R1,    C
         ST      B,     R1
         LD      R2,    D
         LD      R3,    E
         ADD     R1,    R2
         ADD     R1,    R3
         ST      F,     R1
         ST      A,     R0

The programmer's intent to ensure sequential consistency at the labels L1 and L2 in the code fragment. If the underlying hardware memory model is PC, will this code sequence achieve the programmer's intent? If it will not, suggest any architectural enhancement in each processor of the SMP to achieve the programmer's intent.

9. The early 90's saw a spate of research in memory consistency models for shared memory multiprocessors. Many research papers of that genre would have you believe that relaxed consistency models offer a significant performance advantage over a sequentially consistent memory model. Indeed research projects such as DASH implemented such a model in hardware.

The intent of this question is to critique relaxed consistency models for building memory systems (be they hardware or software shared memory systems). Your question should address both the software and hardware issues in building such memory systems.

a) What is hard about implementing relaxed consistency models?

b) What is hard about writing system software on top of a memory system that uses a relaxed consistency model?

c) What is hard about writing application software on top of a memory system that uses a relaxed consistency model?

d) What are the sources of performance advantage with relaxed consistency models? Are such sources prevelant in a significant number of applications? If the research papers of the early 90's showed relaxed consistency models in a good light, how were they able to do that?

e) For any SMP you are familiar with that uses a relaxed consistency model, describe the model implemented by it. Give a state diagram that shows the protocol transitions. Qualitatively benchmark this model against release consistency and sequential consistency in terms of its "relaxedness".