Architecture Area Qualifier

Fall 2001

Answer 6 of 9 questions

  1. Amdahl's law dictates that successful machines balance computation vs. communication, latency vs. bandwidth and load balance vs. locality. For instance the Cray 1, a vector machine, was successful in part because of its attention to low latency which is crucial when vectors are short. Critique the five approaches to high performance computing below according to their approach to the three notions of balance above.
    1. message-passing multiprocessors, e.g. T3E
    2. shared-memory multiprocessors, e.g. O2K
    3. dataflow multiprocessors, e.g. Monsoon
    4. multithreaded multiprocessors, e.g. HEP/Tera
    5. vector supercomputers, e.g. Cray.

  2. Imagine you are a user of a machine of unknown architecture (you can only run user-mode programs). Describe an experiment (program that you would use, inputs, conclusions based on the form of the output) that you would perform in order to discover what caching hardware is used on that machine. Explain how your experiment would let you discover how many levels of cache hierarchy there are and how big each level is. Other questions to address include:
  3. List scheduling is a fast heuristic which is used in superscalars to decide the wake-up order in wide-issue processors which might use speculation etc. List scheduling attempts to topologically order the instructions in ready-queue and use a heuristic function (such as some kind of priority) to determine the schedule of instructions at the same level in the dependence graph. It appears to give schedules close to optimal in practice. Intuitively argue why this should be the case. Consider the effects of instruction windows, issue rates etc. What hardware support is needed for list scheduling? Give an example where list scheduling performs at its worst. Try to generalize this behavior to a class of problems. Can you give an alternative scheduling scheme for the above class?
  4. This question pertains to software and hardware issues in memory management in modern operating systems (such as Linux and NT) on modern processors such as DEC Alpha and Intel P-III.
    1. What are multi-level page tables? How do they work? Why do we need them for memory management in modern processors?
    2. Give a detailed description of memory management. Your answer should cover both the hardware and software issues in memory management. Explain what aspects are handled typically in hardware and what aspects are handled typically in software. Explain why such a division of labor between hardware and software is reasonable or unreasonable. If an aspect can be handled in either software or hardware, then what are the pros and cons for each approach.
    3. What hardware support is deemed necessary for efficient memory management, and how is it used by the operating system?
    4. Indicate what kinds of hardware support are desirable for efficient memory management but are not available. For any such support in the wish list, explain with adequate reason whether the lack of hardware support is because it is infeasible to implement in hardware, or because it will have a detrimental effect on performance.
    5. Using the memory hierarchy of any modern processor (physical memory, virtual memory, page tables, TLB, L1, and L2, caches) walk through what design decisions you will have to make in designing the memory management piece of the operating system.

  5. This problem deals with shared memory multiprocessors.
    1. Clearly explain the distinction between memory consistency model and cache coherence in shared memory multiprocessors. Why is this distinction important? How is it used in the design and implementation of multiprocessors?
    2. SMPs keep the private caches of the processors coherent. However, each processor in an SMP has its own private set of registers. Are they kept coherent? If not, why not? If they are not kept coherent, how is the desired semantics of a multithreaded program that uses a threading package such as pthreads guaranteed to work correctly in an SMP?
    3. Each processor in an SMP does its own virtual to physical address translations for memory accesses. To speed up the address translation, each processor has its own TLB as well. Consider a multithreaded program executing on an SMP. All the threads execute in the same address space. How is the semantics of the multithreaded program preserved with the autonomous virtual memory management that is happening on each processor of the SMP? Explain clearly, what needs to happen to preserve the semantics and what are the possible solution approaches to achieving them.

  6. Write-invalidate and write-update cache coherence protocols differ in that the former is a pull approach for getting the most recent write to a memory location, and the latter is a push approach for disseminating the most recent write to a memory location.
    1. Discuss the pros and cons of both these approaches from the point of view of program behavior. Give code fragments where one or the other would be more beneficial.
    2. Discuss the pros and cons of both these approaches from the point of view of designing large-scale shared memory multiprocessors (i.e. the interconnect does not have a broadcast or multicast capability). Be concrete as to the problems that will be encountered with each approach and what are possible solutions.
    3. Given that programs show different sharing characteristics over the lifetime of their execution, suggest a scheme for dynamically switching from one approach to the other dynamically. It is sufficient if you suggest heuristics you would use to switch from one to the other, and what additional state information you will need in the cache coherence hardware to enable this switching.

  7. Message-passing and shared memory are two alternative styles of communicating and synchronizing among the different threads of the same program. In the former, the communication is explicit (via primitives such as sends and receives) while in the latter the communication is implicit (via primitives such as read and write to shared memory locations).
    1. Give the pros and cons of both these approaches from the point of view of ease of parallel program development and performance.
    2. To reap the advantages of both styles it is desired to add some explicit communication primitives to a shared memory multiprocessor. Suggest a scheme to do this. Your scheme should clearly state the semantics of the proposed primitives, and their implementation (at the level of protocol enhancements, state transitions, and any additional state information in the cache coherence hardware) in the context of a cache coherent shared memory multiprocessor.

  8. Compare and contrast polling and interrupts as mechanisms for the delivery of events, e.g. notification of the arrival of network messages in a multiprocessor. Under what circumstances does polling provide better performance? Under what circumstances does interrupt-based delivery provide better performance. What other issues drive the selection of one mechanism or the other? Give examples that are as concrete as possible.
  9. This problem deals with branch prediction.
    1. What problem is solved by branch prediction? Why is it important?
    2. Branch predictors utilize tables indexed by the PC, by global history and by local history. Define each indexing scheme and give detailed scenarios in which each scheme performs better than the other two.
    3. What issues limit the complexity of branch predictors?
    4. How might compiler "advice" be propagated to a branch predictor?