;-*- Text -*- ; 29-May-98 ; ; RCS: $Id$ ; Notes for Wednesday, 27-May-98 Today: 1. Pipelining wrapup 2. Quiz post-mortem Reading: chapter 6 ---------------------------------------------------------------- In this class, I re-drew pictures of the pipeline diagrams (from 980522.text) for the stall and squash situations. We wrapped up pipelining by considering how the particular techniques were selected to deal with the particular problems. The problems were: a. "ADD-ADD" hazards, i.e. data hazards through a register but the result is available in every stage after the execute stage. We solved this with forwarding. b. "LW-ADD" hazards, i.e. a data hazard but the result isn't available until after the memory stage. So forwarding can't always work. We solved this with a combination of forwarding (when possible) and stalling in the one situation (LW followed immediately by something like ADD that uses the loaded value) in which forwarding is impossible. c. "BEQ" hazards, i.e. the control hazards. We used predict-and-squash for these hazards. The full suite of techniques is: x. "Expose", i.e. just let the programmer deal with it. y. "Stall", i.e. always stall the front half of the pipeline until the situation goes away (because the back half of the pipeline keeps moving). z. "Forward", i.e. find and forward the data needed. w. "Predict", i.e. guess an answer and squash if the answer turns out to be wrong. We picked particular solutions, but it's insightful to consider what the other possibilities are: Technique: Problem Expose Stall Forward Predict ------- ------ ----- ------- ------- ADD-ADD [did this] LW-ADD [did this] [and this] BEQ [did this] Consider the empty slots (in an arbitrary order). 1. Exposing ADD-ADD. We can always expose to fix a hazard. The trouble with the ADD-ADD situation is that (as you can imaging) it probably happens a lot and we'd end up inserting a lot of NOOPs. Forwarding, assuming we can do it, is strictly better than exposing in terms of reducing the number of cycles that the program takes to run. 2. Stalling ADD-ADD. Like exposing, we can always stall to fix a hazard. Like exposing, though, this option is lousy compared to forwarding (in terms of the number of cycles to execute a program) because it happens frequently. One drawback to forwarding, though, is that all that mux circuitry _might_ force the processor to have a longer clock period. In other words, if the EX stage is the critical path (longest Tpd) in the processor, then forwarding circuitry adds to the critical path and the maximum clock period has to be longer with forwarding circuitry than without. In a simple, 5-stage pipeline such as ours or the MIPS R2000, the EX stage and the forwarding circuit is generally not the critical path (the worst stage is probably one of the memory stages, IF or MEM). In aggressive current processors, however, which have multiple execution units in the EX stage and much more forwarding circuitry, the EX stage is becoming a potential bottleneck. The 21264 paper we talked about in class [on Friday] actually splits up the EX stage into halves to reduce the forwarding circuitry (sometimes it stalls instead of forwarding) in order to keep the clock period low. 3. Exposing LW-ADD. Assuming we keep forwarding (which solves most of the LW-ADD problem), exposing the one un-forwardable situation instead of stalling isn't a bad idea. It's only one cycle and we could probably fill that one cycle fairly often. Since the alternative is stalling (which *always* wastes the cycle), exposing looks pretty good. The drawback to exposing is that we have to re-compile programs to be aware of this implementation "feature". Also, if we come back next year and re-implement the processor differently (say, with 4 ALUs like the 21264), we may not want to expose things in the same way. 4. Forwarding BEQ. This is non-sensical: there's nothing to forward. 5. Stalling BEQ. Stalling BEQ is just a lose. We'd have to stall for three cycles no matter which way the branch went. Maybe we could reduce the penalty to one cycle, but exposing or predicting give us a chance to use that one cycle whereas stalling never does. 6. Exposing BEQ (as opposed to predicting). This is an interesting tradeoff: -- Exposing may allow us to get useful work out of all three cycles no matter which way the branch ultimately goes. Predict-and-squash sometimes has to throw away three cycles/instructions. -- On the other hand, it may not be possible to fill those three cycles with useful work. Expose vs. predict isn't clear... it depends on how good the prediction is vs. how often we can fill the "delay slots". Both of these factors are hard to have much intuition about. We'd have to go out, try it both ways & measure the results! Processors have been built both ways. Early commercial RISC processors (MIPS & SPARC) had branches with one exposed delay slot. More recent processors have switched to using prediction, mostly because (since they execute more than one instruction per cycle), the number of exposed slots would have to be "large". 7. Finally, what about using prediction to solve *data* hazards... The amusing thing is that you can in fact do this! Consider the LW-ADD problem. You could guess in the EX stage what the memory value is going to be and then check it on the next cycle. The guess would be based on some info available in the EX stage (e.g. a hash of the register number and offset). A guess might actually be pretty good if the LW were for something like a reference to the stack (something we'd just written recently using the same register number and offset). People are starting to try things like this as the need to deepen (more stages) and widen (more ALUs in the EX stage) the pipeline increases so that the cost of a stall becomes high. Here's the table again as a summary: Technique: Problem Expose Stall Forward Predict ------- ------ ----- ------- ------- ADD-ADD too common too common [did this] probability too low. LW-ADD okay, but [did this] [and this] maybe! happens a lot BEQ maybe... but lose: you'd impossible [did this] a really good have to stall predictor is *every* time probably a better bet. ---------------------------------------------------------------- I also handed back the quiz and talked about it a little bit. The average was 61 and the standard deviation was 15. The toughest problem appeared to be number 6, the one that asked for a memory-mapped multiplier unit. The way to approach this problem is in two stages: 1. Do the datapath first (ignore control for a moment). The multiplier is a thing that we want to attach to a single bus. Therefore, as a start, it has to have the usual registers on the inputs and tri-state buffers on the outputs. In other words, the datapath is the same no matter whether this is going to be a microcode-controlled thing (like the ALU in the single-bus implementation of the LC) or a memory-mapped thing. There are two registers, call them CAND and IER with control inputs LdCAND and LdIER, and two tri-state buffers (since the output of the multipler is 64 bits), call them MULTHI and MULTLO with enable inputs DrMULTHI and DrMULTLO. 2. Add the controls for memory mapping. Again, there are two parts. First the existing memory unit needs to have its control inputs conditionalized so that it appears in the bottom 64K of address space (instead of replicated everywhere): memory "write" input = (WrMEM && (MAR[31..16] == 0x0000)) memory's tri-state enable = (DrMEM && (MAR[31..16] == 0x0000)) Second, we need to develop control signals for the two registers and two buffers associated with the multipler. These inputs/outputs are supposed to show up in particular places in the address space. In other words, when the processor emits an address 0x00010003, I'm supposed to enable the high part of the multiplier result on the bus so that the processor reads that result as data. So, the four control signals come from comparators that check for the address ANDed with the appropriate WrMEM or DrMEM signal: LdCAND = (WrMEM && (MAR[31..0] == 0x00010000)) LdIER = (WrMEM && (MAR[31..0] == 0x00010001)) DrMULTLO = (DrMEM && (MAR[31..0] == 0x00010002)) DrMULTHI = (DrMEM && (MAR[31..0] == 0x00010003)) Obviously, much of the circuitry implied by the six equations above is common and could use common hardware.