;-*- Text -*- ; 26-May-98 ; ; RCS: $Id$ ; Lecture notes for CS3760, Friday, 22-May-98 Continued pipelining the multiple-bus version of the LC reading: read chapter 6. see also the accompanying diagram (handed out in class): http://www.cc.gatech.edu/~kenmac/3760/notes/lc-multibus.ps One note about the diagram: the pipeline registers are named by the stages they separate. I.e. The one between the IF and ID stages is called IFID. The others are IDEX, EXMEM and MEMWB. Also, for the purpose of hazard detection in project 3, we add a last pipeline register, WBEND. ---------------- 0. Pipeline hazards are caused by dependencies between stages of the pipeline, e.g. trying to read a register before the register has actually been written. We can do four things to deal with hazards: a. "Expose" them. After all, it's just a mismatch between what the hardware does and what the programmer expects it to do, so change the programmer's (i.e. the compiler writer's) expectations. Sometimes this approach is okay. b. "Forward" data to the stage where it's needed. In the case of reading a register before it's written, usually the data that is *going*to*be* written already exists and is just sitting somewhere in a pipeline register. Find it and "forward" it to the execution unit. c. "Stall" the pipeline, i.e. freeze the stage that needs the data (and all preceeding stages) until the data is available. d. "Predict/Squash". Make a guess, execute as if the guess were true, and "squash" instructions in the pipeline if the guess if false. We'll do this with control transfer instructions. 1. Data hazards and forwarding. The problem is that an instruction wants to read a value before it has been written into the register file. E.g. the example from Wednesday's note: ADD 1 2 3 ! writes R3 (as before) ADD 3 5 6 ! changed to read from R3 NAND 7 8 9 LW 10 11 12 ADD 13 14 15 cycle:| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ------------------------------------------------------------------------- ADD: |IF |ID(1/2)|EX |MEM |WB(3) | | | | ADD: | |IF |ID(3/5)|EX |MEM |WB(6) | | | ^ ^ | | uh oh! <---------/ This problem can be solved by forwarding. Ordinarily, one forwards data to the stage that actually needs the data. In this case, that's the EX stage. At the time the EX stage needs the data (cycle 4), the data is actually in the EXMEM pipeline register (the portion that stores the 32 output bits from the ALU) since the data needed is the data the ALU computed on cycle 3. That's easy to say, but how can one tell what to do automatically, i.e. in a circuit?... The trick is to compare the register _numbers_ associated with the inputs to the ALU in the EX stage with the register numbers associated with the data in subsequent pipeline stages. Assume that we save the RA and RB specifiers in the IDEX pipeline stage. Assume further that we can specify the fields in the pipeline using a C-like syntax: IDEX.ra RA specifier in the IDEX pipeline register EXMEM.aluresult ALU result in the EXMEM pipeline register We can tell that this particular data hazard situation is occuring by observing that the RDST register address in the EXMEM pipeline register is the same as the RA register address in the IDEX pipeline register. I.e., we need a piece of hardware that does this: int alu_input_a(void) { if (IDEX.ra == EXMEM.rdst) /* in this hazard situation... */ return(EXMEM.aluresult); /* ... get input from EXMEM reg */ else return(IDEX.a); /* otherwise from IDEX as usual */ } The hardware that does '==' is a 4-bit comparator (4 bits because register addresses are four bits). The hardware represented by the if/else structure is a 32-bit 2-to-1 MUX. In general, the forwarding logic (both the muxes and the detectors) are more complex because you have to be prepared to forward data from the EXMEM, MEMWB or WBEND registers (meaning a 4-to-1 mux at the ALU A input and three comparators) and you have to be prepared to forward data to the ALU B input as well as the A input (double everything I just said!). Figure 6.40 on page 484 of the text depicts the data forwarding wiring for the MIPS with the multiplexors shown explicitly and the comparators buried in a box called "forwarding unit". Note that all this forwarding hardware is "outside" the pipeline. When we were doing simple pipelining (as in Homework 3), the pipeline register banks were carefully drawn to cover all paths through the circuit. In this, more complicated world of processor pipelining, you have to add additional circuitry to deal with the pipeline itself... The book assumes that their register file can read and write a value in a single cycle. While this is common practice, we choose not to break the clocking abstraction in this way; instead, the LC has an extra pipeline register WBEND used to detect hazards at the pipeline unit. To be explicit, consider this situation, where an ADD writes R3 and then, three cycles later another instruction attempts the read R3: ADD 1 2 3 ! writes R3 (as before) NOOP NOOP ADD 3 5 6 ! tries to read R3 cycle:| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ------------------------------------------------------------------------- ADD: |IF |ID(1/2)|EX |MEM |WB(3) | | | | NOOP: | |IF |ID |EX |MEM |WB | | | NOOP: | | |IF |ID |EX |MEM |WB | | ADD: | | | |IF |ID(3/5)|EX |MEM |WB | ^ ^ write/read | | same cycle --/ \-- R3 actually updated on this cycle in the standard clocking discipline. bottom line: the EX stage sees the *old* value of R3. Summary: forwarding solves most of our data hazards (one exception below). Forwarding is great it you can do it because it keeps the pipeline moving at one instruction per cycle. A cost, however, is that the multiplexors used may slow down the system if they add delay to the critical path. An aside: modern "superscalar" processors have more than one ALU in the EX stage. Among other things, that means they need a lot more forwarding circuitry. The delay induced by this forwarding circuitry (which grows as O(n^2) given n ALUs.) is a serious limit to the clock speed of current processors. 2. Data hazards and stalls. One situation that the forwarding hardware cannot eliminate is that situation where a LW is followed by an instruction that wants to use the result of the load, e.g.: LW 0 3 0x42 ! writes R3 (as before) ADD 3 3 6 ! reads R3 cycle:| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ------------------------------------------------------------------------- LW: |IF |ID |EX |MEM |WB(3) | | | | ADD: | |IF |ID(3/3)|EX |MEM |WB(6) | | | ^ | \-- data needed by the EX stage here but the MEM stage hasn't had a chance to read it from memory yet!! The problem could be resolved a cycle later by forwarding, but when the LW is followed immediately by the ADD in this way, the problem is impossible(*) to solve without a delay because the data item doesn't yet exist. This situation could be exposed to the programmer as a "delay slot" after a load instruction. That's not an unreasonable thing to do since the time will be wasted anyway. However, as mentioned before, exposing these pipeline details in the instruction set architecture is annoying because it makes programs less portable, etc. So we stall the pipeline. Figuring out *when* to stall the pipeline is the same as figuring out when to perform forwarding: if (EXMEM.instr is a LW && ((IDEX.instr is something that reads RA && EXMEM.rdst == IDEX.ra) || (IDEX.instr is something that reads RB && EXMEM.rdst == IDEX.rb))) [... then stall ...] Figuring out *how* to stall the pipeline is a little more interesting. Essentially, we want to create the effect of inserting a NOOP into the pipeline between the LW and the ADD instructions: a. The MEM and WB stages should operate normally, since the whole point is to let the operation in the MEM stage proceed. b. The EX stage can't compute anything this cycle, so rather than pass junk along to the MEM stage, it better be sure to pass a NOOP instruction. You can do this by using a multiplexor select at the inputs to the instr and control fields of the EXMEM pipeline register to select whether to pass along the instr/control values from the EX stage or NOOP instr/NOOP control values. c. The ID and IF stages (and the PC register) must be disabled for one cycle while the NOOP is being inserted. Note, I described how to detect and implement the stall in the EX stage. In fact, since the only thing that's really happening is that a NOOP is being inserted into the pipeline, the detection/insertion can occur earlier. In Section 6.5, the book inserts the NOOP at the ID stage, which is probably more natural. Summary: we solve all the data hazards through a combination of forwarding and stalling. Forwarding is preferred. The LW hazard requires stalling (since it crosses two stages) for one cycle, but only one cycle since the forwarding will fix things up after one cycle. Stalling wastes time but at least it only happens when there's actually a dependency. For instance, in the following sequence (which adds the contents of 0x42 to the contents of 0x43): LW 1 0 0x42 ! doesn't stall LW 2 0 0x43 ! stalls one cycle ADD 1 2 3 3. Control hazards. The last pipeline hazard comes from the BEQ instruction. Looking at the diagram, the PC is updated each cycle by the IF stage. However, on a branch, the outcome of the branch and the target of the branch aren't known until the MEM stage. That means that without additional hardware, a BEQ instruction that took the branch would take three cycles to take effect! BEQ skip ADD ! always executed ADD ! always executed ADD ! always executed NAND ! executed if not equal NAND ! executed if not equal skip: LW Again, we could expose this effect to the programmer as a feature. This feature, a "branch delay slot", was common in the first generation of RISC architectures (MIPS & SPARC), but they came to regret it as implementations became more complicated. If you're going to have delay slots, you want as few of them as possible. You can reduce the number down from our current three by moving the branch target/branch taken logic to earlier stages. You can't forward control info. You can always stall until the branch direction is resolved, but unlike data dependencies, you have to stall *every* time. Since branches occur frequently (every 5 instructions or so), stalling would waste a lot of time. So we explore a new technique. If you're an optimist (and we are), try to guess the direction of the branch, execute instructions as if that direction were true, and then *squash* the instructions in the pipeline if the prediction turns out to be wrong. This predict-and-squash speculative approach can work because (up to a point) instructions in the pipeline haven't done any "damage" yet -- they haven't changed the permanent, program-visible state values in memory or in registers -- so they can be squashed simply by turning the instr/control fields in the pipeline registers to NOOPs. The easiest way to apply prediction is to predict that a branch is not taken, since that's what they IF hardware wants to do anyway (keep fetching from PC+1). At the point the BEQ instruction reaches the MEM stage, then, we may suddenly find that the instructions currently in the IF, ID and EX stages are bogus and need to be dropped. BEQ skip ADD ! always executed ADD ! always executed ADD ! always executed NAND ! executed if not equal NAND ! executed if not equal skip: LW cycle:| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ------------------------------------------------------------------------- BEQ: |IF |ID |EX |MEM |WB | | | | ADD: | |IF |ID |EX |xxMEM |xxWB | | | ADD: | | |IF |ID |xxEX |xxMEM |xxWB | | ADD: | | | |IF |xxID |xxEX |xxMEM |xxWB | LW: | | | | |IF |ID |EX |MEM |WB ^ ^ | | discover on cycle 4 --/ \-- three instructions squashed that the branch is (internally replaced w/NOOPs) in fact taken! More elaborate predictions strategies are certainly possible. A simple one is to offer two BEQ opcodes, BEQ-probably-taken and BEQ-probably-not-taken, and let the compiler pass along any info it has to the hardware. Hardware schemes include keeping a table of the actual directions of branches and using that information to predict future executions of the same branch. The table is indexed by a hash on the PC value so that the lookup can occur in the IF stage with no lost cycles. Aside: branch prediction is especially important to superscalar processors that (with their multiple ALUs) are attempting to execute more than one instruction per cycle. The branches are especially important because even one or two stalled/squashed cycles leads to a large number of wasted instructions! There's a whole mini-industry in the research community inventing branch prediction schemes... 4. More asides. a. Prediction can be applied to data hazard situations as well; this is another hot research topic right now. The only annoying data situation we have left at the moment is the LW followed immediately by a use of the target register, e.g. in an ADD. Suppose that you make a guess of what the value of the LW is going to be (e.g. from a table indexed by a hash of the RA/offset values) -- you could optimistically use the predicted value in the ADD right away and do a squash if the value turns out to be wrong. This simple-minded value prediction scheme probably wouldn't work often enough to be useful :-) although it might work well on things like stack accesses. b. We've been assuming a general-register architecture throughout the term. What about other instruction set architectures like the single-address architecture with only one register (an accumulator)? What happens when you try to pipeline it? The general-register idea allows multiple operations in parallel at least *sometimes*, and we can use the compiler to increase that probability.