;-*- Text -*- ; 19-May-98 ; ; RCS: $Id$ ; Lecture notes for CS3760, Monday, 18-May-98 1. quiz review 2. continued pipelining the multiple-bus version of the LC reading: read chapter 6 about pipelining a processor (great stuff) ---------------- 1. quiz topics assignment book ------ ---------- ---- a. arithmetic Homework 2 chap. 4 b. single-bus processors Project 2 chap. 5 c. memory and I/O interface Project 2 chap. 5 d. exceptions Project 2 (2D) chap. 5 e. simple pipelining Homework 3 section 6.1, pipe.ps 2. Continued the pipelining story for the multiple-bus, single-cycle version of the LC. Chapter 6 develops this story for the MIPS very clearly. a. I drew the LC datapath so that all signals flow from left-to-right on the picture. One piece of trickiness is that I drew the register file write port in a separate box from the box with the two read port -- this makes a cleaner diagram; there really is only one register file. b. I pipelined the resulting circuit into five stages: IF: "instruction fetch" -- instruction port to memory ID: "instruction decode" -- contains the opcode decode logic but most noticably contains the read ports to the register file. EX: "execute" stage -- contains the ALU and the M1 muxes. MEM: "memory" stage -- contains the data port to memory WB: "writeback" stage -- M2 mux and write port to the regfile. Like any other pipelined circuit, this circuit can execute an instruction, it just takes more than one cycle for an instruction to "flow" through the pipeline from left to right. Most instructions take 5 cycles (have a latency of 5 cycles). The SW instruction actually takes only 4 cycles because it doesn't use the last stage. c. The beauty is that, since it's a pipeline, a *second* instruction can be started in the pipeline before the first is done. For instance, a program fragment like this: ADD 1 2 3 ADD 4 5 6 NAND 7 8 9 LW 10 11 12 ADD 13 14 15 Executes in the pipelined machine like this. I'm using the standard notation of time (in cycles) increasing the the right and operations going down the page: cycle:| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ------------------------------------------------------------------------- ADD: |IF |ID(1/2)|EX |MEM |WB(3) | | | | ADD: | |IF |ID(4/5)|EX |MEM |WB(6) | | | NAND: | | |IF |ID(7/8)|EX |MEM |WB(9) | | LW: | | | |IF |ID(a/b)|EX |MEM |WB | ADD: | | | | |IF |ID(c/d)|EX |MEM |WB The first instruction takes 5 cycles to make it through the pipeline, but subseqent instructions finish 1/cycle. Assuming the pipeline is balanced, we can reduce the clock cycle by (nearly) 5X and increase the throughput by (nearly) 5X! d. The rub is that these instructions happen to be independent in the pipeline: they all read and write different registers. Consider what happens, though, if the second ADD wants to read the result from the first ADD: ADD 1 2 3 ! writes R3 (as before) ADD 3 5 6 ! changed to read from R3 NAND 7 8 9 LW 10 11 12 ADD 13 14 15 cycle:| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ------------------------------------------------------------------------- ADD: |IF |ID(1/2)|EX |MEM |WB(3) | | | | ADD: | |IF |ID(3/5)|EX |MEM |WB(6) | | | ^ ^ | | uh oh! <---------/ What happens is that, because of pipelining, the second ADD instruction is trying to read R3 before the first instruction has written it! The second ADD reaches its ID stage (where the registers are read) in the 3rd cycle. But the first ADD instruction doesn't reach its WB stage (where the output register is written) until the 5th cycle. This isn't an electrical disaster or anything that will cause outright smoke and flames, but the effect is that the second instruction reads the _previous_ value of R3, not the value that the first instruction writes, which is contrary to the behavior that we expect from a sequential program. 3. The problem above is a "pipeline hazard", specifically a data hazard. [Another plug for the book: chapter 6 has a really nice discussion of pipelining and hazards]. There are four ways to deal with hazards: a. give up #1: oh well, let's just pipeline the processor in two stages. IF and ID/EX/MEM/WB. That's not a very satisfying solution. First, it obviously gives almost no throughput improvement. Second, as we'll see later, hazards involving branches (control hazards) will involve the IF stage, too. b. give up #2: make it the programmer's problem... This actually works: just declare that the programmer has to deal with a modified model of processor behavior: when you execute an instruction like ADD, the result is not available for 3 cycles. If you want back-to-back ADDs, as in the program, you have to add NOOPs to the program like this: cycle:| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ------------------------------------------------------------------------- ADD: |IF |ID(1/2)|EX |MEM |WB(3) | | | | NOOP: | |IF |ID |EX |MEM |WB | | | NOOP: | | |IF |ID |EX |MEM |WB | | NOOP: | | | |IF |ID |EX |MEM |WB | ADD: | | | | |IF |ID(3/5)|EX |MEM |WB ^ ^ | | -> ok -> | Exposing the pipeline to the programmer isn't entirely unreasonable. A good compiler can often contrive to reorder instructions to fill the three slots. Machines have been and continue to be built this way. There are problems, though. It is not always possible for the compiler/programmer to fill the three slots with useful work. More importantly, though, the pipeline model is a level of detail that we'd like to avoid exposing in the instruction set model. What if next year we want to bring out an implementation that has a 6-stage pipeline with 4 delay slots!? All the code for that machine would have to be recompiled. c. deal w/it #1: detect the situation and "stall" the pipeline. The situation is that the ID stage is attempting to read registers that are about-to-be-but-have-not-yet-been written. The situation is *detectable* because the register numbers involved are stored in registers in various stages of the pipeline -- we can build a circuit to check whether RA or RB in the ID stage (the numbers of the registers about to be read) match the RDST number in the subsequent EX, MEM or WB stages. If they match, we stall the pipeline. We'll talk more about stalling on Friday. d. deal w/it #2: detect the situation and "forward" or ("bypass") the results. We didn't talk about this option Monday but I include it here for completeness. The problem is the new value of R3 hasn't been written into the register file at the time we want to read R3. However, the new value of R3 does *exist* -- it's in a pipeline register -- it just hasn't made it all the way to the register file. So, if you detect the situation (same idea as detect-and-stall above), you can contrive to copy the new register value from a downstream pipeline register to where it's needed. And I'll leave the resulting ugly wiring for Friday (or read the book!)