; ; 16-May-98 ; ; RCS: $Id$ ; Lecture notes for CS3760, Friday, 8-May-98 1. Punch line from Wednesday (once you memory-map the I/O device, the rest is easy) 2. Look ahead to the rest of the term. 3. Multiple-bus version of the LC. ---------------- Look ahead: there are two ideas in high-performance: "P" (parallelism) & "L" (locality). P L parallelism locality multiple busses registers pipelining caches multiple functional units virtual memory multiple processors The idea of exploiting parallelism is straightforward: do multiple things simultaneously if you can... We'll look at multiple busses (easy) and pipelining (easy in concept, can get complex in practice). Those techniques improve seconds/instruction and cycles/instruction, but stop short of attempting to reduce cycles/instruction below one. The subsequent parallelism techniques attempt to exceed that natural limit, but we'll leave those topics to 4760 (gotta leave something). Locality is the idea that variables "near" one another tend to be used together in time, in memory space, in physical space, etc. For locality, you know all about registers as a programmer. We'll talk about the idea of caching in detail. Virtual memory performs a caching function but has a lot of uses beyond that; we'll look at it mostly from the point of view of caching. ---------------- Multiple-bus LC. My discussion of the multiple-bus LC follows the logic of the "single-cycle" MIPS developed in the first sections of Chapter 5. The resulting circuit is very nearly the same, with some differences arising from the differences between MIPS and the LC: 1. The MIPS uses byte-addressing while the LC uses word addressing. Two consequences are: a. The MIPS PC incrementer adds 4 after each instruction. b. The MIPS branch hardware multiplies the OFFSET field by 4 2. The MIPS has a more fully-functional ALU (the ALU developed in Chapter 4) ---------------- The multiple-bus/single-cycle machine has no microcode. Instead, anywhere the circuit needs to make a decision based on the opcode (or the equals-zero circuit), there is a little piece of custom combinational logic that computes exactly what is needed during the single cycle. For instance, the "alufunc" inputs on the ALU are connected to a block of combinational logic that picks an ALU function based on the opcode. The first part of the truth table for this block looks like this: OP3 OP2 OP1 OP0 | F1 F0 ----------------------- 0 0 0 0 | 0 0 ! OP = ADD, func = addition 0 0 0 1 | 1 0 ! OP = NAND, func = NAND 0 0 1 0 | 0 0 ! OP = LW, func = addition 0 0 1 1 | 0 0 ! OP = SW, func = addition 0 1 0 0 | 0 1 ! OP = BEQ, func = subtract [...] The alufuncs for ADD/NAND are straightforward. For LW/SW, the ALU is used to compute the address (register + offset), so the ALU is set up for addition. For BEQ, the ALU is used to compare for equality, so the function is subtraction. ---------------- note for next time: talk about the difference between instruction layout and the branch instruction scheme in the Sparc & Alpha as compared to the MIPS/LC. The sparc instruction layout avoids the MUX of RB and RD. The Sparc LD instruction comes in both R & I forms. The Sparc branch instruction could use the main ALU (although I bet it doesn't). The Alpha branch instruction uses a zero detector separate from the ALU (I think...). Anyway, discuss the tradeoffs: # of instructions, cycle time, silicon area. Complex tradeoffs, but all measurable. [[[later on the book mentions that MIPS implementations often use an equal-comparer placed ahead of the real ALU so that branch resolution can be moved up to the ID stage.]]] note for next year: discuss switch-based logic at the beginning. It makes tri-states easy to explain. It also makes pass-gates clear and would allow me to explain the guts of real ROM/RAMs with no mystery.