Project 2--CS3760 Winter 99 Assigned: Thursday, February 18, 1999 Due: 6:00 pm, Friday, March 5, 1999 1 Purpose This project is intended to help you understand in detail how a pipelined implementation works. You will write a cycle-accurate behavioral simulator for a pipelined implementation of the LC-99, complete with data forwarding and branch prediction. 2 LC-99 Pipelined Implementation 2.1 Datapath The dual-bus style of Project 1's LC-99 implementation is not suitable for pipelining, because it is difficult for more than one instruction to use the buses at the same time. So for this project we will use a datapath similar (but not identical) to the MIPS datapath from Chapter 6 of Patterson and Hennessy. We will provide a picture of the datapath for this assignment. There are several changes that you should note between the MIPS datapath in the textbook and the LC-99 datapath for this assignment: 1) Instead of a "4" input in the PC's adder, we will use a "1", since the LC-99 is word-addressed instead of byte-addressed. 2) The instruction bit fields have to be modified to suit the LC-99's instruction-set architecture. 3) The "shift left 2" component is not necessary in computing the branch target. Because LC-99 uses word-addressing, the offset does not have to be shifted before adding to the incremented PC. 4) We have moved the logic to compute the branch target earlier in the pipeline, to the ID stage. Note that we will know the branch target earlier than the branch condition. 5) The MUX on the input to the PC now includes three possible inputs: the incremented PC from the IF stage, the branch target, and the incremented PC from the EX stage. The second and third inputs to this MUX will be used for hardware branch prediction, which is described in detail below. For our datapath, we will ALWAYS PREDICT THAT THE BRANCH IS TAKEN. 6) The MUXes on the ALU inputs now include connections for forwarded data values. Values may be forwarded from the aluResult of the EX/MEM register, from the aluResult or loadMemData of the MEM/WB register, or from the writeRegData of the WB/end register. 7) We will add a pipeline register AFTER the write-back stage (called the WBEND pipeline register). This register is used to simplify data forwarding so that the register file does not have to do any internal forwarding. 2.2 jalr You will not implement the jalr instruction from the LC-99. Taking out jalr eliminates several dependencies. 2.3 Memory Note in the typedef of stateType below that there are two memories: instrMem and dataMem. When the program starts, read the machine-code file into BOTH instrMem and dataMem (ie. they'll have the same contents in the beginning). During execution, read instructions from instrMem and perform load/stores using dataMem. That is, instrMem will never change after the program starts, but dataMem will change. 2.4 Pipeline Registers To simplify the project and make the output formats uniform, you must use the following structures WITHOUT MODIFICATION to hold pipeline register contents. Note that the instruction gets passed down the pipeline in its entirety. #define NUMMEMORY 10000 /* maximum number of data words in memory */ #define NUMREGS 8 /* number of machine registers */ #define ADD 0 #define NAND 1 #define LW 2 #define SW 3 #define BEQ 4 /* note that JALR will NOT be implemented for project 3 */ #define HALT 6 #define NOOP 7 #define NOOPINSTRUCTION 0x1c000; typedef struct IFIDStruct { int instr; int pcPlus1; } IFIDType; typedef struct IDEXStruct { int instr; int pcPlus1; int readRegA; int readRegB; int immediate; } IDEXType; typedef struct EXMEMStruct { int instr; int aluResult; int readRegB; } EXMEMType; typedef struct MEMWBStruct { int instr; int loadMemData; int aluOutput; } MEMWBType; typedef struct WBENDStruct { int instr; int writeRegData; } WBENDType; typedef struct stateStruct { int pc; int instrMem[NUMMEMORY]; int dataMem[NUMMEMORY]; int reg[NUMREGS]; int numMemory; IFIDType IFID; IDEXType IDEX; EXMEMType EXMEM; MEMWBType MEMWB; WBENDType WBEND; int cycles; /* number of cycles run so far */ } stateType; 2.5 Branch prediction Our datapath will include branch prediction. When we encounter a branch, we will ALWAYS PREDICT THAT THE BRANCH IS TAKEN. We make this prediction because statistics show that most branches are taken. (For example, 67% of branches are taken in the SPEC benchmarks.) You should implement branch prediction as follows: When you decode an instruction in the ID phase and identify it as a branch, change the value of the PC to the branch target (which is also calculated during the ID phase). This has the effect of reducing the branch penalty on a taken branch from 3 cycles in the datapath shown in the book to just one cycle. When we begin fetching at the branch target, we must kill or abort the instruction in the IF phase; this instruction is the fall-through instruction, or the instruction immediately following the branch. Kill this instruction by inserting a NOOPinstruction into the IFID register at the end of the cycle. Most of the time, our prediction that the branch is taken will be correct. However, sometimes our prediction will be wrong. We will determine whether the prediction was correct in the EX stage of the branch instruction, when we calculate the branch condition. When the branch is not taken (i.e., our prediction was wrong), we must restore the value of the PC to the "fall-through" or branch-not-taken value of the PC (which has been passed along to this stage through the pipeline register.) The branch target instruction, which was just fetched, must be killed or aborted. On the next cycle, we will re-fetch the fall-through instruction that was discarded when we made our original prediction. (Note: An alternative implementation could store the fall-through instruction until the branch outcome was known, to avoid fetching it twice when the branch is not taken. For simplicity, we will re-fetch the instruction.) 2.6 Data Flow vs. Control In project 1, your main task was to set the appropriate control signals for each state of your state diagram. In project 2, your main task is to move values through the datapath. You will not set specific control signals. Rather, at each stage of the pipeline, you will perform necessary calculations or memory accesses and then record the results of these accesses in the pipeline register. 3 Problem 3.1 Basic Structure Your task is to write a cycle-accurate simulator for the LC-99. We will provide template code to get you started. At the start of the program, we initialize the pc and all registers to zero. Initialize the instruction field in all pipeline registers to the noop instruction (0x1c000). main() will be a loop, where each iteration through the loop executes one cycle. At the beginning of the cycle, print the complete state of the machine (you MUST use the printState function WITHOUT MODIFICATION). In the body of the loop, you will figure out what the new state of the machine (memory, registers, pipeline registers) will be at the end of the cycle. Recall that a cycle is defined by clock edges, for example, from one rising edge of the clock to the next rising edge. You will simulate all the actions that take place between the edges that define a clock cycle. Any changes that you make to the machine state (memory, register file and pipeline registers) will take effect on the clock edge that ends the current cycle. In hardware, all stages of the pipeline compute their new state simultaneously. However, since we are simulating the operation of our machine in hardware, we must use a language like C that executes statements sequentially. To simulate all the changes taking effect simultaneously, we will use two state variables: state and newState. The variable state will define the current state of the machine while the cycle is executing. At the end of the clock cycle, the variable newState will becoem the current state of the machine. So, during the current clock cycle, you will modify the newState variable using values in the current state variable. For example, in the ID stage, you will have a statement like newState.IDEX.instr = state.IFID.instr (to transfer the instruction in the IFID register to the IDEX register) In the body of loop, you will use newState ONLY as the target of an assignment and you will use state ONLY as the source of an assignment (e.g. newState... = state...). state should never appear on the left-hand side of an assignment, and newState should never appear on the right-hand side of an assignment. ****************************************************************************** Your simulator MUST be pipelined. This means that the work of carrying out an instruction should be done in different stages of the pipeline and the execution of multiple instructions should be overlapped. The IF stage should be the ONLY stage that reads the instruction memory; the ID stage should be the ONLY stage that reads the register file; the MEM stage should be the ONLY stage that accesses data memory; and so on. Values must be passed to other stages via the pipeline registers. If it violates these criteria, your program will get a 0. ****************************************************************************** 3.2 Halting At what point does the pipelined computer know to halt? It's incorrect to halt as soon as a halt instruction is fetched because if an earlier branch was actually taken, then the halt instruction could actually have been branched around. To solve this problem, halt the machine when a halt instruction reaches the MEMWB register. This ensures that previously executed instructions have completed, and it also ensures that the machine won't branch around this halt. (The final printState call before the check for halt will print the final state of the machine.) 3.3 Begin Your Implementation Assuming No Hazards The easiest way to start is to first write your simulator so that it does not account for data hazards or branch predictions. This will allow you to get started right away. Of course, this initial version of the simulator will only be able to correctly run assembly-language programs that have no hazards. It is thus the responsibility of the assembly-language programmer to insert noop instructions so that there are no data or branch hazards. This means putting a number of noops in an assembly-language program after a branch and a number of noops in an assembly-language program before a dependent data operation (it's a good exercise to figure out the minimum number needed in each situation). 3.4 Finish Your Implementation by Accounting for Hazards Modifying your first implementation to account for data hazards and branch prediction will probably be the hardest part of this assignment. Use data forwarding to resolve most data hazards. I.e. the ALU should be able to take its inputs from the EX/MEM, MEM/WB and WB/end pipeline registers (instead of just the IDEX register). There is no need for forwarding within the register file (as the book has). For this case of forwarding, you'll instead forward data from the WBEND pipeline register. Remember to take the most recent data (e.g. data in the EXMEM register gets priority over data in the MEMWB register). ONLY FORWARD DATA TO THE EX STAGE. You will also need to stall for one type of data hazard: a lw followed by an instruction that uses the register being loaded. You will detect this condition when the conflicting instruction (the one following the lw) reaches the ID stage. To implement the stall, don't allow the instructions in the IF and ID stage to proceed during the one-cycle stall period. As described above, always predict that branches are TAKEN. Change the PC to the branch target as soon as the branch instruction is decoded. This requires discarding the fall-through instruction of the branch. When the branch outcome is known, either continue executing at the branch target if the prediction was correct, or discard the predicted instructions and re-fetch the fall-through instruction of the branch. To insert a stall or to discard instructions, change the relevant instructions in the pipeline registers to the noop instruction (0x1c000). 4 Running and Demonstrating Your Program Your simulator should be run using the same command format specified in Project 1, that is: simulate code > output 5 Grading and Formatting We will grade almost solely on functionality. In particular, we will run your program on various assembly-language programs and check the contents of your memory, registers, and pipeline registers at each cycle. Most of these assembly-language programs will have hazards; a few will be hazard-free. Since we'll be grading on getting the exact right answers (both at the end of the run and being cycle-accurate throughout the run), it behooves you to spend a lot of time writing test assembly-language programs and testing your program. Programs that are not doing exactly what they're supposed to on every cycle will be penalized heavily, so check carefully. We will use a program to compare your output against a solution output. So it's very important that you follow these exact formatting rules: 1) Don't modify printState at all. 2) There should be ONLY ONE call to printState in your program, which is the printState shown in Section 3.1. Do not put in any extra printState calls (you can put these in for debugging, but take them out before submitting the program). 3) Make sure to initialize all values correctly. a. state.numMemory should be set to the number of memory words in the machine-code file. b. state.cycles should be set to 0. c. pc and all registers should be set to 0. d. the instruction field in all pipeline registers should be set to the noop instruction (0x1c000). 4) Check your program's output on the sample assembly-language program and output at the end of this handout. 5) Pay particular attention to what stage various operations are done in. For example, PC is incremented in the IF stage, so the IFID register should have PC+1. Also, the sign-extender is in the ID stage, so the IDEX register should contain the value of offsetField AFTER calling convertNum. 6) Don't print the sequence "@@@" anywhere except in printState(). Having events happen on the right cycle will also be very important (e.g. stall the exact number of cycles needed, write the branch target into the PC at exactly the right cycle, halt at the exact right cycle, stalling only when needed).