Project 1--EE/CMPE 4760 (Fall 1998) 1 Purpose This project is intended to review in detail how a pipelined implementation works. Using a subset of the DLX architecture, you will write a cycle-accurate behavioral simulator for a pipelined implementation, complete with data forwarding and branch prediction. 2 DLX Pipelined Implementation 2.1 Instruction Set You will implement the pipeline processor for the following subset of DLX instructions: LW, SW ADD, ADDI SUB AND, ANDI BEQZ J We will add two instructions: --the HALT instruction, which tells the simulator when to stop executing --the NOOP instruction, which does nothing. For simplicity, we will assume in Project 1 that all the arithmetic operations can be executed by the ALU in a single EX cycle. Assume the instruction formats in the book (Figure 2.21). Since the book doesn't define the opcodes for the instructions, assume the following MIPS and other instruction opcodes: Instr Type Opcode Function Code ----- ---- ------ ------------- LW I 35 N/A (not applicable) SW I 43 N/A ADD R 0 32 ADDI I 8 N/A SUB R 0 34 AND R 0 36 ANDI I 12 N/A BEQZ I 4 N/A J J 2 N/A HALT J 1 N/A NOOP J 3 N/A 2.2 Which Pipeline? There are several variations of the pipeline presented in Chapter 3. We will assume the datapath in Figure 3.4. Important points about this datapath: the branch target and condition are resolved in the EX phase. We will make one addition to the datapath of Figure 3.4: a pipeline register AFTER the write-back stage (the WBEND pipeline register). This will be used to simplify data forwarding so that the register file does not have to do any internal forwarding. 2.3 Writing Programs for Your Simulator We will supply sample programs and sample machine code for you to run on your simulator. You can also write your own programs in DLX assembly language and translate them to machine code using an assembler that we provide. To execute the assembler, you will run: assemble The output of the assembler (in ) is then used as input to your simulator. Each line of machine code corresponds to an instruction of assembly. As you'll see in the code provided, the simulator reads the entire program into memory and then executes each instruction. When writing assembly language programs, you will use a special notation a little different from that in the book. This notation matches the instruction formats more closely. For example, in the book a load word of register 1 from the memory location [30 + contents of register 2] would be specified by: lw R1, 30(R2) For our assembler, we would write that instruction: lw 2 1 30 This reflects the I-type instruction format, where the first register specified is Rs1, the second register specified is Rd, and the third field is the offset or immediate value. Likewise, the book specifies an ADD instruction for reg[1] <- reg[2] + reg[3] as: add r1, r2, r3 We would specify an ADD instruction as: add 2 3 1 Note the different ordering of the registers. Again, this reflects the R-type structure that lists the registers in the order Rs1, Rs2 and Rd. The format for a line of assembly code is: labelinstructionfield0field1field2comments The leftmost field on a line is the label field. Valid labels contain a maximum of 6 characters and can consist of letters and numbers. The label is optional (the tab following the label field is not). Labels make it much easier to write assembly language programs, since otherwise you would need to modify all address fields each time you added a line to your assembly-language program! After the optional label is a tab. Next is the instruction field, where the instruction can be any of the assembly-language instruction names listed in the above table. After another tab comes a series of fields. All fields are given as decimal numbers. The number of fields depends on the instruction. I-type instructions (lw, sw, addi, andi, beqz) require 3 fields: field0 is the register rs1, field1 is register rd, and field2 is an immediate value or a symbolic address that the assembler resolves. For the beqz instruction, field1 is unused; if the rs1 register is zero, the program will branch to location PC+1+(16-bit offset). Numeric offsetFields are limited to 16 bits and can be positive or negative. R-type instructions (add, sub, and) instructions require 3 fields: field0 is source register 1, field1 is source register 2, field2 is destReg. J-type instructions (j) require 1 field: field0 is the 26-bit jump offset for j instructions. The jump is PC-relative, meaning that after the jump, the new value of the PC is PC+1+(26-bit jump offset). O-type instructions (noop, halt) require no fields. After the last used field comes another tab, then any comments. We also need to be able to load values into memory locations. We do this with an assembler directive called .fill. This .fill directive tells the assembler to put a number into the place where the instruction would normally be stored. .fill instructions use one field, which can be either a numeric value or a symbolic address. For example, ".fill 32" puts the value 32 where the instruction would normally be stored. .fill with a symbolic address will store the address of the label. Here is an example assembly language program that uses each instruction type: lw 0 8 neg1 load reg8 with -1 lw 0 5 four load reg5 with 4 sub 5 8 6 reg6 <- reg5 - reg8 (reg6 <- 5) addi 5 3 4 reg3 <- reg5 + 4 (reg3 <- 8) andi 5 2 240 reg2 <- reg5 & 0xf0 (reg2 <- 0) loop add 5 8 5 reg5 <- reg5 + reg8 (reg5 decremented) beqz 5 done if reg5=0, go to done j loop noop (never executed) done halt neg1 .fill -1 four .fill 4 And here is the corresponding machine code output: -1945632758 (0x8c08000a) -1945829365 (0x8c05000b) 11022370 (0xa83022) 547553284 (0x20a30004) 815923440 (0x30a200f0) 11020320 (0xa82820) 278921218 (0x10a00002) 201326589 (0xbfffffd) 201326592 (0xc000000) 67108864 (0x4000000) -1 (0xffffffff) 4 (0x4) Make sure you understand how assembly instructions get encoded into machine instructions. Pay particular attention to the branch and jump instructions, which use a 16-bit or 26-bit offset, respectively, from the value PC + 1. (Note we don't use PC + 4 here, as the book does, because we only increment by 1 as we step through the memory.) 2.4 Data Structures The following data structures specify the fields in the pipeline registers used in the simulator and the entire hardware state of the simulated computer. You must use these data structures WITHOUT MODIFICATION to hold pipeline register contents. Note that we have added an extra pipeline register to hold the results of the WB stage. typedef struct IFIDStruct { int instr; int pcPlus1; } IFIDType; typedef struct IDEXStruct { int instr; int pcPlus1; int readRegA; int readRegB; int immediate; } IDEXType; typedef struct EXMEMStruct { int instr; int branchTarget; int branchCond; int aluOutput; int readRegB; } EXMEMType; typedef struct MEMWBStruct { int instr; int loadMemData; int aluOutput; } MEMWBType; typedef struct WBENDStruct { int instr; int writeRegData; } WBENDType; typedef struct stateStruct { int pc; int instrMem[NUMMEMORY]; int dataMem[NUMMEMORY]; int reg[NUMREGS]; int numMemory; IFIDType IFID; IDEXType IDEX; EXMEMType EXMEM; MEMWBType MEMWB; WBENDType WBEND; int cycles; /* number of cycles run so far */ } stateType; 3 Completing the Simulator 3.1 Overall Problem We will give you a "template" program that contains the framework of a pipeline simulator. You will complete the simulator by adding code to execute instructions, detect hazards in the ID phase and perform data forwarding in the EX stage. At the start of the program, we initialize the memories, the pc and all registers to zero. We initialize the instruction field in all pipeline registers to the noop instruction. The simulator program executes as a loop, where each iteration through the loop executes one cycle. At the beginning of the cycle, we print the complete state of the machine (you MUST use the printState function provided WITHOUT MODIFICATION). In the body of the loop, you will figure out what the new state of the machine (memory, registers, pipeline registers) will be at the end of the cycle. Conceptually, all stages of the pipeline compute their new state simultaneously. Since statements execute sequentially in C rather than simultaneously, you will need two state variables: state and newState. state will be the state of the machine while the cycle is executing; newState will be the state of the machine at the end of the cycle. Each stage of the pipeline will modify the newState variable using the current values in the state variable. E.g. in the ID stage, you will have a statement like newState.IDEX.instr = state.IFID.instr (to transfer the instruction in the IFID register to the IDEX register) In the body of loop, you will use newState ONLY as the target of an assignment and you will use state ONLY as the source of an assignment (e.g. newState... = state...). state should never appear on the left-hand side of an assignment, and newState should never appear on the right-hand side of an assignment. 3.2 Halting At what point does the pipelined computer know to halt? It's incorrect to halt as soon as a halt instruction is fetched because if an earlier branch was actually taken, then the halt instruction could actually have been branched around. To solve this problem, halt the machine when a halt instruction reaches the MEMWB register. This ensures that previously executed instructions have completed, and it also ensures that the machine won't branch around this halt. This solution is shown above; note how the final printState call before the check for halt will print the final state of the machine. 3.3 First, Complete the Pipelined Implementation Assuming No Hazards The easiest way to begin this project is to complete the simulation without accounting for data or branch hazards. This will allow you to get started right away. Of course, the simulator will only be able to correctly run assembly-language programs that have no hazards. It is thus the responsibility of the assembly-language programmer to insert noop instructions so that there are no data or branch hazards. This means putting a number of noops in an assembly-language program after a branch and a number of noops in an assembly-language program before a dependent data operation (it's a good exercise to figure out the minimum number needed in each situation). 3.4 Finish Your Implementation by Detecting Hazards and Stalling or Forwarding Modifying your first implementation to account for data and branch hazards is the difficult part of this assignment. Stalling occurs in the ID phase. You must stall in only once case: a LW instruction immediately followed by some other instruction that tries to read the destination register of the load instruction. Since memory is not accessed until the MEM phase, forwarding cannot solve this hazard. When you detect that the instruction currently in the ID phase is trying to read the destination of a LW instruction that immediately preceded it, you must stall by turning the instruction in the ID phase to a NOOP, leaving the contents of the IFID register unchanged, and re-fetching the same instruction in the IF phase as the previous cycle. Resolve other data hazards using forwarding in the EX stage. There is some ambiguity in the book about where forwarding is done. Although the need for forwarding can be detected in the ID phase, the actual forwarding is done for the inputs of the ALU in the EX phase. I.e. the ALU should be able to take its inputs from any pipeline register (instead of just the IDEX register). There is no need for forwarding within the register file (as the book has). For this case of forwarding, you'll instead forward data from the WBEND pipeline register. If there are multiple hazards among several different instructions, remember to forward the most recent data (e.g. data in the EXMEM register gets priority over data in the MEMWB register). ONLY FORWARD DATA TO THE EX STAGE (not to memory as alluded to in the textbook). 3.5 Jumps and Conditional Branches There are several ways we could handle branches and jumps. For the conditional branch beqz, your simulator should do the following: Calculate the branch target and branch condition in the EX phase. If the branch is actually taken, change the PC in the MEM phase. For simplicity, our simulation will predict that the branch is not taken and continue fetching instructions even after decoding a branch. If the branch IS taken, then we must turn all instructions behind it in the pipeline into NOOPs. This happens in the MEM phase when we change the PC. For the unconditional jumps, although you will decode the instruction in the ID phase, change the PC to the jump PC in the MEM phase. This guarantees that a beqz instruction followed immediately by a jump will execute correctly. Calculate the jump target in the EX phase and pass it to the MEM phase using the branchTarget field of the pipeline register. (Note that if we instead implemented the more efficient hardware of Figure 3.22, both beqz and j instructions could update the PC in the EX phase.) 3.6 Simulator Hints Be careful how you handle the 16-bit and 26-bit offsets for the I-type and J-type instructions. They are 2's complement numbers, so you need to convert negative offsets to negative 32-bit numbers on the Sun workstations (by sign extending it). To do this, use the following convertNum (for 16-bit numbers) and convertNum26 (for 26-bit numbers)functions provided with the code. An example run of the simulator (not for the specified task of multiplication) is included at the end of this posting. Make sure your simulator and assembler work for all instructions. 3.7 C Programming Tips Here are a few programming tips for writing C programs to manipulate bits and strings: 1. To indicate a hexadecimal constant in C, precede the number by 0x. For example, 27 decimal is 0x1b in hexadecimal. 2. The value of the expression (a >> b) is the number "a" shifted right by "b" bits. Neither a nor b are changed. E.g. (25 >> 2) is 6. Note that 25 is 11001 in binary, and 6 is 110 in binary. Be aware when doing a shift that if the sign bit is 1 (negative), you will shift in 1's when doing a right shift. To be safe, perform a logical AND with a bit mask (for example, (instr >> 10) & 0x3f). See tip (4) below. 3. The value of the expression (a << b) is the number "a" shifted left by "b" bits. Neither a nor b are changed. E.g. (25 << 2) is 100. Note that 25 is 11001 in binary, and 100 is 1100100 in binary. 4. To find the value of the expression (a & b), perform a logical AND on each bit of a and b (i.e. bit 31 of a ANDED with bit 31 of b, bit 30 of a ANDED with bit 30 of b, etc.). E.g. (25 & 11) is 9, since: 11001 (binary) & 01011 (binary) --------------------- = 01001 (binary), which is 9 decimal. 5. To find the value of the expression (a | b), perform a logical OR on each bit of a and b (i.e. bit 31 of a ORED with bit 31 of b, bit 30 of a ORED with bit 30 of b, etc.). E.g. (25 | 11) is 27, since: 11001 (binary) & 01011 (binary) --------------------- = 11011 (binary), which is 27 decimal. 6. ~a is the bit-wise complement of a (a is not changed). Use these operations to create and manipulate machine-code. E.g. to look at bit 3 of the variable a, you might do: (a>>3) & 0x1. To look at bits (bits 15-12) of a 16-bit word, you could do: (a>>12) & 0xF. To put a 6 into bits 5-3 and a 3 into bits 2-1, you could do: (6<<3) | (3<<1). If you're not sure what an operation is doing, print some intermediate results to help you debug. 7. Several string functions that you may find useful: a) strtok returns a pointer to the next token in a string b) strcmp(str1, str2) compares two null-terminated strings c) strcpy(str1, str2) is used to copy the contents of str2 into str1 d) strcat(str1, str2) concatenates a copy of str2 to str1 4 Running and Demonstrating Your Program Your simulator should be run using the following command format: simulate code > output We will supply sample machine code files. You may also use the assembler we supply to create your own machine-code files. 5 Grading and Formatting We will grade almost solely on functionality. In particular, we will run your program on various assembly-language programs and check the contents of your memory, registers, and pipeline registers at each cycle. Most of these assembly-language programs will have hazards; a few will be hazard-free. Since we'll be grading on getting the exact right answers (both at the end of the run and being cycle-accurate throughout the run), it behooves you to spend a lot of time writing test assembly-language programs and testing your program. Programs that are not doing exactly what they're supposed to on every cycle will be penalized heavily, so check carefully. We will use a program to compare your output against a solution output. So it's very important that you follow these exact formatting rules: 1) Don't modify printState at all. 2) There should be ONLY ONE call to printState in your program, which is the printState shown in Section 3.1. Do not put in any extra printState calls (you can put these in for debugging, but take them out before submitting the program). 3) Check your program's output on the sample assembly-language program and output at the end of this handout. 4) Pay particular attention to what stage various operations are done in. For example, PC is incremented in the IF stage, so the IFID register should have PC+1. Also, the sign-extender is in the ID stage, so the IDEX register should contain the value of offsetField AFTER calling convertNum. 5) Don't print the sequence "@@@" anywhere except in printState(). Having events happen on the right cycle will also be very important (e.g. stall the exact number of cycles needed, write the branch target into the PC at exactly the right cycle, halt at the exact right cycle, stalling only when needed). 6 Turning in the Project You should submit the following separate files: 1. C program listing for your simulator Each of your programs must be in a single file and must be able to be compiled by running gcc. Do not expect us to specify additional flags or use a makefile. 2. A separate file containing just the CHANGES you made to the simulator we provided. To do this, you can use the diff program. Be sure to eliminate extraneous comments, punctuation marks, etc. from the diff output. The official time of submission for your project will be the time the files are sent. If you send in anything after the due date, your project will be considered late (and will use up your late days or will receive a zero).