CS4290/CS6290HPCA

School of computer science

Georgia Institute of Technology

CS4290/CS6290HPCA Fall 2011

Programming assignment #2
Part 1: Simulator: Due: Friday (9/23) 6:00 pm (you can submit Part 1 until Sunday 6:00 pm without late penalty. However, after Sunday mid-night: 40% late penalty, after Monday mid-night: 80% penalty, after Tuesday mid-night: 100% will be applied.)
Part 2: Report & questions (9/27) Hard copy only. Before the class

Hyesoon Kim, Instructor

This is an individual assignment. You can discuss this assignment with other classmates but you should do your assignment individually. Please follow the submission instructions. If you do not follow the submission file names, you will not receive full credit. Please check the class homepage to see the latest update. Your code must run on jinx cluster with g++4.1

Overview
This assignment is composed of two parts: Part 1: building the simulator, Part 2: simulation and analysis

Part 1(100 pts): Building a Branch Predictor (60 points) and Deeper Pipeline (40 points)

You will extend your Lab #1 pipeline design. We provide add_me.txt file. Please add the code at appropriate places. userknob.h and simknob.h files have been updated, please use the updated files.

From this assignment, you are building a much higher level of simulator. So, you don't need to maintain cycle accurate latch values.

Implement a g-share branch predictor

Add a g-share branch predictor in the FE stage.

When a processor fetches a conditional branch instruction (cf_type must be CF_CBR), it accesses the branch predictor. All other branch types, we assume that they are correctly predicted. There are no pipeline bubbles for other branches.
If a branch is mispredicted, the processor should not fetch instructions from the correct path. i.e., the processor should not call the get_op function. Instead, it should stall the pipeline until the branch instruction is resolved. After the branch is resolved in the EX stage, in the following cycle the correct PC address is forwarded to the PC latch. From the next cycle the processor can start to fetch instructions from the correct path. Note that in real hardware, the processor will keep fetching wrong-path instructions until a branch is resolved.
Once a misprediction is discovered, the processor flushes the pipeline and starts to fetch instructions from the correct path. Since we are building a trace-driven simulator, we cannot send wrong-path instructions into the pipeline. That's why we are stalling the FE stage to model the time delay between a branch prediction and branch resolution.
To simplify the problem, after a branch is predicted in the FE stage, you also update the branch predictor including BHR with the correct value at the same cycle. In real hardware, the branch predictor is speculatively updated with predicted values. When a branch misprediction is detected, the BHR value is recovered. However, since we do not fetch wrong-path instructions, this step is unnecessary. The 2bit counters in the PHT table are initialized with "10b (2)", weekly taken. In the op field, actually_taken field provides the correct branch prediction. So you can check whether a branch prediction is correct or not by comparing actually_taken value.
Note that, in a real hardware, when a branch misprediction is detected, the processor cannot start to rename until all correct path instructions are retired. Again, to simplify the simulator, the processor starts to resume as soon as the correct-path PC value is available.

Implement a deeper pipeline processor
Now, you convert your 5-stage pipeline into a much deeper pipeline machine. Similar to the previous execution stage (multi-cycle execution stages), you can have multiple cycles in FE, ID, WB We will not check the contents of the latches in this assignment, so you don't have to maintain the correct latch information. Please note that you are not adding additional pipeline stages in your simulator. Each pipeline stage can take more than one cycle.

KNOBS

New KNOBS

KNOB_GHR_LENGTH: It decides the length of GHR. The number of gshare predictor entry is 2^(KNOB_GHR_LENGTH). The maximum value of KNOB_GHR_LENGTH is 20. KNOB_DEBUG_PRINT: print a debug message. (0: no debug message 1: print debug message )

KNOB_FE_DEPTH: It sets the number of pipeline stages in the FE stage

KNOB_ID_DEPTH: It sets the number of pipeline stages in the ID stage

KNOB_WB_DEPTH: It sets the number of pipeline stage in the WB stage

KNOBS from Assignment #1
KNOB_OUTPUT_FILE: set the output filename of the print_stats function

KNOB_TRACE_NAME: set the input trace file name

KNOB_READ_TRACE: (1): read trace and execute the simulator (0): execute the binary

You should not set 1 for both KNOB_WRITE_TRACE and KNOB_READ_TRACE. Only one of them has to be 1.

KNOB_MAX_SIM_COUNT: set the maximum cycle_count for the simulation

KNOB_MAX_INST_COUNT: set the maximum inst_count for the simulation

KNOB_PRINIT_PIPE_FREQ: set the frequency of calling print_pipeline() function

Grading

We will check branch predictor accuracies as we vary the length of GHR.

We will check IPC values to grade your homework. Most of gradings are checking the IPC value trend.

Submission Guide
Please do not turn in pzip files(trace files). Trace file sizes are so huge so they will cause a lot of problems.
(Tar the lab2 directory. Gzip the tarfile and submit lab2.tar.gz file at T-square)
cd pin-2.8-36111-gcc.3.4.6-ia32_intel64-linux/source/tools

cd lab2
make clean
rm *.pzip
cd ..
tar cvf lab2.tar lab2
gzip lab2.tar

For this assignment, test sample outputs will not be provided. You should try to solve Part 2 before you submit. The answers to Part 2 will indicate whether you have implemented your simulator correctly or not.

Part 2 20 pts : Simulation and analysis
Due: 9/27/11 before class. Hard copy only
Using your simulator, you will do performance analysis. You will generate traces which are at least longer than 10,000 instructions. Any applications are fine. Please do not turn in your traces!!

(6 points) Vary the g-share history length (2, 4, 8, 10, 12, 16) and measure the branch predictor accuracy. Do you see the increases in the branch predictor accuracy? If not discuss why.

(6 points) G-share branch predictor uses XOR as a hash function. Instead of XOR, one of HPCA class students suggested using a concatenation (i.e., BHR@PC). Modify your simulator to use a concatenation instead of XOR. Keep the same PHT (2-bit counter table) size as the previous question and compare the performance with the g-share branch predictor results. In this case, you have vary the history length (1,2,4,5,6,8) to match the same G-share branch predictor sizes as the previous problem.

(5 points) A machine has a 5-stage pipeline and the frequency is 500MHz. We like to build a 1-GHz machine. so we increase the FE stage depth to 5 cycles, ID stage depth to 5 cycles. For two types of branch predictor (GHR length is 4, GHR length is 12) measure the IPC for three kinds of machines. machine-1: 1-cycle FE stage, 1-cycle ID stage, EX stage, 1-cycle MEM stage, 1-cycle WB stage. 500 MHz machine. machine-2: 5-cycle FE stage, 5-cycle ID stage, EX stage, 1-cycle MEM stage, 1-cycle WB stage. 1GHz machine machine-3: 5-cycle FE stage, 5-cycle ID stage, EX stage, 1-cycle MEM stage, 5-cycle WB stage. 1GHz machine. EX stage is dependent on op latency that are the same as Lab 1. (total 6 simulations) Which machine is faster? (please compare the performance with time.) What kind of conclusions can you make from this experiment?

(3 points) When a processor discovers a branch misprediction in an out of order processor, what are the steps for flushing the pipeline and resuming where on the correct path?

The following questions won't be graded. However, you may want to solve these questions to help prepare the exams. (hint! hint!)

A graduate student, Mary, likes drinking coffee. She drinks coffee every day. She wrote a simple program to decide which coffee shop to go every day. Here is the code. (to simplify the problem, we assume that Mary drinks coffee only on Th,F, Sa and Sunday)

class date_c{
	public:
	void get_coffee();
};
class weekday_c: public date_c{
	public:
	void get_coffee();
}
class Saturday_c:public date_c{
	public:
	void get_coffee();
}
class Sunday_c:public date_c{
	public:
	void get_coffee();
}


void TutoTh_c::get_coffee(){
// beginning of the function body starts at 0x800
	If (time >= 7 am && time < 8 am) printf("go to Seattle's Best\n");  // br: pc 0x801
	else if (time >= 8 am && time < 1pm ) printf ("go to CCB \n"); // br: pc 0x807
	else if (time < 6 pm ) printf ("go to IIB \n");  // br: pc 0x80a
	else if (time <=10 pm) printf("go to StarBucks\n");  // br: pc 0x815 
	else printf("sorry. No more coffee\n"); 
}

void MF_c::get_coffee(){
// beginning of the function body starts at 0x820
	If (time >= 7 am && time < 8 am) printf("go to Seattle's Best\n");  // br: pc 0x821
	else if (time >= 8 am && time < 1pm ) printf ("go to CCB \n"); // br: pc 0x827
	else if (time < 5 pm ) printf ("go to IIB \n");  // br: pc0x82a
	else if (time <=10 pm) printf("go to StarBucks\n");  // br: pc 0x825 
	else printf("sorry. No more coffee\n"); 
}

void Sa_c::get_coffee() {
// beginning of the function body starts at 0x900
	if (time >= 7 am && time < 10 pm) printf(\"go to Starbucks \n"); // br: pc 0x902
	else printf("sorry, No more coffee\n"); 

}
void Su_c::get_coffee(){
// beginning of the function body starts at 0xa00
	if (time> =10 am && time < 6 pm) printf("go to Starbucks \n"); // br: pc 0xa03
	else printf("sorry,  No more coffee\n"); 
}

main()
{
// initialize code. // week array is initialized 
for (today= Thursday ; today <= Sunday; today++) { // br: pc 0x100 : NT: Loop exit 
	date_c *date = week[today];
       for (time = 7 am ; time < 5 pm; time = time + 6) {  // br: pc 0x102, NT Loop exit
           date->get_coffee(); // indirect branch at PC 0x110: indirect branch X 
      }
}

To simplify the problem, we assume that multiple condition branches are handled by only one branch. i.e, if (time > 7 am && time < 10 pm) correspond to only one branch.
today is incremented from Th, F, Sa to Su.
each day data->get_coffee calls a different function.
Th : TutoTh_c::get_coffee
F: MF_c::get_coffee
Sa: Sa_c::get_coffee
Su: Su_c::get_coffee

We execute the code in a processor which uses the BTB to predict indirect branches. Calculate the branch prediction accuracy of indirect branch. (Hint: fill out the target addresses of indirect branch X.)

	Th	F	Sa	Su
7 am
1 pm

We execute the code in a processor which has a 3-bit history gshare branch predictor.
What is the overall accuracy of branch predictor?
For each branch shows: which entry (i.e. entry index), is used to predict it, whether the prediction is correct, what is the GHR value (before prediction, after update), what is the entry value (after update). We assume that the entry value and the GHR will be updated--after the branch is predicted--with the correct value.
The pc_pred address used in the lowermost 3 bits of the branch address. GHR is initialized as "000" at the beginning. All two bit counters in the PHT are initialized as "10"(weekly taken). We assume that only the following branches will be executed. Ignore all other branches in the code. (0x801, 0x807, 0x80a, 0x815, 0x821, 0x827, 0x82a, 0x825, 0x902, 0xa03, 0x100, 0x102).
Note that indirect branches do not update the GHR.

Branch prediction accuracy =

We provide a table to help you calculate the prediction accuracy but the table is not part of the grading.

br Address	T/NT	pc_pred (3 bits)	GHR (pred)	Entry index	Prediction	Correct?	Entry value (updated value)	GHR (updated value)

(Baer:Chap3Q1) The Alpha 21125 is called "in order". However, instructions can complete out of order. Give two examples of such an occurrence.

In a superscalar processor, the renaming mechanism needs to check dependence between two instructions that are issued in the same cycle. Draw a diagram if you extend the renaming logic into 3-wide rename mechanism. Please see the "dual-rename" slide in lec_ooo2.pdf. How about commit? Do you need similar dependence check logic for commit?

What is a hybrid (tournament hybrid) predictor and why is it a good idea?

Explain indirect branches. Write high-level source code that will result in indirect branches.

Explain why a longer branch history g-share branch predictor increases performance. Explain what could be the problem for a g-share branch predictor that has an extreme long branch history register.

What is a delayed branch and what are pros and cons?

What is an instruction trace cache? What are the benefits of having instruction trace cache?

Provide an example of code section where global branch predictor provides good performance. Provide an example of code section where local branch predictor provides good performance.

A byte-addressable computer has a 128B data cache. The cache block size is 128 bits. The cache uses the true LRU replacement policy. When a given program is executed, the processor reads data from the following sequence of hex addresses:

0x0500, 0x5004, 0x0508, 0x050C, 0x050A, 0x0500, 0x05F4, 0x0500, 0x0504, 0x0518, 0x051C, 0x054C, 0x05F4, 0x500C

Compute the hit rate if a direct-mapped cache is used.
Compute the hit rate if a 2-way set-associative cache is used.

What are differences between write-allocate and no-write-allocate? What are differences between blocking cache and non-blocking cache? What kind information is stored in MSHR?