School of computer science
Georgia Institute of Technology
CS4290/CS6290HPCA Fall 2011
Programming assignment #2
Part 1: Simulator: Due: Friday (9/23) 6:00 pm (you can submit Part 1 until Sunday 6:00 pm without late penalty. However, after Sunday mid-night: 40% late penalty, after Monday mid-night: 80% penalty, after Tuesday mid-night: 100% will be applied.)
Part 2: Report & questions (9/27) Hard copy only. Before the class
Hyesoon Kim, Instructor
This is an individual assignment. You can discuss this assignment with other classmates but you should do your assignment individually. Please follow the submission instructions. If you do not follow the submission file names, you will not receive full credit. Please check the class homepage to see the latest update. Your code must run on jinx cluster with g++4.1
Overview
This assignment is composed of two parts: Part 1: building the simulator, Part 2: simulation and analysis
Part 1(100 pts): Building a Branch Predictor (60 points) and Deeper Pipeline (40 points)
You will extend your Lab #1 pipeline design. We provide add_me.txt file. Please add the code at appropriate places. userknob.h and simknob.h files have been updated, please use the updated files.
From this assignment, you are building a much higher level of simulator.
So, you don't need to maintain cycle accurate latch values.
Implement a g-share branch predictor
Add a g-share branch predictor in the FE stage.
When a processor
fetches a conditional branch instruction (cf_type must be CF_CBR), it
accesses the branch predictor. All other branch types, we assume that they
are correctly predicted. There are no pipeline bubbles for other
branches.
If a branch is mispredicted, the processor should not fetch
instructions from the correct path. i.e., the processor should not
call the get_op function. Instead, it should stall the pipeline until
the branch instruction is resolved. After the branch is resolved in the EX stage, in the following cycle the correct PC address is forwarded to the PC
latch. From the next cycle the processor can start to fetch
instructions from the correct path. Note that in real hardware, the
processor will keep fetching wrong-path instructions until a branch is
resolved.
Once a misprediction is discovered, the processor flushes
the pipeline and starts to fetch instructions from the correct
path. Since we are building a trace-driven simulator, we cannot send
wrong-path instructions into the pipeline. That's why we are stalling
the FE stage to model the time delay between a branch prediction and
branch resolution.
To simplify the problem, after a branch is
predicted in the FE stage, you also update the branch predictor including BHR with the
correct value at the same cycle. In real hardware, the branch
predictor is speculatively updated with predicted values. When a
branch misprediction is detected, the BHR value is recovered. However,
since we do not fetch wrong-path instructions, this step is
unnecessary. The 2bit counters in the PHT table are initialized with
"10b (2)", weekly taken.
In the op field, actually_taken field provides the correct branch prediction. So you can check whether a branch prediction is correct or not by comparing actually_taken value.
Note that, in a real hardware, when a branch misprediction is
detected, the processor cannot start to rename until all correct path
instructions are retired. Again, to simplify the simulator, the
processor starts to resume as soon as the correct-path PC value is
available.
Implement a deeper pipeline processor
Now, you convert your 5-stage pipeline into a much deeper pipeline machine.
Similar to the previous execution stage (multi-cycle execution stages),
you can have multiple cycles in FE, ID, WB We will not check the contents of the latches in this assignment, so you don't have to maintain the correct latch information. Please note that you are not adding additional pipeline stages in your simulator. Each pipeline stage can take more than one cycle.
KNOBS
New KNOBS
KNOB_GHR_LENGTH: It decides the length of GHR. The number of gshare predictor entry is 2^(KNOB_GHR_LENGTH). The maximum value of KNOB_GHR_LENGTH is 20.
KNOB_DEBUG_PRINT: print a debug message. (0: no debug message 1: print debug message )
KNOB_FE_DEPTH: It sets the number of pipeline stages in the FE stage
KNOB_ID_DEPTH: It sets the number of pipeline stages in the ID stage
KNOB_WB_DEPTH: It sets the number of pipeline stage in the WB stage
KNOBS from Assignment #1
KNOB_OUTPUT_FILE: set the output filename of the print_stats function
KNOB_TRACE_NAME: set the input trace file name
KNOB_READ_TRACE: (1): read trace and execute the simulator (0): execute the binary
You should not set 1 for both KNOB_WRITE_TRACE and KNOB_READ_TRACE. Only one of them has to be 1.
KNOB_MAX_SIM_COUNT: set the maximum cycle_count for the simulation
KNOB_MAX_INST_COUNT: set the maximum inst_count for the simulation
KNOB_PRINIT_PIPE_FREQ: set the frequency of calling print_pipeline() function
Grading
We will check branch predictor accuracies as we vary the length of GHR.
We will check IPC values to grade your homework.
Most of gradings are checking the IPC value trend.
Submission Guide
Please do not turn in pzip files(trace files). Trace file sizes are so huge so they will cause a lot of problems.
(Tar the lab2 directory. Gzip the tarfile and submit lab2.tar.gz file at T-square)
cd pin-2.8-36111-gcc.3.4.6-ia32_intel64-linux/source/tools
cd lab2
make clean
rm *.pzip
cd ..
tar cvf lab2.tar lab2
gzip lab2.tar
For this assignment, test sample outputs will not be provided. You should try to solve Part 2 before you submit.
The answers to Part 2 will indicate whether you have implemented your simulator correctly or not.
Part 2 20 pts : Simulation and analysis
Due: 9/27/11 before class. Hard copy only
Using your simulator, you will do performance analysis. You will generate traces which are at least longer than 10,000 instructions. Any applications are fine. Please do not turn in your traces!!
(6 points) Vary the g-share history length (2, 4, 8, 10, 12, 16) and measure the branch predictor accuracy.
Do you see the increases in the branch predictor accuracy? If not discuss why.
(6 points) G-share branch predictor uses XOR as a hash function. Instead of XOR, one of HPCA class students suggested using a concatenation (i.e., BHR@PC). Modify your simulator to use a concatenation instead of XOR. Keep the same PHT (2-bit counter table) size as the previous question and compare the performance with the g-share branch predictor results. In this case, you have vary the history length (1,2,4,5,6,8) to match the same G-share branch predictor sizes as the previous problem.
(5 points) A machine has a 5-stage pipeline and the frequency is 500MHz. We like to build a 1-GHz machine.
so we increase the FE stage depth to 5 cycles, ID stage depth to 5 cycles.
For two types of branch predictor (GHR length is 4, GHR length is 12) measure the IPC for three kinds of machines.
machine-1: 1-cycle FE stage, 1-cycle ID stage, EX stage, 1-cycle MEM stage, 1-cycle WB stage. 500 MHz machine.
machine-2: 5-cycle FE stage, 5-cycle ID stage, EX stage, 1-cycle MEM stage, 1-cycle WB stage. 1GHz machine
machine-3: 5-cycle FE stage, 5-cycle ID stage, EX stage, 1-cycle MEM stage, 5-cycle WB stage. 1GHz machine.
EX stage is dependent on op latency that are the same as Lab 1.
(total 6 simulations)
Which machine is faster? (please compare the performance with time.)
What kind of conclusions can you make from this experiment?
(3 points) When a processor discovers a branch misprediction in an out of order processor, what are the steps for flushing the pipeline and resuming where on the correct path?
The following questions won't be graded. However, you may want to solve these questions to help prepare the exams. (hint! hint!)
A graduate student, Mary, likes drinking coffee. She drinks coffee every day. She wrote a simple program to decide which coffee shop to go every day.
Here is the code. (to simplify the problem, we assume that Mary drinks coffee only on Th,F, Sa and Sunday)
class date_c{
public:
void get_coffee();
};
class weekday_c: public date_c{
public:
void get_coffee();
}
class Saturday_c:public date_c{
public:
void get_coffee();
}
class Sunday_c:public date_c{
public:
void get_coffee();
}
void TutoTh_c::get_coffee(){
// beginning of the function body starts at 0x800
If (time >= 7 am && time < 8 am) printf("go to Seattle's Best\n"); // br: pc 0x801
else if (time >= 8 am && time < 1pm ) printf ("go to CCB \n"); // br: pc 0x807
else if (time < 6 pm ) printf ("go to IIB \n"); // br: pc 0x80a
else if (time <=10 pm) printf("go to StarBucks\n"); // br: pc 0x815
else printf("sorry. No more coffee\n");
}
void MF_c::get_coffee(){
// beginning of the function body starts at 0x820
If (time >= 7 am && time < 8 am) printf("go to Seattle's Best\n"); // br: pc 0x821
else if (time >= 8 am && time < 1pm ) printf ("go to CCB \n"); // br: pc 0x827
else if (time < 5 pm ) printf ("go to IIB \n"); // br: pc0x82a
else if (time <=10 pm) printf("go to StarBucks\n"); // br: pc 0x825
else printf("sorry. No more coffee\n");
}
void Sa_c::get_coffee() {
// beginning of the function body starts at 0x900
if (time >= 7 am && time < 10 pm) printf(\"go to Starbucks \n"); // br: pc 0x902
else printf("sorry, No more coffee\n");
}
void Su_c::get_coffee(){
// beginning of the function body starts at 0xa00
if (time> =10 am && time < 6 pm) printf("go to Starbucks \n"); // br: pc 0xa03
else printf("sorry, No more coffee\n");
}
main()
{
// initialize code. // week array is initialized
for (today= Thursday ; today <= Sunday; today++) { // br: pc 0x100 : NT: Loop exit
date_c *date = week[today];
for (time = 7 am ; time < 5 pm; time = time + 6) { // br: pc 0x102, NT Loop exit
date->get_coffee(); // indirect branch at PC 0x110: indirect branch X
}
}
Assumptions:
- To simplify the problem, we assume that multiple condition branches are handled by only one branch. i.e,
if (time > 7 am && time < 10 pm) correspond to only one branch.
- today is incremented from Th, F, Sa to Su.
each day data->get_coffee calls a different function.
Th : TutoTh_c::get_coffee
F: MF_c::get_coffee
Sa: Sa_c::get_coffee
Su: Su_c::get_coffee
We execute the code in a processor which uses the BTB to predict indirect branches. Calculate the branch prediction accuracy of indirect branch.
(Hint: fill out the target addresses of indirect branch X.)
We execute the code in a processor which has a 3-bit history gshare branch predictor.
What is the overall accuracy of branch predictor?
For each branch shows: which entry (i.e. entry index), is used to predict it, whether the prediction is correct, what is the GHR value (before prediction, after update), what is the entry value (after update). We assume that the entry value and the GHR will be updated--after the branch is predicted--with the correct value.
The pc_pred address used in the lowermost 3 bits of the branch address. GHR is initialized as "000" at the beginning. All two bit counters in the PHT are initialized as "10"(weekly taken). We assume that only the following branches will be executed. Ignore all other branches in the code. (0x801, 0x807, 0x80a, 0x815, 0x821, 0x827, 0x82a, 0x825, 0x902, 0xa03, 0x100, 0x102).
Note that indirect branches do not update the GHR.
Branch prediction accuracy =
We provide a table to help you calculate the prediction accuracy but the table is not part of the grading.
br Address | T/NT | pc_pred (3 bits) | GHR (pred) | Entry index | Prediction | Correct? | Entry value (updated value) | GHR (updated value) |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
(Baer:Chap3Q1)
The Alpha 21125 is called "in order". However, instructions can complete out of order. Give two examples of such an occurrence.
In a superscalar processor, the renaming mechanism needs to check dependence between two instructions that are issued in the same cycle. Draw a diagram if you extend the renaming logic into 3-wide rename mechanism. Please see the "dual-rename" slide in lec_ooo2.pdf. How about commit? Do you need similar dependence check logic for commit?
What is a hybrid (tournament hybrid) predictor and why is it a good idea?
Explain indirect branches. Write high-level source code that will result in indirect branches.
Explain why a longer branch history g-share branch predictor increases performance. Explain what could be the problem for a g-share branch predictor that has an extreme long branch history register.
What is a delayed branch and what are pros and cons?
What is an instruction trace cache? What are the benefits of having instruction trace cache?
Provide an example of code section where global branch predictor provides good performance. Provide an example of code section where local branch predictor provides good performance.
A byte-addressable computer has a 128B data cache. The cache block
size is 128 bits. The cache uses the true LRU replacement policy.
When a given program is executed, the processor reads data from the
following sequence of hex addresses:
0x0500, 0x5004, 0x0508, 0x050C, 0x050A, 0x0500, 0x05F4, 0x0500, 0x0504, 0x0518, 0x051C, 0x054C, 0x05F4, 0x500C
- Compute the hit rate if a direct-mapped cache is used.
- Compute the hit rate if a 2-way set-associative cache is used.
What are differences between write-allocate and no-write-allocate?
What are differences between blocking cache and non-blocking cache?
What kind information is stored in MSHR?