CS4290/CS6290HPCA

School of computer science

Georgia Institute of Technology

CS4290/CS6290HPCA Fall 2011

Programming assignment #4
For CS6290 Stduents : Simulator due: Wednesday (11/9) 6:00 PM, Report due: Thursday (11/10) in the class.
For CS4290 Stduents : Simulator due: Friday (11/18) 6:00 pm, Report due: Tuesday (11/22) in the class

Hyesoon Kim, Instructor

This is an individual assignment. You can discuss this assignment with other classmates but you should do your assignment individually.

Part 1 (80 points) Simulator
Overview
In this assignment, you will extend your simulator to support fine-grained MT.

We need multiple steps to support the MT feature. Adding MT feature requires modifications in multiple places in the simulator. Additional data structures must be added. add_me4.txt file is provided.

First, you need to make the simulator run correctly right after you add add_me4.txt file. add_me4.txt file changes get_op function and adds thread_id into op_struct, add additional knobs to userknob.h and simknob.h. You need to update use new trace.cpp, sim_pin.cpp

Because now the system handles multiple traces, before you add MT feature, you need to make it sure that your simulator still runs one thread just like before and then you add features to support multi traces.

In a real architecture simulator, simulation ending condition should be more sophisticated. However, in this assignment, we do not change the ending conditions. Therefore, simulator reads from only remaining traces until all the traces are finished. max_inst_count is based on the sum of all threads.

Fetch Stage

Fetch needs to fetch from multiple threads.
get_op(Op *) function is now updated to fetch from multiple threads.
Deciding which thread to fetch is also a research topic. Several papers have been proposed to increase a processor's utilization. In this assignment, we just simply use a round-robin fashion. add_me4.txt file already has this feature.
op->thread_id shows thread id of each op.
After fetch, the simulator accesses a branch predictor just like before.
In the real hardware, the hardware needs multiple PCs.
You need to have different GHR for each thread. However, a 2-bit counter table is shared among all threads
Branch misprediction handling
- When a thread is mispredicted, you must set br_stall[op->thread_id] = true.
- When the mispredicted instruction is resolved in the execution stage, you should reset br_stall[op->thread_id]=false.
- get_op function checks br_stall and if there is a misprediction, it doesn't fetch from the mispredicted thread.

Decode Stage

We need to have multiple of reg, so we need reg[MAX_THREAD][NUM_REG].
Don't add a new pipeline stage but you have to have a scheduler in the decode stage.
The scheduler is just simply an array that holds one op from each thread.
The scheduler should implement the Round-robin policy among ready threads.

Memory

To simplify the problem, we change the address. The top 2 bits are replaced with thread_id. And we use the rest of original address (62 bits, [61:0]). new address = thread_id[63:62]@ original_address[61:0].
You change this feature in get_op function.

When instructions are retired, it has to be in-order within a thread. Across threads, the processor can retire instructions out of order.
In programming assignment #3, when an op is not finished, we stop the retirement. In this assignment, even though an op is not finished, if there are finished ops from other threads, the processor should retire them.

Stats

Now, we need to collect stats per thread rather than all threads together.
We collect separate retired_instruction_thread, bp_miss_count_thread, bp_corr_predict_thread, dcache_miss_count_thread, dcache_hit_count_thread counters. The simulator should update retired_instruction, bp_miss_count, bp_corr_predict, dcache_miss_count, dcache_hit_count with all threads also.

For example, for branch misprediction counts,

 
if (bp_corr) { 
   bp_corr_predict++;
   bp_corr_predict_thread[op->thread_id] = bp_corr_predict_thread[op->thread_id]+1; 
}

Knobs related to this assignment

KNOB_MAX_THREAD_NUM: number of threads that can be executed together (default 1, max 4)
KNOB_TRACE_NAME2: set the input trace file name2
KNOB_TRACE_NAME3: set the input trace file name3
KNOB_TRACE_NAME4: set the input trace file name4

Submission Guide
Please follow the submission instructions. If you do not follow the submission file names, you will not receive the full credit. Please check the class homepage to see the latest update. Your code must run on jinx with g++4.1.

Please do not turn in pzip files(trace files). Trace file sizes are huge so they will cause a lot of problems.
(Tar the lab4 directory. Gzip the tarfile and submit lab4.tar.gz file at T-square)
Please make it sure the directory name is lab4!
cd pin-2.8-36111-gcc.3.4.6-ia32_intel64-linux/source/tools

cd lab4
make clean
rm *.pzip
cd ..
tar cvf lab4.tar lab4
gzip lab4.tar

Part 2:
Simluations (20 points)
Include your simulation results in a report. You do not need to submit any traces. Please note that there are many simulation caess so it will take several hours to simulate all of them. 10M instructions will provide enough data so you can reduce the simulation time by simulating only 10M instructions.

The default configuration is

2-way MT architecture.

For MT, create 2 benchmarks using 3 traces (trace1+trace2, trace1+trace3, trace2+trace3). You need to generate your own traces. (Please do not share your traces. Include which code or which application you generate traces in your report.) Show weighted Speedup results.
weighted speedup = Sigma(i,n) (ipc_smt(i)/ipc_alone(i))/n
Classify three traces as memory intensive or not. Discuss the performance impacts.
Discuss branch prediction accuracy effect in MT. Do you see degradation compared to a single thread running? why or why not?
You could notice that one benchmark finishes much faster than others. Discuss how we should simulate benchmarks to reduce the simulation errors.

[0 point] Assume that there are two processors and each processor has a cache which is described in Problem 5 ( see the below). The processors use the MSI coherence protocol. The coherence stats are denoted M, S, and I for Modified, Shared, and Invalid.
A sequence of one or more CPU operations is specified P#: op, address, -> value. P# designates the CPU (e.g., P0), op is the CPU operation (LDB: load byte, STB: store byte). address denotes the memory address, and -> value indicates the new word to be assigned on a write operation.
Each action occurs after the previous action is finished. (1) Show only the cache blocks that change, for example P0.B0: (I, 1, 0x01) indicates that CPU P0's block B0 (0 is the index number) has the final state of I, tag of 1, and data 0x1. Assume that the memory is initialized by all 0 at the beginning of the program. (2) Show the final state of the caches after all the following operations. (coherence state, cache tags, data, and index). To simplify the problem you do not need to draw the entire cache. You just need to fill out the relevant entries

P0: LDB 0x450

P0: STB 0x448 -> 0x4

P1: STB 0x450 -> 0x5

P1: LDB 0x450

P1: STB 0x448 -> 0x6

P0: LDB 0x450

P0: LDB 0x052

P0: STB 0x452 -> 0x7

P0: STB 0x852 -> 0x8

P0: LDB 0x1850

P1: LDB 0x850

P1: LDB 0x450

[0 point] A computer has a 128KB write back cache and 2MB physical memory. Each cache block is 16B, the cache is 2-way set associative and uses the true LRU replacement policy. Assume 26-bit virtual address space and byte-addressable memory. It uses a 1KB page size. The cache uses a physical tag and physical index. How big (in bits) is the tag store? The cache uses MSI protocol and assume that we need 2 bits to indicate cache coherence states.

[0 point] The FRFCFS dram memory scheduler shows better performance than the FCFS dram scheduler. Why is it?

(Additional question) p% of a program is vectorizable. We wish to run it on a multiprocessor. Assume we have an unlimited number of processing elements. If the maximum speedup achievable on this program is 30, what is p?