School of computer science
Georgia Institute of Technology
CS4290/CS6290HPCA Fall 2011
Programming assignment #4
For CS6290 Stduents : Simulator due: Wednesday (11/9) 6:00 PM, Report due: Thursday (11/10) in the class.
For CS4290 Stduents : Simulator due: Friday (11/18) 6:00 pm, Report due: Tuesday (11/22) in the class
Hyesoon Kim, Instructor
This is an individual assignment. You can discuss this assignment with other classmates but you should do your assignment individually.
Part 1 (80 points) Simulator
Overview
In this assignment, you will extend your simulator to support fine-grained MT.
We need multiple steps to support the MT feature.
Adding MT feature requires modifications in multiple places in the simulator.
Additional data structures must be added. add_me4.txt file is provided.
First, you need to make the simulator run correctly right after you add add_me4.txt file. add_me4.txt file changes get_op function and adds thread_id into op_struct, add additional knobs to userknob.h and simknob.h. You need to update use new trace.cpp, sim_pin.cpp
Because now the system handles multiple traces, before you add MT feature, you need to make it sure
that your simulator still runs one thread just like before and then you add features to support multi traces.
In a real architecture simulator, simulation ending condition should be more sophisticated. However, in this assignment, we do not change the ending conditions. Therefore, simulator reads from only remaining traces until all the traces are finished. max_inst_count is based on the sum of all threads.
Fetch Stage
- Fetch needs to fetch from multiple threads.
- get_op(Op *) function is now updated to fetch from multiple threads.
- Deciding which thread to fetch is also a research topic. Several papers have been proposed to increase
a processor's utilization. In this assignment, we just simply use a round-robin fashion. add_me4.txt file already has this feature.
- op->thread_id shows thread id of each op.
- After fetch, the simulator accesses a branch predictor just like before.
- In the real hardware, the hardware needs multiple PCs.
- You need to have different GHR for each thread. However, a 2-bit counter table is shared among all threads
- Branch misprediction handling
- When a thread is mispredicted, you must set br_stall[op->thread_id] = true.
- When the mispredicted instruction is resolved in the execution stage, you should reset br_stall[op->thread_id]=false.
- get_op function checks br_stall and if there is a misprediction, it doesn't fetch from the mispredicted thread.
Decode Stage
- We need to have multiple of reg, so we need reg[MAX_THREAD][NUM_REG].
- Don't add a new pipeline stage but you have to have a scheduler in the decode stage.
- The scheduler is just simply an array that holds one op from each thread.
- The scheduler should implement the Round-robin policy among ready threads.
Memory
- To simplify the problem, we change the address. The top 2 bits
are replaced with thread_id. And we use the rest of original address (62 bits, [61:0]).
new address = thread_id[63:62]@ original_address[61:0].
- You change this feature in get_op function.
WB
- When instructions are retired, it has to be in-order within a thread. Across threads, the processor can retire instructions out of order.
- In programming assignment #3, when an op is not finished, we stop the retirement. In this assignment, even though an op is not finished, if there are finished ops from other threads, the processor should retire them.
Stats
Knobs related to this assignment
- KNOB_MAX_THREAD_NUM: number of threads that can be executed together (default 1, max 4)
- KNOB_TRACE_NAME2: set the input trace file name2
- KNOB_TRACE_NAME3: set the input trace file name3
- KNOB_TRACE_NAME4: set the input trace file name4
Submission Guide
Please follow the submission instructions. If you do not follow the submission file names, you will not receive the full credit. Please check the class homepage to see the latest update. Your code must run on jinx with g++4.1.
Please do not turn in pzip files(trace files). Trace file sizes are huge so they will cause a lot of problems.
(Tar the lab4 directory. Gzip the tarfile and submit lab4.tar.gz file at T-square)
Please make it sure the directory name is lab4!
cd pin-2.8-36111-gcc.3.4.6-ia32_intel64-linux/source/tools
cd lab4
make clean
rm *.pzip
cd ..
tar cvf lab4.tar lab4
gzip lab4.tar
Part 2:
Simluations (20 points)
Include your simulation results in a report. You do not need to submit any traces.
Please note that there are many simulation caess so it will take several hours to simulate all of them.
10M instructions will provide enough data so you can reduce the simulation time by simulating only 10M instructions.
The default configuration is
gshare history length: 12
MSHR size: 4
cache size, way: 512KB 4 way
Dcache latency: 5 cycles
DRAM row buffer hit latency: 100 cycles
DRAM row buffer miss latency: 200 cycles
FE/ID/WB depth: 5/5/1
MSHR size:4, DRAM_BANK_INDEX_SIZE: 2, DRAM_BANK_ROW_BITS: 20
Use 2 benchmarks: one is very memory intensive and the other is non-memory intensive. (a simple matrix multiplication would use lots of memory operations.)
2-way MT architecture.
- For MT, create 2 benchmarks using 3 traces (trace1+trace2, trace1+trace3, trace2+trace3). You need to generate your own traces.
(Please do not share your traces. Include which code or which application you generate traces in your report.)
Show weighted Speedup results.
weighted speedup = Sigma(i,n) (ipc_smt(i)/ipc_alone(i))/n
- Classify three traces as memory intensive or not. Discuss the performance impacts.
- Discuss branch prediction accuracy effect in MT. Do you see degradation compared to a single thread running? why or why not?
- You could notice that one benchmark finishes much faster than others. Discuss how we should
simulate benchmarks to reduce the simulation errors.
[0 point]
Assume that there are two processors and each processor has a cache
which is described in Problem 5 ( see the below). The processors use the MSI coherence
protocol. The coherence stats are denoted M, S, and I for Modified,
Shared, and Invalid.
A sequence of one or more CPU operations is specified
P#: op, address, -> value.
P# designates the CPU (e.g., P0), op is the CPU operation (LDB: load byte, STB: store byte). address denotes the memory address, and -> value indicates the new word to be assigned on a write operation.
Each action occurs after the previous action is
finished. (1) Show only the cache blocks that change, for example
P0.B0: (I, 1, 0x01) indicates that CPU P0's block B0 (0 is the index
number) has the final state of I, tag of 1, and data 0x1. Assume that
the memory is initialized by all 0 at the beginning of the
program. (2) Show the final state of the caches after all the
following operations. (coherence state, cache tags, data, and index). To
simplify the problem you do not need to draw the entire cache. You
just need to fill out the relevant entries
P0: LDB 0x450
P0: STB 0x448 -> 0x4
P1: STB 0x450 -> 0x5
P1: LDB 0x450
P1: STB 0x448 -> 0x6
P0: LDB 0x450
P0: LDB 0x052
P0: STB 0x452 -> 0x7
P0: STB 0x852 -> 0x8
P0: LDB 0x1850
P1: LDB 0x850
P1: LDB 0x450
[0 point] A computer has a 128KB write back cache and 2MB physical memory. Each
cache block is 16B, the cache is 2-way set associative and uses the
true LRU replacement policy. Assume 26-bit virtual address space and
byte-addressable memory. It uses a 1KB page size. The cache uses a
physical tag and physical index. How big (in bits) is the tag store?
The cache uses MSI protocol and assume that we need 2 bits to indicate
cache coherence states.
[0 point] The FRFCFS dram memory scheduler shows better performance than the FCFS dram scheduler. Why is it?
(Additional question) p% of a program is vectorizable. We wish to run it on a multiprocessor. Assume we have an unlimited number of processing elements. If the maximum speedup achievable on this program is 30, what is p?