CS4290/CS6290HPCA

School of computer science

Georgia Institute of Technology

CS4290/CS6290HPCA Fall 2011

Programming assignment #3
Due: Simulator (10/20) (Th) 6:00 pm
Report: (10/25) Hard copy only Before the class
Hyesoon Kim, Instructor

This is an individual assignment. You can discuss this assignment with other classmates but you should do your assignment individually except for the one extra bonus point question. Please follow the submission instructions. If you do not follow the submission file names, you will not receive the full credit. Please check the class homepage to see the latest update. Your code must run on jinx cluster with g++4.1

Simulator (80%): Complete the memory system

You will extend your Lab #2 pipeline design. Please add the code in add_me3.txt appropriate places. userknob.h and simknob.h files have been updated, please use the updated files.

Step 1:
You need to complete the dcache_access function in this assignment. I-cache is still a perfect cache for this assignment

To activate your dcache structure, you must turn off KNOB_PERFECT_DCACHE.
e.g.) ../../../pin -t obj-intel64/sim.so -perfect_dcache 0 -readtrace 1 -- /bin/ls
Note that KNOB_PERFECT_DCACHE should work even after you implement a data cache. Hence, when KNOB_PERFECT_DCACHE value is 1, regardless of data cache size, all the cache access should be cache hit.

The cache has a 64B block size, true LRU and write-through policy. D-cache access latency is set by KNOB_DCACHE_LATENCY. Note that, a load/store instruction still takes load/store instruction latency cycles in side the execution stage. Hence, if there is a cache miss, the processor needs to wait at least KNOB_MEM_LATENCY_ROW_HIT/MISS cycles. You implement the write-allocate policy, so for both store and load misses, you bring the entire cache block. However, you can retire store instructions even before the requested block is serviced. We are implementing a non-blocking cache. Even if an instruction generates a cache miss, the pipeline continues to execute if there are ready instructions.
KNOB_DCACHE_SIZE, and KNOB_DCACHE_WAY set the cache configurations. Cache size should use a K-Byte unit

e.g) ../../../pin -t obj-intel64/sim.so -perfect_dcache 0 -dcache_size 1 -dcache_way 4 -readtrace 1 -- /bin/ls
cache size = 1KB, 1024/4/64=4 sets

To provide hints to build a cache, a stand alone cache simulator, cache.cc is provided. You can design your own cache structure.

Step 2:
You need to implement a MSHR to handle memory latency correctly. The size of MSHR is determined by KNOB_MSHR_SIZE.

Summary of how to handle memory instructions in the core

1: A memory instruction is fetched, and decoded.
2: In MEM_stage, the instruction checks the dcache.
3(cache hit): If a cache hit, the instruction will be moved to the WB stage after KNOB_DCACHE_LATENCY.
3(cache miss): If a cache miss, go to 4
4: the instruction searches MSHR using the memory address. If there is a match (i.e., the memory address is the same), the processor records the instruction id into the MSHR entry (waiting insts IDs). The instruction's ready cycle is the same as the ready cycle in the MSHR. (the simulator stores an op pointer inside the MSHR entry.)
When the simulator checks MSHR, it checks cache block address not the actual memory address. Because we always bring the entire cache block.
5: If there is no match in MSHR, the processor checks the size of MSHR. If there is no space in the MSHR, the processor stalls. Otherwise, the processor creates an entry in the MSHR and sends the memory request into the memory.
6: After we insert the block into the cache, we free the MSHR entry. The ops in the ready MSHR entry are ready to retire. The freed MSHR entry is ready to be used from the following cycle. Even if multiple instructions are ready to retire, the processor still retire one instruction at one cycle.

Step 3: Modeling a DRAM

check all entries in the MSHR and see whether there are ready requests. (i.e. MSHR->ready_cycle < cycle_count).
If yes, then all ops in the corresponding MSHR entry should broadcast tags. We insert the block into the cache at that moment.
If there is an unscheduled memory request, it checks the corresponding bank and see whether the bank is available.
(if (bank[bank_id]->ready_cycle < cycle_count), the bank is available)
bank_index = (addr << DRAM_BANK_ROW_BITS) >> (64 - DRAM_BANK_INDEX_SIZE)
If the corresponding bank is idle, check the last row buffer number.
If the memory request has the same row id you set MSHR->ready_cycle = cycle_count + MEM_LATENCY_ROW_HIT
If the row buffer number is different from the row id of memory request, you set MSHR->ready_cycle = cycle_count + MEM_LATENCY_ROW_MISS, and also set row buffer id (addr/(KNOB_DRAM_PAGE_SIZE*1024)) as the row id of the memory request. You also set bank[bank_id]->ready_cycle = MSHR->ready_cycle.

relevant data structures:

Knobs related to this assignment

KNOB_DCACHE_SIZE: data cache size (kbytes) (default value: 512 i.e., 512KB)
KNOB_DCACHE_WAY: N-way set associative data cache (default value: 4)
KNOB_DCACHE_LATENCY: cache latency when a cache hit (default value: 5)
KNOB_MEM_LATENCY_ROW_HIT: DRAM access latency when row buffer hit. ( default value: 100)
KNOB_MEM_LATENCY_ROW_MISS: DRAM access latency when row buffer miss ( default value: 200)
KNOB_MSHR_SIZE: the number of entries in the MSHR ( default value is 4)
KNOB_DRAM_BANK_INDEX_SIZE: log2 of (the number of DRAM banks) (default value is 2 i.e. 4)
KNOB_DRAM_BANK_ROW_ADDR_BITS: the number of row address bits. (default value is 20)
KNOB_DRAM_PAGE_SIZE: the size of DRAM banks (unit: KB) (default value is 2 i.e., 2KB)

You have to update dcache_hit_count, dcache_miss_count accordingly.

Submission Guide
Please do not turn in pzip files(trace files). Trace file sizes are so huge so they will cause a lot of problems.
(Tar the lab3 directory. Gzip the tarfile and submit lab3.tar.gz file at T-square)
cd pin-2.8-36111-gcc.3.4.6-ia32_intel64-linux/source/tools

cd lab3
make clean
rm *.pzip
cd ..
tar cvf lab3.tar lab3
gzip lab3.tar

Report (20%)
Include your simulation results in a report. You do not need to submit any traces. Please note that there are many simulation cases so it will take several hours to simulate all of them. Please consider to use the Jinx job batch system to simulate your work. 10M instructions will provide enough data so you can reduce the simulation time by simulating only 10M instructions.

The default configuration is

(5 points) Vary the cache size ( 128KB, 512KB, 1MB, and 2MB) and measure IPC and cache hit ratio. Discuss the performance improvements. Estimate a working set size based on the evaluated four points.
Where you see a big cache hit ratio increment when you increase the cache size is the estimated working set size.
(5 points) Vary the set associativity from (1, 2, 4, 8, 16 ways) and measure IPC and cache hit ratio. Discuss the performance improvements.
(5 points) Choose one of the test case in the above (only one cache configuration and only one trace, which generate enough number of cache misses) and measure the histogram of actual memory latencies. (10 cycle granularity) The actual latency means the time after DCACHE CHECK and until the processor inserts a block into the DCACHE in this report. Why does the access latency vary significantly?
(5 points) Summarize what kind of data structures are good for Stream, Stride, Markov, CDP prefetchers. Which hardware prefetcher requires the most amount of hardware?
(10 points) We like to insert software prefetching requests to speed up the following matrix addition code.
```
for (int ii = 0; ii < 1000; ii++) {
a[ii] = b[ii]+c[ii];
}
```
The modified code is:
```
 
for (int ii = 0; ii < 1000; ii++) {
prefetch(a[ii+k]);
prefetch(b[ii+k]);
prefetch(c[ii+k]);
a[ii] = b[ii]+c[ii];
}
```
prefetch(x) prefetches memory address x.

(a) Find the best performing k value (K is an integer value). Assume that the memory latency (cache miss latency) is 200 cycles and a[],b[],and c[] are 64 bit floating point data structures and the cache block size is 16B. Assume that the statement a[ii]=b[ii]+c[ii] is translated into 2 LDs and 1 ST, 1 FP add. LD/ST cache hit latency is 3 cycles, FP add takes 1 cycle.

(b) The above code generates many extra prefetch requests. How can we reduce them? Show a modified source code.
(10 pints) (Extra point question: only this question you can do it with a group of size 2). Write benchmarks that you can generate different cache miss ratios: small, medium, high. Generate traces and measure cache miss ratios with your simulator. Measure IPC, MPKI (misses for 1000 instructions), and cache hit&miss ratio. Discuss how you generated the code. The source code should be submitted at T-square under lab3_bonus.