CS4290/CS6290HPCA

School of computer science

Georgia Institute of Technology

CS4290/CS6290HPCA Fall 2011

Programming assignment #5
For CS4290 Students:
Simulator due: Friday (12/9) 6 pm. ~~Thursday (12/8) before the class (+ 6 hours grace period)~~
Report due: Friday (12/9) 6 pm ~~Thursday (12/8) in the class~~
Please note the report is due on the same day as the simulator and the report requires many simulations and analysis. This assignment also requires the use of other software.

Hyesoon Kim, Instructor

This is an individual assignment. You can discuss this assignment with other classmates but you should do the assignment individually.

Part 1: Adding Counters in the Simulator (10 points)

Overview

In this assignment, you will extend your simulator to calculate power and energy. To simplify the assignment, we measure power and energy consumption for some important hardware structures only.
The basic approach to calculate power consumption is that you add counters in your simulators and after the end of simulation, you use the collected counter values to calculate power and energy.
You have to download McPAT from McPAT website or here . McPAT is multicore power, area and timing simulator.
Your have to add counters to your simulator and then provide the output values to McPAT as an XML file, since McPAT takes XML files as input. You do not have to generate an XML file directly from your simulator. Your simulator should generate a text file, power_counters.txt, containing the counter values, and then you will manually type the values into an XML file. You can use Excel or other XML editors to edit the XML file. The format of power_counters.txt is not specified. (You will be graded based on whether your simulator produces power_counters.txt file) Since we are building a simplified power model, we do not generate many counter values. For counter values that are not mentioned in this assignment, you just use the default values in the original XML file. Here are the counters that you have to add in your simulator. We also show the how these counters map to attributes (counters) in a McPAT XML file.

Overall

Total number of instructions (McPAT: total_instructions, committed_instructions)
Total simulation cycles (McPAT: total_cycles, busy_cycles)
Number of floating point operations (McPAT: fp_instructions, committed_fp_instructions)
Number of integer operations (McPAT: int_instructions, committed_int_instructions)
Number of branch instructions (McPAT: branch_instructions)
Number of load instructions (McPAT: load_instructions)
Number of store instructions (McPAT: store_instructions)

Fetch Stage

Number of I-cache accesses (McPAT:icache read_accesses)
Number of branch mispredictions (McPAT: branch_mispredictions)

Decode Stage

Number of scheduler accesses (i.e. the number of times the scheduler is accessed. Even if no instructions are scheduled in a cycle, this counter should be incremented if there are instructions in the scheduler. And all the instructions should update this value separately.) McPAT: ROB_reads, ROB_writes, rename_accesses, fp_rename_accesses, inst_window_reads, inst_window_writes, inst_window_wakeup_accesses, fp_inst_window_reads, fp_inst_window_writes, fp_inst_window_wakeup_accesses)
Number of register reads for integer operations (McPAT: int_regfile_reads, ialu_access)
Number of register reads for fp operations (McPAT: float_regfile_reads, fpu_access)

Execution Stage

Number of multiply instructions (OP_IMUL, OP_MM in simulator) (McPAT: mul_accesses)

Memory

Number of data cache reads (McPAT: dtlb:total_accesses (read+write), dcache:read_accesses)
Number of data cache writes (McPAT: dtlb:total_accesses (read+write), dcache:write_accesses)
Number of memory accesses (McPAT: system.Noc0:total_accesses = 4 times of number of memory accesses, system.mem.Memory accesses, system.mem.memory_reads, system.mem.memory_writes, (system.mc.memory_accesses, = 4 times of number of memory accesses system.mc.memory_reads = system.mc.memory_writes = 2 times of number of memory accesses))

Number of register writes for integer operations (int_regfile_writes)
Number of register writes for fp operations (float_regfile_writes)

Since there will be multiple answers depending on your simulator design, grades are mainly based on your reports.

Part 2: Setting up McPAT (10 points)
You copy Niagra1.xml that comes with McPAT as lab5.xml file and change certain values to match the configuration of your pipelined processor. You need to turn in lab5.xml file. (10 points) The values that you have to modify are:

Number of cores:1
Number of L1Directories:1
Number of L2s: 0
Number of memory controllers:1
Number of hardware threads: set up based on your simulation configuration.
pipeline depth: set up based on your simulator
Branch predictor (global_predictor_entries, chooser_predictor_entries): the number of entries of GHR size.
Icache/Dcache: icache_config, dcache_config: set up based on cache size.
set all zeros for L2/L3 accesses (read/write)
Counter values: Use the outputs from your simulator.
Set to be zero: function_calls, context_switches

Please see the README file provided with McPAT to learn how to use McPAT.

Part 3: Simulations (80 points)

Simulation 1: Clock frequency effects
Vary clock frequency (target_core_clockrate): 900, 1200, 1500
Simulation 2: Branch Prediction effects
Vary branch predictor size 4, 8, 12
Simulation 3: Cache size effects
16KB, 512KB, 1MB

Simulation 4: MT simulations
Vary the number of threads (1,2,4)
The default configuration is

Submission Guide

You must include an XML file for the default configuration. The name of the file should be lab5.xml Please follow the submission instructions. If you do not follow the submission file names, you will not receive full credit. Please check the class homepage to see the latest updates. Your code must run on jinx with g++4.1.

Please do not turn in pzip files(trace files). Trace file sizes are huge so they will cause a lot of problems.
(Tar the lab5 directory. Gzip the tarfile and submit lab5.tar.gz file on T-square)
Please make sure the directory name is lab5!
cd pin-2.8-36111-gcc.3.4.6-ia32_intel64-linux/source/tools

cd lab5
make clean
rm *.pzip
cd ..
tar cvf lab5.tar lab5
gzip lab5.tar

[0 point] Discuss techniques that are used for DTM (Dynamic Thermal Management). Why are these useful?

[0 point] A CMP has 8 cores in a chip. The total power budget for the chip is 140W. When all cores are active, the frequency of each core is 2GHz and Vdd=2.7V. Note that P = 1/2CV^2f. To improve power efficiency, the processor employs a power gating mechanism. When a program uses only two cores, the system increases the clock frequency of the two active cores to improve performance and the rest 6 cores are power gated.

(a) What will be the maximum frequency for these two cores? Frequency is proportional to voltage. The system has a powerful cooling system.

(b) The total area of the chip is (2.5cm x 2.5cm). The cooling capacity is 60W/cm^2. When we assume that 2 cores use only 1/4th of the total area, what will be the maximum frequency of two active cores?
(c) Instead of power gating, the system uses a clock gating mechanism. Now, the remaining 6 cores are clock gated. What will be the maximum frequency of two active cores? All the 8 cores share the same Vdd and clock.

(d) There is a two-thread application whose IPC per thread is 1.5. Each thread has 3M instructions. We can calculate EDP (Energy X Delay) for the two threads. What are the ratios of EDP for the power gating and clock gating mechanisms over baseline (no clock gating, no power gating)?

EDP(power gating)/EDP(baseline)

EDP(clock gating)/EDP(baseline)

[0 point](Cuda ) In the G80 architecture, each SM has 8K number of register files. Let's say that one thread consumes 24 registers, how many threads can be run on one SM? Assume that each thread consumes only 64B shared memory and there is a total 64KB shared memory space on one SM. Always a multiple of 32 threads can be executed together.

(The problem is continued). We assume that the memory latency is always 100 cycles and the execution latency of LD and FADD is 4 cycles. The processor is a pipelined machine and Fetch/Decode/Schedule takes 1 cycle each. There are total 512 threads to execute. How many cycles does it take to finish all 512 threads? The machine has a bandwidth of executing (fetch/decode/schedule) 32 threads at one cycle (this is called a warp.) Let's say that N number of threads can be executed together. (N is determined by the previous question). The processor waits until all N threads retire and then it starts to fetch the remaining threads and executes the code again. The G80 architecture is an SMT machine.

 
LD   R1 mem[R5]
FADD R2, R1, R3
FADD R3, R4, R5

Hint
If N is 3, the processor executes the code in the following way

 
cycle 1: fetch warp #1 (LD) 
cycle 2: fetch warp #2 (LD)   decode warp #1 
cycle 3: fetch warp #3  (LD)  decode warp #2  schedule warp #1 
cycle 4: fetch warp #1 (FADD) decode warp #3  schedule warp #2  LD warp #1 
cycle 5: fetch warp #2 (FADD) decode warp #1 (FADD) schedule warp #2  LD warp #2 
cycle 6: fetch warp #3 (FADD) decode warp #2 (FADD)  LD warp #3 
....
cycle 104:                                                LD warp #1 done 
cycle 105:                                                LD warp #2 done ...

How can this problem be changed if we consider memory bandwidth effects?
The CUDA programming has a block concept. How can this problem be changed if we consider a block?

[0 point] Show a software-pipelined version of this loop. Assume that you have infinite number of registers.

 
LOOP;  LD F0, 0 (R1)
       Add F4, F0, F2
       Mul F5, F4, F3
       Store F5, 0 (R1)
       Add R1, R1, #-8 
       Br R1, R2, LOOP

You may omit the start-up and clean-up code.

[0 point]Compare and contrast super-scalar architecture and VLIW architecture.

[0 point] For a VLIW machine, the compiler has to find independent instructions that can be executed at the same cycle. Supposed that there is a 3-wide VLIW machine. (i.e., there are three-instruction slots.) For the first and second slots, the processor can execute any types of instructions. For the third slots, the processor can execute all instructions except loads, stores. Arrange the following code to maximize the performance. The processor is an in-order processor. The compiler needs to insert nops if it cannot find any instructions for a slot.

 
0xa000 ADD R0, R0, R2
0xa004 ADD R1, R1, R0
0xa008 FADD F3, F2,  F3
0xa00B ADD R1, R1, R0
0xa010 FADD F3, F2,  F3
0xa014 LD R2, MEM[R6]
0xa008 ADD R1, R1, R0
0xa01B FMUL F1, F5, F4
0xa020 LD F2, MEM[R0]
0xa024 FADD F4, F1, F2
0xa028 LD F3, MEM[R0] 
0xa030 STORE MEM[R5], F4
0xa034 FADD F4, F3, F4
0xa038 ADD R2, R2, R0
0xa03B BR R2, 0x8000

ADD: simple ALU instruction (1-cycle latency)
FADD: floating point instruction (1-cycle latency)
FMUL: floating point instruction (2-cycle latency)
LD: load instruction (2-cycle latency)
STORE: store instruction (1-cycle latency)
BR: branch instruction (1-cycle latency)

[0 point] Many of x86 instructions are actually broken into multiple RISC type uops when they are executed inside a processor. If one of the uops generates an exception, how should we handle them?

[0 point] (H&P 5.2) You are investigating the possible benefits of a way-predicting level 1 cache. Assume that the 32 KB two-way set-associative single-banked level 1 data cache is currently the cycle time limiter. As an alternate cache organization you are considering a way-predicted cache modeled as a 16KB direct-mapped cache with 85% prediction accuracy. Unless stated otherwise, assume a mispredicted way access that hits in the cache takes one more cycle. Assume that the access time of 32-KB 2-way cache is 2 cycles and that of 16KB direct-mapped cache is 1 cycle.

What is the average memory access time of the current cache versus the way-predicted cache?
If all the components could operate with the faster way-predicted cache cycle time (including the main memory), what would be the impact on performance from using the way-predicted cache?
Way-predicted caches have usually only been used for instruction caches that feed an instruction queue or buffer. Imagine you want to try out way prediction on a data cache. Assume you have 85% prediction accuracy, and subsequent operations (e.g., data cache access of other instructions, and dependent operations, etc.) are issued assuming a correct way prediction. Thus a way misprediction necessitates a pipe flush and replay trap, which requires 15 cycles. Is the change in average memory access time per load instruction with data cache way prediction positive or negative, and how much is it?

[0 point] A processor has an ECC in register files and all the memory hierarchy. But still it is not free from faults? What kind of faults can still exist?