School of computer science

Georgia Institute of Technology

CS4290/CS6290HPCA Fall 2011


Programming assignment #5
For CS4290 Students:
Simulator due: Friday (12/9) 6 pm. Thursday (12/8) before the class (+ 6 hours grace period)
Report due: Friday (12/9) 6 pm Thursday (12/8) in the class
Please note the report is due on the same day as the simulator and the report requires many simulations and analysis. This assignment also requires the use of other software.

Hyesoon Kim, Instructor


This is an individual assignment. You can discuss this assignment with other classmates but you should do the assignment individually.


Part 1: Adding Counters in the Simulator (10 points)

Overview

In this assignment, you will extend your simulator to calculate power and energy. To simplify the assignment, we measure power and energy consumption for some important hardware structures only.
The basic approach to calculate power consumption is that you add counters in your simulators and after the end of simulation, you use the collected counter values to calculate power and energy.
You have to download McPAT from McPAT website or here . McPAT is multicore power, area and timing simulator.
Your have to add counters to your simulator and then provide the output values to McPAT as an XML file, since McPAT takes XML files as input. You do not have to generate an XML file directly from your simulator. Your simulator should generate a text file, power_counters.txt, containing the counter values, and then you will manually type the values into an XML file. You can use Excel or other XML editors to edit the XML file. The format of power_counters.txt is not specified. (You will be graded based on whether your simulator produces power_counters.txt file) Since we are building a simplified power model, we do not generate many counter values. For counter values that are not mentioned in this assignment, you just use the default values in the original XML file. Here are the counters that you have to add in your simulator. We also show the how these counters map to attributes (counters) in a McPAT XML file.

  • Overall
  • Fetch Stage
  • Decode Stage
  • Execution Stage
  • Memory
  • WB
  • Since there will be multiple answers depending on your simulator design, grades are mainly based on your reports.


    Part 2: Setting up McPAT (10 points)
    You copy Niagra1.xml that comes with McPAT as lab5.xml file and change certain values to match the configuration of your pipelined processor. You need to turn in lab5.xml file. (10 points) The values that you have to modify are:
    Please see the README file provided with McPAT to learn how to use McPAT.

    Part 3: Simulations (80 points)
    Submission Guide

    You must include an XML file for the default configuration. The name of the file should be lab5.xml Please follow the submission instructions. If you do not follow the submission file names, you will not receive full credit. Please check the class homepage to see the latest updates. Your code must run on jinx with g++4.1.

    Please do not turn in pzip files(trace files). Trace file sizes are huge so they will cause a lot of problems.
    (Tar the lab5 directory. Gzip the tarfile and submit lab5.tar.gz file on T-square)
    Please make sure the directory name is lab5!
    cd pin-2.8-36111-gcc.3.4.6-ia32_intel64-linux/source/tools

    cd lab5
    make clean
    rm *.pzip
    cd ..
    tar cvf lab5.tar lab5
    gzip lab5.tar




  • [0 point] Discuss techniques that are used for DTM (Dynamic Thermal Management). Why are these useful?


  • [0 point] A CMP has 8 cores in a chip. The total power budget for the chip is 140W. When all cores are active, the frequency of each core is 2GHz and Vdd=2.7V. Note that P = 1/2CV^2f. To improve power efficiency, the processor employs a power gating mechanism. When a program uses only two cores, the system increases the clock frequency of the two active cores to improve performance and the rest 6 cores are power gated.

    (a) What will be the maximum frequency for these two cores? Frequency is proportional to voltage. The system has a powerful cooling system.

    (b) The total area of the chip is (2.5cm x 2.5cm). The cooling capacity is 60W/cm^2. When we assume that 2 cores use only 1/4th of the total area, what will be the maximum frequency of two active cores?
    (c) Instead of power gating, the system uses a clock gating mechanism. Now, the remaining 6 cores are clock gated. What will be the maximum frequency of two active cores? All the 8 cores share the same Vdd and clock.

    (d) There is a two-thread application whose IPC per thread is 1.5. Each thread has 3M instructions. We can calculate EDP (Energy X Delay) for the two threads. What are the ratios of EDP for the power gating and clock gating mechanisms over baseline (no clock gating, no power gating)?

    EDP(power gating)/EDP(baseline)

    EDP(clock gating)/EDP(baseline)


  • [0 point](Cuda ) In the G80 architecture, each SM has 8K number of register files. Let's say that one thread consumes 24 registers, how many threads can be run on one SM? Assume that each thread consumes only 64B shared memory and there is a total 64KB shared memory space on one SM. Always a multiple of 32 threads can be executed together.

    (The problem is continued). We assume that the memory latency is always 100 cycles and the execution latency of LD and FADD is 4 cycles. The processor is a pipelined machine and Fetch/Decode/Schedule takes 1 cycle each. There are total 512 threads to execute. How many cycles does it take to finish all 512 threads? The machine has a bandwidth of executing (fetch/decode/schedule) 32 threads at one cycle (this is called a warp.) Let's say that N number of threads can be executed together. (N is determined by the previous question). The processor waits until all N threads retire and then it starts to fetch the remaining threads and executes the code again. The G80 architecture is an SMT machine.


     
    LD   R1 mem[R5]
    FADD R2, R1, R3
    FADD R3, R4, R5 
    
    Hint
    If N is 3, the processor executes the code in the following way


     
    cycle 1: fetch warp #1 (LD) 
    cycle 2: fetch warp #2 (LD)   decode warp #1 
    cycle 3: fetch warp #3  (LD)  decode warp #2  schedule warp #1 
    cycle 4: fetch warp #1 (FADD) decode warp #3  schedule warp #2  LD warp #1 
    cycle 5: fetch warp #2 (FADD) decode warp #1 (FADD) schedule warp #2  LD warp #2 
    cycle 6: fetch warp #3 (FADD) decode warp #2 (FADD)  LD warp #3 
    ....
    cycle 104:                                                LD warp #1 done 
    cycle 105:                                                LD warp #2 done ... 
    
    How can this problem be changed if we consider memory bandwidth effects?
    The CUDA programming has a block concept. How can this problem be changed if we consider a block?




  • [0 point] Show a software-pipelined version of this loop. Assume that you have infinite number of registers.

     
    LOOP;  LD F0, 0 (R1)
           Add F4, F0, F2
           Mul F5, F4, F3
           Store F5, 0 (R1)
           Add R1, R1, #-8 
           Br R1, R2, LOOP
    
    


    You may omit the start-up and clean-up code.

  • [0 point]Compare and contrast super-scalar architecture and VLIW architecture.

  • [0 point] For a VLIW machine, the compiler has to find independent instructions that can be executed at the same cycle. Supposed that there is a 3-wide VLIW machine. (i.e., there are three-instruction slots.) For the first and second slots, the processor can execute any types of instructions. For the third slots, the processor can execute all instructions except loads, stores. Arrange the following code to maximize the performance. The processor is an in-order processor. The compiler needs to insert nops if it cannot find any instructions for a slot.
     
    0xa000 ADD R0, R0, R2
    0xa004 ADD R1, R1, R0
    0xa008 FADD F3, F2,  F3
    0xa00B ADD R1, R1, R0
    0xa010 FADD F3, F2,  F3
    0xa014 LD R2, MEM[R6]
    0xa008 ADD R1, R1, R0
    0xa01B FMUL F1, F5, F4
    0xa020 LD F2, MEM[R0]
    0xa024 FADD F4, F1, F2
    0xa028 LD F3, MEM[R0] 
    0xa030 STORE MEM[R5], F4
    0xa034 FADD F4, F3, F4
    0xa038 ADD R2, R2, R0
    0xa03B BR R2, 0x8000
    
    ADD: simple ALU instruction (1-cycle latency)
    FADD: floating point instruction (1-cycle latency)
    FMUL: floating point instruction (2-cycle latency)
    LD: load instruction (2-cycle latency)
    STORE: store instruction (1-cycle latency)
    BR: branch instruction (1-cycle latency)


  • [0 point] Many of x86 instructions are actually broken into multiple RISC type uops when they are executed inside a processor. If one of the uops generates an exception, how should we handle them?

  • [0 point] (H&P 5.2) You are investigating the possible benefits of a way-predicting level 1 cache. Assume that the 32 KB two-way set-associative single-banked level 1 data cache is currently the cycle time limiter. As an alternate cache organization you are considering a way-predicted cache modeled as a 16KB direct-mapped cache with 85% prediction accuracy. Unless stated otherwise, assume a mispredicted way access that hits in the cache takes one more cycle. Assume that the access time of 32-KB 2-way cache is 2 cycles and that of 16KB direct-mapped cache is 1 cycle.


  • [0 point] A processor has an ECC in register files and all the memory hierarchy. But still it is not free from faults? What kind of faults can still exist?