CS4290/CS6290HPCA

School of computer science

Georgia Institute of Technology

CS4290/CS6290HPCA Fall 2010

Homework assignment #4
Due: Thursday, December 9, on-line by 3:00 pm T-square (no late submission)
Due to the TA's emergency situation, we are taking only softcopy of the homework. Please submit your homework at T-square.
Hyesoon Kim, Instructor

This is an individual assignment. You can discuss this assignment with other classmates but you should do your assignment individually. You do not need to turn in 0 point questions.

[5 pts] p% of a program is vectorizable. We wish to run it on a multiprocessor. Assume we have an unlimited number of processing elements. If the maximum speedup achievable on this program is 30, what is p?
[10 points] Discuss techniques that are used for DTM (Dynamic Thermal Management). Why are these useful?
[20 points] A CMP has 8 cores in a chip. The total power budget for the chip is 140W. When all cores are active, the frequency of each core is 2GHz and Vdd=2.7V. Note that P = 1/2CV^2f. To improve power efficiency, the processor employs a power gating mechanism. When a program uses only two cores, the system increases the clock frequency of the two active cores to improve performance and the rest 6 cores are power gated.

(a) What will be the maximum frequency for these two cores? Frequency is proportional to voltage. The system has a powerful cooling system.

(b) The total area of the chip is (2.5cm x 2.5cm). The cooling capacity is 60W/cm^2. When we assume that 2 cores use only 1/4th of the total area, what will be the maximum frequency of two active cores?
(c) Instead of power gating, the system uses a clock gating mechanism. Now, the remaining 6 cores are clock gated. What will be the maximum frequency of two active cores? All the 8 cores share the same Vdd and clock.

(d) There is a two-thread application whose IPC per thread is 1.5. Each thread has 3M instructions. We can calculate EDP (Energy X Delay) for the two threads. What are the ratios of EDP for the power gating and clock gating mechanisms over baseline (no clock gating, no power gating)?

EDP(power gating)/EDP(baseline)

EDP(clock gating)/EDP(baseline)
[15 points] Assume that there are two processors and each processor has a cache which is described in Problem 5 ( see the below). The processors use the MSI coherence protocol. The coherence stats are denoted M, S, and I for Modified, Shared, and Invalid.
A sequence of one or more CPU operations is specified P#: op, address, -> value. P# designates the CPU (e.g., P0), op is the CPU operation (LDB: load byte, STB: store byte). address denotes the memory address, and -> value indicates the new word to be assigned on a write operation.
Each action occurs after the previous action is finished. (1) Show only the cache blocks that change, for example P0.B0: (I, 1, 0x01) indicates that CPU P0's block B0 (0 is the index number) has the final state of I, tag of 1, and data 0x1. Assume that the memory is initialized by all 0 at the beginning of the program. (2) Show the final state of the caches after all the following operations. (coherence state, cache tags, data, and index). To simplify the problem you do not need to draw the entire cache. You just need to fill out the relevant entries

P0: LDB 0x450

P0: STB 0x448 -> 0x4

P1: STB 0x450 -> 0x5

P1: LDB 0x450

P1: STB 0x448 -> 0x6

P0: LDB 0x450

P0: LDB 0x052

P0: STB 0x452 -> 0x7

P0: STB 0x852 -> 0x8

P0: LDB 0x1850

P1: LDB 0x850

P1: LDB 0x450
[0 point] A computer has a 128KB write back cache and 2MB physical memory. Each cache block is 16B, the cache is 2-way set associative and uses the true LRU replacement policy. Assume 26-bit virtual address space and byte-addressable memory. It uses a 1KB page size. The cache uses a physical tag and physical index. How big (in bits) is the tag store? The cache uses MSI protocol and assume that we need 2 bits to indicate cache coherence states.
[0 point](Cuda) In the G80 architecture, each SM has 8K number of register files. Let's say that one thread consumes 24 registers, how many threads can be run on one SM? Assume that each thread consumes only 64B shared memory and there is a total 64KB shared memory space on one SM. Always a multiple of 32 threads can be executed together.

(The problem is continued). We assume that the memory latency is always 100 cycles and the execution latency of LD and FADD is 4 cycles. The processor is a pipelined machine and Fetch/Decode/Schedule takes 1 cycle each. There are total 512 threads to execute. How many cycles does it take to finish all 512 threads? The machine has a bandwidth of executing (fetch/decode/schedule) 32 threads at one cycle (this is called a warp.) Let's say that N number of threads can be executed together. (N is determined by the previous question). The processor waits until all N threads retire and then it starts to fetch the remaining threads and executes the code again. The G80 architecture is an SMT machine.
```
 
LD   R1 mem[R5]
FADD R2, R1, R3
FADD R3, R4, R5 
```
Hint
If N is 3, the processor executes the code in the following way
```
 
cycle 1: fetch warp #1 (LD) 
cycle 2: fetch warp #2 (LD)   decode warp #1 
cycle 3: fetch warp #3  (LD)  decode warp #2  schedule warp #1 
cycle 4: fetch warp #1 (FADD) decode warp #3  schedule warp #2  LD warp #1 
cycle 5: fetch warp #2 (FADD) decode warp #1 (FADD) schedule warp #2  LD warp #2 
cycle 6: fetch warp #3 (FADD) decode warp #2 (FADD)  LD warp #3 
....
cycle 104:                                                LD warp #1 done 
cycle 105:                                                LD warp #2 done ... 
```
How can this problem be changed if we consider memory bandwidth effects?
The CUDA programming has a block concept. How can this problem be changed if we consider a block?
[0 point] (H&P) Q 5.10
Assume a RAM is on a standard DDR2 DIMM with ECC, having 72 data lines. Also assume burst lengths of 8 which reads out 8 bits per data line, or a total of 64 byes from the DIMM. Assume the DRAMs have a 1 KB page size, 8 banks, t
= CL * Clock_frequency, and Clock_frequency = Transfer_per_seconds/2. The on-chip latency on a cache miss through levels 1 and 2 and back not including the DRAM access is 20 ns. Assume DDR2-667 1GB DIMM with CL=5. DDR2-667 means the clock speed is 333 MHZ.
- How many DRAMs are on the DIMM if 512 Mbit DRAMs are used, and how many data I/Os must each DRAM have if only one DRAM connects to each DIMM data pin?
- What burst length is required to support 32-byte versus 64-byte level 2 cache blocks?
- What is the peak bandwidth ratio between the DIMMs for reads from an active page?
- How much time is required from the presentation of the activate command until the last requested bit of data from the DRAM transitions from valid to invalid for the DDR2-533 1 GB CL = 4 DIMM?
- What is the relative latency when using the DDR2-553 DIMM of a read requiring a bank activate versus one to an already open page, including the time required to process the miss inside the processor?
[0 point] Markov prefetcher is useful when a cache size is small. Why is it? What's the benefit of stride prefetcher over Markov prefetcher?
[0 point] The FRFCFS dram memory scheduler shows better performance than the FCFS dram scheduler. Why is it?
[0 point] In Programming Assignment #3, we implemented a non-blocking cache without a write buffer. If the cache was a blocking cache, what need to be changed in your simulator and what will be the performance impact?