CS4290/CS6290HPCA

School of computer science

Georgia Institute of Technology

CS4290/CS6290HPCA Fall 2010

Homework assignment #4
Due: Thursday, December 9, on-line by 3:00 pm T-square (no late submission)
Due to the TA's emergency situation, we are taking only softcopy of the homework. Please submit your homework at T-square.
Hyesoon Kim, Instructor

This is an individual assignment. You can discuss this assignment with other classmates but you should do your assignment individually. You do not need to turn in 0 point questions.

[5 pts] p% of a program is vectorizable. We wish to run it on a multiprocessor. Assume we have an unlimited number of processing elements. If the maximum speedup achievable on this program is 30, what is p?
s = 1/ ((1-p) + p/N)
s = 30
p = limit (N->infinite) (-10/30 * 1/(1/N-1)) = 0.97
p = 0.97 or 97%
[10 points] Discuss techniques that are used for DTM (Dynamic Thermal Management). Why are these useful?

- DVFS
- Power gating
- pipeline gating
- clock gating
- To provide inexpensive hardware system
[20 points] A CMP has 8 cores in a chip. The total power budget for the chip is 140W. When all cores are active, the frequency of each core is 2GHz and Vdd=2.7V. Note that P = 1/2CV^2f. To improve power efficiency, the processor employs a power gating mechanism. When a program uses only two cores, the system increases the clock frequency of the two active cores to improve performance and the rest 6 cores are power gated.

(a) What will be the maximum frequency for these two cores? Frequency is proportional to voltage. The system has a powerful cooling system.

(a) P = 1/2C'V^3.
1/2C'(2.7)^3 *8 = 1/2C'(Vdd')^3*2
Vdd' = 4.29 V
f' = f * (Vdd')/Vdd = 3.17 GHz

(b) The total area of the chip is (2.5cm x 2.5cm). The cooling capacity is 60W/cm^2. When we assume that 2 cores use only 1/4th of the total area, what will be the maximum frequency of two active cores?

Cooling capacity = 60W*(2.5*2.5/4) = 93.75W
C'= P/(1/2Vdd^3) = (140/8)/(1/2*2.7^3) = 1.7775
93.75W = 2*(1/2*C'*Vdd''^3)
Vdd'' = 3.75V
f'' = f*(Vdd'')/Vdd = 2.77GHz

(c) Instead of power gating, the system uses a clock gating mechanism. Now, the remaining 6 cores are clock gated. What will be the maximum frequency of two active cores? All the 8 cores share the same Vdd and clock.
If leakage power = 0, then considering values provided in (b), max frequency of the two cores = 2.77 GHz
Better question: Let's assume that 60% power consumption is due to dynamic power and 40% is due to static power. Static power can also vary depending on voltage and temperature but to simplify the problem, let's assume that static power consumption does not vary.

P = 1/2CV^2f + P(static)
Dynamic power = 140W *0.6 = 84W
Static power = 140W*0.4 = 56W
C'' = (84/8/(1/2*(2.7^3))(W/(V^2GHz))
new dynamic power consumption will be 140W*0.6 = 84W
84W = 2*(1/2*C''*vdd^3) New Vdd = cuberoot(84/(2*1/2*C'''))
New frequency = 2GHz * New vdd /2.7

(d) There is a two-thread application whose IPC per thread is 1.5. Each thread has 3M instructions. We can calculate EDP (Energy X Delay) for the two threads. What are the ratios of EDP for the power gating and clock gating mechanisms over baseline (no clock gating, no power gating)?

EDP(power gating)/EDP(baseline)

EDP(clock gating)/EDP(baseline)
```
Since each thread will be executed by each core in parallel, we just need to worry about the execution time of one thread. 
    Delay (time) = # of instructions /(IPC * frequency) 
    Instructions per thread = 3 * 10^6
    
    EDP(baseline) = Energy (baseline) * Delay (baseline)
                  = Power * (execution time)^2
		  = 1/2CV^2 * ((3*10^6/(1.5 *2GHz))^2
		  = 0.00014 

    EDP(Power Gating) = Energy (power gating) * (# of insts)/ (IPC * frequency (power gating)) 
(Assuming results from (a)) 
    EDP(power gating) = 140W * ((3*10^6/(1.5*3.17*10^9)))^2 = 5.57*10^(-5) 
    EDP(power gating)/EDP(baseline) = 0.40
(Assuming results from (b)) 
    EDP(power gating) = 93.75W * ((3*10^6/(1.5*2.77*10^9)))^2 =4.88*10^(-5) 
    EDP(power gating)/EDP(baseline) = 0.38
(Assuming results from (c)) 
    EDP(clock gating) = 140W*(3*10^/(1.5*(vdd from (c))*10^9))^2 = 
   EDP(clock gating)/EDP(baseline) = 
```

[15 points] Assume that there are two processors and each processor has a cache which is described in Problem 5 ( see the below). The processors use the MSI coherence protocol. The coherence stats are denoted M, S, and I for Modified, Shared, and Invalid.
A sequence of one or more CPU operations is specified P#: op, address, -> value. P# designates the CPU (e.g., P0), op is the CPU operation (LDB: load byte, STB: store byte). address denotes the memory address, and -> value indicates the new word to be assigned on a write operation.
Each action occurs after the previous action is finished. (1) Show only the cache blocks that change, for example P0.B0: (I, 1, 0x01) indicates that CPU P0's block B0 (0 is the index number) has the final state of I, tag of 1, and data 0x1. Assume that the memory is initialized by all 0 at the beginning of the program. (2) Show the final state of the caches after all the following operations. (coherence state, cache tags, data, and index). To simplify the problem you do not need to draw the entire cache. You just need to fill out the relevant entries

P0: LDB 0x450

P0: STB 0x448 -> 0x4

P1: STB 0x450 -> 0x5

P1: LDB 0x450

P1: STB 0x448 -> 0x6

P0: LDB 0x450

P0: LDB 0x052

P0: STB 0x452 -> 0x7

P0: STB 0x852 -> 0x8

P0: LDB 0x1850

P1: LDB 0x850

P1: LDB 0x450


SOLUTION:

Address 0x52 falls under block 0x5
Address 0x448 falls under block 0x44
Addresses 0x450 and 0x452 falls under block 0x45
Addresses 0x850 and 0x852 falls under block 0x85
Address 0x1850 falls under block 0x185

All addresses have tag: 0x0

For the given sequence of LDB, STB operations, the following MSI protocol operations occur:

P0: LDB 0x450  
  
P0.B45: (S,0,0x0) 

P0: STB 0x448 -> 0x4
 
P0.B44: (M,0,0x4) 
 
P1: STB 0x450 -> 0x5 

P0.B45: (I,0,0x0) | P1.B45: (M,0,0x5) 

P1: LDB 0x450 
  
P1.B45: (M,0,0x5) 

P1: STB 0x448 -> 0x6 
 
P1.B44: (M,0,0x6) | P0.B44: (I,0,0x4) 

P0: LDB 0x450 
  
P0.B45: (S,0,0x5) | P1.B45: (S,0,0x5) 

P0: LDB 0x052 
  
P0.B5: (S,0,0x0) 

P0: STB 0x452 -> 0x7 
 
P0.B45: (M,0,0x7) | P1.B45: (I,0,0x5) 
 
P0: STB 0x852 -> 0x8 
 
P0.B85: (M,0,0x8) 
 
P0: LDB 0x1850  
 
P0.B185: (S,0,0x0) 
 
P1: LDB 0x850  
 
P1.B85: (S,0,0x8) | P0.B85: (S,0,0x8) 
 
P1: LDB 0x450 
  
P1.B45: (S,0,0x7) | P0.B45: (S,0,0x7) 
 



Final Cache Entries are:

Processor 0

	B5 :- State: Shared, Tag: 0x0, Data: 0x0, Index: 0x5
	B44:- State: Invalid, Tag: 0x0, Data: 0x4, Index: 0x44
	B45:- State: Shared, Tag: 0x0, Data:0x7, Index: 0x45
	B85:- State: Shared, Tag: 0x0, Data: 0x8, Index: 0x85
	B185:- State: Shared, Tag: 0x0, Data: 0x0, Index: 0x185

Processor 1

	B5 :- State: Invalid, Tag: 0x0, Data: 0x0, Index: 0x5
        B44:- State: Modified, Tag: 0x0, Data: 0x6, Index: 0x44
        B45:- State: Shared, Tag: 0x0, Data: 0x7, Index: 0x45
        B85:- State: Shared, Tag: 0x0, Data: 0x8, Index: 0x85
        B185:- State: Invalid, Tag: 0x0, Data: 0x0, Index: 0x185

[0 point] A computer has a 128KB write back cache and 2MB physical memory. Each cache block is 16B, the cache is 2-way set associative and uses the true LRU replacement policy. Assume 26-bit virtual address space and byte-addressable memory. It uses a 1KB page size. The cache uses a physical tag and physical index. How big (in bits) is the tag store? The cache uses MSI protocol and assume that we need 2 bits to indicate cache coherence states. Physical tag and physical index:
2MB: 21-bit address space
128KB, 2way, 16B block: (2^17/(2^4*2)) --> 12-bit index
B-offset bits: 4 bits
Address tag space = 21 - 12 - 4 = 5 bits
tag bits: address (5) + 1 (dirty) + 1 (valid) + 0.5 (lru) + 2 bits (coherence) = 9.5 bits.
total tag storage: (9.5 bits)* (2^17/16) = 77824 bits
[10 points](Cuda) In the G80 architecture, each SM has 8K number of register files. Let's say that one thread consumes 24 registers, how many threads can be run on one SM? Assume that each thread consumes only 64B shared memory and there is a total 64KB shared memory space on one SM. Always a multiple of 32 threads can be executed together.

N < (8*1024/24)
N = 320
(The problem is continued). We assume that the memory latency is always 100 cycles and the execution latency of LD and FADD is 4 cycles. The processor is a pipelined machine and Fetch/Decode/Schedule takes 1 cycle each. There are total 512 threads to execute. How many cycles does it take to finish all 512 threads? The machine has a bandwidth of executing (fetch/decode/schedule) 32 threads at one cycle (this is called a warp.) Let's say that N number of threads can be executed together. (N is determined by the previous question). The processor waits until all N threads retire and then it starts to fetch the remaining threads and executes the code again. The G80 architecture is an SMT machine.
```
 
LD   R1 mem[R5]
FADD R2, R1, R3
FADD R3, R4, R5 
```
N = 320, so 512 = 320 + 192 = 10 warps + 6 warps
memory latency = 100 cycles.
During the first warp's 100 cycles, 10 warps' fetch for 2 FADD can be finished.
2*10 warps = 20 cycles < 100 cycles
first 10 warps batch
warp #1 : Fetch+Decode+Schedule+exec+100 cycles+ 4 cycles (FADD-execution latency) + 4 cycles (FADD-execution latency)
the remaining 9 warps: (4+4)*9
3+4+100+8+72 = 187 cycles
the second 6 warps batch
3+4+100+8+8*(6-1) = 155 cycles
Total cycles = 342 cycles
```
 
warp #1 |--LD (FADD fetch/decode/schedule)-----------------------|FADD-EXEC|FADD-EXEC|
warp #2  |--LD (FADD fetch/decode/schedule)------------------------------------------|FADD-EXEC|FADD-EXEC|
warp #3   |--LD (FADD fetch/decode/schedule)-------------------------------------------------------------|FADD-EXEC|FADD-EXEC|
warp #4    |--LD (FADD fetch/decode/schedule)----------------------------------------------------------------------------------|FADD-EXEC|FADD-EXEC|
warp #5     |--LD (FADD fetch/decode/schedule)------------------------------------------------------------------------------------------------------|FADD-EXEC|FADD-EXEC|
warp #6      |--LD (FADD fetch/decode/schedule)--------------------------------------------------------------------------------------------------------------------------|FADD-EXEC|FADD-EXEC|
warp #7       |--LD (FADD fetch/decode/schedule)---------------------------------------------------------------------------------------------------------------------------------------------|FADD-EXEC|FADD-EXEC|
warp #8        |--LD (FADD fetch/decode/schedule)-----------------------------------------------------------------------------------------------------------------------------------------------------------------|FADD-EXEC|FADD-EXEC|
warp #9         |--LD (FADD fetch/decode/schedule)------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|FADD-EXEC|FADD-EXEC|
warp #10         |--LD (FADD fetch/decode/schedule)-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|FADD-Exec|FADD-EXEC|
```
Note that the length is not really proportional to the execution time.
Note that this problem does not consider any memory bandwidth consumption. It assumes that 10 LDs can be serviced concurrently.
What about there are some limitations?

Q1. How many DRAMs on each DIMM?
512Mb * n = 1GB
n = 16 , 16 512M-bit DRAMS
Each side 8 DRMAS are connected with 72 pins. (8data + 1 ECC)*8 = 72 pins. so 8 data I/Os per each DRAM

Q2. burst length 8 can support 8 *(64 data out) = 64B.
32B--> burst length 4

Q3. Peak bandwidth:
Clock frequency *2 (DDR2) * data width
DDR2-667 = (667/2)*2*8B = 5.3GB/s
DDR2-533 = (533/2)*2*8B = 4.264GB/s
ratio between these two dimms= 1.2

Q4. Active command line (tRCD) + CAS + Data transfer time cycles= (CL)+(CL)+ (64B/(8B*2))
= (2CL+4) cycle
time = (2*4+4)/(533*E6) = 22nsec

Q5. from active command: 22nsec
no active command (open page) = (CL+4)/(frequency) = 15nsec
Total access latency = DRAM access latency + cache access latency (20nsec)
(22+20)/(15+20) = 1.2
[0 point] Markov prefetcher is useful when a cache size is small. Why is it? What's the benefit of stride prefetcher over Markov prefetcher?

Markov prefetcher requires storing all memory addresses. When a cache size is larger, the address storage also needs to be bigger. Stride prefetcher does not need to store all memory addresses. Unlike Markov prefetcher, stride prefetcher does not need to be trained with the same memory addresses. Stride prefetcher only requires training a few addresses to find strides.
[0 point] The FRFCFS dram memory scheduler shows better performance than the FCFS dram scheduler. Why is it?

FRFCFS increases row buffer hit ratio. Whenever row buffer hit, the scheduler does not need to precharge and also resend row address again.
[0 point] In Programming Assignment #3, we implemented a non-blocking cache without a write buffer. If the cache was a blocking cache, what need to be changed in your simulator and what will be the performance impact?

we do not need MSHR. Each memory request blocks all other memory requests. Severe performance degradation