CS4290/CS6290HPCA

School of computer science

Georgia Institute of Technology

CS4290/CS6290HPCA Fall 2010

Homework assignment #3
Hyesoon Kim, Instructor

This is an individual assignment. You can discuss this assignment with other classmates but you should do your assignment individually.

(5 points) Average memory latency for an application measured in a modern processor varies significantly application by application. Discuss reasons why memory latency can vary significantly. CPU frequency and DRAM frequency are not changed during run-time.

-DRAM row buffer hit/miss. Some applications have high row-buffer locality such as streaming applications.
-Bank conflicts. Some applications generate high bank conflict memory requests. Memory latency will be 2x if two memory requests are mapped to the same bank.
-Queuing delay in each memory buffer.
-Bus bandwidth. Some applications are limited by bandwidth. If an application generate too many memory requests, the bandwidth between CPU and memory will limit the performance.
(10 points) We like to insert software pref etching requests to speed up the following matrix addition code.
```
for (int ii = 0; ii < 1000; ii++) {
a[ii] = b[ii]+c[ii];
}
```
The modified code is:
```
 
for (int ii = 0; ii < 1000; ii++) {
prefetch(a[ii+k]);
prefetch(b[ii+k]);
prefetch(c[ii+k]);
a[ii] = b[ii]+c[ii];
}
```
prefetch(x) prefetches memory address x.

(a) Find the best performing k value (K is an integer value). Assume that the memory latency (cache miss latency) is 200 cycles and a[],b[],and c[] are 64 bit floating point data structures and the cache block size is 8B. Assume that the statement a[ii]=b[ii]+c[ii] is translated into 2 LDs and 1 ST, 1 FP add. LD/ST cache hit latency is 3 cycles, FP add takes 1 cycle.
Normally, we ignore the cost of prefetching instructions. To simplify the problem, we also ignore the cost of branch handling instructions.
The assembly code would be
load f1, b[ii]
load f2, c[ii]
FADD f3, f1, f2
ST f3 a[ii]

If you simply assume that all instructions are serialized, 3 cycles + 3 cycles + 1 cycle + 3 cycles = 10 cycles
k = 200/10 = 20

If the compiler knows the machine information, it can also calculate the maxium execution length.
```
 
LOAD   LOAD 
  \    / 
   \  /
   FADD 
    | 
   ST 
```
The first load and the second load can be overlapped. So the latency would be 4 cycles (if we just assume a singe issue processor).
FADD (1cycle), ST (1 cycle). Store does not block the processor so we do not need to wait for the 3 cycles.
Hence, the computation cycle is 6 cycles.
Prefetch distance(k) = 200/6 = 33

If the machine is a superscalar processor, the total execution length is 3 cycles (2 loads) + 1 FADD + 1 ST= 5 cycles

Prefetch distance(k) = 200/5 =40
(b) The above code generates many prefetch requests. How can we reduce them? Show a modified source code.
Since the data structure size and the cache block size are the same, only useless prefetching instructions are after the loop bound.
```
 
int ii = 0; 
for (ii = 0; ii < 1000-k; ii++) {
prefetch(a[ii+k]);
prefetch(b[ii+k]);
prefetch(c[ii+k]);
a[ii] = b[ii]+c[ii];
}

for (; ii < 1000; ii++) {
a[ii] = b[ii]+c[ii];
}
```
To increase the performance benefits, we can also add pro-log.
In that case,
```
 
int ii = 0; 
for (ii = 0; ii < k; ii++) {
prefetch(a[ii]);
prefetch(b[ii]);
prefetch(c[ii]);
}

for (; ii < 1000-k; ii++) {
prefetch(a[ii+k]);
prefetch(b[ii+k]);
prefetch(c[ii+k]);
a[ii] = b[ii]+c[ii];
}

for (; ii < 1000; ii++) {
a[ii] = b[ii]+c[ii];
}
```
If the cache block size is 16B instead of 8B, we could unroll the loop to reduce redundant prefetching requests.
```
 

int ii = 0; 
for (ii = 0; ii < k; ii=ii+2) {
prefetch(a[ii]);
prefetch(b[ii]);
prefetch(c[ii]);
}

for (; ii < 1000-k; ii=ii+2) {
prefetch(a[ii+k]);
prefetch(b[ii+k]);
prefetch(c[ii+k]);
a[ii] = b[ii]+c[ii];
a[ii+1] = b[ii+1]+c[ii+1];
}

for (; ii < 1000; ii++) {
a[ii] = b[ii]+c[ii];
}
```

(10 points) Show a software-pipelined version of this loop. Assume that you have infinite number of registers.

 
LOOP;  LD F0, 0 (R1)
       Add F4, F0, F2
       Mul F5, F4, F3
       Store F5, 0 (R1)
       Add R1, R1, #-8 
       Br R1, R2, LOOP

You may omit the start-up and clean-up code.

Without start-up/clean-up code

LOOP;  Store F5, 0(R1)
       Mul F5, F4, F3
       Add F4, F0, F2
        LD F0, -24(R1)
       Add R1, R1, #-8 
       Br R1, R2, LOOP

With all start-up/clean-up code

 

       LD F0, 0(R1)
       Add F4, F0, F2
       Mul F5, F4, F3
       LD F0, -8(R1)
       Add F4, F0, F2
       LD F0, -16(R1)
LOOP;  Store F5, 0(R1)
       Mul F5, F4, F3
       Add F4, F0, F2
        LD F0, -24 (R1)
       Add R1, R1, #-8 
       Br R1, R2, LOOP

       Store F5, -8(R1)
       Mul F5, F4, F3
       Store F5, -16(R1)
       Add F4, F0, F2
       Mul F5, F4, F3
       Store F5, -24(R1)

( 5 points) Discuss why a check point based branch prediction recovery mechanism can provide a faster recovery than a ROB based mechanism. Explain with example.

ROB based mechanisms need to wait for the branch instruction to reach the head of the ROB to recover from a misprediction. Once this happens, all instructions
after the branch can be flushed and the RAT can be set to the ARF.

For a checkpoint based recovery, each branch prediction results in a new copy of the RAT. On a misprediction, the wrong instructions are flushed and the RAT of the
checkpoint before the mispredicted branch is restored. 

For checkpointing, the recovery can resume as soon as the misprediction is detected. But for ROB, we must wait for the branch to retire before resumption.

For example,

Consider the below sequence of instructions.

I1: Add R1, R4, R5
I2: Mul R2, R3, R6
I3: BEQZ R2, End
I4: Sub R1, R4, R5
I5: Div R2, R3, R6
I6: Halt

End:
I7: Add R4, R2, R5
I8: Sub R3, R1, R6
I9: Halt

If BEQZ was mispredicted, then I7, I8 would need to wait for I1, I2 to retire before resuming execution with the RAT now valid. If I1, I2 together were
3 cycles, then the misprediction results a stall of 3 cycles.
However, with checkpointing, on detection of misprediction, the RAT is set to the state before the misprediction and I7, I8 would not need to wait for the
branch to retire. Hence, this method is faster.

(5 points) Summarize what kind of data structures are good for Stream, Stride, Markov, CDP prefetchers. Which hardware prefetcher requires the most amount of hardware?

Stream, Stride : Arrays and Sequential Data Structures
Markov, CDP : Pointers, Linked Lists
A hardware implementation of Markov would require the most amount of resources.
(5 points) Some applications get great performance from FRFCFS memory scheduling compared to FCFS memory scheduling. What kind of applications could be? (what kind of characteristics are useful? )

An application where the memory access sequence results in frequent row buffer hits would perform well under FR-FCFS. Applications which show spatial locality in their memory accesses mixed with a few other memory addresses that are not so spatially aware would be a good example to analyze FCFS vs. FR-FCFS.
In such an example where most of the memory accesses are to the same row, FR-FCFS would provide the buffer hits while FCFS would not provide this benefit. A streaming application would be a good example.

(5 points) Compare and contrast super-scalar architecture and VLIW architecture.


	VLIW						SuperScalar
	----						-----------
	statically combined                             dynamically combined 
	
	more than one instruction                      more than one instruction 

	Static Issue					Dynamic Issue
	
	Compiler optimized				Runtime optimized in hardware

	Can't react to latencies			Can react to latencies in caches, memory, etc

	Larger view of program may			Smaller window for optimization
	lead to better optimization

	Less complex hardware, less			Complex hardware, more power
	power consumed

	And more..

(10 points) For a VLIW machine, the compiler has to find independent instructions that can be executed at the same cycle. Supposed that there is a 3-wide VLIW machine. (i.e., there are three-instruction slots.) For the first and second slots, the processor can execute any types of instructions. For the third slots, the processor can execute all instructions except loads, stores. Arrange the following code to maximize the performance. The processor is an in-order processor. The compiler needs to insert nops if it cannot find any instructions for a slot.

I1: 0xa000 ADD R0, R0, R2
I2: 0xa004 ADD R1, R1, R0
I3: 0xa008 FADD F3, F2,  F3
I4: 0xa00B ADD R1, R1, R0
I5: 0xa010 FADD F3, F2,  F3
I6: 0xa014 LD R2, MEM[R6]
I7: 0xa008 ADD R1, R1, R0
I8: 0xa01B FMUL F1, F5, F4
I9: 0xa020 LD F2, MEM[R0]
IA: 0xa024 FADD F4, F1, F2
IB: 0xa028 LD F3, MEM[R0] 
IC: 0xa030 STORE MEM[R5], F4
ID: 0xa034 FADD F4, F3, F4
IE: 0xa038 ADD R2, R2, R0
IF: 0xa03B BR R2, 0x8000

ADD: simple ALU instruction (1-cycle latency) 
FADD: floating point instruction (1-cycle latency) 
FMUL: floating point instruction (2-cycle latency) 
LD: load instruction (2-cycle latency) 
STORE: store instruction (1-cycle latency) 
BR: branch instruction (1-cycle latency) 

Rename I9:F2 to F6
       IA:F2 to F6
       IB:F3 to F7
       ID:F3 to F7

Now the VLIW schedule looks like:

	 -----------------------------------------------
	|	I1	|	I6	|	I8	|
	 -----------------------------------------------
	|	I9	|	IB	|	I2	|
	 -----------------------------------------------
	|	I4	|	I3	|	IE	|
	 -----------------------------------------------
	|	I7	|	IA	|	I5	|
	 -----------------------------------------------
	|	IC	|	ID	|	IF	|
	 -----------------------------------------------

(10 points) For a VLIW machine, the compiler has to find independent instructions that can be executed at the same cycle. Supposed that there is a 2-wide VLIW machine. (i.e., there are two instruction slots.) For the first slots, the processor can execute any types of instructions. For the second slots, the processor can execute only loads, stores. Arrange the following code to maximize the performance. The processor is an in-order processor. The compiler needs to insert nops if it cannot find any instructions for a slot.
```
 
0xa000 ADD R0, R0, R2
0xa004 ADD R1, R1, R0
0xa008 FADD R2, R2,  F3
0xa00B FMUL R1, R1, R4
0xa010 STORE MEM[R2], R3
0xa014 FADD R3, F3, R4
0xa018 BR R3, 0x8000
```
ADD: simple ALU instruction
FADD: floating point instruction
FMUL: floating point instruction
STORE: store instruction
FADD: floating point instruction
BR: branch instruction
```
 
ADD R0, R0, R2, NOP 
ADD R1, R1, R0, NOP 
FADD R2, R2, F3, NOP
FMUL R1, R1, R4, STORE MEM[R2], R3
FADD R3,F3, R4, NOP 
BR R3, 0x8000 NOP 
```
(A better question) For a VLIW machine, the compiler has to find independent instructions that can be executed at the same cycle. Supposed that there is a 2-wide VLIW machine. (i.e., there are two instruction slots.) For the first slots, the processor can execute any types of instructions. For the second slots, the processor can execute only integer, floating points operations. Arrange the following code to maximize the performance. The processor is an in-order processor. The compiler needs to insert nops if it cannot find any instructions for a slot.
```
 
0xa000 ADD R0, R0, R5
0xa004 ADD R1, R1, R0
0xa008 FADD R2, R2,  F3
0xa00B FMUL R1, R1, R4
0xa010 STORE MEM[R2], R3
0xa014 FADD R3, F3, R4
0xa018 BR R3, 0x8000
```
ADD: simple ALU instruction
FADD: floating point instruction
FMUL: floating point instruction
STORE: store instruction
FADD: floating point instruction
BR: branch instruction
```
 
ADD R0, R0, R5 : FADD R2, R2, R3 
ADD R1, R1, R0  : NOP 
STORE MEM[R2], R3 :  FMUL R1, R1, R4 
FADD F3, F3, R4 : NOP 
BR R3, 0x80000 : NOP 
```
The following questions won't be graded. So you do not need to turn in. These questions are provided to help you understand the material. The questions might have more than one answers.
(no grading) Discuss why a check point based branch prediction recovery mechanism can provide a faster recovery than a ROB based mechanism. Explain with example.
(no grading) What is a memory disambiguation problem? One of the solutions of solving memory disambiguation problem is waiting all loads until the processor knows the addresses of all younger store instructions. Load store dependency predictor is a more aggressive solution. Surprisingly load store dependency predictor works well. Depending on applications and ISAs, the number of loads that are dependent on near-by stores varies significantly. Discuss why.
(no grading) The benefits of out of order execution is increasing ILP and MLP. Highly optimized compiler can also improve ILP and MLP. Discuss how compilers improve ILP and MLP. What are the limitations of static time scheduling? The answer is not just simply, "oh compiler does not have run-time information.". Using profiling, the compiler can have a lot of run-time information.
(no grading) Many of x86 instructions are actually broken into multiple RISC type uops when they are executed inside a processor. If one of the uops generates an exception, how should we handle them?