CS4290/CS6290HPCA

School of computer science

Georgia Institute of Technology

CS4290/CS6290HPCA Fall 2010

Homework assignment #3
Due: Tuesday, November 11, ~~November 9~~ , Before class (no late submission)
Hyesoon Kim, Instructor

This is an individual assignment. You can discuss this assignment with other classmates but you should do your assignment individually.

(5 points) The DRAM memory access time within an application measured in a modern processor varies significantly. Discuss reasons why the DRAM memory access time can vary significantly. CPU frequency and DRAM frequency are not changed during run-time.
(10 points) We like to insert software prefetching requests to speed up the following matrix addition code.
```
for (int ii = 0; ii < 1000; ii++) {
a[ii] = b[ii]+c[ii];
}
```
The modified code is:
```
 
for (int ii = 0; ii < 1000; ii++) {
prefetch(a[ii+k]);
prefetch(b[ii+k]);
prefetch(c[ii+k]);
a[ii] = b[ii]+c[ii];
}
```
prefetch(x) prefetches memory address x.

(a) Find the best performing k value (K is an integer value). Assume that the memory latency (cache miss latency) is 200 cycles and a[],b[],and c[] are 64 bit floating point data structures and the cache block size is 16B . Assume that the statement a[ii]=b[ii]+c[ii] is translated into 2 LDs and 1 ST, 1 FP add. LD/ST cache hit latency is 3 cycles, FP add takes 1 cycle.

(b) The above code generates many prefetch requests. How can we reduce them? Show a modified source code.
( 5 points) Discuss why a check point based branch prediction recovery mechanism can provide a faster recovery than a ROB based mechanism. Explain with example.
(5 points) Summarize what kind of data structures are good for Stream, Stride, Markov, CDP prefetchers. Which hardware prefetcher requires the most amount of hardware?
(5 points) Some applications get great performance from FRFCFS memory scheduling compared to FCFS memory scheduling. What kind of applications could be? (what kind of characteristics are useful? )
(10 points) Show a software-pipelined version of this loop. Assume that you have infinite number of registers.
```
 
LOOP;  LD F0, 0 (R1)
       Add F4, F0, F2
       Mul F5, F4, F3
       Store F5, 0 (R1)
       Add R1, R1, #-8 
       Br R1, R2, LOOP
```
You may omit the start-up and clean-up code.
(5 points) Compare and contrast super-scalar architecture and VLIW architecture.
This problem is updated
(10 points) For a VLIW machine, the compiler has to find independent instructions that can be executed at the same cycle. Supposed that there is a 3-wide VLIW machine. (i.e., there are three-instruction slots.) For the first and second slots, the processor can execute any types of instructions. For the third slots, the processor can execute all instructions except loads, stores. Arrange the following code to maximize the performance. The processor is an in-order processor. The compiler needs to insert nops if it cannot find any instructions for a slot.
```
 
0xa000 ADD R0, R0, R2
0xa004 ADD R1, R1, R0
0xa008 FADD F3, F2,  F3
0xa00B ADD R1, R1, R0
0xa010 FADD F3, F2,  F3
0xa014 LD R2, MEM[R6]
0xa008 ADD R1, R1, R0
0xa01B FMUL F1, F5, F4
0xa020 LD F2, MEM[R0]
0xa024 FADD F4, F1, F2
0xa028 LD F3, MEM[R0] 
0xa030 STORE MEM[R5], F4
0xa034 FADD F4, F3, F4
0xa038 ADD R2, R2, R0
0xa03B BR R2, 0x8000
```
ADD: simple ALU instruction (1-cycle latency)
FADD: floating point instruction (1-cycle latency)
FMUL: floating point instruction (2-cycle latency)
LD: load instruction (2-cycle latency)
STORE: store instruction (1-cycle latency)
BR: branch instruction (1-cycle latency)

The following questions won't be graded. So you do not need to turn in. These questions are provided to help you understand the material. The questions might have more than one answers.
(no grading) What is a memory disambiguation problem? One of the solutions of solving memory disambiguation problem is waiting all loads until the processor knows the addresses of all younger store instructions. Load store dependency predictor is a more aggressive solution. Surprisingly load store dependency predictor works well. Depending on applications and ISAs, the number of loads that are dependent on near-by stores varies significantly. Discuss why.
(no grading) The benefits of out of order execution is increasing ILP and MLP. Highly optimized compiler can also improve ILP and MLP. Discuss how compilers improve ILP and MLP. What are the limitations of static time scheduling? The answer is not just simply, "oh compiler does not have run-time information.". Using profiling, the compiler can have a lot of run-time information.
(no grading) Many of x86 instructions are actually broken into multiple RISC type uops when they are executed inside a processor. If one of the uops generates an exception, how should we handle them?