(10 points) We like to insert software prefetching requests to speed up the following matrix addition code.
for (int ii = 0; ii < 1000; ii++) {
a[ii] = b[ii]+c[ii];
}
The modified code is:
for (int ii = 0; ii < 1000; ii++) {
prefetch(a[ii+k]);
prefetch(b[ii+k]);
prefetch(c[ii+k]);
a[ii] = b[ii]+c[ii];
}
prefetch(x) prefetches memory address x.
(a) Find the best performing k value (K is an integer value). Assume
that the memory latency (cache miss latency) is 200 cycles and
a[],b[],and c[] are 64 bit floating point data structures and the cache
block size is 16B. Assume that the statement a[ii]=b[ii]+c[ii] is translated into 2 LDs and 1 ST, 1 FP add. LD/ST cache hit latency is 3 cycles, FP add takes 1 cycle.
(b) The above code generates many extra prefetch requests. How can we reduce them? Show a modified source code.