;-*- Text -*- ; 3-Jun-98 ; ; RCS: $Id$ ; notes for Monday, 1-Jun-98 Caches. Reading: chapter 7 Topics: 1. performance 2. simulating cache action for a direct-mapped cache via a reference stream 3. split instruction and data caches (an aside) 4. blocking ---------------- 1. Performance. The goal of the cache is to have the processor see a memory speed near that of the small, fast (but expensive) SRAM chips used for the cache with the capacity of large, cheap (but slow) DRAM chips used in main memory. The performance in terms of cost is easy: add up the amount of money you spent for all the chips (main memory DRAMs plus cache SRAMs). The performance in terms of speed is usually given as the effective memory access time: Teffective = Tcache + (1 - h) * Tmemory Where Teffective is the effective memory access time, Tcache is the time to get an item of data out of the cache, Tmemory is the time to get an item of data out of main memory and h is the "hit rate" or fraction of memory references that turn out to be in the cache. The intuition behind this equation is this: (a) every memory reference goes to the cache, so we have to wait at least as long as the cache time, Tcache. (b) references that do not hit in the cache ("misses") must incur a time to go to memory as well, Tmemory. Suppose Tcache is 1 cycle, Tmemory is 100 cycles and h is 0.90 (90%). Then: Teffective = 1 + (1 - 0.90) * 100 = 1 + 10 = 11 cycles That's pretty good! By adding the cache (probably less than double the cost of memory), we've reduced the average memory access time from 100 cycles down to 11. That is, *if* the hit rate turns out to be 90%. 2. Hit Rate & simulator. So, we're extremely interested in hit rates and, in particular, how the organization of the cache affects the hit rates. The best way to understand how/why hits & misses happen is to simulate the action of a real cache on a real program. Figure 7.6 in the book shows one simulation. We talked about a loop used as the running example in the projects, except modified so that it walks all of memory in an infinite loop (ignore what happens when the address wraps around and walks over the instructions themselves :-) lw 0 1 one ! load R1 with 1 (a handy constant) add 0 0 3 ! for (index = 0; ... loop: lw 3 4 data ! temp = data[index] nand 4 4 4 ! temp = (~temp add 4 1 4 ! + 1) (i.e. temp = - temp) sw 3 4 data ! data[index] = temp add 3 1 3 ! ... index++) (i.e. end of for loop) beq 0 0 loop ! branch-always to the top of the loop Assuming the instructions start at location 0 in memory and the data starts at location 0x1000, you see the following "string" of memory references from this program (note that the string includes both instruction fetches and data read/write accesses): 0x0 miss 0x1 miss 0x2 miss 0x1000 miss 0x3 miss 0x4 miss 0x5 miss 0x1000 hit! 0x1000 was already in the cache! 0x6 miss 0x7 miss /--> 0x2 hit! 0x2 was already in the cache! | 0x1001 miss | 0x3 hit! | 0x4 hit! | 0x5 hit! | 0x1001 hit! | 0x6 hit! \--> 0x7 hit! 0x2 hit! 0x1002 miss [...] After the first iteration of the loop, every access to memory hits in the cache except the data access caused by LW. We can ignore the first iteration of the loop for the purposes of computing the hit rate since that one iteration is inconsequential compared to the large (infinite) number of times we'll run the loop. The hit rate for this program is then 7/8: there are 8 access in the loop (6 instruction, 2 data) of which 7 hit (all but one data access). If the cache takes 1 cycle and memory takes 100, then the effective memory access time is: 1 + (1 - 7/8) * 100 = 13.5 cycles 3. split instruction and data caches (an aside). Processors often have more than one cache. These can be arranged heirarchically (if I miss in one, then try the next one) or by traffic type (instructions try one, data try another). The following configuration is very common in contemporary systems (e.g. the Sun Ultra 1 that I'm typing this on): "primary" "secondary" or "level 1" or "level 2" caches cache (typically (typically 8K-64K) 256K-4M) /-- L1 data ----\ | | processor -->| |----> L2 unified ---> memory | | \-- L1 instr. --/ Where the L1 caches are on the same chip as the processor and the L2 caches are implemented externally on other chips. Splitting the L1 into data and instruction parts (probably) reduces the hit rate compared to a single, "unified" L1 cache, but has the advantage that both caches may be accessed in a single-cycle. This split-cache trick is how the memory in the pipelined LC can effectively be "dual-ported" without requiring real, dual-ported memory. If we re-think the test loop in part (2) above using split L1 and L2 caches, then we can talk about the hit rates of the instructions and data separately. The instruction hit rate is 100%. The data hit rate is 50% (every other data reference misses). From here on, I'll just talk about data hit rates since they're more interesting... 4. effect of blocking. The data hit rate for the program in part (2) was 50% with a direct-mapped cache with one data word per tag. Suppose we employ a blocked cache, e.g. one with a total size of 64K words and a blocking factor of four (4 data words per tag). This new cache is like Figure 7.10 in the book, except that we have words not bytes so there's no 2-bit byte offset and the index is 2-bits larger. What's the hit rate of this new cache? Here's the reference stream (data references only): 0x1000 miss 0x1000 hit! used last time 0x1001 hit! part of the *block* used last time! 0x1001 hit! 0x1002 hit! 0x1002 hit! 0x1003 hit! 0x1003 hit! 0x1004 miss new block 0x1004 hit! 0x1005 hit! 0x1005 hit! [...] Blocking with a block size of four effectively "prefetches" the next three items when we miss on the first. The data hit rate is improved from 1/2 to 7/8. This blocking trick makes use of spacial locality: we expect (having observed a lot of programs) that programs will use memory sequentially so that data words in a block will often be used together. On the other hand, if the program were changed slightly so that it stepped through the array by increments of four: 0x1000 miss 0x1000 hit! 0x1004 miss new block 0x1004 hit! 0x1008 miss 0x1008 hit! 0x100c miss 0x100c hit! then we're back where we started: blocking isn't helping. In fact, since blocking means that we have to fetch four words on every miss, the system will run a little slower than one that had no blocking. Happily, the sequential case is more common.