GRE CS Subject Test Notes


  • Compiler optimization

  • Cache mapping and associativity

    A very important factor in determining the effectiveness of the level 2 cache relates to how the cache is mapped to the system memory. What this means in brief is that there are many different ways to allocate the storage in our cache to the memory addresses it serves. Let's take as an example a system with 512 KB of L2 cache and 64 MB of main memory. The burning question is: how do we decide how to divvy up the 16,384 address lines in our cache amongst the "huge" 64 MB of memory?

    There are three different ways that this mapping can generally be done. The choice of mapping technique is so critical to the design that the cache is often named after this choice:

    • Direct Mapped Cache: The simplest way to allocate the cache to the system memory is to determine how many cache lines there are (16,384 in our example) and just chop the system memory into the same number of chunks. Then each chunk gets the use of one cache line. This is called direct mapping. So if we have 64 MB of main memory addresses, each cache line would be shared by 4,096 memory addresses (64 M divided by 16 K).
    • Fully Associative Cache: Instead of hard-allocating cache lines to particular memory locations, it is possible to design the cache so that any line can store the contents of any memory location. This is called fully associative mapping.
    • N-Way Set Associative Cache: "N" here is a number, typically 2, 4, 8 etc. This is a compromise between the direct mapped and fully associative designs. In this case the cache is broken into sets where each set contains "N" cache lines, let's say 4. Then, each memory address is assigned a set, and can be cached in any one of those 4 locations within the set that it is assigned to. In other words, within each set the cache is associative, and thus the name.
      This design means that there are "N" possible places that a given memory location may be in the cache. The tradeoff is that there are "N" times as many memory locations competing for the same "N" lines in the set. Let's suppose in our example that we are using a 4-way set associative cache. So instead of a single block of 16,384 lines, we have 4,096 sets with 4 lines in each. Each of these sets is shared by 16,384 memory addresses (64 M divided by 4 K) instead of 4,096 addresses as in the case of the direct mapped cache. So there is more to share (4 lines instead of 1) but more addresses sharing it (16,384 instead of 4,096).

    Conceptually, the direct mapped and fully associative caches are just "special cases" of the N-way set associative cache. You can set "N" to 1 to make a "1-way" set associative cache. If you do this, then there is only one line per set, which is the same as a direct mapped cache because each memory address is back to pointing to only one possible cache location. On the other hand, suppose you make "N" really large; say, you set "N" to be equal to the number of lines in the cache (16,384 in our example). If you do this, then you only have one set, containing all of the cache lines, and every memory location points to that huge set. This means that any memory address can be in any line, and you are back to a fully associative cache.

    Comparison of Cache Mapping Techniques

    There is a critical tradeoff in cache performance that has led to the creation of the various cache mapping techniques described in the previous section. In order for the cache to have good performance you want to maximize both of the following:

    • Hit Ratio: You want to increase as much as possible the likelihood of the cache containing the memory addresses that the processor wants. Otherwise, you lose much of the benefit of caching because there will be too many misses.
    • Search Speed: You want to be able to determine as quickly as possible if you have scored a hit in the cache. Otherwise, you lose a small amount of time on every access, hit or miss, while you search the cache.

    Now let's look at the three cache types and see how they fare:

    • Direct Mapped Cache: The direct mapped cache is the simplest form of cache and the easiest to check for a hit. Since there is only one possible place that any memory location can be cached, there is nothing to search; the line either contains the memory information we are looking for, or it doesn't.
      Unfortunately, the direct mapped cache also has the worst performance, because again there is only one place that any address can be stored. Let's look again at our 512 KB level 2 cache and 64 MB of system memory. As you recall this cache has 16,384 lines (assuming 32-byte cache lines) and so each one is shared by 4,096 memory addresses. In the absolute worst case, imagine that the processor needs 2 different addresses (call them X and Y) that both map to the same cache line, in alternating sequence (X, Y, X, Y). This could happen in a small loop if you were unlucky. The processor will load X from memory and store it in cache. Then it will look in the cache for Y, but Y uses the same cache line as X, so it won't be there. So Y is loaded from memory, and stored in the cache for future use. But then the processor requests X, and looks in the cache only to find Y. This conflict repeats over and over. The net result is that the hit ratio here is 0%. This is a worst case scenario, but in general the performance is worst for this type of mapping.
    • Fully Associative Cache: The fully associative cache has the best hit ratio because any line in the cache can hold any address that needs to be cached. This means the problem seen in the direct mapped cache disappears, because there is no dedicated single line that an address must use.
      However (you knew it was coming), this cache suffers from problems involving searching the cache. If a given address can be stored in any of 16,384 lines, how do you know where it is? Even with specialized hardware to do the searching, a performance penalty is incurred. And this penalty occurs for all accesses to memory, whether a cache hit occurs or not, because it is part of searching the cache to determine a hit. In addition, more logic must be added to determine which of the various lines to use when a new entry must be added (usually some form of a "least recently used" algorithm is employed to decide which cache line to use next). All this overhead adds cost, complexity and execution time.
    • N-Way Set Associative Cache: The set associative cache is a good compromise between the direct mapped and set associative caches. Let's consider the 4-way set associative cache. Here, each address can be cached in any of 4 places. This means that in the example described in the direct mapped cache description above, where we accessed alternately two addresses that map to the same cache line, they would now map to the same cache set instead. This set has 4 lines in it, so one could hold X and another could hold Y. This raises the hit ratio from 0% to near 100%! Again an extreme example, of course. As for searching, since the set only has 4 lines to examine this is not very complicated to deal with, although it does have to do this small search, and it also requires additional circuitry to decide which cache line to use when saving a fresh read from memory. Again, some form of LRU (least recently used) algorithm is typically used.

    Here's a summary table of the different cache mapping techniques and their relative performance:

    Cache Type

    Hit Ratio

    Search Speed

    Direct Mapped

    Good

    Best

    Fully Associative

    Best

    Moderate

    N-Way Set Associative, N>1

    Very Good, Better as N Increases

    Good, Worse as N Increases

    In the "real world", the direct mapped and set associative caches are by far the most common. Direct mapping is used more for level 2 caches on motherboards, while the higher-performance set-associative cache is found more commonly on the smaller primary caches contained within processors.

    Tag Storage

    Since each cache line (or set) in the data store is shared by a large number of memory addresses that map to it, we need to keep track of which one is using each cache line at a given time. This is what the tag RAM is used for.

    Let's take a look at the same example again: a system with 64 MB of main memory, a 512 KB cache, and 32-byte cache lines. There are 16,384 cache lines, and therefore 4,096 different memory locations that share each line. However, recall that each line contains 32 bytes; that means 32 different bytes can be placed in each line without interfering with each other. So really, there are 128 (4,096 divided by 32) different 32-byte lines of memory that must share a cache spot.

    Okay, now to address 64 MB of memory you need 26 address lines (because 2^26 is 64 M) which are numbered from A0 to A25. 512 KB only requires 19 lines, A0 to A18. The difference between these is 7 lines; not surprisingly, since 128 is 2^7. These 7 address lines are what tell you which of the 128 different address lines that can use a given cache line, are actually using it at the moment. That's what the tag RAM is for. There will be as many entries in the tag RAM as there are in the data store, so we will have 16,384 tag RAM lines, although of course these entries are only a few bits wide, not 32 bytes wide like the data store.

    Notice that the tag RAM is used early in the process of determining whether or not we have a cache hit. This means that no matter how fast the cache data store is, the tag RAM must be slightly faster.

    How the Memory Address Is Used

    The memory address provided by the processor represents which byte of information the processor is looking for at a given time. This is looked at in three sections by the cache controller as it does its work of checking for hits. This example is the same as before (64 MB memory, 512 KB cache, direct mapping to keep things simple) so we again have 26 address bits, A0 through A25:

    • A0 to A4: The lowest-order 5 bits represent the 32 different bytes within the data store (2^5 = 32). Recall that the cache we are looking at has 32 byte lines, all of which are moved around together. Therefore, the address bits A0 to A4 are ignored by the cache controller; the processor will use them later to determine which to use of the 32 bytes it receives from the cache.
    • A5 to A18: These 14 bits represent the cache line that this address maps to. 2^14 is 16,384, which is the total number of cache lines in our example, as you recall. This cache line address is used for looking up both the tag address in the tag RAM, and later the actual data in the data store if there is a hit.
    • A19 to A25: These 7 bits represent the tag address, which tells the system which of the possible memory locations that share the cache line (indicated by address lines A5 to A18) is currently using it.

    If the numbers in the example change, so do these ranges. If instead we have 32 MB of memory, 128 KB of cache, and 16 byte cache lines, then A0 to A3 are ignored, A4 to A16 represent the cache line address, and A17 to A24 are the tag address.

    Cache Write Policy and the Dirty Bit

    In addition to caching reads from memory, the system is capable of caching writes to memory. The handling of the address bits and the cache lines, etc. is pretty similar to how this is done when the cache is read. However, there are two different ways that the cache can handle writes, and this is referred to as the "write policy" of the cache.

    Write-Back and Write-Through policies have same hit-ratio.

    • Write-Back Cache: Also called "copy back" cache, this policy is "full" write caching of the system memory. When a write is made to system memory at a location that is currently cached, the new data is only written to the cache, not actually written to the system memory. Later, if another memory location needs to use the cache line where this data is stored, it is saved ("written back") to the system memory and then the line can be used by the new address.
    • Write-Through Cache: With this method, every time the processor writes to a cached memory location, both the cache and the underlying memory location are updated. This is really sort of like "half caching" of writes; the data just written is in the cache in case it is needed to be read by the processor soon, but the write itself isn't actually cached because we still have to initiate a memory write operation each time.

    write-miss policy

    • Write-allocation Cache: a cache line is allocated and loaded on a write-miss.

    Many caches that are capable of write-back operation can also be set to operate as write-through (not all however), but not generally the other way around.

    Comparing the two policies, in general terms write-back provides better performance, but at the slight risk of memory integrity. Write-back caching saves the system from performing many unnecessary write cycles to the system RAM, which can lead to noticeably faster execution. However, when write-back caching is used, writes to cached memory locations are only placed in cache, and the RAM itself isn't actually updated until the cache line is booted out to make room for another address to use it.

    As a result, at any given time, there can be a mismatch between many of the lines in the cache and the memory addresses that they correspond to. When this happens, the data in the memory is said to be "stale", since it doesn't have the fresh information yet that was only written to the cache. Memory used with a write-through cache can never be "stale" because the system memory is written at the same time that the cache is.

    Normally, stale memory isn't a problem, because the cache controller keeps track of which locations in the cache have been changed and therefore which memory locations may be stale. This is done by using an extra single bit of memory, one per cache line, called the "dirty bit". Whenever a write is cached, this bit is set (made a 1) to tell the cache controller "when you decide to re-use this cache line for a different address, you need to write the current contents back to memory". This dirty bit is normally implemented by adding one extra bit to the tag RAM, instead of using a separate memory chip (to save cost).

    However, the use of a write-back cache does entail the small possibility of data corruption if something were to happen before the "dirty" cache lines could be saved to memory. There aren't too many cases where this could happen, because both the memory and the cache are volatile (cleared when the machine is powered off).

    On the other hand, consider a disk cache, where system memory is used to cache writes to the disk. Here, the memory is volatile but the disk is not. If a write-back cache is used here, you could have stale data on your disk compared to what is in memory. Then, if the power goes out, you lose everything that hadn't yet been written back to the disk, leading to possible corruption. For this reason, most disk caches allow programs to over-rule the write-back policy to ensure consistency between the cache (in memory) and disk. Disk utilities, for example, don't like write-back caching very much!

    It is also possible with many caches to tell the controller "please write out to system memory all dirty cache lines, right now". This is done when it is necessary to make sure that the cache is in sync with the memory, and there is no stale data. This is sometimes called "flushing" the cache, and is especially common with disk caches, for the reason outlined in the previous paragraph.

    Summary: The Cache Read/Write Process

    Having looked at all the parts and design factors that make up a cache, in this section the actual process is described that is followed when the processor reads or writes from the system memory. This example is the same as in the other sections on this page: 64 MB memory, 512 KB cache, 32 byte cache lines. I will assume a direct mapped cache, since that is the simplest to explain (and is in fact most common for level 2 cache):

    1. The processor begins a read/write from/to the system memory.
    2. Simultaneously, the cache controller begins to check if the information requested is in the cache, and the memory controller begins the process of either reading or writing from the system RAM. This is done so that we don't lose any time at all in the event of a cache miss; if we have a cache hit, the system will cancel the partially-completed request from RAM, if appropriate. If we are doing a write on a write-through cache, the write to memory always proceeds.
    3. The cache controller checks for a hit by looking at the address sent by the processor. The lowest five bits (A0 to A4) are ignored, because these differentiate between the 32 different bytes in the cache line. We aren't concerned with that because the cache will always return the whole 32 bytes and let the processor decide which one it wants. The next 14 lines (A5 to A18) represent the line in the cache that we need to check (notice that 2^14 is 16,384).
    4. The cache controller reads the tag RAM at the address indicated by the 14 address lines A5 to A18. So if those 14 bits say address 13,714, the controller will examine the contents of tag RAM entry #13,714. It compares the 7 bits that it reads from the tag RAM at this location to the 7 address bits A19 to A25 that it gets from the processor. If they are identical, then the controller knows that the entry in the cache at that line address is the one the processor wanted; we have a hit. If the tag RAM doesn't match, then we have a miss.
    5. If we do have a hit, then for a read, the cache controller reads the 32-byte contents of the cache data store at the same line address indicated by bits A5 to A18 (13,714), and sends them to the processor. The read that was started to the system RAM is canceled. The process is complete. For a write, the cache controller writes 32 bytes to the data store at that same cache line location referenced by bits A5 to A18. Then, if we are using a write-through cache the write to memory proceeds; if we are using a write-back cache, the write to memory is canceled, and the dirty bit for this cache line is set to 1 to indicate that the cache was updated but the memory was not.
    6. If we have a miss and we were doing a read, the read of system RAM that we started earlier carries on, with 32 bytes being read from memory at the location specified by bits A5 to A25. These bytes are fed to the processor, which uses the lowest five bits (A0 to A4) to decide which of the 32 bytes it wanted. While this is happening the cache also must perform the work of storing these bytes that were just read from memory into the cache so they will be there for the next time this location is wanted. If we are using a write-through cache, the 32 bytes are just placed into the data store at the address indicated by bits A5 to A18. The contents of bits A19 to A25 are saved in the tag RAM at the same 14-bit address, A5 to A18. The entry is now ready for any future request by the processor. If we are using a write-back cache, then before overwriting the old contents of the cache line, we must check the line's dirty bit. If it is set (1) then we must first write back the contents of the cache line to memory, and then clear the dirty bit. If it is clear (0) then the memory isn't stale and we continue without the write cycle.
    7. If we have a cache miss and we were doing a write, interestingly, the cache doesn't do much at all, because most caches don't update the cache line on a write miss. They just leave the entry that was there alone, and write to memory, bypassing the cache entirely. There are some caches that put all writes into the appropriate cache line whenever a write is done. They make the general assumption that anything the processor has just written, it is likely to read back again at some point in the near future. Therefore, they treat every write as a hit, by definition. This means there is no check for a hit on a write; in essence, the cache line that is used by the address just written is always replaced by the data that was just put out by the processor. It also means that on a write miss the cache controller must update the cache, including checking the dirty bit on the entry that was there before the write, exactly the same as what happens for a read miss.

    As complex as it already is :^) this example would of course be even more complex if we used a set associative or fully associative cache. Then we would have a search to do when checking for a hit, and we would also have the matter of deciding which cache line to update on a cache miss.

    There are two common options on a write miss:
    Write Allocate - the block is loaded on a write miss, followed by the wite-hit action.
     No Write Allocate - the block is modified in the main memory and not loaded into the cache.

    Although either write-miss policy could be used with write through or write back, write-back caches generally use write allocate (hoping that subsequent writes to that block will be captured by the cache) and
    write-through caches often use no-write allocate (since subsequent writes to that block will still have to go to memory).

  • Processes (mutual exclusion, semaphore, test-and-set)

    race condition: two or more processes are reading or writing some shared data and the final result depends on who runs precisely when.

    mutual exclusion:

    1. mutual exclusion: no processes may be simultaneously inside their critical sections.

    2. No assumptions are made about relative process speeds or number of CPUs.

    3. dead lock: no process stopped outside its critical section should block other processes.

    4. starvation: no process should wait arbitrarily long to enter its critical section.

    IPC (Inter Processor Communication) techniques

    • disabling interrupts for kernel, but not user processes

    • lock variables: introduce race condition on the lock variable, use test-and-set to avoid this

    busy waiting

    • strict alternation:

      • busy waiting: continuously testing a variable waiting for some value to appear. it should be avoided, since it wastes CPU time.

      • require strict alteration, otherwise, violating rule 3

    • Dekker's algorithm: dose not require strict alternation

    • Peterson's Solution:

    • The TSL instruction: require busy waiting

      • Test and Set Lock: It reads the contents of the memory word into a register and then stores a nonzero value at that memory address.

      • The operations of reading the word and storing into it are guaranteed to be indivisible- no other processor can access the word until the instruction is finished. The CPU executing the TSL instruction locks the memory bus to prohibit other CPUs from accessing memory until it is done.

    blocking

    • sleep and wakeup: lost wakeup will cause deadlock

      • producer-consumer problem  (bounded buffer)

    • semaphores: solve the lost wakeup problem (proposed by E.W. Dijkstra (1965))

      • P: down, V: up, both are atomic actions.

        • P: down: the down operation on a semaphore checks to see if the value is greater than 0. If so, it decrements the value (i.e. uses up one stored wakeup) and just continues. If the value is 0, the process is put to sleep.

        • V: up: the up operation increments the value of the semaphore addressed. If one or more processes were sleeping on that semaphore, unable to complete an earlier P operation, one of them is chosen buy the system at random, and is allowed to complete its P (down).

      • for one CPU, P, V are system calls, with the OS disabling all interrupts while it is accessing the semaphore.

      • multiple CPUs, each semaphore should be protected by a lock variable with the TSL instruction used to make sure that only one CPU at a time examines the semaphore. (here, no busy waiting for TSL)

      • binary semaphores: semaphores that are initialized to 1 and used by two or more processes to ensure that only one of them can enter its critical region at the same time.

      • will cause deadlock if not carefully used.

      • it needs ceil(log2n) binary semaphores to implement a n-ary semaphore.

    • monitor: to make it easier to write correct programs using semaphores

      • definition: A monitor is a collection of procedures, variables, and data structures that are all grouped together in a special kind of module or package. Processes may call the procedures in a monitor whenever they want to, but they cannot directly access the monitor's internal data structures from procedures declared outside the monitor

      • only one process can be active in a monitor at any instant.

      • Monitors are a programming language construct, so the compiler knows they are special and can handle calls to monitor procedures differently from other procedure calls.

      • monitors provide an easy way to achieve mutual exclusion

      • conditional variables (wait and signal) provide block,

        condition full, empty;

        wait(full), signal(empty)

    • semaphores are too low level, and monitors are not usable except in a few programming languages. None of the primitives provide for information exchange between machines.

    • Message passing: uses two primitives SEND and RECEIVE, which, like semaphores and unlike monitors, are system calls rather than language constructs.

    Classical IPC problems

    • The Dining Philosophers Problem

      • starvation: all the programs continue to run indefinitely but fail to make any progress.

      • modeling processes that are competing for exclusive access to a limited number of resources.

    • The Readers and Writers Problem

      • models access to a database.

  • Process Scheduling

    • Scheduling algorithms criteria:

      1. Fairness: make sure each process gets its fair share of the CPU

      2. Efficiency: keep the CPU busy 100 percent of the time.

      3. Response time: minimize respponse time for interactive users.

      4. Turnaround: minimize the time batch users much wait for output

      5. Throughput: maximize the number of jobs processed per hour.

    • preemptive scheduling (in contrast to the run to completion)

      • definition: the strategy of allowing processes that are logically runnable to be temporarily suspended.

    • Round Robin:

      • efficiency vs. response time

    • Priority Scheduling:

      • starvation if priorities are not adjusted from time to time.

    • Shortest Job First (batch jobs) How to figure out which one is the shortest one.

      • optimal turnaround time.

      • average run time (4a+3b+2c+d)/4 for a, b, c, d, probably optimal if a<b<c<d.

      • minimum average response time.

      • aging: the technique of estimating the next value in  a series by taking the weighted average of current measured value and the previous estimate.

      • shortest job first is only optimal when all the jobs are available simultaneously.

    • Policy-Driven Scheduling: related to the ratio of user usage.

  • Dead Lock

    • necessary conditions:

      1. mutual exclusion: each resource is either currently assigned to exactly one process or is available.

      2. hold & wait: processes currently holding resources granted earlier can request new resources.

      3. no preemption: resources previously granted cannot be forcibly taken away from a process. They must be explicitly released by the process holding them.

      4. circular wait: There must be a circular chain of two or more processes, each of which is waiting for a resource held by the next member of the chain.

    • Banker's algorithm: a scheduling algorithm that can avoid deadlocks (Dijkstra 1965)

      • safe state: there exists a sequence of other states that leads to all the customers getting loans up to their credit limits.

      • consider each request as it occurs and see if granting it leads to a safe state. If it does, the request is granted, otherwise, it is postponed until later. To see if a state is safe, the banker checks to see if he has enough resources to satisfy the customer closest to his or her maximum.

  • Superscalar processor

    describes a microprocessor design that makes it possible for more than one instruction at a time to be executed during a single clock cycle. In a superscalar design, the processor or the instruction compiler is able to determine whether an instruction can be carried out independently of other sequential instructions, or whether it has a dependency on another instruction and must be executed in sequence with it. The processor then uses multiple execution units to simultaneously carry out two or more independent instructions at a time. Superscalar design is sometimes called "second generation RISC."



This page was last updated by Howard @01/21/2004 13:26:02 -0500