|
GRE CS Subject Test Notes
-
Compiler optimization -
Cache mapping and associativity
A very important factor in determining the effectiveness of the level 2 cache
relates to how the cache is mapped to the system memory. What this means in
brief is that there are many different ways to allocate the storage in our cache
to the memory addresses it serves. Let's take as an example a system with 512 KB
of L2 cache and 64 MB of main memory. The burning question is: how do we decide
how to divvy up the 16,384 address lines in our cache amongst the "huge" 64 MB
of memory?
There are three different ways that this mapping can generally be done. The
choice of mapping technique is so critical to the design that the cache is often
named after this choice:
- Direct Mapped Cache: The simplest way to allocate the cache to the
system memory is to determine how many cache lines there are (16,384 in our
example) and just chop the system memory into the same number of chunks. Then
each chunk gets the use of one cache line. This is called direct mapping.
So if we have 64 MB of main memory addresses, each cache line would be shared
by 4,096 memory addresses (64 M divided by 16 K).
- Fully Associative Cache: Instead of hard-allocating cache lines to
particular memory locations, it is possible to design the cache so that any
line can store the contents of any memory location. This is called fully
associative mapping.
- N-Way Set Associative Cache: "N" here is a number, typically 2, 4,
8 etc. This is a compromise between the direct mapped and fully associative
designs. In this case the cache is broken into sets where each set contains
"N" cache lines, let's say 4. Then, each memory address is assigned a set, and
can be cached in any one of those 4 locations within the set that it is
assigned to. In other words, within each set the cache is associative,
and thus the name.
This design means that there are "N" possible places that a given memory
location may be in the cache. The tradeoff is that there are "N" times as many
memory locations competing for the same "N" lines in the set. Let's suppose in
our example that we are using a 4-way set associative cache. So instead of a
single block of 16,384 lines, we have 4,096 sets with 4 lines in each. Each of
these sets is shared by 16,384 memory addresses (64 M divided by 4 K) instead
of 4,096 addresses as in the case of the direct mapped cache. So there is more
to share (4 lines instead of 1) but more addresses sharing it (16,384 instead
of 4,096).
Conceptually, the direct mapped and fully associative caches are just
"special cases" of the N-way set associative cache. You can set "N" to 1 to make
a "1-way" set associative cache. If you do this, then there is only one line per
set, which is the same as a direct mapped cache because each memory address is
back to pointing to only one possible cache location. On the other hand, suppose
you make "N" really large; say, you set "N" to be equal to the number of lines
in the cache (16,384 in our example). If you do this, then you only have one
set, containing all of the cache lines, and every memory location points to that
huge set. This means that any memory address can be in any line, and you are
back to a fully associative cache.
Comparison of Cache Mapping Techniques
There is a critical tradeoff in cache performance that has led to the
creation of the various cache mapping techniques described in the previous
section. In order for the cache to have good performance you want to maximize
both of the following:
- Hit Ratio: You want to increase as much as possible the likelihood
of the cache containing the memory addresses that the processor wants.
Otherwise, you lose much of the benefit of caching because there will be too
many misses.
- Search Speed: You want to be able to determine as quickly as
possible if you have scored a hit in the cache. Otherwise, you lose a small
amount of time on every access, hit or miss, while you search the
cache.
Now let's look at the three cache types and see how they fare:
- Direct Mapped Cache: The direct mapped cache is the simplest form
of cache and the easiest to check for a hit. Since there is only one possible
place that any memory location can be cached, there is nothing to search; the
line either contains the memory information we are looking for, or it doesn't.
Unfortunately, the direct mapped cache also has the worst performance, because
again there is only one place that any address can be stored. Let's look again
at our 512 KB level 2 cache and 64 MB of system memory. As you recall this
cache has 16,384 lines (assuming 32-byte cache lines) and so each one is
shared by 4,096 memory addresses. In the absolute worst case, imagine that the
processor needs 2 different addresses (call them X and Y) that both map to the
same cache line, in alternating sequence (X, Y, X, Y). This could happen in a
small loop if you were unlucky. The processor will load X from memory and
store it in cache. Then it will look in the cache for Y, but Y uses the same
cache line as X, so it won't be there. So Y is loaded from memory, and stored
in the cache for future use. But then the processor requests X, and looks in
the cache only to find Y. This conflict repeats over and over. The net result
is that the hit ratio here is 0%. This is a worst case scenario, but in
general the performance is worst for this type of mapping.
- Fully Associative Cache: The fully associative cache has the best
hit ratio because any line in the cache can hold any address that needs to be
cached. This means the problem seen in the direct mapped cache disappears,
because there is no dedicated single line that an address must use.
However (you knew it was coming), this cache suffers from problems involving
searching the cache. If a given address can be stored in any of 16,384 lines,
how do you know where it is? Even with specialized hardware to do the
searching, a performance penalty is incurred. And this penalty occurs for
all accesses to memory, whether a cache hit occurs or not, because it is
part of searching the cache to determine a hit. In addition, more logic must
be added to determine which of the various lines to use when a new entry must
be added (usually some form of a "least recently used" algorithm is employed
to decide which cache line to use next). All this overhead adds cost,
complexity and execution time.
- N-Way Set Associative Cache: The set associative cache is a good
compromise between the direct mapped and set associative caches. Let's
consider the 4-way set associative cache. Here, each address can be cached in
any of 4 places. This means that in the example described in the direct mapped
cache description above, where we accessed alternately two addresses that map
to the same cache line, they would now map to the same cache set
instead. This set has 4 lines in it, so one could hold X and another could
hold Y. This raises the hit ratio from 0% to near 100%! Again an extreme
example, of course. As for searching, since the set only has 4 lines to
examine this is not very complicated to deal with, although it does have to do
this small search, and it also requires additional circuitry to decide which
cache line to use when saving a fresh read from memory. Again, some form of
LRU (least recently used) algorithm is typically used.
Here's a summary table of the different cache mapping techniques and their
relative performance:
|
Cache Type |
Hit Ratio |
Search Speed |
|
Direct Mapped |
Good |
Best |
|
Fully Associative |
Best |
Moderate |
|
N-Way Set Associative, N>1 |
Very Good, Better as N Increases |
Good, Worse as N Increases |
In the "real world", the direct mapped and set associative caches are by far
the most common. Direct mapping is used more for level 2 caches on motherboards,
while the higher-performance set-associative cache is found more commonly on the
smaller primary caches contained within processors.
Tag Storage
Since each cache line (or set) in the data store is shared by a large number
of memory addresses that map to it, we need to keep track of which one is using
each cache line at a given time. This is what the tag RAM is used for.
Let's take a look at the same example again: a system with 64 MB of main
memory, a 512 KB cache, and 32-byte cache lines. There are 16,384 cache lines,
and therefore 4,096 different memory locations that share each line. However,
recall that each line contains 32 bytes; that means 32 different bytes can be
placed in each line without interfering with each other. So really, there are
128 (4,096 divided by 32) different 32-byte lines of memory that must share a
cache spot.
Okay, now to address 64 MB of memory you need 26 address lines (because 2^26
is 64 M) which are numbered from A0 to A25. 512 KB only requires 19 lines, A0 to
A18. The difference between these is 7 lines; not surprisingly, since 128 is
2^7. These 7 address lines are what tell you which of the 128 different address
lines that can use a given cache line, are actually using it at the moment.
That's what the tag RAM is for. There will be as many entries in the tag RAM as
there are in the data store, so we will have 16,384 tag RAM lines, although of
course these entries are only a few bits wide, not 32 bytes wide like the data
store.
Notice that the tag RAM is used early in the process of determining whether
or not we have a cache hit. This means that no matter how fast the cache data
store is, the tag RAM must be slightly faster.
How the Memory Address Is Used
The memory address provided by the processor represents which byte of
information the processor is looking for at a given time. This is looked at in
three sections by the cache controller as it does its work of checking for hits.
This example is the same as before (64 MB memory, 512 KB cache, direct mapping
to keep things simple) so we again have 26 address bits, A0 through A25:
- A0 to A4: The lowest-order 5 bits represent the 32 different bytes
within the data store (2^5 = 32). Recall that the cache we are looking at has
32 byte lines, all of which are moved around together. Therefore, the address
bits A0 to A4 are ignored by the cache controller; the processor will use them
later to determine which to use of the 32 bytes it receives from the cache.
- A5 to A18: These 14 bits represent the cache line that this address
maps to. 2^14 is 16,384, which is the total number of cache lines in our
example, as you recall. This cache line address is used for looking up both
the tag address in the tag RAM, and later the actual data in the data store if
there is a hit.
- A19 to A25: These 7 bits represent the tag address, which tells the
system which of the possible memory locations that share the cache line
(indicated by address lines A5 to A18) is currently using it.
If the numbers in the example change, so do these ranges. If instead we have
32 MB of memory, 128 KB of cache, and 16 byte cache lines, then A0 to A3 are
ignored, A4 to A16 represent the cache line address, and A17 to A24 are the tag
address.
Cache Write Policy and the Dirty Bit
In addition to caching reads from memory, the system is capable of caching
writes to memory. The handling of the address bits and the cache lines,
etc. is pretty similar to how this is done when the cache is read. However,
there are two different ways that the cache can handle writes, and this is
referred to as the "write policy" of the cache.
Write-Back and Write-Through policies have same hit-ratio.
- Write-Back Cache: Also called "copy back" cache, this policy is
"full" write caching of the system memory. When a write is made to system
memory at a location that is currently cached, the new data is only written to
the cache, not actually written to the system memory. Later, if another memory
location needs to use the cache line where this data is stored, it is saved
("written back") to the system memory and then the line can be used by the new
address.
- Write-Through Cache: With this method, every time the processor
writes to a cached memory location, both the cache and the underlying memory
location are updated. This is really sort of like "half caching" of writes;
the data just written is in the cache in case it is needed to be read by the
processor soon, but the write itself isn't actually cached because we still
have to initiate a memory write operation each time.
write-miss policy
- Write-allocation Cache: a cache line is allocated and loaded on a
write-miss.
Many caches that are capable of write-back operation can also be set to
operate as write-through (not all however), but not generally the other way
around.
Comparing the two policies, in general terms write-back provides better
performance, but at the slight risk of memory integrity. Write-back caching
saves the system from performing many unnecessary write cycles to the
system RAM, which can lead to noticeably faster execution. However, when
write-back caching is used, writes to cached memory locations are only placed in
cache, and the RAM itself isn't actually updated until the cache line is booted
out to make room for another address to use it.
As a result, at any given time, there can be a mismatch between many of the
lines in the cache and the memory addresses that they correspond to. When this
happens, the data in the memory is said to be "stale", since it doesn't have the
fresh information yet that was only written to the cache. Memory used with a
write-through cache can never be "stale" because the system memory is written at
the same time that the cache is.
Normally, stale memory isn't a problem, because the cache controller keeps
track of which locations in the cache have been changed and therefore which
memory locations may be stale. This is done by using an extra single bit of
memory, one per cache line, called the "dirty bit". Whenever a write is cached,
this bit is set (made a 1) to tell the cache controller "when you decide to
re-use this cache line for a different address, you need to write the current
contents back to memory". This dirty bit is normally implemented by adding one
extra bit to the tag RAM, instead of using a separate memory chip (to save
cost).
However, the use of a write-back cache does entail the small possibility of
data corruption if something were to happen before the "dirty" cache lines could
be saved to memory. There aren't too many cases where this could happen, because
both the memory and the cache are volatile (cleared when the machine is powered
off).
On the other hand, consider a disk cache, where system memory is used
to cache writes to the disk. Here, the memory is volatile but the disk is not.
If a write-back cache is used here, you could have stale data on your disk
compared to what is in memory. Then, if the power goes out, you lose everything
that hadn't yet been written back to the disk, leading to possible corruption.
For this reason, most disk caches allow programs to over-rule the write-back
policy to ensure consistency between the cache (in memory) and disk. Disk
utilities, for example, don't like write-back caching very much!
It is also possible with many caches to tell the controller "please write out
to system memory all dirty cache lines, right now". This is done when it
is necessary to make sure that the cache is in sync with the memory, and there
is no stale data. This is sometimes called "flushing" the cache, and is
especially common with disk caches, for the reason outlined in the previous
paragraph.
Summary: The Cache Read/Write Process
Having looked at all the parts and design factors that make up a cache, in
this section the actual process is described that is followed when the processor
reads or writes from the system memory. This example is the same as in the other
sections on this page: 64 MB memory, 512 KB cache, 32 byte cache lines. I will
assume a direct mapped cache, since that is the simplest to explain (and is in
fact most common for level 2 cache):
- The processor begins a read/write from/to the system memory.
- Simultaneously, the cache controller begins to check if the information
requested is in the cache, and the memory controller begins the process of
either reading or writing from the system RAM. This is done so that we don't
lose any time at all in the event of a cache miss; if we have a cache hit, the
system will cancel the partially-completed request from RAM, if appropriate.
If we are doing a write on a write-through cache, the write to memory always
proceeds.
- The cache controller checks for a hit by looking at the address sent by
the processor. The lowest five bits (A0 to A4) are ignored, because these
differentiate between the 32 different bytes in the cache line. We aren't
concerned with that because the cache will always return the whole 32 bytes
and let the processor decide which one it wants. The next 14 lines (A5 to A18)
represent the line in the cache that we need to check (notice that 2^14 is
16,384).
- The cache controller reads the tag RAM at the address indicated by the 14
address lines A5 to A18. So if those 14 bits say address 13,714, the
controller will examine the contents of tag RAM entry #13,714. It compares the
7 bits that it reads from the tag RAM at this location to the 7 address bits
A19 to A25 that it gets from the processor. If they are identical, then the
controller knows that the entry in the cache at that line address is the one
the processor wanted; we have a hit. If the tag RAM doesn't match, then we
have a miss.
- If we do have a hit, then for a read, the cache controller reads the
32-byte contents of the cache data store at the same line address indicated by
bits A5 to A18 (13,714), and sends them to the processor. The read that was
started to the system RAM is canceled. The process is complete. For a write,
the cache controller writes 32 bytes to the data store at that same cache line
location referenced by bits A5 to A18. Then, if we are using a write-through
cache the write to memory proceeds; if we are using a write-back cache, the
write to memory is canceled, and the dirty bit for this cache line is set to 1
to indicate that the cache was updated but the memory was not.
- If we have a miss and we were doing a read, the read of system RAM that we
started earlier carries on, with 32 bytes being read from memory at the
location specified by bits A5 to A25. These bytes are fed to the processor,
which uses the lowest five bits (A0 to A4) to decide which of the 32 bytes it
wanted. While this is happening the cache also must perform the work of
storing these bytes that were just read from memory into the cache so they
will be there for the next time this location is wanted. If we are using a
write-through cache, the 32 bytes are just placed into the data store at the
address indicated by bits A5 to A18. The contents of bits A19 to A25 are saved
in the tag RAM at the same 14-bit address, A5 to A18. The entry is now ready
for any future request by the processor. If we are using a write-back cache,
then before overwriting the old contents of the cache line, we must check the
line's dirty bit. If it is set (1) then we must first write back the contents
of the cache line to memory, and then clear the dirty bit. If it is clear (0)
then the memory isn't stale and we continue without the write cycle.
- If we have a cache miss and we were doing a write, interestingly, the
cache doesn't do much at all, because most caches don't update the cache line
on a write miss. They just leave the entry that was there alone, and write to
memory, bypassing the cache entirely. There are some caches that put all
writes into the appropriate cache line whenever a write is done. They make the
general assumption that anything the processor has just written, it is likely
to read back again at some point in the near future. Therefore, they treat
every write as a hit, by definition. This means there is no check for a
hit on a write; in essence, the cache line that is used by the address just
written is always replaced by the data that was just put out by the processor.
It also means that on a write miss the cache controller must update the cache,
including checking the dirty bit on the entry that was there before the write,
exactly the same as what happens for a read miss.
As complex as it already is :^) this example would of course be even more
complex if we used a set associative or fully associative cache. Then we would
have a search to do when checking for a hit, and we would also have the matter
of deciding which cache line to update on a cache miss.
There are two common options on a write miss:
Write Allocate - the block is loaded on a write miss, followed by
the wite-hit action.
No Write Allocate - the block is modified in the main memory and
not loaded into the cache.
Although either write-miss policy could be used with write
through or write back, write-back caches generally use write allocate
(hoping that subsequent writes to that block will be captured by the cache) and
write-through caches often use no-write allocate
(since subsequent writes to that block will still have to go to memory).
-
Processes (mutual exclusion, semaphore, test-and-set)
race condition: two or more processes are reading or writing some shared
data and the final result depends on who runs precisely when.
mutual exclusion:
-
mutual exclusion: no processes may be
simultaneously inside their critical sections.
-
No assumptions are made about relative process speeds or
number of CPUs.
-
dead lock: no process stopped outside its critical
section should block other processes.
-
starvation: no process should wait arbitrarily
long to enter its critical section.
IPC (Inter Processor Communication) techniques
-
disabling interrupts for kernel, but not user processes
-
lock variables: introduce race condition on the lock
variable, use test-and-set to avoid this
busy waiting
blocking
-
sleep and wakeup: lost wakeup will cause
deadlock
-
semaphores: solve the lost wakeup problem
(proposed by E.W. Dijkstra (1965))
-
P: down, V: up, both are atomic actions.
-
P: down: the down operation on a semaphore checks to see
if the value is greater than 0. If so, it decrements the value (i.e. uses up one
stored wakeup) and just continues. If the value is 0, the process is put to
sleep.
-
V: up: the up operation increments the value of the
semaphore addressed. If one or more processes were sleeping on that semaphore,
unable to complete an earlier P operation, one of them is chosen buy the system
at random, and is allowed to complete its P (down).
-
for one CPU, P, V are system calls, with the OS disabling
all interrupts while it is accessing the semaphore.
-
multiple CPUs, each semaphore should be protected by a
lock variable with the TSL instruction used to make sure that only one CPU at a
time examines the semaphore. (here, no busy waiting for TSL)
-
binary semaphores: semaphores that are initialized
to 1 and used by two or more processes to ensure that only one of them can enter
its critical region at the same time.
-
will cause deadlock if not carefully used.
-
it needs ceil(log2n) binary semaphores
to implement a n-ary semaphore.
-
monitor: to make it easier to write correct
programs using semaphores
-
definition: A monitor is a collection of procedures,
variables, and data structures that are all grouped together in a special kind
of module or package. Processes may call the procedures in a monitor whenever
they want to, but they cannot directly access the monitor's internal data
structures from procedures declared outside the monitor
-
only one process can be active in a monitor at any
instant.
-
Monitors are a programming language construct, so the
compiler knows they are special and can handle calls to monitor procedures
differently from other procedure calls.
-
monitors provide an easy way to achieve mutual exclusion
-
conditional variables (wait and signal) provide block,
condition full, empty; wait(full),
signal(empty)
-
semaphores are too low level, and monitors are not usable
except in a few programming languages. None of the primitives provide for
information exchange between machines.
-
Message passing: uses two primitives SEND and
RECEIVE, which, like semaphores and unlike monitors, are system calls rather
than language constructs.
Classical IPC problems
-
Process Scheduling
-
Scheduling algorithms criteria:
-
Fairness: make sure each process gets its fair share of
the CPU
-
Efficiency: keep the CPU busy 100 percent of the time.
-
Response time: minimize respponse time for interactive
users.
-
Turnaround: minimize the time batch users much wait for
output
-
Throughput: maximize the number of jobs processed per
hour.
-
preemptive scheduling (in contrast to the run to
completion)
-
Round Robin:
-
Priority Scheduling:
-
Shortest Job First (batch jobs) How to figure out which
one is the shortest one.
-
optimal turnaround time.
-
average run time (4a+3b+2c+d)/4 for a, b, c, d, probably
optimal if a<b<c<d.
-
minimum average response time.
-
aging: the technique of estimating the next value
in a series by taking the weighted average of current measured value and
the previous estimate.
-
shortest job first is only optimal
when all the jobs are available simultaneously.
-
Policy-Driven Scheduling: related to the ratio of user
usage.
-
Dead Lock
-
Superscalar processor
describes a
microprocessor design that makes it possible for more than one
instruction at a time to be executed during a single
clock cycle. In a superscalar design, the processor or the
instruction
compiler is able to determine whether an instruction can be carried
out independently of other sequential instructions, or whether it has a
dependency on another instruction and must be executed in sequence with
it. The processor then uses multiple execution units to simultaneously
carry out two or more independent instructions at a time. Superscalar
design is sometimes called "second generation
RISC."
|