

#### **CS4803DGC Design Game Consoles**

Spring 2010 Prof. Hyesoon Kim







Thanks to Prof. Loh & Prof. Prvulovic







## Multiprocessing

- Flynn's Taxonomy of Parallel Machines
  - How many Instruction streams?
  - How many Data streams?
- SISD: Single I Stream, Single D Stream
   A uniprocessor
- SIMD: Single I, Multiple D Streams
  - Each "processor" works on its own data
  - But all execute the same instrs in lockstep
  - E.g. a vector processor or MMX, CUDA



# Flynn's Taxonomy

- MISD: Multiple I, Single D Stream
   Not used much
  - Stream processors are closest to MISD
- MIMD: Multiple I, Multiple D Streams
  - Each processor executes its own instructions and operates on its own data
  - This is your typical off-the-shelf multiprocessor (made using a bunch of "normal" processors)
  - Includes multi-core processors



Computing

### SIMD Model

- Texas C62xx, IA32 (SSE), AMD K6, CUDA, Xbox..
- Early SIMD machines: e.g.) CM-2 (large distributed system)
  - Lack of vector register files and efficient transposition support in the memory system.
  - Lack of irregular indexed memory accesses
- Modern SIMD machines:
  - SIMD engine is in the same die



### **SIMD Execution Model**





## **Locality and Caches**

- Data Locality
  - Temporal: if data item needed now,
    it is likely to be needed again in near future
  - Spatial: if data item needed now, nearby data likely to be needed in near future
- Exploiting Locality: Caches
  - Keep recently used data in fast memory close to the processor
  - Also bring nearby data there



# **MEMORY SYSTEM**



College of Computing



### **Storage Hierarchy and Locality**





Computing

### Memory Latency is Long

- 60-100ns not uncommon
- Quick back-of-the-envelope calculation: – 2GHz CPU
  - $\rightarrow$  0.5ns / cycle
  - 100ns memory  $\rightarrow$  200 cycle memory latency!
- Solution: Caches



#### Cache



### **CPU-DRAM**







Computing

## SRAM vs. DRAM

• DRAM = Dynamic RAM

- SRAM: 6T per bit
  - built with normal high-speed CMOS technology
- DRAM: 1T per bit
  - built with special DRAM process optimized for density



### **Hardware Structures**





### **DRAM Chip Organization**



# **DRAM Chip Organization (2)**

- Differences with SRAM
  - reads are *destructive*: contents are erased after reading
  - row buffer
    - read lots of bits all at once, and then parcel them out based on different column addresses
      - similar to reading a full cache line, but only accessing one word at a time

Computing

- "Fast-Page Mode" FPM DRAM organizes the DRAM row to contain bits for a complete page
  - row address held constant, and then fast read from different locations from the same page



### **DRAM Read Operation**





### **Destructive Read**





# **CACHE COHERENCE**



College of Computing

#### **Problem**





| A1: 10 |  |
|--------|--|
| A2: 20 |  |
| A3: 39 |  |
| A4: 17 |  |

College of Computing **SNOOPING** 



A4: 17

College of Computing

Georgia Tech





Computing

# **MSI Snoopy Protocol**

- State of block B in cache C can be
  - Invalid: B is not cached in C
    - To read or write, must make a request on the bus
  - Modified: B is dirty in C
    - has the block, no other cache has the block, and C must update memory when it displaces B
    - Can read or write B without going to the bus
  - Shared: B is clean in C
    - C has the block, other caches have the block, and C need not update memory when it displaces B
    - Can read B without going to bus
    - To write, must send an upgrade request to the bus

### **MSI Example**

