

#### **CS4803DGC Design Game Consoles**

Spring 2010 Prof. Hyesoon Kim







Georgia

Tech

College of

Computing

#### Lectures are from

- Reading assignment [LRB]
- http://www.ece.neu.edu/groups/nucar/GPGPU/GPGPU-2/Seiler.pdf



# **Motivation for LRB**

- Economic reason: Game market is getting bigger
- Technology reason: Power wall, memory wall. → Heterogeneous architecture shows promising results





College of Computing

Georgia Tech

## In-order vs. 000

| # CPU cores:       | 2 out-of-order | 10 in-order   |  |
|--------------------|----------------|---------------|--|
| Instruction issue: | 4 per clock    | 2 per clock   |  |
| VPU per core:      | 4-wide SSE     | 16-wide       |  |
| L2 cache size:     | 4 MB           | 4 MB          |  |
| Single-stream:     | 4 per clock    | 2 per clock   |  |
| Vector throughput: | 8 per clock    | 160 per clock |  |

Detour: OOO



## Larrabee Architecture



Georgia

Tech

College of

Computing

- 4-way SMT in order processor with cache coherence
- Extended X86 ISA
- Fixed functions: texture filtering



#### **Programmable Pipeline Comparison**



#### Core



College of

Computing



- Pentium processor in-order
- Extended X86 (64bit, new instructions)
- 4-way SMT
- 32KB I-cache, 32KB D-cache (statically partitioned)

Georgia Tech

 256KB Locale L2 cache (subset of L2 ache)



Computing

# **Dual issue**

- U-pipe V-pipe
- Primary pipeline: All instructions (U-pipe)
- Secondary: Limited instructions (V-pipe)

loads, stores, simple ALU operations, branches, cache manipulation instructions, and vector stores.

Reply on compiler's paring
VLIWish again

Computing

# **Cache managements**

- Use cache as extended register file storage
- Target for Stream applications
- Each core can
  - Fast-access its local subset of L2 (256KB) Access other's L2 shares too
  - Control for non-temporal streaming data (SSE)
  - Prefetch to L1, or L2 only
  - •Mark a streaming cache line for early eviction
  - •Render target kept in L2 (e.g., FB, ZB, SB, etc)



# L2-Cache and Ring network

- Global L2 cache is divided into 256KB Local L2 per core
- Data written by a CPU core is stored in its own L2 cache subset and is flushed from other subsets, if necessary
- Bi-direction Ring network (<16)
  - Even cycle, odd cycle: one clock per one hop
- Each ring data-path is 512-bits wide per direction
- L2 cache Insertion requires cache coherence checking
- Memory and fixed function access



Georgia Tech

College of Computing



# **Ring Network**

| Fixed Function Logic | In-Order<br>CPU core        | In-Order<br>CPU core | ••• | In-Order<br>CPU core | In-Order<br>CPU core | ces     |
|----------------------|-----------------------------|----------------------|-----|----------------------|----------------------|---------|
|                      | Interprocessor Ring Network |                      |     |                      |                      |         |
|                      | Coherent<br>L2 cache        | Coherent<br>L2 cache | ••• | Coherent<br>L2 cache | Coherent<br>L2 cache |         |
|                      | Coherent<br>L2 cache        | Coherent<br>L2 cache | ••• | Coherent<br>L        | Coherent<br>L2 cache | y & I/O |
|                      | Interprocessor Ring k       |                      |     |                      | not                  |         |
| E _                  | In-Order<br>CPU core        | In-Order<br>CPU core | ••• | In-Order<br>CPU core | In-Order<br>CPU core | Memory  |



# **VPU (Vector Processor Units)**



- 16-wide SIMD unit
  - 16 wide Single precision
  - 8-wide double precision
- Hardware scatter/gathering operations : 16 elements are loaded from or stored to up to 16 different addresses that are specified in another vector register.
- New instructions: fused multiply-add, and the standard logical operations, including instructions to extract non-byte-aligned
- Data can be replicated from L2 cache directly
- Free numeric type conversion and data
- replication while reading from memory
- Mask registers: predicated
- 3 source operands, one of them can come from L1 directly





#### **Gather and Scatter Operation Support**

- Loads and stores from non-continuous addresses
- 16 data values can be loaded or stored from addresses in another vector register value.





## Mask Bits

#### VVVVVVVVVVVVVV





- All the operations participated the computation in vector units
- Mask bits decide write enable signal

Georgia College of Tech Computing



Georgia

Tech

College of

Computing

## **Predication for Load/Store**



#### MASK\_bit



# **Fixed Functions**

- Use FIFO for load balancing
- No rasterizations
- Texture filtering
  - 32KB texture cache per each core
  - Core passes commands through L2 cache
- Texture unites perform virtual to physical page translation



Computing

#### **Larrabee Data Parallelism: Threads**



Programming Larrabee: Beyond Data Parallelism

http://www.ece.neu.edu/groups/nucar/GPG PU/GPGPU-2/Seiler.pdf





# **Data Parallelism Support**

- Implementation hierarchy (your names may vary)
  - Strand: runs a data kernel in one (or more) VPU lane(s)
  - Fiber: SW-managed group of strands, like co-routines
  - Thread: HW-managed, swaps fibers to cover long latencies
- Core: runs multiple threads to cover short latencies
- Comparison to GPU data parallelism
  - Same mechanisms as used in GPUs, except...
  - Larrabee allows SW scheduling (except for HW threads)

http://www.ece.neu.edu/groups/nucar/GPGPU/GPGPU-2/Seiler.pdf



# LRB new instruction sets

#### New Data types



- Vector arithmetic, logical, an
- Vector mask generation
- Vector load/store



Figure 3: 1-to-16 {1to16} and 4-to-16 {4to16} broadcasts.



Georgia

Tech

College of

Computing

# LRB Programming Models

- Stream data handles
  - Cache control instructions
    - Prefetch into L1 or L2 caches
    - Early evictions
- CPU/GPU memory space





Computing

# Project

- Team members (1-2)
- Schedule
  - 4/12 (1 paragraph project proposal)
  - 4/14 proposal 1<sup>st</sup> feedback
  - 4/16 detailed proposal description meeting
  - 4/23 progress meeting
  - 4/28 final project presentation
    - 10 min for each team
  - 4/30 project submission



Computing

# **Suggested Topics**

- Programs
  - Game related programs
  - CUDA, Nintendo DS

- Architecture
  - Architecture survey
  - Hardware architecture simple models