

#### CS4803DGC Design and Programming of Game Consoles

Spring 2011 Prof. Hyesoon Kim







#### **Overview of GPU (Tesla) Architecture**



Georgia College of Tech Computing



#### **Execution Unit: Warp**

#### Warp is the basic unit of execution

A group of threads (e.g. 32 threads for the Tesla GPU architecture)

#### **Warp Execution**



#### SM Executes Blocks



© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC

#### **Blocks**

- Threads are assigned to SMs in Block granularity
  - Up to 8 Blocks to each SM as resource allows (# of blocks is dependent on the architecture)
  - SM in G80 can take up to 768 threads
    - Could be 256 (threads/block) \* 3 blocks
    - Or 128 (threads/block) \* 6 blocks, etc.
- Threads run concurrently
  - SM assigns/maintains thread id #s
  - SM manages/schedules thread execution





#### **Warp Maintaing Unit**





College of

Computing

Georgia Tech

#### Pipeline



- Fetch
  - One instruction for each warp (could be further optimizations)
  - Round Robin, Greedy-fetch (switch when stall events such as branch, I-cache misses, buffer full)
- Thread scheduling polices
  - Execute when all sources are ready
  - In-order execution within warps
  - Scheduling polices: Greedy-execution, round-robin



# No Branch Prediction. Why?

- Enough parallelism
  - Switch to another thread
  - Speculative execution is
- Branch predictor could be expensive
  Per thread predictor
- Branch elimination techniques
- Pipeline flush is too costly

# Background: CFG (Control Flow

- Basic Block
  - Def: a sequence of consecutive operations in which flow of control enters at the beginning and leaves at the end without halt or possibility of branching except at the end

Control-flow graph

- Single entry, single exit





http://www.eecs.umich.edu/~mahlke/583w04/



#### **Dominator/Postdominator**

- Defn: Dominator Given a CFG, a node x dominates a node y, if every path from the Entry block to y contains x
  - Given some BB, which blocks are guaranteed to have executed prior to executing the BB
- **Defn: Post dominator**: Given a CFG, a node x post dominates a node y, if every path from y to the Exit contains x
  - Given some BB, which blocks are guaranteed to have executed after executing the BB
  - reverse of dominator



Computing

Georgia

http://www.eecs.umich.edu/~mahlke/583w04/



#### **Immediate Post Domiantor**

- <u>Defn: Immediate post</u> <u>dominator</u> (ipdom) – Each node n has a unique immediate post dominator m that is the first post dominator of n on any path from n to the Exit
  - Closest node that post dominates
  - First breadth-first successor that post dominates a node
- Immediate post dominator is the reconvergence point of divergent branch







#### **Control Flow**

- Recap:
  - 32 threads in a warm are executed in SIMD (share one instruction sequencer)
  - Threads within a warp can be disabled (masked)
    - For example, handling bank conflicts
  - Threads contain arbitrary code including conditional branches
- How do we handle different conditions in different threads?
  - No problem if the threads are in different warps
  - Control divergence
  - Predication



#### **Eliminating Branches**

- Predication
- Loop unrolling



College of Computing

#### Predication





Convert control flow dependency to data dependency Pro: Eliminate hard-to-predict branches (in traditional architecture) Eliminate branch divergence (in CUDA) Cons: Extra instructions

Computing

# Instruction Predication in G80

- Comparison instructions set condition codes (CC)
- Instructions can be predicated to write results only when CC meets criterion (CC != 0, CC >= 0, etc.)
- Compiler tries to predict if a branch condition is likely to produce many divergent warps
  - If guaranteed not to diverge: only predicates if < 4 instructions</li>
  - If not guaranteed: only predicates if < 7 instructions</li>
- May replace branches with instruction predication
- ALL predicated instructions take execution cycles
  - Those with false conditions don't write their output
    - Or invoke memory loads and stores
  - Saves branch instructions, so can be cheaper than serializing divergent paths



Computing

Tech

### Loop Unrolling

- Transforms an M-iteration loop into a loop with M/N iterations
  - We say that the loop has been unrolled N times





#### **Reduction Example**

• Sum { 1- 100}, How to calculate?





#### **Handling Branch Instructions**

• Reduction example 0 1 2 3 4 5 6 7 If (threadId.x%==2) 0 2 4 6 If (threadId.x%==4) 0 4 If (threadId.x%==8) 0



- What about other threads?
- What about different paths?



If (threadid.x>2) { do work B} else { do work C

**Divergent branch!** 



ng et al. MICRO '07



Computing

#### **Divergent Branches**

- All branch conditions are serialized and will be executed
  - Parallel code  $\rightarrow$  sequential code
- Divergence occurs within a warp granularity.
- It's a performance issue
  - Degree of nested branches
- Depending on memory instructions, (cache hits or misses), divergent warps can occur
  - Dynamic warp subdivision [Meng'10]
- Hardware solutions to reduce divergent branches
  - Dynamic warp formation [Fung'07]

#### Stack Based Divergent Branch Execution



#### **SM Register File**

- Register File (RF)
  - 32 KB
  - Provides 4 operands/clock
- TEX pipe can also read/write RF
  - 2 SMs share 1 TEX
- Load/Store pipe can also read/write RF



College of

Computing

Georgia Tech





#### Ports vs. Banks



Banks

Georgia Tech

College of Computing

Multiple read ports

#### **SM Memory Architecture**



© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC

t0 t1 t2 ... tm Blocks

- Threads in a Block share data & results
  - In Memory and Shared Memory
  - Synchronize at barrier instruction
- Per-Block Shared Memory Allocation
  - Keeps data close to processor

Georgia

Tech

- Minimize trips to global Memory
- SM Shared Memory dynamically allocated to Blocks, one of the limiting resources

College of

Computing

#### Constants

- Immediate address constants
- Indexed address constants
- Constants stored in DRAM, and cached on chip
  - L1 per SM
- A constant value can be broadcast to all threads in a Warp
  - Extremely efficient way of accessing a value that is common for all threads in a Block!
  - Can reduce the number of registers.



Georgia

Tech

College of

Computing





#### Textures

- Textures are 1D,2D, 3D arrays of values stored in global DRAM
- Textures are cached in L1 and L2
- Read-only access
- Caches are optimized for 2D access:
  - Threads in a warp that follow 2D locality will achieve better memory performance

Georgia

 Texels: elements of the arrays, texture elements

https://users.ece.utexas.edu/~merez/new/pmwiki.php/EE382VFa07/Schedule?action=download&upname=EEEextra07\_Lect13\_G80Mem.pdf

Computing

#### **Exploiting the Texture Samplers**

- Designed to map textures onto 3D polygons
- Specialty hardware pipelines for:
  - Fast data sampling from 1D, 2D, 3D arrays
  - Swizzling of 2D, 3D data for optimal access
  - Bilinear filtering in zero cycles
  - Image compositing & blending operations
- Arrays indexed by u,v,w coordinates easy to program
- Extremely well suited for multigrid & finite difference methods



College of

Computing

Georgia

Tech

### **GPU Memory System**



- Many levels of queues
- Large size of queues
- High number of DRAM banks
- Sensitive to memory scheduling algorithms
  - FRFCFS >> FCFS
- Interconnection network algorithm to get FRFCFS Effects
  - Yan'09,



### **Multiple In-flight Memory Requests**

- In-order execution but
- Warp cannot execute an instruction when sources are dependent on memory instructions, not when it generates memory requests
- High MLP ( M ) **W0 Context Switch W1** Computing

#### Same Data from Multiple Threads (SDMT)



Techniques to take advantages of this SDMT

College of

Computing

Georgia

Tech

- Compiler optimization[Yang'10]: increase memory reuse
- Cache coherence [Tarjan'10]
- Cache increase reuses



#### **Shared Memory: Bank Addressing Examples**





Tech

Thread 0 Thread 1



#### Data types and bank conflicts

• This has no conflicts if type of shared is 32-bits:

foo = shared[baseIndex + threadIdx.x]

- But not if the data type is smaller
  - 4-way bank conflicts:

```
___shared___ char shared[];
```

foo = shared[baseIndex + threadIdx.x];









### Synchronization Model

- Bulk-Synchronous Parallel (BSP) program (Valiant [90])
- Synchronization within blocks using explicit barrier
- Implicit barrier across kernels
  - Kernel 1  $\rightarrow$  Kernel 2
  - C.f.) Cuda 3.x





Computing

#### **Global Communications**

- Use multiple kernels
- Write to same memory addresses
  - Behavior is not guaranteed
  - Data race
- Atomic operation
  - No other threads can write to the same location
  - Memory Write order is still arbitrary
  - Keep being updated: atomic{Add, Sub, Exch, Min, Max, Inc, Dec, CAS, And, Or, Xor}
- Performance degradation
  - Fermi increases atomic performance by 5x to 20x (M. Shebanow)



# **FERMI ARCHITECTURE**

White paper, NVIDIA's Next Generation, CUDA Compute Architecture Fermi

White paper: World's Fastest GPU Delivering Great Gaming Performance with True Geometric Realism

Tech

Computing

# Major Architecture Changes in Fermi

- SM
  - 32 CUDA cores per SM (fully 32-lane)
  - Dual Warp Scheduler and dispatches from two independent warps
  - 64KB of RAM with a configurable partitioning of shared memory and L1 cache
- Programming support
  - Unified Address Space with Full C++ Support
  - Full 32-bit integer path with 64-bit extensions
  - Memory access instructions to support transition to 64-bit addressing
- Memory system
  - Data cache, ECC support, Atomic memory operations

Computing

Concurrent kernel execution



- Better integer suppor
- 4 SFU
- Dual warp scheduler
- More TLP
- 16 cores are execute together





Fermi Streaming Multiprocessor (SM)



| GPU                       | G80         | GT200              | Fermi                          |
|---------------------------|-------------|--------------------|--------------------------------|
| Transistors               | 681 million | 1.4 billion        | 3.0 billion                    |
| CUDA Cores                | 128         | 240                | 512                            |
| Double Precision Floating | None        | 30 FMA ops / clock | 256 FMA ops /clock             |
| Point Capability          |             |                    |                                |
| Single Precision Floating | 128 MAD     | 240 MAD ops /      | 512 FMA ops /clock             |
| Point Capability          | ops/clock   | clock              |                                |
| Special Function Units    | 2           | 2                  | 4                              |
| (SFUs) / SM               |             |                    |                                |
| Warp schedulers (per SM)  | 1           | 1                  | 2                              |
| Shared Memory (per SM)    | 16 KB       | 16 KB              | Configurable 48 KB or<br>16 KB |
| L1 Cache (per SM)         | None        | None               | Configurable 16 KB or<br>48 KB |
| L2 Cache                  | None        | None               | 768 KB                         |
| ECC Memory Support        | No          | No                 | Yes                            |
| Concurrent Kernels        | No          | No                 | Up to 16                       |
| Load/Store Address Width  | 32-bit      | 32-bit             | 64-bit                         |