

#### **CS4803DGC Design Game Consoles**

Spring 2009 Prof. Hyesoon Kim



### CUDA

- "Compute Unified Device Architecture"
- Available for GeForce 8, 9 Series, Quadro FX5600/4600, and Tesla solutions
- Targeted software stack
  - Compute oriented drivers, language, and tools
- Driver for loading computation programs into GPU
  - Standalone Driver Optimized for computation
  - Interface designed for compute graphics free API

Georgia

College of

Tech // Computing

 Cuda provides general DRAM memory addressing (just like CPU)



# Why Programming with GPU?

- A quiet revolution and potential build-up
  - Calculation: 367 GFLOPS vs. 32 GFLOPS
  - Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s
  - Until a few years, programmed through graphics API





#### Memory Bandwidth for the CPU and GPU



# CUDA Programming Model: A Highly Multithreaded Coprocessor

- The GPU is viewed as a compute device that:
  - Is a coprocessor to the CPU or host
  - Has its own DRAM (device memory)
  - Runs many threads in parallel
- Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads
- Differences between GPU and CPU threads
  - GPU threads are extremely lightweight
    - Very little creation overhead
  - GPU needs 1000s of threads for full efficiency

Georgia

College of

Tech // Computing

• Multi-core CPU needs only a few

# An Example of Physical Reality Behind CUDA





### Parallel Computing on a GPU

- NVIDIA GPU Computing Architecture
  - Via a separate HW interface
  - In laptops, desktops, workstations, servers
- 8-series GPUs deliver 50 to 200 GFLOPS on compiled parallel C applications
- GPU parallelism is doubling every year
- Programming model scales transparently
- Programmable in C with CUDA tools
- Multithreaded SPMD model uses application data parallelism and thread parallelism



Computing





16 highly threaded SM's, >128 FPU's, 367 GFLOPS, 768 MB DRAM, 86.4 GB/S Mem BW, 4GB/S BW to CPU





### **Code Example (HelloWorld)**





#### **Output of Helloworld**

dim3 threads (1, 2, 4); dim3 grid (2,1); helloworld<<< grid, threads >>> ();

Hello World! I am a thread with BlockId: {0,0}, ThreadId:{0,0,0} Hello World! I am a thread with BlockId: {0,0}, ThreadId:{0,1,0} Hello World! I am a thread with BlockId: {0,0}, ThreadId:{0,0,1} Hello World! I am a thread with BlockId: {0,0}, ThreadId:{0,1,1} Hello World! I am a thread with BlockId: {0,0}, ThreadId:{0,0,2} Hello World! I am a thread with BlockId: {0,0}, ThreadId:{0,1,2} Hello World! I am a thread with BlockId: {0,0}, ThreadId:{0,0,3} Hello World! I am a thread with BlockId: {0,0}, ThreadId:{0,1,3} Hello World! I am a thread with BlockId: {1,0}, ThreadId:{0,0,0} Hello World! I am a thread with BlockId: {1,0}, ThreadId:{0,1,0} Hello World! I am a thread with BlockId: {1,0}, ThreadId:{0,0,1} Hello World! I am a thread with BlockId: {1,0}, ThreadId:{0,1,1} Hello World! I am a thread with BlockId: {1,0}, ThreadId:{0,0,2} Hello World! I am a thread with BlockId: {1,0}, ThreadId:{0,1,2} Hello World! I am a thread with BlockId: {1,0}, ThreadId:{0,0,3} Hello World! I am a thread with BlockId: {1,0}, ThreadId:{0,1,3}

> College of Tech Computing

Georgia



#### **Extended C**

- Declspecs
  - global, device, shared, local, constant
- Keywords
  - threadIdx, blockIdx
- Intrinsics
  - \_\_\_syncthreads
- Runtime API
  - Memory, symbol, execution management

```
__device__ float filter[N];
__global__ void convolve (float *image)
__shared__ float region[M];
...
region[threadIdx] = image[i];
__syncthreads()
...
image[j] = result;
}
```

```
// Allocate GPU memory
void *myimage = cudaMalloc(bytes)
```

// 100 blocks, 10 threads per block
convolve<<<100, 10>>> (myimage);

Function launch



### CUDA Programming Model: A Highly Multithreaded Coprocessor

- The GPU is viewed as a compute device that:
  - Is a coprocessor to the CPU or host
  - Has its own DRAM (device memory)
  - Runs many threads in parallel
- Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads

Georgia

**lech** 

College of

Computing

- Differences between GPU and CPU threads
  - GPU threads are extremely lightweight
    - Very little creation overhead
  - GPU needs 1000s of threads for full efficiency
    - Multi-core CPU needs only a few



#### **G80** Characteristics

- 367 GFLOPS peak performance (25-50 times of current high-end microprocessors)
- Massively parallel, 128 cores, 90W
- Massively threaded, sustains 1000s of threads per app

Georgia

College of Computing

 30-100 times speedup over high-end microprocessors on scientific and media applications: medical imaging, molecular dynamics





Tech

#### CUDA

- CUDA is a programming system for utilizing the G80 processor for compute
  - CUDA follows the architecture very closely
- General purposed programming model
  - User kicks off batches of threads on the GPU
  - GPU = dedicated super-threaded, massively data parallel processor

Matches architecture features Specific parameters are not exposed

https://users.ece.utexas.edu/~merez/new/pmwiki.php/EE382VFa07/Schedule?action=download&upname=EE382V\_Fa07\_Lect13\_G80Mem.pdiGeorgia

# Programming model: Block and Thread

- A kernel is executed as a grid of thread blocks
- Threads and blocks have IDs
  - So each thread can decide what data to work on
  - Block ID: 1D or 2D
  - Thread ID: 1D, 2D, or 3D
- Simplifies memory addressing when processing multidimensional data
  - Image processing
  - Solving PDEs on volumes









#### **Hardware Model**



Georgia Tech College of Computing



# **CUDA Device Memory Space**

#### Overview

- Each thread can:
  - R/W per-thread registers
  - R/W per-thread local memory
  - R/W per-block shared memory
  - R/W per-grid global memory
  - Read only per-grid constant memory
  - Read only per-grid texture memory
- The host can R/W global, constant, and texture memories



#### Global, Constant, and Texture Memories (Long Latency Accesses)

- Global memory

   Main means of communicating R/W Data between host and device
  - Contents visible to all threads
- Texture and Constant Memories
  - Constants initialized by Host
  - Contents visible to all threads



**Techurtesym NDVIA** 



Tech // Computing

#### **Execution Model**





• CUDA – API





#### CUDA Highlights: Easy and Lightweight

The API is an extension to the ANSI C
 programming language

Low learning curve

• The hardware is designed to enable lightweight runtime and driver

High performance





- NOT part of CUDA
- It will be frequently used in many code examples
  - 2 D matrix
  - single precision float elements
  - width \* height elements
  - pitch is meaningful when the matrix is actually a sub-matrix of another matrix
  - data elements allocated and attached to elements

typedef struct {
 int width;
 int height;
 int pitch;
 float\* elements;
} Matrix;





### **CUDA Device Memory Allocation**

- cudaMalloc()
  - Allocates object in the device <u>Global Memory</u>
  - Requires two parameters
    - Address of a pointer to the allocated object
    - Size of allocated object
- cudaFree()
  - Frees object from device Global Memory
    - Pointer to freed object



# CUDA Device Memory Allocation

- Code example:
  - Allocate a 64 \* 64 single precision float array
  - Attach the allocated storage to Md.elements
  - "d" is often used to indicate a device data structure

```
BLOCK_SIZE = 64;
Matrix Md
int size = BLOCK_SIZE * BLOCK_SIZE * sizeof(float);
```

```
cudaMalloc((void**)&Md.elements, size);
cudaFree(Md.elements);
```

Georgia

College of Computing



#### **CUDA Host-Device Data Transfer**

- cudaMemcpy()
  - memory data transfer
  - Requires four parameters
    - Pointer to source
    - · Pointer to destination
    - Number of bytes copied
    - Type of transfer
      - Host to Host
      - Host to Device
      - Device to Host
      - Device to Device
- Asynchronous in CUDA 1.0



# CUDA Host-Device Data Transfer

- Code example:
  - Transfer a 64 \* 64 single precision float array
  - M is in host memory and Md is in device memory
  - cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost are symbolic constants

cudaMemcpy(Md.elements, M.elements, size, cudaMemcpyHostToDevice);

cudaMemcpy(M.elements, Md.elements, size, cudaMemcpyDeviceToHost);





#### **CUDA Function Declarations**

|                                      | Executed on the: | Only callable from the: |
|--------------------------------------|------------------|-------------------------|
| <pre>device float DeviceFunc()</pre> | device           | device                  |
| global void KernelFunc()             | device           | host                    |
| <pre>host float HostFunc()</pre>     | host             | host                    |

- \_\_global\_\_ defines a kernel function
  - Must return void

# CUDA Function Declarations

- \_\_device\_\_ functions cannot have their address taken
- For functions executed on the device:
  - No recursion
  - No static variable declarations inside the function
  - No variable number of arguments



Tech // Computing

#### **Review: Execution Model**





Georgia College of

Tech Computing

#### **Review: Calling a Kernel Function – Thread Creation**

• A kernel function must be called with an execution configuration:

\_\_\_global\_\_\_ void KernelFunc(...);

dim3 DimGrid(100, 50); // 5000 thread blocks

- dim3 DimBlock(4, 8, 8); // 256 threads per block
- size\_t SharedMemBytes = 64; // 64 bytes of shared
   memory

KernelFunc<<< DimGrid, DimBlock, SharedMemBytes
>>>(...);



#### **Elementwise Matrix Addition**







#### **Elementwise Matrix Addition**

#### CPU Program

#### **GPU Program**

```
void add matrix
                                           global add matrix
(float *a, float* b, float *c, int N) {
                                         (float *a, float *b, float *c, int N) {
                                        int i = blockIdx.x * blockDim.x + threadIdx.x;
  int index:
  for (int i = 0; i < N; ++i)
                                        Int j = blockIdx.y * blockDim.y + threadIdx.y;
    for (int j = 0; j < N; ++j) {
                                        int index = i + j^*N;
      index = i + j^*N;
                                        if (i < N \&\& j < N)
      c[index] = a[index] + b[index];
                                          c[index] = a[index]+b[index];
                                        Int main() {
int main () {
                                         dim3 dimBlock( blocksize, blocksize);
                                         dim3 dimGrid (N/dimBlock.x, N/dimBlock.y);
  add matrix (a, b, c, N);
                                         add matrix<<<dimGrid, dimBlock>>>( a, b, c, N);
```





#### A Simple Running Example Matrix Multiplication

- A straightforward matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programs
  - Leave shared memory usage until later
  - Local, register usage
  - Thread ID usage
  - Memory data transfer API between host and device





- NOT part of CUDA
- It will be frequently used in many code examples
  - 2 D matrix
  - single precision float elements
  - width \* height elements
  - pitch is meaningful when the matrix is actually a sub-matrix of another matrix
  - data elements allocated and attached to elements

typedef struct {
 int width;
 int height;
 int pitch;
 float\* elements;
} Matrix;





#### Programming Model: Square Matrix Multiplication Example

- P = M \* N of size WIDTH x WIDTH
- Without tiling:
  - One thread handles one element of P
  - M and N are loaded WIDTH times from global memory





Tech || Computing

#### **Step 1: Matrix Data Transfers**

```
// Allocate the device memory where we will copy M to
Matrix Md;
Md.width = WIDTH;
Md.height = WIDTH;
Md.pitch = WIDTH;
int size = WIDTH * WIDTH * sizeof(float);
cudaMalloc((void**)&Md.elements, size);
// Copy M from the host to the device
cudaMemcpy(Md.elements, M.elements, size,
   cudaMemcpyHostToDevice);
// Read M from the device to the host into P
cudaMemcpy(P.elements, Md.elements, size,
   cudaMemcpyDeviceToHost);
. . .
// Free device memory
cudaFree(Md.elements);
                                              College of
                                        Georgia
```



Georgia

**lech** 

College of

Computing

#### Step 2: Matrix Multiplication A Simple Host Code in C

// Matrix multiplication on the (CPU) host in double precision
// for simplicity, we will assume that all dimensions are equal

```
void MatrixMulOnHost(const Matrix M, const Matrix N, Matrix P)
```

```
for (int i = 0; i < M.height; ++i)
for (int j = 0; j < N.width; ++j) {
    double sum = 0;
    for (int k = 0; k < M.width; ++k) {
        double a = M.elements[i * M.width + k];
        double b = N.elements[k * N.width + j];
        sum += a * b;
    }
    P.elements[i * N.width + j] = sum;
}</pre>
```



#### **Multiply Using One Thread Block**

- One Block of threads compute matrix P
  - Each thread computes one element of P
- Each thread
  - Loads a row of matrix M
  - Loads a column of matrix N
  - Perform one multiply and addition for each pair of M and N elements
  - Compute to off-chip memory access ratio close to 1:1 (not very high)
- Size of matrix limited by the number of threads allowed in a thread block



# Step 3: Matrix Multiplication Host-side Main Program Code

College of

Computing

Georgia Tech

```
int main(void) {
// Allocate and initialize the matrices
Matrix M = AllocateMatrix(WIDTH, WIDTH, 1);
Matrix N = AllocateMatrix(WIDTH, WIDTH, 1);
Matrix P = AllocateMatrix(WIDTH, WIDTH, 0);
```

```
// M * N on the device
MatrixMulOnDevice(M, N, P);
```

```
// Free matrices
    FreeMatrix(M);
    FreeMatrix(N);
    FreeMatrix(P);
return 0;
}
```



#### Step 3: Matrix Multiplication Host-side code

// Matrix multiplication on the device
void MatrixMulOnDevice(const Matrix M, const Matrix N, Matrix P)
{
 // Load M and N to the device

Matrix Md = AllocateDeviceMatrix(M); CopyToDeviceMatrix(Md, M); Matrix Nd = AllocateDeviceMatrix(N); CopyToDeviceMatrix(Nd, N);

// Allocate P on the device
Matrix Pd = AllocateDeviceMatrix(P);
CopyToDeviceMatrix(Pd, P); // Clear memory



College of

Computing

Georgia Tech

#### Step 3: Matrix Multiplication Host-side Code (cont.)

// Setup the execution configuration dim3 dimBlock(WIDTH, WIDTH); dim3 dimGrid(1, 1);

// Launch the device computation threads!
MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd);

// Read P from the device CopyFromDeviceMatrix(P, Pd);

// Free device matrices
FreeDeviceMatrix(Md);
FreeDeviceMatrix(Nd);
FreeDeviceMatrix(Pd);



#### Step 4: Matrix Multiplication Device-side Kernel Function

// Matrix multiplication kernel – thread specification
\_\_\_\_global\_\_\_ void MatrixMulKernel(Matrix M, Matrix N, Matrix P)
{
// OD Thread ID

```
// 2D Thread ID
int tx = threadIdx.x;
int ty = threadIdx.y;
```

// Pvalue is used to store the element of the matrix
// that is computed by the thread
float Pvalue = 0;

Georgia

College of Computing







College of

Computing

Georgia

Tech

```
// Allocate a device matrix of same size as M.
Matrix AllocateDeviceMatrix(const Matrix M)
  Matrix Mdevice = M;
  int size = M.width * M.height * sizeof(float);
  cudaMalloc((void**)&Mdevice.elements, size);
  return Mdevice;
// Free a device matrix.
void FreeDeviceMatrix(Matrix M) {
  cudaFree(M.elements);
void FreeMatrix(Matrix M) {
  free(M.elements);
```



#### Step 5: Some Loose Ends (cont.)

```
// Copy a host matrix to a device matrix.
void CopyToDeviceMatrix(Matrix Mdevice, const Matrix Mhost)
  int size = Mhost.width * Mhost.height * sizeof(float);
  cudaMemcpy(Mdevice.elements, Mhost.elements, size,
        cudaMemcpyHostToDevice);
// Copy a device matrix to a host matrix.
void CopyFromDeviceMatrix(Matrix Mhost, const Matrix Mdevice)
  int size = Mdevice.width * Mdevice.height * sizeof(float);
  cudaMemcpy(Mhost.elements, Mdevice.elements, size,
        cudaMemcpyDeviceToHost);
```

Georgia

College of Computing



#### **Access Times**

- Register dedicated HW single cycle
- Shared Memory dedicated HW single cycle
- Local Memory DRAM, no cache \*slow\*
- Global Memory DRAM, no cache \*slow\*
- Constant Memory DRAM, cached, 1...10s...100s of cycles, depending on cache locality
- Texture Memory DRAM, cached, 1...10s...100s of cycles, depending on cache locality
- Instruction Memory (invisible) DRAM, cached



#### **How about performance?**

- All threads access global memory for their input matrix elements
  - Two memory accesses (& bytes) per floating point multiply-add
  - 4B/s of memory bandwidth/FLOPS
  - 86.4 GB/s limits the code at 21.6 GFLOPS
- The actual code should run at about 15 GFLOPS
- Need to drastically cut down memory accesses to get closer to the peak 346.5 GFLOPS



## Idea: Use Shared Memory to reuse global memory data

- Each input element is read by WIDTH threads.
- If we load each element into Shared Memory and have several threads use the local version, we can drastically reduce the memory bandwidth

College of

Computing

Georgia

**Tech** 

- Load all the matrix ?
- Tiled algorithms
- Pattern
  - Copy data from global to shared memory
  - Synchronization
  - Computation (iteration)
  - Synchronization
  - Copy data from shared to global memory

## Blocked (Tiled) Matrix Multiply

Consider A,B,C to be N by N matrices of b by b subblocks where b=n / N is called the block size for i = 1 to N for j = 1 to N {read block C(i,j) into shared memory} for k = 1 to N {read block A(i,k) into shared memory} {read block B(k,j) into shared memory} C(i,j) = C(i,j) + A(i,k) \* B(k,j) {do a matrix multiply on blocks} {write block C(i,j) back to global memory}







## Blocked (Tiled) Matrix Multiply





www.sdsc.edu/~allans/cs260/lectures/matmul.ppt















## **Tiled Multiply Using Thread Blocks**

- One block computes one square submatrix P<sub>sub</sub> of size BLOCK\_SIZE
- One thread computes one element of P<sub>sub</sub>
- Assume that the dimensions of M and N are multiples of BLOCK\_SIZE and square shape

0

2

bsize-1 🗖

by 1





Tech 🛛 Computing

#### **Shared Memory Usage**

- Each SMP has 16KB shared memory
  - Each Thread Block uses 2 \*256\*4B = 2KB of shared memory. [2: two matrix, 256 = 16\*16, 4B (floating point)]
  - Can potentially have up to 8 Thread Blocks actively executing
  - Initial load:
    - For BLOCK\_SIZE = 16, this allows up to 8\*512 = 4,096 pending loads (8 blocks, 2 loads \* 256)
    - In practice, there will probably be up to half of this due to scheduling to make use of SPs.
  - The next BLOCK\_SIZE 32 would lead to 2\*32\*32\*4B= 8KB shared memory usage per Thread Block, allowing only up to two Thread Blocks active at the same time



# CUDA Code – Kernel Execution

For very large N and M dimensions, one will need to add another level of blocking and execute the second-level blocks sequentially.



Tech || Computing

#### **CUDA Code – Kernel Overview**

```
// Block index
int bx = blockIdx.x;
int by = blockIdx.y;
// Thread index
int tx = threadIdx.x;
int ty = threadIdx.y;
```

// Pvalue stores the element of the block sub-matrix
// that is computed by the thread
float Pvalue = 0;

// Loop over all the sub-matrices of M and N
// required to compute the block sub-matrix
for (int m = 0; m < M.width/BLOCK\_SIZE; ++m) {
 code from the next few slides };</pre>





Tech // Computing

# CUDA Code - Load Data to Shared Memory

// Get a pointer to the current sub-matrix Msub of M
Matrix Msub = GetSubMatrix(M, m, by);

// Get a pointer to the current sub-matrix Nsub of N
Matrix Nsub = GetSubMatrix(N, bx, m);

\_\_shared\_\_ float Ms[BLOCK\_SIZE][BLOCK\_SIZE]; \_\_shared\_\_ float Ns[BLOCK\_SIZE][BLOCK\_SIZE];

// each thread loads one element of the sub-matrix
Ms[ty][tx] = GetMatrixElement(Msub, tx, ty);

// each thread loads one element of the sub-matrix
Ns[ty][tx] = GetMatrixElement(Nsub, tx, ty);



Tech // Computing

#### **CUDA Code - Compute Result**



#### **CUDA Code - Save Result**

## // Get a pointer to the block sub-matrix of P Matrix Psub = GetSubMatrix(P, bx, by);

// Write the block sub-matrix to device memory; // each thread writes one element

SetMatrixElement(Psub, tx, ty, Pvalue);

Macro functions will be provided.





Computing

#### **Device Runtime Component:** Synchronization Function

- void \_\_\_\_\_syncthreads();
- Synchronizes all threads in a block
- Once all threads have reached this point, execution resumes normally
- Used to avoid RAW/WAR/WAW hazards when accessing shared or global memory
- Allowed in conditional constructs only if the conditional is uniform across the entire thread block

```
if (tid>16) {__syncthreads(); code1 ... } T HOTT
else { code1; }
Georgia
```



Some Useful Information on Tools





#### Compilation

- Any source file containing CUDA language extensions must be compiled with nvcc
- nvcc is a compiler driver
  - Works by invoking all the necessary tools and compilers like cudacc, g++, cl, ...
- nvcc can output:
  - Either C code
    - That must then be compiled with the rest of the application using another tool
  - Or object code directly





Computing

**Tech** 

#### Debugging Using the Device Emulation Mode

- An executable compiled in device emulation mode (nvcc -deviceemu) runs completely on the host using the CUDA runtime
  - No need of any device and CUDA driver (??)
  - Each device thread is emulated with a host thread
- When running in device emulation mode, one can:
  - Use host native debug support (breakpoints, inspection, etc.)
  - Access any device-specific data from host code and vice-versa
  - Call any host function from device code (e.g. printf) and vice-versa



#### **Device Emulation Mode Pitfalls**

- Emulated device threads execute sequentially, so simultaneous accesses of the same memory location by multiple threads could produce different results.
- Dereferencing device pointers on the host or host pointers on the device can produce correct results in device emulation mode, but will generate an error in device execution mode
- Results of floating-point computations will slightly differ because of:
  - Different compiler outputs, instruction sets
  - Use of extended precision for intermediate results
    - There are various options to force strict single precision on the host





#### **Blocks must be Indepdent**

- Blocks may coordinate but not synchronize
  - Shared queue pointer:OK
  - Shared block: Bad…
- Thread blocks can run in any order
  - Concurrently or sequentially
  - Facilitates scaling of the same code across many devices





#### Linking

- Any executable with CUDA code requires two dynamic libraries:
  - The CUDA runtime library (cudart)
  - The CUDA core library (cuda)



Some Additional API Features







#### Language Extensions: Built-in Variables

- dim3 gridDim;
  - Dimensions of the grid in blocks (gridDim.z unused)
- dim3 blockDim;
  - Dimensions of the block in threads
- dim3 blockIdx;
  - Block index within the grid
- dim3 threadIdx;
  - Thread index within the block



#### **Common Runtime Component**

- Provides:
  - Built-in vector types
  - A subset of the C runtime library supported in both host and device codes

Georgia

Tech

College of

Computing

## Common Runtime Component: Built-in Vector Types

- [u]char[1..4], [u]short[1..4], [u]int[1..4], [u]long[1..4], float[1..4]
  - Structures accessed with x, y, z, w fields:

Georgia

Tech

College of

Computing

uint4 param; int y = param.y;

- dim3
  - Based on uint3
  - Used to specify dimensions

### **Common Runtime Component:** Mathematical Functions

- pow, sqrt, cbrt, hypot
- exp, exp2, expm1
- log, log2, log10, log1p
- sin, cos, tan, asin, acos, atan, atan2
- sinh, cosh, tanh, asinh, acosh, atanh
- ceil, floor, trunc, round
- Etc.
  - When executed on the host, a given function uses the C runtime implementation if available
  - These functions are only supported for scalar types, not vector types

Georgia

College of

Computing



#### **Host Runtime Component**

- Provides functions to deal with:
  - Device management (including multi-device systems)
  - Memory management
  - Error handling
- Initializes the first time a runtime function is called
- A host thread can invoke device code on only one device
  - Multiple host threads required to run on multiple devices



Georgia

College of

Computing

#### Host Runtime Component: Memory Management

- Device memory allocation
  - cudaMalloc(),cudaFree()
- Memory copy from host to device, device to host, device to device
  - cudaMemcpy(),cudaMemcpy2D(), cudaMemcpyToSymbol(), cudaMemcpyFromSymbol()
- Memory addressing
  - cudaGetSymbolAddress()



College of

Computing

Georgia

Tech

## Mathematical Functions

 Some mathematical functions (e.g. sin(x)) have a less accurate, but faster device-only version (e.g. \_\_sin(x))



- \_\_\_exp

• SFU