School of computer science

Georgia Institute of Technology

CS4803DGC, Spring 2011
Programming Assignment #1
Due: Extra 10% Friday, Feb. 18 6:00 pm (extension to Monday Feb 21 6:00 pm)
Hyesoon Kim, Instructor

Introduction This is an individual assignment. In this assignment, you will design several micro-benchmarks that will reveal the performance of GPUs. You have to execute the kernel for longer period to measure the effective execution time. Submit a report that describes the basic idea of benchmarks and analysis of the results.

  • Peak Performance Benchmark
    You write a benchmark that achieves the best FLOPS. Hint: Use FMA.
    You calculate the FLOPS by counting the number of floating point operations manually. (i.e., how many fp instructions in a loop and how many times the loop is executed?) In the report, you include the results. You vary the number of threads and blocks and also report the number. Do the peak FLOPS vary as you vary the number of threads or blocks? Explain the results.

  • Memory latency measurement benchmark
    We will start from a simple sequential memory behavior. You vary the starting address of the memory instruction. See whether the performance is also varied.

  • Coalesced vs. Uncoalesced benchmarks
    Change the memory address access patterns to generate uncoalesced memory addresses. For example, LD A[tid+3] will generate a stride access pattern. Do you see performance delta between sequential memory addresses and uncoalesced memory accesses?

  • Arithmetic Intensity
    Write a program that you can change a various arithmetic intensity. Plot the performance vs. arithmetic intensity. Vary the number of threads and blocks (at least 15 cases).

  • Peak memory bandwidth
    Write a program that generate peak memory bandwidth. Using the same kernel and vary the number of threads and blocks. The results should show bandwidth saturation after a certain number of threads and blocks. Compare your results with the bandwidth test program in CUDA SDK.