School of computer science

Georgia Institute of Technology

CS4803DGC, Spring 2010
Programming Assignment #2
Due: Monday, Feb. 1 6:00 pm
Hyesoon Kim, Instructor

Introduction This is an individual assignment. In this assignment, you will implement a tiled matrix multiplication using CUDA. 1) Untar lab2.tar.gz into ~/NVIDIA_GPU_Computing_SDK/C/src
Instruction:
cd ~/NVIDIA_GPU_Computing_SDK/C/src
tar -xvf lab2.tar.gz
cd lab2
make
To run the program ../../bin/linux/release/matrixmul
2) Edit the source files matrixmul.cu and matrixmul_kernel.cu to complete the functionality of the matrix multiplication on the device. The two matrices could be any size, but the resulting matrix is guaranteed to have a number of elements less than 64,000.

3) There are several modes of operation for the application.

No arguments: The application will create two randomly sized and initialized matrices such that the matrix operation M * N is valid, and P is properly sized to hold the result. After the device multiplication is invoked, it will compute the correct solution matrix using the CPU, and compare that solution with the device-computed solution. If it matches (within a certain tolerance), if will print out "Test PASSED" to the screen before exiting.

One argument: The application will use the random initialization to create the input matrices, and write the device-computed output to the file specified by the argument.

Three arguments: The application will read input matrices from provided files. The first argument should be a file containing three integers. The first, second and third integers will be used as M.height, M.width, and N.height. The second and third function arguments will be expected to be files which have exactly enough entries to fill matrices M and N respectively. No output is written to file.

Four arguments: The application will read its inputs from the files provided by the first three arguments as described above, and write its output to the file provided in the fourth.

Note that if you wish to use the output of one run of the application as an input, you must delete the first line in the output file, which displays the accuracy of the values within the file. The value is not relevant for this application.

4) Measure the following cases.

For matrix size 1024 vary the block size 8, 16 and measure speedup.

5) Submission:
The lab2.tar.gz file should contain the lab2 folder provided, with all the changes and additions you have made to the source code. Include a pdf file with the answer of question 4.

Instruction:
cd ~/NVIDIA_GPU_Computing_SDK/C/src/
make clean
cd ..
tar cvf lab2.tar lab2
gzip lab2.tar
upload lab2.tar.gz file at T-square

6) Grading
We will grade the functionality of the code with different matrix sizes. We will test arbitary block sizes.
If your code works only for multiple of 8 (default block size), you will receive 60% of the total grade. (Source: UIUC EE498AL)