CUDA Overview
CUDA is a feature set for programming nVidia GPUs. We have 16 nodes with nVidia Tesla m2050 GPUs. These GPUs have 448 cores running at 1.15 GHz, and are very fast at floating point math - over a TeraFLOP! However, programming in CUDA is difficult for the uninitiated.
Training videos
CUDA Programming Model Overview: http://www.youtube.com/watch?v=aveYOlBSe-Y
CUDA Programming Basics Part I (Host functions): http://www.youtube.com/watch?v=79VARRFwQgY
CUDA Programming Basics Part II (Device functions): http://www.youtube.com/watch?v=G5-iI1ogDW4
Compiling CUDA Applications
nvcc is the compiler for CUDA applications. When compiling your applications manually you will need to keep 3 things in mind:
- The CUDA development headers are located here: /opt/cuda/sdk/C/common/inc
- The CUDA architecture is: sm_20
- The CUDA SDK is currently not available on the headnode. (compile on the nodes with CUDA, either in your jobscript or via qrsh -l cuda=TRUE)
- Do not run your cuda applications on the headnode. I cannot guarantee it will run, and it will give you terrible results if it does run.
Putting it all together you can compile CUDA applications as follows:
nvcc -I /opt/cuda/sdk/C/common/inc -arch sm_20 <source>.cu -o <output>
Example
Create your Application
Copy the following Application into Beocat as vecadd.cu
// Kernel definition, see also section 4.2.3 of Nvidia Cuda Programming Guide
__global__ void vecAdd(float* A, float* B, float* C)
{
// threadIdx.x is a built-in variable provided by CUDA at runtime
int i = threadIdx.x;
A[i]=0;
B[i]=i;
C[i] = A[i] + B[i];
}
#include <stdio.h>
#define SIZE 10
int main()
{
int N=SIZE;
float A[SIZE], B[SIZE], C[SIZE];
float *devPtrA;
float *devPtrB;
float *devPtrC;
int memsize= SIZE * sizeof(float);
cudaMalloc((void**)&devPtrA, memsize);
cudaMalloc((void**)&devPtrB, memsize);
cudaMalloc((void**)&devPtrC, memsize);
cudaMemcpy(devPtrA, A, memsize, cudaMemcpyHostToDevice);
cudaMemcpy(devPtrB, B, memsize, cudaMemcpyHostToDevice);
// __global__ functions are called: Func<<< Dg, Db, Ns >>>(parameter);
vecAdd<<<1, N>>>(devPtrA, devPtrB, devPtrC);
cudaMemcpy(C, devPtrC, memsize, cudaMemcpyDeviceToHost);
for (int i=0; i<SIZE; i++)
printf("C[%d]=%f\n",i,C[i]);
cudaFree(devPtrA);
cudaFree(devPtrA);
cudaFree(devPtrA);
}
Gain Access to a CUDA-capable Node
qrsh -l cuda=TRUE
Compile Your Application
nvcc -I /opt/cuda/sdk/C/common/inc -arch sm_20 vecadd.cu -o vecadd
This will create a program with the name 'vecadd' (specified by the '-o' flag).
Run Your Application
Run the program as you usually would, namely
./vecadd
Assuming you don't want to run the program interactively because this is a large job, you can submit a job via qsub, just be sure to add the '-l cuda=true' directive.