CUDA: Difference between revisions

Latest revision as of 08:44, 26 June 2025

CUDA Overview

CUDA is a feature set for programming nVidia GPUs. We have many dwarf nodes that are CUDA-enabled with 1-2 GPUs and most of the Wizard nodes have 4 GPUs each. Most of these are consumer grade nVidia 1080 Ti graphics cards that are good for accelerating 32-bit calculations. Dwarf36-38 have two nVidia RTX A4000 graphic cards and dwarf39 has two nVidia 1080 Ti graphics cards that are available for anybody to use but you'll need to contact Beocat staff through a TDX Ticket or email beocat@cs.ksu.edu to request being added to the GPU priority group then you'll need to submit jobs with --partition=ksu-gen-gpu.q. Wizard20 and wizard21 each have two nVidia P100 cards that are much more costly than the consumer grade 1080Ti cards but can accelerate 64-bit calculations much better.

Training videos

CUDA Programming Model Overview: http://www.youtube.com/watch?v=aveYOlBSe-Y

CUDA Programming Basics Part I (Host functions): http://www.youtube.com/watch?v=79VARRFwQgY

CUDA Programming Basics Part II (Device functions): http://www.youtube.com/watch?v=G5-iI1ogDW4

Compiling CUDA Applications

nvcc is the compiler for CUDA applications. When compiling your applications manually you will need to load a CUDA enabled compiler toolchain (e.g. fosscuda):

module load fosscuda
Do not run your cuda applications on the headnode. I cannot guarantee it will run, and it will give you terrible results if it does run.

With those two things in mind, you can compile CUDA applications as follows:

module load fosscuda
nvcc <source>.cu -o <output>

Example

Create your Application

Copy the following Application into Beocat as vecadd.cu

//  Kernel definition, see also section 4.2.3 of Nvidia Cuda Programming Guide
__global__  void vecAdd(float* A, float* B, float* C)
{
            // threadIdx.x is a built-in variable  provided by CUDA at runtime
            int i = threadIdx.x;
       A[i]=0;
       B[i]=i;
       C[i] = A[i] + B[i];
}

#include  <stdio.h>
#define  SIZE 10
int  main()
{
   int N=SIZE;
   float A[SIZE], B[SIZE], C[SIZE];
   float *devPtrA;
   float *devPtrB;
   float *devPtrC;
   int memsize= SIZE * sizeof(float);

   cudaMalloc((void**)&devPtrA, memsize);
   cudaMalloc((void**)&devPtrB, memsize);
   cudaMalloc((void**)&devPtrC, memsize);
   cudaMemcpy(devPtrA, A, memsize,  cudaMemcpyHostToDevice);
   cudaMemcpy(devPtrB, B, memsize,  cudaMemcpyHostToDevice);
   // __global__ functions are called:  Func<<< Dg, Db, Ns  >>>(parameter);
   vecAdd<<<1, N>>>(devPtrA,  devPtrB, devPtrC);
   cudaMemcpy(C, devPtrC, memsize,  cudaMemcpyDeviceToHost);

   for (int i=0; i<SIZE; i++)
        printf("C[%d]=%f\n",i,C[i]);

  cudaFree(devPtrA);
  cudaFree(devPtrA);
  cudaFree(devPtrA);

}

Gain Access to a CUDA-capable Node

See our advanced scheduler documentation

Compile Your Application

module load fosscuda
nvcc vecadd.cu -o vecadd

This will create a program with the name 'vecadd' (specified by the '-o' flag).

Run Your Application

Run the program as you usually would, namely

./vecadd

Assuming you don't want to run the program interactively because this is a large job, you can submit a job via sbatch, just be sure to add '--gres=gpu:1' to the sbatch directive.

@@ Line 1: / Line 1: @@
 == CUDA Overview ==
-[[wikipedia:CUDA|CUDA]] is a feature set for programming nVidia [[wikipedia:Graphics_processing_unit|GPUs]]. We have 16 nodes with nVidia Tesla m2050 GPUs. These GPUs have 448 cores running at 1.15 GHz, and are very fast at floating point math - over a TeraFLOP! However, programming in CUDA is difficult for the uninitiated.
+[[wikipedia:CUDA|CUDA]] is a feature set for programming nVidia [[wikipedia:Graphics_processing_unit|GPUs]]. We have many dwarf nodes that are CUDA-enabled with 1-2 GPUs and most of the Wizard nodes have 4 GPUs each. Most of these are consumer grade [https://www.nvidia.com/en-us/geforce/products/10series/geforce-gtx-1080-ti/ nVidia 1080 Ti graphics cards] that are good for accelerating 32-bit calculations. Dwarf36-38 have two [https://www.nvidia.com/en-us/design-visualization/rtx-a4000/ nVidia RTX A4000 graphic cards] and dwarf39 has two [https://www.nvidia.com/en-us/geforce/products/10series/geforce-gtx-1080-ti/ nVidia 1080 Ti graphics cards] that are available for anybody to use but you'll need to contact Beocat staff through a [https://support.ksu.edu/TDClient/30/Portal/Requests/ServiceDet?ID=44 TDX Ticket] or email beocat@cs.ksu.edu to request being added to the GPU priority group then you'll need to submit jobs with <B>--partition=ksu-gen-gpu.q</B>.  Wizard20 and wizard21 each have two [https://www.nvidia.com/object/quadro-graphics-with-pascal.html nVidia P100 cards] that are much more costly than the consumer grade 1080Ti cards but can accelerate 64-bit calculations much better.
 == Training videos ==
 CUDA Programming Model Overview: [http://www.youtube.com/watch?v=aveYOlBSe-Y http://www.youtube.com/watch?v=aveYOlBSe-Y]
-<HTML5video type="youtube" width="800" height="480" autoplay="false">aveYOlBSe-Y</HTML5video>
+{{#widget:YouTube|id=aveYOlBSe-Y|width=800|height=600}}
 CUDA Programming Basics Part I (Host functions): [http://www.youtube.com/watch?v=79VARRFwQgY http://www.youtube.com/watch?v=79VARRFwQgY]
-<HTML5video type="youtube" width="800" height="480" autoplay="false">79VARRFwQgY</HTML5video>
+{{#widget:YouTube|id=79VARRFwQgY|width=800|height=600}}
 CUDA Programming Basics Part II (Device functions): [http://www.youtube.com/watch?v=G5-iI1ogDW4 http://www.youtube.com/watch?v=G5-iI1ogDW4]
-<HTML5video type="youtube" width="800" height="480" autoplay="false">G5-iI1ogDW4</HTML5video>
+{{#widget:YouTube|id=G5-iI1ogDW4|width=800|height=600}}
 == Compiling CUDA Applications ==
-nvcc is the compiler for CUDA applications. When compiling your applications manually you will need to keep 3 things in mind:
+nvcc is the compiler for CUDA applications. When compiling your applications manually you will need to load a CUDA enabled compiler toolchain (e.g. fosscuda):
-* The CUDA development headers are located here: /opt/cuda/sdk/C/common/inc
+* module load fosscuda
-* The CUDA architecture is: sm_20
-* The CUDA SDK is currently not available on the headnode. (compile on the nodes with CUDA, either in your jobscript or via <tt>qrsh -l cuda=TRUE</tt>)
 * '''Do not run your cuda applications on the headnode. I cannot guarantee it will run, and it will give you terrible results if it does run.'''
-Putting it all together you can compile CUDA applications as follows:
+With those two things in mind, you can compile CUDA applications as follows:
 <syntaxhighlight lang="bash">
-nvcc -I /opt/cuda/sdk/C/common/inc -arch sm_20 <source>.cu -o <output>
+module load fosscuda
+nvcc <source>.cu -o <output>
 </syntaxhighlight>
 == Example ==
 === Create your Application ===
@@ Line 66: / Line 71: @@
 </syntaxhighlight>
 === Gain Access to a CUDA-capable Node ===
-<syntaxhighlight lang="bash">
+See our [[AdvancedSlurm|advanced scheduler documentation]]
-qrsh -l cuda=TRUE
-</syntaxhighlight>
 === Compile Your Application ===
 <syntaxhighlight lang="bash">
-nvcc -I /opt/cuda/sdk/C/common/inc -arch sm_20 vecadd.cu -o vecadd
+module load fosscuda
+nvcc vecadd.cu -o vecadd
 </syntaxhighlight>
 This will create a program with the name 'vecadd' (specified by the '-o' flag).
 === Run Your Application ===
 Run the program as you usually would, namely
@@ Line 80: / Line 85: @@
 </syntaxhighlight>
-Assuming you don't want to run the program interactively because this is a large job, you can submit a job via qsub, just be sure to add the '<tt>-l cuda=true</tt>' directive.
+Assuming you don't want to run the program interactively because this is a large job, you can submit a job via sbatch, just be sure to add '<tt>--gres=gpu:1</tt>' to the '''sbatch''' directive.

CUDA: Difference between revisions

Views

Latest revision as of 08:44, 26 June 2025

Contents

CUDA Overview

Training videos

Compiling CUDA Applications

Example

Create your Application

Gain Access to a CUDA-capable Node

Compile Your Application

Run Your Application

Navigation menu

Navigation

Search

Tools

Personal tools