CUDA: Difference between revisions

Latest revision as of 19:41, 8 December 2022

CUDA Overview

CUDA is a feature set for programming nVidia GPUs. We have many dwarf nodes that are CUDA-enabled with 1-2 GPUs and most of the Wizard nodes have 4 GPUs each. Most of these are consumer grade nVidia 1080 Ti graphics cards that are good for accelerating 32-bit calculations. Dwarf36-38 have two nVidia RTX A4000 graphic cards and dwarf39 has two nVidia 1080 Ti graphics cards that are available for anybody to use but you'll need to email beocat@cs.ksu.edu to request being added to the GPU priority group then you'll need to submit jobs with --partition=ksu-gen-gpu.q. Wizard20 and wizard21 each have two nVidia P100 cards that are much more costly than the consumer grade 1080Ti cards but can accelerate 64-bit calculations much better.

Training videos

CUDA Programming Model Overview: http://www.youtube.com/watch?v=aveYOlBSe-Y

CUDA Programming Basics Part I (Host functions): http://www.youtube.com/watch?v=79VARRFwQgY

CUDA Programming Basics Part II (Device functions): http://www.youtube.com/watch?v=G5-iI1ogDW4

Compiling CUDA Applications

nvcc is the compiler for CUDA applications. When compiling your applications manually you will need to load a CUDA enabled compiler toolchain (e.g. fosscuda):

module load fosscuda
Do not run your cuda applications on the headnode. I cannot guarantee it will run, and it will give you terrible results if it does run.

With those two things in mind, you can compile CUDA applications as follows:

module load fosscuda
nvcc <source>.cu -o <output>

Example

Create your Application

Copy the following Application into Beocat as vecadd.cu

//  Kernel definition, see also section 4.2.3 of Nvidia Cuda Programming Guide
__global__  void vecAdd(float* A, float* B, float* C)
{
            // threadIdx.x is a built-in variable  provided by CUDA at runtime
            int i = threadIdx.x;
       A[i]=0;
       B[i]=i;
       C[i] = A[i] + B[i];
}

#include  <stdio.h>
#define  SIZE 10
int  main()
{
   int N=SIZE;
   float A[SIZE], B[SIZE], C[SIZE];
   float *devPtrA;
   float *devPtrB;
   float *devPtrC;
   int memsize= SIZE * sizeof(float);

   cudaMalloc((void**)&devPtrA, memsize);
   cudaMalloc((void**)&devPtrB, memsize);
   cudaMalloc((void**)&devPtrC, memsize);
   cudaMemcpy(devPtrA, A, memsize,  cudaMemcpyHostToDevice);
   cudaMemcpy(devPtrB, B, memsize,  cudaMemcpyHostToDevice);
   // __global__ functions are called:  Func<<< Dg, Db, Ns  >>>(parameter);
   vecAdd<<<1, N>>>(devPtrA,  devPtrB, devPtrC);
   cudaMemcpy(C, devPtrC, memsize,  cudaMemcpyDeviceToHost);

   for (int i=0; i<SIZE; i++)
        printf("C[%d]=%f\n",i,C[i]);

  cudaFree(devPtrA);
  cudaFree(devPtrA);
  cudaFree(devPtrA);

}

Gain Access to a CUDA-capable Node

See our advanced scheduler documentation

Compile Your Application

module load fosscuda
nvcc vecadd.cu -o vecadd

This will create a program with the name 'vecadd' (specified by the '-o' flag).

Run Your Application

Run the program as you usually would, namely

./vecadd

Assuming you don't want to run the program interactively because this is a large job, you can submit a job via sbatch, just be sure to add '--gres=gpu:1' to the sbatch directive.

@@ Line 1: / Line 1: @@
 == CUDA Overview ==
-[[wikipedia:CUDA|CUDA]] is a feature set for programming nVidia [[wikipedia:Graphics_processing_unit|GPUs]]. We have 16 nodes with nVidia Tesla m2050 GPUs. These GPUs have 448 cores running at 1.15 GHz, and are very fast at floating point math - over a TeraFLOP! However, programming in CUDA is difficult for the uninitiated.
+[[wikipedia:CUDA|CUDA]] is a feature set for programming nVidia [[wikipedia:Graphics_processing_unit|GPUs]]. We have many dwarf nodes that are CUDA-enabled with 1-2 GPUs and most of the Wizard nodes have 4 GPUs each. Most of these are consumer grade [https://www.nvidia.com/en-us/geforce/products/10series/geforce-gtx-1080-ti/ nVidia 1080 Ti graphics cards] that are good for accelerating 32-bit calculations. Dwarf36-38 have two [https://www.nvidia.com/en-us/design-visualization/rtx-a4000/ nVidia RTX A4000 graphic cards] and dwarf39 has two [https://www.nvidia.com/en-us/geforce/products/10series/geforce-gtx-1080-ti/ nVidia 1080 Ti graphics cards] that are available for anybody to use but you'll need to email beocat@cs.ksu.edu to request being added to the GPU priority group then you'll need to submit jobs with <B>--partition=ksu-gen-gpu.q</B>.  Wizard20 and wizard21 each have two [https://www.nvidia.com/object/quadro-graphics-with-pascal.html nVidia P100 cards] that are much more costly than the consumer grade 1080Ti cards but can accelerate 64-bit calculations much better.
 == Training videos ==
 CUDA Programming Model Overview: [http://www.youtube.com/watch?v=aveYOlBSe-Y http://www.youtube.com/watch?v=aveYOlBSe-Y]
-<HTML5video type="youtube" width="800" height="480" autoplay="false">aveYOlBSe-Y</HTML5video>
+{{#widget:YouTube|id=aveYOlBSe-Y|width=800|height=600}}
 CUDA Programming Basics Part I (Host functions): [http://www.youtube.com/watch?v=79VARRFwQgY http://www.youtube.com/watch?v=79VARRFwQgY]
-<HTML5video type="youtube" width="800" height="480" autoplay="false">79VARRFwQgY</HTML5video>
+{{#widget:YouTube|id=79VARRFwQgY|width=800|height=600}}
 CUDA Programming Basics Part II (Device functions): [http://www.youtube.com/watch?v=G5-iI1ogDW4 http://www.youtube.com/watch?v=G5-iI1ogDW4]
-<HTML5video type="youtube" width="800" height="480" autoplay="false">G5-iI1ogDW4</HTML5video>
+{{#widget:YouTube|id=G5-iI1ogDW4|width=800|height=600}}
 == Compiling CUDA Applications ==
-nvcc is the compiler for CUDA applications. When compiling your applications manually you will need to keep 3 things in mind:
+nvcc is the compiler for CUDA applications. When compiling your applications manually you will need to load a CUDA enabled compiler toolchain (e.g. fosscuda):
-* The CUDA development headers are located here: /opt/cuda/sdk/C/common/inc
+* module load fosscuda
-* The CUDA architecture is: sm_20
-* The CUDA SDK is currently not available on the headnode. (compile on the nodes with CUDA, either in your jobscript or via <tt>qrsh -l cuda=TRUE</tt>)
 * '''Do not run your cuda applications on the headnode. I cannot guarantee it will run, and it will give you terrible results if it does run.'''
-Putting it all together you can compile CUDA applications as follows:
+With those two things in mind, you can compile CUDA applications as follows:
 <syntaxhighlight lang="bash">
-nvcc -I /opt/cuda/sdk/C/common/inc -arch sm_20 <source>.cu -o <output>
+module load fosscuda
+nvcc <source>.cu -o <output>
 </syntaxhighlight>
 == Example ==
 === Create your Application ===
@@ Line 66: / Line 71: @@
 </syntaxhighlight>
 === Gain Access to a CUDA-capable Node ===
-<syntaxhighlight lang="bash">
+See our [[AdvancedSlurm|advanced scheduler documentation]]
-qrsh -l cuda=TRUE
-</syntaxhighlight>
 === Compile Your Application ===
 <syntaxhighlight lang="bash">
-nvcc -I /opt/cuda/sdk/C/common/inc -arch sm_20 vecadd.cu -o vecadd
+module load fosscuda
+nvcc vecadd.cu -o vecadd
 </syntaxhighlight>
 This will create a program with the name 'vecadd' (specified by the '-o' flag).
 === Run Your Application ===
 Run the program as you usually would, namely
@@ Line 80: / Line 85: @@
 </syntaxhighlight>
-Assuming you don't want to run the program interactively because this is a large job, you can submit a job via qsub, just be sure to add the '<tt>-l cuda=true</tt>' directive.
+Assuming you don't want to run the program interactively because this is a large job, you can submit a job via sbatch, just be sure to add '<tt>--gres=gpu:1</tt>' to the '''sbatch''' directive.

CUDA: Difference between revisions

Views

Latest revision as of 19:41, 8 December 2022

Contents

CUDA Overview

Training videos

Compiling CUDA Applications

Example

Create your Application

Gain Access to a CUDA-capable Node

Compile Your Application

Run Your Application

Navigation menu

Navigation

Search

Tools

Personal tools