From Beocat
Jump to: navigation, search
No edit summary
 
(7 intermediate revisions by 2 users not shown)
Line 1: Line 1:
== CUDA Overview ==
== CUDA Overview ==
[[wikipedia:CUDA|CUDA]] is a feature set for programming nVidia [[wikipedia:Graphics_processing_unit|GPUs]]. We have many dwarf nodes that are CUDA-enabled with 1-2 GPUs and most of the Wizard nodes have 4 GPUs each. Most of these are consumer grade [https://www.nvidia.com/en-us/geforce/products/10series/geforce-gtx-1080-ti/ nVidia 1080 Ti graphics cards] that are good for accelerating 32-bit calculations. Dwarf38 has two [https://www.geforce.com/hardware/desktop-gpus/geforce-gtx-980-ti/specifications nVidia 980 Ti graphic cards] and dwarf39 has two [https://www.nvidia.com/en-us/geforce/products/10series/geforce-gtx-1080-ti/ nVidia 1080 Ti graphics cards] that are available for anybody to use but you'll need to email beocat@cs.ksu.edu to request being added to the GPU priority group then you'll need to submit jobs with <B>--partition=ksu-gen-gpu.q</B>.  Wizard20 and wizard21 each have two [https://www.nvidia.com/object/quadro-graphics-with-pascal.html nVidia P100 cards] that are much more costly than the consumer grade 1080Ti cards but can accelerate 64-bit calculations much better.
[[wikipedia:CUDA|CUDA]] is a feature set for programming nVidia [[wikipedia:Graphics_processing_unit|GPUs]]. We have many dwarf nodes that are CUDA-enabled with 1-2 GPUs and most of the Wizard nodes have 4 GPUs each. Most of these are consumer grade [https://www.nvidia.com/en-us/geforce/products/10series/geforce-gtx-1080-ti/ nVidia 1080 Ti graphics cards] that are good for accelerating 32-bit calculations. Dwarf36-38 have two [https://www.nvidia.com/en-us/design-visualization/rtx-a4000/ nVidia RTX A4000 graphic cards] and dwarf39 has two [https://www.nvidia.com/en-us/geforce/products/10series/geforce-gtx-1080-ti/ nVidia 1080 Ti graphics cards] that are available for anybody to use but you'll need to email beocat@cs.ksu.edu to request being added to the GPU priority group then you'll need to submit jobs with <B>--partition=ksu-gen-gpu.q</B>.  Wizard20 and wizard21 each have two [https://www.nvidia.com/object/quadro-graphics-with-pascal.html nVidia P100 cards] that are much more costly than the consumer grade 1080Ti cards but can accelerate 64-bit calculations much better.


== Training videos ==
== Training videos ==
CUDA Programming Model Overview: [http://www.youtube.com/watch?v=aveYOlBSe-Y http://www.youtube.com/watch?v=aveYOlBSe-Y]
CUDA Programming Model Overview: [http://www.youtube.com/watch?v=aveYOlBSe-Y http://www.youtube.com/watch?v=aveYOlBSe-Y]
<HTML5video type="youtube" width="800" height="480" autoplay="false">aveYOlBSe-Y</HTML5video>
 
{{#widget:YouTube|id=aveYOlBSe-Y|width=800|height=600}}


CUDA Programming Basics Part I (Host functions): [http://www.youtube.com/watch?v=79VARRFwQgY http://www.youtube.com/watch?v=79VARRFwQgY]
CUDA Programming Basics Part I (Host functions): [http://www.youtube.com/watch?v=79VARRFwQgY http://www.youtube.com/watch?v=79VARRFwQgY]
<HTML5video type="youtube" width="800" height="480" autoplay="false">79VARRFwQgY</HTML5video>
 
{{#widget:YouTube|id=79VARRFwQgY|width=800|height=600}}


CUDA Programming Basics Part II (Device functions): [http://www.youtube.com/watch?v=G5-iI1ogDW4 http://www.youtube.com/watch?v=G5-iI1ogDW4]
CUDA Programming Basics Part II (Device functions): [http://www.youtube.com/watch?v=G5-iI1ogDW4 http://www.youtube.com/watch?v=G5-iI1ogDW4]
<HTML5video type="youtube" width="800" height="480" autoplay="false">G5-iI1ogDW4</HTML5video>
 
{{#widget:YouTube|id=G5-iI1ogDW4|width=800|height=600}}
== Compiling CUDA Applications ==
== Compiling CUDA Applications ==
nvcc is the compiler for CUDA applications. When compiling your applications manually you will need to keep 3 things in mind:
nvcc is the compiler for CUDA applications. When compiling your applications manually you will need to load a CUDA enabled compiler toolchain (e.g. fosscuda):


* The CUDA development headers are located here: /opt/cuda/sdk/common/inc
* module load fosscuda
* The CUDA architecture is: sm_30
* The CUDA SDK is currently not available on the headnode. (compile on the nodes with CUDA, either in your jobscript or interactively via <tt>srun -N 1 -n 1 --gres=gpu:1 --pty bash</tt>)
* '''Do not run your cuda applications on the headnode. I cannot guarantee it will run, and it will give you terrible results if it does run.'''
* '''Do not run your cuda applications on the headnode. I cannot guarantee it will run, and it will give you terrible results if it does run.'''


Putting it all together you can compile CUDA applications as follows:
With those two things in mind, you can compile CUDA applications as follows:


<syntaxhighlight lang="bash">
<syntaxhighlight lang="bash">
nvcc -I  /opt/cuda/sdk/common/inc -arch sm_30 <source>.cu -o <output>
module load fosscuda
nvcc <source>.cu -o <output>
</syntaxhighlight>
</syntaxhighlight>


Line 72: Line 74:
=== Compile Your Application ===
=== Compile Your Application ===
<syntaxhighlight lang="bash">
<syntaxhighlight lang="bash">
nvcc -I /opt/cuda/sdk/common/inc -arch sm_30 vecadd.cu -o vecadd
module load fosscuda
nvcc vecadd.cu -o vecadd
</syntaxhighlight>
</syntaxhighlight>
This will create a program with the name 'vecadd' (specified by the '-o' flag).
This will create a program with the name 'vecadd' (specified by the '-o' flag).
=== Run Your Application ===
=== Run Your Application ===
Run the program as you usually would, namely
Run the program as you usually would, namely
Line 81: Line 85:
</syntaxhighlight>
</syntaxhighlight>


Assuming you don't want to run the program interactively because this is a large job, you can submit a job via qsub, just be sure to add the '<tt>-l cuda=true</tt>' directive.
Assuming you don't want to run the program interactively because this is a large job, you can submit a job via sbatch, just be sure to add '<tt>--gres=gpu:1</tt>' to the '''sbatch''' directive.

Latest revision as of 18:41, 8 December 2022

CUDA Overview

CUDA is a feature set for programming nVidia GPUs. We have many dwarf nodes that are CUDA-enabled with 1-2 GPUs and most of the Wizard nodes have 4 GPUs each. Most of these are consumer grade nVidia 1080 Ti graphics cards that are good for accelerating 32-bit calculations. Dwarf36-38 have two nVidia RTX A4000 graphic cards and dwarf39 has two nVidia 1080 Ti graphics cards that are available for anybody to use but you'll need to email beocat@cs.ksu.edu to request being added to the GPU priority group then you'll need to submit jobs with --partition=ksu-gen-gpu.q. Wizard20 and wizard21 each have two nVidia P100 cards that are much more costly than the consumer grade 1080Ti cards but can accelerate 64-bit calculations much better.

Training videos

CUDA Programming Model Overview: http://www.youtube.com/watch?v=aveYOlBSe-Y

CUDA Programming Basics Part I (Host functions): http://www.youtube.com/watch?v=79VARRFwQgY

CUDA Programming Basics Part II (Device functions): http://www.youtube.com/watch?v=G5-iI1ogDW4

Compiling CUDA Applications

nvcc is the compiler for CUDA applications. When compiling your applications manually you will need to load a CUDA enabled compiler toolchain (e.g. fosscuda):

  • module load fosscuda
  • Do not run your cuda applications on the headnode. I cannot guarantee it will run, and it will give you terrible results if it does run.

With those two things in mind, you can compile CUDA applications as follows:

module load fosscuda
nvcc <source>.cu -o <output>

Example

Create your Application

Copy the following Application into Beocat as vecadd.cu

//  Kernel definition, see also section 4.2.3 of Nvidia Cuda Programming Guide
__global__  void vecAdd(float* A, float* B, float* C)
{
            // threadIdx.x is a built-in variable  provided by CUDA at runtime
            int i = threadIdx.x;
       A[i]=0;
       B[i]=i;
       C[i] = A[i] + B[i];
}

#include  <stdio.h>
#define  SIZE 10
int  main()
{
   int N=SIZE;
   float A[SIZE], B[SIZE], C[SIZE];
   float *devPtrA;
   float *devPtrB;
   float *devPtrC;
   int memsize= SIZE * sizeof(float);

   cudaMalloc((void**)&devPtrA, memsize);
   cudaMalloc((void**)&devPtrB, memsize);
   cudaMalloc((void**)&devPtrC, memsize);
   cudaMemcpy(devPtrA, A, memsize,  cudaMemcpyHostToDevice);
   cudaMemcpy(devPtrB, B, memsize,  cudaMemcpyHostToDevice);
   // __global__ functions are called:  Func<<< Dg, Db, Ns  >>>(parameter);
   vecAdd<<<1, N>>>(devPtrA,  devPtrB, devPtrC);
   cudaMemcpy(C, devPtrC, memsize,  cudaMemcpyDeviceToHost);

   for (int i=0; i<SIZE; i++)
        printf("C[%d]=%f\n",i,C[i]);

  cudaFree(devPtrA);
  cudaFree(devPtrA);
  cudaFree(devPtrA);

}

Gain Access to a CUDA-capable Node

See our advanced scheduler documentation

Compile Your Application

module load fosscuda
nvcc vecadd.cu -o vecadd

This will create a program with the name 'vecadd' (specified by the '-o' flag).

Run Your Application

Run the program as you usually would, namely

./vecadd

Assuming you don't want to run the program interactively because this is a large job, you can submit a job via sbatch, just be sure to add '--gres=gpu:1' to the sbatch directive.