No edit summary |
|||
(4 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
== CUDA Overview == | == CUDA Overview == | ||
[[wikipedia:CUDA|CUDA]] is a feature set for programming nVidia [[wikipedia:Graphics_processing_unit|GPUs]]. We have many dwarf nodes that are CUDA-enabled with 1-2 GPUs and most of the Wizard nodes have 4 GPUs each. Most of these are consumer grade [https://www.nvidia.com/en-us/geforce/products/10series/geforce-gtx-1080-ti/ nVidia 1080 Ti graphics cards] that are good for accelerating 32-bit calculations. | [[wikipedia:CUDA|CUDA]] is a feature set for programming nVidia [[wikipedia:Graphics_processing_unit|GPUs]]. We have many dwarf nodes that are CUDA-enabled with 1-2 GPUs and most of the Wizard nodes have 4 GPUs each. Most of these are consumer grade [https://www.nvidia.com/en-us/geforce/products/10series/geforce-gtx-1080-ti/ nVidia 1080 Ti graphics cards] that are good for accelerating 32-bit calculations. Dwarf36-38 have two [https://www.nvidia.com/en-us/design-visualization/rtx-a4000/ nVidia RTX A4000 graphic cards] and dwarf39 has two [https://www.nvidia.com/en-us/geforce/products/10series/geforce-gtx-1080-ti/ nVidia 1080 Ti graphics cards] that are available for anybody to use but you'll need to email beocat@cs.ksu.edu to request being added to the GPU priority group then you'll need to submit jobs with <B>--partition=ksu-gen-gpu.q</B>. Wizard20 and wizard21 each have two [https://www.nvidia.com/object/quadro-graphics-with-pascal.html nVidia P100 cards] that are much more costly than the consumer grade 1080Ti cards but can accelerate 64-bit calculations much better. | ||
== Training videos == | == Training videos == | ||
CUDA Programming Model Overview: [http://www.youtube.com/watch?v=aveYOlBSe-Y http://www.youtube.com/watch?v=aveYOlBSe-Y] | CUDA Programming Model Overview: [http://www.youtube.com/watch?v=aveYOlBSe-Y http://www.youtube.com/watch?v=aveYOlBSe-Y] | ||
{{#widget:YouTube|id=aveYOlBSe-Y|width=800|height=600}} | |||
CUDA Programming Basics Part I (Host functions): [http://www.youtube.com/watch?v=79VARRFwQgY http://www.youtube.com/watch?v=79VARRFwQgY] | CUDA Programming Basics Part I (Host functions): [http://www.youtube.com/watch?v=79VARRFwQgY http://www.youtube.com/watch?v=79VARRFwQgY] | ||
{{#widget:YouTube|id=79VARRFwQgY|width=800|height=600}} | |||
CUDA Programming Basics Part II (Device functions): [http://www.youtube.com/watch?v=G5-iI1ogDW4 http://www.youtube.com/watch?v=G5-iI1ogDW4] | CUDA Programming Basics Part II (Device functions): [http://www.youtube.com/watch?v=G5-iI1ogDW4 http://www.youtube.com/watch?v=G5-iI1ogDW4] | ||
{{#widget:YouTube|id=G5-iI1ogDW4|width=800|height=600}} | |||
== Compiling CUDA Applications == | == Compiling CUDA Applications == | ||
nvcc is the compiler for CUDA applications. When compiling your applications manually you will need to load a CUDA enabled compiler toolchain (e.g. fosscuda): | nvcc is the compiler for CUDA applications. When compiling your applications manually you will need to load a CUDA enabled compiler toolchain (e.g. fosscuda): | ||
Line 71: | Line 74: | ||
=== Compile Your Application === | === Compile Your Application === | ||
<syntaxhighlight lang="bash"> | <syntaxhighlight lang="bash"> | ||
nvcc | module load fosscuda | ||
nvcc vecadd.cu -o vecadd | |||
</syntaxhighlight> | </syntaxhighlight> | ||
This will create a program with the name 'vecadd' (specified by the '-o' flag). | This will create a program with the name 'vecadd' (specified by the '-o' flag). | ||
=== Run Your Application === | === Run Your Application === | ||
Run the program as you usually would, namely | Run the program as you usually would, namely | ||
Line 80: | Line 85: | ||
</syntaxhighlight> | </syntaxhighlight> | ||
Assuming you don't want to run the program interactively because this is a large job, you can submit a job via sbatch, just be sure to add | Assuming you don't want to run the program interactively because this is a large job, you can submit a job via sbatch, just be sure to add '<tt>--gres=gpu:1</tt>' to the '''sbatch''' directive. |
Latest revision as of 18:41, 8 December 2022
CUDA Overview
CUDA is a feature set for programming nVidia GPUs. We have many dwarf nodes that are CUDA-enabled with 1-2 GPUs and most of the Wizard nodes have 4 GPUs each. Most of these are consumer grade nVidia 1080 Ti graphics cards that are good for accelerating 32-bit calculations. Dwarf36-38 have two nVidia RTX A4000 graphic cards and dwarf39 has two nVidia 1080 Ti graphics cards that are available for anybody to use but you'll need to email beocat@cs.ksu.edu to request being added to the GPU priority group then you'll need to submit jobs with --partition=ksu-gen-gpu.q. Wizard20 and wizard21 each have two nVidia P100 cards that are much more costly than the consumer grade 1080Ti cards but can accelerate 64-bit calculations much better.
Training videos
CUDA Programming Model Overview: http://www.youtube.com/watch?v=aveYOlBSe-Y
CUDA Programming Basics Part I (Host functions): http://www.youtube.com/watch?v=79VARRFwQgY
CUDA Programming Basics Part II (Device functions): http://www.youtube.com/watch?v=G5-iI1ogDW4
Compiling CUDA Applications
nvcc is the compiler for CUDA applications. When compiling your applications manually you will need to load a CUDA enabled compiler toolchain (e.g. fosscuda):
- module load fosscuda
- Do not run your cuda applications on the headnode. I cannot guarantee it will run, and it will give you terrible results if it does run.
With those two things in mind, you can compile CUDA applications as follows:
module load fosscuda
nvcc <source>.cu -o <output>
Example
Create your Application
Copy the following Application into Beocat as vecadd.cu
// Kernel definition, see also section 4.2.3 of Nvidia Cuda Programming Guide
__global__ void vecAdd(float* A, float* B, float* C)
{
// threadIdx.x is a built-in variable provided by CUDA at runtime
int i = threadIdx.x;
A[i]=0;
B[i]=i;
C[i] = A[i] + B[i];
}
#include <stdio.h>
#define SIZE 10
int main()
{
int N=SIZE;
float A[SIZE], B[SIZE], C[SIZE];
float *devPtrA;
float *devPtrB;
float *devPtrC;
int memsize= SIZE * sizeof(float);
cudaMalloc((void**)&devPtrA, memsize);
cudaMalloc((void**)&devPtrB, memsize);
cudaMalloc((void**)&devPtrC, memsize);
cudaMemcpy(devPtrA, A, memsize, cudaMemcpyHostToDevice);
cudaMemcpy(devPtrB, B, memsize, cudaMemcpyHostToDevice);
// __global__ functions are called: Func<<< Dg, Db, Ns >>>(parameter);
vecAdd<<<1, N>>>(devPtrA, devPtrB, devPtrC);
cudaMemcpy(C, devPtrC, memsize, cudaMemcpyDeviceToHost);
for (int i=0; i<SIZE; i++)
printf("C[%d]=%f\n",i,C[i]);
cudaFree(devPtrA);
cudaFree(devPtrA);
cudaFree(devPtrA);
}
Gain Access to a CUDA-capable Node
See our advanced scheduler documentation
Compile Your Application
module load fosscuda
nvcc vecadd.cu -o vecadd
This will create a program with the name 'vecadd' (specified by the '-o' flag).
Run Your Application
Run the program as you usually would, namely
./vecadd
Assuming you don't want to run the program interactively because this is a large job, you can submit a job via sbatch, just be sure to add '--gres=gpu:1' to the sbatch directive.