From Beocat
Jump to: navigation, search
 
(127 intermediate revisions by 5 users not shown)
Line 11: Line 11:
First of all, let me state that just because it sounds "cool" doesn't mean you need it or even want it. InfiniBand does absolutely no good if running on a single machine. InfiniBand is a high-speed host-to-host communication fabric. It is (most-often) used in conjunction with MPI jobs (discussed below). Several times we have had jobs which could run just fine, except that the submitter requested InfiniBand, and all the nodes with InfiniBand were currently busy. In fact, some of our fastest nodes do not have InfiniBand, so by requesting it when you don't need it, you are actually slowing down your job. To request Infiniband, add <tt>--gres=fabric:ib:1</tt> to your sbatch command-line.
First of all, let me state that just because it sounds "cool" doesn't mean you need it or even want it. InfiniBand does absolutely no good if running on a single machine. InfiniBand is a high-speed host-to-host communication fabric. It is (most-often) used in conjunction with MPI jobs (discussed below). Several times we have had jobs which could run just fine, except that the submitter requested InfiniBand, and all the nodes with InfiniBand were currently busy. In fact, some of our fastest nodes do not have InfiniBand, so by requesting it when you don't need it, you are actually slowing down your job. To request Infiniband, add <tt>--gres=fabric:ib:1</tt> to your sbatch command-line.
==== ROCE ====
==== ROCE ====
ROCE, like InfiniBand is a high-speed host-to-host communication layer. Again, used most often with MPI. Most of our nodes are ROCE enabled, but this will let you guarantee the nodes allocated to your job will be able to communicate with ROCE. To request ROCE, add <tt>--gres:fabric:roce:1</tt> to your sbatch command-line.
ROCE, like InfiniBand is a high-speed host-to-host communication layer. Again, used most often with MPI. Most of our nodes are ROCE enabled, but this will let you guarantee the nodes allocated to your job will be able to communicate with ROCE. To request ROCE, add <tt>--gres=fabric:roce:1</tt> to your sbatch command-line.
 
==== Ethernet ====
==== Ethernet ====
Ethernet is another communication fabric. All of our nodes are connected by ethernet, this is simply here to allow you to specify the interconnect speed. Speeds are selected in units of Gbps, with all nodes supporting 1Gbps or above. The currently available speeds for ethernet are: <tt>1, 10, 40, and 100</tt>. To select nodes with 40Gbps and above, you could specify <tt>--gres:fabric:eth:40</tt> on your sbatch command-line.
Ethernet is another communication fabric. All of our nodes are connected by ethernet, this is simply here to allow you to specify the interconnect speed. Speeds are selected in units of Gbps, with all nodes supporting 1Gbps or above. The currently available speeds for ethernet are: <tt>1, 10, 40, and 100</tt>. To select nodes with 40Gbps and above, you could specify <tt>--gres=fabric:eth:40</tt> on your sbatch command-line. Since ethernet is used to connect to the file server, this can be used to select nodes that have fast access for applications doing heavy IO.  The Dwarves and Heroes have 40 Gbps ethernet and we measure single stream performance as high as 20 Gbps, but if your application
requires heavy IO then you'd want to avoid the Moles which are connected to the file server with only 1 Gbps ethernet.
 
=== CUDA ===
=== CUDA ===
[[CUDA]] is the resource required for GPU computing. We have a very small number of nodes which have GPUs installed. To request one of these gpus on of of these nodes, add <tt>--gres=gpu:1</tt> to your sbatch command-line.
[[CUDA]] is the resource required for GPU computing. 'kstat -g' will show you the GPU nodes and the jobs running on them.  To request a GPU node, add <tt>--gres=gpu:1</tt> for example to request 1 GPU for your job; if your job uses multiple nodes, the number of GPUs requested is per-node. You can also request a given type of GPU (kstat -g -l to show types) by using <tt>--gres=gpu:geforce_gtx_1080_ti:1</tt> for a 1080Ti GPU on the Wizards or Dwarves, <tt>--gres=gpu:quadro_gp100:1</tt> for the P100 GPUs on Wizard20-21 that are best for 64-bit codes like Vasp.  Most of these GPU nodes are owned by various groups.  If you want access to GPU nodes and your group does not own any, we can add you to the <tt>--partition=ksu-gen-gpu.q</tt> group that has priority on Dwarf36-39.  For more information on compiling CUDA code click on this [[CUDA]] link.
 
A listing of the current types of gpus can be gathered with this command:
<syntaxhighlight lang=bash>
scontrol show nodes | grep CfgTRES | tr ',' '\n' | awk -F '[:=]' '/gres\/gpu:/ { print $2 }' | sort -u
</syntaxhighlight>
At the time of this writing, that command produces this list:
* geforce_gtx_1080_ti
* geforce_rtx_2080_ti
* geforce_rtx_3090
* l40s
* quadro_gp100
* rtx_a4000
* rtx_a6000
 
== Parallel Jobs ==
== Parallel Jobs ==
There are two ways jobs can run in parallel, ''intra''node and ''inter''node. '''Note: Beocat will not automatically make a job run in parallel.''' Have I said that enough? It's a common misperception.
There are two ways jobs can run in parallel, ''intra''node and ''inter''node. '''Note: Beocat will not automatically make a job run in parallel.''' Have I said that enough? It's a common misperception.
=== Intranode jobs ===
=== Intranode jobs ===
Intranode jobs are easier to code and can take advantage of many common libraries, such as [http://openmp.org/wp/ OpenMP], or Java's threads. Many times, your program will need to know how many cores you want it to use. Many will use all available cores if not told explicitly otherwise. This can be a problem when you are sharing resources, as Beocat does. To request multiple cores, use the sbatch directives '<tt>--cpus-per-task=''n''</tt>', where ''n'' is the number of cores you wish to use. If your command can take an environment variable, you can use $SLURM_CPUS_ON_NODE to tell how many cores you've been allocated.
''Intra''node jobs run on many cores in the same node. These jobs can take advantage of many common libraries, such as [http://openmp.org/wp/ OpenMP], or any programming language that has the concept of ''threads''. Often, your program will need to know how many cores you want it to use, and many will use all available cores if not told explicitly otherwise. This can be a problem when you are sharing resources, as Beocat does. To request multiple cores, use the sbatch directives '<tt>--nodes=1 --cpus-per-task=n</tt>' or '<tt>--nodes=1 --ntasks-per-node=n</tt>', where ''n'' is the number of cores you wish to use. If your command can take an environment variable, you can use $SLURM_CPUS_ON_NODE to tell how many cores you've been allocated.
 
=== Internode (MPI) jobs ===
=== Internode (MPI) jobs ===
"Talking" between nodes is trickier that talking between cores on the same node. The specification for doing so is called "[[wikipedia:Message_Passing_Interface|Message Passing Interface]]", or MPI. We have [http://www.open-mpi.org/ OpenMPI] installed on Beocat for this purpose. Most programs written to take advantage of large multi-node systems will use MPI. You can tell if you have an MPI-enabled program because its directions will tell you to run '<tt>mpirun ''program''</tt>'. Requesting MPI resources is only mildly more difficult than requesting single-node jobs. Instead of using '<tt>--cpus-per-task=''n''</tt>', you would use '<tt>--nodes=''n'' --tasks-per-node=''m''</tt>' ''or'' '<tt>--ntasks=''o''</tt>' for your sbatch request, where ''n'' is the number of nodes you want, ''m'' is the number of cores per node you need, and ''o'' is the total number of cores you need.
''Inter''node jobs can utilize many cores on one or more nodes. Communicating between nodes is trickier than talking between cores on the same node. The specification for doing so is called "[[wikipedia:Message_Passing_Interface|Message Passing Interface]]", or MPI. We have [http://www.open-mpi.org/ OpenMPI] installed on Beocat for this purpose. Most programs written to take advantage of large multi-node systems will use MPI, but MPI also allows an application to run on multiple cores within a node. You can tell if you have an MPI-enabled program because its directions will tell you to run '<tt>mpirun ''program''</tt>'. Requesting MPI resources is only mildly more difficult than requesting single-node jobs. Instead of using '<tt>--cpus-per-task=''n''</tt>', you would use '<tt>--nodes=''n'' --tasks-per-node=''m''</tt>' ''or'' '<tt>--nodes=''n'' --ntasks=''o''</tt>' for your sbatch request, where ''n'' is the number of nodes you want, ''m'' is the number of cores per node you need, and ''o'' is the total number of cores you need.


Some quick examples:
Some quick examples:


<tt>--nodes=4 --ntasks-per-node=4</tt> will give you 4 chunks of 4 cores apiece, for a total of 16 cores.
<tt>--nodes=6 --ntasks-per-node=4</tt> will give you 4 cores on each of 6 nodes for a total of 24 cores.


<tt>--ntasks=40</tt> will give you 40 cores, on any number of nodes.
<tt>--ntasks=40</tt> will give you 40 cores spread across any number of nodes.


<tt>--ntasks=100</tt> will give you 100 cores, on any number of nodes.
<tt>--nodes=10 --ntasks=100</tt> will give you a total of 100 cores across 10 nodes.


== Requesting memory for multi-core jobs ==
== Requesting memory for multi-core jobs ==
Line 35: Line 53:
== Other Handy Slurm Features ==
== Other Handy Slurm Features ==
=== Email status changes ===
=== Email status changes ===
One of the most commonly used options when submitting jobs not related to resource requests is to have have Slurm email you when a job changes its status. This takes two directives to sbatch: '<tt>-M ''someone@somewhere.com''</tt>' will give the email address to which to send status updates. '<tt>-m abe</tt>' is probably the most common directive given for ''when'' to send updates. This will send email messages when a job (a)borts, (b)egins, or (e)nds. Other possibilities are (s)uspended and (n)ever.
One of the most commonly used options when submitting jobs not related to resource requests is to have have Slurm email you when a job changes its status. This takes may need two directives to sbatch: <tt>--mail-user</tt> and <tt>--mail-type</tt>.
==== --mail-type ====
<tt>--mail-type</tt> is used to tell Slurm to notify you about certain conditions. Options are comma separated and include the following
{| class="wikitable"
!Option!!Explanation
|-
| NONE || This disables event-based mail
|-
| BEGIN || Sends a notification when the job begins
|-
| END || Sends a notification when the job ends
|-
| FAIL || Sends a notification when the job fails.
|-
| REQUEUE || Sends a notification if the job is put back into the queue from a running state
|-
| STAGE_OUT || Burst buffer stage out and teardown completed
|-
| ALL || Equivalent to BEGIN,END,FAIL,REQUEUE,STAGE_OUT
|-
| TIME_LIMIT || Notifies if the job ran out of time
|-
| TIME_LIMIT_90 || Notifies when the job has used 90% of its allocated time
|-
| TIME_LIMIT_80 || Notifies when the job has used 80% of its allocated time
|-
| TIME_LIMIT_50 || Notifies when the job has used 50% of its allocated time
|-
| ARRAY_TASKS || Modifies the BEGIN, END, and FAIL options to apply to each array task (instead of notifying for the entire job
|}
 
==== --mail-user ====
<tt>--mail-user</tt> is optional. It is only needed if you intend to send these job status updates to a different e-mail address than what you provided in the [https://acount.beocat.ksu.edu/user Account Request Page]. It is specified with the following arguments to sbatch: <tt>--mail-user=someone@somecompany.com</tt>
 
=== Job Naming ===
=== Job Naming ===
If you have several jobs in the queue, running the same script with different parameters, it's handy to have a different name for each job as it shows up in the queue. This is accomplished with the '<tt>-N ''JobName''</tt>' sbatch directive.
If you have several jobs in the queue, running the same script with different parameters, it's handy to have a different name for each job as it shows up in the queue. This is accomplished with the '<tt>-J ''JobName''</tt>' sbatch directive.
=== Combining Output Streams ===
 
Normally, Slurm will create two files for output. One will be .e''jobnumber'' and the other .o''jobnumber''. If you want both of these to be combined into a single file, you can use the sbatch directive '<tt>-j y</tt>'.
=== Separating Output Streams ===
Normally, Slurm will create one output file, containing both STDERR and STDOUT. If you want both of these to be separated into two files, you can use the sbatch directives '<tt>--output</tt>' and '<tt>--error</tt>'.
 
{| class="wikitable"
! option !! default !! example
|-
| --output || slurm-%j.out || slurm-206.out
|-
| --error || slurm-%j.out || slurm-206.out
|}
<tt>%j</tt> above indicates that it should be replaced with the job id.
 
=== Running from the Current Directory ===
=== Running from the Current Directory ===
By default, jobs run from your home directory. Many programs incorrectly assume that you are running the script from the current directory. You can use the '<tt>-cwd</tt>' directive to change to the "current working directory" you used when submitting the job.
By default, jobs run from your home directory. Many programs incorrectly assume that you are running the script from the current directory. You can use the '<tt>-cwd</tt>' directive to change to the "current working directory" you used when submitting the job.
=== Running in a specific class of machine ===
=== Running in a specific class of machine ===
If you want to run on a specific class of machines, e.g., the Dwarves, you can add the flag "-q \*@@dwarves" to select that queue.
If you want to run on a specific class of machines, e.g., the Dwarves, you can add the flag "--constraint=dwarves" to select any of those machines.
 
=== Processor Constraints ===
Because Beocat is a heterogenous cluster (we have machines from many years in the cluster), not all of our processors support every new and fancy feature. You might have some applications that require some newer processor features, so we provide a mechanism to request those.
 
<tt>--contraint</tt> tells the cluster to apply constraints to the types of nodes that the job can run on. For instance, we know of several applications that must be run on chips that have "AVX" processor extensions. To do that, you would specify <tt>--constraint=avx</tt> on you ''<tt>sbatch</tt>'' '''or''' ''<tt>srun</tt>'' command lines.
Using <tt>--constraint=avx</tt> will prohibit your job from running on the Mages while <tt>--contraint=avx2</tt> will eliminate the Elves as well as the Mages.
 
=== Slurm Environment Variables ===
=== Slurm Environment Variables ===
Within an actual job, sometimes you need to know specific things about the running environment to setup your scripts correctly. Here is a listing of environment variables that Slurm makes available to you. Of course the value of these variables will be different based on many different factors.
Within an actual job, sometimes you need to know specific things about the running environment to setup your scripts correctly. Here is a listing of environment variables that Slurm makes available to you. Of course the value of these variables will be different based on many different factors.
Line 99: Line 168:
USER=mozes
USER=mozes
</syntaxhighlight>
</syntaxhighlight>
Sometimes it is nice to know what hosts you have access to during a PE job. You would checkout the PE_HOSTFILE to know that. If your job has been restarted, it is nice to be able to change what happens rather than redoing all of your work. If this is the case, RESTARTED would equal 1. There are lots of useful Environment Variables there, I will leave it to you to identify the ones you want.
Sometimes it is nice to know what hosts you have access to during a job. You would checkout the SLURM_JOB_NODELIST to know that. There are lots of useful Environment Variables there, I will leave it to you to identify the ones you want.
 
Some of the most commonly-used variables we see used are $SLURM_CPUS_ON_NODE, $HOSTNAME, and $SLURM_JOB_ID.


Some of the most commonly-used variables we see used are $NSLOTS, $HOSTNAME, and $Slurm_TASK_ID (used for array jobs, discussed below).
== Running from a sbatch Submit Script ==
== Running from a sbatch Submit Script ==
No doubt after you've run a few jobs you get tired of typing something like 'sbatch -l mem=2G,h_rt=10:00 -pe single 8 -n MyJobTitle MyScript.sh'. How are you supposed to remember all of these every time? The answer is to create a 'submit script', which outlines all of these for you. Below is a sample submit script, which you can modify and use for your own purposes.
No doubt after you've run a few jobs you get tired of typing something like 'sbatch -l mem=2G,h_rt=10:00 -pe single 8 -n MyJobTitle MyScript.sh'. How are you supposed to remember all of these every time? The answer is to create a 'submit script', which outlines all of these for you. Below is a sample submit script, which you can modify and use for your own purposes.
Line 115: Line 185:
## it to your own purposes. The only exception here is the first line, which
## it to your own purposes. The only exception here is the first line, which
## *must* be #!/bin/bash (or another valid shell).
## *must* be #!/bin/bash (or another valid shell).
## There is one strict rule for guaranteeing Slurm reads all of your options:
## Do not put *any* lines above your resource requests that aren't either:
##    1) blank. (no other characters)
##    2) comments (lines must begin with '#')


## Specify the amount of RAM needed _per_core_. Default is 1G
## Specify the amount of RAM needed _per_core_. Default is 1G
##SBATCH --mem-per-cpu=1G
##SBATCH --mem-per-cpu=1G


## Specify the maximum runtime. Default is 1 hour (1:00:00)
## Specify the maximum runtime in DD-HH:MM:SS form. Default is 1 hour (1:00:00)
##SBATCH --time=1:00:00
##SBATCH --time=1:00:00


Line 127: Line 202:


## GPU directive. If You don't know what this is, you probably don't need it
## GPU directive. If You don't know what this is, you probably don't need it
##SBATCH --gres:gpu:1
##SBATCH --gres=gpu:1


## number of cores/nodes:
## number of cores/nodes:
Line 138: Line 213:
##SBATCH --nodes=2 --tasks-per-node=1
##SBATCH --nodes=2 --tasks-per-node=1
##SBATCH --tasks=20
##SBATCH --tasks=20
## Constraints for this job. Maybe you need to run on the elves
##SBATCH --constraint=elves
## or perhaps you just need avx processor extensions
##SBATCH --constraint=avx


## Output file name. Default is slurm-%j.out where %j is the job id.
## Output file name. Default is slurm-%j.out where %j is the job id.
Line 147: Line 227:
## Name my job, to make it easier to find in the queue
## Name my job, to make it easier to find in the queue
##SBATCH -J MyJobTitle
##SBATCH -J MyJobTitle
## And finally, we run the job we came here to do.
## $HOME/ProgramDir/ProgramName ProgramArguments
## OR, for the case of MPI-capable jobs
## mpirun $HOME/path/MpiJobName


## Send email when certain criteria are met.
## Send email when certain criteria are met.
Line 171: Line 245:
## request form.
## request form.
##SBATCH --mail-user myemail@ksu.edu
##SBATCH --mail-user myemail@ksu.edu
## And finally, we run the job we came here to do.
## $HOME/ProgramDir/ProgramName ProgramArguments
## OR, for the case of MPI-capable jobs
## mpirun $HOME/path/MpiJobName
</syntaxhighlight>
</syntaxhighlight>


== File Access ==
== File Access ==
Beocat has a variety of options for storing and accessing your files.   
Beocat has a variety of options for storing and accessing your files.   
Every user has a home directory for general use which is limited in size, has decent file access performance,
Every user has a home directory for general use which is limited in size, has decent file access performance.  Those needing more storage may purchase /bulk subdirectories which have the same decent performance
and will soon be backed up nightlyLarger files should be stored in the /bulk subdirectories which have the same decent performance
but are not backed up. The /fastscratch file system is a zfs host with lots of NVME drives provide much faster
but are not backed up. The /scratch file system will soon be implemented on a Lustre file system that will provide very fast
temporary file access.  When fast IO is critical to the application performance, access to /fastscratch, the local disk on each node, or to a
temporary file access.  When fast IO is critical to the application performance, access to the local disk on each node or to a
RAM disk are the best options.
RAM disk are the best options.


Line 186: Line 264:
Every user has a <tt>/homes/''username''</tt> directory that they drop into when they log into Beocat.   
Every user has a <tt>/homes/''username''</tt> directory that they drop into when they log into Beocat.   
The home directory is for general use and provides decent performance for most file IO.   
The home directory is for general use and provides decent performance for most file IO.   
Disk space in each home directory is limited to 1 TB, so larger files should be kept in the /bulk
Disk space in each home directory is limited to 1 TB, so larger files should be kept in a purchased /bulk
directory, and there is a limit of 100,000 files in each subdirectory in your account.
directory, and there is a limit of 100,000 files in each subdirectory in your account.
This file system is fully redundant, so 3 specific hard disks would need to fail before any data was lost.
This file system is fully redundant, so 3 specific hard disks would need to fail before any data was lost.
Line 194: Line 272:
===Bulk directory===
===Bulk directory===


Each user also has a <tt>/bulk/''username''</tt> directory where large files should be stored.
Bulk data storage may be provided at a cost of $45/TB/year billed monthly. Due to the cost, directories will be provided when we are contacted and provided with payment information.
File access is the same speed as for the home directories, and the same limit of 100,000 files
per subdirectory applies. There is no limit to the disk space you can use in your bulk directory,
but the files there will not be backed up.  They are still redundantly stored so you don't need to
worry about losing data to hardware failures, just don't delete something by accident.
If you need to back up large files in the bulk directory, talk to Dan Andresen (dan@ksu.edu) about
purchasing some hard disks for archival storage.


===Scratch file system===
===Fast Scratch file system===


The /scratch file system will soon be using the Lustre software which is much faster than the
The /fastscratch file system is faster than /bulk or /homes.
speed of the file access on /homes or /bulk. In order to use scratch, you first need to make a
In order to use fastscratch, you first need to make a directory for yourself.   
directory for yourself.  Scratch offers greater speed, no limit to the size of files nor the number
Fast Scratch is meant as temporary space for prepositioning files and accessing them
of files in each subdirectory.  It is meant as temporary space for prepositioning files and accessing them
during runs.  Once runs are completed, any files that need to be kept should be moved to your home
during runs.  Once runs are completed, any files that need to be kept should be moved to your home
or bulk directories since files on the scratch file system get purged after 30 days.  Lustre is faster than
or bulk directories since files on the fastscratch file system may get purged after 30 days.   
the home and bulk file systems in part because it does not redundantly store files by striping them
across multiple disks, so if a hard disk fails data will be lost.  When we get scratch set up to use Lustre
we will post the difference in file access rates.


<syntaxhighlight lang=bash>
<syntaxhighlight lang=bash>
mkdir /scratch/$USER
mkdir /fastscratch/$USER
</syntaxhighlight>
</syntaxhighlight>


Line 221: Line 289:


If you are running on a single node, it may also be faster to access your files from the local disk
If you are running on a single node, it may also be faster to access your files from the local disk
on that node.  This can be done conveniently using the environment variable $TMPDIR which
on that node.  Each job creates a subdirectory /tmp/job# where '#' is the job ID number on the
is set to point to a subdirectory on /tmp set up for each job.  You may need to copy files to
local disk of each node the job uses.  This can be accessed simply by writing to /tmp rather than
needing to use /tmp/job#.   
 
You may need to copy files to
local disk at the start of your script, or set the output directory for your application to point
local disk at the start of your script, or set the output directory for your application to point
to a file on the local disk, then you'll need to copy any files you want off the local disk before
to a file on the local disk, then you'll need to copy any files you want off the local disk before
the job finishes since Slurm will remove all files in your job's directory on /tmp on completion
the job finishes since Slurm will remove all files in your job's directory on /tmp on completion
of the job or when it aborts.  When we get the scratch file system working with Lustre, it may
of the job or when it aborts.  Use 'kstat -l -h' to see how much /tmp space is available on each node.
end up being faster than accessing local disk so we will post the access rates for each.  Most nodes
have around 600 GB of file space accessible on the local disk.  


<syntaxhighlight lang=bash>
<syntaxhighlight lang=bash>
# Copy input files to the tmp directory if needed
# Copy input files to the tmp directory if needed
cp $input_files $TMPDIR
cp $input_files /tmp


# Make an 'out' directory to pass to the app if needed
# Make an 'out' directory to pass to the app if needed
mkdir $TMPDIR/out
mkdir /tmp/out


# Example of running an app and passing the tmp directory in/out
# Example of running an app and passing the tmp directory in/out
app -input_directory $TMPDIR -output_directory $TMPDIR/out
app -input_directory /tmp -output_directory /tmp/out


# Copy the 'out' directory back to the current working directory after the run
# Copy the 'out' directory back to the current working directory after the run
cp -rp $TMPDIR/out .
cp -rp /tmp/out .
</syntaxhighlight>
</syntaxhighlight>


Line 273: Line 342:
# Supervisor:
# Supervisor:
mkdir /bulk/$USER/$STUDENT_USERNAME
mkdir /bulk/$USER/$STUDENT_USERNAME
chmod ugo+w /bulk/$USER/$STUDENT_USERNAME
setfacl -d -m u:$USER:rwX -R /bulk/$USER/$STUDENT_USERNAME
setfacl -m u:$USER:rwX -R /bulk/$USER/$STUDENT_USERNAME
setfacl -d -m u:$STUDENT_USERNAME:rwX -R /bulk/$USER/$STUDENT_USERNAME
setfacl -m u:$STUDENT_USERNAME:rwX -R /bulk/$USER/$STUDENT_USERNAME


# Student:
# Student:
nohup mv /homes/$USER/data /bulk/$SUPERVISOR_USERNAME/$USER &
nohup mv /homes/$USER/data /bulk/$SUPERVISOR_USERNAME/$USER &
# Once the move is complete, the Supervisor should limit the permissions for the directory again by removing the student's access:
chown $USER: -R /bulk/$USER/$STUDENT_USERNAME
setfacl -d -x u:$STUDENT_USERNAME -R /bulk/$USER/$STUDENT_USERNAME
setfacl -x u:$STUDENT_USERNAME -R /bulk/$USER/$STUDENT_USERNAME
</syntaxhighlight>
</syntaxhighlight>
==File Sharing==
This section will cover methods of sharing files with other users within Beocat and on remote systems.
In the past, Beocat users have been allowed to keep their
/homes and /bulk directories open so that any other user could
access files.  In order to bring Beocat into alignment with
State of Kansas regulations and industry norms, all users must now have their /homes /bulk /scratch and /fastscratch directories
locked down from other users, but can still share files and directories within their group or with individual users
using group and individual ACLs (Access Control Lists) which will be explained below.
Beocat staff will be exempted from this
policy as we need to work freely with all users and will manage our
subdirectories to minimize access.
===Securing your home directory with the setacls script===
If you do not wish to share files or directories with other users, you do not need to do anything
as rwx access to others has already been removed.
If you want to share files or directories you can either use the **setacls** script or configure
the ACLs (Access Control Lists) manually.
The '''setacls -h''' will show how to use the script.
 
  Eos: setacls -h
  setacls [-r] [-w] [-g group] [-u user] -d /full/path/to/directory
  Execute pemission will always be applied, you may also choose r or w
  Must specify at least one group or user
  Must specify at least one directory, and it must be the full path
  Example: setacls -r -g ksu-cis-hpc -u mozes -d /homes/daveturner/shared_dir
You can specify the permissions to be either -r for read or -w for write or you can specify both.
You can provide a priority group to share with, which is the same as the group used in a --partition=
statement in a job submission script.  You can also specify users.
You can specify a file or a directory to share.  If the directory is specified then all files in that
directory will also be shared, and all files created in the directory laster will also be shared.
The script will set everything up for you, telling you the commands it is executing along the way,
then show the resulting ACLs at the end with the '''getfacl''' command.
====Manually configuring your ACLs====
If you want to manually configure the ACLs you can use the directions below to do what the **setacls**
script would do for you.
You first need to provide the minimum execute access to your /homes
or /bulk directory before sharing individual subdirectories.  Setting the ACL to execute only will allow those
in your group to get access to subdirectories while not including read access will mean they will not
be able to see other files or subdirectories on your main directory, but do keep in mind that they can still access them
so you may want to still lock them down manually.  Below is an example of how I would change my
/homes/daveturner directory to allow ksu-cis-hpc group execute access.
<syntaxhighlight lang=bash>
setfacl -m g:ksu-cis-hpc:X /homes/daveturner
</syntaxhighlight>
If your research group owns any nodes on Beocat, then you have a group name that can be used to securely share
files with others within your group.  Below is an example of creating a directory called 'share_hpc',
then providing access to my ksu-cis.hpc group
(my group is ksu-cis-hpc so I submit jobs to --partition=ksu-cis-hpc.q).
Using -R will make these changes recursively to all files and directories in that subdirectory while changing the defaults with the setfacl -d command will ensure that files and directories created
later will be done so with these same ACLs.
<syntaxhighlight lang=bash>
mkdir share_hpc
# ACLs are used here for setting default permissions
setfacl -d -m g:ksu-cis-hpc:rX -R share_hpc
# ACLs are used here for setting actual permissions
setfacl -m g:ksu-cis-hpc:rX -R share_hpc
</syntaxhighlight>
This will give people in your group the ability to read files in the 'share_hpc' directory.  If you also want
them to be able to write or modify files in that directory then change the ':rX' to ':rwX' instead. e.g. 'setfacl -d -m g:ksu-cis-hpc:rwX -R share_hpc'
If you want to know what groups you belong to use the line below.
<syntaxhighlight lang=bash>
groups
</syntaxhighlight>
If your group does not own any nodes, you can still request a group name and manage the participants yourself
by emailing us at
beocat@cs.ksu.edu
.
If you want to share a directory with only a few people you can manage your ACLs using individual usernames
instead of with a group.
You can use the '''getfacl''' command to see groups have access to a given directory.
<syntaxhighlight lang=bash>
getfacl share_hpc
  # file: share_hpc
  # owner: daveturner
  # group: daveturner_users
  user::rwx
  group::r-x
  group:ksu-cis-hpc:r-x
  mask::r-x
  other::---
  default:user::rwx
  default:group::r-x
  default:group:ksu-cis-hpc:r-x
  default:mask::r-x
  default:other::---
</syntaxhighlight>
ACLs give you great flexibility in controlling file access at the
group level.  Below is a more advanced example where I set up a directory to be shared with
my ksu-cis-hpc group, Dan's ksu-cis-dan group, and an individual user 'mozes' who I also want
to have write access.
<syntaxhighlight lang=bash>
mkdir share_hpc_dan_mozes
# acls are used here for setting default permissions
setfacl -d -m g:ksu-cis-hpc:rX -R share_hpc_dan_mozes
setfacl -d -m g:ksu-cis-dan:rX -R share_hpc_dan_mozes
setfacl -d -m u:mozes:rwX -R share_hpc_dan_mozes
# ACLs are used here for setting actual permissions
setfacl -m g:ksu-cis-hpc:rX -R share_hpc_dan_mozes
setfacl -m g:ksu-cis-dan:rX -R share_hpc_dan_mozes
setfacl -m u:mozes:rwX -R share_hpc_dan_mozes
getfacl share_hpc_dan_mozes
  # file: share_hpc_dan_mozes
  # owner: daveturner
  # group: daveturner_users
  user::rwx
  user:mozes:rwx
  group::r-x
  group:ksu-cis-hpc:r-x
  group:ksu-cis-dan:r-x
  mask::r-x
  other::---
  default:user::rwx
  default:user:mozes:rwx
  default:group::r-x
  default:group:ksu-cis-hpc:r-x
  default:group:ksu-cis-dan:r-x
  default:mask::r-x
  default:other::--x
</syntaxhighlight>
===Openly sharing files on the web===
If  you create a 'public_html' directory on your home directory, then any files put there will be shared
openly on the web.  There is no way to restrict who has access to those files.
<syntaxhighlight lang=bash>
cd
mkdir public_html
# Opt-in to letting the webserver access your home directory:
setfacl -m g:public_html:x ~/
</syntaxhighlight>
Then access the data from a web browser using the URL:
http://people.beocat.ksu.edu/~your_user_name
This will show a list of the files you have in your public_html subdirectory.
===Globus===
We have a page here dedicated to [[Globus]]


== Array Jobs ==
== Array Jobs ==
Line 285: Line 523:




   -t n[-m[:s]]
   --array=n[-m[:s]]
     Submits a so called Array Job, i.e. an array of identical tasks being differentiated only by an index number and being treated by Grid
     Submits a so called Array Job, i.e. an array of identical tasks being differentiated only by an index number and being treated by Slurm
     Engine almost like a series of jobs. The option argument to -t specifies the number of array job tasks and the index number which will be
     almost like a series of jobs. The option argument to --array specifies the number of array job tasks and the index number which will be
     associated with the tasks. The index numbers will be exported to the job tasks via the environment variable Slurm_TASK_ID. The option arguments
     associated with the tasks. The index numbers will be exported to the job tasks via the environment variable SLURM_ARRAY_TASK_ID. The option
     n, m and s will be available through the environment variables Slurm_TASK_FIRST, Slurm_TASK_LAST and Slurm_TASK_STEPSIZE.
     arguments n, and m will be available through the environment variables SLURM_ARRAY_TASK_MIN and SLURM_ARRAY_TASK_MAX.
    Following restrictions apply to the values n and m:
   
   
            1 <= n <= 1,000,000
    The task id range specified in the option argument may be a single number, a simple range of the form n-m or a range with a step size.
            1 <= m <= 1,000,000
    Hence, the task id range specified by 2-10:2 would result in the task id indexes 2, 4, 6, 8, and 10, for a total of 5 identical tasks, each
            n <= m
    with the environment variable SLURM_ARRAY_TASK_ID containing one of the 5 index numbers.
   
   
    The task id range specified in the option argument may be a single number, a simple range of the form n-m or  a  range  with  a  step  size.
     Array jobs are commonly used to execute the same type of operation on varying input data sets correlated with the task index number. The
    Hence,  the task id range specified by 2-10:2 would result in the task id indexes 2, 4, 6, 8, and 10, for a total of 5 identical tasks, each
    with the environment variable Slurm_TASK_ID containing one of the 5 index numbers.
     Array jobs are commonly used to execute the same type of operation on varying input data sets correlated with the task index number. The
     number of tasks in a array job is unlimited.
     number of tasks in a array job is unlimited.
   
   
     STDOUT and STDERR of array job tasks will be written into different files with the default location
     STDOUT and STDERR of array job tasks follow a slightly different naming convention (which can be controlled in the same way as mentioned above).
   
   
     <jobname>.['e'|'o']<job_id>'.'<task_id>
     slurm-%A_%a.out
 
    %A is the SLURM_ARRAY_JOB_ID, and %a is the SLURM_ARRAY_TASK_ID


=== Examples ===
=== Examples ===
Line 330: Line 564:
<syntaxhighlight lang="bash">
<syntaxhighlight lang="bash">
#!/bin/bash
#!/bin/bash
#$ -t 50:200:50
#SBATCH --array=50-200:50
RUNSIZE=$Slurm_TASK_ID
RUNSIZE=$SLURM_ARRAY_TASK_ID
app1 $RUNSIZE dataset.txt
app1 $RUNSIZE dataset.txt
</syntaxhighlight>
</syntaxhighlight>
Line 358: Line 592:
<syntaxhighlight lang="bash">
<syntaxhighlight lang="bash">
#!/bin/bash
#!/bin/bash
#$ -t 1:5000
#SBATCH --array=1-5000
app2 `sed -n "${Slurm_TASK_ID}p" dataset.txt`
app2 `sed -n "${SLURM_ARRAY_TASK_ID}p" dataset.txt`
</syntaxhighlight>
</syntaxhighlight>
This uses a subshell via `, and has the sed command print out only the line number $Slurm_TASK_ID out of the file dataset.txt.
This uses a subshell via `, and has the sed command print out only the line number $SLURM_ARRAY_TASK_ID out of the file dataset.txt.


Not only is this a smaller script, it is also faster to submit because it is one job instead of 5000, so sbatch doesn't have to verify as many.
Not only is this a smaller script, it is also faster to submit because it is one job instead of 5000, so sbatch doesn't have to verify as many.


To give you an idea about time saved: submitting 1 job takes 1-2 seconds. by extension if you are submitting 5000, that is 5,000-10,000 seconds, or 1.5-3 hours.
To give you an idea about time saved: submitting 1 job takes 1-2 seconds. by extension if you are submitting 5000, that is 5,000-10,000 seconds, or 1.5-3 hours.
== Checkpoint/Restart using DMTCP ==
DMTCP is Distributed Multi-Threaded CheckPoint software that will checkpoint your application without modification, and
can be set up to automatically restart your job from the last checkpoint if for example the node you are running on fails. 
This has been tested successfully
on Beocat for some scalar and OpenMP codes, but has failed on all MPI tests so far.  We would like to encourage users to
try DMTCP out if their non-MPI jobs run longer than 24 hours.  If you want to try this, please contact us first since we are still
experimenting with DMTCP.
The sample job submission script below shows how dmtcp_launch is used to start the application, then dmtcp_restart is used to start from a checkpoint if the job has failed and been rescheduled.
If you are putting this in an array script, then add the Slurm array task ID to the end of the ckeckpoint directory name
like <B>ckptdir=ckpt-$SLURM_ARRAY_TASK_ID</B>.
  #!/bin/bash -l
  #SBATCH --job-name=gromacs
  #SBATCH --mem=50G
  #SBATCH --time=24:00:00
  #SBATCH --nodes=1
  #SBATCH --ntasks-per-node=4
 
  module reset
  module load GROMACS/2016.4-foss-2017beocatb-hybrid
  module load DMTCP
  module list
 
  ckptdir=ckpt
  mkdir -p $ckptdir
  export DMTCP_CHECKPOINT_DIR=$ckptdir
 
  if ! ls -1 $ckptdir | grep -c dmtcp_restart_script > /dev/null
  then
    echo "Using dmtcp_launch to start the app the first time"
    dmtcp_launch --no-coordinator mpirun -np 1 -x OMP_NUM_THREADS=4 gmx_mpi mdrun -nsteps 50000 -ntomp 4 -v -deffnm 1ns -c 1ns.pdb -nice 0
  else
    echo "Using dmtcp_restart from $ckptdir to continue from a checkpoint"
    dmtcp_restart $ckptdir/*.dmtcp
  fi
You will need to run several tests to verify that DMTCP is working properly with your application.
First, run a short test without DMTCP and another with DMTCP with the checkpoint interval set to 5 minutes
by adding the line <B>export DMTCP_CHECKPOINT_INTERVAL=300</B> to your script.  Then use <B>kstat -d 1</B> to
check that the memory in both runs is close to the same.  Also use this information to calculate the time
that each checkpoint takes.  In most cases I've seen times less than a minute for checkpointing that will normally
be done once each hour.  If your application is taking more time, let us know.  Sometimes this can be sped up
by simply turning off compression by adding the line <B>export DMTCP_GZIP=0</B>.  Make sure to remove the
line where you set the checkpoint interval to 300 seconds so that the default time of once per hour will be used.
After verifying that your code completes using DMTCP and does not take significantly more time or memory, you
will need to start a run then <B>scancel</B> it after the first checkpoint, then resubmit the same script to make
sure that it restarts and runs to completion.  If you are working with an array job script, the last is to try a few
array tasks at once to make sure there is no conflict between the jobs.
== Running jobs interactively ==
== Running jobs interactively ==
Some jobs just don't behave like we think they should, or need to be run with somebody sitting at the keyboard and typing in response to the output the computers are generating. Beocat has a facility for this, called 'srun'. srun uses the exact same command-line arguments as sbatch, but you need to add the following arguments at the end: <tt>--pty bash</tt>. If no node is available with your resource requirements, srun will tell you something like the following:
Some jobs just don't behave like we think they should, or need to be run with somebody sitting at the keyboard and typing in response to the output the computers are generating. Beocat has a facility for this, called 'srun'. srun uses the exact same command-line arguments as sbatch, but you need to add the following arguments at the end: <tt>--pty bash</tt>. If no node is available with your resource requirements, srun will tell you something like the following:
Line 373: Line 660:
  srun: error: Unable to allocate resources: Requested node configuration is not available
  srun: error: Unable to allocate resources: Requested node configuration is not available
Note that, like sbatch, your interactive job will timeout after your allotted time has passed.
Note that, like sbatch, your interactive job will timeout after your allotted time has passed.
== Connecting to an existing job ==
You can connect to an existing job using <B>srun</B> in the same way that the <B>MonitorNode</B> command
allowed us to in the old cluster.  This is essentially like using ssh to get into the node where your job is running which
can be very useful in allowing you to look at files in /tmp/job# or in running <B>htop</B> to view the
activity level for your job.
srun --jobid=# --pty bash                        where '#' is the job ID number
== Altering Job Requests ==
== Altering Job Requests ==
We generally do not support users to modify job parameters once the job has been submitted. It can be done, but there are numerous catches, and all of the variations can be a bit problematic; it is normally easier to simply delete the job and resubmit it with the right parameters. '''If your job doesn't start after modifying such parameters (after a reasonable amount of time), delete the job and resubmit it.'''
We generally do not support users to modify job parameters once the job has been submitted. It can be done, but there are numerous catches, and all of the variations can be a bit problematic; it is normally easier to simply delete the job (using '''scancel ''jobid''''') and resubmit it with the right parameters. '''If your job doesn't start after modifying such parameters (after a reasonable amount of time), delete the job and resubmit it.'''


As it is unsupported, this is an excercise left to the reader. A starting point is <tt>man scontrol</tt>
As it is unsupported, this is an excercise left to the reader. A starting point is <tt>man scontrol</tt>
Line 380: Line 676:
There are a growing number of machines within Beocat that are owned by a particular person or group. Normally jobs from users that aren't in the group designated by the owner of these machines cannot use them. This is because we have guaranteed that the nodes will be accessible and available to the owner at any given time. We will allow others to use these nodes if they designate their job as "killable." If your job is designated as killable, your job will be able to use these nodes, but can (and will) be killed off at any point in time to make way for the designated owner's jobs. Jobs that are marked killable will be re-queued and may restart on another node.
There are a growing number of machines within Beocat that are owned by a particular person or group. Normally jobs from users that aren't in the group designated by the owner of these machines cannot use them. This is because we have guaranteed that the nodes will be accessible and available to the owner at any given time. We will allow others to use these nodes if they designate their job as "killable." If your job is designated as killable, your job will be able to use these nodes, but can (and will) be killed off at any point in time to make way for the designated owner's jobs. Jobs that are marked killable will be re-queued and may restart on another node.


The way you would designate your job as killable is to add <tt>-l killable</tt> to the '''<tt>sbatch</tt> or <tt>qrsh</tt>''' arguments. This could be either on the command-line or in your script file.
The way you would designate your job as killable is to add <tt>--gres=killable:1</tt> to the '''<tt>sbatch</tt> or <tt>srun</tt>''' arguments. This could be either on the command-line or in your script file.


''Note: This is a submit-time only request, it cannot be added by a normal user after the job has been submitted.'' If you would like jobs modified to be '''killable''' after the jobs have been submitted (and it is too much work to <tt>qdel</tt> the jobs and re-submit), send an e-mail to the administrators detailing the job ids and what you would like done.
''Note: This is a submit-time only request, it cannot be added by a normal user after the job has been submitted.'' If you would like jobs modified to be '''killable''' after the jobs have been submitted (and it is too much work to <tt>scancel</tt> the jobs and re-submit), send an e-mail to the administrators detailing the job ids and what you would like done.


== Scheduling Priority ==
== Scheduling Priority ==
The scheduler uses a complex formula to determine the order that jobs get scheduled in. Jobs in general get run in the order that they are submitted to the queue with the following exceptions. Jobs for users in a group that owns nodes will immediately get scheduled on those nodes even if that means bumping existing jobs off. Users in groups that have contributed funds to Beocat may have higher scheduling priority.  You can check the base scheduling priority of each group using <tt>qconf -sst</tt>.   If you do not have a group your jobs are scheduled using BEODEFAULT. The higher the priority, the faster your job will be moved to the front of the queue. A fair scheduling algorithm adjusts this scheduling priority down as users in that group submit more jobs.   
Some users are members of projects that have contributed to Beocat. When those users have contributed nodes, the group gets access to a "partition" giving you priority on those nodes.
 
In most situations, the scheduler will automatically add those priority partitions to the jobs as submitted. You should not need to include a partition list in your job submission.
 
There are currently just a few exceptions that we will not automatically add:
* ksu-chem-mri.q
* ksu-gen-gpu.q
* ksu-gen-highmem.q
 
If you have access to those any of the non-automatic partitions, and have need of the resources in that partition, you can then alter your <tt>#SBATCH</tt> lines to include your new partition:
#SBATCH --partition=ksu-gen-highmem.q
 
Otherwise, you shouldn't modify the partition line at all unless you really know what you're doing.
 
== Graphical Applications ==
Some applications are graphical and need to have some graphical input/output. We currently accomplish this with X11 forwarding or [[OpenOnDemand]]
=== OpenOnDemand ===
[[OpenOnDemand]] is likely the easier and more performant way to run a graphical application on the cluster.
# visit [https://ondemand.beocat.ksu.edu/ ondemand] and login with your cluster credentials.
# Check the "Interactive Apps" dropdown. We may have a workflow ready for you. If not choose the desktop.
# Select the resources you need
# Select launch
# A job is now submitted to the cluster and once the job is started you'll see a Connect button
# use the app as needed. If using the desktop, start your graphical application.
 
=== X11 Forwarding ===
==== Connecting with an X11 client ====
===== Windows =====
If you are running Windows, we recommend MobaXTerm as your file/ssh manager, this is because it is one relatively simple tool to do everything. MobaXTerm also automatically connects with X11 forwarding enabled.
===== Linux/OSX =====
Both Linux and OSX can connect in an X11 forwarding mode. Linux will have all of the tools you need installed already, OSX will need [https://www.xquartz.org/ XQuartz] installed.
 
Then you will need to change your 'ssh' command slightly:
 
  ssh -Y eid@headnode.beocat.ksu.edu


Since all users not in a group having higher priority get put into BEODEFAULT, the priority is always very low and each job gets scheduled in the order it was submitted. Groups with a higher priority may jump ahead of the BEODEFAULT jobs, but if these groups are submitting lots of jobs their priority will become low as well. Groups with the highest priority that are submitting the fewest jobs may see those jobs moved to the front of the queue quickly.
The '''-Y''' argument tells ssh to setup X11 forwarding.
==== Starting an Graphical job ====
All graphical jobs, by design, must be interactive, so we'll use the srun command. On a headnode, we run the following:
# load an X11 enabled application
module load Octave
# start an X11 job, sbatch arguments are accepted for srun as well, 1 node, 1 hour, 1 gb of memory
  srun --nodes=1 --time=1:00:00 --mem=1G --pty --x11 octave --gui


When processing cores become available, the scheduler looks at the head of the queue to find jobs that will fit within the resources available.  Shorter jobs of 12 hours or less get marked as killable and will be run on nodes owned by other groups.  These jobs will jump past longer jobs when resources become available on owned nodes.  Many jobs in the queue may require more memory than is available on some nodes, so smaller memory jobs will be scheduled ahead of larger memory jobs on hosts with more limited memory. <tt>kstat -q</tt> will show you the order in the queue and allow you to see jobs marked as "killable" and those that require large memory.
Because these jobs are interactive, they may not be able to run at all times, depending on how busy the scheduler is at any point in time. '''--pty --x11''' are required arguments setting up the job, and '''octave --gui''' is the command to run inside the job.


== Job Accounting ==
== Job Accounting ==
Line 405: Line 741:


===== My job didn't do anything when it ran! =====
===== My job didn't do anything when it ran! =====
{{Scrolling table/top}}
{| class="wikitable" style="float:left; margin:0; margin-right:-1px; {{{style|}}}
{{Scrolling table/mid}}
|-
| &nbsp;
|-
|1
|-
|2
|-
|3
|}
<div style="overflow-x:auto; white-space:nowrap;">
{| class="wikitable" style="margin:0; {{{style|}}}"
|-
!JobID!!JobIDRaw!!JobName!!Partition!!MaxVMSize!!MaxVMSizeNode!!MaxVMSizeTask!!AveVMSize!!MaxRSS!!MaxRSSNode!!MaxRSSTask!!AveRSS!!MaxPages!!MaxPagesNode!!MaxPagesTask!!AvePages!!MinCPU!!MinCPUNode!!MinCPUTask!!AveCPU!!NTasks!!AllocCPUS!!Elapsed!!State!!ExitCode!!AveCPUFreq!!ReqCPUFreqMin!!ReqCPUFreqMax!!ReqCPUFreqGov!!ReqMem!!ConsumedEnergy!!MaxDiskRead!!MaxDiskReadNode!!MaxDiskReadTask!!AveDiskRead!!MaxDiskWrite!!MaxDiskWriteNode!!MaxDiskWriteTask!!AveDiskWrite!!AllocGRES!!ReqGRES!!ReqTRES!!AllocTRES
!JobID!!JobIDRaw!!JobName!!Partition!!MaxVMSize!!MaxVMSizeNode!!MaxVMSizeTask!!AveVMSize!!MaxRSS!!MaxRSSNode!!MaxRSSTask!!AveRSS!!MaxPages!!MaxPagesNode!!MaxPagesTask!!AvePages!!MinCPU!!MinCPUNode!!MinCPUTask!!AveCPU!!NTasks!!AllocCPUS!!Elapsed!!State!!ExitCode!!AveCPUFreq!!ReqCPUFreqMin!!ReqCPUFreqMax!!ReqCPUFreqGov!!ReqMem!!ConsumedEnergy!!MaxDiskRead!!MaxDiskReadNode!!MaxDiskReadTask!!AveDiskRead!!MaxDiskWrite!!MaxDiskWriteNode!!MaxDiskWriteTask!!AveDiskWrite!!AllocGRES!!ReqGRES!!ReqTRES!!AllocTRES
|-
|-
Line 414: Line 761:
|-
|-
|218.0||218.0||qqqqstat||||204212K||dwarf37||0||204212K||1420K||dwarf37||0||1420K||0||dwarf37||0||0||00:00:00||dwarf37||0||00:00:00||1||12||00:00:00||FAILED||2:0||196.52M||Unknown||Unknown||Unknown||1Gn||0||0||dwarf37||65534||0||0.00M||dwarf37||0||0.00M||||||||cpu=12,mem=1G,node=1
|218.0||218.0||qqqqstat||||204212K||dwarf37||0||204212K||1420K||dwarf37||0||1420K||0||dwarf37||0||0||00:00:00||dwarf37||0||00:00:00||1||12||00:00:00||FAILED||2:0||196.52M||Unknown||Unknown||Unknown||1Gn||0||0||dwarf37||65534||0||0.00M||dwarf37||0||0.00M||||||||cpu=12,mem=1G,node=1
{{Scrolling table/end}}
|}</div><br style="clear:both"/>
If you look at the columns showing Elapsed and State, you can see that they show 00:00:00 and FAILED respectively. This means that the job started and then promptly ended. This points to something being wrong with your submission script. Perhaps there is a typo somewhere in it.
If you look at the columns showing Elapsed and State, you can see that they show 00:00:00 and FAILED respectively. This means that the job started and then promptly ended. This points to something being wrong with your submission script. Perhaps there is a typo somewhere in it.


===== My job ran but didn't finish! =====
===== My job ran but didn't finish! =====
{{Scrolling table/top}}
{| class="wikitable" style="float:left; margin:0; margin-right:-1px; {{{style|}}}
{{Scrolling table/mid}}
|-
| &nbsp;
|-
|1
|-
|2
|-
|3
|}
<div style="overflow-x:auto; white-space:nowrap;">
{| class="wikitable" style="margin:0; {{{style|}}}"
|-
!JobID!!JobIDRaw!!JobName!!Partition!!MaxVMSize!!MaxVMSizeNode!!MaxVMSizeTask!!AveVMSize!!MaxRSS!!MaxRSSNode!!MaxRSSTask!!AveRSS!!MaxPages!!MaxPagesNode!!MaxPagesTask!!AvePages!!MinCPU!!MinCPUNode!!MinCPUTask!!AveCPU!!NTasks!!AllocCPUS!!Elapsed!!State!!ExitCode!!AveCPUFreq!!ReqCPUFreqMin!!ReqCPUFreqMax!!ReqCPUFreqGov!!ReqMem!!ConsumedEnergy!!MaxDiskRead!!MaxDiskReadNode!!MaxDiskReadTask!!AveDiskRead!!MaxDiskWrite!!MaxDiskWriteNode!!MaxDiskWriteTask!!AveDiskWrite!!AllocGRES!!ReqGRES!!ReqTRES!!AllocTRES
!JobID!!JobIDRaw!!JobName!!Partition!!MaxVMSize!!MaxVMSizeNode!!MaxVMSizeTask!!AveVMSize!!MaxRSS!!MaxRSSNode!!MaxRSSTask!!AveRSS!!MaxPages!!MaxPagesNode!!MaxPagesTask!!AvePages!!MinCPU!!MinCPUNode!!MinCPUTask!!AveCPU!!NTasks!!AllocCPUS!!Elapsed!!State!!ExitCode!!AveCPUFreq!!ReqCPUFreqMin!!ReqCPUFreqMax!!ReqCPUFreqGov!!ReqMem!!ConsumedEnergy!!MaxDiskRead!!MaxDiskReadNode!!MaxDiskReadTask!!AveDiskRead!!MaxDiskWrite!!MaxDiskWriteNode!!MaxDiskWriteTask!!AveDiskWrite!!AllocGRES!!ReqGRES!!ReqTRES!!AllocTRES
|-
|-
Line 427: Line 785:
|-
|-
|220.0||220.0||sleep||||204212K||dwarf37||0||107916K||1000K||dwarf37||0||620K||0||dwarf37||0||0||00:00:00||dwarf37||0||00:00:00||1||1||00:01:27||CANCELLED||0:15||1.54G||Unknown||Unknown||Unknown||1Gn||0||0.05M||dwarf37||0||0.05M||0.00M||dwarf37||0||0.00M||||||||cpu=1,mem=1G,node=1
|220.0||220.0||sleep||||204212K||dwarf37||0||107916K||1000K||dwarf37||0||620K||0||dwarf37||0||0||00:00:00||dwarf37||0||00:00:00||1||1||00:01:27||CANCELLED||0:15||1.54G||Unknown||Unknown||Unknown||1Gn||0||0.05M||dwarf37||0||0.05M||0.00M||dwarf37||0||0.00M||||||||cpu=1,mem=1G,node=1
{{Scrolling table/end}}
|}</div><br style="clear:both"/>
If you look at the column showing State, we can see some pointers to the issue. The job ran out of time (TIMEOUT) and then was killed (CANCELLED).
If you look at the column showing State, we can see some pointers to the issue. The job ran out of time (TIMEOUT) and then was killed (CANCELLED).
{{Scrolling table/top}}
{| class="wikitable" style="float:left; margin:0; margin-right:-1px; {{{style|}}}
{{Scrolling table/mid}}
|-
| &nbsp;
|-
|1
|-
|2
|}
<div style="overflow-x:auto; white-space:nowrap;">
{| class="wikitable" style="margin:0; {{{style|}}}"
|-
!JobID!!JobIDRaw!!JobName!!Partition!!MaxVMSize!!MaxVMSizeNode!!MaxVMSizeTask!!AveVMSize!!MaxRSS!!MaxRSSNode!!MaxRSSTask!!AveRSS!!MaxPages!!MaxPagesNode!!MaxPagesTask!!AvePages!!MinCPU!!MinCPUNode!!MinCPUTask!!AveCPU!!NTasks!!AllocCPUS!!Elapsed!!State!!ExitCode!!AveCPUFreq!!ReqCPUFreqMin!!ReqCPUFreqMax!!ReqCPUFreqGov!!ReqMem!!ConsumedEnergy!!MaxDiskRead!!MaxDiskReadNode!!MaxDiskReadTask!!AveDiskRead!!MaxDiskWrite!!MaxDiskWriteNode!!MaxDiskWriteTask!!AveDiskWrite!!AllocGRES!!ReqGRES!!ReqTRES!!AllocTRES
!JobID!!JobIDRaw!!JobName!!Partition!!MaxVMSize!!MaxVMSizeNode!!MaxVMSizeTask!!AveVMSize!!MaxRSS!!MaxRSSNode!!MaxRSSTask!!AveRSS!!MaxPages!!MaxPagesNode!!MaxPagesTask!!AvePages!!MinCPU!!MinCPUNode!!MinCPUTask!!AveCPU!!NTasks!!AllocCPUS!!Elapsed!!State!!ExitCode!!AveCPUFreq!!ReqCPUFreqMin!!ReqCPUFreqMax!!ReqCPUFreqGov!!ReqMem!!ConsumedEnergy!!MaxDiskRead!!MaxDiskReadNode!!MaxDiskReadTask!!AveDiskRead!!MaxDiskWrite!!MaxDiskWriteNode!!MaxDiskWriteTask!!AveDiskWrite!!AllocGRES!!ReqGRES!!ReqTRES!!AllocTRES
|-
|-
Line 436: Line 803:
|-
|-
|221.batch||221.batch||batch||||137940K||dwarf37||0||137940K||1144K||dwarf37||0||1144K||0||dwarf37||0||0||00:00:00||dwarf37||0||00:00:00||1||1||00:00:01||CANCELLED||0:15||2.62G||0||0||0||1Mn||0||0||dwarf37||65534||0||0||dwarf37||65534||0||||||||cpu=1,mem=1M,node=1
|221.batch||221.batch||batch||||137940K||dwarf37||0||137940K||1144K||dwarf37||0||1144K||0||dwarf37||0||0||00:00:00||dwarf37||0||00:00:00||1||1||00:00:01||CANCELLED||0:15||2.62G||0||0||0||1Mn||0||0||dwarf37||65534||0||0||dwarf37||65534||0||||||||cpu=1,mem=1M,node=1
{{Scrolling table/end}}
|}</div><br style="clear:both"/>
If you look at the column showing State, we see it was "CANCELLED by 0", then we look at the AllocTRES column to see our allocated resources, and see that 1MB of memory was granted. Combine that with the column "MaxRSS" and we see that the memory granted was less than the memory we tried to use, thus the job was "CANCELLED".
If you look at the column showing State, we see it was "CANCELLED by 0", then we look at the AllocTRES column to see our allocated resources, and see that 1MB of memory was granted. Combine that with the column "MaxRSS" and we see that the memory granted was less than the memory we tried to use, thus the job was "CANCELLED".

Latest revision as of 14:22, 2 May 2024

Resource Requests

Aside from the time, RAM, and CPU requirements listed on the SlurmBasics page, we have a couple other requestable resources:

Valid gres options are:
gpu[[:type]:count]
fabric[[:type]:count]

Generally, if you don't know if you need a particular resource, you should use the default. These can be generated with the command

srun --gres=help

Fabric

We currently offer 3 "fabrics" as request-able resources in Slurm. The "count" specified is the line-rate (in Gigabits-per-second) of the connection on the node.

Infiniband

First of all, let me state that just because it sounds "cool" doesn't mean you need it or even want it. InfiniBand does absolutely no good if running on a single machine. InfiniBand is a high-speed host-to-host communication fabric. It is (most-often) used in conjunction with MPI jobs (discussed below). Several times we have had jobs which could run just fine, except that the submitter requested InfiniBand, and all the nodes with InfiniBand were currently busy. In fact, some of our fastest nodes do not have InfiniBand, so by requesting it when you don't need it, you are actually slowing down your job. To request Infiniband, add --gres=fabric:ib:1 to your sbatch command-line.

ROCE

ROCE, like InfiniBand is a high-speed host-to-host communication layer. Again, used most often with MPI. Most of our nodes are ROCE enabled, but this will let you guarantee the nodes allocated to your job will be able to communicate with ROCE. To request ROCE, add --gres=fabric:roce:1 to your sbatch command-line.

Ethernet

Ethernet is another communication fabric. All of our nodes are connected by ethernet, this is simply here to allow you to specify the interconnect speed. Speeds are selected in units of Gbps, with all nodes supporting 1Gbps or above. The currently available speeds for ethernet are: 1, 10, 40, and 100. To select nodes with 40Gbps and above, you could specify --gres=fabric:eth:40 on your sbatch command-line. Since ethernet is used to connect to the file server, this can be used to select nodes that have fast access for applications doing heavy IO. The Dwarves and Heroes have 40 Gbps ethernet and we measure single stream performance as high as 20 Gbps, but if your application requires heavy IO then you'd want to avoid the Moles which are connected to the file server with only 1 Gbps ethernet.

CUDA

CUDA is the resource required for GPU computing. 'kstat -g' will show you the GPU nodes and the jobs running on them. To request a GPU node, add --gres=gpu:1 for example to request 1 GPU for your job; if your job uses multiple nodes, the number of GPUs requested is per-node. You can also request a given type of GPU (kstat -g -l to show types) by using --gres=gpu:geforce_gtx_1080_ti:1 for a 1080Ti GPU on the Wizards or Dwarves, --gres=gpu:quadro_gp100:1 for the P100 GPUs on Wizard20-21 that are best for 64-bit codes like Vasp. Most of these GPU nodes are owned by various groups. If you want access to GPU nodes and your group does not own any, we can add you to the --partition=ksu-gen-gpu.q group that has priority on Dwarf36-39. For more information on compiling CUDA code click on this CUDA link.

A listing of the current types of gpus can be gathered with this command:

scontrol show nodes | grep CfgTRES | tr ',' '\n' | awk -F '[:=]' '/gres\/gpu:/ { print $2 }' | sort -u

At the time of this writing, that command produces this list:

  • geforce_gtx_1080_ti
  • geforce_rtx_2080_ti
  • geforce_rtx_3090
  • l40s
  • quadro_gp100
  • rtx_a4000
  • rtx_a6000

Parallel Jobs

There are two ways jobs can run in parallel, intranode and internode. Note: Beocat will not automatically make a job run in parallel. Have I said that enough? It's a common misperception.

Intranode jobs

Intranode jobs run on many cores in the same node. These jobs can take advantage of many common libraries, such as OpenMP, or any programming language that has the concept of threads. Often, your program will need to know how many cores you want it to use, and many will use all available cores if not told explicitly otherwise. This can be a problem when you are sharing resources, as Beocat does. To request multiple cores, use the sbatch directives '--nodes=1 --cpus-per-task=n' or '--nodes=1 --ntasks-per-node=n', where n is the number of cores you wish to use. If your command can take an environment variable, you can use $SLURM_CPUS_ON_NODE to tell how many cores you've been allocated.

Internode (MPI) jobs

Internode jobs can utilize many cores on one or more nodes. Communicating between nodes is trickier than talking between cores on the same node. The specification for doing so is called "Message Passing Interface", or MPI. We have OpenMPI installed on Beocat for this purpose. Most programs written to take advantage of large multi-node systems will use MPI, but MPI also allows an application to run on multiple cores within a node. You can tell if you have an MPI-enabled program because its directions will tell you to run 'mpirun program'. Requesting MPI resources is only mildly more difficult than requesting single-node jobs. Instead of using '--cpus-per-task=n', you would use '--nodes=n --tasks-per-node=m' or '--nodes=n --ntasks=o' for your sbatch request, where n is the number of nodes you want, m is the number of cores per node you need, and o is the total number of cores you need.

Some quick examples:

--nodes=6 --ntasks-per-node=4 will give you 4 cores on each of 6 nodes for a total of 24 cores.

--ntasks=40 will give you 40 cores spread across any number of nodes.

--nodes=10 --ntasks=100 will give you a total of 100 cores across 10 nodes.

Requesting memory for multi-core jobs

Memory requests are easiest when they are specified per core. For instance, if you specified the following: '--tasks=20 --mem-per-core=20G', your job would have access to 400GB of memory total.

Other Handy Slurm Features

Email status changes

One of the most commonly used options when submitting jobs not related to resource requests is to have have Slurm email you when a job changes its status. This takes may need two directives to sbatch: --mail-user and --mail-type.

--mail-type

--mail-type is used to tell Slurm to notify you about certain conditions. Options are comma separated and include the following

Option Explanation
NONE This disables event-based mail
BEGIN Sends a notification when the job begins
END Sends a notification when the job ends
FAIL Sends a notification when the job fails.
REQUEUE Sends a notification if the job is put back into the queue from a running state
STAGE_OUT Burst buffer stage out and teardown completed
ALL Equivalent to BEGIN,END,FAIL,REQUEUE,STAGE_OUT
TIME_LIMIT Notifies if the job ran out of time
TIME_LIMIT_90 Notifies when the job has used 90% of its allocated time
TIME_LIMIT_80 Notifies when the job has used 80% of its allocated time
TIME_LIMIT_50 Notifies when the job has used 50% of its allocated time
ARRAY_TASKS Modifies the BEGIN, END, and FAIL options to apply to each array task (instead of notifying for the entire job

--mail-user

--mail-user is optional. It is only needed if you intend to send these job status updates to a different e-mail address than what you provided in the Account Request Page. It is specified with the following arguments to sbatch: --mail-user=someone@somecompany.com

Job Naming

If you have several jobs in the queue, running the same script with different parameters, it's handy to have a different name for each job as it shows up in the queue. This is accomplished with the '-J JobName' sbatch directive.

Separating Output Streams

Normally, Slurm will create one output file, containing both STDERR and STDOUT. If you want both of these to be separated into two files, you can use the sbatch directives '--output' and '--error'.

option default example
--output slurm-%j.out slurm-206.out
--error slurm-%j.out slurm-206.out

%j above indicates that it should be replaced with the job id.

Running from the Current Directory

By default, jobs run from your home directory. Many programs incorrectly assume that you are running the script from the current directory. You can use the '-cwd' directive to change to the "current working directory" you used when submitting the job.

Running in a specific class of machine

If you want to run on a specific class of machines, e.g., the Dwarves, you can add the flag "--constraint=dwarves" to select any of those machines.

Processor Constraints

Because Beocat is a heterogenous cluster (we have machines from many years in the cluster), not all of our processors support every new and fancy feature. You might have some applications that require some newer processor features, so we provide a mechanism to request those.

--contraint tells the cluster to apply constraints to the types of nodes that the job can run on. For instance, we know of several applications that must be run on chips that have "AVX" processor extensions. To do that, you would specify --constraint=avx on you sbatch or srun command lines. Using --constraint=avx will prohibit your job from running on the Mages while --contraint=avx2 will eliminate the Elves as well as the Mages.

Slurm Environment Variables

Within an actual job, sometimes you need to know specific things about the running environment to setup your scripts correctly. Here is a listing of environment variables that Slurm makes available to you. Of course the value of these variables will be different based on many different factors.

CUDA_VISIBLE_DEVICES=NoDevFiles
ENVIRONMENT=BATCH
GPU_DEVICE_ORDINAL=NoDevFiles
HOSTNAME=dwarf37
SLURM_CHECKPOINT_IMAGE_DIR=/var/slurm/checkpoint
SLURM_CLUSTER_NAME=beocat
SLURM_CPUS_ON_NODE=1
SLURM_DISTRIBUTION=cyclic
SLURMD_NODENAME=dwarf37
SLURM_GTIDS=0
SLURM_JOB_CPUS_PER_NODE=1
SLURM_JOB_GID=163587
SLURM_JOB_ID=202
SLURM_JOBID=202
SLURM_JOB_NAME=slurm_simple.sh
SLURM_JOB_NODELIST=dwarf37
SLURM_JOB_NUM_NODES=1
SLURM_JOB_PARTITION=batch.q,killable.q
SLURM_JOB_QOS=normal
SLURM_JOB_UID=163587
SLURM_JOB_USER=mozes
SLURM_LAUNCH_NODE_IPADDR=10.5.16.37
SLURM_LOCALID=0
SLURM_MEM_PER_NODE=1024
SLURM_NNODES=1
SLURM_NODEID=0
SLURM_NODELIST=dwarf37
SLURM_NPROCS=1
SLURM_NTASKS=1
SLURM_PRIO_PROCESS=0
SLURM_PROCID=0
SLURM_SRUN_COMM_HOST=10.5.16.37
SLURM_SRUN_COMM_PORT=37975
SLURM_STEP_ID=0
SLURM_STEPID=0
SLURM_STEP_LAUNCHER_PORT=37975
SLURM_STEP_NODELIST=dwarf37
SLURM_STEP_NUM_NODES=1
SLURM_STEP_NUM_TASKS=1
SLURM_STEP_TASKS_PER_NODE=1
SLURM_SUBMIT_DIR=/homes/mozes
SLURM_SUBMIT_HOST=dwarf37
SLURM_TASK_PID=23408
SLURM_TASKS_PER_NODE=1
SLURM_TOPOLOGY_ADDR=due1121-prod-core-40g-a1,due1121-prod-core-40g-c1.due1121-prod-sw-100g-a9.dwarf37
SLURM_TOPOLOGY_ADDR_PATTERN=switch.switch.node
SLURM_UMASK=0022
SRUN_DEBUG=3
TERM=screen-256color
TMPDIR=/tmp
USER=mozes

Sometimes it is nice to know what hosts you have access to during a job. You would checkout the SLURM_JOB_NODELIST to know that. There are lots of useful Environment Variables there, I will leave it to you to identify the ones you want.

Some of the most commonly-used variables we see used are $SLURM_CPUS_ON_NODE, $HOSTNAME, and $SLURM_JOB_ID.

Running from a sbatch Submit Script

No doubt after you've run a few jobs you get tired of typing something like 'sbatch -l mem=2G,h_rt=10:00 -pe single 8 -n MyJobTitle MyScript.sh'. How are you supposed to remember all of these every time? The answer is to create a 'submit script', which outlines all of these for you. Below is a sample submit script, which you can modify and use for your own purposes.

#!/bin/bash

## A Sample sbatch script created by Kyle Hutson
##
## Note: Usually a '#" at the beginning of the line is ignored. However, in
## the case of sbatch, lines beginning with #SBATCH are commands for sbatch
## itself, so I have taken the convention here of starting *every* line with a
## '#', just Delete the first one if you want to use that line, and then modify
## it to your own purposes. The only exception here is the first line, which
## *must* be #!/bin/bash (or another valid shell).

## There is one strict rule for guaranteeing Slurm reads all of your options:
## Do not put *any* lines above your resource requests that aren't either:
##    1) blank. (no other characters)
##    2) comments (lines must begin with '#')

## Specify the amount of RAM needed _per_core_. Default is 1G
##SBATCH --mem-per-cpu=1G

## Specify the maximum runtime in DD-HH:MM:SS form. Default is 1 hour (1:00:00)
##SBATCH --time=1:00:00

## Require the use of infiniband. If you don't know what this is, you probably
## don't need it.
##SBATCH --gres=fabric:ib:1

## GPU directive. If You don't know what this is, you probably don't need it
##SBATCH --gres=gpu:1

## number of cores/nodes:
## quick note here. Jobs requesting 16 or fewer cores tend to get scheduled
## fairly quickly. If you need a job that requires more than that, you might
## benefit from emailing us at beocat@cs.ksu.edu to see how we can assist in
## getting your job scheduled in a reasonable amount of time. Default is
##SBATCH --cpus-per-task=1
##SBATCH --cpus-per-task=12
##SBATCH --nodes=2 --tasks-per-node=1
##SBATCH --tasks=20

## Constraints for this job. Maybe you need to run on the elves
##SBATCH --constraint=elves
## or perhaps you just need avx processor extensions
##SBATCH --constraint=avx

## Output file name. Default is slurm-%j.out where %j is the job id.
##SBATCH --output=MyJobTitle.o%j

## Split the errors into a seperate file. Default is the same as output
##SBATCH --error=MyJobTitle.e%j

## Name my job, to make it easier to find in the queue
##SBATCH -J MyJobTitle

## Send email when certain criteria are met.
## Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL (equivalent to
## BEGIN, END, FAIL, REQUEUE,  and  STAGE_OUT),  STAGE_OUT  (burst buffer stage
## out and teardown completed), TIME_LIMIT, TIME_LIMIT_90 (reached 90 percent
## of time limit), TIME_LIMIT_80 (reached 80 percent of time limit),
## TIME_LIMIT_50 (reached 50 percent of time limit) and ARRAY_TASKS (send
## emails for each array task). Multiple type values may be specified in a
## comma separated list. Unless the  ARRAY_TASKS  option  is specified, mail
## notifications on job BEGIN, END and FAIL apply to a job array as a whole
## rather than generating individual email messages for each task in the job
## array.
##SBATCH --mail-type=ALL

## Email address to send the email to based on the above line.
## Default is to send the mail to the e-mail address entered on the account
## request form.
##SBATCH --mail-user myemail@ksu.edu

## And finally, we run the job we came here to do.
## $HOME/ProgramDir/ProgramName ProgramArguments

## OR, for the case of MPI-capable jobs
## mpirun $HOME/path/MpiJobName

File Access

Beocat has a variety of options for storing and accessing your files. Every user has a home directory for general use which is limited in size, has decent file access performance. Those needing more storage may purchase /bulk subdirectories which have the same decent performance but are not backed up. The /fastscratch file system is a zfs host with lots of NVME drives provide much faster temporary file access. When fast IO is critical to the application performance, access to /fastscratch, the local disk on each node, or to a RAM disk are the best options.

Home directory

Every user has a /homes/username directory that they drop into when they log into Beocat. The home directory is for general use and provides decent performance for most file IO. Disk space in each home directory is limited to 1 TB, so larger files should be kept in a purchased /bulk directory, and there is a limit of 100,000 files in each subdirectory in your account. This file system is fully redundant, so 3 specific hard disks would need to fail before any data was lost. All files will soon be backed up nightly to a separate file server in Nichols Hall, so if you do accidentally delete something it can be recovered.

Bulk directory

Bulk data storage may be provided at a cost of $45/TB/year billed monthly. Due to the cost, directories will be provided when we are contacted and provided with payment information.

Fast Scratch file system

The /fastscratch file system is faster than /bulk or /homes. In order to use fastscratch, you first need to make a directory for yourself. Fast Scratch is meant as temporary space for prepositioning files and accessing them during runs. Once runs are completed, any files that need to be kept should be moved to your home or bulk directories since files on the fastscratch file system may get purged after 30 days.

mkdir /fastscratch/$USER

Local disk

If you are running on a single node, it may also be faster to access your files from the local disk on that node. Each job creates a subdirectory /tmp/job# where '#' is the job ID number on the local disk of each node the job uses. This can be accessed simply by writing to /tmp rather than needing to use /tmp/job#.

You may need to copy files to local disk at the start of your script, or set the output directory for your application to point to a file on the local disk, then you'll need to copy any files you want off the local disk before the job finishes since Slurm will remove all files in your job's directory on /tmp on completion of the job or when it aborts. Use 'kstat -l -h' to see how much /tmp space is available on each node.

# Copy input files to the tmp directory if needed
cp $input_files /tmp

# Make an 'out' directory to pass to the app if needed
mkdir /tmp/out

# Example of running an app and passing the tmp directory in/out
app -input_directory /tmp -output_directory /tmp/out

# Copy the 'out' directory back to the current working directory after the run
cp -rp /tmp/out .

RAM disk

If you need ultrafast access to files, you can use a RAM disk which is a file system set up in the memory of the compute node you are running on. The RAM disk is limited to the requested memory on that node, so you should account for this usage when you request memory for your job. Below is an example of how to use the RAM disk.

# Copy input files over if necessary
cp $any_input_files /dev/shm/

# Run the application, possibly giving it the path to the RAM disk to use for output files
app -output_directory /dev/shm/

# Copy files from the RAM disk to the current working directory and clean it up
cp /dev/shm/* .

When you leave KSU

If you are done with your account and leaving KSU, please clean up your directory, move any files to your supervisor's account that need to be kept after you leave, and notify us so that we can disable your account. The easiest way to move your files to your supervisor's account is for them to set up a subdirectory for you with the appropriate write permissions. The example below shows moving just a user's 'data' subdirectory to their supervisor. The 'nohup' command is used so that the move will continue even if the window you are doing the move from gets disconnected.

# Supervisor:
mkdir /bulk/$USER/$STUDENT_USERNAME
setfacl -d -m u:$USER:rwX -R /bulk/$USER/$STUDENT_USERNAME
setfacl -m u:$USER:rwX -R /bulk/$USER/$STUDENT_USERNAME
setfacl -d -m u:$STUDENT_USERNAME:rwX -R /bulk/$USER/$STUDENT_USERNAME
setfacl -m u:$STUDENT_USERNAME:rwX -R /bulk/$USER/$STUDENT_USERNAME

# Student:
nohup mv /homes/$USER/data /bulk/$SUPERVISOR_USERNAME/$USER &

# Once the move is complete, the Supervisor should limit the permissions for the directory again by removing the student's access:
chown $USER: -R /bulk/$USER/$STUDENT_USERNAME
setfacl -d -x u:$STUDENT_USERNAME -R /bulk/$USER/$STUDENT_USERNAME
setfacl -x u:$STUDENT_USERNAME -R /bulk/$USER/$STUDENT_USERNAME

File Sharing

This section will cover methods of sharing files with other users within Beocat and on remote systems. In the past, Beocat users have been allowed to keep their /homes and /bulk directories open so that any other user could access files. In order to bring Beocat into alignment with State of Kansas regulations and industry norms, all users must now have their /homes /bulk /scratch and /fastscratch directories locked down from other users, but can still share files and directories within their group or with individual users using group and individual ACLs (Access Control Lists) which will be explained below. Beocat staff will be exempted from this policy as we need to work freely with all users and will manage our subdirectories to minimize access.

Securing your home directory with the setacls script

If you do not wish to share files or directories with other users, you do not need to do anything as rwx access to others has already been removed. If you want to share files or directories you can either use the **setacls** script or configure the ACLs (Access Control Lists) manually.

The setacls -h will show how to use the script.

 Eos: setacls -h
 setacls [-r] [-w] [-g group] [-u user] -d /full/path/to/directory
 Execute pemission will always be applied, you may also choose r or w
 Must specify at least one group or user
 Must specify at least one directory, and it must be the full path
 Example: setacls -r -g ksu-cis-hpc -u mozes -d /homes/daveturner/shared_dir

You can specify the permissions to be either -r for read or -w for write or you can specify both. You can provide a priority group to share with, which is the same as the group used in a --partition= statement in a job submission script. You can also specify users. You can specify a file or a directory to share. If the directory is specified then all files in that directory will also be shared, and all files created in the directory laster will also be shared.

The script will set everything up for you, telling you the commands it is executing along the way, then show the resulting ACLs at the end with the getfacl command.

Manually configuring your ACLs

If you want to manually configure the ACLs you can use the directions below to do what the **setacls** script would do for you. You first need to provide the minimum execute access to your /homes or /bulk directory before sharing individual subdirectories. Setting the ACL to execute only will allow those in your group to get access to subdirectories while not including read access will mean they will not be able to see other files or subdirectories on your main directory, but do keep in mind that they can still access them so you may want to still lock them down manually. Below is an example of how I would change my /homes/daveturner directory to allow ksu-cis-hpc group execute access.

setfacl -m g:ksu-cis-hpc:X /homes/daveturner

If your research group owns any nodes on Beocat, then you have a group name that can be used to securely share files with others within your group. Below is an example of creating a directory called 'share_hpc', then providing access to my ksu-cis.hpc group (my group is ksu-cis-hpc so I submit jobs to --partition=ksu-cis-hpc.q). Using -R will make these changes recursively to all files and directories in that subdirectory while changing the defaults with the setfacl -d command will ensure that files and directories created later will be done so with these same ACLs.

mkdir share_hpc
# ACLs are used here for setting default permissions
setfacl -d -m g:ksu-cis-hpc:rX -R share_hpc
# ACLs are used here for setting actual permissions
setfacl -m g:ksu-cis-hpc:rX -R share_hpc

This will give people in your group the ability to read files in the 'share_hpc' directory. If you also want them to be able to write or modify files in that directory then change the ':rX' to ':rwX' instead. e.g. 'setfacl -d -m g:ksu-cis-hpc:rwX -R share_hpc'

If you want to know what groups you belong to use the line below.

groups

If your group does not own any nodes, you can still request a group name and manage the participants yourself by emailing us at beocat@cs.ksu.edu . If you want to share a directory with only a few people you can manage your ACLs using individual usernames instead of with a group.

You can use the getfacl command to see groups have access to a given directory.

getfacl share_hpc

  # file: share_hpc
  # owner: daveturner
  # group: daveturner_users
  user::rwx
  group::r-x
  group:ksu-cis-hpc:r-x
  mask::r-x
  other::---
  default:user::rwx
  default:group::r-x
  default:group:ksu-cis-hpc:r-x
  default:mask::r-x
  default:other::---

ACLs give you great flexibility in controlling file access at the group level. Below is a more advanced example where I set up a directory to be shared with my ksu-cis-hpc group, Dan's ksu-cis-dan group, and an individual user 'mozes' who I also want to have write access.

mkdir share_hpc_dan_mozes
# acls are used here for setting default permissions
setfacl -d -m g:ksu-cis-hpc:rX -R share_hpc_dan_mozes
setfacl -d -m g:ksu-cis-dan:rX -R share_hpc_dan_mozes
setfacl -d -m u:mozes:rwX -R share_hpc_dan_mozes
# ACLs are used here for setting actual permissions
setfacl -m g:ksu-cis-hpc:rX -R share_hpc_dan_mozes
setfacl -m g:ksu-cis-dan:rX -R share_hpc_dan_mozes
setfacl -m u:mozes:rwX -R share_hpc_dan_mozes

getfacl share_hpc_dan_mozes

  # file: share_hpc_dan_mozes
  # owner: daveturner
  # group: daveturner_users
  user::rwx
  user:mozes:rwx
  group::r-x
  group:ksu-cis-hpc:r-x
  group:ksu-cis-dan:r-x
  mask::r-x
  other::---
  default:user::rwx
  default:user:mozes:rwx
  default:group::r-x
  default:group:ksu-cis-hpc:r-x
  default:group:ksu-cis-dan:r-x
  default:mask::r-x
  default:other::--x

Openly sharing files on the web

If you create a 'public_html' directory on your home directory, then any files put there will be shared openly on the web. There is no way to restrict who has access to those files.

cd
mkdir public_html
# Opt-in to letting the webserver access your home directory:
setfacl -m g:public_html:x ~/

Then access the data from a web browser using the URL:

http://people.beocat.ksu.edu/~your_user_name

This will show a list of the files you have in your public_html subdirectory.

Globus

We have a page here dedicated to Globus

Array Jobs

One of Slurm's useful options is the ability to run "Array Jobs"

It can be used with the following option to sbatch.


 --array=n[-m[:s]]
    Submits a so called Array Job, i.e. an array of identical tasks being differentiated only by an index number and being treated by Slurm
    almost like a series of jobs. The option argument to --array specifies the number of array job tasks and the index number which will be
    associated with the tasks. The index numbers will be exported to the job tasks via the environment variable SLURM_ARRAY_TASK_ID. The option
    arguments n, and m will be available through the environment variables SLURM_ARRAY_TASK_MIN and SLURM_ARRAY_TASK_MAX.

    The task id range specified in the option argument may be a single number, a simple range of the form n-m or a range with a step size.
    Hence, the task id range specified by 2-10:2 would result in the task id indexes 2, 4, 6, 8, and 10, for a total of 5 identical tasks, each
    with the environment variable SLURM_ARRAY_TASK_ID containing one of the 5 index numbers.

    Array jobs are commonly used to execute the same type of operation on varying input data sets correlated with the task index number. The
    number of tasks in a array job is unlimited.

    STDOUT and STDERR of array job tasks follow a slightly different naming convention (which can be controlled in the same way as mentioned above).

    slurm-%A_%a.out
    %A is the SLURM_ARRAY_JOB_ID, and %a is the SLURM_ARRAY_TASK_ID

Examples

Change the Size of the Run

Array Jobs have a variety of uses, one of the easiest to comprehend is the following:

I have an application, app1 I need to run the exact same way, on the same data set, with only the size of the run changing.

My original script looks like this:

#!/bin/bash
RUNSIZE=50
#RUNSIZE=100
#RUNSIZE=150
#RUNSIZE=200
app1 $RUNSIZE dataset.txt

For every run of that job I have to change the RUNSIZE variable, and submit each script. This gets tedious.

With Array Jobs the script can be written like so:

#!/bin/bash
#SBATCH --array=50-200:50
RUNSIZE=$SLURM_ARRAY_TASK_ID
app1 $RUNSIZE dataset.txt

I then submit that job, and Slurm understands that it needs to run it 4 times, once for each task. It also knows that it can and should run these tasks in parallel.

Choosing a Dataset

A slightly more complex use of Array Jobs is the following:

I have an application, app2, that needs to be run against every line of my dataset. Every line changes how app2 runs slightly, but I need to compare the runs against each other.

Originally I had to take each line of my dataset and generate a new submit script and submit the job. This was done with yet another script:

 #!/bin/bash
 DATASET=dataset.txt
 scriptnum=0
 while read LINE
 do
     echo "app2 $LINE" > ${scriptnum}.sh
     sbatch ${scriptnum}.sh
     scriptnum=$(( $scriptnum + 1 ))
 done < $DATASET

Not only is this needlessly complex, it is also slow, as sbatch has to verify each job as it is submitted. This can be done easily with array jobs, as long as you know the number of lines in the dataset. This number can be obtained like so: wc -l dataset.txt in this case lets call it 5000.

#!/bin/bash
#SBATCH --array=1-5000
app2 `sed -n "${SLURM_ARRAY_TASK_ID}p" dataset.txt`

This uses a subshell via `, and has the sed command print out only the line number $SLURM_ARRAY_TASK_ID out of the file dataset.txt.

Not only is this a smaller script, it is also faster to submit because it is one job instead of 5000, so sbatch doesn't have to verify as many.

To give you an idea about time saved: submitting 1 job takes 1-2 seconds. by extension if you are submitting 5000, that is 5,000-10,000 seconds, or 1.5-3 hours.

Checkpoint/Restart using DMTCP

DMTCP is Distributed Multi-Threaded CheckPoint software that will checkpoint your application without modification, and can be set up to automatically restart your job from the last checkpoint if for example the node you are running on fails. This has been tested successfully on Beocat for some scalar and OpenMP codes, but has failed on all MPI tests so far. We would like to encourage users to try DMTCP out if their non-MPI jobs run longer than 24 hours. If you want to try this, please contact us first since we are still experimenting with DMTCP.

The sample job submission script below shows how dmtcp_launch is used to start the application, then dmtcp_restart is used to start from a checkpoint if the job has failed and been rescheduled. If you are putting this in an array script, then add the Slurm array task ID to the end of the ckeckpoint directory name like ckptdir=ckpt-$SLURM_ARRAY_TASK_ID.

 #!/bin/bash -l
 #SBATCH --job-name=gromacs
 #SBATCH --mem=50G
 #SBATCH --time=24:00:00
 #SBATCH --nodes=1
 #SBATCH --ntasks-per-node=4
 
 module reset
 module load GROMACS/2016.4-foss-2017beocatb-hybrid
 module load DMTCP
 module list
 
 ckptdir=ckpt
 mkdir -p $ckptdir
 export DMTCP_CHECKPOINT_DIR=$ckptdir
 
 if ! ls -1 $ckptdir | grep -c dmtcp_restart_script > /dev/null
 then
    echo "Using dmtcp_launch to start the app the first time"
    dmtcp_launch --no-coordinator mpirun -np 1 -x OMP_NUM_THREADS=4 gmx_mpi mdrun -nsteps 50000 -ntomp 4 -v -deffnm 1ns -c 1ns.pdb -nice 0
 else
    echo "Using dmtcp_restart from $ckptdir to continue from a checkpoint"
    dmtcp_restart $ckptdir/*.dmtcp
 fi

You will need to run several tests to verify that DMTCP is working properly with your application. First, run a short test without DMTCP and another with DMTCP with the checkpoint interval set to 5 minutes by adding the line export DMTCP_CHECKPOINT_INTERVAL=300 to your script. Then use kstat -d 1 to check that the memory in both runs is close to the same. Also use this information to calculate the time that each checkpoint takes. In most cases I've seen times less than a minute for checkpointing that will normally be done once each hour. If your application is taking more time, let us know. Sometimes this can be sped up by simply turning off compression by adding the line export DMTCP_GZIP=0. Make sure to remove the line where you set the checkpoint interval to 300 seconds so that the default time of once per hour will be used.

After verifying that your code completes using DMTCP and does not take significantly more time or memory, you will need to start a run then scancel it after the first checkpoint, then resubmit the same script to make sure that it restarts and runs to completion. If you are working with an array job script, the last is to try a few array tasks at once to make sure there is no conflict between the jobs.

Running jobs interactively

Some jobs just don't behave like we think they should, or need to be run with somebody sitting at the keyboard and typing in response to the output the computers are generating. Beocat has a facility for this, called 'srun'. srun uses the exact same command-line arguments as sbatch, but you need to add the following arguments at the end: --pty bash. If no node is available with your resource requirements, srun will tell you something like the following:

srun --pty bash
srun: Force Terminated job 217
srun: error: CPU count per node can not be satisfied
srun: error: Unable to allocate resources: Requested node configuration is not available

Note that, like sbatch, your interactive job will timeout after your allotted time has passed.

Connecting to an existing job

You can connect to an existing job using srun in the same way that the MonitorNode command allowed us to in the old cluster. This is essentially like using ssh to get into the node where your job is running which can be very useful in allowing you to look at files in /tmp/job# or in running htop to view the activity level for your job.

srun --jobid=# --pty bash                        where '#' is the job ID number

Altering Job Requests

We generally do not support users to modify job parameters once the job has been submitted. It can be done, but there are numerous catches, and all of the variations can be a bit problematic; it is normally easier to simply delete the job (using scancel jobid) and resubmit it with the right parameters. If your job doesn't start after modifying such parameters (after a reasonable amount of time), delete the job and resubmit it.

As it is unsupported, this is an excercise left to the reader. A starting point is man scontrol

Killable jobs

There are a growing number of machines within Beocat that are owned by a particular person or group. Normally jobs from users that aren't in the group designated by the owner of these machines cannot use them. This is because we have guaranteed that the nodes will be accessible and available to the owner at any given time. We will allow others to use these nodes if they designate their job as "killable." If your job is designated as killable, your job will be able to use these nodes, but can (and will) be killed off at any point in time to make way for the designated owner's jobs. Jobs that are marked killable will be re-queued and may restart on another node.

The way you would designate your job as killable is to add --gres=killable:1 to the sbatch or srun arguments. This could be either on the command-line or in your script file.

Note: This is a submit-time only request, it cannot be added by a normal user after the job has been submitted. If you would like jobs modified to be killable after the jobs have been submitted (and it is too much work to scancel the jobs and re-submit), send an e-mail to the administrators detailing the job ids and what you would like done.

Scheduling Priority

Some users are members of projects that have contributed to Beocat. When those users have contributed nodes, the group gets access to a "partition" giving you priority on those nodes.

In most situations, the scheduler will automatically add those priority partitions to the jobs as submitted. You should not need to include a partition list in your job submission.

There are currently just a few exceptions that we will not automatically add:

  • ksu-chem-mri.q
  • ksu-gen-gpu.q
  • ksu-gen-highmem.q

If you have access to those any of the non-automatic partitions, and have need of the resources in that partition, you can then alter your #SBATCH lines to include your new partition:

#SBATCH --partition=ksu-gen-highmem.q

Otherwise, you shouldn't modify the partition line at all unless you really know what you're doing.

Graphical Applications

Some applications are graphical and need to have some graphical input/output. We currently accomplish this with X11 forwarding or OpenOnDemand

OpenOnDemand

OpenOnDemand is likely the easier and more performant way to run a graphical application on the cluster.

  1. visit ondemand and login with your cluster credentials.
  2. Check the "Interactive Apps" dropdown. We may have a workflow ready for you. If not choose the desktop.
  3. Select the resources you need
  4. Select launch
  5. A job is now submitted to the cluster and once the job is started you'll see a Connect button
  6. use the app as needed. If using the desktop, start your graphical application.

X11 Forwarding

Connecting with an X11 client

Windows

If you are running Windows, we recommend MobaXTerm as your file/ssh manager, this is because it is one relatively simple tool to do everything. MobaXTerm also automatically connects with X11 forwarding enabled.

Linux/OSX

Both Linux and OSX can connect in an X11 forwarding mode. Linux will have all of the tools you need installed already, OSX will need XQuartz installed.

Then you will need to change your 'ssh' command slightly:

ssh -Y eid@headnode.beocat.ksu.edu

The -Y argument tells ssh to setup X11 forwarding.

Starting an Graphical job

All graphical jobs, by design, must be interactive, so we'll use the srun command. On a headnode, we run the following:

# load an X11 enabled application
module load Octave
# start an X11 job, sbatch arguments are accepted for srun as well, 1 node, 1 hour, 1 gb of memory
srun --nodes=1 --time=1:00:00 --mem=1G --pty --x11 octave --gui

Because these jobs are interactive, they may not be able to run at all times, depending on how busy the scheduler is at any point in time. --pty --x11 are required arguments setting up the job, and octave --gui is the command to run inside the job.

Job Accounting

Some people may find it useful to know what their job did during its run. The sacct tool will read Slurm's accounting database and give you summarized or detailed views on jobs that have run within Beocat.

sacct

This data can usually be used to diagnose two very common job failures.

Job debugging

It is simplest if you know the job number of the job you are trying to get information on.

# if you know the jobid, put it here:
sacct -j 1122334455 -l
# if you don't know the job id, you can look at your jobs started since some day:
sacct -S 2017-01-01
My job didn't do anything when it ran!
 
1
2
3
JobID JobIDRaw JobName Partition MaxVMSize MaxVMSizeNode MaxVMSizeTask AveVMSize MaxRSS MaxRSSNode MaxRSSTask AveRSS MaxPages MaxPagesNode MaxPagesTask AvePages MinCPU MinCPUNode MinCPUTask AveCPU NTasks AllocCPUS Elapsed State ExitCode AveCPUFreq ReqCPUFreqMin ReqCPUFreqMax ReqCPUFreqGov ReqMem ConsumedEnergy MaxDiskRead MaxDiskReadNode MaxDiskReadTask AveDiskRead MaxDiskWrite MaxDiskWriteNode MaxDiskWriteTask AveDiskWrite AllocGRES ReqGRES ReqTRES AllocTRES
218 218 slurm_simple.sh batch.q 12 00:00:00 FAILED 2:0 Unknown Unknown Unknown 1Gn cpu=12,mem=1G,node=1 cpu=12,mem=1G,node=1
218.batch 218.batch batch 137940K dwarf37 0 137940K 1576K dwarf37 0 1576K 0 dwarf37 0 0 00:00:00 dwarf37 0 00:00:00 1 12 00:00:00 FAILED 2:0 1.36G 0 0 0 1Gn 0 0 dwarf37 65534 0 0.00M dwarf37 0 0.00M cpu=12,mem=1G,node=1
218.0 218.0 qqqqstat 204212K dwarf37 0 204212K 1420K dwarf37 0 1420K 0 dwarf37 0 0 00:00:00 dwarf37 0 00:00:00 1 12 00:00:00 FAILED 2:0 196.52M Unknown Unknown Unknown 1Gn 0 0 dwarf37 65534 0 0.00M dwarf37 0 0.00M cpu=12,mem=1G,node=1


If you look at the columns showing Elapsed and State, you can see that they show 00:00:00 and FAILED respectively. This means that the job started and then promptly ended. This points to something being wrong with your submission script. Perhaps there is a typo somewhere in it.

My job ran but didn't finish!
 
1
2
3
JobID JobIDRaw JobName Partition MaxVMSize MaxVMSizeNode MaxVMSizeTask AveVMSize MaxRSS MaxRSSNode MaxRSSTask AveRSS MaxPages MaxPagesNode MaxPagesTask AvePages MinCPU MinCPUNode MinCPUTask AveCPU NTasks AllocCPUS Elapsed State ExitCode AveCPUFreq ReqCPUFreqMin ReqCPUFreqMax ReqCPUFreqGov ReqMem ConsumedEnergy MaxDiskRead MaxDiskReadNode MaxDiskReadTask AveDiskRead MaxDiskWrite MaxDiskWriteNode MaxDiskWriteTask AveDiskWrite AllocGRES ReqGRES ReqTRES AllocTRES
220 220 slurm_simple.sh batch.q 1 00:01:27 TIMEOUT 0:0 Unknown Unknown Unknown 1Gn cpu=1,mem=1G,node=1 cpu=1,mem=1G,node=1
220.batch 220.batch batch 370716K dwarf37 0 370716K 7060K dwarf37 0 7060K 0 dwarf37 0 0 00:00:00 dwarf37 0 00:00:00 1 1 00:01:28 CANCELLED 0:15 1.23G 0 0 0 1Gn 0 0.16M dwarf37 0 0.16M 0.00M dwarf37 0 0.00M cpu=1,mem=1G,node=1
220.0 220.0 sleep 204212K dwarf37 0 107916K 1000K dwarf37 0 620K 0 dwarf37 0 0 00:00:00 dwarf37 0 00:00:00 1 1 00:01:27 CANCELLED 0:15 1.54G Unknown Unknown Unknown 1Gn 0 0.05M dwarf37 0 0.05M 0.00M dwarf37 0 0.00M cpu=1,mem=1G,node=1


If you look at the column showing State, we can see some pointers to the issue. The job ran out of time (TIMEOUT) and then was killed (CANCELLED).

 
1
2
JobID JobIDRaw JobName Partition MaxVMSize MaxVMSizeNode MaxVMSizeTask AveVMSize MaxRSS MaxRSSNode MaxRSSTask AveRSS MaxPages MaxPagesNode MaxPagesTask AvePages MinCPU MinCPUNode MinCPUTask AveCPU NTasks AllocCPUS Elapsed State ExitCode AveCPUFreq ReqCPUFreqMin ReqCPUFreqMax ReqCPUFreqGov ReqMem ConsumedEnergy MaxDiskRead MaxDiskReadNode MaxDiskReadTask AveDiskRead MaxDiskWrite MaxDiskWriteNode MaxDiskWriteTask AveDiskWrite AllocGRES ReqGRES ReqTRES AllocTRES
221 221 slurm_simple.sh batch.q 1 00:00:00 CANCELLED by 0 0:0 Unknown Unknown Unknown 1Mn cpu=1,mem=1M,node=1 cpu=1,mem=1M,node=1
221.batch 221.batch batch 137940K dwarf37 0 137940K 1144K dwarf37 0 1144K 0 dwarf37 0 0 00:00:00 dwarf37 0 00:00:00 1 1 00:00:01 CANCELLED 0:15 2.62G 0 0 0 1Mn 0 0 dwarf37 65534 0 0 dwarf37 65534 0 cpu=1,mem=1M,node=1


If you look at the column showing State, we see it was "CANCELLED by 0", then we look at the AllocTRES column to see our allocated resources, and see that 1MB of memory was granted. Combine that with the column "MaxRSS" and we see that the memory granted was less than the memory we tried to use, thus the job was "CANCELLED".