From Beocat
Jump to: navigation, search
 
(45 intermediate revisions by 4 users not shown)
Line 1: Line 1:
== Using the CentOS/Slurm test nodes ==
== The Rocky/Slurm nodes ==


We will be converting Beocat over from Gentoo Linux to CentOS Linux on December 26thThis means you will need to recompile any applications you have in your own directory.  We are also converting over to use the Slurm scheduler instead of SGE.  You will therefore also need to convert all your qsub scripts over to sbatch scripts.  We have developed tools to make this process as easy as possible.  We have test nodes available now so that you can compile and test your code before the transition.  To access these, start by ssh'ing to the Eunomia head node from either Beocat head node.
We have converted Beocat from CentOS Linux to Rocky Linux on April 1st of 2024Any applications or libraries from the old system must be recompiled.  


clymene>  <B>ssh eunomia</B>
=== Using Modules ===


If you're using a common code that others may also be using, we may already have it compiled in a module.  You can list the modules available and load an application as in the example below for Vasp.
If you're using a common code that others may also be using, we may already have it compiled in a module.  You can list the modules available and load an application as in the example below for Vasp.


eunomia>  <B>module avail</B><BR>
eos>  <B>module avail</B>
eunomia>  <B>module load VASP</B><BR>
eos>  <B>module load GROMACS</B>
eunomia>  <B>module list</B>
eos>  <B>module list</B>


When a module gets loaded, all the necessary libraries are also loaded and the paths to the libraries and executables are automatically set up.  Loading Vasp also loads the OpenMPI library needed to run it and it adds the path to the executables.  To see how the path is set up, try executing <B>which vasp_std</B>.  The module system allows you to easily switch between different version of applications, libraries, or languages as well.
When a module gets loaded, all the necessary libraries are also loaded and the paths to the libraries and executables are automatically set up.  Loading GROMACS for example also loads the OpenMPI library needed to run it and adds the path to the MPI commands and Grimaces executables.  To see how the path is set up, try executing <B><I>which gmx</I></B>.  The module system allows you to easily switch between different version of applications, libraries, or languages as well.


If you are using a custom code or one that is not installed in a module, you'll need to recompile it yourself.  This process is easier under CentOS as some of the work just involves loading the necessary set of modules.  Below are some of the common modules you may needIf you want you can load these automatically when you log in by adding the module load statements to your .bashrc file.  See <B>/homes/daveturner/.bashrc</B> as an example, where I put the module load statements inside a conditional that tests for being on Eunomia.
If you are using a custom code or one that is not installed in a module, you'll need to recompile it yourself.  This process is easier under CentOS as some of the work just involves loading the necessary set of modules.  The first step is to decide whether to use the Intel compiler toolchain or the GNU toolchain, each of which includes the compilers and other math librariesThe module commands for each are below, and you can load these automatically when you log in by adding one of these module load statements to your .bashrc file.  See <B>/homes/daveturner/.bashrc</B> as an example, where I put the module load statements .


eunomia>  <B>module load iccifort</B><BR>
To load the Intel compiler tool chain including the Intel Math Kernel Library (and OpenMPI):
eunomia>  <B>module load GCC</B><BR>
icr-helios>  <B>module load iomkl</B>
eunomia>  <B>module load OpenMPI</B><BR>
 
eunomia>  <B>module load CUDA</B><BR>
To load the GNU compiler tool chain including OpenMPI, OpenBLAS, FFTW, and ScalaPack load foss (free open source software):
icr-helios>  <B>module load foss</B>


Modules provide an easy way to set up the compilers and libraries you may need to compile your code.  Beyond that there are many different ways to compile codes so you'll just need to follow the directions.  If you need help you can always email us at <B>beocat@cs.ksu.edu</B>.
Modules provide an easy way to set up the compilers and libraries you may need to compile your code.  Beyond that there are many different ways to compile codes so you'll just need to follow the directions.  If you need help you can always email us at <B>beocat@cs.ksu.edu</B>.


Once on Eunomia, you can use the Slurm version of kstat (kstat.slurm for now) to see what jobs are running on the nodes.  kstat.slurm will show all nodes for now even though only a few are accessible from the CentOS side (1-2 of each type of node right now).
=== Submitting jobs to Slurm ===
 
eunomia>  <B>kstat.slurm --help</B><BR>
eunomia>  <B>kstat.slurm</B>
 
If you already have a qsub script, I have created a new perl program called kstat.convert that will automatically convert your qsub script over to an sbatch script.
 
<B>kstat.convert --sge qsub_script.sh --slurm slurm_script.sh</B>
 
Below is an example of a simple qsub script and the resulting sbatch script after conversion.
 
<syntaxhighlight lang="bash">
#!/bin/bash
#$ -j y
#$ -cwd
#$ -N netpipe
#$ -P KSU-CIS-HPC
 
#$ -l mem=4G
#$ -l h_rt=100:00:00
#$ -pe single 32
 
#$ -M daveturner@ksu.edu
#$ -m ab
 
mpirun -np $NSLOTS NPmpi -o np.out
</syntaxhighlight>
 
<syntaxhighlight lang="bash">
#!/bin/bash -l
#SBATCH --job-name=netpipe
 
#SBATCH --mem-per-cpu=4G  # Memory per core, use --mem= for memory per node
#SBATCH --time=4-04:00:00  # Use the form DD-HH:MM:SS
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=32
 
#SBATCH --mail-user=daveturner@ksu.edu
#SBATCH --mail-type=ALL  # same as =BEGIN,FAIL,END


mpirun -np $SLURM_NPROCS NPmpi -o np.out
You can submit your job script using the <B>sbatch</B> command.
</syntaxhighlight>


The sbatch file uses <B>#SBATCH</B> to identify command options for the scheduler where the qsub file uses <B>#$</B>.  Most options are similar but simply use a different syntax. The memory can still be defined on a per core basis as with SGE, or you can use <B>--mem=128G</B> to specify the total memory per node if you'd prefer. The <B>--nodes=</B> and <B>--ntasks-per-node=</B> provide an easy way to request the core configuration you want. If your code can be distributed across multiple nodes and you don't care what the arrangement is, you can instead just specify the number of cores using <B>--ntasks=</B>.  For more in depth documentation on converting from SGE to Slurm follow the links below:
icr-helios> <B>sbatch sbatch_script.sh</B>
  icr-helios> <B>kstat --me</B>


https://srcc.stanford.edu/sge-slurm-conversion<BR>
This will submit the script and show you a list of your jobs that are running and the jobs you have in the queue.  By default the output for each job will go into a <B>slurm-###.out</B> file where ### is the job ID number.  If you need to kill a job, you can use the <B>scancel</B> command with the job ID number.
https://slurm.schedmd.com/sbatch.html
 
Once your qsub script has been converted over to an sbatch script and you have an application compiled for CentOS, you can submit the job using the <B>sbatch</B> command.
 
eunomia> <B>sbatch sbatch_script.sh</B><BR>
eunomia> <B>kstat.slurm  --me</B>
 
This will submit the script and show you a list of your jobs that are running and in the queue.  By default the output will go into a <B>slurm-###.out</B> file where ### is the job ID number.  If you need to kill a job, you can use the <B>scancel</B> command with the job ID number.


== Submitting your first job ==
== Submitting your first job ==
To submit a job to run under Slurm, we use the <code>sbatch</code> command. sbatch (submit batch) takes the commands you give it, and runs it through the scheduler, which finds the optimum place for your job to run. With over 300 nodes and 7500 cores to schedule, as well as differing priorities, hardware, and individual resources, the scheduler's job is not trivial.
To submit a job to run under Slurm, we use the <B><I>sbatch</I></B> (submit batch) command.  The scheduler finds the optimum place for your job to run. With over 300 nodes and 7500 cores to schedule, as well as differing priorities, hardware, and individual resources, the scheduler's job is not trivial and it can take some time for a job to start even when there are empty nodes available.


There are a few things you'll need to know before running sbatch.
There are a few things you'll need to know before running sbatch.
* How many cores you need. Note that unless your program is created to use multiple cores (called "threading"), asking for more cores will not speed up your job. This is a common misperception. '''Beocat will not magically make your program use multiple cores!''' For this reason the default is 1 core.
* How many cores you need. Note that unless your program is created to use multiple cores (called "threading"), asking for more cores will not speed up your job. This is a common misperception. '''Beocat will not magically make your program use multiple cores!''' For this reason the default is 1 core.
* How much time you need. Many users when beginning to use Beocat neglect to specify a time requirement. The default is one hour, and we get asked why their job died after one hour. We usually point them to the [[FAQ]].
* How much time you need. Many users when beginning to use Beocat neglect to specify a time requirement. The default is one hour, and we get asked why their job died after one hour. We usually point them to the [[FAQ]].
* How much memory you need. The default is 1GB. If your job uses significantly more than you ask, your job will be killed off.
* How much memory you need. The default is 1 GB. If your job uses significantly more than you ask, your job will be killed off.
* Any advanced options. See the [[AdvancedSlurm]] page for these requests. For our basic examples here, we will ignore these.
* Any advanced options. See the [[AdvancedSlurm]] page for these requests. For our basic examples here, we will ignore these.


Line 89: Line 44:
<syntaxhighlight lang="bash" line>
<syntaxhighlight lang="bash" line>
#!/bin/sh
#!/bin/sh
srun hostname
hostname
</syntaxhighlight>
</syntaxhighlight>


Line 98: Line 53:
* <code>--mem-per-cpu=</code> tells how much memory I need. In my example, I'm using our system minimum of 512 MB, which is more than enough. Note that your memory request is '''per core''', which doesn't make much difference for this example, but will as you submit more complex jobs.
* <code>--mem-per-cpu=</code> tells how much memory I need. In my example, I'm using our system minimum of 512 MB, which is more than enough. Note that your memory request is '''per core''', which doesn't make much difference for this example, but will as you submit more complex jobs.
* <code>--time=</code> tells how much runtime I need. This can be in the form of "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds". This is a very short job, so 1 minute should be plenty. This can't be changed after the job is started please make sure you have requested a sufficient amount of time.
* <code>--time=</code> tells how much runtime I need. This can be in the form of "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds". This is a very short job, so 1 minute should be plenty. This can't be changed after the job is started please make sure you have requested a sufficient amount of time.
* <code>--cpus-per-task=1</code> tells Slurm that I need only a single core per task. The [[AdvancedSlurm]] page has much more on the "cpus-per-task" switch.
* <code>--ntasks=1</code> tells Slurm that I only need to run 1 task. The [[AdvancedSlurm]] page has much more on the "ntasks" switch.
* <code>--nodes=1</code> tells Slurm that this must be run on one machine. The [[AdvancedSlurm]] page has much more on the "nodes" switch.
* <code>--nodes=1</code> tells Slurm that this must be run on one machine. The [[AdvancedSlurm]] page has much more on the "nodes" switch.
* <code>--ntasks-per-node=16 </code> Request 16 cores on each node.
* <code>--constraint=moles</code> Request to only run on the Mole class of compute nodes.


  % '''ls'''
  % '''ls'''
Line 119: Line 74:
The <B>kstat</B> perl script has been developed at K-State to provide you with all the available information about your jobs on Beocat.  <B>kstat --help</B> will give you a full description of how to use it.
The <B>kstat</B> perl script has been developed at K-State to provide you with all the available information about your jobs on Beocat.  <B>kstat --help</B> will give you a full description of how to use it.


<B>kstat --me</B>
Eos> kstat --help
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
 
This will give you a good summary of your jobs that are running and in the queue.
  USAGE: kstat [-q] [-c] [-g] [-l] [-u user] [-p NaMD] [-j 1234567] [--part partition]
        kstat alone dumps all info except for the core summaries
        choose -q -c for only specific info on queued or core summaries.
        then specify any searchables for the user, program name, or job id
 
  kstat                info on running and queued jobs
  kstat -h              list host info only, no jobs
  kstat -q              info on the queued jobs only
  kstat -c              core usage for each user
  kstat -d #            show jobs run in the last # days
                        Memory per node - used/allocated/requested
                        Red is close to or over requested amount
                        Yellow is under utilized for large jobs
  kstat -g              Only show GPU nodes
  kstat -o Turner      Only show info for a given owner
  kstat -o CS_HPC          Same but sub _ for spaces
  kstat -l              long list - node features and performance
                        Node hardware and node CPU usage
                        job nodelist and switchlist
                        job current and max memory
                        job CPU utilizations
  kstat -u daveturner  job info for one user only
  kstat --me           job info for my jobs only
  kstat -j 1234567      info on a given job id
  kstat --osg          show OSG background jobs also
  kstat --nocolor      do not use any color
  kstat --name          display full names instead of eIDs
 
  ---------------- Graphs and Tables ---------------------------------------
  Specify graph/table,  CPU or GPU or host, usage or memory, and optional time
  kstat --graph-cpu-memory #      gnuplot CPU memory for job #
  kstat --table-gpu-usage-5min #  GPU usage table every 5 min for job #
  kstat --table-cpu-60min #      CPU usage, memory, swap table every 60 min for job #
  kstat --table-node [nodename]  cores, load, CPU usage, memory table for a node
 
  --------------------------------------------------------------------------
    Multi-node jobs are highlighted in Magenta
      kstat -l also provides a node list and switch list
      highlighted in Yellow when nodes are spread across multiple switches
    Run time is colorized yellow then red for jobs nearing their time limit
    Queue time is colorized yellow then red for jobs waiting longer times
  --------------------------------------------------------------------------
 
kstat can be used to give you a summary of your jobs that are running and in the queue:
<B>Eos>  kstat --me</B>


<b>
<b>
Line 131: Line 130:
<font color=lightgreen>daveturner &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</font>
<font color=lightgreen>daveturner &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</font>
<font color=black>unafold &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1234567 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</font>
<font color=black>unafold &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1234567 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</font>
<font color=cyan>1 core run &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</font>
<font color=cyan>1 core &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</font>
<font color=black> 127 MB &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</font>
<font color=green>running &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</font>
<font color=black> 4gb req &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</font>
<font color=black> 0 d  5 h 35 m </font><br>
<font color=black> 0 d  5 h 35 m </font><br>
&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;
<font color=green>daveturner &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</font>
<font color=green>daveturner &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</font>
<font color=black>octopus &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1234568 &nbsp;&nbsp;&nbsp;&nbsp;</font>
<font color=black>octopus &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1234568 &nbsp;&nbsp;&nbsp;&nbsp;</font>
<font color=cyan>16 core run &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</font>
<font color=cyan>16 core &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</font>
<font color=red> 125 GB &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</font>
<font color=green>running &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</font>
<font color=red> 128gb req &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</font>
<font color=black> 8 d 15 h 42 m </font><br>
<font color=black> 8 d 15 h 42 m </font><br>
<font color=green> ##################################  BeoCat Queue    ################################### </font><br>
<font color=green> ##################################  BeoCat Queue    ################################### </font><br>
Line 145: Line 146:
<font color=black>NetPIPE &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1234569 &nbsp;&nbsp;&nbsp;&nbsp; </font>
<font color=black>NetPIPE &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1234569 &nbsp;&nbsp;&nbsp;&nbsp; </font>
<font color=cyan>2 core &nbsp;&nbsp;&nbsp;</font>
<font color=cyan>2 core &nbsp;&nbsp;&nbsp;</font>
<font color=green> qw &nbsp;</font>
<font color=red> PD &nbsp;</font>
<font color=black> 2h &nbsp;</font>
<font color=black> 2h &nbsp;</font>
<font color=black> 4 GB &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</font>
<font color=black> 4gb req &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</font>
<font color=red> killable &nbsp;&nbsp;&nbsp;&nbsp;</font>
<font color=black> 0 d 1 h 2 m </font><br>
<font color=black> 0 d 1 h 2 m </font><br>
</b>
</b>
Line 155: Line 155:
For the example above we are listing our jobs and the hosts they are on.
For the example above we are listing our jobs and the hosts they are on.


Host names -  
Core usage - yellow for empty, red for empty on owned nodes, cyan for partially used, blue for all cores used.<BR>
yellow background means reserved,  
red background means down,
red means owned by the
group in orange on the right.<BR>
Core usage - yellow for empty, cyan for partially used, blue for all cores used.<BR>
Load level - yellow or yellow background indicates the node is being inefficiently used.  Red just means more threads than cores.<br>
Load level - yellow or yellow background indicates the node is being inefficiently used.  Red just means more threads than cores.<br>
Memory usage - yellow or red means most memory is used.  Exceeding memory will show disk swap in background red.
Memory usage - yellow or red means most memory is used.<BR>
If the node is owned the group name will be in orange on the right.  Killable jobs can still be run on those nodes.<BR>
 
Each job line will contain the username, program name, job ID, number of cores, the status which may be colored red for killable jobs,
the maximum memory used or memory requested, and the amount of time the job has run.   
Jobs in the queue may contain information on the requested memory and run time, priority access, constraints, and
how long the job has been in the queue.
In this case, I have 2 jobs running on Hero43.  <i>unafold</i> is using 1 core while <i>octopus</i> is using 16 cores.  Slurm did not provide
any information on the actual memory use so the memory request is reported 
 
=== Detailed information about a single job ===
 
kstat can provide a great deal of information on a particular job including a very rough estimate of when it will run.  This time is a worst case scenario as this will
be adapted as other jobs finish early.  This is a good way to check for job submission problems before contacting us.  kstat colorizes the more important
information to make it easier to identify.
 
Eos>  kstat -j 157054
##################################  Beocat Queue    ###################################
  daveturner  netpipe    157054  64 cores  PD      dwarves fabric  CS HPC    8gb req  0 d  0 h  0 m
JobId 157054  Job Name  netpipe
  UserId=daveturner GroupId=daveturner_users(2117) MCS_label=N/A
  Priority=11112 Nice=0 Account=ksu-cis-hpc QOS=normal
  Status=PENDING Reason=Resources Dependency=(null)
  Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
  RunTime=00:00:00 TimeLimit=00:40:00 TimeMin=N/A
  SubmitTime=2018-02-02T18:18:31 EligibleTime=2018-02-02T18:18:31
  Estimated Start Time is 2018-02-03T06:17:49 EndTime=2018-02-03T06:57:49 Deadline=N/A
  PreemptTime=None SuspendTime=None SecsPreSuspend=0
  Partitions killable.q,ksu-cis-hpc.q AllocNode:Sid=eos:1761
  ReqNodeList=(null) ExcNodeList=(null)
  NodeList=(null) SchedNodeList=dwarf[01-02]
  NumNodes=2-2 NumCPUs=64 NumTasks=64 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
  TRES 2 nodes 64 cores 8192  mem gres/fabric 2
  Socks/Node=* NtasksPerN:B:S:C=32:0:*:* CoreSpec=*
  MinCPUsNode=32 MinMemoryNode=4G MinTmpDiskNode=0
  Constraint=dwarves DelayBoot=00:00:00
  Gres=fabric Reservation=(null)
  OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
  Slurm script  /homes/daveturner/perf/NetPIPE-5.x/sb.np
  WorkDir=/homes/daveturner/perf/NetPIPE-5.x
  StdErr=/homes/daveturner/perf/NetPIPE-5.x/0.o157054
  StdIn=/dev/null
  StdOut=/homes/daveturner/perf/NetPIPE-5.x/0.o157054
  Switches=1@00:05:00
<syntaxhighlight lang=bash>
#!/bin/bash -l
#SBATCH --job-name=netpipe
#SBATCH -o 0.o%j
#SBATCH --time=0:40:00
#SBATCH --mem=4G
#SBATCH --switches=1
#SBATCH --nodes=2
#SBATCH --constraint=dwarves
#SBATCH --ntasks-per-node=32
#SBATCH --gres=fabric:roce:1


Each job line will contain the username, program name, job ID number, number of cores, maximum memory used, whether the job is killable, and the
host=`echo $SLURM_JOB_NODELIST | sed s/[^a-z0-9]/\ /g | cut -f 1 -d ' '`
amount of time the job has run. If the job is still in the queue, it may contain information on the requested run time and memory per core and the time
nprocs=$SLURM_NTASKS
shown is how long the job has been in the queue.
openmpi_hostfile.pl $SLURM_JOB_NODELIST 1 hf.$host
opts="--printhostnames --quick --pert 3"


In this case, I have 2 jobs running on Hero43.  <i>unafold</i> is using 1 core while <i>octopus</i> is using 16 cores.  The most useful information here
echo "*******************************************************************"
is the memory being used in each case.  While <i>unafold</i> is taking very little memory, <i>octopus</i> is using 125 GB and the red
echo "Running on $SLURM_NNODES nodes $nprocs cores on nodes $SLURM_JOB_NODELIST"
font indicates that it is close to the amount requested.  If the memory on a job is over the requested amount it will have a
echo "*******************************************************************"
red background and you should request more memory in future runs.  If the memory is flashing with a red background, you are more
than 50% over your requested amount and your code will be forced to use disk swap which can slow it down enormously.  You're usually
better off killing the job and restarting with an appropriate memory request.
If the code accesses large files, there may be an IO value reported.  This number is not very accurate.


<b> kstat -d 7 </b>
mpirun -np 2 --hostfile hf.$host NPmpi $opts -o np.${host}.mpi
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
mpirun -np 2 --hostfile hf.$host NPmpi $opts -o np.${host}.mpi.bi --async --bidir
This will show you information about the jobs that have completed in the last 7 days.
mpirun -np $nprocs NPmpi $opts -o np.${host}.mpi$nprocs --async --bidir
</syntaxhighlight>


<b> kstat -c </b>
=== Completed jobs and memory usage ===
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
This provides a global view of Beocat showing how many cores each person is using.


''Generally speaking'', those jobs that are higher on the list will start running before the ones lower on the list. This way you can see your relative position. Another useful tool is to see how busy Beocat is. [http://ganglia.beocat.ksu.edu/ http://ganglia.beocat.ksu.edu/] will give you those statistics. Depending on the resources you ask for, a job you submit may start immediately or may take up to several weeks, depending on the priority of your job, the resources available, and the requested resources of the jobs ahead of you in the queue.
kstat -d #
 
This will provide information on the jobs you have currently running and those that have completed
in the last '#' days. This is currently the only reliable way to get the memory used per node for your job.
This also provides information on whether the job completed normally, was canceled with <I>scancel</I>,
timed out, or was killed because it exceeded its memory request.
 
Eos>  kstat -d 10
 
###########################  sacct -u daveturner  for 10 days  ###########################
                                      max gb used on a node /   gb requested per node
  193037  ADF        dwarf43          1 n  32 c  30.46gb/100gb    05:15:34  COMPLETED
  193289  ADF        dwarf33          1 n  32 c  26.42gb/100gb    00:50:43  CANCELLED
  195171  ADF        dwarf44          1 n  32 c  56.81gb/120gb    14:43:35  COMPLETED
  209518  matlab      dwarf36          1 n  1 c    0.00gb/ 4gb    00:00:02  FAILED
 
=== Summary of core usage ===
 
kstat can also provide a listing of the core usage and cores requested for each user.
Eos>  kstat -c
##############################  Core usage    ###############################
  antariksh      1512 cores  %25.1 used    41528 cores queued
  bahadori        432 cores  % 7.2 used        80 cores queued
  eegoetz            0 cores  % 0.0 used        2 cores queued
  fahrialkan        24 cores  % 0.4 used        32 cores queued
  gowri            66 cores  % 1.1 used        32 cores queued
  jeffcomer        160 cores  % 2.7 used        0 cores queued
  ldcoates12        80 cores  % 1.3 used      112 cores queued
  lukesteg        464 cores  % 7.7 used        0 cores queued
  mike5454        1060 cores  %17.6 used      852 cores queued
  nilusha          344 cores  % 5.7 used        0 cores queued
  nnshan2014      136 cores  % 2.3 used        0 cores queued
  ploetz          264 cores  % 4.4 used        60 cores queued
  sadish          812 cores  %13.5 used        0 cores queued
  sandung          72 cores  % 1.2 used        56 cores queued
  zhiguang          80 cores  % 1.3 used      688 cores queued
 
=== Producing memory and CPU utilization tables and graphs ===
 
kstat can now produce tables or graphs for the memory or CPU utilization
for a job.  In order to view graphs you must set up X11 forwarding on your
ssh connection by using the -X parameter.


If you want to read more, continue on to our [[AdvancedSlurm]] page.
If you want to read more, continue on to our [[AdvancedSlurm]] page.
=== kstat is now available to download and install on other clusters ===
https://gitlab.beocat.ksu.edu/Admin-Public/kstat
This software has been installed and used on several clusters for many years.
It should be considered Beta software and may take some slight modifications
to install on some clusters.  Please contact the author if you want to give
it a try (daveturner@ksu.edu).

Latest revision as of 23:54, 29 July 2024

The Rocky/Slurm nodes

We have converted Beocat from CentOS Linux to Rocky Linux on April 1st of 2024. Any applications or libraries from the old system must be recompiled.

Using Modules

If you're using a common code that others may also be using, we may already have it compiled in a module. You can list the modules available and load an application as in the example below for Vasp.

eos>  module avail
eos>  module load GROMACS
eos>  module list

When a module gets loaded, all the necessary libraries are also loaded and the paths to the libraries and executables are automatically set up. Loading GROMACS for example also loads the OpenMPI library needed to run it and adds the path to the MPI commands and Grimaces executables. To see how the path is set up, try executing which gmx. The module system allows you to easily switch between different version of applications, libraries, or languages as well.

If you are using a custom code or one that is not installed in a module, you'll need to recompile it yourself. This process is easier under CentOS as some of the work just involves loading the necessary set of modules. The first step is to decide whether to use the Intel compiler toolchain or the GNU toolchain, each of which includes the compilers and other math libraries. The module commands for each are below, and you can load these automatically when you log in by adding one of these module load statements to your .bashrc file. See /homes/daveturner/.bashrc as an example, where I put the module load statements .

To load the Intel compiler tool chain including the Intel Math Kernel Library (and OpenMPI):

icr-helios>  module load iomkl

To load the GNU compiler tool chain including OpenMPI, OpenBLAS, FFTW, and ScalaPack load foss (free open source software):

icr-helios>  module load foss

Modules provide an easy way to set up the compilers and libraries you may need to compile your code. Beyond that there are many different ways to compile codes so you'll just need to follow the directions. If you need help you can always email us at beocat@cs.ksu.edu.

Submitting jobs to Slurm

You can submit your job script using the sbatch command.

icr-helios> sbatch sbatch_script.sh
icr-helios> kstat  --me

This will submit the script and show you a list of your jobs that are running and the jobs you have in the queue. By default the output for each job will go into a slurm-###.out file where ### is the job ID number. If you need to kill a job, you can use the scancel command with the job ID number.

Submitting your first job

To submit a job to run under Slurm, we use the sbatch (submit batch) command. The scheduler finds the optimum place for your job to run. With over 300 nodes and 7500 cores to schedule, as well as differing priorities, hardware, and individual resources, the scheduler's job is not trivial and it can take some time for a job to start even when there are empty nodes available.

There are a few things you'll need to know before running sbatch.

  • How many cores you need. Note that unless your program is created to use multiple cores (called "threading"), asking for more cores will not speed up your job. This is a common misperception. Beocat will not magically make your program use multiple cores! For this reason the default is 1 core.
  • How much time you need. Many users when beginning to use Beocat neglect to specify a time requirement. The default is one hour, and we get asked why their job died after one hour. We usually point them to the FAQ.
  • How much memory you need. The default is 1 GB. If your job uses significantly more than you ask, your job will be killed off.
  • Any advanced options. See the AdvancedSlurm page for these requests. For our basic examples here, we will ignore these.

So let's now create a small script to test our ability to submit jobs. Create the following file (either by copying it to Beocat or by editing a text file and we'll name it myhost.sh. Both of these methods are documented on our LinuxBasics page.

#!/bin/sh
hostname

Be sure to make it executable

chmod u+x myhost.sh

So, now lets submit it as a job and see what happens. Here I'm going to use five options

  • --mem-per-cpu= tells how much memory I need. In my example, I'm using our system minimum of 512 MB, which is more than enough. Note that your memory request is per core, which doesn't make much difference for this example, but will as you submit more complex jobs.
  • --time= tells how much runtime I need. This can be in the form of "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds". This is a very short job, so 1 minute should be plenty. This can't be changed after the job is started please make sure you have requested a sufficient amount of time.
  • --nodes=1 tells Slurm that this must be run on one machine. The AdvancedSlurm page has much more on the "nodes" switch.
  • --ntasks-per-node=16 Request 16 cores on each node.
  • --constraint=moles Request to only run on the Mole class of compute nodes.
% ls
myhost.sh
% sbatch --time=1 --mem-per-cpu=512M --cpus-per-task=1 --ntasks=1 --nodes=1 ./myhost.sh
salloc: Granted job allocation 1483446

Since this is such a small job, it is likely to be scheduled almost immediately, so a minute or so later, I now see

% ls
myhost.sh
slurm-1483446.out
% cat slurm-1483446.out
mage03

Monitoring Your Job

The kstat perl script has been developed at K-State to provide you with all the available information about your jobs on Beocat. kstat --help will give you a full description of how to use it.

Eos>  kstat --help
 
 USAGE: kstat [-q] [-c] [-g] [-l] [-u user] [-p NaMD] [-j 1234567] [--part partition]
        kstat alone dumps all info except for the core summaries
        choose -q -c for only specific info on queued or core summaries.
        then specify any searchables for the user, program name, or job id
 
 kstat                 info on running and queued jobs
 kstat -h              list host info only, no jobs
 kstat -q              info on the queued jobs only
 kstat -c              core usage for each user
 kstat -d #            show jobs run in the last # days
                       Memory per node - used/allocated/requested
                       Red is close to or over requested amount
                       Yellow is under utilized for large jobs
 kstat -g              Only show GPU nodes
 kstat -o Turner       Only show info for a given owner
 kstat -o CS_HPC          Same but sub _ for spaces
 kstat -l              long list - node features and performance
                       Node hardware and node CPU usage
                       job nodelist and switchlist
                       job current and max memory
                       job CPU utilizations
 kstat -u daveturner   job info for one user only
 kstat --me            job info for my jobs only
 kstat -j 1234567      info on a given job id
 kstat --osg           show OSG background jobs also
 kstat --nocolor       do not use any color
 kstat --name          display full names instead of eIDs
 
 ---------------- Graphs and Tables ---------------------------------------
 Specify graph/table,  CPU or GPU or host, usage or memory, and optional time
 kstat --graph-cpu-memory #      gnuplot CPU memory for job #
 kstat --table-gpu-usage-5min #  GPU usage table every 5 min for job #
 kstat --table-cpu-60min #       CPU usage, memory, swap table every 60 min for job #
 kstat --table-node [nodename]   cores, load, CPU usage, memory table for a node
 
 --------------------------------------------------------------------------
   Multi-node jobs are highlighted in Magenta
      kstat -l also provides a node list and switch list
      highlighted in Yellow when nodes are spread across multiple switches
   Run time is colorized yellow then red for jobs nearing their time limit
   Queue time is colorized yellow then red for jobs waiting longer times
 --------------------------------------------------------------------------

kstat can be used to give you a summary of your jobs that are running and in the queue:

Eos>  kstat --me

Hero43       24 of 24 cores       Load 23.4 / 24       495.3 / 512 GB used
     daveturner       unafold        1234567       1 core                running                4gb req                 0 d 5 h 35 m
     daveturner       octopus       1234568      16 core               running                128gb req                 8 d 15 h 42 m
################################## BeoCat Queue ###################################
     daveturner       NetPIPE       1234569      2 core     PD   2h   4gb req       0 d 1 h 2 m

kstat produces a separate line for each host. Use kstat -h to see information on all hosts without the jobs. For the example above we are listing our jobs and the hosts they are on.

Core usage - yellow for empty, red for empty on owned nodes, cyan for partially used, blue for all cores used.
Load level - yellow or yellow background indicates the node is being inefficiently used. Red just means more threads than cores.
Memory usage - yellow or red means most memory is used.
If the node is owned the group name will be in orange on the right. Killable jobs can still be run on those nodes.

Each job line will contain the username, program name, job ID, number of cores, the status which may be colored red for killable jobs, the maximum memory used or memory requested, and the amount of time the job has run. Jobs in the queue may contain information on the requested memory and run time, priority access, constraints, and how long the job has been in the queue. In this case, I have 2 jobs running on Hero43. unafold is using 1 core while octopus is using 16 cores. Slurm did not provide any information on the actual memory use so the memory request is reported

Detailed information about a single job

kstat can provide a great deal of information on a particular job including a very rough estimate of when it will run. This time is a worst case scenario as this will be adapted as other jobs finish early. This is a good way to check for job submission problems before contacting us. kstat colorizes the more important information to make it easier to identify.

Eos>  kstat -j 157054

##################################   Beocat Queue    ###################################
 daveturner  netpipe     157054   64 cores  PD       dwarves fabric  CS HPC     8gb req   0 d  0 h  0 m

JobId 157054  Job Name  netpipe
  UserId=daveturner GroupId=daveturner_users(2117) MCS_label=N/A
  Priority=11112 Nice=0 Account=ksu-cis-hpc QOS=normal
  Status=PENDING Reason=Resources Dependency=(null)
  Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
  RunTime=00:00:00 TimeLimit=00:40:00 TimeMin=N/A
  SubmitTime=2018-02-02T18:18:31 EligibleTime=2018-02-02T18:18:31
  Estimated Start Time is 2018-02-03T06:17:49 EndTime=2018-02-03T06:57:49 Deadline=N/A
  PreemptTime=None SuspendTime=None SecsPreSuspend=0
  Partitions killable.q,ksu-cis-hpc.q AllocNode:Sid=eos:1761
  ReqNodeList=(null) ExcNodeList=(null)
  NodeList=(null) SchedNodeList=dwarf[01-02]
  NumNodes=2-2 NumCPUs=64 NumTasks=64 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
  TRES 2 nodes 64 cores 8192  mem gres/fabric 2
  Socks/Node=* NtasksPerN:B:S:C=32:0:*:* CoreSpec=*
  MinCPUsNode=32 MinMemoryNode=4G MinTmpDiskNode=0
  Constraint=dwarves DelayBoot=00:00:00
  Gres=fabric Reservation=(null)
  OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
  Slurm script  /homes/daveturner/perf/NetPIPE-5.x/sb.np
  WorkDir=/homes/daveturner/perf/NetPIPE-5.x
  StdErr=/homes/daveturner/perf/NetPIPE-5.x/0.o157054
  StdIn=/dev/null
  StdOut=/homes/daveturner/perf/NetPIPE-5.x/0.o157054
  Switches=1@00:05:00
#!/bin/bash -l
#SBATCH --job-name=netpipe
#SBATCH -o 0.o%j
#SBATCH --time=0:40:00
#SBATCH --mem=4G
#SBATCH --switches=1
#SBATCH --nodes=2
#SBATCH --constraint=dwarves
#SBATCH --ntasks-per-node=32
#SBATCH --gres=fabric:roce:1

host=`echo $SLURM_JOB_NODELIST | sed s/[^a-z0-9]/\ /g | cut -f 1 -d ' '`
nprocs=$SLURM_NTASKS
openmpi_hostfile.pl $SLURM_JOB_NODELIST 1 hf.$host
opts="--printhostnames --quick --pert 3"

echo "*******************************************************************"
echo "Running on $SLURM_NNODES nodes $nprocs cores on nodes $SLURM_JOB_NODELIST"
echo "*******************************************************************"

mpirun -np 2 --hostfile hf.$host NPmpi $opts -o np.${host}.mpi
mpirun -np 2 --hostfile hf.$host NPmpi $opts -o np.${host}.mpi.bi --async --bidir
mpirun -np $nprocs NPmpi $opts -o np.${host}.mpi$nprocs --async --bidir

Completed jobs and memory usage

kstat -d #

This will provide information on the jobs you have currently running and those that have completed in the last '#' days. This is currently the only reliable way to get the memory used per node for your job. This also provides information on whether the job completed normally, was canceled with scancel, timed out, or was killed because it exceeded its memory request.

Eos>  kstat -d 10
###########################  sacct -u daveturner  for 10 days  ###########################
                                     max gb used on a node /   gb requested per node
 193037   ADF         dwarf43           1 n  32 c   30.46gb/100gb    05:15:34  COMPLETED
 193289   ADF         dwarf33           1 n  32 c   26.42gb/100gb    00:50:43  CANCELLED
 195171   ADF         dwarf44           1 n  32 c   56.81gb/120gb    14:43:35  COMPLETED
 209518   matlab      dwarf36           1 n   1 c    0.00gb/  4gb    00:00:02  FAILED

Summary of core usage

kstat can also provide a listing of the core usage and cores requested for each user.

Eos>  kstat -c

##############################   Core usage    ###############################
  antariksh       1512 cores   %25.1 used     41528 cores queued
  bahadori         432 cores   % 7.2 used        80 cores queued
  eegoetz            0 cores   % 0.0 used         2 cores queued
  fahrialkan        24 cores   % 0.4 used        32 cores queued
  gowri             66 cores   % 1.1 used        32 cores queued
  jeffcomer        160 cores   % 2.7 used         0 cores queued
  ldcoates12        80 cores   % 1.3 used       112 cores queued
  lukesteg         464 cores   % 7.7 used         0 cores queued
  mike5454        1060 cores   %17.6 used       852 cores queued
  nilusha          344 cores   % 5.7 used         0 cores queued
  nnshan2014       136 cores   % 2.3 used         0 cores queued
  ploetz           264 cores   % 4.4 used        60 cores queued
  sadish           812 cores   %13.5 used         0 cores queued
  sandung           72 cores   % 1.2 used        56 cores queued
  zhiguang          80 cores   % 1.3 used       688 cores queued

Producing memory and CPU utilization tables and graphs

kstat can now produce tables or graphs for the memory or CPU utilization for a job. In order to view graphs you must set up X11 forwarding on your ssh connection by using the -X parameter.

If you want to read more, continue on to our AdvancedSlurm page.

kstat is now available to download and install on other clusters

https://gitlab.beocat.ksu.edu/Admin-Public/kstat

This software has been installed and used on several clusters for many years. It should be considered Beta software and may take some slight modifications to install on some clusters. Please contact the author if you want to give it a try (daveturner@ksu.edu).