From Beocat
Jump to: navigation, search

The CentOS/Slurm nodes

We have converted Beocat from Gentoo Linux to CentOS Linux on December 26th of 2017. Any applications or libraries from the old system must be recompiled. We also converted Beocat to use the Slurm scheduler instead of SGE. You will therefore also need to convert all your old qsub scripts over to sbatch scripts. We have developed tools to make this process as easy as possible.

Using Modules

If you're using a common code that others may also be using, we may already have it compiled in a module. You can list the modules available and load an application as in the example below for Vasp.

eos> module avail
eos> module load VASP
eos> module list

When a module gets loaded, all the necessary libraries are also loaded and the paths to the libraries and executables are automatically set up. Loading Vasp for example also loads the OpenMPI library needed to run it and adds the path to the MPI commands and Vasp executables. To see how the path is set up, try executing which vasp_std. The module system allows you to easily switch between different version of applications, libraries, or languages as well.

If you are using a custom code or one that is not installed in a module, you'll need to recompile it yourself. This process is easier under CentOS as some of the work just involves loading the necessary set of modules. The first step is to decide whether to use the Intel compiler toolchain or the GNU toolchain, each of which includes the compilers and other math libraries. The module commands for each are below, and you can load these automatically when you log in by adding one of these module load statements to your .bashrc file. See /homes/daveturner/.bashrc as an example, where I put the module load statements .

To load the Intel compiler tool chain including the Intel Math Kernel Library:
eos> module load iomkl

To load the GNU compiler tool chain including OpenBLAS, FFTW, and ScalaPack load foss (free open source software):
eos> module load foss

Modules provide an easy way to set up the compilers and libraries you may need to compile your code. Beyond that there are many different ways to compile codes so you'll just need to follow the directions. If you need help you can always email us at beocat@cs.ksu.edu.

Converting your qsub script for sbatch using kstat.convert

If you already have a qsub script, I have created a new perl program called kstat.convert that will automatically convert your qsub script over to an sbatch script.

kstat.convert --sge qsub_script.sh --slurm slurm_script.sh

Below is an example of a simple qsub script and the resulting sbatch script after conversion.

#!/bin/bash
#$ -j y
#$ -cwd
#$ -N netpipe
#$ -P KSU-CIS-HPC

#$ -l mem=4G
#$ -l h_rt=100:00:00
#$ -pe single 32

#$ -M daveturner@ksu.edu
#$ -m ab

mpirun -np $NSLOTS NPmpi -o np.out
#!/bin/bash -l
#SBATCH --job-name=netpipe

#SBATCH --mem-per-cpu=4G   # Memory per core, use --mem= for memory per node
#SBATCH --time=4-04:00:00   # Use the form DD-HH:MM:SS
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=32

#SBATCH --mail-user=daveturner@ksu.edu
#SBATCH --mail-type=ALL   # same as =BEGIN,FAIL,END

mpirun -np $SLURM_NPROCS NPmpi -o np.out

The sbatch file uses #SBATCH to identify command options for the scheduler where the qsub file uses #$. Most options are similar but simply use a different syntax. The memory can still be defined on a per core basis as with SGE, or you can use --mem=128G to specify the total memory per node if you'd prefer. The --nodes= and --ntasks-per-node= provide an easy way to request the core configuration you want. If your code can be distributed across multiple nodes and you don't care what the arrangement is, you can instead just specify the number of cores using --ntasks=. For more in depth documentation on converting from SGE to Slurm follow the links below:

https://srcc.stanford.edu/sge-slurm-conversion
https://slurm.schedmd.com/sbatch.html

Submitting jobs to Slurm

Once your qsub script has been converted to an sbatch script and you have an application compiled for CentOS, you can submit the job using the sbatch command.

eos> sbatch sbatch_script.sh
eos> kstat --me

This will submit the script and show you a list of your jobs that are running and the jobs you have in the queue. By default the output for each job will go into a slurm-###.out file where ### is the job ID number. If you need to kill a job, you can use the scancel command with the job ID number.

Submitting your first job

To submit a job to run under Slurm, we use the sbatch (submit batch) command. The scheduler finds the optimum place for your job to run. With over 300 nodes and 7500 cores to schedule, as well as differing priorities, hardware, and individual resources, the scheduler's job is not trivial and it can take some time for a job to start even when there are empty nodes available.

There are a few things you'll need to know before running sbatch.

  • How many cores you need. Note that unless your program is created to use multiple cores (called "threading"), asking for more cores will not speed up your job. This is a common misperception. Beocat will not magically make your program use multiple cores! For this reason the default is 1 core.
  • How much time you need. Many users when beginning to use Beocat neglect to specify a time requirement. The default is one hour, and we get asked why their job died after one hour. We usually point them to the FAQ.
  • How much memory you need. The default is 1 GB. If your job uses significantly more than you ask, your job will be killed off.
  • Any advanced options. See the AdvancedSlurm page for these requests. For our basic examples here, we will ignore these.

So let's now create a small script to test our ability to submit jobs. Create the following file (either by copying it to Beocat or by editing a text file and we'll name it myhost.sh. Both of these methods are documented on our LinuxBasics page.

#!/bin/sh
srun hostname

Be sure to make it executable

chmod u+x myhost.sh

So, now lets submit it as a job and see what happens. Here I'm going to use five options

  • --mem-per-cpu= tells how much memory I need. In my example, I'm using our system minimum of 512 MB, which is more than enough. Note that your memory request is per core, which doesn't make much difference for this example, but will as you submit more complex jobs.
  • --time= tells how much runtime I need. This can be in the form of "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds". This is a very short job, so 1 minute should be plenty. This can't be changed after the job is started please make sure you have requested a sufficient amount of time.
  • --cpus-per-task=1 tells Slurm that I need only a single core per task. The AdvancedSlurm page has much more on the "cpus-per-task" switch.
  • --ntasks=1 tells Slurm that I only need to run 1 task. The AdvancedSlurm page has much more on the "ntasks" switch.
  • --nodes=1 tells Slurm that this must be run on one machine. The AdvancedSlurm page has much more on the "nodes" switch.
  • --nodes=4 --ntasks-per-node=16 --constraint=elf requests 4 nodes with 16 cores on each and to only use the Elves.
% ls
myhost.sh
% sbatch --time=1 --mem-per-cpu=512M --cpus-per-task=1 --ntasks=1 --nodes=1 ./myhost.sh
salloc: Granted job allocation 1483446

Since this is such a small job, it is likely to be scheduled almost immediately, so a minute or so later, I now see

% ls
myhost.sh
slurm-1483446.out
% cat slurm-1483446.out
mage03

Monitoring Your Job

The kstat perl script has been developed at K-State to provide you with all the available information about your jobs on Beocat. kstat --help will give you a full description of how to use it. The Slurm version of kstat is very similar to the SGE version, with the exception that the actual memory usage of each job is not always available so the memory requested is reported, and the memory usage on each node is not always accurate since Slurm includes disk cache. We are continuing to look for better ways to get the memory usage for each job, but at the moment you may need to use Ganglia and look at the memory graph for the node you are running on to get an accurate idea of the memory being used by your application.

Eos>  kstat --help
USAGE: kstat [-q] [-c] [-g] [-l] [-u user] [-p NaMD] [-j 1234567] [--part partition]
      kstat alone dumps all info except for the core summaries
      choose -q -c for only specific info on queued or core summaries.
      then specify any searchables for the user, program name, or job id
kstat                 info on running and queued jobs
kstat -q              info on the queued jobs only
kstat -c              core usage for each user
kstat -g              gpu nodes only
kstat -l              long list - prints full node list
kstat -u daveturner   job info for one user only
kstat --me            job info for my jobs only
kstat -j 1234567      info on a given job id
kstat --nocolor       do not use any color
--------------------------------------------------------------------------
  Multi-node jobs are highlighted in Magenta
     The switch and nodes/switch are on the right
     highlighted in Yellow when nodes are spread across multiple switches
  Shared jobs are highlighted in Cyan
  Memory requested is reported along with the total used when available
     Total RSS / Total VMSize / Total requested
  Runtime is colorized with yellow then red for jobs nearing their time limit
  Time in the queue is colorized yellow then red for jobs waiting long times
--------------------------------------------------------------------------


kstat --me            This will give you a good summary of your jobs that are running and in the queue.

Hero43       24 of 24 cores       Load 23.4 / 24       495.3 / 512 GB used
     daveturner       unafold        1234567       1 core run                127 MB                 0 d 5 h 35 m
     daveturner       octopus       1234568      16 core run               125 GB                 8 d 15 h 42 m
################################## BeoCat Queue ###################################
     daveturner       NetPIPE       1234569      2 core     qw   2h   4 GB       killable      0 d 1 h 2 m

kstat produces a separate line for each host. Use kstat -h to see information on all hosts without the jobs. For the example above we are listing our jobs and the hosts they are on.

Host names - yellow background means reserved, red background means down, red means owned by the group in orange on the right.
Core usage - yellow for empty, cyan for partially used, blue for all cores used.
Load level - yellow or yellow background indicates the node is being inefficiently used. Red just means more threads than cores.
Memory usage - yellow or red means most memory is used. Exceeding memory will show disk swap in background red.

Each job line will contain the username, program name, job ID number, number of cores, maximum memory used, whether the job is killable, and the amount of time the job has run. If the job is still in the queue, it may contain information on the requested run time and memory per core and the time shown is how long the job has been in the queue.

In this case, I have 2 jobs running on Hero43. unafold is using 1 core while octopus is using 16 cores. The most useful information here is the memory being used in each case. While unafold is taking very little memory, octopus is using 125 GB and the red font indicates that it is close to the amount requested. If the memory on a job is over the requested amount it will have a red background and you should request more memory in future runs. If the memory is flashing with a red background, you are more than 50% over your requested amount and your code will be forced to use disk swap which can slow it down enormously. You're usually better off killing the job and restarting with an appropriate memory request. If the code accesses large files, there may be an IO value reported. This number is not very accurate.

kstat -d 7            This will show you information about the jobs that have completed in the last 7 days.

kstat -c            This provides a global view of Beocat showing how many cores each person is using.

Generally speaking, those jobs that are higher on the list will start running before the ones lower on the list. This way you can see your relative position. Another useful tool is to see how busy Beocat is. http://ganglia.beocat.ksu.edu/ will give you those statistics. Depending on the resources you ask for, a job you submit may start immediately or may take up to several weeks, depending on the priority of your job, the resources available, and the requested resources of the jobs ahead of you in the queue.

If you want to read more, continue on to our AdvancedSlurm page.