From Beocat
Jump to: navigation, search
(Added MPI instructions.)
(Removed content as SGE is gone.)
Tag: Replaced
 
(27 intermediate revisions by 5 users not shown)
Line 1: Line 1:
== Resource Requests ==
== SGE is gone, long live [[AdvancedSlurm|slurm]] ==
Aside from the time, RAM, and CPU requirements listed on the [[SGEBasics]] page, we have several other requestable resources. Generally, if you don't know if you need a particular resource, you should use the default. These can be generated with the command
<tt>qconf -sc | awk '{ if ($5 != "NO") { print }}'</tt>
{| class="wikitable sortable"
!name
!shortcut
!type
!relop
!requestable
!consumable
!default
!urgency
|-
|arch
|a
|RESTRING
|==
|YES
|NO
|NONE
|0
|-
|avx
|avx
|BOOL       
|==   
|YES       
|NO       
|FALSE   
|0
|-
|calendar           
|c         
|RESTRING   
|==     
|YES       
|NO       
|NONE   
|0
|-
|cpu               
|cpu       
|DOUBLE     
|>=     
|YES       
|NO       
|0       
|0
|-
|cpu_flags         
|c_f       
|STRING     
|==     
|YES       
|NO       
|NONE   
|0
|-
|cuda               
|cuda     
|INT       
|<=     
|YES       
|JOB       
|0       
|0
|-
|display_win_gui   
|dwg       
|BOOL       
|==     
|YES       
|NO       
|0       
|0
|-
|exclusive         
|excl     
|BOOL       
|EXCL   
|YES       
|YES       
|0       
|1000
|-
|h_core             
|h_core   
|MEMORY     
|<=     
|YES       
|NO       
|0       
|0
|-
|h_cpu             
|h_cpu     
|TIME       
|<=     
|YES       
|NO       
|0:0:0   
|0
|-
|h_data             
|h_data   
|MEMORY     
|<=     
|YES       
|NO       
|0       
|0
|-
|h_fsize           
|h_fsize   
|MEMORY     
|<=     
|YES       
|NO       
|0       
|0
|-
|h_rss             
|h_rss     
|MEMORY     
|<=     
|YES       
|NO       
|0       
|0
|-
|h_rt               
|h_rt     
|TIME       
|<=     
|FORCED     
|NO       
|0:0:0   
|0
|-
|h_stack           
|h_stack   
|MEMORY     
|<=     
|YES       
|NO       
|0       
|0
|-
|h_vmem             
|h_vmem   
|MEMORY     
|<=     
|YES       
|NO       
|0       
|0
|-
|hostname           
|h         
|HOST       
|==     
|YES       
|NO       
|NONE   
|0
|-
|infiniband         
|ib       
|BOOL       
|==     
|YES       
|NO       
|FALSE   
|0
|-
|m_core             
|core     
|INT       
|<=     
|YES       
|NO       
|0       
|0
|-
|m_socket           
|socket   
|INT       
|<=     
|YES       
|NO       
|0       
|0
|-
|m_thread           
|thread   
|INT       
|<=     
|YES       
|NO       
|0       
|0
|-
|m_topology         
|topo     
|RESTRING   
|==     
|YES       
|NO       
|NONE   
|0
|-
|m_topology_inuse   
|utopo     
|RESTRING   
|==     
|YES       
|NO       
|NONE   
|0
|-
|mem_free           
|mf       
|MEMORY     
|<=     
|YES       
|NO       
|0       
|0
|-
|mem_total         
|mt       
|MEMORY     
|<=     
|YES       
|NO       
|0       
|0
|-
|mem_used           
|mu       
|MEMORY     
|>=     
|YES       
|NO       
|0     
|0
|-
|memory             
|mem       
|MEMORY     
|<=     
|FORCED     
|YES       
|0       
|0
|-
|num_proc           
|p         
|INT       
|==     
|YES       
|NO       
|0       
|0
|-
|qname             
|q         
|RESTRING   
|==     
|YES       
|NO       
|NONE   
|0
|-
|s_core             
|s_core   
|MEMORY     
|<=     
|YES       
|NO       
|0       
|0
|-
|s_cpu             
|s_cpu     
|TIME       
|<=     
|YES       
|NO       
|0:0:0   
|0
|-
|s_data             
|s_data   
|MEMORY     
|<=     
|YES       
|NO       
|0       
|0
|-
|s_fsize           
|s_fsize   
|MEMORY     
|<=     
|YES       
|NO       
|0       
|0
|-
|s_rss             
|s_rss     
|MEMORY     
|<=     
|YES       
|NO       
|0       
|0
|-
|s_rt               
|s_rt     
|TIME       
|<=     
|YES       
|NO       
|0:0:0   
|0
|-
|s_stack           
|s_stack   
|MEMORY     
|<=     
|YES       
|NO       
|0       
|0
|-
|s_vmem             
|s_vmem   
|MEMORY     
|<=     
|YES       
|NO       
|0       
|0
|-
|slots             
|s         
|INT       
|<=     
|YES       
|YES       
|1       
|1000
|-
|swap_free         
|sf       
|MEMORY     
|<=     
|YES       
|NO       
|0       
|0
|-
|swap_rate         
|sr       
|MEMORY     
|>=     
|YES       
|NO       
|0       
|0
|-
|swap_rsvd         
|srsv     
|MEMORY     
|>=     
|YES       
|NO       
|0       
|0
|-
|swap_total         
|st       
|MEMORY     
|<=     
|YES       
|NO       
|0       
|0
|-
|swap_used         
|su       
|MEMORY     
|>=     
|YES       
|NO       
|0       
|0
|-
|virtual_free       
|vf       
|MEMORY     
|<=     
|YES       
|NO       
|0       
|0
|-
|virtual_total     
|vt       
|MEMORY     
|<=     
|YES       
|NO       
|0       
|0
|-
|virtual_used       
|vu       
|MEMORY     
|>=     
|YES       
|NO       
|0       
|0
|}
 
The good news is that most of these nobody ever uses. There are a couple of exceptions, though:
=== Infiniband ===
First of all, let me state that just because it sounds "cool" doesn't mean you need it or even want it. Infiniband does absolutely no good if running in a 'single' parallel environment. Infiniband is a high-speed host-to-host communication fabric. It is used in conjunction with MPI jobs (discussed below). Several times we have had jobs which could run just fine, except that the submitter requested Infiniband, and all the nodes with Infiniband were currently busy. In fact, some of our fastest nodes do not have Infiniband, so by requesting it when you don't need it, you are actually slowing down your job. To request Infiniband, add <tt>-l ib=true</tt> to your qsub command-line.
=== CUDA ===
[[CUDA]] is the resource required for GPU computing. We have a very small number of nodes which have GPUs installed. To request one of these nodes, add <tt>-l cuda=true</tt> to your qsub command-line.
=== Exclusive ===
Some programs just don't play nicely with others. They will attempt to use all available memory or will try to use all the cores it can use. The way to be a nice neighbor if your program has this problem is to request exclusive use of a node with <tt>-l excl=true</tt>. This can also be useful for benchmarking, where you can be sure that no other jobs are interfering with yours.
== Parallel Jobs ==
There are two ways jobs can run in parallel, ''intra''node and ''inter''node. '''Note: Beocat will not automatically make a job run in parallel.''' Have I said that enough? It's a common misperception.
=== Intranode jobs ===
Intranode jobs are easier to code and can take advantage of many common libraries, such as [http://openmp.org/wp/ OpenMP], or Java's threads. Many times, your program will need to know how many cores you want it to use. Many will use all available cores if not told explicitly otherwise. This can be a problem when you are sharing resources, as Beocat does. To request multiple cores, use the qsub directive '<tt>-pe single ''n''</tt>', where ''n'' is the number of cores you wish to use. If your command can take an environment variable, you can use $nslots to tell how many cores you've been allocated.
=== Internode (MPI) jobs ===
"Talking" between nodes is trickier that talking between cores on the same node. The specification for doing so is called "[[wikipedia:Message_Passing_Interface|Message Passing Interface]]", or MPI. We have [http://www.open-mpi.org/ OpenMPI] installed on Beocat for this purpose. Most programs written to take advantage of large multi-node systems will use MPI. You can tell if you have an MPI-enabled program because its directions will tell you to run '<tt>mpirun ''program''</tt>'. Requesting MPI resources is only mildly more difficult than requesting single-node jobs. Instead of using '<tt>-pe single ''n''</tt>' for your qsub request, you will use one of the following:
{|
|mpi-fill
|This environment will use as many slots on each node as it can until it reaches the number of cores you have requested.
|-
|mpi-spread
|This environment will spread itself out over as many nodes as possible until it reaches the number of cores you have requested.
|-
|mpi-1
|This environment will allocate the slots you've requested 1 per node.
|-
|mpi-2
|This environment will allocate the slots you've requested 2 per node. You must request cores as a multiple of 2
|-
|mpi-4
|This environment will allocate the slots you've requested 4 per node. You must request cores as a multiple of 4
|-
|mpi-8
|This environment will allocate the slots you've requested 8 per node. You must request cores as a multiple of 8
|-
|mpi-10
|This environment will allocate the slots you've requested 10 per node. You must request cores as a multiple of 10
|-
|mpi-12
|This environment will allocate the slots you've requested 12 per node. You must request cores as a multiple of 12
|-
|mpi-16
|This environment will allocate the slots you've requested 16 per node. You must request cores as a multiple of 16
|-
|mpi-80
|This environment will allocate the slots you've requested 80 per node. You must request cores as a multiple of 80
|}
Some quick examples:
 
<tt>-pe mpi-4 16</tt> will give you 4 chunks of 4 cores apiece. They might all happen to be allocated on the same node (16 cores), on 4 different nodes (4 cores each), on 3 nodes (8 cores on one and 4 cores on the other two), or on 2 nodes (8 cores each).
 
<tt>-pe mpi-fill 40</tt> will give you 40 cores, but will attempt to get them all on the same node.
 
<tt>-pe mpi-fill 100</tt> will give you 100 cores, and place them on as few nodes as possible. In this case it's likely you would get a full mage (80 cores) and either part of another mage (the remaining 20 cores) or one of the 20-core elves.
 
<tt>-pe mpi-spread 40</tt> will give you 40 cores, and will attempt to place each on a separate node.
== Requesting memory for multi-core jobs ==
All memory requests are '''per core'''. One of the more common scenarios is where somebody will need, say 20 cores and 400 GB of memory. So they will make a request like '<tt>-pe single 20, -l mem=400G</tt>' This will never run, because what you are really requesting is 20 cores and 8000GB of memory (20 * 400). Since we have no nodes with 8000 terabytes of memory, the job will never run. In this case, you will divide the 400GB total memory request by the number of cores (20), so the correct command would be '<tt>-pe single 20, -l mem=20G</tt>'.
== Running jobs interactively ==
Some jobs just don't behave like we think they should, or need to be run with somebody sitting at the keyboard and typing in response to the output the computers are generating. Beocat has a facility for this, called 'qrsh'. qrsh uses the exact same command-line arguments as qsub. If no node is available with your resource requirements, qrsh will tell you
Your "qrsh" request could not be scheduled, try again later.
Note that, like qsub, your interactive job will timeout after your allotted time has passed.
== Job Accounting ==
Some people may find it useful to know what their job did during its run. The qacct tool will read SGE's accounting file and give you summarized or detailed views on jobs that have run within Beocat.
=== qacct ===
This data can usually be used to diagnose two very common job failures.
==== Job debugging ====
It is simplest if you know the job number of the job you are trying to get information on.
<syntaxhighlight lang="bash" line>
# if you know the jobid, put it here:
qacct -j 1122334455
# if you don't know the job id, you can look at your jobs over some number of days in this case the past 14 days:
qacct -o $USER -d 14 -j
</syntaxhighlight>
 
===== My job didn't do anything when it ran! =====
<tt>qname        batch.q           
hostname    mage07.beocat     
group        some_user_users       
owner        some_user             
project      BEODEFAULT         
department  defaultdepartment 
jobname      my_job_script.sh 
jobnumber    1122334455         
...
snipped to save space
...
exit_status  1                  </tt>
<tt style="color: red">ru_wallclock 1s</tt>
<tt>ru_utime    0.030s
ru_stime    0.030s
...
snipped to save space
...
arid        undefined
category    -u some_user -q batch.q,long.q -l h_rt=604800,mem_free=1024.0M,memory=2G</tt>
If you look at the line showing ru_wallclock. You can see that it shows 1s. This means that the job started and then promptly ended. This points to something being wrong with your submission script. Perhaps there is a typo somewhere in it.
 
===== My job ran but didn't finish! =====
<tt>qname        batch.q           
hostname    scout59.beocat     
group        some_user_users   
owner        some_user         
project      BEODEFAULT         
department  defaultdepartment 
jobname      my_job_script.sh         
jobnumber    1122334455           
...
snipped to save space
...           
slots        1                  </tt>
<tt style="color: red">failed      37  : qmaster enforced h_rt, h_cpu, or h_vmem limit</tt>
<tt>exit_status  0                  </tt>
<tt style="color: red">ru_wallclock 21600s</tt>
<tt>ru_utime    0.130s
ru_stime    0.020s
...
snipped to save space
...
arid        undefined</tt>
<tt style="color: red">category    -u some_user -q batch.q,long.q -l h_rt=21600,mem_free=512.0M,memory=1G</tt>
If you look at the lines showing failed, ru_wallclock and category we can see some pointers to the issue.
It didn't finish because the scheduler (qmaster) enforced some limit. If you look at the category line, the only limit requested was h_rt. So it was a runtime (wallclock) limit.
Comparing ru_wallclock and the h_rt request, we can see that it ran until the h_rt time was hit, and then the scheduler enforce the limit and killed the job. You will need to resubmit the job and ask for more time next time.

Latest revision as of 23:11, 19 September 2018

SGE is gone, long live slurm