|
|
(29 intermediate revisions by 5 users not shown) |
Line 1: |
Line 1: |
| == Resource Requests == | | == SGE is gone, long live [[AdvancedSlurm|slurm]] == |
| Aside from the time, RAM, and CPU requirements listed on the [[SGEBasics]] page, we have several other requestable resources. Generally, if you don't know if you need a particular resource, you should use the default. These can be generated with the command
| |
| <tt>qconf -sc | awk '{ if ($5 != "NO") { print }}'</tt>
| |
| {|
| |
| !name
| |
| !shortcut
| |
| !type
| |
| !relop
| |
| !requestable
| |
| !consumable
| |
| !default
| |
| !urgency
| |
| |-
| |
| |arch
| |
| |a
| |
| |RESTRING
| |
| |==
| |
| |YES
| |
| |NO
| |
| |NONE
| |
| |0
| |
| |-
| |
| |avx
| |
| |avx
| |
| |BOOL
| |
| |==
| |
| |YES
| |
| |NO
| |
| |FALSE
| |
| |0
| |
| |-
| |
| |calendar
| |
| |c
| |
| |RESTRING
| |
| |==
| |
| |YES
| |
| |NO
| |
| |NONE
| |
| |0
| |
| |-
| |
| |cpu
| |
| |cpu
| |
| |DOUBLE
| |
| |>=
| |
| |YES
| |
| |NO
| |
| |0
| |
| |0
| |
| |-
| |
| |cpu_flags
| |
| |c_f
| |
| |STRING
| |
| |==
| |
| |YES
| |
| |NO
| |
| |NONE
| |
| |0
| |
| |-
| |
| |cuda
| |
| |cuda
| |
| |INT
| |
| |<=
| |
| |YES
| |
| |JOB
| |
| |0
| |
| |0
| |
| |-
| |
| |display_win_gui
| |
| |dwg
| |
| |BOOL
| |
| |==
| |
| |YES
| |
| |NO
| |
| |0
| |
| |0
| |
| |-
| |
| |exclusive
| |
| |excl
| |
| |BOOL
| |
| |EXCL
| |
| |YES
| |
| |YES
| |
| |0
| |
| |1000
| |
| |-
| |
| |h_core
| |
| |h_core
| |
| |MEMORY
| |
| |<=
| |
| |YES
| |
| |NO
| |
| |0
| |
| |0
| |
| |-
| |
| |h_cpu
| |
| |h_cpu
| |
| |TIME
| |
| |<=
| |
| |YES
| |
| |NO
| |
| |0:0:0
| |
| |0
| |
| |-
| |
| |h_data
| |
| |h_data
| |
| |MEMORY
| |
| |<=
| |
| |YES
| |
| |NO
| |
| |0
| |
| |0
| |
| |-
| |
| |h_fsize
| |
| |h_fsize
| |
| |MEMORY
| |
| |<=
| |
| |YES
| |
| |NO
| |
| |0
| |
| |0
| |
| |-
| |
| |h_rss
| |
| |h_rss
| |
| |MEMORY
| |
| |<=
| |
| |YES
| |
| |NO
| |
| |0
| |
| |0
| |
| |-
| |
| |h_rt
| |
| |h_rt
| |
| |TIME | |
| |<=
| |
| |FORCED
| |
| |NO
| |
| |0:0:0
| |
| |0
| |
| |-
| |
| |h_stack
| |
| |h_stack
| |
| |MEMORY
| |
| |<=
| |
| |YES
| |
| |NO
| |
| |0
| |
| |0
| |
| |-
| |
| |h_vmem
| |
| |h_vmem
| |
| |MEMORY
| |
| |<=
| |
| |YES
| |
| |NO
| |
| |0
| |
| |0
| |
| |-
| |
| |hostname
| |
| |h
| |
| |HOST
| |
| |==
| |
| |YES
| |
| |NO
| |
| |NONE
| |
| |0
| |
| |-
| |
| |infiniband
| |
| |ib
| |
| |BOOL
| |
| |==
| |
| |YES
| |
| |NO
| |
| |FALSE
| |
| |0
| |
| |-
| |
| |m_core
| |
| |core
| |
| |INT
| |
| |<=
| |
| |YES
| |
| |NO
| |
| |0
| |
| |0
| |
| |-
| |
| |m_socket
| |
| |socket
| |
| |INT
| |
| |<=
| |
| |YES
| |
| |NO
| |
| |0
| |
| |0
| |
| |-
| |
| |m_thread
| |
| |thread
| |
| |INT
| |
| |<=
| |
| |YES
| |
| |NO
| |
| |0
| |
| |0
| |
| |-
| |
| |m_topology
| |
| |topo
| |
| |RESTRING
| |
| |==
| |
| |YES
| |
| |NO
| |
| |NONE
| |
| |0
| |
| |-
| |
| |m_topology_inuse
| |
| |utopo
| |
| |RESTRING
| |
| |==
| |
| |YES
| |
| |NO
| |
| |NONE
| |
| |0
| |
| |-
| |
| |mem_free
| |
| |mf
| |
| |MEMORY
| |
| |<=
| |
| |YES
| |
| |NO
| |
| |0
| |
| |0
| |
| |-
| |
| |mem_total
| |
| |mt
| |
| |MEMORY
| |
| |<=
| |
| |YES
| |
| |NO
| |
| |0
| |
| |0
| |
| |-
| |
| |mem_used
| |
| |mu
| |
| |MEMORY
| |
| |>=
| |
| |YES
| |
| |NO
| |
| |0
| |
| |0
| |
| |-
| |
| |memory
| |
| |mem
| |
| |MEMORY
| |
| |<=
| |
| |FORCED
| |
| |YES
| |
| |0
| |
| |0
| |
| |-
| |
| |num_proc
| |
| |p
| |
| |INT
| |
| |==
| |
| |YES
| |
| |NO
| |
| |0
| |
| |0
| |
| |-
| |
| |qname
| |
| |q
| |
| |RESTRING
| |
| |==
| |
| |YES
| |
| |NO
| |
| |NONE
| |
| |0
| |
| |-
| |
| |s_core
| |
| |s_core
| |
| |MEMORY
| |
| |<=
| |
| |YES
| |
| |NO
| |
| |0
| |
| |0
| |
| |-
| |
| |s_cpu
| |
| |s_cpu
| |
| |TIME
| |
| |<=
| |
| |YES
| |
| |NO
| |
| |0:0:0
| |
| |0
| |
| |-
| |
| |s_data
| |
| |s_data
| |
| |MEMORY
| |
| |<=
| |
| |YES
| |
| |NO
| |
| |0
| |
| |0
| |
| |-
| |
| |s_fsize
| |
| |s_fsize
| |
| |MEMORY
| |
| |<=
| |
| |YES
| |
| |NO
| |
| |0
| |
| |0
| |
| |-
| |
| |s_rss
| |
| |s_rss
| |
| |MEMORY
| |
| |<=
| |
| |YES
| |
| |NO
| |
| |0
| |
| |0
| |
| |-
| |
| |s_rt
| |
| |s_rt
| |
| |TIME
| |
| |<=
| |
| |YES
| |
| |NO
| |
| |0:0:0
| |
| |0
| |
| |-
| |
| |s_stack
| |
| |s_stack
| |
| |MEMORY
| |
| |<=
| |
| |YES
| |
| |NO
| |
| |0
| |
| |0
| |
| |-
| |
| |s_vmem
| |
| |s_vmem
| |
| |MEMORY
| |
| |<=
| |
| |YES
| |
| |NO
| |
| |0
| |
| |0
| |
| |-
| |
| |slots
| |
| |s
| |
| |INT
| |
| |<=
| |
| |YES
| |
| |YES
| |
| |1
| |
| |1000
| |
| |-
| |
| |swap_free
| |
| |sf
| |
| |MEMORY
| |
| |<=
| |
| |YES
| |
| |NO
| |
| |0
| |
| |0
| |
| |-
| |
| |swap_rate
| |
| |sr
| |
| |MEMORY
| |
| |>=
| |
| |YES
| |
| |NO
| |
| |0
| |
| |0
| |
| |-
| |
| |swap_rsvd
| |
| |srsv
| |
| |MEMORY
| |
| |>=
| |
| |YES
| |
| |NO
| |
| |0
| |
| |0
| |
| |-
| |
| |swap_total
| |
| |st
| |
| |MEMORY
| |
| |<=
| |
| |YES
| |
| |NO
| |
| |0
| |
| |0
| |
| |-
| |
| |swap_used
| |
| |su
| |
| |MEMORY
| |
| |>=
| |
| |YES
| |
| |NO
| |
| |0
| |
| |0
| |
| |-
| |
| |virtual_free
| |
| |vf
| |
| |MEMORY
| |
| |<=
| |
| |YES
| |
| |NO
| |
| |0
| |
| |0
| |
| |-
| |
| |virtual_total
| |
| |vt
| |
| |MEMORY
| |
| |<=
| |
| |YES
| |
| |NO
| |
| |0
| |
| |0
| |
| |-
| |
| |virtual_used
| |
| |vu
| |
| |MEMORY
| |
| |>=
| |
| |YES
| |
| |NO
| |
| |0
| |
| |0
| |
| |}
| |
| | |
| The good news is that most of these nobody ever uses. There are a couple of exceptions, though:
| |
| === Infiniband ===
| |
| First of all, let me state that just because it sounds "cool" doesn't mean you need it or even want it. Infiniband does absolutely no good if running in a 'single' parallel environment. Infiniband is a high-speed host-to-host communication fabric. It is used in conjunction with MPI jobs (discussed below). Several times we have had jobs which could run just fine, except that the submitter requested Infiniband, and all the nodes with Infiniband were currently busy. In fact, some of our fastest nodes do not have Infiniband, so by requesting it when you don't need it, you are actually slowing down your job. To request Infiniband, add <tt>-l ib=true</tt> to your qsub command-line.
| |
| === CUDA ===
| |
| [[CUDA]] is the resource required for GPU computing. We have a very small number of nodes which have GPUs installed. To request one of these nodes, add <tt>-l cuda=true</tt> to your qsub command-line.
| |
| == Job Accounting ==
| |
| Some people may find it useful to know what there job did during its run. The qacct tool will read SGE's accounting file and give you summarized or detailed views on jobs that have run within Beocat.
| |
| === qacct ===
| |
| This data can usually be used to diagnose two very common job failures.
| |
| ==== Job debugging ====
| |
| It is simplest if you know the job number of the job you are trying to get information on.
| |
| <syntaxhighlight lang="bash" line>
| |
| # if you know the jobid, put it here:
| |
| qacct -j 1122334455
| |
| # if you don't know the job id, you can look at your jobs over some number of days in this case the past 14 days:
| |
| qacct -o $USER -d 14 -j
| |
| </syntaxhighlight>
| |
| | |
| ===== My job didn't do anything when it ran! =====
| |
| <tt>qname batch.q
| |
| hostname mage07.beocat
| |
| group some_user_users
| |
| owner some_user
| |
| project BEODEFAULT
| |
| department defaultdepartment
| |
| jobname my_job_script.sh
| |
| jobnumber 1122334455
| |
| ...
| |
| snipped to save space
| |
| ...
| |
| exit_status 1 </tt>
| |
| <tt style="color: red">ru_wallclock 1s</tt>
| |
| <tt>ru_utime 0.030s
| |
| ru_stime 0.030s
| |
| ...
| |
| snipped to save space
| |
| ...
| |
| arid undefined
| |
| category -u some_user -q batch.q,long.q -l h_rt=604800,mem_free=1024.0M,memory=2G</tt>
| |
| If you look at the line showing ru_wallclock. You can see that it shows 1s. This means that the job started and then promptly ended. This points to something being wrong with your submission script. Perhaps there is a typo somewhere in it.
| |
| | |
| ===== My job ran but didn't finish! =====
| |
| <tt>qname batch.q
| |
| hostname scout59.beocat
| |
| group some_user_users
| |
| owner some_user
| |
| project BEODEFAULT
| |
| department defaultdepartment
| |
| jobname my_job_script.sh
| |
| jobnumber 1122334455
| |
| ...
| |
| snipped to save space
| |
| ...
| |
| slots 1 </tt>
| |
| <tt style="color: red">failed 37 : qmaster enforced h_rt, h_cpu, or h_vmem limit</tt>
| |
| <tt>exit_status 0 </tt>
| |
| <tt style="color: red">ru_wallclock 21600s</tt>
| |
| <tt>ru_utime 0.130s
| |
| ru_stime 0.020s
| |
| ...
| |
| snipped to save space
| |
| ...
| |
| arid undefined</tt>
| |
| <tt style="color: red">category -u some_user -q batch.q,long.q -l h_rt=21600,mem_free=512.0M,memory=1G</tt>
| |
| If you look at the lines showing failed, ru_wallclock and category we can see some pointers to the issue.
| |
| It didn't finish because the scheduler (qmaster) enforced some limit. If you look at the category line, the only limit requested was h_rt. So it was a runtime (wallclock) limit.
| |
| Comparing ru_wallclock and the h_rt request, we can see that it ran until the h_rt time was hit, and then the scheduler enforce the limit and killed the job. You will need to resubmit the job and ask for more time next time.
| |