OLD DEPRECATED AdvancedSGE

Resource Requests

Aside from the time, RAM, and CPU requirements listed on the SGEBasics page, we have several other requestable resources. Generally, if you don't know if you need a particular resource, you should use the default. These can be generated with the command

qconf -sc | awk '{ if ($5 != "NO") { print }}'

name	shortcut	type	relop	requestable	consumable	default	urgency
arch	a	RESTRING	==	YES	NO	NONE	0
avx	avx	BOOL	==	YES	NO	FALSE	0
calendar	c	RESTRING	==	YES	NO	NONE	0
cpu	cpu	DOUBLE	>=	YES	NO	0	0
cpu_flags	c_f	STRING	==	YES	NO	NONE	0
cuda	cuda	INT	<=	YES	JOB	0	0
display_win_gui	dwg	BOOL	==	YES	NO	0	0
exclusive	excl	BOOL	EXCL	YES	YES	0	1000
h_core	h_core	MEMORY	<=	YES	NO	0	0
h_cpu	h_cpu	TIME	<=	YES	NO	0:0:0	0
h_data	h_data	MEMORY	<=	YES	NO	0	0
h_fsize	h_fsize	MEMORY	<=	YES	NO	0	0
h_rss	h_rss	MEMORY	<=	YES	NO	0	0
h_rt	h_rt	TIME	<=	FORCED	NO	0:0:0	0
h_stack	h_stack	MEMORY	<=	YES	NO	0	0
h_vmem	h_vmem	MEMORY	<=	YES	NO	0	0
hostname	h	HOST	==	YES	NO	NONE	0
infiniband	ib	BOOL	==	YES	NO	FALSE	0
m_core	core	INT	<=	YES	NO	0	0
m_socket	socket	INT	<=	YES	NO	0	0
m_thread	thread	INT	<=	YES	NO	0	0
m_topology	topo	RESTRING	==	YES	NO	NONE	0
m_topology_inuse	utopo	RESTRING	==	YES	NO	NONE	0
mem_free	mf	MEMORY	<=	YES	NO	0	0
mem_total	mt	MEMORY	<=	YES	NO	0	0
mem_used	mu	MEMORY	>=	YES	NO	0	0
memory	mem	MEMORY	<=	FORCED	YES	0	0
num_proc	p	INT	==	YES	NO	0	0
qname	q	RESTRING	==	YES	NO	NONE	0
s_core	s_core	MEMORY	<=	YES	NO	0	0
s_cpu	s_cpu	TIME	<=	YES	NO	0:0:0	0
s_data	s_data	MEMORY	<=	YES	NO	0	0
s_fsize	s_fsize	MEMORY	<=	YES	NO	0	0
s_rss	s_rss	MEMORY	<=	YES	NO	0	0
s_rt	s_rt	TIME	<=	YES	NO	0:0:0	0
s_stack	s_stack	MEMORY	<=	YES	NO	0	0
s_vmem	s_vmem	MEMORY	<=	YES	NO	0	0
slots	s	INT	<=	YES	YES	1	1000
swap_free	sf	MEMORY	<=	YES	NO	0	0
swap_rate	sr	MEMORY	>=	YES	NO	0	0
swap_rsvd	srsv	MEMORY	>=	YES	NO	0	0
swap_total	st	MEMORY	<=	YES	NO	0	0
swap_used	su	MEMORY	>=	YES	NO	0	0
virtual_free	vf	MEMORY	<=	YES	NO	0	0
virtual_total	vt	MEMORY	<=	YES	NO	0	0
virtual_used	vu	MEMORY	>=	YES	NO	0	0

The good news is that most of these nobody ever uses. There are a couple of exceptions, though:

Infiniband

First of all, let me state that just because it sounds "cool" doesn't mean you need it or even want it. Infiniband does absolutely no good if running in a 'single' parallel environment. Infiniband is a high-speed host-to-host communication fabric. It is used in conjunction with MPI jobs (discussed below). Several times we have had jobs which could run just fine, except that the submitter requested Infiniband, and all the nodes with Infiniband were currently busy. In fact, some of our fastest nodes do not have Infiniband, so by requesting it when you don't need it, you are actually slowing down your job. To request Infiniband, add -l ib=true to your qsub command-line.

CUDA

CUDA is the resource required for GPU computing. We have a very small number of nodes which have GPUs installed. To request one of these nodes, add -l cuda=true to your qsub command-line.

Job Accounting

Some people may find it useful to know what their job did during its run. The qacct tool will read SGE's accounting file and give you summarized or detailed views on jobs that have run within Beocat.

qacct

This data can usually be used to diagnose two very common job failures.

Job debugging

It is simplest if you know the job number of the job you are trying to get information on.

# if you know the jobid, put it here:
qacct -j 1122334455
# if you don't know the job id, you can look at your jobs over some number of days in this case the past 14 days:
qacct -o $USER -d 14 -j

My job didn't do anything when it ran!

qname        batch.q             
hostname     mage07.beocat       
group        some_user_users        
owner        some_user              
project      BEODEFAULT          
department   defaultdepartment   
jobname      my_job_script.sh  
jobnumber    1122334455          
...
snipped to save space
...
exit_status  1                   
ru_wallclock 1s
ru_utime     0.030s
ru_stime     0.030s
...
snipped to save space
...
arid         undefined
category     -u some_user -q batch.q,long.q -l h_rt=604800,mem_free=1024.0M,memory=2G

If you look at the line showing ru_wallclock. You can see that it shows 1s. This means that the job started and then promptly ended. This points to something being wrong with your submission script. Perhaps there is a typo somewhere in it.

My job ran but didn't finish!

qname        batch.q             
hostname     scout59.beocat      
group        some_user_users     
owner        some_user           
project      BEODEFAULT          
department   defaultdepartment   
jobname      my_job_script.sh           
jobnumber    1122334455            
...
snipped to save space
...            
slots        1                   
failed       37  : qmaster enforced h_rt, h_cpu, or h_vmem limit
exit_status  0                   
ru_wallclock 21600s
ru_utime     0.130s
ru_stime     0.020s
...
snipped to save space
...
arid         undefined
category     -u some_user -q batch.q,long.q -l h_rt=21600,mem_free=512.0M,memory=1G

If you look at the lines showing failed, ru_wallclock and category we can see some pointers to the issue. It didn't finish because the scheduler (qmaster) enforced some limit. If you look at the category line, the only limit requested was h_rt. So it was a runtime (wallclock) limit. Comparing ru_wallclock and the h_rt request, we can see that it ran until the h_rt time was hit, and then the scheduler enforce the limit and killed the job. You will need to resubmit the job and ask for more time next time.

OLD DEPRECATED AdvancedSGE

Views

Contents

Resource Requests

Infiniband

CUDA

Job Accounting

qacct

Job debugging

My job didn't do anything when it ran!

My job ran but didn't finish!

Navigation menu

Navigation

Search

Tools

Personal tools