Kylehutson (talk | contribs) (Placeholder) |
(Adding some quick debugging things. probably needs some cleanup.) |
||
Line 1: | Line 1: | ||
== Coming Soon! == | == Coming Soon! == | ||
This is just a placeholder as we move our old support site to the new one. We'll have it up soon! | This is just a placeholder as we move our old support site to the new one. We'll have it up soon! | ||
== Job Accounting == | |||
Some people may find it useful to know what there job did during its run. The qacct tool will read SGE's accounting file and give you summarized or detailed views on jobs that have run within Beocat. | |||
=== qacct === | |||
This data can usually be used to diagnose two very common job failures. | |||
==== Job debugging ==== | |||
It is simplest if you know the job number of the job you are trying to get information on. | |||
<syntaxhighlight lang="bash" line> | |||
# if you know the jobid, put it here: | |||
qacct -j 1122334455 | |||
# if you don't know the job id, you can look at your jobs over some number of days in this case the past 14 days: | |||
qacct -o $USER -d 14 -j | |||
</syntaxhighlight> | |||
===== My job didn't do anything when it ran! ===== | |||
<tt>qname batch.q | |||
hostname mage07.beocat | |||
group some_user_users | |||
owner some_user | |||
project BEODEFAULT | |||
department defaultdepartment | |||
jobname my_job_script.sh | |||
jobnumber 1122334455 | |||
... | |||
snipped to save space | |||
... | |||
exit_status 1 </tt> | |||
<tt style="color: red">ru_wallclock 1s</tt> | |||
<tt>ru_utime 0.030s | |||
ru_stime 0.030s | |||
... | |||
snipped to save space | |||
... | |||
arid undefined | |||
category -u some_user -q batch.q,long.q -l h_rt=604800,mem_free=1024.0M,memory=2G</tt> | |||
If you look at the line showing ru_wallclock. You can see that it shows 1s. This means that the job started and then promptly ended. This points to something being wrong with your submission script. Perhaps there is a typo somewhere in it. | |||
===== My job ran but didn't finish! ===== | |||
<tt>qname batch.q | |||
hostname scout59.beocat | |||
group some_user_users | |||
owner some_user | |||
project BEODEFAULT | |||
department defaultdepartment | |||
jobname my_job_script.sh | |||
jobnumber 1122334455 | |||
... | |||
snipped to save space | |||
... | |||
slots 1 </tt> | |||
<tt style="color: red">failed 37 : qmaster enforced h_rt, h_cpu, or h_vmem limit</tt> | |||
<tt>exit_status 0 </tt> | |||
<tt style="color: red">ru_wallclock 21600s</tt> | |||
<tt>ru_utime 0.130s | |||
ru_stime 0.020s | |||
... | |||
snipped to save space | |||
... | |||
arid undefined</tt> | |||
<tt style="color: red">category -u some_user -q batch.q,long.q -l h_rt=21600,mem_free=512.0M,memory=1G</tt> | |||
If you look at the lines showing failed, ru_wallclock and category we can see some pointers to the issue. | |||
It didn't finish because the scheduler (qmaster) enforced some limit. If you look at the category line, the only limit requested was h_rt. So it was a runtime (wallclock) limit. | |||
Comparing ru_wallclock and the h_rt request, we can see that it ran until the h_rt time was hit, and then the scheduler enforce the limit and killed the job. You will need to resubmit the job and ask for more time next time. |
Revision as of 09:28, 20 June 2014
Coming Soon!
This is just a placeholder as we move our old support site to the new one. We'll have it up soon!
Job Accounting
Some people may find it useful to know what there job did during its run. The qacct tool will read SGE's accounting file and give you summarized or detailed views on jobs that have run within Beocat.
qacct
This data can usually be used to diagnose two very common job failures.
Job debugging
It is simplest if you know the job number of the job you are trying to get information on.
# if you know the jobid, put it here:
qacct -j 1122334455
# if you don't know the job id, you can look at your jobs over some number of days in this case the past 14 days:
qacct -o $USER -d 14 -j
My job didn't do anything when it ran!
qname batch.q hostname mage07.beocat group some_user_users owner some_user project BEODEFAULT department defaultdepartment jobname my_job_script.sh jobnumber 1122334455 ... snipped to save space ... exit_status 1 ru_wallclock 1s ru_utime 0.030s ru_stime 0.030s ... snipped to save space ... arid undefined category -u some_user -q batch.q,long.q -l h_rt=604800,mem_free=1024.0M,memory=2G
If you look at the line showing ru_wallclock. You can see that it shows 1s. This means that the job started and then promptly ended. This points to something being wrong with your submission script. Perhaps there is a typo somewhere in it.
My job ran but didn't finish!
qname batch.q hostname scout59.beocat group some_user_users owner some_user project BEODEFAULT department defaultdepartment jobname my_job_script.sh jobnumber 1122334455 ... snipped to save space ... slots 1 failed 37 : qmaster enforced h_rt, h_cpu, or h_vmem limit exit_status 0 ru_wallclock 21600s ru_utime 0.130s ru_stime 0.020s ... snipped to save space ... arid undefined category -u some_user -q batch.q,long.q -l h_rt=21600,mem_free=512.0M,memory=1G
If you look at the lines showing failed, ru_wallclock and category we can see some pointers to the issue. It didn't finish because the scheduler (qmaster) enforced some limit. If you look at the category line, the only limit requested was h_rt. So it was a runtime (wallclock) limit. Comparing ru_wallclock and the h_rt request, we can see that it ran until the h_rt time was hit, and then the scheduler enforce the limit and killed the job. You will need to resubmit the job and ask for more time next time.