|  |   | 
| (13 intermediate revisions by 3 users not shown) | 
| Line 1: | Line 1: | 
|  | == Submitting your first job == |  | == SGE is gone, long live [[SlurmBasics|slurm]] == | 
|  | To submit a job to run under SGE, we use the <code>qsub</code> command. qsub (queue submit) takes the commands you give it, and runs it through the scheduler, which finds the optimum place for your job to run. With over 150 nodes and 2500 cores to schedule, as well as differing priorities, hardware, and individual resources, the scheduler's job is not trivial.
 |  | 
|  |   |  | 
|  | There are a few things you'll need to know before running qsub.
 |  | 
|  | * How many cores you need. Note that unless your program is created to use multiple cores (called "threading"), asking for more cores will not speed up your job. This is a common misperception. '''Beocat will not magically make your program use multiple cores!''' For this reason the default is 1 core.
 |  | 
|  | * How much time you need. Many users when beginning to use Beocat neglect to specify a time requirement. The default is one hour, and we get asked why their job died after one hour. We usually point them to the [[FAQ]].
 |  | 
|  | * How much memory you need. The default is1GB. If your job uses significantly more than you ask,your job will be placed on hold until you fix your request.
 |  | 
|  | * Any advanced options. See the [[AdvancedSGE]]page for these requests. For our basic examples here, we will ignore these.
 |  | 
|  |   |  | 
|  | So let's now create a small script to test our ability to submit jobs. Create the following file (either by copying it to Beocat or by editing a text file and we'll name it <code>myhost.sh</code>. Both of these methods are documented on our [[LinuxBasics]] page.
 |  | 
|  | <syntaxhighlight lang="bash">
 |  | 
|  | #!/bin/sh
 |  | 
|  | hostname
 |  | 
|  | </syntaxhighlight>
 |  | 
|  |   |  | 
|  | Be sure to make it executable
 |  | 
|  |  chmod u+x myhost.sh
 |  | 
|  |   |  | 
|  | Now, let's first run it on the headnode. As I write this, I'm logged into the headnode named 'minerva'. When I run it, it looks like this:
 |  | 
|  |  % ./myhost.sh
 |  | 
|  |  minerva
 |  | 
|  |   |  | 
|  | So, now lets submit it as a job and see what happens. Here I'm going to use three options
 |  | 
|  | * <code>-l mem=</code> tells how much memory I need. In my example, I'm using our system minimum of 512 MB, which is more than enough. Note that your memory request is '''per core''', which doesn't make much difference for this example, but will as you submit more complex jobs.
 |  | 
|  | * <code>-l h_rt=</code> tells how much runtime I need. This can be in the form of ''seconds'', or ''hours'':''minutes'':''seconds''. This is a very short job, so 60 seconds should be plenty. Note that if you submit a job that needs to run for days or weeks, you'll need to translate that into hours
 |  | 
|  | * <code>-pe single 1</code> tells SGE that I need only a single core on one machine. The [[AdvancedSGE]] page has much more on the "Parallel Environment" switch.
 |  | 
|  |   |  | 
|  |  % '''ls'''
 |  | 
|  |  myhost.sh
 |  | 
|  |  % '''qsub -l h_rt=60 -l mem=512M -pe single 1 ./myhost.sh'''
 |  | 
|  |  INFO: Requested resources for this job: pe=single cores=1 time(HH:MM:SS or SS)=60 mem(per-core)=512M
 |  | 
|  |  Your job 1483446 ("myhost.sh") has been submitted
 |  | 
|  |   |  | 
|  | Since this is such a small job, it is likely to be scheduled almost immediately, so a minute or so later, I now see
 |  | 
|  |  % '''ls'''
 |  | 
|  |  myhost.sh
 |  | 
|  |  myhost.sh.po1483446
 |  | 
|  |  myhost.sh.pe1483446
 |  | 
|  |  myhost.sh.e1483446
 |  | 
|  |  myhost.sh.o1483446
 |  | 
|  |   |  | 
|  | The four additional files that were created are in the form ''scriptname''.''XX''.''jjjjjjj'' - where ''scriptname'' is the script you submitted, ''XX'' is 'po' (parallel output), 'pe' (parallel error), 'e' (error), or 'o' (output), and ''jjjjjjj'' is the job number (which is given when you submitted your job, in this case 1483446).
 |  | 
|  |   |  | 
|  | If everything goes as planned, the po, pe, and e jobs will be blank...
 |  | 
|  |  % '''cat myhost.sh.po1483446'''
 |  | 
|  |  % '''cat myhost.sh.pe1483446'''
 |  | 
|  |  % '''cat myhost.sh.e1483446'''
 |  | 
|  | ...and the .o file will show the hostname of the node that ran the job...
 |  | 
|  |  % '''cat myhost.sh.o1483446'''
 |  | 
|  |  mage03
 |  | 
|  |   |  | 
|  | So we can see that this job ran on host 'mage03'.
 |  | 
|  |   |  | 
|  | == Monitoring Your Job ==
 |  | 
|  | '''TODO''' - qstat and status
 |  | 
|  |   |  | 
|  | == Getting Help ==
 |  | 
|  | '''TODO''' - self-diagnosis and asking for help
 |  |