Daveturner (talk | contribs) |
Daveturner (talk | contribs) |
||
Line 1: | Line 1: | ||
== Open Science Grid == | == Open Science Grid == | ||
OSG is a high-throughput computing system where users have unlimited | |||
access to submit large numbers of small single-node jobs of typically 1-8 cores, | access to submit large numbers of small single-node jobs of typically 1-8 cores, | ||
10 GB memory, around 10 GB of IO, for 24 hours to the HTCondor queue where the jobs | 10 GB memory, around 10 GB of IO, for 24 hours to the HTCondor queue where the jobs | ||
Line 10: | Line 10: | ||
and quick start guides for submitting HTCondor scripts. | and quick start guides for submitting HTCondor scripts. | ||
https://opensciencegrid.org | [https://opensciencegrid.org] | ||
https://support.opensciencegrid.org/support/home | [https://support.opensciencegrid.org/support/home|Full OSG documentations] | ||
Guidelines for determining if your jobs will work well with OSG | '''Guidelines for determining if your jobs will work well with OSG''' | ||
https://opensciencegrid.org/about/computation/ | [https://opensciencegrid.org/about/computation/] | ||
https://support.opensciencegrid.org/support/solutions/articles/5000632058-is-the-open-science-grid-for-you- | [https://support.opensciencegrid.org/support/solutions/articles/5000632058-is-the-open-science-grid-for-you-] | ||
Below is an example HTCondor job script to run a code named NaMD | Below is an example HTCondor job script to run a code named NaMD. | ||
#!/bin/bash -l | #!/bin/bash -l | ||
Line 38: | Line 38: | ||
Below are some common HTCondor commands | Below are some common HTCondor commands | ||
> condor_submit htc.sh # Submit the condor script to the queue | > condor_submit htc.sh # Submit the condor script to the queue | ||
> condor_q # Check on the status while in the queue | > condor_q # Check on the status while in the queue | ||
> condor_q netid # Check status of currently running jobs | > condor_q netid # Check status of currently running jobs | ||
> condor_q 1441271 # Check status of a particular job | > condor_q 1441271 # Check status of a particular job | ||
> condor_history 1441271 # Check status of a job that is completed | > condor_history 1441271 # Check status of a job that is completed | ||
> condor_history -long 1441271 # Same but report more info | > condor_history -long 1441271 # Same but report more info | ||
> condor_rm # # Remove the job # | > condor_rm # # Remove the job # | ||
> condor_rm daveturner # Remove all jobs for the given username | > condor_rm daveturner # Remove all jobs for the given username | ||
The job output will be in the file specified by 'output='. This is similar | The job output will be in the file specified by 'output='. This is similar | ||
Line 57: | Line 57: | ||
to transfer back after the run. | to transfer back after the run. | ||
Modules are available, but difficult to use since they may have different names | |||
on different systems. Use 'module avail' on the login node to get a list, | |||
then request modules as a resource. | |||
https://support.opensciencegrid.org/support/solutions/articles/12000048518 | [https://support.opensciencegrid.org/support/solutions/articles/12000048518] | ||
If you submit with queue=10, then you get 10 identical jobs similar to what | |||
an array job with Slurm will provide. You can use | |||
output=job.$(Cluster).$(Process).output to keep the outputs different | |||
Use 'arguments = input_file.$(ProcId)' to vary the input file for each job. |
Revision as of 20:39, 19 April 2021
Open Science Grid
OSG is a high-throughput computing system where users have unlimited access to submit large numbers of small single-node jobs of typically 1-8 cores, 10 GB memory, around 10 GB of IO, for 24 hours to the HTCondor queue where the jobs will be run on supercomputers in the U.S. and Europe. To use OSG, you must first obtain an account from the OSG Connect group, arrange a short Zoom meeting with someone from their support team, use a webpage to upload your ssh keys, then log into their head node. Below are several links on OSG, the signup process, and quick start guides for submitting HTCondor scripts.
Guidelines for determining if your jobs will work well with OSG [2] [3]
Below is an example HTCondor job script to run a code named NaMD.
#!/bin/bash -l output = osg.namd.out error = osg.namd.error log = osg.namd.log # Requested resources request_cpus = 8 request_memory = 8 GB request_disk = 1 GB requirements = Arch == "X86_64" && HAS_MODULES == True transfer_input_files = input_files/ # Slash means all files in that directory executable = namd2 arguments = +p8 test.0.namd transfer_output_files = output queue 1
Below are some common HTCondor commands
> condor_submit htc.sh # Submit the condor script to the queue > condor_q # Check on the status while in the queue > condor_q netid # Check status of currently running jobs > condor_q 1441271 # Check status of a particular job > condor_history 1441271 # Check status of a job that is completed > condor_history -long 1441271 # Same but report more info > condor_rm # # Remove the job # > condor_rm daveturner # Remove all jobs for the given username
The job output will be in the file specified by 'output='. This is similar to the slurm-#.out files on Beocat.
The log file contains timings for the file transfers and the job execution.
One big change from Slurm on Beocat is that you will need to list all the input files that need to be transferred before the run since the job will not be run on a computer with the same file system, and also list the output files to transfer back after the run.
Modules are available, but difficult to use since they may have different names on different systems. Use 'module avail' on the login node to get a list, then request modules as a resource. [4]
If you submit with queue=10, then you get 10 identical jobs similar to what an array job with Slurm will provide. You can use output=job.$(Cluster).$(Process).output to keep the outputs different Use 'arguments = input_file.$(ProcId)' to vary the input file for each job.