Open Science Grid
OSG is a high-throughput computing system where users have unlimited access to submit large numbers of small single-node jobs of typically 1-8 cores, 10 GB memory, around 10 GB of IO, for 24 hours to the HTCondor queue where the jobs will be run on supercomputers in the U.S. and Europe. To use OSG, you must first obtain an account from the OSG Connect group, arrange a short Zoom meeting with someone from their support team, use a webpage to upload your ssh keys, then log into their head node. Below are several links on OSG, the signup process, and quick start guides for submitting HTCondor scripts.
- https://opensciencegrid.org
- https://support.opensciencegrid.org/support/home
- https://www.youtube.com/watch?v=oMAvxsFJaw4 OSG tutorial video
Guidelines for determining if your jobs will work well with OSG
- https://opensciencegrid.org/about/computation-ideal-for-OSPool/
- https://support.opensciencegrid.org/support/solutions/articles/5000632058-is-the-open-science-grid-for-you-
Below is an example HTCondor job script to run a code named NaMD.
#!/bin/bash -l output = osg.namd.out error = osg.namd.error log = osg.namd.log # Requested resources request_cpus = 8 request_memory = 8 GB request_disk = 1 GB requirements = Arch == "X86_64" && HAS_MODULES == True transfer_input_files = input_files/ # Slash means all files in that directory executable = namd2 arguments = +p8 test.0.namd transfer_output_files = output queue 1
Below are some common HTCondor commands
> condor_submit htc.sh # Submit the condor script to the queue > condor_q # Check on the status while in the queue > condor_q netid # Check status of currently running jobs > condor_q 1441271 # Check status of a particular job > condor_history 1441271 # Check status of a job that is completed > condor_history -long 1441271 # Same but report more info > condor_rm # # Remove the job # > condor_rm daveturner # Remove all jobs for the given username
The job output will be in the file specified by 'output='. This is similar to the slurm-#.out files on Beocat.
The log file contains timings for the file transfers and the job execution.
One big change from Slurm on Beocat is that you will need to list all the input files that need to be transferred before the run since the job will not be run on a computer with the same file system, and also list the output files to transfer back after the run.
Modules are available, but difficult to use since they may have different names on different systems. Use 'module avail' on the login node to get a list, then request modules as a resource. https://support.opensciencegrid.org/support/solutions/articles/12000048518
If you submit with queue=10, then you get 10 identical jobs similar to what an array job with Slurm will provide. You can use output=job.$(Cluster).$(Process).output to keep the outputs different Use 'arguments = input_file.$(ProcId)' to vary the input file for each job.