From Beocat
Revision as of 23:54, 1 February 2022 by Daveturner (talk | contribs) (→‎Open Science Grid)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Open Science Grid

OSG is a high-throughput computing system where users have unlimited access to submit large numbers of small single-node jobs of typically 1-8 cores, 10 GB memory, around 10 GB of IO, for 24 hours to the HTCondor queue where the jobs will be run on supercomputers in the U.S. and Europe. To use OSG, you must first obtain an account from the OSG Connect group, arrange a short Zoom meeting with someone from their support team, use a webpage to upload your ssh keys, then log into their head node. Below are several links on OSG, the signup process, and quick start guides for submitting HTCondor scripts.

Guidelines for determining if your jobs will work well with OSG

Below is an example HTCondor job script to run a code named NaMD.

 #!/bin/bash -l
 output = osg.namd.out
 error = osg.namd.error
 log = osg.namd.log
 # Requested resources
 request_cpus = 8
 request_memory = 8 GB
 request_disk = 1 GB
 requirements = Arch == "X86_64" && HAS_MODULES == True
 transfer_input_files = input_files/         # Slash means all files in that directory
 executable = namd2
 arguments = +p8 test.0.namd
 transfer_output_files = output
 queue 1

Below are some common HTCondor commands

 > condor_submit                       # Submit the condor script to the queue
 > condor_q                                   # Check on the status while in the queue
 > condor_q netid                             # Check status of currently running jobs
 > condor_q 1441271                           # Check status of a particular job
 > condor_history 1441271                     # Check status of a job that is completed
 > condor_history -long 1441271               # Same but report more info
 > condor_rm #                                # Remove the job #
 > condor_rm daveturner                       # Remove all jobs for the given username

The job output will be in the file specified by 'output='. This is similar to the slurm-#.out files on Beocat.

The log file contains timings for the file transfers and the job execution.

One big change from Slurm on Beocat is that you will need to list all the input files that need to be transferred before the run since the job will not be run on a computer with the same file system, and also list the output files to transfer back after the run.

Modules are available, but difficult to use since they may have different names on different systems. Use 'module avail' on the login node to get a list, then request modules as a resource.

If you submit with queue=10, then you get 10 identical jobs similar to what an array job with Slurm will provide. You can use output=job.$(Cluster).$(Process).output to keep the outputs different Use 'arguments = input_file.$(ProcId)' to vary the input file for each job.