| Line 94: | Line 94: | ||
| === Enter the "killable" resource === | === Enter the "killable" resource === | ||
| Killable (--gres=killable:1) jobs are jobs that can be scheduled to these "owned" machines by users outside of the true group of owners. If a "killable" job starts on one of these owned machines and the owner of said machine comes along and submits a job, the "killable" job will be returned to the queue, (killed off as it were), and restarted at some future point in time. The job will still complete at some future point, and if the job makes use of a checkpointing algorithm it may complete even faster. The trade off between marking a job "killable" and not, is that sometimes applications need a significant amount of runtime, and cannot resume running from a partial output, meaning that it may get restarted over and over again, never reaching the finish line. As such, we only auto-enable "killable" for relatively short jobs (<= | Killable (--gres=killable:1) jobs are jobs that can be scheduled to these "owned" machines by users outside of the true group of owners. If a "killable" job starts on one of these owned machines and the owner of said machine comes along and submits a job, the "killable" job will be returned to the queue, (killed off as it were), and restarted at some future point in time. The job will still complete at some future point, and if the job makes use of a checkpointing algorithm it may complete even faster. The trade off between marking a job "killable" and not, is that sometimes applications need a significant amount of runtime, and cannot resume running from a partial output, meaning that it may get restarted over and over again, never reaching the finish line. As such, we only auto-enable "killable" for relatively short jobs (<=168:00:00). Some users still feel this is a hinderance, so we created a way to tell us not to automatically mark short jobs "killable" | ||
| === Disabling killable === | === Disabling killable === | ||
Revision as of 16:35, 13 November 2020
How do I connect to Beocat
| Connection Settings | |
|---|---|
| Hostname | headnode.beocat.ksu.edu | 
| Port | 22 | 
| Username | eID | 
| Password | eID Password | 
| Supported Connection Software (Latest Versions of Each) | |
|---|---|
| Shell | |
| Putty | |
| ssh from openssh | |
| File Transfer Utilities | |
| Filezilla | |
| WinSCP | |
| scp and sftp from openssh | |
| Combination | |
| MobaXterm | |
How do I compile my programs?
Serial programs
Fortran
ifort or gfortran
C/C++
icc, gcc and g++
Parallel programs
Fortran
mpif77 or mpif90
C/C++
mpicc or mpic++
How are the filesystems on Beocat set up?
| Mountpoint | Local / Shared | Size | Filesystem | Advice | 
|---|---|---|---|---|
| /bulk | Shared | 3.1PB shared with /homes and /scratch | cephfs | Slower than /homes; costs $45/TB/year | 
| /homes | Shared | 3.1PB shared with /bulk and /scratch | cephfs | Good enough for most jobs; limited to 1TB per home directory | 
| /scratch | Shared | 3.1PB shared with /bulk and /homes | cephfs | Fast shared tmp space; files not used in 30 days are automatically culled | 
| /tmp | Local | >100GB (varies per node) | ext4 | Good for I/O intensive jobs | 
Usage Advice
For most jobs you shouldn't need to worry, your default working directory is your homedir and it will be fast enough for most tasks. I/O intensive work should use /tmp, but you will need to remember to copy your files to and from this partition as part of your job script. This is made easier through the $TMPDIR environment variable in your jobs.
Example usage of $TMPDIR in a job script
#!/bin/bash
#copy our input file to $TMPDIR to make processing faster
cp ~/experiments/input.data $TMPDIR
#use the input file we copied over to the local system
#generate the output file in $TMPDIR as well
~/bin/my_program --input-file=$TMPDIR/input.data --output-file=$TMPDIR/output.data
#copy the results back from $TMPDIR
cp $TMPDIR/output.data ~/experiments/results.$SLURM_JOB_ID
You need to remember to copy over your data from $TMPDIR as part of your job. That directory and its contents are deleted when the job is complete.
What is "killable:1" or "killable:0"
On Beocat, some of the machines have been purchased by specific users and/or groups. These users and/or groups get guaranteed access to their machines at any point in time. Often, these machines are sitting idle because the owners have no need for it at the time. This would be a significant waste of computational power if there were no other way to make use of the computing cycles.
Enter the "killable" resource
Killable (--gres=killable:1) jobs are jobs that can be scheduled to these "owned" machines by users outside of the true group of owners. If a "killable" job starts on one of these owned machines and the owner of said machine comes along and submits a job, the "killable" job will be returned to the queue, (killed off as it were), and restarted at some future point in time. The job will still complete at some future point, and if the job makes use of a checkpointing algorithm it may complete even faster. The trade off between marking a job "killable" and not, is that sometimes applications need a significant amount of runtime, and cannot resume running from a partial output, meaning that it may get restarted over and over again, never reaching the finish line. As such, we only auto-enable "killable" for relatively short jobs (<=168:00:00). Some users still feel this is a hinderance, so we created a way to tell us not to automatically mark short jobs "killable"
Disabling killable
Specifying --gres=killable:0 will tell us to not mark your job as killable.
The trade-off
If a job is marked killable, there are a non-trivial amount of additional nodes that the job can run on. If your job checkpoints itself, or is relatively short, there should be no downside to marking the job killable, as the job will probably start sooner. If your job is long-running and doesn't checkpoint (save its state to restart a previous session) itself, it could cause your job to take longer to complete.
Help! When I submit my jobs I get "Warning To stay compliant with standard unix behavior, there should be a valid #! line in your script i.e. #!/bin/tcsh"
Job submission scripts are supposed to have a line similar to '#!/bin/bash' in them to start. We have had problems with people submitting jobs with invalid #! lines, so we enforce that rule. When this happens the job fails and we have to manually clean it up. The warning message is there just to inform you that the job script should have a line in it, in most cases #!/bin/tcsh or #!/bin/bash, to indicate what program should be used to run the script. When the line is missing from a script, by default your default shell is used to execute the script (in your case /usr/local/bin/tcsh). This works in most cases, but may not be what you are wanting.
Help! When I submit my jobs I get "A #! line exists, but it is not pointing to an executable. Please fix. Job not submitted."
Like the above, error says you need a #!/bin/bash or similar line in your job script. This error says that while the line exists, the #! line isn't mentioning an executable file, thus the script will not be able to run. Most likely you wanted #!/bin/bash instead of something else.
Help! My jobs keep dying after 1 hour and I don't know why
Beocat has default runtime limit of 1 hour. If you need more than that, or need more than 1 GB of memory per core, you'll want to look at the documentation here to see how to request it.
In short, when you run sbatch for your job, you'll want to put something along the lines of '--time=0-10:00:00' before the job script if you want your job to run for 10 hours.
Help my error file has "Warning: no access to tty"
The warning message "Warning: no access to tty (Bad file descriptor)" is safe to ignore. It typically happens with the tcsh shell.
Help! My job isn't going to finish in the time I specified. Can I change the time requirement?
Generally speaking, no.
Jobs are scheduled based on execution times (among other things). If it were easy to change your time requirement, one could submit a job with a 15-minute run-time, get it scheduled quickly, and then say "whoops - I meant 15 weeks", effectively gaming the job scheduler. In extreme circumstances and depending on the job requirements, we may be able to manually intervene. This process prevents other users from using the node(s) you are currently using, so are not routinely approved. Contact Beocat support (below) if you feel your circumstances warrant special consideration.
Help! My perl job runs fine on the head node, but only runs for a few seconds and then quits when submitted to the queue.
Take a look at our documentation on Perl
Help! When using mpi I get 'CMA: no RDMA devices found' or 'A high-performance Open MPI point-to-point messaging module was unable to find any relevant network interfaces'
This message simply means that some but not all nodes the job is running on have infiniband cards. The job will still run, but will not use the fastest interconnect we have available. This may or may not be an issue, depending on how message heavy your job is. If you would like to not see this warning, you may request infiniband as a resource when submitting your job. --gres=fabric:ib:1
What happens to my data when I leave K-State?
First of all, although we use eid credentials, we are not tied in with K-State's central IT policies which apply to employees or students leaving the university. As long as you keep your eid password current, you still have access to Beocat. Once we deem your data to be "stale", we will archive your data and disable your account. We have no written policy on when we do this, because we only do so as necessity dictates, but generally speaking if you have any data which is modified for less than two years will not be marked as stale. If your account is disabled for this reason, you will have to apply for a new account and un-archive your data.
Common Storage For Projects
Sometimes it is useful for groups of people to have a common storage area.
If you do not have a project, send a request via email to beocat@cs.ksu.edu. Note that these projects are generally reserved for tenure-track faculty and with a single project per eID.
If you already have a project you can do the following:
Note: The $group_name variable in the commands below needs to be replaced with the lower case name of your project. Membership of the projects can be done here
- Create a directory in one of the home directories of someone in your group, ideally the project owner's.
- mkdir $directory
 
- Change the group to the name assigned by the Beocat admins
- chgrp -R $group_name $directory
 
- Set the directory writeable and sticky for the group
- chmod -R g+ws $directory
 
- Change your umask to 002 (there will probably be a setting for it in your file transfer utilities, also). This step needs to be done by all group members.
- It needs to go at the very end of your .bashrc file.umask 002 
 
- Finally logout and log back in
How do I get more help?
There are many sources of help for most Linux systems.
Unix man pages
Linux provides man pages (short for manual pages). These are simple enough to call, for example: if you need information on submitting jobs to Beocat, you can type 'man sbatch'. This will bring up the manual for sbatch.
GNU info system
Not all applications have "man pages." Most of the rest have what they call info pages. For example, if you needed information on finding a file you could use 'info find'.
This documentation
This documentation is very thoroughly researched, and has been painstakingly assembled for your benefit. Please use it.
Contact support
Support can be contacted here. Please include detailed information about your problem, including the job number, applications you are trying to run, and the current directory that you are in.
