From Beocat
Jump to: navigation, search
(Common storage for groups)
 
(61 intermediate revisions by 2 users not shown)
Line 5: Line 5:
|-
|-
! Hostname  
! Hostname  
| style="text-align:right" | beocat.cis.ksu.edu
| style="text-align:right" | headnode.beocat.ksu.edu
|-
|-
! Port  
! Port  
Line 16: Line 16:
| style="text-align:right" | <tt>eID Password</tt>
| style="text-align:right" | <tt>eID Password</tt>
|}
|}
{| class="wikitable"
!colspan="2" | Supported Connection Software (Latest Versions of Each)
|-
!rowspan="3" | Shell
|-
| [http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html Putty]
|-
| ssh from openssh
|-
!rowspan="4" | File Transfer Utilities
|-
| [https://filezilla-project.org/ Filezilla]
|-
| [http://winscp.net/ WinSCP]
|-
| scp and sftp from openssh
|-
!rowspan="2" | Combination
|-
| [http://mobaxterm.mobatek.net/ MobaXterm]
|}
===Duo===
If your account is Duo Enabled, you will be asked to approve ''each'' connection through Duo's push system to your smart device by default for any non-interactive protocols. If you don't have a smart device, or your smart device is not currently able to be contacted by Duo, there are options.
====Automating Duo Method====
You would need to configure your connection client to send an ''Environment'' variable called <tt>DUO_PASSCODE</tt>. Its value could be the currently valid passcode from Duo, <tt>push</tt> or it could be set to <tt>phone</tt>. <tt>push</tt> will push the prompt to your smart device. <tt>phone</tt> will have duo call your phone number to approve.
===== OpenSSH =====
With OpenSSH (Linux or Mac command-line), to automatically set the Duo method to "push", use the command
<syntaxhighlight lang="bash">DUO_PASSCODE=push ssh -o SendEnv=DUO_PASSCODE headnode.beocat.ksu.edu</syntaxhighlight>
If you would like to put this in your ~/.ssh/config file, it will send the environment variable whenever it is set to Beocat upon connection:
Host headnode.beocat.ksu.edu
    HostName headnode.beocat.ksu.edu
    User YOUR_EID_GOES_HERE
    SendEnv DUO_PASSCODE
From there you would simply do the following:
<syntaxhighlight lang=bash>
export DUO_PASSCODE=push
ssh headnode.beocat.ksu.edu
</syntaxhighlight>
===== PuTTY =====
In PuTTY to automatically set the Duo method to "push", expand "Connection" (if it isn't already), then click "Data". Under Environment variables, enter '''<tt>DUO_PASSCODE</tt>''' beside ''Variable'' and '''<tt>push</tt>''' beside ''Value''. Click the "Add" button and it will show up underneath. Be sure to go back to "Session" to save this change for PuTTY to remember this change.
===== MobaXTerm =====
There doesn't seem to be a way to send an environment variable in MobaXTerm, so you won't be able to set DUO_PASSCODE to an actual valid temporary key. To get MobaXterm to push automatically, you can edit your SSH session and on the "Advanced SSH Settings" tab, change the "Execute command" to <syntaxhighlight lang="bash">DUO_PASSCODE=push bash</syntaxhighlight>
==== Common issues ====
; Duo Pushes sometimes don't show up in a timely manner.
: If you open the Duo MFA application on your smart device when you're expecting an authentication challenge, the prompts seem to show up faster.
; MobaXTerm has excessive prompts for managing files.
: MobaXTerm has a sidebar browser for managing your files. Unfortunately, that sidebar browser initiates another SSH connection for every file transfer, which triggers a Duo push that you need to approve. MobaXTerm's dedicated SFTP Session doesn't have this same issue, it initiates a connection, keeps it open and re-uses it as needed, so you will have much fewer Duo approvals to respond to. If you choose to use the dedicated SFTP Session, you might consider disabling the sidebar file browser. "Advanced SSH settings" -> "SSH-browser type" -> "None"
; WinSCP has auto-reconnect enabled by default.
: Auto-reconnect is a useful function when actively transferring files, but if you have an idle session and the connection drops it will reconnect, sending you a Duo MFA prompt. If you don't approve it soon enough, WinSCP will attempt it again. Miss enough prompts and Duo will lock your account. It may be best to disable [https://winscp.net/eng/docs/ui_pref_resume reconnections during idle periods] if you do not wish be locked out of all services at K-State using Duo.
; FileZilla has auto-reconnect enabled by default.
: Auto-reconnect is a useful function when actively transferring files, but if you have an idle session and the connection drops it will reconnect, sending you a Duo MFA prompt. If you don't approve it soon enough, FileZilla will attempt it again. Miss enough prompts and Duo will lock your account. It may be best to disable timeouts and/or connection retries under the <tt>Edit -> Settings -> Connection</tt> menu if you do not wish to be locked out of all services at K-State using Duo.
; FileZilla has excessive prompts for managing files.
: Filezilla opens one connection for browsing the system. Transferring files opens 1-4 additional connections when the transfers start. Once they finish, those connections disconnect. If you start additional transfers, new connections will be opened. Every one of those connections must be approved through Duo MFA on your smart device. You can adjust the number of connections that FileZilla opens for transfers if you like. <tt>File -> Site Manager -> (choose the site you're changing) -> Transfer Settings -> Limit number of simultaneous connections</tt>.
: Another option is to disable processing the transfer queue, add the things to it you want to transfer and then re-enable the transfer queue. Then at least it will re-use the connections until the queue is empty.
== How do I compile my programs? ==
== How do I compile my programs? ==
=== Serial programs ===
=== Serial programs ===
==== Fortran ====
; Fortran
<tt>ifort</tt> or <tt>gfortran</tt>
: <tt>ifort</tt> or <tt>gfortran</tt>
==== C/C++ ====
; C/C++
<tt>icc</tt>, <tt>gcc</tt> and <tt>g++</tt>
: <tt>icc</tt>, <tt>gcc</tt> and <tt>g++</tt>
 
=== Parallel programs ===
=== Parallel programs ===
==== Fortran ====
; Fortran
<tt>mpif77</tt> or <tt>mpif90</tt>
: <tt>mpif77</tt> or <tt>mpif90</tt>
==== C/C++ ====
; C/C++
<tt>mpicc</tt> or <tt>mpic++</tt>
: <tt>mpicc</tt> or <tt>mpic++</tt>
 
== Do Beocat jobs have a maximum Time Limit ==
Yes, there is a time limit, the scheduler will reject jobs longer than 28 days. The other side of that is that we reserve the right to a maintenance period every 14 days. Unless it is an emergency, we will give at least 2 weeks notice before these maintenance periods actually occur. Jobs 14 days or less that have started when we announce a maintenance period should be able to complete before it begins.
 
With that being said, there is no guarantee that any physical piece of hardware and the software that runs on it will behave for any significant length of time. Memory, processors, disk drives can all fail with little to no warning. Software may have bugs. We have had issues with the shared filesystem that resulted in several nodes losing connectivity and forced reboots. If you can, we always recommend that you write your jobs so that they can be resumed if they get interrupted.
 
{{Note|The 28 day limit can be overridden on a temporary and per-user basis provided there is enough justification|reminder|inline=1}}
 
== How are the filesystems on Beocat set up? ==
== How are the filesystems on Beocat set up? ==


Line 33: Line 106:
! Mountpoint !! Local / Shared !! Size !! Filesystem !! Advice
! Mountpoint !! Local / Shared !! Size !! Filesystem !! Advice
|-
|-
| /home || Shared || 210TB total || glusterfs on top of xfs || Good enough for most jobs
| /bulk || Shared || 3.1PB shared with /homes || cephfs || Slower than /homes; costs $45/TB/year
|-
|-
| /tmp || Local || >30GB (varies per node) || ext2 || Good for I/O intensive jobs
| /homes || Shared || 3.1PB shared with /bulk || cephfs || Good enough for most jobs; limited to 1TB per home directory
|-
| /fastscratch || Shared || 280TB || nfs on top of ZFS || Faster than /homes or /bulk, built with all NVME disks; files not used in 30 days are automatically culled.
|-
| /tmp || Local || >100GB (varies per node) || XFS || Good for I/O intensive jobs. Unique per job, culled with the job finishes.
|-
|-
|}
|}
=== Usage Advice ===
=== Usage Advice ===
For most jobs you shouldn't need to worry, your default working directory
For most jobs you shouldn't need to worry, your default working directory
Line 58: Line 134:


#copy the results back from $TMPDIR
#copy the results back from $TMPDIR
cp $TMPDIR/output.data ~/experiments/results.$SGE_JOBID
cp $TMPDIR/output.data ~/experiments/results.$SLURM_JOB_ID
</syntaxhighlight>
</syntaxhighlight>


You need to remember to copy over your data from <tt>$TMPDIR</tt> as part of your job.
You need to remember to copy over your data from <tt>$TMPDIR</tt> as part of your job.
That directory and its contents are deleted when the job is complete.
That directory and its contents are deleted when the job is complete.
== What is "killable:1" or "killable:0" ==
On Beocat, some of the machines have been purchased by specific users and/or groups. These users and/or groups get guaranteed access to their machines at any point in time. Often, these machines are sitting idle because the owners have no need for it at the time. This would be a significant waste of computational power if there were no other way to make use of the computing cycles.
If you're wondering why a job may have the exit status of <tt>PREEMPTED</tt> from kstat or sacct, this is the reason.
=== Enter the "killable" resource ===
Killable (--gres=killable:1) jobs are jobs that can be scheduled to these "owned" machines by users outside of the true group of owners. If a "killable" job starts on one of these owned machines and the owner of said machine comes along and submits a job, the "killable" job will be returned to the queue, (killed off as it were), and restarted at some future point in time. The job will still complete at some future point, and if the job makes use of a checkpointing algorithm it may complete even faster. The trade off between marking a job "killable" and not, is that sometimes applications need a significant amount of runtime, and cannot resume running from a partial output, meaning that it may get restarted over and over again, never reaching the finish line. As such, we only auto-enable "killable" for relatively short jobs (<=168:00:00). Some users still feel this is a hindrance, so we created a way to tell us not to automatically mark short jobs "killable"
=== Disabling killable ===
Specifying --gres=killable:0 will tell us to not mark your job as killable.
=== The trade-off ===
If a job is marked killable, there are a non-trivial amount of additional nodes that the job can run on. If your job checkpoints itself, or is relatively short, there should be no downside to marking the job killable, as the job will probably start sooner. If your job is long-running and doesn't checkpoint (save its state to restart a previous session) itself, it could cause your job to take longer to complete.


== Help! When I submit my jobs I get "Warning To stay compliant with standard unix behavior, there should be a valid #! line in your script i.e. #!/bin/tcsh" ==
== Help! When I submit my jobs I get "Warning To stay compliant with standard unix behavior, there should be a valid #! line in your script i.e. #!/bin/tcsh" ==
Line 71: Line 161:


== Help! My jobs keep dying after 1 hour and I don't know why ==
== Help! My jobs keep dying after 1 hour and I don't know why ==
Beocat has default runtime limit of 1 hour. If you need more than that, or need more than 1 GB of memory per core, you'll want to look at the documentation [[SGEBasics|here]] to see how to request it.
Beocat has default runtime limit of 1 hour. If you need more than that, or need more than 1 GB of memory per core, you'll want to look at the documentation [[SlurmBasics|here]] to see how to request it.


In short, when you run qsub for your job, you'll want to put something along the lines of '<code>-l h_rt=10:00:00</code>' before the job script if you want your job to run for 10 hours.
In short, when you run sbatch for your job, you'll want to put something along the lines of '<code>--time=0-10:00:00</code>' before the job script if you want your job to run for 10 hours.


== Help my error file has "Warning: no access to tty" ==
== Help my error file has "Warning: no access to tty" ==
Line 81: Line 171:
Generally speaking, no.
Generally speaking, no.


Jobs are scheduled based on execution times (among other things). If it were easy to change your time requirement, one could submit a job with a 15-minute run-time, get it scheduled quickly, and then say "whoops - I meant 15 weeks", effectively gaming the job scheduler. In fact, even the administrators cannot change the run-time requirement of a particular job. In extreme circumstances and depending on the job requirements, we '''may''' be able to manually intervene. This process prevents other users from using the node(s) you are currently using, so are not routinely approved. Contact Beocat support (below) if you feel your circumstances warrant special consideration.
Jobs are scheduled based on execution times (among other things). If it were easy to change your time requirement, one could submit a job with a 15-minute run-time, get it scheduled quickly, and then say "whoops - I meant 15 weeks", effectively gaming the job scheduler. In extreme circumstances and depending on the job requirements, we '''may''' be able to manually intervene. This process prevents other users from using the node(s) you are currently using, so are not routinely approved. Contact Beocat support (below) if you feel your circumstances warrant special consideration.


== Help! My perl job runs fine on the head node, but only runs for a few seconds and then quits when submitted to the queue. ==
== Help! My perl job runs fine on the head node, but only runs for a few seconds and then quits when submitted to the queue. ==
Perl doesn't like getting called straight from the scheduler. However, there is a fairly easy workaround. Create a shell wrapper script that calls perl and its program.
Take a look at our documentation on [[Installed_software#Perl|Perl]]
 
== Help! When using mpi I get 'CMA: no RDMA devices found' or 'A high-performance Open MPI point-to-point messaging module was unable to find any relevant network interfaces' ==
This message simply means that some but not all nodes the job is running on have infiniband cards. The job will still run, but will not use the fastest interconnect we have available. This may or may not be an issue, depending on how message heavy your job is. If you would like to not see this warning, you may request infiniband as a resource when submitting your job. <code>--gres=fabric:ib:1</code>
 
== Help! when I use sbatch I get an error about line breaks ==
Beocat is a Linux system. Operating Systems use certain patterns of characters to indicate line breaks in their files. Linux and operating systems like it use '\n' as their line break character. Windows uses '\r\n' for their line breaks.
 
If you're getting an error that looks like this:
sbatch: error: Batch script contains DOS line breaks (\r\n)
sbatch: error: instead of expected UNIX line breaks (\n).


For instance, I can create a script called runperl.sh that looks like this:
It means that your script is using the windows line endings. You can convert it with the <tt>dos2unix</tt> command
dos2unix myscript.sh


#!/bin/sh
It would probably be beneficial for your editor to save the files with UNIX line breaks in the future.
#$ -l h_rt=1:00:00,mem=1G
* Visual Studio Code -- “Text Editor” > “Files” > “Eol”
/usr/bin/perl /path/to/my/perl_program.pl
* Notepad++ -- "Edit" > "EOL Conversion"


Make this wrapper program executable:
== Help! when logging into OnDemand I get a '400 Bad request' message ==
chmod 755 runperl.sh
Unfortunately, there are some known issues with OnDemand and how it handles some of the complexities behind the scenes. This involves browser cookies that (occasionally) get too large and make it so you get these messages upon login.


Then submit it with
The only work around is to clear your browser cookies (although you can limit it to simply clearing the ksu.edu ones).
qsub runperl.sh


Of course, the name of this script isn't important, as long as you change the corresponding chmod and qsub commands.
Details for specific browsers are below


== Help! When using mpi I get 'CMA: no RDMA devices found' or 'A high-performance Open MPI point-to-point messaging module was unable to find any relevant network interfaces' ==
* [https://support.mozilla.org/en-US/kb/clear-cookies-and-site-data-firefox Firefox]
This message simply means that some but not all nodes the job is running on have infiniband cards. The job will still run, but will not use the fastest interconnect we have available. This may or may not be an issue, depending on how message heavy your job is. If you would like to not see this warning, you may request infiniband as a resource when submitting your job. <code>-l infinband=TRUE</code>
* [https://support.microsoft.com/en-us/microsoft-edge/delete-cookies-in-microsoft-edge-63947406-40ac-c3b8-57b9-2a946a29ae09 Edge]
* [https://support.google.com/chrome/answer/95647?sjid=1537101898131489753-NA#zippy=%2Cdelete-cookies-from-a-site Chrome]
* [https://support.apple.com/guide/safari/manage-cookies-sfri11471/mac Safari]
* If you are using some other browser, I would recommend searching google for <tt>$browsername clear site cookies</tt>


== Common Storage For Projects ==
== Common Storage For Projects ==
Sometimes it is useful for groups of people to have a common storage area. If you already have a project you can do the following:
Sometimes it is useful for groups of people to have a common storage area.


If you do not have a project, send a request via email to beocat@cs.ksu.edu. Note that these projects are generally reserved for tenure-track faculty and with a single project per eID.
If you already have a project you can do the following:
'''Note:''' The <tt>$group_name</tt> variable in the commands below needs to be replaced with the lower case name of your project. Membership of the projects can be done using our [[Group Management]] application.
* Create a directory in one of the home directories of someone in your group, ideally the project owner's.
* Create a directory in one of the home directories of someone in your group, ideally the project owner's.
** <tt>mkdir $directory</tt>
** <tt>mkdir $directory</tt>
* Change the group to the name assigned by the Beocat admins
* Set the default permissions for new files and directories created in the directory:
** <tt>chgrp -R $group_name $directory</tt>
** <tt>setfacl -d -m g:$group_name:rX -R $directory</tt>
* Set the directory writeable and sticky for the group
* Set the permissions for the existing files and directories:
** <tt>chmod g+ws $directory</tt>
** <tt>setfacl -m g:$group_name:rX -R $directory</tt>
 
* Change your umask to 002 (there will probably be a setting for it in your file transfer utilities, also). This step needs to be done by all group members.
** <tt>umask 002</tt> needs to go above <tt>if [[ $- != *i* ]] ; then</tt> in your <tt>~/.bashrc</tt> file


* Finally logout and log back in
This will give people in your group the ability to read files in the 'share' directory. If you also want them to be able to write or modify files in that directory then use change the ':rX' to ':rwX' instead. e.g. 'setfacl -d -m g:$group_name:rwX -R $directory' for both setfacl commands. As with other permissions, the individuals will need access through every level of the directory hierarchy. [[LinuxBasics#Access_Control_Lists|It may be best to review our more in-depth topic on Access Control Lists.]]


== How do I get more help? ==
== How do I get more help? ==
Line 122: Line 227:


=== Unix man pages ===
=== Unix man pages ===
Linux provides man pages (short for manual pages). These are simple enough to call, for example: if you need information on submitting jobs to Beocat, you can type '<code>man qsub</code>'. This will bring up the manual for qsub.
Linux provides man pages (short for manual pages). These are simple enough to call, for example: if you need information on submitting jobs to Beocat, you can type '<code>man sbatch</code>'. This will bring up the manual for sbatch.


=== GNU info system ===
=== GNU info system ===

Latest revision as of 18:56, 26 January 2024

How do I connect to Beocat

Connection Settings
Hostname headnode.beocat.ksu.edu
Port 22
Username eID
Password eID Password
Supported Connection Software (Latest Versions of Each)
Shell
Putty
ssh from openssh
File Transfer Utilities
Filezilla
WinSCP
scp and sftp from openssh
Combination
MobaXterm

Duo

If your account is Duo Enabled, you will be asked to approve each connection through Duo's push system to your smart device by default for any non-interactive protocols. If you don't have a smart device, or your smart device is not currently able to be contacted by Duo, there are options.

Automating Duo Method

You would need to configure your connection client to send an Environment variable called DUO_PASSCODE. Its value could be the currently valid passcode from Duo, push or it could be set to phone. push will push the prompt to your smart device. phone will have duo call your phone number to approve.

OpenSSH

With OpenSSH (Linux or Mac command-line), to automatically set the Duo method to "push", use the command

DUO_PASSCODE=push ssh -o SendEnv=DUO_PASSCODE headnode.beocat.ksu.edu

If you would like to put this in your ~/.ssh/config file, it will send the environment variable whenever it is set to Beocat upon connection:

Host headnode.beocat.ksu.edu
    HostName headnode.beocat.ksu.edu
    User YOUR_EID_GOES_HERE
    SendEnv DUO_PASSCODE

From there you would simply do the following:

export DUO_PASSCODE=push
ssh headnode.beocat.ksu.edu
PuTTY

In PuTTY to automatically set the Duo method to "push", expand "Connection" (if it isn't already), then click "Data". Under Environment variables, enter DUO_PASSCODE beside Variable and push beside Value. Click the "Add" button and it will show up underneath. Be sure to go back to "Session" to save this change for PuTTY to remember this change.

MobaXTerm

There doesn't seem to be a way to send an environment variable in MobaXTerm, so you won't be able to set DUO_PASSCODE to an actual valid temporary key. To get MobaXterm to push automatically, you can edit your SSH session and on the "Advanced SSH Settings" tab, change the "Execute command" to

DUO_PASSCODE=push bash

Common issues

Duo Pushes sometimes don't show up in a timely manner.
If you open the Duo MFA application on your smart device when you're expecting an authentication challenge, the prompts seem to show up faster.
MobaXTerm has excessive prompts for managing files.
MobaXTerm has a sidebar browser for managing your files. Unfortunately, that sidebar browser initiates another SSH connection for every file transfer, which triggers a Duo push that you need to approve. MobaXTerm's dedicated SFTP Session doesn't have this same issue, it initiates a connection, keeps it open and re-uses it as needed, so you will have much fewer Duo approvals to respond to. If you choose to use the dedicated SFTP Session, you might consider disabling the sidebar file browser. "Advanced SSH settings" -> "SSH-browser type" -> "None"
WinSCP has auto-reconnect enabled by default.
Auto-reconnect is a useful function when actively transferring files, but if you have an idle session and the connection drops it will reconnect, sending you a Duo MFA prompt. If you don't approve it soon enough, WinSCP will attempt it again. Miss enough prompts and Duo will lock your account. It may be best to disable reconnections during idle periods if you do not wish be locked out of all services at K-State using Duo.
FileZilla has auto-reconnect enabled by default.
Auto-reconnect is a useful function when actively transferring files, but if you have an idle session and the connection drops it will reconnect, sending you a Duo MFA prompt. If you don't approve it soon enough, FileZilla will attempt it again. Miss enough prompts and Duo will lock your account. It may be best to disable timeouts and/or connection retries under the Edit -> Settings -> Connection menu if you do not wish to be locked out of all services at K-State using Duo.
FileZilla has excessive prompts for managing files.
Filezilla opens one connection for browsing the system. Transferring files opens 1-4 additional connections when the transfers start. Once they finish, those connections disconnect. If you start additional transfers, new connections will be opened. Every one of those connections must be approved through Duo MFA on your smart device. You can adjust the number of connections that FileZilla opens for transfers if you like. File -> Site Manager -> (choose the site you're changing) -> Transfer Settings -> Limit number of simultaneous connections.
Another option is to disable processing the transfer queue, add the things to it you want to transfer and then re-enable the transfer queue. Then at least it will re-use the connections until the queue is empty.

How do I compile my programs?

Serial programs

Fortran
ifort or gfortran
C/C++
icc, gcc and g++

Parallel programs

Fortran
mpif77 or mpif90
C/C++
mpicc or mpic++

Do Beocat jobs have a maximum Time Limit

Yes, there is a time limit, the scheduler will reject jobs longer than 28 days. The other side of that is that we reserve the right to a maintenance period every 14 days. Unless it is an emergency, we will give at least 2 weeks notice before these maintenance periods actually occur. Jobs 14 days or less that have started when we announce a maintenance period should be able to complete before it begins.

With that being said, there is no guarantee that any physical piece of hardware and the software that runs on it will behave for any significant length of time. Memory, processors, disk drives can all fail with little to no warning. Software may have bugs. We have had issues with the shared filesystem that resulted in several nodes losing connectivity and forced reboots. If you can, we always recommend that you write your jobs so that they can be resumed if they get interrupted.

The 28 day limit can be overridden on a temporary and per-user basis provided there is enough justification

How are the filesystems on Beocat set up?

Mountpoint Local / Shared Size Filesystem Advice
/bulk Shared 3.1PB shared with /homes cephfs Slower than /homes; costs $45/TB/year
/homes Shared 3.1PB shared with /bulk cephfs Good enough for most jobs; limited to 1TB per home directory
/fastscratch Shared 280TB nfs on top of ZFS Faster than /homes or /bulk, built with all NVME disks; files not used in 30 days are automatically culled.
/tmp Local >100GB (varies per node) XFS Good for I/O intensive jobs. Unique per job, culled with the job finishes.

Usage Advice

For most jobs you shouldn't need to worry, your default working directory is your homedir and it will be fast enough for most tasks. I/O intensive work should use /tmp, but you will need to remember to copy your files to and from this partition as part of your job script. This is made easier through the $TMPDIR environment variable in your jobs.

Example usage of $TMPDIR in a job script

#!/bin/bash

#copy our input file to $TMPDIR to make processing faster
cp ~/experiments/input.data $TMPDIR

#use the input file we copied over to the local system
#generate the output file in $TMPDIR as well
~/bin/my_program --input-file=$TMPDIR/input.data --output-file=$TMPDIR/output.data

#copy the results back from $TMPDIR
cp $TMPDIR/output.data ~/experiments/results.$SLURM_JOB_ID

You need to remember to copy over your data from $TMPDIR as part of your job. That directory and its contents are deleted when the job is complete.

What is "killable:1" or "killable:0"

On Beocat, some of the machines have been purchased by specific users and/or groups. These users and/or groups get guaranteed access to their machines at any point in time. Often, these machines are sitting idle because the owners have no need for it at the time. This would be a significant waste of computational power if there were no other way to make use of the computing cycles.

If you're wondering why a job may have the exit status of PREEMPTED from kstat or sacct, this is the reason.

Enter the "killable" resource

Killable (--gres=killable:1) jobs are jobs that can be scheduled to these "owned" machines by users outside of the true group of owners. If a "killable" job starts on one of these owned machines and the owner of said machine comes along and submits a job, the "killable" job will be returned to the queue, (killed off as it were), and restarted at some future point in time. The job will still complete at some future point, and if the job makes use of a checkpointing algorithm it may complete even faster. The trade off between marking a job "killable" and not, is that sometimes applications need a significant amount of runtime, and cannot resume running from a partial output, meaning that it may get restarted over and over again, never reaching the finish line. As such, we only auto-enable "killable" for relatively short jobs (<=168:00:00). Some users still feel this is a hindrance, so we created a way to tell us not to automatically mark short jobs "killable"

Disabling killable

Specifying --gres=killable:0 will tell us to not mark your job as killable.

The trade-off

If a job is marked killable, there are a non-trivial amount of additional nodes that the job can run on. If your job checkpoints itself, or is relatively short, there should be no downside to marking the job killable, as the job will probably start sooner. If your job is long-running and doesn't checkpoint (save its state to restart a previous session) itself, it could cause your job to take longer to complete.

Help! When I submit my jobs I get "Warning To stay compliant with standard unix behavior, there should be a valid #! line in your script i.e. #!/bin/tcsh"

Job submission scripts are supposed to have a line similar to '#!/bin/bash' in them to start. We have had problems with people submitting jobs with invalid #! lines, so we enforce that rule. When this happens the job fails and we have to manually clean it up. The warning message is there just to inform you that the job script should have a line in it, in most cases #!/bin/tcsh or #!/bin/bash, to indicate what program should be used to run the script. When the line is missing from a script, by default your default shell is used to execute the script (in your case /usr/local/bin/tcsh). This works in most cases, but may not be what you are wanting.

Help! When I submit my jobs I get "A #! line exists, but it is not pointing to an executable. Please fix. Job not submitted."

Like the above, error says you need a #!/bin/bash or similar line in your job script. This error says that while the line exists, the #! line isn't mentioning an executable file, thus the script will not be able to run. Most likely you wanted #!/bin/bash instead of something else.

Help! My jobs keep dying after 1 hour and I don't know why

Beocat has default runtime limit of 1 hour. If you need more than that, or need more than 1 GB of memory per core, you'll want to look at the documentation here to see how to request it.

In short, when you run sbatch for your job, you'll want to put something along the lines of '--time=0-10:00:00' before the job script if you want your job to run for 10 hours.

Help my error file has "Warning: no access to tty"

The warning message "Warning: no access to tty (Bad file descriptor)" is safe to ignore. It typically happens with the tcsh shell.

Help! My job isn't going to finish in the time I specified. Can I change the time requirement?

Generally speaking, no.

Jobs are scheduled based on execution times (among other things). If it were easy to change your time requirement, one could submit a job with a 15-minute run-time, get it scheduled quickly, and then say "whoops - I meant 15 weeks", effectively gaming the job scheduler. In extreme circumstances and depending on the job requirements, we may be able to manually intervene. This process prevents other users from using the node(s) you are currently using, so are not routinely approved. Contact Beocat support (below) if you feel your circumstances warrant special consideration.

Help! My perl job runs fine on the head node, but only runs for a few seconds and then quits when submitted to the queue.

Take a look at our documentation on Perl

Help! When using mpi I get 'CMA: no RDMA devices found' or 'A high-performance Open MPI point-to-point messaging module was unable to find any relevant network interfaces'

This message simply means that some but not all nodes the job is running on have infiniband cards. The job will still run, but will not use the fastest interconnect we have available. This may or may not be an issue, depending on how message heavy your job is. If you would like to not see this warning, you may request infiniband as a resource when submitting your job. --gres=fabric:ib:1

Help! when I use sbatch I get an error about line breaks

Beocat is a Linux system. Operating Systems use certain patterns of characters to indicate line breaks in their files. Linux and operating systems like it use '\n' as their line break character. Windows uses '\r\n' for their line breaks.

If you're getting an error that looks like this:

sbatch: error: Batch script contains DOS line breaks (\r\n)
sbatch: error: instead of expected UNIX line breaks (\n).

It means that your script is using the windows line endings. You can convert it with the dos2unix command

dos2unix myscript.sh

It would probably be beneficial for your editor to save the files with UNIX line breaks in the future.

  • Visual Studio Code -- “Text Editor” > “Files” > “Eol”
  • Notepad++ -- "Edit" > "EOL Conversion"

Help! when logging into OnDemand I get a '400 Bad request' message

Unfortunately, there are some known issues with OnDemand and how it handles some of the complexities behind the scenes. This involves browser cookies that (occasionally) get too large and make it so you get these messages upon login.

The only work around is to clear your browser cookies (although you can limit it to simply clearing the ksu.edu ones).

Details for specific browsers are below

  • Firefox
  • Edge
  • Chrome
  • Safari
  • If you are using some other browser, I would recommend searching google for $browsername clear site cookies

Common Storage For Projects

Sometimes it is useful for groups of people to have a common storage area.

If you do not have a project, send a request via email to beocat@cs.ksu.edu. Note that these projects are generally reserved for tenure-track faculty and with a single project per eID.

If you already have a project you can do the following:

Note: The $group_name variable in the commands below needs to be replaced with the lower case name of your project. Membership of the projects can be done using our Group Management application.

  • Create a directory in one of the home directories of someone in your group, ideally the project owner's.
    • mkdir $directory
  • Set the default permissions for new files and directories created in the directory:
    • setfacl -d -m g:$group_name:rX -R $directory
  • Set the permissions for the existing files and directories:
    • setfacl -m g:$group_name:rX -R $directory

This will give people in your group the ability to read files in the 'share' directory. If you also want them to be able to write or modify files in that directory then use change the ':rX' to ':rwX' instead. e.g. 'setfacl -d -m g:$group_name:rwX -R $directory' for both setfacl commands. As with other permissions, the individuals will need access through every level of the directory hierarchy. It may be best to review our more in-depth topic on Access Control Lists.

How do I get more help?

There are many sources of help for most Linux systems.

Unix man pages

Linux provides man pages (short for manual pages). These are simple enough to call, for example: if you need information on submitting jobs to Beocat, you can type 'man sbatch'. This will bring up the manual for sbatch.

GNU info system

Not all applications have "man pages." Most of the rest have what they call info pages. For example, if you needed information on finding a file you could use 'info find'.

This documentation

This documentation is very thoroughly researched, and has been painstakingly assembled for your benefit. Please use it.

Contact support

Support can be contacted here. Please include detailed information about your problem, including the job number, applications you are trying to run, and the current directory that you are in.