|
|
Line 65: |
Line 65: |
|
| |
|
| As a defense against too much abuse of tiny files, there is a limit of 100,000 entries in any directory in our shared filesystem space. | | As a defense against too much abuse of tiny files, there is a limit of 100,000 entries in any directory in our shared filesystem space. |
|
| |
| == Programming for Performance ==
| |
| === BLAS ===
| |
| BLAS (Basic Linear Algebra Subroutines) is a standard set of linear algebra subroutines. The standard was set so that software could be written against a standardized library interface, and optimized libraries could be "plug-and-play." There are lots of implementations of the BLAS libraries, with the most common ones being [http://software.intel.com/en-us/intel-mkl/ Intel's MKL] and [http://developer.amd.com/tools/cpu/acml/pages/default.aspx AMD's ACML].
| |
|
| |
| ==== Beocat BLAS Libraries ====
| |
| Since BLAS is a modular standard, we have installed a few (free) BLAS libraries.
| |
|
| |
| * The BLAS reference library: An unoptimized reference library
| |
| * [http://developer.amd.com/tools/cpu/acml/pages/default.aspx AMD's ACML]: Optimized BLAS library for AMD systems
| |
| * [http://www.openblas.net OpenBLAS]: Optimized BLAS library for some AMD, and most Intel sytems
| |
|
| |
| The default BLAS library is OpenBLAS.
| |
|
| |
| ==== Using a different BLAS library ====
| |
| If you want or need to use a different BLAS library, list the available libraries with 'ls -1 /etc/env.d/alternatives/blas' (Ignore _current and _current_list)
| |
|
| |
| $ ls -1 /etc/env.d/alternatives/blas
| |
| _current
| |
| _current_list
| |
| acml-gfortran64
| |
| acml-gfortran64-openmp
| |
| acml-ifort64
| |
| acml-ifort64-openmp
| |
| mkl32-dynamic
| |
| mkl32-dynamic-openmp
| |
| mkl32-gfortran
| |
| mkl32-gfortran-openmp
| |
| mkl32-intel
| |
| mkl32-intel-openmp
| |
| mkl64-dynamic
| |
| mkl64-dynamic-openmp
| |
| mkl64-gfortran
| |
| mkl64-gfortran-openmp
| |
| mkl64-int64-dynamic
| |
| mkl64-int64-dynamic-openmp
| |
| mkl64-int64-gfortran
| |
| mkl64-int64-gfortran-openmp
| |
| mkl64-int64-intel
| |
| mkl64-int64-intel-openmp
| |
| mkl64-intel
| |
| mkl64-intel-openmp
| |
| openblas-openmp
| |
| reference
| |
| To change your default BLAS version you need to determine which shell you are using:
| |
|
| |
| ===== CSH or TCSH =====
| |
| If your tool simply uses pkg-config to find the right blas, you can just run the following:
| |
| <syntaxhighlight lang="bash">
| |
| setenv PKG_CONFIG_PATH /etc/env.d/alternatives/blas/openblas-openmp/usr/lib64/pkgconfig
| |
| </syntaxhighlight>
| |
| Where the openblas-openmp is replaced with name of your preferred BLAS. You can put that line in your job script, or in your ~/.cshrc file.
| |
|
| |
| If it needs actual library names and options for the compiler, after you have run the above you can run these to get the right arguments/library names for your compiler
| |
| <syntaxhighlight lang="bash">
| |
| pkg-config --cflags blas
| |
| pkg-config --libs blas
| |
| </syntaxhighlight>
| |
| ===== SH, BASH, or ZSH =====
| |
| If your tool simply uses pkg-config to find the right blas, you can just run the following:
| |
| <syntaxhighlight lang="bash">
| |
| export PKG_CONFIG_PATH=/etc/env.d/alternatives/blas/openblas-openmp/usr/lib64/pkgconfig
| |
| </syntaxhighlight>
| |
| Where the openblas-openmp is replaced with name of your preferred BLAS. Put the output of that script in your job script, or in your ~/.bashrc or ~/.zshrc file.
| |
|
| |
| If it needs actual library names and options for the compiler, after you have run the above you can run these to get the right arguments/library names for your compiler
| |
| <syntaxhighlight lang="bash">
| |
| pkg-config --cflags blas
| |
| pkg-config --libs blas
| |
| </syntaxhighlight>
| |
| === LAPACK ===
| |
| LAPACK (Linear Algebra PACKage) is a standard set of linear algebra subroutines. Like BLAS, these are very optimized, but LAPACK handles a different set of functions. The standard was set so that software could be written against a standardized library interface, and optimized libraries could be "plug-and-play." There are lots of implementations of the LAPACK libraries, with the most common ones being [http://software.intel.com/en-us/intel-mkl/ Intel's MKL] and [http://developer.amd.com/tools/cpu/acml/pages/default.aspx AMD's ACML].
| |
|
| |
| ==== Beocat LAPACK Libraries ====
| |
| Since LAPACK is a modular standard, we have installed a few (free) LAPACK libraries.
| |
| * [http://www.netlib.org/lapack/ The LAPACK reference library]: An unoptimized reference library
| |
| * [http://developer.amd.com/tools/cpu/acml/pages/default.aspx AMD's ACML]: Optimized LAPACK library for AMD systems
| |
|
| |
| The default LAPACK library is ACML.
| |
|
| |
| ==== Using a different LAPACK library ====
| |
| If you want or need to use a different LAPACK library, list the available libraries with 'ls -1 /etc/env.d/alternatives/lapack' (Ignore _current and _current_list)
| |
|
| |
| $ ls -1 /etc/env.d/alternatives/lapack
| |
| _current
| |
| _current_list
| |
| acml-gfortran64
| |
| acml-gfortran64-openmp
| |
| acml-ifort64
| |
| acml-ifort64-openmp
| |
| mkl32-dynamic
| |
| mkl32-dynamic-openmp
| |
| mkl32-gfortran
| |
| mkl32-gfortran-openmp
| |
| mkl32-intel
| |
| mkl32-intel-openmp
| |
| mkl64-dynamic
| |
| mkl64-dynamic-openmp
| |
| mkl64-gfortran
| |
| mkl64-gfortran-openmp
| |
| mkl64-int64-dynamic
| |
| mkl64-int64-dynamic-openmp
| |
| mkl64-int64-gfortran
| |
| mkl64-int64-gfortran-openmp
| |
| mkl64-int64-intel
| |
| mkl64-int64-intel-openmp
| |
| mkl64-intel
| |
| mkl64-intel-openmp
| |
| reference
| |
| To change your default LAPACK version you need to determine which shell you are using:
| |
|
| |
| ===== CSH or TCSH =====
| |
| If your tool simply uses pkg-config to find the right lapack, you can just run the following:
| |
| <syntaxhighlight lang="bash">
| |
| setenv PKG_CONFIG_PATH /etc/env.d/alternatives/lapack/acml-ifort64/usr/lib64/pkgconfig
| |
| </syntaxhighlight>
| |
| Where the acml-ifort64 is replaced with name of your preferred LAPACK. You can put that line in your job script, or in your ~/.cshrc file.
| |
|
| |
| If it needs actual library names and options for the compiler, after you have run the above you can run these to get the right arguments/library names for your compiler
| |
| <syntaxhighlight lang="bash">
| |
| pkg-config --cflags lapack
| |
| pkg-config --libs lapack
| |
| </syntaxhighlight>
| |
| ===== SH, BASH, or ZSH =====
| |
| If your tool simply uses pkg-config to find the right lapack, you can just run the following:
| |
| <syntaxhighlight lang="bash">
| |
| export PKG_CONFIG_PATH=/etc/env.d/alternatives/lapack/acml-ifort64/usr/lib64/pkgconfig
| |
| </syntaxhighlight>
| |
| Where the acml-ifort64 is replaced with name of your preferred LAPACK. Put the output of that script in your job script, or in your ~/.bashrc or ~/.zshrc file.
| |
|
| |
| If it needs actual library names and options for the compiler, after you have run the above you can run these to get the right arguments/library names for your compiler
| |
| <syntaxhighlight lang="bash">
| |
| pkg-config --cflags lapack
| |
| pkg-config --libs lapack
| |
| </syntaxhighlight>
| |
|
| |
| === [http://openmp.org/wp/ OpenMP] ===
| |
| OpenMP is a set of directives for C, C++, and Fortran which greatly simplifies parallelizing applications on a single node. There is a good tutorial for OpenMP at [https://computing.llnl.gov/tutorials/openMP/ https://computing.llnl.gov/tutorials/openMP/]
| |
| To compile an OpenMP-enabled program, you need to tell GCC that OpenMP is available this is done like:
| |
| gcc -fopenmp myOpenMPprogram.c
| |
| By default OpenMP will use all available cores for its computation, which is a problem for shared resources like Beocat.
| |
|
| |
| To make use of only the cores assigned to you, you must first make sure you have requested the 'single' parallel environment and in your job script you will need something like the following (before the application you are trying to run):
| |
|
| |
| ==== bash, sh, zsh ====
| |
| <syntaxhighlight lang="bash">
| |
| export OMP_NUM_THREADS=${NSLOTS}
| |
| </syntaxhighlight>
| |
|
| |
| ==== csh or tcsh ====
| |
| <syntaxhighlight lang="bash">
| |
| setenv OMP_NUM_THREADS ${NSLOTS}
| |
| </syntaxhighlight>
| |
Beocat has a number of tools to make your work easier, some which you may not know about. This is a simple list of these programs and some basic usage scenarios.
Submitting your job to run the fastest
Size your jobs to use the fastest nodes
Specify the proper number of cores
Beocat (nor any other computer or cluster) can make your job run on more than one core at a time if your program isn't designed to take advantage of this. Many people think "I can run this on 40 cores and it will run 40 times faster". This isn't true.
While we have many programs that are designed to take advantage of multiple cores, do not assume this is the case
Optimize your jobs for speed, not for number of cores
It seems that many people pick an arbitrary large number of cores for their jobs. 20 seems to be a common one. However, some of our fastest nodes have 16 cores. It's quite likely if your job will fit on an Elf (16 cores, 8 GB/RAM/core (64 GB RAM total)), it will run faster with 16 cores than by specifying more cores and having it run on slower nodes.
Don't request resources you don't need
The most common culprit here is people specifying they need infiniband when the job is run on a single node. This limits the scheduling such that a perfectly good node for your job may be idle while your job is still waiting.
Programs that make using Beocat easier
The name is short for "Nigel's Monitor", it's a program written by Nigel Griffiths from IBM.
A tool for producing graphs and spreadsheets from output generated by nmon.
A prettier, easier to use top. Shows CPU and memory usage in an easy-to-digest format.
A sort of terminal multiplexer, allows you to run many terminal programs at once without mixing them up. Also allows you to disconnect and reconnect sessions. There is a good explanation of how to use screen at http://www.mattcutts.com/blog/a-quick-tutorial-on-screen/.
Ganglia
The web-based load monitoring tool for the cluster. http://ganglia.beocat.ksu.edu . From there, you can see how busy Beocat is.
A very detailed performance analyzer.
Increasing file write performance
Credit for this goes to http://moo.nac.uci.edu/~hjm/bduc/BDUC_USER_HOWTO.html#writeperfongl
Use gzip
If you have written your own code or are using an app that writes zillions of tiny chunks of data to STDOUT, and you are storing the results on Beocat, you should consider passing the output thru gzip to consolidate the writes into a continuous stream. If you don’t do this, each write will be considered a separate IO event and the write performance will suffer.
If, however, the STDOUT is passed thru gzip, the wallclock runtime decreases even below the usual runtime and you end up with an output file that it already compressed to about 1/5 the usual size.
The here’s how to do it:
someapp --opt1 --opt2 --input=/path/to/input_file | gzip > /path/to/output_file
Use named pipes
Named pipes are special files that don't actually write to the filesystem, and can be used to communicate between processess. Since these pipes are in memory rather than directly to disk, they can be used to buffer writes:
# Create the named pipe
mkfifo /path/to/MyNamedPipe
# Write some data to it
MyProgram --infile=/path/to/InputData1 --outfile=/path/to/MyNamedPipe &
MyOtherProgram < /path/to/InputData2 > /path/to/MyNamedPipe
# Extract the output
cat < /path/to/MyNamedPipe > $HOME/MyOutput
## OR, we could compress the output
gzip < /path/to/MyNamedPipe > $HOME/MyOutput.gz
# Delete the named pipe like you would a file
rm /path/to/MyNamedPipe
One cautionary word. Unlike normal files, named pipes cannot be used between machines, but can be used among processes running on the same machine. So, if you're running an MPI job that will be running completely on one node, you could setup a named pipe and do all your writes to that pipe, and then flush it at the end, but if you're running a multi-node MPI job and your named pipe is on a shared filesystem (like $HOME), each process will need to flush its named pipe to a regular file before the job quits.
Use one big file instead of many small ones
This may seem to be a non-issue, but it's a performance problem we've seen on Beocat many times. I love the term coined by UCI at the link above. They call making many small files "Zillions Of Tiny files (ZOTfiles)". Using files like this is an inefficient use of our shared resources. A tiny file by itself is no more inefficient than a huge one. If you have only 100bytes to store, store it in single file. However, the problems start compounding when there are many of them. Because of the way data is stored on disk, 10 MB stored in ZOTfiles of 100bytes each can easily take up NOT 10MB, but more than 400MB - 40 times more space. Worse, data stored in this manner makes many operations very slow - instead of looking up 1 directory entry, the OS has to look up 100,000. This means 100,000 times more disk head movement, with a concommittent decrease in performance and disk lifetime. We have had Beocat users with several million files of less than 1kB each. Just creating a directory listing with ls would take nearly 1/2 hour. Not only is that inefficient for you, but it also degrades the performance of everybody using that filesystem and degrades our backups as well.
Please use large files instead of ZOTfiles any chance you can!
As a defense against too much abuse of tiny files, there is a limit of 100,000 entries in any directory in our shared filesystem space.