Slurm is a very popular soft drink, it is highly addictive. When consumed in large quantity user may experience radioactive glowing green skin.
Slurm, formerly known as Simple Linux Utility for Resource Management, is a very powerful job scheduler that enjoys wide popularity within the HPC world. More than 60% of the TOP 500 super computers use slurm, and we use it for both Turing and Wahab cluster.
Slurm, as most job schedulers, brings the following abilities to the cluster:
Slurm distributes computational tasks across nodes within a cluster, it avoids the situation where certain nodes are under utilized while some are overloaded.
Slurm lets you submit tasks to the cluster, and starts the computation for you once resources become available.
Slurm monitors job status, node status, and also keeps a historical record of them for further study.
A cluster is all resources wired together for the purpose of high performance computing, which includes computational devices (servers), networking devices (switches) and storage devices combined.
A node is a single computational device, usually a server.
When you want the scheduler to execute a program, performing a computation on your behalf, it has to be boxed into an abstraction layer called "job".
A partition is a set of compute nodes, grouped logically. We separate our computational resources base on the features of their hardware and the nature of the job.
For instance, there is a regular compute partition
main and a
GPU based partition
It maybe confusing, but tasks in Slurm means processor resource. By default, 1 task uses 1 core. However, this behavior can be altered.
Each node come with some difference, for instance different generation of processor, and that would be a feature of that node. You can pick a set of machines to share certain features to execute your job.
sbatcha job script with complex instructions
salloca interactive shell
sruna single command
There are two commands in Slurm that are often used to gather information regarding the cluster.
sinfocommand gives an overview of the resources offered by the cluster, including their current availability
squeuecommand shows jobs currently runnning and pending on the cluster
sinfo shows partitions and nodes available, or occupied in that partition. Example command output:
$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST main* up infinite 2 down* coreV2-22-[012,026] main* up infinite 1 mix coreV1-22-016 main* up infinite 11 alloc coreV1-22-[001,004,007,011-013],coreV2-22-[028,030],coreV2-25-[006-008] main* up infinite 18 idle coreV1-22-[017-024],coreV2-22-[001,005,010,014,016-017],coreV2-25-[011,018-020] timed-main up 2:00:00 2 down* coreV2-22-[012,026] timed-main up 2:00:00 1 mix coreV1-22-016 timed-main up 2:00:00 11 alloc coreV1-22-[001,004,007,011-013],coreV2-22-[028,030],coreV2-25-[006-008] timed-main up 2:00:00 18 idle coreV1-22-[017-024],coreV2-22-[001,005,010,014,016-017],coreV2-25-[011,018-020]
You can try out some varitions on
sinfo; for instance,
sinfo -N -l will show much more detailed information, down to the individual the compute nodes:
$ sinfo -N -l Thu May 25 14:36:08 2017 NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON coreV1-22-001 1 main* allocated 16 2:8:1 126976 0 1 (null) none coreV1-22-001 1 timed-main allocated 16 2:8:1 126976 0 1 (null) none coreV1-22-004 1 main* allocated 16 2:8:1 126976 0 1 (null) none coreV1-22-004 1 timed-main allocated 16 2:8:1 126976 0 1 (null) none coreV1-22-007 1 main* allocated 16 2:8:1 126976 0 1 (null) none coreV1-22-007 1 timed-main allocated 16 2:8:1 126976 0 1 (null) none coreV1-22-011 1 main* allocated 16 2:8:1 126976 0 1 (null) none ...
(The apparent duplication of compute nodes was due to two partition existing in each compute node.)
squeue shows the jobs running or waiting t run, with information such as which partition they are running on, the users who submitted the jobs, and the time and resources consumed by the jobs. Example output:
$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 357 main migr0000 abcde001 R 1-01:27:02 7 coreV1-22-[001,004,007,011-013,016] 356 main migr0001 abcde001 R 1-01:28:02 5 coreV2-22-[028,030],coreV2-25-[006-008] 358 main migr0002 abcde001 PD 0:00 8 (Resources)
In order to see your jobs, you can use the
squeue -u <USERNAME>, where
<USERNAME> stands for your own user name. Here is an example:
squeue -u jpratt JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 360 main testjob jpratt R 03:27:02 7 coreV2-25-[001,004,007,011-013,016]
It is worth noting that
JOBID is a very useful property since many Slurm commands use it. For example, to cancel job
migr0000 the user can use the command
squeue above also showed that job
358 is in the state of
PD, which means pending (waiting for execution). The scheduler will start it once the resource requirement for the job can be met.
There are two kinds of jobs on a cluster: Interactive and non-interactive. An interactive job allows the user to interact with the job by providing inputs either via keyboard and/or mouse; the output can come via files or on terminal or graphical window(s). Noninteractive jobs require all inputs be provided via files, and all outputs be redirected to files.
There are three SLURM commands to submit jobs for execution:
sbatch-- for submitting a non-interactive job;
salloc-- for running a single job (interactive/non-interactive) or launch an interactive shell;
srun-- for running a single job (interactive/non-interactive).
sbatch is the primary method of running jobs on HPC; it is used more frequently used than the other commands. For end-users,
salloc is used primarily to launch an interactive shell on a compute node, which you can use for interactive purposes.
srun can be used to process an interactive or a non-interactive job;
srun waits until the job is completed before returning to the shell, whereas
sbatch does not (the job can be run much later when the resources are available).
You will have to use different commands to submit them but the good part is they use the same arguments.
command structure: salloc/srun [options] path-to-your-executable [arguments-for-your-executable] example: salloc/srun --ntasks=1 --job-name=matlab /usr/local/bin/matlab -singleCompThread -nodesktop options: --job-name job-name (optional) --ntasks number of process request
If you submit the same job often or the job is relatively complex, you can use a submission script and submit it with the
#!/bin/bash #SBATCH --ntasks=4 enable_lmod module load gcc module load openmpi srun your_application
Use a text editor such as
vi to create this script. Save this script to a file (let's call it
JOB_SCRIPT.sh in the discussions below--you are free to specify a more appropriate name).
squeue -u $USER.
Let's study more closely the example job script above. Job submissions, including
salloc resource requests and
sbatch script, instruct the cluster to perform computation using specified set of resources. You can list all the job requests at the beginning of a job script or input them via command line arguments.
The lines starting with
#SBATCH near the top of a job script appends options to the
sbatch command. Consider the following example script:
#!/bin/bash #SBATCH --job-name=test #SBATCH --output=output.txt #SBATCH --ntasks=4 /usr/bin/true
is equivalent to creating a job script
submitted with the command:
$ sbatch --job-name=test --output=output.txt --ntasks=4 job_script.sh
HINT: A combination of command-line options and
#SBATCHlines is possible, where command-line options will override the
Secondly, please look at the
module command in this script. It changes your environment variables based on pre-defined module files. We discuss the matter of modules in the section Dynamic Module Loading of this wiki.
Finally, we give the instruction on how to execute our code. In this example, it is given by:
Long time HPC users may have noticed a few things differently already:
From Fall 2016, Research Computing Services has recommended
bash to new users.
bash shell is richer in features, and it has been the default for many Linux distributions for years. We felt that moving to
bash would make our new users' lives easier. We still maintain and support all existing
tcsh scripts. If you are experienced with
tcsh or have scripts previously written in
tcsh, please feel free to continue using
tcsh, by changing the first line to
From Fall 2016, Research Computing Services has promoted
lmod-based modules to all users. We encourage all users to move to the new module system since the new module system includes a number of improvements and works well with our new software repository. The
lmod-based module system is not yet enabled by default. Users must use the
enable_lmod command to load the new module system. We will maintain both module systems to ensure current job scripts continue to work but eventually we will abandon the old module system altogether. For details, please read Dynamic Module Loading.
HINT: While not discussed in depth here, the job submission options specified above can be used to launch jobs via
sruncommands as well. As an example:
$ srun --ntasks=1 --job-name=test hostname
--ntasks=1 --job-name=testare the resource request specifications, and
hostnameis the command (computational task) to execute.
Resource requests in Slurm can be very simple, and it can be very specific and complex. Please take look at following situations:
A typical MPI job requires a certain amount of processes and each process requires 1 core.
Therefore, the resource request can be as simple as:
A typical multi-threading job requires 1 process and it runs on multiple cores.
Therefore, the resource request would looks like below:
#SBATCH --ntasks=1 #SBATCH --cpus-per-task=8
When a hybrid program requires both multi-processes and multi-threading,
you can make such request in slurm like below example.
8 processes will be launched and each process execute on 8 cores,
Totally 64 core will allocated to this job.
#SBATCH --ntasks=8 #SBATCH --cpus-per-task=8
Sometimes, it is very useful to have some information regarding your Slurm job when you are writing your script.
Let's take a look at an example:
#!/bin/bash -l #SBATCH --job-name=SBATCH_EXAMPLE #SBATCH --output=output #SBATCH --ntasks=64 cat << EOF this job is called $SLURM_JOB_NAME and its ID is $SLURM_JOB_ID job $SLURM_JOB_ID has being allocated $SLURM_NTASKS cores across $SLURM_NNODES hosts job $SLURM_JOB_ID will be running on the following machines: EOF echo $SLURM_NODELIST cat << EOF the working directory for job $SLURM_JOB_NAME is $SLURM_SUBMIT_DIR what is inside? EOF ls -l "$SLURM_SUBMIT_DIR"
If we take a look at the output after running this script, you will see:
this job is called SBATCH_EXAMPLE and its ID is 397 job 397 has being allocated 64 cores across 4 hosts job 397 will be running on the following machines: coreV2-25-[011,018-020] the working directory for job SBATCH_EXAMPLE is /home/XXXXXXX/Testing - Slurm what is inside? total 480 ....
These environment variables are very useful and self explanatory. Please take a look at the cheat sheet below to see more.
Slurm supports both long form of options and short form for convenience.
Below lists both if available.
specify which partition the job needs
give a name to your job. This will make managing your job easier
redirect your stdout to a file with the specified filename
Slurm will send you a mail when certain events happen. Events are defined as follow:
BEGINMail is sent at the beginning of the job
ENDMail is sent at the end of the job
FAILMail is sent when the job is aborted or rescheduled
Slurm will send email to the list of addresses here when events defined in
Slurm provides a way to constrain a job to run only on nodes satisfying certain requirements. These requirements are specified using the
--constraint flag. One notable example is choosing a specific GPU on Turing, where we have a mix of K40, K80, P100 and V100 GPUs. Suppose you want to run your workload only on nodes with V100 GPUs: you specify the
--constraint=v100 flag (along with the
-p gpu --gress gpu:1) when submitting the job, and Slurm will only execute the job when a node with one or more V100 GPUs is available.
The list of constraints can be seen in the
/etc/slurm/slurm.conf file in the respective cluster, where the compute nodes are defined in the lines that begin with
NodeName=. Here is a snippet of the relevant section:
NodeName=coreV3-23-k40-[001-010] State=UNKNOWN RealMemory=128000 Sockets=2 CoresPerSocket=14 ThreadsPerCore=1 Feature=c28,coreV3,AVX2,k40 Gres=gpu:1 NodeName=coreV4-21-k80-[001-005] State=UNKNOWN RealMemory=128000 Sockets=2 CoresPerSocket=16 ThreadsPerCore=1 Feature=c32,coreV4,AVX2,k80 Gres=gpu:2 NodeName=coreV4-22-p100-[001-002] State=UNKNOWN RealMemory=128000 Sockets=2 CoresPerSocket=12 ThreadsPerCore=1 Feature=c24,coreV4,AVX2,p100 Gres=gpu:2
The words listed on the
Feature field can be used to constrain the selection of compute nodes.
Some popular constraints are:
AVX2: CPUs supporting the "AVX2" instruction set, which doubles the processing speed of dense vector-like operations (e.g. matrix-matrix multiply using Intel MKL).
v100: Specific GPU type.
Slurm allows for a more complex constraint using the AND (
&) and OR (
|) operator. For example, to let a GPU job run only on a node with either a K40 or K80 GPU, use:
--constraint="K40|K80". Please note the explicit quoting with double quotes! Since both
| are special characters in shell, this quoting is mandatory. Please refer to sbatch documentation for more details.
When you decide that you no longer wish a task to continue executing,
you can use the
scancel job_number command to remove it from the cluster.
You can only remove your own jobs.
|Resource Request||Long Form||Short Form|
|Core per Task||--cpus-per-task=core||-c|
|Number of Tasks||$SLURM_NTASKS|
|Number of Node||$SLURM_NNODES|
|Task ID in Job||$SLURM_ARRAY_TASK_ID|
|First Task ID in Job||$SLURM_ARRAY_TASK_MIN|
|Last Task ID in Job||$SLURM_ARRAY_TASK_MAX|
|Task Step Size in Job||$SLURM_ARRAY_TASK_STEP|
|Job submission||sbatch script_file|
|Job deletion||scancel job_id|
|Job status (by job)||scontrol show job job_id|
|Job status (by user)||squeue -u username|
|Job hold||scontrol hold job_id|
|Job release||scontrol release job_id|