Slurm is a very popular soft drink, it is highly addictive. When consumed in large quantity user may experience radioactive glowing green skin.
Slurm, formerly known as Simple Linux Utility for Resource Management, is a very powerful job scheduler that enjoys wide popularity within the HPC world. More than 60% of the TOP 500 super computers use slurm, and we use it for both Turing and Wahab cluster.
Slurm, as most job schedulers, brings the following abilities to the cluster:
Load Balancing
Slurm distributes computational tasks across nodes within a cluster, it avoids the situation where certain nodes are under utilized while some are overloaded.
Scheduling
Slurm lets you submit tasks to the cluster, and starts the computation for you once resources become available.
Monitoring
Slurm monitors job status, node status, and also keeps a historical record of them for further study.
Cluster
A cluster is all resources wired together for the purpose of high performance computing, which includes computational devices (servers), networking devices (switches) and storage devices combined.
Node
A node is a single computational device, usually a server.
Job
When you want the scheduler to execute a program, performing a computation on your behalf, it has to be boxed into an abstraction layer called "job".
Partition
A partition is a set of compute nodes, grouped logically. We separate our computational resources base on the features of their hardware and the nature of the job.
For instance, there is a regular compute partition main
and a CUDA
enabled GPU
based partition gpu
.
Task
It maybe confusing, but tasks in Slurm means processor resource. By default, 1 task uses 1 core. However, this behavior can be altered.
Feature
Each node come with some difference, for instance different generation of processor, and that would be a feature of that node. You can pick a set of machines to share certain features to execute your job.
sbatch
a job script with complex instructionssalloc
a interactive shellsrun
a single commandThere are two commands in Slurm that are often used to gather information regarding the cluster.
sinfo
command gives an overview of the resources offered by the cluster, including their current availabilitysqueue
command shows jobs currently runnning and pending on the clustersinfo
shows partitions and nodes available, or occupied in that partition. Example command output:
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
main* up infinite 2 down* coreV2-22-[012,026]
main* up infinite 1 mix coreV1-22-016
main* up infinite 11 alloc coreV1-22-[001,004,007,011-013],coreV2-22-[028,030],coreV2-25-[006-008]
main* up infinite 18 idle coreV1-22-[017-024],coreV2-22-[001,005,010,014,016-017],coreV2-25-[011,018-020]
timed-main up 2:00:00 2 down* coreV2-22-[012,026]
timed-main up 2:00:00 1 mix coreV1-22-016
timed-main up 2:00:00 11 alloc coreV1-22-[001,004,007,011-013],coreV2-22-[028,030],coreV2-25-[006-008]
timed-main up 2:00:00 18 idle coreV1-22-[017-024],coreV2-22-[001,005,010,014,016-017],coreV2-25-[011,018-020]
You can try out some varitions on sinfo
; for instance, sinfo -N -l
will show much more detailed information, down to the individual the compute nodes:
$ sinfo -N -l
Thu May 25 14:36:08 2017
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
coreV1-22-001 1 main* allocated 16 2:8:1 126976 0 1 (null) none
coreV1-22-001 1 timed-main allocated 16 2:8:1 126976 0 1 (null) none
coreV1-22-004 1 main* allocated 16 2:8:1 126976 0 1 (null) none
coreV1-22-004 1 timed-main allocated 16 2:8:1 126976 0 1 (null) none
coreV1-22-007 1 main* allocated 16 2:8:1 126976 0 1 (null) none
coreV1-22-007 1 timed-main allocated 16 2:8:1 126976 0 1 (null) none
coreV1-22-011 1 main* allocated 16 2:8:1 126976 0 1 (null) none
...
(The apparent duplication of compute nodes was due to two partition existing in each compute node.)
The squeue
shows the jobs running or waiting t run, with information such as which partition they are running on, the users who submitted the jobs, and the time and resources consumed by the jobs. Example output:
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
357 main migr0000 abcde001 R 1-01:27:02 7 coreV1-22-[001,004,007,011-013,016]
356 main migr0001 abcde001 R 1-01:28:02 5 coreV2-22-[028,030],coreV2-25-[006-008]
358 main migr0002 abcde001 PD 0:00 8 (Resources)
In order to see your jobs, you can use the squeue -u <USERNAME>
, where <USERNAME>
stands for your own user name. Here is an example:
squeue -u jpratt
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
360 main testjob jpratt R 03:27:02 7 coreV2-25-[001,004,007,011-013,016]
It is worth noting that JOBID
is a very useful property since many Slurm commands use it. For example, to cancel job migr0000
the user can use the command scancel 357
.
squeue
above also showed that job 358
is in the state of PD
, which means pending (waiting for execution). The scheduler will start it once the resource requirement for the job can be met.
There are two kinds of jobs on a cluster: Interactive and non-interactive. An interactive job allows the user to interact with the job by providing inputs either via keyboard and/or mouse; the output can come via files or on terminal or graphical window(s). Noninteractive jobs require all inputs be provided via files, and all outputs be redirected to files.
There are three SLURM commands to submit jobs for execution:
sbatch
-- for submitting a non-interactive job;salloc
-- for running a single job (interactive/non-interactive) or launch an interactive shell;srun
-- for running a single job (interactive/non-interactive).sbatch
is the primary method of running jobs on HPC; it is used more frequently used than the other commands. For end-users, salloc
is used primarily to launch an interactive shell on a compute node, which you can use for interactive purposes. srun
can be used to process an interactive or a non-interactive job; srun
waits until the job is completed before returning to the shell, whereas sbatch
does not (the job can be run much later when the resources are available).
You will have to use different commands to submit them but the good part is they use the same arguments.
command structure:
salloc/srun [options] path-to-your-executable [arguments-for-your-executable]
example:
salloc/srun --ntasks=1 --job-name=matlab /usr/local/bin/matlab -singleCompThread -nodesktop
options:
--job-name job-name (optional)
--ntasks number of process request
If you submit the same job often or the job is relatively complex, you can use a submission script and submit it with the sbatch
command.
#!/bin/bash
#SBATCH --ntasks=4
enable_lmod
module load gcc
module load openmpi
srun your_application
Use a text editor such as nano
or vi
to create this script. Save this script to a file (let's call it JOB_SCRIPT.sh
in the discussions below--you are free to specify a more appropriate name).
sbatch JOB_SCRIPT.sh
squeue -u $USER
.Let's study more closely the example job script above. Job submissions, including srun
/salloc
resource requests and sbatch
script, instruct the cluster to perform computation using specified set of resources. You can list all the job requests at the beginning of a job script or input them via command line arguments.
The lines starting with #SBATCH
near the top of a job script appends options to the sbatch
command. Consider the following example script:
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --output=output.txt
#SBATCH --ntasks=4
/usr/bin/true
is equivalent to creating a job script
#!/bin/bash
/usr/bin/true
submitted with the command:
$ sbatch --job-name=test --output=output.txt --ntasks=4 job_script.sh
HINT: A combination of command-line options and
#SBATCH
lines is possible, where command-line options will override the#SBATCH
lines.
Secondly, please look at the module
command in this script. It changes your environment variables based on pre-defined module files. We discuss the matter of modules in the section Dynamic Module Loading of this wiki.
Finally, we give the instruction on how to execute our code. In this example, it is given by:
srun your_application
Long time HPC users may have noticed a few things differently already:
#!/bin/bash
From Fall 2016, Research Computing Services has recommended bash
to new users. bash
shell is richer in features, and it has been the default for many Linux distributions for years. We felt that moving to bash
would make our new users' lives easier. We still maintain and support all existing tcsh
scripts. If you are experienced with tcsh
or have scripts previously written in tcsh
, please feel free to continue using tcsh
, by changing the first line to #!/bin/tcsh
.
enable_lmod
From Fall 2016, Research Computing Services has promoted lmod
-based modules to all users. We encourage all users to move to the new module system since the new module system includes a number of improvements and works well with our new software repository. The lmod
-based module system is not yet enabled by default. Users must use the enable_lmod
command to load the new module system. We will maintain both module systems to ensure current job scripts continue to work but eventually we will abandon the old module system altogether. For details, please read Dynamic Module Loading.
HINT: While not discussed in depth here, the job submission options specified above can be used to launch jobs via
salloc
andsrun
commands as well. As an example:$ srun --ntasks=1 --job-name=test hostname
Here,
--ntasks=1 --job-name=test
are the resource request specifications, andhostname
is the command (computational task) to execute.
Resource requests in Slurm can be very simple, and it can be very specific and complex. Please take look at following situations:
A typical MPI job requires a certain amount of processes and each process requires 1 core.
Therefore, the resource request can be as simple as:
#SBATCH --ntasks=8
A typical multi-threading job requires 1 process and it runs on multiple cores.
Therefore, the resource request would looks like below:
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
When a hybrid program requires both multi-processes and multi-threading,
you can make such request in slurm like below example.
8 processes will be launched and each process execute on 8 cores,
Totally 64 core will allocated to this job.
#SBATCH --ntasks=8
#SBATCH --cpus-per-task=8
Sometimes, it is very useful to have some information regarding your Slurm job when you are writing your script.
Let's take a look at an example:
#!/bin/bash -l
#SBATCH --job-name=SBATCH_EXAMPLE
#SBATCH --output=output
#SBATCH --ntasks=64
cat << EOF
this job is called $SLURM_JOB_NAME and its ID is $SLURM_JOB_ID
job $SLURM_JOB_ID has being allocated $SLURM_NTASKS cores across $SLURM_NNODES hosts
job $SLURM_JOB_ID will be running on the following machines:
EOF
echo $SLURM_NODELIST
cat << EOF
the working directory for job $SLURM_JOB_NAME is $SLURM_SUBMIT_DIR
what is inside?
EOF
ls -l "$SLURM_SUBMIT_DIR"
If we take a look at the output after running this script, you will see:
this job is called SBATCH_EXAMPLE and its ID is 397
job 397 has being allocated 64 cores across 4 hosts
job 397 will be running on the following machines:
coreV2-25-[011,018-020]
the working directory for job SBATCH_EXAMPLE is /home/XXXXXXX/Testing - Slurm
what is inside?
total 480
....
These environment variables are very useful and self explanatory. Please take a look at the cheat sheet below to see more.
Slurm supports both long form of options and short form for convenience.
Below lists both if available.
--partition=partition_name
-p partition_name
specify which partition the job needs
--job-name=name
-J name
give a name to your job. This will make managing your job easier
--output=filename
-o filename
redirect your stdout to a file with the specified filename
--mail-type=type
Slurm will send you a mail when certain events happen. Events are defined as follow:
BEGIN
Mail is sent at the beginning of the jobEND
Mail is sent at the end of the jobFAIL
Mail is sent when the job is aborted or rescheduledALL
Above all--mail-user=your@email.address
Slurm will send email to the list of addresses here when events defined in --mail-user
occur.
Slurm provides a way to constrain a job to run only on nodes satisfying certain requirements. These requirements are specified using the -C
or --constraint
flag. One notable example is choosing a specific GPU on Turing, where we have a mix of K40, K80, P100 and V100 GPUs. Suppose you want to run your workload only on nodes with V100 GPUs: you specify the --constraint=v100
flag (along with the -p gpu --gress gpu:1
) when submitting the job, and Slurm will only execute the job when a node with one or more V100 GPUs is available.
The list of constraints can be seen in the /etc/slurm/slurm.conf
file in the respective cluster, where the compute nodes are defined in the lines that begin with NodeName=
. Here is a snippet of the relevant section:
NodeName=coreV3-23-k40-[001-010] State=UNKNOWN RealMemory=128000 Sockets=2 CoresPerSocket=14 ThreadsPerCore=1 Feature=c28,coreV3,AVX2,k40 Gres=gpu:1
NodeName=coreV4-21-k80-[001-005] State=UNKNOWN RealMemory=128000 Sockets=2 CoresPerSocket=16 ThreadsPerCore=1 Feature=c32,coreV4,AVX2,k80 Gres=gpu:2
NodeName=coreV4-22-p100-[001-002] State=UNKNOWN RealMemory=128000 Sockets=2 CoresPerSocket=12 ThreadsPerCore=1 Feature=c24,coreV4,AVX2,p100 Gres=gpu:2
The words listed on the Feature
field can be used to constrain the selection of compute nodes.
Some popular constraints are:
coreV1
, coreV2
, coreV3
, coreV4
, coreV5
:AVX2
: CPUs supporting the "AVX2" instruction set, which doubles the processing speed of dense vector-like operations (e.g. matrix-matrix multiply using Intel MKL).k40
, k80
, p100
, v100
: Specific GPU type.Slurm allows for a more complex constraint using the AND (&
) and OR (|
) operator. For example, to let a GPU job run only on a node with either a K40 or K80 GPU, use: --constraint="K40|K80"
. Please note the explicit quoting with double quotes! Since both &
and |
are special characters in shell, this quoting is mandatory. Please refer to sbatch documentation for more details.
When you decide that you no longer wish a task to continue executing,
you can use the scancel job_number
command to remove it from the cluster.
You can only remove your own jobs.
Resource Request | Long Form | Short Form |
---|---|---|
Job name | --job-name=name | -J |
Stdout | --output=file_name | -o |
Stderr | --error=file_name | -e |
Email notification | --mail-type=type | |
Email address | --mail-user=address | |
Partition | --partition=partition_name | -p |
Tasks | --ntasks=core | -n |
Core per Task | --cpus-per-task=core | -c |
Job array | --array=array | -a |
Environment Variable | |
---|---|
Job ID | $SLURM_JOB_ID |
Job Name | $SLURM_JOB_NAME |
Partition Name | $SLURM_JOB_PARTITIONE |
Node list | $SLURM_NODELIST |
Number of Tasks | $SLURM_NTASKS |
Number of Node | $SLURM_NNODES |
Submit Directory | $SLURM_SUBMIT_DIR |
Submit Host | $SLURM_SUBMIT_HOST |
Task ID in Job | $SLURM_ARRAY_TASK_ID |
First Task ID in Job | $SLURM_ARRAY_TASK_MIN |
Last Task ID in Job | $SLURM_ARRAY_TASK_MAX |
Task Step Size in Job | $SLURM_ARRAY_TASK_STEP |
Command | |
---|---|
Job submission | sbatch script_file |
Job deletion | scancel job_id |
Job status (by job) | scontrol show job job_id |
Job status (by user) | squeue -u username |
Job hold | scontrol hold job_id |
Job release | scontrol release job_id |
Cluster status | sinfo |