Slurm is a very popular soft drink, it is highly addictive. When consumed in large quantity user may experience radioactive glowing green skin.
Slurm, formerly known as Simple Linux Utility for Resource Management, is a very powerful job scheduler that enjoys wide popularity within the HPC world. More than 60% of the TOP 500 super computers use slurm, and we use it for both Turing and Wahab cluster.
Slurm, as most job schedulers, brings the following abilities to the cluster:
Slurm distributes computational tasks across nodes within a cluster, it avoids the situation where certain nodes are under utilized while some are overloaded.
Slurm lets you submit tasks to the cluster, and starts the computation for you once resources become available.
Slurm monitors job status, node status, and also keeps a historical record of them for further study.
A cluster is all resources wired together for the purpose of high performance computing, which includes computational devices (servers), networking devices (switches) and storage devices combined.
A node is a single computational device, usually a server.
When you want the scheduler to execute a program, performing a computation on your behalf, it has to be boxed into an abstraction layer called "job".
A partition is a set of compute nodes, grouped logically. We separate our computational resources base on the features of their hardware and the nature of the job.
For instance, there is a regular compute partition
main and a
GPU based partition
It maybe confusing, but tasks in Slurm means processor resource. By default, 1 task uses 1 core. However, this behavior can be altered.
Each node come with some difference, for instance different generation of processor, and that would be a feature of that node. You can pick a set of machines to share certain features to execute your job.
sbatcha job script with complex instructions
salloca interactive shell
sruna single command
There are two commands in Slurm that are often used to gather information regarding the cluster.
sinfo command gives an overview of the resources offered by the cluster
squeue command shows jobs currently runnning and pending on the cluster
$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST main* up infinite 2 down* coreV2-22-[012,026] main* up infinite 1 mix coreV1-22-016 main* up infinite 11 alloc coreV1-22-[001,004,007,011-013],coreV2-22-[028,030],coreV2-25-[006-008] main* up infinite 18 idle coreV1-22-[017-024],coreV2-22-[001,005,010,014,016-017],coreV2-25-[011,018-020] timed-main up 2:00:00 2 down* coreV2-22-[012,026] timed-main up 2:00:00 1 mix coreV1-22-016 timed-main up 2:00:00 11 alloc coreV1-22-[001,004,007,011-013],coreV2-22-[028,030],coreV2-25-[006-008] timed-main up 2:00:00 18 idle coreV1-22-[017-024],coreV2-22-[001,005,010,014,016-017],coreV2-25-[011,018-020]
sinfo shows partitions and nodes available, or occupied in that partition.
$ sinfo -N -l Thu May 25 14:36:08 2017 NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON coreV1-22-001 1 main* allocated 16 2:8:1 126976 0 1 (null) none coreV1-22-001 1 timed-main allocated 16 2:8:1 126976 0 1 (null) none coreV1-22-004 1 main* allocated 16 2:8:1 126976 0 1 (null) none coreV1-22-004 1 timed-main allocated 16 2:8:1 126976 0 1 (null) none coreV1-22-007 1 main* allocated 16 2:8:1 126976 0 1 (null) none coreV1-22-007 1 timed-main allocated 16 2:8:1 126976 0 1 (null) none coreV1-22-011 1 main* allocated 16 2:8:1 126976 0 1 (null) none ...
You can try out some varitions on
sinfo, for instance
sinfo -N -l will show information arranged into a node-oriented fashion.
squeue shows two jobs running, with information such as which partition they are running on, the user or owner who submitted the job, and the time and resource consumed by the jobs.
$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 357 main migr0000 XXXXXXXX R 1-01:27:02 7 coreV1-22-[001,004,007,011-013,016] 356 main migr0001 XXXXXXXX R 1-01:28:02 5 coreV2-22-[028,030],coreV2-25-[006-008] 358 main migr0002 XXXXXXXX PD 0:00 8 (Resources)
In order to see your jobs, you can use the
squeue -u <username>.
Here is an example:
squeue -u jpratt JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
It is worth noting that Jobid is a very useful property since many Slurm commands use it. For example, to cancel job
migr0000 the user can use the command
squeue above also showed that job
358 is in the state of
PD, which means
pending and scheduler will start it once the resource requirement for the job can be met.
Job submission, including a resource request and a script, instructs the cluster to perform computing. You can list all job requests in the beginning of a job script or input them as command line arguments. For example, you can submit a job with a script below:
#!/bin/bash #SBATCH --job-name=test #SBATCH --output=output.txt #SBATCH --ntasks=1 hostname
then use the command:
$ sbatch job_script.sh Submitted batch job 390
or you can use the command:
$ salloc --ntasks=1 --job-name=test hostname salloc: Pending job allocation 391 salloc: job 391 queued and waiting for resources salloc: job 391 has been allocated resources salloc: Granted job allocation 391 turing2 salloc: Relinquishing job allocation 391
In the above example,
--ntasks=1 --job-name=test is the resource request, and
hostname is the computational task.
Resource requests in Slurm can be very simple, and it can be very specific. Please take look at following situation:
A typical MPI job requires a certain amount of processes and each process requires 1 core.
Therefore, the resource request can be as simple as:
A typical multi-threading job requires 1 process and it runs on multiple cores.
Therefore, the resource request would looks like below:
#SBATCH --ntasks=1 #SBATCH --cpus-per-task=8
When a hybrid program requires both multi-processes and multi-threading,
you can make such request in slurm like below example.
8 processes will be launched and each process execute on 8 cores,
Totally 64 core will allocated to this job.
#SBATCH --ntasks=8 #SBATCH --cpus-per-task=8
Typically, that are two kinds of jobs on a cluster: Interactive and non-interactive.
sbatch - #for a non-interactive job, more frequently used than the other salloc - #for running a single task or launch interactive shell srun - #for running a single task
You will have to use different commands to submit them but the good part is they use the same arguments.
command structure: salloc/srun [options] path-to-your-executable [arguments-for-your-executable] example: salloc/srun --ntasks=1 --job-name=matlab /usr/local/bin/matlab -singleCompThread -nodesktop options: --job-name job-name (optional) --ntasks number of process request
If you submit the same job often or the job is relatively complex, you can use a submission script and submit it with the
#!/bin/bash #SBATCH --ntasks=4 enable_lmod module load gcc/4 module load openmpi/2.1 srun your_application
Now, please look at the lines starting with
#SBATCH. It appends options to the
sbatch command, so:
#!/bin/bash #SBATCH --ntasks=4 /usr/bin/true
This is the same as:
And execute with command:
$ sbatch --ntasks=4 job_script.sh
Secondly, please look at the
module command. It changes your environment variables based on pre-defined module files.
We will discus modules more in the section Dynamic Module Loading
At the end, please give the instruction on how to execute your code. In this example, the section is:
Sometimes, it is very useful to have some information regarding your Slurm job when you are writing your script.
Let's take a look at an example:
#!/bin/bash -l #SBATCH --job-name=SBATCH_EXAMPLE #SBATCH --output=output #SBATCH --ntasks=64 cat << EOF this job is called $SLURM_JOB_NAME and its ID is $SLURM_JOB_ID job $SLURM_JOB_ID has being allocated $SLURM_NTASKS cores across $SLURM_NNODES hosts job $SLURM_JOB_ID will be running on the following machines: EOF echo $SLURM_NODELIST cat << EOF the working directory for job $SLURM_JOB_NAME is $SLURM_SUBMIT_DIR what is inside? EOF ls -l "$SLURM_SUBMIT_DIR"
If we take a look at the output after running this script, you will see:
this job is called SBATCH_EXAMPLE and its ID is 397 job 397 has being allocated 64 cores across 4 hosts job 397 will be running on the following machines: coreV2-25-[011,018-020] the working directory for job SBATCH_EXAMPLE is /home/XXXXXXX/Testing - Slurm what is inside? total 480 ....
These environment variables are very useful and self explanatory. Please take a look at the cheat sheet below to see more.
Slurm supports both long form of options and short form for convenience.
Below lists both if available.
specify which partition the job needs
give a name to your job. This will make managing your job easier
redirect your stdout to a file with the specified filename
Slurm will send you a mail when certain events happen. Events are defined as follow:
BEGINMail is sent at the beginning of the job
ENDMail is sent at the end of the job
FAILMail is sent when the job is aborted or rescheduled
Slurm will send email to the list of addresses here when events defined in
When you decide that you no longer wish a task to continue executing,
you can use the
scancel job_number command to remove it from the cluster.
You can only remove your own jobs.
|Resource Request||Long Form||Short Form|
|Core per Task||--cpus-per-task=core||-c|
|Number of Tasks||$SLURM_NTASKS|
|Number of Node||$SLURM_NNODES|
|Task ID in Job||$SLURM_ARRAY_TASK_ID|
|First Task ID in Job||$SLURM_ARRAY_TASK_MIN|
|Last Task ID in Job||$SLURM_ARRAY_TASK_MAX|
|Task Step Size in Job||$SLURM_ARRAY_TASK_STEP|
|Job submission||sbatch script_file|
|Job deletion||scancel job_id|
|Job status (by job)||scontrol show job job_id|
|Job status (by user)||squeue -u username|
|Job hold||scontrol hold job_id|
|Job release||scontrol release job_id|