Slurm Primer!

I decided to write this primer to encourage the usage the lab members to use our compute cluster.

We don’t really do distribute computing with tasks that run across nodes/processes and inter-communicate. Thus, I’ll focus on the case where only one task is asked (i.e. --tasks=1).

Slurm vocabulary

A job is a description of what to do, it is composed of one or more steps which can execute serially or concurrently
A task is a unit of work performed by a job step
A node is a physical machine (i.e. bart).
TRES: trackable resources which is either CPUs, memory or GRES (generic resources).

Resource allocation

Request a number of CPUs:

--cpus-per-task 4 or --tres-per-task cpu:4

Request a certain amount of memory:

--mem 32G or --tres-per-task mem:32G

Request a certain amount of scratch space:

--gres:scratch:100G or --tres-per-taks gres/scratch:100G

Request a certain amout of time:

--time 1-12:20:15

Running jobs

srun

srun allows you to run commands via Slurm.

srun echo "Hello world!"

Interactive mode

Run your current shell with allocated resources.

srun --pty --mem 8G $SHELL
# a new shell will be created
$ echo "Hello world"
$ exit # or Ctrl-D

The --pty option will create a pseudo-terminal, making it possible to send signal such as Ctrl-C and Ctrl-D.

Reference: https://slurm.schedmd.com/srun.html

Batch scripts (sbatch)

A batch script is a special kind of script that is interpreted by sbatch. It is essentially a shell script with a special syntax for comments, allowing you to conveniently allocate the resources you need.

In a file named script.sh:

#!/bin/bash
#SBATCH --cpus-per-task 4 --mem 32G
echo "Hello world!"

Which can then be launched with:

sbatch script.sh

A nicety of sbatch is that you can use srun within it to create job steps. This allows you to create heterogenous jobs that can request different amout of resources at different stage.

#!/bin/bash
srun --cpus-per-task 4 echo "Heavily parallelized Hello world!"
srun --cpus-per-task 1 echo "Regular Hello world!"

Nothing prevents you to run those steps in parallel, or interleave parallel and serial steps.

#!/bin/bash
srun --cpus-per-task 4 echo "Heavily parallelized Hello world!" &
srun --cpus-per-task 1 echo "Regular Hello world!" &
wait # wait until both steps finish

Reference: https://slurm.schedmd.com/sbatch.html

Job array

Some workflow require the creation of hundreds if not thousands of similar jobs. To not overwhelm Slurm -- and also your ability to track those jobs, you can create a job array with the --array flag.

It accepts a range of integer values (i.e. 0-99 or multiple values separated by commas (i.e. 1,2,3,5,6). Ranges are inclusive.

In the job, you have access to the $SLURM_ARRAY_TASK_ID variable.

#!/bin/bash
#!SBATCH --array 0-99
echo "Hello world from parallel universe #$SLURM_ARRAY_TASK_ID!"

Job arrays only work with sbatch.

Send job output and error to a file

Use --output and --error to redirect the output of a Slurm job to a file. You can include the job ID %j.

#!/bin/bash
#SBATCH --output output-%j.log --error error-%j.log
echo "Hello world!" # will be written to output-%j.log

For job arrays, use '%A' for the first job ID and '%a' for the task ID.

#!/bin/bash
#SBATCH --array 0-99 --output output-%A-%a.log --error error-%A-%a.log
echo "Hello world!" # will be written to output-%A-%a.log

Email notification

To get notified by email when a job is completed of fails, add --mail-user and --mail-type flags to your submission.

sbatch [email protected] --mail-type=END,FAIL script.sh

If you're using a job array, only one mail is sent for the whole job!

Tracking jobs

Once jobs are submitted to the cluster, you can monitor them with squeue.

squeue --me

The --me flag allows you to see your own jobs.

Reference: https://slurm.schedmd.com/squeue.html

To view the output of a job in real-time, use sattach:

sattach $jobid

Reference: https://slurm.schedmd.com/sattach.html

Cancelling jobs

Use scancel to cancel a job that is either pending or running.

scancel $jobid

Reference: https://slurm.schedmd.com/scancel.html

arteymix/slurm-primer.md

Select an option

No results found