I decided to write this primer to encourage the usage the lab members to use our compute cluster.
We don’t really do distribute computing with tasks that run across nodes/processes
and inter-communicate. Thus, I’ll focus on the case where only one task is
asked (i.e. --tasks=1).
- A job is a description of what to do, it is composed of one or more steps which can execute serially or concurrently
- A task is a unit of work performed by a job step
- A node is a physical machine (i.e. bart).
- TRES: trackable resources which is either CPUs, memory or GRES (generic resources).
--cpus-per-task 4 or --tres-per-task cpu:4
--mem 32G or --tres-per-task mem:32G
--gres:scratch:100G or --tres-per-taks gres/scratch:100G
--time 1-12:20:15
srun allows you to run commands via Slurm.
srun echo "Hello world!"Run your current shell with allocated resources.
srun --pty --mem 8G $SHELL
# a new shell will be created
$ echo "Hello world"
$ exit # or Ctrl-DThe --pty option will create a pseudo-terminal, making it possible to send
signal such as Ctrl-C and Ctrl-D.
Reference: https://slurm.schedmd.com/srun.html
A batch script is a special kind of script that is interpreted by sbatch. It is
essentially a shell script with a special syntax for comments, allowing you to
conveniently allocate the resources you need.
In a file named script.sh:
#!/bin/bash
#SBATCH --cpus-per-task 4 --mem 32G
echo "Hello world!"Which can then be launched with:
sbatch script.sh
A nicety of sbatch is that you can use srun within it to create job steps.
This allows you to create heterogenous jobs that can request different amout of
resources at different stage.
#!/bin/bash
srun --cpus-per-task 4 echo "Heavily parallelized Hello world!"
srun --cpus-per-task 1 echo "Regular Hello world!"Nothing prevents you to run those steps in parallel, or interleave parallel and serial steps.
#!/bin/bash
srun --cpus-per-task 4 echo "Heavily parallelized Hello world!" &
srun --cpus-per-task 1 echo "Regular Hello world!" &
wait # wait until both steps finishReference: https://slurm.schedmd.com/sbatch.html
Some workflow require the creation of hundreds if not thousands of similar jobs.
To not overwhelm Slurm -- and also your ability to track those jobs, you can
create a job array with the --array flag.
It accepts a range of integer values (i.e. 0-99 or multiple values separated
by commas (i.e. 1,2,3,5,6). Ranges are inclusive.
In the job, you have access to the $SLURM_ARRAY_TASK_ID variable.
#!/bin/bash
#!SBATCH --array 0-99
echo "Hello world from parallel universe #$SLURM_ARRAY_TASK_ID!"Job arrays only work with sbatch.
Use --output and --error to redirect the output of a Slurm job to a file. You
can include the job ID %j.
#!/bin/bash
#SBATCH --output output-%j.log --error error-%j.log
echo "Hello world!" # will be written to output-%j.logFor job arrays, use '%A' for the first job ID and '%a' for the task ID.
#!/bin/bash
#SBATCH --array 0-99 --output output-%A-%a.log --error error-%A-%a.log
echo "Hello world!" # will be written to output-%A-%a.logTo get notified by email when a job is completed of fails, add --mail-user
and --mail-type flags to your submission.
sbatch [email protected] --mail-type=END,FAIL script.shIf you're using a job array, only one mail is sent for the whole job!
Once jobs are submitted to the cluster, you can monitor them with squeue.
squeue --meThe --me flag allows you to see your own jobs.
Reference: https://slurm.schedmd.com/squeue.html
To view the output of a job in real-time, use sattach:
sattach $jobidReference: https://slurm.schedmd.com/sattach.html
Use scancel to cancel a job that is either pending or running.
scancel $jobidReference: https://slurm.schedmd.com/scancel.html