Run Jobs

This page is to give some basic instructions on how to run and monitor jobs on Slurm. Slurm provides excellent man pages for all of its commands, so if you have questions refer to the man pages.

Set Up

Load the environment module for Slurm to configure your PATH and Slurm related environment variables.

module load {{ClusterName}}

The modulefile sets environment variables that control the defaults for Slurm commands. These are documented in the man pages for each command. If you don't like the defaults then you can set them in your environment (for example, your .bashrc) and the modulefile won't change any variables that are already set. The environment variables can always be overridden by the command line options.

For example, the SQUEUE_FORMAT2 and SQUEUE_SORT environment variables are set so that the default output format is easier to read and contains useful information that isn't in the default format.

Key Slurm Commands

The key Slurm commands are

Command Description Example
salloc Create a compute allocation. salloc -c 1 --mem 1G -C 'spot&GHz:3.1'
srun Run a job within an allocation. srun --pty bin/bash
sbatch Submit a batch script sbatch -c 1 --mem 1G -C 'spot&GHz:3.1' script
squeue Get job status
scancel Cancel a job scancel jobid
sinfo Get info about Slurm node status sinfo -p all
scontrol view or modify Slurm configuration and state scontrol show node nodename
sstat Display various status information about a running job/step
sshare Tool for listing fair share information
sprio View the factors that comprise a job's scheduling priority
sacct Display accounting data for jobs
sreport Generate reports from the Slurm accounting data.
sview Graphical tool for viewing cluster state

sbatch

The most common options for sbatch are listed here. For more details run man sbatch.

Options Description Default
-p, --partition=partition-names Select the partition/partitions to run job on. Set by slurm.InstanceConfig.DefaultPartition in config file.
-t, --time=time Set a limit on total run time of the job. SBATCH_TIMELIMIT="1:0:0" (1 hour)
-c, --cpus-per-task=ncpus Number of cores. Default is 1.
--mem=size[units] Amount of memory. Default unit is M. Valid units are [K|M|G|T]. SBATCH_MEM_PER_NODE=100M
-L, --licenses=license Licenses used by the job.
-a, --array=indexes Submit job array
-C, --constraint=list Features required by the job. Multiple constraints can be specified with AND(&) and OR( ).
-d, --dependency=dependency-list Don't start the job until the dependencies have been completed.
-D, --chdir=directory Set the working directory of the job
--wait Do not exit until the job finishes, Exit code of sbatch will be the same as the exit code of the job.
--wrap Wrap shell commands in a batch script.

Run a simulation build followed by a regression

build_jobid=$(sbatch -c 4 --mem 4G -L vcs_build -C 'GHz:4|GHz:4.5' -t 30:0 sim-build.sh)
if sbatch -d "afterok:$build_jobid" -c 1 --mem 100M --wait submit-regression.sh; then
    echo "Regression Passed"
else
    echo "Regression Failed"
fi

srun

The srun is usually used to open a pseudo terminal on a compute node for you to run interactive jobs. It accepts most of the same options as sbatch to request cpus, memory, and node features.

To open up a pseudo terminal in your shell on a compute node with 4 cores and 16G of memory, execute the following command.

srun -c 4 --mem 8G --pty /bin/bash

This will queue a job and when it is allocated to a node and the node runs, the job control will be returned to your shell, but stdin and stdout will be on the compute node. If you set your DISPLAY environment variable and allow external X11 connections you can use this to run interactive GUI jobs on the compute node and have the windows on your instance.

xhost +
export DISPLAY=$(hostname):$(echo $DISPLAY | cut -d ':' -f 2)
srun -c 4 --mem 8G --pty /bin/bash
emacs . # Or whatever gui application you want to run. Should open a window.

Another way to run interactive GUI jobs is to use srun's --x11 flag to enable X11 forwarding.

srun -c 1 --mem 8G --pty --x11 emacs

squeue

The squeue command shows the status of jobs.

The output format can be customized using the --format or --Format options and you can configure the default output format using the corresponding SQUEUE_FORMAT or SQUEUE_FORMAT2 environment variables.

squeue

sprio

Use sprio to get information about a job's priority. This can be useful to figure out why a job is scheduled before or after another job.

sprio -j10,11

sacct

Display accounting information about jobs. For example, it can be used to get the requested CPU and memory and see the CPU time and memory actually used.

sacct -o JobID,User,JobName,AllocCPUS,State,ExitCode,Elapsed,CPUTime,MaxRSS,MaxVMSize,ReqCPUS,ReqMem,SystemCPU,TotalCPU,UserCPU -j 44

This shows more details.

sacct --allclusters --allusers --federation --starttime 1970-01-01 --format 'Submit,Start,End,jobid%15,State%15,user,account,cluster%15,AllocCPUS,AllocNodes,ExitCode,ReqMem,MaxRSS,MaxVMSize,MaxPages,Elapsed,CPUTime,UserCPU,SystemCPU,TotalCPU' | less

For more information:

man sacct

sreport

The sreport command can be used to generate report from the Slurm database.

Other Slurm Commands

Use man command to get information about these less commonly used Slurm commands.

Command Description
sacctmgr View/modify Slurm account information
sattach Attach to a job step
sbcast Transmit a file to the nodes allocated to a Slurm job.
scrontab Manage slurm crontab files
sdiag Diagnostic tool for Slurm. Shows information related to slurmctld execution.
seff
sgather Transmit a file from the nodes allocated to a Slurm job.
sh5util Tool for merging HDF5 files from the acct_gather_profile plugin that gathers detailed data for jobs.
sjobexitmod Modify derived exit code of a job
strigger Set, get, or clear Slurm trigger information