SLURM

Home User Accounts Allocations SLURM Hardware Software Filesystems Tape Storage Cluster Status Contact


Job dispatch explained Kerberos & SSH Troubleshooting Tips Data Management Policy Globus Online FAQs

Slurm on Fermilab USQCD Clusters

SLURM (Simple Linux Utility For Resource Management) is a very powerful open source, fault-tolerant, and highly scalable resource manager and job scheduling system of high availability currently developed by SchedMD. Initially developed for large Linux Clusters at the Lawrence Livermore National Laboratory, SLURM is used extensively on most Top 500 supercomputers around the globe.

If you have questions about job dispatch priorities on the Fermilab LQCD clusters then please visit this page or send us an email with your question to hpc-admin@fnal.gov.

Slurm Commands

One must log in to the appropriate submit host (see Start Here in the graphics above) in order to run Slurm commands for the appropriate accounts and resources.

scontrol and squeue: Job control and monitoring.
sbatch: Batch jobs submission.
salloc: Interactive job sessions are request.
srun: Command to launch a job.
sinfo: Nodes info and cluster status.
sacct: Job and job steps accounting data.
Useful environment variables are $SLURM_NODELIST and $SLURM_JOBID.

Slurm User Accounts

In order to check your "default" SLURM account use the following command:

To check "all" the SLURM accounts you are associated with use the following command.

NOTE: If you do not specify an account name during your job submission (using --account), the "default" account will be used to track usage.

Slurm Resource Types

SLURM Partition (or queue name)	Resource Type	Description	Number of resources	Number of tasks per resource	GPU resources per node	Max nodes per job
--partition			--nodes	--ntasks-per-node	--gres
lq1csl	CPU	2.50GHz Intel Xeon Gold 6248 "Cascade Lake", 196GB memory per node (4.9GB/core), EDR Omni-Path	183	40		88

Using SLURM: examples

Submit an interactive job requesting 12 "pi" nodes

[@lattice:~]$ srun --pty --nodes=12 --ntasks-per-node=16 --partition pi bash

[user@pi111:~]$ env | grep NTASKS

SLURM_NTASKS_PER_NODE=16

SLURM_NTASKS=192

[user@pi111:~]$ exit

Submit an interactive job requesting two "pigpu" nodes (or 4 GPUs/node)

[@lattice:~]$ srun --pty --nodes=2 --partition pigpu --gres=gpu:4 bash

[@pig607:~]$ PBS_NODEFILE=`generate_pbs_nodefile`

[@pig607:~]$ rgang --rsh=/usr/bin/rsh $PBS_NODEFILE nvidia-smi -L

pig607=

GPU 0: Tesla K40m (UUID: GPU-2fe2a84f-3de9-2ca0-60f0-db011d53a20c)

GPU 1: Tesla K40m (UUID: GPU-9afce23b-cfdf-2318-ed00-2b23c14337f1)

GPU 2: Tesla K40m (UUID: GPU-782960ea-d854-e6ee-26ce-363a4c9c01e2)

GPU 3: Tesla K40m (UUID: GPU-ee804701-10ac-919e-ae64-27888dcb4645)

pig608=

GPU 0: Tesla K40m (UUID: GPU-b20a4059-56c2-b36a-ba31-1403fa6de2dc)

GPU 1: Tesla K40m (UUID: GPU-af290605-caeb-50e8-a4ca-fd533098c302)

GPU 2: Tesla K40m (UUID: GPU-16ab19e4-9835-5eb2-9b8b-1e479753d20b)

GPU 3: Tesla K40m (UUID: GPU-2b3d082e-3113-617a-dcc6-26eee33e3b2d)

[@pig607:~]$exit

Submit a batch job requesting 4 GPUs i.e. one "pigpu" nodes

[@lattice ~]$ cat myscript.sh

#!/bin/sh

#SBATCH --job-name=test

#SBATCH --partition=pigpu

#SBATCH --nodes=1

#SBATCH --gres=gpu:4

nvidia-smi -L

sleep 5

exit

[@lattice ~]$ sbatch myscript.sh

Submitted batch job 46

Once the batch job completes the output is available as follows:

[@lattice ~]$ cat slurm-46.out

GPU 0: Tesla K40m (UUID: GPU-2fe2a84f-3de9-2ca0-60f0-db011d53a20c)

GPU 1: Tesla K40m (UUID: GPU-9afce23b-cfdf-2318-ed00-2b23c14337f1)

GPU 2: Tesla K40m (UUID: GPU-782960ea-d854-e6ee-26ce-363a4c9c01e2)

GPU 3: Tesla K40m (UUID: GPU-ee804701-10ac-919e-ae64-27888dcb4645)

SLURM Reporting

The lquota command run on lq.fnal.gov will provide allocation usage reporting as shown below.

lq1-ch=lq1-core-hour , Sky-ch=Sky-core-hour ,1 lq1-ch=1.05 Sky-ch

Usage reports are also available on the Allocations page. For questions regarding the reports or should you notice discrepancies in data please email us at lqcd-admin@fnal.gov

SLURM Environment variables

Variable Name	Description	Example Value	PBS/Torque analog
$SLURM_JOB_ID	Job ID	5741192	$PBS_JOBID
$SLURM_JOBID	Deprecated. Same as SLURM_JOB_ID
$SLURM_JOB_NAME	Job Name	myjob	$PBS_JOBNAME
$SLURM_SUBMIT_DIR	Submit Directory	/project/charmonium	$PBS_O_WORKDIR
$SLURM_JOB_NODELIST	Nodes assigned to job	pi1[01-05]	cat $PBS_NODEFILE
$SLURM_SUBMIT_HOST	Host submitted from	lattice.fnal.gov	$PBS_O_HOST
$SLURM_JOB_NUM_NODES	Number of nodes allocated to job	2	$PBS_NUM_NODES
$SLURM_CPUS_ON_NODE	Number of cores/node	8,3	$PBS_NUM_PPN
$SLURM_NTASKS	Total number of cores for job	11	$PBS_NP
$SLURM_NODEID	Index to node running on relative to nodes assigned to job	0	$PBS_O_NODENUM
$PBS_O_VNODENUM	Index to core running on within node	4	$SLURM_LOCALID
$SLURM_PROCID	Index to task relative to job	0	$PBS_O_TASKNUM - 1
$SLURM_ARRAY_TASK_ID	Job Array Index	0	$PBS_ARRAYID

Binding and Distribution of tasks

There's a good description of MPI process affinity binding and srun here: Click here

Reasonable affinity choices by partition types on the Fermilab LQCD clusters are:

Intel (lq1) --distribution=cyclic:cyclic --cpu_bind=sockets --mem_bind=no

Launching MPI processes

Please refer to the following page for recommended MPI launch options.

Additional useful information

Fermi National Accelerator Laboratory

Managed by Fermi Research Alliance, LLC

for the U.S. Department of Energy Office of Science

Security, Privacy, Legal