0% found this document useful (0 votes)

25 views22 pages

JSSPP 2023 Keynote SLURM

Slurm is an open-source, fault-tolerant workload manager designed for high-performance computing, offering resource allocation, job execution, and queue management. Originally developed at Lawrence Livermore National Laboratory, it has evolved to support large-scale systems with features like scalability, fault tolerance, and security. The paper outlines Slurm's architecture, key functionalities, and various job management capabilities as of the February 2023 release.

Uploaded by

Uday Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views22 pages

JSSPP 2023 Keynote SLURM

Uploaded by

Uday Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Architecture of the Slurm Workload Manager

Morris A. Jette and Tim Wickberg

SchedMD LLC, Lehi, UT 84043, USA
{jette,tim}@schedmd.com

Abstract. Slurm is an open source, fault-tolerant, and highly scalable

workload manager used on many of the world's supercomputers and com-
puter clusters. As a cluster workload manager, Slurm has three key func-
tions. First, it allocates exclusive and/or non-exclusive access to resources
for some duration of time. Second, it provides a framework for starting,
executing, and monitoring work on the allocated resources. Finally, it
arbitrates contention for resources by managing queues of pending work
and enforcing administrative policies. This paper describes the current
design and capabilities of Slurm.
Keywords: scheduling · slurm · hpc

1 Introduction
The development of Slurm began at Lawrence Livermore National Laboratory
(LLNL) in 2002. It was originally designed as a simple resource manager capable 1
only of allocating whole nodes to jobs, then dispatching and managing those
applications on their allocated resources [1]. Slurm relied upon an external
scheduler such as Maui [2] to manage queues of work and schedule resources.
Slurm has since evolved into a comprehensive workload scheduler capable of
managing the most demanding workows on many of the largest computers in
the world. While Slurm has evolved a great deal over its two decades of existence,
its original design goals have largely persisted and proven critical to its success.
Open Source: Slurm's source code is distributed under the GNU General
Public License and is freely available on Github [3]. This openness has
resulted in contributions from roughly 300 individuals ranging from minor
corrections to the documentation to complex added functionality.
Portability: Slurm is written in the C language with a GNU autoconf con-
guration engine. Slurm can be thought of as a highly modular and generic
kernel with hundreds of plugins available for customization using a building-
block approach in order to support a wide variety of hardware types, software
environments, and scheduling capabilities. Site-specic plugins and scripts
can easily be integrated for even greater customization. This exibility allows
Slurm to operate eectively in virtually any environment.
1
Slurm was originally an acronym for "Simple Linux Utility for Resource Manage-
ment", and stylized as "SLURM". The acronym was dropped in 2012 and the pre-
ferred capitalization changed to "Slurm".
2 M. Jette, T. Wickberg

Scalability: Slurm daemons are highly concurrent with locking on a set of

core data-structures in order to support the largest computers with tens
of thousands of jobs. As system sizes have increased through time, only rela-
tively minor changes to Slurm have been required. The rst exascale system
Frontier at Oak Ridge National Laboratory (ORNL) runs Slurm, and
required no system- or scale-specic modications to the code at installation
[4]. A full system application on Frontier from job submission to termina-
tion of all processes and release of its resources can be completed in under
13 seconds.
Fault tolerance: Slurm has no single point of failure. Backup daemons and
cached information insure eective utilization of even the most failure prone
systems. Applications can continue execution through failure of allocated
compute nodes.
Security: Slurm authenticates all communication between processes within
the cluster using MUNGE [5], ensuring that users cannot impersonate one
another or Slurm's control processes. Optional support for JWT [6] allows
authentication to Slurm from the REST interface to permit external systems
interactions with the cluster.
System administrator friendly: Slurm has strived to make the support of
large and complex systems as simple as possible for system administrators.
This includes use of "hostlists" for node name specications, centralized
conguration management, and simplied dynamic reconguration.
Simplicity was one of Slurm's original design goals. While that goal may have
been satised while Slurm was only performing resource management, complex-
ity has increased greatly with the addition of workload scheduling logic. Slurm's
code base is currently over six hundred thousand lines, primarily in the C lan-
guage.
The original functionality of Slurm was loosely based on Quadrics RMS [7],
a closed source resource manager which only supported Quadrics proprietary
networks and thus failed to satisfy the broader requirements of LLNL.
The remainder of this paper will present information about Slurm's overall
design and capabilities as of the 23.02 release (February 2023). Detailed informa-
tion about its conguration, daemons, commands and operation currently spans
many hundreds of pages [8] and is outside the scope of this paper.
2 Slurm Entities
A job in Slurm is a resource allocation request. It can either be in the form
of a batch script to be queued for later execution, or an interactive job for
which the user awaits resource allocation and then makes real time use of it. Job
information includes a numeric ID, name, time limit (minimum and/or maxi-
mum), size specications (a minimum and/or maximum count of nodes, CPUs,
sockets, cores and/or threads), names of specic nodes to include or exclude
in its allocation, node features both required (e.g., specic processor type,
Architecture of the Slurm Workload Manager 3

this specication can include AND, OR, exclusive OR, and count specications)
and preferred (e.g., faster clock speed desired), required licenses, account name,
job dependency specications (to control the order of job execution), Quality
Of Service (QOS), relative priority, and queues/partitions to use. The minimum
time limit and size specications are valuable if the user is willing to sacrice
run time and/or resources in order that the job be initiated as soon as possible.
Slurm's backll scheduler will take advantage of this exibility to allocate such
a job with the maximum run time and resources possible within it's specied
range without delaying the initiation time of higher priority jobs. The ability
for a job to explicitly exclude specic nodes from its allocation is valuable if the
user has doubts as to the integrity of specic nodes.
A job step in Slurm is a set of parallel tasks, typically an MPI application.
A job can initiate an arbitrary number of steps serially and/or in parallel, with
Slurm providing the queuing and resource management for those steps within
the job's existing resource allocation. Use cases in which jobs execute thousands
of steps are not uncommon, particularly when jobs may need to wait lengthy
periods of time for their initial resource allocation request to be satised. Job
step state information maintained by Slurm include its ID expressed as a job ID
followed by a period and step ID (e.g., "123.45"), name, time limit (maximum),
size specication (minimum and/or maximum count of nodes, CPUs, sockets,
cores and/or threads), specic node names to include or exclude from its alloca-
tion, and node features required in its allocation. The job step management is
lighter than job management. If currently available resources within a job allo-
cation are insucient for a step to be initiated then it is queued until resources
are available. Slurm does not support dependencies between job steps that
functionality can be provided by the job script managing the job step workow
if necessary.
A cluster typically consists of a collection of nodes sharing a common net-
work. A Slurm cluster can be on premises, in the cloud, or spread across both.
A Slurm federation is a collection of clusters sharing a common conguration
database. By default, a job is submitted to the local cluster, and user commands
to gather information about jobs and queues report information about the local
cluster. However, all clusters in a federation may be congured to operate as a
single system from the perspective of the users. Jobs can be submitted to any
individual cluster in the federation or any set of clusters (e.g., clusters from the
same manufacturer or having the same architecture), and may be eligible for
execution on any of the federated systems.
Node conguration includes a wide variety of information, most of which
is collected directly from the compute node when Slurm's slurmd daemon is
started. Information collected from a compute node and maintained by Slurm
includes: count of boards, sockets, cores and threads, a count of CPUs (usually
dened as boards × sockets × cores × threads, but may vary depending upon
conguration), memory size, generic resources (GRES) including names, types
and counts (used for GPUs, network bandwidth, scratch disk space, etc.). Infor-
mation not collected from a compute node but maintained by Slurm include a
4 M. Jette, T. Wickberg

scheduling preference weight (used to administratively steer jobs towards specic

nodes), features (an arbitrary string typically used to record operating system
version, CPU type, etc.), node state (up, down, draining, etc.), and a reason
for the current node state including the user ID of the individual modifying the
node state and the time when state changed. For ease of use with larger clusters,
node names may be specied using comma separated numbers and/or ranges
(e.g., "node[00,08]", "node[0-15]", "rack[0-7]_blade[0-63]"). NodeSets can also
be used as a shorthand for collections of nodes, and dened either by an explicit
list of nodes, or automatically established on a set of common features. Nodes
can be dynamically added and removed from a system, which is generally re-
quired for dynamic cluster congurations such as cloud bursting (supplementing
an on premises cluster with resources from the cloud on an as needed basis).
Slurm supports the concept of specialized cores, which are one or more
specic cores on a node reserved for system use and not allocated to jobs. This
ensures an application's access to compute resources are never blocked by system
overhead and optimal application parallelism can be achieved [9].
A Slurm partition is the name given to a job queue . These queues have
2
a wide range of available limits and access controls. Each partition may be as-
sociated with a specic set of compute nodes, although each node can be in
more than one partition. Alternatively a "oating" partition may be assigned a
size, but no specic nodes. When a oating partition's workload is insucient
to fully utilize it's allocated nodes, those resources are made available to other
partitions. If the workload exceeds physical resources, the available resources
will be distributed per partition scheduling priority, job scheduling priority and
applicable limits. Each partition has an access control list, state information
(up, down, draining, etc.), scheduling priority, billing weight (higher priority or
otherwise more capable partitions may charge jobs at a higher rate), and rules
for preemption, over-subscription, gang scheduling [10] and a wide variety of
limits. A partition can include a heterogeneous variety of node types. In this
case, a resource allocation request can specify the partition(s) to use along with
required node features, memory size, etc. A simple example of Slurm entities is
shown in Figure 1.
In order to support a series of closely related jobs, Slurm provides the concept
of a job array. A job array is a batch job in which a single script is executed
multiple times. Typically, input and output les for the batch script dier for
each execution. The user can then monitor and manage the entire job array as
a single entity. Internally Slurm manages the entire job array as a single record
in its job table until execution of an element in the array is ready to begin. At
that time, a new job record is created for the element. This minimizes both the
memory footprint and management overhead for job arrays. Internally, a single
pair of les are used to record the batch script and environment variable for all
tasks in the job array. For example:
2
The "partition" term was inherited from Quadrics RMS, which required a strict par-
titioning of compute nodes within the environment. The requirement that nodes be
in disjoint partitions was discarded very early on, but the terminology has persisted.
Architecture of the Slurm Workload Manager 5

Fig. 1. Example partition, job and job step allocation.

$ sbatch --array=1-9000 -i my_in_%a -o my_out_%a -N1 my.bash

Submits the batch script "my.bash" to be executed 9000 times. The input and
output les for each element in the job array will have a prex of "my_in_" and
"my_out_" respectively with the element number appended (e.g., "my_in_123"
for the input le used by element number 123). In case some individual elements
in the array fail for some reason, those specic job array element IDs may be
resubmitted with those IDs identied using the command's "--array" option. For
example:
$ sbatch --array=5,28 -i my_in_%a -o my_out_%a -N1 my.bash
The user can optionally specify element IDs using a range with a step count
(e.g., from 1 to 50000 by 10). The user can also specify the maximum number
of elements in the job array to be executing at any given time, which is valuable
in managing their aggregate workload within their various resource limits. By
default, each element in the job array is counted as one job for limit enforcement
although limits may be congured for the maximum number of elements in
any job array. Environment variables are available in the job array applications
identifying the job array parameters.
Slurm supports the concept of heterogeneous jobs. For example, some
elements of a job may require additional memory or other resources. All compo-
nents of the heterogeneous job will be monitored and managed as a single entity,
although each component can also be monitored and managed independently. A
heterogeneous job will be allocated resources only when all components of that
job can be allocated resources simultaneously, considering both resource avail-
ability and applicable limits. Heterogeneous job steps are also supported and
options are available to control how applications are launched across the various
components of the heterogeneous job.
In order to support complex workows, Slurm supports a wide variety of
dependencies as listed below:
6 M. Jette, T. Wickberg

After: job can begin execution after the specied job IDs have begun execu-

tion
AfterOK: job can begin execution after the specied job IDs have completed
successfully (run to completion with exit code of zero)
AfterNotOK: job can begin execution after the specied job IDs have termi-
nated in some failed state
AfterAny: job can begin execution after the specied job IDs have terminated
in any state
AfterCorr: an element of a job array can begin execution after the corre-
sponding element ID in another job array completes
Singleton: the job can begin execution after any previously initiated job with
the same job name and user ID have completed (i.e., only one job owned by
a given user with the same job name can be running at any time)
The system administrator can congure the desired behavior for jobs with de-
pendencies that cannot be satised (e.g., a job dependent upon the successful
completion of another job, but that job fails). Typically such jobs are congured
to be purged.
Users can request to be notied by email when their jobs change state. This
can be valuable for batch jobs in environments where long delays are possible.
Job state transitions which can be used to trigger email include: begin, end, fail,
requeue, and invalid dependency detected.
A Slurm account is used to group users into sets, independent of UNIX
groups. Accounts are typically organized in a hierarchical fashion (e.g., division,
group, project, etc. see Figure 2). A user can have access to multiple accounts
with a default value. Each account can have one or more account coordinators
who are able to create sub-accounts, add or remove users from their account,
modify limits and resource apportioning to the users and accounts under their
control. The account coordinator may also be able to view accounting informa-
tion normally hidden from other users, such as a record of jobs executed by other
users in the accounts over which they have control.
A Slurm association is a combination of cluster, account, user name, and
(optional) partition name. Each association can have a fair share allocation of
resources and a multitude of limits. It is worth noting that these limits come
in two forms. Many limits apply to individual jobs, such as the maximum time
limit. Other limits apply on an aggregate basis, such as the maximum number
of running jobs for an individual user or all users in some account.
A Quality Of Service (QOS) is used to control a job's limits, priority,
and charge multiplier. A QOS may be associated with partition or independent
of partitions and selected on a job by job basis. A QOS not explicitly bound to
a partition can be used with any partition. The benet in associating a QOS
with a partition is in making a greater number of limits available than otherwise
provided in Slurm's conguration le for a partition. A typical use case is to
congure "standby", "normal", and "expedite" QOS on a system. Jobs submit-
ted to any partition with a "standby" QOS would have very low priority and a
corresponding low charge multiplier, say being charged for resource use at 20% of
Architecture of the Slurm Workload Manager 7

Root - 100%

A Division - 40% B Division - 60%

A Group - 30% User Alice- 10% C Group - 40% User Bob - 20%

User Adam - 20% User Brenda - 10% User Charles - 20% User Debra - 15% User Edward - 5%

Fig. 2. Example account hierarchy with resource allocations.

the normal change. Similarly, jobs submitted to any partition with a "expedite"
QOS would be given a very high priority and high charge multiplier, perhaps
being charged 5 times the normal rate. Access control lists can be congured to
limit which accounts or users can use each QOS. QOS can also be used to dene
job preemption rules, so the "expedite" QOS might be congured to preempt
(terminate running jobs) from the lower priority "standby" QOS. QOS limits
available override the partitions and associations limits. So if a user's associa-
tion has a maximum number of running jobs set to 10, but the "expedite" QOS
has a limit of 20, the higher limit will apply to jobs running in the "expedite"
QOS. In order to avoid confusion with the multitude of congurable limits, only
a subset of limits are typically congured for associations and QOS.
The order of precedence for limits is as follows:
1. Partition QOS
2. Job QOS
3. User association
4. Account associations (ascending the hierarchy)
5. Root/Cluster association (i.e., the top of the account association hierarchy)
6. Partition conguration
If limits are dened at multiple points in this hierarchy, the point in this list
where the limit is rst dened will be used. Consider the following example:
MaxJobs=20 and MaxSubmitJobs is undened in the partition QOS
No limits are set in the job QOS and
MaxJobs=4 and MaxSubmitJobs=50 in the user association
8 M. Jette, T. Wickberg

The limits in eect will be MaxJobs=20 and MaxSubmitJobs=50

A Generic RESource (GRES) is an arbitrary on-node resource. These can

be allocated to jobs, accounted for, and have limits imposed on their use all in a
generic fashion. GRES can also be managed in a hierarchical fashion. GPUs are
managed in Slurm as GRES. Jobs can request a specic count of GPUs, admin-
istrators can congure GPU limits by job or user, etc. If a cluster has more than
one variety of GPU, the job can request some count of GPUs using any variety of
GPU available (e.g., "srun --gres=gpu:2 ...") or identify the specic varieties and
counts of GPU desired (e.g., "srun --gres=gpu:rtx_5000:2,gpu:rtx_6000:1 ...").
The cgroup [11] plugin is used to enforce binding of applications to their allo-
cated GPUs or other GRES. Resource limits for GRES can be congured with
similar exibility and the possibility of dierent counts for dierent varieties of
GPU.

An advanced reservation is a block of resources reserved for current or fu-

ture use. These resources can include specic nodes, specic cores, licenses, burst
buer space, start time, end time, and access control list (user ID or Slurm ac-
count name). The reservation can identify a count of nodes or cores without
identifying specic nodes or cores, which can enable more ecient resource us-
age when the reservation is not regularly used, and permits any failed resources
to be replaced in order to maintain the requested reservation size. The reserva-
tion can identify a start time relative to whatever the current time is (e.g., 10
minutes in the future), which we refer to as a "oating" reservation. Use of a
relative start time can be used to ensure that privileged users or accounts can
access compute resources in a timely fashion without preventing their use by
the wider user community when otherwise not required. For example, one might
create a oating reservation consisting of 16 arbitrary compute nodes that must
be available within 30 minutes of the current time. This will ensure that a job
requesting 16 nodes in that reservation will be able to secure those resources and
begin execution within 30 minutes without preventing other users from making
use of those resources with shorter lived jobs or jobs approaching their time lim-
its. Reservations can be recurring (e.g., daily, weekly, etc.). A reservation can be
congured so that jobs must t entirely within its time and space constraints.
Alternately a job can use resources in a partition in addition to any other avail-
able resources. Reservations can be congured as "magnetic", in which case any
job that matches that reservation's access controls will be eligible to run within
resources it controls when the reservation is active without explicitly requesting
that use of that reservation. Another common use case for advanced reservations
is to limit access to resources so that they will be idle at the time of a scheduled
maintenance without the need to terminate any active jobs using those resources,
but permitting jobs with lower time limits to be initiated on those resources as
time permits.
Architecture of the Slurm Workload Manager 9

3 Daemons

Slurmdbd is Slurm's database daemon, which can interface either MariaDB

[12] or MySQL [13] (see Figure 3). There is typically one slurmdbd daemon
per enterprise. The slurmdbd records accounting information and manages some
conguration information centrally (many limits, fair share information, QOS,
licenses, etc.). When a system administrator modies this centralized congu-
ration information, those changes are transmitted to other Slurm daemons as
needed. If the slurmdbd or its database are unavailable for some reason, the
entire Slurm enterprise will continue to operate normally, but use cached con-
guration information and locally store accounting information until database
access is restored. Slurmdbd is typically congured to execute as user "Slur-
mUser", an unprivileged user.
The slurmctld daemon commonly referred to as the Slurm controller
run as a single active daemon per cluster, although an arbitrary number
of backup daemons may be congured. The slurmctld monitors the state of
resources, decides when and where to initiate jobs and job steps, and processes
almost all user commands (except for accounting and other database related
commands). If congured for high-availability, a le system must be congured
to store the slurmctld state information such that is accessible on both the
primary and all backup controllers. Slurmctld executes as user "SlurmUser", an
unprivileged user.
The slurmd is Slurm's compute node daemon. There is typically one slurmd
daemon per compute node. The slurmd daemon is the only Slurm daemon that
needs to execute as user root in order to spawn processes as the appropriate user.
Slurmd is quiescent after launching the job step management process (slurm-
stepd) except for optional accounting and message forwarding. One slurmstepd
process is spawned by slurmd for each batch script and application on a compute
node. Slurmstepd is responsible for launching the batch script or user applica-
tion's tasks, handling any cgroup and lesystem namespace management, man-
aging accounting, application I/O, signals, etc. It is possible to congure multiple
slurmd daemons to execute on each compute node in order to evaluate Slurm
behavior on larger systems. Moderately large clusters can be emulated on just a
single compute node or desktop. Each compute node in such an emulated clus-
ter can be congured independently, for example with diering memory or CPU
counts, potentially with larger sizes than the compute node actually executing
the slurmd daemon.
The slurmrestd daemon is an interface to Slurm via REST API that trans-
lates JSON/YAML-encoded requests into Slurm Remote Procedure Calls (RPCs).
Any user can initiate their own slurmrestd daemon and all remote clients are
authenticated using HTTP headers. This is intended to enable use of Slurm's
API outside of the cluster to web interfaces, user desktops (with appropriate
security considerations), and other third-party applications.
10 M. Jette, T. Wickberg

slurmctld

slurmd slurmd slurmd slurmd

slurmdbd
Cluster 1

slurmctld

slurmd slurmd slurmd slurmd

Database

Cluster 2

Fig. 3. Typical Slurm federation with two clusters.

4 Plugin Infrastructure
Slurm's extensive use of plugins is particularly valuable in providing portability
and exibility. Roughly 65% of Slurm's code is within its kernel including the
primary data structures, system daemons, and user commands. Over 100 plugins
form the remainder of the code to support a wide range of diering hardware
types, software environments, and scheduling capabilities. The congured plug-
ins are loaded when a daemon or command is started and persists throughout
its lifetime. Some plugins are called from multiple daemons and commands. In
some cases plugins are also called by other plugins, such as the network topol-
ogy plugin being called by the scheduling plugin. Plugin infrastructure provides
a level of indirection to some congurable underlying functions. One example of
the value in plugins was development work performed by Hewlett-Packard (HP)
in 2005 [14]. Slurm's original implementation only supported the allocation of
whole compute nodes to jobs implemented through the select/linear schedul-
ing plugin. HP added the ability to allocate resources on a node down to the
core level with a new select/cons_res plugin. Later development introduced a
new select/cons_tres plugin, now the default, which extended resource schedul-
ing support to GPUs within the nodes as well. Roughly 80% of the changes to
Slurm for this enhancement were in the form of a new job scheduling plugin with
the remaining changes in the kernel, much of that in the form data structure
changes. Given the number of plugins available, only a few will be described
here.
Slurm's topology plugin is used to gather network topology and use that in-
formation to optimize resource allocations with respect to communication band-
width. The topology plugins developed to date include 3 dimensional torus, 4
Architecture of the Slurm Workload Manager 11

dimensional torus, hypercube, dragony, and tree. Slurm's GUI, sview, displays
the nodes allocated to jobs, partitions, advanced reservations, etc, so that one
can readily observe the network topology they utilize as shown in Figure 4.

Fig. 4. Example of job allocations on a 4D torus interconnect. A list of jobs is shown

on the right and their color coded node allocations on the left.

Slurm's job submit plugin is called from the slurmctld daemon. It is executed
for each job submit or job modify RPC. An arbitrary number of job submit plu-
gins may be used with a congurable call sequence. Each of these plugins can
modify the arguments passed to slurmctld and return error messages as appro-
priate to the user. Some of the job submit plugins packaged with Slurm include:
throttle (limits the rate at which a user can submit jobs, sleeping as needed to
decrease job submission rates for individual users), require_time_limit (rejects
jobs without an explicit time limit specication), pbs (adds PBS [15] environ-
ment variables for newly submitted jobs and supports the "before" job depen-
dency), cray (sets Cray specic generic resource parameters), and Lua (executes
a customer provided Lua script with almost limitless exibility).
Four plugins are available to gather energy consumption from a node in-
cluding IPMI, RAPL, and Cray. Should some new mechanism become available
to gather a node's energy consumption data, one would need to develop a new
Slurm plugin to gather that information and present it to the Slurm kernel in
the appropriate format.
SPANK (Slurm Plugin Architecture for Node and job [K]ontrol) is a generic
plugin mechanism. The plugins are written in C, but without requiring access to
12 M. Jette, T. Wickberg

the Slurm source code. The plugins are executed by the Slurm daemons and the
Slurm commands used for job submission. SPANK plugins can be used to add
new site-specic job options, including making information about those options
visible in the command's help messages. One example of SPANK use was the
initial integration of Singularity containers with Slurm [16]. This plugin added
new options to the Slurm job submission commands: --singularity-container, --
singularity-bind and --singularity-args. It also added support for Singularity spe-
cic environment variables. The slurmstepd job step management process made
use of this newly added information to initiate the application in an appropriate
container environment.
The most common integration point remains through the use of site-specic
scripts in the Slurm prolog and epilog interfaces. These are executed in various
places (the submit host, the head node by slurmctld, or the compute node by
slurmd or slurmstepd), at various times (e.g., job allocation, step startup, and
task launch), by various users (SlurmUser, root or the job user). Typical use cases
include establishing the environment for the job (boot nodes, node health check,
congure temporary storage, etc.) or cleaning up at job completion (deleting
temporary les).
5 Conguration
Slurm requires at least one conguration le, although some plugins require
their own conguration le. These les can either be in a location readable by
all daemons and user commands or the les can be replicated on every node.
Alternatively, the les can be placed on the nodes where the slurmctld daemons
execute (primary daemon plus backups), and the slurmctld daemon will make
that conguration information available to the other daemons and commands
upon request with a newer optional feature referred to as "congless" support.
A system administrator may nd it dicult to upgrade the Slurm installation
simultaneously across all of the enterprise. In order to support rolling upgrades,
every daemon and command is able to support RPCs for three major releases,
which includes its release plus the previous two major releases of Slurm. Since
major releases are currently scheduled every 9 months, that supports a relaxed
upgrade schedule. Changes to RPCs are limited to major releases, so upgrades
between maintenance releases can be performed on a node-by-node basis.
6 Communications
Slurm uses a fault-tolerant hierarchical communication mechanism with cong-
urable fanout for communications to the compute nodes. This ooads as much
work as possible from the slurmctld daemon, which typically has a multitude
of active threads. This also minimizes the wall time required for operations in-
volving a large number of nodes, such as application launch and le transfer.
It is typically recommended to congure a fanout value so that no more than
a ve level communication tree can be used to reach all compute nodes. For
Architecture of the Slurm Workload Manager 13

example, if the slurmctld needs to kill a job running on 1110 compute nodes and
Slurm's fanout is congured at 10. The slurmctld daemon will take the list of
1110 compute nodes and divide it into 10 sets of 111 nodes each. The slurmctld
daemon will then launch 10 threads, each communicating with a single slurmd
daemon notifying it of the job to be killed along with a list of the additional
110 compute nodes that slurmd should forward the request to. Each of these
10 slurmd daemon launches 10 threads to communicate with additional slurmd
daemons on other compute nodes. The process is continued until every slurmd
daemon involved in the operation is reached. In this case, the process requires
three levels in the communication tree. Note the communication hierarchy is cre-
ated as needed and destroyed upon completion of the communications. Multiple
communication hierarchies may be active at any time using dierent or even
overlapping sets of slurmd daemons. See the example in Figure 5.

slurmctld

slurmd slurmd slurmd slurmd slurmd

Fig. 5. Example hierarchical communication with fanout of 5.

Slurm's congurable timeout parameter is used to help detect communication

failures. In the event of a failure, the daemon originating the message will send
the message to another slurmd daemon in the group of compute nodes to be
communicated with. Since this retry logic may result in messages being sent
more than once to a slurmd daemon, the daemon retains state information to
avoid duplicating operations.
The logic to spawn job steps follows a similar hierarchical communication
mechanism. The srun command sends its job step launch request to the slurm-
ctld daemon. When the slurmctld daemon satises the request (which might be
queued pending resource availability), slurmctld returns a digitally signed step
launch credential to srun. The srun command then forwards this credential with
14 M. Jette, T. Wickberg

an application launch request to a set of slurmd daemons using the same fanout
logic as for messages originating from the slurmctld daemon.
7 Job Priority
Slurm assigns a priority to each job based on a multitude of factors includ-
ing age, fair share, queue/partition, QOS, size, nice value, association, and a
site-managed value. The weight of each factor in determining a job's priority is
congurable so that a job's priority may be based 60% on age age, 30% on fair
share, etc. The age component of job priority is based upon the time when the
job rst becomes eligible for execution, after dependencies are satised, rather
than its submission time. It value is proportional to that wait time with a cong-
urable maximum time (i.e., the value could be congured to stop increasing once
a job is 7 days old). Fair share is a measure of how over- or under-serviced an
association is relative to its resource allocation. The window of time used in this
calculation is either xed with usage data cleared periodically (i.e., at the end
of each week, month, quarter, year, etc.) or historic resource usage data is con-
tinuously decreased on an exponential basis through time. Dierent algorithms
and parameters are available to control how fair share is computed based upon
the association tree, and a plugin interface called site_factor is provided should
an administrator wish to develop their own novel prioritization approach. For
example, should an individual user be allowed to consume all resources allocated
to his group if no other users in that group are active or should some portion
of that group's resources be retained for when other users in that group become
active and if so how much [17]. Job size requirements can be used to consider
a job's resource requirements in computing its priority. Size in this context is
congurable to consider a variety of resources with dierent weights for each
resource (e.g., one GPU might be given the same weight as 1TB of memory in
computing a job's size component of priority). A system administrator may want
to increase the scheduling priority of jobs with large CPU, memory, or license
requirements. This can also be reversed to favor jobs with low resource require-
ments. A user may specify a job's nice value to establish their relative scheduling
priority in Slurm and this works similar to a process's Linux nice value, although
Slurm's nice value range is much larger with a signed 32-bit value. If necessary, a
system administrator may also explicitly set a job's scheduling priority in order
to override the default calculated value. This is typically done to force a job to
have the highest priority, and ensure it will begin execution as soon as possible.
8 Typical Congurations
Each site and each cluster have unique congurations and scheduling consider-
ations. Before considering Slurm's scheduling algorithms, some typical congu-
rations and their scheduling requirements are described below.
Architecture of the Slurm Workload Manager 15

Roughly half of the clusters that we work with are homogeneous: every com-
pute node has the same processors, memory size, GPUs, etc. Homogeneous clus-
ters may be congured with a single partition for the most ecient use of re-
sources. While every job may be in a single queue/partition, a variety of limits
are typically used to prevent any single user or group of users from being allo-
cated more resources than desired. Depending on the conguration and use case,
it is not uncommon to see dozens or even hundreds of the highest priority jobs
have their resource allocation deferred by one or more limit (e.g., maximum run-
ning job count by user, maximum allocated GPU count by account). In addition
to the compute nodes, global resources such as licenses and burst buer space
must be managed.
The other half of the clusters we work with are heterogeneous. Such clus-
ters may include a small number of unique compute nodes, say with GPUs or
a larger memory size. Other clusters may include a dozen or more unique com-
pute node congurations including dierent processor types and clock speeds.
Such clusters are typically assembled over time with dierent organizations con-
tributing hardware best suited for their workload and budget. For example, the
physics department at a university may purchase two racks of nodes with large
memory size, the chemistry department another rack of nodes with GPUs, etc.
We refer to each set of resources as a "condo" or "condominium" and they are
interconnected into a single cluster sharing a high speed network. Typically each
condo can be accessed from two or more partitions. One partition will provide
priority access to the resources with an access control list identifying the organi-
zation nancing those resources (e.g., the "physics" partition will have an access
control list containing the faculty and students in the physics department). A
second partition might span all compute nodes in the cluster with lower priority
access for any user, typically with a lower size and/or time limits. Slurm allows
jobs to be submitted to multiple partitions simultaneously to take advantage
of this conguration. The node "feature" parameter can be used to prevent job
allocations from spanning dierent processor types as shown in Figure 6.
Job throughput rate requirements also vary widely. Some workloads consist
primarily of jobs that execute for days, in which case expending considerable time
to optimize scheduling may be warranted. Other workloads consist primarily of
jobs that execute for a few seconds and a throughput rate in the of hundreds of
jobs per second may be required [20]. Slurm can support both workloads, but
with dierent algorithms and scheduling parameters.
9 Scheduling Algorithm
Slurm performs a quick and simple (rst-in rst-out, FIFO) scheduling attempt
on an event driven basis (with a congurable minimum time interval between
executions): upon each job submission, job completion, or conguration change.
Only the top priority jobs (a congurable count) in each partition will be evalu-
ated for initiation and this can be useful for high throughput computing. Given
the appropriate conguration and hardware, Slurm can sustain a throughput
16 M. Jette, T. Wickberg

Fig. 6. Example heterogeneous cluster with job submission request for any two nodes
with the same architecture. Plugins can be used to set default partitions and node
features as appropriate.

rate exceeding 100 jobs per second. Since this scheduling algorithm is FIFO,
once any job in a partition is found unable to be initiated, all lower priority jobs
in that partition will be left pending without further consideration.
Slurm also performs a more comprehensive FIFO scheduling attempt on a
periodic basis with a congurable interval. This algorithm typically executes
far less frequently than the event driven scheduling algorithm, but uses the
same logic with dierent conguration parameters. The default conguration
parameters will typically enable this algorithm to evaluate jobs in every partition
from the highest priority until a job unable to be initiated is identied.
Except for some high throughput computing congurations, backll schedul-
ing plugin is used to initiate most jobs. Without the backll plugin, jobs in each
queue are scheduled strictly in priority order as described above. The backll
scheduling plugin will initiate lower priority jobs only if doing so does not delay
the expected start time of any higher priority job, although reservation of re-
sources for higher priority jobs can be limited to jobs which have been pending
for over some congurable period time or the highest priority jobs in each parti-
tion within a congurable queue depth. The expected start time of pending jobs
is dependent on the expected completion time of running jobs based solely upon
the job's time limit. Approximately 20 conguration parameters are available
for tuning backll scheduling such as: how far in the future to consider, what is
the time resolution for scheduling, how many jobs to consider for each user, how
many jobs to consider from each partition, etc. The backll scheduler builds a
table of expected resource availability through time, tracing the expected initi-
ation and termination time of running and pending jobs. All resource limits are
enforced by the backll scheduler as it builds this table. For example, sucient
compute resources may be available to initiate the highest priority job in some
partition one hour in the future, but job initiation prevented by the maximum
number of CPU hours of running jobs for the job's association. In this case a
Architecture of the Slurm Workload Manager 17

lower priority job may be initiated. Since a cluster may have thousands of exe-
cuting jobs and tens of thousands of pending jobs, backll scheduling overhead
is kept more manageable by the time resolution conguration parameter cited
above. Say the time resolution is congured to 300 seconds. The resources that
become available in any 300 second interval are recorded in a single record rather
than one record per job completion, which might number in the tens or even hun-
dreds of jobs for any 300 second interval. For example, consider a job expected to
end in 600 seconds and another job expected to end in 610 seconds. Rather than
creating records of expected system state at those two times and determining
which pending jobs can start at each of those times, the records are combined
and pending jobs will only be evaluated for initiation at that one time. While
this does result in some loss of precision when evaluating when pending jobs are
expected to start, that is likely insignicant compared to the inaccuracy in job
time limits. The benet is that computational overhead of the backll scheduler
can be dramatically reduced. The backll scheduler determines the reason each
pending job is currently unable to be initiated (e.g., some specic limit, depen-
dency, waiting for resources, held by administrator) and its expected start time
(i.e., when compute resources are available and no limits exceeded). This infor-
mation is made available to users and is regularly consulted. Since the backll
algorithm can require multiple minutes to complete with large workloads, it is
performed in a piecemeal manner. It acquires the appropriate locks, executes for
a congurable time interval (typically a couple of seconds), releases locks for a
congurable time interval (typically less than one second) in order to perform
other outstanding operations (e.g., accept newly submitted jobs, process newly
completed jobs, provide users with status information, etc.), and repeats the
process until all pending jobs have been considered.
Slurm supports burst buers via a plugin mechanism. Slurm will allocate
burst buer space for a job when it approaches its expected initiation time and
stage-in any required data. The job will not be allocated compute resources
until data stage-in has completed. When the job's computation has completed,
data will be staged-out and the burst buer space released. The burst buers
supported by Slurm plugins include Cray's DataWarp [21] and a generic Lua
script based plugin.
Slurm has the ability to support a cluster that grows and shrinks on de-
mand, typically relying upon a service such as Amazon Elastic Computing Cloud
(Amazon EC2), Google Cloud Platform or Microsoft Azure for resources. These
resources can be combined with an existing cluster to process excess workload
(cloud bursting) or it can operate as an independent self-contained cluster. Good
responsiveness and throughput can be achieved while only paying for the re-
sources needed.
Slurm has dozens of available scheduling parameters available to control the
number of jobs considered for scheduling in each partition, maximum scheduling
frequency, etc. [18] Slurm also maintains detailed information about schedul-
ing performance and makes that information available to system administrators
for tuning purposes [19]. Detailed information about anticipated resource avail-
18 M. Jette, T. Wickberg

ability in the future and expected resource allocations and initiation times of
pending jobs are also available.
10 License Scheduling
In addition to node-centric resources, Slurm supports scheduling for licenses.
Licenses can be used to represent any resource available on a global basis such
as network bandwidth or global scratch disk space; although, as implied by the
name, they are most commonly used to track software licenses.
Licenses are requested as part of the job submission, and will be allocated
to the job alongside the compute resources. Backll scheduling management for
these licenses is a recent optional addition, and can be enabled by the admin-
istrator. Preemption support, in which lower priority jobs can be preempted to
free up sucient licenses for higher priority ones, has also been recently added.
11 Application Layout
The job has control over its resource allocation with respect to sockets, cores,
and threads using options at job submit time, for example threads per core, cores
per socket, and sockets per node. Similarly the job step allocation can specify the
number of tasks to launch per core, socket and/or core. The job step has complete
control over how tasks are distributed over the allocated resources by specifying
layout patterns across nodes, sockets, and/or cores. Binding tasks to allocated
resources can be performed using CPU anity and/or Linux cgroups [11]. Linux
cgroups are essential for congurations where more than one job can be active
at the same time on a compute node. Besides limiting each job to its allocated
CPUs, cgroups can also ensure that each job is constrained to its allocated
memory and not interfere with another job's memory allocation. Linux cgroups
can constrain each job's RAM, kernel memory, swap space and allocated generic
resources such as GPUs. There are a variety of options available to control how
each task/rank of the application are bound to the job step's allocated CPUs.
Cgroups are also used to collect usage data for allocated resources.
Some topology plugins support the ability for a job to specify the maximum
number of leaf switches desired in its resource allocation and the maximum time
to wait for such an allocation (e.g., wait up to an extra 10 minutes for my job's
allocated nodes to be on one leaf switch). The system administrator can congure
the maximum wait time for any job to secure its desired leaf count in order to
limit resources idled for job layout optimization.
Jobs can increase or decrease their size per administrative controls. In the
case of increasing the size of a job, the user must submit a new job to acquire the
additional resources desired. Once this second job allocation has been made, the
user merges the two job allocations into a single job, a process which generates a
shell script the user executes in the original job in order to modify its environment
variables as appropriate.
Architecture of the Slurm Workload Manager 19

12 Job Proling
Slurm has the ability to collect detailed performance information about a job step
on a periodic basis. This is more information than can reasonably be recorded
in Slurm's database for every job, and may involve some additional overhead
to collect therefore it is only collected when requested by the user. The user
species what types of information are to be collected and at what frequencies
(independent frequency can be congured for each data type). The types of in-
formation available for collection includes power consumption, le system usage,
network interconnect usage, CPU and memory usage, and GPU utilization. At
application termination the data is collected and stored into a single HDF5 [22]
dataset. We recommend the HDFView [23] tool to graphically view the resulting
data, which can easily identify problems such as spikes in memory usage with
the timing and oending task ID (rank) identied.
13 Compute Node Management
Alongside traditional workload management capabilities, additional integrations
have been developed to solve common HPC systems administration tasks.
Slurm has the ability to limit the power consumption of a cluster [24]. It does
this by monitoring each node's power consumption and periodically adjusting
its available power. Nodes consuming less than their available power will have
their power availability reduced and that power will be redistributed to the other
nodes in the cluster. Special mechanisms exist to manage power limits uniformly
across all nodes allocated to a job as well as job startup and termination. 3
Container support for OCI images [25,26] is supported and allows for the
launch of compatible containers without any external tooling on the compute
nodes. Management for these container images is a direct extension of existing
support for Linux subsystems such as cgroups and lesystem namespaces, and
required only minimal changes to the slurmd daemon. A newly added command,
scrun, allows for use of Slurm as the underpinning for common tools such as
Docker [27] and Podman [28].
slurm_pam_adopt is a PAM [29] module that intercepts user SSH con-
nections, and connes them to the resources that were allocated to their job.
For connections initiated from other compute nodes (common with SSH-based
MPI launchers), it will interrogate the originating compute node for which job
that network connection originated from, and can perfectly match that to the
allocated resources on that node. (If the job has not been allocated resources,
the connection is usually denied depending on conguration.) For connections
initiated from login nodes or other external machines, the connection will usually
be permitted under resources allocated to the rst job running under that user
account. Besides conning these processes to resources already allocated to the
3
This capability was successfully used at King Abdullah University of Science and
Technology (KAUST) for a period when a power availability was limited. We are
unaware of any other organization currently using of this capability.
20 M. Jette, T. Wickberg

job, these processes have an accounting record created for them, which details
their resource usage.
nss_slurm is an NSS [30,31] module that allows Slurm to centrally prop-
agate the user and group information corresponding to each jobs' owner. This
mitigates issues when compute nodes, which are usually connected to LDAP, all
simultaneously try to resolve user and group information when the job is initi-
ated. In addition, certain cluster deployment approaches can preclude the need
to either synchronize /etc/passwd and /etc/group les to the compute nodes, or
connect the compute node to LDAP or NIS.
The sbcast command, and the associated --bcast option to the srun com-
mand, allow for Slurm to distribute les through its built-in tree hierarchical
communication systems. Optional compression can further improve performance.
This can be used for large-scale job launches to avoid performance issues from
large-scale job launches on common network lesystems by moving executables
into local scratch (possibly tmpfs) space. An optional mode, enabled through the
--send-libs argument, allows for dynamic libraries to be identied and transmit-
ted alongside an executable image, further improving performance for large-scale
job launches.
The scrontab command allows for users to register a number of periodic
compute jobs with a crontab compatible syntax. This is designed to mitigate a
common HPC user request for cron access on login nodes, and instead launches
compute jobs on designated intervals (while ensuring that only one copy of the
process / Slurm job is launched concurrently, a feature that cron itself lacks),
and avoids reliance on specic login nodes remaining online continously.
14 Conclusion
This paper presents an overview of Slurm's design and functionality. Slurm pro-
vides an open source tool that can eectively manage a wide variety of workloads
on computers of any size. It is also highly modular in order to provide excellent
exibility and extensibility. Motivated researchers can experiment with alter-
native scheduling algorithms, network topologies, etc. by developing their own
plugin while leveraging Slurm's extensive and stable framework.
Slurm continues to evolve with the help of numerous dedicated engineers.
Areas under current development include improved integration with external or-
chestration systems such as Kubernetes, full adoption of an internal extensible
HMAC authentication scheme [32], further improvements to the "cloud burst-
ing" modes of operation, performance improvements in job throughput and user
command interaction, and integration with external interfaces such as PMIx.
Future directions, subject to community interest, may also involve improved
support for energy-centric computing, ways to improve job step performance
and workow capabilities by divorcing responsibility from the central slurmctld
process, and refactoring the core scheduling system to allow for dierent com-
pute node hierarchies than the traditional board/socket/core/CPU model that
is embedded in the existing scheduler subsystems.
Architecture of the Slurm Workload Manager 21

References
1. Jette M., Yoo A., and Grondona M.: SLURM: Simple Linux Utility for Resource
Management. In: Feitelson D., Rudolph, L., Schwiegelshohn, U. (eds.) Proceedings
of the 9th Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP),
LNCS, vol. 2862, pp 44-62, Springer-Verlag (2003).
2. Jackson D., Snell Q., and Clement M.: Core Algorithms of the Maui Scheduler.
In: Feitelson D. and Rudolph, L. (eds.) Proceedings of the 7th Workshop on Job
Scheduling Strategies for Parallel Processing (JSSPP), LNCS, vol. 2221, pp 88-102,
Springer-Verlag (2001).
3. Slurm code repository, https://github.com/SchedMD/slurm.git. Last accessed 3 Feb
2023.
4. Frontier User Guide, https://docs.olcf.ornl.gov/systems/frontier_user_guide.html.
Last accessed 3 Feb 2023.
5. MUNGE home page, https://dun.github.io/munge/. Last accessed 26 Apr 2023.
6. JWT home page, https://jwt.io/. Last accessed 1 May 2023.
7. Quadrics in Linux Clusters presentation, https://hsi.web.cern.ch/HNF-
Europe/sem3_2001/hnf.pdf. Last accessed 3 Feb 2023.
8. Slurm Documentation, https://slurm.schedmd.com/. Last accessed 4 Feb 2023.
9. Pritchard H., Roweth D., Henseler D., and Cassella P.: Leveraging the Cray Linux
Environment Core Specialization Feature to Realize MPI Asynchronous Progress
on Cray XE Systems. In Proceedings of the Cray User Group (2012).
10. Jette M.: Expanding symmetric multiprocessor capability through gang scheduling.
In: Feitelson D. and Rudolph, L. (eds.) Proceedings of the 4th Workshop on Job
Scheduling Strategies for Parallel Processing (JSSPP), LNCS, vol. 1459, pp 199-216,
Springer-Verlag (1998).
11. Ondrejka P., Majorsinova E., Prpic M., Landmann R., Silas D.: Re-
source Management Guide, https://access.redhat.com/documentation/en-
us/red_hat_enterprise_linux/6/html/resource_management_guide/index. Last
accessed 23 Mar 2023.
12. MariaDB Foundation home page, https://mariadb.org/. Last accessed 23 Mar
2023.
13. MySQL corporate home page, https://www.mysql.com/. Last accessed 23 Mar
2023.
14. Balle, S. M. and Palermo, D.: Enhancing an Open Source Resource Manager
with Multi-Core/Multi-threaded Support. In: Frachtenberg E. and Scwiegelshohn
U. (eds.) Proceedings of the 13th Workshop on Job Scheduling Strategies for Parallel
Processing (JSSPP), LNCS, vol. 4942, pp 37-50, Springer-Verlag (2007).
15. OpenPBS home page, https://www.openpbs.org/. Last accessed 28 Mar 2023.
16. Singularity plugin for Slurm, https://github.com/sol-eng/singularity-
rstudio/blob/main/slurm-singularity-exec.md Last accessed 4 Feb 2023.
17. Cox, R. and Morrison, L.: Fair Tree: Fairshare Algorithm for Slurm. Slurm User
Group Meeting (2014). https://slurm.schedmd.com/SC14/BYU_Fair_Tree.pdf.
Last accessed 28 Mar 2023.
18. Slurm Scheduling Conguration Guide, https://slurm.schedmd.com/sched_cong.html
Last accessed 31 Mar 2023.
19. Slurm Scheduling Diagnostic Documentation, https://slurm.schedmd.com/sdiag.html
Last accessed 31 Mar 2023.
20. High Throughput Computing Administration Guide,
https://slurm.schedmd.com/high_throughput.html. Last accessed 30 Mar 2023.
22 M. Jette, T. Wickberg

21. Henseler D., Landsteiner B., Petesch D., Wright C., and Wright N.: Architecture
and Design of Cray DataWarp. In Proceedings of the Cray User Group (2016).
https://cug.org/proceedings/cug2016_proceedings/includes/les/pap105s2-
le1.pdf
22. HDF5 download page from The HDF Group,
https://www.hdfgroup.org/downloads/hdf5. Last accessed 25 Mar 2023.
23. HDFview download page from The HDF Group,
https://www.hdfgroup.org/downloads/hdfview. Last accessed 25 Mar 2023.
24. Jette, M.: Slurm Power Management Support. Slurm User Group Meeting
(2015). https://slurm.schedmd.com/SLUG15/Power_mgmt.pdf. Last accessed 26
Mar 2023.
25. Open Container Initiative organization home page, https://opencontainers.org/.
Last accessed 28 Mar 2023.
26. Slurm container guide https://slurm.schedmd.com/containers.html Last accessed
4 Feb 2023.
27. Docker home page, https://www.docker.com/. Last accessed 28 Mar 2023.
28. Podman home page, https://podman.io/. Last accessed 28 Mar 2023.
29. Garnkel S., Spaord G., and Schwartz A.: Pluggable Authentication Mod-
ules (PAM). In Practical UNIX and Internet Security, 3rd Edition. pp,. 94-96.
O'Reilly(2003).
30. Name Service Switch description, https://guix.gnu.org/manual/en/html_node/Name-
Service-Switch.html. Last accessed 3 Feb 2023.
31. Name Service Switch implementation for Slurm,
https://slurm.schedmd.com/nss_slurm.html. Last accessed 3 Feb 2023.
32. Wikipedia description of HMAC, https://en.wikipedia.org/wiki/HMAC. Last ac-
cessed 1 May 2023.

9 Slurm
No ratings yet
9 Slurm
50 pages
IAL IT Unit 1 Notes
100% (2)
IAL IT Unit 1 Notes
64 pages
Intro To Slurm
No ratings yet
Intro To Slurm
27 pages
Slurm Guide
No ratings yet
Slurm Guide
78 pages
User Guide Slurm
100% (2)
User Guide Slurm
82 pages
Slurm HPC
No ratings yet
Slurm HPC
28 pages
Hpcsa Block Slurm Slides
No ratings yet
Hpcsa Block Slurm Slides
25 pages
Slurm 18.08 Overview
No ratings yet
Slurm 18.08 Overview
21 pages
Hercules Instructions
No ratings yet
Hercules Instructions
12 pages
Using The Batch Farm: Technische Universität München
No ratings yet
Using The Batch Farm: Technische Universität München
28 pages
Slurm in The Clouds
No ratings yet
Slurm in The Clouds
28 pages
Great Lakes Cheat Sheet
No ratings yet
Great Lakes Cheat Sheet
3 pages
A, Array : Jobacctgatherfrequency Parameter in Slurm'S Configuration File, Slurm - Conf. The Supported For
No ratings yet
A, Array : Jobacctgatherfrequency Parameter in Slurm'S Configuration File, Slurm - Conf. The Supported For
26 pages
Linux Clusters Institute: Scheduling
No ratings yet
Linux Clusters Institute: Scheduling
93 pages
HPC Rosalind Gettingstarted
No ratings yet
HPC Rosalind Gettingstarted
6 pages
Summary
No ratings yet
Summary
2 pages
Pages From Introduction To Einstein HPC Portal-V3-2
No ratings yet
Pages From Introduction To Einstein HPC Portal-V3-2
3 pages
SAP IAG Implementation Guide
100% (1)
SAP IAG Implementation Guide
16 pages
Scheduler Commands Cheatsheet-2020-Ally
No ratings yet
Scheduler Commands Cheatsheet-2020-Ally
1 page
Cluster Computing Tutorial
No ratings yet
Cluster Computing Tutorial
101 pages
Slurm Talk
No ratings yet
Slurm Talk
40 pages
01 Slurm14.3TrainingHands On
No ratings yet
01 Slurm14.3TrainingHands On
1 page
Beowulf Cluster
No ratings yet
Beowulf Cluster
60 pages
Slurm ParallelCluster AWS
No ratings yet
Slurm ParallelCluster AWS
3 pages
II Slurm Overview
No ratings yet
II Slurm Overview
52 pages
Slurm. Our Way.: Douglas Jacobsen, James Botts, Helen He Nersc
No ratings yet
Slurm. Our Way.: Douglas Jacobsen, James Botts, Helen He Nersc
13 pages
Doing More With Slurm Advanced Capabilities
No ratings yet
Doing More With Slurm Advanced Capabilities
31 pages
Entry Test PHARM D - 2023 1
No ratings yet
Entry Test PHARM D - 2023 1
1 page
S L U R M: Imple Inux Tility For Esource Anagement
No ratings yet
S L U R M: Imple Inux Tility For Esource Anagement
21 pages
Auto Create Invoice in Oracle Apps
No ratings yet
Auto Create Invoice in Oracle Apps
3 pages
Integrated Bridge Systems (IBS) : T.C. Dokuz Eylül University Maritime Faculty Marine Transportation Engineering
100% (7)
Integrated Bridge Systems (IBS) : T.C. Dokuz Eylül University Maritime Faculty Marine Transportation Engineering
23 pages
Agile Fundamentals For Project Managers: Saturday Workshop PMI Lakeshore Chapter
No ratings yet
Agile Fundamentals For Project Managers: Saturday Workshop PMI Lakeshore Chapter
72 pages
74-0436 Command Center Op Manual
No ratings yet
74-0436 Command Center Op Manual
42 pages
Using Arduino With Matlab and Simulink PDF
No ratings yet
Using Arduino With Matlab and Simulink PDF
15 pages
Huawei Network Solution Overview v2
No ratings yet
Huawei Network Solution Overview v2
77 pages
An Introduction To Programming For Hackers
No ratings yet
An Introduction To Programming For Hackers
62 pages
Additive Manufacturing Module 5 Notes
No ratings yet
Additive Manufacturing Module 5 Notes
30 pages
Thesis On Cloud Computing Load Balancing
100% (3)
Thesis On Cloud Computing Load Balancing
5 pages
Jakarta Struts: An MVC Framework: Overview Installation and Setup Overview, Installation, and Setup
No ratings yet
Jakarta Struts: An MVC Framework: Overview Installation and Setup Overview, Installation, and Setup
17 pages
Intrusion - Brosura SI410
No ratings yet
Intrusion - Brosura SI410
14 pages
Getting Started On The AT91SAM7X-EK PDF
No ratings yet
Getting Started On The AT91SAM7X-EK PDF
16 pages
TW Comms LTD CPS Firmware Upgrade Dec 18
No ratings yet
TW Comms LTD CPS Firmware Upgrade Dec 18
10 pages
Excel Notes
No ratings yet
Excel Notes
42 pages
Intrusion in Information Security
No ratings yet
Intrusion in Information Security
26 pages
Chapter 06
No ratings yet
Chapter 06
4 pages
Non-Classical Models of IR (Uploaded by Snaptricks - In)
No ratings yet
Non-Classical Models of IR (Uploaded by Snaptricks - In)
8 pages
GEO Machine Learning Platform
No ratings yet
GEO Machine Learning Platform
2 pages
System Analysis Toolkit Users Guide
No ratings yet
System Analysis Toolkit Users Guide
124 pages
MS Paint Old
No ratings yet
MS Paint Old
7 pages
Database For DUmmies - 074553
No ratings yet
Database For DUmmies - 074553
19 pages
GC 2024 06 11
No ratings yet
GC 2024 06 11
21 pages
Manual Testing
No ratings yet
Manual Testing
4 pages
Tle 10 - 2
No ratings yet
Tle 10 - 2
5 pages
How To Add AdSense Ads at The End of The Post in Blogger
No ratings yet
How To Add AdSense Ads at The End of The Post in Blogger
2 pages
PROJECT QUALITY Management
No ratings yet
PROJECT QUALITY Management
10 pages
Big Data Workshop Contents
No ratings yet
Big Data Workshop Contents
2 pages
IT Automation: The Definitive Guide to Mastering Infrastructure Automation, Scaling, and Future Trends
From Everand
IT Automation: The Definitive Guide to Mastering Infrastructure Automation, Scaling, and Future Trends
turki alkhwlani
3/5 (1)
Study Guide Cisco Certified Design Expert (CCDE 400-007) Exam
From Everand
Study Guide Cisco Certified Design Expert (CCDE 400-007) Exam
Anand Vemula
No ratings yet
Efficient Workflow Orchestration with Astronomer: The Complete Guide for Developers and Engineers
From Everand
Efficient Workflow Orchestration with Astronomer: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Rust In Practice, Second Edition
From Everand
Rust In Practice, Second Edition
Rick Tim
No ratings yet
Mastering Terraform A Comprehensive Guide to Infrastructure As Code
From Everand
Mastering Terraform A Comprehensive Guide to Infrastructure As Code
Mario Marinov
No ratings yet
Serverless Cron Scheduling with Render: The Complete Guide for Developers and Engineers
From Everand
Serverless Cron Scheduling with Render: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Lustre Administration and Optimization: Definitive Reference for Developers and Engineers
From Everand
Lustre Administration and Optimization: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Model-Driven Online Capacity Management for Component-Based Software Systems
From Everand
Model-Driven Online Capacity Management for Component-Based Software Systems
André van Hoorn
No ratings yet
Airflow for Data Workflow Automation
From Everand
Airflow for Data Workflow Automation
Richard Johnson
No ratings yet
Kubernetes Made Easy
From Everand
Kubernetes Made Easy
Pankaj Joshi
No ratings yet
Linux Services Deployment
From Everand
Linux Services Deployment
Fabian Mestre
No ratings yet
“Information Systems Unraveled: Exploring the Core Concepts”: GoodMan, #1
From Everand
“Information Systems Unraveled: Exploring the Core Concepts”: GoodMan, #1
Patrick Mukosha
No ratings yet
Shell Scripting Step by Step: A Practical Guide with Examples
From Everand
Shell Scripting Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
Slurm Administration and Workflow: Definitive Reference for Developers and Engineers
From Everand
Slurm Administration and Workflow: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
From Everand
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
Jonathan Rigdon
No ratings yet
uWSGI Deployment and Configuration Guide: Definitive Reference for Developers and Engineers
From Everand
uWSGI Deployment and Configuration Guide: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Efficient Workload Management with SGE: Definitive Reference for Developers and Engineers
From Everand
Efficient Workload Management with SGE: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Gluster Filesystem - Practical Method
From Everand
Gluster Filesystem - Practical Method
Fabian Mestre
No ratings yet
Microsoft Azure Fundamentals Exam Cram: Second Edition
From Everand
Microsoft Azure Fundamentals Exam Cram: Second Edition
IP Specialist
5/5 (1)
ROCm Deep Dive: Definitive Reference for Developers and Engineers
From Everand
ROCm Deep Dive: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Daemon Architecture and Implementation: Definitive Reference for Developers and Engineers
From Everand
Daemon Architecture and Implementation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Real-Time Applications with FreeRTOS: Definitive Reference for Developers and Engineers
From Everand
Real-Time Applications with FreeRTOS: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Storm Systems for Real-Time Data Processing: Definitive Reference for Developers and Engineers
From Everand
Storm Systems for Real-Time Data Processing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Puma Deployment and Configuration Guide: Definitive Reference for Developers and Engineers
From Everand
Puma Deployment and Configuration Guide: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Deploying Scalable Systems with Nomad: Definitive Reference for Developers and Engineers
From Everand
Deploying Scalable Systems with Nomad: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Swarm Deployment and Orchestration: Definitive Reference for Developers and Engineers
From Everand
Swarm Deployment and Orchestration: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
RabbitMQ in Practice: Definitive Reference for Developers and Engineers
From Everand
RabbitMQ in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Distributed Cluster Operations with DC/OS: Definitive Reference for Developers and Engineers
From Everand
Distributed Cluster Operations with DC/OS: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Qubes OS: Security Architecture and Administration
From Everand
Qubes OS: Security Architecture and Administration
Richard Johnson
No ratings yet
Living with Linux in the Industrial World
From Everand
Living with Linux in the Industrial World
Elaiya Iswera Lallan
No ratings yet
Building an Operating System with Rust: A Practical Guide
From Everand
Building an Operating System with Rust: A Practical Guide
Robert Johnson
No ratings yet
Embedded Rust Programming: Building Safe and Efficient Systems
From Everand
Embedded Rust Programming: Building Safe and Efficient Systems
Robert Johnson
No ratings yet
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Computer Science Self Management: Fundamentals and Applications
From Everand
Computer Science Self Management: Fundamentals and Applications
Fouad Sabry
No ratings yet

JSSPP 2023 Keynote SLURM

Uploaded by

JSSPP 2023 Keynote SLURM

Uploaded by

Architecture of the Slurm Workload Manager

Morris A. Jette and Tim Wickberg

Abstract. Slurm is an open source, fault-tolerant, and highly scalable

Scalability: Slurm daemons are highly concurrent  with locking on a set of

scheduling preference weight (used to administratively steer jobs towards specic

Fig. 1. Example partition, job and job step allocation.

$ sbatch --array=1-9000 -i my_in_%a -o my_out_%a -N1 my.bash

A Division - 40% B Division - 60%

Fig. 2. Example account hierarchy with resource allocations.

The limits in eect will be MaxJobs=20 and MaxSubmitJobs=50

A Generic RESource (GRES) is an arbitrary on-node resource. These can

An advanced reservation is a block of resources reserved for current or fu-

Slurmdbd is Slurm's database daemon, which can interface either MariaDB

slurmd slurmd slurmd slurmd

slurmd slurmd slurmd slurmd

Fig. 3. Typical Slurm federation with two clusters.

Fig. 4. Example of job allocations on a 4D torus interconnect. A list of jobs is shown

slurmd slurmd slurmd slurmd slurmd

slurmd slurmd slurmd slurmd slurmd

Fig. 5. Example hierarchical communication with fanout of 5.

Slurm's congurable timeout parameter is used to help detect communication

You might also like

Scalability: Slurm daemons are highly concurrent with locking on a set of

scheduling preference weight (used to administratively steer jobs towards specic

The limits in eect will be MaxJobs=20 and MaxSubmitJobs=50

Slurm's congurable timeout parameter is used to help detect communication