JSSPP 2023 Keynote SLURM
JSSPP 2023 Keynote SLURM
1 Introduction
The development of Slurm began at Lawrence Livermore National Laboratory
(LLNL) in 2002. It was originally designed as a simple resource manager capable 1
only of allocating whole nodes to jobs, then dispatching and managing those
applications on their allocated resources [1]. Slurm relied upon an external
scheduler such as Maui [2] to manage queues of work and schedule resources.
Slurm has since evolved into a comprehensive workload scheduler capable of
managing the most demanding workows on many of the largest computers in
the world. While Slurm has evolved a great deal over its two decades of existence,
its original design goals have largely persisted and proven critical to its success.
Open Source: Slurm's source code is distributed under the GNU General
Public License and is freely available on Github [3]. This openness has
resulted in contributions from roughly 300 individuals ranging from minor
corrections to the documentation to complex added functionality.
Portability: Slurm is written in the C language with a GNU autoconf con-
guration engine. Slurm can be thought of as a highly modular and generic
kernel with hundreds of plugins available for customization using a building-
block approach in order to support a wide variety of hardware types, software
environments, and scheduling capabilities. Site-specic plugins and scripts
can easily be integrated for even greater customization. This exibility allows
Slurm to operate eectively in virtually any environment.
1
Slurm was originally an acronym for "Simple Linux Utility for Resource Manage-
ment", and stylized as "SLURM". The acronym was dropped in 2012 and the pre-
ferred capitalization changed to "Slurm".
2 M. Jette, T. Wickberg
this specication can include AND, OR, exclusive OR, and count specications)
and preferred (e.g., faster clock speed desired), required licenses, account name,
job dependency specications (to control the order of job execution), Quality
Of Service (QOS), relative priority, and queues/partitions to use. The minimum
time limit and size specications are valuable if the user is willing to sacrice
run time and/or resources in order that the job be initiated as soon as possible.
Slurm's backll scheduler will take advantage of this exibility to allocate such
a job with the maximum run time and resources possible within it's specied
range without delaying the initiation time of higher priority jobs. The ability
for a job to explicitly exclude specic nodes from its allocation is valuable if the
user has doubts as to the integrity of specic nodes.
A job step in Slurm is a set of parallel tasks, typically an MPI application.
A job can initiate an arbitrary number of steps serially and/or in parallel, with
Slurm providing the queuing and resource management for those steps within
the job's existing resource allocation. Use cases in which jobs execute thousands
of steps are not uncommon, particularly when jobs may need to wait lengthy
periods of time for their initial resource allocation request to be satised. Job
step state information maintained by Slurm include its ID expressed as a job ID
followed by a period and step ID (e.g., "123.45"), name, time limit (maximum),
size specication (minimum and/or maximum count of nodes, CPUs, sockets,
cores and/or threads), specic node names to include or exclude from its alloca-
tion, and node features required in its allocation. The job step management is
lighter than job management. If currently available resources within a job allo-
cation are insucient for a step to be initiated then it is queued until resources
are available. Slurm does not support dependencies between job steps that
functionality can be provided by the job script managing the job step workow
if necessary.
A cluster typically consists of a collection of nodes sharing a common net-
work. A Slurm cluster can be on premises, in the cloud, or spread across both.
A Slurm federation is a collection of clusters sharing a common conguration
database. By default, a job is submitted to the local cluster, and user commands
to gather information about jobs and queues report information about the local
cluster. However, all clusters in a federation may be congured to operate as a
single system from the perspective of the users. Jobs can be submitted to any
individual cluster in the federation or any set of clusters (e.g., clusters from the
same manufacturer or having the same architecture), and may be eligible for
execution on any of the federated systems.
Node conguration includes a wide variety of information, most of which
is collected directly from the compute node when Slurm's slurmd daemon is
started. Information collected from a compute node and maintained by Slurm
includes: count of boards, sockets, cores and threads, a count of CPUs (usually
dened as boards × sockets × cores × threads, but may vary depending upon
conguration), memory size, generic resources (GRES) including names, types
and counts (used for GPUs, network bandwidth, scratch disk space, etc.). Infor-
mation not collected from a compute node but maintained by Slurm include a
4 M. Jette, T. Wickberg
After: job can begin execution after the specied job IDs have begun execu-
tion
AfterOK: job can begin execution after the specied job IDs have completed
successfully (run to completion with exit code of zero)
AfterNotOK: job can begin execution after the specied job IDs have termi-
nated in some failed state
AfterAny: job can begin execution after the specied job IDs have terminated
in any state
AfterCorr: an element of a job array can begin execution after the corre-
sponding element ID in another job array completes
Singleton: the job can begin execution after any previously initiated job with
the same job name and user ID have completed (i.e., only one job owned by
a given user with the same job name can be running at any time)
The system administrator can congure the desired behavior for jobs with de-
pendencies that cannot be satised (e.g., a job dependent upon the successful
completion of another job, but that job fails). Typically such jobs are congured
to be purged.
Users can request to be notied by email when their jobs change state. This
can be valuable for batch jobs in environments where long delays are possible.
Job state transitions which can be used to trigger email include: begin, end, fail,
requeue, and invalid dependency detected.
A Slurm account is used to group users into sets, independent of UNIX
groups. Accounts are typically organized in a hierarchical fashion (e.g., division,
group, project, etc. see Figure 2). A user can have access to multiple accounts
with a default value. Each account can have one or more account coordinators
who are able to create sub-accounts, add or remove users from their account,
modify limits and resource apportioning to the users and accounts under their
control. The account coordinator may also be able to view accounting informa-
tion normally hidden from other users, such as a record of jobs executed by other
users in the accounts over which they have control.
A Slurm association is a combination of cluster, account, user name, and
(optional) partition name. Each association can have a fair share allocation of
resources and a multitude of limits. It is worth noting that these limits come
in two forms. Many limits apply to individual jobs, such as the maximum time
limit. Other limits apply on an aggregate basis, such as the maximum number
of running jobs for an individual user or all users in some account.
A Quality Of Service (QOS) is used to control a job's limits, priority,
and charge multiplier. A QOS may be associated with partition or independent
of partitions and selected on a job by job basis. A QOS not explicitly bound to
a partition can be used with any partition. The benet in associating a QOS
with a partition is in making a greater number of limits available than otherwise
provided in Slurm's conguration le for a partition. A typical use case is to
congure "standby", "normal", and "expedite" QOS on a system. Jobs submit-
ted to any partition with a "standby" QOS would have very low priority and a
corresponding low charge multiplier, say being charged for resource use at 20% of
Architecture of the Slurm Workload Manager 7
Root - 100%
A Group - 30% User Alice- 10% C Group - 40% User Bob - 20%
User Adam - 20% User Brenda - 10% User Charles - 20% User Debra - 15% User Edward - 5%
the normal change. Similarly, jobs submitted to any partition with a "expedite"
QOS would be given a very high priority and high charge multiplier, perhaps
being charged 5 times the normal rate. Access control lists can be congured to
limit which accounts or users can use each QOS. QOS can also be used to dene
job preemption rules, so the "expedite" QOS might be congured to preempt
(terminate running jobs) from the lower priority "standby" QOS. QOS limits
available override the partitions and associations limits. So if a user's associa-
tion has a maximum number of running jobs set to 10, but the "expedite" QOS
has a limit of 20, the higher limit will apply to jobs running in the "expedite"
QOS. In order to avoid confusion with the multitude of congurable limits, only
a subset of limits are typically congured for associations and QOS.
The order of precedence for limits is as follows:
1. Partition QOS
2. Job QOS
3. User association
4. Account associations (ascending the hierarchy)
5. Root/Cluster association (i.e., the top of the account association hierarchy)
6. Partition conguration
If limits are dened at multiple points in this hierarchy, the point in this list
where the limit is rst dened will be used. Consider the following example:
MaxJobs=20 and MaxSubmitJobs is undened in the partition QOS
No limits are set in the job QOS and
MaxJobs=4 and MaxSubmitJobs=50 in the user association
8 M. Jette, T. Wickberg
3 Daemons
slurmctld
slurmdbd
Cluster 1
slurmctld
Cluster 2
4 Plugin Infrastructure
Slurm's extensive use of plugins is particularly valuable in providing portability
and exibility. Roughly 65% of Slurm's code is within its kernel including the
primary data structures, system daemons, and user commands. Over 100 plugins
form the remainder of the code to support a wide range of diering hardware
types, software environments, and scheduling capabilities. The congured plug-
ins are loaded when a daemon or command is started and persists throughout
its lifetime. Some plugins are called from multiple daemons and commands. In
some cases plugins are also called by other plugins, such as the network topol-
ogy plugin being called by the scheduling plugin. Plugin infrastructure provides
a level of indirection to some congurable underlying functions. One example of
the value in plugins was development work performed by Hewlett-Packard (HP)
in 2005 [14]. Slurm's original implementation only supported the allocation of
whole compute nodes to jobs implemented through the select/linear schedul-
ing plugin. HP added the ability to allocate resources on a node down to the
core level with a new select/cons_res plugin. Later development introduced a
new select/cons_tres plugin, now the default, which extended resource schedul-
ing support to GPUs within the nodes as well. Roughly 80% of the changes to
Slurm for this enhancement were in the form of a new job scheduling plugin with
the remaining changes in the kernel, much of that in the form data structure
changes. Given the number of plugins available, only a few will be described
here.
Slurm's topology plugin is used to gather network topology and use that in-
formation to optimize resource allocations with respect to communication band-
width. The topology plugins developed to date include 3 dimensional torus, 4
Architecture of the Slurm Workload Manager 11
dimensional torus, hypercube, dragony, and tree. Slurm's GUI, sview, displays
the nodes allocated to jobs, partitions, advanced reservations, etc, so that one
can readily observe the network topology they utilize as shown in Figure 4.
Slurm's job submit plugin is called from the slurmctld daemon. It is executed
for each job submit or job modify RPC. An arbitrary number of job submit plu-
gins may be used with a congurable call sequence. Each of these plugins can
modify the arguments passed to slurmctld and return error messages as appro-
priate to the user. Some of the job submit plugins packaged with Slurm include:
throttle (limits the rate at which a user can submit jobs, sleeping as needed to
decrease job submission rates for individual users), require_time_limit (rejects
jobs without an explicit time limit specication), pbs (adds PBS [15] environ-
ment variables for newly submitted jobs and supports the "before" job depen-
dency), cray (sets Cray specic generic resource parameters), and Lua (executes
a customer provided Lua script with almost limitless exibility).
Four plugins are available to gather energy consumption from a node in-
cluding IPMI, RAPL, and Cray. Should some new mechanism become available
to gather a node's energy consumption data, one would need to develop a new
Slurm plugin to gather that information and present it to the Slurm kernel in
the appropriate format.
SPANK (Slurm Plugin Architecture for Node and job [K]ontrol) is a generic
plugin mechanism. The plugins are written in C, but without requiring access to
12 M. Jette, T. Wickberg
the Slurm source code. The plugins are executed by the Slurm daemons and the
Slurm commands used for job submission. SPANK plugins can be used to add
new site-specic job options, including making information about those options
visible in the command's help messages. One example of SPANK use was the
initial integration of Singularity containers with Slurm [16]. This plugin added
new options to the Slurm job submission commands: --singularity-container, --
singularity-bind and --singularity-args. It also added support for Singularity spe-
cic environment variables. The slurmstepd job step management process made
use of this newly added information to initiate the application in an appropriate
container environment.
The most common integration point remains through the use of site-specic
scripts in the Slurm prolog and epilog interfaces. These are executed in various
places (the submit host, the head node by slurmctld, or the compute node by
slurmd or slurmstepd), at various times (e.g., job allocation, step startup, and
task launch), by various users (SlurmUser, root or the job user). Typical use cases
include establishing the environment for the job (boot nodes, node health check,
congure temporary storage, etc.) or cleaning up at job completion (deleting
temporary les).
5 Conguration
Slurm requires at least one conguration le, although some plugins require
their own conguration le. These les can either be in a location readable by
all daemons and user commands or the les can be replicated on every node.
Alternatively, the les can be placed on the nodes where the slurmctld daemons
execute (primary daemon plus backups), and the slurmctld daemon will make
that conguration information available to the other daemons and commands
upon request with a newer optional feature referred to as "congless" support.
A system administrator may nd it dicult to upgrade the Slurm installation
simultaneously across all of the enterprise. In order to support rolling upgrades,
every daemon and command is able to support RPCs for three major releases,
which includes its release plus the previous two major releases of Slurm. Since
major releases are currently scheduled every 9 months, that supports a relaxed
upgrade schedule. Changes to RPCs are limited to major releases, so upgrades
between maintenance releases can be performed on a node-by-node basis.
6 Communications
Slurm uses a fault-tolerant hierarchical communication mechanism with cong-
urable fanout for communications to the compute nodes. This ooads as much
work as possible from the slurmctld daemon, which typically has a multitude
of active threads. This also minimizes the wall time required for operations in-
volving a large number of nodes, such as application launch and le transfer.
It is typically recommended to congure a fanout value so that no more than
a ve level communication tree can be used to reach all compute nodes. For
Architecture of the Slurm Workload Manager 13
example, if the slurmctld needs to kill a job running on 1110 compute nodes and
Slurm's fanout is congured at 10. The slurmctld daemon will take the list of
1110 compute nodes and divide it into 10 sets of 111 nodes each. The slurmctld
daemon will then launch 10 threads, each communicating with a single slurmd
daemon notifying it of the job to be killed along with a list of the additional
110 compute nodes that slurmd should forward the request to. Each of these
10 slurmd daemon launches 10 threads to communicate with additional slurmd
daemons on other compute nodes. The process is continued until every slurmd
daemon involved in the operation is reached. In this case, the process requires
three levels in the communication tree. Note the communication hierarchy is cre-
ated as needed and destroyed upon completion of the communications. Multiple
communication hierarchies may be active at any time using dierent or even
overlapping sets of slurmd daemons. See the example in Figure 5.
slurmctld
an application launch request to a set of slurmd daemons using the same fanout
logic as for messages originating from the slurmctld daemon.
7 Job Priority
Slurm assigns a priority to each job based on a multitude of factors includ-
ing age, fair share, queue/partition, QOS, size, nice value, association, and a
site-managed value. The weight of each factor in determining a job's priority is
congurable so that a job's priority may be based 60% on age age, 30% on fair
share, etc. The age component of job priority is based upon the time when the
job rst becomes eligible for execution, after dependencies are satised, rather
than its submission time. It value is proportional to that wait time with a cong-
urable maximum time (i.e., the value could be congured to stop increasing once
a job is 7 days old). Fair share is a measure of how over- or under-serviced an
association is relative to its resource allocation. The window of time used in this
calculation is either xed with usage data cleared periodically (i.e., at the end
of each week, month, quarter, year, etc.) or historic resource usage data is con-
tinuously decreased on an exponential basis through time. Dierent algorithms
and parameters are available to control how fair share is computed based upon
the association tree, and a plugin interface called site_factor is provided should
an administrator wish to develop their own novel prioritization approach. For
example, should an individual user be allowed to consume all resources allocated
to his group if no other users in that group are active or should some portion
of that group's resources be retained for when other users in that group become
active and if so how much [17]. Job size requirements can be used to consider
a job's resource requirements in computing its priority. Size in this context is
congurable to consider a variety of resources with dierent weights for each
resource (e.g., one GPU might be given the same weight as 1TB of memory in
computing a job's size component of priority). A system administrator may want
to increase the scheduling priority of jobs with large CPU, memory, or license
requirements. This can also be reversed to favor jobs with low resource require-
ments. A user may specify a job's nice value to establish their relative scheduling
priority in Slurm and this works similar to a process's Linux nice value, although
Slurm's nice value range is much larger with a signed 32-bit value. If necessary, a
system administrator may also explicitly set a job's scheduling priority in order
to override the default calculated value. This is typically done to force a job to
have the highest priority, and ensure it will begin execution as soon as possible.
8 Typical Congurations
Each site and each cluster have unique congurations and scheduling consider-
ations. Before considering Slurm's scheduling algorithms, some typical congu-
rations and their scheduling requirements are described below.
Architecture of the Slurm Workload Manager 15
Roughly half of the clusters that we work with are homogeneous: every com-
pute node has the same processors, memory size, GPUs, etc. Homogeneous clus-
ters may be congured with a single partition for the most ecient use of re-
sources. While every job may be in a single queue/partition, a variety of limits
are typically used to prevent any single user or group of users from being allo-
cated more resources than desired. Depending on the conguration and use case,
it is not uncommon to see dozens or even hundreds of the highest priority jobs
have their resource allocation deferred by one or more limit (e.g., maximum run-
ning job count by user, maximum allocated GPU count by account). In addition
to the compute nodes, global resources such as licenses and burst buer space
must be managed.
The other half of the clusters we work with are heterogeneous. Such clus-
ters may include a small number of unique compute nodes, say with GPUs or
a larger memory size. Other clusters may include a dozen or more unique com-
pute node congurations including dierent processor types and clock speeds.
Such clusters are typically assembled over time with dierent organizations con-
tributing hardware best suited for their workload and budget. For example, the
physics department at a university may purchase two racks of nodes with large
memory size, the chemistry department another rack of nodes with GPUs, etc.
We refer to each set of resources as a "condo" or "condominium" and they are
interconnected into a single cluster sharing a high speed network. Typically each
condo can be accessed from two or more partitions. One partition will provide
priority access to the resources with an access control list identifying the organi-
zation nancing those resources (e.g., the "physics" partition will have an access
control list containing the faculty and students in the physics department). A
second partition might span all compute nodes in the cluster with lower priority
access for any user, typically with a lower size and/or time limits. Slurm allows
jobs to be submitted to multiple partitions simultaneously to take advantage
of this conguration. The node "feature" parameter can be used to prevent job
allocations from spanning dierent processor types as shown in Figure 6.
Job throughput rate requirements also vary widely. Some workloads consist
primarily of jobs that execute for days, in which case expending considerable time
to optimize scheduling may be warranted. Other workloads consist primarily of
jobs that execute for a few seconds and a throughput rate in the of hundreds of
jobs per second may be required [20]. Slurm can support both workloads, but
with dierent algorithms and scheduling parameters.
9 Scheduling Algorithm
Slurm performs a quick and simple (rst-in rst-out, FIFO) scheduling attempt
on an event driven basis (with a congurable minimum time interval between
executions): upon each job submission, job completion, or conguration change.
Only the top priority jobs (a congurable count) in each partition will be evalu-
ated for initiation and this can be useful for high throughput computing. Given
the appropriate conguration and hardware, Slurm can sustain a throughput
16 M. Jette, T. Wickberg
Fig. 6. Example heterogeneous cluster with job submission request for any two nodes
with the same architecture. Plugins can be used to set default partitions and node
features as appropriate.
rate exceeding 100 jobs per second. Since this scheduling algorithm is FIFO,
once any job in a partition is found unable to be initiated, all lower priority jobs
in that partition will be left pending without further consideration.
Slurm also performs a more comprehensive FIFO scheduling attempt on a
periodic basis with a congurable interval. This algorithm typically executes
far less frequently than the event driven scheduling algorithm, but uses the
same logic with dierent conguration parameters. The default conguration
parameters will typically enable this algorithm to evaluate jobs in every partition
from the highest priority until a job unable to be initiated is identied.
Except for some high throughput computing congurations, backll schedul-
ing plugin is used to initiate most jobs. Without the backll plugin, jobs in each
queue are scheduled strictly in priority order as described above. The backll
scheduling plugin will initiate lower priority jobs only if doing so does not delay
the expected start time of any higher priority job, although reservation of re-
sources for higher priority jobs can be limited to jobs which have been pending
for over some congurable period time or the highest priority jobs in each parti-
tion within a congurable queue depth. The expected start time of pending jobs
is dependent on the expected completion time of running jobs based solely upon
the job's time limit. Approximately 20 conguration parameters are available
for tuning backll scheduling such as: how far in the future to consider, what is
the time resolution for scheduling, how many jobs to consider for each user, how
many jobs to consider from each partition, etc. The backll scheduler builds a
table of expected resource availability through time, tracing the expected initi-
ation and termination time of running and pending jobs. All resource limits are
enforced by the backll scheduler as it builds this table. For example, sucient
compute resources may be available to initiate the highest priority job in some
partition one hour in the future, but job initiation prevented by the maximum
number of CPU hours of running jobs for the job's association. In this case a
Architecture of the Slurm Workload Manager 17
lower priority job may be initiated. Since a cluster may have thousands of exe-
cuting jobs and tens of thousands of pending jobs, backll scheduling overhead
is kept more manageable by the time resolution conguration parameter cited
above. Say the time resolution is congured to 300 seconds. The resources that
become available in any 300 second interval are recorded in a single record rather
than one record per job completion, which might number in the tens or even hun-
dreds of jobs for any 300 second interval. For example, consider a job expected to
end in 600 seconds and another job expected to end in 610 seconds. Rather than
creating records of expected system state at those two times and determining
which pending jobs can start at each of those times, the records are combined
and pending jobs will only be evaluated for initiation at that one time. While
this does result in some loss of precision when evaluating when pending jobs are
expected to start, that is likely insignicant compared to the inaccuracy in job
time limits. The benet is that computational overhead of the backll scheduler
can be dramatically reduced. The backll scheduler determines the reason each
pending job is currently unable to be initiated (e.g., some specic limit, depen-
dency, waiting for resources, held by administrator) and its expected start time
(i.e., when compute resources are available and no limits exceeded). This infor-
mation is made available to users and is regularly consulted. Since the backll
algorithm can require multiple minutes to complete with large workloads, it is
performed in a piecemeal manner. It acquires the appropriate locks, executes for
a congurable time interval (typically a couple of seconds), releases locks for a
congurable time interval (typically less than one second) in order to perform
other outstanding operations (e.g., accept newly submitted jobs, process newly
completed jobs, provide users with status information, etc.), and repeats the
process until all pending jobs have been considered.
Slurm supports burst buers via a plugin mechanism. Slurm will allocate
burst buer space for a job when it approaches its expected initiation time and
stage-in any required data. The job will not be allocated compute resources
until data stage-in has completed. When the job's computation has completed,
data will be staged-out and the burst buer space released. The burst buers
supported by Slurm plugins include Cray's DataWarp [21] and a generic Lua
script based plugin.
Slurm has the ability to support a cluster that grows and shrinks on de-
mand, typically relying upon a service such as Amazon Elastic Computing Cloud
(Amazon EC2), Google Cloud Platform or Microsoft Azure for resources. These
resources can be combined with an existing cluster to process excess workload
(cloud bursting) or it can operate as an independent self-contained cluster. Good
responsiveness and throughput can be achieved while only paying for the re-
sources needed.
Slurm has dozens of available scheduling parameters available to control the
number of jobs considered for scheduling in each partition, maximum scheduling
frequency, etc. [18] Slurm also maintains detailed information about schedul-
ing performance and makes that information available to system administrators
for tuning purposes [19]. Detailed information about anticipated resource avail-
18 M. Jette, T. Wickberg
ability in the future and expected resource allocations and initiation times of
pending jobs are also available.
10 License Scheduling
In addition to node-centric resources, Slurm supports scheduling for licenses.
Licenses can be used to represent any resource available on a global basis such
as network bandwidth or global scratch disk space; although, as implied by the
name, they are most commonly used to track software licenses.
Licenses are requested as part of the job submission, and will be allocated
to the job alongside the compute resources. Backll scheduling management for
these licenses is a recent optional addition, and can be enabled by the admin-
istrator. Preemption support, in which lower priority jobs can be preempted to
free up sucient licenses for higher priority ones, has also been recently added.
11 Application Layout
The job has control over its resource allocation with respect to sockets, cores,
and threads using options at job submit time, for example threads per core, cores
per socket, and sockets per node. Similarly the job step allocation can specify the
number of tasks to launch per core, socket and/or core. The job step has complete
control over how tasks are distributed over the allocated resources by specifying
layout patterns across nodes, sockets, and/or cores. Binding tasks to allocated
resources can be performed using CPU anity and/or Linux cgroups [11]. Linux
cgroups are essential for congurations where more than one job can be active
at the same time on a compute node. Besides limiting each job to its allocated
CPUs, cgroups can also ensure that each job is constrained to its allocated
memory and not interfere with another job's memory allocation. Linux cgroups
can constrain each job's RAM, kernel memory, swap space and allocated generic
resources such as GPUs. There are a variety of options available to control how
each task/rank of the application are bound to the job step's allocated CPUs.
Cgroups are also used to collect usage data for allocated resources.
Some topology plugins support the ability for a job to specify the maximum
number of leaf switches desired in its resource allocation and the maximum time
to wait for such an allocation (e.g., wait up to an extra 10 minutes for my job's
allocated nodes to be on one leaf switch). The system administrator can congure
the maximum wait time for any job to secure its desired leaf count in order to
limit resources idled for job layout optimization.
Jobs can increase or decrease their size per administrative controls. In the
case of increasing the size of a job, the user must submit a new job to acquire the
additional resources desired. Once this second job allocation has been made, the
user merges the two job allocations into a single job, a process which generates a
shell script the user executes in the original job in order to modify its environment
variables as appropriate.
Architecture of the Slurm Workload Manager 19
12 Job Proling
Slurm has the ability to collect detailed performance information about a job step
on a periodic basis. This is more information than can reasonably be recorded
in Slurm's database for every job, and may involve some additional overhead
to collect therefore it is only collected when requested by the user. The user
species what types of information are to be collected and at what frequencies
(independent frequency can be congured for each data type). The types of in-
formation available for collection includes power consumption, le system usage,
network interconnect usage, CPU and memory usage, and GPU utilization. At
application termination the data is collected and stored into a single HDF5 [22]
dataset. We recommend the HDFView [23] tool to graphically view the resulting
data, which can easily identify problems such as spikes in memory usage with
the timing and oending task ID (rank) identied.
13 Compute Node Management
Alongside traditional workload management capabilities, additional integrations
have been developed to solve common HPC systems administration tasks.
Slurm has the ability to limit the power consumption of a cluster [24]. It does
this by monitoring each node's power consumption and periodically adjusting
its available power. Nodes consuming less than their available power will have
their power availability reduced and that power will be redistributed to the other
nodes in the cluster. Special mechanisms exist to manage power limits uniformly
across all nodes allocated to a job as well as job startup and termination. 3
Container support for OCI images [25,26] is supported and allows for the
launch of compatible containers without any external tooling on the compute
nodes. Management for these container images is a direct extension of existing
support for Linux subsystems such as cgroups and lesystem namespaces, and
required only minimal changes to the slurmd daemon. A newly added command,
scrun, allows for use of Slurm as the underpinning for common tools such as
Docker [27] and Podman [28].
slurm_pam_adopt is a PAM [29] module that intercepts user SSH con-
nections, and connes them to the resources that were allocated to their job.
For connections initiated from other compute nodes (common with SSH-based
MPI launchers), it will interrogate the originating compute node for which job
that network connection originated from, and can perfectly match that to the
allocated resources on that node. (If the job has not been allocated resources,
the connection is usually denied depending on conguration.) For connections
initiated from login nodes or other external machines, the connection will usually
be permitted under resources allocated to the rst job running under that user
account. Besides conning these processes to resources already allocated to the
3
This capability was successfully used at King Abdullah University of Science and
Technology (KAUST) for a period when a power availability was limited. We are
unaware of any other organization currently using of this capability.
20 M. Jette, T. Wickberg
job, these processes have an accounting record created for them, which details
their resource usage.
nss_slurm is an NSS [30,31] module that allows Slurm to centrally prop-
agate the user and group information corresponding to each jobs' owner. This
mitigates issues when compute nodes, which are usually connected to LDAP, all
simultaneously try to resolve user and group information when the job is initi-
ated. In addition, certain cluster deployment approaches can preclude the need
to either synchronize /etc/passwd and /etc/group les to the compute nodes, or
connect the compute node to LDAP or NIS.
The sbcast command, and the associated --bcast option to the srun com-
mand, allow for Slurm to distribute les through its built-in tree hierarchical
communication systems. Optional compression can further improve performance.
This can be used for large-scale job launches to avoid performance issues from
large-scale job launches on common network lesystems by moving executables
into local scratch (possibly tmpfs) space. An optional mode, enabled through the
--send-libs argument, allows for dynamic libraries to be identied and transmit-
ted alongside an executable image, further improving performance for large-scale
job launches.
The scrontab command allows for users to register a number of periodic
compute jobs with a crontab compatible syntax. This is designed to mitigate a
common HPC user request for cron access on login nodes, and instead launches
compute jobs on designated intervals (while ensuring that only one copy of the
process / Slurm job is launched concurrently, a feature that cron itself lacks),
and avoids reliance on specic login nodes remaining online continously.
14 Conclusion
This paper presents an overview of Slurm's design and functionality. Slurm pro-
vides an open source tool that can eectively manage a wide variety of workloads
on computers of any size. It is also highly modular in order to provide excellent
exibility and extensibility. Motivated researchers can experiment with alter-
native scheduling algorithms, network topologies, etc. by developing their own
plugin while leveraging Slurm's extensive and stable framework.
Slurm continues to evolve with the help of numerous dedicated engineers.
Areas under current development include improved integration with external or-
chestration systems such as Kubernetes, full adoption of an internal extensible
HMAC authentication scheme [32], further improvements to the "cloud burst-
ing" modes of operation, performance improvements in job throughput and user
command interaction, and integration with external interfaces such as PMIx.
Future directions, subject to community interest, may also involve improved
support for energy-centric computing, ways to improve job step performance
and workow capabilities by divorcing responsibility from the central slurmctld
process, and refactoring the core scheduling system to allow for dierent com-
pute node hierarchies than the traditional board/socket/core/CPU model that
is embedded in the existing scheduler subsystems.
Architecture of the Slurm Workload Manager 21
References
1. Jette M., Yoo A., and Grondona M.: SLURM: Simple Linux Utility for Resource
Management. In: Feitelson D., Rudolph, L., Schwiegelshohn, U. (eds.) Proceedings
of the 9th Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP),
LNCS, vol. 2862, pp 44-62, Springer-Verlag (2003).
2. Jackson D., Snell Q., and Clement M.: Core Algorithms of the Maui Scheduler.
In: Feitelson D. and Rudolph, L. (eds.) Proceedings of the 7th Workshop on Job
Scheduling Strategies for Parallel Processing (JSSPP), LNCS, vol. 2221, pp 88-102,
Springer-Verlag (2001).
3. Slurm code repository, https://github.com/SchedMD/slurm.git. Last accessed 3 Feb
2023.
4. Frontier User Guide, https://docs.olcf.ornl.gov/systems/frontier_user_guide.html.
Last accessed 3 Feb 2023.
5. MUNGE home page, https://dun.github.io/munge/. Last accessed 26 Apr 2023.
6. JWT home page, https://jwt.io/. Last accessed 1 May 2023.
7. Quadrics in Linux Clusters presentation, https://hsi.web.cern.ch/HNF-
Europe/sem3_2001/hnf.pdf. Last accessed 3 Feb 2023.
8. Slurm Documentation, https://slurm.schedmd.com/. Last accessed 4 Feb 2023.
9. Pritchard H., Roweth D., Henseler D., and Cassella P.: Leveraging the Cray Linux
Environment Core Specialization Feature to Realize MPI Asynchronous Progress
on Cray XE Systems. In Proceedings of the Cray User Group (2012).
10. Jette M.: Expanding symmetric multiprocessor capability through gang scheduling.
In: Feitelson D. and Rudolph, L. (eds.) Proceedings of the 4th Workshop on Job
Scheduling Strategies for Parallel Processing (JSSPP), LNCS, vol. 1459, pp 199-216,
Springer-Verlag (1998).
11. Ondrejka P., Majorsinova E., Prpic M., Landmann R., Silas D.: Re-
source Management Guide, https://access.redhat.com/documentation/en-
us/red_hat_enterprise_linux/6/html/resource_management_guide/index. Last
accessed 23 Mar 2023.
12. MariaDB Foundation home page, https://mariadb.org/. Last accessed 23 Mar
2023.
13. MySQL corporate home page, https://www.mysql.com/. Last accessed 23 Mar
2023.
14. Balle, S. M. and Palermo, D.: Enhancing an Open Source Resource Manager
with Multi-Core/Multi-threaded Support. In: Frachtenberg E. and Scwiegelshohn
U. (eds.) Proceedings of the 13th Workshop on Job Scheduling Strategies for Parallel
Processing (JSSPP), LNCS, vol. 4942, pp 37-50, Springer-Verlag (2007).
15. OpenPBS home page, https://www.openpbs.org/. Last accessed 28 Mar 2023.
16. Singularity plugin for Slurm, https://github.com/sol-eng/singularity-
rstudio/blob/main/slurm-singularity-exec.md Last accessed 4 Feb 2023.
17. Cox, R. and Morrison, L.: Fair Tree: Fairshare Algorithm for Slurm. Slurm User
Group Meeting (2014). https://slurm.schedmd.com/SC14/BYU_Fair_Tree.pdf.
Last accessed 28 Mar 2023.
18. Slurm Scheduling Conguration Guide, https://slurm.schedmd.com/sched_cong.html
Last accessed 31 Mar 2023.
19. Slurm Scheduling Diagnostic Documentation, https://slurm.schedmd.com/sdiag.html
Last accessed 31 Mar 2023.
20. High Throughput Computing Administration Guide,
https://slurm.schedmd.com/high_throughput.html. Last accessed 30 Mar 2023.
22 M. Jette, T. Wickberg
21. Henseler D., Landsteiner B., Petesch D., Wright C., and Wright N.: Architecture
and Design of Cray DataWarp. In Proceedings of the Cray User Group (2016).
https://cug.org/proceedings/cug2016_proceedings/includes/les/pap105s2-
le1.pdf
22. HDF5 download page from The HDF Group,
https://www.hdfgroup.org/downloads/hdf5. Last accessed 25 Mar 2023.
23. HDFview download page from The HDF Group,
https://www.hdfgroup.org/downloads/hdfview. Last accessed 25 Mar 2023.
24. Jette, M.: Slurm Power Management Support. Slurm User Group Meeting
(2015). https://slurm.schedmd.com/SLUG15/Power_mgmt.pdf. Last accessed 26
Mar 2023.
25. Open Container Initiative organization home page, https://opencontainers.org/.
Last accessed 28 Mar 2023.
26. Slurm container guide https://slurm.schedmd.com/containers.html Last accessed
4 Feb 2023.
27. Docker home page, https://www.docker.com/. Last accessed 28 Mar 2023.
28. Podman home page, https://podman.io/. Last accessed 28 Mar 2023.
29. Garnkel S., Spaord G., and Schwartz A.: Pluggable Authentication Mod-
ules (PAM). In Practical UNIX and Internet Security, 3rd Edition. pp,. 94-96.
O'Reilly(2003).
30. Name Service Switch description, https://guix.gnu.org/manual/en/html_node/Name-
Service-Switch.html. Last accessed 3 Feb 2023.
31. Name Service Switch implementation for Slurm,
https://slurm.schedmd.com/nss_slurm.html. Last accessed 3 Feb 2023.
32. Wikipedia description of HMAC, https://en.wikipedia.org/wiki/HMAC. Last ac-
cessed 1 May 2023.