Linux Clusters Institute: Scheduling
Linux Clusters Institute: Scheduling
Linux Clusters Institute: Scheduling
Scheduling
David King, Sr. HPC Engineer
National Center for Supercomputing Applications
University of Illinois
August 2017 1
About me
August 2017 2
Scheduling Schedule
• Introduction • Break
• Workload Management Overview • Slurm Deep Dive
• What is the goal? • Components
• Review of Fundamentals of Workload Managers • Interface
• Queues • Limiting Resources
• Priority • Overview
• FIFO • Cgroups
• Fairshare • Processor affinity
• QOS • Containers
• Reservations • Cloud and Scheduling
• Feature Set Survey • Accounting
• PBS Pro • Xdmod
• Torque • Gold/Mam
• Moab/Maui • Slurm
• LSF
• Other Open Source Tools
• Lunch
• Slurm Hands-on
• Cloud and Scheduling
August 2017 3
Workload Management Overview
August 2017 4
Workload Management Overview
August 2017 5
Workload Management
August 2017 6
Resource Managers
August 2017 7
Schedulers
August 2017 8
Features Set
August 2017 9
Fairshare
• Takes historical resource utilization as a factor in job priority.
• The more you use, the less priority you get.
• If Bob uses the entire cluster on Monday, his priority will be less for
the rest of the week.
• Can be set for users, groups classes/queues, and QoS.
• Multi-level targets
• Highly tunable for interval (duration of window), depth (number of
days), decay (weighting of contribution of each day) and what metric
to use.
August 2017 10
QOS
August 2017 11
Multi-factor Priority
August 2017 12
Backfill
August 2017 13
Preemption/Gang Scheduling
August 2017 14
Reservations
August 2017 15
Reservation Examples
August 2017 16
Job Dependencies
August 2017 17
Job Arrays
August 2017 18
Topology Aware
August 2017 19
Hierarchical Topology Aware
August 2017 20
3-D Torus Interconnect Topology Aware
August 2017 21
Accelerator Support
August 2017 22
Power Capping and Green Computing
• The idea to limit the work done by a specific job to a fixed of total
power consumed
• The implementation is usually a resource manager (RM) throttle of
local cpu performance per node
• May not account for MB power from DIMM’s, networking & any GPU’s
• Alternately, the RM may be able to monitor the total power
consumed at the power supply (usually through BMC hooks) and
simply terminate the job at the requested amount
• Node can also be shutdown while not in use
• Not all schedulers support this
August 2017 23
Common Open Source and Commercial
Schedulers and Workload Managers
• There are several commercial and open source choices for workload
management
• Portable Batch System derived
• PBS Pro – commercial and open source product supported by Altair Engineering
• Torque – open source maintained by Adaptive Computing
• Very limited built-in scheduler
• Can utilize the Maui Scheduler – open source
• In maintenance mode
• Moab scheduler is a commercial fork of Maui developed by Adaptive
• Univa Grid Engine (UGE) (formally Sun Grid Engine) supported by Univa
• Platform LSF – IBM Commercial product
• SLURM – Open source with commercial support
August 2017 24
PBS Pro
August 2017 25
Torque
August 2017 26
Maui
August 2017 27
Moab
August 2017 28
Univa Grid Engine
August 2017 29
Platform LSF (Load Sharing Facility)
August 2017 30
Slurm
August 2017 31
Choosing the right Scheduler and Workload
Manager
• What is your budget?
• What support level do you need?
• What is your experience with various scheduler?
• What is your workload?
• High-throughput computing?
• Number of jobs?
• Feature set needed
August 2017 32
SLURM In Depth
• Architecture
• Daemons
• Configuration Files
• Key Configuration Items
• Node Configuration
• Partition Configuration
• Commands
• Test Suite
August 2017 33
SLURM Architecture
August 2017 34
SLURM Daemons
• Daemons
• slurmctld – controller that handles scheduling, communication with nodes,
etc – One per cluster (plus an optional HA pair)
• slurmdbd – (optional) communicates with MySQL database, usually one per
enterprise
• slurmd – runs on a compute node and launches jobs
• slurmstepd – run by slurmd to launch a job step
• munged – authenticates RPC calls (https://code.google.com/p/munge/)
• Install munged everywhere with the same key
• Slurmd
• hierarchical communication between slurmd instances (for scalability)
• slurmctld and slurmdbd can have primary and backup instances
for HA
• State synchronized through shared file system (StateSaveLocation)
August 2017 35
Slurm Prerequisites
• Each node in cluster must be configured with a MUNGE key and have
the daemons running
• MUNGE generated credential includes
• User id
• group id
• time stamp
• whatever else it is asked to sign and/or encrypt
• names of nodes allocated to a job/step
• specific CPUs on each node allocated to job/step, etc.
August 2017 36
SLURM Configuration Files
• Config files are read directly from the node by commands and daemons
• Config files should be kept in sync everywhere
• Exception slurmdbd.conf: only used by slurmdbd, contains database
passwords
• DebugFlags=NO_CONF_HASH tell Slurm to tolerate some differences.
Everything should be consistent except maybe backfill parameters, etc that
slurmd doesn't need
• Can use “Include /path/to/file.conf” to separate out portions, e.g. partitions,
nodes, licenses
• Can configure generic resources with GresTypes=gpu
• man slurm.conf
• Easy: http://slurm.schedmd.com/configurator.easy.html
• Almost as easy: http://slurm.schedmd.com/configurator.html
August 2017 37
Key Configuration Items
August 2017 38
Partition Configuration
August 2017 39
Commands
August 2017 40
A Simple Sequence of Jobs
August 2017 41
Tasks versus Nodes
• Tasks are like processes and can be distributed among nodes as the
scheduler sees fit.
• Nodes means you get that many distinct nodes.
• Must add –exclusive to ensure you are the only user of the node!
August 2017 42
Interactive Jobs
• salloc uses a similar syntax to sbatch, but blocks until the job is
launched and you then have a shell within which to execute tasks
directly or with srun.
• salloc --ntasks=8 --time=20 --pty bash
salloc: Granted job allocation 104
• Try hostname directly.
• Try srun --label hostname
August 2017 43
srun
August 2017 44
sacct
August 2017 45
sacctmgr
August 2017 46
scancel Command
scancel 101.2
scancel 102
scancel –user=bob –state=pending
August 2017 47
sbcast
August 2017 48
strigger
August 2017 49
Host Range Syntax
• Host range syntax is more compact, allows smaller RPC calls, easier to
read config files, etc
• Node lists have a range syntax with [] using “,” and “-”
• Usable with commands and config files
• n[1-10,40-50] and n[5-20] are valid
• Comma separated lists are allowed:
• a-[1-5]-[1-2],b-3-[1-16],b-[4-5]-[1-2,7,9]
August 2017 50
squeue
• Want to see all running jobs on nodes n[4-31] submitted by all users
in account acctE using QOS special with a certain set of job names in
reservation res8 but only show the job ID and the list of nodes the
jobs are assigned then sort it by time remaining then descending by
job ID?
• There's a command for that!
• squeue -t running -w n[4-31] -A acctE -q special -n name1,name2 -R
res8 -o "%.10i %N" -S +L,-i
• Way too many options to list here. Read the manpage.
August 2017 51
sbatch,salloc,srun
August 2017 52
sbatch,salloc,srun
• Short and long versions exist for most options
• -N 2 # node count, same as --nodes=2
• In order to get exclusive access to a node add --exclusive
• -n 8 # task count, same as --ntasks=8
• default behavior is to try loading up fewer nodes as much as possible rather than
spreading tasks
• -t 2-04:30:00 # time limit in d-h:m:s, d-h, h:m:s, h:m, or m
• -p p1 # partition name(s): can list multiple partitions
• --qos=standby # QOS to use
• --mem=24G # memory per node
• --mem-per-cpu=2G # memory per CPU
• -a 1-1000 # job array
August 2017 53
Job Arrays
• Used to submit homogeneous scripts that differ only by an index number
• $SLURM_ARRAY_TASK_ID stores the job's index number (from -a)
• An individual job looks like 1234_7 where
${SLURM_JOB_ID}_${SLURM_ARRAY_TASK_ID}
• “scancel 1234” for the whole array or “scancel 1234_7” for just
one job in the array
• Prior to 14.11
• Job arrays are purely for convenience
• One sbatch call, scancel can work on the entire array, etc
• Internally, one job entry created for each job array entry at submit time
• Overhead of job array w/1000 tasks is about equivalent to 1000 individual jobs
• Starting in 14.11
• “Meta” job is used internally
• Scheduling code is aware of the homogeneity of the array
• Individual job entries are created once a job is started
• Big performance advantage!
August 2017 54
scontrol
• scontrol can list, set and update a lot of different things
• scontrol show job $jobid # checkjob equiv
• scontrol show node $node
• scontrol show reservation
August 2017 55
Reservations
• Slurm supports time based reservations on resources with ACLs for users
and groups.
• A system maintenance reservation for 120 minutes:
• scontrol create reservation starttime=2009-02-06T16:00:00 duration=120
user=root flags=maint,ignore_jobs nodes=ALL
• A repeating reservation:
• scontrol create reservation user=alan,brenda starttime=noon duration=60
flags=daily nodecnt=10
• For a specific account:
• scontrol create reservation account=foo user=-alan partition=pdebug
starttime=noon duration=60 nodecnt=2k,2k
• To associate a job with a reservation:
• sbatch --reservation=alan_6 -N4 my.script
• To review reservations:
• scontrol show reservation
August 2017 56
Node Configuration
August 2017 57
SLURM Test Suite
• SLURM includes and extensive test suite that can be used to calibrate
proper operation
• include over 300 test programs
• executes thousands of jobs
• executes tens of thousands of steps
• change directory to testsuite/expect
• create file “globals.local” with installation specific information
• set slurm_dir “/home/moe/SLURM/install.linux”
• set build_dir “/home/moe/SLURM/build.linux”
• set src_dir “/home/moe/SLURM/slurm.git”
• Execute individual tests or run regression for all tests
August 2017 58
Plugins
August 2017 59
Robust Accounting
• Likely want:
• AccountingStorageEnforce=associations,limits,qos
August 2017 60
Database Use
August 2017 61
Setup Accounts
August 2017 62
Add Users to Accounts
August 2017 63
Break Time
• Back in 30 minutes
August 2017 64
Limiting and Managing Resources
• Cgroups
• Pam authentication modules
• Processor Affinity
• Containers
August 2017 65
Control Groups (Cgroups)
• Cgroups are a Linux Kernel feature that limits, accounts for, and
isolates the resource usage of a collection of processes
• Used to limit and/or track:
• CPU
• Memory
• Disk I/O
• Network
• Etc…
• Features
• Resource Limiting
• Prioritization
• Accounting
• Control
August 2017 66
Cgroup support within Workload managers
• Torque must be built using cgroups during configure time
• Built using –enable-cgroups
• Newer versions of hwloc is required which can be built locally
• You must have cgroups mounted when compiling with cgroups
• You cannot disable cgroup support on a live system
• Slurm Cgroup support is enable via multiple plugins
• proctrack (process tracking)
• task (task management)
• jobacct_gather (job accounting statistics)
August 2017 67
Pam module for compute node
authentication
• A pluggable authentication module (PAM) is a mechanism to
integrate multiple low-level authentication schemes into a high-level
application programming interface (API). It allows programs that rely
on authentication to be written independently of the
underlying authentication scheme.
• Most schedulers have a PAM plugin module that can be used to
restrict ssh access to compute nodes to only nodes where the user
has an active job.
• This will not clean up any users that exist on the compute nodes
August 2017 68
Pam_slurm
• Slurm provides a PAM plugin module that can be used to restrict ssh
access to compute nodes to only nodes where the user has an active
job.
• The pam_slurm PAM plugin is installed by the rpms.
• Need to add:
auth include password-auth
account required pam_slurm.so
account required pam_nologin.so
to /etc/pam.d/sshd
• Only do this on compute nodes! If you put it on the head node it will
lock out users!
August 2017 69
Processor Affinity
August 2017 70
Enabling Processor Affinity on Slurm
August 2017 71
Containers
August 2017 72
Docker
August 2017 73
Shifter
• Developed by the National Energy Research Scientific Computing
Center
• Leverages or integrates with public image repos such as Dockerhub
• Require no administrator assistance to launch an application inside
an image
• Shared resource availability such as parallel filesystems and network
interfaces
• Robust and secure implementation
• Localized data relieves metadata contention
improving application performance
• “native” application execution performance
• Slurm integration via SPANK Plugin
August 2017 74
Singularity
August 2017 75
Scheduling and Cloud
August 2017 76
Accounting
• XDMod
• MAM/Gold
August 2017 77
XDMod
August 2017 78
XDMod
August 2017 79
Moab Accounting Manager
August 2017 80
Job Submission Portals
August 2017 81
Moab Viewpoint
August 2017 82
Lunch
August 2017 83
Hands On with Slurm
August 2017 84
Getting Setup on Nebula
• Grab handout
• SSH to host
• Find all hosts
• Users and groups
• Exploring SLURM
• Run a simple job
• After you are done exploring, users of the cluster will start submitting
jobs
August 2017 85
Things to know
August 2017 86
Exercise 1: Enabling Fairshare
August 2017 87
Exercise 2: Enable Fairshare for Groups and
Users
• The professors have decided that all departments need to share the
cluster evenly
• They also want all users to share within the account
• Setup hierarchical fairshare between users and between accounts
August 2017 88
Exercise 3: Issues with Priority
August 2017 89
Exercise 4: Limiting Groups with Accounting
August 2017 90
Exercise 5: Enable Preemption for a Low
Queue
• Users want a the ability to submit low priority jobs to allow
• Make sure these jobs only backfill
• They should be half the cost of normal jobs
August 2017 91
Extra Exercise 6: Singularity
August 2017 92
References
• Brett Bode
• https://en.wikipedia.org/wiki/Linux_PAM
• https://en.wikipedia.org/wiki/Platform_LSF
• http://www.nersc.gov/research-and-development/user-defined-
images/
• http://singularity.lbl.gov/
• https://geekyap.blogspot.com/2016/11/docker-vs-singularity-vs-
shifter-in-hpc.html
• http://clusterdesign.org/
August 2017 93