19 JobSchedulers
19 JobSchedulers
Apr 1, 2024
IBM Blue Gene/Q
• November 2011
• 4,096-node BG/Q (Sequoia)
• #17 on top500 at 677.10 TF
• #1 Graph 500 at 254 Gteps (Giga traversed edges/second)
• #1 on Green 500 list at 2.0 Gflops/W
• June 2012
• #1 Sequoia at Lawrence Livermore National Laboratory (#13 in 2019)
• 96K nodes, 16.3 PF Max, 20 PF Peak, 7.8 MW
• #3 Mira at Argonne National Laboratory (#24 in 2019)
• 48K nodes, 8.1 PF Max, 10 PF Peak, 3.9 MW
• Decommissioned in Dec 2019
2
Real Applications on Sequoia
3
Wikipedia
5
BG/Q Compute Node Board (32 nodes)
6
BG/Q Hierarchy
1 Rack (1024 nodes)->
2 Midplanes (512 nodes)->
16 Node boards (32 nodes)
7
Interconnects in BG
Why 5D torus?
- Lower diameter, higher bisection width, lower latency than 3D torus
- High nearest neighbour bandwidth
8
BG/Q Messaging Unit and Network Logic
• A, B, C, D, E dimensions (5D torus)
• Last dimension E is of size 2 (reduces wiring)
• Link chips on each node board connect via optics to node boards on other midplanes
• Dimension-order routing
• Injection and reception FIFOs (More than half latency incurred here)
• Packets arriving on A- receiver are always placed on A- reception FIFO
9
BG/Q Network Device
11
Supercomputer Job Allocation
12
status.alcf.anl.gov -> Theta (retired) 13
Resources Required
• Number of nodes
• Wall-clock time
• Users are charged for node-hours
14
User Jobs
• Different types of applications
• Interactive vs. batch jobs
• Debug in interactive mode
• Exclusive vs. shared access
• Charged based on total resource usage
• Job is killed when requested wall-clock time is over
• Need to plan resource usage apriori
15
David Lifka, The ANL/IBM SP Scheduling
System, JSSPP 1995
16
ANL IBM SP System Observations (Typical
User Requirement)
18
FCFS with Backfilling
• FCFS scheduling
• Poor system utilization
• Backfilling – to overcome inefficiency of FCFS
• Scan the queue of jobs for a job that does not cause the first
queued job to wait for any longer than they otherwise would
• Improve system utilization
• Lower queue waiting times
19
Backfilling – 128-node Example
128 96
20
32 8
Scheduler Queues
• Jobs are submitted to a queue
• Different queuing policies (decided by the administrator)
• Multiple queues in some systems
• Based on the usage
• Queue waiting time different
• Static vs. dynamic partitioning
21
Anomaly
Henderson, “Job Scheduling Under the Portable Batch System”, JSSPP 1995. 23
An Example Scheduler Script
Henderson, “Job Scheduling Under the Portable Batch System”, JSSPP 1995. 24
What is missing?
25
Network Utilization in Different Applications
Petrini and Feng, Time-Sharing Parallel Jobs in the Presence of Multiple Resource Requirements, JSSPP 2000 26
Network Utilization in FFT
27
Batch Queueing Systems
• Schedules jobs based on queues
• Has full knowledge of queued, running jobs
• Has full knowledge of the resource usage
• Often combination of best fit, fair share, priority-based
• Designed to be generic, can be customized
• Suited to meet demands of the scheduling goals of the centre
• Typically FIFO/FCFS with backfilling
28
Workload managers/Schedulers
• Portable Batch System (PBS)
• LoadLeveler
• Application Level Placement Scheduler (ALPS)
• Moab/Torque
• Simple Linux Utility for Resource Management (SLURM)
29
Example Batch Scheduler
• Network Queueing System developed at NASA
• Supported multiple queues of several types
• Disable/enable each queue
• Tune the #jobs running in each queue
Henderson, “Job Scheduling Under the Portable Batch System”, JSSPP 1995. 30
Portable Batch Scheduler
• Genesis of PBS in NASA (from NQS)
• Client commands for submission, modification, and monitoring jobs
• Daemons running on service nodes, compute nodes, and servers
31
PBS daemons
• Server (pbs_server) SERVICE NODE
• Handles PBS commands
pbs_server
• Creates batch jobs
• Sends jobs for execution pbs_sched
pbs_mom
33
Application Level Placement Scheduler (Cray)
LOGIN NODE SERVICE NODE
COMPUTE NODE
apinit apinit
daemon daemon
34
SLURM • Monitors states of nodes
• Accepts job requests
• Maintains queue of requests
• Schedules jobs
• Initiates job execution and cleanup
• Polls slurmd periodically
• Maintains complete state information
qstat (PBS)
qdel (PBS)
qsub (PBS)
36
#!/bin/bash
HPC2010 #PBS -N test
#PBS -q small
#PBS -l nodes=2:ppn=8
#PBS -l walltime=00:05:00
cd $PBS_O_WORKDIR
37