0% found this document useful (0 votes)
12 views37 pages

19 JobSchedulers

The document provides details about supercomputers, including the IBM Blue Gene/Q system. It describes the architecture and components of Blue Gene/Q, such as the compute chip, node boards, and 5D torus interconnect. It also discusses job scheduling on supercomputers, including concepts like backfilling, queues, and batch queueing systems.

Uploaded by

sayo3712
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views37 pages

19 JobSchedulers

The document provides details about supercomputers, including the IBM Blue Gene/Q system. It describes the architecture and components of Blue Gene/Q, such as the compute chip, node boards, and 5D torus interconnect. It also discusses job scheduling on supercomputers, including concepts like backfilling, queues, and batch queueing systems.

Uploaded by

sayo3712
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Supercomputers

Apr 1, 2024
IBM Blue Gene/Q
• November 2011
• 4,096-node BG/Q (Sequoia)
• #17 on top500 at 677.10 TF
• #1 Graph 500 at 254 Gteps (Giga traversed edges/second)
• #1 on Green 500 list at 2.0 Gflops/W
• June 2012
• #1 Sequoia at Lawrence Livermore National Laboratory (#13 in 2019)
• 96K nodes, 16.3 PF Max, 20 PF Peak, 7.8 MW
• #3 Mira at Argonne National Laboratory (#24 in 2019)
• 48K nodes, 8.1 PF Max, 10 PF Peak, 3.9 MW
• Decommissioned in Dec 2019
2
Real Applications on Sequoia

Cosmology code HACC 14 PFLOPS Heart simulation code Cardioid 12 PFLOPS

3
Wikipedia

BG/Q Compute Chip

• 18.96 x 18.96 mm chip (45 nm, 1 billion


transistors)
• 16 active cores, memory, cache, NoC
• PowerPC A2 Processor Core
• 1.6 GHz
• 64-bit Power ISA
• In order execution
• 4-way SMT
• 2-way concurrent instruction issue
• Quad FPU
The IBM Blue Gene/Q Compute Chip, IEEE MICRO, 2012 4
Machine Architecture (lstopo)

5
BG/Q Compute Node Board (32 nodes)

6
BG/Q Hierarchy
1 Rack (1024 nodes)->
2 Midplanes (512 nodes)->
16 Node boards (32 nodes)

7
Interconnects in BG

• BG/P has a 3D torus with 425 MB/s per link


• BG/Q has a 5D torus with 2 GB/s per link

Why 5D torus?
- Lower diameter, higher bisection width, lower latency than 3D torus
- High nearest neighbour bandwidth

8
BG/Q Messaging Unit and Network Logic
• A, B, C, D, E dimensions (5D torus)
• Last dimension E is of size 2 (reduces wiring)
• Link chips on each node board connect via optics to node boards on other midplanes
• Dimension-order routing

• On-chip per hop latency: 40 ns (20 network cycles)


• 16x16x16x12x2 P2P latency is about 2.6 μs
• 0.6 μs at 1 hop, 1.17 μs at 13 hops

• Injection and reception FIFOs (More than half latency incurred here)
• Packets arriving on A- receiver are always placed on A- reception FIFO

9
BG/Q Network Device

Messaging Unit (MU)


10
References for BG/Q
• The IBM Blue Gene/Q Compute Chip, IEEE MICRO, 2012.
• The IBM Blue Gene/Q Interconnection Fabric, IEEE MICRO, 2012.
• The IBM Blue Gene/Q Interconnection Network and Message Unit, SC 2011.
• Looking Under the Hood of the IBM Blue Gene/Q Network, SC 2012.
• IBM System Blue Gene Solution: Blue Gene/Q Application Development, IBM
Redbooks, 2013.

11
Supercomputer Job Allocation

12
status.alcf.anl.gov -> Theta (retired) 13
Resources Required
• Number of nodes
• Wall-clock time
• Users are charged for node-hours

Should there be any constraints on the above requirements?

14
User Jobs
• Different types of applications
• Interactive vs. batch jobs
• Debug in interactive mode
• Exclusive vs. shared access
• Charged based on total resource usage
• Job is killed when requested wall-clock time is over
• Need to plan resource usage apriori

15
David Lifka, The ANL/IBM SP Scheduling
System, JSSPP 1995

16
ANL IBM SP System Observations (Typical
User Requirement)

Users were asked to use


their scheduler and
provide feedback
17
Desirable Features of Scheduler
• Fair
• Simple
• Low average queue wait times
• High system utilization
• Provide optimum performance for all kinds of jobs
• Support different job classes (interactive vs. batch)
• Provide priority for special jobs

18
FCFS with Backfilling
• FCFS scheduling
• Poor system utilization
• Backfilling – to overcome inefficiency of FCFS
• Scan the queue of jobs for a job that does not cause the first
queued job to wait for any longer than they otherwise would
• Improve system utilization
• Lower queue waiting times

19
Backfilling – 128-node Example
128 96

20
32 8
Scheduler Queues
• Jobs are submitted to a queue
• Different queuing policies (decided by the administrator)
• Multiple queues in some systems
• Based on the usage
• Queue waiting time different
• Static vs. dynamic partitioning

21
Anomaly

# Jobs executing per day on HPC2010 22


An Example Scheduling Policy (144 nodes)

Henderson, “Job Scheduling Under the Portable Batch System”, JSSPP 1995. 23
An Example Scheduler Script

Henderson, “Job Scheduling Under the Portable Batch System”, JSSPP 1995. 24
What is missing?

25
Network Utilization in Different Applications

Petrini and Feng, Time-Sharing Parallel Jobs in the Presence of Multiple Resource Requirements, JSSPP 2000 26
Network Utilization in FFT

27
Batch Queueing Systems
• Schedules jobs based on queues
• Has full knowledge of queued, running jobs
• Has full knowledge of the resource usage
• Often combination of best fit, fair share, priority-based
• Designed to be generic, can be customized
• Suited to meet demands of the scheduling goals of the centre
• Typically FIFO/FCFS with backfilling

28
Workload managers/Schedulers
• Portable Batch System (PBS)
• LoadLeveler
• Application Level Placement Scheduler (ALPS)
• Moab/Torque
• Simple Linux Utility for Resource Management (SLURM)

29
Example Batch Scheduler
• Network Queueing System developed at NASA
• Supported multiple queues of several types
• Disable/enable each queue
• Tune the #jobs running in each queue

Henderson, “Job Scheduling Under the Portable Batch System”, JSSPP 1995. 30
Portable Batch Scheduler
• Genesis of PBS in NASA (from NQS)
• Client commands for submission, modification, and monitoring jobs
• Daemons running on service nodes, compute nodes, and servers

31
PBS daemons
• Server (pbs_server) SERVICE NODE
• Handles PBS commands
pbs_server
• Creates batch jobs
• Sends jobs for execution pbs_sched

• Scheduler (pbs_sched) pbs_mom

• Schedules jobs according to system policy


• MOM (pbs_mom) COMPUTE NODE
• Manage job execution on hosts
• Resource usage monitor
pbs_mom
• Record diagnostic messages
• Notify server about job completion
• Clean up after job completion 32
PBS daemons
SERVICE NODE
• Server contacts scheduler
pbs_server
• Job is queued
• Job terminates pbs_sched
• Scheduler contacts the resource pbs_mom
monitor (MOM)
• Queries resource usages
COMPUTE NODE
• Records diagnostic messages

pbs_mom

33
Application Level Placement Scheduler (Cray)
LOGIN NODE SERVICE NODE

aprun apsys apsched


client daemon daemon

COMPUTE NODE

apinit apinit
daemon daemon

34
SLURM • Monitors states of nodes
• Accepts job requests
• Maintains queue of requests
• Schedules jobs
• Initiates job execution and cleanup
• Polls slurmd periodically
• Maintains complete state information

qstat (PBS)

qdel (PBS)

qsub (PBS)

• Responds to controller requests


• Maintains job state
• Initiate, manage, cleanup processes
Slurm architecture [Jette et al.] • I/O handling
35
Scheduler Commands (Example)

36
#!/bin/bash
HPC2010 #PBS -N test
#PBS -q small
#PBS -l nodes=2:ppn=8
#PBS -l walltime=00:05:00

cd $PBS_O_WORKDIR

• qsub –I –X source /opt/software/intel/initpaths intel64


export I_MPI_FABRICS=shm:dapl
• mpiicc –o sample sample.c
mpirun -np 8 ./sample
• qsub sub.sh
• qstat
• http://172.31.30.3/new/code/index.html

37

You might also like