Job Scheduling in HPC cluster
Job Scheduling in HPC cluster
www.dell.com/powersolutions Reprinted from Dell Power Solutions, February 2005. Copyright © 2005 Dell Inc. All rights reserved. POWER SOLUTIONS 133
HIGH-PERFORMANCE COMPUTING
maintaining job priority in accordance with the site policy that the themselves during execution. The manner in which the subtasks
administrator has established for the amount and timing of resources are assigned to processors is called mapping. Because mapping
used to execute jobs. Based on that information, the scheduler decides affects execution time, the scheduler must map subtasks carefully.
which job will execute on which compute node and when. The scheduler needs to ensure that nodes scheduled to execute
parallel jobs are connected by fast interconnects to minimize the
Understanding job scheduling in clusters associated communication overhead. For parallel jobs, the job
When a job is submitted to a resource manager, the job waits in efficiency also affects resource utilization. To achieve high resource
a queue until it is scheduled and executed. The time spent in the utilization for parallel jobs, both job efficiency and advanced
queue, or wait time,
e depends on several factors including job pri- scheduling are required. Efficient job processing depends on effec-
ority, load on the system, and availability of requested resources. tive application design.
Turnaround timee represents the elapsed time between when the Under heavy load conditions, the capability to provide a fair
job is submitted and when the job is completed; turnaround time portion of the cluster’s resources to each user is important. This capa-
includes the wait time as well as the job’s actual execution time. bility can be provided by using the fair-sharee strategy, in which the
Response timee represents how fast a user receives a response from scheduler collects historical data from previously executed jobs and
the system after the job is submitted. uses the historical data to dynamically adjust the priority of the jobs
Resource utilization
n during the lifetime of the job represents the in the queue. The capability to dynamically make priority changes
actual useful work that has been performed. System throughputt is helps ensure that resources are fairly distributed among users.
defined as the number of jobs completed per unit of time. Mean Most job schedulers have several parameters that can be
response timee is an important performance metric for users, who adjusted to control job queues and scheduling algorithms, thus
expect minimal response time. In contrast, system administrators providing different response times and utilization percentages. Usu-
are concerned with overall resource utilization because they want ally, high system utilization also means high average response time
to maximize system throughput and return on investment (ROI), for jobs—and as system utilization climbs, the average response
especially in high-throughput computing clusters. time tends to increase sharply beyond a certain threshold. This
In a typical production environment, many different jobs are threshold depends on the job-processing algorithms and job profiles.
submitted to clusters. These jobs can be characterized by factors In most cases, improving resource utilization and decreasing job
such as the number of processors requested (also known as job size, turnaround time are conflicting considerations. The challenge for IT
or job width), estimated runtime, priority level, parallel or distributed organizations is to maximize resource utilization while maintaining
execution, and specific I/O requirements. During execution, large acceptable average response times for users.
jobs can occupy significant portions of a cluster’s processing and Figure 2 summarizes the desirable features of job schedulers.
memory resources. These features can serve as guidelines for system administrators as
System administrators can create several types of queues, each they select job schedulers.
with a different priority level and quality of service (QoS). To make
intelligent schedules, however, schedulers need information regarding Using job scheduling algorithms
job size, priority, expected execution time (indicated by the user), The parallel and distributed computing community has put substan-
resource access permission (established by the administrator), and tial research effort into developing and understanding job scheduling
resource availability (automatically obtained by the scheduler). algorithms. Today, several of these algorithms have been implemented
In high-performance computing clusters, the scheduling of par- in both commercial and open source job schedulers. Scheduling algo-
allel jobs requires special attention because parallel jobs comprise rithms can be broadly divided into two classes: time-sharing and
several subtasks. Each subtask is assigned to a unique compute space-sharing. Time-sharingg algorithms divide time on a processor
node during execution and nodes constantly communicate among into several discrete intervals, or slots. These slots are then assigned
to unique jobs. Hence, several jobs at any given time can share the
User 1 Information is sent Compute node 1 same compute resource. Conversely, space-sharingg algorithms give
to the resource manager
Compute node 2 the requested resources to a single job until the job completes execu-
User 2 Resource manager
Compute node 3 tion. Most cluster schedulers operate in space-sharing mode.
User 3 Internal job
scheduler Compute node 4 Common, simple space-sharing algorithms are first come, first
Job queue
served (FCFS); first in, first out (FIFO); round robin (RR); shortest job
Users Jobs are
submit jobs assigned to first (SJF); and longest job first (LJF). As the names suggest, FCFS
User n External job compute nodes Compute node n
scheduler and FIFO execute jobs in the order in which they enter the queue.
This is a very simple strategy to implement, and works acceptably
Figure 1. Typical resource management system well with a low job load. RR assigns jobs to nodes as they arrive in
134
3 POWER SOLUTIONS Reprinted from Dell Power Solutions, February 2005. Copyright © 2005 Dell Inc. All rights reserved. February 2005
HIGH-PERFORMANCE COMPUTING
a. Jobs waiting
Feature Comments in the queue
Broad scope The nature of jobs submitted to a cluster can vary, so the
scheduler must support batch, parallel, sequential, distributed,
interactive, and noninteractive jobs with similar efficiency.
y
Support for algorithms The scheduler should support numerous job-processing b. The queue after
algorithms—including FCFS, FIFO, SJF,F LJF, advance each priority group
reservation, and backfill. In addition, the scheduler should be is sorted according
able to switch between algorithms and apply different to execution time
algorithms at different times—or apply different algorithms
to different queues, or both. Sorted high-priority jobs Sorted low-priority jobs
Capability to integrate The scheduler should be able to interface with the resource
with standard resource manager in use, including common resource managers such c. Longest job first
managers as Platform LSF,F Sun Grid Engine, and OpenPBS (the original, schedule
open source version of Portable Batch System). Compute
nodes
Sensitivity to compute The scheduler should match the appropriate compute node
node and interconnect architecture to the job profile—for example, by using compute
architecture nodes that have more than one processor to provide optimal Time
performance for applications that can use the second
processor effectively.
y d. Longest job first
and backfill
Scalability The scheduler should be capable of scaling to thousands of schedule Compute
nodes and processing thousands of jobs simultaneously.
y nodes
Fair-share capability The scheduler should distribute resources fairly under heavy
conditions and at different times. Time
Efficiency The overhead associated with scheduling should be minimal e. Shortest job first
and within acceptable limits. Advanced scheduling algorithms schedule
can take time to run. To be efficient, the scheduling algorithm Compute
itself must spend less time running than the expected saving nodes
in application execution time from improved scheduling.
Dynamic capability The scheduler should be able to add or remove compute Time
resources to a job on the fly—assuming that the job can
adjust and utilize the extra compute capacity.
y
f. Shortest job first
Support for preemption Preemption can occur at various levels; for example, jobs may and backfill
schedule Compute
be suspended while running. Checkpointing—that is, the nodes
capability to stop a running job, save the intermediate results,
and restart the job later—can help ensure that results are not
lost for very long jobs. Time
the queue in a cyclical, round-robin manner. SJF periodically sorts Figure 3 illustrates the use of the basic algorithms and the
the incoming jobs and executes the shortest job first, allowing short enhancements discussed in this article. Figure 3a shows a queue
jobs to get a good turnaround time. However, this strategy may cause with 11 jobs waiting; the queue has both high-priority and low-
delays for the execution of long (large) jobs. In contrast, LJF commits priority jobs. Figure 3b shows these jobs sorted according to their
resources to longest jobs first. The LJF approach tends to maximize estimated execution time.
system utilization at the cost of turnaround time. The example in Figure 3 assumes an eight-processor cluster and
Basic scheduling algorithms such as these can be enhanced considers only two parameters: the number of processors and the
by combining them with the use of advance reservation and estimated execution time. This figure shows the effects of generating
backfill techniques. Advance reservation uses execution time schedules using the LJF and SJF algorithms with and without backfill
predictions provided by the users to reserve resources (such as techniques. Sections c through f of Figure 3 indicate that backfill can
CPUs and memory) and to generate a schedule. The backfill tech- improve schedules generated by LJF and SJF, either by increasing uti-
nique improves space-sharing scheduling. Given a schedule with lization, decreasing response time, or both. To generate the schedules
advance-reserved, high-priority jobs and a list of low-priority jobs, shown, the low- and high-priority jobs are sorted separately.
a backfill algorithm tries to fit the small jobs into scheduling
gaps. This allocation does not alter the sequence of jobs previ- Examining a commercial resource manager
ously scheduled, but improves system utilization by running low- and an external job scheduler
priority jobs in between high-priority jobs. To use backfill, the This section introduces scheduling features of a commercial resource
scheduler requires a runtime estimate of the small jobs, which is manager, Load Sharing Facility (LSF) from Platform Computing, and
supplied by the user when jobs are submitted. an open source job scheduler, Maui.
www.dell.com/powersolutions Reprinted from Dell Power Solutions, February 2005. Copyright © 2005 Dell Inc. All rights reserved. POWER SOLUTIONS 135
HIGH-PERFORMANCE COMPUTING
Platform Load Sharing Facility resource manager with several common resource managers including Platform LSF and
Platform LSF is a popular resource manager for clusters. Its focus Platform LSF HPC, and potentially improve scheduling performance
is to maximize resource utilization within the constraints of local compared to built-in schedulers.
administration policies. Platform Computing offers two products: Maui has a two-phase scheduling algorithm. During the first
Platform LSF and Platform LSF HPC. LSF is designed to handle a phase, the high-priority jobs are scheduled using advance reserva-
broad range of job types such as batch, parallel, distributed, and tion. In the second phase, a backfill algorithm is used to schedule
interactive. LSF HPC is optimized for HPC parallel applications by low-priority jobs between previously scheduled jobs. Maui uses the
providing additional facilities for intelligent scheduling, which enables fair-share technique when making scheduling decisions based on
different QoS in different queues. LSF also implements a hierarchical job history. Note: Maui’s internal behavior is based on a single, uni-
fair-share scheduling algorithm to balance resources among users fied queue. This maximizes the opportunity to utilize resources.
under all load conditions. Typically, users are guaranteed certain QoS, but Maui gives
Platform LSF has built-in schedulers that implement advanced a significant amount of control to administrators—allowing local
scheduling algorithms to provide easy configurability and high reliabil- policies to control access to resources, especially for scheduling.
ity for users. In addition to basic scheduling algorithms, Platform LSF For example, administrators can enable different QoS and access
uses advanced techniques like advance reservation and backfill. levels to users and jobs, which can be preemptively identified. Maui
Platform LSF and Platform LSF HPC both have a dynamic uses a tool called QBank for allocation management. QBank allows
scheduling decision mechanism. The scheduling decisions under multisite control over the use of resources. Another Maui feature
this mechanism are based on processing load. Based on these deci- allows charge rates (the amount users pay for compute resources)
sions, jobs can be migrated among compute nodes or rescheduled. to be based on QoS, resources, and time of day. Maui is scalable
Loads can also be balanced among compute nodes in heterogeneous to thousands of jobs, despite its nondistributed scheduler daemon,
environments. These features make Platform LSF suitable for a which is centralized and runs on a single node.
broad range of HPC applications. In addition, Platform LSF can Maui supports job preemption, which can occur under several
dynamically migrate jobs among compute nodes. Platform LSF can conditions. High-priority jobs can preempt lower-priority or backfill
also have multiple scheduling algorithms applied to different queues jobs if resources to run the high-priority jobs are not available. In
simultaneously. Platform LSF HPC can make intelligent scheduling some cases, resources reserved for high-priority jobs can be used to
decisions based on the features of advanced interconnect networks, run low-priority jobs when no high-priority jobs are in the queue.
thus enhancing process mapping for parallel applications. However, when high-priority jobs are submitted, these low-priority
The term resourcee has a broad definition in Platform LSF and jobs can be preempted to reclaim resources for high-priority jobs.
Platform LSF HPC. Resources can be CPUs, memory, storage space, Maui has a simulation mode that can be used to evaluate the
or software licenses. (In some sectors, software licenses are expen- effect of queuing parameters on the scheduler performance. Because
sive and are considered a valuable resource.) each HPC environment has a unique job profile, the parameters of
Platform LSF and Platform LSF HPC each have an extensive the queues and scheduler can be tuned based on historical logs to
advance reservation system that can reserve different kinds of maximize scheduler performance.
resources. In some distributed applications, many instances of
the same application are required to perform parametric studies. Satisfying ever-increasing computing demands
Platform LSF and Platform LSF HPC allow users to submit a job As cluster sizes scale to satisfy growing computing needs in various
group that can contain a large number of jobs, making parametric industries as well as in academia, advanced schedulers can help maxi-
studies much easier to manage. mize resource utilization and QoS. The profile of jobs, the nature of
Platform LSF and Platform LSF HPC can interface with external computation performed by the jobs, and the number of jobs submitted
schedulers such as Maui. External schedulers can complement fea- can help determine the benefits of using advanced schedulers.
tures of the resource manager and enable sophisticated scheduling.
For example, using Platform LSF HPC, the hierarchical fair-share Saeed Iqbal, Ph.D., is a systems engineer and advisor in the Scalable Systems Group at
algorithm can dynamically adjust priorities and feed these priorities Dell. He has a Ph.D. in Computer Engineering from The University of Texas at Austin.
to Maui for use in scheduling decisions.
Rinku Gupta is a systems engineer and advisor in the Scalable Systems Group at Dell. She
Maui job scheduler has a B.E. in Computer Engineering from Mumbai University in India and an M.S. in Computer
Maui is an advanced open source job scheduler that is specifically Information Science from The Ohio State University.
designed to optimize system utilization in policy-driven, heterogeneous
HPC environments. Its focus is on fast turnaround of large parallel jobs, Yung-Chin Fang is a senior consultant in the Scalable Systems Group at Dell. He specializes
making the Maui scheduler highly suitable for HPC. Maui can work in cyberinfrastructure management and high-performance computing.
136
36 POWER SOLUTIONS Reprinted from Dell Power Solutions, February 2005. Copyright © 2005 Dell Inc. All rights reserved. February 2005