0% found this document useful (0 votes)
9 views

Job Scheduling in HPC cluster

This document discusses job scheduling in high-performance computing (HPC) clusters, highlighting the differences between high-throughput and high-performance clusters and their respective scheduling needs. It introduces various scheduling algorithms and features that help optimize resource utilization and maintain quality of service, while also addressing the challenges of job prioritization and turnaround times. Additionally, it examines commercial and open-source resource managers, such as Platform LSF and Maui, that implement advanced scheduling techniques to enhance performance and user experience.

Uploaded by

sendhilks
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Job Scheduling in HPC cluster

This document discusses job scheduling in high-performance computing (HPC) clusters, highlighting the differences between high-throughput and high-performance clusters and their respective scheduling needs. It introduces various scheduling algorithms and features that help optimize resource utilization and maintain quality of service, while also addressing the challenges of job prioritization and turnaround times. Additionally, it examines commercial and open-source resource managers, such as Platform LSF and Maui, that implement advanced scheduling techniques to enhance performance and user experience.

Uploaded by

sendhilks
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

HIGH-PERFORMANCE COMPUTING

Planning Considerations for

Job Scheduling in HPC Clusters


As cluster installations continue growing to satisfy ever-increasing computing demands,
advanced schedulers can help improve resource utilization and quality of service. This
article discusses issues related to job scheduling on clusters and introduces scheduling
algorithms to help administrators select a suitable job scheduler.

BY SAEED IQBAL, PH.D.; RINKU GUPTA; AND YUNG-CHIN FANG

C luster installations primarily comprise two types of


standards-based hardware components—servers and
networking interconnects. Clusters are divided into two
applications, which have substantial communication and
synchronization requirements.
A resource management system manages the process-
major classes: high-throughput computing clusters and ing load by preventing jobs from competing with each
high-performance computing clusters. High-throughput other for limited compute resources. Typically, a resource
computing clusters usually connect a large number of management system comprises a resource manager and a
nodes using low-end interconnects. In contrast, high- job scheduler (see Figure 1). Most resource managers have
performance computing clusters connect more powerful an internal, built-in job scheduler, but system administra-
compute nodes using faster interconnects than high- tors can usually substitute an external scheduler for the
throughput computing clusters. Fast interconnects are internal scheduler to enhance functionality. In either case,
designed to provide lower latency and higher bandwidth the scheduler communicates with the resource manager to
than low-end interconnects. obtain information about queues, loads on compute nodes,
These two classes of clusters have different scheduling and resource availability to make scheduling decisions.
requirements. In high-throughput computing clusters, the Usually, the resource manager runs several daemons
main goal is to maximize throughput—that is, jobs com- on the master node and compute nodes including a sched-
pleted per unit of time—by reducing load imbalance among uler daemon, which typically runs on the master node. The
compute nodes in the cluster. Load balancing is particularly resource manager also sets up a queuing system for users
important if the cluster has heterogeneous compute nodes. to submit jobs—and users can query the resource manager
In high-performance computing clusters, an additional con- to determine the status of their jobs. In addition, a resource
sideration arises: the need to minimize communication manager maintains a list of available compute resources
overhead by mapping applications appropriately to the and reports the status of previously submitted jobs to the
available compute nodes. High-throughput computing clus- user. The resource manager helps organize submitted jobs
ters are suitable for executing loosely coupled parallel or based on priority, resources requested, and availability.
distributed applications, because such applications do not As shown in Figure 1, the scheduler receives periodic
have high communication requirements among compute input from the resource manager regarding job queues and
nodes during execution time. High-performance comput- available resources, and makes a schedule that determines
ing clusters are more suitable for tightly coupled parallel the order in which jobs will be executed. This is done while

www.dell.com/powersolutions Reprinted from Dell Power Solutions, February 2005. Copyright © 2005 Dell Inc. All rights reserved. POWER SOLUTIONS 133
HIGH-PERFORMANCE COMPUTING

maintaining job priority in accordance with the site policy that the themselves during execution. The manner in which the subtasks
administrator has established for the amount and timing of resources are assigned to processors is called mapping. Because mapping
used to execute jobs. Based on that information, the scheduler decides affects execution time, the scheduler must map subtasks carefully.
which job will execute on which compute node and when. The scheduler needs to ensure that nodes scheduled to execute
parallel jobs are connected by fast interconnects to minimize the
Understanding job scheduling in clusters associated communication overhead. For parallel jobs, the job
When a job is submitted to a resource manager, the job waits in efficiency also affects resource utilization. To achieve high resource
a queue until it is scheduled and executed. The time spent in the utilization for parallel jobs, both job efficiency and advanced
queue, or wait time,
e depends on several factors including job pri- scheduling are required. Efficient job processing depends on effec-
ority, load on the system, and availability of requested resources. tive application design.
Turnaround timee represents the elapsed time between when the Under heavy load conditions, the capability to provide a fair
job is submitted and when the job is completed; turnaround time portion of the cluster’s resources to each user is important. This capa-
includes the wait time as well as the job’s actual execution time. bility can be provided by using the fair-sharee strategy, in which the
Response timee represents how fast a user receives a response from scheduler collects historical data from previously executed jobs and
the system after the job is submitted. uses the historical data to dynamically adjust the priority of the jobs
Resource utilization
n during the lifetime of the job represents the in the queue. The capability to dynamically make priority changes
actual useful work that has been performed. System throughputt is helps ensure that resources are fairly distributed among users.
defined as the number of jobs completed per unit of time. Mean Most job schedulers have several parameters that can be
response timee is an important performance metric for users, who adjusted to control job queues and scheduling algorithms, thus
expect minimal response time. In contrast, system administrators providing different response times and utilization percentages. Usu-
are concerned with overall resource utilization because they want ally, high system utilization also means high average response time
to maximize system throughput and return on investment (ROI), for jobs—and as system utilization climbs, the average response
especially in high-throughput computing clusters. time tends to increase sharply beyond a certain threshold. This
In a typical production environment, many different jobs are threshold depends on the job-processing algorithms and job profiles.
submitted to clusters. These jobs can be characterized by factors In most cases, improving resource utilization and decreasing job
such as the number of processors requested (also known as job size, turnaround time are conflicting considerations. The challenge for IT
or job width), estimated runtime, priority level, parallel or distributed organizations is to maximize resource utilization while maintaining
execution, and specific I/O requirements. During execution, large acceptable average response times for users.
jobs can occupy significant portions of a cluster’s processing and Figure 2 summarizes the desirable features of job schedulers.
memory resources. These features can serve as guidelines for system administrators as
System administrators can create several types of queues, each they select job schedulers.
with a different priority level and quality of service (QoS). To make
intelligent schedules, however, schedulers need information regarding Using job scheduling algorithms
job size, priority, expected execution time (indicated by the user), The parallel and distributed computing community has put substan-
resource access permission (established by the administrator), and tial research effort into developing and understanding job scheduling
resource availability (automatically obtained by the scheduler). algorithms. Today, several of these algorithms have been implemented
In high-performance computing clusters, the scheduling of par- in both commercial and open source job schedulers. Scheduling algo-
allel jobs requires special attention because parallel jobs comprise rithms can be broadly divided into two classes: time-sharing and
several subtasks. Each subtask is assigned to a unique compute space-sharing. Time-sharingg algorithms divide time on a processor
node during execution and nodes constantly communicate among into several discrete intervals, or slots. These slots are then assigned
to unique jobs. Hence, several jobs at any given time can share the
User 1 Information is sent Compute node 1 same compute resource. Conversely, space-sharingg algorithms give
to the resource manager
Compute node 2 the requested resources to a single job until the job completes execu-
User 2 Resource manager
Compute node 3 tion. Most cluster schedulers operate in space-sharing mode.
User 3 Internal job
scheduler Compute node 4 Common, simple space-sharing algorithms are first come, first
Job queue
served (FCFS); first in, first out (FIFO); round robin (RR); shortest job
Users Jobs are
submit jobs assigned to first (SJF); and longest job first (LJF). As the names suggest, FCFS
User n External job compute nodes Compute node n
scheduler and FIFO execute jobs in the order in which they enter the queue.
This is a very simple strategy to implement, and works acceptably
Figure 1. Typical resource management system well with a low job load. RR assigns jobs to nodes as they arrive in

134
3 POWER SOLUTIONS Reprinted from Dell Power Solutions, February 2005. Copyright © 2005 Dell Inc. All rights reserved. February 2005
HIGH-PERFORMANCE COMPUTING

a. Jobs waiting
Feature Comments in the queue

Broad scope The nature of jobs submitted to a cluster can vary, so the
scheduler must support batch, parallel, sequential, distributed,
interactive, and noninteractive jobs with similar efficiency.
y

Support for algorithms The scheduler should support numerous job-processing b. The queue after
algorithms—including FCFS, FIFO, SJF,F LJF, advance each priority group
reservation, and backfill. In addition, the scheduler should be is sorted according
able to switch between algorithms and apply different to execution time
algorithms at different times—or apply different algorithms
to different queues, or both. Sorted high-priority jobs Sorted low-priority jobs
Capability to integrate The scheduler should be able to interface with the resource
with standard resource manager in use, including common resource managers such c. Longest job first
managers as Platform LSF,F Sun Grid Engine, and OpenPBS (the original, schedule
open source version of Portable Batch System). Compute
nodes
Sensitivity to compute The scheduler should match the appropriate compute node
node and interconnect architecture to the job profile—for example, by using compute
architecture nodes that have more than one processor to provide optimal Time
performance for applications that can use the second
processor effectively.
y d. Longest job first
and backfill
Scalability The scheduler should be capable of scaling to thousands of schedule Compute
nodes and processing thousands of jobs simultaneously.
y nodes

Fair-share capability The scheduler should distribute resources fairly under heavy
conditions and at different times. Time

Efficiency The overhead associated with scheduling should be minimal e. Shortest job first
and within acceptable limits. Advanced scheduling algorithms schedule
can take time to run. To be efficient, the scheduling algorithm Compute
itself must spend less time running than the expected saving nodes
in application execution time from improved scheduling.

Dynamic capability The scheduler should be able to add or remove compute Time
resources to a job on the fly—assuming that the job can
adjust and utilize the extra compute capacity.
y
f. Shortest job first
Support for preemption Preemption can occur at various levels; for example, jobs may and backfill
schedule Compute
be suspended while running. Checkpointing—that is, the nodes
capability to stop a running job, save the intermediate results,
and restart the job later—can help ensure that results are not
lost for very long jobs. Time

Figure 2. Features of job schedulers Figure 3. Job scheduling algorithms

the queue in a cyclical, round-robin manner. SJF periodically sorts Figure 3 illustrates the use of the basic algorithms and the
the incoming jobs and executes the shortest job first, allowing short enhancements discussed in this article. Figure 3a shows a queue
jobs to get a good turnaround time. However, this strategy may cause with 11 jobs waiting; the queue has both high-priority and low-
delays for the execution of long (large) jobs. In contrast, LJF commits priority jobs. Figure 3b shows these jobs sorted according to their
resources to longest jobs first. The LJF approach tends to maximize estimated execution time.
system utilization at the cost of turnaround time. The example in Figure 3 assumes an eight-processor cluster and
Basic scheduling algorithms such as these can be enhanced considers only two parameters: the number of processors and the
by combining them with the use of advance reservation and estimated execution time. This figure shows the effects of generating
backfill techniques. Advance reservation uses execution time schedules using the LJF and SJF algorithms with and without backfill
predictions provided by the users to reserve resources (such as techniques. Sections c through f of Figure 3 indicate that backfill can
CPUs and memory) and to generate a schedule. The backfill tech- improve schedules generated by LJF and SJF, either by increasing uti-
nique improves space-sharing scheduling. Given a schedule with lization, decreasing response time, or both. To generate the schedules
advance-reserved, high-priority jobs and a list of low-priority jobs, shown, the low- and high-priority jobs are sorted separately.
a backfill algorithm tries to fit the small jobs into scheduling
gaps. This allocation does not alter the sequence of jobs previ- Examining a commercial resource manager
ously scheduled, but improves system utilization by running low- and an external job scheduler
priority jobs in between high-priority jobs. To use backfill, the This section introduces scheduling features of a commercial resource
scheduler requires a runtime estimate of the small jobs, which is manager, Load Sharing Facility (LSF) from Platform Computing, and
supplied by the user when jobs are submitted. an open source job scheduler, Maui.

www.dell.com/powersolutions Reprinted from Dell Power Solutions, February 2005. Copyright © 2005 Dell Inc. All rights reserved. POWER SOLUTIONS 135
HIGH-PERFORMANCE COMPUTING

Platform Load Sharing Facility resource manager with several common resource managers including Platform LSF and
Platform LSF is a popular resource manager for clusters. Its focus Platform LSF HPC, and potentially improve scheduling performance
is to maximize resource utilization within the constraints of local compared to built-in schedulers.
administration policies. Platform Computing offers two products: Maui has a two-phase scheduling algorithm. During the first
Platform LSF and Platform LSF HPC. LSF is designed to handle a phase, the high-priority jobs are scheduled using advance reserva-
broad range of job types such as batch, parallel, distributed, and tion. In the second phase, a backfill algorithm is used to schedule
interactive. LSF HPC is optimized for HPC parallel applications by low-priority jobs between previously scheduled jobs. Maui uses the
providing additional facilities for intelligent scheduling, which enables fair-share technique when making scheduling decisions based on
different QoS in different queues. LSF also implements a hierarchical job history. Note: Maui’s internal behavior is based on a single, uni-
fair-share scheduling algorithm to balance resources among users fied queue. This maximizes the opportunity to utilize resources.
under all load conditions. Typically, users are guaranteed certain QoS, but Maui gives
Platform LSF has built-in schedulers that implement advanced a significant amount of control to administrators—allowing local
scheduling algorithms to provide easy configurability and high reliabil- policies to control access to resources, especially for scheduling.
ity for users. In addition to basic scheduling algorithms, Platform LSF For example, administrators can enable different QoS and access
uses advanced techniques like advance reservation and backfill. levels to users and jobs, which can be preemptively identified. Maui
Platform LSF and Platform LSF HPC both have a dynamic uses a tool called QBank for allocation management. QBank allows
scheduling decision mechanism. The scheduling decisions under multisite control over the use of resources. Another Maui feature
this mechanism are based on processing load. Based on these deci- allows charge rates (the amount users pay for compute resources)
sions, jobs can be migrated among compute nodes or rescheduled. to be based on QoS, resources, and time of day. Maui is scalable
Loads can also be balanced among compute nodes in heterogeneous to thousands of jobs, despite its nondistributed scheduler daemon,
environments. These features make Platform LSF suitable for a which is centralized and runs on a single node.
broad range of HPC applications. In addition, Platform LSF can Maui supports job preemption, which can occur under several
dynamically migrate jobs among compute nodes. Platform LSF can conditions. High-priority jobs can preempt lower-priority or backfill
also have multiple scheduling algorithms applied to different queues jobs if resources to run the high-priority jobs are not available. In
simultaneously. Platform LSF HPC can make intelligent scheduling some cases, resources reserved for high-priority jobs can be used to
decisions based on the features of advanced interconnect networks, run low-priority jobs when no high-priority jobs are in the queue.
thus enhancing process mapping for parallel applications. However, when high-priority jobs are submitted, these low-priority
The term resourcee has a broad definition in Platform LSF and jobs can be preempted to reclaim resources for high-priority jobs.
Platform LSF HPC. Resources can be CPUs, memory, storage space, Maui has a simulation mode that can be used to evaluate the
or software licenses. (In some sectors, software licenses are expen- effect of queuing parameters on the scheduler performance. Because
sive and are considered a valuable resource.) each HPC environment has a unique job profile, the parameters of
Platform LSF and Platform LSF HPC each have an extensive the queues and scheduler can be tuned based on historical logs to
advance reservation system that can reserve different kinds of maximize scheduler performance.
resources. In some distributed applications, many instances of
the same application are required to perform parametric studies. Satisfying ever-increasing computing demands
Platform LSF and Platform LSF HPC allow users to submit a job As cluster sizes scale to satisfy growing computing needs in various
group that can contain a large number of jobs, making parametric industries as well as in academia, advanced schedulers can help maxi-
studies much easier to manage. mize resource utilization and QoS. The profile of jobs, the nature of
Platform LSF and Platform LSF HPC can interface with external computation performed by the jobs, and the number of jobs submitted
schedulers such as Maui. External schedulers can complement fea- can help determine the benefits of using advanced schedulers.
tures of the resource manager and enable sophisticated scheduling.
For example, using Platform LSF HPC, the hierarchical fair-share Saeed Iqbal, Ph.D., is a systems engineer and advisor in the Scalable Systems Group at
algorithm can dynamically adjust priorities and feed these priorities Dell. He has a Ph.D. in Computer Engineering from The University of Texas at Austin.
to Maui for use in scheduling decisions.
Rinku Gupta is a systems engineer and advisor in the Scalable Systems Group at Dell. She
Maui job scheduler has a B.E. in Computer Engineering from Mumbai University in India and an M.S. in Computer
Maui is an advanced open source job scheduler that is specifically Information Science from The Ohio State University.
designed to optimize system utilization in policy-driven, heterogeneous
HPC environments. Its focus is on fast turnaround of large parallel jobs, Yung-Chin Fang is a senior consultant in the Scalable Systems Group at Dell. He specializes
making the Maui scheduler highly suitable for HPC. Maui can work in cyberinfrastructure management and high-performance computing.

136
36 POWER SOLUTIONS Reprinted from Dell Power Solutions, February 2005. Copyright © 2005 Dell Inc. All rights reserved. February 2005

You might also like