0% found this document useful (0 votes)
4 views

Machine Learning Feature Based Job Scheduling for Distributed Machine Learning Clusters - 2023

The document presents a Machine Learning Feature based job Scheduling system (MLFS) designed for distributed ML clusters, which effectively manages both data and model parallelism to enhance job completion time (JCT) and accuracy. MLFS utilizes a heuristic scheduling method and a deep reinforcement learning model to prioritize tasks based on their spatial and temporal features, significantly reducing JCT by up to 53% and improving accuracy by up to 64% compared to existing schedulers. The system also includes load control methods and an optimal iteration stopping technique to ensure efficient resource utilization and timely job completion.

Uploaded by

xieyiping999
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Machine Learning Feature Based Job Scheduling for Distributed Machine Learning Clusters - 2023

The document presents a Machine Learning Feature based job Scheduling system (MLFS) designed for distributed ML clusters, which effectively manages both data and model parallelism to enhance job completion time (JCT) and accuracy. MLFS utilizes a heuristic scheduling method and a deep reinforcement learning model to prioritize tasks based on their spatial and temporal features, significantly reducing JCT by up to 53% and improving accuracy by up to 64% compared to existing schedulers. The system also includes load control methods and an optimal iteration stopping technique to ensure efficient resource utilization and timely job completion.

Uploaded by

xieyiping999
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

58 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 31, NO.

1, FEBRUARY 2023

Machine Learning Feature Based Job Scheduling


for Distributed Machine Learning Clusters
Haoyu Wang , Student Member, IEEE, Zetian Liu, and Haiying Shen , Senior Member, IEEE, ACM

Abstract— With the rapid proliferation of Machine Learn-


ing (ML) and Deep learning (DL) applications running on
modern platforms, it is crucial to satisfy application performance
requirements such as meeting deadline and ensuring accuracy.
To this end, researchers have proposed several job schedulers for
ML clusters. However, none of the previously proposed schedulers
consider ML model parallelism, though it has been proposed as
an approach to increase the efficiency of running large-scale ML Fig. 1. ML job cluster.
and DL jobs. Thus, in this paper, we propose an ML job Feature
based job Scheduling system (MLFS) for ML clusters running
both data parallelism and model parallelism ML jobs. MLFS more efficient transportation, effective healthcare, and other
first uses a heuristic scheduling method that considers an ML
job’s spatial and temporal features to determine task priority activities. For example, scientists use ML techniques to predict
for job queue ordering in order to improve job completion the spread of a virus (e.g., coronavirus in Wuhan China in
time (JCT) and accuracy performance. It uses the data from the 2020) among the population, and the path of a hurricane
heuristic scheduling method for training a deep reinforcement (e.g., Hurricane Dorian in 2019) for in-time precaution in
learning (RL) model. After the RL model is well trained, it then affected areas. Many companies (e.g., Amazon and Google)
switches to the RL method to automatically make decisions on job
and academic organizations now provide platforms for users
scheduling. In addition, MLFS has a system load control method
that selects tasks from overloaded servers to move to underloaded to train their ML models. As shown in Figure 1, ML jobs
servers based on task priority, and also intelligently removes are submitted to an ML cluster and put into a job queue
the tasks that generate little or no improvement on the desired where they wait to be assigned to servers. Jobs always have
accuracy performance when the system is overloaded to improve performance requirements (e.g., deadline and accuracy). For
JCT and accuracy by job deadline. Furthermore, we propose example, an ML job for predicting a hurricane path must be
Optimal ML iteration stopping method that determines the
proper time to stop training ML model when this model reaches completed by a certain time before the hurricane landfall with
the minimum loss value. Our real experiments and large-scale a high prediction accuracy to maximally avoid damage and
simulation based on real trace show that MLFS reduces JCT by casualties. Therefore, it is crucial to ensure that all running and
up to 53% and makespan by up to 52%, and improves accuracy waiting ML jobs in an ML cluster complete in time without
by up to 64% when compared with existing ML job schedulers. degrading their accuracy performance.
We also open sourced our code.
In recent years, the ML and DL models have been growing
Index Terms— Machine learning, resource management, job deeper and larger, which dramatically increases the job com-
scheduling. pletion time (JCT). To meet the high-scalability requirement
I. I NTRODUCTION of increasingly large-scale ML or DL applications, model
parallelism has been proposed [1]–[4], in which an ML model
T HE rapid proliferation and development of Machine
Learning (ML) and Deep Learning (DL) techniques have
paved the way for a future where intelligent applications can
is partitioned, and multiple partitions run in parallel. As the
scale of ML and DL models rapidly grows, a job scheduler for
model parallelism is increasingly necessary to help reduce the
assist people with personal tasks such as living, working,
JCTs of jobs while guaranteeing a high degree of accuracy.
and learning, as well as broader community goals such as
However, to the best of our knowledge, none of the previ-
Manuscript received 20 February 2022; accepted 10 June 2022; approved ously proposed schedulers can handle both model and data
by IEEE/ACM T RANSACTIONS ON N ETWORKING Editor G. Joshi. Date of parallelism in spite of the previous research on job scheduling
publication 27 July 2022; date of current version 16 February 2023. This work
was supported in part by U.S. NSF under Grant NSF-1827674 and Grant CCF- in ML clusters [5]–[17].
1822965, in part by the Microsoft Research Faculty Fellowship under Grant Therefore, in this paper, we propose a Machine Learning
8300751, and in part by Amazon Web Services (AWS) Machine Learning job Feature based job Scheduling system (MLFS) for both data
Research Awards. An early version of this work was presented in the Proceed-
ings of the 16th International Conference on emerging Networking EXperi-
parallelism and model parallelism ML jobs in an ML cluster.
ments and Technologies (CoNEXT 2020) [DOI: 10.1145/3386367.3432588]. In addition, MLFS is also applicable for data parallelism only
(Corresponding author: Haiying Shen.) ML jobs and model parallelism only ML jobs. MLFS should
The authors are with the Department of Computer Science, University meet the following requirements. First, it should reduce the
of Virginia, Charlottesville, VA 22903 USA (e-mail: [email protected];
[email protected]; [email protected]). bandwidth cost, which can be the bottleneck of distributed
Digital Object Identifier 10.1109/TNET.2022.3190797 ML jobs [18], [19] when enhancing JCT, where frequent
1558-2566 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Tsinghua University. Downloaded on June 14,2023 at 01:26:44 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: MACHINE LEARNING FEATURE BASED JOB SCHEDULING FOR DISTRIBUTED MACHINE LEARNING CLUSTERS 59

data transmission between tasks is needed. For example, the and reschedules these tasks along with the waiting tasks in
communication overhead between GPUs is 970MB-3168MB the queue. MLF-H further tries to fully utilize resources and
per mini-batch for data parallelism jobs and 784MB-1037MB reduce bandwidth cost when selecting migration tasks from an
per mini-batch for model parallelism jobs [20]. Second, overloaded server and when selecting a host server to allocate
it should concurrently improve JCT and accuracy since both a task.
are important for ML jobs. For example, hurricane path (2) ML feature based RL task scheduling (MLF-RL).
prediction jobs are time critical and also require high accuracy MLFS initially runs MLF-H for a certain time period and
to avoid damage and casualties. Third, it should prioritize jobs uses the data to train a deep RL model, and it then switches
by considering their urgency and accuracy requirements. For to MLF-RL when the model is well trained. Given both
example, hurricane path prediction is time-critical and requires running and waiting tasks as well as nodes, based on their
high accuracy, but annual people movement prediction is time status, MLF-RL selects tasks to move out of overloaded nodes
tolerant and does not require very high accuracy. Fourth, it can and determines the destination node or the queue for the
still provide job deadline guarantee or low JCT when the selected tasks and the waiting tasks in the queue to achieve
system is overloaded. the aforementioned goal.
In summary, the goal of MLFS is to minimize the average (3) ML feature based system load control (MLF-C).
JCT, maximize the average job accuracy, maximize the number When the system is overloaded, jobs will experience high JCT
of jobs whose deadline and accuracy requirements are satis- and low accuracy by the job deadline. MLF-C can stop running
fied, and minimize the bandwidth cost. MLFS is novel in that it or generating tasks once the desired accuracy is reached (based
intelligently takes advantage of the spatial/temporal features of on users’ choices) to relieve system workload in order to
ML jobs. First, different tasks (for different model partitions) improve JCT and accuracy by the job deadline. MLF-C also
run separately, and the tasks have different impacts on the final adopts an optimal ML iteration stopping method that finds
job latency and accuracy (i.e., spatial features). For example, the iteration to stop training in order to achieve the maximum
in the ML model partition graph (or task dependency graph), accuracy while minimizing the number of iterations.
if a task has more dependent tasks, or its dependent tasks are (4) Optimal ML iteration stopping (OptS). OptS can find
in layers closer to the task, it should run earlier to improve the proper time to stop training an ML model near and after
job latency and accuracy by the job deadline. In addition, the the minimum loss epoch so that the ML job does not waste
model partition size (measured by the number of ML model computation time and resource on further training.
parameters) influences the final accuracy result – a larger size We conducted real experiments using Pytorch [21] on
generates a higher impact and vice versa. Second, an ML job AWS [22] and large-scale simulation based on real workload
usually runs many iterations, and earlier iterations have higher trace [23]. Extensive experimental results show the superior
impact on the accuracy than later iterations [6] (i.e., temporal performance of MLFS compared to state-of-the-art methods
features). Therefore, the tasks in earlier iterations should have in [5]–[8], [24] and the effectiveness of each of its components.
a higher priority to run and vice versa. Accordingly, MLFS We open-sourced our code in Github [25].
first uses a heuristic scheduling method that considers these The rest of this paper is organized as follows. Section II
features to determine task priority for job queue ordering in presents related work. Section III presents our goal and
order to improve the JCT and accuracy performance. To relieve the details of MLFS. Section IV presents the enhance-
extra load from an overloaded server, it also considers the task ment method. Section VI presents performance evaluation.
priority in selecting tasks to move out from the server and puts Section VII presents the summary of the experiment results
them in the queue to be rescheduled. It uses the data from the and the limitations of our methods. Section VIII presents our
heuristic scheduling method for training a deep reinforcement conclusions and our future work.
learning (RL) model. After the RL model is well trained,
it then switches to the RL method to automatically make II. R ELATED W ORK
decisions on job scheduling. Furthermore, when the system A significant amount of research effort has been
is overloaded, MLFS has a system load control method that devoted to job scheduling in clusters and cloud datacenters
selects tasks from overloaded servers to move to underloaded (e.g., [26]–[33]). In this paper, we focus on job scheduling for
servers based on task priority and also intelligently terminates ML clusters [5]–[8], [24], [34], [35]. Li et al. [17] proposed
the tasks that generate little on the required accuracy perfor- parameter server framework for ML data parallelism using
mance, which helps improve both JCT and accuracy by the first-in-first-out (FIFO) scheduling. TensorFlow [36] uses the
job deadline. MLFS consists of the following components: Borg resource manager [24] that aims to achieve fairness of
(1) ML feature based heuristic task scheduling (MLF-H). resource allocation among different jobs. SLAQ [6] aims to
MLF-H determines the task priority based on the spatial/ maximize the overall job accuracy. For the available CPU
temporal ML job features and computation features (e.g., resource, SLAQ predicts the loss reduction and runtime if
deadline, waiting time, remaining job time) to improve both different numbers of CPU cores are assigned to each of the
JCT and accuracy. Tasks that contribute more to the accuracy running jobs, and then chooses the job with the maximum loss
and JCT are given higher priorities. The priority of a task reduction per unit runtime to adjust the number of CPU cores.
is used for queuing tasks and allocating tasks when there Tiresias [8] is proposed to schedule DL jobs in a GPU cluster
are servers with available resources. MLF-H also handles to reduce JCT. It determines job priority (for queuing and
overloaded servers by considering task priority and other preemption when there are no available GPUs) using two prin-
factors in selecting tasks to migrate out of overloaded servers, ciples: 1) for jobs without prior knowledge of its task running

Authorized licensed use limited to: Tsinghua University. Downloaded on June 14,2023 at 01:26:44 UTC from IEEE Xplore. Restrictions apply.
60 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 31, NO. 1, FEBRUARY 2023

time, the least-attended service principle gives higher priorities


to the jobs that received less service time; and 2) for jobs with
known task running time distribution on GPUs, the priority is
determined by how likely the job can complete within the next
service epoch if it receives the available GPUs. Gandiva [7]
uses first-in-first-out (FIFO) queuing. Also, it defines the jobs
with the same number of GPU requirements as affinity jobs
and tries to put the affinity jobs to the same machine to more
fully utilize resources. Meanwhile, to relieve the extra load
of an overloaded GPU (i.e., GPU utilization is higher than
a threshold), Gandiva moves the job with the lowest GPU
utilization to the GPU with the lowest utilization to avoid
Fig. 2. ML platform with data parallelism and model parallelism.
resource fragmentation and achieve high resource efficiency.
Based on Gandiva, Gandiva-fair [14] was proposed, which
is a distributed fair share scheduler that balances conflicting set of jobs, Graphene determines the order of multiple jobs
goals between efficiency and fairness of resource allocation based on a weighted scores calculated based on multiple job
among users. Optimus [9] uses online fitting to predict an ML scheduling objectives including average job completion time,
job’s convergence performance during training and estimates cluster throughput and fairness. Mirhoseini et al. [5] applied
the training speed as a function of assigned resources on RL in job scheduling in a GPU cluster to minimize the average
each job. Then, it tends to assign more resources to the job JCT. The scheduler scans all tasks and then maps the tasks to
which will be finished sooner in order to reduce the average the appropriate GPUs. In the past few years, RL has been
JCT while providing accuracy guarantee. HyperDrive [10] used for job scheduling for general jobs in clusters to improve
tends to give high priorities for resource allocation to those the average JCT [26], [39].
jobs that have higher possibilities to achieve better accuracy Compared with the above job schedulers, MLFS is novel in
performance (based on the ML job model accuracy prediction) that it leverages the ML job spatical/temporal features in job
while reducing JCT. HyperSched [11] aims to produce a scheduling to improve their JCT and accuracy.
trained model with higher accuracy before the pre-set deadline
under a certain resource constraint. This method pauses jobs III. S YSTEM D ESIGN
that do not increase accuracy significantly and tends to assign A. Assumptions
more resources to the job with more accuracy improvement MLFS can be used for all types of ML workloads, and
before its deadline. AlloX [12] transforms the job scheduling it significantly benefits ML jobs running in the model paral-
problem into a min-cost bipartite matching problem in order lelism manner. Based on the prior research, we make several
to reduce the average JCT. TetriSched [13] aims to improve assumptions. First, we assume that the accuracy of a job can
the cluster utilization and also the number of jobs that are be predicted. We adopt the previously proposed method in [40]
completed before their pre-set deadlines. It formulates the for the accuracy estimation that achieves around 90% accuracy.
job scheduling problem into an optimization problem and We will explain this method in Section III-E. The accuracy at a
uses integer linear programming to solve it. Yabu et al. [15] certain iteration is predicted based on the number of iterations
proposed an algorithm that first identifies whether a job is executed and the accuracy change for each executed epoch.
trial-and-error or best-effort and then tends to give more Therefore, a user does not need to enter the details of his/her
resources to best-effort jobs in order to reduce the JCT. job for the accuracy prediction in our system. We also assume
The trial-and-error jobs aim to find the best hyperparameter that the total job running time can be predicted accurately
configurations, and the best-effort jobs try to achieve the best using the approach in [9].
accuracy improvement with fixed hyperparameter configura-
tion. DL2 [16] is an RL-based job scheduling algorithm which
uses job completion time as a reward directly. Dorm [34] B. Problem Formulation
is a cluster management system that uses a container-based Currently, TensorFlow uses data parallelism, which divides
virtualization technique to partition a cluster and then runs a training dataset to multiple mini-batches, and each mini-
one application per partition to increase resource utiliza- batch (task) runs in a worker (i.e., a computing slot). We use
tion and speed up ML jobs. Narayanan et al. [37] proposed the complex scenario with both data parallelism and model
PipeDream, a method that uses pipeline parallelism to enable parallelism to present our methods. Our MLFS also can be
faster DNN training by combining intra-batch parallelism with applied to the data parallelism only, model parallelism only
inter-batch parallelization within one job. Grandl et al. [38] and none parallelism scenarios. In the scenario of both data
proposed Graphene, a job scheduler for general jobs with a parallelism and model parallelism (as shown in Figure 2),
task dependency DAG (directed acyclic graph). Within one a task (running in a worker) computes one model partition
job, Graphene tends to first assign the available resources for one mini-batch. The tasks form a task dependency graph
to the “troublesome” tasks (the tasks have more dependent based on the data flow between the tasks. The parameter server
tasks and tough-to-pack resource demands (e.g., high resource communication structure and the all-reduce communication
demand of one job) and then assign the remaining resources structure are used to accumulate the learned parameters in
to the remaining tasks without dependency violation. For a training. In the parameter server communication structure,

Authorized licensed use limited to: Tsinghua University. Downloaded on June 14,2023 at 01:26:44 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: MACHINE LEARNING FEATURE BASED JOB SCHEDULING FOR DISTRIBUTED MACHINE LEARNING CLUSTERS 61

each task sends its training results to the parameter server, C. ML Feature Based Heuristic Task Scheduling (MLF-H)
which accumulates the results and updates the ML model. 1) Priority Determination: We use Pk,J to denote the
To apply the data and model parallelism to the parameter priority of task k of job J which is in its I th iteration.
server communication structure, an ML network is divided We use Imax to denote the specified maximum number of
into different partitions and each partition runs in a worker iterations of job J. We first introduce how we consider the
and handles one mini-batch. The workers transmit data based ML spatial and temporal features to determine the priority
on the ML partition dependency, and the final workers in ML
(denoted by Pk,J ), and then introduce how we consider
the dependency graph (i.e., model partition graph) send the the traditional computation features to determine the priority
training results to the parameter server. In the all-reduce (denoted by Pk,J C
). Finally, we combine the two priority values
communication structure [41], one parameter server containing to calculate Pk,J .
the whole model and one worker are combined together First, the jobs running in a cluster have different urgency
as one reducer, which is the unit execution processor. For levels, as explained in Section I. The system sets m levels of
the model parameter update in each iteration, each reducer job urgent coefficients LJ ∈ [0, m]; a higher urgent coefficient
communicates to other reducers to update model parameters means that the job is more urgent and vice versa. The urgent
using a communication topology (e.g., ring all-reduce [42] coefficients of jobs in a cluster can be pre-determined by the
and 2D-Torus [43]). To apply data and model parallelism to
system administrator based on the urgency of the applications
the all-reduce communication structure, we can directly use or specified by the users. A higher urgent coefficient specified
the existing communication topology for the parameter trans- by a user leads to higher monetary cost to the user. Giving
mission between workers to update their models. Therefore, higher priorities to jobs with higher urgency levels can help
the task dependency graph in data and model parallelism can the jobs to meet their deadline requirement.
be for both parameter server and all-reduce communication Second, for an ML job, based on the ML temporal features,
structure. the earlier iterations are more important than later iterations on
Assume that J represents the job set, T represents the task improving accuracy because of the diminishing loss reduction
set, and N represents the node set and the queue. We use (k, n)
returns [6]. Therefore, tasks that will gain higher accuracy
to represent that task k is allocated to node n or the queue. improvement in the next iteration should have a higher priority
Our goal is to find a task allocation plan A = {(k, n)|k ∈ to be scheduled in order to improve the average accuracy of
T , n ∈ N } that minimizes the average JCT, maximizes the jobs in the system. We use the inverse proportional function
average ML model accuracy, maximizes the number of jobs 1
whose deadline and accuracy requirements are satisfied, and I (I ≥ 1) to represent the importance of the current iteration
of the job. A larger ratio I1 means an earlier stage of the
minimizes the bandwidth cost between nodes. For job J,
job running, so the I th iteration contributes more on accuracy
we use dJ to denote its JCT, drJ its deadline, aJ its final
improvement of the ML job. We use δlI−1 to denote the loss
accuracy, arJ its accuracy requirement. Then, the goal can be
reduction
I−1 of the most recent finished iteration I − 1. Then,
represented by the formula below:
 δl j denotes the overall loss reduction achieved by all the


j=1

⎪ J∈J dJ completed iterations, and δlI−1 I−1
represents the normalized
⎪ 1
⎪ g (A) = 1/ j=1 δlj

⎪ |J | loss reduction of the most recent completed iteration. A higher
⎪ 



⎪ g2 (A) =

1(drJ ≥ dJ ) 
value of δlI−1 I−1

j=1 δlj
means that the I th iteration contributes more




J∈J

on the loss reduction. Note that these two formulas represent
⎨ the trends of general ML jobs and can be replaced by other
g3 (A) = 1/ Bni ,nj

⎪ ni ,nj ∈N appropriate formulas specific to certain ML jobs.

⎪  Third, considering the ML spatial features in model paral-



⎪ g 4 (A) = 1(aJ ≥ arJ ) lelism, a larger model partition usually plays a more important



⎪ J∈J
 role in calculating the model parameters. We measure the



⎪ g (A) = J∈J aJ model partition size by the number of ML model parameters
⎩ 5
|J| in this model partition. We use SkJ = Sk /S J to denote the
max (g1 (A), g2 (A), g3 (A), g4 (A), g5 (A)), (1) normalized size of the model partition of task k, where Sk
is the size of the model partition and S J is the entire ML
in which 1 is a binary indicator function, and Bni ,nj is overall model. Giving a higher priority to a larger model partition
size of the data transmitted between ni and nj . The goal is can contribute more in increasing the model accuracy.
to maximize the value of each objective function gi (A) (i = Combining all of these ML features, for task k that has no
1, 2, . . . , 5) simultaneously. Note that this problem formulation dependent tasks, its priority is calculated by:
covers both all-reduce communication structure and parameter
server communication structure. This is a multi-objective opti- ML 1 δlI−1
Pk,J = LJ · · I−1 · SkJ (2)
mization problem. We could use the adaptive epsilon constraint I δl j
j=1
algorithm [44] to solve the above multi-objective optimization
problem. However, due to its high computation overhead, Next, we consider the dependency of the tasks (or workers)
we instead propose a heuristic task scheduling method to in the dependency graph as shown in Figure 2. The more
obtain the solution of this optimal problem and an RL-based tasks that depend on task k, the higher priority that task
task scheduling method in the following sections. k should have because its completion enables more other

Authorized licensed use limited to: Tsinghua University. Downloaded on June 14,2023 at 01:26:44 UTC from IEEE Xplore. Restrictions apply.
62 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 31, NO. 1, FEBRUARY 2023

tasks to start running, which helps reduce JCT. In addition, where α is a weight factor and a larger α means that the ML
if more dependent tasks are in layers closer to task k in job features have higher weights than the computation features
the dependency graph, then it should have a higher priority in determining a job’s priority. MLF-H uses the task priority
because its completion enables more tasks to start earlier. to order the tasks in the queue. Because tasks with higher
Based on the rationale, we calculate the priority of task k priority are allocated to servers and executed earlier, it helps
ML
considering ML features, Pk,J , as follows: achieve our goal in Formula (1). Compared to previous job
 schedulers (e.g., Graphene [38]) that only consider traditional
ML
Pk,J , no dependent tasks
ML
Pk,J = ML
 computation features and task dependency, MLFS is novel in
Pk,J + γ i∈child(k) Pi,J
ML
, otherwise that it additionally considers ML job spatial/temporal features
(3) in the priority determination. Also, it additionally considers
the goal of increasing ML accuracy, which is not considered
where γ ∈ (0, 1) is a discounting factor to count the closeness in previous schedulers for general jobs.
between a dependent task and task k, child(k) is the set of
direct children of task k in the dependency graph. A larger 2) Basic Job Scheduling: MLF-H inserts newly submitted
γ means a higher weight is given to the priorities of a task’s tasks and preempted tasks to the queue based on the priority.
children when determining the task’s priority. This priority When there are underloaded servers and waiting tasks in the
determination method can be directly applied to the all-reduce queue, it picks tasks one by one from the queue and assigns it
communication structure to determine the priority of each task to an underloaded node until there no more underloaded nodes
running in a worker. For the parameter server communication or the queue is empty. Here, one problem is which node among
structure which has a separate centralized parameter server, the underloaded nodes should we choose to allocate each task
its computation is also considered as a task and assigned with from the queue? To solve this problem, we can rely on the
the highest priority since only after the parameter server is method in [45] as described below.
determined, the tasks in the workers know where to send their We consider M types of resources (e.g., GPU, CPU, mem-
results. ory, and bandwidth), and more types of resources can easily be
For the computation features of task k of job J, we consider added. The utilization of type-m resource of server s at time
its task deadline (dk,J ), remaining running time (rk,J ) and slot t is calculated by utm = lm t
/cm , where lm t
is resource
waiting time in the queue (wk,J ). We consider two cases consumption of server s in time slot t and cm is the capacity
to estimate the remaining running time. If a job is executed of server s on type-m resource. The utilization of type-m
previously, the deadline of each of its tasks can be calculated resource of task k is calculated by utm,k = lm,k t
/cm , where
t
based on the job’s deadline, dependency graph and historical lm,k is resource consumption of task k in time slot t. The
task running time, as in [8]. We use the historical task running resource utilization of server s at time t (Ust ) is represented
time as the task’s required running time. If a job is not as a vector (ut1,s , ut2,s , . . . , utm,s , . . . , utM,s ), and the resource
executed previously [1], we use sample running to estimate the utilization of task k at time t (Ukt ) is represented as a vec-
required running time of each of the job’s tasks and then infer tor (ut1,k , ut2,k , . . . , utm,k , . . . , utM,k ). We consider that type-m
the deadline of each task. Then, a task’s remaining running resource in a server is overloaded if its utm is higher than a
time is calculated by rk,J = tk,J − pk,J , where tk,J is the pre-defined threshold hr (e.g., 90%). A larger hr helps more
estimated required running time of task k and pk,J is the fully utilize the resources but increases the probability that a
actual running time of task k. A task with a closer deadline, server becomes overloaded and hence JCT. Lower hr increase
less remaining running time, or longer waiting time should the number of unnecessary task migrations. When at least one
have a higher priority to run because it helps meet the job type of resources in a server are overloaded, we consider that
deadline and improve the JCT. Considering the computation this server is overloaded; or, it is underloaded.
features, we calculate the priority of task k that does not have When we choose a server from several underloaded servers
dependent tasks by: to allocate a task k, we hope that the chosen node is less likely
C 1 1 to be overloaded, the communication bandwidth consumption
Pk,J = γd · + γr · + γw · wk,J (4)
(dk,J − t) rk,J between this task and other tasks is maximally reduced,
where t is the current time and γd , γr and γw are the weights and the task’s performance degradation caused by the task
of the considered factors. Larger γd , γr and γw lead to more movement to the node [46] is minimized. Avoiding overloaded
weights on deadline consideration, job remaining time and job servers and fully utilizing resources can help improve JCT
waiting time in the queue, respectively. The priority of task k and accuracy of the jobs in the system. To achieve these
C
the traditional computation features, Pk,J , is calculated by: objectives, among the underloaded servers, we first determine
ideal virtual host server for the task from the queue represented
C
Pk,J , no dependent tasks by UVt = (ut1,V , ut2,V , . . . , utm,V , . . . , utM,V , uBW,V , qk,V ),
C
Pk,J = C

Pk,J + γ i∈child(k) Pi,J
C
, otherwise in which utm,V is the minimum type-m resource utilization
(5) among all of the underloaded severs, uBW,V denotes the
maximum value among the communication data sizes between
By combining the priorities based on the ML features and
the task and each of the underloaded servers (in order to
computation features (Equs. (3) and (5)), Pk,J is calculated
allocate high-volume communicating tasks to the same server),
by:
qk,V is the task performance degradation caused by the task
Pk,J = αPk,J
ML
+ (1 − α)Pk,J
C
, α ∈ [0, 1] (6) movement [46] and its ideal value is 0. The server that is

Authorized licensed use limited to: Tsinghua University. Downloaded on June 14,2023 at 01:26:44 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: MACHINE LEARNING FEATURE BASED JOB SCHEDULING FOR DISTRIBUTED MACHINE LEARNING CLUSTERS 63

most similar as the ideal virtual host server based on the priority is calculated considering the accuracy improvement so
Euclidean distance should be chosen. That is, the server with the tasks that will contribute more on improving the accuracy
min ||Ust − UVt || among the candidate servers and will not should be less likely to be chosen to migrate out. Second,
be overloaded (on each resource and its least-loaded GPU) because ML tasks run in multiple GPUs in a server, each GPU
by hosting the task is selected as the host server for the must not be overloaded. Therefore, if there exist overloaded
task. Then, we schedule the task to the least-loaded GPU in GPUs, we order the tasks in the overloaded GPUs based on
the selected server. After one task is allocated, the scheduler the ascending order of the priority, and select tasks using the
continues to pick up the next task in the queue and schedule above method only among a certain percentage (ps ) of the
it. The scheduling stops when there are no more underloaded tasks on the top until there are no overloaded GPUs. A smaller
servers or the queue is empty. Finally, based on the deter- ps means that the migration tasks have lower priorities, which
mined task allocation schedule, the tasks are moved to their constrains the performance degradation on JCT and accuracy
allocated GPUs. but may not relieve overload quickly. When there are no
3) Handling Overloaded Servers: MLF-H can be executed overloaded GPUs, we select the task from all the running tasks
periodically or when there are underloaded nodes (and waiting in the server using the above method.
tasks) or overloaded nodes. When there is an overloaded Stragglers may occur due to failing hardware, software bugs,
server, MLF-H chooses tasks in the overloaded server to be misconfiguration and so on. To handle this problem, we can
migrated out to relieve its excess load, inserts the tasks to duplicate a task and assign the replica task to another server
the waiting queue, and reschedules the tasks along with the in task scheduling. Then, we use the output of the task that
waiting tasks in the queue in the same manner. Note that these completes first and stop the other tasks. The number of replicas
chosen migration tasks are virtually instead of actually moved of a task depends on the possibility of the straggler occurrence.
to the queue in order to save the migration overhead. After More replicas can better avoid straggler occurrence but gen-
one migration task is scheduled, it is directly moved from the erate more overhead. We leave detailed study of stragglers as
overloaded server to the scheduled server. When there are no future work.
more underloaded servers but some migration tasks are still
not assigned to servers, then these migration tasks are moved D. ML Feature Based RL Task Scheduling (MLF-RL)
back to the queue. The heuristic MLF-H and other heuristic scheduling meth-
Many previous ML job schedulers [5], [6], [8], [24] only ods may not be able to fully catch ML job features or set
determine task priority for task allocation (and preemption [8]) optimal parameter values (e.g., γ, α), where obtaining the
but do not handle server overload. Gandiva [7] handles GPU optimal schedule for goal in Equation (1) may be difficult.
overload but does not consider other types of resources, though Also, the decision making in MLF-H may take a long time. To
ML jobs also use other types of resources including CPU, handle these problems, we rely on the deep RL technique. That
memory, and bandwidth. Server overload on other resources is, MLFS initially runs MLF-H for a certain time period and
still can adversely affect the ML job performance. MLF-H uses the data to train MLF-RL, and then switches to MLF-RL
overcomes these problems. Below, we present how MLF-H when it is well trained. MLF-RL is novel compared with the
selects migration tasks from an overloaded node while avoid- previous RL-based job schedulers [5], [26], [39] in that it
ing resource fragmentation to more fully utilize resources in additionally considers ML features to improve both JCT and
the system. accuracy while previous RL-based job schedulers do not aim
For an overloaded server, to move out some of its current to improve accuracy or consider ML features.
tasks to relieve its excess workload, an important question is An RL consists of an agent, state, action, and reward [47].
which tasks we should choose to move out. To fully utilize In each time step t, the agent observes the environment state
resources and relieve the load on overloaded resources in st and chooses an action at based on its optimal policy in
an overloaded server, we use the method in [45]. That is, response to the current state and receives reward rt . Recently,
we first determine ideal virtual task to move out represented the Deep Neural Network (DNN) has become a popular func-
by Uvt = (ut1,v , ut2,v , . . . , utm,v , . . . , utM,v , uBW,v ), in which tion approximator because it automatically extracts features
utm,v for each overloaded resource is the maximum value when solving large-scale RL problems [48]. Thus, we use
among the tasks in the sever and utm,v for each underloaded DNN to serve as the agent, which generates the optimal policy.
resource is the minimum value among the tasks in the sever. The output of DNN is the probability distribution of actions
uBW,v denotes the ideal communication data size between π : π(st , at )→[0, 1], where π(st , at ) denotes the probability
the migration task and existing tasks in the server and it of taking an action at at the current state st . The goal of
equals to 0, which means that the task migration will not the agent isto maximize the expected cumulative discounted

cause additional communication cost. Then, the task whose reward: E[ t=0 η t rt ], where η ∈ (0, 1] is a factor discounting
resource utilization is the closest to Uvt based on the Euclidean future rewards. A larger η enables the RL agent to consider
distance, i.e., the task with min ||Ukt −Uvt || is selected to move more weights on the future rewards when updating the model.
out. If the server is still overloaded after it migrates out this As shown in Figure 3, the state includes the information
selected task, the same process is repeated until the server of all the waiting and running tasks and nodes in the cluster
is not overloaded. Here, we further advance this method by including the information needed to derive the ML job features
considering the ML features. First, we need to make sure that and computation features used in MLF-H and additional
the tasks with high priorities will not be selected to migrate information such as the ML algorithm name and dependency
out in order to improve JCT and accuracy. Recall that the graph. More specifically, the state (or the RL input) includes:

Authorized licensed use limited to: Tsinghua University. Downloaded on June 14,2023 at 01:26:44 UTC from IEEE Xplore. Restrictions apply.
64 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 31, NO. 1, FEBRUARY 2023

completed, as in [26], [39], we wait for a time period after the


scheduling decision is made at time t0 . That is, we compute
the cumulative reward from t0  to t0 + tm as the reward of
m
scheduling decision at time t0 : i=0 η i × rti . Note that we
do not consider task migration overhead in the reward function
here and leave it as our future work.
The DNN uses all of the aforementioned information and
cumulative discounted reward to train the neural network, i.e.,
Fig. 3. Deep RL structure in MLF-RL.
to update its policy neural network parameters θ to improve
the scheduling decisions. We utilize the gradient-descent
i) the information of tasks, including whether it is queuing
to update θ. Only after the RL model is well trained (i.e.,
or running, its job, arrival time, resource demand, its mini-
converged), MLFS switches from MLF-H to MLF-RL in order
batch, waiting time and running time; ii) the information of
to output optimal scheduling decisions.
each task’s job, including its ML algorithm, urgency level,
deadline, number of maximum and finished iterations, loss E. ML Feature Based System Load Control (MLF-C)
reduction for each finished iteration, mini-batch size, training
There is a tradeoff between ML model accuracy and ML
data size, and dependency graph; and iii) the information of
job running time (the number of iterations) or the resource
servers and nodes (GPUs), including the utilization of each
consumption [52]. To handle the ML overfitting problem, cur-
resource type in a server, the utilization of each GPU, and
rently, users usually define a maximum epoch (i.e., iteration)
current running tasks.
to stop training or observe the validation loss curve to decide
The DNN observes the state from the cluster environment
at which epoch the model performs the best and stop training.
and outputs the optimal action (i.e., task allocation schedule),
However, the minimum accuracy loss may appear earlier than
represented by A = {(k, n)|k ∈ T , n ∈ N }. Recall that
the specified maximum epoch and then computation resources
N also includes the waiting queue. The action includes the
are wasted and JCT is increased by continuing training the
selection of tasks in overloaded nodes to move out and the
model. Also, the human observation method relies on intuition
assigned node (either underloaded node or queue) for each
and needs expert knowledge, which is not applicable for
task in the selected migration tasks and waiting tasks in the
general users. Therefore, the proper time to stop training an
queue.
ML model is when the maximum accuracy is obtained before
The RL model has only one objective function to optimize,
the job deadline. For this purpose, we can use an early training
and a normal way to create the objective function for multiple
stopping method in [40]. That is, when a job is running,
objectives is to create a weighted sum of the objectives
we first use a weighted probabilistic learning curve model to
according to the approach in [39]. Therefore, to achieve our
predict the job’s accuracy at the specified maximum iteration.
goal indicated in Formula (1), we define the reward function
If the predicted accuracy is less than an accuracy threshold, the
at time step t as follows:
training stops when the prediction confidence is higher than a
rt = β1 g1 (A) + β2 g2 (A) + β3 g3 (A) + β4 g4 (A) + β5 g5 (A), threshold. Otherwise, the training continues and stops when
(7) the achieved accuracy reaches the accuracy threshold. The
weighted probabilistic learning curve model is built based on
where βi is the reward weight for the ith (i = 1, 2, . . . , 5) historical data to fit one weighted learning curve for multiple
objective according to Equation (1). A larger βi (i = types of ML jobs using Bayesian optimization. The inputs
1, 2, . . . , 5) value means a higher weight on the ith objective. of the model include the number of iterations executed, the
A problem here is how to determine the weight combination accuracy change for each executed iteration, the iteration when
(β1 , β2 , β3 , β4 , β5 ) that generates better performance of RL’s accuracy needs to be predicted, and its output is the job’s
learned optimal policy in each of the objective dimensions. predicted accuracy at the indicated interation. When one ML
For this purpose, we could directly adopt the reward tuning job is running, we monitor the accuracy change in real time.
method in [49] that uses Bayesian optimization to search for In our proposed MLF-C, when users submit their ML jobs,
the weight combination. However, Bayesian optimization can they are asked to choose one of the following options: i) jobs
only give better (instead of the best) results by exploring the run for the number of iterations indicated or controlled by
whole weight search space and its time overhead is high [50]. them (i.e., the current approaches), ii) choose an ML iteration
After a limited number of rounds of Bayesian optimization, stopping algorithm (OptStop) (e.g., in [40]) that stops ML
it cannot give fine-tuned weight results [49]. Specifically, running when the achieved accuracy equals or is close to the
Bayesian optimization cannot efficiently find the best results maximum accuracy, or iii) only achieve the required accuracy.
within a smaller value range [51]. Therefore, in order to obtain The users are also asked to indicate which of their chosen
a better weight combination, we first run a limited number options can be changed by the system when the system
of rounds (e.g., 10) to obtain results, and then empirically is overloaded. For example, users choosing option i) allow
try different combinations by slightly varying each value in the system to switch their choices to option ii) or iii) and
the results. Finally, we choose the weight combination that users choosing option ii) allow the system to switch their
generates the highest rt in Equation (7). choices to option iii) in order to help further reduce the
Although the actual reward of the current scheduling system workload, which in turn improve their job JCT and
decision can be known only when the scheduled tasks are accuracy performance. Further, the users are informed that

Authorized licensed use limited to: Tsinghua University. Downloaded on June 14,2023 at 01:26:44 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: MACHINE LEARNING FEATURE BASED JOB SCHEDULING FOR DISTRIBUTED MACHINE LEARNING CLUSTERS 65

Fig. 4. Validation loss in ML running of different ML algorithms.

option i) may lead to longer JCT especially when the system not need to further generate or run tasks for the next iteration
is overloaded. Based on the users’ different choices and their for the job once the selected option’s accuracy requirement is
performance requirements on accuracy and JCT, they are met.
charged with different payment rates. In a private cluster, the
system administrator can directly make the decision about the IV. O PTIMAL ML I TERATION S TOPPING (O PT S)
above options. When there are enough resources in the cluster, Figure 4 illustrates the validation loss curve versus epochs
we can run the jobs based on user preference. When the system for different ML training algorithms including AlexNet, Resid-
is overloaded, we will make changes on the user choices if the ual Neural Network (ResNet), Multilayer Perceptron (MLP),
changes help reduce the system workload. and Long Short-term Memory (LSTM). These ML algorithms
Thus, when the system is not overloaded, MLF-C follows and training are used in [7] and we obtained them from
the user choices accordingly, and when the system is over- GitHub [53], [54]. There is a tradeoff between ML model
loaded, MLF-C changes the choices based on the users’ indi- accuracy and training time [52]. The current approaches to
cations to reduce system workload. Comparing to option i) (the handle the overfitting problem (explained in Section III-E) are
current approaches), options ii) and iii) reduce the number not efficient. First, the human observation method relies on
of iterations of the ML jobs and hence system workload. intuition and needs expert knowledge, which is not applicable
As a result, it reduces JCT since jobs do not need to run for general users. Second, the minimum validation loss may
more iterations or wait in the queue for a long time due to appear earlier than the specified maximum epoch and then
system overload while still providing near-optimal accuracy or computation resources are wasted and JCT is increased by
meeting users’ accuracy requirements. In addition, the jobs’ continuing training the model. But we also cannot stop training
accuracies by the job deadlines are improved since important at the current minimum loss (lmin ) because there might be a
iterations have a higher chance to run and will not be blocked new minimum loss later. Therefore, the proper time to stop
due to the running of unimportant iterations. As a result, training an ML model is when we can guarantee that all future
both JCT and accuracy performances are improved. Therefore, losses are larger than lmin in the validation loss curve.
when the system is not overloaded, MLF-C proactively helps To find such a stopping epoch, three methods in [52] have
avoid system overload, and when the system is overloaded, been proposed to find the epoch where the loss increases
MLF-C reactively reduces system workload to mitigate the significantly, which indicates that the model is overfitting the
overload. In addition, MLF-C also helps people who do not data. Let lva (t) and ltr (t) be the validation loss and training
have the knowledge about how many iterations are needed to loss after training t epochs. The first method is called General-
(t)
achieve the maximum or their designed accuracy. ization Loss (GL). It defines GL(t) = 100×( llva min
−1), which
Next, we explain how we judge whether the system measures the fraction of additional loss comparing the current
resources are limited or the system is overloaded. Recall that validation loss to the minimum loss. The stopping time is when
the resource utilization of server s at time t is represented as a GL(t) is larger than a threshold (hGL ). The second method
vector Ust = (ut1 , ut2 , . . . , utm , . . . , utM ). The resource utiliza-
tion of the cluster is represented as Uct = (U1t , U2t , . . . , U|N
t 
is Progress Quotient window based method (PQ). It defines
i=t−κ+1 ltr (i)
| ),
t
P Q = GL(t)
Pk (t) , where Pκ (t) = 1000 × ( κ×mint ltr (i) − 1)
i=t−κ+1
where N is the set of servers in the cluster. The overload
measures how much the average training loss was larger than
degree of server s is calculated by Ost = ||Ust ||. The overload
the minimum training loss during window κ assuming the
degree of the whole cluster is measured by the average of
 training is from epoch t − κ + 1 to epoch t. A higher GL(t)
overload degrees of all servers: Oct = |N1 | s∈N ||Ust ||. The means more overfitting, and a lower Pk (t) means more stable
system is considered to be overloaded when there are tasks in training loss. Thus, it stops training when P Q is greater than
the queue or when Oct > hs , where hs is a pre-defined thresh- a threshold (hP Q ). The third method is called UP. It stops
old. A larger hs means the more severe resource competition training in the end of the (mh )th window when GL(t) keeps
between jobs leads to higher JCT. However, a lower hs may increasing in mh successive windows. The validation loss
cause a higher overhead for system load control. In this case, curve of ML jobs usually can be divided into two phases:
as mentioned above, based on the users’ indications, MLF-C quick-drop phase and fluctuation phase [52] as shown in
makes changes on user selected options if the changes reduce Figure 4. After a quick-drop phase, the validation loss begins
the number of iterations, and stops producing or running to increase but may still further decrease. Thus, we aim to
unnecessary tasks accordingly. Consequently, the system does find the minimum loss point in the fluctuation phase. For this

Authorized licensed use limited to: Tsinghua University. Downloaded on June 14,2023 at 01:26:44 UTC from IEEE Xplore. Restrictions apply.
66 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 31, NO. 1, FEBRUARY 2023

purpose, we first find the fluctuation phase, and then find the Algorithm 1 Algorithm Pseudocode of MLF-H
loss curve area that the minimum validation loss is not likely to 1 Function MLF-H: heuristic job scheduling
appear, that is, the overall trend of losses goes up and becomes 2 Update the running tasks’ information and resource utilizations
more stable. Specifically, we keep track of the loss values. of all the nodes in the cluster;
We consider a phase as a quick-drop phase if the number of 3 for all the tasks in the waiting queue (including newly
consecutive loss decreasing epochs with loss values lower than submitted tasks and ready-to-migrate tasks) do
4 Calculate the priority Pk,J according to Equation (6);
the current minimum loss reaches a threshold md ; that is, 5 Sort the tasks in queue in descending order of priority;
Assign the queueing tasks one by one from the top until
1(lva (i) < lmin ) = 1 (i = k, k + 1, . . . , k + md − 1), (8) 6
there is no available resource or the queue is empty;
where li is the loss of the ith epoch, and k is any epoch. 7 Update the resource utilization of each node;
Then, the subsequent curve starting from the (k + md − 1)th
epoch will be the fluctuation phase. Next, we sample every
window including κ losses in the fluctuation phase. Let k  = Algorithm 2 Algorithm Pseudocode of MLF-RL
(k + md − 1), then the j th window is represented as: wj = 1 Function MLF-RL: RL based job scheduling
{lva (k  + jκ), lva (k  + 1 + jκ), . . . , lva (k  + κ − 1 + jκ)} 2 Update the running tasks’ information and resource utilizations
(j = 0, 1, . . . .). We stop training when: of all the nodes in the cluster;
3 for all the tasks in the waiting queue (including newly
mean(wj+1 ) > mean(wj ) and var(wj+1 ) < var(wj ), (9) submitted jobs and ready-to-migrate tasks) do
4 Generate the current state st ;
where mean(wj ) and var(wj ) denote the mean and variance 5 Select action at : with π(st , at )
of the loss values in window wj . The idea behind Equ. (9) is 6 Execute the action for all the queuing tasks (at ) and
that mean(wj+1 ) > mean(wj ) means that the model starts observe reward rt = R(st , rt ) according to Equation (7)
Update Aπη (st , at )
to overfit the training data [52], and var(wj+1 ) < var(wj ) 7
8 if the waiting queue is not empty and no available
means that the loss value tends to be stable and is unlikely to resources then
decrease further. The trained ML model for each echo after 9 put the tasks in the queue until the loop;
the quick-drop phase is recorded along with its loss. When the
training stops, the stored trained ML model with lmin so far
is fetched as the final trained ML model. To further enhance
the accuracy of finding the minimum loss epoch, we can stop in the code. Meanwhile, we use API torch.utils.
training when Equ. (9) occurs for successive mo windows. checkpoint(), which saves a checkpoint of one ML job
Our experiments show mo = 1 is sufficient. model when the accuracy change in each iteration. After the
After the first fluctuation phase, there may be another optimal point has been found by the method introduced in
quick-drop phase followed by another fluctuation phase and so Section IV, the corresponding trained model with satisfied
on, which means there are multiple quick-drop and fluctuation accuracy values will be picked with OptimalStopping()
phases in a curve. To handle this case, from the end of the that then makes the decision with stop() or continue().
quick-drop phase, i.e., the (k + md − 1)th epoch, we also
keep checking if there exists another quick-drop phase using B. Job Scheduling
Equ. (8) concurrently. If we find another quick-drop phase
The MLFS scheduler will be executed periodically or
before we find the optimal stopping time, then we repeat the
when there is overload nodes in the cluster. The GPU
above process; that is, we identify the subsequent fluctuation
utilization is obtained from cutorch.getMemoryUsage
phase and find the optimal stopping epoch. In Figure 4,
and other resources utilization are achieved by Linux com-
we marked the stopping epochs found by the three previous
mands. Based on our lab-made Python code, we create one
methods and OptS. OptS stops training closely after the actual
queue waitingQueue() to store all the ready-to-schedule
minimum loss epoch while other methods either stop much
jobs’ IDs. All the proposed job schedulers will generate the
earlier or later than the actual minimum epoch. For the former,
job placement plan for the jobs with the format JobID,
the method cannot find the minimum point and for the latter,
NodeID in the queue. Algorithm 1 shows the pseudocode of
it wastes computation resources and time.
MLF-H. The queueing tasks are scheduled one by one from
the queue until the queue is empty or there is no available
V. I MPLEMENTATION
computing resources. Algorithm 2 shows the pseudocode of
We built our MLFS methods based on Pytorch using roughly MLF-RL. The RL agent processes the tasks in the queue at the
5000 lines python code. same time with the trained model as introduced in Section V-D
and then generates the job placement plan. After the MLF-H
A. Tracking Accuracy and Optimal Stopping or MLF-RL makes the job placement plan, our implemented
In order to realize the job scheduling and opti- class JobAssign(JobID, NodeID) is executed for each
mal stopping functions in our system, we need to assigned jobs according to the generated job placement plan.
track the accuracy change in real time. We use the
existing accuracy API torch.utils.tensorboard to C. Model Partition
track the accuracy for each job and record the accu- There are two types of model partition: sequential model
racy using function SummaryWritter(). We create partition and layer-based model. For the sequential model
an OptimalStopping(accuracy,iterations) class partition, we modified the MLP and AlexNet code based on
Authorized licensed use limited to: Tsinghua University. Downloaded on June 14,2023 at 01:26:44 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: MACHINE LEARNING FEATURE BASED JOB SCHEDULING FOR DISTRIBUTED MACHINE LEARNING CLUSTERS 67

Algorithm 3 Algorithm Pseudocode of MLF-C E. Job Migration


1 Function MLF-C: System load control The load control in our system is mainly used to change
2 Update resource utilization of all the machines in the cluster to the job placement to avoid machine overload and further
calculate Ust , Uct and Oct ; improve the resource utilization after job scheduling. We lever-
3 while Oct >hs do
4 for one selected job on the machine with highest Ost do
age torch.utils. checkpoint() which is already
5 Option-i continue run until indicated number of in Pytorch. Once one job is selected to migrate, we use
iterations; torch.load to load the saved checkpoint and the job will
6 Option-ii OptStopping(); continue.
7 Option-iii stop until the required accuracy;
8 Update Oct and Ost ;
VI. P ERFORMANCE E VALUATION
A. Experiment Settings
Real trace: We used a publicly available trace of DNN
Gpipe [1] with the library support from tochgpipe.GPipe. training workloads from Microsoft [23] to run trace-driven
For a given sequential model parallelism job, we partition the simulation. The trace contains a representative subset of the
model into several parts evenly, where the number of parts first-party DNN training workload on Microsoft’s internal
is randomly chosen from the range [2,4,6,8]. The number of Philly clusters with 550 servers and 2474 GPUs collected from
sequential partition is generated with nn.Sequential(). Aug. 07, 2017 to Dec. 22, 2017. The trace has 117325 jobs
For the layer-based model partition, we first randomly including Convolutional Neural Networks (CNNs), LSTMs
select three layers (exclude the input and output layer) and RNNs. It contains two types of information: 1) for each
and partition each selected layer into several parts evenly, job, it contains the job arrival time, the number of GPUs
where the number of parts are randomly chosen from the requested, GPU allocation position, job completion status (the
range [2,3,4]. LSTM layer-based model partition is exe- highest accuracy value when the job finished), and the job
cuted with lstm.unroll() and ResNet is executed by running information reported per minute including CPU, mem-
resnet.fc.train(). We apply model partition by the ory, and GPU utilization; and 2) for each server, it contains
code modification on all the original ML codes. the CPU, memory, and GPU utilization per minute [56]. In
our experiment, we use the job arrival time, the number of
GPUs requested and job completion status as the accuracy
D. RL Scheduler Training requirement of each job.
We first use the Microsoft Philly trace to do the offline These ML algorithms include AlexNet, ResNet, MLP,
supervised learning to train the RL agent. Since the RL LSTM and Support Vector Machine (SVM). In SVM, we only
training from scratch can result in poor policies at the begin- used data parallelism. In MLP and AlexNet, because of their
ning of learning and long time to converge, we adopt the sequential task dependency graph structures, we partitioned
offline supervised learning to guide the RL policy update the model sequentially into several parts for model parallelism
with the existing scheduling strategy, the default scheduler in and also used data parallelism. In LSTM and ResNet, we used
TensorFlow. The online training of the RL agent is trained with both data parallelism and model parallelism and partitioned
Adam optimizer [55] with a fixed learning rate of 0.001 for each layer into several parts for model parallelism. Therefore,
offline supervised learning and 0.0001 for the online training. we used mixed workloads in our experiments. The batch size
With these setting, the online training will spend around is 1MB for AlexNet and ResNet, and 1.5KB for LSTM, MLP
8 hours to converge after the offline learning. The input of and SVM. To create each ML job, we randomly selected one
our RL agent includes two parts: resource utilization and of the ML algorithms. The size of the training data is randomly
job’s information. For the resource utilization, it contains the selected from [100,1000]MB. We scaled down the number of
utilization percentage of CPU, GPU, memory and bandwidth jobs as in [8] that used 60 GPUs and 480 jobs in experiments.
on each machine. For the job’s information, it contains job’s We used 80 GPUs. The duration of the trace is 18 weeks, and
deadline, the required resources, job’s accuracy threshold, we randomly selected one week to do the test.
job’s iteration settings, the number of workers and parameters Real implementation: We conducted Pytorch [21] based
servers, the task dependency of tasks in one job. The output implementation on Amazon AWS [22] for our methods and
is the job’s placement plan about which job should be which comparison methods. We used 20 p3.8xlarge instances (to
devices. According to our model tunning experience, the order represent servers), which form a 80-GPU cluster [8], where
of the importance for each goal in the reward function follows: each server has 4 Nvidia Tesla V100 GPUs (16 GB RAM,
β2 ≈ β1 > β3 > β4 ≈ β5 . β1 and β2 are more important 5120 CUDA Cores and 640 Tensor Cores), 32 vCPU cores
because the reduction of job completion time can release more (based on Intel Xeon E5-2686 v4 processor), and 244 GB
computing resource and then lower the average workload in memory [57]. We used the job arrival time from the real
the GPU clusters. Meanwhile, due to the high communication trace [23], as well as the training data and five ML algorithms
cost in current ML jobs, β3 plays an important role to reduce used in [7] (downloaded from GitHub [53], [54]) with their
the data transmission time and then reduce the job completion default batch size. We set the number of jobs to 620x, where
time further. Due to the above coefficient settings, β4 and x equals to 1/4, 1/2, 1, 2, and 3.
β5 are less important since most of the job can finish soon Simulation: We developed a Python based simulator
or benefit from the optimal stopping method. (running on Google Colab platform) for large-scale testing

Authorized licensed use limited to: Tsinghua University. Downloaded on June 14,2023 at 01:26:44 UTC from IEEE Xplore. Restrictions apply.
68 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 31, NO. 1, FEBRUARY 2023

with 2474 GPUs in 550 servers. We set the number of jobs layers in one job. The details of these methods are explained in
to 117325x jobs, where x equals to 1/2 and then is varied Section II. We implemented SLAQ, RL and Tetris by ourselves
from 1 to 4 with 1 increase in each step. We used all trace and used the open source codes of other comparison methods.
data to drive the simulation. Given that the trace does not In MLF-C, we assume that all jobs use OptStop and it stops
contain bandwidth cost information, for each job, we randomly a job when its required accuracy is reached when the system
selected two values within [50,100]MB as the communication is overloaded. We open-sourced our code in Github [25].
volume between each worker to the parameter server and
between workers in one communication since [8] indicates B. Experimental Results
that the communication volume is the same across iterations. 1) Overall Performance Comparison: Figure 5 and Figure 6
As we use both data parallelism and model parallelism, compare the overall performance of our methods against
we need the corresponding information of the tasks in this other methods in real experiments and simulation, respectively.
scenario to drive our simulation. However, the trace is not Given that both sets of figures show similar trends and
from the data parallelism and model parallelism scenario orders, we discuss the two figures together and show the
nor provides the name of the ML algorithms. We then first performance improvement ratios from the simulation in “()”
sample-ran the aforementioned five ML algorithms with the in the following.
aforementioned parallelism methods using different number Figures 5(a) and 6(a) show the cumulative distribution
of GPUs and recorded the job completion times. Then, function (CDF) of JCT of each method. The results show
we mapped each job in the trace to one of our running ML that the overall JCT follows: MLFS<MLF-RL<MLF-H<
algorithms with a certain number of GPUs that has similar Graphene<Tiresias≈HyperSched≈RL≈Gandiva<Tensor
job latency. Next, we ran the ML algorithm with our data Flow≺SLAQ. We use ≺ to mean slightly lower relationship
parallelism degree and different model parallelism degrees to and use  to mean slightly higher relationship. Over 85%,
get the information of different tasks in the job in this scenario. 81%, and 78% jobs in MLFS, MLF-RL and MLF-H have
Experimental setting: In the experiments, we considered JCTs less than 100 minutes, respectively, in real experiments.
resource types including CPU, memory, GPU and bandwidth Considering the percentage of jobs with JCTs less than
cost. The number of GPUs needed by each ML job was 100 minutes, MLF-RL outperforms MLF-H by 4% (4%),
randomly selected from {1, 2, 4, 8, 16, 32}. We also set the and MLFS improves MLF-RL by 5% (6%) due to additional
number of model partitions to this number. SVM did not run MLF-C. Meanwhile, 61%, 60%, 60%, 55%, 46%, and 39%
in model parallelism because it is hard to partition its network jobs have JCTs less than 100 minutes in Tiresias, HyperSched,
model. The default parameter values of our method are listed RL, Gandiva, TensorFlow and SLAQ, respectively. MLFS
below: α = 0.3, γ = 0.8, γd = 0.3, γr = 0.3, γw = 0.35, improves Tiresias by 33% (38%) and improves SLAQ by
β1 = 0.5, β2 = 0.55 (larger β2 means more weights on 118% (128%).
deadline guarantee), β3 = 0.25, β4 = 0.15, β5 = 0.15, Both MLF-RL and MLF-H jointly consider both tem-
η = 0.95, hr = hs = 90%, and ps = 10%. We chose these poral/spatial ML job features and computation features in
default values because in our experiment, we found that these scheduling tasks to the underloaded servers and also in select-
parameters can achieve best performance in our evaluation ing migration tasks from overloaded servers. The additional
environment. In practice, these tunable parameters of a cluster consideration of ML job features compared to other methods
are determined by the administrator of the cluster according to helps them achieve lower JCT. Since the tasks in earlier
the goals they concern and the particular cluster environment iterations that have more closer dependent tasks in the depen-
and configurations. dency graph have higher priorities to run, their completion
The job scheduler runs every minute. For each job, we ran- enables more tasks to start earlier, thus helping reduce JCTs.
domly used max{1.1te , tr } as its deadline, where te is its In addition, MLF-RL and MLF-H consider computation fea-
estimated execution time and tr was randomly chosen from tures related to JCT which directly reduces JCTs. MLF-RL
[ 12 , 24] hours. Unless otherwise specified, the number of iter- outperforms MLF-H because MLF-RL can better extract ML
ations run by each job is the same as that in the trace. In the job features, adapt to learn the scheduling policy and output
RL training process in MLF-RL, we randomly scheduled tasks the optimal schedule whereas MLF-H may not be able to set
to nodes and calculated reward, and then updated the RL optimal parameter values for its parameters.
model. After the RL processed the first 50% data of the MLFS produces significantly lower JCT than other methods.
real trace, the model is trained which takes around 26 hours. The reduction of JCT is contributed from MLF-RL and
The error bar in our experimental figures represents the 1th MLF-C. MLF-C stops training ML jobs at their optimal stop-
and 99th percentiles and median of the result values from ping epochs to save computation resources and time without
10 experiments. We use y−z z to calculate the improvement degrading their accuracy performance and also stops tasks in
when comparing the performance of method y and z. later iterations when the desired accuracy is reached to avoid
Comparison methods: We compared our methods with resource overload and task waiting, thus reducing JCT.
the state-of-the-art ML job schedulers including SLAQ [6], In order to reduce JCT, Graphene considers the task depen-
Gandiva [7], Tiresias [8], RL [5], Graphene [38], TensorFlow dancy graph. The tasks with more dependent tasks have
[36] that uses the Borg resource manager [24] and HyperSched higher priority to be scheduled. The earlier completion for
[11]. RL aims to minimize the JCT and it uses data parallelism this kind of tasks leads to the earliest start of other dependent
and partitions the model sequentially into several parts for tasks and then decreased JCT. Thus, Graphene achieve better
model parallelism. The number of parts equals the number of JCT reduction performance compared with other comparison

Authorized licensed use limited to: Tsinghua University. Downloaded on June 14,2023 at 01:26:44 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: MACHINE LEARNING FEATURE BASED JOB SCHEDULING FOR DISTRIBUTED MACHINE LEARNING CLUSTERS 69

Fig. 5. Overall performance in real experiments.

Fig. 6. Overall performance in large-scale simulation.

methods. Tiresias gives the jobs that can complete in the next in MLF-RL, and 54-116 (72-349) hours in MLF-H. For
service epoch the highest priority to run and RL uses the RL the highest workload, MLFS improves MLF-RL by 10%
technique for the job scheduling, but they do not consider the (9%) and MLF-RL improves MLF-H by 15% (11%). Also,
ML features. HyperSched, Gandiva and TensorFlow do not the makespan is 56-125 (74-355) hours in Tiresias, 58-128
directly aim to improve JCT. HyperSched aims to increase the (78-360) hours in HyperSched, 61-134 (78-361) hours in RL,
accuracy of jobs within their deadlines. Gandiva uses FIFO 64-138 (84-368) hours in Gandiva, 78-158 (103-416) hours
and also migrate tasks between GPUs to fully utilize GPU in TensorFlow, and 85-170 (11-415) hours in SLAQ. Thus,
resources, and TensorFlow uses Fair scheduler that aims to MLFS improves Tiresias by 32% (17%) and improves SLAQ
achieve the fairness between jobs in resource allocation. As a by 52% (39%).
result, HyperSched, Tiresias and RL produce JCTs higher Figures 5(b) and 6(b) show the average JCT. We see that
than our methods, but lower than TensorFlow. Gandiva’s the result follows: MLFS<MLF-RL<MLF-H<Graphene<
JCT is comparable to those of HyperSched, Tiresias and Tiresias≈HyperSched≈RL≈Gandiva<Tensor Flow≺SLAQ.
RL, and lower than TensorFlow due to Gandiva’s additional For 1860 jobs, MLFS improves MLF-RL by 22% (18%)
task migration. SLAQ only aims to maximize the accuracy (due to additional MLF-C) and MLF-RL improves MLF-H
improvement across jobs rather than JCT, thus it produces by 11% (10%). Meanwhile, MLFS improves Tiresias by 34%
higher JCT than TensorFlow. In conclusion, in terms of the (30%) and improves SLAQ by 53% (47%.). The results show
JCT, MLFS outperforms other methods and our proposed similar orders and trends of all methods as in Figures 5(a)
methods MLF-H, MLF-RL and MLF-C are all effective in and 6(a) due to the same reasons.
reducing JCT. Figures 5(c) and 6(c) show the job deadline guarantee
Makespan is the time period from when the first job is ratio, which is the percent of the jobs whose deadlines are
submitted to when the last job is completed. The makespan satisfied. We see that the result follows: MLFS>MLF-RL>
is 40-90 (63-336) hours in MLFS, 51-102 (67-342) hours MLF-H>HyperSched >Graphene>Tiresias≈RL≈Gandiva>

Authorized licensed use limited to: Tsinghua University. Downloaded on June 14,2023 at 01:26:44 UTC from IEEE Xplore. Restrictions apply.
70 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 31, NO. 1, FEBRUARY 2023

TensorFlowSLAQ. For 1860 jobs, MLFS improves MLF-RL Figures 5(f) and 6(f) show the accuracy guarantee ratio, i.e.,
by 21% (16%) and MLF-RL improves MLF-H by 16% (21%). the percentage of ML jobs whose accuracy requirements are
Meanwhile, MLFS improves HyperSched by 47% (43%) and satisfied by their deadlines. The trends and orders of result
improves SLAQ by 102% (101%). Part of the reasons for are similar as in Figures 5(e) and 6(e). For 155 jobs, MLFS
the results are the same as those for Figures 5(a) and 6(a). improves MLF-RL by 15% (18%) and MLF-RL improves
HyperSched considers the job deadline and tends to give MLF-H by 20% (19%). Also, MLFS improves HyperSched
more resources to the tasks with the best potential accuracy by 36% (35%) and improves TensorFlow by 56% (52%).
performance before the deadlines. These methods try to satisfy For 1860 jobs, MLFS improves MLF-RL by 22% (20%)
job deadlines in job scheduling while other methods do not. and MLF-RL improves MLF-H by 8% (9%). Also, MLFS
A job’s waiting time is the accumulated time periods in improves HyperSched by 32% (34%) and improves Tensor-
which none of its tasks are running from its submission time Flow by 61% (54%). Part of the reasons for the results are the
to its completion time. Figures 5(d) and 6(d) plot the average same reasons as in Figures 5(e) and 6(e). In addition, meeting
job waiting time per job. The result shows that MLFS< the accuracy requirement is an objective in our methods, but
MLF-RL<MLF-H<Graphene<HyperSched≈Tiresias≈RL≈ is not considered in other methods. As a result, our methods
Gandiva<Tensor Flow<SLAQ. ML FS improves MLF-RL produce the highest accuracy guarantee ratio.
by 25% (22%) (due to additional MLF-C) and MLF-RL Figures 5(g) and 6(g) plot the communication bandwidth
improves MLF-H by 14% (21%). MLFS improves Tiresias cost. We see that the result follows: MLFS<MLF-RL<
by 40% (47%) and improves SLAQ by 52% (57%). Part MLF-H≈TensorFlow≈Graphene≈Tiresias≈HyperSched≈
of the reasons are similar as those in Figures 5(a) and 6(a) SLAQ ≈RL<Gandiva. For 1860 jobs, MLFS improves
since a lower JCT enables more jobs to run earlier and helps MLF-RL by 22% (24%) and MLF-RL improves MLF-H by
reduce job waiting time. In addition, waiting time is a factor 14% (15%). Meanwhile, MLFS improves TensorFlow by 36%
considered in MLF-RL and MLF-H in priority determination (37%) and improves Gandiva by 45% (54%). MLF-H tries
in order to reduce the task waiting time, and MLF-C further to reduce bandwidth cost when selecting a node to allocate a
reduces job waiting time by removing some unnecessary task and when selecting tasks to migrate out of an overloaded
tasks. Though HyperSched pauses jobs that do not increase node. MLF-RL considers bandwidth cost as a part of reward
accuracy significantly, it does not aim to reduce JCT, so jobs function. MLF-C further removes tasks that make little or
are always running or waiting before their deadlines. As a no contribution to the desired accuracy. As a result, MLFS,
result, our methods generate lower waiting time than other MLF-H and MLF-RL produce less bandwidth cost than
methods, and MLFS produces the least average job waiting other methods, and MLFS produces the least bandwidth cost.
time. Gandiva simply uses FIFO without trying to reduce JCT and
Figures 5(e) and 6(e) plot the average accuracy of uses task migration to improve resource utilization without
ML jobs by their deadlines. We see that the result considering bandwidth cost. Therefore, the task migration
follows: MLFS>MLF-RL>MLF-H>HyperSched>SLAQ> introduces extra bandwidth cost, so Gandiva generates the
Tiresias≈RL≈Graphene>Gandiva ≈Tensor Flow for the low- highest bandwidth cost. All other comparison methods do
est workload, and MLFS>MLF-RL>MLF-H>HyperSched> not aim to reduce bandwidth cost, thus producing higher
Tiresias≈RL≈Graphene>SLA -Q≈Gandiva≈TensorFlow for bandwidth cost than MLF-H, MLF-RL, and MLFS.
high workload. For 155 jobs, MLFS improves MLF-RL by We recorded a scheduler’s running time for each scheduling
10% (11%) and MLF-RL improves MLF-H by 7% (8%). and calculated its average value. Figure 5(h) plots the
Also, MLFS improves Tiresias by 14% (16%) and improves average time overhead of different schedulers versus
TensorFlow by 44% (46%). For 1860 jobs, MLFS improves different numbers of jobs. We see that the result follows:
MLF-RL by 23% (20%) and MLF-RL improves MLF-H by Gandiva<TensorFlow<Tiresias<Graphene <HyperSched<
14% (8%). Also, MLFS improves HyperSched by 43% (41%) SLAQ<RL≈MLF-RL<MLF-H<MLFS. For 1860 jobs,
and improves TensorFlow by 64% (60%). the scheduler time overhead is 54ms, 49ms, and 40ms in
On the one hand, with a lower workload in the cluster, MLFS, MLF-RL and MLF-H, and it is 23ms, 28ms, 35ms,
HyperSched performs the best among the comparison methods 38ms, 40ms, and 49ms in Gandiva, TensorFlow, Tiresias,
since it pauses jobs that do not gain high accuracy increase and HyperSched, SLAQ and RL, respectively. MLF-RL and RL
also gives higher priorities in resource allocation to the task use RL technique so that they have similar time overhead.
which tends to achieve the best accuracy performance before MLFS generates higher time overhead since it has additional
its deadline. Tiresias, RL and Graphene try to improve JCT MLF-C in addition to MLF-RL. MLF-H consists two parts,
so that the jobs can complete earlier, which also increases a heuristic job scheduling method and handling overloaded
accuracy by the job deadlines due to resource competition servers, which needs to select migration tasks and destination
mitigation. As a result, they achieve higher accuracy than nodes. Thus, it generates higher time overhead than MLF-RL
SLAQ though SLAQ aims to improve accuracy. Gandiva and and RL and less overhead compared with MLFS consisting
the Fair scheduler in TensorFlow do not directly aim to of MLF-RL and MLF-C. The results indicate the advantages
improve JCT or accuracy. Therefore, many jobs must wait and of MLF-RL and RL in reducing the time for decision
their accuracy by the deadline is lower than other methods. making. SLAQ, Tiresias, Graphene and HyperSched use a
Since MLFS considers both accuracy and JCT requirements single heuristic method without handling server overload or
and leverages ML job features, it achieves the highest average system overload, which cost less time overhead compared
accuracy in both high and low workloads. with the RL-based methods. Gandiva uses simple FIFO and

Authorized licensed use limited to: Tsinghua University. Downloaded on June 14,2023 at 01:26:44 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: MACHINE LEARNING FEATURE BASED JOB SCHEDULING FOR DISTRIBUTED MACHINE LEARNING CLUSTERS 71

Fig. 7. Urgency and deadline consideration.


Fig. 9. Effectiveness of system load control.

Fig. 8. Bandwidth consideration.


Fig. 10. System load reduction.
TensorFlow uses the simple Fair scheduler, so they produce
the smallest scheduler overhead. Though our methods have width cost by 10%-14%. The task migration also increases
slightly higher time overhead in the ms level than other the average accuracy by 8%-10% and reduces the average
methods, they achieve much higher JCT and accuracy as JCT by 15%-24%. Server overload leads to longer waiting
shown in the above. time and slower processing speed that lowers accuracy by
In all of the figures, we see that as the number of jobs the deadline. Task migration helps relieve excess load from
increases, the average JCT, average job waiting and bandwidth overloaded servers.
cost increase. This is because more jobs generate higher c) Effectiveness of ML-based System Load Reduction:
workload and hence longer running time and waiting time, Figure 10 shows the accuracy guarantee ratio and average JCT
and higher bandwidth cost for the job running. One of the with and without MLF-C (in Section III-E). MLF-C improves
reasons that MLFS performs better in both JCT and accuracy the accuracy guarantee ratio by 17%-23% and the average JCT
than comparison methods is that it uses OptStop, which helps by 28%-42%. Server overload leads to longer waiting time and
reduce JCT while still achieving the highest accuracy of the slower processing speed that lowers accuracy by the deadline.
stopped jobs and leaves more resources for other jobs to MLF-C solves this problem by removing unnecessary tasks
achieve high accuracy. when the desired accuracy is reached.
2) Performance of Each System Component: We conducted d) Effectiveness of Optimal ML Iteration Stopping: We
real experiments to show the effectiveness of each system compare our OptS (in Section IV) with GL, PQ, UP early
component of MLF-H and MLF-C. stopping methods introduced in [52] and HyperSched (HS)
a) Factors in Priority Determination: We set the urgency in [11]. We replace OptS in MLFS with GL, PQ, UP, HS and
level of each job to a value randomly selected from [1,10] and the method that runs ML for the specified maximum number
consider the jobs with urgency level higher than 8 as urgent of iterations (denoted by Max), respectively. Figure 11(a)
jobs. Figure 7 shows the deadline guarantee ratio for urgent plots the average JCT of these methods. The result exhibits
jobs with and without urgency coefficient consideration in the that OptS<GL≈PQ<HS<UP<Max. For 1860 jobs, OptS
priority calculation in Equ. (2). It shows that this consideration improves PQ, GL and HS by around 17% and improves
improves the deadline guarantee ratio by 22%-30%. Figure 7 Max by 33%. In Figure 11(b), the average accuracy exhibits
also plots the job deadline guarantee ratio with and without that Max≈OptS≈HS>UP>PQ>GL. For 1860 jobs, OptS
job deadline consideration in Equ. (4). We find that the has similar average accuracy as Max and improves GL by
deadline consideration improves the job deadline guarantee 23%. OptS can find the stopping epoch near and after the
by 13%-25%. minimum loss epoch so the job does not waste computation
Figure 8 shows the average JCT and bandwidth cost with time on further training but Max needs to run for the specified
and without the bandwidth cost consideration in Equ. (2). maximum number of iterations. OptS improves other methods
We observe that the bandwidth cost consideration reduces the in both accuracy and JCT because the other methods only can
JCT by 5%-15% and reduces the bandwidth cost by 20%-35%, avoid overfitting but may not find the optimal stopping epoch
which confirms the importance of considering reducing band- as shown in Figure 4. GL generates the minimum accuracy
width cost. and also relatively low JCT. It stops training once it finds that
b) Effectiveness of Task Migration: Figures 9(a) and 9(b) the deviation between the current loss and the minimum loss
show the number of server overload occurrences, bandwidth is larger than a threshold. However, it is sensitive to noisy
cost, average accuracy by job deadline, and average JCT data (e.g., sudden increase of validation loss in a fluctuation
with and without task migration, respectively. We notice that phase) which is common in ML training jobs [52]. HS tries to
the task migration component reduces the number of server allocate more resources to the job with the largest potential to
overload occurrences by 36%-60% and increases the band- achieve the best final accuracy improvement with deadline and

Authorized licensed use limited to: Tsinghua University. Downloaded on June 14,2023 at 01:26:44 UTC from IEEE Xplore. Restrictions apply.
72 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 31, NO. 1, FEBRUARY 2023

system load by removing tasks once the desired accuracy is


reached when the system is overloaded and determines the
optimal number of iterations that an ML job runs to maximize
accuracy while avoiding running more iterations. We also
propose Optimal ML iteration stopping (OptS) that determine
the proper time to stop training ML model when this model
reaches the minimum loss value. Both trace-driven large-
scale simulation and real implementation show that MLFS
Fig. 11. Effectiveness of OptS. has superior performance in JCT and accuracy than state-
of-the-art ML job schedulers, and the effectiveness of each
resource constraint aware so that HS can also achieve similar component. This work is the first to explore how to leverage
average accuracy compared with OptS and Max. Without ML job features to improve ML job schedulers, and we hope
efficient early stopping method, the average JCT of HS is that it can stimulate more research work in this direction. In the
larger than that of OptS. PQ improves GL’s accuracy with future, we will explore more features of the data parallelism
similar JCT. It reduces the sensitivity to sudden increasing of and model parallelism scenario and leverage the features in
loss by considering the stability of training loss, which actually job scheduling.
is stable in most cases. UP has higher accuracy and also higher
JCT than PQ and GL. It stops training when the validation loss
R EFERENCES
keeps increasing in a certain number of successive epochs, thus
generating average JCT close to Max’s. [1] Y. Huang et al., “GPipe: Efficient training of giant neural networks using
pipeline parallelism,” in Proc. NIPS, 2019, pp. 1–10.
[2] Z. Jia, M. Zaharia, and A. Aiken, “Beyond data and model parallelism
VII. S UMMARY AND L IMITATIONS for deep neural networks,” in Proc. SysML, 2019, pp. 1–13.
Compared to other ML cluster schedulers, first, MLFS is [3] A. Gholami, A. Azad, P. Jin, K. Keutzer, and A. Buluc, “Integrated
model, batch, and domain parallelism in training neural networks,” in
novel in that it combines the ML features and computation fea- Proc. SPAA, Jul. 2018, pp. 77–86.
tures together to achieve better performance in both JCT and [4] S. Lee, J. Kim, X. Zheng, Q. Ho, G. Gibson, and E. Xing, “On model
accuracy. Second, the RL based method (MLF-RL) in MLFS parallelization and scheduling strategies for distributed machine learn-
ing,” in Proc. NIPS. 2014, pp. 1–9.
improves both JCT and accuracy of the ML feature based [5] A. Mirhoseini et al., “Device placement optimization with reinforcement
RL task scheduling method (MLF-RL), thus contributing the learning,” in Proc. ICML, 2017, pp. 2430–2439.
JCT and accuracy performance of MLFS. Third, with system [6] H. Zhang, L. Stafman, A. Or, and M. J. Freedman, “SLAQ: Quality-
driven scheduling for distributed machine learning,” in Proc. SOCC,
load control (MLF-C), MLFS generates higher time overhead Sep. 2017, pp. 390–404.
but improves both JCT and accuracy. Fourth, comparing with [7] W. Xiao et al., “Gandiva: Introspective cluster scheduling for deep
other job schedulers, MLFS has two additional methods: task learning,” in Proc. OSDI, 2018, pp. 595–610.
migration and system load control, which can significantly [8] J. Gu, M. Chowdhury, G. Shin, and Y. Zhu, “Tiresias: A GPU
cluster manager for distributed deep learning,” in Proc. NSDI, 2019,
improve JCT and accuracy. pp. 485–500.
There are several limitations of MLFS. First, MLFS needs [9] Y. Peng, Y. Bao, Y. Chen, C. Wu, and C. Guo, “Optimus: An efficient
task dependency graph to determine the priorities of different dynamic resource scheduler for deep learning clusters,” in Proc. EuroSys,
Apr. 2018, pp. 1–14.
tasks. For an ML model that cannot be partitioned, our method [10] J. Rasley, Y. He, and F. Yan, “HyperDrive: Exploring hyperparameters
may achieve similar performance as other methods. Second, with pop scheduling,” in Proc. Middleware, 2017, pp. 1–13.
when the RL job scheduler needs to be used in a new [11] R. Liaw et al., “HyperSched: Dynamic resource reallocation for model
development on a deadline,” in Proc. SoCC, 2019, pp. 61–73.
environment (e.g., when there is a new model or a new GPU [12] T. N. Le, X. Sun, M. Chowdhury, and Z. Liu, “AlloX: Compute
type), the previous trained model under different scenarios allocation in hybrid clusters,” in Proc. EuroSys, Apr. 2020, pp. 1–16.
may not be able to achieve the best performance because the [13] A. Tumanov, T. Zhu, J. W. Park, M. A. Kozuch, M. Harchol-Balter,
and G. R. Ganger, “TetriSched: Global rescheduling with adaptive plan-
different characters of ML jobs and the different computing ahead in dynamic heterogeneous clusters,” in Proc. EuroSys, Apr. 2016,
ability of different hardware can degrade the performance of pp. 1–16.
RL model. [14] S. Chaudhary, R. Ramjee, M. Sivathanu, N. Kwatra, and S. Viswanatha,
“Balancing efficiency and fairness in heterogeneous GPU clusters for
deep learning,” in Proc. EuroSys, Apr. 2020, pp. 1–16.
VIII. C ONCLUSION [15] H. Yabu and D. Taniw, “Low-latency job scheduling with preemption
for the development of deep learning,” in Proc. OpML, 2019, pp. 27–30.
In this paper, we propose MLFS. Compared with previ- [16] Y. Peng, Y. Bao, Y. Chen, C. Wu, C. Meng, and W. Lin, “DL2:
ous ML job schedulers, the advantages of MLFS are that A deep learning-driven scheduler for deep learning clusters,” 2019,
i) it intelligently leverages ML job features to significantly arXiv:1909.06040.
[17] M. Li et al., “Scaling distributed machine learning with the parameter
improve JCT and accuracy, ii) it can be applied for ML model server,” in Proc. 11th USENIX Conf. Oper. Syst. Design Implement.,
parallelism, which is an approach for large-scale ML and 2014, pp. 583–598.
DL jobs, and iii) it can meet both deadline and accuracy [18] Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally, “Deep gradient
requirements of ML applications even when the system is compression: Reducing the communication bandwidth for distributed
training,” 2017, arXiv:1712.01887.
overloaded. Considering the spatial/temporal ML features and [19] J. Zhan, O. Kayiran, G. H. Loh, C. R. Das, and Y. Xie, “OSCAR:
computation features, MLFS determines the task priority for Orchestrating STT-RAM cache traffic for heterogeneous CPU-GPU
job scheduling to achieve our goal. MLFS also uses deep architectures,” in Proc. MICRO, Oct. 2016, pp. 1–13.
[20] C.-C. Chen, C.-L. Yang, and H.-Y. Cheng, “Efficient and robust parallel
RL in task scheduling to achieve the goal by considering DNN training through model parallelism on multi-GPU platform,” 2018,
all the features. In addition, MLFS intelligently controls the arXiv:1809.02839.

Authorized licensed use limited to: Tsinghua University. Downloaded on June 14,2023 at 01:26:44 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: MACHINE LEARNING FEATURE BASED JOB SCHEDULING FOR DISTRIBUTED MACHINE LEARNING CLUSTERS 73

[21] PyTorch. Accessed: Jul. 2022. [Online]. Available: https://pytorch.org [50] J. Wang, J. Xu, and X. Wang, “Combination of hyperband and Bayesian
[22] P. Moritz et al., “Ray: A distributed framework for emerging AI optimization for hyperparameter optimization in deep learning,” 2018,
applications,” in Proc. OSDI, 2018, pp. 561–577. arXiv:1801.01596.
[23] Microsoft DNN Trace. Accessed: Jul. 2022. [Online]. Available: [51] B. Letham and E. Bakshy, “Bayesian optimization for policy search via
https://github.com/msr-fiddle/philly-traces/ online-offline experimentation,” J. Mach. Learn. Res., vol. 20, p. 145,
[24] A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer, E. Tune, and Jan. 2019.
J. Wilkes, “Large-scale cluster management at Google with Borg,” in [52] L. Prechelt, “Early stopping—But when?” in Neural Networks: Tricks
Proc. EuroSys, Apr. 2015. of the Trade. Berlin, Germany: Springer, 1998, pp. 55–69.
[25] Source Code. Accessed: Jul. 2022. [Online]. Available: [53] ILSVRC2010. Accessed: Jul. 2022. [Online]. Available: https://github.
https://github.com/hiddenlayer2020/ML-Job-Scheduler-MLFS com/Abhisek-/AlexNet
[26] H. Mao, M. Schwarzkopf, S. B. Venkatakrishnan, Z. Meng, and [54] Text Classification on R8 Dataset. Accessed: Jul. 2022. [Online]. Avail-
M. Alizadeh, “Learning scheduling algorithms for data processing clus- able: https://paperswithcode.com/sota/text-classification-on-r8
ters,” in Proc. SIGCOMM, Aug. 2019, pp. 270–288. [55] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
[27] P. Delgado, D. Didona, F. Dinu, and W. Zwaenepoel, “Kairos: Preemp- 2014, arXiv:1412.6980.
tive data center scheduling without runtime estimates,” in Proc. SOCC, [56] M. Jeon, S. Venkataraman, A. Phanishayee, J. Qian, W. Xiao, and
Oct. 2018, pp. 135–148. F. Yang, “Analysis of large-scale multi-tenant GPU clusters for DNN
[28] A. Chung, J. W. Park, and G. R. Ganger, “Stratus: Cost-aware container training workloads,” in Proc. ATC, 2019, pp. 947–960.
scheduling in the public cloud,” in Proc. SOCC, Oct. 2018, pp. 121–134. [57] Amazon EC2 Types. Accessed: Jul. 2022. [Online]. Available:
[29] C. Curino et al., “Hydra: A federated resource manager for data-center https://aws.amazon.com/ec2/instance-types/
scale analytics,” in Proc. NSDI, 2019, pp. 177–192.
[30] W. Bai, L. Chen, K. Chen, D. Han, C. Tian, and H. Wang, “Information-
agnostic flow scheduling for commodity data centers,” in Proc. NSDI,
2015, pp. 455–468.
[31] R. Grandl, G. Ananthanarayanan, S. Kandula, S. Rao, and A. Akella,
“Multi-resource packing for cluster schedulers,” in Proc. SIGCOMM,
Aug. 2014, pp. 455–466.
[32] M. Chowdhury and I. Stoica, “HUG: Multi-resource fairness for corre-
lated and elastic demands,” in Proc. NSDI, 2016, pp. 407–424. Haoyu Wang (Student Member, IEEE) received the
[33] J. Mace, P. Bodik, R. Fonseca, and M. Musuvathi, “Retro: Targeted B.S. degree from the University of Science and
resource management in multi-tenant distributed systems,” in Proc. Technology of China and the M.S. degree from
NSDI, 2015, pp. 589–603. the Columbia University in the City of New York.
[34] P. Sun, Y. Wen, N. B. D. Ta, and S. Yan, “Towards distributed machine He is currently pursuing the Ph.D. degree with
learning in shared clusters: A dynamically-partitioned approach,” in the Department of Computer Science, University of
Proc. SMARTCOMP, May 2017, pp. 1–6. Virginia. His research interests include data center,
[35] B. Huang, M. Boehm, Y. Tian, B. Reinwald, S. Tatikonda, and cloud, and distributed networks.
F. R. Reiss, “Resource elasticity for large-scale machine learning,” in
Proc. SIGMOD, May 2015, pp. 137–152.
[36] M. Abadi et al., “TensorFlow: A system for large-scale machine learn-
ing,” in Proc. OSDI, 2016, pp. 265–283.
[37] D. Narayanan et al., “PipeDream: Generalized pipeline parallelism for
DNN training,” in Proc. 27th ACM Symp. Oper. Syst. Princ., 2019,
pp. 1–15.
[38] R. Grandl, S. Kandula, S. Rao, A. Akella, and J. Kulkarni,
“GRAPHENE: Packing and dependency-aware scheduling for data-
parallel clusters,” in Proc. 12th USENIX Symp. Oper. Syst. Design
Implement., 2016, pp. 81–97.
[39] H. Mao, M. Alizadeh, I. Menache, and S. Kandula, “Resource manage- Zetian Liu received the B.S. degree from Jilin Uni-
ment with deep reinforcement learning,” in Proc. HotNet, Nov. 2016, versity in 2018. He is currently pursuing the Ph.D.
pp. 50–56. degree with the Department of Computer Science,
[40] T. Domhan, J. Springenberg, and F. Hutter, “Speeding up automatic University of Virginia. His research interests include
hyperparameter optimization of deep neural networks by extrapolation distributed networks, cloud computing, and machine
of learning curves,” in Proc. 24th Int. Joint Conf. Artif. Intell., 2019, learning algorithms and applications.
pp. 1–9.
[41] A. Sergeev and M. D. Balso, “Horovod: Fast and easy distributed deep
learning in TensorFlow,” 2018, arXiv:1802.05799.
[42] Baidu, Ring All Reduce. Accessed: Jul. 2022. [Online]. Available:
https://github.com/baidu-research/baidu-allreduce
[43] H. Mikami, H. Suganuma, Y. Tanaka, and Y. Kageyama, “Mas-
sively distributed SGD: ImageNet/ResNet-50 training in a flash,” 2018,
arXiv:1811.05233.
[44] M. Laumanns and E. Zitzler, “An efficient, adaptive parameter variation
scheme for metaheuristics based on the epsilon-constraint method,” Eur.
J. Oper. Res., vol. 169, no. 3, pp. 932–942, 2006.
[45] H. Shen and L. Chen, “A resource usage intensity aware load balancing
method for virtual machine migration in cloud datacenters,” IEEE Trans. Haiying Shen (Senior Member, IEEE) received the
Cloud Comput., vol. 8, no. 1, pp. 17–31, 2017. B.S. degree in computer science and engineering
[46] A. Beloglazov and R. Buyya, “Optimal online deterministic algorithms from Tongji University, China, in 2000, and the M.S.
and adaptive heuristics for energy and performance efficient dynamic and Ph.D. degrees in computer engineering from
consolidation of virtual machines in cloud data centers,” Concurrency Wayne State University in 2004 and 2006, respec-
Comput. Pract. Exp., vol. 24, no. 13, pp. 1397–1420, Sep. 2012. tively. She is currently an Associate Professor with
[47] R. Sutton, Reinforcement Learning. Cambridge, MA, USA: MIT Press, the Department of Computer Science, University
2014. of Virginia. Her research interests include distrib-
[48] D. Silver et al., “Mastering the game of go with deep neural networks uted computer systems, cloud and edge computing,
and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, Jan. 2016. distributed machine learning, big data, and cyber-
[49] H. Mao et al., “Real-world video adaptation with reinforcement learn- physical systems. She is a Microsoft Faculty Fellow
ing,” 2020, arXiv:2008.12858. of 2010. She is a Senior Member of ACM.

Authorized licensed use limited to: Tsinghua University. Downloaded on June 14,2023 at 01:26:44 UTC from IEEE Xplore. Restrictions apply.

You might also like