Machine Learning Feature Based Job Scheduling for Distributed Machine Learning Clusters - 2023
Machine Learning Feature Based Job Scheduling for Distributed Machine Learning Clusters - 2023
1, FEBRUARY 2023
Authorized licensed use limited to: Tsinghua University. Downloaded on June 14,2023 at 01:26:44 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: MACHINE LEARNING FEATURE BASED JOB SCHEDULING FOR DISTRIBUTED MACHINE LEARNING CLUSTERS 59
data transmission between tasks is needed. For example, the and reschedules these tasks along with the waiting tasks in
communication overhead between GPUs is 970MB-3168MB the queue. MLF-H further tries to fully utilize resources and
per mini-batch for data parallelism jobs and 784MB-1037MB reduce bandwidth cost when selecting migration tasks from an
per mini-batch for model parallelism jobs [20]. Second, overloaded server and when selecting a host server to allocate
it should concurrently improve JCT and accuracy since both a task.
are important for ML jobs. For example, hurricane path (2) ML feature based RL task scheduling (MLF-RL).
prediction jobs are time critical and also require high accuracy MLFS initially runs MLF-H for a certain time period and
to avoid damage and casualties. Third, it should prioritize jobs uses the data to train a deep RL model, and it then switches
by considering their urgency and accuracy requirements. For to MLF-RL when the model is well trained. Given both
example, hurricane path prediction is time-critical and requires running and waiting tasks as well as nodes, based on their
high accuracy, but annual people movement prediction is time status, MLF-RL selects tasks to move out of overloaded nodes
tolerant and does not require very high accuracy. Fourth, it can and determines the destination node or the queue for the
still provide job deadline guarantee or low JCT when the selected tasks and the waiting tasks in the queue to achieve
system is overloaded. the aforementioned goal.
In summary, the goal of MLFS is to minimize the average (3) ML feature based system load control (MLF-C).
JCT, maximize the average job accuracy, maximize the number When the system is overloaded, jobs will experience high JCT
of jobs whose deadline and accuracy requirements are satis- and low accuracy by the job deadline. MLF-C can stop running
fied, and minimize the bandwidth cost. MLFS is novel in that it or generating tasks once the desired accuracy is reached (based
intelligently takes advantage of the spatial/temporal features of on users’ choices) to relieve system workload in order to
ML jobs. First, different tasks (for different model partitions) improve JCT and accuracy by the job deadline. MLF-C also
run separately, and the tasks have different impacts on the final adopts an optimal ML iteration stopping method that finds
job latency and accuracy (i.e., spatial features). For example, the iteration to stop training in order to achieve the maximum
in the ML model partition graph (or task dependency graph), accuracy while minimizing the number of iterations.
if a task has more dependent tasks, or its dependent tasks are (4) Optimal ML iteration stopping (OptS). OptS can find
in layers closer to the task, it should run earlier to improve the proper time to stop training an ML model near and after
job latency and accuracy by the job deadline. In addition, the the minimum loss epoch so that the ML job does not waste
model partition size (measured by the number of ML model computation time and resource on further training.
parameters) influences the final accuracy result – a larger size We conducted real experiments using Pytorch [21] on
generates a higher impact and vice versa. Second, an ML job AWS [22] and large-scale simulation based on real workload
usually runs many iterations, and earlier iterations have higher trace [23]. Extensive experimental results show the superior
impact on the accuracy than later iterations [6] (i.e., temporal performance of MLFS compared to state-of-the-art methods
features). Therefore, the tasks in earlier iterations should have in [5]–[8], [24] and the effectiveness of each of its components.
a higher priority to run and vice versa. Accordingly, MLFS We open-sourced our code in Github [25].
first uses a heuristic scheduling method that considers these The rest of this paper is organized as follows. Section II
features to determine task priority for job queue ordering in presents related work. Section III presents our goal and
order to improve the JCT and accuracy performance. To relieve the details of MLFS. Section IV presents the enhance-
extra load from an overloaded server, it also considers the task ment method. Section VI presents performance evaluation.
priority in selecting tasks to move out from the server and puts Section VII presents the summary of the experiment results
them in the queue to be rescheduled. It uses the data from the and the limitations of our methods. Section VIII presents our
heuristic scheduling method for training a deep reinforcement conclusions and our future work.
learning (RL) model. After the RL model is well trained,
it then switches to the RL method to automatically make II. R ELATED W ORK
decisions on job scheduling. Furthermore, when the system A significant amount of research effort has been
is overloaded, MLFS has a system load control method that devoted to job scheduling in clusters and cloud datacenters
selects tasks from overloaded servers to move to underloaded (e.g., [26]–[33]). In this paper, we focus on job scheduling for
servers based on task priority and also intelligently terminates ML clusters [5]–[8], [24], [34], [35]. Li et al. [17] proposed
the tasks that generate little on the required accuracy perfor- parameter server framework for ML data parallelism using
mance, which helps improve both JCT and accuracy by the first-in-first-out (FIFO) scheduling. TensorFlow [36] uses the
job deadline. MLFS consists of the following components: Borg resource manager [24] that aims to achieve fairness of
(1) ML feature based heuristic task scheduling (MLF-H). resource allocation among different jobs. SLAQ [6] aims to
MLF-H determines the task priority based on the spatial/ maximize the overall job accuracy. For the available CPU
temporal ML job features and computation features (e.g., resource, SLAQ predicts the loss reduction and runtime if
deadline, waiting time, remaining job time) to improve both different numbers of CPU cores are assigned to each of the
JCT and accuracy. Tasks that contribute more to the accuracy running jobs, and then chooses the job with the maximum loss
and JCT are given higher priorities. The priority of a task reduction per unit runtime to adjust the number of CPU cores.
is used for queuing tasks and allocating tasks when there Tiresias [8] is proposed to schedule DL jobs in a GPU cluster
are servers with available resources. MLF-H also handles to reduce JCT. It determines job priority (for queuing and
overloaded servers by considering task priority and other preemption when there are no available GPUs) using two prin-
factors in selecting tasks to migrate out of overloaded servers, ciples: 1) for jobs without prior knowledge of its task running
Authorized licensed use limited to: Tsinghua University. Downloaded on June 14,2023 at 01:26:44 UTC from IEEE Xplore. Restrictions apply.
60 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 31, NO. 1, FEBRUARY 2023
Authorized licensed use limited to: Tsinghua University. Downloaded on June 14,2023 at 01:26:44 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: MACHINE LEARNING FEATURE BASED JOB SCHEDULING FOR DISTRIBUTED MACHINE LEARNING CLUSTERS 61
each task sends its training results to the parameter server, C. ML Feature Based Heuristic Task Scheduling (MLF-H)
which accumulates the results and updates the ML model. 1) Priority Determination: We use Pk,J to denote the
To apply the data and model parallelism to the parameter priority of task k of job J which is in its I th iteration.
server communication structure, an ML network is divided We use Imax to denote the specified maximum number of
into different partitions and each partition runs in a worker iterations of job J. We first introduce how we consider the
and handles one mini-batch. The workers transmit data based ML spatial and temporal features to determine the priority
on the ML partition dependency, and the final workers in ML
(denoted by Pk,J ), and then introduce how we consider
the dependency graph (i.e., model partition graph) send the the traditional computation features to determine the priority
training results to the parameter server. In the all-reduce (denoted by Pk,J C
). Finally, we combine the two priority values
communication structure [41], one parameter server containing to calculate Pk,J .
the whole model and one worker are combined together First, the jobs running in a cluster have different urgency
as one reducer, which is the unit execution processor. For levels, as explained in Section I. The system sets m levels of
the model parameter update in each iteration, each reducer job urgent coefficients LJ ∈ [0, m]; a higher urgent coefficient
communicates to other reducers to update model parameters means that the job is more urgent and vice versa. The urgent
using a communication topology (e.g., ring all-reduce [42] coefficients of jobs in a cluster can be pre-determined by the
and 2D-Torus [43]). To apply data and model parallelism to
system administrator based on the urgency of the applications
the all-reduce communication structure, we can directly use or specified by the users. A higher urgent coefficient specified
the existing communication topology for the parameter trans- by a user leads to higher monetary cost to the user. Giving
mission between workers to update their models. Therefore, higher priorities to jobs with higher urgency levels can help
the task dependency graph in data and model parallelism can the jobs to meet their deadline requirement.
be for both parameter server and all-reduce communication Second, for an ML job, based on the ML temporal features,
structure. the earlier iterations are more important than later iterations on
Assume that J represents the job set, T represents the task improving accuracy because of the diminishing loss reduction
set, and N represents the node set and the queue. We use (k, n)
returns [6]. Therefore, tasks that will gain higher accuracy
to represent that task k is allocated to node n or the queue. improvement in the next iteration should have a higher priority
Our goal is to find a task allocation plan A = {(k, n)|k ∈ to be scheduled in order to improve the average accuracy of
T , n ∈ N } that minimizes the average JCT, maximizes the jobs in the system. We use the inverse proportional function
average ML model accuracy, maximizes the number of jobs 1
whose deadline and accuracy requirements are satisfied, and I (I ≥ 1) to represent the importance of the current iteration
of the job. A larger ratio I1 means an earlier stage of the
minimizes the bandwidth cost between nodes. For job J,
job running, so the I th iteration contributes more on accuracy
we use dJ to denote its JCT, drJ its deadline, aJ its final
improvement of the ML job. We use δlI−1 to denote the loss
accuracy, arJ its accuracy requirement. Then, the goal can be
reduction
I−1 of the most recent finished iteration I − 1. Then,
represented by the formula below:
δl j denotes the overall loss reduction achieved by all the
⎧
j=1
⎪
⎪ J∈J dJ completed iterations, and δlI−1 I−1
represents the normalized
⎪ 1
⎪ g (A) = 1/ j=1 δlj
⎪
⎪ |J | loss reduction of the most recent completed iteration. A higher
⎪
⎪
⎪
⎪
⎪ g2 (A) =
⎪
1(drJ ≥ dJ )
value of δlI−1 I−1
j=1 δlj
means that the I th iteration contributes more
⎪
⎪
⎪
⎪
J∈J
on the loss reduction. Note that these two formulas represent
⎨ the trends of general ML jobs and can be replaced by other
g3 (A) = 1/ Bni ,nj
⎪
⎪ ni ,nj ∈N appropriate formulas specific to certain ML jobs.
⎪
⎪ Third, considering the ML spatial features in model paral-
⎪
⎪
⎪
⎪ g 4 (A) = 1(aJ ≥ arJ ) lelism, a larger model partition usually plays a more important
⎪
⎪
⎪
⎪ J∈J
role in calculating the model parameters. We measure the
⎪
⎪
⎪
⎪ g (A) = J∈J aJ model partition size by the number of ML model parameters
⎩ 5
|J| in this model partition. We use SkJ = Sk /S J to denote the
max (g1 (A), g2 (A), g3 (A), g4 (A), g5 (A)), (1) normalized size of the model partition of task k, where Sk
is the size of the model partition and S J is the entire ML
in which 1 is a binary indicator function, and Bni ,nj is overall model. Giving a higher priority to a larger model partition
size of the data transmitted between ni and nj . The goal is can contribute more in increasing the model accuracy.
to maximize the value of each objective function gi (A) (i = Combining all of these ML features, for task k that has no
1, 2, . . . , 5) simultaneously. Note that this problem formulation dependent tasks, its priority is calculated by:
covers both all-reduce communication structure and parameter
server communication structure. This is a multi-objective opti- ML 1 δlI−1
Pk,J = LJ · · I−1 · SkJ (2)
mization problem. We could use the adaptive epsilon constraint I δl j
j=1
algorithm [44] to solve the above multi-objective optimization
problem. However, due to its high computation overhead, Next, we consider the dependency of the tasks (or workers)
we instead propose a heuristic task scheduling method to in the dependency graph as shown in Figure 2. The more
obtain the solution of this optimal problem and an RL-based tasks that depend on task k, the higher priority that task
task scheduling method in the following sections. k should have because its completion enables more other
Authorized licensed use limited to: Tsinghua University. Downloaded on June 14,2023 at 01:26:44 UTC from IEEE Xplore. Restrictions apply.
62 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 31, NO. 1, FEBRUARY 2023
tasks to start running, which helps reduce JCT. In addition, where α is a weight factor and a larger α means that the ML
if more dependent tasks are in layers closer to task k in job features have higher weights than the computation features
the dependency graph, then it should have a higher priority in determining a job’s priority. MLF-H uses the task priority
because its completion enables more tasks to start earlier. to order the tasks in the queue. Because tasks with higher
Based on the rationale, we calculate the priority of task k priority are allocated to servers and executed earlier, it helps
ML
considering ML features, Pk,J , as follows: achieve our goal in Formula (1). Compared to previous job
schedulers (e.g., Graphene [38]) that only consider traditional
ML
Pk,J , no dependent tasks
ML
Pk,J = ML
computation features and task dependency, MLFS is novel in
Pk,J + γ i∈child(k) Pi,J
ML
, otherwise that it additionally considers ML job spatial/temporal features
(3) in the priority determination. Also, it additionally considers
the goal of increasing ML accuracy, which is not considered
where γ ∈ (0, 1) is a discounting factor to count the closeness in previous schedulers for general jobs.
between a dependent task and task k, child(k) is the set of
direct children of task k in the dependency graph. A larger 2) Basic Job Scheduling: MLF-H inserts newly submitted
γ means a higher weight is given to the priorities of a task’s tasks and preempted tasks to the queue based on the priority.
children when determining the task’s priority. This priority When there are underloaded servers and waiting tasks in the
determination method can be directly applied to the all-reduce queue, it picks tasks one by one from the queue and assigns it
communication structure to determine the priority of each task to an underloaded node until there no more underloaded nodes
running in a worker. For the parameter server communication or the queue is empty. Here, one problem is which node among
structure which has a separate centralized parameter server, the underloaded nodes should we choose to allocate each task
its computation is also considered as a task and assigned with from the queue? To solve this problem, we can rely on the
the highest priority since only after the parameter server is method in [45] as described below.
determined, the tasks in the workers know where to send their We consider M types of resources (e.g., GPU, CPU, mem-
results. ory, and bandwidth), and more types of resources can easily be
For the computation features of task k of job J, we consider added. The utilization of type-m resource of server s at time
its task deadline (dk,J ), remaining running time (rk,J ) and slot t is calculated by utm = lm t
/cm , where lm t
is resource
waiting time in the queue (wk,J ). We consider two cases consumption of server s in time slot t and cm is the capacity
to estimate the remaining running time. If a job is executed of server s on type-m resource. The utilization of type-m
previously, the deadline of each of its tasks can be calculated resource of task k is calculated by utm,k = lm,k t
/cm , where
t
based on the job’s deadline, dependency graph and historical lm,k is resource consumption of task k in time slot t. The
task running time, as in [8]. We use the historical task running resource utilization of server s at time t (Ust ) is represented
time as the task’s required running time. If a job is not as a vector (ut1,s , ut2,s , . . . , utm,s , . . . , utM,s ), and the resource
executed previously [1], we use sample running to estimate the utilization of task k at time t (Ukt ) is represented as a vec-
required running time of each of the job’s tasks and then infer tor (ut1,k , ut2,k , . . . , utm,k , . . . , utM,k ). We consider that type-m
the deadline of each task. Then, a task’s remaining running resource in a server is overloaded if its utm is higher than a
time is calculated by rk,J = tk,J − pk,J , where tk,J is the pre-defined threshold hr (e.g., 90%). A larger hr helps more
estimated required running time of task k and pk,J is the fully utilize the resources but increases the probability that a
actual running time of task k. A task with a closer deadline, server becomes overloaded and hence JCT. Lower hr increase
less remaining running time, or longer waiting time should the number of unnecessary task migrations. When at least one
have a higher priority to run because it helps meet the job type of resources in a server are overloaded, we consider that
deadline and improve the JCT. Considering the computation this server is overloaded; or, it is underloaded.
features, we calculate the priority of task k that does not have When we choose a server from several underloaded servers
dependent tasks by: to allocate a task k, we hope that the chosen node is less likely
C 1 1 to be overloaded, the communication bandwidth consumption
Pk,J = γd · + γr · + γw · wk,J (4)
(dk,J − t) rk,J between this task and other tasks is maximally reduced,
where t is the current time and γd , γr and γw are the weights and the task’s performance degradation caused by the task
of the considered factors. Larger γd , γr and γw lead to more movement to the node [46] is minimized. Avoiding overloaded
weights on deadline consideration, job remaining time and job servers and fully utilizing resources can help improve JCT
waiting time in the queue, respectively. The priority of task k and accuracy of the jobs in the system. To achieve these
C
the traditional computation features, Pk,J , is calculated by: objectives, among the underloaded servers, we first determine
ideal virtual host server for the task from the queue represented
C
Pk,J , no dependent tasks by UVt = (ut1,V , ut2,V , . . . , utm,V , . . . , utM,V , uBW,V , qk,V ),
C
Pk,J = C
Pk,J + γ i∈child(k) Pi,J
C
, otherwise in which utm,V is the minimum type-m resource utilization
(5) among all of the underloaded severs, uBW,V denotes the
maximum value among the communication data sizes between
By combining the priorities based on the ML features and
the task and each of the underloaded servers (in order to
computation features (Equs. (3) and (5)), Pk,J is calculated
allocate high-volume communicating tasks to the same server),
by:
qk,V is the task performance degradation caused by the task
Pk,J = αPk,J
ML
+ (1 − α)Pk,J
C
, α ∈ [0, 1] (6) movement [46] and its ideal value is 0. The server that is
Authorized licensed use limited to: Tsinghua University. Downloaded on June 14,2023 at 01:26:44 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: MACHINE LEARNING FEATURE BASED JOB SCHEDULING FOR DISTRIBUTED MACHINE LEARNING CLUSTERS 63
most similar as the ideal virtual host server based on the priority is calculated considering the accuracy improvement so
Euclidean distance should be chosen. That is, the server with the tasks that will contribute more on improving the accuracy
min ||Ust − UVt || among the candidate servers and will not should be less likely to be chosen to migrate out. Second,
be overloaded (on each resource and its least-loaded GPU) because ML tasks run in multiple GPUs in a server, each GPU
by hosting the task is selected as the host server for the must not be overloaded. Therefore, if there exist overloaded
task. Then, we schedule the task to the least-loaded GPU in GPUs, we order the tasks in the overloaded GPUs based on
the selected server. After one task is allocated, the scheduler the ascending order of the priority, and select tasks using the
continues to pick up the next task in the queue and schedule above method only among a certain percentage (ps ) of the
it. The scheduling stops when there are no more underloaded tasks on the top until there are no overloaded GPUs. A smaller
servers or the queue is empty. Finally, based on the deter- ps means that the migration tasks have lower priorities, which
mined task allocation schedule, the tasks are moved to their constrains the performance degradation on JCT and accuracy
allocated GPUs. but may not relieve overload quickly. When there are no
3) Handling Overloaded Servers: MLF-H can be executed overloaded GPUs, we select the task from all the running tasks
periodically or when there are underloaded nodes (and waiting in the server using the above method.
tasks) or overloaded nodes. When there is an overloaded Stragglers may occur due to failing hardware, software bugs,
server, MLF-H chooses tasks in the overloaded server to be misconfiguration and so on. To handle this problem, we can
migrated out to relieve its excess load, inserts the tasks to duplicate a task and assign the replica task to another server
the waiting queue, and reschedules the tasks along with the in task scheduling. Then, we use the output of the task that
waiting tasks in the queue in the same manner. Note that these completes first and stop the other tasks. The number of replicas
chosen migration tasks are virtually instead of actually moved of a task depends on the possibility of the straggler occurrence.
to the queue in order to save the migration overhead. After More replicas can better avoid straggler occurrence but gen-
one migration task is scheduled, it is directly moved from the erate more overhead. We leave detailed study of stragglers as
overloaded server to the scheduled server. When there are no future work.
more underloaded servers but some migration tasks are still
not assigned to servers, then these migration tasks are moved D. ML Feature Based RL Task Scheduling (MLF-RL)
back to the queue. The heuristic MLF-H and other heuristic scheduling meth-
Many previous ML job schedulers [5], [6], [8], [24] only ods may not be able to fully catch ML job features or set
determine task priority for task allocation (and preemption [8]) optimal parameter values (e.g., γ, α), where obtaining the
but do not handle server overload. Gandiva [7] handles GPU optimal schedule for goal in Equation (1) may be difficult.
overload but does not consider other types of resources, though Also, the decision making in MLF-H may take a long time. To
ML jobs also use other types of resources including CPU, handle these problems, we rely on the deep RL technique. That
memory, and bandwidth. Server overload on other resources is, MLFS initially runs MLF-H for a certain time period and
still can adversely affect the ML job performance. MLF-H uses the data to train MLF-RL, and then switches to MLF-RL
overcomes these problems. Below, we present how MLF-H when it is well trained. MLF-RL is novel compared with the
selects migration tasks from an overloaded node while avoid- previous RL-based job schedulers [5], [26], [39] in that it
ing resource fragmentation to more fully utilize resources in additionally considers ML features to improve both JCT and
the system. accuracy while previous RL-based job schedulers do not aim
For an overloaded server, to move out some of its current to improve accuracy or consider ML features.
tasks to relieve its excess workload, an important question is An RL consists of an agent, state, action, and reward [47].
which tasks we should choose to move out. To fully utilize In each time step t, the agent observes the environment state
resources and relieve the load on overloaded resources in st and chooses an action at based on its optimal policy in
an overloaded server, we use the method in [45]. That is, response to the current state and receives reward rt . Recently,
we first determine ideal virtual task to move out represented the Deep Neural Network (DNN) has become a popular func-
by Uvt = (ut1,v , ut2,v , . . . , utm,v , . . . , utM,v , uBW,v ), in which tion approximator because it automatically extracts features
utm,v for each overloaded resource is the maximum value when solving large-scale RL problems [48]. Thus, we use
among the tasks in the sever and utm,v for each underloaded DNN to serve as the agent, which generates the optimal policy.
resource is the minimum value among the tasks in the sever. The output of DNN is the probability distribution of actions
uBW,v denotes the ideal communication data size between π : π(st , at )→[0, 1], where π(st , at ) denotes the probability
the migration task and existing tasks in the server and it of taking an action at at the current state st . The goal of
equals to 0, which means that the task migration will not the agent isto maximize the expected cumulative discounted
∞
cause additional communication cost. Then, the task whose reward: E[ t=0 η t rt ], where η ∈ (0, 1] is a factor discounting
resource utilization is the closest to Uvt based on the Euclidean future rewards. A larger η enables the RL agent to consider
distance, i.e., the task with min ||Ukt −Uvt || is selected to move more weights on the future rewards when updating the model.
out. If the server is still overloaded after it migrates out this As shown in Figure 3, the state includes the information
selected task, the same process is repeated until the server of all the waiting and running tasks and nodes in the cluster
is not overloaded. Here, we further advance this method by including the information needed to derive the ML job features
considering the ML features. First, we need to make sure that and computation features used in MLF-H and additional
the tasks with high priorities will not be selected to migrate information such as the ML algorithm name and dependency
out in order to improve JCT and accuracy. Recall that the graph. More specifically, the state (or the RL input) includes:
Authorized licensed use limited to: Tsinghua University. Downloaded on June 14,2023 at 01:26:44 UTC from IEEE Xplore. Restrictions apply.
64 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 31, NO. 1, FEBRUARY 2023
Authorized licensed use limited to: Tsinghua University. Downloaded on June 14,2023 at 01:26:44 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: MACHINE LEARNING FEATURE BASED JOB SCHEDULING FOR DISTRIBUTED MACHINE LEARNING CLUSTERS 65
option i) may lead to longer JCT especially when the system not need to further generate or run tasks for the next iteration
is overloaded. Based on the users’ different choices and their for the job once the selected option’s accuracy requirement is
performance requirements on accuracy and JCT, they are met.
charged with different payment rates. In a private cluster, the
system administrator can directly make the decision about the IV. O PTIMAL ML I TERATION S TOPPING (O PT S)
above options. When there are enough resources in the cluster, Figure 4 illustrates the validation loss curve versus epochs
we can run the jobs based on user preference. When the system for different ML training algorithms including AlexNet, Resid-
is overloaded, we will make changes on the user choices if the ual Neural Network (ResNet), Multilayer Perceptron (MLP),
changes help reduce the system workload. and Long Short-term Memory (LSTM). These ML algorithms
Thus, when the system is not overloaded, MLF-C follows and training are used in [7] and we obtained them from
the user choices accordingly, and when the system is over- GitHub [53], [54]. There is a tradeoff between ML model
loaded, MLF-C changes the choices based on the users’ indi- accuracy and training time [52]. The current approaches to
cations to reduce system workload. Comparing to option i) (the handle the overfitting problem (explained in Section III-E) are
current approaches), options ii) and iii) reduce the number not efficient. First, the human observation method relies on
of iterations of the ML jobs and hence system workload. intuition and needs expert knowledge, which is not applicable
As a result, it reduces JCT since jobs do not need to run for general users. Second, the minimum validation loss may
more iterations or wait in the queue for a long time due to appear earlier than the specified maximum epoch and then
system overload while still providing near-optimal accuracy or computation resources are wasted and JCT is increased by
meeting users’ accuracy requirements. In addition, the jobs’ continuing training the model. But we also cannot stop training
accuracies by the job deadlines are improved since important at the current minimum loss (lmin ) because there might be a
iterations have a higher chance to run and will not be blocked new minimum loss later. Therefore, the proper time to stop
due to the running of unimportant iterations. As a result, training an ML model is when we can guarantee that all future
both JCT and accuracy performances are improved. Therefore, losses are larger than lmin in the validation loss curve.
when the system is not overloaded, MLF-C proactively helps To find such a stopping epoch, three methods in [52] have
avoid system overload, and when the system is overloaded, been proposed to find the epoch where the loss increases
MLF-C reactively reduces system workload to mitigate the significantly, which indicates that the model is overfitting the
overload. In addition, MLF-C also helps people who do not data. Let lva (t) and ltr (t) be the validation loss and training
have the knowledge about how many iterations are needed to loss after training t epochs. The first method is called General-
(t)
achieve the maximum or their designed accuracy. ization Loss (GL). It defines GL(t) = 100×( llva min
−1), which
Next, we explain how we judge whether the system measures the fraction of additional loss comparing the current
resources are limited or the system is overloaded. Recall that validation loss to the minimum loss. The stopping time is when
the resource utilization of server s at time t is represented as a GL(t) is larger than a threshold (hGL ). The second method
vector Ust = (ut1 , ut2 , . . . , utm , . . . , utM ). The resource utiliza-
tion of the cluster is represented as Uct = (U1t , U2t , . . . , U|N
t
is Progress Quotient window based method (PQ). It defines
i=t−κ+1 ltr (i)
| ),
t
P Q = GL(t)
Pk (t) , where Pκ (t) = 1000 × ( κ×mint ltr (i) − 1)
i=t−κ+1
where N is the set of servers in the cluster. The overload
measures how much the average training loss was larger than
degree of server s is calculated by Ost = ||Ust ||. The overload
the minimum training loss during window κ assuming the
degree of the whole cluster is measured by the average of
training is from epoch t − κ + 1 to epoch t. A higher GL(t)
overload degrees of all servers: Oct = |N1 | s∈N ||Ust ||. The means more overfitting, and a lower Pk (t) means more stable
system is considered to be overloaded when there are tasks in training loss. Thus, it stops training when P Q is greater than
the queue or when Oct > hs , where hs is a pre-defined thresh- a threshold (hP Q ). The third method is called UP. It stops
old. A larger hs means the more severe resource competition training in the end of the (mh )th window when GL(t) keeps
between jobs leads to higher JCT. However, a lower hs may increasing in mh successive windows. The validation loss
cause a higher overhead for system load control. In this case, curve of ML jobs usually can be divided into two phases:
as mentioned above, based on the users’ indications, MLF-C quick-drop phase and fluctuation phase [52] as shown in
makes changes on user selected options if the changes reduce Figure 4. After a quick-drop phase, the validation loss begins
the number of iterations, and stops producing or running to increase but may still further decrease. Thus, we aim to
unnecessary tasks accordingly. Consequently, the system does find the minimum loss point in the fluctuation phase. For this
Authorized licensed use limited to: Tsinghua University. Downloaded on June 14,2023 at 01:26:44 UTC from IEEE Xplore. Restrictions apply.
66 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 31, NO. 1, FEBRUARY 2023
purpose, we first find the fluctuation phase, and then find the Algorithm 1 Algorithm Pseudocode of MLF-H
loss curve area that the minimum validation loss is not likely to 1 Function MLF-H: heuristic job scheduling
appear, that is, the overall trend of losses goes up and becomes 2 Update the running tasks’ information and resource utilizations
more stable. Specifically, we keep track of the loss values. of all the nodes in the cluster;
We consider a phase as a quick-drop phase if the number of 3 for all the tasks in the waiting queue (including newly
consecutive loss decreasing epochs with loss values lower than submitted tasks and ready-to-migrate tasks) do
4 Calculate the priority Pk,J according to Equation (6);
the current minimum loss reaches a threshold md ; that is, 5 Sort the tasks in queue in descending order of priority;
Assign the queueing tasks one by one from the top until
1(lva (i) < lmin ) = 1 (i = k, k + 1, . . . , k + md − 1), (8) 6
there is no available resource or the queue is empty;
where li is the loss of the ith epoch, and k is any epoch. 7 Update the resource utilization of each node;
Then, the subsequent curve starting from the (k + md − 1)th
epoch will be the fluctuation phase. Next, we sample every
window including κ losses in the fluctuation phase. Let k = Algorithm 2 Algorithm Pseudocode of MLF-RL
(k + md − 1), then the j th window is represented as: wj = 1 Function MLF-RL: RL based job scheduling
{lva (k + jκ), lva (k + 1 + jκ), . . . , lva (k + κ − 1 + jκ)} 2 Update the running tasks’ information and resource utilizations
(j = 0, 1, . . . .). We stop training when: of all the nodes in the cluster;
3 for all the tasks in the waiting queue (including newly
mean(wj+1 ) > mean(wj ) and var(wj+1 ) < var(wj ), (9) submitted jobs and ready-to-migrate tasks) do
4 Generate the current state st ;
where mean(wj ) and var(wj ) denote the mean and variance 5 Select action at : with π(st , at )
of the loss values in window wj . The idea behind Equ. (9) is 6 Execute the action for all the queuing tasks (at ) and
that mean(wj+1 ) > mean(wj ) means that the model starts observe reward rt = R(st , rt ) according to Equation (7)
Update Aπη (st , at )
to overfit the training data [52], and var(wj+1 ) < var(wj ) 7
8 if the waiting queue is not empty and no available
means that the loss value tends to be stable and is unlikely to resources then
decrease further. The trained ML model for each echo after 9 put the tasks in the queue until the loop;
the quick-drop phase is recorded along with its loss. When the
training stops, the stored trained ML model with lmin so far
is fetched as the final trained ML model. To further enhance
the accuracy of finding the minimum loss epoch, we can stop in the code. Meanwhile, we use API torch.utils.
training when Equ. (9) occurs for successive mo windows. checkpoint(), which saves a checkpoint of one ML job
Our experiments show mo = 1 is sufficient. model when the accuracy change in each iteration. After the
After the first fluctuation phase, there may be another optimal point has been found by the method introduced in
quick-drop phase followed by another fluctuation phase and so Section IV, the corresponding trained model with satisfied
on, which means there are multiple quick-drop and fluctuation accuracy values will be picked with OptimalStopping()
phases in a curve. To handle this case, from the end of the that then makes the decision with stop() or continue().
quick-drop phase, i.e., the (k + md − 1)th epoch, we also
keep checking if there exists another quick-drop phase using B. Job Scheduling
Equ. (8) concurrently. If we find another quick-drop phase
The MLFS scheduler will be executed periodically or
before we find the optimal stopping time, then we repeat the
when there is overload nodes in the cluster. The GPU
above process; that is, we identify the subsequent fluctuation
utilization is obtained from cutorch.getMemoryUsage
phase and find the optimal stopping epoch. In Figure 4,
and other resources utilization are achieved by Linux com-
we marked the stopping epochs found by the three previous
mands. Based on our lab-made Python code, we create one
methods and OptS. OptS stops training closely after the actual
queue waitingQueue() to store all the ready-to-schedule
minimum loss epoch while other methods either stop much
jobs’ IDs. All the proposed job schedulers will generate the
earlier or later than the actual minimum epoch. For the former,
job placement plan for the jobs with the format JobID,
the method cannot find the minimum point and for the latter,
NodeID in the queue. Algorithm 1 shows the pseudocode of
it wastes computation resources and time.
MLF-H. The queueing tasks are scheduled one by one from
the queue until the queue is empty or there is no available
V. I MPLEMENTATION
computing resources. Algorithm 2 shows the pseudocode of
We built our MLFS methods based on Pytorch using roughly MLF-RL. The RL agent processes the tasks in the queue at the
5000 lines python code. same time with the trained model as introduced in Section V-D
and then generates the job placement plan. After the MLF-H
A. Tracking Accuracy and Optimal Stopping or MLF-RL makes the job placement plan, our implemented
In order to realize the job scheduling and opti- class JobAssign(JobID, NodeID) is executed for each
mal stopping functions in our system, we need to assigned jobs according to the generated job placement plan.
track the accuracy change in real time. We use the
existing accuracy API torch.utils.tensorboard to C. Model Partition
track the accuracy for each job and record the accu- There are two types of model partition: sequential model
racy using function SummaryWritter(). We create partition and layer-based model. For the sequential model
an OptimalStopping(accuracy,iterations) class partition, we modified the MLP and AlexNet code based on
Authorized licensed use limited to: Tsinghua University. Downloaded on June 14,2023 at 01:26:44 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: MACHINE LEARNING FEATURE BASED JOB SCHEDULING FOR DISTRIBUTED MACHINE LEARNING CLUSTERS 67
Authorized licensed use limited to: Tsinghua University. Downloaded on June 14,2023 at 01:26:44 UTC from IEEE Xplore. Restrictions apply.
68 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 31, NO. 1, FEBRUARY 2023
with 2474 GPUs in 550 servers. We set the number of jobs layers in one job. The details of these methods are explained in
to 117325x jobs, where x equals to 1/2 and then is varied Section II. We implemented SLAQ, RL and Tetris by ourselves
from 1 to 4 with 1 increase in each step. We used all trace and used the open source codes of other comparison methods.
data to drive the simulation. Given that the trace does not In MLF-C, we assume that all jobs use OptStop and it stops
contain bandwidth cost information, for each job, we randomly a job when its required accuracy is reached when the system
selected two values within [50,100]MB as the communication is overloaded. We open-sourced our code in Github [25].
volume between each worker to the parameter server and
between workers in one communication since [8] indicates B. Experimental Results
that the communication volume is the same across iterations. 1) Overall Performance Comparison: Figure 5 and Figure 6
As we use both data parallelism and model parallelism, compare the overall performance of our methods against
we need the corresponding information of the tasks in this other methods in real experiments and simulation, respectively.
scenario to drive our simulation. However, the trace is not Given that both sets of figures show similar trends and
from the data parallelism and model parallelism scenario orders, we discuss the two figures together and show the
nor provides the name of the ML algorithms. We then first performance improvement ratios from the simulation in “()”
sample-ran the aforementioned five ML algorithms with the in the following.
aforementioned parallelism methods using different number Figures 5(a) and 6(a) show the cumulative distribution
of GPUs and recorded the job completion times. Then, function (CDF) of JCT of each method. The results show
we mapped each job in the trace to one of our running ML that the overall JCT follows: MLFS<MLF-RL<MLF-H<
algorithms with a certain number of GPUs that has similar Graphene<Tiresias≈HyperSched≈RL≈Gandiva<Tensor
job latency. Next, we ran the ML algorithm with our data Flow≺SLAQ. We use ≺ to mean slightly lower relationship
parallelism degree and different model parallelism degrees to and use to mean slightly higher relationship. Over 85%,
get the information of different tasks in the job in this scenario. 81%, and 78% jobs in MLFS, MLF-RL and MLF-H have
Experimental setting: In the experiments, we considered JCTs less than 100 minutes, respectively, in real experiments.
resource types including CPU, memory, GPU and bandwidth Considering the percentage of jobs with JCTs less than
cost. The number of GPUs needed by each ML job was 100 minutes, MLF-RL outperforms MLF-H by 4% (4%),
randomly selected from {1, 2, 4, 8, 16, 32}. We also set the and MLFS improves MLF-RL by 5% (6%) due to additional
number of model partitions to this number. SVM did not run MLF-C. Meanwhile, 61%, 60%, 60%, 55%, 46%, and 39%
in model parallelism because it is hard to partition its network jobs have JCTs less than 100 minutes in Tiresias, HyperSched,
model. The default parameter values of our method are listed RL, Gandiva, TensorFlow and SLAQ, respectively. MLFS
below: α = 0.3, γ = 0.8, γd = 0.3, γr = 0.3, γw = 0.35, improves Tiresias by 33% (38%) and improves SLAQ by
β1 = 0.5, β2 = 0.55 (larger β2 means more weights on 118% (128%).
deadline guarantee), β3 = 0.25, β4 = 0.15, β5 = 0.15, Both MLF-RL and MLF-H jointly consider both tem-
η = 0.95, hr = hs = 90%, and ps = 10%. We chose these poral/spatial ML job features and computation features in
default values because in our experiment, we found that these scheduling tasks to the underloaded servers and also in select-
parameters can achieve best performance in our evaluation ing migration tasks from overloaded servers. The additional
environment. In practice, these tunable parameters of a cluster consideration of ML job features compared to other methods
are determined by the administrator of the cluster according to helps them achieve lower JCT. Since the tasks in earlier
the goals they concern and the particular cluster environment iterations that have more closer dependent tasks in the depen-
and configurations. dency graph have higher priorities to run, their completion
The job scheduler runs every minute. For each job, we ran- enables more tasks to start earlier, thus helping reduce JCTs.
domly used max{1.1te , tr } as its deadline, where te is its In addition, MLF-RL and MLF-H consider computation fea-
estimated execution time and tr was randomly chosen from tures related to JCT which directly reduces JCTs. MLF-RL
[ 12 , 24] hours. Unless otherwise specified, the number of iter- outperforms MLF-H because MLF-RL can better extract ML
ations run by each job is the same as that in the trace. In the job features, adapt to learn the scheduling policy and output
RL training process in MLF-RL, we randomly scheduled tasks the optimal schedule whereas MLF-H may not be able to set
to nodes and calculated reward, and then updated the RL optimal parameter values for its parameters.
model. After the RL processed the first 50% data of the MLFS produces significantly lower JCT than other methods.
real trace, the model is trained which takes around 26 hours. The reduction of JCT is contributed from MLF-RL and
The error bar in our experimental figures represents the 1th MLF-C. MLF-C stops training ML jobs at their optimal stop-
and 99th percentiles and median of the result values from ping epochs to save computation resources and time without
10 experiments. We use y−z z to calculate the improvement degrading their accuracy performance and also stops tasks in
when comparing the performance of method y and z. later iterations when the desired accuracy is reached to avoid
Comparison methods: We compared our methods with resource overload and task waiting, thus reducing JCT.
the state-of-the-art ML job schedulers including SLAQ [6], In order to reduce JCT, Graphene considers the task depen-
Gandiva [7], Tiresias [8], RL [5], Graphene [38], TensorFlow dancy graph. The tasks with more dependent tasks have
[36] that uses the Borg resource manager [24] and HyperSched higher priority to be scheduled. The earlier completion for
[11]. RL aims to minimize the JCT and it uses data parallelism this kind of tasks leads to the earliest start of other dependent
and partitions the model sequentially into several parts for tasks and then decreased JCT. Thus, Graphene achieve better
model parallelism. The number of parts equals the number of JCT reduction performance compared with other comparison
Authorized licensed use limited to: Tsinghua University. Downloaded on June 14,2023 at 01:26:44 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: MACHINE LEARNING FEATURE BASED JOB SCHEDULING FOR DISTRIBUTED MACHINE LEARNING CLUSTERS 69
methods. Tiresias gives the jobs that can complete in the next in MLF-RL, and 54-116 (72-349) hours in MLF-H. For
service epoch the highest priority to run and RL uses the RL the highest workload, MLFS improves MLF-RL by 10%
technique for the job scheduling, but they do not consider the (9%) and MLF-RL improves MLF-H by 15% (11%). Also,
ML features. HyperSched, Gandiva and TensorFlow do not the makespan is 56-125 (74-355) hours in Tiresias, 58-128
directly aim to improve JCT. HyperSched aims to increase the (78-360) hours in HyperSched, 61-134 (78-361) hours in RL,
accuracy of jobs within their deadlines. Gandiva uses FIFO 64-138 (84-368) hours in Gandiva, 78-158 (103-416) hours
and also migrate tasks between GPUs to fully utilize GPU in TensorFlow, and 85-170 (11-415) hours in SLAQ. Thus,
resources, and TensorFlow uses Fair scheduler that aims to MLFS improves Tiresias by 32% (17%) and improves SLAQ
achieve the fairness between jobs in resource allocation. As a by 52% (39%).
result, HyperSched, Tiresias and RL produce JCTs higher Figures 5(b) and 6(b) show the average JCT. We see that
than our methods, but lower than TensorFlow. Gandiva’s the result follows: MLFS<MLF-RL<MLF-H<Graphene<
JCT is comparable to those of HyperSched, Tiresias and Tiresias≈HyperSched≈RL≈Gandiva<Tensor Flow≺SLAQ.
RL, and lower than TensorFlow due to Gandiva’s additional For 1860 jobs, MLFS improves MLF-RL by 22% (18%)
task migration. SLAQ only aims to maximize the accuracy (due to additional MLF-C) and MLF-RL improves MLF-H
improvement across jobs rather than JCT, thus it produces by 11% (10%). Meanwhile, MLFS improves Tiresias by 34%
higher JCT than TensorFlow. In conclusion, in terms of the (30%) and improves SLAQ by 53% (47%.). The results show
JCT, MLFS outperforms other methods and our proposed similar orders and trends of all methods as in Figures 5(a)
methods MLF-H, MLF-RL and MLF-C are all effective in and 6(a) due to the same reasons.
reducing JCT. Figures 5(c) and 6(c) show the job deadline guarantee
Makespan is the time period from when the first job is ratio, which is the percent of the jobs whose deadlines are
submitted to when the last job is completed. The makespan satisfied. We see that the result follows: MLFS>MLF-RL>
is 40-90 (63-336) hours in MLFS, 51-102 (67-342) hours MLF-H>HyperSched >Graphene>Tiresias≈RL≈Gandiva>
Authorized licensed use limited to: Tsinghua University. Downloaded on June 14,2023 at 01:26:44 UTC from IEEE Xplore. Restrictions apply.
70 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 31, NO. 1, FEBRUARY 2023
TensorFlowSLAQ. For 1860 jobs, MLFS improves MLF-RL Figures 5(f) and 6(f) show the accuracy guarantee ratio, i.e.,
by 21% (16%) and MLF-RL improves MLF-H by 16% (21%). the percentage of ML jobs whose accuracy requirements are
Meanwhile, MLFS improves HyperSched by 47% (43%) and satisfied by their deadlines. The trends and orders of result
improves SLAQ by 102% (101%). Part of the reasons for are similar as in Figures 5(e) and 6(e). For 155 jobs, MLFS
the results are the same as those for Figures 5(a) and 6(a). improves MLF-RL by 15% (18%) and MLF-RL improves
HyperSched considers the job deadline and tends to give MLF-H by 20% (19%). Also, MLFS improves HyperSched
more resources to the tasks with the best potential accuracy by 36% (35%) and improves TensorFlow by 56% (52%).
performance before the deadlines. These methods try to satisfy For 1860 jobs, MLFS improves MLF-RL by 22% (20%)
job deadlines in job scheduling while other methods do not. and MLF-RL improves MLF-H by 8% (9%). Also, MLFS
A job’s waiting time is the accumulated time periods in improves HyperSched by 32% (34%) and improves Tensor-
which none of its tasks are running from its submission time Flow by 61% (54%). Part of the reasons for the results are the
to its completion time. Figures 5(d) and 6(d) plot the average same reasons as in Figures 5(e) and 6(e). In addition, meeting
job waiting time per job. The result shows that MLFS< the accuracy requirement is an objective in our methods, but
MLF-RL<MLF-H<Graphene<HyperSched≈Tiresias≈RL≈ is not considered in other methods. As a result, our methods
Gandiva<Tensor Flow<SLAQ. ML FS improves MLF-RL produce the highest accuracy guarantee ratio.
by 25% (22%) (due to additional MLF-C) and MLF-RL Figures 5(g) and 6(g) plot the communication bandwidth
improves MLF-H by 14% (21%). MLFS improves Tiresias cost. We see that the result follows: MLFS<MLF-RL<
by 40% (47%) and improves SLAQ by 52% (57%). Part MLF-H≈TensorFlow≈Graphene≈Tiresias≈HyperSched≈
of the reasons are similar as those in Figures 5(a) and 6(a) SLAQ ≈RL<Gandiva. For 1860 jobs, MLFS improves
since a lower JCT enables more jobs to run earlier and helps MLF-RL by 22% (24%) and MLF-RL improves MLF-H by
reduce job waiting time. In addition, waiting time is a factor 14% (15%). Meanwhile, MLFS improves TensorFlow by 36%
considered in MLF-RL and MLF-H in priority determination (37%) and improves Gandiva by 45% (54%). MLF-H tries
in order to reduce the task waiting time, and MLF-C further to reduce bandwidth cost when selecting a node to allocate a
reduces job waiting time by removing some unnecessary task and when selecting tasks to migrate out of an overloaded
tasks. Though HyperSched pauses jobs that do not increase node. MLF-RL considers bandwidth cost as a part of reward
accuracy significantly, it does not aim to reduce JCT, so jobs function. MLF-C further removes tasks that make little or
are always running or waiting before their deadlines. As a no contribution to the desired accuracy. As a result, MLFS,
result, our methods generate lower waiting time than other MLF-H and MLF-RL produce less bandwidth cost than
methods, and MLFS produces the least average job waiting other methods, and MLFS produces the least bandwidth cost.
time. Gandiva simply uses FIFO without trying to reduce JCT and
Figures 5(e) and 6(e) plot the average accuracy of uses task migration to improve resource utilization without
ML jobs by their deadlines. We see that the result considering bandwidth cost. Therefore, the task migration
follows: MLFS>MLF-RL>MLF-H>HyperSched>SLAQ> introduces extra bandwidth cost, so Gandiva generates the
Tiresias≈RL≈Graphene>Gandiva ≈Tensor Flow for the low- highest bandwidth cost. All other comparison methods do
est workload, and MLFS>MLF-RL>MLF-H>HyperSched> not aim to reduce bandwidth cost, thus producing higher
Tiresias≈RL≈Graphene>SLA -Q≈Gandiva≈TensorFlow for bandwidth cost than MLF-H, MLF-RL, and MLFS.
high workload. For 155 jobs, MLFS improves MLF-RL by We recorded a scheduler’s running time for each scheduling
10% (11%) and MLF-RL improves MLF-H by 7% (8%). and calculated its average value. Figure 5(h) plots the
Also, MLFS improves Tiresias by 14% (16%) and improves average time overhead of different schedulers versus
TensorFlow by 44% (46%). For 1860 jobs, MLFS improves different numbers of jobs. We see that the result follows:
MLF-RL by 23% (20%) and MLF-RL improves MLF-H by Gandiva<TensorFlow<Tiresias<Graphene <HyperSched<
14% (8%). Also, MLFS improves HyperSched by 43% (41%) SLAQ<RL≈MLF-RL<MLF-H<MLFS. For 1860 jobs,
and improves TensorFlow by 64% (60%). the scheduler time overhead is 54ms, 49ms, and 40ms in
On the one hand, with a lower workload in the cluster, MLFS, MLF-RL and MLF-H, and it is 23ms, 28ms, 35ms,
HyperSched performs the best among the comparison methods 38ms, 40ms, and 49ms in Gandiva, TensorFlow, Tiresias,
since it pauses jobs that do not gain high accuracy increase and HyperSched, SLAQ and RL, respectively. MLF-RL and RL
also gives higher priorities in resource allocation to the task use RL technique so that they have similar time overhead.
which tends to achieve the best accuracy performance before MLFS generates higher time overhead since it has additional
its deadline. Tiresias, RL and Graphene try to improve JCT MLF-C in addition to MLF-RL. MLF-H consists two parts,
so that the jobs can complete earlier, which also increases a heuristic job scheduling method and handling overloaded
accuracy by the job deadlines due to resource competition servers, which needs to select migration tasks and destination
mitigation. As a result, they achieve higher accuracy than nodes. Thus, it generates higher time overhead than MLF-RL
SLAQ though SLAQ aims to improve accuracy. Gandiva and and RL and less overhead compared with MLFS consisting
the Fair scheduler in TensorFlow do not directly aim to of MLF-RL and MLF-C. The results indicate the advantages
improve JCT or accuracy. Therefore, many jobs must wait and of MLF-RL and RL in reducing the time for decision
their accuracy by the deadline is lower than other methods. making. SLAQ, Tiresias, Graphene and HyperSched use a
Since MLFS considers both accuracy and JCT requirements single heuristic method without handling server overload or
and leverages ML job features, it achieves the highest average system overload, which cost less time overhead compared
accuracy in both high and low workloads. with the RL-based methods. Gandiva uses simple FIFO and
Authorized licensed use limited to: Tsinghua University. Downloaded on June 14,2023 at 01:26:44 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: MACHINE LEARNING FEATURE BASED JOB SCHEDULING FOR DISTRIBUTED MACHINE LEARNING CLUSTERS 71
Authorized licensed use limited to: Tsinghua University. Downloaded on June 14,2023 at 01:26:44 UTC from IEEE Xplore. Restrictions apply.
72 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 31, NO. 1, FEBRUARY 2023
Authorized licensed use limited to: Tsinghua University. Downloaded on June 14,2023 at 01:26:44 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: MACHINE LEARNING FEATURE BASED JOB SCHEDULING FOR DISTRIBUTED MACHINE LEARNING CLUSTERS 73
[21] PyTorch. Accessed: Jul. 2022. [Online]. Available: https://pytorch.org [50] J. Wang, J. Xu, and X. Wang, “Combination of hyperband and Bayesian
[22] P. Moritz et al., “Ray: A distributed framework for emerging AI optimization for hyperparameter optimization in deep learning,” 2018,
applications,” in Proc. OSDI, 2018, pp. 561–577. arXiv:1801.01596.
[23] Microsoft DNN Trace. Accessed: Jul. 2022. [Online]. Available: [51] B. Letham and E. Bakshy, “Bayesian optimization for policy search via
https://github.com/msr-fiddle/philly-traces/ online-offline experimentation,” J. Mach. Learn. Res., vol. 20, p. 145,
[24] A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer, E. Tune, and Jan. 2019.
J. Wilkes, “Large-scale cluster management at Google with Borg,” in [52] L. Prechelt, “Early stopping—But when?” in Neural Networks: Tricks
Proc. EuroSys, Apr. 2015. of the Trade. Berlin, Germany: Springer, 1998, pp. 55–69.
[25] Source Code. Accessed: Jul. 2022. [Online]. Available: [53] ILSVRC2010. Accessed: Jul. 2022. [Online]. Available: https://github.
https://github.com/hiddenlayer2020/ML-Job-Scheduler-MLFS com/Abhisek-/AlexNet
[26] H. Mao, M. Schwarzkopf, S. B. Venkatakrishnan, Z. Meng, and [54] Text Classification on R8 Dataset. Accessed: Jul. 2022. [Online]. Avail-
M. Alizadeh, “Learning scheduling algorithms for data processing clus- able: https://paperswithcode.com/sota/text-classification-on-r8
ters,” in Proc. SIGCOMM, Aug. 2019, pp. 270–288. [55] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
[27] P. Delgado, D. Didona, F. Dinu, and W. Zwaenepoel, “Kairos: Preemp- 2014, arXiv:1412.6980.
tive data center scheduling without runtime estimates,” in Proc. SOCC, [56] M. Jeon, S. Venkataraman, A. Phanishayee, J. Qian, W. Xiao, and
Oct. 2018, pp. 135–148. F. Yang, “Analysis of large-scale multi-tenant GPU clusters for DNN
[28] A. Chung, J. W. Park, and G. R. Ganger, “Stratus: Cost-aware container training workloads,” in Proc. ATC, 2019, pp. 947–960.
scheduling in the public cloud,” in Proc. SOCC, Oct. 2018, pp. 121–134. [57] Amazon EC2 Types. Accessed: Jul. 2022. [Online]. Available:
[29] C. Curino et al., “Hydra: A federated resource manager for data-center https://aws.amazon.com/ec2/instance-types/
scale analytics,” in Proc. NSDI, 2019, pp. 177–192.
[30] W. Bai, L. Chen, K. Chen, D. Han, C. Tian, and H. Wang, “Information-
agnostic flow scheduling for commodity data centers,” in Proc. NSDI,
2015, pp. 455–468.
[31] R. Grandl, G. Ananthanarayanan, S. Kandula, S. Rao, and A. Akella,
“Multi-resource packing for cluster schedulers,” in Proc. SIGCOMM,
Aug. 2014, pp. 455–466.
[32] M. Chowdhury and I. Stoica, “HUG: Multi-resource fairness for corre-
lated and elastic demands,” in Proc. NSDI, 2016, pp. 407–424. Haoyu Wang (Student Member, IEEE) received the
[33] J. Mace, P. Bodik, R. Fonseca, and M. Musuvathi, “Retro: Targeted B.S. degree from the University of Science and
resource management in multi-tenant distributed systems,” in Proc. Technology of China and the M.S. degree from
NSDI, 2015, pp. 589–603. the Columbia University in the City of New York.
[34] P. Sun, Y. Wen, N. B. D. Ta, and S. Yan, “Towards distributed machine He is currently pursuing the Ph.D. degree with
learning in shared clusters: A dynamically-partitioned approach,” in the Department of Computer Science, University of
Proc. SMARTCOMP, May 2017, pp. 1–6. Virginia. His research interests include data center,
[35] B. Huang, M. Boehm, Y. Tian, B. Reinwald, S. Tatikonda, and cloud, and distributed networks.
F. R. Reiss, “Resource elasticity for large-scale machine learning,” in
Proc. SIGMOD, May 2015, pp. 137–152.
[36] M. Abadi et al., “TensorFlow: A system for large-scale machine learn-
ing,” in Proc. OSDI, 2016, pp. 265–283.
[37] D. Narayanan et al., “PipeDream: Generalized pipeline parallelism for
DNN training,” in Proc. 27th ACM Symp. Oper. Syst. Princ., 2019,
pp. 1–15.
[38] R. Grandl, S. Kandula, S. Rao, A. Akella, and J. Kulkarni,
“GRAPHENE: Packing and dependency-aware scheduling for data-
parallel clusters,” in Proc. 12th USENIX Symp. Oper. Syst. Design
Implement., 2016, pp. 81–97.
[39] H. Mao, M. Alizadeh, I. Menache, and S. Kandula, “Resource manage- Zetian Liu received the B.S. degree from Jilin Uni-
ment with deep reinforcement learning,” in Proc. HotNet, Nov. 2016, versity in 2018. He is currently pursuing the Ph.D.
pp. 50–56. degree with the Department of Computer Science,
[40] T. Domhan, J. Springenberg, and F. Hutter, “Speeding up automatic University of Virginia. His research interests include
hyperparameter optimization of deep neural networks by extrapolation distributed networks, cloud computing, and machine
of learning curves,” in Proc. 24th Int. Joint Conf. Artif. Intell., 2019, learning algorithms and applications.
pp. 1–9.
[41] A. Sergeev and M. D. Balso, “Horovod: Fast and easy distributed deep
learning in TensorFlow,” 2018, arXiv:1802.05799.
[42] Baidu, Ring All Reduce. Accessed: Jul. 2022. [Online]. Available:
https://github.com/baidu-research/baidu-allreduce
[43] H. Mikami, H. Suganuma, Y. Tanaka, and Y. Kageyama, “Mas-
sively distributed SGD: ImageNet/ResNet-50 training in a flash,” 2018,
arXiv:1811.05233.
[44] M. Laumanns and E. Zitzler, “An efficient, adaptive parameter variation
scheme for metaheuristics based on the epsilon-constraint method,” Eur.
J. Oper. Res., vol. 169, no. 3, pp. 932–942, 2006.
[45] H. Shen and L. Chen, “A resource usage intensity aware load balancing
method for virtual machine migration in cloud datacenters,” IEEE Trans. Haiying Shen (Senior Member, IEEE) received the
Cloud Comput., vol. 8, no. 1, pp. 17–31, 2017. B.S. degree in computer science and engineering
[46] A. Beloglazov and R. Buyya, “Optimal online deterministic algorithms from Tongji University, China, in 2000, and the M.S.
and adaptive heuristics for energy and performance efficient dynamic and Ph.D. degrees in computer engineering from
consolidation of virtual machines in cloud data centers,” Concurrency Wayne State University in 2004 and 2006, respec-
Comput. Pract. Exp., vol. 24, no. 13, pp. 1397–1420, Sep. 2012. tively. She is currently an Associate Professor with
[47] R. Sutton, Reinforcement Learning. Cambridge, MA, USA: MIT Press, the Department of Computer Science, University
2014. of Virginia. Her research interests include distrib-
[48] D. Silver et al., “Mastering the game of go with deep neural networks uted computer systems, cloud and edge computing,
and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, Jan. 2016. distributed machine learning, big data, and cyber-
[49] H. Mao et al., “Real-world video adaptation with reinforcement learn- physical systems. She is a Microsoft Faculty Fellow
ing,” 2020, arXiv:2008.12858. of 2010. She is a Senior Member of ACM.
Authorized licensed use limited to: Tsinghua University. Downloaded on June 14,2023 at 01:26:44 UTC from IEEE Xplore. Restrictions apply.