0% found this document useful (0 votes)
11 views11 pages

System Conf Dist ML

The document discusses the challenges of optimizing system configurations for distributed machine learning (ML) systems, particularly those using a parameter server (PS) approach. It introduces Cruise, an automated optimization tool that dynamically tunes system configurations to improve performance while minimizing reconfiguration costs. Evaluation results show that Cruise can significantly reduce training times by up to 58.3% compared to static configurations, achieving performance close to optimal settings.

Uploaded by

aluparatha73
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views11 pages

System Conf Dist ML

The document discusses the challenges of optimizing system configurations for distributed machine learning (ML) systems, particularly those using a parameter server (PS) approach. It introduces Cruise, an automated optimization tool that dynamically tunes system configurations to improve performance while minimizing reconfiguration costs. Evaluation results show that Cruise can significantly reduce training times by up to 58.3% compared to static configurations, achieving performance close to optimal settings.

Uploaded by

aluparatha73
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Automating System Configuration of Distributed

Machine Learning
Woo-Yeon Lee1 , Yunseong Lee1 , Joo Seong Jeong1 , Gyeong-In Yu1 , Joo Yeon Kim2 , Ho Jin Park1 ,
Beomyeol Jeon3 , Wonwook Song1 , Gunhee Kim1 , Markus Weimer4 , Brian Cho5 , Byung-Gon Chun1∗
1
Seoul National University, 2 Samsung Electronics,
3
University of Illinois at Urbana-Champaign, 4 Microsoft, 5 Facebook
{wooyeonlee0, yunseong.lee0, joosjeong, gyeonginyu, jykim88, hojinpark.cs}@gmail.com,
{beomyeolj, wsong0512}@gmail.com, [email protected], [email protected], [email protected], [email protected]

the best system configuration is challenging; optimal system


Abstract—The performance of distributed machine learning configuration parameters vary widely for different algorithms,
systems is dependent on their system configuration. However, hyper-parameters, and environments. Furthermore, the best
configuring the system for optimal performance is challenging
and time consuming even for experts due to the diverse runtime configuration changes during runtime as the total amount of
factors such as workloads or the system environment. We present available resource changes.
cost-based optimization to automatically find a good system We present cost-based optimization that finds a good system
configuration for parameter server (PS) machine learning (ML) configuration for PS-based frameworks. We extend a PS-
frameworks. We design and implement Cruise that applies the based ML framework to build Cruise that automatically tunes
optimization technique to tune distributed PS ML execution auto-
matically. Evaluation results on three ML applications verify that its system configuration with the optimization technique. We
Cruise automates the system configuration of the applications to would like to ideally model convergence time, but this problem
achieve good performance with minor reconfiguration costs. is an open research question that requires modeling algorithms
themselves. Instead, Cruise focuses on the system aspects
I. I NTRODUCTION and models performance of workers, as an optimization goal,
Machine learning (ML) systems are widely used to ex- analytically with the system’s runtime statistics, and computes
tract insights from data. Ever increasing dataset sizes and optimal configurations by solving the optimization problem.
model complexity gave rise to many efforts towards efficient Cruise applies the new configurations efficiently during run-
distributed machine learning systems. One of the popular time by elastically changing allocated resources and migrating
approaches to support large-scale data and complicated models data. The reconfiguration allows us to make the best use of
is the parameter server (PS) approach [1], [12], [16]. In this given resources and also opportunistically available resources.
approach, training data is partitioned across workers, while Our evaluation shows that our cost model is valid and Cruise
model parameters – which compose the global model being finds a good system configuration automatically to optimize
trained – are partitioned across servers. During training, each the performance. With three widely-used machine learning
of the workers computes model updates using the allocated workloads, we demonstrate that the configuration found by
data and sends the model updates to the corresponding servers. Cruise performs close to the optimal configuration that we
Workers then fetch fresh models from servers in order to work find exhaustively, with the difference at most 6.5%. Cruise
with the latest model parameter values. Servers, meanwhile, reduces the training time by up to 58.3% compared to static
apply the model updates received from workers and send the configuration within tens of seconds reconfiguration overhead.
latest model parameter values back to workers as inquired. II. BACKGROUND
This process occurs iteratively during the course of the ML
job until the global model converges. The performance of this A. Parameter Server ML Framework
approach is crucially dependent on choosing the right system A typical machine learning (ML) process engages itself in
configuration1 : the number of workers and servers as well as building models from input data and such training process
the training data and model partitioning across them [24]. could be designed in various ways. In many recent ML
Current PS implementations assume system configuration systems, the notion of parameter servers (PS) is used to
to be static: the configuration is chosen before training com- manage training models [5], [7], [16], [23].
mences and remains unchanged until job termination [4], [5], The PS architecture shown in Figure 1 consists of work-
[7], [23]. However, as we illustrate in Section II-B, choosing ers and servers. Machine learning is an iterative convergent
process, where an epoch is the unit of iteration. An epoch is
∗ Corresponding author normally defined as a full scan of the entire training data, but in
1 ML training systems have two types of configurations: system and algorith- asynchronous ML systems, an epoch can refer to the process
mic configurations. Algorithmic configurations include hyper-parameters such
as learning rate and batch size. In this paper, we focus on system configuration of scanning the same number of training data instances as
parameters. the entire dataset. Training data is partitioned across workers
App. NMF MLR LDA #Topics 400 4K Env. m4.large m4.xlarge
(18, 14) 1.30x 2.29x Best (18, 14) Best 1.44x (34, 18) Best 1.10x
(23, 9) Best 1.12x 1.21x (23, 9) 1.21x Best (38, 14) 1.08x 1.05x
(27, 5) 1.66x Best 1.60x (27, 5) 1.60x 1.20x (42, 10) 1.48x Best

(a) Case 1. Epoch time comparison in different (b) Case 2. Epoch time comparison in (c) Case 3. Epoch time comparison in
algorithms. different hyper-parameters in LDA. different VM instance types in NMF.

TABLE I: Epoch times varying numbers of workers and servers for different algorithms, hyper-parameters, and virtual machine instance
types. (W , S): W denotes the number of workers and S denotes the number of servers. For each column, a cell presents a ratio between
the epoch time of the configuration and the optimal epoch time.

B. System Configuration Challenges


Servers

Executor Executor
Server
The system configuration of a PS system includes the
Code Code allocation of worker and server roles to available containers,
Model Code
M1 M2 as well as the partitioning of the training data across workers
data
pull push and the model parameters across servers. A good system
configuration is essential for the performance of the machine
Workers

Executor Executor Executor


Worker learning system [24]. However, optimal system configurations
Code Code Code
Code that produce minimal training time are difficult to find, even
D1 D2 D3
for system experts. This is because, first, predicting how
Training
data
ML application code translates to actual running time - how
D1 D2 D3
much time each step takes - is nontrivial. Even if we were
Fig. 1: Parameter server ML framework. to estimate the exact running time for an algorithm, there
may exist many different implementations for that particular
algorithm, all having slightly different running times. Second,
even for the same algorithm, using different hyper-parameters
to achieve data parallelism. Workers execute the code to can change the application’s computation or communication
compute gradients for a model using each of its allocated data overhead. Finally, the capabilities of the environment on which
and communicate with servers to contribute towards model the applications run vary from cluster to cluster.
convergence. The model, similar to input data, is split into We illustrate the challenges with experiments that vary
many partial models and distributed across servers. algorithms, hyper-parameters, and machines in Table I. In each
case, we fix the total number of machines, assign a fraction
Each server supports push and pull requests from workers of the machines to run workers, and assign the rest of the
for the partial models it is assigned to. The literature [21] machines to run servers. We experiment with all possible
contains multiple proposals for scheduling the worker-server worker and server configurations to compare epoch time of
communication. We focus on the common push/pull model different configurations.
below. A worker issues a push request when it wants to update All experiments in Tables Ia and Ib were run on a cluster
a certain portion of a model. The request consists of a key, of 32 AWS EC2 r4.xlarge instances (4 CPU vCores, 30.5GB
identifying the partial model to update, and the corresponding memory, and 1.25 Gbps network bandwidth), and Table Ic
update data. When the server with the partial model receives shows epoch time of an ML application on either a cluster
a push request, it first searches for the model value associated of 52 m4.large instances (2 CPU vCores, 8GB memory, and
with the key, and then applies the update by calling a user- 0.5 Gbps network bandwidth) or a cluster of 52 m4.xlarge in-
defined update function defined in the server code. A worker stances (4 CPU vCores, 16GB memory, and 1.0 Gbps network
sends a pull request when it needs to access a certain part of bandwidth). In the table, NMF denotes Non-negative Matrix
the model. Unlike push requests, a pull request only contains Factorization, MLR denotes Multinomial Logistic Regression,
the key associated with the necessary partial model. After and LDA denotes Latent Dirichlet Allocation. We present the
receiving a pull request, the server with the partial model details of these algorithms in Section V-A.
fetches the corresponding model data and replies to the sender Case 1: ML algorithm. Different ML algorithms show
worker by transmitting a pull response. different optimal configurations. From Table Ia, with (W:27,
In the figure, Executor is an environment on which the ML S:5) MLR achieves the smallest epoch time, whereas LDA
application code (e.g., worker computation code and server runs 1.6 times slower with this configuration compared to
model update code) runs. Executor takes care of low-level LDA’s optimal configuration (W:18, S:14). This is because
system supports such as initializing and maintaining network MLR is more compute-intensive than LDA, thus requiring
connections between nodes. In this paper, we consider a cluster more workers for smaller epoch time.
environment where each Executor runs in a container obtained Case 2: Hyper-parameter. Hyper-parameter values affect
from a Resource Manager such as YARN and Mesos. the optimal configuration of an ML application. A hyper-
parameter in LDA is the number of topics to categorize mini-batch
documents. Increasing the number of topics makes both com- Server
putation and communication more expensive, but they are (1) push …
affected differently. Table Ib shows that (W:18, S:14) is the (2) pull (1) (2)
best configuration for 400 topics but is 1.44 times slower than Worker
the best configuration (W:23, S:9) for 4K topics. (A) local computation Time
(A) (B)
Case 3: Machine environment. The specification of the (B) communication
cluster on which jobs run also heavily affects the best con- Fig. 2: A worker’s epoch
figuration due to varying computation and communication
capabilities. Running NMF on different clusters, we observe
that the best configuration varies drastically as shown in workers may divide the training data into several smaller
Table Ic. When we use AWS EC2 m4.xlarge instances, the subsets and go through a computation-communication cycle
best configuration is (W:42, S:10), which is 1.48 times slower for each subset. Such computation-communication cycles are
if the same configuration of m4.large instances, compared to called mini-batches. The next epoch begins once the worker
the m4.large best configuration (W:34, S:18). has processed all of its mini-batches (i.e., all training data
The three factors investigated above demonstrate that dis- assigned to the worker).
covering the optimal system configuration is challenging. Even To simplify the cost model, we make the following assump-
worse, there are other factors such as algorithm implemen- tion on communication between workers and servers. First,
tation and dataset that can also affect the performance for push requests from workers do not block gradient computation
different configurations. Since the problem space is too broad, and thus can be sent asynchronously with respect to the
it is hard to predict the performance of an ML application workers’ local computation. On the other hand, a fresh pull of
given a specific setting. This motivates adapting to optimal the whole model must always occur before local computation
configurations automatically. takes place. We assume such a model where pull requests
are issued synchronously and blocks local computation [24].
III. F INDING G OOD S YSTEM C ONFIGURATION The synchrony of pull requests can be partially resolved by
In this section, we describe our cost formulation of the decoupling the computation mechanism from communication
training epoch time along with our model assumptions, and threads [20]; this leads to a different cost formulation that
how we minimize epoch time by using the cost model to can be understood as a variation of the one described in this
find values for system configuration parameters – namely, the section.
number of workers and servers as well as the training data and We define the total time spent on local computation of
model partitioning across them. an epoch as computation cost, and the time spent on the
communication as communication cost (denoted by (A) and
A. Cost Model (B) in Figure 2, respectively). Communication cost, to be
Given the PS architecture, we define the cost C of the more specific, is the sum of the elapsed times between a
entire system to be the maximum of the time for each push request’s initiation and the response for a successive
i i
worker i to process the assigned training data in each epoch pull request in each mini-batch. Using Ccomp and Ccomm to
(C i : epoch time). By minimizing the maximum epoch time denote the computation and communication cost of worker i
(C), we can improve the absolute performance as well as respectively, the epoch time of worker i becomes
balancing all workers’ performance. Unbalanced training can
slow down the learning process because training data does not C i = Ccomp
i i
+ Ccomm (2)
contribute evenly to the global model parameters. In a general
system consisting of heterogeneous containers2 and uneven B. Cost Formulation
data partitions, C i is usually different for each worker. a) Computation cost.: This cost depends on the size
of the training dataset and the computing power of workers.
C = max C i (1) The entire dataset of size D is split and distributed to w
i i
workers. Ccomp depends on the size of the training dataset di
A worker’s epoch can be further split into smaller compo- assigned to worker i and the computing power of the worker.
nents. Figure 2 depicts the timeline of a worker’s epoch. A Depending on the time complexity f of the ML algorithm,
worker first performs computation on its training data using i
Ccomp depends on f (di ) since a worker-side computation
the current model to produce model gradients. The worker scans all of the allocated training data during an epoch. In
then communicates with the servers to send its gradients via case an ML algorithm has linear time complexity (e.g., NMF,
push requests and fetches the updated model via pull requests. MLR, LDA), Ccompi
is proportional to di . In a general system
Depending on the algorithm and additional job parameters, consisting of heterogeneous containers (e.g., containers with
i
2 We focus on modeling heterogeneous containers because of heterogeneous
different numbers of cores), each worker i takes Cw.proc , the
hardware or virtual machines. Transient stragglers are not part of the model. time spent to perform computation on a single training data
We borrow work stealing techniques from prior work [10] to handle stragglers. instance, which varies across workers.
i i
Ccomp (di ) = Cw.proc di (3)
Given parameters
i
Cw.proc depends on factors such as the implementation of N : the total number of machines
ML algorithm or the hardware of the worker container. In D: the entire dataset size
i
Cruise, Cw.proc is measured by monitoring workers’ local M : the entire model size
computation; we measure the elapsed time for workers to B: the mini-batch size
compute gradients and divide it by the amount of the training Variables
N
data instances. A larger dataset makes the computation cost w = {0, 1} : wi = 1 if a worker runs on machine i
N
more expensive, but we can reduce the cost by introducing s = {0, 1} : sj = 1 if a server runs on machine j
more workers, which reduces the size of dataset di that each d = (d1 , ..., dN ): training data partitioning for workers
worker processes in each epoch. m = (m1 , ..., mN ): model partitioning for servers
b) Communication cost.: We model communication cost Problem
as the time a worker takes when communicating with the Find w∗ , s∗ , d∗ , m∗
server. The entire model of size M is split and distributed over = argmin max C i (w, s, d, m)
s servers. mj is the size of partial models assigned to server w,s,d,m h i h
i i
j. We consider the following two cases to model Ccomm . = argmin max Cw.proc di +
w,s,d,m i
1) Server network bandwidth is the bottleneck. The serving m kwk P mj 
ii
d dBi e max max( jbij ),
latency of server j is the number of bytes sent to j j j
bij
divided by the bandwidth bij between worker i and Constraints
mj w
server j: bij where w is the number of workers. kwk + ksk = N : workers and servers are assigned to
With a mini-batch size of B, each worker i executes NPmachines disjointly
d dBi e mini-batches per epoch. Since the communication di = D : total number of training data samples
i i
cost is determined by the slowest server, Ccomm = P
di mj w mj = M : total number of model partitions
d B e max( bij ). j
j
2) Worker network bandwidth is the bottleneck. In this case,
Fig. 3: Optimization Problem
the worker’s network bandwidth is being fully utilized
to serve push and pull requests. The cost is formed as
the number of bytes sent P by worker i divided by its
m
i
bandwidth: Ccomm = d dBi e bijj . pull requests within an epoch and fewer containers available
j
i
for servers.
Then, the communication cost Ccomm is the maximum of
the above two terms. b) Solution.: Based on the problem definition in Fig-
ld m mj w X mj  ure 3, we cast the optimization problem as Mixed Integer
i i
Ccomm = max max( ), (4) Programming (MIP) and solves the problem using a solver
B j bij bij
j library from Gurobi [9]. Since the quadratic terms affect the
performance significantly, we encode integer variables d and
C. Optimization m in binary representation, which allows the solver to multiply
a) Optimization problem.: In Figure 3, we formally variables faster. As a result, the MIP program consists of
define our optimization problem. The problem formulated O(N 2 ) variables, O(N ) quadratic constraints, and objective
for heterogeneous environments where some machines have terms. In case that we have homogeneous machines (i.e.,
higher computing power or network bandwidth. In these all machines have the same computing power and network
environments, the configuration space becomes larger, because bandwidth), the optimal solution distributes d evenly across
we also need to determine the data distribution as well as workers and m evenly across servers. Thus, we can derive
deciding whether to run a worker or a server on a container. an analytical solution that runs in O(N ). We present our
The optimization goal is to find the parameters w, s, d, m analytical solution below, but due to space constraints, we omit
that minimize the cost function, where w and s denote the its derivation.
assignment of machines to workers and servers, respectively,
and d and m denote the partitioning of the training dataset
and partial models, respectively. h D i
Given N machines, we adjust the configurations to meet w∗ = argmin T (kwk) ,
w kwk
the optimal balance between computation and communication M kwk
costs. For example, using more machines as workers certainly where T (kwk) = Cw.proc + max(1, )/B.
b N − kwk (5)
brings down the computation cost by reducing the training data ∗ ∗ ∗ D
s : si = 1 − wi , d : di = ∗
,
size that each worker deals with. However, this leads to high kw k
M
communication cost due to an increased number of push and m∗ : mj = , b : machines’ bandwidth
N − kw∗ k
Send metrics to the changes of the running job. Using metrics from a
Elastic Elastic
Executor Executor configured number of subsequent mini-batches, we estimate

Servers

Monitor

Monitor
Code Code Optimizer the cost of an epoch.

Master
Elastic Elastic In order to avoid the system from continuously reconfigur-
Store Store
M1 M2 ing back and forth around the estimated optimum, Optimizer
Elasticity
Controller predicts the performance benefit of a new configuration and
pull push skips that attempt if the gain is less than a certain threshold.
Reconfigure
A threshold number from our experience -5%- is good enough
Elastic Elastic Elastic to prevent the system from “oscillating”, while allowing the
Executor Executor Executor
system to undergo moderately-sized optimizations.
Workers

Monitor

Monitor

Monitor
Code Code Code Container

Elastic Elastic Elastic Elastic When the amount of available resources (e.g., N ) increases,
Runtime Optimizer opportunistically tries to use the extra resources. If
Store Store Store
D1 D2 D3 Optimizer
+ Monitor more resources become available, Optimizer can adjust to find
an optimal configuration including the new resources. When
Fig. 4: Cruise Architecture. the amount of available resources decreases, it rebalances
execution accordingly.
3) Optimization Execution: Once the decision of a recon-
IV. C RUISE figuration is made with the computed system configuration
We extend an existing PS system to automatically con- (w∗ , s∗ , d∗ , m∗ ), the new configuration is contrasted with the
figure distributed ML execution. The extended system called current configuration (w, s, d, m) to generate a reconfigura-
Cruise adds Optimizer and Elastic Runtime to the PS system, tion plan. All plans consist of a subset of four Elastic Runtime
as depicted in Figure 4. Optimizer estimates the optimal operations, which we discuss in detail in Section IV-B. The
configuration for a running ML job using runtime metrics. operation add is for newly joining containers, while the oper-
Following the decision of Optimizer, Elastic Runtime applies ation delete deletes containers that are no longer assigned
the necessary changes dynamically to the system without any data or partial model. The operation switch is to change
stopping the running job. an existing server container to worker or from worker to server.
Training data and model partitioning, (d, m), can be modified
A. Optimizer by migrating data between containers to preserve the state
Optimizer performs cost-based optimization by solving an of the running job, which we will further discuss in Section
optimal configuration problem formulated in Section III. Mon- IV-B. The move operation migrates data between containers,
itors collect runtime statistics related to the performance (e.g., starting with containers that have the largest training data or
the elapsed time for workers to compute gradients) and reports model changes in a greedy fashion, to minimize the amount
the metrics to Master periodically. Optimizer then estimates of data to move and the number of movements.
the performance in different system configurations based on Optimizer executes a plan by simply invoking Elastic
the runtime status. By doing so, our optimizer does not require Runtime API that reconfigures system transparently without
knowledge about the ML jobs (e.g., algorithms and hyper- stopping training. The simplest approach to execute the plan
parameters). After finding the configuration that is expected would be to invoke the operations sequentially. However, to
to be optimal, Optimizer maps the difference from the current make the reconfiguration agile, Optimizer generates the plan
configuration and generates an optimization plan, consisting as directed acyclic graphs of independent operations that can
of operations provided by Elastic Runtime. By executing the be executed concurrently.
operations in the plan, Cruise changes the system configuration
to the one with better performance. To achieve performance B. Elastic Runtime
benefit with optimization, we need to make decisions such as Elastic Runtime is an execution environment that exposes
when to calculate an optimization plan, whether to execute the operations which Optimizer can call to dynamically reconfig-
plan or not. We describe these policies below. ure the system. Elastic Runtime manages workers and servers
1) Metric Collection: Cruise collects runtime metrics to in the form of containers, each integrated with an Elastic
use them as inputs to the Optimizer. Workers measure local Executor. Elastic Executor runs application code on data
computation time and communication time and report to encapsulated by Elastic Store, a distributed key-value store
Master at the end of every mini-batch. On the other hand, that constructs an effective management scheme. Elasticity
Servers report the metrics to Master periodically. Since the Controller manages the distributed Elastic Executors. It is
runtime metrics can fluctuate, we apply moving average to also the endpoint where Optimizer triggers reconfigurations
reduce noise. according to the generated optimization plan.
2) Optimization Trigger Policy: Based on the cost model Elastic Runtime deals with two types of reconfigurations: re-
above, Cruise triggers optimization after collecting sufficient source reconfiguration and workload repartitioning. Resource
metrics to substitute the unknown variables in the cost model. reconfiguration is achieved by Elastic Executor, a container-
We use metrics at the mini-batch granularity to be responsive ized and reconfigurable runtime which extends the existing
PS architecture’s Executor. Elasticity Controller coordinates Data access interface Description
resource reconfiguration by easily adding and removing Elastic put(Key, Value) Puts (Key,Value) to Elas-
tic Store
Executors. Workload repartitioning is conducted efficiently
get(Key) Gets the value associated
with Elastic Executor’s internal component, Elastic Store. with Key
An Elastic Store encapsulates data in an in-memory storage update(Key, Func, Updates the value for
with a management scheme that provides flexibility in the Delta) Key with the result of
accommodated data type (e.g., training data or model data). Func(Value, Delta)
Transparency must be maintained in the course of a recon- Reconfiguration interface Description
figuration. Reconfiguration must occur with minimal effects add(ResourceConf, Adds new containers and
RuntimeConf) starts runtime on them
to the running job by maintaining the application’s access to delete(Containers) Deletes existing containers
data without any loss or significant overhead. Elastic Executor switch(Container, Switches a container to run
performs several additional tasks required for a transparent RuntimeConf) a specified runtime
reconfiguration, such as adaptive data ownership management move(Blocks, Moves blocks from one
or redirection of requests to the new owner of data. SrcContainer, container to another
DstContainer)
We explain the details on resource reconfiguration in Section
IV-B1 and workload repartitioning in Section IV-B2 while Fig. 5: Elastic Runtime Interfaces.
maintaining transparency in Section IV-B3.
1) Resource Reconfiguration: Containers can be added or repartitioning. Both mutable and immutable data can be stored
deleted when Optimizer determines so with operations add in Elastic Stores on which workload repartitioning occurs.
and delete for which the simple signatures are provided
Data Storage and Ownership Management: Data man-
in Figure 5. When add is executed, Elastic Runtime simply
agement in Elastic Runtime involves a collection of Elastic
launches an Elastic Executor on a new container. In the case
Stores where the actual key-value tuples are stored. The
of a container delete, Elastic Executor stops the app code
actual ownership of each data instance is maintained by
and itself to release the container.
the respective Elastic Store, but Elasticity Controller also
When deleting an executor, Elastic Runtime performs ad- maintains a global ownership view to orchestrate migration
ditional wrap up, corresponding to its role (e.g., worker or between Elastic Stores. Ownership tables are updated during
server). Before a server-side Elastic Executor shuts down, it the migration process, which we will discuss the details below
redirects all remaining pull requests from workers to the new in this section.
owning Elastic Executors to prevent workers from waiting long Elastic Stores are composed of blocks containing data and
for a response. For worker-side, it waits until ongoing mini- a block is owned by exactly one Elastic Store. The entire key-
batch to be finished and push requests are flushed to servers. space of data is partitioned and each block contains data for a
Elastic Runtime also provides switch operation that range in the key-space. For an even partitioning of keys over
changes Elastic Executor to another type (e.g., from server blocks, each block stores data for a hashed key range. Clients
to worker or from worker to server). This operation also of Elastic Stores - worker and server code in our paper - use
involves the setup and cleanup procedure involved in add and a key which is mapped to a value to access each key-value
delete. However, the two procedures occur in parallel in the tuple. For each client access, the only Elastic Store owning
existing container and there is no container setup or cleanup the block where the key-value tuple is stored processes the
involved. This is especially beneficial in an environment with request according to the ownership table.
constrained resources as add must wait for a container to Data Access: Elastic Runtime allows values to be stored
become free after a delete completes. to and retrieved from Elastic Stores through simple opera-
2) Workload Repartitioning: Workload repartitioning in- tions, similar to what can be done in distributed hash tables
cludes changing each container’s ownership of training data/- (DHTs) [8]. The difference of Elastic Stores over such key-
partial models and migrating the data accordingly. A re- value stores is that Elastic Runtime exposes options to migrate
source reconfiguration must occur in conjunction with work- data. Elastic Store provides simple and standard operations
load repartitioning. When a container is added/deleted, the for clients to access and update each data instance with a
workload for each container must be readjusted across the key as shown in Figure 5. Elastic Runtime guarantees that
new set of containers now running in the system. Workload operations are served exactly once by maintaining a single
repartitioning may occur on its own in the case of an imbalance owner of the block containing the key-value tuple on which
in workload between containers. the operation is conducted across all Elastic Stores. When an
Any runtime state of the job such as the model data across operation is requested to the Elastic Store that does not own
servers must be preserved, not to lose the job’s progress. Thus, the block, the request is processed by remotely accessing the
the states must be migrated from one container to another. In owner according to the ownership table in each Elastic Store.
addition to such mutable data, training data across workers In addition to the put/get operations, we provide
which remains unchanged can enjoy the benefit of migration to update operation, which atomically executes Func, a user-
reduce the overhead of reloading the entire dataset in workload defined function that should be commutative and accumulative
to guarantee atomic incremental updates. Data locality: Though data is remotely accessible through
In Cruise, when starting a job, a worker Elastic Executor local Elastic Stores, remote access is expensive. Elastic Ex-
loads its assigned set of training data using put into its local ecutor aligns its workload partitioning with the actual data
Elastic Store. While running the job, Elastic Executor fetches in its Elastic Store to maximize locality with the migration
the data to process for each mini-batch from the local Elastic protocol. During a migration, when worker code running on
Store using get. Servers, however, must use update when Elastic Executor asks for a batch of data to process, a local
processing a push request to guarantee atomicity. To process set of data is guaranteed to be returned by keeping track of
a pull request, servers simply get model data from the local the keys for local training data.
Elastic Store. Dynamic ownership table: In Cruise, workers send re-
Data Migration: move operation changes ownership and quests to specific servers containing the key of the partial
migrates data between Elastic Stores. This should be done model according to each worker’s local ownership table. When
carefully to prevent loss or duplicated processing of an oper- the ownership update is immediately broadcasted to all worker
ation, while changing block owner. It is also the most critical Elastic Executors during migration, workers can immediately
factor that determines reconfiguration performance and thus request to the new owner server. For requests that arrive at the
Elastic Runtime executes multiple moves concurrently, each old owner server prior to the worker-side ownership update,
move parallelized in block units. Elastic Executor of the old owner server refers to its ownership
We implement the following protocol in Elastic Runtime table and redirects the requests to the new owning server.
to provide an efficient migration process. Elasticity Controller
V. E VALUATION
initiates a migration for a set of blocks by sending a message
to the source container. The source container migrates blocks We implemented Cruise with around 20K lines of code in
concurrently to the destination container and reports Elasticity Java 1.8. We built Cruise on Apache REEF [22], a library for
Controller about the completion of the migration for every application development on cluster resource managers such as
block, upon each ACK message from the destination container. Apache YARN and Apache Mesos. REEF provides a control
Finally, Elasticity Controller broadcasts ownership change plane for data processing frameworks including the negotiation
of the block to all other containers. Specifically, the block with the cluster resource manager and the control channel
migration is done in two distinct steps: ownership handover between containers.
and actual data transfer. In the source container, when starting We evaluate Cruise with three machine learning applica-
migration for a block it hands over ownership first, so access tions. Our evaluation mainly consists of the following four
operations for the block in this container are redirected to sections: (1) We compare the performance of our expected
the destination container. In the destination container, when it optimal configuration to that of the actual optimal configura-
takes an ownership it starts queueing access operations for the tion (Section V-B). (2) We demonstrate that Cruise reduces
block and starts processing them after receiving actual block epoch time, speeding up training (Section V-C). (3) We show
data. how Cruise optimizes the system configuration when resource
The key point in the migration process is that block own- availability changes (Section V-D) and in heterogeneous en-
ership is transferred atomically such that there is always a vironments (Section V-E). (4) We investigate the overhead
single owner for a block. Another key point is that even if incurred while optimizing the system. (Section V-F).
multiple blocks are requested for migration, client access to a A. Experimental Setup
key is blocked only during the actual migration of the block
containing the key. Default cluster setup: We run experiments on AWS EC2
3) Transparency during Reconfiguration: Dynamic recon- instances with YARN running on Ubuntu 14.04. Unless explic-
figuration must occur without any extra work for Elastic itly mentioned, we use 32 r4.xlarge instances, each of which
Runtime’s clients, but also must refrain from any performance has 4 virtual cores, 30.5GB RAM, and 1.25 Gbps network
degradation. Such transparent reconfigurations include the connection 3 . We launch one Elastic Executor per machine to
following requirements. First, client access APIs must be run a worker or a server.
supported during reconfiguration, maintaining read-my-write Workloads: We choose three popular ML workloads in
consistency. Second, in effort to serve client access, over- different categories: recommendation, classification, and topic
heads such as increased number of remote data accesses are modeling as summarized in Table II.
inevitable due to resource reconfigurations. Such inefficiency Non-negative Matrix Factorization (NMF) is commonly
must be minimized. Finally, the reconfiguration must guaran- used in recommendation systems. The main idea is to find
tee that the accuracy of the model being learned is unaffected. undetermined entries in a given matrix. NMF factorizes a
Elastic Runtime reinforces the key requirements in achieving matrix M (m×n) into factor matrices L (m×r) and R (r×n),
transparent reconfigurations with the following features. where M ≈ LR. We implement NMF via the stochastic
Data accessibility: Data must be accessible any time during gradient descent (SGD) algorithm, similar to the one described
and after data migration for client access. Elastic Store enables in [21]. Matrix R is partitioned across servers, while L is
remote access with the ownership table maintained atomically 3 AWS specifies that the r4.xlarge type provides up to 10 Gbps network
during the migration process. bandwidth. We measured the actual bandwidth with iperf tool.
Application Dataset Hyper-parameter Num. of model parameters
NMF 16x Netflix (1.9M users, 71K movies) 1K rank 1K * 71K
MLR Synthetic sparse (100K samples, 160K features) 4K classes 4K * 160K
LDA PubMed (8.2M documents, 141K words) 400 topics 400 * 141K
TABLE II: Description of datasets used in evaluation. The dataset is partitioned to workers. Num. of model parameters refers to the total
number of parameters in model, which is partitioned to servers.

Initial Cruise’s Optimal Rel.


App.
(W, S) Choice (W, S) Perf
7
(27, 5) (24, 8) 98.8%
NMF (23, 9)
(18, 14) (22, 10) 93.9% 6
(18, 14)
MLR (26, 6) (27, 5) 97.8% 5
(23, 9)

Epoch ime (min)


(27, 5)
LDA (18, 14) (18, 14) 100% 4
(23, 9)
TABLE III: Comparison between the configurations found by 3
Cruise’s Optimizer and the ground truth optimum found by the grid (18, 14) Cruise
2
search for the cases in Table I. Relative performance is the epoch time (18, 14) S a ic
31.3s 23.2s (27, 5) Cruise
in the optimal (W, S), divided by the epoch time in each configuration. 1 (18, 14) → (22, 10) (27, 5) → (24, 8)
(27, 5) S a ic
(23, 9) Op imal
0
0 5 10 15 20 25 30
partitioned across workers, where the smallest unit of training Time (min)

data is a single user’s rating matrix (1 × n). NMF experiments Fig. 6: Epoch time of an NMF job starting at 3 different configu-
use the 16x Netflix dataset whose size is around 40 times rations. The black line shows the global optimum and the blue and
greater than the one in the evaluation of [21]. We set mini- red lines show optimizations from the other initial configurations.
batch size to be 10K. The dotted lines show the performance without optimization to the
corresponding colors. The vertical lines represent the reconfiguration
Multinomial Logistic Regression (MLR) is an algorithm for of each cases.
classification. Each d-dimensional observation x ∈ Rd belongs
to one of the M classes, with the model parameter size of
M ×d. We also implement MLR using SGD. Our experiments a grid search including (d, m) is quite complicated, we use
use a synthetic dataset generated by a public script from the heuristics to eliminate these variables. In the homogeneous
Petuum framework [3]. The dataset is around 46 times greater environment, an even partitioning is intuitively optimal. The
than the one used in the evaluation of [21]. We process 1K result of this grid search yields the ‘Optimal (W, S)’ column
observations in a mini-batch for our experiments. in Table III. In the heterogeneous environment, we distribute
LDA is an algorithm to discover hidden properties (topic) blocks proportional to each machine’s power based on met-
from a group of documents. Each document consists of a bag rics including worker’s local computation time and server’s
of words, where LDA associates latent topic assignments. Our network bandwidth.
LDA implementation uses an efficient variant of the collapsed
Gibbs sampling algorithm [25], which is widely used [17], C. Optimization in the Homogeneous Environment
[21]. We run LDA experiments using the PubMed dataset. After performing the grid search, we run NMF, MLR, and
Our dataset is 15 times larger than one of the datasets used LDA each starting with its optimal configuration and the
in [21]. We process 1K documents in a mini-batch. optimal configurations for the other two applications (as the
Optimizer setup: We observe that the performance of the two configurations are reasonable starting points for users
initial mini-batches fluctuate until the system stabilizes. To running the applications in the same cluster).
prevent Optimizer from computing the cost inaccurately, we Cruise finds configurations close to the ground-truth opti-
configure it to wait until all workers finish a set of mini- mum found in Section V-B for the various cases mentioned
batches. In addition, Optimizer does not trigger reconfiguration in Section II-B in the homogeneous environment. Table III
if the estimated performance gain (in terms of the cost) is shows the comparisons: In NMF and MLR, Cruise chooses
below 5% in order to avoid oscillation. As mentioned in near-optimal configurations where the number of machines for
Section III-C, we use the O(N ) analytical solver for the each role differs by one node compared to the optimum found
homogeneous environment and use the ILP solver for the by grid search. The resulting performance in terms of epoch
heterogeneous environment. time is slightly inferior to the optimum but the difference is
smaller than 6.1%. Cruise finds the optimal configuration in
B. Finding Baselines with Grid Search LDA with the same performance as the optimum.
Before evaluating Cruise’s optimization, we find the base- Figure 6 depicts how Cruise decreases the epoch times of
line for all experiments. We simply perform a grid search that NMF. Starting at (W:27, S:5), Cruise moves to (W:24, S:8),
runs all possible configurations (w, s), to find the ground truth with the relative performance of 1.1% slower than the optimum
optimal configurations for the various experiments. Since such at (W:23, S:9). With the initial (W:18, S:14) configuration,
3.5
87.3s 98.4s 69.7s
12 (12, 4) (25,7) (12, 4)
(25,7) (12, 4) (25,7) 3.0
10
2.5
Epoch time (min)

Epoch time (min)


8
2.0
6
1.5 28.2s to solve
77.7s to execute
4 1.0
(25+2, 3+2)|E Cruise
(25+2, 3+2)|E Static
2 0.5 (21+2, 7+2)|E Static
(12, 4) Cruise
(12, 4) Static (19+4, 9+0)|P Optimal
0 0.0
0 10 20 30 40 50 60 70 80 0 5 10 15 20 25 30
Time (min) Time (min)

Fig. 7: Utilizing opportunistic resources in the NMF job. The blue Fig. 8: Epoch time of NMF at different starting points in the
line shows Cruise’s ability to adapt to resource availability compared heterogeneous environment. 4 machines are r4.4xlarge (“Faster”) and
to the baseline drawn in black line, where the configuration is fixed to the rest 28 are r4.xlarge (“Slower”) EC2 instances. Notations of
the initial 16 resources. Vertical lines represent the event of resource configuration in legend is defined in Section V-E.
addition (green) and reclamation (red). The areas filled in sky-blue
denote the reconfigurations.
E. Optimization in the Heterogeneous Environment

Cruise reconfigures to (W:22, S:10), with 6.5% slower perfor- There are two different aspects to the setup in the het-
mance than the optimum. We observe that Cruise optimizes the erogeneous environment from the homogeneous environment.
misconfigured NMF jobs with significant drops in epoch time Workload should be partitioned differently corresponding to
of 35.8% and 22.3% in each case, soon stabilizing to that of machine types and each type of machine should be assigned a
the new configuration. Our experiments with other applications more proper role (e.g., worker or server). The heterogeneous
in the same environment also decrease epoch time by 55.3% environment uses two types of machines: in addition to the 28
and 8.7% in MLR, and 37.5% and 17.4% in LDA. instances of r4.xlarge, we use 4 faster machines (r4.4xlarge)
that have 16 virtual cores, 122 GB RAM, and 5.00 Gbps
D. Utilizing Opportunistic Resources network connection. 4 In our experiments, we allocate the
faster instances evenly to begin with, 2 for workers and 2
The previous experiments show how Cruise optimizes sys- for servers. For block partitioning, we start all experiments
tem configuration when available resources do not change. with even partitioning denoted as ‘E’, whereas the optimal
Cruise’s capability to optimize system configuration during configuration distributes blocks proportially to machine’s ca-
runtime, however, is more powerful when available resources pability, which is denoted as ‘P’. For example, we denote a
change over time. Cruise keeps track of available resources in configuration of 20 workers with 3 strong machines and 12
the cluster and updates the system configuration if there are servers with 1 strong machine with even block partitioning as
changes in resource availability. In this experiment, we assume (W:17+3, S:11+1)|E. We run the same three applications with
that the cluster has 16 extra containers that are available the same starting points as the homogeneous environment, and
opportunistically; starting with 16 containers, we add/reclaim we show how differently Cruise optimizes the configuration in
16 containers at every 20 minutes. We show how Cruise’s the heterogeneous environment.
runtime optimization utilizes opportunistic resources by com- Figure 8 shows the results of running NMF starting at
paring cases with and without optimization. In both cases, we (W:25+2, S:3+2)|E. Cruise reconfigures the job to (W:20+4,
run ML jobs with the initial configuration of (W:14, S:2), the S:8+0)|P close to the ground truth optimum in the heteroge-
actual optimum found for 16 containers from experiments. neous environment (W:19+4, S:9+0)|P. Data is repartitioned so
Figure 7 demonstrates that the average epoch time approx- that the faster instances have about 2 times more blocks than
imately halves when 16 more resources are available. At the the slower instances, reflecting the heterogeneity. Epoch time
20th minute, the configuration moves toward (W:25, S:7), decreases by 35.2% (from 162 s to 105 s). Our experiments
taking advantage of the added resources. The reconfiguration with other applications in the same environment also reduce
takes around 87.3 seconds, most of the overhead caused by the epoch times by 58.3% in MLR starting at (W:16+2,
the data migration of the half of total training data and model S:12+2)|E and 41.3% in LDA starting at (W:25+2, S:3+2)|E.
data, to the new containers. At the 40th minute, Cruise returns To focus on the benefit of role reassignment and workload
to the previous configuration (W:12, S:4) with 98.4 seconds of repartitioning, we run the job again at (W:21+2, S:7+2)|E,
reconfiguration overhead for data migration and state cleanup. without any optimization (the green line). Here, we observe
The additional resources become available again at the 60th
minute and Cruise goes to (W:25, S:7), same as the previous 4 AWS specifies that the r4.4xlarge type provides up to 10 Gbps network
optimization at the [20, 40] minutes time interval. bandwidth. We measured the actual bandwidth with iperf tool.
that this static configuration is 12.4% slower (118 s vs. 105 s) run declarative ML programs. There also is another version
than the epoch time for the configuration chosen by Cruise. of the work [2] that applied the same concept to Spark [26],
regarding Spark Driver as the CP and Executors as the worker
F. Reconfiguration Speed jobs. The system focuses on optimizing memory configura-
The optimization procedure is composed of cost calculation tions during runtime when there is a change in available re-
and plan execution. Cost calculation takes 30 ms for the sources. On the other hand, Cruise optimizes the running time
homogeneous environment and 37.4 s for the heterogeneous of ML applications on the PS architecture by automatically
environment on average. More time is spent on plan execution, tuning system configuration.
especially for move: the overhead includes (de)serialization Yan et al. [24] propose a cost formulization that predicts the
time, network transfer time, and the time to acquire the lock computation and communication overheads of Deep Neural
on the block to migrate. The data size also affects the time Network (DNN) applications by modeling the internals of
to execute the operation. The overhead of add and delete the algorithm. In contrast, Cruise measures computation and
is relatively small, which takes around 2~3 s for resource communication time at runtime instead of modeling the inter-
initialization and cleanup. The time for switch is more nals of the algorithm, uses the runtime measurement for cost-
negligible since there is no cost for resource setup. based optimization, and applies the estimated optimal system
Here we break down the plan execution of an NMF ex- configuration by reconfiguring a running job, which allows us
periment that starts from (W:18, S:14) in Section V-C. The to take advantage of resource elasticity.
plan changes the configuration to (W:22, S:10), composed Starfish [11], built on Apache Hadoop [19], performs opti-
of 4 switches from server to worker and 30 moves that mization on MapReduce. It gathers job profiles from runtime
repartition data and model blocks. It takes 31.3 s in total for statistics via dynamic instrumentation for job-level tuning,
our plan executor to execute these operations in parallel. Most guaranteeing shorter execution times. Many of Starfish’s de-
of this time is taken during worker-side moves, due to the sign considerations come from the MapReduce programming
huge size of data being migrated. Input data is divided into model, while Cruise targets ML applications running on the
200 worker blocks. A worker block is 100 MB, each of which PS architecture.
contains 10K items. 18 moves migrate 36 data blocks in total Recent works like Bösen [21], Ako [20], and MALT [15] do
between worker executors. The longest move takes 25.8 s, not decide on the number of workers and servers since each
migrating 4 blocks at once, serving as the bottleneck to this node runs both worker and server. They have different styles
plan execution time. On the other hand, model data is divided of model synchronization. Bösen is a PS implementation,
into 128 server blocks. Block size is 2 MB and each contains which requires all-to-all communication. Ako is a peer-to-
only 280 items. The plan migrates 36 model blocks with 12 peer DNN training system. Ako exchanges partial gradients
moves. Server-side moves take at most 1.4s. across multiple rounds to adjust the hardware and statistical
efficiency. Similarly, MALT is a peer-to-peer ML training
VI. R ELATED W ORK system where each node exchanges parameter updates with
An early version of our work appeared in a workshop log n nodes deterministically. In contrast, Cruise employs cost-
paper that sketched a high-level approach to optimizing PS based optimization, allows flexible worker and server allo-
system configurations [6]. This paper presents the complete cation, handles elastically changing resources, and considers
problem formulation, design, implementation, and evaluation heterogeneous environments.
of the system. Cruise differs from other systems in that it
automatically tunes PS system configuration by solving the VII. C ONCLUSION
configuration optimization problem with runtime PS system In this paper, we present a methodology to automatically
metrics and applying the solution to the running system tune system configuration of PS-based ML systems. We build
efficiently. Below we summarize the works that are most Cruise by extending an existing PS-based system to optimize
relevant to Cruise. system configuration based on the methodology. Cruise Op-
TuPAQ [18] is a system for identifying ML model configu- timizer estimates the optimal system configuration - resource
rations (e.g., support vector machine vs. logistic regression, configuration and workload partitioning - using a cost-based
hyper-parameter values) that lead to high performance in model with runtime metrics. Elastic Runtime enables efficient
terms of model accuracy, built on Apache Spark [14], [26]. runtime reconfigurations according to the computed optimal
TuPAQ casts ML model identification as a query planning configuration. Our evaluation shows that Cruise frees ML ap-
problem and applies a bandit allocation strategy as well as plication developers of choosing right system configuration by
various optimizations such as batching, optimal cluster sizing, tuning system configuration automatically. Cruise is publicly
and advanced hyper-parameter tuning techniques to solve the available at https://github.com/snuspl/cruise.
problem efficiently. This study focuses on supervised ML
models. In contrast, Cruise addresses the problem of tuning ACKNOWLEDGEMENT
system configuration in the PS architecture. We thank all reviewers for their comments. This research
SystemML [13] is a hybrid runtime system that uses in- was supported by Samsung Research Funding Center of Sam-
memory Control Program (CP) and MapReduce (MR) jobs to sung Electronics under Project Number SRFC-TC1603-01.
R EFERENCES [14] T. Kraska, A. Talwalkar, J. C. Duchi, R. Griffith, M. J. Franklin, and
M. I. Jordan. MLbase: A distributed machine-learning system. In CIDR,
[1] A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A. J. Smola. volume 1, pages 2–1, 2013.
Scalable inference in latent variable models. In WSDM, 2012. [15] H. Li, A. Kadav, E. Kruus, and C. Ungureanu. MALT: Distributed
[2] M. Boehm, M. W. Dusenberry, D. Eriksson, A. V. Evfimievski, F. M. data-parallelism for existing ml applications. In EuroSys, 2015.
Manshadi, N. Pansare, B. Reinwald, F. R. Reiss, P. Sen, A. C. Surve, [16] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski,
et al. SystemML: Declarative machine learning on spark. Proceedings J. Long, E. J. Shekita, and B.-Y. Su. Scaling distributed machine learning
of the VLDB Endowment, 9(13):1425–1436, 2016. with the parameter server. In OSDI, 2014.
[3] Carnegie Mellon University. Petuum Bösen, 2016. https://github.com/ [17] A. Smola and S. Narayanamurthy. An architecture for parallel topic
petuum/bosen. models. In Proceedings of the VLDB Endowment (PVLDB), volume 3,
[4] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, pages 703–710, Sept. 2010.
C. Zhang, and Z. Zhang. MXNet: A flexible and efficient machine [18] E. R. Sparks, A. Talwalkar, D. Haas, M. J. Franklin, M. I. Jordan, and
learning library for heterogeneous distributed systems. arXiv preprint T. Kraska. Automating model search for large scale machine learning.
arXiv:1512.01274, 2015. In SoCC, 2015.
[5] T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman. Project adam: [19] The Apache Software Foundation. Apache Hadoop, 2015. http://hadoop.
Building an efficient and scalable deep learning training system. In apache.org.
OSDI, 2014. [20] P. Watcharapichat, V. L. Morales, R. C. Fernandez, and P. Pietzuch. Ako:
[6] B.-G. Chun, B. Cho, B. Jeon, J. S. Jeong, G. Kim, J. Y. Kim, W.- Decentralised deep learning with partial gradient exchange. In SoCC,
Y. Lee, Y. S. Lee, M. Weimer, Y. Yang, and G.-I. Yu. Dolphin: 2016.
Runtime optimization for distributed machine learning. In NIPS ML [21] J. Wei, W. Dai, A. Qiao, Q. Ho, H. Cui, G. R. Ganger, P. B. Gibbons,
Sys Workshop, 2016. G. A. Gibson, and E. P. Xing. Managed communication and consistency
[7] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, for fast data-parallel iterative analytics. In SoCC, 2015.
P. Tucker, K. Yang, Q. V. Le, et al. Large scale distributed deep networks. [22] M. Weimer, Y. Chen, B.-G. Chun, T. Condie, C. Curino, C. Douglas,
In NIPS, 2012. Y. Lee, T. Majestro, D. Malkhi, S. Matusevych, B. Myers, S. Narayana-
[8] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, murthy, R. Ramakrishnan, S. Rao, R. Sears, B. Sezgin, and J. Wang.
A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Reef: Retainable evaluator execution framework. In SIGMOD, 2015.
Amazon’s highly available key-value store. SIGOPS, 2007. [23] E. P. Xing, Q. Ho, W. Dai, J. K. Kim, J. Wei, S. Lee, X. Zheng, P. Xie,
[9] I. Gurobi Optimization. Gurobi optimizer, 2017. A. Kumar, and Y. Yu. Petuum: A new platform for distributed machine
[10] A. Harlap, H. Cui, W. Dai, J. Wei, G. R. Ganger, P. B. Gibbons, G. A. learning on big data. In SIGKDD, 2015.
Gibson, and E. P. Xing. Addressing the straggler problem for iterative [24] F. Yan, O. Ruwase, Y. He, and T. Chilimbi. Performance modeling
convergent parallel ml. In SoCC, 2016. and scalability optimization of distributed deep learning systems. In
[11] H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, and SIGKDD, 2015.
S. Babu. Starfish: A self-tuning system for big data analytics. In CIDR, [25] L. Yao, D. Mimno, and A. McCallum. Efficient methods for topic model
volume 11, pages 261–272, 2011. inference on streaming document collections. In SIGKDD, 2009.
[12] Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, G. A. Gibson, [26] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley,
G. Ganger, and E. P. Xing. More effective distributed ml via a stale M. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets:
synchronous parallel parameter server. In NIPS, 2013. A fault-tolerant abstraction for in-memory cluster computing. In NSDI,
[13] B. Huang, M. Boehm, Y. Tian, B. Reinwald, S. Tatikonda, and F. R. 2012.
Reiss. Resource elasticity for large-scale machine learning. In SIGMOD,
2015.

You might also like