SparkNet Training Deep Networks in Spark PDF
SparkNet Training Deep Networks in Spark PDF
net/publication/284219237
CITATIONS READS
70 342
4 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Michael Jordan on 09 September 2016.
A BSTRACT
arXiv:1511.06051v4 [stat.ML] 28 Feb 2016
1 I NTRODUCTION
Deep learning has advanced the state of the art in a number of application domains. Many of the
recent advances involve fitting large models (often several hundreds megabytes) to larger datasets
(often hundreds of gigabytes). Given the scale of these optimization problems, training can be time-
consuming, often requiring multiple days on a single GPU using stochastic gradient descent (SGD).
For this reason, much effort has been devoted to leveraging the computational resources of a cluster
to speed up the training of deep networks (and more generally to perform distributed optimization).
Many attempts to speed up the training of deep networks rely on asynchronous, lock-free optimiza-
tion (Dean et al., 2012; Chilimbi et al., 2014). This paradigm uses the parameter server model (Li
et al., 2014; Ho et al., 2013), in which one or more master nodes hold the latest model parameters
in memory and serve them to worker nodes upon request. The nodes then compute gradients with
respect to these parameters on a minibatch drawn from the local data shard. These gradients are
shipped back to the server, which updates the model parameters.
At the same time, batch-processing frameworks enjoy widespread usage and have been gaining in
popularity. Beginning with MapReduce (Dean & Ghemawat, 2008), a number of frameworks for
distributed computing have emerged to make it easier to write distributed programs that leverage the
resources of a cluster (Zaharia et al., 2010; Isard et al., 2007; Murray et al., 2013). These frameworks
have greatly simplified many large-scale data analytics tasks. However, state-of-the-art deep learning
systems rely on custom implementations to facilitate their asynchronous, communication-intensive
workloads. One reason is that popular batch-processing frameworks (Dean & Ghemawat, 2008;
Zaharia et al., 2010) are not designed to support the workloads of existing deep learning systems.
SparkNet implements a scalable, distributed algorithm for training deep networks that lends itself to
batch computational frameworks such as MapReduce and Spark and works well out-of-the-box in
bandwidth-limited environments.
∗
Both authors contributed equally.
1
Published as a conference paper at ICLR 2016
The benefits of integrating model training with existing batch frameworks are numerous. Much of
the difficulty of applying machine learning has to do with obtaining, cleaning, and processing data as
well as deploying models and serving predictions. For this reason, it is convenient to integrate model
training with the existing data-processing pipelines that have been engineered in today’s distributed
computational environments. Furthermore, this approach allows data to be kept in memory from
start to finish, whereas a segmented approach requires writing to disk between operations. If a
user wishes to train a deep network on the output of a SQL query or on the output of a graph
computation and to feed the resulting predictions into a distributed visualization tool, this can be
done conveniently within a single computational framework.
We emphasize that the hardware requirements of our approach are minimal. Whereas many ap-
proaches to the distributed training of deep networks involve heavy communication (often com-
municating multiple gradient vectors for every minibatch), our approach gracefully handles the
bandwidth-limited setting while also taking advantage of clusters with low-latency communication.
For this reason, we can easily deploy our algorithm on clusters that are not optimized for com-
munication. Our implementation works well out-of-the box on a five-node EC2 cluster in which
broadcasting and collecting model parameters (several hundred megabytes per worker) takes on the
order of 20 seconds, and performing a single minibatch gradient computation requires about 2 sec-
onds (for AlexNet). We achieve this by providing a simple algorithm for parallelizing SGD that
involves minimal communication and lends itself to straightforward implementation in batch com-
putational frameworks. Our goal is not to outperform custom computational frameworks but rather
to propose a system that can be easily implemented in popular batch frameworks and that performs
nearly as well as what can be accomplished with specialized frameworks.
2 I MPLEMENTATION
Here we describe our implementation of SparkNet. SparkNet builds on Apache Spark (Zaharia et al.,
2010) and the Caffe deep learning library (Jia et al., 2014). In addition, we use Java Native Access
class Net {
def Net(netParams: NetParams): Net
def setTrainingData(data: Iterator[(NDArray,Int)])
def setValidationData(data: Iterator[(NDArray,Int)])
def train(numSteps: Int)
def test(numSteps: Int): Float
def setWeights(weights: WeightCollection)
def getWeights(): WeightCollection
}
2
Published as a conference paper at ICLR 2016
for accessing Caffe data and weights natively from Scala, and we use the Java implementation of
Google Protocol Buffers to allow the dynamic construction of Caffe networks at runtime.
The Net class wraps Caffe and exposes a simple API containing the methods shown in Listing 1.
The NetParams type specifies a network architecture, and the WeightCollection type is
a map from layer names to lists of weights. It allows the manipulation of network components
and the storage of weights and outputs for individual layers. To facilitate manipulation of data
and weights without copying memory from Caffe, we implement the NDArray class, which is a
lightweight multi-dimensional tensor library. One benefit of building on Caffe is that any existing
Caffe model definition or solver file is automatically compatible with SparkNet. There is a large
community developing Caffe models and extensions, and these can easily be used in SparkNet. By
building on top of Spark, we inherit the advantages of modern batch computational frameworks.
These include the high-throughput loading and preprocessing of data and the ability to keep data in
memory between operations. In Listing 2, we give an example of how network architectures can
be specified in SparkNet. In addition, model specifications or weights can be loaded directly from
Caffe files. An example sketch of code that uses our API to perform distributed training is given in
Listing 3.
3
Published as a conference paper at ICLR 2016
The parallelization scheme is described in Listing 3. Spark consists of a single master node and a
number of worker nodes. The data is split among the Spark workers. In every iteration, the Spark
master broadcasts the model parameters to each worker. Each worker then runs SGD on the model
with its subset of data for a fixed number of iterations τ (we use τ = 50 in Listing 3) or for a fixed
length of time, after which the resulting model parameters on each worker are sent to the master and
averaged to form the new model parameters. We recommend initializing the network by running
SGD for a small number of iterations on the master. A similar and more sophisticated approach to
parallelizing SGD with minimal communication overhead is discussed in Zhang et al. (2015).
The standard approach to parallelizing each gradient computation requires broadcasting and collect-
ing model parameters (hundreds of megabytes per worker and gigabytes in total) after every SGD
update, which occurs tens of thousands of times during training. On our EC2 cluster, each broadcast
and collection takes about twenty seconds, putting a bound on the speedup that can be expected
using this approach without better hardware or without partitioning models across machines. Our
approach broadcasts and collects the parameters a factor of τ times less for the same number of
iterations. In our experiments, we set τ = 50, but other values seem to work about as well.
We note that Caffe supports parallelism across multiple GPUs within a single node. This is not a
competing form of parallelism but rather a complementary one. In some of our experiments, we use
Caffe to handle parallelism within a single node, and we use the parallelization scheme described in
Listing 3 to handle parallelism across nodes.
3 E XPERIMENTS
In Section 3.2, we will benchmark the performance of SparkNet and measure the speedup that our
system obtains relative to training on a single node. However, the outcomes of those experiments
depend on a number of different factors. In addition to τ (the number of iterations between syn-
chronizations) and K (the number of machines in our cluster), they depend on the communication
overhead in our cluster S. In Section 3.1, we find it instructive to measure the speedup in the ideal-
ized case of zero communication overhead (S = 0). This idealized model gives us an upper bound
on the maximum speedup that we could hope to obtain in a real-world cluster, and it allows us to
build a model for the speedup as a function of S (the overhead is easily measured in practice).
Before benchmarking our system, we determine the maximum possible speedup that could be ob-
tained in principle in a cluster with no communication overhead. We determine the dependence of
this speedup on the parameters τ (the number of iterations between synchronizations) and K (the
number of machines in our cluster).
4
Published as a conference paper at ICLR 2016
effectively parallelize the minibatch computation. One might imagine circumventing this limitation
by using a larger batch size b. Unfortunately, the benefit of using larger batches is relatively modest.
As the batch size b increases, Na (b) does not decrease enough to justify the use of a very large value
of b.
Furthermore, the benefits of this approach depend greatly on the degree of communication overhead.
If aggregating the gradients and broadcasting the model parameters requires S units of time, then
the time required by this approach is at least C(b)/K + S per iteration and Na (b)(C(b)/K + S) to
achieve an accuracy of a. Therefore, the maximum achievable speedup is C(b)/(C(b)/K + S) ≤
C(b)/S. We may expect S to increase modestly as K increases, but we suppress this effect here.
The performance of the naive parallelization scheme is easily understood because its behavior is
equivalent to that of the serial algorithm. In contrast, SparkNet uses a parallelization scheme that is
not equivalent to serial SGD (described in Section 2.1), and so its analysis is more complex.
SparkNet’s parallelization scheme proceeds in rounds (see Figure 2c). In each round, each machine
runs SGD for τ iterations with batch size b. Between rounds, the models on the workers are gathered
together on the master, averaged, and broadcast to the workers.
We use Ma (b, K, τ ) to denote the number of rounds required to achieve an accuracy of a. The
number of parallel iterations of SGD under SparkNet’s parallelization scheme required to achieve
an accuracy of a is then τ Ma (b, K, τ ), and the wallclock time is
5
Published as a conference paper at ICLR 2016
(a) This figure depicts a serial run of SGD. Each block corresponds to a single SGD update with
batch size b. The quantity Na (b) is the number of iterations required to achieve an accuracy of a.
(b) This figure depicts a parallel run of SGD on K = 4 machines under a naive parallelization
scheme. At each iteration, each batch of size b is divided among the K machines, the gradients
over the subsets are computed separately on each machine, the updates are aggregated, and the new
model is broadcast to the workers. Algorithmically, this approach is exactly equivalent to the serial
run of SGD in Figure 2a and so the number of iterations required to achieve an accuracy of a is the
same value Na (b).
(c) This figure depicts a parallel run of SGD on K = 4 machines under SparkNet’s parallelization
scheme. At each step, each machine runs SGD with batch size b for τ iterations, after which the
models are aggregated, averaged, and broadcast to the workers. The quantity Ma (b, K, τ ) is the
number of rounds (of τ iterations) required to obtain an accuracy of a. The total number of parallel
iterations of SGD under SparkNet’s parallelization scheme required to obtain an accuracy of a is
then τ Ma (b, K, τ ).
The speedup given by the naive parallelization scheme can be computed exactly and is given by
C(b)/(C(b)/K +S). This formula is essentially Amdahl’s law. Note that when S ≥ C(b), the naive
parallelization scheme is slower than the computation on a single machine. The speedup obtained
by SparkNet is Na (b)C(b)/[(τ C(b) + S)Ma (b, K, τ )] for a specific value of τ . The numerator is
the time required by serial SGD to achieve an accuracy of a from Equation 1, and the denominator is
the time required by SparkNet to achieve the same accuracy from Equation 2. Choosing the optimal
value of τ gives us a speedup of maxτ Na (b)C(b)/[(τ C(b) + S)Ma (b, K, τ )]. In practice, choosing
τ is not a difficult problem. The ratio Na (b)/(τ Ma (b, K, τ )) (the speedup when S = 0) degrades
6
Published as a conference paper at ICLR 2016
2 1.4 1.6 1.6 1.8 1.6 2.0 2.0 1.8 1.6 2.50
2.25
3 1.3 1.4 1.9 2.1 2.1 2.1 2.2 2.6 1.7
2.00
K
4 1.2 1.7 1.8 1.9 2.2 2.4 2.6 2.1 1.9 1.75
1.50
5 1.1 1.6 1.9 1.9 2.4 3.0 3.0 2.4 1.9
1.25
6 1.1 1.7 1.7 1.8 2.6 2.8 3.0 2.4 2.0 1.00
1
2500
1000
500
100
25
10
5
τ
Figure 3: This figure shows the speedup τ Ma (b, τ, K)/Na (b) given by SparkNet’s parallelization
scheme relative to training on a single machine to obtain an accuracy of a = 20%. Each grid square
corresponds to a different choice of K and τ . We show the speedup in the zero communication
overhead setting. This experiment uses a modified version of AlexNet on a subset of ImageNet
(100 classes each with approximately 1000 images). Note that these numbers are dataset specific.
Nevertheless, the trends they capture are of interest.
5
Naive
4 SparkNet
No Speedup
3
speedup
0
10−2 10−1 100 101 102 103
communication overhead S
Figure 4: This figure shows the speedups obtained by the naive parallelization scheme and by
SparkNet as a function of the cluster’s communication overhead (normalized so that C(b) = 1).
We consider K = 5. The data for this plot applies to training a modified version of AlexNet on
a subset of ImageNet (approximately 1000 images for each of the first 100 classes). The speedup
obtained by the naive parallelization scheme is C(b)/(C(b)/K + S). The speedup obtained by
SparkNet is Na (b)C(b)/[(τ C(b) + S)Ma (b, K, τ )] for a specific value of τ . The numerator is the
time required by serial SGD to achieve an accuracy of a, and the denominator is the time required by
SparkNet to achieve the same accuracy (see Equation 1 and Equation 2). For the optimal value of τ ,
the speedup is maxτ Na (b)C(b)/[(τ C(b) + S)Ma (b, K, τ )]. To plot the SparkNet speedup curve,
we maximize over the set of values τ ∈ {1, 2, 5, 10, 25, 100, 500, 1000, 2500} and use the values
Ma (b, K, τ ) and Na (b) from the experiments in the fifth row of Figure 3. In our experiments, we
have S ≈ 20s and C(b) ≈ 2s.
slowly as τ increases, so it suffices to choose τ to be a small multiple of S (say 5S) so that the
algorithm spends only a fraction of its time in communication.
When plotting the SparkNet speedup in Figure 4, we do not maximize over all positive integer
values of τ but rather over the set τ ∈ {1, 2, 5, 10, 25, 100, 500, 1000, 2500}, and we use the values
7
Published as a conference paper at ICLR 2016
45 60
40 50
35
30 40
accuracy
accuracy
25 30
20 Caffe
15 SparkNet 3 node 20 Caffe 4 GPU
10 SparkNet 5 node 10
SparkNet 3 node 4 GPU
5 SparkNet 10 node SparkNet 6 node 4 GPU
0 0
0 5 10 15 20 0 20 40 60 80 100 120
hours hours
Figure 5: This figure shows the performance Figure 6: This figure shows the performance of
of SparkNet on a 3-node, 5-node, and 10-node SparkNet on a 3-node cluster and on a 6-node
cluster, where each node has 1 GPU. In these cluster, where each node has 4 GPUs. In these
experiments, we use τ = 50. The baseline experiments, we use τ = 50. The baseline uses
was obtained by running Caffe on a single GPU Caffe on a single node with 4 GPUs and no
with no communication. The experiments are communication overhead. The experiments are
performed on ImageNet using AlexNet. performed on ImageNet using GoogLeNet.
of Na (b) and Ma (b, K, τ ) corresponding to the fifth row of Figure 3. Including more values of τ
would only increase the SparkNet speedup. The distributed training of deep networks is typically
thought of as a communication-intensive procedure. However, Figure 4 demonstrates the value of
SparkNet’s parallelization scheme even in the most bandwidth-limited settings.
The naive parallelization scheme may appear to be a straw man. However, it is a frequently-used ap-
proach to parallelizing SGD (Noel et al., 2015; Iandola et al., 2015), especially when asynchronous
updates are not an option (as in computational frameworks like MapReduce and Spark).
To explore the scaling behavior of our algorithm and implementation, we perform experiments on
EC2 using clusters of g2.8xlarge nodes. Each node has four NVIDIA GRID GPUs and 60GB
memory. We train the default Caffe model of AlexNet (Krizhevsky et al., 2012) on the ImageNet
dataset (Russakovsky et al., 2015). We run SparkNet with K = 3, 5, and 10 and plot the results
in Figure 5. For comparison, we also run Caffe on the same cluster with a single GPU and no
communication overhead to obtain the K = 1 plot. These experiments use only a single GPU on
each node. To measure the speedup, we compare the wall-clock time required to obtain an accuracy
of 45%. With 1 GPU and no communication overhead, this takes 55.6 hours. With 3, 5, and 10
GPUs, SparkNet takes 22.9, 14.5, and 12.8 hours, giving speedups of 2.4, 3.8, and 4.4.
We also train the default Caffe model of GoogLeNet (Szegedy et al., 2015) on ImageNet. We run
SparkNet with K = 3 and K = 6 and plot the results in Figure 6. In these experiments, we
use Caffe’s multi-GPU support to take advantage of all four GPUs within each node, and we use
SparkNet’s parallelization scheme to handle parallelism across nodes. For comparison, we train
Caffe on a single node with four GPUs and no communication overhead. To measure the speedup,
we compare the wall-clock time required to obtain an accuracy of 40%. Relative to the baseline of
Caffe with four GPUs, SparkNet on 3 and 6 nodes gives speedups of 2.7 and 3.2. Note that this is
on top of the speedup of roughly 3.5 that Caffe with four GPUs gets over Caffe with one GPU, so
the speedups that SparkNet obtains over Caffe on a single GPU are roughly 9.4 and 11.2.
Furthermore, we explore the dependence of the parallelization scheme described in Section 2.1 on
the parameter τ which determines the number of iterations of SGD that each worker does before
synchronizing with the other workers. These results are shown in Figure 7. Note that in the presence
of stragglers, it suffices to replace the fixed number of iterations τ with a fixed length of time, but in
our experimental setup, the timing was sufficiently consistent and stragglers did not arise. The single
GPU experiment in Figure 5 was trained on a single GPU node with no communication overhead.
8
Published as a conference paper at ICLR 2016
45
40
35
30
accuracy
25
20 20 iterations
15 50 iterations
10 100 iterations
5 150 iterations
0
0 2 4 6 8 10
hours
Figure 7: This figure shows the dependence of the parallelization scheme described in Section 2.1
on τ . Each experiment was run with K = 5 workers. This figure shows that good performance can
be achieved without collecting and broadcasting the model after every SGD update.
4 R ELATED W ORK
Much work has been done to build distributed frameworks for training deep networks. Coates et al.
(2013) build a model-parallel system for training deep networks on a GPU cluster using MPI over
Infiniband. Dean et al. (2012) build DistBelief, a distributed system capable of training deep net-
works on thousands of machines using stochastic and batch optimization procedures. In particular,
they highlight asynchronous SGD and batch L-BFGS. Distbelief exploits both data parallelism and
model parallelism. Chilimbi et al. (2014) build Project Adam, a system for training deep networks
on hundreds of machines using asynchronous SGD. Li et al. (2014); Ho et al. (2013) build parameter
servers to exploit model and data parallelism, and though their systems are better suited to sparse
gradient updates, they could very well be applied to the distributed training of deep networks. More
recently, Abadi et al. (2015) build TensorFlow, a sophisticated system for training deep networks
and more generally for specifying computation graphs and performing automatic differentiation.
Iandola et al. (2015) build FireCaffe, a data-parallel system that achieves impressive scaling using
naive parallelization in the high-performance computing setting. They minimize communication
overhead by using a tree reduce for aggregating gradients in a supercomputer with Cray Gemini
interconnects.
These custom systems have numerous advantages including high performance, fine-grained control
over scheduling and task placement, and the ability to take advantage of low-latency communication
between machines. On the other hand, due to their demanding communication requirements, they
are unlikely to exhibit the same scaling on an EC2 cluster. Furthermore, due to their nature as custom
systems, they lack the benefits of tight integration with general-purpose computational frameworks
such as Spark. For some of these systems, preprocessing must be done separately by a MapReduce
style framework, and data is written to disk between segments of the pipeline. With SparkNet,
preprocessing and training are both done in Spark.
Training a machine learning model such as a deep network is often one step of many in real-world
data analytics pipelines (Sparks et al., 2015). Obtaining, cleaning, and preprocessing the data are
often expensive operations, as is transferring data between systems. Training data for a machine
learning model may be derived from a streaming source, from a SQL query, or from a graph com-
putation. A user wishing to train a deep network in a custom system on the output of a SQL query
would need a separate SQL engine. In SparkNet, training a deep network on the output of a SQL
query, or a graph computation, or a streaming data source is straightforward due to its general pur-
pose nature and its support for SQL, graph computations, and data streams (Armbrust et al., 2015;
Gonzalez et al., 2014; Zaharia et al., 2013).
Some attempts have been made to train deep networks in general-purpose computational frame-
works, however, existing work typically hinges on extremely low-latency intra-cluster communica-
9
Published as a conference paper at ICLR 2016
tion. Noel et al. (2015) train deep networks in Spark on top of YARN using SGD and leverage cluster
resources to parallelize the computation of the gradient over each minibatch. To achieve competitive
performance, they use remote direct memory accesses over Infiniband to exchange model parameters
quickly between GPUs. In contrast, SparkNet tolerates low-bandwidth intra-cluster communication
and works out of the box on Amazon EC2.
A separate line of work addresses speeding up the training of deep networks using single-machine
parallelism. For example, Caffe con Troll (Abuzaid et al., 2015) modifies Caffe to leverage both
CPU and GPU resources within a single node. These approaches are compatible with SparkNet and
the two can be used in conjunction.
Many popular computational frameworks provide support for training machine learning models
(Meng et al., 2015) such as linear models and matrix factorization models. However, due to the
demanding communication requirements and the larger scale of many deep learning problems, these
libraries have not been extended to include deep networks.
Various authors have studied the theory of averaging separate runs of SGD. In the bandwidth-limited
setting, Zinkevich et al. (2010) analyze a simple algorithm for convex optimization that is easily
implemented in the MapReduce framework and can tolerate high-latency communication between
machines. Zhang et al. (2015) define a parallelization scheme that penalizes divergences between
parallel workers, and they provide an analysis in the convex case. Zhang & Jordan (2015) pro-
pose a general abstraction for parallelizing stochastic optimization algorithms along with a Spark
implementation.
5 D ISCUSSION
ACKNOWLEDGMENTS
We would like to thank Cyprien Noel, Andy Feng, Tomer Kaftan, Evan Sparks, and Shivaram
Venkataraman for valuable advice. This research is supported in part by NSF grant number DGE-
1106400. This research is supported in part by NSF CISE Expeditions Award CCF-1139158, DOE
Award SN10040 DE-SC0012463, and DARPA XData Award FA8750-12-2-0331, and gifts from
Amazon Web Services, Google, IBM, SAP, The Thomas and Stacey Siebel Foundation, Adatao,
Adobe, Apple, Blue Goji, Bosch, Cisco, Cray, Cloudera, EMC2, Ericsson, Facebook, Fujitsu,
Guavus, HP, Huawei, Informatica, Intel, Microsoft, NetApp, Pivotal, Samsung, Schlumberger,
Splunk, Virdata and VMware.
10
Published as a conference paper at ICLR 2016
R EFERENCES
Abadi, Martı́n, Agarwal, Ashish, Barham, Paul, et al. TensorFlow: Large-scale machine learning on
heterogeneous systems, 2015. URL http://tensorflow.org/. Software available from
tensorflow.org.
Abuzaid, Firas, Hadjis, Stefan, Zhang, Ce, and Ré, Christopher. Caffe con Troll: Shallow ideas to
speed up deep learning. arXiv preprint arXiv:1504.04343, 2015.
Armbrust, Michael, Xin, Reynold S, Lian, Cheng, Huai, Yin, Liu, Davies, Bradley, Joseph K, Meng,
Xiangrui, Kaftan, Tomer, Franklin, Michael J, Ghodsi, Ali, et al. Spark SQL: Relational data
processing in Spark. In Proceedings of the 2015 ACM SIGMOD International Conference on
Management of Data, pp. 1383–1394. ACM, 2015.
Chilimbi, Trishul, Suzue, Yutaka, Apacible, Johnson, and Kalyanaraman, Karthik. Project Adam:
Building an efficient and scalable deep learning training system. In 11th USENIX Symposium on
Operating Systems Design and Implementation, pp. 571–582, 2014.
Coates, Adam, Huval, Brody, Wang, Tao, Wu, David, Catanzaro, Bryan, and Andrew, Ng. Deep
learning with cots hpc systems. In Proceedings of the 30th International Conference on Machine
Learning, pp. 1337–1345, 2013.
Dean, Jeffrey and Ghemawat, Sanjay. MapReduce: simplified data processing on large clusters.
Communications of the ACM, 51(1):107–113, 2008.
Dean, Jeffrey, Corrado, Greg, Monga, Rajat, Chen, Kai, Devin, Matthieu, Mao, Mark, Ranzato,
Marc’Aurelio, Senior, Andrew, Tucker, Paul, Yang, Ke, Le, Quoc V., and Ng, Andrew Y. Large
scale distributed deep networks. In Advances in Neural Information Processing Systems, pp.
1223–1231, 2012.
Gonzalez, Joseph E, Xin, Reynold S, Dave, Ankur, Crankshaw, Daniel, Franklin, Michael J, and
Stoica, Ion. Graphx: Graph processing in a distributed dataflow framework. In Proceedings of
OSDI, pp. 599–613, 2014.
Ho, Qirong, Cipar, James, Cui, Henggang, Lee, Seunghak, Kim, Jin Kyu, Gibbons, Phillip B, Gib-
son, Garth A, Ganger, Greg, and Xing, Eric P. More effective distributed ML via a stale syn-
chronous parallel parameter server. In Advances in Neural Information Processing Systems, pp.
1223–1231, 2013.
Iandola, Forrest N, Ashraf, Khalid, Moskewicz, Mattthew W, and Keutzer, Kurt. FireCaffe:
near-linear acceleration of deep neural network training on compute clusters. arXiv preprint
arXiv:1511.00175, 2015.
Isard, Michael, Budiu, Mihai, Yu, Yuan, Birrell, Andrew, and Fetterly, Dennis. Dryad: Distributed
data-parallel programs from sequential building blocks. In Proceedings of the 2nd ACM SIGOP-
S/EuroSys European Conference on Computer Systems, pp. 59–72, 2007.
Jia, Yangqing, Shelhamer, Evan, Donahue, Jeff, Karayev, Sergey, Long, Jonathan, Girshick, Ross,
Guadarrama, Sergio, and Darrell, Trevor. Caffe: Convolutional architecture for fast feature em-
bedding. In Proceedings of the ACM International Conference on Multimedia, pp. 675–678.
ACM, 2014.
Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep convo-
lutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097–1105,
2012.
Li, Mu, Andersen, David G, Park, Jun Woo, Smola, Alexander J, Ahmed, Amr, Josifovski, Vanja,
Long, James, Shekita, Eugene J, and Su, Bor-Yiing. Scaling distributed machine learning with the
parameter server. In 11th USENIX Symposium on Operating Systems Design and Implementation,
pp. 583–598, 2014.
Meng, Xiangrui, Bradley, Joseph, Yavuz, Burak, Sparks, Evan, Venkataraman, Shivaram, Liu,
Davies, Freeman, Jeremy, Tsai, DB, Amde, Manish, Owen, Sean, et al. MLlib: Machine learning
in Apache Spark. arXiv preprint arXiv:1505.06807, 2015.
11
Published as a conference paper at ICLR 2016
Murray, Derek G, McSherry, Frank, Isaacs, Rebecca, Isard, Michael, Barham, Paul, and Abadi,
Martı́n. Naiad: a timely dataflow system. In Proceedings of the Twenty-Fourth ACM Symposium
on Operating Systems Principles, pp. 439–455. ACM, 2013.
Noel, Cyprien, Shi, Jun, and Feng, Andy. Large scale distributed deep learning on Hadoop
clusters, 2015. URL http://yahoohadoop.tumblr.com/post/129872361846/
large-scale-distributed-deep-learning-on-hadoop.
Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang,
Zhiheng, Karpathy, Andrej, Khosla, Aditya, Bernstein, Michael, Berg, Alexander C., and Fei-Fei,
Li. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer
Vision, pp. 1–42, 2015.
Sparks, Evan R., Venkataraman, Shivaram, Kaftan, Tomer, Franklin, Michael, and Recht, Benjamin.
KeystoneML: End-to-end machine learning pipelines at scale. 2015.
Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott, Anguelov, Dragomir,
Erhan, Dumitru, Vanhoucke, Vincent, and Rabinovich, Andrew. Going deeper with convolutions.
In Computer Vision and Pattern Recognition, 2015.
Zaharia, Matei, Chowdhury, Mosharaf, Franklin, Michael J, Shenker, Scott, and Stoica, Ion. Spark:
cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics
in cloud computing, volume 10, pp. 10, 2010.
Zaharia, Matei, Das, Tathagata, Li, Haoyuan, Hunter, Timothy, Shenker, Scott, and Stoica, Ion.
Discretized streams: Fault-tolerant streaming computation at scale. In Proceedings of the Twenty-
Fourth ACM Symposium on Operating Systems Principles, pp. 423–438. ACM, 2013.
Zhang, Sixin, Choromanska, Anna E, and LeCun, Yann. Deep learning with elastic averaging SGD.
In Advances in Neural Information Processing Systems, pp. 685–693, 2015.
Zhang, Yuchen and Jordan, Michael I. Splash: User-friendly programming interface for parallelizing
stochastic algorithms. arXiv preprint arXiv:1506.07552, 2015.
Zinkevich, Martin, Weimer, Markus, Li, Lihong, and Smola, Alex J. Parallelized stochastic gradient
descent. In Advances in Neural Information Processing Systems, pp. 2595–2603, 2010.
12