BigDL - A Distributed Deep Learning Framework For Big Data
BigDL - A Distributed Deep Learning Framework For Big Data
ABSTRACT Dongjie Shi, Qi Lu, Kai Huang, and Guoqiong Song. 2019. BigDL: A Dis-
ThispaperpresentsBigDL (adistributeddeeplearning framework for tributed Deep Learning Framework for Big Data . In SoCC ’19: ACM Sympo-
sium of Cloud Computing conference, Nov 20–23, 2019, Santa Cruz, CA. ACM,
Apache Spark), which has been used by a variety of users in the
New York, NY, USA, 11 pages.
industry for building deep learning applications on production
big data platforms. It allows deep learning applications to run on 1 INTRODUCTION
the Apache Hadoop/Spark cluster so as to directly process the
production data, and as a part of the end-to-end data analysis Continued advancements in artificial intelligence applications have
pipeline for deployment and management. Unlike existing deep brought deep learning to the forefront of a new generation of data
learning frameworks, BigDL implements distributed, data parallel analytics development; as the requirements and usage models ex-
training directly on top of the functional compute model (with pand, new systems and architecture beyond existing deep learning
copy-on-write and coarse-grained operations) of Spark. We also frameworks (e.g., Caffe [1], Torch [2], TensorFlow [3], MXNet [4],
share real-world experience and “war stories” of users that havead- Chainer [5], PyTorch [6], etc.) have inevitably emerged. In particu-
optedBigDLtoaddresstheirchallenges(i.e., howtoeasilybuildend-to- lar, there is increasing demand from organizations to apply deep
enddataanalysisanddeep learning pipelines for their production learning technologies to their big data analysis pipelines.
data). To support these new requirements, we have developed BigDL,
a distributed deep learning framework for big data platforms and
CCS CONCEPTS workflows. It is implemented as a library on top of Apache Spark
[7], and allows users to write their deep learning applications as
• Theory of computation → Distributed algorithms; • Com- standard Spark programs, running directly on existing big data
puting methodologies → Neural networks. (Apache Hadoop [8] or Spark) clusters. It supports an API similar
to Torch and Keras [9] for constructing neural network models (as
KEYWORDS illustrate in Figure 1); it also supports both large-scale distributed
distributed deep learning, big data, Apache Spark, end-to-end data training and inference, leveraging the scale-out architecture of
pipeline the underlying Spark framework (which runs across hundreds or
ACM Reference Format: thousands of servers efficiently).
Jason (Jinquan) Dai, Yiheng Wang, Xin Qiu, Ding Ding, Yao Zhang, Yanzhang BigDL provides an expressive, “data-analytics integrated” deep
Wang, Xianyan Jia, Cherry (Li) Zhang, Yan Wan, Zhichao Li, Jiao Wang, learning programming model; within a single, unified data analy-
Shengsheng Huang, Zhongyuan Wu, Yang Wang, Yuhao Yang, Bowen She, sis pipeline, users can efficiently process very large dataset using
Spark APIs (e.g., RDD [10], Dataframe [11], Spark SQL, ML pipeline,
∗ Work was done when the author worked at Intel etc.), feed the distributed dataset to the neural network model, and
perform distributed training or inference on top of Spark. Contrary
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed to the conventional wisdom of the machine learning community
for profit or commercial advantage and that copies bear this notice and the full citation (that fine-grained data access and in-place updates are critical for
on the first page. Copyrights for components of this work owned by others than ACM efficient distributed training [3]), BigDL provides large-scale, data
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a parallel training directly on top of the functional compute model
fee. Request permissions from [email protected]. (with copy-on-write and coarse-grained operations) of Spark. By
SoCC ’19, November 20-23, Santa Cruz, CA unifying the execution model of neural network models and big
© 2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-9999-9/18/06. . . $15.00 data analytics, BigDL allows new deep learning algorithms to be
seamless integrated into production data pipelines, which can then
DOI: 10.1145/3357223.3362707
50
SoCC ’19, November 20-23, Santa Cruz, CA Trovato and Tobin, et al.
Figure 1: The end-to-end text classification pipeline (including data loading, processing, training, prediction, etc.) on Spark
and BigDL
51
BigDL: A Distributed Deep Learning Framework for Big Data SoCC ’19, November 20-23, Santa Cruz, CA
data, etc.). Therefore, it is highly inefficient to run these workloads the key novelty of BigDL is how to efficiently implement these
on separate big data and deep learning systems (e.g., processing functionalities on a functional, coarse-grained compute model of
data on a Spark cluster, and then export the processed data to a Spark.
separate TensorFlow cluster for training/inference) in terms of not The conventional wisdom of the machine learning community
only data transfer, but also development, debugging, deployment is that, fine-grained data access and in-place data mutation are
and operation productivity. critical to support highly efficient parameter server, AllReduce and
One way to address the above challenge is to adopt a “connector distributed training [3]. On the other hand, big data systems (such
approach” (e.g., TFX [15], CaffeOnSpark [16], TensorFlowOnSpark as Spark) usually adopts a very different, functional compute model,
[17], SageMaker [18], etc.), which develops proper interfaces to where dataset are immutable and can only be transformed into new
connect different data processing and deep learning components dataset without side effects (i.e., copy-on-write); in addition, the
using an integrated workflow (and possibly on a shared cluster). transformations are coarse-grained operations (i.e., applying the
However, the adaptation between different frameworks can impose same operation to all data items at once).
very large overheads in practice (e.g., inter-process communication,
data serialization and persistency, etc.). More importantly, this ap-
proach suffers from impedance mismatches [19] that arise from
crossing boundaries between heterogeneous components. For in-
stance, many of these systems (such as TensorFlowOnSpark) first
use big data (e.g., Spark) tasks to allocate resources (e.g., Spark
worker nodes), and then run deep learning (e.g., TensorFlow) tasks
on the allocated resources. However, big data and deep learning
systems have very different distributed execution model – big data
tasks are embarrassingly parallel and independent of each other,
while deep learning tasks need to coordinate with and depend on
others. For instance, when a Spark worker fails, the Spark system
just relaunch the worker (which in turn re-runs the TensorFlow
task); this however is incompatible with the TensorFlow execution
model and can cause the entire workflow to block indefinitely.
The Big Data community have also started to provide better
support for the “connector approach”. For instance, the barrier Figure 2: A Spark job consists of many Spark tasks; the dri-
execution mode introduced by Project Hydrogen [20] provides ver node is responsible for scheduling and dispatching the
gang scheduling [21] support in Spark, so as to overcome the errors tasks to worker nodes, which runs the actual Spark tasks.
caused by different execution models between Spark and existing
deep learning frameworks (as described in the preceding paragraph).
On the other hand, this does not eliminate the difference in the two Algorithm 1 Data-parallel training in BigDL
execution models, which can still lead to lower efficiency (e.g., it
1: for i = 1 to M do
is unclear how to apply delay scheduling [22] to gang scheduling
2: //“model forward-backward” job
in Spark, resulting in poorer data locality). In addition, it does not
3: for each task in the Spark job do
address other impedance mismatches such as different parallelism
4: read the latest weights;
behaviors between data processing and model computations (e.g.,
5: get a random batch of data from local Sample partition;
see Section 5.1).
6: compute local gradients (forward-backward on local model
BigDL has taken a different approach that directly implements
replica);
the distributed deep learning support in the big data system (namely,
7: end for
Apache Spark). Consequently, one can easily build the end-to-end,
8: //“parameter synchronization” job
“data-analytics integrated” deep learning pipelines (under a unified
9: aggregate (sum) all the gradients;
programming paradigm, as illustrated in Figure 1), which can then
10: update the weights per specified optimization method;
run as standard Spark jobs to apply large-scale data processing and
11: end for
deep learning training/inference to production dataset within a sin-
gle framework. This completely eliminates the impedance mismatch
problems, and greatly improves the efficiency of development and BigDL is implemented as a standard library on Spark and has
operations of deep learning applications for big data. adopted this functional compute model; nevertheless, it still pro-
vides an efficient “parameter server” style architecture for efficient
distributed training (by implementing an AllReduce like operation
3 BIGDL EXECUTION MODEL directly using existing primitives in Spark).
This section describes in detail how BigDL support large-scale,
distributed training on top of Apache Spark. While it has adopted 3.1 Spark execution model
the standard practice (such as data parallel training [23], parameter Similar to other Big Data systems (such as MapReduce [28] and
server and AllReduce [3] [24] [25] [26]) [27]for scalable training, Dryad [29]), a Spark cluster consists of a single driver node and
52
SoCC ’19, November 20-23, Santa Cruz, CA Trovato and Tobin, et al.
Figure 3: The “model forward-backward” spark job, which computes the local gradients for each model replica in parallel.
multiple worker nodes, as shown in Figure 2. The driver is respon- BigDL does not support model parallelism (i.e., no distribution
sible for coordinating tasks in a Spark job (e.g., task scheduling and of the model across different workers). This is not a limitation in
dispatching), while the workers are responsible for the actual com- practice, as BigDL runs on Intel Xeon CPU servers, which usually
putation. To automatically parallelize the data processing across the have large (100s of GB) memory size and can easily hold very large
cluster in a fault-tolerant fashion, Spark provides a data-parallel, models.
functional compute model. In a Spark job, data are represented as
Resilient Distributed Dataset (RDD) [10], which is an immutable 3.3 Parameter synchronization in BigDL
collection of records partitioned across a cluster, and can only be Parameter synchronization is a performance critical operation for
transformed to derive new RDDs (i.e., copy-on-write) through func- data parallel distributed model training (in terms of speed and scal-
tional operators like map, filter and reduce (e.g., see line 4 – 6 in ability). To support efficient parameter synchronization, existing
Figure 1); in addition, these operations are both data-parallel (i.e., deep learning frameworks usually implement parameter server
applied to individual data partitions in parallel by different Spark or AllReduce using operations like fine-grained data access and
tasks) and coarse-grained (i.e., applying the same operation to all in-place data mutation. Unfortunately, these operations are not
data items at once). supported by the functional compute model of big data systems
(such as Spark).
3.2 Data-parallel training in BigDL
Built on top of the data-parallel, functional compute model of Spark,
Algorithm 2 “Parameter synchronization” job
BigDL provides synchronous data-parallel training to train a deep
neural network model across the cluster, which is shown to achieve 1: for each task n in the “parameter synchronization” job do
better scalability and efficiency (in terms of time-to-quality) com- 2: shuffle the nt h partition of all gradients to this task;
pared to asynchronous training [30]. Specifically, the distributed 3: aggregate (sum) these gradients;
training in BigDL is implemented as an iterative process, as illus- 4: updates the nth partition of the weights;
trated in Algorithm 1; each iteration runs a couple of Spark jobs 5: broadcast the nt h partition of the updated weights;
to first compute the gradients using the current mini-batch (by a 6: end for
“model forward-backward” job), and then make a single update
to the parameters of the neural network model (by a “parameter
synchronization” job). BigDL has taken a completely different approach that directly
The training data in BigDL are represented as an RDD of Sam- implements an efficient AllReduce like operation using existing
ples (see line 6 in Figure 1), which are automatically partitioned primitives in Spark (e.g., shuffle, broadcast, in-memory cache, etc.),
across the Spark cluster. In addition, to implement the data-parallel so as to mimic the functionality of a parameter server architecture
training, BigDL also constructs an RDD of models, each of which is (as illustrated in Figure 4).
a replica of the original neural network model. Before the training, • A Spark job has N tasks, each of which is assigned a unique
both the model and Sample RDDs are cached in memory, and co- Id ranging from 1 to N in BigDL. After each task in the
partitioned and co-located across the cluster, as shown in Figure “model forward-backward” job computes the local gradients
3; consequently, in each iteration of the model training, a single (as described in Section 3.2 and illustrated in Figure 3), it
“model forward-backward” Spark job can simply apply the func- evenly divides the local gradients into N partitions, as shown
tional zip operator to the co-located partitions of the two RDDs in Figure 4.
(with no extra cost), and compute the local gradients in parallel for • Next, another “parameter synchronization” job is launched;
each model replica (using a small batch of data in the co-located each task n of this job is responsible for managing the nt h
Sample partition), as illustrated in Figure 3. partition of the parameters (as shown in Algorithm 2), just
53
BigDL: A Distributed Deep Learning Framework for Big Data SoCC ’19, November 20-23, Santa Cruz, CA
Figure 4: Parameter synchronization in BigDL. Each local gradient (computed by a task in the “model forward-backward” job)
is evenly divided into N partitions; then each task n in the “parameter synchronization” job aggregates these local gradients
and updates the weights for the nth partition.
54
SoCC ’19, November 20-23, Santa Cruz, CA Trovato and Tobin, et al.
automatically and efficiently address failures (e.g., cluster scale- 4.2 Computing Performance
down, task preemption, random bugs in the code, etc.) in a timely To study the computing performance of BigDL, we compare the
fashion. training speed of the NCF model using BigDL and PyTorch. MLPerf
While AllReduce has been implemented in almost all existing has provided a reference implementation of the NCF program [40]
deep learning frameworks, the implementation in BigDL is very based on PyTorch 0.4, which trains a movie recommender using the
different. In particular, existing deep learning frameworks usually MovieLens 20Million dataset (ml-20m) [41], a widely used bench-
implement the AllReduce operation using MPI-like primitives; as mark dataset with 20 million ratings and 465,000 tags applied to
a result, they often create long-running task replicas that coordi- 27,000 movies by 138,000 users. It also provides the reference train-
nate among themselves with no central control. On the other hand, ing speed of the PyTorch implementation (to achieve the target
BigDL has adopted a logically centralized control for distributed accuracy goal) on a single Nvidia P100 GPU.
training [33]; that is, a single driver program coordinates the dis- We have implemented the same NCF program using BigDL 0.7.0
tributed training (as illustrated in Algorithm 1). The driver program and Spark 2.1.0 [42]. We then trained the program on a dual-socket
first launches the “model forward-backward” job to compute the Intel Skylake 8180 2.5GHz server (with 56 cores in total and 384GB
local gradients, and then launches the “parameter synchronization” memory), and it took 29.8 minutes to converge and achieve the
job to update the weights. The dependence between the two jobs same accuracy goal.
are explicitly managed by the driver program, and each individual
task in the two jobs are completely stateless and non-blocking once
they are launched by the driver.
4 EVALUATION
This section evaluates the computing performance and scalability
of neural network training in BigDL. In addition, while we do not
report inference performance results in this section, Section 5.1
shows the comparison of a real-world object detection inference
pipeline running on BigDL vs. Caffe (and as reported by JD.com, the
BigDL inference pipeline running on 24 Intel Xeon servers is 3.83x
faster than Caffe running on 5 servers and 20 GPU cards ).
4.1 Experiments Figure 5: The training performance of NCF using the BigDL
Two categories of neural network models are used in this section to implementation is 1.6x faster than the reference PyTorch
evaluate the performance and scalability of BigDL, namely, neural implementation, as reported by MLPerf [43].
collaborative filtering (NCF) and convolutional neural network
(CNN), which are representatives of the workloads that BigDL As reported by MLPerf, the training performance of NCF using
users run in their production Big Data platform. the BigDL implementation is 1.6x faster than the reference PyTorch
Neural Collaborative Filtering (NCF) [34] is one of most com- implementation [43] (as shown in Figure 5). While this only com-
monly used neural network models for recommendation, and has pares the training performance of BigDL on a single CPU server
also been included in MLPerf [35], a widely used benchmark suite to PyTorch on a single GPU, it shows BigDL provides efficient im-
for measuring training and inference performance of machine learn- plementations for neural network model computation (forward
ing hardware, software, and services. In our experiments, we com- and backward). We will study the scalability and efficiency of the
pare the training performance of BigDL (running on Intel Xeon distributed training in BigDL in Section 4.3 and 4.4.
server) vs. PyTorch (running on GPU).
In addition, deep convolutional neural networks (CNNs) have 4.3 Scalability of distributed training
achieved human-level accuracy and are widely used for many com- In the machine learning community, it is commonly believed that
puter vision tasks (such as image classifications and object detec- fine-grained data access and in-place data mutation are critical for
tion). In our experiments, we study the scalability and efficiency efficient distributed training, and mechanisms like Spark’s RDDs
of training Inception-v1 [36] on ImageNet dataset [37] in BigDL would impose significant overheads [3]. In this section, we show
with various number of Intel Xeon servers and Spark task; the re- that BigDL provides highly efficient and scalable training, despite
sults for other deep convolutional models, such as Inception-v3 it is built on top of the coarse-grained functional compute model
[38] and ResNet50 [39], are similar. We do not include results for and immutable RDDs of Spark.
RNN (recurrent neural networks) training in this section, because The scalability of distributed training in BigDL is determined
it actually has better scalability compared to CNN training. This is by the efficiency (or overheads) of its parameter synchronizations.
because RNN computation is much slower than CNN, and therefore We first study the parameter synchronization overheads in BigDL
the parameter synchronization overhead (as a fraction of model by running ImageNet Inception-v1 model training using BigDL
compute time) is also much lower. on various number of Xeon servers (dual-socket Intel Broadwell
55
BigDL: A Distributed Deep Learning Framework for Big Data SoCC ’19, November 20-23, Santa Cruz, CA
56
SoCC ’19, November 20-23, Santa Cruz, CA Trovato and Tobin, et al.
Figure 9: End-to-end object detection and image feature extraction pipeline (using SSD and DeepBit models) on top of Spark
and BigDL [48].
57
BigDL: A Distributed Deep Learning Framework for Big Data SoCC ’19, November 20-23, Santa Cruz, CA
Figure 11: End-to-end precipitation nowcasting workflow (using sequence-to-sequence models) on Spark and BigDL [45].
Figure 12: Predicting precipitation patterns for the next hour (i.e., a sequence of images for the future time steps of the next
hour) on Spark and BigDL [45].
5.2 Precipitation nowcasting using Seq2Seq the development productivity due to the fragmented workflow. As
models a result, Cray engineers chose to implement the solution using a
single unified data analysis and deep learning pipeline on Spark
Cray has built a precipitation nowcasting (predicting short-term pre-
and BigDL, which greatly improves the efficiency of development
cipitation) application using a Seq2Seq [51] model (with a stacked
and deployment.
convolutional LSTM network [52] as the encoder, and another
stacked convolutional LSTM network as the decoder); the end-to-
end pipeline runs on Spark and BigDL [45], including data prepara- 5.3 Real-time streaming speech classification
tion, model training and inference (as illustrated in Figure 11). GigaSpaces has built a speech classification application for efficient
call center management [53], which automatically routes client
• The application first reads over a terabyte of raw radar scan
calls to corresponding support specialists in real-time. The end-to-
data into Spark (as an RDD of radar images), and then con-
end workflow is implemented using BigDL with Apache Kafka [54]
verts it into an RDD of NumPy ndarrays.
and Spark Streaming [55] (as illustrated Figure 13), so as to provide
• It then trains a sequence-to-sequence model, using a se-
distributed realtime streaming model inference.
quence of images leading up to the current time as the input,
and a sequence of predicted images in the future as the out- • When a customer calls the call center, his or her speech is
put. first processed on the fly by a speech recognition unit and
• After the model is trained, it can be used to predict, say, result is stored in Kafka.
the precipitation patterns (i.e., a sequence of images for the • A Spark Streaming job then reads speech recognition results
future time steps) of the next hour, as illustrated in Figure from Kafka and classifies each call using the BigDL model
12. in real-time.
• The classification result is in turn used by a routing system
Cray engineers have previously implemented the application
to redirect the call to the proper support specialist.
using two separate workflows: running data processing on a highly
distributed Spark cluster, and deep learning training on another One of the key challenges for GigaSpaces engineers to implement
GPU cluster running TensorFlow. It turns out that this approach not the end-to-end workflow is how to efficiently integrate the new
only brings high data movement overheads, but also greatly hurts neural network models in the realtime stream processing pipeline,
58
SoCC ’19, November 20-23, Santa Cruz, CA Trovato and Tobin, et al.
Figure 13: The end-to-end workflow of real-time streaming speech classification on Kafka, Spark Streaming and BigDL [53].
and how to seamlessly scale the streaming applications from a hand- 7 SUMMARY
ful machines to thousands of nodes. BigDL allows neural network We have described BigDL, including its distributed execution model,
models to be directly applied in standard distributed streaming ar- computation performance, training scalability, and real-world use
chitecture for Big Data (using Apache Kafka and Spark Streaming), cases. It allows users to build deep learning applications for big data
which can then efficiently scales out to a large number of nodes using a single unified data pipeline; the entire pipeline can directly
in a transparent fashion. As a result, this greatly improves the de- run on top of existing big data systems in a distributed fashion.
veloper productivity and deployment efficiency of the end-to-end Unlike existing deep learning frameworks, it provides efficient
streaming workflow. and scalable distributed training directly on top of the functional
compute model (with copy-on-write and coarse-grained operations)
of Spark. BigDL is a work in progress, but our initial experience
is encouraging. Since its initial open source release on Dec 30,
2016, it has received over 3100 stars on Github; and it has enabled
6 RELATED WORK many users (e.g., Mastercard, World Bank, Cray, Talroo, UCSF, JD,
UnionPay, Telefonica, GigaSpaces, etc.) to build new analytics and
Existing deep learning frameworks (such as TensorFlow, MXNet,
deep learning applications for their production data pipelines.
Petuum, ChainerMN, etc.) typically provide efficient parameter
server and/or AllReduce implementation (using fine-grained data
access and in-place data mutation) for distributed training. In con- REFERENCES
trast, BigDL provides distributed training support directly on top of [1] Jia, Yangqing and Shelhamer, Evan and Donahue, Jeff and Karayev, Sergey and
Long, Jonathan and Girshick, Ross and Guadarrama, Sergio and Darrell, Trevor.
a functional compute model of big data systems (with copy-on-write Caffe: Convolutional architecture for fast feature embedding. in Proceedings of
and coarse-grained operations), which is completely different from the 22nd ACM international conference on Multimedia. MM’14.
[2] Collobert, Ronan and Kavukcuoglu, Koray and Farabet, Clément. Torch7: A
the implementation in existing deep learning frameworks. This matlab-like environment for machine learning. in BigLearn, NIPS workshop.
provides a viable design alternative for distributed model training (2011).
by adopting the state of practice of big data systems, and makes it [3] Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat,
S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray,
easy to handle failures, resource changes, task preemptions, etc., in D. G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., and
a more timely and fine-grained fashion. Zheng, X. Tensorflow: A system for large-scale machine learning. in Proceedings
As discussed in Section 2, to address the challenge of integrat- of the 12th USENIX Conference on Operating Systems Design and Implementation.
OSDI’16.
ing deep learning into real-world data pipelines, there have been [4] Chen, Tianqi and Li, Mu and Li, Yutian and Lin, Min and Wang, Naiyan and
many efforts in the industry that adopt a “connector approach” Wang, Minjie and Xiao, Tianjun and Xu, Bing and Zhang, Chiyuan and Zhang,
Zheng. Mxnet: A flexible and efficient machine learning library for heterogeneous
(e.g., TFX, CaffeOnSpark, TensorFlowOnSpark, SageMaker, etc.). distributed systems. In Proceedings of Workshop on Machine Learning Systems
Unfortunately, these frameworks can incur very large overheads in (LearningSys) in The Twenty-ninth Annual Conference on Neural Information
practice due to the adaptation layer between different frameworks; Processing Systems (NIPS). (2015).
[5] Tokui, Seiya and Oono, Kenta and Hido, Shohei and Clayton, Justin Chainer: a
more importantly, they often suffer from impedance mismatches next-generation open source framework for deep learning in In Proceedings of
that arise from crossing boundaries between heterogeneous com- workshop on machine learning systems (LearningSys) in the twenty-ninth annual
ponents. While efforts in the Big Data community (such as Project conference on neural information processing systems (NIPS). (2015).
[6] Paszke, Adam and Gross, Sam and Chintala, Soumith and Chanan, Gregory and
Hydrogen in Spark) attempt to overcome some of these issues Yang, Edward and DeVito, Zachary and Lin, Zeming and Desmaison, Alban and
brought by the “connector approach”, they still do not address the Antiga, Luca and Lerer, Adam Automatic differentiation in pytorch. NIPS 2017
Autodiff Workshop. (2017).
fundamental “impedance mismatch” problem (as discussed in Sec- [7] Apache spark Apache software foundation. (2014) (https://spark.apache.org).
tion 2). By unifying the distributed execution model of deep neural [8] Apache hadoop Apache software foundation. (2006) (https://hadoop.apache.org).
network models and big data analysis, BigDL provides a single [9] Chollet,F.et al. Keras. (https://keras.io).
[10] Zaharia, Matei and Chowdhury, Mosharaf and Das, Tathagata and Dave, Ankur
unified data pipeline for both deep learning and big data analysis, and Ma, Justin and McCauley, Murphy and Franklin, Michael J and Shenker, Scott
which eliminates the adaptation overheads or impedance mismatch. and Stoica, Ion. Resilient distributed datasets: A fault-tolerant abstraction for
59
BigDL: A Distributed Deep Learning Framework for Big Data SoCC ’19, November 20-23, Santa Cruz, CA
in-memory cluster computing. in Proceedings of the 9th USENIX conference on [35] Mlperf. (https://mlperf.org/).
Networked Systems Design and Implementation. NSDI’12. [36] Szegedy,C., Liu,W., Jia,Y., Sermanet,P., Reed,S., Anguelov,D., Erhan,D., Van-
[11] Armbrust, Michael and Xin, Reynold S and Lian, Cheng and Huai, Yin and Liu, houcke,V., and Rabinovich,A. Going deeper with convolutions in Computer Vision
Davies and Bradley, Joseph K and Meng, Xiangrui and Kaftan, Tomer and Franklin, and Pattern Recognition (CVPR). (2015).
Michael J and Ghodsi, Ali and others. Spark sql: Relational data processing in [37] Deng,J., Socher,R., Fei-Fei,L., Dong,W., Li,K., and Li,L.-J. Imagenet: A large-scale
spark. in 2015 ACM SIGMOD international conference on management of data. hierarchical image database. in 2009 IEEE conference on computer vision and
SIGMOD’15. pattern recognition(CVPR). (2009).
[12] Russakovsky, Olga and Deng, Jia and Su, Hao and Krause, Jonathan and Satheesh, [38] Szegedy,C., Vanhoucke,V., Ioffe,S., Shlens,J., and Wojna,Z. Rethinking the incep-
Sanjeev and Ma, Sean and Huang, Zhiheng and Karpathy, Andrej and Khosla, tion architecture for computer vision in 2016 IEEE Conference on Computer Vision
Aditya and Bernstein, Michael and others. Imagenet large scale visual recognition and Pattern Recognition (CVPR). (2016).
challenge. International journal of computer vision(IJCV). (2015). [39] He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian Deep residual
[13] Rajpurkar,P and Zhang,J and Lopyrev,K and Liang,P. Squad: 100,000+ questions learning for image recognition. in Proceedings of the IEEE conference on computer
for machine comprehension of text. In Proceedings of the 2016 Conference on vision and pattern recognition. (2016).
Empirical Methods in Natural Language Processing, EMNLP. (2016). [40] Reference ncf implementation using pytorch in mlperf.
[14] Jawaheer,G and Szomszor,M and Kostkova,P. Comparison of implicit and explicit (https://github.com/mlperf/training/blob/
feedback from an online music recommendation service. in proceedings of the 1st master/recommendation/pytorch/README.md).
international workshop on information heterogeneity and fusion in recommender [41] Harper, F Maxwell and Konstan, Joseph A. "the movielens datasets: History and
systems. (2010) HetRec’10. context". ACM Trans. Interact. Intell. Syst. 5(4):19. (2015).
[15] Baylor, D., Breck, E., Cheng, H.-T., Fiedel, N., Foo, C. Y., Haque, Z., Haykal, S., [42] Ncf implementation in bigdl. (https://github.com/mlperf/training_results_v0.5/tree
Ispir, M., Jain, V., Koc, L., Koo, C. Y., Lew, L., Mewald, C., Modi, A. N., Polyzo- /master/v0.5.0/intel/intel_ncf_submission).
tis, N., Ramesh, S., Roy, S., Whang, S. E., Wicke, M., Wilkiewicz, J., Zhang, X., [43] Mlperf 0.5 training results. (https://mlperf.org/training-results-0-5).
and Zinkevich, M. Tfx: A tensorflow-based production-scale machine learning [44] Jason (Jinquan) Dai, and Ding Ding. Very large-scale distributed deep learning
platform in Proceedings of the 23rd ACM SIGKDD International Conference on with bigdl. o’reilly ai conference, san francisco. (2017).
Knowledge Discovery and Data Mining. KDD’17. [45] Alex Heye, et al. "scalable deep learning with bigdl on the urika-xc soft-
[16] CaffeOnSpark. Yahoo. (2016) (https://github.com/yahoo/CaffeOnSpark). ware suite". (https://www.cray.com/blog/scalable-deep-learning-bigdl-urika-xc-
[17] TensorflowOnSpark. Yahoo. (2017) (https://github.com/yahoo/TensorFlowOnSpark). software-suite/).
[18] Sagemaker. Amazon. (2017) (https://aws.amazon.com/sagemaker/). [46] Shivaram Venkataraman, et al. "accelerating deep learning training with bigdl
[19] Lin, Jimmy and Ryaboy, Dmitriy Scaling big data mining infrastructure: the and drizzle on apache spark". (https://rise.cs.berkeley.edu/blog/accelerating-deep-
twitter experience. ACM SIGKDD Explorations Newsletter 14(2). (December 2012). learning-training-with-bigdl-and-drizzle-on-apache-spark/).
[20] Reynold Xin. "project hydrogen: Unifying state-of-the-art ai and big data in [47] Venkataraman,S., Panda,A., Ousterhout,K., Armbrust,M., Ghodsi,A., Franklin,M.J.,
apache spark". spark + ai summit 2018. Recht,B., and Stoica,I. Drizzle: Fast and adaptable stream processing at scale in
[21] Gang scheduling. (https://en.wikipedia.org/wiki/Gang_scheduling/). Proceedings of the 26th Symposium on Operating Systems Principles. SOSP’17.
[22] Zaharia, Matei and Borthakur, Dhruba and Sen Sarma, Joydeep and Elmeleegy, [48] Jason (Jinquan) Dai, et al. Building large-scale image feature extraction with
Khaled and Shenker, Scott and Stoica, Ion. Delay scheduling: a simple technique bigdl at jd.com. (https://software.intel.com/en-us/articles/building-large-scale-
for achieving locality and fairness in cluster scheduling. in Proceedings of the 5th image-feature-extraction-with-bigdl-at-jdcom).
European conference on Computer systems,. EuroSys’10. [49] Liu,W., Anguelov,D., Erhan,D., Szegedy,C., Reed,S.E., Fu,C.-Y., and Berg,A.C. Ssd:
[23] Dean,J., Corrado,G., Monga,R., Chen,K., Devin,M., Mao,M., Ranzato,Marc’aurelio, Single shot multibox detector in ECCV. (2016).
Senior,A., Tucker,P., Yang,K., Le,Q.V., Ng,A.Y. Large scale distributed deep net- [50] Lin, K., Lu, J., Chen, C.-S., and Zhou, J. Learning compact binary descriptors with
works. in Proceedings of the 25th International Conference on Neural Information unsupervised deep neural networks. in 2016 IEEE Conference on Computer Vision
Processing Systems. NIPS’12. and Pattern Recognition (CVPR). (2016).
[24] Li,M., Andersen,D.G., Park,J.W., Smola,A.J., Ahmed,A., Josifovski,V., Long,J., [51] Sutskever,I., Vinyals,O., and Le,Q.V. Sequence to sequence learning with neural
Shekita,E.J., and Su,B.-Y. Scaling distributed machine learning with the parameter networks. in Proceedings of the 27th International Conference on Neural Information
server. in Proceedings of the 11th USENIX Conference on Operating Systems Design Processing Systems. Vol. 2. NIPS’14.
and Implementation. OSDI’14. [52] Shi,X., Chen,Z., Wang,H., Yeung,D.-Y., Wong,W.-k., and Woo,W.-c. Convolutional
[25] Chilimbi,T., Suzue,Y., Apacible,J., and Kalyanaraman,K. Project adam: Building lstm network: A machine learning approach for precipitation nowcasting. in
an efficient and scalable deep learning training system. in Proceedings of the 11th Proceedings of the 28th International Conference on Neural Information Processing
USENIX Conference on Operating Systems Design and Implementation. OSDI’14. Systems. Vol. 1. NIPS’15.
[26] Xing,E.P., Ho,Q., Dai,W., Kim,J.-K., Wei,J., Lee,S., Zheng,X., Xie,P., Kumar,A., and [53] Rajiv Shah. Gigaspaces integrates insightedge platform with intel’s bigdl for
Yu,Y. Petuum: A new platform for distributed machine learning on big data. scalable deep learning innovation. (https://www.gigaspaces.com/blog/gigaspaces-
Proceedings of the 21th ACM SIGKDD International Conference on Knowledge to-demo-with-intel-at-strata-data-conference-and-microsoft-ignite/).
Discovery and Data Mining. KDD’15. [54] Apache Kafka. (https://kafka.apache.org/).
[27] Zhang,H., Zheng,Z., Xu,S., Dai,W., Ho,Q., Liang,X., Hu,Z., Wei,J., Xie,P., and [55] Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, and
Xing,E.P. Poseidon: An efficient communication architecture for distributed deep Ion Stoica. Discretized streams: fault-tolerant streaming computation at scale
learning on gpu clusters. in 2017 USENIX Annual Technical Conference (USENIX in The Twenty-Fourth ACM Symposium on Operating Systems Principles. (2013)
ATC 17). (2017). SOSP’13.
[28] Jeffrey Dean, Sanjay Ghemawat Mapreduce: simplified data processing on large
clusters. Proceedings of the 6th conference on Symposium on Operating Systems
Design & Implementation, {OSDI }. (2004).
[29] Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad:
distributed data-parallel programs from sequential building blocks in Proceedings
of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007.
EuroSys’07.
[30] Chen,J., Monga,R., Bengio,S., and Jozefowicz,R. Revisiting distributed synchro-
nous sgd. In International Conference on Learning Representations Workshop Track.
(2016).
[31] Gibiansky,Andrew. "bringing hpc techniques to deep learning".
(http://andrew.gibiansky.com/blog/machine-learning/baidu-allreduce/).
[32] Akiba,T., Fukuda,K., and Suzuki,S. Chainermn: Scalable distributed deep learning
framework. Proceedings of Workshop on ML Systems in The Thirty-first Annual
Conference on Neural Information Processing Systems (NIPS). (2017).
[33] Eric Liang, Richard Liaw, Philipp Moritz, Robert Nishihara, Roy Fox, Ken Gold-
berg, Joseph E. Gonzalez, Michael I. Jordan, Ion Stoica. Rllib: Abstractions for
distributed reinforcement learning. International Conference on Machine Learning
(ICML). (2018).
[34] He, Xiangnan and Liao, Lizi and Zhang, Hanwang and Nie, Liqiang and Hu, Xia
and Chua, Tat-Seng Neural collaborative filtering. in Proceedings of the 26th inter-
national conference on world wide web. International World Wide Web Conferences
Steering Committee. (2017).
60