0% found this document useful (0 votes)
60 views

BigDL - A Distributed Deep Learning Framework For Big Data

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views

BigDL - A Distributed Deep Learning Framework For Big Data

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

BigDL: A Distributed Deep Learning Framework for Big Data

Jason (Jinquan) Dai Yiheng Wang∗ Xin Qiu Ding Ding


Intel Corporation Tencent Inc. Intel Corporation Intel Corporation

Yao Zhang∗ Yanzhang Wang Xianyan Jia∗ Cherry (Li) Zhang


Sequoia Capital Intel Corporation Alibaba Group Intel Corporation

Yan Wan∗ Zhichao Li Jiao Wang Shengsheng Huang


Alibaba Group Intel Corporation Intel Corporation Intel Corporation

Zhongyuan Wu Yang Wang Yuhao Yang Bowen She


Intel Corporation Intel Corporation Intel Corporation Intel Corporation

Dongjie Shi Qi Lu Kai Huang Guoqiong Song


Intel Corporation Intel Corporation Intel Corporation Intel Corporation

ABSTRACT Dongjie Shi, Qi Lu, Kai Huang, and Guoqiong Song. 2019. BigDL: A Dis-
ThispaperpresentsBigDL (adistributeddeeplearning framework for tributed Deep Learning Framework for Big Data . In SoCC ’19: ACM Sympo-
sium of Cloud Computing conference, Nov 20–23, 2019, Santa Cruz, CA. ACM,
Apache Spark), which has been used by a variety of users in the
New York, NY, USA, 11 pages.
industry for building deep learning applications on production
big data platforms. It allows deep learning applications to run on 1 INTRODUCTION
the Apache Hadoop/Spark cluster so as to directly process the
production data, and as a part of the end-to-end data analysis Continued advancements in artificial intelligence applications have
pipeline for deployment and management. Unlike existing deep brought deep learning to the forefront of a new generation of data
learning frameworks, BigDL implements distributed, data parallel analytics development; as the requirements and usage models ex-
training directly on top of the functional compute model (with pand, new systems and architecture beyond existing deep learning
copy-on-write and coarse-grained operations) of Spark. We also frameworks (e.g., Caffe [1], Torch [2], TensorFlow [3], MXNet [4],
share real-world experience and “war stories” of users that havead- Chainer [5], PyTorch [6], etc.) have inevitably emerged. In particu-
optedBigDLtoaddresstheirchallenges(i.e., howtoeasilybuildend-to- lar, there is increasing demand from organizations to apply deep
enddataanalysisanddeep learning pipelines for their production learning technologies to their big data analysis pipelines.
data). To support these new requirements, we have developed BigDL,
a distributed deep learning framework for big data platforms and
CCS CONCEPTS workflows. It is implemented as a library on top of Apache Spark
[7], and allows users to write their deep learning applications as
• Theory of computation → Distributed algorithms; • Com- standard Spark programs, running directly on existing big data
puting methodologies → Neural networks. (Apache Hadoop [8] or Spark) clusters. It supports an API similar
to Torch and Keras [9] for constructing neural network models (as
KEYWORDS illustrate in Figure 1); it also supports both large-scale distributed
distributed deep learning, big data, Apache Spark, end-to-end data training and inference, leveraging the scale-out architecture of
pipeline the underlying Spark framework (which runs across hundreds or
ACM Reference Format: thousands of servers efficiently).
Jason (Jinquan) Dai, Yiheng Wang, Xin Qiu, Ding Ding, Yao Zhang, Yanzhang BigDL provides an expressive, “data-analytics integrated” deep
Wang, Xianyan Jia, Cherry (Li) Zhang, Yan Wan, Zhichao Li, Jiao Wang, learning programming model; within a single, unified data analy-
Shengsheng Huang, Zhongyuan Wu, Yang Wang, Yuhao Yang, Bowen She, sis pipeline, users can efficiently process very large dataset using
Spark APIs (e.g., RDD [10], Dataframe [11], Spark SQL, ML pipeline,
∗ Work was done when the author worked at Intel etc.), feed the distributed dataset to the neural network model, and
perform distributed training or inference on top of Spark. Contrary
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed to the conventional wisdom of the machine learning community
for profit or commercial advantage and that copies bear this notice and the full citation (that fine-grained data access and in-place updates are critical for
on the first page. Copyrights for components of this work owned by others than ACM efficient distributed training [3]), BigDL provides large-scale, data
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a parallel training directly on top of the functional compute model
fee. Request permissions from [email protected]. (with copy-on-write and coarse-grained operations) of Spark. By
SoCC ’19, November 20-23, Santa Cruz, CA unifying the execution model of neural network models and big
© 2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-9999-9/18/06. . . $15.00 data analytics, BigDL allows new deep learning algorithms to be
seamless integrated into production data pipelines, which can then
DOI: 10.1145/3357223.3362707

50
SoCC ’19, November 20-23, Santa Cruz, CA Trovato and Tobin, et al.

Figure 1: The end-to-end text classification pipeline (including data loading, processing, training, prediction, etc.) on Spark
and BigDL

be easily deployed, monitored and managed in a single unified big 2 MOTIVATION


data platform. A lot of efforts in the deep learning community have been focusing
BigDL is developed as an open source project1 ; over the past on improving the accuracy and/or speed of standard deep learn-
years, a variety of users in the industry (e.g., Mastercard, World ing benchmarks (such as ImageNet [12] or SQuAD [13]). For these
Bank, Cray, Talroo, UCSF, JD, UnionPay, Telefonica, GigaSpaces, benchmarks, the input dataset have already been curated and ex-
etc.) have built their data analytics and deep learning applications plicitly labelled, and it makes sense to run deep learning algorithms
on top of BigDL for a wide range of workloads, such as transfer on specialized deep learning frameworks for best computing effi-
learning based image classification, object detection and feature ciency. On the other hand, if the input dataset are dynamic and
extraction, sequence-to-sequence prediction for precipitation now- messy (e.g., live data streaming into production data pipeline that
casting, neural collaborative filtering for recommendations, etc. In require complex processing), it makes more sense to adopt BigDL
this paper, we focus on the execution model of BigDL to support to build the end-to-end, integrated data analytics and deep learning
large-scale distributed training (a challenging system problem for pipelines for production data.
deep learning frameworks), as well as empirical results of real- As mentioned in Section 1, BigDL has been used by a variety of
world deep learning applications built on top of BigDL. The main users in the industry to build deep learning applications on their
contributions of this paper are: production data platform. The key motivation for adopting such
• It presents BigDL, a working system that have been used by a unified data analytics and deep learning system like BigDL is to
many users in the industry for distributed deep learning on improve the ease of use (including development, deployment and
production big data systems. operations) for applying deep learning in real-world data pipelines.
• It describes the distributed execution model in BigDL (that In real world, it is critical to run deep learning applications
adopts the state of practice of big data systems), which pro- directly on where the data are stored, and as a part of the end-to-
vides a viable design alternative for distributed model train- end data analysis pipelines. Applying deep learning to production
ing (compared to existing deep learning frameworks). big data is very different from the ImageNet [12] or SQuAD [13]
• It shares real-world experience and “war stories” of users problem; real-world big data are both dynamic and messy, and are
that have adopted BigDL to address their challenges (i.e., how possibly implicitly labeled (e.g., implicit feedbacks in recommenda-
to easily build end-to-end data analysis and deep learning tion applications [14]), which require very complex data processing;
pipelines for their production data). furthermore, instead of running ETL (extract, transform and load)
and data processing only once, real-world data analytics pipeline is
an iterative and recurrent process (e.g., back-and-forth development
and debugging, incremental model update with new production
1 https://github.com/intel-analytics/BigDL

51
BigDL: A Distributed Deep Learning Framework for Big Data SoCC ’19, November 20-23, Santa Cruz, CA

data, etc.). Therefore, it is highly inefficient to run these workloads the key novelty of BigDL is how to efficiently implement these
on separate big data and deep learning systems (e.g., processing functionalities on a functional, coarse-grained compute model of
data on a Spark cluster, and then export the processed data to a Spark.
separate TensorFlow cluster for training/inference) in terms of not The conventional wisdom of the machine learning community
only data transfer, but also development, debugging, deployment is that, fine-grained data access and in-place data mutation are
and operation productivity. critical to support highly efficient parameter server, AllReduce and
One way to address the above challenge is to adopt a “connector distributed training [3]. On the other hand, big data systems (such
approach” (e.g., TFX [15], CaffeOnSpark [16], TensorFlowOnSpark as Spark) usually adopts a very different, functional compute model,
[17], SageMaker [18], etc.), which develops proper interfaces to where dataset are immutable and can only be transformed into new
connect different data processing and deep learning components dataset without side effects (i.e., copy-on-write); in addition, the
using an integrated workflow (and possibly on a shared cluster). transformations are coarse-grained operations (i.e., applying the
However, the adaptation between different frameworks can impose same operation to all data items at once).
very large overheads in practice (e.g., inter-process communication,
data serialization and persistency, etc.). More importantly, this ap-
proach suffers from impedance mismatches [19] that arise from
crossing boundaries between heterogeneous components. For in-
stance, many of these systems (such as TensorFlowOnSpark) first
use big data (e.g., Spark) tasks to allocate resources (e.g., Spark
worker nodes), and then run deep learning (e.g., TensorFlow) tasks
on the allocated resources. However, big data and deep learning
systems have very different distributed execution model – big data
tasks are embarrassingly parallel and independent of each other,
while deep learning tasks need to coordinate with and depend on
others. For instance, when a Spark worker fails, the Spark system
just relaunch the worker (which in turn re-runs the TensorFlow
task); this however is incompatible with the TensorFlow execution
model and can cause the entire workflow to block indefinitely.
The Big Data community have also started to provide better
support for the “connector approach”. For instance, the barrier Figure 2: A Spark job consists of many Spark tasks; the dri-
execution mode introduced by Project Hydrogen [20] provides ver node is responsible for scheduling and dispatching the
gang scheduling [21] support in Spark, so as to overcome the errors tasks to worker nodes, which runs the actual Spark tasks.
caused by different execution models between Spark and existing
deep learning frameworks (as described in the preceding paragraph).
On the other hand, this does not eliminate the difference in the two Algorithm 1 Data-parallel training in BigDL
execution models, which can still lead to lower efficiency (e.g., it
1: for i = 1 to M do
is unclear how to apply delay scheduling [22] to gang scheduling
2: //“model forward-backward” job
in Spark, resulting in poorer data locality). In addition, it does not
3: for each task in the Spark job do
address other impedance mismatches such as different parallelism
4: read the latest weights;
behaviors between data processing and model computations (e.g.,
5: get a random batch of data from local Sample partition;
see Section 5.1).
6: compute local gradients (forward-backward on local model
BigDL has taken a different approach that directly implements
replica);
the distributed deep learning support in the big data system (namely,
7: end for
Apache Spark). Consequently, one can easily build the end-to-end,
8: //“parameter synchronization” job
“data-analytics integrated” deep learning pipelines (under a unified
9: aggregate (sum) all the gradients;
programming paradigm, as illustrated in Figure 1), which can then
10: update the weights per specified optimization method;
run as standard Spark jobs to apply large-scale data processing and
11: end for
deep learning training/inference to production dataset within a sin-
gle framework. This completely eliminates the impedance mismatch
problems, and greatly improves the efficiency of development and BigDL is implemented as a standard library on Spark and has
operations of deep learning applications for big data. adopted this functional compute model; nevertheless, it still pro-
vides an efficient “parameter server” style architecture for efficient
distributed training (by implementing an AllReduce like operation
3 BIGDL EXECUTION MODEL directly using existing primitives in Spark).
This section describes in detail how BigDL support large-scale,
distributed training on top of Apache Spark. While it has adopted 3.1 Spark execution model
the standard practice (such as data parallel training [23], parameter Similar to other Big Data systems (such as MapReduce [28] and
server and AllReduce [3] [24] [25] [26]) [27]for scalable training, Dryad [29]), a Spark cluster consists of a single driver node and

52
SoCC ’19, November 20-23, Santa Cruz, CA Trovato and Tobin, et al.

Figure 3: The “model forward-backward” spark job, which computes the local gradients for each model replica in parallel.

multiple worker nodes, as shown in Figure 2. The driver is respon- BigDL does not support model parallelism (i.e., no distribution
sible for coordinating tasks in a Spark job (e.g., task scheduling and of the model across different workers). This is not a limitation in
dispatching), while the workers are responsible for the actual com- practice, as BigDL runs on Intel Xeon CPU servers, which usually
putation. To automatically parallelize the data processing across the have large (100s of GB) memory size and can easily hold very large
cluster in a fault-tolerant fashion, Spark provides a data-parallel, models.
functional compute model. In a Spark job, data are represented as
Resilient Distributed Dataset (RDD) [10], which is an immutable 3.3 Parameter synchronization in BigDL
collection of records partitioned across a cluster, and can only be Parameter synchronization is a performance critical operation for
transformed to derive new RDDs (i.e., copy-on-write) through func- data parallel distributed model training (in terms of speed and scal-
tional operators like map, filter and reduce (e.g., see line 4 – 6 in ability). To support efficient parameter synchronization, existing
Figure 1); in addition, these operations are both data-parallel (i.e., deep learning frameworks usually implement parameter server
applied to individual data partitions in parallel by different Spark or AllReduce using operations like fine-grained data access and
tasks) and coarse-grained (i.e., applying the same operation to all in-place data mutation. Unfortunately, these operations are not
data items at once). supported by the functional compute model of big data systems
(such as Spark).
3.2 Data-parallel training in BigDL
Built on top of the data-parallel, functional compute model of Spark,
Algorithm 2 “Parameter synchronization” job
BigDL provides synchronous data-parallel training to train a deep
neural network model across the cluster, which is shown to achieve 1: for each task n in the “parameter synchronization” job do
better scalability and efficiency (in terms of time-to-quality) com- 2: shuffle the nt h partition of all gradients to this task;
pared to asynchronous training [30]. Specifically, the distributed 3: aggregate (sum) these gradients;
training in BigDL is implemented as an iterative process, as illus- 4: updates the nth partition of the weights;
trated in Algorithm 1; each iteration runs a couple of Spark jobs 5: broadcast the nt h partition of the updated weights;
to first compute the gradients using the current mini-batch (by a 6: end for
“model forward-backward” job), and then make a single update
to the parameters of the neural network model (by a “parameter
synchronization” job). BigDL has taken a completely different approach that directly
The training data in BigDL are represented as an RDD of Sam- implements an efficient AllReduce like operation using existing
ples (see line 6 in Figure 1), which are automatically partitioned primitives in Spark (e.g., shuffle, broadcast, in-memory cache, etc.),
across the Spark cluster. In addition, to implement the data-parallel so as to mimic the functionality of a parameter server architecture
training, BigDL also constructs an RDD of models, each of which is (as illustrated in Figure 4).
a replica of the original neural network model. Before the training, • A Spark job has N tasks, each of which is assigned a unique
both the model and Sample RDDs are cached in memory, and co- Id ranging from 1 to N in BigDL. After each task in the
partitioned and co-located across the cluster, as shown in Figure “model forward-backward” job computes the local gradients
3; consequently, in each iteration of the model training, a single (as described in Section 3.2 and illustrated in Figure 3), it
“model forward-backward” Spark job can simply apply the func- evenly divides the local gradients into N partitions, as shown
tional zip operator to the co-located partitions of the two RDDs in Figure 4.
(with no extra cost), and compute the local gradients in parallel for • Next, another “parameter synchronization” job is launched;
each model replica (using a small batch of data in the co-located each task n of this job is responsible for managing the nt h
Sample partition), as illustrated in Figure 3. partition of the parameters (as shown in Algorithm 2), just

53
BigDL: A Distributed Deep Learning Framework for Big Data SoCC ’19, November 20-23, Santa Cruz, CA

Figure 4: Parameter synchronization in BigDL. Each local gradient (computed by a task in the “model forward-backward” job)
is evenly divided into N partitions; then each task n in the “parameter synchronization” job aggregates these local gradients
and updates the weights for the nth partition.

like a parameter server does. Specifically, the nt h partition 3.4 Discussions


of the local gradients (computed by the previous “model While BigDL has followed the standard practice (such as data par-
forward-backward” job) are first shuffled to task n, which allel training and AllReduce operations) for scalable training, its
aggregates these gradients and applies the updates to the implementation is very different from existing deep learning frame-
nt h partition of the weights, as illustrated in Figure 4. works. By adopting the state of practice of big data systems (i.e.,
• After that, each task n in the “parameter synchronization” coarse-grained functional compute model), BigDL provides a viable
job broadcasts the nt h partition of the updated weights; con- design alternative for distributed model training. This allows deep
sequently, tasks in the “model forward-backward” job of the learning algorithms and big data analytics to be seamless integrated
next iteration can read the latest value of all the weights into a single unified data pipeline, and completely eliminates the
before the next training step begins. impedance mismatch problem described in Section 2. Furthermore,
• The shuffle and task-side broadcast operations described this also makes it easy to handle failures, resource changes, task pre-
above are implemented on top of the distributed in-memory emptions, etc., which are expected to be norm rather than exception
storage in Spark: the relevant tasks simply store the local in large-scale systems.
gradients and updated weights in the in-memory storage, Existing distributed deep learning frameworks (e.g., TensorFlow,
which can then be read remotely by the Spark tasks with MXNet, Petuum [26], ChainerMN [32], etc.) have adopted an archi-
extremely low latency. tecture where multiple long-running, stateful tasks interact with
others for model computation and parameter synchronization, usu-
ally in a blocking fashion to support synchronous distributed train-
ing. While this is optimized for constant communications among
the tasks, it can only support coarse-grained failure recovery by
The implementation of AllReduce in BigDL has similar perfor- completely starting over from previous (e.g., a couple of epochs
mance characteristics compared to Ring AllReduce from Baidu Re- before) snapshots.
search [31]. As described in [31], the total amount of data trans- In contrast, BigDL runs a series of short-lived Spark jobs (e.g., two
ferred to and from every node is 2K(N-1)/N in Ring AllReduce jobs per mini-batch as described earlier), and each task in the job is
(where N is the number of nodes and K is the total size of the pa- stateless, non-blocking, and completely independent of each other;
rameters); similarly, in BigDL, the total amount of data transferred as a result, BigDL tasks can simply run without gang scheduling. In
to and from every node is 2K. In addition, all the bandwidth of addition, it can also efficiently support fine-grained failure recovery
every node in the cluster are fully utilized in both BigDL and Ring by just re-running the failed task (which can then re-generate the
AllReduce. As a result, BigDL can efficiently train large deep neural associated partition of the local gradient or updated weight in
network across large (e.g., hundreds of servers) clusters, as shown the in-memory storage of Spark); this allows the framework to
in Section 4.3.

54
SoCC ’19, November 20-23, Santa Cruz, CA Trovato and Tobin, et al.

automatically and efficiently address failures (e.g., cluster scale- 4.2 Computing Performance
down, task preemption, random bugs in the code, etc.) in a timely To study the computing performance of BigDL, we compare the
fashion. training speed of the NCF model using BigDL and PyTorch. MLPerf
While AllReduce has been implemented in almost all existing has provided a reference implementation of the NCF program [40]
deep learning frameworks, the implementation in BigDL is very based on PyTorch 0.4, which trains a movie recommender using the
different. In particular, existing deep learning frameworks usually MovieLens 20Million dataset (ml-20m) [41], a widely used bench-
implement the AllReduce operation using MPI-like primitives; as mark dataset with 20 million ratings and 465,000 tags applied to
a result, they often create long-running task replicas that coordi- 27,000 movies by 138,000 users. It also provides the reference train-
nate among themselves with no central control. On the other hand, ing speed of the PyTorch implementation (to achieve the target
BigDL has adopted a logically centralized control for distributed accuracy goal) on a single Nvidia P100 GPU.
training [33]; that is, a single driver program coordinates the dis- We have implemented the same NCF program using BigDL 0.7.0
tributed training (as illustrated in Algorithm 1). The driver program and Spark 2.1.0 [42]. We then trained the program on a dual-socket
first launches the “model forward-backward” job to compute the Intel Skylake 8180 2.5GHz server (with 56 cores in total and 384GB
local gradients, and then launches the “parameter synchronization” memory), and it took 29.8 minutes to converge and achieve the
job to update the weights. The dependence between the two jobs same accuracy goal.
are explicitly managed by the driver program, and each individual
task in the two jobs are completely stateless and non-blocking once
they are launched by the driver.

4 EVALUATION
This section evaluates the computing performance and scalability
of neural network training in BigDL. In addition, while we do not
report inference performance results in this section, Section 5.1
shows the comparison of a real-world object detection inference
pipeline running on BigDL vs. Caffe (and as reported by JD.com, the
BigDL inference pipeline running on 24 Intel Xeon servers is 3.83x
faster than Caffe running on 5 servers and 20 GPU cards ).

4.1 Experiments Figure 5: The training performance of NCF using the BigDL
Two categories of neural network models are used in this section to implementation is 1.6x faster than the reference PyTorch
evaluate the performance and scalability of BigDL, namely, neural implementation, as reported by MLPerf [43].
collaborative filtering (NCF) and convolutional neural network
(CNN), which are representatives of the workloads that BigDL As reported by MLPerf, the training performance of NCF using
users run in their production Big Data platform. the BigDL implementation is 1.6x faster than the reference PyTorch
Neural Collaborative Filtering (NCF) [34] is one of most com- implementation [43] (as shown in Figure 5). While this only com-
monly used neural network models for recommendation, and has pares the training performance of BigDL on a single CPU server
also been included in MLPerf [35], a widely used benchmark suite to PyTorch on a single GPU, it shows BigDL provides efficient im-
for measuring training and inference performance of machine learn- plementations for neural network model computation (forward
ing hardware, software, and services. In our experiments, we com- and backward). We will study the scalability and efficiency of the
pare the training performance of BigDL (running on Intel Xeon distributed training in BigDL in Section 4.3 and 4.4.
server) vs. PyTorch (running on GPU).
In addition, deep convolutional neural networks (CNNs) have 4.3 Scalability of distributed training
achieved human-level accuracy and are widely used for many com- In the machine learning community, it is commonly believed that
puter vision tasks (such as image classifications and object detec- fine-grained data access and in-place data mutation are critical for
tion). In our experiments, we study the scalability and efficiency efficient distributed training, and mechanisms like Spark’s RDDs
of training Inception-v1 [36] on ImageNet dataset [37] in BigDL would impose significant overheads [3]. In this section, we show
with various number of Intel Xeon servers and Spark task; the re- that BigDL provides highly efficient and scalable training, despite
sults for other deep convolutional models, such as Inception-v3 it is built on top of the coarse-grained functional compute model
[38] and ResNet50 [39], are similar. We do not include results for and immutable RDDs of Spark.
RNN (recurrent neural networks) training in this section, because The scalability of distributed training in BigDL is determined
it actually has better scalability compared to CNN training. This is by the efficiency (or overheads) of its parameter synchronizations.
because RNN computation is much slower than CNN, and therefore We first study the parameter synchronization overheads in BigDL
the parameter synchronization overhead (as a fraction of model by running ImageNet Inception-v1 model training using BigDL
compute time) is also much lower. on various number of Xeon servers (dual-socket Intel Broadwell

55
BigDL: A Distributed Deep Learning Framework for Big Data SoCC ’19, November 20-23, Santa Cruz, CA

2.20GHz, 256GB RAM and 10GbE network) [44]. As shown in Figure


6, the parameter synchronization overheads, measured as a fraction
of the average model computation (forward and backward) time,
turn out to be small (e.g., less than 7% for Inception-v1 training on
32 nodes) in BigDL.
To study the scalability of the distributed training of BigDL
on very large-scale Intel Xeon clusters, Cray have run ImageNet
Inception-v1 model training using BigDL 0.3.0 with various node
counts (starting at 16 nodes and scaling up to 256 nodes) [45]. Each
node is a dual-socket Intel Broadwell 2.1 GHz (CCU 36 and DDR4
2400) server; the learning rate and Spark’s executor memory are
set to 0.10 and 120 GB respectively in the experiments.

Figure 8: Overheads of task scheduling and dispatch (as


a fraction of average computation time) for ImageNet
Inception-v1 training in BigDL [46].

4.4 Efficiency of task scheduling


As described in Section 3.4, BigDL needs to run a very large num-
ber of shot-lived tasks on Spark (e.g., the ImageNet Inception-v1
training may run 100s of thousands of iterations and 100s of tasks
in parallel per iteration, while each task runs for just a couple
of seconds); as a result, the underlying Spark framework needs
to schedule a very large number of tasks across the cluster in a
Figure 6: Overheads of parameter synchronization (as a short period of time, which can potentially become a bottleneck
fraction of average model computation time) of ImageNet on large clusters. For instance, Figure 8 shows that the overhead of
Inception-v1 training in BigDL [44]. launching tasks (as a fraction of average model computation time)
in ImageNet Inception-v1 training on BigDL, while low for 100-200
tasks per iteration, can grows to over 10% when there are close to
500 tasks per iteration [46].
To address this issue, in each training iteration BigDL will launch
only a single (multi-threaded) task on each server, so as to achieve
high scalability on large clusters (e.g., up to 256 machines, as de-
scribed in Section 4.3). To scale to an even larger number (e.g., over
500) of servers, one can potentially leverage the iterative nature of
model training (in which the same operations are executed repeat-
edly). For instance, group scheduling introduced by Drizzle [47], a
low latency execution engine for Spark, can help schedule multiple
iterations (or a group) of computations at once, so as to greatly
reduce scheduling overheads even if there are a large number of
tasks in each iteration, as shown in Figure 8 (which ran on AWS
EC2 using r4.x2large instances) [46].

Figure 7: Throughput of ImageNet Inception-v1 training in 5 APPLICATIONS


BigDL 0.3.0 reported by Cray, which scales almost linearly
Since its initial open source release (on Dec 30, 2016), BigDL users
up to 96 nodes (and continue to scale reasonably up to 256
have built many deep learning applications on Spark and big data
nodes) [45].
platforms. In this section, we share the real-world experience and
“war stories” of our users that adopts BigDL to build the end-to-end
Figure 7 shows the throughput of ImageNet Inception-v1 train-
data analysis and deep learning pipelines for their production data.
ing; the training throughput scales almost linearly up to 96 nodes
(e.g., about 5.3x speedup on 96 nodes compared to 16 nodes), and
continue to scale reasonably well up to 256 nodes [45]. The results
5.1 Image feature extraction using object
show that, even though BigDL implements its parameter server detection models
architecture directly on top of Spark (with immutable RDDs and JD.com has built an end-to-end object detection and image feature
coarse-grained functional operations), it can still provide efficient extraction pipeline on top of Spark and BigDL [48], as illustrated
distributed training on large clusters. in Figure 9.

56
SoCC ’19, November 20-23, Santa Cruz, CA Trovato and Tobin, et al.

Figure 9: End-to-end object detection and image feature extraction pipeline (using SSD and DeepBit models) on top of Spark
and BigDL [48].

• It then generates the RDD of target images (by keeping the


object with highest score as the target, and cropping the
original picture based on the coordinates of the target), and
further pre-processes the RDD of target images.
• Finally it uses BigDL to load a DeepBit [50] model (again
pre-trained in Caffe) for distributed feature extraction of the
target images, and stores the results (RDD of extracted object
features) in HDFS.

Previously JD engineers have deployed the same solution on a


5-node GPU cluster with 20 NVIDIA Tesla K40 following a “connec-
tor approach” (similar to CaffeOnSpark): reading data from HBase,
partitioning and processing the data across the cluster, and then
running the deep learning models on Caffe. This turns out to be
Figure 10: Throughput of GPU clusters and Xeon clusters for very complex and error-prone (because all of the data partitioning,
the image feature extraction pipeline benchmarked by JD; load balancing, fault tolerance, etc., need to be manually managed).
the GPU cluster consists of 20 NVIDIA Tesla K40 cards, and In addition, it also reveals an impedance mismatch of the “con-
the Xeon cluster consists of 1200 logical cores (with each In- nector approach” (HBase + Caffe in this case) – reading data from
tel Xeon E5-2650 v4 2.2GHz server running 50 logical cores) HBase takes about half of the time in this solution (because the
[48]. task parallelism is tied to the number of GPU cards in the system,
which is too low for interacting with HBase to read the data).
After migrating the solution to BigDL, JD engineers can easily
• The pipeline first reads hundreds of millions of pictures from implement the entire data analysis and deep learning pipeline (in-
a distributed database into Spark (as an RDD of pictures), cluding data loading, partitioning, pre-processing, model inference,
and then pre-processes the RDD of pictures in a distributed etc.,) under a unified programming paradigm on Spark. This not
fashion using Spark. only greatly improves the efficiency of development and deploy-
• It then uses BigDL to load a SSD [49] model (pre-trained in ment, but also delivers about 3.83x speedup (running on about 24
Caffe) for large scale, distributed object detection on Spark, Intel Broadwell 2.2GHz servers) compared to running the Caffe-
which generates the coordinates and scores for the detected based solution on the GPU cluster (with 20 NVIDIA Tesla K40 cards),
objects in each of the pictures. as reported by JD [48] and shown in Figure 10.

57
BigDL: A Distributed Deep Learning Framework for Big Data SoCC ’19, November 20-23, Santa Cruz, CA

Figure 11: End-to-end precipitation nowcasting workflow (using sequence-to-sequence models) on Spark and BigDL [45].

Figure 12: Predicting precipitation patterns for the next hour (i.e., a sequence of images for the future time steps of the next
hour) on Spark and BigDL [45].

5.2 Precipitation nowcasting using Seq2Seq the development productivity due to the fragmented workflow. As
models a result, Cray engineers chose to implement the solution using a
single unified data analysis and deep learning pipeline on Spark
Cray has built a precipitation nowcasting (predicting short-term pre-
and BigDL, which greatly improves the efficiency of development
cipitation) application using a Seq2Seq [51] model (with a stacked
and deployment.
convolutional LSTM network [52] as the encoder, and another
stacked convolutional LSTM network as the decoder); the end-to-
end pipeline runs on Spark and BigDL [45], including data prepara- 5.3 Real-time streaming speech classification
tion, model training and inference (as illustrated in Figure 11). GigaSpaces has built a speech classification application for efficient
call center management [53], which automatically routes client
• The application first reads over a terabyte of raw radar scan
calls to corresponding support specialists in real-time. The end-to-
data into Spark (as an RDD of radar images), and then con-
end workflow is implemented using BigDL with Apache Kafka [54]
verts it into an RDD of NumPy ndarrays.
and Spark Streaming [55] (as illustrated Figure 13), so as to provide
• It then trains a sequence-to-sequence model, using a se-
distributed realtime streaming model inference.
quence of images leading up to the current time as the input,
and a sequence of predicted images in the future as the out- • When a customer calls the call center, his or her speech is
put. first processed on the fly by a speech recognition unit and
• After the model is trained, it can be used to predict, say, result is stored in Kafka.
the precipitation patterns (i.e., a sequence of images for the • A Spark Streaming job then reads speech recognition results
future time steps) of the next hour, as illustrated in Figure from Kafka and classifies each call using the BigDL model
12. in real-time.
• The classification result is in turn used by a routing system
Cray engineers have previously implemented the application
to redirect the call to the proper support specialist.
using two separate workflows: running data processing on a highly
distributed Spark cluster, and deep learning training on another One of the key challenges for GigaSpaces engineers to implement
GPU cluster running TensorFlow. It turns out that this approach not the end-to-end workflow is how to efficiently integrate the new
only brings high data movement overheads, but also greatly hurts neural network models in the realtime stream processing pipeline,

58
SoCC ’19, November 20-23, Santa Cruz, CA Trovato and Tobin, et al.

Figure 13: The end-to-end workflow of real-time streaming speech classification on Kafka, Spark Streaming and BigDL [53].

and how to seamlessly scale the streaming applications from a hand- 7 SUMMARY
ful machines to thousands of nodes. BigDL allows neural network We have described BigDL, including its distributed execution model,
models to be directly applied in standard distributed streaming ar- computation performance, training scalability, and real-world use
chitecture for Big Data (using Apache Kafka and Spark Streaming), cases. It allows users to build deep learning applications for big data
which can then efficiently scales out to a large number of nodes using a single unified data pipeline; the entire pipeline can directly
in a transparent fashion. As a result, this greatly improves the de- run on top of existing big data systems in a distributed fashion.
veloper productivity and deployment efficiency of the end-to-end Unlike existing deep learning frameworks, it provides efficient
streaming workflow. and scalable distributed training directly on top of the functional
compute model (with copy-on-write and coarse-grained operations)
of Spark. BigDL is a work in progress, but our initial experience
is encouraging. Since its initial open source release on Dec 30,
2016, it has received over 3100 stars on Github; and it has enabled
6 RELATED WORK many users (e.g., Mastercard, World Bank, Cray, Talroo, UCSF, JD,
UnionPay, Telefonica, GigaSpaces, etc.) to build new analytics and
Existing deep learning frameworks (such as TensorFlow, MXNet,
deep learning applications for their production data pipelines.
Petuum, ChainerMN, etc.) typically provide efficient parameter
server and/or AllReduce implementation (using fine-grained data
access and in-place data mutation) for distributed training. In con- REFERENCES
trast, BigDL provides distributed training support directly on top of [1] Jia, Yangqing and Shelhamer, Evan and Donahue, Jeff and Karayev, Sergey and
Long, Jonathan and Girshick, Ross and Guadarrama, Sergio and Darrell, Trevor.
a functional compute model of big data systems (with copy-on-write Caffe: Convolutional architecture for fast feature embedding. in Proceedings of
and coarse-grained operations), which is completely different from the 22nd ACM international conference on Multimedia. MM’14.
[2] Collobert, Ronan and Kavukcuoglu, Koray and Farabet, Clément. Torch7: A
the implementation in existing deep learning frameworks. This matlab-like environment for machine learning. in BigLearn, NIPS workshop.
provides a viable design alternative for distributed model training (2011).
by adopting the state of practice of big data systems, and makes it [3] Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat,
S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray,
easy to handle failures, resource changes, task preemptions, etc., in D. G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., and
a more timely and fine-grained fashion. Zheng, X. Tensorflow: A system for large-scale machine learning. in Proceedings
As discussed in Section 2, to address the challenge of integrat- of the 12th USENIX Conference on Operating Systems Design and Implementation.
OSDI’16.
ing deep learning into real-world data pipelines, there have been [4] Chen, Tianqi and Li, Mu and Li, Yutian and Lin, Min and Wang, Naiyan and
many efforts in the industry that adopt a “connector approach” Wang, Minjie and Xiao, Tianjun and Xu, Bing and Zhang, Chiyuan and Zhang,
Zheng. Mxnet: A flexible and efficient machine learning library for heterogeneous
(e.g., TFX, CaffeOnSpark, TensorFlowOnSpark, SageMaker, etc.). distributed systems. In Proceedings of Workshop on Machine Learning Systems
Unfortunately, these frameworks can incur very large overheads in (LearningSys) in The Twenty-ninth Annual Conference on Neural Information
practice due to the adaptation layer between different frameworks; Processing Systems (NIPS). (2015).
[5] Tokui, Seiya and Oono, Kenta and Hido, Shohei and Clayton, Justin Chainer: a
more importantly, they often suffer from impedance mismatches next-generation open source framework for deep learning in In Proceedings of
that arise from crossing boundaries between heterogeneous com- workshop on machine learning systems (LearningSys) in the twenty-ninth annual
ponents. While efforts in the Big Data community (such as Project conference on neural information processing systems (NIPS). (2015).
[6] Paszke, Adam and Gross, Sam and Chintala, Soumith and Chanan, Gregory and
Hydrogen in Spark) attempt to overcome some of these issues Yang, Edward and DeVito, Zachary and Lin, Zeming and Desmaison, Alban and
brought by the “connector approach”, they still do not address the Antiga, Luca and Lerer, Adam Automatic differentiation in pytorch. NIPS 2017
Autodiff Workshop. (2017).
fundamental “impedance mismatch” problem (as discussed in Sec- [7] Apache spark Apache software foundation. (2014) (https://spark.apache.org).
tion 2). By unifying the distributed execution model of deep neural [8] Apache hadoop Apache software foundation. (2006) (https://hadoop.apache.org).
network models and big data analysis, BigDL provides a single [9] Chollet,F.et al. Keras. (https://keras.io).
[10] Zaharia, Matei and Chowdhury, Mosharaf and Das, Tathagata and Dave, Ankur
unified data pipeline for both deep learning and big data analysis, and Ma, Justin and McCauley, Murphy and Franklin, Michael J and Shenker, Scott
which eliminates the adaptation overheads or impedance mismatch. and Stoica, Ion. Resilient distributed datasets: A fault-tolerant abstraction for

59
BigDL: A Distributed Deep Learning Framework for Big Data SoCC ’19, November 20-23, Santa Cruz, CA

in-memory cluster computing. in Proceedings of the 9th USENIX conference on [35] Mlperf. (https://mlperf.org/).
Networked Systems Design and Implementation. NSDI’12. [36] Szegedy,C., Liu,W., Jia,Y., Sermanet,P., Reed,S., Anguelov,D., Erhan,D., Van-
[11] Armbrust, Michael and Xin, Reynold S and Lian, Cheng and Huai, Yin and Liu, houcke,V., and Rabinovich,A. Going deeper with convolutions in Computer Vision
Davies and Bradley, Joseph K and Meng, Xiangrui and Kaftan, Tomer and Franklin, and Pattern Recognition (CVPR). (2015).
Michael J and Ghodsi, Ali and others. Spark sql: Relational data processing in [37] Deng,J., Socher,R., Fei-Fei,L., Dong,W., Li,K., and Li,L.-J. Imagenet: A large-scale
spark. in 2015 ACM SIGMOD international conference on management of data. hierarchical image database. in 2009 IEEE conference on computer vision and
SIGMOD’15. pattern recognition(CVPR). (2009).
[12] Russakovsky, Olga and Deng, Jia and Su, Hao and Krause, Jonathan and Satheesh, [38] Szegedy,C., Vanhoucke,V., Ioffe,S., Shlens,J., and Wojna,Z. Rethinking the incep-
Sanjeev and Ma, Sean and Huang, Zhiheng and Karpathy, Andrej and Khosla, tion architecture for computer vision in 2016 IEEE Conference on Computer Vision
Aditya and Bernstein, Michael and others. Imagenet large scale visual recognition and Pattern Recognition (CVPR). (2016).
challenge. International journal of computer vision(IJCV). (2015). [39] He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian Deep residual
[13] Rajpurkar,P and Zhang,J and Lopyrev,K and Liang,P. Squad: 100,000+ questions learning for image recognition. in Proceedings of the IEEE conference on computer
for machine comprehension of text. In Proceedings of the 2016 Conference on vision and pattern recognition. (2016).
Empirical Methods in Natural Language Processing, EMNLP. (2016). [40] Reference ncf implementation using pytorch in mlperf.
[14] Jawaheer,G and Szomszor,M and Kostkova,P. Comparison of implicit and explicit (https://github.com/mlperf/training/blob/
feedback from an online music recommendation service. in proceedings of the 1st master/recommendation/pytorch/README.md).
international workshop on information heterogeneity and fusion in recommender [41] Harper, F Maxwell and Konstan, Joseph A. "the movielens datasets: History and
systems. (2010) HetRec’10. context". ACM Trans. Interact. Intell. Syst. 5(4):19. (2015).
[15] Baylor, D., Breck, E., Cheng, H.-T., Fiedel, N., Foo, C. Y., Haque, Z., Haykal, S., [42] Ncf implementation in bigdl. (https://github.com/mlperf/training_results_v0.5/tree
Ispir, M., Jain, V., Koc, L., Koo, C. Y., Lew, L., Mewald, C., Modi, A. N., Polyzo- /master/v0.5.0/intel/intel_ncf_submission).
tis, N., Ramesh, S., Roy, S., Whang, S. E., Wicke, M., Wilkiewicz, J., Zhang, X., [43] Mlperf 0.5 training results. (https://mlperf.org/training-results-0-5).
and Zinkevich, M. Tfx: A tensorflow-based production-scale machine learning [44] Jason (Jinquan) Dai, and Ding Ding. Very large-scale distributed deep learning
platform in Proceedings of the 23rd ACM SIGKDD International Conference on with bigdl. o’reilly ai conference, san francisco. (2017).
Knowledge Discovery and Data Mining. KDD’17. [45] Alex Heye, et al. "scalable deep learning with bigdl on the urika-xc soft-
[16] CaffeOnSpark. Yahoo. (2016) (https://github.com/yahoo/CaffeOnSpark). ware suite". (https://www.cray.com/blog/scalable-deep-learning-bigdl-urika-xc-
[17] TensorflowOnSpark. Yahoo. (2017) (https://github.com/yahoo/TensorFlowOnSpark). software-suite/).
[18] Sagemaker. Amazon. (2017) (https://aws.amazon.com/sagemaker/). [46] Shivaram Venkataraman, et al. "accelerating deep learning training with bigdl
[19] Lin, Jimmy and Ryaboy, Dmitriy Scaling big data mining infrastructure: the and drizzle on apache spark". (https://rise.cs.berkeley.edu/blog/accelerating-deep-
twitter experience. ACM SIGKDD Explorations Newsletter 14(2). (December 2012). learning-training-with-bigdl-and-drizzle-on-apache-spark/).
[20] Reynold Xin. "project hydrogen: Unifying state-of-the-art ai and big data in [47] Venkataraman,S., Panda,A., Ousterhout,K., Armbrust,M., Ghodsi,A., Franklin,M.J.,
apache spark". spark + ai summit 2018. Recht,B., and Stoica,I. Drizzle: Fast and adaptable stream processing at scale in
[21] Gang scheduling. (https://en.wikipedia.org/wiki/Gang_scheduling/). Proceedings of the 26th Symposium on Operating Systems Principles. SOSP’17.
[22] Zaharia, Matei and Borthakur, Dhruba and Sen Sarma, Joydeep and Elmeleegy, [48] Jason (Jinquan) Dai, et al. Building large-scale image feature extraction with
Khaled and Shenker, Scott and Stoica, Ion. Delay scheduling: a simple technique bigdl at jd.com. (https://software.intel.com/en-us/articles/building-large-scale-
for achieving locality and fairness in cluster scheduling. in Proceedings of the 5th image-feature-extraction-with-bigdl-at-jdcom).
European conference on Computer systems,. EuroSys’10. [49] Liu,W., Anguelov,D., Erhan,D., Szegedy,C., Reed,S.E., Fu,C.-Y., and Berg,A.C. Ssd:
[23] Dean,J., Corrado,G., Monga,R., Chen,K., Devin,M., Mao,M., Ranzato,Marc’aurelio, Single shot multibox detector in ECCV. (2016).
Senior,A., Tucker,P., Yang,K., Le,Q.V., Ng,A.Y. Large scale distributed deep net- [50] Lin, K., Lu, J., Chen, C.-S., and Zhou, J. Learning compact binary descriptors with
works. in Proceedings of the 25th International Conference on Neural Information unsupervised deep neural networks. in 2016 IEEE Conference on Computer Vision
Processing Systems. NIPS’12. and Pattern Recognition (CVPR). (2016).
[24] Li,M., Andersen,D.G., Park,J.W., Smola,A.J., Ahmed,A., Josifovski,V., Long,J., [51] Sutskever,I., Vinyals,O., and Le,Q.V. Sequence to sequence learning with neural
Shekita,E.J., and Su,B.-Y. Scaling distributed machine learning with the parameter networks. in Proceedings of the 27th International Conference on Neural Information
server. in Proceedings of the 11th USENIX Conference on Operating Systems Design Processing Systems. Vol. 2. NIPS’14.
and Implementation. OSDI’14. [52] Shi,X., Chen,Z., Wang,H., Yeung,D.-Y., Wong,W.-k., and Woo,W.-c. Convolutional
[25] Chilimbi,T., Suzue,Y., Apacible,J., and Kalyanaraman,K. Project adam: Building lstm network: A machine learning approach for precipitation nowcasting. in
an efficient and scalable deep learning training system. in Proceedings of the 11th Proceedings of the 28th International Conference on Neural Information Processing
USENIX Conference on Operating Systems Design and Implementation. OSDI’14. Systems. Vol. 1. NIPS’15.
[26] Xing,E.P., Ho,Q., Dai,W., Kim,J.-K., Wei,J., Lee,S., Zheng,X., Xie,P., Kumar,A., and [53] Rajiv Shah. Gigaspaces integrates insightedge platform with intel’s bigdl for
Yu,Y. Petuum: A new platform for distributed machine learning on big data. scalable deep learning innovation. (https://www.gigaspaces.com/blog/gigaspaces-
Proceedings of the 21th ACM SIGKDD International Conference on Knowledge to-demo-with-intel-at-strata-data-conference-and-microsoft-ignite/).
Discovery and Data Mining. KDD’15. [54] Apache Kafka. (https://kafka.apache.org/).
[27] Zhang,H., Zheng,Z., Xu,S., Dai,W., Ho,Q., Liang,X., Hu,Z., Wei,J., Xie,P., and [55] Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, and
Xing,E.P. Poseidon: An efficient communication architecture for distributed deep Ion Stoica. Discretized streams: fault-tolerant streaming computation at scale
learning on gpu clusters. in 2017 USENIX Annual Technical Conference (USENIX in The Twenty-Fourth ACM Symposium on Operating Systems Principles. (2013)
ATC 17). (2017). SOSP’13.
[28] Jeffrey Dean, Sanjay Ghemawat Mapreduce: simplified data processing on large
clusters. Proceedings of the 6th conference on Symposium on Operating Systems
Design & Implementation, {OSDI }. (2004).
[29] Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad:
distributed data-parallel programs from sequential building blocks in Proceedings
of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007.
EuroSys’07.
[30] Chen,J., Monga,R., Bengio,S., and Jozefowicz,R. Revisiting distributed synchro-
nous sgd. In International Conference on Learning Representations Workshop Track.
(2016).
[31] Gibiansky,Andrew. "bringing hpc techniques to deep learning".
(http://andrew.gibiansky.com/blog/machine-learning/baidu-allreduce/).
[32] Akiba,T., Fukuda,K., and Suzuki,S. Chainermn: Scalable distributed deep learning
framework. Proceedings of Workshop on ML Systems in The Thirty-first Annual
Conference on Neural Information Processing Systems (NIPS). (2017).
[33] Eric Liang, Richard Liaw, Philipp Moritz, Robert Nishihara, Roy Fox, Ken Gold-
berg, Joseph E. Gonzalez, Michael I. Jordan, Ion Stoica. Rllib: Abstractions for
distributed reinforcement learning. International Conference on Machine Learning
(ICML). (2018).
[34] He, Xiangnan and Liao, Lizi and Zhang, Hanwang and Nie, Liqiang and Hu, Xia
and Chua, Tat-Seng Neural collaborative filtering. in Proceedings of the 26th inter-
national conference on world wide web. International World Wide Web Conferences
Steering Committee. (2017).

60

You might also like