Real-Time Machine Learning: The Missing Pieces
Real-Time Machine Learning: The Missing Pieces
Real-Time Machine Learning: The Missing Pieces
Robert Nishihara∗ , Philipp Moritz∗ , Stephanie Wang, Alexey Tumanov, William Paul,
Johann Schleier-Smith, Richard Liaw, Mehrdad Niknami, Michael I. Jordan, Ion Stoica
arXiv:1703.03924v2 [cs.DC] 19 May 2017
UC Berkeley
logs
Abstract
Training Models Query
Machine learning applications are increasingly deployed Data Model
(off-
not only to serve predictions using static models, but also sets Serving Prediction
line)
as tightly-integrated components of feedback loops in-
volving dynamic, real-time decision making. These ap- (a)
plications pose a new set of requirements, none of which
Compute & Query Observation
are difficult to achieve in isolation, but the combination policy
of which creates a challenge for existing distributed ex- (obs. à action) Action
ecution frameworks: computation with millisecond la-
tency at high throughput, adaptive construction of arbi- (b)
trary task graphs, and execution of heterogeneous kernels
over diverse sets of resources. We assert that a new dis- Figure 1: (a) Traditional ML pipeline (off-line training).
tributed execution framework is needed for such ML ap- (b) Example reinforcement learning pipeline: the system con-
tinously interacts with an environment to learn a policy, i.e., a
plications and propose a candidate approach with a proof-
mapping between observations and actions.
of-concept architecture that achieves a 63x performance
improvement over a state-of-the-art execution framework
for a representative application.
Since learning by interacting with the real world can
be unsafe, impractical, or bandwidth-limited, many re-
1 Introduction inforcement learning systems rely heavily on simulating
physical or virtual environments. Simulations may be
The landscape of machine learning (ML) applications is used during training (e.g., to learn a neural network pol-
undergoing a significant change. While ML has predom- icy), and during deployment. In the latter case, we may
inantly focused on training and serving predictions based constantly update the simulated environment as we inter-
on static models (Figure 1a), there is now a strong shift act with the real world and perform many simulations to
toward the tight integration of ML models in feedback figure out the next action (e.g., using online planning al-
loops. Indeed, ML applications are expanding from the gorithms like Monte Carlo tree search). This requires the
supervised learning paradigm, in which static models are ability to perform simulations faster than real time.
trained on offline data, to a broader paradigm, exempli-
Such emerging applications require new levels of pro-
fied by reinforcement learning (RL), in which applica-
gramming flexibility and performance. Meeting these
tions may operate in real environments, fuse and react to
requirements without losing the benefits of modern dis-
sensory data from numerous input streams, perform con-
tributed execution frameworks (e.g., application-level
tinuous micro-simulations, and close the loop by taking
fault tolerance) poses a significant challenge. Our own ex-
actions that affect the sensed environment (Figure 1b).
perience implementing ML and RL applications in Spark,
∗ equal contribution MPI, and TensorFlow highlights some of these challenges
Time
(a) multiple sensor inputs (b) Monte Carlo tree search (MCTS) (c) Recurrent Neural Network (RNN)
Figure 2: Example components of a real-time ML application: (a) online processing of streaming sensory data to model
the environment, (b) dynamic graph construction for Monte Carlo tree search (here tasks are simulations exploring sequences of
actions), and (c) heterogeneous tasks in recurrent neural networks. Different shades represent different types of tasks, and the task
lengths represent their durations.
and gives rise to three groups of requirements for sup- plicit system support for heterogeneity of tasks and
porting these applications. Though these requirements are resources is essential for RL applications.
critical for ML and RL applications, we believe they are
broadly useful. • R5: Arbitrary dataflow dependencies. Similarly,
Performance Requirements. Emerging ML applications deep learning primitives and RL simulations produce
have stringent latency and throughput requirements. arbitrary and often fine-grained task dependencies
(not restricted to bulk synchronous parallel).
• R1: Low latency. The real-time, reactive, and inter-
active nature of emerging ML applications calls for Practical Requirements.
fine-granularity task execution with millisecond end-
• R6: Transparent fault tolerance. Fault tolerance re-
to-end latency [8].
mains a key requirement for many deployment sce-
• R2: High throughput. The volume of micro- narios, and supporting it alongside high-throughput
simulations required both for training [16] as well and non-deterministic tasks poses a challenge.
as for inference during deployment [19] necessitates
• R7: Debuggability and Profiling. Debugging and
support for high-throughput task execution on the or-
performance profiling are the most time-consuming
der of millions of tasks per second.
aspects of writing any distributed application. This
Execution Model Requirements. Though many exist- is especially true for ML and RL applications, which
ing parallel execution systems [9, 21] have gotten great are often compute-intensive and stochastic.
mileage out of identifying and optimizing for common
Existing frameworks fall short of achieving one or more
computational patterns, emerging ML applications re-
of these requirements (Section 5). We propose a flexible
quire far greater flexibility [10].
distributed programming model (Section 3.1) to enable
• R3: Dynamic task creation. RL primitives such as R3-R5. In addition, we propose a system architecture
Monte Carlo tree search may generate new tasks dur- to support this programming model and meet our perfor-
ing execution based on the results or the durations of mance requirements (R1-R2) without giving up key prac-
other tasks. tical requirements (R6-R7). The proposed system archi-
tecture (Section 3.2) builds on two principal components:
• R4: Heterogeneous tasks. Deep learning primitives a logically-centralized control plane and a hybrid sched-
and RL simulations produce tasks with widely differ- uler. The former enables stateless distributed components
ent execution times and resource requirements. Ex- and lineage replay. The latter allocates resources in a
bottom-up fashion, splitting locally-born work between graph must be constructed dynamically in order to allow
node-level and cluster-level schedulers. the algorithm to adapt to real-time constraints and oppor-
The result is millisecond-level performance on mi- tunities.
crobenchmarks and a 63x end-to-end speedup on a repre-
sentative RL application over a bulk synchronous parallel 3 Proposed Solution
(BSP) implementation.
In this section, we outline a proposal for a distributed ex-
ecution framework and a programming model satisfying
2 Motivating Example requirements R1-R7 for real-time ML applications.
To motivate requirements R1-R7, consider a hypothetical
application in which a physical robot attempts to achieve 3.1 API and Execution Model
a goal in an unfamiliar real-world environment. Various In order to support the execution model requirements (R3-
sensors may fuse video and LIDAR input to build multi- R5), we outline an API that allows arbitrary functions to
ple candidate models of the robot’s environment (Fig. 2a). be specified as remotely executable tasks, with dataflow
The robot is then controlled in real time using actions dependencies between them.
informed by a recurrent neural network (RNN) policy
(Fig. 2c), as well as by Monte Carlo tree search (MCTS) 1. Task creation is non-blocking. When a task is cre-
and other online planning algorithms (Fig. 2b). Using a ated, a future [4] representing the eventual return
physics simulator along with the most recent environment value of the task is returned immediately, and the
models, MCTS tries millions of action sequences in par- task is executed asynchronously.
allel, adaptively exploring the most promising ones. 2. Arbitrary function invocation can be designated as a
The Application Requirements. Enabling these kinds remote task, making it possible to support arbitrary
of applications involves simultaneously solving a num- execution kernels (R4). Task arguments can be either
ber of challenges. In this example, the latency require- regular values or futures. When an argument is a
ments (R1) are stringent, as the robot must be controlled future, the newly created task becomes dependent on
in real time. High task throughput (R2) is needed to the task that produces that future, enabling arbitrary
support the online simulations for MCTS as well as the DAG dependencies (R5).
streaming sensory input. 3. Any task execution can create new tasks without
Task heterogeneity (R4) is present on many scales: blocking on their completion. Task throughput is
some tasks run physics simulators, others process diverse therefore not limited by the bandwidth of any one
data streams, and some compute actions using RNN- worker (R2), and the computation graph is dynami-
based policies. Even similar tasks may exhibit substantial cally built (R3).
variability in duration. For example, the RNN consists of 4. The actual return value of a task can be obtained by
different functions for each “layer”, each of which may calling the get method on the corresponding future.
require different amounts of computation. Or, in a task This blocks until the task finishes executing.
simulating the robot’s actions, the simulation length may 5. The wait method takes a list of futures, a timeout,
depend on whether the robot achieves its goal or not. and a number of values. It returns the subset of fu-
In addition to the heterogeneity of tasks, the dependen- tures whose tasks have completed when the timeout
cies between tasks can be complex (R5, Figs. 2a and 2c) occurs or the requested number have completed.
and difficult to express as batched BSP stages.
Dynamic construction of tasks and their dependen- The wait primitive allows developers to specify la-
cies (R3) is critical. Simulations will adaptively use the tency requirements (R1) with a timeout, accounting for
most recent environment models as they become avail- arbitrarily sized tasks (R4). This is important for ML ap-
able, and MCTS may choose to launch more tasks explor- plications, in which a straggler task may produce negligi-
ing particular subtrees, depending on how promising they ble algorithmic improvement but block the entire compu-
are or how fast the computation is. Thus, the dataflow tation. This primitive enhances our ability to dynamically
the database is fault-tolerant, we can recover from compo-
Global Scheduler nent failures by simply restarting the failed components.
Web UI
Task Table
(R6). The database also makes it easy to write tools to
Error Diagnosis
Event Logs profile and inspect the state of the system (R7).
Function Table
To achieve the throughput requirement (R2), we shard
the database. Since we require only exact matching opera-
Node Node Node tions and since the keys are computed as hashes, sharding
Local Scheduler Local Scheduler Local Scheduler
is straightforward. Our early experiments show that this
Worker Worker Worker Worker Worker Worker Worker Worker Worker
design enables sub-millisecond scheduling latencies (R1).
Shared Memory Shared Memory Shared Memory
[6] C HEN , T., L I , M., L I , Y., L IN , M., WANG , N., WANG , M., [17] ROCKLIN , M. Dask: Parallel computation with blocked algo-
X IAO , T., X U , B., Z HANG , C., AND Z HANG , Z. MXNet: A rithms and task scheduling. In Proceedings of the 14th Python in
flexible and efficient machine learning library for heterogeneous Science Conference (2015), K. Huff and J. Bergstra, Eds., pp. 130
distributed systems. In NIPS Workshop on Machine Learning Sys- – 136.
tems (LearningSys’16) (2016). [18] S ANFILIPPO , S. Redis: An open source, in-memory data structure
[7] C OATES , A., H UVAL , B., WANG , T., W U , D., C ATANZARO , store. https://redis.io/, 2009.
B., AND A NDREW, N. Deep learning with COTS HPC systems. [19] S ILVER , D., H UANG , A., M ADDISON , C. J., G UEZ , A.,
In Proceedings of The 30th International Conference on Machine S IFRE , L., VAN D EN D RIESSCHE , G., S CHRITTWIESER , J.,
Learning (2013), pp. 1337–1345. A NTONOGLOU , I., PANNEERSHELVAM , V., L ANCTOT, M.,
ET AL . Mastering the game of Go with deep neural networks and
[8] C RANKSHAW, D., BAILIS , P., G ONZALEZ , J. E., L I , H.,
tree search. Nature 529, 7587 (2016), 484–489.
Z HANG , Z., F RANKLIN , M. J., G HODSI , A., AND J ORDAN ,
M. I. The missing piece in complex analytics: Low latency, scal- [20] W ENTZLAFF , D., AND AGARWAL , A. Factored operating sys-
able model management and serving with Velox. arXiv preprint tems (fos): The case for a scalable operating system for multicores.
arXiv:1409.3809 (2014). SIGOPS Oper. Syst. Rev. 43, 2 (Apr. 2009), 76–85.
[9] D EAN , J., AND G HEMAWAT, S. MapReduce: Simplified data [21] Z AHARIA , M., X IN , R. S., W ENDELL , P., DAS , T., A RMBRUST,
processing on large clusters. Commun. ACM 51, 1 (Jan. 2008), M., DAVE , A., M ENG , X., ROSEN , J., V ENKATARAMAN , S.,
107–113. F RANKLIN , M. J., G HODSI , A., G ONZALEZ , J., S HENKER , S.,
AND S TOICA , I. Apache Spark: A unified engine for big data
[10] D UAN , Y., C HEN , X., H OUTHOOFT, R., S CHULMAN , J., AND
processing. Commun. ACM 59, 11 (Oct. 2016), 56–65.
A BBEEL , P. Benchmarking deep reinforcement learning for con-
tinuous control. In Proceedings of the 33rd International Confer-
ence on Machine Learning (ICML) (2016).
[11] G ABRIEL , E., FAGG , G. E., B OSILCA , G., A NGSKUN , T., D ON -
GARRA , J. J., S QUYRES , J. M., S AHAY, V., K AMBADUR , P.,
BARRETT, B., L UMSDAINE , A., C ASTAIN , R. H., DANIEL ,
D. J., G RAHAM , R. L., AND W OODALL , T. S. Open MPI:
Goals, concept, and design of a next generation MPI implemen-
tation. In Proceedings, 11th European PVM/MPI Users’ Group
Meeting (Budapest, Hungary, September 2004), pp. 97–104.