Ray A Distributed Framework
Ray A Distributed Framework
Philipp Moritz∗, Robert Nishihara∗, Stephanie Wang, Alexey Tumanov, Richard Liaw,
Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, Ion Stoica
University of California, Berkeley
arXiv:1712.05889v2 [cs.DC] 30 Sep 2018
3 Programming and Computation Model Table 2 summarizes the properties of tasks and actors.
Tasks enable fine-grained load balancing through leverag-
Ray implements a dynamic task graph computation ing load-aware scheduling at task granularity, input data
model, i.e., it models an application as a graph of depen- locality, as each task can be scheduled on the node stor-
dent tasks that evolves during execution. On top of this ing its inputs, and low recovery overhead, as there is no
model, Ray provides both an actor and a task-parallel need to checkpoint and recover intermediate state. In con-
programming abstraction. This unification differentiates trast, actors provide much more efficient fine-grained up-
Ray from related systems like CIEL, which only pro- dates, as these updates are performed on internal rather
vides a task-parallel abstraction, and from Orleans [14] or than external state, which typically requires serialization
Akka [1], which primarily provide an actor abstraction. and deserialization. For example, actors can be used to
implement parameter servers [32] and GPU-based itera-
tive computations (e.g., training). In addition, actors can
3.1 Programming Model be used to wrap third-party simulators and other opaque
handles that are hard to serialize.
Tasks. A task represents the execution of a remote func-
tion on a stateless worker. When a remote function is To satisfy the requirements for heterogeneity and flex-
invoked, a future representing the result of the task is ibility (Section 2), we augment the API in three ways.
returned immediately. Futures can be retrieved using First, to handle concurrent tasks with heterogeneous du-
ray.get() and passed as arguments into other remote func- rations, we introduce ray.wait(), which waits for the
tions without waiting for their result. This allows the user first k available results, instead of waiting for all results
to express parallelism while capturing data dependencies. like ray.get(). Second, to handle resource-heterogeneous
Table 1 shows Ray’s API. tasks, we enable developers to specify resource require-
ments so that the Ray scheduler can efficiently manage re-
Remote functions operate on immutable objects and
sources. Third, to improve flexibility, we enable nested re-
are expected to be stateless and side-effect free: their
mote functions, meaning that remote functions can invoke
outputs are determined solely by their inputs. This implies
other remote functions. This is also critical for achiev-
idempotence, which simplifies fault tolerance through
ing high scalability (Section 4), as it enables multiple pro-
function re-execution on failure.
cesses to invoke remote functions in a distributed fashion.
Actors. An actor represents a stateful computation. Each
actor exposes methods that can be invoked remotely and
are executed serially. A method execution is similar to a 3.2 Computation Model
task, in that it executes remotely and returns a future, but
differs in that it executes on a stateful worker. A handle Ray employs a dynamic task graph computation
to an actor can be passed to other actors or tasks, making model [21], in which the execution of both remote func-
it possible for them to invoke methods on that actor. tions and actor methods is automatically triggered by the
system when their inputs become available. In this sec-
tion, we describe how the computation graph (Figure 4) is
Tasks (stateless) Actors (stateful) constructed from a user program (Figure 3). This program
Fine-grained load balancing Coarse-grained load balancing
uses the API in Table 1 to implement the pseudocode
Support for object locality Poor locality support
High overhead for small updates Low overhead for small updates from Figure 2.
Efficient failure handling Overhead from checkpointing Ignoring actors first, there are two types of nodes in
a computation graph: data objects and remote function
invocations, or tasks. There are also two types of edges:
Table 2: Tasks vs. actors tradeoffs. data edges and control edges. Data edges capture the de-
@ray.remote T0
def create_policy(): train_policy
# Initialize the policy randomly.
return policy A10 T1 A20
Simulator create_policy Simulator
@ray.remote(num_gpus=1)
A11 policy1 A21
class Simulator(object):
rollout rollout
def __init__(self):
# Initialize the environment. rollout11 rollout21
self.env = Environment() T2
update_policy
def rollout(self, policy, num_steps):
observations = []
A12 policy2 A22
observation = self.env.current_state() rollout rollout
for _ in range(num_steps):
action = policy(observation) rollout12 rollout22
observation = self.env.step(action) T3
update_policy
observations.append(observation)
return observations … … …
object task/method
@ray.remote(num_gpus=2)
data edges control edges stateful edges
def update_policy(policy, *rollouts):
# Update the policy.
return policy Figure 4: The task graph corresponding to an invocation of
train policy.remote() in Figure 3. Remote function calls and the
@ray.remote actor method calls correspond to tasks in the task graph. The
def train_policy():
# Create a policy. figure shows two actors. The method invocations for each actor
policy_id = create_policy.remote() (the tasks labeled A1i and A2i ) have stateful edges between them
# Create 10 actors. indicating that they share the mutable actor state. There are con-
simulators = [Simulator.remote() for _ in range(10)] trol edges from train policy to the tasks that it invokes. To train
# Do 100 steps of training.
for _ in range(100): multiple policies in parallel, we could call train policy.remote()
# Perform one rollout on each actor. multiple times.
rollout_ids = [s.rollout.remote(policy_id)
for s in simulators]
# Update the policy with the rollouts.
policy_id =
methods invoked on the same actor object form a chain
update_policy.remote(policy_id, *rollout_ids) that is connected by stateful edges (Figure 4). This chain
return ray.get(policy_id) captures the order in which these methods were invoked.
Stateful edges help us embed actors in an otherwise
Figure 3: Python code implementing the example in Figure 2 stateless task graph, as they capture the implicit data de-
in Ray. Note that @ray.remote indicates remote functions and pendency between successive method invocations sharing
actors. Invocations of remote functions and actor methods return the internal state of an actor. Stateful edges also enable
futures, which can be passed to subsequent remote functions or us to maintain lineage. As in other dataflow systems [64],
actor methods to encode task dependencies. Each actor has an we track data lineage to enable reconstruction. By explic-
environment object self.env shared between all of its methods. itly including stateful edges in the lineage graph, we can
easily reconstruct lost data, whether produced by remote
functions or actor methods (Section 4.2.3).
pendencies between data objects and tasks. More pre-
cisely, if data object D is an output of task T , we add a 4 Architecture
data edge from T to D. Similarly, if D is an input to T ,
we add a data edge from D to T . Control edges capture Ray’s architecture comprises (1) an application layer im-
the computation dependencies that result from nested re- plementing the API, and (2) a system layer providing high
mote functions (Section 3.1): if task T1 invokes task T2 , scalability and fault tolerance.
then we add a control edge from T1 to T2 .
Actor method invocations are also represented as nodes 4.1 Application Layer
in the computation graph. They are identical to tasks
with one key difference. To capture the state dependency The application layer consists of three types of processes:
across subsequent method invocations on the same actor, • Driver: A process executing the user program.
we add a third type of edge: a stateful edge. If method
M j is called right after method Mi on the same actor, • Worker: A stateless process that executes tasks
then we add a stateful edge from Mi to M j . Thus, all (remote functions) invoked by a driver or another
Node Node Node
App Layer
we decouple the durable lineage storage from the other
Driver Worker Actor Driver Worker Worker system components, allowing each to scale independently.
Object Store
Maintaining low latency requires minimizing over-
Object Store Object Store
heads in task scheduling, which involves choosing where
Local Scheduler Local Scheduler Local Scheduler
to execute, and subsequently task dispatch, which in-
System Layer (backend)
Web UI
volves retrieving remote inputs from other nodes. Many
Global Control Store (GCS)
Object Table Debugging
existing dataflow systems [64, 40, 48] couple these by
Global Tools storing object locations and sizes in a centralized sched-
Global Task Table
Scheduler
Global
Scheduler
Scheduler Function Table Profiling Tools uler, a natural design when the scheduler is not a bottle-
Event Logs
Error Diagnosis
neck. However, the scale and granularity that Ray targets
requires keeping the centralized scheduler off the critical
path. Involving the scheduler in each object transfer is pro-
Figure 5: Ray’s architecture consists of two parts: an applica- hibitively expensive for primitives important to distributed
tion layer and a system layer. The application layer implements training like allreduce, which is both communication-
the API and the computation model described in Section 3, the
intensive and latency-sensitive. Therefore, we store the
system layer implements task scheduling and data management
object metadata in the GCS rather than in the scheduler,
to satisfy the performance and fault-tolerance requirements.
fully decoupling task dispatch from task scheduling.
In summary, the GCS significantly simplifies Ray’s
worker. Workers are started automatically and as- overall design, as it enables every component in the sys-
signed tasks by the system layer. When a remote tem to be stateless. This not only simplifies support for
function is declared, the function is automatically fault tolerance (i.e., on failure, components simply restart
published to all workers. A worker executes tasks and read the lineage from the GCS), but also makes it
serially, with no local state maintained across tasks. easy to scale the distributed object store and scheduler in-
dependently, as all components share the needed state via
• Actor: A stateful process that executes, when in- the GCS. An added benefit is the easy development of de-
voked, only the methods it exposes. Unlike a worker, bugging, profiling, and visualization tools.
an actor is explicitly instantiated by a worker or a
driver. Like workers, actors execute methods seri-
ally, except that each method depends on the state 4.2.2 Bottom-Up Distributed Scheduler
resulting from the previous method execution.
As discussed in Section 2, Ray needs to dynamically
schedule millions of tasks per second, tasks which may
4.2 System Layer take as little as a few milliseconds. None of the clus-
The system layer consists of three major components: a ter schedulers we are aware of meet these requirements.
global control store, a distributed scheduler, and a dis- Most cluster computing frameworks, such as Spark [64],
tributed object store. All components are horizontally CIEL [40], and Dryad [28] implement a centralized sched-
scalable and fault-tolerant. uler, which can provide locality but at latencies in the tens
of ms. Distributed schedulers such as work stealing [12],
Sparrow [45] and Canary [47] can achieve high scale, but
4.2.1 Global Control Store (GCS)
they either don’t consider data locality [12], or assume
The global control store (GCS) maintains the entire con- tasks belong to independent jobs [45], or assume the com-
trol state of the system, and it is a unique feature of our putation graph is known [47].
design. At its core, GCS is a key-value store with pub- To satisfy the above requirements, we design a two-
sub functionality. We use sharding to achieve scale, and level hierarchical scheduler consisting of a global sched-
per-shard chain replication [61] to provide fault tolerance. uler and per-node local schedulers. To avoid overloading
The primary reason for the GCS and its design is to main- the global scheduler, the tasks created at a node are sub-
tain fault tolerance and low latency for a system that can mitted first to the node’s local scheduler. A local sched-
dynamically spawn millions of tasks per second. uler schedules tasks locally unless the node is overloaded
Fault tolerance in case of node failure requires a solu- (i.e., its local task queue exceeds a predefined threshold),
tion to maintain lineage information. Existing lineage- or it cannot satisfy a task’s requirements (e.g., lacks a
based solutions [64, 63, 40, 28] focus on coarse-grained GPU). If a local scheduler decides not to schedule a task
parallelism and can therefore use a single node (e.g., mas- locally, it forwards it to the global scheduler. Since this
ter, driver) to store the lineage without impacting perfor- scheduler attempts to schedule tasks locally first (i.e., at
mance. However, this design is not scalable for a fine- the leaves of the scheduling hierarchy), we call it a bottom-
grained and dynamic workload like simulation. Therefore, up scheduler.
Node 1 Node N
needed to disk using an LRU policy.
Driver Worker Worker Worker Worker Worker As with existing cluster computing frameworks, such
… as Spark [64], and Dryad [28], the object store is limited
Local Scheduler Local Scheduler to immutable data. This obviates the need for complex
consistency protocols (as objects are not updated), and
Global Control State (GCS) simplifies support for fault tolerance. In the case of node
failure, Ray recovers any needed objects through lineage
Global Global
re-execution. The lineage stored in the GCS tracks both
Scheduler Scheduler stateless tasks and stateful actors during initial execution;
Submit Schedule Load we use the former to reconstruct objects in the store.
tasks tasks info
For simplicity, our object store does not support dis-
Figure 6: Bottom-up distributed scheduler. Tasks are submitted tributed objects, i.e., each object fits on a single node. Dis-
bottom-up, from drivers and workers to a local scheduler and tributed objects like large matrices or trees can be imple-
forwarded to the global scheduler only if needed (Section 4.2.2). mented at the application level as collections of futures.
The thickness of each arrow is proportional to its request rate.
4.2.4 Implementation
The global scheduler considers each node’s load and Ray is an active open source project† developed at the Uni-
task’s constraints to make scheduling decisions. More pre- versity of California, Berkeley. Ray fully integrates with
cisely, the global scheduler identifies the set of nodes that the Python environment and is easy to install by simply
have enough resources of the type requested by the task, running pip install ray. The implementation com-
and of these nodes selects the node which provides the prises ≈ 40K lines of code (LoC), 72% in C++ for the
lowest estimated waiting time. At a given node, this time system layer, 28% in Python for the application layer. The
is the sum of (i) the estimated time the task will be queued GCS uses one Redis [50] key-value store per shard, with
at that node (i.e., task queue size times average task ex- entirely single-key operations. GCS tables are sharded
ecution), and (ii) the estimated transfer time of task’s by object and task IDs to scale, and every shard is chain-
remote inputs (i.e., total size of remote inputs divided by replicated [61] for fault tolerance. We implement both
average bandwidth). The global scheduler gets the queue the local and global schedulers as event-driven, single-
size at each node and the node resource availability via threaded processes. Internally, local schedulers maintain
heartbeats, and the location of the task’s inputs and their cached state for local object metadata, tasks waiting for
sizes from GCS. Furthermore, the global scheduler com- inputs, and tasks ready for dispatch to a worker. To trans-
putes the average task execution and the average transfer fer large objects between different object stores, we stripe
bandwidth using simple exponential averaging. If the the object across multiple TCP connections.
global scheduler becomes a bottleneck, we can instantiate
more replicas all sharing the same information via GCS. 4.3 Putting Everything Together
This makes our scheduler architecture highly scalable.
Figure 7 illustrates how Ray works end-to-end with a
simple example that adds two objects a and b, which
4.2.3 In-Memory Distributed Object Store could be scalars or matrices, and returns result c. The
To minimize task latency, we implement an in-memory remote function add() is automatically registered with the
distributed storage system to store the inputs and outputs GCS upon initialization and distributed to every worker
of every task, or stateless computation. On each node, we in the system (step 0 in Figure 7a).
implement the object store via shared memory. This al- Figure 7a shows the step-by-step operations triggered
lows zero-copy data sharing between tasks running on the by a driver invoking add.remote(a, b), where a and b are
same node. As a data format, we use Apache Arrow [2]. stored on nodes N1 and N2, respectively. The driver sub-
If a task’s inputs are not local, the inputs are replicated mits add(a, b) to the local scheduler (step 1), which for-
to the local object store before execution. Also, a task wards it to a global scheduler (step 2).‡ Next, the global
writes its outputs to the local object store. Replication scheduler looks up the locations of add(a, b)’s arguments
eliminates the potential bottleneck due to hot data ob- in the GCS (step 3) and decides to schedule the task on
jects and minimizes task execution time as a task only node N2, which stores argument b (step 4). The local
reads/writes data from/to the local memory. This in- scheduler at node N2 checks whether the local object
creases throughput for computation-bound workloads, a store contains add(a, b)’s arguments (step 5). Since the
profile shared by many AI applications. For low latency, † https://github.com/ray-project/ray
we keep objects entirely in memory and evict them as ‡ Note that N1 could also decide to schedule the task locally.
N1 Global Control Store (GCS) N2 10 -1
Millions of tasks/s
Driver Function Table Worker Locality Aware 1.6
@ray.remote @ray.remote @ray.remote 10 -2 Unaware
def add(a, b): 0 def add(a, b): def add(a, b): 1.2
return a + b return a + b return a + b 10 -3
idc = add.remote(a, b) 0.8
c = ray.get(idc) Object Table 10 -4 0.4
ida N1 9
Object store idb N2 Object store 8 10 -5 0.0
6 100KB 1MB 10MB 100MB 10 20 30 40 50 60 100
1
ida a 7 ida a idb b Object size number of nodes
3 5
Local Scheduler Local Scheduler
(a) Ray locality scheduling (b) Ray scalability
2 Global Scheduler 4
(a) Executing a task remotely Figure 8: (a) Tasks leverage locality-aware placement. 1000
N1 Global Control Store (GCS) N2 tasks with a random object dependency are scheduled onto one
Driver Function Table Worker of two nodes. With locality-aware policy, task latency remains
@ray.remote @ray.remote @ray.remote independent of the size of task inputs instead of growing by 1-2
def add(a, b): def add(a, b): def add(a, b):
return a + b return a + b return a + b
orders of magnitude. (b) Near-linear scalability leveraging the
idc = add.remote(a, b) GCS and bottom-up distributed scheduler. Ray reaches 1 million
c = ray.get(idc) Object Table tasks per second throughput with 60 nodes. x ∈ {70, 80, 90}
ida N1
7
omitted due to cost.
2 idb N2 3
1 idc N2, N1 4
5
ida a idc c idc c ida a idb b
6 in many cases this number is much smaller, as most tasks
Local Scheduler Global Scheduler Local Scheduler
are scheduled locally, and the GCS replies are cached by
(b) Returning the result of a remote task the global and local schedulers.
throughput (GB/s)
Latency (μs)
15000 12 104 read 104
10 node dead
IOPS
10000 8
6 103 103
5000 4
2
0 0
1KB 10KB 100KB 1MB 10MB100MB 1GB 0 1 2 3 4 5 6 7 8 9 10
object size Time since start (s)
Figure 9: Object store write throughput and IOPS. From a (a) A timeline for GCS read and write latencies as viewed from
single client, throughput exceeds 15GB/s (red) for large objects a client submitting tasks. The chain starts with 2 replicas. We
and 18K IOPS (cyan) for small objects on a 16 core instance manually trigger reconfiguration as follows. At t ≈ 4.2s, a chain
(m4.4xlarge). It uses 8 threads to copy objects larger than 0.5MB member is killed; immediately after, a new chain member joins,
and 1 thread for small objects. Bar plots report throughput with initiates state transfer, and restores the chain to 2-way replication.
1, 2, 4, 8, 16 threads. Results are averaged over 5 runs. The maximum client-observed latency is under 30ms despite
reconfigurations.
50 million no-op tasks
Number of nodes
10 4 700
60 Ray*
(16 nodes, 100MB)
2000 Original tasks 600
Ray
10 3 500
Re-executed tasks
1500 40 2
400
10
300
1000 10 1 200
20 100
500 10 0 0
10MB 100MB 1GB +0 +1 +5 +10
Object size Added scheduler latency (ms)
0 0
0 50 100 150 200 (a) Ray vs OpenMPI (b) Ray scheduler ablation
Time since start (s)
(a) Task reconstruction Figure 12: (a) Mean execution time of allreduce on 16 m4.16xl
nodes. Each worker runs on a distinct node. Ray* restricts Ray
Throughput (tasks/s)
700 to 1 thread for sending and 1 thread for receiving. (b) Ray’s low-
600 Original tasks latency scheduling is critical for allreduce.
500 Re-executed tasks
Checkpoint tasks
400
300 primitive important to many machine learning workloads.
200 Here, we evaluate whether Ray can natively support a
100 ring allreduce [57] implementation with low enough over-
0 head to match existing implementations [53]. We find that
100 200 300 400 500 600
Ray completes allreduce across 16 nodes on 100MB in
Time since start (s) ∼200ms and 1GB in ∼1200ms, surprisingly outperform-
(b) Actor reconstruction ing OpenMPI (v1.10), a popular MPI implementation,
by 1.5× and 2× respectively (Figure 12a). We attribute
Figure 11: Ray fault-tolerance. (a) Ray reconstructs lost task Ray’s performance to its use of multiple threads for net-
dependencies as nodes are removed (dotted line), and recovers
work transfers, taking full advantage of the 25Gbps con-
to original throughput when nodes are added back. Each task
is 100ms and depends on an object generated by a previously
nection between nodes on AWS, whereas OpenMPI se-
submitted task. (b) Actors are reconstructed from their last quentially sends and receives data on a single thread [22].
checkpoint. At t = 200s, we kill 2 of the 10 nodes, causing 400 For smaller objects, OpenMPI outperforms Ray by switch-
of the 2000 actors in the cluster to be recovered on the remaining ing to a lower overhead algorithm, an optimization we
nodes (t = 200–270s). plan to implement in the future.
Ray’s scheduler performance is critical to implement-
ing primitives such as allreduce. In Figure 12b, we inject
demonstrate Ray’s ability to transparently recover from artificial task execution delays and show that performance
worker node failures and elastically scale, using the drops nearly 2× with just a few ms of extra latency. Sys-
durable GCS lineage storage. The workload, run on tems with centralized schedulers like Spark and CIEL typ-
m4.xlarge instances, consists of linear chains of 100ms ically have scheduler overheads in the tens of millisec-
tasks submitted by the driver. As nodes are removed (at onds [62, 38], making such workloads impractical. Sched-
25s, 50s, 100s), the local schedulers reconstruct previous uler throughput also becomes a bottleneck since the num-
results in the chain in order to continue execution. Over- ber of tasks required by ring reduce scales quadratically
all per-node throughput remains stable throughout. with the number of participants.
Recovering from actor failures. By encoding actor
method calls as stateful edges directly in the dependency
graph, we can reuse the same object reconstruction mech-
anism as in Figure 11a to provide transparent fault tol- 5.2 Building blocks
erance for stateful computation. Ray additionally lever-
ages user-defined checkpoint functions to bound the re- End-to-end applications (e.g., AlphaGo [54]) require a
construction time for actors (Figure 11b). With minimal tight coupling of training, serving, and simulation. In this
overhead, checkpointing enables only 500 methods to be section, we isolate each of these workloads to a setting
re-executed, versus 10k re-executions without checkpoint- that illustrates a typical RL application’s requirements.
ing. In the future, we hope to further reduce actor recon- Due to a flexible programming model targeted to RL, and
struction time, e.g., by allowing users to annotate meth- a system designed to support this programming model,
ods that do not mutate state. Ray matches and sometimes exceeds the performance of
Allreduce. Allreduce is a distributed communication dedicated systems for these individual workloads.
Mean images / s 7000 System Small Input Larger Input
6000 Horovod + TF
5000
Distributed TF
Clipper 4400 ± 15 states/sec 290 ± 1.3 states/sec
4000
3000 Ray + TF Ray 6200 ± 21 states/sec 6900 ± 150 states/sec
2000
1000
0
4 8 16 32 64
Num GPUs (V100)
Table 3: Throughput comparisons for Clipper [19], a dedicated
serving system, and Ray for two embedded serving workloads.
Figure 13: Images per second reached when distributing the We use a residual network and a small fully connected network,
training of a ResNet-101 TensorFlow model (from the official taking 10ms and 5ms to evaluate, respectively. The server is
TF benchmark). All experiments were run on p3.16xl instances queried by clients that each send states of size 4KB and 100KB
connected by 25Gbps Ethernet, and workers allocated 4 GPUs respectively in batches of 64.
per node as done in Horovod [53]. We note some measurement
deviations from previously reported, likely due to hardware
differences and recent TensorFlow performance improvements. server throughput achieved using a Ray actor to serve
We used OpenMPI 3.0, TF 1.8, and NCCL2 for all runs. a policy versus using the open source Clipper system
over REST. Here, both client and server processes are co-
located on the same machine (a p3.8xlarge instance). This
5.2.1 Distributed Training is often the case for RL applications but not for the general
We implement data-parallel synchronous SGD leverag- web serving workloads addressed by systems like Clipper.
ing the Ray actor abstraction to represent model replicas. Due to its low-overhead serialization and shared memory
Model weights are synchronized via allreduce (5.1) or pa- abstractions, Ray achieves an order of magnitude higher
rameter server, both implemented on top of the Ray API. throughput for a small fully connected policy model that
In Figure 13, we evaluate the performance of the takes in a large input and is also faster on a more expensive
Ray (synchronous) parameter-server SGD implementa- residual network policy model, similar to one used in
tion against state-of-the-art implementations [53], us- AlphaGo Zero, that takes smaller input.
ing the same TensorFlow model and synthetic data gen-
erator for each experiment. We compare only against
5.2.3 Simulation
TensorFlow-based systems to accurately measure the over-
head imposed by Ray, rather than differences between the Simulators used in RL produce results with variable
deep learning frameworks themselves. In each iteration, lengths (“timesteps”) that, due to the tight loop with train-
model replica actors compute gradients in parallel, send ing, must be used as soon as they are available. The task
the gradients to a sharded parameter server, then read the heterogeneity and timeliness requirements make simu-
summed gradients from the parameter server for the next lations hard to support efficiently in BSP-style systems.
iteration. To demonstrate, we compare (1) an MPI implementation
Figure 13 shows that Ray matches the performance of that submits 3n parallel simulation runs on n cores in 3
Horovod and is within 10% of distributed TensorFlow rounds, with a global barrier between rounds§ , to (2) a
(in distributed replicated mode). This is due to Ray program that issues the same 3n tasks while concur-
the ability to express the same application-level optimiza- rently gathering simulation results back to the driver. Ta-
tions found in these specialized systems in Ray’s general- ble 4 shows that both systems scale well, yet Ray achieves
purpose API. A key optimization is the pipelining of gra- up to 1.8× throughput. This motivates a programming
dient computation, transfer, and summation within a sin- model that can dynamically spawn and collect the results
gle iteration. To overlap GPU computation with network of fine-grained simulation tasks.
transfer, we use a custom TensorFlow operator to write
tensors directly to Ray’s object store. System, programming model 1 CPU 16 CPUs 256 CPUs
MPI, bulk synchronous 22.6K 208K 2.16M
Ray, asynchronous tasks 22.3K 290K 4.03M
5.2.2 Serving
Model serving is an important component of end-to-end
Table 4: Timesteps per second for the Pendulum-v0 simulator
applications. Ray focuses primarily on the embedded
in OpenAI Gym [13]. Ray allows for better utilization when
serving of models to simulators running within the same running heterogeneous simulations at scale.
dynamic task graph (e.g., within an RL application on
Ray). In contrast, systems like Clipper [19] focus on
serving predictions to external clients. § Note that experts can use MPI’s asynchronous primitives to get
In this setting, low latency is critical for achieving high around barriers—at the expense of increased program complexity —we
utilization. To show this, in Table 3 we compare the nonetheless chose such an implementation to simulate BSP.
Mean time to solve (minutes)
90
[14] B YKOV, S., G ELLER , A., K LIOT, G., L ARUS , J. R., PANDYA , [27] H ORGAN , D., Q UAN , J., B UDDEN , D., BARTH -M ARON , G.,
R., AND T HELIN , J. Orleans: Cloud computing for everyone. In H ESSEL , M., VAN H ASSELT, H., AND S ILVER , D. Distributed
Proceedings of the 2nd ACM Symposium on Cloud Computing prioritized experience replay. International Conference on Learn-
(2011), ACM, p. 16. ing Representations (2018).
[15] C ARBONE , P., E WEN , S., F ÓRA , G., H ARIDI , S., R ICHTER , [28] I SARD , M., B UDIU , M., Y U , Y., B IRRELL , A., AND F ETTERLY,
S., AND T ZOUMAS , K. State management in Apache Flink: D. Dryad: Distributed data-parallel programs from sequential
Consistent stateful distributed stream processing. Proc. VLDB building blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys
Endow. 10, 12 (Aug. 2017), 1718–1729. European Conference on Computer Systems 2007 (New York, NY,
USA, 2007), EuroSys ’07, ACM, pp. 59–72.
[16] C ASADO , M., F REEDMAN , M. J., P ETTIT, J., L UO , J., M C K E -
OWN , N., AND S HENKER , S. Ethane: Taking control of the enter- [29] J IA , Y., S HELHAMER , E., D ONAHUE , J., K ARAYEV, S., L ONG ,
prise. SIGCOMM Comput. Commun. Rev. 37, 4 (Aug. 2007), 1–12. J., G IRSHICK , R., G UADARRAMA , S., AND DARRELL , T. Caffe:
Convolutional architecture for fast feature embedding. arXiv
[17] C HAROUSSET, D., S CHMIDT, T. C., H IESGEN , R., AND preprint arXiv:1408.5093 (2014).
W ÄHLISCH , M. Native actors: A scalable software platform for
distributed, heterogeneous environments. In Proceedings of the [30] J ORDAN , M. I., AND M ITCHELL , T. M. Machine learning:
2013 workshop on Programming based on actors, agents, and de- Trends, perspectives, and prospects. Science 349, 6245 (2015),
centralized control (2013), ACM, pp. 87–96. 255–260.
[31] L EIBIUSKY, J., E ISBRUCH , G., AND S IMONASSI , D. Getting [45] O USTERHOUT, K., W ENDELL , P., Z AHARIA , M., AND S TOICA ,
Started with Storm. O’Reilly Media, Inc., 2012. I. Sparrow: Distributed, low latency scheduling. In Proceedings
of the Twenty-Fourth ACM Symposium on Operating Systems
[32] L I , M., A NDERSEN , D. G., PARK , J. W., S MOLA , A. J., Principles (New York, NY, USA, 2013), SOSP ’13, ACM, pp. 69–
A HMED , A., J OSIFOVSKI , V., L ONG , J., S HEKITA , E. J., AND 84.
S U , B.-Y. Scaling distributed machine learning with the parame-
ter server. In Proceedings of the 11th USENIX Conference on Op- [46] PASZKE , A., G ROSS , S., C HINTALA , S., C HANAN , G., YANG ,
erating Systems Design and Implementation (Berkeley, CA, USA, E., D E V ITO , Z., L IN , Z., D ESMAISON , A., A NTIGA , L., AND
2014), OSDI’14, pp. 583–598. L ERER , A. Automatic differentiation in PyTorch.
[33] L OOKS , M., H ERRESHOFF , M., H UTCHINS , D., AND N ORVIG ,
P. Deep learning with dynamic computation graphs. arXiv preprint [47] Q U , H., M ASHAYEKHI , O., T EREI , D., AND L EVIS , P. Canary:
arXiv:1702.02181 (2017). A scheduling architecture for high performance cloud computing.
arXiv preprint arXiv:1602.01412 (2016).
[34] L OW, Y., G ONZALEZ , J., K YROLA , A., B ICKSON , D.,
G UESTRIN , C., AND H ELLERSTEIN , J. GraphLab: A new frame- [48] ROCKLIN , M. Dask: Parallel computation with blocked algo-
work for parallel machine learning. In Proceedings of the Twenty- rithms and task scheduling. In Proceedings of the 14th Python in
Sixth Conference on Uncertainty in Artificial Intelligence (Arling- Science Conference (2015), K. Huff and J. Bergstra, Eds., pp. 130
ton, Virginia, United States, 2010), UAI’10, pp. 340–349. – 136.
[35] M ALEWICZ , G., AUSTERN , M. H., B IK , A. J., D EHNERT, J. C., [49] S ALIMANS , T., H O , J., C HEN , X., AND S UTSKEVER , I. Evolu-
H ORN , I., L EISER , N., AND C ZAJKOWSKI , G. Pregel: A system tion strategies as a scalable alternative to reinforcement learning.
for large-scale graph processing. In Proceedings of the 2010 ACM arXiv preprint arXiv:1703.03864 (2017).
SIGMOD International Conference on Management of Data (New
York, NY, USA, 2010), SIGMOD ’10, ACM, pp. 135–146. [50] S ANFILIPPO , S. Redis: An open source, in-memory data structure
store. https://redis.io/, 2009.
[36] M NIH , V., BADIA , A. P., M IRZA , M., G RAVES , A., L ILLICRAP,
T. P., H ARLEY, T., S ILVER , D., AND K AVUKCUOGLU , K. Asyn-
chronous methods for deep reinforcement learning. In Interna- [51] S CHULMAN , J., W OLSKI , F., D HARIWAL , P., R ADFORD , A.,
AND K LIMOV, O. Proximal policy optimization algorithms. arXiv
tional Conference on Machine Learning (2016).
preprint arXiv:1707.06347 (2017).
[37] M NIH , V., K AVUKCUOGLU , K., S ILVER , D., RUSU , A. A.,
V ENESS , J., B ELLEMARE , M. G., G RAVES , A., R IEDMILLER , [52] S CHWARZKOPF, M., KONWINSKI , A., A BD -E L -M ALEK , M.,
M., F IDJELAND , A. K., O STROVSKI , G., ET AL . Human-level AND W ILKES , J. Omega: Flexible, scalable schedulers for large
control through deep reinforcement learning. Nature 518, 7540 compute clusters. In Proceedings of the 8th ACM European Con-
(2015), 529–533. ference on Computer Systems (New York, NY, USA, 2013), Eu-
roSys ’13, ACM, pp. 351–364.
[38] M URRAY, D. A Distributed Execution Engine Supporting Data-
dependent Control Flow. University of Cambridge, 2012. [53] S ERGEEV, A., AND D EL BALSO , M. Horovod: fast and
easy distributed deep learning in tensorflow. arXiv preprint
[39] M URRAY, D. G., M C S HERRY, F., I SAACS , R., I SARD , M., arXiv:1802.05799 (2018).
BARHAM , P., AND A BADI , M. Naiad: A timely dataflow system.
In Proceedings of the Twenty-Fourth ACM Symposium on Operat-
[54] S ILVER , D., H UANG , A., M ADDISON , C. J., G UEZ , A.,
ing Systems Principles (New York, NY, USA, 2013), SOSP ’13,
S IFRE , L., VAN D EN D RIESSCHE , G., S CHRITTWIESER , J.,
ACM, pp. 439–455.
A NTONOGLOU , I., PANNEERSHELVAM , V., L ANCTOT, M.,
[40] M URRAY, D. G., S CHWARZKOPF, M., S MOWTON , C., S MITH , ET AL . Mastering the game of Go with deep neural networks and
S., M ADHAVAPEDDY, A., AND H AND , S. CIEL: A universal exe- tree search. Nature 529, 7587 (2016), 484–489.
cution engine for distributed data-flow computing. In Proceedings
of the 8th USENIX Conference on Networked Systems Design and [55] S ILVER , D., L EVER , G., H EESS , N., D EGRIS , T., W IERSTRA ,
Implementation (Berkeley, CA, USA, 2011), NSDI’11, USENIX D., AND R IEDMILLER , M. Deterministic policy gradient algo-
Association, pp. 113–126. rithms. In ICML (2014).
[41] NAIR , A., S RINIVASAN , P., B LACKWELL , S., A LCICEK , C., [56] S UTTON , R. S., AND BARTO , A. G. Reinforcement Learning:
F EARON , R., M ARIA , A. D., PANNEERSHELVAM , V., S ULEY- An Introduction. MIT press Cambridge, 1998.
MAN , M., B EATTIE , C., P ETERSEN , S., L EGG , S., M NIH , V.,
K AVUKCUOGLU , K., AND S ILVER , D. Massively parallel meth- [57] T HAKUR , R., R ABENSEIFNER , R., AND G ROPP, W. Optimiza-
ods for deep reinforcement learning, 2015. tion of collective communication operations in MPICH. The Inter-
national Journal of High Performance Computing Applications
[42] N G , A., C OATES , A., D IEL , M., G ANAPATHI , V., S CHULTE , J., 19, 1 (2005), 49–66.
T SE , B., B ERGER , E., AND L IANG , E. Autonomous inverted he-
licopter flight via reinforcement learning. Experimental Robotics
[58] T IAN , Y., G ONG , Q., S HANG , W., W U , Y., AND Z ITNICK , C. L.
IX (2006), 363–372.
ELF: An extensive, lightweight and flexible research platform
[43] N ISHIHARA , R., M ORITZ , P., WANG , S., T UMANOV, A., PAUL , for real-time strategy games. Advances in Neural Information
W., S CHLEIER -S MITH , J., L IAW, R., N IKNAMI , M., J ORDAN , Processing Systems (NIPS) (2017).
M. I., AND S TOICA , I. Real-time machine learning: The missing
pieces. In Workshop on Hot Topics in Operating Systems (2017). [59] T ODOROV, E., E REZ , T., AND TASSA , Y. Mujoco: A physics
engine for model-based control. In Intelligent Robots and Systems
[44] O PENAI. OpenAI Dota 2 1v1 bot. https://openai.com/ (IROS), 2012 IEEE/RSJ International Conference on (2012), IEEE,
the-international/, 2017. pp. 5026–5033.
[60] VAN D EN B ERG , J., M ILLER , S., D UCKWORTH , D., H U , H.,
WAN , A., F U , X.-Y., G OLDBERG , K., AND A BBEEL , P. Su-
perhuman performance of surgical tasks by robots using iterative
learning from human-guided demonstrations. In Robotics and Au-
tomation (ICRA), 2010 IEEE International Conference on (2010),
IEEE, pp. 2074–2081.
[64] Z AHARIA , M., C HOWDHURY, M., DAS , T., DAVE , A., M A , J.,
M C C AULEY, M., F RANKLIN , M. J., S HENKER , S., AND S TO -
ICA , I. Resilient distributed datasets: A fault-tolerant abstrac-
tion for in-memory cluster computing. In Proceedings of the 9th
USENIX conference on Networked Systems Design and Implemen-
tation (2012), USENIX Association, pp. 2–2.