0% found this document useful (0 votes)
103 views22 pages

Ray AIR Technical Whitepaper

Uploaded by

karthikeya007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
103 views22 pages

Ray AIR Technical Whitepaper

Uploaded by

karthikeya007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Ray AIR Technical Whitepaper

Ray Team, August 2022

This document is public; please use "Viewing" mode to avoid accidental comments.

This doc overviews the technical design and value proposition of Ray AI Runtime (AIR), a
scalable toolkit for ML applications. It should provide an understanding of whether AIR is a good
fit for your use cases, whether you are an individual ML scientist / SWE or ML platform builder.

The intended audience of this document is ML practitioners and ML platform builders who are
familiar with ML and already have a basic understanding of Ray. For an introduction to Ray and
AIR, see the AIR Getting Started Guide. For a deep dive into Ray's system architecture, see the
Ray 2.x Architecture.

Ray AIR Technical Whitepaper 1

Project Overview 2
When would you use AIR? 3
When to NOT use AIR? 3
End-to-End API 3
Why AIR? 6
Related Systems 6

AIR Technical Overview 8


Stateless Compute 9
Usage Patterns 10
Example: Simple Batch Inference 11
Compute Strategies 11
Pipelining Optimization 12
Separate GPU Stage Optimization 12
Distributed Shuffle 12
Memory Management 13
Failure Model 13
Stateful Compute 13
Usage Patterns 14
Example: Grid Search for RL Experiment 14
Compute Strategies 15
Memory Management 15
Failure Model 16
Advanced Composite Workloads 16
Usage Patterns 16
Example: Pipelined and Shuffled Data Ingest 16
Compute Strategies 17
Memory Management 18
Failure Model 18
Online Serving 18
Usage Patterns 19
Compute Strategies 19
Memory Management 19
Failure Model 20
Autoscaling 20
Cluster Scalability 22

Ecosystem Map 22

Benchmarks 22

Project Overview
The Ray team has worked closely with ML users and advanced infra groups for a number of
years, starting from 2018. AIR is our effort to synthesize our lessons learned into a simple toolkit
for the community.

At a high level, AIR is a scalable and unified toolkit for ML applications. The Ray ML libraries
(Datasets, Train, RLlib, Tune, and Serve) make up this toolkit, and they offer specific integration
points for 3rd party libraries and services (e.g., PyTorch, XGBoost, MLFlow, custom data
sources). The AIR API is designed to take advantage of Ray as a flexible compute layer, but
does not directly expose low-level Ray task and actor APIs to end users.

Evolution of the Ray stack and target users. AIR unifies the previously independent Ray libraries
into a toolkit that works seamlessly with the ML ecosystem, enabling organizations to leverage
Ray with less custom platform and integration work.
When would you use AIR?
Because AIR is a collection of libraries, the first way to use AIR is à la carte: to scale a single
workload such as training, tuning, or inference. Second, AIR libraries also seamlessly compose
with each other for scaling ML end-to-end. Finally, the unified API also enables AIR to be easily
used for platform needs, e.g., a common API for running ecosystem libraries.

Data scientists can scale individual workloads and end-to-end workflows with AIR.

ML engineers can leverage AIR's scalable abstractions for platform needs.

End-to-End API
The following is a condensed (non-runnable) version of the AIR Getting Started guide. It
demonstrates how AIR unifies the usage of different distributed frameworks (e.g., XGBoost,
Pytorch, Horovod, etc.; first two are shown). The following diagram shows the end-to-end ML
workflow we will be overviewing, which covers model training, tuning, scoring, and serving:
Let's get started by loading some data! We can do that by using the `ray.data` APIs for loading
distributed datasets, here from a mock S3 bucket:

# Create distributed dataset and preprocessing pipeline.


train_dataset = ray.data.read_parquet("s3://bucket/training")
preprocessor = ray.data.preprocessors.StandardScaler()

Above, we've loaded our training dataset, as well as created a simple preprocessor pipeline
consisting of just a single preprocessor. Next, we'll train a model using the framework of our
choice: For frameworks like Torch, a custom training loop can be specified:

XGBoost vs Torch training setup:


trainer = trainer =
ray.train.xgboost.XGBoostTrainer( ray.train.torch.TorchTrainer(
scaling_config={"num_workers": 4}, torch_train_loop,
datasets={"train": train_dataset}, scaling_config={"num_workers": 4},
preprocessor=preprocessor, datasets={"train": train_dataset}
**xgboost_params, preprocessor=preprocessor,
) **torch_params,
)

# Fit the given Trainer on the distributed dataset.


result = trainer.fit()
Note that some important code is omitted from the above (i.e., the training loop for the torch
trainer). The takeaway here is that AIR lets you switch between frameworks with ease (though it
doesn't take away the need to set up your Torch or XGBoost model code!). Next, we show how
to run a hyperparameter sweep across a range of parameters via random sampling:

# Optionally, use Tune to sweep a range of hyper-params for training.


tuner = Tuner(
trainer,
param_space={"params": {"model_size": tune.randint(1, 9)}},
tune_config=TuneConfig(num_samples=5, metric="train-logloss"),
)
result_grid = tuner.fit()

# Retrieve the best result for batch scoring.


result = result_grid.get_best_result()

The above could easily have been run on either a single node or large cluster. Under the hood,
this simple API is possible due to Ray's native support for stitching together distributed
computations.

Scoring the trained model is next. AIR comes with a simple `BatchPredictor` utility that uses Ray
tasks and actors under the hood to perform efficient inference on any distributed dataset:

XGBoost vs Torch batch predictor setup:


bp = BatchPredictor.from_checkpoint( bp = BatchPredictor.from_checkpoint(
result.checkpoint, XGBoostPredictor) result.checkpoint, TorchPredictor)

# Load historical data, run batch prediction, and write out results.
historical_dataset = ray.data.read_parquet("s3://bucket/historical")
predict_dataset = bp.predict(historical_dataset)
predict_dataset.write_csv("s3://bucket/predict_out")

Finally, we can serve our model in Ray as well, by passing the model checkpoint to the Serve
library. Here we sketch how to create a trivial serving pipeline, which has no other business logic
beyond the model inference step:

# Start a Serve session to test serving the model.


serve.start()

XGBoost vs Torch serving setup:


deployment = deployment =
ray.serve.PredictorDeployment.deploy( ray.serve.PredictorDeployment.deploy(
XGBoostPredictor, result.checkpoint) TorchPredictor, result.checkpoint)

# Inspect the HTTP endpoint of the Serve deployment.


print(deployment.url)

That's it for the main AIR primitives! For a runnable version of the above overview, check out the
AIR Getting Started guide.

Why AIR?
By leveraging Ray as a flexible compute layer, AIR enables users to build and run ML workflows
of any kind in a single Python script. This solves the following problems:

1. Friction going from development to production.


a. One common pain point for teams is that taking machine learning code from
development to production often requires a handoff from data scientists to ML
engineers, and this conversion is often very expensive and time intensive.
b. AIR enables ML engineers to scale common ML workloads from a single
machine to large clusters without requiring a separate way of running – the same
code scales up seamlessly and robustly.
2. Stitching together many systems for distributed machine learning.
a. Typical systems for scaling batch ML workloads like training a model for batch
inference requires engineers to stitch together a variety of systems -- AirFlow,
Spark, Distributed TensorFlow, etc.
b. With AIR, these can be done with a single system and scale out with top
performance. The programmability of AIR also allows you to easily express new
ML workflows such as continual training.
3. Migration fatigue/rapid deprecation due to the rapidly changing nature of the
ecosystem.
a. Many ML infrastructure teams are now experiencing “migration fatigue”. Teams
that previously standardized on TFX or Kubeflow are now unable to support
emerging workloads, while teams that didn’t standardize are getting pushback
from clients that the platform is changing too frequently.
b. Ray AIR is easy to evolve due to its flexible distributed system foundation. As
examples, users are using Ray today for advanced serving pipelines, hyper-scale
model training, and cutting edge hyperparameter tuning algorithms.

Related Systems
AIR's primary value is in scalably distributing and gluing together ML frameworks via simple
Python APIs, bridging previously siloed systems and API surfaces. This means that AIR
complements many offerings in the ML ecosystem in the areas of training, tracking, storage,
etc., as shown in the following figure:

AIR built-in libraries and the integration with ecosystem frameworks. For a full overview of the
AIR ecosystem, check out the AIR documentation.

The following table lists a few common categories of systems and how AIR relates to them.
Note that this table focuses on the ML Ecosystem. For comparisons with more systems, refer
to the related work section of the Ray Architecture Whitepaper.

Category Examples Relationship Discussion

ML Tracking and MLFlow, W&B, Complement AIR focuses on AI compute, and hence (1)
Observability Arize can support any kind of distributed training
Tools framework via its Train API, and (2) is
designed to integrate with other ecosystem
Training Torch, tools for data storage, monitoring, and
Frameworks TensorFlow, tracking.
Lightning, JAX

ML Feature Feast, Tecton


Stores

ML Platforms SageMaker, Azure Complement / As open source systems, Ray and AIR can
ML, Vertex AI, Alternative be leveraged within hosted ML platforms; or
Databricks they can be used to build custom ML
platforms.

Data processing Spark, Dask AIR includes a Dataset library for fast and
systems seamless last-mile data preprocessing.
However, AIR is not an ETL tool, and has
limited relational data processing support.

Workflow Argo, AirFlow, As a more powerful general-purpose


Orchestrators Metaflow distributed framework, Ray Core subsumes
the need for DAG orchestration / system
MLOps ZenML, Lightning orchestration. This is because the required
Frameworks distributed steps can run natively in Ray as
libraries.

However, these orchestrators / frameworks


also provide management and reporting
capabilities that complement AIR (e.g.,
when running Ray as a step inside AirFlow
or Lightning).

Framework-Spe TorchX, TFX Alternative AIR is designed to be framework-agnostic


cific Toolkits from the ground up and does not create
any lock-in.

Container-Based Kubeflow AIR has similar goals to ML workflow


ML Workflow frameworks, differing on two aspects:
Frameworks 1. Ray supports in-memory distributed
data via its object store for better
Task-Based ML FBLearner Flow, performance and flexibility.
Workflow Flyte 2. Compared to container-based
Frameworks frameworks, AIR proposes instead a
high-level Pythonic API.

AIR Technical Overview


ML workflows include both stateless and stateful computations. For example, workloads such as
batch inference apply a model inference function over each record in a dataset independently in
parallel. On the other hand, computations like training and tuning rely on updating distributed
model state in training workers. AIR libraries generally implement stateless computations using
Ray tasks, and stateful computations using actors. In some cases AIR will use actors to improve
performance of stateless computations by caching state (e.g., model setup for batch inference).
This section covers examples of how the AIR Dataset, Tune, Train libraries execute stateless
and stateful workloads using Ray primitives. These primitives are also sometimes coordinated
together in composite workloads (e.g., a job combining distributed training, tuning, and data
loading).

For each computation type, we cover:


1. Usage Patterns: what use cases fall into this category?
2. Compute Strategies: how does Ray execute the workload?
3. Memory Management: how is memory managed?
4. Failure Model: how are faults handled?

Jump to:
1. Stateless Compute
2. Stateful Compute
3. Advanced Composite Workloads
4. Online Serving

Stateless Compute
AIR uses Ray's Dataset library for handling stateless computations: for example, data
preprocessing and inference. Under the hood, Datasets execute most computations using Ray
tasks, but may also use actors to cache state to speed up computations.
Stateless transformations can involve only tasks (left), as well as worker actors containing soft model
state (right), or a combination of both (e.g., CPU read tasks to load data, followed by GPU inference
actors to do the inference (not shown)).

Usage Patterns
There are a couple common usage patterns for Datasets in AIR:
● The Dataset is passed along with a model checkpoint into BatchPredictor, to execute
batch inference over the dataset (example above).
● The Dataset is passed into Ray Train to define the training data for the model.

To better understand how Datasets works for stateless compute, let's walk through an example
of using Datasets for batch inference on a large CPU cluster.

Example: Simple Batch Inference

Datasets implements embarrassingly parallel computations. In batch inference, the computation executes
over a distributed dataset and a side input (model checkpoint).

Let's consider the batch inference example from the overview section, which consists of (1)
reading data from S3, (2) loading a model from a checkpoint, (3) executing prediction, and (4)
writing out results. We break down each step below:

(1) historical_dataset = ray.data.read_parquet("s3://bucket/historical")

The first step creates the dataset. Note that the data isn't loaded into memory yet; initially only
the metadata of the dataset is loaded. This metadata loading is done in parallel using Ray tasks.
(2) bp = BatchPredictor.from_checkpoint(result.checkpoint, XGBoostPredictor)

This creates a predictor object from the given checkpoint. The checkpoint is not loaded yet into
memory.

(3) predict_dataset = bp.predict(historical_dataset)

The bulk of the action happens here. When `predict()` is called, the BatchPredictor library calls
`dataset.map_batches(model_cls, compute=ActorPoolStrategy(...))`, where `model_cls` and
`compute_strategy` are set to the inference model and the compute strategy is configured to be
an actor pool. This tells Datasets to create a pool of worker actors as specified, load the model
in each actor, and then use the pool of actors to perform inference on the given dataset.

(4) predict_dataset.write_csv("s3://bucket/predict_out")

The result of inference is another Dataset which is kept in the Ray object store, and can be
written to persistent storage.

In the next sections, we dive into the details of the above example.

Compute Strategies
Datasets can use either Ray tasks or actors to execute transformations on blocks. Tasks are
generally preferred since they can be elastically scheduled, not requiring any pool size to be
specified. However, actors can be used if the transformation requires an expensive setup of soft
state. For example, loading a large model from a checkpoint can be quite costly, so it makes
sense to load it once in an actor, and re-use it multiple times for inference on data batches.

When using tasks, Datasets also uses a SPREAD scheduling strategy to ensure tasks (and
their output objects) are evenly balanced across the cluster. When using actors, the actors are
created ahead of time, and data is moved (prefetched by the scheduler) to the actor prior to
executing the transformation.

Pipelining Optimization
Datasets are stored in object store memory, and large datasets will be spilled to disk. However,
for many stateless transformations there is no need to keep the intermediate data in memory---
it can be streamed to/from storage, to improve performance. This can be done by using the
pipelining feature of Datasets.

Both Train and BatchPredictor offer pipelined data loading options. When using pipelined
processing, Datasets will only load a fraction of the data into memory at a time in windows to be
processed. This reduces memory usage and in some cases can considerably speed up
computation by avoiding disk spilling. The following code illustrates:
(1) predict_dataset = bp.predict_pipelined(
historical_dataset, bytes_per_window=10e9)
(2) predict_dataset.write_csv("s3://bucket/predict_out")

In these two lines, we use the pipelined prediction mode of BatchPredictor, specifying a window
size of 10GiB. This instructs BatchPredictor to load batches of 10GiB, run inference, and write
the results out. These stages are run concurrently in a "pipeline" fashion to avoid any idle time in
the computation.

Separate GPU Stage Optimization


When using BatchPredictor with GPUs enabled, preprocessing tasks will be executed
separately using CPU tasks, unless `separate_gpu_stage=False` is specified. This optimization
ensures that GPU actors don't block on expensive CPU preprocessing tasks, which can be the
bottleneck in many workloads.

Distributed Shuffle
Sometimes, a shuffle operation (e.g., distributed group by or sort) is necessary to preprocess or
re-organize the data. While AIR is not aiming to be a general purpose ETL framework, its
Dataset library does support large-scale data shuffles, up to the scale of 100TiB, for the
purposes of computing preprocessor statistics, or improving the shuffle quality of ML training.
See the performance tips and academic paper on Exoshuffle in Ray for more information.

Memory Management
Datasets uses Ray objects to represent data loaded into memory. Since each individual Ray
object has to be loaded fully into memory for access, Datasets are partitioned into blocks of Ray
objects (e.g., commonly <1GiB in size). Keeping block sizes under control is important to avoid
out-of-memory errors. Datasets employs the following strategy to right-size blocks:

● When reading data, the parallelism is automatically selected to try to keep blocks less
than 512MiB in size. This balances the overhead of having too many small blocks and
the risk of having too large blocks.
● A warning is produced if blocks are larger than this size.

When object data doesn't fit into memory, blocks are spilled to local disk via Ray's object spilling
mechanism.
Failure Model
AIR offers transparent fault tolerance for most stateless computations via lineage reconstruction.
This means Ray will reconstruct Dataset blocks if they are lost due to node failures by
re-submitting the necessary tasks, enabling workloads to scale to large clusters.

Limitations:
● Note that BatchPredictor does not currently provide fault tolerance. A fix for this is
targeted for Ray 2.1.
● Similar to systems like Spark, Ray fault tolerance does not apply to driver / head-node
failures; see below.

At the cluster level, a crash of the Ray cluster metadata server (GCS) will kill all jobs in the
cluster unless GCS is deployed in HA mode. We strongly recommend only considering GCS HA
for online serving workloads, since for batch workloads (1) machine failures involving the cluster
head node are rare, (2) a failure of the head node will typically also kill your job driver, and (3)
overload-induced crashes are typically not helped by restarting the GCS.

Stateful Compute
Distributed training and hyperparameter tuning are stateful workloads. These workloads create
actors that hold the current state of the training / tuning job, and placement groups are used for
gang-scheduling of these actors.

If a distributed training job is run under Tune, a tree of actors may be created, consisting of one
“driver” actor for each tuning trial, and sub-actors that implement the parallel training:

A tree of actors implements a nested Tuning / Training job. Each model training trial consists of a driver
actor and multiple worker actors held in a placement group. The Tune driver coordinates the overall
computation, creating and destroying these trials to run the experiment.
Usage Patterns
The three AIR libraries that leverage stateful compute are Tune, Train, and RLlib. Since both
Train and RLlib run on Tune by default (implement Tune's Trainable interface for execution),
there is really only one pattern to be aware of: how Tune launches a distributed "trial", or group
of actors executing a computation.

Note that Tune also supports running non-distributed (single-threaded) trials, as well as trials
defined as a function instead of as a class. However, under the hood these are treated as
special cases of "distributed trial" and implemented using actors.

Example: Grid Search for RL Experiment


In this example, we'll consider the execution of a grid search over RLTrainer jobs. We use
RLTrainer as an example since it's simple and self-contained. The Composite Workloads
section covers more complex training jobs involving both data ingest and training.

> trainer = RLTrainer(..., scaling_config={"num_workers": 4})


> trainer.fit()

This call to `trainer.fit()` would create a single instance of a RLTrainer trial in the cluster.
Scheduling of this trial would proceed as follows:
1. A placement group request for 1 learner actor and 4 worker actors is created in the
cluster (5 bundles of 1 CPU each).
2. When the placement group is scheduled, then the actors will be created and learning will
proceed.
3. When the trial finishes, the actors and placement group are destroyed and the resources
freed up in the cluster.

> tuner = Tuner(trainer, tune_config=TuneConfig(num_samples=10))


> tuner.fit()

This snippet shows a tuning sweep layered on top of the singleton trainer. This will create a
nested tree of actors where each subtree is an instance of the RLTrainer trial. Note that
`num_samples=10`, so when the sweep is executed with `tuner.fit()`, we will run 10 Trainer trials
in total instead of just one. Scheduling proceeds similarly:
1. Placement group requests are created for each trial.
2. As placement groups are scheduled, the actors for each trial are created in the
corresponding placement group.
3. As trials finish, placement groups and the actors in them are destroyed.

Compute Strategies
Train and Tune are both structured as a driver that pulls results from a pool of actors that
execute the workload. When using Tune and Train together, the Train driver is executed as an
actor and nested as a subtree of Tune.
Once created, how does a Train/Tune actor execute internally? There are two behaviors to be
aware of: how Tune pulls results from the worker actors, and how each actor internally
implements its computation:

Monitoring trial status: Tune periodically issues actor method calls to the driver actor of each
trial, which blocks and returns periodic metrics.

Internal execution: Worker actors have full freedom on how they implement the computation.
They can choose to leverage Ray fully (i.e., use actor calls between the actors to communicate),
or minimally use Ray as a process scheduler (i.e., use libraries such as NCCL to communicate
out of band between the actors).

Memory Management
Stateful workloads typically use the Ray object store only for setting up the trial, and
communicating back results. For example, Tune may use the object store to send the initial
checkpoint used to set up a trial, and to retrieve the checkpoint returned as the result of a trial
run.

Hence, stateful workloads primarily consume Python heap memory, which is not managed by
Ray. This means that the user should take care not to use too much memory in individual Tune
trials, as this can cause actors to be killed by the OS. If the amount of memory required per
actor is known, this can be expressed in the trial scaling configuration as `memory` resource
requests to the scheduler, or by expressing additional CPU requirements (e.g., scheduling 1
actor per every 2 CPUs).

Certain stateful workloads will use the Ray object store internally. For example, RLlib uses Ray
objects to broadcast weights to rollout workers and to collect experience batches from worker to
learner actors. In this case, the object store memory management protocols for stateless
workloads apply as well.

Failure Model
Jobs involving stateful computations primarily rely on checkpoint-based fault tolerance. Tune will
restart distributed trials from their last checkpoint as configured in its failure configuration. With a
configured checkpoint interval, this means that Tune can run trials effectively on clusters
consisting of preemptible / spot instances. In addition, it is possible to resume entire Tune
experiments from the experiment-wide checkpoint in case of whole-cluster failure.

Certain libraries (e.g., RLlib) offer further fault tolerance options such as recovering individual
worker failures, which are available as a further optimization.
Advanced Composite Workloads
What if a training workload needs to perform data processing tasks during the training run, such
as for preprocessing or shuffling data between epochs? This is where Ray's support for both
task and actor-based workloads shines--- training actors can seamlessly leverage tasks (i.e.,
Dataset operations) for data processing. In this section, we explore an example of an advanced
Training data ingest workload, and discuss the underlying scheduling strategy for these types of
workloads.

Usage Patterns
A common pattern explicitly supported in AIR is data-parallel training. This is used in model
trainers that extend the DataParallelTrainer subclass. In this pattern, a source Dataset is passed
into a Train trial. The dataset is split into equal-sized pieces and passed to each Train worker
individually. This means that during trial execution, tasks and actor computations may run
concurrently to implement data loading and distributed training.

Example: Pipelined and Shuffled Data Ingest


This example explores the most advanced case of data parallel training, in which data is (1)
ingested in a pipelined fashion, and is also (2) globally shuffled to maximize the randomness of
training batches. Note that very few training systems today support global shuffling. Common
approaches for this outside of Ray typically involve gluing together multiple distributed systems
(i.e., calling into a separate Spark cluster to shuffle data between Horovod epochs, etc.).

Visual depiction of the example we overview here. Per-epoch data shuffles execute concurrently with the
distributed training. Diagram from Pipelining Compute Guide.

Setting up the above pipeline looks as follows (see Configuring Training Datasets for more
information):

> data = ray.data.read_parquet("s3://large-dataset/...")


> trainer = TorchTrainer(
...,
scaling_config={"num_workers": 3},
datasets={"train": data},
dataset_config={
"train": DatasetConfig(
use_stream_api=True,
stream_window_size=-1,
global_shuffle=True,
),
},
)
> trainer.fit()

Under the hood, AIR generates the necessary Dataset operations to set up the data ingest
pipeline. We set the stream window size to "-1", which tells Datasets to, for each epoch, buffer
the entire dataset in memory as a single window. This window of data will then be shuffled and
split up to send to training workers. The loading, shuffling, and splitting stages all involve
running distributed tasks. Since this work is repeated per epoch, the tasks run concurrently with
the training actor processes in a pipelined fashion.

Compute Strategies
Composite workloads leverage both task and actor-based computation simultaneously. This can
lead to resource reservation challenges, as described below.

While trial actors reserve their resources up-front for a trial, we cannot do so for stateless tasks
since these are meant to be elastic. This presents a conundrum for scheduling: if a trial is using
both actors and tasks, and all resources are reserved by actors via placement groups, how will
the data loading tasks execute?

AIR resolves this by having Tune set a flag on its placement groups allowing them to only
reserve up to 80% of a node's CPUs by default. This configurable option ensures that some
CPU slots will always be available on each node for executing stateless computations, as may
be needed for data-parallel training. Often, for trials that leverage GPUs, there are extra CPU
slots available on machines anyway, and so this behavior is a no-op.

Memory Management
Composite workloads pose additional challenges for the memory management subsystem, as it
must handle stateful actors reading data objects created by stateless tasks. Let's consider both
non-pipelined and pipeline data ingest cases:

Non-pipelined:
If the node hosting the training actors has enough object store capacity to fit all referenced data
objects in-memory, then training is relatively straightforward: after preprocessing runs, the data
blocks are downloaded to each node, and stay in memory. Training workers iterate over
in-memory data. If there is not enough memory, the blocks are spilled to disk and re-loaded as
needed as they are requested.

Pipelined:
In this case, data is generated on the fly by tasks as the ingest pipeline executes, and is
downloaded to the right node upon request by training actors. If the task happens to execute on
the same node as the actor reading it, then the fetch is a no-op and the trainer can retrieve the
data from shared memory. Typically, you should try to configure pipelined training to avoid disk
spilling (i.e., make the window size small enough that the pipeline working set fits in the
aggregate object store memory of the cluster).

Failure Model
Composite workloads inherit the fault tolerance strategies of both stateless and stateful
workloads. This means that lineage reconstruction applies to the stateless portion of the
workload, but application-level checkpointing still applies to the overall computation overall,
retaining the best of both worlds.

Online Serving
Finally, AIR makes it easy to go from training and scoring to online serving using the same
infrastructure. This comes for free since AIR checkpoints work out of the box with Ray Serve.
Under the hood, Serve uses Ray actors to serve online RPC requests. Data flow between the
actors is handled via the Ray object store and actor calls.

Usage Patterns
The primary usage of online serving capabilities is to host the predictor behind an HTTP
endpoint. This exposes the model as a service. PredictorDeployment is a Ray Serve
Deployment that can load checkpoints and call a predictor to perform online inference. The
PredictorDeployment exposes the model behind an HTTP endpoint at port 8000 by default.
Users can configure the deployment to scale out to multiple replicas, or utilize Serve’s auto
scaling capabilities to adjust the number of replicas given requested traffic.

Serve also supports model composition. For example, using Serve, you can combine models
trained on different categories of data to boost overall prediction accuracy; or dynamically select
a model out of a pool to perform A/B testing given input attributes.

Online serving on Ray is optimized for low latency, high throughput, and scalability. The desired
workload often consists of multiple replicas of a single model, or multiple models in production
serving a single request.
Compute Strategies
Online serving utilizes Ray actors. Under the hood, Serve manages a pool of stateless actors to
serve requests. Some actors listen on RPC ports to accept incoming requests, and these actors
call other actors to perform the predictions.

Requests are automatically load balanced using a round-robin algorithm to the pool of actors
hosting the models. Load metrics are sent to the Serve component to perform autoscaling.

Serve also performs on-demand micro batching to improve the throughput. This happens within
the actor hosting the model. The first request will be buffered and wait for up to
max_batch_size requests to arrive, or up to batch_wait_timeout_s; once a full batch is
ready, the request will be processed by the model. Both parameters are configurable. This
approach greatly improves throughput because ML models are optimized to utilize hardware
parallelism to process multiple vectors at the same time.

Memory Management
Typically, the models are hosted in an actor’s main process memory. Usually, the memory usage
only scales based on request load, because request payloads are stored in the actor’s process
memory for the duration of the request.

During request handling, the communication between actors is through direct actor calls. This
means small intermediate objects are transferred directly between actors without going through
the Ray Object Store. Large objects are put into the Ray Object Store first, then received via
zero-copy read by the model.

Optionally, model weights themselves can be stored in the Ray Object Store so that replicas can
share physical memory. This method is still experimental and only works for certain frameworks
(e.g. PyTorch).

Failure Model
Online serving workloads typically have high requirements for availability. Serve ensures high
availability through a layered approach: Ray Serve, Ray Core, and the Kubernetes Operator.

In the Ray Serve layer, Serve makes sure application errors are well handled and will not crash
the actor. If the application segfaults or OOM, Serve will create a new actor to replace the failed
actor. You can also define a custom health check method for Serve actors.

Ray Serve is built on top of Ray Core abstractions. In particular, Serve leverages Ray Core to
restart failed actors. Serve can take advantage of Ray high availability (HA) deployment modes.
In HA mode, when the Ray head node goes down, Serve will enter a degraded state. In this
state, all actors on the worker nodes should be able to continue to serve traffic, but Serve will no
longer be able to perform autoscaling or health checks until the head node is recovered.
In production, Ray Serve should be hosted on Kubernetes. In particular, the KubeRay operator
ensures that Ray nodes are healthy and it will create new nodes if needed. The operator can
also ensure the Serve application is healthy and it will create a new cluster to replace the
unhealthy one if needed. This enables zero-downtime application upgrades.

Autoscaling
AIR libraries can run on Ray autoscaling clusters. For stateless workloads, Ray will autoscale
automatically if there are queued tasks (or queued Dataset compute actors). For stateful
workloads, Ray will autoscale up if there are pending placement groups (i.e., Tune trials) not yet
scheduled in the cluster.

The following figure shows how the nodes and placement groups interact. Each yellow box
represents a node (machine) in the cluster, and each gray box a placement group or task:

Ray will autoscale down when nodes are idle. A node is considered idle when there is no
resource usage on the node and also no Ray objects in-memory or spilled on-disk on the node.
Since most AIR libraries leverage objects, this means that nodes may be kept if they are holding
objects referenced by workers on other nodes (e.g., Dataset block used by another trial).

The following figure shows how nodes (yellow and red boxes) become eligible for deletion.
Nodes shown in yellow are not eligible for deletion (they contain running tasks, actors, or
objects). Nodes in red have no active resources and are eligible for deletion:
Be aware that autoscaling may result in less than ideal data balancing in the cluster. This is
since nodes that are started earlier naturally run more tasks over their lifetime. Consider limiting
(e.g., starting with a certain min cluster size) or disabling autoscaling to optimize the efficiency of
data-intensive workloads.

Cluster Scalability
AIR can scale up to processing hundreds of terabytes of data. This section overviews the
fundamental scalability bottlenecks commonly seen in AIR, which may be useful for practitioners
working at that scale to understand.

The fundamental bottlenecks come from (1) tracking a large number of Dataset block
references, (2) even balancing of load across the cluster, and (3) the maximum supported Ray
cluster size:
1. When the number of Ray objects in a job increases into the range of millions, this puts
considerable CPU and memory pressure on the owner of the objects (typically the job
driver). This can happen for very large Datasets or when performing shuffle operations
on a Dataset.
2. If load is imbalanced, this can cause hot-spots in network or disk usage in the cluster.
While AIR tries to spread load evenly, this is a common issue to watch out for that can
happen due to misconfigurations or bugs.
3. The Ray max supported cluster size is ~1000 nodes. Beyond this, scalability bottlenecks
in the Ray scheduler are likely to cause problems. Generally speaking, Ray runs better
with fewer large nodes than many small nodes due to its two-level scheduler.
Ecosystem Map
You can find the AIR ecosystem map at the following link:
https://docs.ray.io/en/master/ray-air/air-ecosystem.html#air-ecosystem-map

Benchmarks
We maintain a set of AIR benchmarks here:
https://docs.ray.io/en/master/ray-air/benchmarks.html

These benchmarks are intended to provide guidance on expected performance in common


scenarios. Please reach out to us on GitHub or discourse if you have workloads that may be
suitable for adding to this page, or see worse performance for a workload similar to one of those
included on this page.

You might also like