0% found this document useful (0 votes)
242 views13 pages

DeepSpeed Inference - Enabling Efficient Inference of Transformer Models at Unprecedented Scale

DeepSpeed Inference is a system for efficient transformer model inference that addresses challenges of meeting latency requirements and maximizing throughput. It includes: 1) A multi-GPU inference solution that minimizes latency for dense and sparse models by maximizing memory bandwidth utilization at small batch sizes, while also increasing throughput by overlapping computation and model weight loading. 2) A heterogeneous inference solution, called ZeRO-Inference, that leverages CPU and NVMe memory in addition to GPU memory and computation. This enables inference of very large models that do not fit in GPU memory alone. DeepSpeed Inference reduces latency by up to 7.3x and increases throughput by over 1.5x compared to state-of-

Uploaded by

kai lu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
242 views13 pages

DeepSpeed Inference - Enabling Efficient Inference of Transformer Models at Unprecedented Scale

DeepSpeed Inference is a system for efficient transformer model inference that addresses challenges of meeting latency requirements and maximizing throughput. It includes: 1) A multi-GPU inference solution that minimizes latency for dense and sparse models by maximizing memory bandwidth utilization at small batch sizes, while also increasing throughput by overlapping computation and model weight loading. 2) A heterogeneous inference solution, called ZeRO-Inference, that leverages CPU and NVMe memory in addition to GPU memory and computation. This enables inference of very large models that do not fit in GPU memory alone. DeepSpeed Inference reduces latency by up to 7.3x and increases throughput by over 1.5x compared to state-of-

Uploaded by

kai lu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

DeepSpeed Inference: Enabling Efficient Inference

of Transformer Models at Unprecedented Scale


Reza Yazdani Aminabadi, Samyam Rajbhandari, Minjia Zhang, Ammar Ahmad Awan,
Cheng Li, Du Li, Elton Zheng, Jeff Rasley, Shaden Smith, Olatunji Ruwase, Yuxiong He
Microsoft Corporation
{yazdani.reza,samyamr,minjiaz,ammar.awan,chengli1,du.li,elton.zheng,jeff.rasley,shaden.smith,olruwase,yuxhe}@microsoft.com

Abstract—The past several years have witnessed the success latency requirements, and thus the batch sizes used are generally
arXiv:2207.00032v1 [cs.LG] 30 Jun 2022

of transformer-based models, and their scale and application small. For small batch sizes, inference latency of a model is
scenarios continue to grow aggressively. The current landscape of lower bounded by the time it takes to load all the model
transformer models is increasingly diverse: the model size varies
drastically with the largest being of hundred-billion parameters; parameters from memory to registers. Meeting the latency
the model characteristics differ due to the sparsity introduced requirements of a transformer model inference therefore is
by the Mixture-of-Experts; the target application scenarios can equivalent to achieving adequate overall memory bandwidth.
be latency-critical or throughput-oriented; the deployment hard- Maximizing effective memory bandwidth at small batch sizes
ware could be single- or multi-GPU systems with different types requires reading memory at near peak memory bandwidth for
of memory and storage, etc. With such increasing diversity
and the fast-evolving pace of transformer models, designing a fully-connected (or, linear) layers which contain the majority
highly performant and efficient inference system is extremely of the model weights, while also minimizing kernel launch
challenging. and data movement overhead of other operators like layernorm
In this paper, we present DeepSpeed Inference, a comprehen- and softmax. The GeMM implementations and other kernels
sive system solution for transformer model inference to address designed for training primarily focus on maximizing compute
the above-mentioned challenges. DeepSpeed Inference consists
of (1) a multi-GPU inference solution to minimize latency while utilization at very large batch sizes and are sub-optimal for
maximizing the throughput of both dense and sparse transformer latency-critical inference.
models when they fit in aggregate GPU memory, and (2) a In addition, for large models, even the peak memory band-
heterogeneous inference solution that leverages CPU and NVMe width of a single device may not be sufficient to meet inference
memory in addition to the GPU memory and compute to enable latency constraints. It requires aggregate memory bandwidth
high inference throughput with large models which do not fit in
aggregate GPU memory. across multiple devices, which needs optimal parallelism
DeepSpeed Inference reduces latency by up to 7.3× over strategies for partitioning the model computation across devices
the state-of-the-art for latency oriented scenarios and increases that minimizes the communication overhead across devices.
throughput by over 1.5x for throughput oriented scenarios. Such parallelism strategies must cater to the variation in
Moreover, it enables trillion parameter scale inference under transformer architecture and hardware characteristics.
real-time latency constraints by leveraging hundreds of GPUs, an
unprecedented scale for inference. It can inference 25× larger With respect to transformer architectures, we view them in
models than with GPU-only solutions, while delivering a high two broad categories — dense or sparse Mixture-of-Experts
throughput of 84 TFLOPS (over 50% of A6000 peak). (MoE) transformer models. The optimal parallelism strategy
depends on the model architecture. For example, tensor
I. I NTRODUCTION and pipeline parallelism work only for dense transformers,
The past several years have witnessed the success of while expert parallelism only works for sparse transformers.
transformer-based models; their scale and application scenarios Moreover, transformer-based MoE models contain both dense
continue to grow aggressively. The current landscape of and sparse transformer components, requiring a combination
transformer models is increasingly diverse: the model size of different parallelism techniques to maximize the effective
varies drastically with the largest being over trillion parameters; memory bandwidth across devices. Finally, with respect to
the model characteristics differ due to the sparsity introduced hardware characteristics, modern clusters have heterogeneous
by the Mixture-of-Experts technique; the target application network topology (eg. intra-node NVLink/NVSwitch and inter-
scenarios can be latency-critical or throughput-oriented; the node InfiniBand) which requires further consideration when
deployment hardware could be single- or multi-GPU systems developing parallelism strategies.
with different types of memory and storage, etc. With such Throughput Challenges: In addition to meeting latency,
increasing diversity and the fast-evolving pace of transformer production workloads also have throughput targets to meet
models, designing a highly performant and efficient inference cost budget. At small batch sizes, where the workload is still
system is extremely challenging. memory bandwidth bound, the latency of the workload does not
Latency Challenges: Using a transformer based model increase as long as the computation is entirely overlapped with
for online scenarios in production requires meeting stringent model weight reads. Therefore, maximizing throughput while
meeting the latency SLA requires not only maximizing the a unique aspect of the latency challenge: batch size, scaling
memory bandwidth utilization, but also overlapping compute dense models, and scaling sparse models, but are compatible
with the model weight reads, and at the same time achieving and built on top of each other, we create a comprehensive
high compute efficiency at small batch sizes to maximize the system capable of achieving state-of-art latency and throughput
batch size whose compute can be overlapped with reading the at unprecedented scales for both dense and sparse transformer
model weights. Inference kernels must therefore achieve high models despite the heterogeneity in batch size, model scale
memory bandwidth utilization and high compute utilization and model characteristics.
at small batch sizes, whereas training kernels simply need to 2) ZeRO-Inference: ZeRO-Inference is a heterogeneous
achieve high compute utilization at much larger batch sizes. GPU+CPU+NVMe based solution to address the memory
This makes developing inference kernels quite challenging. challenge by enabling massive model inference with mini-
Moreover, even for throughput bound scenarios with large mal GPU resources. In contrast to DeepSpeed Transformer,
batch sizes, inference workloads can differ from training for applications that are less latency sensitive but resource
workloads in terms of data flow and computation dependencies, constrained, ZeRO-Inference allows inference of models with
requiring novel solutions to achieve high throughput. For hundreds of billions of parameters on a single or multiple
example, generative transformers have dependencies between GPUs as long as there is enough CPU or NVMe memory to
each generated token and the next token, which does not store the model parameters. In addition, even when the model
exist during training. As a result, it incurs higher memory does fit in aggregate GPU memory, ZeRO-Inference delivers
requirement during inference to keep track of previously better per GPU efficiency than DeepSpeed Transformer by
generated states. For large models that may require pipeline supporting much larger batch sizes.
parallelism to fit the model in memory, this dependency across The main contributions of the paper are as follows:
generated tokens also requires new pipeline schedules to keep • Single GPU transformer kernels for minimizing la-
all devices busy compared to training scenarios. tency and maximizing throughput via memory-bandwidth-
Feasibility Challenges under Limited Resources: A model centric fusion schedules and GeMM kernels (Sec. III).
with tens of billions of parameters is simply too large to fit • A many-GPU dense transformer inference system that
in the memory of a single GPU device, and at hundreds of combines tensor-parallelism to minimize latency with
billions of parameters, it is too large to even fit in the aggregate inference optimized pipeline parallelism schedules and
GPU memory of a single node. For example, inferencing MT- memory optimizations to maximize throughput (Sec. IV).
NLG 530B [1] requires about 1TB of GPU memory just to • A massive-GPU sparse model inference system that
fit the model for inference, requiring over three DGX-2 nodes combines: i) expert, data, and tensor parallelism, ii)
consisting over two dozen of NVIDIA A100 40GB GPUs. novel communication optimizations and iii) sparse kernel
Most data scientists simply do not have access to such GPU optimizations to scale sparse inference on trillions of
resources needed for inference of these massive models. parameters across hundreds of GPUs (Sec. V).
In this paper, we present DeepSpeed Inference, a compre- • ZeRO-Inference that leverages CPU, NVMe and GPU
hensive solution for transformer model inference designed to memory along with GPU compute to make massive model
address the above challenges. DeepSpeed Inference consists inference accessible with limited resources (Sec. VI).
of two components: • Extensive evaluation of DeepSpeed Inference on a wide
1) DeepSpeed Transformer: DeepSpeed Transformer, is range of transformer models covering four aspects: i)
a GPU only solution, designed to minimize latency while For latency sensitive scenarios, DeepSpeed Transformer
maximizing throughput for both dense and sparse transformer shows latency reduction over state-of-the-art of up to 1.9×
models. It achieves state-of-art latency and throughput for for dense models (up to 175B parameters) and 7.2× for
transformer models of all sizes and supports running on a sparse models (a 1T model under 25 ms), while scaling
single GPU or scaling to hundreds of GPUs to inference multi- to 256 GPUs at 33% peak memory bandwidth utilization,
trillion parameter models. an unprecedented scale for inference. ii) For throughput
The DeepSpeed Transformer solution is a three-layered sys- oriented scenarios, DeepSpeed Transformer demonstrates
tem architecture consisting of i) single GPU transformer kernels over 1.5× gain over state-of-the-art (Sec. VII-C). iii) Eval-
optimized for memory bandwidth utilization at low batch sizes uation of ZeRO-Inference on GPU resource constrained
and high throughput at large batch sizes, ii) many-GPU dense systems that shows ZeRO-Inference can support inference
transformer layer, for scaling dense transformer models across with 25× larger models than with GPU only solution
GPUs using tensor-slicing and inference-optimized pipeline while achieving over 50% of peak hardware performance.
parallelism, and iii) massive-GPU scale sparse transformer (Sec. VII-D). iv) Performance analysis and breakdown of
layer, designed to scale MoE transformer layers to hundreds the different optimizations discussed throughout the paper
of GPUs using a combination of parallelism techniques and (Sec. VII-E).
communication optimization strategies, while also minimizing Despite the diversity in transformer inference landscape,
single GPU sparse computation overhead using optimized DeepSpeed Inference offers a versatile solution capable of
sparse kernels. achieving state-of-art latency and throughput for all variations
By taking this layered approach, where each layer addresses of transformer model inference: dense or sparse, small or
large batches, billions to trillions of parameters, single GPU could pose functionality, performance and convergence related
or across hundreds of GPUs. Furthermore, it democratizes restrictions for pipeline parallelism.
access to large transformer inference by enabling them on ZeRO [20] takes a different approach and removes the
systems with limited GPU resources. DeepSpeed Inference memory redundancies in conventional data parallelism by
is available for everyone to leverage though our open-source partitioning model states across the data-parallel processes
repository: https://github.com/microsoft/DeepSpeed. instead of replicating them. 3D parallelism [21] combines
data, tensor, and pipeline parallelism efficiently to scale to
II. BACKGROUND AND R ELATED W ORK models of trillions of parameters.
a) Dense Transformer Models: The size of transformer- Expert parallelism [22] places different experts on different
based language models has been increasing by 10× each GPUs and executes them in parallel. Each expert only processes
year for the past few years, from models with a few hundred a subset of tokens on each expert based on a learned top-k
millions of parameters [2], [3], [4], [5], to models with dozens gating function. The classic all-to-all communication primitive
of billions parameters [6], [7]. Recently, GPT-3 175B [8], has been used to implement expert parallelism [23], [15], [22].
Gopher 280B [9], and MT-NLG 530B [1] further push this limit The above parallelism strategies are mainly designed for
to hundreds of billions of parameters. As larger models have maximizing training throughput and their effectiveness can
demonstrated outstanding accuracy performance on various be limited during inference because of insufficient parallelism
natural language understanding and generation tasks, this with small batch sizes in inference. Our work leverages these
exponential growth in model scale would continue as long techniques and applies innovative optimizations to make them
as the system and hardware technology could keep up with it. effective and performant in inference.
b) Sparse Transformer Models: The success of scaling dense d) Optimized Transformer Kernels: There is also a suite of
language models has motivated researchers and practitioners work focused on accelerating the performance of transformer
to further propose the Mixture-of-Experts (MoE) technique kernels [24], [25], [26]. A record training time for BERT
which introduces sparsity in transformer models [10]. Typical was accomplished with stochastic transformer kernels that
transformer model [11] architectures have transformer blocks fused operators and reduced activation memory to support
that consist of two consecutive sub-layers, a self-attention large batch sizes [24]. Ianov et al. [25] use transformer
sub-layer followed by a position-wise feed-forward (FF) block. dataflow graphs to fuse elementwise and reduction operators
MoE models add conditional computation by replacing the feed- and accelerate training. TurboTransformers [26] similarly fuses
forward blocks with a Position-wise MoE layer with a variable elementwise and reduction operators for transformer inference.
number of experts and a top-k gating function. Increasing the E.T. [27] combines fusion, custom GeMM, and pruning together
number of experts allows scaling the MoE model size with to accelerate inference speed of Transformers. The kernel
only sublinear increase in computation cost, greatly reducing optimizations presented in this work fuse a wider variety
the training cost of the model. However, MoE models can of operators, such as head-wise transformation that requires
be up to 8× larger than their quality-equivalent dense models additional data layout transformation and layers beyond the self-
[12], [13], [14], [15], requiring much higher aggregate memory attention sublayers, such as the intermediate layers and MoE
bandwidth to achieve comparable latency during inference. specific layers. In addition, the kernels presented in this work
c) System Technology for Memory and Performance Scaling: also support auto-regressive generative models that require KV-
The major challenge in scaling model sizes resides in the caching [8] to be performant during inference, where as the
memory bottleneck. To satisfy the memory requirement, prior above mentioned work do not consider support for KV-caching.
works have proposed various parallelism strategies to use the e) DNN Inference Optimizations: There has also been
aggregated GPU memory within and across nodes. extensive work on optimizing DNN inference through platforms,
Tensor parallelism [16] splits model layers horizontally libraries, compilation, and compression strategies. Several com-
across GPU devices. As the number of GPU increases for tensor- pilers and runtimes exist to facilitate the deployment of models,
slicing, two primary trade-offs show up: (i) lower compute such as TVM [28], ONNXRuntime [29] and TensorRT [30].
granularity due to the smaller local problem size, and (ii) all- These platforms have been mostly focused on optimizing DNN
reduce communications in each transformer layer to aggregate models that can fit in a single GPU, such as small transformers
the partial activations. When scaling across node boundaries, with a few hundreds millions of parameters. In contrast, our
the inter-node bandwidth is limited comparing to the fast work targets billion-scale or even trillion-scale transformers
intra-node connections, thus tensor parallelism can cause a that do not easily fit on a single GPU device. The most related
significant latency degradation. In practice, tensor parallelism is work to ours is FastTransformer [31], which supports multi-
often restricted to groups of GPUs sharing the high-bandwidth GPU inference for transformer models, which we will provide
interconnect within a node (e.g., NVIDIA NVLink). a more detailed comparison in Section VII. Finally, there
Pipeline parallelism [17], [18], [19] splits a model vertically has been numerous works that improves the deployment of
into pipeline stages and use micro-batching to hide pipeline DNN models through model compression techniques, such
bubbles. It only requires communication for data aggregation as distillation, quantization, and sparsification, which could
between adjacent pipeline stages, thus more efficient to scale reduce the computation time and memory consumption with a
across nodes. However, model splitting and micro-batching small accuracy trade-off. Our work is complimentary to these
model compression techniques and can be combined together tile granularity, Deep-Fusion can fuse not only element-wise
to boost performance further. operations but also reductions, data transpositions, and GeMMs
as long as there are no cross-tile dependencies. For example,
III. I NFERENCE -O PTIMIZED T RANSFORMER K ERNELS all micro-operations in a layer-norm [33] can be tiled along the
In this part, we discuss the challenges, design, and opti- token dimension, while the reduction dimensions are processed
mizations for transformer kernels capable of achieving high- within a tile. This allows all the micro-operations inside a
performance inference for both small and large batch sizes. layernorm to be fused into a single kernel despite consisting of
multiple reduction operations. Furthermore, the data produced
A. Inference Challenges on Different Batch Sizes by each tile is either kept in registers or in shared memory
As discussed in Sec.I, small batch performance is limited by when possible to allow for data-reuse across operators without
the memory bandwidth utilization in reading model weights. incurring global memory data-transfer overheads.
There are three main challenges to optimizing for memory C. SBI-GeMM: Custom GeMM for Small Batch Size
bandwidth at small-batch inference. First, due to limited work
at different kernels performing the operations of a transformer Our custom GeMM implementation is designed to be
layer using small batch, inference performance suffers from fusable with Deep-Fusion while achieving maximum memory
the kernel-invocation overhead. Second, each kernel-invocation bandwidth utilization. Its design can be viewed in three parts:
writes data to global memory which is read by GPU cores tiling strategies, cooperative-group reduction, and data-layout
during the next kernel invocation, and this data-transfer between transformation for better memory bandwidth utilization.
1) Tiling Strategies: Fig. 1(a) depicts our GeMM scheduling
GPU cores and global memory adds an additional overhead.
for a skinny matrix multiplication. We first tile the computation
Finally, neither cuBLAS nor CUTLASS GeMM libraries are
along the output dimension. That allows us to implement
well tuned for extremely small batch sizes, and cannot achieve
GeMM using a single kernel by keeping the reduction within
good memory-bandwidth utilization.
a tile. For small models, where the output dimension is too
Large-batch inference performance on the other-hand is
small to create enough parallel tiles to achieve good memory
limited by compute utilization, and while compute heavy
bandwidth, we tile the input dimension as well and implement
operations like GeMM inside a transformer layer can achieve
GeMM as two kernels to allow for reduction across tiles.
very good compute utilization using CUBLAS and CUTLASS
2) Cooperative-Group Reduction: With the aforementioned
libraries, the overall utilization can still be limited by the kernel
tiling strategy, each warp in a thread block is responsible for
launch overheads and data-transfers between GPU cores and
producing a partially reduced result for a tile of outputs and a
global memory across different kernels other than GeMMs.
final reduction is needed across all the warps within the thread
To address these challenges, we introduce two techniques: i)
block. Usually this is implemented as a binary tree based
Deep-Fusion to reduce kernel-invocation and data-movement
reduction in shared memory which requires multiple warp-
overheads by fusing multiple kernels beyond element-wise
level synchronizations, thus creating a performance bottleneck.
operations, and ii) A custom GeMM kernel designed for
To avoid this, we perform a single data-layout transpose in
improving the memory bandwidth utilization when the batch
shared memory such that partial results of the same output
size is relatively small while also allowing it to be fused using
element are contiguous in memory, and can be reduced by
Deep-Fusion. We discuss these techniques in detail next.
a single warp using cooperative-group collectives directly in
B. Deep-Fusion registers (See Fig. 1(a)). At the end, the first thread of each
warp holds the final result and writes it to shared memory.
While operator fusion is a common technique used in deep
The results in shared memory are contiguous, allowing for a
learning to reduce kernel launch and data-movement overhead,
coalesced write to global memory.
it is limited primarily to element-wise operators [32], [28],
3) Leveraging Full Cache-line: In GPU architecture, each
[29]. In contrast, transformer consists of operators like data
L1 cache-line is 128 bytes, however a coalesced memory access
layout transformations, reductions, and GeMMs which create
with a single FP16 or INT8 element per thread in the warp
data dependencies across thread blocks, making them difficult
cannot fully consume the full cache-line. Reading multiple
to fuse. This is because on GPU, if a data produced by a
elements per thread along the output dimension to address this
thread-block is consumed by a different one, a global memory
issue reduces the number of parallel tiles which also hurts
synchronization is needed which invokes a new kernel.
memory bandwidth. Therefore, our solution is to transpose the
To avoid the need for a global synchronization, Deep-Fusion weight matrix during initialization such that M rows for each
tiles the computation-space along dimensions of the iteration column are contiguous in memory, allowing each thread to
space which incur no cross-tile data-dependencies and executes read M elements along the input dimension (See Fig. 1(b)).
them in parallel across different thread-blocks. The dimensions We set M as 2 for half precision and 4 for the INT8 data types
of the computation-space which does contain data dependencies considering a 128-byte cache line.
are not tiled, and instead processed by the same thread-block.
After this tiling, two operators can be fused using Deep- D. Putting It Together
Fusion if each tile of the second operator depends on exactly Small-batch Transformer Kernel: Fig. 1.c shows the different
one output tile of the first operator. By performing fusion at components of a transformer layer, and the operations which
Output GeMM Kernel
Output
Input …. Output

...

...
Input Section 1

...

...
Batch …. Batch QKV Query + Q-trans Attn Soft Attention
Input Block 1 Block K GeMM

...
…. Score Max

...
...
Q_bias
Input Block N-K+1 Block N Reduce X
Local Section N/K Norm Key + K-trans
Broadcast N/K sections
W0 W1
weight W0 W1 K_bias
to all Read Attn
blocks Value + V-trans
W2 W3 W2 W3 Context
Lanes …. V_bias
Block 1 Block n
Local GeMM Output Data-layout Transformation Attn_bias All Attn
× Reduce Transform
+ Output
M
Local GeMM M M × Output + +
× Shared Shared
+
Memory Memory
Warps Input ... Norm

...
Local GeMM Feed-Forward
× Input/M Bias-add
+
Transpose Transpose Intermediate Output All Y
Local GeMM GELU + Reduce + +
×
FF FF
+
Reduce I_bias O_bias

(a) (b) (c)


Fig. 1. (a) GeMM scheduling at different GPU architectural levels: threads-blocks, Warps and CUDA threads. Warps show different cooperative threads (32
threads), Lanes show the thread index at each Warp. (b) GeMM modification to support 2-dimensional partitioning of the weight matrix and new data-layout.
(c) Deep-Fusion strategy for the small-batch inference.

are considered for Deep-Fusion in the small-batch inference


case. As the figure shows, we fuse the operations inside a
transformer layer at four main regions: 1) the QKV GeMM
and input layer-norm, 2) transposition plus attention, 3) post- (a) Baseline schedule with observed data dependencies.
attention layer-norm and intermediate GeMM, and 4) bias and
residual addition. To support the fusion of GeMM with the rest
of the operations in a single kernel for 3), we broadcast the input
batch across the SMs and perform the same operations that
come before GeMM, so that there is no need of communicating
data between SMs for adding the GeMM schedule. We observe
that in spite of replicating the work across SMs, we still gain (b) Pipeline schedule to hide data dependencies.
performance benefit compared to the non-replicated, non-fused Fig. 2. A pipeline-parallel schedule for generating the first three tokens
of four sequences S0, . . . , S3 using four pipeline stages. Sequence colors
kernel implementation for the very small batch sizes. indicate the token being generated. Data dependencies exist between the first
Large-batch Transfomer Kernel: We follow the same fusion and last pipeline stages: we illustrate the dependencies for only S0. Gray
strategy as discussed above, with the difference that we use blocks denote pipeline bubbles.
CUBLAS for GeMM operations, and keep them unfused.
Support for Different Data Types: Our kernels support
FP32, FP16 and INT8 data types for the GeMM operations.
To support INT-8, we use CUTLASS [34] INT8 GeMM
implementation tuned for different batch sizes. We also add
quantize operation before GeMM that we fuse using Deep-
Fusion and de-quantization after GeMM that we fuse using
CUTLASS’s epilogue functionality.
Eliminating Kernel Invocation Overhead via Cuda-Graph:
For small to moderate sized models with small batch sizes, as
we reduce the actual execution time of the kernels, the main
Fig. 3. Illustration of three different batch size combinations for prompt
latency bottleneck shifts from kernel execution to the kernel processing and token generation: A small micro-batch count reduces the
launch overhead on the CPU side. To address this issue, we latency of token-generation but prolongs the latency of prompt processing, and
add the CUDA-Graph [35] support in our inference pipeline. vice versa. By using a hybrid scheduling where different micro-batch counts
are induced to different stages, the latency of both prompt processing and
More specifically, we store the trace of the kernels the first token-generation is reduced.
time they are launched during the forward computation at
inferencing and create the computation-graph that can be reused multiple nodes to fit massive models. While model parallelism
for the following requests, which largely eliminates the kernel is extensively studied in the context of training, there are unique
launching overhead and substantially improves the performance. challenges in inference, requiring new solutions.
IV. I NFERENCE -A DAPTED D ENSE T RANSFORMER M ODELS A. Aggregate Memory Bandwidth via Tensor Parallelism
ON M ANY-GPU S YSTEMS
We leverage the aggregate memory bandwidth across mul-
This section presents the model parallelism techniques that tiple GPU devices via tensor-slicing parallelism (TP) from
we use on top of the single transformer kernels discussed in Megatron-LM [16]. DeepSpeed Inference can automatically
Sec. I with two goals: i) reducing latency further by leveraging scale a dense transformer model to multiple devices by
aggregate memory bandwidth across GPUs and ii) increasing partitioning transformer operators across multiple devices while
memory capacity by leveraging aggregate GPU memory across also adding appropriate communication operations needed
across GPUs. Under the hood, it leverages the single GPU for generating the next token. Fig. 2 illustrates our pipeline-
kernels to maximize per GPU memory bandwidth utilization, parallel sequence generation schedule. We set the number of
while using NCCL all-reduce collectives to perform the micro-batches to the pipeline depth, P . Having at least P micro-
necessary across GPU communication as described in [19]. batches is critical to utilize all of the pipeline stages, but avoid
This allows DeepSpeed Inference to achieve excellent aggregate additional micro-batches due to latency and memory costs of
memory bandwidth utilization across several GPUs with a node. the larger batch size. However, we cannot repeatedly inference
However, as discussed in Section II, tensor slicing can not a batch of P micro-batches without significant pipeline bubble
be scaled efficiently beyond a single node due to significant overheads ruining efficiency. We avoid intermediate pipeline
communication overhead. Thus to further scale to multi-node bubbles by dynamically queuing micro-batches of generated
systems, DeepSpeed Inference uses pipeline parallelism. tokens until the sequences terminate. The resulting schedule
amortizes the pipeline bubble over all generated tokens without
B. Aggregate Memory via Pipeline Parallelism — Challenges allocating extra activations from a larger batch size.
As models exceed the memory capacity of a single node, we However, this is not sufficient to get the best performance.
use pipeline parallelism (PP) [17], [18]. Although PP does not As mentioned in in Section IV-B, autoregressive models, such
help with the aggregate memory bandwidth since each micro- as GPT-3, often consist of two stages that have different
batch traverses the full depth of the model in sequence across performance characteristics, and using the same micro-batch
the pipeline stages, it has smaller communication overhead size for both stages is sub-optimal.
(as discussed in Sec. II) compared to TP, thus more efficient The prompt processing component of inference has a large
to scale across nodes. However, applying PP in inference is number of tokens per sample that can saturate the GPU compute
non-trivial and requires different considerations from training: and the choice of micro-batches only affects the pipeline bubble
First, transformer decoders are autoregressive, i.e., the but not the GPU execution time. However, for token generation,
inputs to the model inference are previously-generated outputs. each sample has a single token, the total number of tokens
Therefore, when generating a sequence, the next token in across the entire micro-batch is small, and the kernel execution
the sequence is a function of the previous tokens. Existing time is entirely memory bandwidth bound. That means, the
training pipelines inference at the granularity of batches, and execution time for a micro-batch does not change much with
so batch boundaries are defined by the data dependencies of change in the size of micro-batch as most of the time is spent
sequence generation. These data dependencies induce frequent in fetching model parameters. However, the overall execution
pipeline bubbles that degrade inference performance (see Fig. 2). time is proportional to the number of micro-batches, as the
Second, autoregressive generation models have two distinct forward pass on each micro-batch requires fetching the weights
phases: i) prompt processing phase where the entire input all over again. Therefore, efficient token generation requires
prompt is processed to generate the first token and ii) token minimizing the number of micro-batches while keeping it large
generation phase, where the results of the prompt processing is enough to hide the pipeline bubble.
reused via KV-caching, and the new computation only depends To address the varying requirements for prompt processing
on a single previously generated token. As the number of and token generation, we adopt a hybrid scheduling strategy,
tokens processed in each phase is drastically different, they where we use different number of micro-batches for the prompt
have different performance characteristics requiring different processing and token-generation. Figure3 illustrates how the
considerations. hybrid scheduling works. We use larger number of micro-
Third, autoregressive inferencing caches the key and value batches during the prompt processing stage to minimize the
activations of each transformer layer in order to avoid recom- pipeline bubble, while during the token generation phase, we
putation for each token. This activation memory scales with reduce the number of micro-batches to reduces the overall
the number of sequences that are concurrently generated. In execution time.
effect, inference performance for large transformer models can 2) Offloading Activations to CPU Memory: The cached key
be limited by memory capacity. and value activation tensors have a predictable reuse pattern.
The activations of sequence si will not be used again until
C. Inference Optimized Pipeline Parallelism generating the next token of si . When the allocated activation
In order to overcome the inference-specific challenges, our memory exceeds a threshold, we offload some activations from
approach includes three important aspects: scheduling, memory GPU to CPU memory while not in use. The saved GPU memory
footprint reduction, and communication optimization. allows for larger batch sizes and enable better system utilization.
1) Hiding data dependencies and hybrid scheduling: 3) Communication Optimization: Inference performance
Suppose our goal is to inference a batch of B sequences, will ultimately degrade if the transformer kernels are stalled
s1 , s2 , . . . , sB . We divide the sequence into groups of micro- on communications for CPU memory offloading over the low-
batches, where one micro-batch is the unit of computation bandwidth PCIe. To avoid the stall we overlap the communi-
provided to each kernel invocation. A micro-batch progresses cation with computation, and more importantly we employ
through the stages of the model pipeline until the next tokens an architecture-aware communication optimized offloading
are produced by the last stage of the model. If sequence si does strategy. Most system architectures do not have a unique PCIe
not terminate, the generated token will be used as the input bus for each GPU and share a single link across two GPUs. To
Parallelism-coordinated AlltoAll optimization

Total GPUs = 16, Total Experts = 8 Expert-parallelism MP 0 GPU 0 A B C D


Local MP 0 GPU 0 A C B D MP 0 GPU 0 A E
Expert-slicing Expert-slicing MP 1 GPU 1 A B C D Inter-MP MP 1 GPU 1
expert-slicing degree = 2, …….. Transform + MP 1 GPU 1 A C B D AlltoAll
B F
Expert-Slice 0 Splitting
expert-parallel degree = 8 Alltoall Expert-Slice 0 MP 0 GPU 2 E F G H
→ MP 0 GPU 2 E G F H → MP 0 GPU 2 C G
(GPU 0) (GPU 7) MP 1 GPU 3 E MP 1 GPU 3 D H
tensor-slicing degree = 4, F G H MP 1 GPU 3 E G F H
Expert-Slice 1 Expert-Slice 1
data-parallel degree = 4 All-reduce (GPU 8) All-reduce Inter-MP AlltoAll
(GPU 15)


Output Allgather Allgather MP 0 GPU 0 A C B D MP 0 GPU 0 A C
(GPU0 <-> GPU8) (GPU7 <-> GPU15) Intra-MP
Local MP 1 GPU 1
MP 1 GPU 1 A C B D AllGather B D
Transform →
MoE Transformer Layer Tensor-slicing Tensor-slicing MP 0 GPU 2 E G F H MP 0 GPU 2 E G
Expert Parameters Slice 0 Slice 0 MP 1 GPU 3 E G F H MP 1 GPU 3 F H
(e.g., MLP) (GPU 0) (GPU 12)
Slice 1 Slice 1 Baseline AlltoAll:
(GPU 1) ..………..
Alltoall (GPU 13)
Non-expert Parameters MP 0 GPU 0 A B C D MP 0 GPU 0 A A E E MP 0 GPU 0 A B C D
Slice 2 (Expert 0..7) Slice 2 Global Global
(e.g., Attention) MP 1 GPU 1 A B C D
All-reduce (GPU 2) All-reduce (GPU 14) MP 1 GPU 1 A B C D AlltoAll MP 1 GPU 1 B B F F AlltoAll
Slice 3 Slice 3 MP 0 GPU 2 E F G H → MP 0 GPU 2 C C G G → MP 0 GPU 2 E F G H
(GPU 3) (GPU 15) MP 1 GPU 3 E F G H
MP 1 GPU 3 E F G H MP 1 GPU 3 D D H H
Input Data-parallelism (no communication)

Fig. 4. Expert, data and tensor parallelism in DeepSpeed-MoE. Fig. 5. The parallelism coordinated communication (PCC) optimization follows
four steps: 1) local transformation and splitting of the original data, 2) Intra-
tensor-model-parallel (MP) and inter-MP alltoall, followed by 3) intra-MP
avoid contention between GPUs, odd-numbered GPUs offload allgather, and 4) finally a local transform operation. Despite four steps, it is
activations for odd-numbered layers, while even-numbered faster than the baseline alltoall shown in the bottom half of this illustration.
GPUs offload activation for even-numbered layers. This is
B. PCC: Parallelism Coordinated Communication for MoE
crucial to fully leverage the PCIe bandwidth. Scheduling odd
and even layer offloading across GPUs prevents contention on Expert parallelism places expert operators across GPUs and
the PCIe link, allowing each GPU to fully leverage the PCIe requires all-to-all communication between all expert-parallel
bandwidth when it needs to offload. GPUs. However, it is not efficient to scale expert parallelism
to hundreds of devices needed for sparse model inference
as the latency increases linearly with the increase in devices.
V. M ASSIVE S CALE S PARSE M ODEL I NFERENCE
Fortunately, when combining expert parallelism and tensor-
While the techniques developed so far enables DeepSpeed slicing within a single model, there are opportunities for
Inference to achieve state-of-art latency and throughput for communication optimization that can reduce the communication
dense transformer models, new considerations are necessary for latency. Note that tensor-slicing splits individual operators
sparse transformer models that consist of both sparse and dense across GPUs and requires all-reduce between them. The all-
components. The key challenge is that on one hand, sparse reduce operation in tensor-slicing replicates data among the
models are much larger than quality equivalent dense models involved devices. When executing tensor-parallel operators
(Sec. II), requiring much higher aggregate memory bandwidth followed by expert-parallel operators, this replication allows
to achieve latency comparable to quality equivalent dense creating an optimized communication schedule for the all-to-all
models, and on the other hand it has a different computational operator that does not require communicating between all the
structure than dense models, requiring different parallelism expert parallel processes: the all-to-all can happen within just
approaches compared to dense transformers [23]. the subset of devices that share the same tensor-slicing rank,
In this section, we introduce a massive scale MoE-based since the data across tensor-parallel ranks are replicated (Fig. 5).
transformer model inference system capable of addressing the As a result, the latency of all-to-all is bounded by O(p/L)
above challenges. It is built on top of the dense components instead of O(p) where L is the tensor-slicing parallelism degree
discussed before and consists of three main components: and p is the total number of GPU devices.
Similarly, when executing expert-parallel operators followed
by tensor-slicing operators, the final all-to-all can be done in
A. Orchestration of Tensor, Data, &Expert Parallelism for MoE the same way, but this time followed by an allgather operator
We use tensor parallelism, referred in Fig. 4 as tensor-slicing between tensor-parallel ranks to replicate the data needed by
(for non-expert parameters) and expert-slicing (for expert tensor-slicing (Fig. 5). This reduces the latency overhead from
parameters), to split individual parameters across multiple O(p) to O(p/L) + O(L).
GPUs to leverage the aggregate memory bandwidth across This reduced latency overhead allows better scaling to a large
GPUs. However, tensor parallelism can only scale efficiently to number of devices. For example, when scaling to 128 GPUs
a few GPUs due to communication overhead and fine-grained with 8-way tensor-slicing and 128-way expert parallelism, this
parallelism. To address this, we use expert parallelism in approach reduces the latency overhead of the all-to-all from
conjunction with tensor parallelism to scale experts parameters (128C1 + C2 ) to (16C1 + C2 ) due to 8-way tensor-slicing,
to hundreds of GPUs. Expert parallelism does not reduce where C1 and C2 are some constants determined by point-to-
computation granularity of individual operators, therefore point latency, message size, and bandwidth.
allowing our system to leverage aggregate memory bandwidth
across hundreds of GPUs. To scale the non-expert computation C. Highly Optimized Computation Kernels for MoE
to the same number of GPUs, we use data parallelism at no MoE-related computation consists of four major components:
communication overhead. (1) a gating function that determines the assignment of tokens
to experts, where the result is represented as a sparse tensor (a VI. D EMOCRATIZATION OF LARGE MODEL INFERENCE .
one-hot vector representing the assigned expert for each token
DeepSpeed Transformer needs the model to fit in aggregate
in the sequence); (2) a sequence of sparse operators including
GPU memory, requiring a large number of GPUs for large
a cumsum operator to compute an inverse mapping from
models. This is a barrier for many data scientists who lack
experts to token IDs (experts-to-token) using the previously
access to large number of GPUs, e.g., dozens of GPUs are
mentioned token-to-expert one-hot vector; (3) a scatter operator
required to inference models like MT-NLG-530B. To broaden
to distribute tokens to its corresponding experts. This is
access to large models, we propose ZeRO-Inference which
implemented as a sparse einsum operator between the expert-to-
enables large model inference using as few as a single GPU.
token computed in the previous step and input tokens; and (4)
For non-latency sensitive applications, ZeRO-Inference achieves
a final sparse einsum based gather operation that re-distributes
high performance by leveraging DRAM and NVMe memories
tokens processed at each expert back to their original ordering.
in addition to GPU memory and compute. Compared to a
CPU only based solution, ZeRO-Inference can achieve orders
The sparse tensor representation in the gating function of magnitude higher throughput by efficiently exploiting the
and sparse einsum operators introduce a significant latency available GPU hardware. Moreover, it offers similar or even
overhead. The gating function includes numerous operations better throughput than DeepSpeed Transformer by supporting
to create token-masks, select top-k experts, and perform larger batch sizes. We now discuss the design of ZeRO-
cumulative-sum (cumsum) to find the token-id going to each Inference and the performance optimizations that make it very
expert and sparse matrix-multiply, all of which are not only efficient for throughput oriented inference.
wasteful due to the sparse tenor representation, but also
extremely slow due to many kernel call invocations. Moreover, A. ZeRO-Inference Design
the sparse einsums have a complexity of S × E × M × ce ,
ZeRO-Inference utilizes available heterogeneous memory
where S represents the total number of tokens, E represents
(i.e., GPU memory, DRAM, and NVMe) to satisfy the memory
the number of experts, M represents model hidden dimension,
requirement of fitting massive models. This is motivated by
and ce represents expert capacity (S, E, and M are the main
the observation that environments with limited GPU resources
complexity factors, while ce is normally very small). In this
are often equipped with terabytes of aggregate heterogeneous
equation, (E − 1) out of E operators for each token are
memory, which is sufficient to fit hundreds of billion-parameter
multiplications and additions with zeros, since only one expert
models. ZeRO-Inference builds on the offloading techniques
is typically selected to process ce tokens. This comes from
of ZeRO-Infinity [36], and adapts them to inference.
the fact that generalizing the gating operations results in the
einsums over several masking matrices or one-hot vectors that An important design decision is how to apportion GPU
produce a lot of non-necessary computation with zeros to select memory among model weights, inference inputs, and interme-
the correct token for each expert. We optimize these operators diate results. One approach is to pin as much of the model
using dense representation and kernel-fusion. weights as possible into GPU memory, and fetch the remainder
(from DRAM or NVMe) when needed for computation. A
We optimize each of the four steps in the gating function in benefit of this approach is avoidance of the latency of fetching
the following way: 1) we replace the one-hot representation of weights that are already pinned in GPU memory. However,
the token to expert mapping using a table data-structure, greatly this approach has two downsides: (i) it allows only small batch
reducing the memory overhead from eliminating all the zeros in sizes which hurts efficiency, and (ii) the latency savings for
the one-hot vectors; 2) we create the inverse mapping (expert-to- hundred-billion parameter models are negligible since only a
tokens mapping table) from the tokens-to-expert mapping table small fraction of the weights can fit in GPU memory anyway.
by simply scanning though the token-to-expert table in parallel. ZeRO-Inference adopts a different approach that pins the
3) we replace the sparse einsum based scatter operation using model weights either in DRAM (if large enough) or NVMe,
a data-layout transformation that achieves the same result by and streams each layer into GPU memory for computation
first identifying the token IDs assigned to an expert using the when needed. Despite the latency of fetching model weights
expert-to-token mapping table created in the previous step, and over PCIe, ZeRO-Inference is able to achieve high efficiency
then copying these tokens to the appropriate expert location; 4) for two reasons. First, by limiting GPU memory usage of the
after the tokens are processed by their corresponding experts, model to one or a few layers of weights, ZeRO-Inference is able
we use a similar data-layout transformation to replace the to use large batch sizes for inference. Second, a large model
sparse einsum based gather operation. layer requires significant amount of compute, especially given
Using the data-layout transformation instead of sparse their long input sequence length (e.g., 2048). For example, one
einsums reduces the complexity of these operations from GPT3-175B layer requires about 7 TFlops to process an input
S × E × M × ce to S × M × ce . We use shared memory of batch size 1. Therefore, large batch sizes cause compute
for data-layout transformations and fuse all but the final data- time to dominate the latency of fetching model weights, which
layout transformation together into a single kernel using basic ultimately improves efficiency. In summary, ZeRO-Inference’s
fusion principles. Combined, these optimizations result in over strategy to utilize GPU memory to support large batch sizes
6× reduction in MoE kernel-related latency. results in high performance inference for large models.
Name # params(B) hidden dim (K) # layers # attention heads Fig 6 Fig 6 Fig 8 Fig 9
GPT-[2, Neo, J, 13B] 1.5, 2.7, 6, 13 1.6, 2.5, 4, 5 48, 32, 28, 40 25, 20, 32, 40 TP=1 N/A N/A N/A
GPT-[NeoX, 50B, 87B] 20, 50, 87 6, 8, 12, 12 44, 62, 48 64, 64, 96 N/A TP=2,4,8 N/A TP=1
LM-175B 175 12 96 96 N/A TP=16 TP=8, PP=2 TP=1
LM-530B 530 20 105 128 N/A N/A TP=8,PP=5 TP=1
TABLE I
M ODEL CONFIGURATIONS USED FOR THE DENSE MODEL INFERENCE PERFORMANCE EVALUATION .

Model Size (billions) #Layers Hidden size MP degree EP degree Expert-slicing #GPUs
1.3B+MoE-128 52 24 2048 1 128 1 128
2.4B+MoE-128 107.7 16 3584 1 128 1 128
8B+MoE-128 349.0 30 4096 4 128 1 128
24B+MoE-128 1064.9 40 8192 8 128 2 256
47B+MoE-128 2024.0 58 8192 8 128 2 256
TABLE II
M ODEL CONFIGURATIONS USED FOR THE SPARSE MODEL INFERENCE PERFORMANCE EVALUATION . MP STANDS FOR MODEL - PARALLELISM . EP REFERS TO
EXPERT- PARALLELISM .

B. Performance Optimizations use a full-featured distributed PyTorch implementation that


ZeRO-Inference implements two optimizations to further supports both tensor and expert parallelism [37].
mitigate the impact of fetching model weights from DRAM 2) Metrics: We use three performance metrics: (i) latency,
or NVMe for inference computations. i.e., end-to-end output generation time for a batch of input
Prefetching: ZeRO-Inference prefetches a configurable num- prompts, (ii) token throughput, i.e., tokens-per-second pro-
ber of layers ahead of use, overlapping with computation of cessed, and (iii) compute throughput, i.e., TFLOPS per GPU.
the current layer. Prefetching gives the flexibility to improve 3) Workloads: For the performance evaluation, we focus
throughput at the cost of a configurable increase in GPU on evaluating GPT-style transformer-based decoder models [8],
memory consumption. where we vary the hidden dimension, the number of transformer
Multi-GPU PCI-e bandwith utilization: In multi-GPU sce- layers, and attention heads based on the GPT-3 paper as well as
narios, the aggregate PCI-e bandwidth is used to reduce the its publicly available variants to cover a wide range of model
layer transfer time by having each GPU only fetch a partition configurations and different number of parameters. Table I
of the layer and then aggregating partitions over the much elaborates the model architectures. For sparse MoE models, we
faster GPU-GPU interconnect. further vary the expert degree to cover models ranging from 52B
Beyond the above optimizations, ZeRO-Inference also per- parameters to 2 trillion parameters. Sparse model configurations
forms several other efficiency optimizations to achieve close are shown in Table II. Since generative text language models
to peak NVMe IO bandwidth, such as bulk read/write requests like GPT-3 produces tokens based on a prompt, which is the
for asynchronous completion, aggressive parallelization of I/O text given to the model to be completed, we measure the latency
requests, work scheduling, memory pinning, and avoiding data of generating 8 tokens with an input prompt of 128 tokens
copy. However, we do not claim novelty on those optimizations for dense models varying batch sizes, which reflects scenarios
as they were introduced by prior work [36]. that correspond to more latency-sensitive applications. For
the sparse MoE model, we measure the per-token latency by
VII. P ERFORMANCE E VALUATION generating 100 tokens at a time with a prompt of 128 tokens
We present an extensive evaluation of DeepSpeed Inference and batch size 8. For throughput oriented applications, we
covering four aspects. i) For latency sensitive applications, measure the performance with an input prompt of 512 tokens
DeepSpeed Inference achieves up to 1.9× and 7.3× lower while generating 50 tokens at a time. For resource constrained
latency than state-of-art for a wide range of dense models with systems, we measure the compute throughput using maximum
hundreds of billions of parameters, and sparse models with batch size possible for generating a single token.
trillions of parameters scaling to hundreds of GPUs. ii) For 4) Testbeds: We conduct our experiments on: a cluster
throughput-oriented inference of massive models, DeepSpeed of up to 256 NVIDIA Ampere A100 40GB GPUs (32
Inference achieves up to 1.5× higher throughput. iii) On 8×A100 DGX boxes [38]), a lambda A6000 workstation [39]
resource constrained systems DeepSpeed Inference enables (2×A6000-48GB-GPU, 256GB DRAM, and 2TB NVME)
inference of 25× larger models than GPU-only solution and a DGX2 V100 server [40] (16×V100-32GB-SXM-GPU,
(530B vs 20B) while achieving over 50% of peak hardware 1500GB DRAM, and 30TB NVME).
performance, democratizing large-model inference with limited
GPU resources. iv) We present a performance breakdown to B. Evaluation of DeepSpeed Inference for Latency Sensitive
zoom into the contributions of individual optimizations. Workloads
A. Evaluation Methodology DeepSpeed Inference provides a comprehensive system
1) Baseline: For dense models, we use FasterTransformer solution to support fast inference of dense models over 530B
(FT) [31], an efficient implementation of transformer models parameters as well sparse models that have more than 2 trillion
provided by NVIDIA. For experiments of sparse models, we parameters at unprecedented scale.
GPT2 GPT-Neo-2.7B GPTJ-6B 600
GPT-13B 80
300 120

Tput (#tokens-per-sec)

Tput (#tokens-per-sec)
140 200

Tput (#tokens-per-sec)
160 250

Tput (#tokens-per-sec)
120 140 250 100

Latency (ms)
160

Latency (ms)
200 60

Latency (ms)
120

Latency (ms)
100 200 80 400
80 120 100 150
80 150 60 40
60 80 60 100 100 40 200
40 40 20
40 50 50 20
20 20
0 0 0 0 0 0 0 0
1-batch 8-batch 16-batch 1-batch 8-batch 16-batch 1-batch 8-batch 16-batch 1-batch 8-batch 16-batch
GPT-Neox-20B GPT-50B GPT-87B LM-175B
450 40 600 30 600 30 900 16

Tput (#tokens-per-sec)
Tput (#tokens-per-sec)

Tput (#tokens-per-sec)
Tput (#tokens-per-sec)
400 800 14
500 25 500
350 30 700 12

Latency (ms)
Latency (ms)

Latency (ms)
Latency (ms)
300 400 20 400 20 600 10
250 500
20 300 15 300 8
200 400
150 300 6
10 200 10 200 10
100 200 4
50 100 5 100 2
100
0 0
0 0 0 0 0 0
1-batch 8-batch 16-batch
1-batch 8-batch 16-batch 1-batch 8-batch 16-batch 1-batch 8-batch 16-batch
FT (Fp16) latency DS-Inference (Fp16) latency DS-Inference (INT8) latency FT (Fp16) tput DS-Inference (Fp16) tput DS-Inference (INT8) tput

Fig. 6. Latency and throughput comparison of DeepSpeed Transformer with FasterTransformer [31] for different models and batch sizes.

Fig. 7. Latency and throughput improvement offered by DeepSpeed-MoE over baseline on 256 GPUs. Throughput shown is per GPU and the speedup values
along the arrows refer to improvement in latency.

1) Dense Model Evaluation: Fig. 6 shows the latency and the latency of the GeMM operators only increases sub-
and throughput improvements of DeepSpeed Inference on linearly with the batch size in this modest batch size regime.
up to 175B parameter models running with up to 16-way However,the latency of the non-GeMM operations increase
tensor parallelism (see Tab. I). In particular, we compare linearly due to proportional increase in data movement from
both FP16 (DeepSpeed-FP16) and INT8 (DeepSpeed-INT8) GPU memory, making it a bigger fraction of the overall
implementations of DeepSpeed Inference with the FasterTrans- latency. Deep-fusion reduces this data movement by keeping
former FP16 baseline (FT-FP16) 1 Both the baseline and intermediate data for fused operators in shared memory or
DeepSpeed Inference uses identical TP strategy so all the registers to achieve higher performance. The DeepSpeed-INT8
latency differences in these results come from the differences further improves upon the DeepSpeed-FP16 performance by
in kernel implementations described below. utilizing the higher peak of the INT8 tensor-cores compared
Small Batch Sizes For small batch size, DeepSpeed-FP16 to FP16.
achieves a speedup of up to 1.55× over the baseline. The 2) Sparse Model Evaluation: Fig. 7 shows the single output
performance improvements for both single GPU and multi- token generation latency and throughput of serving 100B to 2T
GPU configs are primarily due to deep-fusion and custom MoE models with up to 256 GPUs with and without DeepSpeed-
GeMMs. The latency reduction is the largest for the smallest MoE. Compared to baseline, DeepSpeed-MoE achieves better
model sizes, as they have the largest kernel-launch overhead performance than the state-of-the-art, with up to 7.3× reduction
due to limited work per kernel, and worst GeMM memory in latency. To have a fair comparison, the configuration for
bandwidth utilization from CUBLAS as they are not optimized data/tensor/expert parallelism is the same for both the baseline
for small and skinny GeMMs. DeepSpeed-INT8 enables a and DeepSpeed Inference-MoE. The main differences are
further performance boost of up to 1.95× over the FP16 optimizations that DeepSpeed Inference has, such as expert-
baseline by reducing the overall size of the parameters in slicing, parallelism coordinated all-to-all and MoE-specific
half compared to FP16. kernels, but the PyTorch-MoE baseline does not. By effectively
Larger Batch Sizes For larger batch sizes, DeepSpeed-FP16 exploiting hundreds of GPUs in parallel, DeepSpeed-MoE
reduces the latency by up to 1.57× over the baseline, and achieves an unprecedented scale for inference at incredibly low
up to 1.93× using DeepSpeed-INT8. The primary source of latency - a staggering trillion parameter MoE model can be
performance improvement for DeepSpeed-FP16 is the reduction served under 25ms by leveraging an aggregate GPU memory
of non-GeMM data-movement overhead via deep-fusion. As bandwidth of 128 TB/sec (33 % of peak memory bandwidth),
batch size increases, the GeMM becomes much more efficient, making it possible to serve such a massive model even in
extremely interactive online applications.
1 As the time of writing, FasterTransformer only supports INT8 computation
for Transformer models with just the encoders, e.g., BERT, but not decoders While we realize that 33% compute utilization on 256 GPUs
used in state-of-the-art large-scale Transformer models such as GPT3 [8]. would be a fairly low for a compute bound application such
as training with high arithmetic intensity, a 33% memory by offloading the parameters to CPU or NVMe and using GPU
bandwidth utilization for enabling a low latency massive memory to store activations. The benefit of larger batch size
model inference with virtually no arithmetic intensity is an is shown in Fig. 9(a).
unprecedented result due to the intensive communication 3) Scalability: When additional GPUs are available, ZeRO-
required in such scenarios. Inference can leverage them in parallel to achieve near perfect
linear throughput (see Fig. 9 (c)) by leveraging the aggregate
LM-175B LM-530B
LM-530B PCIe bandwidth across GPUs as described in Sec. VI-B.
Tput (#tokens/sec/GPU)

Tput (#tokens/sec/GPU)
21 80 67 65
35
19 Chart Title6 60
30

TFLOPS-per-GPU
TFLOPS-per-GPU
70 E. Performance Breakdown and Analysis

TFLPS-per-GPU
17
400 55 55
25
15 60 4 50
20
300
13 1) Dense GPU kernel performance breakdown: Fig. 10(a)
4
200 50 3 45
15
11 shows that compared to PyTorch baseline, deep-fusion offers
32 40
10
1009 40
7 1 a significant reduction in latency by reducing kernel launch
35
5
0
5 30 2 30
Latency 0 and data movement overheads, while our custom GeMM
0
Tput Tflops
implementation offers further reduction for small batch sizes
FT (TP-only) DS-Inference (TP-only) FT (PP+TP) DS-Inference (PP+TP)
by increasing memory bandwidth utilization of GeMM.
Fig. 8. Throughput comparison of DeepSpeed Transformer with FT for 175B
2) Throughput breakdown for massive model GPU-Inference:
and 530B models on 16 and 40 GPUs. We run with batch sizes that give the Fig. 10(b) shows the impact of several optimizations in Deep-
best performance for each configuration. Speed Inference to the inference throughput, such as the dense
C. Throughput Oriented Massive Model Inference optimized kernel, inference optimized scheduling, memory
optimizations that lead to increased batch size, communication
Massive models are capable of processing large input
optimizations that reduce PCIe data movement overheads as
prompts and generating large number of coherent tokens. In
described in Sec. IV.
some applications (e.g., offline query rewriting in web-scale
3) Prompt latency improvement with hybrid scheduling:
search and recommendation systems), this token generation
Fig.13 shows that DeepSpeed Inference with hybrid scheduling
process can be less latency focused and more throughput
achieves 1.18x and 3.06x prompt processing speed-up over
oriented. In this sub-section we show throughput improvement
FasterTransformer for GPT-3 175B model with PP + MP
of DeepSpeed Inference for massive model inference.
configuration and MP-only configuration, respectively. This
Fig.8 shows that DeepSpeed Inference achieves 1.51×
experiment was conducted on two nodes each with 8 A100
throughput improvement over the best FasterTransformer (FT)
GPUs. We enable both pipeline and tensor parallelism. We set
configuration for the GPT-3 175B model running on two nodes
the batch size to 24, because the latency dramatically increases
(2 × 8 A100). This improvement comes from our improved
when the batch size is larger than 24. We suspect this is related
pipeline parallelism schedule, and ability to run much larger
to an issue in the AllReduce kernel in Pytorch. The results
batch sizes using memory optimization and communication
demonstrate hybrid scheduling has the potential to reduce
minimization strategies described in Sec. IV. For the 530B,
prompt processing latency and we leave fixing the AllReduce
we could not run FT using a combination of TP and PP
issue as future work.
without crashing, but compared to the TP only version of
4) Memory bandwidth scalability for sparse MoE models:
FT, DeepSpeed Inference achieves over 1.53× throughput
Fig. 11 shows that DeepSpeed Inference achieves much
improvement running on 5 nodes.
higher per GPU memory bandwidth than PyTorch baseline
D. Democratizing Larger Model Inference with ZeRO-Inference for a 52B MoE models on an 8×A100-GPU node while
We evaluate three aspects of ZeRO-Inference: also demonstrating significantly better memory bandwidth
1) Model Scale: ZeRO-Inference can inference a 530B scalability all the way to 128 GPUs that leads to the faster
parameter model on a single A6000 GPU, 25× larger than the sparse model inference latency and higher throughput. This is
largest model that can be inferenced with a GPU-only solution the combined effect of MoE kernels and all-to-all optimizations
(and 10× larger compared to the CPU-only solution), making presented in Section V.
it possible for data-scientists to test massive models on single 5) Impact of pre-fetching on ZeRO-Inference throughput:
GPU workstations without requiring massive GPU clusters or Fig. 10(c) shows that prefetching (Sec. VI-B) improves through-
incurring huge cost (see Fig. 9(b)). put at small batch sizes while the benefit diminishing at larger
2) Inference Throughput: ZeRO-Inference achieves excellent batch sizes dues to higher arithmetic intensity to hide the
inference throughput of up to 84 TFLOPS, 54% of theoretical CPU/NVMe to GPU communication overhead.
peak (158.4 TFLOPS) for offline inference with very large 6) Comparison with E.T.: We also compared with a state-
batch sizes (see Fig. 9(b)). In fact, for models that fit in CPU of-the-art transformer kernel E.T. [27] for smaller scale
memory, it offers over 25× higher throughput than the CPU- DistilBERT and BERT encoder models on NVIDIA A100
only solution. Furthermore, even for models that fit in single GPUs for a batch size 1 and sequence length 128. Fig. 12 shows
GPU memory, it offers over 50% better throughput than the that DeepSpeed Inference is 1.7x and 1.4x faster than E.T. on
GPU-only solution. This is possible, because ZeRO-Inference those two models. DeepSpeed Inference achieves lower latency
can support much larger batch sizes than a GPU-only solutions because DeepFusion fuses more operators, leading to lower
(a) (b) (c)
Fig. 9. (a) Throughput of GPT-NeoX-20B across batch sizes on a A6000 GPU. (b) Throughput across models on a A6000 GPU. (c) Throughput of GPT-50B
using up to 16 GPUs over a single GPU (67 TFLOPS, 53% of peak) on the DGX2 V100.

(a) (b) (c)


Fig. 10. (a) Benefit of the Deep-Fusion and optimized GeMM over Megatron baseline for the GPT2 model. (b) Throughput improvement with different
pipeline parallelism optimizations for 530B Model. (c) Impact of prefetching on ZeRO-Inference performance on a single V100 GPU.

Fig. 11. Aggregate memory bandwidth scalability of DeepSpeed-MoE


compared to baseline.
Fig. 13. Prompt processing latency and TFLOPS comparison between
DeepSpeed with hybrid scheduling and FasterTransformer.
E.T. [27] DS-Inference
1.8 VIII. C ONCLUSION
1.6
1.4 This paper presents DeepSpeed Inference, a system that en-
Latency (ms)

1.2 ables efficient inference of transformer models at unprecedented


1 scale, with respect to model size, the number of GPUs, and
0.8 performance. With innovations across the entire system stack,
0.6 DeepSpeed Inference delivers speedy, efficient and economic
0.4 inference as the model size grows, model architecture evolves,
0.2 or the latency requirements become more stringent, supporting
0
the increasing diversity of the transformer models and their
DistilBert Bert application scenarios. DeepSpeed Inference offers previously
Fig. 12. Comparison with alternative Transformer kernels (E.T. [27]). unattainable low latencies at unprecedented model scales, and
make these gigantic models servable with unimaginably few
resources. With such capabilities, we hope DeepSpeed Inference
kernel invocation overhead and higher memory bandwidth
will not only facilitate the fast pace of innovation in transformer
utilization. In addition to being faster for small encoder models,
models but also further the state of using these models in
we remark that the scope of our work is also much broader than
production and research for everyone in need.
E.T., where DeepSpeed Inference supports encoder, decoder,
and sparsely gated MoE models at much larger scale.
R EFERENCES [21] D. Team and R. Majumder, “DeepSpeed: Extreme-scale model training for
everyone,” https://www.microsoft.com/en-us/research/blog/deepspeed-
[1] S. Smith, M. Patwary, B. Norick, P. LeGresley, S. Rajbhandari, J. Casper, extreme-scale-model-training-for-everyone/, 2020.
Z. Liu, S. Prabhumoye, G. Zerveas, V. Korthikanti et al., “Using [22] W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling
deepspeed and megatron to train megatron-turing nlg 530b, a large-scale to trillion parameter models with simple and efficient sparsity,” arXiv
generative language model,” arXiv preprint arXiv:2201.11990, 2022. preprint arXiv:2101.03961, 2021.
[2] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training [23] S. Rajbhandari, C. Li, Z. Yao, M. Zhang, R. Yazdani Aminabadi, A. A.
of deep bidirectional transformers for language understanding,” arXiv Awan, J. Rasley, and Y. He, “DeepSpeed-MoE: Advancing Mixture-of-
preprint arXiv:1810.04805, 2018. Experts Inference and Training to Power Next-Generation AI Scale,”
[3] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, ArXiv, January 2022. [Online]. Available: https://www.microsoft.com/en-
L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert us/research/publication/deepspeed-moe-advancing-mixture-of-experts-
pretraining approach,” arXiv preprint arXiv:1907.11692, 2019. inference-and-training-to-power-next-generation-ai-scale/
[4] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving [24] “Microsoft DeepSpeed achieves the fastest BERT training time,” https://
language understanding by generative pre-training,” OpenAI Blog, 2018. www.deepspeed.ai/2020/05/27/fastest-bert-training.html, accessed: 2022-
[5] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., 04-01.
“Language models are unsupervised multitask learners,” OpenAI blog, [25] A. Ivanov, N. Dryden, T. Ben-Nun, S. Li, and T. Hoefler, “Data movement
vol. 1, no. 8, p. 9, 2019. is all you need: A case study on optimizing transformers,” Proceedings
[6] S. Black, S. Biderman, E. Hallahan, Q. Anthony, L. Gao, L. Golding, of Machine Learning and Systems, vol. 3, pp. 711–732, 2021.
H. He, C. Leahy, K. McDonell, J. Phang, M. Pieler, U. S. Prashanth, [26] J. Fang, Y. Yu, C. Zhao, and J. Zhou, “Turbotransformers: an efficient
S. Purohit, L. Reynolds, J. Tow, B. Wang, and S. Weinbach, “GPT-NeoX- gpu serving system for transformer models,” in Proceedings of the
20B: An open-source autoregressive language model,” 2022. 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel
[7] “Turing-NLG: A 17-billion-parameter language model by Mi- Programming, 2021, pp. 389–402.
crosoft,” https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17- [27] S. Chen, S. Huang, S. Pandey, B. Li, G. R. Gao, L. Zheng, C. Ding,
billion-parameter-language-model-by-microsoft/, accessed: 2022-03-20. and H. Liu, “E.T.: re-thinking self-attention for transformer models on
[8] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, gpus,” in SC ’21: The International Conference for High Performance
A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models Computing, Networking, Storage and Analysis, St. Louis, Missouri, USA,
are few-shot learners,” Advances in neural information processing systems, November 14 - 19, 2021, B. R. de Supinski, M. W. Hall, and T. Gamblin,
vol. 33, pp. 1877–1901, 2020. Eds. ACM, 2021, pp. 25:1–25:18.
[9] J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, [28] T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan,
J. Aslanides, S. Henderson, R. Ring, S. Young et al., “Scaling language L. Wang, Y. Hu, L. Ceze et al., “{TVM}: An automated {End-to-End}
models: Methods, analysis & insights from training gopher,” arXiv optimizing compiler for deep learning,” in 13th USENIX Symposium
preprint arXiv:2112.11446, 2021. on Operating Systems Design and Implementation (OSDI 18), 2018, pp.
[10] D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, 578–594.
N. Shazeer, and Z. Chen, “Gshard: Scaling giant models with conditional [29] ONNX Runtime developers, “ONNX Runtime,” 11 2018. [Online].
computation and automatic sharding,” arXiv preprint arXiv:2006.16668, Available: https://github.com/microsoft/onnxruntime
2020. [30] “NVIDIA TensorRT,” https://developer.nvidia.com/tensorrt, accessed:
[11] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, 2022-03-20.
Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in [31] “NVIDIA FasterTransformer,” https://github.com/NVIDIA/
neural information processing systems, 2017, pp. 5998–6008. FasterTransformer, accessed: 2022-03-20.
[12] W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to [32] TensorFlow XLA developers, “Xla: Optimizing compiler for machine
trillion parameter models with simple and efficient sparsity,” CoRR, vol. learning.” [Online]. Available: https://github.com/tensorflow/tensorflow/
abs/2101.03961, 2021. tree/master/tensorflow/compiler/xla
[13] Google, “More efficient in-context learning with glam,” [33] M. Dehghani, A. Arnab, L. Beyer, A. Vaswani, and Y. Tay, “The efficiency
https://ai.googleblog.com/2021/12/more-efficient-in-context-learning- misnomer,” ArXiv, vol. abs/2110.12894, 2021.
with.html, 2021. [34] “Nvidia cutlass,” accessed: 2022-03-20. [Online]. Available: https:
[14] A. Yang, J. Lin, R. Men, C. Zhou, L. Jiang, X. Jia, A. Wang, J. Zhang, //github.com/NVIDIA/cutlass
J. Wang, Y. Li, D. Zhang, W. Lin, L. Qu, J. Zhou, and H. Yang, “M6-t: [35] Alan Gray, “Getting stared with cuda graphs,” 11.
Exploring sparse expert models and beyond,” 2021. [Online]. Available: [36] S. Rajbhandari, O. Ruwase, J. Rasley, S. Smith, and Y. He, “Zero-
https://arxiv.org/abs/2105.15082 infinity: Breaking the gpu memory wall for extreme scale deep learning,”
[15] Y. J. Kim, A. A. Awan, A. Muzio, A. F. Cruz-Salinas, L. Lu, A. Hendy, in Proceedings of the International Conference for High Performance
S. Rajbhandari, Y. He, and H. H. Awadalla, “Scalable and efficient moe Computing, Networking, Storage and Analysis, ser. SC ’21, 2021.
training for multitask multilingual models,” CoRR, vol. abs/2109.10465, [37] Y. J. Kim, A. A. Awan, A. Muzio, A. F. C. Salinas, L. Lu, A. Hendy, S. Ra-
2021. [Online]. Available: https://arxiv.org/abs/2109.10465 jbhandari, Y. He, and H. H. Awadalla, “Scalable and efficient moe training
[16] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catan- for multitask multilingual models,” arXiv preprint arXiv:2109.10465,
zaro, “Megatron-LM: Training multi-billion parameter language models 2021.
using gpu model parallelism,” arXiv preprint arXiv:1909.08053, 2019. [38] “NVIDIA DGX A100,” https://www.nvidia.com/en-us/data-center/dgx-
[17] Y. Huang, Y. Cheng, D. Chen, H. Lee, J. Ngiam, Q. V. Le, and a100/, accessed: 2022-03-20.
Z. Chen, “Gpipe: Efficient training of giant neural networks using pipeline [39] “Lambda Vector,” https://lambdalabs.com/gpu-workstations/vector, ac-
parallelism,” ArXiv, vol. abs/1811.06965, 2018. cessed: 2022-03-20.
[18] A. Harlap, D. Narayanan, A. Phanishayee, V. Seshadri, N. Devanur, [40] “NVIDIA DGX-2,” https://www.nvidia.com/en-us/data-center/dgx-2/,
G. Ganger, and P. Gibbons, “Pipedream: Fast and efficient pipeline accessed: 2022-03-20.
parallel dnn training,” arXiv preprint arXiv:1806.03377, 2018.
[19] D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary,
V. Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro,
A. Phanishayee, and M. Zaharia, “Efficient large-scale language
model training on gpu clusters using megatron-lm,” in Proceedings
of the International Conference for High Performance Computing,
Networking, Storage and Analysis, ser. SC ’21. New York, NY,
USA: Association for Computing Machinery, 2021. [Online]. Available:
https://doi.org/10.1145/3458817.3476209
[20] S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, “Zero: Memory
optimizations toward training trillion parameter models,” in SC20:
International Conference for High Performance Computing, Networking,
Storage and Analysis. IEEE, 2020, pp. 1–16.

You might also like