DeepSpeed Inference - Enabling Efficient Inference of Transformer Models at Unprecedented Scale
DeepSpeed Inference - Enabling Efficient Inference of Transformer Models at Unprecedented Scale
Abstract—The past several years have witnessed the success latency requirements, and thus the batch sizes used are generally
arXiv:2207.00032v1 [cs.LG] 30 Jun 2022
of transformer-based models, and their scale and application small. For small batch sizes, inference latency of a model is
scenarios continue to grow aggressively. The current landscape of lower bounded by the time it takes to load all the model
transformer models is increasingly diverse: the model size varies
drastically with the largest being of hundred-billion parameters; parameters from memory to registers. Meeting the latency
the model characteristics differ due to the sparsity introduced requirements of a transformer model inference therefore is
by the Mixture-of-Experts; the target application scenarios can equivalent to achieving adequate overall memory bandwidth.
be latency-critical or throughput-oriented; the deployment hard- Maximizing effective memory bandwidth at small batch sizes
ware could be single- or multi-GPU systems with different types requires reading memory at near peak memory bandwidth for
of memory and storage, etc. With such increasing diversity
and the fast-evolving pace of transformer models, designing a fully-connected (or, linear) layers which contain the majority
highly performant and efficient inference system is extremely of the model weights, while also minimizing kernel launch
challenging. and data movement overhead of other operators like layernorm
In this paper, we present DeepSpeed Inference, a comprehen- and softmax. The GeMM implementations and other kernels
sive system solution for transformer model inference to address designed for training primarily focus on maximizing compute
the above-mentioned challenges. DeepSpeed Inference consists
of (1) a multi-GPU inference solution to minimize latency while utilization at very large batch sizes and are sub-optimal for
maximizing the throughput of both dense and sparse transformer latency-critical inference.
models when they fit in aggregate GPU memory, and (2) a In addition, for large models, even the peak memory band-
heterogeneous inference solution that leverages CPU and NVMe width of a single device may not be sufficient to meet inference
memory in addition to the GPU memory and compute to enable latency constraints. It requires aggregate memory bandwidth
high inference throughput with large models which do not fit in
aggregate GPU memory. across multiple devices, which needs optimal parallelism
DeepSpeed Inference reduces latency by up to 7.3× over strategies for partitioning the model computation across devices
the state-of-the-art for latency oriented scenarios and increases that minimizes the communication overhead across devices.
throughput by over 1.5x for throughput oriented scenarios. Such parallelism strategies must cater to the variation in
Moreover, it enables trillion parameter scale inference under transformer architecture and hardware characteristics.
real-time latency constraints by leveraging hundreds of GPUs, an
unprecedented scale for inference. It can inference 25× larger With respect to transformer architectures, we view them in
models than with GPU-only solutions, while delivering a high two broad categories — dense or sparse Mixture-of-Experts
throughput of 84 TFLOPS (over 50% of A6000 peak). (MoE) transformer models. The optimal parallelism strategy
depends on the model architecture. For example, tensor
I. I NTRODUCTION and pipeline parallelism work only for dense transformers,
The past several years have witnessed the success of while expert parallelism only works for sparse transformers.
transformer-based models; their scale and application scenarios Moreover, transformer-based MoE models contain both dense
continue to grow aggressively. The current landscape of and sparse transformer components, requiring a combination
transformer models is increasingly diverse: the model size of different parallelism techniques to maximize the effective
varies drastically with the largest being over trillion parameters; memory bandwidth across devices. Finally, with respect to
the model characteristics differ due to the sparsity introduced hardware characteristics, modern clusters have heterogeneous
by the Mixture-of-Experts technique; the target application network topology (eg. intra-node NVLink/NVSwitch and inter-
scenarios can be latency-critical or throughput-oriented; the node InfiniBand) which requires further consideration when
deployment hardware could be single- or multi-GPU systems developing parallelism strategies.
with different types of memory and storage, etc. With such Throughput Challenges: In addition to meeting latency,
increasing diversity and the fast-evolving pace of transformer production workloads also have throughput targets to meet
models, designing a highly performant and efficient inference cost budget. At small batch sizes, where the workload is still
system is extremely challenging. memory bandwidth bound, the latency of the workload does not
Latency Challenges: Using a transformer based model increase as long as the computation is entirely overlapped with
for online scenarios in production requires meeting stringent model weight reads. Therefore, maximizing throughput while
meeting the latency SLA requires not only maximizing the a unique aspect of the latency challenge: batch size, scaling
memory bandwidth utilization, but also overlapping compute dense models, and scaling sparse models, but are compatible
with the model weight reads, and at the same time achieving and built on top of each other, we create a comprehensive
high compute efficiency at small batch sizes to maximize the system capable of achieving state-of-art latency and throughput
batch size whose compute can be overlapped with reading the at unprecedented scales for both dense and sparse transformer
model weights. Inference kernels must therefore achieve high models despite the heterogeneity in batch size, model scale
memory bandwidth utilization and high compute utilization and model characteristics.
at small batch sizes, whereas training kernels simply need to 2) ZeRO-Inference: ZeRO-Inference is a heterogeneous
achieve high compute utilization at much larger batch sizes. GPU+CPU+NVMe based solution to address the memory
This makes developing inference kernels quite challenging. challenge by enabling massive model inference with mini-
Moreover, even for throughput bound scenarios with large mal GPU resources. In contrast to DeepSpeed Transformer,
batch sizes, inference workloads can differ from training for applications that are less latency sensitive but resource
workloads in terms of data flow and computation dependencies, constrained, ZeRO-Inference allows inference of models with
requiring novel solutions to achieve high throughput. For hundreds of billions of parameters on a single or multiple
example, generative transformers have dependencies between GPUs as long as there is enough CPU or NVMe memory to
each generated token and the next token, which does not store the model parameters. In addition, even when the model
exist during training. As a result, it incurs higher memory does fit in aggregate GPU memory, ZeRO-Inference delivers
requirement during inference to keep track of previously better per GPU efficiency than DeepSpeed Transformer by
generated states. For large models that may require pipeline supporting much larger batch sizes.
parallelism to fit the model in memory, this dependency across The main contributions of the paper are as follows:
generated tokens also requires new pipeline schedules to keep • Single GPU transformer kernels for minimizing la-
all devices busy compared to training scenarios. tency and maximizing throughput via memory-bandwidth-
Feasibility Challenges under Limited Resources: A model centric fusion schedules and GeMM kernels (Sec. III).
with tens of billions of parameters is simply too large to fit • A many-GPU dense transformer inference system that
in the memory of a single GPU device, and at hundreds of combines tensor-parallelism to minimize latency with
billions of parameters, it is too large to even fit in the aggregate inference optimized pipeline parallelism schedules and
GPU memory of a single node. For example, inferencing MT- memory optimizations to maximize throughput (Sec. IV).
NLG 530B [1] requires about 1TB of GPU memory just to • A massive-GPU sparse model inference system that
fit the model for inference, requiring over three DGX-2 nodes combines: i) expert, data, and tensor parallelism, ii)
consisting over two dozen of NVIDIA A100 40GB GPUs. novel communication optimizations and iii) sparse kernel
Most data scientists simply do not have access to such GPU optimizations to scale sparse inference on trillions of
resources needed for inference of these massive models. parameters across hundreds of GPUs (Sec. V).
In this paper, we present DeepSpeed Inference, a compre- • ZeRO-Inference that leverages CPU, NVMe and GPU
hensive solution for transformer model inference designed to memory along with GPU compute to make massive model
address the above challenges. DeepSpeed Inference consists inference accessible with limited resources (Sec. VI).
of two components: • Extensive evaluation of DeepSpeed Inference on a wide
1) DeepSpeed Transformer: DeepSpeed Transformer, is range of transformer models covering four aspects: i)
a GPU only solution, designed to minimize latency while For latency sensitive scenarios, DeepSpeed Transformer
maximizing throughput for both dense and sparse transformer shows latency reduction over state-of-the-art of up to 1.9×
models. It achieves state-of-art latency and throughput for for dense models (up to 175B parameters) and 7.2× for
transformer models of all sizes and supports running on a sparse models (a 1T model under 25 ms), while scaling
single GPU or scaling to hundreds of GPUs to inference multi- to 256 GPUs at 33% peak memory bandwidth utilization,
trillion parameter models. an unprecedented scale for inference. ii) For throughput
The DeepSpeed Transformer solution is a three-layered sys- oriented scenarios, DeepSpeed Transformer demonstrates
tem architecture consisting of i) single GPU transformer kernels over 1.5× gain over state-of-the-art (Sec. VII-C). iii) Eval-
optimized for memory bandwidth utilization at low batch sizes uation of ZeRO-Inference on GPU resource constrained
and high throughput at large batch sizes, ii) many-GPU dense systems that shows ZeRO-Inference can support inference
transformer layer, for scaling dense transformer models across with 25× larger models than with GPU only solution
GPUs using tensor-slicing and inference-optimized pipeline while achieving over 50% of peak hardware performance.
parallelism, and iii) massive-GPU scale sparse transformer (Sec. VII-D). iv) Performance analysis and breakdown of
layer, designed to scale MoE transformer layers to hundreds the different optimizations discussed throughout the paper
of GPUs using a combination of parallelism techniques and (Sec. VII-E).
communication optimization strategies, while also minimizing Despite the diversity in transformer inference landscape,
single GPU sparse computation overhead using optimized DeepSpeed Inference offers a versatile solution capable of
sparse kernels. achieving state-of-art latency and throughput for all variations
By taking this layered approach, where each layer addresses of transformer model inference: dense or sparse, small or
large batches, billions to trillions of parameters, single GPU could pose functionality, performance and convergence related
or across hundreds of GPUs. Furthermore, it democratizes restrictions for pipeline parallelism.
access to large transformer inference by enabling them on ZeRO [20] takes a different approach and removes the
systems with limited GPU resources. DeepSpeed Inference memory redundancies in conventional data parallelism by
is available for everyone to leverage though our open-source partitioning model states across the data-parallel processes
repository: https://github.com/microsoft/DeepSpeed. instead of replicating them. 3D parallelism [21] combines
data, tensor, and pipeline parallelism efficiently to scale to
II. BACKGROUND AND R ELATED W ORK models of trillions of parameters.
a) Dense Transformer Models: The size of transformer- Expert parallelism [22] places different experts on different
based language models has been increasing by 10× each GPUs and executes them in parallel. Each expert only processes
year for the past few years, from models with a few hundred a subset of tokens on each expert based on a learned top-k
millions of parameters [2], [3], [4], [5], to models with dozens gating function. The classic all-to-all communication primitive
of billions parameters [6], [7]. Recently, GPT-3 175B [8], has been used to implement expert parallelism [23], [15], [22].
Gopher 280B [9], and MT-NLG 530B [1] further push this limit The above parallelism strategies are mainly designed for
to hundreds of billions of parameters. As larger models have maximizing training throughput and their effectiveness can
demonstrated outstanding accuracy performance on various be limited during inference because of insufficient parallelism
natural language understanding and generation tasks, this with small batch sizes in inference. Our work leverages these
exponential growth in model scale would continue as long techniques and applies innovative optimizations to make them
as the system and hardware technology could keep up with it. effective and performant in inference.
b) Sparse Transformer Models: The success of scaling dense d) Optimized Transformer Kernels: There is also a suite of
language models has motivated researchers and practitioners work focused on accelerating the performance of transformer
to further propose the Mixture-of-Experts (MoE) technique kernels [24], [25], [26]. A record training time for BERT
which introduces sparsity in transformer models [10]. Typical was accomplished with stochastic transformer kernels that
transformer model [11] architectures have transformer blocks fused operators and reduced activation memory to support
that consist of two consecutive sub-layers, a self-attention large batch sizes [24]. Ianov et al. [25] use transformer
sub-layer followed by a position-wise feed-forward (FF) block. dataflow graphs to fuse elementwise and reduction operators
MoE models add conditional computation by replacing the feed- and accelerate training. TurboTransformers [26] similarly fuses
forward blocks with a Position-wise MoE layer with a variable elementwise and reduction operators for transformer inference.
number of experts and a top-k gating function. Increasing the E.T. [27] combines fusion, custom GeMM, and pruning together
number of experts allows scaling the MoE model size with to accelerate inference speed of Transformers. The kernel
only sublinear increase in computation cost, greatly reducing optimizations presented in this work fuse a wider variety
the training cost of the model. However, MoE models can of operators, such as head-wise transformation that requires
be up to 8× larger than their quality-equivalent dense models additional data layout transformation and layers beyond the self-
[12], [13], [14], [15], requiring much higher aggregate memory attention sublayers, such as the intermediate layers and MoE
bandwidth to achieve comparable latency during inference. specific layers. In addition, the kernels presented in this work
c) System Technology for Memory and Performance Scaling: also support auto-regressive generative models that require KV-
The major challenge in scaling model sizes resides in the caching [8] to be performant during inference, where as the
memory bottleneck. To satisfy the memory requirement, prior above mentioned work do not consider support for KV-caching.
works have proposed various parallelism strategies to use the e) DNN Inference Optimizations: There has also been
aggregated GPU memory within and across nodes. extensive work on optimizing DNN inference through platforms,
Tensor parallelism [16] splits model layers horizontally libraries, compilation, and compression strategies. Several com-
across GPU devices. As the number of GPU increases for tensor- pilers and runtimes exist to facilitate the deployment of models,
slicing, two primary trade-offs show up: (i) lower compute such as TVM [28], ONNXRuntime [29] and TensorRT [30].
granularity due to the smaller local problem size, and (ii) all- These platforms have been mostly focused on optimizing DNN
reduce communications in each transformer layer to aggregate models that can fit in a single GPU, such as small transformers
the partial activations. When scaling across node boundaries, with a few hundreds millions of parameters. In contrast, our
the inter-node bandwidth is limited comparing to the fast work targets billion-scale or even trillion-scale transformers
intra-node connections, thus tensor parallelism can cause a that do not easily fit on a single GPU device. The most related
significant latency degradation. In practice, tensor parallelism is work to ours is FastTransformer [31], which supports multi-
often restricted to groups of GPUs sharing the high-bandwidth GPU inference for transformer models, which we will provide
interconnect within a node (e.g., NVIDIA NVLink). a more detailed comparison in Section VII. Finally, there
Pipeline parallelism [17], [18], [19] splits a model vertically has been numerous works that improves the deployment of
into pipeline stages and use micro-batching to hide pipeline DNN models through model compression techniques, such
bubbles. It only requires communication for data aggregation as distillation, quantization, and sparsification, which could
between adjacent pipeline stages, thus more efficient to scale reduce the computation time and memory consumption with a
across nodes. However, model splitting and micro-batching small accuracy trade-off. Our work is complimentary to these
model compression techniques and can be combined together tile granularity, Deep-Fusion can fuse not only element-wise
to boost performance further. operations but also reductions, data transpositions, and GeMMs
as long as there are no cross-tile dependencies. For example,
III. I NFERENCE -O PTIMIZED T RANSFORMER K ERNELS all micro-operations in a layer-norm [33] can be tiled along the
In this part, we discuss the challenges, design, and opti- token dimension, while the reduction dimensions are processed
mizations for transformer kernels capable of achieving high- within a tile. This allows all the micro-operations inside a
performance inference for both small and large batch sizes. layernorm to be fused into a single kernel despite consisting of
multiple reduction operations. Furthermore, the data produced
A. Inference Challenges on Different Batch Sizes by each tile is either kept in registers or in shared memory
As discussed in Sec.I, small batch performance is limited by when possible to allow for data-reuse across operators without
the memory bandwidth utilization in reading model weights. incurring global memory data-transfer overheads.
There are three main challenges to optimizing for memory C. SBI-GeMM: Custom GeMM for Small Batch Size
bandwidth at small-batch inference. First, due to limited work
at different kernels performing the operations of a transformer Our custom GeMM implementation is designed to be
layer using small batch, inference performance suffers from fusable with Deep-Fusion while achieving maximum memory
the kernel-invocation overhead. Second, each kernel-invocation bandwidth utilization. Its design can be viewed in three parts:
writes data to global memory which is read by GPU cores tiling strategies, cooperative-group reduction, and data-layout
during the next kernel invocation, and this data-transfer between transformation for better memory bandwidth utilization.
1) Tiling Strategies: Fig. 1(a) depicts our GeMM scheduling
GPU cores and global memory adds an additional overhead.
for a skinny matrix multiplication. We first tile the computation
Finally, neither cuBLAS nor CUTLASS GeMM libraries are
along the output dimension. That allows us to implement
well tuned for extremely small batch sizes, and cannot achieve
GeMM using a single kernel by keeping the reduction within
good memory-bandwidth utilization.
a tile. For small models, where the output dimension is too
Large-batch inference performance on the other-hand is
small to create enough parallel tiles to achieve good memory
limited by compute utilization, and while compute heavy
bandwidth, we tile the input dimension as well and implement
operations like GeMM inside a transformer layer can achieve
GeMM as two kernels to allow for reduction across tiles.
very good compute utilization using CUBLAS and CUTLASS
2) Cooperative-Group Reduction: With the aforementioned
libraries, the overall utilization can still be limited by the kernel
tiling strategy, each warp in a thread block is responsible for
launch overheads and data-transfers between GPU cores and
producing a partially reduced result for a tile of outputs and a
global memory across different kernels other than GeMMs.
final reduction is needed across all the warps within the thread
To address these challenges, we introduce two techniques: i)
block. Usually this is implemented as a binary tree based
Deep-Fusion to reduce kernel-invocation and data-movement
reduction in shared memory which requires multiple warp-
overheads by fusing multiple kernels beyond element-wise
level synchronizations, thus creating a performance bottleneck.
operations, and ii) A custom GeMM kernel designed for
To avoid this, we perform a single data-layout transpose in
improving the memory bandwidth utilization when the batch
shared memory such that partial results of the same output
size is relatively small while also allowing it to be fused using
element are contiguous in memory, and can be reduced by
Deep-Fusion. We discuss these techniques in detail next.
a single warp using cooperative-group collectives directly in
B. Deep-Fusion registers (See Fig. 1(a)). At the end, the first thread of each
warp holds the final result and writes it to shared memory.
While operator fusion is a common technique used in deep
The results in shared memory are contiguous, allowing for a
learning to reduce kernel launch and data-movement overhead,
coalesced write to global memory.
it is limited primarily to element-wise operators [32], [28],
3) Leveraging Full Cache-line: In GPU architecture, each
[29]. In contrast, transformer consists of operators like data
L1 cache-line is 128 bytes, however a coalesced memory access
layout transformations, reductions, and GeMMs which create
with a single FP16 or INT8 element per thread in the warp
data dependencies across thread blocks, making them difficult
cannot fully consume the full cache-line. Reading multiple
to fuse. This is because on GPU, if a data produced by a
elements per thread along the output dimension to address this
thread-block is consumed by a different one, a global memory
issue reduces the number of parallel tiles which also hurts
synchronization is needed which invokes a new kernel.
memory bandwidth. Therefore, our solution is to transpose the
To avoid the need for a global synchronization, Deep-Fusion weight matrix during initialization such that M rows for each
tiles the computation-space along dimensions of the iteration column are contiguous in memory, allowing each thread to
space which incur no cross-tile data-dependencies and executes read M elements along the input dimension (See Fig. 1(b)).
them in parallel across different thread-blocks. The dimensions We set M as 2 for half precision and 4 for the INT8 data types
of the computation-space which does contain data dependencies considering a 128-byte cache line.
are not tiled, and instead processed by the same thread-block.
After this tiling, two operators can be fused using Deep- D. Putting It Together
Fusion if each tile of the second operator depends on exactly Small-batch Transformer Kernel: Fig. 1.c shows the different
one output tile of the first operator. By performing fusion at components of a transformer layer, and the operations which
Output GeMM Kernel
Output
Input …. Output
...
...
Input Section 1
...
...
Batch …. Batch QKV Query + Q-trans Attn Soft Attention
Input Block 1 Block K GeMM
...
…. Score Max
...
...
Q_bias
Input Block N-K+1 Block N Reduce X
Local Section N/K Norm Key + K-trans
Broadcast N/K sections
W0 W1
weight W0 W1 K_bias
to all Read Attn
blocks Value + V-trans
W2 W3 W2 W3 Context
Lanes …. V_bias
Block 1 Block n
Local GeMM Output Data-layout Transformation Attn_bias All Attn
× Reduce Transform
+ Output
M
Local GeMM M M × Output + +
× Shared Shared
+
Memory Memory
Warps Input ... Norm
...
Local GeMM Feed-Forward
× Input/M Bias-add
+
Transpose Transpose Intermediate Output All Y
Local GeMM GELU + Reduce + +
×
FF FF
+
Reduce I_bias O_bias
→
Output Allgather Allgather MP 0 GPU 0 A C B D MP 0 GPU 0 A C
(GPU0 <-> GPU8) (GPU7 <-> GPU15) Intra-MP
Local MP 1 GPU 1
MP 1 GPU 1 A C B D AllGather B D
Transform →
MoE Transformer Layer Tensor-slicing Tensor-slicing MP 0 GPU 2 E G F H MP 0 GPU 2 E G
Expert Parameters Slice 0 Slice 0 MP 1 GPU 3 E G F H MP 1 GPU 3 F H
(e.g., MLP) (GPU 0) (GPU 12)
Slice 1 Slice 1 Baseline AlltoAll:
(GPU 1) ..………..
Alltoall (GPU 13)
Non-expert Parameters MP 0 GPU 0 A B C D MP 0 GPU 0 A A E E MP 0 GPU 0 A B C D
Slice 2 (Expert 0..7) Slice 2 Global Global
(e.g., Attention) MP 1 GPU 1 A B C D
All-reduce (GPU 2) All-reduce (GPU 14) MP 1 GPU 1 A B C D AlltoAll MP 1 GPU 1 B B F F AlltoAll
Slice 3 Slice 3 MP 0 GPU 2 E F G H → MP 0 GPU 2 C C G G → MP 0 GPU 2 E F G H
(GPU 3) (GPU 15) MP 1 GPU 3 E F G H
MP 1 GPU 3 E F G H MP 1 GPU 3 D D H H
Input Data-parallelism (no communication)
Fig. 4. Expert, data and tensor parallelism in DeepSpeed-MoE. Fig. 5. The parallelism coordinated communication (PCC) optimization follows
four steps: 1) local transformation and splitting of the original data, 2) Intra-
tensor-model-parallel (MP) and inter-MP alltoall, followed by 3) intra-MP
avoid contention between GPUs, odd-numbered GPUs offload allgather, and 4) finally a local transform operation. Despite four steps, it is
activations for odd-numbered layers, while even-numbered faster than the baseline alltoall shown in the bottom half of this illustration.
GPUs offload activation for even-numbered layers. This is
B. PCC: Parallelism Coordinated Communication for MoE
crucial to fully leverage the PCIe bandwidth. Scheduling odd
and even layer offloading across GPUs prevents contention on Expert parallelism places expert operators across GPUs and
the PCIe link, allowing each GPU to fully leverage the PCIe requires all-to-all communication between all expert-parallel
bandwidth when it needs to offload. GPUs. However, it is not efficient to scale expert parallelism
to hundreds of devices needed for sparse model inference
as the latency increases linearly with the increase in devices.
V. M ASSIVE S CALE S PARSE M ODEL I NFERENCE
Fortunately, when combining expert parallelism and tensor-
While the techniques developed so far enables DeepSpeed slicing within a single model, there are opportunities for
Inference to achieve state-of-art latency and throughput for communication optimization that can reduce the communication
dense transformer models, new considerations are necessary for latency. Note that tensor-slicing splits individual operators
sparse transformer models that consist of both sparse and dense across GPUs and requires all-reduce between them. The all-
components. The key challenge is that on one hand, sparse reduce operation in tensor-slicing replicates data among the
models are much larger than quality equivalent dense models involved devices. When executing tensor-parallel operators
(Sec. II), requiring much higher aggregate memory bandwidth followed by expert-parallel operators, this replication allows
to achieve latency comparable to quality equivalent dense creating an optimized communication schedule for the all-to-all
models, and on the other hand it has a different computational operator that does not require communicating between all the
structure than dense models, requiring different parallelism expert parallel processes: the all-to-all can happen within just
approaches compared to dense transformers [23]. the subset of devices that share the same tensor-slicing rank,
In this section, we introduce a massive scale MoE-based since the data across tensor-parallel ranks are replicated (Fig. 5).
transformer model inference system capable of addressing the As a result, the latency of all-to-all is bounded by O(p/L)
above challenges. It is built on top of the dense components instead of O(p) where L is the tensor-slicing parallelism degree
discussed before and consists of three main components: and p is the total number of GPU devices.
Similarly, when executing expert-parallel operators followed
by tensor-slicing operators, the final all-to-all can be done in
A. Orchestration of Tensor, Data, &Expert Parallelism for MoE the same way, but this time followed by an allgather operator
We use tensor parallelism, referred in Fig. 4 as tensor-slicing between tensor-parallel ranks to replicate the data needed by
(for non-expert parameters) and expert-slicing (for expert tensor-slicing (Fig. 5). This reduces the latency overhead from
parameters), to split individual parameters across multiple O(p) to O(p/L) + O(L).
GPUs to leverage the aggregate memory bandwidth across This reduced latency overhead allows better scaling to a large
GPUs. However, tensor parallelism can only scale efficiently to number of devices. For example, when scaling to 128 GPUs
a few GPUs due to communication overhead and fine-grained with 8-way tensor-slicing and 128-way expert parallelism, this
parallelism. To address this, we use expert parallelism in approach reduces the latency overhead of the all-to-all from
conjunction with tensor parallelism to scale experts parameters (128C1 + C2 ) to (16C1 + C2 ) due to 8-way tensor-slicing,
to hundreds of GPUs. Expert parallelism does not reduce where C1 and C2 are some constants determined by point-to-
computation granularity of individual operators, therefore point latency, message size, and bandwidth.
allowing our system to leverage aggregate memory bandwidth
across hundreds of GPUs. To scale the non-expert computation C. Highly Optimized Computation Kernels for MoE
to the same number of GPUs, we use data parallelism at no MoE-related computation consists of four major components:
communication overhead. (1) a gating function that determines the assignment of tokens
to experts, where the result is represented as a sparse tensor (a VI. D EMOCRATIZATION OF LARGE MODEL INFERENCE .
one-hot vector representing the assigned expert for each token
DeepSpeed Transformer needs the model to fit in aggregate
in the sequence); (2) a sequence of sparse operators including
GPU memory, requiring a large number of GPUs for large
a cumsum operator to compute an inverse mapping from
models. This is a barrier for many data scientists who lack
experts to token IDs (experts-to-token) using the previously
access to large number of GPUs, e.g., dozens of GPUs are
mentioned token-to-expert one-hot vector; (3) a scatter operator
required to inference models like MT-NLG-530B. To broaden
to distribute tokens to its corresponding experts. This is
access to large models, we propose ZeRO-Inference which
implemented as a sparse einsum operator between the expert-to-
enables large model inference using as few as a single GPU.
token computed in the previous step and input tokens; and (4)
For non-latency sensitive applications, ZeRO-Inference achieves
a final sparse einsum based gather operation that re-distributes
high performance by leveraging DRAM and NVMe memories
tokens processed at each expert back to their original ordering.
in addition to GPU memory and compute. Compared to a
CPU only based solution, ZeRO-Inference can achieve orders
The sparse tensor representation in the gating function of magnitude higher throughput by efficiently exploiting the
and sparse einsum operators introduce a significant latency available GPU hardware. Moreover, it offers similar or even
overhead. The gating function includes numerous operations better throughput than DeepSpeed Transformer by supporting
to create token-masks, select top-k experts, and perform larger batch sizes. We now discuss the design of ZeRO-
cumulative-sum (cumsum) to find the token-id going to each Inference and the performance optimizations that make it very
expert and sparse matrix-multiply, all of which are not only efficient for throughput oriented inference.
wasteful due to the sparse tenor representation, but also
extremely slow due to many kernel call invocations. Moreover, A. ZeRO-Inference Design
the sparse einsums have a complexity of S × E × M × ce ,
ZeRO-Inference utilizes available heterogeneous memory
where S represents the total number of tokens, E represents
(i.e., GPU memory, DRAM, and NVMe) to satisfy the memory
the number of experts, M represents model hidden dimension,
requirement of fitting massive models. This is motivated by
and ce represents expert capacity (S, E, and M are the main
the observation that environments with limited GPU resources
complexity factors, while ce is normally very small). In this
are often equipped with terabytes of aggregate heterogeneous
equation, (E − 1) out of E operators for each token are
memory, which is sufficient to fit hundreds of billion-parameter
multiplications and additions with zeros, since only one expert
models. ZeRO-Inference builds on the offloading techniques
is typically selected to process ce tokens. This comes from
of ZeRO-Infinity [36], and adapts them to inference.
the fact that generalizing the gating operations results in the
einsums over several masking matrices or one-hot vectors that An important design decision is how to apportion GPU
produce a lot of non-necessary computation with zeros to select memory among model weights, inference inputs, and interme-
the correct token for each expert. We optimize these operators diate results. One approach is to pin as much of the model
using dense representation and kernel-fusion. weights as possible into GPU memory, and fetch the remainder
(from DRAM or NVMe) when needed for computation. A
We optimize each of the four steps in the gating function in benefit of this approach is avoidance of the latency of fetching
the following way: 1) we replace the one-hot representation of weights that are already pinned in GPU memory. However,
the token to expert mapping using a table data-structure, greatly this approach has two downsides: (i) it allows only small batch
reducing the memory overhead from eliminating all the zeros in sizes which hurts efficiency, and (ii) the latency savings for
the one-hot vectors; 2) we create the inverse mapping (expert-to- hundred-billion parameter models are negligible since only a
tokens mapping table) from the tokens-to-expert mapping table small fraction of the weights can fit in GPU memory anyway.
by simply scanning though the token-to-expert table in parallel. ZeRO-Inference adopts a different approach that pins the
3) we replace the sparse einsum based scatter operation using model weights either in DRAM (if large enough) or NVMe,
a data-layout transformation that achieves the same result by and streams each layer into GPU memory for computation
first identifying the token IDs assigned to an expert using the when needed. Despite the latency of fetching model weights
expert-to-token mapping table created in the previous step, and over PCIe, ZeRO-Inference is able to achieve high efficiency
then copying these tokens to the appropriate expert location; 4) for two reasons. First, by limiting GPU memory usage of the
after the tokens are processed by their corresponding experts, model to one or a few layers of weights, ZeRO-Inference is able
we use a similar data-layout transformation to replace the to use large batch sizes for inference. Second, a large model
sparse einsum based gather operation. layer requires significant amount of compute, especially given
Using the data-layout transformation instead of sparse their long input sequence length (e.g., 2048). For example, one
einsums reduces the complexity of these operations from GPT3-175B layer requires about 7 TFlops to process an input
S × E × M × ce to S × M × ce . We use shared memory of batch size 1. Therefore, large batch sizes cause compute
for data-layout transformations and fuse all but the final data- time to dominate the latency of fetching model weights, which
layout transformation together into a single kernel using basic ultimately improves efficiency. In summary, ZeRO-Inference’s
fusion principles. Combined, these optimizations result in over strategy to utilize GPU memory to support large batch sizes
6× reduction in MoE kernel-related latency. results in high performance inference for large models.
Name # params(B) hidden dim (K) # layers # attention heads Fig 6 Fig 6 Fig 8 Fig 9
GPT-[2, Neo, J, 13B] 1.5, 2.7, 6, 13 1.6, 2.5, 4, 5 48, 32, 28, 40 25, 20, 32, 40 TP=1 N/A N/A N/A
GPT-[NeoX, 50B, 87B] 20, 50, 87 6, 8, 12, 12 44, 62, 48 64, 64, 96 N/A TP=2,4,8 N/A TP=1
LM-175B 175 12 96 96 N/A TP=16 TP=8, PP=2 TP=1
LM-530B 530 20 105 128 N/A N/A TP=8,PP=5 TP=1
TABLE I
M ODEL CONFIGURATIONS USED FOR THE DENSE MODEL INFERENCE PERFORMANCE EVALUATION .
Model Size (billions) #Layers Hidden size MP degree EP degree Expert-slicing #GPUs
1.3B+MoE-128 52 24 2048 1 128 1 128
2.4B+MoE-128 107.7 16 3584 1 128 1 128
8B+MoE-128 349.0 30 4096 4 128 1 128
24B+MoE-128 1064.9 40 8192 8 128 2 256
47B+MoE-128 2024.0 58 8192 8 128 2 256
TABLE II
M ODEL CONFIGURATIONS USED FOR THE SPARSE MODEL INFERENCE PERFORMANCE EVALUATION . MP STANDS FOR MODEL - PARALLELISM . EP REFERS TO
EXPERT- PARALLELISM .
Tput (#tokens-per-sec)
Tput (#tokens-per-sec)
140 200
Tput (#tokens-per-sec)
160 250
Tput (#tokens-per-sec)
120 140 250 100
Latency (ms)
160
Latency (ms)
200 60
Latency (ms)
120
Latency (ms)
100 200 80 400
80 120 100 150
80 150 60 40
60 80 60 100 100 40 200
40 40 20
40 50 50 20
20 20
0 0 0 0 0 0 0 0
1-batch 8-batch 16-batch 1-batch 8-batch 16-batch 1-batch 8-batch 16-batch 1-batch 8-batch 16-batch
GPT-Neox-20B GPT-50B GPT-87B LM-175B
450 40 600 30 600 30 900 16
Tput (#tokens-per-sec)
Tput (#tokens-per-sec)
Tput (#tokens-per-sec)
Tput (#tokens-per-sec)
400 800 14
500 25 500
350 30 700 12
Latency (ms)
Latency (ms)
Latency (ms)
Latency (ms)
300 400 20 400 20 600 10
250 500
20 300 15 300 8
200 400
150 300 6
10 200 10 200 10
100 200 4
50 100 5 100 2
100
0 0
0 0 0 0 0 0
1-batch 8-batch 16-batch
1-batch 8-batch 16-batch 1-batch 8-batch 16-batch 1-batch 8-batch 16-batch
FT (Fp16) latency DS-Inference (Fp16) latency DS-Inference (INT8) latency FT (Fp16) tput DS-Inference (Fp16) tput DS-Inference (INT8) tput
Fig. 6. Latency and throughput comparison of DeepSpeed Transformer with FasterTransformer [31] for different models and batch sizes.
Fig. 7. Latency and throughput improvement offered by DeepSpeed-MoE over baseline on 256 GPUs. Throughput shown is per GPU and the speedup values
along the arrows refer to improvement in latency.
1) Dense Model Evaluation: Fig. 6 shows the latency and the latency of the GeMM operators only increases sub-
and throughput improvements of DeepSpeed Inference on linearly with the batch size in this modest batch size regime.
up to 175B parameter models running with up to 16-way However,the latency of the non-GeMM operations increase
tensor parallelism (see Tab. I). In particular, we compare linearly due to proportional increase in data movement from
both FP16 (DeepSpeed-FP16) and INT8 (DeepSpeed-INT8) GPU memory, making it a bigger fraction of the overall
implementations of DeepSpeed Inference with the FasterTrans- latency. Deep-fusion reduces this data movement by keeping
former FP16 baseline (FT-FP16) 1 Both the baseline and intermediate data for fused operators in shared memory or
DeepSpeed Inference uses identical TP strategy so all the registers to achieve higher performance. The DeepSpeed-INT8
latency differences in these results come from the differences further improves upon the DeepSpeed-FP16 performance by
in kernel implementations described below. utilizing the higher peak of the INT8 tensor-cores compared
Small Batch Sizes For small batch size, DeepSpeed-FP16 to FP16.
achieves a speedup of up to 1.55× over the baseline. The 2) Sparse Model Evaluation: Fig. 7 shows the single output
performance improvements for both single GPU and multi- token generation latency and throughput of serving 100B to 2T
GPU configs are primarily due to deep-fusion and custom MoE models with up to 256 GPUs with and without DeepSpeed-
GeMMs. The latency reduction is the largest for the smallest MoE. Compared to baseline, DeepSpeed-MoE achieves better
model sizes, as they have the largest kernel-launch overhead performance than the state-of-the-art, with up to 7.3× reduction
due to limited work per kernel, and worst GeMM memory in latency. To have a fair comparison, the configuration for
bandwidth utilization from CUBLAS as they are not optimized data/tensor/expert parallelism is the same for both the baseline
for small and skinny GeMMs. DeepSpeed-INT8 enables a and DeepSpeed Inference-MoE. The main differences are
further performance boost of up to 1.95× over the FP16 optimizations that DeepSpeed Inference has, such as expert-
baseline by reducing the overall size of the parameters in slicing, parallelism coordinated all-to-all and MoE-specific
half compared to FP16. kernels, but the PyTorch-MoE baseline does not. By effectively
Larger Batch Sizes For larger batch sizes, DeepSpeed-FP16 exploiting hundreds of GPUs in parallel, DeepSpeed-MoE
reduces the latency by up to 1.57× over the baseline, and achieves an unprecedented scale for inference at incredibly low
up to 1.93× using DeepSpeed-INT8. The primary source of latency - a staggering trillion parameter MoE model can be
performance improvement for DeepSpeed-FP16 is the reduction served under 25ms by leveraging an aggregate GPU memory
of non-GeMM data-movement overhead via deep-fusion. As bandwidth of 128 TB/sec (33 % of peak memory bandwidth),
batch size increases, the GeMM becomes much more efficient, making it possible to serve such a massive model even in
extremely interactive online applications.
1 As the time of writing, FasterTransformer only supports INT8 computation
for Transformer models with just the encoders, e.g., BERT, but not decoders While we realize that 33% compute utilization on 256 GPUs
used in state-of-the-art large-scale Transformer models such as GPT3 [8]. would be a fairly low for a compute bound application such
as training with high arithmetic intensity, a 33% memory by offloading the parameters to CPU or NVMe and using GPU
bandwidth utilization for enabling a low latency massive memory to store activations. The benefit of larger batch size
model inference with virtually no arithmetic intensity is an is shown in Fig. 9(a).
unprecedented result due to the intensive communication 3) Scalability: When additional GPUs are available, ZeRO-
required in such scenarios. Inference can leverage them in parallel to achieve near perfect
linear throughput (see Fig. 9 (c)) by leveraging the aggregate
LM-175B LM-530B
LM-530B PCIe bandwidth across GPUs as described in Sec. VI-B.
Tput (#tokens/sec/GPU)
Tput (#tokens/sec/GPU)
21 80 67 65
35
19 Chart Title6 60
30
TFLOPS-per-GPU
TFLOPS-per-GPU
70 E. Performance Breakdown and Analysis
TFLPS-per-GPU
17
400 55 55
25
15 60 4 50
20
300
13 1) Dense GPU kernel performance breakdown: Fig. 10(a)
4
200 50 3 45
15
11 shows that compared to PyTorch baseline, deep-fusion offers
32 40
10
1009 40
7 1 a significant reduction in latency by reducing kernel launch
35
5
0
5 30 2 30
Latency 0 and data movement overheads, while our custom GeMM
0
Tput Tflops
implementation offers further reduction for small batch sizes
FT (TP-only) DS-Inference (TP-only) FT (PP+TP) DS-Inference (PP+TP)
by increasing memory bandwidth utilization of GeMM.
Fig. 8. Throughput comparison of DeepSpeed Transformer with FT for 175B
2) Throughput breakdown for massive model GPU-Inference:
and 530B models on 16 and 40 GPUs. We run with batch sizes that give the Fig. 10(b) shows the impact of several optimizations in Deep-
best performance for each configuration. Speed Inference to the inference throughput, such as the dense
C. Throughput Oriented Massive Model Inference optimized kernel, inference optimized scheduling, memory
optimizations that lead to increased batch size, communication
Massive models are capable of processing large input
optimizations that reduce PCIe data movement overheads as
prompts and generating large number of coherent tokens. In
described in Sec. IV.
some applications (e.g., offline query rewriting in web-scale
3) Prompt latency improvement with hybrid scheduling:
search and recommendation systems), this token generation
Fig.13 shows that DeepSpeed Inference with hybrid scheduling
process can be less latency focused and more throughput
achieves 1.18x and 3.06x prompt processing speed-up over
oriented. In this sub-section we show throughput improvement
FasterTransformer for GPT-3 175B model with PP + MP
of DeepSpeed Inference for massive model inference.
configuration and MP-only configuration, respectively. This
Fig.8 shows that DeepSpeed Inference achieves 1.51×
experiment was conducted on two nodes each with 8 A100
throughput improvement over the best FasterTransformer (FT)
GPUs. We enable both pipeline and tensor parallelism. We set
configuration for the GPT-3 175B model running on two nodes
the batch size to 24, because the latency dramatically increases
(2 × 8 A100). This improvement comes from our improved
when the batch size is larger than 24. We suspect this is related
pipeline parallelism schedule, and ability to run much larger
to an issue in the AllReduce kernel in Pytorch. The results
batch sizes using memory optimization and communication
demonstrate hybrid scheduling has the potential to reduce
minimization strategies described in Sec. IV. For the 530B,
prompt processing latency and we leave fixing the AllReduce
we could not run FT using a combination of TP and PP
issue as future work.
without crashing, but compared to the TP only version of
4) Memory bandwidth scalability for sparse MoE models:
FT, DeepSpeed Inference achieves over 1.53× throughput
Fig. 11 shows that DeepSpeed Inference achieves much
improvement running on 5 nodes.
higher per GPU memory bandwidth than PyTorch baseline
D. Democratizing Larger Model Inference with ZeRO-Inference for a 52B MoE models on an 8×A100-GPU node while
We evaluate three aspects of ZeRO-Inference: also demonstrating significantly better memory bandwidth
1) Model Scale: ZeRO-Inference can inference a 530B scalability all the way to 128 GPUs that leads to the faster
parameter model on a single A6000 GPU, 25× larger than the sparse model inference latency and higher throughput. This is
largest model that can be inferenced with a GPU-only solution the combined effect of MoE kernels and all-to-all optimizations
(and 10× larger compared to the CPU-only solution), making presented in Section V.
it possible for data-scientists to test massive models on single 5) Impact of pre-fetching on ZeRO-Inference throughput:
GPU workstations without requiring massive GPU clusters or Fig. 10(c) shows that prefetching (Sec. VI-B) improves through-
incurring huge cost (see Fig. 9(b)). put at small batch sizes while the benefit diminishing at larger
2) Inference Throughput: ZeRO-Inference achieves excellent batch sizes dues to higher arithmetic intensity to hide the
inference throughput of up to 84 TFLOPS, 54% of theoretical CPU/NVMe to GPU communication overhead.
peak (158.4 TFLOPS) for offline inference with very large 6) Comparison with E.T.: We also compared with a state-
batch sizes (see Fig. 9(b)). In fact, for models that fit in CPU of-the-art transformer kernel E.T. [27] for smaller scale
memory, it offers over 25× higher throughput than the CPU- DistilBERT and BERT encoder models on NVIDIA A100
only solution. Furthermore, even for models that fit in single GPUs for a batch size 1 and sequence length 128. Fig. 12 shows
GPU memory, it offers over 50% better throughput than the that DeepSpeed Inference is 1.7x and 1.4x faster than E.T. on
GPU-only solution. This is possible, because ZeRO-Inference those two models. DeepSpeed Inference achieves lower latency
can support much larger batch sizes than a GPU-only solutions because DeepFusion fuses more operators, leading to lower
(a) (b) (c)
Fig. 9. (a) Throughput of GPT-NeoX-20B across batch sizes on a A6000 GPU. (b) Throughput across models on a A6000 GPU. (c) Throughput of GPT-50B
using up to 16 GPUs over a single GPU (67 TFLOPS, 53% of peak) on the DGX2 V100.