0% found this document useful (0 votes)
40 views

Quantized Distributed Training

1) The document proposes QSDP, a variant of fully-sharded data-parallel (FSDP) training that supports quantization of gradients and weights to reduce communication costs. 2) QSDP achieves this through a modified SGD algorithm and analysis showing convergence even when weights are quantized. Nodes only see quantized weights without error correction. 3) Experiments show QSDP can train large language models up to 1.3 billion parameters on multi-node clusters 2.2x faster than standard FSDP without loss of accuracy. It removes communication bottlenecks.

Uploaded by

Shrikant Koltur
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

Quantized Distributed Training

1) The document proposes QSDP, a variant of fully-sharded data-parallel (FSDP) training that supports quantization of gradients and weights to reduce communication costs. 2) QSDP achieves this through a modified SGD algorithm and analysis showing convergence even when weights are quantized. Nodes only see quantized weights without error correction. 3) Experiments show QSDP can train large language models up to 1.3 billion parameters on multi-node clusters 2.2x faster than standard FSDP without loss of accuracy. It removes communication bottlenecks.

Uploaded by

Shrikant Koltur
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Quantized Distributed Training of Large Models with

Convergence Guarantees
Ilia Markov∗ Adrian Vladu† Qi Guo‡ Dan Alistarh§
arXiv:2302.02390v1 [cs.LG] 5 Feb 2023

Abstract
Communication-reduction techniques are a popular way to improve scalability in data-parallel
training of deep neural networks (DNNs). The recent emergence of large language models such
as GPT has created the need for new approaches to exploit data-parallelism. Among these,
fully-sharded data parallel (FSDP) training is highly popular, yet it still encounters scalability
bottlenecks. One reason is that applying compression techniques to FSDP is challenging: as
the vast majority of the communication involves the model’s weights, direct compression alters
convergence and leads to accuracy loss. We present QSDP, a variant of FSDP which supports
both gradient and weight quantization with theoretical guarantees, is simple to implement and
has essentially no overheads. To derive QSDP we prove that a natural modification of SGD
achieves convergence even when we only maintain quantized weights, and thus the domain over
which we train consists of quantized points and is, therefore, highly non-convex. We validate
this approach by training GPT-family models with up to 1.3 billion parameters on a multi-node
cluster. Experiments show that QSDP preserves model accuracy, while completely removing the
communication bottlenecks of FSDP, providing end-to-end speedups of up to 2.2x.

1 Introduction
The impressive recent progress of Deep Learning in tasks such as natural language processing and
computer vision has been accompanied by massive increases in parameter counts. For instance, large
language models (LLMs) from Transformer family, such as GPT [Radford et al., 2018], OPT [Zhang
et al., 2022] and BLOOM [Laurençon et al., 2022] easily count billions of trainable parameters, which
induces tremendous computational and memory costs. Training such models can easily exceed the
memory capacity of a single computational unit, such as a GPU.
As a consequence, standard distribution strategies such as data-parallel training Bottou [2010],
which require each node to be able to keep all parameters in memory, are no longer directly applicable.
Several novel distribution strategies have been proposed to mitigate this challenge, such as model-
parallel training Shoeybi et al. [2019], Raffel et al. [2020], pipeline-parallel training Huang et al.
[2019], Harlap et al. [2018] and model sharding Ren et al. [2021], Rajbhandari et al. [2020], Rasley
et al. [2020], FairScale [2021].
We consider the communication costs of distribution strategies for massive models, and focus
on Fully-Sharded Data-Parallel (FSDP) distributed training, which is among the most popular and

Institute of Science and Technology Austria, [email protected]

CNRS & IRIF, Université Paris Cité, [email protected]

Max Planck Institute for Informatics, [email protected]
§
Institute of Science and Technology Austria, [email protected]

1
user-friendly approaches to mitigate per-node memory limitations. FSDP is supported natively by
Pytorch Paszke et al. [2019], Facebook fairscale FairScale [2021], and Microsoft DeepSpeed Ren
et al. [2021], where it is known as ZeRO-3.
The main idea behind FSDP is that both the training data and the model parameters are
partitioned among the P nodes. That is, only a 1/P partition of the parameters of each layer is
stored at a node. Then, both for the forward and for the backward pass, nodes proceed synchronously
layer-by-layer, gathering full weights for the current layer, via all-to-all communication, before
executing its forward or backward operation. After this operation is complete, nodes can discard the
current layer’s received weights partitions, and move to the next layer. (Please see Figure 1 for an
illustration, and Section 4.1 for a detailed description.)
The key advantage of this pattern is that it reduces memory usage linearly in P . Thus, it enables
running models with billions of parameters on small or medium-sized clusters FairScale [2021],
MosaicML [2022]. At the same time, FSDP faces challenges in terms of communication efficiency:
since every forward and backward pass relies on all-to-all weight exchanges, FSDP can put massive
pressure on the network bandwidth, which becomes a bottleneck.
As we will show, all-to-all communication leads to significant communication bottlenecks when
training LLMs on multi-node clusters. Two key challenges to removing this communication bottleneck
are that 1) a majority of FSDP’s communication are layer weights: quantizing them naively loses
theoretical convergence, and can easily lead to practical divergence; 2) the FSDP setting poses
stringent compute and memory constraints, restricting the set of approaches.
Contribution. We propose the first communication-efficient variant of FSDP, called QSDP, which
provides both convergence guarantees, and strong practical performance. QSDP is inspired by on a
new analysis of SGD convergence with full quantization of transmitted model state. That is, we show
that a simple modified variant of SGD can allow both weights and gradients to be quantized during
training, without additional per-node memory, nor costly local computation. We find the fact that
this is possible with convergence guarantees surprising, since nodes only observe biased estimators
of the gradients, taken over quantized weights, without any error-correction Karimireddy et al.
[2019]. From the practical perspective, our approach is fast and easy to implement, and completely
removes the communication bottlenecks of FSDP, while recovering accuracy for billion-parameter
GPT models.
At a high level, the QSDP algorithm simply performs weight and gradient quantization before the
corresponding FSDP all-to-all communication steps. While gradient compression can be performed
using standard unbiased compressors, e.g. Alistarh et al. [2017], weight compression is performed
using a carefully-designed unbiased estimator. Our key contribution is in the analysis: we model
the training process as a new instance of sparse recovery Blumensath and Davies [2008], Foucart
[2012], in which 1) the projection step is performed via quantization and not sparsification, and 2)
the gradient step is itself quantized. This connection allows us to prove, under analytic assumptions,
that QSDP converges towards a minimizer of the loss over the set of lattice points corresponding to
the quantization being employed. We believe this is the first instance of such an analysis.
We complement our analysis with an efficient implementation of QSDP in Pytorch Paszke et al.
[2019], which we validate by training LLMs from the GPT family Radford et al. [2018], Zhang et al.
[2022] between 125M and 1.3B parameters, on a multi-node multi-GPU environment on Amazon EC2.
Our experiments first show that communication bottlenecks can significantly impact standard FSDP
in this standard practical setting, and that QSDP essentially removes such bottlenecks, without
impact on accuracy. For example, QSDP can train GPT-1.3B to essentially the same perplexity up

2
to 2.2x faster than standard FSDP on a 10Gbps network. In addition, we also introduce a “learned”
adaptive weight quantization approach which can further reduce bit-width, without significant
accuracy impact.

Forward Backward
Weights
Weights

...
Data Quantize Quantize Quantize Sync
Update

GPU1 Partition 1 weights Collection


Forward
weights Collection
Backward
gradients gradients model

GPU2 Data Quantize Weights

...
Forward Quantize Weights
Quantize Sync
Update

Partition 1 weights Collection Backward


weights Collection gradients gradients model

Figure 1: Scheme of (Quantized) Fully Sharded Data Parallel algorithm. During forward pass we
collect the missing partitions of layer’s weights, compute its activations and discard the partitions.
At backward pass, we collect the weights again, compute the gradients, synchronize the gradients
corresponding to our partition.

2 Related Work
Over the past decade, there has been a massive amount of work on communication-efficient variants
of Data-Parallel SGD, e.g. Seide et al. [2014], Dryden et al. [2016], Alistarh et al. [2017], Vogels
et al. [2019], Tang et al. [2019], Wang et al. [2018]. (Please see Ben-Nun and Hoefler [2019] for a
survey.) The vast majority of this work focuses on gradient compression, the main communication
cost of SGD, and is thus mostly orthogonal to our work. The massive scale of recent deep models,
e.g. Chowdhery et al. [2022], Brown et al. [2020] has led to significant work on novel distribution
strategies Ren et al. [2021], Rajbhandari et al. [2020], Rasley et al. [2020], FairScale [2021] adapted
to the requirements of these models, among which FSDP is a standard approach, e.g. Chowdhery
et al. [2022]. While there is recent work on further optimizing FSDP, e.g. Jiang et al. [2022], Miao
et al. [2022], we are the first to investigate and address its communication costs. Our results are part
of a broader line of work using different techniques to make the training of massive models amenable
to standard infrastructure, e.g. Wang et al. [2022], Yuan et al. [2022], Borzunov et al. [2022].
Quantized weight exchange during training has been investigated independently in the context
of decentralized distributed learning. Tang et al. [2018] presents a scheme which supports quantized
weight exchange by having each node extrapolate each of its neighbors’ model values; yet, this would
require unrealistic Θ(P d) extra memory in our case. Similarly, other work in this vein [Koloskova
et al., 2019, Nadiradze et al., 2021, Lu and De Sa, 2020] either requires additional storage, or would
not fit the FSDP algorithm structure. Both our analysis approach and our algorithms’ guarantees
are different relative to this line of work.
Recently, there has been a surge of interest in post-training quantization approaches for large
language models, which reduce the deployment costs of already trained models Yao et al. [2022],
Dettmers et al. [2022], Frantar et al. [2022], Xiao et al. [2022]. Our work is complementary, in
the sense that we show that quantized weights and gradient representations can be applied during
training, without accuracy loss, leading to training speedup. On the other hand, these post-training
approaches would be too expensive to be executed for compression at training time.
A parallel line of work aims to perform fully-quantized training of DNNs Banner et al. [2018],
Zhu et al. [2020]. One general finding from this line of work is that integrating weight and gradient

3
quantization into training is extremely challenging, even when using 8bit precision, from both accuracy
and performance perspectives. Specifically, this line of work investigates model modifications via e.g.
parameter tuning and specialized normalization layers, in order to recover accuracy. By contrast, we
preserve model structure, and do not modify hyper-parameter values, although we only quantize
transmitted state.

3 Background and Motivation


3.1 Data-Parallel Training
In this classic SGD distribution pattern Bottou [2010], each node (e.g., GPU) holds a copy of the
model, and the data is partitioned among the nodes. Each training step samples a subset of the data
called a batch, performs a forward pass over the batch to obtain model predictions, and then performs
a backward pass to obtain gradient updates. Finally, nodes communicate their local gradient updates
in all-to-all fashion to keep the model in sync.

3.2 Gradient Compression


Transmitting gradients is the key communication cost of Data-Parallel SGD, and there has been a
tremendous amount of work on addressing the resulting bandwidth bottleneck Seide et al. [2014],
Dryden et al. [2016], Strom [2015]. (As this area is extremely vast, we refer to Ben-Nun and Hoefler
[2019], Liu et al. [2020b] for a full overview.) Of these, gradient quantization is a particularly-popular
technique, which has the advantage that variants of it can be implemented without additional
memory cost. A simple example is the QSGD technique Alistarh et al. [2017], which is essentially a
codebook compression method which maps each gradient value to a point on a uniform grid, via
randomized rounding. For this, values are first scaled to the range [−1, 1], and then each scaled
coordinate v is mapped to one of the endpoints of its quantization interval v ∈ [qi , qi+1 ] via the
following rule: (
v−qi
qi , with probability qi+1 −qi ,
q(v) =
qi+1 , otherwise.
It is easy to see that this gradient estimator is unbiased with respect to the stochastic quantization,
and that its variance can be bounded by the norm of the original gradient. We will revisit this
scheme in Sections 4.3 and 5.

3.3 Fully-Sharded Data-Parallel Training


As the name suggests, FSDP starts from the Data-Parallel (DP) approach. The main observation is
that nodes do not necessarily need to store the full set of parameters at every stage of training, in
particular during the backward pass. Specifically, we use the scarce GPU memory to represent only
those layers which are in the forward-backward “working set” at a given moment of time.
Initially, model parameters are partitioned, so that each of the P workers is assigned a distinct
1/P fraction of each layer’s weights. At each optimization step (see Figure 1, ignoring the dashed
quantization operations), before the forward pass on a layer, each worker collects the missing
partitions from other workers, computes the output activations, discards the received partitions
and proceeds to the next layer. For the backward pass, workers again collect all layer weights,

4
compute the gradients, synchronize them, discard the layer weights and proceed to the next layer.
Technically, each optimization step consists of two AllGather collective operations for weights, and
one Reduce-Scatter to sync gradients (full pseudocode in Appendix A).
One can easily check that the above approach implements the standard SGD iteration one-to-one,
relative to a sequential execution. If we denote by y t the model’s parameter vector used at iteration t,
and by g (y t ) the average of the nodes’ stochastic gradients at step t, taken at y t , then, for learning
rate η, we can model the iteration as

y t+1 = y t − ηg (y t ) . (1)

FSDP with Compression. The natural way to reduce the cost of weight and gradient transmission
in the above scheme would be to simply quantize them before transmission. (Please see the full
Figure 1.) To examine the impact of adding compression on the above SGD iteration, let us consider
abstract quantization operators Qw applied to the weights, and Qg applied to the gradients. (We
will specify these quantization functions precisely in Section 4.3, and the exact implementation in
Section 5.) For iteration t ≥ 0, let v t be a “virtual” view of the model weights at the beginning of
iteration t, obtained by aggregating all the weights, across all the weight partitions, in full precision.
First, notice that, if we apply Qw before all transmissions, then the algorithm will only observe
the quantized version of v t , which we denote by Qw (v t ). Then, we can re-write one iteration of the
algorithm as
v t+1 = Qw (v t ) − ηQg (g (Qw (v t ))).
This formulation inspires the notation x t = Qw (v t ), as the algorithm only “sees” the quantized
version of the full-precision weights. Then, we get the following iteration:

x t+1 = Qw (x t − ηQg (g (x t ))), (2)

which would correspond to an abstractly-quantized version of FSDP. This iteration is the starting
point for our analysis in the next section.

4 SGD with Quantized Weights and Provable Convergence


The cornerstone of our method consists of a stochastic gradient method that provably converges to a
good quantized iterate, under reasonable analytic assumptions. One main novelty is that it converges
despite the fact that the domain is non-convex. At a very high level, it is similar to the iterative
hard thresholding (IHT) method, which achieves provable guarantees despite the fact that it seeks a
good iterate in the set of vectors of bounded sparsity Blumensath and Davies [2008]. Throughout
this section, we abstract away specifics of the system architecture, since they are not relevant to our
analysis. We explain their relationship to the actual implementation in Section 5.

4.1 Background and Assumptions


The main challenge we face is to obtain quantized solutions to optimization problems that seek to
minimize a function f : Rn → R:
min f (x ) , (3)
x ∈G

5
where the domain G is a lattice that allows for an efficient communication of its elements. We
restrict our attention to shifts of the lattice δZn along the direction of the all-ones vector. Formally,
G = {δZn + r1 : r ∈ [−δ/2, δ/2)}.
Overview. Even in the case where f is convex, the non-convex structure of G makes it incredibly
difficult to obtain a good minimizer to f without suffering a large loss. In fact, problems of this
form are generally NP-hard. However, we show that when f is reasonably well-conditioned we can
obtain strong convergence guarantees. The idea consists of alternating stochastic gradient descent
steps with applications of a quantization operator Qw which projects the new iterate back onto a
certain subset of G. Letting g (x t ) be a stochastic gradient, and δ a parameter which determines the
coarseness of the quantization grid that we project onto, our update at step t + 1 has the following
form:
x t+1 = Qwδ (x t − ηg (x t )) . (4)
This formulation covers the practical case where the stochastic gradient g (xP t ) corresponds to a
mini-batch stochastic gradient. Indeed, as in practice f takes the form f (x ) = P1m Pi=1 m
P
j=1 f (x ; y j ),
where S = {y 1 , . . . , y m } are data samples, and fi (x ) are loss functions at individual nodes, the
1 P
stochastic gradients obtained via backpropagation takes the form |B| j∈B ∇fi (x ; y j ), where i is a
random node, and B is a sampled mini-batch.
Quantization by Random Shift. For weight quantization, we consider the following unbiased
stochastic quantization method. To quantize a vector, we first sample a single fixed random scalar r,
then shift all the coordinates of the vector by r. For vector encoding, it rounds each coordinate to the
nearest neighbor on the quantization grid, and sends the lattice coordinates of the resulting vector,
together with the scalar r. For decoding, it takes the lattice point, and undoes the shift on the
quantized coordinates. The notable difference between this and more standard quantization methods,
e.g.Alistarh et al. [2017], is that quantization is dependent across coordinates. In exchange for losing
independence, it allows us to provide stronger guarantees in the context of weight quantization. We
define it formally:
Definition 1 (quantization by random shift). Let δ > 0 be a scalar defining the coarseness of the
quantization grid. Let a scalar r ∈ [−δ/2, δ/2), and let the deterministic operator Qw
r,δ : R → R
which rounds to the nearest element in δZ + r:
 
x−r
Qwr,δ (x) = δ · +r.
δ
Define the randomized quantization operator Qw w w
δ : R → R via Qδ (x) = Qr,δ (x), for a random
w
r ∼ Unif ([−δ/2, δ/2)). We apply Qδ to vectors, with the meaning that it is dependently applied to
each coordinate for a single random shift r.
We use this quantization methods to show that, for a well conditioned loss function, and
appropriate grid parameters, the iteration (4) converges, under reasonable analytical assumptions,
to a set of weights that are comparable in quality to the best possible quantized weights from a
slightly coarser grid. We note that to further reduce communication costs, our method also supports
gradient quantization in addition to weight quantization, provided that gradients are quantized using
an (arbitrary) unbiased estimator.
Analytical Assumptions. Formally, our analysis uses the following assumptions on f .

1. Unbiased gradient estimators with variance σ: E [g (x ) |x ] = ∇f (x ).

6
2. For β > 0, the β-smoothness condition: for all x , ∆,
β
f (x + ∆) ≤ f (x ) + h∇f (x ) , ∆i + k∆k22 .
2

3. For α > 0, the Polyak-Łojasiewicz (α-PL) condition:


1
k∇f (x )k22 ≥ α (f (x ) − f ∗ ) ,
2
where f ∗ = minx f (x ).
The first two assumptions are standard in stochastic optimization (see e.g. Lin et al. [2019]). The
Polyak-Łojasiewicz (PL) condition [Karimi et al., 2016] is common in non-convex optimization, and
versions of it are essential in the analysis of DNN training [Liu et al., 2020a, Allen-Zhu et al., 2019].
In words, it states that small gradient norm, i.e. approximate stationarity, implies closeness to
optimum in function value.

4.2 Main Theoretical Results


We are now ready to state our main analytical result.
Theorem 2. Let α, β, δ? , ε > 0 and σ ≥ 0 be real parameters, and let η = min 10 , 1 . Let
 3 εα
σ2
f : R → R be a β-smooth and α-PL function, with access to a stochastic gradient g (x ), i.e.
n

E [g (x ) |x ] = ∇f (x ) with bounded variance E kg (x ) − ∇f (x )k22 ≤ σ 2 . For each r ∈ [−δ? /2, δ? /2),


η
let x ?r,δ? be any minimizer of f over δ? Zn + r1. Let δ = 16(β/α) 2 · δ? . Consider the iteration:
d e
 
η
x t+1 = Qwδ x t − g (x t ) .
β
β f (x 0 )−Ef (x ?
r,δ? )
In T = 10
η · α ln ε iterations we obtain a point x T satisfying Ef (x T ) − Ef (x ?r,δ? ) ≤ ε.
Discussion. To understand the convergence of this method, let us establish as benchmark the
expected value Ef (x ?r,δ? ) of the best iterate on the lattice δ? Zn + r1, where the expectation is taken
over the shift r. Our method finds a point in a slightly finer grid x T ∈ δZn + r0 1, such that in
expectation over the randomness in the algorithm, the value of the function is at most  larger than
our benchmark. The sacrifice we have to make in exchange for this surprisingly strong convergence
is an increase in resolution for the iterates we maintain, which is dependent, among others, on the
condition number of f .
Since our method works with stochastic gradients, we can additionally quantize gradients to
further reduce communication. In fact any quantization method that compresses gradients to an
unbiased estimator with low variance can be directly plugged into Theorem 2.
We state in the following corollary a generic bound for quantized gradients, which highlights the
trade-off between variance and communication for the quantization method.
Corollary 3 (Gradient Quantization). Let α, β, δ? , ε, b > 0 and σ, σ∇ ≥ 0 be real parameters. Let
f : Rn → R be a β-smooth and α-PL function, with access to a stochastic gradient estimator g (x ),
i.e. E [g (x ) |x ] = ∇f (x ) with bounded variance E kg (x ) − ∇f (x )k22 ≤ σ 2 . Let Qg : Rn → R be a
gradient quantizer which for any stochastic gradient g (x ) encountered during the execution of the
algorithm, ensures:

7
1. unbiased estimator: E [Qg (g (x ))|g (x )] = g (x ),
h i
2. variance: E kQg (g (x )) − g (x )k22 |g (x ) ≤ σ∇2,

3. requires b bits to communicate Qg (g (x )).

For each r ∈ [−δ? /2, δ? /2), let x ?r,δ? be any minimizer of f over δ? Zn + r · 1. Let η =
n o
η
3
min 10 εα
2 , 1 , δ = 16(β/α)2 · δ? , and consider the iteration:
σ 2 +σ∇ d e
 
w η g
x t+1 = Qδ x t − Q (g (x t )) .
β

β f (x 0 )−Ef (x ?
r,δ? )
In T = 10
η · α ln iterations we obtain a point x T satisfying Ef (x T ) − Ef (x ?r,δ? ) ≤ ε.
ε
σ 2 +σ 2 f (x 0 )−Ef (x ?
r,δ? )
 
Furthermore, the entire algorithm requires O b · εα ∇ αβ ln ε bits to communicate the
quantized gradients.

We notice that since b and σ∇ are inversely associated, we can establish a trade-off between the
number of iterations and the total communication. As a matter of fact, this trade-off kicks in only
at the point where the variance of the quantized gradient estimator becomes as large as that of the
stochastic gradient, as the number of iterations scales linearly with σ 2 + σ∇
2 . For example, given a

resolution parameter δ∇ > 0, a simple gradient quantization scheme such as the one employed by
QSGD, quantizes gradient entries to δ∇ Zn , while guaranteeing σ∇ 2 ≤ δ G , where G ≥ kg (x ) k ,
∇ `1 `1 1
and b = O(G`1 /δ∇ · (ln n + ln G`1 )) bits required for communication. See a more detailed discussion
in Section D.3. By varying δ∇ we distinguish between the extreme cases, corresponding to the
scenarios where we the quantized gradients are dense and sparse, respectively. While the total
communication does not improve by varying δ∇ , by doing so we are able to reduce the communication
performed in each iteration, in practice Alistarh et al. [2017].
Dense gradients: setting δ∇ = σ 2 /G`1 , we obtain exactly the same number of iterations as in
the basic case without quantized gradients, but the communication per iteration is reduced to
O(G2`1 /σ 2 · (ln n + ln G`1 )).
Sparse gradients: setting δ∇ = G`1 , the number of iterations scales with max σ 2 , G2`1 rather


than σ 2 , but the pre-step communication is reduced to O (ln n + ln G`1 ) bits.

4.3 Analysis Overview


Let us briefly explain the intuition behind our theoretical analyses. We view our iteration as a
version of projected gradient descent, where iterates are projected onto the non-convex domain of
quantized vectors. In general, when the domain is convex, projections do not hurt convergence.
But in our setting the distance to the optimal solution can increase and drastically affect the loss.
However, we can show a trade-off between how much this distance increases and the ratio between
the target and optimal resolution δ/δ? .
To understand this better, consider a point x 0 obtained by taking a step x 0 = Qw 1
δ (x − β ∇f (x )).
Using smoothness, we can verify that this significantly decreases the loss, provided that the quanti-
zation operator does not perturb its input by too much in `2 norm. Formally, using Lemma 7 we see

8
that
1
f (x 0 ) ≤ f (x ) − k∇f (x )k22 (5)

    2
β w 1 1
+ Q δ x − ∇f (x ) − x − ∇f (x ) .
2 β β
2

Since compared to a vanilla gradient method, this suffers a reduction in the progress made in a
single iteration, we can force this to be significantly smaller, so as not to undo more than a fraction
of the progress we would ideally make. To do so we notice that we can charge the last term in (5) to
the current error in function value, and we can make this dependence be arbitrarily small by using a
finer resolution δ for our quantization grid. This is captured by the following crucial lemma, which
we prove in Section D.4.2.
Lemma 4. Let δ? > δ > 0, such that δ? /δ ∈ Z. Let x ∈ Rn , and for all r ∈ [−δ? /2, δ? /2), let an
arbitrary x ?r,δ? ∈ δ? Zn + r1. Then
h
2
i δ h i
x ? − x 2 .
E kQw
δ (x ) − x k 2 ≤ Er r,δ? 2
δ?
Using Lemma 4, together with the α-PL condition, we can charge the extra error term to
    
β δ 2 1 ?
· Er f x − ∇f (x ) − f (x r,δ? ) ,
2 δ? α β
where x ?r,δ? are picked to be the best minimizers in δ? Zn + r1. To simplify the exposition and
highlight the main ideas, let us assume that Er [f (x ?r,δ? )] = f (x ? ). Since by the α-PL condition we
know that the gradient norm is large compared to the error in function value, we conclude that
α
f (x 0 ) − f (x ? ) ≤ f (x ) − f (x ? ) − (f (x ) − f (x ? ))
β
   
β δ 1 ?
+ · f x − ∇f (x ) − f (x )
α δ? β
 
? α β δ
≤ (f (x ) − f (x )) 1 − + .
β α δ?

This shows that by setting the δ ≤ δ? · (α/β)2 /2, in each iteration the error contracts by a 1 − Θ(α/β)
factor, which allows us to conclude that this algorithm converges linearly to a minimizer. We provide
full proofs in Section D.

5 QSDP Implementation
5.1 Overview
We implemented a practical version of the QSDP algorithm described in the previous section,
supporting both weight and gradient quantization, in Pytorch Paszke et al. [2019] starting from the
PyTorch FSDP support. Our implementation uses the CGX framework Markov et al. [2022] as a
communication backend, to which we added support for quantized AllGather and Reduce-Scatter
collectives.

9
In the original FSDP implementation, layers are packed into groups: weights and gradients of
layers in the same group are concatenated before communication. In QSDP, we compress layers
separately, filtering out normalization layers and biases, which are communicated in full precision.
This filtering is implemented at the level of the CGX communication backend. The quantized
AllGather and Reduce-Scatter operations are implemented by leveraging peer-to-peer NVIDIA
NCCL primitives. For multi-node (inter-server) communication, we used hierarchical versions of the
algorithms, to reduce the size of inter-node transmissions.
One important optimization regards the granularity at which quantization is performed. Specifi-
cally, applying quantization over large tensors suffers from scaling issues, which results in accuracy
degradation. To address this, we perform compression independently into equally-sized “buckets”
of fixed size, and compress each bucket independently. This approach sacrifices compression by a
negligible amount (as we need to transmit min-max scaling meta-information for each bucket), but
helps avoid loss in terms of model quality.
Bucketing (or grouping) on the weights is known to be necessary for good accuracy when
quantizing pre-trained LLMs Dettmers and Zettlemoyer [2022]. It is also justified theoretically
(Theorem 2), as it both reduces compression variance, and allows us to explore solutions over
finer-grained lattices. We observed experimentally that bucket size 1024 provides a good balance
between compression and accuracy, and use it as a universal hyper-parameter. In the context of this
optimization, we observed that the impact of stochasticity in the quantization becomes minimal.

5.2 Learned Weight Quantization


We now describe an additional (optional) optimization, which allows us to further reduce practical bit-
width, at little to no accuracy loss. The motivating observation behind this optimization is that the
quantization schemes we use for weights and gradients assume uniform locations of the quantization
levels. Yet, this uniform grid does not take the distribution of values into account. The idea of
adapting the locations of quantization levels to the data distribution has already been studied Zhang
et al. [2017], Faghri et al. [2020]. However, existing dynamic-programming approaches Zhang et al.
[2017] have high computational cost (quadratic in the number of data points); thus, we use a fast
version of gradient-descent-based optimization over the quantization levels Faghri et al. [2020].
The goal of the distribution-aware quantizer in Algorithm 2 is to select new locations for a
fixed number of quantization points, and weight values, so as to minimize the error introduced by
quantization. The algorithm runs iteratively across all values, finds the locations of quantization
points for each value, and updates the quantization points using the gradient descent update rule.
We run this heuristic periodically after a warmup period, separately for the weights and gradients of
each layer. We save the derived locations of the quantization levels, and use them for quantization
until the next re-computation.

6 Experimental Validation
6.1 Experimental setup
Infrastructure. We evaluate QSDP for training GPT-scale LLMs using multiple cloud-grade
Amazon EC2 p3dn.24xlarge machines, with 8 V100 SXM2 GPUs each. Each GPU has 32GB
memory. The inter-GPU interconnect is provisioned by NVLinks with 200Gbps, while the inter-server
bandwidth is of 100 Gbps.

10
Figure 2: Gradient-based Optimization of the Levels

1: Input: values V , initial levels Q0 , learning rate α.


2: Output: optimized quantization levels Q.
3: Normalize values V bucket-wise.
4: for each value vi from V do
5: qi = find_closest(v, Qi ) // Quantize using current level
6: qi = qi − α(qi − vi ) // Update chosen quantization level
7: end for

Environment and Tasks. We use the official NGC PyTorch 22.05-py3 Docker image with PyTorch
1.12, CUDA 11.6.2, NCCL 2.12, and the MosaicML Composer library (version 0.12), as well as a
fork of the CGX communication library Markov et al. [2022]. All experiments were run with with
MosaicML Large Language Models implementation Mos [2022]. The benchmarks run the pre-training
of different version of GPT-family models Radford et al. [2018], Brown et al. [2020], varying the
sizes of the models, on the C4 dataset Raffel et al. [2019]. Specifically, we examine accuracy on GPT
models with 125M, 350M, and 1.3B parameters. For benchmarks, we use 4 servers with 8 GPUs
each. See Appendix A for training details.
Baselines. As a baseline we use training with default parameters, which is already highly-optimized
by MosaicML Mos [2022]. We note that directly using INT8 quantization, without bucketing, resulted
in very poor accuracy, and therefore we do not use it as a baseline. In terms of communication,
the baseline transmits weights in full (FP32) precision, and gradients in half (FP16) precision. In
QSDP experiments, we do not modify any hyperparameters. We convert gradients to full precision
before quantization. For all timing experiments, the reported numbers are averaged over 50 training
steps after warm-up of 10 iterations. Our main accuracy measure is perplexity, which is known to
be a very stringent accuracy measure in this setting, and correlates extremely well with zero-shot
performance Dettmers et al. [2022].
Accuracy Recovery. We first examine the effect of quantization on model quality, i.e. final
model perplexity, in the end-to-end experiments. The default bit-width for weights and gradients
quantization is 8 bits, using 1024 bucket size, which we illustrate as W8G8. We communicate
normalization layers and biases in full precision. We emphasize that straightforward round-to-nearest
or stochastic quantization does not converge to reasonable final perplexity in this setup: Naive
quantization without bucketing loses more than 2 units of perplexity on GPT-125M, a model on
which W8G8 with 1024 bucket size improves perplexity.
The accuracy results are presented in Table 1. The QSDP final perplexity is almost identical to
that of regular training, and QSDP can even slightly improve the baseline accuracy. We stress that
we did not perform any parameter tuning: quantization parameters are the same across all layers.
End-to-end Speedup. For end-to-end training speedup improvements, we use multi-node GPT
pretraining under standard hyperparameters. We examine speedup for different inter-node band-
widths: 10 Gbits, 50 Gbits and 100 Gbits. For that, we artificially reduce input-output bandwidth
on each node, using the UNIX tc tool TC [2001]. The results are presented in Figure 4. First,
please notice that standard FSDP training has a non-trivial bandwidth bottleneck even at 100Gbps
bandwidth as we increase model size, and that this bandwidth bottleneck can dominate training time
on the lower 10Gbps bandwidth. Second, the running time of QSDP is essentially constant across all

11
35.0
FSDP(W32G16)
32.5 QSDP(W8G8)

30.0

Eval Perplexity
27.5

25.0
2.2x
22.5

20.0
ppl=18.0
17.5
0 20 40 60 80 100 120 140 160
Time, hours

Figure 3: Perplexity vs time for standard FSDP (FP32 weights and FP16 gradients) and QSDP
(both weights and gradients quantized to 8 bits) for the 1.3B model in the 10Gbps bandwidth setup.

three scenarios, showing that it has essentially removed the bandwidth bottleneck. More precisely,
QSDP outperforms the baseline by up to 15% in the 100Gbps scenario (a non-trivial reduction of 12
hours of training time or 1.5k$ of cloud costs1 ), and by 2.25x in the 10Gbps scenario.
Learned quantization. We examined the performance of learned quantization for the small 125M
parameters model. We ran the optimization algorithm after 400, 1900 and 3800 training steps, and
noticed that optimizing the locations of quantization levels has no effect for bit-widths higher than
6 bits, but leads to noticeable improvements for lower bit-widths. Please see Table 3. Learned
weight quantization allows to improve the final model performance for different weight and gradient
quantization parameter pairs, reaching perplexities that are close to the baseline. Specifically, using
learned quantization results in reaching the highest compression ratio for weights and gradient in
training (i.e. 5 and 4 bits respectively) without substantial accuracy impact. We expand upon these
experiments in Appendix C.

Table 1: Perplexities recoveries for different models end-to-end training using QSDP. Weights and
gradients quantized to 8 bits, uniform quantization.

125M 350M 1.3B


Baseline 35.81 23.94 18.00
QSDP 35.58 23.95 18.34

7 Conclusion
Motivated by the efficient distributed training of large language models, we have explored the
feasibility of fully-quantized training for such models, with respect to both weights and gradients.
This led to an interesting new analysis, showing that SGD can indeed converge with strong convergence
guarantees even with quantized iterates, as long as a good quantized solution exists.
Complementing this analysis, we proposed QSDP, a quantized an extension of the popular Fully
Sharded Data Parallel (FSDP) distributed training approach, in which all transmitted state is in
1
price of 12 hours training on 4 AWS p3dn.24xlarge instances

12
Table 2: Final perplexities of training 125m GPT-2 model with combinations of weights and gradients
low-bits uniform (not learned) quantization.

hhhh
h hhhGradients bits 6 5 4
hhhh
Weights bits hh h
6 35.74 36.08 35.84
5 36.01 35.94 36.36
4 37.11 37.38 37.61

45
FSDP (W32G16)
40
QSDP (W8G8)
35

30 10 Gbps 50 Gbps 100 Gbps


Step time, s

25

20

15

10

0
125M350M1.3B 125M350M1.3B 125M350M1.3B

Figure 4: Training step time for different models at various inter-node bandwidth with and without
QSDP enabled. The fact that QSDP step time is constant across considered bandwidths means that
QSDP successfully tackles bandwidth bottlenecks.

quantized form. We also provided a highly-efficient implementation of QSDP in Pytorch, which


we showed to successfully eliminate the bandwidth bottleneck in large-scale distributed training
of modern language models, without sacrificing accuracy. Specifically, our experimental validation
across three model sizes showed that training with QSDP reaches up to 2.2× speedup.
Our results suggest that communication compression can be an effective tool in the context
of novel distribution schemes motivated by large-scale training. Specifically, we believe we are
the first to show both convergence guarantees and strong practical performance for simple weight
compression schemes being applied during SGD-based training, which should motivate further work
in this direction. In particular, an interesting extension of our work would be to examine whether
the lower-precision weight representation can also be exploited for faster runtimes.

Acknowledgements
AV acknowledges the support of the French Agence Nationale de la Recherche (ANR), under grant
ANR-21-CE48-0016 (project COMCOPT), the support of Fondation Hadamard with a PRMO grant,
and the support of CNRS with a CoopIntEER IEA grant (project ALFRED).

References
Mosaicml examples. https://github.com/mosaicml/examples, 2022. 11, 18

13
Table 3: Final perplexities of low-bits quantization of 125m GPT-2 model using the learned
quantization levels. Learned quantization in the W6G4 configuration provides lower perplexity than
the baseline.

baseline w6g4 w5g4 w4g4 w4g32


Uniform 35.81 36.34 37.61 37.11
35.81
Learned 35.75 36.01 36.94 36.55

Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. QSGD: Communication-
efficient sgd via gradient quantization and encoding. Advances in Neural Information Processing
Systems, 30:1709–1720, 2017. 2, 3, 4, 6, 8, 27, 28

Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-
parameterization. pages 242–252. PMLR, 2019. 7

Ron Banner, Itay Hubara, Elad Hoffer, and Daniel Soudry. Scalable methods for 8-bit training of
neural networks. Advances in neural information processing systems, 31, 2018. 3

Tal Ben-Nun and Torsten Hoefler. Demystifying parallel and distributed deep learning: An in-depth
concurrency analysis. ACM Computing Surveys (CSUR), 52(4):1–43, 2019. 3, 4

Thomas Blumensath and Mike E Davies. Iterative thresholding for sparse approximations. Journal
of Fourier analysis and Applications, 14(5-6):629–654, 2008. 2, 5

Alexander Borzunov, Dmitry Baranchuk, Tim Dettmers, Max Ryabinin, Younes Belkada, Artem
Chumachenko, Pavel Samygin, and Colin Raffel. Petals: Collaborative inference and fine-tuning
of large models. arXiv preprint arXiv:2209.01188, 2022. 3

Léon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of
COMPSTAT’2010, pages 177–186. Springer, 2010. 1, 4

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are
few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020. 3, 11

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam
Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm:
Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. 3

Tim Dettmers and Luke Zettlemoyer. The case for 4-bit precision: k-bit inference scaling laws. arXiv
preprint arXiv:2212.09720, 2022. 10

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. LLM.int8(): 8-bit matrix
multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022. 3, 11

Nikoli Dryden, Sam Ade Jacobs, Tim Moon, and Brian Van Essen. Communication quantization
for data-parallel training of deep neural networks. In Proceedings of the Workshop on Machine
Learning in High Performance Computing Environments, pages 1–8. IEEE Press, 2016. 3, 4

14
Fartash Faghri, Iman Tabrizian, Ilia Markov, Dan Alistarh, Daniel M Roy, and Ali Ramezani-Kebrya.
Adaptive gradient quantization for data-parallel sgd. Advances in neural information processing
systems, 33:3174–3185, 2020. 10
FairScale. Fairscale: A general purpose modular pytorch library for high performance and large scale
training. https://github.com/facebookresearch/fairscale, 2021. 1, 2, 3
Simon Foucart. Sparse recovery algorithms: sufficient conditions in terms of restricted isometry
constants. In Approximation Theory XIII: San Antonio 2010, pages 65–77. Springer, 2012. 2
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training
quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022. 3
Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger,
and Phil Gibbons. Pipedream: Fast and efficient pipeline parallel dnn training. arXiv preprint
arXiv:1806.03377, 2018. 1
Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong
Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural
networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019.
1
Youhe Jiang, Xupeng Miao, Xiaonan Nie, and Bin Cui. Osdp: Optimal sharded data parallel for
distributed deep learning. arXiv preprint arXiv:2209.13258, 2022. 3
Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-
gradient methods under the Polyak-Łojasiewicz condition. In Joint European Conference on
Machine Learning and Knowledge Discovery in Databases, pages 795–811. Springer, 2016. 7, 31
Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian Stich, and Martin Jaggi. Error feedback
fixes signsgd and other gradient compression schemes. In International Conference on Machine
Learning, pages 3252–3261. PMLR, 2019. 2
Anastasia Koloskova, Sebastian Stich, and Martin Jaggi. Decentralized stochastic optimization
and gossip algorithms with compressed communication. In International Conference on Machine
Learning, pages 3478–3487. PMLR, 2019. 3
Hugo Laurençon, Lucile Saulnier, Thomas Wang, Christopher Akiki, Albert Villanova del Moral,
Teven Le Scao, Leandro Von Werra, Chenghao Mou, Eduardo González Ponferrada, Huu Nguyen,
et al. The bigscience roots corpus: A 1.6 tb composite multilingual dataset. In Thirty-sixth
Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. 1
Tao Lin, Sebastian U Stich, Luis Barba, Daniil Dmitriev, and Martin Jaggi. Dynamic model pruning
with feedback. 2019. 7
Chaoyue Liu, Libin Zhu, and Mikhail Belkin. Toward a theory of optimization for over-parameterized
systems of non-linear equations: the lessons of deep learning. arXiv preprint arXiv:2003.00307,
2020a. 7
Ji Liu, Ce Zhang, et al. Distributed learning systems with first-order methods. Foundations and
Trends® in Databases, 9(1):1–100, 2020b. 4

15
Yucheng Lu and Christopher De Sa. Moniqua: Modulo quantized communication in decentralized
sgd. In International Conference on Machine Learning, pages 6415–6425. PMLR, 2020. 3
Ilia Markov, Hamidreza Ramezanikebrya, and Dan Alistarh. Cgx: Adaptive system support for
communication-efficient deep learning, 2022. URL https://arxiv.org/abs/2111.08617. 9, 11
Xupeng Miao, Yujie Wang, Youhe Jiang, Chunan Shi, Xiaonan Nie, Hailin Zhang, and Bin Cui.
Galvatron: Efficient transformer training over multiple gpus using automatic parallelism. arXiv
preprint arXiv:2211.13878, 2022. 3
MosaicML. Mosaic LLMs (part 2): Gpt-3 quality for <$500k, 2022. URL https://www.mosaicml.
com/blog/gpt-3-quality-for-500k. 2
Giorgi Nadiradze, Amirmojtaba Sabour, Peter Davies, Shigang Li, and Dan Alistarh. Asynchronous
decentralized sgd with quantized and local updates. Advances in Neural Information Processing
Systems, 34:6829–6842, 2021. 3
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor
Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward
Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner,
Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance
deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035.
Curran Associates, Inc., 2019. 2, 9
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language
understanding by generative pre-training. 2018. 1, 2, 11
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi
Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text
transformer. arXiv e-prints, 2019. 11
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi
Zhou, Wei Li, Peter J Liu, et al. Exploring the limits of transfer learning with a unified text-to-text
transformer. J. Mach. Learn. Res., 21(140):1–67, 2020. 1
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations
toward training trillion parameter models. In SC20: International Conference for High Performance
Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020. 1, 3
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimiza-
tions enable training deep learning models with over 100 billion parameters. In Proceedings of
the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages
3505–3506, 2020. 1, 3
Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia
Zhang, Dong Li, and Yuxiong He. {ZeRO-Offload}: Democratizing {Billion-Scale} model training.
In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 551–564, 2021. 1, 2, 3
Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent and
its application to data-parallel distributed training of speech dnns. In Fifteenth annual conference
of the international speech communication association. Citeseer, 2014. 3, 4

16
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catan-
zaro. Megatron-lm: Training multi-billion parameter language models using model parallelism.
arXiv preprint arXiv:1909.08053, 2019. 1

Nikko Strom. Scalable distributed dnn training using commodity gpu cloud computing. In Sixteenth
Annual Conference of the International Speech Communication Association, 2015. 4

Hanlin Tang, Ce Zhang, Shaoduo Gan, Tong Zhang, and Ji Liu. Decentralization meets quantization.
arXiv preprint arXiv:1803.06443, 2018. 3

Hanlin Tang, Chen Yu, Xiangru Lian, Tong Zhang, and Ji Liu. Doublesqueeze: Parallel stochastic
gradient descent with double-pass error-compensated compression. In International Conference on
Machine Learning, pages 6155–6165. PMLR, 2019. 3

Unix TC. tc(8) Linux User’s Manual, December 2001. 11

Thijs Vogels, Sai Praneeth Karinireddy, and Martin Jaggi. Powersgd: Practical low-rank gradient
compression for distributed optimization. Advances In Neural Information Processing Systems 32
(Nips 2019), 32, 2019. 3

Hongyi Wang, Scott Sievert, Zachary Charles, Shengchao Liu, Stephen Wright, and Dimitris
Papailiopoulos. ATOMO: Communication-efficient learning via atomic sparsification. 2018. 3

Jue Wang, Binhang Yuan, Luka Rimanic, Yongjun He, Tri Dao, Beidi Chen, Christopher Re, and
Ce Zhang. Fine-tuning language models over slow networks using activation compression with
guarantees. arXiv preprint arXiv:2206.01299, 2022. 3

Guangxuan Xiao, Ji Lin, Mickael Seznec, Julien Demouth, and Song Han. Smoothquant: Accurate
and efficient post-training quantization for large language models. arXiv preprint arXiv:2211.10438,
2022. 3

Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He.
ZeroQuant: Efficient and affordable post-training quantization for large-scale transformers. arXiv
preprint arXiv:2206.01861, 2022. 3

Binhang Yuan, Yongjun He, Jared Quincy Davis, Tianyi Zhang, Tri Dao, Beidi Chen, Percy Liang,
Christopher Re, and Ce Zhang. Decentralized training of foundation models in heterogeneous
environments. arXiv preprint arXiv:2206.01288, 2022. 3

Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, and Ce Zhang. Zipml: Training linear
models with end-to-end low precision, and a little bit of deep learning. In International Conference
on Machine Learning, pages 4035–4043. PMLR, 2017. 10

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher
Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language
models. arXiv preprint arXiv:2205.01068, 2022. 1, 2

Feng Zhu, Ruihao Gong, Fengwei Yu, Xianglong Liu, Yanfei Wang, Zhelong Li, Xiuqi Yang, and
Junjie Yan. Towards unified int8 training for convolutional neural network. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1969–1979, 2020. 3

17
Figure 5: Pseudocode of QSDP for a Fixed Layer

1: Input: worker p, layer input xp , worker weight partition wp .


2: function ExecuteForwardPass
3: qwp ← QuantizeWeights(wp ) // Quantize p’s weights
4: qw ← AllGather(qwi for all i ) // Collect quantized weights
5: op ← Layer(qw, xp ) // Compute output for p
6: free(qw) // Discard aggregated layer weights
7: end function
8: function ExecuteBackwardPass
9: qwp ← QuantizeWeights(wp ) // Quantize p’s weights
10: qw ← AllGather(qwi for all i ) // Collect quantized weights
11: gp ← Gradient(qw, op ) // Compute gradient for p
12: free(qw) // Discard aggregated layer weights
13: qgp ← QuantizeGradients(gp ) // Quantize p’s gradient
14: qgp ← ReduceScatter(qgi for each i) // Distribute gradients
15: wp ← WeightUpdate(qgp , wp ) // Update p’s weights
16: free(qg) // Discard aggregated gradients
17: end function

A Training details.
For training of GPT-2 models we were using MosaicML Mos [2022] examples. The global batch size
for 125M and 350M models was 256, for 1.3B - 512, resulting in 4 gradient accumulations at each
iteration. For all models AdamW optimizer was used, the optimizer parameters are presented in the
Table 4. 125M model was trained in 4800 steps, 350M model in 13400 steps, 1.3B model in 14000
steps.

Table 4: AdamW optimizer parameters.

125M 350M 1.3B


learning rate 6e-4 3e-4 2e-4
betas 0.9, 0.95 0.9, 0.95 0.9, 0.95
epsilon 1e-8 1e-8 1e-8

B Network overhead experiments


In order to evaluate the effect on communications in FSDP training we conducted the synthetic
experiment which reduces the bandwidth costs in each iteration. Specifically, given the buffer of
size N which is about to be communicated, and compression ratio γ we only transmit the first N/γ
elements. The results for our setup (4 x 8V100-32G GPUs) at different internode bandwidths is
shown in the Figures.6, communication weights and gradients are reduced to the same compression
ratio. We see that the most effect of compression is reached as expected for the largest 1.3B model
and at lowest bandwidth. However, one can get around 80% speedup at high bandwidth when up

18
to 8x compression ratio is applied. Also, we notice that 8× compression almost reaches the ideal
scaling for large model and has a evident overhead over the no-communication training in case of
the small model. It infers that the large models have a bottleneck in bandwidth component of the
communication and the small model has a dominating latency part.
To see the variance of the compression effects on weights and gradients we conducted the similar
experiment for different combinations of compression ratio pairs (see 5). We observe that weight
compression gives more performance profits than gradient compression. This can be naturally
explained by the fact that weights are communicated more frequently than gradients in FSDP (in
this particular experiment weights are communicated 5 times per one gradient exchange) and the
amount of transmissions per communication is similar.
The difference between the synthetic experiment and QSDP performance numbers with the
same compression ratios can be justified by the performance inefficiency of NCCL point-to-point
communication primitives on which QSDP compressed communication is based on - the compression
overhead in our experiments was verified to be negligible (less than 1% per iteration).

2.6 14 10 Gbits 50 10 Gbits


10 Gbits
2.4 50 Gbits 50 Gbits 50 Gbits
45
100 Gbits 12 100 Gbits 100 Gbits
2.2 40
10
Step time, s

Step time, s
Step time, s

2.0 35

1.8 8 30

1.6 25
6
20
1.4
4 15
1.2
10
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Compression ratio Compression ratio Compression ratio

(a) 125M (b) 350M (c) 1.3B

Figure 6: Compression vs average step time for different models at different inter-node bandwidths with
fake compression (weights and gradients have the same compression ratio). Lower is better. The dashed line
represents ideal scaling - training without communication.

Table 5: Training step timings (in seconds) for 1.3B model at 100 Gbps bandwidth with various
combinations of weights and gradient compression ratio.

hhhh
hhhhGradients ratios
hhhh 1 2 4 8
Weights ratio hhh h
1 23.23 21.36 20.62 20.2
2 19.27 17.17 16.26 15.95
4 17.50 15.35 14.6 14.08
8 16.62 14.52 13.66 13.21

C Learned quantization
We implemented stochastic gradient descent optimization of quantization levels in PyTorch. We use
learning rate 0.01, batch size 1024. We run the learning for each layer larger than 1e5 parameters,

19
for other layers uniform quantization was used. We evaluate the quality of quantization levels by
comparing L2 of compression error introduced by quantizing a buffer using the levels. We conducted
the such evaluation for weights quantized to 5 bits and gradients quantized to 4 bits during the
training of GPT 125M model. The results for one of the attention layers and LM head layer are shown
in the Figures 7 and 8. The dashed vertical lines show the moment of running learning quantization
levels algorithm. We see that compression error of learned quantization levels is constantly lower for
the learned algorithm, and the lower bits-width (for gradients we use 4 bits quantizaton) the larger
the gap between the considered methods. Also, we see that the compression error of the learned
quantization only increases in sync with uniform quantization over time. It means that learning
algorithm can be run only once, at the start of the training.
Also, we measured overhead of running learning algorithm for GPT 125M with weights quantized
to 5 bits, gradients to 4 bits. The overhead of learning algorithm amounts to around 9 minutes,
whereas the full training takes lasts 5 hours.
The extra experiments results with low bit-width quantization are shown in the Table. 6. The
number doesn’t show full perplexity recovery but they represent the improvements achieved by
learned levels algorithm. We can see that with learned quantization levels one can reduce up to 3
units of perplexity.

0.30
0.06
0.25
0.05
relative to layer norm

relative to layer norm

0.20
L2 norm of error

L2 norm of error

0.04
0.15
0.03
0.10
0.02

0.01 0.05
maxmin maxmin
learned learned
0.00 0.00
1000 2000 3000 4000 5000 1000 2000 3000 4000 5000
Training step Training step

(a) Weights (b) Gradients

Figure 7: Compression error (L2 norm of the error relative to L2 norm of the input) comparison with learned
quantization levels for attention layer of 125M model, W5G4 quantization.

Table 6: Final perplexities of low-bits quantization of 125m GPT-2 model using the learned
quantization levels.

baseline w3g32 w2g32 w8g3 w8g2


Uniform 45.53 57.92 39.91 44.79
35.81
Learned 42.31 56.54 37.72 44.65

20
0.30
0.06
0.25
0.05
relative to layer norm

relative to layer norm


0.20
L2 norm of error

L2 norm of error
0.04
0.15
0.03
0.10
0.02

0.01 0.05
maxmin maxmin
learned learned
0.00 0.00
1000 2000 3000 4000 5000 1000 2000 3000 4000 5000
Training step Training step

(a) Weights (b) Gradients

Figure 8: Compression error (L2 norm of the error relative to L2 norm of the input) comparison with learned
quantization levels for LM-Head layerof 125M model, W5G4 quantization.

D Convergence Proofs
In this section we provide the convergence analysis for our algorithms.

D.1 Overview
We use the notation and assumptions defined in Section 4.1. As all of our analyses revolve around
bounding the progress made in a single iteration, to simplify notation we will generally use x to
denote the current iterate, and x 0 to denote the iterate obtained after the generic update:
 
0 w η
x = Qδ x − g (x ) ,
β

where β is the smoothness parameter of f . In Section D.2 we will first prove convergence for the
deterministic method, where we have direct access to the gradients of f . The analysis precisely
follows the steps we described in Section 4.1. Then, we extend this analysis to the case of stochastic
gradients, and provide the full proof for Theorem 2. Finally, in Section D.3 we show that given
an appropriate gradient quantization method with bounded variance, we can use it on top of our
iteration to further reduce the required amount of communication, and thus prove Corollary 3.
Before proceeding, we first formally analyze the quantization method we defined in Section 4.1,
and show some additional properties that will be useful later.

Lemma 5. Let v ∈ Rn , and let δ > 0. Then,

E [Qw δ (v )] = v ,
n n o
h
w 2
i
2
X vi n v o
i
E kQδ (v ) − v k2 = δ · 1− ,
δ δ
i=1
h i
E Qwr,δ (v ) − r1 ≤ kv k /δ .
0 1

21
Since the proofs are technical, we defer them to Section D.4.1. The most important feature of
this quantization scheme is captured by Lemma 4, which is crucial for our convergence proof. We
first restate it, and prove it formally in Section D.4.2.

Lemma 4. Let δ? > δ > 0, such that δ? /δ ∈ Z. Let x ∈ Rn , and for all r ∈ [−δ? /2, δ? /2), let an
arbitrary x ?r,δ? ∈ δ? Zn + r1. Then
h
2
i δ h i
x ? − x 2 .
E kQw
δ (x ) − x k 2 ≤ Er r,δ? 2
δ?
The proof crucially relies on the fact that δ/δ? ∈ Z, and is rooted in the following inequality:

Lemma 6. Let y ∈ R and k ∈ Z. Then


 n y o n y o
(1 − {y}) {y} ≤ k 1 − .
k k
Proof. It suffices to consider y ∈ [0, k], as both {y}(1 − {y}) and {y/k}(1 − {y/k}) are periodic
within this interval. The function {y/k}(1 − {y/k}) is a quadratic which is monotonically increasing
over [0, k/2] and symmetric around k/2. As (1 − {y}){y} is periodic on intervals of length 1 it suffices
 y  y
to show that (1 − {y}){y} ≤ k 1 − k k on the interval [0, 1]. At this point we can drop the
fractional part, and simply need to compare two quadratics over [0, 1]. Equivalently we need to
show that k(1 − y/k)y/k ≥ y(1 − y) over [0, 1], which after simplifying both sides is equivalent to
y 2 (1 − 1/k) ≥ 0 over this interval, which is true.

Finally, we provide some basic optimization inequalities, which will allow us to prove our theorems.
Optimization Basics. The first Lemma bounds the change in function value using smoothness,
while the latter upper bounds the `2 distance to optimality using the error in function value. We
provide the proofs in Sections D.4.3 and D.4.4.

Lemma 7. Let f : Rn → R be a β-smooth function. Then for any ∆ ∈ Rn ,


2
η2

2 β η
f (x + ∆) ≤ f (x ) + (1 − η) h∇f (x ) , ∆i − k∇f (x )k2 + ∇f (x ) + ∆

.
2β 2 β 2

Lemma 8. If f : Rn → R satisfies the α-PL condition, then for all x ∈ Rn ,


α
f (x ) − f ? ≥ kx − x ? k22 ,
2
where x ? ∈ arg minx f (x ).

We are now ready to prove the main theorems in this paper.

D.2 SGD with weight quantization


We first prove the stepping lemma for our quantized gradient method, in the case where full gradients
are available. The steps are essentially the same we described in Section 4.1.

22
Lemma 9. Let f (x ) : Rn → R be a β-smooth and α-PL function. For each r ∈ [−δ  ? /2, δ? /2], let

x r,δ? be any minimizer of f over δ? Z + r. Let δ = 4(β/α)2 . Then letting x = Qδ x − β1 ∇f (x ) ,
? δ? 0 w
d e
one has that in expectation over the random bits used by the quantization operator:
 
0 ? α
Ef (x ) − Ef x ?r,δ? .
  
Ef x − Ef x r,δ? ≤ 1 −

Proof. Letting ∆ = x 0 − x , we write:
2   2     2
1
∇f (x ) + ∆ = (x + ∆) − x − 1 ∇f (x ) = Qw x − 1 ∇f (x ) − x − 1 ∇f (x ) .

β δ

2
β
2
β β
2
Also using Lemma 4, we have that for any x ? ∈ arg minx f (x ) ,
    2 " 2 #
w 1 1 δ 1
≤ Er x − ∇f (x ) − x ?

E Q δ x − ∇f (x ) − x − ∇f (x ) r,δ ?
β β
2 δ? β 2
2 !
δ h
2
i 1
Er x ?r,δ? − x ? 2 + x − ∇f (x ) − x ?

≤2 .
δ? β
2

Using the α-PL condition we upper bound distance from x ? with function value i.e.
? ? 2 2
≤ · f x ?r,δ? − f (x ? ) ,

r,δ? − x
x
2 α
2    
x − 1 ∇f (x ) − x ? ≤ 2 f x − 1 ∇f (x ) − x ? − f (x ? ) .

β
2 α β
Combining these with Lemma 7 for η = 1 we conclude that
1
Ef x 0 ≤ f (x ) − k∇f (x )k22


   
δ β 1 δ β
f x − ∇f (x ) − f (x ? ) + 2 · · Er f x ?r,δ? − f (x ? ) .
  
+2 ·
δ? α β δ? α
Again, using the PL condition we lower bound 12 k∇f (x )k22 ≥ α (f (x ) − f (x ? )), which gives
E f x 0 − f (x ? )
  
     
α ? δ β 1 δ β
f x − ∇f (x ) − f (x ) + 2 · · Er f x ?r,δ? − f (x ? )
?
  
≤ 1− (f (x ) − f (x )) + 2 ·
β δ? α β δ? α
 
α δ β δ β
(f (x ) − f (x ? )) + 2 · · Er f x ?r,δ? − f (x ? ) .
  
≤ 1− +2 ·
β δ? α δ? α
Equivalently we obtain
 
0 α δ β δ β
x ?r,δ? f (x ) − f x ?r,δ? + 2 · · E f x ?r,δ? − f (x ? )
     
Ef x − Ef ≤ 1− +2 ·
β δ? α δ? α
? ?

+ f (x ) − Ef x r,δ?
   
α δ β ? δ β α δ β
· E f x ?r,δ? − f (x ? )
   
= 1− +2 · f (x ) − Ef x r,δ? + 2 · − + 2 ·
β δ? α δ? α β δ? α
   
α δ β δ β α
f (x ) − Ef x ?r,δ? + 4 · − · E f x ?r,δ? − f (x ? ) .
   
= 1− +2 ·
β δ? α δ? α β

23
1
Since we set δ/δ? = , the second term is non-positive. Therefore in this case we have
d4(β/α)2 e
 
0 α
x ?r,δ? f (x ) − Ef x ?r,δ?
  
Ef x − Ef ≤ 1− ,

which concludes the proof.
We now generalize the proof of Lemma 9 to the case where only stochastic gradients are available.
The proof is essentially the same, the main difference being that we isolate terms involving the
difference between the stochastic and the true gradient, which we bound separately using our variance
bound.
Lemma 10. Let f : Rn → R be a β-smooth and α-PL function. For each r ∈ [−δ? /2, δ? /2], let
η w x − η g (x ) , where
x ?r,δ? be any minimizer of f over δ? Z + r. Let δ = 16(β/α) 2 · δ? . Let x = Qδ
0
d e β
g (x ) is an unbiased estimator for ∇f (x ) i.e. E [g (x ) |x ] = ∇f (x ), and 0 < η ≤ 1 is a step size
parameter. Furthermore assume that the variance of g (x ) is bounded E kg (x ) − ∇f (x )k22 ≤ σ 2 ,
for a real parameter σ > 0. Then, for r ∼ Unif ([−δ? /2, δ? /2)), in expectation over the gradient
stochasticity:
 5 η 2 2
 
0 ? 3 α
f (x ) − Ef x ?r,δ? +
   
E f x |x − Ef x r,δ? ≤ 1 − η σ .
4 β 4β
Proof. We follow the analysis from Lemma 9, while moving the stochastic gradients into expressions
that involve the stochastic variance. Letting δ = x 0 − x , we write:
2   2     2
η
∇f (x ) + ∆ = (x + ∆) − x − η ∇f (x ) = Qw x − η g (x ) − x − η ∇f (x )

β r,δ

2
β
2
β β
2
    2 2
w η η
η

≤ 2 Qr,δ x − β g (x ) − x − β g (x ) + 2 β (g (x ) − ∇f (x )) ,

2 2

where we used the inequality ka + ≤bk22 + 2 kak22 2 kbk22 .


Also using Lemma 4, we have that for
any x ? ∈ arg minx f (x ) ,
    2
w η η
E Qr,δ x − g (x ) − x − g (x )

β β
2
" 2 #
δ η ?

≤ E x − β g (x ) − x r,δ?
δ? 2
2 !
δ h
2
i η
E x ?r,δ? − x ? 2 + ?

≤2 x − g (x ) − x
δ? β
2
2 2 !
δ h
2
i η η
E x ?r,δ? − x ? 2 + 2 x − ∇f (x ) − x ?

≤2 + 2 (g (x ) − ∇f (x )) .
δ? β
2
β
2

Using the α-PL condition we upper bound distance from x ? with function value i.e.
? ? 2 2
≤ · f x ?r,δ? − f (x ? ) ,
 
r,δ? − x
x
2 α
2    
η
x − ∇f (x ) − x ≤?
2 η ?
f x − ∇f (x ) − f (x ) .
β
2 α β

24
Combining these with Lemma 7 we conclude that in expectation over the random shift:
η2
f x 0 ≤ f (x ) + (1 − η) h∇f (x ) , ∆i − k∇f (x )k22


    2 2 !
β w η η η
+ · 2 Qr,δ x − g (x ) − x − ∇fi (x ) + 2 (g (x ) − ∇f (x ))
2 β β

2
η2
≤ f (x ) + (1 − η) h∇f (x ) , ∆i − k∇f (x )k22

2 2 !
δ ? ?
2 η ?
η
+ 2β · x
r,δ? − x
+ 2 x − ∇f (x) − x + 2 (g (x ) − ∇f (x ))
2
δ? β
2
β
2
η2
+ kg (x ) − ∇f (x )k22
β
η2
≤ f (x ) + (1 − η) h∇f (x ) , ∆i − k∇f (x )k22

   
δ β η ? δ β
f x ?r,δ? − f (x ? )
 
+4 f x − ∇f (x ) − f (x ) + 4
δ? α β δ? α
2
 
η δ
+ 1+4 kg (x ) − ∇f (x )k22 .
β δ?
Again, using the PL condition we lower bound 12 k∇f (x )k22 ≥ α (f (x ) − f (x ? )), which gives that
in expectation over the random shift:
 
α
f x 0 − f (x ? ) ≤ 1 − η 2 (f (x ) − f (x ? )) + (1 − η) h∇f (x ) , ∆i

β
   
δ β η ? δ β
f x ?r,δ? − f (x ? )
 
+4 f x − ∇f (x ) − f (x ) + 4
δ? α β δ? α
2
 
η δ
+ 1+4 kg (x ) − ∇f (x )k22
β δ?
 
2 α δ β δ β
(f (x ) − f (x ? )) + 4 f x ?r,δ? − f (x ? )
 
≤ 1−η +4
β δ? α δ? α
2
 
η δ
+ (1 − η) h∇f (x ) , ∆i + 1+4 kg (x ) − ∇f (x )k22 .
β δ?
and equivalently, in expectation over the random shift:
   
0 ? 2α δ β ? δ β 2α
Ef x ?r,δ? − f (x ? )
     
E f x − f x r,δ? ≤ 1 − η +4 f (x ) − Ef x r,δ? + 8 −η
β δ? α δ? α β
η2
 
δ
+ (1 − η) h∇f (x ) , E [∆]i + 1+4 kg (x ) − ∇f (x )k22 .
β δ?
At this point we use Lemma 5 to write
   
w η
E [∆] = E Qr,δ x − g (x ) − x
β
η
= − g (x ) ,
β

25
and thus
η
(1 − η) h∇f (x ) , E [∆]i = −
(1 − η) h∇f (x ) , g (x )i .
β
Therefore, after taking expectation over both the random shift and gradient stochasticity we obtain:
E f x 0 − f x ?r,δ? |x
   
   
2α δ β ? δ β 2α
Ef x ?r,δ? − f (x ? )
  
≤ 1−η +4 f (x ) − Ef x r,δ? + 8 −η
β δ? α δ? α β
2
 
η η δ
− (1 − η) k∇f (x )k22 + 1+4 σ2
β β δ?
   
α δ β δ β α
≤ 1 − η2 + 4 f (x ) − Ef x ?r,δ? + 8 − η2 Ef x ?r,δ? − f (x ? )
  
β δ? α δ? α β
2
 
α η δ
− 2η (1 − η) (f (x ) − f (x ? )) + 1+4 σ2
β β δ?
   
2α α δ β ? δ β 2α α
Ef x ?r,δ? − f (x ? )
  
= 1−η − 2η (1 − η) + 4 f (x ) − Ef x r,δ? + 8 −η − 2η (1 − η)
β β δ? α δ? α β β
2
 
η δ
+ 1+4 σ2 .
β δ?
η
Since we set δ/δ? = , the second term is non-positive. Therefore in this case we have
d16(β/α)2 e
  2
 2 !
α α η α η η α
E f x 0 − Ef x ?r,δ? |x ≤ 1 − η 2 − 2η (1 − η) + f (x ) − Ef x ?r,δ? + σ2
    
1+
β β 4β β 4 β
  2
 2 !
7 α α η η α
≤ 1 − η + η2 f (x ) − Ef x ?r,δ? + σ2

1+
4 β β β 4 β
 5 η 2 2
 
3 α
≤ 1− η f (x ) − Ef x ?r,δ? + σ ,
4 β 4β
as long as η ≤ 1. This concludes the proof.
Using Lemma 10 the proof of Theorem 2 follows very easily.
Theorem 2. Let α, β, δ? , ε > 0 and σ ≥ 0 be real parameters, and let η = min 10 , 1 . Let
 3 εα
σ2
f : Rn → R be a β-smooth and α-PL function, with access to a stochastic gradient g (x ), i.e.
E [g (x ) |x ] = ∇f (x ) with bounded variance E kg (x ) − ∇f (x )k22 ≤ σ 2 . For each r ∈ [−δ? /2, δ? /2),
η
let x ?r,δ? be any minimizer of f over δ? Zn + r1. Let δ = 16(β/α) 2 · δ? . Consider the iteration:
d e
 
η
x t+1 = Qwδ x t − g (x t ) .
β
β f (x 0 )−Ef (x ?
r,δ? )
In T = 10
η · α ln ε iterations we obtain a point x T satisfying Ef (x T ) − Ef (x ?r,δ? ) ≤ ε.

10 β f (x 0 )−Ef (x ?
r,δ? )
Proof. Plugging in Lemma 10 and applying it for T = η α ln ε we obtain:
−1 
 ε 5 η 2 2 TX 3 α k ε 5 η2 2 4 1 β

ε 5η 2
Ef (x T ) − Ef x ?r,δ? ≤ + σ · 1− η ≤ + σ · = + σ .
2 4β 4 β 2 4β 3ηα 2 3α
k=0

26
3 εα

Since we set η = min 10 σ 2 , 1 , the entire quantity is at most ε, which concludes the proof.

D.3 Reducing Communication by Quantizing Gradients


The approach described in Section D.2 maintains quantized weights, but communicating gradients
may still be expensive. In this section we show that any reasonable quantization method for gradients
can be used to reduce communication, while paying in exchange an increased variance. This trade-off
is inherent, as the reduction in the number of bits requires injecting randomness, so as the entropy
of the output is not smaller than that of the original message to be communicated.
To do so we use any gradient quantization method Qg , as long as it is an unbiased estimator for
the input it takes, and has bounded variance. Our formal requirements for Qg are the following.

Definition 11. We say that a gradient quantization operator Qg is a (σ∇ , b)-unbiased quantizer if
it:

1. is an unbiased estimator: E [Qg (g (x ))|g (x )] = g (x ),


h i
2. has bounded variance on the stochastic gradients: E kQg (g (x )) − g (x )k22 |g (x ) ≤ σ∇
2,

3. requires b bits to communicate Qg (g (x )).

By Lemma 5, these requirements are automatically satisfied by our shift-and-round quantization


operator Qw , and we can show that σ∇ and b are determined by the `1 norm of g (x ).
Standard Quantization Schemes and Their Communication Cost. Another standard gra-
dient quantization scheme can be obtained by independently rounding each coordinate to one of the
neighboring points on the quantization grid, with an appropriate probability. An identical scheme
has been previously used in other related works on gradient quantization Alistarh et al. [2017].

Definition 12 (quantization by flipping a coin). Let δ > 0 be a scalar defining the coarseness of
the quantization grid. Let the operator Qδ : R → δZ defined as
(  
δ xδ with probability 1 − xδ − xδ
 
Qδ (x) =
δ xδ + 1 with probability xδ − xδ
    

where δ > 0. We apply Qδ to vectors, with the meaning that it is independently applied to each
coordinate.

It is fairly easy to prove that this satisfies very similar properties to those proved for Qw in
Lemma 5, which we quickly prove in Section D.4.5. We notice an important difference between these
two quantization methods. While Q independently quantizes each coordinate, the quantization in
Qw is done dependently across coordinates, and the output is always a vector in δZn + r1, for a
randomly sampled scalar r. Although morally they are quite similar (in fact, the shift after rounding
could just as well be ignored, and still have an unbiased extimator), it is important if we want to
relate the quality of the final solution to the best set of weights from a reasonably chosen grid. This
difference is apparent when trying to provide bounds of the type of Lemma 4, bu this attempt falls
through in the case of the Q operator.
As we can naively relate the communication cost of a quantized gradient to its sparsity, it is
important to discuss quantitative bounds. In both cases, the sparsity bound depends on the `1 norm

27
of the quantized vector, and its easy to see that it is tight. By comparison, the bound from Alistarh

et al. [2017] is provided in terms of the `2 norm of the vector, but pays an additional n factor,
which is suboptimal when the input is analytically sparse. For Qw and Q, we see that the variance
introduced by quantizing a generic vector v is bounded by δ kv k1 ,while its sparsity is kv k1 /δ.
kv k
Hence a naive encoding of this quantized gradient requires at most O δ (ln n + ln kv k1 ) bits of
1

communication.
SGD with Weight and Gradient Quantization. For gradient quantization operators that are
unbiased estimators, we can use them as stochastic gradients inside the scheme we derived in Theorem
2. To do so we crucially use the following identity involving conditional variance:

Lemma 13 (Law of total variance). Given random variables X and Y , one has that

Var [Y ] = E [Var [Y |X]] + Var [E [Y |X]] .

Corollary
h 14. Consideri a stochastic gradient estimator g (x ) such that E [g (x ) |x ] = ∇f (x ) and
E kg (x ) − ∇f (x )k22 |x ≤ σ 2 . Consider (σ∇ , b)-unbiased quantizer (Definition 11). Then

E [Qδ (g (x )) |x ] = ∇f (x ) ,

i.e. it is an unbiased estimator for the gradient, and


h i
E kQδ (g (x )) − ∇f (x )k22 ≤ σ∇
2
+ σ2 .

Proof. The fact that the quantized gradient is an unbiased estimator for ∇f (x ) follows from the
law of total expectation, as

E [Qδ (g (x )) |x ] = E [E [Qδ (g (x )) |x , g (x )]] = E [E [g (x ) |x ]] = ∇f (x ) .

For the variance, we use Lemma 13 to write:


h i
E kQδ (g (x )) − ∇f (x )k22 = Var [Qδ (g (x ))]
= E [Var [Qδ (g (x )) |g (x )]] + Var [E [Qδ (g (x )) |g (x )]]
2
≤ σ∇ + Var [g (x )]
2
= σ∇ + σ2 .

Finally, combining Theorem 2 with Corollary 14, we obtain the final result from Corollary 3.

D.4 Deferred Proofs


D.4.1 Proof of Lemma 5
Proof. For both the mean and variance computation, it suffices to prove these bounds for the scalar
operator.

28
x−r
We note that by definition Qw w w := δ · x−r
  
r,δ (x) − r = Q0,δ (x − r) = δ · Q0,1 δ δ . Also let
{x} = x − bxc denote the fractional part of x. We can easily verify that for any scalar 0 ≤ z < 1, we
have
Eu∼Unif([−1/2,1/2)) [bz + ue] = z . (6)
This is because bz + ue = 1 if and only if z + u ≥ 1/2 i.e. u ≥ 12 − z, which happens with probability
z. Now we can express the expectation of Qw r, (x) as follows:

E [Qw
 w 
δ (x)] = Er Q0,δ (x − r) + r
= Er Qw
 
0,δ (x − r) + Er [r]
h  jxk nxo i
= Er Qw 0,δ δ + δ − r + Er [r]
jxk h δ  n xδ o i
=δ + Er Qw 0, δ − r + Er [r]
δ
jxk h nδ x o r i
=δ + Er δ · Qw 0,1 − + Er [r] .
δ δ δ
y
In the last line we used the fact that Qw w
0,δ (y) = δ · Q0,1 δ . Now we reparameterize by using
u := r/δ, where r ∼ Unif ([−1/2, 1/2)). This allows to write the term in the middle as:
h n x o r i h n x o i nxo
Er δ · Qw
0,1 − = δ · E u∼Unif([−1/2,1/2)) Q w
0,1 − u = δ · ,
δ δ δ δ
were we used (6). Plugging back in we obtain that
 w  jxk nxo
E Qr,δ (x) = δ +δ· + 0 = x.
δ δ
Next we compute the scalar variance:
h i h 2 i
2
E (Qw
δ (x) − x) = E r Q w
0,δ (x − r) − x
 j k 2 
x w
n x o r 
= Er δ + δ · Q0,1 − −x
δ δ δ
 j x k n x o r  x 2 
2 w
= Er δ · + Q0,1 − −
δ δ δ δ
 j x k n x o  x 2 
2 w
= Eu∼Unif([−1/2,1/2)) δ · + Q0,1 −u −
δ δ δ
 n x o  n x o2 
= δ 2 · Eu∼Unif([−1/2,1/2)) Qw 0,1 −u − .
δ δ
Now we use the fact that for any scalar 0 ≤ z < 1 one has that
h i
Eu∼Unif([−1/2,1/2)) (bz + ue − z)2 = z 2 .

This follows from the fact that bz + ue = 1 iff u ≥ 1/2 − z, which happens with probability z, and
makes the expectation equal to
Z 1/2−z Z 1/2
z 2 du + (1 − z)2 du = z (1 − z) ,
−1/2 1/2−z

29
which leads us to h i nxo  n x o
2
E (Qw δ (x) − x) = δ2 · 1− ,
δ δ
which gives us what we needed.
Finally, for the sparsity bound, let us understand when a single scalar gets rounded to zero
(before shifting back by r). We have that for x ∈ R,
(R 1/2
−1/2 1b xδ −re=0 dr, |x| < δ,
h x i
P Qr,δ (x) − r = 0 = P Qw
 w 
r,1 −r =0 =
δ 0, δ ≤ |x| ,
(R 1/2 (R 1/2 x
−1/2 1− 21 ≤ xδ −r≤ 21 dr, |x| < δ, −1/2 1 xδ − 12 ≤r≤ xδ + 12 dr,
< 1,
δ
= =
1 ≤ xδ ,

0, δ ≤ |x| , 0,
    nx o  1 n x o
x 1 1 x 1 1 1
= min + , − max − ,− = + min , 0 − − + max ,0
δ 2 2 δ 2 2 2 δ 2 δ
nx o nx o x
= 1 + min , 0 − max ,0 = 1 − .

δ δ δ
which shows that
n n
(
 g  X  g  X vi , |vi | < δ,
E Qδ (v ) 0 = 1 − P Qδ (vi ) = 0 = δ ≤ kv k1 /δ .
i=1 i=1
1, δ ≤ |v i | ,

This concludes the proof.

D.4.2 Proof of Lemma 4


Proof. It suffices to prove this coordinate-wise. From Lemma 5 we have that for any x ∈ R,
h i  n x o n x o
2
E (Qw δ (x) − x) = δ2 1 −
δ δ
and similarly for δ? . Let k = δ? /δ. Then
    
h 2 i x/δ x/δ
E Qw
δ? (x) − x 2 2
=k δ 1−
k k

Applying the inequality from Lemma 6, we conclude that


h i  n x o n x o  n x o n x o 1 h 2 i
2
E (Qw
δ (x) − x) = δ2 1 − ≤ δ2 · k 1 − = E Qw
δ? (x) − x .
δ δ kδ kδ k
Applying this bound to all coordinates we obtain
h
2
i δ h 2 i
E kQwδ (x ) − x k2 ≤ E Qw r,δ? (x ) − x 2 .

δ?
2 2
Also since Qw
w ?
rounds to the nearest point in δ + r, clearly Q (x ) − x ≤ x − x for

r,δ? ? Z r,δ? r,δ?
2 2
all r. Taking expectations on both sides and combining with the previous inequality concludes the
proof.

30
D.4.3 Proof of Lemma 7
Proof. Using smoothness we have
β
f (x + ∆) ≤ f (x ) + h∇f (x ) , ∆i + k∆k22
2
η2 η2
 
β
= f (x ) + (1 − η) h∇f (x ) , ∆i − k∇f (x )k22 + k∇f (x )k22 + η h∇f (x ) , ∆i + k∆k22
2β 2β 2
2
2
η β η
k∇f (x )k22 +

= f (x ) + (1 − η) h∇f (x ) , ∆i − ∇f (x ) + ∆ .
2β 2 β

2

D.4.4 Proof of Lemma 8


The proof is standard and can be found in literature, such as Karimi et al. [2016]. However, for
completeness we reproduce it here.
Proof. Let g (x ) = f (x ) − f ∗ for which we have
p

1
∇g (x ) = p ∇f (x ) .
2 f (x ) − f ∗
Using the α-PL condition we have
1 1 α
k∇g (x )k2 = ∗
· k∇f (x )k2 ≥ ∗
· α · (f (x ) − f ∗ ) = .
4 (f (x ) − f ) 2 (f (x ) − f ) 2
Now starting at some x 0 , we consider the dynamic ẋ = −∇g (x ). We see that this always decreases
function value until it reaches some x T for which ∇g (x T ) = 0 and hence by the PL inequality, x T
is a minimizer i.e. f (x T ) = f ∗ . Now we can write
Z T Z T
g (x T ) = g (x 0 ) + h∇g (x t ) , x˙ t i dt = g (x 0 ) + h∇g (x t ) , −∇g (x t )i dt
0 0
Z T
= g (x 0 ) − k∇g (x t )k2 dt .
0
Thus
T T T
Z r Z r Z
2 α α
g (x 0 ) − g (x T ) = k∇g (x t )k dt ≥ · k∇g (x t )k dt = · kx˙ t k dt ,
0 2 0 2 0
where we used our lower bound on the norm of ∇g (x ). Finally, we use the fact that the last integral
lower bounds the total movement of x as it moves from x 0 to x T . Thus
Z T
kx˙ t k dt ≥ kx 0 − x T k ,
0
so r
α
g (x 0 ) − g (x T ) ≥ kx 0 − x T k ,
2
which enables us to conclude that
α
kx 0 − x T k2 ,
f (x 0 ) − f ∗ ≥
2
where x T is some global minimizer of f . This concludes the proof.

31
D.4.5 Bound for Quantization by Coin Flip
Lemma 15. Let v ∈ Rn , and let δ > 0, and let Qδ be the quantization operator from Definition 12.
Then,

E [Qδ (v )] = v ,
n n o
h
2
i
2
X vi n v o
i
E kQδ (v ) − v k2 = δ · 1− ,
δ δ
i=1
E [kQδ (v )k0 ] ≤ kv k1 /δ .

Proof. For the expectation and


 x variance,
 x  it suffices to prove that these bound holds coordinate-wise.
Let x ∈ R, and write x = δ δ + δ so that
h  j x k n x oi
E [Qδ (x)] = E Qδ δ +
jxk h δ  n δx oi
=δ + E Qδ δ
δk
jx h  nx δ oi
=δ + E Qδ δ
δk
jx nxo δ
=δ +δ·
δ δ
= x.

Similarly we write the variance as:


  n o n x o2 
h
2
i x
E (Qδ (x) − x) = E Qδ δ −δ
δ δ
 n x o  n x o2 n x o  n x o2
= 1− δ + · δ−δ
 δ n o nδ o δ δ
n x o2 
x x 2 nxo 
= δ2 1− + · 1−
δ δ δ δ
 n x o n x o
2
=δ 1− ,
δ δ
For the sparsity bound, we need to understand when a single scalar gets rounded to zero. We have
that for x ∈ R, (
1 − xδ , |x| < δ,

P [Qδ (x) = 0] =
0, δ ≤ |x| ,
which shows that
n
X
E [kQδ (v )k0 ] = (1 − P [Qδ (vi ) = 0])
i=1
n
(
X vi , |vi | < δ,
= δ

i=1
1, δ ≤ |vi | ,
≤ kv k1 /δ .

32

You might also like