ERQ: Error Reduction For Post-Training Quantization of Vision Transformers
ERQ: Error Reduction For Post-Training Quantization of Vision Transformers
( 𝐖 + 𝜹𝐖 ∗ ) × ( 𝐱 + 𝜹𝐱 ) 𝐖 × 𝐱
ever, existing methods typically overlook the intri-
cate interdependence between quantized weight
and activation, leading to considerable quantiza-
tion error. In this paper, we propose ERQ, a 𝐖
𝐱
two-step PTQ approach meticulously crafted to
sequentially reduce the quantization error arising Error
from activation and weight quantization. ERQ 𝐱 𝒔 Reduction
first introduces Activation quantization error re- [ 𝐖 𝒔 + 𝜹𝐖 𝒔 , 𝐖 𝒓 + 𝛿𝐖 𝒓∗ ] × 𝐖 × 𝐱
𝐱𝒓
duction (Aqer) that strategically formulates the
minimization of activation quantization error as Ridge Regression
a Ridge Regression problem, tackling it by up- Rounding Refinement 𝜹𝐖 𝒔
1
ERQ: Error Reduction for Post-Training Quantization of Vision Transformers
variant activation, reparameterization technique (Li et al., video classification (Arnab et al., 2021), and medical imag-
2023) and power-of-two factor are employed. Additionally, ing (Shamshad et al., 2023), etc. However, ViTs are ac-
evolutionary search methods (Frumkin et al., 2023) are uti- companied by substantial computational overhead and in-
lized for determining unstable scale factors. Nevertheless, creased memory requirements, posing challenges for their
existing methods typically overlook the intricate interdepen- deployment in resource-constrained environments (Mehta
dence between weight and activation quantization, leading & Rastegari, 2022; Zhang et al., 2022).
to considerable quantization error when it comes to weight-
activation quantization. 2.2. Post-training Quantization for ViTs
In this paper, we introduce ERQ, a two-step post-training Model quantization reduces the numerical precision of
quantization method tailored for ViTs, that sequentially mit- weight and activation to mitigate computational and stor-
igates quantization error induced by quantized activations age costs of neural networks (Krishnamoorthi, 2018). In
and weights. As shown in Fig. 1, ERQ consists of two contrast to quantization-aware training (QAT) (Li et al.,
steps, i.e., Activation quantization error reduction (Aqer) 2022a; Li & Gu, 2023; Xijie Huang & Cheng, 2023) that
followed by Weight quantization error reduction (Wqer). involves complete training data and compute-heavy retrain-
Aqer formulates a Ridge Regression problem to mitigate ing, post-training quantization (PTQ) operates on a tiny
the quantization error induced by activation quantization, dataset with a reduced time overhead, harvesting extensive
which can be solved with a closed-form solution via weight attention (Banner et al., 2019). The unique architecture
updating. Subsequently, Wqer is introduced to mitigate the of ViTs, such as LayerNorm and attention mechanisms,
quantization error caused by weight quantization in an iter- makes distinct PTQ methods compared to those used for
ative quantization-and-correction manner. In particular, at convolutional neural networks (CNNs) (Li et al., 2021; Wei
each iteration, we quantize the first half of the full-precision et al., 2022). Liu et al. (Liu et al., 2021b) introduce the
weight and mitigate the resulting quantization error by first first PTQ method for ViTs. To maintain the order of soft-
performing Rounding Refinement and then again solving a max scores and adapt various quantization sensitivities of
Ridge Regression problem. The former derives an efficient different layers, they respectively introduce a ranking loss
proxy for output error, which is used to refine the rounding and a nuclear norm-based mixed-precision scheme. FQ-
directions of quantized weight to lower the quantization ViT (Lin et al., 2022) introduces a fully-quantized method,
error. The latter further mitigates the quantization error by which respectively designs Powers-of-Two Scale and Log-
updating the remaining full-precision weight. Such a pro- Int-Softmax for post-LayerNorm and post-Softmax activa-
cess continuously performs until all weights are accurately tion. In PTQ4ViT (Yuan et al., 2022), a twin uniform quan-
quantized. tizer is introduced to handle the long-tail post-Softmax acti-
The proposed ERQ’s effectiveness is demonstrated in exten- vation and uneven post-GELU activation. APQ-ViT (Ding
sive experiments on various ViTs variants (ViT, DeiT, and et al., 2022) establishes a block-wise error reconstruction
Swin) and tasks (image classification, object detection, and and a Matthew-effect preserving quantizer for post-Softmax
instance segmentation). Notably, on the image classification activation. In Evol-Q (Frumkin et al., 2023), an evolutionary
task, ERQ outperforms GPTQ by 22.36% for W3A4 ViT-S. search method is employed to search extremely sensitive
quantization parameters. RepQ-ViT (Li et al., 2023) intro-
duces a reparameterization technique to handle high-variant
2. Related Work post-LayerNorm activation, where the channel-wise quantiz-√
2.1. Vision Transformers (ViTs) ers are simplified to layer-wise quantizers. Also, a Log 2
quantizer is adopted to accommodate post-Softmax activa-
Inspired by the success of transformers in the natural lan- tion. GPTQ (Frantar et al., 2022) employs OBS (Frantar
guage processing field, ViTs, by treating images as patch & Alistarh, 2022) to progressively compensate for weight
tokens, have emerged as a groundbreaking development quantization error by utilizing Hessian information. Despite
in computer vision (Dosovitskiy et al., 2021). Addressing sharing a certain similarity with GPTQ, our ERQ introduces
the dependency of ViTs on large datasets, DeiT (Touvron Aqer for mitigating the error of quantized activation, while
et al., 2021) showcases an efficient teacher-student train- GPTQ does not quantize activations. Moreover, ERQ uses
ing approach. Then, Swin Transformers (Liu et al., 2021a) a derived proxy for output error to refine weight rounding,
employs a hierarchical structure that integrates a shifted which has not been proposed before.
window-based self-attention mechanism, marking further
improvements. The applications of ViTs has broadened con-
3. Preliminaries
siderably, including areas such as object detection (Carion
et al., 2020; Zhu et al., 2020), image segmentation (Zheng Quantizers. For a fair comparison, our quantization set-
et al., 2021), low-level image processing (Liang et al., 2021), tings are aligned with the earlier work (Li et al., 2023).
2
ERQ: Error Reduction for Post-Training Quantization of Vision Transformers
Specifically, we quantize the weight and activation of all which formulates the error mitigation as the Ridge Regres-
matrix multiplications in ViTs. The channel-wise quantizer sion problem. Specifically, we retain the weight as full-
and layer-wise quantizer are adopted for weight and acti- precision and solely consider the MSE caused by activation
vation, respectively. For weights and the activation except quantization error δx as:
for the post-Softmax activation, we adopt the uniform quan-
LMSE = E ∥Wx − W(x + δx)∥22 .
tizer. Given full-precision values x and the bit-width b, the (5)
uniform quantizer is defined as:
j x m To minimize Eq. 5, we formulate the Ridge Regression prob-
x̄ = Qun (x, b) = s · clip + z, 0, 2b − 1 , (1) lem, where the minimization is completed by adding W
s with an adjustment δW∗ :
where ⌊·⌉ denotes the round function, clip function makes
E ∥Wx − (W + δW∗ )(x + δx)∥22 + λ1 ∥δW∗ ∥22
the output between 0 and 2b − 1, the scale factor s is grid-
= E ∥ − δW∗ (x + δx) − Wδx∥22 + λ1 ∥δW∗ ∥22 (6)
searched by minimizing the error
j before
m and after quantiza-
tion, and the zero-point z = − min(x) . For long-tail post- = E ∥δW∗ x̄ + Wδx∥22 + λ1 ∥δW∗ ∥22 .
s
√
Softmax activation, the log 2 quantizer (Li et al., 2023) is
Here, δW∗ denotes adjustment that is computed by Ridge
adopted:
Regression, x̄ = x + δx is the quantized input, λ1 ∥δW∗ ∥22
xq √
x̄ = Qlg√2 (x, b) = s · 2⌊− 2 ⌋ (1(xq )( 2 − 1) + 1), (2) acts as the regularization term, λ1 is a hyper-parameter that
j xm control the intensity of the regularization. Eq. 6 constitutes
xq = clip −2log2 , 0, 2b − 1 , (3) the Ridge Regression problem. To minimize it, we first
s
compute its gradient w.r.t. δW∗ :
where 1(·) returns 0 for even numbers and 1 for odd num-
bers, s is grid-searched by minimizing the error before and ∂
E ∥δW∗ x̄ + Wδx∥22 + λ1 ∥δW∗ ∥22
after quantization. All scale factors of the above-mentioned ∂δW ∗ (7)
= E 2(δW∗ x̄ + Wδx)x̄T + 2λ1 δW∗ .
quantizers are determined by the calibration datasets.
Objective. Denoting the full-precision activation as x ∼ Then, we solve for δW∗ by setting Equation 7 to zero:
P(x), x ∈ RDin , and the weight W ∈ RDout ×Din . Here,
Din and Dout are the input and output dimensions, re- E 2(δW∗ x̄ + Wδx)x̄T + 2λ1 δW∗ = 0
spectively. The quantization error induced by activation (8)
⇒ δW∗ = −WE δxx̄T (E x̄x̄T + λ1 I)−1 .
and weight quantization is denoted as δx = x̄ − x and
δW = W̄ − W, respectively. For each layer, we aim to The term λ1 I ensures the inverse of
minimize the Mean Squared Error (MSE) before and after regularization
E x̄x̄T + λ1 I always exists, which is crucial for computa-
weight-activation quantization: tional stability. In addition, it suppresses outliers, thereby
LMSE = E ∥Wx − W̄x̄∥22
mitigating overfitting and enhancing the model’s generaliz-
(4) ability. Suppressing outliers is also crucial for subsequent
= E ∥Wx − (W + δW)(x + δx)∥22 .
weight quantization since it restricts the range of weight.
Eq. 4 indicates that output error is contributed both by acti- This restriction prevents the quantization points from being
vations and weight quantization error. distributed in the uncovered region, thus enhancing the ex-
pressive ability of quantization (Li et al., 2020). In practice,
given the calibration dataset, we estimate E δxx̄T and
4. Method PN PN
E x̄x̄ using N1 n δxn x̄Tn and N1 n x̄n x̄Tn , respec-
T
s
The entangled δx and δW make it a challenge to find an tively. Here, N = B × T >> Din , where B is the size
optimal solution for Eq. 4 (Li et al., 2021). To make it of the calibration dataset, and T is the number of tokens
tractable, we relax Eq. 4 to two sequential sub-problems of one image. Note that δx and x̄ are determined given
by respectively minimizing error from quantized activation the input and the quantization parameters. After obtain-
and weight. As shown in Fig. 1, we first perform Activa- ing δW∗ , we incorporate it into the network’s weight by
tion quantization error reduction (Aqer) followed by Weight W = W+δW∗ . By doing so, the proposed Aqer explicitly
quantization error reduction (Wqer), which respectively de- mitigates the quantization error from quantized activation
tailed in the following. into the weight.
4.1. Activation Quantization Error Reduction 4.2. Weight Quantization Error Reduction
To mitigate error induced by activation quantization, we After Aqer, we perform weight quantization and propose
introduce Activation quantization error reduction (Aqer), Weight quantization error reduction (Wqer) to mitigate the
3
ERQ: Error Reduction for Post-Training Quantization of Vision Transformers
9: while 0 ≤ T--:
s
10: Calculate proxy Lold with δWi,: by Eq. 12 from x̄, where x̄s and x̄r respectively contains the rows of
Calculate gradients GδWi,: by Eq. 14 s r
11: s x̄ corresponding to Wi,: and Wi,: . The quantization error
12: Obtain S by Eq. 15 s
of the quantized Wi,: is denoted as δWi,: s
= W̄i,: s s
− Wi,: ,
′
13: Obtain adjusted δWi,: by Eq. 13 and the resulting MSE is:
′
14: Calculate proxy Lnow with δWi,: by Eq. 12 s r
LMSE ][x̄s , x̄r ]
15: if Lnow > Lold : break i = E ∥[Wi,: , Wi,:
′
s s r
s
][x̄s , x̄r ]∥22
16: δWi,: = δWi,: − [Wi,: + δWi,: , Wi,: (10)
s s
17: W̄i,: ← W̄i,: ∪ (Wi,: + δWi,: ) s s 2
= E ∥δWi,: x̄ ∥2 .
18: /* END Rounding Refinement */
s r
19: /* Ridge Regression */ Here, Wi,: = [Wi,: , Wi,: ], x̄ = [x̄s , x̄r ]. To mitigate
r∗
20: Calculate δWi,: by Eq. 17 Eq. 10, we first introduce Rounding Refinement, in which
r r∗
21: Wi,: ← Wi,: + δWi,: the rounding direction of the quantized
weight is refined, i.e.,
s s s 2
22: /* END Ridge Regression */ adjusting δWi,: , to reduce E ∥δWi,: x̄ ∥2 itself. Then,
{x̄n }N r N
n=1 ← {x̄n }n=1
s s 2
23: given E ∥δWi,: x̄ ∥2 after Rounding Refinement, we for-
24: W̄ ← W̄ ∪ W̄i,: mulate a Ridge Regression problem to further mitigate it by
25: Output: Quantized weight W̄ r
adjusting Wi,: .
4
ERQ: Error Reduction for Post-Training Quantization of Vision Transformers
Channel 10 of input for blocks.6.mlp.fc1
Full-precision Quantized (5-bit) Quantized (4-bit)
𝔼 𝔼
𝔼 𝔼
Then, the Eq. 11 becomes: To determine S, we first take the derivative of the proxy
s
(Eq. 12) w.r.t the δWi,:
s s 2 s s
E δWi,: x̄ + Var δWi,: x̄
∂ s s sT
s
= δWi,: µs µsT (δWi,:
s T
) + δWi,: Σs (δWi,:
s T (12)
) GδWi,:
s =
s δWi,: (µ µ + Σs )(δWi,:
s T
)
∂δWi,: (14)
s
= δWi,: (µs µsT + Σs )(δWi,:
s T
) . s
= 2δWi,: (µs µsT s
+ Σ ).
s s 2
Here, Eq. 12 is the obtained proxy for E ∥δWi,: x̄ ∥2 . In We only select the elements whose gradients are the same
practice, we estimate the empirical µ̂s and Σ̂s with the sign, since this is the only way to allow overturn. For ex-
s s
given calibration dataset. Note that for all output channels, ample, given δWi,j s
= δWi,j↓ , replacing it by δWi,j↑ is
µ̂s and Σ̂s are shared and require only a single computa- feasible only if GδWi,js
s
has the same sign as δWi,j . Thus,
tion.
Fig.s3 presents
the relationship between the proxy and the index set S is defined as:
E ∥δWi,: x̄s ∥22 . It can be seen that the proposed proxy is
proportional to the real value, demonstrating its fidelity. S = topk index(M),
s Din s (15)
The computational complexity of using M = |GδWi,:
s ⊙ 1(GδWs ⊙ δW
i,: )| ∈ R .
s 2
ours proxy is i,:
5
ERQ: Error Reduction for Post-Training Quantization of Vision Transformers
Table 2. Results on ImageNet dataset. The top-1 accuracy (%) is reported as the metric. “W/A” indicates that the bit-width of the weight
and activation are W and A bits, respectively. “*” indicates the results are re-produced by using the official code. More results are provided
in the appendix.
the overturn is performed with Eq. 13. The above process the following target:
s
iterates until the adjusted δWi,: incurs a larger proxy value
s s r∗ r 2 r∗ 2
or reaches maximum iterations. After obtaining δWi,: s
, the E ∥δWi,: x̄ + δWi,: x̄ ∥2 + λ2 ∥δWi,: ∥2 , (16)
s s s
quantization can be completed by W̄i,: = Wi,: + δWi,: .
s
W̄i,: is then added into the set of quantized weight. The where λ2 is a hyper-parameter to control intensity of the reg-
r∗ 2
overall process of Rounding Refinement is presented in ularization term λ2 ∥δWi,: ∥2 . The minimization of Eq. 16
Lines 7 - Lines 18 of Alg. 1. As shown in Tab. 1, Round- formulates the Ridge Regression problem and the solution
ing Refinement significantly reduces the time costs from is defined as:
10 hours to 4 minutes by 150× at the cost of affordable r∗ s
E x̄s x̄rT (E x̄r x̄rT + λ2 I)−1 . (17)
δWi,: = −δWi,:
accuracy loss.
In practice, we estimate E x̄r x̄sT and E x̄r x̄rT by us-
4.2.2. R IDGE R EGRESSION N N r rT
ing N1 n x̄rn x̄sT 1 r
P P
n and N n x̄n x̄n . Afterward, Wi,: =
r
After Rounding Refinement, we suggest adjusting Wi,: with r r∗ r
Wi,: + δWi,: to mitigate the error. Currently, Wi,: remains
r∗
s s 2
δWi,: to further counteract E ∥δWi,: x̄ ∥2 , which yields as full-precision and will be processed in the next iteration.
6
ERQ: Error Reduction for Post-Training Quantization of Vision Transformers
Such a process continuously runs until all weights are accu- 5.2. Results on ImageNet Dataset
rately quantized. The proposed Rounding Refinement and
The comparison between ERQ and other PTQ of ViTs meth-
Ridge Regression collectively form Wqer, whose overall
ods is presented in Tab. 2. It can be seen that the proposed
process is presented in Alg. 1. In practice, we perform the
ERQ showcases advantages over the compared methods in
Wqer for multiple output channels in parallel.
all bit-width settings, especially in the low-bit cases. Specif-
ically, due to the small amount of the calibration dataset,
5. Experiments many methods typically suffer from the overfitting prob-
lem and exhibit unstable performance. For instance, for
5.1. Implementation details
the W3A4 case, QDrop and PD-Quant obtain 9.77% and
Models and datasets. We conduct extensive experiments 4.56% on ViT-S, respectively. In contrast, the proposed
on image classification, object detection, and instance seg- ERQ shows stable improvements across all variants. No-
mentation. For the image classification task, we evaluate the tably, ERQ demonstrates 22.36% and 9.25% performance
ERQ on the ImageNet dataset (Russakovsky et al., 2015), gains on ViT-S and ViT-B, 1.84%, 8.68% and 8.58% gains
with different ViT variants including ViT (Dosovitskiy et al., on DeiT-T, DeiT-S, and DeiT-B, 1.61% and 1.45% gains on
2021), DeiT (Touvron et al., 2021), and Swin (Liu et al., Swin-S and Swin-B. When it comes to the W4A4 case, ERQ
2021a). As for object detection and instance segmenta- respectively obtains 1.32% and 1.51% performance gains
tion tasks, we evaluate ERQ on the COCO dataset (Lin on ViT-S and ViT-B, 1.33%, 1.71% and 2.13% performance
et al., 2014) with Mask R-CNN (He et al., 2017) and Cas- gains on DeiT-T, DeiT-S, and DeiT-B, 0.57% and 1.36%
cade Mask R-CNN (Cai & Vasconcelos, 2018), both using performance gains on Swin-S and Swin-B. For the W5A5
Swin (Liu et al., 2021a) as their backbone. case, ERQ also presents the best performance. For example,
ERQ improves the accuracy by 1.28% on Swin-B.
Implementation details. Consistent with previous study (Li
et al., 2023), we randomly select 32 images each from the
ImageNet and 1 image from the COCO dataset. The quan- 5.3. Results on COCO Dataset
tization parameters are determined by forwarding the cal- The results of object detection and instance segmentation
ibration datasets, and the reparameterization technique is are reported in Tab. 3. It can be seen that ERQ improves
used to initialize the activation quantizer as in (Li et al., performance in most cases. For instance, ERQ augments
2023). In our experiments, the k and maximum iteration of the box AP and mask AP by 0.5 and 0.3 for W4A4 Mask
Rounding Refinement are set to 1 and 100, respectively. We R-CNN with Swin-T as its backbone, respectively. Also,
use the pulp (a CPU-only LP modeler written in Python) ERQ augments the box AP and mask AP by 0.8 and 0.6 for
to solve the MIPQ. For the image classification task, we W4A4 Cascade Mask R-CNN with Swin-T as its backbone.
set λ1 = λ2 = 1e4 for ViT, λ1 = λ2 = 1e3 for DeiT-T, The above results further demonstrate the effectiveness and
λ1 = λ2 = 1e4 for DeiT-S and DeiT-B, and λ1 = λ2 = 1e4 generalization ability of the proposed ERQ.
for Swin. For detection and segmentation tasks, we set
λ1 = λ2 = 1e5 for all models. All experiments are imple- ( I I H F W R I 1 D Q G 2 R Q : $ '