0% found this document useful (0 votes)
24 views17 pages

ERQ: Error Reduction For Post-Training Quantization of Vision Transformers

Uploaded by

guomx.mail
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views17 pages

ERQ: Error Reduction For Post-Training Quantization of Vision Transformers

Uploaded by

guomx.mail
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

ERQ: Error Reduction for Post-Training Quantization of Vision Transformers

Yunshan Zhong 1 2 Jiawei Hu 2 3 You Huang 2 3 Yuxin Zhang 2 3 Rongrong Ji 1 2 3 4 *

Abstract Activation quantization error reduction (Aqer)


Post-training quantization (PTQ) for vision trans- Ridge Regression
formers (ViTs) has garnered significant attention Error
due to its efficiency in compressing models. How- Reduction
arXiv:2407.06794v1 [cs.CV] 9 Jul 2024

( 𝐖 + 𝜹𝐖 ∗ ) × ( 𝐱 + 𝜹𝐱 ) 𝐖 × 𝐱
ever, existing methods typically overlook the intri-
cate interdependence between quantized weight
and activation, leading to considerable quantiza-
tion error. In this paper, we propose ERQ, a 𝐖
𝐱
two-step PTQ approach meticulously crafted to
sequentially reduce the quantization error arising Error
from activation and weight quantization. ERQ 𝐱 𝒔 Reduction
first introduces Activation quantization error re- [ 𝐖 𝒔 + 𝜹𝐖 𝒔 , 𝐖 𝒓 + 𝛿𝐖 𝒓∗ ] × 𝐖 × 𝐱
𝐱𝒓
duction (Aqer) that strategically formulates the
minimization of activation quantization error as Ridge Regression
a Ridge Regression problem, tackling it by up- Rounding Refinement 𝜹𝐖 𝒔

dating weights with full-precision. Subsequently,


Weight quantization error reduction (Wqer)
ERQ introduces Weight quantization error reduc-
tion (Wqer) that adopts an iterative approach to
Figure 1. Illustration of the proposed ERQ.
mitigate the quantization error induced by weight
quantization. In each iteration, an empirically
derived, efficient proxy is employed to refine the
rounding directions of quantized weights, coupled among image patches, demonstrating impressive progress in
with a Ridge Regression solver to curtail weight a variety of vision tasks (Touvron et al., 2021; Carion et al.,
quantization error. Experimental results attest to 2020; Zhu et al., 2020; Arnab et al., 2021).
the effectiveness of our approach. Notably, ERQ However, great power comes with considerable complex-
surpasses the state-of-the-art GPTQ by 22.36% in ity. The inherent architectural intricacies of ViTs result in
accuracy for W3A4 ViT-S. high computational demands and substantial memory re-
quirements, posing challenges for deployments in resource-
constrained environments (Tang et al., 2022; Hou & Kung,
1. Introduction 2022; Zheng et al., 2023). To mitigate this dilemma, model
Vision Transformers (ViTs) (Dosovitskiy et al., 2021) have quantization has garnered sustained attention from both in-
significantly challenged the convolutional neural networks dustry and academia (Krishnamoorthi, 2018). Quantization
(CNNs), emerging as a new paradigm in the field of reduces model complexity by enabling low-bit representa-
computer vision. ViTs leverage multi-head self-attention tion of weight and activation, offering a promising pathway
(MHSA) mechanics to capture long-range relationships for efficient deployments. Recently, researchers have been
gravitating towards post-training quantization (PTQ) for
1
Institute of Artificial Intelligence, Xiamen University. 2 Key ViTs, which seeks to quantize models with a tiny calibration
Laboratory of Multimedia Trusted Perception and Efficient Com- dataset and minor costs (Li et al., 2022b; Lin et al., 2022;
puting, Ministry of Education of China, Xiamen University. Liu et al., 2023b; Frumkin et al., 2023).
3
Department of Artificial Intelligence, School of Informatics, Xi-
amen University. 4 Peng Cheng Laboratory.. Correspondence to: Various PTQ methods have been explored to accommodate
Rongrong Ji <[email protected]>. the ViTs’ unique structure. For example, for
√ handling long-
Proceedings of the 41 st International Conference on Machine tail post-Softmax activation, the log2/log 2 quantizer (Li
Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by et al., 2023; Lin et al., 2022) and the twin uniform quan-
the author(s). tizer (Yuan et al., 2022) are introduced. To manage high

1
ERQ: Error Reduction for Post-Training Quantization of Vision Transformers

variant activation, reparameterization technique (Li et al., video classification (Arnab et al., 2021), and medical imag-
2023) and power-of-two factor are employed. Additionally, ing (Shamshad et al., 2023), etc. However, ViTs are ac-
evolutionary search methods (Frumkin et al., 2023) are uti- companied by substantial computational overhead and in-
lized for determining unstable scale factors. Nevertheless, creased memory requirements, posing challenges for their
existing methods typically overlook the intricate interdepen- deployment in resource-constrained environments (Mehta
dence between weight and activation quantization, leading & Rastegari, 2022; Zhang et al., 2022).
to considerable quantization error when it comes to weight-
activation quantization. 2.2. Post-training Quantization for ViTs
In this paper, we introduce ERQ, a two-step post-training Model quantization reduces the numerical precision of
quantization method tailored for ViTs, that sequentially mit- weight and activation to mitigate computational and stor-
igates quantization error induced by quantized activations age costs of neural networks (Krishnamoorthi, 2018). In
and weights. As shown in Fig. 1, ERQ consists of two contrast to quantization-aware training (QAT) (Li et al.,
steps, i.e., Activation quantization error reduction (Aqer) 2022a; Li & Gu, 2023; Xijie Huang & Cheng, 2023) that
followed by Weight quantization error reduction (Wqer). involves complete training data and compute-heavy retrain-
Aqer formulates a Ridge Regression problem to mitigate ing, post-training quantization (PTQ) operates on a tiny
the quantization error induced by activation quantization, dataset with a reduced time overhead, harvesting extensive
which can be solved with a closed-form solution via weight attention (Banner et al., 2019). The unique architecture
updating. Subsequently, Wqer is introduced to mitigate the of ViTs, such as LayerNorm and attention mechanisms,
quantization error caused by weight quantization in an iter- makes distinct PTQ methods compared to those used for
ative quantization-and-correction manner. In particular, at convolutional neural networks (CNNs) (Li et al., 2021; Wei
each iteration, we quantize the first half of the full-precision et al., 2022). Liu et al. (Liu et al., 2021b) introduce the
weight and mitigate the resulting quantization error by first first PTQ method for ViTs. To maintain the order of soft-
performing Rounding Refinement and then again solving a max scores and adapt various quantization sensitivities of
Ridge Regression problem. The former derives an efficient different layers, they respectively introduce a ranking loss
proxy for output error, which is used to refine the rounding and a nuclear norm-based mixed-precision scheme. FQ-
directions of quantized weight to lower the quantization ViT (Lin et al., 2022) introduces a fully-quantized method,
error. The latter further mitigates the quantization error by which respectively designs Powers-of-Two Scale and Log-
updating the remaining full-precision weight. Such a pro- Int-Softmax for post-LayerNorm and post-Softmax activa-
cess continuously performs until all weights are accurately tion. In PTQ4ViT (Yuan et al., 2022), a twin uniform quan-
quantized. tizer is introduced to handle the long-tail post-Softmax acti-
The proposed ERQ’s effectiveness is demonstrated in exten- vation and uneven post-GELU activation. APQ-ViT (Ding
sive experiments on various ViTs variants (ViT, DeiT, and et al., 2022) establishes a block-wise error reconstruction
Swin) and tasks (image classification, object detection, and and a Matthew-effect preserving quantizer for post-Softmax
instance segmentation). Notably, on the image classification activation. In Evol-Q (Frumkin et al., 2023), an evolutionary
task, ERQ outperforms GPTQ by 22.36% for W3A4 ViT-S. search method is employed to search extremely sensitive
quantization parameters. RepQ-ViT (Li et al., 2023) intro-
duces a reparameterization technique to handle high-variant
2. Related Work post-LayerNorm activation, where the channel-wise quantiz-√
2.1. Vision Transformers (ViTs) ers are simplified to layer-wise quantizers. Also, a Log 2
quantizer is adopted to accommodate post-Softmax activa-
Inspired by the success of transformers in the natural lan- tion. GPTQ (Frantar et al., 2022) employs OBS (Frantar
guage processing field, ViTs, by treating images as patch & Alistarh, 2022) to progressively compensate for weight
tokens, have emerged as a groundbreaking development quantization error by utilizing Hessian information. Despite
in computer vision (Dosovitskiy et al., 2021). Addressing sharing a certain similarity with GPTQ, our ERQ introduces
the dependency of ViTs on large datasets, DeiT (Touvron Aqer for mitigating the error of quantized activation, while
et al., 2021) showcases an efficient teacher-student train- GPTQ does not quantize activations. Moreover, ERQ uses
ing approach. Then, Swin Transformers (Liu et al., 2021a) a derived proxy for output error to refine weight rounding,
employs a hierarchical structure that integrates a shifted which has not been proposed before.
window-based self-attention mechanism, marking further
improvements. The applications of ViTs has broadened con-
3. Preliminaries
siderably, including areas such as object detection (Carion
et al., 2020; Zhu et al., 2020), image segmentation (Zheng Quantizers. For a fair comparison, our quantization set-
et al., 2021), low-level image processing (Liang et al., 2021), tings are aligned with the earlier work (Li et al., 2023).

2
ERQ: Error Reduction for Post-Training Quantization of Vision Transformers

Specifically, we quantize the weight and activation of all which formulates the error mitigation as the Ridge Regres-
matrix multiplications in ViTs. The channel-wise quantizer sion problem. Specifically, we retain the weight as full-
and layer-wise quantizer are adopted for weight and acti- precision and solely consider the MSE caused by activation
vation, respectively. For weights and the activation except quantization error δx as:
for the post-Softmax activation, we adopt the uniform quan-
LMSE = E ∥Wx − W(x + δx)∥22 .
 
tizer. Given full-precision values x and the bit-width b, the (5)
uniform quantizer is defined as:
j x m  To minimize Eq. 5, we formulate the Ridge Regression prob-
x̄ = Qun (x, b) = s · clip + z, 0, 2b − 1 , (1) lem, where the minimization is completed by adding W
s with an adjustment δW∗ :
where ⌊·⌉ denotes the round function, clip function makes
E ∥Wx − (W + δW∗ )(x + δx)∥22 + λ1 ∥δW∗ ∥22
 
the output between 0 and 2b − 1, the scale factor s is grid-
= E ∥ − δW∗ (x + δx) − Wδx∥22 + λ1 ∥δW∗ ∥22 (6)
 
searched by minimizing the error
j before
m and after quantiza-
tion, and the zero-point z = − min(x) . For long-tail post- = E ∥δW∗ x̄ + Wδx∥22 + λ1 ∥δW∗ ∥22 .
 
s

Softmax activation, the log 2 quantizer (Li et al., 2023) is
Here, δW∗ denotes adjustment that is computed by Ridge
adopted:
Regression, x̄ = x + δx is the quantized input, λ1 ∥δW∗ ∥22
xq √
x̄ = Qlg√2 (x, b) = s · 2⌊− 2 ⌋ (1(xq )( 2 − 1) + 1), (2) acts as the regularization term, λ1 is a hyper-parameter that
j xm  control the intensity of the regularization. Eq. 6 constitutes
xq = clip −2log2 , 0, 2b − 1 , (3) the Ridge Regression problem. To minimize it, we first
s
compute its gradient w.r.t. δW∗ :
where 1(·) returns 0 for even numbers and 1 for odd num-
bers, s is grid-searched by minimizing the error before and ∂
E ∥δW∗ x̄ + Wδx∥22 + λ1 ∥δW∗ ∥22
 
after quantization. All scale factors of the above-mentioned ∂δW ∗ (7)
= E 2(δW∗ x̄ + Wδx)x̄T + 2λ1 δW∗ .
 
quantizers are determined by the calibration datasets.
Objective. Denoting the full-precision activation as x ∼ Then, we solve for δW∗ by setting Equation 7 to zero:
P(x), x ∈ RDin , and the weight W ∈ RDout ×Din . Here,
Din and Dout are the input and output dimensions, re- E 2(δW∗ x̄ + Wδx)x̄T + 2λ1 δW∗ = 0
 
spectively. The quantization error induced by activation (8)
⇒ δW∗ = −WE δxx̄T (E x̄x̄T + λ1 I)−1 .
   
and weight quantization is denoted as δx = x̄ − x and
δW = W̄ − W, respectively. For each layer, we aim to The term λ1 I ensures the inverse of
minimize the Mean Squared Error (MSE) before and after  regularization

E x̄x̄T + λ1 I always exists, which is crucial for computa-
weight-activation quantization: tional stability. In addition, it suppresses outliers, thereby
LMSE = E ∥Wx − W̄x̄∥22
  mitigating overfitting and enhancing the model’s generaliz-
(4) ability. Suppressing outliers is also crucial for subsequent
= E ∥Wx − (W + δW)(x + δx)∥22 .
 
weight quantization since it restricts the range of weight.
Eq. 4 indicates that output error is contributed both by acti- This restriction prevents the quantization points from being
vations and weight quantization error. distributed in the uncovered region, thus enhancing the ex-
pressive ability of quantization (Li et al., 2020). In practice,

given the calibration dataset, we estimate E δxx̄T and
4. Method PN PN
E x̄x̄ using N1 n δxn x̄Tn and N1 n x̄n x̄Tn , respec-
 T
s
The entangled δx and δW make it a challenge to find an tively. Here, N = B × T >> Din , where B is the size
optimal solution for Eq. 4 (Li et al., 2021). To make it of the calibration dataset, and T is the number of tokens
tractable, we relax Eq. 4 to two sequential sub-problems of one image. Note that δx and x̄ are determined given
by respectively minimizing error from quantized activation the input and the quantization parameters. After obtain-
and weight. As shown in Fig. 1, we first perform Activa- ing δW∗ , we incorporate it into the network’s weight by
tion quantization error reduction (Aqer) followed by Weight W = W+δW∗ . By doing so, the proposed Aqer explicitly
quantization error reduction (Wqer), which respectively de- mitigates the quantization error from quantized activation
tailed in the following. into the weight.

4.1. Activation Quantization Error Reduction 4.2. Weight Quantization Error Reduction
To mitigate error induced by activation quantization, we After Aqer, we perform weight quantization and propose
introduce Activation quantization error reduction (Aqer), Weight quantization error reduction (Wqer) to mitigate the

3
ERQ: Error Reduction for Post-Training Quantization of Vision Transformers

Algorithm 1 Weight Quantization Error Reduction


Table 1. Results of W4A4  DeiT-S with different methods for min-
1: Input: W, W̄ = ∅, {x̄n }N n=1 , maximum iteration T
s
x̄s ∥22 . “baseline” indicates only performing

imizing E ∥δWi,:
2: for i in range(Dout ): calibration and no error reduction is involved.
3: W̄i,: = ∅, {x̄n }N n=1 = {x̄n }n=1
N
Method Time costs Acc. (%)
4: while |Wi,: | > 0  s 
r
5: Partition Wi,: into Wi,: , Wi,: Baseline ∼ 1 minute 68.41
 s r N
6: Partition {x̄n }N
i=1 into { x̄n , x̄n }i=1 + MIPQ w/o Proxy ∼130 hours 69.67
7: /* Rounding Refinement */ + MIPQ w/ Proxy ∼10 hours 69.55
8: Obtain µˆs µ̂sT + Σ̂s from cache or calculate it + Rounding Refinement ∼4 minutes 69.24
s↓ s↑
with {x̄sn }N
n=1 , calculate δWi,: , δWi,: with Wi,:
s

9: while 0 ≤ T--:
s
10: Calculate proxy Lold with δWi,: by Eq. 12 from x̄, where x̄s and x̄r respectively contains the rows of
Calculate gradients GδWi,: by Eq. 14 s r
11: s x̄ corresponding to Wi,: and Wi,: . The quantization error
12: Obtain S by Eq. 15 s
of the quantized Wi,: is denoted as δWi,: s
= W̄i,: s s
− Wi,: ,

13: Obtain adjusted δWi,: by Eq. 13 and the resulting MSE is:

14: Calculate proxy Lnow with δWi,: by Eq. 12 s r
LMSE ][x̄s , x̄r ]

15: if Lnow > Lold : break i = E ∥[Wi,: , Wi,:

s s r
s
][x̄s , x̄r ]∥22

16: δWi,: = δWi,: − [Wi,: + δWi,: , Wi,: (10)
s s
17: W̄i,: ← W̄i,: ∪ (Wi,: + δWi,: )  s s 2
= E ∥δWi,: x̄ ∥2 .

18: /* END Rounding Refinement */
s r
19: /* Ridge Regression */ Here, Wi,: = [Wi,: , Wi,: ], x̄ = [x̄s , x̄r ]. To mitigate
r∗
20: Calculate δWi,: by Eq. 17 Eq. 10, we first introduce Rounding Refinement, in which
r r∗
21: Wi,: ← Wi,: + δWi,: the rounding direction of the quantized
 weight is refined, i.e.,
s s s 2
22: /* END Ridge Regression */ adjusting δWi,: , to reduce E ∥δWi,: x̄ ∥2 itself. Then,
{x̄n }N r N
n=1 ← {x̄n }n=1
 s s 2

23: given E ∥δWi,: x̄ ∥2 after Rounding Refinement, we for-
24: W̄ ← W̄ ∪ W̄i,: mulate a Ridge Regression problem to further mitigate it by
25: Output: Quantized weight W̄ r
adjusting Wi,: .

4.2.1. ROUNDING R EFINEMENT


resulting quantization error. Here, the target is defined as:
At first, we aim to adjust the rounding direction
 of quan-
s s 2
tized weights to minimize E ∥δWi,: x̄ ∥2 . Specifically,
 DX
out
s s
LMSE = E ∥Wx̄ − (W + δW)x̄∥22 = LMSE for the j-th value in Wi,: , denoted as Wi,j , the quantiza-

i
i tion involves rounding it either to the floor or ceil (Nagel
(9) et al., 2020a). Thereby the quantization error for Wi,: s
, de-
D
Xout

E ∥Wi,: x̄ − (Wi,: + δWi,: )x̄∥22 .


  s s↓
= noted as δWi,j , can be represented as either δW i, j or
s↓
i δWs↑ i, j. Here, δWi,j = Wi,js
− Qun↓ (Wi,j s
, b) > 0
s↑
Note that after Aqer, the activation is quantized. Eq. 9 indi- denotes error from rounding-to-floor strategy, δWi,j =
s s
cates that the minimization across output channels operates Wi,j − Qun↑ (Wi,j , b) < 0 denotes error from rounding-
independently. Consequently, we analyze the minimization to-ceil strategy, where ↓ / ↑ denote replacing ⌊·⌉ in Eq. 1
s
of each LMSE
i individually. Simultaneously quantizing the with ⌊·⌋/⌈·⌉. The selection of δWi,: is an NP-hard prob-
entire full-precision weight yields unrecoverable quantiza- lem, whose solution can be searched by the mixed-integer
tion error (Frantar et al., 2022). Thus, we adopt an iterative quadratic program (MIPQ) (Pia et al., 2017; Kuzmin et al.,
quantization-and-correction manner to gradually minimize 2023).
 However, the high computational complexity of
s s 2
quantization error caused by weight quantization (Zhou E ∥δWi,: x̄ ∥2 makes it a challenge to find the solu-
et al., 2017). In each iteration, the first half of unquantized tion with
 reasonable  time costs. As shown in Tab. 1, us-
s s 2
weights is quantized, followed by a mitigation of the re- ing E ∥δWi,: x̄ ∥2 as the target of MIPQ consumes pro-
sulting quantization error. Specifically, we begin with the hibitive ∼130 hours.
current full-precision weight Wi,: and the corresponding Efficient
x̄. We then partition W into two segments: the first half,  Proxy.
s s 2
Therefore,
 we aim to findan efficient proxy
s s 2

s s for E ∥δWi,: x̄ ∥2 . First, we re-write E ∥δWi,: x̄ ∥2 as:
Wi,: ∈ R1×Din , is designated for quantization, while the re-
r
r
maining part, Wi,: ∈ R1×Din , is retained at full-precision. s s 2 ∆ s s 2 s s
     
s r
E ∥δWi,: x̄ ∥2 = (E δWi,: x̄ ) + Var δWi,: x̄ .
Correspondingly, we derive x̄s ∈ RDin and x̄r ∈ RDin (11)

4
ERQ: Error Reduction for Post-Training Quantization of Vision Transformers
Channel 10 of input for blocks.6.mlp.fc1
Full-precision Quantized (5-bit) Quantized (4-bit)

𝔼 𝔼

7.5 5.0 2.5 0.0 2.5 6 4 2 0 2 6 4 2 0


Channel 978 of input for blocks.6.mlp.fc1
Full-precision Quantized (5-bit) Quantized (4-bit)

𝔼 𝔼

7.5 5.0 2.5 0.0 2.5 6 4 2 0 2 6 4 2 0


s
x̄s ∥22 . The proxy values are posi-
 
Figure 3. E denotes E ∥δWi,:
Figure 2. Distribution of activation on different channels. Results tively correlated with the real values.
are extracted from DeiT-S with 32 images.

  from ∼130 hours to ∼10 hours. However, this still incurs


Here, ∆ indicates utilizing E Z 2 = (E [Z])2 + Var [Z]. moderate costs since current open-source implementations
As proved by (Klambauer et al., 2017), according to the of MIPQ only support CPU and cannot fully exploit the
central limit theorem, the numerous multiplication and addi- capacity of GPU. In the next, we introduce Rounding Re-
tion operations within neural networks make the activation finement, a GPU-support method that uses the gradient of
generally follow a Gaussian distribution, which is also a ba- s
the proxy to adjust δWi,: faster.
sic assumption in many previous works in the quantization s
field (Ding et al., 2019; Sun et al., 2022; Lin et al., 2022; Rounding Refinement. At first, we initialize δWi,j with
s
Chmiel et al., 2020). Meanwhile, Fig. 2 illustrates the chan- the rounding-to-nearest strategy. Now, δWi,j is either equal
s↓ s↑
nel distribution of the full-precision and quantized activation. to δWi,j or δWi,j . Then, we aim to determine an index set
It can be seen that quantized activation continues to exhibit S that contains the index set of the elements necessitating
an approximated Gaussian distribution (Krishnamoorthi, modifications, whose rounding direction is overturned:
2018). Thus, we consider channel distribution of x̄s still can (
s↓ s s↑
be captured by the Gaussian distribution, and model x̄s with s δWi,j if δWi,j = δWi,j
s δWi,j = s↑ , j ∈ S. (13)
a Din -dimensional Gaussian distribution N (µs , Σs ), where δWi,j otherwise.
s s s
Din is the dimension of x̄s , µs ∈ RDin , Σs ∈ RDin ×Din .
s

Then, the Eq. 11 becomes: To determine S, we first take the derivative of the proxy
s
(Eq. 12) w.r.t the δWi,:
s s 2 s s
   
E δWi,: x̄ + Var δWi,: x̄
∂ s s sT
s
= δWi,: µs µsT (δWi,:
s T
) + δWi,: Σs (δWi,:
s T (12)
) GδWi,:
s =
s δWi,: (µ µ + Σs )(δWi,:
s T
)
∂δWi,: (14)
s
= δWi,: (µs µsT + Σs )(δWi,:
s T
) . s
= 2δWi,: (µs µsT s
+ Σ ).
 s s 2

Here, Eq. 12 is the obtained proxy for E ∥δWi,: x̄ ∥2 . In We only select the elements whose gradients are the same
practice, we estimate the empirical µ̂s and Σ̂s with the sign, since this is the only way to allow overturn. For ex-
s s
given calibration dataset. Note that for all output channels, ample, given δWi,j s
= δWi,j↓ , replacing it by δWi,j↑ is
µ̂s and Σ̂s are shared and require only a single computa- feasible only if GδWi,js
s
has the same sign as δWi,j . Thus,
tion.
 Fig.s3 presents
 the relationship between the proxy and the index set S is defined as:
E ∥δWi,: x̄s ∥22 . It can be seen that the proposed proxy is
proportional to the real value, demonstrating its fidelity. S = topk index(M),
s Din s (15)
The computational complexity of using M = |GδWi,:
s ⊙ 1(GδWs ⊙ δW
i,: )| ∈ R .
s 2
 ours proxy  is i,:

O((Din ) ), while the complexity of E ∥δWi,: x̄s ∥22 is


s s
O(N Din ), where N >> Din . Thus, the proxy can serve as Here, topk index return the index of the top k elements,
s
a low-cost objective for solving δWi,: . As shown in Tab. 1, 1(·) returns 1 for non-negative input and 0 for negative input,
using Eq. 12 as the target of MIPQ reduces the time costs |·| returns the absolute value of the input. After obtaining S,

5
ERQ: Error Reduction for Post-Training Quantization of Vision Transformers

Table 2. Results on ImageNet dataset. The top-1 accuracy (%) is reported as the metric. “W/A” indicates that the bit-width of the weight
and activation are W and A bits, respectively. “*” indicates the results are re-produced by using the official code. More results are provided
in the appendix.

Method W/A ViT-S ViT-B DeiT-T DeiT-S DeiT-B Swin-S Swin-B

Full-Precision 32/32 81.39 84.54 72.21 79.85 81.80 83.23 85.27


FQ-ViT* (Lin et al., 2022) 3/4 0.10 0.10 0.10 0.10 0.10 0.10 0.10
PTQ4ViT* (Yuan et al., 2022) 3/4 0.10 0.10 0.20 0.15 0.59 0.64 0.53
GPTQ* (Frantar et al., 2022) 3/4 23.32 44.63 42.25 48.95 61.75 66.71 71.43
RepQ-ViT* (Li et al., 2023) 3/4 15.65 26.98 29.34 45.82 58.92 59.83 44.17
AdaRound* (Nagel et al., 2020b) 3/4 11.04 4.72 36.05 33.56 62.50 68.12 53.92
BRECQ* (Li et al., 2021) 3/4 4.97 1.25 29.23 18.58 40.49 66.93 53.38
QDrop* (Wei et al., 2022) 3/4 9.77 11.87 17.85 30.27 61.12 73.47 74.33
PD-Quant* (Liu et al., 2023a) 3/4 4.56 21.81 41.87 41.65 53.63 70.07 56.48
ERQ (Ours) 3/4 45.68 53.88 44.09 57.63 70.33 75.08 75.78
FQ-ViT (Lin et al., 2022) 4/4 0.10 0.10 0.10 0.10 0.10 0.10 0.10
PTQ4ViT (Yuan et al., 2022) 4/4 42.57 30.69 36.96 34.08 64.39 76.09 74.02
APQ-ViT (Ding et al., 2022) 4/4 47.95 41.41 47.94 43.55 67.48 77.15 76.48
GPTQ* (Frantar et al., 2022) 4/4 67.59 75.12 58.96 70.85 76.10 80.17 81.08
RepQ-ViT (Li et al., 2023) 4/4 65.05 68.48 57.43 69.03 75.61 79.45 78.32
AdaRound* (Nagel et al., 2020b) 4/4 63.09 70.51 55.65 69.24 75.20 76.05 78.12
BRECQ* (Li et al., 2021) 4/4 11.31 3.03 38.41 32.89 59.10 68.40 56.51
QDrop* (Wei et al., 2022) 4/4 17.77 21.72 31.65 35.79 65.47 78.92 80.49
PD-Quant* (Liu et al., 2023a) 4/4 32.64 34.86 58.50 64.85 60.06 77.04 75.84
ERQ (Ours) 4/4 68.91 76.63 60.29 72.56 78.23 80.74 82.44
FQ-ViT* (Lin et al., 2022) 5/5 0.10 0.10 0.10 0.10 0.10 0.10 0.10
PTQ4ViT* (Yuan et al., 2022) 5/5 72.74 72.32 65.00 70.26 72.65 80.90 81.87
GPTQ* (Frantar et al., 2022) 5/5 78.63 82.06 69.05 77.12 80.17 82.19 83.00
RepQ-ViT* (Li et al., 2023) 5/5 78.43 82.03 69.00 77.04 80.08 82.08 83.22
AdaRound* (Nagel et al., 2020b) 5/5 77.53 82.00 68.87 76.22 80.18 82.12 84.09
BRECQ* (Li et al., 2021) 5/5 47.35 43.51 62.12 63.15 75.61 80.66 82.31
QDrop* (Wei et al., 2022) 5/5 56.32 57.92 62.36 70.07 78.41 81.73 83.61
PD-Quant* (Liu et al., 2023a) 5/5 65.06 58.40 68.02 74.94 74.61 81.27 82.12
ERQ (Ours) 5/5 78.83 82.81 69.42 77.58 80.65 82.44 84.50

the overturn is performed with Eq. 13. The above process the following target:
s
iterates until the adjusted δWi,: incurs a larger proxy value
s s r∗ r 2 r∗ 2
 
or reaches maximum iterations. After obtaining δWi,: s
, the E ∥δWi,: x̄ + δWi,: x̄ ∥2 + λ2 ∥δWi,: ∥2 , (16)
s s s
quantization can be completed by W̄i,: = Wi,: + δWi,: .
s
W̄i,: is then added into the set of quantized weight. The where λ2 is a hyper-parameter to control intensity of the reg-
r∗ 2
overall process of Rounding Refinement is presented in ularization term λ2 ∥δWi,: ∥2 . The minimization of Eq. 16
Lines 7 - Lines 18 of Alg. 1. As shown in Tab. 1, Round- formulates the Ridge Regression problem and the solution
ing Refinement significantly reduces the time costs from is defined as:
10 hours to 4 minutes by 150× at the cost of affordable r∗ s
E x̄s x̄rT (E x̄r x̄rT + λ2 I)−1 . (17)
   
δWi,: = −δWi,:
accuracy loss.
   
In practice, we estimate E x̄r x̄sT and E x̄r x̄rT by us-
4.2.2. R IDGE R EGRESSION N N r rT
ing N1 n x̄rn x̄sT 1 r
P P
n and N n x̄n x̄n . Afterward, Wi,: =
r
After Rounding Refinement, we suggest adjusting Wi,: with r r∗ r
Wi,: + δWi,: to mitigate the error. Currently, Wi,: remains
r∗
 s s 2

δWi,: to further counteract E ∥δWi,: x̄ ∥2 , which yields as full-precision and will be processed in the next iteration.

6
ERQ: Error Reduction for Post-Training Quantization of Vision Transformers

Such a process continuously runs until all weights are accu- 5.2. Results on ImageNet Dataset
rately quantized. The proposed Rounding Refinement and
The comparison between ERQ and other PTQ of ViTs meth-
Ridge Regression collectively form Wqer, whose overall
ods is presented in Tab. 2. It can be seen that the proposed
process is presented in Alg. 1. In practice, we perform the
ERQ showcases advantages over the compared methods in
Wqer for multiple output channels in parallel.
all bit-width settings, especially in the low-bit cases. Specif-
ically, due to the small amount of the calibration dataset,
5. Experiments many methods typically suffer from the overfitting prob-
lem and exhibit unstable performance. For instance, for
5.1. Implementation details
the W3A4 case, QDrop and PD-Quant obtain 9.77% and
Models and datasets. We conduct extensive experiments 4.56% on ViT-S, respectively. In contrast, the proposed
on image classification, object detection, and instance seg- ERQ shows stable improvements across all variants. No-
mentation. For the image classification task, we evaluate the tably, ERQ demonstrates 22.36% and 9.25% performance
ERQ on the ImageNet dataset (Russakovsky et al., 2015), gains on ViT-S and ViT-B, 1.84%, 8.68% and 8.58% gains
with different ViT variants including ViT (Dosovitskiy et al., on DeiT-T, DeiT-S, and DeiT-B, 1.61% and 1.45% gains on
2021), DeiT (Touvron et al., 2021), and Swin (Liu et al., Swin-S and Swin-B. When it comes to the W4A4 case, ERQ
2021a). As for object detection and instance segmenta- respectively obtains 1.32% and 1.51% performance gains
tion tasks, we evaluate ERQ on the COCO dataset (Lin on ViT-S and ViT-B, 1.33%, 1.71% and 2.13% performance
et al., 2014) with Mask R-CNN (He et al., 2017) and Cas- gains on DeiT-T, DeiT-S, and DeiT-B, 0.57% and 1.36%
cade Mask R-CNN (Cai & Vasconcelos, 2018), both using performance gains on Swin-S and Swin-B. For the W5A5
Swin (Liu et al., 2021a) as their backbone. case, ERQ also presents the best performance. For example,
ERQ improves the accuracy by 1.28% on Swin-B.
Implementation details. Consistent with previous study (Li
et al., 2023), we randomly select 32 images each from the
ImageNet and 1 image from the COCO dataset. The quan- 5.3. Results on COCO Dataset
tization parameters are determined by forwarding the cal- The results of object detection and instance segmentation
ibration datasets, and the reparameterization technique is are reported in Tab. 3. It can be seen that ERQ improves
used to initialize the activation quantizer as in (Li et al., performance in most cases. For instance, ERQ augments
2023). In our experiments, the k and maximum iteration of the box AP and mask AP by 0.5 and 0.3 for W4A4 Mask
Rounding Refinement are set to 1 and 100, respectively. We R-CNN with Swin-T as its backbone, respectively. Also,
use the pulp (a CPU-only LP modeler written in Python) ERQ augments the box AP and mask AP by 0.8 and 0.6 for
to solve the MIPQ. For the image classification task, we W4A4 Cascade Mask R-CNN with Swin-T as its backbone.
set λ1 = λ2 = 1e4 for ViT, λ1 = λ2 = 1e3 for DeiT-T, The above results further demonstrate the effectiveness and
λ1 = λ2 = 1e4 for DeiT-S and DeiT-B, and λ1 = λ2 = 1e4 generalization ability of the proposed ERQ.
for Swin. For detection and segmentation tasks, we set
λ1 = λ2 = 1e5 for all models. All experiments are imple- (IIHFWRI 1DQG 2RQ:$'HL76

mented using PyTorch framework (Paszke et al., 2019) with 
  
a single NVIDIA 3090 GPU and an Intel Xeon 4214R CPU.

7RS$FFXUDF\ 

Compared methods. We re-implement BRECQ (Li et al.,  


2021), QDrop (Wei et al., 2022), PD-Quant (Liu et al.,  
2023a), GPTQ (Frantar et al., 2022) with their official code 
with 32 images as the calibration dataset as the same ours. 
The initial implementation of GPTQ did not involve activa- 

tion quantization. Thus, we employed the same quantizer as 

our own to quantize the activation for them,  H H H H H H
√ including the 1 2
reparameterization technique and the log 2 quantizer. For
other PTQ of ViT methods including PSAQ-ViT (Li et al., Figure 4. Ablation studies of λ1 and λ2 .
2022b), Ranking (Liu et al., 2021b), EasyQuant (Wu et al.,
2020), PTQ4ViT (Yuan et al., 2022), APQ-ViT (Ding et al.,
2022), NoisyQuant (Liu et al., 2023b), RepQ-ViT (Li et al., 5.4. Ablation Study
2023), we use the result reported in their paper if it exists,
otherwise, we re-implement based on their official code. All ablation studies are conducted on the W4A4 DeiT-S.
The ablation study of image numbers and comparisons of Tab. 4 reports the results of various components within the
time costs are presented in the appendix. ERQ. Note that the Wqer consists of Rounding Refinement
and Ridge Regression. It can be observed that, compared

7
ERQ: Error Reduction for Post-Training Quantization of Vision Transformers

Table 3. Results on COCO dataset. “APbox ” denotes the box average precision for object detection, and “APmask ” denotes the mask average
precision for instance segmentation. “*” and “†” indicate the results are re-produced by using the official code.

Mask R-CNN Cascade Mask R-CNN


Method W/A w. Swin-T w. Swin-S w. Swin-T w. Swin-S
APbox APmask APbox APmask APbox APmask APbox APmask

Full-Precision 32/32 46.0 41.6 48.5 43.3 50.4 43.7 51.9 45.0
PTQ4ViT (Yuan et al., 2022) 4/4 6.9 7.0 26.7 26.6 14.7 13.5 0.5 0.5
APQ-ViT (Ding et al., 2022) 4/4 23.7 22.6 44.7 40.1 27.2 24.4 47.7 41.1
GPTQ* (Frantar et al., 2022) 4/4 36.3 36.3 42.9 40.2 47.1 41.5 49.2 43.2
RepQ-ViT (Li et al., 2023) 4/4 36.1 36.0 44.242.7 † 40.240.1 † 47.0 41.4 49.3 43.1
AdaRound* (Nagel et al., 2020a) 4/4 16.3 19.8 22.3 22.5 34.6 33.4 35.8 34.5
BRECQ* (Li et al., 2021) 4/4 25.2 27.3 32.4 32.9 40.4 35.9 41.5 37.2
QDrop* (Wei et al., 2022) 4/4 10.4 11.3 39.7 37.8 17.9 16.2 20.1 17.4
PD-Quant* (Liu et al., 2023a) 4/4 15.7 16.1 30.2 28.4 34.5 30.1 38.6 34.1
ERQ (Ours) 4/4 36.8 36.6 43.4 40.7 47.9 42.1 50.0 43.6

Table 4. Ablations on different components of ERQ. “baseline” Table 5. Ablation studies of different k in Rounding Refinement.
indicates only performing calibration and no error reduction is
involved. “Aqer” and “Wqer” represent Activation quantization Model k Top-1 Acc. (%)
error reduction and Weight quantization error reduction, respec-
tively. “Rounding” and “Ridge” represent Rounding Refinement
0 72.01
and Ridge Regression used in Wqer, respectively. Results are re- 1 72.56
DeiT-S (W4/A4)
ported with W4A4 DeiT-S on ImageNet. 2 72.38
3 71.79
Wqer
Aqer Top-1 Acc. (%)
Rounding Ridge

Baseline 68.41
✓ 71.45 (+3.04)
when λ1 = λ2 = 1e4 for W4A4 Deit-S. Tab. 5 presents
✓ 69.24 (+0.83)
the ablation study of different k in Eq. 15 of Rounding Re-
✓ 70.06 (+1.65)
finement. It can be observed that when k = 1, the best
✓ ✓ 70.49 (+2.08)
accuracy is achieved. Note that when k = 0, the Rounding
✓ ✓ 71.83 (+3.42)
Refinement is invalid. Finally, in Sec. D of the appendix, we
✓ ✓ 72.01 (+3.60)
demonstrate that each component of ERQ including Aqer,
✓ ✓ ✓ 72.56 (+4.15)
Rounding Refinement of Wqer, and Ridge Regression of
Wqer effectively reduces the quantization error.

to the baseline, Aqer enhances accuracy by 3.04%. Further- 6. Discussion


more, Rounding Refinement and Ridge Regression results
in accuracy improvements of 0.83% and 1.65%, respectively. We further discuss several unexplored limitations of the
When these two approaches are both employed, i.e., using proposed ERQ, which will be our future focus. First, de-
Wqer, the performance is improved by 2.08%. Ultimately, spite achieving considerable performance gains, ERQ cur-
the combination of Aqer with Wqer showcases the optimal rently focuses on layers with weight. A further improvement
performance, with an accuracy increment of 4.15% points would be feasible if the error of quantized self-attention can
to 72.56% over the baseline. These results confirm the ef- be taken into consideration. Meanwhile, the Rounding Re-
fectiveness of the components in ERQ. Then, we provide finement remains further exploration. There is substantial
the ablation study of λ1 of Eq. 6 and λ2 of Eq. 16. For potential for exploring other refinement techniques such as
simplicity, we set λ1 = λ2 and search for the best value. offering more flexibility in rounding targets. Finally, limited
Despite this may not be the best choice, it yields desirable by the computational resources, we are currently unable to
performance. Fig. 4 presents the results of different val- apply ERQ to Large Language Models (LLMs). Extending
ues. It can be seen that the best performance is exhibited ERQ to LLMs is an imperative task for our future studies.

8
ERQ: Error Reduction for Post-Training Quantization of Vision Transformers

7. Conclusion Cai, Z. and Vasconcelos, N. Cascade r-cnn: Delving into


high quality object detection. In Proceedings of the
In this paper, we present ERQ, consisting of Activation IEEE/CVF Conference on Computer Vision and Pattern
quantization error reduction (Aqer) and Weight quantization Recognition (CVPR), pp. 6154–6162, 2018.
error reduction (Wqer) to respectively mitigate the quanti-
zation error induced by activation and weight quantization. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov,
In Aqer, the mitigation of activation quantization error is A., and Zagoruyko, S. End-to-end object detection with
formulated as a Ridge Regression problem, presenting a transformers. In Proceedings of the European Confer-
closed-form solution to tackle the error by updating weights ence on Computer Vision (ECCV), pp. 213–229. Springer,
with full-precision. In Wqer, the weight quantization error 2020.
is progressively mitigated in a quantization-and-correction
manner. At each iteration, the first half of weights are quan- Chmiel, B., Banner, R., Shomron, G., Nahshan, Y., Bron-
tized and the resulting error is first mitigated by Rounding stein, A., Weiser, U., et al. Robust quantization: One
Refinement and then again by solving a Ridge Regression model to rule them all. In Proceedings of the Advances in
problem. The former first mitigates the quantization error by Neural Information Processing Systems (NeurIPS), vol-
leveraging an empirically derived efficient proxy of output ume 33, pp. 5308–5317, 2020.
error to refine the rounding directions of quantized weights.
Ding, R., Chin, T.-W., Liu, Z., and Marculescu, D. Regu-
The latter further mitigates the quantization error into the
larizing activation distribution for training binarized deep
remaining full-precision weight with a closed-form solution.
networks. In Proceedings of the IEEE/CVF Conference
The effectiveness of ERQ is demonstrated by extensive ex-
on Computer Vision and Pattern Recognition (CVPR), pp.
periments on a variety of ViTs across diverse tasks.
11408–11417, 2019.

Acknowledgements Ding, Y., Qin, H., Yan, Q., Chai, Z., Liu, J., Wei, X., and
Liu, X. Towards accurate post-training quantization for
This work was supported by National Key R&D Program vision transformer. In Proceedings of the 30th ACM
of China (No.2022ZD0118202), the National Science Fund International Conference on Multimedia (ACMMM), pp.
for Distinguished Young Scholars (No.62025603), the Na- 5380–5388, 2022.
tional Natural Science Foundation of China (No. U21B2037,
No. U22B2051, No. 62176222, No. 62176223, No. Dong, P., Lu, L., Wu, C., Lyu, C., Yuan, G., Tang, H.,
62176226, No. 62072386, No. 62072387, No. 62072389, and Wang, Y. Packqvit: Faster sub-8-bit vision trans-
No. 62002305 and No. 62272401), and the Natural Science formers via full and packed quantization on the mobile.
Foundation of Fujian Province of China (No.2021J01002, In Proceedings of the Advances in Neural Information
No.2022J06001). Processing Systems (NeurIPS), volume 36, 2024.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,


Impact Statement D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,
This paper presents a PTQ method that aims to boost the M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby,
performance of quantized ViTs. The model compression N. An image is worth 16x16 words: Transformers for
community may benefit from this research. There are many image recognition at scale. In Proceedings of the Interna-
potential societal consequences of our work, none of which tional Conference on Learning Representations (ICLR).
we feel must be specifically highlighted here. OpenReview.net, 2021.

Frantar, E. and Alistarh, D. Optimal brain compression:


References A framework for accurate post-training quantization and
pruning. In Proceedings of the Advances in Neural In-
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M.,
formation Processing Systems (NeurIPS), volume 35, pp.
and Schmid, C. Vivit: A video vision transformer. In
4475–4488, 2022.
Proceedings of the IEEE/CVF international conference
on computer vision (ICCV), pp. 6836–6846, 2021. Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh,
D. GPTQ: Accurate post-training compression for
generative pretrained transformers. arXiv preprint
Banner, R., Nahshan, Y., Soudry, D., et al. Post training arXiv:2210.17323, 2022.
4-bit quantization of convolutional networks for rapid-
deployment. In Proceedings of the Advances in Neural Frumkin, N., Gope, D., and Marculescu, D. Jumping
Information Processing Systems (NeurIPS), pp. 7950– through local minima: Quantization in the loss landscape
7958, 2019. of vision transformers. In Proceedings of the IEEE/CVF

9
ERQ: Error Reduction for Post-Training Quantization of Vision Transformers

International Conference on Computer Vision (ICCV), pp. Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., and
16978–16988, 2023. Timofte, R. Swinir: Image restoration using swin trans-
former. In Proceedings of the IEEE/CVF international
He, K., Gkioxari, G., Dollár, P., and Girshick, R. Mask
conference on computer vision (ICCV), pp. 1833–1844,
r-cnn. In Proceedings of the IEEE/CVF International
2021.
Conference on Computer Vision (ICCV), pp. 2961–2969,
2017. Lin, C., Peng, B., Li, Z., Tan, W., Ren, Y., Xiao, J., and
Pu, S. Bit-shrinking: Limiting instantaneous sharpness
Hou, Z. and Kung, S.-Y. Multi-dimensional vision trans-
for improving post-training quantization. In Proceedings
former compression via dependency guided gaussian pro-
of the IEEE/CVF Conference on Computer Vision and
cess search. In Proceedings of the IEEE/CVF Conference
Pattern Recognition (CVPR), pp. 16196–16205, 2023.
on Computer Vision and Pattern Recognition (CVPR), pp.
3669–3678, 2022. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P.,
Ramanan, D., Dollár, P., and Zitnick, C. L. Microsoft
Klambauer, G., Unterthiner, T., Mayr, A., and Hochreiter,
coco: Common objects in context. In Proceedings of the
S. Self-normalizing neural networks. In Proceedings of
European Conference on Computer Vision (ECCV), pp.
the Advances in Neural Information Processing Systems
740–755. Springer, 2014.
(NeurIPS), volume 30, 2017.
Lin, Y., Zhang, T., Sun, P., Li, Z., and Zhou, S. Fq-vit:
Krishnamoorthi, R. Quantizing deep convolutional networks
Post-training quantization for fully quantized vision trans-
for efficient inference: A whitepaper. arXiv preprint
former. In Raedt, L. D. (ed.), Proceedings of the Thirty-
arXiv:1806.08342, 2018.
First International Joint Conference on Artificial Intelli-
Kuzmin, A., Nagel, M., van Baalen, M., Behboodi, A., and gence, (IJCAI), pp. 1173–1179, 2022.
Blankevoort, T. Pruning vs quantization: Which is better?,
Liu, J., Niu, L., Yuan, Z., Yang, D., Wang, X., and Liu,
2023.
W. Pd-quant: Post-training quantization based on predic-
Li, Y., Dong, X., and Wang, W. Additive powers-of-two tion difference metric. In Proceedings of the IEEE/CVF
quantization: An efficient non-uniform discretization for Conference on Computer Vision and Pattern Recognition
neural networks. In Proceedings of the International (CVPR), pp. 24427–24437, 2023a.
Conference on Learning Representations (ICLR), 2020.
Liu, Y., Yang, H., Dong, Z., Keutzer, K., Du, L., and Zhang,
Li, Y., Gong, R., Tan, X., Yang, Y., Hu, P., Zhang, Q., Yu, S. Noisyquant: Noisy bias-enhanced post-training activa-
F., Wang, W., and Gu, S. Brecq: Pushing the limit of tion quantization for vision transformers. In Proceedings
post-training quantization by block reconstruction. In of the IEEE/CVF Conference on Computer Vision and
Proceedings of the International Conference on Learning Pattern Recognition (CVPR), pp. 20321–20330, 2023b.
Representations (ICLR), 2021.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin,
Li, Y., Xu, S., Zhang, B., Cao, X., Gao, P., and Guo, G. S., and Guo, B. Swin transformer: Hierarchical vision
Q-vit: Accurate and fully quantized low-bit vision trans- transformer using shifted windows. In Proceedings of the
former. In Proceedings of the Advances in Neural In- IEEE/CVF international conference on computer vision
formation Processing Systems (NeurIPS), volume 35, pp. (ICCV), pp. 10012–10022, 2021a.
34451–34463, 2022a.
Liu, Z., Wang, Y., Han, K., Zhang, W., Ma, S., and Gao,
Li, Z. and Gu, Q. I-vit: Integer-only quantization for effi- W. Post-training quantization for vision transformer. In
cient vision transformer inference. In Proceedings of the Proceedings of the Advances in Neural Information Pro-
IEEE/CVF International Conference on Computer Vision cessing Systems (NeurIPS), volume 34, pp. 28092–28103,
(ICCV), pp. 17065–17075, 2023. 2021b.
Li, Z., Ma, L., Chen, M., Xiao, J., and Gu, Q. Patch similar- Mehta, S. and Rastegari, M. Mobilevit: Light-weight,
ity aware data-free quantization for vision transformers. general-purpose, and mobile-friendly vision transformer.
In Proceedings of the European Conference on Computer In Proceedings of the International Conference on Learn-
Vision (ECCV), pp. 154–170. Springer, 2022b. ing Representations (ICLR), 2022.
Li, Z., Xiao, J., Yang, L., and Gu, Q. Repq-vit: Scale Nagel, M., Amjad, R. A., Van Baalen, M., Louizos, C.,
reparameterization for post-training quantization of vi- and Blankevoort, T. Up or down? adaptive rounding for
sion transformers. In Proceedings of the IEEE/CVF In- post-training quantization. In Proceedings of the Inter-
ternational Conference on Computer Vision (ICCV), pp. national Conference on Machine Learning (ICML), pp.
17227–17236, 2023. 7197–7206, 2020a.

10
ERQ: Error Reduction for Post-Training Quantization of Vision Transformers

Nagel, M., Amjad, R. A., Van Baalen, M., Louizos, C., Wu, D., Tang, Q., Zhao, Y., Zhang, M., Fu, Y., and Zhang,
and Blankevoort, T. Up or down? adaptive rounding for D. Easyquant: Post-training quantization via scale opti-
post-training quantization. In Proceedings of the Inter- mization. CoRR, abs/2006.16669, 2020.
national Conference on Machine Learning (ICML), pp.
7197–7206. PMLR, 2020b. Xijie Huang, Z. S. and Cheng, K.-T. Variation-
aware vision transformer quantization. arXiv preprint
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., arXiv:2307.00331, 2023.
Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga,
L., et al. Pytorch: An imperative style, high-performance Yuan, Z., Xue, C., Chen, Y., Wu, Q., and Sun, G. Ptq4vit:
deep learning library. In Proceedings of the Advances in Post-training quantization for vision transformers with
Neural Information Processing Systems (NeurIPS), pp. twin uniform quantization. In Proceedings of the Euro-
8026–8037, 2019. pean Conference on Computer Vision (ECCV), pp. 191–
207. Springer, 2022.
Pia, A. D., Dey, S. S., and Molinaro, M. Mixed-integer
quadratic programming is in np. Mathematical Program- Zhang, J., Peng, H., Wu, K., Liu, M., Xiao, B., Fu, J., and
ming, 162:225–240, 2017. Yuan, L. Minivit: Compressing vision transformers with
weight multiplexing. In Proceedings of the IEEE/CVF
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Conference on Computer Vision and Pattern Recognition
Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, (CVPR), pp. 12145–12154, 2022.
M., et al. Imagenet large scale visual recognition chal-
Zheng, D., Dong, W., Hu, H., Chen, X., and Wang, Y. Less
lenge. International Journal of Computer Vision (IJCV),
is more: Focus attention for efficient detr. In Proceedings
115:211–252, 2015.
of the IEEE/CVF International Conference on Computer
Shamshad, F., Khan, S., Zamir, S. W., Khan, M. H., Hayat, Vision (ICCV), pp. 6674–6683, 2023.
M., Khan, F. S., and Fu, H. Transformers in medical
Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y.,
imaging: A survey. Medical Image Analysis, pp. 102802,
Fu, Y., Feng, J., Xiang, T., Torr, P. H., et al. Rethink-
2023.
ing semantic segmentation from a sequence-to-sequence
Sun, Z., Ge, C., Wang, J., Lin, M., Chen, H., Li, H., and Sun, perspective with transformers. In Proceedings of the
X. Entropy-driven mixed-precision quantization for deep IEEE/CVF conference on computer vision and pattern
network design. In Proceedings of the Advances in Neural recognition (CVPR), pp. 6881–6890, 2021.
Information Processing Systems (NeurIPS), volume 35,
Zhou, A., Yao, A., Guo, Y., Xu, L., and Chen, Y. Incre-
pp. 21508–21520, 2022.
mental network quantization: Towards lossless cnns with
Tang, Y., Han, K., Wang, Y., Xu, C., Guo, J., Xu, C., and low-precision weights. In The Eleventh International
Tao, D. Patch slimming for efficient vision transform- Conference on Learning Representations (ICLR), 2017.
ers. In Proceedings of the IEEE/CVF Conference on
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., and Dai, J. De-
Computer Vision and Pattern Recognition (CVPR), pp.
formable detr: Deformable transformers for end-to-end
12165–12174, 2022.
object detection. arXiv preprint arXiv:2010.04159, 2020.
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles,
A., and Jégou, H. Training data-efficient image transform-
ers & distillation through attention. In Proceedings of the
International Conference on Machine Learning (ICML),
pp. 10347–10357. PMLR, 2021.

Van Baalen, M., Louizos, C., Nagel, M., Amjad, R. A.,


Wang, Y., Blankevoort, T., and Welling, M. Bayesian bits:
Unifying quantization and pruning. In Proceedings of
the Advances in Neural Information Processing Systems
(NeurIPS), volume 33, pp. 5741–5752, 2020.

Wei, X., Gong, R., Li, Y., Liu, X., and Yu, F. Qdrop:
Randomly dropping quantization for extremely low-bit
post-training quantization. In Proceedings of the Interna-
tional Conference on Learning Representations (ICLR),
2022.

11
ERQ: Error Reduction for Post-Training Quantization of Vision Transformers

A. More Results on ImageNet Dataset


Tab. 6 presents the comparisons between the proposed ERQ and other PTQ methods. It can be observed that for the W6A6
case, ERQ also exhibits the best results by further improving the performance. For example, ERQ only incurs 0.39% and
0.25% drops compared to the full-precision model for DeiT-s and Swin-B, respectively.

Table 6. More results on ImageNet dataset. The top-1 accuracy (%) is reported as the metric. “W/A” indicates that the bit-width of the
weight and activation are W and A bits, respectively. †/‡ respectively indicates applying NoisyQuant to linear quantization/PTQ4ViT. “*”
indicates the results are re-produced by using the official code.

Method W/A ViT-S ViT-B DeiT-T DeiT-S DeiT-B Swin-S Swin-B

Full-Precision 32/32 81.39 84.54 72.21 79.85 81.80 83.23 85.27


FQ-ViT (Lin et al., 2022) 6/6 4.26 0.10 58.66 45.51 64.63 66.50 52.09
PSAQ-ViT (Li et al., 2022b) 6/6 37.19 41.52 57.58 63.61 67.95 72.86 76.44
Ranking (Liu et al., 2021b) 6/6 - 75.26 - 74.58 77.02 - -
PTQ4ViT (Yuan et al., 2022) 6/6 78.63 81.65 69.68 76.28 80.25 82.38 84.01
APQ-ViT (Ding et al., 2022) 6/6 79.10 82.21 70.49 77.76 80.42 82.67 84.18
NoisyQuant† (Liu et al., 2023b) 6/6 76.86 81.90 - 76.37 79.77 82.78 84.57
NoisyQuant‡ (Liu et al., 2023b) 6/6 78.65 82.32 - 77.43 80.70 82.86 84.68
GPTQ* (Frantar et al., 2022) 6/6 80.44 83.72 71.05 78.95 81.37 82.82 84.89
RepQ-ViT (Li et al., 2023) 6/6 80.43 83.62 70.76 78.90 81.27 82.79 84.57
EasyQuant (Wu et al., 2020) 6/6 75.13 81.42 - 75.27 79.47 82.45 84.30
Bit-shrinking (Lin et al., 2023) 6/6 80.44 83.16 - 78.51 80.47 82.44 -
BRECQ* (Li et al., 2021) 6/6 61.18 71.29 69.62 70.93 79.46 81.85 84.08
QDrop* (Wei et al., 2022) 6/6 68.57 74.38 69.98 76.57 80.66 82.53 84.31
PD-Quant* (Liu et al., 2023a) 6/6 71.38 63.14 70.74 77.63 79.32 82.33 84.38
ERQ (Ours) 6/6 80.48 83.89 71.14 79.03 81.41 82.86 85.02

B. Ablation Study of Image Numbers


Tab. 7 presents the ablation study using different image numbers. It can be observed that as the image number increases, the
performance increases correspondingly. For example, the accuracy is 71.58% and 72.56% for 4 and 32 images, respectively.
After 256 images, the performance reaches the plateau. Despite using more images can improve the performance, in our
main paper, we adopt 32 images to align with the previous study (Li et al., 2023) for a fair comparison.

Table 7. Ablation studies of different image numbers.

Model Image Numbers Top-1 Acc. (%)

4 71.58
8 71.87
16 72.54
32 72.56
DeiT-S (W4/A4) 64 72.94
128 73.19
256 73.51
512 73.68
1024 73.69

12
ERQ: Error Reduction for Post-Training Quantization of Vision Transformers

Table 8. Time costs of different methods. “*” indicates the results are re-produced by using the official code. Experiments are conducted
with W4A4 DeiT-S.
Method Runtime Top-1 Acc. (%)

BRECQ* (Li et al., 2021) ∼48 minutes 32.89


QDrop* (Wei et al., 2022) ∼80 minutes 35.79
PD-Quant* (Liu et al., 2023a) ∼110 minutes 64.85
GPTQ* (Frantar et al., 2022) ∼3 minutes 70.85
RepQ-ViT (Li et al., 2023) ∼1 minute 69.03
ERQ (Ours) ∼4 minutes 72.56

C. Comparisons of Time Costs


Tab. 8 presents comparisons of time costs between the proposed ERQ and other PTQ methods. It can be seen that the
BRECQ, QDrop, and PD-Quant require longer time overhead. In contrast, GPTQ, RepQ-ViT, and the proposed ERQ
demonstrated significantly reduced time costs. Notably, our ERQ achieved the best Top-1 Accuracy of 72.56% with a
runtime of only 4 minutes.

D. Validation of Error Reduction


In the main paper, we demonstrate that the proposed ERQ improves performance. In this section, we validate that each
component of ERQ successfully reduces the quantization error, making the output of quantized computation closer to the
full-precision output.

D.1. Activation quantization error reduction (Aqer)


:$'HL76 
:$'HL7% 
:$9L76 
:$9L7%
0HDQ  0HDQ  0HDQ  0HDQ 
   
5HGXFWLRQ5DWLR 

5HGXFWLRQ5DWLR 

5HGXFWLRQ5DWLR 

5HGXFWLRQ5DWLR 
   

   

   

   
           
/D\HU,QGH[ /D\HU,QGH[ /D\HU,QGH[ /D\HU,QGH[

Figure 5. Illustration of the error reduction ratio brought by Aqer. We plot the error reduction ratio for each layer.

Fig. 5 presents the error reduction ratio brought by Aqer. The ratio is computed using the value of Eq. 4 before and after
applying Aqer. As can be observed, Aqer generally yields a 13%-17% average error reduction. Therefore, applying Aqer
improves the performance as shown in the result in Tab. 4 of the main paper.

D.2. Weight quantization error reduction (Wqer)


The proposed Wqer consists of Rounding Refinement and Ridge Regression. In the next, we respectively plot the error
reduction ratio of applying Rounding Refinement, applying Ridge Regression, and applying both Rounding Refinement and
Ridge Regression, i.e., applying Wqer.

D.2.1. ROUNDING R EFINEMENT


Fig. 6 presents the error reduction ratio brought by Rounding Refinement. The ratio is computed using the value of Eq. 4
before and after applying Rounding Refinement. As can be observed, Rounding Refinement generally yields a 7%-11%
average error reduction. Such results support that applying Rounding Refinement improves the performance as shown in the

13
ERQ: Error Reduction for Post-Training Quantization of Vision Transformers


:$'HL76 
:$'HL7% 
:$9L76 
:$9L7%
0HDQ  0HDQ  0HDQ  0HDQ 
   
5HGXFWLRQ5DWLR 

5HGXFWLRQ5DWLR 

5HGXFWLRQ5DWLR 

5HGXFWLRQ5DWLR 
   

   

   

   
           
/D\HU,QGH[ /D\HU,QGH[ /D\HU,QGH[ /D\HU,QGH[

Figure 6. Illustration of the error reduction ratio brought by Rounding Refinement. We plot the error reduction ratio for each layer.

result in Tab. 4 of the main paper.

D.2.2. R IDGE R EGRESSION


:$'HL76 
:$'HL7% 
:$9L76 
:$9L7%
0HDQ  0HDQ  0HDQ  0HDQ 
   
5HGXFWLRQ5DWLR 

5HGXFWLRQ5DWLR 

5HGXFWLRQ5DWLR 

5HGXFWLRQ5DWLR 
   

   

   

   
           
/D\HU,QGH[ /D\HU,QGH[ /D\HU,QGH[ /D\HU,QGH[

Figure 7. Illustration of the error reduction ratio brought by Ridge Regression. We plot the error reduction ratio for each layer.

Fig. 7 presents the error reduction ratio brought by the Ridge Regression of Wqer. The ratio is computed using the value of
Eq. 4 before and after applying Ridge Regression. As can be observed, Rounding Refinement generally yields a 20%-27%
average error reduction. Thus, applying Ridge Regression improves the performance as shown in the result in Tab. 4 of the
main paper.

D.2.3. W QER (ROUNDING R EFINEMENT + R IDGE R EGRESSION


:$'HL76 
:$'HL7% 
:$9L76 
:$9L7%
0HDQ  0HDQ  0HDQ  0HDQ 
   
5HGXFWLRQ5DWLR 

5HGXFWLRQ5DWLR 

5HGXFWLRQ5DWLR 

5HGXFWLRQ5DWLR 

   

   

   

   
           
/D\HU,QGH[ /D\HU,QGH[ /D\HU,QGH[ /D\HU,QGH[

Figure 8. Illustration of the error reduction ratio brought by Wqer. We plot the error reduction ratio for each layer.

When both Rounding Refinement and Ridge Regression are applied, it is equal to applying Wqer. Fig. 8 presents the error
reduction ratio brought by Wqer. The ratio is computed using the value of Eq. 4 before and after Wqer. As can be observed,
Wqer generally yields a 21%-28% average error reduction. Note that compared to the ratio of applying Ridge Regression, the
reduction ratio further increases. Specifically, the reduction ratio increases by 0.40%, 0.71%, 1.52%, and 1.19% for W4A4

14
ERQ: Error Reduction for Post-Training Quantization of Vision Transformers

DeiT-S, DeiT-B, ViT-S, and ViT-B, respectively. Such results demonstrate that the combination of Rounding Refinement
and Ridge Regression brings further error reduction, thereby resulting in a better performance, which is also supported by
the result in Tab. 4 of the main paper.

D.3. ERQ (Aqer + Wqer)


:$'HL76 
:$'HL7% 
:$9L76 
:$9L7%
0HDQ  0HDQ  0HDQ  0HDQ 
   
5HGXFWLRQ5DWLR 

5HGXFWLRQ5DWLR 

5HGXFWLRQ5DWLR 

5HGXFWLRQ5DWLR 
   

   

   

   
           
/D\HU,QGH[ /D\HU,QGH[ /D\HU,QGH[ /D\HU,QGH[

Figure 9. Illustration of the error reduction ratio brought by ERQ (Aqer + Wqer). We plot the error reduction ratio for each layer.

In this subsection, we provide the error reduction ratio when both Aqer and Wqer are applied, i.e., our ERQ. Fig. 9 presents
the error reduction ratio brought by Wqer. The reduction ratio is computed using the value of Eq. 4. It can be observed that
combining ERQ yields a 32%-47% average error reduction. Meanwhile, the results prove that combining Aqer and Wqer
presents a higher error reduction ratio, supporting the performance results as shown in Tab. 4 of the main paper.
In conclusion, the aforementioned results prove the effectiveness of the proposed ERQ in reducing quantization error, which
narrows the difference between the output of the quantized layer and that of the full-precision layer. These findings also
justify the two-step approach of ERQ, which addresses activation and weight quantization errors sequentially.

E. Comparisons of Latency

Table 9. Model latency and throughput of W8A8 ViTs.

Model Latency (ms) Throughput (img/s)

184 5.43
ViT-S
104 (1.77x) 9.62 (1.77x)
746 1.34
ViT-B
443 (1.68x) 2.26 (1.68x)
54 18.51
DeiT-T
31 (1.74x) 32.26 (1.74x)
163 6.13
DeiT-S
106 (1.54x) 9.43 (1.54x)
745 1.34
DeiT-B
376 (1.98x) 2.66 (1.98x)
337 2.97
Swin-S
217 (1.55x) 4.61 (1.55x)
683 1.46
Swin-B
461 (1.48x) 2.17 (1.48x)

Tab. 9 presents the latency and throughput of W8A8 ViT-S, DeiT-S, and Swin-S. The experiments are conducted with onnx
framework on Intel i5-10210U CPU. The thread number is set to 1 and the results are evaluated by forwarding a single
224*224 image. We repeat the process 5 times and report the average outcomes. It can be seen that the quantized model
typically achieves 1.5x to 2x speedups, demonstrating the effectiveness of quantization. Reducing the bit-width further has
the potential to yield greater speedup. For example, the W4A4 case could achieve 3x to 6x speedups (Dong et al., 2024).

15
ERQ: Error Reduction for Post-Training Quantization of Vision Transformers

However, the implementation of bit-widths below 8-bit typically requires specialized hardware (Dong et al., 2024; Li et al.,
2021), and is not supported by the public software framework. At this point, we lack the necessary toolset to reproduce this.
Nonetheless, we will study this as long as the toolset is available in the future.
It is worth noting that the results in W8A8 prove the effectiveness of our ERQ in facilitating acceleration. Thus, it can be
expected that our ERQ is also able to achieve similar acceleration in the lower bit-width cases if supported by the necessary
hardware and software framework.

Table 10. Model size and FLOPs for different bit-width configurations.

Model Method W/A Size (MB) FLOPs (G)

Baseline FP 88 4.6
ERQ 3/4 8.3 0.054
DeiT-S
ERQ 4/4 11 0.072
ERQ 5/5 13.8 0.11

F. Analysis on model size and computational costs


For the quantized model, ERQ has the same model size and computational costs as the other PTQ methods since the
quantized models have the same bit-width. In Tab. 10, we present the model size and the FLOPs of DeiT-S as the example.
Here, the FLOPs are converted from Bit Operations (BOPs) (Van Baalen et al., 2020).

Table 11. Performance comparison of W4A4 and W4A8 ViTs on ImageNet Dataset.

Method W/A ViT-S ViT-B DeiT-T DeiT-S DeiT-B Swin-S Swin-B

Full-Precision 32/32 81.39 84.54 72.21 79.85 81.80 83.23 85.27


ERQ 4/4 68.91 76.63 60.29 72.56 78.23 80.74 82.44
ERQ 4/8 77.84 81.98 68.31 77.53 80.22 82.24 84.23

G. Analysis on activation bit-width

Table 12. Performance comparison of W4A4 and W4A8 ViTs on COCO Dataset.

Model Method W/A AP(box) AP(mask)

Full-Precision 32/32 46.0 41.6


Mask R-CNN (Swin-T) ERQ 4/4 36.8 36.6
ERQ 4/8 41.0 39.2
Full-Precision 32/32 48.5 43.3
Mask R-CNN (Swin-S) ERQ 4/4 43.4 40.7
ERQ 4/8 46.1 42.2
Full-Precision 32/32 50.4 43.7
Cascade R-CNN (Swin-T) ERQ 4/4 47.9 42.2
ERQ 4/8 49.5 43.3
Full-Precision 32/32 51.9 45.0
Cascade R-CNN (Swin-S) ERQ 4/4 50.0 43.6
ERQ 4/8 51.3 44.5

Tab. 11 presents the performance comparison of W4A4 and W4A8 ViTs on ImageNet Dataset. It can be seen that using 8-bit
activation leads to considerable gains for ViT variants. Specifically, the gains are respectively 8.93% and 5.35% on ViT-S
and ViT-B, 8.02%, 4.97%, and 1.99% on DeiT-T, DeiT-S, and DeiT-B, 1.50% and 1.79% on Swin-S and Swin-B. Tab. 12

16
ERQ: Error Reduction for Post-Training Quantization of Vision Transformers

presents the performance comparison of W4A4 and W4A8 ViTs on the detection task. For the detection task, using 8-bit
activation also yields significant improvements. For example, the AP(box) and AP(mask) are respectively improved by 4.2
and 2.6 for Mask R-CNN (Swin-T).

17

You might also like