GPU友好稀疏量化Boost Vision Transformer
GPU友好稀疏量化Boost Vision Transformer
Abstract features [9] [5] in the computer vision domain. With com-
parable and even superior accuracy than the traditional con-
The transformer extends its success from the language to volution neural networks (CNN) [12] [49], more vision
the vision domain. Because of the stacked self-attention and transformer models are invented and gradually replace the
cross-attention blocks, the acceleration deployment of vi- CNN with state-of-the-art performance on image classifi-
sion transformer on GPU hardware is challenging and also cation [27] [26], object detection [70] [59], and segmenta-
rarely studied. This paper thoroughly designs a compres- tion [58] [68] tasks. Due to the vision transformer models
sion scheme to maximally utilize the GPU-friendly 2:4 fine- having a generally weaker local visual inductive bias [9] in-
grained structured sparsity and quantization. Specially, an herent in CNN counterparts, many transformer blocks are
original large model with dense weight parameters is first stacked for compensation. Moreover, the attention module
pruned into a sparse one by 2:4 structured pruning, which in the transformer block contains several matrix-to-matrix
considers the GPU’s acceleration of 2:4 structured sparse calculations between key, query, and value parts [52]. Such
pattern with FP16 data type, then the floating-point sparse designs give the naive vision transformers more parameters
model is further quantized into a fixed-point one by sparse- and higher memory and computational resource require-
distillation-aware quantization aware training, which con- ments, causing high latency and energy consuming during
siders GPU can provide an extra speedup of 2:4 sparse cal- the inference stage. It is challenging for actual acceleration
culation with integer tensors. A mixed-strategy knowledge deployment in GPU hardware.
distillation is used during the pruning and quantization pro- Model compression techniques to transfer the large-scale
cess. The proposed compression scheme is flexible to sup- vision transformer models to a lightweight version can
port supervised and unsupervised learning styles. Exper- bring benefits to more efficient computation with less on-
iment results show GPUSQ-ViT scheme achieves state-of- device memory and energy consumption. There are some
the-art compression by reducing vision transformer models previous studies to inherit CNN compression methods, in-
6.4-12.7× on model size and 30.3-62× on FLOPs with neg- cluding pruning [43] [15], quantization [28] [23], distilla-
ligible accuracy degradation on ImageNet classification, tion [61], and architecture search [6] on vision transformers.
COCO detection and ADE20K segmentation benchmarking However, there are some drawbacks in previous studies:
tasks. Moreover, GPUSQ-ViT can boost actual deployment • Most of these common methods aim to reduce the
performance by 1.39-1.79× and 3.22-3.43× of latency and theoretical model size and Floating Point Operations
throughput on A100 GPU, and 1.57-1.69× and 2.11-2.51× (FLOPs). But it has been proved [33] [37] that smaller
improvement of latency and throughput on AGX Orin. model sizes and FLOPs are not directly proportional to
better efficiency on deployed hardware.
• The compression patterns do not match hardware char-
1. Introduction acteristics. For example, pruned [43] or searched [6]
vision transformer models have the unstructured sparse
Transformer-based neural models [48] have garnered im-
pattern in weight parameters, i.e., the distribution of
mense interest recently due to their effectiveness and gen-
non-zero elements is random. So deployed hardware
eralization across various applications. Equipped with the
can not provide actual speedup due to lacking the char-
attention mechanism [52] as the core of its architecture,
acteristics support for unstructured sparsity [35].
transformer-based models specialize in handling long-range
dependencies, which are also good at extracting non-local • How to keep the accuracy to the best with multiple
compression methods and how to generalize on mul-
* Tao Chen and Zhongxue Gan are corresponding authors. tiple vision tasks lack systematical investigation.
Sparse operation
on Tensor Core for the final compressed model’s accuracy. We demonstrate
B matrix (Dense) Choose matching K/2 B matrix (Dense)
Dense operation
on Tensor Core Select
elements out of K
elements
that GPUSQ-ViT can generally apply to vision transformer
K K models and benchmarking tasks, with state-of-the-art the-
☓ ☓
A matrix (Sparse)
A matrix (Dense)
M M M M
efficacy on GPU platforms. Our main contributions include:
K C matrix (Dense) K/2 K/2 C matrix (Dense)
• Unlike previous compression methods only aiming at
Non-zero data 2-bits
values indices reducing theoretical metrics, we propose GPUSQ-ViT
Dense M✕N✕K GEMM Sparse M✕N✕K GEMM
from the perspective of GPU-friendly 2:4 sparse pat-
Figure 1. Comparison of computing a M × N × K GEMM onto a tern with low-precision quantization for the first time,
Tensor Core. Dense matrix A of size M × K in left side becomes achieving GPU acceleration of 4 times than prior arts.
M × K 2
in right side after compressing with 2:4 fine-grained
structured sparse pattern. Sparse Tensor Core automatically picks
• GPUSQ-ViT combines feature-based KD with sparse
only the elements from B according to the nonzero elements in A. pruning and QAT, which can best compensate for
Comparing the dense and sparse GEMMs, B and C are the same sparse and quantized models’ accuracy.
dense K × N and M × N matrices, respectively. By skipping the • GPUSQ-ViT can apply to various vision transformer
unnecessary multiplications of redundant zeros, sparse GEMM ac- models and benchmarking tasks, with proven state-of-
celerate the dense GEMM with 2×. the-art efficacy on model size, FLOPs, and actual de-
General Matrix Multiplication (GEMM) is the funda- ployment performance on multiple GPUs. Moreover,
mental implementation inside the common parts of vision GPUSQ-ViT can work without ground truth label an-
transformers, such as convolution, linear projection, and notations in an unsupervised learning style.
transformer blocks. A specific acceleration unit called
Tensor Core [39] is firstly introduced in NVIDIA Volta 2. Related work
GPU [34] to accelerate these GEMM instructions and
2.1. Sparsity in model compression
further enhanced to support sparse GEMM in Ampere
GPU [35]. To make the GPU hardware efficient for sparse Sparsity is a typical pattern [10] in the deep learning
GEMM, a constraint named 2:4 fine-grained structured paradigm, which can help to save the computational power
sparsity [33] is imposed on the allowed sparsity pattern, as well as reduce the memory bandwidth and storage bur-
i.e., two values from every four contiguous elements on den [33]. Sparsity has different granularities [29], e.g., we
rows must be zero. Due to the 2:4 sparsity support on GPU can generate the filter-level, kernel-level, vector-level, and
Tensor Core hardware, sparse GEMM can reduce memory element-level sparsity [29] in a weight tensor from coarse
storage and bandwidth by almost 2× and provide 2× math to fine granularity. The coarse-grained sparsity has a reg-
throughput compared to dense GEMM by skipping the re- ular sparse pattern which can facilitate acceleration with
dundant zero-value computation, as shown in Figure 1. Am- algebra libraries [33]. The fine-grained sparsity leads to
pere GPU supports various numeric precision for 2:4 spar- a more irregular sparse pattern which is not friendly for
sity, including FP32, FP16, INT8, and INT4, etc. acceleration, but it can achieve a higher sparse ratio with-
Inspired by GPU’s acceleration characteristic for 2:4 out harming model accuracy [60] [63]. Many previous ef-
fine-grained structured sparse pattern with various low- forts [4] [63] [20] have explored the sparse granularity to
precision operators, we thoroughly design the compres- balance accuracy influence with real performance benefits.
sion scheme GPUSQ-ViT by utilizing the GPU-friendly Several efforts explored to compress the vision trans-
Sparsity and Quantization to boost deployment efficacy for formers with sparsity. Inspired by the phenomenon that the
Vision Transformer models, especially on GPU platforms. vision transformers take effect only according to a subset
GPUSQ-ViT contains two main workflows. Firstly, 2:4 of most informative tokens [43], we can generate the sparse
sparse pruning with knowledge distillation [14] (KD) is pro- tokens by pruning the less informative ones. The redun-
posed to compress the specific structures in vision trans- dant tokens are pruned based on the inputs, spatial attention
former architecture, e.g., transformer block, patch embed- mechanism [44], or multi-head interpreter [40] in a dynam-
ding, to be GPU-friendly. Secondly, we further quantize ical [43] or patch-slimming manner [50].
the sparse model through sparse-distillation-aware Quanti- Other efforts are explored on how to prune the compo-
zation Aware Training [30] (QAT). To measure the influence nents inside the basic structure in vision transformers, i.e.,
of quantization errors, we use the feature-based distillation the multi-head attention block (MHA) [52]. For example, a
loss in the sparse pruning workflow as the weight factor. successful trial [69] is first to learn the importance of each
The feature-based KD utilizes the scale factor in the quan- component in MHA by training with sparse regularization,
tization compression workflow, which can best compensate then pruning the less important ones to obtain the sparse
MHA. Other strategies aim to sparsify the attention heads expected speedup on general acceleration hardware, like
and reduce the sequence length in an MHA structure based GPU [34] [35] and TPU [45]. Moreover, supporting the
on specific numerical metrics [54] or searched optimal pol- specific bit-width quantization, like 6 bits, is a non-trivial
icy [15]. A more aggressive approach is pruning the entire effort. End-users need to program the FPGA hardware [22]
MHA blocks to generate a sparse Mixture-of-Experts [16] and develop specific bit-width libraries like Basic Linear
vision transformer or an extremely compact version [66]. Algebra Subprograms (BLAS) [19], which is a heavy bur-
Most of the prior arts use model sizes and FLOPs as com- den for actual deployment.
pression targets without considering the characteristics of
deployed hardware. We find low efficiency when deploying 3. Boost vision transformer on GPU
these compressed models on GPUs, which inspires us to de-
GPUSQ-ViT mainly contains 2:4 structured sparse
sign the compression scheme with a GPU-friendly sparse
pruning and sparse-distillation-aware QAT workflows.
pattern. Based on prior arts, weight multiplexing [66] or
We further explain the 2:4 sparse pattern in section 3.1, and
knowledge distillation [64] [61] are effective to compensate
how to compress each part of a vision transformer model
for the accuracy loss.
according to the 2:4 sparse pattern in sections 3.2 and 3.3.
Section 3.4 describes the GPUSQ-ViT design as a whole.
2.2. Quantization in model compression
3.1. Fine-grained structured sparsity on GPU
Quantization is another orthogonal technique in the
model compression area. It refers to the technique [56] of As shown in Figure 1, the sparse GEMM performs the
applying alternative formats other than the standard 32-bit sparse matrix × dense matrix = dense matrix operation by
single-precision floating-point (FP32) data type for weight skipping the redundant zero-value computation with sparse
parameters, inputs, and activations when executing a neural Tensor Core acceleration. For example, matrix A of size
model. Quantization can significantly speed up the model M × K follows the 2:4 fine-grained structured sparse pat-
inference performance because the low-precision formats tern, and the dense matrix B is of size K × N . If we use the
have higher computational throughput support in many pro- dense GEMM to calculate between matrices A and B, the
cessors [35] [17] [2]. Meanwhile, low-precision repre- zero values in A would not be skipped during computation.
sentation helps to reduce the memory bandwidth pressure The entire M × N × K dense GEMM will calculate the re-
and can save much memory-system operation time with the sult matrix C with M × N size in T GPU cycles. If we use
cache utilization improvement. the sparse GEMM, only the non-zero elements in each row
Post Training Quantization (PTQ) [18] and Quantiza- of matrix A and the corresponding elements from matrix B,
tion Aware Training (QAT) [30] are two main strategies which sparse Tensor Core automatically picks out without
in quantization. PTQ directly calibrates on limited sam- run-time overhead, are calculated. So the entire M ×N ×K
ple inputs [31] to find the optimal clipping threshold and sparse GEMM will also calculate the same result matrix C
the scale factor to minimize the quantization noise [3]. with M × N size but only needs T/2 GPU cycles, leading
PTQ is preferred [47] when without access to the whole to 2× math throughput speedup.
training dataset [21]. However, it is a non-trivial ef-
fort [28] [65] [25] [23] to ensure the PTQ quantized vision
transformer model without an apparent accuracy decrease.
And the accuracy degradation is more serious when going
below 8 bits formats [47]. QAT inserts the quantization and
de-quantization nodes [37] into the float-point model struc-
ture, then undergo the fine-tuning process to learn the scale
factor adjustment with minimal influence on accuracy [30].
Considering some activation structures like GeLU [13] and
Swish [42] are more sensitive [23] than ReLU [1], some
efforts are made to design the specific QAT [23] [22] for
the vision transformers. Moreover, QAT can provide more
quantization robustness for lower-bit formats [23].
Previous efforts to design the PTQ and QAT approaches
for vision transformer mainly focused on the accuracy im- Figure 2. Storage formats for 2:4 fine-grained structured sparse
provement. Due to the lack of hardware characters and pattern and metadata with FP16, INT8 and INT4 operators.
acceleration library support, some quantized models us- (w,x,y,z denote the non-zero elements.)
ing 6 bits [28] or float-point learnable bit-width like 3.7 The 2:4 sparsity uses 2-bit metadata per non-zero ele-
bits [23] to represent weights and activations cannot get the ment to indicate the position of two non-zero elements in
every four adjacent elements in a row of matrix Awith FP16 Transformer blocks used in vision transformer mod-
and INT8 data formats. The 2:4 sparsity instruction for the els are directly borrowed from [9] [51] or made tiny
INT4 data format differs from FP16 and INT8. Matrix A changes [27] [55] on the standard transformer block intro-
is defined as a pair-wise structured sparse at a granularity duced in the naive attention mechanism [52]. For exam-
of 4:8. In other words, each chunk of eight adjacent ele- ple, the transformer block in the Swin Transformer model is
ments in a row of matrix A has four zero and four non-zero built by replacing the standard multi-head attention module
values. Further, the zero and non-zero values are clustered with a shifted windows attention module [27], with other
in sub-chunks of two elements each within the eight-wide layers kept the same as the standard transformer block.
chunk, i.e., each two-wide sub-chunk within the eight-wide Without losing the generalization of the proposed method,
chunk must be all zeroes or all non-zeroes. Only the four we explore the utilization of 2:4 sparsity on a standard trans-
non-zero values are stored in the compressed matrix, and former block. 2:4 fine-grained structured sparsity accel-
two 2-bit indices in the metadata indicate the position of the erates GEMM operations, so the Q, K, and V projection
two two-wide sub-chunks with non-zero values in the eight- layers, the linear projection layer in the multi-head atten-
wide chunk of a row of matrix A. In conclusion, the sparse tion module, and the linear projection layers in the feed-
format for FP16, INT8, and INT4 lead to 43.75%, 37.5%, forward module are the proper targets to apply, as shown in
and 37.5% savings in storage. GPUSQ-ViT will firstly the zoomed-in parts in Figure 3.
compress model as 2:4 FP16 sparse, then further quantize
to 2:4 INT8 or INT4 sparse for best deployment efficiency.
3.3. Apply structured sparsity in patch embedding
Because the 2:4 fine-grained structured sparse pattern The vision transformer paradigm splits each input im-
is well supported on NVIDIA GPU and corresponding li- age into small square patches [9], and each image patch
braries for math acceleration and memory saving, so we is treated as a token in the same way in the NLP do-
are motivated to design the compression strategy for vision main. In vision transformer models, the following train-
transformer models to meet such sparse pattern. More- able linear embedding process is handled by a patch em-
over, the 2:4 sparse GEMM supports low-precision formats bedding layer and is usually implemented as a strided-
like INT8 and INT4. So it is natural to combine the spar- convolution [9] [27]. Considering the input images are or-
sity and quantization in the proposed strategy jointly and ganized as an N × C × H × W batched data format, and
further boost the actual deployment performance on GPUs. each image will be divided into small patches with P × P
square shape, where N refers to batch size, C refers to the
3.2. Apply structured sparsity in transformer block
number of the input channel, H and W refers to the height
and width of an input image, P refers to the size of each
patch. So there will be C × (H × W )/(P × P ) patches for
each image, and each patch will be flattened as a token with
shape 1 × P 2 . Suppose the given embedding dimension
is denoted as Dembed . In that case, the patch embedding
layer can be implemented with a convolution layer with C
as the input channel, Dembed as the output channel, and
kernel size and stride step equal to P . The total Floating
Point Operations (FLOPs) of the patch embedding layer is
2 × N × C × H × W × Dembed .
The strided-convolution layer is executed as an implicit
GEMM [7] [36] on GPUs, which the 2:4 fine-grained
structured sparsity can also accelerate, as shown in left-
Figure 3. Illustration about applying the 2:4 fine-grained struc- most of Figure 3. The implicit GEMM transfers the weight
tured sparsity in vision transformer. The target layers include the matrix of strided-convolution with C × P × P as the width
patch embedding, final linear projection, as well as the feed for-
of matrix A, which is the target dimension to apply the 2:4
ward and linear projection inside each transformer block.
sparsity. It helps to save half of the total FLOPs.
The transformer block [52] is the fundamental building
structure in various vision transformers. The majority of 3.4. Overall GPUSQ-ViT compression method
the weight parameters and the execution time are taken in
GPUSQ-ViT mainly contains 2:4 structured sparse
stacked transformer blocks. For example, about 96% of the
pruning and sparse-distillation-aware QAT workflows,
weight parameters and 95% of the inference time are from
as shown in Figure 5. KD is applied in each workflow as
the transformer blocks in Swin Transformer [27]. So we fo-
auxiliary accuracy compensation.
cus on how to apply the 2:4 fine-grained structured spar-
sity in the transformer block. 2:4 Structured Sparse Pruning aims to compress the
dense floating-point model MDF as the sparse floating- input as an Egyptian cat, and the base-sized model classi-
point model MSF . Based on Sections 3.2 and 3.3, we can fies it as a Border collie. Different classified labels influ-
compress each part of a vision transformer model accord- ence the CAM to pay attention to totally different features
ing to the GPU-friendly 2:4 fine-grained structured sparse of a cat and a collie, respectively. It inspires us to enable
pattern. To best compensate for the accuracy of MSF , we mimic feature learning only when the teacher and student
apply KD [14] which can effectively transfer the predicted models have the same classification labels; otherwise, skip
hard label or soft logits from a teacher model with appeal- the mimic behavior.
ing performance to a student model. If the student model Denoting distillation losses for the hard label, soft log-
wants to learn more, feature-based KD is applied to mimic its and feature maps are Lprune prune prune
hard label , Lsof t logits , Lf eature ,
the teacher model’s feature maps. In 2:4 structured sparse respectively, and their weight factors are: α, β, γ, then the
pruning workflow, three KD strategies are jointly used. overall sparse pruning loss Lprune is calculated as follows:
No. Input Stage 1 Stage 2 Stage 3 Stage 4 \small \textit {\textbf {L}}_{prune}=\alpha *\textit {\textbf {L}}_{hard\_label}^{prune}+\beta *\textit {\textbf {L}}_{soft\_logits}^{prune}+\gamma *\textit {\textbf {L}}_{feature}^{prune} (1)
loss value is larger, then we give a smaller weight factor for ment all algorithms. The results of the dense model train-
the corresponding feature-based calibration loss, to indicate ing, sparse compression, and QAT experiments are obtained
even the quantization compression leads to the difference with A100 [35] GPU clusters. The acceleration perfor-
between MSF and MSQ models; however, this difference mance results for deployment are obtained with A100 GPU
has a low probability of having the real influence on the and AGX Orin chip [38] to represent the server and edge
quantized model’s final accuracy. That’s the reason why device scenarios, respectively. Both A100 and Orin have
we named GPUSQ-ViT quantization workflow as sparse- the Tensor Core [39] support for 2:4 structured sparsity and
distillation-aware QAT. Denoting calibration losses for mixed-precision calculation among FP16, INT8, and INT4.
the hard label, soft logits and feature maps are Lcalibrate
hard label , All the reference algorithms use the default data type pro-
Lcalibrate calibrate
sof t logits , Lf eature , respectively, and their weight factors vided in public repositories.
are still: α, β, γ, then the overall quantization calibration
loss Lcalibrate is calculated as follows: 4.1. Compression efficacy for classification task
\small \textit {\textbf {L}}_{calibrate}=\alpha *\textit {\textbf {L}}_{hard\_label}^{calibrate}+\beta *\textit {\textbf {L}}_{soft\_logits}^{calibrate}+\gamma *\textit {\textbf {L}}_{feature}^{calibrate} (2) To evaluate the compression efficacy of GPUSQ-ViT
and make the comparison with prior arts on the image clas-
The sparse-distillation-aware QAT workflow minimizes the sification task, DeiT [51]2 and Swin Transformer [27]3 are
Lcalibrate loss w.r.t weight parameters of MSQ model. The chosen as the experiment target models. For the state-of-
details about each loss items in GPUSQ-ViT are provided the-art vision transformer compression methods, we choose
in Algorithm 1 in Appendix. the Dyn-ViT [43], MiniViT [66], UVC [64], PS-ViT [50],
IA-RED2 [40], MultiViT [15], SViTE [8] and S2 ViTE [8]
4. Experiments as the reference methods from sparse pruning category, and
For the experiments in this paper, we choose Py- 2 https://github.com/facebookresearch/deit
Torch [41] with version 1.12.0 as the framework to imple- 3 https://github.com/microsoft/Swin- Transformer
Model Method Input Format Params (M) FLOPs (G) Top-1 Acc(%) Top-5 Acc(%) NVIDIA A100 GPU NVIDIA AGX Orin
Model Method Input Format
FPS FPS FPS FPS
Baseline FP32 5.72 1.30 72.2 91.1
(BS=1) (BS=256) (BS=1) (BS=64)
S2 ViTE FP32 4.21 0.99 70.1 90.1
SViTE FP32 3.46 0.86 71.8 90.6 Baseline FP32 3067 14934 2671 4005
MiniViT FP32 3.09 1.30 72.8 91.6 DeiT-Tiny GPUSQ-ViT 2242 INT8 3864 (1.26×) 38978 (2.60×) 3232 (1.21×) 7329 (1.83×)
PS-ViT FP32 3.08 0.70 72.0 91.0 GPUSQ-ViT INT4 4263 (1.39×) 51224 (3.43×) 4193 (1.57×) 8531 (2.13×)
2
DeiT-Tiny 224 Baseline FP32 1256 5277 877 1280
UVC FP32 3.08 0.69 71.8 90.6
FQ-ViT INT8 1.43 1.27 71.6 90.6 DeiT-Small GPUSQ-ViT 2242 INT8 1629 (1.30×) 13359 (2.53×) 1096 (1.25×) 2291 (1.79×)
GPUSQ-ViT INT4 1809 (1.44×) 17775 (3.37×) 1447 (1.65×) 2701 (2.11×)
GPUSQ-ViT INT8 0.90 (6.4×) 0.04 (31×) 72.4 (+0.2) 90.9 (-0.2)
Baseline FP32 485 1682 351 513
Q-ViT INT4 0.72 0.34 71.6 90.5
DeiT-Base GPUSQ-ViT 2242 INT8 645 (1.33×) 4136 (2.46×) 453 (1.29×) 939 (1.83×)
GPUSQ-ViT INT4 0.45 (12.7×) 0.02 (62×) 71.7 (-0.5) 90.6 (-0.5)
GPUSQ-ViT INT4 714 (1.47×) 5643 (3.35×) 569 (1.62×) 1206 (2.35×)
Baseline FP32 22.05 4.60 79.9 95.0
Baseline FP32 256 689 233 303
DyViT FP32 26.90 3.70 82.0 95.5 DeiT-Base GPUSQ-ViT 3842 INT8 350 (1.37×) 1730 (2.51×) 308 (1.32×) 561 (1.85×)
MultiViT FP32 16.76 2.90 79.9 94.9 GPUSQ-ViT INT4 394 (1.54×) 2315 (3.36×) 371 (1.59×) 761 (2.51×)
IA-RED2 FP32 14.99 3.10 79.1 94.5
Baseline FP32 621 2907 544 968
S2 ViTE FP32 14.60 2.12 79.2 94.6
Swin-Tiny GPUSQ-ViT 2242 INT8 807 (1.30×) 6975 (2.40×) 675 (1.24×) 1946 (2.01×)
MiniViT FP32 11.45 4.70 80.7 95.6 GPUSQ-ViT INT4 910 (1.46×) 9911 (3.41×) 892 (1.64×) 2275 (2.35×)
PS-ViT FP32 12.46 2.59 79.4 94.7 Baseline FP32 330 1802 309 631
DeiT-Small UVC 2242 FP32 12.70 2.65 79.4 94.7 Swin-Small GPUSQ-ViT 2242 INT8 426 (1.29×) 4411 (2.45×) 392 (1.27×) 1306 (2.07×)
SViTE FP32 8.90 1.38 79.4 94.7 GPUSQ-ViT INT4 510 (1.55×) 5942 (3.30×) 516 (1.67×) 1521 (2.41×)
PTQ-ViT INT8 5.51 5.67 78.1 94.2 Baseline FP32 282 1261 247 433
PTQ4ViT INT8 5.51 3.45 79.5 94.7 Swin-Base GPUSQ-ViT 2242 INT8 388 (1.37×) 3226 (2.56×) 309 (1.25×) 842 (1.94×)
FQ-ViT INT8 5.51 4.61 79.2 94.6 GPUSQ-ViT INT4 485 (1.72×) 4071 (3.22×) 410 (1.66×) 1063 (2.45×)
GPUSQ-ViT INT8 3.46 (6.4×) 0.14 (31×) 80.3 (+0.4) 95.1 +0.1) Baseline FP32 154 531 140 226
Q-ViT INT4 2.76 1.22 80.1 94.9 Swin-Base GPUSQ-ViT 3842 INT8 226 (1.47×) 1310 (2.47×) 180 (1.28×) 414 (1.83×)
GPUSQ-ViT INT4 1.73 (12.7×) 0.07 (62×) 79.3 (-0.6) 94.8 (-0.2) GPUSQ-ViT INT4 369 (1.79×) 1747 (3.29×) 238 (1.69×) 562 (2.48×)
Baseline FP32 86.57 17.60 81.8 95.6
MultiViT
IA-RED2
FP32
FP32
64.93
58.01
11.20
11.80
82.3
80.9
96.0
95.0
Table 2. Deployment efficiency of GPUSQ-ViT compressed DeiT
S2 ViTE FP32 56.80 11.77 82.2 95.8 and Swin Transformer models on NVIDIA GPUs. The latency is
MiniViT FP32 44.10 17.70 83.2 96.5
PS-ViT FP32 48.22 9.80 81.5 95.4 measured with batch size 1 on a single A100 GPU and AGX Orin.
UVC FP32 39.40 8.01 80.6 94.5
DeiT-Base 2242 The throughput is measured with batch size fixed to 256 on a single
SViTE FP32 34.80 7.48 81.3 95.3
PTQ-ViT INT8 21.64 20.10 81.3 95.2
FQ-ViT INT8 21.64 17.48 81.2 95.2 A100 GPU and with batch size fixed to 64 on a single AGX Orin.
PTQ4ViT INT8 21.64 13.10 81.5 95.3
GPUSQ-ViT INT8 13.55 (6.4×) 0.55 (31×) 82.9 (+1.1) 96.4 (+0.8) | Swin Transformer Tiny | | Swin Transformer Base |
PTQ4ViT INT4 10.82 6.94 75.9 95.3 Input Baseline INT8 INT4 Baseline INT8 INT4
GPUSQ-ViT INT4 6.78 (12.7×) 0.28 (62×) 81.6 (-0.2) 95.5 (-0.1)
Baseline FP32 86.86 55.60 82.9 96.2
IA-RED FP32 54.31 34.70 81.9 95.7
MiniViT FP32 44.39 56.90 84.7 97.2
DeiT-Base 3842
PTQ4ViT INT8 21.71 41.70 82.9 96.3
GPUSQ-ViT INT8 13.62 (6.4×) 1.74 (31×) 82.9 (+0.0) 96.3 (+0.1)
GPUSQ-ViT INT4 6.81 (12.7×) 0.87 (62×) 82.4 (-0.5) 96.1 (-0.1)
Baseline FP32 28.29 4.49 81.2 95.5
Dyn-ViT FP32 19.80 4.00 80.9 95.4
MiniViT FP32 12.00 4.60 81.4 95.7
FQ-ViT INT8 7.07 4.39 80.5 95.2
Swin-Tiny 2242
PTQ4ViT INT8 7.07 3.37 81.2 95.4
GPUSQ-ViT INT8 4.43 (6.4×) 0.14 (31×) 81.2 (+0.0) 95.5 (+0.0)
Q-ViT INT4 3.54 1.10 80.6 95.2
GPUSQ-ViT INT4 2.21 (12.7×) 0.07 (62×) 80.7 (-0.5) 95.3 (-0.2)
Baseline FP32 49.61 8.75 83.2 96.2
Dyn-ViT FP32 34.73 6.90 83.2 96.3
MiniViT FP32 26.46 8.93 83.6 97.0
Swin-Small FQ-ViT 2242 INT8 12.40 8.77 82.7 96.1 Figure 6. CAM visualization for Swin Transformer baseline dense
PTQ4ViT INT8 12.40 6.56 83.1 96.2
GPUSQ-ViT INT8 7.77 (6.4×) 0.27 (31×) 83.1 (-0.1) 96.3 (+0.1) models and GPUSQ-ViT compressed INT8 and INT4 models.
GPUSQ-ViT INT4 3.88 (12.7×) 0.14 (62×) 82.8 (-0.4) 96.2 (+0.0)
Baseline FP32 87.77 15.44 83.5 96.5 Moreover, GPUSQ-ViT can greatly boost the com-
Dyn-ViT FP32 61.44 12.10 83.4 96.4
MiniViT FP32 46.44 15.71 84.3 97.3 pressed models’ deployment efficiency on GPUs with Ten-
Swin-Base FQ-ViT 2242 INT8 21.94 15.33 83.0 96.3
PTQ4ViT INT8 21.94 11.58 83.2 96.3 sorRT toolkit [37] support of 2:4 sparsity. For INT8 com-
GPUSQ-ViT INT8 13.73 (6.4×) 0.48 (31×) 83.4 (-0.1) 96.4 (-0.1)
GPUSQ-ViT INT4 6.87 (12.7×) 0.24 (62×) 83.2 (-0.3) 96.3 (-0.2) pressed models, GPUSQ-ViT can bring 1.26-1.47× and
Baseline FP32 87.90 47.11 84.5 97.0
MiniViT FP32 47.00 49.40 85.5 97.6 2.4-2.6× improvement for various DeiT and Swin Trans-
Swin-Base PTQ4ViT 3842 INT8 21.98 35.33 84.3 96.8
GPUSQ-ViT INT8 13.77 (6.4×) 1.47 (31×) 84.4 (-0.1) 97.0 (0.0) former models of latency and throughput on A100 GPU,
GPUSQ-ViT INT4 6.88 (12.7×) 0.74 (62×) 84.4 (-0.1) 96.9 (-0.1)
and 1.21-1.32× and 1.79-2.07× improvement of latency
Table 1. Compare the model size and FLOPs of GPUSQ-ViT with
and throughput on AGX Orin. For INT4 compressed mod-
state-of-the-art compression methods on classification task.
els, GPUSQ-ViT can bring 1.39-1.79× and 3.22-3.43× im-
we choose the FQ-ViT [25], Q-ViT [23], PTQ-ViT [28] and provement of latency and throughput on A100 GPU, and
PTQ4ViT [65] as the reference methods from quantization 1.57-1.69× and 2.11-2.51× improvement of latency and
category. For GPUSQ-ViT, the loss adjustment factors for throughput on AGX Orin, as shown in Table 2.
hard label, soft logits and feature-based losses apply α = 1, To compare between dense and GPUSQ-ViT com-
β = 10, and γ = 5), respectively. The model size and pressed models in visualization, we apply CAM for tiny-
FLOPs comparison results are shown in Table 1. and base-sized Swin Transformer models’ attention on final
We can apply GPUSQ-ViT to compress each vision norm layer. The results are shown in Figure 6.
model as INT8 and INT4 versions. For INT8 compressed
models, GPUSQ-ViT can bring 6.4× reduction for model 4.2. Compression efficacy for detection task
size and 31× reduction for FLOPs with negligible accuracy To evaluate the compression efficacy of GPUSQ-ViT on
drop. For INT4 compressed models, GPUSQ-ViT can get the object detection task, Mask R-CNN [11]4 , DETR [5]5
12.7× and 62× reduction for model size and FLOPs with and Deformable-DETR [70] 6 are chosen as the target mod-
a small accuracy drop. Compared with both sparse prun- 4 https://github.com/SwinTransformer/Swin- Transformer- Object- Detection
ing and quantization prior arts, GPUSQ-ViT can steadily 5 https://github.com/facebookresearch/detr
provide more reduction for model size and FLOPs. 6 https://github.com/fundamentalvision/Deformable- DETR
GPUSQ-ViT (INT8) GPUSQ-ViT (INT4)
els. GPUSQ-ViT compression results on COCO dataset Model Factor α Factor β Factor γ
Enable QAT
Weight Factor
Top-1 Top-5 Top-1 Top-5
Acc(%) Acc(%) Acc(%) Acc(%)
[24] are shown in Table 3. 1 10 5 ! 82.9 (+1.1) 96.4 (+0.8) 81.6 (-0.2) 95.5 (-0.1)
1 10 5 % 82.4 (+0.6) 96.1 (+0.5) 80.1 (-1.7) 94.3 (-1.3)
Model Backbone Method Format Params (M) FLOPs (G) bbox mAP segm mAP 1 0 5 ! 82.7 (+0.9) 96.2 (+0.6) 81.3 (-0.5) 95.2 (-0.4)
DeiT-Base 1 10 0 ! 82.2 (+0.4) 95.8 (+0.2) 80.8 (-1.0) 94.8 (-0.8)
Baseline FP32 48 267 46.0 41.6 (2242 )
1 20 5 ! 82.9 (+1.1) 96.4 (+0.8) 81.6 (-0.2) 95.6 (+0.0)
Swin-Tiny GPUSQ-ViT INT8 7.5 (6.4×) 8.8 (30.5×) 46.0 (+0.0) 41.6 (+0.0)
GPUSQ-ViT INT4 3.8 (12.7×) 4.4 (61.0×) 45.7 (-0.3) 41.4 (-0.2) 1 30 5 ! 82.9 (+1.1) 96.5 (+0.9) 81.6 (-0.2) 95.6 (+0.0)
Mask R-CNN 1 10 10 ! 82.8 (+1.0) 96.5 (+0.9) 81.5 (-0.3) 95.5 (-0.1)
Baseline FP32 69 359 48.5 43.3
Swin-Small GPUSQ-ViT INT8 10.8 (6.4×) 11.8 (30.5×) 48.6 (+0.1) 43.4 (+0.1) 1 10 2.5 ! 82.8 (+1.0) 96.5 (+0.9) 81.5 (-0.3) 95.6 (+0.0)
GPUSQ-ViT INT4 5.4 (12.7×) 5.9 (61.0×) 48.3 (-0.2) 43.2 (-0.1) 1 10 5 ! 83.4 (-0.1) 96.4 (-0.1) 83.2 (-0.3) 96.3 (-0.2)
Baseline FP32 86 745 48.1 41.7 1 10 5 % 82.9 (-0.6) 96.0 (-0.5) 81.5 (-2.0) 94.9 (-1.6)
Swin-Tiny GPUSQ-ViT INT8 13.4 (6.4×) 24.4 (30.5×) 48.1 (+0.0) 41.8 (+0.1) 1 0 5 ! 83.2 (-0.3) 96.2 (-0.3) 82.9 (-0.6) 96.0 (-0.5)
GPUSQ-ViT INT4 6.8 (12.7×) 12.2 (61.0×) 47.8 (-0.3) 41.5 (-0.2) Swin-Base 1 10 0 ! 82.7 (-0.8) 95.7 (-0.8) 82.4 (-1.1) 95.5 (-1.0)
Baseline FP32 107 838 51.9 45.0 (2242 )
Cascade 1 20 5 ! 83.4 (-0.1) 96.4 (-0.1) 83.2 (-0.3) 96.3 (-0.2)
Swin-Small GPUSQ-ViT INT8 16.7 (6.4×) 27.5 (30.5×) 52.0 (+0.1) 45.2 (+0.2)
Mask R-CNN
GPUSQ-ViT INT4 8.4 (12.7×) 13.7 (61.0×) 51.7 (-0.2) 44.9 (-0.1) 1 30 5 ! 83.4 (-0.1) 96.4 (-0.1) 83.2 (-0.3) 96.4 (-0.1)
Baseline FP32 145 982 51.9 45.0 1 10 10 ! 83.3 (-0.2) 96.4 (-0.1) 83.1 (-0.4) 96.3 (-0.2)
Swin-Base GPUSQ-ViT INT8 22.7 (6.4×) 32.2 (30.5×) 52.1 (+0.2) 45.3 (+0.3) 1 10 2.5 ! 83.3 (-0.2) 96.4 (-0.1) 83.1 (-0.4) 96.4 (-0.1)
GPUSQ-ViT INT4 11.4 (12.7×) 16.1 (61.0×) 51.8 (-0.1) 44.9 (-0.1)
Baseline FP32 41 86 42.0 N/A Table 6. Ablation study of the loss adjustment factors and sparse-
DETR ResNet50 GPUSQ-ViT INT8 6.4 (6.4×) 2.8 (30.5×) 42.0 (+0.0) N/A
GPUSQ-ViT INT4 3.2 (12.7×) 1.4 (61.0×) 41.7 (-0.3) N/A distillation-aware weight factors of GPUSQ-ViT method.
Baseline FP32 40 173 44.5 N/A
Deformable
ResNet50 GPUSQ-ViT INT8 6.3 (6.4×) 5.7 (30.5×) 44.5 (+0.0) N/A
DETR
GPUSQ-ViT INT4 3.1 (12.7×) 2.8 (61.0×) 44.1 (-0.4) N/A
and feature-based losses (α, β, γ) and enabling sparse-
Table 3. Effectiveness of GPUSQ-ViT on object detection task. distillation-aware weight factor on GPUSQ-ViT com-
4.3. Compression efficacy for segmentation task pressed model accuracy is shown in Table 6. From the
ablation results, we can find enabling sparse-distillation-
To evaluate the compression efficacy of GPUSQ-ViT on aware weight factor has an apparent boost for the com-
the semantic segmentation task, UPerNet [57]7 is chosen pressed models’ accuracy. Such a boost effect is more in-
as the target model. GPUSQ-ViT compression results on fluential on INT4 than INT8 model, because disabling this
ADE20K dataset [67] are shown in Table 4. weight factor will see a more significant drop in INT4 com-
Model Backbone Method Format Params (M) FLOPs (G) Mean IoU (%) Pixel Acc. (%) pressed model. The potential reason is sparse-distillation-
Swin-Tiny
Baseline
GPUSQ-ViT
FP32
INT8
60
9.4 (6.4×)
945
31.2 (30.3×)
44.51
44.47 (-0.04)
81.09
81.01 (-0.08)
aware weight factor indicates how much influence the quan-
GPUSQ-ViT
Baseline
INT4
FP32
4.7 (12.7×)
81
15.6 (60.6×)
1038
43.93 (-0.58)
47.64
80.89 (-0.20)
82.45
tization error from each critical layer has on the final accu-
UPerNet Swin-Small GPUSQ-ViT
GPUSQ-ViT
INT8
INT4
12.7 (6.4×)
6.4 (12.7×)
34.3 (30.3×)
17.1 (60.6×)
47.66 (+0.02)
47.15 (-0.49)
82.41 (-0.04)
82.30 (-0.15)
racy. So the distillation process can focus on mimicking the
Swin-Base
Baseline
GPUSQ-ViT
FP32
INT8
121
18.9 (6.4×)
1188
39.2 (30.3×)
48.13
48.18 (+0.05)
82.37
82.43 (+0.06)
layers with more accuracy influence, which is more effec-
GPUSQ-ViT INT4 9.5 (12.7×) 19.6 (60.6×) 47.86 (-0.27) 82.19 (-0.18)
tive for limited quantized bits. Then we can find disabling
Table 4. Effectiveness of GPUSQ-ViT on semantic segmentation. the feature-based distillation will lead to a more severe in-
GPUSQ-ViT provides good compression effects on de- fluence than disabling the soft logits distillation. It indi-
tection and segmentation tasks in Table 3 and 4 with small cates that mimicking feature maps is very helpful for accu-
accuracy gap to the dense baseline models. racy compensation in GPUSQ-ViT compression. Finally,
we can find GPUSQ-ViT is relatively robust to the soft log-
4.4. GPUSQ-ViT with unsupervised learning its and feature-based loss adjustment factors, i.e., within the
Because the compressed model can learn the represen- close range of β = 10 and γ = 5 the accuracy of com-
tation of target from dense model’s prediction when lack- pressed models are stable.
ing ground-truth label annotations, so GPUSQ-ViT can still
work well in unsupervised training, as shown in Table 5. 5. Conclusion and limitation
GPUSQ-ViT (INT8) GPUSQ-ViT (INT4)
Model Input
Top-1 Top-5 Top-1 Top-5 This paper is inspired by GPU’s acceleration characteris-
Acc(%) Acc(%) Acc(%) Acc(%) tic for 2:4 sparse pattern with various low-precision opera-
DeiT-Tiny 2242 72.0 (-0.2) 90.8 (-0.3) 71.4 (-0.8) 90.2 (-0.9) tors to design the GPUSQ-ViT compression method, which
DeiT-Small 2242 79.8 (-0.1) 94.9 (-0.1) 79.2 (-0.7) 94.2 (-0.8) can boost deployment efficiency for various vision trans-
DeiT-Base 2242 82.0 (+0.2) 95.7 (+0.1) 81.1 (-0.7) 95.0 (-0.6)
DeiT-Base 3842 82.5 (-0.4) 95.9 (-0.3) 82.0 (-0.9) 95.7 (-0.5)
former models of benchmarking tasks on NVIDIA GPUs.
Swin-Tiny 2242 80.8 (-0.4) 95.2 (-0.3) 80.3 (-0.9) 94.9 (-0.6) We should notice a potential limitation. If the structured
Swin-Small 2242 82.7 (-0.5) 95.9 (-0.3) 82.3 (-0.9) 95.7 (-0.5) sparse support is changed or extended to support other pat-
Swin-Base 2242 82.9 (-0.6) 96.1 (-0.4) 82.5 (-1.0) 95.7 (-0.8)
Swin-Base 3842 83.9 (-0.6) 96.6 (-0.4) 83.7 (-0.8) 96.4 (-0.6)
terns like 1:4 or 2:16, GPUSQ-ViT needs to make the ac-
cording adjustments to fit the new or more sparse patterns.
Table 5. Effectiveness of GPUSQ-ViT in unsupervised learning.
Acknowledgements This work is supported by Na-
4.5. Ablation study of GPUSQ-ViT tional Natural Science Foundation of China (No. 62071127,
U1909207 and 62101137), Shanghai Municipal Science
The ablation study to measure the influence of the dif-
and Technology Major Project (No.2021SHZDZX0103),
ferent adjustment factors for the hard label, soft logits,
Shanghai Natural Science Foundation (No. 23ZR1402900),
7 https://github.com/SwinTransformer/Swin- Transformer- Semantic- Segmentation Zhejiang Lab Project (No. 2021KH0AB05).
References sian process search. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, pages
[1] Abien Fred Agarap. Deep learning using rectified linear units 3669–3678, 2022. 1, 3, 6
(relu). arXiv preprint arXiv:1803.08375, 2018. 3
[16] Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze
[2] Mohamed Arafa, Bahaa Fahim, Sailesh Kottapalli, Akhilesh
Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin Jose, Prab-
Kumar, Lily P Looi, Sreenivas Mandava, Andy Rudoff,
hat Ram, et al. Tutel: Adaptive mixture-of-experts at scale.
Ian M Steiner, Bob Valentine, Geetha Vedaraman, et al. Cas-
arXiv preprint arXiv:2206.03382, 2022. 3
cade lake: Next generation intel xeon scalable processor.
[17] Norman P Jouppi, Cliff Young, Nishant Patil, David Patter-
IEEE Micro, 39(2):29–36, 2019. 3
son, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh
[3] Ron Banner, Yury Nahshan, and Daniel Soudry. Post train-
Bhatia, Nan Boden, Al Borchers, et al. In-datacenter perfor-
ing 4-bit quantization of convolutional networks for rapid-
mance analysis of a tensor processing unit. In Proceedings
deployment. Advances in Neural Information Processing
of the 44th annual international symposium on computer ar-
Systems, 32, 2019. 3
chitecture, pages 1–12, 2017. 3
[4] Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and
[18] Raghuraman Krishnamoorthi. Quantizing deep convolu-
Song Han. Once-for-all: Train one network and specialize it
tional networks for efficient inference: A whitepaper. arXiv
for efficient deployment. 2020. 2
preprint arXiv:1806.08342, 2018. 3
[5] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- [19] Feng Li, Yunming Ye, Zhaoyang Tian, and Xiaofeng Zhang.
end object detection with transformers. In European Confer- Cpu versus gpu: which can perform matrix computa-
ence on Computer Vision, pages 213–229. Springer, 2020. 1, tion faster—performance comparison for basic linear al-
7 gebra subprograms. Neural Computing and Applications,
31(8):4353–4365, 2019. 3
[6] Arnav Chavan, Zhiqiang Shen, Zhuang Liu, Zechun Liu,
Kwang-Ting Cheng, and Eric P Xing. Vision transformer [20] Muyang Li, Ji Lin, Yaoyao Ding, Zhijian Liu, Jun-Yan
slimming: Multi-dimension searching in continuous opti- Zhu, and Song Han. Gan compression: Efficient architec-
mization space. In Proceedings of the IEEE/CVF Conference tures for interactive conditional gans. In Proceedings of
on Computer Vision and Pattern Recognition, pages 4931– the IEEE/CVF conference on computer vision and pattern
4941, 2022. 1 recognition, pages 5284–5294, 2020. 2
[7] Kumar Chellapilla, Sidd Puri, and Patrice Simard. High per- [21] Zhikai Li, Liping Ma, Mengjuan Chen, Junrui Xiao, and
formance convolutional neural networks for document pro- Qingyi Gu. Patch similarity aware data-free quantization for
cessing. In Tenth international workshop on frontiers in vision transformers. pages 154–170, 2022. 3
handwriting recognition. Suvisoft, 2006. 4 [22] Zhengang Li, Mengshu Sun, Alec Lu, Haoyu Ma, Geng
[8] Tianlong Chen, Yu Cheng, Zhe Gan, Lu Yuan, Lei Zhang, Yuan, Yanyue Xie, Hao Tang, Yanyu Li, Miriam Leeser,
and Zhangyang Wang. Chasing sparsity in vision transform- Zhangyang Wang, et al. Auto-vit-acc: An fpga-aware au-
ers: An end-to-end exploration. Advances in Neural Infor- tomatic acceleration framework for vision transformer with
mation Processing Systems, 34:19974–19988, 2021. 6 mixed-scheme quantization. pages 109–116, 2022. 3
[9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, [23] Zhexin Li, Tong Yang, Peisong Wang, and Jian Cheng. Q-
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, vit: Fully differentiable quantization for vision transformer.
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- arXiv preprint arXiv:2201.07703, 2022. 1, 3, 7
vain Gelly, et al. An image is worth 16x16 words: Trans- [24] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
formers for image recognition at scale. 2020. 1, 4 Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
[10] Song Han, Jeff Pool, John Tran, and William Dally. Learning Zitnick. Microsoft coco: Common objects in context. In
both weights and connections for efficient neural network. In European Conference on Computer Vision, pages 740–755.
Advances in Neural Information Processing Systems, pages Springer, 2014. 8
1135–1143, 2015. 2 [25] Yang Lin, Tianyu Zhang, Peiqin Sun, Zheng Li, and
[11] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir- Shuchang Zhou. Fq-vit: Post-training quantization for fully
shick. Mask r-cnn. In Proceedings of the IEEE International quantized vision transformer. In Proceedings of the Thirty-
Conference on Computer Vision, pages 2961–2969, 2017. 7 First International Joint Conference on Artificial Intelli-
[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. gence, IJCAI-22, pages 1173–1179, 2022. 3, 7
Deep residual learning for image recognition. In Proceed- [26] Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie,
ings of the IEEE Conference on Computer Vision and Pattern Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al.
Recognition, pages 770–778, 2016. 1 Swin transformer v2: Scaling up capacity and resolution. In
[13] Dan Hendrycks and Kevin Gimpel. Gaussian error linear Proceedings of the IEEE/CVF Conference on Computer Vi-
units (gelus). arXiv preprint arXiv:1606.08415, 2016. 3 sion and Pattern Recognition, pages 12009–12019, 2022. 1
[14] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill- [27] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng
ing the knowledge in a neural network. arXiv preprint Zhang, Stephen Lin, and Baining Guo. Swin transformer:
arXiv:1503.02531, 2015. 2, 5 Hierarchical vision transformer using shifted windows. In
[15] Zejiang Hou and Sun-Yuan Kung. Multi-dimensional vi- Proceedings of the IEEE/CVF International Conference on
sion transformer compression via dependency guided gaus- Computer Vision, pages 10012–10022, 2021. 1, 4, 5, 6
[28] Zhenhua Liu, Yunhe Wang, Kai Han, Wei Zhang, Siwei Ma, [45] K Sato. An in-depth look at google’s first tensor processing
and Wen Gao. Post-training quantization for vision trans- unit (tpu). Google Cloud Platform, 2017. 3
former. Advances in Neural Information Processing Systems, [46] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das,
34:28092–28103, 2021. 1, 3, 7 Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra.
[29] Huizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Grad-cam: Visual explanations from deep networks via
Yu Wang, and William J Dally. Exploring the granularity gradient-based localization. In Proceedings of the IEEE In-
of sparsity in convolutional neural networks. In Proceed- ternational Conference on Computer Vision, pages 618–626,
ings of the IEEE Conference on Computer Vision and Pattern 2017. 5
Recognition Workshops, pages 13–20, 2017. 2 [47] Gil Shomron, Freddy Gabbay, Samer Kurzum, and Uri
[30] Jeffrey L McKinstry, Steven K Esser, Rathinakumar Ap- Weiser. Post-training sparsity-aware quantization. Advances
puswamy, Deepika Bablani, John V Arthur, Izzet B Yildiz, in Neural Information Processing Systems, 34:17737–17748,
and Dharmendra S Modha. Discovering low-precision net- 2021. 3
works close to full-precision networks for efficient embed- [48] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to
ded inference. arXiv preprint arXiv:1809.04191, 2018. 2, sequence learning with neural networks. Advances in Neural
3 Information Processing Systems, 27, 2014. 1
[31] Szymon Migacz. NVIDIA 8-bit Inference with TensorRT. [49] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model
GPU Technology Conference, 2017. 3 scaling for convolutional neural networks. In International
[32] Seyed Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Conference on Machine Learning, pages 6105–6114. PMLR,
Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. Im- 2019. 1
proved knowledge distillation via teacher assistant. In Pro- [50] Yehui Tang, Kai Han, Yunhe Wang, Chang Xu, Jianyuan
ceedings of the AAAI conference on artificial intelligence, Guo, Chao Xu, and Dacheng Tao. Patch slimming for ef-
volume 34, pages 5191–5198, 2020. 5 ficient vision transformers. In Proceedings of the IEEE/CVF
[33] Asit Mishra, Jorge Albericio Latorre, Jeff Pool, Darko Conference on Computer Vision and Pattern Recognition,
Stosic, Dusan Stosic, Ganesh Venkatesh, Chong Yu, and pages 12165–12174, 2022. 2, 6
Paulius Micikevicius. Accelerating sparse deep neural net- [51] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco
works. arXiv preprint arXiv:2104.08378, 2021. 1, 2, 5 Massa, Alexandre Sablayrolles, and Hervé Jégou. Training
[34] NVIDIA. NVIDIA Tesla V100 GPU Architecture, 2017. 2, data-efficient image transformers & distillation through at-
3 tention. In International Conference on Machine Learning,
pages 10347–10357. PMLR, 2021. 4, 6
[35] NVIDIA. NVIDIA A100 Tensor Core GPU Architecture,
2020. 1, 2, 3, 6 [52] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
[36] NVIDIA. NVIDIA CUTLASS, 2022. 4
Polosukhin. Attention is all you need. Advances in Neural
[37] NVIDIA. NVIDIA TensorRT, 2022. 1, 3, 5, 7
Information Processing Systems, 30, 2017. 1, 2, 4
[38] NVIDIA-Orin. NVIDIA Jetson Agx Orin series technical [53] Haofan Wang, Zifan Wang, Mengnan Du, Fan Yang, Zijian
brief, 2021. 6 Zhang, Sirui Ding, Piotr Mardziel, and Xia Hu. Score-cam:
[39] NVIDIA-TC. NVIDIA Tensor Core, 2020. 2, 6 Score-weighted visual explanations for convolutional neural
[40] Bowen Pan, Rameswar Panda, Yifan Jiang, Zhangyang networks. In Proceedings of the IEEE/CVF conference on
Wang, Rogerio Feris, and Aude Oliva. IA-RED2: computer vision and pattern recognition workshops, pages
Interpretability-aware redundancy reduction for vision trans- 24–25, 2020. 5
formers. Advances in Neural Information Processing Sys- [54] Hanrui Wang, Zhekai Zhang, and Song Han. Spatten: Ef-
tems, 34:24898–24911, 2021. 2, 6 ficient sparse attention architecture with cascade token and
[41] Adam Paszke, Sam Gross, Soumith Chintala, Gregory head pruning. In 2021 IEEE International Symposium on
Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban High-Performance Computer Architecture (HPCA), pages
Desmaison, Luca Antiga, and Adam Lerer. Automatic dif- 97–110. IEEE, 2021. 3
ferentiation in pytorch. In Advances in Neural Information [55] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao
Processing Systems-Autodiff Workshop, 2017. 6 Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao.
[42] Prajit Ramachandran, Barret Zoph, and Quoc V Le. Pyramid vision transformer: A versatile backbone for dense
Searching for activation functions. arXiv preprint prediction without convolutions. In Proceedings of the
arXiv:1710.05941, 2017. 3 IEEE/CVF International Conference on Computer Vision,
[43] Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie pages 568–578, 2021. 4
Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision [56] Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev, and
transformers with dynamic token sparsification. Advances Paulius Micikevicius. Integer quantization for deep learn-
in Neural Information Processing Systems, 34:13937–13949, ing inference: Principles and empirical evaluation. arXiv
2021. 1, 2, 6 preprint arXiv:2004.09602, 2020. 3
[44] Michael Ryoo, AJ Piergiovanni, Anurag Arnab, Mostafa [57] Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and
Dehghani, and Anelia Angelova. Tokenlearner: Adaptive Jian Sun. Unified perceptual parsing for scene understand-
space-time tokenization for videos. Advances in Neural In- ing. In European Conference on Computer Vision, pages
formation Processing Systems, 34:12786–12797, 2021. 2 418–434. Springer, 2018. 8
[58] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar,
Jose M Alvarez, and Ping Luo. Segformer: Simple and
efficient design for semantic segmentation with transform-
ers. Advances in Neural Information Processing Systems,
34:12077–12090, 2021. 1
[59] Mengde Xu, Zheng Zhang, Han Hu, Jianfeng Wang, Lijuan
Wang, Fangyun Wei, Xiang Bai, and Zicheng Liu. End-to-
end semi-supervised object detection with soft teacher. In
Proceedings of the IEEE/CVF International Conference on
Computer Vision, pages 3060–3069, 2021. 1
[60] Tao Yang, Yunkun Liao, Jianping Shi, Yun Liang, Naifeng
Jing, and Li Jiang. A winograd-based cnn accelerator with a
fine-grained regular sparsity pattern. In 2020 30th Interna-
tional Conference on Field-Programmable Logic and Appli-
cations (FPL), pages 254–261. IEEE, 2020. 2
[61] Zhendong Yang, Zhe Li, Ailing Zeng, Zexian Li, Chun Yuan,
and Yu Li. Vitkd: Practical guidelines for vit feature knowl-
edge distillation. arXiv preprint arXiv:2209.02432, 2022. 1,
3
[62] Chong Yu. Minimally invasive surgery for sparse neu-
ral networks in contrastive manner. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 3589–3598, 2021. 5
[63] Chong Yu and Jeff Pool. Self-supervised generative adver-
sarial compression. Advances in Neural Information Pro-
cessing Systems, 33:8235–8246, 2020. 2
[64] Shixing Yu, Tianlong Chen, Jiayi Shen, Huan Yuan, Jianchao
Tan, Sen Yang, Ji Liu, and Zhangyang Wang. Unified visual
transformer compression. 2022. 3, 6
[65] Zhihang Yuan, Chenhao Xue, Yiqi Chen, Qiang Wu,
and Guangyu Sun. Ptq4vit: Post-training quantiza-
tion framework for vision transformers. arXiv preprint
arXiv:2111.12293, 2021. 3, 7
[66] Jinnian Zhang, Houwen Peng, Kan Wu, Mengchen Liu, Bin
Xiao, Jianlong Fu, and Lu Yuan. Minivit: Compressing vi-
sion transformers with weight multiplexing. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition, pages 12145–12154, 2022. 3, 6
[67] Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fi-
dler, Adela Barriuso, and Antonio Torralba. Semantic under-
standing of scenes through the ade20k dataset. International
Journal of Computer Vision, 127(3):302–321, 2019. 8
[68] Daquan Zhou, Zhiding Yu, Enze Xie, Chaowei Xiao, An-
imashree Anandkumar, Jiashi Feng, and Jose M Alvarez.
Understanding the robustness in vision transformers. In In-
ternational Conference on Machine Learning, pages 27378–
27394. PMLR, 2022. 1
[69] Mingjian Zhu, Yehui Tang, and Kai Han. Vision transformer
pruning. arXiv preprint arXiv:2104.08500, 2021. 2
[70] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang,
and Jifeng Dai. Deformable detr: Deformable transformers
for end-to-end object detection. 2021. 1, 7