0% found this document useful (0 votes)
144 views12 pages

InternImage - Exploring Large-Scale Vision Foundation Models With Deformable Convolutions

This document presents a new large-scale CNN foundation model called InternImage. It uses deformable convolutions as the core operator to achieve long-range dependencies and adaptive spatial aggregation like transformers. InternImage outperforms current CNNs and ViTs on benchmarks, achieving a new record of 65.4 mAP on COCO test-dev and 62.9 mIoU on ADE20K.

Uploaded by

Tung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
144 views12 pages

InternImage - Exploring Large-Scale Vision Foundation Models With Deformable Convolutions

This document presents a new large-scale CNN foundation model called InternImage. It uses deformable convolutions as the core operator to achieve long-range dependencies and adaptive spatial aggregation like transformers. InternImage outperforms current CNNs and ViTs on benchmarks, achieving a new record of 65.4 mAP on COCO test-dev and 62.9 mIoU on ADE20K.

Uploaded by

Tung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

This CVPR paper is the Open Access version, provided by the Computer Vision Foundation.

Except for this watermark, it is identical to the accepted version;


the final published version of the proceedings is available on IEEE Xplore.

InternImage: Exploring Large-Scale Vision Foundation Models with


Deformable Convolutions

Wenhai Wang1∗ , Jifeng Dai2,1∗ , Zhe Chen3,1∗ , Zhenhang Huang1∗ , Zhiqi Li3,1∗ , Xizhou Zhu4∗ ,
Xiaowei Hu1 , Tong Lu3 , Lewei Lu4 , Hongsheng Li5 , Xiaogang Wang4,5 , Yu Qiao1B
1
Shanghai AI Laboratory 2 Tsinghua University
3
Nanjing University 4 SenseTime Research 5 The Chinese University of Hong Kong
https://github.com/OpenGVLab/InternImage

query pixels response pixels with fixed weights


Abstract
response pixels with adaptive weights

Compared to the great progress of large-scale vision


transformers (ViTs) in recent years, large-scale models
based on convolutional neural networks (CNNs) are still
in an early state. This work presents a new large-scale
CNN-based foundation model, termed InternImage, which
can obtain the gain from increasing parameters and train- (a) global attention (b) local attention
✓ long-range dependence ✗ long-range dependence
ing data like ViTs. Different from the recent CNNs that focus ✓ adaptive spatial aggregation ✓ adaptive spatial aggregation
on large dense kernels, InternImage takes deformable con- ✗ computation/memory efficient ✓ computation/memory efficient
volution as the core operator, so that our model not only
has the large effective receptive field required for down-
stream tasks such as detection and segmentation, but also
has the adaptive spatial aggregation conditioned by input
and task information. As a result, the proposed InternIm-
age reduces the strict inductive bias of traditional CNNs
(c) large kernel (d) dynamic sparse kernel (ours)
and makes it possible to learn stronger and more robust ✓ long-range dependence ✓ long-range dependence
patterns with large-scale parameters from massive data like ✗ adaptive spatial aggregation ✓ adaptive spatial aggregation
ViTs. The effectiveness of our model is proven on challeng- ✓ computation/memory efficient ✓ computation/memory efficient

ing benchmarks including ImageNet, COCO, and ADE20K.


Figure 1. Comparisons of different core operators. (a) shows
It is worth mentioning that InternImage-H achieved a new
the global aggregation of multi-head self-attention (MHSA) [1],
record 65.4 mAP on COCO test-dev and 62.9 mIoU on whose computational and memory costs are expensive in down-
ADE20K, outperforming current leading CNNs and ViTs. stream tasks that require high-resolution inputs. (b) limits the
range of MHSA into a local window [2] to reduce the cost. (c)
is a depth-wise convolution with very large kernels to model long-
1. Introduction range dependencies. (d) is a deformable convolution, which shares
similar favorable properties with MHSA and is efficient enough
With the remarkable success of transformers in large-
for large-scale models. We start from it to build a large-scale CNN.
scale language models [3–8], vision transformers (ViTs) [2,
9–15] have also swept the computer vision field and are tation. While these results suggest that CNNs are inferior
becoming the primary choice for the research and prac- to ViTs in the era of massive parameters and data, we ar-
tice of large-scale vision foundation models. Some pio- gue that CNN-based foundation models can also achieve
neers [16–20] have made attempts to extend ViTs to very comparable or even better performance than ViTs when
large models with over a billion parameters, beating convo- equipped with similar operator-/architecture-level designs,
lutional neural networks (CNNs) and significantly pushing scaling-up parameters, and massive data.
the performance bound for a wide range of computer vision
To bridge the gap between CNNs and ViTs, we first
tasks, including basic classification, detection, and segmen-
summarize their differences from two aspects: (1) From
* equal contribution, B corresponding author ([email protected]) the operator level [9, 21, 22], the multi-head self-attention

14408
67
(MHSA) of ViTs has long-range dependencies and adap- 65.4
65
tive spatial aggregation (see Fig. 1(a)). Benefiting from the 63.7 64.2
63 63.1
62.4
flexible MHSA, ViTs can learn more powerful and robust 61

COCO box AP (%)


representations than CNNs from massive data. (2) From 59 InternImage-H (test-dev)
the architecture view [9, 22, 23], besides MHSA, ViTs con- 57 SwinV2 (test-dev)
tain a series of advanced components that are not included 55 FD-SwinV2-G (test-dev)
in standard CNNs, such as Layer Normalization (LN) [24], BEiT-3 (test-dev)
53
Florence-CoSwin-H (test-dev)
feed-forward network (FFN) [1], GELU [25], etc. Although 51
InternImage (val2017)
recent works [21, 22] have made meaningful attempts to in- 49 Swin (val2017)
troduce long-range dependencies into CNNs by using dense 47 ConvNeXt (val2017)
convolutions with very large kernels (e.g., 31×31) as shown 45
0 0.5 1 1.5 2 2.5 3 3.5
in Fig. 1 (c), there is still a considerable gap with the state- #parameter (B)
of-the-art large-scale ViTs [16, 18–20, 26] in terms of per-
formance and model scale. Figure 2. Performance comparison on COCO of different
backbones. The proposed InternImage-H achieves a new record
In this work, we concentrate on designing a CNN-based
65.4 box AP on COCO test-dev, significantly outperforming state-
foundation model that can efficiently extend to large-scale
of-the-art CNNs and large-scale ViTs.
parameters and data. Specifically, we start with a flexible
convolution variant—deformable convolution (DCN) [27,
the gains from large-scale parameters and data.
28]. By combining it with a series of tailored block-
(3) We evaluate the proposed model on representative
level and architecture-level designs similar to transformers,
vision tasks including image classification, object detec-
we design a brand-new convolutional backbone network,
tion, instance and semantic segmentation, and compared it
termed InternImage. As shown in Fig. 1, different from
with state-of-the-art CNNs and large-scale ViTs by scal-
recently improved CNNs with very large kernels such as
ing the model size ranging from 30 million to 1 billion,
31×31 [22], the core operator of InternImage is a dynamic
the data ranging from 1 million to 400 million. Specifi-
sparse convolution with a common window size of 3×3, (1)
cally, our model with different parameter sizes can consis-
whose sampling offsets are flexible to dynamically learn ap-
tently outperform prior arts on ImageNet [30]. InternImage-
propriate receptive fields (can be long- or short-range) from
B achieves 84.9% top-1 accuracy trained only on the
given data; (2) the sampling offsets and modulation scalars
ImageNet-1K dataset, outperforming CNN-based counter-
are adaptively adjusted according to the input data, which
parts [21, 22] by at least 1.1 points. With large-scale pa-
can achieve adaptive spatial aggregation like ViTs, reduc-
rameters (i.e., 1 billion) and training data (i.e., 427 million),
ing the over-inductive bias of regular convolutions; and (3)
the top-1 accuracy of InternImage-H is further boosted to
the convolution window is a common 3×3, avoiding the
89.6%, which is close to well-engineering ViTs [2, 19] and
optimization problems and expensive costs caused by large
hybrid-ViTs [20]. In addition, on COCO [31], a challeng-
dense kernels [22, 29].
ing downstream benchmark, our best model InternImage-H
With the aforementioned designs, the proposed Intern- achieves state-of-the-art 65.4% box mAP with 2.18 billion
Image can efficiently scale to large parameter sizes and parameters, 2.3 points higher than SwinV2-G [16] (65.4 vs.
learn stronger representations from large-scale training 63.1) with 27% fewer parameters as shown in Fig. 2.
data, achieving comparable or even better performance to
large-scale ViTs [2, 11, 19] on a wide range of vision tasks.
In summary, our main contributions are as follows: 2. Related Work
(1) We present a new large-scale CNN-based founda- Vision foundation models. Convolutional neural net-
tion model—InternImage. To our best knowledge, it is the works (CNNs) became the mainstream for visual recogni-
first CNN that effectively scales to over 1 billion parameters tion after the large-scale dataset and computation resources
and 400 million training images and achieves comparable or were available. Straining from AlexNet [32], lots of deeper
even better performance than state-of-the-art ViTs, showing and more effective neural network architectures have been
that convolutional models are also a worth-exploring direc- proposed, such as VGG [33], GoogleNet [34], ResNet [35],
tion for large-scale model research. ResNeXt [36], EfficientNet [37, 38], etc. In addition to the
(2) We successfully scale CNNs to large-scale settings architectural design, more sophisticated convolution opera-
by introducing long-range dependencies and adaptive spa- tions such as depth-wise convolution [39] and deformable
tial aggregation using an improved 3×3 DCN operator, and convolution [27, 28] are formulated. By considering the
explore the tailored basic block, stacking rules, and scaling advanced designs of transformers, modern CNNs showed
strategies centered on the operator. These designs make ef- promising performance on the vision tasks by discover-
fective use of the operator, enabling our models to obtain ing better components in macro/micro designs and intro-

14409
ducing improved convolutions with long-range dependen- stem
cies [21, 22, 40] or dynamic weights [41]. 3×3 conv, s2, p1
𝐻×𝑊×3 LN, GELU
In recent years, a new line of vision foundation mod- 3×3 conv, s2, p1
LN
els focuses on transformer-based architecture. ViT [9] is
stem downsampling
the most representative model, which achieves great suc-
3×3 conv, s2, p1
cess in vision tasks thanks to global receptive fields and LN
stage 𝟏
dynamic spatial aggregation. However, global attention in 𝐻/4×𝑊/4×𝐶"
basic block ×𝐿 "
ViT suffers from expensive computational/memory com- stage 𝒊
𝐿 %×
plexity, especially on large feature maps, which limits its downsampling
application in downstream tasks. To address this problem, LN
PVT [10, 11] and Linformer [42] performed global atten- stage 𝟐
𝐻/8×𝑊/8×𝐶! FFN
basic block ×𝐿 !
tion on the downsampled key and value maps, DAT [43]
employed deformable attention to sparsely sample informa-
downsampling
tion from value maps, while HaloNet [44] and Swin trans- LN
former [2] developed local attention mechanisms and used stage 𝟑 DCNv3 (𝐺% )
𝐻/16×𝑊/16×𝐶#
haloing and shift operations to transfer information among basic block ×𝐿 # Δ𝑝, 𝐦
adjacent local regions.
downsampling stacking rules
Large-scale models. Scaling up models is an important
(1) 𝐶% = 2%&" 𝐶"
strategy to improve feature representation quality, which stage 𝟒 (2) 𝐺% = 𝐶% /𝐶′
𝐻/32×𝑊/32×𝐶$
has been well-studied in the natural language processing basic block ×𝐿 $ (3) 𝐿 " = 𝐿 ! = 𝐿 $
(NLP) domain [45]. Inspired by the success in the NLP (4) 𝐿 " ≤ 𝐿 #

field, Zhai et al. [19] first extended ViT to 2 billion pa- cls, det, seg, ...
rameters. Liu et al. [16] enlarged the hierarchical-structure
Swin transformer to a deeper and wider model with 3 bil- Figure 3. Overall Architecture of InternImage, where the core
lion parameters. Some researchers developed large-scale operator is DCNv3, and the basic block composes of layer normal-
ization (LN) [24] and feed-forward network (FFN) [1] as trans-
hybrid ViTs [20, 46] by combining the advantages of ViTs
formers, the stem and downsampling layers follows conventional
and CNNs at different levels. Recently, BEiT-3 [17] further CNN’s designs, where “s2” and “p1” mean stride 2 and padding
explored stronger representations based on ViT with large- 1, respectively. Constrained by the stacking rules, only 4 hyper-
scale parameters using multimodal pre-training. These parameters (C1 , C ′ , L1 , L3 ) can decide a model variant.
methods significantly raise the upper bound of basic vi-
sion tasks. However, research on CNN-based large-scale 3.1. Deformable Convolution v3
models has lagged behind transformer-based architectures
in terms of the total number of parameters and performance. Convolution vs. MHSA. Previous works [21, 22, 48]
Although newly-proposed CNNs [21, 22, 40, 47] introduce have extensively discussed the differences between CNNs
long-range dependencies by using convolutions with very and ViTs. Before deciding on the core operator of InternIm-
large kernels or recursive gated kernels, there is still a con- age, we first summarize the main differences between regu-
siderable gap with state-of-the-art ViTs. In this work, we lar convolution and MHSA.
aim to develop a CNN-based foundation model that can ex- (1) Long-range dependencies. Although it has long been
tend efficiently to a large scale comparable to ViT. recognized that models with large effective receptive fields
(long-range dependencies) usually perform better on down-
stream vision tasks [49–51], the de-facto effective receptive
3. Proposed Method field of CNNs [33, 35] stacked by 3×3 regular convolutions
is relatively small. Even with very deep models, the CNN-
To design a large-scale CNN-based foundation model, based model still cannot acquire long-range dependencies
we start with a flexible convolution variant, namely de- like ViTs, which limits its performance.
formable convolution v2 (DCNv2) [28] and make some (2) Adaptive spatial aggregation. Compared to MHSA
tune-ups based on it to better suit the requirements of large- whose weights are dynamically conditioned by the input,
scale foundation models. Then, we build the basic block regular convolution [52] is an operator with static weights
by combining the tuned convolution operator with advanced and strong inductive biases such as 2D locality, neigh-
block designs used in modern backbones [16, 19]. Finally, borhood structure, translation equivalence, etc. With the
we explore the stacking and scaling principles of DCN- highly-inductive properties, models composed by regular
based blocks to build a large-scale convolutional model that convolutions might converge faster and require less train-
can learn strong representations from massive data. ing data than ViTs, but it also restricts CNNs from learn-

14410
ing more general and robust patterns from web-scale data. (2) Introducing multi-group mechanism. The multi-
More robustness experiments are detailed in the supplemen- group (head) design first appeared in group convolu-
tary material. tion [32], and it is widely used in MHSA [1] of transformers
Revisiting DCNv2. A straightforward way to bridge the and works with adaptive spatial aggregation to effectively
gap between convolution and MHSA is to introduce long- learn richer information from different representation sub-
range dependencies and adaptive spatial aggregation into spaces at different locations. Inspired by this, we split the
regular convolutions. Let us start with DCNv2 [28], which spatial aggregation process into G groups, each of which
is a general variant of regular convolution. Given an input has individual sampling offsets ∆pgk and modulation scale
x ∈ RC×H×W and current pixel p0 , DCNv2 can be formu- mgk , and thus different groups on a single convolution layer
lated as: can have different spatial aggregation patterns, resulting in
stronger features for downstream tasks.
(3) Normalizing modulation scalars along sampling
\vspace {-3pt} \textbf {y}(p_0) = \sum ^{K}_{k=1}\mathbf {w}_k \mathbf {m}_k \mathbf {x}(p_0 + p_k + \Delta p_k), \label {eqn:dcnv2} (1)
points. The modulation scalars in the original DCNv2 are
element-wise normalized by the sigmoid function. There-
where K represents the total number of sampling points, fore, each modulation scalar is in the range [0, 1], and the
and k enumerates the sampling point. wk ∈ RC×C de- sum of the modulation scalars of all sample points is not sta-
notes the projection weights of the k-th sampling point, ble and varies from 0 to K. This leads to unstable gradients
and mk ∈ R represents the modulation scalar of the k- in DCNv2 layers when training with large-scale parame-
th sampling point, which is normalized by sigmoid func- ters and data. To alleviate the instability issues, we change
tion. pk denotes the k-th location of the pre-defined grid element-wise sigmoid normalization to softmax normaliza-
sampling {(−1, −1), (−1, 0), ..., (0, +1), ..., (+1, +1)} as tion along sample points. In this way, the sum of the modu-
in regular convolutions, and ∆pk is the offset correspond- lation scalars is constrained to 1, which makes the training
ing to the k-th grid sampling location. We see from the process of models at different scales more stable.
equation that (1) for long-range dependencies, the sampling Combining the mentioned modifications, the extended
offset ∆pk is flexible and able to interact with short- or DCNv2, marked as DCNv3, can be formulated as Eqn. (2).
long-range features; and (2) for adaptive spatial aggrega-
tion, both the sampling offset ∆pk and modulation scalar
mk are learnable and conditioned by input x. So it can be \vspace {-2pt} \textbf {y}(p_0) = \sum ^{G}_{g=1} \sum ^{K}_{k=1} \mathbf {w}_g \mathbf {m}_{gk}\mathbf {x}_g(p_0 + p_k + \Delta p_{gk}), \label {eqn:dcnv3} (2)
found that DCNv2 shares similar favorable properties with
MHSA, which motivated us to develop large-scale CNN- where G denotes the total number of aggregation groups.

based foundation models on the basis of this operator. For the g-th group, wg ∈ RC×C , mgk ∈ R denote the
Extending DCNv2 for Vision Foundation Models. In location-irrelevant projection weights of the group, where
common practice, DCNv2 is usually used as an extension C ′ = C/G represents the group dimension. mgk ∈ R de-
to regular convolutions, loading pre-trained weights of reg- notes the modulation scalar of the k-th sampling point in
ular convolutions and fine-tuning for better performance, the g-th group, normalized by the softmax function along

which is not exactly suitable for large-scale vision founda- the dimension K. xg ∈ RC ×H×W represents the sliced in-
tion models that need to be trained from scratch. In this put feature map. ∆pgk is the offset corresponding to the
work, to address this problem, we extend DCNv2 from as- grid sampling location pk in the g-th group.
pects as follows: In general, DCNv3, as an extension of the DCN series,
(1) Sharing weights among convolutional neurons. Sim- enjoys three merits as follows: (1) This operator made up
ilar to regular convolution, different convolutional neu- for the deficiencies of regular convolution in terms of long-
rons1 in original DCNv2 have independent linear projection range dependencies and adaptive spatial aggregation; (2)
weights, and thus its parameter and memory complexity Compared with attention-based operators such as common
are linear with the total number of sampling points, which MHSA and closely-related deformable attention [43, 54],
significantly limits the efficiency of the model, especially this operator inherits the inductive bias of convolution,
in large-scale models. To remedy this problem, we bor- making our model more efficient with fewer training data
row the idea from the separable convolution [53] and de- and shorter training time; (3) This operator is based on
tach the original convolution weights wk into depth-wise sparse sampling, which is more computational and mem-
and point-wise parts, where the depth-wise part is respon- ory efficient than previous methods such as MHSA [1] and
sible by the original location-aware modulation scalar mk , re-parameterizing large kernel [22]. In addition, due to
and the point-wise part is the shared projection weights w the sparse sampling, DCNv3 only needs a 3×3 kernel to
among sampling points. learn long-range dependencies, which is easier to be op-
timized and avoids extra auxiliary techniques such as re-
1A 3×3 regular convolution has 9 linear projection neurons. parameterizing [22] used in large kernels.

14411
3.2. InternImage Model model name C1 C′ L1,2,3,4 #params
InternImage-T (origin) 64 16 4, 4, 18, 4 30M
InternImage-S 80 16 4, 4, 21, 4 50M
Using DCNv3 as the core operator brings a new prob- InternImage-B 112 16 4, 4, 21, 4 97M
lem: how to build a model that can make effective use of the InternImage-L 160 16 5, 5, 22, 5 223M
core operator? In this section, we first present the details InternImage-XL 192 16 5, 5, 24, 5 335M
InternImage-H 320 32 6, 6, 32, 6 1.08B
of the basic block and other integral layers of our model,
and then we construct a new CNN-based foundation model Table 1. Hyper-parameters for models of different scales.
termed InternImage, by exploring a tailored stacking strat- InternImage-T is the origin model, and -S/B/L/XL/H are scaled
egy for these basic blocks. Finally, we study scaling-up up from -T. “#params” denotes the number of parameters.
rules for the proposed model to obtain the gain from in-
creasing parameters. 3 as illustrated in the last two rules. With these rules, a
InternImage variant can be defined by using only 4 hyper-
Basic block. Unlike the widely used bottlenecks in tradi-
parameters (C1 , C ′ , L1 , L3 ).
tional CNNs [35], the design of our basic block is closer to
Let us choose a model with 30 million parameters as the
ViTs, which is equipped with more advanced components
origin and discretize C1 to {48, 64, 80}, L1 to {1, 2, 3, 4, 5},
including LN [24], feed-forward networks (FFN) [1], and
and C ′ to {16, 32}. In this way, the original huge search
GELU [25]. This design is proved to be efficient [2, 10,
space is reduced to 30, and we can find the best model
11, 21, 22] in various vision tasks. The details of our ba-
from the 30 variants by training and evaluating them in Im-
sic block are illustrated in Fig. 3, where the core operator
ageNet [30]. In practice, we use the best hyper-parameter
is DCNv3, and the sampling offsets and modulation scales
setting (64, 16, 4, 18) to define the origin model and scale it
are predicted by passing input feature x through a separable
to different scales.
convolution (a 3×3 depth-wise convolution followed by a
Scaling rules. Based on the optimal origin model un-
linear projection). For other components, we use the post-
der the aforementioned constraints, we further explore the
normalization setting [55] by default and follow the same
parameter scaling rules inspired by [37]. Specifically, we
design as that of the plain transformer [1, 9].
consider two scaling dimensions: depth D (i.e., 3L1 +L3 )
Stem & downsampling layers. To obtain hierarchical
and width C1 , and scale the two dimensions using α, β and
feature maps, we use convolutional stem and downsampling
a composite factor ϕ. The scaling rules can be written as:
layers to resize the feature maps to different scales. As
D′ = αϕ D and C1′ = β ϕ C1 ,
shown in Fig. 3, the stem layer is placed before the first
where α ≥ 1, β ≥ 1, and αβ 1.99 ≈ 2. Here, 1.99
stage to reduce the input resolution by 4 times. It consists
is specific for InternImage and calculated by doubling the
of two convolutions, two LN layers, and one GELU layer,
model width and keeping the depth constant. We experi-
where the kernel size of the two convolutions is 3, the stride
mentally find out that the best scaling setting is α = 1.09
is 2, the padding is 1, and the output channel of the first con-
and β = 1.36, and then we base on it to construct In-
volution is half of the second one. Similarly, the downsam-
ternImage variants with different parameter scales, namely
pling layer is made up of a 3×3 convolution with a stride
InternImage-T/S/B/L/XL, whose complexity is similar to
of 2 and a padding of 1, followed by one LN layer. It sits
those of ConvNeXt [21]. To further test the capability, we
between the two stages and is used to downsample the input
built a larger InternImage-H with 1 billion parameters, and
feature map by 2 times.
to accommodate very large model widths, we also change
Stacking rules. To clarify the block-stacking process,
the group dimension C ′ to 32. The configurations are sum-
we first list the hyper-parameters of InternImage as follows:
marized in Table 1.
Ci : the channel number of the i-th stage;
Gi : the group number of the DCNv3 in the i-th stage; 4. Experiment
Li : the number of basic blocks in the i-th stage.
Since our model has 4 stages, a variant is decided by 12 We analyze and compare InternImage with the leading
hyper-parameters, whose search space is too large to ex- CNNs and ViTs on representative vision tasks including im-
haustively enumerate and find the best variant. To reduce age classification, object detection, instance and semantic
the search space, we summarize the design experiences of segmentation. Besides the experiments in the main paper,
prior arts [2, 21, 35] into 4 rules as shown in Fig. 3, where due to space constraints, more experimental setups and ab-
the first rule makes the channel numbers of the last three lation studies are presented in the supplementary materials.
stages determined by the channel number C1 of the first
4.1. Image Classification
stage, and the second rule lets the group number correspond
to the channel number of stages. For the number of stacked Settings. We evaluate the classification performance of
blocks in different stages, we simplify the stacking pattern InternImage on ImageNet [30]. For fair comparisons, fol-
to “AABA”, which means the block number of stage 1, 2, lowing common practices [2,10,21,56], InternImage-T/S/B
and 4 are the same, and are not greater than that of the stage are trained on ImageNet-1K (∼1.3 million) for 300 epochs,

14412
method type scale #params #FLOPs acc (%) perior to the state-of-the-art transformer-based and CNN-
Swin-T [2] T 2242 29M 5G 81.3 based models. For example, InternImage-T achieves 83.5%
CoAtNet-0 [20] T 2242 25M 4G 81.6
PVTv2-B2 [11] T 2242 25M 4G 82.0 top-1 accuracy, outperforming ConvNext-T [21] with a
DeiT III-S [62] T 224 2 22M 5G 81.4 clear margin of 1.4 points. InternImage-S/B keeps the
SwinV2-T/8 [16] T 2562 28M 6G 81.8 leading position and InternImage-B surpasses the hybrid-
ConvNeXt-T [21] C 2242 29M 5G 82.1
InternImage-T (ours) C 2242 30M 5G 83.5
ViT CoAtNet-2 [20] by 0.8 points. When pre-trained on
Swin-S [2] T 2242 50M 9G 83.0 ImageNet-22K and the large-scale joint dataset, the top-1
CoAtNet-1 [20] T 2242 42M 8G 83.3 accuracy of InternImage-XL and -H are boosted to 88.0%
PVTv2-B4 [11] T 2242 63M 10G 83.6 and 89.6%, respectively, which is better than previous
SwinV2-S/8 [16] T 2562 50M 12G 83.7
ConvNeXt-S [21] C 2242 50M 9G 83.1 CNNs [22, 63] also trained with large-scale data, and closes
InternImage-S (ours) C 2242 50M 8G 84.2 the gap with the state-of-the-art large-scale ViTs to about 1
Swin-B [2] T 2242 88M 15G 83.5 point. This gap may be caused by the discrepancy between
CoAtNet-2 [20] T 2242 75M 16G 84.1
PVTv2-B5 [11] T 2242 82M 12G 83.8 large-scale inaccessible private data and the aforementioned
DeiT III-B [62] T 2242 87M 18G 83.8 joint public data. These results show that our InternImage
SwinV2-B/8 [16] T 2562 88M 20G 84.2 not only has good performance on the common parameter
RepLKNet-31B [22] C 2242 79M 15G 83.5
ConvNeXt-B [21] C 2242 88M 15G 83.8
scale and the public training data, but also can effectively
InternImage-B (ours) C 224 2 97M 16G 84.9 extend to large-scale parameters and data.
Swin-L‡ [2] T 3842 197M 104G 87.3
CoAtNet-4‡ [20] T 3842 275M 190G 87.9 4.2. Object Detection
DeiT III-L‡ [62] T 3842 304M 191G 87.7
SwinV2-L/24‡ [16] T 3842 197M 115G 87.6 Settings. We verify the detection performance of our
RepLKNet-31L‡ [22] C 3842 172M 96G 86.6
InternImage on the COCO benchmark [31], on top of
ConvNeXt-L‡ [21] C 3842 198M 101G 87.5
ConvNeXt-XL‡ [21] C 3842 350M 179G 87.8 two representative object detection frameworks: Mask R-
InternImage-L‡ (ours) C 3842 223M 108G 87.7 CNN [66], and Cascade Mask R-CNN [67]. We follow
InternImage-XL‡ (ours) C 3842 335M 163G 88.0 common practices [2, 11] to initialize the backbone with
ViT-G/14# [19] T 5182 1.84B 5160G 90.5
CoAtNet-6# [20] T 5122 1.47B 1521G 90.5
pre-trained classification weights, and train models use a
CoAtNet-7# [20] T 5122 2.44B 2586G 90.9 1× (12 epochs) or 3× (36 epochs) schedule by default.
Florence-CoSwin-H# [57] T − 893M − 90.0 Results. As shown in Table 3, when using Mask R-
SwinV2-G# [16] T 6402 3.00B − 90.2 CNN for object detection, we find that under a compara-
RepLKNet-XL# [22] C 3842 335M 129G 87.8
BiT-L-ResNet152x4# [63] C 4802 928M − 87.5 ble number of parameters, our models significantly surpass
InternImage-H# (ours) C 2242 1.08B 188G 88.9 their counterparts. For example, with the 1× training sched-
InternImage-H# (ours) C 6402 1.08B 1478G 89.6 ule, the box AP (APb ) of InternImage-T is 4.5 points better
Table 2. Image classification performance on the ImageNet val- than Swin-T [2] (47.2 vs. 42.7), and 3.0 points higher than
idation set. “type” refers to model type, where “T” and “C” de- ConvNeXt-T [21] (47.2 vs. 44.2). With the 3× multi-scale
note transformer and CNN, respectively. “scale” is the input scale. training schedule, more parameters, and more advanced
“acc” is the top-1 accuracy. “‡ ” indicates the model is pre-trained Cascade Mask R-CNN [67], InternImage-XL achieves APb
on ImageNet-22K [30]. “# ” indicates pretraining on extra large- of 56.2, surpassing ConvNeXt-XL by 1.0 points (56.2 vs.
scale private dataset such as JFT-300M [64], FLD-900M [57], or 55.2). Similar results are also seen in instance segmentation
the joint public dataset in this work. experiments. With the 1× training schedule, InternImage-T
yields 42.5 mask AP (i.e., APm ), which outperforms Swin-
and InternImage-L/XL are first trained on ImageNet-22K T and ConvNeXt-T by 3.2 points (42.5 vs. 39.3) and 2.4
(∼14.2 million) for 90 epochs and then fine-tuned on points (42.5 vs. 40.1), respectively. The best APm 48.8 is
ImageNet-1K for 20 epochs. To further explore the ca- obtained by InternImage-XL with Cascade Mask R-CNN,
pability of our model and match the large-scale private which is at least 1.1 points higher than its counterparts.
data used in previous methods [16, 20, 57], we adopt M3I To further push the performance bound of object detec-
Pre-training [58], a unified pre-training approach available tion, we follow the advanced setting used in leading meth-
for both unlabeled and weakly-labeled data, to pre-train ods [16, 17, 26, 70, 74] to initialize the backbone with the
InternImage-H on a 427 million joint dataset of public weights pre-trained on ImageNet-22K or the large-scale
Laion-400M [59], YFCC-15M [60], and CC12M [61] for joint dataset, and double its parameters via the composite
30 epochs, and then we fine-tune the model on ImageNet- techniques [74] (see Fig. 2). Then, we fine-tune it along
1K for 20 epochs. with the DINO [70] detector on the Objects365 [75] and
Results. Table 2 shows the classification results of mod- COCO datasets one after another for 26 epochs and 12
els with different scales. With similar parameters and com- epochs, respectively. As shown in Table 4, our method
putational costs, our models are comparable or even su- achieves the best results of 65.0 APb and 65.4 APb on

14413
Mask R-CNN 1× schedule Mask R-CNN 3×+MS schedule
method #params #FLOPs
APb APb50 APb 75 APm APm 50 APm75 APb APb50 APb 75 APm APm 50 APm75
Swin-T [2] 48M 267G 42.7 65.2 46.8 39.3 62.2 42.2 46.0 68.1 50.3 41.6 65.1 44.9
ConvNeXt-T [21] 48M 262G 44.2 66.6 48.3 40.1 63.3 42.8 46.2 67.9 50.8 41.7 65.0 44.9
PVTv2-B2 [11] 45M 309G 45.3 67.1 49.6 41.2 64.2 44.4 47.8 69.7 52.6 43.1 66.8 46.7
ViT-Adapter-S [65] 48M 403G 44.7 65.8 48.3 39.9 62.5 42.8 48.2 69.7 52.5 42.8 66.4 45.9
InternImage-T (ours) 49M 270G 47.2 69.0 52.1 42.5 66.1 45.8 49.1 70.4 54.1 43.7 67.3 47.3
Swin-S [2] 69M 354G 44.8 66.6 48.9 40.9 63.4 44.2 48.2 69.8 52.8 43.2 67.0 46.1
ConvNeXt-S [21] 70M 348G 45.4 67.9 50.0 41.8 65.2 45.1 47.9 70.0 52.7 42.9 66.9 46.2
PVTv2-B3 [11] 65M 397G 47.0 68.1 51.7 42.5 65.7 45.7 48.4 69.8 53.3 43.2 66.9 46.7
InternImage-S (ours) 69M 340G 47.8 69.8 52.8 43.3 67.1 46.7 49.7 71.1 54.5 44.5 68.5 47.8
Swin-B [2] 107M 496G 46.9 − − 42.3 − − 48.6 70.0 53.4 43.3 67.1 46.7
ConvNeXt-B [21] 108M 486G 47.0 69.4 51.7 42.7 66.3 46.0 48.5 70.1 53.3 43.5 67.1 46.7
PVTv2-B5 [11] 102M 557G 47.4 68.6 51.9 42.5 65.7 46.0 48.4 69.2 52.9 42.9 66.6 46.2
ViT-Adapter-B [65] 120M 832G 47.0 68.2 51.4 41.8 65.1 44.9 49.6 70.6 54.0 43.6 67.7 46.9
InternImage-B (ours) 115M 501G 48.8 70.9 54.0 44.0 67.8 47.4 50.3 71.4 55.3 44.8 68.7 48.0

method #param #FLOPs Cascade Mask R-CNN 1× schedule Cascade Mask R-CNN 3×+MS schedule
Swin-L‡ [2] 253M 1382G 51.8 71.0 56.2 44.9 68.4 48.9 53.9 72.4 58.8 46.7 70.1 50.8
ConvNeXt-L‡ [21] 255M 1354G 53.5 72.8 58.3 46.4 70.2 50.2 54.8 73.8 59.8 47.6 71.3 51.7
RepLKNet-31L‡ [22] 229M 1321G − − − − − − 53.9 72.5 58.6 46.5 70.0 50.6
HorNet-L‡ [40] 259M 1358G − − − − − − 56.0 − − 48.6 − −
InternImage-L‡ (ours) 277M 1399G 54.9 74.0 59.8 47.7 71.4 52.1 56.1 74.8 60.7 48.5 72.4 53.0
ConvNeXt-XL‡ [21] 407M 1898G 53.6 72.9 58.5 46.5 70.3 50.5 55.2 74.2 59.9 47.7 71.6 52.2
InternImage-XL‡ (ours) 387M 1782G 55.3 74.4 60.1 48.1 71.9 52.4 56.2 75.0 61.2 48.8 72.5 53.4

Table 3. Object detection and instance segmentation performance on COCO val2017. The FLOPs are measured with 1280×800
inputs. APb and APm represent box AP and mask AP, respectively. “MS” means multi-scale training.
APb method
crop
#params #FLOPs
mIoU mIoU
method detector #params size (SS) (MS)
val2017test-dev
Swin-L‡ [2] HTC++ [2] 284M 58.0 58.7 Swin-T [2] 5122 60M 945G 44.5 45.8
Swin-L‡ [2] Soft-Teacher [68] 284M 60.7 61.3 ConvNeXt-T [21] 5122 60M 939G 46.0 46.7
Florence-CoSwin-H# [57] DyHead [69] 637M 62.0 62.4 SLaK-T [29] 5122 65M 936G 47.6 −
ViT-Adapter-L‡ [65] HTC++ [2] 401M 62.6 62.6 InternImage-T (ours) 5122 59M 944G 47.9 48.1
Swin-L‡ [2] DINO [70] 218M 63.2 63.3 Swin-S [2] 5122 81M 1038G 47.6 49.5
FocalNet-H‡ [71] DINO [70] 746M 64.2 64.3 ConvNeXt-S [21] 5122 82M 1027G 48.7 49.6
ViT-Huge [72] Group-DETRv2 [72] 629M − 64.5 SLaK-S [29] 5122 91M 1028G 49.4 −
#
SwinV2-G [16] HTC++ [2] 3.00B 62.5 63.1 InternImage-S (ours) 5122 80M 1017G 50.1 50.9
BEiT-3# [17] ViTDet [73] 1.90B − 63.7 Swin-B [2] 5122 121M 1188G 48.1 49.7
FD-SwinV2-G# [26] HTC++ [2] 3.00B − 64.2 ConvNeXt-B [21] 5122 122M 1170G 49.1 49.9
InternImage-XL‡ (ours) DINO [70] 602M 64.2 64.3 RepLKNet-31B [22] 5122 112M 1170G 49.9 50.6
InternImage-H# (ours) DINO [70] 2.18B 65.0 65.4 SLaK-B [29] 5122 135M 1172G 50.2 −
InternImage-B (ours) 5122 128M 1185G 50.8 51.3
Table 4. Comparison of the state-of-the-art detectors on Swin-L‡ [2] 6402 234M 2468G 52.1 53.5
RepLKNet-31L‡ [22] 6402 207M 2404G 52.4 52.7
COCO val2017 and test-dev.
ConvNeXt-L‡ [21] 6402 235M 2458G 53.2 53.7
COCO val2017 and test-dev. Compared to previous state- ConvNeXt-XL‡ [21] 6402 391M 3335G 53.6 54.0
InternImage-L‡ (ours) 6402 256M 2526G 53.9 54.1
of-the-art models, we surpass FD-SwinV2-G [26] by 1.2 InternImage-XL‡ (ours) 6402 368M 3142G 55.0 55.3
points (65.4 vs. 64.2), with 27% fewer parameters and with- SwinV2-G# [16] 8962 3.00B − − 59.9
out complicated distillation processes, which shows the ef- InternImage-H# (ours) 8962 1.12B 3566G 59.9 60.3
fectiveness of our models on the detection task. BEiT-3# [17] 8962 1.90B − − 62.8
FD-SwinV2-G# [26] 8962 3.00B − − 61.4
InternImage-H# (ours) +
4.3. Semantic Segmentation Mask2Former [76]
8962 1.31B 4635G 62.5 62.9

Settings. To evaluate the semantic segmentation per- Table 5. Semantic segmentation performance on the ADE20K
formance of InternImage, we initialize the backbone with validation set. The FLOPs are measured with 512×2048,
pre-trained classification weights and train our models with 640×2560, or 896×896 inputs according to the crop size.
UperNet [77] on ADE20K [78] for 160k iterations and
compare fairly with previous CNN-based and transformer- ple, with almost the same parameter numbers and FLOPs,
based backbones. To further reach top performance, we arm our InternImage-B reports 50.8 mIoU on the ADE20K val,
InternImage-H with more advanced Mask2Former [76], and which is outstanding from the strong counterparts such
adopt the same training settings in [17, 65]. as ConvNeXt-B (50.8 vs. 49.1) and RepLKNet-31B (50.8
Results. As shown in Table 5, when using UperNet vs. 49.9). Furthermore, our InternImage-H yields 60.3 MS
[77] for semantic segmentation, our InternImage consis- mIoU, which is better than SwinV2-G [16], while the pa-
tently outperforms prior arts [2, 21, 22, 29]. For exam- rameter number is much smaller (1.12B vs. 3.00B).

14414
shared w multi-group softmax top-1 acc APb APm
✗ ✓ ✓ 83.6 47.4 42.6
✓ ✗ ✓ 82.3 43.8 40.0
✓ ✓ ✗ 65.7 38.7 35.6
✓ ✓ ✓ 83.5 47.2 42.5

Table 6. Ablation comparison of the three modifications in


DCNv3. These experiments are based on InternImage-T for clas-
sification and Mask R-CNN 1× schedule for detection.

66.1% more parameters.


Multi-group spatial aggregation brings stronger fea-
Figure 4. Model parameters and GPU memory usage of shared
tures. We introduce aggregation groups to allow our model
weights v.s unshared weights among convolution neurons. The
to learn information from different representation subspaces
left vertical axis indicates the model parameters and the right one
indicates the GPU memory usage per image when the batch size like transformers [9]. As shown in Fig. 5, for the same
is 32 and the input image resolution is 224 × 224. query pixel, the offsets from different groups are concen-
trated in different regions, resulting in hierarchical seman-
stage 1 stage 2 tic features. We also compare the performance of the model
with and without multiple groups. As reported in Table 6,
the model significantly drops 1.2 points on ImageNet and
3.4 points on COCO val2017. In addition, we also see that
in the first two stages, the learned effective receptive field
stage 3 stage 4 (ERF) is relatively small, and as the model goes deeper (i.e.,
stages 3 and 4), the ERF increases to be global. This phe-
nomenon is different from ViTs [9, 10, 79] whose ERF is
usually global. Moreover, the normalization of sampling
points improves gradient stability. Without using softmax
normalization leads to 17.8 points drop on ImageNet and
Figure 5. Visualization of sampling locations for different 8.5 points drop on COCO.
groups at different stages. The blue star indicates the query point
(on the left sheep), and the dots with different colors indicate the 5. Conclusion & Limitations
sampling locations of different groups.
We introduce InternImage, a new large-scale CNN-based
It is worth noting that, when using Mask2Former [76] foundation model that can provide strong representations
and multi-scale testing, our InternImage-H achieves the best for versatile vision tasks, such as image classification, ob-
mIoU of 62.9, higher than the current best BEiT-3 [17] on ject detection, and semantic segmentation. We tune the flex-
the ADE20K benchmark. These results demonstrate that ible DCNv2 operator to satisfy the requirement of foun-
the CNN-based foundation model can also enjoy the divi- dation models, and develop a series of blocks, stacking
dends of massive data and challenge the leading position of and scaling rules centered on the core operator. Exten-
transformer-based models. sive experiments on object detection and semantic segmen-
tation benchmarks verify that our InternImage can obtain
4.4. Ablation Study comparable or better performance than well-designed large-
scale vision transformers trained with massive data, show-
Sharing weights among convolution neurons matters. ing that CNN is also a considerable choice for large-scale
Large-scale models are sensitive to parameters and memory vision foundation model research. Nonetheless, latency re-
cost of the core operator, due to hardware limitations. To mains an issue for DCN-based operators adapting to down-
address this problem, we share weights among convolution stream tasks with high-speed requirements. Also, large-
neurons of DCNv3. As shown in Fig. 4, we compare the pa- scale CNNs are still in their early stages of development,
rameters and memory cost of the models based on DCNv3 and we hope InternImage can serve as a good starting point.
with shared or unshared weights. We see that the parame-
ters and memory cost of models with unshared weights are Acknowledgement
much higher than the shared one, especially for the -H scale,
the ratio of saved parameters and GPU memory is 42.0% This work is partially supported by the National Key RD
and 84.2%, respectively. As shown in Table 6, we also ex- Program of China (No. 2022ZD0160100), the National Nat-
amine that the two models at -T scale have similar top-1 ural Science Foundation of China (Grant No. 61672273,
accuracy on ImageNet (83.5 vs. 83.6) and APb on COCO 61832008), and Shanghai Committee of Science and Tech-
(47.2 vs. 47.4), even the model without shared weights has nology (Grant No. 21DZ1100100).

14415
References [12] Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming
Zhang, Nenghai Yu, Lu Yuan, Dong Chen, and Baining Guo.
[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- Cswin transformer: A general vision transformer backbone
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia with cross-shaped windows. IEEE Conf. Comput. Vis. Pat-
Polosukhin. Attention is all you need. Adv. Neural Inform. tern Recog., pages 12124–12134, 2022. 1
Process. Syst., 30, 2017. 1, 2, 3, 4, 5
[13] Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu,
[2] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Xiyang Dai, Lu Yuan, and Lei Zhang. Cvt: Introducing con-
Zhang, Stephen Lin, and Baining Guo. Swin transformer: volutions to vision transformers. In Int. Conf. Comput. Vis.,
Hierarchical vision transformer using shifted windows. In pages 22–31, 2021. 1
Int. Conf. Comput. Vis., pages 10012–10022, 2021. 1, 2, 3,
5, 6, 7 [14] Alaaeldin Ali, Hugo Touvron, Mathilde Caron, Piotr Bo-
janowski, Matthijs Douze, Armand Joulin, Ivan Laptev, Na-
[3] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick talia Neverova, Gabriel Synnaeve, Jakob Verbeek, et al. Xcit:
LeGresley, Jared Casper, and Bryan Catanzaro. Megatron- Cross-covariance image transformers. Adv. Neural Inform.
lm: Training multi-billion parameter language models using Process. Syst., 34, 2021. 1
model parallelism. arXiv preprint arXiv:1909.08053, 2019.
1 [15] Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chunjing Xu,
and Yunhe Wang. Transformer in transformer. Adv. Neural
[4] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Inform. Process. Syst., 34, 2021. 1
Amodei, Ilya Sutskever, et al. Language models are unsu-
pervised multitask learners. OpenAI blog, 1(8):9, 2019. 1 [16] Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie,
Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong,
[5] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, et al. Swin transformer v2: Scaling up capacity and res-
Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and olution. Adv. Neural Inform. Process. Syst., pages 12009–
Peter J Liu. Exploring the limits of transfer learning with a 12019, 2022. 1, 2, 3, 6, 7
unified text-to-text transformer. Journal of Machine Learn-
ing Research, 21:1–67, 2020. 1 [17] Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhil-
iang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mo-
[6] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub-
hammed, Saksham Singhal, Subhojit Som, et al. Image as a
biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan-
foreign language: Beit pretraining for all vision and vision-
tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan-
language tasks. arXiv preprint arXiv:2208.10442, 2022. 1,
guage models are few-shot learners. Adv. Neural Inform.
3, 6, 7, 8
Process. Syst., 33:1877–1901, 2020. 1
[18] Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim
[7] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin,
Neumann, Rodolphe Jenatton, André Susano Pinto, Daniel
Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul
Keysers, and Neil Houlsby. Scaling vision with sparse mix-
Barham, Hyung Won Chung, Charles Sutton, Sebastian
ture of experts. Adv. Neural Inform. Process. Syst., 34:8583–
Gehrmann, et al. Palm: Scaling language modeling with
8595, 2021. 1, 2
pathways. arXiv preprint arXiv:2204.02311, 2022. 1
[19] Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lu-
[8] William Fedus, Barret Zoph, and Noam Shazeer. Switch
cas Beyer. Scaling vision transformers. In IEEE Conf. Com-
transformers: Scaling to trillion parameter models with sim-
put. Vis. Pattern Recog., pages 12104–12113, 2022. 1, 2, 3,
ple and efficient sparsity. Journal of Machine Learning Re-
6
search, 23(120):1–39, 2022. 1
[20] Zihang Dai, Hanxiao Liu, Quoc V Le, and Mingxing Tan.
[9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
Coatnet: Marrying convolution and attention for all data
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
sizes. Adv. Neural Inform. Process. Syst., 34:3965–3977,
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
2021. 1, 2, 3, 6
vain Gelly, et al. An image is worth 16x16 words: Trans-
formers for image recognition at scale. In Int. Conf. Learn. [21] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht-
Represent., 2020. 1, 2, 3, 5, 8 enhofer, Trevor Darrell, and Saining Xie. A convnet for the
2020s. arXiv preprint arXiv:2201.03545, 2022. 1, 2, 3, 5, 6,
[10] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao
7
Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyra-
mid vision transformer: A versatile backbone for dense pre- [22] Xiaohan Ding, Xiangyu Zhang, Jungong Han, and Guiguang
diction without convolutions. In Int. Conf. Comput. Vis., Ding. Scaling up your kernels to 31x31: Revisiting large
pages 568–578, 2021. 1, 3, 5, 8 kernel design in cnns. In IEEE Conf. Comput. Vis. Pattern
Recog., pages 11963–11975, 2022. 1, 2, 3, 4, 5, 6, 7
[11] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao
Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pvt [23] Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou,
v2: Improved baselines with pyramid vision transformer. Xinchao Wang, Jiashi Feng, and Shuicheng Yan. Metaformer
Computational Visual Media, 8(3):415–424, 2022. 1, 2, 3, is actually what you need for vision. In IEEE Conf. Comput.
5, 6, 7 Vis. Pattern Recog., pages 10819–10829, 2022. 2

14416
[24] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin- [38] Mingxing Tan and Quoc Le. Efficientnetv2: Smaller models
ton. Layer normalization. arXiv preprint arXiv:1607.06450, and faster training. In International Conference on Machine
2016. 2, 3, 5 Learning., pages 10096–10106. PMLR, 2021. 2
[25] Dan Hendrycks and Kevin Gimpel. Gaussian error linear [39] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry
units (gelus). arxiv. arXiv preprint arXiv:1606.08415, 2016. Kalenichenko, Weijun Wang, Tobias Weyand, Marco An-
2, 5 dreetto, and Hartwig Adam. Mobilenets: Efficient convolu-
[26] Yixuan Wei, Han Hu, Zhenda Xie, Zheng Zhang, Yue Cao, tional neural networks for mobile vision applications. arXiv
Jianmin Bao, Dong Chen, and Baining Guo. Contrastive preprint arXiv:1704.04861, 2017. 2
learning rivals masked image modeling in fine-tuning via [40] Yongming Rao, Wenliang Zhao, Yansong Tang, Jie Zhou,
feature distillation. arXiv preprint arXiv:2205.14141, 2022. Ser-Nam Lim, and Jiwen Lu. Hornet: Efficient high-order
2, 6, 7 spatial interactions with recursive gated convolutions. arXiv
[27] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong preprint arXiv:2207.14284, 2022. 3, 7
Zhang, Han Hu, and Yichen Wei. Deformable convolutional [41] Qi Han, Zejia Fan, Qi Dai, Lei Sun, Ming-Ming Cheng, Ji-
networks. In Int. Conf. Comput. Vis., pages 764–773, 2017. aying Liu, and Jingdong Wang. On the connection between
2 local attention and dynamic depth-wise convolution. In Int.
[28] Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. De- Conf. Learn. Represent., 2021. 3
formable convnets v2: More deformable, better results. In
[42] Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and
IEEE Conf. Comput. Vis. Pattern Recog., pages 9308–9316,
Hao Ma. Linformer: Self-attention with linear complexity.
2019. 2, 3, 4
arXiv preprint arXiv:2006.04768, 2020. 3
[29] Shiwei Liu, Tianlong Chen, Xiaohan Chen, Xuxi Chen, Qiao
[43] Zhuofan Xia, Xuran Pan, Shiji Song, Li Erran Li, and Gao
Xiao, Boqian Wu, Mykola Pechenizkiy, Decebal Mocanu,
Huang. Vision transformer with deformable attention. In
and Zhangyang Wang. More convnets in the 2020s: Scal-
IEEE Conf. Comput. Vis. Pattern Recog., pages 4794–4803,
ing up kernels beyond 51x51 using sparsity. arXiv preprint
2022. 3, 4
arXiv:2207.03620, 2022. 2, 7
[30] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, [44] Ashish Vaswani, Prajit Ramachandran, Aravind Srinivas,
and Li Fei-Fei. Imagenet: A large-scale hierarchical image Niki Parmar, Blake Hechtman, and Jonathon Shlens. Scaling
database. In IEEE Conf. Comput. Vis. Pattern Recog., pages local self-attention for parameter efficient visual backbones.
248–255, 2009. 2, 5, 6 In IEEE Conf. Comput. Vis. Pattern Recog., pages 12894–
12904, 2021. 3
[31] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence [45] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B
Zitnick. Microsoft coco: Common objects in context. In Eur. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec
Conf. Comput. Vis., pages 740–755, 2014. 2, 6 Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for
neural language models. arXiv preprint arXiv:2001.08361,
[32] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
2020. 3
Imagenet classification with deep convolutional neural net-
works. Communications of the ACM, 60(6):84–90, 2017. 2, [46] Mingyu Ding, Bin Xiao, Noel Codella, Ping Luo, Jingdong
4 Wang, and Lu Yuan. Davit: Dual attention vision transform-
ers. arXiv preprint arXiv:2204.03645, 2022. 3
[33] Karen Simonyan and Andrew Zisserman. Very deep convo-
lutional networks for large-scale image recognition. arXiv [47] Shiwei Liu, Tianlong Chen, Xiaohan Chen, Xuxi Chen, Qiao
preprint arXiv:1409.1556, 2014. 2, 3 Xiao, Boqian Wu, Mykola Pechenizkiy, Decebal Mocanu,
[34] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, and Zhangyang Wang. More convnets in the 2020s: Scal-
Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent ing up kernels beyond 51x51 using sparsity. arXiv preprint
Vanhoucke, and Andrew Rabinovich. Going deeper with arXiv:2207.03620, 2022. 3
convolutions. In IEEE Conf. Comput. Vis. Pattern Recog., [48] Xizhou Zhu, Dazhi Cheng, Zheng Zhang, Stephen Lin, and
pages 1–9, 2015. 2 Jifeng Dai. An empirical study of spatial attention mecha-
[35] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. nisms in deep networks. In Int. Conf. Comput. Vis., pages
Deep residual learning for image recognition. In IEEE Conf. 6688–6697, 2019. 3
Comput. Vis. Pattern Recog., pages 770–778, 2016. 2, 3, 5 [49] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos,
[36] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image
Kaiming He. Aggregated residual transformations for deep segmentation with deep convolutional nets, atrous convolu-
neural networks. In IEEE Conf. Comput. Vis. Pattern Recog., tion, and fully connected crfs. IEEE Trans. Pattern Anal.
pages 1492–1500, 2017. 2 Mach. Intell., 40(4):834–848, 2017. 3
[37] Mingxing Tan and Quoc Le. Efficientnet: Rethinking [50] L-CCGP Florian and Schroff Hartwig Adam. Rethinking
model scaling for convolutional neural networks. In Interna- atrous convolution for semantic image segmentation. In
tional Conference on Machine Learning., pages 6105–6114. IEEE Conf. Comput. Vis. Pattern Recog., volume 6, 2017.
PMLR, 2019. 2, 5 3

14417
[51] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian [64] Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V
Schroff, and Hartwig Adam. Encoder-decoder with atrous Le. Self-training with noisy student improves imagenet clas-
separable convolution for semantic image segmentation. In sification. In IEEE Conf. Comput. Vis. Pattern Recog., pages
Eur. Conf. Comput. Vis., pages 801–818, 2018. 3 10687–10698, 2020. 6
[52] Yann LeCun, Bernhard Boser, John S Denker, Donnie [65] Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong
Henderson, Richard E Howard, Wayne Hubbard, and Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for
Lawrence D Jackel. Backpropagation applied to handwritten dense predictions. arXiv preprint arXiv:2205.08534, 2022.
zip code recognition. Neural Computation, 1(4):541–551, 7
1989. 3
[66] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir-
[53] François Chollet. Xception: Deep learning with depthwise shick. Mask r-cnn. In Int. Conf. Comput. Vis., pages 2961–
separable convolutions. In IEEE Conf. Comput. Vis. Pattern 2969, 2017. 6
Recog., pages 1251–1258, 2017. 4
[67] Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: high
[54] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang quality object detection and instance segmentation. IEEE
Wang, and Jifeng Dai. Deformable detr: Deformable trans- Trans. Pattern Anal. Mach. Intell., 43(5):1483–1498, 2019.
formers for end-to-end object detection. arXiv preprint 6
arXiv:2010.04159, 2020. 4
[68] Mengde Xu, Zheng Zhang, Han Hu, Jianfeng Wang, Lijuan
[55] Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin
Wang, Fangyun Wei, Xiang Bai, and Zicheng Liu. End-to-
Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei
end semi-supervised object detection with soft teacher. In
Wang, and Tieyan Liu. On layer normalization in the trans-
Int. Conf. Comput. Vis., pages 3060–3069, 2021. 7
former architecture. In International Conference on Machine
Learning., pages 10524–10533. PMLR, 2020. 5 [69] Xiyang Dai, Yinpeng Chen, Bin Xiao, Dongdong Chen,
[56] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Mengchen Liu, Lu Yuan, and Lei Zhang. Dynamic head:
Massa, Alexandre Sablayrolles, and Hervé Jégou. Training Unifying object detection heads with attentions. In IEEE
data-efficient image transformers & distillation through at- Conf. Comput. Vis. Pattern Recog., pages 7373–7382, 2021.
tention. In International Conference on Machine Learning., 7
pages 10347–10357, 2021. 5 [70] Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun
[57] Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Zhu, Lionel M Ni, and Heung-Yeung Shum. Dino: Detr
Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, with improved denoising anchor boxes for end-to-end object
Boxin Li, Chunyuan Li, et al. Florence: A new detection. arXiv preprint arXiv:2203.03605, 2022. 6, 7
foundation model for computer vision. arXiv preprint [71] Jianwei Yang, Chunyuan Li, and Jianfeng Gao. Focal mod-
arXiv:2111.11432, 2021. 6, 7 ulation networks. arXiv preprint arXiv:2203.11926, 2022.
[58] Weijie Su, Xizhou Zhu, Chenxin Tao, Lewei Lu, Bin Li, Gao 7
Huang, Yu Qiao, Xiaogang Wang, Jie Zhou, and Jifeng Dai. [72] Qiang Chen, Xiaokang Chen, Gang Zeng, and Jingdong
Towards all-in-one pre-training via maximizing multi-modal Wang. Group detr: Fast training convergence with de-
mutual information. In CVPR, 2023. 6 coupled one-to-many label assignment. arXiv preprint
[59] Christoph Schuhmann, Richard Vencu, Romain Beaumont, arXiv:2207.13085, 2022. 7
Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo
[73] Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He.
Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m:
Exploring plain vision transformer backbones for object de-
Open dataset of clip-filtered 400 million image-text pairs.
tection. arXiv preprint arXiv:2203.16527, 2022. 7
arXiv preprint arXiv:2111.02114, 2021. 6
[60] Bart Thomee, David A Shamma, Gerald Friedland, Ben- [74] Tingting Liang, Xiaojie Chu, Yudong Liu, Yongtao Wang,
jamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Zhi Tang, Wei Chu, Jingdong Chen, and Haibin Ling. Cb-
Li-Jia Li. Yfcc100m: The new data in multimedia research. net: A composite backbone network architecture for object
Communications of the ACM, 59(2):64–73, 2016. 6 detection. IEEE Trans. Image Process., 2022. 6

[61] Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu [75] Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang
Soricut. Conceptual 12m: Pushing web-scale image-text pre- Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: A
training to recognize long-tail visual concepts. In IEEE Conf. large-scale, high-quality dataset for object detection. In Int.
Comput. Vis. Pattern Recog., pages 3558–3568, 2021. 6 Conf. Comput. Vis., pages 8430–8439, 2019. 6
[62] Hugo Touvron, Matthieu Cord, and Hervé Jégou. Deit iii: [76] Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexan-
Revenge of the vit. arXiv preprint arXiv:2204.07118, 2022. der Kirillov, and Rohit Girdhar. Masked-attention mask
6 transformer for universal image segmentation. arXiv preprint
arXiv:2112.01527, 2021. 7, 8
[63] Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan
Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. [77] Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and
Big transfer (bit): General visual representation learning. In Jian Sun. Unified perceptual parsing for scene understand-
Eur. Conf. Comput. Vis., pages 491–507. Springer, 2020. 6 ing. In Eur. Conf. Comput. Vis., pages 418–434, 2018. 7

14418
[78] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela
Barriuso, and Antonio Torralba. Scene parsing through
ade20k dataset. In IEEE Conf. Comput. Vis. Pattern Recog.,
pages 633–641, 2017. 7
[79] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar,
Jose M Alvarez, and Ping Luo. Segformer: Simple and ef-
ficient design for semantic segmentation with transformers.
Adv. Neural Inform. Process. Syst., 34, 2021. 8

14419

You might also like