InternImage: Exploring Large-Scale Vision Foundation Models With Deformable Convolutions

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

InternImage: Exploring Large-Scale Vision Foundation Models with

Deformable Convolutions

Wenhai Wang1∗ , Jifeng Dai2,1∗ , Zhe Chen3,1∗ , Zhenhang Huang1∗ , Zhiqi Li3,1∗ , Xizhou Zhu4∗ ,
Xiaowei Hu1 , Tong Lu3 , Lewei Lu4 , Hongsheng Li5 , Xiaogang Wang4,5 , Yu Qiao1B
1
Shanghai AI Laboratory 2 Tsinghua University
3
Nanjing University 4 SenseTime Research 5 The Chinese University of Hong Kong
arXiv:2211.05778v2 [cs.CV] 13 Nov 2022

https://github.com/OpenGVLab/InternImage

Abstract query pixels response pixels with fixed weights


response pixels with adaptive weights
Compared to the great progress of large-scale vision
transformers (ViTs) in recent years, large-scale models
based on convolutional neural networks (CNNs) are still
in an early state. This work presents a new large-scale
CNN-based foundation model, termed InternImage, which
can obtain the gain from increasing parameters and train-
ing data like ViTs. Different from the recent CNNs that focus (a) global attention (b) local attention
on large dense kernels, InternImage takes deformable con- ✓ long-range dependence ✗ long-range dependence
volution as the core operator, so that our model not only ✓ adaptive spatial aggregation ✓ adaptive spatial aggregation
has the large effective receptive field required for down- ✗ computation/memory efficient ✓ computation/memory efficient
stream tasks such as detection and segmentation, but also
has the adaptive spatial aggregation conditioned by input
and task information. As a result, the proposed InternIm-
age reduces the strict inductive bias of traditional CNNs
and makes it possible to learn stronger and more robust
patterns with large-scale parameters from massive data like
ViTs. The effectiveness of our model is proven on challeng- (c) large kernel (d) dynamic sparse kernel (ours)
ing benchmarks including ImageNet, COCO, and ADE20K. ✓ long-range dependence ✓ long-range dependence
It is worth mentioning that InternImage-H achieved a new ✗ adaptive spatial aggregation ✓ adaptive spatial aggregation
✓ computation/memory efficient ✓ computation/memory efficient
record 65.4 mAP on COCO test-dev and 62.9 mIoU on
ADE20K, outperforming current leading CNNs and ViTs.
Figure 1. Comparisons of different core operators. (a) shows
the global aggregation of multi-head self-attention (MHSA) [1],
whose computational and memory costs are expensive in down-
1. Introduction stream tasks that require high-resolution inputs. (b) limits the
range of MHSA into a local window [2] to reduce the cost. (c)
With the remarkable success of transformers in large- is a depth-wise convolution with very large kernels to model long-
scale language models [3–8], vision transformers (ViTs) [2, range dependencies. (d) is a deformable convolution, which shares
9–15] have also swept the computer vision field and are similar favorable properties with MHSA and is efficient enough
becoming the primary choice for the research and prac- for large-scale models. We start from it to build a large-scale CNN.
tice of large-scale vision foundation models. Some pio-
neers [16–20] have made attempts to extend ViTs to very
large models with over a billion parameters, beating convo- tasks, including basic classification, detection, and segmen-
lutional neural networks (CNNs) and significantly pushing tation. While these results suggest that CNNs are inferior
the performance bound for a wide range of computer vision to ViTs in the era of massive parameters and data, we ar-
gue that CNN-based foundation models can also achieve
* equal contribution, B corresponding author ([email protected]) comparable or even better performance than ViTs when

1
equipped with similar operator-/architecture-level designs, 67
scaling-up parameters, and massive data. 65 65.4
63.7 64.2
To bridge the gap between CNNs and ViTs, we first 63 63.1
62.4
summarize their differences from two aspects: (1) From 61

COCO box AP (%)


the operator level [9, 21, 22], the multi-head self-attention 59 InternImage-H (test-dev)
(MHSA) of ViTs has long-range dependencies and adap-
57 SwinV2 (test-dev)
tive spatial aggregation (see Fig. 1(a)). Benefiting from the
55 FD-SwinV2-G (test-dev)
flexible MHSA, ViTs can learn more powerful and robust
BEiT-3 (test-dev)
representations than CNNs from massive data. (2) From 53
Florence-CoSwin-H (test-dev)
the architecture view [9, 22, 23], besides MHSA, ViTs con- 51
InternImage (val2017)
tain a series of advanced components that are not included 49 Swin (val2017)
in standard CNNs, such as Layer Normalization (LN) [24], 47 ConvNeXt (val2017)
feed-forward network (FFN) [1], GELU [25], etc. Although
45
recent works [21, 22] have made meaningful attempts to in-
0 0.5 1 1.5 2 2.5 3 3.5
troduce long-range dependencies into CNNs by using dense
#parameter (B)
convolutions with very large kernels (e.g., 31×31) as shown
in Fig. 1 (c), there is still a considerable gap with the state- Figure 2. Performance comparison on COCO of different
of-the-art large-scale ViTs [16, 18–20, 26] in terms of per- backbones. The proposed InternImage-H achieves a new record
formance and model scale. 65.4 box AP on COCO test-dev, significantly outperforming state-
In this work, we concentrate on designing a CNN-based of-the-art CNNs and large-scale ViTs.
foundation model that can efficiently extend to large-scale
parameters and data. Specifically, we start with a flexible
convolution variant—deformable convolution (DCN) [27, by introducing long-range dependencies and adaptive spa-
28]. By combining it with a series of tailored block- tial aggregation using an improved 3×3 DCN operator, and
level and architecture-level designs similar to transformers, explore the tailored basic block, stacking rules, and scaling
we design a brand-new convolutional backbone network, strategies centered on the operator. These designs make ef-
termed InternImage. As shown in Fig. 1, different from fective use of the operator, enabling our models to obtain
recently improved CNNs with very large kernels such as the gains from large-scale parameters and data.
31×31 [22], the core operator of InternImage is a dynamic (3) We evaluate the proposed model on representative
sparse convolution with a common window size of 3×3, (1) vision tasks including image classification, object detec-
whose sampling offsets are flexible to dynamically learn ap- tion, instance and semantic segmentation, and compared it
propriate receptive fields (can be long- or short-range) from with state-of-the-art CNNs and large-scale ViTs by scal-
given data; (2) the sampling offsets and modulation scalars ing the model size ranging from 30 million to 1 billion,
are adaptively adjusted according to the input data, which the data ranging from 1 million to 400 million. Specifi-
can achieve adaptive spatial aggregation like ViTs, reduc- cally, our model with different parameter sizes can consis-
ing the over-inductive bias of regular convolutions; and (3) tently outperform prior arts on ImageNet [31]. InternImage-
the convolution window is a common 3×3, avoiding the B achieves 84.9% top-1 accuracy trained only on the
optimization problems and expensive costs caused by large ImageNet-1K dataset, outperforming CNN-based counter-
dense kernels [22, 29]. parts [21, 22] by at least 1.1 points. With large-scale pa-
rameters (i.e., 1 billion) and training data (i.e., 427 million),
With the aforementioned designs, the proposed Intern-
the top-1 accuracy of InternImage-H is further boosted to
Image can efficiently scale to large parameter sizes and
89.2%, which is close to well-engineering ViTs [2, 30] and
learn stronger representations from large-scale training
hybrid-ViTs [20]. In addition, on COCO [32], a challeng-
data, achieving comparable or even better performance to
ing downstream benchmark, our best model InternImage-H
large-scale ViTs [2, 11, 30] on a wide range of vision tasks.
achieves state-of-the-art 65.4% box mAP with 2.18 billion
In summary, our main contributions are as follows:
parameters, 2.3 points higher than SwinV2-G [16] (65.4 vs.
(1) We present a new large-scale CNN-based founda-
63.1) with 27% fewer parameters as shown in Fig. 2.
tion model—InternImage. To our best knowledge, it is the
first CNN that effectively scales to over 1 billion parameters
and 400 million training images and achieves comparable or
2. Related Work
even better performance than state-of-the-art ViTs, showing Vision foundation models. Convolutional neural net-
that convolutional models are also a worth-exploring direc- works (CNNs) became the mainstream for visual recogni-
tion for large-scale model research. tion after the large-scale dataset and computation resources
(2) We successfully scale CNNs to large-scale settings were available. Straining from AlexNet [33], lots of deeper

2
and more effective neural network architectures have been tune-ups based on it to better suit the requirements of large-
proposed, such as VGG [34], GoogleNet [35], ResNet [36], scale foundation models. Then, we build the basic block
ResNeXt [37], EfficientNet [38, 39], etc. In addition to the by combing the tuned convolution operator with advanced
architectural design, more sophisticated convolution opera- block designs used in modern backbones [16, 19]. Finally,
tions such as depth-wise convolution [40] and deformable we explore the stacking and scaling principles of DCN-
convolution [28, 41] are formulated. By considering the based blocks to build a large-scale convolutional model that
advanced designs of transformers, modern CNNs showed can learn strong representations from massive data.
promising performance on the vision tasks by discover-
ing better components in macro/micro designs and intro- 3.1. Deformable Convolution v3
ducing improved convolutions with long-range dependen- Convolution vs. MHSA. Previous works [21, 22, 51]
cies [21, 42–44] or dynamic weights [45]. have extensively discussed the differences between CNNs
In recent years, a new line of vision foundation mod- and ViTs. Before deciding on the core operator of InternIm-
els focuses on transformer-based architecture. ViT [9] is age, we first summarize the main differences between regu-
the most representative model, which achieves great suc- lar convolution and MHSA.
cess in vision tasks thanks to global receptive fields and (1) Long-range dependencies. Although it has long been
dynamic spatial aggregation. However, global attention in recognized that models with large effective receptive fields
ViT suffers from expensive computational/memory com- (long-range dependencies) usually perform better on down-
plexity, especially on large feature maps, which limits its stream vision tasks [52–54], the de-facto effective receptive
application in downstream tasks. To address this problem, field of CNNs [34, 36] stacked by 3×3 regular convolutions
PVT [10, 11] and Linformer [46] performed global atten- is relatively small. Even with very deep models, the CNN-
tion on the downsampled key and value maps, DAT [47] based model still cannot acquire long-range dependencies
employed deformable attention to sparsely sample informa- like ViTs, which limits its performance.
tion from value maps, while HaloNet [48] and Swin trans- (2) Adaptive spatial aggregation. Compared to MHSA
former [2] developed local attention mechanisms and used whose weights are dynamically conditioned by the input,
haloing and shift operations to transfer information among regular convolution [55] is an operator with static weights
adjacent local regions. and strong inductive biases such as 2D locality, neigh-
Large-scale models. Scaling up models is an important borhood structure, translation equivalence, etc. With the
strategy to improve feature representation quality, which highly-inductive properties, models composed by regular
has been well-studied in the natural language processing convolutions might converge faster and require less train-
(NLP) domain [49]. Inspired by the success in the NLP ing data than ViTs, but it also restricts CNNs from learning
field, Zhai et al. [19] first extended ViT to 2 billion pa- more general and robust patterns from web-scale data.
rameters. Liu et al. [16] enlarged the hierarchical-structure
Revisiting DCNv2. A straightforward way to bridge the
Swin transformer to a deeper and wider model with 3 bil-
gap between convolution and MHSA is to introduce long-
lion parameters. Some researchers developed large-scale
range dependencies and adaptive spatial aggregation into
hybrid ViTs [20, 50] by combining the advantages of ViTs
regular convolutions. Let us start with DCNv2 [28], which
and CNNs at different levels. Recently, BEiT-3 [17] further
is a general variant of regular convolution. Given an input
explored stronger representations based on ViT with large-
x ∈ RC×H×W and current pixel p0 , DCNv2 can be formu-
scale parameters using multimodal pre-training. These
lated as:
methods significantly raise the upper bound of basic vision
tasks. However, research on CNN-based large-scale models K
X
has lagged behind transformer-based architectures in terms y(p0 ) = wk mk x(p0 + pk + ∆pk ), (1)
of the total number of parameters and performance. Al- k=1
though newly-proposed CNNs [21, 42–44] introduce long-
range dependencies by using convolutions with very large where K represents the total number of sampling points,
kernels or recursive gated kernels, there is still a consider- and k enumerates the sampling point. wk ∈ RC×C de-
able gap with state-of-the-art ViTs. In this work, we aim notes the projection weights of the k-th sampling point,
to develop a CNN-based foundation model that can extend and mk ∈ R represents the modulation scalar of the k-
efficiently to a large scale comparable to ViT. th sampling point, which is normalized by sigmoid func-
tion. pk denotes the k-th location of the pre-defined grid
3. Proposed Method sampling {(−1, −1), (−1, 0), ..., (0, +1), ..., (+1, +1)} as
in regular convolutions, and ∆pk is the offset correspond-
To design a large-scale CNN-based foundation model, ing to the k-th grid sampling location. We see from the
we start with a flexible convolution variant, namely de- equation that (1) for long-range dependencies, the sampling
formable convolution v2 (DCNv2) [28] and make some offset ∆pk is flexible and able to interact with short- or

3
stem rons1 in original DCNv2 have independent linear projection
3×3 conv, s2, p1 weights, and thus its parameter and memory complexity
𝐻×𝑊×3 LN, GELU are linear with the total number of sampling points, which
3×3 conv, s2, p1 significantly limits the efficiency of the model, especially
LN in large-scale models. To remedy this problem, we bor-
stem downsampling
row the idea from the separable convolution [56] and de-
3×3 conv, s2, p1 tach the original convolution weights wk into depth-wise
stage 𝟏 LN and point-wise parts, where the depth-wise part is respon-
𝐻/4×𝑊/4×𝐶" sible by the original location-aware modulation scalar mk ,
basic block ×𝐿 "
stage 𝒊 and the point-wise part is the shared projection weights w
𝐿 %× among sampling points.
downsampling
(2) Introducing multi-group mechanism. The multi-
LN
group (head) design first appeared in group convolu-
stage 𝟐 tion [33], and it is widely used in MHSA [1] of transformers
𝐻/8×𝑊/8×𝐶! FFN
basic block ×𝐿 !
and works with adaptive spatial aggregation to effectively
learn richer information from different representation sub-
downsampling spaces at different locations. Inspired by this, we split the
LN
spatial aggregation process into G groups, each of which
stage 𝟑 DCNv3 (𝐺% ) has individual sampling offsets ∆pgk and modulation scale
𝐻/16×𝑊/16×𝐶#
basic block ×𝐿 # Δ𝑝, 𝐦 mgk , and thus different groups on a single convolution layer
can have different spatial aggregation patterns, resulting in
downsampling stacking rules stronger features for downstream tasks.
(3) Normalizing modulation scalars along sampling
(1) 𝐶% = 2%&" 𝐶"
stage 𝟒
points. The modulation scalars in the original DCNv2 are
(2) 𝐺% = 𝐶% /𝐶′
𝐻/32×𝑊/32×𝐶$ element-wise normalized by the sigmoid function. There-
basic block ×𝐿 $ (3) 𝐿 " = 𝐿 ! = 𝐿 $
(4) 𝐿 " ≤ 𝐿 # fore, each modulation scalar is in the range [0, 1], and the
sum of the modulation scalars of all sample points is not sta-
cls, det, seg, ... ble and varies from 0 to K. This leads to unstable gradients
in DCNv2 layers when training with large-scale parame-
Figure 3. Overall Architecture of InternImage, where the core ters and data. To alleviate the instability issues, we change
operator is DCNv3, and the basic block composes of layer normal- element-wise sigmoid normalization to softmax normaliza-
ization (LN) [24] and feed-forward network (FFN) [1] as trans-
tion along sample points. In this way, the sum of the modu-
formers, the stem and downsampling layers follows conventional
lation scalars is constrained to 1, which makes the training
CNN’s designs, where “s2” and “p1” mean stride 2 and padding
1, respectively. Constrained by the stacking rules, only 4 hyper- process of models at different scales more stable.
parameters (C1 , C 0 , L1 , L3 ) can decide a model variant. Combining the aforementioned modifications, the ex-
tended DCNv2, marked as DCNv3, can be formulated as
Eqn. (2).
long-range features; and (2) for adaptive spatial aggrega- K
G X
X
tion, both the sampling offset ∆pk and modulation scalar y(p0 ) = wg mgk xg (p0 + pk + ∆pgk ), (2)
mk are learnable and conditioned by input x. So it can be g=1 k=1
found that DCNv2 shares similar favorable properties with
MHSA, which motivated us to develop large-scale CNN- where G denotes the total number of aggregation groups.
0
based foundation models on the basis of this operator. For the g-th group, wg ∈ RC×C , mgk ∈ R denote the
Extending DCNv2 for Vision Foundation Models. In location-irrelevant projection weights of the group, where
common practice, DCNv2 is usually used as an extension C 0 = C/G represents the group dimension. mgk ∈ R de-
to regular convolutions, loading pre-trained weights of reg- notes the modulation scalar of the k-th sampling point in
ular convolutions and fine-tuning for better performance, the g-th group, normalized by the softmax function along
0

which is not exactly suitable for large-scale vision founda- the dimension K. xg ∈ RC ×H×W represents the sliced in-
tion models that need to be trained from scratch. In this put feature map. ∆pgk is the offset corresponding to the
work, to address this problem, we extend DCNv2 from as- grid sampling location pk in the g-th group.
pects as follows: In general, DCNv3, as an extension of the DCN series,
enjoys three merits as follows: (1) This operator made up
(1) Sharing weights among convolutional neurons. Sim-
ilar to regular convolution, different convolutional neu- 1A 3×3 regular convolution has 9 linear projection neurons.

4
for the deficiencies of regular convolution in terms of long- model name C1 C0 L1,2,3,4 #params
InternImage-T (origin) 64 16 4, 4, 18, 4 30M
range dependencies and adaptive spatial aggregation; (2) InternImage-S 80 16 4, 4, 21, 4 50M
Compared with attention-based operators such as common InternImage-B 112 16 4, 4, 21, 4 97M
MHSA and closely-related deformable attention, this oper- InternImage-L 160 16 5, 5, 22, 5 223M
ator inherits the inductive bias of convolution, making our InternImage-XL 192 16 5, 5, 24, 5 335M
InternImage-H 320 16 6, 6, 32, 6 1.08B
model more efficient with fewer training data and shorter
training time; (3) This operator is based on sparse sampling, Table 1. Hyper-parameters for models of different scales.
which is more computational and memory efficient than InternImage-T is the origin model, and -S/B/L/XL/H are scaled
previous methods such as MHSA [1] and re-parameterizing up from -T. “#params” denotes the number of parameters.
large kernel [22]. In addition, due to the sparse sampling,
DCNv3 only needs a 3×3 kernel to learn long-range depen-
dencies, which is easier to be optimized and avoids extra as follows:
auxiliary techniques such as re-parameterizing [22] used in Ci : the channel number of the i-th stage;
large kernels. Gi : the group number of the DCNv3 in the i-th stage;
Li : the number of basic blocks in the i-th stage.
3.2. InternImage Model Since our model has 4 stages, a variant is decided by 12
Using DCNv3 as the core operator brings a new prob- hyper-parameters. whose search space is too large to ex-
lem: how to build a model that can make effective use of the haustively enumerate and find the best variant. To reduce
core operator? In this section, we first present the details the search space, we summarize the design experiences of
of the basic block and other integral layers of our model, prior arts [2, 21, 36] into 4 rules as shown in Fig. 3, where
and then we construct a new CNN-based foundation model the first rule makes the channel numbers of the last three
termed InternImage, by exploring a tailored stacking strat- stages determined by the channel number C1 of the first
egy for these basic blocks. Finally, we study scaling-up stage, and the second rule lets the group number correspond
rules for the proposed model to obtain the gain from in- to the channel number of stages. For the number of stacked
creasing parameters. blocks in different stages, we simplify the stacking pattern
Basic block. Unlike the widely used bottlenecks in tradi- to “AABA”, which means the block number of stage 1, 2,
tional CNNs [36], the design of our basic block is closer to and 4 are the same, and are not greater than that of the stage
ViTs, which is equipped with more advanced components 3 as illustrated in the last two rules. With these rules, a
including LN [24], feed-forward networks (FFN) [1], and InternImage variant can be defined by using only 4 hyper-
GELU [25]. This design is proved to be efficient [2, 10, parameters (C1 , C 0 , L1 , L3 ).
11, 21, 22] in various vision tasks. The details of our ba- Let us choose a model with 30 million parameters as the
sic block are illustrated in Fig. 3, where the core operator origin and discretize C1 to {16, 32, 64}, L1 to {1, 2, 3, 4, 5},
is DCNv3, and the sampling offsets and modulation scales and C 0 to {16, 32}. In this way, the original huge search
are predicted by passing input feature x through a separable space is reduced to 30, and we can find the best model
convolution (a 3×3 depth-wise convolution followed by a from the 30 variants by training and evaluating them in Im-
linear projection). For other components, we use the post- ageNet [31]. In practice, we use the best hyper-parameter
normalization setting [57] by default and follow the same setting (64, 16, 4, 18) to define the base model and scale it
design as that of the plain transformer [1, 9]. to different scales.
Stem & downsampling layers. To obtain hierarchical Scaling rules. Based on the optimal origin model un-
feature maps, we use convolutional stem and downsampling der the aforementioned constraints, we further explore the
layers to resize the feature maps to different scales. As parameter scaling rules inspired by [38]. Specifically, we
shown in Fig. 3, the stem layer is placed before the first consider two scaling dimensions: depth D (i.e., 3L1 +L3 )
stage to reduce the input resolution by 4 times. It consists and width C1 , and scale the two dimensions using α, β and
of two convolutions, two LN layers, and one GELU layer, a composite factor φ. The scaling rules can be written as:
where the kernel size of the two convolutions is 3, the stride D0 = αφ D and C10 = β φ C1 , where α ≥ 1, β ≥ 1, and
is 2, the padding is 1, and the output channel of the first con- αβ 1.99 ≈ 2. Here, 1.99 is specific for InternImage and cal-
volution is half of the second one. Similarly, the downsam- culated by doubling the model width and keeping the depth
pling layer is made up of a 3×3 convolution with a stride constant. We experimentally find out that the best scaling
of 2 and a padding of 1, followed by one LN layer. It sits setting is α = 1.09 and β = 1.36, and then we base on it
between the two stages and is used to downsample the input to construct InternImage variants with different parameter
feature map by 2 times. scales, namely InternImage-T/S/B/L/XL, whose complex-
Stacking rules. To clarify the block-stacking process, ity is similar to those of ConvNeXt [21]. To further test the
we first list the integral hyperparameters of the InternImage capability, we built a larger InternImage-H with 1 billion

5
parameters. The configurations are summarized in Table 1. method type scale #params #FLOPs acc (%)
DeiT-S [58] T 2242 22G 5G 79.9
PVT-S [10] T 2242 25M 4G 79.8
4. Experiment Swin-T [2] T 2242 29M 5G 81.3
CoAtNet-0 [20] T 2242 25M 4G 81.6
We analyze and compare InternImage with the leading CSwin-T [12] T 2242 23M 4G 82.7
CNNs and ViTs on representative vision tasks including im- PVTv2-B2 [11] T 2242 25M 4G 82.0
age classification, object detection, instance and semantic DeiT III-S [65] T 2242 22M 5G 81.4
SwinV2-T/8 [16] T 2562 28M 6G 81.8
segmentation. Besides the experiments in the main paper,
Focal-T [66] T 2242 29M 5G 82.2
due to space constraints, more experimental setups and ab- ConvNeXt-T [21] C 2242 29M 5G 82.1
lation studies are presented in the supplementary material. SLaK-T [29] C 2242 30M 5G 82.5
HorNet-T [44] C 2242 23M 4G 83.0
4.1. Image Classification InternImage-T (ours) C 2242 30M 5G 83.5
PVT-L [10] T 2242 61M 10G 81.7
Settings. We evaluate the classification performance of Swin-S [2] T 2242 50M 9G 83.0
InternImage on ImageNet [31]. For fair comparisons, fol- CoAtNet-1 [20] T 2242 42M 8G 83.3
lowing common practices [2,10,21,58], InternImage-T/S/B PVTv2-B4 [11] T 2242 63M 10G 83.6
SwinV2-S/8 [16] T 2562 50M 12G 83.7
are trained on ImageNet-1K (∼1.3 million) for 300 epochs, ConvNeXt-S [21] C 2242 50M 9G 83.1
and InternImage-L/XL are first trained on ImageNet-22K SLaK-S [29] C 2242 55M 10G 83.8
(∼14.2 million) for 90 epochs and then fine-tuned on HorNet-S [44] C 2242 50M 9G 84.0
ImageNet-1K for 30 epochs. To further explore the ca- InternImage-S (ours) C 2242 50M 8G 84.2
DeiT-B [58] T 2242 87M 18G 83.1
pability of our model and match the large-scale private Swin-B [2] T 2242 88M 15G 83.5
data used in previous methods [16, 20, 59], we adopt M3I CoAtNet-2 [20] T 2242 75M 16G 84.1
Pre-training [60], a unified pre-training approach available PVTv2-B5 [11] T 2242 82M 12G 83.8
for both unlabeled and weakly-labeled data, to pre-train DeiT III-B [65] T 2242 87M 18G 83.8
SwinV2-B/8 [16] T 2562 88M 20G 84.2
InternImage-H on a 427 million joint dataset of public RepLKNet-31B [22] C 2242 79M 15G 83.5
Laion-400M [61], YFCC-15M [62], and CC12M [63] for ConvNeXt-B [21] C 2242 88M 15G 83.8
30 epochs, and then we fine-tune the model on ImageNet- SLaK-B [29] C 2242 95M 17G 84.0
22K and -1K for 30 epochs, respectively. HorNet-B [44] C 2242 88M 16G 84.3
InternImage-B (ours) C 2242 97M 16G 84.9
Results. Table 2 shows the classification results of Swin-L‡ [2] T 3842 197M 104G 87.3
models with different scales. With comparable param- CoAtNet-3‡ [20] T 3842 168M 107G 87.6
eters and computational costs, our models are compara- CoAtNet-4‡ [20] T 3842 275M 190G 87.9
ble or even superior to the state-of-the-art transformer- DeiT III-L‡ [65] T 3842 304M 191G 87.7
SwinV2-L/24‡ [16] T 3842 197M 115G 87.6
based and CNN-based models. For example, InternImage-T
RepLKNet-31L‡ [22] C 3842 172M 96G 86.6
achieves 83.5% top-1 accuracy, outperforming ConvNext- HorNet-L‡ [44] C 3842 202M 102G 87.7
T [21] with a clear margin of 1.4 points. InternImage- ConvNeXt-L‡ [21] C 3842 198M 101G 87.5
S/B keeps the leading position and InternImage-B surpasses ConvNeXt-XL‡ [21] C 3842 350M 179G 87.8
the hybrid-ViT CoAtNet-1 [20] by 0.8 points. When pre- InternImage-L‡ (ours) C 3842 223M 108G 87.7
InternImage-XL‡ (ours) C 3842 335M 163G 88.0
trained on ImageNet-22K and the large-scale joint dataset, ViT-G/14# [30] T 5182 1.84B 5160G 90.5
the top-1 accuracy of InternImage-XL and -H are boosted CoAtNet-6# [20] T 5122 1.47B 1521G 90.5
to 88.0% and 89.2%, respectively, which is better than pre- CoAtNet-7# [20] T 5122 2.44B 2586G 90.9
vious CNNs [22, 64] also trained with large-scale data, and Florence-CoSwin-H# [59] T − 893M − 90.0
closes the gap with the state-of-the-art large-scale ViTs to SwinV2-G# [16] T 6402 3.00B − 90.2
RepLKNet-XL# [22] C 3842 335M 129G 87.8
about 1 point. This gap may be caused by the discrepancy BiT-L-ResNet152x4# [64] C 4802 928M − 87.5
between large-scale inaccessible private data and the afore- InternImage-H# (ours) C 2242 1.08B 188G 88.5
mentioned joint public data. These results show that our In- InternImage-H# (ours) C 6402 1.08B 1478G 89.2
ternImage not only has good performance on the common
parameter scale and the public training data, but also can Table 2. Image classification performance on the ImageNet val-
idation set. “type” refers to model type, where “T” and “C” de-
effectively extend to large-scale parameters and data.
note transformer and CNN, respectively. “scale” is the input scale.
4.2. Object Detection “acc” is the top-1 accuracy. “‡ ” indicates the model is pre-trained
on ImageNet-22K [31]. “# ” indicates pretraining on extra large-
Settings. We verify the detection performance of our scale private dataset such as JFT-300M [67], FLD-900M [59], or
InternImage on the COCO benchmark [69], on top of the joint public dataset in this work.
two representative object detection frameworks: Mask R-
CNN [70], and Cascade Mask R-CNN [71]. We follow
common practices [2, 11] to initialize the backbone with

6
Mask R-CNN 1× schedule Mask R-CNN 3×+MS schedule
method #params #FLOPs
APb APb50 APb 75 APm APm 50 APm75 APb APb50 APb 75 APm APm 50 APm75
Swin-T [2] 48M 267G 42.7 65.2 46.8 39.3 62.2 42.2 46.0 68.1 50.3 41.6 65.1 44.9
ConvNeXt-T [21] 48M 262G 44.2 66.6 48.3 40.1 63.3 42.8 46.2 67.9 50.8 41.7 65.0 44.9
PVTv2-B2 [11] 45M 309G 45.3 67.1 49.6 41.2 64.2 44.4 47.8 69.7 52.6 43.1 66.8 46.7
ViT-S [9, 68] 48M 353G 44.7 65.8 48.3 39.9 62.5 42.8 48.2 69.7 52.5 42.8 66.4 45.9
InternImage-T (ours) 49M 270G 47.2 69.0 52.1 42.5 66.1 45.8 49.1 70.3 54.0 43.7 67.3 47.1
Swin-S [2] 69M 354G 44.8 66.6 48.9 40.9 63.4 44.2 48.2 69.8 52.8 43.2 67.0 46.1
ConvNeXt-S [21] 70M 348G 45.4 67.9 50.0 41.8 65.2 45.1 47.9 70.0 52.7 42.9 66.9 46.2
PVTv2-B3 [11] 65M 397G 47.0 68.1 51.7 42.5 65.7 45.7 48.4 69.8 53.3 43.2 66.9 46.7
InternImage-S (ours) 69M 340G 47.8 69.9 52.8 43.3 67.1 46.7 49.7 71.1 54.5 44.4 68.5 47.8
Swin-B [2] 107M 496G 46.9 − − 42.3 − − 48.6 70.0 53.4 43.3 67.1 46.7
ConvNeXt-B [21] 108M 486G 47.0 69.4 51.7 42.7 66.3 46.0 48.5 70.1 53.3 43.5 67.1 46.7
PVTv2-B5 [11] 102M 557G 47.4 68.6 51.9 42.5 65.7 46.0 48.4 69.2 52.9 42.9 66.6 46.2
ViT-B [9, 68] 120M 781G 47.0 68.2 51.4 41.8 65.1 44.9 49.6 70.6 54.0 43.6 67.7 46.9
InternImage-B (ours) 115M 501G 48.8 71.0 53.9 44.0 67.8 47.5 50.3 71.4 55.3 44.8 68.7 48.0

method #param #FLOPs Cascade Mask R-CNN 1× schedule Cascade Mask R-CNN 3×+MS schedule
Swin-L‡ [2] 253M 1382G 51.8 71.0 56.2 44.9 68.4 48.9 53.9 72.4 58.8 46.7 70.1 50.8
ConvNeXt-L‡ [21] 255M 1354G 53.5 72.8 58.3 46.4 70.2 50.2 54.8 73.8 59.8 47.6 71.3 51.7
RepLKNet-31L‡ [22] 229M 1321G − − − − − − 53.9 72.5 58.6 46.5 70.0 50.6
HorNet-L‡ [44] 259M 1358G − − − − − − 56.0 − − 48.6 − −
InternImage-L‡ (ours) 277M 1399G 54.9 73.8 59.6 47.7 71.3 52.4 56.0 74.7 61.3 48.4 72.2 53.0
ConvNeXt-XL‡ [21] 407M 1898G 53.6 72.9 58.5 46.5 70.3 50.5 55.2 74.2 59.9 47.7 71.6 52.2
InternImage-XL‡ (ours) 387M 1782G 55.3 74.5 60.2 48.0 72.0 52.4 56.2 74.9 61.7 48.8 72.6 53.8

Table 3. Object detection and instance segmentation performance on COCO val2017. The FLOPs are measured with 1280×800
inputs. APb and APm represent box AP and mask AP, respectively. “MS” means multi-scale training.

pre-trained classification weights, and train models use a APb


method detector #params
val2017test-dev
1× (12 epochs) or 3× (36 epochs) schedule by default. Swin-L‡ [2] HTC++ [2] 284M 58.0 58.7
Results. As shown in Table 3, when using Mask R- Swin-L [2] DyHead [72] 213M 56.2 58.4
ViT-L‡ [9] ViT-Adapter [68] 401M 60.5 60.9
CNN for object detection, we find that under a compara-
Swin-L‡ [2] Soft-Teacher [73] 284M 60.7 61.3
ble number of parameters, our models significantly surpass Swin-L‡ [2] DINO [74] 218M 63.2 63.3
their counterparts. For example, with the 1× training sched- FocalNet-H‡ [75] DINO [74] 746M 64.2 64.3
ule, the box AP (APb ) of InternImage-T is 4.5 points better ViT-Huge [76] Group-DETRv2 [76] 629M − 64.5
than Swin-T [2] (47.2 vs. 42.7), and 3.0 points higher than Florence-CoSwin-H# [59] DyHead [72] 637M 62.0 62.4
SwinV2-G# [16] HTC++ [2] 3.00B 62.5 63.1
ConvNeXt-T [21] (47.2 vs. 44.2). With the 3× multi-scale BEiT-3# [17] ViTDet [77] 1.90B − 63.7
training schedule, more parameters, and more advanced FD-SwinV2-G# [26] HTC++ [2] 3.00B − 64.2
Cascade Mask R-CNN [71], InternImage-XL achieves APb InternImage-XL‡ (ours) DINO [74] 602M 64.2 64.3
of 56.2, surpassing ConvNeXt-XL by 1.0 points (56.2 vs. InternImage-H# (ours) DINO [74] 2.18B 65.0 65.4
55.2). Similar results are also seen in instance segmentation Table 4. Comparison of the state-of-the-art detectors on
experiments. With the 1× training schedule, InternImage-T COCO val2017 and test-dev.
yields 42.5 mask AP (i.e., APm ), which outperforms Swin-
T and ConvNeXt-T by 3.2 points (42.5 vs. 39.3) and 2.4 Compared to previous state-of-the-art models, we surpass
points (42.5 vs. 40.1), respectively. The best APm 48.8 is FD-SwinV2-G [26] by 1.2 points (65.4 vs. 64.2), with 27%
obtained by InternImage-XL with Cascade Mask R-CNN, fewer parameters and without complicated distillation pro-
which is at least 1.1 points higher than its counterparts. cesses, which shows the effectiveness of our models on the
detection task.
To further push the performance bound of object detec-
tion, we follow the advanced setting used in leading meth-
4.3. Semantic Segmentation
ods [16, 17, 26, 74, 78] to initialize the backbone with the
weights pre-trained on ImageNet-22K or the large-scale Settings. To evaluate the semantic segmentation per-
joint dataset, and double its parameters via the composite formance of InternImage, we initialize the backbone with
techniques [78] (see the model with 2 billion parameters in pre-trained classification weights and train our models with
Fig. 2). Then, we fine-tune it along with the DINO [74] UperNet [81] on ADE20K [82] for 160k iterations and
detector on the Objects365 [79] and COCO datasets one af- compare fairly with previous CNN-based and transformer-
ter another for 26 epochs and 12 epochs, respectively. As based backbones. To further reach top performance, we arm
shown in Table 4, our method achieves the best results of InternImage-H with more advanced Mask2Former [80], and
65.0 APb and 65.4 APb on COCO val2017 and test-dev. adopt the same training settings in [17, 68].

7
crop mIoU mIoU sive experiments on object detection and semantic segmen-
method #params #FLOPs
size (SS) (MS)
Swin-T [2] 5122 60M 945G 44.5 45.8
tation benchmarks verify that our InternImage can obtain
ConvNeXt-T [21] 5122 60M 939G 46.0 46.7 comparable or better performance than well-designed large-
SLaK-T [29] 5122 65M 936G 47.6 − scale vision transformers trained with massive data, show-
InternImage-T (ours) 5122 59M 944G 47.9 48.1 ing that CNN is also a considerable choice for large-scale
Swin-S [2] 5122 81M 1038G 47.6 49.5
ConvNeXt-S [21] 5122 82M 1027G 48.7 49.6
vision foundation model research. Nonetheless, latency re-
SLaK-S [29] 5122 91M 1028G 49.4 − mains an issue for DCN-based operators adapting to down-
InternImage-S (ours) 5122 80M 1017G 50.1 50.9 stream tasks with high-speed requirements. Also, large-
Swin-B [2] 5122 121M 1188G 48.1 49.7 scale CNNs are still in their early stages of development,
ConvNeXt-B [21] 5122 122M 1170G 49.1 49.9
RepLKNet-31B [22] 5122 112M 1170G 49.9 50.6
and we hope InternImage can serve as a good starting point.
SLaK-B [29] 5122 135M 1172G 50.2 −
InternImage-B (ours) 5122 128M 1185G 50.8 51.3 References
Swin-L‡ [2] 6402 234M 2468G 52.1 53.5
RepLKNet-31L‡ [22] 6402 207M 2404G 52.4 52.7 [1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
ConvNeXt-L‡ [21] 6402 235M 2458G 53.2 53.7 reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
ConvNeXt-XL‡ [21] 6402 391M 3335G 53.6 54.0 Polosukhin. Attention is all you need. Adv. Neural Inform.
InternImage-L‡ (ours) 6402 256M 2526G 53.9 54.1 Process. Syst., 30, 2017. 1, 2, 4, 5
InternImage-XL‡ (ours) 6402 368M 3142G 55.0 55.3
SwinV2-G# [16] 8962 3.00B − − 59.9 [2] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng
InternImage-H# (ours) 8962 1.12B 3566G 59.9 60.3 Zhang, Stephen Lin, and Baining Guo. Swin transformer:
BEiT-3# [17] 8962 1.90B − − 62.8 Hierarchical vision transformer using shifted windows. In
FD-SwinV2-G# [26] 8962 3000 − − 61.3 Int. Conf. Comput. Vis., pages 10012–10022, 2021. 1, 2, 3,
InternImage-H# (ours) + 5, 6, 7, 8
8962 1.31B 4635G 62.5 62.9
Mask2Former [80]
[3] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick
Table 5. Semantic segmentation performance on the ADE20K LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-
validation set. The FLOPs are measured with 512×2048, lm: Training multi-billion parameter language models using
640×2560, or 896×896 inputs according to the crop size. “SS” model parallelism. arXiv preprint arXiv:1909.08053, 2019.
and “MS” means single-scale and multi-scale testing, respectively. 1

Results. As shown in Table 5, when using UperNet [4] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario
Amodei, Ilya Sutskever, et al. Language models are unsu-
[81] for semantic segmentation, our InternImage consis-
pervised multitask learners. OpenAI blog, 1(8):9, 2019. 1
tently outperforms prior arts [2, 21, 22, 29]. For exam-
ple, with almost the same parameter numbers and FLOPs, [5] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee,
our InternImage-B reports 50.8 mIoU on the ADE20K val, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and
Peter J Liu. Exploring the limits of transfer learning with a
which is outstanding from the strong counterparts such
unified text-to-text transformer. Journal of Machine Learn-
as ConvNeXt-B (50.8 vs. 49.1) and RepLKNet-31B (50.8 ing Research, 21:1–67, 2020. 1
vs. 49.9). Furthermore, our InternImage-H yields 60.3 MS
[6] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub-
mIoU, which is better than SwinV2-G [16], while the pa-
biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan-
rameter number is much smaller (1.12B vs. 3.00B).
tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan-
It is worth noting that, when using Mask2Former [80] guage models are few-shot learners. Advances in neural in-
and multi-scale testing, our InternImage-H achieves the best formation processing systems, 33:1877–1901, 2020. 1
mIoU of 62.9, higher than the current best BEiT-3 [17] on
[7] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin,
the ADE20K benchmark. These results demonstrate that Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul
the CNN-based foundation model can also enjoy the divi- Barham, Hyung Won Chung, Charles Sutton, Sebastian
dends of massive data and challenge the leading position of Gehrmann, et al. Palm: Scaling language modeling with
transformer-based models. pathways. arXiv preprint arXiv:2204.02311, 2022. 1
[8] William Fedus, Barret Zoph, and Noam Shazeer. Switch
5. Conclusion & Limitations transformers: Scaling to trillion parameter models with sim-
ple and efficient sparsity. Journal of Machine Learning Re-
We introduce InternImage, a new large-scale CNN-based search, 23(120):1–39, 2022. 1
foundation model that can provide strong representations
[9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
for versatile vision tasks, such as image classification, ob- Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
ject detection, and semantic segmentation. We tune the flex- Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
ible DCNv2 operator to satisfy the requirement of foun- vain Gelly, et al. An image is worth 16x16 words: Trans-
dation models, and develop a series of blocks, stacking formers for image recognition at scale. In Int. Conf. Learn.
and scaling rules centered on the core operator. Exten- Represent., 2020. 1, 2, 3, 5, 7

8
[10] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao [22] Xiaohan Ding, Xiangyu Zhang, Jungong Han, and Guiguang
Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyra- Ding. Scaling up your kernels to 31x31: Revisiting large
mid vision transformer: A versatile backbone for dense pre- kernel design in cnns. In IEEE Conf. Comput. Vis. Pattern
diction without convolutions. In Int. Conf. Comput. Vis., Recog., pages 11963–11975, 2022. 2, 3, 5, 6, 7, 8
pages 568–578, 2021. 1, 3, 5, 6 [23] Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou,
[11] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Xinchao Wang, Jiashi Feng, and Shuicheng Yan. Metaformer
Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. is actually what you need for vision. In IEEE Conf. Comput.
Pvtv2: Improved baselines with pyramid vision transformer. Vis. Pattern Recog., pages 10819–10829, 2022. 2
CVMJ, pages 1–10, 2022. 1, 2, 3, 5, 6, 7 [24] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin-
[12] Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming ton. Layer normalization. arXiv preprint arXiv:1607.06450,
Zhang, Nenghai Yu, Lu Yuan, Dong Chen, and Baining Guo. 2016. 2, 4, 5
Cswin transformer: A general vision transformer backbone [25] Dan Hendrycks and Kevin Gimpel. Gaussian error linear
with cross-shaped windows. IEEE Conf. Comput. Vis. Pat- units (gelus). arxiv. arXiv preprint arXiv:1606.08415, 2016.
tern Recog., pages 12124–12134, 2022. 1, 6 2, 5
[13] Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, [26] Yixuan Wei, Han Hu, Zhenda Xie, Zheng Zhang, Yue Cao,
Xiyang Dai, Lu Yuan, and Lei Zhang. Cvt: Introducing con- Jianmin Bao, Dong Chen, and Baining Guo. Contrastive
volutions to vision transformers. In Int. Conf. Comput. Vis., learning rivals masked image modeling in fine-tuning via
pages 22–31, 2021. 1 feature distillation. arXiv preprint arXiv:2205.14141, 2022.
[14] Alaaeldin Ali, Hugo Touvron, Mathilde Caron, Piotr Bo- 2, 7, 8
janowski, Matthijs Douze, Armand Joulin, Ivan Laptev, Na- [27] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong
talia Neverova, Gabriel Synnaeve, Jakob Verbeek, et al. Xcit: Zhang, Han Hu, and Yichen Wei. Deformable convolutional
Cross-covariance image transformers. Adv. Neural Inform. networks. In Int. Conf. Comput. Vis., pages 764–773, 2017.
Process. Syst., 34, 2021. 1 2
[15] Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chunjing Xu, [28] Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. De-
and Yunhe Wang. Transformer in transformer. Adv. Neural formable convnets v2: More deformable, better results. In
Inform. Process. Syst., 34, 2021. 1 IEEE Conf. Comput. Vis. Pattern Recog., pages 9308–9316,
[16] Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, 2019. 2, 3
Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, [29] Shiwei Liu, Tianlong Chen, Xiaohan Chen, Xuxi Chen, Qiao
et al. Swin transformer v2: Scaling up capacity and res- Xiao, Boqian Wu, Mykola Pechenizkiy, Decebal Mocanu,
olution. Adv. Neural Inform. Process. Syst., pages 12009– and Zhangyang Wang. More convnets in the 2020s: Scal-
12019, 2022. 1, 2, 3, 6, 7, 8 ing up kernels beyond 51x51 using sparsity. arXiv preprint
arXiv:2207.03620, 2022. 2, 6, 8
[17] Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhil-
iang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mo- [30] Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lu-
hammed, Saksham Singhal, Subhojit Som, et al. Image as a cas Beyer. Scaling vision transformers. In IEEE Conf. Com-
foreign language: Beit pretraining for all vision and vision- put. Vis. Pattern Recog., pages 12104–12113, 2022. 2, 6
language tasks. arXiv preprint arXiv:2208.10442, 2022. 1, [31] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,
3, 7, 8 and Li Fei-Fei. Imagenet: A large-scale hierarchical image
[18] Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim database. In IEEE Conf. Comput. Vis. Pattern Recog., pages
Neumann, Rodolphe Jenatton, André Susano Pinto, Daniel 248–255, 2009. 2, 5, 6
Keysers, and Neil Houlsby. Scaling vision with sparse mix- [32] Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco-
ture of experts. Advances in Neural Information Processing stuff: Thing and stuff classes in context. In IEEE Conf. Com-
Systems, 34:8583–8595, 2021. 1, 2 put. Vis. Pattern Recog., pages 1209–1218, 2018. 2
[19] Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lu- [33] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
cas Beyer. Scaling vision transformers. In IEEE Conf. Com- Imagenet classification with deep convolutional neural net-
put. Vis. Pattern Recog., pages 12104–12113, 2022. 1, 2, works. Communications of the ACM, 60(6):84–90, 2017. 2,
3 4
[20] Zihang Dai, Hanxiao Liu, Quoc V Le, and Mingxing Tan. [34] Karen Simonyan and Andrew Zisserman. Very deep convo-
Coatnet: Marrying convolution and attention for all data lutional networks for large-scale image recognition. arXiv
sizes. Advances in Neural Information Processing Systems, preprint arXiv:1409.1556, 2014. 3
34:3965–3977, 2021. 1, 2, 3, 6 [35] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet,
[21] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht- Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent
enhofer, Trevor Darrell, and Saining Xie. A convnet for the Vanhoucke, and Andrew Rabinovich. Going deeper with
2020s. arXiv preprint arXiv:2201.03545, 2022. 2, 3, 5, 6, 7, convolutions. In IEEE Conf. Comput. Vis. Pattern Recog.,
8 pages 1–9, 2015. 3

9
[36] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for
Deep residual learning for image recognition. In IEEE Conf. neural language models. arXiv preprint arXiv:2001.08361,
Comput. Vis. Pattern Recog., pages 770–778, 2016. 3, 5 2020. 3
[37] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and [50] Mingyu Ding, Bin Xiao, Noel Codella, Ping Luo, Jingdong
Kaiming He. Aggregated residual transformations for deep Wang, and Lu Yuan. Davit: Dual attention vision transform-
neural networks. In IEEE Conf. Comput. Vis. Pattern Recog., ers. arXiv preprint arXiv:2204.03645, 2022. 3
pages 1492–1500, 2017. 3
[51] Xizhou Zhu, Dazhi Cheng, Zheng Zhang, Stephen Lin, and
[38] Mingxing Tan and Quoc Le. Efficientnet: Rethinking Jifeng Dai. An empirical study of spatial attention mecha-
model scaling for convolutional neural networks. In Interna- nisms in deep networks. In Int. Conf. Comput. Vis., pages
tional Conference on Machine Learning., pages 6105–6114. 6688–6697, 2019. 3
PMLR, 2019. 3, 5
[52] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos,
[39] Mingxing Tan and Quoc Le. Efficientnetv2: Smaller models Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image
and faster training. In International Conference on Machine segmentation with deep convolutional nets, atrous convolu-
Learning., pages 10096–10106. PMLR, 2021. 3 tion, and fully connected crfs. IEEE transactions on pattern
[40] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry analysis and machine intelligence, 40(4):834–848, 2017. 3
Kalenichenko, Weijun Wang, Tobias Weyand, Marco An-
[53] L-CCGP Florian and Schroff Hartwig Adam. Rethinking
dreetto, and Hartwig Adam. Mobilenets: Efficient convolu-
atrous convolution for semantic image segmentation. In
tional neural networks for mobile vision applications. arXiv
IEEE Conf. Comput. Vis. Pattern Recog., volume 6, 2017.
preprint arXiv:1704.04861, 2017. 3
3
[41] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong
[54] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian
Zhang, Han Hu, and Yichen Wei. Deformable convolutional
Schroff, and Hartwig Adam. Encoder-decoder with atrous
networks. In Int. Conf. Comput. Vis., pages 764–773, 2017.
separable convolution for semantic image segmentation. In
3
Eur. Conf. Comput. Vis., pages 801–818, 2018. 3
[42] Xiaohan Ding, Xiangyu Zhang, Jungong Han, and Guiguang
Ding. Scaling up your kernels to 31x31: Revisiting large [55] Yann LeCun, Bernhard Boser, John S Denker, Donnie
kernel design in cnns. In IEEE Conf. Comput. Vis. Pattern Henderson, Richard E Howard, Wayne Hubbard, and
Recog., pages 11963–11975, 2022. 3 Lawrence D Jackel. Backpropagation applied to handwrit-
ten zip code recognition. Neural computation, 1(4):541–551,
[43] Shiwei Liu, Tianlong Chen, Xiaohan Chen, Xuxi Chen, Qiao 1989. 3
Xiao, Boqian Wu, Mykola Pechenizkiy, Decebal Mocanu,
and Zhangyang Wang. More convnets in the 2020s: Scal- [56] François Chollet. Xception: Deep learning with depthwise
ing up kernels beyond 51x51 using sparsity. arXiv preprint separable convolutions. In IEEE Conf. Comput. Vis. Pattern
arXiv:2207.03620, 2022. 3 Recog., pages 1251–1258, 2017. 4
[44] Yongming Rao, Wenliang Zhao, Yansong Tang, Jie Zhou, [57] Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin
Ser-Nam Lim, and Jiwen Lu. Hornet: Efficient high-order Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei
spatial interactions with recursive gated convolutions. arXiv Wang, and Tieyan Liu. On layer normalization in the trans-
preprint arXiv:2207.14284, 2022. 3, 6, 7 former architecture. In International Conference on Machine
Learning., pages 10524–10533. PMLR, 2020. 5
[45] Qi Han, Zejia Fan, Qi Dai, Lei Sun, Ming-Ming Cheng, Ji-
aying Liu, and Jingdong Wang. On the connection between [58] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco
local attention and dynamic depth-wise convolution. In Int. Massa, Alexandre Sablayrolles, and Hervé Jégou. Training
Conf. Learn. Represent., 2021. 3 data-efficient image transformers & distillation through at-
[46] Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and tention. In International Conference on Machine Learning.,
Hao Ma. Linformer: Self-attention with linear complexity. pages 10347–10357, 2021. 6
arXiv preprint arXiv:2006.04768, 2020. 3 [59] Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella,
[47] Zhuofan Xia, Xuran Pan, Shiji Song, Li Erran Li, and Gao Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang,
Huang. Vision transformer with deformable attention. In Boxin Li, Chunyuan Li, et al. Florence: A new
IEEE Conf. Comput. Vis. Pattern Recog., pages 4794–4803, foundation model for computer vision. arXiv preprint
2022. 3 arXiv:2111.11432, 2021. 6, 7
[48] Ashish Vaswani, Prajit Ramachandran, Aravind Srinivas, [60] Anonymous. Towards all-in-one pre-training via maximizing
Niki Parmar, Blake Hechtman, and Jonathon Shlens. Scaling multi-modal mutual information. arXiv preprint, 2022. 6
local self-attention for parameter efficient visual backbones. [61] Christoph Schuhmann, Richard Vencu, Romain Beaumont,
In IEEE Conf. Comput. Vis. Pattern Recog., pages 12894– Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo
12904, 2021. 3 Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m:
[49] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Open dataset of clip-filtered 400 million image-text pairs.
Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec arXiv preprint arXiv:2111.02114, 2021. 6

10
[62] Bart Thomee, David A Shamma, Gerald Friedland, Ben- [76] Qiang Chen, Xiaokang Chen, Gang Zeng, and Jingdong
jamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Wang. Group detr: Fast training convergence with de-
Li-Jia Li. Yfcc100m: The new data in multimedia research. coupled one-to-many label assignment. arXiv preprint
Communications of the ACM, 59(2):64–73, 2016. 6 arXiv:2207.13085, 2022. 7
[63] Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu [77] Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He.
Soricut. Conceptual 12m: Pushing web-scale image-text pre- Exploring plain vision transformer backbones for object de-
training to recognize long-tail visual concepts. In IEEE Conf. tection. arXiv preprint arXiv:2203.16527, 2022. 7
Comput. Vis. Pattern Recog., pages 3558–3568, 2021. 6 [78] Tingting Liang, Xiaojie Chu, Yudong Liu, Yongtao Wang,
[64] Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Zhi Tang, Wei Chu, Jingdong Chen, and Haibin Ling. Cb-
Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. net: A composite backbone network architecture for object
Big transfer (bit): General visual representation learning. In detection. IEEE Trans. Image Process., 2022. 7
Eur. Conf. Comput. Vis., pages 491–507. Springer, 2020. 6
[79] Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang
[65] Hugo Touvron, Matthieu Cord, and Hervé Jégou. Deit iii: Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: A
Revenge of the vit. arXiv preprint arXiv:2204.07118, 2022. large-scale, high-quality dataset for object detection. In Int.
6 Conf. Comput. Vis., pages 8430–8439, 2019. 7
[66] Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, [80] Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexan-
Bin Xiao, Lu Yuan, and Jianfeng Gao. Focal self-attention der Kirillov, and Rohit Girdhar. Masked-attention mask
for local-global interactions in vision transformers. arXiv transformer for universal image segmentation. arXiv preprint
preprint arXiv:2107.00641, 2021. 6 arXiv:2112.01527, 2021. 7, 8
[67] Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V [81] Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and
Le. Self-training with noisy student improves imagenet clas- Jian Sun. Unified perceptual parsing for scene understand-
sification. In IEEE Conf. Comput. Vis. Pattern Recog., pages ing. In Eur. Conf. Comput. Vis., pages 418–434, 2018. 7,
10687–10698, 2020. 6 8
[68] Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong [82] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela
Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for Barriuso, and Antonio Torralba. Scene parsing through
dense predictions. arXiv preprint arXiv:2205.08534, 2022. ade20k dataset. In IEEE Conf. Comput. Vis. Pattern Recog.,
7 pages 633–641, 2017. 7
[69] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
Zitnick. Microsoft coco: Common objects in context. In Eur.
Conf. Comput. Vis., pages 740–755, 2014. 6
[70] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir-
shick. Mask r-cnn. In Int. Conf. Comput. Vis., pages 2961–
2969, 2017. 6
[71] Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: high
quality object detection and instance segmentation. IEEE
Trans. Pattern Anal. Mach. Intell., 43(5):1483–1498, 2019.
6, 7
[72] Xiyang Dai, Yinpeng Chen, Bin Xiao, Dongdong Chen,
Mengchen Liu, Lu Yuan, and Lei Zhang. Dynamic head:
Unifying object detection heads with attentions. In IEEE
Conf. Comput. Vis. Pattern Recog., pages 7373–7382, 2021.
7
[73] Mengde Xu, Zheng Zhang, Han Hu, Jianfeng Wang, Lijuan
Wang, Fangyun Wei, Xiang Bai, and Zicheng Liu. End-to-
end semi-supervised object detection with soft teacher. In
Int. Conf. Comput. Vis., pages 3060–3069, 2021. 7
[74] Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun
Zhu, Lionel M Ni, and Heung-Yeung Shum. Dino: Detr
with improved denoising anchor boxes for end-to-end object
detection. arXiv preprint arXiv:2203.03605, 2022. 7
[75] Jianwei Yang, Chunyuan Li, and Jianfeng Gao. Focal mod-
ulation networks. arXiv preprint arXiv:2203.11926, 2022.
7

11

You might also like