Dense Clip
Dense Clip
Image
f1
Encoder
Recent progress has shown that large-scale pre-training Image Human Self-
using contrastive image-text pairs can be a promising alter- Image
Encoder Annotated or Image supervised
Label Loss
f2 Image
native for high-quality visual representation learning from Encoder
natural language supervision. Benefiting from a broader
source of supervision, this new paradigm exhibits impres- Fine-tuning Downstream tasks
Segmentation
Image Image
sive transferability to downstream classification tasks and Image
Encoder Decoder Detection
…
datasets. However, the problem of transferring the knowl-
edge learned from image-text pairs to more complex dense (a) Conventional Pre-training + Fine-tuning Paradigm
prediction tasks has barely been visited. In this work, we Vision-Language Pre-training
present a new framework for dense prediction by implicitly Contrastive
Image Text
Image Learning Text
and explicitly leveraging the pre-trained knowledge from Encoder Encoder
CLIP. Specifically, we convert the original image-text match-
ing problem in CLIP to a pixel-text matching problem and Language-Guided Fine-tuning 𝐾-class text Downstream tasks
embeddings Segmentation
use the pixel-text score maps to guide the learning of dense Text
Text …
Encoder Detection
prediction models. By further using the contextual infor- pixel-text …
Context-Aware score maps
mation from the image to prompt the language model, we Prompting
…
are able to facilitate our model to better exploit the pre- Image Image
Image
trained knowledge. Our method is model-agnostic, which Encoder Decoder
various pre-trained visual backbones including both CLIP (b) DenseCLIP: CLIP Pre-training + Language-Guided Fine-tuning
models and ImageNet pre-trained models. Extensive ex-
periments demonstrate the superior performance of our Figure 1. Comparisons of the conventional “pre-training + fine-
methods on semantic segmentation, object detection, and tuning” paradigm and our proposed DenseCLIP. The pre-training +
fine-tuning paradigm directly applies the image pre-trained model
instance segmentation tasks. Code is available at https:
as the initialization of encoder. Differently, DenseCLIP transfers
//github.com/raoyongming/DenseCLIP.
the knowledge learned with image-text contrastive learning to dense
prediction models by introducing a new pixel-text matching task
and further using the contextual information from images to prompt
1. Introduction pre-trained language model.
40.8 41.0 But the straightforward way cannot fully exploit the poten-
40.6
40.2
40 39.6
38.6
39.3
38.7
tial of the CLIP models. Inspired by the original contrastive
38 37.6 learning framework in CLIP, we propose to convert the orig-
36 inal image-text matching problem in CLIP to a pixel-text
matching problem and use the pixel-text score maps to guide
34
IN1K IN21K IN1K+Ours MoCoV2 DenseCL CLIP CLIP+Ours the learning of dense prediction models explicitly. By further
Supervised Self-supervised Vision-Language using the contextual information from the image to prompt
the language model with a Transformer [40] module, we are
Figure 2. Results of different pre-training and fine-tuning strategies
on the semantic segmentation task. We report the single-scale and
able to facilitate our model to better exploit the pre-trained
multi-scale mIoU on ADE20K [59] of different pre-trained ResNet- knowledge by optimizing the text embeddings.
50 [18] models, including supervised ImageNet1K [11] (IN1K) Our method can be a plug-and-play module to improve
and ImageNet21K [11, 36] (IN21K), self-supervised MoCoV2 [9] the fine-tuning of CLIP pre-trained models on off-the-shelf
and DenseCL [43], and vision-language model CLIP. Equipped dense prediction methods and tasks. By applying our method
with DenseCLIP, we show that large-scale vision-language pre- to the popular semantic segmentation framework semantic
training can substantially improve the dense prediction performance FPN [21] on the challenging ADE20K [59] dataset, we ex-
(+4.9%/+4.1%) over the commonly used ImageNet pre-training. hibit +4.9%, +4.7% and +2.3% mIoU improvement com-
pared over ImageNet pre-trained models and +3.9%, +2.4%
text pairs. By exploiting the semantic relationships between and +1.2% mIoU improvement compared to vanilla fine-
the images and the associated texts, this new framework tuning of a CLIP models based on ResNet-50, ResNet-
benefits from rich and semantic level supervision from texts 101 [18] and ViT-B [12] respectively. We also observe
while enjoying a broader and cheaper source of data. Thanks significant improvements in object detection and instance
to the language supervision, models pre-trained via CLIP segmentation tasks. Notably, we show a ResNet-101 model
achieve impressive results on various visual classification equipped with our method and a lightweight semantic FPN
tasks with no or very limited annotations [13, 41, 60]. decoder can achieve 46.5% mIoU on ADE20K, which out-
Very recently, several efforts have been made to adopt performs state-of-the-art solutions like DeepLabV3+ [7] and
the prompt engineering from NLP community [28] to better UperNet [45] with only 1/3 computation.
transfer the CLIP models to the downstream visual clas- Moreover, our framework can also be applied to any
sification tasks. Several learning-based prompting meth- backbone models by using the pre-trained language model to
ods [13, 51, 56, 60] are proposed to modify the output of the guide the training of dense prediction tasks. We observe sig-
language model to better adapt to the new tasks. However, nificant improvements by applying DenseCLIP to ImageNet
they mainly focus on transferring the CLIP model to classi- pre-trained ResNets [18] and recent Swin Transformers [29]
fication tasks by performing image-text matching, which is with slight computation overhead. We expect our method to
much close to the original pre-training task. The problem of be a new and generic paradigm to improve dense prediction
transferring the knowledge learning from image-text pairs models with guidance from pre-trained language models.
to more complex dense prediction tasks and a more generic
setting has barely been visited. 2. Related Work
In this paper, we study how to fine-tune the pre-trained Pre-training and fine-tuning. The revolution of computer
CLIP models to dense prediction tasks. Compared to conven- vision in the past decade has been driven by the “pre-training
tional ImageNet pre-trained models, one distinct challenge + fine-tuning” paradigm. Specifically, it first pre-trains mod-
is the gap between the upstream contrastive pre-training task els on large-scale datasets (e.g., ImageNet [11], JFT [38],
and the downstream per-pixel prediction task, where the for- Kinetics [4], etc.) in a supervised learning [12, 18, 29, 34] or
mer involves instance-level representation of both images self-supervised learning manner [3, 8, 15, 16], and then fine-
and texts, and the latter is only based on the visual informa- tunes the models on various downstream tasks. In NLP com-
tion at the pixel level. To tackle this problem, we present munity, this framework has also been similarly and widely
a new language-guided dense prediction framework named used [2] and recently evolves into a prompt paradigm [28],
DenseCLIP. As shown in Figure 1 (b), it is designed for in which downstream tasks are reformulated to simulate the
various Dense prediction tasks by implicitly and explicitly solved tasks in original pretraining process. Inspired by these
leveraging the pre-trained knowledge from CLIP models. works, we explore to transfer the knowledge in large-scale
An implicit way to exploit the pre-trained knowledge is to vision-language pre-trained models to the downstream dense
directly fine-tune the models on the downstream datasets. prediction tasks.
Vision-language models. There have been a series of works transfer knowledge of CLIP to downstream classification
on the interaction of computer vision and natural language task, a simple yet effective way [33] is to construct a set
processing fields, e.g., text-to-image retrieval [44], image of text prompts based on a template such as “a photo
caption [48], visual question answering [1], referring seg- of a [CLS].”, where [CLS] can be replaced by the ac-
mentation [19,49,50] and so on. Among these works, vision- tual class names. Then given an image, one can use CLIP
language pre-training has attracted growing attention during to compute the similarities between the image and the text
the past few years [24, 32, 37]. As a milestone, Radford et prompts in the embedding space and the class with the high-
al. devise a large-scale pretraining model, named CLIP [33], est score is regarded as the final prediction. Recently, several
which employs a contrastive learning strategy on a huge works [13, 60] have shown CLIP can obtain strong classifica-
amount of image-text pairs, and shows impressive trans- tion performance with few examples. Therefore, it raises an
ferable ability over 30 classification datasets. Motivated interesting question: whether the impressive ability of CLIP
by this work, a number of follow-ups have been proposed can be transferred to more complex vision tasks like dense
to improve the training strategy (e.g., CoOp [60], CLIP- prediction?
Adapter [13], Tip-adapter [56]) or apply it to other domains However, the extension is nontrivial. Firstly, how to
(e.g., ActionCLIP [41]). However, there are very few at- leverage the visual-language pre-trained model in dense pre-
tempts on performing dense prediction tasks via the CLIP diction tasks is a barely visited question. Although a simple
model. The work most related to ours is CPT [51], which re- solution is to only use the image encoder like a pre-trained
formulates dense predictions into a fill-in-the-blank problem 2D backbone, we argue that the language priors contained in
by jointly marking co-referential parts of both image and text the text encoder are also of great importance. Secondly, un-
in color. Differently, we consider a standard dense prediction like the classification considered in [13, 60], transferring the
setting in this paper, where we use pixel-text relationships knowledge from CLIP to dense prediction is more difficult
to guide the training of dense prediction models and further due to the substantial gap between the upstream contrastive
optimize language embedding with image context with a pre-training task and the downstream per-pixel prediction
context-aware prompting method. task, where the former considers instance-level representa-
tion of both images and texts, and the latter is only based on
Dense prediction. Compared with conventional instance-
the visual information but expects pixel-level outputs.
level classification problem, dense prediction tasks (e.g.,
semantic segmentation [59], instance segmentation [27], ob- 3.2. Language-Guided Dense Prediction
ject detection [27]) are more challenging as they requires
to model the finer-grained representation at the pixel level To solve the above issues, we propose our language-
or region level. Following the “pre-training + fine-tuning” guided dense prediction framework, which can better lever-
paradigm, previous literatures have developed various dense age the language priors in CLIP pre-trained models. The
prediction models like FCN [30], PSPNet [57], FPN [25], pipeline of our framework is shown in Figure 3. One of
UperNet [45], and many others. To alleviate the heavy an- our important findings is that apart from the global image
notation cost in previous supervised pre-training settings, a feature, we can also extract a language-compatible feature
number of self-supervised pre-training approaches have been map from the last layer of the CLIP image encoder. To show
proposed for dense prediction [3, 43, 46, 47]. Orthogonal to this, we start by describing the architecture of the CLIP im-
these prior arts, we introduce a new fine-tuning strategy that age encoder in detail. Take the ResNet [18] encoder for
leverages the knowledge in the large-scale vision-language example, there are 4 stages in total and we denote the feature
pre-trained model and uses the language information to guide maps as {xi }4i=1 . Different from the original ResNet [18],
the learning process. CLIP makes a small modification [33] by adding an attention
pooling layer. Specifically, CLIP first performs global av-
3. Approach erage pooling to x4 ∈ RH4 W4 ×C to obtain a global feature
x̄4 ∈ R1×C , where H4 , W4 , C are the height, width and the
3.1. Preliminaries: Overview of CLIP number of channels of the feature maps from the 4-th stage
of the backbone. The concatenated features [x̄4 , x4 ] are then
We begin by reviewing the Contrastive Language-Image
fed into an multi-head self-attention layer [40] (MHSA):
Pre-training (CLIP) [33] framework to illustrate the motiva-
tion of our method. CLIP consists of two encoders, including [z̄, z] = MHSA([x̄4 , x4 ]). (1)
an image encoder (ResNet [18] or ViT [12]) and a text en-
coder (Transformer [40]). The goal of CLIP is to align the In the standard training process of CLIP, the global feature
embedding spaces of visual and language during pre-training z̄ is used as the output of the image encoder while the other
through a contrastive objective. outputs z are usually neglected. However, we find z has
To learn more transferable pre-trained knowledge, CLIP two interesting properties: (1) z still retains sufficient spatial
collects 400 million image-text pairs for model training. To information thus can serve as a feature map. (2) since the
𝐾-class text
pixel-text
[dog] Text embeddings
score maps
𝑝P!N … 𝑝" [cat] Encoder … Pixel-Text Matching Loss
…
Prompt
ground-truth
Transformer labels
𝐾
Image Image
Encoder Decoder Task Loss
image embeddings
Figure 3. The overall framework of DenseCLIP. DenseCLIP first extracts the image embeddings and K-class text embeddings, and then
calculates pixel-text score maps to convert the original image-text matching problem in CLIP to pixel-text matching for dense prediction.
These score maps are fed into decoder and also supervised using the ground-truth labels. To better exploit the pre-trained knowledge,
DenseCLIP uses the contextual information in images to prompt the language model with a Transformer module.
MHSA is symmetric to each input element, z might behave of a [CLS].” as text prompts, CoOp [60] introduces
similarly to z̄, which aligns well with the language features. learnable textual contexts to achieve better transferability
Based on the above observations, we can use z as a language- in downstream classification tasks by directly optimizing the
compatible feature map. It is also noted that for architectures contexts using back-propagation. Inspired by CoOp [60],
like ViT [12], z can be obtained similarly by excluding the we also use learnable textual contexts in our framework as a
class token of outputs. baseline, which only includes language-domain prompting.
To obtain the text features, we can construct text prompts The input of the text encoder then becomes:
from the template “a photo of a [CLS].” with K
[p, ek ], 1 ≤ k ≤ K, (3)
class names, and the use CLIP text encoder to extract the
features as t ∈ RK×C . We then compute the pixel-text score where p ∈ RN ×C are the learnable textual contexts and
maps using the language-compatible feature map z and the ek ∈ RC is the embedding for the name of the k-th class.
text features t by:
Vision-to-language prompting. Including descriptions of
s = ẑt̂ ,>
s∈R H4 W4 ×K
, (2) visual contexts can make the text more accurate. For exam-
ple, “a photo of a cat in the grass.” is more
where ẑ and t̂ are the `2 normalized version of z and t along accurate than “a photo of a cat.”. Therefore, we in-
the channel dimension. The score maps characterize the re- vestigate how to use visual contexts to refine the text fea-
sults of pixel-text matching, which is one of the most crucial tures. Generally, we can use the cross-attention mechanism
ingredients in our framework. Firstly, the score maps can be in Transformer decoder [40] to model the interactions be-
viewed as segmentation results with a lower resolution, and tween vision and language.
thus we can use them to compute an auxiliary segmentation We propose two different strategies of context-aware
loss. Secondly, we can concatenate the score maps to the last prompting, which is shown in Figure 4. The first strategy
feature map to explicitly incorporate language priors, i.e., we consider is the pre-language-model prompting, or pre-
x04 = [x4 , s] ∈ RH4 W4 ×(C+K) . Our framework is model- model prompting for short. We pass the features [z̄, z] to a
agnostic because the modified feature maps can be directly Transformer decoder to encode visual contexts:
used as usual in segmentation or detection with some minor
vpre = TransDecoder(q, [z̄, z]), (4)
modifications (e.g., the input dimension of FPN [25]).
where q ∈ RN ×C are a set of learnable queries and vpre ∈
3.3. Context-Aware Prompting RN ×C are the extracted visual contexts. We replace the p in
Previous efforts [13, 60] have already proved that mitigat- Equation (3) by the visual contexts v to form the input of the
ing the domain gaps in visual or language can significantly text encoder. Since the input of the text encoder is modified,
improve the performance of CLIP models on downstream we refer to this version as pre-model prompting.
tasks. Therefore, instead of using the vanilla human pre- Another choice is to refine the text features after the text
defined templates, we seek for other methods to improve the encoder, namely post-model prompting. In this variant, we
text features t. use CoOp [60] to generate text features and directly use them
as the queries of the Transformer decoder:
Language-domain prompting. Different from the original
CLIP that uses human-designed templates like “a photo vpost = TransDecoder(t, [z̄, z]). (5)
context generation 𝑡) its locality faster, which is beneficial to dense prediction
Text
𝑣! 𝑣" 𝑣# 𝑣$ 𝑣% … 𝑣& [dog]
Encoder tasks for both segmentation and detection.
𝒗#$% class embedding
Object detection & instance segmentation. In this case,
Transformer
we do not have ground truth segmentation labels. To con-
Image
Encoder Decoder struct a similar auxiliary loss as in segmentation, we use
[#𝒛, 𝒛]
the bounding box and the label to build a binary target
…
(a) Pre-model prompting learnable queries 𝒒 ỹ ∈ {0, 1}H4 W4 ×K . The auxiliary objective can be defined
Only used for training
as a binary cross-entropy loss:
class embeddings 𝒕
[dog] Text
𝑝! … 𝑝" [cat]
Encoder
… Ldet
aux = BinaryCrossEntropy(Sigmoid(s/τ ), ỹ). (8)
…
𝒗#&'(
Applications to any backbone models. Another interest-
Image Transformer
+ ing usage of our framework is that we can replace the image
Encoder [#𝒛, 𝒛] Decoder
update
encoder of CLIP with any backbones (e.g., ImageNet pre-
(b) Post-model prompting trained models and self-supervised models). Although there
might be no strong relation between the outputs of the visual
Figure 4. Two different strategies of context-aware prompting. backbone and the text encoder, the backbone can learn better
The pre-model prompting directly uses the image contexts to gen-
and faster with language guidance. In other words, we can
erate the desired text inputs, while post-model prompting refines
leverage the language priors from the pre-trained text en-
the class embedding instead.
coder to improve the performance of any pre-trained image
This implementation encourage the text features to find most backbone, which makes DenseCLIP a more generic frame-
related visual clues. We then update the text features through work to improve dense prediction with the natural language
a residual connection: priors learned from large-scale pre-training.
Backbone Method Pre-train mIoU (SS) mIoU (MS) GFLOPs Params (M)
FCN [30] ImageNet 36.1 38.1 793.3 49.6
EncNet [55] ImageNet 40.1 41.7 565.6 36.1
PSPNet [57] ImageNet 41.1 41.9 716.2 49.1
CCNet [20] ImageNet 42.1 43.1 804.0 49.9
ResNet-50 DeeplabV3+ [7] ImageNet 42.7 43.8 711.5 43.7
UperNet [45] ImageNet 42.1 42.8 953.2 66.5
DNL [52] ImageNet 41.9 43.0 939.3 50.1
Semantic FPN [21] ImageNet 38.6 40.6 227.1 31.0
CLIP + Semantic FPN CLIP 39.6 41.6 248.8 31.0
DenseCLIP + Semantic FPN CLIP 43.5 44.7 269.2 50.3
FCN [30] ImageNet 39.9 41.4 1104.4 68.6
EncNet [55] ImageNet 42.6 44.7 876.8 55.1
PSPNet [57] ImageNet 43.6 44.4 1027.4 68.1
CCNet [20] ImageNet 44.0 45.2 1115.2 68.9
DeeplabV3+ [7] ImageNet 44.6 46.1 1022.7 62.7
ResNet-101 UperNet [45] ImageNet 43.8 44.8 1031.0 85.5
OCRNet [54] ImageNet 45.3 - 923.9 55.5
DNL [52] ImageNet 44.3 45.8 1250.5 69.1
Semantic FPN [21] ImageNet 40.4 42.3 304.9 50.0
CLIP + Semantic FPN CLIP 42.7 44.3 326.6 50.0
DenseCLIP + Semantic FPN CLIP 45.1 46.5 346.3 67.8
SETR-MLA-DeiT [58] ImageNet 46.2 47.7 - -
Semantic FPN [21] ImageNet 48.3 50.9 1037.4 100.8
ViT-B Semantic FPN [21] ImageNet-21K 49.1 50.4 1037.4 100.8
CLIP + Semantic FPN CLIP 49.4 50.3 1037.4 100.8
DenseCLIP + Semantic FPN CLIP 50.6 51.3 1043.1 105.3
Table 2. Ablation study. We demonstrate that performing post- Main results. We report the semantic segmentation re-
model vision-to-language prompting can yield the better perfor- sults of our DenseCLIP with three different backbones on
mance with fewer extra FLOPs and parameters. ADE20K in Table 1. We include the FLOPs, the number
Language V→L Prompt mIoU FLOPs Params of parameters, and the mIoU in both single-scale (SS) and
Pre-train multi-scale (MS) testings. The experiments results show
Prompt (%) (G) (M)
pre post
that for the same backbone, our DenseCLIP with a sim-
ImageNet 38.6 227 31.0 ple Semantic FPN can outperform the state-of-the-art meth-
CLIP 39.6(+1.0) 249 31.0 ods that use more sophisticated decoders by large margins.
CLIP 3 42.1(+3.5) 269 46.5 Unlike previous works that use dilated backbones (ResNet-
CLIP 3 3 42.9(+4.3) 368 116.9 D8 [20, 53, 54, 57]), the ResNet encoder in DenseCLIP is
CLIP 3 3 43.5(+4.9) 269 50.2 more close to standard ResNet thus our DenseCLIP has much
fewer FLOPs. Besides, our DenseCLIP is +4.9%, +4.7%,
and +2.3% mIoU (SS) higher than the original ImageNet
embeddings to a lower dim (256) before the Transformer pre-trained baselines on ResNet-50, ResNet-101 and ViT-B
module. We empirically find that directly fine-tuning CLIP backbones with acceptable extra computation cost. Dense-
models to dense prediction with the default training strate- CLIP is also +3.9%, +2.4%, and +1.2% mIoU higher than
gies in [10] will lead to unsatisfactory results (only 21.9% the vanilla fine-tuning strategy (CLIP + Semantic FPN).
mIoU on ADE20K, which is 15.6% lower than its ImageNet
pre-trained counterpart). Therefore, two key modifications Ablation studies. To further demonstrate the effects of dif-
are made compared to the default configurations: (1) we use ferent components of our DenseCLIP, we perform detailed
AdamW [31] instead of the default SGD inspired by recent ablation studies with the ResNet-50 [18] backbone and the
progress in vision Transformers [29, 39, 42]; (2) to better results are shown in Table 2. Firstly, we show by adopting a
preserve the pre-trained weights, we set the learning rate better training strategy aforementioned the ResNet-50 base-
of the image encoder as 1/10 of the other parameters. We line we implemented has a higher mIoU than [10] (38.6% vs.
also adopt the above training strategies to our baselines in 37.5%). Secondly, we find that CLIP pre-trained ResNet-50
ablation studies for fair comparisons (+1.1% mIoU over the outperforms the ImageNet pre-trained one by 1%, which
ImageNet pre-trained ResNet-50 with the default settings indicates that large-scale vision language pre-trained model
in [10]). can be better transferred to downstream vision tasks. To
better leverage the language priors, we adopt our language- Table 3. Object detection on COCO val2017 using Reti-
guided with language-domain prompt and witness a signifi- naNet [26] framework. We compare our DenseCLIP framework
cant performance boost (+2.5% mIoU). Finally, we compare to the vanilla fine-tuning of ImageNet/CLIP pre-trained models.
We find DenseCLIP can better make use of the language priors to
the two methods to perform vision-language prompting to
facilitate better training.
incorporate visual contexts. We find both the pre-model and
post-model prompting can improve the performance, while FLOPs Params
the post-model prompting is better and more computationally Model AP AP50 AP75 APS APM APL
(G) (M)
efficient. Therefore, we choose the post-model prompting as RN50-IN1K [18] 239 38 36.3 55.3 38.6 19.3 40.0 48.8
the default configuration in all the rest experiments. RN50-CLIP [33] 265 38 36.9 57.7 39.1 22.5 40.7 47.1
RN50-DenseCLIP 285 60 37.8 59.9 40.0 24.8 42.0 47.9
Effects of language-guided pre-training and fine-tuning.
RN101-IN1K [18] 315 57 38.5 57.6 41.0 21.7 42.8 50.4
We compare the performance on ADE20K of different pre- RN101-CLIP [33] 341 57 40.5 61.6 43.4 25.6 44.6 51.3
training and fine-tuning strategies to better reveal the po- RN101-DenseCLIP 360 78 41.1 63.4 44.1 26.9 45.5 52.4
tential of language-guided paradigm, which is shown in
Figure 2. We consider supervised pre-training on Ima-
For Mask R-CNN, we observe that DenseCLIP achieves
geNet1K [11] and ImageNet21K [11, 36], self-supervised
consistent improvement on both object detection and in-
pre-training via MoCoV2 [16] and DenseCL [43], and the
stance segmentation tasks within an affordable computa-
vision-language pre-training. We show that the vision-
tional budget. Especially for instance segmentation, our
language pre-trained model (CLIP) can outperform Ima-
DenseCLIP outperforms the ImageNet1K pre-trained model
geNet1K pre-trained model by vanilla fine-tuning. Further-
with +2.9% and +2.5% mask AP on both ResNet50 and
more, through the language-guided fine-tuning with context-
ResNet101 backbones and also outperforms the vanilla fine-
aware prompting, our DenseCLIP surpasses even the Im-
tuning strategy with +0.8% and +0.7% mask AP. The signifi-
ageNet21K pre-trained model. These promising results
cant improvements of DenseCLIP on the instance segmenta-
demonstrate that language-priors can largely facilitate vi-
tion task suggest that our pixel-text matching is conceptually
sion models in downstream dense prediction tasks.
suitable for segmentation.
4.2. Object Detection and Instance Segmentation 4.3. DenseCLIP for Any Visual Backbone
Setups. We also conduct experiments to apply our Dense- Previous experiments have demonstrated the effectiveness
CLIP to object detection and instance segmentation tasks on of our DenseCLIP framework. However, since DenseCLIP is
COCO [27], which contains 118K training images and 5K specifically designed to leverage the visual-language relation
validation images. We adopt two widely used frameworks, contained in the pre-trained CLIP models, the generalization
RetinaNet [26] and Mask R-CNN [17]. Following [17], we ability of DenseCLIP might be somehow doubted: Is Dense-
report the standard AP, AP at IoU=0.5/0.75, and cross-scale CLIP only suitable to CLIP image encoders? To answer
AP. For Mask R-CNN, we report both the mAPs for object this question, we perform experiments to verify whether our
detection and instance segmentation since these two tasks DenseCLIP can also perform well with other backbones.
are performed simultaneously. The extension is actually straightforward: we can simply re-
place the CLIP image encoder with any given 2D pre-trained
Implementation details. For object detection, we adopt
image model. Although there are no strong correlations be-
ResNet-50 and ResNet-101 as backbones. We train all the
tween the feature maps of the new backbone and the text
models for 12 epochs using AdamW optimizer with batch
features output by the CLIP text encoder, we hypothesize
size 16 as in [5]. Specifically for RetinaNet, we witness a
that if we preserve the language priors by freezing the text
super large loss at the start of training that makes the model
encoder as before, the text encoder will guide the backbone
hard to converge. Therefore, we use gradient clipping with a
to better adapt to downstream tasks.
max `2 norm of 0.1 to protect the pre-trained weights.
To verify the above assumption, we choose two represen-
Results analysis. The results using the RetinaNet [26] and tative 2D models including ResNet [18], the most widely
the Mask R-CNN [17] are summarized in Table 3 and Ta- used CNN model, and Swin [29], the recent state-of-the-art
ble 4, respectively. For object detection with RetinaNet, we vision Transformer. Following the standard setting in [21]
compare DenseCLIP with ImageNet1K pretrained model and [29], we use the Semantic FPN [21] framework for
and vanilla CLIP fine-tuning on detection task. One can ResNet models and the UperNet [45] framework for Swin
observe that DenseCLIP outperforms the ImageNet1K pre- models. The experimental results are summarized in Ta-
trained model by +1.5% and +2.6% AP. Meanwhile, it also ble 5, where we report the mIoU on ADE20K of both the
improves the vanilla fine-tuning strategy by +0.9% and single-scale and multi-scale testing. We demonstrate that our
+0.6% AP on both ResNet-50 and ResNet-101 backbones. DenseCLIP can consistently improve all the baseline models
Table 4. Object detection and instance segmentation results on COCO val2017 using Mask R-CNN [17] framework. Our DenseCLIP
outperforms ImageNet/CLIP pre-trained baseline models, especially on the instance segmentation task.
FLOPs Params
Model APb APb50 APb75 APbS APbM APbL APm APm m m m m
50 AP75 APS APM APL
(G) (M)
RN50-IN1K [18] 275 44 38.2 58.8 41.4 21.9 40.9 49.5 34.7 55.7 37.2 18.3 37.4 47.2
RN50-CLIP [33] 301 44 39.3 61.3 42.7 24.6 42.6 50.1 36.8 58.5 39.2 18.6 39.9 51.8
RN50-DenseCLIP 327 67 40.2 63.2 43.9 26.3 44.2 51.0 37.6 60.2 39.8 20.8 40.7 53.7
RN101-IN1K [18] 351 63 40.0 60.5 44.0 22.6 44.0 52.6 36.1 57.5 38.6 18.8 39.7 49.5
RN101-CLIP [33] 377 63 42.2 64.2 46.5 26.4 46.1 54.0 38.9 61.4 41.8 20.5 42.3 55.1
RN101-DenseCLIP 399 84 42.6 65.1 46.5 27.7 46.5 54.2 39.6 62.4 42.4 21.4 43.0 56.2
input ImageNet CLIP DenseCLIP ground-truth Table 5. Applying DenseCLIP to any backbone. Image back-
bones (such as ImageNet pre-trained ResNet [18] and Swin [29])
equipped with our DenseCLIP benefit from the language priors and
enjoy significant performance boost. We report mIoU on ADE20K
dataset for both single-scale (SS) and multi-scale (MS) testing.
notably. Specifically, DenseCLIP can bring ∼ 2.5% single- tion tasks. DenseCLIP is a model-agnostic framework to use
scale mIoU improvement for ResNet-50/101 with semantic the pre-trained vision-language knowledge with the context-
FPN [21], and ∼ 0.8% improvement for Swin-T/S with aware prompting strategy. The framework can be applied
UperNet [45]. These results clearly show that our Dense- to various dense prediction tasks including semantic seg-
CLIP can successfully guide any pre-trained 2D backbone mentation, object detection, and instance segmentation. We
by language priors to boost performance. Since the text en- conducted extensive experiments to demonstrate the superior
coder can be removed after training, our method provides a performance of our method.
low-cost solution to improve arbitrary dense prediction mod-
Limitations & societal impact. Although our method has
els. Although these performances still lag behind our models
achieved substantial improvement in segmentation, we find
with CLIP image encoders, the findings in this section pro-
the improvements on detection are not such significant. We
vide a solution to generalize human knowledge learned from
conjecture that it is because the pre-trained CLIP image en-
large-scale vision-language pre-training to a wider range of
coder lacks locality since there is no such constraint during
models. We expect this could be an interesting direction to
the pre-training of CLIP while object-centered tasks can only
connect vision and language researches in the future.
provide less dense supervision. We believe DenseCLIP can
4.4. Visualization be further improved by introducing the dense supervision
during pre-training or better recovering the locality after
To better demonstrate the superiority of DenseCLIP, we
pre-training. We develop a general method for dense pre-
provide several qualitative results in Figure 5. We compare
diction in this paper. Since our method is not for a specific
the segmentation maps of our method and the baselines and
application, it does not directly involve societal issues.
find DenseCLIP is better at identifying holistic objects.
Acknowledgements
5. Conclusion and Discussion
This work was supported in part by the National Natural
In this paper, we have presented a new framework, Dense- Science Foundation of China under Grant 62125603, Grant
CLIP, to transfer the knowledge from the vision-language U1813218, and in part by a grant from the Beijing Academy
pre-trained model (CLIP) to the downstream dense predic- of Artificial Intelligence (BAAI).
References [16] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross B.
Girshick. Momentum contrast for unsupervised visual repre-
[1] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret sentation learning. In CVPR, pages 9726–9735, 2020. 2, 5,
Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 7
Vqa: Visual question answering. In ICCV, pages 2425–2433, [17] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Gir-
2015. 3 shick. Mask r-cnn. In ICCV, 2017. 7, 8
[2] Tom B. Brown, Benjamin Mann, Nick Ryder, and [18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Melanie Subbiah et al. Language models are few-shot learn- Deep residual learning for image recognition. In CVPR, pages
ers. In NeurIPS, 2020. 2 770–778, 2016. 2, 3, 5, 6, 7, 8
[3] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, [19] Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. Seg-
Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- mentation from natural language expressions. In ECCV, pages
ing properties in self-supervised vision transformers. In ICCV, 108–124, 2016. 3
2021. 2, 3 [20] Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang,
[4] Joao Carreira and Andrew Zisserman. Quo vadis, action Yunchao Wei, and Wenyu Liu. Ccnet: Criss-cross attention
recognition? a new model and the kinetics dataset. In CVPR, for semantic segmentation. In ICCV, pages 603–612, 2019.
pages 6299–6308, 2017. 1, 2 5, 6
[5] Kai Chen, Jiaqi Wang, Jiangmiao Pang, et al. MMDetec- [21] Alexander Kirillov, Ross Girshick, Kaiming He, and Piotr
tion: Open mmlab detection toolbox and benchmark. arXiv Dollár. Panoptic feature pyramid networks. In CVPR, pages
preprint arXiv:1906.07155, 2019. 7 6399–6408, 2019. 2, 5, 6, 7, 8
[6] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, [22] Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan
Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby.
segmentation with deep convolutional nets, atrous convolu- Big transfer (bit): General visual representation learning. In
tion, and fully connected crfs. TPAMI, 40(4):834–848, 2017. ECCV, pages 491–507. Springer, 2020. 1
1 [23] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Im-
[7] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian agenet classification with deep convolutional neural networks.
Schroff, and Hartwig Adam. Encoder-decoder with atrous In NeurIPS, pages 1097–1105, 2012. 1
separable convolution for semantic image segmentation. In [24] Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L. Berg,
ECCV, pages 801–818, 2018. 2, 6 Mohit Bansal, and Jingjing Liu. Less is more: Clipbert for
[8] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geof- video-and-language learning via sparse sampling. In CVPR,
frey E. Hinton. A simple framework for contrastive learning pages 7331–7341, 2021. 3
of visual representations. In ICML, pages 1597–1607, 2020. [25] Tsung-Yi Lin, Piotr Dollár, Ross B. Girshick, Kaiming He,
Bharath Hariharan, and Serge J. Belongie. Feature pyramid
[9] Xinlei Chen, Haoqi Fan, Ross B. Girshick, and Kaiming
networks for object detection. In CVPR, pages 936–944, 2017.
He. Improved baselines with momentum contrastive learning.
3, 4
CoRR, abs/2003.04297, 2020. 2
[26] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and
[10] MMSegmentation Contributors. Mmsegmentation: Open-
Piotr Dollár. Focal loss for dense object detection. In ICCV,
mmlab semantic segmentation toolbox and benchmark, 2020.
pages 2980–2988, 2017. 7
6
[27] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
[11] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
Fei-Fei. Imagenet: A large-scale hierarchical image database. Zitnick. Microsoft coco: Common objects in context. In
In CVPR, pages 248–255, 2009. 1, 2, 7 ECCV, pages 740–755, 2014. 3, 7
[12] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, [28] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hi-
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, roaki Hayashi, and Graham Neubig. Pre-train, prompt, and
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- predict: A systematic survey of prompting methods in natural
vain Gelly, et al. An image is worth 16x16 words: Transform- language processing. arXiv preprint arXiv:2107.13586, 2021.
ers for image recognition at scale. In ICLR, 2020. 1, 2, 3, 4, 2
5 [29] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng
[13] Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Zhang, Stephen Lin, and Baining Guo. Swin transformer:
Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip- Hierarchical vision transformer using shifted windows. In
adapter: Better vision-language models with feature adapters. ICCV, 2021. 2, 6, 7, 8
arXiv preprint arXiv:2110.04544, 2021. 2, 3, 4, 11 [30] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully
[14] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra convolutional networks for semantic segmentation. In CVPR,
Malik. Rich feature hierarchies for accurate object detection pages 3431–3440, 2015. 1, 3, 6
and semantic segmentation. In CVPR, pages 580–587, 2014. [31] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
1 regularization. arXiv preprint arXiv:1711.05101, 2017. 6
[15] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr [32] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert:
Dollár, and Ross Girshick. Masked autoencoders are scalable Pretraining task-agnostic visiolinguistic representations for
vision learners. arXiv preprint arXiv:2111.06377, 2021. 2 vision-and-language tasks. In NeurIPS, pages 13–23, 2019. 3
[33] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya [49] Zhao Yang, Yansong Tang, Luca Bertinetto, Hengshuang
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Zhao, and Philip H.S. Torr. Hierarchical interaction network
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning for video object segmentation from referring expressions. In
transferable visual models from natural language supervision. BMVC, 2021. 3
arXiv preprint arXiv:2103.00020, 2021. 1, 3, 7, 8, 11 [50] Zhao Yang, Jiaqi Wang, Yansong Tang, Kai Chen, Heng-
[34] Yongming Rao, Wenliang Zhao, Zheng Zhu, Jiwen Lu, and shuang Zhao, and Philip H.S. Torr. Lavt: Language-aware
Jie Zhou. Global filter networks for image classification. In vision transformer for referring image segmentation. In CVPR,
NeurIPS, 2021. 2 2022. 3
[35] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. [51] Yuan Yao, Ao Zhang, Zhengyan Zhang, Zhiyuan Liu, Tat-
Faster r-cnn: Towards real-time object detection with region Seng Chua, and Maosong Sun. Cpt: Colorful prompt tun-
proposal networks. In NeurIPS, pages 91–99, 2015. 1 ing for pre-trained vision-language models. arXiv preprint
[36] Tal Ridnik, Emanuel Ben-Baruch, Asaf Noy, and Lihi Zelnik- arXiv:2109.11797, 2021. 2, 3
Manor. Imagenet-21k pretraining for the masses. arXiv [52] Minghao Yin, Zhuliang Yao, Yue Cao, Xiu Li, Zheng Zhang,
preprint arXiv:2104.10972, 2021. 2, 7 Stephen Lin, and Han Hu. Disentangled non-local neural
[37] Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu networks. In ECCV, 2020. 6
Wei, and Jifeng Dai. VL-BERT: pre-training of generic visual- [53] Fisher Yu and Vladlen Koltun. Multi-scale context
linguistic representations. In ICLR, 2020. 3 aggregation by dilated convolutions. arXiv preprint
[38] Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav arXiv:1511.07122, 2015. 6
Gupta. Revisiting unreasonable effectiveness of data in deep [54] Yuhui Yuan, Xilin Chen, and Jingdong Wang. Object-
learning era. In ICCV, pages 843–852, 2017. 2 contextual representations for semantic segmentation. In
[39] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco ECCV, 2020. 6
Massa, Alexandre Sablayrolles, and Hervé Jégou. Training [55] Hang Zhang, Kristin Dana, Jianping Shi, Zhongyue Zhang,
data-efficient image transformers & distillation through atten- Xiaogang Wang, Ambrish Tyagi, and Amit Agrawal. Context
tion. In ICML, pages 10347–10357, 2021. 6 encoding for semantic segmentation. In CVPR, pages 7151–
[40] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- 7160, 2018. 6
reit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia
[56] Renrui Zhang, Rongyao Fang, Peng Gao, Wei Zhang, Kun-
Polosukhin. Attention is all you need. In NeurIPS, pages
chang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. Tip-
5998–6008, 2017. 2, 3, 4
adapter: Training-free clip-adapter for better vision-language
[41] Mengmeng Wang, Jiazheng Xing, and Yong Liu. Actionclip:
modeling. arXiv preprint arXiv:2111.03930, 2021. 2, 3
A new paradigm for video action recognition. arXiv preprint
[57] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang
arXiv:2109.08472, 2021. 2, 3
Wang, and Jiaya Jia. Pyramid scene parsing network. In
[42] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao
CVPR, pages 2881–2890, 2017. 3, 6
Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyra-
[58] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu,
mid vision transformer: A versatile backbone for dense predic-
Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao
tion without convolutions. arXiv preprint arXiv:2102.12122,
Xiang, Philip H.S. Torr, and Li Zhang. Rethinking semantic
2021. 6
segmentation from a sequence-to-sequence perspective with
[43] Xinlong Wang, Rufeng Zhang, Chunhua Shen, Tao Kong, and
transformers. In CVPR, 2021. 6
Lei Li. Dense contrastive learning for self-supervised visual
pre-training. In CVPR, pages 3024–3033, 2021. 2, 3, 7 [59] Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler,
[44] Zihao Wang, Xihui Liu, Hongsheng Li, Lu Sheng, Junjie Adela Barriuso, and Antonio Torralba. Semantic understand-
Yan, Xiaogang Wang, and Jing Shao. CAMP: cross-modal ing of scenes through the ade20k dataset. IJCV, 127(3):302–
adaptive message passing for text-image retrieval. In ICCV, 321, 2019. 2, 3, 5
pages 5763–5772, 2019. 3 [60] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei
[45] Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Liu. Learning to prompt for vision-language models. arXiv
Jian Sun. Unified perceptual parsing for scene understanding. preprint arXiv:2109.01134, 2021. 2, 3, 4, 11
In ECCV, pages 418–434, 2018. 2, 3, 5, 6, 7, 8
[46] Zhenda Xie, Yutong Lin, Zhuliang Yao, Zheng Zhang, Qi Dai,
Yue Cao, and Han Hu. Self-supervised learning with swin
transformers. arXiv preprint arXiv:2105.04553, 2021. 3
[47] Zhenda Xie, Yutong Lin, Zheng Zhang, Yue Cao, Stephen
Lin, and Han Hu. Propagate yourself: Exploring pixel-level
consistency for unsupervised visual representation learning.
In CVPR, pages 16684–16693, 2021. 3
[48] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C.
Courville, Ruslan Salakhutdinov, Richard S. Zemel, and
Yoshua Bengio. Show, attend and tell: Neural image cap-
tion generation with visual attention. In ICML, volume 37,
pages 2048–2057, 2015. 3
Appendix: More Analysis final performance.
We provide more analyses of both the design of our model Table 8. Ablation study of the residual coefficient γ. The configu-
and the training strategies in detail in the section. ration used in our final models is highlighted in gray.
Effects of learning rate multipliers. As discussed in Sec-
tion 4.1, we found that the optimal learning rate for CLIP initial value of γ γ learnable mIoU (%)
models and conventional ImageNet pre-trained models are −4
10 7 42.6
different. Here we further investigate the effects of learn- 10−4 3 43.5
ing rate multiplier for image encoder and text encoder in 1.0 3 42.8
Table 6. We see both fixing the text encoder and using a
lower learning rate for image encoder is beneficial to train
the dense prediction model. Note that we observe a much
lower performance (<30% mIoU) when directly fine-tuning
CLIP models with 1.0× learning rate for the image encoder,
which suggests our language guided method can largely sta-
bilize the training process and make the final results less
sensitive to the learning rate configuration.