0% found this document useful (0 votes)

9 views11 pages

Dense Clip

Uploaded by

zhangwei17210

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views11 pages

Dense Clip

Uploaded by

zhangwei17210

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting

Yongming Rao*,1 , Wenliang Zhao∗,1 , Guangyi Chen1 , Yansong Tang1 ,

Zheng Zhu1 , Guan Huang2 , Jie Zhou1 , Jiwen Lu†,1
1
Tsinghua University 2 PhiGent Robotics
arXiv:2112.01518v2 [cs.CV] 21 Mar 2022

Abstract Supervised Pre-training Self-supervised Pre-training

Image
f1
Encoder
Recent progress has shown that large-scale pre-training Image Human Self-
using contrastive image-text pairs can be a promising alter- Image
Encoder Annotated or Image supervised
Label Loss
f2 Image
native for high-quality visual representation learning from Encoder
natural language supervision. Benefiting from a broader
source of supervision, this new paradigm exhibits impres- Fine-tuning Downstream tasks
Segmentation
Image Image
sive transferability to downstream classification tasks and Image
Encoder Decoder Detection
…
datasets. However, the problem of transferring the knowl-
edge learned from image-text pairs to more complex dense (a) Conventional Pre-training + Fine-tuning Paradigm
prediction tasks has barely been visited. In this work, we Vision-Language Pre-training
present a new framework for dense prediction by implicitly Contrastive
Image Text
Image Learning Text
and explicitly leveraging the pre-trained knowledge from Encoder Encoder
CLIP. Specifically, we convert the original image-text match-
ing problem in CLIP to a pixel-text matching problem and Language-Guided Fine-tuning 𝐾-class text Downstream tasks
embeddings Segmentation
use the pixel-text score maps to guide the learning of dense Text
Text …
Encoder Detection
prediction models. By further using the contextual infor- pixel-text …
Context-Aware score maps
mation from the image to prompt the language model, we Prompting

…
are able to facilitate our model to better exploit the pre- Image Image
Image
trained knowledge. Our method is model-agnostic, which Encoder Decoder

can be applied to arbitrary dense prediction systems and image embeddings 𝐾

various pre-trained visual backbones including both CLIP (b) DenseCLIP: CLIP Pre-training + Language-Guided Fine-tuning
models and ImageNet pre-trained models. Extensive ex-
periments demonstrate the superior performance of our Figure 1. Comparisons of the conventional “pre-training + fine-
methods on semantic segmentation, object detection, and tuning” paradigm and our proposed DenseCLIP. The pre-training +
fine-tuning paradigm directly applies the image pre-trained model
instance segmentation tasks. Code is available at https:
as the initialization of encoder. Differently, DenseCLIP transfers
//github.com/raoyongming/DenseCLIP.
the knowledge learned with image-text contrastive learning to dense
prediction models by introducing a new pixel-text matching task
and further using the contextual information from images to prompt
1. Introduction pre-trained language model.

The “pre-training + fine-tuning” paradigm is recognized

as one of the key discoveries that has largely pushed the supervised classification or self-supervised learning of the
state-of-the-art for various downstream computer vision backbone model on large-scale datasets like ImageNet [11].
tasks, including image classification [12, 22, 23], object de- Then, a task-specific module like a detector or a segmenta-
tection [14, 35], semantic segmentation [6, 30], and action tion decoder is added to the backbone and the whole model is
recognition [4]. Due to the high annotation and computa- fine-tuned on the target dataset with less training data [6, 35].
tion cost of the per-pixel prediction, pre-training is even Different from conventional supervised and self-
more critical for dense prediction tasks. As illustrated in Fig- supervised pre-training methods only based on images, Con-
ure 1 (a), the pre-training step is usually accomplished via trastive Language-Image Pre-training (CLIP) [33] is a new
framework to learn high-quality visual representation by
* Equal contribution. † Corresponding author. exploring contrastive learning with large-scale noisy image-
46 Single-Scale Vanilla Fine-tuning Our results show that the CLIP models can outperform the
44.7
Multi-Scale Language-Guided Fine-tuning
44 43.5 conventional ImageNet pre-trained models with some modifi-
43.0
42.5 cations on hyper-parameters (see the CLIP result in Figure 2).
42 41.6
mIoU (%)

40.8 41.0 But the straightforward way cannot fully exploit the poten-
40.6
40.2
40 39.6
38.6
39.3
38.7
tial of the CLIP models. Inspired by the original contrastive
38 37.6 learning framework in CLIP, we propose to convert the orig-
36 inal image-text matching problem in CLIP to a pixel-text
matching problem and use the pixel-text score maps to guide
34
IN1K IN21K IN1K+Ours MoCoV2 DenseCL CLIP CLIP+Ours the learning of dense prediction models explicitly. By further
Supervised Self-supervised Vision-Language using the contextual information from the image to prompt
the language model with a Transformer [40] module, we are
Figure 2. Results of different pre-training and fine-tuning strategies
on the semantic segmentation task. We report the single-scale and
able to facilitate our model to better exploit the pre-trained
multi-scale mIoU on ADE20K [59] of different pre-trained ResNet- knowledge by optimizing the text embeddings.
50 [18] models, including supervised ImageNet1K [11] (IN1K) Our method can be a plug-and-play module to improve
and ImageNet21K [11, 36] (IN21K), self-supervised MoCoV2 [9] the fine-tuning of CLIP pre-trained models on off-the-shelf
and DenseCL [43], and vision-language model CLIP. Equipped dense prediction methods and tasks. By applying our method
with DenseCLIP, we show that large-scale vision-language pre- to the popular semantic segmentation framework semantic
training can substantially improve the dense prediction performance FPN [21] on the challenging ADE20K [59] dataset, we ex-
(+4.9%/+4.1%) over the commonly used ImageNet pre-training. hibit +4.9%, +4.7% and +2.3% mIoU improvement com-
pared over ImageNet pre-trained models and +3.9%, +2.4%
text pairs. By exploiting the semantic relationships between and +1.2% mIoU improvement compared to vanilla fine-
the images and the associated texts, this new framework tuning of a CLIP models based on ResNet-50, ResNet-
benefits from rich and semantic level supervision from texts 101 [18] and ViT-B [12] respectively. We also observe
while enjoying a broader and cheaper source of data. Thanks significant improvements in object detection and instance
to the language supervision, models pre-trained via CLIP segmentation tasks. Notably, we show a ResNet-101 model
achieve impressive results on various visual classification equipped with our method and a lightweight semantic FPN
tasks with no or very limited annotations [13, 41, 60]. decoder can achieve 46.5% mIoU on ADE20K, which out-
Very recently, several efforts have been made to adopt performs state-of-the-art solutions like DeepLabV3+ [7] and
the prompt engineering from NLP community [28] to better UperNet [45] with only 1/3 computation.
transfer the CLIP models to the downstream visual clas- Moreover, our framework can also be applied to any
sification tasks. Several learning-based prompting meth- backbone models by using the pre-trained language model to
ods [13, 51, 56, 60] are proposed to modify the output of the guide the training of dense prediction tasks. We observe sig-
language model to better adapt to the new tasks. However, nificant improvements by applying DenseCLIP to ImageNet
they mainly focus on transferring the CLIP model to classi- pre-trained ResNets [18] and recent Swin Transformers [29]
fication tasks by performing image-text matching, which is with slight computation overhead. We expect our method to
much close to the original pre-training task. The problem of be a new and generic paradigm to improve dense prediction
transferring the knowledge learning from image-text pairs models with guidance from pre-trained language models.
to more complex dense prediction tasks and a more generic
setting has barely been visited. 2. Related Work
In this paper, we study how to fine-tune the pre-trained Pre-training and fine-tuning. The revolution of computer
CLIP models to dense prediction tasks. Compared to conven- vision in the past decade has been driven by the “pre-training
tional ImageNet pre-trained models, one distinct challenge + fine-tuning” paradigm. Specifically, it first pre-trains mod-
is the gap between the upstream contrastive pre-training task els on large-scale datasets (e.g., ImageNet [11], JFT [38],
and the downstream per-pixel prediction task, where the for- Kinetics [4], etc.) in a supervised learning [12, 18, 29, 34] or
mer involves instance-level representation of both images self-supervised learning manner [3, 8, 15, 16], and then fine-
and texts, and the latter is only based on the visual informa- tunes the models on various downstream tasks. In NLP com-
tion at the pixel level. To tackle this problem, we present munity, this framework has also been similarly and widely
a new language-guided dense prediction framework named used [2] and recently evolves into a prompt paradigm [28],
DenseCLIP. As shown in Figure 1 (b), it is designed for in which downstream tasks are reformulated to simulate the
various Dense prediction tasks by implicitly and explicitly solved tasks in original pretraining process. Inspired by these
leveraging the pre-trained knowledge from CLIP models. works, we explore to transfer the knowledge in large-scale
An implicit way to exploit the pre-trained knowledge is to vision-language pre-trained models to the downstream dense
directly fine-tune the models on the downstream datasets. prediction tasks.
Vision-language models. There have been a series of works transfer knowledge of CLIP to downstream classification
on the interaction of computer vision and natural language task, a simple yet effective way [33] is to construct a set
processing fields, e.g., text-to-image retrieval [44], image of text prompts based on a template such as “a photo
caption [48], visual question answering [1], referring seg- of a [CLS].”, where [CLS] can be replaced by the ac-
mentation [19,49,50] and so on. Among these works, vision- tual class names. Then given an image, one can use CLIP
language pre-training has attracted growing attention during to compute the similarities between the image and the text
the past few years [24, 32, 37]. As a milestone, Radford et prompts in the embedding space and the class with the high-
al. devise a large-scale pretraining model, named CLIP [33], est score is regarded as the final prediction. Recently, several
which employs a contrastive learning strategy on a huge works [13, 60] have shown CLIP can obtain strong classifica-
amount of image-text pairs, and shows impressive trans- tion performance with few examples. Therefore, it raises an
ferable ability over 30 classification datasets. Motivated interesting question: whether the impressive ability of CLIP
by this work, a number of follow-ups have been proposed can be transferred to more complex vision tasks like dense
to improve the training strategy (e.g., CoOp [60], CLIP- prediction?
Adapter [13], Tip-adapter [56]) or apply it to other domains However, the extension is nontrivial. Firstly, how to
(e.g., ActionCLIP [41]). However, there are very few at- leverage the visual-language pre-trained model in dense pre-
tempts on performing dense prediction tasks via the CLIP diction tasks is a barely visited question. Although a simple
model. The work most related to ours is CPT [51], which resolution is to only use the image encoder like a pre-trained
formulates dense predictions into a fill-in-the-blank problem 2D backbone, we argue that the language priors contained in
by jointly marking co-referential parts of both image and text the text encoder are also of great importance. Secondly, un-
in color. Differently, we consider a standard dense prediction like the classification considered in [13, 60], transferring the
setting in this paper, where we use pixel-text relationships knowledge from CLIP to dense prediction is more difficult
to guide the training of dense prediction models and further due to the substantial gap between the upstream contrastive
optimize language embedding with image context with a pre-training task and the downstream per-pixel prediction
context-aware prompting method. task, where the former considers instance-level representa-
tion of both images and texts, and the latter is only based on
Dense prediction. Compared with conventional instance-
the visual information but expects pixel-level outputs.
level classification problem, dense prediction tasks (e.g.,
semantic segmentation [59], instance segmentation [27], ob- 3.2. Language-Guided Dense Prediction
ject detection [27]) are more challenging as they requires
to model the finer-grained representation at the pixel level To solve the above issues, we propose our language-
or region level. Following the “pre-training + fine-tuning” guided dense prediction framework, which can better lever-
paradigm, previous literatures have developed various dense age the language priors in CLIP pre-trained models. The
prediction models like FCN [30], PSPNet [57], FPN [25], pipeline of our framework is shown in Figure 3. One of
UperNet [45], and many others. To alleviate the heavy an- our important findings is that apart from the global image
notation cost in previous supervised pre-training settings, a feature, we can also extract a language-compatible feature
number of self-supervised pre-training approaches have been map from the last layer of the CLIP image encoder. To show
proposed for dense prediction [3, 43, 46, 47]. Orthogonal to this, we start by describing the architecture of the CLIP im-
these prior arts, we introduce a new fine-tuning strategy that age encoder in detail. Take the ResNet [18] encoder for
leverages the knowledge in the large-scale vision-language example, there are 4 stages in total and we denote the feature
pre-trained model and uses the language information to guide maps as {xi }4i=1 . Different from the original ResNet [18],
the learning process. CLIP makes a small modification [33] by adding an attention
pooling layer. Specifically, CLIP first performs global av-
3. Approach erage pooling to x4 ∈ RH4 W4 ×C to obtain a global feature
x̄4 ∈ R1×C , where H4 , W4 , C are the height, width and the
3.1. Preliminaries: Overview of CLIP number of channels of the feature maps from the 4-th stage
of the backbone. The concatenated features [x̄4 , x4 ] are then
We begin by reviewing the Contrastive Language-Image
fed into an multi-head self-attention layer [40] (MHSA):
Pre-training (CLIP) [33] framework to illustrate the motiva-
tion of our method. CLIP consists of two encoders, including [z̄, z] = MHSA([x̄4 , x4 ]). (1)
an image encoder (ResNet [18] or ViT [12]) and a text en-
coder (Transformer [40]). The goal of CLIP is to align the In the standard training process of CLIP, the global feature
embedding spaces of visual and language during pre-training z̄ is used as the output of the image encoder while the other
through a contrastive objective. outputs z are usually neglected. However, we find z has
To learn more transferable pre-trained knowledge, CLIP two interesting properties: (1) z still retains sufficient spatial
collects 400 million image-text pairs for model training. To information thus can serve as a feature map. (2) since the
𝐾-class text
pixel-text
[dog] Text embeddings
score maps
𝑝P!N … 𝑝" [cat] Encoder … Pixel-Text Matching Loss

…
Prompt
ground-truth
Transformer labels

𝐾
Image Image
Encoder Decoder Task Loss

image embeddings

Figure 3. The overall framework of DenseCLIP. DenseCLIP first extracts the image embeddings and K-class text embeddings, and then
calculates pixel-text score maps to convert the original image-text matching problem in CLIP to pixel-text matching for dense prediction.
These score maps are fed into decoder and also supervised using the ground-truth labels. To better exploit the pre-trained knowledge,
DenseCLIP uses the contextual information in images to prompt the language model with a Transformer module.

MHSA is symmetric to each input element, z might behave of a [CLS].” as text prompts, CoOp [60] introduces
similarly to z̄, which aligns well with the language features. learnable textual contexts to achieve better transferability
Based on the above observations, we can use z as a language- in downstream classification tasks by directly optimizing the
compatible feature map. It is also noted that for architectures contexts using back-propagation. Inspired by CoOp [60],
like ViT [12], z can be obtained similarly by excluding the we also use learnable textual contexts in our framework as a
class token of outputs. baseline, which only includes language-domain prompting.
To obtain the text features, we can construct text prompts The input of the text encoder then becomes:
from the template “a photo of a [CLS].” with K
[p, ek ], 1 ≤ k ≤ K, (3)
class names, and the use CLIP text encoder to extract the
features as t ∈ RK×C . We then compute the pixel-text score where p ∈ RN ×C are the learnable textual contexts and
maps using the language-compatible feature map z and the ek ∈ RC is the embedding for the name of the k-th class.
text features t by:
Vision-to-language prompting. Including descriptions of
s = ẑt̂ ,>
s∈R H4 W4 ×K
, (2) visual contexts can make the text more accurate. For exam-
ple, “a photo of a cat in the grass.” is more
where ẑ and t̂ are the `2 normalized version of z and t along accurate than “a photo of a cat.”. Therefore, we in-
the channel dimension. The score maps characterize the re- vestigate how to use visual contexts to refine the text fea-
sults of pixel-text matching, which is one of the most crucial tures. Generally, we can use the cross-attention mechanism
ingredients in our framework. Firstly, the score maps can be in Transformer decoder [40] to model the interactions be-
viewed as segmentation results with a lower resolution, and tween vision and language.
thus we can use them to compute an auxiliary segmentation We propose two different strategies of context-aware
loss. Secondly, we can concatenate the score maps to the last prompting, which is shown in Figure 4. The first strategy
feature map to explicitly incorporate language priors, i.e., we consider is the pre-language-model prompting, or pre-
x04 = [x4 , s] ∈ RH4 W4 ×(C+K) . Our framework is model- model prompting for short. We pass the features [z̄, z] to a
agnostic because the modified feature maps can be directly Transformer decoder to encode visual contexts:
used as usual in segmentation or detection with some minor
vpre = TransDecoder(q, [z̄, z]), (4)
modifications (e.g., the input dimension of FPN [25]).
where q ∈ RN ×C are a set of learnable queries and vpre ∈
3.3. Context-Aware Prompting RN ×C are the extracted visual contexts. We replace the p in
Previous efforts [13, 60] have already proved that mitigat- Equation (3) by the visual contexts v to form the input of the
ing the domain gaps in visual or language can significantly text encoder. Since the input of the text encoder is modified,
improve the performance of CLIP models on downstream we refer to this version as pre-model prompting.
tasks. Therefore, instead of using the vanilla human pre- Another choice is to refine the text features after the text
defined templates, we seek for other methods to improve the encoder, namely post-model prompting. In this variant, we
text features t. use CoOp [60] to generate text features and directly use them
as the queries of the Transformer decoder:
Language-domain prompting. Different from the original
CLIP that uses human-designed templates like “a photo vpost = TransDecoder(t, [z̄, z]). (5)
context generation 𝑡) its locality faster, which is beneficial to dense prediction
Text
𝑣! 𝑣" 𝑣# 𝑣$ 𝑣% … 𝑣& [dog]
Encoder tasks for both segmentation and detection.
𝒗#$% class embedding
Object detection & instance segmentation. In this case,
Transformer
we do not have ground truth segmentation labels. To con-
Image
Encoder Decoder struct a similar auxiliary loss as in segmentation, we use
[#𝒛, 𝒛]
the bounding box and the label to build a binary target
…
(a) Pre-model prompting learnable queries 𝒒 ỹ ∈ {0, 1}H4 W4 ×K . The auxiliary objective can be defined
Only used for training
as a binary cross-entropy loss:
class embeddings 𝒕
[dog] Text
𝑝! … 𝑝" [cat]
Encoder
… Ldet
aux = BinaryCrossEntropy(Sigmoid(s/τ ), ỹ). (8)
…

𝒗#&'(
Applications to any backbone models. Another interest-
Image Transformer
+ ing usage of our framework is that we can replace the image
Encoder [#𝒛, 𝒛] Decoder
update
encoder of CLIP with any backbones (e.g., ImageNet pre-
(b) Post-model prompting trained models and self-supervised models). Although there
might be no strong relation between the outputs of the visual
Figure 4. Two different strategies of context-aware prompting. backbone and the text encoder, the backbone can learn better
The pre-model prompting directly uses the image contexts to gen-
and faster with language guidance. In other words, we can
erate the desired text inputs, while post-model prompting refines
leverage the language priors from the pre-trained text en-
the class embedding instead.
coder to improve the performance of any pre-trained image
This implementation encourage the text features to find most backbone, which makes DenseCLIP a more generic frame-
related visual clues. We then update the text features through work to improve dense prediction with the natural language
a residual connection: priors learned from large-scale pre-training.

t ← t + γvpost , (6) 4. Experiments

C
where γ ∈ R is a learnable parameter to control the scaling To evaluate the effectiveness of our DenseCLIP, we con-
of the residual. γ is initialized with very small values (e.g., duct extensive experiments on dense prediction tasks includ-
10−4 ) to maximally preserve the language priors from the ing semantic segmentation, object detection and instance
text features. segmentation. The following subsections describe the details
Although the two variants target the same goal, we prefer of the experiments, results and analyses.
the post-model prompting for mainly two reasons: (1) The
post-model prompting is efficient. The pre-model prompt- 4.1. Semantic Segmentation
ing requires extra forward passes of the text encoder during
inference since its input is dependent on the image. In the Setups. We start by evaluating our DenseCLIP on
case of post-model prompting, we can store the extracted ADE20K [59], a challenging large-scale semantic segmen-
text features after training and thus can reduce the overhead tation dataset that covers a broad range of 150 categories.
brought by the text encoder during inference. (2) Our em- ADE20K contains 20K images for training and 2K images
pirical results show the post-model prompting can achieve for validation. Following common practice [20, 45], we re-
better performance than pre-model prompting. port the mIoU on the validation set. For fair comparisons,
we also include the FLOPs and the number of parameters.
3.4. Instantiations
Implementation details. We experiment with the popu-
Semantic segmentation. As discussed in Section 3.2, our lar Semantic FPN [21] framework to evaluate our Dense-
framework is model-agnostic and can be applied to any dense CLIP. Specifically, we apply the pre-trained image encoder
prediction pipelines. Moreover, we propose to use an aux- of the CLIP as the segmentation backbone, and directly
iliary objective to make better use of our pixel-text score use the Semantic FPN [21] as the decoder. We consider
maps in segmentation. Since the score maps s ∈ RH4 W4 ×K three kinds of image backbones including ResNet-50 [18],
can be viewed as smaller segmentation results, we therefore ResNet-101 [18], and ViT-B [12]. For language-domain
compute a segmentation loss on it: prompting, we use a context length of 8. The Transformer
decoder to extract visual contexts consists of 6 layers and we
Lseg
aux = CrossEntropy(Softmax(s/τ ), y), (7)
set the number of heads as 4. We fix the text encoder during
where τ = 0.07 is a temperature coefficient following [16] training to preserve the natural language knowledge learned
and y ∈ {1, . . . , K}H4 W4 is the ground truth label. The aux- from large-scale pre-training. To reduce the computational
iliary segmentation loss can help the feature map to recover costs, we project both the image embeddings and the text
Table 1. Semantic segmentation results on ADE20K. We compare the performance of DenseCLIP and existing methods when using the
same backbone. We report the mIoU of both single-scale and multi-scale testing, the FLOPs and the number of parameters. The FLOPs are
measured with 1024 × 1024 input using the fvcore library. The results show that our DenseCLIP outperforms other methods by large
margins with much lower complexity. Our models and our baselines that are trained using identical settings are highlighted in gray.

Backbone Method Pre-train mIoU (SS) mIoU (MS) GFLOPs Params (M)
FCN [30] ImageNet 36.1 38.1 793.3 49.6
EncNet [55] ImageNet 40.1 41.7 565.6 36.1
PSPNet [57] ImageNet 41.1 41.9 716.2 49.1
CCNet [20] ImageNet 42.1 43.1 804.0 49.9
ResNet-50 DeeplabV3+ [7] ImageNet 42.7 43.8 711.5 43.7
UperNet [45] ImageNet 42.1 42.8 953.2 66.5
DNL [52] ImageNet 41.9 43.0 939.3 50.1
Semantic FPN [21] ImageNet 38.6 40.6 227.1 31.0
CLIP + Semantic FPN CLIP 39.6 41.6 248.8 31.0
DenseCLIP + Semantic FPN CLIP 43.5 44.7 269.2 50.3
FCN [30] ImageNet 39.9 41.4 1104.4 68.6
EncNet [55] ImageNet 42.6 44.7 876.8 55.1
PSPNet [57] ImageNet 43.6 44.4 1027.4 68.1
CCNet [20] ImageNet 44.0 45.2 1115.2 68.9
DeeplabV3+ [7] ImageNet 44.6 46.1 1022.7 62.7
ResNet-101 UperNet [45] ImageNet 43.8 44.8 1031.0 85.5
OCRNet [54] ImageNet 45.3 - 923.9 55.5
DNL [52] ImageNet 44.3 45.8 1250.5 69.1
Semantic FPN [21] ImageNet 40.4 42.3 304.9 50.0
CLIP + Semantic FPN CLIP 42.7 44.3 326.6 50.0
DenseCLIP + Semantic FPN CLIP 45.1 46.5 346.3 67.8
SETR-MLA-DeiT [58] ImageNet 46.2 47.7 - -
Semantic FPN [21] ImageNet 48.3 50.9 1037.4 100.8
ViT-B Semantic FPN [21] ImageNet-21K 49.1 50.4 1037.4 100.8
CLIP + Semantic FPN CLIP 49.4 50.3 1037.4 100.8
DenseCLIP + Semantic FPN CLIP 50.6 51.3 1043.1 105.3

Table 2. Ablation study. We demonstrate that performing post- Main results. We report the semantic segmentation re-
model vision-to-language prompting can yield the better perfor- sults of our DenseCLIP with three different backbones on
mance with fewer extra FLOPs and parameters. ADE20K in Table 1. We include the FLOPs, the number
Language V→L Prompt mIoU FLOPs Params of parameters, and the mIoU in both single-scale (SS) and
Pre-train multi-scale (MS) testings. The experiments results show
Prompt (%) (G) (M)
pre post
that for the same backbone, our DenseCLIP with a sim-
ImageNet 38.6 227 31.0 ple Semantic FPN can outperform the state-of-the-art meth-
CLIP 39.6(+1.0) 249 31.0 ods that use more sophisticated decoders by large margins.
CLIP 3 42.1(+3.5) 269 46.5 Unlike previous works that use dilated backbones (ResNet-
CLIP 3 3 42.9(+4.3) 368 116.9 D8 [20, 53, 54, 57]), the ResNet encoder in DenseCLIP is
CLIP 3 3 43.5(+4.9) 269 50.2 more close to standard ResNet thus our DenseCLIP has much
fewer FLOPs. Besides, our DenseCLIP is +4.9%, +4.7%,
and +2.3% mIoU (SS) higher than the original ImageNet
embeddings to a lower dim (256) before the Transformer pre-trained baselines on ResNet-50, ResNet-101 and ViT-B
module. We empirically find that directly fine-tuning CLIP backbones with acceptable extra computation cost. Dense-
models to dense prediction with the default training strate- CLIP is also +3.9%, +2.4%, and +1.2% mIoU higher than
gies in [10] will lead to unsatisfactory results (only 21.9% the vanilla fine-tuning strategy (CLIP + Semantic FPN).
mIoU on ADE20K, which is 15.6% lower than its ImageNet
pre-trained counterpart). Therefore, two key modifications Ablation studies. To further demonstrate the effects of dif-
are made compared to the default configurations: (1) we use ferent components of our DenseCLIP, we perform detailed
AdamW [31] instead of the default SGD inspired by recent ablation studies with the ResNet-50 [18] backbone and the
progress in vision Transformers [29, 39, 42]; (2) to better results are shown in Table 2. Firstly, we show by adopting a
preserve the pre-trained weights, we set the learning rate better training strategy aforementioned the ResNet-50 base-
of the image encoder as 1/10 of the other parameters. We line we implemented has a higher mIoU than [10] (38.6% vs.
also adopt the above training strategies to our baselines in 37.5%). Secondly, we find that CLIP pre-trained ResNet-50
ablation studies for fair comparisons (+1.1% mIoU over the outperforms the ImageNet pre-trained one by 1%, which
ImageNet pre-trained ResNet-50 with the default settings indicates that large-scale vision language pre-trained model
in [10]). can be better transferred to downstream vision tasks. To
better leverage the language priors, we adopt our language- Table 3. Object detection on COCO val2017 using Reti-
guided with language-domain prompt and witness a signifi- naNet [26] framework. We compare our DenseCLIP framework
cant performance boost (+2.5% mIoU). Finally, we compare to the vanilla fine-tuning of ImageNet/CLIP pre-trained models.
We find DenseCLIP can better make use of the language priors to
the two methods to perform vision-language prompting to
facilitate better training.
incorporate visual contexts. We find both the pre-model and
post-model prompting can improve the performance, while FLOPs Params
the post-model prompting is better and more computationally Model AP AP50 AP75 APS APM APL
(G) (M)
efficient. Therefore, we choose the post-model prompting as RN50-IN1K [18] 239 38 36.3 55.3 38.6 19.3 40.0 48.8
the default configuration in all the rest experiments. RN50-CLIP [33] 265 38 36.9 57.7 39.1 22.5 40.7 47.1
RN50-DenseCLIP 285 60 37.8 59.9 40.0 24.8 42.0 47.9
Effects of language-guided pre-training and fine-tuning.
RN101-IN1K [18] 315 57 38.5 57.6 41.0 21.7 42.8 50.4
We compare the performance on ADE20K of different pre- RN101-CLIP [33] 341 57 40.5 61.6 43.4 25.6 44.6 51.3
training and fine-tuning strategies to better reveal the po- RN101-DenseCLIP 360 78 41.1 63.4 44.1 26.9 45.5 52.4
tential of language-guided paradigm, which is shown in
Figure 2. We consider supervised pre-training on Ima-
For Mask R-CNN, we observe that DenseCLIP achieves
geNet1K [11] and ImageNet21K [11, 36], self-supervised
consistent improvement on both object detection and in-
pre-training via MoCoV2 [16] and DenseCL [43], and the
stance segmentation tasks within an affordable computa-
vision-language pre-training. We show that the vision-
tional budget. Especially for instance segmentation, our
language pre-trained model (CLIP) can outperform Ima-
DenseCLIP outperforms the ImageNet1K pre-trained model
geNet1K pre-trained model by vanilla fine-tuning. Further-
with +2.9% and +2.5% mask AP on both ResNet50 and
more, through the language-guided fine-tuning with context-
ResNet101 backbones and also outperforms the vanilla fine-
aware prompting, our DenseCLIP surpasses even the Im-
tuning strategy with +0.8% and +0.7% mask AP. The signifi-
ageNet21K pre-trained model. These promising results
cant improvements of DenseCLIP on the instance segmenta-
demonstrate that language-priors can largely facilitate vi-
tion task suggest that our pixel-text matching is conceptually
sion models in downstream dense prediction tasks.
suitable for segmentation.
4.2. Object Detection and Instance Segmentation 4.3. DenseCLIP for Any Visual Backbone
Setups. We also conduct experiments to apply our Dense- Previous experiments have demonstrated the effectiveness
CLIP to object detection and instance segmentation tasks on of our DenseCLIP framework. However, since DenseCLIP is
COCO [27], which contains 118K training images and 5K specifically designed to leverage the visual-language relation
validation images. We adopt two widely used frameworks, contained in the pre-trained CLIP models, the generalization
RetinaNet [26] and Mask R-CNN [17]. Following [17], we ability of DenseCLIP might be somehow doubted: Is Dense-
report the standard AP, AP at IoU=0.5/0.75, and cross-scale CLIP only suitable to CLIP image encoders? To answer
AP. For Mask R-CNN, we report both the mAPs for object this question, we perform experiments to verify whether our
detection and instance segmentation since these two tasks DenseCLIP can also perform well with other backbones.
are performed simultaneously. The extension is actually straightforward: we can simply re-
place the CLIP image encoder with any given 2D pre-trained
Implementation details. For object detection, we adopt
image model. Although there are no strong correlations be-
ResNet-50 and ResNet-101 as backbones. We train all the
tween the feature maps of the new backbone and the text
models for 12 epochs using AdamW optimizer with batch
features output by the CLIP text encoder, we hypothesize
size 16 as in [5]. Specifically for RetinaNet, we witness a
that if we preserve the language priors by freezing the text
super large loss at the start of training that makes the model
encoder as before, the text encoder will guide the backbone
hard to converge. Therefore, we use gradient clipping with a
to better adapt to downstream tasks.
max `2 norm of 0.1 to protect the pre-trained weights.
To verify the above assumption, we choose two represen-
Results analysis. The results using the RetinaNet [26] and tative 2D models including ResNet [18], the most widely
the Mask R-CNN [17] are summarized in Table 3 and Ta- used CNN model, and Swin [29], the recent state-of-the-art
ble 4, respectively. For object detection with RetinaNet, we vision Transformer. Following the standard setting in [21]
compare DenseCLIP with ImageNet1K pretrained model and [29], we use the Semantic FPN [21] framework for
and vanilla CLIP fine-tuning on detection task. One can ResNet models and the UperNet [45] framework for Swin
observe that DenseCLIP outperforms the ImageNet1K pre- models. The experimental results are summarized in Ta-
trained model by +1.5% and +2.6% AP. Meanwhile, it also ble 5, where we report the mIoU on ADE20K of both the
improves the vanilla fine-tuning strategy by +0.9% and single-scale and multi-scale testing. We demonstrate that our
+0.6% AP on both ResNet-50 and ResNet-101 backbones. DenseCLIP can consistently improve all the baseline models
Table 4. Object detection and instance segmentation results on COCO val2017 using Mask R-CNN [17] framework. Our DenseCLIP
outperforms ImageNet/CLIP pre-trained baseline models, especially on the instance segmentation task.

FLOPs Params
Model APb APb50 APb75 APbS APbM APbL APm APm m m m m
50 AP75 APS APM APL
(G) (M)
RN50-IN1K [18] 275 44 38.2 58.8 41.4 21.9 40.9 49.5 34.7 55.7 37.2 18.3 37.4 47.2
RN50-CLIP [33] 301 44 39.3 61.3 42.7 24.6 42.6 50.1 36.8 58.5 39.2 18.6 39.9 51.8
RN50-DenseCLIP 327 67 40.2 63.2 43.9 26.3 44.2 51.0 37.6 60.2 39.8 20.8 40.7 53.7
RN101-IN1K [18] 351 63 40.0 60.5 44.0 22.6 44.0 52.6 36.1 57.5 38.6 18.8 39.7 49.5
RN101-CLIP [33] 377 63 42.2 64.2 46.5 26.4 46.1 54.0 38.9 61.4 41.8 20.5 42.3 55.1
RN101-DenseCLIP 399 84 42.6 65.1 46.5 27.7 46.5 54.2 39.6 62.4 42.4 21.4 43.0 56.2

input ImageNet CLIP DenseCLIP ground-truth Table 5. Applying DenseCLIP to any backbone. Image back-
bones (such as ImageNet pre-trained ResNet [18] and Swin [29])
equipped with our DenseCLIP benefit from the language priors and
enjoy significant performance boost. We report mIoU on ADE20K
dataset for both single-scale (SS) and multi-scale (MS) testing.

mIoU (SS) mIoU (MS)

Decoder Method
(%) (%)
RN50 [18] 38.6 40.6
Semantic RN50 + DenseCLIP 41.0(+2.4) 43.0(+2.4)
FPN [21] RN101 [18] 40.4 42.3
RN101 + DenseCLIP 43.0(+2.6) 45.2(+2.9)
Swin-T [29] 44.5 45.8
Figure 5. Qualitative results on ADE20K. We visualize the seg- Swin-T + DenseCLIP 45.4(+0.9) 46.5(+0.7)
UperNet [45]
mentation results on ADE20K validation set of our DenseCLIP Swin-S [29] 47.6 49.5
based on ResNet-101 and two baseline models. Swin-S + DenseCLIP 48.3(+0.7) 49.7(+0.2)

notably. Specifically, DenseCLIP can bring ∼ 2.5% single- tion tasks. DenseCLIP is a model-agnostic framework to use
scale mIoU improvement for ResNet-50/101 with semantic the pre-trained vision-language knowledge with the context-
FPN [21], and ∼ 0.8% improvement for Swin-T/S with aware prompting strategy. The framework can be applied
UperNet [45]. These results clearly show that our Dense- to various dense prediction tasks including semantic seg-
CLIP can successfully guide any pre-trained 2D backbone mentation, object detection, and instance segmentation. We
by language priors to boost performance. Since the text en- conducted extensive experiments to demonstrate the superior
coder can be removed after training, our method provides a performance of our method.
low-cost solution to improve arbitrary dense prediction mod-
Limitations & societal impact. Although our method has
els. Although these performances still lag behind our models
achieved substantial improvement in segmentation, we find
with CLIP image encoders, the findings in this section pro-
the improvements on detection are not such significant. We
vide a solution to generalize human knowledge learned from
conjecture that it is because the pre-trained CLIP image en-
large-scale vision-language pre-training to a wider range of
coder lacks locality since there is no such constraint during
models. We expect this could be an interesting direction to
the pre-training of CLIP while object-centered tasks can only
connect vision and language researches in the future.
provide less dense supervision. We believe DenseCLIP can
4.4. Visualization be further improved by introducing the dense supervision
during pre-training or better recovering the locality after
To better demonstrate the superiority of DenseCLIP, we
pre-training. We develop a general method for dense pre-
provide several qualitative results in Figure 5. We compare
diction in this paper. Since our method is not for a specific
the segmentation maps of our method and the baselines and
application, it does not directly involve societal issues.
find DenseCLIP is better at identifying holistic objects.
Acknowledgements
5. Conclusion and Discussion
This work was supported in part by the National Natural
In this paper, we have presented a new framework, Dense- Science Foundation of China under Grant 62125603, Grant
CLIP, to transfer the knowledge from the vision-language U1813218, and in part by a grant from the Beijing Academy
pre-trained model (CLIP) to the downstream dense predic- of Artificial Intelligence (BAAI).
References [16] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross B.
Girshick. Momentum contrast for unsupervised visual repre-
[1] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret sentation learning. In CVPR, pages 9726–9735, 2020. 2, 5,
Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 7
Vqa: Visual question answering. In ICCV, pages 2425–2433, [17] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Gir-
2015. 3 shick. Mask r-cnn. In ICCV, 2017. 7, 8
[2] Tom B. Brown, Benjamin Mann, Nick Ryder, and [18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Melanie Subbiah et al. Language models are few-shot learn- Deep residual learning for image recognition. In CVPR, pages
ers. In NeurIPS, 2020. 2 770–778, 2016. 2, 3, 5, 6, 7, 8
[3] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, [19] Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. Seg-
Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- mentation from natural language expressions. In ECCV, pages
ing properties in self-supervised vision transformers. In ICCV, 108–124, 2016. 3
2021. 2, 3 [20] Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang,
[4] Joao Carreira and Andrew Zisserman. Quo vadis, action Yunchao Wei, and Wenyu Liu. Ccnet: Criss-cross attention
recognition? a new model and the kinetics dataset. In CVPR, for semantic segmentation. In ICCV, pages 603–612, 2019.
pages 6299–6308, 2017. 1, 2 5, 6
[5] Kai Chen, Jiaqi Wang, Jiangmiao Pang, et al. MMDetec- [21] Alexander Kirillov, Ross Girshick, Kaiming He, and Piotr
tion: Open mmlab detection toolbox and benchmark. arXiv Dollár. Panoptic feature pyramid networks. In CVPR, pages
preprint arXiv:1906.07155, 2019. 7 6399–6408, 2019. 2, 5, 6, 7, 8
[6] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, [22] Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan
Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby.
segmentation with deep convolutional nets, atrous convolu- Big transfer (bit): General visual representation learning. In
tion, and fully connected crfs. TPAMI, 40(4):834–848, 2017. ECCV, pages 491–507. Springer, 2020. 1
1 [23] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Im-
[7] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian agenet classification with deep convolutional neural networks.
Schroff, and Hartwig Adam. Encoder-decoder with atrous In NeurIPS, pages 1097–1105, 2012. 1
separable convolution for semantic image segmentation. In [24] Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L. Berg,
ECCV, pages 801–818, 2018. 2, 6 Mohit Bansal, and Jingjing Liu. Less is more: Clipbert for
[8] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geof- video-and-language learning via sparse sampling. In CVPR,
frey E. Hinton. A simple framework for contrastive learning pages 7331–7341, 2021. 3
of visual representations. In ICML, pages 1597–1607, 2020. [25] Tsung-Yi Lin, Piotr Dollár, Ross B. Girshick, Kaiming He,
Bharath Hariharan, and Serge J. Belongie. Feature pyramid
[9] Xinlei Chen, Haoqi Fan, Ross B. Girshick, and Kaiming
networks for object detection. In CVPR, pages 936–944, 2017.
He. Improved baselines with momentum contrastive learning.
3, 4
CoRR, abs/2003.04297, 2020. 2
[26] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and
[10] MMSegmentation Contributors. Mmsegmentation: Open-
Piotr Dollár. Focal loss for dense object detection. In ICCV,
mmlab semantic segmentation toolbox and benchmark, 2020.
pages 2980–2988, 2017. 7
6
[27] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
[11] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
Fei-Fei. Imagenet: A large-scale hierarchical image database. Zitnick. Microsoft coco: Common objects in context. In
In CVPR, pages 248–255, 2009. 1, 2, 7 ECCV, pages 740–755, 2014. 3, 7
[12] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, [28] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hi-
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, roaki Hayashi, and Graham Neubig. Pre-train, prompt, and
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- predict: A systematic survey of prompting methods in natural
vain Gelly, et al. An image is worth 16x16 words: Transform- language processing. arXiv preprint arXiv:2107.13586, 2021.
ers for image recognition at scale. In ICLR, 2020. 1, 2, 3, 4, 2
5 [29] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng
[13] Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Zhang, Stephen Lin, and Baining Guo. Swin transformer:
Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip- Hierarchical vision transformer using shifted windows. In
adapter: Better vision-language models with feature adapters. ICCV, 2021. 2, 6, 7, 8
arXiv preprint arXiv:2110.04544, 2021. 2, 3, 4, 11 [30] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully
[14] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra convolutional networks for semantic segmentation. In CVPR,
Malik. Rich feature hierarchies for accurate object detection pages 3431–3440, 2015. 1, 3, 6
and semantic segmentation. In CVPR, pages 580–587, 2014. [31] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
1 regularization. arXiv preprint arXiv:1711.05101, 2017. 6
[15] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr [32] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert:
Dollár, and Ross Girshick. Masked autoencoders are scalable Pretraining task-agnostic visiolinguistic representations for
vision learners. arXiv preprint arXiv:2111.06377, 2021. 2 vision-and-language tasks. In NeurIPS, pages 13–23, 2019. 3
[33] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya [49] Zhao Yang, Yansong Tang, Luca Bertinetto, Hengshuang
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Zhao, and Philip H.S. Torr. Hierarchical interaction network
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning for video object segmentation from referring expressions. In
transferable visual models from natural language supervision. BMVC, 2021. 3
arXiv preprint arXiv:2103.00020, 2021. 1, 3, 7, 8, 11 [50] Zhao Yang, Jiaqi Wang, Yansong Tang, Kai Chen, Heng-
[34] Yongming Rao, Wenliang Zhao, Zheng Zhu, Jiwen Lu, and shuang Zhao, and Philip H.S. Torr. Lavt: Language-aware
Jie Zhou. Global filter networks for image classification. In vision transformer for referring image segmentation. In CVPR,
NeurIPS, 2021. 2 2022. 3
[35] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. [51] Yuan Yao, Ao Zhang, Zhengyan Zhang, Zhiyuan Liu, Tat-
Faster r-cnn: Towards real-time object detection with region Seng Chua, and Maosong Sun. Cpt: Colorful prompt tun-
proposal networks. In NeurIPS, pages 91–99, 2015. 1 ing for pre-trained vision-language models. arXiv preprint
[36] Tal Ridnik, Emanuel Ben-Baruch, Asaf Noy, and Lihi Zelnik- arXiv:2109.11797, 2021. 2, 3
Manor. Imagenet-21k pretraining for the masses. arXiv [52] Minghao Yin, Zhuliang Yao, Yue Cao, Xiu Li, Zheng Zhang,
preprint arXiv:2104.10972, 2021. 2, 7 Stephen Lin, and Han Hu. Disentangled non-local neural
[37] Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu networks. In ECCV, 2020. 6
Wei, and Jifeng Dai. VL-BERT: pre-training of generic visual- [53] Fisher Yu and Vladlen Koltun. Multi-scale context
linguistic representations. In ICLR, 2020. 3 aggregation by dilated convolutions. arXiv preprint
[38] Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav arXiv:1511.07122, 2015. 6
Gupta. Revisiting unreasonable effectiveness of data in deep [54] Yuhui Yuan, Xilin Chen, and Jingdong Wang. Object-
learning era. In ICCV, pages 843–852, 2017. 2 contextual representations for semantic segmentation. In
[39] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco ECCV, 2020. 6
Massa, Alexandre Sablayrolles, and Hervé Jégou. Training [55] Hang Zhang, Kristin Dana, Jianping Shi, Zhongyue Zhang,
data-efficient image transformers & distillation through atten- Xiaogang Wang, Ambrish Tyagi, and Amit Agrawal. Context
tion. In ICML, pages 10347–10357, 2021. 6 encoding for semantic segmentation. In CVPR, pages 7151–
[40] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- 7160, 2018. 6
reit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia
[56] Renrui Zhang, Rongyao Fang, Peng Gao, Wei Zhang, Kun-
Polosukhin. Attention is all you need. In NeurIPS, pages
chang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. Tip-
5998–6008, 2017. 2, 3, 4
adapter: Training-free clip-adapter for better vision-language
[41] Mengmeng Wang, Jiazheng Xing, and Yong Liu. Actionclip:
modeling. arXiv preprint arXiv:2111.03930, 2021. 2, 3
A new paradigm for video action recognition. arXiv preprint
[57] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang
arXiv:2109.08472, 2021. 2, 3
Wang, and Jiaya Jia. Pyramid scene parsing network. In
[42] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao
CVPR, pages 2881–2890, 2017. 3, 6
Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyra-
[58] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu,
mid vision transformer: A versatile backbone for dense predic-
Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao
tion without convolutions. arXiv preprint arXiv:2102.12122,
Xiang, Philip H.S. Torr, and Li Zhang. Rethinking semantic
2021. 6
segmentation from a sequence-to-sequence perspective with
[43] Xinlong Wang, Rufeng Zhang, Chunhua Shen, Tao Kong, and
transformers. In CVPR, 2021. 6
Lei Li. Dense contrastive learning for self-supervised visual
pre-training. In CVPR, pages 3024–3033, 2021. 2, 3, 7 [59] Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler,
[44] Zihao Wang, Xihui Liu, Hongsheng Li, Lu Sheng, Junjie Adela Barriuso, and Antonio Torralba. Semantic understand-
Yan, Xiaogang Wang, and Jing Shao. CAMP: cross-modal ing of scenes through the ade20k dataset. IJCV, 127(3):302–
adaptive message passing for text-image retrieval. In ICCV, 321, 2019. 2, 3, 5
pages 5763–5772, 2019. 3 [60] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei
[45] Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Liu. Learning to prompt for vision-language models. arXiv
Jian Sun. Unified perceptual parsing for scene understanding. preprint arXiv:2109.01134, 2021. 2, 3, 4, 11
In ECCV, pages 418–434, 2018. 2, 3, 5, 6, 7, 8
[46] Zhenda Xie, Yutong Lin, Zhuliang Yao, Zheng Zhang, Qi Dai,
Yue Cao, and Han Hu. Self-supervised learning with swin
transformers. arXiv preprint arXiv:2105.04553, 2021. 3
[47] Zhenda Xie, Yutong Lin, Zheng Zhang, Yue Cao, Stephen
Lin, and Han Hu. Propagate yourself: Exploring pixel-level
consistency for unsupervised visual representation learning.
In CVPR, pages 16684–16693, 2021. 3
[48] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C.
Courville, Ruslan Salakhutdinov, Richard S. Zemel, and
Yoshua Bengio. Show, attend and tell: Neural image cap-
tion generation with visual attention. In ICML, volume 37,
pages 2048–2057, 2015. 3
Appendix: More Analysis final performance.
We provide more analyses of both the design of our model Table 8. Ablation study of the residual coefficient γ. The configu-
and the training strategies in detail in the section. ration used in our final models is highlighted in gray.
Effects of learning rate multipliers. As discussed in Sec-
tion 4.1, we found that the optimal learning rate for CLIP initial value of γ γ learnable mIoU (%)
models and conventional ImageNet pre-trained models are −4
10 7 42.6
different. Here we further investigate the effects of learn- 10−4 3 43.5
ing rate multiplier for image encoder and text encoder in 1.0 3 42.8
Table 6. We see both fixing the text encoder and using a
lower learning rate for image encoder is beneficial to train
the dense prediction model. Note that we observe a much
lower performance (<30% mIoU) when directly fine-tuning
CLIP models with 1.0× learning rate for the image encoder,
which suggests our language guided method can largely sta-
bilize the training process and make the final results less
sensitive to the learning rate configuration.

Table 6. Ablation study of the learning rate multiplier of text

encoder and image encoder. We find freezing the text encoder
and setting the lr multiplier of image encoder as 0.1 yields the
best performance. The configuration used in our final models is
highlighted in gray.

text encoder image encoder mIoU (%)

0.0 0.1 43.5
lr multi 0.0 1.0 42.6
0.1 0.1 42.2

Effects of optimization of the textual contexts. Previous

works [13, 60] on transferring CLIP models to downstream
classification tasks have clearly shown the importance of
adapting the textual contexts for different datasets and tasks.
We show the effects of optimizing the textual contexts com-
pared to the original prompting strategy proposed in [33] in
Table 7. We see that although the learnable contexts will
introduce additional computation during training (gradient
computation for the text encoder), this strategy can bring no-
table improvement over the baseline. Therefore, we choose
to add the learnable textual contexts for our models.
Table 7. Effects of optimization of the textual contexts. We compare
the results of using the context optimization [60] with directly
constructing textual prompts from human defined template and find
learnable textual contexts can bring notable improvements. The
configuration used in our final models is highlighted in gray.

Textual Context mIoU (%)

a photo of a [CLS]. 42.9
CoOp [60] 43.5

Effects of γ. Table 8 shows the effects of γ. We see a

learnable γ initialized with small values can improve the

Dense Constrastive Learning For Self Supervised Visual Pre Training
No ratings yet
Dense Constrastive Learning For Self Supervised Visual Pre Training
11 pages
CRIS
No ratings yet
CRIS
10 pages
Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese
No ratings yet
Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese
18 pages
Laclip
No ratings yet
Laclip
29 pages
Contrastive Language and Vision Learning of General Fashion Concepts
No ratings yet
Contrastive Language and Vision Learning of General Fashion Concepts
13 pages
Contrastive Language Image Pre-Training
No ratings yet
Contrastive Language Image Pre-Training
18 pages
Grounding Descriptions in Images Informs Zero-Shot Visual Recognition
No ratings yet
Grounding Descriptions in Images Informs Zero-Shot Visual Recognition
18 pages
CLIP - Connecting Text and Images - OpenAI
No ratings yet
CLIP - Connecting Text and Images - OpenAI
16 pages
Hierarchical Text-Conditional Image Generation With CLIP Latents
No ratings yet
Hierarchical Text-Conditional Image Generation With CLIP Latents
27 pages
465-Lecture 17-CT
No ratings yet
465-Lecture 17-CT
22 pages
uCAP: An Unsupervised Prompting Method For Vision-Language Models
No ratings yet
uCAP: An Unsupervised Prompting Method For Vision-Language Models
16 pages
Yao DetCLIPv2 Scalable Open-Vocabulary Object Detection Pre-Training Via Word-Region Alignment CVPR 2023 Paper
No ratings yet
Yao DetCLIPv2 Scalable Open-Vocabulary Object Detection Pre-Training Via Word-Region Alignment CVPR 2023 Paper
10 pages
Cloth Captioning
No ratings yet
Cloth Captioning
36 pages
Deep Learning For Visual Understanding Bridging Text and Image With CLIP and BLIP
No ratings yet
Deep Learning For Visual Understanding Bridging Text and Image With CLIP and BLIP
9 pages
SLIP: Self-Supervision Meets Language-Image Pre-Training
No ratings yet
SLIP: Self-Supervision Meets Language-Image Pre-Training
13 pages
Learning To Prompt With Text Only Supervision For Vision-Language Models
No ratings yet
Learning To Prompt With Text Only Supervision For Vision-Language Models
15 pages
Scaling Language-Image Pre-Training Via Masking
No ratings yet
Scaling Language-Image Pre-Training Via Masking
11 pages
Scaling Language-Image Pre-Training Via Masking: Yanghao Li Haoqi Fan Ronghang Hu Christoph Feichtenhofer Kaiming He
No ratings yet
Scaling Language-Image Pre-Training Via Masking: Yanghao Li Haoqi Fan Ronghang Hu Christoph Feichtenhofer Kaiming He
12 pages
Chinese Clip
No ratings yet
Chinese Clip
19 pages
2023 Cross-Domain Image Captioning With Discriminative Finetuning
No ratings yet
2023 Cross-Domain Image Captioning With Discriminative Finetuning
10 pages
CLIP Reid
No ratings yet
CLIP Reid
11 pages
SCLIP: Rethinking Self-Attention For Dense Vision-Language Inference
No ratings yet
SCLIP: Rethinking Self-Attention For Dense Vision-Language Inference
20 pages
Learning Transferable Visual Models From Natural Language Supervision
No ratings yet
Learning Transferable Visual Models From Natural Language Supervision
48 pages
Learning Transferable Visual Models From Natural Language Supervision
No ratings yet
Learning Transferable Visual Models From Natural Language Supervision
47 pages
What Is CLIP - Contrastive Language-Image Pre-Processing Explained
No ratings yet
What Is CLIP - Contrastive Language-Image Pre-Processing Explained
16 pages
Enhancing Multimodal Understanding With CLIP-Based
No ratings yet
Enhancing Multimodal Understanding With CLIP-Based
7 pages
Clip Prefix For Image Captioning Task in Generative
No ratings yet
Clip Prefix For Image Captioning Task in Generative
13 pages
SigLIP 2 - Multilingual Vision-Language Encoders With Improved Semantic Understanding, Localization, and Dense Features
No ratings yet
SigLIP 2 - Multilingual Vision-Language Encoders With Improved Semantic Understanding, Localization, and Dense Features
20 pages
Learning Transferable Visual Models From Natural Language Supervision
No ratings yet
Learning Transferable Visual Models From Natural Language Supervision
14 pages
Unsupervised Pre-Training For Images: Sunita Sarawagi CS 725 Fall 2023
No ratings yet
Unsupervised Pre-Training For Images: Sunita Sarawagi CS 725 Fall 2023
62 pages
Co Op
No ratings yet
Co Op
13 pages
Clip
No ratings yet
Clip
15 pages
Zhang Prompt Generate Then Cache Cascade of Foundation Models Makes Strong CVPR 2023 Paper
No ratings yet
Zhang Prompt Generate Then Cache Cascade of Foundation Models Makes Strong CVPR 2023 Paper
12 pages
Mobile Clip
No ratings yet
Mobile Clip
18 pages
452 Learning To Adapt Frozen C
No ratings yet
452 Learning To Adapt Frozen C
22 pages
Mediaeval 2023
No ratings yet
Mediaeval 2023
5 pages
Large Language Models Are Good Prompt Learners
No ratings yet
Large Language Models Are Good Prompt Learners
12 pages
Image Captioners Are Scalable Vision Learners Too
No ratings yet
Image Captioners Are Scalable Vision Learners Too
26 pages
003 Open-Vocabulary Semantic Segmentation With Mask-Adapted CLIP
No ratings yet
003 Open-Vocabulary Semantic Segmentation With Mask-Adapted CLIP
10 pages
Scaling Lang FAIR Researchpaper
No ratings yet
Scaling Lang FAIR Researchpaper
21 pages
DenseCap - Fully Convolutional Localization Networks For Dense Captioning
No ratings yet
DenseCap - Fully Convolutional Localization Networks For Dense Captioning
10 pages
BLIP: Bootstrapping Language-Image Pre-Training For Unified Vision-Language Understanding and Generation
No ratings yet
BLIP: Bootstrapping Language-Image Pre-Training For Unified Vision-Language Understanding and Generation
12 pages
Image Captioners Sometimes Tell More Than Images They See
No ratings yet
Image Captioners Sometimes Tell More Than Images They See
6 pages
Combined Scaling For Zero-Shot Transfer Learning
No ratings yet
Combined Scaling For Zero-Shot Transfer Learning
47 pages
Co Co Op
No ratings yet
Co Co Op
11 pages
Siglip
No ratings yet
Siglip
13 pages
PromptDet Towards Open-Vocabulary
No ratings yet
PromptDet Towards Open-Vocabulary
21 pages
Image Generator
No ratings yet
Image Generator
11 pages
CLIPBERT For Video-And-Language Learning Via Sparse Sampling
100% (1)
CLIPBERT For Video-And-Language Learning Via Sparse Sampling
12 pages
Cvpr2022 Glip Grounded Language Image Pre Training
No ratings yet
Cvpr2022 Glip Grounded Language Image Pre Training
20 pages
Zhou Conditional Prompt Learning For Vision-Language Models CVPR 2022 Paper
No ratings yet
Zhou Conditional Prompt Learning For Vision-Language Models CVPR 2022 Paper
10 pages
Interactive
No ratings yet
Interactive
13 pages
CLIP4Clip: An Empirical Study of CLIP For End To End Video Clip Retrieval
No ratings yet
CLIP4Clip: An Empirical Study of CLIP For End To End Video Clip Retrieval
14 pages
CLIP4STR: A Simple Baseline For Scene Text Recognition With Pre-Trained Vision-Language Model
No ratings yet
CLIP4STR: A Simple Baseline For Scene Text Recognition With Pre-Trained Vision-Language Model
13 pages
Learning Prompt With Distribution-Based Feature Replay For Few-Shot Class-Incremental Learning
No ratings yet
Learning Prompt With Distribution-Based Feature Replay For Few-Shot Class-Incremental Learning
14 pages
A Simple Recipe For Language-Guided Domain Generalized Segmentation
No ratings yet
A Simple Recipe For Language-Guided Domain Generalized Segmentation
14 pages
Text-Image Embeddings With OpenAIs CLIP
No ratings yet
Text-Image Embeddings With OpenAIs CLIP
5 pages
AN IMPROVED TECHNIQUE FOR MIX NOISE AND BLURRING REMOVAL IN DIGITAL IMAGES
From Everand
AN IMPROVED TECHNIQUE FOR MIX NOISE AND BLURRING REMOVAL IN DIGITAL IMAGES
UTKARSH SHUKLA
No ratings yet
Joint Photographic Experts Group: Unlocking the Power of Visual Data with the JPEG Standard
From Everand
Joint Photographic Experts Group: Unlocking the Power of Visual Data with the JPEG Standard
Fouad Sabry
No ratings yet
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Unit 1: Communication
No ratings yet
Unit 1: Communication
2 pages
Final Test of English For The Fourth Grade School Year: 2020-2021 - Testing Date: ........................ - Time Allowed: 35 Minutes
No ratings yet
Final Test of English For The Fourth Grade School Year: 2020-2021 - Testing Date: ........................ - Time Allowed: 35 Minutes
2 pages
List of Experiments
No ratings yet
List of Experiments
2 pages
Term Test 3
No ratings yet
Term Test 3
4 pages
Pretest - MC Eng 101 (Teaching English in Elementary Grades - Language Arts)
No ratings yet
Pretest - MC Eng 101 (Teaching English in Elementary Grades - Language Arts)
2 pages
(2015) Magical Texts - Litany of Neith
No ratings yet
(2015) Magical Texts - Litany of Neith
10 pages
EC Teacher Recruitment Screening Form 2024
No ratings yet
EC Teacher Recruitment Screening Form 2024
5 pages
B2 FIRST PAPER 2 - Email
No ratings yet
B2 FIRST PAPER 2 - Email
6 pages
Part 2 FCE Exercises
100% (1)
Part 2 FCE Exercises
4 pages
Implant Dentistry Research Guide Basic Translational and Clinical Research PDF Download
No ratings yet
Implant Dentistry Research Guide Basic Translational and Clinical Research PDF Download
26 pages
Oral Communication Module 1 Week 1
No ratings yet
Oral Communication Module 1 Week 1
10 pages
Comparative and Superlative
No ratings yet
Comparative and Superlative
5 pages
2019 Question Paper
No ratings yet
2019 Question Paper
6 pages
CHAPTER 3 - Lesson 1 Module (Purposive Communication)
100% (1)
CHAPTER 3 - Lesson 1 Module (Purposive Communication)
4 pages
NLIP Verzola
No ratings yet
NLIP Verzola
26 pages
Mwongozo STD Iv Muhula Wa Kwanza 2023
No ratings yet
Mwongozo STD Iv Muhula Wa Kwanza 2023
6 pages
2 Mixed Conditionals 1
No ratings yet
2 Mixed Conditionals 1
2 pages
Summary of Thai Tone Rules by Tone
No ratings yet
Summary of Thai Tone Rules by Tone
1 page
Important Info On PhilIRI
No ratings yet
Important Info On PhilIRI
8 pages
Socialite
No ratings yet
Socialite
2 pages
Simple Present Tense
No ratings yet
Simple Present Tense
13 pages
Ingles Basico
100% (7)
Ingles Basico
231 pages
Binh Duong Merged
No ratings yet
Binh Duong Merged
12 pages
MPDF
No ratings yet
MPDF
7 pages
Eng. 3-LAS-Q2-Week-5
No ratings yet
Eng. 3-LAS-Q2-Week-5
9 pages
Present Simple or Present Continuous - A1
No ratings yet
Present Simple or Present Continuous - A1
2 pages
Palestinians Inhabited Modern-Day Israel Before The Modern Day Israelis
No ratings yet
Palestinians Inhabited Modern-Day Israel Before The Modern Day Israelis
3 pages
Pioneer Vol 2 No 1
No ratings yet
Pioneer Vol 2 No 1
134 pages
BASIC 5 - Final Exam
No ratings yet
BASIC 5 - Final Exam
5 pages
Ielts Reading
No ratings yet
Ielts Reading
1 page

Dense Clip

Uploaded by

Dense Clip

Uploaded by

DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting

Yongming Rao*,1 , Wenliang Zhao∗,1 , Guangyi Chen1 , Yansong Tang1 ,

Abstract Supervised Pre-training Self-supervised Pre-training

can be applied to arbitrary dense prediction systems and image embeddings 𝐾

The “pre-training + fine-tuning” paradigm is recognized

t ← t + γvpost , (6) 4. Experiments

mIoU (SS) mIoU (MS)

Table 6. Ablation study of the learning rate multiplier of text

text encoder image encoder mIoU (%)

Effects of optimization of the textual contexts. Previous

Textual Context mIoU (%)

Effects of γ. Table 8 shows the effects of γ. We see a

You might also like