Discriminative Probing and Tuning For Text-To-Image Generation
Discriminative Probing and Tuning For Text-To-Image Generation
Leigang Qu1 , Wenjie Wang1 *, Yongqi Li2 , Hanwang Zhang3,4 , Liqiang Nie5 , Tat-Seng Chua1
1
National University of Singapore, 2 Hong Kong Polytechnic University, 3 Nanyang Technological University,
4
Skywork AI, 5 Harbin Institute of Technology (Shenzhen)
[email protected], [email protected], [email protected]
[email protected], [email protected], [email protected]
arXiv:2403.04321v2 [cs.CV] 14 Mar 2024
alignment capabilities of T2I models are still inadequate. (a) Text-Image Misalignment Problems
By reviewing the link between generative and discriminative
modeling, we posit that T2I models’ discriminative abilities “The boy sitting in the middle.” y' Discrimination
may reflect their text-image alignment proficiency during x “Three children and their mother
sitting behind are flying kites.” y Generation
generation. In this light, we advocate bolstering the dis-
criminative abilities of T2I models to achieve more precise
text-to-image alignment for generation. We present a dis-
criminative adapter built on T2I models to probe their dis- x̂
criminative abilities on two representative tasks and lever-
age discriminative fine-tuning to improve their text-image
alignment. As a bonus of the discriminative adapter, a self-
correction mechanism can leverage discriminative gradi- Where is the object
Which matches y
described by y' ?
ents to better align generated images to text prompts during better, x or x̂ ?
the VAE encoder and adds noise ϵ ∼ N (0, 1) to obtain the ence of different blocks in Sec. 4.3.
Positive Text Negative Text s (z t , y ) s (z t -1 , y ) Inference
LoRA Sampling Sampling
Partialy-descriptive Text z t z t −1
zt ẑ t z t −1 zˆ t −1 ... z0
h ( x, t ) (z t , y, t ) Text Encoder
hi
0
zt
bi
Negative Positive
match ground
Image Image U-Net
CLIP( y ) Text Encoder h0 h1 bˆ j bˆ (i )
Time Embeddings *
Q ... Global Matching
Three children and their mother
sitting behind are flying kites Transformer Transformer Local Grounding
Encoder Decoder
Two girls sitting on the grass Ft = UNet l (z t , CLIP( y), t ) F *
Self-Correction
Q
The boy sitting in the middle
...
Figure 2. Schematic illustration of the proposed discriminative probing and tuning (DPT) framework. We first extract semantic represen-
tations from the frozen SD and then propose a discriminative adapter to conduct discriminative probing to investigate the global matching
and local grounding abilities of SD. Afterward, we perform parameter-efficient discriminative tuning by introducing LoRA parameters.
During inference, we present the self-correction mechanism to guide the denoising-based text-to-image generation.
Q∗ capture multiple aspects of the semantic representation matched with a given text from all samples in a batch, i.e.,
F̃. Thereafter, Q∗ can be used to do various downstream
exp(s(z, y)/τ )
tasks, possibly with a classier or regressor. LT →I = − log PB , (3)
j=1 exp s((zj , y)/τ )
In the following, we will introduce two probing tasks,
i.e., ITM and REC, and train the discriminative adapter on where B denotes the min-batch size, and τ is a learnable
them to investigate the global matching and local grounding temperature factor. Similarly, the opposite direction from
abilities of T2I models, respectively. image to text is computed by
exp(s(z, y)/τ )
• Global Matching. From the view of discriminative mod- LI→T = − log PB . (4)
j=1 exp s((z, yj )/τ )
eling, a model with strong text-image alignment should be
able to identify subtle alignment differences between vari- With Lmatch as the optimization objective, the discrimi-
ous images and a text prompt. In light of this, We utilize the native adapter and the projection layers are enforced to dis-
task of Image-Text Matching [18] to probe the discrimina- cover discriminative information from the semantic repre-
tive global matching ability. This task is defined to achieve sentations for matching, implying the global matching abil-
bidirectional matching or retrieval, including text-to-image ity of a T2I model.
(T → I) and image-to-text (I → T ).
• Local Grounding. Local grounding requires a model to
To achieve this, we first collect the first M (M < N ) recognize the referred object from others in an image given
query representations {q∗1 , ..., q∗M } from Q∗ , and then a partially descriptive text. We adapt SD to the REC [68]
project each of them into a matching space with the same task to evaluate its discriminative local grounding ability.
dimension as CLIP and obtain hi = g(q∗i ; Wm ). Intu- Formally, given a textual expression y ′ referring to a
itively, different query representations may capture differ- specific object with index i in an image x, REC aims to
ent aspects to understand the same image. Inspired by this, predict the coordinate and the size, i.e., the bounding box
we calculate the cross-modal semantic similarities between bi , of the ground-truth object. To achieve it, we share the
x and y by comparing the CLIP textual embedding of y same discriminative adapter and employ the other (N − M )
and the most matched projected query representations via learnable queries as object prior queries and obtain the cor-
s(y, z) = maxi∈{1,...,M } cos(CLIP(y), hi ). Based on pair- responding query representations from the transformer de-
wise similarities, we optimize the discriminative adapter coder as {q∗j }j∈{M +1,...,N } . We then project each q∗j into
f (·; Wa , Q) and the projection layer g(·; Wm ) using con- three spaces separately by three different project layers g(·):
trastive learning loss Lmatch = LT →I + LI→T . The first 1) the grounding space to get the probability of predict-
term optimizes the model to distinguish the correct image ing the correct object, i.e., pj = g(q∗j ; Wp ) ∈ R1 ; 2)
the box space to estimate the bounding box parameters, visual generative foundation models. In this vein, we strive
i.e., b̂j = g(q∗j ; Wb ) ∈ R4 ; and 3) the semantic space to to explain “How can we enhance text-image alignment for
bridge the semantic gap between queries and the text, i.e., T2I models by discriminative tuning?”
oj = g(q∗j ; Ws ) ∈ Rd . In the previous stage, we freeze SD and probe how in-
After projection, we perform maximum matching to dis- formative intermediate activations are in global matching
cover the most matched query with index ψ(i). The cost and local grounding. Here, we conduct parameter-efficient
used for matching includes using grounding probability, fine-tuning using LoRA [24] by injecting trainable layers
L1, and GIoU [49] losses between the prediction and the over cross-attention layers and freezing the parameters of
ground-truth box as costs. It is formulated as the pre-trained SD. We use the same discriminative objec-
tive functions as stage 1 to tune the LoRA, discriminative
ψ(i) = arg min −pj + L1(b̂j , bi ) + GIoU(b̂j , bi ) (5) adapter, and task-specific projection layers. Due to the par-
j∈{M +1,...,N }
ticipation of LoRA, we can flexibly manipulate the interme-
Besides, we adopt a text-to-object contrastive loss to fur- diate activation of T2I models.
ther drive the model to distinguish the positive object from
others at the semantic level: 3.3. Self-Correction
exp(cos(oψ(i) , CLIP(y ′ ))/τ ) Equipping the T2I model with the discriminative adapter
LT →O = − log PKx , (6) enables the whole model to execute discriminative tasks.
′
j=1 exp(cos(oj , CLIP(y ))/τ )
As a bonus of using the discriminative adapter, we propose
We combine all the losses and obtain the grounding loss as a self-correction mechanism to guide high-alignment gen-
eration during inference. Formally, we update the latent zt
Lground = − λ0 pψ(i) + λ1 L1(b̂ψ(i) , bi ) aiming to enhance the semantic similarity between zt and
(7)
+ λ2 GIoU(b̂ψ(i) , bi ) + λ3 LT →O , the prompt y through gradients:
It may shed new light on giving full play to the versatility of text-image similarities as CLIP and BLIP scores, respectively. We will
Table 1. Performance comparison for text-to-image generation on COCO-NSS1K, CC-500, and ABC-6K. ID, OOD, and MD refer to
in-distribution, out-of-distribution, and mixed-distribution settings, respectively. According to the version of Stable Diffusion, we split
methods into two groups, top and down for v1.4 and v2.1, respectively. SC denotes self-correction.
COCO-NSS1K (ID) CC-500 (OOD) ABC-6K (MD)
Method
CLIP BLIP-M BLIP-C IS FID CLIP BLIP-M BLIP-C GLIP IS CLIP BLIP-M BLIP-C IS
Stable Diffusion-v1.4 [CVPR22] [50] 33.27 67.96 39.48 31.32 54.77 34.82 70.95 40.36 31.17 14.28 35.33 72.03 40.82 34.47
LayoutLLM-T2I [ACMMM23] [44] 32.42 67.42 39.46 25.57 59.26 - - - - - - - - -
StructureDiffusion [ICLR23] [16] - - - - - 33.71 66.71 39.54 31.39 14.14 34.95 69.55 40.69 34.97
HN-DiffusionITM [NeurIPS23] [28] 33.26 70.06 40.14 31.53 53.26 34.15 68.77 40.30 31.54 13.99 35.02 72.28 41.12 34.83
DPT (Ours) 33.85 71.84 40.11 31.65 54.96 35.97 76.74 41.15 37.07 13.56 35.88 75.88 41.26 34.46
Stable Diffusion-v2.1 [CVPR22] [50] 34.96 73.32 40.22 30.40 55.35 39.24 85.45 43.36 52.09 11.53 37.53 81.98 41.77 33.31
Attend-and- Excite [TOG23] [4] 34.95 74.68 40.32 30.27 55.16 39.43 90.03 44.08 53.29 11.82 37.59 82.64 41.83 32.94
HN-DiffusionITM [NeurIPS23] [28] 35.14 75.64 40.77 30.34 52.73 38.81 85.76 43.22 48.95 12.11 37.58 82.33 42.07 34.14
DPT (Ours) 35.83 78.58 41.14 30.83 55.55 40.23 90.72 44.55 53.29 11.59 38.39 86.19 42.36 32.97
DPT + SC (Ours) 35.75 79.15 41.14 30.50 54.89 40.25 91.33 44.69 53.29 11.89 38.41 85.63 42.34 33.56
Table 2. Performance comparison for text-to-image generation on SD-v2.1 demonstrates that the proposed DPT may be par-
TIFA [25] and T2I-CompBench [14]. According to the version of allel with the generative pre-training based on score match-
Stable Diffusion, we split methods into two groups, top and down ing, reflecting the possibility of activating the intrinsic rea-
for v1.4 and v2.1, respectively. SC denotes self-correction.
soning abilities of T2I models using DPT. And 4) in all, the
TIFA
T2I-CompBench proposed method achieves the best generation performance
Color Shape Text. Sp. Non-Sp. Comp. consistently on text-image alignment across comprehensive
SD-v1.4 [50] 79.15 36.82 35.94 42.16 10.64 30.45 28.18
benchmarks, distribution settings, and evaluation protocols.
HN-DiffusionITM [28] 79.02 36.71 35.48 39.84 11.22 30.91 28.05
Besides, the improvement in alignment does not result in a
VPGen [13] 77.33 32.12 32.36 35.85 19.08 30.07 24.39
LayoutGPT [17] 79.31 33.86 36.35 44.07 35.06 30.30 26.36
loss of image quality per IS and FID. These results confirm
DPT (Ours) 82.04 48.84 38.93 50.10 14.63 30.83 30.05 the effectiveness of the proposed paradigm DPT.
DPT + SC (Ours) 82.40 51.51 39.61 49.38 15.45 30.84 30.29 • Discriminative Matching and Grounding. In Sec. 3.1,
SD-v2.1 [50] 81.35 48.21 40.49 46.83 16.94 30.63 29.96
we incorporate a discriminative adapter on top of T2I mod-
Attend-and-Excite [4] 81.98 53.72 43.41 48.53 16.30 30.64 30.38
els and probe and improve its understanding abilities based
HN-DiffusionITM [28] 82.02 46.45 40.09 49.35 15.01 30.99 30.35
DPT (Ours) 84.49 60.59 48.18 58.24 20.78 30.95 32.44
on ITM and REC. In an empirical sense, we carry out ex-
DPT + SC (Ours) 84.63 62.59 48.44 57.60 21.04 30.76 32.52 periments by training the adapter in the first stage and intro-
ducing the LoRA for tuning in the second stage using ITM
including BLIP-ITM and BLIP-ITC, and GLIP score [16] and REC data, and then evaluate the matching and ground-
based on object detection to evaluate text-image alignment, ing performance. We show experimental results of base-
and IS [53] and FID [22] as quality evaluation metrics. lines including discriminative and generative models under
As for TIFA and T2I-CompBench, we follow the recom- the zero-shot and fine-tuning settings in Tab. 11. See Ap-
mended VQA accuracy or specifically curated protocols. pendix B for more details on the implementation and set-
tings. From this table, we observe that our method could
4.2. Performance Comparison outperform the existing state-of-the-art generative methods,
such as Diffusion Classifier [31] and DiffusionITM [28],
• Text-to-Image Generation. As shown in Tab. 1 and by large margins on ITM and REC tasks. Even it could
Tab. 2, we have the following observations and discus- achieve competitive performance in the first probing stage
sions: 1) Compared with the base foundation models, i.e., or when selected with a priority generation in the second
SD [50], the proposed DPT manages to improve the text- stage. These results show that the generative representa-
image alignment remarkably, which illustrates that enhanc- tions extracted from the intermediate layers of U-Net con-
ing discriminative abilities could benefit the generative se- vey meaningful semantics, verifying that T2I models have
mantic alignment for T2I models. 2) DPT achieves superior basic discriminative matching and grounding abilities. Be-
performance on CC-500 and ABC-6K under the OOD set- sides, it also indicates that such abilities could be further im-
ting, showing its powerful generalization to other prompt proved by the discriminative tuning introduced in Sec. 3.2.
distributions. It also reveals its capability to resist the risk
of overfitting when tuning T2I models with discriminative 4.3. In-depth Analysis
tasks. 3) The consistent improvement on both SD-v1.4 and
To verify the effectiveness of each component in DPT, in-
adopt Image-Text Matching (ITM) and Image-Text Contrastive (ITC) as cluding discriminative tuning on Global Matching (GM)
BLIP scores in the following. and Local Grounding (LG) in the 2nd stage, and the Self-
Table 3. Performance comparison for image-text matching and referring expression comprehension to evaluate global matching and local
grounding abilities, respectively. Datasets include MSCOCO-HN for ITM, and RefCOCO, RefCOCO+, and RefCOCOg for REC. All
the methods are grouped into three parts, in which the upper, middle, and lower groups correspond to zero-shot discriminative, zero-shot
generative, and fine-tuning generative methods, respectively. All the generative models are based on Stable Diffusion-v2.1.
MSCOCO-HN RefCOCO RefCOCO+ RefCOCOg
Method
I-to-T T-to-I Overall val testA testB val testA testB val test
Random Chance 25.00 25.00 25.00 16.53 13.51 19.20 16.29 13.57 19.60 18.12 19.10
CLIP (ViT-B-32) [ICML21] [45] 47.63 42.82 45.23 44.79 46.12 42.61 49.60 51.07 46.04 58.31 58.42
OpenCLIP (ViT-B-32) [CVPR23] [6] 49.07 47.45 48.26 43.22 43.15 44.65 48.21 48.60 50.64 60.32 60.84
Diffusion Classifier § [ICCV23] [31] 34.59 24.12 29.36 6.23 2.14 12.11 6.07 2.11 12.29 8.68 8.45
DiffusionITM § [NeurIPS23] [28] 34.59 29.83 32.21 28.88 30.16 29.01 29.97 31.17 30.25 38.07 38.91
Local Dinoising - - - 23.83 21.20 24.85 24.07 21.31 25.45 28.66 28.59
Diffusion Classifier †§ [ICCV23] [31] 37.72 24.03 30.88 6.11 2.10 10.91 6.04 2.13 11.48 8.05 7.54
DiffusionITM †§ [NeurIPS23] [28] 37.72 29.88 33.80 34.09 32.70 35.29 35.86 35.42 38.23 49.67 49.05
HN-DiffusionITM †§ [NeurIPS23] [28] 37.55 30.37 33.96 31.43 28.50 35.47 33.47 30.16 37.47 47.98 48.20
Local Dinoising † - - - 23.70 21.55 24.81 24.01 21.52 25.32 28.53 28.77
DPT (Stage1, Ours) 42.29 34.75 38.52 48.79 53.28 43.06 42.56 47.69 36.14 46.56 45.75
DPT (Ours) 42.07 34.97 38.52 52.73 57.84 46.73 45.34 50.12 38.41 48.61 47.45
DPT* (Ours) 43.12 35.25 39.18 63.45 66.70 57.90 51.56 56.81 42.73 54.96 54.80
†: fine-tuning with the denoising objective;
§: cropping an image into blocks and then matching them with the referring text for REC;
∗: model selection with a priority discriminative task, i.e., ITM or REC
Correction (SC) during inference, we conduct several ana- Table 4. Ablation study for the influence of two objectives of
lytic experiments on COCO-NSS1K and CC-500 under ID discriminative tuning including Global Matching (GM) and Lo-
and OOD settings. The results are summarized in Tab. 4. cal Ground (LG) in the 2nd stage, and the Self-Correction (SC)
during inference on alignment-oriented text-to-image generation.
• Effectiveness of Discriminative Tuning. From the com- The COCO-NSS1K and CC-500 datasets are used to evaluate in-
pared results in Tab. 4 between different variants, we ob- distribution (ID) and out-of-distribution (OOD) generation. All
serve that the two tuning objectives, i.e., GM and LG, could experiments are based on Stable Diffusion-v2.1.
consistently promote the alignment performance for T2I ac- Index GM LG SC
COCO-NSS1K (ID) CC-500 (OOD)
cording to CLIP and BLIP scores. It verifies the validity of CLIP BLIP-M BLIP-C CLIP BLIP-M BLIP-C GLIP
0 34.96 73.32 40.22 39.24 85.45 43.36 52.09
discriminative tuning on ITM and REC tasks. Compared
1 ! 35.14 74.83 40.45 39.28 86.23 43.36 49.55
with GM, LG achieves more remarkable improvement over 2 ! 35.94 79.19 41.11 40.31 90.63 44.31 57.03
semantic and object detection metrics. It may be attributed 3 ! ! 35.83 78.58 41.14 40.23 90.72 44.55 53.29
4 ! ! ! 35.75 79.15 41.14 40.25 91.33 44.69 53.29
to the enhanced grounding ability brought by the prediction
of local concepts based on partial descriptions. Further-
4 0
more, combining the two objectives to conduct multi-task C L IP 8 7 .5
5 0
3 9 B L IP -M 8 6 .0
4 5
learning may contribute to a slight improvement in BLIP IS
G L IP
3 8 8 5 .0
G r o u n d in g
8 4 .7 4 0
M a tc h in g
M a tc h in g
scores under the OOD setting, but other metrics are slightly 3 7
G r o u n d in g
8 4 .0
5 4 .3
5 6 .0
5 4 .0 3 5
compromised. This phenomenon indicates that some con- 3 6 3 7 .9 3 8 .1 3 8 .0
5 3 .3
3 8 .3 3 8 .6 5 2 .6
3 0
5 1 .6 8 0 .4
tradictions may exist during model optimization, reflecting 3 5 3 6 .8 7 8 .9 3 7 .1
2 5
4 9 .2
that unifying multiple tasks is still challenging. 3 4
b o tto m 1 b o tto m 2 b o tto m 3 m id iu m u p 1 u p 2 u p 3
2 0
U - N e t B lo c k
• Effectiveness of Self-Correction. In Sec. 3.3, we propose Figure 3. Generative and discriminative results by probing differ-
to recycle the discriminative adapter in the inference phase ent layers of U-Net in SD-v2.1 and adapting to ITM and REC. We
by guiding iterative denoising. Comparing the 3rd and report average CLIP and BLIP-M scores over COCO-NSS1K and
4th variants in Table 4, we can see that the self-correction CC-500, overall matching performance on MSCOCO-HN, and av-
scheme could consistently improve the alignment for T2I, erage grounding performance over all test sets of RefCOCO, Re-
attesting to its effectiveness. fCOCO+, and RefCOCOg. We conduct model selection based on
T2I performance on the validation set of COCO-NSS1K.
• Impact of Probed U-Net Block. Due to the hierarchical
structure of the U-Net in SD, we could extract multi-level tion, we probe consecutive seven blocks of the U-Net shown
feature maps from its different blocks. Prior work [62] has in Fig. 2 from left to right and then tune the whole model
shown that different blocks may have different discrimina- based on the probed block. The generative and discrimi-
tive powers in image classification. To further investigate native results are shown in Fig. 3. It can be observed that
the matching and grounding abilities empowered by various the T2I performance gets continuously improved with the
blocks and the trade-off between discrimination and genera- probed block shifting from bottom to up. The reason may be
that more LoRA parameters would be introduced and more SD-v2.1 AaE HN-DiffITM DPT + SC (Ours)
Object Appearance
layers would be tuned during back-propagation. In contrast,
the discriminative performance regardless of matching and
grounding starts to increase and then deteriorates. It may
be attributed to two points: 1) the feature maps from those
1
blocks (e.g., up2 and up3) close to final outputs, i.e., pre- “A closeup view of a pizza, with a fork near it.”
Counting
4 0 5 5 4 0 .8
9 2 .0
“The two yellow trains are coming down from the mountain.”
3 9 5 4
4 0 .6 9 1 .5
C L IP M a tc h in g
3 8 B L IP -M G r o u n d in g 5 3
G r o u n d in g
9 1 .0
M a tc h in g
B L IP -M
Spatial Relation
4 0 .4
C L IP
3 7 5 2
9 0 .5
3 6 5 1 C L IP C L IP (S D -v 2 .1 )
4 0 .2 B L IP -M B L IP -M (S D -v 2 .1 )
9 0 .0
3 5 5 0 8 5 .0
3 9 .3
3 9 .2
3 4
2 k 4 k 6 k 8 k 1 0 k
4 9
0 0 .0 1 0 .0 5 0 .1 0 .2 0 .5 1 2
8 0 .0
“A girl standing in the grass with no shoes on with a frisbee in one
hand and her other hand on her hip. ”
(a) Tuning Step (b) Guidance Factor η
Semantic Relation
Figure 4. Impact of (a) the variation of generation and discrimi-
nation performance with the progress of tuning and (b) the self-
correction strength on the performance of T2I on CC-500.
• Impact of Tuning Step. To further delve into the durative
“A person guiding a child down a hill on skis.”
impact of discriminative tuning on two aspects of perfor-
mance, we show the dynamics of the performance with the
increment of the tuning step in the 2nd stage in Fig. 4a. We
can see that the generative performance gets better with tun-
ing and seems to reach the saturation point at the 8k step. In
contrast, there is still potential for grounding performance “A couple of glasses next to a bottle.”
on the alignment performance of T2I. The results demon- “A fire is going near four lounge chairs.”
strate that the proposed self-correction mechanism could
alleviate the text-image misalignment issue with a proper
range of guidance factor, i.e., (0.05, 1).
5. Conclusion and Future Work ited effectiveness and generalization across five T2I datasets
and four ITM and REC datasets.
In this work, we tackled the text-image misalignment issue
In the future, we plan to explore the effect of discrim-
for text-to-image generative models. Toward this end, we inative probing and tuning to more generative models
retrospected the relations between generative and discrim- using more conception and understanding tasks. Besides,
inative modeling and presented a two-stage method named it is interesting to discuss more complicated relations
DPT. It introduces a discriminative adapter for probing ba- between discriminative and generative modeling such as
sic discriminative abilities in the first stage and performs trade-offs and mutual promotion across different tasks.
discriminative fine-tuning in the second stage. DPT exhib-
References [13] Cho et al. Visual programming for text-to-image generation
and evaluation. In NeurIPS, 2023. 6, 16
[1] Eslam Mohamed Bakr, Pengzhan Sun, Xiaogian Shen, [14] Huang et al. T2i-compbench: A comprehensive benchmark
Faizan Farooq Khan, Li Erran Li, and Mohamed Elhoseiny. for open-world compositional text-to-image generation. In
Hrs-bench: Holistic, reliable and scalable benchmark for NeurIPS, 2023. 5, 6, 15
text-to-image models. In Proceedings of the IEEE/CVF In-
[15] Wan-Cyuan Fan, Yen-Chun Chen, DongDong Chen, Yu
ternational Conference on Computer Vision, pages 20041–
Cheng, Lu Yuan, and Yu-Chiang Frank Wang. Frido: Fea-
20053, 2023. 2
ture pyramid diffusion for complex scene image synthesis.
[2] Ryan Burgert, Kanchana Ranasinghe, Xiang Li, and In Proceedings of the AAAI Conference on Artificial Intelli-
Michael S Ryoo. Peekaboo: Text to image diffusion models gence, pages 579–587, 2023. 3
are zero-shot segmentors. arXiv preprint arXiv:2211.13224, [16] Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun
2022. 3, 5 Akula, Pradyumna Narayana, Sugato Basu, Xin Eric Wang,
[3] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas and William Yang Wang. Training-free structured diffusion
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- guidance for compositional text-to-image synthesis. arXiv
end object detection with transformers. In European confer- preprint arXiv:2212.05032, 2022. 1, 2, 5, 6, 15, 16, 18
ence on computer vision, pages 213–229. Springer, 2020. 3 [17] Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani, Ar-
[4] Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and jun Akula, Xuehai He, Sugato Basu, Xin Eric Wang, and
Daniel Cohen-Or. Attend-and-excite: Attention-based se- William Yang Wang. Layoutgpt: Compositional visual plan-
mantic guidance for text-to-image diffusion models. ACM ning and generation with large language models. arXiv
Transactions on Graphics (TOG), 42(4):1–10, 2023. 1, 2, 5, preprint arXiv:2305.15393, 2023. 2, 3, 6, 16
6, 8, 16, 18 [18] Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio,
[5] Huanran Chen, Yinpeng Dong, Zhengyi Wang, Xiao Yang, Jeff Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. De-
Chengqi Duan, Hang Su, and Jun Zhu. Robust clas- vise: A deep visual-semantic embedding model. Advances
sification via a single diffusion model. arXiv preprint in neural information processing systems, 26, 2013. 2, 4
arXiv:2305.15241, 2023. 3 [19] Tejas Gokhale, Hamid Palangi, Besmira Nushi, Vibhav Vi-
[6] Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell neet, Eric Horvitz, Ece Kamar, Chitta Baral, and Yezhou
Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuh- Yang. Benchmarking spatial relationships in text-to-image
mann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scal- generation. arXiv preprint arXiv:2212.10015, 2022. 2
ing laws for contrastive language-image learning. In Pro- [20] Kamal Gupta, Justin Lazarow, Alessandro Achille, Larry S
ceedings of the IEEE/CVF Conference on Computer Vision Davis, Vijay Mahadevan, and Abhinav Shrivastava. Layout-
and Pattern Recognition, pages 2818–2829, 2023. 5, 7 transformer: Layout generation and completion with self-
[7] Jaemin Cho, Abhay Zala, and Mohit Bansal. Dall-eval: attention. In Proceedings of the IEEE/CVF International
Probing the reasoning skills and social biases of text-to- Conference on Computer Vision, pages 1004–1014, 2021. 3
image generation models. In Proceedings of the IEEE/CVF [21] Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras,
International Conference on Computer Vision, pages 3043– and Yejin Choi. Clipscore: A reference-free evaluation met-
3054, 2023. 15 ric for image captioning. arXiv preprint arXiv:2104.08718,
[8] Kevin Clark and Priyank Jaini. Text-to-image diffu- 2021. 5
sion models are zero-shot classifiers. arXiv preprint [22] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
arXiv:2303.15233, 2023. 3 Bernhard Nessler, and Sepp Hochreiter. Gans trained by a
[9] Colin Conwell and Tomer Ullman. Testing relational un- two time-scale update rule converge to a local nash equilib-
derstanding in text-guided image generation. arXiv preprint rium. Advances in neural information processing systems,
arXiv:2208.00005, 2022. 2 30, 2017. 6
[10] Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, [23] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif-
Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, fusion probabilistic models. Advances in neural information
Hongxia Yang, et al. Cogview: Mastering text-to-image gen- processing systems, 33:6840–6851, 2020. 1, 2, 3
eration via transformers. Advances in Neural Information [24] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-
Processing Systems, 34:19822–19835, 2021. 2 Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen.
[11] Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Lora: Low-rank adaptation of large language models. arXiv
Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, preprint arXiv:2106.09685, 2021. 2, 5
Haoran Wei, et al. Dreamllm: Synergistic multimodal com- [25] Yushi Hu, Benlin Liu, Jungo Kasai, Yizhong Wang, Mari Os-
prehension and creation. arXiv preprint arXiv:2309.11499, tendorf, Ranjay Krishna, and Noah A Smith. Tifa: Accurate
2023. 3 and interpretable text-to-image faithfulness evaluation with
[12] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, question answering. In ICCV, 2023. 5, 6, 15
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, [26] Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- Synnaeve, Ishan Misra, and Nicolas Carion. Mdetr-
vain Gelly, et al. An image is worth 16x16 words: Trans- modulated detection for end-to-end multi-modal understand-
formers for image recognition at scale. arXiv preprint ing. In Proceedings of the IEEE/CVF International Confer-
arXiv:2010.11929, 2020. 3 ence on Computer Vision, pages 1780–1790, 2021. 14, 15
[27] Diederik P Kingma and Max Welling. Auto-encoding varia- [41] Liqiang Nie, Leigang Qu, Dai Meng, Min Zhang, Qi Tian,
tional bayes. arXiv preprint arXiv:1312.6114, 2013. 3 and Alberto Del Bimbo. Search-oriented micro-video cap-
[28] Benno Krojer, Elinor Poole-Dayan, Vikram Voleti, Christo- tioning. In Proceedings of the 30th ACM International Con-
pher Pal, and Siva Reddy. Are diffusion models vision-and- ference on Multimedia, pages 3234–3243, 2022. 14
language reasoners? In Thirty-seventh Conference on Neural [42] Leigang Qu, Meng Liu, Da Cao, Liqiang Nie, and Qi
Information Processing Systems, 2023. 3, 5, 6, 7, 8, 16, 17 Tian. Context-aware multi-view summarization network for
[29] Mingi Kwon, Jaeseok Jeong, and Youngjung Uh. Diffusion image-text matching. In Proceedings of the 28th ACM In-
models already have a semantic latent space. arXiv preprint ternational Conference on Multimedia, pages 1047–1055,
arXiv:2210.10960, 2022. 2, 3 2020. 15
[30] Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, [43] Leigang Qu, Meng Liu, Jianlong Wu, Zan Gao, and Liqiang
Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Nie. Dynamic modality interaction modeling for image-text
Ghavamzadeh, and Shixiang Shane Gu. Aligning text- retrieval. In Proceedings of the 44th International ACM SI-
to-image models using human feedback. arXiv preprint GIR Conference on Research and Development in Informa-
arXiv:2302.12192, 2023. 1 tion Retrieval, pages 1104–1113, 2021. 2
[31] Alexander C Li, Mihir Prabhudesai, Shivam Duggal, Ellis [44] Leigang Qu, Shengqiong Wu, Hao Fei, Liqiang Nie, and Tat-
Brown, and Deepak Pathak. Your diffusion model is secretly Seng Chua. Layoutllm-t2i: Eliciting layout guidance from
a zero-shot classifier. arXiv preprint arXiv:2303.16203, llm for text-to-image generation. In Proceedings of the 31st
2023. 3, 6, 7, 17 ACM International Conference on Multimedia, pages 643–
[32] Chenliang Li, Haiyang Xu, Junfeng Tian, Wei Wang, Ming 654, 2023. 1, 2, 3, 5, 6, 12, 15, 16
Yan, Bin Bi, Jiabo Ye, Hehong Chen, Guohai Xu, Zheng [45] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Cao, et al. mplug: Effective and efficient vision-language Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
learning by cross-modal skip-connections. arXiv preprint Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning
arXiv:2205.12005, 2022. 15 transferable visual models from natural language supervi-
[33] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. sion. In International conference on machine learning, pages
Blip-2: Bootstrapping language-image pre-training with 8748–8763. PMLR, 2021. 3, 7
frozen image encoders and large language models. arXiv
[46] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray,
preprint arXiv:2301.12597, 2023. 5
Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever.
[34] Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jian-
Zero-shot text-to-image generation. In International Confer-
wei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee.
ence on Machine Learning, pages 8821–8831. PMLR, 2021.
Gligen: Open-set grounded text-to-image generation. In Pro-
2
ceedings of the IEEE/CVF Conference on Computer Vision
[47] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu,
and Pattern Recognition, pages 22511–22521, 2023. 2
and Mark Chen. Hierarchical text-conditional image gen-
[35] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
eration with clip latents. arXiv preprint arXiv:2204.06125,
Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
2022. 1, 2
Zitnick. Microsoft coco: Common objects in context. In
Computer Vision–ECCV 2014: 13th European Conference, [48] Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Gener-
Zurich, Switzerland, September 6-12, 2014, Proceedings, ating diverse high-fidelity images with vq-vae-2. Advances
Part V 13, pages 740–755. Springer, 2014. 5, 13, 15 in neural information processing systems, 32, 2019. 3
[36] Xinyu Lin, Wenjie Wang, Yongqi Li, Fuli Feng, See-Kiong [49] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir
Ng, and Tat-Seng Chua. A multi-facet paradigm to bridge Sadeghian, Ian Reid, and Silvio Savarese. Generalized in-
large language model and recommendation. arXiv preprint tersection over union: A metric and a loss for bounding
arXiv:2310.06491, 2023. 3 box regression. In Proceedings of the IEEE/CVF conference
[37] Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and on computer vision and pattern recognition, pages 658–666,
Joshua B Tenenbaum. Compositional visual generation with 2019. 5
composable diffusion models. In European Conference on [50] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
Computer Vision, pages 423–439. Springer, 2022. 2 Patrick Esser, and Björn Ommer. High-resolution image
[38] Shilong Liu, Shijia Huang, Feng Li, Hao Zhang, Yaoyuan synthesis with latent diffusion models. In Proceedings of
Liang, Hang Su, Jun Zhu, and Lei Zhang. Dq-detr: the IEEE/CVF conference on computer vision and pattern
Dual query detection transformer for phrase extraction and recognition, pages 10684–10695, 2022. 1, 2, 3, 6, 16
grounding. In Proceedings of the AAAI Conference on Arti- [51] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-
ficial Intelligence, pages 1728–1736, 2023. 15 net: Convolutional networks for biomedical image segmen-
[39] Ilya Loshchilov and Frank Hutter. Decoupled weight decay tation. In Medical Image Computing and Computer-Assisted
regularization. arXiv preprint arXiv:1711.05101, 2017. 16 Intervention–MICCAI 2015: 18th International Conference,
[40] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Munich, Germany, October 5-9, 2015, Proceedings, Part III
Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and 18, pages 234–241. Springer, 2015. 3
Mark Chen. Glide: Towards photorealistic image generation [52] Chitwan Saharia, William Chan, Saurabh Saxena, Lala
and editing with text-guided diffusion models. arXiv preprint Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour,
arXiv:2112.10741, 2021. 2 Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans,
et al. Photorealistic text-to-image diffusion models with deep on computer vision and pattern recognition, pages 1316–
language understanding. Advances in Neural Information 1324, 2018. 2
Processing Systems, 35:36479–36494, 2022. 1, 2, 15 [65] Xinchen Yan, Jimei Yang, Kihyuk Sohn, and Honglak Lee.
[53] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Attribute2image: Conditional image generation from visual
Cheung, Alec Radford, and Xi Chen. Improved techniques attributes. In Computer Vision–ECCV 2016: 14th European
for training gans. Advances in neural information processing Conference, Amsterdam, The Netherlands, October 11–14,
systems, 29, 2016. 6 2016, Proceedings, Part IV 14, pages 776–791. Springer,
[54] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, 2016. 2
and Surya Ganguli. Deep unsupervised learning using [66] Xingyi Yang and Xinchao Wang. Diffusion model as repre-
nonequilibrium thermodynamics. In International confer- sentation learner. In Proceedings of the IEEE/CVF Interna-
ence on machine learning, pages 2256–2265. PMLR, 2015. tional Conference on Computer Vision, pages 18938–18949,
1, 2 2023. 3
[55] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- [67] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gun-
hishek Kumar, Stefano Ermon, and Ben Poole. Score-based jan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yin-
generative modeling through stochastic differential equa- fei Yang, Burcu Karagol Ayan, et al. Scaling autoregres-
tions. arXiv preprint arXiv:2011.13456, 2020. 3 sive models for content-rich text-to-image generation. arXiv
[56] Sanjay Subramanian, William Merrill, Trevor Darrell, Matt preprint arXiv:2206.10789, 2022. 2, 15
Gardner, Sameer Singh, and Anna Rohrbach. Reclip: A [68] Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg,
strong zero-shot baseline for referring expression compre- and Tamara L Berg. Modeling context in referring expres-
hension. arXiv preprint arXiv:2204.05991, 2022. 15, 17 sions. In Computer Vision–ECCV 2016: 14th European
[57] Teng Sun, Wenjie Wang, Liqaing Jing, Yiran Cui, Xuemeng Conference, Amsterdam, The Netherlands, October 11-14,
Song, and Liqiang Nie. Counterfactual reasoning for out-of- 2016, Proceedings, Part II 14, pages 69–85. Springer, 2016.
distribution multimodal sentiment analysis. In Proceedings 2, 4, 5, 13
of the 30th ACM International Conference on Multimedia, [69] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiao-
pages 15–23, 2022. 5, 15 gang Wang, Xiaolei Huang, and Dimitris N Metaxas. Stack-
[58] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- gan: Text to photo-realistic image synthesis with stacked
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia generative adversarial networks. In Proceedings of the IEEE
Polosukhin. Attention is all you need. Advances in neural international conference on computer vision, pages 5907–
information processing systems, 30, 2017. 3 5915, 2017. 2
[59] Haokun Wen, Xuemeng Song, Xin Yang, Yibing Zhan, and [70] Wenliang Zhao, Yongming Rao, Zuyan Liu, Benlin Liu,
Liqiang Nie. Comprehensive linguistic-visual composition Jie Zhou, and Jiwen Lu. Unleashing text-to-image dif-
network for image retrieval. In Proceedings of the Inter- fusion models for visual perception. arXiv preprint
national ACM SIGIR Conference on Research and Devel- arXiv:2303.02153, 2023. 3
opment in Information Retrieval, pages 1369–1378. ACM,
2021. 15
[60] Haokun Wen, Xian Zhang, Xuemeng Song, Yinwei Wei, and
Liqiang Nie. Target-guided composed image retrieval. In
Proceedings of the ACM International Conference on Multi-
media, pages 915–923. ACM, 2023. 15
[61] Peter West, Ximing Lu, Nouha Dziri, Faeze Brahman, Lin-
jie Li, Jena D Hwang, Liwei Jiang, Jillian Fisher, Abhilasha
Ravichander, Khyathi Chandu, et al. The generative ai para-
dox:” what it can create, it may not understand”. arXiv
preprint arXiv:2311.00059, 2023. 3
[62] Weilai Xiang, Hongyu Yang, Di Huang, and Yunhong Wang.
Denoising diffusion autoencoders are unified self-supervised
learners. arXiv preprint arXiv:2303.09769, 2023. 5, 7
[63] Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiao-
long Wang, and Shalini De Mello. Open-vocabulary panop-
tic segmentation with text-to-image diffusion models. In
Proceedings of the IEEE/CVF Conference on Computer Vi-
sion and Pattern Recognition, pages 2955–2966, 2023. 3,
5
[64] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang,
Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-
grained text to image generation with attentional generative
adversarial networks. In Proceedings of the IEEE conference
Discriminative Probing and Tuning for Text-to-Image Generation
Supplementary Material
A. More Results and Analysis mance seems to get better at the beginning and then fluctu-
ates within a very small range. Compared with the perfor-
A.1. Ablation Study on DPT mance at the end of the discriminative probing phase, i.e.,
To further explore the effectiveness of the proposed DPT at step 0, the discriminative tuning by introducing the extra
paradigm, we compare it with the traditional denoising tun- parameters of LoRA could further improve both discrimi-
ing method based on the MSE loss function. Concretely, we native abilities. On the other hand, the performance on ITM
delve into all four combinations of these two and evaluate of DPT-v2.1 is obviously better than that of DPT-v1.4. On
the ID and OOD generation performance on the basis of two the contrary, DPT-v1.4 seems to be slightly stronger than
versions of SD, i.e., SD-v1.4 and SD-v2.1. The experimen- DPT-v2.1 in terms of REC, as shown in Fig. 9.
tal results are reported in Tab. 5. We have the following ob- • Discriminative Tuning for Generation From Fig. 7 and
servations. 1) MSE could improve the alignment and qual- Fig. 8, we can see that the performance on discrimination
ity under the ID setting, but it may not be helpful and even and generation gets better meanwhile at the beginning of
harmful to the alignment under the OOD setting. 2) DPT tuning, e.g., [0, 700] for DPT-v2.1 and [0, 800] for DPT-
consistently enhances alignment performance across both v1.4, which demonstrates the effectiveness of discrimina-
ID and OOD settings, with minimal impact on the original tive tuning on the enhancement of various abilities. After-
image quality. And 3) by combining MSE and DPT objec- ward, we can see that the generation performance declines,
tives, we may get a good trade-off between alignment and perhaps due to over-fitting or some potential discrepancy
quality, and achieve a better performance on the BLIP-ITM between generation and discrimination.
evaluation metric which may focus on more details.
A.5. Impact of Rank Numbers in LoRA
A.2. Ablation Study on Probed Blocks
The rank number in LoRA determines the number of ex-
As shown in Tab. 6, we carry out extensive experiments to tra parameters introduced to the discriminative tuning stage
study the effect of different probed blocks of U-Net on gen- compared with the first stage. To explore the influence of
eration performance. The results show that DPT achieves the rank numbers on the generation performance, we com-
the best performance when probing the up3 block. pare different rank numbers, from 0 to 128, as shown in
Tab. 7. The results reflect that DPT achieves the best per-
A.3. Comprehensive Evaluation on COCO-NSS1K formance when using 4 rank numbers on most evaluation
The COCO-NSS1K dataset [44] was constructed to evaluate metrics. More rank numbers do not bring further improve-
five categories of abilities for T2I models, including count- ment, which may be attributable to the scale of tuning data.
ing, spatial relation reasoning, semantic relation reasoning,
A.6. Impact of Layers of Discriminative Adapter
complicated relation reasoning, and abstract imagination.
To delve into these categories, we compare the proposed The total parameters of DPT also depend on the transformer
methods, including DPT and DPT + SC, with SD-v1.4 and layers of the discriminative adapter. We conduct experi-
SD-v2.1, as shown in Fig. 6. Our method consistently im- ments by using different numbers of layers. Note that we
proves the alignment performance in all categories com- keep the same number of layers in encoders and decoders
pared with the state-of-the-art SD-v2.1. Besides, the self- for each experiment. The results are reported in Tab. 8. In
correction module could further align the generated images general, the best generation performance can be achieved
with prompts, especially in the semantic relation category. when using 4 layers. Besides, the alignment performance
is always better than SD-v2.1 (i.e., the experiment with 0
A.4. Impact of Discriminative Tuning Steps layer), which further verifies the effectiveness of DPT.
As shown in Fig. 7, Fig. 8, and Fig. 9, we study the genera-
A.7. Impact of Denosing Objective
tion and discrimination performance based on SD-v2.1 and
SD-v1.4, and the comparison between the two versions, re- In the raw SD model, only the denoising objective with the
spectively. We discuss the results from the discrimination MSE form is used to model the data distribution for image
and generation aspects as follows. synthesis. To further the interplay of DPT and MSE ob-
• Discriminative Tuning for Discrimination On the one jectives, we perform more experiments by combining them
hand, the grounding performance is continuously improved and taking different values of the coefficient of MSE. As
with the tuning step increases, while the matching perfor- shown in Tab. 9, we observe that the simultaneous use of
Table 5. Ablation study for the influence of two objectives of discriminative probing and tuning (DPT) and denoising (MSE) on text-
to-image generation. The COCO-NSS1K and CC-500 datasets are used to evaluate in-distribution (ID) and out-of-distribution (OOD)
generation. Alignment-oriented evaluation metrics include CLIP score ↑, BLIP-ITM score ↑, and GLIP score ↑, while quality-oriented
evaluation metrics include IS ↑ and FID ↓. The best results are highlighted in bold.
COCO-NSS1K (ID) CC-500 (OOD)
Version MSE DPT
CLIP BLIP-ITM IS FID CLIP BLIP-ITM GLIP IS
33.27 67.96 31.32 54.77 34.82 70.95 31.17 14.28
SD-v1.4 ! 32.95 70.48 32.03 52.63 34.00 71.27 34.08 15.02
! 33.85 71.84 31.65 54.96 35.97 76.74 37.07 13.56
! ! 33.83 73.28 30.59 55.02 35.83 77.20 37.89 13.67
34.96 73.32 30.40 55.35 39.24 85.45 52.09 11.53
SD-v2.1 ! 34.20 75.90 30.10 51.85 38.49 84.65 49.48 13.38
! 35.83 78.58 30.83 55.55 40.23 90.72 53.29 11.59
! ! 35.64 79.23 30.18 53.36 39.87 90.41 52.09 12.09
Table 6. Generation performance with different probed blocks of U-Net and the sizes of feature maps (Feat. Size). All the experiments
are based on SD-v2.1. We combine multiple feature maps using additive fusion, and perform interpolation if the feature sizes are different.
Alignment-oriented evaluation metrics include CLIP score ↑, BLIP-ITM (BLIP-M) score ↑, BLIP-ITC (BLIP-C) score ↑, and GLIP score ↑,
while quality-oriented evaluation metrics include IS ↑ and FID ↓. The best results are highlighted in bold.
COCO-NSS1K (ID) CC-500 (OOD)
Block Feat. Size
CLIP BLIP-M BLIP-C IS FID CLIP BLIP-M BLIP-C GLIP IS
- - 34.96 73.32 40.22 30.40 55.35 39.24 85.45 43.36 52.09 11.53
bottom1 32×32 34.76 73.15 40.12 29.45 55.51 38.81 84.68 43.05 51.57 11.80
bottom2 16×16 35.03 73.63 40.24 30.38 55.38 39.15 87.08 43.52 49.18 12.04
bottom3 8×8 35.69 77.57 40.85 30.06 55.52 40.19 90.33 44.34 54.33 11.17
middle 8×8 35.90 79.25 41.11 30.53 55.12 40.28 90.66 44.40 54.04 11.31
up1 8×8 35.83 78.58 41.14 30.83 55.55 40.23 90.72 44.55 53.29 11.59
up2 16×16 35.85 79.19 41.10 30.25 54.92 40.67 92.72 44.85 55.98 11.34
up3 32×32 35.91 80.39 41.24 31.47 57.12 41.21 94.52 45.46 52.62 11.89
middle + bottom3 + up1 8×8 35.65 78.26 41.13 30.39 55.47 39.99 90.42 44.12 52.47 11.76
bottom2 + up2 16×16 35.91 79.43 41.16 30.41 54.86 40.64 91.98 44.71 53.21 11.77
bottom1 + up1 32×32 35.77 79.31 41.15 30.09 56.63 40.75 93.72 45.02 51.57 12.69
all 8×8 35.84 79.19 41.10 30.47 55.64 40.61 91.22 44.77 53.14 11.42
all 32×32 35.48 78.85 41.40 30.57 56.53 40.18 92.74 45.29 48.95 11.88
these two objectives does not cause significant conflict. In- approaches, we evaluate their generation performance as
stead, there may be a possibility that they can collaborate shown in Tab. 11. From the results, we find training DA +
with each other to achieve a competitive trade-off between LoRA is better, perhaps due to the more flexibility brought
text-image alignment and image quality. by more parameters from DA and LoRA during the discrim-
inative tuning phase.
A.8. Impact of Timesteps on Discriminative Tasks
As shown in Tab. 11, we explore the influence of differ- B. More Experimental Settings
ent timesteps used in DDPM on the ITM and REC perfor-
mance. The results illustrate that the proposed model could B.1. Benchmark Datasets
achieve the best performance when the timestep is set to • Data used for Training. We evaluate the basic global
250. The performance comparison between 0 and 250 re- matching and local grounding abilities of T2I models based
veals that it is helpful to improve the discriminative abilities on two discriminative tasks, i.e., Image-Text Matching and
by introducing appropriate levels of noise. Referring Expression Comprehension, respectively. For
discriminative probing and tuning, we reorganize public
A.9. Impact of Tunable Modules
benchmarks including MSCOCO [35] for ITM, and Re-
In the second stage, we have two strategies for discrim- fCOCO [68], RefCOCO+ [68], and RefCOCOg [68] for
inative tuning: only training LoRA and training DA + REC. Specifically, we use the COCO2014 version of MS-
LoRA. To make a comparison between the two tuning COCO, composed of 82,783 images and 414,113 captions
S D -v 1 .4 S D -v 2 .1 D P T (O u rs ) D P T + S C (O u rs )
2 .6 0 % 9 .8 4 % 2 .0 2 %
Im a g in a tio n
2 .0 6 % 5 .1 5 % 2 .5 5 %
C o m p lic a te d
2 .1 6 % 6 .7 9 % 1 .2 0 %
S e m a n tic
2 .0 4 % 1 0 .8 7 % 1 .0 5 %
S p a tia l
2 .8 1 % 7 .2 8 % 3 .0 3 %
C o u n tin g
3 3 3 4 3 5 3 6 6 0 6 5 7 0 7 5 8 0 8 5 3 8 3 9 4 0 4 1 4 2
(a) CLIP Score (b) BLIP-ITM Score (c) BLIP-ITC Score
Figure 6. Alignment performance improvement of the proposed method compared with SD-v1.4 and SD-v2.1 on five categories of the
COCO-NSS1K dataset, including counting, spatial relation, semantic relation, and complicated relation reasoning and imagination abilities
evaluation. Results on three evaluation metrics (a) CLIP Score, (b) BLIP-ITM Score, and (c) BLIP-ITC Score are reported. The value on
the right of each category denotes the percentage improvement of DPT + SC (Ours) compared to SD-v2.1.
Table 7. Generation performance with different numbers of LoRA ranks on the COCO-NSS1K and CC-500 datasets. Alignment-oriented
evaluation metrics include CLIP score ↑, BLIP-ITM score ↑, BLIP-ITC score ↑, and GLIP score ↑, while quality-oriented evaluation
metrics include IS ↑ and FID ↓. The best results are highlighted in bold.
COCO-NSS1K (ID) CC-500 (OOD)
Rank
CLIP BLIP-ITM BLIP-ITC IS FID CLIP BLIP-ITM BLIP-ITC GLIP IS
0 34.96 73.32 40.22 30.40 55.35 39.24 85.45 43.36 52.09 11.53
4 35.83 78.58 41.14 30.83 55.55 40.23 90.72 44.55 53.29 11.59
8 35.47 76.89 40.94 30.39 55.28 39.78 90.04 44.26 52.69 11.84
16 35.59 77.33 40.77 30.25 54.83 39.85 89.45 43.96 52.69 11.99
32 35.55 76.68 40.80 29.98 54.81 39.98 89.82 44.04 52.76 11.83
64 35.67 77.05 40.79 30.12 55.28 40.22 90.99 44.21 53.44 11.73
128 35.66 77.11 40.84 30.68 55.34 40.05 89.99 44.12 52.39 11.92
4 3 6 0 8 0 4 0 6 0
4 4 4 4
9 2 C L IP B L IP -IT M M a tc h in g G r o u n d in g C L IP B L IP -IT M M a tc h in g G r o u n d in g
4 2 4 2 4 2
7 5 3 8
5 5 5 5
9 0 4 0 4 0
4 1
3 8 7 0 3 6 3 8
5 0 5 0
8 8
4 0 3 6 3 6
6 5 3 4
3 4 4 5 3 4 4 5
8 6 3 9
3 2 6 0 3 2 3 2
0 7 0 0 4 k 8 k 1 2 k 1 6 k 2 0 k 2 4 k 2 8 k 3 2 k 3 6 k 4 0 k 0 8 0 0 4 k 8 k 1 2 k 1 6 k 2 0 k 2 4 k 2 8 k 3 2 k 3 6 k 4 0 k
T u n in g S te p T u n in g S te p
Figure 7. Impact of the discriminative tuning steps in DPT (SD- Figure 8. Impact of the discriminative tuning steps in DPT (SD-
v2.1) on generation and discrimination performance. CLIP and v1.4) on generation and discrimination performance. CLIP and
BLIP-ITM scores on the CC-500 dataset under out-of-distribution BLIP-ITM scores on the CC-500 dataset under out-of-distribution
setting, averaging performance over I-to-T and T-to-I matching, setting, averaging performance over I-to-T and T-to-I matching,
and averaging precision@1 over test sets of RefCOCO, Ref- and averaging precision@1 over test sets of RefCOCO, Ref-
COCO+, and RefCOCOg are shown. The model achieves the best COCO+, and RefCOCOg are shown. The model achieves the best
generation performance on the validation set at step 700. generation performance on the validation set at step 700.
in the training set. There are about 5 caption annotations for Therefore, we adopt the following data sampling [41] strat-
each image. As for REC, following MDETR [26], we com- egy during training: 1) randomly sample an expression y ′
bine the three datasets, i.e., RefCOCO, RefCOCO+, and from RefCOCOall and get the corresponding image x, 2)
RefCOCOg, into one, called RefCOCOall in the following. randomly sample a caption y from all positive captions of
Its training set includes 28,158 images and 321,327 expres- x, 3) randomly sample a hard negative caption y neg from
sions. top-20 hard negative captions4 of x from the training set
of MS-COCO, 4) randomly sample a hard negative im-
To probe global matching and local grounding abilities
at the same time, we combine all the above datasets into 4 We use OpenCLIP (ViT-H-14) to calculate image-text similarities and
one. Concretely, we observe that the MSCOCO dataset in- retrieve the top-k hard negative captions (k=20) or hard negative images
cludes all the raw images in the above three REC datasets. (k=4).
Table 8. Generation performance with different numbers of layers of encoders and decoders in the discriminative adapter on the COCO-
NSS1K and CC-500 datasets. Alignment-oriented evaluation metrics include CLIP score ↑, BLIP-ITM score ↑, BLIP-ITC score ↑, and
GLIP score ↑, while quality-oriented evaluation metrics include IS ↑ and FID ↓. The best results are highlighted in bold.
COCO-NSS1K (ID) CC-500 (OOD)
Layer
CLIP BLIP-ITM BLIP-ITC IS FID CLIP BLIP-ITM BLIP-ITC GLIP IS
0 34.96 73.32 40.22 30.40 55.35 39.24 85.45 43.36 52.09 11.53
1 35.83 78.58 41.14 30.83 55.55 40.23 90.72 44.55 53.29 11.59
2 35.51 77.67 40.93 30.52 54.75 40.23 91.14 44.36 52.99 11.65
3 35.55 77.14 40.81 30.27 55.02 40.03 90.55 44.19 53.59 12.00
4 35.91 79.41 41.08 30.95 55.06 40.88 92.92 44.88 54.26 11.53
5 35.80 78.76 41.02 29.53 55.18 40.10 90.89 44.22 55.23 11.97
Table 9. Generation performance with different coefficients (Coeff.) of the denoising MSE loss function at the second stage for the
discriminative tuning on the COCO-NSS1K and CC-500 datasets. Alignment-oriented evaluation metrics include CLIP score ↑, BLIP-
ITM score ↑, BLIP-ITC score ↑, and GLIP score ↑, while quality-oriented evaluation metrics include IS ↑ and FID ↓. The best results are
highlighted in bold.
COCO-NSS1K (ID) CC-500 (OOD)
Coeff. of MSE
CLIP BLIP-ITM BLIP-ITC IS FID CLIP BLIP-ITM BLIP-ITC GLIP IS
0.00 35.83 78.58 41.14 30.83 55.55 40.23 90.72 44.55 53.29 11.59
0.05 35.54 77.78 41.14 29.73 52.84 40.05 90.12 44.41 52.84 11.57
0.10 35.64 79.08 41.34 29.74 52.79 40.11 89.85 44.37 53.36 11.97
0.30 35.53 78.81 41.32 29.95 52.73 40.03 89.65 44.37 54.48 12.20
0.50 35.49 77.51 40.99 30.61 53.44 39.81 88.78 44.05 50.90 11.26
1.00 35.64 79.23 41.49 30.18 53.36 39.87 90.41 44.70 52.09 12.09
4 0 .0 6 0
5 8
color words describe different objects, and the other 3,217
3 9 .5
prompts obtained by switching the positions of two color
1 o n R E C
5 6
3 9 .0
A c c u ra c y o n IT M
3 8 .5
5 4
words. TIFA [25] is a recent VQA-based benchmark built
5 2
to evaluate T2I alignment across 12 categories, consisting of
P r e c is io n @
3 8 .0
5 0
3 7 .5
4 8 4,081 prompts from MSCOCO [35], DrawBench [52], Par-
3 7 .0 M a tc h in g ( D P T - v 1 .4 ) G r o u n d in g ( D P T - v 1 .4 )
4 6
M a tc h in g ( D P T - v 2 .1 )
4 4
G r o u n d in g ( D P T - v 2 .1 )
tiPrompt [67], and PaintSill [7]. The VQA accuracy based
3 6 .5
0 4 8 1 2 1 6 2 0
T u n in g S te p
2 4 2 8 3 2 3 6 4 0 × 1 k 0 4 8 1 2 1 6 2 0
T u n in g S te p
2 4 2 8 3 2 3 6 4 0 × 1 k
on MLLMs [32] is employed to assess text-image align-
(a) ITM (b) REC ment. As a contemporary work, T2I-CompBench [14] is
constructed to offer a comprehensive benchmark for com-
Figure 9. Discriminative performance comparison between DPT-
positional T2I from 6 categories, including color binding,
v1.4 and DPT-v2.1 on (a) Image-Text Matching (ITM) and (b)
shape binding, texture binding, spatial relationships, non-
Referring Expression Comprehension (REC) as the tuning pro-
gresses. spatial relationships, and complex compositions. We use
the test set with 1,800 prompts for evaluation.
Table 11. Generation performance comparison with different tun- B.3.2 Settings for Discriminative Adapter
ing strategies at the second stage. “LoRA” refers to only training
LoRA and freezing other modules including U-Net and Discrim- We use the 1-layer Transformer encoder and the 1-
inative Adapter, while “DA + LoRA” means training Discrimina- layer Transformer decoder to implement the discriminative
tive Adapter and LoRA with only U-Net frozen. adapter introduced in Sec. 3.1. The feature map with the
COCO-NSS1K CC-500 shape 1280 × 8 × 8 in the medium layer of U-Net is flat-
Trainable
CLIP BLIP-ITM CLIP BLIP-ITM tened and fed into the encoder. The hidden dimensions of
LoRA 35.48 76.46 39.75 89.23 attention layers and feedforward layers are set to 256 and
DA + LoRA 35.83 78.58 40.23 90.72 2048, respectively. The number of attention heads is 8. We
use 110 learnable queries in total, 10 for global matching,
and the other 100 for local grounding, i.e., N = 110 and
B.2. Baselines M = 10. The temperature factor τ for contrastive learning
in Eqn. (3) and Eqn. (4) is learnable and initialized to 0.07.
To verify the effectiveness of our method on T2I, we carry To make a balance between different grounding objectives
out experiments based on SD-v1.4 and SD-v2.1, and com- in Eqn. (7), we set λ0 , λ1 , λ2 , and λ3 as 1, 5, 2, and 1,
pare DPT with SD [50] and recent alignment-oriented T2I respectively. The same setting is also used for maximum
baselines including LayoutLLM-T2I [44], StructureDiffu- matching in Eqn. (5). During inference, we use 0.5 as the
sion [16], Attend-and-Exite [4], DiffusinoITM [28], VP- default guidance strength of self-correction, i.e., η = 0.5.
Gen [13] and LayoutGPT [17].
B.3.3 Implementation of Baselines
B.3. Implementation Details
We run the codes of SD-v1.45 and SD-v2.16 in the hug-
B.3.1 Settings for Discriminative Probing and Tuning gingface open-source community. Besides, we run the code
of LayoutLLM-T2I7 and load the open checkpoint to eval-
uate its performance on COCO-NSS1K. Because of the
In the first stage, we perform discriminative tuning by pre-
training-free property, StructureDiffusion8 and Attend-and-
training for 60k steps with a batch size of 220 and a learn-
Excite9 can be directly executed for evaluation. As for
ing rate of 1e-4. After that, we perform discriminative tun-
DiffusionITM [28], we implement it based on the open-
ing in the second stage for 6k steps with a batch size of
sourced code10 and rename it as HN-DiffusionITM con-
8 and a learning rate of 1e-4. We use the AdamW [39]
sidering we perform contrastive learning based on the hard
optimizer with betas as (0.9, 0.999) and weight decay as
negative samples. For a fair comparison, we re-train HN-
0.01. We perform the gradient clip with 0.1 as the maxi-
DiffusionITM using our training data and adopt the DDIM
mum norm. We select the model checkpoint by using a val-
idation set for alignment-oriented generation. Specifically, 5 https : / / huggingface . co / CompVis / stable -
mulate gradients in each 8 steps in this stage. Using a sin- com/weixi-feng/Structured-Diffusion-Guidance, but it
is only implemented based on SD-v1.4.
gle A100 (40G), the discriminative probing requires about 9 https : / / github . com / yuval - alaluf / Attend - and -
3 days for the first stage, and the discriminative tuning re- Excite.
quires about 1 day for the second stage. 10 https://github.com/McGill-NLP/diffusion-itm.
sampler with 50 timesteps to synthesize images.
To compare the global matching abilities of different
generative models, we implement Diffusion Classifier [31]
and DiffusionITM [28]. They both depend on ELBO to
compute text-image similarities. Compared with Diffusion
Classifier, DiffusionITM uses a normalization method to
rectify the text-to-image directional matching, dealing with
the modality asymmetry issue to some extent. We call this
version as HN-DiffusionITM.
Regarding the local grounding ability, we first evalu-
ate CLIP and OpenCLIP by means of the cropping strat-
egy [56]. In detail, we first crop the raw images into multi-
ple blocks and resize them to the same size according to the
proposals, and then perform expression-to-image matching.
In addition, we also adopt the cropping and expression-
to-image matching strategy to evaluate the grounding per-
formance of Diffusion Classifier, DiffusionITM, and HN-
DiffusionITM, since these models can not be directly re-
purposed to do the REC task. We also propose a local de-
noising method based on Diffusion Classifier. Specifically,
we calculate the ELBO for each local proposal region in-
stead of the whole image, and then take the proposal with
the maximum ELBO values as the prediction.
C. More Examples
To intuitively compare the proposed method with recent
baselines, we list some generated images on the CC-500
and ABC-6K datasets, as shown in Fig. 10 and Fig. 11, re-
spectively. To make a fair comparison, we implement all the
methods based on SD-v1.4, considering that StructureDif-
fusion only supports SD-v1.4. Besides, we also show more
diverse examples including SD-v1.4, SD-v2.1, and ours,
on the COCO-NSS1K, CC-500, and ABC-6K datasets, in
Fig. 12 and Fig. 13, Fig. 14, and Fig. 15, respectively.
SD-v1.4 StructureDiff AaE DPT + SC (Ours) SD-v1.4 StructureDiff AaE DPT + SC (Ours)
“a blue backpack and a brown cow” “A man in a red shirt and black pants resting on a bench. ”
“a blue bowl and a red car” “A man in a black shirt and red pants resting on a bench. ”
“a blue backpack and a brown bear” “A kitchen with white tile floor and blue sloped ceiling. ”
Object Appearance
“a red giraffe and a brown train” “A kitchen with blue tile floor and white sloped ceiling. ”
“a blue chair and a red vase” “A small black dog sitting in the basket of a white bike. ”
“a blue apple and a green boat” “A small white dog sitting in the basket of a black bike. ”
Attribute Characterizing
“a blue dog and a brown suitcase” “A yellow toy car with lights on sitting on a black mat. ”
“a blue horse and a brown vase” “A black toy car with lights on sitting on a yellow mat. ”
Figure 10. Qualitative results on CC-500 We compare the pro- Figure 11. Qualitative results on ABC-6K. We compare the pro-
posed method with SD-v1.4 and two baselines including Struc- posed method with SD-v1.4 and two baselines including Struc-
tureDiff [16] and Attend-and-Excite (AaE) [4] regarding object tureDiff [16] and Attend-and-Excite (AaE) [4] regarding color at-
appearance and attribute characterizing. tribute characterizing.
Ground Truth Stable Diffusion-v1.4 Stable Diffusion-v2.1 DPT + SC (Ours)
Figure 12. More generated examples on COCO-NSS1K regarding object appearance and counting.
Ground Truth Stable Diffusion-v1.4 Stable Diffusion-v2.1 DPT + SC (Ours)
Figure 13. More generated examples on COCO-NSS1K regarding spatial, semantic, and compositional reasoning.
Stable Diffusion-v1.4 Stable Diffusion-v2.1 DPT + SC (Ours)
“A yellow rocking chair attached to a device with a blue umbrella over the top of it. “
“A blue rocking chair attached to a device with a yellow umbrella over the top of it. “
“An orange and white cat sitting in the grass near some yellow flowers. “
“An orange and yellow cat sitting in the grass near some white flowers. “