0% found this document useful (0 votes)

19 views22 pages

Discriminative Probing and Tuning For Text-To-Image Generation

Uploaded by

Jeremy Wayin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views22 pages

Discriminative Probing and Tuning For Text-To-Image Generation

Uploaded by

Jeremy Wayin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Discriminative Probing and Tuning for Text-to-Image Generation

Leigang Qu1 , Wenjie Wang1 *, Yongqi Li2 , Hanwang Zhang3,4 , Liqiang Nie5 , Tat-Seng Chua1
1
National University of Singapore, 2 Hong Kong Polytechnic University, 3 Nanyang Technological University,
4
Skywork AI, 5 Harbin Institute of Technology (Shenzhen)
[email protected], [email protected], [email protected]
[email protected], [email protected], [email protected]
arXiv:2403.04321v2 [cs.CV] 14 Mar 2024

Abstract “A white horse looking

“A blue bird and “Five pandas are through the window of a
a brown bowl. ” eating bamboo. ” tall brick building. ”
Despite advancements in text-to-image generation (T2I),
prior methods often face text-image misalignment problems
such as relation confusion in generated images. Exist-
ing solutions involve cross-attention manipulation for better
compositional understanding or integrating large language
models for improved layout planning. However, the inherent Attribute Binding Counting Error Relation Confusion

alignment capabilities of T2I models are still inadequate. (a) Text-Image Misalignment Problems
By reviewing the link between generative and discriminative
modeling, we posit that T2I models’ discriminative abilities “The boy sitting in the middle.” y' Discrimination
may reflect their text-image alignment proficiency during x “Three children and their mother
sitting behind are flying kites.” y Generation
generation. In this light, we advocate bolstering the dis-
criminative abilities of T2I models to achieve more precise
text-to-image alignment for generation. We present a dis-
criminative adapter built on T2I models to probe their dis- x̂
criminative abilities on two representative tasks and lever-
age discriminative fine-tuning to improve their text-image
alignment. As a bonus of the discriminative adapter, a self-
correction mechanism can leverage discriminative gradi- Where is the object
Which matches y
described by y' ?
ents to better align generated images to text prompts during better, x or x̂ ?

inference. Comprehensive evaluations across three bench-

(b) Improving Generation with Discrimination
mark datasets, including both in-distribution and out-of-
distribution scenarios, demonstrate our method’s superior Figure 1. Illustration of the (a) text-image misalignment problem
generation performance. Meanwhile, it achieves state-of- and (b) our motivation by enhancing discriminative abilities of T2I
the-art discriminative performance on the two discrimina- models to promote generative abilities. We list three wrong gen-
eration results generated by SD-v2.1 [50] with regard to attribute
tive tasks compared to other generative models. The code is
binding, counting error, and relation confusion in (a).
available at https://dpt-t2i.github.io/.
in T2I [47, 50, 52]. However, due to the weak composi-
tional reasoning capabilities, current T2I models still suffer
1. Introduction from the Text-Image Misalignment problem [30], such as
Text-to-image generation (T2I) aims to synthesize high- attribute binding [16], counting error [44], and relation con-
quality and semantically-relevant images to a given free- fusion [44] (see Fig. 1), especially in complicated multi-
form text prompt. In recent years, the rapid development object generation scenes.
of diffusion models [23, 54] has ignited the research enthu- Two lines of work have made remarkable progress in
siasm for content generation, leading to a significant leap improving text-image alignment for T2I models. The first
* Corresponding author. This work is supported by NExT++ Research
line proposes to intervene in cross-modal attention activa-
Center and National Research Foundation and Singapore under its AI Sin- tions guided by linguistic structures [16] or test time op-
gapore Programme (AISG Award No: AISG2-RP-2021-022). timization [4]. However, they heavily rely on the induc-
tive bias for manipulating attention structures, often neces- fine-tuning, e.g., LoRA [24]. In addition to the adapter,
sitating expertise in vision-language interaction. This ex- DPT fine-tunes the foundation T2I models to strengthen
pertise is not easily acquired and lacks flexibility. In con- its intrinsic compositional reasoning abilities for both dis-
trast, another research line [17, 44] borrows LLM’s lin- criminative and generative tasks. As an extension, we
guistic comprehension and compositional abilities for lay- present a self-correction mechanism to guide T2I mod-
out planning, and then incorporates layout-to-image models els for better alignment by gradient-based guidance signals
(e.g., GLIGEN [34]) for controllable generation. Although from the discriminative adapter. We conduct extensive ex-
these methods mitigate misalignment issues like counting periments on three alignment-oriented text-to-image gener-
error, they heavily rely on intermediate states, e.g., bound- ation benchmarks and four ITM and REC benchmarks un-
ing boxes, for layout representation. The intermediate states der in-distribution and out-of-distribution settings, validat-
may not adequately capture fine-grained visual attributes, ing the effectiveness of DPT in enhancing both generative
and can also accumulate errors in this two-stage paradigm. and discriminative abilities of T2I models. The main con-
Furthermore, the intrinsic compositional reasoning abilities tributions of this work are threefold.
of T2I models are still inadequate. • We retrospect the relations between generative and dis-
To tackle these issues, we aim to promote text-image criminative modeling, and propose a simple yet effec-
alignment by directly catalyzing the intrinsic compositional tive paradigm called DPT to probe and improve the basic
reasoning of T2I models, without depending on the induc- discriminative abilities of T2I models for better text-to-
tive bias for attention manipulation or intermediate states. image generation.
Richard Feynman famously stated, “What I cannot create, • We present a discriminative adapter to achieve efficient
I do not understand,” underscoring the significance of un- probing and tuning in DPT. Besides, we extend T2I mod-
derstanding in the process of creation. This motivates us to els with a self-correction mechanism guided by the dis-
consider enhancing the understanding abilities of T2I mod- criminative adapter for alignment-oriented generation.
els to facilitate their text-to-image generation. As illustrated • We conduct extensive experiments on three text-to-image
in Fig. 1, T2I models are more likely to generate an image generation datasets and four discriminative datasets, sig-
with correct semantics if they can distinguish the alignment nificantly enhancing the generative and discriminative
difference between the text prompt and the two images with abilities of representative T2I models.
minor semantic variations.
In light of this, we propose to examine the understanding 2. Related Work
abilities of T2I models by two discriminative tasks. First,
we probe the discriminative global matching ability1 of T2I • Text-to-Image Generation. Over the past decades, great
models on Image-text Matching (ITM) [18, 43], a represen- efforts on Variational Autoencoders [65], Generative Ad-
tative task to evaluate fundamental text-image alignment. versarial Networks [64, 69], and Auto-regression Mod-
The second discriminative task inspects the local ground- els [10, 46, 67] have been dedicated to generating high-
ing ability of T2I models. One representative task is Re- quality images with text conditions. Recently, there has
ferring Expression Comprehension (REC) [68], which ex- been a flurry of interest in Diffusion Probabilistic Models
amines the fine-grained expression-object alignment within (DMs) [23, 54] due to their stability and scalability. To fur-
an image. Based on the two tasks, we aim to 1) probe the ther improve the generation quality, large-scale models such
discriminative abilities of T2I models, especially the com- as DALL·E 2 [47], Imagen [52], and GLIDE [40], emerged
positional semantic alignment, and 2) further improve their to synthesize photorealistic images. This work mainly fo-
discriminative abilities for better text-to-image generation. cuses on diffusion models and especially takes the open-
Toward this end, we propose a Discriminative Probing sourced Stable Diffusion (SD) [50] as the base model.
and Tuning (DPT) paradigm to examine and improve text- • Improving Text-Image Alignment. Despite the thrilling
image alignment of T2I models in a two-stage process. success, current T2I models still suffer from Text-Image
1) To probe the discriminative abilities, DPT incorporates Misalignment issues [1, 9, 19], especially in complex
a Discriminative Adapter to do the ITM and REC tasks scenes requiring compositional reasoning [37]. Several pi-
based on the semantic representations [29] of T2I mod- oneering efforts were made to introduce guidance to in-
els. For example, DPT may take the feature maps from tervene in internal features of SD to stimulate the high-
U-Net of diffusion models [50] as semantic representa- alignment generation. For example, StructureDiffusion [16]
tions. And 2) in the second stage, DPT further improves parses prompts into tree structures and incorporates them
the text-image alignment by means of parameter-efficient with cross-attention representations to promote composi-
1 Here we inspect the understanding ability of models with discrimina- tional generation. Attend-and-Excite [4] manipulates cross-
tive tasks by considering the taxonomy of discriminative and generative attention units to attend to all textual subject tokens and en-
learning in Machine Learning. hance the activations in attention maps. Despite the notable
momentum, they are limited to tackling problems including latent zt = h(x, t) at timestep t. Thereafter, SD employs
missing objects and incorrect attributes, and ignore relation U-Net to predict the added noise and optimizes the model
enhancement. Another thread of work, e.g., LayoutLLM- parameters by minimizing the L2 loss between the ground-
T2I [44] and LayoutGPT [17], resorts to two-stage coarse- truth noise and the predicted one.
to-fine frameworks [15, 20, 48], in which they first induce • Semantic Representations. It is non-trivial to leverage
explicit intermediate bounding box-based layout, and then T2I models such as SD to do discriminative tasks. Fortu-
synthesize images. However, such an intermediate lay- nately, recent work [29] demonstrates that diffusion mod-
out may not be sufficient to represent complex scenes and els have a meaningful semantic latent space although they
they almost abandon the intrinsic reasoning abilities of pre- were originally designed for denoising [23] or score estima-
trained T2I models. In this work, we propose a discrimination [55]. Besides, a series of pioneering work [2, 8, 31, 63]
tive tuning paradigm by stimulating discriminative abilities shows the validity and even superiority of representations
of pre-trained T2I models for high-alignment generation. extracted from U-Net of SD to be qualified to discrimina-
• Generative and Discriminative Modeling. The thrilling tive tasks. Inspired by these studies, we consider utilizing
progress of LLMs enables generative models to complete semantic representations from the U-Net of SD to do dis-
discriminative tasks, which motivates researchers to exploit criminative tasks via a discriminative adapter.
understanding abilities [36] with foundation visual genera- • Discriminative Adapter. We propose a lightweight dis-
tive models in Image Classification [5, 8, 31, 66], Segmen- criminative adapter, which relies on the semantic represen-
tation [2, 63, 70], and Image-Text Matching [28]. Besides, tations of SD to handle discriminative tasks. Inspired by
DreamLLM [11] unifies generation and discrimination in a DETR [3], we implement the discriminative adapter with
multimodal auto-regressive framework and reveals the po- the Transformer [58] structure, including a Transformer en-
tential synergy. On the contrary, a recent work [61] dis- coder and a Transformer decoder. Besides, we adopt a
cusses the generative AI paradox and showed LLMs may fixed number of randomly initialized and learnable queries
not indeed understand what they have generated. To the to adapt the framework to specific discriminative tasks.
best of our knowledge, we are the first to study discrimina- Concretely, given a noisy latent zt at a sampled timestep
tive tuning to promote alignment in T2I. t and a prompt y, we first feed them into U-Net and extract a
2D feature map Ft ∈ Rh×w×d from one of the intermediate
3. Method blocks2 , where h, w, and d denote the height, width, and
dimension, respectively. Formally, we extract Ft via
In this section, we introduce the DPT paradigm to probe
and enhance the discriminative abilities of foundation T2I Ft = UNetl (zt , CLIP(y), t), (1)
models. As shown in Fig. 2, DPT consists of two stages,
i.e., Discrimination Probing and Discrimination Tuning, as where UNetl refers to the operation of extracting the feature
well as a self-correction mechanism in Sec. 3.3. maps in the l-th block of U-Net. Afterward, we combine Ft
with learnable position embeddings [12] and timestep em-
3.1. Stage 1 – Discriminative Probing beddings [50] of t via additive fusion, and then flatten it into
the semantic representation F̃t ∈ Rhw×d . For simplicity,
In the first stage, we aim to develop a probing method to ex-
we will omit the subscript t in the following.
plore “How powerful are discriminative abilities of recent
To probe the discriminative abilities, we feed F̃ into
T2I models?”. To this end, we first select representative
the Transformer encoder Enc(·), and then perform interac-
T2I models and semantic representations, and then consider
tion between the encoder output and some learnable queries
adapting the T2I models to do discriminative tasks.
Q = {q1 , ..., qN } with qi ∈ Rd in the Transformer decoder
• Stable Diffusion for Discriminative Probing. Consider- Dec(·, ·). The whole process is formulated as
ing SD is open-sourced and one of the most powerful and
popular T2I models, we select its different versions (see Q∗ = f (F̃; Wa , Q) = Dec(Enc(F̃), Q) (2)
Sec. 4.2) as representative models to probe the discrimina-
tive abilities. To make generative diffusion models seman- where f (·) abstracts to the discriminative adapter with pa-
tically focused and efficient, SD [50] performs denoising in rameters Wa and Q. Wa includes the parameters in Enc
a latent low-dimensional space. It includes VAE [27], Text and Dec. The queries Q serve as a bridge between vi-
Encoder of CLIP [45], and U-Net [51]. The U-Net serves sual representations and downstream discriminative tasks,
as a neural backbone for denoising score matching in the which attends the encoded semantic representation F̃t via
latent space, composed of three parts, i.e., down blocks, cross-attention [58] of the decoder for downstream tasks.
mid blocks, and up blocks. During training, given a pos- Thanks to multiple queries in Q, the query representations
itive image-text pair (x, y), SD first encodes image x with 2 We select the medium block by default, and also delve into the influ-

the VAE encoder and adds noise ϵ ∼ N (0, 1) to obtain the ence of different blocks in Sec. 4.3.
Positive Text Negative Text s (z t , y ) s (z t -1 , y ) Inference
LoRA Sampling Sampling
Partialy-descriptive Text z t z t −1
zt ẑ t z t −1 zˆ t −1 ... z0

h ( x, t )  (z t , y, t ) Text Encoder

hi
0

zt
bi
Negative Positive
match ground
Image Image U-Net
CLIP( y ) Text Encoder h0 h1 bˆ j bˆ  (i )

Time Embeddings *
Q ... Global Matching
Three children and their mother
sitting behind are flying kites Transformer Transformer Local Grounding
Encoder Decoder
Two girls sitting on the grass Ft = UNet l (z t , CLIP( y), t ) F *

Self-Correction
Q
The boy sitting in the middle
...

DPT Position Embeddings

Discriminative Queries

Figure 2. Schematic illustration of the proposed discriminative probing and tuning (DPT) framework. We first extract semantic represen-
tations from the frozen SD and then propose a discriminative adapter to conduct discriminative probing to investigate the global matching
and local grounding abilities of SD. Afterward, we perform parameter-efficient discriminative tuning by introducing LoRA parameters.
During inference, we present the self-correction mechanism to guide the denoising-based text-to-image generation.

Q∗ capture multiple aspects of the semantic representation matched with a given text from all samples in a batch, i.e.,
F̃. Thereafter, Q∗ can be used to do various downstream
exp(s(z, y)/τ )
tasks, possibly with a classier or regressor. LT →I = − log PB , (3)
j=1 exp s((zj , y)/τ )
In the following, we will introduce two probing tasks,
i.e., ITM and REC, and train the discriminative adapter on where B denotes the min-batch size, and τ is a learnable
them to investigate the global matching and local grounding temperature factor. Similarly, the opposite direction from
abilities of T2I models, respectively. image to text is computed by
exp(s(z, y)/τ )
• Global Matching. From the view of discriminative mod- LI→T = − log PB . (4)
j=1 exp s((z, yj )/τ )
eling, a model with strong text-image alignment should be
able to identify subtle alignment differences between vari- With Lmatch as the optimization objective, the discrimi-
ous images and a text prompt. In light of this, We utilize the native adapter and the projection layers are enforced to dis-
task of Image-Text Matching [18] to probe the discrimina- cover discriminative information from the semantic repre-
tive global matching ability. This task is defined to achieve sentations for matching, implying the global matching abil-
bidirectional matching or retrieval, including text-to-image ity of a T2I model.
(T → I) and image-to-text (I → T ).
• Local Grounding. Local grounding requires a model to
To achieve this, we first collect the first M (M < N ) recognize the referred object from others in an image given
query representations {q∗1 , ..., q∗M } from Q∗ , and then a partially descriptive text. We adapt SD to the REC [68]
project each of them into a matching space with the same task to evaluate its discriminative local grounding ability.
dimension as CLIP and obtain hi = g(q∗i ; Wm ). Intu- Formally, given a textual expression y ′ referring to a
itively, different query representations may capture differ- specific object with index i in an image x, REC aims to
ent aspects to understand the same image. Inspired by this, predict the coordinate and the size, i.e., the bounding box
we calculate the cross-modal semantic similarities between bi , of the ground-truth object. To achieve it, we share the
x and y by comparing the CLIP textual embedding of y same discriminative adapter and employ the other (N − M )
and the most matched projected query representations via learnable queries as object prior queries and obtain the cor-
s(y, z) = maxi∈{1,...,M } cos(CLIP(y), hi ). Based on pair- responding query representations from the transformer de-
wise similarities, we optimize the discriminative adapter coder as {q∗j }j∈{M +1,...,N } . We then project each q∗j into
f (·; Wa , Q) and the projection layer g(·; Wm ) using con- three spaces separately by three different project layers g(·):
trastive learning loss Lmatch = LT →I + LI→T . The first 1) the grounding space to get the probability of predict-
term optimizes the model to distinguish the correct image ing the correct object, i.e., pj = g(q∗j ; Wp ) ∈ R1 ; 2)
the box space to estimate the bounding box parameters, visual generative foundation models. In this vein, we strive
i.e., b̂j = g(q∗j ; Wb ) ∈ R4 ; and 3) the semantic space to to explain “How can we enhance text-image alignment for
bridge the semantic gap between queries and the text, i.e., T2I models by discriminative tuning?”
oj = g(q∗j ; Ws ) ∈ Rd . In the previous stage, we freeze SD and probe how in-
After projection, we perform maximum matching to dis- formative intermediate activations are in global matching
cover the most matched query with index ψ(i). The cost and local grounding. Here, we conduct parameter-efficient
used for matching includes using grounding probability, fine-tuning using LoRA [24] by injecting trainable layers
L1, and GIoU [49] losses between the prediction and the over cross-attention layers and freezing the parameters of
ground-truth box as costs. It is formulated as the pre-trained SD. We use the same discriminative objec-
tive functions as stage 1 to tune the LoRA, discriminative
ψ(i) = arg min −pj + L1(b̂j , bi ) + GIoU(b̂j , bi ) (5) adapter, and task-specific projection layers. Due to the par-
j∈{M +1,...,N }
ticipation of LoRA, we can flexibly manipulate the interme-
Besides, we adopt a text-to-object contrastive loss to fur- diate activation of T2I models.
ther drive the model to distinguish the positive object from
others at the semantic level: 3.3. Self-Correction
exp(cos(oψ(i) , CLIP(y ′ ))/τ ) Equipping the T2I model with the discriminative adapter
LT →O = − log PKx , (6) enables the whole model to execute discriminative tasks.
′
j=1 exp(cos(oj , CLIP(y ))/τ )
As a bonus of using the discriminative adapter, we propose
We combine all the losses and obtain the grounding loss as a self-correction mechanism to guide high-alignment gen-
eration during inference. Formally, we update the latent zt
Lground = − λ0 pψ(i) + λ1 L1(b̂ψ(i) , bi ) aiming to enhance the semantic similarity between zt and
(7)
+ λ2 GIoU(b̂ψ(i) , bi ) + λ3 LT →O , the prompt y through gradients:

where {λk }k∈{0,1,2,3} serve as trade-off factors. ∂s(zt , y)

ẑt = zt + η , (9)
Finally, we optimize the parameters of the whole model, ∂zt
including Q and {Wi }, i ∈ {a, p, b, s}, with the following where the guidance factor η control the guidance strength.
loss function on two tasks: ∂s(zt ,y)
represents the gradients from the discriminative
∂zt
L = Ex,ϵ∼N (0,1),t (Lmatch + Lground ) (8) adapter to the latent zt . Afterward, we predict the noise by
t t
feeding ẑt into U-Net and then obtain zt−1 for generation.
The probing process includes training and inference on the
two discriminative tasks. During training, we freeze all pa- 4. Experiments
rameters of SD, and adopt its semantic representations for
matching and grounding by optimizing the discriminative We conduct extensive experiments to evaluate the genera-
adapter and several projection layers. During inference, we tive and discriminative performance of DPT, justify its ef-
obtain the testing performance on the two discriminative fectiveness, and conduct an in-depth analysis.
tasks, which reflects the discriminative abilities of SD. 4.1. Experimental Settings
3.2. Stage 2 – Discriminative Tuning • Benchmarks. During training, we adopt the training
In the second stage, we propose to improve the genera- set of MSCOCO [35] for ITM and three commonly used
tive abilities, especially text-image alignment, by optimiz- datasets [68], i.e., RefCOCO, RefCOCO+, and RefCOCOg
ing T2I models in a discriminative tuning manner. Most for REC. To evaluate the text-image alignment, we utilize
prior work [2, 63] only views SD as a fixed feature extractor five benchmarks: COCO-NSS1K [44], CC-500 [16], ABC-
for segmentation tasks due to its fine-grained semantic rep- 6K [16], TIFA [25], and T2I-CompBench [14]. According
resentation power but overlooks the potential back-feeding to the distribution differences of textual prompts between
of discrimination to generation. Besides, though a recent the training set and the test sets, we adopt three settings, i.e.,
study [28, 62] fine-tunes the SD model using discrimina- In-Distribution (ID) and Out-of-Distribution (OOD) [57] on
tive objectives, it only pays attention to specific downstream COCO-NSS1K and CC-500, respectively, and Mixed Dis-
tasks (e.g., ITM) and ignores the effect of tuning on gener- tribution (MD) on ABC-6K, TIFA, and T2I-CompBench.
ation. The advancement of discrimination may sacrifice the More details can be found in Appendix B.1.
original generative power. In this stage, we mainly focus on • Evaluation Metrics. Following the existing baselines [4,
enhancing generation, but also investigate the superior limit 16, 44], we adopt CLIP score [21] and BLIP score3 [33]
of discrimination under the premise of priority generation. 3 OpenCLIP (ViT-H-14) [6] and BLIP-2 (pretrain) are used to compute

It may shed new light on giving full play to the versatility of text-image similarities as CLIP and BLIP scores, respectively. We will
Table 1. Performance comparison for text-to-image generation on COCO-NSS1K, CC-500, and ABC-6K. ID, OOD, and MD refer to
in-distribution, out-of-distribution, and mixed-distribution settings, respectively. According to the version of Stable Diffusion, we split
methods into two groups, top and down for v1.4 and v2.1, respectively. SC denotes self-correction.
COCO-NSS1K (ID) CC-500 (OOD) ABC-6K (MD)
Method
CLIP BLIP-M BLIP-C IS FID CLIP BLIP-M BLIP-C GLIP IS CLIP BLIP-M BLIP-C IS
Stable Diffusion-v1.4 [CVPR22] [50] 33.27 67.96 39.48 31.32 54.77 34.82 70.95 40.36 31.17 14.28 35.33 72.03 40.82 34.47
LayoutLLM-T2I [ACMMM23] [44] 32.42 67.42 39.46 25.57 59.26 - - - - - - - - -
StructureDiffusion [ICLR23] [16] - - - - - 33.71 66.71 39.54 31.39 14.14 34.95 69.55 40.69 34.97
HN-DiffusionITM [NeurIPS23] [28] 33.26 70.06 40.14 31.53 53.26 34.15 68.77 40.30 31.54 13.99 35.02 72.28 41.12 34.83
DPT (Ours) 33.85 71.84 40.11 31.65 54.96 35.97 76.74 41.15 37.07 13.56 35.88 75.88 41.26 34.46
Stable Diffusion-v2.1 [CVPR22] [50] 34.96 73.32 40.22 30.40 55.35 39.24 85.45 43.36 52.09 11.53 37.53 81.98 41.77 33.31
Attend-and- Excite [TOG23] [4] 34.95 74.68 40.32 30.27 55.16 39.43 90.03 44.08 53.29 11.82 37.59 82.64 41.83 32.94
HN-DiffusionITM [NeurIPS23] [28] 35.14 75.64 40.77 30.34 52.73 38.81 85.76 43.22 48.95 12.11 37.58 82.33 42.07 34.14
DPT (Ours) 35.83 78.58 41.14 30.83 55.55 40.23 90.72 44.55 53.29 11.59 38.39 86.19 42.36 32.97
DPT + SC (Ours) 35.75 79.15 41.14 30.50 54.89 40.25 91.33 44.69 53.29 11.89 38.41 85.63 42.34 33.56

Table 2. Performance comparison for text-to-image generation on SD-v2.1 demonstrates that the proposed DPT may be par-
TIFA [25] and T2I-CompBench [14]. According to the version of allel with the generative pre-training based on score match-
Stable Diffusion, we split methods into two groups, top and down ing, reflecting the possibility of activating the intrinsic rea-
for v1.4 and v2.1, respectively. SC denotes self-correction.
soning abilities of T2I models using DPT. And 4) in all, the
TIFA
T2I-CompBench proposed method achieves the best generation performance
Color Shape Text. Sp. Non-Sp. Comp. consistently on text-image alignment across comprehensive
SD-v1.4 [50] 79.15 36.82 35.94 42.16 10.64 30.45 28.18
benchmarks, distribution settings, and evaluation protocols.
HN-DiffusionITM [28] 79.02 36.71 35.48 39.84 11.22 30.91 28.05
Besides, the improvement in alignment does not result in a
VPGen [13] 77.33 32.12 32.36 35.85 19.08 30.07 24.39
LayoutGPT [17] 79.31 33.86 36.35 44.07 35.06 30.30 26.36
loss of image quality per IS and FID. These results confirm
DPT (Ours) 82.04 48.84 38.93 50.10 14.63 30.83 30.05 the effectiveness of the proposed paradigm DPT.
DPT + SC (Ours) 82.40 51.51 39.61 49.38 15.45 30.84 30.29 • Discriminative Matching and Grounding. In Sec. 3.1,
SD-v2.1 [50] 81.35 48.21 40.49 46.83 16.94 30.63 29.96
we incorporate a discriminative adapter on top of T2I mod-
Attend-and-Excite [4] 81.98 53.72 43.41 48.53 16.30 30.64 30.38
els and probe and improve its understanding abilities based
HN-DiffusionITM [28] 82.02 46.45 40.09 49.35 15.01 30.99 30.35
DPT (Ours) 84.49 60.59 48.18 58.24 20.78 30.95 32.44
on ITM and REC. In an empirical sense, we carry out ex-
DPT + SC (Ours) 84.63 62.59 48.44 57.60 21.04 30.76 32.52 periments by training the adapter in the first stage and intro-
ducing the LoRA for tuning in the second stage using ITM
including BLIP-ITM and BLIP-ITC, and GLIP score [16] and REC data, and then evaluate the matching and ground-
based on object detection to evaluate text-image alignment, ing performance. We show experimental results of base-
and IS [53] and FID [22] as quality evaluation metrics. lines including discriminative and generative models under
As for TIFA and T2I-CompBench, we follow the recom- the zero-shot and fine-tuning settings in Tab. 11. See Ap-
mended VQA accuracy or specifically curated protocols. pendix B for more details on the implementation and set-
tings. From this table, we observe that our method could
4.2. Performance Comparison outperform the existing state-of-the-art generative methods,
such as Diffusion Classifier [31] and DiffusionITM [28],
• Text-to-Image Generation. As shown in Tab. 1 and by large margins on ITM and REC tasks. Even it could
Tab. 2, we have the following observations and discus- achieve competitive performance in the first probing stage
sions: 1) Compared with the base foundation models, i.e., or when selected with a priority generation in the second
SD [50], the proposed DPT manages to improve the text- stage. These results show that the generative representa-
image alignment remarkably, which illustrates that enhanc- tions extracted from the intermediate layers of U-Net con-
ing discriminative abilities could benefit the generative se- vey meaningful semantics, verifying that T2I models have
mantic alignment for T2I models. 2) DPT achieves superior basic discriminative matching and grounding abilities. Be-
performance on CC-500 and ABC-6K under the OOD set- sides, it also indicates that such abilities could be further im-
ting, showing its powerful generalization to other prompt proved by the discriminative tuning introduced in Sec. 3.2.
distributions. It also reveals its capability to resist the risk
of overfitting when tuning T2I models with discriminative 4.3. In-depth Analysis
tasks. 3) The consistent improvement on both SD-v1.4 and
To verify the effectiveness of each component in DPT, in-
adopt Image-Text Matching (ITM) and Image-Text Contrastive (ITC) as cluding discriminative tuning on Global Matching (GM)
BLIP scores in the following. and Local Grounding (LG) in the 2nd stage, and the Self-
Table 3. Performance comparison for image-text matching and referring expression comprehension to evaluate global matching and local
grounding abilities, respectively. Datasets include MSCOCO-HN for ITM, and RefCOCO, RefCOCO+, and RefCOCOg for REC. All
the methods are grouped into three parts, in which the upper, middle, and lower groups correspond to zero-shot discriminative, zero-shot
generative, and fine-tuning generative methods, respectively. All the generative models are based on Stable Diffusion-v2.1.
MSCOCO-HN RefCOCO RefCOCO+ RefCOCOg
Method
I-to-T T-to-I Overall val testA testB val testA testB val test
Random Chance 25.00 25.00 25.00 16.53 13.51 19.20 16.29 13.57 19.60 18.12 19.10
CLIP (ViT-B-32) [ICML21] [45] 47.63 42.82 45.23 44.79 46.12 42.61 49.60 51.07 46.04 58.31 58.42
OpenCLIP (ViT-B-32) [CVPR23] [6] 49.07 47.45 48.26 43.22 43.15 44.65 48.21 48.60 50.64 60.32 60.84
Diffusion Classifier § [ICCV23] [31] 34.59 24.12 29.36 6.23 2.14 12.11 6.07 2.11 12.29 8.68 8.45
DiffusionITM § [NeurIPS23] [28] 34.59 29.83 32.21 28.88 30.16 29.01 29.97 31.17 30.25 38.07 38.91
Local Dinoising - - - 23.83 21.20 24.85 24.07 21.31 25.45 28.66 28.59
Diffusion Classifier †§ [ICCV23] [31] 37.72 24.03 30.88 6.11 2.10 10.91 6.04 2.13 11.48 8.05 7.54
DiffusionITM †§ [NeurIPS23] [28] 37.72 29.88 33.80 34.09 32.70 35.29 35.86 35.42 38.23 49.67 49.05
HN-DiffusionITM †§ [NeurIPS23] [28] 37.55 30.37 33.96 31.43 28.50 35.47 33.47 30.16 37.47 47.98 48.20
Local Dinoising † - - - 23.70 21.55 24.81 24.01 21.52 25.32 28.53 28.77
DPT (Stage1, Ours) 42.29 34.75 38.52 48.79 53.28 43.06 42.56 47.69 36.14 46.56 45.75
DPT (Ours) 42.07 34.97 38.52 52.73 57.84 46.73 45.34 50.12 38.41 48.61 47.45
DPT* (Ours) 43.12 35.25 39.18 63.45 66.70 57.90 51.56 56.81 42.73 54.96 54.80
†: fine-tuning with the denoising objective;
§: cropping an image into blocks and then matching them with the referring text for REC;
∗: model selection with a priority discriminative task, i.e., ITM or REC

Correction (SC) during inference, we conduct several ana- Table 4. Ablation study for the influence of two objectives of
lytic experiments on COCO-NSS1K and CC-500 under ID discriminative tuning including Global Matching (GM) and Lo-
and OOD settings. The results are summarized in Tab. 4. cal Ground (LG) in the 2nd stage, and the Self-Correction (SC)
during inference on alignment-oriented text-to-image generation.
• Effectiveness of Discriminative Tuning. From the com- The COCO-NSS1K and CC-500 datasets are used to evaluate in-
pared results in Tab. 4 between different variants, we ob- distribution (ID) and out-of-distribution (OOD) generation. All
serve that the two tuning objectives, i.e., GM and LG, could experiments are based on Stable Diffusion-v2.1.
consistently promote the alignment performance for T2I ac- Index GM LG SC
COCO-NSS1K (ID) CC-500 (OOD)
cording to CLIP and BLIP scores. It verifies the validity of CLIP BLIP-M BLIP-C CLIP BLIP-M BLIP-C GLIP
0 34.96 73.32 40.22 39.24 85.45 43.36 52.09
discriminative tuning on ITM and REC tasks. Compared
1 ! 35.14 74.83 40.45 39.28 86.23 43.36 49.55
with GM, LG achieves more remarkable improvement over 2 ! 35.94 79.19 41.11 40.31 90.63 44.31 57.03
semantic and object detection metrics. It may be attributed 3 ! ! 35.83 78.58 41.14 40.23 90.72 44.55 53.29
4 ! ! ! 35.75 79.15 41.14 40.25 91.33 44.69 53.29
to the enhanced grounding ability brought by the prediction
of local concepts based on partial descriptions. Further-
4 0
more, combining the two objectives to conduct multi-task C L IP 8 7 .5
5 0
3 9 B L IP -M 8 6 .0
4 5
learning may contribute to a slight improvement in BLIP IS
G L IP
3 8 8 5 .0

G r o u n d in g
8 4 .7 4 0
M a tc h in g

M a tc h in g
scores under the OOD setting, but other metrics are slightly 3 7
G r o u n d in g
8 4 .0
5 4 .3
5 6 .0

5 4 .0 3 5
compromised. This phenomenon indicates that some con- 3 6 3 7 .9 3 8 .1 3 8 .0
5 3 .3
3 8 .3 3 8 .6 5 2 .6
3 0
5 1 .6 8 0 .4
tradictions may exist during model optimization, reflecting 3 5 3 6 .8 7 8 .9 3 7 .1
2 5
4 9 .2
that unifying multiple tasks is still challenging. 3 4
b o tto m 1 b o tto m 2 b o tto m 3 m id iu m u p 1 u p 2 u p 3
2 0

U - N e t B lo c k
• Effectiveness of Self-Correction. In Sec. 3.3, we propose Figure 3. Generative and discriminative results by probing differ-
to recycle the discriminative adapter in the inference phase ent layers of U-Net in SD-v2.1 and adapting to ITM and REC. We
by guiding iterative denoising. Comparing the 3rd and report average CLIP and BLIP-M scores over COCO-NSS1K and
4th variants in Table 4, we can see that the self-correction CC-500, overall matching performance on MSCOCO-HN, and av-
scheme could consistently improve the alignment for T2I, erage grounding performance over all test sets of RefCOCO, Re-
attesting to its effectiveness. fCOCO+, and RefCOCOg. We conduct model selection based on
T2I performance on the validation set of COCO-NSS1K.
• Impact of Probed U-Net Block. Due to the hierarchical
structure of the U-Net in SD, we could extract multi-level tion, we probe consecutive seven blocks of the U-Net shown
feature maps from its different blocks. Prior work [62] has in Fig. 2 from left to right and then tune the whole model
shown that different blocks may have different discrimina- based on the probed block. The generative and discrimi-
tive powers in image classification. To further investigate native results are shown in Fig. 3. It can be observed that
the matching and grounding abilities empowered by various the T2I performance gets continuously improved with the
blocks and the trade-off between discrimination and genera- probed block shifting from bottom to up. The reason may be
that more LoRA parameters would be introduced and more SD-v2.1 AaE HN-DiffITM DPT + SC (Ours)

Object Appearance
layers would be tuned during back-propagation. In contrast,
the discriminative performance regardless of matching and
grounding starts to increase and then deteriorates. It may
be attributed to two points: 1) the feature maps from those
1

blocks (e.g., up2 and up3) close to final outputs, i.e., pre- “A closeup view of a pizza, with a fork near it.”

dicted noises, are less semantic; 2) the feature sequences

flattened from these feature maps may be too long, making
it difficult for the discriminative adapter to probe.

Counting
4 0 5 5 4 0 .8
9 2 .0
“The two yellow trains are coming down from the mountain.”
3 9 5 4
4 0 .6 9 1 .5
C L IP M a tc h in g
3 8 B L IP -M G r o u n d in g 5 3
G r o u n d in g

9 1 .0
M a tc h in g

B L IP -M

Spatial Relation
4 0 .4
C L IP

3 7 5 2
9 0 .5
3 6 5 1 C L IP C L IP (S D -v 2 .1 )
4 0 .2 B L IP -M B L IP -M (S D -v 2 .1 )
9 0 .0
3 5 5 0 8 5 .0
3 9 .3
3 9 .2
3 4
2 k 4 k 6 k 8 k 1 0 k
4 9
0 0 .0 1 0 .0 5 0 .1 0 .2 0 .5 1 2
8 0 .0
“A girl standing in the grass with no shoes on with a frisbee in one
hand and her other hand on her hip. ”
(a) Tuning Step (b) Guidance Factor η

Semantic Relation
Figure 4. Impact of (a) the variation of generation and discrimi-
nation performance with the progress of tuning and (b) the self-
correction strength on the performance of T2I on CC-500.
• Impact of Tuning Step. To further delve into the durative
“A person guiding a child down a hill on skis.”
impact of discriminative tuning on two aspects of perfor-
mance, we show the dynamics of the performance with the
increment of the tuning step in the 2nd stage in Fig. 4a. We
can see that the generative performance gets better with tun-
ing and seems to reach the saturation point at the 8k step. In
contrast, there is still potential for grounding performance “A couple of glasses next to a bottle.”

to get higher while the matching performance seems to re-

main stable in the tuning stage.
• Impact of Self-Correction Factor. As shown in Fig. 4b,
we study the influence of the guidance factor η in Eqn. (9)
Compositional Reasoning

on the alignment performance of T2I. The results demon- “A fire is going near four lounge chairs.”
strate that the proposed self-correction mechanism could
alleviate the text-image misalignment issue with a proper
range of guidance factor, i.e., (0.05, 1).

4.4. Qualitative Results

“Two white and blue vases and a white vase with a white and
yellow flower.”
To intuitively show the alignment improvement achieved by
DPT and SC, we illustrate generated examples with prompts Figure 5. Qualitative results on COCO-NSS1K. We compare
from COCO-NSS1K for object appearance, counting, rela- DPT with SD-v2.1 and two baselines including Attend-and-Excite
(AaE) [4] and HN-DiffusionITM (HN-DiffITM) [28] regarding
tion, and compositional reasoning evaluation, as shown in
object appearance, counting, spatial relation, semantic relation,
Fig. 5. These cases demonstrate the effectiveness of incor-
and compositional reasoning. Categories and the corresponding
porating discriminative probing and tuning into T2I models. keywords in prompts are highlighted with different colors.

5. Conclusion and Future Work ited effectiveness and generalization across five T2I datasets
and four ITM and REC datasets.
In this work, we tackled the text-image misalignment issue
In the future, we plan to explore the effect of discrim-
for text-to-image generative models. Toward this end, we inative probing and tuning to more generative models
retrospected the relations between generative and discrim- using more conception and understanding tasks. Besides,
inative modeling and presented a two-stage method named it is interesting to discuss more complicated relations
DPT. It introduces a discriminative adapter for probing ba- between discriminative and generative modeling such as
sic discriminative abilities in the first stage and performs trade-offs and mutual promotion across different tasks.
discriminative fine-tuning in the second stage. DPT exhib-
References [13] Cho et al. Visual programming for text-to-image generation
and evaluation. In NeurIPS, 2023. 6, 16
[1] Eslam Mohamed Bakr, Pengzhan Sun, Xiaogian Shen, [14] Huang et al. T2i-compbench: A comprehensive benchmark
Faizan Farooq Khan, Li Erran Li, and Mohamed Elhoseiny. for open-world compositional text-to-image generation. In
Hrs-bench: Holistic, reliable and scalable benchmark for NeurIPS, 2023. 5, 6, 15
text-to-image models. In Proceedings of the IEEE/CVF In-
[15] Wan-Cyuan Fan, Yen-Chun Chen, DongDong Chen, Yu
ternational Conference on Computer Vision, pages 20041–
Cheng, Lu Yuan, and Yu-Chiang Frank Wang. Frido: Fea-
20053, 2023. 2
ture pyramid diffusion for complex scene image synthesis.
[2] Ryan Burgert, Kanchana Ranasinghe, Xiang Li, and In Proceedings of the AAAI Conference on Artificial Intelli-
Michael S Ryoo. Peekaboo: Text to image diffusion models gence, pages 579–587, 2023. 3
are zero-shot segmentors. arXiv preprint arXiv:2211.13224, [16] Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun
2022. 3, 5 Akula, Pradyumna Narayana, Sugato Basu, Xin Eric Wang,
[3] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas and William Yang Wang. Training-free structured diffusion
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- guidance for compositional text-to-image synthesis. arXiv
end object detection with transformers. In European confer- preprint arXiv:2212.05032, 2022. 1, 2, 5, 6, 15, 16, 18
ence on computer vision, pages 213–229. Springer, 2020. 3 [17] Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani, Ar-
[4] Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and jun Akula, Xuehai He, Sugato Basu, Xin Eric Wang, and
Daniel Cohen-Or. Attend-and-excite: Attention-based se- William Yang Wang. Layoutgpt: Compositional visual plan-
mantic guidance for text-to-image diffusion models. ACM ning and generation with large language models. arXiv
Transactions on Graphics (TOG), 42(4):1–10, 2023. 1, 2, 5, preprint arXiv:2305.15393, 2023. 2, 3, 6, 16
6, 8, 16, 18 [18] Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio,
[5] Huanran Chen, Yinpeng Dong, Zhengyi Wang, Xiao Yang, Jeff Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. De-
Chengqi Duan, Hang Su, and Jun Zhu. Robust clas- vise: A deep visual-semantic embedding model. Advances
sification via a single diffusion model. arXiv preprint in neural information processing systems, 26, 2013. 2, 4
arXiv:2305.15241, 2023. 3 [19] Tejas Gokhale, Hamid Palangi, Besmira Nushi, Vibhav Vi-
[6] Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell neet, Eric Horvitz, Ece Kamar, Chitta Baral, and Yezhou
Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuh- Yang. Benchmarking spatial relationships in text-to-image
mann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scal- generation. arXiv preprint arXiv:2212.10015, 2022. 2
ing laws for contrastive language-image learning. In Pro- [20] Kamal Gupta, Justin Lazarow, Alessandro Achille, Larry S
ceedings of the IEEE/CVF Conference on Computer Vision Davis, Vijay Mahadevan, and Abhinav Shrivastava. Layout-
and Pattern Recognition, pages 2818–2829, 2023. 5, 7 transformer: Layout generation and completion with self-
[7] Jaemin Cho, Abhay Zala, and Mohit Bansal. Dall-eval: attention. In Proceedings of the IEEE/CVF International
Probing the reasoning skills and social biases of text-to- Conference on Computer Vision, pages 1004–1014, 2021. 3
image generation models. In Proceedings of the IEEE/CVF [21] Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras,
International Conference on Computer Vision, pages 3043– and Yejin Choi. Clipscore: A reference-free evaluation met-
3054, 2023. 15 ric for image captioning. arXiv preprint arXiv:2104.08718,
[8] Kevin Clark and Priyank Jaini. Text-to-image diffu- 2021. 5
sion models are zero-shot classifiers. arXiv preprint [22] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
arXiv:2303.15233, 2023. 3 Bernhard Nessler, and Sepp Hochreiter. Gans trained by a
[9] Colin Conwell and Tomer Ullman. Testing relational un- two time-scale update rule converge to a local nash equilib-
derstanding in text-guided image generation. arXiv preprint rium. Advances in neural information processing systems,
arXiv:2208.00005, 2022. 2 30, 2017. 6
[10] Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, [23] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif-
Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, fusion probabilistic models. Advances in neural information
Hongxia Yang, et al. Cogview: Mastering text-to-image gen- processing systems, 33:6840–6851, 2020. 1, 2, 3
eration via transformers. Advances in Neural Information [24] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-
Processing Systems, 34:19822–19835, 2021. 2 Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen.
[11] Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Lora: Low-rank adaptation of large language models. arXiv
Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, preprint arXiv:2106.09685, 2021. 2, 5
Haoran Wei, et al. Dreamllm: Synergistic multimodal com- [25] Yushi Hu, Benlin Liu, Jungo Kasai, Yizhong Wang, Mari Os-
prehension and creation. arXiv preprint arXiv:2309.11499, tendorf, Ranjay Krishna, and Noah A Smith. Tifa: Accurate
2023. 3 and interpretable text-to-image faithfulness evaluation with
[12] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, question answering. In ICCV, 2023. 5, 6, 15
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, [26] Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- Synnaeve, Ishan Misra, and Nicolas Carion. Mdetr-
vain Gelly, et al. An image is worth 16x16 words: Trans- modulated detection for end-to-end multi-modal understand-
formers for image recognition at scale. arXiv preprint ing. In Proceedings of the IEEE/CVF International Confer-
arXiv:2010.11929, 2020. 3 ence on Computer Vision, pages 1780–1790, 2021. 14, 15
[27] Diederik P Kingma and Max Welling. Auto-encoding varia- [41] Liqiang Nie, Leigang Qu, Dai Meng, Min Zhang, Qi Tian,
tional bayes. arXiv preprint arXiv:1312.6114, 2013. 3 and Alberto Del Bimbo. Search-oriented micro-video cap-
[28] Benno Krojer, Elinor Poole-Dayan, Vikram Voleti, Christo- tioning. In Proceedings of the 30th ACM International Con-
pher Pal, and Siva Reddy. Are diffusion models vision-and- ference on Multimedia, pages 3234–3243, 2022. 14
language reasoners? In Thirty-seventh Conference on Neural [42] Leigang Qu, Meng Liu, Da Cao, Liqiang Nie, and Qi
Information Processing Systems, 2023. 3, 5, 6, 7, 8, 16, 17 Tian. Context-aware multi-view summarization network for
[29] Mingi Kwon, Jaeseok Jeong, and Youngjung Uh. Diffusion image-text matching. In Proceedings of the 28th ACM In-
models already have a semantic latent space. arXiv preprint ternational Conference on Multimedia, pages 1047–1055,
arXiv:2210.10960, 2022. 2, 3 2020. 15
[30] Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, [43] Leigang Qu, Meng Liu, Jianlong Wu, Zan Gao, and Liqiang
Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Nie. Dynamic modality interaction modeling for image-text
Ghavamzadeh, and Shixiang Shane Gu. Aligning text- retrieval. In Proceedings of the 44th International ACM SI-
to-image models using human feedback. arXiv preprint GIR Conference on Research and Development in Informa-
arXiv:2302.12192, 2023. 1 tion Retrieval, pages 1104–1113, 2021. 2
[31] Alexander C Li, Mihir Prabhudesai, Shivam Duggal, Ellis [44] Leigang Qu, Shengqiong Wu, Hao Fei, Liqiang Nie, and Tat-
Brown, and Deepak Pathak. Your diffusion model is secretly Seng Chua. Layoutllm-t2i: Eliciting layout guidance from
a zero-shot classifier. arXiv preprint arXiv:2303.16203, llm for text-to-image generation. In Proceedings of the 31st
2023. 3, 6, 7, 17 ACM International Conference on Multimedia, pages 643–
[32] Chenliang Li, Haiyang Xu, Junfeng Tian, Wei Wang, Ming 654, 2023. 1, 2, 3, 5, 6, 12, 15, 16
Yan, Bin Bi, Jiabo Ye, Hehong Chen, Guohai Xu, Zheng [45] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Cao, et al. mplug: Effective and efficient vision-language Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
learning by cross-modal skip-connections. arXiv preprint Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning
arXiv:2205.12005, 2022. 15 transferable visual models from natural language supervi-
[33] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. sion. In International conference on machine learning, pages
Blip-2: Bootstrapping language-image pre-training with 8748–8763. PMLR, 2021. 3, 7
frozen image encoders and large language models. arXiv
[46] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray,
preprint arXiv:2301.12597, 2023. 5
Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever.
[34] Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jian-
Zero-shot text-to-image generation. In International Confer-
wei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee.
ence on Machine Learning, pages 8821–8831. PMLR, 2021.
Gligen: Open-set grounded text-to-image generation. In Pro-
2
ceedings of the IEEE/CVF Conference on Computer Vision
[47] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu,
and Pattern Recognition, pages 22511–22521, 2023. 2
and Mark Chen. Hierarchical text-conditional image gen-
[35] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
eration with clip latents. arXiv preprint arXiv:2204.06125,
Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
2022. 1, 2
Zitnick. Microsoft coco: Common objects in context. In
Computer Vision–ECCV 2014: 13th European Conference, [48] Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Gener-
Zurich, Switzerland, September 6-12, 2014, Proceedings, ating diverse high-fidelity images with vq-vae-2. Advances
Part V 13, pages 740–755. Springer, 2014. 5, 13, 15 in neural information processing systems, 32, 2019. 3
[36] Xinyu Lin, Wenjie Wang, Yongqi Li, Fuli Feng, See-Kiong [49] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir
Ng, and Tat-Seng Chua. A multi-facet paradigm to bridge Sadeghian, Ian Reid, and Silvio Savarese. Generalized in-
large language model and recommendation. arXiv preprint tersection over union: A metric and a loss for bounding
arXiv:2310.06491, 2023. 3 box regression. In Proceedings of the IEEE/CVF conference
[37] Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and on computer vision and pattern recognition, pages 658–666,
Joshua B Tenenbaum. Compositional visual generation with 2019. 5
composable diffusion models. In European Conference on [50] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
Computer Vision, pages 423–439. Springer, 2022. 2 Patrick Esser, and Björn Ommer. High-resolution image
[38] Shilong Liu, Shijia Huang, Feng Li, Hao Zhang, Yaoyuan synthesis with latent diffusion models. In Proceedings of
Liang, Hang Su, Jun Zhu, and Lei Zhang. Dq-detr: the IEEE/CVF conference on computer vision and pattern
Dual query detection transformer for phrase extraction and recognition, pages 10684–10695, 2022. 1, 2, 3, 6, 16
grounding. In Proceedings of the AAAI Conference on Arti- [51] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-
ficial Intelligence, pages 1728–1736, 2023. 15 net: Convolutional networks for biomedical image segmen-
[39] Ilya Loshchilov and Frank Hutter. Decoupled weight decay tation. In Medical Image Computing and Computer-Assisted
regularization. arXiv preprint arXiv:1711.05101, 2017. 16 Intervention–MICCAI 2015: 18th International Conference,
[40] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Munich, Germany, October 5-9, 2015, Proceedings, Part III
Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and 18, pages 234–241. Springer, 2015. 3
Mark Chen. Glide: Towards photorealistic image generation [52] Chitwan Saharia, William Chan, Saurabh Saxena, Lala
and editing with text-guided diffusion models. arXiv preprint Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour,
arXiv:2112.10741, 2021. 2 Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans,
et al. Photorealistic text-to-image diffusion models with deep on computer vision and pattern recognition, pages 1316–
language understanding. Advances in Neural Information 1324, 2018. 2
Processing Systems, 35:36479–36494, 2022. 1, 2, 15 [65] Xinchen Yan, Jimei Yang, Kihyuk Sohn, and Honglak Lee.
[53] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Attribute2image: Conditional image generation from visual
Cheung, Alec Radford, and Xi Chen. Improved techniques attributes. In Computer Vision–ECCV 2016: 14th European
for training gans. Advances in neural information processing Conference, Amsterdam, The Netherlands, October 11–14,
systems, 29, 2016. 6 2016, Proceedings, Part IV 14, pages 776–791. Springer,
[54] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, 2016. 2
and Surya Ganguli. Deep unsupervised learning using [66] Xingyi Yang and Xinchao Wang. Diffusion model as repre-
nonequilibrium thermodynamics. In International confer- sentation learner. In Proceedings of the IEEE/CVF Interna-
ence on machine learning, pages 2256–2265. PMLR, 2015. tional Conference on Computer Vision, pages 18938–18949,
1, 2 2023. 3
[55] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- [67] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gun-
hishek Kumar, Stefano Ermon, and Ben Poole. Score-based jan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yin-
generative modeling through stochastic differential equa- fei Yang, Burcu Karagol Ayan, et al. Scaling autoregres-
tions. arXiv preprint arXiv:2011.13456, 2020. 3 sive models for content-rich text-to-image generation. arXiv
[56] Sanjay Subramanian, William Merrill, Trevor Darrell, Matt preprint arXiv:2206.10789, 2022. 2, 15
Gardner, Sameer Singh, and Anna Rohrbach. Reclip: A [68] Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg,
strong zero-shot baseline for referring expression compre- and Tamara L Berg. Modeling context in referring expres-
hension. arXiv preprint arXiv:2204.05991, 2022. 15, 17 sions. In Computer Vision–ECCV 2016: 14th European
[57] Teng Sun, Wenjie Wang, Liqaing Jing, Yiran Cui, Xuemeng Conference, Amsterdam, The Netherlands, October 11-14,
Song, and Liqiang Nie. Counterfactual reasoning for out-of- 2016, Proceedings, Part II 14, pages 69–85. Springer, 2016.
distribution multimodal sentiment analysis. In Proceedings 2, 4, 5, 13
of the 30th ACM International Conference on Multimedia, [69] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiao-
pages 15–23, 2022. 5, 15 gang Wang, Xiaolei Huang, and Dimitris N Metaxas. Stack-
[58] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- gan: Text to photo-realistic image synthesis with stacked
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia generative adversarial networks. In Proceedings of the IEEE
Polosukhin. Attention is all you need. Advances in neural international conference on computer vision, pages 5907–
information processing systems, 30, 2017. 3 5915, 2017. 2
[59] Haokun Wen, Xuemeng Song, Xin Yang, Yibing Zhan, and [70] Wenliang Zhao, Yongming Rao, Zuyan Liu, Benlin Liu,
Liqiang Nie. Comprehensive linguistic-visual composition Jie Zhou, and Jiwen Lu. Unleashing text-to-image dif-
network for image retrieval. In Proceedings of the Inter- fusion models for visual perception. arXiv preprint
national ACM SIGIR Conference on Research and Devel- arXiv:2303.02153, 2023. 3
opment in Information Retrieval, pages 1369–1378. ACM,
2021. 15
[60] Haokun Wen, Xian Zhang, Xuemeng Song, Yinwei Wei, and
Liqiang Nie. Target-guided composed image retrieval. In
Proceedings of the ACM International Conference on Multi-
media, pages 915–923. ACM, 2023. 15
[61] Peter West, Ximing Lu, Nouha Dziri, Faeze Brahman, Lin-
jie Li, Jena D Hwang, Liwei Jiang, Jillian Fisher, Abhilasha
Ravichander, Khyathi Chandu, et al. The generative ai para-
dox:” what it can create, it may not understand”. arXiv
preprint arXiv:2311.00059, 2023. 3
[62] Weilai Xiang, Hongyu Yang, Di Huang, and Yunhong Wang.
Denoising diffusion autoencoders are unified self-supervised
learners. arXiv preprint arXiv:2303.09769, 2023. 5, 7
[63] Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiao-
long Wang, and Shalini De Mello. Open-vocabulary panop-
tic segmentation with text-to-image diffusion models. In
Proceedings of the IEEE/CVF Conference on Computer Vi-
sion and Pattern Recognition, pages 2955–2966, 2023. 3,
5
[64] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang,
Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-
grained text to image generation with attentional generative
adversarial networks. In Proceedings of the IEEE conference
Discriminative Probing and Tuning for Text-to-Image Generation
Supplementary Material
A. More Results and Analysis mance seems to get better at the beginning and then fluctu-
ates within a very small range. Compared with the perfor-
A.1. Ablation Study on DPT mance at the end of the discriminative probing phase, i.e.,
To further explore the effectiveness of the proposed DPT at step 0, the discriminative tuning by introducing the extra
paradigm, we compare it with the traditional denoising tun- parameters of LoRA could further improve both discrimi-
ing method based on the MSE loss function. Concretely, we native abilities. On the other hand, the performance on ITM
delve into all four combinations of these two and evaluate of DPT-v2.1 is obviously better than that of DPT-v1.4. On
the ID and OOD generation performance on the basis of two the contrary, DPT-v1.4 seems to be slightly stronger than
versions of SD, i.e., SD-v1.4 and SD-v2.1. The experimen- DPT-v2.1 in terms of REC, as shown in Fig. 9.
tal results are reported in Tab. 5. We have the following ob- • Discriminative Tuning for Generation From Fig. 7 and
servations. 1) MSE could improve the alignment and qual- Fig. 8, we can see that the performance on discrimination
ity under the ID setting, but it may not be helpful and even and generation gets better meanwhile at the beginning of
harmful to the alignment under the OOD setting. 2) DPT tuning, e.g., [0, 700] for DPT-v2.1 and [0, 800] for DPT-
consistently enhances alignment performance across both v1.4, which demonstrates the effectiveness of discrimina-
ID and OOD settings, with minimal impact on the original tive tuning on the enhancement of various abilities. After-
image quality. And 3) by combining MSE and DPT objec- ward, we can see that the generation performance declines,
tives, we may get a good trade-off between alignment and perhaps due to over-fitting or some potential discrepancy
quality, and achieve a better performance on the BLIP-ITM between generation and discrimination.
evaluation metric which may focus on more details.
A.5. Impact of Rank Numbers in LoRA
A.2. Ablation Study on Probed Blocks
The rank number in LoRA determines the number of ex-
As shown in Tab. 6, we carry out extensive experiments to tra parameters introduced to the discriminative tuning stage
study the effect of different probed blocks of U-Net on gen- compared with the first stage. To explore the influence of
eration performance. The results show that DPT achieves the rank numbers on the generation performance, we com-
the best performance when probing the up3 block. pare different rank numbers, from 0 to 128, as shown in
Tab. 7. The results reflect that DPT achieves the best per-
A.3. Comprehensive Evaluation on COCO-NSS1K formance when using 4 rank numbers on most evaluation
The COCO-NSS1K dataset [44] was constructed to evaluate metrics. More rank numbers do not bring further improve-
five categories of abilities for T2I models, including count- ment, which may be attributable to the scale of tuning data.
ing, spatial relation reasoning, semantic relation reasoning,
A.6. Impact of Layers of Discriminative Adapter
complicated relation reasoning, and abstract imagination.
To delve into these categories, we compare the proposed The total parameters of DPT also depend on the transformer
methods, including DPT and DPT + SC, with SD-v1.4 and layers of the discriminative adapter. We conduct experi-
SD-v2.1, as shown in Fig. 6. Our method consistently im- ments by using different numbers of layers. Note that we
proves the alignment performance in all categories com- keep the same number of layers in encoders and decoders
pared with the state-of-the-art SD-v2.1. Besides, the self- for each experiment. The results are reported in Tab. 8. In
correction module could further align the generated images general, the best generation performance can be achieved
with prompts, especially in the semantic relation category. when using 4 layers. Besides, the alignment performance
is always better than SD-v2.1 (i.e., the experiment with 0
A.4. Impact of Discriminative Tuning Steps layer), which further verifies the effectiveness of DPT.
As shown in Fig. 7, Fig. 8, and Fig. 9, we study the genera-
A.7. Impact of Denosing Objective
tion and discrimination performance based on SD-v2.1 and
SD-v1.4, and the comparison between the two versions, re- In the raw SD model, only the denoising objective with the
spectively. We discuss the results from the discrimination MSE form is used to model the data distribution for image
and generation aspects as follows. synthesis. To further the interplay of DPT and MSE ob-
• Discriminative Tuning for Discrimination On the one jectives, we perform more experiments by combining them
hand, the grounding performance is continuously improved and taking different values of the coefficient of MSE. As
with the tuning step increases, while the matching perfor- shown in Tab. 9, we observe that the simultaneous use of
Table 5. Ablation study for the influence of two objectives of discriminative probing and tuning (DPT) and denoising (MSE) on text-
to-image generation. The COCO-NSS1K and CC-500 datasets are used to evaluate in-distribution (ID) and out-of-distribution (OOD)
generation. Alignment-oriented evaluation metrics include CLIP score ↑, BLIP-ITM score ↑, and GLIP score ↑, while quality-oriented
evaluation metrics include IS ↑ and FID ↓. The best results are highlighted in bold.
COCO-NSS1K (ID) CC-500 (OOD)
Version MSE DPT
CLIP BLIP-ITM IS FID CLIP BLIP-ITM GLIP IS
33.27 67.96 31.32 54.77 34.82 70.95 31.17 14.28
SD-v1.4 ! 32.95 70.48 32.03 52.63 34.00 71.27 34.08 15.02
! 33.85 71.84 31.65 54.96 35.97 76.74 37.07 13.56
! ! 33.83 73.28 30.59 55.02 35.83 77.20 37.89 13.67
34.96 73.32 30.40 55.35 39.24 85.45 52.09 11.53
SD-v2.1 ! 34.20 75.90 30.10 51.85 38.49 84.65 49.48 13.38
! 35.83 78.58 30.83 55.55 40.23 90.72 53.29 11.59
! ! 35.64 79.23 30.18 53.36 39.87 90.41 52.09 12.09

Table 6. Generation performance with different probed blocks of U-Net and the sizes of feature maps (Feat. Size). All the experiments
are based on SD-v2.1. We combine multiple feature maps using additive fusion, and perform interpolation if the feature sizes are different.
Alignment-oriented evaluation metrics include CLIP score ↑, BLIP-ITM (BLIP-M) score ↑, BLIP-ITC (BLIP-C) score ↑, and GLIP score ↑,
while quality-oriented evaluation metrics include IS ↑ and FID ↓. The best results are highlighted in bold.
COCO-NSS1K (ID) CC-500 (OOD)
Block Feat. Size
CLIP BLIP-M BLIP-C IS FID CLIP BLIP-M BLIP-C GLIP IS
- - 34.96 73.32 40.22 30.40 55.35 39.24 85.45 43.36 52.09 11.53
bottom1 32×32 34.76 73.15 40.12 29.45 55.51 38.81 84.68 43.05 51.57 11.80
bottom2 16×16 35.03 73.63 40.24 30.38 55.38 39.15 87.08 43.52 49.18 12.04
bottom3 8×8 35.69 77.57 40.85 30.06 55.52 40.19 90.33 44.34 54.33 11.17
middle 8×8 35.90 79.25 41.11 30.53 55.12 40.28 90.66 44.40 54.04 11.31
up1 8×8 35.83 78.58 41.14 30.83 55.55 40.23 90.72 44.55 53.29 11.59
up2 16×16 35.85 79.19 41.10 30.25 54.92 40.67 92.72 44.85 55.98 11.34
up3 32×32 35.91 80.39 41.24 31.47 57.12 41.21 94.52 45.46 52.62 11.89
middle + bottom3 + up1 8×8 35.65 78.26 41.13 30.39 55.47 39.99 90.42 44.12 52.47 11.76
bottom2 + up2 16×16 35.91 79.43 41.16 30.41 54.86 40.64 91.98 44.71 53.21 11.77
bottom1 + up1 32×32 35.77 79.31 41.15 30.09 56.63 40.75 93.72 45.02 51.57 12.69
all 8×8 35.84 79.19 41.10 30.47 55.64 40.61 91.22 44.77 53.14 11.42
all 32×32 35.48 78.85 41.40 30.57 56.53 40.18 92.74 45.29 48.95 11.88

these two objectives does not cause significant conflict. In- approaches, we evaluate their generation performance as
stead, there may be a possibility that they can collaborate shown in Tab. 11. From the results, we find training DA +
with each other to achieve a competitive trade-off between LoRA is better, perhaps due to the more flexibility brought
text-image alignment and image quality. by more parameters from DA and LoRA during the discrim-
inative tuning phase.
A.8. Impact of Timesteps on Discriminative Tasks
As shown in Tab. 11, we explore the influence of differ- B. More Experimental Settings
ent timesteps used in DDPM on the ITM and REC perfor-
mance. The results illustrate that the proposed model could B.1. Benchmark Datasets
achieve the best performance when the timestep is set to • Data used for Training. We evaluate the basic global
250. The performance comparison between 0 and 250 re- matching and local grounding abilities of T2I models based
veals that it is helpful to improve the discriminative abilities on two discriminative tasks, i.e., Image-Text Matching and
by introducing appropriate levels of noise. Referring Expression Comprehension, respectively. For
discriminative probing and tuning, we reorganize public
A.9. Impact of Tunable Modules
benchmarks including MSCOCO [35] for ITM, and Re-
In the second stage, we have two strategies for discrim- fCOCO [68], RefCOCO+ [68], and RefCOCOg [68] for
inative tuning: only training LoRA and training DA + REC. Specifically, we use the COCO2014 version of MS-
LoRA. To make a comparison between the two tuning COCO, composed of 82,783 images and 414,113 captions
S D -v 1 .4 S D -v 2 .1 D P T (O u rs ) D P T + S C (O u rs )

2 .6 0 % 9 .8 4 % 2 .0 2 %
Im a g in a tio n

2 .0 6 % 5 .1 5 % 2 .5 5 %
C o m p lic a te d

2 .1 6 % 6 .7 9 % 1 .2 0 %
S e m a n tic

2 .0 4 % 1 0 .8 7 % 1 .0 5 %
S p a tia l

2 .8 1 % 7 .2 8 % 3 .0 3 %
C o u n tin g

3 3 3 4 3 5 3 6 6 0 6 5 7 0 7 5 8 0 8 5 3 8 3 9 4 0 4 1 4 2
(a) CLIP Score (b) BLIP-ITM Score (c) BLIP-ITC Score

Figure 6. Alignment performance improvement of the proposed method compared with SD-v1.4 and SD-v2.1 on five categories of the
COCO-NSS1K dataset, including counting, spatial relation, semantic relation, and complicated relation reasoning and imagination abilities
evaluation. Results on three evaluation metrics (a) CLIP Score, (b) BLIP-ITM Score, and (c) BLIP-ITC Score are reported. The value on
the right of each category denotes the percentage improvement of DPT + SC (Ours) compared to SD-v2.1.
Table 7. Generation performance with different numbers of LoRA ranks on the COCO-NSS1K and CC-500 datasets. Alignment-oriented
evaluation metrics include CLIP score ↑, BLIP-ITM score ↑, BLIP-ITC score ↑, and GLIP score ↑, while quality-oriented evaluation
metrics include IS ↑ and FID ↓. The best results are highlighted in bold.
COCO-NSS1K (ID) CC-500 (OOD)
Rank
CLIP BLIP-ITM BLIP-ITC IS FID CLIP BLIP-ITM BLIP-ITC GLIP IS
0 34.96 73.32 40.22 30.40 55.35 39.24 85.45 43.36 52.09 11.53
4 35.83 78.58 41.14 30.83 55.55 40.23 90.72 44.55 53.29 11.59
8 35.47 76.89 40.94 30.39 55.28 39.78 90.04 44.26 52.69 11.84
16 35.59 77.33 40.77 30.25 54.83 39.85 89.45 43.96 52.69 11.99
32 35.55 76.68 40.80 29.98 54.81 39.98 89.82 44.04 52.76 11.83
64 35.67 77.05 40.79 30.12 55.28 40.22 90.99 44.21 53.44 11.73
128 35.66 77.11 40.84 30.68 55.34 40.05 89.99 44.12 52.39 11.92

4 3 6 0 8 0 4 0 6 0
4 4 4 4
9 2 C L IP B L IP -IT M M a tc h in g G r o u n d in g C L IP B L IP -IT M M a tc h in g G r o u n d in g
4 2 4 2 4 2
7 5 3 8
5 5 5 5
9 0 4 0 4 0
4 1
3 8 7 0 3 6 3 8
5 0 5 0
8 8
4 0 3 6 3 6
6 5 3 4
3 4 4 5 3 4 4 5
8 6 3 9
3 2 6 0 3 2 3 2
0 7 0 0 4 k 8 k 1 2 k 1 6 k 2 0 k 2 4 k 2 8 k 3 2 k 3 6 k 4 0 k 0 8 0 0 4 k 8 k 1 2 k 1 6 k 2 0 k 2 4 k 2 8 k 3 2 k 3 6 k 4 0 k
T u n in g S te p T u n in g S te p

Figure 7. Impact of the discriminative tuning steps in DPT (SD- Figure 8. Impact of the discriminative tuning steps in DPT (SD-
v2.1) on generation and discrimination performance. CLIP and v1.4) on generation and discrimination performance. CLIP and
BLIP-ITM scores on the CC-500 dataset under out-of-distribution BLIP-ITM scores on the CC-500 dataset under out-of-distribution
setting, averaging performance over I-to-T and T-to-I matching, setting, averaging performance over I-to-T and T-to-I matching,
and averaging precision@1 over test sets of RefCOCO, Ref- and averaging precision@1 over test sets of RefCOCO, Ref-
COCO+, and RefCOCOg are shown. The model achieves the best COCO+, and RefCOCOg are shown. The model achieves the best
generation performance on the validation set at step 700. generation performance on the validation set at step 700.

in the training set. There are about 5 caption annotations for Therefore, we adopt the following data sampling [41] strat-
each image. As for REC, following MDETR [26], we com- egy during training: 1) randomly sample an expression y ′
bine the three datasets, i.e., RefCOCO, RefCOCO+, and from RefCOCOall and get the corresponding image x, 2)
RefCOCOg, into one, called RefCOCOall in the following. randomly sample a caption y from all positive captions of
Its training set includes 28,158 images and 321,327 expres- x, 3) randomly sample a hard negative caption y neg from
sions. top-20 hard negative captions4 of x from the training set
of MS-COCO, 4) randomly sample a hard negative im-
To probe global matching and local grounding abilities
at the same time, we combine all the above datasets into 4 We use OpenCLIP (ViT-H-14) to calculate image-text similarities and
one. Concretely, we observe that the MSCOCO dataset in- retrieve the top-k hard negative captions (k=20) or hard negative images
cludes all the raw images in the above three REC datasets. (k=4).
Table 8. Generation performance with different numbers of layers of encoders and decoders in the discriminative adapter on the COCO-
NSS1K and CC-500 datasets. Alignment-oriented evaluation metrics include CLIP score ↑, BLIP-ITM score ↑, BLIP-ITC score ↑, and
GLIP score ↑, while quality-oriented evaluation metrics include IS ↑ and FID ↓. The best results are highlighted in bold.
COCO-NSS1K (ID) CC-500 (OOD)
Layer
CLIP BLIP-ITM BLIP-ITC IS FID CLIP BLIP-ITM BLIP-ITC GLIP IS
0 34.96 73.32 40.22 30.40 55.35 39.24 85.45 43.36 52.09 11.53
1 35.83 78.58 41.14 30.83 55.55 40.23 90.72 44.55 53.29 11.59
2 35.51 77.67 40.93 30.52 54.75 40.23 91.14 44.36 52.99 11.65
3 35.55 77.14 40.81 30.27 55.02 40.03 90.55 44.19 53.59 12.00
4 35.91 79.41 41.08 30.95 55.06 40.88 92.92 44.88 54.26 11.53
5 35.80 78.76 41.02 29.53 55.18 40.10 90.89 44.22 55.23 11.97

Table 9. Generation performance with different coefficients (Coeff.) of the denoising MSE loss function at the second stage for the
discriminative tuning on the COCO-NSS1K and CC-500 datasets. Alignment-oriented evaluation metrics include CLIP score ↑, BLIP-
ITM score ↑, BLIP-ITC score ↑, and GLIP score ↑, while quality-oriented evaluation metrics include IS ↑ and FID ↓. The best results are
highlighted in bold.
COCO-NSS1K (ID) CC-500 (OOD)
Coeff. of MSE
CLIP BLIP-ITM BLIP-ITC IS FID CLIP BLIP-ITM BLIP-ITC GLIP IS
0.00 35.83 78.58 41.14 30.83 55.55 40.23 90.72 44.55 53.29 11.59
0.05 35.54 77.78 41.14 29.73 52.84 40.05 90.12 44.41 52.84 11.57
0.10 35.64 79.08 41.34 29.74 52.79 40.11 89.85 44.37 53.36 11.97
0.30 35.53 78.81 41.32 29.95 52.73 40.03 89.65 44.37 54.48 12.20
0.50 35.49 77.51 40.99 30.61 53.44 39.81 88.78 44.05 50.90 11.26
1.00 35.64 79.23 41.49 30.18 53.36 39.87 90.41 44.70 52.09 12.09

4 0 .0 6 0

5 8
color words describe different objects, and the other 3,217
3 9 .5
prompts obtained by switching the positions of two color
1 o n R E C

5 6
3 9 .0
A c c u ra c y o n IT M

3 8 .5
5 4
words. TIFA [25] is a recent VQA-based benchmark built
5 2
to evaluate T2I alignment across 12 categories, consisting of
P r e c is io n @

3 8 .0
5 0
3 7 .5
4 8 4,081 prompts from MSCOCO [35], DrawBench [52], Par-
3 7 .0 M a tc h in g ( D P T - v 1 .4 ) G r o u n d in g ( D P T - v 1 .4 )
4 6
M a tc h in g ( D P T - v 2 .1 )

4 4
G r o u n d in g ( D P T - v 2 .1 )
tiPrompt [67], and PaintSill [7]. The VQA accuracy based
3 6 .5
0 4 8 1 2 1 6 2 0
T u n in g S te p
2 4 2 8 3 2 3 6 4 0 × 1 k 0 4 8 1 2 1 6 2 0
T u n in g S te p
2 4 2 8 3 2 3 6 4 0 × 1 k
on MLLMs [32] is employed to assess text-image align-
(a) ITM (b) REC ment. As a contemporary work, T2I-CompBench [14] is
constructed to offer a comprehensive benchmark for com-
Figure 9. Discriminative performance comparison between DPT-
positional T2I from 6 categories, including color binding,
v1.4 and DPT-v2.1 on (a) Image-Text Matching (ITM) and (b)
shape binding, texture binding, spatial relationships, non-
Referring Expression Comprehension (REC) as the tuning pro-
gresses. spatial relationships, and complex compositions. We use
the test set with 1,800 prompts for evaluation.

age xneg from top-4 hard negative images of y from the

training set of MS-COCO, and 5) randomly sample a neg-
ative image xrand and a negative caption y rand from all • Benchmarks for Evaluation of Discrimination. To eval-
other uncorrelated images and captions, respectively, from uate the discriminative global matching [42, 57, 59, 60]
the training set of MS-COCO. Finally, we have a septuple ability of different T2I models, we reorganize the test sets
(x, y ′ , y, y neg , xneg , y rand , xrand ) in each data instance. of MS-COCO for efficient evaluation. Specifically, as for
• Benchmarks for Evaluation of Generation. Origi- the Image-to-Text (I-to-T) matching direction, we first re-
nating from MSCOCO [35], COCO-NSS1K [44] is spe- trieve the top-20 hard negative captions and then randomly
cially reorganized to assess counting and relation under- sample 3 captions. In this way, we have one positive cap-
standing of generative models in complex scenes, includ- tions and three negative captions for each image. Similarly,
ing 943 natural prompts and relevant ground-truth images. we randomly sample 3 negative images from the top-4 re-
CC-500 [16] is built to evaluate compositional generation trieval results for the T-to-I matching direction. We name
by template-based 446 prompts that conjunct two concepts. this hard-negative-based test set as HN-MSCOCO. As for
ABC-6K [16] is composed of 6,434 prompts, including local grounding, we evaluate generative models following
3,217 natural prompts from MSCOCO in which at least two existing methods [26, 38, 56] for REC.
Table 10. Discriminative performance comparison with different timesteps for image-text matching and referring expression comprehension
to evaluate global matching and local grounding abilities, respectively. Larger timesteps mean stronger noises are introduced to contaminate
input images. Datasets include MSCOCO-HN for ITM, and RefCOCO, RefCOCO+, and RefCOCOg for REC.
MSCOCO-HN RefCOCO RefCOCO+ RefCOCOg
Method Avg.
I-to-T T-to-I Overall testA testB testA testB test
Random Chance 25.00 25.00 25.00 13.51 19.20 13.57 19.60 19.10 17.00
0 44.02 34.36 39.19 60.53 49.32 54.16 42.38 52.74 51.83
250 44.31 35.13 39.72 63.80 52.03 56.72 44.04 55.03 54.32
500 42.07 34.97 38.52 57.84 46.73 50.12 38.41 47.45 48.11
750 38.09 32.13 35.11 35.90 27.81 28.50 21.50 28.00 28.34
999 34.48 28.56 31.52 19.50 15.17 13.31 9.10 9.10 13.24

Table 11. Generation performance comparison with different tun- B.3.2 Settings for Discriminative Adapter
ing strategies at the second stage. “LoRA” refers to only training
LoRA and freezing other modules including U-Net and Discrim- We use the 1-layer Transformer encoder and the 1-
inative Adapter, while “DA + LoRA” means training Discrimina- layer Transformer decoder to implement the discriminative
tive Adapter and LoRA with only U-Net frozen. adapter introduced in Sec. 3.1. The feature map with the
COCO-NSS1K CC-500 shape 1280 × 8 × 8 in the medium layer of U-Net is flat-
Trainable
CLIP BLIP-ITM CLIP BLIP-ITM tened and fed into the encoder. The hidden dimensions of
LoRA 35.48 76.46 39.75 89.23 attention layers and feedforward layers are set to 256 and
DA + LoRA 35.83 78.58 40.23 90.72 2048, respectively. The number of attention heads is 8. We
use 110 learnable queries in total, 10 for global matching,
and the other 100 for local grounding, i.e., N = 110 and
B.2. Baselines M = 10. The temperature factor τ for contrastive learning
in Eqn. (3) and Eqn. (4) is learnable and initialized to 0.07.
To verify the effectiveness of our method on T2I, we carry To make a balance between different grounding objectives
out experiments based on SD-v1.4 and SD-v2.1, and com- in Eqn. (7), we set λ0 , λ1 , λ2 , and λ3 as 1, 5, 2, and 1,
pare DPT with SD [50] and recent alignment-oriented T2I respectively. The same setting is also used for maximum
baselines including LayoutLLM-T2I [44], StructureDiffu- matching in Eqn. (5). During inference, we use 0.5 as the
sion [16], Attend-and-Exite [4], DiffusinoITM [28], VP- default guidance strength of self-correction, i.e., η = 0.5.
Gen [13] and LayoutGPT [17].
B.3.3 Implementation of Baselines
B.3. Implementation Details
We run the codes of SD-v1.45 and SD-v2.16 in the hug-
B.3.1 Settings for Discriminative Probing and Tuning gingface open-source community. Besides, we run the code
of LayoutLLM-T2I7 and load the open checkpoint to eval-
uate its performance on COCO-NSS1K. Because of the
In the first stage, we perform discriminative tuning by pre-
training-free property, StructureDiffusion8 and Attend-and-
training for 60k steps with a batch size of 220 and a learn-
Excite9 can be directly executed for evaluation. As for
ing rate of 1e-4. After that, we perform discriminative tun-
DiffusionITM [28], we implement it based on the open-
ing in the second stage for 6k steps with a batch size of
sourced code10 and rename it as HN-DiffusionITM con-
8 and a learning rate of 1e-4. We use the AdamW [39]
sidering we perform contrastive learning based on the hard
optimizer with betas as (0.9, 0.999) and weight decay as
negative samples. For a fair comparison, we re-train HN-
0.01. We perform the gradient clip with 0.1 as the maxi-
DiffusionITM using our training data and adopt the DDIM
mum norm. We select the model checkpoint by using a val-
idation set for alignment-oriented generation. Specifically, 5 https : / / huggingface . co / CompVis / stable -

we collect this validation set from MS-COCO, following diffusion-v1-4.

6 https : / / huggingface . co / stabilityai / stable -
COCO-NSS1K [44]. Based on this set, we generate images
diffusion-2-1.
conditioned on prompts and then compute the CLIP score 7 https : / / github . com / LayoutLLM - T2I / LayoutLLM -
between them. The model with the highest CLIP score in T2I.
this validation set is chosen for testing. Besides, we accu- 8 The code of StructureDiffusion can be found at https://github.

mulate gradients in each 8 steps in this stage. Using a sin- com/weixi-feng/Structured-Diffusion-Guidance, but it
is only implemented based on SD-v1.4.
gle A100 (40G), the discriminative probing requires about 9 https : / / github . com / yuval - alaluf / Attend - and -
3 days for the first stage, and the discriminative tuning re- Excite.
quires about 1 day for the second stage. 10 https://github.com/McGill-NLP/diffusion-itm.
sampler with 50 timesteps to synthesize images.
To compare the global matching abilities of different
generative models, we implement Diffusion Classifier [31]
and DiffusionITM [28]. They both depend on ELBO to
compute text-image similarities. Compared with Diffusion
Classifier, DiffusionITM uses a normalization method to
rectify the text-to-image directional matching, dealing with
the modality asymmetry issue to some extent. We call this
version as HN-DiffusionITM.
Regarding the local grounding ability, we first evalu-
ate CLIP and OpenCLIP by means of the cropping strat-
egy [56]. In detail, we first crop the raw images into multi-
ple blocks and resize them to the same size according to the
proposals, and then perform expression-to-image matching.
In addition, we also adopt the cropping and expression-
to-image matching strategy to evaluate the grounding per-
formance of Diffusion Classifier, DiffusionITM, and HN-
DiffusionITM, since these models can not be directly re-
purposed to do the REC task. We also propose a local de-
noising method based on Diffusion Classifier. Specifically,
we calculate the ELBO for each local proposal region in-
stead of the whole image, and then take the proposal with
the maximum ELBO values as the prediction.

C. More Examples
To intuitively compare the proposed method with recent
baselines, we list some generated images on the CC-500
and ABC-6K datasets, as shown in Fig. 10 and Fig. 11, re-
spectively. To make a fair comparison, we implement all the
methods based on SD-v1.4, considering that StructureDif-
fusion only supports SD-v1.4. Besides, we also show more
diverse examples including SD-v1.4, SD-v2.1, and ours,
on the COCO-NSS1K, CC-500, and ABC-6K datasets, in
Fig. 12 and Fig. 13, Fig. 14, and Fig. 15, respectively.
SD-v1.4 StructureDiff AaE DPT + SC (Ours) SD-v1.4 StructureDiff AaE DPT + SC (Ours)

“a blue backpack and a brown cow” “A man in a red shirt and black pants resting on a bench. ”

“a blue bowl and a red car” “A man in a black shirt and red pants resting on a bench. ”

“a blue backpack and a brown bear” “A kitchen with white tile floor and blue sloped ceiling. ”
Object Appearance

“a red giraffe and a brown train” “A kitchen with blue tile floor and white sloped ceiling. ”

“a blue chair and a red vase” “A small black dog sitting in the basket of a white bike. ”

“a blue apple and a green boat” “A small white dog sitting in the basket of a black bike. ”
Attribute Characterizing

“a blue dog and a brown suitcase” “A yellow toy car with lights on sitting on a black mat. ”

“a blue horse and a brown vase” “A black toy car with lights on sitting on a yellow mat. ”

Figure 10. Qualitative results on CC-500 We compare the pro- Figure 11. Qualitative results on ABC-6K. We compare the pro-
posed method with SD-v1.4 and two baselines including Struc- posed method with SD-v1.4 and two baselines including Struc-
tureDiff [16] and Attend-and-Excite (AaE) [4] regarding object tureDiff [16] and Attend-and-Excite (AaE) [4] regarding color at-
appearance and attribute characterizing. tribute characterizing.
Ground Truth Stable Diffusion-v1.4 Stable Diffusion-v2.1 DPT + SC (Ours)

“The plate of food is next to a magazine and a mug of coffee. “

“A computers keyboard and the bottom of an apple mouse. “

“A traffic sign next to a road near a river. “

Object Appearance

“A funky looking clock with a bottle next to it. “

“Two green duffel bags of two different designs. “

“Two parking meters connectec onto one pole. “

“There are several stop signs along side the grass. “

“A couple of flowers inside a clear vase. “

Counting

“Three adult brown bears standing around staring at something. “

Figure 12. More generated examples on COCO-NSS1K regarding object appearance and counting.
Ground Truth Stable Diffusion-v1.4 Stable Diffusion-v2.1 DPT + SC (Ours)

“a close up of dog with its head between a fence“

“A loft bed with a dresser underneath it. “

Spatial Relation

“A person riding on the back of a horse drawn carriage on a beach. “

“Helmeted person taking a tight corner on a motorcycle. “

“A man standing on a beach holding a white frisbee. “

“A cat standing on two legs near a tv. “

Semantic Relation

“A woman looking at a polar bear through glass. “

“A white stove top oven sitting between two counters. “

Compositional Reasoning

“A pair of girls are cooking in one pan together. “

Figure 13. More generated examples on COCO-NSS1K regarding spatial, semantic, and compositional reasoning.
Stable Diffusion-v1.4 Stable Diffusion-v2.1 DPT + SC (Ours)