0% found this document useful (0 votes)
21 views10 pages

Mo Dynamic Prompt Optimizing For Text-to-Image Generation CVPR 2024 Paper

Uploaded by

parth.jha.soft
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views10 pages

Mo Dynamic Prompt Optimizing For Text-to-Image Generation CVPR 2024 Paper

Uploaded by

parth.jha.soft
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

This CVPR paper is the Open Access version, provided by the Computer Vision Foundation.

Except for this watermark, it is identical to the accepted version;


the final published version of the proceedings is available on IEEE Xplore.

Dynamic Prompt Optimizing for Text-to-Image Generation


Wenyi Mo1,2 , Tianyu Zhang3 , Yalong Bai3 , Bing Su1,2 , Ji-Rong Wen1,2 and Qing Yang3
1
Gaoling School of Artificial Intelligence, Renmin University of China
2
Beijing Key Laboratory of Big Data Management and Analysis Methods
3
Du Xiaoman Technology
{mowenyi, jrwen}@ruc.edu.cn, [email protected]
{zhangtianyu, libai, yangqing}@duxiaoman.com

Abstract pressions of these prompts may yield vastly different image


interpretations. Therefore, it is crucial to craft appropriate
Text-to-image generative models, specifically those prompts that convey the user’s intended ideas and establish
based on diffusion models like Imagen and Stable Diffu- clear communication with the generative model.
sion, have made substantial advancements. Recently, there For a given pre-trained text-to-image generative model,
has been a surge of interest in the delicate refinement of it is unclear which type of prompt is the most suitable. Con-
text prompts. Users assign weights or alter the injection sequently, users heavily rely on heuristic engineering meth-
time steps of certain words in the text prompts to improve ods [21] by repeatedly running the generative model with
the quality of generated images. However, the success of modified prompt candidates in search of an optimal one.
fine-control prompts depends on the accuracy of the text They append modifier words to enhance the art style or
prompts and the careful selection of weights and time steps, emphasize the image quality. These hand-crafted heuris-
which requires significant manual intervention. To address tics need to be implemented separately for each design in-
this, we introduce the Prompt Auto-Editing (PAE) method. tention and generative model, resulting in a costly, time-
Besides refining the original prompts for image generation, consuming, and labor-intensive trial-and-error process. Al-
we further employ an online reinforcement learning strat- though there are learning-based methods [9, 44] that aim to
egy to explore the weights and injection time steps of each enhance the quality of image generation results by rephas-
word, leading to the dynamic fine-control prompts. The re- ing or appending modifiers to user-input prompts, these
ward function during training encourages the model to con- methods lack control over the extent to which the added
sider aesthetic score, semantic consistency, and user prefer- modifier words influence the image generation process.
ences. Experimental results demonstrate that our proposed It is a common practice to assign varying levels of impor-
method effectively improves the original prompts, generat- tance to specific words in the design of text prompt* . This
ing visually more appealing images while maintaining se- technique allows for more precise control over the gener-
mantic alignment. Code is available at this https URL. ation process, as illustrated in Fig. 1 (a). Another notable
characteristic of the diffusion model is the multi-step de-
noising process. This multi-step design allows us to use
1. Introduction different prompts at different time steps, thus achieving bet-
ter results. By precisely adjusting the effect time range of
Text-to-image generative models take a user-provided text modifier words during this process, a significant enhance-
to generate images matching the description [1, 16, 29, 30]. ment of the visual aesthetics of the generated image can be
The input text is called a prompt since it prompts the gen- achieved, as shown in Fig. 1 (b). Therefore, to achieve more
erative models to follow the user’s instructions. However, precise and detailed control over various aspects of the gen-
it has been reported that recent text-to-image models are erated image, we propose a novel prompt format called the
sensitive to prompts [5, 15, 19]. The organization of the in- Dynamic Fine-control Prompt (DF-Prompt). It consists of
put prompts plays a crucial role in determining the quality several triples of tokens, effect ranges, and importance lev-
and relevance of the generated images. Interestingly, even els. Traditional hand-crafted heuristic prompt engineering
when two prompts convey identical meanings, different ex-
* https://github.com/AUTOMATIC1111/stable-diffus
† Corresponding authors. ion-webui/wiki/Features#attentionemphasis

26627
a red horse on the yellow a red horse on the yellow a red horse on the yellow a red horse on the yellow
grass, anime, 1 ↦ 0, 1 style grass, anime, 1 ↦ 0, 1. 𝟓 style grass, detailed, 1 ↦ 0, 1 grass, detailed, 1 ↦ 0.85, 1
(a) (b)
Figure 1. Generation results with the same seed using dynamic fine-control prompt (one plain token is extended into a triple of
⟨token, effect range, weight⟩). It can be seen that (a) increasing the weight of anime to 1.5 can amplify the sense of anime; (b) apply-
ing the word detailed in the first 15% denoising timesteps can generate more natural texture details than applying it in all timesteps.

approaches struggle to handle such intricate and granular of modifier tokens into a reinforcement learning frame-
adjustments. Hence, it is necessary to develop an automated work, we enable fine-grained control and precise ad-
method for providing fine-grained optimization of prompts. justments in image generation.
In this study, we propose a method called Prompt Auto- • Effective results: Our method’s effectiveness is thor-
Editing (PAE). The primary aim of PAE is to optimize oughly validated through experiments on several
user-provided plain prompts to DF-Prompts for generating datasets. The results show that our approach improves
high-quality images. This optimization process is achieved image aesthetics, ensures semantic consistency be-
through reinforcement learning. PAE involves a two-stage tween prompts and generated images, and aligns more
training process. In the first training stage of PAE, we closely with human preferences.
introduce an automated method to overcome the depen- • Insightful findings: Our research reveals that artist
dency on manually constructed training samples. We de- names and texture-related modifiers enhance the artis-
fine a confidence score to automatically filter publicly avail- tic quality of generated images, while preserving the
able prompt-image data. It ensures that the selected im- original semantics. It is more effective to introduce
ages are both visually pleasing and semantically consis- these terms in the latter half, rather than the initial half
tent with the corresponding text. We then use this filtered of the diffusion process. Assigning a lower weight to
dataset to fine-tune a pre-trained language model. The re- complex terms promotes a more balanced image gen-
sult is a tailored model that can enhance a given prompt eration. These findings hold significant implications
with suitable modifiers. The second stage of PAE is based for creative work and future research.
on the tailored model. We use online reinforcement learn-
ing tasks to encourage the model to explore better com- 2. Related work
binations of prompts and extra parameters, i.e., the effect
range and weight of each modifier. To support this, we Content generation AI-generated content (AIGC) [3, 22,
build a multidimensional reward system that takes into ac- 26, 28–30, 36, 42] has made revolutionary progress in
count factors such as aesthetic ratings, consistency between recent years, particularly in natural language processing.
image and text semantics, and user preferences. Through Large language models such as BERT [6], GPT-1 to GPT-
the above process, PAE can automatically find the appro- 4 [2, 17, 23, 24], and ChatGPT‡ have demonstrated excep-
priate dynamic fine-grained prompt tokens. To demonstrate tional text understanding and generation ability. Their ad-
the effectiveness of our approach, we apply PAE to opti- vancements have greatly influenced the generation of text-
mize text prompts from several public datasets, including to-image content. With the development of generative mod-
Lexica.art† , DiffusionDB [38], and COCO [14]. The exper- els [7, 32–34] and multi-modal pre-training techniques [25],
imental results show that our method can greatly improve text-to-image generative models such as DALL·E 2 [28],
human preference and aesthetic score while maintaining se- Imagen [30], Stable Diffusion [29] and Versatile Diffu-
mantic consistency between the generated images and the sion [42] have showcased impressive performance in gener-
original prompts. The contributions are as follows. ating high-quality images. These breakthroughs have cap-
• Dynamic fine-control prompt editing framework: We tured the attention of both academia and industry due to
introduce a framework that enhances prompt editing their potential impact on content production and applica-
flexibility. By integrating the effect range and weight tions in the open creative scene, etc. In this paper, the
† https://lexica.art/ ‡ https://chat.openai.com/

26628
Stage 1: Supervised Fine-tuning Stage Stage 2: Online Reinforcement Learning Stage
Refined prompt
DF-prompt
𝐬-./ = 𝐬⨁{𝑥!, 𝑥", … , 𝑥#} 𝜏! 𝑤! 𝜏" 𝑤"
𝒙! 𝒙" 𝐬01/ = 𝐬⨁{ 𝑥!, 𝜏!, 𝑤! , 𝑥", 𝜏", 𝑤" , … , 𝑥2, 𝜏2, 𝑤2 }

prompt logs 𝒙! 𝒙" ... 𝒙𝒏
Confidence plain prompt
Score 𝐈!"#
𝐬’ 𝜀 34, init 𝜀 *+, … 𝑅
𝑝 𝑁 𝐈

𝐬 𝑥! 𝑥" ... 𝑥# 𝐬 𝒙! ... 𝒙𝒏(𝟏 𝐬 𝑥$ …


𝐬 ‘portrait of a beautiful forest goddess‘ 𝒙$ ‘beauty’ 𝜏! [0.5↦0] 𝑤! 0.75 𝒙% ‘esoteric’ 𝜏" [1↦0.5] 𝑤" 0.5
Text
Injection
𝐬 )*# ‘portrait of a beautiful forest goddess, beauty, esoteric‘ 𝐬 !"# ‘portrait of a beautiful forest goddess, [beauty:0.5↦0:0.75], [esoteric:1↦0.5:0.5]'

Figure 2. The training process of PAE. (Stage 1) We select the training prompts based on a confidence score S as shown in Eq. (1), then
fine-tune a pre-trained language model. The result is EReP , a model that produces refined prompts. (Stage 2) We initialize the policy
model EDFP using EReP . We add two linear headers to this model. These headers, along with the one predicting word tokens, use the
same model’s intermediate representation for their predictions. We then transform these predictions into DF-prompts. These DF-prompts
modify the text injection mode of the diffusion model M, which in turn affects the output images. During the online exploration, we use
the original plain prompt s, the optimized DF-prompt sDFP , and their respective images I and IDFP to compute the reward R. Finally, we
update the policy model by minimizing a loss function as defined in Eq. (3).

proposed dynamic prompt editing framework utilizes a lan- ison experiments and proposes design guidelines for text-
guage generation model to assist text-to-image generation. to-image prompt engineering. Recently, Hao et al. [9] pro-
Text-to-image prompt collection and analysis In recent pose a learning-based prompt optimization method using
years, several studies have been conducted to explore the reinforcement learning. These approaches primarily focus
generative ability of text-to-image generative models. Some on modifications to plain prompts and fail to achieve fine-
researchers collect prompt-image pairs from online commu- control information injection. In this paper, we introduce
nities or expert users [37, 38, 40, 41]. DiffusionDB [38], a novel prompt editing framework to achieve fine-control
containing 2 million images, is collected from online public prompt optimization. A reinforcement learning strategy is
Stable Diffusion servers. It provides a valuable resource for used to develop the capability of extending modifiers, ad-
researchers to study and improve the performance of text- justing weights, and adaptively fitting effect step ranges of
to-image models. More recently, Xu et al. [41] build an the modifier tokens, with aesthetics, text-image semantic
expert comparison dataset, including 137K prompt-image consistency, and human preferences serving as the reward.
pairs from text-to-image models. These pairs are evaluated
in terms of aesthetics, text-image alignment, toxicity, and 3. Method
biases. With the wealth of data, we aim to develop an au- In this section, we introduce the novel prompt format
tomatic prompt editing method that can improve the perfor- for diffusion-based text-to-image generative models. To
mance of text-to-image models and generate high-quality achieve automated prompt editing, we design a two-stage
images that satisfy users’ demands. training process, called Prompt Auto-Editing (PAE). PAE
Prompt design Text-to-image generative models [4, 10, includes a supervised fine-tuning stage for refined prompt
16, 27–30] are currently experiencing significant advance- generation and an online reinforcement learning stage for
ments, resulting in impressive visual effects from the gener- dynamic fine-control prompt generation.
ated images. However, these models only yield satisfactory
3.1. Definitions of Dynamic Fine-control Prompt
images given appropriate input prompts, leading users to
invest considerable time in modifying the prompts to en- Given a pre-trained text-to-image generative model M and
sure the generated images are aesthetically pleasing. In user input text s, our goal is to produce a modified prompt
pursuit of higher-quality images, both researchers and on- sm with fine-grained control so that the generated image,
line communities contribute creatively to prompt engineer- Im ∼ M(sm ), exhibits enhanced visual effects while re-
ing for text-to-image generation [18, 21, 38]. For instance, maining faithful to the semantics of the initial prompt s.
Pavlichenko et al. [21] employ the genetic algorithm [8] to The modified prompt sm contains the initial prompt s and a
select a range of prompt keywords that enhance the qual- set of predicted modifiers A = {x1 , · · · , xi , · · · , xn }, i.e.,
ity of the images. Concurrently, Oppenlaender [19] utilizes sm = s ⊕ A. The ⊕ symbol indicates the append operation.
auto-ethnographic research to understand the prompt de- We hereby define a new prompt format that enriches the
sign of online communities and categorizes existing prompt information of the initial prompt, named Dynamic Fine-
modifiers into six categories. Additionally, Liu et al. [15] Control Prompt (DF-Prompt). Within this paradigm, each
collect over a thousand prompts for multiple group compar- token xi of the modifier set A is coupled with an ef-

26629
fect range τi and a specific weight wi , resulting in a suitable for model training. Therefore, we devise an auto-
triple ai = ⟨xi , τi , wi ⟩, where wi is a float number that mated process for data filtration and training sample con-
weights the token embeddings for controlling the over- struction. The rule for data filtration stipulates that only
all influences of token xi during image generating. The instances that demonstrate an improvement in aesthetics
range τi = [bi 7→ ei ] (1 ≥ bi ≥ ei ≥ 0) is the nor- and maintain semantic relevance after the addition of mod-
malized range that delineates the start and end steps dur- ifiers are retained. As depicted on the left of Fig. 2, we
ing the iterative denoising process of the text-to-image start with a given prompt s′ from publicly available prompt
model. We define the DF-Prompt token set is ADFP = logs. The original prompt s′ is split at a division point
{⟨x1 , τ1 , w1 ⟩, · · · , ⟨xn , τn , wn ⟩}, and the DF-Prompt is p ∈ {1, · · · , N }. Here, N represents the number of to-
sDFP = s ⊕ ADFP . The essence of DF-Prompt lies kens in s′ . The text preceding the division point is consid-
in facilitating a more precise and controlled generation, ered to contain primary information, describing the main
ensuring the refined prompts are optimally structured for theme of the image; the text following the division point
M to process. In order to facilitate demonstration and is regarded as secondary, providing supplementary suffixes
code implementation, we also define a plain-text format, as modifier words. According to [38], we select the first
where the triples are written within square brackets, [to- comma in s′ as the division point. Following this, we obtain
ken:range:weight]. For instance, as shown in Fig. 2, a DF- the short prompt s = {s1 , ..., sp }, which is the first p tokens
Prompt is written as “portrait of a beautiful forest goddess, joined together. The remaining tokens form the modifier set
[beauty : 0.5 7→ 0 : 0.75], [esoteric : 1 7→ 0.5 : 0.5]”. A = {x1 , · · · , xn |x1 = sp+1 , · · · , xn = sN }. Lastly, we
define a confidence score, S(s, s′ ). Using this, we construct
3.2. Overview of PAE the training samples as follows:
We formulate the prompt editing problem as a reinforce-
\label {eq:confidence_score} \begin {aligned} \mathbb {D} &= \left \{ \left \langle \mathbf {s}, \mathbf {A} \right \rangle \mid \mathcal {S}(\mathbf {s}', \mathbf {s})> 0 \right \}, \\ \mathcal {S}(\mathbf {s}', \mathbf {s}) &= \mathbb {E}_{\mathbf {I}' \sim \mathcal {M}(\mathbf {s}'),\mathbf {I} \sim \mathcal {M}(\mathbf {s})} \big [ u\left (g_\mathrm {aes}(\mathbf {I}')-g_\mathrm {aes}(\mathbf {I})\right )\\ &\quad \times u\left ( g_\mathrm {CLIP}(\mathbf {s}, \mathbf {I}')-g_\mathrm {CLIP}(\mathbf {s}, \mathbf {I}) + \gamma \right ) \big ], \end {aligned}
ment learning task and propose a Prompt Auto-Editing
method named PAE. PAE enhances the user-provided (1)
prompt by adding modifiers in an auto-regressive manner
while assigning corresponding effect ranges and weights.
As illustrated in Fig. 2, PAE operates in two distinct train- where gCLIP measures the image-text relevance by using
ing stages. Stage 1: To enrich simple prompts, we fine-tune pre-trained CLIP model [25] and gaes returns the aesthetic
a pre-trained language model on a curated prompt-image score§ . The parameter \gamma acts as a tolerance constant. Ad-
dataset. The dataset is specifically selected based on a con- ditionally, u(z) represents a characteristic function that re-
fidence score S. The result of this stage is a refined prompt turns 1 if z > 0 and 0 otherwise.
model EReP . Stage 2: This stage involves an online rein- We train the language model based on the training
forcement learning process. We implement a policy model datasets D using teacher forcing methods [39], and perform
EDFP initialized from EReP . The policy model interacts a direct auto-regressive style negative log-likelihood loss on
with the environment (the text-to-image model M) through the next token:
the current policy (the model-derived mapping from the in-
put prompt to the dynamic fine-control prompt). A reward \mathcal {L}_\mathrm {ReP}= -\mathbb {E}_{\left \langle \mathbf {s}, \mathbf {A} \right \rangle \sim \mathbb {D}} \left [\log {P(\mathbf {A}|\mathbf {s},\mathcal {E}}_\mathrm {ReP})\right ]. (2)
function is defined to evaluate the aesthetic appeal of the
generated image, its semantic similarity to the input text, In this way, the trained model EReP is proficient in handling
and its alignment with human preference. The policy model brief prompt inputs, i.e., simple text describing the image
EDFP is then optimized based on a defined loss function. theme, and predicting appropriate modifiers to formulate re-
fined prompts sReP , thereby elevating the aesthetic quality
3.3. Finetuning for Plain Prompt Refinement of the generated image.
In the first stage, we utilize selected data to fine-tune the 3.4. RL for DF-Prompt Generation
GPT-2 [24] model to get a plain prompt refining model
EReP . The model EReP predicts suffix modifiers one by one, In the second training stage, we aim to explore bet-
and this process repeats until the model outputs the stop ter prompt configurations by specifying effect ranges and
sign, i.e., <|endoftext|>. Given a prompt s, we construct weights for additional modifier suffixes.
the refined prompt as sReP = s ⊕ A, where A ∼ EReP (s). Online reinforcement learning. We utilize PPO algo-
Data Selection. Different from previous methods that de- rithm [31], a popular reinforcement learning method known
pend on human-in-the-loop annotation datasets [21], we for its effectiveness and stability. The aim is to maximize
collect training data from public text-image datasets and on- the expected cumulative reward over the training set D. We
line communities. Given the inconsistent quality of images § https://github.com/christophschuhmann/improv

in publicly available text-image pairs, not all prompts are ed-aesthetic-predictor

26630
add two head layers on EReP to predict the effect range and where gPKS denotes the learned human preference evalua-
weight corresponding to each token, and initialize the pa- tion metric of PickScore. The symbols \zeta and \kappa set minimum
rameters of additional layers to output \tau _i = [1\mapsto 0] and thresholds for CLIP score and PickScore contributions to
w_i = 1 for every token xi . After that, EReP is used to ini- the reward, while \alpha and \beta scale the Aes score’s impact.
tialize a policy model EDFP . During an episode of prompt
optimization, we set the initial state as the initial text s = 4. Experiments
{s1 , ..., sp }. The action space is tripartite: word space V,
discrete time range space T = {0.5 7→ 0, 1 7→ 0, 1 7→ 0.5}, 4.1. Experimental Setup
and discrete weight space W = {0.5, 0.75, 1, 1.25, 1.5}. Data Collection. The public text-image pair sources in-
At each step t of online exploration, the model selects an clude Lexica.art and DiffusionDB [38]. The NSFW images
action at = ⟨xt , τt , wt |xt ∈ V, τt ∈ T , wt ∈ W⟩, in accor- are recognized with an image classification model and re-
dance with the policy model at ∼ EDFP (s<t ). To be con- moved from the training data. After that, we conduct data
sistent with the input format of the language model, we selection as described in Sec. 3.3. Finally, we get about
define the state at t-th step with tokens only, i.e., s<t = 450, 000 prompts. We randomly select 500 ⟨s, s′ ⟩ pairs
s ⊕ {x1 , x2 , · · · , xt−1 }. from DiffusionDB for validation and extract 1, 000 prompts
During training, the policy model EDFP interacts with from Lexica.art and DiffusionDB respectively for evalua-
the text-to-image model M. We make adjustments to tion. In particular, we also use 1, 000 prompts randomly
the text encoder module of the model, with the specific selected from COCO [14] dataset for out-of-domain evalu-
implementation details outlined in supplementary materi- ation. The training set, validation set, and test set are inde-
als. These modifications allow for weighting individual pendent of each other.
tokens and customizing the effective time range during Comparison to other methods. We compare the prompts
the denoising process. The predicted action set ADFP = edited with our method to four types of prompts: the short
{⟨x1 , τ1 , w1 ⟩, · · · , ⟨xT , τT , wT ⟩} are used to generate im- primary prompts s, the original human-written prompts s′ ,
ages. Using the generated images, we compute the reward and the prompts generated from the same short prompt s
R(s, ADFP ). We define a loss function LDFP , which is by the pre-trained GPT-2 [24] and Promptist [9]. Human-
used to optimize the policy model: written prompts are randomly chosen from user-provided
prompt datasets like Lexica.art and DiffusionDB, while
\label {eq:online_object} \vspace {-2.5pt} \mathcal {L}_\mathrm {DFP} = -\mathbb {E}_{\mathbf {s} \sim \mathbb {D}, \mathbf {A}^\mathrm {DFP} \sim \mathcal {E}_\mathrm {DFP}}\left [ R(\mathbf {s}, \mathbf {A}^\mathrm {DFP}) -\eta D_\mathrm {KL}\right ], short prompts are the texts before the first commas.
(3)
where DKL computes the Kullback-Leibler diver- Metrics. We utilize four metrics to evaluate the re-
gence [13]. It serves as a regulation constraint to minimize sults of edited prompts: Aesthetic score, CLIP score [25],
differences between the output modifiers of the policy PickScore [12], CMMD score [11]. The Aesthetic score
model EDFP and those of the initial model EReP [20]. We reflects the visual attractiveness of an image. Higher val-
also use Gaussian distributions to supervise the effect range ues indicate better visual quality. The CLIP score evaluates
probability distribution and weight distribution predicted the alignment between the generated image and the prompt.
by EDFP . More implementation details are in Sec. 4.2. PickScore is an automatic measurement standard used to
Another component in PPO is the value model. Its role comprehensively assess the visual quality and text align-
is to estimate the expected cumulative reward from the cur- ment of images. Larger values indicate a greater consis-
rent state, directed by the policy model’s actions. Its opti- tency between the generated image and human preferences.
mization objective is to minimize the difference between the CMMD offers a more accurate and consistent measure of
predicted and actual rewards. In the optimization process, image quality by not assuming a normal distribution of data
the policy model and the value model are optimized alter- and being efficient with sample sizes. Lower CMMD val-
nately, so that they can promote each other to maximize the ues indicate more realistic images. In our evaluation, we re-
expected cumulative reward. We initialize the value model port the Aesthetic scores of the corresponding images, CLIP
with EReP and replace the initial linear layer with a regres- scores between the short prompt and the images generated
sion head for better performance. by the edited prompt. For PickScore, We report the rel-
Reward definition. We construct the reward R(s, ADFP ) ative pairwise comparisons E[gPKS (s, Im ) ≥ gPKS (s, I)]
using CLIP Score, Aesthetic Score, and PickScore [12]: between the edited prompt sm and the short prompt s. We
report CMMD between the generated images and the real
\label {eq:reward} \begin {split} R(\mathbf {s}, \mathbf {A}^\mathrm {DFP})=&\mathbb {E}_{\mathbf {I} \sim \mathcal {M}(\mathbf {s}),\mathbf {I}^\mathrm {DFP} \sim \mathcal {M}(\mathbf {s}\oplus \mathbf {A}^\mathrm {DFP})}[ \\& \min \left ( g_\mathrm {CLIP}\left (\mathbf {s}, \mathbf {I}^\mathrm {DFP}\right )-\zeta ,0\right )\\& +\min \left ( g_{\mathrm {PKS}}\left (\mathbf {s}, \mathbf {I}^\mathrm {DFP} \right )-\kappa ,0\right ) \\& + \alpha \cdot \left (g_{\mathrm {aes}}(\mathbf {I}^\mathrm {DFP}) - \beta \cdot g_{\mathrm {aes}}({\mathbf {I}})\right )]. \end {split} images corresponding to the prompts in the COCO dataset.

4.2. Implementation Details


(4)
For the processes of data collection, model training, and
evaluation, we use Stable Diffusion v1.4 [29] with the

26631
Short Prompt

illustration studio portrait of three friends a beautiful portrait of a symmetric woman face cats in suits smoking cigars together a black broken tv sitting in the desert
dancing at a concert
Promptist

illustration studio portrait of three friends a beautiful portrait of a symmetric woman face cats in suits smoking cigars together a black broken tv sitting in the desert
dancing at a concert , fantasy, intricate, elegant, highly detailed, , intricate, elegant, highly detailed, digital , ultra realistic, concept art, intricate details,
, by wlop, digital painting, artstation, concept art, smooth, painting, artstation, concept art, smooth, sharp eerie,
sharp focus, illustration, focus, illustration,
Ours

illustration studio portrait of three friends a beautiful portrait of a symmetric woman face cats in suits smoking cigars together a black broken tv sitting in the desert
dancing at a concert [ with gorgeous hair face illustration :1↦0:1.0 ], [ on a ship deck:1↦0:1.0 ], [ intricate:1↦0:1.25 ], [ for 50 years:1↦0:1.0 ], [ intricate:1↦0.5:1.0 ],
[ in a scenic environment by beeple:1↦0:1.0 ], [ by wlop:1↦0.5:0.75 ], [ artgerm:1↦0:1.0 ], [ elegant:1↦0:1.25 ], [ highly detailed:0.5↦0:0.75 ], [ elegant:1↦0.5:1.0 ], [ highly detailed:0.5↦0:0.75 ],
[ trending on artstation:0.5↦0:0.75 ] [ greg rutkowski:1↦0:1.0 ], [ digital painting:0.5↦0:1.0 ], [ artstation:0.5↦0:0.5 ], [ digital painting:0.5↦0:1.0 ], [ artstation:0.5↦0:0.75 ],
[ alphonse mucha:1↦0:1.0 ] [ concept art:0.5↦0:1.0 ], [ sharp focus:0.5↦0:0.75 ], [ concept art:0.5↦0:1.0 ], [ sharp focus:1↦0:1.0 ],
[ illustration:0.5↦0:1.25 ], [ by justin gerard and [ illustration:1↦0.5:1.0 ], [ by justin gerard and
artgerm:1↦0.5:1.0 ], [ 8 k:1↦0:0.5 ] artgerm:1↦0:1.0 ], [ 8 k:1↦0:0.75 ]

Figure 3. Generated images using Stable Diffusion v1.4 with short prompts, Promptist [9], and our method. In each column, the images
are generated using the same random seed. Our method shows the ability to moderately expand the semantic content, such as “in a scenic
environment”, “with gorgeous hair face illustration”, “on a ship deck” and “for 50 years.” These expansions stimulate users’ imagination
while enhancing the comprehensiveness and aesthetic quality of the image.

variance of σ. The frequency of different joint settings is


shown in the dotted line marked by “label” in Fig. 5 (b∼d).
This introduction of random sampling from Gaussian distri-
Short Prompt: biblic mecha cyberpunk soldier, butions aims to diversify the training signal in the first stage,
thereby enabling better generalization in second stages. For
the model structure of EReP , we load the pre-trained GPT-
2 Medium [24] weights and add two linear heads directly
Refined Prompt: biblic mecha cyberpunk soldier, sci - fi, fantasy, intricate, elegant, highly detailed, digital painting, artstation,
to approximate the distributions. We can use the distribu-
tions predicted by these heads to supervise the effect range
probability distribution and weight distribution predicted by
EDFP . We train the model for 50k steps, using a batch size
of 64 and a learning rate of 5 × 10−5 , with the Adam opti-
DF-Prompt: biblic mecha cyberpunk soldier, [ sci - fi:0.5↦0:0.5 ], [ fantasy:1↦0.5:1.0 ], [ intricate:1↦0:1.0 ], [ elegant:1↦0.5:1.0 ],
[ highly detailed:1↦0.5:1.0 ], [ digital painting:1↦0:0.75 ],[ artstation:0.5↦0:0.75 ],

Figure 4. Our method generate the DF-Prompt, which corresponds mizer. The block size is 256. To avoid the model learning
to the generated images with more detailed textures and a richer fixed patterns, we introduce variability by randomly alter-
background for a better visual effect than the refined prompt. The ing the case of the prompt’s first letter and replacing com-
images are generated using the same random seed in each column. mas with periods at a 50% probability. In our implemen-
tation, phrases separated by commas share the same effect
UniPC solver [43], and set the inference time steps to 10. range and weight, calculated using the mode of the range
Supervised fine-tuning. Empirically, we find that when and weight among these phrases.
training with the default settings for both effect range and Online reinforcement learning. In our experiment, we fol-
weight (τi = [1 7→ 0] and wi = 1) as a one-point distri- low the approach by Hao et al. [9] to set ζ = 0.28 in the re-
bution, the policy model is prone to overfitting to this set- ward function. The stability of the rewards is crucial in our
tings. To address this, we apply a strategy similar to La- process. To ensure this, we calculate the reward by gener-
bel Smoothing [35] in the first stage to enhance the model’s ating two images per prompt. We train both the policy and
learning process. This strategy involves sampling discrete the value models for 3,000 episodes, each with a batch size
values from Gaussian distributions. The means of these dis- of 32. For optimization, we set the learning rate to 5 × 10−5
tributions are consistent with the values of the default set- and employ the Adam optimizer for both models. We ad-
tings for effect range and weight, and they share a uniform just the Adam optimizer’s hyper-parameters, setting β1 to

26632
Figure 5. (a) The 15 most frequently generated modifiers. (b∼d) The frequency of different combinations of settings.

Method PickScore (↑) CLIP Score (↑) Aes Score (↑) Method PickScore (↑) CLIP Score (↑) Aes Score (↑)
Short Prompt - 0.28 5.58 Short Prompt - 0.28 5.58
GPT-2 47.9% 0.25 5.38 GPT-2 48.1% 0.25 5.40
Human 72.5% 0.26 6.07 Human 70.5% 0.26 5.84
Promptist 68.4% 0.27 6.11 Promptist 62.3% 0.27 6.06
PAE (Ours) 73.9% 0.26 6.12 PAE (Ours) 64.4% 0.26 6.07
Table 1. Quantitative comparison on Lexica.art. Table 3. Quantitative comparison on DiffusionDB.
Method PickScore (↑) CLIP Score (↑) Aes Score (↑)
0.9 and β2 to 0.95. The KL coefficient η is 0.02. To save
memory, we use a simplified version of the PPO algorithm Short Prompt - 0.27 5.37
that processes one PPO epoch per batch. GPT-2 51.2% 0.25 5.24
Promptist 53.4% 0.25 6.15
4.3. Evaluation and Analysis PAE (Ours) 53.8% 0.25 6.09

Qualitative analysis. As shown in Fig. 3, based on the Table 4. Quantitative comparison on COCO.
short prompts, PAE adds texture-related terms like “highly displayed in Fig. 5 (a). They mainly pertain to art trends
detailed”, the artist’s name “justin gerard and artgerm”, and such as “artstation”, artist names like “WLOP”, art styles
some highly aesthetically related words “elegant”, “artsta- and types such as “digital painting” and “illustration”, and
tion” to enhance the aesthetic quality of the generated im- texture-related terms like “highly detailed” and “smooth”.
ages. As shown in Fig. 4, the DF-prompt generated by our These modifiers subtly boost the artistic vibe without sig-
method can provide finer control than the refined prompt. nificantly altering the prompt’s semantics. In Fig. 5 (b∼d),
the red dotted lines indicate the frequency of the label case
Method CMMD (↓)
as detailed in Sec. 4.2. We observe several phenomena and
Promptist 1.147
PAE (Ours) 1.125 attempt to interpret them: 1) In (c), most terms mentioned
above appear more frequently than the label case under the
Table 2. Quantitative comparison using the CMMD metric. 1 7→ 0 and 0.5 7→ 0 settings. This suggests that these ef-
Quantitative comparison. We evaluate PAE on two in- fect ranges yield higher rewards during training when the
domain datasets: Lexica.art and DiffusionDB. As shown weight is 1.0, hence the policy model leans towards select-
in Tab. 1 and Tab. 3, the results show that PAE surpasses ing them. 2) Also in (c), the 0.5 7→ 0 setting outperforms
other methods in terms of Aesthetic Score, and it achieves the 1 7→ 0.5 setting. This suggests that injecting texture-
a human preference Pick Score that closely mirrors the related terms and art styles (except “smooth” and “illustra-
human-written prompt. This suggests that PAE aligns well tion”) into the final 50% of diffusion time steps is more ef-
with human aesthetic preferences. Additionally, we eval- fective than in the first 50%. This latter half of the diffusion
uate PAE on the out-of-domain dataset COCO. As shown time steps is typically when image details and structure start
in Tab. 4, PAE outperforms other methods in terms of to form. Hence, it’s optimal to introduce texture-related
Pick Score. This consistent performance across various terms and art styles at this stage, as they can directly impact
datasets demonstrates the robustness and versatility of the the image’s details and structure. Conversely, introducing
PAE method. Furthermore, as shown in Tab. 2, PAE outper- these elements in the initial 50% of the diffusion time steps
forms Promptist, as it indicates lower CMMD scores [11]. may not significantly influence the final image, as these ele-
This shows that the prompts edited by our method generate ments could be overwhelmed by subsequent diffusion steps
images of superior quality and enhanced realism. when the image is still relatively unstructured. 3) Compar-
Statistical analysis of text. We apply our method to ing the 1 7→ 0 setting in (b) and that in (d), the setting with
3,500 prompts, gathering DF-prompt tokens from the pol- weight = 0.75 occurs more frequently than weight = 1.25.
icy model. The top 15 frequently generated modifiers are By assigning a lower weight (0.75), the prompt effectively

26633
instructs the generative model to pay less attention to these Reward Settings Pick* (↑) CLIP (↑) Aes (↑)
tokens. This could lead the model to consider all tokens (1) α = 1, β = 0, κ = 16 53.8% 0.26 6.01
more evenly when generating images, resulting in a more α = 1, β = 0, κ = 18 58.0% 0.26 6.04
balanced and potentially superior outcome. Furthermore, α = 1, β = 0, κ = 20 56.4% 0.26 6.05
these elements (like “digital painting”, “concept art”, “art- (2) α = 1, β = 1, κ = 16 3.8% 0.28 5.56
station”, etc.) are inherently complex and can be interpreted α = 1, β = 1, κ = 18 9.6% 0.28 5.54
in various ways. If the model focuses excessively on the to- α = 1, β = 1, κ = 20 5.2% 0.28 5.56
kens (due to the higher weight of 1.25), it might struggle to (3) α = 5, β = 1, κ = 18 52.0% 0.26 5.97
generate coherent images due to these concepts’ complexity α = 10, β = 1, κ = 18 57.0% 0.26 5.93
and ambiguity. Note that the aforementioned observations * To highlight the disparity, we report the measure E[gPKS (s, I m ) >
merely reflect the trends, different prompts may have differ- gPKS (s, I)].

ent optimal choices, which is why our method is necessary. Table 6. Ablation experiments on different parameters of reward.
The second stage model EDFP is trained for 1,000 episodes.
4.4. Ablation Study Method PickScore* (↑) CLIP Score (↑) Aes Score (↑) Reward (↑)
EReP 53.8% 0.26 6.03 4.49
We conduct ablation experiments on the DiffusionDB val- EDFP 57.8% 0.26 6.07 4.58
idation dataset to examine the effects of different data set-
tings, training settings, and prompt types. Table 7. Comparison between the initial model EReP and the sec-
ond stage model EDFP trained over 3,000 episodes.
Data Settings. The main parameters associated with the
training data are a variance σ in Sec. 4.2 and a tolerance settings remain the same, the reward increases with the out-
constant γ in Eq. (1). As shown in Tab. 5, the setting of put of the policy model using the DF-Prompt format instead
σ = 0.5, γ = 0.01 obtains the highest aesthetic score, so of the plain prompt format. This indicates that compared to
we choose it as the parameter setting for other experiments. plain prompts, DF-Prompts enhance the aesthetic appeal of
the generated images. They also strengthen the alignment
Data Settings CLIP Score (↑) Aes Score (↑)
between the image and the prompt, making the image more
σ = 0.5, γ = 0.00 0.26 6.01 in line with human preferences.
σ = 0.5, γ = 0.01 0.26 6.03
σ = 1.0, γ = 0.00 0.26 5.95
σ = 1.0, γ = 0.01 0.26 5.94
Table 5. Ablation experiments on hyperparameters of the valida-
tion set. We validate the results of the first-stage model EReP at
50k steps on the DiffusionDB Validation set.

Training Settings. In our method, the reward is primarily


influenced by three main parameters: α, β, and κ, as out-
lined in Eq. (4). In Tab. 6, we observe that when κ = 18, a
higher PickScore is achieved, while the CLIP score and Aes
Score remain relatively consistent compared to other values Figure 6. (a) The relationship between episode and reward. (b)
of κ. Comparisons between (1) and (2), setting β = 1 re- Ablation experiments with different prompt types.
sults in a significant increase in the CLIP score, but leads
5. Conclusion
to a decrease in both the aesthetic score and PickScore,
compared to when β = 0. Furthermore, in comparing (2) In this paper, we propose PAE, a novel method for auto-
and (3), we find that an increase in α boosts both the latter matically editing prompts to improve the quality of images
scores, albeit at the cost of the CLIP score. Given that our generated by a pre-trained text-to-image model. Unlike ex-
task is primarily aimed at enhancing human preferences and isting methods that require heuristic human engineering of
aesthetics without causing significant semantic deviations, prompts, PAE automatically edits input prompts and pro-
we choose α = 1, β = 0, κ = 18 for other experiments. vides more flexible and fine-grained control. Experimental
We also demonstrate the improvement brought by the sec- evaluations demonstrate the effectiveness and efficiency of
ond stage of training. In Tab. 7, compared with EReP , the PAE, which exhibits strong generalization abilities and per-
policy model EDFP can bring comprehensive improvement. forms well on both in-domain and out-of-domain data.
Ablation experiments on different episodes. As shown Acknowledgment This work was supported in part by
in Fig. 6 (a), the policy model achieves its peak reward after the National Natural Science Foundation of China No.
3,000 episodes of training. Consequently, we adopt 3,000 62376277, Beijing Outstanding Young Scientist Program
episodes as the standard setting for other experiments. NO. BJJWZYJH012019100020098, and Public Computing
DF-prompt format. As shown in Fig. 6 (b), when other Cloud, Renmin University of China.

26634
References ing transformers for high-resolution image synthesis. 2021
IEEE/CVF Conference on Computer Vision and Pattern
[1] Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Recognition (CVPR), pages 12868–12878, 2020. 2
Hang Su, and Jun Zhu. All are worth words: A vit backbone
[8] David E. Goldberg. Genetic Algorithms in Search Optimiza-
for diffusion models. 2023 IEEE/CVF Conference on Com-
tion and Machine Learning. Addison-Wesley, 1989. 3
puter Vision and Pattern Recognition (CVPR), pages 22669–
[9] Yaru Hao, Zewen Chi, Li Dong, and Furu Wei. Op-
22679, 2023. 1
timizing prompts for text-to-image generation. CoRR,
[2] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Sub-
abs/2212.09611, 2022. 1, 3, 5, 6
biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakan-
[10] Jonathan Ho, Chitwan Saharia, William Chan, David J. Fleet,
tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand-
Mohammad Norouzi, and Tim Salimans. Cascaded diffusion
hini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, T. J.
models for high fidelity image generation. J. Mach. Learn.
Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler,
Res., 23:47:1–47:33, 2022. 3
Jeff Wu, Clemens Winter, Christopher Hesse, Mark Chen,
Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, [11] Sadeep Jayasumana, Srikumar Ramalingam, Andreas Veit,
Jack Clark, Christopher Berner, Sam McCandlish, Alec Rad- Daniel Glasner, Ayan Chakrabarti, and Sanjiv Kumar. Re-
ford, Ilya Sutskever, and Dario Amodei. Language models thinking FID: towards a better evaluation metric for image
are few-shot learners. arXiv, abs/2005.14165, 2020. 2 generation. CoRR, abs/2401.09603, 2024. 5, 7
[12] Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Ma-
[3] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin,
tiana, Joe Penna, and Omer Levy. Pick-a-pic: An open
Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul
dataset of user preferences for text-to-image generation.
Barham, Hyung Won Chung, Charles Sutton, Sebas-
arXiv preprint arXiv:2305.01569, 2023. 5
tian Gehrmann, Parker Schuh, Kensen Shi, Sasha
Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker [13] S. Kullback and R. A. Leibler. On information and suffi-
Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, ciency. The Annals of Mathematical Statistics, 22(1):79–86,
Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James 1951. 5
Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, [14] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence
Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Zitnick. Microsoft coco: Common objects in context. In
Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Computer Vision – ECCV 2014, pages 740–755, Cham,
Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, 2014. Springer International Publishing. 2, 5
Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David [15] Vivian Liu and Lydia B. Chilton. Design guidelines for
Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, prompt engineering text-to-image generative models. Pro-
Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor ceedings of the 2022 CHI Conference on Human Factors in
Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polo- Computing Systems, 2022. 1, 3
zov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan [16] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav
Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and
Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, Mark Chen. Glide: Towards photorealistic image genera-
and Noah Fiedel. Palm: Scaling language modeling with tion and editing with text-guided diffusion models. In ICML,
pathways. J. Mach. Learn. Res., 24:240:1–240:113, 2023. 2 2022. 1, 3
[4] Katherine Crowson, Stella Biderman, Daniel Kornis, [17] OpenAI. Gpt-4 technical report. arXiv, abs/2303.08774,
Dashiell Stander, Eric Hallahan, Louis Castricato, and Ed- 2023. 2
ward Raff. VQGAN-CLIP: open domain image generation [18] Jonas Oppenlaender. The creativity of text-to-image gener-
and editing with natural language guidance. In Computer Vi- ation. In 25th International Academic Mindtrek conference,
sion - ECCV 2022 - 17th European Conference, Tel Aviv, Is- Academic Mindtrek 2022, Tampere, Finland, November 16-
rael, October 23-27, 2022, Proceedings, Part XXXVII, pages 18, 2022, pages 192–202, 2022. 3
88–105. Springer, 2022. 3 [19] Jonas Oppenlaender. A taxonomy of prompt modifiers for
[5] Nassim Dehouche and Kullathida Dehouche. What is in a text-to-image generation. arXiv, abs/2204.13988, 2022. 1, 3
text-to-image prompt: The potential of stable diffusion in [20] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car-
visual arts education. CoRR, abs/2301.01902, 2023. 1 roll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini
[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob
Toutanova. BERT: pre-training of deep bidirectional trans- Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda
formers for language understanding. In Proceedings of the Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and
2019 Conference of the North American Chapter of the Ryan Lowe. Training language models to follow instructions
Association for Computational Linguistics: Human Lan- with human feedback. In NeurIPS, 2022. 5
guage Technologies, NAACL-HLT 2019, Minneapolis, MN, [21] Nikita Pavlichenko and Dmitry Ustalov. Best prompts
USA, June 2-7, 2019, Volume 1 (Long and Short Papers), for text-to-image models and how to find them. arXiv,
pages 4171–4186. Association for Computational Linguis- abs/2209.11711, 2022. 1, 3, 4
tics, 2019. 2 [22] William Peebles and Saining Xie. Scalable diffusion models
[7] Patrick Esser, Robin Rombach, and Björn Ommer. Tam- with transformers. In Proceedings of the IEEE/CVF Inter-

26635
national Conference on Computer Vision, pages 4195–4205, [36] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
2023. 2 Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste
[23] Alec Radford and Karthik Narasimhan. Improving language Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien
understanding by generative pre-training. OpenAI, 2018. 2 Rodriguez, Armand Joulin, Edouard Grave, and Guillaume
[24] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Lample. Llama: Open and efficient foundation language
Amodei, and Ilya Sutskever. Language models are unsuper- models. CoRR, abs/2302.13971, 2023. 2
vised multitask learners. 2019. 2, 4, 5, 6 [37] Kailas Vodrahalli and James Zou. Artwhisperer: A dataset
[25] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya for characterizing human-ai interactions in artistic creations.
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, CoRR, abs/2306.08141, 2023. 3
Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen [38] Zijie J. Wang, Evan Montoya, David Munechika, Haoyang
Krueger, and Ilya Sutskever. Learning transferable visual Yang, Benjamin Hoover, and Duen Horng Chau. Diffu-
models from natural language supervision. In Proceedings sionDB: A large-scale prompt gallery dataset for text-to-
of the 38th International Conference on Machine Learning, image generative models. In Proceedings of the 61st Annual
ICML 2021, 18-24 July 2021, Virtual Event, pages 8748– Meeting of the Association for Computational Linguistics
8763. PMLR, 2021. 2, 4, 5 (Volume 1: Long Papers), pages 893–911, Toronto, Canada,
[26] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, 2023. Association for Computational Linguistics. 2, 3, 4, 5
Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. [39] Ronald J. Williams and David Zipser. A learning algorithm
Zero-shot text-to-image generation. In Proceedings of the for continually running fully recurrent neural networks. Neu-
38th International Conference on Machine Learning, ICML ral Comput., 1(2):270–280, 1989. 4
2021, 18-24 July 2021, Virtual Event, pages 8821–8831. [40] Yutong Xie, Zhaoying Pan, Jinge Ma, Luo Jie, and Qiaozhu
PMLR, 2021. 2 Mei. A prompt log analysis of text-to-image generation sys-
[27] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, tems. In Proceedings of the ACM Web Conference 2023,
and Mark Chen. Hierarchical text-conditional image gener- WWW 2023, Austin, TX, USA, 30 April 2023 - 4 May 2023,
ation with CLIP latents. CoRR, abs/2204.06125, 2022. 3 pages 3892–3902. ACM, 2023. 3
[28] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, [41] Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai
and Mark Chen. Hierarchical text-conditional image gener- Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagere-
ation with clip latents. arXiv, abs/2204.06125, 2022. 2 ward: Learning and evaluating human preferences for text-
[29] Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick to-image generation. arXiv, abs/2304.05977, 2023. 3
Esser, and Björn Ommer. High-resolution image synthesis [42] Xingqian Xu, Zhangyang Wang, Eric Zhang, Kai Wang, and
with latent diffusion models. 2022 IEEE/CVF Conference Humphrey Shi. Versatile diffusion: Text, images and vari-
on Computer Vision and Pattern Recognition (CVPR), pages ations all in one diffusion model. CoRR, abs/2211.08332,
10674–10685, 2022. 1, 2, 5 2022. 2
[30] Chitwan Saharia, William Chan, Saurabh Saxena, Lala [43] Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, and Ji-
Li, Jay Whang, Emily L. Denton, Seyed Kamyar Seyed wen Lu. Unipc: A unified predictor-corrector framework for
Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, fast sampling of diffusion models. arXiv, abs/2302.04867,
Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad 2023. 6
Norouzi. Photorealistic text-to-image diffusion models with
[44] Wanrong Zhu, Xinyi Wang, Yujie Lu, Tsu-Jui Fu, Xin Eric
deep language understanding. In NeurIPS, 2022. 1, 2, 3
Wang, Miguel P. Eckstein, and William Yang Wang. Collab-
[31] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad-
orative generative AI: integrating gpt-k for efficient editing
ford, and Oleg Klimov. Proximal policy optimization algo-
in text-to-image generation. CoRR, abs/2305.11317, 2023. 1
rithms. CoRR, abs/1707.06347, 2017. 4
[32] Jascha Narain Sohl-Dickstein, Eric A. Weiss, Niru Ma-
heswaranathan, and Surya Ganguli. Deep unsupervised
learning using nonequilibrium thermodynamics. arXiv,
abs/1503.03585, 2015. 2
[33] Xingzhe Su, Wenwen Qiang, Jie Hu, Fengge Wu, Changwen
Zheng, and Fuchun Sun. Intriguing property and counterfac-
tual explanation of gan for remote sensing image generation,
2023.
[34] Xingzhe Su, Wenwen Qiang, Zeen Song, Hang Gao, Fengge
Wu, and Changwen Zheng. A unified gan framework re-
garding manifold alignment for remote sensing images gen-
eration. arXiv preprint arXiv:2305.19507, 2023. 2
[35] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe,
Jonathon Shlens, and Zbigniew Wojna. Rethinking the in-
ception architecture for computer vision. 2016 IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR),
pages 2818–2826, 2015. 6

26636

You might also like