Training Diffusion Models With Reinforcement Learning
Training Diffusion Models With Reinforcement Learning
1
Compressibility: llama
Figure 1 (Reinforcement learning for diffusion models) We propose a reinforcement learning algorithm, DDPO, for
optimizing diffusion models on downstream objectives such as compressibility, aesthetic quality, and prompt-image
alignment as determined by vision-language models. Each row shows a progression of samples for the same prompt and
random seed over the course of training.
align with a variety of user-specified objectives. We com- ibility prompts are sampled uniformly from all 398 animals
pare reward-weighted regression approaches, denoted RWR, in the ImageNet-1000 (Deng et al., 2009) categories. Aes-
to our proposed policy gradient approaches, denoted DDPO. thetic quality prompts are sampled uniformly from a smaller
We evaluate four reward functions: compressibility and in- set of 45 common animals.
compressibility, as determined by the JPEG compression
As shown qualitatively in Figure 2, DDPO is able to effec-
algorithm; aesthetic quality, as determined by the LAION
tively adapt a pretrained model with only the specification of
aesthetic quality predictor (Schuhmann, 2022); and prompt-
a reward function and without any further data curation. The
image alignment, as determined by the LLaVA VLM (Liu
strategies found to optimize each reward are nontrivial; for
et al., 2023). Full details of the algorithms and reward
example, to maximize LAION-predicted aesthetic quality,
functions are provided in Appendix C and D, respectively.
DDPO transforms a model that produces naturalistic images
Additional experiments studying zero-shot generlization and
into one that produces stylized line drawings. To maximize
reward overoptimization are provided in Appendix E.1 and
compressibility, DDPO removes backgrounds and applies a
E.2, respectively.
Gaussian blur to what remains. To maximize incompress-
ibility, DDPO finds artifacts that are difficult for the JPEG
2.1. Algorithm Comparisons compression algorithm to encode, such as high-frequency
We begin by evaluating all methods on the compressibility, noise and sharp edges, and occasionally produces multiple
incompressibility, and aesthetic quality tasks, as these tasks entities. Samples from RWR are provided in Appendix H
isolate the effectiveness of the RL approach from considera- for comparison.
tions relating to automated VLM reward evaluation. We use We provide a quantitative comparison of all methods in
Stable Diffusion v1.4 (Rombach et al., 2022) as the base Figure 3. We plot the attained reward as a function of
model for all experiments. Compressibility and incompress-
2
Pretrained Aesthetic Quality (Zhang et al., 2020) could correspond to large differences
in quality. It is important to note that some of the prompts
in the finetuning set, such as “a dolphin riding a bike”, had
zero success rate from the base model; if trained in isolation,
this prompt would be unlikely to ever improve because there
would be no reward signal. It was only via transfer between
Compressibility Incompressibility
prompts that these particular prompts could improve.
Nearly all of the samples become more cartoon-like or artis-
tic during finetuning. This was not optimized for directly.
We hypothesize that this is a function of the pretraining
distribution; though it would be extremely rare to see a pho-
torealistic image of a bear washing dishes, it would be much
less unusual to see the scene depicted in a children’s book.
Figure 2 (DDPO samples) Qualitative depiction of the ef- As a result, in the process of satisfying the content of the
fects of RL fine-tuning on different reward functions. DDPO prompt, the style of the samples also changes.
transforms naturalistic images into stylized line drawings to
maximize predicted aesthetic quality, removes background
content and applies a foreground blur to maximize com-
pressibility, and adds artifacts and high-frequency noise to
maximize incompressibility.
3
Aesthetic Quality JPEG Compressibility JPEG Incompressibility
0 350
5.2
LAION Aesthetic Score
Filesize (kb)
−50 250
4.8
−75
4.6 200
−100
4.4 150
−125
4.2 100
−150
0 10k 20k 30k 40k 0 10k 20k 30k 40k 0 10k 20k 30k 40k
Reward Queries Reward Queries Reward Queries
Figure 3 (Finetuning effectiveness) The relative effectiveness of different RL algorithms on three reward functions. We
find that the policy gradient variants, denoted DDPO, are more effective optimizers than both RWR variants.
0.81
0.75
0.72
0.69
a bear washing dishes
10k 20k 30k 40k 50k
Reward Queries
. . . riding a bike
. . . playing chess
. . . washing dishes
Figure 4 (Prompt alignment) (L) Progression of samples for the same prompt and random seed over the course of training.
The images become significantly more faithful to the prompt. The samples also adopt a cartoon-like style, which we
hypothesize is because the prompts are more likely depicted as illustrations than realistic photographs in the pretraining
distribution. (R) Quantitative improvement of prompt alignment. Each thick line is the average score for an activity, while
the faint lines show average scores for a few randomly selected individual prompts.
4
References S., and Amodei, D. Deep reinforcement learning from
human preferences. In Neural Information Processing
Ajay, A., Du, Y., Gupta, A., Tenenbaum, J., Jaakkola,
Systems, 2017.
T., and Agrawal, P. Is conditional generative model-
ing all you need for decision-making? arXiv preprint Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei,
arXiv:2211.15657, 2022. L. ImageNet: A large-scale hierarchical image database.
In Conference on Computer Vision and Pattern Recogni-
tion, 2009.
Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Grit-
senko, A., Kingma, D. P., Poole, B., Norouzi, M., Fleet,
Chi, C., Feng, S., Du, Y., Xu, Z., Cousineau, E., Burch-
D. J., and Salimans, T. Imagen video: High definition
fiel, B., and Song, S. Diffusion Policy: Visuomotor
video generation with diffusion models. arXiv preprint
Policy Learning via Action Diffusion. arXiv preprint
arXiv:2210.02303, 2022.
arXiv:2303.04137, 2023.
Janner, M., Du, Y., Tenenbaum, J., and Levine, S. Plan-
ning with diffusion for flexible behavior synthesis. In
International Conference on Machine Learning, 2022.
5
Nineteenth International Conference on Machine Learn- Leike, J., and Lowe, R. Training language models to
ing, pp. 267–274, 2002. follow instructions with human feedback. arXiv preprint
arXiv:2203.02155, 2022.
Kingma, D. P., Salimans, T., Poole, B., and Ho, J. Varia-
tional diffusion models. In Neural Information Process- Peng, X. B., Kumar, A., Zhang, G., and Levine, S.
ing Systems, 2021. Advantage-weighted regression: Simple and scalable off-
policy reinforcement learning. CoRR, abs/1910.00177,
Knox, W. B. and Stone, P. TAMER: Training an Agent 2019. URL https://arxiv.org/abs/1910.00177.
Manually via Evaluative Reinforcement. In International
Conference on Development and Learning, 2008. Peters, J. and Schaal, S. Reinforcement learning by reward-
weighted regression for operational space control. In
Lee, K., Liu, H., Ryu, M., Watkins, O., Du, Y., Boutilier, C.,
International Conference on Machine learning, 2007.
Abbeel, P., Ghavamzadeh, M., and Gu, S. S. Aligning text-
to-image models using human feedback. arXiv preprint Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,
arXiv:2302.12192, 2023. Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark,
J., Krueger, G., and Sutskever, I. Learning transferable
Liu, H., Li, C., Wu, Q., and Lee, Y. J. Visual instruction
visual models from natural language supervision. arXiv
tuning. 2023.
preprint arXiv:2103.00020, 2021.
Liu, N., Li, S., Du, Y., Torralba, A., and Tenenbaum, J. B.
Ramesh, A., Pavlov, M., Gabriel Goh, S. G., Voss, C., Rad-
Compositional visual generation with composable diffu-
ford, A., Chen, M., and Sutskever, I. Zero-shot text-
sion models. arXiv preprint arXiv:2206.01714, 2022.
to-image generation. arXiv preprint arXiv:2102.12092,
Menick, J., Trebacz, M., Mikulik, V., Aslanides, J., Song, 2021.
F., Chadwick, M., Glaese, M., Young, S., Campbell-
Gillingham, L., Irving, G., and McAleese, N. Teaching Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and
language models to support answers with verified quotes. Ommer, B. High-resolution image synthesis with latent
arXiv preprint arXiv:2203.11147, 2022. diffusion models. In IEEE Conference on Computer
Vision and Pattern Recognition, 2022.
Mohamed, S., Rosca, M., Figurnov, M., and Mnih, A.
Monte carlo gradient estimation in machine learning. The Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M.,
Journal of Machine Learning Research, 21(1):5183–5244, and Aberman, K. Dreambooth: Fine tuning text-to-image
2020. diffusion models for subject-driven generation. arXiv
preprint arXiv:2208.12242, 2022.
Nair, A., Dalal, M., Gupta, A., and Levine, S. Accelerating
online reinforcement learning with offline datasets. arXiv Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Den-
preprint arXiv:2006.09359, 2020. ton, E., Ghasemipour, S. K. S., Ayan, B. K., Mahdavi,
S. S., Lopes, R. G., Salimans, T., Ho, J., Fleet, D. J.,
Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., and Norouzi, M. Photorealistic text-to-image diffusion
Kim, C., Hesse, C., Jain, S., Kosaraju, V., Saunders, W., models with deep language understanding. arXiv preprint
Jiang, X., Cobbe, K., Eloundou, T., Krueger, G., Button, arXiv:2205.11487, 2022.
K., Knight, M., Chess, B., and Schulman, J. Webgpt:
Browser-assisted question-answering with human feed- Schneuing, A., Du, Y., Charles Harris, A. J., Igashov, I., Du,
back. arXiv preprint arXiv:2112.09332, 2021. W., Blundell, T., Lió, P., Gomes, C., Max Welling, M. B.,
and Correia, B. Structure-based drug design with equiv-
Nguyen, K., Daumé III, H., and Boyd-Graber, J. Reinforce- ariant diffusion models. arXiv preprint arXiv:2210.02303,
ment learning for bandit neural machine translation with 2022.
simulated human feedback. In Empirical Methods in
Natural Language Processing, 2017. Schuhmann, C. Laion aesthetics, Aug 2022. URL https:
//laion.ai/blog/laion-aesthetics/.
Nichol, A. Q. and Dhariwal, P. Improved denoising diffusion
probabilistic models. In International Conference on Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz,
Machine Learning, 2021. P. Trust region policy optimization. In International
Conference on Machine Learning, 2015.
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright,
C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and
Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, Klimov, O. Proximal policy optimization algorithms.
L., Simens, M., Askell, A., Welinder, P., Christiano, P., arXiv preprint arXiv:1707.06347, 2017.
6
Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Zhou, L., Du, Y., and Wu, J. 3d shape generation and
Hu, Q., Yang, H., Ashual, O., Gafni, O., et al. Make-a- completion through point-voxel diffusion. In Proceedings
video: Text-to-video generation without text-video data. of the IEEE/CVF International Conference on Computer
arXiv preprint arXiv:2209.14792, 2022. Vision, pp. 5826–5835, 2021.
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford,
Ganguli, S. Deep unsupervised learning using nonequi- A., Amodei, D., Christiano, P., and Irving, G. Fine-tuning
librium thermodynamics. In International Conference on language models from human preferences. arXiv preprint
Machine Learning, 2015. arXiv:1909.08593, 2019.
Song, J., Meng, C., and Ermon, S. Denoising diffusion
implicit models. In International Conference on Learn-
ing Representations, 2021. URL https://openreview.
net/forum?id=St1giarCHLP.
Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R.,
Voss, C., Radford, A., Amodei, D., and Christiano, P. F.
Learning to summarize with human feedback. In Neural
Information Processing Systems, 2020.
Sutton, R. S., McAllester, D., Singh, S., and Man-
sour, Y. Policy gradient methods for reinforcement
learning with function approximation. In Solla,
S., Leen, T., and Müller, K. (eds.), Advances in
Neural Information Processing Systems, volume 12.
MIT Press, 1999. URL https://proceedings.
neurips.cc/paper_files/paper/1999/file/
464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf.
7
A. Related Work
Denoising diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020) have emerged as an effective class of generative
models for modalities including images (Ramesh et al., 2021; Saharia et al., 2022), videos (Ho et al., 2022; Singer et al.,
2022), 3D shapes (Zhou et al., 2021; Zeng et al., 2022), and robotic trajectories (Janner et al., 2022; Ajay et al., 2022; Chi
et al., 2023). While the denoising objective is conventionally derived as an approximation to likelihood, the training of
diffusion models typically departs from maximum likelihood in several ways that improve sample quality in practice (Ho
et al., 2020). Modifying the objective to more strictly optimize likelihood (Nichol & Dhariwal, 2021; Kingma et al., 2021)
often leads to worsened image quality, as likelihood is not a faithful proxy for visual quality. In this paper, we show how
diffusion models can be optimized directly for downstream objectives.
Recent progress in text-to-image diffusion models (Ramesh et al., 2021; Saharia et al., 2022) has enabled fine-grained high-
resolution image synthesis. To further improve the controllability and quality of diffusion models, recent approaches have
investigated finetuning on limited user-provided data (Ruiz et al., 2022), optimizing text embeddings for new concepts (Gal
et al., 2022), composing models (Du et al., 2023; Liu et al., 2022), adapters for additional input constraints (Zhang &
Agrawala, 2023), and inference-time techniques such as classifier (Dhariwal & Nichol, 2021) and classifier-free (Ho &
Salimans, 2021) guidance.
A number of works have studied using human feedback to optimize models in settings such as simulated robotic control
(Christiano et al., 2017), game-playing (Knox & Stone, 2008), machine translation (Nguyen et al., 2017), citation retrieval
(Menick et al., 2022), browsing-based question-answering (Nakano et al., 2021), summarization (Stiennon et al., 2020;
Ziegler et al., 2019), instruction-following (Ouyang et al., 2022), and alignment with specifications (Bai et al., 2022a).
Recently, Lee et al. (2023) studied the alignment of text-to-image diffusion models to human preferences using a method
based on reward-weighted likelihood maximization, and posited that finetuning with RL is a promising direction for
future work. In our comparisons, their method roughly corresponds to one iteration of the RWR method, though precise
implementation details are likely different. Our results demonstrate that DDPO significantly outperforms even multiple
iterations of weighted likelihood maximization (RWR-style) optimization. More generally, our aim is not to study learning
from human feedback per se, but general algorithms compatible with a variety of reward functions.
B. Preliminaries
In this section, we provide a brief background on diffusion models and the RL problem formulation.
where µ̃ is a weighted average of x0 and xt . This objective is justified as maximizing a variational lower bound on the
model log-likelihood (Ho et al., 2020).
Sampling from a diffusion model begins with sampling xT ∼ N (0, I) and using the reverse process pθ (xt−1 | xt , c) to
produce a trajectory {xT , xT −1 , . . . , x0 } ending with a sample x0 . The reverse process depends not only on the predictor
µθ but also the choice of sampler. Most popular samplers (Ho et al., 2020; Song et al., 2021) use an isotropic Gaussian
reverse process with a fixed timestep-dependent variance:
8
transition kernel, and R is the reward function. At each timestep t, the agent observes a state st ∈ S, takes an action at ∈ A,
receives a reward R(st , at ), and transitions to a new state st+1 ∼ P (· | st , at ). An agent acts according to a policy π(a | s).
As the agent acts in the MDP, it produces trajectories, which are sequences of states and actions τ =
(s0 , a0 , s1 , a1 , . . . , sT , aT ). The reinforcement learning (RL) objective for the agent is to maximize JRL (π), the expected
cumulative reward over trajectories sampled from its policy:
hP i
T
JRL (π) = Eτ ∼p(·|π) t=0 R(st , at ) .
C. Algorithm Details
We now describe how RL algorithms can be used to train diffusion models. We present two classes of methods, one based
on prior work and one novel, and show that each corresponds to a different mapping of the denoising process to the MDP
framework.
1
wRWR (x0 , c) = exp βR(x0 , c) ,
Z
where β is an inverse temperature and Z is a normalization constant. We also consider a simplified weighting scheme that
uses binary weights,
wsparse (x0 , c) = 1 R(x0 , c) ≥ C ,
where C is a reward threshold determining which samples are used for training. The sparse weights may be desirable
because they eliminate the need to retain every sample from the model.
Within the RL formalism, the RWR procedure corresponds to the following one-step MDP:
with a transition kernel P that immediately leads to an absorbing termination state. Therefore, maximizing JDDRL (θ) is
equivalent to maximizing JRL (π) in this MDP.
Weighting a maximum likelihood objective by wRWR approximately optimizes JRL (π) subject to a KL divergence constraint
on the policy (Nair et al., 2020). However, LDDPM is not an exact maximum likelihood objective, but is derived from a
reweighted variational bound. Therefore, RWR algorithms applied to LDDPM optimize JDDRL via two levels of approximation.
Thus, this methodology provides us with a starting point, but might underperform for complex objectives.
9
C.3. Denoising Diffusion Policy Optimization
RWR relies on an approximate maximum likelihood objective because it ignores the sequential nature of the denoising
process, only using the final samples x0 . In this section, we show that when the sampler is fixed, the denoising process can
be reframed as a multi-step MDP. This allows us to directly optimize JDDRL using policy gradient estimators. We refer to
the resulting class of algorithms as denoising diffusion policy optimization (DDPO) and present two variants.
Denoising as a multi-step MDP. We map the iterative denoising procedure to the following MDP:
st ≜ (c, t, xt ) π(at | st ) ≜ pθ (xt−1 | xt , c) P (st+1 | st , at ) ≜ δc , δt−1 , δxt−1
(
r(x0 , c) if t = 0
at ≜ xt−1 ρ0 (s0 ) ≜ p(c), δT , N (0, I) R(st , at ) ≜
0 otherwise
in which δy is the Dirac delta distribution with nonzero density only at y. Trajectories consist of T timesteps, after which
P leads to a termination state. The cumulative reward of each trajectory is equal to r(x0 , c), so maximizing JDDRL (θ) is
equivalent to maximizing JRL (π) in this MDP.
The benefit of this formulation is that, if we use a standard sampler parameterized as in Equation 2, the policy π becomes an
isotropic Gaussian as opposed to an arbitrarily complicated distribution induced by the entire denoising procedure. This
simplification allows for the evaluation of exact action likelihoods and gradients of these likelihoods with respect to the
diffusion model parameters.
Policy gradient estimation. With access to likelihoods and likelihood gradients, we can make Monte Carlo estimates
of the policy gradient ∇θ JDDRL . DDPO alternates collecting trajectories {xT , xT −1 , . . . , x0 } via sampling and updating
parameters via gradient ascent on JDDRL .
The first variant of DDPO, which we call DDPOSF , uses the score function policy gradient estimator, also known as the
likelihood ratio method or REINFORCE (Williams, 1992; Mohamed et al., 2020):
" T #
X
ĝSF = E ∇θ log pθ (xt−1 | c, t, xt ) r(x0 , c) (3)
t=0
where the expectation is taken over denoising trajectories generated by the current policy pθ .
This estimator is unbiased. However, it only allows for one step of optimization per round of data collection, as the gradients
must be estimated using data from the current policy. To perform multiple steps of optimization, we may use an importance
sampling estimator (Kakade & Langford, 2002):
" T #
X pθ (xt−1 | c, t, xt )
ĝIS = E ∇θ log pθ (xt−1 | c, t, xt ) r(x0 , c) (4)
p (xt−1 | c, t, xt )
t=0 θold
where θold are the parameters used to collect the data, and the expectation is taken over denoising trajectories generated
by the corresponding policy pθold . This estimator also becomes inaccurate if pθ deviates too far from pθold , which can be
addressed using trust regions (Schulman et al., 2015) to constrain the size of the update. In practice, we implement the trust
region by clipping the importance weights, as introduced in proximal policy optimization (Schulman et al., 2017). We call
this variant DDPOIS .
10
“what is happening
in this image?” LLaVA BERTScore
similarity-based
“a monkey is...”
reward
Diffusion
“a monkey washing dishes...”
Model
Figure 5 (VLM reward function) Illustration of the VLM-based reward function for prompt-image alignment. LLaVA
(Liu et al., 2023) provides a short description of a generated image; the reward is the similarity between this description and
the original prompt as measured by BERTScore (Zhang et al., 2020).
11
E. Additional Experiments
E.1. Generalization
RL finetuning on large language models has been shown to produce interesting generalization properties; for example,
instruction finetuning almost entirely in English has been shown to improve capabilities in other languages (Ouyang et al.,
2022). It is difficult to reconcile this phenomenon with our current understanding of generalization; it would a priori seem
more likely for finetuning to have an effect only on the finetuning prompt set or distribution. In order to investigate the same
phenomenon with diffusion models, Figure 6 shows a set of DDPO-finetuned model samples corresponding to prompts
that were not seen during finetuning. In concordance with instruction-following transfer in language modeling, we find
that the effects of finetuning do generalize, even with prompt distributions as narrow as 45 animals. We find evidence of
generalization to both animals outside of the training distribution and to non-animal everyday objects.
Pretrained (New Animals and Activities) Alignment (New Animals and Activities)
Figure 6 (Generalization) For aesthetic quality, finetuning on a limited set of 45 animals generalizes to both new animals
and non-animal everyday objects. For prompt alignment, finetuning on the same set of animals and only three activities
generalizes to both new animals, new activities, and even combinations of the two. The prompts for the bottom row (left
to right) are: “a capybara washing dishes”, “a crab playing chess”, “a parrot driving a car”, and “a horse typing on a
keyboard”. More samples are provided in Appendix H.
E.2. Overoptimization
Section 2.1 highlights the optimization problem: given a reward function, how well can an RL algorithm maximize
that reward? However, finetuning on a reward function, especially a learned one, has been observed to lead to reward
overoptimization or exploitation (Gao et al., 2022) in which the model learns to achieve high reward while moving too far
away from the pretraining distribution to be useful.
Our setting is no exception, and we provide two examples of reward exploitation in Figure 7. When optimizing the
incompressibility objective, the model eventually stops producing semantically meaningful content, degenerating into
high-frequency noise. Similarly, we observed that VLM reward pipelines are susceptible to typographic attacks (Goh et al.,
2021). When optimizing for alignment with respect to prompts of the form “n animals”, DDPO exploited deficiencies in the
VLM by instead generating text loosely resembling the specified number. There is currently no general-purpose method for
preventing overoptimization (Gao et al., 2022). We highlight this problem as an important area for future work.
12
Incompressibility Counting Animals
DDPO
RWR
Figure 7 (Reward model overoptimization) Examples of RL overoptimizing reward functions. (L) The diffusion model
eventually loses all recognizable semantic content and produces noise when optimizing for incompressibility. (R) When
optimized for prompts of the form “n animals”, the diffusion model exploits the VLM with a typographic attack (Goh et al.,
2021), writing text that is interpreted as the specified number n instead of generating the correct number of animals.
F. Implementation Details
For all experiments, we use Stable Diffusion v1.4 (Rombach et al., 2022) as the base model and finetune only the UNet
weights while keeping the text encoder and autoencoder weights frozen.
13
img = Image . f r o m a r r a y ( x )
b u f f e r = i o . BytesIO ( )
img . s a v e ( b u f f e r , ‘ JPEG ’ , q u a l i t y = q u a l i t y )
jpeg = buffer . getvalue ( )
b y t e s = np . f r o m b u f f e r ( j p e g , d t y p e =np . u i n t 8 )
r e t u r n l e n ( b y t e s ) / 1000
14
mixed at sampling time using a guidance weight w:
where ϵθ is the ϵ-prediction parameterization of the diffusion model (Ho et al., 2020) and ϵ̃θ is the guided ϵ-prediction that
is used to compute the next denoised sample.
For reinforcement learning, it does not make sense to train on the unconditional objective since the reward may depend on
the context. However, we found that when only training on the conditional objective, performance rapidly deteriorated after
the first round of finetuning. We hypothesized that this is due to the guidance weight becoming miscalibrated each time the
model is updated, leading to degraded samples, which in turn impair the next round of finetuning, and so on. Our solution
was to choose a fixed guidance weight and use the guided ϵ-prediction during training as well as sampling. We call this
procedure CFG training. Figure 8 shows the effect of CFG training on RWRsparse ; it has no effect after a single round of
finetuning, but becomes essential for subsequent rounds.
JPEG Compressibility
0
Negative Filesize (kb)
−50
−100
−150
−200
Figure 8 (CFG training) We run the RWRsparse algorithm while optimizing only the conditional ϵ-prediction (without CFG
training), and while optimizing the guided ϵ-prediction (with CFG training). Each point denotes a diffusion model update.
We find that CFG training is essential for methods that do more than one round of interleaved sampling and training.
H. More Samples
Figure 9 shows qualitative samples from the baseline RWR method. Figure 10 shows more samples on seen prompts from
DDPO finetuning with the image-prompt alignment reward function. Figure 11 shows more examples of generalization
to unseen animals and everyday objects with the aesthetic quality reward function. Figure 12 shows more examples of
generalization to unseen subjects and activities with the image-prompt alignment reward function.
15
Pretrained Aesthetic Quality
Compressibility Incompressibility
16
Pretrained (New Animals) Aesthetic Quality (New Animals)
17
a capybara washing dishes a snail playing chess
18