0% found this document useful (0 votes)
57 views18 pages

Training Diffusion Models With Reinforcement Learning

This document discusses using reinforcement learning to directly optimize diffusion models for downstream objectives like image quality and effectiveness, rather than just matching a training distribution. The authors propose a method called Denoising Diffusion Policy Optimization (DDPO) that frames denoising as a multi-step decision process and uses policy gradients to optimize a diffusion model based on a reward function. They apply DDPO to tasks like improving image compressibility and aesthetic quality as determined by models, and better aligning text prompts with images using feedback from a vision-language model. Experimental results show DDPO can optimize diffusion models to achieve various objectives without additional data or human labeling beyond specifying a reward function.

Uploaded by

markus.aurelius
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views18 pages

Training Diffusion Models With Reinforcement Learning

This document discusses using reinforcement learning to directly optimize diffusion models for downstream objectives like image quality and effectiveness, rather than just matching a training distribution. The authors propose a method called Denoising Diffusion Policy Optimization (DDPO) that frames denoising as a multi-step decision process and uses policy gradients to optimize a diffusion model based on a reward function. They apply DDPO to tasks like improving image compressibility and aesthetic quality as determined by models, and better aligning text prompts with images using feedback from a vision-language model. Experimental results show DDPO can optimize diffusion models to achieve various objectives without additional data or human labeling beyond specifying a reward function.

Uploaded by

markus.aurelius
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Training Diffusion Models with Reinforcement Learning

Kevin Black * 1 Michael Janner * 1 Yilun Du 2 Ilya Kostrikov 1 Sergey Levine 1

Abstract behind diffusion models is to iteratively transform a simple


prior distribution into a target distribution by applying a
Diffusion models are a class of flexible generative sequential denoising process. This procedure is convention-
models trained with an approximation to the ally motivated as a maximum likelihood estimation problem,
log-likelihood objective. However, most use with the objective derived as a variational lower bound on
cases of diffusion models are not concerned the model log-likelihood.
with likelihoods, but instead with downstream
objectives such as human-perceived image quality However, most use cases of diffusion models are not ex-
or drug effectiveness. In this paper, we investigate plicitly concerned with likelihoods, but instead on a down-
reinforcement learning methods for directly stream objective such as human-perceived image quality
optimizing diffusion models for such objectives. or drug effectiveness. In this paper, we consider the prob-
We describe how posing denoising as a multi- lem of training diffusion models to satisfy such objectives
step decision-making problem enables a class of directly, as opposed to matching a data distribution. This
policy gradient algorithms, which we refer to as problem is challenging because exact likelihood computa-
denoising diffusion policy optimization (DDPO), tion with diffusion models is intractable, making it difficult
that are more effective than alternative reward- to apply many conventional reinforcement learning (RL)
weighted likelihood approaches. Empirically, algorithms. We instead propose to frame denoising as a
DDPO is able to adapt text-to-image diffusion multi-step decision-making task, using the exact likelihoods
models to objectives that are difficult to express at each denoising step in place of the approximate likeli-
via prompting, such as image compressibility, hoods induced by a full denoising process. We then devise
and those derived from human feedback, such a policy gradient algorithm, which we refer to as denoising
as aesthetic quality. Finally, we show that diffusion policy optimization (DDPO), that can optimize a
DDPO can improve prompt-image alignment diffusion model for downstream tasks using only a black-
using feedback from a vision-language model box reward function.
without the need for additional data collection We apply our algorithm to the finetuning of large pretrained
or human annotation. text-to-image diffusion models. Our initial evaluation fo-
cuses on tasks that are difficult to specify via prompting,
such as image compressibility, and those derived from hu-
1. Introduction man feedback, such as aesthetic quality. However, because
many reward functions of interest are difficult to specify
Diffusion probabilistic models (Sohl-Dickstein et al., 2015)
programmatically, finetuning procedures often rely on large-
have recently emerged as the de facto standard for genera-
scale human labeling efforts to obtain a reward signal. In the
tive modeling in continuous domains. Their flexibility in
case of text-to-image diffusion, we propose a method for re-
representing complex, high-dimensional distributions has
placing such labeling with feedback from a vision-language
led to the adoption of diffusion models in applications in-
model (VLM). Similar to RLAIF finetuning for language
cluding image and video synthesis (Ramesh et al., 2021;
models (Bai et al., 2022b), the resulting procedure allows
Saharia et al., 2022; Ho et al., 2022), drug and material
for diffusion models to be adapted to reward functions that
design (Xu et al., 2021; Xie et al., 2021; Schneuing et al.,
would otherwise require additional human annotations. We
2022), and continuous control (Janner et al., 2022; Wang
use this procedure to improve prompt-image alignment for
et al., 2022; Hansen-Estruch et al., 2023). The key idea
unusual subject-setting compositions.
*
Equal contribution 1 University of California, Berkeley
2
Massachusetts Institute of Technology. Correspondence to: Kevin
Black <[email protected]>.
2. Experimental Evaluation
The purpose of our experiments is to evaluate the effective-
Accepted to ICML workshop on Structured Probabilistic Inference
& Generative Modeling ness of RL algorithms for finetuning diffusion models to

1
Compressibility: llama

Aesthetic Quality: lion

Prompt Alignment: a raccoon washing dishes

Figure 1 (Reinforcement learning for diffusion models) We propose a reinforcement learning algorithm, DDPO, for
optimizing diffusion models on downstream objectives such as compressibility, aesthetic quality, and prompt-image
alignment as determined by vision-language models. Each row shows a progression of samples for the same prompt and
random seed over the course of training.

align with a variety of user-specified objectives. We com- ibility prompts are sampled uniformly from all 398 animals
pare reward-weighted regression approaches, denoted RWR, in the ImageNet-1000 (Deng et al., 2009) categories. Aes-
to our proposed policy gradient approaches, denoted DDPO. thetic quality prompts are sampled uniformly from a smaller
We evaluate four reward functions: compressibility and in- set of 45 common animals.
compressibility, as determined by the JPEG compression
As shown qualitatively in Figure 2, DDPO is able to effec-
algorithm; aesthetic quality, as determined by the LAION
tively adapt a pretrained model with only the specification of
aesthetic quality predictor (Schuhmann, 2022); and prompt-
a reward function and without any further data curation. The
image alignment, as determined by the LLaVA VLM (Liu
strategies found to optimize each reward are nontrivial; for
et al., 2023). Full details of the algorithms and reward
example, to maximize LAION-predicted aesthetic quality,
functions are provided in Appendix C and D, respectively.
DDPO transforms a model that produces naturalistic images
Additional experiments studying zero-shot generlization and
into one that produces stylized line drawings. To maximize
reward overoptimization are provided in Appendix E.1 and
compressibility, DDPO removes backgrounds and applies a
E.2, respectively.
Gaussian blur to what remains. To maximize incompress-
ibility, DDPO finds artifacts that are difficult for the JPEG
2.1. Algorithm Comparisons compression algorithm to encode, such as high-frequency
We begin by evaluating all methods on the compressibility, noise and sharp edges, and occasionally produces multiple
incompressibility, and aesthetic quality tasks, as these tasks entities. Samples from RWR are provided in Appendix H
isolate the effectiveness of the RL approach from considera- for comparison.
tions relating to automated VLM reward evaluation. We use We provide a quantitative comparison of all methods in
Stable Diffusion v1.4 (Rombach et al., 2022) as the base Figure 3. We plot the attained reward as a function of
model for all experiments. Compressibility and incompress-

2
Pretrained Aesthetic Quality (Zhang et al., 2020) could correspond to large differences
in quality. It is important to note that some of the prompts
in the finetuning set, such as “a dolphin riding a bike”, had
zero success rate from the base model; if trained in isolation,
this prompt would be unlikely to ever improve because there
would be no reward signal. It was only via transfer between
Compressibility Incompressibility
prompts that these particular prompts could improve.
Nearly all of the samples become more cartoon-like or artis-
tic during finetuning. This was not optimized for directly.
We hypothesize that this is a function of the pretraining
distribution; though it would be extremely rare to see a pho-
torealistic image of a bear washing dishes, it would be much
less unusual to see the scene depicted in a children’s book.
Figure 2 (DDPO samples) Qualitative depiction of the ef- As a result, in the process of satisfying the content of the
fects of RL fine-tuning on different reward functions. DDPO prompt, the style of the samples also changes.
transforms naturalistic images into stylized line drawings to
maximize predicted aesthetic quality, removes background
content and applies a foreground blur to maximize com-
pressibility, and adds artifacts and high-frequency noise to
maximize incompressibility.

the number of queries to the reward function, as reward


evaluation becomes the limiting factor in many practical
applications. DDPO shows a clear advantage over RWR
on all tasks, demonstrating that formulating the denoising
process as an MDP and estimating the policy gradient di-
rectly is more effective than optimizing a reward-weighted
lower bound on likelihood. Within the DDPO class, the im-
portance sampling estimator slightly outperforms the score
function estimator, likely due to the increased number of
optimization steps. Within the RWR class, the performance
of weighting schemes is comparable, making the sparse
weighting scheme preferable on these tasks due to its sim-
plicity and reduced resource requirements.

2.2. Automated Prompt Alignment


We next evaluate the ability of VLMs, in conjunction with
DDPO, to automatically improve the image-prompt align-
ment of the pretrained model without additional human
labels. We focus on DDPOIS for this experiment, as we
found it to be the most effective algorithm in Section 2.1.
The prompts for this task all have the form “a(n) [animal]
[activity] ”, where the animal comes from the same list of
45 common animals used in Section 2.1 and the activity is
chosen from a list of 3 activities: “riding a bike”, “playing
chess”, and “washing dishes”.
The progression of finetuning is depicted in Figure 4. Qual-
itatively, the samples come to depict the prompts much
more faithfully throughout the course of training. This
trend is also reflected quantitatively, though is less salient
as we found that even small changes in average BERTScore

3
Aesthetic Quality JPEG Compressibility JPEG Incompressibility
0 350
5.2
LAION Aesthetic Score

Negative Filesize (kb)


−25 300
5.0

Filesize (kb)
−50 250
4.8
−75
4.6 200
−100
4.4 150
−125
4.2 100
−150
0 10k 20k 30k 40k 0 10k 20k 30k 40k 0 10k 20k 30k 40k
Reward Queries Reward Queries Reward Queries

DDPOIS DDPOSF RWR RWRsparse

Figure 3 (Finetuning effectiveness) The relative effectiveness of different RL algorithms on three reward functions. We
find that the policy gradient variants, denoted DDPO, are more effective optimizers than both RWR variants.

a dolphin riding a bike


Prompt Alignment
0.84

0.81

an ant playing chess 0.78


BERTScore

0.75

0.72

0.69
a bear washing dishes
10k 20k 30k 40k 50k
Reward Queries

. . . riding a bike
. . . playing chess
. . . washing dishes

Figure 4 (Prompt alignment) (L) Progression of samples for the same prompt and random seed over the course of training.
The images become significantly more faithful to the prompt. The samples also adopt a cartoon-like style, which we
hypothesize is because the prompts are more likely depicted as illustrations than realistic photographs in the pretraining
distribution. (R) Quantitative improvement of prompt alignment. Each thick line is the average score for an activity, while
the faint lines show average scores for a few randomly selected individual prompts.

4
References S., and Amodei, D. Deep reinforcement learning from
human preferences. In Neural Information Processing
Ajay, A., Du, Y., Gupta, A., Tenenbaum, J., Jaakkola,
Systems, 2017.
T., and Agrawal, P. Is conditional generative model-
ing all you need for decision-making? arXiv preprint Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei,
arXiv:2211.15657, 2022. L. ImageNet: A large-scale hierarchical image database.
In Conference on Computer Vision and Pattern Recogni-
tion, 2009.

Dhariwal, P. and Nichol, A. Q. Diffusion models beat GANs


on image synthesis. In Advances in Neural Information
Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., Das- Processing Systems, 2021.
Sarma, N., Drain, D., Fort, S., Ganguli, D., Henighan,
T., Joseph, N., Kadavath, S., Kernion, J., Conerly, T., Du, Y., Durkan, C., Strudel, R., Tenenbaum, J. B., Diele-
El-Showk, S., Elhage, N., Hatfield-Dodds, Z., Hernan- man, S., Fergus, R., Sohl-Dickstein, J., Doucet, A., and
dez, D., Hume, T., Johnston, S., Kravec, S., Lovitt, L., Grathwohl, W. Reduce, reuse, recycle: Compositional
Nanda, N., Olsson, C., Amodei, D., Brown, T., Clark, generation with energy-based diffusion models and mcmc.
J., McCandlish, S., Olah, C., Mann, B., and Kaplan, J. arXiv preprint arXiv:2302.11552, 2023.
Training a helpful and harmless assistant with reinforce- Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano,
ment learning from human feedback. arXiv preprint A. H., Chechik, G., and Cohen-Or, D. An image is worth
arXiv:2204.05862, 2022a. one word: Personalizing text-to-image generation using
textual inversion. arXiv preprint arXiv:2208.01618, 2022.

Gao, L., Schulman, J., and Hilton, J. Scaling laws


for reward model overoptimization. arXiv preprint
arXiv:2210.10760, 2022.
Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J.,
Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McK- Goh, G., †, N. C., †, C. V., Carter, S., Petrov, M.,
innon, C., Chen, C., Olsson, C., Olah, C., Hernandez, Schubert, L., Radford, A., and Olah, C. Multimodal
D., Drain, D., Ganguli, D., Li, D., Tran-Johnson, E., neurons in artificial neural networks. Distill, 2021.
Perez, E., Kerr, J., Mueller, J., Ladish, J., Landau, J., https://distill.pub/2021/multimodal-neurons.
Ndousse, K., Lukosuite, K., Lovitt, L., Sellitto, M.,
Hansen-Estruch, P., Kostrikov, I., Janner, M., Kuba, J. G.,
Elhage, N., Schiefer, N., Mercado, N., DasSarma, N.,
and Levine, S. IDQL: Implicit q-learning as an actor-
Lasenby, R., Larson, R., Ringer, S., Johnston, S., Kravec,
critic method with diffusion policies. arXiv preprint
S., Showk, S. E., Fort, S., Lanham, T., Telleen-Lawton,
arXiv:2304.10573, 2023.
T., Conerly, T., Henighan, T., Hume, T., Bowman, S. R.,
Hatfield-Dodds, Z., Mann, B., Amodei, D., Joseph, N., Ho, J. and Salimans, T. Classifier-free diffusion guidance.
McCandlish, S., Brown, T., and Kaplan, J. Constitu- In NeurIPS 2021 Workshop on Deep Generative Models
tional AI: Harmlessness from AI feedback. arXiv preprint and Downstream Applications, 2021.
arXiv:2212.08073, 2022b.
Ho, J., Jain, A., and Abbeel, P. Denoising diffusion prob-
abilistic models. In Advances in Neural Information
Processing Systems, 2020.

Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Grit-
senko, A., Kingma, D. P., Poole, B., Norouzi, M., Fleet,
Chi, C., Feng, S., Du, Y., Xu, Z., Cousineau, E., Burch-
D. J., and Salimans, T. Imagen video: High definition
fiel, B., and Song, S. Diffusion Policy: Visuomotor
video generation with diffusion models. arXiv preprint
Policy Learning via Action Diffusion. arXiv preprint
arXiv:2210.02303, 2022.
arXiv:2303.04137, 2023.
Janner, M., Du, Y., Tenenbaum, J., and Levine, S. Plan-
ning with diffusion for flexible behavior synthesis. In
International Conference on Machine Learning, 2022.

Kakade, S. and Langford, J. Approximately optimal ap-


Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, proximate reinforcement learning. In Proceedings of the

5
Nineteenth International Conference on Machine Learn- Leike, J., and Lowe, R. Training language models to
ing, pp. 267–274, 2002. follow instructions with human feedback. arXiv preprint
arXiv:2203.02155, 2022.
Kingma, D. P., Salimans, T., Poole, B., and Ho, J. Varia-
tional diffusion models. In Neural Information Process- Peng, X. B., Kumar, A., Zhang, G., and Levine, S.
ing Systems, 2021. Advantage-weighted regression: Simple and scalable off-
policy reinforcement learning. CoRR, abs/1910.00177,
Knox, W. B. and Stone, P. TAMER: Training an Agent 2019. URL https://arxiv.org/abs/1910.00177.
Manually via Evaluative Reinforcement. In International
Conference on Development and Learning, 2008. Peters, J. and Schaal, S. Reinforcement learning by reward-
weighted regression for operational space control. In
Lee, K., Liu, H., Ryu, M., Watkins, O., Du, Y., Boutilier, C.,
International Conference on Machine learning, 2007.
Abbeel, P., Ghavamzadeh, M., and Gu, S. S. Aligning text-
to-image models using human feedback. arXiv preprint Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,
arXiv:2302.12192, 2023. Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark,
J., Krueger, G., and Sutskever, I. Learning transferable
Liu, H., Li, C., Wu, Q., and Lee, Y. J. Visual instruction
visual models from natural language supervision. arXiv
tuning. 2023.
preprint arXiv:2103.00020, 2021.
Liu, N., Li, S., Du, Y., Torralba, A., and Tenenbaum, J. B.
Ramesh, A., Pavlov, M., Gabriel Goh, S. G., Voss, C., Rad-
Compositional visual generation with composable diffu-
ford, A., Chen, M., and Sutskever, I. Zero-shot text-
sion models. arXiv preprint arXiv:2206.01714, 2022.
to-image generation. arXiv preprint arXiv:2102.12092,
Menick, J., Trebacz, M., Mikulik, V., Aslanides, J., Song, 2021.
F., Chadwick, M., Glaese, M., Young, S., Campbell-
Gillingham, L., Irving, G., and McAleese, N. Teaching Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and
language models to support answers with verified quotes. Ommer, B. High-resolution image synthesis with latent
arXiv preprint arXiv:2203.11147, 2022. diffusion models. In IEEE Conference on Computer
Vision and Pattern Recognition, 2022.
Mohamed, S., Rosca, M., Figurnov, M., and Mnih, A.
Monte carlo gradient estimation in machine learning. The Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M.,
Journal of Machine Learning Research, 21(1):5183–5244, and Aberman, K. Dreambooth: Fine tuning text-to-image
2020. diffusion models for subject-driven generation. arXiv
preprint arXiv:2208.12242, 2022.
Nair, A., Dalal, M., Gupta, A., and Levine, S. Accelerating
online reinforcement learning with offline datasets. arXiv Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Den-
preprint arXiv:2006.09359, 2020. ton, E., Ghasemipour, S. K. S., Ayan, B. K., Mahdavi,
S. S., Lopes, R. G., Salimans, T., Ho, J., Fleet, D. J.,
Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., and Norouzi, M. Photorealistic text-to-image diffusion
Kim, C., Hesse, C., Jain, S., Kosaraju, V., Saunders, W., models with deep language understanding. arXiv preprint
Jiang, X., Cobbe, K., Eloundou, T., Krueger, G., Button, arXiv:2205.11487, 2022.
K., Knight, M., Chess, B., and Schulman, J. Webgpt:
Browser-assisted question-answering with human feed- Schneuing, A., Du, Y., Charles Harris, A. J., Igashov, I., Du,
back. arXiv preprint arXiv:2112.09332, 2021. W., Blundell, T., Lió, P., Gomes, C., Max Welling, M. B.,
and Correia, B. Structure-based drug design with equiv-
Nguyen, K., Daumé III, H., and Boyd-Graber, J. Reinforce- ariant diffusion models. arXiv preprint arXiv:2210.02303,
ment learning for bandit neural machine translation with 2022.
simulated human feedback. In Empirical Methods in
Natural Language Processing, 2017. Schuhmann, C. Laion aesthetics, Aug 2022. URL https:
//laion.ai/blog/laion-aesthetics/.
Nichol, A. Q. and Dhariwal, P. Improved denoising diffusion
probabilistic models. In International Conference on Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz,
Machine Learning, 2021. P. Trust region policy optimization. In International
Conference on Machine Learning, 2015.
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright,
C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and
Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, Klimov, O. Proximal policy optimization algorithms.
L., Simens, M., Askell, A., Welinder, P., Christiano, P., arXiv preprint arXiv:1707.06347, 2017.

6
Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Zhou, L., Du, Y., and Wu, J. 3d shape generation and
Hu, Q., Yang, H., Ashual, O., Gafni, O., et al. Make-a- completion through point-voxel diffusion. In Proceedings
video: Text-to-video generation without text-video data. of the IEEE/CVF International Conference on Computer
arXiv preprint arXiv:2209.14792, 2022. Vision, pp. 5826–5835, 2021.
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford,
Ganguli, S. Deep unsupervised learning using nonequi- A., Amodei, D., Christiano, P., and Irving, G. Fine-tuning
librium thermodynamics. In International Conference on language models from human preferences. arXiv preprint
Machine Learning, 2015. arXiv:1909.08593, 2019.
Song, J., Meng, C., and Ermon, S. Denoising diffusion
implicit models. In International Conference on Learn-
ing Representations, 2021. URL https://openreview.
net/forum?id=St1giarCHLP.

Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R.,
Voss, C., Radford, A., Amodei, D., and Christiano, P. F.
Learning to summarize with human feedback. In Neural
Information Processing Systems, 2020.
Sutton, R. S., McAllester, D., Singh, S., and Man-
sour, Y. Policy gradient methods for reinforcement
learning with function approximation. In Solla,
S., Leen, T., and Müller, K. (eds.), Advances in
Neural Information Processing Systems, volume 12.
MIT Press, 1999. URL https://proceedings.
neurips.cc/paper_files/paper/1999/file/
464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf.

Wang, Z., Hunt, J. J., and Zhou, M. Diffusion policies as an


expressive policy class for offline reinforcement learning.
arXiv preprint arXiv:2208.06193, 2022.
Williams, R. J. Simple statistical gradient-following algo-
rithms for connectionist reinforcement learning. Rein-
forcement learning, pp. 5–32, 1992.
Xie, T., Fu, X., Ganea, O.-E., Barzilay, R., and Jaakkola,
T. S. Crystal diffusion variational autoencoder for peri-
odic material generation. In International Conference on
Learning Representations, 2021.
Xu, M., Yu, L., Song, Y., Shi, C., Ermon, S., , and Tang,
J. GeoDiff: A geometric diffusion model for molecular
conformation generation. In International Conference on
Learning Representations, 2021.
Zeng, X., Vahdat, A., Williams, F., Gojcic, Z., Litany,
O., Fidler, S., and Kreis, K. Lion: Latent point diffu-
sion models for 3d shape generation. arXiv preprint
arXiv:2210.06978, 2022.
Zhang, L. and Agrawala, M. Adding conditional control to
text-to-image diffusion models, 2023.
Zhang, T., Kishore*, V., Wu, F., Weinberger, K. Q., and
Artzi, Y. BERTScore: Evaluating text generation with
BERT. In International Conference on Learning Repre-
sentations, 2020.

7
A. Related Work
Denoising diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020) have emerged as an effective class of generative
models for modalities including images (Ramesh et al., 2021; Saharia et al., 2022), videos (Ho et al., 2022; Singer et al.,
2022), 3D shapes (Zhou et al., 2021; Zeng et al., 2022), and robotic trajectories (Janner et al., 2022; Ajay et al., 2022; Chi
et al., 2023). While the denoising objective is conventionally derived as an approximation to likelihood, the training of
diffusion models typically departs from maximum likelihood in several ways that improve sample quality in practice (Ho
et al., 2020). Modifying the objective to more strictly optimize likelihood (Nichol & Dhariwal, 2021; Kingma et al., 2021)
often leads to worsened image quality, as likelihood is not a faithful proxy for visual quality. In this paper, we show how
diffusion models can be optimized directly for downstream objectives.
Recent progress in text-to-image diffusion models (Ramesh et al., 2021; Saharia et al., 2022) has enabled fine-grained high-
resolution image synthesis. To further improve the controllability and quality of diffusion models, recent approaches have
investigated finetuning on limited user-provided data (Ruiz et al., 2022), optimizing text embeddings for new concepts (Gal
et al., 2022), composing models (Du et al., 2023; Liu et al., 2022), adapters for additional input constraints (Zhang &
Agrawala, 2023), and inference-time techniques such as classifier (Dhariwal & Nichol, 2021) and classifier-free (Ho &
Salimans, 2021) guidance.
A number of works have studied using human feedback to optimize models in settings such as simulated robotic control
(Christiano et al., 2017), game-playing (Knox & Stone, 2008), machine translation (Nguyen et al., 2017), citation retrieval
(Menick et al., 2022), browsing-based question-answering (Nakano et al., 2021), summarization (Stiennon et al., 2020;
Ziegler et al., 2019), instruction-following (Ouyang et al., 2022), and alignment with specifications (Bai et al., 2022a).
Recently, Lee et al. (2023) studied the alignment of text-to-image diffusion models to human preferences using a method
based on reward-weighted likelihood maximization, and posited that finetuning with RL is a promising direction for
future work. In our comparisons, their method roughly corresponds to one iteration of the RWR method, though precise
implementation details are likely different. Our results demonstrate that DDPO significantly outperforms even multiple
iterations of weighted likelihood maximization (RWR-style) optimization. More generally, our aim is not to study learning
from human feedback per se, but general algorithms compatible with a variety of reward functions.

B. Preliminaries
In this section, we provide a brief background on diffusion models and the RL problem formulation.

B.1. Diffusion Models


In this work, we consider conditional diffusion probabilistic models (Sohl-Dickstein et al., 2015; Ho et al., 2020), which
represent a distribution over data x0 conditioned on context c as the result of sequential denoising. The denoising procedure is
trained to reverse a Markovian forward process q(xt | xt−1 ), which iteratively adds noise to the data. Reversing the forward
process can be accomplished by training a forward process posterior mean predictor µθ (xt , t, c) for all t ∈ {0, 1, . . . , T }
with the following simplified objective:

LDDPM (θ) = E ∥µ̃(xt , x0 ) − µθ (xt , t, c)∥2


 
(1)

where µ̃ is a weighted average of x0 and xt . This objective is justified as maximizing a variational lower bound on the
model log-likelihood (Ho et al., 2020).
Sampling from a diffusion model begins with sampling xT ∼ N (0, I) and using the reverse process pθ (xt−1 | xt , c) to
produce a trajectory {xT , xT −1 , . . . , x0 } ending with a sample x0 . The reverse process depends not only on the predictor
µθ but also the choice of sampler. Most popular samplers (Ho et al., 2020; Song et al., 2021) use an isotropic Gaussian
reverse process with a fixed timestep-dependent variance:

pθ (xt−1 | xt , c) = N (xt−1 | µθ (xt , t, c) , σt2 I). (2)

B.2. Markov Decision Processes and Reinforcement Learning


A Markov decision process (MDP) is a formalization of sequential decision-making problems. An MDP is defined by a
tuple (S, A, ρ0 , P, R), in which S is the state space, A is the action space, ρ0 is the distribution of initial states, P is the

8
transition kernel, and R is the reward function. At each timestep t, the agent observes a state st ∈ S, takes an action at ∈ A,
receives a reward R(st , at ), and transitions to a new state st+1 ∼ P (· | st , at ). An agent acts according to a policy π(a | s).
As the agent acts in the MDP, it produces trajectories, which are sequences of states and actions τ =
(s0 , a0 , s1 , a1 , . . . , sT , aT ). The reinforcement learning (RL) objective for the agent is to maximize JRL (π), the expected
cumulative reward over trajectories sampled from its policy:
hP i
T
JRL (π) = Eτ ∼p(·|π) t=0 R(st , at ) .

C. Algorithm Details
We now describe how RL algorithms can be used to train diffusion models. We present two classes of methods, one based
on prior work and one novel, and show that each corresponds to a different mapping of the denoising process to the MDP
framework.

C.1. Problem Statement


We assume a pre-existing diffusion model, which may be pretrained or randomly initialized. If we choose a fixed sampler,
the diffusion model induces a sample distribution pθ (x0 | c). The denoising diffusion RL objective is to maximize a reward
signal r defined on the samples and contexts:

JDDRL (θ) = Ec∼p(c), x0 ∼pθ (·|c) [r(x0 , c)]

for some context distribution p(c) of our choosing.

C.2. Reward-Weighted Regression


To optimize JDDRL with minimal changes to standard diffusion model training, we can use the denoising objective LDDPM
(Equation 1), but with training data sampled from the model itself and a per-sample loss weighting that depends on the
reward r(x0 , c). Lee et al. (2023) describe a single-round version of this procedure for diffusion models, but in general this
approach can be performed for multiple rounds of alternating sampling and training, leading to a simple RL method. We
refer to this general class of algorithms as reward-weighted regression (RWR) (Peters & Schaal, 2007).
A standard weighting scheme uses exponentiated rewards to ensure nonnegativity,

1 
wRWR (x0 , c) = exp βR(x0 , c) ,
Z
where β is an inverse temperature and Z is a normalization constant. We also consider a simplified weighting scheme that
uses binary weights,
 
wsparse (x0 , c) = 1 R(x0 , c) ≥ C ,

where C is a reward threshold determining which samples are used for training. The sparse weights may be desirable
because they eliminate the need to retain every sample from the model.
Within the RL formalism, the RWR procedure corresponds to the following one-step MDP:

s≜c a ≜ x0 π(a | s) ≜ pθ (x0 | c) ρ0 (s) ≜ p(c) R(s, a) ≜ r(x0 , c)

with a transition kernel P that immediately leads to an absorbing termination state. Therefore, maximizing JDDRL (θ) is
equivalent to maximizing JRL (π) in this MDP.
Weighting a maximum likelihood objective by wRWR approximately optimizes JRL (π) subject to a KL divergence constraint
on the policy (Nair et al., 2020). However, LDDPM is not an exact maximum likelihood objective, but is derived from a
reweighted variational bound. Therefore, RWR algorithms applied to LDDPM optimize JDDRL via two levels of approximation.
Thus, this methodology provides us with a starting point, but might underperform for complex objectives.

9
C.3. Denoising Diffusion Policy Optimization
RWR relies on an approximate maximum likelihood objective because it ignores the sequential nature of the denoising
process, only using the final samples x0 . In this section, we show that when the sampler is fixed, the denoising process can
be reframed as a multi-step MDP. This allows us to directly optimize JDDRL using policy gradient estimators. We refer to
the resulting class of algorithms as denoising diffusion policy optimization (DDPO) and present two variants.
Denoising as a multi-step MDP. We map the iterative denoising procedure to the following MDP:

st ≜ (c, t, xt ) π(at | st ) ≜ pθ (xt−1 | xt , c) P (st+1 | st , at ) ≜ δc , δt−1 , δxt−1
(
 r(x0 , c) if t = 0
at ≜ xt−1 ρ0 (s0 ) ≜ p(c), δT , N (0, I) R(st , at ) ≜
0 otherwise

in which δy is the Dirac delta distribution with nonzero density only at y. Trajectories consist of T timesteps, after which
P leads to a termination state. The cumulative reward of each trajectory is equal to r(x0 , c), so maximizing JDDRL (θ) is
equivalent to maximizing JRL (π) in this MDP.
The benefit of this formulation is that, if we use a standard sampler parameterized as in Equation 2, the policy π becomes an
isotropic Gaussian as opposed to an arbitrarily complicated distribution induced by the entire denoising procedure. This
simplification allows for the evaluation of exact action likelihoods and gradients of these likelihoods with respect to the
diffusion model parameters.
Policy gradient estimation. With access to likelihoods and likelihood gradients, we can make Monte Carlo estimates
of the policy gradient ∇θ JDDRL . DDPO alternates collecting trajectories {xT , xT −1 , . . . , x0 } via sampling and updating
parameters via gradient ascent on JDDRL .
The first variant of DDPO, which we call DDPOSF , uses the score function policy gradient estimator, also known as the
likelihood ratio method or REINFORCE (Williams, 1992; Mohamed et al., 2020):
" T #
X
ĝSF = E ∇θ log pθ (xt−1 | c, t, xt ) r(x0 , c) (3)
t=0

where the expectation is taken over denoising trajectories generated by the current policy pθ .
This estimator is unbiased. However, it only allows for one step of optimization per round of data collection, as the gradients
must be estimated using data from the current policy. To perform multiple steps of optimization, we may use an importance
sampling estimator (Kakade & Langford, 2002):
" T #
X pθ (xt−1 | c, t, xt )
ĝIS = E ∇θ log pθ (xt−1 | c, t, xt ) r(x0 , c) (4)
p (xt−1 | c, t, xt )
t=0 θold

where θold are the parameters used to collect the data, and the expectation is taken over denoising trajectories generated
by the corresponding policy pθold . This estimator also becomes inaccurate if pθ deviates too far from pθold , which can be
addressed using trust regions (Schulman et al., 2015) to constrain the size of the update. In practice, we implement the trust
region by clipping the importance weights, as introduced in proximal policy optimization (Schulman et al., 2017). We call
this variant DDPOIS .

D. Reward Function Details


In this work, we evaluate our methods on text-to-image diffusion. Text-to-image diffusion serves as a valuable test
environment for reinforcement learning experiments due to the availability of large pretrained models and the versatility of
using diverse and visually interesting reward functions.
The choice of reward function is one of the most important decisions in practical applications of RL. In this section, we
outline our selection of reward functions for text-to-image diffusion models. We study a spectrum of reward functions of
varying complexity, ranging from those that are straightforward to specify and evaluate to those that capture the complexity
of real-world downstream tasks.

10
“what is happening
in this image?” LLaVA BERTScore
similarity-based
“a monkey is...”
reward

Diffusion
“a monkey washing dishes...”
Model

Figure 5 (VLM reward function) Illustration of the VLM-based reward function for prompt-image alignment. LLaVA
(Liu et al., 2023) provides a short description of a generated image; the reward is the similarity between this description and
the original prompt as measured by BERTScore (Zhang et al., 2020).

D.1. Compressibility and Incompressibility


The capabilities of text-to-image diffusion models are limited by the co-occurrences of text and images in their training
distribution. For instance, images are rarely captioned with their file size, making it impossible to specify a desired file
size via prompting. This limitation makes reward functions based on file size a convenient case study: they are simple to
compute, but not controllable through the conventional workflow of likelihood maximization and prompt engineering.
We fix the resolution of diffusion model samples at 512x512, such that the file size is determined solely by the compressibility
of the image. We define two tasks based on file size: compressibility, in which the file size of the image after JPEG
compression is minimized, and incompressibility, in which the same measure is maximized.

D.2. Aesthetic Quality


To capture a reward function that would be useful to a human user, we define a task based on perceived aesthetic quality. We
use the LAION aesthetics predictor (Schuhmann, 2022), which is trained on 176,000 human image ratings. The predictor is
implemented as a linear model on top of CLIP embeddings (Radford et al., 2021). Annotations range between 1 and 10, with
the highest-rated images mostly containing artwork. Since the aesthetic quality predictor is trained on human judgments,
this task constitutes reinforcement learning from human feedback (Ouyang et al., 2022; Christiano et al., 2017; Ziegler et al.,
2019).

D.3. Automated Prompt Alignment with Vision-Language Models


A very general-purpose reward function for training a text-to-image model is prompt-image alignment. However, specifying
a reward that captures generic prompt alignment is difficult, conventionally requiring large-scale human labeling efforts. We
propose using an existing VLM to replace additional human annotation. This design is inspired by recent work on RLAIF
(Bai et al., 2022b), in which language models are improved using feedback from themselves.
We use LLaVA (Liu et al., 2023), a state-of-the-art VLM, to describe an image. The finetuning reward is the BERTScore
(Zhang et al., 2020) recall metric, a measure of semantic similarity, using the prompt as the reference and the VLM
description as the candidate. Samples that more faithfully include all of the details of the prompt receive higher rewards, to
the extent that those visual details are legible to the VLM.
In Figure 5, we show one simple question: “what is happening in this image?”. While this captures the general task of
prompt-image alignment, in principle any question could be used to specify complex or hard-to-define reward functions for
a particular use case. One could even employ a language model to automatically generate candidate questions and evaluate
responses based on the prompt. This framework provides a flexible interface where the complexity of the reward function is
only limited by the capabilities of the vision and language models involved.

11
E. Additional Experiments
E.1. Generalization
RL finetuning on large language models has been shown to produce interesting generalization properties; for example,
instruction finetuning almost entirely in English has been shown to improve capabilities in other languages (Ouyang et al.,
2022). It is difficult to reconcile this phenomenon with our current understanding of generalization; it would a priori seem
more likely for finetuning to have an effect only on the finetuning prompt set or distribution. In order to investigate the same
phenomenon with diffusion models, Figure 6 shows a set of DDPO-finetuned model samples corresponding to prompts
that were not seen during finetuning. In concordance with instruction-following transfer in language modeling, we find
that the effects of finetuning do generalize, even with prompt distributions as narrow as 45 animals. We find evidence of
generalization to both animals outside of the training distribution and to non-animal everyday objects.

Pretrained (New Animals) Aesthetic Quality (New Animals)

Pretrained (Non-Animals) Aesthetic Quality (Non-Animals)

Pretrained (New Animals and Activities) Alignment (New Animals and Activities)

Figure 6 (Generalization) For aesthetic quality, finetuning on a limited set of 45 animals generalizes to both new animals
and non-animal everyday objects. For prompt alignment, finetuning on the same set of animals and only three activities
generalizes to both new animals, new activities, and even combinations of the two. The prompts for the bottom row (left
to right) are: “a capybara washing dishes”, “a crab playing chess”, “a parrot driving a car”, and “a horse typing on a
keyboard”. More samples are provided in Appendix H.

E.2. Overoptimization
Section 2.1 highlights the optimization problem: given a reward function, how well can an RL algorithm maximize
that reward? However, finetuning on a reward function, especially a learned one, has been observed to lead to reward
overoptimization or exploitation (Gao et al., 2022) in which the model learns to achieve high reward while moving too far
away from the pretraining distribution to be useful.
Our setting is no exception, and we provide two examples of reward exploitation in Figure 7. When optimizing the
incompressibility objective, the model eventually stops producing semantically meaningful content, degenerating into
high-frequency noise. Similarly, we observed that VLM reward pipelines are susceptible to typographic attacks (Goh et al.,
2021). When optimizing for alignment with respect to prompts of the form “n animals”, DDPO exploited deficiencies in the
VLM by instead generating text loosely resembling the specified number. There is currently no general-purpose method for
preventing overoptimization (Gao et al., 2022). We highlight this problem as an important area for future work.

12
Incompressibility Counting Animals

DDPO

RWR

Figure 7 (Reward model overoptimization) Examples of RL overoptimizing reward functions. (L) The diffusion model
eventually loses all recognizable semantic content and produces noise when optimizing for incompressibility. (R) When
optimized for prompts of the form “n animals”, the diffusion model exploits the VLM with a typographic attack (Goh et al.,
2021), writing text that is interpreted as the specified number n instead of generating the correct number of animals.

F. Implementation Details
For all experiments, we use Stable Diffusion v1.4 (Rombach et al., 2022) as the base model and finetune only the UNet
weights while keeping the text encoder and autoencoder weights frozen.

F.1. DDPO Implementation


We collect 256 samples per training iteration. For DDPOSF , we accumulate gradients across all 256 samples and perform
one gradient update. For DDPOIS , we split the samples into 4 minibatches and perform 4 gradient updates. Gradients are
always accumulated across all denoising timesteps for a single sample. For DDPOIS , we use the same clipped surrogate
objective as in proximal policy optimization (Schulman et al., 2017), but find that we need to use a very small clip range
compared to standard RL tasks. We use a clip range of 1e-4 for all experiments.

F.2. RWR Implementation


We compute the weights for a training iteration using the entire dataset of samples collected for that training iteration. For
wRWR , the weights are computed using the softmax function. For wsparse , we use a percentile-based threshold, meaning C is
dynamically selected such that the bottom p% of a given pool of samples are discarded and the rest are used for training.

F.3. Reward Normalization


In practice, rewards are rarely used as-is, but instead are normalized to have zero mean and unit variance. Furthermore, this
normalization can depend on the current state; in the policy gradient context, this is analogous to a value function baseline
(Sutton et al., 1999), and in the RWR context, this is analogous to advantage-weighted regression (Peng et al., 2019). In
our experiments, we normalize the rewards on a per-context basis. For DDPO, this is implemented as normalization by
a running mean and standard deviation that is tracked for each prompt independently. For RWR, this is implemented by
computing the softmax over rewards for each prompt independently. For RWRsparse , this is implemented by computing the
percentile-based threshold C for each prompt independently.

F.4. JPEG Encoding Code


import io
from PIL i m p o r t Image

def encode_jpeg ( x , q u a l i t y =95):


’’’
x : np a r r a y o f s h a p e (H, W, 3 ) and d t y p e u i n t 8
’’’

13
img = Image . f r o m a r r a y ( x )
b u f f e r = i o . BytesIO ( )
img . s a v e ( b u f f e r , ‘ JPEG ’ , q u a l i t y = q u a l i t y )
jpeg = buffer . getvalue ( )
b y t e s = np . f r o m b u f f e r ( j p e g , d t y p e =np . u i n t 8 )
r e t u r n l e n ( b y t e s ) / 1000

F.5. Resource Details


RWR experiments were conducted on a v3-128 TPU pod, and took approximately 4 hours to reach 50k samples. DDPO
experiments were conducted on a v4-64 TPU pod, and took approximately 4 hours to reach 50k samples. For the VLM-based
reward function, LLaVA inference was conducted on a DGX machine with 8 80Gb A100 GPUs.

F.6. Full Hyperparameters


DDPOIS DDPOSF RWR RWRsparse
Sampler Ancestral Ancestral Ancestral Ancestral
Diffusion Denoising steps (T ) 50 50 50 50
Guidance weight (w) 5.0 5.0 5.0 5.0
Optimizer AdamW AdamW AdamW AdamW
Learning rate 1e-5 1e-5 1e-5 1e-5
Weight decay 1e-4 1e-4 1e-4 1e-4
Optimization β1 0.9 0.9 0.9 0.9
β2 0.999 0.999 0.999 0.999
ϵ 1e-8 1e-8 1e-8 1e-8
Gradient clip norm 1.0 1.0 1.0 1.0
Inverse temperature (β) - - 0.2 -
Percentile - - - 0.9
RWR Batch size - - 128 128
Gradient updates per iteration - - 400 400
Samples per iteration - - 10k 10k
Batch size 64 256 - -
Samples per iteration 256 256 - -
DDPO
Gradient updates per iteration 4 1 - -
Clip range 1e-4 - - -

F.7. List of 45 Common Animals


This list was used for experiments with the aesthetic quality reward function and the VLM-based reward function.

cat dog horse monkey rabbit zebra spider bird sheep


deer cow goat lion tiger bear raccoon fox wolf
lizard beetle ant butterfly fish shark whale dolphin squirrel
mouse rat snake turtle frog chicken duck goose bee
pig turkey fly llama camel bat gorilla hedgehog kangaroo

G. Additional Design Decisions


G.1. CFG Training
Recent text-to-image diffusion models rely critically on classifier-free guidance (CFG) (Ho & Salimans, 2021) to produce
perceptually high-quality results. CFG involves jointly training the diffusion model on conditional and unconditional
objectives by randomly masking out the context c during training. The conditional and unconditional predictions are then

14
mixed at sampling time using a guidance weight w:

ϵ̃θ (xt , t, c) = wϵθ (xt , t, c) + (1 − w)ϵθ (xt , t) (5)

where ϵθ is the ϵ-prediction parameterization of the diffusion model (Ho et al., 2020) and ϵ̃θ is the guided ϵ-prediction that
is used to compute the next denoised sample.
For reinforcement learning, it does not make sense to train on the unconditional objective since the reward may depend on
the context. However, we found that when only training on the conditional objective, performance rapidly deteriorated after
the first round of finetuning. We hypothesized that this is due to the guidance weight becoming miscalibrated each time the
model is updated, leading to degraded samples, which in turn impair the next round of finetuning, and so on. Our solution
was to choose a fixed guidance weight and use the guided ϵ-prediction during training as well as sampling. We call this
procedure CFG training. Figure 8 shows the effect of CFG training on RWRsparse ; it has no effect after a single round of
finetuning, but becomes essential for subsequent rounds.

JPEG Compressibility
0
Negative Filesize (kb)

−50

−100

−150

−200

0 10k 20k 30k 40k


Reward Queries
with CFG training without CFG training

Figure 8 (CFG training) We run the RWRsparse algorithm while optimizing only the conditional ϵ-prediction (without CFG
training), and while optimizing the guided ϵ-prediction (with CFG training). Each point denotes a diffusion model update.
We find that CFG training is essential for methods that do more than one round of interleaved sampling and training.

H. More Samples
Figure 9 shows qualitative samples from the baseline RWR method. Figure 10 shows more samples on seen prompts from
DDPO finetuning with the image-prompt alignment reward function. Figure 11 shows more examples of generalization
to unseen animals and everyday objects with the aesthetic quality reward function. Figure 12 shows more examples of
generalization to unseen subjects and activities with the image-prompt alignment reward function.

15
Pretrained Aesthetic Quality

Compressibility Incompressibility

Figure 9 (RWR samples)

a hedgehog riding a bike a dog riding a bike

a lizard riding a bike a shark washing dishes

a frog washing dishes a monkey washing dishes

Figure 10 (More image-prompt alignment samples)

16
Pretrained (New Animals) Aesthetic Quality (New Animals)

Pretrained (Non-Animals) Aesthetic Quality (Non-Animals)

Figure 11 (Aesthetic quality generalization)

17
a capybara washing dishes a snail playing chess

a dog doing laundry a giraffe playing basketball

a parrot driving a car a duck taking an exam

a robot fishing in a lake a horse typing on a keyboard

a rabbit sewing clothes a tree riding a bike

a car eating a sandwich an apple playing soccer

Figure 12 (Image-prompt alignment generalization)

18

You might also like