DIRE For Diffusion-Generated Image Detection
DIRE For Diffusion-Generated Image Detection
{zhendongwang,wwlustc,alexhu}@mail.ustc.edu.cn
[email protected], {zhwg,lihq}@ustc.edu.cn, [email protected]
Table 2: Comprehensive comparisons of our DIRE and other generated image detectors. The previous detectors includ-
ing CNNDet [48], GANDet [25], Patchfor [4], and SBI [40] are tested on our DiffusionForensics benchmark using their
provided models. * denotes our reproduction training on the ADM subset of DiffusionForensics with the official codes.
All the used diffusion-generation models [10, 23, 32] for preparing training data are unconditional models pre-trained on
LSUN-Bedroom (LSUN-B.) [49]. Generated images from StyleGAN [20] on LSUN-Bedroom are downloaded from the official
repository. All the testing images produced by text-image generators (SD-v1 [37], SD-v2 [37], LDM [37], VQDiffusion [14])
are prompted by “A photo of bedroom”. We report ACC (%) and AP (%) (ACC/AP in the Table).
Training Generation Testing generators we select three different generation models: PNDM [23] and
dataset model ADM(IN) SD-v1(IN) StyleGAN(LSUN-B.)
iDDPM [32] (diffusion models) and StyleGAN [20] (GAN
LSUN-B. ADM 90.2/97.9 97.2/99.8 99.9/100
LSUN-B. iDDPM 90.2/97.9 93.7/99.3 99.9/100 model). The results are reported in Table 2. Despite the in-
LSUN-B. StyleGAN 76.9/94.4 89.7/99.0 100/100 consistent use of generation and reconstruction models when
training, DIRE still keeps a strong generalization capability.
Table 3: Cross-dataset evaluation on ImageNet (IN) [9] Specifically, when pairing iDDPM [32] as the generation
and LSUN-bedroom (LSUN-B.) [49]. Each testing generator model and ADM [10] as the reconstruction model, DIRE
is pre-trained on the corresponding dataset. Images gener- achieves 98.6% ACC and 100% AP on average, highlighting
ated by Stable Diffusion-v1 (SD-v1) are prompted by “ A its adaptation with images generated by different diffusion
photo of {class}” in which the classes are from [9]. ACC (%) models. It’s worth noting that when the generation model is
and AP (%) are reported (ACC/AP in the Table). StyleGAN, DIRE still exhibits excellent performance. This
might be attributed to DIRE’s capability of incorporating
performance when dealing with diffusion-generated images,
the generation properties of other generation models besides
with ACC results lower than 60%. We also include diffusion-
diffusion models.
generated images (ADM [10]) as training data and re-train
CNNDetection [48], Patchforensics [4], and F3Net [34], Cross-dataset evaluation. We further design a more chal-
whose training codes are publicly available. The resulting lenging scenario, i.e., training the detector with images gen-
models get a significant improvement on images generated erated by models pre-trained on LSUN-Bedroom [49] and
by the same diffusion models as used in training, but still then testing it on images produced by models pre-trained
perform unsatisfactorily facing unseen diffusion models. In on ImageNet [9]. We choose three different generators for
contrast, our method, DIRE, shines with excellent general- generating training images: ADM [10], iDDPM [32], and
ization performance. Concretely, DIRE with the generation StyleGAN [20]. The evaluation results on ADM (IN) are
model and the reconstruction model set to ADM achieves an shown in Table 3. The comparison indicates that DIRE main-
average of 99.9% ACC and 100% AP on detecting images tains a satisfactory generalization capability even though
generated by various diffusion models. facing unseen datasets, i.e., ACC/AP: 90.2%/97.9% when
training on images generated by ADM and iDDPM. And
4.3. Generalization Capability Evaluation for StyleGAN [20], DIRE still achieves 94.4% in AP, al-
Effect of choice of generation and reconstruction models. though there is a huge domain gap that both the dataset
We evaluate the impact of different choices of the generation and generation models are different in training and testing.
and reconstruction models on the generalization capabil- This evaluation further validates that the proposed DIRE is a
ity. We employ the ADM [10] model as the reconstruction general image representation for this task.
model and apply different models for generating images. Unseen text-to-image generation evaluation. Further-
After generating, the ADM model converts these images to more, we seek to verify whether DIRE can detect images
their DIREs for training a binary classifier. In this evaluation, generated by unseen text-to-image models. We adopt Stable
CNNDetection GANDetection SBI CNNDetection* F3Net* Patchforensics* Ours
ADM DDPM iDDPM PNDM
100 100 100 100
80 80 80 80
AP(%)
60 60 60 60
40 40 40 40
0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
Sigma Sigma Sigma Sigma
ADM DDPM iDDPM PNDM
100 100 100 100
80 80 80 80
AP(%)
60 60 60 60
40 40 40 40
100 65 30 100 65 30 100 65 30 100 65 30
Quality Quality Quality Quality
Figure 4: Robustness to unseen perturbations. The top rows show the robustness to Gaussian blur, and the bottom rows
show the robustness to JPEG compression. * denotes our reproduction training on the ADM subset of DiffusionForensics with
AP (%) reported for robustness comparison.
Input ADM DDPM iDDPM PNDM SD-v1 Effect of different calculation of DIRE. After computing
REC 100/100 57.1/57.7 49.7/92.6 87.1/98.7 46.9/57.0 the residual result of the reconstructed image and source
RGB 100/100 87.3/99.6 100/100 77.8/99.1 77.4/85.8 image, whether to compute the absolute value should be con-
RGB&DIRE 100/100 99.8/100 99.9/100 99.2/100 62.4/92.4
sidered. As reported in Table 6, we find that the absolute op-
DIRE 100/100 100/100 100/100 99.7/100 99.7/100
eration is critical for achieving a strong diffusion-generated
Table 5: Influence of different input information. All the image detector, particularly on SD-v1 [37] where it improves
models in this experiment are trained on the ADM subset. ACC/AP from 87.0%/93.0% → 99.7%/100%. So by default,
ACC (%) and AP (%) are reported (ACC/AP in the Table). the absolute operation is applied in all our models.
Qualitative Analysis of DIRE. The above quantitative ex-
ADM DDPM iDDPM PNDM SD-v1
periments have indicated the effectiveness of the proposed
w/o ABS 100/100 99.4/100 100/100 98.2/100 87.0/93.0
DIRE. As analyzed before, the key motivation behind DIRE
w/ ABS 100/100 100/100 100/100 99.7/100 99.7/100
is that generated images can be approximately reconstructed
Table 6: Effect of computing the absolute value (ABS) by a pre-trained diffusion model while real images cannot.
when obtaining DIRE. All the models in this experiment DIRE makes use of the residual characteristic of an input
are trained on the ADM subset. ACC (%) and AP (%) are image and its reconstruction for discrimination. To gain a
reported (ACC/AP in the Table). better understanding of its intrinsic properties, we conduct a
further qualitative analysis of DIRE, utilizing noise pattern
generated image detection. Note that the steps in reconstruc- and frequency analysis for visualization.
tion are the same as in the inversion by default. The results When images are acquired, various factors from hardware
are reported in Table 4. We observe that more steps in DDIM facilities, such as lens and sensors, and software algorithms,
benefit the detection performance of DIRE. Considering the such as compression and demosaic, can impact image quality
computational cost, we choose 20 steps by default. at the low level. One typical low-level analysis of images is
Is DIRE really better than the original RGB for detect- noise pattern analysis1 , which is usually regular and corre-
ing diffusion-generated images? We conduct an exper- sponds to the shape of objects in real scenarios. In addition
iment on various forms of input for detection, including to low-level analysis, frequency analysis can provide fre-
RGB images, reconstructed images (REC), DIRE, and the quency information about images. To compute the frequency
combination of RGB and DIRE (RGB&DIRE). The results information of DIRE, we used FFT algorithms.
displayed in Table 5 reveal that REC performed much worse We visualize the results of the aforementioned two analy-
than RGB, suggesting that reconstructed images are not sis tools in Figure 5. The visual comparison of noise patterns
suitable as input information for detection. One possible highlights significant differences of the DIRE of real and
explanation is the loss of essential information during re- diffusion-generated images from the low-level perspective,
construction by a pre-trained diffusion model. The compari- with real images tending to be regular and corresponding to
son between RGB and DIRE also demonstrates that DIRE the shape of objects while diffusion-generated images tend
serves as a stronger image representation, contributing to a to be messy. By comparing the FFT spectrum of DIRE from
more generalizable detector than simply training on RGB im- real and diffusion-generated images, we observe that the
ages. Furthermore, we find that combining RGB with DIRE FFT spectrum of real images is usually more abundant than
together hurts the generalization compared to pure DIRE. that of diffusion-generated images, which confirms that real
Therefore, we use DIRE as the default input for detection by
default. 1 https://29a.ch/photo-forensics/#noise-analysis
images are more difficult to be reconstructed by a pre-trained [11] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming
diffusion model. transformers for high-resolution image synthesis. In CVPR,
2021. 3
[12] Joel Frank, Thorsten Eisenhofer, Lea Schönherr, Asja Fischer,
5. Conclusion
Dorothea Kolossa, and Thorsten Holz. Leveraging frequency
In this paper, we focus on building a generalizable de- analysis for deep fake image recognition. In ICML, 2020. 3
tector for discriminating diffusion-generated images. We [13] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
find that previous generated-image detectors show limited Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
Yoshua Bengio. Generative adversarial nets. In NeurIPS,
performance when detecting images generated by diffusion
2014. 5
models. To address the issue, we present an image represen-
[14] Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang,
tation called DIRE based on reconstruction errors of images Dongdong Chen, Lu Yuan, and Baining Guo. Vector quan-
inverted and reconstructed by DDIM. Furthermore, we create tized diffusion model for text-to-image synthesis. In CVPR,
a new dataset, DiffusionForensics, which includes images 2022. 2, 4, 5, 6, 16
generated by unconditional, conditional, and text-to-image [15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
diffusion models to facilitate the evaluation of diffusion- Deep residual learning for image recognition. In CVPR, 2016.
generated images. Extensive experiments indicate that the 5
proposed image representation DIRE contributes to a strong [16] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman,
diffusion-generated image detector, which is very effective Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im-
for this task. We hope that our work can serve as a solid age editing with cross attention control. arXiv preprint
baseline for diffusion-generated image detection. arXiv:2208.01626, 2022. 2
[17] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu-
sion probabilistic models. arXiv preprint arxiv:2006.11239,
References 2020. 1, 2, 3, 4, 5, 7, 13
[1] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended [18] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kil-
diffusion for text-driven editing of natural images. In CVPR, ian Q Weinberger. Densely connected convolutional networks.
2022. 2 In CVPR, 2017. 3
[2] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large [19] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen.
scale gan training for high fidelity natural image synthesis. Progressive growing of gans for improved quality, stability,
arXiv preprint arXiv:1809.11096, 2018. 5 and variation. arXiv preprint arXiv:1710.10196, 2017. 2
[3] Nicholas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagiel- [20] Tero Karras, Samuli Laine, and Timo Aila. A style-based
ski, Vikash Sehwag, Florian Tramèr, Borja Balle, Daphne generator architecture for generative adversarial networks. In
Ippolito, and Eric Wallace. Extracting training data from CVPR, 2019. 5, 6
diffusion models. arXiv preprint arXiv:2301.13188, 2023. 1 [21] Diederik P Kingma and Max Welling. Auto-encoding varia-
tional bayes. arXiv preprint arXiv:1312.6114, 2013. 5
[4] Lucy Chai, David Bau, Ser-Nam Lim, and Phillip Isola. What
[22] Lingzhi Li, Jianmin Bao, Ting Zhang, Hao Yang, Dong Chen,
makes fake images detectable? understanding properties that
Fang Wen, and Baining Guo. Face x-ray for more general
generalize. In ECCV, 2020. 5, 6, 7
face forgery detection. In CVPR, 2020. 3
[5] Wenhu Chen, Hexiang Hu, Chitwan Saharia, and William W
[23] Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. Pseudo
Cohen. Re-imagen: Retrieval-augmented text-to-image gen-
numerical methods for diffusion models on manifolds. arXiv
erator. arXiv preprint arXiv:2209.14491, 2022. 2
preprint arXiv:2202.09778, 2022. 1, 2, 4, 5, 6, 11, 14
[6] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, [24] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan
Sunghun Kim, and Jaegul Choo. Stargan: Unified genera- Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffu-
tive adversarial networks for multi-domain image-to-image sion probabilistic model sampling in around 10 steps. arXiv
translation. In CVPR, 2018. 5 preprint arXiv:2206.00927, 2022. 2
[7] François Chollet. Xception: Deep learning with depthwise [25] Sara Mandelli, Nicolò Bonettini, Paolo Bestagini, and Ste-
separable convolutions. In CVPR, 2017. 3 fano Tubaro. Detecting gan-generated images by orthogonal
[8] Riccardo Corvi, Davide Cozzolino, Giada Zingarini, Giovanni training of multiple cnns. In ICIP, 2022. 3, 5, 6, 7
Poggi, Koki Nagano, and Luisa Verdoliva. On the detection [26] Francesco Marra, Diego Gragnaniello, Davide Cozzolino, and
of synthetic images generated by diffusion models. arXiv Luisa Verdoliva. Detection of gan-generated fake images over
preprint arXiv:2211.00680, 2022. 3 social networks. In MIPR, 2018. 3
[9] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li [27] Francesco Marra, Diego Gragnaniello, Luisa Verdoliva, and
Fei-Fei. Imagenet: A large-scale hierarchical image database. Giovanni Poggi. Do gans leave artificial fingerprints? In
In CVPR, 2009. 2, 4, 5, 6, 7, 11, 17 MIPR, 2019. 3
[10] Prafulla Dhariwal and Alexander Nichol. Diffusion models [28] Scott McCloskey and Michael Albright. Detecting gan-
beat gans on image synthesis. In NeurIPS, 2021. 1, 2, 3, 4, 5, generated imagery using color cues. arXiv preprint
6, 7, 11, 14 arXiv:1812.08247, 2018. 3
[29] Scott McCloskey and Michael Albright. Detecting gan- [46] Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro
generated imagery using saturation cues. In ICIP, 2019. 3 Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj,
[30] Lakshmanan Nataraj, Tajuddin Manhar Mohammed, Shiv- and Thomas Wolf. Diffusers: State-of-the-art diffu-
kumar Chandrasekaran, Arjuna Flenner, Jawadul H Bappy, sion models. https://github.com/huggingface/
Amit K Roy-Chowdhury, and BS Manjunath. Detecting gan diffusers, 2022. 11
generated fake images using co-occurrence matrices. arXiv [47] Sheng-Yu Wang, Oliver Wang, Andrew Owens, Richard
preprint arXiv:1903.06836, 2019. 3 Zhang, and Alexei A Efros. Detecting photoshopped faces by
[31] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav scripting photoshop. In ICCV, 2019. 5
Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and [48] Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew
Mark Chen. Glide: Towards photorealistic image generation Owens, and Alexei A Efros. Cnn-generated images are sur-
and editing with text-guided diffusion models. arXiv preprint prisingly easy to spot...for now. In CVPR, 2020. 3, 5, 6,
arXiv:2112.10741, 2021. 2 7
[32] Alexander Quinn Nichol and Prafulla Dhariwal. Improved [49] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas
denoising diffusion probabilistic models. In ICML, 2021. 1, Funkhouser, and Jianxiong Xiao. Lsun: Construction of a
2, 3, 4, 5, 6, 13 large-scale image dataset using deep learning with humans in
[33] Gaurav Parmar, Krishna Kumar Singh, Richard Zhang, Yijun the loop. arXiv preprint arXiv:1506.03365, 2015. 2, 3, 4, 5,
Li, Jingwan Lu, and Jun-Yan Zhu. Zero-shot image-to-image 6, 11, 13, 14
translation. arXiv preprint arXiv:2302.03027, 2023. 2 [50] Ning Yu, Larry S Davis, and Mario Fritz. Attributing fake
[34] Yuyang Qian, Guojun Yin, Lu Sheng, Zixuan Chen, and Jing images to gans: Learning and analyzing gan fingerprints. In
Shao. Thinking in frequency: Face forgery detection by ICCV, 2019. 3
mining frequency-aware clues. In ECCV, 2020. 5, 6, 7 [51] Xu Zhang, Svebor Karaman, and Shih-Fu Chang. Detecting
[35] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and simulating artifacts in gan fake images. In WIFS, 2019. 3
and Mark Chen. Hierarchical text-conditional image genera- [52] Peng Zhou, Xintong Han, Vlad I Morariu, and Larry S Davis.
tion with clip latents. arXiv preprint arXiv:2204.06125, 2022. Learning rich features for image manipulation detection. In
2 CVPR, 2018. 5
[36] Jonas Ricker, Simon Damm, Thorsten Holz, and Asja Fischer.
[53] Derui Zhu, Dingfan Chen, Jens Grossklags, and Mario Fritz.
Towards the detection of diffusion model deepfakes. arXiv
Data forensics in diffusion models: A systematic analysis of
preprint arXiv:2210.14571, 2022. 3
membership privacy. arXiv preprint arXiv:2302.07801, 2023.
[37] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
1
Patrick Esser, and Björn Ommer. High-resolution image
synthesis with latent diffusion models. In CVPR, 2022. 1, 2,
3, 4, 5, 6, 8, 11, 15, 16
[38] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch,
Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine
tuning text-to-image diffusion models for subject-driven gen-
eration. arXiv preprint arXiv:2208.12242, 2022. 2
[39] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay
Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour,
Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes,
et al. Photorealistic text-to-image diffusion models with deep
language understanding. arXiv preprint arXiv:2205.11487,
2022. 2
[40] Kaede Shiohara and Toshihiko Yamasaki. Detecting deep-
fakes with self-blended images. In CVPR, 2022. 5, 6, 7
[41] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan,
and Surya Ganguli. Deep unsupervised learning using
nonequilibrium thermodynamics. In ICML, 2015. 1, 2, 3
[42] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising
diffusion implicit models. arXiv preprint arXiv:2010.02502,
2020. 1, 2, 3, 4, 5, 7, 11
[43] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon
Shlens, and Zbigniew Wojna. Rethinking the inception archi-
tecture for computer vision. In CVPR, 2016. 3
[44] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model
scaling for convolutional neural networks. In ICML, 2019. 3,
5
[45] Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete
representation learning. In NeurIPS, 2017. 2
A. More Details of DiffusionForensics ImageNet-ADM. We sample 50k images for 1k classes from
ImageNet [9] using the pre-trained conditional ADM model
In this section, we give more details about the proposed and the provided classifier on ImageNet from the official
DiffusionForensics dataset. The real images in the LSUN- repository with DDIM [42] scheduler applied for better sam-
Bedroom and ImageNet subsets are from the source LSUN- pling with 50 steps. The images are divided in the ratio of
Bedroom [49] and ImageNet [9] datasets, respectively. To 8:1:1 for training:validation:testing.
generate DIREs of all the real and generated images in our
ImageNet-SD-v1. We sample 10k images using the pre-
DiffusionForensics, we use the unconditional ADM [10]
trained Stable Diffusion v1.5 model with code provided by
model pre-trained on LSUN-Bedroom with DDIM [42]
diffusers [46]. The prompt for generation is “A photo of
scheduler applied for 20 steps in total as the reconstruc-
{class}” in which the class is chosen from the 1k classes
tion model. As for the ImageNet subset, we also provide
from ImageNet [9], resulting in ten generated images for
the DIREs computed by using the unconditional ADM [10]
each class.
model pre-trained on ImageNet as the reconstruction model
with DDIM [42] scheduler applied for 20 steps in total.
B. More Explanation of DIRE
LSUN-Bedroom-ADM. We download the pre-trained
LSUN-Bedroom model of ADM [10] from the official repos- We have demonstrated the key motivation of our DIRE.
itory2 . And then we sample 42k images for training (40k), But one may wonder why DIREs of diffusion-generated
validation (1k), and testing (1k) with DDIM [42] scheduler images are not zero-value images. Here, we explain from
applied for better sampling with 50 steps. the approximation to solving ordinary differential equations
LSUN-Bedroom-DDPM. We download the provided 1k im- (ODEs) perspective.
ages generated by the model pre-trained on LSUN-Bedroom The deterministic reverse process (reconstruction) in
from the official repository3 . DDIM [42] is as follows,
LSUN-Bedroom-iDDPM. We sample 42k images using √
the official codes and the pre-trained LSUN-Bedroom √ xt − 1 − αt θ (xt , t)
xt−1 = αt−1 √
model (lr=2e-5)4 with DDIM [42] scheduler applied for αt (11)
better sampling with 50 steps. p
+ 1 − αt−1 · θ (xt , t).
LSUN-Bedroom-PNDM. We sample 42k images using the
official codes and the pre-trained LSUN-Bedroom model5 When the total steps T is large enough (e.g., T = 1000),
with their PNDM [23] scheduler applied for better sampling Eqn. (11) can be seen as Euler integration for solving ordi-
with 50 steps. nary differential equations (ODEs):
LSUN-Bedroom-LDM. The pipeline code for sampling is
downloaded from diffusers [46]6 . The version of the Latent
s r !
xt−∆t xt 1 − αt−∆t 1 − αt
Diffusion model (LDM) [37] we used is “CompVis/ldm- √ =√ + − θ (xt , t).
αt−∆t αt αt−∆t αt
text2im-large-256”. We give a prompt “A photo of bedroom”
√ (12)
for generating 1k bedroom images. √ √
LSUN-Bedroom-SD-v1. The pipeline code for sampling is Suppose σ = 1 − α/ α, x̄ = x/ α, the correspond-
downloaded from diffusers [46]. The version of Stable Diffu- ing ODE becomes:
sion (SD) v1 [37] we used is “runwayml/stable-diffusion-v1-
x̄(t)
5”. We give a prompt “A photo of bedroom” for generating dx̄(t) = θ √ , t dσ(t). (13)
1k bedroom images. σ2 + 1
LSUN-Bedroom-SD-v2. The pipeline code for sampling is Then the forward process (inversion) (from xt to xt+1 ) in
downloaded from diffusers [46]. The version of Stable Diffu- DDIM can be the reversion of the reconstruction process:
sion (SD) v2 [37] we used is “stabilityai/stable-diffusion-2”.
We give a prompt “A photo of bedroom” for generating 1k
s r !
xt+1 xt 1 − αt+1 1 − αt
bedroom images. √ =√ + − θ (xt , t).
αt+1 αt αt+1 αt
LSUN-Bedroom-VQDiffusion. We sample 1k images us-
ing the official codes and the pre-trained ITHQ model7 with (14)
the prompt “A photo of bedroom”. It is worth noting that during the approximation of
Eqn. (11) by Eqn. (12), there is a deviation since T is usually
2 https://github.com/openai/guided-diffusion
not infinitely large (e.g., T = 1000). The deviation is more
3 https://github.com/hojonathanho/diffusion
4 https://github.com/openai/improved-diffusion
prominent for real images than diffusion-generated images
5 https://github.com/luping-liu/PNDM due to the more complex characteristics of real images. The
6 https://github.com/huggingface/diffusers deviation caused by the approximation actually leads to our
7 https://github.com/microsoft/VQ-Diffusion key idea of DIRE.
C. More Visualization About DIRE
We visualize more examples of source images, their re-
constructions, and DIREs of real images and generated im-
ages from different diffusion models in Figures 6, 7, 8, 9,
10. The DIREs of real images tend to have larger values
compared to diffusion-generated images.
Source
Recon.
DIRE
(a) Real
Source
Recon.
DIRE
(b) DDPM
Source
Recon.
DIRE
(c) iDDPM
Figure 6: The DIRE representation of real images and generated images from DDPM [17] and iDDPM [32] pre-trained on
LSUN-Bedroom [49]. For each source image, we visualize its corresponding reconstruction image and DIRE.
Source
Recon.
DIRE
(a) ADM
Source
Recon.
DIRE
(b) PNDM
Figure 7: The DIRE representation of generated images from ADM [10] and PNDM [23] pre-trained on LSUN-Bedroom [49].
For each source image, we visualize its corresponding reconstruction image and DIRE.
Source
Recon.
DIRE
Source
Recon.
DIRE
Figure 8: The DIRE representation of generated images from Stable Diffusion v1 and v2 [37] with the prompt “A photo of
bedroom”. For each source image, we visualize its corresponding reconstruction image and DIRE.
Source
Recon.
DIRE
Source
Recon.
DIRE
(b) VQ-Diffusion
Figure 9: The DIRE representation of generated images from Latent Diffusion [37] and VQ-Diffusion [14] with the prompt “A
photo of bedroom”. For each source image, we visualize its corresponding reconstruction image and DIRE.
Source
Recon.
DIRE
(a) Real
Source
Recon.
DIRE
(b) ADM
Source
Recon.
DIRE
Figure 10: The DIRE representation of real images and generated images from ADM and Stable Diffusion v1 with the prompt
“A photo of class” (class from ImageNet [9]). For each source image, we visualize its corresponding reconstruction image and
DIRE.