0% found this document useful (0 votes)
93 views17 pages

DIRE For Diffusion-Generated Image Detection

Uploaded by

aayantejani325
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views17 pages

DIRE For Diffusion-Generated Image Detection

Uploaded by

aayantejani325
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

DIRE for Diffusion-Generated Image Detection

Zhendong Wang1 ∗ Jianmin Bao2 * Wengang Zhou1,3


Weilun Wang1 Hezhen Hu1 Hong Chen4 Houqiang Li1,3
1
CAS Key Laboratory of GIPAS, EEIS Department, University of Science and Technology of China
2
Microsoft Research Asia
3
Institute of Artificial Intelligence, Hefei Comprehensive National Science Center
4
Merchants Union Consumer Finance Company
arXiv:2303.09295v1 [cs.CV] 16 Mar 2023

{zhendongwang,wwlustc,alexhu}@mail.ustc.edu.cn
[email protected], {zhwg,lihq}@ustc.edu.cn, [email protected]

Abstract Real DDPM iDDPM ADM PNDM

Diffusion models have shown remarkable success in vi-


sual synthesis, but have also raised concerns about potential Source
abuse for malicious purposes. In this paper, we seek to
build a detector for telling apart real images from diffusion-
generated images. We find that existing detectors struggle Recon.
to detect images generated by diffusion models, even if we
include generated images from a specific diffusion model
in their training data. To address this issue, we propose a
novel image representation called DIffusion Reconstruction DIRE
Error (DIRE), which measures the error between an input
image and its reconstruction counterpart by a pre-trained Figure 1: The DIRE representation of a real image and
diffusion model. We observe that diffusion-generated images four generated images from diffusion models: DDPM [17],
can be approximately reconstructed by a diffusion model iDDPM [32], ADM [10], and PNDM [23], respectively. The
while real images cannot. It provides a hint that DIRE DIREs of real images tend to have larger values compared
can serve as a bridge to distinguish generated and real to diffusion-generated images.
images. DIRE provides an effective way to detect images
generated by most diffusion models, and it is general for de- celeration of sampling, and so on. As users enjoy the strong
tecting generated images from unseen diffusion models and generation capability of diffusion models, there are concerns
robust to various perturbations. Furthermore, we establish about potential privacy problems. For example, diffusion
a comprehensive diffusion-generated benchmark including models may memorize individual images from their training
images generated by eight diffusion models to evaluate the data and emit them at the generation stage [3, 53]. Moreover,
performance of diffusion-generated image detectors. Exten- some attackers may develop new deepfake techniques based
sive experiments on our collected benchmark demonstrate on diffusion models. Therefore, it is an urgent demand for a
that DIRE exhibits superiority over previous generated- diffusion-generated image detector.
image detectors. The code and dataset are available at
Our focus in this work is to develop a general diffusion-
https://github.com/ZhendongWang6/DIRE.
generated image detector. We notice that there are various
detectors for detecting generated images available. Despite
1. Introduction the fact that most diffusion models employ CNNs as the
Recently, Denoising Diffusion Probabilistic Mod- network, the generation processes between diffusion mod-
els (DDPMs) [17, 41] have set up a new paradigm in image els and previous generators (e.g., GAN, VAE) are entirely
generation due to their strong ability to generate high-quality different, rendering previously generated image detectors in-
images. There arises plenty of studies [32, 10, 42, 23, 37] effective. A naïve thought is to train a CNN binary classifier
exploring the improvement of the network architecture, ac- on diffusion-generated and real images. However, we find
that such a naïve scheme suffers limited generalization to
* Equal contribution. unseen diffusion-generated images.
tion significantly enhances generalization ability. We show
𝑝𝑔 (𝒙) 𝑝𝑔 (𝒙) that our framework achieves a remarkably high detection
Inversion Recon. accuracy and average precision on generated images from
𝒩(𝟎, 𝜤) unseen diffusion models, as well as robustness to various
perturbations. In comparison with existing generated image
𝑝𝑟 (𝒙) 𝑝𝑟 (𝒙) detectors, our framework largely exceeds the competitive
state-of-the-art methods.
𝒙𝑔 : Generated 𝒙𝑔′ : Reconstructed generated
Our main contributions are three-fold as follows.
𝒙𝑟 : Real 𝒙′𝑟 : Reconstructed real • We propose a novel image representation called DIRE
Figure 2: Illustration of the difference between a real for detecting diffusion-generated images.
sample and a generated sample from the DIRE perspec-
• We set up a new dataset, DiffusionForensics, for bench-
tive. pg (x) represents the distribution of generated images
marking the diffusion-generated image detectors.
while pr (x) represents the distribution of real images. xg
and xr represent a generated sample and a real sample, re- • Extensive experiments demonstrate that the proposed
spectively. Using the inversion and reconstruction process DIRE sets a state-of-the-art performance in diffusion-
of DDIM [42], xg and xr become x0g and x0r , respectively. generated detection.
After the reconstruction, x0r is actually within the pg (x),
which leads to a noticeably different DIRE in real samples 2. Related Work
compared to generated samples. Since our focus is to detect diffusion-generated images
and the proposed DIRE representation is based on the recon-
In this paper, we propose a novel image representation, struction error by a pre-trained diffusion model, we briefly
called DIffusion REconstruction Error (DIRE), for detect- introduce recent diffusion models in image generation and
ing diffusion-generated images. The hypothesis behind generalizable generated image detection in this section.
DIRE is that images produced by diffusion processes can
be reconstructed more accurately by a pre-trained diffusion 2.1. Diffusion Models for Image Generation
model compared to real images. The diffusion reconstruc- Inspired by nonequilibrium thermodynamics [41], Ho et
tion process involves two steps: (1) inverting the input image al. [17] propose a new generation paradigm, denoising dif-
x and mapping it to a noise vector xT in the noise space fusion probabilistic models (DDPMs), which achieves a
N (0, I), and (2) reconstructing the image x0 from xT using competitive performance compared to PGGAN [19] on 256
a denoising process. The DIRE is calculated as the differ- × 256 LSUN [49]. Since then, more and more researchers
ence between x and x0 . As a sample xg from the generated turn their attention to diffusion models for improving the
distribution pg (x) and its reconstruction x0g belong to the architectures [10, 37], accelerating sampling speed [32, 42,
same distribution, the DIRE value for xg would be relatively 23, 24], exploring downstream tasks [16, 31, 1, 33], and etc.
low. Conversely, the reconstruction of a real image xr is Nichol et al. [32] find that learning variances of the reverse
likely to differ significantly from itself, resulting in a high process in DDPMs can contribute to an order of magnitude
amplitude in DIRE. This concept is depicted in Figure 2. fewer sampling steps. Song et al. [42] generalize DDPMs
DIRE offers a reliable method for differentiating between via a class of non-Markovian diffusion processes into denois-
real and diffusion-generated images. By training a simple bi- ing diffusion implicit models (DDIMs), which leads to more
nary classifier based on DIRE, it becomes possible to detect high-quality samples with fewer sampling steps. A later
diffusion-generated images with ease. The DIRE is general work ADM [10] finds a much more effective architecture
and flexible since it can generalize to images generated by and further achieves a state-of-the-art performance compared
unseen diffusion models during inference time. It only as- to other generative models with classifier guidance. From
sumes the distinct reconstruction errors of real images and the perspective that DDPMs can be treated as solving differ-
generated ones as shown in Figure 1. ential equations on manifolds, Liu et al. [23] propose pseudo
To evaluate the diffusion-generated image detectors, we numerical methods for diffusion models (PNDMs), which
create a comprehensive diffusion-generated dataset, the Dif- further improves sampling efficiency and generation quality.
fusionForensics dataset, including images generated by eight Besides unconditional image generation, there are also
different diffusion models trained on LSUN-Bedroom [49] plenty of text-to-image generation works based on diffu-
and ImageNet [9]. DiffusionForensics involves uncondi- sion models [37, 14, 35, 39, 5, 38]. Among them, VQ-
tional, conditional, and text-to-image diffusion generation Diffusion [14] is based on a VQ-VAE [45] and models the
models. We will release the dataset to facilitate a good latent space by a conditional variant of DDPMs. Another typ-
benchmark for diffusion-generated image detection. ical work is LDM [37] that conditions the diffusion models
Extensive experiments show that the DIRE representa- on the input by cross-attention mechanism, and proposes
latent diffusion models by introducing latent space [11]. this section is organized as follows. We begin with reviewing
Recent popular Stable Diffusion v1 and v2 are based on DDPMs, and the inversion and reconstruction process of the
LDM [37] and further improved to achieve a surprising gen- DDIM [42]. Then we present details of DIRE for diffusion-
eration performance. generated image detection. Finally, we introduce a new
dataset, i.e., DiffusionForensics, for evaluating diffusion-
2.2. Generalizable Generated Image Detection generated image detectors.
Generated image detection has been widely explored over
the past years. Earlier researchers focus on detecting gener- 3.1. Preliminaries
ated images leveraging hand-crafted features, such as color Denoising Diffusion Probabilistic Models (DDPMs). Dif-
cues [28], saturation cues [29], blending artifacts [22], co- fusion models are first proposed in [41] inspired by non-
occurrence features [30]. Marra et al. [26] study several equilibrium thermodynamics, and achieve strong perfor-
classical deep CNN classifiers [18, 43, 7] to detect images mance in image generation [17, 32, 10, 37]. They define
generated by image-to-image translation networks. How- a Markov chain of diffusion steps that slowly add Gaussian
ever, they do not consider the generalization capability to noise to data until degenerating it into isotropic Gaussian
unseen generation models. In another work, Wang et al. [48] distribution (forward process), and then learn to reverse the
notice this challenge and claim that training a simple clas- diffusion process to generate samples from the noise (reverse
sifier on ProGAN-generated images can generalize to other process). The Markov chain in the forward process is defined
unseen GAN-based generated images well. However, their as:
strong generalization capability relies on their large-scale r
training and 20 different models each trained on a different αt αt
q(xt |xt−1 ) = N (xt ; xt−1 , (1 − )I), (1)
LSUN [49] object category. αt−1 αt−1
Besides detection by spatial artifacts, there are also
frequency-based methods [12, 51]. Frank et al. [12] in which xt is the noisy image at the t-th step and α1 , . . . , αT
present that in the frequency domain, GAN-generated im- is a predefined schedule, with T denotes the total steps.
ages are more likely to expose severe artifacts mainly caused An important property brought by the Markov chain is
by upsampling operations in previous GAN architectures. that we can obtain xt from x0 directly via:
Zhang et al. [51] propose a GAN simulator, AutoGAN, to √
simulate the artifacts produced by standard GAN pipelines. q(xt |x0 ) = N (xt ; αt x0 , (1 − αt )I). (2)
Then they train a detector on the spectrum input on the
synthesized images. It can generalize to unseen generation The reverse process in [17] is also defined as a Markov chain:
models to some extent. Marra et al. [27] and Yu et al. [50]
suggest detecting generated images by fingerprints that are pθ (xt−1 |xt ) = N (xt−1 ; µθ (xt , t), Σθ (xt , t)). (3)
often produced during GAN generation. A recent work [25]
Diffusion models use a network pθ (xt−1 |xt ) to fit the real
proposes a detector based on an ensemble of EfficientNet-
distribution q(xt−1 |xt ). The overall simplified optimization
B4 [44] to alleviate the generalization problem.
target is a sampling and denoising process as follows,
However, with the boosting development of diffusion
models, a general and robust detector for detecting im- h √ √ 2
i
ages generated by diffusion models has not been explored. Lsimple (θ) = Et,x0 ,  − θ ( αt x0 + 1 − αt , t) ,
We note that some recent works also notice the diffusion- (4)
generated image detection problem [36, 8]. Different from where  ∼ N (0, I).
them, the focus of our work is exploring a generalizable Denoising Diffusion Implicit Models (DDIMs).
detector for wide-range diffusion models. DDIM [42] proposes a new deterministic method for
accelerating the iterative process without the Markov
3. Method hypothesis. The new reverse process in DDIM is as follows,

In this paper, we present a novel representation named
 
√ xt − 1 − αt θ (xt , t)
DIffusion Reconstruction Error (DIRE) for diffusion- xt−1 = αt−1 √
αt (5)
generated image detection. DIRE measures the error be- q
2
+ 1 − αt−1 − σt · θ (xt , t) + σt t .
tween an input image and its reconstruction by a pre-trained
diffusion model. We observe that diffusion-generated images
can be more approximately reconstructed by a pre-trained If σt = 0, the reverse process becomes deterministic (recon-
diffusion model compared to real images. Based on this, the struction process), in which one noise sample determines one
DIRE provides discriminative properties for distinguishing generated image. Furthermore when T is large enough (e.g.,
diffusion-generated images from real images. The rest of T = 1000), Eqn. (5) can be seen as Euler integration for
DDIM inversion Denoising # of images
Image Source Generator
with Eqn. (8) Condition (real/generated)
ADM [10] 42k/42k
DDPM [17] 42k/42k
… Unconditional
iDDPM [32] 42k/42k
LSUN PNDM [23] 42k/42k
𝑥𝜏2 𝑥𝜏1 𝑥0 -Bedroom [49] LDM [37] 1k/1k
SD-v1 [37] 1k/1k
Text2Image
SD-v2 [37] 1k/1k
𝑥𝑇 … DIRE VQ-Diffusion [14] 1k/1k
(Eqn. (9)) Conditional ADM [10] 50k/50k
ImageNet [9]
Text2Image SD-v1 [37] 10k/10k
𝑥𝜏′ 2 𝑥𝜏′ 1 𝑥0′
DDIM Reconstruction Table 1: Composition of the DiffusionForensics dataset.
with Eqn. (5) It includes real images from LSUN-Bedroom [49] and Im-
Figure 3: Illustration of the process of computing DIRE ageNet [9], and generated images from corresponding pre-
given an input image x0 . The input image x0 is first gradu- trained diffusion models. According to the class of diffusion
ally inverted into a noise image xT by DDIM inversion [42], models, the containing images are divided into three classes:
and then is denoised step by step until getting a reconstruc- unconditional, conditional, and text2image.
tion x00 . DIRE is simply defined as the residual image got
from x0 and x00 . Our research takes note of the fact that images generated
by diffusion models are essentially sampled from the dis-
solving ordinary differential equations (ODEs): tribution of the diffusion generation space (pg (x)), while
s r ! real images are sampled from another distribution (pr (x))
xt−∆t xt 1 − αt−∆t 1 − αt although it may be near to pg (x) but not exactly the same.
√ =√ + − θ (xt , t). Our core motivation is that samples from the diffusion gen-
αt−∆t αt αt−∆t αt
(6) eration space pg (x) are more likely to be reconstructed by a
√ √ √ pre-trained diffusion model while real images cannot.
Suppose σ = 1 − α/ α, x̄ = x/ α, the correspond-
ing ODE becomes: So the key idea of our work is to make use of the diffusion
  model to detect diffusion-generated images. We find that
x̄(t) images generated by diffusion models are more likely to
dx̄(t) = θ √ , t dσ(t). (7)
σ2 + 1 be reconstructed by a pre-trained diffusion model. On the
Then the inversion process(from xt to xt+1 ) can be the other hand, due to the complex characteristics of real images,
reversion of the reconstruction process: real images can not be well constructed. As shown in Fig-
s r ! ure 1, the reconstruction error of real and diffusion-generated
xt+1 xt 1 − αt+1 1 − αt images shows dramatically different properties.
√ =√ + − θ (xt , t).
αt+1 αt αt+1 αt Given an input image x0 , we wish to judge whether it
(8) is synthesized by diffusion models. Take a pre-trained dif-
This process is to obtain the corresponding noisy sample xT
fusion model θ (xt , t). As shown in Figure 3, we apply
for an input image x0 . However, it is very slow to invert or
the DDIM [42] inversion process to gradually add Gaussian
sample step by step. To speed up the diffusion model sam-
noise into x0 via Eqn. (8). After S steps, x0 becomes a
pling, DDIM [42] permits us to sample a subset of S steps
point xT in the isotropic Gaussian noise distribution. The
τ1 , . . . , τS , so that the neighboring xt and xt+1 become xτt
inversion process is to find the corresponding point in noisy
and xτt+1 , respectively, in Eqn. (8) and Eqn. (5).
space, then the DDIM [42] generation process (Eqn. (5))
3.2. DIRE is employed to reconstruct the input image and produces
a recovered version x00 . The differences between x0 and
Due to the intrinsic differences between diffusion models x00 help to distinguish real or generated. Then the DIRE is
and previous generative models (i.e., GANs, Flow-based defined as:
models, VAEs), existing generated image detectors experi-
DIRE(x0 ) = |x0 − R(I(x0 ))|, (9)
ence dramatic performance drops when facing images gen-
erated by diffusion models. To avoid the abuse of diffu- where | · | denotes computing the absolute value, I(·) is a
sion models, it is urgent to develop a detector for diffusion- series of the inversion process with Eqn. (8) and R(·) is a
generated image detection. A straightforward approach series of the reconstruction process with Eqn. (5).
would be to train a binary classifier using a dataset of both Then for real images and diffusion-generated images,
real and diffusion-generated images. However, it is difficult we can get their DIRE representations, we train a binary
for such a method to guarantee generalization to diffusion classifier to distinguish their DIREs by a simple binary cross-
models that have not been previously encountered. entropy loss, which is formulated as follows,
N
X 4. Experiment
L(y, y0 ) = − (yi log(yi0 ) + (1 − yi )log(1 − yi0 )), (10)
i=1 In this section, we first introduce the experimental setups
where N is mini-batch size, y is the ground-truth label, and and then provide extensive experimental results to demon-
y0 is the corresponding prediction by the detector. In the strate the superiority of our approach.
inference stage, we first apply a diffusion model to recon-
4.1. Experimental Setup
struct the image and get the DIRE. Subsequently, we input
the DIRE into the binary classifier, which will then classify Data pre-processing and augmentation. All the experi-
the source image as either real or generated. ments are conducted on our DiffusionForensics dataset. To
calculate the DIRE for each image, we use the ADM [10]
3.3. DiffusionForensics: A Dataset for Evaluating network pre-trained on LSUN-Bedroom as the reconstruc-
Diffusion-Generated Image Detectors tion model, and the DDIM [42] inversion and reconstruction
process in which the number of steps S = 20 by default.
To better evaluate the performance of diffusion-generated We employ ResNet-50 [15] as our forensics classifier. The
detectors, we collect a dataset, DiffusionForensics, which size of most images (ADM [10], DDPM [17], iDDPM [32],
is comprised of ten subsets for comprehensive experiments. PNDM [23], VQ-Diffusion [14], LDM [37]) in the dataset
Its composition is shown in Table 1. The images can be is 256 × 256. For Stable Diffusion [37] v1 and v2, the
roughly divided into two classes by their source: LSUN- generated images are resized into 256 × 256 with bicubic
Bedroom [49] and ImageNet [9]. interpolation. During training, the images fed into the net-
LSUN-Bedroom. We collect images generated by eight work are randomly cropped with the size of 224 × 224 and
diffusion models trained on LSUN-Bedroom, in which horizontally flipped with a probability of 0.5. During testing,
four subset images (ADM [10], DDPM [17], iDDPM [32], the images are center-cropped with the size of 224 × 224.
PNDM [23]) are generated by unconditional diffusion mod- Evaluation metrics. Following previous generated-image
els and the other four (LDM [37], SD-v1 [37], SD-v2 [37], detection methods [48, 47, 52], we mainly report accu-
VQ-Diffusion [14]) are generated by text2image diffusion racy (ACC) and average precision (AP) in our experiments to
models. The text prompt for all the text2image genera- evaluate the detectors. The threshold for computing accuracy
tion is “A photo of bedroom”. For the images generated is set to 0.5 following [48].
by unconditional diffusion models, the generated 42k im- Baselines. 1) CNNDetection [48] proposes a CNN-
ages in ADM [10], iDDPM [32], and PNDM [23] are split generated image detection model that can be trained on one
into 40k (training), 1k (validation), and 1k (testing). The CNN dataset and then generalized to other CNN-synthesized
1k images generated by DDPM [17] are used for testing. images. 2) GANDetection [25] applies an ensemble of
Further for generalization evaluation to text2image genera- EfficientNet-B4 [44] to increase the detection performance.
tion, 1k images generated by using pre-trained LDM [37], 3) SBI [40] trains a general synthetic-image detector on im-
SD-v1 [37], SD-v2 [37], and VQ-Diffusion [14] models or ages generated by blending pseudo source and target images
provided APIs are generated, respectively. from single pristine images. 4) Patchforensics [4] employs
ImageNet. We further collect images from ImageNet for a patch-wise classifier which is claimed to be better than
evaluating detectors when facing more universal image gen- simple classifiers for fake image detection. 5) F3Net [34]
eration and cross-dataset evaluation. To be specific, we col- proposes that the frequency information of images is essen-
lect images from a conditional diffusion model (ADM [10]), tial for fake image detection.
and a text2image diffusion model (SD-v1 [37]) in which
the text prompt for generation is “A photo of {class}”(1k 4.2. Comparison to Existing Detectors
classes from ImageNet [9]). Applying the pre-trained ADM Diffusion models [17, 10] are claimed to exhibit better
model [10] with classifier, we generate 50k images in to- generation ability than previous generation models (e.g.,
tal (50 images for each class in ImageNet), i.e., 40k for GAN [13], VAE [21]). We notice that previous detectors
training, 5k for validation, and 5k for testing. And images achieve surprising performance on images generated by
generated by the text2image model [37] are only for testing. CNNs [20, 6, 2], but the generalization ability when fac-
The split of real images for training/validation/testing is ing recent diffusion-generated images has not been well ex-
the same as the corresponding generated images. Besides, plored. Here, we evaluate CNNDetection [48], GANDetec-
all the data in our dataset are triplet, i.e., source image, recon- tion [25], Patchforensics [4], and SBI [40] on the proposed
structed image, and corresponding DIRE image. In general, DiffusionForensics dataset using the pre-trained weights
the proposed DiffusionForensics dataset contains uncondi- downloaded from their official repositories.
tional, conditional, and text2image generated images, which The quantitative results can be found in Table 2. We find
is fertile for evaluation from various aspects. that existing detectors have a significant performance drop in
Training Generation Reconstruction Testing diffusion generators Total
Method
dataset model model ADM DDPM iDDPM PNDM SD-v1 SD-v2 LDM VQDiffusion Avg.
CNNDet [48] LSUN ProGAN – 50.1/63.4 56.7/74.6 50.1/77.6 50.3/82.9 50.2/70.9 50.8/80.4 50.1/60.2 50.1/70.6 51.1/72.6
GANDet [25] LSUN ProGAN – 54.2/43.6 52.2/47.3 45.7/57.3 42.1/77.6 68.1/78.5 61.5/52.7 79.2/57.1 64.8/52.3 58.5/58.3
Patchfor [4] FF++ Multiple – 50.4/74.8 56.8/67.4 50.3/69.5 55.1/78.5 49.9/84.7 50.0/52.8 54.0/92.0 92.8/99.7 57.4/77.4
SBI [40] FF++ Multiple – 53.6/57.7 55.8/47.4 54.0/58.2 46.7/44.8 65.6/75.9 55.0/59.8 81.0/88.3 59.6/66.6 58.9/62.3
CNNDet* [48] LSUN-B. ADM – 100/100 87.3/99.6 100/100 77.8/99.1 77.4/85.8 83.4/98.2 96.0/99.9 70.3/96.1 86.5/97.3
Patchfor* [4] LSUN-B. ADM – 100/100 72.9/100 100/100 96.6/100 63.2/71.3 97.2/100 97.3/100 100/100 90.9/96.4
F3Net* [34] LSUN-B. ADM – 94.3/99.6 93.7/99.6 94.8/99.9 94.1/99.2 86.1/95.3 83.6/91.7 92.5/97.8 93.4/98.7 91.6/97.7
LSUN-B. ADM ADM 100/100 100/100 100/100 99.7/100 99.7/100 100/100 100/100 100/100 99.9/100
LSUN-B. PNDM ADM 100/100 100/100 100/100 100/100 89.4/99.9 100/100 100/100 100/100 98.7/100
DIRE (ours)
LSUN-B. iDDPM ADM 99.6/100 100/100 100/100 89.7/99.8 99.7/100 100/100 99.9/100 99.9/100 98.6/100
LSUN-B. StyleGAN ADM 98.8/100 99.8/100 99.9/100 89.6/100 95.2/100 100/100 100/100 100/100 97.9/100

Table 2: Comprehensive comparisons of our DIRE and other generated image detectors. The previous detectors includ-
ing CNNDet [48], GANDet [25], Patchfor [4], and SBI [40] are tested on our DiffusionForensics benchmark using their
provided models. * denotes our reproduction training on the ADM subset of DiffusionForensics with the official codes.
All the used diffusion-generation models [10, 23, 32] for preparing training data are unconditional models pre-trained on
LSUN-Bedroom (LSUN-B.) [49]. Generated images from StyleGAN [20] on LSUN-Bedroom are downloaded from the official
repository. All the testing images produced by text-image generators (SD-v1 [37], SD-v2 [37], LDM [37], VQDiffusion [14])
are prompted by “A photo of bedroom”. We report ACC (%) and AP (%) (ACC/AP in the Table).

Training Generation Testing generators we select three different generation models: PNDM [23] and
dataset model ADM(IN) SD-v1(IN) StyleGAN(LSUN-B.)
iDDPM [32] (diffusion models) and StyleGAN [20] (GAN
LSUN-B. ADM 90.2/97.9 97.2/99.8 99.9/100
LSUN-B. iDDPM 90.2/97.9 93.7/99.3 99.9/100 model). The results are reported in Table 2. Despite the in-
LSUN-B. StyleGAN 76.9/94.4 89.7/99.0 100/100 consistent use of generation and reconstruction models when
training, DIRE still keeps a strong generalization capability.
Table 3: Cross-dataset evaluation on ImageNet (IN) [9] Specifically, when pairing iDDPM [32] as the generation
and LSUN-bedroom (LSUN-B.) [49]. Each testing generator model and ADM [10] as the reconstruction model, DIRE
is pre-trained on the corresponding dataset. Images gener- achieves 98.6% ACC and 100% AP on average, highlighting
ated by Stable Diffusion-v1 (SD-v1) are prompted by “ A its adaptation with images generated by different diffusion
photo of {class}” in which the classes are from [9]. ACC (%) models. It’s worth noting that when the generation model is
and AP (%) are reported (ACC/AP in the Table). StyleGAN, DIRE still exhibits excellent performance. This
might be attributed to DIRE’s capability of incorporating
performance when dealing with diffusion-generated images,
the generation properties of other generation models besides
with ACC results lower than 60%. We also include diffusion-
diffusion models.
generated images (ADM [10]) as training data and re-train
CNNDetection [48], Patchforensics [4], and F3Net [34], Cross-dataset evaluation. We further design a more chal-
whose training codes are publicly available. The resulting lenging scenario, i.e., training the detector with images gen-
models get a significant improvement on images generated erated by models pre-trained on LSUN-Bedroom [49] and
by the same diffusion models as used in training, but still then testing it on images produced by models pre-trained
perform unsatisfactorily facing unseen diffusion models. In on ImageNet [9]. We choose three different generators for
contrast, our method, DIRE, shines with excellent general- generating training images: ADM [10], iDDPM [32], and
ization performance. Concretely, DIRE with the generation StyleGAN [20]. The evaluation results on ADM (IN) are
model and the reconstruction model set to ADM achieves an shown in Table 3. The comparison indicates that DIRE main-
average of 99.9% ACC and 100% AP on detecting images tains a satisfactory generalization capability even though
generated by various diffusion models. facing unseen datasets, i.e., ACC/AP: 90.2%/97.9% when
training on images generated by ADM and iDDPM. And
4.3. Generalization Capability Evaluation for StyleGAN [20], DIRE still achieves 94.4% in AP, al-
Effect of choice of generation and reconstruction models. though there is a huge domain gap that both the dataset
We evaluate the impact of different choices of the generation and generation models are different in training and testing.
and reconstruction models on the generalization capabil- This evaluation further validates that the proposed DIRE is a
ity. We employ the ADM [10] model as the reconstruction general image representation for this task.
model and apply different models for generating images. Unseen text-to-image generation evaluation. Further-
After generating, the ADM model converts these images to more, we seek to verify whether DIRE can detect images
their DIREs for training a binary classifier. In this evaluation, generated by unseen text-to-image models. We adopt Stable
CNNDetection GANDetection SBI CNNDetection* F3Net* Patchforensics* Ours
ADM DDPM iDDPM PNDM
100 100 100 100

80 80 80 80
AP(%)

60 60 60 60

40 40 40 40
0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
Sigma Sigma Sigma Sigma
ADM DDPM iDDPM PNDM
100 100 100 100

80 80 80 80
AP(%)

60 60 60 60

40 40 40 40
100 65 30 100 65 30 100 65 30 100 65 30
Quality Quality Quality Quality
Figure 4: Robustness to unseen perturbations. The top rows show the robustness to Gaussian blur, and the bottom rows
show the robustness to JPEG compression. * denotes our reproduction training on the ADM subset of DiffusionForensics with
AP (%) reported for robustness comparison.

S ADM DDPM iDDPM PNDM SD-v1 4.4. Robustness to Unseen Perturbations


5 100/100 100/100 100/100 97.5/100 87.5/99.8
10 100/100 100/100 100/100 99.4/100 98.2/100 Besides the generalization to unseen generation mod-
20 100/100 100/100 100/100 99.7/100 99.7/100 els, the robustness to unseen perturbations is also a com-
50 100/100 100/100 100/100 100/100 99.9/100 mon concern since in real-world applications images are
Table 4: Influence of different inversion steps. All the usually perturbed by various degradations. Here, we eval-
models in this experiment are trained on the ADM subset. uate the robustness of detectors in two-class degradations,
ACC (%) and AP (%) are reported (ACC/AP in the Table). i.e., Gaussian blur and JPEG compression, following [48].
The perturbations are added under three levels for Gaus-
sian blur (σ = 1, 2, 3) and two levels for JPEG compres-
Diffusion v1 (SD-v1) as the generation model and generate sion (quality = 65, 30). We explore the robustness of our
images based on the class label of ImageNet [9] for eval- baselines CNNDetection [48], GANDetection [25], SBI [40],
uating detectors. The results are shown in Table 3. Our F3Net [34], Patchforensics [4], and our DIRE. The results
detector trained with images generated by ADM pre-trained are shown in Figure 4. We observe that at each level of
on LSUN-Bedroom achieves a 97.2% ACC and 99.8% AP, blur and JPEG compression, our DIRE gets a perfect per-
demonstrating the strong generalization capability of DIRE formance without performance drop. It is worth noting that
to text-to-image generation models. our reproduction of CNNDetection* [48] and Patchforen-
sics* [4] trained on the ADM subset of DiffusionForensics
Unseen GAN generation evaluation. Besides generaliza- also get satisfactory performance while they experience a
tion between diffusion models, we further evaluate the per- dramatic performance drop facing JPEG compression, which
formance of DIRE for images generated by GANs. We test further reveals training on RGB images may be not robust.
the performance of three generation models: ADM, iDDPM,
and StyleGAN pre-trained on LSUN-Bedroom. The results 4.5. More Analysis of the Proposed DIRE
are reported in Table 3. Although the classifier never encoun-
ters any GAN-generated image during training, it achieves How do the inversion steps in DDIM affect the detec-
surprising performance when detecting GAN-generated im- tion performance? Recent diffusion models [42, 10] find
ages, i.e., 99% in ACC and 100% in AP. This indicates that more steps contribute to more high-quality images and
that DIRE is not only an effective image representation for DDIM [42] sampling can improve the generation perfor-
diffusion-generated image detection but also may beneficial mance compared to original DDPM [17] sampling. Here, we
to detect GAN-generated images. explore the influence of different inversion steps in diffusion-
(a) Real

(b) ADM (c) DDPM (d) iDDPM (e) PNDM


Figure 5: Noise pattern and frequency analysis of DIRE of real and generated images. Noise pattern is regular to portray
the shape of objects in DIRE of real images, while it is messy in DIRE of diffusion-generated images. For frequency analysis,
the frequency bands in DIRE of real images are more abundant than that of diffusion-generated images, i.e., the white regions
in the frequency domain are larger.

Input ADM DDPM iDDPM PNDM SD-v1 Effect of different calculation of DIRE. After computing
REC 100/100 57.1/57.7 49.7/92.6 87.1/98.7 46.9/57.0 the residual result of the reconstructed image and source
RGB 100/100 87.3/99.6 100/100 77.8/99.1 77.4/85.8 image, whether to compute the absolute value should be con-
RGB&DIRE 100/100 99.8/100 99.9/100 99.2/100 62.4/92.4
sidered. As reported in Table 6, we find that the absolute op-
DIRE 100/100 100/100 100/100 99.7/100 99.7/100
eration is critical for achieving a strong diffusion-generated
Table 5: Influence of different input information. All the image detector, particularly on SD-v1 [37] where it improves
models in this experiment are trained on the ADM subset. ACC/AP from 87.0%/93.0% → 99.7%/100%. So by default,
ACC (%) and AP (%) are reported (ACC/AP in the Table). the absolute operation is applied in all our models.
Qualitative Analysis of DIRE. The above quantitative ex-
ADM DDPM iDDPM PNDM SD-v1
periments have indicated the effectiveness of the proposed
w/o ABS 100/100 99.4/100 100/100 98.2/100 87.0/93.0
DIRE. As analyzed before, the key motivation behind DIRE
w/ ABS 100/100 100/100 100/100 99.7/100 99.7/100
is that generated images can be approximately reconstructed
Table 6: Effect of computing the absolute value (ABS) by a pre-trained diffusion model while real images cannot.
when obtaining DIRE. All the models in this experiment DIRE makes use of the residual characteristic of an input
are trained on the ADM subset. ACC (%) and AP (%) are image and its reconstruction for discrimination. To gain a
reported (ACC/AP in the Table). better understanding of its intrinsic properties, we conduct a
further qualitative analysis of DIRE, utilizing noise pattern
generated image detection. Note that the steps in reconstruc- and frequency analysis for visualization.
tion are the same as in the inversion by default. The results When images are acquired, various factors from hardware
are reported in Table 4. We observe that more steps in DDIM facilities, such as lens and sensors, and software algorithms,
benefit the detection performance of DIRE. Considering the such as compression and demosaic, can impact image quality
computational cost, we choose 20 steps by default. at the low level. One typical low-level analysis of images is
Is DIRE really better than the original RGB for detect- noise pattern analysis1 , which is usually regular and corre-
ing diffusion-generated images? We conduct an exper- sponds to the shape of objects in real scenarios. In addition
iment on various forms of input for detection, including to low-level analysis, frequency analysis can provide fre-
RGB images, reconstructed images (REC), DIRE, and the quency information about images. To compute the frequency
combination of RGB and DIRE (RGB&DIRE). The results information of DIRE, we used FFT algorithms.
displayed in Table 5 reveal that REC performed much worse We visualize the results of the aforementioned two analy-
than RGB, suggesting that reconstructed images are not sis tools in Figure 5. The visual comparison of noise patterns
suitable as input information for detection. One possible highlights significant differences of the DIRE of real and
explanation is the loss of essential information during re- diffusion-generated images from the low-level perspective,
construction by a pre-trained diffusion model. The compari- with real images tending to be regular and corresponding to
son between RGB and DIRE also demonstrates that DIRE the shape of objects while diffusion-generated images tend
serves as a stronger image representation, contributing to a to be messy. By comparing the FFT spectrum of DIRE from
more generalizable detector than simply training on RGB im- real and diffusion-generated images, we observe that the
ages. Furthermore, we find that combining RGB with DIRE FFT spectrum of real images is usually more abundant than
together hurts the generalization compared to pure DIRE. that of diffusion-generated images, which confirms that real
Therefore, we use DIRE as the default input for detection by
default. 1 https://29a.ch/photo-forensics/#noise-analysis
images are more difficult to be reconstructed by a pre-trained [11] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming
diffusion model. transformers for high-resolution image synthesis. In CVPR,
2021. 3
[12] Joel Frank, Thorsten Eisenhofer, Lea Schönherr, Asja Fischer,
5. Conclusion
Dorothea Kolossa, and Thorsten Holz. Leveraging frequency
In this paper, we focus on building a generalizable de- analysis for deep fake image recognition. In ICML, 2020. 3
tector for discriminating diffusion-generated images. We [13] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
find that previous generated-image detectors show limited Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
Yoshua Bengio. Generative adversarial nets. In NeurIPS,
performance when detecting images generated by diffusion
2014. 5
models. To address the issue, we present an image represen-
[14] Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang,
tation called DIRE based on reconstruction errors of images Dongdong Chen, Lu Yuan, and Baining Guo. Vector quan-
inverted and reconstructed by DDIM. Furthermore, we create tized diffusion model for text-to-image synthesis. In CVPR,
a new dataset, DiffusionForensics, which includes images 2022. 2, 4, 5, 6, 16
generated by unconditional, conditional, and text-to-image [15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
diffusion models to facilitate the evaluation of diffusion- Deep residual learning for image recognition. In CVPR, 2016.
generated images. Extensive experiments indicate that the 5
proposed image representation DIRE contributes to a strong [16] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman,
diffusion-generated image detector, which is very effective Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im-
for this task. We hope that our work can serve as a solid age editing with cross attention control. arXiv preprint
baseline for diffusion-generated image detection. arXiv:2208.01626, 2022. 2
[17] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu-
sion probabilistic models. arXiv preprint arxiv:2006.11239,
References 2020. 1, 2, 3, 4, 5, 7, 13
[1] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended [18] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kil-
diffusion for text-driven editing of natural images. In CVPR, ian Q Weinberger. Densely connected convolutional networks.
2022. 2 In CVPR, 2017. 3
[2] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large [19] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen.
scale gan training for high fidelity natural image synthesis. Progressive growing of gans for improved quality, stability,
arXiv preprint arXiv:1809.11096, 2018. 5 and variation. arXiv preprint arXiv:1710.10196, 2017. 2
[3] Nicholas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagiel- [20] Tero Karras, Samuli Laine, and Timo Aila. A style-based
ski, Vikash Sehwag, Florian Tramèr, Borja Balle, Daphne generator architecture for generative adversarial networks. In
Ippolito, and Eric Wallace. Extracting training data from CVPR, 2019. 5, 6
diffusion models. arXiv preprint arXiv:2301.13188, 2023. 1 [21] Diederik P Kingma and Max Welling. Auto-encoding varia-
tional bayes. arXiv preprint arXiv:1312.6114, 2013. 5
[4] Lucy Chai, David Bau, Ser-Nam Lim, and Phillip Isola. What
[22] Lingzhi Li, Jianmin Bao, Ting Zhang, Hao Yang, Dong Chen,
makes fake images detectable? understanding properties that
Fang Wen, and Baining Guo. Face x-ray for more general
generalize. In ECCV, 2020. 5, 6, 7
face forgery detection. In CVPR, 2020. 3
[5] Wenhu Chen, Hexiang Hu, Chitwan Saharia, and William W
[23] Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. Pseudo
Cohen. Re-imagen: Retrieval-augmented text-to-image gen-
numerical methods for diffusion models on manifolds. arXiv
erator. arXiv preprint arXiv:2209.14491, 2022. 2
preprint arXiv:2202.09778, 2022. 1, 2, 4, 5, 6, 11, 14
[6] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, [24] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan
Sunghun Kim, and Jaegul Choo. Stargan: Unified genera- Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffu-
tive adversarial networks for multi-domain image-to-image sion probabilistic model sampling in around 10 steps. arXiv
translation. In CVPR, 2018. 5 preprint arXiv:2206.00927, 2022. 2
[7] François Chollet. Xception: Deep learning with depthwise [25] Sara Mandelli, Nicolò Bonettini, Paolo Bestagini, and Ste-
separable convolutions. In CVPR, 2017. 3 fano Tubaro. Detecting gan-generated images by orthogonal
[8] Riccardo Corvi, Davide Cozzolino, Giada Zingarini, Giovanni training of multiple cnns. In ICIP, 2022. 3, 5, 6, 7
Poggi, Koki Nagano, and Luisa Verdoliva. On the detection [26] Francesco Marra, Diego Gragnaniello, Davide Cozzolino, and
of synthetic images generated by diffusion models. arXiv Luisa Verdoliva. Detection of gan-generated fake images over
preprint arXiv:2211.00680, 2022. 3 social networks. In MIPR, 2018. 3
[9] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li [27] Francesco Marra, Diego Gragnaniello, Luisa Verdoliva, and
Fei-Fei. Imagenet: A large-scale hierarchical image database. Giovanni Poggi. Do gans leave artificial fingerprints? In
In CVPR, 2009. 2, 4, 5, 6, 7, 11, 17 MIPR, 2019. 3
[10] Prafulla Dhariwal and Alexander Nichol. Diffusion models [28] Scott McCloskey and Michael Albright. Detecting gan-
beat gans on image synthesis. In NeurIPS, 2021. 1, 2, 3, 4, 5, generated imagery using color cues. arXiv preprint
6, 7, 11, 14 arXiv:1812.08247, 2018. 3
[29] Scott McCloskey and Michael Albright. Detecting gan- [46] Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro
generated imagery using saturation cues. In ICIP, 2019. 3 Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj,
[30] Lakshmanan Nataraj, Tajuddin Manhar Mohammed, Shiv- and Thomas Wolf. Diffusers: State-of-the-art diffu-
kumar Chandrasekaran, Arjuna Flenner, Jawadul H Bappy, sion models. https://github.com/huggingface/
Amit K Roy-Chowdhury, and BS Manjunath. Detecting gan diffusers, 2022. 11
generated fake images using co-occurrence matrices. arXiv [47] Sheng-Yu Wang, Oliver Wang, Andrew Owens, Richard
preprint arXiv:1903.06836, 2019. 3 Zhang, and Alexei A Efros. Detecting photoshopped faces by
[31] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav scripting photoshop. In ICCV, 2019. 5
Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and [48] Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew
Mark Chen. Glide: Towards photorealistic image generation Owens, and Alexei A Efros. Cnn-generated images are sur-
and editing with text-guided diffusion models. arXiv preprint prisingly easy to spot...for now. In CVPR, 2020. 3, 5, 6,
arXiv:2112.10741, 2021. 2 7
[32] Alexander Quinn Nichol and Prafulla Dhariwal. Improved [49] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas
denoising diffusion probabilistic models. In ICML, 2021. 1, Funkhouser, and Jianxiong Xiao. Lsun: Construction of a
2, 3, 4, 5, 6, 13 large-scale image dataset using deep learning with humans in
[33] Gaurav Parmar, Krishna Kumar Singh, Richard Zhang, Yijun the loop. arXiv preprint arXiv:1506.03365, 2015. 2, 3, 4, 5,
Li, Jingwan Lu, and Jun-Yan Zhu. Zero-shot image-to-image 6, 11, 13, 14
translation. arXiv preprint arXiv:2302.03027, 2023. 2 [50] Ning Yu, Larry S Davis, and Mario Fritz. Attributing fake
[34] Yuyang Qian, Guojun Yin, Lu Sheng, Zixuan Chen, and Jing images to gans: Learning and analyzing gan fingerprints. In
Shao. Thinking in frequency: Face forgery detection by ICCV, 2019. 3
mining frequency-aware clues. In ECCV, 2020. 5, 6, 7 [51] Xu Zhang, Svebor Karaman, and Shih-Fu Chang. Detecting
[35] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and simulating artifacts in gan fake images. In WIFS, 2019. 3
and Mark Chen. Hierarchical text-conditional image genera- [52] Peng Zhou, Xintong Han, Vlad I Morariu, and Larry S Davis.
tion with clip latents. arXiv preprint arXiv:2204.06125, 2022. Learning rich features for image manipulation detection. In
2 CVPR, 2018. 5
[36] Jonas Ricker, Simon Damm, Thorsten Holz, and Asja Fischer.
[53] Derui Zhu, Dingfan Chen, Jens Grossklags, and Mario Fritz.
Towards the detection of diffusion model deepfakes. arXiv
Data forensics in diffusion models: A systematic analysis of
preprint arXiv:2210.14571, 2022. 3
membership privacy. arXiv preprint arXiv:2302.07801, 2023.
[37] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
1
Patrick Esser, and Björn Ommer. High-resolution image
synthesis with latent diffusion models. In CVPR, 2022. 1, 2,
3, 4, 5, 6, 8, 11, 15, 16
[38] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch,
Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine
tuning text-to-image diffusion models for subject-driven gen-
eration. arXiv preprint arXiv:2208.12242, 2022. 2
[39] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay
Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour,
Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes,
et al. Photorealistic text-to-image diffusion models with deep
language understanding. arXiv preprint arXiv:2205.11487,
2022. 2
[40] Kaede Shiohara and Toshihiko Yamasaki. Detecting deep-
fakes with self-blended images. In CVPR, 2022. 5, 6, 7
[41] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan,
and Surya Ganguli. Deep unsupervised learning using
nonequilibrium thermodynamics. In ICML, 2015. 1, 2, 3
[42] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising
diffusion implicit models. arXiv preprint arXiv:2010.02502,
2020. 1, 2, 3, 4, 5, 7, 11
[43] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon
Shlens, and Zbigniew Wojna. Rethinking the inception archi-
tecture for computer vision. In CVPR, 2016. 3
[44] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model
scaling for convolutional neural networks. In ICML, 2019. 3,
5
[45] Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete
representation learning. In NeurIPS, 2017. 2
A. More Details of DiffusionForensics ImageNet-ADM. We sample 50k images for 1k classes from
ImageNet [9] using the pre-trained conditional ADM model
In this section, we give more details about the proposed and the provided classifier on ImageNet from the official
DiffusionForensics dataset. The real images in the LSUN- repository with DDIM [42] scheduler applied for better sam-
Bedroom and ImageNet subsets are from the source LSUN- pling with 50 steps. The images are divided in the ratio of
Bedroom [49] and ImageNet [9] datasets, respectively. To 8:1:1 for training:validation:testing.
generate DIREs of all the real and generated images in our
ImageNet-SD-v1. We sample 10k images using the pre-
DiffusionForensics, we use the unconditional ADM [10]
trained Stable Diffusion v1.5 model with code provided by
model pre-trained on LSUN-Bedroom with DDIM [42]
diffusers [46]. The prompt for generation is “A photo of
scheduler applied for 20 steps in total as the reconstruc-
{class}” in which the class is chosen from the 1k classes
tion model. As for the ImageNet subset, we also provide
from ImageNet [9], resulting in ten generated images for
the DIREs computed by using the unconditional ADM [10]
each class.
model pre-trained on ImageNet as the reconstruction model
with DDIM [42] scheduler applied for 20 steps in total.
B. More Explanation of DIRE
LSUN-Bedroom-ADM. We download the pre-trained
LSUN-Bedroom model of ADM [10] from the official repos- We have demonstrated the key motivation of our DIRE.
itory2 . And then we sample 42k images for training (40k), But one may wonder why DIREs of diffusion-generated
validation (1k), and testing (1k) with DDIM [42] scheduler images are not zero-value images. Here, we explain from
applied for better sampling with 50 steps. the approximation to solving ordinary differential equations
LSUN-Bedroom-DDPM. We download the provided 1k im- (ODEs) perspective.
ages generated by the model pre-trained on LSUN-Bedroom The deterministic reverse process (reconstruction) in
from the official repository3 . DDIM [42] is as follows,
LSUN-Bedroom-iDDPM. We sample 42k images using  √ 
the official codes and the pre-trained LSUN-Bedroom √ xt − 1 − αt θ (xt , t)
xt−1 = αt−1 √
model (lr=2e-5)4 with DDIM [42] scheduler applied for αt (11)
better sampling with 50 steps. p
+ 1 − αt−1 · θ (xt , t).
LSUN-Bedroom-PNDM. We sample 42k images using the
official codes and the pre-trained LSUN-Bedroom model5 When the total steps T is large enough (e.g., T = 1000),
with their PNDM [23] scheduler applied for better sampling Eqn. (11) can be seen as Euler integration for solving ordi-
with 50 steps. nary differential equations (ODEs):
LSUN-Bedroom-LDM. The pipeline code for sampling is
downloaded from diffusers [46]6 . The version of the Latent
s r !
xt−∆t xt 1 − αt−∆t 1 − αt
Diffusion model (LDM) [37] we used is “CompVis/ldm- √ =√ + − θ (xt , t).
αt−∆t αt αt−∆t αt
text2im-large-256”. We give a prompt “A photo of bedroom”
√ (12)
for generating 1k bedroom images. √ √
LSUN-Bedroom-SD-v1. The pipeline code for sampling is Suppose σ = 1 − α/ α, x̄ = x/ α, the correspond-
downloaded from diffusers [46]. The version of Stable Diffu- ing ODE becomes:
sion (SD) v1 [37] we used is “runwayml/stable-diffusion-v1-  
x̄(t)
5”. We give a prompt “A photo of bedroom” for generating dx̄(t) = θ √ , t dσ(t). (13)
1k bedroom images. σ2 + 1
LSUN-Bedroom-SD-v2. The pipeline code for sampling is Then the forward process (inversion) (from xt to xt+1 ) in
downloaded from diffusers [46]. The version of Stable Diffu- DDIM can be the reversion of the reconstruction process:
sion (SD) v2 [37] we used is “stabilityai/stable-diffusion-2”.
We give a prompt “A photo of bedroom” for generating 1k
s r !
xt+1 xt 1 − αt+1 1 − αt
bedroom images. √ =√ + − θ (xt , t).
αt+1 αt αt+1 αt
LSUN-Bedroom-VQDiffusion. We sample 1k images us-
ing the official codes and the pre-trained ITHQ model7 with (14)
the prompt “A photo of bedroom”. It is worth noting that during the approximation of
Eqn. (11) by Eqn. (12), there is a deviation since T is usually
2 https://github.com/openai/guided-diffusion
not infinitely large (e.g., T = 1000). The deviation is more
3 https://github.com/hojonathanho/diffusion
4 https://github.com/openai/improved-diffusion
prominent for real images than diffusion-generated images
5 https://github.com/luping-liu/PNDM due to the more complex characteristics of real images. The
6 https://github.com/huggingface/diffusers deviation caused by the approximation actually leads to our
7 https://github.com/microsoft/VQ-Diffusion key idea of DIRE.
C. More Visualization About DIRE
We visualize more examples of source images, their re-
constructions, and DIREs of real images and generated im-
ages from different diffusion models in Figures 6, 7, 8, 9,
10. The DIREs of real images tend to have larger values
compared to diffusion-generated images.
Source

Recon.

DIRE

(a) Real

Source

Recon.

DIRE

(b) DDPM

Source

Recon.

DIRE

(c) iDDPM
Figure 6: The DIRE representation of real images and generated images from DDPM [17] and iDDPM [32] pre-trained on
LSUN-Bedroom [49]. For each source image, we visualize its corresponding reconstruction image and DIRE.
Source

Recon.

DIRE

(a) ADM

Source

Recon.

DIRE

(b) PNDM

Figure 7: The DIRE representation of generated images from ADM [10] and PNDM [23] pre-trained on LSUN-Bedroom [49].
For each source image, we visualize its corresponding reconstruction image and DIRE.
Source

Recon.

DIRE

(a) Stable Diffusion v1

Source

Recon.

DIRE

(b) Stable Diffusion v2

Figure 8: The DIRE representation of generated images from Stable Diffusion v1 and v2 [37] with the prompt “A photo of
bedroom”. For each source image, we visualize its corresponding reconstruction image and DIRE.
Source

Recon.

DIRE

(a) Latent Diffusion

Source

Recon.

DIRE

(b) VQ-Diffusion

Figure 9: The DIRE representation of generated images from Latent Diffusion [37] and VQ-Diffusion [14] with the prompt “A
photo of bedroom”. For each source image, we visualize its corresponding reconstruction image and DIRE.
Source

Recon.

DIRE

(a) Real

Source

Recon.

DIRE

(b) ADM

Source

Recon.

DIRE

(c) Stable Diffusion v1

Figure 10: The DIRE representation of real images and generated images from ADM and Stable Diffusion v1 with the prompt
“A photo of class” (class from ImageNet [9]). For each source image, we visualize its corresponding reconstruction image and
DIRE.

You might also like