Diffusion-Generated Image Detection
Diffusion-Generated Image Detection
22445
that our framework achieves a remarkably high detection
𝑝! (𝒙) 𝑝! (𝒙) accuracy and average precision on generated images from
Inversion Recon. unseen diffusion models, as well as robustness to various
𝒩(𝟎, 𝜤) perturbations. In comparison with existing generated image
detectors, our framework largely exceeds the competitive
𝑝" (𝒙) 𝑝" (𝒙) state-of-the-art methods.
Our main contributions are three-fold as follows.
𝒙! : Generated 𝒙!# : Reconstructed generated
• We propose a novel image representation called DIRE
𝒙" : Real 𝒙#" : Reconstructed real
for detecting diffusion-generated images.
Figure 2: Illustration of the difference between a real
• We set up a new dataset, DiffusionForensics, for bench-
sample and a generated sample from the DIRE perspec-
marking the diffusion-generated image detectors.
tive. pg (x) represents the distribution of generated images
while pr (x) represents the distribution of real images. xg • Extensive experiments demonstrate that the proposed
and xr represent a generated sample and a real sample, re- DIRE sets a state-of-the-art performance in diffusion-
spectively. Using the inversion and reconstruction process generated detection.
of DDIM [42], xg and xr become x′g and x′r , respectively.
After the reconstruction, x′r is actually within the pg (x), 2. Related Work
which leads to a noticeably different DIRE in real samples Since our focus is to detect diffusion-generated images
compared to generated samples. and the proposed DIRE representation is based on the recon-
struction error by a pre-trained diffusion model, we briefly
In this paper, we propose a novel image representation, introduce recent diffusion models in image generation and
called DIffusion REconstruction Error (DIRE), for detect- generalizable generated image detection in this section.
ing diffusion-generated images. The hypothesis behind
DIRE is that images produced by diffusion processes can 2.1. Diffusion Models for Image Generation
be reconstructed more accurately by a pre-trained diffusion Inspired by nonequilibrium thermodynamics [41], Ho et
model compared to real images. The diffusion reconstruc- al. [17] propose a new generation paradigm, denoising dif-
tion process involves two steps: (1) inverting the input image fusion probabilistic models (DDPMs), which achieves a
x and mapping it to a noise vector xT in the noise space competitive performance compared to PGGAN [19] on 256
N (0, I), and (2) reconstructing the image x′ from xT using × 256 LSUN [48]. Since then, more and more researchers
a denoising process. The DIRE is calculated as the differ- turn their attention to diffusion models for improving the
ence between x and x′ . As a sample xg from the generated architectures [10, 37], accelerating sampling speed [32, 42,
distribution pg (x) and its reconstruction x′g belong to the 23, 24], exploring downstream tasks [16, 31, 1, 33], and etc.
same distribution, the DIRE value for xg would be relatively Nichol et al. [32] find that learning variances of the reverse
low. Conversely, the reconstruction of a real image xr is process in DDPMs can contribute to an order of magnitude
likely to differ significantly from itself, resulting in a high fewer sampling steps. Song et al. [42] generalize DDPMs
amplitude in DIRE. This concept is depicted in Figure 2. via a class of non-Markovian diffusion processes into denois-
DIRE offers a reliable method for differentiating between ing diffusion implicit models (DDIMs), which leads to more
real and diffusion-generated images. By training a simple bi- high-quality samples with fewer sampling steps. A later
nary classifier based on DIRE, it becomes possible to detect work ADM [10] finds a much more effective architecture
diffusion-generated images with ease. The DIRE is general and further achieves a state-of-the-art performance compared
and flexible since it can generalize to images generated by to other generative models with classifier guidance. From
unseen diffusion models during inference time. It only as- the perspective that DDPMs can be treated as solving differ-
sumes the distinct reconstruction errors of real images and ential equations on manifolds, Liu et al. [23] propose pseudo
generated ones as shown in Figure 1. numerical methods for diffusion models (PNDMs), which
To evaluate the diffusion-generated image detectors, we further improves sampling efficiency and generation quality.
create a comprehensive diffusion-generated dataset, the Dif- Besides unconditional image generation, there are also
fusionForensics dataset, including three-domain images ( plenty of text-to-image generation works based on diffu-
LSUN-Bedroom [48], ImageNet [9], and CelebA-HQ [19]) sion models [37, 14, 35, 39, 5, 38]. Among them, VQ-
generated by eleven different diffusion models. Diffusion- Diffusion [14] is based on a VQ-VAE [45] and models the
Forensics involves unconditional, conditional, and text-to- latent space by a conditional variant of DDPMs. Another typ-
image diffusion generation models. ical work is LDM [37] that conditions the diffusion models
Extensive experiments show that the DIRE representa- on the input by cross-attention mechanism, and proposes
tion significantly enhances generalization ability. We show latent diffusion models by introducing latent space [11].
22446
Recent popular Stable Diffusion v1 and v2 are based on DDPMs, and the inversion and reconstruction process of the
LDM [37] and further improved to achieve a surprising gen- DDIM [42]. Then we present details of DIRE for diffusion-
eration performance. generated image detection. Finally, we introduce a new
dataset, i.e., DiffusionForensics, for evaluating diffusion-
2.2. Generalizable Generated Image Detection generated image detectors.
Generated image detection has been widely explored over
3.1. Preliminaries
the past years. Earlier researchers focus on detecting gener-
ated images leveraging hand-crafted features, such as color Denoising Diffusion Probabilistic Models (DDPMs). Dif-
cues [28], saturation cues [29], blending artifacts [22], co- fusion models are first proposed in [41] inspired by non-
occurrence features [30]. Marra et al. [26] study several equilibrium thermodynamics, and achieve strong perfor-
classical deep CNN classifiers [18, 43, 7] to detect images mance in image generation [17, 32, 10, 37]. They define
generated by image-to-image translation networks. How- a Markov chain of diffusion steps that slowly add Gaussian
ever, they do not consider the generalization capability to noise to data until degenerating it into isotropic Gaussian
unseen generation models. In another work, Wang et al. [47] distribution (forward process), and then learn to reverse the
notice this challenge and claim that training a simple clas- diffusion process to generate samples from the noise (reverse
sifier on ProGAN-generated images can generalize to other process). The Markov chain in the forward process is defined
unseen GAN-based generated images well. However, their as:
strong generalization capability relies on their large-scale
training and 20 different models each trained on a different q(\bx _t|\bx _{t-1}) = \mathcal {N}(\bx _t;\sqrt {\frac {\alpha _t}{\alpha _{t-1}}}\bx _{t-1},(1-\frac {\alpha _t}{\alpha _{t-1}}) \bI ),\label {eq:forwardprocess} (1)
LSUN [48] object category.
Besides detection by spatial artifacts, there are also in which xt is the noisy image at the t-th step and α1 , . . . , αT
frequency-based methods [12, 50]. Frank et al. [12] is a predefined schedule, with T denotes the total steps.
present that in the frequency domain, GAN-generated im- An important property brought by the Markov chain is
ages are more likely to expose severe artifacts mainly caused that we can obtain xt from x0 directly via:
by upsampling operations in previous GAN architectures. q(\bx _t|\bx _0) = \mathcal {N}(\bx _t;\sqrt {\alpha _t}\bx _0,(1-\alpha _t) \bI ). \label {eq:x_t} (2)
Zhang et al. [50] propose a GAN simulator, AutoGAN, to
The reverse process in [17] is also defined as a Markov chain:
simulate the artifacts produced by standard GAN pipelines.
Then they train a detector on the spectrum input on the
p_\theta (\bx _{t-1}|\bx _t) = \mathcal {N}(\bx _{t-1}; \bmu _\theta (\bx _t, t), \bSigma _\theta (\bx _t, t)).\label {eq:reverseprocess} (3)
synthesized images. It can generalize to unseen generation
models to some extent. Marra et al. [27] and Yu et al. [49] Diffusion models use a network pθ (xt−1 |xt ) to fit the real
suggest detecting generated images by fingerprints that are distribution q(xt−1 |xt ). The overall simplified optimization
often produced during GAN generation. A recent work [25] target is a sampling and denoising process as follows,
proposes a detector based on an ensemble of EfficientNet-
L_\mathrm {simple}(\theta ) = \Eb {t, \bx _0, \bepsilon }{ \left \| \bepsilon - \bepsilon _\theta (\sqrt {\alpha _t} \bx _0 + \sqrt {1-\alpha _t}\bepsilon , t) \right \|^2},\label {eq:training_objective_simple}
B4 [44] to alleviate the generalization problem.
However, with the boosting development of diffusion (4)
models, a general and robust detector for detecting im- where ϵ ∼ N (0, I).
ages generated by diffusion models has not been explored. Denoising Diffusion Implicit Models (DDIMs).
We note that some recent works also notice the diffusion- DDIM [42] proposes a new deterministic method for
generated image detection problem [36, 8]. Different from accelerating the iterative process without the Markov
them, the focus of our work is exploring a generalizable hypothesis. The new reverse process in DDIM is as follows,
detector for wide-range diffusion models.
\begin {aligned} \bx _{t-1} = \sqrt {\alpha _{t-1}} \left (\frac {\bx _t - \sqrt {1 - \alpha _t} \bepsilon _\theta (\bx _t, t)}{\sqrt {\alpha _t}}\right ) \\ + \sqrt {1 - \alpha _{t-1} - \sigma _t^2} \cdot \bepsilon _\theta (\bx _t, t) +\sigma _t \epsilon _t.\label {eq:ddim_reverse} \end {aligned}
3. Method (5)
22447
DDIM inversion Denoising #
Image Source Generator
with Eqn. 8 Condition of images
real 42,000
… ADM [10] 42,000
DDPM [17] 42,000
unconditional
iDDPM [32] 42,000
𝑥$" 𝑥$! 𝑥" LSUN PNDM [23] 42,000
-Bedroom [48] LDM [37] 42,000
𝑥! SD-v1 [37] 42,000
… DIRE
SD-v2 [37] 42,000
Eqn. 9
text-to-image VQ-Diffusion [14] 42,000
𝑥$# " 𝑥$# ! 𝑥"# IF [39] 1,000
DDIM Reconstruction DALLE-2 [35] 500
with Eqn. 5 Midjourney 100
real 50,000
Figure 3: Illustration of the process of computing DIRE conditional ADM [10] 50,000
given an input image x0 . The input image x0 is first gradu- ImageNet [9]
text-to-image SD-v1 [37] 50,000
ally inverted into a noise image xT by DDIM inversion [42], real 42,000
and then is denoised step by step until getting a reconstruc- SD-v2 [37] 42,000
tion x′0 . DIRE is simply defined as the residual image got IF [39] 1,000
CelebA-HQ [19] text-to-image
DALLE-2 [35] 500
from x0 and x′0 . Midjourney 100
√ √ √ Table 1: Composition of the DiffusionForensics dataset.
Suppose σ = 1 − α/ α, x̄ = x/ α, the correspond-
ing ODE becomes: It includes real images from LSUN-Bedroom [48] Ima-
geNet [9], CelebA-HQ [19] and generated images from pre-
\diff \bar {\bx }(t) = \bepsilon _\theta \left (\frac {\bar {\bx }(t)}{\sqrt {\sigma ^2 + 1}}, t\right ) \diff \sigma (t).\label {eq:ddim_ode} \vspace {-2mm} (7) trained diffusion models. According to the class of diffusion
models, the containing images are divided into three cate-
Then the inversion process(from xt to xt+1 ) can be the gories: unconditional, conditional, and text-to-image.
reversion of the reconstruction process:
Our core motivation is that samples from the diffusion gen-
\frac {\bx _{t+1}}{\sqrt {\alpha _{t+1}}} = \frac {\bx _t}{\sqrt {\alpha _t}} + \left (\sqrt {\frac {1 - \alpha _{t+1}}{\alpha _{t+1}}} - \sqrt {\frac {1 - \alpha _{t}}{\alpha _t}}\right ) \bepsilon _\theta (\bx _t, t). \label {eq:ddim_forward} \vspace {-2mm} eration space pg (x) are more likely to be reconstructed by a
pre-trained diffusion model while real images cannot.
(8) So the key idea of our work is to make use of the diffusion
This process is to obtain the corresponding noisy sample xT
model to detect diffusion-generated images. We find that
for an input image x0 . However, it is very slow to invert or
images generated by diffusion models are more likely to
sample step by step. To speed up the diffusion model sam-
be reconstructed by a pre-trained diffusion model. On the
pling, DDIM [42] permits us to sample a subset of S steps
other hand, due to the complex characteristics of real images,
τ1 , . . . , τS , so that the neighboring xt and xt+1 become xτt
real images can not be well constructed. As shown in Fig-
and xτt+1 , respectively, in Eqn. (8) and Eqn. (5).
ure 1, the reconstruction error of real and diffusion-generated
3.2. DIRE images shows dramatically different properties.
Given an input image x0 , we wish to judge whether it
Due to the intrinsic differences between diffusion models is synthesized by diffusion models. Take a pre-trained dif-
and previous generative models (i.e., GANs, Flow-based fusion model ϵθ (xt , t). As shown in Figure 3, we apply
models, VAEs), existing generated image detectors experi- the DDIM [42] inversion process to gradually add Gaussian
ence dramatic performance drops when facing images gen- noise into x0 via Eqn. (8). After S steps, x0 becomes a
erated by diffusion models. To avoid the abuse of diffu- point xT in the isotropic Gaussian noise distribution. The
sion models, it is urgent to develop a detector for diffusion- inversion process is to find the corresponding point in noisy
generated image detection. A straightforward approach space, then the DDIM [42] generation process (Eqn. (5))
would be to train a binary classifier using a dataset of both is employed to reconstruct the input image and produces
real and diffusion-generated images. However, it is difficult a recovered version x′0 . The differences between x0 and
for such a method to guarantee generalization to diffusion x′0 help to distinguish real or generated. Then the DIRE is
models that have not been previously encountered. defined as:
Our research takes note of the fact that images generated
\text {DIRE}(\bx _0) = |\bx _0-\bR (\bI (\bx _0))|, \vspace {-2mm} (9)
by diffusion models are essentially sampled from the dis-
tribution of the diffusion generation space (pg (x)), while where | · | denotes computing the absolute value, I(·) is a
real images are sampled from another distribution (pr (x)) series of the inversion process with Eqn. (8) and R(·) is a
although it may be near to pg (x) but not exactly the same. series of the reconstruction process with Eqn. (5).
22448
Then for real images and diffusion-generated images, collect 42,000 real images from CelebA-HQ [19]. And sam-
we can get their DIRE representations, we train a binary pling 42,000 face images using the pre-trained SD-v2 model
classifier to distinguish their DIREs by a simple binary cross- with the prompt “A professional photograph of face”. The
entropy loss, which is formulated as follows, 40,000/1,000/1,000 images in this SD-v2 subset are used as
the face-domain training/validation/testing dataset. Further,
L(\by , \by ') = -\sum _{i=1}^N (\by _{i}\mathrm {log}(\by '_{i})+(1-\by _{i})\mathrm {log}(1-\by '_{i})),\label {eq:binary_loss} \vspace {-3mm} (10) we collect 1,000 IF images, 500 DALLE-2 images, and 100
Midjourney images only for face-domain evaluation.
where N is mini-batch size, y is the ground-truth label, and The split of real images for training/validation/testing
y′ is the corresponding prediction by the detector. In the is 40,000/1,000/1,000 when the number of real images is
inference stage, we first apply a diffusion model to recon- 42,000, and 40,000/5,000/5,000 when the number of real im-
struct the image and get the DIRE. Subsequently, we input ages is 50,000. Besides, all the data in the DiffusionForensics
the DIRE into the binary classifier, which will then classify dataset are triplet, i.e., source image, reconstructed image,
the source image as either real or generated. and corresponding DIRE image. In general, the proposed
DiffusionForensics dataset contains unconditional, condi-
3.3. DiffusionForensics: A Dataset for Evaluating tional, and text-to-image generated images, which is fertile
Diffusion-Generated Image Detectors for evaluation from various aspects.
To better evaluate the performance of diffusion-generated
detectors, we establish a dataset, DiffusionForensics, which 4. Experiment
is comprised of images generated by various diffusion mod- In this section, we first introduce the experimental setups
els for comprehensive experiments. Its composition is shown and then provide extensive experimental results to demon-
in Table 1. The images can be roughly divided into three strate the superiority of our approach.
classes by their source: LSUN-Bedroom [48], ImageNet [9],
and CelebA-HQ [19]. 4.1. Experimental Setup
LSUN-Bedroom. We collect bedroom images generated by Data pre-processing and augmentation. All the experi-
11 diffusion models, in which four subset images (ADM [10], ments are conducted on our DiffusionForensics dataset. To
DDPM [17], iDDPM [32], PNDM [23]) are generated by un- calculate DIRE for each image, we use the ADM [10] net-
conditional diffusion models and the other seven (LDM [37], work pre-trained on LSUN-Bedroom as the reconstruction
SD-v1 [37], SD-v2 [37], VQ-Diffusion [14], IF [39]1 , model, and the DDIM [42] inversion and reconstruction pro-
DALLE-2 [35], and Midjourney2 ) are generated by text- cess in which the number of diffusion steps S = 20 by
to-image diffusion models. The text prompt for all the text- default. We employ ResNet-50 [15] as our forensics clas-
to-image generation is “A photo of bedroom”. The numbers sifier. The size of most images (ADM [10], DDPM [17],
of real and generated images in each image domain are listed iDDPM [32], PNDM [23], VQ-Diffusion [14], LDM [37])
in Table 1. Subsets containing 42,000 images are divided in the dataset is 256 × 256. For Stable Diffusion [37] v1
into 40,000 for training, 1,000 for validation, and 1,000 for and v2, IF, DALLE-2, and Midjourney, the generated images
testing. The remaining subsets (IF, DALLE-2, Midjourney) are resized into 256 × 256 with bicubic interpolation. Dur-
are only used for testing. ing training, the images fed into the network are randomly
ImageNet. We further collect images from ImageNet for cropped with the size of 224 × 224 and horizontally flipped
evaluating detectors when facing more universal image gen- with a probability of 0.5. During testing, the images are
eration and cross-dataset evaluation. To be specific, we col- center-cropped with the size of 224 × 224.
lect images from a conditional diffusion model (ADM [10]) Evaluation metrics. Following previous generated-image
with class condition. Applying the pre-trained ADM detection methods [47, 46, 51], we report accuracy (ACC)
model [10], we generate 50,000 images in total (50 images and average precision (AP) in our experiments to evaluate
for each class in ImageNet), i.e., 40,000 for training, 5,000 the detectors. The threshold for computing accuracy is set to
for validation, and 5,000 for testing. And for text-to-image 0.5 following [47].
diffusion generation, we employ SD-v1 [37] in which the Baselines. 1) CNNDetection [47] proposes a CNN-
text prompt for generation is “A photo of {class}”(1,000 generated image detection model that can be trained on one
classes from ImageNet [9]). The number and split of images CNN dataset and then generalized to other CNN-synthesized
is the same as conditional ADM. images. 2) GANDetection [25] applies an ensemble of
CelebA-HQ. Besides the bedroom and universal ImageNet EfficientNet-B4 [44] to increase the detection performance.
scenarios, one may be curious about face domain. We further 3) SBI [40] trains a general synthetic-image detector on im-
1 Reproduced version of Imagen by DeepFloyd Lab at StabilityAI: ages generated by blending pseudo source and target images
https://github.com/deep-floyd/IF from single pristine images. 4) Patchforensics [4] employs
2 https://www.midjourney.com a patch-wise classifier which is claimed to be better than
22449
Training Generation Recon. Diffusion-generated bedroom images Total
Method
dataset model model ADM DDPM iDDPM PNDM SD-v1 SD-v2 LDM VQD IF DALLE-2 Mid. Avg.
CNNDet [47] LSUN ProGAN – 50.1/63.4 56.7/74.6 50.1/77.6 50.3/82.9 50.2/70.9 50.8/80.4 50.1/60.2 50.1/70.6 51.3/79.7 68.4/78.9 90.8/11.1 56.3/68.2
GANDet [25] LSUN ProGAN – 54.2/43.6 52.2/47.3 45.7/57.3 42.1/77.6 68.1/78.5 61.5/52.7 79.2/57.1 64.8/52.3 90.6/16.1 95.7/11.8 92.3/24.1 67.9/47.1
Patchfor [4] FF++ Multiple – 50.4/74.8 56.8/67.4 50.3/69.5 55.1/78.5 49.9/84.7 50.0/52.8 54.0/92.0 92.8/99.7 55.3/88.1 66.9/65.1 90.9/81.5 61.1/77.6
SBI [40] FF++ Multiple – 53.6/57.7 55.8/47.4 54.0/58.2 46.7/44.8 65.6/75.9 55.0/59.8 81.0/88.3 59.6/66.6 70.8/78.1 67.7/52.5 76.5/9.6 62.4/58.1
CNNDet* [47] LSUN-B. ADM – 100/100 83.7/99.5 100/100 71.2/98.6 77.4/85.8 85.9/98.4 98.9/100 72.9/97.2 99.8/100 90.9/95.1 92.4/52.0 88.5/93.3
Patchfor* [4] LSUN-B. ADM – 100/100 72.9/100 100/100 96.6/100 63.2/71.3 97.2/100 97.3/100 100/100 99.8/100 100/100 99.4/100 93.3/97.4
F3Net* [34] LSUN-B. ADM – 96.0/99.7 95.5/99.6 96.4/99.9 96.0/99.7 86.1/95.3 81.1/91.5 93.8/98.4 90.1/96.7 89.4/96.6 92.9/95.8 86.9/23.1 91.3/90.6
LSUN-B. ADM ADM 100/100 100/100 100/100 99.7/100 99.7/100 100/100 100/100 100/100 100/100 100/100 100/100 99.9/100
LSUN-B. PNDM ADM 100/100 100/100 100/100 100/100 89.4/99.9 100/100 100/100 100/100 100/100 100/100 100/100 99.0/100
DIRE (ours)
LSUN-B. iDDPM ADM 99.6/100 100/100 100/100 89.7/99.8 99.7/100 100/100 99.9/100 99.9/100 99.9/100 99.6/100 100/100 98.9/100
LSUN-B. StyleGAN ADM 98.8/100 99.8/100 99.9/100 89.6/100 95.2/100 100/100 100/100 100/100 100/100 99.9/100 100/100 98.5/100
Table 2: Comprehensive comparisons of our DIRE and other generated image detectors on the LSUN-Bedroom split of
DiffusionForensics. The previous detectors including CNNDet [47], GANDet [25], Patchfor [4], and SBI [40] are evaluated
with their provided weights. * denotes our reproduced training with the official codes. All the used diffusion-generation
models [10, 23, 32] for preparing training data are unconditional models pre-trained on LSUN-Bedroom (LSUN-B.) [48].
Generated images from StyleGAN [20] trained on LSUN-Bedroom are downloaded from the official repository. All the
testing images produced by text-to-image generators (SD-v1 [37], SD-v2 [37], LDM [37], VQDiffusion [14], IF, DALLE-2,
Midjourney) are prompted by “A photo of bedroom”. We report ACC (%) and AP (%) (ACC/AP in the Table).
Generated face images and F3Net [34], whose training codes are publicly avail-
Method
SD-v2 IF DALLE-2 Midjourney StarGAN
able. The resulting models get a significant improvement
CNNDet* [47] 95.7/99.8 71.1/82.7 64.8/33.7 90.4/69.3 30.7/45.3
F3Net* [34] 89.9/99.1 75.2/84.9 75.2/69.8 82.5/87.9 27.0/45.2 on images generated by the same diffusion models as used
DIRE (ours) 96.7/100 96.8/99.9 95.6/99.9 99.1/100 97.9/99.8 in training, but still perform unsatisfactorily facing unseen
diffusion models. In contrast, our DIRE shines with excel-
Table 3: Face domain evaluation. All detectors are trained
lent generalization performance. Concretely, DIRE with the
on CelebA-HQ [19] and diffusion images generated by SD-
generation model used to prepare training data and the recon-
v2 [37]. * denotes our reproduced training with the offi-
struction model used to compute DIRE set to ADM achieves
cial codes. When generating images using SD-v2 and IF,
an average of 99.9% ACC and 100% AP when detecting
the prompts used is “A professional photograph of face”.
bedroom images generated by various diffusion models.
ACC (%) and AP (%) are reported (ACC/AP in the Table).
Besides the comprehensive comparisons on bedroom im-
ages, we further conduct a comparison of DIRE and previous
simple classifiers for fake image detection. 5) F3Net [34] detectors on the CelebA-HQ split of DiffusionForensics. The
proposes that the frequency information of images is essen- result is reported in Table 3. Our DIRE, CNNDet [47], and
tial for fake image detection. F3Net [34] are trained with images generated by SD-v2, and
evaluated with images generated by SD-v2, IF, DALLE-2,
4.2. Comparison to Existing Detectors Midjouney, and StarGAN. The results demonstrate our DIRE
encompasses a much stronger capability when detecting gen-
Diffusion models [17, 10] are claimed to exhibit better erated face images.
generation ability than previous generation models (e.g.,
GAN [13], VAE [21]). We notice that previous detectors
4.3. Generalization Capability Evaluation
achieve surprising performance on images generated by
CNNs [20, 6, 2], but the generalization ability to recent Effect of choice of generation and reconstruction models.
diffusion-generated images has not been well explored. Here, We evaluate the impact of different choices of the generation
we evaluate CNNDetection [47], GANDetection [25], Patch- and reconstruction models on the generalization capability.
forensics [4], and SBI [40] on the proposed DiffusionForen- We employ the ADM [10] model as the reconstruction model
sics dataset using the pre-trained weights downloaded from and apply different models for generating images. After gen-
their official repositories. eration, the ADM reconstruction model converts these im-
First, we conduct experiments on the LSUN-Bedroom ages into their DIREs for training a binary classifier. In this
split of DiffusionForensics. The quantitative results can be evaluation, we select three different generation models for
found in Table 2. We find that existing detectors have a preparing training data: PNDM [23] and iDDPM [32] (diffu-
significant performance drop when dealing with diffusion- sion models) and StyleGAN [20] (GAN model). The results
generated images, with ACC results lower than 70%. We are reported in Table 2. Despite the inconsistent use of gen-
also include diffusion-generated images (ADM [10]) as train- eration and reconstruction models when training, DIRE still
ing data and re-train CNNDetection [47], Patchforensics [4], keeps a strong generalization capability. Specifically, when
22450
Generation Generated IN images generation model and generate images based on the class
Method
model ADM SD-v1 label of ImageNet [9]. The results are shown in Table 4.
CNNDet* [47] ADM 66.2/82.3 47.0/80.4 Our detector DIRE trained with images generated by ADM
F3Net* [34] ADM 69.5/87.3 44.8/82.6
pre-trained on LSUN-Bedroom achieves a 97.2% ACC and
ADM 98.4/99.9 97.2/99.6
DIRE (ours) iDDPM 93.4/99.4 92.5/98.8 99.6% AP, demonstrating the strong generalization capability
StyleGAN 85.6/98.4 85.4/98.1 of DIRE to text-to-image generation models.
Unseen GAN evaluation. Besides generalization between
Table 4: Cross-dataset evaluation on the ImageNet (IN) diffusion models, we further evaluate the performance of
split using the detectors trained on the LSUN-Bedroom split DIRE for images generated by GANs. We evaluate the
of DiffusionForensics. * denotes our reproduced training performance of our DIRE, CNNDet [47], and F3Net [34]
with the official codes. Each testing set is generated by corre- trained on the ADM subset of the LSUN-Bedroom split.
sponding generation model pre-trained on the corresponding The results are reported in Table 5. In this setting, all the
dataset. Images generated by SD-v1 are prompted by “ A trained detectors are not trained with any GAN image. Our
photo of {class}” in which the classes are from [9]. ACC (%) reproduced CNNDet and F3Net experience significant per-
and AP (%) are reported (ACC/AP in the Table). formance drop, which suggests that previous generated im-
GAN-generated bedroom images age detectors fail across diffusion and GAN models. In
Method
StyleGAN ProjGAN Diff-StyleGAN Diff-ProjGAN contrast, DIRE achieves surprising performance when de-
CNNDet* [47] 94.3/99.8 62.2/93.2 68.1/91.4 60.0/92.6 tecting GAN-generated images. This indicates that DIRE
F3Net* [34] 88.1/95.5 74.4/86.0 85.5/94.4 70.2/83.0
is not only an effective image representation for diffusion-
DIRE (ours) 99.8/100 100/100 100/100 100/100
generated image detection but also may beneficial to detect
Table 5: GAN evaluation on the LSUN-Bedroom split. * de- GAN-generated images even though DIRE is built upon
notes our reproduced training on the LSUN-Bedroom-ADM a mathematical formulation of the diffusion forward and
subset of DiffusionForensics with the official codes. Each reverse processes.
testing set is generated by corresponding GAN model pre-
trained on the corresponding dataset. ACC (%) and AP (%) 4.4. Robustness to Unseen Perturbations
are reported (ACC/AP in the Table). Besides the generalization to unseen generation mod-
els, the robustness to unseen perturbations is also a com-
pairing iDDPM [32] as the generation model and ADM [10] mon concern since in real-world applications images are
as the reconstruction model, DIRE achieves 98.9% ACC and usually perturbed by various degradations. Here, we eval-
100% AP on average, highlighting its adaptation with images uate the robustness of detectors in two-class degradations,
generated by different diffusion models. It’s worth noting i.e., Gaussian blur and JPEG compression, following [47].
that when the generation model is StyleGAN, DIRE still The perturbations are added under three levels for Gaus-
exhibits excellent performance. This might be attributed to sian blur (σ = 1, 2, 3) and two levels for JPEG compres-
DIRE’s capability of incorporating the generation properties sion (quality = 65, 30). We explore the robustness of our
of other generation models besides diffusion models. baselines CNNDetection [47], GANDetection [25], SBI [40],
Cross-dataset evaluation. We further design a more chal- F3Net [34], Patchforensics [4], and our DIRE. The results
lenging scenario, i.e., training the detector with images gen- are shown in Figure 4. We observe that at each level of
erated by models pre-trained on LSUN-Bedroom [48] and blur and JPEG compression, our DIRE gets a perfect per-
then testing it on images produced by models pre-trained formance without performance drop. It is worth noting that
on ImageNet [9]. We choose three different generators for our reproduction of CNNDetection* [47] and Patchforen-
generating training images: ADM [10], iDDPM [32], and sics* [4] trained on LSUN-Bedroom-ADM subset of Diffu-
StyleGAN [20]. The evaluation results on ADM (IN) are sionForensics also get satisfactory performance while they
shown in Table 4. We find that CNNDet [47] and F3Net [34] experience a dramatic performance drop facing JPEG com-
get a dramatically performance. But DIRE still maintains a pression, which further reveals training on RGB images may
satisfactory generalization capability even though facing un- be not robust.
seen datasets, i.e., ACC/AP: 98.4%/99.9% and 93.4%/99.4%
when training on images generated by ADM and iDDPM, 4.5. More Analysis of the Proposed DIRE
respectively. This evaluation further validates that the pro- In this subsection, we conduct experiments on the LSUN-
posed DIRE is a general image representation for detecting Bedroom split of DiffusionForensics to help better under-
diffusion-generated images. standing of DIRE.
Unseen text-to-image generation evaluation. Furthermore, How do the inversion steps in DDIM affect the detec-
we seek to verify whether DIRE can detect images generated tion performance? Recent diffusion models [42, 10] find
by unseen text-to-image models. We adopt SD-v1 as the that more steps contribute to more high-quality images and
22451
CNNDetection GANDetection SBI CNNDetection* F3Net* Patchforensics* Ours
ADM DDPM iDDPM PNDM
100 100 100 100
80 80 80 80
AP(%)
60 60 60 60
40 40 40 40
0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
Sigma Sigma Sigma Sigma
ADM DDPM iDDPM PNDM
100 100 100 100
80 80 80 80
AP(%)
60 60 60 60
40 40 40 40
100 65 30 100 65 30 100 65 30 100 65 30
Quality Quality Quality Quality
Figure 4: Robustness to unseen perturbations. The top rows show the robustness to Gaussian blur, and the bottom rows
show the robustness to JPEG compression. * denotes our reproduced training on the LSUN-Bedroom-ADM subset of
DiffusionForensics with AP (%) reported for robustness comparison.
S ADM DDPM iDDPM PNDM SD-v1 ADM DDPM iDDPM PNDM SD-v1
5 100/100 100/100 100/100 97.5/100 87.5/99.8 w/o ABS 100/100 99.4/100 100/100 98.2/100 87.0/93.0
10 100/100 100/100 100/100 99.4/100 98.2/100 w/ ABS 100/100 100/100 100/100 99.7/100 99.7/100
20 100/100 100/100 100/100 99.7/100 99.7/100
50 100/100 100/100 100/100 100/100 99.9/100 Table 8: Effect of computing the absolute value (ABS)
when obtaining DIRE. All the models in this experiment
Table 6: Influence of different inversion steps. All the
are trained on the ADM subset and tested on other sub-
models in this experiment are trained on the ADM subset
sets of LSUN-Bedroom. ACC (%) and AP (%) are re-
and tested on other subsets of LSUN-Bedroom. ACC (%)
ported (ACC/AP in the Table).
and AP (%) are reported (ACC/AP in the Table).
Input ADM DDPM iDDPM PNDM SD-v1
various forms of input for detection, including RGB images,
REC 100/100 57.1/57.7 49.7/92.6 87.1/98.7 46.9/57.0
RGB 100/100 87.3/99.6 100/100 77.8/99.1 77.4/85.8 reconstructed images (REC), DIRE, and the combination
RGB&DIRE 100/100 99.8/100 99.9/100 99.2/100 62.4/92.4 of RGB and DIRE (RGB&DIRE). The results displayed in
DIRE 100/100 100/100 100/100 99.7/100 99.7/100 Table 7 reveal that REC performed much worse than RGB,
suggesting that reconstructed images are not suitable as input
Table 7: Influence of different input information. All the
information for detection. One possible explanation is the
models in this experiment are trained on the ADM subset
loss of essential information during reconstruction by a pre-
and tested on other subsets of LSUN-Bedroom. ACC (%)
trained diffusion model. The comparison between RGB
and AP (%) are reported (ACC/AP in the Table).
and DIRE also demonstrates that DIRE serves as a stronger
image representation, contributing to a more generalizable
DDIM [42] sampling can improve the generation perfor- detector than simply training on RGB images. Furthermore,
mance compared to original DDPM [17] sampling. Here, we we find that combining RGB with DIRE together hurts the
explore the influence of different inversion steps in diffusion- generalization compared to pure DIRE. Therefore, we use
generated image detection. Note that the steps in reconstruc- DIRE as the default input for detection by default.
tion are the same as in the inversion by default. The results Effect of different calculation of DIRE. After computing
are reported in Table 6. We observe that more steps in DDIM the residual result of the reconstructed image and source
benefit the detection performance of DIRE. Considering the image, whether to compute the absolute value should be con-
computational cost, we choose 20 steps by default. sidered. As reported in Table 8, we find that the absolute op-
Is DIRE really better than the original RGB for detecting eration is critical for achieving a strong diffusion-generated
diffusion-generated images? We conduct an experiment on image detector, particularly on SD-v1 [37] where it improves
22452
(a) Real
ACC/AP from 87.0%/93.0% → 99.7%/100%. So by default, find that previous generated-image detectors show limited
the absolute operation is applied in all our models. performance when detecting images generated by diffusion
Qualitative Analysis of DIRE. The above quantitative ex- models. To address the issue, we present an image represen-
periments have indicated the effectiveness of the proposed tation called DIRE based on reconstruction errors of images
DIRE. As analyzed before, the key motivation behind DIRE inverted and reconstructed by DDIM. Furthermore, we create
is that generated images can be approximately reconstructed a new dataset, DiffusionForensics, which includes images
by a pre-trained diffusion model while real images cannot. generated by unconditional, conditional, and text-to-image
DIRE makes use of the residual characteristic of an input diffusion models to facilitate the evaluation of diffusion-
image and its reconstruction for discrimination. To gain a generated images. Extensive experiments indicate that the
better understanding of its intrinsic properties, we conduct a proposed image representation DIRE contributes to a strong
further qualitative analysis of DIRE, utilizing noise pattern diffusion-generated image detector, which is very effective
and frequency analysis for visualization. for this task. We hope that our work can serve as a solid
When images are acquired, various factors from hardware baseline for diffusion-generated image detection.
facilities, such as lens and sensors, and software algorithms, Acknowledgement. This work was supported by NSFC un-
such as compression and demosaic, can impact image quality der Contract 61836011 and 62021001. It was also supported
at the low level. One typical low-level analysis of images is by the GPU cluster built by MCC Lab of Information Science
noise pattern analysis3 , which is usually regular and corre- and Technology Institution, USTC, and the Supercomputing
sponds to the shape of objects in real scenarios. In addition Center of the USTC.
to low-level analysis, frequency analysis can provide fre-
quency information about images. To compute the frequency References
information of DIRE, we used FFT algorithms.
[1] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended
We visualize the results of the aforementioned two analy- diffusion for text-driven editing of natural images. In CVPR,
sis tools in Figure 5. The visual comparison of noise patterns 2022. 2
highlights significant differences of the DIRE of real and [2] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large
diffusion-generated images from the low-level perspective, scale gan training for high fidelity natural image synthesis.
with real images tending to be regular and corresponding to arXiv preprint arXiv:1809.11096, 2018. 6
the shape of objects while diffusion-generated images tend [3] Nicholas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagiel-
to be messy. By comparing the FFT spectrum of DIRE from ski, Vikash Sehwag, Florian Tramèr, Borja Balle, Daphne
real and diffusion-generated images, we observe that the Ippolito, and Eric Wallace. Extracting training data from
FFT spectrum of real images is usually more abundant than diffusion models. arXiv preprint arXiv:2301.13188, 2023. 1
that of diffusion-generated images, which confirms that real [4] Lucy Chai, David Bau, Ser-Nam Lim, and Phillip Isola. What
images are more difficult to be reconstructed by a pre-trained makes fake images detectable? understanding properties that
generalize. In ECCV, 2020. 5, 6, 7
diffusion model.
[5] Wenhu Chen, Hexiang Hu, Chitwan Saharia, and William W
Cohen. Re-imagen: Retrieval-augmented text-to-image gen-
5. Conclusion erator. arXiv preprint arXiv:2209.14491, 2022. 2
In this paper, we focus on building a generalizable de- [6] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha,
tector for discriminating diffusion-generated images. We Sunghun Kim, and Jaegul Choo. Stargan: Unified genera-
tive adversarial networks for multi-domain image-to-image
3 https://29a.ch/photo-forensics/#noise-analysis translation. In CVPR, 2018. 6
22453
[7] François Chollet. Xception: Deep learning with depthwise [25] Sara Mandelli, Nicolò Bonettini, Paolo Bestagini, and Ste-
separable convolutions. In CVPR, 2017. 3 fano Tubaro. Detecting gan-generated images by orthogonal
[8] Riccardo Corvi, Davide Cozzolino, Giada Zingarini, Giovanni training of multiple cnns. In ICIP, 2022. 3, 5, 6, 7
Poggi, Koki Nagano, and Luisa Verdoliva. On the detection [26] Francesco Marra, Diego Gragnaniello, Davide Cozzolino, and
of synthetic images generated by diffusion models. arXiv Luisa Verdoliva. Detection of gan-generated fake images over
preprint arXiv:2211.00680, 2022. 3 social networks. In MIPR, 2018. 3
[9] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li [27] Francesco Marra, Diego Gragnaniello, Luisa Verdoliva, and
Fei-Fei. Imagenet: A large-scale hierarchical image database. Giovanni Poggi. Do gans leave artificial fingerprints? In
In CVPR, 2009. 2, 4, 5, 7 MIPR, 2019. 3
[10] Prafulla Dhariwal and Alexander Nichol. Diffusion models
[28] Scott McCloskey and Michael Albright. Detecting gan-
beat gans on image synthesis. In NeurIPS, 2021. 1, 2, 3, 4, 5,
generated imagery using color cues. arXiv preprint
6, 7
arXiv:1812.08247, 2018. 3
[11] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming
[29] Scott McCloskey and Michael Albright. Detecting gan-
transformers for high-resolution image synthesis. In CVPR,
generated imagery using saturation cues. In ICIP, 2019. 3
2021. 2
[12] Joel Frank, Thorsten Eisenhofer, Lea Schönherr, Asja Fischer, [30] Lakshmanan Nataraj, Tajuddin Manhar Mohammed, Shiv-
Dorothea Kolossa, and Thorsten Holz. Leveraging frequency kumar Chandrasekaran, Arjuna Flenner, Jawadul H Bappy,
analysis for deep fake image recognition. In ICML, 2020. 3 Amit K Roy-Chowdhury, and BS Manjunath. Detecting gan
[13] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing generated fake images using co-occurrence matrices. arXiv
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and preprint arXiv:1903.06836, 2019. 3
Yoshua Bengio. Generative adversarial nets. In NeurIPS, [31] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav
2014. 6 Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and
[14] Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Mark Chen. Glide: Towards photorealistic image generation
Dongdong Chen, Lu Yuan, and Baining Guo. Vector quan- and editing with text-guided diffusion models. arXiv preprint
tized diffusion model for text-to-image synthesis. In CVPR, arXiv:2112.10741, 2021. 2
2022. 2, 4, 5, 6 [32] Alexander Quinn Nichol and Prafulla Dhariwal. Improved
[15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. denoising diffusion probabilistic models. In ICML, 2021. 1,
Deep residual learning for image recognition. In CVPR, 2016. 2, 3, 4, 5, 6, 7
5 [33] Gaurav Parmar, Krishna Kumar Singh, Richard Zhang, Yijun
[16] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Li, Jingwan Lu, and Jun-Yan Zhu. Zero-shot image-to-image
Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im- translation. arXiv preprint arXiv:2302.03027, 2023. 2
age editing with cross attention control. arXiv preprint [34] Yuyang Qian, Guojun Yin, Lu Sheng, Zixuan Chen, and Jing
arXiv:2208.01626, 2022. 2 Shao. Thinking in frequency: Face forgery detection by
[17] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- mining frequency-aware clues. In ECCV, 2020. 6, 7
sion probabilistic models. arXiv preprint arxiv:2006.11239, [35] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu,
2020. 1, 2, 3, 4, 5, 6, 8 and Mark Chen. Hierarchical text-conditional image genera-
[18] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kil- tion with clip latents. arXiv preprint arXiv:2204.06125, 2022.
ian Q Weinberger. Densely connected convolutional networks. 2, 4, 5
In CVPR, 2017. 3
[36] Jonas Ricker, Simon Damm, Thorsten Holz, and Asja Fischer.
[19] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen.
Towards the detection of diffusion model deepfakes. arXiv
Progressive growing of gans for improved quality, stability,
preprint arXiv:2210.14571, 2022. 3
and variation. arXiv preprint arXiv:1710.10196, 2017. 2, 4,
5, 6 [37] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
[20] Tero Karras, Samuli Laine, and Timo Aila. A style-based Patrick Esser, and Björn Ommer. High-resolution image
generator architecture for generative adversarial networks. In synthesis with latent diffusion models. In CVPR, 2022. 1, 2,
CVPR, 2019. 6, 7 3, 4, 5, 6, 8
[21] Diederik P Kingma and Max Welling. Auto-encoding varia- [38] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch,
tional bayes. arXiv preprint arXiv:1312.6114, 2013. 6 Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine
[22] Lingzhi Li, Jianmin Bao, Ting Zhang, Hao Yang, Dong Chen, tuning text-to-image diffusion models for subject-driven gen-
Fang Wen, and Baining Guo. Face x-ray for more general eration. arXiv preprint arXiv:2208.12242, 2022. 2
face forgery detection. In CVPR, 2020. 3 [39] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay
[23] Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. Pseudo Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour,
numerical methods for diffusion models on manifolds. arXiv Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes,
preprint arXiv:2202.09778, 2022. 1, 2, 4, 5, 6 et al. Photorealistic text-to-image diffusion models with deep
[24] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan language understanding. arXiv preprint arXiv:2205.11487,
Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffu- 2022. 2, 4, 5
sion probabilistic model sampling in around 10 steps. arXiv [40] Kaede Shiohara and Toshihiko Yamasaki. Detecting deep-
preprint arXiv:2206.00927, 2022. 2 fakes with self-blended images. In CVPR, 2022. 5, 6, 7
22454
[41] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan,
and Surya Ganguli. Deep unsupervised learning using
nonequilibrium thermodynamics. In ICML, 2015. 1, 2, 3
[42] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising
diffusion implicit models. arXiv preprint arXiv:2010.02502,
2020. 1, 2, 3, 4, 5, 7, 8
[43] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon
Shlens, and Zbigniew Wojna. Rethinking the inception archi-
tecture for computer vision. In CVPR, 2016. 3
[44] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model
scaling for convolutional neural networks. In ICML, 2019. 3,
5
[45] Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete
representation learning. In NeurIPS, 2017. 2
[46] Sheng-Yu Wang, Oliver Wang, Andrew Owens, Richard
Zhang, and Alexei A Efros. Detecting photoshopped faces by
scripting photoshop. In ICCV, 2019. 5
[47] Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew
Owens, and Alexei A Efros. Cnn-generated images are sur-
prisingly easy to spot...for now. In CVPR, 2020. 3, 5, 6,
7
[48] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas
Funkhouser, and Jianxiong Xiao. Lsun: Construction of a
large-scale image dataset using deep learning with humans in
the loop. arXiv preprint arXiv:1506.03365, 2015. 2, 3, 4, 5,
6, 7
[49] Ning Yu, Larry S Davis, and Mario Fritz. Attributing fake
images to gans: Learning and analyzing gan fingerprints. In
ICCV, 2019. 3
[50] Xu Zhang, Svebor Karaman, and Shih-Fu Chang. Detecting
and simulating artifacts in gan fake images. In WIFS, 2019. 3
[51] Peng Zhou, Xintong Han, Vlad I Morariu, and Larry S Davis.
Learning rich features for image manipulation detection. In
CVPR, 2018. 5
[52] Derui Zhu, Dingfan Chen, Jens Grossklags, and Mario Fritz.
Data forensics in diffusion models: A systematic analysis of
membership privacy. arXiv preprint arXiv:2302.07801, 2023.
1
22455