Community Forensics Using Thousands Generators To TrainTrain Fake Image
Community Forensics Using Thousands Generators To TrainTrain Fake Image
Abstract 1.00
One of the key challenges of detecting AI-generated im-
0.95
ages is spotting images that have been created by previously 0.90
unseen generative models. We argue that the limited diver-
mAP
sity of the training data is a major obstacle to addressing 0.85
this problem, and we propose a new dataset that is signif- 0.80
icantly larger and more diverse than prior work. As part
of creating this dataset, we systematically download thou- 0.75
sands of text-to-image latent diffusion models and sample
images from them. We also collect images from dozens of 101 102 103
popular open source and commercial models. The resulting
Number of Latent Diffusion Models in Training Set
dataset contains 2.7M images that have been sampled from Latent Diffusion Commercial Other
Pixel Diffusion GAN
4803 different models. These images collectively capture a
wide range of scene content, generator architectures, and Figure 1. Performance vs. model diversity. We use images sam-
image processing settings. Using this dataset, we study the pled from different numbers of open source latent diffusion models
generalization abilities of fake image detectors. Our experi- in the Community Forensics dataset to train fake image detectors
ments suggest that detection performance improves as the (shown in Fig. 2a). As the number of models increases, so does
number of models in the training set increases, even when the detector’s performance, even though these models have similar
these models have similar architectures. We also find that de- designs and same total number of images. This improvement is
tection performance improves as the diversity of the models largest for test images from out-of-distribution generative model
increases, and that our trained detectors generalize better classes, such as pixel-based diffusion models or GANs. For each
data point, we sample 10 random model subsets with 100K training
than those trained on other datasets.1
images each and report the mean and standard error values.
1
Dreamshaper v6 Fn Samenamemodel5 Ggm LEOSAM s ilmGirl l 91
Pony diffusion 87 Attn Maps mist mas Plat diffusion SoteMix First avartar
Ncadg dreamshaper PhotoSomnia vFinal Nitro diffusion Textual inversion Technoalbum AnalogMadness real
526Mix Textual inversion Cburnett helmet co Enasswhidaly Msart 541 A can thr
Figure 2. The Community Forensics dataset. Our dataset contains images sampled from three types of generative models. (a) We
systematically download open-source latent diffusion models from a model-sharing community [37, 124]. (b) We select popular open source
generators with a variety of architectures and training procedures. (c) We sample from both closed and open state-of-the-art commercial
models. We present example images and their corresponding model names.
performance, since it is easy for cues that work well on one Our dataset contains 4803 distinct models, approximately
set of generators to fail on others. 250× more than the previous forensics datasets that sample
To address these problems, we propose Community Foren- images from generative models [9, 18, 35, 91, 126, 133],
sics, a dataset that is significantly more diverse and compre- and covers a variety of recent model designs (Fig. 2).
hensive than those in prior works (Fig. 2). Our dataset con- We use this dataset to study generalization in the gener-
tains images generated by: (a) thousands of systematically ated image detection problem. Our experiments support the
downloaded open-source latent diffusion models, (b) hand- hypothesis that increasing the diversity of generative models
selected open source models with various architectures, and used in training is important for generalization. Through
(c) state-of-the-art commercial models. We use this dataset experiments, we find:
to conduct a study of generalization in image forensics. • Classifiers trained on our dataset obtain strong perfor-
To acquire large numbers of models, we sample images mance, both on our newly proposed evaluations and on
from thousands of text-to-image diffusion models hosted on multiple previously-proposed benchmarks.
a popular model-sharing website, Hugging Face [37]. We • Adding more generative models improves performance.
exploit the fact that these models use a common program- Fig. 1 demonstrates the performance of fake image detec-
ming library [124] and thus can be sampled in a standardized tion when trained on samples from varying numbers of
way. A large fraction of them are extensions of Stable Dif- diffusion models. Notably, the performance improves as
fusion [102], but collectively capture a variety of common more models are added, even across different architectures.
model variations, such as in the architecture, image process- • Including diverse generative model architectures signifi-
ing, and image content. We also sample images from many cantly improves results, since classifiers do not fully gen-
other open source models, including GANs [40], autoregres- eralize between generator architectures. Likewise, the
sive models [42], and consistency models [81, 117]. To help performance gain from including large numbers of images
study how image content affects classification performance, from any particular architecture is relatively marginal.
we provide a corresponding set of real images that are de- • Standard classifiers perform well. In contrast to observa-
signed to resemble the generated images. For example, we tions from recent work, we find that end-to-end training of
condition the text-to-image models using text obtained by classifiers based on CNNs or ViTs generalizes well, with
captioning our real images. qualitatively similar to that of other recognition problems.
2
2. Related work the detector. Bammey [9] uses high-frequency artifacts to
Datasets for detecting generated images. A number detect generated images. However, these approaches may be
of datasets have been proposed for specifically detecting brittle since the artifacts they rely on can be eliminated by
“deepfake” images containing manipulated faces [28, 64, post-processing [18]. We instead approach this problem in a
68, 72, 103, 105, 134]. Rather than focusing on face ma- data-driven manner, scaling the number of models, images,
nipulation, we address of creating general-purpose meth- and architectures. Recent work has created ensembles of
ods that can detect images that have been directly pro- fake image classifiers [51]. In parallel, researchers have de-
duced by generative models. Wang et al. [126] proposed tected text generated by language models using supervised
a widely-used dataset of CNN-generated images, mixing learning and heuristics [8, 38, 56, 66, 86, 106, 116, 122],
images from GANs [10, 14, 59, 60, 94, 132] with other mod- which closely resemble those in visual forensics. However,
els [11, 12, 21, 71, 104]. This work showed that forensics no existing techniques that we are aware of aim to collect
models generalize between generative models, providing comprehensive datasets of community-created generators.
motivation for training on large datasets of diverse genera- Out-of-distribution generalization. Our work is related
tors. However, their classifier was trained on images from a to the out-of-distribution recognition problem as it in-
single GAN and was highly sensitive to data augmentation volves generalizing to unseen generators and image pro-
parameters, and more recent work shows that it does not cessing. A variety of approaches have been proposed for
generalize to newer models [17, 91]. Ojha et al. [91] intro- this problem, based on likelihood ratios [73, 101, 128],
duced a dataset of recent diffusion models and found that self-supervision [48, 87, 112, 125], internal model statis-
training a linear classifier on CLIP features [100] extracted tics [47, 107], temperature scaling [6, 74], and via energy-
from ProGAN-generated images performed well. Cozzolino based models [31, 34, 76]. Work by Schuhmann et al. [111]
et al. [18] extends this work by studying the performance and Hendrycks et al. [49] show that diverse training data and
of CLIP-based detectors on various generative models and data augmentation is important to improving the robustness
datasets. Epstein et al. [35] simulated detecting fake images to out-of-distribution samples. Our results are in line with
in an online way by training a detector up to a certain year these conclusions, as we find that a diverse set of generative
and testing it on generators released after that year. Zhu et models and stronger augmentations improve generalization.
al. [133] collected 1.4M generated images from 8 different
generators. These datasets, however, only consider a handful
of models (less than 20 each), limiting the generalization of 3. The Community Forensics Dataset
their detectors. We improve upon these works by collecting
To support our goal of studying generalization in generated
much more diverse generative models to improve the per-
image detection, we collect a dataset of images sampled
formance and generalization of the detector. In concurrent
from a wide range of models (Fig. 2). Our dataset consists
work, Hong et al. [50] acquires user-created images from
of: (a) a large and systematically collected set of “in-the-
Midjourney and CivitAI. This strategy is complementary
wild” text-to-image latent diffusion models obtained from a
to ours: while it aims to collect in-the-wild fake images,
model-sharing website, (b) hand-selected models from other
its distribution is centered on images that users share, and
open source architectures, and (c) closed and open state-of-
the models are not necessarily identifiable, making it chal-
the-art commercial models. We also pair these generated
lenging to rigorously analyze the dataset’s contents and to
images with real images from other datasets. For all im-
interpret experiments conducted on it.
ages in our dataset, we preserve the original image format
Fingerprint-based image forensics methods. Classic whenever possible, without any additional compression or
work on image forensics relied on methods based on image resizing. This is to mitigate potential bias and performance
statistics [99] and physical constraints [57], rather than learn- degradation in out-of-domain settings due to unwanted arti-
ing. A number of datasets have been created for detecting facts [43]. Our dataset contains significantly more models
images that have been manipulated using traditional methods, than previous works (Tab. 1) and spans a wider range of
such as with photo editors [24, 29, 54, 65, 88]. Recent works architectures, processing pipelines, and semantic contents.
focus on detecting synthetic images by inspecting the gen-
erator fingerprints. Zhang et al. [131] and Marra et al. [82]
3.1. Systematically collecting generative models
proposed identifying the spatial fingerprints left by the gen- We perform our systematic collection using publicly avail-
erator to detect synthetic images. Others focus on spectral able, open source2 models that use the Hugging Face
anomalies to detect synthetic images. Durall et al. [32] and diffusers library [37, 124] because: 1) it is a popular
Dzanic et al. [33] identified that CNN-generated images fail library for creating text-to-image models and is widely used
to reproduce certain spectral properties of real images. Corvi 2We use the term “open source” to refer to models with public weights
et al. [17] studies the frequency fingerprints of the generated and source code, even if the models may be closed in some respects (e.g.,
images and analyzes the cross-architecture generalization of private training data).
3
Dataset Models Images Architectures Training setup
Wang et al. [126] 11 362K GAN, Perceptual, Deepfake, ... ProGAN [59] vs. LSUN [129]
Ojha et al. [91] 4∗ 10K∗ GAN, Perceptual, Diffusion, ... ProGAN [59] vs. LSUN [129]
Epstein et al. [35] 14 570K Diffusion Diffusion vs. LAION [111]
Cozzolino et al. [18] 18 26K Diffusion LDM [102] vs. MS-COCO [75]
Synthbuster [9] 9 10K Diffusion Diffusion vs. Dresden [39]
GenImage [133] 8 1.4M Diffusion, GAN Diffusion, GAN vs. Diffusion, GAN
Ours 4803 2.7M Diffusion, GAN, Autoregressive, ... Many vs. Many
Table 1. Comparison with existing forensics datasets. We compare the size of the dataset with existing datasets containing identifiable
generative models. We only count the number of generated images. Our dataset contains significantly more generative models than prior
works. ∗: Only counting the unique evaluation set by Ojha et al. [91] as their dataset is based on Wang et al. [126].
by hobbyists, 2) thousands of such models are publicly in- We provide the model metadata with each image to enable
dexed, and 3) it provides a standard interface by which we other possible forensics forensics applications. We discuss
can sample images. We process them in the order of popular- these in Appendix B and provide information about image
ity, as indicated by the number of downloads. Our pipeline and model licenses.
downloads each model and extracts relevant hyperparam-
3.2. Collecting images from other architectures
eters (e.g., number of diffusion steps), sampling pipeline
configurations, and metadata from the model-sharing web- Images from manually chosen models. To ensure that
page [37, 124]. We sample images using a distribution of our dataset contains a broader range of models, we manually
text prompts obtained from real images (Sec. 3.3). Since select 19 models from public repositories and sample 40,738
experiments suggest that there are diminishing returns for images per model on average. We note that this number is
sampling large numbers of images from any given model, we itself on par with (or more than) prior datasets with identi-
sample a few hundred images from each one. Images with fiable generative models. We include several GANs (e.g.,
NSFW content are filtered out using a safety checker [16]. StyleGANs [61–63, 109], BigGAN [10], StyleSwin [130],
We obtain 4763 models with approximately 403 images each GigaGAN [58], ProGAN [59], ProjectedGAN [108], GANs-
from this process. former [53], SAN [118], and CIPS [7]), pixel-based diffusion
While the lack of documentation in each model and the models (e.g., GLIDE [89], ADM [27], and DeepFloyd [25]),
scale of data collection make it challenging to exactly charac- latent diffusion models (e.g., VQ-Diffusion [44], Diffusion
terize the model designs in this set, they are either entirely (or Transformers [96], and Latent Flow Matching [23]), and an
almost entirely) based on latent diffusion. More specifically, autoregressive model (Taming Transformers [36])
we categorize models as being based on latent diffusion if Images from commercial models. We sample 15K im-
they perform a denoising process on a latent representation. ages from 11 commercial models using LAION-based cap-
3
Based on this criterion and the self-reported tags, all mod- tions to evaluate the generalization to state-of-the-art models
els in our systematically collected set appear to be based with typically unknown architectures: DALL·E 2, 3 [92, 93],
on latent diffusion. While pixel-based diffusion models Ideogram V1, V2 [5], Midjourney V5, V6 [85], Firefly Im-
also use the diffusers library (e.g., DeepFloyd [25]), they age 2, 3 [4], FLUX.1-dev, schnell [69], and Imagen 3 [41].
were incompatible with our automated generation pipeline.
We record incompatible models such as these and manually 3.3. Collecting real images
sample a portion of them to construct an out-of-distribution To help study how real images influence forensics models,
test set (Sec. 3.4), or as manually-chosen models used for we source real images from a variety of existing datasets:
training data (Sec. 3.2). LAION [110], ImageNet [26], COCO [75], FFHQ [60],
We show examples of sampled images in Fig. 2. In Ap- CelebA [77], MetFaces [61], AFHQ [15], Forchheim [45],
pendix D, we provide examples of models and information IMD2020 [90], Landscapes HQ [115], and VISION [114].4
from their project pages. These models generate a variety of 3.4. Curating the evaluation set
different types of images, with various types of preprocess-
ing. For example, a large fraction of these models adapt vari- We construct our evaluation set using the incompatible
ations of a popular pretrained latent diffusion model, Stable models from our automated sampling pipeline, commer-
Diffusion [102], to different downstream applications, and cial models (Sec. 3.2), and manually collected open source
use a number of adaptation strategies (e.g., using LoRA [52]). 4 Following common convention in visual forensics, we refer to these
images as real images, even though they may be synthetic (e.g., containing
3We note that this definition includes latent consistency models [80, 117], graphic design). More precisely, our goal is to distinguish “AI-generated”
which are present in our dataset. versions of images from the originals.
4
models. The evaluation set comprises 26K images sam- We construct our training set of 5.4M images by pairing
pled from 21 models not included in the training set. 2.7M generated images with 2.7M real images.
This includes our commercial models set and an ad-
Training and evaluation setup. We evaluate the mod-
ditional 11K images from 10 models: Deci Diffusion
els trained on our dataset and compare them with prior
V2 [121], GALIP [120], KandinskyV2.2 [113], Kvikon-
works [91, 126, 133]. Following prior works [91, 126],
tent [67], LCM-LoRA-SDv1.5, LCM-LoRA-SDXL, LCM-
we use the threshold-independent mean average precision
LoRA-SSD1B [81], Stable Cascade [97], DF-GAN [119],
(mAP) and accuracy (Acc.) as our evaluation metrics. We
and HDiT [19], sampled using RAISE [22], ImageNet [26],
compute the mAP and accuracy by averaging the results of
FFHQ [60], and COCO [75]-based captions.
each generative model. We use five evaluation sets: Wang et
The generated images are paired with the source real
al. [126], Ojha et al. [91], Synthbuster [9], GenImage [133],
data that are used to prompt the generators. However, since
and our evaluation set. All evaluation sets apart from Gen-
some of the real datasets do not have appropriate licenses
Image [133] evaluate out-of-distribution performance for all
for redistribution (e.g., LAION [110, 111]), we created a
classifiers. GenImage [133] evaluation set, however, con-
public version of our evaluation set by pairing the generated
tains the same set of generators used in training, and is
images with openly licensed COCO [75] and FFHQ [60]
an in-distribution evaluation set for their classifiers. Con-
which allow redistribution for non-commercial purposes.
cretely, the evaluation set by Wang et al. [126] and Ojha et
The public version of our evaluation set will serve as an
al. [91, 126] contains models such as DALL·E [92], Deep-
easily reproducible and shareable evaluation set that will
Fake [28], CycleGAN [132], StarGAN [14], CRN [12],
complement our default set. We will refer to our default set
IMLE [71], SITD [11], and SAN [21] which are unseen
as the comprehensive evaluation set. We also release the
by both their and our classifiers. Synthbuster [9] evaluation
instructions to reconstruct our comprehensive set. However,
set is comprised of RAISE [22]-based synthetic images of
note that it may not be possible to exactly reconstruct this
DALL·E [92, 93], Firefly [4], Midjourney [85], Glide [89],
set in the future due to link rot.
and Stable Diffusion [98, 102], and is mostly out of distri-
3.5. Generating images bution for all classifiers. GenImage [133] evaluation set is
a validation split of their training set; an exact same set of
Unconditional models are sampled until we reach the desired models are used in training their classifier: Midjourney [85],
number of images. For class conditional models, we sam- Stable Diffusion [102], ADM [27], Glide [89], Wukong [3],
ple an equal number of images per class. To sample from VQ-Diffusion [44], and BigGAN [10].
text-conditional models, we gather prompts from multiple
sources to ensure semantic diversity. We obtain captions Model architecture. Building on prior works which
from real images (Sec. 3.3). We either use captions that are mainly used CLIP-ViT [30, 55, 100] and ResNet-50 [46],
already present in the dataset (when available), or we use we consider ViT [30] and ConvNeXt [78] pretrained models
BLIP [70] to generate them. The captions are then used for our classifiers. We use a plain ViT-S backbone [30] pre-
to sample synthetic images. Some models such as Giga- trained on CLIP objective [55, 100] using LAION-2B [111],
GAN [58] and HDiT [19] do not provide a pretrained model, ImageNet 21K, and ImageNet 1K datasets [26]. We also
so we instead use their pre-generated images. Generated im- experiment with a ConvNeXt-S model [78] pretrained on Im-
ages are saved in PNG format to avoid compression artifacts. ageNet 21K and ImageNet 1K datasets [26]. We replace the
However, Firefly [4] generated images are saved in JPEG classification head with a linear layer with sigmoid activation
format as their web UI does not allow downloading in PNG. that outputs the probability of the image being generated. Un-
like prior works [18, 91] that freeze the CLIP-ViT backbone,
we train the backbone end-to-end. The models are obtained
4. Experiments through timm [13, 127] library on Hugging Face. We exper-
iment with two input resolutions, 2242 and 3842 , to evaluate
We use our dataset to conduct a study of generalization in
the impact of the input resolution on the detector’s perfor-
visual forensics, asking a number of questions: (1) How
mance. We denote the detector with 3842 input resolution
well do forensics models trained on our dataset generalize
as High res. We implement the models using PyTorch [95].
to unseen models? (2) Does adding more models improve
The hyperparameters are detailed in Appendix C.
detection performance? (3) How does diversity of the train-
ing data affect performance? (4) What architectures and data Data augmentation. Prior work considered augmenta-
augmentation schemes are most successful? tions that were designed to simulate postprocessing, such as
flipping, cropping, Gaussian blur, and JPEG recompression
4.1. Training image forensics models to train their detectors [9, 18, 91, 126]. We propose an aug-
We train binary classifiers that detect generated images using mentation scheme that extends this approach and compare it
our dataset to study the generalization in image forensics. with previously proposed augmentation methods. We expand
5
Evaluation Set (mAP) Evaluation Set (Acc)
Model Wang et al. Ojha et al. SB GenImage Ours Wang et al. Ojha et al. SB GenImage Ours
[126] [91] [9] [133] Comp. Public [126] [91] [9] [133] Comp. Public
Wang et al. [126] 0.897 0.696 0.516 0.642 0.535 0.600 0.714 0.527 0.508 0.533 0.509 0.517
Ojha et al. [91] 0.939 0.957 0.620 0.797 0.630 0.656 0.791 0.821 0.532 0.641 0.543 0.548
GenImage [133] 0.929 0.984 0.813 0.999 0.938 0.968 0.795 0.966 0.719 0.990 0.857 0.886
Ours 0.964 0.991 0.904 0.990 0.979 0.977 0.873 0.950 0.818 0.946 0.895 0.888
Ours - High res. 0.967 0.996 0.974 0.998 0.991 0.994 0.901 0.970 0.908 0.957 0.925 0.912
Table 2. Generalization of AI-generated image detectors across datasets. We evaluate the classifiers trained on our dataset on several
datasets, including our own. We also evaluate several previously released classifiers. Our Comprehensive set (abbreviated as Comp.) pairs
the generated images with original real data; the Public set pairs them with openly licensed COCO [75] and FFHQ [60] for license-compliant
redistribution of the evaluation set (Sec. 3.4). We use plain CLIP-ViT-S [30, 55, 100] architecture with 2242 and 3842 (High res.) input
resolutions, Wang et al. [126] and GenImage [133] use ResNet-50 [46] with 2242 input resolution, and Ojha et al. [91] uses CLIP-ViT-L
with 2242 input resolution as the backbone. Our classifiers show robust performance across all evaluation sets, outperforming all baselines
in out-of-distribution evaluations ([9, 91, 126] and Ours) while nearly matching GenImage [133] on its in-distribution evaluation set.
1.00 mAP Acc
0.75 0.95 0.85
0.90
mAP
0.50
0.80
0.25 0.85
0.00 0.80 1000 Random Models 0.75 1000 Random Models
Latent Diff. Pixel Diff. Commercial GAN Other 10 Popular Models 10 Popular Models
Wang et al. Ojha et al. GenImage Ours Ours - High res.
3K 27K 243K 3K 27K 243K
Figure 3. Performance across generator types. We evaluate the Number of Images in Training Set Number of Images in Training Set
classifier performance across five generator types – latent diffusion, Figure 4. Performance with increasing number of images. We
pixel diffusion, commercial models, GANs, and other architecture train a classifier with varying numbers of images from two sets:
(Stable Cascade [97]). Our classifiers show robust performance 1000 randomly chosen models and 10 popular (highly downloaded)
across all generator types, where prior works struggle to generalize. models in the systematically collected subset. The classifier trained
from 1000 random models outperforms 10 popular models in all
the set of augmentations to handle additional transformations cases. Notably, the accuracy gap is wider than that of mAP, which
that can occur in the wild, such as padding, resizing, rota- may suggest that having diversity in the models improves accuracy
tion, and shear, and integrate them into a framework that threshold calibration. We report the mean and standard error values
can apply complex sequences of transformations. We intro- for each data point across 4 randomly sampled subsets.
duce a modified version of RandAugment [20] that applies a
randomly-ordered sequence of augmentations to the images. our comprehensive evaluation set with a significant margin
Specifically, our modified RandAugment samples a random compared to prior works. This gap in performance can be
number n between 0 and nmax for each augmentation type. traced to our training data which incorporates a substantially
Then, it applies the augmentations in random order until n richer variety of generators compared to prior works. Conse-
augmentations are applied for each augmentation type to the quently, our classifiers demonstrate robust generalization to
image. We use various augmentations, including in-memory out-of-distribution data, where prior works often struggle.
JPEG compression, random resizing with random interpola- To better illustrate the generalization of the classifiers, we
tion methods, cropping, flipping, rotation, translation, shear, show the performance per each generator type in Figure 3.
padding, and cutout. We group them into five subsets: latent diffusion, pixel diffu-
sion, commercial models, GANs, and other architecture type
4.2. Generalization to other datasets (Stable Cascade [97]). Our classifiers show strong perfor-
We first evaluate how well classifiers trained on our dataset mance across all generator types, unlike prior works which
transfer to other benchmarks. In Table 2, we observe that struggle to generalize to diverse architectures.
our models outperform the prior works [91, 126, 133] in all For the following experiments, we use our best-
evaluation sets except GenImage. This is expected since performing model (High res.) unless stated otherwise.
the GenImage evaluation set is a validation split of their
training set; all of the generators are already seen by their 4.3. Impact of model diversity
classifier. On all unseen evaluation sets, our classifiers out- Next, we examine the impact of the number of models in
perform all prior works. Notably, our classifiers achieve training data. We train classifiers with images sampled from
very high performance (0.991 mAP and 92.5% accuracy) on 3 to 3333 generators and evaluate them (Fig. 1). To ensure
6
1.00 1.00 1.00 1.00
0.75 0.75 0.75 0.75
0.50 0.50 0.50 0.50
0.25 0.25 0.25 0.25
mAP Acc mAP Acc mAP Acc mAP Acc
Systematic Manual Full ViT (Ours) ConvNeXt (Ours)
ViT (GenImage) ConvNeXt (GenImage) ViT ConvNeXt R:LAION F:LAION R:Others F:LAION
ViT (Wang) ConvNeXt (Wang) ViT (Frozen) ConvNeXt (Frozen) R:LAION F:Others R:Others F:Others
(a) Impact of generator type diversity (b) Classifier backbone comparison (a) Impact of frozen backbones (b) Semantic alignment analysis
Figure 5. (a) Performance and model diversity. We compare de- Figure 6. (a) Evaluating frozen backbones. Freezing the pre-
tection performance for commercial models using classifiers trained trained backbone, a common practice in prior works ([18, 91]),
on different subsets of the dataset: the systematically collected consistently decreases the performance. (b) Analyzing source and
latent diffusion models, the manually chosen models containing generated data alignment. We evaluate how the pairing of the
diverse generator types, and both. As diversity increases, so does real datasets affects performance. R denotes the real dataset used in
performance. (b) Classifier backbone comparison. We compare training, and F indicates the source dataset used to obtain the cap-
the architectures across datasets: ours, GenImage [133], and Wang tions for prompting the generators. The results suggest that pairing
et al. [126]. Performance is similar between architectures. the source data (i.e., real data used to prompt the generators) with
the generated images is not essential for performance.
that the gains are not due to simply sampling qualitatively
different architectures, we only use our systematically col- shows stronger performance compared to the one trained on
lected latent diffusion models. We use an extended evalua- the systematic set. Additionally, we find that the two sets are
tion set that includes non-latent diffusion generators from our complementary; the performance is further improved when
training set, which allows us to comprehensively assess the we train using both sets.
generalization capability of the classifiers trained exclusively 4.4. Analysis of design choices
on latent diffusion models. We find that the performance
steadily increases with the number of models. However, the We examine the impact of various design choices, including
performance begins to flatten out beyond 1000 models, sug- some suggested in earlier works. In particular, we investi-
gesting diminishing returns. Interestingly, the performance gate the choice of backbone models, freezing the backbone,
improves even on out-of-distribution architectures such as semantic alignment between the real and generated data, and
GANs and pixel-based diffusion models, even though the robustness to transformations.
classifier is only trained on latent diffusion models. Classifier backbone comparison. We compare the perfor-
In Figure 4, we vary the number of images from two sets: mance of the classifier trained using CLIP-ViT [30, 55, 100]
1000 randomly chosen models and 10 popular models (as and ConvNeXt [78] backbones following our training pro-
denoted by their number of downloads) downloaded from cedure in Figure 5b. We examine three datasets: ours, Gen-
our systematically collected diffusion models. While the re- Image [133], and Wang et al. [126]. We observe similar
sults show that the performance improves with more training performance between architectures across all datasets.
images, it begins to plateau at approximately 27K images.
Frozen backbone. Prior works [18, 91] suggested using
Moreover, the classifier trained on 1000 models outperforms
a frozen CLIP-ViT backbone for training the classifiers.
the 10 models in all cases, indicating that model diversity
We investigate this practice by training the classifiers with
is important for strong performance. We also note that the
both frozen and unfrozen pretrained backbones, using CLIP-
accuracy gap is noticeably wider than that of mAP, which
ViT [30, 55, 100] and ConvNeXt [78]. As shown in Fig-
may suggest that model diversity is crucial in calibrating the
ure 6a, freezing the backbone consistently leads to poorer
accuracy thresholds of the classifiers.
performance, indicating that end-to-end training is crucial to
Our experiments show that the performance improve-
achieving high performance.
ments from increasing the number of models may plateau
when they are limited to a single generator type (Fig. 1). In Semantic alignment. Existing works often pair the gener-
Figure 5a, we show that the diversity of the generator type ated images with the source dataset (i.e., the real dataset used
also plays a major role in generalization. We train classifiers to prompt or generate the images) arguing that misaligned
on three different sets of training data: our systematically data can introduce bias [9, 18, 91, 126]. We test this practice
collected set, manually chosen set, and a full set consisting in Fig. 6b by examining the performance with both seman-
of both subsets. The systematic set comprises entirely of la- tically aligned and misaligned real datasets. Specifically,
tent diffusion models, and the manual set contains numerous we consider two real datasets: one comprised exclusively
generator types, including GANs, latent and pixel-based dif- from LAION [110] and another combining ImageNet [26],
fusion, and autoregressive models (Sec. 3.2). The classifier MS-COCO [75], LandscapesHQ [115], Forchheim [45], VI-
trained on the manual set with more diverse generator types SION [114], and IMD2020 [90]. We sample our systemati-
7
Shear Padding Gaussian Blur While we only focus on generated image detection in our
1.00 1.00 1.00
paper, our dataset may enable further forensics studies that
0.75 0.75 0.75
mAP
8
Conference on Machine Learning, pages 32–47. PMLR, of the 41st International Conference on Machine Learning,
2023. 3 pages 9550–9575. PMLR, 2024. 5, 15
[7] Ivan Anokhin, Kirill Demochkin, Taras Khakhulin, Gleb [20] Ekin Dogus Cubuk, Barret Zoph, Jon Shlens, and Quoc Le.
Sterkin, Victor Lempitsky, and Denis Korzhenkov. Image Randaugment: Practical automated data augmentation with
generators with conditionally-independent pixel synthesis. a reduced search space. In Advances in Neural Information
arXiv preprint arXiv:2011.13775, 2020. 4 Processing Systems, pages 18613–18624. Curran Associates,
[8] Anton Bakhtin, Sam Gross, Myle Ott, Yuntian Deng, Inc., 2020. 6
Marc’Aurelio Ranzato, and Arthur Szlam. Real or fake? [21] Tao Dai, Jianrui Cai, Yongbing Zhang, Shu-Tao Xia, and
learning to discriminate machine from human generated text. Lei Zhang. Second-order attention network for single image
arXiv preprint arXiv:1906.03351, 2019. 3 super-resolution. In Proceedings of the IEEE/CVF confer-
[9] Quentin Bammey. Synthbuster: Towards detection of diffu- ence on computer vision and pattern recognition, pages
sion model generated images. IEEE Open Journal of Signal 11065–11074, 2019. 3, 5
Processing, 2023. 2, 3, 4, 5, 6, 7 [22] Duc-Tien Dang-Nguyen, Cecilia Pasquini, Valentina Conot-
[10] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large ter, and Giulia Boato. Raise: A raw images dataset for digital
scale GAN training for high fidelity natural image synthesis. image forensics. In Proceedings of the 6th ACM multimedia
In International Conference on Learning Representations, systems conference, pages 219–224, 2015. 5, 14
2019. 3, 4, 5 [23] Quan Dao, Hao Phung, Binh Nguyen, and Anh Tran. Flow
[11] Chen Chen, Qifeng Chen, Jia Xu, and Vladlen Koltun. matching in latent space. arXiv preprint arXiv:2307.08698,
Learning to see in the dark. In Proceedings of the IEEE con- 2023. 4
ference on computer vision and pattern recognition, pages [24] Tiago José De Carvalho, Christian Riess, Elli Angelopoulou,
3291–3300, 2018. 3, 5 Helio Pedrini, and Anderson de Rezende Rocha. Exposing
[12] Qifeng Chen and Vladlen Koltun. Photographic image syn- digital image forgeries by illumination color classification.
thesis with cascaded refinement networks. In Proceedings IEEE Transactions on Information Forensics and Security,
of the IEEE international conference on computer vision, 8(7):1182–1194, 2013. 3
pages 1511–1520, 2017. 3, 5 [25] DeepFloyd. Deepfloyd. https://huggingface.co/
[13] Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell DeepFloyd/IF-I-L-v1.0, 2024. 4
Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuh- [26] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,
mann, Ludwig Schmidt, and Jenia Jitsev. Reproducible and Li Fei-Fei. Imagenet: A large-scale hierarchical image
scaling laws for contrastive language-image learning. arXiv database. In 2009 IEEE conference on computer vision and
preprint arXiv:2212.07143, 2022. 5 pattern recognition, pages 248–255. Ieee, 2009. 4, 5, 7, 14,
[14] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, 15
Sunghun Kim, and Jaegul Choo. Stargan: Unified generative [27] Prafulla Dhariwal and Alexander Nichol. Diffusion models
adversarial networks for multi-domain image-to-image trans- beat gans on image synthesis. Advances in neural informa-
lation. In Proceedings of the IEEE Conference on Computer tion processing systems, 34:8780–8794, 2021. 4, 5
Vision and Pattern Recognition, 2018. 3, 5 [28] Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ
[15] Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. Howes, Menglin Wang, and Cristian Canton Ferrer. The
Stargan v2: Diverse image synthesis for multiple domains. deepfake detection challenge (dfdc) dataset. arXiv preprint
In Proceedings of the IEEE Conference on Computer Vision arXiv:2006.07397, 2020. 3, 5
and Pattern Recognition, 2020. 4 [29] Jing Dong, Wei Wang, and Tieniu Tan. Casia image tam-
[16] CompVis. Stable diffusion safety checker. https : pering detection evaluation database. In 2013 IEEE China
/ / huggingface . co / CompVis / stable - diffusion - summit and international conference on signal and informa-
safety-checker, 2022. 4 tion processing, pages 422–426. IEEE, 2013. 3
[17] Riccardo Corvi, Davide Cozzolino, Giada Zingarini, Gio- [30] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
vanni Poggi, Koki Nagano, and Luisa Verdoliva. On the Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
detection of synthetic images generated by diffusion models. Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
In ICASSP 2023-2023 IEEE International Conference on vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is
Acoustics, Speech and Signal Processing (ICASSP), pages worth 16x16 words: Transformers for image recognition at
1–5. IEEE, 2023. 3 scale. ICLR, 2021. 5, 6, 7
[18] Davide Cozzolino, Giovanni Poggi, Riccardo Corvi, [31] Yilun Du, Shuang Li, Joshua Tenenbaum, and Igor Mor-
Matthias Nießner, and Luisa Verdoliva. Raising the bar datch. Improved contrastive divergence training of energy-
of ai-generated image detection with clip. In Proceedings of based models. In International Conference on Machine
the IEEE/CVF Conference on Computer Vision and Pattern Learning, pages 2837–2848. PMLR, 2021. 3
Recognition, pages 4356–4366, 2024. 2, 3, 4, 5, 7 [32] Ricard Durall, Margret Keuper, and Janis Keuper. Watch
[19] Katherine Crowson, Stefan Andreas Baumann, Alex Birch, your up-convolution: Cnn based generative deep neural net-
Tanishq Mathew Abraham, Daniel Z Kaplan, and Enrico works are failing to reproduce spectral distributions. In Pro-
Shippole. Scalable high-resolution pixel-space image syn- ceedings of the IEEE/CVF conference on computer vision
thesis with hourglass diffusion transformers. In Proceedings and pattern recognition, pages 7890–7899, 2020. 3
9
[33] Tarik Dzanic, Karan Shah, and Freddie Witherden. Fourier [48] Dan Hendrycks, Mantas Mazeika, Saurav Kadavath, and
spectrum discrepancies in deep network generated images. Dawn Song. Using self-supervised learning can improve
Advances in neural information processing systems, 33:3022– model robustness and uncertainty. In Advances in Neural
3032, 2020. 3 Information Processing Systems. Curran Associates, Inc.,
[34] Sven Elflein, Bertrand Charpentier, Daniel Zügner, and 2019. 3
Stephan Günnemann. On out-of-distribution detection with [49] Dan Hendrycks, Steven Basart, Norman Mu, Saurav Ka-
energy-based models, 2021. 3 davath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler
[35] David C Epstein, Ishan Jain, Oliver Wang, and Richard Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Stein-
Zhang. Online detection of ai-generated images. In Proceed- hardt, and Justin Gilmer. The many faces of robustness:
ings of the IEEE/CVF International Conference on Com- A critical analysis of out-of-distribution generalization. In
puter Vision, pages 382–392, 2023. 2, 3, 4 Proceedings of the IEEE/CVF International Conference on
[36] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming Computer Vision (ICCV), pages 8340–8349, 2021. 3
transformers for high-resolution image synthesis. In Pro- [50] Yan Hong and Jianfu Zhang. Wildfake: A large-scale chal-
ceedings of the IEEE/CVF conference on computer vision lenging dataset for ai-generated images detection. arXiv
and pattern recognition, pages 12873–12883, 2021. 4 preprint arXiv:2402.11843, 2024. 3
[37] Hugging Face. Hugging face diffusers library. https: [51] Shuwei Hou, Yan Ju, Chengzhe Sun, Shan Jia, Lipeng Ke,
//huggingface.co/models?library=diffusers, ac- Riky Zhou, Anita Nikolich, and Siwei Lyu. Deepfake-o-
cessed on June 05, 2022, 2022. 2, 3, 4, 15 meter v2. 0: An open platform for deepfake detection. arXiv
[38] Sebastian Gehrmann, Hendrik Strobelt, and Alexander M preprint arXiv:2404.13146, 2024. 3
Rush. Gltr: Statistical detection and visualization of gener- [52] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-
ated text. arXiv preprint arXiv:1906.04043, 2019. 3 Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen.
[39] Thomas Gloe and Rainer Böhme. The’dresden image Lora: Low-rank adaptation of large language models. arXiv
database’for benchmarking digital image forensics. In Pro- preprint arXiv:2106.09685, 2021. 4
ceedings of the 2010 ACM symposium on applied computing,
[53] Drew A Hudson and Larry Zitnick. Generative adversar-
pages 1584–1590, 2010. 4
ial transformers. In International conference on machine
[40] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing learning, pages 4487–4499. PMLR, 2021. 4
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
[54] Minyoung Huh, Andrew Liu, Andrew Owens, and Alexei A
Yoshua Bengio. Generative adversarial nets. Advances in
Efros. Fighting fake news: Image splice detection via
Neural Information Processing Systems, 27, 2014. 2
learned self-consistency. In Proceedings of the European
[41] Google. Imagen 3. https : / / deepmind . google /
conference on computer vision (ECCV), 2018. 3
technologies/imagen-3, 2024. 4
[55] Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade
[42] Karol Gregor, Ivo Danihelka, Andriy Mnih, Charles Blun-
Gordon, Nicholas Carlini, Rohan Taori, Achal Dave,
dell, and Daan Wierstra. Deep autoregressive networks.
Vaishaal Shankar, Hongseok Namkoong, John Miller, Han-
In International Conference on Machine Learning, pages
naneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Open-
1242–1250. PMLR, 2014. 2
clip, 2021. If you use this software, please cite it as below.
[43] Patrick Grommelt, Louis Weiss, Franz-Josef Pfreundt, and
5, 6, 7
Janis Keuper. Fake or jpeg? revealing common biases
in generated image detection datasets. arXiv preprint [56] Ganesh Jawahar, Muhammad Abdul-Mageed, and Laks VS
arXiv:2403.17608, 2024. 3 Lakshmanan. Automatic detection of machine generated
[44] Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo text: A critical survey. arXiv preprint arXiv:2011.01314,
Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector 2020. 3
quantized diffusion model for text-to-image synthesis. In [57] Micah K Johnson and Hany Farid. Exposing digital forgeries
Proceedings of the IEEE/CVF Conference on Computer Vi- in complex lighting environments. IEEE Transactions on
sion and Pattern Recognition, pages 10696–10706, 2022. 4, Information Forensics and Security, 2(3):450–461, 2007. 3
5 [58] Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park,
[45] Benjamin Hadwiger and Christian Riess. The forchheim im- Eli Shechtman, Sylvain Paris, and Taesung Park. Scaling up
age database for camera identification in the wild. In Pattern gans for text-to-image synthesis. In Proceedings of the IEEE
Recognition. ICPR International Workshops and Challenges: Conference on Computer Vision and Pattern Recognition
Virtual Event, January 10–15, 2021, Proceedings, Part VI, (CVPR), 2023. 4, 5
pages 500–515. Springer, 2021. 4, 7 [59] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen.
[46] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Progressive growing of gans for improved quality, stability,
Deep residual learning for image recognition. In Proceed- and variation. In International Conference on Learning
ings of the IEEE conference on computer vision and pattern Representations, 2018. 3, 4
recognition, pages 770–778, 2016. 5, 6 [60] Tero Karras, Samuli Laine, and Timo Aila. A style-based
[47] Dan Hendrycks and Kevin Gimpel. A baseline for detect- generator architecture for generative adversarial networks.
ing misclassified and out-of-distribution examples in neural In Proceedings of the IEEE/CVF conference on computer
networks. In International Conference on Learning Repre- vision and pattern recognition, pages 4401–4410, 2019. 3,
sentations, 2016. 3 4, 5, 6, 14, 15
10
[61] Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Computer Vision–ECCV 2014: 13th European Conference,
Jaakko Lehtinen, and Timo Aila. Training generative ad- Zurich, Switzerland, September 6-12, 2014, Proceedings,
versarial networks with limited data. Advances in neural Part V 13, pages 740–755. Springer, 2014. 4, 5, 6, 7, 14
information processing systems, 33:12104–12114, 2020. 4 [76] Weitang Liu, Xiaoyun Wang, John Owens, and Yixuan Li.
[62] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Energy-based out-of-distribution detection. In Advances
Jaakko Lehtinen, and Timo Aila. Analyzing and improv- in Neural Information Processing Systems, pages 21464–
ing the image quality of stylegan. In Proceedings of the 21475. Curran Associates, Inc., 2020. 3
IEEE/CVF conference on computer vision and pattern recog- [77] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.
nition, pages 8110–8119, 2020. Deep learning face attributes in the wild. In Proceedings of
[63] Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, International Conference on Computer Vision (ICCV), 2015.
Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias- 4, 14
free generative adversarial networks. Advances in neural [78] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht-
information processing systems, 34:852–863, 2021. 4 enhofer, Trevor Darrell, and Saining Xie. A convnet for the
[64] Hasam Khalid, Shahroz Tariq, Minha Kim, and Simon Woo. 2020s. In Proceedings of the IEEE/CVF conference on com-
Fakeavceleb: A novel audio-video multimodal deepfake puter vision and pattern recognition, pages 11976–11986,
dataset. In Proceedings of the Neural Information Process- 2022. 5, 7
ing Systems Track on Datasets and Benchmarks, 2021. 3 [79] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
[65] P. Korus and J. Huang. Multi-scale analysis strategies in regularization. In International Conference on Learning
prnu-based tampering localization. IEEE Trans. on Informa- Representations, 2019. 15
tion Forensics & Security, 2017. 3 [80] Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang
[66] Kalpesh Krishna, Yixiao Song, Marzena Karpinska, John Zhao. Latent consistency models: Synthesizing high-
Wieting, and Mohit Iyyer. Paraphrasing evades detectors resolution images with few-step inference. arXiv preprint
of ai-generated text, but retrieval is an effective defense. arXiv:2310.04378, 2023. 4
Advances in Neural Information Processing Systems, 36, [81] Simian Luo, Yiqin Tan, Suraj Patil, Daniel Gu, Patrick von
2024. 3 Platen, Apolinário Passos, Longbo Huang, Jian Li, and Hang
Zhao. Lcm-lora: A universal stable-diffusion acceleration
[67] Kvikontent. Kvikontent-midjourney v6.
module. arXiv preprint arXiv:2311.05556, 2023. 2, 5
https://huggingface.co/Kvikontent/midjourney-v6, 2023. 5
[82] Francesco Marra, Diego Gragnaniello, Luisa Verdoliva, and
[68] Patrick Kwon, Jaeseong You, Gyuhyeon Nam, Sungwoo
Giovanni Poggi. Do gans leave artificial fingerprints? In
Park, and Gyeongsu Chae. Kodf: A large-scale korean
2019 IEEE conference on multimedia information process-
deepfake detection dataset. In Proceedings of the IEEE/CVF
ing and retrieval (MIPR), pages 506–511. IEEE, 2019. 3
International Conference on Computer Vision, 2021. 3
[83] Leland McInnes, John Healy, Nathaniel Saul, and Lukas
[69] Black Forst Labs. Flux. https://blackforestlabs.ai,
Großberger. Umap: Uniform manifold approximation and
2024. 4
projection. Journal of Open Source Software, 3(29), 2018.
[70] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 14
Blip: Bootstrapping language-image pre-training for unified [84] Paulius Micikevicius, Sharan Narang, Jonah Alben, Gre-
vision-language understanding and generation. In ICML, gory Diamos, Erich Elsen, David Garcia, Boris Ginsburg,
2022. 5 Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al.
[71] Ke Li, Tianhao Zhang, and Jitendra Malik. Diverse image Mixed precision training. In International Conference on
synthesis from semantic layouts via conditional imle. In Learning Representations, 2018. 15
Proceedings of the IEEE/CVF International Conference on [85] Inc. Midjourney. Midjourney. https://www.midjourney.
Computer Vision, pages 4220–4229, 2019. 3, 5 com/home, 2022. 4, 5
[72] Yuezun Li, Xin Yang, Pu Sun, Honggang Qi, and Siwei Lyu. [86] Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christo-
Celeb-df: A large-scale challenging dataset for deepfake pher D Manning, and Chelsea Finn. Detectgpt: Zero-shot
forensics. In Proceedings of the IEEE/CVF conference on machine-generated text detection using probability curva-
computer vision and pattern recognition, 2020. 3 ture. In International Conference on Machine Learning,
[73] Yewen Li, Chaojie Wang, Xiaobo Xia, Tongliang Liu, Bo pages 24950–24962. PMLR, 2023. 3
An, et al. Out-of-distribution detection with an adaptive [87] Sina Mohseni, Mandar Pitale, JBS Yadawa, and Zhangyang
likelihood ratio on informative hierarchical vae. Advances Wang. Self-supervised learning for generalizable out-of-
in Neural Information Processing Systems, 35:7383–7396, distribution detection. Proceedings of the AAAI Conference
2022. 3 on Artificial Intelligence, 34(04):5216–5223, 2020. 3
[74] Shiyu Liang, Yixuan Li, and R Srikant. Enhancing the [88] Tian-Tsong Ng, Shih-Fu Chang, and Q Sun. A data set of
reliability of out-of-distribution image detection in neural authentic and spliced image blocks. Columbia University,
networks. In International Conference on Learning Repre- ADVENT Technical Report, 4, 2004. 3
sentations, 2018. 3 [89] Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh,
[75] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya
Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Sutskever, and Mark Chen. Glide: Towards photorealis-
Zitnick. Microsoft coco: Common objects in context. In tic image generation and editing with text-guided diffusion
11
models. In International Conference on Machine Learning, synthesis with latent diffusion models. In Proceedings of
pages 16784–16804. PMLR, 2022. 4, 5 the IEEE/CVF Conference on Computer Vision and Pattern
[90] Adam Novozamsky, Babak Mahdian, and Stanislav Saic. Recognition (CVPR), pages 10684–10695, 2022. 2, 4, 5, 8
Imd2020: A large-scale annotated dataset tailored for detect- [103] Andreas Rössler, Davide Cozzolino, Luisa Verdoliva, Chris-
ing manipulated images. In Proceedings of the IEEE/CVF tian Riess, Justus Thies, and Matthias Nießner. Faceforen-
Winter Conference on Applications of Computer Vision Work- sics: A large-scale video dataset for forgery detection in
shops, pages 71–80, 2020. 4, 7 human faces. arXiv preprint arXiv:1803.09179, 2018. 1, 3
[91] Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. Towards uni- [104] Andreas Rössler, Davide Cozzolino, Luisa Verdoliva, Chris-
versal fake image detectors that generalize across generative tian Riess, Justus Thies, and Matthias Nießner. FaceForen-
models. In Proceedings of the IEEE/CVF Conference on sics++: Learning to detect manipulated facial images. In
Computer Vision and Pattern Recognition (CVPR), pages International Conference on Computer Vision (ICCV), 2019.
24480–24489, 2023. 1, 2, 3, 4, 5, 6, 7, 8, 14 3
[92] OpenAI. Dall-e 2. https://openai.com/index/dall- [105] Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Chris-
e-2, 2022. 4, 5 tian Riess, Justus Thies, and Matthias Nießner. Faceforen-
[93] OpenAI. Dall-e 3. https://openai.com/index/dall- sics++: Learning to detect manipulated facial images. In
e-3, 2023. 4, 5 Proceedings of the IEEE/CVF international conference on
computer vision, 2019. 1, 3
[94] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-
Yan Zhu. Semantic image synthesis with spatially-adaptive [106] Vinu Sankar Sadasivan, Aounon Kumar, Sriram Balasubra-
normalization. In Proceedings of the IEEE Conference on manian, Wenxiao Wang, and Soheil Feizi. Can ai-generated
Computer Vision and Pattern Recognition, 2019. 3 text be reliably detected? arXiv preprint arXiv:2303.11156,
2023. 3
[95] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer,
[107] Chandramouli Shama Sastry and Sageev Oore. Detecting
James Bradbury, Gregory Chanan, Trevor Killeen, Zeming
out-of-distribution examples with Gram matrices. In Pro-
Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An
ceedings of the 37th International Conference on Machine
imperative style, high-performance deep learning library. Ad-
Learning, pages 8491–8501. PMLR, 2020. 3
vances in neural information processing systems, 32, 2019.
[108] Axel Sauer, Kashyap Chitta, Jens Müller, and Andreas
5
Geiger. Projected gans converge faster. In Advances in
[96] William Peebles and Saining Xie. Scalable diffusion models
Neural Information Processing Systems (NeurIPS), 2021. 4
with transformers. In Proceedings of the IEEE/CVF Interna-
[109] Axel Sauer, Katja Schwarz, and Andreas Geiger. Stylegan-
tional Conference on Computer Vision, pages 4195–4205,
xl: Scaling stylegan to large diverse datasets. In ACM SIG-
2023. 4
GRAPH 2022 conference proceedings, pages 1–10, 2022.
[97] Pablo Pernias, Dominic Rampas, Mats Leon Richter, 4
Christopher Pal, and Marc Aubreville. Würstchen: An ef- [110] Christoph Schuhmann, Richard Vencu, Romain Beaumont,
ficient architecture for large-scale text-to-image diffusion Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo
models. In The Twelfth International Conference on Learn- Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m:
ing Representations, 2023. 5, 6, 14 Open dataset of clip-filtered 400 million image-text pairs.
[98] Dustin Podell, Zion English, Kyle Lacey, Andreas arXiv preprint arXiv:2111.02114, 2021. 4, 5, 7, 14
Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and [111] Christoph Schuhmann, Romain Beaumont, Richard Vencu,
Robin Rombach. Sdxl: Improving latent diffusion models Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo
for high-resolution image synthesis. In The Twelfth Inter- Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts-
national Conference on Learning Representations, 2023. man, Patrick Schramowski, Srivatsa R Kundurthy, Katherine
5 Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia
[99] Alin C Popescu and Hany Farid. Exposing digital forgeries Jitsev. LAION-5b: An open large-scale dataset for training
by detecting traces of resampling. IEEE Transactions on next generation image-text models. In Thirty-sixth Confer-
signal processing, 2005. 3 ence on Neural Information Processing Systems Datasets
[100] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya and Benchmarks Track, 2022. 3, 4, 5
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, [112] Vikash Sehwag, Mung Chiang, and Prateek Mittal. Ssd:
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- A unified framework for self-supervised outlier detection.
ing transferable visual models from natural language super- In International Conference on Learning Representations,
vision. In International conference on machine learning, 2020. 3
pages 8748–8763. PMLR, 2021. 3, 5, 6, 7 [113] Arseniy Shakhmatov, Anton Razzhigaev, Aleksandr
[101] Jie Ren, Peter J Liu, Emily Fertig, Jasper Snoek, Ryan Nikolich, Vladimir Arkhipkin, Igor Pavlov, Andrey
Poplin, Mark Depristo, Joshua Dillon, and Balaji Lakshmi- Kuznetsov, and Denis Dimitrov. Kandinsky 2.2. https:
narayanan. Likelihood ratios for out-of-distribution detec- //github.com/ai-forever/Kandinsky-2, 2023. 5
tion. Advances in neural information processing systems, [114] Dasara Shullani, Marco Fontani, Massimo Iuliani, Omar Al
32, 2019. 3 Shaya, and Alessandro Piva. Vision: a video and image
[102] Robin Rombach, Andreas Blattmann, Dominik Lorenz, dataset for source identification. EURASIP Journal on In-
Patrick Esser, and Björn Ommer. High-resolution image formation Security, 2017:1–16, 2017. 4, 7
12
[115] Ivan Skorokhodov, Grigorii Sotnikov, and Mohamed El- Systems, pages 20685–20696. Curran Associates, Inc., 2020.
hoseiny. Aligning latent and image spaces to connect the 3
unconnectable. arXiv preprint arXiv:2104.06954, 2021. 4, [129] Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianx-
7, 14 iong Xiao. Lsun: Construction of a large-scale image dataset
[116] Irene Solaiman, Miles Brundage, Jack Clark, Amanda using deep learning with humans in the loop. arXiv preprint
Askell, Ariel Herbert-Voss, Jeff Wu, Alec Radford, Gretchen arXiv:1506.03365, 2015. 4
Krueger, Jong Wook Kim, Sarah Kreps, et al. Release [130] Bowen Zhang, Shuyang Gu, Bo Zhang, Jianmin Bao, Dong
strategies and the social impacts of language models. arXiv Chen, Fang Wen, Yong Wang, and Baining Guo. Styleswin:
preprint arXiv:1908.09203, 2019. 3 Transformer-based gan for high-resolution image generation.
[117] Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya In Proceedings of the IEEE/CVF conference on computer
Sutskever. Consistency models. In Proceedings of the vision and pattern recognition, pages 11304–11314, 2022. 4
40th International Conference on Machine Learning, pages [131] Xu Zhang, Svebor Karaman, and Shih-Fu Chang. Detect-
32211–32252, 2023. 2, 4 ing and simulating artifacts in gan fake images. In 2019
[118] Yuhta Takida, Masaaki Imaizumi, Takashi Shibuya, Chieh- IEEE international workshop on information forensics and
Hsin Lai, Toshimitsu Uesaka, Naoki Murata, and Yuki Mit- security (WIFS), pages 1–6. IEEE, 2019. 3
sufuji. SAN: Inducing metrizability of GAN with discrimi- [132] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A
native normalized linear layer. In The Twelfth International Efros. Unpaired image-to-image translation using cycle-
Conference on Learning Representations, 2024. 4 consistent adversarial networks. In Computer Vision (ICCV),
[119] Ming Tao, Hao Tang, Fei Wu, Xiao-Yuan Jing, Bing-Kun 2017 IEEE International Conference on, 2017. 3, 5
Bao, and Changsheng Xu. Df-gan: A simple and effec- [133] Mingjian Zhu, Hanting Chen, Qiangyu YAN, Xudong
tive baseline for text-to-image synthesis. In Proceedings of Huang, Guanyu Lin, Wei Li, Zhijun Tu, Hailin Hu, Jie Hu,
the IEEE/CVF conference on computer vision and pattern and Yunhe Wang. Genimage: A million-scale benchmark
recognition, pages 16515–16525, 2022. 5 for detecting ai-generated image. In Advances in Neural
[120] Ming Tao, Bing-Kun Bao, Hao Tang, and Changsheng Xu. Information Processing Systems Dataset and Benchmarks
Galip: Generative adversarial clips for text-to-image synthe- Track, pages 77771–77782, 2023. 2, 3, 4, 5, 6, 7, 8
sis. In Proceedings of the IEEE/CVF Conference on Com- [134] Bojia Zi, Minghao Chang, Jingjing Chen, Xingjun Ma, and
puter Vision and Pattern Recognition, pages 14214–14223, Yu-Gang Jiang. Wilddeepfake: A challenging real-world
2023. 5 dataset for deepfake detection. In Proceedings of the 28th
[121] DeciAI Research Team. Decidiffusion 2.0, 2024. 5 ACM international conference on multimedia, 2020. 3
[122] Adaku Uchendu, Thai Le, Kai Shu, and Dongwon Lee. Au-
thorship attribution for neural text generation. In Proceed-
ings of the 2020 Conference on Empirical Methods in Natu-
ral Language Processing (EMNLP), pages 8384–8395, 2020.
3
[123] Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete
representation learning. Advances in neural information
processing systems, 30, 2017. 8
[124] Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro
Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj,
Dhruv Nair, Sayak Paul, William Berman, Yiyi Xu, Steven
Liu, and Thomas Wolf. Diffusers: State-of-the-art dif-
fusion models. https://github.com/huggingface/
diffusers, 2022. 2, 3, 4
[125] Apoorv Vyas, Nataraj Jammalamadaka, Xia Zhu, Dipankar
Das, Bharat Kaul, and Theodore L. Willke. Out-of-
distribution detection using an ensemble of self supervised
leave-out classifiers. In Proceedings of the European Con-
ference on Computer Vision (ECCV), 2018. 3
[126] Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew
Owens, and Alexei A. Efros. Cnn-generated images are sur-
prisingly easy to spot... for now. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 2020. 1, 2, 3, 4, 5, 6, 7, 8
[127] Ross Wightman. Pytorch image models. https://github.
com/huggingface/pytorch-image-models, 2019. 5
[128] Zhisheng Xiao, Qing Yan, and Yali Amit. Likelihood regret:
An out-of-distribution detection score for variational auto-
encoder. In Advances in Neural Information Processing
13
Predicted
A. Other applications GAN LatDiff PixDiff Real
GAN 0.93 0.04 0.01 0.01 Predicted
20 20 GAN LatDiff PixDiff Real
Ground Truth
15 15 LatDiff 0.02 0.95 0.00 0.03 1 Commercial 0.22 0.40 0.03 0.36 1
playground-v2
NVIDIA-Source-NC
bigcode-openrail-m
openrail++
gpl-3.0
bigscience-openrail-m
agpl-3.0
yodayno-v2
None
mit
openrail
apache-2.0
faipl-1.0-sd
sai-nc
Ideogram TOS
lgpl-3.0
cc
SDXL 0.9 Research License
OpenAI TOS
artistic-2.0
wtfpl
stable-cascade-nc
DeepFloyd-IF
Google TOS
cc-by-nc-sa-4.0
cc-by-nc-nd-4.0
cc-by-nc-4.0
overall-license
FLUX.1-dev-nc
Midjourney TOS
Adobe TOS
cc-by-4.0
cc-by-sa-4.0
14
Model Architecture License RealSource HF_pipeline_tag HF_diffusers_tag
danbochman/ coco,forchheim,imagenet,imd2020,laion, StableDiffusionXL- StableDiffusionXL-
LatentDiff None
ccxl landscapesHQ,vision Pipeline Pipeline
livingbox/
creativeml- coco,forchheim,imagenet,imd2020,laion, StableDiffusion-
modern- LatentDiff stable-diffusion
openrail-m landscapesHQ,vision Pipeline
style-v3
...
DeepFloyd PixelDiff DeepFloyd-IF coco N/A N/A
BigGAN GAN MIT imagenet N/A N/A
...
Table 3. Example model metadata. We log both the author and model names for the Hugging Face [37] models and only the model names
for others. We also log the generator type (i.e., architecture), model license, source real dataset, and Hugging Face tags if available.
mAP
Percentage 99.67% 0.25% 0.06% 0.02% 0.975
Table 4. Model counts per architecture in the training set. A vast
majority of the generators are latent diffusion models. 0.950
0.8K 3.2K 13K 36K 104K
Training Iterations
Figure 13. Impact of training iterations. The performance of the
classifier plateaus beyond 3K iterations.
C. Training settings Figure 14 shows a project page from Hugging Face [2, 37].
For training our classifiers, we use AdamW optimizer [79] We can see the tags associated with the model (e.g., Text-
with a learning rate of 2e-5, a weight decay of 1e-2, a to-image, pipeline type, license), number of downloads, and
batch size of 512, and mixed precision [84]. We use a cosine sample images.
15