0% found this document useful (0 votes)

55 views15 pages

Community Forensics Using Thousands Generators To TrainTrain Fake Image

Uploaded by

aliseren86

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

55 views15 pages

Community Forensics Using Thousands Generators To TrainTrain Fake Image

Uploaded by

aliseren86

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Community Forensics: Using Thousands of Generators to Train

Fake Image Detectors

Jeongsoo Park Andrew Owens

University of Michigan
arXiv:2411.04125v1 [cs.CV] 6 Nov 2024

Abstract 1.00
One of the key challenges of detecting AI-generated im-
0.95
ages is spotting images that have been created by previously 0.90
unseen generative models. We argue that the limited diver-

mAP
sity of the training data is a major obstacle to addressing 0.85
this problem, and we propose a new dataset that is signif- 0.80
icantly larger and more diverse than prior work. As part
of creating this dataset, we systematically download thou- 0.75
sands of text-to-image latent diffusion models and sample
images from them. We also collect images from dozens of 101 102 103
popular open source and commercial models. The resulting
Number of Latent Diffusion Models in Training Set
dataset contains 2.7M images that have been sampled from Latent Diffusion Commercial Other
Pixel Diffusion GAN
4803 different models. These images collectively capture a
wide range of scene content, generator architectures, and Figure 1. Performance vs. model diversity. We use images sam-
image processing settings. Using this dataset, we study the pled from different numbers of open source latent diffusion models
generalization abilities of fake image detectors. Our experi- in the Community Forensics dataset to train fake image detectors
ments suggest that detection performance improves as the (shown in Fig. 2a). As the number of models increases, so does
number of models in the training set increases, even when the detector’s performance, even though these models have similar
these models have similar architectures. We also find that de- designs and same total number of images. This improvement is
tection performance improves as the diversity of the models largest for test images from out-of-distribution generative model
increases, and that our trained detectors generalize better classes, such as pixel-based diffusion models or GANs. For each
data point, we sample 10 random model subsets with 100K training
than those trained on other datasets.1
images each and report the mean and standard error values.

1. Introduction low-level image processing details, such as the ways that

training images are resized or compressed, can strongly in-
Our ability to automatically generate realistic images is fluence detection accuracy [126]. As a result of these model-
quickly outpacing our ability to detect them, potentially specific idiosyncrasies, a generator’s images may evade de-
leading to a state of affairs in which neither humans nor tection, even when images from architecturally similar mod-
machines can reliably tell real from fake. While the field els exist in the training set. This issue has been exacerbated
of image forensics has been developing methods to address by the thousands of open source models that are now avail-
this problem, existing fake image detectors still struggle able online, many of which extend pretrained base models
with generalization. These methods often excel at detecting in complex ways.
images from generators that were present in their training We hypothesize that the lack of diversity in training
sets, but fail when given images sampled from unseen mod- datasets is a major source of these shortcomings. Although
els [91, 103, 105, 126]. today’s datasets often contain millions of fake images, they
A core challenge is dealing with the large amounts of come from a relatively small number of generators (e.g.,
variation between models. Each generator has a potentially fewer than 20 models in any given previous work that we are
unique combination of the architecture, loss function, and aware of). As a result, this data fails to capture many sources
training distribution. Even seemingly minor differences in of variation that one might encounter in the wild. These
1 https://jespark.net/projects/2024/community_forensics limitations also make it challenging to accurately benchmark

1
Dreamshaper v6 Fn Samenamemodel5 Ggm LEOSAM s ilmGirl l 91

DeepFloyd VQDiffusion StyleGAN2 DiT

Pony diffusion 87 Attn Maps mist mas Plat diffusion SoteMix First avartar

StyleGANXL GuidedDiffusion Glide CIPS

Ncadg dreamshaper PhotoSomnia vFinal Nitro diffusion Textual inversion Technoalbum AnalogMadness real

(b) Manually chosen open source models (19 models)

Dalle2 MidjourneyV6 1 MidjourneyV5 2 Dalle3

Uploadmodel Lyrieldiff Vxlnsiya Nightvisionxl 0791 Albedobase xl v21 DucHaitenAnimated

FLUX dev Firefly Image2 Firefly Image3 IdeogramV1

526Mix Textual inversion Cburnett helmet co Enasswhidaly Msart 541 A can thr

(c) Commercial models (11 models)

(a) Systematically collected diffusion models (4763 models)

Figure 2. The Community Forensics dataset. Our dataset contains images sampled from three types of generative models. (a) We
systematically download open-source latent diffusion models from a model-sharing community [37, 124]. (b) We select popular open source
generators with a variety of architectures and training procedures. (c) We sample from both closed and open state-of-the-art commercial
models. We present example images and their corresponding model names.

performance, since it is easy for cues that work well on one Our dataset contains 4803 distinct models, approximately
set of generators to fail on others. 250× more than the previous forensics datasets that sample
To address these problems, we propose Community Foren- images from generative models [9, 18, 35, 91, 126, 133],
sics, a dataset that is significantly more diverse and compre- and covers a variety of recent model designs (Fig. 2).
hensive than those in prior works (Fig. 2). Our dataset con- We use this dataset to study generalization in the gener-
tains images generated by: (a) thousands of systematically ated image detection problem. Our experiments support the
downloaded open-source latent diffusion models, (b) hand- hypothesis that increasing the diversity of generative models
selected open source models with various architectures, and used in training is important for generalization. Through
(c) state-of-the-art commercial models. We use this dataset experiments, we find:
to conduct a study of generalization in image forensics. • Classifiers trained on our dataset obtain strong perfor-
To acquire large numbers of models, we sample images mance, both on our newly proposed evaluations and on
from thousands of text-to-image diffusion models hosted on multiple previously-proposed benchmarks.
a popular model-sharing website, Hugging Face [37]. We • Adding more generative models improves performance.
exploit the fact that these models use a common program- Fig. 1 demonstrates the performance of fake image detec-
ming library [124] and thus can be sampled in a standardized tion when trained on samples from varying numbers of
way. A large fraction of them are extensions of Stable Dif- diffusion models. Notably, the performance improves as
fusion [102], but collectively capture a variety of common more models are added, even across different architectures.
model variations, such as in the architecture, image process- • Including diverse generative model architectures signifi-
ing, and image content. We also sample images from many cantly improves results, since classifiers do not fully gen-
other open source models, including GANs [40], autoregres- eralize between generator architectures. Likewise, the
sive models [42], and consistency models [81, 117]. To help performance gain from including large numbers of images
study how image content affects classification performance, from any particular architecture is relatively marginal.
we provide a corresponding set of real images that are de- • Standard classifiers perform well. In contrast to observa-
signed to resemble the generated images. For example, we tions from recent work, we find that end-to-end training of
condition the text-to-image models using text obtained by classifiers based on CNNs or ViTs generalizes well, with
captioning our real images. qualitatively similar to that of other recognition problems.

2
2. Related work the detector. Bammey [9] uses high-frequency artifacts to
Datasets for detecting generated images. A number detect generated images. However, these approaches may be
of datasets have been proposed for specifically detecting brittle since the artifacts they rely on can be eliminated by
“deepfake” images containing manipulated faces [28, 64, post-processing [18]. We instead approach this problem in a
68, 72, 103, 105, 134]. Rather than focusing on face ma- data-driven manner, scaling the number of models, images,
nipulation, we address of creating general-purpose meth- and architectures. Recent work has created ensembles of
ods that can detect images that have been directly pro- fake image classifiers [51]. In parallel, researchers have de-
duced by generative models. Wang et al. [126] proposed tected text generated by language models using supervised
a widely-used dataset of CNN-generated images, mixing learning and heuristics [8, 38, 56, 66, 86, 106, 116, 122],
images from GANs [10, 14, 59, 60, 94, 132] with other mod- which closely resemble those in visual forensics. However,
els [11, 12, 21, 71, 104]. This work showed that forensics no existing techniques that we are aware of aim to collect
models generalize between generative models, providing comprehensive datasets of community-created generators.
motivation for training on large datasets of diverse genera- Out-of-distribution generalization. Our work is related
tors. However, their classifier was trained on images from a to the out-of-distribution recognition problem as it in-
single GAN and was highly sensitive to data augmentation volves generalizing to unseen generators and image pro-
parameters, and more recent work shows that it does not cessing. A variety of approaches have been proposed for
generalize to newer models [17, 91]. Ojha et al. [91] intro- this problem, based on likelihood ratios [73, 101, 128],
duced a dataset of recent diffusion models and found that self-supervision [48, 87, 112, 125], internal model statis-
training a linear classifier on CLIP features [100] extracted tics [47, 107], temperature scaling [6, 74], and via energy-
from ProGAN-generated images performed well. Cozzolino based models [31, 34, 76]. Work by Schuhmann et al. [111]
et al. [18] extends this work by studying the performance and Hendrycks et al. [49] show that diverse training data and
of CLIP-based detectors on various generative models and data augmentation is important to improving the robustness
datasets. Epstein et al. [35] simulated detecting fake images to out-of-distribution samples. Our results are in line with
in an online way by training a detector up to a certain year these conclusions, as we find that a diverse set of generative
and testing it on generators released after that year. Zhu et models and stronger augmentations improve generalization.
al. [133] collected 1.4M generated images from 8 different
generators. These datasets, however, only consider a handful
of models (less than 20 each), limiting the generalization of 3. The Community Forensics Dataset
their detectors. We improve upon these works by collecting
To support our goal of studying generalization in generated
much more diverse generative models to improve the per-
image detection, we collect a dataset of images sampled
formance and generalization of the detector. In concurrent
from a wide range of models (Fig. 2). Our dataset consists
work, Hong et al. [50] acquires user-created images from
of: (a) a large and systematically collected set of “in-the-
Midjourney and CivitAI. This strategy is complementary
wild” text-to-image latent diffusion models obtained from a
to ours: while it aims to collect in-the-wild fake images,
model-sharing website, (b) hand-selected models from other
its distribution is centered on images that users share, and
open source architectures, and (c) closed and open state-of-
the models are not necessarily identifiable, making it chal-
the-art commercial models. We also pair these generated
lenging to rigorously analyze the dataset’s contents and to
images with real images from other datasets. For all im-
interpret experiments conducted on it.
ages in our dataset, we preserve the original image format
Fingerprint-based image forensics methods. Classic whenever possible, without any additional compression or
work on image forensics relied on methods based on image resizing. This is to mitigate potential bias and performance
statistics [99] and physical constraints [57], rather than learn- degradation in out-of-domain settings due to unwanted arti-
ing. A number of datasets have been created for detecting facts [43]. Our dataset contains significantly more models
images that have been manipulated using traditional methods, than previous works (Tab. 1) and spans a wider range of
such as with photo editors [24, 29, 54, 65, 88]. Recent works architectures, processing pipelines, and semantic contents.
focus on detecting synthetic images by inspecting the gen-
erator fingerprints. Zhang et al. [131] and Marra et al. [82]
3.1. Systematically collecting generative models
proposed identifying the spatial fingerprints left by the gen- We perform our systematic collection using publicly avail-
erator to detect synthetic images. Others focus on spectral able, open source2 models that use the Hugging Face
anomalies to detect synthetic images. Durall et al. [32] and diffusers library [37, 124] because: 1) it is a popular
Dzanic et al. [33] identified that CNN-generated images fail library for creating text-to-image models and is widely used
to reproduce certain spectral properties of real images. Corvi 2We use the term “open source” to refer to models with public weights
et al. [17] studies the frequency fingerprints of the generated and source code, even if the models may be closed in some respects (e.g.,
images and analyzes the cross-architecture generalization of private training data).

3
Dataset Models Images Architectures Training setup
Wang et al. [126] 11 362K GAN, Perceptual, Deepfake, ... ProGAN [59] vs. LSUN [129]
Ojha et al. [91] 4∗ 10K∗ GAN, Perceptual, Diffusion, ... ProGAN [59] vs. LSUN [129]
Epstein et al. [35] 14 570K Diffusion Diffusion vs. LAION [111]
Cozzolino et al. [18] 18 26K Diffusion LDM [102] vs. MS-COCO [75]
Synthbuster [9] 9 10K Diffusion Diffusion vs. Dresden [39]
GenImage [133] 8 1.4M Diffusion, GAN Diffusion, GAN vs. Diffusion, GAN
Ours 4803 2.7M Diffusion, GAN, Autoregressive, ... Many vs. Many

Table 1. Comparison with existing forensics datasets. We compare the size of the dataset with existing datasets containing identifiable
generative models. We only count the number of generated images. Our dataset contains significantly more generative models than prior
works. ∗: Only counting the unique evaluation set by Ojha et al. [91] as their dataset is based on Wang et al. [126].

by hobbyists, 2) thousands of such models are publicly in- We provide the model metadata with each image to enable
dexed, and 3) it provides a standard interface by which we other possible forensics forensics applications. We discuss
can sample images. We process them in the order of popular- these in Appendix B and provide information about image
ity, as indicated by the number of downloads. Our pipeline and model licenses.
downloads each model and extracts relevant hyperparam-
3.2. Collecting images from other architectures
eters (e.g., number of diffusion steps), sampling pipeline
configurations, and metadata from the model-sharing web- Images from manually chosen models. To ensure that
page [37, 124]. We sample images using a distribution of our dataset contains a broader range of models, we manually
text prompts obtained from real images (Sec. 3.3). Since select 19 models from public repositories and sample 40,738
experiments suggest that there are diminishing returns for images per model on average. We note that this number is
sampling large numbers of images from any given model, we itself on par with (or more than) prior datasets with identi-
sample a few hundred images from each one. Images with fiable generative models. We include several GANs (e.g.,
NSFW content are filtered out using a safety checker [16]. StyleGANs [61–63, 109], BigGAN [10], StyleSwin [130],
We obtain 4763 models with approximately 403 images each GigaGAN [58], ProGAN [59], ProjectedGAN [108], GANs-
from this process. former [53], SAN [118], and CIPS [7]), pixel-based diffusion
While the lack of documentation in each model and the models (e.g., GLIDE [89], ADM [27], and DeepFloyd [25]),
scale of data collection make it challenging to exactly charac- latent diffusion models (e.g., VQ-Diffusion [44], Diffusion
terize the model designs in this set, they are either entirely (or Transformers [96], and Latent Flow Matching [23]), and an
almost entirely) based on latent diffusion. More specifically, autoregressive model (Taming Transformers [36])
we categorize models as being based on latent diffusion if Images from commercial models. We sample 15K im-
they perform a denoising process on a latent representation. ages from 11 commercial models using LAION-based cap-
3
Based on this criterion and the self-reported tags, all mod- tions to evaluate the generalization to state-of-the-art models
els in our systematically collected set appear to be based with typically unknown architectures: DALL·E 2, 3 [92, 93],
on latent diffusion. While pixel-based diffusion models Ideogram V1, V2 [5], Midjourney V5, V6 [85], Firefly Im-
also use the diffusers library (e.g., DeepFloyd [25]), they age 2, 3 [4], FLUX.1-dev, schnell [69], and Imagen 3 [41].
were incompatible with our automated generation pipeline.
We record incompatible models such as these and manually 3.3. Collecting real images
sample a portion of them to construct an out-of-distribution To help study how real images influence forensics models,
test set (Sec. 3.4), or as manually-chosen models used for we source real images from a variety of existing datasets:
training data (Sec. 3.2). LAION [110], ImageNet [26], COCO [75], FFHQ [60],
We show examples of sampled images in Fig. 2. In Ap- CelebA [77], MetFaces [61], AFHQ [15], Forchheim [45],
pendix D, we provide examples of models and information IMD2020 [90], Landscapes HQ [115], and VISION [114].4
from their project pages. These models generate a variety of 3.4. Curating the evaluation set
different types of images, with various types of preprocess-
ing. For example, a large fraction of these models adapt vari- We construct our evaluation set using the incompatible
ations of a popular pretrained latent diffusion model, Stable models from our automated sampling pipeline, commer-
Diffusion [102], to different downstream applications, and cial models (Sec. 3.2), and manually collected open source
use a number of adaptation strategies (e.g., using LoRA [52]). 4 Following common convention in visual forensics, we refer to these
images as real images, even though they may be synthetic (e.g., containing
3We note that this definition includes latent consistency models [80, 117], graphic design). More precisely, our goal is to distinguish “AI-generated”
which are present in our dataset. versions of images from the originals.

4
models. The evaluation set comprises 26K images sam- We construct our training set of 5.4M images by pairing
pled from 21 models not included in the training set. 2.7M generated images with 2.7M real images.
This includes our commercial models set and an ad-
Training and evaluation setup. We evaluate the mod-
ditional 11K images from 10 models: Deci Diffusion
els trained on our dataset and compare them with prior
V2 [121], GALIP [120], KandinskyV2.2 [113], Kvikon-
works [91, 126, 133]. Following prior works [91, 126],
tent [67], LCM-LoRA-SDv1.5, LCM-LoRA-SDXL, LCM-
we use the threshold-independent mean average precision
LoRA-SSD1B [81], Stable Cascade [97], DF-GAN [119],
(mAP) and accuracy (Acc.) as our evaluation metrics. We
and HDiT [19], sampled using RAISE [22], ImageNet [26],
compute the mAP and accuracy by averaging the results of
FFHQ [60], and COCO [75]-based captions.
each generative model. We use five evaluation sets: Wang et
The generated images are paired with the source real
al. [126], Ojha et al. [91], Synthbuster [9], GenImage [133],
data that are used to prompt the generators. However, since
and our evaluation set. All evaluation sets apart from Gen-
some of the real datasets do not have appropriate licenses
Image [133] evaluate out-of-distribution performance for all
for redistribution (e.g., LAION [110, 111]), we created a
classifiers. GenImage [133] evaluation set, however, con-
public version of our evaluation set by pairing the generated
tains the same set of generators used in training, and is
images with openly licensed COCO [75] and FFHQ [60]
an in-distribution evaluation set for their classifiers. Con-
which allow redistribution for non-commercial purposes.
cretely, the evaluation set by Wang et al. [126] and Ojha et
The public version of our evaluation set will serve as an
al. [91, 126] contains models such as DALL·E [92], Deep-
easily reproducible and shareable evaluation set that will
Fake [28], CycleGAN [132], StarGAN [14], CRN [12],
complement our default set. We will refer to our default set
IMLE [71], SITD [11], and SAN [21] which are unseen
as the comprehensive evaluation set. We also release the
by both their and our classifiers. Synthbuster [9] evaluation
instructions to reconstruct our comprehensive set. However,
set is comprised of RAISE [22]-based synthetic images of
note that it may not be possible to exactly reconstruct this
DALL·E [92, 93], Firefly [4], Midjourney [85], Glide [89],
set in the future due to link rot.
and Stable Diffusion [98, 102], and is mostly out of distri-
3.5. Generating images bution for all classifiers. GenImage [133] evaluation set is
a validation split of their training set; an exact same set of
Unconditional models are sampled until we reach the desired models are used in training their classifier: Midjourney [85],
number of images. For class conditional models, we sam- Stable Diffusion [102], ADM [27], Glide [89], Wukong [3],
ple an equal number of images per class. To sample from VQ-Diffusion [44], and BigGAN [10].
text-conditional models, we gather prompts from multiple
sources to ensure semantic diversity. We obtain captions Model architecture. Building on prior works which
from real images (Sec. 3.3). We either use captions that are mainly used CLIP-ViT [30, 55, 100] and ResNet-50 [46],
already present in the dataset (when available), or we use we consider ViT [30] and ConvNeXt [78] pretrained models
BLIP [70] to generate them. The captions are then used for our classifiers. We use a plain ViT-S backbone [30] pre-
to sample synthetic images. Some models such as Giga- trained on CLIP objective [55, 100] using LAION-2B [111],
GAN [58] and HDiT [19] do not provide a pretrained model, ImageNet 21K, and ImageNet 1K datasets [26]. We also
so we instead use their pre-generated images. Generated im- experiment with a ConvNeXt-S model [78] pretrained on Im-
ages are saved in PNG format to avoid compression artifacts. ageNet 21K and ImageNet 1K datasets [26]. We replace the
However, Firefly [4] generated images are saved in JPEG classification head with a linear layer with sigmoid activation
format as their web UI does not allow downloading in PNG. that outputs the probability of the image being generated. Un-
like prior works [18, 91] that freeze the CLIP-ViT backbone,
we train the backbone end-to-end. The models are obtained
4. Experiments through timm [13, 127] library on Hugging Face. We exper-
iment with two input resolutions, 2242 and 3842 , to evaluate
We use our dataset to conduct a study of generalization in
the impact of the input resolution on the detector’s perfor-
visual forensics, asking a number of questions: (1) How
mance. We denote the detector with 3842 input resolution
well do forensics models trained on our dataset generalize
as High res. We implement the models using PyTorch [95].
to unseen models? (2) Does adding more models improve
The hyperparameters are detailed in Appendix C.
detection performance? (3) How does diversity of the train-
ing data affect performance? (4) What architectures and data Data augmentation. Prior work considered augmenta-
augmentation schemes are most successful? tions that were designed to simulate postprocessing, such as
flipping, cropping, Gaussian blur, and JPEG recompression
4.1. Training image forensics models to train their detectors [9, 18, 91, 126]. We propose an aug-
We train binary classifiers that detect generated images using mentation scheme that extends this approach and compare it
our dataset to study the generalization in image forensics. with previously proposed augmentation methods. We expand

5
Evaluation Set (mAP) Evaluation Set (Acc)
Model Wang et al. Ojha et al. SB GenImage Ours Wang et al. Ojha et al. SB GenImage Ours
[126] [91] [9] [133] Comp. Public [126] [91] [9] [133] Comp. Public
Wang et al. [126] 0.897 0.696 0.516 0.642 0.535 0.600 0.714 0.527 0.508 0.533 0.509 0.517
Ojha et al. [91] 0.939 0.957 0.620 0.797 0.630 0.656 0.791 0.821 0.532 0.641 0.543 0.548
GenImage [133] 0.929 0.984 0.813 0.999 0.938 0.968 0.795 0.966 0.719 0.990 0.857 0.886
Ours 0.964 0.991 0.904 0.990 0.979 0.977 0.873 0.950 0.818 0.946 0.895 0.888
Ours - High res. 0.967 0.996 0.974 0.998 0.991 0.994 0.901 0.970 0.908 0.957 0.925 0.912

Table 2. Generalization of AI-generated image detectors across datasets. We evaluate the classifiers trained on our dataset on several
datasets, including our own. We also evaluate several previously released classifiers. Our Comprehensive set (abbreviated as Comp.) pairs
the generated images with original real data; the Public set pairs them with openly licensed COCO [75] and FFHQ [60] for license-compliant
redistribution of the evaluation set (Sec. 3.4). We use plain CLIP-ViT-S [30, 55, 100] architecture with 2242 and 3842 (High res.) input
resolutions, Wang et al. [126] and GenImage [133] use ResNet-50 [46] with 2242 input resolution, and Ojha et al. [91] uses CLIP-ViT-L
with 2242 input resolution as the backbone. Our classifiers show robust performance across all evaluation sets, outperforming all baselines
in out-of-distribution evaluations ([9, 91, 126] and Ours) while nearly matching GenImage [133] on its in-distribution evaluation set.
1.00 mAP Acc
0.75 0.95 0.85
0.90
mAP

0.50
0.80
0.25 0.85
0.00 0.80 1000 Random Models 0.75 1000 Random Models
Latent Diff. Pixel Diff. Commercial GAN Other 10 Popular Models 10 Popular Models
Wang et al. Ojha et al. GenImage Ours Ours - High res.
3K 27K 243K 3K 27K 243K
Figure 3. Performance across generator types. We evaluate the Number of Images in Training Set Number of Images in Training Set
classifier performance across five generator types – latent diffusion, Figure 4. Performance with increasing number of images. We
pixel diffusion, commercial models, GANs, and other architecture train a classifier with varying numbers of images from two sets:
(Stable Cascade [97]). Our classifiers show robust performance 1000 randomly chosen models and 10 popular (highly downloaded)
across all generator types, where prior works struggle to generalize. models in the systematically collected subset. The classifier trained
from 1000 random models outperforms 10 popular models in all
the set of augmentations to handle additional transformations cases. Notably, the accuracy gap is wider than that of mAP, which
that can occur in the wild, such as padding, resizing, rota- may suggest that having diversity in the models improves accuracy
tion, and shear, and integrate them into a framework that threshold calibration. We report the mean and standard error values
can apply complex sequences of transformations. We intro- for each data point across 4 randomly sampled subsets.
duce a modified version of RandAugment [20] that applies a
randomly-ordered sequence of augmentations to the images. our comprehensive evaluation set with a significant margin
Specifically, our modified RandAugment samples a random compared to prior works. This gap in performance can be
number n between 0 and nmax for each augmentation type. traced to our training data which incorporates a substantially
Then, it applies the augmentations in random order until n richer variety of generators compared to prior works. Conse-
augmentations are applied for each augmentation type to the quently, our classifiers demonstrate robust generalization to
image. We use various augmentations, including in-memory out-of-distribution data, where prior works often struggle.
JPEG compression, random resizing with random interpola- To better illustrate the generalization of the classifiers, we
tion methods, cropping, flipping, rotation, translation, shear, show the performance per each generator type in Figure 3.
padding, and cutout. We group them into five subsets: latent diffusion, pixel diffu-
sion, commercial models, GANs, and other architecture type
4.2. Generalization to other datasets (Stable Cascade [97]). Our classifiers show strong perfor-
We first evaluate how well classifiers trained on our dataset mance across all generator types, unlike prior works which
transfer to other benchmarks. In Table 2, we observe that struggle to generalize to diverse architectures.
our models outperform the prior works [91, 126, 133] in all For the following experiments, we use our best-
evaluation sets except GenImage. This is expected since performing model (High res.) unless stated otherwise.
the GenImage evaluation set is a validation split of their
training set; all of the generators are already seen by their 4.3. Impact of model diversity
classifier. On all unseen evaluation sets, our classifiers out- Next, we examine the impact of the number of models in
perform all prior works. Notably, our classifiers achieve training data. We train classifiers with images sampled from
very high performance (0.991 mAP and 92.5% accuracy) on 3 to 3333 generators and evaluate them (Fig. 1). To ensure

6
1.00 1.00 1.00 1.00
0.75 0.75 0.75 0.75
0.50 0.50 0.50 0.50
0.25 0.25 0.25 0.25
mAP Acc mAP Acc mAP Acc mAP Acc
Systematic Manual Full ViT (Ours) ConvNeXt (Ours)
ViT (GenImage) ConvNeXt (GenImage) ViT ConvNeXt R:LAION F:LAION R:Others F:LAION
ViT (Wang) ConvNeXt (Wang) ViT (Frozen) ConvNeXt (Frozen) R:LAION F:Others R:Others F:Others

(a) Impact of generator type diversity (b) Classifier backbone comparison (a) Impact of frozen backbones (b) Semantic alignment analysis

Figure 5. (a) Performance and model diversity. We compare de- Figure 6. (a) Evaluating frozen backbones. Freezing the pre-
tection performance for commercial models using classifiers trained trained backbone, a common practice in prior works ([18, 91]),
on different subsets of the dataset: the systematically collected consistently decreases the performance. (b) Analyzing source and
latent diffusion models, the manually chosen models containing generated data alignment. We evaluate how the pairing of the
diverse generator types, and both. As diversity increases, so does real datasets affects performance. R denotes the real dataset used in
performance. (b) Classifier backbone comparison. We compare training, and F indicates the source dataset used to obtain the cap-
the architectures across datasets: ours, GenImage [133], and Wang tions for prompting the generators. The results suggest that pairing
et al. [126]. Performance is similar between architectures. the source data (i.e., real data used to prompt the generators) with
the generated images is not essential for performance.
that the gains are not due to simply sampling qualitatively
different architectures, we only use our systematically col- shows stronger performance compared to the one trained on
lected latent diffusion models. We use an extended evalua- the systematic set. Additionally, we find that the two sets are
tion set that includes non-latent diffusion generators from our complementary; the performance is further improved when
training set, which allows us to comprehensively assess the we train using both sets.
generalization capability of the classifiers trained exclusively 4.4. Analysis of design choices
on latent diffusion models. We find that the performance
steadily increases with the number of models. However, the We examine the impact of various design choices, including
performance begins to flatten out beyond 1000 models, sug- some suggested in earlier works. In particular, we investi-
gesting diminishing returns. Interestingly, the performance gate the choice of backbone models, freezing the backbone,
improves even on out-of-distribution architectures such as semantic alignment between the real and generated data, and
GANs and pixel-based diffusion models, even though the robustness to transformations.
classifier is only trained on latent diffusion models. Classifier backbone comparison. We compare the perfor-
In Figure 4, we vary the number of images from two sets: mance of the classifier trained using CLIP-ViT [30, 55, 100]
1000 randomly chosen models and 10 popular models (as and ConvNeXt [78] backbones following our training pro-
denoted by their number of downloads) downloaded from cedure in Figure 5b. We examine three datasets: ours, Gen-
our systematically collected diffusion models. While the re- Image [133], and Wang et al. [126]. We observe similar
sults show that the performance improves with more training performance between architectures across all datasets.
images, it begins to plateau at approximately 27K images.
Frozen backbone. Prior works [18, 91] suggested using
Moreover, the classifier trained on 1000 models outperforms
a frozen CLIP-ViT backbone for training the classifiers.
the 10 models in all cases, indicating that model diversity
We investigate this practice by training the classifiers with
is important for strong performance. We also note that the
both frozen and unfrozen pretrained backbones, using CLIP-
accuracy gap is noticeably wider than that of mAP, which
ViT [30, 55, 100] and ConvNeXt [78]. As shown in Fig-
may suggest that model diversity is crucial in calibrating the
ure 6a, freezing the backbone consistently leads to poorer
accuracy thresholds of the classifiers.
performance, indicating that end-to-end training is crucial to
Our experiments show that the performance improve-
achieving high performance.
ments from increasing the number of models may plateau
when they are limited to a single generator type (Fig. 1). In Semantic alignment. Existing works often pair the gener-
Figure 5a, we show that the diversity of the generator type ated images with the source dataset (i.e., the real dataset used
also plays a major role in generalization. We train classifiers to prompt or generate the images) arguing that misaligned
on three different sets of training data: our systematically data can introduce bias [9, 18, 91, 126]. We test this practice
collected set, manually chosen set, and a full set consisting in Fig. 6b by examining the performance with both seman-
of both subsets. The systematic set comprises entirely of la- tically aligned and misaligned real datasets. Specifically,
tent diffusion models, and the manual set contains numerous we consider two real datasets: one comprised exclusively
generator types, including GANs, latent and pixel-based dif- from LAION [110] and another combining ImageNet [26],
fusion, and autoregressive models (Sec. 3.2). The classifier MS-COCO [75], LandscapesHQ [115], Forchheim [45], VI-
trained on the manual set with more diverse generator types SION [114], and IMD2020 [90]. We sample our systemati-

7
Shear Padding Gaussian Blur While we only focus on generated image detection in our
1.00 1.00 1.00
paper, our dataset may enable further forensics studies that
0.75 0.75 0.75
mAP

can take advantage of our diverse array of generators. For

0.50 0.50 0.50
example, in Appendix A, we show a preliminary result in
15 30 45 60 75 0.25 0.50 0.75 1.00 1.25 0.5 0.9 1.3 1.7 2.0
fraction identifying the generator used to synthesize a given image.
JPEG Rotation Resize We do not intend for our dataset to be used to train classi-
1.00 1.00 1.00
fiers that are directly used in the wild. Detecting in-the-wild
0.75 0.75 0.75
mAP

synthetic images remains a challenging open problem, and

0.50 0.50 0.50
detection errors can have severe consequences (e.g., falsely
100 84 68 52 36 15 30 45 60 75 0.15 0.30 0.45 0.60 0.75
quality fraction accusing an author of creating fake images or allowing mis-
Ours GenImage Wang et al. information to be certified as real). We hope that our work
Ours - High Res. Ojha et al.
will serve as a stepping stone for future research in this area
Figure 7. Robustness to various transformations. Our classifiers by providing tools and insights for studying generalization
display robust performance across transformations. Other works and data collection strategies.
generally show more sensitivity to these factors.
Limitations. While our dataset is diverse, a large portion
of the data is diffusion-based, especially models fine-tuned
cally collected latent diffusion models using these two sets
on Stable Diffusion [102]. However, we note that Figure 1
and categorize the generated images by their source dataset.
shows that classifiers trained only on diffusion models still
The resulting changes in the performance of all pairs were
generalize reasonably well to other generator types as they
marginal, suggesting that strict alignment may not be as
represent different semantic content. Future work may con-
critical as previously believed.
sider collecting more diverse models, including more GANs,
Robustness to transformations. Figure 7 illustrates the VQ-VAEs [123], and autoregressive models. We also note
robustness of the classifiers against various transformations. that the generative models sourced from the community may
Following prior works [91, 126], we test robustness to JPEG contain inappropriate content. While in many contexts it is
compression and Gaussian blur. Additionally, we examine important to detect such images (and we have removed im-
robustness to rotation, resizing, padding, and shear, as they ages flagged by a NSFW detector), these models may require
commonly occur in real-world scenarios. For padding, we further scrutiny before being used in other downstream appli-
randomly pad the width or height of the image with a given cations. Finally, although our experiments suggest that our
fraction and scale it back to the original size. Similarly, in forensics classifiers generalize to unseen models better than
resize, we randomly upsample or downsample the height and those of previous work, the error rates for forensics models
width of the image by a given fraction and resize it back to are still too high to be used in many important applications.
the original size (e.g., if fraction is 0.3, resize the height and Acknowledgements. We thank the creators of the many
width to 0.7× or 1.3× and then scale it back to the original open source models that we used to collect the Community
size). The results demonstrate that our models are more Forensics dataset. We thank Chenhao Zheng, Cameron John-
robust to transformations than existing models. Specifically, son, Matthias Kirchner, Daniel Geng, Ziyang Chen, Ayush
GenImage [133] is notably more sensitive to Gaussian blur, Shrivastava, Yiming Dou, Chao Feng, Zihao Wei, Zixuan
JPEG compression, and resizing artifacts; classifier by Ojha Pan, Inbum Park, Rohit Banerjee, and Ang Cao for the valu-
et al. [91] displays sensitivity to shear transforms, and the able discussions and feedback. This research was developed
one by Wang et al. [126] performs poorly overall. with funding from the Defense Advanced Research Projects
Agency (DARPA) under Contract No. HR001120C0123.
5. Discussion
In this paper, we studied the problem of generalizing to un- References
seen generative models in synthetic image detection. We [1] Creativeml openrail-m license. https://huggingface.
proposed a new dataset, Community Forensics, which con- co/spaces/CompVis/stable-diffusion-license. 14
tains 4803 models and 2.7M images collected from various [2] Hugging face. https://huggingface.co/, 2016. 15
public sources. We studied the impact of model diversity on [3] Wukong. https://xihe.mindspore.cn/modelzoo/
the generalization performance of the classifiers and demon- wukong, 2022. 5
strated that diverse data is a crucial factor in training a robust, [4] Adobe. Firefly. https://www.adobe.com/products/
generalizable forensics model. We trained classifiers on our firefly, 2023. 4, 5
dataset, demonstrated their ability to generalize on various [5] Ideogram AI. Ideogram. https://ideogram.ai, 2024. 4
settings, and evaluated previously proposed architecture and [6] Rushil Anirudh and Jayaraman J Thiagarajan. Out of dis-
training practices. tribution detection via neural network anchoring. In Asian

8
Conference on Machine Learning, pages 32–47. PMLR, of the 41st International Conference on Machine Learning,
2023. 3 pages 9550–9575. PMLR, 2024. 5, 15
[7] Ivan Anokhin, Kirill Demochkin, Taras Khakhulin, Gleb [20] Ekin Dogus Cubuk, Barret Zoph, Jon Shlens, and Quoc Le.
Sterkin, Victor Lempitsky, and Denis Korzhenkov. Image Randaugment: Practical automated data augmentation with
generators with conditionally-independent pixel synthesis. a reduced search space. In Advances in Neural Information
arXiv preprint arXiv:2011.13775, 2020. 4 Processing Systems, pages 18613–18624. Curran Associates,
[8] Anton Bakhtin, Sam Gross, Myle Ott, Yuntian Deng, Inc., 2020. 6
Marc’Aurelio Ranzato, and Arthur Szlam. Real or fake? [21] Tao Dai, Jianrui Cai, Yongbing Zhang, Shu-Tao Xia, and
learning to discriminate machine from human generated text. Lei Zhang. Second-order attention network for single image
arXiv preprint arXiv:1906.03351, 2019. 3 super-resolution. In Proceedings of the IEEE/CVF confer-
[9] Quentin Bammey. Synthbuster: Towards detection of diffu- ence on computer vision and pattern recognition, pages
sion model generated images. IEEE Open Journal of Signal 11065–11074, 2019. 3, 5
Processing, 2023. 2, 3, 4, 5, 6, 7 [22] Duc-Tien Dang-Nguyen, Cecilia Pasquini, Valentina Conot-
[10] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large ter, and Giulia Boato. Raise: A raw images dataset for digital
scale GAN training for high fidelity natural image synthesis. image forensics. In Proceedings of the 6th ACM multimedia
In International Conference on Learning Representations, systems conference, pages 219–224, 2015. 5, 14
2019. 3, 4, 5 [23] Quan Dao, Hao Phung, Binh Nguyen, and Anh Tran. Flow
[11] Chen Chen, Qifeng Chen, Jia Xu, and Vladlen Koltun. matching in latent space. arXiv preprint arXiv:2307.08698,
Learning to see in the dark. In Proceedings of the IEEE con- 2023. 4
ference on computer vision and pattern recognition, pages [24] Tiago José De Carvalho, Christian Riess, Elli Angelopoulou,
3291–3300, 2018. 3, 5 Helio Pedrini, and Anderson de Rezende Rocha. Exposing
[12] Qifeng Chen and Vladlen Koltun. Photographic image syn- digital image forgeries by illumination color classification.
thesis with cascaded refinement networks. In Proceedings IEEE Transactions on Information Forensics and Security,
of the IEEE international conference on computer vision, 8(7):1182–1194, 2013. 3
pages 1511–1520, 2017. 3, 5 [25] DeepFloyd. Deepfloyd. https://huggingface.co/
[13] Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell DeepFloyd/IF-I-L-v1.0, 2024. 4
Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuh- [26] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,
mann, Ludwig Schmidt, and Jenia Jitsev. Reproducible and Li Fei-Fei. Imagenet: A large-scale hierarchical image
scaling laws for contrastive language-image learning. arXiv database. In 2009 IEEE conference on computer vision and
preprint arXiv:2212.07143, 2022. 5 pattern recognition, pages 248–255. Ieee, 2009. 4, 5, 7, 14,
[14] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, 15
Sunghun Kim, and Jaegul Choo. Stargan: Unified generative [27] Prafulla Dhariwal and Alexander Nichol. Diffusion models
adversarial networks for multi-domain image-to-image trans- beat gans on image synthesis. Advances in neural informa-
lation. In Proceedings of the IEEE Conference on Computer tion processing systems, 34:8780–8794, 2021. 4, 5
Vision and Pattern Recognition, 2018. 3, 5 [28] Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ
[15] Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. Howes, Menglin Wang, and Cristian Canton Ferrer. The
Stargan v2: Diverse image synthesis for multiple domains. deepfake detection challenge (dfdc) dataset. arXiv preprint
In Proceedings of the IEEE Conference on Computer Vision arXiv:2006.07397, 2020. 3, 5
and Pattern Recognition, 2020. 4 [29] Jing Dong, Wei Wang, and Tieniu Tan. Casia image tam-
[16] CompVis. Stable diffusion safety checker. https : pering detection evaluation database. In 2013 IEEE China
/ / huggingface . co / CompVis / stable - diffusion - summit and international conference on signal and informa-
safety-checker, 2022. 4 tion processing, pages 422–426. IEEE, 2013. 3
[17] Riccardo Corvi, Davide Cozzolino, Giada Zingarini, Gio- [30] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
vanni Poggi, Koki Nagano, and Luisa Verdoliva. On the Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
detection of synthetic images generated by diffusion models. Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
In ICASSP 2023-2023 IEEE International Conference on vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is
Acoustics, Speech and Signal Processing (ICASSP), pages worth 16x16 words: Transformers for image recognition at
1–5. IEEE, 2023. 3 scale. ICLR, 2021. 5, 6, 7
[18] Davide Cozzolino, Giovanni Poggi, Riccardo Corvi, [31] Yilun Du, Shuang Li, Joshua Tenenbaum, and Igor Mor-
Matthias Nießner, and Luisa Verdoliva. Raising the bar datch. Improved contrastive divergence training of energy-
of ai-generated image detection with clip. In Proceedings of based models. In International Conference on Machine
the IEEE/CVF Conference on Computer Vision and Pattern Learning, pages 2837–2848. PMLR, 2021. 3
Recognition, pages 4356–4366, 2024. 2, 3, 4, 5, 7 [32] Ricard Durall, Margret Keuper, and Janis Keuper. Watch
[19] Katherine Crowson, Stefan Andreas Baumann, Alex Birch, your up-convolution: Cnn based generative deep neural net-
Tanishq Mathew Abraham, Daniel Z Kaplan, and Enrico works are failing to reproduce spectral distributions. In Pro-
Shippole. Scalable high-resolution pixel-space image syn- ceedings of the IEEE/CVF conference on computer vision
thesis with hourglass diffusion transformers. In Proceedings and pattern recognition, pages 7890–7899, 2020. 3

9
[33] Tarik Dzanic, Karan Shah, and Freddie Witherden. Fourier [48] Dan Hendrycks, Mantas Mazeika, Saurav Kadavath, and
spectrum discrepancies in deep network generated images. Dawn Song. Using self-supervised learning can improve
Advances in neural information processing systems, 33:3022– model robustness and uncertainty. In Advances in Neural
3032, 2020. 3 Information Processing Systems. Curran Associates, Inc.,
[34] Sven Elflein, Bertrand Charpentier, Daniel Zügner, and 2019. 3
Stephan Günnemann. On out-of-distribution detection with [49] Dan Hendrycks, Steven Basart, Norman Mu, Saurav Ka-
energy-based models, 2021. 3 davath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler
[35] David C Epstein, Ishan Jain, Oliver Wang, and Richard Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Stein-
Zhang. Online detection of ai-generated images. In Proceed- hardt, and Justin Gilmer. The many faces of robustness:
ings of the IEEE/CVF International Conference on Com- A critical analysis of out-of-distribution generalization. In
puter Vision, pages 382–392, 2023. 2, 3, 4 Proceedings of the IEEE/CVF International Conference on
[36] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming Computer Vision (ICCV), pages 8340–8349, 2021. 3
transformers for high-resolution image synthesis. In Pro- [50] Yan Hong and Jianfu Zhang. Wildfake: A large-scale chal-
ceedings of the IEEE/CVF conference on computer vision lenging dataset for ai-generated images detection. arXiv
and pattern recognition, pages 12873–12883, 2021. 4 preprint arXiv:2402.11843, 2024. 3
[37] Hugging Face. Hugging face diffusers library. https: [51] Shuwei Hou, Yan Ju, Chengzhe Sun, Shan Jia, Lipeng Ke,
//huggingface.co/models?library=diffusers, ac- Riky Zhou, Anita Nikolich, and Siwei Lyu. Deepfake-o-
cessed on June 05, 2022, 2022. 2, 3, 4, 15 meter v2. 0: An open platform for deepfake detection. arXiv
[38] Sebastian Gehrmann, Hendrik Strobelt, and Alexander M preprint arXiv:2404.13146, 2024. 3
Rush. Gltr: Statistical detection and visualization of gener- [52] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-
ated text. arXiv preprint arXiv:1906.04043, 2019. 3 Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen.
[39] Thomas Gloe and Rainer Böhme. The’dresden image Lora: Low-rank adaptation of large language models. arXiv
database’for benchmarking digital image forensics. In Pro- preprint arXiv:2106.09685, 2021. 4
ceedings of the 2010 ACM symposium on applied computing,
[53] Drew A Hudson and Larry Zitnick. Generative adversar-
pages 1584–1590, 2010. 4
ial transformers. In International conference on machine
[40] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing learning, pages 4487–4499. PMLR, 2021. 4
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
[54] Minyoung Huh, Andrew Liu, Andrew Owens, and Alexei A
Yoshua Bengio. Generative adversarial nets. Advances in
Efros. Fighting fake news: Image splice detection via
Neural Information Processing Systems, 27, 2014. 2
learned self-consistency. In Proceedings of the European
[41] Google. Imagen 3. https : / / deepmind . google /
conference on computer vision (ECCV), 2018. 3
technologies/imagen-3, 2024. 4
[55] Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade
[42] Karol Gregor, Ivo Danihelka, Andriy Mnih, Charles Blun-
Gordon, Nicholas Carlini, Rohan Taori, Achal Dave,
dell, and Daan Wierstra. Deep autoregressive networks.
Vaishaal Shankar, Hongseok Namkoong, John Miller, Han-
In International Conference on Machine Learning, pages
naneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Open-
1242–1250. PMLR, 2014. 2
clip, 2021. If you use this software, please cite it as below.
[43] Patrick Grommelt, Louis Weiss, Franz-Josef Pfreundt, and
5, 6, 7
Janis Keuper. Fake or jpeg? revealing common biases
in generated image detection datasets. arXiv preprint [56] Ganesh Jawahar, Muhammad Abdul-Mageed, and Laks VS
arXiv:2403.17608, 2024. 3 Lakshmanan. Automatic detection of machine generated
[44] Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo text: A critical survey. arXiv preprint arXiv:2011.01314,
Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector 2020. 3
quantized diffusion model for text-to-image synthesis. In [57] Micah K Johnson and Hany Farid. Exposing digital forgeries
Proceedings of the IEEE/CVF Conference on Computer Vi- in complex lighting environments. IEEE Transactions on
sion and Pattern Recognition, pages 10696–10706, 2022. 4, Information Forensics and Security, 2(3):450–461, 2007. 3
5 [58] Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park,
[45] Benjamin Hadwiger and Christian Riess. The forchheim im- Eli Shechtman, Sylvain Paris, and Taesung Park. Scaling up
age database for camera identification in the wild. In Pattern gans for text-to-image synthesis. In Proceedings of the IEEE
Recognition. ICPR International Workshops and Challenges: Conference on Computer Vision and Pattern Recognition
Virtual Event, January 10–15, 2021, Proceedings, Part VI, (CVPR), 2023. 4, 5
pages 500–515. Springer, 2021. 4, 7 [59] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen.
[46] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Progressive growing of gans for improved quality, stability,
Deep residual learning for image recognition. In Proceed- and variation. In International Conference on Learning
ings of the IEEE conference on computer vision and pattern Representations, 2018. 3, 4
recognition, pages 770–778, 2016. 5, 6 [60] Tero Karras, Samuli Laine, and Timo Aila. A style-based
[47] Dan Hendrycks and Kevin Gimpel. A baseline for detect- generator architecture for generative adversarial networks.
ing misclassified and out-of-distribution examples in neural In Proceedings of the IEEE/CVF conference on computer
networks. In International Conference on Learning Repre- vision and pattern recognition, pages 4401–4410, 2019. 3,
sentations, 2016. 3 4, 5, 6, 14, 15

10
[61] Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Computer Vision–ECCV 2014: 13th European Conference,
Jaakko Lehtinen, and Timo Aila. Training generative ad- Zurich, Switzerland, September 6-12, 2014, Proceedings,
versarial networks with limited data. Advances in neural Part V 13, pages 740–755. Springer, 2014. 4, 5, 6, 7, 14
information processing systems, 33:12104–12114, 2020. 4 [76] Weitang Liu, Xiaoyun Wang, John Owens, and Yixuan Li.
[62] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Energy-based out-of-distribution detection. In Advances
Jaakko Lehtinen, and Timo Aila. Analyzing and improv- in Neural Information Processing Systems, pages 21464–
ing the image quality of stylegan. In Proceedings of the 21475. Curran Associates, Inc., 2020. 3
IEEE/CVF conference on computer vision and pattern recog- [77] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.
nition, pages 8110–8119, 2020. Deep learning face attributes in the wild. In Proceedings of
[63] Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, International Conference on Computer Vision (ICCV), 2015.
Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias- 4, 14
free generative adversarial networks. Advances in neural [78] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht-
information processing systems, 34:852–863, 2021. 4 enhofer, Trevor Darrell, and Saining Xie. A convnet for the
[64] Hasam Khalid, Shahroz Tariq, Minha Kim, and Simon Woo. 2020s. In Proceedings of the IEEE/CVF conference on com-
Fakeavceleb: A novel audio-video multimodal deepfake puter vision and pattern recognition, pages 11976–11986,
dataset. In Proceedings of the Neural Information Process- 2022. 5, 7
ing Systems Track on Datasets and Benchmarks, 2021. 3 [79] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
[65] P. Korus and J. Huang. Multi-scale analysis strategies in regularization. In International Conference on Learning
prnu-based tampering localization. IEEE Trans. on Informa- Representations, 2019. 15
tion Forensics & Security, 2017. 3 [80] Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang
[66] Kalpesh Krishna, Yixiao Song, Marzena Karpinska, John Zhao. Latent consistency models: Synthesizing high-
Wieting, and Mohit Iyyer. Paraphrasing evades detectors resolution images with few-step inference. arXiv preprint
of ai-generated text, but retrieval is an effective defense. arXiv:2310.04378, 2023. 4
Advances in Neural Information Processing Systems, 36, [81] Simian Luo, Yiqin Tan, Suraj Patil, Daniel Gu, Patrick von
2024. 3 Platen, Apolinário Passos, Longbo Huang, Jian Li, and Hang
Zhao. Lcm-lora: A universal stable-diffusion acceleration
[67] Kvikontent. Kvikontent-midjourney v6.
module. arXiv preprint arXiv:2311.05556, 2023. 2, 5
https://huggingface.co/Kvikontent/midjourney-v6, 2023. 5
[82] Francesco Marra, Diego Gragnaniello, Luisa Verdoliva, and
[68] Patrick Kwon, Jaeseong You, Gyuhyeon Nam, Sungwoo
Giovanni Poggi. Do gans leave artificial fingerprints? In
Park, and Gyeongsu Chae. Kodf: A large-scale korean
2019 IEEE conference on multimedia information process-
deepfake detection dataset. In Proceedings of the IEEE/CVF
ing and retrieval (MIPR), pages 506–511. IEEE, 2019. 3
International Conference on Computer Vision, 2021. 3
[83] Leland McInnes, John Healy, Nathaniel Saul, and Lukas
[69] Black Forst Labs. Flux. https://blackforestlabs.ai,
Großberger. Umap: Uniform manifold approximation and
2024. 4
projection. Journal of Open Source Software, 3(29), 2018.
[70] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 14
Blip: Bootstrapping language-image pre-training for unified [84] Paulius Micikevicius, Sharan Narang, Jonah Alben, Gre-
vision-language understanding and generation. In ICML, gory Diamos, Erich Elsen, David Garcia, Boris Ginsburg,
2022. 5 Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al.
[71] Ke Li, Tianhao Zhang, and Jitendra Malik. Diverse image Mixed precision training. In International Conference on
synthesis from semantic layouts via conditional imle. In Learning Representations, 2018. 15
Proceedings of the IEEE/CVF International Conference on [85] Inc. Midjourney. Midjourney. https://www.midjourney.
Computer Vision, pages 4220–4229, 2019. 3, 5 com/home, 2022. 4, 5
[72] Yuezun Li, Xin Yang, Pu Sun, Honggang Qi, and Siwei Lyu. [86] Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christo-
Celeb-df: A large-scale challenging dataset for deepfake pher D Manning, and Chelsea Finn. Detectgpt: Zero-shot
forensics. In Proceedings of the IEEE/CVF conference on machine-generated text detection using probability curva-
computer vision and pattern recognition, 2020. 3 ture. In International Conference on Machine Learning,
[73] Yewen Li, Chaojie Wang, Xiaobo Xia, Tongliang Liu, Bo pages 24950–24962. PMLR, 2023. 3
An, et al. Out-of-distribution detection with an adaptive [87] Sina Mohseni, Mandar Pitale, JBS Yadawa, and Zhangyang
likelihood ratio on informative hierarchical vae. Advances Wang. Self-supervised learning for generalizable out-of-
in Neural Information Processing Systems, 35:7383–7396, distribution detection. Proceedings of the AAAI Conference
2022. 3 on Artificial Intelligence, 34(04):5216–5223, 2020. 3
[74] Shiyu Liang, Yixuan Li, and R Srikant. Enhancing the [88] Tian-Tsong Ng, Shih-Fu Chang, and Q Sun. A data set of
reliability of out-of-distribution image detection in neural authentic and spliced image blocks. Columbia University,
networks. In International Conference on Learning Repre- ADVENT Technical Report, 4, 2004. 3
sentations, 2018. 3 [89] Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh,
[75] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya
Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Sutskever, and Mark Chen. Glide: Towards photorealis-
Zitnick. Microsoft coco: Common objects in context. In tic image generation and editing with text-guided diffusion

11
models. In International Conference on Machine Learning, synthesis with latent diffusion models. In Proceedings of
pages 16784–16804. PMLR, 2022. 4, 5 the IEEE/CVF Conference on Computer Vision and Pattern
[90] Adam Novozamsky, Babak Mahdian, and Stanislav Saic. Recognition (CVPR), pages 10684–10695, 2022. 2, 4, 5, 8
Imd2020: A large-scale annotated dataset tailored for detect- [103] Andreas Rössler, Davide Cozzolino, Luisa Verdoliva, Chris-
ing manipulated images. In Proceedings of the IEEE/CVF tian Riess, Justus Thies, and Matthias Nießner. Faceforen-
Winter Conference on Applications of Computer Vision Work- sics: A large-scale video dataset for forgery detection in
shops, pages 71–80, 2020. 4, 7 human faces. arXiv preprint arXiv:1803.09179, 2018. 1, 3
[91] Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. Towards uni- [104] Andreas Rössler, Davide Cozzolino, Luisa Verdoliva, Chris-
versal fake image detectors that generalize across generative tian Riess, Justus Thies, and Matthias Nießner. FaceForen-
models. In Proceedings of the IEEE/CVF Conference on sics++: Learning to detect manipulated facial images. In
Computer Vision and Pattern Recognition (CVPR), pages International Conference on Computer Vision (ICCV), 2019.
24480–24489, 2023. 1, 2, 3, 4, 5, 6, 7, 8, 14 3
[92] OpenAI. Dall-e 2. https://openai.com/index/dall- [105] Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Chris-
e-2, 2022. 4, 5 tian Riess, Justus Thies, and Matthias Nießner. Faceforen-
[93] OpenAI. Dall-e 3. https://openai.com/index/dall- sics++: Learning to detect manipulated facial images. In
e-3, 2023. 4, 5 Proceedings of the IEEE/CVF international conference on
computer vision, 2019. 1, 3
[94] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-
Yan Zhu. Semantic image synthesis with spatially-adaptive [106] Vinu Sankar Sadasivan, Aounon Kumar, Sriram Balasubra-
normalization. In Proceedings of the IEEE Conference on manian, Wenxiao Wang, and Soheil Feizi. Can ai-generated
Computer Vision and Pattern Recognition, 2019. 3 text be reliably detected? arXiv preprint arXiv:2303.11156,
2023. 3
[95] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer,
[107] Chandramouli Shama Sastry and Sageev Oore. Detecting
James Bradbury, Gregory Chanan, Trevor Killeen, Zeming
out-of-distribution examples with Gram matrices. In Pro-
Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An
ceedings of the 37th International Conference on Machine
imperative style, high-performance deep learning library. Ad-
Learning, pages 8491–8501. PMLR, 2020. 3
vances in neural information processing systems, 32, 2019.
[108] Axel Sauer, Kashyap Chitta, Jens Müller, and Andreas
5
Geiger. Projected gans converge faster. In Advances in
[96] William Peebles and Saining Xie. Scalable diffusion models
Neural Information Processing Systems (NeurIPS), 2021. 4
with transformers. In Proceedings of the IEEE/CVF Interna-
[109] Axel Sauer, Katja Schwarz, and Andreas Geiger. Stylegan-
tional Conference on Computer Vision, pages 4195–4205,
xl: Scaling stylegan to large diverse datasets. In ACM SIG-
2023. 4
GRAPH 2022 conference proceedings, pages 1–10, 2022.
[97] Pablo Pernias, Dominic Rampas, Mats Leon Richter, 4
Christopher Pal, and Marc Aubreville. Würstchen: An ef- [110] Christoph Schuhmann, Richard Vencu, Romain Beaumont,
ficient architecture for large-scale text-to-image diffusion Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo
models. In The Twelfth International Conference on Learn- Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m:
ing Representations, 2023. 5, 6, 14 Open dataset of clip-filtered 400 million image-text pairs.
[98] Dustin Podell, Zion English, Kyle Lacey, Andreas arXiv preprint arXiv:2111.02114, 2021. 4, 5, 7, 14
Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and [111] Christoph Schuhmann, Romain Beaumont, Richard Vencu,
Robin Rombach. Sdxl: Improving latent diffusion models Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo
for high-resolution image synthesis. In The Twelfth Inter- Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts-
national Conference on Learning Representations, 2023. man, Patrick Schramowski, Srivatsa R Kundurthy, Katherine
5 Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia
[99] Alin C Popescu and Hany Farid. Exposing digital forgeries Jitsev. LAION-5b: An open large-scale dataset for training
by detecting traces of resampling. IEEE Transactions on next generation image-text models. In Thirty-sixth Confer-
signal processing, 2005. 3 ence on Neural Information Processing Systems Datasets
[100] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya and Benchmarks Track, 2022. 3, 4, 5
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, [112] Vikash Sehwag, Mung Chiang, and Prateek Mittal. Ssd:
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- A unified framework for self-supervised outlier detection.
ing transferable visual models from natural language super- In International Conference on Learning Representations,
vision. In International conference on machine learning, 2020. 3
pages 8748–8763. PMLR, 2021. 3, 5, 6, 7 [113] Arseniy Shakhmatov, Anton Razzhigaev, Aleksandr
[101] Jie Ren, Peter J Liu, Emily Fertig, Jasper Snoek, Ryan Nikolich, Vladimir Arkhipkin, Igor Pavlov, Andrey
Poplin, Mark Depristo, Joshua Dillon, and Balaji Lakshmi- Kuznetsov, and Denis Dimitrov. Kandinsky 2.2. https:
narayanan. Likelihood ratios for out-of-distribution detec- //github.com/ai-forever/Kandinsky-2, 2023. 5
tion. Advances in neural information processing systems, [114] Dasara Shullani, Marco Fontani, Massimo Iuliani, Omar Al
32, 2019. 3 Shaya, and Alessandro Piva. Vision: a video and image
[102] Robin Rombach, Andreas Blattmann, Dominik Lorenz, dataset for source identification. EURASIP Journal on In-
Patrick Esser, and Björn Ommer. High-resolution image formation Security, 2017:1–16, 2017. 4, 7

12
[115] Ivan Skorokhodov, Grigorii Sotnikov, and Mohamed El- Systems, pages 20685–20696. Curran Associates, Inc., 2020.
hoseiny. Aligning latent and image spaces to connect the 3
unconnectable. arXiv preprint arXiv:2104.06954, 2021. 4, [129] Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianx-
7, 14 iong Xiao. Lsun: Construction of a large-scale image dataset
[116] Irene Solaiman, Miles Brundage, Jack Clark, Amanda using deep learning with humans in the loop. arXiv preprint
Askell, Ariel Herbert-Voss, Jeff Wu, Alec Radford, Gretchen arXiv:1506.03365, 2015. 4
Krueger, Jong Wook Kim, Sarah Kreps, et al. Release [130] Bowen Zhang, Shuyang Gu, Bo Zhang, Jianmin Bao, Dong
strategies and the social impacts of language models. arXiv Chen, Fang Wen, Yong Wang, and Baining Guo. Styleswin:
preprint arXiv:1908.09203, 2019. 3 Transformer-based gan for high-resolution image generation.
[117] Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya In Proceedings of the IEEE/CVF conference on computer
Sutskever. Consistency models. In Proceedings of the vision and pattern recognition, pages 11304–11314, 2022. 4
40th International Conference on Machine Learning, pages [131] Xu Zhang, Svebor Karaman, and Shih-Fu Chang. Detect-
32211–32252, 2023. 2, 4 ing and simulating artifacts in gan fake images. In 2019
[118] Yuhta Takida, Masaaki Imaizumi, Takashi Shibuya, Chieh- IEEE international workshop on information forensics and
Hsin Lai, Toshimitsu Uesaka, Naoki Murata, and Yuki Mit- security (WIFS), pages 1–6. IEEE, 2019. 3
sufuji. SAN: Inducing metrizability of GAN with discrimi- [132] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A
native normalized linear layer. In The Twelfth International Efros. Unpaired image-to-image translation using cycle-
Conference on Learning Representations, 2024. 4 consistent adversarial networks. In Computer Vision (ICCV),
[119] Ming Tao, Hao Tang, Fei Wu, Xiao-Yuan Jing, Bing-Kun 2017 IEEE International Conference on, 2017. 3, 5
Bao, and Changsheng Xu. Df-gan: A simple and effec- [133] Mingjian Zhu, Hanting Chen, Qiangyu YAN, Xudong
tive baseline for text-to-image synthesis. In Proceedings of Huang, Guanyu Lin, Wei Li, Zhijun Tu, Hailin Hu, Jie Hu,
the IEEE/CVF conference on computer vision and pattern and Yunhe Wang. Genimage: A million-scale benchmark
recognition, pages 16515–16525, 2022. 5 for detecting ai-generated image. In Advances in Neural
[120] Ming Tao, Bing-Kun Bao, Hao Tang, and Changsheng Xu. Information Processing Systems Dataset and Benchmarks
Galip: Generative adversarial clips for text-to-image synthe- Track, pages 77771–77782, 2023. 2, 3, 4, 5, 6, 7, 8
sis. In Proceedings of the IEEE/CVF Conference on Com- [134] Bojia Zi, Minghao Chang, Jingjing Chen, Xingjun Ma, and
puter Vision and Pattern Recognition, pages 14214–14223, Yu-Gang Jiang. Wilddeepfake: A challenging real-world
2023. 5 dataset for deepfake detection. In Proceedings of the 28th
[121] DeciAI Research Team. Decidiffusion 2.0, 2024. 5 ACM international conference on multimedia, 2020. 3
[122] Adaku Uchendu, Thai Le, Kai Shu, and Dongwon Lee. Au-
thorship attribution for neural text generation. In Proceed-
ings of the 2020 Conference on Empirical Methods in Natu-
ral Language Processing (EMNLP), pages 8384–8395, 2020.
3
[123] Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete
representation learning. Advances in neural information
processing systems, 30, 2017. 8
[124] Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro
Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj,
Dhruv Nair, Sayak Paul, William Berman, Yiyi Xu, Steven
Liu, and Thomas Wolf. Diffusers: State-of-the-art dif-
fusion models. https://github.com/huggingface/
diffusers, 2022. 2, 3, 4
[125] Apoorv Vyas, Nataraj Jammalamadaka, Xia Zhu, Dipankar
Das, Bharat Kaul, and Theodore L. Willke. Out-of-
distribution detection using an ensemble of self supervised
leave-out classifiers. In Proceedings of the European Con-
ference on Computer Vision (ECCV), 2018. 3
[126] Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew
Owens, and Alexei A. Efros. Cnn-generated images are sur-
prisingly easy to spot... for now. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 2020. 1, 2, 3, 4, 5, 6, 7, 8
[127] Ross Wightman. Pytorch image models. https://github.
com/huggingface/pytorch-image-models, 2019. 5
[128] Zhisheng Xiao, Qing Yan, and Yali Amit. Likelihood regret:
An out-of-distribution detection score for variational auto-
encoder. In Advances in Neural Information Processing

13
Predicted
A. Other applications GAN LatDiff PixDiff Real
GAN 0.93 0.04 0.01 0.01 Predicted
20 20 GAN LatDiff PixDiff Real

Ground Truth
15 15 LatDiff 0.02 0.95 0.00 0.03 1 Commercial 0.22 0.40 0.03 0.36 1

10 10 PixDiff 0.20 0.03 0.26 0.52 Stable

0 Cascade 0.02 0.96 0.00 0.01 0
5 5
Real 0.00 0.00 0.00 0.99
0 0
5 5 (a) Known architectures (b) Unknown architectures
10 10
5 0 5 10 15 20 5 0 5 10 15 20 Figure 9. Generator type classification. We classify the generator
Fake Real Commercial GAN FFHQ type of a given image using k-nearest-neighbor. (a) Confusion
LatDiff LAION CelebA
PixDiff
Other
COCO
ImageNet
RAISE matrix of “known” generator types. We observe high accuracy in
LandscapesHQ
GANs, latent diffusions, and real data. (b) Classification results on
(a) Fake vs. real visualization (b) Generator type visualization “unknown” architectures. Commercial models are predominantly
Figure 8. Feature space visualization. We visualize a feature classified as latent diffusion and GANs (disregarding ‘real’). Stable
space of our trained classifier using 10% of our training data and Cascade [97], which we categorized as Other generator type, shows
the evaluation set. For better visibility, only a subset of our real similarity to latent diffusion models.
datasets are visualized and the labels for real datasets are italicized.
We observe a good separation between fake vs. real data, and
between different generator types and real datasets. nantly classified as latent diffusion or GANs, while Stable
Cascade [97] displays similarity to latent diffusion models
Other applications, beyond the “real-or-fake” image foren- despite their unique three-stage sampling process.
sics task, could potentially be supported by our dataset. In
particular, a diverse array of generators and their correspond- B. Dataset composition
ing images in our dataset may be valuable for addressing
the generator attribution problem, where the goal is to iden-
tify the characteristics of the underlying generator that is Model License Counts
responsible for synthesizing a given image.
4000
Figure 8 presents a UMAP [83] visualization of the fea-
ture space of our trained classifier. We use the activation
3000
of the penultimate layer for visualization following Ojha
et al. [91]. The feature space reveals interesting struc-
2000
ture: GANs form a clearly separated cluster; most com-
mercial models are distributed closely to latent diffusion
1000
models; real datasets such as LAION [110], ImageNet [26],
COCO [75], and RAISE [22] are closely distributed, whereas
0
CelebA [77], FFHQ [60], and Landscapes HQ [115] appear
creativeml-openrail-m

playground-v2
NVIDIA-Source-NC

bigcode-openrail-m
openrail++
gpl-3.0

bigscience-openrail-m

agpl-3.0

yodayno-v2
None
mit
openrail
apache-2.0
faipl-1.0-sd

sai-nc

Ideogram TOS

lgpl-3.0

cc
SDXL 0.9 Research License

FFXL Research License

bigscience-bloom-rail-1.0

OpenAI TOS
artistic-2.0
wtfpl

stable-cascade-nc
DeepFloyd-IF

Google TOS
cc-by-nc-sa-4.0
cc-by-nc-nd-4.0
cc-by-nc-4.0

overall-license

FLUX.1-dev-nc
Midjourney TOS
Adobe TOS

cc-by-4.0
cc-by-sa-4.0

HelpingAI License v2.0

fluently-license

to be more isolated. It is important to note that these separa-

tions emerge naturally without explicit training. A targeted
learning objective may further enhance these separations.
Building on the feature space observations, we use a k-
nearest-neighbor classifier with k = 5 using 10% of our train-
ing data to identify the generator types in our evaluation set.
We separate generators as “known” (i.e., GANs, latent and Figure 10. Histogram of model licenses in our dataset. A vast ma-
pixel diffusions, and real data) and “unknown” (commercial jority of the models use the CreativeML OpenRAIL-M license [1].
models and Stable Cascade [97]) generator types and com-
pute the confusion matrices as shown in Figure 9. Note that Generator licenses. In Figure 10, we report the gener-
none of these generators are seen during training. Figure 9a ator licenses in our dataset. Most of the models use the
demonstrates strong performance in identifying GANs, la- CreativeML OpenRAIL-M license [1].
tent diffusion models, and real data. However, pixel-based
diffusion models show lower performance, possibly due to Model metadata. We show an example model metadata in
their limited representation (only 3 models) in our training Tab. 3. It contains the name of the models, their categorized
set. The classification result for the “unknown” set is shown architectures, licenses, source real datasets, and the Hugging
in Figure 9b. Interestingly, commercial models are predomi- Face tags if available.

14
Model Architecture License RealSource HF_pipeline_tag HF_diffusers_tag
danbochman/ coco,forchheim,imagenet,imd2020,laion, StableDiffusionXL- StableDiffusionXL-
LatentDiff None
ccxl landscapesHQ,vision Pipeline Pipeline
livingbox/
creativeml- coco,forchheim,imagenet,imd2020,laion, StableDiffusion-
modern- LatentDiff stable-diffusion
openrail-m landscapesHQ,vision Pipeline
style-v3
...
DeepFloyd PixelDiff DeepFloyd-IF coco N/A N/A
BigGAN GAN MIT imagenet N/A N/A
...

Table 3. Example model metadata. We log both the author and model names for the Hugging Face [37] models and only the model names
for others. We also log the generator type (i.e., architecture), model license, source real dataset, and Hugging Face tags if available.

Latent Diff. GAN Pixel Diff. Other 1.000

Models 4766 12 3 1

mAP
Percentage 99.67% 0.25% 0.06% 0.02% 0.975
Table 4. Model counts per architecture in the training set. A vast
majority of the generators are latent diffusion models. 0.950
0.8K 3.2K 13K 36K 104K
Training Iterations
Figure 13. Impact of training iterations. The performance of the
classifier plateaus beyond 3K iterations.

weight decay with a warmup of 20% of the total iterations.

22% 73% We train our models for 52K iterations using this setting. For
the models in Figures 1 and 4, we employ shorter training
Latent Diff. iterations (3K) due to the computational overhead associated
3% GAN with training a substantial number of models for statistical
2% Pixel Diff. analysis. We chose this number of iterations since we found
Other that classifier performance begins to plateau with approxi-
mately this amount of training (Figure 13).
Figure 11. Number of images per generator type in the training set.
D. Example model project page
Commercial Latent Diff. GAN Pixel Diff. Other
Models 11 6 2 1 1
Images 14918 6000 2000 2000 1000

Figure 12. Evaluation set composition.

Model composition. The composition of the training set

of Community Forensics is detailed in Table 4 and Fig. 11.
A vast majority of the models and generated images are
latent diffusion. Figure 12 illustrates the composition of the
evaluation set, which includes two variants of HDiT [19]:
one trained on FFHQ [60] and another on ImageNet [26].
For computing metrics such as mAP and accuracy, these
HDiT variants are treated as separate entities due to their
distinct training data and model weights. However, when
reporting the number of models in our dataset, we count
them as a single model. Figure 14. Example model project page from Hugging Face [2, 37].

C. Training settings Figure 14 shows a project page from Hugging Face [2, 37].
For training our classifiers, we use AdamW optimizer [79] We can see the tags associated with the model (e.g., Text-
with a learning rate of 2e-5, a weight decay of 1e-2, a to-image, pipeline type, license), number of downloads, and
batch size of 512, and mixed precision [84]. We use a cosine sample images.

Sketch2face: Conditional Generative Adversarial Networks For Transforming Face Sketches Into Photorealistic Images
No ratings yet
Sketch2face: Conditional Generative Adversarial Networks For Transforming Face Sketches Into Photorealistic Images
9 pages
A13、A13A ServiceManual PDF
50% (2)
A13、A13A ServiceManual PDF
1,333 pages
Exploring Painting Synthesis With Diffusion Models 2
No ratings yet
Exploring Painting Synthesis With Diffusion Models 2
10 pages
Fake Image Detection Research Paper
No ratings yet
Fake Image Detection Research Paper
10 pages
Epstein Online Papers 2024
No ratings yet
Epstein Online Papers 2024
11 pages
Detecting Facial Image Forgeries With Transfer Learning Techniques
No ratings yet
Detecting Facial Image Forgeries With Transfer Learning Techniques
13 pages
Detection of AI Generated Images
No ratings yet
Detection of AI Generated Images
6 pages
A Simplified Generative Model Based On Gradient Descent and Mean Square Error
No ratings yet
A Simplified Generative Model Based On Gradient Descent and Mean Square Error
8 pages
CIFAKE Image Classification and Explainable Identification of AI-Generated Synthetic Images
No ratings yet
CIFAKE Image Classification and Explainable Identification of AI-Generated Synthetic Images
10 pages
AMC Junior 2020
No ratings yet
AMC Junior 2020
12 pages
Methods and Trends in Detecting Generated Images: A Comprehensive Review
No ratings yet
Methods and Trends in Detecting Generated Images: A Comprehensive Review
30 pages
A Survey of Diffusion Based Image Generation Models: Issues and Their Solutions
No ratings yet
A Survey of Diffusion Based Image Generation Models: Issues and Their Solutions
11 pages
1 s2.0 S0262885623001452 Main
No ratings yet
1 s2.0 S0262885623001452 Main
21 pages
Full Text 01
No ratings yet
Full Text 01
39 pages
Image To Imag e Translation
No ratings yet
Image To Imag e Translation
19 pages
Today 2
No ratings yet
Today 2
49 pages
Baseline Paper
No ratings yet
Baseline Paper
7 pages
Wa0004.
No ratings yet
Wa0004.
59 pages
Image Classification and Generation of Images
No ratings yet
Image Classification and Generation of Images
21 pages
MD Awsafur Rahman, Bishmoy Paul, Najibul Haque Sarker, Zaber Ibn Abdul Hakim Shaikh Anowarul Fattah
No ratings yet
MD Awsafur Rahman, Bishmoy Paul, Najibul Haque Sarker, Zaber Ibn Abdul Hakim Shaikh Anowarul Fattah
5 pages
Block Based Deepfake Detection Main
No ratings yet
Block Based Deepfake Detection Main
15 pages
CIFAKE Image Classification and Explainable Identification of AI-Generated Synthetic Images (1) Chat GPT
No ratings yet
CIFAKE Image Classification and Explainable Identification of AI-Generated Synthetic Images (1) Chat GPT
9 pages
Unsupervised Generative Fake Image Detector
No ratings yet
Unsupervised Generative Fake Image Detector
14 pages
Accuracy and Fidelity Comparison of Luna and DALL-E 2 Diffusion-Based Image Generation Systems
No ratings yet
Accuracy and Fidelity Comparison of Luna and DALL-E 2 Diffusion-Based Image Generation Systems
5 pages
Zero-Shot Detection of AI-Generated Images: (Davide - Cozzolino, Poggi, Verdoliv) @unina - It Niessner@tum - de
No ratings yet
Zero-Shot Detection of AI-Generated Images: (Davide - Cozzolino, Poggi, Verdoliv) @unina - It Niessner@tum - de
24 pages
CIFAKE Image Classification and Explainable Identification of AI-Generated Synthetic Images
No ratings yet
CIFAKE Image Classification and Explainable Identification of AI-Generated Synthetic Images
18 pages
Synthetic Image Verification in The Era of Generative AI: What Works and What Isn't There Yet
No ratings yet
Synthetic Image Verification in The Era of Generative AI: What Works and What Isn't There Yet
11 pages
Tan Rethinking The Up-Sampling Operations in CNN-based Generative Network For Generalizable CVPR 2024 Paper
No ratings yet
Tan Rethinking The Up-Sampling Operations in CNN-based Generative Network For Generalizable CVPR 2024 Paper
10 pages
浙大论文
No ratings yet
浙大论文
23 pages
Corel DESIGNER 12 User Guide PDF
No ratings yet
Corel DESIGNER 12 User Guide PDF
460 pages
Instagen: Enhancing Object Detection by Training On Synthetic Dataset
No ratings yet
Instagen: Enhancing Object Detection by Training On Synthetic Dataset
13 pages
Advancing GANDeepfake Detection
No ratings yet
Advancing GANDeepfake Detection
24 pages
Wang CNN-Generated Images Are Surprisingly Easy To Spot... For Now CVPR 2020 Paper
No ratings yet
Wang CNN-Generated Images Are Surprisingly Easy To Spot... For Now CVPR 2020 Paper
10 pages
Peerj Cs 2127
No ratings yet
Peerj Cs 2127
19 pages
CNN-generated Images Are Surprisingly Easy To Spot... For Now
No ratings yet
CNN-generated Images Are Surprisingly Easy To Spot... For Now
13 pages
Final Report
No ratings yet
Final Report
7 pages
Stable Signature
No ratings yet
Stable Signature
21 pages
Towards The Detection of Diffusion Model Deepfakes
No ratings yet
Towards The Detection of Diffusion Model Deepfakes
27 pages
Detection Methods For AI Generated Visual Content (2020-2025)
No ratings yet
Detection Methods For AI Generated Visual Content (2020-2025)
10 pages
Zero-Shot Detection of AI-Generated Images
No ratings yet
Zero-Shot Detection of AI-Generated Images
19 pages
Jimaging 09 00199 With Cover
No ratings yet
Jimaging 09 00199 With Cover
19 pages
Generative Adversarial Networks For Visible To Infrared Video Con
No ratings yet
Generative Adversarial Networks For Visible To Infrared Video Con
14 pages
Visual Gans
No ratings yet
Visual Gans
19 pages
IDS705 Final Report
No ratings yet
IDS705 Final Report
27 pages
RND Report
No ratings yet
RND Report
10 pages
Synthetic Image Transformation
No ratings yet
Synthetic Image Transformation
13 pages
SSRN 4779382
No ratings yet
SSRN 4779382
11 pages
Applsci 13 10637 v2
No ratings yet
Applsci 13 10637 v2
29 pages
Deepfake Image Detection
No ratings yet
Deepfake Image Detection
7 pages
Are GAN Generated Images Easy To Detect A Critical Analysis of The State-Of-The-Art.
No ratings yet
Are GAN Generated Images Easy To Detect A Critical Analysis of The State-Of-The-Art.
7 pages
Tan Et Al - 2023 - Rethinking the Up-Sampling Operations in CNN-based Generative Network For - 副本
No ratings yet
Tan Et Al - 2023 - Rethinking the Up-Sampling Operations in CNN-based Generative Network For - 副本
10 pages
Fake Image Identification Using CNN
No ratings yet
Fake Image Identification Using CNN
3 pages
IJRPR11629
No ratings yet
IJRPR11629
7 pages
Fusing Global and Local Features For Generalized AI-synthesized Image Detection
No ratings yet
Fusing Global and Local Features For Generalized AI-synthesized Image Detection
5 pages
Synthetic Data Generation For Scarce Road Scene Detection Scenarios
No ratings yet
Synthetic Data Generation For Scarce Road Scene Detection Scenarios
10 pages
Do GANs Leave Artificial Pingerprints
No ratings yet
Do GANs Leave Artificial Pingerprints
6 pages
A Review of Generative Adversarial Networks For Computer Vision TasksElectronics Switzerland
No ratings yet
A Review of Generative Adversarial Networks For Computer Vision TasksElectronics Switzerland
17 pages
Fig. 1: Synthetic Images Generated Using Recent Text-To-Image Mod-Els: DALL E 2 (3), Stable Diffusion (4) and GLIDE
No ratings yet
Fig. 1: Synthetic Images Generated Using Recent Text-To-Image Mod-Els: DALL E 2 (3), Stable Diffusion (4) and GLIDE
6 pages
Roofpak Units With Energy Recovery Wheels: Packaged Heating and Cooling Units
No ratings yet
Roofpak Units With Energy Recovery Wheels: Packaged Heating and Cooling Units
74 pages
Deepfake Detector
No ratings yet
Deepfake Detector
6 pages
Fake Image Detection PDF
No ratings yet
Fake Image Detection PDF
19 pages
Continuous Concrete Beam Design To Bs 81101997 Table 3.5
No ratings yet
Continuous Concrete Beam Design To Bs 81101997 Table 3.5
8 pages
Belt Drives 1
No ratings yet
Belt Drives 1
41 pages
Butter Making Equipment
No ratings yet
Butter Making Equipment
8 pages
Solutions: Unit - 1
No ratings yet
Solutions: Unit - 1
38 pages
ADMT ZN Detector Manuel 2023
No ratings yet
ADMT ZN Detector Manuel 2023
29 pages
Practical Analytical 1 ,,chemistry
No ratings yet
Practical Analytical 1 ,,chemistry
45 pages
Physics 00111 - Exam Nov 2017 - Tharaka
No ratings yet
Physics 00111 - Exam Nov 2017 - Tharaka
3 pages
Application Development Using Flutter
No ratings yet
Application Development Using Flutter
5 pages
Ijrdet 0115 02
No ratings yet
Ijrdet 0115 02
6 pages
Class 12 Accounts, Partnership Admission New Ratio Treatment of GW Revaluation Account Test Ansx
No ratings yet
Class 12 Accounts, Partnership Admission New Ratio Treatment of GW Revaluation Account Test Ansx
4 pages
Preliminary Chemistry: Stathfield Girls High School Year 11 Yearly Exam 2004
No ratings yet
Preliminary Chemistry: Stathfield Girls High School Year 11 Yearly Exam 2004
15 pages
IEC 61508 - Techniques and Measures - 1315236692 - 2 PDF
No ratings yet
IEC 61508 - Techniques and Measures - 1315236692 - 2 PDF
3 pages
SNK Series - 2023
No ratings yet
SNK Series - 2023
4 pages
Computer System Servicing 9
No ratings yet
Computer System Servicing 9
6 pages
How To Create Cartoons by Frank Tashlin
96% (24)
How To Create Cartoons by Frank Tashlin
43 pages
Science8 Q1 M2 Grade 10
No ratings yet
Science8 Q1 M2 Grade 10
25 pages
Gke Product Catalog 1
No ratings yet
Gke Product Catalog 1
22 pages
Math 501 Problem Set Topic: Fraction: Submitted To: Ms. Rosyl S. Matin-Ao, MAT-Math Submitted By: Buscagan, Valleree T
No ratings yet
Math 501 Problem Set Topic: Fraction: Submitted To: Ms. Rosyl S. Matin-Ao, MAT-Math Submitted By: Buscagan, Valleree T
7 pages
Brosur Isolation DRY HEXTA
No ratings yet
Brosur Isolation DRY HEXTA
4 pages
Information Asymmetry and Capital Structure Around The World
No ratings yet
Information Asymmetry and Capital Structure Around The World
45 pages
Olefini Catalogue 2015 - Industrial
No ratings yet
Olefini Catalogue 2015 - Industrial
2 pages
Stability Castelated Beams
No ratings yet
Stability Castelated Beams
142 pages
SAP HANA Innovations and Challenges: Unit 1: Introduction
No ratings yet
SAP HANA Innovations and Challenges: Unit 1: Introduction
1 page
Tribhuvan University Faculty of Management Office of The Dean 2011
No ratings yet
Tribhuvan University Faculty of Management Office of The Dean 2011
2 pages
Packet Tracer - Who Hears The Broadcast?: Objectives
No ratings yet
Packet Tracer - Who Hears The Broadcast?: Objectives
2 pages
IR Receiver AX 1838HS
No ratings yet
IR Receiver AX 1838HS
6 pages
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Pedestrian Detection: Please, suggest a subtitle for a book with title 'Pedestrian Detection' within the realm of 'Computer Vision'. The suggested subtitle should not have ':'.
From Everand
Pedestrian Detection: Please, suggest a subtitle for a book with title 'Pedestrian Detection' within the realm of 'Computer Vision'. The suggested subtitle should not have ':'.
Fouad Sabry
No ratings yet