CIFAKE Image Classification and Explainable Identification of AI-Generated Synthetic Images
CIFAKE Image Classification and Explainable Identification of AI-Generated Synthetic Images
ABSTRACT Recent advances in synthetic data have enabled the generation of images with such high
quality that human beings cannot distinguish the difference between real-life photographs and Artificial
Intelligence (AI) generated images. Given the critical necessity of data reliability and authentication, this
article proposes to enhance our ability to recognise AI-generated images through computer vision. Initially,
a synthetic dataset is generated that mirrors the ten classes of the already available CIFAR-10 dataset with
latent diffusion, providing a contrasting set of images for comparison to real photographs. The model is
capable of generating complex visual attributes, such as photorealistic reflections in water. The two sets of
data present as a binary classification problem with regard to whether the photograph is real or generated
by AI. This study then proposes the use of a Convolutional Neural Network (CNN) to classify the images
into two categories; Real or Fake. Following hyperparameter tuning and the training of 36 individual
network topologies, the optimal approach could correctly classify the images with 92.98% accuracy.
Finally, this study implements explainable AI via Gradient Class Activation Mapping to explore which
features within the images are useful for classification. Interpretation reveals interesting concepts within
the image, in particular, noting that the actual entity itself does not hold useful information for
classification; instead, the model focuses on small visual imperfections in the background of the images.
The complete dataset engineered for this study, referred to as the CIFAKE dataset, is made publicly
available to the research community for future work.
INDEX TERMS AI-generated images, generative AI, image classification, latent diffusion.
I. INTRODUCTION
and truth. This has led to a situation where consumer-level
The field of synthetic image generation by Artificial Intel-
technology is available that could quite easily be used for
ligence (AI) has developed rapidly in recent years, and the
the violation of privacy and to commit fraud. These
ability to detect AI-generated photos has also become a
philosophical and societal implications are at the forefront
critical necessity to ensure the authenticity of image data.
of the current state of the art, raising fundamental questions
Within recent memory, generative technology often
about the nature of trustworthiness and reality. Recent
produced images with major visual defects that were
technological advances have enabled the generation of
noticeable to the human eye, but now we are faced with the
images with such high quality that human beings cannot tell
possibility of AI models generating high-fidelity and
the difference between a real-life photograph and an image
photorealistic images in a matter of seconds. The AI-
that is no more than a hallucination of an artificial neural
generated images are now at the quality level needed to
network’s weights and biases.
compete with humans and win art competitions [1].
Generative imagery that is indistinguishable from pho-
Latent Diffusion Models (LDMs), a type of generative
tographic data raises questions both ontological, those
model, have emerged as a powerful tool to generate
which concern the nature of being, and epistemological,
synthetic imagery [2]. These recent developments have
surrounding the theories of methods, validity, and scope.
caused a paradigm shift in our understanding of creativity,
Ontologically, given that humans cannot tell the difference
authenticity,
between images from cameras and those generated by AI
models such as an Artificial Neural Network, in terms of
The associate editor coordinating the review of this manuscript and
digital information, what is real and what is not? The
approving it for publication was Yiqi Liu .
15642
©
2024
The
Author
s. This
work is
license
d
under
a
Creativ
e
Comm
ons
Attribu
tion
4.0
License
.
Fo
r
m
or
e
inf
or
m
ati
on
,
se
e
ht
tp
s:/
/cr
ea
tiv
ec
o
m
m
on
s.
or
g/l
ic
en
se
s/
by
/4.
0/
V
O
L
U
M
E
12
,
20
24
J. J. Bird, A. Lotfi: CIFAKE: Image Classification and Explainable
Identification
epistemological reality is that there are serious questions an individual committing a crime or vice versa,
surrounding the reliability of human knowledge and the providing false
ethical implications that surround the misuse of these types
of technology. The implications suggest that we are in
growing need of a system that can aid us in the recognition
of real images versus those generated by AI.
This study explores the potential of using computer
vision to enhance our newfound inability to recognise the
difference between real photographs and those that are AI-
generated. Given that there are many years worth of
photographic datasets available for image classification,
these provide examples for a model of real images.
Following the generation of a synthetic equivalent to such
data, we will then explore the output of the model before
finally implementing methods of differentiation between
the two types of image.
There are several scientific contributions with multidis-
ciplinary and social implications that arise from this study.
First, a dataset, called CIFAKE, is generated with latent dif-
fusion and released to the research community. The
CIFAKE dataset provides a contrasting set of real and fake
photographs and contains 120,000 images (60,000 images
from the exist- ing CIFAR-10 dataset (Collection of images
that are com- monly used to train machine learning and
computer vision algorithms available from:
https://www.cs.toronto.edu/ kriz/- cifar.html) and 60,000
images generated for this study), mak- ing it a valuable
resource for researchers in the field. Second, this study
proposes a method to improve our waning human ability to
recognise AI-generated images through computer vision,
using the CIFAKE dataset for classification. Finally, this
study proposes the use of Explainable AI (XAI) to further
our understanding of the complex processes involved in
synthetic image recognition, as well as visualisation of the
important features within those images. These scientific
contributions provide important steps forward in addressing
the modern challenges posed by rapid developments of
modern technology and have important implications for
ensuring the authenticity and trustworthiness of data.
The remainder of this article is as follows; the state-of-
the- art research background is initially explored in Section
II with a discussion of relevant related studies in the field.
Following this, the methodology followed by this study is
detailed in Section III, which provides the technical
implementation and the method followed for the binary
classification of real versus AI-generated imagery. The
results of these experiments are presented with discussion
in Section IV before this work is finally concluded, and
future work is proposed in Section V.
II. BACKGROUND
The ability to distinguish between real imagery and those
generated by machine learning models is important for
a number of reasons. Identification of real data provides
confirmation on the authenticity and originality of the
image; for example, a fine-tuned Stable Diffusion Model
(SDM) could be used to generate a synthetic photograph of
VOLUME 12, 15643
2024
J. J. Bird, A. Lotfi: CIFAKE: Image Classification and Explainable
Identification
evidence of an alibi for a person who was, in reality, otherwise
elsewhere. Misinformation and fake news is a significant
modern problem, and machine-generated images could be used
to manipulate public opinion [3], [4]. Situations where synthetic
imagery is used in fake news can promote its false credibility
and have serious consequences [5]. Cybersecurity is another
major concern, with research noting that synthetically generated
human faces can be used in false acceptance attacks and have
the potential to gain unauthorised access to digital systems [6],
[7]. In [8], it was observed that synthetically generated
signatures could overcome signature verification systems with
ease.
Latent Diffusion Models are a new approach for generating
images, which use attention mechanisms and a U-Net to reverse
the process of Gaussian noise and, ultimately, use text
conditioning to generate novel images from random noise.
Details on the methodological implementation of LDM can be
found in Section III. The approach is rapidly developing but is
young, and thus literature on the subject is currently scarce. The
models are a new approach in the field of generative models;
thus, the literature is young, and few applications have been
explored. Examples of notable models include Dall-E by
OpenAI [9], Imagen from Google [10], and the open source
equivalent, SDM from StabilityAI [2]. These models have
pushed the boundaries of image quality, both in realism and
arguably in artistic ability. This has led to much debate about
the professional, social, ethical, and legal considerations of
technology [1].
The majority of research in the field is cutting-edge and
is presented as preprints and recent theses. In [11], researchers
proposed to train SDM on medical imaging data, achieving
higher-quality images that could potentially lead to increased
model abilities through data augmentations. It is worth
mentioning that in [12] and [13], diffusion models were found
to have the ability to generate audio and images. In 2021, the
results of Yi et al. [14] suggested that diffusion models were
highly capable of generating realistic artworks, fooling human
subjects into believing that the works were created by human
beings. Given this, researchers have noted that diffusion models
have a promising capacity for co- creating with human artists
[15].
DE-FAKE, proposed by Sha et al. [16], shows that images
generated by various latent diffusion approaches may contain
digital fingerprints to suggest they are synthetic. Although
visual glitches are increasingly rare given the advances in
model quality, it may be possible that computer vision
approaches will detect these attributes within images that the
human eye cannot. The Fourier transforms presented in [17]
show visual examples of these digital fingerprints.
When discussing the topic of vision, the results in [18]
suggest that optical flow techniques could detect synthetic
human faces within the FaceForensics dataset with around
81.61% accuracy. Extending to the temporal domain, [19]
proposes recurrence in AI-generated video recognition
achieving 97.1% accuracy over 80 frames due to minor visual
glitches at the pixel scale. In Wang et al. [20],
EfficientNets and Vision Transformers are proposed within from the training dataset are used for the training of positive
a system that can detect forged images by adversarial
class ‘‘REAL’’. Therefore, 50, 000 are used for training and
models at an F1 score of 0.88 and AUC of 0.95, competing
10, 000 for testing.
with the state of the art on the DeepFake Detection
Samples of images within the CIFAR-10 dataset that
Challenge dataset while remaining efficient. In the
form the ‘‘REAL’’ class can be observed in Figure 1.
aforementioned study, a Convolutional Neural Network
was used to extract features, similarly to the approach B. SYNTHETIC DATA GENERATION
proposed in this study, prior to processing using attention-
The synthetic images generated for this study use CompVis
based approaches.
SD (https://huggingface.co/CompVis/stable-diffusion-v1-4),
Similarly, convolutional and temporal techniques were
an open source LDM. The goal is to model the diffusion of
proposed in [21] to achieve 66.26% to 91.21% accuracy in
image data through a latent space given a textual context.
a mixed set of synthetic data detection datasets.
If noise, such as that of a Gaussian distribution, is
Chrominance components CbCr within a digital image
iteratively added to an image, the image ultimately becomes
were noted in [22] as a promising route for the detection of
noise and all prior visual information is lost. To generalise,
minor pixel disparities that are sometimes present within
the reverse of this process is to, therefore, generate a
synthetic images.
synthetic image from noise. The method of reverse diffusion
Human recognition of manipulation within images is
can be put simply as, given an image x at timestep t, xt ,
wan- ing as a direct result of image generation methods
output the prediction of
improving. A study by Nightingale et al. [23] in 2017 xi−1 through the prediction of noise and subsequent removal
suggested that humans have difficulty recognising by classical means.
photographs that have been edited using image processing A noisy image xt is generated from the original x0 by the
techniques. Since this study, there has been nearly five following:
years of rapid development
in the field to date. t √ t 0 √ t
Reviewing the relevant literature has highlighted rapid x= α¯ x + 1 − α¯ ε, (1)
developments within AI-generated imagery and the chal- testing, i.e., a testing dataset of 16.6%. In this
lenges today posed with respect to its detection. Generative study, all images
models have enabled the generation of high-fidelity, photore-
alistic images within a matter of seconds that humans often
cannot distinguish between when compared to reality. This
conclusion sets the stage for the studies presented by this
work and argues the need to fill the knowledge gap when it
comes to the availability of examples of synthetic data.
III. METHOD
This section describes the methods followed by this study
in terms of their technical implementation and application
for the detection of synthetic images. This section first
describes the collection of data for the real data, and then
the methodology followed to generate the synthetic
equivalent for comparison. Sections III-A and III-B will
describe how 60,000 images are collected and 60,000
images are synthetically generated, respectively. This forms
the overall dataset of 120,000 images. Section III-C will
then describe the machine learning model engineering
approach which aims to recognise the authenticity of the
images, before Section III-D notes the approach for
Explainable AI to interpret model predictions.
C. IMAGE CLASSIFICATION
Image classification is an algorithm that predicts a class label
given an input image. The learnt features are extracted from the
image and processed in order to provide an output, in this case,
whether or not the image is real or synthetic. This subsection
describes the selected approach to classification.
In this study, the Convolutional Neural Network (CNN) [26],
[27], [28] is employed to learn from the input
FIGURE 1. Examples of images from the CIFAR-10 image classification dataset [24].
FIGURE 2. Examples of AI-generated images within the dataset contributed by this study, selected at random with regards to their real
CIFAR-10 equivalent labels.
FIGURE 3. Examples of visual defects found within the synthetic TABLE 4. Observed validation precision for the filters within
image dataset. the convolutional neural network.
A. DATASET EXPLORATION
Random samples of images used in this study and within
the dataset provided can be observed in Figure 2. Five
images are presented for each class label, and all of the
images within this figure are synthetic, which have been
generated by the SDM. Note within this sample that the
images are high- quality and, for the most part, seem to be TABLE 6. Observed validation F1-Score for the filters within
difficult to discern as synthetic by the human eye. Synthetic the convolutional neural network.
photographs are representative of their counterparts from
reality and feature complex attributes such as depth of field,
reflections, and motion blur.
It can also be observed that there are visual imperfections
within some of the images. Figure 3 shows a number of
examples of the win of the dataset in which the model has
output images with visual glitches. Given that the LAION
dataset provides physical descriptions of the image content,
faced by the CNN is that of binary classification, whether or
little to no information on text is provided, and thus it can
not the image is a real photograph or the output of an LDM.
be seen that the model produces shapes similar to
The validation accuracy of the results and the loss metrics
alphabetic characters. Also observed here is a lack of
for the feature extractors can be found in Tables 2 and 3,
important detail, such as the case of a jet aircraft that has no
respectively. All feature extractors scored relatively well
cockpit window. It seems that this image has been produced
without the need for dense layers to process feature maps,
by combining the knowledge of jet aircraft (in particular,
with an average classification accuracy of 91.79%. The
the engines) along with the concept of an Unmanned Aerial
lowest loss feature extractor was found to use two layers of
Vehicle’s chassis. Finally, there are also some cases of
32 filters, which led to an overall classification accuracy of
anatomical errors for living creatures, seen in these
92.93% and a binary cross-entropy loss of 0.18. The highest
examples through the cat’s limbs and eyes.
accuracy model, two layers of 128 filters, scored 92.98%
Complex visual concepts are present within much of the
with
dataset, with examples shown in Figure 4. Observe that
a loss of 0.221.
the ripples in the water and reflections of the entities are
Extended validation metrics are presented in Tables 4, 5,
highly realistic and match what would be expected within
and 6, which detail validation precision, recall, and F1
a photograph. In addition to complex lighting, there is also
scores, respectively. The F1 score, which is a unification of
evidence of depth of field and photographic framing.
precision and recall, had a mean value of 0.929 with the
highest being 0.936. A small standard deviation of 0.003
B. CLASSIFICATION RESULTS
was observed.
In this subsection, we present the results for the computer
Following these experiments, the lowest-loss feature
vision experiments for image classification. The problem
extractor is selected for further engineering of the network
topology. This was the model that had two layers of
VOLUME 12, 15651
2024
J. J. Bird, A. Lotfi: CIFAKE: Image Classification and Explainable
Identification
32 convolutional filters.
FIGURE 4. A selection of AI-generated images within the dataset. Examples of complex visual attributes
generated by the latent diffusion model that include realistic water and reflections.
TABLE 7. Observed validation accuracy for the dense layers within the
convolutional neural network. TABLE 9. Observed validation precision for the dense layers within
the convolutional neural network.
The results of the general network engineering are of the network topology.
presented in Tables 7 and 8, which contain the validation
accuracy and loss, respectively. The lowest loss observed
was
0.177 binary cross-entropy when the CNN was followed by
three layers of 64 rectified linear units. The highest
accuracy, on the other hand, was 93.55%, which was
achieved by implementing a single layer of 64 neurons.
Additional validation metrics for precision, recall, and F-
1 score are also provided in Tables 9, 10, and 11,
respectively. Similarly to the prior experiments, the
standard deviation of F1-scores was relatively low at 0.003.
The highest F-1 score was the network that used a single
dense layer of 64 rectified linear units, with a value of
0.936. The aforementioned highest F1 score model is
graphically detailed in Figure 5 to provide a visual example
VOLUME 12, 15653
2024
J. J. Bird, A. Lotfi: CIFAKE: Image Classification and Explainable
Identification
TABLE 11. Observed validation F1-Score for the dense layers within the
convolutional neural network.
FIGURE 5. An example of one of the final model architectures following hyperparameter search for the classification of
real or AI-generated images.