CIFAKE Image Classification and Explainable Identification of AI-Generated Synthetic Images (1) Chat GPT
CIFAKE Image Classification and Explainable Identification of AI-Generated Synthetic Images (1) Chat GPT
ABSTRACT Recent advances in synthetic data have enabled the generation of images with such high quality
that human beings cannot distinguish the difference between real-life photographs and Artificial Intelligence
(AI) generated images. Given the critical necessity of data reliability and authentication, this article proposes
to enhance our ability to recognise AI-generated images through computer vision. Initially, a synthetic dataset
is generated that mirrors the ten classes of the already available CIFAR-10 dataset with latent diffusion,
providing a contrasting set of images for comparison to real photographs. The model is capable of generating
complex visual attributes, such as photorealistic reflections in water. The two sets of data present as a binary
classification problem with regard to whether the photograph is real or generated by AI. This study then
proposes the use of a Convolutional Neural Network (CNN) to classify the images into two categories;
Real or Fake. Following hyperparameter tuning and the training of 36 individual network topologies, the
optimal approach could correctly classify the images with 92.98% accuracy. Finally, this study implements
explainable AI via Gradient Class Activation Mapping to explore which features within the images are useful
for classification. Interpretation reveals interesting concepts within the image, in particular, noting that the
actual entity itself does not hold useful information for classification; instead, the model focuses on small
visual imperfections in the background of the images. The complete dataset engineered for this study, referred
to as the CIFAKE dataset, is made publicly available to the research community for future work.
INDEX TERMS AI-generated images, generative AI, image classification, latent diffusion.
epistemological reality is that there are serious questions evidence of an alibi for a person who was, in reality,
surrounding the reliability of human knowledge and the otherwise elsewhere. Misinformation and fake news is a
ethical implications that surround the misuse of these types of significant modern problem, and machine-generated images
technology. The implications suggest that we are in growing could be used to manipulate public opinion [3], [4]. Situations
need of a system that can aid us in the recognition of real where synthetic imagery is used in fake news can promote
images versus those generated by AI. its false credibility and have serious consequences [5].
This study explores the potential of using computer vision Cybersecurity is another major concern, with research noting
to enhance our newfound inability to recognise the difference that synthetically generated human faces can be used in false
between real photographs and those that are AI-generated. acceptance attacks and have the potential to gain unauthorised
Given that there are many years worth of photographic access to digital systems [6], [7]. In [8], it was observed that
datasets available for image classification, these provide synthetically generated signatures could overcome signature
examples for a model of real images. Following the verification systems with ease.
generation of a synthetic equivalent to such data, we will then Latent Diffusion Models are a new approach for generating
explore the output of the model before finally implementing images, which use attention mechanisms and a U-Net to
methods of differentiation between the two types of image. reverse the process of Gaussian noise and, ultimately, use text
There are several scientific contributions with multidis- conditioning to generate novel images from random noise.
ciplinary and social implications that arise from this study. Details on the methodological implementation of LDM can
First, a dataset, called CIFAKE, is generated with latent dif- be found in Section III. The approach is rapidly developing
fusion and released to the research community. The CIFAKE but is young, and thus literature on the subject is currently
dataset provides a contrasting set of real and fake photographs scarce. The models are a new approach in the field of
and contains 120,000 images (60,000 images from the exist- generative models; thus, the literature is young, and few
ing CIFAR-10 dataset (Collection of images that are com- applications have been explored. Examples of notable models
monly used to train machine learning and computer vision include Dall-E by OpenAI [9], Imagen from Google [10],
algorithms available from: https://www.cs.toronto.edu/ kriz/- and the open source equivalent, SDM from StabilityAI [2].
cifar.html) and 60,000 images generated for this study), mak- These models have pushed the boundaries of image quality,
ing it a valuable resource for researchers in the field. Second, both in realism and arguably in artistic ability. This has led to
this study proposes a method to improve our waning human much debate about the professional, social, ethical, and legal
ability to recognise AI-generated images through computer considerations of technology [1].
vision, using the CIFAKE dataset for classification. Finally, The majority of research in the field is cutting-edge
this study proposes the use of Explainable AI (XAI) to and is presented as preprints and recent theses. In [11],
further our understanding of the complex processes involved researchers proposed to train SDM on medical imaging data,
in synthetic image recognition, as well as visualisation of achieving higher-quality images that could potentially lead
the important features within those images. These scientific to increased model abilities through data augmentations. It is
contributions provide important steps forward in addressing worth mentioning that in [12] and [13], diffusion models
the modern challenges posed by rapid developments of were found to have the ability to generate audio and images.
modern technology and have important implications for In 2021, the results of Yi et al. [14] suggested that diffusion
ensuring the authenticity and trustworthiness of data. models were highly capable of generating realistic artworks,
The remainder of this article is as follows; the state-of-the- fooling human subjects into believing that the works were
art research background is initially explored in Section II with created by human beings. Given this, researchers have noted
a discussion of relevant related studies in the field. Following that diffusion models have a promising capacity for co-
this, the methodology followed by this study is detailed in creating with human artists [15].
Section III, which provides the technical implementation DE-FAKE, proposed by Sha et al. [16], shows that images
and the method followed for the binary classification of generated by various latent diffusion approaches may contain
real versus AI-generated imagery. The results of these digital fingerprints to suggest they are synthetic. Although
experiments are presented with discussion in Section IV visual glitches are increasingly rare given the advances in
before this work is finally concluded, and future work is model quality, it may be possible that computer vision
proposed in Section V. approaches will detect these attributes within images that the
human eye cannot. The Fourier transforms presented in [17]
II. BACKGROUND show visual examples of these digital fingerprints.
The ability to distinguish between real imagery and those When discussing the topic of vision, the results in [18]
generated by machine learning models is important for suggest that optical flow techniques could detect synthetic
a number of reasons. Identification of real data provides human faces within the FaceForensics dataset with around
confirmation on the authenticity and originality of the image; 81.61% accuracy. Extending to the temporal domain, [19]
for example, a fine-tuned Stable Diffusion Model (SDM) proposes recurrence in AI-generated video recognition
could be used to generate a synthetic photograph of an achieving 97.1% accuracy over 80 frames due to minor
individual committing a crime or vice versa, providing false visual glitches at the pixel scale. In Wang et al. [20],
VOLUME 12, 2024 15643
J. J. Bird, A. Lotfi: CIFAKE: Image Classification and Explainable Identification
EfficientNets and Vision Transformers are proposed within from the training dataset are used for the training of positive
a system that can detect forged images by adversarial models class ‘‘REAL’’. Therefore, 50, 000 are used for training and
at an F1 score of 0.88 and AUC of 0.95, competing with 10, 000 for testing.
the state of the art on the DeepFake Detection Challenge Samples of images within the CIFAR-10 dataset that form
dataset while remaining efficient. In the aforementioned the ‘‘REAL’’ class can be observed in Figure 1.
study, a Convolutional Neural Network was used to extract
features, similarly to the approach proposed in this study, B. SYNTHETIC DATA GENERATION
prior to processing using attention-based approaches. The synthetic images generated for this study use CompVis
Similarly, convolutional and temporal techniques were SD (https://huggingface.co/CompVis/stable-diffusion-v1-4),
proposed in [21] to achieve 66.26% to 91.21% accuracy in a an open source LDM. The goal is to model the diffusion of
mixed set of synthetic data detection datasets. Chrominance image data through a latent space given a textual context.
components CbCr within a digital image were noted in [22] If noise, such as that of a Gaussian distribution, is iteratively
as a promising route for the detection of minor pixel added to an image, the image ultimately becomes noise and
disparities that are sometimes present within synthetic all prior visual information is lost. To generalise, the reverse
images. of this process is to, therefore, generate a synthetic image
Human recognition of manipulation within images is wan- from noise. The method of reverse diffusion can be put simply
ing as a direct result of image generation methods improving. as, given an image x at timestep t, xt , output the prediction of
A study by Nightingale et al. [23] in 2017 suggested that xi−1 through the prediction of noise and subsequent removal
humans have difficulty recognising photographs that have by classical means.
been edited using image processing techniques. Since this A noisy image xt is generated from the original x0 by the
study, there has been nearly five years of rapid development following:
in the field to date. p p
Reviewing the relevant literature has highlighted rapid xt = ᾱt x0 + 1 − ᾱt ε, (1)
developments within AI-generated imagery and the chal- where noise is ε, and the adjustment according to the time step
lenges today posed with respect to its detection. Generative t is ᾱ. The method of this study is to make use of the reverse
models have enabled the generation of high-fidelity, photore- process of 50 noising steps, which from x50 will ultimately
alistic images within a matter of seconds that humans often form x0 , the synthetic image. The neural network εθ thus
cannot distinguish between when compared to reality. This minimises the following loss function:
conclusion sets the stage for the studies presented by this h i
work and argues the need to fill the knowledge gap when it Loss = Et,x0 ,ε ||ε − εθ (xt , t)||2 . (2)
comes to the availability of examples of synthetic data.
Further technical details on the approach can be obtained
from [2].The model chosen for this approach is Stable
III. METHOD Diffusion 1.4, which is trained in the LAION2B-en, LAION-
This section describes the methods followed by this study high-resolution and LAION-aesthetics v2.5 + datasets
in terms of their technical implementation and application (https://laion.ai/blog/laion-5b/). The aforementioned datasets
for the detection of synthetic images. This section first are a cleaned subset of the original LAION-5B dataset [25],
describes the collection of data for the real data, and then the which contains 5.85 billion text-image pairs.
methodology followed to generate the synthetic equivalent SDM is used to generate a synthetic equivalent to the
for comparison. Sections III-A and III-B will describe CIFAR-10 dataset which contains 6, 000 images of ten
how 60,000 images are collected and 60,000 images are classes. The classes are airplane, automobile, bird, cat, deer,
synthetically generated, respectively. This forms the overall dog, frog, horse, ship and truck. Following observations
dataset of 120,000 images. Section III-C will then describe from the CIFAR-10 dataset, this study implements prompt
the machine learning model engineering approach which modifiers to increase diversity of the synthetic dataset,
aims to recognise the authenticity of the images, before which can be observed in Table 1. As in the real data set,
Section III-D notes the approach for Explainable AI to 50, 000 images are used for training data and 10, 000 for
interpret model predictions. testing data, provided with a class label to indicate that the
image is not real.
A. REAL DATA COLLECTION
For the class label ‘‘REAL’’, interpreted as a positive class C. IMAGE CLASSIFICATION
value ‘‘1’’, data is collected from the CIFAR-10 dataset [24]. Image classification is an algorithm that predicts a class label
It is a dataset of 60, 000, 32 × 32 RGB images of real given an input image. The learnt features are extracted from
subjects divided into ten classes. Classes within the data set the image and processed in order to provide an output, in this
are airplane, automobile, bird, cat, deer, dog, frog, horse, ship, case, whether or not the image is real or synthetic. This
and truck. There are 6, 000 images per class. For each class, subsection describes the selected approach to classification.
5,000 images are used for training and 1, 000 for testing, In this study, the Convolutional Neural Network (CNN)
i.e., a testing dataset of 16.6%. In this study, all images [26], [27], [28] is employed to learn from the input
15644 VOLUME 12, 2024
J. J. Bird, A. Lotfi: CIFAKE: Image Classification and Explainable Identification
FIGURE 1. Examples of images from the CIFAR-10 image classification dataset [24].
TABLE 1. Latent diffusion prompt modifiers for generating the 10-class the weight matrix W and the bias b. The activation function
synthetic dataset. All prompts are preceded by ‘‘a photograph of {a/an}’’
and modifiers are used equally for the 6000 images. f in this study, as in CNN, is the ReLu activation function
f (x) = max(0, x).
The goal of the network in this study is to classify whether
the image is a real photograph or an image generated by
a LDM, and thus is a problem of binary classification.
Therefore, the output of the network is a single neuron with
the S-shaped Sigmoid activation function:
1
σ (x) = (5)
1 + e−x
The ‘‘FAKE’’ class is 0 and the ‘‘REAL’’ class is 1,
therefore, a value closer to either of the two values represents
the likelihood of that class. Although this aids in learning,
because it is differentiable, the values are rounded to the
closest value for inference.
Although the goal of the network is to use backpropagation
to reduce binary cross-entropy loss, this study also notes
an extended number of classification metrics. These are the
Precision, which is a measure of how many of the predictive
images. It is the concatenation of two main networks with positive cases are positive, a metric which allows for the
intermediate operations. These are the convolutional layers analysis of false-positives:
and the fully connected layers. The initial convolutional
network within the overall model is the CNN, which can be True positives
Precision = . (6)
operationally generalised for an image of dimensions x and a True positives + False positives
filter matrix w as follows: The Recall which is a measure of how many positive cases are
M X
X N correctly predicted, which enables analysis of false-negative
(x ∗ w)(i, j) = x(i + m − 1, j + n − 1)w(m, n), (3) predictions:
m=1 n=1
True positives
where (i, j) is the output for the feature map, and (m, n) Recall = , (7)
True positives + False negatives
represents the location of the filter w. The output is derived
by applying convolutional operations to the input x with each This measure is particularly important in this case, as it is in
of the filters (which are learnable) and applying an activation fraud detection, since a false negative would falsely accuse
function f , which, in the context of this study, is the Rectified the author of generating their image with AI. Finally, the F-1
Linear Unit (ReLu) f (x) = max(0, x). score is considered:
For an image of (height, width) dimensions and fil- Precision × Recall
F1 score = 2 × , (8)
ters depending on the filter kernel of (heightkernel ) and Precision + Recall
(widthkernel ) with a stride = 1 and no padding for simplicity, which is a unified metric of precision and recall.
the output would have dimensions: The dataset that forms the classification is the collection
of real images and the equivalent synthetic images gen-
(height − heightkernel + 1, width − widthkernel + 1). (4)
erated, detailed in Sections III-A and III-B, respectively.
Then, a pooling operation is performed to reduce the spatial 100, 000 images are used for training (50, 000 real images
dimensions and flatten the output so it can be entered into and 50, 000 synthetic images), and 20, 000 are used for
densely connected layers. For L = HWD (height, width, and testing (10, 000 real and 10, 000 synthetic).
dimensions), the flattened one-dimensional output vector is Initially, CNN architectures are benchmarked as a lone
simply x = [x1 , x2 , . . . , xL ]. The output vector y is ultimately feature extractor. That is, the filters of {16, 32, 64, 128}
the output from the dense layer(s) as y = f (WL + b), for are benchmarked in layers of {1, 2, 3}, flattened, and
VOLUME 12, 2024 15645
J. J. Bird, A. Lotfi: CIFAKE: Image Classification and Explainable Identification
FIGURE 2. Examples of AI-generated images within the dataset contributed by this study, selected at random with regards to their real
CIFAR-10 equivalent labels.
∂yc
where αk is the global average pooling 1 P P
connected directly to the output neurone. The topology of Z i j ∂Ak of
i,j
the highest performing feature extractor is then used to ∂yc
spatial locations Z , and are the gradients of the model.
compare the highest performing dense network featuring ∂Aki,j
{32, 64, 128, 256, . . . , 4096} rectified linear units in layers The approach is used for interpretation in the final step
of {1, 2, 3}. These 36 artificial neural networks are then of this study, given the random data selected from the two
compared with regard to classification metrics to derive the classes. Due to the nature of heatmapping, the results of the
topology that performs best. algorithm are visually interpreted with discussion.
TABLE 3. Observed validation loss for the filters within the convolutional
neural network.
A. DATASET EXPLORATION
Random samples of images used in this study and within the
dataset provided can be observed in Figure 2. Five images are
presented for each class label, and all of the images within
this figure are synthetic, which have been generated by the
SDM. Note within this sample that the images are high-
TABLE 6. Observed validation F1-Score for the filters within the
quality and, for the most part, seem to be difficult to discern convolutional neural network.
as synthetic by the human eye. Synthetic photographs are
representative of their counterparts from reality and feature
complex attributes such as depth of field, reflections, and
motion blur.
It can also be observed that there are visual imperfections
within some of the images. Figure 3 shows a number of
examples of the win of the dataset in which the model has
output images with visual glitches. Given that the LAION
dataset provides physical descriptions of the image content, faced by the CNN is that of binary classification, whether or
little to no information on text is provided, and thus it can not the image is a real photograph or the output of an LDM.
be seen that the model produces shapes similar to alphabetic The validation accuracy of the results and the loss metrics
characters. Also observed here is a lack of important detail, for the feature extractors can be found in Tables 2 and 3,
such as the case of a jet aircraft that has no cockpit window. respectively. All feature extractors scored relatively well
It seems that this image has been produced by combining the without the need for dense layers to process feature maps,
knowledge of jet aircraft (in particular, the engines) along with an average classification accuracy of 91.79%. The
with the concept of an Unmanned Aerial Vehicle’s chassis. lowest loss feature extractor was found to use two layers of
Finally, there are also some cases of anatomical errors for 32 filters, which led to an overall classification accuracy of
living creatures, seen in these examples through the cat’s 92.93% and a binary cross-entropy loss of 0.18. The highest
limbs and eyes. accuracy model, two layers of 128 filters, scored 92.98% with
Complex visual concepts are present within much of the a loss of 0.221.
dataset, with examples shown in Figure 4. Observe that Extended validation metrics are presented in Tables 4, 5,
the ripples in the water and reflections of the entities are and 6, which detail validation precision, recall, and F1 scores,
highly realistic and match what would be expected within respectively. The F1 score, which is a unification of precision
a photograph. In addition to complex lighting, there is also and recall, had a mean value of 0.929 with the highest being
evidence of depth of field and photographic framing. 0.936. A small standard deviation of 0.003 was observed.
Following these experiments, the lowest-loss feature
B. CLASSIFICATION RESULTS extractor is selected for further engineering of the network
In this subsection, we present the results for the computer topology. This was the model that had two layers of
vision experiments for image classification. The problem 32 convolutional filters.
FIGURE 4. A selection of AI-generated images within the dataset. Examples of complex visual attributes
generated by the latent diffusion model that include realistic water and reflections.
TABLE 7. Observed validation accuracy for the dense layers within the TABLE 9. Observed validation precision for the dense layers within the
convolutional neural network. convolutional neural network.
TABLE 8. Observed validation loss for the dense layers within the
convolutional neural network. TABLE 10. Observed validation recall for the dense layers within the
convolutional neural network.
The results of the general network engineering are TABLE 11. Observed validation F1-Score for the dense layers within the
presented in Tables 7 and 8, which contain the validation convolutional neural network.
accuracy and loss, respectively. The lowest loss observed was
0.177 binary cross-entropy when the CNN was followed by
three layers of 64 rectified linear units. The highest accuracy,
on the other hand, was 93.55%, which was achieved by
implementing a single layer of 64 neurons.
Additional validation metrics for precision, recall, and F-1
score are also provided in Tables 9, 10, and 11, respectively.
Similarly to the prior experiments, the standard deviation of
F1-scores was relatively low at 0.003. The highest F-1 score
was the network that used a single dense layer of 64 rectified
linear units, with a value of 0.936. The aforementioned
highest F1 score model is graphically detailed in Figure 5 to Figure 6 shows examples of the interpretation of predic-
provide a visual example of the network topology. tions via Grad-CAM. Brighter pixels in the image represent
FIGURE 5. An example of one of the final model architectures following hyperparameter search for the classification of
real or AI-generated images.