CIFAKE Image Classification and Explainable Identification of AI-Generated Synthetic Images
CIFAKE Image Classification and Explainable Identification of AI-Generated Synthetic Images
This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3356122
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI
ABSTRACT Recent advances in synthetic data have enabled the generation of images with such high
quality that human beings cannot tell the difference between real-life photographs and Artificial Intelligence
(AI) generated images. Given the critical necessity of data reliability and authentication, this article proposes
to enhance our ability to recognise AI-generated images through computer vision. Initially, a synthetic
dataset is generated that mirrors the ten classes of the already available CIFAR-10 dataset with latent
diffusion, providing a contrasting set of images for comparison to real photographs. The model is capable
of generating complex visual attributes, such as photorealistic reflections in water. The two sets of data
present as a binary classification problem with regard to whether the photograph is real or generated by
AI. This study then proposes the use of a Convolutional Neural Network (CNN) to classify the images into
two categories; Real or Fake. Following hyperparameter tuning and the training of 36 individual network
topologies, the optimal approach could correctly classify the images with 92.98% accuracy. Finally, this
study implements explainable AI via Gradient Class Activation Mapping to explore which features within
the images are useful for classification. Interpretation reveals interesting concepts within the image, in
particular, noting that the actual entity itself does not hold useful information for classification; instead,
the model focuses on small visual imperfections in the background of the images. The complete dataset
engineered for this study, referred to as the CIFAKE dataset, is made publicly available to the research
community for future work.
INDEX TERMS AI-generated Images, Generative AI, Image Classification, Latent Diffusion
VOLUME 4, 2016 1
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3356122
Bird and Lotfi: Image Classification and Explainable Identification of AI-Generated Synthetic Images
mation, what is real and what is not? The epistemological be used to generate a synthetic photograph of an individual
reality is that there are serious questions surrounding the committing a crime or vice versa, providing false evidence
reliability of human knowledge and the ethical implications of an alibi for a person who was, in reality, otherwise else-
that surround the misuse of these types of technology. The where. Misinformation and fake news is a significant modern
implications suggest that we are in growing need of a system problem, and machine-generated images could be used to
that can aid us in the recognition of real images versus those manipulate public opinion [3], [4]. Situations where synthetic
generated by AI. imagery is used in fake news can promote its false credibility
This study explores the potential of using computer vi- and have serious consequences [5]. Cybersecurity is another
sion to enhance our newfound inability to recognise the major concern, with research noting that synthetically gen-
difference between real photographs and those that are AI- erated human faces can be used in false acceptance attacks
generated. Given that there are many years worth of pho- and have the potential to gain unauthorised access to digital
tographic datasets available for image classification, these systems [6], [7]. In [8], it was observed that synthetically
provide examples for a model of real images. Following the generated signatures could overcome signature verification
generation of a synthetic equivalent to such data, we will then systems with ease.
explore the outputs of the model before finally implementing Latent Diffusion Models is a new approach for generating
methods of differentiation between the two types of image. images which use attention mechanisms and a U-Net to
There are several scientific contributions with multidis- reverse the process of Gaussian noise and, ultimately, use text
ciplinary and social implications that arise from this study. conditioning to generate novel images from random noise.
First, a dataset, called CIFAKE, is generated with latent dif- Details on the methodological implementation of LDMs can
fusion and released to the research community. The CIFAKE be found in Section III. The approach is rapidly developing
dataset provides a contrasting set of real and fake pho- but young, and thus literature on the subject is currently
tographs and contains 120,000 images (60,000 images from scarce. The models are a new approach in the field of
the existing CIFAR-10 dataset (Collection of images that generative models; thus, the literature is young, and few
are commonly used to train machine learning and computer applications have been explored. Examples of notable models
vision algorithms available from: https://www.cs.toronto include Dall-E by OpenAI [9], Imagen from Google [10],
.edu/~kriz/cifar.html) and 60,000 images generated for and the open source equivalent, SDM from StabilityAI [2].
this study), making it a valuable resource for researchers in These models have pushed the boundaries of image quality,
the field. Second, this study proposes a method to improve both in realism, and arguably artistic ability. This has led to
our waning human ability to recognise AI-generated images much debate on the professional, social, ethical, and legal
through computer vision, using the CIFAKE dataset for clas- considerations of technology [1].
sification. Finally, this study proposes the use of Explain- The majority of research in the field is cutting-edge and is
able AI (XAI) to further our understanding of the complex presented as preprints and recent theses. In [11], researchers
processes involved in synthetic image recognition, as well proposed to train SDM on medical imaging data, achieving
as visualisation of the important features within those im- higher quality images that could potentially lead to increased
ages. These scientific contributions provide important steps model abilities through data augmentations. It is worth men-
forward in addressing the modern challenges posed by rapid tioning that in [12], [13], diffusion models were found to
developments of modern technology, and have important have the ability to generate audio and images. In 2021, the
implications for ensuring the authenticity and trustworthiness results of Yi et al. [14] suggested that diffusion models
of data. were highly capable of generating realistic artworks, fooling
The remainder of this article is as follows; the state-of- human subjects into believing that the works were created
the-art research background is initially explored in Section by human beings. Given this, researchers have noted that
II with a discussion of relevant related studies in the field. diffusion models have a promising capacity for co-creating
Following this, the methodology followed by this study is with human artists [15].
detailed in Section III, which provides technical implemen- DE-FAKE, proposed by Sha et al. [16], shows that images
tation and the method followed for binary classification of generated by various latent diffusion approaches may contain
real versus AI-generated imagery. The results of these ex- digital fingerprints to suggest they are synthetic. Although
periments are presented with discussion in Section IV before visual glitches are increasingly rare given the advances in
this work is finally concluded, and future work is proposed in model quality, it may be possible that computer vision ap-
Section V. proaches will detect these attributes within images that the
human eye cannot. The Fourier transforms presented in [17]
II. BACKGROUND show visual examples of these digital fingerprints.
The ability to distinguish between real imagery and those When discussing the topic of vision, the results in
generated by machine learning models is important for a [18] suggest that optical flow techniques could detect syn-
number of reasons. Identification of real data provides con- thetic human faces within the FaceForensics dataset with
firmation on the authenticity and originality of the image; for around 81.61% accuracy. Extending to the temporal domain,
example, a fine-tuned Stable Diffusion Model (SDM) could [19] proposes recurrence in AI-generated video recognition
2 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3356122
Bird and Lotfi: Image Classification and Explainable Identification of AI-Generated Synthetic Images
achieving 97.1% accuracy over 80 frames due to minor visual a testing dataset of 16.6%. Within this study, all images from
glitches at the pixel scale. In Wang et al. [20], EfficientNets the training dataset are used for the training of positive class
and Vision Transformers are proposed within a system that “REAL”. Therefore, 50, 000 are used for training and 10, 000
can detect forged images by adversarial models at an F1 score for testing.
of 0.88 and AUC of 0.95, competing with the state of the art Samples of images within the CIFAR-10 dataset that form
on the DeepFake Detection Challenge dataset while remain- the “REAL” class can be observed in Figure 1.
ing efficient. In the aforementioned study, a Convolutional
Neural Network was used to extract features, similarly to the B. SYNTHETIC DATA GENERATION
approach proposed in this study, prior to processing using The synthetic images generated for this study use CompVis
attention-based approaches. SD (https://huggingface.co/CompVis/stable- diffusi
Similarly, convolutional and temporal techniques were on-v1-4), an open source LDM. The goal is to model the
proposed in [21] to achieve 66.26% to 91.21% accuracy in a diffusion of image data through a latent space given a textual
mixed set of synthetic data detection datasets. Chrominance context. If noise, such as that of a Gaussian distribution, is
components CbCr within a digital image were noted in [22] iteratively added to an image, the image ultimately becomes
as a promising route for the detection of minor pixel dispari- noise and all prior visual information is lost. To generalise,
ties that are sometimes present within synthetic images. the reverse of this process is to, therefore, generate a synthetic
Human recognition of manipulation within images is wan- image from noise. The method of reverse diffusion can be
ing as a direct result of image generation methods improving. put simply as, given an image x at timestep t, xt , output
A study by Nightingale et al. [23] in 2017 suggested that the prediction of xi−1 through the prediction of noise and
humans have difficulty recognising photographs that have subsequent removal by classical means.
been edited by image-processing techniques. Since this study, A noisy image xt is generated from the original x0 by the
there has been nearly five years of rapid development in the following:
field to date. √ √
Reviewing the relevant literature has highlighted rapid de- xt = ᾱt x0 + 1 − ᾱt ε, (1)
velopments within AI-generated imagery and the challenges
where noise is ε, and the adjustment according to the time
today posed with respect to its detection. Generative models
step t is ᾱ. The method of this study is to make use of the
have enabled the generation of high-fidelity, photorealistic
reverse process of 50 noising steps, which from x50 will
images within a matter of seconds that humans often cannot
ultimately form x0 , the synthetic image. The neural network
distinguish between when compared to reality. This conclu-
εθ thus minimises the following loss function:
sion sets the stage for the studies presented by this work and
argues the need to fill the knowledge gap when it comes to
h i
2
Loss = Et,x0 ,ε ||ε − εθ (xt , t)|| . (2)
the availability of examples of synthetic data.
Further technical details on the approach can be obtained
III. METHOD from [2].The model chosen for this approach is Stable Dif-
This section describes the methods followed by this study fusion 1.4, which is trained in the LAION2B-en, LAION-
in terms of their technical implementation and application high-resolution and LAION-aesthetics v2.5 + datasets (https:
for the detection of synthetic images. This section first de- //laion.ai/blog/laion-5b/). The aforementioned datasets are a
scribes the collection of data for the real data and then the cleaned subset of the original LAION-5B dataset [25], which
methodology followed to generate the synthetic equivalent contains 5.85 billion text-image pairs.
for comparison. Sections III-A and III-B will describe how SDM is used to generate a synthetic equivalent to the
60,000 images are collected and 60,000 images are syntheti- CIFAR-10 dataset which contains 6, 000 images of ten
cally generated, respectively. This forms the overall dataset classes. The classes are airplane, automobile, bird, cat, deer,
of 120,000 images. Section III-C will then describe the dog, frog, horse, ship and truck. Following observations
machine learning model engineering approach which aims to from the CIFAR-10 dataset, this study implements prompt
recognise the authenticity of the images, before Section III-D modifiers to increase diversity of the synthetic dataset, which
notes the approach for Explainable AI to interpret model can be observed in Table 1. As in the real data set, 50, 000
predictions. images are used for training data and 10, 000 for testing data,
provided with a class label to indicate that the image is not
A. REAL DATA COLLECTION real.
For the class label “REAL”, interpreted as a positive class
value “1”, data is collected from the CIFAR-10 dataset [24]. C. IMAGE CLASSIFICATION
It is a dataset of 60, 000, 32×32 RGB images of real subjects Image classification is an algorithm that predicts a class label
divided into ten classes. The classes within the dataset are given an input image. The learnt features are extracted from
airplane, automobile, bird, cat, deer, dog, frog, horse, ship, the image and processed in order to provide an output, in
and truck. There are 6, 000 images per class. For each class this case, whether or not the image is real or synthetic. This
5,000 images are used for training and 1, 000 for testing, i.e. subsection describes the selected approach to classification.
VOLUME 4, 2016 3
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3356122
Bird and Lotfi: Image Classification and Explainable Identification of AI-Generated Synthetic Images
Airplane Automobile Bird Cat Deer Dog Frog Horse Ship Truck
FIGURE 1: Examples of images from the CIFAR-10 image classification dataset [24].
TABLE 1: Latent Diffusion prompt modifiers for generating Then, a pooling operation is performed to reduce the spatial
the 10-class synthetic dataset. All prompts are preceded by dimensions and flatten the output to then input into the
“a photograph of {a/an}” and modifiers are used equally for densely connected layers. For L = HW D (height, width,
the 6000 images. and dimensions), the flattened one-dimensional output vector
Class Label Prompt Modifiers
is simply x = [x1 , x2 , ..., xL ]. The output vector y is ulti-
mately output from the dense layer(s) as y = f (WL + b), for
Airplane :- aircraft, airplane, fighter, flying, jet, plane
Automobile :- family, new, sports, vintage the weight matrix W and the bias b. The activation function
Bird
:- flying, in a tree, indoors, on water, outdoors, f in this study, as in CNN, is the ReLu activation function
walking f (x) = max(0, x).
:- indoors, outdoors, walking, running, eating,
Cat
jumping, sleeping, sitting The goal of the network in this study is to classify whether
Deer
:- herd, in a field, in the forest, outdoors, running, the image is a real photograph or an image generated by a
wildlife photography LDM, and thus is a problem of binary classification. There-
:- indoors, outdoors, walking, running, eating,
Dog
jumping, sleeping, sitting fore, the output of the network is a single neuron with the
Frog
:- European, in the forest, on a tree, on the ground, S-shaped Sigmoid activation function:
swimming, tropical, wildlife photography
Horse
:- herd, in a field, in the forest, outdoors, running, 1
wildlife photography σ(x) = (5)
:- at sea, boat, cargo, cruise, on the water, river,
1 + e−x
Ship
sailboat, tug The “FAKE” class is 0 and the “REAL” class is 1, there-
:- 18-wheeler, car transport, fire, garbage, heavy goods, fore, a value closer to either of the two values represents the
Truck
lorry, mining, tanker, tow
likelihood of that class. Although this aids in learning, as it is
differentiable, the values are rounded to the closest value for
inference.
In this study, the Convolutional Neural Network (CNN)
Although the goal of the network is to use backpropagation
[26]–[28] is employed to learn from the input images. It is
to reduce binary cross-entropy loss, this study also notes
the concatenation of two main networks with intermediate
an extended number of classification metrics. These are the
operations. These are the convolutional layers and the fully
Precision, which is a measure of how many of the predictive
connected layers. The initial convolutional network within
positive cases are positive, a metric which allows for the
the overall model is the CNN, which can be operationally
analysis of false-positives:
generalised for an image of dimensions x and a filter matrix
w as follows: True positives
Precision = . (6)
M X
X N True positives + False positives
(x∗w)(i, j) = x(i+m−1, j +n−1)w(m, n), (3) The Recall which is a measure of how many positive cases
m=1 n=1 are correctly predicted, which enables analysis of false-
where (i, j) is the output for the feature map, and (m, n) negative predictions:
represents the location of the filter w. The output is derived
True positives
by applying convolutional operations to the input x with each Recall = , (7)
of the filters (which are learnable) and applying an activation True positives + False negatives
function f , which, in the context of this study, is the Rectified This measure is particularly important in this case, as it is in
Linear Unit (ReLu) f (x) = max(0, x). fraud detection, since a false negative would falsely accuse
For an image of (height, width) dimensions and fil- the author of generating their image with AI. Finally, the F-1
ters depending on the filter kernel of (heightkernel ) and score is considered:
(widthkernel ) with a stride = 1 and no padding for sim- Precision × Recall
plicity, the output would have dimensions: F1 score = 2 × , (8)
Precision + Recall
which is a unified metric of precision and recall.
(height − heightkernel + 1, width − widthkernel + 1). (4)
4 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3356122
Bird and Lotfi: Image Classification and Explainable Identification of AI-Generated Synthetic Images
The dataset that forms the classification is the collection of and the Euler Ancestral scheduler was used. Synthetic images
real images and the equivalent synthetic images generated, were rendered at a resolution of 512px before resizing to
detailed in Sections III-A and III-B, respectively. 100, 000 32px by bilinear interpolation to match the resolution of
images are used for training (50, 000 real images and 50, 000 CIFAR-10.
synthetic images), and 20, 000 are used for testing (10, 000 All algorithms in this study were executed using a single
real and 10, 000 synthetic). Nvidia RTX 3080Ti GPU, which has 10,240 CUDA cores, a
Initially, CNN architectures are benchmarked as a lone clock speed of 1.67 GHz, and 12GB GDDR6X VRAM.
feature extractor. That is, the filters of {16, 32, 64, 128}
are benchmarked in layers of {1, 2, 3}, flattened, and con- IV. RESULTS AND OBSERVATIONS
nected directly to the output neurone. The topology of This section presents examples of the dataset followed by the
the highest performing feature extractor is then used to findings of the planned computer vision experiments. The
compare the highest performing dense network featuring dataset is also released to the public research community
{32, 64, 128, 256, ..., 4096} rectified linear units in layers for use in future studies, given the important implications of
of {1, 2, 3}. These 36 artificial neural networks are then detecting AI-generated imagery1 .
compared with regard to classification metrics to derive the
topology that performs best. A. DATASET EXPLORATION
Random samples of images used in this study and within the
D. EXPLAINABLE AI dataset provided can be observed in Figure 2. Five images are
While deep learning approaches often lead to impressive pre- presented for each class label, and all of the images within
dictive ability, many algorithms are black boxes that provide this figure are synthetic, which have been generated by the
no reasoning for their classification. The aim of Explainable SDM. Note within this sample that the images are high-
AI (XAI) is to extract meaning from algorithms and provide quality and, for the most part, seem to be difficult to discern
readable interpretations of why a prediction or decision is as synthetic by the human eye. Synthetic photographs are
being made [29]. Regarding the experiments in this work, representative of their counterparts from reality and feature
the CNN simply predicts that an image is real or synthetic, complex attributes such as depth of field, reflections, and
and then XAI is used to provide interpretations as to why the motion blur.
image is real or synthetic. It can also be observed that there are visual imperfections
Given that the literature shows that humans have a major within some of the images. Figure 3 shows a number of
difficulty in recognising synthetic imagery, it is important examples of the win of the dataset in which the model has
to display and visualise minor defects within the image that output images with visual glitches. Given that the LAION
could suggest that it is not real. dataset provides physical descriptions of the image content,
The method selected for explainable AI (XAI) and inter- little to no information on text is provided, and thus it can
pretation is Gradient Class Activation Mapping (Grad-CAM) be seen that the model produces shapes similar to alphabetic
[30]. Grad-CAM interprets the gradients of the predicted characters. Also observed here is a lack of important detail,
class along with the CNN feature maps, which can therefore such as the case of a jet aircraft that has no cockpit window.
be spatially localised with respect to the original input (the It seems that this image has been produced by combining the
image) and produce a heatmap. This is generated by the knowledge of jet aircraft (in particular, the engines) along
Recitified Linear Unit (ReLU) function as: with the concept of an Unmanned Aerial Vehicle’s chassis.
X Finally, there are also some cases of anatomical errors for
LcGrad−CAM = ReLU ( αk Ak ), (9)
living creatures, seen in these examples through the cat’s
k
limbs and eyes.
1
P P ∂yc
where αk is the global average pooling Z i j ∂Ak of Complex visual concepts are present within much of the
i,j
VOLUME 4, 2016 5
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3356122
Bird and Lotfi: Image Classification and Explainable Identification of AI-Generated Synthetic Images
Airplane Automobile Bird Cat Deer Dog Frog Horse Ship Truck
FIGURE 2: Examples of AI-generated images within the dataset contributed by this study, selected at random with regards to
their real CIFAR-10 equivalent labels.
FIGURE 3: Examples of visual defects found within the TABLE 4: Observed validation precision for the filters within
synthetic image dataset. the Convolutional Neural Network.
Layers
Filters
1 2 3
The validation accuracy of the results and the loss met- 16 0.903 0.941 0.921
rics for the feature extractors can be found in Tables 2 32 0.878 0.923 0.937
and 3, respectively. All feature extractors scored relatively 64 0.908 0.947 0.936
128 0.92 0.948 0.94
well without the need for dense layers to process feature
maps, with an average classification accuracy of 91.79%. The
lowest loss feature extractor was found to use two layers of
92.93% and a binary cross-entropy loss of 0.18. The highest
32 filters, which led to an overall classification accuracy of
accuracy model, two layers of 128 filters, scored 92.98% with
a loss of 0.221.
TABLE 2: Observed classification accuracy metrics for fea- Extended validation metrics are presented in Tables 4, 5,
ture extraction networks. and 6 which detail the validation precision, recall, and F1-
scores, respectively. The F1 score, which is a unification of
Filters
Layers precision and recall, had a mean value of 0.929 with the
1 2 3 highest being 0.936. A small standard deviation of 0.003 was
16 90.06 91.46 91.63
observed.
32 90.38 92.93 92.54 Following these experiments, the lowest loss feature ex-
64 90.94 92.71 92.38 tractor is selected for further engineering of the network
128 91.39 92.98 92.07
topology. This was the model which had two layers of 32
6 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3356122
Bird and Lotfi: Image Classification and Explainable Identification of AI-Generated Synthetic Images
FIGURE 4: A selection of AI-generated images within the dataset. Examples of complex visual attributes generated by the
latent diffusion model that include realistic water and reflections.
TABLE 5: Observed validation recall for the filters within the TABLE 8: Observed validation loss for the dense layers
Convolutional Neural Network. within the Convolutional Neural Network.
Layers Layers
Filters Neurons
1 2 3 1 2 3
16 0.897 0.885 0.911 32 0.186 0.182 0.187
32 0.938 0.936 0.912 64 0.182 0.193 0.177
64 0.92 0.904 0.91 128 0.187 0.183 0.178
128 0.906 0.909 0.898 256 0.187 0.192 0.194
512 0.188 0.193 0.184
1024 0.199 0.194 0.192
TABLE 6: Observed validation F1-Score for the filters within 2048 0.194 0.2 0.219
the Convolutional Neural Network. 4096 0.234 0.204 0.19
Layers
Filters TABLE 9: Observed validation precision for the dense layers
1 2 3 within the Convolutional Neural Network.
16 0.9 0.912 0.916
32 0.907 0.93 0.924 Layers
Neurons
64 0.91 0.925 0.923
128 0.913 0.928 0.919 1 2 3
32 0.932 0.916 0.929
64 0.925 0.92 0.93
128 0.948 0.942 0.935
convolutional filters. 256 0.939 0.926 0.931
512 0.944 0.924 0.946
The results for the engineering of the overall network are 1024 0.933 0.94 0.939
presented in Tables 7 and 8, which contain the validation 2048 0.942 0.922 0.929
accuracy and loss, respectively. The lowest loss observed was 4096 0.926 0.914 0.923
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3356122
Bird and Lotfi: Image Classification and Explainable Identification of AI-Generated Synthetic Images
TABLE 10: Observed validation recall for the dense layers visual cues. To achieve this, this study proposed the gen-
within the Convolutional Neural Network. eration of a synthetic dataset with Latent Diffusion, recog-
Layers
nition with Convolutional Neural Networks, and interpreta-
Neurons tion through Gradient Class Activation Mapping. The results
1 2 3
showed that the synthetic images were high quality and fea-
32 0.932 0.943 0.93 tured complex visual attributes, and that binary classification
64 0.948 0.936 0.935
128 0.91 0.923 0.973 could be achieved with around 92.98% accuracy. Grad-CAM
256 0.919 0.932 0.926 interpretation revealed interesting concepts within the images
512 0.915 0.928 0.919 that were useful for predictions.
1024 0.925 0.916 0.915
2048 0.912 0.934 0.925 In addition to the method proposed in this study, a signifi-
4096 0.926 0.939 0.936 cant contribution is made through the release of the CIFAKE
dataset. The dataset contains a total of 120, 000 images
TABLE 11: Observed validation F1-Score for the dense (60, 000 real images from CIFAR-10 and 60,000 synthetic
layers within the Convolutional Neural Network. images generated for this study). The CIFAKE dataset pro-
Layers
vides the research community with a valuable resource for
Neurons future work on the social problems faced by AI-generated
1 2 3
imagery. The dataset provides a significant expansion of
32 0.932 0.929 0.93 the resource availability for the development and testing of
64 0.936 0.928 0.933
128 0.928 0.932 0.932 applied computer vision approaches to this problem.
256 0.929 0.929 0.929 The reality of AI generating images that are indistinguish-
512 0.929 0.926 0.932
1024 0.929 0.928 0.927
able from real-life photographic images raises fundamental
2048 0.927 0.928 0.927 questions about the limits of human perception, and thus
4096 0.926 0.926 0.929 this study proposed to enhance that ability by fighting fire
with fire. The proposed approach addresses the challenges of
ensuring the authenticity and trustworthiness of visual data.
detailed in Figure 5 to provide a visual example of the Future work could involve exploring other techniques for
network topology. classification of the dataset provided. For example, the imple-
Figure 6 shows examples of the interpretation of predic- mentation of attention-based approaches are a promising new
tions via Grad-CAM. Brighter pixels in the image represent field that could provide increased ability and an alternative
areas that contribute more to the decision of the CNN. It can method of explainable AI. Furthermore, with even further
be observed that there is a significantly different distribution improvements to synthetic imagery in the future, it is impor-
of features given the binary classification problem. Firstly, tant to consider updating the dataset with images generated
the classification of real images can be interpreted as a more by these approaches. Furthermore, considering generating
holistic approach in which the majority of the contents of the images from other domains, such as human faces and clinical
image are useful for recognition. However, the classification scans, would provide additional datasets for this type of study
of synthetic images is somewhat more atomistic and sparse. and expand the applicability of our proposed approach to
Note that for the recognition of AI-generated imagery, acti- other fields of research.
vation occurs in select parts of the image that are more likely
Finally, in conclusion, this study provides contributions to
to present visual glitches that are difficult to recognise with
the ongoing implications of AI-generated images. The pro-
the human eye. An example of this can be seen for the image
posed approach supports important implications of ensuring
of the frog, where an out-of-focus bokeh is the only attribute
data authenticity and trustworthiness, providing not only a
that suggests the image is not real. For the truck, only the
system that can recognise synthetic images but also data
radiator grill pattern is considered useful for classification.
and interpretation. The public release of the CIFAKE dataset
The XAI approach also shows an interesting mechanic in
generated within this study provides a valuable resource for
a more general sense. Given the examples of airplane, bird,
interdisciplinary research.
frog, horse and ship, note that the object within the image has
little to no class activation overlay whatsoever. This suggests
that the actual focus of the image itself, the entity, contains VI. AVAILABILITY OF DATA AND MATERIALS
almost no useful features for synthetic image recognition. The datasets generated and analysed during the current study
This suggests that the model is often available to produce a are available in the CIFAKE repository, https://www.kaggle
near-perfect representation of the entity. .com/datasets/birdy654/cifake-real-and-ai-generated-synth
etic-images.
V. CONCLUSION AND FUTURE WORK
This study has proposed a method to improve our waning REFERENCES
ability to recognise AI-generated images through the use of [1] K. Roose, “An ai-generated picture won an art prize. artists aren’t happy,”
Computer Vision and to provide insight into predictions with The New York Times, vol. 2, p. 2022, 2022.
8 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3356122
Bird and Lotfi: Image Classification and Explainable Identification of AI-Generated Synthetic Images
Rescale
32 x 32 x 3
FIGURE 5: An example of one of the final model architectures following hyperparameter search for the classification of real
or AI-generated images.
[14] D. Yi, C. Guo, and T. Bai, “Exploring painting synthesis with diffusion
models,” in 2021 IEEE 1st International Conference on Digital Twins and
Parallel Intelligence (DTPI), pp. 332–335, IEEE, 2021.
[15] C. Guo, Y. Dou, T. Bai, X. Dai, C. Wang, and Y. Wen, “Artverse:
A paradigm for parallel human–machine collaborative painting creation
in metaverses,” IEEE Transactions on Systems, Man, and Cybernetics:
Systems, 2023.
[16] Z. Sha, Z. Li, N. Yu, and Y. Zhang, “De-fake: Detection and attribution of
fake images generated by text-to-image diffusion models,” arXiv preprint
arXiv:2210.06998, 2022.
[17] R. Corvi, D. Cozzolino, G. Zingarini, G. Poggi, K. Nagano, and L. Verdo-
liva, “On the detection of synthetic images generated by diffusion models,”
FIGURE 6: Gradient class activation maps (Grad-CAM) arXiv preprint arXiv:2211.00680, 2022.
overlays and raw heatmaps for prediction interpretation. Top [18] I. Amerini, L. Galteri, R. Caldelli, and A. Del Bimbo, “Deepfake video de-
tection through optical flow based cnn,” in Proceedings of the IEEE/CVF
examples show real images and bottom examples show AI- international conference on computer vision workshops, pp. 0–0, 2019.
generated images. Brighter pixels represent features con- [19] D. Güera and E. J. Delp, “Deepfake video detection using recurrent neural
tributing to the output class label. networks,” in 2018 15th IEEE international conference on advanced video
and signal based surveillance (AVSS), pp. 1–6, IEEE, 2018.
[20] J. Wang, Z. Wu, W. Ouyang, X. Han, J. Chen, Y.-G. Jiang, and S.-N.
Li, “M2tr: Multi-modal multi-scale transformers for deepfake detection,”
in Proceedings of the 2022 International Conference on Multimedia Re-
[2] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-
trieval, pp. 615–623, 2022.
resolution image synthesis with latent diffusion models,” in Proceedings of
[21] P. Saikia, D. Dholaria, P. Yadav, V. Patel, and M. Roy, “A hybrid cnn-lstm
the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
model for video deepfake detection by leveraging optical flow features,” in
pp. 10684–10695, 2022.
2022 International Joint Conference on Neural Networks (IJCNN), pp. 1–
[3] G. Pennycook and D. G. Rand, “The psychology of fake news,” Trends in
7, IEEE, 2022.
cognitive sciences, vol. 25, no. 5, pp. 388–402, 2021.
[22] H. Li, B. Li, S. Tan, and J. Huang, “Identification of deep network gen-
[4] B. Singh and D. K. Sharma, “Predicting image credibility in fake news
erated images using disparities in color components,” Signal Processing,
over social media using multi-modal approach,” Neural Computing and
vol. 174, p. 107616, 2020.
Applications, vol. 34, no. 24, pp. 21503–21517, 2022.
[23] S. J. Nightingale, K. A. Wade, and D. G. Watson, “Can people iden-
[5] N. Bonettini, P. Bestagini, S. Milani, and S. Tubaro, “On the use of tify original and manipulated photos of real-world scenes?,” Cognitive
benford’s law to detect gan-generated images,” in 2020 25th international research: principles and implications, vol. 2, no. 1, pp. 1–21, 2017.
conference on pattern recognition (ICPR), pp. 5495–5502, IEEE, 2021. [24] A. Krizhevsky, G. Hinton, et al., “Learning multiple layers of features from
[6] D. Deb, J. Zhang, and A. K. Jain, “Advfaces: Adversarial face synthesis,” tiny images,” 2009.
in 2020 IEEE International Joint Conference on Biometrics (IJCB), pp. 1– [25] C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman,
10, IEEE, 2020. M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, et al., “Laion-
[7] M. Khosravy, K. Nakamura, Y. Hirose, N. Nitta, and N. Babaguchi, 5b: An open large-scale dataset for training next generation image-text
“Model inversion attack: analysis under gray-box scenario on deep learn- models,” arXiv preprint arXiv:2210.08402, 2022.
ing based face recognition system,” KSII Transactions on Internet and [26] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521,
Information Systems (TIIS), vol. 15, no. 3, pp. 1100–1118, 2021. no. 7553, pp. 436–444, 2015.
[8] J. J. Bird, A. Naser, and A. Lotfi, “Writer-independent signature verifica- [27] J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, T. Liu, X. Wang,
tion; evaluation of robotic and generative adversarial attacks,” Information G. Wang, J. Cai, et al., “Recent advances in convolutional neural net-
Sciences, vol. 633, pp. 170–181, 2023. works,” Pattern recognition, vol. 77, pp. 354–377, 2018.
[9] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, [28] Z. Li, F. Liu, W. Yang, S. Peng, and J. Zhou, “A survey of convolutional
and I. Sutskever, “Zero-shot text-to-image generation,” in International neural networks: analysis, applications, and prospects,” IEEE transactions
Conference on Machine Learning, pp. 8821–8831, PMLR, 2021. on neural networks and learning systems, 2021.
[10] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton, S. K. S. [29] D. Gunning, M. Stefik, J. Choi, T. Miller, S. Stumpf, and G.-Z. Yang,
Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes, et al., “Photore- “Xai—explainable artificial intelligence,” Science robotics, vol. 4, no. 37,
alistic text-to-image diffusion models with deep language understanding,” p. eaay7120, 2019.
arXiv preprint arXiv:2205.11487, 2022. [30] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Ba-
[11] P. Chambon, C. Bluethgen, C. P. Langlotz, and A. Chaudhari, “Adapting tra, “Grad-cam: Visual explanations from deep networks via gradient-
pretrained vision-language foundational models to medical imaging do- based localization,” in Proceedings of the IEEE international conference
mains,” arXiv preprint arXiv:2210.04133, 2022. on computer vision, pp. 618–626, 2017.
[12] F. Schneider, Z. Jin, and B. Schölkopf, “Moûsai: Text-to-music genera- [31] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Cor-
tion with long-context latent diffusion,” arXiv preprint arXiv:2301.11757, rado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp,
2023. G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Lev-
[13] F. Schneider, “Archisound: Audio generation with diffusion,” Master’s enberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster,
thesis, ETH Zurich, 2023. J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke,
VOLUME 4, 2016 9
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3356122
Bird and Lotfi: Image Classification and Explainable Identification of AI-Generated Synthetic Images
10 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/