0% found this document useful (0 votes)
275 views10 pages

CIFAKE Image Classification and Explainable Identification of AI-Generated Synthetic Images

CIFAKE Image Classification and Explainable Identification of AI-Generated Synthetic Images
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
275 views10 pages

CIFAKE Image Classification and Explainable Identification of AI-Generated Synthetic Images

CIFAKE Image Classification and Explainable Identification of AI-Generated Synthetic Images
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

This article has been accepted for publication in IEEE Access.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3356122

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI

CIFAKE: Image Classification and


Explainable Identification of
AI-Generated Synthetic Images
JORDAN J. BIRD1 , AHMAD LOTFI2 .
1
Department of Computer Science, Nottingham Trent University, Nottingham, United Kingdom (e-mail: [email protected])
2
Department of Computer Science, Nottingham Trent University, Nottingham, United Kingdom (e-mail: [email protected])
Corresponding author: Jordan J. Bird (e-mail: [email protected])).

ABSTRACT Recent advances in synthetic data have enabled the generation of images with such high
quality that human beings cannot tell the difference between real-life photographs and Artificial Intelligence
(AI) generated images. Given the critical necessity of data reliability and authentication, this article proposes
to enhance our ability to recognise AI-generated images through computer vision. Initially, a synthetic
dataset is generated that mirrors the ten classes of the already available CIFAR-10 dataset with latent
diffusion, providing a contrasting set of images for comparison to real photographs. The model is capable
of generating complex visual attributes, such as photorealistic reflections in water. The two sets of data
present as a binary classification problem with regard to whether the photograph is real or generated by
AI. This study then proposes the use of a Convolutional Neural Network (CNN) to classify the images into
two categories; Real or Fake. Following hyperparameter tuning and the training of 36 individual network
topologies, the optimal approach could correctly classify the images with 92.98% accuracy. Finally, this
study implements explainable AI via Gradient Class Activation Mapping to explore which features within
the images are useful for classification. Interpretation reveals interesting concepts within the image, in
particular, noting that the actual entity itself does not hold useful information for classification; instead,
the model focuses on small visual imperfections in the background of the images. The complete dataset
engineered for this study, referred to as the CIFAKE dataset, is made publicly available to the research
community for future work.

INDEX TERMS AI-generated Images, Generative AI, Image Classification, Latent Diffusion

I. INTRODUCTION consumer-level technology available that could quite easily


HE field of synthetic image generation by Artificial be used for the violation of privacy and to commit fraud.
T Intelligence (AI) has developed rapidly in recent years,
and the ability to detect AI-generated photos has also become
These philosophical and societal implications are at the
forefront of the current state of the art, raising fundamental
a critical necessity to ensure the authenticity of image data. questions about the nature of trustworthiness and reality.
Within recent memory, generative technology often produced Recent technological advances have enabled the generation
images with major visual defects that were noticeable to the of images with such high quality that human beings cannot
human eye, but now we are faced with the possibility of AI tell the difference between a real-life photograph and an
models generating high-fidelity and photorealistic images in image which is no more than a hallucination of an artificial
a matter of seconds. The AI-generated images are now at the neural network’s weights and biases.
quality level needed to compete with humans and win art Generative imagery that is indistinguishable from photo-
competitions [1]. graphic data raises questions both ontological, those which
Latent Diffusion Models (LDMs), a type of generative concern the nature of being, and epistemological, surround-
model, have emerged as a powerful tool to generate syn- ing the theories of methods, validity, and scope. Ontologi-
thetic imagery [2]. These recent developments have caused cally, given that humans cannot tell the difference between
a paradigm shift in our understanding of creativity, authen- images from cameras and those generated by AI models such
ticity, and truth. This has led to a situation where there is as an Artificial Neural Network, in terms of digital infor-

VOLUME 4, 2016 1

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3356122

Bird and Lotfi: Image Classification and Explainable Identification of AI-Generated Synthetic Images

mation, what is real and what is not? The epistemological be used to generate a synthetic photograph of an individual
reality is that there are serious questions surrounding the committing a crime or vice versa, providing false evidence
reliability of human knowledge and the ethical implications of an alibi for a person who was, in reality, otherwise else-
that surround the misuse of these types of technology. The where. Misinformation and fake news is a significant modern
implications suggest that we are in growing need of a system problem, and machine-generated images could be used to
that can aid us in the recognition of real images versus those manipulate public opinion [3], [4]. Situations where synthetic
generated by AI. imagery is used in fake news can promote its false credibility
This study explores the potential of using computer vi- and have serious consequences [5]. Cybersecurity is another
sion to enhance our newfound inability to recognise the major concern, with research noting that synthetically gen-
difference between real photographs and those that are AI- erated human faces can be used in false acceptance attacks
generated. Given that there are many years worth of pho- and have the potential to gain unauthorised access to digital
tographic datasets available for image classification, these systems [6], [7]. In [8], it was observed that synthetically
provide examples for a model of real images. Following the generated signatures could overcome signature verification
generation of a synthetic equivalent to such data, we will then systems with ease.
explore the outputs of the model before finally implementing Latent Diffusion Models is a new approach for generating
methods of differentiation between the two types of image. images which use attention mechanisms and a U-Net to
There are several scientific contributions with multidis- reverse the process of Gaussian noise and, ultimately, use text
ciplinary and social implications that arise from this study. conditioning to generate novel images from random noise.
First, a dataset, called CIFAKE, is generated with latent dif- Details on the methodological implementation of LDMs can
fusion and released to the research community. The CIFAKE be found in Section III. The approach is rapidly developing
dataset provides a contrasting set of real and fake pho- but young, and thus literature on the subject is currently
tographs and contains 120,000 images (60,000 images from scarce. The models are a new approach in the field of
the existing CIFAR-10 dataset (Collection of images that generative models; thus, the literature is young, and few
are commonly used to train machine learning and computer applications have been explored. Examples of notable models
vision algorithms available from: https://www.cs.toronto include Dall-E by OpenAI [9], Imagen from Google [10],
.edu/~kriz/cifar.html) and 60,000 images generated for and the open source equivalent, SDM from StabilityAI [2].
this study), making it a valuable resource for researchers in These models have pushed the boundaries of image quality,
the field. Second, this study proposes a method to improve both in realism, and arguably artistic ability. This has led to
our waning human ability to recognise AI-generated images much debate on the professional, social, ethical, and legal
through computer vision, using the CIFAKE dataset for clas- considerations of technology [1].
sification. Finally, this study proposes the use of Explain- The majority of research in the field is cutting-edge and is
able AI (XAI) to further our understanding of the complex presented as preprints and recent theses. In [11], researchers
processes involved in synthetic image recognition, as well proposed to train SDM on medical imaging data, achieving
as visualisation of the important features within those im- higher quality images that could potentially lead to increased
ages. These scientific contributions provide important steps model abilities through data augmentations. It is worth men-
forward in addressing the modern challenges posed by rapid tioning that in [12], [13], diffusion models were found to
developments of modern technology, and have important have the ability to generate audio and images. In 2021, the
implications for ensuring the authenticity and trustworthiness results of Yi et al. [14] suggested that diffusion models
of data. were highly capable of generating realistic artworks, fooling
The remainder of this article is as follows; the state-of- human subjects into believing that the works were created
the-art research background is initially explored in Section by human beings. Given this, researchers have noted that
II with a discussion of relevant related studies in the field. diffusion models have a promising capacity for co-creating
Following this, the methodology followed by this study is with human artists [15].
detailed in Section III, which provides technical implemen- DE-FAKE, proposed by Sha et al. [16], shows that images
tation and the method followed for binary classification of generated by various latent diffusion approaches may contain
real versus AI-generated imagery. The results of these ex- digital fingerprints to suggest they are synthetic. Although
periments are presented with discussion in Section IV before visual glitches are increasingly rare given the advances in
this work is finally concluded, and future work is proposed in model quality, it may be possible that computer vision ap-
Section V. proaches will detect these attributes within images that the
human eye cannot. The Fourier transforms presented in [17]
II. BACKGROUND show visual examples of these digital fingerprints.
The ability to distinguish between real imagery and those When discussing the topic of vision, the results in
generated by machine learning models is important for a [18] suggest that optical flow techniques could detect syn-
number of reasons. Identification of real data provides con- thetic human faces within the FaceForensics dataset with
firmation on the authenticity and originality of the image; for around 81.61% accuracy. Extending to the temporal domain,
example, a fine-tuned Stable Diffusion Model (SDM) could [19] proposes recurrence in AI-generated video recognition
2 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3356122

Bird and Lotfi: Image Classification and Explainable Identification of AI-Generated Synthetic Images

achieving 97.1% accuracy over 80 frames due to minor visual a testing dataset of 16.6%. Within this study, all images from
glitches at the pixel scale. In Wang et al. [20], EfficientNets the training dataset are used for the training of positive class
and Vision Transformers are proposed within a system that “REAL”. Therefore, 50, 000 are used for training and 10, 000
can detect forged images by adversarial models at an F1 score for testing.
of 0.88 and AUC of 0.95, competing with the state of the art Samples of images within the CIFAR-10 dataset that form
on the DeepFake Detection Challenge dataset while remain- the “REAL” class can be observed in Figure 1.
ing efficient. In the aforementioned study, a Convolutional
Neural Network was used to extract features, similarly to the B. SYNTHETIC DATA GENERATION
approach proposed in this study, prior to processing using The synthetic images generated for this study use CompVis
attention-based approaches. SD (https://huggingface.co/CompVis/stable- diffusi
Similarly, convolutional and temporal techniques were on-v1-4), an open source LDM. The goal is to model the
proposed in [21] to achieve 66.26% to 91.21% accuracy in a diffusion of image data through a latent space given a textual
mixed set of synthetic data detection datasets. Chrominance context. If noise, such as that of a Gaussian distribution, is
components CbCr within a digital image were noted in [22] iteratively added to an image, the image ultimately becomes
as a promising route for the detection of minor pixel dispari- noise and all prior visual information is lost. To generalise,
ties that are sometimes present within synthetic images. the reverse of this process is to, therefore, generate a synthetic
Human recognition of manipulation within images is wan- image from noise. The method of reverse diffusion can be
ing as a direct result of image generation methods improving. put simply as, given an image x at timestep t, xt , output
A study by Nightingale et al. [23] in 2017 suggested that the prediction of xi−1 through the prediction of noise and
humans have difficulty recognising photographs that have subsequent removal by classical means.
been edited by image-processing techniques. Since this study, A noisy image xt is generated from the original x0 by the
there has been nearly five years of rapid development in the following:
field to date. √ √
Reviewing the relevant literature has highlighted rapid de- xt = ᾱt x0 + 1 − ᾱt ε, (1)
velopments within AI-generated imagery and the challenges
where noise is ε, and the adjustment according to the time
today posed with respect to its detection. Generative models
step t is ᾱ. The method of this study is to make use of the
have enabled the generation of high-fidelity, photorealistic
reverse process of 50 noising steps, which from x50 will
images within a matter of seconds that humans often cannot
ultimately form x0 , the synthetic image. The neural network
distinguish between when compared to reality. This conclu-
εθ thus minimises the following loss function:
sion sets the stage for the studies presented by this work and
argues the need to fill the knowledge gap when it comes to
h i
2
Loss = Et,x0 ,ε ||ε − εθ (xt , t)|| . (2)
the availability of examples of synthetic data.
Further technical details on the approach can be obtained
III. METHOD from [2].The model chosen for this approach is Stable Dif-
This section describes the methods followed by this study fusion 1.4, which is trained in the LAION2B-en, LAION-
in terms of their technical implementation and application high-resolution and LAION-aesthetics v2.5 + datasets (https:
for the detection of synthetic images. This section first de- //laion.ai/blog/laion-5b/). The aforementioned datasets are a
scribes the collection of data for the real data and then the cleaned subset of the original LAION-5B dataset [25], which
methodology followed to generate the synthetic equivalent contains 5.85 billion text-image pairs.
for comparison. Sections III-A and III-B will describe how SDM is used to generate a synthetic equivalent to the
60,000 images are collected and 60,000 images are syntheti- CIFAR-10 dataset which contains 6, 000 images of ten
cally generated, respectively. This forms the overall dataset classes. The classes are airplane, automobile, bird, cat, deer,
of 120,000 images. Section III-C will then describe the dog, frog, horse, ship and truck. Following observations
machine learning model engineering approach which aims to from the CIFAR-10 dataset, this study implements prompt
recognise the authenticity of the images, before Section III-D modifiers to increase diversity of the synthetic dataset, which
notes the approach for Explainable AI to interpret model can be observed in Table 1. As in the real data set, 50, 000
predictions. images are used for training data and 10, 000 for testing data,
provided with a class label to indicate that the image is not
A. REAL DATA COLLECTION real.
For the class label “REAL”, interpreted as a positive class
value “1”, data is collected from the CIFAR-10 dataset [24]. C. IMAGE CLASSIFICATION
It is a dataset of 60, 000, 32×32 RGB images of real subjects Image classification is an algorithm that predicts a class label
divided into ten classes. The classes within the dataset are given an input image. The learnt features are extracted from
airplane, automobile, bird, cat, deer, dog, frog, horse, ship, the image and processed in order to provide an output, in
and truck. There are 6, 000 images per class. For each class this case, whether or not the image is real or synthetic. This
5,000 images are used for training and 1, 000 for testing, i.e. subsection describes the selected approach to classification.
VOLUME 4, 2016 3

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3356122

Bird and Lotfi: Image Classification and Explainable Identification of AI-Generated Synthetic Images

Airplane Automobile Bird Cat Deer Dog Frog Horse Ship Truck

FIGURE 1: Examples of images from the CIFAR-10 image classification dataset [24].

TABLE 1: Latent Diffusion prompt modifiers for generating Then, a pooling operation is performed to reduce the spatial
the 10-class synthetic dataset. All prompts are preceded by dimensions and flatten the output to then input into the
“a photograph of {a/an}” and modifiers are used equally for densely connected layers. For L = HW D (height, width,
the 6000 images. and dimensions), the flattened one-dimensional output vector
Class Label Prompt Modifiers
is simply x = [x1 , x2 , ..., xL ]. The output vector y is ulti-
mately output from the dense layer(s) as y = f (WL + b), for
Airplane :- aircraft, airplane, fighter, flying, jet, plane
Automobile :- family, new, sports, vintage the weight matrix W and the bias b. The activation function
Bird
:- flying, in a tree, indoors, on water, outdoors, f in this study, as in CNN, is the ReLu activation function
walking f (x) = max(0, x).
:- indoors, outdoors, walking, running, eating,
Cat
jumping, sleeping, sitting The goal of the network in this study is to classify whether
Deer
:- herd, in a field, in the forest, outdoors, running, the image is a real photograph or an image generated by a
wildlife photography LDM, and thus is a problem of binary classification. There-
:- indoors, outdoors, walking, running, eating,
Dog
jumping, sleeping, sitting fore, the output of the network is a single neuron with the
Frog
:- European, in the forest, on a tree, on the ground, S-shaped Sigmoid activation function:
swimming, tropical, wildlife photography
Horse
:- herd, in a field, in the forest, outdoors, running, 1
wildlife photography σ(x) = (5)
:- at sea, boat, cargo, cruise, on the water, river,
1 + e−x
Ship
sailboat, tug The “FAKE” class is 0 and the “REAL” class is 1, there-
:- 18-wheeler, car transport, fire, garbage, heavy goods, fore, a value closer to either of the two values represents the
Truck
lorry, mining, tanker, tow
likelihood of that class. Although this aids in learning, as it is
differentiable, the values are rounded to the closest value for
inference.
In this study, the Convolutional Neural Network (CNN)
Although the goal of the network is to use backpropagation
[26]–[28] is employed to learn from the input images. It is
to reduce binary cross-entropy loss, this study also notes
the concatenation of two main networks with intermediate
an extended number of classification metrics. These are the
operations. These are the convolutional layers and the fully
Precision, which is a measure of how many of the predictive
connected layers. The initial convolutional network within
positive cases are positive, a metric which allows for the
the overall model is the CNN, which can be operationally
analysis of false-positives:
generalised for an image of dimensions x and a filter matrix
w as follows: True positives
Precision = . (6)
M X
X N True positives + False positives
(x∗w)(i, j) = x(i+m−1, j +n−1)w(m, n), (3) The Recall which is a measure of how many positive cases
m=1 n=1 are correctly predicted, which enables analysis of false-
where (i, j) is the output for the feature map, and (m, n) negative predictions:
represents the location of the filter w. The output is derived
True positives
by applying convolutional operations to the input x with each Recall = , (7)
of the filters (which are learnable) and applying an activation True positives + False negatives
function f , which, in the context of this study, is the Rectified This measure is particularly important in this case, as it is in
Linear Unit (ReLu) f (x) = max(0, x). fraud detection, since a false negative would falsely accuse
For an image of (height, width) dimensions and fil- the author of generating their image with AI. Finally, the F-1
ters depending on the filter kernel of (heightkernel ) and score is considered:
(widthkernel ) with a stride = 1 and no padding for sim- Precision × Recall
plicity, the output would have dimensions: F1 score = 2 × , (8)
Precision + Recall
which is a unified metric of precision and recall.
(height − heightkernel + 1, width − widthkernel + 1). (4)
4 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3356122

Bird and Lotfi: Image Classification and Explainable Identification of AI-Generated Synthetic Images

The dataset that forms the classification is the collection of and the Euler Ancestral scheduler was used. Synthetic images
real images and the equivalent synthetic images generated, were rendered at a resolution of 512px before resizing to
detailed in Sections III-A and III-B, respectively. 100, 000 32px by bilinear interpolation to match the resolution of
images are used for training (50, 000 real images and 50, 000 CIFAR-10.
synthetic images), and 20, 000 are used for testing (10, 000 All algorithms in this study were executed using a single
real and 10, 000 synthetic). Nvidia RTX 3080Ti GPU, which has 10,240 CUDA cores, a
Initially, CNN architectures are benchmarked as a lone clock speed of 1.67 GHz, and 12GB GDDR6X VRAM.
feature extractor. That is, the filters of {16, 32, 64, 128}
are benchmarked in layers of {1, 2, 3}, flattened, and con- IV. RESULTS AND OBSERVATIONS
nected directly to the output neurone. The topology of This section presents examples of the dataset followed by the
the highest performing feature extractor is then used to findings of the planned computer vision experiments. The
compare the highest performing dense network featuring dataset is also released to the public research community
{32, 64, 128, 256, ..., 4096} rectified linear units in layers for use in future studies, given the important implications of
of {1, 2, 3}. These 36 artificial neural networks are then detecting AI-generated imagery1 .
compared with regard to classification metrics to derive the
topology that performs best. A. DATASET EXPLORATION
Random samples of images used in this study and within the
D. EXPLAINABLE AI dataset provided can be observed in Figure 2. Five images are
While deep learning approaches often lead to impressive pre- presented for each class label, and all of the images within
dictive ability, many algorithms are black boxes that provide this figure are synthetic, which have been generated by the
no reasoning for their classification. The aim of Explainable SDM. Note within this sample that the images are high-
AI (XAI) is to extract meaning from algorithms and provide quality and, for the most part, seem to be difficult to discern
readable interpretations of why a prediction or decision is as synthetic by the human eye. Synthetic photographs are
being made [29]. Regarding the experiments in this work, representative of their counterparts from reality and feature
the CNN simply predicts that an image is real or synthetic, complex attributes such as depth of field, reflections, and
and then XAI is used to provide interpretations as to why the motion blur.
image is real or synthetic. It can also be observed that there are visual imperfections
Given that the literature shows that humans have a major within some of the images. Figure 3 shows a number of
difficulty in recognising synthetic imagery, it is important examples of the win of the dataset in which the model has
to display and visualise minor defects within the image that output images with visual glitches. Given that the LAION
could suggest that it is not real. dataset provides physical descriptions of the image content,
The method selected for explainable AI (XAI) and inter- little to no information on text is provided, and thus it can
pretation is Gradient Class Activation Mapping (Grad-CAM) be seen that the model produces shapes similar to alphabetic
[30]. Grad-CAM interprets the gradients of the predicted characters. Also observed here is a lack of important detail,
class along with the CNN feature maps, which can therefore such as the case of a jet aircraft that has no cockpit window.
be spatially localised with respect to the original input (the It seems that this image has been produced by combining the
image) and produce a heatmap. This is generated by the knowledge of jet aircraft (in particular, the engines) along
Recitified Linear Unit (ReLU) function as: with the concept of an Unmanned Aerial Vehicle’s chassis.
X Finally, there are also some cases of anatomical errors for
LcGrad−CAM = ReLU ( αk Ak ), (9)
living creatures, seen in these examples through the cat’s
k
limbs and eyes.
1
P P ∂yc
where αk is the global average pooling Z i j ∂Ak of Complex visual concepts are present within much of the
i,j

spatial locations Z, and ∂yc


are the gradients of the model. dataset, with examples shown in Figure 4. Observe that
∂Aki,j the ripples in the water and reflections of the entities are
The approach is used for interpretation in the final step
highly realistic and match what would be expected within
of this study, given the random data selected from the two
a photograph. In addition to complex lighting, there is also
classes. Due to the nature of heatmapping, the results of the
evidence of depth of field and photographic framing.
algorithm are visually interpreted with discussion.
B. CLASSIFICATION RESULTS
E. EXPERIMENTAL HARDWARE AND SOFTWARE
In this subsection, we present the results for the computer
The neural networks used for the detection of AI-generated
vision experiments for image classification. The problem
images were engineered with the TensorFlow library [31].
faced by the CNN is that of binary classification, whether or
All TensorFlow seeds were set to 1 for replicability. The
not the image is a real photograph or the output of an LDM.
Latent Diffusion model used for the generation of synthetic
data was Stable Diffusion version 1.4 [2]; Random seed 1 The Dataset can be downloaded from: https://www.kaggle.com/datasets/
vectors were denoised for a total of 50 steps to form images birdy654/cifake-real-and-ai-generated-synthetic-images

VOLUME 4, 2016 5

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3356122

Bird and Lotfi: Image Classification and Explainable Identification of AI-Generated Synthetic Images

Airplane Automobile Bird Cat Deer Dog Frog Horse Ship Truck

FIGURE 2: Examples of AI-generated images within the dataset contributed by this study, selected at random with regards to
their real CIFAR-10 equivalent labels.

TABLE 3: Observed validation loss for the filters within the


Convolutional Neural Network.
Layers
Filters
1 2 3
16 0.254 0.222 0.21
32 0.237 0.18 0.193
64 0.226 0.196 0.219
128 0.234 0.221 0.259

FIGURE 3: Examples of visual defects found within the TABLE 4: Observed validation precision for the filters within
synthetic image dataset. the Convolutional Neural Network.
Layers
Filters
1 2 3
The validation accuracy of the results and the loss met- 16 0.903 0.941 0.921
rics for the feature extractors can be found in Tables 2 32 0.878 0.923 0.937
and 3, respectively. All feature extractors scored relatively 64 0.908 0.947 0.936
128 0.92 0.948 0.94
well without the need for dense layers to process feature
maps, with an average classification accuracy of 91.79%. The
lowest loss feature extractor was found to use two layers of
92.93% and a binary cross-entropy loss of 0.18. The highest
32 filters, which led to an overall classification accuracy of
accuracy model, two layers of 128 filters, scored 92.98% with
a loss of 0.221.
TABLE 2: Observed classification accuracy metrics for fea- Extended validation metrics are presented in Tables 4, 5,
ture extraction networks. and 6 which detail the validation precision, recall, and F1-
scores, respectively. The F1 score, which is a unification of
Filters
Layers precision and recall, had a mean value of 0.929 with the
1 2 3 highest being 0.936. A small standard deviation of 0.003 was
16 90.06 91.46 91.63
observed.
32 90.38 92.93 92.54 Following these experiments, the lowest loss feature ex-
64 90.94 92.71 92.38 tractor is selected for further engineering of the network
128 91.39 92.98 92.07
topology. This was the model which had two layers of 32
6 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3356122

Bird and Lotfi: Image Classification and Explainable Identification of AI-Generated Synthetic Images

FIGURE 4: A selection of AI-generated images within the dataset. Examples of complex visual attributes generated by the
latent diffusion model that include realistic water and reflections.

TABLE 5: Observed validation recall for the filters within the TABLE 8: Observed validation loss for the dense layers
Convolutional Neural Network. within the Convolutional Neural Network.
Layers Layers
Filters Neurons
1 2 3 1 2 3
16 0.897 0.885 0.911 32 0.186 0.182 0.187
32 0.938 0.936 0.912 64 0.182 0.193 0.177
64 0.92 0.904 0.91 128 0.187 0.183 0.178
128 0.906 0.909 0.898 256 0.187 0.192 0.194
512 0.188 0.193 0.184
1024 0.199 0.194 0.192
TABLE 6: Observed validation F1-Score for the filters within 2048 0.194 0.2 0.219
the Convolutional Neural Network. 4096 0.234 0.204 0.19

Layers
Filters TABLE 9: Observed validation precision for the dense layers
1 2 3 within the Convolutional Neural Network.
16 0.9 0.912 0.916
32 0.907 0.93 0.924 Layers
Neurons
64 0.91 0.925 0.923
128 0.913 0.928 0.919 1 2 3
32 0.932 0.916 0.929
64 0.925 0.92 0.93
128 0.948 0.942 0.935
convolutional filters. 256 0.939 0.926 0.931
512 0.944 0.924 0.946
The results for the engineering of the overall network are 1024 0.933 0.94 0.939
presented in Tables 7 and 8, which contain the validation 2048 0.942 0.922 0.929
accuracy and loss, respectively. The lowest loss observed was 4096 0.926 0.914 0.923

TABLE 7: Observed validation accuracy for the dense layers


0.177 binary cross-entropy when the CNN was followed by
within the Convolutional Neural Network.
three layers of 64 rectified linear units. The highest accuracy,
Layers on the other hand, was 93.55%, which was achieved when
Neurons
1 2 3 implementing a single layer of 64 neurons.
32 93.2 92.84 92.96 Additional validation metrics for precision, recall, and F-
64 93.55 92.73 93.26 1 score are also provided in Tables Tables 9, 10, and 11,
128 92.99 93.29 93.18 respectively. Similarly to the prior experiments, the standard
256 92.97 92.88 92.88
512 93.05 92.58 93.33 deviation of F1-scores was relatively low at 0.003. The
1024 92.9 92.91 92.75 highest F-1 score was the network that used a single dense
2048 92.78 92.76 92.7 layer of 64 rectified linear units, with a value of 0.936.
4096 92.62 92.52 92.88
The aforementioned highest F1 score model is graphically
VOLUME 4, 2016 7

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3356122

Bird and Lotfi: Image Classification and Explainable Identification of AI-Generated Synthetic Images

TABLE 10: Observed validation recall for the dense layers visual cues. To achieve this, this study proposed the gen-
within the Convolutional Neural Network. eration of a synthetic dataset with Latent Diffusion, recog-
Layers
nition with Convolutional Neural Networks, and interpreta-
Neurons tion through Gradient Class Activation Mapping. The results
1 2 3
showed that the synthetic images were high quality and fea-
32 0.932 0.943 0.93 tured complex visual attributes, and that binary classification
64 0.948 0.936 0.935
128 0.91 0.923 0.973 could be achieved with around 92.98% accuracy. Grad-CAM
256 0.919 0.932 0.926 interpretation revealed interesting concepts within the images
512 0.915 0.928 0.919 that were useful for predictions.
1024 0.925 0.916 0.915
2048 0.912 0.934 0.925 In addition to the method proposed in this study, a signifi-
4096 0.926 0.939 0.936 cant contribution is made through the release of the CIFAKE
dataset. The dataset contains a total of 120, 000 images
TABLE 11: Observed validation F1-Score for the dense (60, 000 real images from CIFAR-10 and 60,000 synthetic
layers within the Convolutional Neural Network. images generated for this study). The CIFAKE dataset pro-
Layers
vides the research community with a valuable resource for
Neurons future work on the social problems faced by AI-generated
1 2 3
imagery. The dataset provides a significant expansion of
32 0.932 0.929 0.93 the resource availability for the development and testing of
64 0.936 0.928 0.933
128 0.928 0.932 0.932 applied computer vision approaches to this problem.
256 0.929 0.929 0.929 The reality of AI generating images that are indistinguish-
512 0.929 0.926 0.932
1024 0.929 0.928 0.927
able from real-life photographic images raises fundamental
2048 0.927 0.928 0.927 questions about the limits of human perception, and thus
4096 0.926 0.926 0.929 this study proposed to enhance that ability by fighting fire
with fire. The proposed approach addresses the challenges of
ensuring the authenticity and trustworthiness of visual data.
detailed in Figure 5 to provide a visual example of the Future work could involve exploring other techniques for
network topology. classification of the dataset provided. For example, the imple-
Figure 6 shows examples of the interpretation of predic- mentation of attention-based approaches are a promising new
tions via Grad-CAM. Brighter pixels in the image represent field that could provide increased ability and an alternative
areas that contribute more to the decision of the CNN. It can method of explainable AI. Furthermore, with even further
be observed that there is a significantly different distribution improvements to synthetic imagery in the future, it is impor-
of features given the binary classification problem. Firstly, tant to consider updating the dataset with images generated
the classification of real images can be interpreted as a more by these approaches. Furthermore, considering generating
holistic approach in which the majority of the contents of the images from other domains, such as human faces and clinical
image are useful for recognition. However, the classification scans, would provide additional datasets for this type of study
of synthetic images is somewhat more atomistic and sparse. and expand the applicability of our proposed approach to
Note that for the recognition of AI-generated imagery, acti- other fields of research.
vation occurs in select parts of the image that are more likely
Finally, in conclusion, this study provides contributions to
to present visual glitches that are difficult to recognise with
the ongoing implications of AI-generated images. The pro-
the human eye. An example of this can be seen for the image
posed approach supports important implications of ensuring
of the frog, where an out-of-focus bokeh is the only attribute
data authenticity and trustworthiness, providing not only a
that suggests the image is not real. For the truck, only the
system that can recognise synthetic images but also data
radiator grill pattern is considered useful for classification.
and interpretation. The public release of the CIFAKE dataset
The XAI approach also shows an interesting mechanic in
generated within this study provides a valuable resource for
a more general sense. Given the examples of airplane, bird,
interdisciplinary research.
frog, horse and ship, note that the object within the image has
little to no class activation overlay whatsoever. This suggests
that the actual focus of the image itself, the entity, contains VI. AVAILABILITY OF DATA AND MATERIALS
almost no useful features for synthetic image recognition. The datasets generated and analysed during the current study
This suggests that the model is often available to produce a are available in the CIFAKE repository, https://www.kaggle
near-perfect representation of the entity. .com/datasets/birdy654/cifake-real-and-ai-generated-synth
etic-images.
V. CONCLUSION AND FUTURE WORK
This study has proposed a method to improve our waning REFERENCES
ability to recognise AI-generated images through the use of [1] K. Roose, “An ai-generated picture won an art prize. artists aren’t happy,”
Computer Vision and to provide insight into predictions with The New York Times, vol. 2, p. 2022, 2022.

8 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3356122

Bird and Lotfi: Image Classification and Explainable Identification of AI-Generated Synthetic Images

Rescale

32 x 32 x 3

Conv2D Max Pool Conv2D Max Pool Flatten Dense Dense

32 x ReLu 32 x ReLu 64 x ReLu Sigmoid


Input

FIGURE 5: An example of one of the final model architectures following hyperparameter search for the classification of real
or AI-generated images.

[14] D. Yi, C. Guo, and T. Bai, “Exploring painting synthesis with diffusion
models,” in 2021 IEEE 1st International Conference on Digital Twins and
Parallel Intelligence (DTPI), pp. 332–335, IEEE, 2021.
[15] C. Guo, Y. Dou, T. Bai, X. Dai, C. Wang, and Y. Wen, “Artverse:
A paradigm for parallel human–machine collaborative painting creation
in metaverses,” IEEE Transactions on Systems, Man, and Cybernetics:
Systems, 2023.
[16] Z. Sha, Z. Li, N. Yu, and Y. Zhang, “De-fake: Detection and attribution of
fake images generated by text-to-image diffusion models,” arXiv preprint
arXiv:2210.06998, 2022.
[17] R. Corvi, D. Cozzolino, G. Zingarini, G. Poggi, K. Nagano, and L. Verdo-
liva, “On the detection of synthetic images generated by diffusion models,”
FIGURE 6: Gradient class activation maps (Grad-CAM) arXiv preprint arXiv:2211.00680, 2022.
overlays and raw heatmaps for prediction interpretation. Top [18] I. Amerini, L. Galteri, R. Caldelli, and A. Del Bimbo, “Deepfake video de-
tection through optical flow based cnn,” in Proceedings of the IEEE/CVF
examples show real images and bottom examples show AI- international conference on computer vision workshops, pp. 0–0, 2019.
generated images. Brighter pixels represent features con- [19] D. Güera and E. J. Delp, “Deepfake video detection using recurrent neural
tributing to the output class label. networks,” in 2018 15th IEEE international conference on advanced video
and signal based surveillance (AVSS), pp. 1–6, IEEE, 2018.
[20] J. Wang, Z. Wu, W. Ouyang, X. Han, J. Chen, Y.-G. Jiang, and S.-N.
Li, “M2tr: Multi-modal multi-scale transformers for deepfake detection,”
in Proceedings of the 2022 International Conference on Multimedia Re-
[2] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-
trieval, pp. 615–623, 2022.
resolution image synthesis with latent diffusion models,” in Proceedings of
[21] P. Saikia, D. Dholaria, P. Yadav, V. Patel, and M. Roy, “A hybrid cnn-lstm
the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
model for video deepfake detection by leveraging optical flow features,” in
pp. 10684–10695, 2022.
2022 International Joint Conference on Neural Networks (IJCNN), pp. 1–
[3] G. Pennycook and D. G. Rand, “The psychology of fake news,” Trends in
7, IEEE, 2022.
cognitive sciences, vol. 25, no. 5, pp. 388–402, 2021.
[22] H. Li, B. Li, S. Tan, and J. Huang, “Identification of deep network gen-
[4] B. Singh and D. K. Sharma, “Predicting image credibility in fake news
erated images using disparities in color components,” Signal Processing,
over social media using multi-modal approach,” Neural Computing and
vol. 174, p. 107616, 2020.
Applications, vol. 34, no. 24, pp. 21503–21517, 2022.
[23] S. J. Nightingale, K. A. Wade, and D. G. Watson, “Can people iden-
[5] N. Bonettini, P. Bestagini, S. Milani, and S. Tubaro, “On the use of tify original and manipulated photos of real-world scenes?,” Cognitive
benford’s law to detect gan-generated images,” in 2020 25th international research: principles and implications, vol. 2, no. 1, pp. 1–21, 2017.
conference on pattern recognition (ICPR), pp. 5495–5502, IEEE, 2021. [24] A. Krizhevsky, G. Hinton, et al., “Learning multiple layers of features from
[6] D. Deb, J. Zhang, and A. K. Jain, “Advfaces: Adversarial face synthesis,” tiny images,” 2009.
in 2020 IEEE International Joint Conference on Biometrics (IJCB), pp. 1– [25] C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman,
10, IEEE, 2020. M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, et al., “Laion-
[7] M. Khosravy, K. Nakamura, Y. Hirose, N. Nitta, and N. Babaguchi, 5b: An open large-scale dataset for training next generation image-text
“Model inversion attack: analysis under gray-box scenario on deep learn- models,” arXiv preprint arXiv:2210.08402, 2022.
ing based face recognition system,” KSII Transactions on Internet and [26] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521,
Information Systems (TIIS), vol. 15, no. 3, pp. 1100–1118, 2021. no. 7553, pp. 436–444, 2015.
[8] J. J. Bird, A. Naser, and A. Lotfi, “Writer-independent signature verifica- [27] J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, T. Liu, X. Wang,
tion; evaluation of robotic and generative adversarial attacks,” Information G. Wang, J. Cai, et al., “Recent advances in convolutional neural net-
Sciences, vol. 633, pp. 170–181, 2023. works,” Pattern recognition, vol. 77, pp. 354–377, 2018.
[9] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, [28] Z. Li, F. Liu, W. Yang, S. Peng, and J. Zhou, “A survey of convolutional
and I. Sutskever, “Zero-shot text-to-image generation,” in International neural networks: analysis, applications, and prospects,” IEEE transactions
Conference on Machine Learning, pp. 8821–8831, PMLR, 2021. on neural networks and learning systems, 2021.
[10] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton, S. K. S. [29] D. Gunning, M. Stefik, J. Choi, T. Miller, S. Stumpf, and G.-Z. Yang,
Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes, et al., “Photore- “Xai—explainable artificial intelligence,” Science robotics, vol. 4, no. 37,
alistic text-to-image diffusion models with deep language understanding,” p. eaay7120, 2019.
arXiv preprint arXiv:2205.11487, 2022. [30] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Ba-
[11] P. Chambon, C. Bluethgen, C. P. Langlotz, and A. Chaudhari, “Adapting tra, “Grad-cam: Visual explanations from deep networks via gradient-
pretrained vision-language foundational models to medical imaging do- based localization,” in Proceedings of the IEEE international conference
mains,” arXiv preprint arXiv:2210.04133, 2022. on computer vision, pp. 618–626, 2017.
[12] F. Schneider, Z. Jin, and B. Schölkopf, “Moûsai: Text-to-music genera- [31] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Cor-
tion with long-context latent diffusion,” arXiv preprint arXiv:2301.11757, rado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp,
2023. G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Lev-
[13] F. Schneider, “Archisound: Audio generation with diffusion,” Master’s enberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster,
thesis, ETH Zurich, 2023. J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke,

VOLUME 4, 2016 9

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3356122

Bird and Lotfi: Image Classification and Explainable Identification of AI-Generated Synthetic Images

V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke,


Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on
heterogeneous systems,” 2015. Software available from tensorflow.org.

JORDAN J. BIRD received his BSc and PhD


in Computer Science from Aston University, UK.
Currently, Dr. Bird is a Senior Lecturer in Com-
puter Science at Nottingham Trent University, UK.
His research interests are primarily in Artificial
Intelligence (AI), focusing on Human-Robot In-
teraction (HRI), Machine Learning (ML), Deep
Learning, Transfer Learning, and Data Augmenta-
tion. His professional academic contributions in-
clude roles as a Technical Programme Committee
member and Deep Learning Session Chair at several international confer-
ences. Dr. Bird has received significant external grant funding towards his
research projects, which involve applications of Artificial Intelligence in the
real-world.

AHMAD LOTFI (M’96-SM’08) received his BSc.


and MTech. in control systems from Isfahan Uni-
versity of Technology, Iran and Indian Institute of
Technology Delhi, India respectively. He received
his PhD degree in Learning Fuzzy Systems from
University of Queensland, Australia in 1995. He
is currently a Professor of Computational Intelli-
gence at Nottingham Trent University, where he is
leading the research group in Computational Intel-
ligence and Applications. His research focuses on
the identification of progressive changes in behaviour of older individuals
suffering from Dementia or any other cognitive impairments. Accurate iden-
tification of progressive changes through utilisation of unobtrusive sensor
network or robotics platform will enable carers (formal and informal) to
intervene when deemed necessary. Research collaboration is established
with world-leading researchers. He has worked in collaboration with many
healthcare commercial organisations and end-users. He has received external
funding from Innovate UK, EU and industrial companies to support his
research. He has authored and co-authored over 200 scientific papers in the
area of computational intelligence, internet of things, abnormal behaviour
recognition and ambient intelligence in highly prestigious journals and
international conferences. He has been invited as an Expert Evaluator and
Panel Member for many European and International Research Programmes.

10 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

You might also like