0% found this document useful (0 votes)
68 views9 pages

CIFAKE Image Classification and Explainable Identification of AI-Generated Synthetic Images (1) Chat GPT

Uploaded by

ssreya873
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views9 pages

CIFAKE Image Classification and Explainable Identification of AI-Generated Synthetic Images (1) Chat GPT

Uploaded by

ssreya873
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Received 15 December 2023, accepted 17 January 2024, date of publication 19 January 2024, date of current version 1 February 2024.

Digital Object Identifier 10.1109/ACCESS.2024.3356122

CIFAKE: Image Classification and Explainable


Identification of AI-Generated Synthetic Images
JORDAN J. BIRD AND AHMAD LOTFI , (Senior Member, IEEE)
Department of Computer Science, Nottingham Trent University, NG1 4FQ Nottingham, U.K.
Corresponding author: Jordan J. Bird ([email protected])

ABSTRACT Recent advances in synthetic data have enabled the generation of images with such high quality
that human beings cannot distinguish the difference between real-life photographs and Artificial Intelligence
(AI) generated images. Given the critical necessity of data reliability and authentication, this article proposes
to enhance our ability to recognise AI-generated images through computer vision. Initially, a synthetic dataset
is generated that mirrors the ten classes of the already available CIFAR-10 dataset with latent diffusion,
providing a contrasting set of images for comparison to real photographs. The model is capable of generating
complex visual attributes, such as photorealistic reflections in water. The two sets of data present as a binary
classification problem with regard to whether the photograph is real or generated by AI. This study then
proposes the use of a Convolutional Neural Network (CNN) to classify the images into two categories;
Real or Fake. Following hyperparameter tuning and the training of 36 individual network topologies, the
optimal approach could correctly classify the images with 92.98% accuracy. Finally, this study implements
explainable AI via Gradient Class Activation Mapping to explore which features within the images are useful
for classification. Interpretation reveals interesting concepts within the image, in particular, noting that the
actual entity itself does not hold useful information for classification; instead, the model focuses on small
visual imperfections in the background of the images. The complete dataset engineered for this study, referred
to as the CIFAKE dataset, is made publicly available to the research community for future work.

INDEX TERMS AI-generated images, generative AI, image classification, latent diffusion.

I. INTRODUCTION and truth. This has led to a situation where consumer-level


The field of synthetic image generation by Artificial Intel- technology is available that could quite easily be used for the
ligence (AI) has developed rapidly in recent years, and the violation of privacy and to commit fraud. These philosophical
ability to detect AI-generated photos has also become a and societal implications are at the forefront of the current
critical necessity to ensure the authenticity of image data. state of the art, raising fundamental questions about the
Within recent memory, generative technology often produced nature of trustworthiness and reality. Recent technological
images with major visual defects that were noticeable to the advances have enabled the generation of images with such
human eye, but now we are faced with the possibility of AI high quality that human beings cannot tell the difference
models generating high-fidelity and photorealistic images in between a real-life photograph and an image that is no more
a matter of seconds. The AI-generated images are now at the than a hallucination of an artificial neural network’s weights
quality level needed to compete with humans and win art and biases.
competitions [1]. Generative imagery that is indistinguishable from pho-
Latent Diffusion Models (LDMs), a type of generative tographic data raises questions both ontological, those
model, have emerged as a powerful tool to generate synthetic which concern the nature of being, and epistemological,
imagery [2]. These recent developments have caused a surrounding the theories of methods, validity, and scope.
paradigm shift in our understanding of creativity, authenticity, Ontologically, given that humans cannot tell the difference
between images from cameras and those generated by AI
The associate editor coordinating the review of this manuscript and models such as an Artificial Neural Network, in terms of
approving it for publication was Yiqi Liu . digital information, what is real and what is not? The
2024 The Authors. This work is licensed under a Creative Commons Attribution 4.0 License.
15642 For more information, see https://creativecommons.org/licenses/by/4.0/ VOLUME 12, 2024
J. J. Bird, A. Lotfi: CIFAKE: Image Classification and Explainable Identification

epistemological reality is that there are serious questions evidence of an alibi for a person who was, in reality,
surrounding the reliability of human knowledge and the otherwise elsewhere. Misinformation and fake news is a
ethical implications that surround the misuse of these types of significant modern problem, and machine-generated images
technology. The implications suggest that we are in growing could be used to manipulate public opinion [3], [4]. Situations
need of a system that can aid us in the recognition of real where synthetic imagery is used in fake news can promote
images versus those generated by AI. its false credibility and have serious consequences [5].
This study explores the potential of using computer vision Cybersecurity is another major concern, with research noting
to enhance our newfound inability to recognise the difference that synthetically generated human faces can be used in false
between real photographs and those that are AI-generated. acceptance attacks and have the potential to gain unauthorised
Given that there are many years worth of photographic access to digital systems [6], [7]. In [8], it was observed that
datasets available for image classification, these provide synthetically generated signatures could overcome signature
examples for a model of real images. Following the verification systems with ease.
generation of a synthetic equivalent to such data, we will then Latent Diffusion Models are a new approach for generating
explore the output of the model before finally implementing images, which use attention mechanisms and a U-Net to
methods of differentiation between the two types of image. reverse the process of Gaussian noise and, ultimately, use text
There are several scientific contributions with multidis- conditioning to generate novel images from random noise.
ciplinary and social implications that arise from this study. Details on the methodological implementation of LDM can
First, a dataset, called CIFAKE, is generated with latent dif- be found in Section III. The approach is rapidly developing
fusion and released to the research community. The CIFAKE but is young, and thus literature on the subject is currently
dataset provides a contrasting set of real and fake photographs scarce. The models are a new approach in the field of
and contains 120,000 images (60,000 images from the exist- generative models; thus, the literature is young, and few
ing CIFAR-10 dataset (Collection of images that are com- applications have been explored. Examples of notable models
monly used to train machine learning and computer vision include Dall-E by OpenAI [9], Imagen from Google [10],
algorithms available from: https://www.cs.toronto.edu/ kriz/- and the open source equivalent, SDM from StabilityAI [2].
cifar.html) and 60,000 images generated for this study), mak- These models have pushed the boundaries of image quality,
ing it a valuable resource for researchers in the field. Second, both in realism and arguably in artistic ability. This has led to
this study proposes a method to improve our waning human much debate about the professional, social, ethical, and legal
ability to recognise AI-generated images through computer considerations of technology [1].
vision, using the CIFAKE dataset for classification. Finally, The majority of research in the field is cutting-edge
this study proposes the use of Explainable AI (XAI) to and is presented as preprints and recent theses. In [11],
further our understanding of the complex processes involved researchers proposed to train SDM on medical imaging data,
in synthetic image recognition, as well as visualisation of achieving higher-quality images that could potentially lead
the important features within those images. These scientific to increased model abilities through data augmentations. It is
contributions provide important steps forward in addressing worth mentioning that in [12] and [13], diffusion models
the modern challenges posed by rapid developments of were found to have the ability to generate audio and images.
modern technology and have important implications for In 2021, the results of Yi et al. [14] suggested that diffusion
ensuring the authenticity and trustworthiness of data. models were highly capable of generating realistic artworks,
The remainder of this article is as follows; the state-of-the- fooling human subjects into believing that the works were
art research background is initially explored in Section II with created by human beings. Given this, researchers have noted
a discussion of relevant related studies in the field. Following that diffusion models have a promising capacity for co-
this, the methodology followed by this study is detailed in creating with human artists [15].
Section III, which provides the technical implementation DE-FAKE, proposed by Sha et al. [16], shows that images
and the method followed for the binary classification of generated by various latent diffusion approaches may contain
real versus AI-generated imagery. The results of these digital fingerprints to suggest they are synthetic. Although
experiments are presented with discussion in Section IV visual glitches are increasingly rare given the advances in
before this work is finally concluded, and future work is model quality, it may be possible that computer vision
proposed in Section V. approaches will detect these attributes within images that the
human eye cannot. The Fourier transforms presented in [17]
II. BACKGROUND show visual examples of these digital fingerprints.
The ability to distinguish between real imagery and those When discussing the topic of vision, the results in [18]
generated by machine learning models is important for suggest that optical flow techniques could detect synthetic
a number of reasons. Identification of real data provides human faces within the FaceForensics dataset with around
confirmation on the authenticity and originality of the image; 81.61% accuracy. Extending to the temporal domain, [19]
for example, a fine-tuned Stable Diffusion Model (SDM) proposes recurrence in AI-generated video recognition
could be used to generate a synthetic photograph of an achieving 97.1% accuracy over 80 frames due to minor
individual committing a crime or vice versa, providing false visual glitches at the pixel scale. In Wang et al. [20],
VOLUME 12, 2024 15643
J. J. Bird, A. Lotfi: CIFAKE: Image Classification and Explainable Identification

EfficientNets and Vision Transformers are proposed within from the training dataset are used for the training of positive
a system that can detect forged images by adversarial models class ‘‘REAL’’. Therefore, 50, 000 are used for training and
at an F1 score of 0.88 and AUC of 0.95, competing with 10, 000 for testing.
the state of the art on the DeepFake Detection Challenge Samples of images within the CIFAR-10 dataset that form
dataset while remaining efficient. In the aforementioned the ‘‘REAL’’ class can be observed in Figure 1.
study, a Convolutional Neural Network was used to extract
features, similarly to the approach proposed in this study, B. SYNTHETIC DATA GENERATION
prior to processing using attention-based approaches. The synthetic images generated for this study use CompVis
Similarly, convolutional and temporal techniques were SD (https://huggingface.co/CompVis/stable-diffusion-v1-4),
proposed in [21] to achieve 66.26% to 91.21% accuracy in a an open source LDM. The goal is to model the diffusion of
mixed set of synthetic data detection datasets. Chrominance image data through a latent space given a textual context.
components CbCr within a digital image were noted in [22] If noise, such as that of a Gaussian distribution, is iteratively
as a promising route for the detection of minor pixel added to an image, the image ultimately becomes noise and
disparities that are sometimes present within synthetic all prior visual information is lost. To generalise, the reverse
images. of this process is to, therefore, generate a synthetic image
Human recognition of manipulation within images is wan- from noise. The method of reverse diffusion can be put simply
ing as a direct result of image generation methods improving. as, given an image x at timestep t, xt , output the prediction of
A study by Nightingale et al. [23] in 2017 suggested that xi−1 through the prediction of noise and subsequent removal
humans have difficulty recognising photographs that have by classical means.
been edited using image processing techniques. Since this A noisy image xt is generated from the original x0 by the
study, there has been nearly five years of rapid development following:
in the field to date. p p
Reviewing the relevant literature has highlighted rapid xt = ᾱt x0 + 1 − ᾱt ε, (1)
developments within AI-generated imagery and the chal- where noise is ε, and the adjustment according to the time step
lenges today posed with respect to its detection. Generative t is ᾱ. The method of this study is to make use of the reverse
models have enabled the generation of high-fidelity, photore- process of 50 noising steps, which from x50 will ultimately
alistic images within a matter of seconds that humans often form x0 , the synthetic image. The neural network εθ thus
cannot distinguish between when compared to reality. This minimises the following loss function:
conclusion sets the stage for the studies presented by this h i
work and argues the need to fill the knowledge gap when it Loss = Et,x0 ,ε ||ε − εθ (xt , t)||2 . (2)
comes to the availability of examples of synthetic data.
Further technical details on the approach can be obtained
from [2].The model chosen for this approach is Stable
III. METHOD Diffusion 1.4, which is trained in the LAION2B-en, LAION-
This section describes the methods followed by this study high-resolution and LAION-aesthetics v2.5 + datasets
in terms of their technical implementation and application (https://laion.ai/blog/laion-5b/). The aforementioned datasets
for the detection of synthetic images. This section first are a cleaned subset of the original LAION-5B dataset [25],
describes the collection of data for the real data, and then the which contains 5.85 billion text-image pairs.
methodology followed to generate the synthetic equivalent SDM is used to generate a synthetic equivalent to the
for comparison. Sections III-A and III-B will describe CIFAR-10 dataset which contains 6, 000 images of ten
how 60,000 images are collected and 60,000 images are classes. The classes are airplane, automobile, bird, cat, deer,
synthetically generated, respectively. This forms the overall dog, frog, horse, ship and truck. Following observations
dataset of 120,000 images. Section III-C will then describe from the CIFAR-10 dataset, this study implements prompt
the machine learning model engineering approach which modifiers to increase diversity of the synthetic dataset,
aims to recognise the authenticity of the images, before which can be observed in Table 1. As in the real data set,
Section III-D notes the approach for Explainable AI to 50, 000 images are used for training data and 10, 000 for
interpret model predictions. testing data, provided with a class label to indicate that the
image is not real.
A. REAL DATA COLLECTION
For the class label ‘‘REAL’’, interpreted as a positive class C. IMAGE CLASSIFICATION
value ‘‘1’’, data is collected from the CIFAR-10 dataset [24]. Image classification is an algorithm that predicts a class label
It is a dataset of 60, 000, 32 × 32 RGB images of real given an input image. The learnt features are extracted from
subjects divided into ten classes. Classes within the data set the image and processed in order to provide an output, in this
are airplane, automobile, bird, cat, deer, dog, frog, horse, ship, case, whether or not the image is real or synthetic. This
and truck. There are 6, 000 images per class. For each class, subsection describes the selected approach to classification.
5,000 images are used for training and 1, 000 for testing, In this study, the Convolutional Neural Network (CNN)
i.e., a testing dataset of 16.6%. In this study, all images [26], [27], [28] is employed to learn from the input
15644 VOLUME 12, 2024
J. J. Bird, A. Lotfi: CIFAKE: Image Classification and Explainable Identification

FIGURE 1. Examples of images from the CIFAR-10 image classification dataset [24].

TABLE 1. Latent diffusion prompt modifiers for generating the 10-class the weight matrix W and the bias b. The activation function
synthetic dataset. All prompts are preceded by ‘‘a photograph of {a/an}’’
and modifiers are used equally for the 6000 images. f in this study, as in CNN, is the ReLu activation function
f (x) = max(0, x).
The goal of the network in this study is to classify whether
the image is a real photograph or an image generated by
a LDM, and thus is a problem of binary classification.
Therefore, the output of the network is a single neuron with
the S-shaped Sigmoid activation function:
1
σ (x) = (5)
1 + e−x
The ‘‘FAKE’’ class is 0 and the ‘‘REAL’’ class is 1,
therefore, a value closer to either of the two values represents
the likelihood of that class. Although this aids in learning,
because it is differentiable, the values are rounded to the
closest value for inference.
Although the goal of the network is to use backpropagation
to reduce binary cross-entropy loss, this study also notes
an extended number of classification metrics. These are the
Precision, which is a measure of how many of the predictive
images. It is the concatenation of two main networks with positive cases are positive, a metric which allows for the
intermediate operations. These are the convolutional layers analysis of false-positives:
and the fully connected layers. The initial convolutional
network within the overall model is the CNN, which can be True positives
Precision = . (6)
operationally generalised for an image of dimensions x and a True positives + False positives
filter matrix w as follows: The Recall which is a measure of how many positive cases are
M X
X N correctly predicted, which enables analysis of false-negative
(x ∗ w)(i, j) = x(i + m − 1, j + n − 1)w(m, n), (3) predictions:
m=1 n=1
True positives
where (i, j) is the output for the feature map, and (m, n) Recall = , (7)
True positives + False negatives
represents the location of the filter w. The output is derived
by applying convolutional operations to the input x with each This measure is particularly important in this case, as it is in
of the filters (which are learnable) and applying an activation fraud detection, since a false negative would falsely accuse
function f , which, in the context of this study, is the Rectified the author of generating their image with AI. Finally, the F-1
Linear Unit (ReLu) f (x) = max(0, x). score is considered:
For an image of (height, width) dimensions and fil- Precision × Recall
F1 score = 2 × , (8)
ters depending on the filter kernel of (heightkernel ) and Precision + Recall
(widthkernel ) with a stride = 1 and no padding for simplicity, which is a unified metric of precision and recall.
the output would have dimensions: The dataset that forms the classification is the collection
of real images and the equivalent synthetic images gen-
(height − heightkernel + 1, width − widthkernel + 1). (4)
erated, detailed in Sections III-A and III-B, respectively.
Then, a pooling operation is performed to reduce the spatial 100, 000 images are used for training (50, 000 real images
dimensions and flatten the output so it can be entered into and 50, 000 synthetic images), and 20, 000 are used for
densely connected layers. For L = HWD (height, width, and testing (10, 000 real and 10, 000 synthetic).
dimensions), the flattened one-dimensional output vector is Initially, CNN architectures are benchmarked as a lone
simply x = [x1 , x2 , . . . , xL ]. The output vector y is ultimately feature extractor. That is, the filters of {16, 32, 64, 128}
the output from the dense layer(s) as y = f (WL + b), for are benchmarked in layers of {1, 2, 3}, flattened, and
VOLUME 12, 2024 15645
J. J. Bird, A. Lotfi: CIFAKE: Image Classification and Explainable Identification

FIGURE 2. Examples of AI-generated images within the dataset contributed by this study, selected at random with regards to their real
CIFAR-10 equivalent labels.

∂yc
where αk is the global average pooling 1 P P
connected directly to the output neurone. The topology of Z i j ∂Ak of
i,j
the highest performing feature extractor is then used to ∂yc
spatial locations Z , and are the gradients of the model.
compare the highest performing dense network featuring ∂Aki,j
{32, 64, 128, 256, . . . , 4096} rectified linear units in layers The approach is used for interpretation in the final step
of {1, 2, 3}. These 36 artificial neural networks are then of this study, given the random data selected from the two
compared with regard to classification metrics to derive the classes. Due to the nature of heatmapping, the results of the
topology that performs best. algorithm are visually interpreted with discussion.

D. EXPLAINABLE AI E. EXPERIMENTAL HARDWARE AND SOFTWARE


While deep learning approaches often lead to impressive The neural networks used for the detection of AI-generated
predictive ability, many algorithms are black boxes that images were engineered with the TensorFlow library [31]. All
provide no reasoning for their classification. The aim of TensorFlow seeds were set to 1 for replicability. The Latent
Explainable AI (XAI) is to extract meaning from algorithms Diffusion model used for the generation of synthetic data
and provide readable interpretations of why a prediction or was Stable Diffusion version 1.4 [2]; Random seed vectors
decision is being made [29]. Regarding the experiments in were denoised for a total of 50 steps to form images and
this work, the CNN simply predicts that an image is real or the Euler Ancestral scheduler was used. Synthetic images
synthetic, and then XAI is used to provide interpretations as were rendered at a resolution of 512px before resizing to
to why the image is real or synthetic. 32px by bilinear interpolation to match the resolution of
Given that the literature shows that humans have a major CIFAR-10.
difficulty in recognising synthetic imagery, it is important All algorithms in this study were executed using a single
to display and visualise minor defects within the image that Nvidia RTX 3080Ti GPU, which has 10,240 CUDA cores,
could suggest that it is not real. a clock speed of 1.67 GHz, and 12GB GDDR6X VRAM.
The method selected for explainable AI (XAI) and
IV. RESULTS AND OBSERVATIONS
interpretation is Gradient Class Activation Mapping (Grad-
This section presents examples of the dataset followed by the
CAM) [30]. Grad-CAM interprets the gradients of the
findings of the planned computer vision experiments. The
predicted class along with the CNN feature maps, which can
dataset is also released to the public research community
therefore be spatially localised with respect to the original
for use in future studies, given the important implications of
input (the image) and produce a heatmap. This is generated
detecting AI-generated imagery.1
by the Recitified Linear Unit (ReLU) function as:
X
c
LGrad−CAM = ReLU ( αk Ak ), (9) 1 The Dataset can be downloaded from: https://www.kaggle.com/datasets/
k birdy654/cifake-real-and-ai-generated-synthetic-images

15646 VOLUME 12, 2024


J. J. Bird, A. Lotfi: CIFAKE: Image Classification and Explainable Identification

TABLE 3. Observed validation loss for the filters within the convolutional
neural network.

TABLE 4. Observed validation precision for the filters within the


FIGURE 3. Examples of visual defects found within the synthetic image convolutional neural network.
dataset.
TABLE 2. Observed classification accuracy metrics for feature extraction
networks.

TABLE 5. Observed validation recall for the filters within the


convolutional neural network.

A. DATASET EXPLORATION
Random samples of images used in this study and within the
dataset provided can be observed in Figure 2. Five images are
presented for each class label, and all of the images within
this figure are synthetic, which have been generated by the
SDM. Note within this sample that the images are high-
TABLE 6. Observed validation F1-Score for the filters within the
quality and, for the most part, seem to be difficult to discern convolutional neural network.
as synthetic by the human eye. Synthetic photographs are
representative of their counterparts from reality and feature
complex attributes such as depth of field, reflections, and
motion blur.
It can also be observed that there are visual imperfections
within some of the images. Figure 3 shows a number of
examples of the win of the dataset in which the model has
output images with visual glitches. Given that the LAION
dataset provides physical descriptions of the image content, faced by the CNN is that of binary classification, whether or
little to no information on text is provided, and thus it can not the image is a real photograph or the output of an LDM.
be seen that the model produces shapes similar to alphabetic The validation accuracy of the results and the loss metrics
characters. Also observed here is a lack of important detail, for the feature extractors can be found in Tables 2 and 3,
such as the case of a jet aircraft that has no cockpit window. respectively. All feature extractors scored relatively well
It seems that this image has been produced by combining the without the need for dense layers to process feature maps,
knowledge of jet aircraft (in particular, the engines) along with an average classification accuracy of 91.79%. The
with the concept of an Unmanned Aerial Vehicle’s chassis. lowest loss feature extractor was found to use two layers of
Finally, there are also some cases of anatomical errors for 32 filters, which led to an overall classification accuracy of
living creatures, seen in these examples through the cat’s 92.93% and a binary cross-entropy loss of 0.18. The highest
limbs and eyes. accuracy model, two layers of 128 filters, scored 92.98% with
Complex visual concepts are present within much of the a loss of 0.221.
dataset, with examples shown in Figure 4. Observe that Extended validation metrics are presented in Tables 4, 5,
the ripples in the water and reflections of the entities are and 6, which detail validation precision, recall, and F1 scores,
highly realistic and match what would be expected within respectively. The F1 score, which is a unification of precision
a photograph. In addition to complex lighting, there is also and recall, had a mean value of 0.929 with the highest being
evidence of depth of field and photographic framing. 0.936. A small standard deviation of 0.003 was observed.
Following these experiments, the lowest-loss feature
B. CLASSIFICATION RESULTS extractor is selected for further engineering of the network
In this subsection, we present the results for the computer topology. This was the model that had two layers of
vision experiments for image classification. The problem 32 convolutional filters.

VOLUME 12, 2024 15647


J. J. Bird, A. Lotfi: CIFAKE: Image Classification and Explainable Identification

FIGURE 4. A selection of AI-generated images within the dataset. Examples of complex visual attributes
generated by the latent diffusion model that include realistic water and reflections.

TABLE 7. Observed validation accuracy for the dense layers within the TABLE 9. Observed validation precision for the dense layers within the
convolutional neural network. convolutional neural network.

TABLE 8. Observed validation loss for the dense layers within the
convolutional neural network. TABLE 10. Observed validation recall for the dense layers within the
convolutional neural network.

The results of the general network engineering are TABLE 11. Observed validation F1-Score for the dense layers within the
presented in Tables 7 and 8, which contain the validation convolutional neural network.
accuracy and loss, respectively. The lowest loss observed was
0.177 binary cross-entropy when the CNN was followed by
three layers of 64 rectified linear units. The highest accuracy,
on the other hand, was 93.55%, which was achieved by
implementing a single layer of 64 neurons.
Additional validation metrics for precision, recall, and F-1
score are also provided in Tables 9, 10, and 11, respectively.
Similarly to the prior experiments, the standard deviation of
F1-scores was relatively low at 0.003. The highest F-1 score
was the network that used a single dense layer of 64 rectified
linear units, with a value of 0.936. The aforementioned
highest F1 score model is graphically detailed in Figure 5 to Figure 6 shows examples of the interpretation of predic-
provide a visual example of the network topology. tions via Grad-CAM. Brighter pixels in the image represent

15648 VOLUME 12, 2024


J. J. Bird, A. Lotfi: CIFAKE: Image Classification and Explainable Identification

FIGURE 5. An example of one of the final model architectures following hyperparameter search for the classification of
real or AI-generated images.

and featured complex visual attributes, and that binary


classification could be achieved with around 92.98% accu-
racy. Grad-CAM interpretation revealed interesting concepts
within the images that were useful for predictions.
In addition to the method proposed in this study, a signifi-
cant contribution is made through the release of the CIFAKE
dataset. The dataset contains a total of 120, 000 images
(60, 000 real images from CIFAR-10 and 60,000 synthetic
images generated for this study). The CIFAKE dataset
FIGURE 6. Gradient class activation maps (Grad-CAM) overlays and raw provides the research community with a valuable resource
heatmaps for prediction interpretation. Top examples show real images
and bottom examples show AI-generated images. Brighter pixels for future work on the social problems faced by AI-generated
represent features contributing to the output class label. imagery. The dataset provides a significant expansion of
the resource availability for the development and testing of
areas that contribute more to the decision of the CNN. It can applied computer vision approaches to this problem.
be observed that there is a significantly different distribution The reality of AI generating images that are indistinguish-
of features given the binary classification problem. Firstly, able from real-life photographic images raises fundamental
the classification of real images can be interpreted as a more questions about the limits of human perception, and thus
holistic approach in which the majority of the contents of the this study proposed to enhance that ability by fighting fire
image are useful for recognition. However, the classification with fire. The proposed approach addresses the challenges of
of synthetic images is somewhat more atomistic and sparse. ensuring the authenticity and trustworthiness of visual data.
Note that for the recognition of AI-generated imagery, Future work could involve exploring other techniques to
activation occurs in select parts of the image that are more classify the provided dataset. For example, the implemen-
likely to present visual glitches that are difficult to recognise tation of attention-based approaches is a promising new
with the human eye. An example of this can be seen for field that could provide increased ability and an alternative
the image of the frog, where an out-of-focus bokeh is the method of explainable AI. Furthermore, with even further
only attribute that suggests the image is not real. For the improvements to synthetic imagery in the future, it is
truck, only the radiator grill pattern is considered useful for important to consider updating the dataset with images
classification. generated by these approaches. Furthermore, considering
The XAI approach also shows an interesting mechanic in generating images from other domains, such as human faces
a more general sense. Given the examples of airplane, bird, and clinical scans, would provide additional datasets for this
frog, horse, and ship, note that the object within the image has type of study and expand the applicability of our proposed
little to no class activation overlay whatsoever. This suggests approach to other fields of research.
that the actual focus of the image itself, the entity, contains Finally, in conclusion, this study provides contributions to
almost no useful features for synthetic image recognition. the ongoing implications of AI-generated images. The pro-
This suggests that the model is often available to produce a posed approach supports important implications of ensuring
near-perfect representation of the entity. data authenticity and trustworthiness, providing not only a
system that can recognise synthetic images, but also data
V. CONCLUSION AND FUTURE WORK and interpretation. The public release of the CIFAKE dataset
This study has proposed a method to improve our waning generated within this study provides a valuable resource for
ability to recognise AI-generated images through the use interdisciplinary research.
of Computer Vision and to provide insight into predictions
with visual cues. To achieve this, this study proposed the VI. AVAILABILITY OF DATA AND MATERIALS
generation of a synthetic dataset with Latent Diffusion, The datasets generated and analysed during the cur-
recognition with Convolutional Neural Networks, and inter- rent study are available in the CIFAKE repository,
pretation through Gradient Class Activation Mapping. The https://www.kaggle.com/datasets/birdy654/cifake-real-and-
results showed that the synthetic images were high quality ai-generated-synthetic-images.

VOLUME 12, 2024 15649


J. J. Bird, A. Lotfi: CIFAKE: Image Classification and Explainable Identification

REFERENCES [25] C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman,


[1] K. Roose, ‘‘An AI-generated picture won an art prize. Artists aren’t happy,’’ M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman,
New York Times, vol. 2, p. 2022, Sep. 2022. P. Schramowski, S. Kundurthy, K. Crowson, L. Schmidt, R. Kaczmarczyk,
[2] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, and J. Jitsev, ‘‘LAION-5B: An open large-scale dataset for training next
‘‘High-resolution image synthesis with latent diffusion models,’’ in Proc. generation image-text models,’’ 2022, arXiv:2210.08402.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, [26] Y. LeCun, Y. Bengio, and G. Hinton, ‘‘Deep learning,’’ Nature, vol. 521,
pp. 10684–10695. no. 7553, pp. 436–444, 2015.
[3] G. Pennycook and D. G. Rand, ‘‘The psychology of fake news,’’ Trends [27] J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, T. Liu, X. Wang,
Cogn. Sci., vol. 25, no. 5, pp. 388–402, May 2021. G. Wang, J. Cai, and T. Chen, ‘‘Recent advances in convolutional neural
networks,’’ Pattern Recognit., vol. 77, pp. 354–377, May 2018.
[4] B. Singh and D. K. Sharma, ‘‘Predicting image credibility in fake news
over social media using multi-modal approach,’’ Neural Comput. Appl., [28] Z. Li, F. Liu, W. Yang, S. Peng, and J. Zhou, ‘‘A survey of convolutional
vol. 34, no. 24, pp. 21503–21517, Dec. 2022. neural networks: Analysis, applications, and prospects,’’ IEEE Trans.
Neural Netw. Learn. Syst., vol. 33, no. 12, pp. 6999–7019, Dec. 2022.
[5] N. Bonettini, P. Bestagini, S. Milani, and S. Tubaro, ‘‘On the use of
Benford’s law to detect GAN-generated images,’’ in Proc. 25th Int. Conf. [29] D. Gunning, M. Stefik, J. Choi, T. Miller, S. Stumpf, and G.-Z. Yang,
Pattern Recognit. (ICPR), Jan. 2021, pp. 5495–5502. ‘‘XAI—Explainable artificial intelligence,’’ Sci. Robot., vol. 4, no. 37,
Dec. 2019, Art. no. eaay7120.
[6] D. Deb, J. Zhang, and A. K. Jain, ‘‘AdvFaces: Adversarial face synthesis,’’
[30] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and
in Proc. IEEE Int. Joint Conf. Biometrics (IJCB), Sep. 2020, pp. 1–10.
D. Batra, ‘‘Grad-CAM: Visual explanations from deep networks via
[7] M. Khosravy, K. Nakamura, Y. Hirose, N. Nitta, and N. Babaguchi,
gradient-based localization,’’ in Proc. IEEE Int. Conf. Comput. Vis.
‘‘Model inversion attack: Analysis under gray-box scenario on deep
(ICCV), Oct. 2017, pp. 618–626.
learning based face recognition system,’’ KSII Trans. Internet Inf. Syst.,
[31] M. Abadi et al. (2015). TensorFlow: Large-Scale Machine Learning on
vol. 15, no. 3, pp. 1100–1118, Mar. 2021.
Heterogeneous Systems. [Online]. Available: https://www.google.com/
[8] J. J. Bird, A. Naser, and A. Lotfi, ‘‘Writer-independent signature
search?q=httpsSoftware+available+from+tensorflow.org&rlz=1C1GCEU
verification; evaluation of robotic and generative adversarial attacks,’’ Inf.
_enIN1087IN1087&oq=httpsSoftware+available+from+tensorflow.org
Sci., vol. 633, pp. 170–181, Jul. 2023.
&gs_lcrp=EgZjaHJvbWUyBggAEEUYOTIKCAEQABiABBiiBDIKCA
[9] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen,
IQABiiBBiJBTIKCAMQABiABBiiBDIKCAQQABiABBiiBDIKCAUQ
and I. Sutskever, ‘‘Zero-shot text-to-image generation,’’ in Proc. Int. Conf.
ABiABBiiBNIBCDY0MDZqMGo5qAIAsAIA&sourceid=chrome&ie
Mach. Learn., 2021, pp. 8821–8831.
=UTF-8
[10] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton,
S. K. S. Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes,
T. Salimans, J. Ho, D. J. Fleet, and M. Norouzi, ‘‘Photorealistic text-
to-image diffusion models with deep language understanding,’’ 2022, JORDAN J. BIRD received the B.Sc. and Ph.D.
arXiv:2205.11487.
degrees in computer science from Aston Uni-
[11] P. Chambon, C. Bluethgen, C. P. Langlotz, and A. Chaudhari, ‘‘Adapting
versity, U.K. He is currently a Senior Lecturer
pretrained vision-language foundational models to medical imaging
domains,’’ 2022, arXiv:2210.04133.
of computer science with Nottingham Trent Uni-
[12] F. Schneider, O. Kamal, Z. Jin, and B. Schölkopf, ‘‘Moûsai: Text-to-music versity, U.K. He received significant external
generation with long-context latent diffusion,’’ 2023, arXiv:2301.11757. grant funding toward his research projects, which
[13] F. Schneider, ‘‘ArchiSound: Audio generation with diffusion,’’ M.S. thesis, involve applications of artificial intelligence in the
ETH Zurich, Zürich, Switzerland, 2023. real-world. His research interests include artificial
[14] D. Yi, C. Guo, and T. Bai, ‘‘Exploring painting synthesis with diffusion intelligence (AI), focusing on human–robot inter-
models,’’ in Proc. IEEE 1st Int. Conf. Digit. Twins Parallel Intell. (DTPI), action (HRI), machine learning (ML), deep learn-
Jul. 2021, pp. 332–335. ing, transfer learning, and data augmentation. His professional academic
[15] C. Guo, Y. Dou, T. Bai, X. Dai, C. Wang, and Y. Wen, ‘‘ArtVerse: contributions include roles as a technical program committee member and
A paradigm for parallel human–machine collaborative painting creation the deep learning session chair of several international conferences.
in Metaverses,’’ IEEE Trans. Syst., Man, Cybern., Syst., vol. 53, no. 4,
pp. 2200–2208, Apr. 2023.
[16] Z. Sha, Z. Li, N. Yu, and Y. Zhang, ‘‘DE-FAKE: Detection and attribution
of fake images generated by text-to-image generation models,’’ 2022, AHMAD LOTFI (Senior Member, IEEE) received
arXiv:2210.06998. the B.Sc. degree in control systems from the
[17] R. Corvi, D. Cozzolino, G. Zingarini, G. Poggi, K. Nagano, and Isfahan University of Technology, Iran, the
L. Verdoliva, ‘‘On the detection of synthetic images generated by diffusion M.Tech. degree in control systems from the Indian
models,’’ 2022, arXiv:2211.00680. Institute of Technology Delhi, India, and the
[18] I. Amerini, L. Galteri, R. Caldelli, and A. Del Bimbo, ‘‘Deepfake video Ph.D. degree in learning fuzzy systems from The
detection through optical flow based CNN,’’ in Proc. IEEE/CVF Int. Conf. University of Queensland, Australia, in 1995. He is
Comput. Vis. Workshop (ICCVW), Oct. 2019, pp. 1205–1207. currently a Professor of computational intelligence
[19] D. Güera and E. J. Delp, ‘‘Deepfake video detection using recurrent neural with Nottingham Trent University, where he
networks,’’ in Proc. 15th IEEE Int. Conf. Adv. Video Signal Based Surveill. is leading the research group in computational
(AVSS), Nov. 2018, pp. 1–6. intelligence and applications. He has authored and coauthored over
[20] J. Wang, Z. Wu, W. Ouyang, X. Han, J. Chen, Y.-G. Jiang, and S.-N. Li, 200 scientific papers in the area of computational intelligence, the Internet of
‘‘M2TR: Multi-modal multi-scale transformers for Deepfake detection,’’ Things, abnormal behavior recognition, and ambient intelligence in highly
in Proc. Int. Conf. Multimedia Retr., Jun. 2022, pp. 615–623.
prestigious journals and international conferences. He received external
[21] P. Saikia, D. Dholaria, P. Yadav, V. Patel, and M. Roy, ‘‘A hybrid CNN-
funding from Innovate UK, EU, and industrial companies to support his
LSTM model for video Deepfake detection by leveraging optical flow
research. His research interest includes the identification of progressive
features,’’ in Proc. Int. Joint Conf. Neural Netw. (IJCNN), Jul. 2022,
pp. 1–7. changes in behavior of older individuals suffering from dementia or any
[22] H. Li, B. Li, S. Tan, and J. Huang, ‘‘Identification of deep network other cognitive impairments. Accurate identification of progressive changes
generated images using disparities in color components,’’ Signal Process., through utilization of unobtrusive sensor network or robotics platform will
vol. 174, Sep. 2020, Art. no. 107616. enable carers (formal and informal) to intervene when deemed necessary.
[23] S. J. Nightingale, K. A. Wade, and D. G. Watson, ‘‘Can people identify Research collaboration is established with world-leading researchers. He has
original and manipulated photos of real-world scenes?’’ Cognit. Res., involved in collaboration with many healthcare commercial organizations
Princ. Implications, vol. 2, no. 1, pp. 1–21, Dec. 2017. and end-users. He has been invited as an expert evaluator and a panel member
[24] A. Krizhevsky and G. Hinton, ‘‘Learning multiple layers of features from of many European and international research programs.
tiny images,’’ 2009.

15650 VOLUME 12, 2024

You might also like