An Efficient Deepfake Detection Using Robust Deep Learning Approch
An Efficient Deepfake Detection Using Robust Deep Learning Approch
Learning Approch
Abdul Qadir
UET Taxila
Rabbia Mahum ( [email protected] )
UET Taxila
Mohammed A. El-Meligy
King Saud University
Adham E. Ragab
King Saud University
Abdulmalik AlSalman
King Saud University, P.O. Box 800,Riyadh 11421
Haseeb Hassan
Shenzhen Technology University (SZTU)
Research Article
DOI: https://doi.org/10.21203/rs.3.rs-3103257/v1
License: This work is licensed under a Creative Commons Attribution 4.0 International License. Read Full
License
Page 1/24
Abstract
The creation and manipulation of synthetic images have evolved rapidly, creating serious concerns about their
effects on society. Although there have been various attempts to identify deep fake videos, these approaches are
not universal. Identifying these misleading deepfakes is the first step in preventing them from following on social
media sites. We introduce a unique deep-learning technique to identify fraudulent clips. Most deepfake identifiers
currently focus on identifying face exchange, lip synchronous, expression modification, puppeteers, and other
factors. However, exploring a consistent basis for all forms of fake video and images in real-time forensics is
challenging. We propose a hybrid technique that takes input from videos of successive targeted frames, then
feeds these to the ResNet-Swish-BiLSTM, an optimized convolutional BiLSTM-based residual network for training
and classification. This proposed method helps identify artifacts in deepfake images that do not seem real. To
assess the robustness of our proposed model, we used the open deepfake detection challenge dataset (DFDC)
and Face Forensics deepfake collections (FF++.) We achieved 96.23% accuracy when using the FF + + digital
record. In contrast, we attained 78.33% accuracy using the aggregated records from FF + + and DFDC. We
performed extensive experiments and believe that our proposed method provides more significant results than
existing techniques.
1. Introduction
In the modern world, falsifying images and videos threaten society. Any audio or video clip can be artificially
created. Thanks to artificial intelligence (AI), especially machine learning (ML) methods, it isn't easy to distinguish
altered photos and movies from the originals [1]. To alter photos and videos, a few conventional approaches are
employed. Some are content altering-based, whilst others focus on computer-generated images like GIMP,
Photoshop, and CANVA. Deepfake, which is based on Deep Learning (DL) [2–5], is a strong competitor among the
procedures for customizing the content of videos. Deepfake is a phrase that evolved from the ideas of DL and
fraud. DL networks (DNN) have accelerated and simplified the process of creating compelling synthetic images
and movies. It involves altering a person's video or image with another person's image using DL algorithms such
as the general adversarial network (GAN) [6][7]. Face swap is a crucial step in establishing a fake video as it
involves the exchange of a source that challenges a victim's face while preserving the victim's initial movements
and verbalization spoofs.
Generative Adversarial Networks (GANs) [8] are the primary driving force behind facial modification methods. As
the use of StyleGAN [9] and StyleGAN2 [10] to synthesize the images increases, it becomes increasingly difficult
for the human visual system to distinguish them. Many parodies, pranks, and other social channels exist on
YouTube, Instagram, Twitter, Tiktok, and other streaming websites that use GAN-based face-swapping
techniques. Commercial smartphone apps like ZAO2 [11] and FaceApp3 [12] would significantly speed up the
adoption of these deepfake capabilities by making it natural for ordinary internet users to produce fake picture
frames and clips. Deepfake records were first easily visible to the human eye, but as technology advances, they
have become increasingly complex to differentiate from real images [13, 14]. Because of this, there has been an
increase in the frequency of its inappropriate use, leading to the formation of numerous pornographic images of
politicians and celebrities to disseminate propaganda and fake news, leading to many societal issues. While they
shouldn't be, they can influence a person's social standing and mental health. It has become so easy that mobile
cameras, tiny devices, or other bugs are used to capture pictures or videos anytime. Then, it can be altered using
Page 2/24
professional image processing technology software, allowing one to create bogus pictures or films. Even some
companies are specialists in offering such deepfakes services.
Various techniques have been proposed for deepfake detection; however, they are still unreliable for unseen
videos. Moreover, the applicability of advanced DL methods for deepfake generation has made detection very
difficult due to environmental changes, such as lighting, compression, change in scale and positions, etc. Hence,
the increasing fraudulent activities using deepfakes have forced researchers to propose a more reliable method
for deepfake detection. Therefore, to overcome the above glitches and shortcomings of previous efforts, we
provide a robust approach for deepfake detection. We propose the innovative DL technology: ResNet-Swish-
BiLSTM. The reason behind the Swish activation function is to utilize its smoothness as it combines the ReLU
and Sigmoid functions. It is proven to be more efficient than ReLU and helps in learning complex patterns from
videos achieving a high-recall rate. Furthermore, the extracted features from ResNet's layers are then classified
using BiLSTM [15] layers that capture the meaningful attributes from input features and classify the frames
effectively. The main contributions are as follows:
a. For deepfake detection, we offer ResNet-Swish-BiLSTM, a distinctive architecture based on Bi-LSTM and
Residual Network. It is a more robust technique than previous deepfake detection methods due to its ability
and reliability in feature extraction.
b. We analyzed the proposed model using DFDC, FF++, and Celeb-DF collections, which include a cross-
compilation check along with a confirmatory assessment to assess the suitability and generalizability of the
proposed ResNet-Swish-BiLSTM network for visual fraud detection.
c. The presented work reliably works well against cyberattacks such as compression, noise, blurring,
translation, and rotation variations, as ResNet-Swish and BiLSTM can propagate a more meaningful set of
visual features within neurons and perform the classification task.
The remaining part of the research is organized as follows: Section 2 outlines the relevant deepfake detection
techniques. Then, Section 3 describes the methodology, Section 4 explains the results and experiments used to
analyze the performance, and Section 5 concludes the article.
2. Related Work
The three types of modern deepfake records are Face Swap (FS), Lip Sync, and Puppeteer. The face of the target
person is exchanged with the face of a source person to develop a fictional clip of the addressee in FS deepfakes.
Deepfakes containing lip sync with audio are called lip sync deepfakes, so it appears they are saying the lyrics.
When using Puppet Master, the source person's facial expressions remain on the target face while the target face
is inserted into the original video to create a more convincing impersonation. Certain types of deepfake are the
focus of most current detection methods; however, general methods capable of defeating all deepfakes are
studied less often. A technique for detecting lip sync deepfakes has been released by Agarwal et al. [16]. This
method took advantage of the discrepancies of both a phoneme (spoken word) and the viseme (shape of the
mouth). Viseme-phoneme mapping was calculated in this study using manual and CNN-based approaches. This
model works well for a specific collection of visual data.
Existing deepfake detection methods can be broadly divided into two groups: those based on hand-made
features [17] and those using DL [18, 19]. Yang et al. [19] trained an SVM (Support Vector Machine) classifier for
recognition using 68-D face marking data. This approach worked well with high-quality movies from the UADFV
Page 3/24
[19] and DARPA MediFor datasets but had difficulties with low-quality movies. In addition, not all deepfakes were
considered in the analysis of this study. Using 16-Dim texture-based eye and teeth characteristics, Massod et al.
[20] were able to exploit visual flaws and detect deepfakes like FS and F2F. The most crucial aspect of this study
was determining the various eye colors of a POI (Point of Interest) to detect FS deepfakes by filling in the blanks,
such as using the reflection in the eye color. This study also uses attributes like the nose's top, face's rim, and eye
retina color to distinguish F2F deepfakes. Unfortunately, only faces with clean teeth and open eyes can be
identified with this approach [20], which is a disadvantage. Finally, only the FF + + collection of deepfake
examples was used for this work's assessment evaluation. The intended affine warping artifacts that arise during
the production of deepfakes have been proposed by Massod [20]; by focusing on specific objects, training cost
has been reduced. However, as it becomes difficult to detect a deepfake with some advanced transforming
characteristics, this artifact selection could weaken the resilience of this approach.
The open-source toolkit OpenFace2 [11] was developed by Agarwal et al. [21] to extract features from facial
features. Several features were created based on the retrieved landmark features. Then, a binary SVM was trained
for deepfake detection using these derived and action unit traits. Five POIs were used in this approach, and a T-
SNE plot showed that each POI could be linearly separated. However, the performance of this approach was
severely impacted by the increased number of POIs brought about by the new datasets. DL-based approaches are
also used for identifying deepfakes, in addition to some professional apps for altering facial content. To identify
the deepfakes, Doke et al. [18] used a DL-based approach. This approach used a long short-term memory (LSTM)
to learn the features after a CNN extracted them. The key contribution of this study was the use of deepfakes
sequence and pattern inconsistencies for categorization. Still, the issue with this approach is lacking accuracy for
the top three deepfake levels. Another DL model (MesoNet) was developed by Xia et al. [22] to identify deepfakes
and F2F video manipulations, convolution and pooling layers for feature extraction followed by dense layers for
classification in this unified architectural design. Instead of using a standard data set, these algorithms [22] were
tested on digital media collected from different websites, raising questions about their sustainability over a
sizeable and heterogeneous standard dataset. A bubble network was reported by Nguyen et al. [23] to uncover
different types of manipulations in images and video clips. This framework is designed to identify computer-
generated images, FS, and Natural Textures (NT). The performance of the proposed model has been improved by
using dynamic routing and expectation maximization methods. The VGG-19 was used by the Nguyen bubble
network to extract latent facial attributes, which are then categorized as legitimate or fake. Although the
framework is sophisticated in computations, it effectively detects FS deepfakes in the FF + + dataset. Still, it has
not been tested on lip-sync and puppet master deepfakes. Table 1 summarizes some recent developments in the
domain of deepfakes.
Page 4/24
Table 1
A Comparison of prior research on deepfake video recognition
Research DF Collections CNN Attributes
A. Jaiswal. [24] FF++ Bidirectional recurrent neural networks (RNN) with DenseNet/ResNet50
are used to analyze the spatiotemporal properties of video streams.
P.Dongare [18] Hollywood-2 It takes into consideration the deep-fake video's temporal irregularities.
Human Inception-V3 + LSTM
Actions
A. Irtaza. [14] Fusion of Differences in facial structure, missing detail in the eyes and mouth, and
datasets a neutral network and logarithmic regression model
Hashmi. [26] DFDC whole CNN + LSTM Used facial landmarks and convolution features
dataset
3. Proposed Methodology
The details of the proposed system are provided in this section. The proposed ResNet-Swish-BiLSTM deepfake
video identification technique is shown in Fig. 1. This proposed approach utilizes the Swish activation function
that minimizes the movement of negative values across the network while optimizing the model's learning
behaviour due to its smoothness feature. To extract face landmark characteristics from the input video, the
OpenFace 2.0 [11] toolkit was used. The Residual Blocks (RBs) extract the face's key features to categorize
whether they are authentic or fake. We also utilized BiLSTM layers that focus on capturing the most useful
features for classifying deepfake frames.
Page 5/24
In this process system detect the faces from the frames after extracting them. In the facial area of frames,
changes are mainly made for visual modification. Therefore, the procedure described here focuses primarily on
the facial area. We used OpenFace 2.0, a face detector toolkit, to collect the faces. The method locates the facial
region using 2 and 3-dimension facial area landmarks. As illustrated in Fig. 1, we have chosen seven (POI), the
outward even edge (OE), the outward left edge (LE), the chin (C), the frontal head (FH), the outward proper cheek
(PC), odd check (OC), and the center of the face (MF). The primary justification for using OpenFace 2.0 for face
identification is that this method works well to find faces even when there are changes in face orientation,
intensity fluctuations, and the position of the capturing device. The OpenFace toolkit's capabilities allow it to
categorize face images from the collection of clips even when they go through considerable transformation
changes.
3.3 Standardization and Segmentation
Different methods are used for the standardization of features. The extracted attribute is normalized during data
loading by semi-training the distributed facial attributes. We use Eq. (1) for the standardization.
X − μ
z =
S
Where the mean and standard deviation of feature columns are represented by μ and S , respectively. The values
ofX are the input face attributes vector. Both the frames and the segment levels are functional with our solution.
We created fragments with a frame duration of 100 and an overlapping of 25 frames to allow the slices to work
planar. Projected resolution requirements are 25 frames per second to maintain computational complexity in our
case.
3.4 Feature computation
The next challenge is to compute features from the video images after extracting the faces. We adjusted the pre-
trained ResNet18 CNN network by introducing the low-value elimination method in the current framework to solve
this issue. The theory behind the Swish incitement design is that it eases the model's capability to scatter
negative numbers through the neural network, helping to capture the complicated underlying patterns in visual
perception. On the other hand, the supplementary RNN (Recurrent Neural Network) and additional coatings help
select a representative set of support countenance that is passed for categorization. The starting layers of CNN
Networks are responsible for learning the essential visual information, but the layers at the end focus more on
computing the information required for each job. Therefore, using a pre-trained CNN network for deepfake
detection improves the accuracy of spurious recognition while accelerating learning. The primary purpose of
using a pre-trained model is to quantify more accurate sets of image attributes because it has been previously
trained using the extensive ImageNet database of online datasets. Figure 2 demonstrates the visual
representation of the above task.
Y = F (xi ) + xi
In the above expression, attributes are fed into the formula symbolized by the letters xi . The residual function
through the letters F and Y represents the results obtained after applying the residual function and addition to
xi .
Similarly, integrating the attention-based RB with a set of filters of Kernal size "3x3" can grab and extract the
different vertices, which is very useful for classification during the final training, making the proposed model
efficient and intelligent for pointing the POI. After that, by using the BiLSTM block, the network can accept a
series of convolution feature vectors from the input frames and a 2-node neural network which predicts whether
the sequence will come from a deepfake or original video clip. The main problem we must address is developing
a logical model to process a sequence of features. To achieve this, we experiment with BiLSM, which produces
positive results, like resorting to a 9216-wide BiLSTM unit with a prediction dropout error of 0.5. This addition in
the parameter has a positive outcome, which we want. More specifically, 9216-D spatial feature vectors are fed
into the BiLSTM model to train it for evaluation. Further, we added in the proposed model a fully connected 512
layers with a 0.5 percent dropout probability at the end of the proposed network to determine the probability that
the frame sequence is either authentic or deepfake. We used a softmax layer for making the final classification.
Page 7/24
Table 2
ResNet-Swish-BiLSTM structural description
Layer Name Activation Learnable
Featureinput 9216
SoftMax 2 -
Classification 2 -
Page 8/24
This section describes the experiments used to measure the detection and classification accuracy of the
proposed model. We also highlighted the delicate aspects of the dataset used. Extensive experiments are also
described to elucidate the suitability of our strategy. Subsection 4.1 describes the evaluation metrics we used to
measure the accuracy, and Subsection 4.2 discusses the dataset in depth. The details of the experimental design
are presented in Subsection 4.3. Subsections 4.4 to 4.7 cover the various trials we conducted to evaluate the
effectiveness of the suggested technique.
D′ + Ꞃ
Ac =
D′ + Ꞃ+r+R
3
,
D′
Pr =
D′ + r
4
,
D′
Re =
D′ + R
5
,
P recision ∗ Recall
F 1 − score = 2 ∗
P recision + Recall
6
,
Where D′ denotes the positive (deepfakes values), and Ꞃ refers to negative values. Similarly, r stands for the
mistakenly positive results (untrue bonafide), and R stands for the mistakenly adverse (false deepfakes),
respectively. Some evaluation measurements using these equations on the proposed classification model are
detailed in the following subsections.
4.2. Dataset
Our proposed model is evaluated using two datasets: DFDC provided by Facebook [32] and the FF++ [33]. The
Facebook dataset contains 4119 modified visual examinations and 1131 unique records. The Kaggle competition
website allows users to obtain the DFDC data, an open and free online dataset [32]. A variety of different AI-based
techniques, including Deepfakes: FS [20], F2F [34], and NT [35], were used to create the FF + + dataset. 1000
Page 9/24
authentic and 1000 altered records are included for each control process. Modifying recordings using simple,
light, and high compression settings is also common nowadays. Table 3 shows some of the abandoned sets that
DL used for deepfake. The datasets we used during the experiment are randomly split into 70% and 30% ratios to
learn the proposed model for the training and testing, respectively. Since the modified recordings in the FF + +
sample set were created with an extended deepfake algorithm, this is specifically used for the evaluation [36].
This calculation creates a high-quality visual record that closely resembles the real-world environment. The
datasets usually contain actual and false examples, as shown in Fig. 6.
Table 3
A list of collections that include modified and original videos.
DF collections with Origin Year Pristine vs Modified
a. The OpenFace2 toolbox is used to perform standardization, segmentation, and feature extraction of facial
features.
b. As the model requires, the patterns of identified faces are sized to 224 x 224 dimensions.
c. We have trained the model for 60 epochs, 0.000001 learning rate, 35 batch sizes, and stochastic gradient
descents (SGD as the hyperparameters.
Page 10/24
The above results demonstrate the resilience of our proposed method. The ResNet-Swish-BiLSTM model is
significant enough in picking and selecting the features that enable the framework to present complex visual
patterns and utilize them for classification effectively.
Table 4
Comparative analysis of Swish and other activation functions over DFDC and FF++
Activation Accuracy Avg training Avg time (sec) Remarks
Function (%) time (sec) classification
ReLU is still better than Leak_ReLU (L-ReLU), but the difference in accuracy has significantly decreased with the
number of epochs; L-ReLU is performing better than ReLU now. For longer training, L-ReLU performed better than
ReLU, and Swish and Mish performed way better than other activation functions. Swish is more accurate than
Mish in such types of applications. So, based on these observations, we can conclude that Swish outperformed
other activation functions in the list. The other main reason for the selection is the non-monotonic nature of the
Swish activation method, which allows the computed values to fall still even if the input rises. Hence, the result
improves the computed values storage capacity of the proposed approach. Similarly, employing the Swish
activation method optimizes the model behavior by improving the proposed approach's feature selection power
and recall ability. Hence due to these factors, the proposed ResNet-Swish-BiLSTM model presents the highest
performance results for classifying visual manipulations.
4.6. Comparing the improved performance of ResNet-18 to the
base Model
In this subsection, we compared the effectiveness of the proposed deepfake detection method with the original
ResNet-18 model. Table 5 illustrates a comparison, and Fig. 9 shows the confusion matrix of the proposed
Page 11/24
deepfake detection model on the FF + + test set, clearly showing its effectiveness. We attained the best results for
FS, F2F, and NT deepfakes when evaluating all performance metrics using the ResNet-18 model. We have
attained performance gains of 4.72%, 5.57%, 5.16%, and 6.24% for the FS deepfakes detection PR, RC, F1, and AC,
respectively. In the PR, RC, F1, and AC measurements for F2F, we gained improvements of 5.81%, 4.73%, 3.93%,
and 6.99%, respectively. For the NT deepfake's PR, RC, F1, and AC measurements, we achieved 4.72%, 5.75%,
4.33%, and 6.33%, respectively. In addition, we compared the AUC (area under the curve) with ROC (receiver
operating characteristic curve) for the FS, F2F, and NT deepfakes in Fig. 8 (a, b, and c), which clearly shows how
robust our method is comparable to the base model. The analysis shows that the enhanced version, ResNet-
Swish-BiLSTM, outperforms the base model over the FF + + dataset. Although, our proposed technique has more
parameters (18.2 million) than ResNet-18 (13.5 million). However, it is quite clear from the presented results that
the proposed technique has reliable performance for deepfake detection. Integrating Swish activation methods
into the model, which produces precise feature computation, is the main factor contributing to the improved
performance of the proposed solutions in recognition. Furthermore, the additional Bi-LSTM layers help the model
deal with network overfitting data more effectively, ultimately improving performance.
Table 5
Comparison of the core with proposed models over FF + + deep fake collections
CNN- PR (%) RC (%) F1 (%) AC (%)
Model
ResNet- 94.15 89.42 94.33 90.19 92.5 94.22 91.1 88.5 85.23 89.2 92.5 86.6
18
Propose 99.25 98.23 95.48 97.89 97.8 97.66 96.3 92.9 89.56 98.9 98.0 96.2
Model
We first compared the accuracy of our deepfake detector with VGG16 [30], VGG-19 [42], ResNet101 [43],
InceptionV3 [44], InceptionResV2 [45], XceptionNet [46], DenseNet-169 [15], MobileNetv2, EfficientNet [47] and
NasNet-Mobile [48], respectively. The comparative findings are shown in Fig. 10. The results show that the
proposed technique works better than the alternative options. VGG16 got the lowest AC at 88%. The VGG19
model took the second lowest place with 88.92%. The developed Res-Swish-BiLSTM model got the best AC with
98.06%. The average accuracy score when comparing is 93.84%, however, our strategy has a score of 98.06%. As
a result, a 4.22% average performance increase in the accuracy metric is achieved. Figure 10 shows a
comparison using the DFDC dataset based on accuracy, recall, precision, and F1-score with the existing DL
techniques. Additionally, using the FF + + dataset, we evaluated the proposed technique against existing DL
models including XceptionNet [46], VGG16 [30], ResNet34 [49], InceptionV3 [44], VGGFace, and EfficientNet, and
the results are shown in Fig. 10. From the results, it is clear that the Res-Swish-BiLSTM model achieved more
Page 12/24
accurate classification results than other DL methods for all classes: FS, F2F, and NT. When finding low-quality
movies versus high-quality videos, detection algorithms often show a decrease in effectiveness, as shown in
Table 6.
Hence, Fig. 10, Fig. 11, Table 6 and Table 7 demonstrate that the provided ResNet-Swish-BiLSTM model is more
robust to visual tampering detection than previous DL techniques for all specified assessment parameters. The
novel proposed method works consistently well due to the improved feature computation capability of the
proposed model, which allows it to learn the key points more effectively. Furthermore, the Resnet18 model's
ability to deal with the video's visual changes, such as lighting and background variations, has been improved by
employing the Swish activation approach, which results in superior deepfakes detection performance.
Table 6
Comparative analysis with existing methods using various DF datasets.
Study Method Dataset Performance
(AC)
FF++,FS 96.3%
FF++,NT 99.09%
Page 13/24
Table 7
Performance Comparison of DL network's over the
DFDC dataset
DL-Models AC PR RC F1
4.7. Cross-Validation
We conducted an experiment to measure the effectiveness of the proposed deepfake detection method in a cross-
corpus scenario. For this challenge, we chose the Celeb-DF data set. The collection includes 950 edited videos
and 475 original videos. Unfortunately, the tiny visual distortions make it difficult to detect tampering.
The ResNet-Swish-BiLSTM model was tested for the situations presented in Table 8, where it was trained on the
FF++ (FS, F2F, and NT) deepfake collections and assessed comparatively on the FF++, DFDC, and Celeb-DF
deepfakes. The statistics in Fig. 12 show unequivocally that in a cross-corpus scenario, the performance of the
proposed technology has decreased compared to the database-internal evaluation situation. The main reason for
this degradation in performance is that the method used does not account for temporal variations that have
evolved within frames throughout training, which could have helped the method to capture the underlying
manipulated biases more accurately. Table 8 clarifies that in the first case, the suggested DL model network AUC
values for the DFDC and Celeb-DF collections are 71.56% and 70.04%, respectively, when trained on the FF++(FS,
F2F, and NT) deepfake collections. Suggested CNN network ResNet-Swish-BiLSTM produced the AUC values for
the FF + + and Celeb-DF deepfake data sources are 70.12% and 65.23%, respectively, when trained on the DFDC
dataset. These results show how well the model holds up to unobserved cases from whole new deepfake
collections.
Page 14/24
Table 8
Cross-data validation accuracy corresponding to the suggested convolution network.
Training Dataset Class Proposed Model-ACU
5 Conclusion
This study has provided a comprehensive strategy to identify all three forms of deepfakes based on the fusion of
our unique facial features. Unlike many other systems, the proposed model is easy to use, understandable,
efficient, and resilient at the same time. We introduced a novel Resnet-Swish-BiLSTM deepfake detection model.
The utility of the proposed visual tampering detection technique was extensively tested on the deepfake data
source FF++, DFDC and Celeb-DF. We evaluated the proposed method across deepfake collection to demonstrate
its generalizability for unusual scenarios. We found that the suggested modelling technique can distinguish
between the modified and the unmodified digital footage with a high- recall rate and recognize various visual
modifications, including FS, NT and F2F. The FF + + and DFDC datasets, which show the highest AU values of
0.9623 and 0.9876, respectively, are used to evaluate the proposed technique in detail. In summary, our approach
experienced some performance loss, yet the performance is still quite good. Therefore, after thoroughly
examining the ResNet-Swish-BiLSTM at the statistical and digital media levels, we can conclude that our work in
the field of advanced digital investigation, such as criminal forensics, is potentially beneficial. Since the model
cannot yet capture the temporal patterns of the faked material over time, our ultimate aim in the future is value
addition in form of temporal pattern analytics and reasoning to further improve the discovery, inference, and
adaptability capabilities of the projected approach.
Declarations
Acknowlegement: This research project was supported by a grant from the “Research Center of College of
Computer and Information Sciences”, Deanship of Scientific Research, King Saud University.
Data Availability Statement: Data sharing does not apply to this article as authors have used publicly available
datasets, whose details are included in the “experimental results and discussions” section of this article. Please
contact the authors for further requests.
References
Page 15/24
1. P.S.Q. Yeoh, K.W. Lai, S.L. Goh, K. Hasikin, Y.C. Hum, et al., "Emergence of deep learning in knee osteoarthritis
diagnosis." Computational intelligence and neuroscience, vol. 2021, pp. 2021.
2. K. Bjerge, H.M. Mann and T.T. Høye, "Real‐time insect tracking and monitoring with computer vision and deep
learning." Remote Sensing in Ecology and Conservation, vol. 8, no.3, pp. 315-327, 2022.
3. N. Le, V.S. Rathour, K. Yamazaki, K. Luu and M. Savvides, "Deep reinforcement learning in computer vision: a
comprehensive survey." Artificial Intelligence Review, vol. pp. 1-87, 2022.
4. A. Bouguettaya, H. Zarzour, A.M. Taberkit and A. Kechida, "A review on early wildfire detection from
unmanned aerial vehicles using deep learning-based computer vision algorithms." Signal Processing, vol.
190, pp. 108309, 2022.
5. P. Shukla, R. Aluvalu, S. Gite and U. Maheswari, Computer Vision: Applications of Visual AI and Image
Processing. Vol. 15. 2023: Walter de Gruyter GmbH & Co KG.
6. T.T. Nguyen, C.M. Nguyen, D.T. Nguyen, D.T. Nguyen and S. Nahavandi, "Deep learning for deepfakes creation
and detection." arXiv preprint arXiv:1909.11573, vol. 1, no.2, pp. 2, 2019.
7. Marek Kowalski, Faceswap, Jan 2020. [Online]. Available :https://github.com/ deepfakes/faceswap.
Accessed 19 Jan 2021.
8. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, et al., "Generative adversarial networks."
Communications of the ACM, vol. 63, no.11, pp. 139-144, 2020.
9. C. Bravo-Prieto, J. Baglio, M. Cè, A. Francis, D.M. Grabowska, et al., "Style-based quantum generative
adversarial networks for Monte Carlo events." Quantum, vol. 6, pp. 777, 2022.
10. P. Zhu, R. Abdal, Y. Qin, J. Femiani and P. Wonka, "Improved stylegan embedding: Where are the good latents?
." arXiv preprint arXiv:2012.09036, vol. pp. 2020.
11. T. Baltrusaitis, A. Zadeh, Y.C. Lim and L.-P. Morency. Openface 2.0: Facial behavior analysis toolkit. in 2018
13th IEEE international conference on automatic face & gesture recognition (FG 2018). 2018. IEEE.
12. F. Schroff, D. Kalenichenko and J. Philbin. Facenet: A unified embedding for face recognition and clustering.
in Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
13. A. Radford, L. Metz and S. Chintala, "Unsupervised representation learning with deep convolutional
generative adversarial networks." arXiv preprint arXiv:1511.06434, vol. pp. 2015.
14. S. Suwajanakorn, S.M. Seitz and I. Kemelmacher-Shlizerman, "Synthesizing obama: learning lip sync from
audio." ACM Transactions on Graphics (ToG), vol. 36, no.4, pp. 1-13, 2017.
15. R. Mahum and S. Aladhadh, "Skin Lesion Detection Using Hand-Crafted and DL-Based Features Fusion and
LSTM." Diagnostics, vol. 12, no.12, pp. 2974, 2022.
16. S. Agarwal, H. Farid, O. Fried and M. Agrawala. Detecting deep-fake videos from phoneme-viseme
mismatches. in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
workshops. 2020.
17. S. Kolagati, T. Priyadharshini and V.M.A. Rajam, "Exposing deepfakes using a deep multilayer perceptron–
convolutional neural network model." International Journal of Information Management Data Insights, vol. 2,
no.1, pp. 100054, 2022.
18. Y. Doke, P. Dongare, V. Marathe, M. Gaikwad and M. Gaikwad, "Deep Fake Video Detection Using Deep
Learning." Journal homepage: www. ijrpr. com ISSN, vol. 2582, pp. 7421.
Page 16/24
19. X. Yang, Y. Li and S. Lyu. Exposing deep fakes using inconsistent head poses. in ICASSP 2019-2019 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2019. IEEE.
20. M. Masood, M. Nawaz, K.M. Malik, A. Javed, A. Irtaza, et al., "Deepfakes Generation and Detection: State-of-
the-art, open challenges, countermeasures, and way forward." Applied Intelligence, vol. pp. 1-53, 2022.
21. S. Agarwal, H. Farid, Y. Gu, M. He, K. Nagano, et al. Protecting World Leaders Against Deep Fakes. in CVPR
workshops. 2019.
22. Z. Xia, T. Qiao, M. Xu, X. Wu, L. Han, et al., "Deepfake Video Detection Based on MesoNet with Preprocessing
Module." Symmetry, vol. 14, no.5, pp. 939, 2022.
23. H.H. Nguyen, J. Yamagishi and I. Echizen. Capsule-forensics: Using capsule networks to detect forged
images and videos. in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP). 2019. IEEE.
24. E. Sabir, J. Cheng, A. Jaiswal, W. AbdAlmageed, I. Masi, et al., "Recurrent convolutional strategies for face
manipulation detection in videos." Interfaces (GUI), vol. 3, no.1, pp. 80-87, 2019.
25. Y. Li and S. Lyu, "Exposing deepfake videos by detecting face warping artifacts." arXiv preprint
arXiv:1811.00656, vol. pp. 2018.
26. M.F. Hashmi, B.K.K. Ashish, A.G. Keskar, N.D. Bokde, J.H. Yoon, et al., "An exploratory analysis on visual
counterfeits using conv-lstm hybrid architecture." IEEE Access, vol. 8, pp. 101293-101308, 2020.
27. I. Ganiyusufoglu, L.M. Ngô, N. Savov, S. Karaoglu and T. Gevers, "Spatio-temporal features for generalized
detection of deepfake videos." arXiv preprint arXiv:2010.11844, vol. pp. 2020.
28. K. He, X. Zhang, S. Ren and J. Sun. Identity mappings in deep residual networks. in Computer Vision–ECCV
2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV
14. 2016. Springer.
29. M.J. Akhtar, R. Mahum, F.S. Butt, R. Amin, A.M. El-Sherbeeny, et al., "A Robust Framework for Object Detection
in a Traffic Surveillance System." Electronics, vol. 11, no.21, pp. 3425, 2022.
30. M. Nawaz, Z. Mehmood, M. Bilal, A.M. Munshi, M. Rashid, et al., "Single and multiple regions duplication
detections in digital images with applications in image forensic." Journal of Intelligent & Fuzzy Systems, vol.
40, no.6, pp. 10351-10371, 2021.
31. A. Graves, "Generating sequences with recurrent neural networks." arXiv preprint arXiv:1308.0850, vol. pp.
2013.
32. B. Dolhansky, J. Bitton, B. Pflaum, J. Lu, R. Howes, et al., "The deepfake detection challenge (dfdc) dataset."
arXiv preprint arXiv:2006.07397, vol. pp. 2020.
33. A. Rössler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, et al., "Faceforensics: A large-scale video dataset for
forgery detection in human faces." arXiv preprint arXiv:1803.09179, vol. pp. 2018.
34. A. Kohli and A. Gupta, "Detecting DeepFake, FaceSwap and Face2Face facial forgeries using frequency
CNN." Multimedia Tools and Applications, vol. 80, pp. 18461-18478, 2021.
35. J. Thies, M. Zollhöfer and M. Nießner, "Deferred neural rendering: Image synthesis using neural textures."
ACM Transactions on Graphics (TOG), vol. 38, no.4, pp. 1-12, 2019.
36. T. Jung, S. Kim and K. Kim, "Deepvision: Deepfakes detection using human eye blinking pattern." IEEE
Access, vol. 8, pp. 83144-83154, 2020.
Page 17/24
37. P. Korshunov and S. Marcel, "Deepfakes: a new threat to face recognition? assessment and detection." arXiv
preprint arXiv:1812.08685, vol. pp. 2018.
38. A. Rössler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, et al., "FaceForensics++: Learning to Detect
Manipulated Facial Images." arXiv preprint arXiv:1901.08971, vol. pp. 2019.
39. Y. Hua, R. Shi, P. Wang and S. Ge, "Learning Patch-Channel Correspondence for Interpretable Face Forgery
Detection." IEEE Transactions on Image Processing, vol. pp. 2023.
40. Y. Li, P. Sun, H. Qi and S. Lyu, Toward the creation and obstruction of deepfakes, in Handbook of Digital Face
Manipulation and Detection: From DeepFakes to Morphing Attacks. 2022, Springer International Publishing
Cham. p. 71-96.
41. H. Chi and M. Peng, Toward Robust Deep Learning Systems against Deepfake for Digital Forensics, in
Cybersecurity and High-Performance Computing Environments. 2022, Chapman and Hall/CRC. p. 309-331.
42. Y.S. Taspinar, M. Dogan, I. Cinar, R. Kursun, I.A. Ozkan, et al., "Computer vision classification of dry beans
(Phaseolus vulgaris L.) based on deep transfer learning techniques." European Food Research and
Technology, vol. 248, no.11, pp. 2707-2725, 2022.
43. D. Theckedath and R. Sedamkar, "Detecting affect states using VGG16, ResNet50 and SE-ResNet50
networks." SN Computer Science, vol. 1, no.2, pp. 1-7, 2020.
44. G. Chugh, A. Sharma, P. Choudhary and R. Khanna, "Potato leaf disease detection using inception V3." Int.
Res. J. Eng. Technol (IRJET), vol. 7, no.11, pp. 1363-1366, 2020.
45. M.M. Rahman, A.A. Biswas, A. Rajbongshi and A. Majumder, "Recognition of local birds of Bangladesh using
MobileNet and Inception-v3." International Journal of Advanced Computer Science and Applications, vol. 11,
no.8, pp. 2020.
46. A. Biswas, D. Bhattacharya and K.A. Kumar, "DeepFake Detection using 3D-Xception Net with Discrete Fourier
Transformation." Journal of Information Systems and Telecommunication (JIST), vol. 3, no.35, pp. 161,
2021.
47. G. Marques, D. Agarwal and I. de la Torre Díez, "Automated medical diagnosis of COVID-19 through
EfficientNet convolutional neural network." Applied soft computing, vol. 96, pp. 106691, 2020.
48. F. Saxen, P. Werner, S. Handrich, E. Othman, L. Dinges, et al. Face attribute detection with mobilenetv2 and
nasnet-mobile. in 2019 11th International Symposium on Image and Signal Processing and Analysis (ISPA).
2019. IEEE.
49. R. Roy, I. Joshi, A. Das and A. Dantcheva, 3D CNN Architectures and Attention Mechanisms for Deepfake
Detection, in Handbook of Digital Face Manipulation and Detection. 2022, Springer, Cham. p. 213-234.
50. N. Bonettini, E.D. Cannas, S. Mandelli, L. Bondi, P. Bestagini, et al., "Video Face Manipulation Detection
Through Ensemble of CNNs." arXiv e-prints, vol. pp. arXiv: 2004.07676, 2020.
51. J.C. Neves, R. Tolosana, R. Vera-Rodriguez, V. Lopes, H. Proença, et al., "Ganprintr: Improved fakes and
evaluation of the state of the art in face manipulation detection." IEEE Journal of Selected Topics in Signal
Processing, vol. 14, no.5, pp. 1038-1048, 2020.
52. U.A. Ciftci, I. Demir and L. Yin, "Fakecatcher: Detection of synthetic portrait videos using biological signals."
IEEE transactions on pattern analysis and machine intelligence, vol. pp. 2020.
53. W. Zhang, C. Zhao and Y. Li, "A novel counterfeit feature extraction technique for exposing face-swap images
based on deep learning and error level analysis." Entropy, vol. 22, no.2, pp. 249, 2020.
Page 18/24
54. A. Keramatfar, H. Amirkhani and A. Jalaly Bidgoly, "Multi-thread hierarchical deep model for context-aware
sentiment analysis." Journal of Information Science, vol. 49, no.1, pp. 133-144, 2023.
55. Y. Nirkin, L. Wolf, Y. Keller and T. Hassner, "Deepfake detection based on discrepancies between faces and
their context." IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no.10, pp. 6111-6121,
2021.
Figures
Figure 1
Page 19/24
Figure 2
Figure 3
ResNet-18's architecture
Figure 4
Page 20/24
Architectural description of RB.
Figure 5
Figure 6
Page 21/24
Samples from the used collections, with instances in the first record that are pristine and in the second record that
have been modified.
Figure 7
Execution of the proposed approaches is examined for preciseness, recall, and accuracy across the FF++ dataset
for the detection of (I) FS, (II) F2F, and (III) NT deepfakes.
Figure 8
AUC with ROC plots against the proposed and underlying CNN network on the FF++ deepfake collections (FS, F2F,
and NT, respectively).
Page 22/24
Figure 9
Chart showing the different DL network's Efficiency by utilizing the DFDC dataset
Figure 10
Figure 9: Chart showing the different DL network's Efficiency by utilizing the DFDC dataset
Page 23/24
Figure 11
Figure 10: Assessing the accuracy of the DL techniques using the FF++ dataset.
Figure 12
Figure 12: Cross-dataset assessment of the proposed hypothesis using numerous datasets.
Page 24/24