Comparative Analysis and Evaluation of CNN Models For Deepfake Detection
Comparative Analysis and Evaluation of CNN Models For Deepfake Detection
Andry Chowanda
2023 4th International Conference on Artificial Intelligence and Data Sciences (AiDAS) | 979-8-3503-1843-2/23/$31.00 ©2023 IEEE | DOI: 10.1109/AIDAS60501.2023.10284611
Abstract—Deepfake technology has become a significant con- classification [3] [4] [5]. In this study, we aim to evaluate
cern due to its ability to create highly realistic fake videos and im- the performance of these models, namely EfficientNetB7, Mo-
ages, leading to the potential deception of individuals. Detecting bileNetV3, and ConvNeXt, in the specific context of deepfake
deepfakes has become a critical research area in computer vision
and multimedia forensics. This paper presents a comparative detection. As a baseline model, we will utilize ResNet-152, a
analysis of deepfake detection models, focusing on evaluating widely adopted CNN model in deep-learning research [6]
their accuracy and robustness. Four CNN models, namely The study objective is to assess the accuracy of these
ResNet-152, MobilenetV3, Convnext Large, and EffecientNetB7, models in detecting various types of deepfakes. The findings of
were implemented and trained using a custom dataset obtained this research will contribute to mitigating the possible threats
from FaceForensics++. The models were evaluated based on
training accuracy, average loss, and testing accuracy. An LSTM associated with deepfakes [7]. By conducting a comprehen-
layer was also incorporated into each model’s architecture to sive evaluation of state-of-the-art models, this study aims to
leverage sequential information. The results demonstrate varying advance deepfake detection techniques and provide guidance
performance among the models, with EfficientNet B7 achieving for future research in this critical area.
the highest testing accuracy of 75%. The findings of this study
provide insights for future research in this critical area. II. L ITERATURE R EVIEW
Index Terms—deepfake detection, CNN models, comparative
analysis, accuracy, LSTM layer Deepfake technology has become a growing concern in
recent years due to its potential to manipulate and deceive indi-
I. I NTRODUCTION viduals through the creation of realistic fake videos or images.
Initially developed for legitimate applications in the entertain-
Deepfake refers to the alteration of digital media, such as ment industry, deepfake technology utilizes deep learning tech-
photos and videos, through manipulations that replace the niques, particularly Convolutional Neural Networks (CNN)
appearance of one person with that of another. Deepfake has and Generative Adversarial Networks (GAN) [8], to generate
raised significant concerns as it enables the creation of highly highly convincing and manipulative media content. However,
realistic fake videos and images that can deceive individuals. this technology’s misuse for spreading misinformation, fake
The detection of deepfake media has become a critical research news, and malicious propaganda has raised significant alarms,
area in computer vision and multimedia forensics. necessitating research efforts to detect and combat deepfake
Convolutional neural networks (CNN) have been widely media.
used in deepfake detection methods because they effectively
extract and analyze visual features from images and videos A. Deepfake Detection Methods
[1]. These methods utilize the hierarchical structure of CNN Deepfake Forensics, as one of the deepfake detection meth-
to identify patterns and anomalies in manipulated media, ods is a technique that analyzes various media features such as
thus enhancing deepfake detection accuracy. However, the patterns, noise, and confusion matrix. This method involves the
rapid evolution of deepfake generation techniques presents use of machine learning approaches, specifically deep learning
challenges for existing deepfake detection approaches [2]. and neural networks. The objective is to develop a model that
Several studies have shown the effectiveness of CNN can differentiate between real and fake media. After being
models in various computer vision tasks, including image trained on a set of real and fake media, the model is tested
using a test set. If the accuracy falls below the expected level, C. Long Short-Term Memory
the model is retrained until it achieves the desired accuracy. Long Short-Term Memory, or LSTM, is a modified version
One of the more common detection methods that use deep of a Recurrent Neural Network [13]. LSTM allows models
learning techniques involves binary classification. Binary clas- to learn about temporal information that would otherwise be
sification is a task involving input determined by predictions lost if regular RNNs were used instead due to their inability to
based on the network according to two class labels. In the preserve long-term dependencies by extending CEC by adding
context of deepfake video detection, Binary classification is input and output gates connected to the input layer, which
often used alongside two real and fake labels (0 or 1). While addresses the problem of conflicts during updating weights.
most deepfake detection methods often utilize deep-learning [14].
techniques, several methods have been proposed without us- LSTM has been proposed to be an excellent addition to
ing deep-learning techniques. Usage of successive subspace most machine learning tasks due to LSTM’s ability to preserve
learning (SSL), extracted features that are distilled by using temporal information [15], which can be used to add more
Spatial dimension reduction and Channel-wise Soft Classifi- parameters that the model can learn and improve the outcomes
cation, and the combination of Multi-Region and Multi-Frame of the model.
ensemble have been tested to produce a light-weight, highly In 2019, Tzuu-Hseng S.LI et al. conducted an experiment to
efficient deepfake detector model without using traditional create a facial recognition model that detects human emotions,
deep-learning-based methods [9]. enhancing human-computer interaction for integrating robots
Various deepfake detection models, primarily based on into daily life [16]. The study highlights the superiority of
Convolutional Neural Networks (CNN), have been proposed in LSTM and CNN-LSTM architectures in capturing temporal
the literature. EfficientNetB7, MobileNetV3, ResNet-152, and and contextual facial expression information compared to
ConvNeXt are notable models with promising performance MLP and Singular CNN. LSTM’s ability to retain temporal
in deepfake detection. EfficientNetB7, a state-of-the-art CNN context makes it a valuable addition to deep-learning-based
model, has demonstrated exceptional accuracy in various com- classification models.
puter vision tasks [3]. MobileNetV3 also has shown promising
results in deepfake detection while maintaining computational D. Vulnerabilities and Challenges
efficiency [4]. ResNet-152 is a widely adopted baseline CNN Deepfake detection models are known to be vulnerable
model in deepfake detection research [6]. ConvNeXt, designed to adversarial attacks. Adversaries can manipulate deepfake
to capture spatial and temporal features, has shown excellent videos to evade or even fool the detection models into misclas-
performance in deepfake detection [5]. sifying fake content as real [17] [18]. Developing generalized
models capable of detecting different types of deepfakes
B. Convolutional Neural Network (CNN)
remains a challenge. Deepfakes can vary in quality, manip-
CNN, or Convolutional Neural Network, is a deep-learning ulation techniques, and characteristics, making it challenging
architecture used to recognize features and patterns, including to develop a one-size-fits-all detection solution.
detecting deepfakes. The main advantage of CNN is that it Most current deepfake detection methods often focus on
can automatically learn valuable features from raw using the analyzing the facial features contained in videos, prioritizing
convolutional layers. These convolutional layers can collect visual elements. The nature of current deepfake detectors
spatial information from inputs and then be used for feature relying on facial features of videos leads to potential concerns
extraction. The features that have been extracted will then be where implementation of strong Antiforensics measures on
used by the fully connected layers in the neural network to facially manipulated images alongside the usage of other non-
identify the deepfake visual deepfake media might cause current deepfake detection
Although it has the benefit of self-learning, CNN is still techniques becoming highly inefficient [19].
vulnerable to attacks specifically created to avoid CNN-based
model detection. Therefore, additional research is required E. Comparative Review
to increase the generalization and robustness of CNN-based In this section, we reviewed a comparative review to eval-
detection models, even to detect a deepfake created to avoid uate the performance of deep CNN in detecting distracted
the CNN detection model. drivers. A comparative review involves analyzing and compar-
While it can be used to detect deepfake, CNN can also be ing multiple models or methods to determine their effective-
used for generating deepfake and sometimes leaving a trace. ness. The goal of the comparative review is to identify the best-
Research conducted by Luca et. Al [10] used Convolutional performing approach. Kathiravan et al. made a Comparison of
traces, a unique identifier, to detect deepfake media and deep convolutional neural networks in 2021 [20], in which
even identify the GAN architecture that makes that deepfake. three deep CNN models were given a set of pictures of
They used the Expectation-Maximization algorithm to extract distracted drivers, which the paper claims that several road
the convolutional traces left by the CNN. Based on their accidents have happened due to humans not paying attention
research result, we can see that their proposed model has while driving. The paper suggests that developing a system ca-
better accuracy than other models such as FakeSpotter [11] pable of accurately predicting driver distraction can potentially
and AutoGAN [12]. reduce road accidents.
— 251 —
Authorized licensed use limited to: K K Wagh Inst of Engg Education and Research. Downloaded on August 21,2024 at 16:22:36 UTC from IEEE Xplore. Restrictions apply.
2023 4th International Conference on Artificial Intelligence and Data Sciences (AiDAS)
C. Preprocessing Methods
Each randomly chosen video was split into frames for the
preprocessing methods used on the set. These frames were
labeled based on meta to determine whether the video was
real or fake. The dimension of the videos will be resized to
112x112 before preprocessing them. The mean and standard
deviation were standardized for each video in the training and
test sets.
Fig. 1. Research Methodology The experiments for this research paper were conducted in
Google Colaboratory Pro using a preprocessed custom dataset
obtained from the FaceForensics++ set, and the source code is
A. Identifying Models for Experimentation a modified version from [22]. The set consisted of a ratio of
Based on the literature review and existing work, we im- real and fake videos 1:4 to ensure a balanced representation.
ported our models from Pytorch libraries. First, Resnet-152 The architecture of the model consisted of a Deep CNN Model
is chosen for the baseline, as it is one of the older models accompanied by one LSTM layer incorporated to enhance
chosen from the roster. The rest of the models chosen, such efficiency and leverage sequential information, which comes
as MobileNetV3, ConvNeXt Large, and EfficientNetB7, are after the data is put through the Deep CNN Model.
much newer and considered much more efficient and accurate The experiments employed four CNN models: ResNet-152,
than the baseline model. MobileNet-V3 Large, ConvNeXt Large, and EfficientNetB7.
Each model was trained separately to ensure independent
B. Collection and Analysis training processes and reliable results. The Adam optimizer
We utilized the FaceForensics++ dataset [21]. FaceForen- with a learning rate of 0.00001 was utilized, and the training
sics++ is a comprehensive public set comprising 1000 original was performed over 20 epochs. Following the training phase,
videos from the public internet and 1000 manipulated videos the models were tested to evaluate their performance regarding
generated through advanced video editing techniques. By us- training accuracy, average loss, and testing accuracy.
ing this established set, we can ensure the inclusion of diverse Finally, we made some additional code that is used to create
and realistic deepfake scenarios. Fig. 2 shows an example of a test dataset, and the newly trained model will be tested by
a FaceForensics++ Video. using a test set to determine the precise accuracy of the model.
To ensure a fair comparison, each deepfake detection By employing this experimental setup, the research aimed to
model had the same set and underwent similar preprocessing compare the performance of the selected models in deepfake
methods. The preprocessing steps involved extracting frames detection. Using custom sets derived from FaceForensics++
— 252 —
Authorized licensed use limited to: K K Wagh Inst of Engg Education and Research. Downloaded on August 21,2024 at 16:22:36 UTC from IEEE Xplore. Restrictions apply.
2023 4th International Conference on Artificial Intelligence and Data Sciences (AiDAS)
A. Resnet-152 Results
— 253 —
Authorized licensed use limited to: K K Wagh Inst of Engg Education and Research. Downloaded on August 21,2024 at 16:22:36 UTC from IEEE Xplore. Restrictions apply.
2023 4th International Conference on Artificial Intelligence and Data Sciences (AiDAS)
precision and recall, even if Resnet-152 yielded better results ResNet-152 and MobileNetV3. From Table. 1, It achieved a
during training. testing accuracy of 67.17%, indicating that it could correctly
However, when evaluated on both the validation and test classify over half of the test set. This performance surpassed
sets from Table. 1, the model’s performance was significantly the other two models, suggesting that ConvNeXt had a higher
worse at a testing accuracy of 61.33%, much lower than the capability to generalize to the unseen.
training accuracy. This suggests the presence of overfitting,
D. EffecientNetB7 Results
where the model may have memorized the training too well
and struggled to generalize to the unseen. Despite this, the
model still exhibited moderate performance.
Comparing the performance of MobileNetV3 to ResNet152,
it appears that MobileNetV3 had a slightly higher testing
accuracy. However, the difference in accuracy between the
two models is minimal, making them practically identical in
performance.
C. ConvNeXt Results
— 254 —
Authorized licensed use limited to: K K Wagh Inst of Engg Education and Research. Downloaded on August 21,2024 at 16:22:36 UTC from IEEE Xplore. Restrictions apply.
2023 4th International Conference on Artificial Intelligence and Data Sciences (AiDAS)
binary classification. One factor that contributed to the success [8] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
of EfficientNetB7 is its utilization of the compound scaling S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,”
2014.
method, which played a role in enhancing its performance. [9] H. G. Hong-Shuo Chen, Mozhdeh Rouhsedaghat, “Defakehop: A light-
The second best-performing model in the experiment was weight high-performance deepfake detector,” in IEEE International
ConvNeXt, which exhibited the second-highest testing ac- Conference on Multimedia and Expo Workshops (ICMEW), 2021.
[10] L. Guarnera, O. Giudice, and S. Battiato, “Fighting deepfake by exposing
curacy. On the other hand, both Resnet-152 and MobileNet the convolutional traces on images,” IEEE Access, vol. 8, pp. 165085–
demonstrated similar results with comparable accuracies. 165098, 2020.
An important observation from the experiment is that all [11] R. Wang, F. Juefei-Xu, L. Ma, X. Xie, Y. Huang, J. Wang, and Y. Liu,
“Fakespotter: A simple yet robust baseline for spotting ai-synthesized
models exhibited signs of overfitting. This can be seen from fake faces,” 2020.
the lower testing accuracy scores than the training accuracy [12] X. Gong, S. Chang, Y. Jiang, and Z. Wang, “Autogan: Neural architec-
scores. This indicates that the models struggled to generalize ture search for generative adversarial networks,” 2019.
[13] A. Sherstinsky, “Fundamentals of recurrent neural network (RNN)
well beyond the training. and long short-term memory (LSTM) network,” Physica D: Nonlinear
To address the issue of overfitting and improve the gen- Phenomena, vol. 404, p. 132306, mar 2020.
[14] R. C. Staudemeyer and E. R. Morris, “Understanding lstm – a tutorial
eralization ability of the models, further enhancements can into long short-term memory recurrent neural networks,” 2019.
be implemented. One approach could involve utilizing addi- [15] P. Saikia, D. Dholaria, P. Yadav, V. Patel, and M. Roy, “A hybrid cnn-
tional sets beyond FaceForensics++ to introduce more diverse lstm model for video deepfake detection by leveraging optical flow
features,” in 2022 International Joint Conference on Neural Networks
features. Increasing the size of the training and testing sets (IJCNN), pp. 1–7, IEEE, 2022.
can also be beneficial, as it provides more information for the [16] T.-H. S. Li, P.-H. Kuo, T.-N. Tsai, and P.-C. Luan, “Cnn and lstm based
models to learn from and improve their performance. facial expression analysis model for a humanoid robot,” IEEE Access,
vol. 7, pp. 93998–94011, 2019.
[17] N. Carlini and H. Farid, “Evading deepfake-image detectors with white-
V. C ONCLUSION and black-box attacks,” in Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition workshops, pp. 658–659, 2020.
This study conducted a comparative analysis of deepfake [18] S. Hussain, P. Neekhara, M. Jere, F. Koushanfar, and J. McAuley,
detection models (ResNet-152, MobileNetV3, ConvNeXt, and “Adversarial deepfakes: Evaluating vulnerability of deepfake detectors
EfficientNetB7) on the FaceForensics++ dataset. Efficient- to adversarial examples,” in Proceedings of the IEEE/CVF winter
conference on applications of computer vision, pp. 3348–3357, 2021.
NetB7 achieved the highest testing accuracy, making it the best [19] S. Lyu, “Deepfake detection: Current challenges and next steps,” in IEEE
model to use when it comes to detecting deepfakes, outper- International Conference on Multimedia and Expo Workshops (ICMEW),
forming the other models. However, all models showed signs pp. 4–5, 2020.
[20] D. D. Kathiravan Srinivasan, Lalit Garg, “Performance comparison of
of overfitting, indicating the need for further improvements in deep cnn models for detecting driver’s distraction,” Tech Science, vol. 68,
generalization ability. To enhance deepfake detection, future 2021.
research should explore techniques such as data augmentation, [21] A. Rössler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and
M. Nießner, “Faceforensics++: Learning to detect manipulated facial
regularization, and larger datasets. Additionally, incorporating images,” 2019.
advanced techniques like attention mechanisms, ensemble [22] A. Jadhav, A. Patange, J. Patel, H. Patil, and M. Mahajan, “Deepfake
learning, and adversarial training can further improve the video detection using neural networks,” International Journal for Sci-
entific Research and Development, vol. 8, no. 1, pp. 1016–1019, 2020.
accuracy and robustness of deepfake detection systems. This
study emphasizes the significance of deepfake detection and
provides insights for selecting appropriate models and address-
ing challenges in the field.
R EFERENCES
[1] H. S. Shad, M. M. Rizvee, N. T. Roza, S. Hoq, M. Monirujjaman Khan,
A. Singh, A. Zaguia, S. Bourouis, et al., “Comparative analysis of
deepfake image detection method using convolutional neural network,”
Computational Intelligence and Neuroscience, vol. 2021, 2021.
[2] A. Beckmann, A. Hilsmann, and P. Eisert, “Fooling state-of-the-art
deepfake detection with high-quality deepfakes,” 2023.
[3] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for con-
volutional neural networks,” in International conference on machine
learning, pp. 6105–6114, PMLR, 2019.
[4] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang,
Y. Zhu, R. Pang, V. Vasudevan, et al., “Searching for mobilenetv3,”
in Proceedings of the IEEE/CVF international conference on computer
vision, pp. 1314–1324, 2019.
[5] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A
convnet for the 2020s,” in Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pp. 11976–11986, 2022.
[6] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, pp. 770–778, 2016.
[7] M. Westerlund, “The emergence of deepfake technology: A review,”
Technology innovation management review, vol. 9, no. 11, 2019.
— 255 —
Authorized licensed use limited to: K K Wagh Inst of Engg Education and Research. Downloaded on August 21,2024 at 16:22:36 UTC from IEEE Xplore. Restrictions apply.