0% found this document useful (0 votes)
26 views

CNN Based Deep Learning Model for Deepfake Detection

Uploaded by

Ashiya Ajare
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

CNN Based Deep Learning Model for Deepfake Detection

Uploaded by

Ashiya Ajare
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2022 2nd Asian Conference on Innovation in Technology (ASIANCON)

Pune, India. Aug 26-28, 2022

CNN based Deep Learning model for Deepfake


Detection
2022 2nd Asian Conference on Innovation in Technology (ASIANCON) | 978-1-6654-6851-0/22/$31.00 ©2022 IEEE | DOI: 10.1109/ASIANCON55314.2022.9908862

1st Vedant Jolly 2nd Mayur Telrandhe 3rd Aditya Kasat


Computer Engineering Department Computer Engineering Department Computer Engineering Department
Sardar Patel Institute of Technology Sardar Patel Institute of Technology Sardar Patel Institute of Technology
Mumbai, India Mumbai, India Mumbai, India
[email protected] [email protected] [email protected]

4th Atharva Shitole 5th Kiran Gawande


Computer Engineering Department Computer Engineering Department
Sardar Patel Institute of Technology Sardar Patel Institute of Technology
Mumbai, India Mumbai, India
[email protected] kiran [email protected]

Abstract—In the recent period there has been massive progress a target person acting or referring to a source has demonstrated
in synthetic image generation and manipulation which signif- how computer graphics and visual effects can be used insult
icantly raises concerns for its ill applications towards society. people by changing their faces to look different faces person. A
This would result in spreading false information, leading to loss
of trust in digital content. This paper introduces an automated basic way to create deepfake in-depth learning models such as
and effective approach to get facial expressions in videos, and autoencoders and competing production networks, widely used
especially focused on the latest method used to produce hyper in the field of computer vision. These models are used to assess
realistic fake videos: Deepfake. Using faceforenc++ dataset for a person’s facial expressions and movements and to combine
training our model, we achieved more that 99% successful images of another person’s face making similar expressions
detection rate in Deepfake, Face2Face, faceSwap and neural
texture. Regular image forensics techniques are usually not and movements [1]. In-depth fraudulent methods often require
very useful, because of the strong deterioration of data due to large amounts of image and video data to train models to
the compression. Thus, this paper follows a layered approach make real photos and videos. While public figures such as
with first detecting the subject with the help of existing facial celebrities and politicians may have a large number of videos
recognition networks followed by extracting facial features using and photos available online, they are the first deep victims [1].
CNN, then passing through the LSTM layer, where we make use
of our temporal sequence for face manipulation between frames. Many politicians and actors became victims of Deepfakes. For
Finally use of the Recycle-GAN which internally makes use of criminal purposes, forensic videos are converted using novel
generative adversarial networks to merge spatial and temporal methods such as face swap and faceswap-GAN. Analysing this
data. issue there have been several methods of obtaining deceptive
Index Terms—Face Detection, FaceForensics++, DeepFake, images, most of them or analyze the inconsistencies compared
Face2Face, FaceSwap, Neural Texture, Convolutional Neural
Network (CNN), Long Short-Term Memory (LSTM) to the conventional ones the camera pipe will be or rely
on the release of something image changes in the resulting
I. I NTRODUCTION image. Opposition (seeing) as video fraud, a few algorithms
using hand-made features, in-depth learning algorithms, and
Over the decades, the popularity of smartphones and the more recently GAN-based methods are being tested. For
growth of social media have led to the adoption of digi- example, manual methods include steganalysis methods, detect
tal photos and videos into the most popular digital assets. 3D head position inconsistencies, etc. However, there is still
On Youtube alone, 300 hours of video are uploaded every a room for the development of the modern industry finding
minute. Every day, 5 billion videos are viewed and 1 billion deepfakes, especially in challenging data such as a database
hours are broadcast, with Facebook and Netflix streaming of Face Forensics (FF ++). In this paper we have selected
combined. This major use of digital photography is followed the base architecture of our model as ResNet18. The reason
by an increase in photo editing techniques, using editing for choosing the ResNet18 model was because it takes care
software such as Photoshop as an example. The proliferation of a major problem like extinction and explosion of the
of deepfakes in recent years raises serious concerns about the gradient. Make this happen by using something known as
authenticity of digital content by the media and other online skipping communication. The advantage of adding this type
forums. For example, Deepfake (based on “deep reading” and of link is that if any of the layers damage the performance
“deception”) is a method that can put more than a person’s of the structures it will be omitted normally. The biggest
facial expression directed at a source video to create a video of

978-1-6654-6851-0/22/$31.00 ©2022 IEEE 1


Authorized licensed use limited to: K K Wagh Inst of Engg Education and Research. Downloaded on August 21,2024 at 16:22:36 UTC from IEEE Xplore. Restrictions apply.
challenge in getting a deep mock image is how well we process are severely degraded by image noise, wherein micro-
our image data, based on which we will highlight the most scopic investigation-based image noise is not applicable.
important similarities. Our model used the CNN model in Moreover, the models are efficient in detecting hyper
feature releasing. This is followed by transferring them to the realistic forged videos at a low computational cost. The
LSTM layer. average detection efficiency rate was found to be 98%
for DeepFake videos and 95% for Face2Face videos
II. L ITERATURE R EVIEW under real conditions of diffusion on the internet [3].
Faceforensic++ designed a novel large-scale dataset of
We cover the most important related research done in the
manipulated facial imagery composed of more than
deepfakes in the following paragraphs.
1.8 million images from 1,000 videos with pristine
1) Deepfake Methods: In last couple of decades interest in (i.e., real) sources and target ground truth to enable
virtual face manipulation has increased greatly. Deep- supervised learning [4]. Research paper published by
fakes methods could be divided into different types by faceforensic++ in 2019 used faceforensic++ dataset
of face manipulation methods. STYGAN model syn- to train CNN model, which was tailored to detect face
thesizes entire non-existent face through GAN model. manipulations. In Lips Don’t Lie, Haliassos suggested
These approaches produce incredible outcomes, such a generalizable and robust approach based on resenet-
as high-resolution facial images with a great degree of 18 to detect face forgery in videos using the semantic
realism. The identity swap technique, also called the irregularities of lips movement, which is also known as
face-swap method, is very popular for replacing the face LipForensics. Jeon proposed a transferable GAN-image
of one person in an image or video with that of another detection framework (T-GD) technique, which efficiently
person. This could be achieved through 2 different detects DeepFake images. The model works on teacher
approaches. One is graphics based approaches such as and student relations, which mutually improve the de-
FaceSwap and another is deep learning technique-based tection performance.
approaches such as DeepFakes.
Attribute manipulation, also known as face editing or III. S YSTEM A LGORITHM
face retouching, entails changing aspects of the face, The overall algorithm consists of three major components:
such as hair or skin color, gender, age, and the addition
of spectacles This manipulation process is usually car- 1) Dataset Used:
ried out through a GAN, such as the StarGAN approach. The dataset which we have used is FaceForensics++.
Expression swap, also known as face reenactment, mod- This dataset consists of about 1000+ manipulated
ifies the facial expression of a person. Face2face changes youtube videos as well as around 1 million+ images
the facial expression of the person in the video based for the same. This dataset was provided by Google
on the expression input given by another person. Some and JigSaw. The novelty of this dataset is that along
more techniques exist for face morphing, which creates with the data which it is providing, it also provides
biometric face samples that resemble the given biometric an automated benchmark for facial manipulation
information. detection. In particular, the benchmark is based on
2) Deepfake Detection: DeepFake detection dominates re- DeepFakes, Face2Face, FaceSwap and NeuralTextures
search on monitoring multimedia information and has as prominent representatives for facial manipulations at
the positive intention to improve the confidentiality and random compression level and size [4]. Another unique
integrity of multimedia content. In recent years CNN aspect of the dataset is that, in the videos which are
based generated multimedia detection has become more manipulated FaceShifter has also been applied, so that
popular. there is no lag when the DeefFake is applied over the
A novel photo-response nonuniformity (PRNU) analysis video, which makes detecting DeepFake even more
method has been tested for its effectiveness at detect- difficult.
ing DeepFake video manipulation. This PRNU analysis
reveals a statistically significant difference in mean 2) DeepFake Methodology:
normalized cross-correlation scores between real and Over the years, the DeepFake model has been devel-
DeepFake Videos [2]. Lugstein designed a novel pipeline oping steadily, where it has come to a point, that it
to detect DeepFakes using photoresponse nonunifor- is visually impossible to tell the difference between a
mity (PRNU). Basically, the PRNU technique is fa- DeepFake video and a video that was original.
mous for detecting facial retouching and face morphing Firstly, let’s look at some of the techniques which
attacks. Two types of mesoscopic (a compact facial DeepFake makes use of for manipulating videos and
video forgery detection network) models (Meso-4 and images.
MesoInception-4) have been proposed by Afchar to a) Face2Face: It is also known as a facial reenactment.
classify hyperrealistic forged videos based on DeepFake The main purpose of this method is to transfer
and Face2Face. It is obvious that uncompressed videos expression from the source to the target photo.

2
Authorized licensed use limited to: K K Wagh Inst of Engg Education and Research. Downloaded on August 21,2024 at 16:22:36 UTC from IEEE Xplore. Restrictions apply.
b) FaceSwap: It is used for facial identity manipu- thus increasing the accuracy of our model.
lation. It is a graphics-based approach that makes
use of each frame to create a model of the source
and then project this new model onto the target by
minimizing the distance between 2 frames.
c) NeutalTextures: This is an old technique that makes
use of GANs for facial reenactment.

Fig. 2. Two Stream Neural Network Architecture [7]

2) Detecting the DeepFake Video:


We started our research based on analyzing a single
frame at a given time, even though the accuracy
Fig. 1. Residual Block [6] achieved by this method was at-par but the time taken
by this model was not justifiable towards the accuracy
The base architecture of our model is based on that was achieved. We then tried on a variety of
ResNet18. The reason for choosing the ResNet model techniques for parallelly running frames for detection.
was because it takes care of the major problem like We made use of temporal sequence between frames
the vanishing and the exploding gradient. It makes this which helped our model to detect the deep fake videos.
possible by using something known as skip connections. Our model made use of a CNN model for the feature
The basic idea behind the skip connection is that for extraction. After the features are detected, we pass
some layers it skips training and directs connects to them on towards our LSTM layer, where we made
the output layer. This helps the model to learn the use of our temporal sequence for face manipulation
underlying mapping and thus allow the network to fit between frames. The last layer of our model consists
the residual mapping. The advantage of adding this of a softmax function which is used to classify the
type of skip connection is that if any layer hurt the deep fake videos into the correct categories. We
performance of architecture then it will be skipped by also made use of the Recycle-GAN which internally
regularization [5]. makes use of generative adversarial networks to merge
spatial and temporal data [7]. The benefit of using the
Recycle-GAN is that while it is evaluating, it passes
IV. P ROPOSED S YSTEM the results back towards the start of the network so that
The system consists of two major components: at the same time the model can analyze its mistakes
and manipulate the factors accordingly.
1) Detecting a DeepFake Image:
The main challenge in detecting a deep fake image is
how well are we pre-processing our image data, based
on which we are going to highlight the most important
aspects for the same. The primary task of our model
is to detect the subject which is going to get analyzed
further, for this we have made use of the existing facial
recognition networks. After the face has been detected, Fig. 3. ConvNet for spatial and temporal features analysis [7]
the next step is to fine-tune the facial features of the
current image so that most noises can be removed from
the image. V. R ESULT A NALYSIS
We have made use of a CNN based deep learning 1) FaceForensics++ dataset for trained models: For models
model so that we can detect even the forensics model’s trained only on paper databases, we realize that the
deep fake images. The CNN model makes use of model only learns to detect fraudulent strategies stated
Gaussian Blur and Gaussian Noise so that we can on paper and avoid all frauds in real-world data from
ignore the noise as well as high-frequency sounds data. This model has been zoomed in to the target image
which are irrelevant in the detection of the face. The by minimizing the contrast between the proposed shape
advantage we get by applying this model is that we can and local landmarks using the input image texture. At
eventually recognize more meaningful characteristics, the end, the database model is combined with the image

3
Authorized licensed use limited to: K K Wagh Inst of Engg Education and Research. Downloaded on August 21,2024 at 16:22:36 UTC from IEEE Xplore. Restrictions apply.
with the specific correction related to color thesis applied it were to modify then the network would additionally
accurately. We apply these steps to all the individual require extra input based on the movements of the eye.
pairs and targeted pairs until one video ends. The frames
detected for output are then used to make to surface
populated with high concentration and density (refer
Fig. 4a, 4b). Then these collected frames are used to
connect properly with the dataset faces under various
facial expressions and lighting conditions. To analyze the
dataset videos precisely, we used the Face2Face method
in order to duplicate the frames and achieve the required
result. We process each video through a pre-processing
world; here, we use the first frames to get a temporary
face recognition (i.e., 3D model), and track additions
over the remaining frames.
Fig. 5. Result set 2

TABLE I
ACCURACY OF DIFFERENT ALGORITHMS TESTED
Method Train Validation Tests Raw HQ LQ
Bayar and Stamm 280374 52359 56382 98.74% 82.97% 66.84%
Rahmouni et al. 280342 52356 56371 97.03% 79.08% 61.18%
MesoNet 295164 55317 60540 95.23% 83.10% 70.47%
XceptionNet 295578 55384 60614 99.26% 95.73% 81.00%

VI. C ONCLUSION
Fig. 4. Result Set 1
Deepfakes has led to people believe less in media and
2) YouTube dataset for trained models: Models trained with seeing them as less trustworthy and consistent with their
one-to-one YouTube data learn to find real-world deep- contents. They may cause distress and ill effects to those
fakes, but also learn to find simple deepfakes on paper who are targeted, nurture inadvertent knowledge and hate
databases. These models however failed to detect any speech, and they may provoke political unrest, burning up
other type of deception (such as NeuralTextures). The society, violence, or war. This is especially important these
large FaceForensics ++ database enables us to train a days as the technology for creating deepfakes is very close
modern counterfeit image detector to detect surveillance and social media can spread that untrue content quickly.
(refer Fig. 5a, 5b). In this case, we use three default Sometimes deepfakes do not need to be distributed to a
facial expressions, which are used in our database. To large audience to create harmful effects. People who build
mimic real-life situations, we have chosen to collect deepfakes with malicious intent only need to bring them to
videos anywhere online and on YouTube. The initial the target audience as part of their destructive strategy without
testing by inculcating the mentioned methods made us using a social media platform. As new methods of deception
realize that the dataset face taken must be redirected with emerge by the day, it is necessary to develop methods that
minimum delay in order for test to not fail and thereby can detect fakes with minimal training data. Our website is
produce accurate results. So, we did a personal review of already being used for this legal transfer learning process,
the resulting clips to ensure the selection of high-quality where the knowledge of one source of fraud is transferred
video and to avoid videos with an explicit face. We have to another targeted domain. We hope that the database and
selected approxmiately 300,000 images for our dataset benchmark will be a stepping stone to future research in the
in order to be implemented by the above three mentioned field of digital media intelligence, and especially with a focus
algorithms.. All performed tests are done using the on face-to-face fraud. To summarize this, we were able to
dataset videos from the set. The NeuralTextures method propose an automatic benchmark for face change acquisition
is based on the geometry that is used during the train and under random pressure for standard comparison, including the
test times. The Face2Face module was used to produce human base. Comprehensive modern handmade experiments
the gathered information. It was used to identify and and counterfeiters learned in a variety of contexts are also
correct the expressions using the mouth regions only. shown in a modern way of finding counterfeit designed for
The other parts like eye area were not modified, since if facial modification.

4
Authorized licensed use limited to: K K Wagh Inst of Engg Education and Research. Downloaded on August 21,2024 at 16:22:36 UTC from IEEE Xplore. Restrictions apply.
R EFERENCES
[1] Nguyen, Thanh Nguyen, Cuong M. Nguyen, Tien Duc, Thanh
Nahavandi, Saeid. (2019). Deep Learning for Deepfakes Creation and
Detection: A Survey.
[2] Korus, Pawel Huang, Jiwu. (2016). Multi-Scale Analysis Strategies in
PRNU-Based Tampering Localization. IEEE Transactions on Informa-
tion Forensics and Security. PP. 1-1. 10.1109/TIFS.2016.2636089.
[3] D. Afchar, V. Nozick, J. Yamagishi and I. Echizen, ”MesoNet: a Com-
pact Facial Video Forgery Detection Network,” 2018 IEEE International
Workshop on Information Forensics and Security (WIFS), 2018, pp. 1-7,
doi: 10.1109/WIFS.2018.8630761.
[4] A. Rössler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies and M.
Niessner, ”FaceForensics++: Learning to Detect Manipulated Facial
Images,” 2019 IEEE/CVF International Conference on Computer Vision
(ICCV), 2019, pp. 1-11, doi: 10.1109/ICCV.2019.00009.
[5] K. Zhang, M. Sun, T. X. Han, X. Yuan, L. Guo and T.
Liu, ”Residual Networks of Residual Networks: Multilevel Resid-
ual Networks,” in IEEE Transactions on Circuits and Systems for
Video Technology, vol. 28, no. 6, pp. 1303-1314, June 2018, doi:
10.1109/TCSVT.2017.2654543.
[6] K. He, X. Zhang, S. Ren and J. Sun, ”Deep Residual Learning for Image
Recognition,” 2016 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2016, pp. 770-778, doi: 10.1109/CVPR.2016.90.
[7] Almars, Abdulqader. (2021). Deepfakes Detection Techniques Using
Deep Learning: A Survey. Journal of Computer and Communications.
09. 20-35. 10.4236/jcc.2021.95003.
[8] B. Malolan, A. Parekh and F. Kazi, ”Explainable Deep-Fake Detection
Using Visual Interpretability Methods,” 2020 3rd International Confer-
ence on Information and Computer Technologies (ICICT), 2020, pp.
289-293, doi: 10.1109/ICICT50521.2020.00051.
[9] S. Agarwal, N. Girdhar and H. Raghav, ”A Novel Neural Model
based Framework for Detection of GAN Generated Fake Images,”
2021 11th International Conference on Cloud Computing, Data Sci-
ence Engineering (Confluence), 2021, pp. 46-51, doi: 10.1109/Con-
fluence51648.2021.9377150.
[10] N. S. Ivanov, A. V. Arzhskov and V. G. Ivanenko, ”Combining Deep
Learning and Super-Resolution Algorithms for Deep Fake Detection,”
2020 IEEE Conference of Russian Young Researchers in Electri-
cal and Electronic Engineering (EIConRus), 2020, pp. 326-328, doi:
10.1109/EIConRus49466.2020.9039498.

5
Authorized licensed use limited to: K K Wagh Inst of Engg Education and Research. Downloaded on August 21,2024 at 16:22:36 UTC from IEEE Xplore. Restrictions apply.

You might also like