Deep-Fake Detection Using Deep Learning
Deep-Fake Detection Using Deep Learning
Abstract: Deep-Fake Detection is a new technology which has caught extreme fashionability in the present generation.
Deep-Fake has now held serious pitfalls over spreading misinformation to the world, destroying political faces and also
blackmailing individualities to prize centrals. As this technology has taken over the internet in a veritably short span of
time and also numerous readily apps are also available to execute Deep-Fake contents, and numerous of the
individualities has made systems grounded on detecting the deepfake contents whether it’s fake or real. From the
DL(deep learning) – grounded approach good results can be attained, this paper presents substantially the results of our
current study which is indicating the traditional machine learning (ML) fashion. This projects points in discovery of video
tape deepfakes using deep literacy ways like ResNext and LSTM. We've achived deepfake discovery by using transfer
literacy where the pretrained ResNext CNN is used to gain a point vector, further the LSTM subcaste is trained using the
features.
How to Cite: Nagashree K T; Shristi ; Sania Firdaushi ; Shweta B Patil ; Shristi Singh (2025). Deep-Fake Detection Using Deep
Learning. International Journal of Innovative Science and Research Technology, 10(1), 1700-1706.
https://doi.org/10.5281/zenodo.14808073
in real- world scripts similar as social media monitoring, vids. “ Enabling Robots to Understand Incomplete
digital forensics, and content verification. Natural Language Instructions Using firm sense ” by
Haonan Chen et al.,( 2020)
As deepfake technology continues to evolve, the need
for sophisticated discovery systems becomes decreasingly This paper introduces a new approach nominated
critical. The use of ResNext and LSTM models in deepfake Language- Model rested firm sense( LMCR), which
analysis represents a slice- edge result that combines empowers a robot to comprehend natural language
advanced point birth and temporal analysis capabilities. By instructions from humans, assess its girding terrain, and
icing the integrity of digital media, this approach helps autonomously infer any missing details from the instruction
combat misinformation and protects sequestration, playing a by exercising contextual information and a new
vital part in maintaining trust in the digital age. establishment sense frame. The disquisition was presented at
the 2020 IEEE International Conference on Robotics and
II. LITERATURE SURVEY automation( ICRA). Natural language is naturally
unstructured and constantly depends on common sense for
Jaiswal et al.( 2020) proposed a deepfake discovery interpretation, posing significant challenges for robots in
system using ResNeXt for face image analysis. Their directly and effectively understanding analogous language.
system used a pretrained ResNeXt model on face images For case, in a domestic terrain where a robot is holding a
to classify whether the face was real or synthetic. The bottle of water alongside scissors, a plate, bell peppers, and a
results demonstrated that ResNeXt handed a high type mug on a table, a mortal muscle instruct the robot to “ pour
delicacy compared to other models like ResNet and me some water. ” From the robot's viewpoint, this command
VGGNet. lacks particularity regarding the destination for the water,
Cozzolino et al.( 2018) explored the use of ResNeXt as whereas a human would presumably infer that the water
part of a larger model to descry manipulated facial should be poured into the mug. A robot equipped with the
features in images. By training ResNeXt on a dataset capability to privately address analogous inscrutability in
containing both real and manipulated faces, they achieved natural language instructions, akin to mortal sense, would
a robust performance in distinguishing deepfakes from grease more natural relations with humans and enhance its
genuine images. overall functionality. The Language- Model rested firm
Xue et al.( 2020) extended the idea of ResNeXt- rested sense( LMCR) approach enables a robot to hear to mortal
point birth and incorporated spatial attention mechanisms instructions, observe its terrain, resolve any inscrutability,
to concentrate on critical face regions, perfecting and subsequently execute the designated task autonomously.
discovery delicacy in deepfake facial images. 4).
Dolhansky et al.( 2020) estimated deepfake videotape III. METHODOLOGY
discovery using the ResNeXt model. They trained the
network on a combination of frame- position CNNs and Proposed Solution
temporal modeling ways, achieving high discovery Deepfake detection has become a critical task in the age
performance, especially when fine- tuned with deepfake- of advanced artificial intelligence, where synthetic media is
specific datasets like FaceForensics. increasingly convincing and widespread. Leveraging deep
Dang et al.( 2019) employed an LSTM- rested model to learning models such as ResNeXt and Long Short-Term
descry temporal inconsistencies across frames of a Memory (LSTM) networks for deepfake detection combines
videotape. Their system involved training LSTM units on the strengths of both spatial and temporal feature extraction.
sequences of frames, learning the temporal dynamics of This methodology focuses on utilizing these two deep
mortal faces, and relating irregularities convinced by learning models to distinguish between real and fake content
deepfake generation processes. in videos, aiming to improve the accuracy and robustness of
Afchar et al.( 2018) introduced the first LSTM- rested detection.
frame for detecting deepfakes by exploiting temporal
features. The model used a combination of CNN for point Deepfake videos are created using advanced generative
birth and LSTM for sequence knowledge. It was shown to models like Generative Adversarial Networks (GANs), which
be effective in detecting vids with synthetic faces, indeed can manipulate visual and auditory content to make it appear
when high- quality GAN models were used for deepfake authentic. As deepfake technology evolves, so too must the
generation. methods used to detect it. In this context, a combination of
Yang et al.( 2020) explored the use of an LSTM model in convolutional neural networks (CNNs) like ResNeXt for
convergence with autoencoders for deepfake discovery. spatial feature extraction and recurrent neural networks
The LSTM was assigned with detecting anomalies in (RNNs) like LSTMs for temporal feature learning offers a
facial movements, analogous as unnatural blinking or lip potent approach.
sync crimes, which are constantly present in deepfake
A. Data Collection and Preprocessing :- FaceForensics++. These datasets provide labeled video data,
The first step in deepfake detection involves collecting a with the real videos representing authentic content and the
comprehensive dataset containing both real and deepfake deepfakes containing manipulated media. Preprocessing of
videos. Popular datasets for this task include the DeepFake this data is essential for ensuring that the model receives
Detection Challenge (DFDC), Celeb-DF, and high-quality input.
For each video, the frames are extracted and then are employed to locate and align faces in each frame,
preprocessed to facilitate effective model learning. Frame ensuring that the model focuses on the relevant portions of
extraction involves converting the video into individual the video. After face alignment, the frames are normalized to
frames at a certain frame rate, typically 30 frames per second, a consistent size and pixel value range, typically rescaled
so that each frame is treated as an independent image for between 0 and 1 or standardized to have zero mean and unit
analysis. Face detection algorithms such as MTCNN or Dlib variance.
Additionally, data augmentation techniques, such as C. Temporal Feature Learning Using LSTM:-
random rotations, flips, and scaling, are often applied to While ResNeXt excels at extracting spatial features,
increase the variety of training examples and prevent detecting deepfakes in videos requires understanding how
overfitting. This ensures that the model learns to generalize these features evolve over time. Deepfake videos often
better and does not become overly sensitive to specific facial contain subtle artifacts in the way faces move or interact with
angles or positions. their surroundings. To capture these temporal dynamics, a
Long Short-Term Memory (LSTM) network is employed.
B. Feature Extraction Using ResNeXt :-
Once the data is preprocessed, the next step is to extract LSTMs are a special type of recurrent neural network
relevant features from the frames using a deep convolutional (RNN) designed to capture long-range dependencies in
neural network (CNN). In this case, the ResNeXt architecture sequential data. This is particularly useful for video analysis,
is used for feature extraction. ResNeXt is a variant of the where the relationships between consecutive frames hold
traditional ResNet model that incorporates a concept known important clues to the authenticity of the video. Unlike
as "cardinality" – the number of independent paths through traditional RNNs, LSTMs are equipped with gates that
the network. By increasing the cardinality, ResNeXt is able regulate the flow of information, allowing the network to
to capture richer, more diverse features without a significant remember or forget information over long sequences, which
increase in computational complexity. is essential when analyzing the entire video.
The ResNeXt model is pretrained on large datasets like In this methodology, the feature vectors produced by
ImageNet and then fine-tuned for deepfake detection. Fine- ResNeXt from each frame are fed into the LSTM network as
tuning involves updating the weights of the network to better a sequence. The LSTM processes these sequential features,
adapt it to the specific characteristics of real and fake video learning the temporal dependencies between frames. It is
frames. During this process, the model learns to identify through this analysis that the LSTM can detect anomalies
subtle discrepancies between authentic and manipulated faces, such as unnatural movement patterns, inconsistencies in
such as unnatural facial expressions, lighting inconsistencies, facial expressions, or other temporal artifacts that are
and irregularities in facial texture and details that are often characteristic of deepfakes.
present in deepfakes.
To enhance the temporal modeling, a bidirectional
At the end of the ResNeXt network, the output is LSTM can be used. A bidirectional LSTM processes the
typically a high-dimensional feature vector that encodes sequence of frames in both forward and backward directions,
spatial information about each frame. These feature vectors which helps capture both past and future context in the video.
are crucial as they contain the learned representation of the This further improves the model’s ability to detect subtle
frame’s content, which is then passed on to the temporal inconsistencies that may not be evident when only
model for further analysis. considering frames in a single direction.
D. Classification Layer and Model Output:- After evaluation, the trained model can be deployed for
After processing the video sequence through both real-time deepfake detection. The system can be used to
ResNeXt and LSTM, the next step is to classify the video as automatically analyze incoming video content, providing a
either real or fake. The output of the LSTM network is a label for each video, indicating whether it is real or fake.
sequence of hidden states that summarize the temporal
information learned from the frames. These hidden states are IV. RESULTS AND DISCUSSIONS
passed through a fully connected layer, which is followed by
a softmax or sigmoid activation function to produce a final Existing Methods
classification. In the realm of deepfake detection, leveraging advanced
machine learning techniques such as ResNeXt and Long
In binary classification tasks, such as deepfake detection, Short-Term Memory (LSTM) networks has demonstrated
the output layer typically generates two classes: one significant promise. Both of these models are chosen for their
indicating the video is real and the other indicating it is a unique strengths in handling the specific challenges posed by
deepfake. The model is trained using a loss function like deepfake content, which involves subtle manipulations of
binary cross-entropy, which measures the difference between image or video data that often require the consideration of
the predicted and actual labels. Optimization techniques such both spatial and temporal patterns.
as Adam or stochastic gradient descent (SGD) are used to
minimize the loss function and update the model’s ResNeXt is a deep convolutional neural network (CNN)
parameters during training. architecture that is designed to be both highly efficient and
effective in capturing spatial features from images. Its
E. Model Evaluation and Deployment:- primary strength lies in its ability to learn features from input
Once the model is trained, it is important to evaluate its data with a high degree of accuracy while maintaining
performance. Common evaluation metrics for binary computational efficiency. By introducing a cardinality
classification tasks include accuracy, precision, recall, F1- dimension to the network architecture, ResNeXt can handle
score, and the area under the Receiver Operating more complex feature extraction tasks with fewer parameters
Characteristic curve (AUC-ROC). These metrics give than traditional CNNs, thus improving both its performance
insights into how well the model distinguishes between real and generalization capabilities. In the context of deepfake
and fake videos. detection, ResNeXt is particularly adept at recognizing
inconsistencies in facial features, textures, and other visual
To ensure the model generalizes well to unseen data, artifacts that are commonly introduced in manipulated
techniques like cross-validation or hold-out validation are images or videos. These artifacts may be subtle and difficult
used. Cross-validation involves splitting the dataset into for the human eye to detect, but deep learning models like
multiple subsets and training the model on different subsets ResNeXt can efficiently identify them, thus improving the
while testing on others. This ensures that the model is not accuracy of deepfake detection.
overfitting to any particular portion of the data.
On the other hand, LSTM networks, a type of recurrent relationships between consecutive video frames. Deepfakes
neural network (RNN), are designed to model sequential data often exhibit abnormalities in facial movements, lip
and capture long-range dependencies within such data. synchronization, or eye motion that become evident only
LSTMs are particularly useful in deepfake detection when when considering the flow of video data over time. LSTMs
analyzing video content, as they can learn the temporal are capable of identifying these temporal inconsistencies by
modeling the sequence of frames in a video, detecting identifying advanced, high-quality deepfakes, though
irregular patterns that signal potential manipulation. challenges in generalization remain.
[15]. Ganiyusufoglu I, Ngô LM, Savov N, Karaoglu S, [29]. Botha J, Pieterse H. Fake news and deepfakes: A
Gevers T. Spatio-temporal features for generalized dangerous threat for 21st-century information security,
detection of deepfake videos, in Computer Vision and in ICCWS 2020 15th International Conference on
Image Understanding, vol. 1, pp. 1–11, 2022. Cyber Warfare and Security, Academic Conferences
[16]. Sherstinsky A. Fundamentals of recurrent neural and Publishing Limited, March 2020, pp. 1–57.
network (RNN) and long short-term memory (LSTM) [30]. Tariq S, Lee S, Kim H, Shin Y, Woo SS. Detecting both
network, Physica D: Nonlinear Phenomena, vol. 404, machine and human created fake face images in the
art. 132306, 2020. wild, in Proceedings of the 2nd International Workshop
[17]. Tariq S, Lee S, Woo SS. A convolutional lstm based on Multimedia Privacy and Security, January 2018, pp.
residual network for deepfake video 81–87.
detection, Conference’17, Washington DC, 2020, pp. 1– [31]. Liu H, Li X, Zhou W, Chen Y, He Y, Xue H et al.
11. Spatial-phase shallow learning: rethinking face forgery
[18]. Oyetoro A. Image Classification of Human Action detection in frequency domain, in Proceedings of the
Recognition Using Transfer learning in Pytorch. Int J IEEE/CVF Conference on Computer Vision and Pattern
Adv Res Ideas Innovations Technol. Apr. 2023;9(2):1– Recognition, pp. 772–781, 2021.
6. [32]. Sun Z, Han Y, Hua Z, Ruan N, Jia W. Improving the
[19]. Jha M, Tiwari A, Himansh M, Manikandan VM, Face efficiency and robustness of deepfakes detection
Recognition: Recent Advancements and Research through precise geometric features, in Proceedings of
Challenges, in. 2022 13th International Conference on the IEEE/CVF Conference on Computer Vision and
Computing Communication and Pattern Recognition, 2021, pp. 3609–3618.
Networking Technologies (ICCCNT), Kharagpur, India, [33]. Li J, Xie H, Li J, Wang Z, Zhang Y. Frequency-aware
2022, pp. 1–6. discriminative feature learning supervised by single-
[20]. Li B, Lima D. Facial expression recognition via center loss for face forgery detection, in Proceedings of
ResNet-50. Int J Cogn Comput Eng. 2021;2:57–64 the IEEE/CVF Conference on Computer Vision and
Google Scholar Pattern Recognition, 2021, pp. 6458–6467.
[21]. Shad HS, Rizvee MM, Roza NT, Hoq SM,
Monirujjaman Khan M, Singh A, Zaguia A, Bourouis S.
Comparative analysis of deepfake image detection
method using convolutional neural network. Comput
Intell Neurosci. 2021;1:1–20.Article Google Scholar
[22]. Amerini I, Galteri L, Caldelli R, Del Bimbo A.
Deepfake video detection through optical flowbased
CNN, in Proceedings of the IEEE/CVF International
Conference on Computer Vision Workshops, 2019, pp.
1–3.
[23]. Kohli A, Gupta A. Detecting deepfake, faceswap and
face2face facial forgeries using frequency CNN.
Multimedia Tools Appl. 2021;80:18461–
78.Article Google Scholar
[24]. Saikia P, Dholaria D, Yadav P, Patel V, Roy M. A
hybrid CNN-LSTM model for video deepfake detection
by leveraging optical flow features, in
2022 International Joint Conference on Neural
Networks (IJCNN), 2022, pp. 1–7.
[25]. Tran VN, Lee SH, Le HS, Kwon KR. High performance
deepfake video detection on CNN-based with attention
target-specific regions and manual distillation
extraction, Applied Sciences, 11, 16, pp. 76–8, 2021.
[26]. Patel Y, Tanwar S, Bhattacharya P, Gupta R, Alsuwian
T, Davidson IE, Mazibuko TF. An Improved dense
CNN Architecture for Deepfake Image Detection. IEEE
Access. 2023;11:22081–95.Article Google Scholar
[27]. Masud U, Sadiq M, Masood S, Ahmad M, El-Latif A,
Ahmed A. LW-DeepFakeNet: a lightweight time
distributed CNN-LSTM network for real-time
DeepFake video detection, Signal, Image and Video
Processing, pp. 1–9, 2023.
[28]. Warke K, Dalavi N, Nahar S. DeepFake Detection
through deep learning using ResNext CNN and LSTM.
IEEE Trans Neural Networks Learn Syst. 2023;10(5):1–
10.Google Scholar