0% found this document useful (0 votes)
6 views

DeepFake_Detection_System

The document presents a project on developing a Deepfake Detection System using deep learning techniques, addressing the growing threat of deepfake technology to online media credibility. The proposed method utilizes multi-modal analysis, incorporating both audio and visual features, and introduces a new dataset, DefakeAVMiT, for rigorous evaluation. The project aims to enhance detection accuracy, scalability, and robustness while ensuring ethical compliance and addressing challenges in the evolving landscape of deepfake manipulation.

Uploaded by

Infancy Pio
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

DeepFake_Detection_System

The document presents a project on developing a Deepfake Detection System using deep learning techniques, addressing the growing threat of deepfake technology to online media credibility. The proposed method utilizes multi-modal analysis, incorporating both audio and visual features, and introduces a new dataset, DefakeAVMiT, for rigorous evaluation. The project aims to enhance detection accuracy, scalability, and robustness while ensuring ethical compliance and addressing challenges in the evolving landscape of deepfake manipulation.

Uploaded by

Infancy Pio
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 60

DEEPFAKE DETECTION SYSTEM

USING
DEEP LEARNING

CS6611 – CREATIVE AND INNOVATIVE PROJECT

Submitted by

AMRITHA R (2021103506)

INFANCY P (2021103532)

POOJAA S (2021103555)

in partial fulfillment for the award

of the degree of

BACHELOR OF ENGINEERING
IN

COMPUTER SCIENCE AND ENGINEERING

COLLEGE OF ENGINEERING, GUINDY

ANNA UNIVERSITY::CHENNAI

600 025

MAY 2024
ANNA UNIVERSITY: CHENNAI 600 025

BONAFIDE CERTIFICATE

Certificate that this project request titled DeepFake Detection


System using Deep Learning the bonafide work of Amritha
(2021103506), Infancy (2021103532) and Poojaa (2021103555)
who carried out the project work under my supervision, for the
fulfillment of the requirements as part of the CS6611 – Creative
and Innovative Project.

Dr. S. VALLI Dr. P. PRABHAVATHY Dr. S. RENUGADEVI


PROFESSOR & ASSISTANT PROFESSOR ASSISTANT PROFESSOR
HEAD OF THE DEPARMENT SUPERVISOR SUPERVISOR
Department of Computer Science Department of Computer Science Department of
and Engineering and Engineering Computer Science
Anna University Anna University and Engineering
Chennai – 600 025 Chennai – 600 025 Anna University

Chennai – 600 025


ABSTRACT

Deepfake technology has surged as a significant threat to the credibility of

online media, prompting extensive research into detection methods. Prior endeavors

predominantly concentrated on identifying intra-modal artifacts within either audio

or visual data. However, real-world deepfake content frequently intertwines both

audio and visual manipulation, necessitating a more comprehensive approach. In

response, our project introduces a pioneering method that addresses this multi-modal

challenge. By leveraging advanced techniques such as Spectrogram analysis for

audio feature extraction and Vision Transformer for visual feature extraction, we

capture the intricacies of manipulated content. Moreover, our approach incorporates

temporal-spatial information through a dedicated encoder, enhancing the model's

understanding of dynamic alterations within the data. To fuse multi-modal features

and discern their underlying relationships, we employ a sophisticated Multi-Modal

Joint Decoder. Crucially, our method includes a Cross-Modal Classifier, which

scrutinizes discrepancies between audio and visual elements, effectively detecting

deepfake manipulation. To facilitate rigorous evaluation, we introduce

DefakeAVMiT, a benchmark dataset meticulously crafted to encompass a diverse

array of visuals and corresponding audios. Experimental validation on

DefakeAVMiT, alongside established datasets like FakeAVCeleb and DFDC,

demonstrates the superior performance compared to existing methodologies. Our

approach exhibits robust generalization across various forgery techniques.

iii
ACKNOWLEDGEMENT

Foremost, we would like to express our sincere gratitude to our project


guide, Dr. P. Prabhavathy, Assistant Professor, Department of Computer Science
and Engineering, College of Engineering Guindy, Chennai for her constant source
of inspiration. We thank her for the continuous support and guidance which was
instrumental in taking the project to successful completion.

We are grateful to Dr. S. Valli, Professor and Head, Department of


Computer Science and Engineering, College of Engineering Guindy, Chennai for
her support and for providing necessary facilities to carry out for the project.

We would also like to thank our friends and family for their encouragement
and continued support. We would also like to thank the Almighty for giving us the
moral strength to accomplish our task.

AMRITHA R INFANCY P POOJAA S

iv
TABLE OF CONTENTS

CHAPTER TITLE PAGE


NO. NO.
ABSTRACT iii

LIST OF TABLES viii

LIST OF FIGURES ix

1. INTRODUCTION 1

1.1 OBJECTIVES 2

1.2 PROBLEM STATEMENT 3

1.3 NEED FOR THE SYSTEM 3

1.4 CHALLENGES IN THE SYSTEM 4

1.5 SCOPE OF THE PROJECT 4

2. LITERATURE SURVEY 5

2.1 DEEPFAKE MANIPULATION & DETECTION 5

2.2 MULTI-MODAL REPRESENTATION LEARNING 6

2.3 GAPS IDENTIFIED 7

3. SYSTEM DESIGN 8

3.1 SYSTEM ARCHITECTURE 8

3.2 SYSTEM REQUIREMENTS 9

3.2.1 FUNCTIONAL REQUIREMENTS 9

3.2.1.1 HARDWARE REQUIREMENTS 9

3.2.1.2 SOFTWARE REQUIREMENTS 9

3.2.2 NON-FUNCTIONAL REQUIREMENTS 11

3.2.2.1 PERFORMANCE 11

3.2.2.2 RELIABILITY 11

v
3.2.2.3 SECURITY 11

3.2.2.4 USABILITY 12

3.2.2.5 MAINTAINABILITY 12

4. MODULE DESCRIPTION 13

4.1.1 PRE-PROCESSING MODULE 13

4.1.2 AUDIO FEATURE EXTRACTION MODULE 16

4.1.3 VIDEO FEATURE EXTRACTION MODULE 17

4.1.4 MULTI MODAL FUSION MODULE 20

4.1.5 FRONT END INTEGRATION MODULE 22

4.1.6 INNOVATION 23

5. IMPLEMENTATION AND RESULTS 25

5.1 DATASET DESCRIPTION 25

5.2 DATA PREPROCESSING 26

5.3 AUDIO FEATURE EXTRACTION 28

5.4 VIDEO FEATURE EXTRACTION 30

5.5 MULTI MODAL FUSION 32

5.6 FRONT END INTEGRATION 34

6. TEST CASES AND PERFORMANCE EVALUATION 40

6.1 TEST CASES 40

6.2 PERFORMANCE EVALUATION 43

6.2.1 ACCURACY 43

6.2.2 PRECISION 43

6.2.3 RECALL 43

6.2.4 F1-SCORE 44

6.2.5 INTERSECTION OVER UNION (IOU) 44

6.2.6 AREA UNDER CURVE 44

vi
7. CONCLUSION AND FUTURE WORK 46
7.1 CONCLUSION 46

7.2 FUTURE WORK 46

REFERENCES 47

vii
LIST OF TABLES

TABLE NO. TITLE PAGE NO.


6.1.1 Test cases for Preprocessing Module 40
6.1.2 Test cases for Audio Feature Extraction 40
Module
6.1.3 Test cases for Video Feature Extraction 41
Module
6.1.4 Test cases for Multi Modal Fusion Module 41
6.1.5 Test cases for FrontEnd Integration Module 42
6.2 Model results 45

vii
i
LIST OF FIGURES

FIGURE NO. TITLE PAGE NO.


1 The motivation for detecting deepfake with audio- 1
visual
3.1 Architecture diagram 8
4.1 Data Preprocessing module 15
4.3 Audio and Video Feature Extraction module 18
4.4 Multi Modal Fusion module 21
4.5 Frontend Integration module 23
5.1.1 Fake Videos 25
5.1.2 Real Videos 25
5.2.1 Bounding box coordinates in format [xmin, ymin, 26
xmax, ymax]
5.2.2.1 Extracted face images cropped using box 27
coordinates for real video
5.2.2.2 Extracted face images cropped using box 27
coordinates for fake video
5.3 Extracted audio features using spectrogram analysis 29
5.4 Extracted video features using ViT pretrained model 31
5.5 Obtained results using Cross Modal Classifier 33
5.6.1 Home Page 36
5.6.2 Upload a file to get the result 36
5.6.3 Handling HTTP requests 38
5.6.4 Result along with Trust Score (FAKE video) 38
5.6.5 Result along with trust score (REAL video) 39

ix
x
CHAPTER 1

INTRODUCTION
The rise of deepfake technology has sparked widespread concern in recent years, as it presents a
dual nature of both innovation and peril. While these techniques hold promise in areas like
filmmaking, computer graphics, and virtual reality, they also pose a grave threat to public trust and
security. Deepfakes, which seamlessly manipulate both audio and visual elements, have been
weaponized by malicious actors to deceive, defraud, and manipulate. This technology enables
attackers to bypass authentication systems, impersonate prominent figures such as celebrities and
politicians, and perpetrate financial scams with alarming ease. Moreover, the accessibility of
deepfake tools, such as Face Swap and Lip-Sync, empowers individuals with little expertise to
produce convincing forgeries, amplifying the potential for widespread harm.

Figure 1: The motivation for detecting deepfake with audio-visual. Attackers


aim to generate forged videos with both visual and audio maliciously modified.
The uploaded videos can pose a looming threat to everyone on the platform. Our
method focuses on exploiting the audio-visual inconsistency among temporal and
spatial information for multi-modal deepfake detection. The detector can
effectively distinguish whether the media contains forged visual or audio to help
against deepfake attacks.

As shown in Figure 1, A poignant example of this threat is the viral video featuring the
likeness and voice of Meta CEO Mark Zuckerberg, which circulated on Instagram, misleading
unsuspecting viewers. Such instances not only endanger the reputation and privacy of the
counterfeited individual but also erode public trust in the authenticity of digital media.
Compounding the challenge is the fact that deepfakes often involve simultaneous manipulation of
both audio and visual components, making them exceedingly difficult to detect using traditional

1
1
methods.

In response to this evolving threat landscape, researchers have developed various techniques
for deepfake detection. However, many of these approaches focus on analyzing either audio or
visual data in isolation, overlooking the nuanced interplay between the two modalities. While some
efforts have explored multi-modal detection, these methods have not fully harnessed the potential
synergies between audio and visual signals. Moreover, existing datasets for training and evaluating
deepfake detectors often lack diversity in terms of forgery methods and modalities, limiting their
effectiveness in real-world scenarios.
To address these gaps, we introduce DefakeAVMiT, a comprehensive multi-modal dataset
comprising over 6,000 visuals paired with corresponding audios, each potentially modified using a
range of forgery techniques. This dataset provides a rich training environment for developing robust
multi-modal deepfake detection systems. Additionally, we propose our novel approach, Audio-
Visual Joint Learning for Detecting Deepfake, which leverages the complementary nature of
audio and visual signals to identify inconsistencies indicative of manipulation. By jointly analyzing
temporal and spatial cues across both modalities, our method achieves superior detection
performance compared to traditional uni-modal systems and existing multi-modal approaches.

1.1 OBJECTIVES
The following are the objectives of the project:
1. High Accuracy: Develop algorithms and models that can accurately identify deepfake
content across various media types (images, videos, audio) and manipulation techniques (face
swapping, voice synthesis, etc.), minimizing false positives and negatives.
2. Scalability: Design the system to efficiently process large volumes of data, enabling it to
handle the vast amount of content generated and shared on online platforms in real-time or near-
real-time.
3. Robustness: Ensure the system's effectiveness against sophisticated deepfake techniques
by continuously updating and refining the detection methods to stay ahead of evolving manipulation
strategies.
4. Ethical and Legal Compliance: Establish guidelines and protocols to ensure the ethical
use of the system, respecting privacy rights, and adhering to legal regulations regarding data
handling and content moderation. Additionally, provide transparency on how the system operates
and its limitations to foster trust among users and stakeholders.

2
1
1.2 PROBLEM STATEMENT

With the proliferation of deepfake technology, the ability to generate highly realistic
synthetic media has raised significant concerns regarding its potential misuse for spreading
misinformation and manipulating public opinion. Deepfake videos pose a serious threat to the
integrity of visual and auditory content on the internet, potentially leading to widespread
misinformation campaigns, privacy breaches, and social unrest. The aim of this project is to develop
an effective deepfake detection system using deep learning networks to combat the spread of
synthetic media manipulation. The system will aim to accurately identify and flag deepfake
content. The primary challenge lies in distinguishing between genuine and manipulated video,
considering the rapid advancements in deepfake generation techniques that continually evolve to
deceive detection algorithms.

1.3 NEED FOR THE SYSTEM

The rise of deepfake technology has underscored the urgent need for a robust detection
system built upon deep learning networks. These systems play a pivotal role in countering the
spread of misinformation by identifying and flagging fabricated videos that pose a threat to the
integrity of digital media. By preserving trust and authenticity in visual and auditory content,
deepfake detection systems help mitigate the erosion of confidence in online information sources.
Moreover, they serve as a crucial defense mechanism against potential harm to individuals and
public figures, protecting their reputations, privacy, and security from malicious exploitation.
Beyond individual safety, these systems contribute to broader societal efforts in preventing fraud,
cybercrime, and identity theft. Additionally, by promoting media literacy and awareness, they
empower users to critically evaluate the authenticity of digital content and make informed
decisions. Integrated into content moderation workflows, deepfake detection capabilities support
platforms in their governance efforts, enabling them to maintain safer online environments by
identifying and removing illicit or harmful deepfake content. In essence, the development and
deployment of deepfake detection systems using deep learning networks address critical societal
needs by combating misinformation, preserving trust, protecting individuals, promoting media
literacy, and supporting content moderation efforts in the digital age.

3
1
1.4 CHALLENGES IN THE SYSTEM

Developing a deepfake detection system using deep learning networks presents a host of
challenges that demand innovative solutions. Adversarial attacks continuously evolve, posing a
formidable hurdle as creators tweak their techniques to outsmart detection algorithms. Acquiring
diverse datasets containing both genuine and deepfake media is critical, yet challenging due to
privacy concerns and data scarcity. Ensuring models generalize well to unseen variations and
sources while maintaining interpretability remains a balancing act. Scalability is essential for real-
time processing of vast media volumes, requiring optimizations in inference speed and resource
efficiency. Data privacy and security concerns necessitate robust measures to safeguard sensitive
information. Ethical considerations, including fairness and bias mitigation, are paramount to
ensure responsible deployment. Additionally, resource constraints demand efficient utilization of
computational resources. Addressing these multifaceted challenges requires interdisciplinary
collaboration and innovative methodologies to develop reliable deepfake detection systems that
uphold ethical principles while mitigating the risks associated with synthetic media manipulation.

1.5 SCOPE OF THE PROJECT

The scope of a deepfake detection system project is multifaceted, encompassing technical, ethical,
and practical considerations. Technically, the project involves collecting and annotating diverse datasets,
developing and implementing deep learning models tailored for deepfake detection, and evaluating their
performance against standard metrics. Ethically, the project must prioritize privacy preservation,
transparency, and compliance with legal regulations, while also establishing guidelines for responsible use
and addressing potential biases. Practically, the project entails designing a user-friendly interface, optimizing
scalability and performance, implementing feedback mechanisms for continuous improvement, and
managing computational resources efficiently. Additionally, educational initiatives and collaboration efforts
are essential for raising awareness about deepfakes, fostering media literacy, and sharing insights and best
practices within the community. By addressing these aspects comprehensively, a deepfake detection system
project can effectively contribute to the mitigation of synthetic media manipulation and the preservation of
trust and authenticity in digital content.

4
1
CHAPTER 2

LITERATURE SURVEY

2.1 Deepfake Manipulation and Detection:

Deepfakes take their name from the fact that they use deep learning technology to create
fake visuals or audios. Recent years have witnessed the rapid development of deepfake techniques,
which enable attackers to manipulate videos on a more highly-detailed and multi-modal level. For
example, Zhou et al. [15] proposed a deepfake system to generate pose-controllable talking faces
with accurate lip synchronization, which was extremely vivid in detail. Similarly, Ji et al. [16]
focused on synthesizing high-quality video portraits with emotional dynamics driven by audios.
Deng et al. [17] presented an unsupervised approach to retarget the speech of any unknown speaker
to an audio-visual stream of known speaker. These powerful video manipulation methods are more
diverse, more complex, and more difficult to detect and thus present a more challenging problem
for deepfake detection.
For image or visual deepfake detection, some efforts focus on special artifacts caused by
face forging. For example, Zhao et al. [18] formulated deepfake detection as a fine-grained
classification problem and proposed a new multiattentional deepfake detection network. Since face
swapping methods would leave partial context unchanged, Nirkin et al. [19] detected deepfake by
comparing two image-derived identities embeddings. Some recent works sought for intra-modal
inconsistency by using source images and forged images [1], [20]. However, these methods were
not suitable for detecting unseen spoofing techniques. To address this, some approaches aimed to
detect deepfake with the help of lip movement and mouth shape, which were more finegrained
features. For example, LipForensics [8] was proposed, which targeted high-level semantic
irregularities in mouth movements for detecting generated videos. Yu et al. [21] proposed a
commonality learning strategy to learn the common forgery features from different forgery
databases. A motion-based fundamental feature extraction network [7] was proposed to extract
information about talking habits for authentication. To further improve the transferability, Liu et al.
[22] presented a novel Spatial-Phase Shallow Learning method, which combined spatial image and
phase spectrum to better capture the up-sampling facial artifacts.
For audio deepfake detection, ASVspoof [14], [23], [24] are a series of competitions that
aim to protect automatic speaker verification systems from manipulation. Some methods have been
proposed for ASVspoof challenges such as AASIST [25] and wav2vec 2.0 [26]. Based on the
LCNN framework, Monteiro et al. [27] proposed a setting for the detection of logical and physical
presentation attackers of audio deepfake. Jung and Kim et al. [6], [28] proposed back-end
5
1
classification models to utilize raw waveforms for speaker verification. Recently, transfer-learning
based approach was adopted for unseen audio forgeries [29]. In [2], a capsule network was applied
to enhance the generalization of audio spoofing attacks detection. Although [30] proposed a
framework for both audio spoof and visual deepfake detection, they still viewed audio and visual as
independent tasks and ignored the correlation between audio and visual.
There are also some other works for multi-modal deepfake detection. Wang et al. [31]
exploited frequency information as a complementary modality to reveal artifacts that were not
perceptible in the RGB domain for deepfake detection. Reference [32] proposed unified audio-
visual deepfake detection frameworks based on modality dissonance and dense hierarchical
features. Mittal et al. [9] extracted perceived emotion cues between audio and visual modalities for
deepfake detection. Cheng et al. [33] addressed the deepfake detection from a voice-face matching
view via the intrinsic correlation of facial and audio. Cai et al. [34] proposed a boundary aware
temporal forgery detection method for detecting content driven audiovisual manipulations. Agarwal
et al. [35] detected the artifacts by comparing the dynamics of the mouth shape with a spoken
phoneme. However, this work focused on the explicit representational information between multi-
modalities, while ignoring the implicit feature non-synergy. Moreover, these multi-modal methods
regarded audios as additional supervision signals and neglected the possibility that audio can also be
forged, which is quite common in real-world. More importantly, compared to the explicit audio-
visual inconsistency, the implicit potential correlation between multi-modalities is still under-
explored and under-utilized.

2.2 Multi-Modal Representation Learning

Visual and sound are the two most popular modalities for representation learning and
there has been a lot of related multi-modal learning work. Recently, audio-visual representations
learning has been explored for video captioning, localization, and action recognition, due to the
pervasive concurrency property of two modalities. For example, Iashin et al. [36] introduced audio-
visual multi-modal to dense video captioning. Wu et al. [37] aimed to localize sound sources in a
visual scene and designed a novel Binaural Audio-Visual Network (BAVNet). To temporally parse
a video into audio or visual event categories, Lin et al. [38] exploited the crossmodality co-
occurrence of audio-visual to localize segments of target events. A novel audio-visual transformer
[39] framework has been proposed to localize audio-visual events with audio features jointly
observed over visual features. For the action recognition task, audio has ideal properties to aid
efficient recognition in long untrimmed videos. Based on this, Gao et al. [40] devised a novel
framework that focuses on clip-level recognition by distilling from the lighter modalities, a single
6
1
frame and its accompanying audio. Contrastive learning [41] has also been exploited for action
recognition and video understanding by transferring knowledge across heterogeneous modalities
between audio-visual. Recently, some weakly-supervised approaches have been explored for multi-
modal representation learning with audio-visual such as Audio-visual Transformer (AV-
transformer) [39], AudioVisual Interacting Network (AVIN) [42], and Multi-Modality Self-
Distillation (MMSD) [43].
The recent success of audio-visual representations learning proves the inherent and
pervasive correlations of multimodalities, which can be used as extra self-supervision signals.
However, due to the heterogeneity of modalities, the combination of audio-visual could also
encounter many difficulties. Qian et al. [44] proved that directly transferring the information from
one modal to another may lead to modal conflicts and redundancy. Despite the current audio-visual
learning systems having obtained huge progress, their performance could degrade dramatically
under the more challenging conditions, such as fine-grained retrieval and deepfake detection. We
believe that the simple concatenation or fusion operation are not the optimal choice for exploiting
information of audiovisual signals. A more inherent relationship and more common inconsistency
of two modalities are waiting to be further explored for multi-modal deepfake detection.
Inappropriate true-false alignment between multi-modalities could lead to more false positives, thus
bringing more noise interference to the detector. Accordingly, detecting deepfake with potential
inconsistencies of audio-visual in the hidden feature space is a more difficult and practical task.

2.3 GAPS IDENTIFIED

a) LACK OF MULTI MODAL FORGERY DETECTION


Integrating multi-modal detection is essential for comprehensive protection against
increasingly complex and convincing deepfake forgeries, addressing vulnerabilities in
current models lacking this capability.
b) LACK OF TEMPORAL FORGERY LOCALISATION
Crucial for pinpointing specific instances of manipulation within videos, enhancing
detection precision and mitigating the risk of false positives or missed detections, ultimately
improving the reliability of detection systems.
c) LIMITED GENERALISATION TO UNSEEN ATTACKS
Ensures adaptability to evolving manipulation techniques, bolstering resilience against
emerging threats by training models with diverse datasets and employing robust feature
extraction methods

7
1
8
1
CHAPTER 3

SYSTEM DESIGN

3.1 SYSTEM ARCHITECTURE

The project comprises five essential modules: Preprocessing ensures data quality and
readiness; Audio and Video Feature Extraction extract pertinent features from respective
sources; Multi-Modal Fusion integrates these features for comprehensive analysis;
Frontend Integration utilizes React.js to create a user-friendly interface, while Flask
handles the deployment of the trained deep learning model, enabling efficient predictions via
APIs.
Each module contributes uniquely to the system's functionality, from data preparation
to model deployment, ensuring a robust and user-accessible solution.

Figure 3.1: Architecture diagram

9
1
3.2 SYSTEM REQUIREMENTS
Before diving into hardware requirements, it's crucial to outline the complexity and
demands of deepfake detection systems. These systems employ deep learning algorithms to discern
authentic content from manipulated or synthetic media, requiring substantial computational resources
for both training and inference. Deep learning models for video analysis are particularly resource-
intensive due to the high-dimensional nature of multimedia data. Moreover, the scale and diversity of
deepfake datasets necessitate efficient data preprocessing pipelines and substantial memory resources
during training. Thus, while hardware forms the backbone of these systems, it's equally essential to
consider software frameworks, model architectures, and data management strategies to ensure the
effectiveness and scalability of deepfake detection solutions.

3.2.1 FUNCTIONAL REQUIREMENTS:

3.2.1.1 Hardware Requirements

 Processor: The system should possess a processor with ample processing power to execute
the deep learning algorithms involved in deepfake detection. A multi-core processor like
Intel Core i5 or i7 is recommended to handle the computational load efficiently.

 Memory: Sufficient memory is essential for storing and manipulating the large datasets and
model parameters required for deep learning. A minimum of 16GB of RAM is
recommended to ensure smooth operation.

 Storage: Adequate storage capacity is necessary to store the datasets, trained models, and
intermediate results generated during deepfake detection processes. A minimum of 512GB
of storage is recommended to accommodate the large volumes of data effectively.

 Connectivity: The system should support various connectivity options such as Wi-Fi,
Bluetooth, or Ethernet to facilitate data transfer from external sources and communication
with other devices or networks.

3.2.1.2 Software Requirements:

 Dataset: A diverse and comprehensive dataset containing both genuine and deepfake media
samples is essential for training, testing, and validating the deepfake detection system.
Access to a curated deepfake dataset, along with relevant annotations, is critical for
assessing the system's performance accurately.

1
01
 Programming language: The project should be implemented using a programming
language suitable for deep learning such as Python.

 Deep Learning Framework: The system should utilize a deep learning framework such as
TensorFlow, PyTorch, or Keras for model development, training, and inference. The
framework should provide flexibility, scalability, and support for state-of-the-art deep
learning techniques.

 Development Environment: Availability of a suitable development environment, such as


Anaconda or Jupyter Notebook, is essential for coding, experimentation, and collaborative
development of the deepfake detection system.

 Video Processing Library: The deepfake detection system should utilize a video processing
library such as FFmpeg or OpenCV for implementing various video analysis and
manipulation techniques. These libraries provide comprehensive functionality for processing
video streams, extracting frames, performing temporal analysis, and applying deep learning-
based algorithms for deepfake detection.

 Version control: The project should use a version control system such as Git to manage the
codebase and track changes.

 Operating System: The system should be compatible with popular operating systems like
Linux, Windows, or macOS, depending on the preferred development environment and
deployment platform.

 Documentation: The deepfake detection system project should comprise comprehensive


documentation outlining installation, configuration, and execution procedures, along with
implementation specifics for the deep learning-based detection model and any preprocessing
techniques employed, such as image normalization and augmentation.

1
11
3.2.2 NON-FUNCTIONAL REQUIREMENTS:

The following are some potential non-functional requirements in terms of performance for the
project “DeepFake Detection using DeepLearning”:

3.2.2.1 Performance:

 Real-time Processing: The system should demonstrate real-time or near-real-time


processing capabilities to efficiently analyze and classify media content, ensuring timely
detection of deepfake videos.

 Scalability: The system should be scalable to handle increasing volumes of media content
and user requests without compromising performance, allowing for seamless operation in
high-demand scenarios.

 Accuracy: The deepfake detection system should exhibit high accuracy in distinguishing
between genuine and manipulated media, minimizing false positives and negatives to
maintain reliability.

3.2.2.2 Reliability:

 Robustness: The system should be robust against variations in deepfake techniques and
sources, ensuring consistent performance across different types of manipulation and media
modalities.

 Fault Tolerance: The system should be resilient to failures or disruptions, with mechanisms
in place to recover gracefully from errors and maintain uninterrupted operation.

3.2.2.3 Security:

 Data Privacy: The system should adhere to strict data privacy regulations, safeguarding
sensitive information processed during deepfake detection, including personal media content
and user data.

 Authentication: Access to the system and its functionalities should be securely


authenticated and authorized, preventing unauthorized access and misuse.

3.2.2.4 Usability:
1
21
 User Interface: The system should feature an intuitive and user-friendly interface, allowing
users to interact with the detection system effortlessly and interpret results effectively.

 Accessibility: The system should be accessible to users with diverse technical backgrounds
and abilities, providing clear instructions and support for user training and assistance.

3.2.2.5 Maintainability:

 Modularity: The system architecture should be modular and well-structured, facilitating


ease of maintenance, updates, and enhancements without disrupting core functionalities.

 Documentation: Comprehensive documentation should accompany the system, including


installation guides, user manuals, and developer documentation, to support ongoing
maintenance and knowledge transfer.

By addressing these non-functional requirements, the deepfake detection system can deliver
optimal performance, reliability, security, usability, and maintainability, ensuring effective
mitigation of synthetic media manipulation and safeguarding the integrity of digital content.

1
31
CHAPTER 4

MODULE DESCRIPTION

4.1.1 Pre-processing Module


The first phase of the project is the pre-processing stage, which involves preparing the dataset
and videos for training and testing. The following steps are included in this module:
Dataset
The dataset consists of videos used for deepfake detection, with two potential categories:
a) DFDC (DeepFake Detection Challenge):
This dataset likely contains 9,000 videos from the DeepFake Detection Challenge, which is
a benchmark dataset for deepfake detection research. The videos in this dataset includes both
real and manipulated (fake) videos.
b) FAKEAVCELEB:
This dataset includes 12,000 videos from the FakeAVCeleb dataset, another popular
dataset for deepfake detection. Similar to DFDC, it contains both original and manipulated
videos.

I) Video Preprocessing Steps:


1. Face Detection:
Purpose: Identify and localize faces within each frame of the input videos.
Implementation: Utilizes the FacenetDetector class, which internally employs the MTCNN
(Multi-task Cascaded Convolutional Networks) model for face detection.
Output: Bounding boxes around detected faces stored in JSON format.

2. Face Crop Extraction:


Purpose: Extract face crops from the original video frames based on the detected bounding boxes.
Implementation: Reads the bounding box information from the JSON files generated in the
previous step and extracts the corresponding face crops from the video frames.
Output: Individual face crops saved as images in the specified output directory.

3. Data Loading and Organization:


Purpose: Organize the dataset for efficient processing.
Implementation:
 Identifies the paths to the videos in the dataset directory.
 Handles the distinction between different datasets (DFDC, FAKEAVCELEB) and their
internal structures.
1
41
 Filters out videos that have already been processed to avoid redundant computation.
4. Utility Functions:
Purpose: Provide various utility functionalities required during preprocessing.
Implementation:
 Functions for fetching video paths, resizing images, extracting method information from
video names, etc.
 These functions assist in managing and manipulating data throughout the preprocessing
pipeline.

Overall, the preprocessing pipeline involves detecting faces in input videos, extracting face
crops based on the detected faces, organizing the dataset, and utilizing utility functions for various
tasks. The pipeline ensures that the data is properly processed and prepared for subsequent steps in
the deepfake detection process.

II) Audio Preprocessing Steps:

1. Audio Extraction:
Module: Audio loading functions or libraries (e.g., Librosa, FFmpeg).
Description: Extract audio streams from the provided video files. This step involves separating
audio tracks from video files to obtain standalone audio files. The extracted audio files will be used
for further analysis and feature extraction.

2. Audio Conversion:
Module: Audio conversion utilities or libraries (e.g., FFmpeg).
Description: Convert the extracted audio files into a common audio format suitable for analysis.
Common formats include WAV, MP3, or FLAC. Conversion ensures uniformity across the dataset
and compatibility with audio processing libraries and tools.
These preprocessing steps ensure that the audio data is properly extracted, converted, and
prepared for further analysis and feature extraction. By standardizing the audio files and removing
unwanted noise or silence, the dataset becomes suitable for training deep learning models or other
machine learning algorithms for deepfake detection.

1
51
Figure 4.1: Data Preprocessing module

Input: DFDC+FakeAVCeleb Dataset

Output: Preprocessed Audio Waveforms and Video frames

PseudoCode
Step 1: Face Detection
def detect_faces(video_paths, output_dir):
detector = FacenetDetector()
for video_path in video_paths:
frames = load_frames(video_path)
bounding_boxes = detector.detect_faces(frames)
save_bounding_boxes(bounding_boxes, video_path, output_dir)

Step 2: Face Crop Extraction


def extract_face_crops(video_paths, bounding_boxes_dir, output_dir):
for video_path in video_paths:
frames = load_frames(video_path)
bounding_boxes = load_bounding_boxes(video_path, bounding_boxes_dir)
face_crops = crop_faces(frames, bounding_boxes)
save_face_crops(face_crops, video_path, output_dir)

Step 3: Data Loading and Organization


def load_and_organize_dataset(dataset_dir):
video_paths = get_video_paths(dataset_dir)
filtered_video_paths = filter_processed_videos(video_paths)
return filtered_video_paths

1
61
Step 4: Audio Extraction
def extract_audio(video_paths, output_dir):
for video_path in video_paths:
audio_file = os.path.join(output_dir, os.path.basename(video_path).split('.')[0] + '.wav')
command = f"ffmpeg -i {video_path} -vn -acodec pcm_s16le -ar 44100 -ac 2 {audio_file}"
subprocess.run(command, shell=True)

Step 5: Audio Conversion


def convert_audio(audio_files, output_format='wav'):
for audio_file in audio_files:
output_file = os.path.splitext(audio_file)[0] + '.' + output_format
command = f"ffmpeg -i {audio_file} -acodec pcm_s16le -ar 44100 -ac 2 {output_file}"
subprocess.run(command, shell=True)

Overall, these data pre-processing steps will help in preparing the data for efficient model
training and testing and will also contribute to the overall robustness of the model.

4.1.2 Audio Feature Extraction Module

Spectrograms capture the frequency content of audio signals over time and serve as essential
features for analyzing and detecting anomalies or inconsistencies in audio data, such as those
introduced by deepfake manipulation. The following steps demonstrate the process of extracting
audio features, specifically spectrograms, from video files for deepfake detection.

1. Load Video and Extract Audio Segments:


Description: Load the video file and extract audio segments with a specified duration. Iterate over
the video segments and use the subclip function to extract audio segments. These segments will be
used for subsequent audio feature extraction.

2. Convert Audio to WAV Format and Load:


Description: Convert the extracted audio segments into WAV format. Use
moviepy.editor.AudioFileClip to temporarily save the audio segments as WAV files. Load the
WAV audio files using librosa.load to obtain the audio samples (y) and the sampling rate (sr).

3. Calculate Spectrogram:
Description: Compute the spectrogram of each audio segment using librosa.stft. Spectrogram
represents the frequency content of the audio signal over time. The magnitude of the Short-Time
Fourier Transform (STFT) is computed and used as the spectrogram.

1
71
4. Store Spectrograms and Save to JSON:
Description: Store the computed spectrograms as lists of lists. Each inner list represents a frame of
the spectrogram, containing the magnitude values for different frequency bins. Finally, save the
extracted spectrograms to JSON files. Each JSON file corresponds to a video file and contains the
spectrograms of its audio segments.

4.1.3 Video Feature Extraction Module


The following steps outline the process of extracting features from video frames using a
pretrained Vision Transformer (ViT) model. The extracted features capture important visual
information from the images, which can be used as input for deepfake detection algorithms or
combined with other modalities (e.g., audio features) for multimodal analysis.

1. Loading Pretrained Vision Transformer (ViT) Model:


The code defines a wrapper class ViTWrapper to encapsulate a pretrained ViT model from
the transformers library. This model is initialized with a specified model name (model_name) and
number of classes (num_classes). The model is loaded and set to evaluation mode
(vit_model.eval()).

2. Image Processing and Feature Extraction:


For each image in the dataset, the code performs the following steps:
Image Loading: The image is loaded using PIL (Image.open(image_path)).
Image Transformation: The image is transformed using transforms.Compose() to resize it
to (224, 224) pixels and convert it to a PyTorch tensor.
Feature Extraction: The transformed image is passed through the ViT model (vit_model) to
extract features. The output is the last hidden state of the ViT model.

3. Saving Features to JSON:


After extracting features for each image, the code constructs a filename for storing the
features in JSON format. This filename is derived from the original image filename by appending
"_features.json". The features are stored in the same directory as the image file, allowing for easy
association between image data and extracted features.

4. Iteration Over Subfolders:


The code iterates through all subfolders in the specified data path (data_path). Each
subfolder is assumed to contain images for feature extraction. For each subfolder, the code
1
81
processes all images.

1
91
Figure 4.3: Audio and Video Feature Extraction module

Input: Preprocessed video and audio waveforms

Output: Extracted features of audio and video

PseudoCode for Audio Feature Extraction:


FUNCTION extract_spectrogram(video_path, segment_duration, overlap):
video = LoadVideo(video_path) // Load the video file

segment_duration_sec = segment_duration * video.fps // Calculate segment duration in seconds


overlap_sec = overlap * segment_duration_sec // Calculate overlap duration in seconds

spectrograms = [] // Initialize list to store spectrograms

FOR i FROM 0 TO video.duration STEP (segment_duration_sec - overlap_sec):


audio_segment = video.audio.subclip(i, min(i + segment_duration_sec, video.duration)) //
Extract audio segment
SaveAudioSegment(audio_segment, temp_audio_file) // Save audio segment to temporary
WAV file
y, sr = LoadAudio(temp_audio_file) // Load audio waveform and sampling rate
S = CalculateSpectrogram(y) // Calculate spectrogram from audio waveform
spectrograms.append(S) // Store spectrogram in the list
DeleteTemporaryFile(temp_audio_file) // Delete temporary audio file

RETURN spectrograms
FUNCTION main():
video_folder = GetVideoFolder() // Path to the folder containing video files
FOR EACH video_file IN video_folder:
IF video_file is .mp4 OR .avi:
video_path = video_folder + video_file
spectrograms = extract_spectrogram(video_path) // Extract spectrograms from video
1
81
json_filename = GetJSONFilename(video_file) // Construct JSON filename
SaveSpectrogramsToJSON(spectrograms, json_filename) // Save spectrograms to JSON

PseudoCode for Video Feature Extraction:

FUNCTION extract_features_vit(model, image_path):


# Define image transformation pipeline
transform = DefineTransformations()

TRY:
# Open image and convert to RGB
image = OpenImage(image_path).ConvertToRGB()

# Apply transformations and unsqueeze to add batch dimension


image_tensor = ApplyTransformations(image).Unsqueeze()

# Extract features using ViT model


features = model(image_tensor)

# Return the last hidden state of the ViT model


RETURN features.last_hidden_state.squeeze()

EXCEPT Exception AS e:
PRINT "Error:", e
RETURN None

FUNCTION main():
# Define data path, model name, and number of classes
data_path = DefineDataPath()
model_name = DefineModelName()
num_classes = DefineNumClasses()

# Load pretrained ViT model


vit_model = LoadPretrainedViTModel(model_name, num_classes)

# Iterate through subfolders in data path


FOR EACH subfolder IN GetSubfolders(data_path):
# Iterate through image files in the subfolder
FOR EACH image_file IN GetImageFiles(subfolder):
# If the image file is in PNG format
IF ImageFileIsPNG(image_file):
# Extract features using ViT for the image file
features = extract_features_vit(vit_model, image_file)

# If features are successfully extracted


IF features is not None:
# Define features file path and name
features_file_path = DefineFeaturesFilePath(image_file)

# Store features in JSON format


StoreFeaturesInJSON(features, features_file_path)

1
91
4.1.4 MULTI MODAL FUSION MODULE

The following are the steps involved in this module

1. Initialization:
The MultiModalJointDecoder module is initialized with input and output sizes, tailored to the task
at hand. Input size is determined by the combined features from video and audio modalities, while
output size represents the number of target classes (e.g., real or fake).

2. Forward Pass:
During the forward pass, combined features from the TemporalSpatialEncoder are fed into the
MultiModalJointDecoder.

3. Linear Transformation and Activation:


The input features undergo a linear transformation (self.fc1) followed by a ReLU activation
function. This step enables the model to capture nonlinear relationships in the data.

4. Output Layer:
Subsequently, another linear transformation (self.fc2) is applied to map the transformed features to
the desired output size. This layer prepares raw predictions for each class.

5. Prediction Generation:
For tasks like binary classification (e.g., real vs. fake), a sigmoid activation function is applied to
the final output, producing probabilities for each class. Thresholding these probabilities yields
binary predictions, with values above the threshold indicating one class and values below indicating
the other.

This streamlined explanation encapsulates the key operations of the MultiModalJointDecoder, from
initialization to prediction generation, crucial for its role in deepfake detection systems.

2
01
Figure 4.4: Multi Modal Fusion module

Input: Extracted features of audio and video

Output: Trained Model and Predicted Labels

PseudoCode:

CLASS MultiModalJointDecoder:
METHOD _init_(input_size, output_size):
INITIALIZE fc1, fc2 as fully connected layers

METHOD forward(x):
x = Apply fc1 and ReLU to x
x = Apply fc2 to x
RETURN x

FUNCTION load_audio_features(audio_folder, video_filename):


TRY:
RETURN audio_features from audio_folder with filename video_filename
EXCEPT Exception AS e:
RETURN None

FUNCTION load_video_features(video_folder, video_filename):


TRY:
RETURN video_features from video_folder with filename video_filename
EXCEPT Exception AS e:
RETURN []

audio_folder = DefineAudioFolder()
video_folder = DefineVideoFolder()
temporal_spatial_encoder = InstantiateTemporalSpatialEncoder()
output_size = DefineOutputSize()
video_filenames = GetVideoFilenames(video_folder)

FOR EACH video_filename IN video_filenames:


audio_features = load_audio_features(audio_folder, video_filename)
video_features = load_video_features(video_folder, video_filename)

IF audio_features is not None AND video_features is not empty:


TRY:
combined_features = temporal_spatial_encoder(video_features, audio_features)
2
11
multi_modal_joint_decoder = MultiModalJointDecoder(SizeOf(combined_features),
output_size)
binary_predictions = Threshold(Sigmoid(multi_modal_joint_decoder(combined_features)))
StorePrediction(video_filename, AggregatePredictions(binary_predictions))
EXCEPT Exception AS e:
PRINT "Error during forward pass:", e

PrintPredictions()

4.1.5 FRONT END INTEGRATION MODULE

The steps for integrating the deep learning model with a front-end module:

1. Save Developed DL Model:


After training and evaluating the deep learning model, save the trained model weights and
architecture to disk using frameworks like PyTorch's torch.save() or TensorFlow's model.save().

2. Integrate Flask:
Set up a Flask application to serve as the backend server for handling requests from the front end.
Create routes within Flask to handle model inference requests, such as receiving input data, running
inference using the trained model, and returning the results.

3. Design UI:
Design the user interface (UI) for the front end module using web development technologies like
HTML, CSS, and JavaScript. Consider the user experience and design aesthetic to create an
intuitive and visually appealing interface for interacting with the deepfake detection system.

4. Connect Backend:
Establish communication between the front end and back end by making AJAX requests from the
UI to the Flask backend. Define API endpoints in Flask to receive data from the front end, process it
using the deep learning model, and send back the results.

5. Test Functionality:
Test the functionality of the integrated system to ensure proper communication between the front
end and back end. Test various scenarios, including uploading different types of videos or images,
triggering model inference, and handling responses.

2
21
2
31
Figure 4.5: Frontend Integration module

INPUT: Trained model

OUTPUT: DeepFake Detection Web Application

4.1.6 INNOVATION

Vision + Time-Series Transformer for Video Feature Extraction:


Our innovation combines the capabilities of vision transformers, such as ViT, with time-series
processing techniques to extract comprehensive features from video data. While the vision
transformer attends to spatial patterns within each frame, capturing intricate visual details, the time-
series transformer focuses on temporal dynamics, discerning motion and sequential patterns over
time. This fusion allows our model to capture both spatial and temporal cues inherent in video
sequences, empowering it to discern subtle manipulations characteristic of deepfake content.

Spectrogram Analysis for Audio Feature Extraction:


In our approach, we utilize spectrogram analysis to extract rich features from audio data,
providing a complementary perspective to the visual cues extracted from video frames.
Spectrograms offer detailed representations of frequency and time-domain characteristics,
encapsulating nuances in speech, ambient noise, and other audio attributes. By transforming audio
signals into spectrograms, our model gains access to a wealth of auditory information, enhancing its
ability to discern genuine audio signals from those manipulated in deepfake content.

Multi-Modal Joint Decoder for Integrating Features:


Our model incorporates a multi-modal joint decoder, which seamlessly integrates features
extracted from both visual and audio modalities into a unified representation. By jointly decoding
spatial-temporal features from video transformers and frequency-domain features from audio
spectrograms, our model learns to fuse information across modalities, capturing intricate
correlations and interactions between visual and auditory cues. This integrated representation
enables our model to make more informed decisions regarding the authenticity of media content,
bolstering its effectiveness in detecting deepfake manipulations across multiple dimensions.
2
41
Comparison

By integrating vision and time-series transformers for video feature extraction alongside
spectrogram analysis for audio feature extraction, our deepfake detection system achieves superior
performance compared to existing models. This integration allows for the extraction of rich spatial
and temporal features from video data, capturing intricate visual patterns and dynamic motion cues,
while also capturing subtle nuances in audio characteristics such as speech patterns and background
noise. Through a multi-modal joint decoder, these features are seamlessly integrated into a unified
representation, enabling holistic understanding across modalities and enhancing the model's
discriminative power. This comprehensive analysis across visual and audio modalities enables the
detection of deepfake manipulations that may attempt to deceive singular detection methods,
leading to more accurate and robust detection outcomes overall.

2
51
CHAPTER 5

IMPLEMENTATION AND RESULTS

5.1 DATASET DESCRIPTION

The DFDC (DeepFake Detection Challenge) dataset and FakeAVCeleb dataset are two
prominent resources in the realm of deepfake detection research. DFDC comprises a diverse array
of videos about 9000 videos, including both genuine and deepfake content, meticulously crafted to
challenge algorithms in distinguishing between real and manipulated media. With a focus on
fostering advancements in deepfake detection, DFDC serves as a foundational dataset for training
and evaluating detection models. Similarly, the FakeAVCeleb dataset offers a rich collection of
videos about 12,000 videos featuring celebrities and public figures, encompassing both authentic
and manipulated footage. This dataset provides researchers with a robust benchmark for assessing
the efficacy of detection techniques against sophisticated deepfake manipulations. Both datasets
represent pivotal resources for the development of strategies to combat the proliferation of deepfake
content, contributing to the preservation of trust and integrity in digital media environments.

Figure 5.1.1: Fake Videos

Figure 5.1.2: Real videos

2
51
5.2 DATA PREPROCESSING

Input: DFDC+FakeAVCeleb dataset

Output: Preprocessed audio waveforms and video frames

a) Face Detection: Detect and localize faces in video frames using the MTCNN model.
Code: Utilizes the FacenetDetector class, which internally implements the MTCNN model for face
detection.
class FacenetDetector(VideoFaceDetector):
def _detect_faces(self, frames) -> List:
batch_boxes, *_ = self.detector.detect(frames,
landmarks=False)
return [b.tolist() if b is not None else None for b in
batch_boxes]

Figure 5.2.1: Bounding box coordinates in format [xmin, ymin, xmax, ymax].

b) Face Crop Extraction: Extract face crops from video frames based on detected bounding boxes.
Code: Reads bounding box information from JSON files and extracts corresponding face crops.

def extract_video(video, root_dir, dataset, opt):


# Extracting face crops based on detected bounding boxes
for i in range(frames_num):
if not success or str(i) not in bboxes_dict:
continue
# Crop faces based on bounding boxes and save as images
for j, crop in enumerate(crops):
cv2.imwrite(os.path.join(opt.output_path, id,
"{}_{}.png".format(i, j)), crop)
2
61
Figure 5.2.2.1: Extracted face images cropped using box coordinates for real video

Figure 5.2.2.2: Extracted face images cropped using box coordinates for fake video

c) Data Loading and Organization: Fetch and organize video paths for efficient processing.
Code: Identifies video paths, handles dataset distinctions, and filters out processed videos.

def get_video_paths(data_path, dataset, excluded_videos=[]):


# Fetch video paths based on dataset and excluded videos
2
71
videos_paths = []
# Loop through folders to get video paths
for folder in videos_folders:
# Filter out banned folders
if any(banned_folder in folder for banned_folder in
banned_folders):
continue
# Add video paths to list
videos_paths.append(os.path.join(folder_path, video))
return videos_paths

d) Utility Functions: Provide various utility functionalities for data manipulation and management.
Code: Includes functions for resizing images, extracting method information, etc.

def resize(image, image_size):


# Resize images to specified dimensions
return cv2.resize(image, dsize=(image_size, image_size))

Each step plays a crucial role in preprocessing the data for deepfake detection, ensuring that the
subsequent analysis is accurate and efficient.

5.3 AUDIO FEATURE EXTRACTION

Input: Preprocessed audio waveforms and video frames

Output: Extracted audio features

a) Load Video and Extract Audio Segments: Load the video file using MoviePy and extract
audio segments of specified duration.
Code: Utilizes moviepy.editor.VideoFileClip to load the video and subclip function to extract audio
segments.
video = mp.VideoFileClip(video_path)
audio = video.audio.subclip(i, min(i + segment_duration_sec,
video.duration))

2
81
b) Convert Audio to WAV Format and Load: Convert extracted audio segments into WAV
format and load them using librosa.

Code: Temporarily save audio segments as WAV files and load them using librosa.load.
audio_temp_file = "temp_audio.wav"
audio.write_audiofile(audio_temp_file, codec='pcm_s16le', fps=44100)
y, sr = librosa.load(audio_temp_file)

c) Calculate Spectrogram: Compute the spectrogram of each audio segment using librosa's Short-
Time Fourier Transform (STFT).
Code: Utilizes librosa.stft to compute the STFT and obtain the magnitude values representing the
spectrogram.

S = np.abs(librosa.stft(y))

d) Store Spectrograms and Save to JSON: Store computed spectrograms as lists and save them to
JSON files corresponding to each video.
Code: Saves spectrograms as lists and writes them to JSON files for further analysis.

json_filename = os.path.splitext(video_file)[0] + ".json"


with open(json_filename, 'w') as json_file:
json.dump(spectrograms, json_file)

These steps outline the process of audio feature extraction, from loading the video to saving
the extracted spectrograms as JSON files, facilitating subsequent analysis and model training for
deepfake detection.

Figure 5.3: Extracted audio features using spectrogram analysis

2
91
5.4 VIDEO FEATURE EXTRACTION

Input: Preprocessed audio waveforms and video frames

Output: Extracted video features

a) Loading Pretrained ViT Model: The code initializes a ViT model wrapper (ViTWrapper) with
a specified model name and number of classes, loading the pretrained ViT model using the
transformers library.
Code:
class ViTWrapper(nn.Module):
def __init__(self, model_name, num_classes):
# Initialize ViT model with specified parameters
self.model = ViTModel(config)
self.model.config.num_classes = num_classes

b) Image Processing and Feature Extraction: Each image undergoes loading, transformation, and
feature extraction using the ViT model to capture visual representations.
Code:
def extract_features_vit(model, image_path):
# Load, transform, and extract features from the image using the
ViT model
features = model(image)

c) Saving Features to JSON: Extracted features are serialized and stored in JSON format with
filenames derived from the original image filenames, facilitating easy association between image
data and features.
Code:
# Store features in JSON format with filenames based on original
image filenames
with open(features_file_path, 'w') as f:
json.dump(features.tolist(), f)

3
01
d) Iteration Over Subfolders: The code iterates through subfolders within the specified data path,
processing all image files with the ".png" extension.
Code:
# Iterate through subfolders and process image files with ".png"
extension
for root, dirs, _ in os.walk(data_path):
for subdir in dirs:
# Iterate through image files in each subfolder
for file in os.listdir(subdir_path):
# Process image files with ".png" extension
if file.lower().endswith(".png"):
# Extract features for each image file
features = extract_features_vit(vit_model,
image_path)

These steps collectively outline the process of extracting visual features from images using a
pretrained Vision Transformer (ViT) model, facilitating subsequent analysis for deepfake detection
tasks.

Figure 5.4: Extracted video features using ViT pretrained model

3
11
5.5 MULTI MODAL FUSION

Input: Extracted audio and video features

Output: Trained Model

a) Initialization: The MultiModalJointDecoder module is initialized with input and output sizes,
defining its architecture for combining and processing multimodal features.
Code: Defines the MultiModalJointDecoder class with specified input and output sizes using linear
layers.

class MultiModalJointDecoder(nn.Module):
def __init__(self, input_size, output_size):
self.fc1 = nn.Linear(input_size, input_size//2)
self.fc2 = nn.Linear(input_size//2, output_size)

b) Forward Pass: During the forward pass, combined features from the TemporalSpatialEncoder
are propagated through the MultiModalJointDecoder for prediction generation.
Code: Implements the forward method in MultiModalJointDecoder, passing input features through
linear layers.

def forward(self, x):


x = F.relu(self.fc1(x))
x = self.fc2(x)
return x

c) Linear Transformation and Activation: Input features undergo a linear transformation


followed by a ReLU activation function to introduce nonlinearity and capture complex
relationships.
Code: Applies linear transformation followed by ReLU activation to input features in the forward
method.

x = F.relu(self.fc1(x))

d) Output Layer: The final output layer of the MultiModalJointDecoder maps transformed
features to the desired output size, preparing raw predictions for each class.
Code: Defines the output layer using another linear transformation in the MultiModalJointDecoder
3
21
class.

self.fc2 = nn.Linear(input_size/2, output_size)

e) Prediction Generation: For binary classification tasks, probabilities for each class are generated
using a sigmoid activation function applied to the final output, facilitating binary prediction
generation.
Code: Applies sigmoid activation to the final output for binary classification and thresholds
probabilities for binary predictions.

probabilities = torch.sigmoid(output)
binary_predictions = (probabilities >= threshold).float()

These steps illustrate the progression of operations within the MultiModalJointDecoder,


from initialization to prediction generation, crucial for its role in processing multimodal features for
deepfake detection.

Figure 5.5: Obtained results using Cross Modal Classifier

3
31
5.6 FRONT END INTEGRATION

Input: Trained Model

Output: UI interface to test the video

a) Create Model: Define the MultiModalJointDecoder model in Python using PyTorch, as


described in the provided code snippet.
Code:
# Save the trained model
torch.save(model.state_dict(), 'model/df_model.pt')
b) Integrate Flask:
 Set up a Flask application to serve as the backend server for handling requests and
responses.
 Create routes for receiving input data, processing it with the model, and sending back
predictions.

Code:
from flask import Flask, render_template, request, jsonify
import torch
app = Flask("_main_", template_folder="templates")
@app.route('/', methods=['GET', 'POST'])
def homepage():
if request.method == 'GET':
return render_template('index.html')
elif request.method == 'POST':
# Handle model inference request
# Load the saved model
model = Model(2)
model.load_state_dict(torch.load('model/df_model.pt',
map_location=torch.device('cpu')))
model.eval()
# Perform inference using the loaded model
# Return the inference result
return jsonify({'result': 'fake'})
3
41
if _name_ == "_main_":
app.run(port=3000)

c) Design UI - React.js:
 Design a user interface using React.js to interact with the deepfake detection model.
 Create components for uploading audio and video files, setting thresholds, and displaying
results.
Code:
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width,
initial-scale=1" />
<title>DeepFake Detection</title>
</head>
<body>
<h1>DeepFake Detection System</h1>
<form id="uploadForm">
<input type="file" name="video" accept="video/*">
<button type="submit">Upload Video</button>
</form>
<div id="result"></div>
</body>
</html>

3
51
Figure 5.6.1: Home Page

Figure 5.6.2: Upload a file to get the result

3
61
d) Connect Backend - Flask:
 Utilize AJAX requests in React.js to communicate with the Flask backend.
 Implement functions to send input data to Flask routes and receive predictions in JSON
format.
 Handle HTTP Requests with AJAX:
 Use AJAX requests to send data asynchronously to Flask endpoints for processing.
 Handle responses from Flask to update the UI with prediction results.
Code:
document.getElementById('uploadForm').addEventListener('submi
t', function(event) {
event.preventDefault();
var formData = new FormData(this);
fetch('/', {
method: 'POST',
body: formData
})
.then(response => response.json())
.then(data => {
// Handle response from backend and update UI
accordingly
document.getElementById('result').innerText =
'Detection Result: ' + data.result;
})
.catch(error => console.error('Error:', error));
});

3
71
Figure 5.6.3: Handling HTTP requests

e) Test Functionality:
 Conduct unit tests for each component to ensure proper functionality.
 Test file upload, prediction generation, and result display.

Figure 5.6.4: Result along with Trust Score (FAKE video)

3
81
Figure 5.6.5: Result along with trust score (REAL video)

3
91
CHAPTER 6
TEST CASES AND PERFORMANCE EVALUATION

6.1 TEST CASES

MODULE 1: PREPROCESSING
TEST INPUT PRE DESCRIPTION STEPS EXPECTED ACTUAL
CASE CONDITION OUTPUT OUTPUT
ID
TC_01 Video file Video file Verify that video 1.Provide a video Extracted Extracted
(e.g., .mp4) exists preprocessing is file to the module. frames from frames match
correctly 2.Execute the video the video. the expected
extracting frames preprocessing. frames.
from the video.
TC_02 Video file Video file Ensure that the 1.Provide a low- Preprocessed Preprocessed
with low with low preprocessing resolution video file frames with frames
resolution resolution module handles to the module. acceptable maintain
exists videos with low 2.Execute the video quality. acceptable
resolution preprocessing. quality despite
appropriately. the low
resolution.

Table 6.1.1: Test cases for Preprocessing Module

MODULE 2: AUDIO FEATURE EXTRACTION


TEST INPUT PRE DESCRIPTION STEPS EXPECTED ACTUAL
CASE CONDITIO OUTPUT OUTPUT
ID N
TC_03 Video file Video file Ensure that audio 1.Provide video Extracted Extracted
(e.g., .mp4 exists features are files to the audio audio audio features
) accurately feature extraction features in match the
extracted through module. JSON expected
spectrogram 2.Execute audio format. features.
analysis. feature extraction
using spectrogram
analysis.
Table 6.1.2: Test cases for Audio Feature Extraction Module

MODULE 3: VIDEO FEATURE EXTRACTION


TEST INPUT PRE DESCRIPTION STEPS EXPECTED ACTUAL
CASE CONDITION OUTPUT OUTPUT
ID
4
01
TC_04 Pre- Pre-processed Verify that visual 1.Provide pre- Extracted Extracted
processed frames exist features are processed visual features visual
frames (e.g., accurately frames to the (embeddings) features
extracted extracted using visual feature in JSON match the
frames from vision extraction format expected
video transformer. module. features.
preprocessin 2.Execute visual
g) feature
extraction using
the vision
transformer.

MODULE 4: MULTI MODAL FUSION


TEST INPUT PRE DESCRIPTION STEPS EXPECTED ACTUAL
CASE CONDITION OUTPUT OUTPUT
ID
TC_05 -Video -Video Verify that the 1. Provide video Classificatio Classification
features features and multimodal fusion features, audio n results for results
Extracted audio features module accurately features, and all videos matching the
from video are extracted classifies videos metadata file to indicating expected
feature from as fake or real the multimodal whether they classification
extraction respective using the fusion module. are fake or for all videos.
-Audio modules. extracted video 2. Execute real.
features -Metadata file and audio features multimodal
Extracted containing along with fusion for each
from audio classifications metadata. video using the
feature is available. provided
extraction features and
-Metadata metadata.
file JSON 3. Generate
file classification
containing results for each
the label of video.
all the videos

Table 6.1.4: Test cases for Multi Modal Fusion Module

MODULE 5: FRONT END INTEGRATION


TEST INPUT PRE DESCRIPTION STEPS EXPECTED ACTUAL
CASE CONDITIO OUTPUT OUTPUT
ID N
TC_06 Video file Front end Verify that the 1.Access the front Successful Verify that the
upload interface is front end end interface. upload of the video file is
through the accessible and correctly sends 2.Upload a video video file successfully
front end functional. video files to the file using the with no sent to the
interface. backend for provided upload errors backend and
functionality. reported by there are no
4
11
processing. 3.Monitor the the front end. errors or
network requests to issues
ensure the video reported by
file is sent to the the front end.
backend.
4.Check for any
error messages or
feedback provided
by the front end.
TC_07 Receive Backend Ensure that the 1.Submit a video The front- Verify that the
classification integration is front end file for processing end interface classification
result from established, correctly through the front- displays the result is
the backend. and the front displays the end interface. classification correctly
end is capable classification 2.Wait for the result displayed on
of receiving result received classification result received the front-end
from the to be returned from from the interface.
responses
backend. the backend. backend
from the
3.Verify that the accurately.
backend. front end correctly
displays the
classification result
(e.g., as "Real" or
"Fake")

6.2 PERFORMANCE EVALUATION

6.2.1 ACCURACY
Accuracy is commonly used for deepfake detection as it measures the overall correctness of the
model's predictions, providing a straightforward assessment of its performance. It's calculated by
dividing the number of correctly classified samples by the total number of samples.
Formula:
Accuracy = Number of correctly classified samples
-----------------------------------------------------
4
21
Total number of samples

6.2.2 PRECISION

Precision is used in deepfake detection systems to measure the proportion of true positive
detections among all positive detections made by the model. It helps in understanding the reliability of
the model in correctly identifying deepfake instances without falsely labeling genuine content as fake.
Formula:
Precision = TP
---------------
TP + FP

where:
TP (True Positives) are the instances correctly identified as deepfakes.
FP (False Positives) are the instances incorrectly identified as deepfakes.

6.2.3 RECALL

Recall is utilized in deepfake detection systems to measure the ability of the model to correctly
identify true deepfake instances among all actual deepfakes present in the dataset. It quantifies the ratio
of true positive predictions to the total number of actual deepfakes. High recall indicates that the model
is effectively capturing most of the deepfake instances.
Formula:

Recall = True Positives


-----------------
True Positives + False Negatives

6.2.4 F1-SCORE
The F1-score is used for deepfake detection because it balances precision and recall, providing a
single metric to evaluate a model's performance in distinguishing between real and fake videos. It is
especially useful when dealing with imbalanced datasets commonly encountered in deepfake detection
tasks.
Formula:
F1-score = 2 x precision x recall
-------------------------------
precision + recall
4
31
6.2.5 INTERSECTION OVER UNION(IOU)
Intersection over Union (IoU) is used for deepfake detection because it quantifies the spatial
overlap between the predicted and ground truth bounding boxes. It measures the ratio of the intersection
area to the union area of the predicted and ground truth bounding boxes, providing a robust evaluation of
object localization accuracy.
Formula:
IoU = Area of Intersection / Area of Union.

6.2.6 AREA UNDER CURVE

The area under the curve (AUC) performance metric is used for deepfake detection because it
provides a comprehensive measure of classifier performance across all possible decision thresholds.
AUC summarizes the classifier's ability to discriminate between classes (e.g., real vs. fake) regardless of
the threshold chosen.
Formula:
AUC = ∫(TPR(FPR)) d(FPR)
where TPR is the true positive rate (sensitivity) and FPR is the false positive rate (1 - specificity).

MODEL RESULTS:

EVALUATION METRIC PERCENTAGE

ACCURACY 92.00%

PRECISION 94.02%

RECALL 90.00%

F1-SCORE 92.67%

IOU 98.28%
4
41

Table 6.2: Model results


AUC 97.57%

4
51
CHAPTER 7

CONCLUSION AND FUTURE WORK

7.1 CONCLUSION

In conclusion, the development of deepfake detection systems utilizing deep learning


networks represents a promising avenue in combating the proliferation of manipulated media. These
systems leverage advanced algorithms to analyze subtle inconsistencies within images and videos,
enabling the identification of forged content with increasing accuracy. While significant progress has
been made, ongoing research and development are necessary to enhance the robustness and scalability
of these systems, particularly in the face of evolving deepfake techniques. Additionally,
interdisciplinary collaboration between researchers, policymakers, and industry stakeholders is crucial
to address the ethical, legal, and societal implications surrounding deepfake technology. Despite the
challenges ahead, the continued refinement of deepfake detection systems holds great potential in
preserving the integrity of digital content and safeguarding against misinformation in an increasingly
interconnected world.

7.2 FUTURE WORK


In the ongoing battle against deepfake proliferation, the evolution of detection systems must
keep pace with advancing synthetic media technology. Future work in deepfake detection could focus
on enhancing model robustness through adversarial training, augmenting datasets with diverse and
challenging examples to improve generalization, and exploring novel architectures like graph neural
networks or capsule networks for more nuanced feature extraction. Additionally, integrating
multimodal cues such as audio and text alongside visual information could offer more comprehensive
detection capabilities. Continuous refinement of anomaly detection techniques, leveraging
unsupervised learning to detect subtle artifacts indicative of manipulation, is crucial. Furthermore,
developing real-time detection systems capable of flagging deepfakes in live streams or social media
platforms will be imperative to curb their dissemination. Collaborative efforts to establish
standardized benchmarks and evaluation protocols could foster progress and interoperability among
different detection solutions. Finally, research into the interpretability of deepfake detection models is
essential for building trust and understanding their decision-making processes, facilitating effective
human oversight and intervention.

46
REFERENCES
1. L. Huang, L. Wang, and H. Li, “Multi-modality self-distillation for
weakly supervised temporal action localization,” IEEE Trans. Image
Process., vol. 31, pp. 1504–1519, 2022.
2. J.-W. Jung et al., “AASIST: Audio anti-spoofing using integrated
spectro-temporal graph attention networks,” 2021, arXiv:2110.01200.
3. C. Zhang, A. Gupta, and A. Zisserman, “Temporal query networks for
fine-grained video understanding,” in Proc. IEEE/CVF Conf. Comput.
Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 4486–4496.
4. P. Korshunov and S. Marcel, “DeepFakes: A new threat to face recog
nition? Assessment and detection,” 2018, arXiv:1812.08685.
5. Y. Qian, Z. Chen, and S. Wang, “Audio-visual deep neural network for
robust person verification,” in Proc. IEEE Trans. Audio Speech Lang.
Process., Oct. 2021, pp. 1079–1092.
6. J. Ramaswamy, “What makes the sound? A dual-modality interacting
network for audio-visual event localization,” in Proc. IEEE Int. Conf.
Acoust., Speech Signal Process. (ICASSP), May 2020, pp. 4372–4376.
7. Y. B. Lin and Y. C. F. Wang, “Audiovisual transformer with instance
attention for audio-visual event localization,” in Proc. Asian Conf.
Comput. Vis., 2020, pp. 1–17.

47

You might also like