DeepFake_Detection_System
DeepFake_Detection_System
USING
DEEP LEARNING
Submitted by
AMRITHA R (2021103506)
INFANCY P (2021103532)
POOJAA S (2021103555)
of the degree of
BACHELOR OF ENGINEERING
IN
ANNA UNIVERSITY::CHENNAI
600 025
MAY 2024
ANNA UNIVERSITY: CHENNAI 600 025
BONAFIDE CERTIFICATE
online media, prompting extensive research into detection methods. Prior endeavors
response, our project introduces a pioneering method that addresses this multi-modal
audio feature extraction and Vision Transformer for visual feature extraction, we
iii
ACKNOWLEDGEMENT
We would also like to thank our friends and family for their encouragement
and continued support. We would also like to thank the Almighty for giving us the
moral strength to accomplish our task.
iv
TABLE OF CONTENTS
LIST OF FIGURES ix
1. INTRODUCTION 1
1.1 OBJECTIVES 2
2. LITERATURE SURVEY 5
3. SYSTEM DESIGN 8
3.2.2.1 PERFORMANCE 11
3.2.2.2 RELIABILITY 11
v
3.2.2.3 SECURITY 11
3.2.2.4 USABILITY 12
3.2.2.5 MAINTAINABILITY 12
4. MODULE DESCRIPTION 13
4.1.6 INNOVATION 23
6.2.1 ACCURACY 43
6.2.2 PRECISION 43
6.2.3 RECALL 43
6.2.4 F1-SCORE 44
vi
7. CONCLUSION AND FUTURE WORK 46
7.1 CONCLUSION 46
REFERENCES 47
vii
LIST OF TABLES
vii
i
LIST OF FIGURES
ix
x
CHAPTER 1
INTRODUCTION
The rise of deepfake technology has sparked widespread concern in recent years, as it presents a
dual nature of both innovation and peril. While these techniques hold promise in areas like
filmmaking, computer graphics, and virtual reality, they also pose a grave threat to public trust and
security. Deepfakes, which seamlessly manipulate both audio and visual elements, have been
weaponized by malicious actors to deceive, defraud, and manipulate. This technology enables
attackers to bypass authentication systems, impersonate prominent figures such as celebrities and
politicians, and perpetrate financial scams with alarming ease. Moreover, the accessibility of
deepfake tools, such as Face Swap and Lip-Sync, empowers individuals with little expertise to
produce convincing forgeries, amplifying the potential for widespread harm.
As shown in Figure 1, A poignant example of this threat is the viral video featuring the
likeness and voice of Meta CEO Mark Zuckerberg, which circulated on Instagram, misleading
unsuspecting viewers. Such instances not only endanger the reputation and privacy of the
counterfeited individual but also erode public trust in the authenticity of digital media.
Compounding the challenge is the fact that deepfakes often involve simultaneous manipulation of
both audio and visual components, making them exceedingly difficult to detect using traditional
1
1
methods.
In response to this evolving threat landscape, researchers have developed various techniques
for deepfake detection. However, many of these approaches focus on analyzing either audio or
visual data in isolation, overlooking the nuanced interplay between the two modalities. While some
efforts have explored multi-modal detection, these methods have not fully harnessed the potential
synergies between audio and visual signals. Moreover, existing datasets for training and evaluating
deepfake detectors often lack diversity in terms of forgery methods and modalities, limiting their
effectiveness in real-world scenarios.
To address these gaps, we introduce DefakeAVMiT, a comprehensive multi-modal dataset
comprising over 6,000 visuals paired with corresponding audios, each potentially modified using a
range of forgery techniques. This dataset provides a rich training environment for developing robust
multi-modal deepfake detection systems. Additionally, we propose our novel approach, Audio-
Visual Joint Learning for Detecting Deepfake, which leverages the complementary nature of
audio and visual signals to identify inconsistencies indicative of manipulation. By jointly analyzing
temporal and spatial cues across both modalities, our method achieves superior detection
performance compared to traditional uni-modal systems and existing multi-modal approaches.
1.1 OBJECTIVES
The following are the objectives of the project:
1. High Accuracy: Develop algorithms and models that can accurately identify deepfake
content across various media types (images, videos, audio) and manipulation techniques (face
swapping, voice synthesis, etc.), minimizing false positives and negatives.
2. Scalability: Design the system to efficiently process large volumes of data, enabling it to
handle the vast amount of content generated and shared on online platforms in real-time or near-
real-time.
3. Robustness: Ensure the system's effectiveness against sophisticated deepfake techniques
by continuously updating and refining the detection methods to stay ahead of evolving manipulation
strategies.
4. Ethical and Legal Compliance: Establish guidelines and protocols to ensure the ethical
use of the system, respecting privacy rights, and adhering to legal regulations regarding data
handling and content moderation. Additionally, provide transparency on how the system operates
and its limitations to foster trust among users and stakeholders.
2
1
1.2 PROBLEM STATEMENT
With the proliferation of deepfake technology, the ability to generate highly realistic
synthetic media has raised significant concerns regarding its potential misuse for spreading
misinformation and manipulating public opinion. Deepfake videos pose a serious threat to the
integrity of visual and auditory content on the internet, potentially leading to widespread
misinformation campaigns, privacy breaches, and social unrest. The aim of this project is to develop
an effective deepfake detection system using deep learning networks to combat the spread of
synthetic media manipulation. The system will aim to accurately identify and flag deepfake
content. The primary challenge lies in distinguishing between genuine and manipulated video,
considering the rapid advancements in deepfake generation techniques that continually evolve to
deceive detection algorithms.
The rise of deepfake technology has underscored the urgent need for a robust detection
system built upon deep learning networks. These systems play a pivotal role in countering the
spread of misinformation by identifying and flagging fabricated videos that pose a threat to the
integrity of digital media. By preserving trust and authenticity in visual and auditory content,
deepfake detection systems help mitigate the erosion of confidence in online information sources.
Moreover, they serve as a crucial defense mechanism against potential harm to individuals and
public figures, protecting their reputations, privacy, and security from malicious exploitation.
Beyond individual safety, these systems contribute to broader societal efforts in preventing fraud,
cybercrime, and identity theft. Additionally, by promoting media literacy and awareness, they
empower users to critically evaluate the authenticity of digital content and make informed
decisions. Integrated into content moderation workflows, deepfake detection capabilities support
platforms in their governance efforts, enabling them to maintain safer online environments by
identifying and removing illicit or harmful deepfake content. In essence, the development and
deployment of deepfake detection systems using deep learning networks address critical societal
needs by combating misinformation, preserving trust, protecting individuals, promoting media
literacy, and supporting content moderation efforts in the digital age.
3
1
1.4 CHALLENGES IN THE SYSTEM
Developing a deepfake detection system using deep learning networks presents a host of
challenges that demand innovative solutions. Adversarial attacks continuously evolve, posing a
formidable hurdle as creators tweak their techniques to outsmart detection algorithms. Acquiring
diverse datasets containing both genuine and deepfake media is critical, yet challenging due to
privacy concerns and data scarcity. Ensuring models generalize well to unseen variations and
sources while maintaining interpretability remains a balancing act. Scalability is essential for real-
time processing of vast media volumes, requiring optimizations in inference speed and resource
efficiency. Data privacy and security concerns necessitate robust measures to safeguard sensitive
information. Ethical considerations, including fairness and bias mitigation, are paramount to
ensure responsible deployment. Additionally, resource constraints demand efficient utilization of
computational resources. Addressing these multifaceted challenges requires interdisciplinary
collaboration and innovative methodologies to develop reliable deepfake detection systems that
uphold ethical principles while mitigating the risks associated with synthetic media manipulation.
The scope of a deepfake detection system project is multifaceted, encompassing technical, ethical,
and practical considerations. Technically, the project involves collecting and annotating diverse datasets,
developing and implementing deep learning models tailored for deepfake detection, and evaluating their
performance against standard metrics. Ethically, the project must prioritize privacy preservation,
transparency, and compliance with legal regulations, while also establishing guidelines for responsible use
and addressing potential biases. Practically, the project entails designing a user-friendly interface, optimizing
scalability and performance, implementing feedback mechanisms for continuous improvement, and
managing computational resources efficiently. Additionally, educational initiatives and collaboration efforts
are essential for raising awareness about deepfakes, fostering media literacy, and sharing insights and best
practices within the community. By addressing these aspects comprehensively, a deepfake detection system
project can effectively contribute to the mitigation of synthetic media manipulation and the preservation of
trust and authenticity in digital content.
4
1
CHAPTER 2
LITERATURE SURVEY
Deepfakes take their name from the fact that they use deep learning technology to create
fake visuals or audios. Recent years have witnessed the rapid development of deepfake techniques,
which enable attackers to manipulate videos on a more highly-detailed and multi-modal level. For
example, Zhou et al. [15] proposed a deepfake system to generate pose-controllable talking faces
with accurate lip synchronization, which was extremely vivid in detail. Similarly, Ji et al. [16]
focused on synthesizing high-quality video portraits with emotional dynamics driven by audios.
Deng et al. [17] presented an unsupervised approach to retarget the speech of any unknown speaker
to an audio-visual stream of known speaker. These powerful video manipulation methods are more
diverse, more complex, and more difficult to detect and thus present a more challenging problem
for deepfake detection.
For image or visual deepfake detection, some efforts focus on special artifacts caused by
face forging. For example, Zhao et al. [18] formulated deepfake detection as a fine-grained
classification problem and proposed a new multiattentional deepfake detection network. Since face
swapping methods would leave partial context unchanged, Nirkin et al. [19] detected deepfake by
comparing two image-derived identities embeddings. Some recent works sought for intra-modal
inconsistency by using source images and forged images [1], [20]. However, these methods were
not suitable for detecting unseen spoofing techniques. To address this, some approaches aimed to
detect deepfake with the help of lip movement and mouth shape, which were more finegrained
features. For example, LipForensics [8] was proposed, which targeted high-level semantic
irregularities in mouth movements for detecting generated videos. Yu et al. [21] proposed a
commonality learning strategy to learn the common forgery features from different forgery
databases. A motion-based fundamental feature extraction network [7] was proposed to extract
information about talking habits for authentication. To further improve the transferability, Liu et al.
[22] presented a novel Spatial-Phase Shallow Learning method, which combined spatial image and
phase spectrum to better capture the up-sampling facial artifacts.
For audio deepfake detection, ASVspoof [14], [23], [24] are a series of competitions that
aim to protect automatic speaker verification systems from manipulation. Some methods have been
proposed for ASVspoof challenges such as AASIST [25] and wav2vec 2.0 [26]. Based on the
LCNN framework, Monteiro et al. [27] proposed a setting for the detection of logical and physical
presentation attackers of audio deepfake. Jung and Kim et al. [6], [28] proposed back-end
5
1
classification models to utilize raw waveforms for speaker verification. Recently, transfer-learning
based approach was adopted for unseen audio forgeries [29]. In [2], a capsule network was applied
to enhance the generalization of audio spoofing attacks detection. Although [30] proposed a
framework for both audio spoof and visual deepfake detection, they still viewed audio and visual as
independent tasks and ignored the correlation between audio and visual.
There are also some other works for multi-modal deepfake detection. Wang et al. [31]
exploited frequency information as a complementary modality to reveal artifacts that were not
perceptible in the RGB domain for deepfake detection. Reference [32] proposed unified audio-
visual deepfake detection frameworks based on modality dissonance and dense hierarchical
features. Mittal et al. [9] extracted perceived emotion cues between audio and visual modalities for
deepfake detection. Cheng et al. [33] addressed the deepfake detection from a voice-face matching
view via the intrinsic correlation of facial and audio. Cai et al. [34] proposed a boundary aware
temporal forgery detection method for detecting content driven audiovisual manipulations. Agarwal
et al. [35] detected the artifacts by comparing the dynamics of the mouth shape with a spoken
phoneme. However, this work focused on the explicit representational information between multi-
modalities, while ignoring the implicit feature non-synergy. Moreover, these multi-modal methods
regarded audios as additional supervision signals and neglected the possibility that audio can also be
forged, which is quite common in real-world. More importantly, compared to the explicit audio-
visual inconsistency, the implicit potential correlation between multi-modalities is still under-
explored and under-utilized.
Visual and sound are the two most popular modalities for representation learning and
there has been a lot of related multi-modal learning work. Recently, audio-visual representations
learning has been explored for video captioning, localization, and action recognition, due to the
pervasive concurrency property of two modalities. For example, Iashin et al. [36] introduced audio-
visual multi-modal to dense video captioning. Wu et al. [37] aimed to localize sound sources in a
visual scene and designed a novel Binaural Audio-Visual Network (BAVNet). To temporally parse
a video into audio or visual event categories, Lin et al. [38] exploited the crossmodality co-
occurrence of audio-visual to localize segments of target events. A novel audio-visual transformer
[39] framework has been proposed to localize audio-visual events with audio features jointly
observed over visual features. For the action recognition task, audio has ideal properties to aid
efficient recognition in long untrimmed videos. Based on this, Gao et al. [40] devised a novel
framework that focuses on clip-level recognition by distilling from the lighter modalities, a single
6
1
frame and its accompanying audio. Contrastive learning [41] has also been exploited for action
recognition and video understanding by transferring knowledge across heterogeneous modalities
between audio-visual. Recently, some weakly-supervised approaches have been explored for multi-
modal representation learning with audio-visual such as Audio-visual Transformer (AV-
transformer) [39], AudioVisual Interacting Network (AVIN) [42], and Multi-Modality Self-
Distillation (MMSD) [43].
The recent success of audio-visual representations learning proves the inherent and
pervasive correlations of multimodalities, which can be used as extra self-supervision signals.
However, due to the heterogeneity of modalities, the combination of audio-visual could also
encounter many difficulties. Qian et al. [44] proved that directly transferring the information from
one modal to another may lead to modal conflicts and redundancy. Despite the current audio-visual
learning systems having obtained huge progress, their performance could degrade dramatically
under the more challenging conditions, such as fine-grained retrieval and deepfake detection. We
believe that the simple concatenation or fusion operation are not the optimal choice for exploiting
information of audiovisual signals. A more inherent relationship and more common inconsistency
of two modalities are waiting to be further explored for multi-modal deepfake detection.
Inappropriate true-false alignment between multi-modalities could lead to more false positives, thus
bringing more noise interference to the detector. Accordingly, detecting deepfake with potential
inconsistencies of audio-visual in the hidden feature space is a more difficult and practical task.
7
1
8
1
CHAPTER 3
SYSTEM DESIGN
The project comprises five essential modules: Preprocessing ensures data quality and
readiness; Audio and Video Feature Extraction extract pertinent features from respective
sources; Multi-Modal Fusion integrates these features for comprehensive analysis;
Frontend Integration utilizes React.js to create a user-friendly interface, while Flask
handles the deployment of the trained deep learning model, enabling efficient predictions via
APIs.
Each module contributes uniquely to the system's functionality, from data preparation
to model deployment, ensuring a robust and user-accessible solution.
9
1
3.2 SYSTEM REQUIREMENTS
Before diving into hardware requirements, it's crucial to outline the complexity and
demands of deepfake detection systems. These systems employ deep learning algorithms to discern
authentic content from manipulated or synthetic media, requiring substantial computational resources
for both training and inference. Deep learning models for video analysis are particularly resource-
intensive due to the high-dimensional nature of multimedia data. Moreover, the scale and diversity of
deepfake datasets necessitate efficient data preprocessing pipelines and substantial memory resources
during training. Thus, while hardware forms the backbone of these systems, it's equally essential to
consider software frameworks, model architectures, and data management strategies to ensure the
effectiveness and scalability of deepfake detection solutions.
Processor: The system should possess a processor with ample processing power to execute
the deep learning algorithms involved in deepfake detection. A multi-core processor like
Intel Core i5 or i7 is recommended to handle the computational load efficiently.
Memory: Sufficient memory is essential for storing and manipulating the large datasets and
model parameters required for deep learning. A minimum of 16GB of RAM is
recommended to ensure smooth operation.
Storage: Adequate storage capacity is necessary to store the datasets, trained models, and
intermediate results generated during deepfake detection processes. A minimum of 512GB
of storage is recommended to accommodate the large volumes of data effectively.
Connectivity: The system should support various connectivity options such as Wi-Fi,
Bluetooth, or Ethernet to facilitate data transfer from external sources and communication
with other devices or networks.
Dataset: A diverse and comprehensive dataset containing both genuine and deepfake media
samples is essential for training, testing, and validating the deepfake detection system.
Access to a curated deepfake dataset, along with relevant annotations, is critical for
assessing the system's performance accurately.
1
01
Programming language: The project should be implemented using a programming
language suitable for deep learning such as Python.
Deep Learning Framework: The system should utilize a deep learning framework such as
TensorFlow, PyTorch, or Keras for model development, training, and inference. The
framework should provide flexibility, scalability, and support for state-of-the-art deep
learning techniques.
Video Processing Library: The deepfake detection system should utilize a video processing
library such as FFmpeg or OpenCV for implementing various video analysis and
manipulation techniques. These libraries provide comprehensive functionality for processing
video streams, extracting frames, performing temporal analysis, and applying deep learning-
based algorithms for deepfake detection.
Version control: The project should use a version control system such as Git to manage the
codebase and track changes.
Operating System: The system should be compatible with popular operating systems like
Linux, Windows, or macOS, depending on the preferred development environment and
deployment platform.
1
11
3.2.2 NON-FUNCTIONAL REQUIREMENTS:
The following are some potential non-functional requirements in terms of performance for the
project “DeepFake Detection using DeepLearning”:
3.2.2.1 Performance:
Scalability: The system should be scalable to handle increasing volumes of media content
and user requests without compromising performance, allowing for seamless operation in
high-demand scenarios.
Accuracy: The deepfake detection system should exhibit high accuracy in distinguishing
between genuine and manipulated media, minimizing false positives and negatives to
maintain reliability.
3.2.2.2 Reliability:
Robustness: The system should be robust against variations in deepfake techniques and
sources, ensuring consistent performance across different types of manipulation and media
modalities.
Fault Tolerance: The system should be resilient to failures or disruptions, with mechanisms
in place to recover gracefully from errors and maintain uninterrupted operation.
3.2.2.3 Security:
Data Privacy: The system should adhere to strict data privacy regulations, safeguarding
sensitive information processed during deepfake detection, including personal media content
and user data.
3.2.2.4 Usability:
1
21
User Interface: The system should feature an intuitive and user-friendly interface, allowing
users to interact with the detection system effortlessly and interpret results effectively.
Accessibility: The system should be accessible to users with diverse technical backgrounds
and abilities, providing clear instructions and support for user training and assistance.
3.2.2.5 Maintainability:
By addressing these non-functional requirements, the deepfake detection system can deliver
optimal performance, reliability, security, usability, and maintainability, ensuring effective
mitigation of synthetic media manipulation and safeguarding the integrity of digital content.
1
31
CHAPTER 4
MODULE DESCRIPTION
Overall, the preprocessing pipeline involves detecting faces in input videos, extracting face
crops based on the detected faces, organizing the dataset, and utilizing utility functions for various
tasks. The pipeline ensures that the data is properly processed and prepared for subsequent steps in
the deepfake detection process.
1. Audio Extraction:
Module: Audio loading functions or libraries (e.g., Librosa, FFmpeg).
Description: Extract audio streams from the provided video files. This step involves separating
audio tracks from video files to obtain standalone audio files. The extracted audio files will be used
for further analysis and feature extraction.
2. Audio Conversion:
Module: Audio conversion utilities or libraries (e.g., FFmpeg).
Description: Convert the extracted audio files into a common audio format suitable for analysis.
Common formats include WAV, MP3, or FLAC. Conversion ensures uniformity across the dataset
and compatibility with audio processing libraries and tools.
These preprocessing steps ensure that the audio data is properly extracted, converted, and
prepared for further analysis and feature extraction. By standardizing the audio files and removing
unwanted noise or silence, the dataset becomes suitable for training deep learning models or other
machine learning algorithms for deepfake detection.
1
51
Figure 4.1: Data Preprocessing module
PseudoCode
Step 1: Face Detection
def detect_faces(video_paths, output_dir):
detector = FacenetDetector()
for video_path in video_paths:
frames = load_frames(video_path)
bounding_boxes = detector.detect_faces(frames)
save_bounding_boxes(bounding_boxes, video_path, output_dir)
1
61
Step 4: Audio Extraction
def extract_audio(video_paths, output_dir):
for video_path in video_paths:
audio_file = os.path.join(output_dir, os.path.basename(video_path).split('.')[0] + '.wav')
command = f"ffmpeg -i {video_path} -vn -acodec pcm_s16le -ar 44100 -ac 2 {audio_file}"
subprocess.run(command, shell=True)
Overall, these data pre-processing steps will help in preparing the data for efficient model
training and testing and will also contribute to the overall robustness of the model.
Spectrograms capture the frequency content of audio signals over time and serve as essential
features for analyzing and detecting anomalies or inconsistencies in audio data, such as those
introduced by deepfake manipulation. The following steps demonstrate the process of extracting
audio features, specifically spectrograms, from video files for deepfake detection.
3. Calculate Spectrogram:
Description: Compute the spectrogram of each audio segment using librosa.stft. Spectrogram
represents the frequency content of the audio signal over time. The magnitude of the Short-Time
Fourier Transform (STFT) is computed and used as the spectrogram.
1
71
4. Store Spectrograms and Save to JSON:
Description: Store the computed spectrograms as lists of lists. Each inner list represents a frame of
the spectrogram, containing the magnitude values for different frequency bins. Finally, save the
extracted spectrograms to JSON files. Each JSON file corresponds to a video file and contains the
spectrograms of its audio segments.
1
91
Figure 4.3: Audio and Video Feature Extraction module
RETURN spectrograms
FUNCTION main():
video_folder = GetVideoFolder() // Path to the folder containing video files
FOR EACH video_file IN video_folder:
IF video_file is .mp4 OR .avi:
video_path = video_folder + video_file
spectrograms = extract_spectrogram(video_path) // Extract spectrograms from video
1
81
json_filename = GetJSONFilename(video_file) // Construct JSON filename
SaveSpectrogramsToJSON(spectrograms, json_filename) // Save spectrograms to JSON
TRY:
# Open image and convert to RGB
image = OpenImage(image_path).ConvertToRGB()
EXCEPT Exception AS e:
PRINT "Error:", e
RETURN None
FUNCTION main():
# Define data path, model name, and number of classes
data_path = DefineDataPath()
model_name = DefineModelName()
num_classes = DefineNumClasses()
1
91
4.1.4 MULTI MODAL FUSION MODULE
1. Initialization:
The MultiModalJointDecoder module is initialized with input and output sizes, tailored to the task
at hand. Input size is determined by the combined features from video and audio modalities, while
output size represents the number of target classes (e.g., real or fake).
2. Forward Pass:
During the forward pass, combined features from the TemporalSpatialEncoder are fed into the
MultiModalJointDecoder.
4. Output Layer:
Subsequently, another linear transformation (self.fc2) is applied to map the transformed features to
the desired output size. This layer prepares raw predictions for each class.
5. Prediction Generation:
For tasks like binary classification (e.g., real vs. fake), a sigmoid activation function is applied to
the final output, producing probabilities for each class. Thresholding these probabilities yields
binary predictions, with values above the threshold indicating one class and values below indicating
the other.
This streamlined explanation encapsulates the key operations of the MultiModalJointDecoder, from
initialization to prediction generation, crucial for its role in deepfake detection systems.
2
01
Figure 4.4: Multi Modal Fusion module
PseudoCode:
CLASS MultiModalJointDecoder:
METHOD _init_(input_size, output_size):
INITIALIZE fc1, fc2 as fully connected layers
METHOD forward(x):
x = Apply fc1 and ReLU to x
x = Apply fc2 to x
RETURN x
audio_folder = DefineAudioFolder()
video_folder = DefineVideoFolder()
temporal_spatial_encoder = InstantiateTemporalSpatialEncoder()
output_size = DefineOutputSize()
video_filenames = GetVideoFilenames(video_folder)
PrintPredictions()
The steps for integrating the deep learning model with a front-end module:
2. Integrate Flask:
Set up a Flask application to serve as the backend server for handling requests from the front end.
Create routes within Flask to handle model inference requests, such as receiving input data, running
inference using the trained model, and returning the results.
3. Design UI:
Design the user interface (UI) for the front end module using web development technologies like
HTML, CSS, and JavaScript. Consider the user experience and design aesthetic to create an
intuitive and visually appealing interface for interacting with the deepfake detection system.
4. Connect Backend:
Establish communication between the front end and back end by making AJAX requests from the
UI to the Flask backend. Define API endpoints in Flask to receive data from the front end, process it
using the deep learning model, and send back the results.
5. Test Functionality:
Test the functionality of the integrated system to ensure proper communication between the front
end and back end. Test various scenarios, including uploading different types of videos or images,
triggering model inference, and handling responses.
2
21
2
31
Figure 4.5: Frontend Integration module
4.1.6 INNOVATION
By integrating vision and time-series transformers for video feature extraction alongside
spectrogram analysis for audio feature extraction, our deepfake detection system achieves superior
performance compared to existing models. This integration allows for the extraction of rich spatial
and temporal features from video data, capturing intricate visual patterns and dynamic motion cues,
while also capturing subtle nuances in audio characteristics such as speech patterns and background
noise. Through a multi-modal joint decoder, these features are seamlessly integrated into a unified
representation, enabling holistic understanding across modalities and enhancing the model's
discriminative power. This comprehensive analysis across visual and audio modalities enables the
detection of deepfake manipulations that may attempt to deceive singular detection methods,
leading to more accurate and robust detection outcomes overall.
2
51
CHAPTER 5
The DFDC (DeepFake Detection Challenge) dataset and FakeAVCeleb dataset are two
prominent resources in the realm of deepfake detection research. DFDC comprises a diverse array
of videos about 9000 videos, including both genuine and deepfake content, meticulously crafted to
challenge algorithms in distinguishing between real and manipulated media. With a focus on
fostering advancements in deepfake detection, DFDC serves as a foundational dataset for training
and evaluating detection models. Similarly, the FakeAVCeleb dataset offers a rich collection of
videos about 12,000 videos featuring celebrities and public figures, encompassing both authentic
and manipulated footage. This dataset provides researchers with a robust benchmark for assessing
the efficacy of detection techniques against sophisticated deepfake manipulations. Both datasets
represent pivotal resources for the development of strategies to combat the proliferation of deepfake
content, contributing to the preservation of trust and integrity in digital media environments.
2
51
5.2 DATA PREPROCESSING
a) Face Detection: Detect and localize faces in video frames using the MTCNN model.
Code: Utilizes the FacenetDetector class, which internally implements the MTCNN model for face
detection.
class FacenetDetector(VideoFaceDetector):
def _detect_faces(self, frames) -> List:
batch_boxes, *_ = self.detector.detect(frames,
landmarks=False)
return [b.tolist() if b is not None else None for b in
batch_boxes]
Figure 5.2.1: Bounding box coordinates in format [xmin, ymin, xmax, ymax].
b) Face Crop Extraction: Extract face crops from video frames based on detected bounding boxes.
Code: Reads bounding box information from JSON files and extracts corresponding face crops.
Figure 5.2.2.2: Extracted face images cropped using box coordinates for fake video
c) Data Loading and Organization: Fetch and organize video paths for efficient processing.
Code: Identifies video paths, handles dataset distinctions, and filters out processed videos.
d) Utility Functions: Provide various utility functionalities for data manipulation and management.
Code: Includes functions for resizing images, extracting method information, etc.
Each step plays a crucial role in preprocessing the data for deepfake detection, ensuring that the
subsequent analysis is accurate and efficient.
a) Load Video and Extract Audio Segments: Load the video file using MoviePy and extract
audio segments of specified duration.
Code: Utilizes moviepy.editor.VideoFileClip to load the video and subclip function to extract audio
segments.
video = mp.VideoFileClip(video_path)
audio = video.audio.subclip(i, min(i + segment_duration_sec,
video.duration))
2
81
b) Convert Audio to WAV Format and Load: Convert extracted audio segments into WAV
format and load them using librosa.
Code: Temporarily save audio segments as WAV files and load them using librosa.load.
audio_temp_file = "temp_audio.wav"
audio.write_audiofile(audio_temp_file, codec='pcm_s16le', fps=44100)
y, sr = librosa.load(audio_temp_file)
c) Calculate Spectrogram: Compute the spectrogram of each audio segment using librosa's Short-
Time Fourier Transform (STFT).
Code: Utilizes librosa.stft to compute the STFT and obtain the magnitude values representing the
spectrogram.
S = np.abs(librosa.stft(y))
d) Store Spectrograms and Save to JSON: Store computed spectrograms as lists and save them to
JSON files corresponding to each video.
Code: Saves spectrograms as lists and writes them to JSON files for further analysis.
These steps outline the process of audio feature extraction, from loading the video to saving
the extracted spectrograms as JSON files, facilitating subsequent analysis and model training for
deepfake detection.
2
91
5.4 VIDEO FEATURE EXTRACTION
a) Loading Pretrained ViT Model: The code initializes a ViT model wrapper (ViTWrapper) with
a specified model name and number of classes, loading the pretrained ViT model using the
transformers library.
Code:
class ViTWrapper(nn.Module):
def __init__(self, model_name, num_classes):
# Initialize ViT model with specified parameters
self.model = ViTModel(config)
self.model.config.num_classes = num_classes
b) Image Processing and Feature Extraction: Each image undergoes loading, transformation, and
feature extraction using the ViT model to capture visual representations.
Code:
def extract_features_vit(model, image_path):
# Load, transform, and extract features from the image using the
ViT model
features = model(image)
c) Saving Features to JSON: Extracted features are serialized and stored in JSON format with
filenames derived from the original image filenames, facilitating easy association between image
data and features.
Code:
# Store features in JSON format with filenames based on original
image filenames
with open(features_file_path, 'w') as f:
json.dump(features.tolist(), f)
3
01
d) Iteration Over Subfolders: The code iterates through subfolders within the specified data path,
processing all image files with the ".png" extension.
Code:
# Iterate through subfolders and process image files with ".png"
extension
for root, dirs, _ in os.walk(data_path):
for subdir in dirs:
# Iterate through image files in each subfolder
for file in os.listdir(subdir_path):
# Process image files with ".png" extension
if file.lower().endswith(".png"):
# Extract features for each image file
features = extract_features_vit(vit_model,
image_path)
These steps collectively outline the process of extracting visual features from images using a
pretrained Vision Transformer (ViT) model, facilitating subsequent analysis for deepfake detection
tasks.
3
11
5.5 MULTI MODAL FUSION
a) Initialization: The MultiModalJointDecoder module is initialized with input and output sizes,
defining its architecture for combining and processing multimodal features.
Code: Defines the MultiModalJointDecoder class with specified input and output sizes using linear
layers.
class MultiModalJointDecoder(nn.Module):
def __init__(self, input_size, output_size):
self.fc1 = nn.Linear(input_size, input_size//2)
self.fc2 = nn.Linear(input_size//2, output_size)
b) Forward Pass: During the forward pass, combined features from the TemporalSpatialEncoder
are propagated through the MultiModalJointDecoder for prediction generation.
Code: Implements the forward method in MultiModalJointDecoder, passing input features through
linear layers.
x = F.relu(self.fc1(x))
d) Output Layer: The final output layer of the MultiModalJointDecoder maps transformed
features to the desired output size, preparing raw predictions for each class.
Code: Defines the output layer using another linear transformation in the MultiModalJointDecoder
3
21
class.
e) Prediction Generation: For binary classification tasks, probabilities for each class are generated
using a sigmoid activation function applied to the final output, facilitating binary prediction
generation.
Code: Applies sigmoid activation to the final output for binary classification and thresholds
probabilities for binary predictions.
probabilities = torch.sigmoid(output)
binary_predictions = (probabilities >= threshold).float()
3
31
5.6 FRONT END INTEGRATION
Code:
from flask import Flask, render_template, request, jsonify
import torch
app = Flask("_main_", template_folder="templates")
@app.route('/', methods=['GET', 'POST'])
def homepage():
if request.method == 'GET':
return render_template('index.html')
elif request.method == 'POST':
# Handle model inference request
# Load the saved model
model = Model(2)
model.load_state_dict(torch.load('model/df_model.pt',
map_location=torch.device('cpu')))
model.eval()
# Perform inference using the loaded model
# Return the inference result
return jsonify({'result': 'fake'})
3
41
if _name_ == "_main_":
app.run(port=3000)
c) Design UI - React.js:
Design a user interface using React.js to interact with the deepfake detection model.
Create components for uploading audio and video files, setting thresholds, and displaying
results.
Code:
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width,
initial-scale=1" />
<title>DeepFake Detection</title>
</head>
<body>
<h1>DeepFake Detection System</h1>
<form id="uploadForm">
<input type="file" name="video" accept="video/*">
<button type="submit">Upload Video</button>
</form>
<div id="result"></div>
</body>
</html>
3
51
Figure 5.6.1: Home Page
3
61
d) Connect Backend - Flask:
Utilize AJAX requests in React.js to communicate with the Flask backend.
Implement functions to send input data to Flask routes and receive predictions in JSON
format.
Handle HTTP Requests with AJAX:
Use AJAX requests to send data asynchronously to Flask endpoints for processing.
Handle responses from Flask to update the UI with prediction results.
Code:
document.getElementById('uploadForm').addEventListener('submi
t', function(event) {
event.preventDefault();
var formData = new FormData(this);
fetch('/', {
method: 'POST',
body: formData
})
.then(response => response.json())
.then(data => {
// Handle response from backend and update UI
accordingly
document.getElementById('result').innerText =
'Detection Result: ' + data.result;
})
.catch(error => console.error('Error:', error));
});
3
71
Figure 5.6.3: Handling HTTP requests
e) Test Functionality:
Conduct unit tests for each component to ensure proper functionality.
Test file upload, prediction generation, and result display.
3
81
Figure 5.6.5: Result along with trust score (REAL video)
3
91
CHAPTER 6
TEST CASES AND PERFORMANCE EVALUATION
MODULE 1: PREPROCESSING
TEST INPUT PRE DESCRIPTION STEPS EXPECTED ACTUAL
CASE CONDITION OUTPUT OUTPUT
ID
TC_01 Video file Video file Verify that video 1.Provide a video Extracted Extracted
(e.g., .mp4) exists preprocessing is file to the module. frames from frames match
correctly 2.Execute the video the video. the expected
extracting frames preprocessing. frames.
from the video.
TC_02 Video file Video file Ensure that the 1.Provide a low- Preprocessed Preprocessed
with low with low preprocessing resolution video file frames with frames
resolution resolution module handles to the module. acceptable maintain
exists videos with low 2.Execute the video quality. acceptable
resolution preprocessing. quality despite
appropriately. the low
resolution.
6.2.1 ACCURACY
Accuracy is commonly used for deepfake detection as it measures the overall correctness of the
model's predictions, providing a straightforward assessment of its performance. It's calculated by
dividing the number of correctly classified samples by the total number of samples.
Formula:
Accuracy = Number of correctly classified samples
-----------------------------------------------------
4
21
Total number of samples
6.2.2 PRECISION
Precision is used in deepfake detection systems to measure the proportion of true positive
detections among all positive detections made by the model. It helps in understanding the reliability of
the model in correctly identifying deepfake instances without falsely labeling genuine content as fake.
Formula:
Precision = TP
---------------
TP + FP
where:
TP (True Positives) are the instances correctly identified as deepfakes.
FP (False Positives) are the instances incorrectly identified as deepfakes.
6.2.3 RECALL
Recall is utilized in deepfake detection systems to measure the ability of the model to correctly
identify true deepfake instances among all actual deepfakes present in the dataset. It quantifies the ratio
of true positive predictions to the total number of actual deepfakes. High recall indicates that the model
is effectively capturing most of the deepfake instances.
Formula:
6.2.4 F1-SCORE
The F1-score is used for deepfake detection because it balances precision and recall, providing a
single metric to evaluate a model's performance in distinguishing between real and fake videos. It is
especially useful when dealing with imbalanced datasets commonly encountered in deepfake detection
tasks.
Formula:
F1-score = 2 x precision x recall
-------------------------------
precision + recall
4
31
6.2.5 INTERSECTION OVER UNION(IOU)
Intersection over Union (IoU) is used for deepfake detection because it quantifies the spatial
overlap between the predicted and ground truth bounding boxes. It measures the ratio of the intersection
area to the union area of the predicted and ground truth bounding boxes, providing a robust evaluation of
object localization accuracy.
Formula:
IoU = Area of Intersection / Area of Union.
The area under the curve (AUC) performance metric is used for deepfake detection because it
provides a comprehensive measure of classifier performance across all possible decision thresholds.
AUC summarizes the classifier's ability to discriminate between classes (e.g., real vs. fake) regardless of
the threshold chosen.
Formula:
AUC = ∫(TPR(FPR)) d(FPR)
where TPR is the true positive rate (sensitivity) and FPR is the false positive rate (1 - specificity).
MODEL RESULTS:
ACCURACY 92.00%
PRECISION 94.02%
RECALL 90.00%
F1-SCORE 92.67%
IOU 98.28%
4
41
4
51
CHAPTER 7
7.1 CONCLUSION
46
REFERENCES
1. L. Huang, L. Wang, and H. Li, “Multi-modality self-distillation for
weakly supervised temporal action localization,” IEEE Trans. Image
Process., vol. 31, pp. 1504–1519, 2022.
2. J.-W. Jung et al., “AASIST: Audio anti-spoofing using integrated
spectro-temporal graph attention networks,” 2021, arXiv:2110.01200.
3. C. Zhang, A. Gupta, and A. Zisserman, “Temporal query networks for
fine-grained video understanding,” in Proc. IEEE/CVF Conf. Comput.
Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 4486–4496.
4. P. Korshunov and S. Marcel, “DeepFakes: A new threat to face recog
nition? Assessment and detection,” 2018, arXiv:1812.08685.
5. Y. Qian, Z. Chen, and S. Wang, “Audio-visual deep neural network for
robust person verification,” in Proc. IEEE Trans. Audio Speech Lang.
Process., Oct. 2021, pp. 1079–1092.
6. J. Ramaswamy, “What makes the sound? A dual-modality interacting
network for audio-visual event localization,” in Proc. IEEE Int. Conf.
Acoust., Speech Signal Process. (ICASSP), May 2020, pp. 4372–4376.
7. Y. B. Lin and Y. C. F. Wang, “Audiovisual transformer with instance
attention for audio-visual event localization,” in Proc. Asian Conf.
Comput. Vis., 2020, pp. 1–17.
47