0% found this document useful (0 votes)
13 views65 pages

Detection of Digital Media Manipulation Using Hybrid Ensembling Model Report

The project report focuses on detecting manipulated digital media using a hybrid ensembling model that combines deep learning architectures like Xception and NASNet Mobile, achieving an accuracy of 97.02%. It addresses the growing concern of misinformation through manipulated videos and images, emphasizing the need for reliable detection mechanisms to restore trust in digital content. The project aligns with the United Nations Sustainable Development Goals by promoting transparency and protecting institutions from the harmful effects of disinformation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views65 pages

Detection of Digital Media Manipulation Using Hybrid Ensembling Model Report

The project report focuses on detecting manipulated digital media using a hybrid ensembling model that combines deep learning architectures like Xception and NASNet Mobile, achieving an accuracy of 97.02%. It addresses the growing concern of misinformation through manipulated videos and images, emphasizing the need for reliable detection mechanisms to restore trust in digital content. The project aligns with the United Nations Sustainable Development Goals by promoting transparency and protecting institutions from the harmful effects of disinformation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

DIGITAL MEDIA MANIPULATION

DETECTION USING HYBRID ENSEMBLING


MODEL
A PROJECT REPORT
Submitted by
KALVA BHAGEERATH [RA2111026010072]
GARLA DHEERAJ [RA2111026010113]
Under the Guidance of
Dr. MEENAKSHI M
Assistant Professor, Department of Computational Intelligence
in partial fulfillment of the requirementsfor the degree
of
BACHELOR OF TECHNOLOGY
in
COMPUTER SCIENCE AND ENGINEERING
with specialization in Artificial Intelligence and
Machine Learning

DEPARTMENT OF COMPUTATIONAL INTELLIGENCE


COLLEGE OF ENGINEERING AND TECHNOLOGY
SRM INSTITUTE OF SCIENCE AND TECHNOLOGY
KATTANKULATHUR- 603 203
MAY 2025
Department of Computational Intelligence
SRM Institute of Science & Technology
Own Work* Declaration Form

This sheet must be filled in (each box ticked to show that the condition has been met). It must be
signed and dated along with your student registration number and included with all assignments
you submit – work will not be marked unless this is done.
To be completed by the student for all assessments

Degree/ Course : B.Tech Computer Science Engineering with specialization in


Artificial Intelligence and Machine Learning

Student Name : Kalva Bhageerath, Garla Dheeraj

Registration Number : RA2111026010072, RA2111026010113

Title of Work : Digital Media Manipulation Detection using Hybrid


Ensembling Model

I / We hereby certify that this assessment compiles with the University’s Rules and Regulations
relating to Academic misconduct and plagiarism**, as listed in the University Website,
Regulations, and the Education Committee guidelines.

I / We confirm that all the work contained in this assessment is my / our own except where
indicated, and that I / We have met the following conditions:

• Clearly referenced / listed all sources as appropriate


• Referenced and put in inverted commas all quoted text (from books, web, etc)
• Given the sources of all pictures, data etc. that are not my own
• Not made any use of the report(s) or essay(s) of any other student(s) either past or present
• Acknowledged in appropriate places any help that I have received from others (e.g.
fellow students, technicians, statisticians, external sources)
• Compiled with any other plagiarism criteria specified in the Course handbook /
University website

I understand that any false claim for this work will be penalized in accordance with the
University policies and regulations.

DECLARATION:
I am aware of and understand the University’s policy on Academic misconduct and plagiarism and I certify
that this assessment is my / our own work, except where indicated by referring, and that I have followed
the good academic practices noted above.

Kalva Bhageerath (RA2111026010072), Garla Dheeraj (RA2111026010113)


SRM INSTITUTE OF SCIENCE AND TECHNOLOGY
KATTANKULATHUR – 603 203
BONAFIDE CERTIFICATE

Certified that 18CSP109L – Major Project report titled “DIGITAL MEDIA


MANIPULATION DETECTION USING HYBRID ENSEMBLING
MODEL” is the bonafide work of ‘‘KALVA BHAGEERATH
[RA2111026010072], GARLA DHEERAJ [RA2111026010113]’’who carried
out the project work[internship] under my supervision. Certified further, that to
the best of my knowledge the work reported herein does not form any other
project report or dissertation on the basis of which a degree or award was
conferred on an earlier occasion on this or any other candidate.

SIGNATURE SIGNATURE

Dr. M. Meenakshi Dr. Annie Uthra R


SUPERVISOR PROFESSOR &HEAD OF THE
Assistant Professor DEPARTMENT
Department of Computational Department of Computational
Intelligence Intelligence

EXAMINER 1 EXAMINER 2
ACKNOWLEDGEMENTS
We express our humble gratitude to Dr. C. Muthamizhchelvan, Vice-Chancellor,
SRM Institute of Scienceand Technology, for the facilities extended for the project work
and his continued support. We extend our sincere thanks to Dr. T. V. Gopal , Dean-CET,
SRM Institute of Science and Technology, forhis invaluable support. We wish to thank
Dr. Revathi Venkataraman, Professor and Chairperson, School of Computing, SRM
Institute of Science and Technology, for her support throughout the project work.
We encompass our sincere thanks to, Dr. M. Pushpalatha, Professor and Associate
Chairperson, School of Computing and Dr. C. Lakshmi, Professor and Associate
Chairperson, School of Computing, SRM Institute of Science and Technology, for their
invaluable support. We are incredibly grateful to our Head of the Department, Dr. R.
Annie Uthra, Professor, Department of Computational Intelligence, SRM Institute of
Science and Technology, for her suggestions and encouragement atall the stages of the
project work.
We want to convey our thanks to our Project Coordinators, Panel Head, and Panel
Members, Department of Computational Intelligence, SRM Institute of Science and
Technology, for their inputs during the project reviews and support. We register our
immeasurable thanks to our Faculty Advisor Dr. S. Amudha, Department of
Computational Intelligence, SRM Institute of Science and Technology, for leading and
helping us to complete our course.
Our inexpressible respect and thanks to our guide, Dr. M. Meenakshi, Department of
Computational Intelligence, SRM Institute of Science and Technology, for providing us
with an opportunity to pursue our project under his / her mentorship. He / She provided
us with the freedom and support to explore the research topics of our interest. His / Her
passion for solving problems and making a difference in the world has always been
inspiring.
We sincerely thank all the staff and students of Computational Intelligence, School of
Computing, S.R.M Institute of Science and Technology, for their help during our project.
Finally, we would like to thank our parents, family members, and friends for their
unconditional love, constant support and encouragement
ABSTRACT
This project aims to address the increasing problems with the manipulative potential
of digital media, and particularly targets the detection of manipulated media, or
synthetically modified videos and photographs, which threaten nowadays to shoulder
the load of truth in a digital world. Such a thing can be used for the distribution of
fake news, presenting a false identity of a celebrity, or even committing fraud online,
thus making a robust requirement to have a detection mechanism.

Proposed work here reporting uses state-of-the-art prevailing deep learning


algorithms incorporated to ensure that manipulated media are detected with high
accuracy. We measured the completeness of the architectures InceptionResNetV2,
VGG19, CNN, Xception, and NASNet Mobile deployed and benchmarks on a
Kaggle-based manipulated media data set. Hybrid ensembling was born, an entirely
new approach by combining Xception and NASNet Mobile to reap the strength of
both models.

The ensemble model surpassed every architecture to achieve an accuracy of 97.02%


along with high precision, recall, and F1-score. It brings about the effectiveness of
hybrid deep learning strategies in forged digital content. This research is very
promising on how to build a complete path for hybrid deep learning detection
techniques and on the necessity to keep innovating in detection methods.

v
TABLE OF CONTENTS
Chapter No. Title Page No.
ABSTRACT v
TABLE OF CONTENTS vi
LIST OF FIGURES viii
LIST OF TABLES ix
ABBREVIATIONS x
1 INTRODUCTION 1
1.1 Introduction to Manipulated Media Detection 1
1.2 Motivation 2
1.3 Sustainable Development Goal of the Project 2
2 LITERATURE SURVEY 4
2.1 Overview of the research area 4
2.2 Existing Models and Frameworks 4
2.3 Limitations of Existing Systems 5
2.4 Research Objectives 6
2.5 Product Backlog 6
2.6 Plan of Action 7
3 SYSTEM ARCHITECTURE AND DESIGN 9
4 SPRINT PLANNING AND METHODOLOGY 12
4.1 SPRINT I 12
4.2 SPRINT II 14
4.3 SPRINT III 16
4.4 Methodology 17
5 RESULTS AND DISCUSSION 19
5.1 Distribution of Data 19
5.2 Evaluation Metrics of Each Model 20
5.2.1 Evaluation using Xception for manipulated media 20
detection
5.2.2 Evaluation using InceptionResnetV2 for manipulated 22
media detection
5.2.3 Evaluation using VGG19 for manipulated media 24
detection
5.2.4 Evaluation using NasNetMobile for manipulated 26
media detection
5.2.5 Evaluation using Proposed Hybrid Ensembling Model 28
for manipulated media detection

vi
(Xception+NasNetMobile)
5.3 Comparative Analysis of Each Model 30
6 CONCLUSION AND FUTURE 31
ENHANCEMENTS
REFERENCES 32
APPENDIX
A CODING AND IMPLEMENTATION 33
B RESEARCH PAPER 53
C PAPER SUBMISSION FORM 54
D PLAGARISM REPORT 55

vii
LIST OF FIGURES
Figure No. Title Page No.
3.1 Proposed System Architecture 9
3.2 Dataflow Diagram 10
5.1 Data Distribution 19
5.2 Evaluation Graphs for Xception Model 20
5.3 Confusion Matrix for Xception Model 21
5.4 Evaluation Graphs forInceptionResnetV2 22
Model

5.5 Confusion Matrix for InceptionResnetV2 23


Model

5.6 Evaluation Graphs for VGG19 Model 24


5.7 Confusion Matrix for VGG19 Model 25
5.8 Evaluation Graphs for NasNetMobile Model 26
5.9 Confusion Matrix for NasNetMobile Model 27
5.10 Evaluation Graphs for Proposed Hybried 28
Ensembling Model

5.11 Confusion Matrix for Proposed Hybried 29


Ensembling Model

viii
LIST OF TABLES

Table No. Title Page No.


2.1 Product Backlog 6
5.1 Comparative analysis for each Model 30

ix
ABBREVIATIONS
Abbreviation Full Form

CNN Convolutional Neural Network

DL Deep Learning

DNN Deep Neural Network

MAE Mean Absolute Error

MSE Mean Squared Error

BCE Binary Cross-Entropy

CE Categorical Cross-Entropy

FPR False Positive Rate

FNR False Negative Rate

TPR True Positive Rate (Sensitivity/Recall)

TNR True Negative Rate (Specificity)

ReLU Rectified Linear Unit

TL Transfer Learning

NAS Neural Architecture Search

VGG Visual Geometry Group

TF TensorFlow

DFD Deepfake Detection

DFDNet Deepfake Detection Network

DFD Data Flow Diagram

UI User Interface

FPS Frames Per Second

x
CHAPTER 1

INTRODUCTION

1.1 Introduction to Manipulated Media Detection


The fast-evolving ways with which we fabricate media content-accompany visual and auditory
effects, has made such an occurrence, clip and share media content-amongst ourselves in the
digital world today. Miraculously alongside the emerging glamour of these developments, also
awakened the lighted curtain in the unveiling of manipulated videos and photographs.
Manipulated media was ordinarily understood as visual content produced artificially through
advanced technologies-especially for purposes of misleading, deceiving, or causing malicious
action. While some content might be made harmless and just entertained with it, misuse for
disinformation, assumption of others' identities, or fraud, has become a critical challenge.

Manipulated Digital Image Video Detection Using Hybrid Ensembling Approach, this project
focuses on the detection of manipulated videos and images. The objective is to conceive an
intelligent system that will be able to discern photographs as genuine or manipulated. Using
deep learning techniques, we train models to identify and pick out the occurrences of minute
changes or concealed patterns which remain beyond the reach of the human eye.

The project implemented and tested several deep learning architectures: InceptionResNetV2,
VGG19, CNN, Xception, and NASNet Mobile. But single models often struggled to handle
the complex variations inherent in the manipulated visual content. Hence the hybrid ensemble
of Xception and NASNet Mobile was proposed. The ensemble performed extraordinarily well,
achieving a phenomenal accuracy of 97.02%.

With these editing tools and AI manipulation techniques having become wide-awake in the
market today, it has become imperative to ensure the authentication of videos and images. This
project seeks to aid in the safeguarding of digital spaces through a trustworthy and efficient
means of detecting manipulated visual content, thereby engendering trust and safety in various
platforms.

1
1.2 Motivation
The driving motivation of this project springs from the ever-growing danger that manipulated
videos and images inflict upon their subjects, communities, and institutions. A world where
visual content helps shape opinions and dictate decisions gives grand influence over what
people see and thus construct their realities. The effects could be far-reaching: whether through
fake political speeches, a non-existent news event, or a social media rumor cast by faked
imagery, manipulated visualizations can mislead millions and cause havoc.

Even though this ugly scenario opened the door for me to start the work of developing a system
that could detect manipulated videos and pictures, which would eventually work its way on to
the internet, I also realized this was an urgent task that required immediate action on my part.
Detection techniques were up to par with fairly simple manipulation techniques, but as these
techniques moved into the upper echelon of complexity, typical detection approaches were
becoming obsolete. My aim was to develop and design a solution that was deep-learning-based
with an ensemble hybrid combining various detection techniques in order to catch any tiny
trace of manipulation. I truly believe that technology must save the day for truth, rebuilds trust
in digital media, and protect the innocent from the perils of misinformation and visual
deception.

With this project, we let our minds develop based on the vision that technology was meant to
solve problems and not to create more. Manipulated media is originally technologic
advancements, but they can also be complemented to respond. By accomplishing a better
detection system, we can really work with restoring the trust online and making sure that media
remain a force for good instead of using it for deception.

1.3 Sustainable Development Goal of the Project


This project aligns closely with the United Nations Sustainable Development Goal 16: "Peace,
Justice, and Strong Institutions." One of the fundamental targets of the goal is to prevent all
forms of violence and to promote the rule of law at national and international levels.
Manipulated media, when not checked, can provoke violence, spread disinformation, tarnish
reputations, and eventually undermine trust in institutions and public figures. By establishing
a credible detection and flagging system for manipulated media, this project is working towards
forging more transparent information systems and ultimately community protection from the

2
detrimental consequences of false narratives. Trustworthy digital media is indispensable for
social peace, public safety, and informed choice, which are all key pillars of strong institutions.

This project also supports the Goal 9 "Industry, Innovation and Infrastructure" by encouraging
the application of advanced technology for the betterment of society. Through the hybrid
ensembling approach using deep learning models, this project shows how innovation solves
the modern problems of cybersecurity and digital ethics. By adding strong protection for digital
infrastructures against abusive manipulation, this project also encourages responsible
innovation while protecting industries most dependent on authentic media journalism,
legitimate legal systems, education, and entertainment. With digital communication being so
much at the center of almost every sector today, having these mechanisms in place for the
integrity of media is not merely a technological achievement but a step towards sustainable
resilient digital infrastructure.

3
CHAPTER 2

LITERATURE SURVEY

2.1 Overview of the Research Area


Advancements of tools used for the creation of digital content have further moved to an
excellent realm of manipulation of videos and images, thereby making the process very simple
and effective. These manipulated media artifacts threaten the authenticity and credibility of
information shared over many platforms. Detection is very important since human observation
can fail to detect some kinds of manipulations. Traditionally, research efforts in this area have
been usually geared towards developing datasets, proposing a deep learning-based detection
framework or finding inherent inconsistencies in manipulated media. This project focuses
specifically on detecting manipulated video and images through the design of a hybrid deep
learning ensembling approach that targets an improved accuracy with realistic applicability.

2.2 Existing Models and Frameworks


B. Zi et al. (2020) in "WildDeepfake: a Challenging Real-World Dataset for Deepfake
Detection" have given an important landmark for the research community through their release
of the WildDeepfake dataset. Earlier datasets had either no or very few samples of synthetic
deepfakes created under very controlled environments. The purpose behind that dataset was
real-world situations. It contains deepfakes collected from online sources, where compression
artifacts, low resolutions, and natural variations make the detection much more difficult. This
dataset absolutely became a huge challenge before the available detection models, providing
evidence for how models trained over synthetic datasets show failing performance in real-life
scenarios.This paper put special focus on training and testing detection systems with varied,
and naturally occurring, manipulated media and compelled the research community to innovate
more generalized and more resilient detection techniques.

Building further on the conceptualization of robustness," Learn Self-Consistency For Faking


Detection", advocated by T. Zhao et al. (2021), encompasses the proposition of a new
mechanism. Their method was predicated upon the observation that many manipulated videos
contain internal inconsistencies which, albeit possibly being minute inconsistency traces, a
well-designed model could learn. The models were instead trained to detect logical
inconsistencies within frames and across sequences in addition to manipulating artifacts or

4
simple manipulation traces. Internal coherence refined the detection scheme to be robust to
realistic manipulations that would foil traditional artifact-based detectors. The work thus
opened the way for designing intelligent systems that probe deeper than surface heuristics, thus
lending inspiration to this project to use ensemble approaches to capture inconsistency in
manipulated media.

Another major contribution to the area is S. Fernandes et al. (2020) introduced an attribution-
based confidence metric for the detection of manipulated videos. Their approach is not merely
limited to binary classification—real or fake— but includes a measure of how confident the
model is about a decision. By assessing the attribution maps-the zones within the image that
had the maximum influence on the output of the model-they can give indication of cases where
the model is not confident in its decision, thereby minimizing the possibilities of false positives
or negatives. This last approach is very practical in settings in which manipulated media
detection systems are required to perform with high reliability as well as explainability. This
concurs with the motivation behind adopting in this project a hybrid ensembling technique,
seeking in the end towards not only high accuracy but also more stability and trustworthiness
of the model.

2.3 Limitations of Existing Systems


1. The existing system is exclusively based upon conventional heuristics rule methods
which have little aptitude in regard analyzing any complicated pattern and also
perhaps may not have an extremely dependable output to detect advanced deep fake
productions with the use of advanced deep learning techniques.

2. Although the simple deep learning algorithms such as CNN, RNN, and LSTM yet has
its incapacity in being flexible and robust over complex deep fake alterations; thus,
this might lead to the limitation in detection.

3. Existing system does not use ensemble models that would combine several algorithms
to improve accuracy and robustness in deep fake detection, thereby resulting in overall
underperformance in detection compared to advanced techniques.

4. Classic detection methods would make existing system relatively immovable with the
emerging system trends, thus needing revision and rework timely in order to keep pace
with techniques.

5
5. It builds on standard heuristic rules, which do not capture the most complex patterns and
could have limited proficiency when it comes to detecting advanced deep fakes created using
new deep learning methods.

2.4 Research Objectives

Based on the review of existing literature and the gaps identified, the following research
objectives have been formulated:

1. To develop a hybrid ensembling model combining Xception and NASNet Mobile


architectures for improved detection of manipulated videos and images.

2. To ensure higher detection accuracy with sensitivity towards the subtle manipulation
artifacts and inconsistencies.

3. To make the proposed model resilient across different real-world conditions such as
low resolution, compression artifacts, and environmental noise.

4. To develop a system with accuracy interpretation, reduced false positives/negatives,


and promoted belief in model outputs.

2.5 Product Backlog

User Story Desired Outcome


US1: As a user, I want to collect and An orderly, pure, and high-grade database
preprocess a dataset of videos and images prepared for worthy model training,
so that the model has high-quality input improving precision and dependability in
for training. the model.
US2: As a user, I want to train separate Identified specific models for detecting
models for images and videos to detect manipulation in videos as well as images
manipulated media so that individual individually on way of casting an
predictions are accurate. enhancement to the overall performance.
US3: As a user, I want to combine A hybrid ensemble of the image and video
predictions from both models into a models to coalesce both efficacy in the
hybrid ensemble so that overall detection detection percentage.
accuracy improves.

6
US4: As a user, I want to create signup A trusted access control mechanism
and login pages so that users can securely allows users to register accounts and log
access the application. in to the application safely.
US5: As a user, I want to build a database A secure and encrypted back end
for user authentication so that users' databases to store users' credentials,
credentials are securely stored. thereby withholding the privacy of the
data.
US6: As a user, I want to create an upload This intuitive upload interface allows
page for users to upload videos or images users to upload their material for analyses
so that the application can analyze their and will receive results in short formats
authenticity. concerning manipulation detection.
US7: As a user, I want to test the The result is a client-empirical, tested
application so that it is free of bugs and through-the-roof dependable application
meets performance standards. devoid of bugs and operating in a well-
mannered way for use.

Table 2.1 Product Backlog for the Project

2.6 PLAN OF ACTION (Road Map of Project)

The successful implementation of a manipulated media detection system, focusing on videos


and images, would have to be a phased and stepwise approach. This action plan details the life's
imperative phases in the project: conceptualization, dataset preparation, model building,
system design, testing, and final deployment. Each step is planned such that methodological
solidity, technical quality, and practical usability ultimately yield a scalable, precise, user-
friendly manipulated media detection system.

2.6.1 Project initiation and Data Gathering (Weeks 1-3)

This first phase lays the groundwork for defining the objectives and scope of the project.
The project starts by collecting a relevant Kaggle dataset which contains original videos
and manipulated ones. Each video is converted to frames and only facial regions will
be extracted for model training. The data will be carefully annotated and preprocessed
for getting quality input to the model.

2.6.2 Model Identification and Development (Weeks 4-6)

7
In this phase model architecture is chosen and training is commenced. Convolutional
neural networks, and even an ensemble of upstream models distinguishing manipulated
from original media, are also deployed. Hyperparameters like the learning rate, batch
size, as well as the number of epochs, are optimized to get the best detection accuracy.

2.6.3 System Designing and Architecture Implementation (Weeks 7-8)

Now, after developing the model, the system is being built to incorporate the model as
part of a working pipeline. It includes front-end for uploading videos or images and
backend services to perform analysis. This design emphasizes modularity, scalability,
and real-time performance.

2.6.4 Testing and Result Validation (Weeks 9-10)

Finally, the models are tested on unseen data to validate their progress. These
evaluations include accuracy, precision, recall, and F1 score. Test conditions also
ensure that models are robust across varied types of videos and images.

2.6.5 Documentation and Future Recommendations (Weeks 11-12)

The project is extensively documented from dataset preparation through model


configurations, system architecture, and test results. Moving forward, the
recommendations are to expand the dataset, include temporal analysis of videos, and,
lastly, to improve the generalization of models into more sophisticated techniques for
manipulation.

8
CHAPTER 3

SYSTEM ARCHITECTURE AND DESIGN

Fig 3.1. Proposed System Architecture

The architecture governing the Manipulated Media Detection Platform is a modular and orderly
flow that starts from data preprocessing. Initially, the input includes a real and fake video
dataset. These videos will go through preprocessing; each video will be split into frames, faces
detected and cropped, then only facial regions' saved to further process. This will result in a
Processed Dataset with strictly face images without all other background noise and irrelevant
frames. Once everything is ready for the dataset, the dataset will then be split into training and
testing subsets using a control data-splitting procedure for a clean evaluation setup. The
training subset is loaded into a hybrid ensembling model which channels the performance of
Xception architecture and NASNet Mobile architectures for high-performance manipulation
detection.

9
Evaluation of the model's performance retrieving from confusion matrix gives a view of
accuracy, precision, recall, and so on. The hybrid model, once trained, is saved for predictions
in the future. For practical purposes, on the upload of any new video by a user, the
preprocessing steps of the same are executed within the system, namely, splitting into frames,
face detection, and cropping, giving the extracted face images as input to the pre-trained
ensemble model. This path, indicated in the architecture diagram, would classify the uploaded
media into 'Real' or 'Fake' accordingly. The modular setup also allows easy retraining and
scaling, with fast prediction speeds that make it very suitable for manipulating media detection
in real time.

Fig 3.2 Data Flow Diagram

The data flow diagram depicts the entire pipeline of the Manipulated Media Detection project-

10
from basic setup to final prediction. It starts with importing different required libraries. After
successfully setting up the environment, the next step is to verify if all libraries and
prerequisites are properly loaded, without which the process will abruptly terminate with a
message "NO PROCESS". After the validation, the system continues to explore the ingestion
of datasets followed by processing of images, including frame extraction, cropping of images
of faces and so on, and normalization of these images. Data visualization comes into play after
preprocessing to map the distribution of the dataset plus its quality and assure that the data is
in a state appropriate for building models.

This phase includes model construction, which involves implementing multiple architectures:
InceptionResNetV2, VGG19, CNN, Xception, and NASNetMobile, as well as Ensemble
(Xception + NASNetMobile) models. Once constructs are ready, training of these models on
preprocessed dataset must take place. In later part of this system, a web-based interface will be
introduced whereby users would be able to register and login to access the service. Users would
upload a video or picture that has to be processed by the model, which would return a final
prediction on whether the medium is fake or real. Finally, the output is presented to the user—
completing the entire cycle of media manipulation detection.

11
CHAPTER 4

SPRINT PLANNING AND METHODOLOGY

4.1 SPRINT I

4.1.1 Objectives with User Stories of Sprint I

The Sprint I objective was to set up a robust and clean dataset pipeline for training models
used in manipulated media detection. This sprint particularly targeted the collection of
deepfake videos, frame extraction, dataset cleaning, and preprocessing methods such as
resizing, face cropping, and normalization. The intent was to provide high-quality, properly
labeled datasets for training the models to establish deep features that can discriminate
between real and fake media.

The user stories underline the need for:


1. Smooth extraction of frames from videos.
2. Accurate classification into real and fake categories.
3. Standardized image dimensions and normalization compatible with models.
4. A good division of the dataset into training and test sets (80:20).

4.1.2 Functional Documents


The functional document for Sprint I detailed the tools and workflows used for dataset
preparation:
• Data Source : Deepfake datasets like Deepfake Detection Challenge and Manipulated
and Original Sequences
• Extracting Frames from the Videos: OpenCV for sampling random 30 frames from
videos
• Preprocessing:
1. Resizing images to 128×128 pixels
2. Cropping around the face regions to focus on important features
3. Normalizing pixel values to a range of 0–1
• Dataset Organization:
1. /data/train/real/, /data/train/fake/

12
2. /data/test/real/, /data/test/fake/
• Splitting: 80% training, 20% testing split

4.1.3 Architecture Document


The architecture document of Sprint I is centered around the Data Preparation Pipeline:
Input: Video (real/fake)
1. Frame Extraction (random selection)
2. Face Detection & Cropping (using DeepFace)
3. Image Resizing (128×128)
4. Normalization (rescale pixel values)
5. Organization of Train/Test Folders
Architecture diagrams do the following:
1. Flow from raw videos to processed frames
2. Directory structure for training pipelines
3. Input-output requirements of deep learning model

4.1.4 Outcome of Objectives


At the end of a Sprint I-
1. A well organized, clean, and balanced image dataset was ready.
2. Frames categorized and preprocessed well for training compatibility.
3. Class balance and feature quality was confirmed from an initial exploratory data
analysis (EDA).
4. Scalable data pipeline was developed for future additions or refinements.
This phase further laid the foundation for an effective and efficient model training to take
place in Sprint II.

4.1.5 Sprint Retrospective


The retrospective for Sprint I underscored its successful moments:
1. A seamless pipeline for video-to-frame conversion and preprocessing
2. Uniformity in image formats and structures The challenges encountered were
occasional failure in face detection and missing frames. Solutions included fallback
logic for non-detected frames and post-processing quality checks. Overall, Sprint I
provided a solid dataset for deep learning model experiments.

13
4.2 SPRINT II

4.2.1 Objectives with User Stories of Sprint II


The highlight of Sprint II was training the different deep learning models to classify frames
into the respective real and fake frames.
The key-user stories included:
1. Build models that distinguish manipulated media with accuracy.
2. Meet high generalization with low overfitting.
3. Experiment different architectures to optimize performance.
Thus, the the sprint emphasized creating separate models first (InceptionResNetV2,
Xception, NASNetMobile, VGG19) before combining them into an ensemble model.

4.2.2 Functional Document


This document concerned itself with:
Architectures of the Models:
1. Transfer learning with InceptionResNetV2, Xception, NASNetMobile, and
VGG19.
2. Ensemble model: Averaging the predictions of Xception and NASNetMobile.
Loss Function: Categorical Cross-Entropy.
Optimizer: Adam Optimizer.
Metrics: Accuracy, F1 Score, Precision, Recall, Specificity, Sensitivity, MAE, MSE.
Improvements on Training:
1. Early stopping as a form of prevent overfitting.
2. Utilizing data generator for memory-efficient data loading.
Hyperparameters:
1. Batch size: 32 (optimally tuned for performance).
2. Image height and width at input: 128×128.
3. Number of epochs: Maximum 50 (with early stopping after patience=3).

4.2.3 Architecture Document


The architecture document has included the following contents:
Basic models:
Pretrained on ImageNet, modified with a flattened layer and a dense softmax classification

14
head.
Ensemble Model:
1. Averaging the outputs of the Xception and NASNetMobile models.
2. Final decisions are made by majority voting or by averaging prediction
probabilities.
Training pipeline:
1. Data loaded from different directories
2. Rescaled
3. Passed through the models in sequential order
Evaluation Metrics:
Precision, recall, F1 score, sensitivity, specificity etc. were evaluated through the detailed
formulae.
Block diagrams illustrated the flow from data input → model training → evaluation.

4.2.4 Outcome of Objectives


At the end of Sprint II:
1. All four models achieved an acceptable level of validation accuracy: over 92–95%
on a validation set.
2. The ensemble model was able to reduce bias and variance, thereby outperforming
its individual models.
3. The class-wise confusion matrixes show excellent generalization.
4. In this context, an early stopping condition prevented any unnecessary training time
while guaranteeing high accuracy.

4.2.5 Outcome of Objectives


High-performance classification models were delivered by Sprint II. There were challenges
which had to be dealt with:
1. Long training times, especially with heavy models such as InceptionResNetV2
2. Imbalance in predictions, which has since been solved through weighted averages
Efficient data loading and callbacks were used to improve GPU/TPU efficiency.
3. Experiments went into the final deployment preparations.

15
4.3 SPRINT III

4.3.1 Objectives with User Stories of Sprint III


Sprint III so far involved the deployment of a well-performing model over an intuitive web
interface for real-life usage. These are the user stories:
1. Allow users to upload videos or images
2. Automatically detect whether media is fake or real.
3. If it's a video file visualize frame and predict image by image, aggregate results and
give final verdict.
4. Display prediction confidence clearly.
The top priority in this sprint was delivering an easy, fast, and consumer-friendly user
experience.

4.3.2 Functional Document


The functional document outlined:
Frontend:
1. HTML/CSS for file upload UI
2. Progress bar during upload and processing
Backend:
1. Flask framework
2. Load trained model weights at server start
3. Accept video or image
4. If video: Sample frames at regular intervals, predict each frame, majority voting
to decide
5. Display prediction ("Real" or "Fake") along with confidence percentage
Libraries: OpenCV, TensorFlow/Keras, Flask, NumPy

4.3.3 Architecture Document

The architecture document covered:


1. Client uploads media via Web Form
2. Flask API receives media
3. If video: Extract frames using OpenCV
4. Each frame/image → Resized → Normalized

16
5. Pass to trained model → Collect predictions
6. If video: Majority voting for final decision
7. Result returned on the frontend with visualization (confidence bars)

Diagrams explained input, processing, and output clearly.

4.3.4 Outcome of Objectives


The major achievements of Sprint III include:
1. Creation of a robust functional web application Fast uploads and real-time
predictions.
2. High user satisfaction (fast inference and intuitive interface)
3. Frame-level prediction aggregation for better video-classification stability.
4. Testing confirmed the model predictions closely aligned with known fakes and real
samples.

4.3.5 Sprint Retrospective


There were several key achievements noted in the retrospective of Sprint III:
1. An effective deployment with minimal backend infrastructure requirements
2. User-oriented design simplifying the app with maximum functionality Issues
Faced:
The large video files delayed processing time (solved by sampling frames)
3. Memory management had to be optimized for frame extraction Learning:
Our focus must be on data handling speed and feedbacks.

4.4 METHODOLOGY

4.4.1 InceptionResnetV2

Inception-resnet-v2 combines the depth of the Inception modules with the efficiency of the
residual connections, thereby maintaining the delicate trade-off between speed and
accuracy. Deeper feature extraction was made possible, with the Inception-resnet-V2
architecture capturing fine textures differentiating between real and fake faces. Multiple
convolutional paths in parallel efficiently analysed different scales of manipulation.
Transfer learning allowed the fast convergence of the manipulated media classification by
utilizing pre-trained weights.

17
4.4.2 Xception

Xception architecture claims to utilize depth wise separable convolutions making it very
efficient parameter-wise without loss of accuracy. It was quite good at revealing subtle
artifacts from the detected deepfake by disentangling the spatial and cross-channel
correlations independently.
The lightweight nature of Xception facilitated faster training and high accuracy, which was
a necessary requirement for ensemble learning.

4.4.3 NasNetMobile

Neural Architecture Search (NAS) was used to evolve NASNetMobile, optimizing its
architecture for accuracy and speed. Its modular building blocks allowed for an enhanced
extraction of features that became of utmost importance for detecting imperfections in the
fake images. On the other hand, NASNetMobile provided a lightweight yet powerful
solution featuring a trade-off between mobile deployment capability and competitive
performance in deepfake detection.

4.4.4 VGG19

A great degree of propagation and the simplicity of VGG19 set it up to become a reliable
standard for manipulating media detection. Its hierarchical feature learning, although
computationally heavy, was beneficial for noticing slight differences in facial textures and
backgrounds associated with deepfakes. Fine-tuning VGG19 on pre-processed datasets set
strong baseline benchmarks with respect to which newer architectures have been compared.

4.4.5 Ensemble Model(Xception+NasNetMobile)

An Ensemble model gathered the outputs of Xception and NASNetMobile to maximize


their complementary strengths. The cooperation between Xception's sensitivity to local
features and NASNetMobile's effectiveness on global patterns has ensured better
robustness in prediction. The multi-model approach reduces variance and increases
generalization, thus making the ensemble more applicable to real-life circumstances
concerning fake media detection, where inconsistencies could differ.

18
CHAPTER 5

RESULTS AND DISCUSSION

5.1 Distribution of the Data

Fig 5.1 Distribution of Data after processing the videos

After preprocessing and extracting 30 random frames from each video, the dataset was split
into two distinct classes, representing real and fake media. Of the total count of images, the
training set processed 16,415 images while the validation set processed a total of 4,109
images. Such distribution shows the datasets are robustly sized so that models can learn
complex patterns to differentiate between real and manipulated media. A large training
sample size is good for the generalization of the model; a validation set acts as a useful
metric for performance monitoring to mitigate against overfitting. Such careful frame
extraction and dataset structuring create a balanced environment for training deep learning
models like InceptionResNetV2, VGG19, CNN, Xception, NASNetMobile, and ensemble
methods, bringing accuracy and robustness for fake media detection.

19
5.2 Evaluation Metrics of Each Model

5.2.1 Evaluation using Xception for manipulated media


detection

Fig 5.2 Evaluation Graphs for Xception Model

20
Confusion Matrix for Xception

Fig 5.3 Confusion matrix for Xception Model

Xception is an advanced convolutional neural network architecture that extends the


Inception model by introducing depthwise separable convolutions. This structure
improves efficiency by reducing the computational complexity without sacrificing
model performance. Xception's design is based on the hypothesis that depthwise
separable convolutions can achieve higher accuracy and faster training compared to
traditional convolutions. It is primarily used for image classification, object detection,
and video analysis, offering enhanced performance by focusing on more effective
feature extraction and parameter optimization

21
5.2.2 Evaluation using InceptionResnetV2 for manipulated
media detection

Fig 5.4 Evaluation Graphs for InceptionResnetV2 Model

22
Confusion Matrix for InceptionResnetV2

Fig 5.5 Confusion Matrix for InceptionResnetV2

InceptionResnetV2 is a CNN-architecture that incorporates the strengths of Inception


modules and residual connections to improve the efficiency of deep learning models. It
utilizes a hybrid structure that incorporates both deep residual learning and multi-level
feature extraction, allowing it to effectively capture complex patterns in visual data. Its
purpose is to enhance classification performance, especially for tasks involving large,
high-dimensional datasets, by enabling faster convergence and more accurate
predictions in image recognition and manipulation detection tasks.

23
5.2.3 Evaluation using VGG19 for manipulated media
detection

Fig 5.6 Evaluation Graphs for VGG19 Model

24
Confusion Matrix for VGG19

Fig 5.7 Confusion Matrix for VGG19

VGG19 is a deep CNN-architecture which is well known for its simplicity and
effectiveness for the image classification tasks. It basically consists of 19 layers,
including convolutional layers, pooling layers, and FC(fully connected) layers, making
it capable of extracting hierarchical features from input images. Its usage focuses on
high-level image classification tasks, where the network is trained to recognize various
objects and patterns. The purpose of VGG19 is to offer a highly interpretable and
scalable model for feature extraction in visual recognition tasks.

25
5.2.4 Evaluation using NasNetMobile for manipulated media
detection

Fig 5.7 Evaluation graphs for NasNetMobile Model

26
Confusion Matrix for NasNetMobile

Fig 5.8 Confusion Matrix for NasNetMobile

NasNetMobile is a mobile-optimized deep learning architecture developed using neural


architecture search (NAS) to identify the most efficient network configuration for mobile
devices. It is designed to balance model performance and computational efficiency,
making it ideal for resource-constrained environments. The purpose of NasNetMobile is
to provide accurate image classification while minimizing the model’s size and
computational requirements, ensuring fast inference times. Its usage spans mobile
applications and embedded systems where efficient, real-time deep learning predictions
are necessary, such as in visual recognition tasks.

27
5.2.5 Evaluation using Proposed Hybrid Ensembling Model
(Xception+NasNetMobile)

Fig 5.9 Evaluation Graphs for Proposed Hybrid Ensembling Model


(Xception+NasNetMobile)

28
Confusion Matrix for the Proposed Hybrid Model

Fig 5.10 Confusion matrix for the proposed model (Xception+NasNetMobile)

The Ensemble model combining Xception and NasNetMobile leverages the strength of
both architectures to boost prediction accuracy along with robustness of the model. By
combining the output of these models, the ensemble approach reduces the chances of
overfitting and enhances generalization across different datasets. This method is used
for tasks requiring high accuracy, such as deep fake detection, by exploiting the
complementary features learned by both models. The purpose of the ensemble is to
capitalize on the diversity of Xception’s feature extraction and NasNetMobile’s mobile
efficiency, achieving optimal performance.

29
5.3 Comparative Analysis of Each Model

Model Accuracy Precision Recall F1 MSE MAE


Score

Xception 93.55 93.55 93.55 93.55 0.0465 0.0927

InceptionResNetV2 95.35 95.35 95.35 95.35 0.0339 0.0689

NASNetMobile 95.32 95.32 95.32 95.32 0.0346 0.0701

VGG19 91.75 91.75 91.75 91.75 0.0604 0.1206

Proposed 97.02 97.02 97.02 97.02 0.0239 0.0534


Ensembling Model

Table 5.1 Comparison of each model’s Performance on the required Task

The highest accuracy obtained was by the ensemble model, which achieved 97.02%, as
well as an F1-score and the least error made (MSE and MAE).Both InceptionResNetV2
(95.35%) and NASNetMobile (95.32%) performed better than Xception (93.55%) and
VGG19 (91.75%), but none came close to the performance of the ensembling model.
VGG19 had the least well performance, with an accuracy of 91.75% and the highest error
rates (MSE = 0.0604 and MAE = 0.1206), which suggests that such older architectures
may not be sufficiently reliable for media manipulation detection. The hybrid ensemble
approach really super-connected multiple architectures for achieving better classification
with combined power

30
CHAPTER 6

CONCLUSION AND FUTURE ENHANCEMENTS

In conclusion, the deep fake detection system developed in this study demonstrates
significant promise in identifying manipulated media using advanced deep learning algorithms.
Among the various models tested, the ensemble approach combining Xception and
NasNetMobile emerged as the highest-performing algorithm, achieving a remarkable accuracy
of 97.0120 with corresponding precision, recall, and F1-score values that further highlight its
efficacy. The ensemble method's superior performance can be attributed to its ability to harness
the strengths of both models, resulting in enhanced generalization and robust detection
capabilities. This high accuracy suggests that the proposed system is well-suited to address the
growing concerns surrounding deep fake content and can be deployed in real-world scenarios
to safeguard against the risks posed by digital manipulations. By focusing on optimizing
detection methods, the research offers a practical solution to mitigate the harmful impact of
deep fakes in various sectors, including media, politics, and cybersecurity

Future modifications on the work will focus on enhancing the system's ability to
identify different forms of digital media manipulation on different types of datasets, e.g.,
images and videos of varying resolutions, lighting, and intricate backgrounds. This will involve
fine-tuning the hybrid ensembling approach by model selection optimization, weight
adjustment, and feature fusion techniques to attain optimal detection accuracy. Additionally,
exploration of more advanced data augmentation processes and adversarial training can
enhance the model's robustness against new emerging manipulation processes. Future work
can also involve the use of multi-modal analysis (audio, textual, and visual input fusion) to
further improve detection performance on different types of media. Periodic benchmarking
against newly emerging manipulation processes will make the system effective in real-world
application

31
REFERENCES
[1]. B. Zi, M. Chang, J. Chen, X. Ma, and Y.-G. Jiang, “Wilddeepfake: A challenging real-
world dataset for deepfake detection,” in Proceedings of the 28th ACM international
conference on multimedia, 2020, pp. 2382– 2390
[2]. T. Zhao, X. Xu, M. Xu, H. Ding, Y. Xiong, and W. Xia, “Learning self-consistency for
deepfake detection,” in Proceedings of the IEEE/CVF international conference on
computer vision, 2021, pp. 15 023–15 033.
[3]. S. Fernandes, S. Raj, R. Ewetz, J. S. Pannu, S. K. Jha, E. Ortiz, I. Vintila, and M. Salter,
“Detecting deepfake videos using attributionbased confidence metric,” in Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops,
2020, pp. 308– 309.
[4]. P. Yang, R. Ni, and Y. Zhao, “Recapture image forensics based on laplacian
convolutional neural networks,” in International Workshop on Digital Watermarking.
Springer, 2016, pp. 119–128.
[5]. B. Bayar and M. C. Stamm, “A deep learning approach to universal image manipulation
detection using a new convolutional layer,” in Proceedings of the 4th ACM workshop
on information hiding and multimedia security, 2016, pp. 5–10.
[6]. J. Luttrell, Z. Zhou, Y. Zhang, C. Zhang, P. Gong, B. Yang, and R. Li, “A deep transfer
learning approach to fine-tuning facial recognition models,” in 2018 13th IEEE
Conference on Industrial Electronics and Applications (ICIEA). IEEE, 2018, pp. 2671–
2676.
[7]. S. Tariq, S. Lee, H. Kim, Y. Shin, and S. S. Woo, “Detecting both machine and human
created fake face images in the wild,” in Proceedings of the 2nd international workshop
on multimedia privacy and security, 2018, pp. 81–87.
[8]. D. Afchar, V. Nozick, J. Yamagishi, and I. Echizen, “Mesonet: a compact facial video
forgery detection network,” in 2018 IEEE international workshop on information
forensics and security (WIFS). IEEE, 2018, pp. 1–7.
[9]. Y. Li, M.-C. Chang, and S. Lyu, “In ictu oculi: Exposing ai created fake videos by
detecting eye blinking,” in 2018 IEEE International workshop on information forensics
and security (WIFS). IEEE, 2018, pp. 1–7.
[10]. Y. Li and S. Lyu, “Exposing deepfake videos by detecting face warping
artifacts,” arXiv preprint arXiv:1811.00656, 2018

32
APPENDIX A

CODING AND IMPLEMENTATION

Importing Necessary Libraries

33
Function for converting videos to frames

Processing Manipulated Sequences

Processing Original Sequences

34
Creating Directories required

Preprocessing and Splitting the Dataset

35
Checking Data Distribution of each Class

36
Visualizing Train Data

37
Writing Functions for Evaluation Metrics

Training With Xception Model

38
Saving the trained Model

Plotting Metrics Graphs

39
Confusion Matrix

40
Training InceptionResnetV2

Saving the trained model

Evaluation Metric History

41
Training VGG19 Model

Save the trained model

Evaluation Metric History

42
Training NasNetMobile Model

Save the trained Model

Evaluation Metric History

43
Training the Hybrid Ensemble Model

Model Description

44
Evaluation Metric History

Creating a function to take user input and Detect the class

Function to detect if it is an image

45
Function to detect if it is a video

46
Sample Prediction I (video as input)

Sample Prediction II (video as input)

Sample Prediction III (image as input)

47
Creating a webpage for uploading the media

1. Home Page

48
2. Result Page

Style Sheet Used

49
Result Page After Predicting the Uploaded Media

Backend for the WebPage


from flask import Flask, render_template, request
from tensorflow.keras.models import load_model
from tensorflow.keras.preprocessing.image import img_to_array
import numpy as np
import cv2
import os
from werkzeug.utils import secure_filename

app = Flask(__name__)
model = load_model("D:\Major Code\kaggle_output\EnsembleModel.h5",
compile=False)
class_labels = ['Fake', 'Real']
UPLOAD_FOLDER = 'uploads'
os.makedirs(UPLOAD_FOLDER, exist_ok=True)

50
app.config['UPLOAD_FOLDER'] = UPLOAD_FOLDER

def preprocess_frame(frame, target_size=(128, 128)):


frame = cv2.resize(frame, target_size)
frame = frame.astype('float32') / 255.0
frame = img_to_array(frame)
frame = np.expand_dims(frame, axis=0)
return frame

def predict_image(image_path):
image = cv2.imread(image_path)
processed_image = preprocess_frame(image)
prediction = model.predict(processed_image)
return class_labels[np.argmax(prediction)]

def predict_video(video_path):
cap = cv2.VideoCapture(video_path)
results = []
while cap.isOpened():
ret, frame = cap.read()
if not ret:
break
processed = preprocess_frame(frame)
pred = model.predict(processed)
results.append(class_labels[np.argmax(pred)])
cap.release()
return max(set(results), key=results.count)

@app.route('/')
def index():
return render_template('index.html')

@app.route('/upload', methods=['POST'])
def upload():
if 'media' not in request.files:
return "No file part"
file = request.files['media']
if file.filename == '':
return "No selected file"

filename = secure_filename(file.filename)
filepath = os.path.join(app.config['UPLOAD_FOLDER'], filename)
file.save(filepath)

if filename.lower().endswith(('.png', '.jpg', '.jpeg')):


result = predict_image(filepath)
elif filename.lower().endswith(('.mp4', '.avi', '.mov')):

51
result = predict_video(filepath)
else:
return "Unsupported file type"

return render_template('result.html', result=result)

if __name__ == '__main__':
app.run(debug=True)

52
APPENDIX B

RESARCH PAPER

55
APPENDIX C

PAPER SUBMISSION FORM

54
APPENDIX D

PLAGARISM REPORT

55

You might also like