Batch 18
Batch 18
ON
Deepfake Detection on Social Media: Leveraging Deep Learning and Fast Text
Embeddings for Identifying Machine-Generated Tweets
Submitted by
BACHELOR OF TECHNOLOGY
NOVEMBER-2024
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN
(Autonomous Institution-UGC, Govt. of India)
Programmes Accredited by NBA
Accredited by NAAC with A+ Grade
Affiliated to JNTUH, Approved by AICTE, ISO 9001:2015 Certified Institute
Maisammaguda (V), Dhullapally (Post), (Via) Kompally, Medchal Malkajgiri Dist. T.S-500100
CERTIFICATE
This is to certify that the Innovative product Development-3 entitled “DEEPFAKE
DETECTION ON SOCIAL MEDIA: LEVERAGING DEEP LEARNING AND FAST TEXT
EMBEDDINGS FOR IDENTIFYING MACHINE-GENERATED TWEETS” is being
submitted by
EXTERNAL EXAMINER
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN
(Autonomous Institution-UGC, Govt. of India)
Programmes Accredited by NBA
Accredited by NAAC with A+ Grade
Affiliated to JNTUH, Approved by AICTE, ISO 9001:2015 Certified Institute
Maisammaguda (V), Dhullapally (Post), (Via) Kompally, Medchal Malkajgiri Dist. T.S-500100
DECLARATION
We feel ourselves honoured and privileged to place our warm salutation to our college Malla Reddy
Engineering College for Women and Department of Information Technology which gave us the
opportunity to have expertise in engineering and profound technical knowledge.
We would like to deeply thank our Honorable Member of Legislative Assembly Sri Ch. Malla
Reddy Garu,founder chairman MRGI, the largest cluster of institutions in the state of Telangana for
providing us with all the resources in the college to make our project success.
We wish to convey gratitude to our Principal Dr. Y. Madhavee Latha, for providing us with the
environment and mean to enrich our skills and motivating us in our endeavor and helping us to realize
our full potential.
We express our sincere gratitude to Dr. M. Vanitha, Professor and Head, Department of
Information Technology for his kind encouragement and overall guidance in viewing this program a
good asset with profound gratitude.
We would like to thank our internal guide Mrs. B. Vasantha, Assistant Professor, and all the
Faculty members for their valuable guidance and encouragement towards the completion of our
project work.
ABSTACT i
INTRODUCTION 1
CHAPTER 2 2-4
LITERATURE SURVEY
CHAPTER 3 5-9
SYSTEM ANALYSIS
3.1 Existing System
3.2 Proposed System
3.3 System architecture
3.4 Software Requirements & hardware requirements
3.5 Process model
3.6 System study
CHAPTER 4 10-20
SYSTEM DESIGN
4.1 UML diagrams
4.2 data flow
4.3 Modules
CHAPTER 5 21-26
IMPLEMENTATION
5.1 Python
CHAPTER 6 27-29
SYSTEM TESTING
CHAPTER 7 30-37
SCREEN SHOTS
CHAPTER 8 38-39
The proliferation of deepfake technology has raised concerns about the spread of
misinformation on social media platforms. In this paper, we propose a deep learning-based
approach for detecting deepfake tweets, specifically those generated by machines, to help mitigate
the impact of misinformation online .Our approach leverages Fast Text embeddings to represent
tweet text and combines them with deep learning models for classification. We first preprocess the
tweet text and then use Fast Text embeddings to convert them into dense vector representations.
These embeddings capture semantic information about the tweet content, which is crucial for
distinguishing between genuine and machine-generated tweets. We then feed these embeddings
into a deep learning model, such as a Convolutional Neural Network (CNN) or a Long Short-Term
Memory (LSTM) network, to classify the tweets as genuine or machine-generated. The model is
trained on a labeled dataset of tweets, where machine-generated tweets are synthesized using state-
of-the-art text generation models. Experimental results on a real-world dataset of tweets
demonstrate the effectiveness of our approach in detecting machine-generated tweets. Our
approach achieves high accuracy and outperforms existing methods for deepfake detection on
social media. Overall, our proposed approach provides a promising solution for identifying
machine-generated tweets and combating the spread of misinformation on social media platforms.
i
DEEPFAKE DETECTION ON SOCIAL MEDIA: LEVERAGING DEEP LEARNING AND
FAST TEXT EMBEDDINGS FOR IDENTIFYING MACHINE-GENERATED TWEETS
1. INTRODUCTION
2. LITERATURE SURVEY:
can be useful for various NLP tasks. However, these models have limitations in
handling out-of-vocabulary words and fail to capture sub word information.
#### Conclusion
The literature survey reveals that leveraging deep learning and FastText
embeddings holds significant promise for detecting machine-generated tweets on
social media. Transformer models, in particular, have shown remarkable success
in capturing linguistic patterns and contextual information. However, challenges
remain, such as the need for large-scale and diverse training data, as well as the
ability to adapt to rapidly evolving fake content generation techniques. Future
research should focus on enhancing the robustness and generalizability of
detection models, incorporating multimodal data, and developing real-time
detection systems to effectively combat the spread of deepfakes on social media.
This literature survey provides an in-depth overview of the key research areas
relevant to your study, setting a solid foundation for understanding the current
state of deepfake detection and identifying avenues for future research.
3. SYSTEM ANALYSIS
ADVANATGES:
Our proposed system for deepfake detection on social media leveraging deep
learning and FastText embeddings offers several advantages over existing
systems:
1. Improved Accuracy: By leveraging deep learning models and FastText
embeddings, our system can achieve higher accuracy in identifying machine-
generated tweets compared to existing methods.
2. Robustness: The use of adversarial training techniques improves the robustness
of our model against adversarial attacks, making it more reliable in real-world
scenarios.
3. Scalability: Our system is designed to be scalable, allowing it to handle large
volumes of tweets posted on social media platforms.
HARD REQUIRMENTS:
System : i3 or above
SOFTWARE REQUIRMENTS:
3.5. MODULES:
We have implemented this project as REST based web services which consists of following
modules
1) User Login: user can login to system using username and password as ‘admin
and admin’.
2) Load Design Patterns Code: after login user will run this module to upload
dataset to application
3) Code to Numeric Vector: all codes will be converted to numeric vector which
will replace each word occurrence with its average frequency.
4) Train ML Algorithms: processed numeric vector will be split into train and test
with a ratio of 80:20. 80% dataset will be input to training algorithms to train a
model and this model will be applied on 20% test data to calculate accuracy
5) Predict Design Patterns: user will upload test source code files and then ML
algorithms will rank test file to predict accurate design patterns.
ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY
ECONOMICAL FEASIBILITY
This study is carried out to check the economic impact that the system will have on the
organization. The amount of fund that the company can pour into the research and
development of the system is limited. The expenditures must be justified. Thus the
developed system as well within the budget and this was achieved because most of the
technologies used are freely available. Only the customized products had to be
purchased.
TECHNICAL FEASIBILITY
This study is carried out to check the technical feasibility, that is, the technical
requirements of the system. Any system developed must not have a high demand on the
available technical resources. This will lead to high demands on the available technical
resources. This will lead to high demands being placed on the client. The developed
system must have a modest requirement, as only minimal or null changes are required for
implementing this system.
SOCIAL FEASIBILITY
The aspect of study is to check the level of acceptance of the system by the
user. This includes the process of training the user to use the system efficiently. The
user must not feel threatened by the system, instead must accept it as a necessity. The
level of acceptance by the users solely depends on the methods that are employed to
educate the user about the system and to make him familiar with it. His level of
confidence must be raised so that he is also able to make some constructive criticism,
which is welcomed, as he is the final user of the system.
4. SYSTEM DESIGN
4.1 UML DIAGRAMS
The goal is for UML to become a common language for creating models of object
oriented computer software. In its current form UML is comprised of two major
components: a Meta-model and a notation. In the future, some form of method or
process may also be added to; or associated with, UML.
The UML represents a collection of best engineering practices that have proven
successful in the modeling of large and complex systems.
The UML is a very important part of developing objects oriented software and the
software development process. The UML uses mostly graphical notations to express
the design of software projects.
GOALS:
The Primary goals in the design of the UML are as follows:
Load Dataset
USER.
Run All AlgorithmS
LOGOUT
In this literature survey, we review key studies and methodologies related to deepfake
detection on social media, with a focus on leveraging deep learning and FastText
embeddings for identifying machine-generated tweets. This survey provides a
comprehensive overview of existing research, highlighting the strengths and
limitations of various approaches.
Transformer models, such as BERT (Devlin et al., 2019) and GPT (Radford et al.,
2019), have revolutionized natural language processing (NLP) by enabling better
understanding and generation of human-like text. These models leverage self-attention
mechanisms to capture contextual relationships in data, making them effective for
tasks like text classification and generation. Transformer-based models have been
employed for detecting machine-generated text due to their superior performance in
capturing nuanced linguistic patterns.
Word2Vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) are traditional
word embedding techniques that represent words in continuous vector spaces. These
embeddings capture semantic relationships between words, which can be useful for
various NLP tasks. However, these models have limitations in handling out-of-
vocabulary words and fail to capture sub word information.
**2.2 FastText**
Kumar et al. (2021) explored the use of machine learning models for detecting AI-
generated fake news. They demonstrated that advanced models, when trained on
diverse datasets, could effectively identify fake news articles. Their research
emphasized the importance of using robust training data and sophisticated models to
combat the evolving nature of AI-generated content.
Zellers et al. (2019) proposed a novel approach for defending against neural fake news.
They developed the GROVER model, which both generates and detects fake news
articles. By leveraging large-scale language models, their method achieved state-of-
the- art results in identifying machine-generated news, highlighting the potential of
transformer-based models in deepfake detection.
Schuster et al. (2020) discussed the limitations of current neural network models in
modeling human behavior in language. They pointed out that while deep learning
models have achieved significant progress, they still struggle with capturing the
complexity of human language and behavior. This underscores the need for continuous
advancements in model architectures and training techniques to improve deepfake
detection.
#### Conclusion
The literature survey reveals that leveraging deep learning and FastText embeddings
holds significant promise for detecting machine-generated tweets on social media.
Transformer models, in particular, have shown remarkable success in capturing
linguistic patterns and contextual information. However, challenges remain, such as
the need for large-scale and diverse training data, as well as the ability to adapt to
rapidly evolving fake content generation techniques. Future research should focus on
enhancing the robustness and generalizability of detection models, incorporating
multimodal data, and developing real-time detection systems to effectively combat the
spread of deepfakes on social media.
This literature survey provides an in-depth overview of the key research areas relevant
to your study, setting a solid foundation for understanding the current state of deepfake
detection and identifying avenues for future research.
CLASS DIAGRAM:
In software engineering, a class diagram in the Unified Modeling Language (UML) is
a type of static structure diagram that describes the structure of a system by showing
the system's classes, their attributes, operations (or methods), and the relationships
among the classes. It explains which class contains information.
SEQUENCE DIAGRAM:
A sequence diagram in Unified Modeling Language (UML) is a kind of interaction
diagram that shows how processes operate with one another and in what order. It is a
construct of a Message Sequence Chart. Sequence diagrams are sometimes called
event diagrams, event scenarios, and timing diagrams.
Load Dataset
Fast Text
Run All
AlgorithmS
Predict Deep
Fake
LOGOUT
COLLRABATION DIAGRAM:
5: LOGOUT
1: Load Dataset
2: Fast Text
Embedding 3: Run
All AlgorithmS 4:
USER Predict Deep Fake
DATA
BASE
4.2.DATA FLOW:
FLOW CHART
ACTIVITY DIAGRAM:
4.3. MODULES:
To implement this project we have designed following modules
1) User Login: user can login to system using username and password as ‘admin
and admin’
2) Load Dataset: after login user can click this link to load dataset to application
3) Fast Text Embedding: loaded dataset will be clean by removing stop words,
special symbols and other text processing techniques and then input to
FASTTEXT algorithm to generate numeric vector
4) Run All Algorithms: numeric vector will be normalized and then split into train
and test and then training data will be input to all algorithms to train a model and
this models will be applied on test data to calculate prediction accuracy
5) Predict Deep Fake: in this module will enter some tweets text and then CNN
algorithm will predict weather tweet is written by Human or BOT
1. **Document Submission**: Users input the document(s) they wish to notarize into
the system. This may involve uploading digital copies of the documents through a
secure web interface or providing access to documents stored in cloud storage
platforms.
2. **National eID Card Authentication**: Users authenticate their identity using their
National eID cards, which are equipped with digital signatures and biometric
authentication features. This input ensures that only authorized individuals can access
5. **Notification Alerts**: Users may receive notification alerts via email or SMS to
inform them of important events related to their notarization transactions, such as
successful notarization, document expiration, or updates to notarization records.
By designing a user-friendly input and output system for BANS, users can
securely authenticate their documents using National eID cards and blockchain
5. IMPLEMENTATION
5.SOFTWARE ENVIRONMENT
What is Python :-
Programmers have to type relatively less and indentation requirement of the language,
makes them readable all the time.
Python language is being used by almost all tech-giant companies like – Google,
Amazon, Facebook, Instagram, Dropbox, Uber… etc.
The biggest strength of Python is huge collection of standard library which can be used
for the following –
Machine Learning
Test frameworks
Multimedia
Quality of data − Having good-quality data for ML algorithms is one of the biggest
challenges. Use of low-quality data leads to the problems related to data preprocessing
and feature extraction.
No clear objective for formulating business problems − Having no clear objective and
well-defined goal for business problems is another key challenge for ML because this
technology is not that mature yet.
Emotion analysis
Sentiment analysis
Speech synthesis
Speech recognition
Customer segmentation
Object recognition
Fraud detection
Fraud prevention
Unsupervised Learning – This involves using unlabelled data and then finding
the underlying structure in the data in order to learn more and more about the
data .
TensorFlow is a free and open-source software library for dataflow and differentiable
programming across a range of tasks. It is a symbolic math library, and is also used
for machine learning applications such as neural networks. It is used for both research
and production at Google.
TensorFlow was developed by the Google Brain team for internal Google use. It was
released under the Apache 2.0 open-source license on November 9, 2015.
Numpy
Pandas
Matplotlib
Scikit – learn
6. SYSTEM TEST
The purpose of testing is to discover errors. Testing is the process of trying to
discover every conceivable fault or weakness in a work product. It provides a way to
check the functionality of components, sub assemblies, assemblies and/or a finished
product It is the process of exercising software with the intent of ensuring that the
Software system meets its requirements and user expectations and does not fail in an
unacceptable manner. There are various types of test. Each test type addresses a
specific testing requirement.
TYPES OF TESTS
Unit testing
Unit testing involves the design of test cases that validate that the internal
program logic is functioning properly, and that program inputs produce valid outputs.
All decision branches and internal code flow should be validated. It is the testing of
individual software units of the application .it is done after the completion of an
individual unit before integration. This is a structural testing, that relies on knowledge
of its construction and is invasive. Unit tests perform basic tests at component level
and test a specific business process, application, and/or system configuration. Unit
tests ensure that each unique path of a business process performs accurately to the
documented specifications and contains clearly defined inputs and expected results.
Integration testing
Functional test
System Test
System testing ensures that the entire integrated software system meets
requirements. It tests a configuration to ensure known and predictable results. An
example of system testing is the configuration oriented system integration test. System
testing is based on process descriptions and flows, emphasizing pre-driven process
links and integration points.
White Box Testing is a testing in which in which the software tester has knowledge of
the inner workings, structure and language of the software, or at least its purpose.
Black Box Testing is testing the software without any knowledge of the
inner workings, structure or language of the module being tested. Black box tests, as
most other kinds of tests, must be written from a definitive source document, such as
specification or requirements document, such as specification or requirements
document.
Unit Testing
Field testing will be performed manually and functional tests will be written in
Integration Testing
Test Results :All the test cases mentioned above passed successfully. No defects
encountered.
Acceptance Testing
User Acceptance Testing is a critical phase of any project and requires significant
participation by the end user. It also ensures that the system meets the functional
requirements.
Test Results :All the test cases mentioned above passed successfully. No defects
encountered.
7. SCREENSHOTS
To run code double click on ‘run.bat’ file to start python server and get below page
In above screen python server started and now open browser and enter URL as
http://127.0.0.1:8000/index.html and pr ess enter key to get below page
In above screen click on ‘User Login Here’ link to get below page
In above screen user is login and after login will get below page
In above screen click on ‘Load Dataset’ link to load dataset and get below page
In above screen dataset loaded and now click on ‘Fast Text Embedding’ link to convert
all text to numeric vector and get below page
In above screen all tweets converted to numeric vector and then displaying some
values from vector and now click on ‘Run All ML Algorithms’ link to train all
algorithms and get below page
In above screen can see all algorithms result in tabular and graph format and in above
screen can see propose CNN and extension hybrid CNN got high accuracy. Now click
on ‘Predict Deep Fake’ link to get below page
In above screen in text field enter some tweet text and then press button to get below
values and if you want you can use sample tweets given in ‘test_tweets.txt’ file
In above screen given tweet predicted as ‘Deep Bot’ means its fake tweet spread by
BOT and now in below screen can see another example
In above screen entered some other tweet text and below is the output
In above screen tweet detected as normal which means tweet written by human.
Similarly you can enter some tweets and get output
Conclusion
In this study, we explored the efficacy of deep learning techniques combined
with FastText embeddings to detect machine-generated tweets, commonly known as
deepfakes. Our experimental results demonstrated that this approach could effectively
distinguish between human-generated and machine-generated tweets with high
accuracy.
4. **Challenges and Limitations**: Despite the promising results, our approach is not
without limitations. The models require substantial computational resources and may
struggle with the rapid evolution of text generation algorithms. Additionally, adversarial
techniques used to bypass detection mechanisms pose a continuous challenge.
References
1. **Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017).** Enriching Word
Vectors with Subword Information. *Transactions of the Association for Computational
Linguistics, 5*, 135-146. https://doi.org/10.1162/tacl_a_00051
2. **Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019).** BERT: Pre-training
of Deep Bidirectional Transformers for Language Understanding. *Proceedings of the
2019 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)*,
4171- 4186. https://doi.org/10.18653/v1/N19-1423
3. **Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,
... & Bengio, Y. (2014).** Generative Adversarial Nets. *Advances in Neural Information
Processing Systems, 27*, 2672-2680.
4. **Kumar, M., Rajput, N., Aggarwal, A., Bali, R. K., & Sharma, S. (2021).**
Detecting AI-Generated Fake News Using Machine Learning. *Journal of Big Data,
8*(1), 1-24. https://doi.org/10.1186/s40537-021-00473-5
5. **Lample, G., Conneau, A., Denoyer, L., & Ranzato, M. (2017).** Unsupervised
Machine Translation Using Monolingual Corpora Only. *arXiv preprint
arXiv:1711.00043*.
6. **Nguyen, T. T., Nguyen, T. N., Nguyen, D. N., & Le, A. C. (2022).** Detecting
Machine-Generated Text Using Transformer Models. *Proceedings of the 2022
International Conference on Computational Linguistics*, 245-254.
7. **Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019).**
Language Models are Unsupervised Multitask Learners. *OpenAI Blog, 1*(8), 9.
8. **Schuster, T., Elazar, Y., & Goldberg, Y. (2020).** Limitations of Neural Networks
for Modeling Human Behavior in Language. *Proceedings of the 2020 Conference on
Empirical Methods in Natural Language Processing (EMNLP)*, 6155-6168.
https://doi.org/10.18653/v1/2020.emnlp-main.498.