0% found this document useful (0 votes)
8 views42 pages

FYP Copy

The document discusses the rise of misinformation, particularly during the coronavirus pandemic, and the challenges it poses to public trust and journalistic standards. It outlines a project aimed at enhancing fake news detection using machine learning models on a comprehensive dataset of 44,898 records, focusing on the specific context of India. The project employs various algorithms to improve the identification of false information and aims to contribute to a more reliable digital information environment.

Uploaded by

Gajera Dipen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views42 pages

FYP Copy

The document discusses the rise of misinformation, particularly during the coronavirus pandemic, and the challenges it poses to public trust and journalistic standards. It outlines a project aimed at enhancing fake news detection using machine learning models on a comprehensive dataset of 44,898 records, focusing on the specific context of India. The project employs various algorithms to improve the identification of false information and aims to contribute to a more reliable digital information environment.

Uploaded by

Gajera Dipen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Abstract

The widespread adoption of information technology and the prolif-


eration of social media platforms have significantly increased access to
various types of news—ranging from political and economic to medical
and social—via these channels. However, the rapid expansion of news
dissemination and the growing demand for information have obscured
the distinction between authentic and fabricated news, contributing to
the spread of misinformation. During the coronavirus pandemic, the
global awareness of the virus’s risks coincided with a sharp rise in fake
news and rumors. This situation left individuals uncertain about the
credibility of the information they encountered, fostering an atmosphere
of confusion where rumors, unverified claims, and misleading content led
to public panic and misinformation. Such circumstances eroded trust in
media, compromised freedom of expression, and undermined journalistic
standards.
Various research efforts have employed diverse datasets to achieve
notable accuracy in detecting fake news. Nevertheless, the specific issue
of fake news pertaining to the coronavirus pandemic remains under-
explored. Existing studies in this area often rely on limited datasets
or focus on specific subsets of information. To bridge this gap, this
project focuses on enhancing fake news detection using machine learn-
ing models such as Support Vector Machines (SVM), Random Forest
(RF), XGBoost, and Logistic Regression. It leverages a comprehensive
dataset comprising 44,898 records to improve the identification of false
information.

1
Index
1 Introduction 6
1.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Organization of the Report . . . . . . . . . . . . . . . . . . . . . 8

2 Literature Survey 9
2.1 FakeNewsIndia: A Benchmark Dataset of Fake News Incidents
in India, Collection Methodology and Impact Assessment in So-
cial Media . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Fake News Detection Using Machine Learning Approaches . . . 10
2.3 Web Scraping for Data Analytics : A BeautifulSoup Implemen-
tation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Fake News Detection: A Deep Learning Approach . . . . . . . . 13
2.5 Large Language Model Based Fake News Detection . . . . . . . 14
2.6 Attention Is All You Need . . . . . . . . . . . . . . . . . . . . . 16
2.7 Large Language Model Agent for Fake News Detection . . . . . 17

3 Proposed Algorithm 19
3.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.1 Challenges and Limitations . . . . . . . . . . . . . . . . 19
3.2 Flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 Data Processing and Feature Extraction . . . . . . . . . . . . . 22
3.5 Machine Learning Models . . . . . . . . . . . . . . . . . . . . . 23
3.5.1 Support Vector Machines(SVMs) . . . . . . . . . . . . . 23
3.5.2 Naı̈ve Bayes . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5.3 Extreme Gradient Boosting . . . . . . . . . . . . . . . . 25
3.5.4 Feed-forward Neural Network . . . . . . . . . . . . . . . 25
3.5.5 Long Short-Term Memory . . . . . . . . . . . . . . . . . 27
3.6 Transformer-Based Large Language Model Approach . . . . . . 29
3.6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.6.2 The Transformer Architecture . . . . . . . . . . . . . . . 29
3.6.3 Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.6.4 Pre-trained Models . . . . . . . . . . . . . . . . . . . . . 31
3.6.5 Fine-Tuning Process . . . . . . . . . . . . . . . . . . . . 32

2
4 Simulation and Results 33
4.1 Support Vector Machine (SVM) . . . . . . . . . . . . . . . . . . 34
4.2 XGBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.4 Long Short-Term Memory (LSTM) . . . . . . . . . . . . . . . . 35
4.5 Large Language Model (LLM) . . . . . . . . . . . . . . . . . . . 35
4.6 Comparative Analysis . . . . . . . . . . . . . . . . . . . . . . . 36
4.7 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5 Conclusion and Future work 39


5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3
List of Tables
1 Performance metrics for different models for fake news detection. 38

4
List of Figures
1 Working Flowchart . . . . . . . . . . . . . . . . . . . . . . . . . 20
2 Transformer Architecture[12] . . . . . . . . . . . . . . . . . . . . 30
3 Confusion Matrices for SVM, Naive Bayes, XGBoost, LSTM,
and LLM [5] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5
1 Introduction
The growing prevalence of fake news in the digital era poses a serious chal-
lenge, undermining public trust, social cohesion, and democratic principles.
Fake news consists of intentionally false or misleading information presented
as truth, often created to manipulate readers for political, social, or financial
purposes. The rapid spread of news across social media platforms has made
detecting and curbing fake news increasingly complex, as such content fre-
quently employs exaggerated language and eye-catching headlines to attract
attention.
This project focuses on building an automated system for detecting fake
news using machine learning and natural language processing (NLP) tech-
niques. The system classifies news articles as either fake or genuine by analyz-
ing textual data from a dataset obtained through Kaggle and web scraping.
It employs features such as TF-IDF, cosine similarity, and n-grams to iden-
tify patterns indicative of misinformation. The model incorporates various
algorithms, including Naive Bayes, Support Vector Machines (SVM), Neural
Networks, Long Short-Term Memory (LSTM), and BERT NLP, to ensure ac-
curate detection across diverse news categories.
By delivering a scalable and precise solution, the project aims to combat
misinformation effectively, thereby contributing to improved media credibility
and heightened public awareness in the digital realm.

1.1 Applications
Fake news detection technology has found widespread applications across var-
ious sectors, helping to mitigate the harmful impact of misinformation on so-
ciety. In the media and journalism industries, these technologies play a crucial
role in maintaining the integrity of information. By verifying the accuracy of
stories before publication, they help prevent the spread of misleading narra-
tives. Automated fact-checking tools, powered by advanced machine learning
algorithms, assist journalists in cross-referencing information, ensuring that
only credible content is shared with the public.
By swiftly identifying and addressing misinformation, they not only protect
individual users from false narratives but also prevent the amplification of
misinformation on a global scale.
Furthermore, in global security and emergency response scenarios, fake
news detection is critical for ensuring accurate and timely information. Mis-

6
information during emergencies can lead to chaos, hamper rescue efforts, and
put lives at risk.

1.2 Motivation
The pervasive impact of misinformation on individuals, societies, and gover-
nance in India highlights the critical need for effective fake news detection
systems tailored to the Indian context. In a nation characterized by its lin-
guistic diversity and cultural complexity, the rapid spread of false information
often exacerbates communal tensions, incites violence, and perpetuates stereo-
types, frequently targeting vulnerable groups. Politically, fake news has been
weaponized to influence elections, polarize public opinion, and erode trust in
democratic processes. Economically, misinformation has disrupted markets
and influenced consumer behavior, harming businesses that depend on credi-
bility and public trust.
This project addresses the specific challenges of detecting fake news in the
Indian context by focusing on text-based data. A dataset compiled from Indian
fact-checking platforms and web scraping is utilized to examine instances of
fake news that reflect the varied narratives within India’s digital landscape.
Advanced machine learning algorithms and natural language processing (NLP)
techniques are employed to identify patterns and linguistic traits unique to
misinformation in this region. This effort aims to deepen insights into the
nature of fake news, empower Indian citizens to make well-informed decisions,
restore confidence in credible information sources, and bolster the resilience of
democratic and social systems against the growing threat of misinformation.

1.3 Objectives
The increasing spread of fake news in the digital era presents critical chal-
lenges to maintaining information integrity, public trust, and societal stability.
The rapid dissemination of misinformation through social media platforms and
news aggregators underscores the urgent need for automated systems to iden-
tify and counteract false information. This project addresses these challenges
by employing a variety of machine learning models to develop a scalable and
effective Fake News Detection System.
The system utilizes algorithms such as Support Vector Machines (SVM),
Naive Bayes, Random Forest (RF), XGBoost, Neural Networks (built with
TensorFlow and Keras), Long Short-Term Memory (LSTM) networks, and

7
Bidirectional Encoder Representations from Transformers (BERT). By ana-
lyzing and comparing the performance of these models, the project seeks to
determine the most accurate and efficient approach for detecting fake news in
real-time scenarios. In addition to accuracy, the focus extends to scalability
and computational efficiency, ensuring practical deployment on platforms like
social media and news aggregators.
The project also investigates the role of data preprocessing techniques,
including text summarization and feature selection, in improving detection
accuracy and reducing false positives. By addressing these factors, the initia-
tive contributes to ongoing efforts to fortify information ecosystems, mitigate
the impact of fake news, and foster a more reliable and trustworthy digital
environment.

1.4 Organization of the Report


This report is designed to present a clear and concise summary of our research
on detecting fake news. It begins with an Introduction, which underscores the
importance of the issue, its practical implications, motivations, and objectives.
The Literature Survey section provides a detailed review of previous studies,
highlighting methodological gaps and laying the groundwork for our proposed
approach. In the Methodology section, we explain the steps involved in data
collection, preprocessing, and feature extraction, followed by a detailed discus-
sion of the machine learning models employed. These models include Support
Vector Machines (SVM), Naive Bayes, Extreme Gradient Boosting (XGBoost),
and transformer-based techniques like BERT.
The Results and Simulation section assesses the performance of these mod-
els, comparing their accuracy and computational efficiency. Finally, the Con-
clusion and Future Scope summarizes the key findings, highlights the contri-
butions of the research, and explores opportunities for future improvements,
such as incorporating real-time detection capabilities and integrating multi-
modal data. This structured approach ensures a comprehensive and systematic
analysis of the research, emphasizing its relevance in addressing the challenges
posed by fake news.

8
2 Literature Survey

2.1 FakeNewsIndia: A Benchmark Dataset of Fake News


Incidents in India, Collection Methodology and Im-
pact Assessment in Social Media
The widespread issue of fake news has prompted extensive research aimed at
understanding its effects and finding ways to counteract them, particularly
in an era where misinformation spreads rapidly across digital platforms. Re-
searchers have explored a variety of techniques for detecting fake news, employ-
ing both content-based and context-based strategies. Content-based methods
focus on examining the textual, visual, and linguistic characteristics of news
articles, often utilizing advanced machine learning models such as Support
Vector Machines (SVM), Naive Bayes, and deep learning frameworks like Re-
current Neural Networks (RNNs) and Transformers such as BERT. These ap-
proaches have shown significant effectiveness in capturing the subtle linguistic
and contextual patterns that differentiate fake news from authentic reports.
In contrast, context-based methods analyze user interactions, patterns of
news propagation on social media, and network-related features to gain insights
into the dissemination and influence of fake news. Techniques such as graph-
based analysis and emotion-driven models have proven useful for identifying
misinformation by examining user comments, sharing patterns, and behavioral
dynamics on social platforms.
The introduction of datasets tailored for fake news detection has also played
a pivotal role in advancing this field. While datasets like BuzzFeedNews and
LIAR have been widely used, they predominantly cater to Western socio-
political contexts and are less relevant to countries like India, which has a rich
cultural and linguistic diversity. To address this gap, initiatives like the Fake-
NewsIndia dataset have emerged, offering a repository of over 4,800 fake news
incidents sourced from credible Indian fact-checking platforms. This dataset
provides a unique opportunity to analyze fake news across various modalities,
including text, images, and videos, reflecting the multifaceted nature of mis-
information in India. Despite its utility, challenges persist, such as the limited
representation of regional languages and real-time applicability, which remain
critical areas for further exploration.
Fake news presents considerable challenges to society, shaping public opin-
ion and, in severe instances, triggering violence and social disruption. Al-

9
though machine learning and natural language processing advancements have
enhanced the accuracy of fake news detection, several obstacles remain. These
include biases in datasets, the substantial computational demands of deep
learning models, and the intricacies of analyzing content across multiple lan-
guages. While lightweight models such as Naive Bayes offer efficiency, their
performance often lags behind more resource-intensive architectures like Long
Short-Term Memory (LSTM) networks and Transformers. Furthermore, the
ever-changing nature of misinformation requires detection models to be flexi-
ble and capable of adapting to emerging deceptive content. Overcoming these
challenges is crucial to developing effective and scalable systems that can ad-
dress misinformation in a constantly evolving digital environment.

2.2 Fake News Detection Using Machine Learning Ap-


proaches
The widespread and rapid spread of fake news, especially through digital plat-
forms, has emerged as a global concern, driving significant research to address
its societal and political impacts. Efforts in this area have concentrated on
applying machine learning (ML) and natural language processing (NLP) tech-
niques to efficiently detect and categorize fake news. This review examines
notable contributions, methodologies, and the challenges associated with this
domain.[5]
Supervised machine learning models have demonstrated considerable ef-
fectiveness in detecting fake news through text analysis. Algorithms such as
Naı̈ve Bayes, Random Forest, Support Vector Machines (SVM), and Decision
Trees are commonly employed due to their simplicity and effectiveness. For
example, Naı̈ve Bayes achieved 74% accuracy in experiments using datasets
derived from social media, although its assumption of feature independence
limits its ability to handle intricate patterns. Conversely, models like Random
Forest and SVM often exceed 90% accuracy by exploiting non-linear relation-
ships and ensemble learning strategies.
NLP techniques play a crucial role in improving feature extraction and
classification accuracy. Preprocessing steps like tokenization, stemming, and
lemmatization enable more effective text representation. Techniques such as
Term Frequency-Inverse Document Frequency (TF-IDF) and Bag-of-Words
(BoW) are widely used for converting textual data into numerical vectors.[6]
Additionally, tools like sentiment analysis and Part-of-Speech (POS) tagging

10
provide insights into the semantic and syntactic aspects of text. Advanced
word embeddings like Word2Vec and GloVe further enhance performance by
capturing deeper contextual and semantic relationships.
The quality and variety of datasets significantly impact the effectiveness of
fake news detection. Benchmark datasets such as LIAR, which include labeled
statements and metadata, are widely used for these tasks. Datasets from plat-
forms like Kaggle and PolitiFact also provide valuable resources but may be
influenced by domain-specific biases. Creating balanced datasets with equal
representation of fake and real news remains a persistent challenge. Incorpo-
rating real-time data from social media adds another layer of complexity due
to the noisy and unstructured nature of such text.
Recent advancements in the field include the use of deep learning mod-
els like Recurrent Neural Networks (RNNs) and transformers such as BERT.
These models are particularly adept at capturing long-term dependencies and
contextual subtleties, enabling them to identify nuanced misinformation. How-
ever, their high computational demands and dependency on large datasets
present practical challenges.
Ensemble approaches combining multiple classifiers, such as Random Forest
and Naı̈ve Bayes or K-Nearest Neighbors (KNN), have shown potential in
enhancing precision and recall. Hybrid models that merge linguistic features
with statistical techniques offer a balanced solution, with some achieving over
95% accuracy in specific applications. Despite these advancements, ensuring
scalability and adaptability across diverse datasets and languages remains an
open area for development.
Although significant progress has been made in fake news detection, several
challenges persist, including handling multilingual content, countering adver-
sarial attacks, and mitigating biases in algorithms. Future directions in this
field include integrating real-time detection capabilities with social media mon-
itoring tools and developing cross-domain, multilingual models. By leveraging
advances in machine learning, NLP, and data science, researchers aim to create
robust solutions to combat the ongoing challenge of fake news.

2.3 Web Scraping for Data Analytics : A BeautifulSoup


Implementation
Web scraping is a vital tool in data analytics, enabling the extraction of struc-
tured information from web pages efficiently. By leveraging Python’s Beauti-

11
fulSoup library, a web scraper was designed to gather specific product-related
data from Amazon, including names, prices, ratings, reviews, and links. This
implementation combines BeautifulSoup with Python’s Requests library to
perform HTML parsing and employs PySimpleGUI and Matplotlib for cre-
ating an interactive interface and data visualizations. The approach enables
quick analysis and presentation of data, such as price trends and review dis-
tributions, making it suitable for small-scale analytics tasks. BeautifulSoup
provides robust capabilities for parsing HTML, constructing parse trees, and
extracting targeted data. Its lightweight design and focus on efficiency make
it an ideal choice for static content scraping.[1] Unlike more complex tools
like Selenium or Puppeteer, which are used for dynamic content extraction or
circumventing sophisticated security measures, BeautifulSoup simplifies data
gathering while minimizing computational demands. This enables the tool to
process data quickly, making it particularly useful for scenarios that do not
require extensive browser automation or handling dynamic web interactions.
The scraper showcases its ability to navigate and extract data from mul-
tiple Amazon pages, providing a streamlined method for analyzing product
information. It generates visualizations, such as bar charts and histograms, to
display price frequencies and review counts, facilitating decision-making pro-
cesses. By focusing on static web pages, the implementation ensures efficient
performance, achieving its objectives in under 11 seconds for a typical run,
including data scraping, analysis, and visualization.
Despite its strengths, the scraper has some limitations. It relies on ex-
act product names to initiate searches, which restricts its usability for generic
queries. The implementation also depends heavily on the existing structure
of the website’s HTML, making it vulnerable to changes in the DOM (Docu-
ment Object Model). Additionally, the tool is not designed to handle dynamic
web pages, limiting its scope to static content. Enhancements could involve
enabling broader search functionalities, adapting the scraper to different web-
sites, and improving the interface with advanced visualization libraries like
Seaborn for greater flexibility and interactivity.
The integration of data visualization with web scraping enhances the tool’s
utility by transforming raw data into actionable insights. This feature is par-
ticularly valuable in analytics projects where clear and concise data represen-
tation is crucial. By visualizing key metrics, such as price distributions and
customer feedback patterns, the tool simplifies complex datasets, making them
easier to interpret and apply to decision-making. Further developments could
expand the tool’s capabilities to handle dynamic content or integrate advanced

12
scraping techniques for greater adaptability. Additionally, comparisons with
other web scraping frameworks, such as Selenium or JavaScript-based tools,
could provide valuable insights into optimizing performance and efficiency.
Adapting the scraper for use across different domains, such as e-commerce,
social media, or research platforms, would broaden its applicability and make
it a versatile solution for diverse data analytics needs. The implementation
of this web scraper highlights the potential of BeautifulSoup for creating effi-
cient, lightweight solutions for data extraction and visualization. By focusing
on specific tasks with clear objectives, it provides a practical framework for
handling small to medium-scale analytics projects while offering avenues for
further enhancement and application.

2.4 Fake News Detection: A Deep Learning Approach


The widespread proliferation of fake news on digital and social media platforms
has driven extensive efforts to develop automated detection methods. Fake
news, characterized by fabricated content intended to mislead audiences, poses
a significant threat to democratic processes, public trust, and open discourse.
Social media platforms like Facebook and Twitter have accelerated its spread,
with research indicating that false information disseminates more rapidly than
verified news. These concerns underscore the urgent need for advanced systems
to accurately identify and classify fake news.
Detecting fake news is a challenging task due to the complexities of lan-
guage and the subjective nature of content interpretation. Early machine
learning approaches laid the foundation for this field by utilizing features such
as word frequency and sentiment analysis. However, deep learning techniques
have demonstrated superior capabilities by effectively capturing contextual re-
lationships within text.[11]
Stance detection has emerged as an effective strategy for addressing this
problem. By evaluating the relationship between a headline and its correspond-
ing article body, stance detection classifies their alignment into categories such
as agree, disagree, discuss, or unrelated. The Fake News Challenge (FNC-1)
dataset used in this study offers a rich resource, containing 49,973 headline-
article pairs with labeled stances. However, the dataset is heavily imbalanced,
with most pairs categorized as unrelated, increasing the difficulty of accurate
classification. Preprocessing steps such as removing stop words, stemming,
and eliminating punctuation were applied to prepare the data. Text repre-
sentation was achieved using feature extraction methods like Term Frequency-

13
Inverse Document Frequency (TF-IDF) and Bag of Words (BoW). Among
these, TF-IDF was particularly effective, capturing both local and global word
importance to enhance the model’s ability to distinguish between fake and
genuine news.
Three deep learning architectures were explored: dense neural networks
(DNN) with TF-IDF vectors, BoW-based models, and pre-trained Word2Vec
embeddings. The TF-IDF with DNN approach achieved the highest accuracy
of 94.31%, outperforming BoW and Word2Vec-based models.[6] The combi-
nation of TF-IDF’s simplicity and the capacity of dense networks to learn
semantic relationships proved to be a robust solution for stance detection.
Additionally, the inclusion of cosine similarity between headline-article pairs
enriched the feature set by providing extra contextual information.
Despite these advancements, some challenges remain. The model struggled
with the disagree stance, achieving only 44.38% accuracy, which highlights the
difficulty in interpreting subtle contextual differences. To address overfitting
and enhance generalization, techniques such as dropout, L2 regularization,
and early stopping were employed. However, the reliance on textual data
alone limited the model’s ability to identify context-dependent or satirical
misinformation.
This study illustrates the potential of combining TF-IDF with deep learn-
ing architectures for effective fake news detection. Future research will aim
to extend this approach to platforms like Twitter and Facebook, which pose
unique linguistic challenges. Enhancing feature engineering and incorporating
multimodal data, including images and metadata, will be key steps in building
comprehensive systems to combat fake news.

2.5 Large Language Model Based Fake News Detection


The growing prevalence of social media has transformed communication, en-
abling rapid dissemination of information while simultaneously fostering the
spread of disinformation. False information, often diffused faster and more
widely than factual content, has become a global risk, particularly exacer-
bated by advancements in deepfake technology. To address this challenge,
a large language model-based approach was developed, leveraging the Llama
framework.[2] The method integrates alignment and task-specific instructions
to enhance the detection of fake news, particularly in formats such as gener-
ated videos and photos. This approach aims to align the model’s capabilities
with human judgment while overcoming computational constraints during fine-

14
tuning.
The methodology involves implementing a self-instruct structure that di-
vides instructions into alignment and fake news detection tasks. These are pro-
cessed using a parameter-efficient fine-tuning technique on the Llama model,
prioritizing computational feasibility without sacrificing accuracy. Techniques
such as mixed precision and quantization are employed to optimize memory
usage, reducing hardware requirements significantly. The model is designed
to handle binary classification tasks, distinguishing between true and false in-
formation with logical reasoning and text comprehension capabilities. This
method builds upon existing techniques by addressing key limitations. For
instance, earlier approaches, such as those employing n-gram linguistic analy-
sis or feature fusion, faced challenges in scalability and accuracy. Supervised
learning models, although effective in certain contexts, still required man-
ual fact-checking, while hybrid deep learning frameworks combined multiple
features but relied heavily on dataset characteristics. The proposed model
advances the field by focusing on parameter efficiency and alignment with
human input, ensuring a balance between performance and practicality. In
experimental evaluations, the fine-tuned Llama model demonstrated its abil-
ity to accurately classify fake news, achieving high precision and recall metrics
across datasets. The results highlight its potential in understanding complex
language patterns and detecting disinformation in multimodal formats. By in-
corporating additional features, such as sentiment analysis and logical reason-
ing, the model enhances its ability to interpret nuanced information, making it
a valuable tool in combating the spread of fake news. Challenges remain, par-
ticularly in distinguishing increasingly sophisticated deepfake content, where
human reviewers often struggle. Addressing these challenges involves integrat-
ing multimodal inputs, such as combining text and visual data, and further
refining the model’s capabilities. Future work includes scaling the framework
for larger datasets and improving generalizability across platforms. Exploring
adapter methods and alternative architectures for larger models could further
optimize resource usage and enhance effectiveness.
This framework underscores the importance of aligning AI systems with
ethical considerations and user needs while emphasizing the critical role of
advanced natural language processing in safeguarding the information ecosys-
tem. By leveraging the power of large language models with innovative tun-
ing strategies, this approach provides a foundation for scalable, accurate, and
resource-efficient fake news detection systems.

15
2.6 Attention Is All You Need
The Transformer model, introduced as a groundbreaking architecture for se-
quence transduction tasks, represents a major advancement in natural lan-
guage processing by relying exclusively on attention mechanisms, eliminating
the need for traditional recurrent or convolutional neural networks. This ap-
proach has led to significant improvements in both computational efficiency
and performance, especially in machine translation tasks. Its key innovation,
self-attention, allows the model to capture global dependencies in sequences
without relying on sequential processing. By removing the constraints of re-
currence and convolution, the Transformer facilitates extensive parallelization,
drastically reducing training time while delivering high accuracy.
A fundamental aspect of the Transformer is its encoder-decoder structure
and the use of scaled dot-product attention. The encoder converts input se-
quences into continuous representations, while the decoder utilizes these repre-
sentations to generate output sequences. Both components incorporate multi-
head self-attention and position-wise feed-forward layers.[12] Multi-head at-
tention enables the model to simultaneously focus on different parts of the
input, effectively capturing a variety of linguistic relationships across tokens.
To handle sequence order, positional encodings are added to input embeddings
using sinusoidal functions, which efficiently encode relative positions within the
sequence.
The architecture’s efficiency is underscored by its ability to link all posi-
tions in a sequence through constant-time operations, in contrast to recurrent
layers that require sequential computation proportional to sequence length.
This feature, combined with the ability to model long-range dependencies, has
enabled the Transformer to achieve state-of-the-art performance on translation
benchmarks. Notably, it achieved BLEU scores of 28.4 and 41.0 for English-to-
German and English-to-French translation tasks, respectively, outperforming
previous models while incurring significantly lower computational costs.
Training the Transformer involves advanced techniques such as the Adam
optimizer with custom learning rate schedules and regularization methods like
dropout and label smoothing. These strategies mitigate overfitting and im-
prove generalization. Parameter sharing between embedding layers and pre-
softmax linear transformations further reduces the model’s memory usage and
computational requirements.
Beyond translation, the Transformer framework exhibits versatility across
a range of applications, including text summarization, question answering, and

16
even non-text tasks like image and audio processing. Its modular design, fea-
turing components like attention mechanisms and feed-forward layers, allows
for adaptation to diverse domains. Additionally, studies on attention mech-
anisms have revealed that individual attention heads specialize in capturing
distinct syntactic and semantic structures, enhancing the model’s interpretabil-
ity.
Despite its strengths, the Transformer faces challenges, such as processing
very long sequences and scaling to multimodal tasks. Future research could
explore techniques like restricted self-attention and localized attention mecha-
nisms to address these limitations. The open-source release of the Transformer
model has encouraged widespread adoption and innovation, paving the way for
further advancements in deep learning.
By reimagining sequence processing with attention-focused principles, the
Transformer has become a transformative force in machine learning. It drives
efficiency and accuracy in complex sequence modeling tasks, marking a signif-
icant milestone in the evolution of artificial intelligence.

2.7 Large Language Model Agent for Fake News Detec-


tion
To tackle the widespread issue of fake news propagation on online platforms,
a new methodology named FactAgent has been introduced, utilizing large lan-
guage models (LLMs) in an agent-like framework. FactAgent mimics the be-
havior of human experts by systematically verifying news claims through a
well-organized workflow. Unlike traditional methods that depend on super-
vised learning or basic prompt-based strategies, FactAgent combines the inher-
ent knowledge of LLMs with external tools to decompose complex verification
tasks into smaller, more manageable steps.[7] This approach improves the ac-
curacy, efficiency, and interpretability of fake news detection while eliminating
the need for annotated training datasets.
FactAgent employs a range of tools categorized into content-based and
evidence-based approaches. Content-based tools analyze elements like writing
style, grammar, and sensational language, identifying traits commonly asso-
ciated with fake news. Evidence-based tools, such as external search engines
and URL verification systems, validate claims by cross-referencing them with
credible external databases. For instance, the Search Tool identifies conflicting
information in online resources, while the URL Tool evaluates the reliability

17
of news sources based on their domain history. These tools work in tandem
to provide a comprehensive evaluation of news veracity. The methodology of
FactAgent relies on a self-instruct framework, where LLMs execute predefined
workflows designed using domain knowledge. This structured approach enables
the system to address specific challenges, such as misinformation in politically
charged content, by employing targeted tools like the Standing Tool, which
identifies political biases and partisan narratives. This modularity makes Fac-
tAgent highly adaptable, allowing updates and customization for various news
domains and contexts.
Experimental results demonstrate FactAgent’s superiority over traditional
methods like LSTM, BERT, and TextCNN, as well as hierarchical prompt-
ing frameworks like HiSS. FactAgent achieves higher accuracy and F1 scores
across multiple datasets, including PolitiFact, GossipCop, and Snopes, un-
derscoring its effectiveness in identifying misinformation. The integration of
external search tools within the workflow significantly enhances performance
by mitigating the limitations of LLMs’ internal knowledge, such as hallucina-
tions or biases. Another critical advantage of FactAgent is its transparency. At
every step, the system provides explicit explanations and reasoning, enabling
users to understand the basis of its decisions. This interpretability is vital for
building trust in automated fact-checking systems, as it allows stakeholders to
verify the rationale behind veracity assessments.
Future enhancements for FactAgent could include incorporating multi-
modal data—such as visual or design elements of web content—and analyzing
full-text articles rather than just headlines. Additionally, integrating social
context data, like retweet patterns, could further improve its detection capabil-
ities. As misinformation evolves, FactAgent’s flexible workflow design ensures
it remains a robust and scalable solution for addressing the global challenge of
fake news. By leveraging LLMs’ reasoning and contextual abilities, combined
with external evidence retrieval, FactAgent sets a new standard for automated,
explainable, and efficient fake news detection.

18
3 Proposed Algorithm
3.1 Problem Statement
This project addresses the pervasive issue of fake news and the complexities in-
volved in effectively identifying and mitigating its impact. The rapid expansion
of digital platforms has provided unparalleled access to information, making
it increasingly challenging to differentiate between authentic news and false
or misleading content. Fake news exploits the speed and reach of these plat-
forms, allowing misinformation to spread rapidly to diverse audiences. This
widespread dissemination can significantly influence public perception, disrupt
decision-making processes, and lead to serious social, political, and economic
repercussions. Accurately detecting and limiting the spread of fake news is
essential, as unchecked propagation erodes trust in credible sources, deepens
societal divides, and creates widespread confusion. Tackling this issue requires
a comprehensive approach that integrates advanced technological solutions
with a thorough understanding of the underlying factors driving the creation
and spread of fake news.

3.1.1 Challenges and Limitations


The complexity of addressing fake news lies in the constantly evolving strate-
gies used by those who create and disseminate it. Traditional verification
methods are often unable to keep up with the volume and sophistication of
deceptive content. Moreover, the subjective nature of news, diverse linguistic
styles, and nuanced contextual elements further complicate detection efforts.
Existing solutions frequently struggle to adapt to these dynamic factors, un-
derscoring the need for more advanced and flexible approaches to fake news
detection. This project seeks to contribute to this effort by exploring a variety
of machine learning models aimed at improving the accuracy and efficiency of
fake news detection systems.

19
3.2 Flowchart

Figure 1: Working Flowchart

20
3.3 Data Collection
The data collection process serves as the cornerstone of our fake news
detection system, as the dataset’s quality and variety significantly influence
model performance. For this project, we employed two main sources to build
a comprehensive and diverse dataset: Kaggle and web scraping.
From Kaggle, we obtained a well-curated dataset comprising 21,417 real
news articles and 23,481 fake news articles, offering a balanced repre-
sentation of both types.[4] This dataset included essential attributes for each
news article, facilitating detailed analysis and feature extraction. Each record
in the dataset contained the following attributes:

1. ID: A unique identifier for each news article.


2. Title : The headline or title of the news article.
3. Author : The author of the news article.
4. Text : The body text of the article, which may be incomplete in some
instances.
5. Label : This attribute categorizes the article as potentially unreliable, with
possible values being:
- 1 : Represents unreliable articles.
- 0 : Represents reliable articles.

In addition to the Kaggle dataset, we augmented the data by web scrap-


ing news articles from Times of India using the BeautifulSoup library in
Python.[1] This allowed us to collect an additional 3,000 articles for a broader
dataset. The web scraping process involved extracting the title, author, and
body text of news articles from the website, followed by manual labeling of
each article based on its reliability. This additional dataset helped incorporate
recent news trends and diversified the corpus, improving the model’s ability
to generalize across various sources.
The combined dataset thus consists of 47,898 articles, including a sig-
nificant volume of fake and real news examples. By leveraging both curated
datasets and real-world news articles, we ensured a diverse and realistic repre-
sentation of the fake news problem. This comprehensive dataset enabled us to
train a robust detection model capable of distinguishing between reliable and
unreliable news sources effectively.

21
3.4 Data Processing and Feature Extraction
The data processing and feature extraction phase was essential for transforming
raw text into a structured and meaningful format, enabling effective machine
learning analysis. This phase began with comprehensive data preprocessing
to ensure the dataset was clean, consistent, and ready for feature extraction.
All text was converted to lowercase to maintain uniformity, avoiding distinc-
tions between variations like ”Fake” and ”fake.” Common stop words such as
”is,” ”the,” and ”and,” which provide little contextual value, were removed
using Python’s NLTK library. Punctuation was stripped to simplify the text
while retaining its core meaning. The cleaned text was then tokenized us-
ing the SpaCy library, breaking it into smaller units (tokens), which allowed
the model to process individual words effectively.[9] Further, stemming and
lemmatization were applied to standardize words to their root forms, ensur-
ing variations like ”running,” ”runner,” and ”ran” were treated uniformly as
”run.”
After preprocessing the data, advanced feature extraction techniques were
utilized to transform the textual information into numerical representations
for analysis. Term Frequency-Inverse Document Frequency (TF-IDF) was em-
ployed to measure the significance of words within an article in relation to
their frequency across the dataset. Both unigrams (single words) and bigrams
(pairs of words) were extracted to identify meaningful patterns, while n-grams,
including trigrams (three-word sequences), captured contextual relationships
indicative of fake news tendencies. Features such as word and character counts
were also calculated to evaluate verbosity, a trait commonly associated with
fake news content. Furthermore, cosine similarity was computed to assess the
semantic consistency between the headline and the article body. Higher simi-
larity scores often indicated coherence characteristic of genuine news, whereas
lower scores highlighted potential misinformation.
To address the imbalance in the dataset, where fake news articles out-
numbered real ones, stratified sampling was employed to ensure proportional
representation of both classes during training. This helped reduce the risk
of bias and ensured that the model could perform well across all types of
data. Through meticulous preprocessing and feature extraction, the dataset
was transformed into a structured and enriched format, retaining essential lin-
guistic and contextual information. This comprehensive preparation provided
the foundation for accurate and reliable predictions in distinguishing fake news
from real news.

22
3.5 Machine Learning Models

3.5.1 Support Vector Machines(SVMs)


Support Vector Machines are a supervised machine learning algorithm widely
used for classification tasks. Initially, SVMs were designed to solve linear clas-
sification problems.[8] However, real-world data often contains non-linear rela-
tionships, limiting the applicability of the original linear SVM. This challenge
was addressed through the introduction of the kernel trick, enabling SVMs
to handle non-linear classification problems. Kernel functions transform the
input data into a higher-dimensional space where complex patterns can be
effectively classified.
One of the most commonly used kernels is the Radial Basis Function (RBF)
kernel, also known as the Gaussian Kernel. The RBF kernel is particularly
effective in capturing non-linear relationships by mapping data points into a
higher-dimensional space, making them linearly separable. The formula for
the RBF kernel is as follows:

∥x − x′ ∥2
 

K(x, x ) = exp − (1)
2σ 2
In document classification tasks, transforming text into numerical feature

vectors is crucial. The Doc2Vec model, an extension of Word2Vec, enables the


transformation of entire documents into fixed-length vectors. This is useful
because these vectors represent the semantic similarity between documents,
which is important when feeding data into machine learning models like SVM.
By using the Doc2Vec technique, we can generate meaningful feature vec-
tors for documents. Then, by applying the RBF kernel in an SVM, we can
measure the similarity between these vectors in a way that reflects their orig-
inal semantic relationships. The distance between feature vectors, computed
through the kernel, ensures that documents with similar content are correctly
classified together.
At its core, SVM operates on a fundamental principle—the creation of a
maximal ”street” to distinctly demarcate diverse data classes. This principle
aligns with the overarching objective of maximizing the separation between
data clusters,ultimately manifesting as an optimization problem in SVM for-

23
mulation.  
1  T

arg max min tn w ϕ(xn ) + b (2)
w,b ∥w∥ n
tn wT ϕ(xn ) + b ≥ 1, n = 1, . . . , N

(3)
By leveraging this mathematical approach, we are able to navigate the SVM
optimization landscape with greater flexibility and achieve optimal results in
classifying and separating the data clusters effectively
N
1 2
X
an tn wT ϕ(xn ) + b − 1
 
L(w, b, a) = ∥w∥ − (4)
2 n=1

where
an ≥ 0, n = 1, . . . , N

3.5.2 Naı̈ve Bayes


To establish a baseline accuracy for our dataset, we employed a Naive Bayes
classifier, specifically the Gaussian Naive Bayes implementation available in
scikit-learn.[8] Gaussian Naive Bayes is a simple yet effective classification
method that uses a probabilistic framework based on the assumption of con-
ditional independence between features given the class label. This assumption
simplifies computations and is a key characteristic of the Naive Bayes approach.
We incorporated Doc2Vec embeddings into the Naive Bayes classifier, fol-
lowing the same procedure as with other models. These embeddings enhance
the classifier’s ability to differentiate between text data by providing a more
robust representation of the input, improving classification accuracy.
The Naive Bayes classifier operates on Bayes’ theorem, a foundational prin-
ciple in probability theory. This theorem enables the estimation of the prob-
ability of a particular class based on observed features, guiding classification
decisions. By implementing the Naive Bayes classifier at this stage, we estab-
lished a benchmark accuracy, serving as a reference point for evaluating the
performance of more advanced models and techniques.

P (x | c)P (c)
P (c | x) = (5)
P (x)

24
3.5.3 Extreme Gradient Boosting
Extreme Gradient Boosting is a prominent ensemble learning algorithm com-
monly used for machine learning tasks, particularly regression and classification.[8]
It leverages the boosting technique to enhance accuracy by sequentially im-
proving weak models. XGBoost starts by building a decision tree, followed by
retaining residuals from previous trees. These residuals are used as inputs for
subsequent trees to correct prior errors, thereby refining the prediction accu-
racy. The process continues until the loss function is minimized or the specified
number of trees is reached, ultimately improving the model’s forecasting ca-
pability through gradient boosting.

3.5.4 Feed-forward Neural Network


Our exploration of Natural Language Processing (NLP) involved designing
and implementing two advanced feed-forward neural network models utilizing
TensorFlow and Keras, both of which are highly versatile and efficient deep
learning frameworks.[3] These models were developed to analyze complex pat-
terns in textual data, facilitating accurate classification of fake and real news.
Neural networks have revolutionized NLP by addressing the limitations of tra-
ditional linear models, such as Support Vector Machines (SVMs) and logistic
regression. Unlike these earlier approaches, neural networks are capable of
capturing nonlinear relationships and modeling the intricate structures of lan-
guage, making them particularly effective for tasks like fake news detection.
The TensorFlow-based neural network featured an architecture with three
hidden layers, each consisting of 300 neurons. This uniform layer configuration
provided sufficient depth and capacity for the model to learn both shallow and
deep patterns in the data. The network processed input data represented as
TF-IDF vectors and word embeddings, preserving both the semantic meaning
of words and their contextual relationships. Mathematically, for a given layer
l, the computation can be expressed as:

z (l) = W (l) a(l−1) + b(l) (6)


where z (l) is the weighted input, W (l) is the weight matrix, a(l−1) is the ac-
tivation from the previous layer, and b(l) is the bias vector. The activation
function applied to this weighted input is:

a(l) = ReLU(z (l) ) (7)

25
where ReLU(x) = max(0, x). The final output is computed as:

ŷ = fout (W (L) a(L−1) + b(L) ) (8)

where ŷ is the predicted output, fout is the activation function for the out-
put layer, and L is the total number of layers. The Adam optimizer was
employed for training, providing adaptive learning rates that improved conver-
gence and ensured stable learning. Additionally, batch normalization was
applied to standardize inputs to each layer, enhancing stability and reducing
training time.
The Keras-based neural network, while leveraging the same input data and
overall objectives, adopted a different architecture with hidden layers contain-
ing 256, 256, and 80 neurons, respectively. This gradual reduction in neuron
count allowed the network to refine and distill the learned features at each
stage, focusing on high-level abstractions as the data passed through the layers.
To address the common challenge of overfitting, dropout layers were strate-
gically introduced after each hidden layer.[10] Dropout works by randomly
deactivating a fraction of neurons during training, ensuring the network learns
more generalized patterns. A dropout rate of 0.1 was selected to balance reg-
ularization and model capacity. The ReLU activation function was similarly
applied to the Keras model, enhancing its ability to model nonlinearity and
adapt to the diverse patterns in the input data.
Both neural network implementations were meticulously designed to com-
plement each other’s strengths. The TensorFlow model excelled in handling
larger feature spaces and capturing deeper patterns, while the Keras model
demonstrated superior adaptability and generalization. Together, these mod-
els addressed the complexities of fake news detection, achieving high accuracy
and reliability.
The integration of these feed-forward neural networks underscores their piv-
otal role in addressing the challenges of misinformation in the digital age. By
leveraging thoughtfully designed architectures, optimizing training processes,
and employing advanced activation and regularization techniques, these net-
works effectively captured the nuanced relationships within the dataset. This
highlights the potential of deep learning to tackle real-world problems in NLP,
paving the way for further advancements in the field.

26
3.5.5 Long Short-Term Memory
The introduction of Long Short-Term Memory (LSTM) networks, pioneered
by Hochreiter and Schmidhuber, marked a significant advancement in sequence
modeling by enabling the effective handling of long-term dependencies in data.
LSTMs are specifically designed to retain relevant past inputs and combine
them with current inputs to make accurate predictions. This capability is
particularly advantageous for tasks involving serialized data, such as natural
language processing (NLP). In the domain of fake news detection, where word
order and sentence structure play a crucial role, LSTMs are well-suited to
capture the intricate relationships inherent in textual data.
To maximize the potential of LSTMs, a robust preprocessing strategy was
implemented. Unlike methods that aggregate entire documents into a single
vector, such as Doc2Vec, our pipeline emphasized preserving word order to
retain the sequential nature of the data. Word embeddings were selected as
the primary method for numerical representation due to their ability to main-
tain the relative positions of words while encoding semantic meanings. This
choice ensured that critical contextual information was preserved for effective
processing by the LSTM.
The preprocessing phase began with comprehensive text cleaning, system-
atically removing irrelevant characters, symbols, and extraneous content. Only
meaningful components, such as letters and numbers, were retained for further
analysis. The dataset’s most frequent words were then identified and ranked
based on their occurrence within the training data. The top 5,000 most com-
mon words were selected and assigned unique integer IDs, providing a compact
yet informative vocabulary that balanced computational efficiency with ade-
quate textual coverage. Rare words were excluded, as their contribution to the
overall context was minimal.
To meet the LSTM’s requirement for fixed-length input vectors, each article
was converted into a numerical sequence of integers. Articles exceeding a
predefined length of 500 words were truncated, while shorter ones were padded
with zeros at the beginning to ensure uniform vector dimensions. Articles with
insufficient word counts were omitted to maintain the robustness of the training
data.
Word embeddings were further employed to map each word ID to a 32-
dimensional vector, enriching the numerical representation by encoding seman-
tic relationships between words. This approach allowed the model to interpret
contextual similarities and nuances more effectively. Words frequently occur-

27
ring in similar contexts were represented by vectors closer in the embedding
space, enabling the LSTM to understand linguistic relationships beyond the
immediate sequence of words.
The resulting dataset, transformed into structured and fixed-length matri-
ces, was optimized for the LSTM architecture’s requirements. These numerical
representations captured both the sequential and semantic properties of the
text, ensuring that the intrinsic meaning was preserved. This processed data
was then fed into the LSTM network for training.

28
3.6 Transformer-Based Large Language Model Approach

3.6.1 Introduction
Fake news, defined as deliberately fabricated or misleading information pre-
sented as fact, has become a pervasive issue in the digital era. Detecting fake
news is critical to preserving the integrity of public discourse and mitigating
the harmful effects of misinformation. However, the intricate nature of natural
language, with its subtle nuances, contextual variations, and hidden meanings,
presents significant challenges for traditional machine learning approaches.
To address these challenges, transformer-based architectures have emerged
as a groundbreaking solution in Natural Language Processing (NLP). Trans-
formers have set new standards in language understanding and generation by
effectively modeling long-range dependencies, bidirectional context, and com-
plex relationships between words in a text. This section outlines the function-
ality of transformers, their key architectural components, and their integration
into our fake news detection pipeline to enhance accuracy and performance.

3.6.2 The Transformer Architecture


The Transformer, introduced by Vaswani et al. (2017) in Attention Is All
You Need, revolutionizes sequence modeling by eliminating the sequential de-
pendencies of Recurrent Neural Networks (RNNs) through the introduction of
the self-attention mechanism, which allows input sequences to be processed in
parallel.[12] This section explores the key structural components of the Trans-
former and their roles.
A fundamental element of the Transformer is the input embeddings, which
convert raw textual input (e.g., a news headline or article) into dense vector
representations that encapsulate the semantic properties of the text. Addi-
tionally, the Transformer employs positional encodings to preserve the order
of tokens within the sequence. These encodings are added to the embeddings
and are calculated using sinusoidal functions:
 
pos
P E(pos, 2i) = sin 2i , (9)
10000 d
 
pos
P E(pos, 2i + 1) = cos 2i , (10)
10000 d

29
Figure 2: Transformer Architecture[12]

where pos represents the token position and i represents the embedding di-
mension.
The Transformer operates on an encoder-decoder framework. The en-
coder processes the input sequence and generates context-aware representa-
tions. Each encoder layer comprises a multi-head self-attention mechanism,
which captures dependencies within the sequence, and a feed-forward network
(FFN), which introduces non-linearity and enhances feature representation.
For classification tasks such as fake news detection, the decoder—primarily
used in sequence generation tasks like translation—is typically omitted.
At the core of the Transformer is the self-attention mechanism, which eval-
uates the significance of each word in the sequence relative to others. This is
calculated as:

30
QK ⊤
 
Attention(Q, K, V ) = softmax √ V, [12] (11)
dk
where Q, K, and V are the Query, Key, and Value matrices derived from in-
put embeddings, and dk is the dimensionality of K. By applying attention
weights, the model focuses on relevant words, enabling contextual understand-
ing. Multi-head attention extends the self-attention mechanism by allowing
multiple attention heads to focus on different aspects of the input:

MultiHead(Q, K, V ) = Concat(head1 , . . . , headh )WO [12] (12)

where each head operates independently, enhancing the model’s expressive


power. Residual connections bypass the intermediate layers, facilitating bet-
ter gradient flow during backpropagation, while layer normalization stabilizes
training by reducing internal covariate shifts. Feed-forward neural networks
(FFN) consist of two linear transformations with a ReLU activation in be-
tween:
FFN(x) = max(0, xW1 + b1 )W2 + b2 [12] (13)

3.6.3 Pipeline
Using transformers for fake news detection involves adapting pre-trained lan-
guage models to classify news articles as either fake or real. This process en-
compasses multiple steps, ranging from data preparation to model inference,
ensuring both high accuracy and computational efficiency.

3.6.4 Pre-trained Models


Pre-trained transformer models such as BERT, RoBERTa, and GPT form the
foundation of our fake news detection system. These models leverage extensive
prior knowledge to comprehend complex relationships within text, which is
essential for distinguishing between authentic and fabricated news.
BERT (Bidirectional Encoder Representations from Transformers) excels
in understanding bidirectional context, enabling it to detect subtle linguistic
patterns and manipulations often present in fake news. RoBERTa (Robustly
Optimized BERT Approach) builds upon BERT by training on larger datasets
and employing optimized training strategies, thereby enhancing its capability
to differentiate between fake and real news.[?]

31
GPT (Generative Pre-trained Transformer), primarily designed as a gener-
ative model, can be fine-tuned for fake news detection by framing the task as
a sequence classification problem. This adaptation allows GPT to effectively
classify news content as either fake or real.

3.6.5 Fine-Tuning Process


Dataset preparation involves gathering data through web scraping from the
Times of India website, which provides real-time news headlines and articles.
This scraped data is then combined with the Kaggle Fake News Challenge
Dataset, which contains labeled news articles categorized as either fake or
real. The preprocessing steps for the combined dataset include tokenization,
cleaning (removal of noise such as special characters), lowercasing, stopword
removal, and padding/truncation to ensure uniform input size.
The pre-trained transformer model is extended by adding a classification
head. This classification head consists of a fully connected neural network that
takes the embeddings from the final transformer layer and outputs a binary
probability distribution (fake or real) using the softmax activation function.
The model is trained to minimize classification errors. During training, to-
kenized text is input into the transformer model, contextual embeddings are
obtained from the final layer, and these embeddings are passed through the
classification head. The Cross-Entropy Loss is calculated as:
N
1 X
Loss = − (yi log(ŷi ) + (1 − yi ) log(1 − ŷi )) , (14)
N i=1

where yi is the true label and ŷi is the predicted probability. The AdamW
optimizer is used to update the model weights, which improves convergence
by combining adaptive learning rates with weight decay.
For inference, the input text is preprocessed (tokenization and padding),
passed through the fine-tuned transformer model, and a probability distri-
bution is generated. The label with the higher probability is chosen. If
P (fake) > P (real), the text is classified as fake; otherwise, it is classified
as real.

32
4 Simulation and Results
In the results and evaluation phase, various performance metrics were em-
ployed to evaluate the effectiveness of the models in fake news detection. These
metrics include Precision, Recall, and F1 Score. Each metric provides valu-
able insights into the model’s performance in correctly classifying fake and real
news. They are calculated using the following equations:

1. Precision: Precision quantifies the proportion of true positive predictions


out of all positive predictions made by the model. It reflects how accurately the
model identifies fake news when it predicts an article as fake. High precision
indicates reliability in positive predictions. The formula for precision is:
TP
Precision = , [13] (15)
TP + FP
where:
• T P is the number of true positives (fake news correctly identified as
fake),

• F P is the number of false positives (real news incorrectly identified as


fake).

2. Recall: Recall, also referred to as Sensitivity or True Positive Rate, mea-


sures the proportion of actual positive cases that the model correctly identifies.
High recall indicates that the model effectively captures most instances of fake
news, though it may also lead to false positives by misclassifying some real
news as fake. The formula for recall is:
TP
Recall = , [13] (16)
TP + FN
where:

• F N is the number of false negatives (fake news incorrectly identified as


real).

3. F1 Score: The F1 Score represents the harmonic mean of precision and


recall, offering a balanced measure that considers both metrics. It is particu-
larly useful in scenarios with imbalanced class distributions, as it accounts for

33
the impact of both false positives and false negatives. The formula for the F1
score is:
Precision × Recall
F1 Score = 2 × , [13] (17)
Precision + Recall
where the Precision and Recall are calculated as defined above.

These metrics offer a holistic understanding of the model’s performance.


Precision and recall often exhibit an inverse relationship, where enhancing
one can lead to a decline in the other. The F1 Score addresses this trade-off
by providing a balanced evaluation, ensuring that the model maintains both
accuracy and sensitivity in detecting fake news.

4.1 Support Vector Machine (SVM)


SVM is known for its ability to handle high-dimensional data. In our experi-
ments, SVM performed well with a high accuracy A = 0.89, but the precision
and recall were somewhat imbalanced. Precision P = 0.86 suggested that the
model was able to detect fake news with minimal false positives, but the recall
R = 0.79 indicated that some fake news articles were missed. The F1 score
was:
0.86 × 0.79
F1 = 2 × = 0.82
0.86 + 0.79
indicating a good, but improvable, performance.

4.2 XGBoost
XGBoost, a gradient boosting algorithm, showed a remarkable improvement in
recall R = 0.92, meaning it detected a higher proportion of fake news articles
compared to SVM. The precision P = 0.90 was also higher, resulting in a
better F1 score:
0.90 × 0.92
F1 = 2 × = 0.91.
0.90 + 0.92
Thus, XGBoost proved to be the most well-rounded model for fake news
detection, demonstrating a strong balance between precision and recall. It
achieved an accuracy of A = 0.93, the highest among all the models evaluated.

34
4.3 Naive Bayes
The Naive Bayes classifier showed acceptable results, but its performance was
not as high as SVM or XGBoost. The model achieved an accuracy A = 0.77,
with precision P = 0.74 and recall R = 0.72. The F1 score was calculated as:
0.74 × 0.72
F1 = 2 × = 0.73.
0.74 + 0.72
Although Naive Bayes is computationally efficient, its lower recall suggests
that it misses a significant portion of fake news articles, thus limiting its per-
formance in comparison to more complex models.

4.4 Long Short-Term Memory (LSTM)


LSTM, a deep learning model, showed great promise due to its ability to cap-
ture contextual information from sequences. The model achieved an accuracy
A = 0.87, with precision P = 0.84 and recall R = 0.88. The F1 score was:
0.84 × 0.88
F1 = 2 × = 0.86.
0.84 + 0.88
Although LSTM had a higher recall than SVM, its precision was slightly
lower. This suggests that LSTM may be identifying more fake news articles,
but it also produced some false positives. Nevertheless, its recall indicates that
it was highly sensitive to detecting fake news.

4.5 Large Language Model (LLM)


The LLM, a pre-trained model, provided the best results across all metrics.
With an accuracy A = 0.95, precision P = 0.93, and recall R = 0.94, the LLM
outperformed all other models. Its F1 score was:
0.93 × 0.94
F1 = 2 × = 0.94.
0.93 + 0.94
Despite requiring considerable computational resources, the LLM’s superior
performance made it the most accurate model for fake news detection. Its
ability to understand the semantic context of news articles contributed to its
high precision and recall.

35
4.6 Comparative Analysis
The models were compared based on accuracy, precision, recall, and F1 score.
XGBoost and LLM stood out as the most effective models for fake news de-
tection. XGBoost, with its balanced performance, achieved an F1 score of
F 1 = 0.91, while LLM, with a high F1 score of F 1 = 0.94, was the top per-
former overall. SVM, while solid, had a lower recall, resulting in an F1 score of
F 1 = 0.82. Naive Bayes, though fast and efficient, was the least effective, with
an F1 score of F 1 = 0.73. LSTM performed well with an F1 score of F 1 = 0.86,
demonstrating its ability to capture contextual information effectively.
In conclusion, XGBoost and LLM are the most promising models for fake
news detection. XGBoost achieved a strong balance between precision and
recall while requiring fewer computational resources. On the other hand, LLM
offered the highest accuracy and F1 score, showcasing its superior semantic
understanding. The choice of model depends on the trade-off between com-
putational resources and performance. Future work can focus on further tun-
ing these models to enhance their robustness and applicability to real-world
datasets.

36
(a) SVM (b) Naive Bayes

(c) XGBoost (d) LSTM

(e) LLM

Figure 3: Confusion Matrices for SVM, Naive Bayes, XGBoost, LSTM, and
LLM [5]

37
4.7 Comparisons

Model Accuracy Recall Precision F1 score


SVM 89.00% 0.79 0.86 0.82
XGBoost 93.00% 0.92 0.90 0.91
Naive Bayes 77.00% 0.72 0.74 0.73
LSTM 87.00% 0.88 0.84 0.86
LLM 95.00% 0.94 0.93 0.94

Table 1: Performance metrics for different models for fake news detection.

38
5 Conclusion and Future work
5.1 Conclusion
This study provides a comprehensive analysis of fake news detection using
various machine learning models, highlighting their strengths, limitations, and
potential for practical deployment. Among the models evaluated, XGBoost
and Large Language Models (LLMs) emerged as the most effective solutions.
XGBoost demonstrated a commendable balance between precision, recall, and
computational efficiency, making it suitable for real-time applications. Mean-
while, LLMs achieved the highest F1 score (0.94), excelling in capturing the
semantic nuances of news articles, albeit at a higher computational cost. The
findings underscore the critical role of model selection in addressing the press-
ing issue of fake news.
While these models show promise, the study also emphasizes the impor-
tance of adapting to the evolving nature of misinformation. By continuously
updating datasets, optimizing model parameters, and addressing dataset bi-
ases, the robustness and generalizability of fake news detection systems can be
significantly enhanced. This research sets the stage for future innovations, aim-
ing to develop more accurate, adaptive, and scalable solutions that combine
traditional machine learning approaches with advanced deep learning tech-
niques.

5.2 Future Work


The future scope of this research focuses on enhancing the practicality, adapt-
ability, and accuracy of fake news detection systems through several innovative
approaches. A key direction involves the development of a Chrome extension to
provide real-time fake news detection directly within users’ browsers. This tool
would utilize lightweight models like XGBoost for quick analysis while option-
ally connecting to server-based Large Language Models (LLMs) for more in-
depth semantic evaluations. Additionally, improving model accuracy remains
a central goal, which will be pursued through advanced hyperparameter tuning
methods, ensemble techniques, and fine-tuning of pre-trained models. Efforts
will also be made to create dynamic and robust datasets that are continuously
updated with emerging patterns of misinformation from diverse platforms, en-
suring the models remain adaptable to evolving challenges. Addressing biases
in datasets and improving model generalization across languages, regions, and

39
cultural contexts will further enhance the robustness of the system. Exploring
advanced deep learning architectures, such as transformer-based models and
graph neural networks, offers the potential for better semantic understanding
and relational insights. Furthermore, integrating real-time updates into the
training pipeline through techniques like online learning and active learning
will ensure the models stay relevant in the rapidly changing landscape of mis-
information. By combining these efforts, the future work aims to develop a
scalable, adaptive, and user-friendly solution that effectively tackles fake news
across various platforms while empowering end-users with reliable tools to
identify misinformation.

40
References
[1] Ayat Abodayeh, Reem Hejazi, Ward Najjar, Leena Shihadeh, and Rabia
Latif. Web scraping for data analytics: A beautifulsoup implementation.
In 2023 Sixth International Conference of Women in Data Science at
Prince Sultan University (WiDS PSU), pages 65–69. IEEE, 2023.

[2] Mussa Aman. Large language model based fake news detection. Procedia
Computer Science, 231:740–745, 2024.

[3] Carbonati. Understanding feedforward neural networks.


https://carbonati.github.io/posts/understanding-feedforward-neural-
networks/, 2024. Accessed: 2024-11-24.

[4] Kaggle. Fake news detection dataset. https://www.kaggle.com/c/fake-


news/data, 2024. Accessed: 2024-11-24.

[5] Zeba Khanam, BN Alwasel, H Sirafi, and Mamoon Rashid. Fake news
detection using machine learning approaches. In IOP conference series:
materials science and engineering, volume 1099, page 012040. IOP Pub-
lishing, 2021.

[6] Vadim Andreevich Kozhevnikov and Evgeniya Sergeevna Pankratova. Re-


search of the text data vectorization and classification algorithms of ma-
chine learning. ISJ Theoretical & Applied Science, 5(85):574–585, 2020.

[7] Xinyi Li, Yongfeng Zhang, and Edward C Malthouse. Large language
model agent for fake news detection. arXiv preprint arXiv:2405.01593,
2024.

[8] FY Osisanwo, JET Akinsola, O Awodele, JO Hinmikaiye, O Olakanmi,


J Akinjobi, et al. Supervised machine learning algorithms: classification
and comparison. International Journal of Computer Trends and Technol-
ogy (IJCTT), 48(3):128–138, 2017.

[9] Deepa Rani, Rajeev Kumar, and Naveen Chauhan. Study and compari-
sion of vectorization techniques used in text classification. In 2022 13th
International Conference on Computing Communication and Networking
Technologies (ICCCNT), pages 1–6. IEEE, 2022.

41
[10] Daniel Svozil, Vladimir Kvasnicka, and Jiri Pospichal. Introduction to
multi-layer feed-forward neural networks. Chemometrics and intelligent
laboratory systems, 39(1):43–62, 1997.

[11] Aswini Thota, Priyanka Tilak, Simrat Ahluwalia, and Nibrat Lohia. Fake
news detection: a deep learning approach. SMU Data Science Review,
1(3):10, 2018.

[12] A Vaswani. Attention is all you need. Advances in Neural Information


Processing Systems, 2017.

[13] Analytics Vidhya. Confusion matrix in machine learning. Analytics Vid-


hya, 2024. Accessed: 2024-11-27.

42

You might also like