FYP Copy
FYP Copy
1
Index
1 Introduction 6
1.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Organization of the Report . . . . . . . . . . . . . . . . . . . . . 8
2 Literature Survey 9
2.1 FakeNewsIndia: A Benchmark Dataset of Fake News Incidents
in India, Collection Methodology and Impact Assessment in So-
cial Media . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Fake News Detection Using Machine Learning Approaches . . . 10
2.3 Web Scraping for Data Analytics : A BeautifulSoup Implemen-
tation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Fake News Detection: A Deep Learning Approach . . . . . . . . 13
2.5 Large Language Model Based Fake News Detection . . . . . . . 14
2.6 Attention Is All You Need . . . . . . . . . . . . . . . . . . . . . 16
2.7 Large Language Model Agent for Fake News Detection . . . . . 17
3 Proposed Algorithm 19
3.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.1 Challenges and Limitations . . . . . . . . . . . . . . . . 19
3.2 Flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 Data Processing and Feature Extraction . . . . . . . . . . . . . 22
3.5 Machine Learning Models . . . . . . . . . . . . . . . . . . . . . 23
3.5.1 Support Vector Machines(SVMs) . . . . . . . . . . . . . 23
3.5.2 Naı̈ve Bayes . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5.3 Extreme Gradient Boosting . . . . . . . . . . . . . . . . 25
3.5.4 Feed-forward Neural Network . . . . . . . . . . . . . . . 25
3.5.5 Long Short-Term Memory . . . . . . . . . . . . . . . . . 27
3.6 Transformer-Based Large Language Model Approach . . . . . . 29
3.6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.6.2 The Transformer Architecture . . . . . . . . . . . . . . . 29
3.6.3 Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.6.4 Pre-trained Models . . . . . . . . . . . . . . . . . . . . . 31
3.6.5 Fine-Tuning Process . . . . . . . . . . . . . . . . . . . . 32
2
4 Simulation and Results 33
4.1 Support Vector Machine (SVM) . . . . . . . . . . . . . . . . . . 34
4.2 XGBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.4 Long Short-Term Memory (LSTM) . . . . . . . . . . . . . . . . 35
4.5 Large Language Model (LLM) . . . . . . . . . . . . . . . . . . . 35
4.6 Comparative Analysis . . . . . . . . . . . . . . . . . . . . . . . 36
4.7 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3
List of Tables
1 Performance metrics for different models for fake news detection. 38
4
List of Figures
1 Working Flowchart . . . . . . . . . . . . . . . . . . . . . . . . . 20
2 Transformer Architecture[12] . . . . . . . . . . . . . . . . . . . . 30
3 Confusion Matrices for SVM, Naive Bayes, XGBoost, LSTM,
and LLM [5] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5
1 Introduction
The growing prevalence of fake news in the digital era poses a serious chal-
lenge, undermining public trust, social cohesion, and democratic principles.
Fake news consists of intentionally false or misleading information presented
as truth, often created to manipulate readers for political, social, or financial
purposes. The rapid spread of news across social media platforms has made
detecting and curbing fake news increasingly complex, as such content fre-
quently employs exaggerated language and eye-catching headlines to attract
attention.
This project focuses on building an automated system for detecting fake
news using machine learning and natural language processing (NLP) tech-
niques. The system classifies news articles as either fake or genuine by analyz-
ing textual data from a dataset obtained through Kaggle and web scraping.
It employs features such as TF-IDF, cosine similarity, and n-grams to iden-
tify patterns indicative of misinformation. The model incorporates various
algorithms, including Naive Bayes, Support Vector Machines (SVM), Neural
Networks, Long Short-Term Memory (LSTM), and BERT NLP, to ensure ac-
curate detection across diverse news categories.
By delivering a scalable and precise solution, the project aims to combat
misinformation effectively, thereby contributing to improved media credibility
and heightened public awareness in the digital realm.
1.1 Applications
Fake news detection technology has found widespread applications across var-
ious sectors, helping to mitigate the harmful impact of misinformation on so-
ciety. In the media and journalism industries, these technologies play a crucial
role in maintaining the integrity of information. By verifying the accuracy of
stories before publication, they help prevent the spread of misleading narra-
tives. Automated fact-checking tools, powered by advanced machine learning
algorithms, assist journalists in cross-referencing information, ensuring that
only credible content is shared with the public.
By swiftly identifying and addressing misinformation, they not only protect
individual users from false narratives but also prevent the amplification of
misinformation on a global scale.
Furthermore, in global security and emergency response scenarios, fake
news detection is critical for ensuring accurate and timely information. Mis-
6
information during emergencies can lead to chaos, hamper rescue efforts, and
put lives at risk.
1.2 Motivation
The pervasive impact of misinformation on individuals, societies, and gover-
nance in India highlights the critical need for effective fake news detection
systems tailored to the Indian context. In a nation characterized by its lin-
guistic diversity and cultural complexity, the rapid spread of false information
often exacerbates communal tensions, incites violence, and perpetuates stereo-
types, frequently targeting vulnerable groups. Politically, fake news has been
weaponized to influence elections, polarize public opinion, and erode trust in
democratic processes. Economically, misinformation has disrupted markets
and influenced consumer behavior, harming businesses that depend on credi-
bility and public trust.
This project addresses the specific challenges of detecting fake news in the
Indian context by focusing on text-based data. A dataset compiled from Indian
fact-checking platforms and web scraping is utilized to examine instances of
fake news that reflect the varied narratives within India’s digital landscape.
Advanced machine learning algorithms and natural language processing (NLP)
techniques are employed to identify patterns and linguistic traits unique to
misinformation in this region. This effort aims to deepen insights into the
nature of fake news, empower Indian citizens to make well-informed decisions,
restore confidence in credible information sources, and bolster the resilience of
democratic and social systems against the growing threat of misinformation.
1.3 Objectives
The increasing spread of fake news in the digital era presents critical chal-
lenges to maintaining information integrity, public trust, and societal stability.
The rapid dissemination of misinformation through social media platforms and
news aggregators underscores the urgent need for automated systems to iden-
tify and counteract false information. This project addresses these challenges
by employing a variety of machine learning models to develop a scalable and
effective Fake News Detection System.
The system utilizes algorithms such as Support Vector Machines (SVM),
Naive Bayes, Random Forest (RF), XGBoost, Neural Networks (built with
TensorFlow and Keras), Long Short-Term Memory (LSTM) networks, and
7
Bidirectional Encoder Representations from Transformers (BERT). By ana-
lyzing and comparing the performance of these models, the project seeks to
determine the most accurate and efficient approach for detecting fake news in
real-time scenarios. In addition to accuracy, the focus extends to scalability
and computational efficiency, ensuring practical deployment on platforms like
social media and news aggregators.
The project also investigates the role of data preprocessing techniques,
including text summarization and feature selection, in improving detection
accuracy and reducing false positives. By addressing these factors, the initia-
tive contributes to ongoing efforts to fortify information ecosystems, mitigate
the impact of fake news, and foster a more reliable and trustworthy digital
environment.
8
2 Literature Survey
9
though machine learning and natural language processing advancements have
enhanced the accuracy of fake news detection, several obstacles remain. These
include biases in datasets, the substantial computational demands of deep
learning models, and the intricacies of analyzing content across multiple lan-
guages. While lightweight models such as Naive Bayes offer efficiency, their
performance often lags behind more resource-intensive architectures like Long
Short-Term Memory (LSTM) networks and Transformers. Furthermore, the
ever-changing nature of misinformation requires detection models to be flexi-
ble and capable of adapting to emerging deceptive content. Overcoming these
challenges is crucial to developing effective and scalable systems that can ad-
dress misinformation in a constantly evolving digital environment.
10
provide insights into the semantic and syntactic aspects of text. Advanced
word embeddings like Word2Vec and GloVe further enhance performance by
capturing deeper contextual and semantic relationships.
The quality and variety of datasets significantly impact the effectiveness of
fake news detection. Benchmark datasets such as LIAR, which include labeled
statements and metadata, are widely used for these tasks. Datasets from plat-
forms like Kaggle and PolitiFact also provide valuable resources but may be
influenced by domain-specific biases. Creating balanced datasets with equal
representation of fake and real news remains a persistent challenge. Incorpo-
rating real-time data from social media adds another layer of complexity due
to the noisy and unstructured nature of such text.
Recent advancements in the field include the use of deep learning mod-
els like Recurrent Neural Networks (RNNs) and transformers such as BERT.
These models are particularly adept at capturing long-term dependencies and
contextual subtleties, enabling them to identify nuanced misinformation. How-
ever, their high computational demands and dependency on large datasets
present practical challenges.
Ensemble approaches combining multiple classifiers, such as Random Forest
and Naı̈ve Bayes or K-Nearest Neighbors (KNN), have shown potential in
enhancing precision and recall. Hybrid models that merge linguistic features
with statistical techniques offer a balanced solution, with some achieving over
95% accuracy in specific applications. Despite these advancements, ensuring
scalability and adaptability across diverse datasets and languages remains an
open area for development.
Although significant progress has been made in fake news detection, several
challenges persist, including handling multilingual content, countering adver-
sarial attacks, and mitigating biases in algorithms. Future directions in this
field include integrating real-time detection capabilities with social media mon-
itoring tools and developing cross-domain, multilingual models. By leveraging
advances in machine learning, NLP, and data science, researchers aim to create
robust solutions to combat the ongoing challenge of fake news.
11
fulSoup library, a web scraper was designed to gather specific product-related
data from Amazon, including names, prices, ratings, reviews, and links. This
implementation combines BeautifulSoup with Python’s Requests library to
perform HTML parsing and employs PySimpleGUI and Matplotlib for cre-
ating an interactive interface and data visualizations. The approach enables
quick analysis and presentation of data, such as price trends and review dis-
tributions, making it suitable for small-scale analytics tasks. BeautifulSoup
provides robust capabilities for parsing HTML, constructing parse trees, and
extracting targeted data. Its lightweight design and focus on efficiency make
it an ideal choice for static content scraping.[1] Unlike more complex tools
like Selenium or Puppeteer, which are used for dynamic content extraction or
circumventing sophisticated security measures, BeautifulSoup simplifies data
gathering while minimizing computational demands. This enables the tool to
process data quickly, making it particularly useful for scenarios that do not
require extensive browser automation or handling dynamic web interactions.
The scraper showcases its ability to navigate and extract data from mul-
tiple Amazon pages, providing a streamlined method for analyzing product
information. It generates visualizations, such as bar charts and histograms, to
display price frequencies and review counts, facilitating decision-making pro-
cesses. By focusing on static web pages, the implementation ensures efficient
performance, achieving its objectives in under 11 seconds for a typical run,
including data scraping, analysis, and visualization.
Despite its strengths, the scraper has some limitations. It relies on ex-
act product names to initiate searches, which restricts its usability for generic
queries. The implementation also depends heavily on the existing structure
of the website’s HTML, making it vulnerable to changes in the DOM (Docu-
ment Object Model). Additionally, the tool is not designed to handle dynamic
web pages, limiting its scope to static content. Enhancements could involve
enabling broader search functionalities, adapting the scraper to different web-
sites, and improving the interface with advanced visualization libraries like
Seaborn for greater flexibility and interactivity.
The integration of data visualization with web scraping enhances the tool’s
utility by transforming raw data into actionable insights. This feature is par-
ticularly valuable in analytics projects where clear and concise data represen-
tation is crucial. By visualizing key metrics, such as price distributions and
customer feedback patterns, the tool simplifies complex datasets, making them
easier to interpret and apply to decision-making. Further developments could
expand the tool’s capabilities to handle dynamic content or integrate advanced
12
scraping techniques for greater adaptability. Additionally, comparisons with
other web scraping frameworks, such as Selenium or JavaScript-based tools,
could provide valuable insights into optimizing performance and efficiency.
Adapting the scraper for use across different domains, such as e-commerce,
social media, or research platforms, would broaden its applicability and make
it a versatile solution for diverse data analytics needs. The implementation
of this web scraper highlights the potential of BeautifulSoup for creating effi-
cient, lightweight solutions for data extraction and visualization. By focusing
on specific tasks with clear objectives, it provides a practical framework for
handling small to medium-scale analytics projects while offering avenues for
further enhancement and application.
13
Inverse Document Frequency (TF-IDF) and Bag of Words (BoW). Among
these, TF-IDF was particularly effective, capturing both local and global word
importance to enhance the model’s ability to distinguish between fake and
genuine news.
Three deep learning architectures were explored: dense neural networks
(DNN) with TF-IDF vectors, BoW-based models, and pre-trained Word2Vec
embeddings. The TF-IDF with DNN approach achieved the highest accuracy
of 94.31%, outperforming BoW and Word2Vec-based models.[6] The combi-
nation of TF-IDF’s simplicity and the capacity of dense networks to learn
semantic relationships proved to be a robust solution for stance detection.
Additionally, the inclusion of cosine similarity between headline-article pairs
enriched the feature set by providing extra contextual information.
Despite these advancements, some challenges remain. The model struggled
with the disagree stance, achieving only 44.38% accuracy, which highlights the
difficulty in interpreting subtle contextual differences. To address overfitting
and enhance generalization, techniques such as dropout, L2 regularization,
and early stopping were employed. However, the reliance on textual data
alone limited the model’s ability to identify context-dependent or satirical
misinformation.
This study illustrates the potential of combining TF-IDF with deep learn-
ing architectures for effective fake news detection. Future research will aim
to extend this approach to platforms like Twitter and Facebook, which pose
unique linguistic challenges. Enhancing feature engineering and incorporating
multimodal data, including images and metadata, will be key steps in building
comprehensive systems to combat fake news.
14
tuning.
The methodology involves implementing a self-instruct structure that di-
vides instructions into alignment and fake news detection tasks. These are pro-
cessed using a parameter-efficient fine-tuning technique on the Llama model,
prioritizing computational feasibility without sacrificing accuracy. Techniques
such as mixed precision and quantization are employed to optimize memory
usage, reducing hardware requirements significantly. The model is designed
to handle binary classification tasks, distinguishing between true and false in-
formation with logical reasoning and text comprehension capabilities. This
method builds upon existing techniques by addressing key limitations. For
instance, earlier approaches, such as those employing n-gram linguistic analy-
sis or feature fusion, faced challenges in scalability and accuracy. Supervised
learning models, although effective in certain contexts, still required man-
ual fact-checking, while hybrid deep learning frameworks combined multiple
features but relied heavily on dataset characteristics. The proposed model
advances the field by focusing on parameter efficiency and alignment with
human input, ensuring a balance between performance and practicality. In
experimental evaluations, the fine-tuned Llama model demonstrated its abil-
ity to accurately classify fake news, achieving high precision and recall metrics
across datasets. The results highlight its potential in understanding complex
language patterns and detecting disinformation in multimodal formats. By in-
corporating additional features, such as sentiment analysis and logical reason-
ing, the model enhances its ability to interpret nuanced information, making it
a valuable tool in combating the spread of fake news. Challenges remain, par-
ticularly in distinguishing increasingly sophisticated deepfake content, where
human reviewers often struggle. Addressing these challenges involves integrat-
ing multimodal inputs, such as combining text and visual data, and further
refining the model’s capabilities. Future work includes scaling the framework
for larger datasets and improving generalizability across platforms. Exploring
adapter methods and alternative architectures for larger models could further
optimize resource usage and enhance effectiveness.
This framework underscores the importance of aligning AI systems with
ethical considerations and user needs while emphasizing the critical role of
advanced natural language processing in safeguarding the information ecosys-
tem. By leveraging the power of large language models with innovative tun-
ing strategies, this approach provides a foundation for scalable, accurate, and
resource-efficient fake news detection systems.
15
2.6 Attention Is All You Need
The Transformer model, introduced as a groundbreaking architecture for se-
quence transduction tasks, represents a major advancement in natural lan-
guage processing by relying exclusively on attention mechanisms, eliminating
the need for traditional recurrent or convolutional neural networks. This ap-
proach has led to significant improvements in both computational efficiency
and performance, especially in machine translation tasks. Its key innovation,
self-attention, allows the model to capture global dependencies in sequences
without relying on sequential processing. By removing the constraints of re-
currence and convolution, the Transformer facilitates extensive parallelization,
drastically reducing training time while delivering high accuracy.
A fundamental aspect of the Transformer is its encoder-decoder structure
and the use of scaled dot-product attention. The encoder converts input se-
quences into continuous representations, while the decoder utilizes these repre-
sentations to generate output sequences. Both components incorporate multi-
head self-attention and position-wise feed-forward layers.[12] Multi-head at-
tention enables the model to simultaneously focus on different parts of the
input, effectively capturing a variety of linguistic relationships across tokens.
To handle sequence order, positional encodings are added to input embeddings
using sinusoidal functions, which efficiently encode relative positions within the
sequence.
The architecture’s efficiency is underscored by its ability to link all posi-
tions in a sequence through constant-time operations, in contrast to recurrent
layers that require sequential computation proportional to sequence length.
This feature, combined with the ability to model long-range dependencies, has
enabled the Transformer to achieve state-of-the-art performance on translation
benchmarks. Notably, it achieved BLEU scores of 28.4 and 41.0 for English-to-
German and English-to-French translation tasks, respectively, outperforming
previous models while incurring significantly lower computational costs.
Training the Transformer involves advanced techniques such as the Adam
optimizer with custom learning rate schedules and regularization methods like
dropout and label smoothing. These strategies mitigate overfitting and im-
prove generalization. Parameter sharing between embedding layers and pre-
softmax linear transformations further reduces the model’s memory usage and
computational requirements.
Beyond translation, the Transformer framework exhibits versatility across
a range of applications, including text summarization, question answering, and
16
even non-text tasks like image and audio processing. Its modular design, fea-
turing components like attention mechanisms and feed-forward layers, allows
for adaptation to diverse domains. Additionally, studies on attention mech-
anisms have revealed that individual attention heads specialize in capturing
distinct syntactic and semantic structures, enhancing the model’s interpretabil-
ity.
Despite its strengths, the Transformer faces challenges, such as processing
very long sequences and scaling to multimodal tasks. Future research could
explore techniques like restricted self-attention and localized attention mecha-
nisms to address these limitations. The open-source release of the Transformer
model has encouraged widespread adoption and innovation, paving the way for
further advancements in deep learning.
By reimagining sequence processing with attention-focused principles, the
Transformer has become a transformative force in machine learning. It drives
efficiency and accuracy in complex sequence modeling tasks, marking a signif-
icant milestone in the evolution of artificial intelligence.
17
of news sources based on their domain history. These tools work in tandem
to provide a comprehensive evaluation of news veracity. The methodology of
FactAgent relies on a self-instruct framework, where LLMs execute predefined
workflows designed using domain knowledge. This structured approach enables
the system to address specific challenges, such as misinformation in politically
charged content, by employing targeted tools like the Standing Tool, which
identifies political biases and partisan narratives. This modularity makes Fac-
tAgent highly adaptable, allowing updates and customization for various news
domains and contexts.
Experimental results demonstrate FactAgent’s superiority over traditional
methods like LSTM, BERT, and TextCNN, as well as hierarchical prompt-
ing frameworks like HiSS. FactAgent achieves higher accuracy and F1 scores
across multiple datasets, including PolitiFact, GossipCop, and Snopes, un-
derscoring its effectiveness in identifying misinformation. The integration of
external search tools within the workflow significantly enhances performance
by mitigating the limitations of LLMs’ internal knowledge, such as hallucina-
tions or biases. Another critical advantage of FactAgent is its transparency. At
every step, the system provides explicit explanations and reasoning, enabling
users to understand the basis of its decisions. This interpretability is vital for
building trust in automated fact-checking systems, as it allows stakeholders to
verify the rationale behind veracity assessments.
Future enhancements for FactAgent could include incorporating multi-
modal data—such as visual or design elements of web content—and analyzing
full-text articles rather than just headlines. Additionally, integrating social
context data, like retweet patterns, could further improve its detection capabil-
ities. As misinformation evolves, FactAgent’s flexible workflow design ensures
it remains a robust and scalable solution for addressing the global challenge of
fake news. By leveraging LLMs’ reasoning and contextual abilities, combined
with external evidence retrieval, FactAgent sets a new standard for automated,
explainable, and efficient fake news detection.
18
3 Proposed Algorithm
3.1 Problem Statement
This project addresses the pervasive issue of fake news and the complexities in-
volved in effectively identifying and mitigating its impact. The rapid expansion
of digital platforms has provided unparalleled access to information, making
it increasingly challenging to differentiate between authentic news and false
or misleading content. Fake news exploits the speed and reach of these plat-
forms, allowing misinformation to spread rapidly to diverse audiences. This
widespread dissemination can significantly influence public perception, disrupt
decision-making processes, and lead to serious social, political, and economic
repercussions. Accurately detecting and limiting the spread of fake news is
essential, as unchecked propagation erodes trust in credible sources, deepens
societal divides, and creates widespread confusion. Tackling this issue requires
a comprehensive approach that integrates advanced technological solutions
with a thorough understanding of the underlying factors driving the creation
and spread of fake news.
19
3.2 Flowchart
20
3.3 Data Collection
The data collection process serves as the cornerstone of our fake news
detection system, as the dataset’s quality and variety significantly influence
model performance. For this project, we employed two main sources to build
a comprehensive and diverse dataset: Kaggle and web scraping.
From Kaggle, we obtained a well-curated dataset comprising 21,417 real
news articles and 23,481 fake news articles, offering a balanced repre-
sentation of both types.[4] This dataset included essential attributes for each
news article, facilitating detailed analysis and feature extraction. Each record
in the dataset contained the following attributes:
21
3.4 Data Processing and Feature Extraction
The data processing and feature extraction phase was essential for transforming
raw text into a structured and meaningful format, enabling effective machine
learning analysis. This phase began with comprehensive data preprocessing
to ensure the dataset was clean, consistent, and ready for feature extraction.
All text was converted to lowercase to maintain uniformity, avoiding distinc-
tions between variations like ”Fake” and ”fake.” Common stop words such as
”is,” ”the,” and ”and,” which provide little contextual value, were removed
using Python’s NLTK library. Punctuation was stripped to simplify the text
while retaining its core meaning. The cleaned text was then tokenized us-
ing the SpaCy library, breaking it into smaller units (tokens), which allowed
the model to process individual words effectively.[9] Further, stemming and
lemmatization were applied to standardize words to their root forms, ensur-
ing variations like ”running,” ”runner,” and ”ran” were treated uniformly as
”run.”
After preprocessing the data, advanced feature extraction techniques were
utilized to transform the textual information into numerical representations
for analysis. Term Frequency-Inverse Document Frequency (TF-IDF) was em-
ployed to measure the significance of words within an article in relation to
their frequency across the dataset. Both unigrams (single words) and bigrams
(pairs of words) were extracted to identify meaningful patterns, while n-grams,
including trigrams (three-word sequences), captured contextual relationships
indicative of fake news tendencies. Features such as word and character counts
were also calculated to evaluate verbosity, a trait commonly associated with
fake news content. Furthermore, cosine similarity was computed to assess the
semantic consistency between the headline and the article body. Higher simi-
larity scores often indicated coherence characteristic of genuine news, whereas
lower scores highlighted potential misinformation.
To address the imbalance in the dataset, where fake news articles out-
numbered real ones, stratified sampling was employed to ensure proportional
representation of both classes during training. This helped reduce the risk
of bias and ensured that the model could perform well across all types of
data. Through meticulous preprocessing and feature extraction, the dataset
was transformed into a structured and enriched format, retaining essential lin-
guistic and contextual information. This comprehensive preparation provided
the foundation for accurate and reliable predictions in distinguishing fake news
from real news.
22
3.5 Machine Learning Models
∥x − x′ ∥2
′
K(x, x ) = exp − (1)
2σ 2
In document classification tasks, transforming text into numerical feature
23
mulation.
1 T
arg max min tn w ϕ(xn ) + b (2)
w,b ∥w∥ n
tn wT ϕ(xn ) + b ≥ 1, n = 1, . . . , N
(3)
By leveraging this mathematical approach, we are able to navigate the SVM
optimization landscape with greater flexibility and achieve optimal results in
classifying and separating the data clusters effectively
N
1 2
X
an tn wT ϕ(xn ) + b − 1
L(w, b, a) = ∥w∥ − (4)
2 n=1
where
an ≥ 0, n = 1, . . . , N
P (x | c)P (c)
P (c | x) = (5)
P (x)
24
3.5.3 Extreme Gradient Boosting
Extreme Gradient Boosting is a prominent ensemble learning algorithm com-
monly used for machine learning tasks, particularly regression and classification.[8]
It leverages the boosting technique to enhance accuracy by sequentially im-
proving weak models. XGBoost starts by building a decision tree, followed by
retaining residuals from previous trees. These residuals are used as inputs for
subsequent trees to correct prior errors, thereby refining the prediction accu-
racy. The process continues until the loss function is minimized or the specified
number of trees is reached, ultimately improving the model’s forecasting ca-
pability through gradient boosting.
25
where ReLU(x) = max(0, x). The final output is computed as:
where ŷ is the predicted output, fout is the activation function for the out-
put layer, and L is the total number of layers. The Adam optimizer was
employed for training, providing adaptive learning rates that improved conver-
gence and ensured stable learning. Additionally, batch normalization was
applied to standardize inputs to each layer, enhancing stability and reducing
training time.
The Keras-based neural network, while leveraging the same input data and
overall objectives, adopted a different architecture with hidden layers contain-
ing 256, 256, and 80 neurons, respectively. This gradual reduction in neuron
count allowed the network to refine and distill the learned features at each
stage, focusing on high-level abstractions as the data passed through the layers.
To address the common challenge of overfitting, dropout layers were strate-
gically introduced after each hidden layer.[10] Dropout works by randomly
deactivating a fraction of neurons during training, ensuring the network learns
more generalized patterns. A dropout rate of 0.1 was selected to balance reg-
ularization and model capacity. The ReLU activation function was similarly
applied to the Keras model, enhancing its ability to model nonlinearity and
adapt to the diverse patterns in the input data.
Both neural network implementations were meticulously designed to com-
plement each other’s strengths. The TensorFlow model excelled in handling
larger feature spaces and capturing deeper patterns, while the Keras model
demonstrated superior adaptability and generalization. Together, these mod-
els addressed the complexities of fake news detection, achieving high accuracy
and reliability.
The integration of these feed-forward neural networks underscores their piv-
otal role in addressing the challenges of misinformation in the digital age. By
leveraging thoughtfully designed architectures, optimizing training processes,
and employing advanced activation and regularization techniques, these net-
works effectively captured the nuanced relationships within the dataset. This
highlights the potential of deep learning to tackle real-world problems in NLP,
paving the way for further advancements in the field.
26
3.5.5 Long Short-Term Memory
The introduction of Long Short-Term Memory (LSTM) networks, pioneered
by Hochreiter and Schmidhuber, marked a significant advancement in sequence
modeling by enabling the effective handling of long-term dependencies in data.
LSTMs are specifically designed to retain relevant past inputs and combine
them with current inputs to make accurate predictions. This capability is
particularly advantageous for tasks involving serialized data, such as natural
language processing (NLP). In the domain of fake news detection, where word
order and sentence structure play a crucial role, LSTMs are well-suited to
capture the intricate relationships inherent in textual data.
To maximize the potential of LSTMs, a robust preprocessing strategy was
implemented. Unlike methods that aggregate entire documents into a single
vector, such as Doc2Vec, our pipeline emphasized preserving word order to
retain the sequential nature of the data. Word embeddings were selected as
the primary method for numerical representation due to their ability to main-
tain the relative positions of words while encoding semantic meanings. This
choice ensured that critical contextual information was preserved for effective
processing by the LSTM.
The preprocessing phase began with comprehensive text cleaning, system-
atically removing irrelevant characters, symbols, and extraneous content. Only
meaningful components, such as letters and numbers, were retained for further
analysis. The dataset’s most frequent words were then identified and ranked
based on their occurrence within the training data. The top 5,000 most com-
mon words were selected and assigned unique integer IDs, providing a compact
yet informative vocabulary that balanced computational efficiency with ade-
quate textual coverage. Rare words were excluded, as their contribution to the
overall context was minimal.
To meet the LSTM’s requirement for fixed-length input vectors, each article
was converted into a numerical sequence of integers. Articles exceeding a
predefined length of 500 words were truncated, while shorter ones were padded
with zeros at the beginning to ensure uniform vector dimensions. Articles with
insufficient word counts were omitted to maintain the robustness of the training
data.
Word embeddings were further employed to map each word ID to a 32-
dimensional vector, enriching the numerical representation by encoding seman-
tic relationships between words. This approach allowed the model to interpret
contextual similarities and nuances more effectively. Words frequently occur-
27
ring in similar contexts were represented by vectors closer in the embedding
space, enabling the LSTM to understand linguistic relationships beyond the
immediate sequence of words.
The resulting dataset, transformed into structured and fixed-length matri-
ces, was optimized for the LSTM architecture’s requirements. These numerical
representations captured both the sequential and semantic properties of the
text, ensuring that the intrinsic meaning was preserved. This processed data
was then fed into the LSTM network for training.
28
3.6 Transformer-Based Large Language Model Approach
3.6.1 Introduction
Fake news, defined as deliberately fabricated or misleading information pre-
sented as fact, has become a pervasive issue in the digital era. Detecting fake
news is critical to preserving the integrity of public discourse and mitigating
the harmful effects of misinformation. However, the intricate nature of natural
language, with its subtle nuances, contextual variations, and hidden meanings,
presents significant challenges for traditional machine learning approaches.
To address these challenges, transformer-based architectures have emerged
as a groundbreaking solution in Natural Language Processing (NLP). Trans-
formers have set new standards in language understanding and generation by
effectively modeling long-range dependencies, bidirectional context, and com-
plex relationships between words in a text. This section outlines the function-
ality of transformers, their key architectural components, and their integration
into our fake news detection pipeline to enhance accuracy and performance.
29
Figure 2: Transformer Architecture[12]
where pos represents the token position and i represents the embedding di-
mension.
The Transformer operates on an encoder-decoder framework. The en-
coder processes the input sequence and generates context-aware representa-
tions. Each encoder layer comprises a multi-head self-attention mechanism,
which captures dependencies within the sequence, and a feed-forward network
(FFN), which introduces non-linearity and enhances feature representation.
For classification tasks such as fake news detection, the decoder—primarily
used in sequence generation tasks like translation—is typically omitted.
At the core of the Transformer is the self-attention mechanism, which eval-
uates the significance of each word in the sequence relative to others. This is
calculated as:
30
QK ⊤
Attention(Q, K, V ) = softmax √ V, [12] (11)
dk
where Q, K, and V are the Query, Key, and Value matrices derived from in-
put embeddings, and dk is the dimensionality of K. By applying attention
weights, the model focuses on relevant words, enabling contextual understand-
ing. Multi-head attention extends the self-attention mechanism by allowing
multiple attention heads to focus on different aspects of the input:
3.6.3 Pipeline
Using transformers for fake news detection involves adapting pre-trained lan-
guage models to classify news articles as either fake or real. This process en-
compasses multiple steps, ranging from data preparation to model inference,
ensuring both high accuracy and computational efficiency.
31
GPT (Generative Pre-trained Transformer), primarily designed as a gener-
ative model, can be fine-tuned for fake news detection by framing the task as
a sequence classification problem. This adaptation allows GPT to effectively
classify news content as either fake or real.
where yi is the true label and ŷi is the predicted probability. The AdamW
optimizer is used to update the model weights, which improves convergence
by combining adaptive learning rates with weight decay.
For inference, the input text is preprocessed (tokenization and padding),
passed through the fine-tuned transformer model, and a probability distri-
bution is generated. The label with the higher probability is chosen. If
P (fake) > P (real), the text is classified as fake; otherwise, it is classified
as real.
32
4 Simulation and Results
In the results and evaluation phase, various performance metrics were em-
ployed to evaluate the effectiveness of the models in fake news detection. These
metrics include Precision, Recall, and F1 Score. Each metric provides valu-
able insights into the model’s performance in correctly classifying fake and real
news. They are calculated using the following equations:
33
the impact of both false positives and false negatives. The formula for the F1
score is:
Precision × Recall
F1 Score = 2 × , [13] (17)
Precision + Recall
where the Precision and Recall are calculated as defined above.
4.2 XGBoost
XGBoost, a gradient boosting algorithm, showed a remarkable improvement in
recall R = 0.92, meaning it detected a higher proportion of fake news articles
compared to SVM. The precision P = 0.90 was also higher, resulting in a
better F1 score:
0.90 × 0.92
F1 = 2 × = 0.91.
0.90 + 0.92
Thus, XGBoost proved to be the most well-rounded model for fake news
detection, demonstrating a strong balance between precision and recall. It
achieved an accuracy of A = 0.93, the highest among all the models evaluated.
34
4.3 Naive Bayes
The Naive Bayes classifier showed acceptable results, but its performance was
not as high as SVM or XGBoost. The model achieved an accuracy A = 0.77,
with precision P = 0.74 and recall R = 0.72. The F1 score was calculated as:
0.74 × 0.72
F1 = 2 × = 0.73.
0.74 + 0.72
Although Naive Bayes is computationally efficient, its lower recall suggests
that it misses a significant portion of fake news articles, thus limiting its per-
formance in comparison to more complex models.
35
4.6 Comparative Analysis
The models were compared based on accuracy, precision, recall, and F1 score.
XGBoost and LLM stood out as the most effective models for fake news de-
tection. XGBoost, with its balanced performance, achieved an F1 score of
F 1 = 0.91, while LLM, with a high F1 score of F 1 = 0.94, was the top per-
former overall. SVM, while solid, had a lower recall, resulting in an F1 score of
F 1 = 0.82. Naive Bayes, though fast and efficient, was the least effective, with
an F1 score of F 1 = 0.73. LSTM performed well with an F1 score of F 1 = 0.86,
demonstrating its ability to capture contextual information effectively.
In conclusion, XGBoost and LLM are the most promising models for fake
news detection. XGBoost achieved a strong balance between precision and
recall while requiring fewer computational resources. On the other hand, LLM
offered the highest accuracy and F1 score, showcasing its superior semantic
understanding. The choice of model depends on the trade-off between com-
putational resources and performance. Future work can focus on further tun-
ing these models to enhance their robustness and applicability to real-world
datasets.
36
(a) SVM (b) Naive Bayes
(e) LLM
Figure 3: Confusion Matrices for SVM, Naive Bayes, XGBoost, LSTM, and
LLM [5]
37
4.7 Comparisons
Table 1: Performance metrics for different models for fake news detection.
38
5 Conclusion and Future work
5.1 Conclusion
This study provides a comprehensive analysis of fake news detection using
various machine learning models, highlighting their strengths, limitations, and
potential for practical deployment. Among the models evaluated, XGBoost
and Large Language Models (LLMs) emerged as the most effective solutions.
XGBoost demonstrated a commendable balance between precision, recall, and
computational efficiency, making it suitable for real-time applications. Mean-
while, LLMs achieved the highest F1 score (0.94), excelling in capturing the
semantic nuances of news articles, albeit at a higher computational cost. The
findings underscore the critical role of model selection in addressing the press-
ing issue of fake news.
While these models show promise, the study also emphasizes the impor-
tance of adapting to the evolving nature of misinformation. By continuously
updating datasets, optimizing model parameters, and addressing dataset bi-
ases, the robustness and generalizability of fake news detection systems can be
significantly enhanced. This research sets the stage for future innovations, aim-
ing to develop more accurate, adaptive, and scalable solutions that combine
traditional machine learning approaches with advanced deep learning tech-
niques.
39
cultural contexts will further enhance the robustness of the system. Exploring
advanced deep learning architectures, such as transformer-based models and
graph neural networks, offers the potential for better semantic understanding
and relational insights. Furthermore, integrating real-time updates into the
training pipeline through techniques like online learning and active learning
will ensure the models stay relevant in the rapidly changing landscape of mis-
information. By combining these efforts, the future work aims to develop a
scalable, adaptive, and user-friendly solution that effectively tackles fake news
across various platforms while empowering end-users with reliable tools to
identify misinformation.
40
References
[1] Ayat Abodayeh, Reem Hejazi, Ward Najjar, Leena Shihadeh, and Rabia
Latif. Web scraping for data analytics: A beautifulsoup implementation.
In 2023 Sixth International Conference of Women in Data Science at
Prince Sultan University (WiDS PSU), pages 65–69. IEEE, 2023.
[2] Mussa Aman. Large language model based fake news detection. Procedia
Computer Science, 231:740–745, 2024.
[5] Zeba Khanam, BN Alwasel, H Sirafi, and Mamoon Rashid. Fake news
detection using machine learning approaches. In IOP conference series:
materials science and engineering, volume 1099, page 012040. IOP Pub-
lishing, 2021.
[7] Xinyi Li, Yongfeng Zhang, and Edward C Malthouse. Large language
model agent for fake news detection. arXiv preprint arXiv:2405.01593,
2024.
[9] Deepa Rani, Rajeev Kumar, and Naveen Chauhan. Study and compari-
sion of vectorization techniques used in text classification. In 2022 13th
International Conference on Computing Communication and Networking
Technologies (ICCCNT), pages 1–6. IEEE, 2022.
41
[10] Daniel Svozil, Vladimir Kvasnicka, and Jiri Pospichal. Introduction to
multi-layer feed-forward neural networks. Chemometrics and intelligent
laboratory systems, 39(1):43–62, 1997.
[11] Aswini Thota, Priyanka Tilak, Simrat Ahluwalia, and Nibrat Lohia. Fake
news detection: a deep learning approach. SMU Data Science Review,
1(3):10, 2018.
42