Conference FakeNews Detection
Conference FakeNews Detection
Techniques
Manpreet Kaur Ishita Srejath Nikhil Chaudhary
Dept. of Mathematics and Computing Dept. of Mathematics and Computing Dept. of Mathematics and Computing
Delhi Technological University Delhi Technological University Delhi Technological University
New Delhi, India New Delhi, India New Delhi, India
Abstract—The rapid spread of fake news across digital plat- BERT with the adaptability of LightGBM to enable real-
forms poses significant risks to society, influencing public opinion, time learning. This approach addresses key challenges such
disrupting democratic processes, and fueling misinformation. as concept drift, adversarial text manipulation, and the need
This study explores the application of machine learning tech-
niques for effective and scalable fake news detection. We evaluate for continuous model updates in dynamic environments. Our
the performance of traditional classifiers such as Logistic Regres- contributions are threefold: A comprehensive comparative
sion, Support Vector Machines, and Random Forest, alongside analysis of traditional and deep learning models for fake news
deep learning models like BERT. Furthermore, we propose a detection. The introduction of a hybrid incremental model
hybrid incremental approach that integrates semantic encoding capable of adapting to real-time data. Empirical validation
with real-time learning capabilities to adapt to evolving content.
Our findings demonstrate that hybrid models outperform stan- using multiple benchmark datasets and simulated streaming
dalone algorithms in both accuracy and adaptability, achieving environments, showing improved performance and robustness.
up to 99.31 By integrating semantic, statistical, and adaptive learning
Index Terms—Fake news detection, machine learning, deep components, this research aims to build a resilient and scalable
learning, BERT, hybrid models, misinformation framework for combating the growing threat of fake news in
the digital age.
I. I NTRODUCTION
In today’s hyper-connected world, the internet and social II. E ASE OF U SE
media have become dominant platforms for disseminating
A. Implementation Simplicity
news and information. While this democratization has em-
powered individuals and accelerated access to knowledge, The fake news detection system has been designed with
it has also led to a surge in the spread of misinformation a strong emphasis on usability and accessibility for both
and disinformation, commonly referred to as fake news. novice and experienced users. The goal is to ensure that
The consequences of fake news can be severe—manipulating the system can be set up and executed with minimal effort,
public perception, influencing elections, undermining trust in while maintaining flexibility for customization and expansion.
institutions, and even inciting violence. Traditional fake news The system is primarily developed in Python 3.10, a widely-
detection relies heavily on human fact-checkers and manual used language in the data science community, ensuring high
verification, which are time-consuming, costly, and unscalable compatibility and ease of maintenance.
in the face of the high volume and velocity of online content. The solution is cross-platform and has been tested on major
As a result, there is an urgent need for automated, intelligent operating systems including Ubuntu 22.04 LTS, Windows
systems that can detect fake news efficiently and accurately. 10/11, and macOS Ventura, offering a consistent user expe-
In this research, we explore the application of various ma- rience across environments. It utilizes a modular architecture
chine learning (ML) and deep learning (DL) techniques to that separates major tasks into distinct modules — includ-
detect fake news articles based on their textual content. We ing data ingestion, preprocessing, feature extraction, model
implement and evaluate a range of models—from conventional training, prediction, and evaluation — thereby promoting
classifiers such as Logistic Regression (LR), Support Vector code clarity and facilitating individual component testing or
Machines (SVM), and Random Forests (RF), to advanced replacement. Each module is well-commented and adheres
deep learning models like BERT (Bidirectional Encoder Rep- to PEP 8 coding standards, aiding in readability and future
resentations from Transformers), known for their contextual collaboration.
understanding of language. Furthermore, we propose a hybrid Installation and setup are greatly simplified using a
incremental approach that combines the semantic strength of requirement.txt file, which automatically installs all
necessary dependencies using the command pip install scikit-learn library, which is well-known for its simple,
-r requirements.txt. Key libraries required include: consistent API and robust performance in classical machine
• scikit-learn==1.3.0 – for machine learning learning tasks. However, its modular and loosely coupled
model implementation and evaluation. codebase makes it easy to replace or extend any individual
• pandas==2.0.2 – for structured data manipulation component without impacting the functionality of others.
and preprocessing. Users seeking to incorporate deep learning techniques can
• nltk==3.8.1 – for text tokenization, stopword re- easily substitute the classification module with models from
moval, and other NLP utilities. TensorFlow or PyTorch. For instance, integrating a fine-
• joblib==1.3.2 – for model serialization and persis- tuned BERT model from Hugging Face’s transformers
tence. library can be accomplished by replacing the TF-IDF vec-
The system is designed for minimal setup time — a new torizer and logistic regression classifier with a pre-trained
user can clone the repository, install dependencies, and start transformer pipeline. The surrounding logic — including data
training or predicting with minimal configuration. Default loading, token preprocessing, and metric computation — re-
configuration files provide commonly used parameters such mains largely unchanged due to the system’s adherence to
as test/train splits, vectorizer settings, and classifier type, common input/output interfaces and clean abstraction layers.
eliminating the need for manual parameter tuning for baseline
runs.
The entire training and classification pipeline can be ex-
ecuted through a single Python script or shell command.
For example, running python main.py --train trig-
gers the complete training cycle, from loading the dataset
to outputting the trained model and classification metrics.
Similarly, python main.py --predict "Your text
here" provides an immediate prediction and confidence
score.
Performance benchmarking reveals that the full training
cycle on a dataset of 20,000 labeled news articles (approx.
3MB CSV file) completes in under 90 seconds on a standard
mid-range laptop with an Intel Core i7 (11th Gen) processor
and 16 GB RAM. On more advanced systems, such as those
equipped with AMD Ryzen 9 CPUs or Apple Silicon chips,
runtimes are further reduced by 20–35%.
The scalability of the system allows it to handle larger
datasets with ease. When scaled to 100,000 records, the train-
ing time increased linearly to approximately 6 minutes without
degradation in system responsiveness or model accuracy. This
is due to the efficient use of sparse matrix operations during Fig. 1. High-level architecture showing pluggable components for prepro-
TF-IDF vectorization and optimized training routines within cessing, feature extraction, and classification.
scikit-learn’s logistic regression and random forest clas-
sifiers. The preprocessing module supports the use of external NLP
Moreover, the pipeline includes built-in error handling and libraries such as spaCy, Stanza, or TextBlob, enabling
logging through Python’s logging module. In the event of users to experiment with different tokenization, lemmatization,
missing data, incorrect formats, or failed predictions, the sys- and named entity recognition techniques. For example, a user
tem alerts the user with clear, human-readable error messages could switch from nltk to spaCy for faster tokenization and
and writes diagnostic details to a log file for troubleshooting. more advanced dependency parsing with just a few lines of
Overall, the system’s clean interface, minimal configuration modification.
requirements, modular architecture, and robust performance Furthermore, the system can interact with third-party APIs
make it exceptionally easy to use and adapt, even for those such as Google Cloud Natural Language, Microsoft Azure
with limited prior experience in machine learning or natural Text Analytics, or AWS Comprehend to enrich analysis with
language processing. sentiment scores, entity recognition, or language detection,
enabling hybrid systems that combine local models with cloud-
B. Compatibility and Extensibility based intelligence.
The system has been architected with interoperability in Multilingual Support: While the initial deployment is tai-
mind, supporting seamless integration with a wide array lored for English-language fake news detection, the system
of modern machine learning frameworks and natural lan- is inherently capable of supporting multiple languages due to
guage processing tools. By default, it is built upon the its reliance on Unicode (UTF-8) encoding and Unicode-aware
string processing. This allows the system to be adapted for Web Dashboard (Flask): To broaden accessibility and ap-
non-English languages such as Spanish, Hindi, or Arabic with peal to non-technical users such as journalists, educators,
relatively minor changes. or fact-checking volunteers, a web-based dashboard is under
Language extension typically requires replacing the stop- active development using the Flask microframework. The web
word list, tokenizer, and training dataset. For example: interface will offer the following features:
• Replace the default English stopword set from • A clean text box for manual entry of news content
nltk.corpus.stopwords with a target language • File upload functionality (CSV or TXT formats)
• Use multilingual embeddings like XLM-RoBERTa or • Interactive charts showing word importance and model
Fig. 9. Confusion matrix from a training run on the Random Forest dataset.