0% found this document useful (0 votes)
15 views6 pages

Conference FakeNews Detection

This document discusses a machine learning-based system for detecting fake news, highlighting the use of traditional classifiers and deep learning models like BERT. It introduces a hybrid incremental approach that adapts to real-time data, demonstrating improved accuracy and robustness. The system is designed for ease of use, compatibility with various platforms, and extensibility for future research and multilingual support.

Uploaded by

kartik010902
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views6 pages

Conference FakeNews Detection

This document discusses a machine learning-based system for detecting fake news, highlighting the use of traditional classifiers and deep learning models like BERT. It introduces a hybrid incremental approach that adapts to real-time data, demonstrating improved accuracy and robustness. The system is designed for ease of use, compatibility with various platforms, and extensibility for future research and multilingual support.

Uploaded by

kartik010902
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Fake News Detection Using Machine Learning

Techniques
Manpreet Kaur Ishita Srejath Nikhil Chaudhary
Dept. of Mathematics and Computing Dept. of Mathematics and Computing Dept. of Mathematics and Computing
Delhi Technological University Delhi Technological University Delhi Technological University
New Delhi, India New Delhi, India New Delhi, India

Dr. Anshul Arora


Supervisor, Dept. of Mathematics and Computing
Delhi Technological University
New Delhi, India

Abstract—The rapid spread of fake news across digital plat- BERT with the adaptability of LightGBM to enable real-
forms poses significant risks to society, influencing public opinion, time learning. This approach addresses key challenges such
disrupting democratic processes, and fueling misinformation. as concept drift, adversarial text manipulation, and the need
This study explores the application of machine learning tech-
niques for effective and scalable fake news detection. We evaluate for continuous model updates in dynamic environments. Our
the performance of traditional classifiers such as Logistic Regres- contributions are threefold: A comprehensive comparative
sion, Support Vector Machines, and Random Forest, alongside analysis of traditional and deep learning models for fake news
deep learning models like BERT. Furthermore, we propose a detection. The introduction of a hybrid incremental model
hybrid incremental approach that integrates semantic encoding capable of adapting to real-time data. Empirical validation
with real-time learning capabilities to adapt to evolving content.
Our findings demonstrate that hybrid models outperform stan- using multiple benchmark datasets and simulated streaming
dalone algorithms in both accuracy and adaptability, achieving environments, showing improved performance and robustness.
up to 99.31 By integrating semantic, statistical, and adaptive learning
Index Terms—Fake news detection, machine learning, deep components, this research aims to build a resilient and scalable
learning, BERT, hybrid models, misinformation framework for combating the growing threat of fake news in
the digital age.
I. I NTRODUCTION
In today’s hyper-connected world, the internet and social II. E ASE OF U SE
media have become dominant platforms for disseminating
A. Implementation Simplicity
news and information. While this democratization has em-
powered individuals and accelerated access to knowledge, The fake news detection system has been designed with
it has also led to a surge in the spread of misinformation a strong emphasis on usability and accessibility for both
and disinformation, commonly referred to as fake news. novice and experienced users. The goal is to ensure that
The consequences of fake news can be severe—manipulating the system can be set up and executed with minimal effort,
public perception, influencing elections, undermining trust in while maintaining flexibility for customization and expansion.
institutions, and even inciting violence. Traditional fake news The system is primarily developed in Python 3.10, a widely-
detection relies heavily on human fact-checkers and manual used language in the data science community, ensuring high
verification, which are time-consuming, costly, and unscalable compatibility and ease of maintenance.
in the face of the high volume and velocity of online content. The solution is cross-platform and has been tested on major
As a result, there is an urgent need for automated, intelligent operating systems including Ubuntu 22.04 LTS, Windows
systems that can detect fake news efficiently and accurately. 10/11, and macOS Ventura, offering a consistent user expe-
In this research, we explore the application of various ma- rience across environments. It utilizes a modular architecture
chine learning (ML) and deep learning (DL) techniques to that separates major tasks into distinct modules — includ-
detect fake news articles based on their textual content. We ing data ingestion, preprocessing, feature extraction, model
implement and evaluate a range of models—from conventional training, prediction, and evaluation — thereby promoting
classifiers such as Logistic Regression (LR), Support Vector code clarity and facilitating individual component testing or
Machines (SVM), and Random Forests (RF), to advanced replacement. Each module is well-commented and adheres
deep learning models like BERT (Bidirectional Encoder Rep- to PEP 8 coding standards, aiding in readability and future
resentations from Transformers), known for their contextual collaboration.
understanding of language. Furthermore, we propose a hybrid Installation and setup are greatly simplified using a
incremental approach that combines the semantic strength of requirement.txt file, which automatically installs all
necessary dependencies using the command pip install scikit-learn library, which is well-known for its simple,
-r requirements.txt. Key libraries required include: consistent API and robust performance in classical machine
• scikit-learn==1.3.0 – for machine learning learning tasks. However, its modular and loosely coupled
model implementation and evaluation. codebase makes it easy to replace or extend any individual
• pandas==2.0.2 – for structured data manipulation component without impacting the functionality of others.
and preprocessing. Users seeking to incorporate deep learning techniques can
• nltk==3.8.1 – for text tokenization, stopword re- easily substitute the classification module with models from
moval, and other NLP utilities. TensorFlow or PyTorch. For instance, integrating a fine-
• joblib==1.3.2 – for model serialization and persis- tuned BERT model from Hugging Face’s transformers
tence. library can be accomplished by replacing the TF-IDF vec-
The system is designed for minimal setup time — a new torizer and logistic regression classifier with a pre-trained
user can clone the repository, install dependencies, and start transformer pipeline. The surrounding logic — including data
training or predicting with minimal configuration. Default loading, token preprocessing, and metric computation — re-
configuration files provide commonly used parameters such mains largely unchanged due to the system’s adherence to
as test/train splits, vectorizer settings, and classifier type, common input/output interfaces and clean abstraction layers.
eliminating the need for manual parameter tuning for baseline
runs.
The entire training and classification pipeline can be ex-
ecuted through a single Python script or shell command.
For example, running python main.py --train trig-
gers the complete training cycle, from loading the dataset
to outputting the trained model and classification metrics.
Similarly, python main.py --predict "Your text
here" provides an immediate prediction and confidence
score.
Performance benchmarking reveals that the full training
cycle on a dataset of 20,000 labeled news articles (approx.
3MB CSV file) completes in under 90 seconds on a standard
mid-range laptop with an Intel Core i7 (11th Gen) processor
and 16 GB RAM. On more advanced systems, such as those
equipped with AMD Ryzen 9 CPUs or Apple Silicon chips,
runtimes are further reduced by 20–35%.
The scalability of the system allows it to handle larger
datasets with ease. When scaled to 100,000 records, the train-
ing time increased linearly to approximately 6 minutes without
degradation in system responsiveness or model accuracy. This
is due to the efficient use of sparse matrix operations during Fig. 1. High-level architecture showing pluggable components for prepro-
TF-IDF vectorization and optimized training routines within cessing, feature extraction, and classification.
scikit-learn’s logistic regression and random forest clas-
sifiers. The preprocessing module supports the use of external NLP
Moreover, the pipeline includes built-in error handling and libraries such as spaCy, Stanza, or TextBlob, enabling
logging through Python’s logging module. In the event of users to experiment with different tokenization, lemmatization,
missing data, incorrect formats, or failed predictions, the sys- and named entity recognition techniques. For example, a user
tem alerts the user with clear, human-readable error messages could switch from nltk to spaCy for faster tokenization and
and writes diagnostic details to a log file for troubleshooting. more advanced dependency parsing with just a few lines of
Overall, the system’s clean interface, minimal configuration modification.
requirements, modular architecture, and robust performance Furthermore, the system can interact with third-party APIs
make it exceptionally easy to use and adapt, even for those such as Google Cloud Natural Language, Microsoft Azure
with limited prior experience in machine learning or natural Text Analytics, or AWS Comprehend to enrich analysis with
language processing. sentiment scores, entity recognition, or language detection,
enabling hybrid systems that combine local models with cloud-
B. Compatibility and Extensibility based intelligence.
The system has been architected with interoperability in Multilingual Support: While the initial deployment is tai-
mind, supporting seamless integration with a wide array lored for English-language fake news detection, the system
of modern machine learning frameworks and natural lan- is inherently capable of supporting multiple languages due to
guage processing tools. By default, it is built upon the its reliance on Unicode (UTF-8) encoding and Unicode-aware
string processing. This allows the system to be adapted for Web Dashboard (Flask): To broaden accessibility and ap-
non-English languages such as Spanish, Hindi, or Arabic with peal to non-technical users such as journalists, educators,
relatively minor changes. or fact-checking volunteers, a web-based dashboard is under
Language extension typically requires replacing the stop- active development using the Flask microframework. The web
word list, tokenizer, and training dataset. For example: interface will offer the following features:
• Replace the default English stopword set from • A clean text box for manual entry of news content

nltk.corpus.stopwords with a target language • File upload functionality (CSV or TXT formats)

(e.g., Spanish: stopwords.words(’spanish’)). • Display of predictions and confidence scores in real-time

• Use multilingual embeddings like XLM-RoBERTa or • Interactive charts showing word importance and model

mBERT for consistent vector representation across lan- uncertainty


guages. • Support for session-based history and exporting results

• Train or fine-tune the classifier using a multilingual or


translated dataset of news articles labeled as real or fake.
Empirical tests show that extending the pipeline to handle
Hindi using transliterated datasets and a Hindi stopword list
yields accuracy scores above 80%, demonstrating its cross-
lingual adaptability.
Future-Proofing: Because of its extensible design, the sys-
tem can incorporate new machine learning paradigms such as
few-shot or zero-shot learning. This means it can be adapted to
work in low-data scenarios using models like TARS or GPT-4
via API wrappers. Additionally, the modular structure enables
easy experimentation with ensemble methods, active learning,
or adversarial training techniques — critical for improving
robustness in real-world deployment.
In summary, the system’s compatibility with major ML
ecosystems, support for multilingual processing, and pluggable
architecture make it highly extensible for future research,
commercial applications, and cross-domain use cases.

C. User Interface Considerations


To enhance usability for non-technical stakeholders, the Fig. 2. Mockup of planned web interface for user interaction.
system includes a user-friendly interface layer designed for
both command-line interaction and eventual web-based access. The dashboard is designed to be mobile-responsive and is
This layer abstracts away the complexities of the underlying intended for deployment on local servers or cloud platforms
machine learning pipeline and enables smooth, real-time in- such as Heroku, AWS EC2, or PythonAnywhere. Security
teraction with the classifier. features such as input sanitization and basic authentication are
Command-Line Interface (CLI): The CLI provides an ef- also planned for future releases.
ficient and lightweight mechanism for users to classify indi- Accessibility and Internationalization: To make the inter-
vidual text samples or bulk files. Users can pass input text face usable across demographics, future updates will intro-
directly via command-line arguments or specify the path to a duce multilingual support for UI elements, including interface
text or CSV file containing multiple news entries. The output text in Hindi, Spanish, and Arabic. Additionally, accessibility
consists of: improvements such as screen reader compatibility, keyboard
navigation, and color-blind-friendly themes will be considered
• A predicted label: Fake or Real in alignment with WCAG 2.1 guidelines.
• A model-generated confidence score, e.g., Fake, Usability Testing and Feedback Loop: A usability testing
92.4% phase is scheduled during deployment, where feedback from
• Optional verbose mode showing tokenized input, ex- target user groups (e.g., college students, journalists) will
tracted features, and top influencing terms inform refinements. Metrics such as task completion time,
The CLI can be executed with a simple command prediction clarity, and navigation ease will be collected via
such as: python classify.py --text "Breaking anonymous feedback forms integrated into the dashboard.
news: Celebrity cloned in lab" --verbose In summary, these user interface considerations bridge the
On a standard machine (Intel i7, 16 GB RAM), classification gap between machine learning complexity and end-user ac-
latency is under 0.2 seconds per article, making it feasible for cessibility, making the system practical for real-world, non-
batch analysis of large datasets. technical use.
D. Automation and Maintenance
To facilitate scalability, reproducibility, and ease of long-
term use, the fake news detection system incorporates robust
automation scripts and lightweight maintenance utilities. These
tools minimize manual intervention and ensure consistent
performance monitoring across iterations.
Automated Workflow: The system is built around modular
automation scripts that orchestrate the entire machine learning
workflow:
• train.py handles dataset ingestion, preprocessing,
model training, and evaluation.
• test.py evaluates saved models on custom datasets or
test partitions.
• evaluate.py generates and saves performance met-
rics, including confusion matrices and classification re-
ports.
These scripts can be chained together in a single shell
pipeline or integrated into a CI/CD workflow using tools like
GitHub Actions or Jenkins.
Fig. 5. Confusion matrix from a training run on the LR dataset.
Model Performance and Logging: Every training session
automatically logs key performance indicators to timestamped
log files in CSV and JSON formats. A sample model evalua-
tion on the LIAR dataset yielded the following metrics:
• Accuracy: 91.6%
• Precision: 92.3%
• Recall: 90.5%
• F1-Score: 91.4%
These results are also visualized as graphs (accuracy vs.
epoch, confusion matrix, precision-recall curve), enabling
side-by-side comparison of model improvements over time.

Fig. 9. Confusion matrix from a training run on the Random Forest dataset.

Model Serialization and Reusability: The final trained mod-


els are serialized using joblib, which preserves all prepro-
cessing pipelines and model weights. This ensures immediate
reusability without the need for retraining. The serialized
model files (.pkl) are versioned using timestamp-based naming
conventions, for example:
model_fake_news_2025-05-11_15-32.pkl
Scheduled Retraining and Maintenance: To maintain rele-
Fig. 4. Confusion matrix from a training run on the SVM dataset. vance as language and misinformation patterns evolve, a cron-
based scheduler (on UNIX-like systems) or Task Scheduler (on
Windows) can be configured to: • Research exploration involving model bias, multilingual
• Periodically retrain the model using newly acquired la- detection, or misinformation propagation
beled data To quantify the system’s transparency and speed for educa-
• Recompute performance metrics and update dashboards tional use, Table I provides a breakdown of execution time by
• Trigger email or webhook notifications if accuracy drops module, tested on a standard dataset of 20,000 news articles
below a threshold on a machine with Intel i7 CPU and 16GB RAM.
Future releases will include a graphical monitoring dash-
board showing historical trends in accuracy, F1-score, and TABLE I
E XECUTION T IME BY M ODULE (S AMPLE I NPUT: 20,000 A RTICLES )
model drift, as well as alerts for retraining triggers.
Extensibility: The automation framework is designed to be Module Execution Time (s)
extensible, allowing plug-and-play integration with additional Data Cleaning 18.2
Feature Extraction (TF-IDF) 24.5
tools such as MLflow for experiment tracking or Docker for Model Training 29.7
containerized deployment. It is also compatible with cloud Testing and Evaluation 16.1
platforms such as Google Colab and AWS SageMaker, making Total Runtime 88.5
it adaptable to a wide range of research or production settings.
In summary, the system’s automation and maintenance
Future Educational Extensions: Planned updates include
features provide a solid foundation for continuous model im-
the integration of a quiz-based CLI to test conceptual under-
provement, efficient deployment, and long-term sustainability.
standing and a set of guided challenges for each module (e.g.,
E. Educational Value replacing TF-IDF with transformer embeddings). Additionally,
instructors can leverage provided templates for grading and
Beyond its primary role as a functional fake news detection
feedback generation based on system logs and outputs.
system, this tool is purposefully designed to be pedagogically
In summary, the system not only serves as a reliable tool
valuable. Its architecture and documentation make it highly
for fake news detection but also functions as a comprehensive
accessible for students, educators, and researchers seeking
educational platform that bridges theoretical knowledge with
to understand and experiment with real-world applications
practical application in machine learning and NLP.
of machine learning (ML) and natural language processing
(NLP). ACKNOWLEDGMENT
Each module in the system—ranging from data preprocess-
The authors—Ishita Sreejeth, Manpreet Kaur, and Nikhil
ing to final classification—contains thorough inline comments
Chaudhary—would like to extend their heartfelt gratitude to
and is supplemented with external Markdown documentation
Dr. Anshul Arora for his invaluable guidance, mentorship, and
that outlines the logic, function, and dependencies of each
continuous support throughout the course of this research. His
code segment. This enables learners to grasp not only how
insights and encouragement were instrumental in shaping the
the code works but also why certain algorithms or structures
direction and quality of this project. The authors also thank
were chosen.
the Department of Mathematics and Computing for providing
Learning Resources and Jupyter Notebooks: To enhance the
essential resources and infrastructure. Sincere appreciation is
learning experience, a series of Jupyter Notebooks are included
extended to the developers and maintainers of the public
that demonstrate:
datasets and APIs utilized in this study. Finally, the authors
• Text normalization, tokenization, stopword removal, and
acknowledge the unwavering support of their families and
stemming using nltk and spaCy peers, whose motivation greatly contributed to the successful
• Feature engineering with TF-IDF and exploratory visual-
completion of this work.
izations of word distributions
• Step-by-step model training using logistic regression and R EFERENCES
decision trees [1] H. Ahmed, I. Traore, and S. Saad, “Detecting opinion spams and fake
• Evaluation techniques including confusion matrices, ROC news using text classification,” Security and Privacy, vol. 1, no. 1, p.
curves, and classification reports e9, 2018.
[2] X. Zhou and R. Zafarani, “Fake news: A survey of research, detection
These notebooks serve as interactive guides, ideal for use methods, and opportunities,” arXiv preprint arXiv:1812.00315, 2018.
in university courses, coding bootcamps, or independent study [3] R. K. Kaliyar, A. Goswami, and P. Narang, “FakeBERT: Fake news
programs. detection in social media with a BERT-based deep learning approach,”
Multimedia Tools and Applications, vol. 80, pp. 11765–11788, 2021.
Classroom and Capstone Project Integration: The system’s [4] K. Shu, A. Sliva, S. Wang, J. Tang, and H. Liu, “Fake news detection on
modularity and straightforward logic make it well-suited for: social media: A data mining perspective,” ACM SIGKDD Explorations
Newsletter, vol. 19, no. 1, pp. 22–36, 2017.
• Classroom demonstrations of NLP pipelines [5] N. Ruchansky, S. Seo, and Y. Liu, “CSI: A hybrid deep model for fake
• Group assignments focusing on algorithm comparison news detection,” in Proc. 2017 ACM Conf. Information and Knowledge
• Capstone projects where students extend functionality Management (CIKM), pp. 797–806, 2017.
[6] J. Thorne and A. Vlachos, “Automated fact checking: Task formulations,
(e.g., add sentiment analysis or integrate social media methods and future directions,” in Proc. 27th Int. Conf. Computational
data) Linguistics (COLING), pp. 3346–3359, 2018.
[7] M. A. Wani and R. Mahajan, “Fake news detection using incremental
learning and stream classification techniques,” Journal of Information
Security and Applications, vol. 73, p. 103470, 2023.
[8] W. Y. Wang, “Liar, liar pants on fire: A new benchmark dataset for fake
news detection,” in Proc. ACL, pp. 422–426, 2017.
[9] S. Volkova, K. Shaffer, J. Y. Jang, and N. Hodas, “Separating facts
from fiction: Linguistic models to classify suspicious and trusted news
posts on Twitter,” in Proc. 55th Annu. Meet. Assoc. Comput. Linguistics
(ACL), pp. 647–653, 2017.
[10] R. Baly, G. Karadzhov, D. Alexandrov, J. Glass, and P. Nakov, “Pre-
dicting factuality of reporting and bias of news media sources,” in
Proc. 2018 Conf. Empirical Methods in Natural Language Processing
(EMNLP), pp. 3528–3539, 2018.
[11] N. J. Conroy, V. L. Rubin, and Y. Chen, “Automatic deception detection:
Methods for finding fake news,” in Proc. Assoc. Information Science and
Technology, vol. 52, no. 1, pp. 1–4, 2015.
[12] M. Al-Rakhami and A. Al-Amri, “Lieskill: A benchmark dataset for fake
news detection in Arabic,” IEEE Access, vol. 8, pp. 158648–158661,
2020.
[13] R. Oshikawa, J. Qian, and W. Y. Wang, “A survey on natural language
processing for fake news detection,” in Proc. 12th Lang. Resour. Eval.
Conf. (LREC), pp. 6086–6093, 2020.

You might also like