Spam News Detection Report: Manikiran
Spam News Detection Report: Manikiran
Manikiran
Internship Project
1 Introduction 2
2 Problem Statement 3
2.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3 Dataset Overview 4
4 Data Preprocessing 5
4.1 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.2 Lowercasing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.3 Stop Word Removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.4 TF-IDF Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
6 Model Evaluation 8
6.1 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
6.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
6.3 Precision and Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
6.4 F1 Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
8 Conclusion 10
9 References 11
1
Chapter 2
Introduction
In the age of information, digital platforms have become primary sources of news for millions.
However, with the rapid dissemination of information, there is a growing threat of spam or
fake news, which can mislead readers and spread false information. Detecting and mitigating
this threat has become a significant challenge for both researchers and developers.
In this report, we develop a machine learning model capable of detecting spam news arti-
cles. The model aims to classify news as either true or false using logistic regression. This
report outlines the problem, the dataset used, the preprocessing techniques applied, and the
evaluation of the model’s performance.
2
Chapter 3
Problem Statement
Spam news, or fake news, is a pervasive issue in modern media. It has the potential to
distort public opinion and misinform large populations, leading to serious societal
consequences. The challenge lies in distinguishing between legitimate news and fake news,
as the latter is often designed to appear authentic.
The main objective is to use machine learning techniques to classify news articles into ”spam”
or ”real” based on the content of the text. The problem can be framed as a binary classifi-
cation task, where the target variable is whether the news is spam or true.
2.1 Objectives
The primary objectives of this project are:
• To develop a machine learning model that can accurately classify news articles as spam
or true.
• To evaluate the performance of the model using standard evaluation metrics.
• To explore different preprocessing techniques for improving the model’s accuracy.
3
Chapter 4
Dataset Overview
The dataset used for this project contains news articles, each labeled as either true or spam.
The dataset comprises two key columns:
4
Chapter 5
Data Preprocessing
Before applying machine learning algorithms, the text data needs to be preprocessed. The
steps involved in preprocessing the text are as follows:
4.1 Tokenization
Tokenization is the process of splitting the text into individual words (tokens). This helps
the model understand the content of the text on a word-by-word basis.
4.2 Lowercasing
All text is converted to lowercase to ensure uniformity. For example, ”News” and ”news”
are treated as the same word after this step.
5
Spam News Detection Report 6
7
Chapter 8
Model Evaluation
To evaluate the performance of the model, we use various metrics such as accuracy, precision,
recall, and the F1 score.
6.2 Accuracy
Accuracy is the proportion of correctly classified news articles (both true and spam) out of
the total number of articles.
6.4 F1 Score
The F1 score is the harmonic mean of precision and recall, providing a balanced measure of
the model’s performance.
8
Chapter 9
The Logistic Regression model achieved the following results on the test set:
• Accuracy: 92.5%
• Precision: 91.8%
• Recall: 93.2%
• F1 Score: 92.5%
These results indicate that the model is effective in detecting spam news, with a high accuracy
and balanced precision-recall performance.
9
Chapter
Conclusion
In this report, we developed a logistic regression model to classify news articles as either
spam or true. The model was trained on a labeled dataset and evaluated using standard
performance metrics. The results demonstrate that logistic regression is a robust and effective
method for detecting spam news.
Future work could involve testing more advanced models like deep learning or exploring ad-
ditional text preprocessing techniques to further improve accuracy. Additionally, expanding
the dataset to include more diverse sources of news could make the model more generalizable.
1
Chapter
References