Spam Filtering
Spam Filtering
A Project Report
BACHELOR OF TECHNOLOGY
In
CSE (ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING)
By
HEMANTH (21481A4215)
LEENA KHAMAR SULTHANA (21481A4237)
HARI KRISHNA (21481A4240)
RAGHU (21481A4248)
DEPARTMENT OF
CSE (ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING)
SESHADRI RAO GUDLAVALLERU ENGINEERING COLLEGE
(An Autonomous Institute with Permanent Affiliation to JNTUK, Kakinada)
SESHADRIRAO KNOWLEDGE VILLAGE
GUDLAVALLERU – 521356
ANDHRA PRADESH
2022-2023
SESHADRI RAO GUDLAVALLERU ENGINEERING COLLEGE
(An Autonomous Institute with Permanent Affiliation to JNTUK, Kakinada)
SESHADRI RAO KNOWLEDGE VILLAGE, GUDLAVALLERU
CERTIFICATE
The satisfaction that accompanies the successful completion of any task would be
incomplete without the mention of people who made it possible and whose constant
guidance and encouragements crown all the efforts with success.
We would like to express our deep sense of gratitude and sincere thanks to Dr. Y.
ADILAKSHMI, Professor & Head of the Department, CSE (Artificial Intelligence
and Machine Learning) for his constant guidance, supervision and motivation in
completing the project work.
We feel elated to express our floral gratitude and sincere thanks to Dr. Y.
ADILAKSHMI, Head of the Department, CSE (Artificial Intelligence and Machine
Learning) for his encouragements all the way during analysis of the project. His
annotations, insinuations and criticisms are the key behind the successful completion
of the project work.
We would like to take this opportunity to thank our beloved principal Dr. B.
KARUNA KUMAR for providing a great support for us in completing our project and
giving us the opportunity for doing project.
Our Special thanks to the faculty of our department and programmers of our computer
lab. Finally, we thank our family members, non-teaching staff and our friends, who
had directly or indirectly helped and supported us in completing our project intime.
Team members
HEMANTH (21481A4215)
LEENA KHAMAR SULTHANA (21481A4237)
HARI KRISHNA (21481A4240)
RAGHU (21481A4248)
INDEX
TITLE PAGE NO
LIST OF FIGURES ii
ABSTRACT iii
CHAPTER 1: INTRODUCTION
1.1 INTRODUCTION 1
3.1 METHODOLOGY 3
3.2 IMPLEMENTATION 9
BIBLIOGRAPHY 20
List of Program Outcomes and Program Specific Outcomes 22
Mapping of Program Outcomes with Graduated POs and PSOs 24
LIST OF SYMBOLS AND ABREVIATIONS
ML Machine Learning
ps PorterStemmer
i
LIST OF FIGURES
ii
ABSTRACT
iii
Spam Detection using SVM
CHAPTER 1
1 INTRODUCTION
1.1 Introduction
The number of email users is growing in tandem with the internet's proliferation.
Spam, which is caused by unsolicited bulk email messages, is a well-known
consequence of email’s expanding popularity. As people adapt their daily
routines to incorporate the internet, email use is expected to continue increasing.
Considered fundamental for communication, email has become the norm.
Harmful in nature, spam emails typically contain advertisements. These
unwelcome emails are both unopened and unneeded by the recipient. Numerous
recipients of email were bombarded by the sender of spam with an abundance of
identical messages. Releasing our email address to deceitful websites or
unauthorized parties usually results in the initiation of spam. The adverse impacts
of spam are manifold. Among them are slower internet speeds, the loss of
significant data, and search engines yielding less accurate results due to the influx
of spam content. Spam also leads to unproductive use of valuable time and an
overwhelming number of frustrating messages for users. Recognizing spammers
and their tactics is pivotal for appropriate counter measures. Despite extensive
research, identifying spam content remains challenging. However, there is still
scope for improvement in distinguishing genuine surveys from unsolicited
contact attempts.
1
Seshadri Rao Gudlavalleru Engineering College
Spam Detection using SVM
2
Seshadri Rao Gudlavalleru Engineering College
Spam Detection using SVM
CHAPTER 2
2 PROPOSED METHODS
2.1 Methodology
So, as you can see it is a classification problem. We can use any classification
algorithm in this problem. Most of them will work. But the problem is that these
algorithms work on numerical data and not on text data. So, we need to convert
the words into some sort of numeric data.
For this, we are going to use Count Vectorizer which will convert the text data
into numeric data. The count Vectorizer has already been explained above in the
article.
For this project, we are going to use support vector machines. The reason for
choosing the SVM is that it seems to work best for most classification problems.
Data Set
Test
Performance evolution Classification Result
Classification
3
Seshadri Rao Gudlavalleru Engineering College
Spam Detection using SVM
we undertake a thorough data cleaning process. The first step in this process
involves removing any duplicate rows. Duplicate rows can skew the model's
learning process, leading to biased or inaccurate predictions. By using Pandas'
4
Seshadri Rao Gudlavalleru Engineering College
Spam Detection using SVM
5
Seshadri Rao Gudlavalleru Engineering College
Spam Detection using SVM
After removing duplicate data items, 5169 rows × 2 columns are in the data set.
Out of the 5169 data samples, the data set contains 86.6% ham messages and
13.4% spam messages.
After addressing duplicates, the next crucial step is to handle the categorical labels.
In our dataset, emails are labelled as either 'spam' or 'not spam' (ham). Machine
learning algorithms, including Support Vector Machines (SVM), require
numerical input for processing. Therefore, we need to convert these categorical
labels into numerical values. This transformation is accomplished using the
LabelEncoder from the Scikit-Learn library. The LabelEncoder assigns a unique
numerical value to each category: for instance, 'spam' might be encoded as 1 and
'not spam' as 0. This encoding not only facilitates the model's ability to process the
labels but also preserves the categorical information in a numerical format that the
algorithm can interpret.
6
Seshadri Rao Gudlavalleru Engineering College
Spam Detection using SVM
Preparing text data for the Support Vector Machine (SVM) model involves
several preprocessing steps to convert raw email content into a suitable format
for machine learning. This transformation enhances the model's ability to
accurately classify emails as spam or not spam. Here are the detailed steps
involved in text data preparation:
7
Seshadri Rao Gudlavalleru Engineering College
Spam Detection using SVM
4. Removing Stop Words: Stop words are common words such as "the," "is,"
"in," and "and" that appear frequently in text but do not carry significant
meaning in the context of spam detection. Removing these stop words
reduces noise in the data and improves model performance. Libraries like
NLTK provide predefined lists of stop words that can be used for this
purpose.
5. Applying Stemming: Stemming is the process of reducing words to their
root form. For example, "running," "runner," and "ran" can all be reduced
to the root word "run." This normalization helps in treating different forms
of a word as the same, thus improving the model's ability to generalize. The
Porter Stemmer from NLTK is a commonly used tool for this task.
To effectively evaluate the performance of our SVM model for spam detection,
we must divide the dataset into training and testing sets. This division is essential
for assessing how well our model generalizes to unseen data, ensuring that it
performs well not only on the training data but also on new, unseen examples.
The training set is used to train the model, while the testing set is reserved for
evaluating its performance. We utilize the train_test_split function from the
8
Seshadri Rao Gudlavalleru Engineering College
Spam Detection using SVM
In the context of spam detection, SVM works by learning patterns and features
from labeled examples of emails. Features extracted from emails, such as word
frequencies, presence of specific keywords, or other text characteristics, are used
to train the SVM model. During training, SVM adjusts its parameters to create an
optimal decision boundary that effectively distinguishes between spam and
legitimate emails based on these features.
advantages
2.2 IMPLEMENTATION
9
Seshadri Rao Gudlavalleru Engineering College
Spam Detection using SVM
For this project we import the libraries (NumPy, pandas, nltk, sklearn) for text
pre-processing and SVM modelling. It initializes a PorterStemmer (ps) for word
stemming. Key functionalities include text pre-processing, feature extraction
using CountVectorizer and TfidfVectorizer, and model selection via
GridSearchCV for SVM parameters optimization.
This code reads a CSV file named 'spam.csv' into a Pandas DataFrame (df),
typically containing data for spam detection tasks. It prepares the dataset for
further analysis and model training in a machine learning project, focusing on
email classification as spam or not spam.
10
Seshadri Rao Gudlavalleru Engineering College
Spam Detection using SVM
This code snippet transforms categorical labels ('spam' and 'ham') into numerical
values (0 and 1) using LabelEncoder, facilitating machine learning model
compatibility. Additionally, it removes duplicate rows from the dataset df,
ensuring data integrity and reducing potential biases during model training and
evaluation.
12
Seshadri Rao Gudlavalleru Engineering College
Spam Detection using SVM
13
Seshadri Rao Gudlavalleru Engineering College
Spam Detection using SVM
CHAPTER 3
14
Seshadri Rao Gudlavalleru Engineering College
Spam Detection using SVM
15
Seshadri Rao Gudlavalleru Engineering College
Spam Detection using SVM
16
Seshadri Rao Gudlavalleru Engineering College
Spam Detection using SVM
17
Seshadri Rao Gudlavalleru Engineering College
Spam Detection using SVM
CHAPTER 4
Looking ahead, there are several avenues for expanding and refining this spam
detection system. Future research could explore more advanced techniques in
natural language processing (NLP) and machine learning to further enhance the
model's performance. This includes investigating deep learning architectures such
as Recurrent Neural Networks (RNNs) or Transformer models like BERT, which
can capture intricate patterns and semantic relationships in textual data.
18
Seshadri Rao Gudlavalleru Engineering College
Spam Detection using SVM
19
Seshadri Rao Gudlavalleru Engineering College
Spam Detection using SVM
BIBLIOGRAPHY
Almeida, T., Hidalgo, J.M., Silva, T. 2013. Towards SMS spam filtering: Results
under a new dataset. International Journal of Information Security Science, 2, 1–
18.
Ardhianto, P., Subiakto, RBR., Lin, C-Y., Jan, Y-K., Liau, B-Y., Tsai, J-Y.,
Akbari, VBH., Lung, C-W. 2022. A deep learning method for foot progression
angle detection in plantar pressure images, Sensors, 22, 2786.
Assagaf, I., Sukandi, A., Abdillah, A.A., Arifin, S., Ga, J.L. 2023. Machine
predictive maintenance by using support vector machines. Recent in Engineering
Science and Technology, 1, 31–35.
Budiman, E., Lawi, A., Wungo, S.L. 2019. Implementation of SVM kernels for
identifying irregularities usage of smart electric voucher. 2019 5th International
Conference on Computing Engineering and Design (ICCED), Singapore. 1–5.
Chen, R.C., Dewi, C., Huang, S.W., Caraka, R.E. 2020. Selecting critical features
for data classification based on machine learning methods. Journal of Big Data,
7, 52.
Chong, K., Shah, N. 2022. Comparison of naive bayes and SVM classification in
grid-search hyperparameter tuned and non-
20
Seshadri Rao Gudlavalleru Engineering College
Spam Detection using SVM
Clarke, C.L.A., Fuhr, N., Kando, N., Kraaij, W., De Vries, A.P. 2007. SIGIR
2007. Proceedings of the 30th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval. Association for Computing
Machinery, New York, USA.
Cormack, G.V., Gómez Hidalgo, J.M., Sánz, E.P. 2007. Spam filtering for short
messages. In Proceedings of the sixteenth ACM conference on Conference on
information and knowledge management, 313–320.
21
Seshadri Rao Gudlavalleru Engineering College
SESHADRI RAO GUDLAVALLERU ENGINEERING COLLEGE
(An Autonomous Institute with Permanent Affiliation to JNTUK, Kakinada)
Seshadri Rao Knowledge Village, Gudlavalleru
Department of CSE (ARTIFICIAL INTELLIGENCE AND MACHINE
LEARNING)
Mapping Table
CS3509 : MACHINE LEARNING
Program Outcomes and Program Specific Outcome
Course
Outcomes PO PO PO PO PO PO PO PO PO PO PO PO PSO PSO
1 2 3 4 5 6 7 8 9 10 11 12 1 2
CO1 1 1 1
CO2 1 1
CO3 2 3 2 2 1
CO4 2 2 3 2 2 2
CO5 1 2 3 1 2 1
Note: Map each Data Mining outcomes with POs and PSOs with
either 1 or 2 or 3based on level of mapping as follows: