0% found this document useful (0 votes)
279 views

Spam Message Detection Using Logistic Regression

The use of the internet is increasing day by day, and the spammers who consistently try to spam people by sending fraud mails and SMS.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
279 views

Spam Message Detection Using Logistic Regression

The use of the internet is increasing day by day, and the spammers who consistently try to spam people by sending fraud mails and SMS.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Volume 6, Issue 9, September – 2021 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165

Spam Message Detection Using Logistic Regression


NIKHIL KUDUPUDI1, SHILPA NAIR2
U.G. Student, School of Engineering, Ajeenkya DY Patil University Pune, India -4121051
U.G. Student, School of Engineering, Ajeenkya DY Patil University Pune, India -4121052

Abstract:- The use of the internet is increasing day by the help of Text classification methods like stemming,
day, and the spammers who consistently try to spam lemmatization, vectorization, etc., it is possible to classify
people by sending fraud mails and SMS. Mails and SMS the mails and train the model, which will be able to detect
are one of the most important and most used means of unwanted mails.
communication, because of which 2.4 billion messages
are sent every one second. With the rise of such exchange In this study, we have come up with our model that
of emails and messages, some find it an opportunity to would classify emails and messages into either spam or
fill other's inbox with preposterous messages that reduce ham. The evaluation metrics for performance such as
internet speed and plunders our personal data. However, accuracy were considered evaluating the proposed study.
due to recent advancements in technology, it is possible The results obtained from experiments confirmed that the
to find solutions to all such problems easily. With the proposed research achieved high accuracy.
help of Natural Language Processing and Machine
Learning, we can quickly detect spam messages. One of II. LITERATURE SURVEY
the crucial aspects of research in the world of machine
learning applications is "NLP". In this paper, we have In this paper [1], (Omay, 2010)the author mentioned
proposed a model where emails would be classified into the history and explained the concept of logistic Regression.
the categories of Spam or Ham. He also explained types of logistic Regression like Binary
Logistic Regression, Multinomial Logistic Regression, and
Keywords:- Spam-Detector, Natural Language Processing, Ordinal Logistic Regression; however, he gave detailed
Logistic Regression. information on binary logistic Regression. The primary
purpose of this paper is to assess the combination of
I. INTRODUCTION independent variable's influence on dependent variables. For
this, the author conducted a study on 200 students from
Technology is advancing at a high rate. A few decades Ankara University, and the dependent/target variable was
back, the only source of communication was the letters, critical thinking. The author found that an increase of one
which turned into telegrams, and in recent times it is in unit in scientific thinking led directly to a 14.4 percent
various forms like emails, phone calls, SMS, etc. An increase in critical thinking, and a rise of one unit in
average person sends 72 messages per day, as texting is also epistemological belief resulted in a 4.9 percent increase in
the most common cell phone activity. Almost 300 billion high critical thinking.
emails are exchanged per day, and half of them are spam
emails. 'Spam Mail' is basically undesired and unwanted In this paper [2], (Lei, 2018) author 'Liu Lei' showed
emails that are sent to many of recipients that is just filling how logistic Regression could be used quickly and
up all the inboxes. Most of these messages are product efficiently to detect Breast Cancer. He applied a logistic
buying links, which would consume our personal data or regression model to the breast cancer dataset. The author got
could be some links and attachments. Sometimes the most accurate results with an accuracy of 96.5% when
carelessness from some users can cause significant damage 'Maximum Texture' and 'Maximum Perimeter' were chosen
to their personal data. Spam mails not only fill your inbox as input to the model. In contrast, he got an accuracy of
with junk mails but also cause email traffic. Spam messages 90.48% when he took 'Mean Texture' and 'Mean Radius' as
accounted for 45.1% of email traffic in March 2021. In input to the model. Therefore, choosing a better feature
short, such mails can be frustrating and dangerous at the combination will give more accurate results.
same time.
In this paper [3], (Radulescu, M.Dinsoreanu, &
Inboxes are 85% filled with Spam mails and due to R.Potolea, 2014) the main goal is to detect spam comments.
which the valuable and important emails are ignored. Many This was achieved by considering unclear comments with
researchers are developing various techniques to find the increased punctuation marks, new lines stop words, non-
solution for such problems and secure to communication. ASCII characters, new lines, capital letters, and offensive
Since the unsolicited emails are termed 'Spam', important words and converting them into vectors to classify them into
and valuable ones are termed 'Ham'. spam or non-spam comments. Next, they added word
duplication ratio as spam comments tend to have repeated
There are many techniques developed to classify such words and stop words ratio, which is the count of stop words
spam and ham mails. One such technique is by using divided by the total count of words in the comment. This
Natural language Processing and Machine Learning. With increased the accuracy of classification. Finally, they added

IJISRT21SEP728 www.ijisrt.com 815


Volume 6, Issue 9, September – 2021 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
post-comment similarity and topic similarity to remove compared to a threshold value. Spam is defined as email
comments unrelated to specific context. The authors also messages that exceed a specified number of recipients.
showed decision tree classifier works better with their spam
detection model. The dataset was subjected to a series of experiments
based on Natural Language Processing (NLP) principles
The authors of this paper [4] (Qaiser, Shahzad, & Ali, such as label encoding, tokenization, stemming, stop word
2018) authors explained what is Term Frequency Inverse removal, and generating features before being subjected to
Document Frequency, how does TI-IDF works. They also an ensemble approach - voting classifier. [9] (Pragna &
discussed strengths and weakness of TI-IDF and how to .RamaBai, 2019)All of the trials in the model correctly
overcome them. First, they collected data from different categorize the data set. The algorithms used in this study
domains and removed stop words from data, then they produced good accuracy results. However, Support Vector
applied TI-IDF on the processed data and displayed the Classifier, with an accuracy of 98.49 percent, is the best
results. The displayed results showed keywords and their predictor of spam messages among the numerous trials
TF-IDF value of different domains. Top keywords from conducted. Other methods have a comparable level of
'.biz', '.com', '.edu' and '.org' domain were parts, presidential, precision, with a variance of around 3%.
years and Marketing respectively.
III. MATERIALS AND METHODS
The Authors[5] (Sjarif, Nila, & Amir, 2019) of this
paper used Term frequency Inverse Document Frequency Dataset
and Random Forest to detect spam messages. The data was The main aim of our project was to detect spam
collected from UCI Machine Learning Repository. Before messages accurately. For this, we have taken the "SMS
applying the TI-IDF, they did some preprocessing like Spam Collection Dataset" from Kaggle.com. The dataset
removing stop words as these messages contain special contains 5574 messages with tags either legitimate/Ham or
symbols, pronouns, and prepositions, which do not help in spam. There are 5574 messages in the dataset, out of which
spam identification. After applying TI-IDF, the authors used 4825 legitimate messages and 747 spam messages. The text
multiple classification algorithms and found that Random messages were compiled from various accessible research
Forest gave better Accuracy, Precision and F-measure sources like 425 spam messages were manually selected
compared to other classification algorithms. from the "Grumbletext" website. 3375 messages were
chosen at random from the National University of Singapore
In this paper[6] (Pandey & Yadav, 2020), the author SMS Corpus (NSC). 450 ham messages were collected from
proposed a model where deep neural networks are exploited "Caroline Tag's" Ph.D. Thesis and 1324 messages were
for detecting spam mails using Tensor Flow. This model gathered from "SMS Spam Corpus v.0.1 Big" out of which
uses a linguistic approach, demonstrating the advantage of 1002 were spam messages, and 322 were legitimate
automatically neural networks. This paper also surveyed messages
various publicly available datasets and noted the basic
structure of the model. They have also revealed plentiful of Packages
open research problems related to spam filters. To work on our project, we have imported different
packages. The "pandas" package was imported to read the
Spam filters' sole purpose is analyzing the incoming dataset and to convert categorical data into indicator
data into unwanted(Spam) or wanted(Ham). Many variables like 0 and 1 using "get_dummies" function. "nltk"
researchers have come up with various types of filters. [7] package was used to get functions like "stopwords",
(Shankar, 2018)The Model proposed in this paper uses "porterstemmer" and "tfidvectorizer" to work on the test
Natural Language processing and Naïve Bayes. This processing. "re" package (Regular Expression Operations)
Bayesian Spam Filter is trained, and a database is was also used for processing text data. "sklearn" package
maintained to store and track the spam and ham messages. was imported to get "train_test_split" and
The messages are split into tokens and messages can be "logisticRegression" function. "train_test_split" function
analyzed once the token database is created aby the filter. was used to split the data into training and testing dataset
The model also introduces a threshold counter that helps to while, "logisticRegression" function was for prediction
maintain the spam filter efficiency. model. "seaborn" and "matplotlib" packages were imported
to plot confusion matrix of our final result. "joblib" package
Different Spam classification methods are used to was imported to save the model and use it again without
classify data into groups.[8] (Emmanuel, Gbengadada, & repeating every process to make predictions.
Joseph, 2016) Some of such types include Random Decision
Tree, probabilistic Method, Support Vector Machine, IV. PROPOSED ANALYSIS APPROACH AND
Artificial Neural networks, etc. These classification RESULTS
techniques have been shown in the literature to be useful for
spam mail filtering when combined with a content-based 1. Data Preprocessing
filtering strategy that recognizes specific features (keywords Dataset had five columns, out of which three had no
frequently utilized in spam emails). The likelihood for each values, and all the columns did not have a proper name. We
feature in the email is determined by the frequency with removed those three columns as they were of no use and
which these qualities appear in emails, which is then gave the other two columns proper names. The column with

IJISRT21SEP728 www.ijisrt.com 816


Volume 6, Issue 9, September – 2021 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
"spam/ham" categorical values were converted into numeric will be 0.02. Term Frequency alone will not give a good
values as machine learning algorithms work well with idea as some insignificant words might occur multiple times
numeric data. This was done using the "get_dummies" in a document but do not have much weightage. As Term
function of the "pandas" package. "get_dummies" function Frequency treats every word equally, but every word has a
converts a given column into two or more new columns with different significance, Inverse Document Frequency (IDF) is
values in 0's and 1's based on categorical values present in used to tackle this issue. IDF helps reduce the weightage of
the old column. terms that are very common in a set of documents. IDF is
calculated by taking the log of the total number of
Label Encoding refers to change the value in numeric documents divided by the number of documents in which
form so it can be Machine-readable. After conversion, that specific term is present. Let us suppose a word A1 is
Machine Learning Algorithms can decide how to operate present in 10 documents out of 100 and word A2 is present
with those labels. This is an essential step for Supervised in 60 documents out of 100 therefore, the IDF of A1 and A2
Machine Learning. will be log(100/10) = 1 and log(100/60)=0.22 respectively.
Term Frequency – Inverse Document Frequency is obtained
1.1.Stop Words Removal by multiplying Term Frequency(TF) and Inverse Document
For the Machine to understand, analyze and operate Frequency(IDF).
Natural Language Processing on the data, the texts (emails
in the dataset) should be readable. Machines do not 2. Implementation of Algorithm
understand human language, so we need to preprocess the As cleaning and preprocessing of the dataset is done,
data to make our data understandable by machines. To be we can use "train_test_split" function to divide the dataset
pristine, we need to clear out useless data from the dataset. into training and testing data. To implement the training data
Such useless words are known as 'Stopwords'. on the model and predict whether the text is spam or not, we
need to import Logistic Regression algorithm from the
Some common examples of stopwords are 'is', 'are', 'a', "scikit-learn" library and performance metrics. In our
'as', etc. Stopwords are commonly used in NLP and even in project, we have used the Logistic Regression algorithm for
text mining to eliminate useless information. classification purpose. Logistic Regression is an excellent
predictive modeling algorithm that models probabilities for
1.2. Stemming classification problems with two or more possible outcomes.
Stemming refers to reducing the word to its root word, Logistic Regression is similar to Linear Regression, where
mostly by removing the suffix. It shortens the vocabulary we get an S-shaped line to get output in either 0’s or 1's
space, which in turn helps to speed up the process. It is one instead of a straight line. To get this S shape curve, Logistic
more method to normalize sentences for machines. Regression uses the sigmoid function. The sigmoid function
gives probabilities between 0 and 1. In our model, logistic
Regression will give us whether the message is spam or not.
Where if it’s 1, it would be spam else it would be ham if the
value is 0.

Fig 1: Stemming Fig 2: Logistic Regression

1.3.TF-IDF Let's Suppose you get the following message on your


Now we need to convert text data into vectors as the phone:
machine learning algorithm works only on numeric data. For "CONGRATULATIONS!! Your email address has won a
this, we will use Term Frequency-Inverse Document lottery sum of USD 2,500,000.00. To claim your prize,
Frequency (TF-IDF). please contact our office via email
[email protected] or call +44 704 675 12446"
Term frequency (TF) is used to measure the frequency Here keywords are [lottery, prize, office, email]
of a word in a document. It is found by dividing the The given weight vector is w = [0.3, 0.3, −0.1, −0.04] T
frequency of a word by the total number of words in that The probability that the email is spam will be:
document. Let us suppose we want to find the TF of the
word 'Health', which occurs 20 times in a document of 1000 𝑥 = [1,1,1,2]𝑇
words long. Therefore, the TF of Health in that document 𝑤 𝑇 𝑥 = 0.3 ∗ 1 + 0.3 ∗ 1 − 0.1 ∗ 1 − 0.04 ∗ 2 = 0.42 > 0

IJISRT21SEP728 www.ijisrt.com 817


Volume 6, Issue 9, September – 2021 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
1 REFERENCES
Pr(𝑦 = 1|𝑥) = 𝜎(𝑤 𝑇 𝑥) = = 0.603
1 + 𝑒 −0.42
[1]. Sjarif, Nila, & Amir, N. (2019). SMS Spam Message
Detection using Term Frequency-Inverse Document
Frequency and Random Forest Algorithm. Procedia
Computer Science , 509-515.
[2]. Shankar, S. (2018). Advanced Detection of Spam And
Email Fitering using NLP algorithms. IJARIT .
[3]. Radulescu, C., M.Dinsoreanu, & R.Potolea. (2014).
Identification of Spam Comments using Natural
Language Processing Techniques. ICCP..
[4]. Qaiser, Shahzad, & Ali, R. (2018). Text Mining: Use
of TF-IDF to Examine the Relevance of Words to
Documents. International Journal of Computer
Applications , 25-29.
[5]. Pragna, B., & .RamaBai, M. (2019). Spam Detecting
Fig 3: Working of Model using NLP Techniques. IJRTE .
[6]. Pandey, S., & Yadav, R. (2020). Email Spam
In order to test the accuracy of our model, an accuracy Detection using Machine Learning and Deep Learning.
score metric is used. This metric compares the predicted IJRASET .
results with the actual results. After running the code, we [7]. Omay, C. (2010). Logistic Regression: Concept and
got 96% accuracy. We have also plotted a heat map to get an application. 2-3.
idea of how accurate our predicted values are compared to [8]. Lei, L. (2018). Research on Logistic Regression
actual values. Algorithm of Breast Cancer Diagnose Data by
Machine Learning. ICRIS, (pp. 3-4).
[9]. Emmanuel, Gbengadada, & Joseph. (2016). Machine
learning for email spam filtering: review, approaches
and open research problems. Heliyon

Fig 4: Heat Map

V. CONCLUSION AND FUTURE SCOPE

In this study, we looked into the general applications


of spam detecting using NLP. We also reviewed the step-by-
step process of the algorithm and how it classifies the mail
into spam and Ham. The dataset we used in this paper was
publicly available, and performance metrics was also
implanted to check the model's accuracy. In the future, we
can use neural network and deep learning models to predict
a given message is spam or not. Deep learning works very
well for natural language processing; however, it requires a
vast amount of data to give accurate results and to
outperform other traditional machine learning algorithms.
Since Natural Language Processing is a relatively
underdeveloped area for research, further enhancements can
be made to the proposed system for spam detection and
email filtering in the field of online security

IJISRT21SEP728 www.ijisrt.com 818

You might also like