Spam Message Detection Using Logistic Regression
Spam Message Detection Using Logistic Regression
ISSN No:-2456-2165
Abstract:- The use of the internet is increasing day by the help of Text classification methods like stemming,
day, and the spammers who consistently try to spam lemmatization, vectorization, etc., it is possible to classify
people by sending fraud mails and SMS. Mails and SMS the mails and train the model, which will be able to detect
are one of the most important and most used means of unwanted mails.
communication, because of which 2.4 billion messages
are sent every one second. With the rise of such exchange In this study, we have come up with our model that
of emails and messages, some find it an opportunity to would classify emails and messages into either spam or
fill other's inbox with preposterous messages that reduce ham. The evaluation metrics for performance such as
internet speed and plunders our personal data. However, accuracy were considered evaluating the proposed study.
due to recent advancements in technology, it is possible The results obtained from experiments confirmed that the
to find solutions to all such problems easily. With the proposed research achieved high accuracy.
help of Natural Language Processing and Machine
Learning, we can quickly detect spam messages. One of II. LITERATURE SURVEY
the crucial aspects of research in the world of machine
learning applications is "NLP". In this paper, we have In this paper [1], (Omay, 2010)the author mentioned
proposed a model where emails would be classified into the history and explained the concept of logistic Regression.
the categories of Spam or Ham. He also explained types of logistic Regression like Binary
Logistic Regression, Multinomial Logistic Regression, and
Keywords:- Spam-Detector, Natural Language Processing, Ordinal Logistic Regression; however, he gave detailed
Logistic Regression. information on binary logistic Regression. The primary
purpose of this paper is to assess the combination of
I. INTRODUCTION independent variable's influence on dependent variables. For
this, the author conducted a study on 200 students from
Technology is advancing at a high rate. A few decades Ankara University, and the dependent/target variable was
back, the only source of communication was the letters, critical thinking. The author found that an increase of one
which turned into telegrams, and in recent times it is in unit in scientific thinking led directly to a 14.4 percent
various forms like emails, phone calls, SMS, etc. An increase in critical thinking, and a rise of one unit in
average person sends 72 messages per day, as texting is also epistemological belief resulted in a 4.9 percent increase in
the most common cell phone activity. Almost 300 billion high critical thinking.
emails are exchanged per day, and half of them are spam
emails. 'Spam Mail' is basically undesired and unwanted In this paper [2], (Lei, 2018) author 'Liu Lei' showed
emails that are sent to many of recipients that is just filling how logistic Regression could be used quickly and
up all the inboxes. Most of these messages are product efficiently to detect Breast Cancer. He applied a logistic
buying links, which would consume our personal data or regression model to the breast cancer dataset. The author got
could be some links and attachments. Sometimes the most accurate results with an accuracy of 96.5% when
carelessness from some users can cause significant damage 'Maximum Texture' and 'Maximum Perimeter' were chosen
to their personal data. Spam mails not only fill your inbox as input to the model. In contrast, he got an accuracy of
with junk mails but also cause email traffic. Spam messages 90.48% when he took 'Mean Texture' and 'Mean Radius' as
accounted for 45.1% of email traffic in March 2021. In input to the model. Therefore, choosing a better feature
short, such mails can be frustrating and dangerous at the combination will give more accurate results.
same time.
In this paper [3], (Radulescu, M.Dinsoreanu, &
Inboxes are 85% filled with Spam mails and due to R.Potolea, 2014) the main goal is to detect spam comments.
which the valuable and important emails are ignored. Many This was achieved by considering unclear comments with
researchers are developing various techniques to find the increased punctuation marks, new lines stop words, non-
solution for such problems and secure to communication. ASCII characters, new lines, capital letters, and offensive
Since the unsolicited emails are termed 'Spam', important words and converting them into vectors to classify them into
and valuable ones are termed 'Ham'. spam or non-spam comments. Next, they added word
duplication ratio as spam comments tend to have repeated
There are many techniques developed to classify such words and stop words ratio, which is the count of stop words
spam and ham mails. One such technique is by using divided by the total count of words in the comment. This
Natural language Processing and Machine Learning. With increased the accuracy of classification. Finally, they added