Email Based Spam Detection
Email Based Spam Detection
Abstract— Nowadays, a big part of people rely on available spam filtering techniques are accustomed protect our mailbox
email or messages sent by the stranger. The possibility that for spam mails.
anybody can leave an email or a message provides a golden II. LITERATURE SURVEY
opportunity for spammers to write spam message about our
different interests .Spam fills inbox with number of ridiculous In the paper[1], authors have highlighted several features
emails . Degrades our internet speed to a great extent .Steals
useful information like our details on our contact list.
contained in the email header which will be used to identify
Identifying these spammers and also the spam content can be a and classify spam messages efficiently .Those features are
hot topic of research and laborious tasks. Email spam is an selected based on their performance in detecting spam
operation to send messages in bulk by mail .Since the expense of messages. This paper also communalize each features
the spam is borne mostly by the recipient ,it is effectively postage contains in Yahoo mail,Gmail and Hotmail so a generic spam
due advertising. Spam email is a kind of commercial advertising messages
which is economically viable because email could be a very cost detection mechanism could be proposed for all major email
effective medium for sender .With this proposed model the providers.
specified message can be stated as spam or not using Bayes’
theorem and Naive Bayes’ Classifier and Also IP addresses of
the sender are often detected .
In the paper[2], a new approach based on the strategy that
how frequently words are repeated was used. The key
Keywords— Term Frequency, Inverse Document Frequency, sentences, those with the keywords, of the incoming emails
language tool kit. have to be tagged and thereafter the grammatical roles of the
entire words in the sentence need to be determined, finally
I. INTRODUCTION they will be put together in a vector in order to take the
similarity between received emails. K-Mean algorithm is
In recent years, internet has become an integral part of life. used to classify the received e-mail. Vector determination is
With increased use of internet, numbers of email users are the method used to determine to which category the e-mail
increasing day by day. This increasing use of email has belongs to.
created problems caused by unsolicited bulk email messages
commonly referred to as Spam. Email has now become one In the paper[3],authors described about cyber attacks
of the best ways for advertisements due to which spam emails .Phishers and malicious attackers are frequently using email
are services to send false kinds of messages by which target user
generated. Spam emails are the emails that the receiver does can lose their money and social reputations. These results into
not wish to receive. a large number of identical messages are gaining personal credentials such as credit card number,
sent to several recipients of email. Spam usually arises as a passwords and some confidential data .In This paper ,authors
result of giving out our email address on an unauthorized or have used Bayesian Classifiers .Consider every single word
unscrupulous website .There are many of the effects of Spam in the mail. Constantly adapts to new forms of spam.
.Fills our Inbox with number of ridiculous emails .Degrades
our Internet speed to a great extent .Steals useful information In the paper[4],proposed system attempts to use machine
like our details on you Contact list .Alters your search results learning techniques to detect a pattern of repetitive keywords
on any computer program .Spam is a huge waste of which are classified as spam. The system also proposes the
everybody’s time and can quickly become very frustrating if classification of emails based on other various parameters
you receive large amounts of it .Identifying these spammers contained in their structure such as Cc/Bcc, domain and
and the spam content is a laborious task . even though header. Each parameter would be considered as a feature
extensive number of studies have been done, yet so far the when
methods set forth still scarcely distinguish spam surveys, and applying it to the machine learning algorithm. The machine
none of them demonstrate the benefits of each removed learning model will be a pre-trained model with a feedback
element compose .In spite of increasing network mechanism to distinguish between a proper output and an
communication and wasting lot of memory space ,spam ambiguous output. This method provides an alternative
messages are also used for some attack . Spam emails, also architecture by which a spam filter can be implemented. This
known as non-self, are unsolicited commercial or malicious paper also takes into consideration the email body with
emails, sent to affect either a single individual or a commonly used keywords and punctuations.
corporation or a bunch of people. Besides advertising, these
may contain links to phishing or malware hosting websites In the paper[5],authors investigated the use of string
found out to steal confidential information. to solve this matching algorithms for spam email detection. Particularly
problem the different spam filtering techniques are used. The this work examines and compares the efficiency of six well-
known string matching algorithms, namely Longest Common Input: select and Delete all the unwanted emails.
Subsequence (LCS), Levenshtein Distance (LD), Jaro , Jaro - Output: all the deleted emails are added in the trash bin.
Winkler, Bi-gram, and TFIDF on two various datasets which Trash bin stores all the deleted emails.
are Enron corpus and CSDMC2010 spam dataset. They 6. Voice Message
observed that Bi-gram algorithm performs best in spam Input: The Email has been sent in the form of the text
detection in both datasets. message by the sender
Output: The email has been read through the use of voice
III. PROPOSED SYSTEM note by the receiver.
In this system, to solve the problem of spam, the spam 7. Offline notification
classification system is created to identify spam and non- Input: The sender sends an email
spam. Since spammers may send spam messages many times, Output: the receivers receive a notification offline in the text
it is difficult to identify it every time manually .So we will be format as SMS.
using some of the strategies in our proposed system to detect 8. Delete For everyone
the spam. The proposed solution not only identifies the spam Input: here the sender deletes the email which he has sent
word but also identifies the IP address of the system through Output: the email has been erased or deleted for both the
which the spam message is sent so that next time when the sender as well as the receiver.
spam message is sent from the same system our proposed 9. Read Message
system directly identifies it as blacklisted based on the IP Input: The receiver will read the email.
address. Output: the sender will get a notification stating the sender as
In the proposed model ,the web application is done using read the message.
dot net and spam detection is done using machine learning When we receive message in the inbox ,that message will be
.The web application consists of following modules: exported to dataset. This message will be detected as spam or
1. User Management : not using Naïve Bayes Classifier.
The user who is using this for the very first time must Before detecting whether received message is spam or not
register, by using the website the user or the individual ,the model has to be trained which is explained in the below
should get registered into it, by registering this will help to section.
maintain separate account for each user. Registration of the
user is must before they log in. The user will login to the IV. SPAMDETECTION USING MACHINE LEARNING|
main page with his registered name and password. Once the
user successfully login the authorized page will be displayed 1. For training the algorithm dataset from Kaggle is used
otherwise that shows the error messages. Login is which is shown below
compulsory.
Login: The user will login to the main page with his
registered name and password. Once the user successfully
login the authorized page will be displayed otherwise that
shows the error messages. Login is compulsory.
Registration: First time while using the website the user or
the individual should get registered into it, by registering this
will help to maintain separate account for each user.
Registration of the user is must before they log in.
2. Compose
Input: the sender will compose the new email; the
sender should add the address of the recipient, the subject and
the message.
Output: the email will be sent based to the address Fig.1. Dataset
mentioned by the recipient. 2. It has many fields, some of these columns of the dataset
3. Inbox are not required. So remove some columns which are not
This page will store all of the mails received by user. All required. We need to change the names of the columns.
the received Mails will be listed sorted in order of date.
Input: the inbox page will accept all the incoming emails sent
to an individual.
Output: the receiver can open and read the email received to
their address.
4. Sent
This folder stores all the mails sent from the user.
Input: here the sender will compose an email and send to the
recipient.
Output: Sent email can be be read out .
5. Trash
This folder will store all of mails deleted by the user.
Fig.2. Classification dataset
Fig.3.Packages
Fig.6.Spam word cloud
3. Split the data into training and testing sets as shown
below. Some percentage f the data set is used as train dataset
and the rest as a test dataset.
Fig.4.Train dataset
6. Then split up the text into small pieces and also removing
the punctuations. So the Tokenization process is used to
remove punctuations and splitting messages.
𝑻𝒐𝒕𝒂𝒍 𝒐𝒄𝒄𝒖𝒓𝒓𝒆𝒏𝒄𝒆𝒔
𝒐𝒇 𝒘𝒐𝒓𝒅
𝑷( 𝒘𝒐𝒓𝒅) =
𝑻𝒐𝒕𝒂𝒍 𝒏𝒖𝒎𝒃𝒆𝒓𝒐𝒇 𝒘𝒐𝒓𝒅𝒔
VI. CONCLUSION
Email has been the most important medium of
communication nowadays, through internet connectivity any
message can be delivered to all aver the world. More than
270 billion emails are exchanged daily, about 57% of these
are just spam emails. Spam emails, also known as non-self,
Fig.8. Exported Dataset
are undesired commercial or malicious emails, which affects
or hacks personal information like bank ,related to money or
The exported message will be detected as spam or not using anything that causes destruction to single individual or a
Bayes’ theorem and Naive Bayes’ Classifier following all corporation or a group of people. Besides advertising, these
the steps discussed above along with finding probability of may contain links to phishing or malware hosting websites
words in spam and ham messages to detect it as spam or not. set up to steal confidential information. Spam is a serious
The below figures shows message which got detected as issue that is not just annoying to the end-users but also
spam and ham. financially damaging and a security risk. Hence this system is
If “Urgent! Please call 09062703810” is an exported designed in such a way that it detects unsolicited and
message from the inbox to the dataset then based on trained unwanted emails and prevents them hence helping in
dataset and using Bayes’ theorem and Naive Bayes’ reducing the spam message which would be of great benefit
Classifier, the above message is detected as Spam as shown to individuals as well as to the company .In the future this
below. system can be implemented by using different algorithms and
also more features can be added to the existing system.
REFERENCES