0% found this document useful (0 votes)
60 views5 pages

Email Based Spam Detection

This document summarizes research from 5 papers on email spam detection techniques. The papers explored using features from email headers, identifying frequently repeated words, using Bayesian classifiers to adapt to new spam patterns, identifying repetitive keywords as indicators of spam, and comparing string matching algorithms to detect spam. The proposed system in this research uses machine learning classifiers and tracks IP addresses to blacklist senders and detect spam more accurately over time based on sending patterns.

Uploaded by

Rahul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views5 pages

Email Based Spam Detection

This document summarizes research from 5 papers on email spam detection techniques. The papers explored using features from email headers, identifying frequently repeated words, using Bayesian classifiers to adapt to new spam patterns, identifying repetitive keywords as indicators of spam, and comparing string matching algorithms to detect spam. The proposed system in this research uses machine learning classifiers and tracks IP addresses to blacklist senders and detect spam more accurately over time based on sending patterns.

Uploaded by

Rahul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Published by : International Journal of Engineering Research & Technology (IJERT)

http://www.ijert.org ISSN: 2278-0181


Vol. 9 Issue 06, June-2020

Email based Spam Detection


Thashina Sultana, K A Sapnaz, Fathima Sana, Mrs. Jamedar Najath
Dept. of Computer Science and Engineering
Yenepoya Institute of Technology
Moodbidri, India

Abstract— Nowadays, a big part of people rely on available spam filtering techniques are accustomed protect our mailbox
email or messages sent by the stranger. The possibility that for spam mails.
anybody can leave an email or a message provides a golden II. LITERATURE SURVEY
opportunity for spammers to write spam message about our
different interests .Spam fills inbox with number of ridiculous In the paper[1], authors have highlighted several features
emails . Degrades our internet speed to a great extent .Steals
useful information like our details on our contact list.
contained in the email header which will be used to identify
Identifying these spammers and also the spam content can be a and classify spam messages efficiently .Those features are
hot topic of research and laborious tasks. Email spam is an selected based on their performance in detecting spam
operation to send messages in bulk by mail .Since the expense of messages. This paper also communalize each features
the spam is borne mostly by the recipient ,it is effectively postage contains in Yahoo mail,Gmail and Hotmail so a generic spam
due advertising. Spam email is a kind of commercial advertising messages
which is economically viable because email could be a very cost detection mechanism could be proposed for all major email
effective medium for sender .With this proposed model the providers.
specified message can be stated as spam or not using Bayes’
theorem and Naive Bayes’ Classifier and Also IP addresses of
the sender are often detected .
In the paper[2], a new approach based on the strategy that
how frequently words are repeated was used. The key
Keywords— Term Frequency, Inverse Document Frequency, sentences, those with the keywords, of the incoming emails
language tool kit. have to be tagged and thereafter the grammatical roles of the
entire words in the sentence need to be determined, finally
I. INTRODUCTION they will be put together in a vector in order to take the
similarity between received emails. K-Mean algorithm is
In recent years, internet has become an integral part of life. used to classify the received e-mail. Vector determination is
With increased use of internet, numbers of email users are the method used to determine to which category the e-mail
increasing day by day. This increasing use of email has belongs to.
created problems caused by unsolicited bulk email messages
commonly referred to as Spam. Email has now become one In the paper[3],authors described about cyber attacks
of the best ways for advertisements due to which spam emails .Phishers and malicious attackers are frequently using email
are services to send false kinds of messages by which target user
generated. Spam emails are the emails that the receiver does can lose their money and social reputations. These results into
not wish to receive. a large number of identical messages are gaining personal credentials such as credit card number,
sent to several recipients of email. Spam usually arises as a passwords and some confidential data .In This paper ,authors
result of giving out our email address on an unauthorized or have used Bayesian Classifiers .Consider every single word
unscrupulous website .There are many of the effects of Spam in the mail. Constantly adapts to new forms of spam.
.Fills our Inbox with number of ridiculous emails .Degrades
our Internet speed to a great extent .Steals useful information In the paper[4],proposed system attempts to use machine
like our details on you Contact list .Alters your search results learning techniques to detect a pattern of repetitive keywords
on any computer program .Spam is a huge waste of which are classified as spam. The system also proposes the
everybody’s time and can quickly become very frustrating if classification of emails based on other various parameters
you receive large amounts of it .Identifying these spammers contained in their structure such as Cc/Bcc, domain and
and the spam content is a laborious task . even though header. Each parameter would be considered as a feature
extensive number of studies have been done, yet so far the when
methods set forth still scarcely distinguish spam surveys, and applying it to the machine learning algorithm. The machine
none of them demonstrate the benefits of each removed learning model will be a pre-trained model with a feedback
element compose .In spite of increasing network mechanism to distinguish between a proper output and an
communication and wasting lot of memory space ,spam ambiguous output. This method provides an alternative
messages are also used for some attack . Spam emails, also architecture by which a spam filter can be implemented. This
known as non-self, are unsolicited commercial or malicious paper also takes into consideration the email body with
emails, sent to affect either a single individual or a commonly used keywords and punctuations.
corporation or a bunch of people. Besides advertising, these
may contain links to phishing or malware hosting websites In the paper[5],authors investigated the use of string
found out to steal confidential information. to solve this matching algorithms for spam email detection. Particularly
problem the different spam filtering techniques are used. The this work examines and compares the efficiency of six well-

IJERTV9IS060087 www.ijert.org 135


(This work is licensed under a Creative Commons Attribution 4.0 International License.)
Published by : International Journal of Engineering Research & Technology (IJERT)
http://www.ijert.org ISSN: 2278-0181
Vol. 9 Issue 06, June-2020

known string matching algorithms, namely Longest Common Input: select and Delete all the unwanted emails.
Subsequence (LCS), Levenshtein Distance (LD), Jaro , Jaro - Output: all the deleted emails are added in the trash bin.
Winkler, Bi-gram, and TFIDF on two various datasets which Trash bin stores all the deleted emails.
are Enron corpus and CSDMC2010 spam dataset. They 6. Voice Message
observed that Bi-gram algorithm performs best in spam Input: The Email has been sent in the form of the text
detection in both datasets. message by the sender
Output: The email has been read through the use of voice
III. PROPOSED SYSTEM note by the receiver.
In this system, to solve the problem of spam, the spam 7. Offline notification
classification system is created to identify spam and non- Input: The sender sends an email
spam. Since spammers may send spam messages many times, Output: the receivers receive a notification offline in the text
it is difficult to identify it every time manually .So we will be format as SMS.
using some of the strategies in our proposed system to detect 8. Delete For everyone
the spam. The proposed solution not only identifies the spam Input: here the sender deletes the email which he has sent
word but also identifies the IP address of the system through Output: the email has been erased or deleted for both the
which the spam message is sent so that next time when the sender as well as the receiver.
spam message is sent from the same system our proposed 9. Read Message
system directly identifies it as blacklisted based on the IP Input: The receiver will read the email.
address. Output: the sender will get a notification stating the sender as
In the proposed model ,the web application is done using read the message.
dot net and spam detection is done using machine learning When we receive message in the inbox ,that message will be
.The web application consists of following modules: exported to dataset. This message will be detected as spam or
1. User Management : not using Naïve Bayes Classifier.
The user who is using this for the very first time must Before detecting whether received message is spam or not
register, by using the website the user or the individual ,the model has to be trained which is explained in the below
should get registered into it, by registering this will help to section.
maintain separate account for each user. Registration of the
user is must before they log in. The user will login to the IV. SPAMDETECTION USING MACHINE LEARNING|
main page with his registered name and password. Once the
user successfully login the authorized page will be displayed 1. For training the algorithm dataset from Kaggle is used
otherwise that shows the error messages. Login is which is shown below
compulsory.
Login: The user will login to the main page with his
registered name and password. Once the user successfully
login the authorized page will be displayed otherwise that
shows the error messages. Login is compulsory.
Registration: First time while using the website the user or
the individual should get registered into it, by registering this
will help to maintain separate account for each user.
Registration of the user is must before they log in.
2. Compose
Input: the sender will compose the new email; the
sender should add the address of the recipient, the subject and
the message.
Output: the email will be sent based to the address Fig.1. Dataset
mentioned by the recipient. 2. It has many fields, some of these columns of the dataset
3. Inbox are not required. So remove some columns which are not
This page will store all of the mails received by user. All required. We need to change the names of the columns.
the received Mails will be listed sorted in order of date.
Input: the inbox page will accept all the incoming emails sent
to an individual.
Output: the receiver can open and read the email received to
their address.
4. Sent
This folder stores all the mails sent from the user.
Input: here the sender will compose an email and send to the
recipient.
Output: Sent email can be be read out .
5. Trash
This folder will store all of mails deleted by the user.
Fig.2. Classification dataset

IJERTV9IS060087 www.ijert.org 136


(This work is licensed under a Creative Commons Attribution 4.0 International License.)
Published by : International Journal of Engineering Research & Technology (IJERT)
http://www.ijert.org ISSN: 2278-0181
Vol. 9 Issue 06, June-2020

5. We need to find out the most repeated words in the spam


With the help of NLTK (Natural Language Tool Kit) for and ham messages.So Word Cloud library is used.
the text processing, Using Matplotlib you can plot graphs
, histogram and bar plot and all those things ,Word Cloud
is used to present text data and pandas for data
manipulation and analysis, NumPy is to do the
mathematical and scientific operation.
The packages used in the proposed model are shown
below.

Fig.3.Packages
Fig.6.Spam word cloud
3. Split the data into training and testing sets as shown
below. Some percentage f the data set is used as train dataset
and the rest as a test dataset.

Fig.4.Train dataset

4. Reset train and test index as shown in the next column:

Fig.7. Ham word cloud

5. Whenever there is any message, we must first preprocess


the input messages. We need to convert all the input
characters to lowercase.

6. Then split up the text into small pieces and also removing
the punctuations. So the Tokenization process is used to
remove punctuations and splitting messages.

7. The Porter Stemming Algorithm is used for stemming.


Stemming is the process of reducing words to their root
word.

8. We need to find the probability of the word in spam and


ham messages.

Fig.5. Reset train and test index

IJERTV9IS060087 www.ijert.org 137


(This work is licensed under a Creative Commons Attribution 4.0 International License.)
Published by : International Journal of Engineering Research & Technology (IJERT)
http://www.ijert.org ISSN: 2278-0181
Vol. 9 Issue 06, June-2020

𝑻𝒐𝒕𝒂𝒍 𝒐𝒄𝒄𝒖𝒓𝒓𝒆𝒏𝒄𝒆𝒔
𝒐𝒇 𝒘𝒐𝒓𝒅
𝑷( 𝒘𝒐𝒓𝒅) =
𝑻𝒐𝒕𝒂𝒍 𝒏𝒖𝒎𝒃𝒆𝒓𝒐𝒇 𝒘𝒐𝒓𝒅𝒔

Eqn.1. Frequency of word


Then spam word frequency is calculated as follows:
𝑻𝒐𝒕𝒂𝒍 𝒐𝒇𝒐𝒄𝒄𝒖𝒓𝒓𝒆𝒏𝒄𝒆𝒔
𝒕𝒉𝒆 𝒘𝒐𝒓𝒅𝒊𝒏 𝒔𝒑𝒂𝒎 𝒎𝒆𝒔𝒔𝒂𝒈𝒆
𝑷( 𝒘𝒐𝒓𝒅| 𝒔𝒑𝒂𝒎) =
𝑻𝒐𝒕𝒂𝒍 𝒏𝒖𝒎𝒃𝒆𝒓𝒐𝒇 𝒘𝒐𝒓𝒅𝒔𝒊𝒏 𝒕𝒉𝒆 𝒔𝒑𝒂𝒎 𝒎𝒆𝒔𝒔𝒂𝒈𝒆

Eqn.2.Spam word frequency


Fig.9.Spam Message

If “Thanx” is an exported message from the inbox to the


9. Tf –idf(term frequency-inverse document frequency) has dataset then using Bayes’ theorem and Naive Bayes’
to be calculated. Classifier, the above message is detected as Ham as shown
TF: Term Frequency, which measures how many times a below.
term occurs in a document.
TF(t) = (Number of times t appeared in a document) / (Total
terms in the document).
IDF: Inverse Document Frequency, which measures the
significance of the term.
IDF(t) = loge(Total documents / documents with term t in it).
10. See how well the model performed by
evaluating Naïve Bayes Classifier and showing the
accuracy score.
V. RESULTS AND DISCUSSIONS
When we receive message in the inbox ,that message will be
Fig.10.Ham message
exported to dataset as shown below. This message will be
detected as spam or not. The IP address of the sender can also be detected.

Fig.11.IP address of the sender

VI. CONCLUSION
Email has been the most important medium of
communication nowadays, through internet connectivity any
message can be delivered to all aver the world. More than
270 billion emails are exchanged daily, about 57% of these
are just spam emails. Spam emails, also known as non-self,
Fig.8. Exported Dataset
are undesired commercial or malicious emails, which affects
or hacks personal information like bank ,related to money or
The exported message will be detected as spam or not using anything that causes destruction to single individual or a
Bayes’ theorem and Naive Bayes’ Classifier following all corporation or a group of people. Besides advertising, these
the steps discussed above along with finding probability of may contain links to phishing or malware hosting websites
words in spam and ham messages to detect it as spam or not. set up to steal confidential information. Spam is a serious
The below figures shows message which got detected as issue that is not just annoying to the end-users but also
spam and ham. financially damaging and a security risk. Hence this system is
If “Urgent! Please call 09062703810” is an exported designed in such a way that it detects unsolicited and
message from the inbox to the dataset then based on trained unwanted emails and prevents them hence helping in
dataset and using Bayes’ theorem and Naive Bayes’ reducing the spam message which would be of great benefit
Classifier, the above message is detected as Spam as shown to individuals as well as to the company .In the future this
below. system can be implemented by using different algorithms and
also more features can be added to the existing system.

IJERTV9IS060087 www.ijert.org 138


(This work is licensed under a Creative Commons Attribution 4.0 International License.)
Published by : International Journal of Engineering Research & Technology (IJERT)
http://www.ijert.org ISSN: 2278-0181
Vol. 9 Issue 06, June-2020

REFERENCES

[1] Shukor Bin Abd Razak, Ahmad Fahrulrazie Bin Mohamad


“Identification of Spam Email Based on Information from Email
Header” 13th International Conference on Intelligent Systems
Design and Applications (ISDA), 2013.
[2] Mohammed Reza Parsei, Mohammed Salehi “E-Mail Spam
Detection Based on Part of Speech Tagging” 2nd International
Conference on Knowledge Based Engineering and Innovation
(KBEI), 2015.
[3] Sunil B. Rathod, Tareek M. Pattewar “Content Based Spam
Detection in Email using Bayesian Classifier”, presented at the
IEEE ICCSP 2015 conference.
[4] Aakash Atul Alurkar, Sourabh Bharat Ranade, Shreeya Vijay
Joshi, Siddhesh Sanjay Ranade, Piyush A. Sonewa, Parikshit N.
Mahalle, Arvind V. Deshpande “A Proposed Data Science
Approach for Email Spam Classification using Machine Learning
Techniques”, 2017.
[5] Kriti Agarwal, Tarun Kumar “Email Spam Detection using
integrated approach of Naïve Bayes and Particle Swarm
Optimization”, Proceedings of the Second International
Conference on Intelligent Computing and Control Systems
(ICICCS), 2018.
[6] Cihan Varol, Hezha M.Tareq Abdulhadi “Comparison of String
Matching Algorithms on Spam Email Detection”, International
Congress on Big Data, Deep Learning and Fighting Cyber
Terrorism Dec, 2018.
[7] Duan, Lixin, Dong Xu, and Ivor Wai-Hung Tsang. "Domain
adaptation from multiple sources: A domaindependent
regularization approach." IEEE Transactions on Neural Networks
and Learning Systems 23.3 (2012).
[8] Mujtaba, Ghulam, et al. "Email classification research trends:
Review and open issues." IEEE Access 5 (2017).
[9] Trivedi, Shrawan Kumar. "A study of machine learning classifiers
for spam detection." Computational and Business Intelligence
(ISCBI), 2016 4th International Symposium on. IEEE, 2016. [10]
You, Wanqing, et al. "Web Service-Enabled Spam Filtering with
Naïve Bayes Classification." 2015 IEEE First International
Conference on Big Data Computing Service and Applications
(BigDataService). IEEE, 2015.
[10] Rathod, Sunil B., and Tareek M. Pattewar. "Content based spam
detection in email using Bayesian classifier." International
Conference on. IEEE, 2015.
[11] Sahın, Esra, Murat Aydos, and Fatih Orhan. "Spam/ham e-mail
classification using machine learning methods based on bag of
words technique." 2018 26th Signal Processing and
Communications Applications Conference (SIU). IEEE, 2018.

IJERTV9IS060087 www.ijert.org 139


(This work is licensed under a Creative Commons Attribution 4.0 International License.)

You might also like