0% found this document useful (0 votes)
29 views7 pages

Email (Research) 3

Uploaded by

utkarshgupta2430
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views7 pages

Email (Research) 3

Uploaded by

utkarshgupta2430
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Harnessing the power of Machine Learning

For Email Spam Classification

Utkarsh Gupta
(6th seme, Section-‘G’, Roll NO.67)
Computer Science and Engineering

Graphic Era Hill University

Dehradun, Uttarakhand

[email protected]

Abstract— With the increase of email sending and Keywords:Ml Algorithm, Email spam classifier, Spam,
receiving in our day to day life. Due to this spam email spam Filter,
increases rapidly and became the biggest problem which
affected our globally integrated communication system.
Previously solutions used to filter and hide spam email
included the blacklisting of specific domains created who I. INTRODUCTION
send spam email and manual detecting the specific
keywords. There has been done a lot of research to render Nowadays email has become an essential part of our
spam filtering more accurately in classifying emails as lives, internet usage has taken a drastic increase from
ham (real or valid email) or spam by using ML classifier. past few years. All the social media application usage
This system uses machine learning techniques to detect also increased, due to this email has now become a
pattern of repetitive keywords which is classified as spam. crucial part of our lives, with the increases of mail in
Even then also we are still getting lots of spam email in
our inboxes on a daily basis due to this email spam also
our daily life. This is not the problem of filters, this
increased. The data collected from the internet shows
happens due to adoption of rising technology by
spammers. The approaches that have been developed to that the number of emails sent and received per day is
reduce the email spamming, filtration is important 347.3 billion(2023) with a 4.3% increase from the 2022
technique. Research is important in the field of spam year, the record of 2022 was 333.2 billion, this data
classification. shows how the usage of email is increasing with the
passing years. With the increase of emails, it is difficult significance of feature engineering, data preparation,
to differentiate between a real email and spam email, model selection, and assessment metrics, all of which
and then it comes to cyber security concerns. are crucial for creating a reliable email spam classifier.
A reliable classifier for email spam would have far-
Our email addresses are collected by spammers through reaching effects. By clearing out the clutter in our
chatrooms, websites, newsgroups and are sold to other inboxes, individuals may increase productivity,
spammers. Through this, the number of spam messages safeguard our privacy, and lessen the threats brought on
increases rapidly. From the 2023 data 3.4 billion spam by harmful email. We will discuss a number of topics
sent every day. Google itself blocks approximate 100 related to classifying email spam in the ensuing emails,
million spam emails daily and over 45% of emails that such as the different machine learning techniques that
was sent in 2022 were spam. So, to reduce the spam are frequently employed, the difficulties encountered in
mails we need a technology that identify between spam real-world situations, and methods for enhancing
email and real email. The implementation of a system classifier performance over time. This series will offer
that delays the transmission of some Gmail messages insightful information and useful skills to anyone
for a short period of time has improved Google's interested in learning more about the inner workings of
performance in detecting phishing attacks since these this technology, whether they are aspiring data
attacks are easier to spot when they are examined all at scientists, cybersecurity enthusiasts, or just curious
once. Delaying the distribution of some of these
questionable emails allows for a more thorough
investigation while waiting for the arrival of additional
messages and real-time algorithm updates. This II. LITERATURE SURVEY
intentional delay affects only 0.05 percent of emails.
One of the Major problems of today’s internet is spam
Entering in the field of Machine Learning that email, which brings financial damage to individual user
revolutionized the way of solving the complex and to companies. The approaches developed to stop
problems. Machine Learning provided a powerful spam, filtering technique are most important. The
algorithm and techniques that learn from the real time process of filtering technique is to remove unrequested
data or previous data and make an accurate prediction, emails from user’s mail inbox. The unrequested mails
that helps to tackle the challenge of email spam. already caused a problem of filling up the mailboxes
and utilizing user’s time [1]. Two different methods
Leading email providers like Gmail, Yahoo Mail, and were classified in paper [2]. Some rules that was
Outlook have combined a variety of machine learning defined manually in first method. One of the example is
(ML) techniques, including neural networks, in their rule based expert system. When all the classes are static
spam filters to successfully tackle the danger posed by and they can be easily separated according to few
email spam. These machine learning approaches have features and the second method one is done with the
the ability to learn and recognise spam emails and help of techniques which are in machine learning. In
phishing communications by examining a large number paper [3] uses a collection of criterion function to
of these messages across a large network of computers. define a statement of clustering of spam messages,
Gmail and Yahoo mail spam filters go beyond simply which is nothing but finds the similar keywords
scanning spam emails using pre-existing rules since between statement or message in clusters, which also
machine learning has the ability to adapt to changing can be define with the help of K- nearest neighbor
circumstances. As they continue their spam filtering algorithm (KNN). In paper [4] they have classified their
activity, they create new rules on their own using what data in four different categories – Neural Network,
they have learned. SVM, Naïve Bayesian and J48 classifier. They perform
their implementation on different data and attributes
In this email, we set off on a fascinating tour into the size. Their final result shows that it is spam if output
area of machine learning-based email spam comes ‘1’; otherwise it shows ‘0’ on not spam.
classification. We shall investigate every facet of this
technology, from its fundamental ideas to its actual In paper [5]-[7], automatic anti-spam filtering method
uses. In this section, we'll go into detail about the becoming an important feature for internet for the
raising family of junk-filtering tools. The researcher has As of most of email spam cleaning techniques
separated numeric distance measure and nominal developed are purely based on text classification
variables, and after that there overall distance measure techniques. Thus filtration of spam now converted into
is combined. In second method, the nominal variables multiple problems. In my paper, work is done to extract
are converted to numeric variables, and then with the attributes vector from statement in email. Here, three
help of variables distance measure are calculated. The machine learning algorithm SCV, Multinomial Naïve
researchers has analyzes in Paper [8] the calculation Bias and Decision Tree Classifier are used to train the
Pre-process Split Data Train data
complexity of the algorithm, and tested their application Dataset
model. data

on n number of data sets that is taken from different-


different domains. The main concept are [9] that the
spammers sends either phishing emails only or no
phishing emails at all, [10] shows that most community
Spam email Make Check for Test data
of spammers sends only phishing or no phishing emails classification accuracy

at all, and [11] state that many different groups of


spammers exhibit relevant behavior within the Not spam
communities or having same IP addresses [12]. It has
been explained that both the methods have little Figure 1: Architecture of the proposed system
generalization from small examples; shows that each
methods are similar in generalization behavior on this
type of problem, even with training sets which is large 1. Collection and preprocessing of data:
in size [13].
Dataset: Collection of labeled dataset which contain
Goodman et al encapsulated different except machine spam and non-spam emails.
learning in email spam filtration and they state that
email spam filtration was in the control of user, but the
real conflict was between the generator and the
researcher of spam was going on [14].

A client easily able to send or receive an email by doing


just a single click through an ISP. In the level of client
spam filtering which provides some framework for that
individual client to secure his or her mail transmission
system. A client can do this by just installing some Preprocessing:
several existing frameworks on their PC or system. This
installed framework directly interacts with Mail user  Tokenization: Breaking the emails into
agent (MUA) and filters the inbox by just accepting and separate words.
managing the particularly messages [15].  Lowercasing: Converting each statement into
lowercase such that uniformity remains ensure.
However, spam filter methods lighten the burden of the
receiver, it is believe to develop a system of email spam
detection which gives results more efficiently and
accurately. Along with all this a system which gives a
result that is user specific has been dreamed for. This
makes sure that the user friendly system is developed.

 Removing of Stop Words: Eliminating the


III. METHODOLOGY words which are common like “and” or “the”.
 Feature Extraction: Converting the text data to
numeric attribute such as TF-IDF numeric
attributes which is known as Term frequency Support Vector Classifier (SVC):
inverse document frequency.
Kernel Selection: Choosing the appropriate kernel from
These keywords show the non-spam emails which are linear, radial basis function, etc. based on
safe and that does not contain any wrong information or implementation and performance.
any bleach to cyber security.
Hyper parameter Tuning: This uses techniques such as
grid search or randomized search to find the optimal
hyperparameters for SVC model.

Multinomial Naive Bayes (MultinomialNB):

Text Vectorization: Converting the textual data into the


form of numeric format which is suitable for
MultinomialNB using the technique like TK-IDF.

In this only two category are required: spam or non-


spam(ham). Almost every spam filters based on statistic
uses Bayesian probability to join separated token’s
statistic to an overall score, and make the decision
based on the overall score. Usually, firstly these filters
goes through training stage that collect each token’s
statistics. In statistic most of the time we are interested
for a token T, which is calculated as follows:
Figure 2. Common keywords in non spam mails

These keywords show the spam emails which are not


safe and that contain wrong, false information or
viruses that may harm system.
Whereas, C spam(T) and C ham(T) are known as the
number of spam and ham statement that contain
token T. A easy way to make classifications is to
calculate the spam token’s result and differentiate
it with the result of ham token’s.

The mail is known as spam email if overall spamminess


product S[M] is greater than hamminess product H[M].

Stage 1. Training - Resolve every email into its


constituent tokens that produce a probability for
individual token W. S[W]=C spam(W) /(C ham(W)+ C
(W)) save spamminess feature to a database
spam

stage2.
Figure 2. Common keywords in spam mails
Stage2. Filtering – For every statement W do scan
2. Model Selection: statement for the coming token Ti. Query for database
spamminess S(Ti).
Now calculate the accumulated statement probability of
S[M] and H[M].

Now, Calculate the whole statement filtering indicated


by: I[M]=f(S[M],H[M])

f is a filter dependent func.

If I[M] > threshold

statement is declared as spam

else This Graph shows the specific keywords which are used
in most of the emails in recent which are not spam.
statement is declared as non-spam

Decision Tree Classifier:

Tree Depth: Performing experiment with multiple tree


depths to find the suitable stability between under and
over fitting.

3. Training and Evaluation:

Training:

In this dataset is splits into training and testing sets for


example 70% for training the dataset and reaming 30%
is used for testing the dataset.

Training every classifier (SVC, Multinomial Naïve


Bayes, and Decision Tree Classifier) on train set. IV. RESULTS

Evaluation Metrics: In my paper, model is train with machine learning


algorithm to detect that a received mail is spam or not.
Assessing each model performance by using metrics In this model I have used spam base dataset. After
such as accuracy, precision, and confusion matrix. selecting dataset it was cleaned and processed so that
there should not any null attributes present. The
attributes of the email dataset were measure using
min_max_scaler for making proper connection with
training of the model by using three machine learning
algorithm. After this dataset was divided into x and y
This Graph shows the specific keywords which are used attributes. Then, these x and y variables were further
in most of the emails in recent which are spam, as we divided into x_train,x_test,y_train,y_text. Then these
can see that call keyword is used in most of the emails. train and test cases were being trained using these three
machine learning algorithms.

This graph shows the comparison between SVC, K


Neighbors Classifier, Multinomial NB, Decision Tree
Classifier, Logistic Regression, Random Forest The Different algorithm used in this approach are SCV,
Classifier, AdaBoost Classifier Bagging Classifier, Multinomial NB and Extra tree classifier. The accuracy
Extra Trees Classifier, Gradient Boosting Classifier and achieved by these algorithm are 97.29%, 95.93% and
XGB Classifier algorithm and between all them Extra 97.77% respectively and the overall combined accuracy
tress classifier, Multinomial NB and SVC shows the is 97.87% with precision of 93.28%.
best accuracy and precision.

REFERENCES

[1] Kh. Ahmed, “An overview of content-based spam filtering


techniques,” Informatica, vol. 31, no. 3, pp. 269–277, 2007.

[2] Biro. I, J. Szabo, and A. A. Benczur. Latent Dirichlet,”


location in Web Spam Filtering”. In Proceedings of the 4th
International Workshop on Adversarial Information Retrieval
on the Web (AIRWeb), 2008.

[3] Perkins, A. The classification of search engine spam.


http://www. ebrand management.Com/ white papers/spam
classification.

[4] Youn and Dennis McLeod, “ A Comparative Study for


Email Classification, Seongwook Los Angeles” , CA 90089,
And after combining SCV, Multinomial NB and Extra
tree classifier their accuracy and precision shows best USA, 2006.
result.
[5] Androutsopoulos .I, J. Koutsias, K.V. Chandrinos, G.
Paliouras, and C.D. Spyropoulos. An Evaluation of Naive
Accuracy 0.9816247582205029
Bayesian Anti-Spam Filtering. Proceedings of the
Precision 0.9917355371900827
Workshopon Machine Learning in the New Information Age,
11th European Conference on Machine Learning, Barcelona,
V. CONCLUSION Spain, pages 9–17, 2000.

[6] Androutsopoulos I., J. Koutsias, K.V. Chandrinos, and


As today email Spam or email fraud becomes
C.D. Spyropoulos. An Experimental Comparison of Naive
demanding internet issue of world of communication.
Bayesian and Keyword-Based Anti-Spam Filtering with
Spam emails are generated by spammers and they Encrypted Personal Messages. Proceedings of the 23rd
misuse them and can affect the organization or any Annual International ACM SIGIR Conference on Research
individual. As we also know that there are already and Development in Information Retrieval, Athens, Greece,
many email spam filtering tools are present. Due to the 2000.
existence of spammers and development of new
[7] Apte, C. and F. Damerau. Automated Learning of
technology, filtering spam emails becomes a
Decision Rules for Text Categorization. ACM Transactions
challenging topic to the researcher. These techniques on Information Systems, 12(3):233–251, 1994.
can be used by mail server or at mail client to decrease
the rate of spam message and to decrease the risk of [8] X. Li and N. Ye, “A supervised clustering and
future loss and storage usage. This system specifically classification algorithm for mining data with mixed
focuses on differentiating emails in two different variables,” IEEE Transactions on Systems, Man, and
categories, known as spam and no-spam. This has a lot Cybernetics Part A, vol. 36, no. 2, pp. 396– 406, 200
of suggestion for both organization and individual [9] Androutsopoulos .I, J. Koutsias, K.V. Chandrinos, G.
users. Paliouras, and C.D. Spyropoulos. An Evaluation of Naive
Bayesian Anti-Spam Filtering. Proceedings of the
Workshopon Machine Learning in the New Information Age,
11th European Conference on Machine Learning, Barcelona,
Spain, pages 9–17, 2000.

[10] Androutsopoulos I., J. Koutsias, K.V. Chandrinos, and


C.D. Spyropoulos. An Experimental Comparison of Naive
Bayesian and Keyword-Based Anti-Spam Filtering with
Encrypted Personal Messages. Proceedings of the 23rd
Annual International ACM SIGIR Conference on Research
and Development in Information Retrieval, Athens, Greece,
2000.

[11] Apte, C. and F. Damerau. Automated Learning of


Decision Rules for Text Categorization. ACM Transactions
on Information Systems, 12(3):233–251, 1994.

[12] A. Bratko, B. Filipic, G. Cormack, T. Lynam, and B.


Zupan. “Spam Filtering Using Statistical Data Compression
Models”, The Journal of Machine Learning Research, pp.,
2673–2698, 2006

[13] Cohen, W, Learning rules that classify e-mail. In


Proceedings of the AAAI Spring Symposium on Machine
Learning in Information Access. Palo Alto, California, 1996.

[14] Tretyakov, K. (2004, May). Machine learning techniques


in spam filtering. Inb bData Mining Problem-oriented
Seminar, MTAT (Vol. 3, No. 177, pp. 60-79)

[15] Saad, O., Darwish, A., & Faraj, R. (2012). A survey of


machine learning techniques for Spam filtering. International
Journal of Computer Science and Network Security
(IJCSNS), 12(2), 66.

You might also like