0% found this document useful (0 votes)
422 views2 pages

Email Spam Detection Using Machine Learning

1) The document discusses an email spam detection framework that uses machine learning algorithms like Naive Bayes to classify emails as spam or not spam. 2) It analyzes a dataset containing over 5000 emails to train and evaluate the Naive Bayes classifier. The classifier achieves an accuracy of 97% according to the evaluation. 3) Spam emails pose security and privacy risks and can spread malware. An effective spam detection system is needed to filter out unwanted spam and protect users.

Uploaded by

Milton
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
422 views2 pages

Email Spam Detection Using Machine Learning

1) The document discusses an email spam detection framework that uses machine learning algorithms like Naive Bayes to classify emails as spam or not spam. 2) It analyzes a dataset containing over 5000 emails to train and evaluate the Naive Bayes classifier. The classifier achieves an accuracy of 97% according to the evaluation. 3) Spam emails pose security and privacy risks and can spread malware. An effective spam detection system is needed to filter out unwanted spam and protect users.

Uploaded by

Milton
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Email Spam Detection using Machine Learning

Mahtab Chelani

Department of Software Engineering


Mehran University of Engineering and Technology, Jamshoro
Jamshoro, Pakistan
[email protected]

Abstract— Email Spam detection framework is a machine spam. Spam Detector is used to detect unwanted, malicious
learning project which uses computerized reasoning and AI and virus infected texts and helps to separate them from the
calculations to sift through the malicious and false emails. non-spam texts. These spam emails are commonly used by
Spam Classifier is utilized to identify undesirable, vindictive fake profiles to attack victims who have no idea about spam.
and infection tainted texts and assists with isolating them from They usually send links, phishing methods, and such things
the non-spam texts. Making a fake profile and email account is to grasp the victim and steal confidential data.
much simpler for the spammers, they seem to look like a
certifiable valid individual in their spam messages. These
spammers focus on groups of people who don't know about
these fake frauds and issues. Thus, it is needed to develop some
kind of filtering system which can Identify those spam emails.
This system will recognize spam by utilizing methods of AI,
and will examine the AI calculations and apply this large
number of calculations on our dataset and the best calculation
is chosen for the email spam discovery having best accuracy
and exactness.

Keywords—spam, email, naïve bayes, spam detection, fake,


fraudulent, malicious Spam emails are very harmful in another way which leads to
several very sensitive data breaching and some viruses like
I. INTRODUCTION trojans, worms, unblockable ads, cryptocurrency miners, and
other malware. The task of handling spam emailing is very
Email Spam is a huge problem in today’s world where
essential because it can lead to critical situations. In other
each and everything is carried out on electronic mail and
words, spam emails are quite annoying to the user
media. According to research, in 2021, it was estimated that
319.6 billion emails were sent and received daily. And in II. LITERATURE REVIEW
December 2021, 45.37% of the total emails were deemed as
spam emails. From 2020 to 2021, the global spam volume Email spam is just phony or undesirable mass mails sent
was the highest in July 2021, when 283 billion out of 336.41 through any account or a robotized system. Spam emails are
billion emails were spam [1]. In the new era of technical increasing day by day, and it has turned into a typical issue
advancement, electronic mails (e-mails) have gathered throughout the last decade. The uses of AI have been
significant users for professional, commercial, and personal assuming a fundamental part in the detection of spam emails.
communications. In 2019, on average, every person was A lot of researchers are focusing on finding new ways to
receiving 130 emails each day, and overall, 296 billion detect spam emails and filter them out.
emails have been sent in that year. Blanzieri and Bryl [2] described multiple spam filtering
approaches in their paper. The paper reviews the spam
filtering approach based on learning-based filtering. In this
study, various ethical, economical, and general level issues
were discussed and its effects explained. This study suggests
Naïve Bayes algorithm for future spam detectors as it is
efficient and precise.
Ferrag, Maglaras, Moschoyiannis and Janicke [3], in his
review of deep learning, presented a comprehensive review
of intrusion detection algorithms and email spam datasets.
They evaluated multiple deep learning models and their
effectiveness based on those spam datasets. They concluded
that deep learning models can perform outrageously better
than traditional models for specifically intrusion detection
and spam filtering.
The classifier which filters these spam mails is nowadays
practiced to help users avoid these fraudulent mails. Email Saleh, Karim and Shanmugam [4] surveyed email spam,
spam detection system is a project which utilizes artificial their datasets, and detection. They analyzed the security
intelligence and machine learning algorithms to filter out the risks, scope of spam analysis, different machine learning and
fake and fraudulent emails which are commonly termed as non-machine learning techniques to filter out spam. They
concluded that all spam email detection research work, performance, also as its strengths and weaknesses. Model
specifically the phishing emails detection, depended on word evaluation is vital to assess the efficacy of a model during
based classification or clustering methodology. initial research phases, and it also plays a task in model
monitoring.
III. METHODOLOGY
F. Results
This classifier uses Naïve Bayes algorithm,
CountVectorizer and MultinomialNB method from Naïve
Bayes Classifier. The dataset [5] used for this is a two
column-based data, which has the body and its type: spam or
ham. That type is letter on converted into 0s and 1s and fed
to the CountVectorizer to generate matrix of token counts.
Further details on each step is as follows:

IV. PSEUDO CODE OF THE METHODOLOGY


1. import pandas, numpy, sklearn
2. read csv using pandas (data collection)
3. data preprocessing and label encoding
A. Data Collection 4. feature extraction using CountVectorizer
The first step to train a model, is to find and obtain error- 5. apply MultinomialNB algorithm of Naïve Bayes
free dataset. The dataset by M. Faisal Qureshi [5] available
for use on Kaggle is best fit for our purpose. It consists of 6. evaluating model
two columns, data and category. Data column contain actual 7. concluding results
text of the email (its body), category column has value either
“spam” or “ham”. This dataset has 5157 total rows, V. CONCLUSION
containing 13% spam and 87% non-spam a.k.a ham.
By the above results, we will conclude that the
B. Label Encoding Naïve Bayes classifier outperforms all other classifiers. In
Label encoding is the process in which labels are converted present scenarios, spam emails are increasing rapidly. We’d
into machine-readable format like numerical type. We like a better model to identify spam emails to handle that
convert the spam to 0 and ham to 1 for our later use. scenario. Our proposed model witnesses the naïve Bayes
classifier, which provides the probabilistic statistics that
identify whether the email is spam. Our proposed model
achieves a mean of 97 percent accuracy.
REFERENCES
C. Feature Extraction
Now the info in the spam dataset is categorized into [1] “Spam e-mail traffic share” Statista, 29-Jul-2022. [Online].
Available: https://www.statista.com/statistics/420391/spam-email-
Training data and Testing data and then feature extraction is traffic-share/. [Accessed: 13-Nov-2022].
done using CountVectorizer which transforms the text into [2] E. Blanzieri and A. Bryl, “A survey of learning-based
matrix of token count, a meaningful representation of techniques of email spam filtering - artificial intelligence review,”
numbers which is used to fit machine algorithms for SpringerLink, 10-Jul-2009. [Online]. Available:
prediction. https://link.springer.com/article/10.1007/s10462-009-9109-6.
[Accessed: 13-Nov-2022].
D. Model Training [3] A. Ferrag, L. Maglaras, S. Moschoyiannis, and H. Janicke,
In this model, we are employing a Naive Bayes Classifier “Deep learning for cyber security intrusion detection: Approaches,
datasets, and comparative study,” Journal of Information Security and
for predicting spam mail. Naïve Bayes Classifier is one of the Applications, vol. 50, p. 102440, 2020.
simple and most effective Classification algorithms which [4] A. J. Saleh et al., “An Intelligent Spam Detection Model Based
helps in building the fast machine learning models that can on Artificial Immune System,” Information, vol. 10, no. 6, p. 209,
make quick predictions. It is a probabilistic classifier, which Jun. 2019, doi: 10.3390/info10060209. [Online]. Available:
means it predicts on the basis of the probability of an object. http://dx.doi.org/10.3390/info10060209.
[5] F. Qureshi, “Spam email dataset,” Kaggle, 21-Jun-2021.
E. Model Evaluation [Online]. Available: https://www.kaggle.com/datasets/mfaisalqureshi/
spam-email. [Accessed: 15-Nov-2022].
Model evaluation is the process of using different
evaluation metrics to understand a machine learning model’s

You might also like