Email Spam Detection Using Machine Learning

1) The document discusses an email spam detection framework that uses machine learning algorithms like Naive Bayes to classify emails as spam or not spam. 2) It analyzes a dataset containing over 5000 emails to train and evaluate the Naive Bayes classifier. The classifier achieves an accuracy of 97% according to the evaluation. 3) Spam emails pose security and privacy risks and can spread malware. An effective spam detection system is needed to filter out unwanted spam and protect users.

Uploaded by

Milton

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

422 views2 pages

Email Spam Detection Using Machine Learning

Uploaded by

Milton

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

Email Spam Detection using Machine Learning

Mahtab Chelani

Department of Software Engineering

Mehran University of Engineering and Technology, Jamshoro
Jamshoro, Pakistan
[email protected]

Abstract— Email Spam detection framework is a machine spam. Spam Detector is used to detect unwanted, malicious
learning project which uses computerized reasoning and AI and virus infected texts and helps to separate them from the
calculations to sift through the malicious and false emails. non-spam texts. These spam emails are commonly used by
Spam Classifier is utilized to identify undesirable, vindictive fake profiles to attack victims who have no idea about spam.
and infection tainted texts and assists with isolating them from They usually send links, phishing methods, and such things
the non-spam texts. Making a fake profile and email account is to grasp the victim and steal confidential data.
much simpler for the spammers, they seem to look like a
certifiable valid individual in their spam messages. These
spammers focus on groups of people who don't know about
these fake frauds and issues. Thus, it is needed to develop some
kind of filtering system which can Identify those spam emails.
This system will recognize spam by utilizing methods of AI,
and will examine the AI calculations and apply this large
number of calculations on our dataset and the best calculation
is chosen for the email spam discovery having best accuracy
and exactness.

Keywords—spam, email, naïve bayes, spam detection, fake,

fraudulent, malicious Spam emails are very harmful in another way which leads to
several very sensitive data breaching and some viruses like
I. INTRODUCTION trojans, worms, unblockable ads, cryptocurrency miners, and
other malware. The task of handling spam emailing is very
Email Spam is a huge problem in today’s world where
essential because it can lead to critical situations. In other
each and everything is carried out on electronic mail and
words, spam emails are quite annoying to the user
media. According to research, in 2021, it was estimated that
319.6 billion emails were sent and received daily. And in II. LITERATURE REVIEW
December 2021, 45.37% of the total emails were deemed as
spam emails. From 2020 to 2021, the global spam volume Email spam is just phony or undesirable mass mails sent
was the highest in July 2021, when 283 billion out of 336.41 through any account or a robotized system. Spam emails are
billion emails were spam [1]. In the new era of technical increasing day by day, and it has turned into a typical issue
advancement, electronic mails (e-mails) have gathered throughout the last decade. The uses of AI have been
significant users for professional, commercial, and personal assuming a fundamental part in the detection of spam emails.
communications. In 2019, on average, every person was A lot of researchers are focusing on finding new ways to
receiving 130 emails each day, and overall, 296 billion detect spam emails and filter them out.
emails have been sent in that year. Blanzieri and Bryl [2] described multiple spam filtering
approaches in their paper. The paper reviews the spam
filtering approach based on learning-based filtering. In this
study, various ethical, economical, and general level issues
were discussed and its effects explained. This study suggests
Naïve Bayes algorithm for future spam detectors as it is
efficient and precise.
Ferrag, Maglaras, Moschoyiannis and Janicke [3], in his
review of deep learning, presented a comprehensive review
of intrusion detection algorithms and email spam datasets.
They evaluated multiple deep learning models and their
effectiveness based on those spam datasets. They concluded
that deep learning models can perform outrageously better
than traditional models for specifically intrusion detection
and spam filtering.
The classifier which filters these spam mails is nowadays
practiced to help users avoid these fraudulent mails. Email Saleh, Karim and Shanmugam [4] surveyed email spam,
spam detection system is a project which utilizes artificial their datasets, and detection. They analyzed the security
intelligence and machine learning algorithms to filter out the risks, scope of spam analysis, different machine learning and
fake and fraudulent emails which are commonly termed as non-machine learning techniques to filter out spam. They
concluded that all spam email detection research work, performance, also as its strengths and weaknesses. Model
specifically the phishing emails detection, depended on word evaluation is vital to assess the efficacy of a model during
based classification or clustering methodology. initial research phases, and it also plays a task in model
monitoring.
III. METHODOLOGY
F. Results
This classifier uses Naïve Bayes algorithm,
CountVectorizer and MultinomialNB method from Naïve
Bayes Classifier. The dataset [5] used for this is a two
column-based data, which has the body and its type: spam or
ham. That type is letter on converted into 0s and 1s and fed
to the CountVectorizer to generate matrix of token counts.
Further details on each step is as follows:

IV. PSEUDO CODE OF THE METHODOLOGY

1. import pandas, numpy, sklearn
2. read csv using pandas (data collection)
3. data preprocessing and label encoding
A. Data Collection 4. feature extraction using CountVectorizer
The first step to train a model, is to find and obtain error- 5. apply MultinomialNB algorithm of Naïve Bayes
free dataset. The dataset by M. Faisal Qureshi [5] available
for use on Kaggle is best fit for our purpose. It consists of 6. evaluating model
two columns, data and category. Data column contain actual 7. concluding results
text of the email (its body), category column has value either
“spam” or “ham”. This dataset has 5157 total rows, V. CONCLUSION
containing 13% spam and 87% non-spam a.k.a ham.
By the above results, we will conclude that the
B. Label Encoding Naïve Bayes classifier outperforms all other classifiers. In
Label encoding is the process in which labels are converted present scenarios, spam emails are increasing rapidly. We’d
into machine-readable format like numerical type. We like a better model to identify spam emails to handle that
convert the spam to 0 and ham to 1 for our later use. scenario. Our proposed model witnesses the naïve Bayes
classifier, which provides the probabilistic statistics that
identify whether the email is spam. Our proposed model
achieves a mean of 97 percent accuracy.
REFERENCES
C. Feature Extraction
Now the info in the spam dataset is categorized into [1] “Spam e-mail traffic share” Statista, 29-Jul-2022. [Online].
Available: https://www.statista.com/statistics/420391/spam-email-
Training data and Testing data and then feature extraction is traffic-share/. [Accessed: 13-Nov-2022].
done using CountVectorizer which transforms the text into [2] E. Blanzieri and A. Bryl, “A survey of learning-based
matrix of token count, a meaningful representation of techniques of email spam filtering - artificial intelligence review,”
numbers which is used to fit machine algorithms for SpringerLink, 10-Jul-2009. [Online]. Available:
prediction. https://link.springer.com/article/10.1007/s10462-009-9109-6.
[Accessed: 13-Nov-2022].
D. Model Training [3] A. Ferrag, L. Maglaras, S. Moschoyiannis, and H. Janicke,
In this model, we are employing a Naive Bayes Classifier “Deep learning for cyber security intrusion detection: Approaches,
datasets, and comparative study,” Journal of Information Security and
for predicting spam mail. Naïve Bayes Classifier is one of the Applications, vol. 50, p. 102440, 2020.
simple and most effective Classification algorithms which [4] A. J. Saleh et al., “An Intelligent Spam Detection Model Based
helps in building the fast machine learning models that can on Artificial Immune System,” Information, vol. 10, no. 6, p. 209,
make quick predictions. It is a probabilistic classifier, which Jun. 2019, doi: 10.3390/info10060209. [Online]. Available:
means it predicts on the basis of the probability of an object. http://dx.doi.org/10.3390/info10060209.
[5] F. Qureshi, “Spam email dataset,” Kaggle, 21-Jun-2021.
E. Model Evaluation [Online]. Available: https://www.kaggle.com/datasets/mfaisalqureshi/
spam-email. [Accessed: 15-Nov-2022].
Model evaluation is the process of using different
evaluation metrics to understand a machine learning model’s

Car Rental System Abstract by Me
71% (7)
Car Rental System Abstract by Me
5 pages
Angular Security Best Practices: Cheat Sheet
No ratings yet
Angular Security Best Practices: Cheat Sheet
1 page
Spam Detection in Email Using Machine Le
No ratings yet
Spam Detection in Email Using Machine Le
8 pages
Spammer Detect Project Document
No ratings yet
Spammer Detect Project Document
45 pages
Survey of Machine Learning in Phishing Detection Research
No ratings yet
Survey of Machine Learning in Phishing Detection Research
21 pages
SSL (2020-21) Lab Manual
No ratings yet
SSL (2020-21) Lab Manual
93 pages
Militant and Weapon Detection Final Report
No ratings yet
Militant and Weapon Detection Final Report
63 pages
Drug Recommender System Using Machine Learning For Sentiment Analysis
No ratings yet
Drug Recommender System Using Machine Learning For Sentiment Analysis
4 pages
Malicious Url Detection Based On Machine Learning
No ratings yet
Malicious Url Detection Based On Machine Learning
52 pages
Detection of Fake Online Reviews Using Semi Supervised and Supervised Learning
No ratings yet
Detection of Fake Online Reviews Using Semi Supervised and Supervised Learning
4 pages
Message Spam Classification Using Machine Learning Report
No ratings yet
Message Spam Classification Using Machine Learning Report
28 pages
Parkinson Detection Using Machine Learning Algorithms
No ratings yet
Parkinson Detection Using Machine Learning Algorithms
8 pages
Intrusion Detection System in Software Defined Networks Using Machine Learning Approach
No ratings yet
Intrusion Detection System in Software Defined Networks Using Machine Learning Approach
8 pages
Spam News Detection Report
No ratings yet
Spam News Detection Report
9 pages
YouTube Transcript Summarizer
No ratings yet
YouTube Transcript Summarizer
62 pages
Secure Persona Prediction and Data Leakage Prevention System Using Python
No ratings yet
Secure Persona Prediction and Data Leakage Prevention System Using Python
49 pages
Spam Email. Classifier
No ratings yet
Spam Email. Classifier
16 pages
Fake Account Detection Using Machine Learning and Data Science
No ratings yet
Fake Account Detection Using Machine Learning and Data Science
58 pages
PDF Sentimental Analysis Project Documentation
No ratings yet
PDF Sentimental Analysis Project Documentation
74 pages
Mini Project HPC
No ratings yet
Mini Project HPC
17 pages
Campus Selection Procedure Android App Project Report
No ratings yet
Campus Selection Procedure Android App Project Report
86 pages
Mini Project CSDF
No ratings yet
Mini Project CSDF
8 pages
Borewell Rescue Robot........
0% (2)
Borewell Rescue Robot........
27 pages
PROJECT REPORT For Machine Learning
100% (1)
PROJECT REPORT For Machine Learning
22 pages
Text Summarization As Feature Selection For Arabic Text Classification
No ratings yet
Text Summarization As Feature Selection For Arabic Text Classification
4 pages
1) Aim: Demonstration of Preprocessing of Dataset Student - Arff
No ratings yet
1) Aim: Demonstration of Preprocessing of Dataset Student - Arff
26 pages
Week 2 Python For Data Science
No ratings yet
Week 2 Python For Data Science
27 pages
Full ML Viva Questions Answers Q1 To Q70
No ratings yet
Full ML Viva Questions Answers Q1 To Q70
6 pages
Intership Report Music Recomandation System
No ratings yet
Intership Report Music Recomandation System
50 pages
Computer Vision Module Application For Finding A Target in A Live Camera
No ratings yet
Computer Vision Module Application For Finding A Target in A Live Camera
8 pages
Matrix-Vector Multiplication Using MapReduce in Big Data.
No ratings yet
Matrix-Vector Multiplication Using MapReduce in Big Data.
4 pages
Image Processing Based Facial Emotion Recognition: A Project Report On
No ratings yet
Image Processing Based Facial Emotion Recognition: A Project Report On
39 pages
Building A Python Package in Minutes - Analytics Vidhya - Medium
No ratings yet
Building A Python Package in Minutes - Analytics Vidhya - Medium
23 pages
Clustering & Association Algorithms 4
No ratings yet
Clustering & Association Algorithms 4
17 pages
Seminar On Deep CNN
No ratings yet
Seminar On Deep CNN
36 pages
AI All Exercises
No ratings yet
AI All Exercises
24 pages
Big Data
No ratings yet
Big Data
30 pages
Computer Science Project
No ratings yet
Computer Science Project
19 pages
Chronic Kidney Disease Prediction Using Machine Learning Algorithms and The Important Attributes For The Detection
No ratings yet
Chronic Kidney Disease Prediction Using Machine Learning Algorithms and The Important Attributes For The Detection
4 pages
Haze Removal
No ratings yet
Haze Removal
34 pages
3-1 Bigdata (Spark)
No ratings yet
3-1 Bigdata (Spark)
3 pages
ML Unit 2
No ratings yet
ML Unit 2
25 pages
Unit-Ii Knowledge Representation and Reasoning Part-A
No ratings yet
Unit-Ii Knowledge Representation and Reasoning Part-A
10 pages
Documentation (218609p)
No ratings yet
Documentation (218609p)
65 pages
Face Mask Detection
No ratings yet
Face Mask Detection
34 pages
Unsupervised Feature Extraction With Autoencoders For EEG Based Multiclass Motor Imagery BCI
No ratings yet
Unsupervised Feature Extraction With Autoencoders For EEG Based Multiclass Motor Imagery BCI
10 pages
Broadcasting Chat Server
83% (6)
Broadcasting Chat Server
25 pages
SMS Spam Detection Using Machine Learning
No ratings yet
SMS Spam Detection Using Machine Learning
9 pages
Currency Recognition On Mobile Phones Proposed System Modules
No ratings yet
Currency Recognition On Mobile Phones Proposed System Modules
26 pages
Machine Learning Paper-2
No ratings yet
Machine Learning Paper-2
4 pages
For Fake or Real Disaster Tweet Analysis of Machine Learning Algorithms
No ratings yet
For Fake or Real Disaster Tweet Analysis of Machine Learning Algorithms
23 pages
THE FAKE ACCOUNT DETECTION IN ONLINE SOCIAL NETWORKS (OSNs) USING RANDOM FOREST
No ratings yet
THE FAKE ACCOUNT DETECTION IN ONLINE SOCIAL NETWORKS (OSNs) USING RANDOM FOREST
95 pages
Comprehensive Review On CNN-based Malware Detection With Hybrid Optimization Algorithm
No ratings yet
Comprehensive Review On CNN-based Malware Detection With Hybrid Optimization Algorithm
13 pages
Breast Cancer Detection - Final
No ratings yet
Breast Cancer Detection - Final
21 pages
Secrecy Preserving Discovery of Subtle Statistic Contents
No ratings yet
Secrecy Preserving Discovery of Subtle Statistic Contents
5 pages
ANPR PowerPoint
No ratings yet
ANPR PowerPoint
39 pages
IS 7118 Unit-5 POS Tagging
No ratings yet
IS 7118 Unit-5 POS Tagging
89 pages
Sign Language Recognition Using Deep Learning
No ratings yet
Sign Language Recognition Using Deep Learning
6 pages
Object Detection Using Yolo
No ratings yet
Object Detection Using Yolo
42 pages
Sms Spam Detection
No ratings yet
Sms Spam Detection
23 pages
E-Mail Spam Detection
No ratings yet
E-Mail Spam Detection
8 pages
Presentation 3
No ratings yet
Presentation 3
13 pages
Software Project Management (Assignment)
No ratings yet
Software Project Management (Assignment)
2 pages
Front-End Details
No ratings yet
Front-End Details
2 pages
Audi Salmaan Edriss 2663392: Tue-04-Oct-22
No ratings yet
Audi Salmaan Edriss 2663392: Tue-04-Oct-22
12 pages
Automatic Braking System Using Machine Learning: (Abstract by 19SW27 and 19SW115)
No ratings yet
Automatic Braking System Using Machine Learning: (Abstract by 19SW27 and 19SW115)
2 pages
Abdul Wajid Moroojo
No ratings yet
Abdul Wajid Moroojo
6 pages
Algorithms Solution - 2
No ratings yet
Algorithms Solution - 2
9 pages
IMAT2914 Summative Coursework 2022
No ratings yet
IMAT2914 Summative Coursework 2022
7 pages
PF LAB4Tasks 1
100% (1)
PF LAB4Tasks 1
4 pages
The Untold Story of The Target Attack Step by Step: Aorato Labs - August 2014
No ratings yet
The Untold Story of The Target Attack Step by Step: Aorato Labs - August 2014
20 pages
BCA 303 Unit 3 Notes
50% (2)
BCA 303 Unit 3 Notes
52 pages
Free Airtel Gprs On PC
No ratings yet
Free Airtel Gprs On PC
5 pages
Crash Course in Azure Active Directory
No ratings yet
Crash Course in Azure Active Directory
11 pages
Red Hat Enterprise Linux-8-Security hardening-en-US
No ratings yet
Red Hat Enterprise Linux-8-Security hardening-en-US
96 pages
그루핑 화면구성 (PC) 20201119 English
No ratings yet
그루핑 화면구성 (PC) 20201119 English
78 pages
JS6 ClassNotes
No ratings yet
JS6 ClassNotes
13 pages
Catalog 1920x1080
No ratings yet
Catalog 1920x1080
9 pages
SQE - Multipurpose One Page
No ratings yet
SQE - Multipurpose One Page
1 page
Universal Design
No ratings yet
Universal Design
6 pages
Example Case 02 - Gozi
No ratings yet
Example Case 02 - Gozi
2 pages
CobWeb - A System For Automated In-Network Cobbling of Web Service
No ratings yet
CobWeb - A System For Automated In-Network Cobbling of Web Service
15 pages
DDI Presentation
100% (1)
DDI Presentation
6 pages
Confident Ux The Essential Skills For User Experience Design Adrian Bilan Download
No ratings yet
Confident Ux The Essential Skills For User Experience Design Adrian Bilan Download
90 pages
E Commerce Business Model
No ratings yet
E Commerce Business Model
22 pages
Blogs: Create A New Blog Using Your Google Account
No ratings yet
Blogs: Create A New Blog Using Your Google Account
20 pages
Data Protection Case Study
No ratings yet
Data Protection Case Study
10 pages
Parmar B. Mastering Neumorphism. A Guide To Modern UI Design With CSS 2023-Output
No ratings yet
Parmar B. Mastering Neumorphism. A Guide To Modern UI Design With CSS 2023-Output
147 pages
Chapter 14 Security Engineering 1
No ratings yet
Chapter 14 Security Engineering 1
48 pages
Poshan Tracker 23.6 New Updates
No ratings yet
Poshan Tracker 23.6 New Updates
36 pages
SMTP POP3 IMAP Protocol Notes
No ratings yet
SMTP POP3 IMAP Protocol Notes
43 pages
How To Configure Multiple Sites With Apache
No ratings yet
How To Configure Multiple Sites With Apache
3 pages
KeePass Sync Guide
No ratings yet
KeePass Sync Guide
9 pages
Cryptograpy and Network Security: Unit-1
No ratings yet
Cryptograpy and Network Security: Unit-1
17 pages
Student Solution Chap 01 of Forouzan Book
No ratings yet
Student Solution Chap 01 of Forouzan Book
2 pages
GST Manual
No ratings yet
GST Manual
19 pages
Abridged MBA Form
No ratings yet
Abridged MBA Form
6 pages
DIVYA SRI PADARTHI Resume-1
No ratings yet
DIVYA SRI PADARTHI Resume-1
1 page

Email Spam Detection Using Machine Learning

Uploaded by

Email Spam Detection Using Machine Learning

Uploaded by

Email Spam Detection using Machine Learning

Department of Software Engineering

Keywords—spam, email, naïve bayes, spam detection, fake,

IV. PSEUDO CODE OF THE METHODOLOGY

You might also like