0% found this document useful (0 votes)

10 views23 pages

Spam Email Dection

Uploaded by

Mạnh Kiên Dương

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views23 pages

Spam Email Dection

Uploaded by

Mạnh Kiên Dương

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

SPAM EMAIL

Classifier
Outline

1. Problem Statement
2. Preprocessing
3. Models
4. Performance
5. Conclusion
1. Problem
Statement

Email spam, often referred to as junk email, consists of unsolicited

messages sent in bulk by email. These messages can contain
advertising, scams, or malicious content intended to harm or deceive
the recipient.The goal of this project is to create a machine learning
model capable of distinguishing between spam and non-spam (ham)
emails.
Dataset

The dataset used for this project is sourced from Kaggle, containing
labeled emails as spam or ham. The dataset includes various features
such as the email text, subject lines, and other metadata. It will serve as
the foundation for training and testing our model.
Example

Ham:
“Did you catch the bus ? Are you frying an egg ? Did you make a tea? Are
you eating your mom's left over dinner ? Do you feel my Love ? “

Spam:
“Thanks for your subscription to Ringtone UK your mobile will be charged
£5/month Please confirm by replying YES or NO. If you reply NO you will
not be charged. “
2. Data preprocessing

We clean and tokenize data by :

Remove stopwords, markups, punctuation marks

Remove all strings that contain a non-letter
Convert to lower
Reduce words to their root form
Remove empty emails

Term frequency - Inverse document frequency: is a numerical statistic that is

intended to reflect how important a word is to a document in a collection or
corpus
2. Data preprocessing

Balance spam emails and ham emails by

Over-sampling
2. Data preprocessing

TF-IDF (term frequency-inverse

document frequency)

Evaluating how relevant

a word is to a document
in a collection of
documents
3. Models

1. SVM

2. XGBoost

3. Random Forest

4. Logistic Regression
3.1. SVM

The main objective of the SVM algorithm is to find the

optimal hyperplane in an N-dimensional space that
can separate the data points in different classes in the
feature space
When data is not perfectly separable or contains
outliers, SVM employs a soft margin technique by
introducing slack variables. This approach
softens the strict margin requirement, permitting
some misclassifications or margin violations. It
strikes a balance between maximizing the margin
and minimizing classification errors.
3.1. SVM

PROS CONS

Having excellent accuracy Sensitive to parameter

Effective in high dimensions tuning
Robust to overfitting Memory intensive due
Handles non-linear data to support vectors
with kernel tricks Computationally
expensive for large
datasets
3.2. XGboost

XGBoost is an ensemble learning method

Boosting techniques is a method that tries to combine multiple weak
learners sequentially, with each one correcting its predecessor
3.2. XGBOOST

PROS CONS

Handles missing values Time-consuming

automatically parameter tuning
Optimized for parallel Significant memory
proc usage
3.3. RANDOM FOREST

An ensemble learning method

Bagging Techniques
Involves combining multiple weak learners in parallel.
Reduces overfitting and improves accuracy.
Decision Trees
Constructs numerous decision trees during training.
Each tree contributes to the final prediction:
Regression Tasks: Averaging the results.
Classification Tasks: Majority vote.
3.3. Random Forest
3.3. Random Forest

PROS CONS

High accuracy More complex

Reduces overfitting Longer training time
Versatile for classification Memory intensive
and regression Slower for real-time
Tolerant of missing data predictions
3.4. Logistic Regression

Logistic Regression is a statistical method for analyzing datasets in which

there are one or more independent variables that determine an outcome.

Sigmoid function
3.4. Logistic Regression

The loss function in logistic

regression with L2 regularization

Similar to linear regression, we can handle overfitting by

adding a regularization term to the error function:
3.4. Logistic Regression

PROS CONS

Fast and efficient training. Sensitive to linearly

Requires few assumptions inseparable features.
about the data. Prone to overfitting
Provides useful probability with many features.
predictions. Ineffective with
datasets containing
many missing values.
4.MODEl valuation
Conclusion

Our project on email spam classification using machine learning has

successfully demonstrated the effectiveness of advanced algorithms in
identifying and filtering
In conclusion, outspam
the developed unwanted emails.
classifier shows significant promise in reducing
the volume of spam emails received by users, thereby improving their overall email
By leveraging techniques experience such asand natural language processing and
productivity.

supervised learning, we have developed a robust model that can

distinguish between legitimate emails and spam with high accuracy.
Thank You

for listening

Final PPT
No ratings yet
Final PPT
18 pages
Project Report Emaildetection
No ratings yet
Project Report Emaildetection
44 pages
CH 5
No ratings yet
CH 5
21 pages
Spam Email Classifier
No ratings yet
Spam Email Classifier
17 pages
Interplay Between Probabilistic Classifiers and Boosting Algorithms For Detecting Complex Unsolicited Emails
100% (1)
Interplay Between Probabilistic Classifiers and Boosting Algorithms For Detecting Complex Unsolicited Emails
5 pages
Quantitative-Methods Summary-Qm-Notes
No ratings yet
Quantitative-Methods Summary-Qm-Notes
35 pages
Industrial Training Report
No ratings yet
Industrial Training Report
31 pages
Bayesian Approach For Animal Breeding Data Analysis
50% (2)
Bayesian Approach For Animal Breeding Data Analysis
42 pages
Format For PBS
No ratings yet
Format For PBS
18 pages
Spam Identification On Facebook, Twitter and Email Using Machine Learning
No ratings yet
Spam Identification On Facebook, Twitter and Email Using Machine Learning
9 pages
MEE 437 Operations Research Project Document Text Mining For Supplier Manufacturing Industries
No ratings yet
MEE 437 Operations Research Project Document Text Mining For Supplier Manufacturing Industries
25 pages
Spamdect
No ratings yet
Spamdect
33 pages
Mini Project
No ratings yet
Mini Project
21 pages
A Comparison of The Accuracy of Support Vector
No ratings yet
A Comparison of The Accuracy of Support Vector
17 pages
Spam Filter - Machine Learning
No ratings yet
Spam Filter - Machine Learning
25 pages
Unit 3
No ratings yet
Unit 3
20 pages
Project Thesis Doc 2
No ratings yet
Project Thesis Doc 2
66 pages
Spamemaildetectionusingmachinelearningppt 230201113400 20a802e7
No ratings yet
Spamemaildetectionusingmachinelearningppt 230201113400 20a802e7
21 pages
MACHINE LEARNING Kumar Jatin
No ratings yet
MACHINE LEARNING Kumar Jatin
31 pages
IJCRT23A5429
No ratings yet
IJCRT23A5429
7 pages
Pending Proj
No ratings yet
Pending Proj
37 pages
Machine Learning Report
No ratings yet
Machine Learning Report
22 pages
BT-3435 Ali
No ratings yet
BT-3435 Ali
49 pages
IR - Group1
No ratings yet
IR - Group1
27 pages
1 s2.0 S0950705106001390 Main
No ratings yet
1 s2.0 S0950705106001390 Main
6 pages
Unit III
No ratings yet
Unit III
10 pages
Project Report
No ratings yet
Project Report
11 pages
Saurabh
No ratings yet
Saurabh
26 pages
Spam Email Classifier
No ratings yet
Spam Email Classifier
16 pages
Elshoush 2019
No ratings yet
Elshoush 2019
6 pages
REDSET1
No ratings yet
REDSET1
11 pages
Report Minor Project PDF
No ratings yet
Report Minor Project PDF
37 pages
1822 B Deleted
No ratings yet
1822 B Deleted
38 pages
(Ebook PDF) Introductory Econometrics: A Modern Approach 6th Editioninstant Download
100% (5)
(Ebook PDF) Introductory Econometrics: A Modern Approach 6th Editioninstant Download
57 pages
Final Report (Saie)
No ratings yet
Final Report (Saie)
38 pages
Email Classification Using Machine Learning
No ratings yet
Email Classification Using Machine Learning
22 pages
1822 B Deleted Merged Cropped
No ratings yet
1822 B Deleted Merged Cropped
40 pages
Spam Filtering
No ratings yet
Spam Filtering
31 pages
Major Project by Ali (Intrainz)
No ratings yet
Major Project by Ali (Intrainz)
25 pages
Irjet V9i11154
No ratings yet
Irjet V9i11154
4 pages
Spam Detection & Classification Final
No ratings yet
Spam Detection & Classification Final
38 pages
Project Report Emaildetection 4 44
No ratings yet
Project Report Emaildetection 4 44
41 pages
A Support Vector Machine Based Naive Bayes Algorithm For Spam Filtering
No ratings yet
A Support Vector Machine Based Naive Bayes Algorithm For Spam Filtering
8 pages
Group 17 Blackbook Final Report
No ratings yet
Group 17 Blackbook Final Report
40 pages
Ad 8511 ML Lab Record
No ratings yet
Ad 8511 ML Lab Record
27 pages
Kriti - Report FINAL
No ratings yet
Kriti - Report FINAL
11 pages
Statquest Multinomial Naive Bayes Study Guide V3-Mgywmv
No ratings yet
Statquest Multinomial Naive Bayes Study Guide V3-Mgywmv
8 pages
Lecture 6 Text Classification
No ratings yet
Lecture 6 Text Classification
19 pages
Arnav MLlab04
No ratings yet
Arnav MLlab04
7 pages
Final
No ratings yet
Final
51 pages
Vishal FOML Micro Project Vishal & Milan
No ratings yet
Vishal FOML Micro Project Vishal & Milan
26 pages
SVM Lab Report
No ratings yet
SVM Lab Report
7 pages
Spam Detection
No ratings yet
Spam Detection
39 pages
Module 1
No ratings yet
Module 1
34 pages
Spam 123
No ratings yet
Spam 123
59 pages
CT3 QP 0512 PDF
No ratings yet
CT3 QP 0512 PDF
6 pages
Biblography Books:: o o o o o o o o o
No ratings yet
Biblography Books:: o o o o o o o o o
27 pages
B. Flowchart of The Model: Esult
No ratings yet
B. Flowchart of The Model: Esult
3 pages
Sms Spam Using Machine Learning 4
No ratings yet
Sms Spam Using Machine Learning 4
42 pages
Spam Detection
No ratings yet
Spam Detection
4 pages
ML Unit 2
No ratings yet
ML Unit 2
21 pages
7z1018 CW Example Predicting House Prices in King County
No ratings yet
7z1018 CW Example Predicting House Prices in King County
16 pages
A Second Course in Statistics Regression Analysis
No ratings yet
A Second Course in Statistics Regression Analysis
8 pages
Spam Email Detection Documentation
No ratings yet
Spam Email Detection Documentation
3 pages
Vaibhav Tiwari Final Project
No ratings yet
Vaibhav Tiwari Final Project
32 pages
Improving Spam Email Classification Accuracy Using Ensemble Techniques: A Stacking Approach
No ratings yet
Improving Spam Email Classification Accuracy Using Ensemble Techniques: A Stacking Approach
13 pages
1 Pengantar Keandalan
100% (1)
1 Pengantar Keandalan
144 pages
Title: Abstract
No ratings yet
Title: Abstract
2 pages
The Design and Statistical Analysis of Animal Experiments 1st Edition New Edition PDF
100% (14)
The Design and Statistical Analysis of Animal Experiments 1st Edition New Edition PDF
14 pages
Varma Garch
No ratings yet
Varma Garch
55 pages
Week 2 Homework - Summer 2020: Attempt History
No ratings yet
Week 2 Homework - Summer 2020: Attempt History
27 pages
Customer Churn Prediction Employing Ensemble Learning
No ratings yet
Customer Churn Prediction Employing Ensemble Learning
5 pages
Statistics
No ratings yet
Statistics
57 pages
ML Project Report
No ratings yet
ML Project Report
12 pages
2004 JQT Woodall Et Al
No ratings yet
2004 JQT Woodall Et Al
12 pages
SoICT - 4010E IntroInfoSec - C7 - VanNK
No ratings yet
SoICT - 4010E IntroInfoSec - C7 - VanNK
59 pages
EWMA Tutorial
No ratings yet
EWMA Tutorial
6 pages
Solution For Assignment # 2 Sta 5206, 5126 & 4202
No ratings yet
Solution For Assignment # 2 Sta 5206, 5126 & 4202
27 pages
Stationary and Nonstationary Series: T y y E T y S S T y T y S T y T y T y
No ratings yet
Stationary and Nonstationary Series: T y y E T y S S T y T y S T y T y T y
17 pages
Lesson 2 Functions
No ratings yet
Lesson 2 Functions
46 pages
MIS770A CH 09 Even Sol PDF
No ratings yet
MIS770A CH 09 Even Sol PDF
14 pages
Assignment 1 CLB20903 January 2020 PDF
No ratings yet
Assignment 1 CLB20903 January 2020 PDF
4 pages
SoICT-IT2022-01 Introduction x4
No ratings yet
SoICT-IT2022-01 Introduction x4
17 pages
Time Series Using Stata (Oscar Torres-Reyna Version) : December 2007
No ratings yet
Time Series Using Stata (Oscar Torres-Reyna Version) : December 2007
32 pages
Template Feasibility Study 1
No ratings yet
Template Feasibility Study 1
13 pages
4793 11183 1 PB
No ratings yet
4793 11183 1 PB
6 pages
Distance To Default Based On The CEV-KMV Model
No ratings yet
Distance To Default Based On The CEV-KMV Model
16 pages
Coordination & Response
No ratings yet
Coordination & Response
13 pages
SoICT IT2022 03 Elements of Statistics x4
No ratings yet
SoICT IT2022 03 Elements of Statistics x4
13 pages
SoICT IT2022 02 Probability Theory x4
No ratings yet
SoICT IT2022 02 Probability Theory x4
13 pages
AMELIA RAHMADHANI (2156102007) Statistik Deskriptif Data Tulang Anak
No ratings yet
AMELIA RAHMADHANI (2156102007) Statistik Deskriptif Data Tulang Anak
12 pages
P7-Statistical Errors and Estimation-2023.12.25
No ratings yet
P7-Statistical Errors and Estimation-2023.12.25
7 pages
Numerical Methods in Engineering: Dr. Ayşe Çağıl KANDEMİR
No ratings yet
Numerical Methods in Engineering: Dr. Ayşe Çağıl KANDEMİR
2 pages
BBT 3106 - Probability & Statistics II - August 2023EC
No ratings yet
BBT 3106 - Probability & Statistics II - August 2023EC
3 pages
Mgt782 Midterm New
No ratings yet
Mgt782 Midterm New
2 pages
Temario - Task 7
No ratings yet
Temario - Task 7
1 page
External Interface Requirements
No ratings yet
External Interface Requirements
1 page
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Lexicon of Computer Science Terminology: Lexicon of Tech and Business, #16
From Everand
Lexicon of Computer Science Terminology: Lexicon of Tech and Business, #16
Mustafa Al-Dori
4/5 (1)