0% found this document useful (0 votes)
10 views23 pages

Spam Email Dection

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views23 pages

Spam Email Dection

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

SPAM EMAIL

Classifier
Outline

1. Problem Statement
2. Preprocessing
3. Models
4. Performance
5. Conclusion
1. Problem
Statement

Email spam, often referred to as junk email, consists of unsolicited


messages sent in bulk by email. These messages can contain
advertising, scams, or malicious content intended to harm or deceive
the recipient.The goal of this project is to create a machine learning
model capable of distinguishing between spam and non-spam (ham)
emails.
Dataset

The dataset used for this project is sourced from Kaggle, containing
labeled emails as spam or ham. The dataset includes various features
such as the email text, subject lines, and other metadata. It will serve as
the foundation for training and testing our model.
Example

Ham:
“Did you catch the bus ? Are you frying an egg ? Did you make a tea? Are
you eating your mom's left over dinner ? Do you feel my Love ? “

Spam:
“Thanks for your subscription to Ringtone UK your mobile will be charged
£5/month Please confirm by replying YES or NO. If you reply NO you will
not be charged. “
2. Data preprocessing

We clean and tokenize data by :

Remove stopwords, markups, punctuation marks


Remove all strings that contain a non-letter
Convert to lower
Reduce words to their root form
Remove empty emails

Term frequency - Inverse document frequency: is a numerical statistic that is


intended to reflect how important a word is to a document in a collection or
corpus
2. Data preprocessing

Balance spam emails and ham emails by


Over-sampling
2. Data preprocessing

TF-IDF (term frequency-inverse


document frequency)

Evaluating how relevant


a word is to a document
in a collection of
documents
3. Models

1. SVM

2. XGBoost

3. Random Forest

4. Logistic Regression
3.1. SVM

The main objective of the SVM algorithm is to find the


optimal hyperplane in an N-dimensional space that
can separate the data points in different classes in the
feature space
When data is not perfectly separable or contains
outliers, SVM employs a soft margin technique by
introducing slack variables. This approach
softens the strict margin requirement, permitting
some misclassifications or margin violations. It
strikes a balance between maximizing the margin
and minimizing classification errors.
3.1. SVM

PROS CONS

Having excellent accuracy Sensitive to parameter


Effective in high dimensions tuning
Robust to overfitting Memory intensive due
Handles non-linear data to support vectors
with kernel tricks Computationally
expensive for large
datasets
3.2. XGboost

XGBoost is an ensemble learning method


Boosting techniques is a method that tries to combine multiple weak
learners sequentially, with each one correcting its predecessor
3.2. XGBOOST

PROS CONS

Handles missing values Time-consuming


automatically parameter tuning
Optimized for parallel Significant memory
proc usage
3.3. RANDOM FOREST

An ensemble learning method


Bagging Techniques
Involves combining multiple weak learners in parallel.
Reduces overfitting and improves accuracy.
Decision Trees
Constructs numerous decision trees during training.
Each tree contributes to the final prediction:
Regression Tasks: Averaging the results.
Classification Tasks: Majority vote.
3.3. Random Forest
3.3. Random Forest

PROS CONS

High accuracy More complex


Reduces overfitting Longer training time
Versatile for classification Memory intensive
and regression Slower for real-time
Tolerant of missing data predictions
3.4. Logistic Regression

Logistic Regression is a statistical method for analyzing datasets in which


there are one or more independent variables that determine an outcome.

Sigmoid function
3.4. Logistic Regression

The loss function in logistic


regression with L2 regularization

Similar to linear regression, we can handle overfitting by


adding a regularization term to the error function:
3.4. Logistic Regression

PROS CONS

Fast and efficient training. Sensitive to linearly


Requires few assumptions inseparable features.
about the data. Prone to overfitting
Provides useful probability with many features.
predictions. Ineffective with
datasets containing
many missing values.
4.MODEl valuation
Conclusion

Our project on email spam classification using machine learning has


successfully demonstrated the effectiveness of advanced algorithms in
identifying and filtering
In conclusion, outspam
the developed unwanted emails.
classifier shows significant promise in reducing
the volume of spam emails received by users, thereby improving their overall email
By leveraging techniques experience such asand natural language processing and
productivity.

supervised learning, we have developed a robust model that can


distinguish between legitimate emails and spam with high accuracy.
Thank You

for listening

You might also like