0% found this document useful (0 votes)

17 views10 pages

Spam Filter Project Report Logistic Regression

The Spam Filter Project utilizes machine learning, specifically Logistic Regression, to classify emails as spam or ham, addressing the challenge of managing unwanted emails. The report details the entire process from data preprocessing and feature extraction to model training and evaluation, achieving a high accuracy of 95%. Additionally, it includes an interactive classification tool and various visualizations to enhance understanding and usability.

Uploaded by

m.mouhcine1234

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views10 pages

Spam Filter Project Report Logistic Regression

Uploaded by

m.mouhcine1234

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 10

Spam Filter Project Report

Made by : Ahannach yassine

EL Garte Mouhcine

1. Introduction
The Spam Filter Project utilizes machine learning techniques to classify email
messages as either spam or ham (non-spam). With the increasing volume of
email communication, managing unwanted spam emails has become critical for
productivity and security. This project aims to provide an efficient and accurate
solution to manage unwanted emails using Logistic Regression as the primary
model. It incorporates advanced feature engineering, hyperparameter tuning,
and visualization tools to ensure optimal performance.

The report documents the entire pipeline, including data preprocessing, feature
extraction, model training, evaluation, and deployment. Each component of the
pipeline is explained in detail to offer a comprehensive understanding of the
implementation.

2. Data Loading and Preprocessing

The dataset used for this project contains labeled email messages with two
categories: "ham" for legitimate emails and "spam" for unwanted emails. The
raw dataset undergoes preprocessing to clean and prepare the text for analysis.

2.1 Data Overview

The dataset consists of the following:

 Messages: Textual content of the emails.
 Labels: Binary labels indicating whether the email is spam or ham.
Example of Dataset:

Label Message
Ham "Go to the meeting at 10 AM."
Spam "Congratulations! You won $1,000. Click here to claim now."

2.2 Preprocessing Steps

The preprocessing pipeline ensures the dataset is cleaned and

structured for feature extraction. The key steps are:
 Text Cleaning: Removing punctuation, numbers, and special characters.
 Lowercasing: Converting all text to lowercase for consistency.
 Tokenization: Splitting messages into individual words.
 Stopword Removal: Eliminating common words (e.g., "and," "the") that do not
contribute to classification.
 Lemmatization: Reducing words to their base forms (e.g., "running" → "run").
 Label Encoding: Converting the labels "spam" and "ham" to numerical values (1 and
0, respectively).

Code Snippet:

3. Feature Extraction and Engineering

Feature extraction converts textual data into numerical representations, which
are then used by the machine learning model. This project leverages both
standard techniques like TF-IDF and custom feature engineering.

3.1 TF-IDF Vectorization

TF-IDF (Term Frequency-Inverse Document Frequency) assigns a weight to each

word based on its importance in a message relative to the entire dataset. This
results in a sparse matrix where each row represents an email, and each column
represents a unique word.

Advantages:
 Captures the importance of rare words in spam detection.
 Efficient for high-dimensional text data.

Code Snippet:
3.2 Custom Feature Engineering

To improve model performance, additional features are engineered:

 Email Length: Total number of characters in the email.
 Special Character Count: Frequency of characters like @, !, and $.
 Uppercase Words: Number of fully capitalized words.

Code Snippet:

4. Data Augmentation
Data augmentation helps address class imbalance by generating
synthetic spam samples. Two primary techniques are used:

1. Synonym Replacement: Replacing words in spam messages with synonyms from a

predefined dictionary or WordNet.
2. Noise Introduction: Adding typos and slight variations to simulate real-world spam
content.

Code Snippet:

Augmentation ensures the model is exposed to diverse spam patterns, improving

its generalization ability.

5. Model Training and Hyperparameter Tuning

The model used in this project is Logistic Regression, which is well-suited for
binary classification problems. Key steps include training, hyperparameter
tuning, and evaluation.

5.1 Hyperparameter Tuning

Grid Search is used to optimize the following hyperparameters:

 C: Regularization strength.
 max_iter: Maximum number of iterations for convergence.
Code Snippet:

5.2 Training and Evaluation

The Logistic Regression model is trained using the optimal hyperparameters.

Key evaluation metrics include:

 Accuracy: Overall correctness of predictions.

 Precision and Recall: Focused on spam detection performance.
 ROC-AUC: Measures the model's ability to differentiate between classes.

Code Snippet:

Classification Report:

6. Visualization
6.1 Confusion Matrix

A confusion matrix highlights the counts of true positives, false positives, true
negatives, and false negatives.

Code Snippet:

6.2 Feature Importance

The top features influencing the classification are visualized to provide
interpretability.

Code Snippet:

6.3 Training Metrics

The loss and accuracy trends during training are plotted to ensure the model is
learning effectively.

7. Interactive Classification
The project includes a real-time email classification tool. Users can input email
messages and receive predictions with confidence scores.

Code Snippet:

8. Results and Analysis

8.1 Performance Metrics

 Accuracy: 95%
 Precision (Spam): 96%
 Recall (Spam): 85%
 ROC-AUC: 0.988

8.2 Observations

 TF-IDF captures word importance effectively for spam detection.

 Custom features improve recall for spam classification.
 Data augmentation ensures robustness to diverse spam patterns.
 Example Dataset Preview:

8.3. Training Log Output

• Output Content:

 Shows the loss and training/test accuracies at different epochs.

For example:

▪ Epoch 0: Loss = 0.6931, Train Accuracy = 0.71, Test Accuracy = 0.72

▪ Epoch 8640: Loss = 0.0344, Train Accuracy = 0.99, Test Accuracy = 0.97
▪ Final Test Accuracy: 0.97

o Also shows detailed metrics:

▪ Precision: 0.942
▪ Recall: 0.945
▪ F1 Score: 0.944
• Analysis:

 The loss decreases significantly over the epochs, indicating that the model is
learning and improving its predictions.
 The final test accuracy of 0.97 is quite high, suggesting that the model
performs well on unseen data.
 The precision, recall, and F1 score values are all relatively high (above 0.94),
which further indicates that the model has a good balance between correctly
identifying positive cases (spam emails) and minimizing false positives and
false negatives.

8.4. Training vs Test Accuracy Plot :

• Plot Content:

 the training accuracy starts around 0.70 and quickly rises to close to 1.00
within the first few epochs and remains at that level.
 othe test accuracystarts around 0.72 and gradually increases to around 0.97
and stabilizes.

• Analysis:

 The training accuracy reaches a very high level (close to 1.00) early in the
training process and remains stable. This indicates that the model is able to
fit the training data very well.
 The test accuracy also increases but stabilizes at a slightly lower level
(around 0.97) compared to the training accuracy. This suggests that there
might be a small degree of overfitting, as the model performs slightly worse
on the unseen test data compared to the training data. However, the gap
between the training and test accuracies is not very large, indicating that the
overfitting is not severe.

8.5. Top 10 Most Important Words for Spam Plot:

• Analysis:

 Words like "money", "million", and "prices" are commonly associated with
spam emails, as they often relate to financial offers or promotions. The
presence of these words with relatively high weights indicates that the
model has learned to recognize these as important features for spam
classification.

 The word "2004" having the highest weight might be an artifact of the
dataset or could potentially be related to some context within the spam
emails in the dataset that is not immediately obvious.
8.6 Confusion Matrix Plot

• Plot Content:
 The x-axis represents the predicted label (HAM or SPAM).
 The y-axis represents the true label (HAM or SPAM).
 The values in the matrix are:
▪ True HAM predicted as HAM: 953
▪ True HAM predicted as SPAM: 13
▪ True SPAM predicted as HAM: 19
▪ True SPAM predicted as SPAM: 131

• Analysis:
 The model has a high number of true positives (131) and true negatives
(953), indicating that it is correctly classifying a large number of emails.
 The number of false positives (13) and false negatives (19) is relatively low,
which is a good sign. This means that the model is not making a large
number of incorrect classifications.
 The overall performance of the model based on the confusion matrix appears
to be quite good, with a high accuracy in classifying both spam and non-spam
emails.

10. Conclusion
This project demonstrates the effectiveness of machine learning in email spam
detection. By leveraging TF-IDF, custom features, and Logistic Regression, the
model achieves high accuracy and interpretability. The tools and visualizations
developed make this solution practical for real-world applications.

Email Spam Detection Final Presentation-21BSCHH010002
No ratings yet
Email Spam Detection Final Presentation-21BSCHH010002
17 pages
Final PPT
No ratings yet
Final PPT
18 pages
Spam News Detection Report
No ratings yet
Spam News Detection Report
9 pages
Project 2
No ratings yet
Project 2
10 pages
Lottery Scam Email
90% (41)
Lottery Scam Email
6 pages
Email Spam Detection PPT Github
No ratings yet
Email Spam Detection PPT Github
11 pages
Title: Abstract
No ratings yet
Title: Abstract
2 pages
Admin Kayako Staff Control Panel
No ratings yet
Admin Kayako Staff Control Panel
106 pages
Final Report
No ratings yet
Final Report
27 pages
Spam Email Classifier
No ratings yet
Spam Email Classifier
17 pages
Disini v. Secretary of Justice Case Digest
75% (4)
Disini v. Secretary of Justice Case Digest
36 pages
FICE Project Report Spam
No ratings yet
FICE Project Report Spam
14 pages
Aayush Nihar Spam Mail Filtering
No ratings yet
Aayush Nihar Spam Mail Filtering
18 pages
Sms Spam Using Machine Learning 4
No ratings yet
Sms Spam Using Machine Learning 4
42 pages
Chapters Report 16it088
No ratings yet
Chapters Report 16it088
13 pages
SVM Lab Report
No ratings yet
SVM Lab Report
7 pages
Data Science Report
No ratings yet
Data Science Report
33 pages
Spamfilter
No ratings yet
Spamfilter
4 pages
Spam Email Classification
No ratings yet
Spam Email Classification
10 pages
Spamdetection
No ratings yet
Spamdetection
6 pages
Kriti - Report FINAL
No ratings yet
Kriti - Report FINAL
11 pages
B. Flowchart of The Model: Esult
No ratings yet
B. Flowchart of The Model: Esult
3 pages
Phase 1
No ratings yet
Phase 1
6 pages
The NEW Office 365 Security Checklist Guide (Sample) PDF
No ratings yet
The NEW Office 365 Security Checklist Guide (Sample) PDF
38 pages
Email Spam CLassification
No ratings yet
Email Spam CLassification
16 pages
Abhishek Mini Proj . File
No ratings yet
Abhishek Mini Proj . File
19 pages
Gen Smart Computer G10 TM
No ratings yet
Gen Smart Computer G10 TM
52 pages
RNC Complaint Against Google
100% (1)
RNC Complaint Against Google
28 pages
Zoom
No ratings yet
Zoom
20 pages
Email Marketing
100% (1)
Email Marketing
6 pages
$RVJ44FQ
No ratings yet
$RVJ44FQ
13 pages
Email Spam Detection
No ratings yet
Email Spam Detection
8 pages
Chapter 3
100% (1)
Chapter 3
13 pages
Spam Email Detection Using Machine Learning
No ratings yet
Spam Email Detection Using Machine Learning
8 pages
Sample Exercises
No ratings yet
Sample Exercises
7 pages
ML Lab
No ratings yet
ML Lab
13 pages
2nd Project Darling
No ratings yet
2nd Project Darling
9 pages
Spam Email. Classifier
No ratings yet
Spam Email. Classifier
16 pages
Email Spam Detection Using Machine Learning
No ratings yet
Email Spam Detection Using Machine Learning
2 pages
Anti Spam
No ratings yet
Anti Spam
26 pages
Cyber Crime Lesson Plan
No ratings yet
Cyber Crime Lesson Plan
6 pages
Email Classification Using Machine Learning
No ratings yet
Email Classification Using Machine Learning
22 pages
Report
No ratings yet
Report
11 pages
Final Report (Saie)
No ratings yet
Final Report (Saie)
38 pages
Spam Mail Classifier
No ratings yet
Spam Mail Classifier
8 pages
IP SPOOFING Documentation
No ratings yet
IP SPOOFING Documentation
18 pages
Gmail - Email From IRCTC
No ratings yet
Gmail - Email From IRCTC
4 pages
Introduction To Spam Email Detection
No ratings yet
Introduction To Spam Email Detection
16 pages
Aiproject 2
No ratings yet
Aiproject 2
4 pages
AI Phash 5
No ratings yet
AI Phash 5
14 pages
0 - Spam Mail Prediction
No ratings yet
0 - Spam Mail Prediction
29 pages
Aryan Blackbook 1
No ratings yet
Aryan Blackbook 1
29 pages
IJCRT23A5429
No ratings yet
IJCRT23A5429
7 pages
Second Progress Report
No ratings yet
Second Progress Report
17 pages
Email Spam Filtering Using Machine Learning.1
No ratings yet
Email Spam Filtering Using Machine Learning.1
16 pages
AI Phase4
No ratings yet
AI Phase4
11 pages
Spam Email Detection Using Python and Machine Learning
No ratings yet
Spam Email Detection Using Python and Machine Learning
14 pages
Presentation 3
No ratings yet
Presentation 3
13 pages
Spam Email Classifier - Ramsanjay
No ratings yet
Spam Email Classifier - Ramsanjay
2 pages
Final Report Spam Classifier
No ratings yet
Final Report Spam Classifier
24 pages
Imsva 9.1 BPG 20160531
No ratings yet
Imsva 9.1 BPG 20160531
61 pages
Research Article On The Forensic
No ratings yet
Research Article On The Forensic
14 pages
Dangers of Using The Internet
No ratings yet
Dangers of Using The Internet
16 pages
Spam Detection & Classification Final
No ratings yet
Spam Detection & Classification Final
38 pages
35.email Marketing Basics
No ratings yet
35.email Marketing Basics
14 pages
Spam Detection 6
No ratings yet
Spam Detection 6
8 pages
Email Report
No ratings yet
Email Report
15 pages
Vishal FOML Micro Project Vishal & Milan
No ratings yet
Vishal FOML Micro Project Vishal & Milan
26 pages
Pruthviraj Micor Foml
No ratings yet
Pruthviraj Micor Foml
26 pages
Ass 3
No ratings yet
Ass 3
2 pages
Cyber World
No ratings yet
Cyber World
27 pages
Document
No ratings yet
Document
11 pages
Preface:: JRE420 People Management and Organizational Behaviour Course Syllabus
No ratings yet
Preface:: JRE420 People Management and Organizational Behaviour Course Syllabus
10 pages
Spam Mail Detection Using Machine Learning
No ratings yet
Spam Mail Detection Using Machine Learning
14 pages
How To Write A Formal Email
No ratings yet
How To Write A Formal Email
6 pages
IceWarp Merak V9 Is Ready
No ratings yet
IceWarp Merak V9 Is Ready
13 pages
Incident Response Plan For Phishing Attacks
No ratings yet
Incident Response Plan For Phishing Attacks
7 pages
Outcomes - Advanced - Word Lists - Spanish - U16 PDF
No ratings yet
Outcomes - Advanced - Word Lists - Spanish - U16 PDF
6 pages
Major-Final Research Paper
No ratings yet
Major-Final Research Paper
3 pages
Achievement - Professional Podcast
No ratings yet
Achievement - Professional Podcast
4 pages
Spamming Tutorial +918954133645 PDF Email Spam Spamming
No ratings yet
Spamming Tutorial +918954133645 PDF Email Spam Spamming
1 page
Check Point-Utm-1 Appliances
No ratings yet
Check Point-Utm-1 Appliances
6 pages
SANS Lightweight Python Based Malware Analysis Pipeline
No ratings yet
SANS Lightweight Python Based Malware Analysis Pipeline
99 pages
Email Spam Detection Edited
No ratings yet
Email Spam Detection Edited
30 pages
Re Application Letter For A Club Membership 2
No ratings yet
Re Application Letter For A Club Membership 2
2 pages
2025 Sunshine Law Manual - WEB
No ratings yet
2025 Sunshine Law Manual - WEB
177 pages
Practical Work On IPSec Protocol
No ratings yet
Practical Work On IPSec Protocol
4 pages
Non Doctrinal Research
No ratings yet
Non Doctrinal Research
36 pages
Végleges Communication
No ratings yet
Végleges Communication
12 pages
Comparing AI Offerings From Top Cloud Providers
No ratings yet
Comparing AI Offerings From Top Cloud Providers
9 pages
Black and Green Modern Finance Business Report Presentation
No ratings yet
Black and Green Modern Finance Business Report Presentation
11 pages
Internal Audit Checklist
No ratings yet
Internal Audit Checklist
2 pages
Lab 1
No ratings yet
Lab 1
1 page
Machine Learning with Python: Foundations and Applications: ML, #1
From Everand
Machine Learning with Python: Foundations and Applications: ML, #1
Mohammed Nurudeen
No ratings yet
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet

Spam Filter Project Report Logistic Regression

Uploaded by

Spam Filter Project Report Logistic Regression

Uploaded by

Spam Filter Project Report

Made by : Ahannach yassine

2. Data Loading and Preprocessing

2.1 Data Overview

The dataset consists of the following:

2.2 Preprocessing Steps

The preprocessing pipeline ensures the dataset is cleaned and

3. Feature Extraction and Engineering

3.1 TF-IDF Vectorization

TF-IDF (Term Frequency-Inverse Document Frequency) assigns a weight to each

To improve model performance, additional features are engineered:

1. Synonym Replacement: Replacing words in spam messages with synonyms from a

Augmentation ensures the model is exposed to diverse spam patterns, improving

5. Model Training and Hyperparameter Tuning

5.1 Hyperparameter Tuning

Grid Search is used to optimize the following hyperparameters:

5.2 Training and Evaluation

The Logistic Regression model is trained using the optimal hyperparameters.

 Accuracy: Overall correctness of predictions.

6.2 Feature Importance

6.3 Training Metrics

8. Results and Analysis

8.1 Performance Metrics

 TF-IDF captures word importance effectively for spam detection.

8.3. Training Log Output

 Shows the loss and training/test accuracies at different epochs.

▪ Epoch 0: Loss = 0.6931, Train Accuracy = 0.71, Test Accuracy = 0.72

o Also shows detailed metrics:

8.4. Training vs Test Accuracy Plot :

8.5. Top 10 Most Important Words for Spam Plot:

You might also like