0% found this document useful (0 votes)
14 views

Spam Filter Project Report logistic regression

The Spam Filter Project utilizes machine learning, specifically Logistic Regression, to classify emails as spam or ham, addressing the challenge of managing unwanted emails. The report details the entire process from data preprocessing and feature extraction to model training and evaluation, achieving a high accuracy of 95%. Additionally, it includes an interactive classification tool and various visualizations to enhance understanding and usability.

Uploaded by

m.mouhcine1234
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Spam Filter Project Report logistic regression

The Spam Filter Project utilizes machine learning, specifically Logistic Regression, to classify emails as spam or ham, addressing the challenge of managing unwanted emails. The report details the entire process from data preprocessing and feature extraction to model training and evaluation, achieving a high accuracy of 95%. Additionally, it includes an interactive classification tool and various visualizations to enhance understanding and usability.

Uploaded by

m.mouhcine1234
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Spam Filter Project Report

Made by : Ahannach yassine


EL Garte Mouhcine

1. Introduction
The Spam Filter Project utilizes machine learning techniques to classify email
messages as either spam or ham (non-spam). With the increasing volume of
email communication, managing unwanted spam emails has become critical for
productivity and security. This project aims to provide an efficient and accurate
solution to manage unwanted emails using Logistic Regression as the primary
model. It incorporates advanced feature engineering, hyperparameter tuning,
and visualization tools to ensure optimal performance.

The report documents the entire pipeline, including data preprocessing, feature
extraction, model training, evaluation, and deployment. Each component of the
pipeline is explained in detail to offer a comprehensive understanding of the
implementation.

2. Data Loading and Preprocessing


The dataset used for this project contains labeled email messages with two
categories: "ham" for legitimate emails and "spam" for unwanted emails. The
raw dataset undergoes preprocessing to clean and prepare the text for analysis.

2.1 Data Overview

The dataset consists of the following:


 Messages: Textual content of the emails.
 Labels: Binary labels indicating whether the email is spam or ham.
Example of Dataset:

Label Message
Ham "Go to the meeting at 10 AM."
Spam "Congratulations! You won $1,000. Click here to claim now."

2.2 Preprocessing Steps

The preprocessing pipeline ensures the dataset is cleaned and


structured for feature extraction. The key steps are:
 Text Cleaning: Removing punctuation, numbers, and special characters.
 Lowercasing: Converting all text to lowercase for consistency.
 Tokenization: Splitting messages into individual words.
 Stopword Removal: Eliminating common words (e.g., "and," "the") that do not
contribute to classification.
 Lemmatization: Reducing words to their base forms (e.g., "running" → "run").
 Label Encoding: Converting the labels "spam" and "ham" to numerical values (1 and
0, respectively).

Code Snippet:

3. Feature Extraction and Engineering


Feature extraction converts textual data into numerical representations, which
are then used by the machine learning model. This project leverages both
standard techniques like TF-IDF and custom feature engineering.

3.1 TF-IDF Vectorization

TF-IDF (Term Frequency-Inverse Document Frequency) assigns a weight to each


word based on its importance in a message relative to the entire dataset. This
results in a sparse matrix where each row represents an email, and each column
represents a unique word.

Advantages:
 Captures the importance of rare words in spam detection.
 Efficient for high-dimensional text data.

Code Snippet:
3.2 Custom Feature Engineering

To improve model performance, additional features are engineered:


 Email Length: Total number of characters in the email.
 Special Character Count: Frequency of characters like @, !, and $.
 Uppercase Words: Number of fully capitalized words.

Code Snippet:

4. Data Augmentation
Data augmentation helps address class imbalance by generating
synthetic spam samples. Two primary techniques are used:

1. Synonym Replacement: Replacing words in spam messages with synonyms from a


predefined dictionary or WordNet.
2. Noise Introduction: Adding typos and slight variations to simulate real-world spam
content.

Code Snippet:

Augmentation ensures the model is exposed to diverse spam patterns, improving


its generalization ability.

5. Model Training and Hyperparameter Tuning


The model used in this project is Logistic Regression, which is well-suited for
binary classification problems. Key steps include training, hyperparameter
tuning, and evaluation.

5.1 Hyperparameter Tuning

Grid Search is used to optimize the following hyperparameters:

 C: Regularization strength.
 max_iter: Maximum number of iterations for convergence.
Code Snippet:

5.2 Training and Evaluation

The Logistic Regression model is trained using the optimal hyperparameters.


Key evaluation metrics include:

 Accuracy: Overall correctness of predictions.


 Precision and Recall: Focused on spam detection performance.
 ROC-AUC: Measures the model's ability to differentiate between classes.

Code Snippet:

Classification Report:

6. Visualization
6.1 Confusion Matrix

A confusion matrix highlights the counts of true positives, false positives, true
negatives, and false negatives.

Code Snippet:

6.2 Feature Importance


The top features influencing the classification are visualized to provide
interpretability.

Code Snippet:

6.3 Training Metrics

The loss and accuracy trends during training are plotted to ensure the model is
learning effectively.

7. Interactive Classification
The project includes a real-time email classification tool. Users can input email
messages and receive predictions with confidence scores.

Code Snippet:

8. Results and Analysis

8.1 Performance Metrics


 Accuracy: 95%
 Precision (Spam): 96%
 Recall (Spam): 85%
 ROC-AUC: 0.988

8.2 Observations

 TF-IDF captures word importance effectively for spam detection.


 Custom features improve recall for spam classification.
 Data augmentation ensures robustness to diverse spam patterns.
 Example Dataset Preview:

8.3. Training Log Output

• Output Content:

 Shows the loss and training/test accuracies at different epochs.

For example:

▪ Epoch 0: Loss = 0.6931, Train Accuracy = 0.71, Test Accuracy = 0.72


▪ Epoch 8640: Loss = 0.0344, Train Accuracy = 0.99, Test Accuracy = 0.97
▪ Final Test Accuracy: 0.97

o Also shows detailed metrics:

▪ Precision: 0.942
▪ Recall: 0.945
▪ F1 Score: 0.944
• Analysis:

 The loss decreases significantly over the epochs, indicating that the model is
learning and improving its predictions.
 The final test accuracy of 0.97 is quite high, suggesting that the model
performs well on unseen data.
 The precision, recall, and F1 score values are all relatively high (above 0.94),
which further indicates that the model has a good balance between correctly
identifying positive cases (spam emails) and minimizing false positives and
false negatives.

8.4. Training vs Test Accuracy Plot :

• Plot Content:

 the training accuracy starts around 0.70 and quickly rises to close to 1.00
within the first few epochs and remains at that level.
 othe test accuracystarts around 0.72 and gradually increases to around 0.97
and stabilizes.

• Analysis:

 The training accuracy reaches a very high level (close to 1.00) early in the
training process and remains stable. This indicates that the model is able to
fit the training data very well.
 The test accuracy also increases but stabilizes at a slightly lower level
(around 0.97) compared to the training accuracy. This suggests that there
might be a small degree of overfitting, as the model performs slightly worse
on the unseen test data compared to the training data. However, the gap
between the training and test accuracies is not very large, indicating that the
overfitting is not severe.

8.5. Top 10 Most Important Words for Spam Plot:

• Analysis:

 Words like "money", "million", and "prices" are commonly associated with
spam emails, as they often relate to financial offers or promotions. The
presence of these words with relatively high weights indicates that the
model has learned to recognize these as important features for spam
classification.

 The word "2004" having the highest weight might be an artifact of the
dataset or could potentially be related to some context within the spam
emails in the dataset that is not immediately obvious.
8.6 Confusion Matrix Plot

• Plot Content:
 The x-axis represents the predicted label (HAM or SPAM).
 The y-axis represents the true label (HAM or SPAM).
 The values in the matrix are:
▪ True HAM predicted as HAM: 953
▪ True HAM predicted as SPAM: 13
▪ True SPAM predicted as HAM: 19
▪ True SPAM predicted as SPAM: 131

• Analysis:
 The model has a high number of true positives (131) and true negatives
(953), indicating that it is correctly classifying a large number of emails.
 The number of false positives (13) and false negatives (19) is relatively low,
which is a good sign. This means that the model is not making a large
number of incorrect classifications.
 The overall performance of the model based on the confusion matrix appears
to be quite good, with a high accuracy in classifying both spam and non-spam
emails.

10. Conclusion
This project demonstrates the effectiveness of machine learning in email spam
detection. By leveraging TF-IDF, custom features, and Logistic Regression, the
model achieves high accuracy and interpretability. The tools and visualizations
developed make this solution practical for real-world applications.

You might also like