0% found this document useful (0 votes)
235 views16 pages

Spam Email. Classifier

Uploaded by

moammadmc23048
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
235 views16 pages

Spam Email. Classifier

Uploaded by

moammadmc23048
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Email Spam Filtering

Using Machine Learning


Project Overview

under guidance of
Miss Naina Devi

Kavya Jaiswal
Mohammad Afzal
(2301220140044
2301220140049)
What is Email Spam Filtering?
• Email spam filtering is a technique used to detect and block unwanted or
malicious emails (spam) from entering a user’s inbox.
Why is it Important?
• Helps reduce security risks, enhances productivity, and prevents exposure to
harmful content
Problem Statement
The Challenge:
• With the growing amount of emails,
distinguishing between legitimate (ham)
emails and spam is crucial.
Need for Automation:
• Manual email filtering is inefficient; machine
learning offers a scalable, automated
solution.
Technologies Used
• Python: Primary programming language for data processing
and model training.
• Scikit-learn: For machine learning algorithms (Logistic
Regression).
• Natural Language Processing (NLP): Techniques like TF-IDF for
text feature extraction.
• Pandas, Numpy: For data handling and manipulation.
• Matplotlib, Seaborn: For data visualization.
• Streamlit: Streamlit is an open-source Python framework for
data scientists and AI/ML engineers to deliver interactive data
apps – in only a few lines of code.
Brief Intro of Key Technologies
Logistic Regression:
• A supervised learning algorithm for binary
classification tasks, predicting whether an email is
spam or ham.
TF-IDF (Term Frequency-Inverse Document Frequency):
• A technique used to convert text data into numerical
values based on the frequency of words, helping the
machine learning model understand the importance
of terms.
Methodology
Data Collection:
• Email data collected with labels: "spam" or "ham".
Data Preprocessing:
• Cleaning, handling missing values, and label encoding.
Feature Extraction:
• TF-IDF Vectorizer to convert text into features.
Model Training:
• Logistic Regression model trained on the extracted
features.
Model Evaluation:
• Accuracy evaluated on test data.
Prediction:
• The system predicts whether a new email is spam or
ham.
Application of Spam Filtering
• Personal Use: Automatic filtering of spam in
email accounts.
• Enterprise Use: Enhances corporate security
by preventing phishing attacks and spam
emails.
• Email Service Providers: Used by Gmail,
Outlook, and other email services to reduce
spam for users.
Advantages of the Model
• High Accuracy: Achieved approximately 96% accuracy on training
and test data.
• Automation: Reduces manual effort in filtering out spam.
• Scalability: Can handle large volumes of email data.
• Efficiency: Quick predictions using machine learning techniques.
Disadvantages and Limitations
Limited Feature Extraction:
• TF-IDF doesn’t capture word context (e.g., meaning or
sequence of words).
Imbalance Issue:
• If the dataset is imbalanced, the model may have biased
predictions.
Static Learning:
• The model doesn’t adapt to new types of spam unless retrained
periodically.
Future Scope
Advanced NLP Techniques:
• Using models like Word2Vec or BERT to
better understand the context of emails.
Improved Models:
• Experimenting with Random Forests, SVMs,
or deep learning models (e.g., LSTM).
Real-time Spam Detection:
• Deploying the model in real-time email
systems for dynamic spam filtering.
Multiclass Classification:
• Extending beyond spam and ham to detect
promotional, social, and update emails.
Test Cases
Case 1:
• WINNER!! As a valued network customer you have been selected to receive £900 prize reward! To claim
call 09061701461. Claim code KL341. Valid 12 hours only.
Case 2:
• Thanks for your subscription to Ringtone UK your mobile will be charged £5/month Please confirm by
replying YES or NO. If you reply NO you will not be charged
Case 3:
• Hi. Wk been ok - on hols now! Yes on for a bit of a run. Forgot that i have hairdressers appointment at
four so need to get home n shower beforehand. Does that cause prob for u?
Case 4:
• I've been searching for the right words to thank you for this breather. I promise i wont take your help for
granted and will fulfil my promise. You have been wonderful and a blessing at all times.
Conclusion
• The project successfully demonstrated how
machine learning can be applied to email spam
filtering.
• Logistic Regression combined with TF-IDF yielded
a high-accuracy model.
• The project lays the groundwork for future
enhancements, such as using advanced NLP and
deploying the system in real-time environments.

You might also like