0% found this document useful (0 votes)
12 views16 pages

Email Spam Filtering Using Machine Learning.1

The document outlines a project on email spam filtering using machine learning, specifically employing Logistic Regression and TF-IDF for feature extraction. It highlights the importance of automation in distinguishing spam from legitimate emails, achieving approximately 96% accuracy. Future enhancements include advanced NLP techniques and real-time spam detection capabilities.

Uploaded by

Mohammad Afzal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views16 pages

Email Spam Filtering Using Machine Learning.1

The document outlines a project on email spam filtering using machine learning, specifically employing Logistic Regression and TF-IDF for feature extraction. It highlights the importance of automation in distinguishing spam from legitimate emails, achieving approximately 96% accuracy. Future enhancements include advanced NLP techniques and real-time spam detection capabilities.

Uploaded by

Mohammad Afzal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 16

Email Spam Filtering

Using Machine
Learning

Project Overview Mohammad Afzal


Under guidance of Kavya Jaiswal

Ms. Naina Devi


INTRODUCTION
What is Email Spam Filtering?
Email spam filtering is a technique used to detect and
block unwanted or malicious emails (spam) from entering
a user’s inbox.

Why is it Important?
Helps reduce security risks, enhances productivity, and
prevents exposure to harmful content.
PROBLEM STATEMENT
• With the growing amount of emails, distinguishing between legitimate
(not spam) emails and spam is crucial.

NEED FOR AUTOMATION

• Manual email filtering is inefficient, machine learning offers a scalable,


automated solution
TECNOLOGY USED

• Python: Primary programming language for data processing and model training.
• Scikit-learn: For machine learning algorithms (Logistic Regression).
• Natural Language Processing (NLP): Techniques like TF-IDF for text feature extraction.
• Pandas, Numpy: For data handling and manipulation.
• Matplotlib, Seaborn: For data visualization.
• Streamlit: Streamlit is an open-source Python framework for data scientists and AI/ML
engineers to deliver interactive data apps – in only a few lines of code.
BRIEF OF TECHNOLOGY USED

Logistic Regression:
• A supervised learning algorithm for binary classification tasks, predicting whether an
email is spam or ham.
• TF-IDF (Term Frequency-Inverse Document Frequency)
• A technique used to convert text data into numerical values based on the frequency of
words, helping the machine learning model understand the importance of terms.
METHODOLOGY
Data Collection:
• Email data collected with labels: "spam" or “not spam".
Data Preprocessing:
• Cleaning, handling missing values, and label encoding.
Feature Extraction:
• TF-IDF Vectorizer to convert text into features.
Model Training:
• Logistic Regression model trained on the extracted features.
Model Evaluation:
• Accuracy evaluated on test data.
Prediction:
• The system predicts whether a new email is spam or not
APPLICATION OF SPAM FILTERING

• Personal Use: Automatic filtering of spam in email accounts.


• Enterprise Use: Enhances corporate security by preventing phishing attacks and spam
emails.
• Email Service Providers: Used by Gmail, Outlook, and other email services to reduce
spam for users.
ADVANTAGE
• High Accuracy: Achieved approximately 96% accuracy on training and test data.
• Automation: Reduces manual effort in filtering out spam.
• Scalability: Can handle large volumes of email data.
• Efficiency: Quick predictions using machine learning techniques.
DISADVANTAGES AND LIMITATION

• Limited Feature Extraction:


TF-IDF doesn’t capture word context (e.g., meaning or sequence of words).
• Imbalance Issue:
If the dataset is imbalanced, the model may have biased predictions.
• Static Learning:
The model doesn’t adapt to new types of spam unless retrained periodically.
FUTURE SCOPE
Advanced NLP Techniques:
• Using models like Word2Vec or BERT to better understand the context of emails.
Improved Models:
• Experimenting with Random Forests, SVMs, or deep learning models (e.g., LSTM).
Real-time Spam Detection:
• Deploying the model in real-time email systems for dynamic spam filtering.
Multiclass Classification:
• Extending beyond spam and not spam to detect promotional, social, and update
emails.
TEST CASES
Case 1:
• WINNER!! As a valued network customer you have been selected to receive £900 prize
reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.
Case 2:
• Thanks for your subscription to Ringtone UK your mobile will be
charged £5/month Please confirm by replying YES or NO. If you
reply NO you will not be charged.
Case 3:
 Hi. Wk been ok - on hols now! Yes on for a bit of a run. Forgot that i
have hairdressers appointment at four so need to get home n shower
beforehand. Does that cause prob for u?
Case 4:
• I've been searching for the right words to thank you for this breather. I
promise i wont take your help for granted and will fulfil my promise.
You have been wonderful and a blessing at all times.
CONCLUSION

• The project successfully demonstrated how machine learning can be


applied to email spam filtering.
• Logistic Regression combined with TF-IDF yielded a high-accuracy model.
• The project lays the groundwork for future enhancements, such as using
advanced NLP and deploying the system in real-time environments.
Thank You

You might also like