Spam Email. Classifier
Spam Email. Classifier
under guidance of
Miss Naina Devi
Kavya Jaiswal
Mohammad Afzal
(2301220140044
2301220140049)
What is Email Spam Filtering?
• Email spam filtering is a technique used to detect and block unwanted or
malicious emails (spam) from entering a user’s inbox.
Why is it Important?
• Helps reduce security risks, enhances productivity, and prevents exposure to
harmful content
Problem Statement
The Challenge:
• With the growing amount of emails,
distinguishing between legitimate (ham)
emails and spam is crucial.
Need for Automation:
• Manual email filtering is inefficient; machine
learning offers a scalable, automated
solution.
Technologies Used
• Python: Primary programming language for data processing
and model training.
• Scikit-learn: For machine learning algorithms (Logistic
Regression).
• Natural Language Processing (NLP): Techniques like TF-IDF for
text feature extraction.
• Pandas, Numpy: For data handling and manipulation.
• Matplotlib, Seaborn: For data visualization.
• Streamlit: Streamlit is an open-source Python framework for
data scientists and AI/ML engineers to deliver interactive data
apps – in only a few lines of code.
Brief Intro of Key Technologies
Logistic Regression:
• A supervised learning algorithm for binary
classification tasks, predicting whether an email is
spam or ham.
TF-IDF (Term Frequency-Inverse Document Frequency):
• A technique used to convert text data into numerical
values based on the frequency of words, helping the
machine learning model understand the importance
of terms.
Methodology
Data Collection:
• Email data collected with labels: "spam" or "ham".
Data Preprocessing:
• Cleaning, handling missing values, and label encoding.
Feature Extraction:
• TF-IDF Vectorizer to convert text into features.
Model Training:
• Logistic Regression model trained on the extracted
features.
Model Evaluation:
• Accuracy evaluated on test data.
Prediction:
• The system predicts whether a new email is spam or
ham.
Application of Spam Filtering
• Personal Use: Automatic filtering of spam in
email accounts.
• Enterprise Use: Enhances corporate security
by preventing phishing attacks and spam
emails.
• Email Service Providers: Used by Gmail,
Outlook, and other email services to reduce
spam for users.
Advantages of the Model
• High Accuracy: Achieved approximately 96% accuracy on training
and test data.
• Automation: Reduces manual effort in filtering out spam.
• Scalability: Can handle large volumes of email data.
• Efficiency: Quick predictions using machine learning techniques.
Disadvantages and Limitations
Limited Feature Extraction:
• TF-IDF doesn’t capture word context (e.g., meaning or
sequence of words).
Imbalance Issue:
• If the dataset is imbalanced, the model may have biased
predictions.
Static Learning:
• The model doesn’t adapt to new types of spam unless retrained
periodically.
Future Scope
Advanced NLP Techniques:
• Using models like Word2Vec or BERT to
better understand the context of emails.
Improved Models:
• Experimenting with Random Forests, SVMs,
or deep learning models (e.g., LSTM).
Real-time Spam Detection:
• Deploying the model in real-time email
systems for dynamic spam filtering.
Multiclass Classification:
• Extending beyond spam and ham to detect
promotional, social, and update emails.
Test Cases
Case 1:
• WINNER!! As a valued network customer you have been selected to receive £900 prize reward! To claim
call 09061701461. Claim code KL341. Valid 12 hours only.
Case 2:
• Thanks for your subscription to Ringtone UK your mobile will be charged £5/month Please confirm by
replying YES or NO. If you reply NO you will not be charged
Case 3:
• Hi. Wk been ok - on hols now! Yes on for a bit of a run. Forgot that i have hairdressers appointment at
four so need to get home n shower beforehand. Does that cause prob for u?
Case 4:
• I've been searching for the right words to thank you for this breather. I promise i wont take your help for
granted and will fulfil my promise. You have been wonderful and a blessing at all times.
Conclusion
• The project successfully demonstrated how
machine learning can be applied to email spam
filtering.
• Logistic Regression combined with TF-IDF yielded
a high-accuracy model.
• The project lays the groundwork for future
enhancements, such as using advanced NLP and
deploying the system in real-time environments.