Email Spam Filtering Using Machine Learning.1
Email Spam Filtering Using Machine Learning.1
Using Machine
Learning
Why is it Important?
Helps reduce security risks, enhances productivity, and
prevents exposure to harmful content.
PROBLEM STATEMENT
• With the growing amount of emails, distinguishing between legitimate
(not spam) emails and spam is crucial.
• Python: Primary programming language for data processing and model training.
• Scikit-learn: For machine learning algorithms (Logistic Regression).
• Natural Language Processing (NLP): Techniques like TF-IDF for text feature extraction.
• Pandas, Numpy: For data handling and manipulation.
• Matplotlib, Seaborn: For data visualization.
• Streamlit: Streamlit is an open-source Python framework for data scientists and AI/ML
engineers to deliver interactive data apps – in only a few lines of code.
BRIEF OF TECHNOLOGY USED
Logistic Regression:
• A supervised learning algorithm for binary classification tasks, predicting whether an
email is spam or ham.
• TF-IDF (Term Frequency-Inverse Document Frequency)
• A technique used to convert text data into numerical values based on the frequency of
words, helping the machine learning model understand the importance of terms.
METHODOLOGY
Data Collection:
• Email data collected with labels: "spam" or “not spam".
Data Preprocessing:
• Cleaning, handling missing values, and label encoding.
Feature Extraction:
• TF-IDF Vectorizer to convert text into features.
Model Training:
• Logistic Regression model trained on the extracted features.
Model Evaluation:
• Accuracy evaluated on test data.
Prediction:
• The system predicts whether a new email is spam or not
APPLICATION OF SPAM FILTERING