EISystems Report On Email Spam Detection (SRI HARI)
EISystems Report On Email Spam Detection (SRI HARI)
On
Email Spam Detection using Multinomial Naive Bayes
Submitted by Submitted to
[PULIPATI SRI HARI] Mallika Srivastava
[University Roll No:- 22195A0506] Head, Training Delivery
[College Name:- JNTUA College of Engineering Pulivendula] EISystems Services
&
EISYSTEMS SERVICES
FF-110, Express Greens Plaza, Sector 1
Vaishali – Delhi NCR – India 201010
W: www.eisystems.in | E: [email protected] | T: +91 92122-51000
Student’s Declaration
I, PULIPATI SRI HARI, a student of BTech program, Roll No. 22195A0506 of the
Department of CSE , JNTUA COLLEGE OF ENGINEERING PUIVENDULA College do hereby
declare that I have completed the mandatory internship in EiSystems Technologies
under the faculty guideship of MALLIKA SRIVASTAVA Head of the Department of
Training Delivery, EISystems Services.
P. Sri hari
06/05/2024
(Signature and Date)
Endorsements
SIGNATURE
[Mallika Srivastava]
[Head, Training Delivery]
[EISystems Services]
SIGNATURE
[Mayur Dev Sewak]
[Head, Internships & Trainings]
[EISystems Services]
1
Page
Table of Content
EISYSTEMS SERVICES
FF-110, Express Greens Plaza, Sector 1
Vaishali – Delhi NCR – India 201010
W: www.eisystems.in | E: [email protected] | T: +91 92122-51000
Serial No Title of the Content Page No
1 Executive Summary 6
2 Overview of Organization 7
3 Project Summary 8
9 References 20
List of Figures
2
Page
EISYSTEMS SERVICES
FF-110, Express Greens Plaza, Sector 1
Vaishali – Delhi NCR – India 201010
W: www.eisystems.in | E: [email protected] | T: +91 92122-51000
Serial No Image Caption Page No
List of Tables
3
Page
EISYSTEMS SERVICES
FF-110, Express Greens Plaza, Sector 1
Vaishali – Delhi NCR – India 201010
W: www.eisystems.in | E: [email protected] | T: +91 92122-51000
Serial No Table Name Page
No
EISYSTEMS SERVICES
FF-110, Express Greens Plaza, Sector 1
Vaishali – Delhi NCR – India 201010
W: www.eisystems.in | E: [email protected] | T: +91 92122-51000
Serial Notations Description
No
1 Data Flow Diagram The Data Flow Diagram (DFD) for the Email Spam Detection Model
illustrates the flow of data from email input to prediction result through
processes such as text preprocessing, feature extraction, model training,
and prediction, facilitating a clear understanding of data flow and
system operations.
Executive Summary
The EISystems Data Science internship equipped participants with a robust foundation in Python
programming and machine learning concepts, fostering practical skills through hands-on projects. The
5
Page
internship aimed to achieve proficiency in data analysis, machine learning algorithms, and project
EISYSTEMS SERVICES
FF-110, Express Greens Plaza, Sector 1
Vaishali – Delhi NCR – India 201010
W: www.eisystems.in | E: [email protected] | T: +91 92122-51000
management, resulting in tangible learning outcomes and valuable experiences.
Learning Objectives:
1. Attain proficiency in Python programming for data manipulation, analysis, and visualization.
2. Understand core machine learning algorithms and their application in solving real-world problems.
3. Develop skills in data preprocessing, feature engineering, model training, and evaluation.
4. Enhance project management, collaboration, and communication skills within a team environment.
Learning Outcomes:
1. Mastered Python programming for data science tasks, including data cleaning, exploration, and
visualization using Pandas, NumPy, and Matplotlib.
2. Demonstrated proficiency in core machine learning algorithms, applying techniques such as
regression, classification, and ensemble methods to real-world datasets.
3. Implemented effective data preprocessing techniques, handling missing values, encoding
categorical variables, and scaling features.
4. Successfully managed and executed data science projects, from project planning to presentation of
results, within specified timelines.
Summary of Activities:
1. Engaged in comprehensive training sessions covering Python programming, data manipulation, and
machine learning concepts.
2. Completed hands-on exercises and assignments to reinforce learning and practical skills.
3. Worked collaboratively on real-world data science projects, conducting data preprocessing,
exploratory data analysis, and predictive modeling.
4. Presented project findings and results in team meetings, fostering collaboration and knowledge
sharing.
5. Pursued continuous learning through self-study and exploration of additional resources to deepen
understanding and expand skill set in data science and Python programming.
Overview of Organization
EISystems Services is a leading technology solutions provider offering software development, data
analytics, and digital transformation services. Committed to innovation and excellence, we empower
organizations worldwide with cutting-edge technology solutions. Our vision is to be a global leader in
technology, driving innovation and delivering transformative results. We value excellence, integrity,
6
Page
collaboration, innovation, and customer focus. Through our internship program, we provide hands-on
EISYSTEMS SERVICES
FF-110, Express Greens Plaza, Sector 1
Vaishali – Delhi NCR – India 201010
W: www.eisystems.in | E: [email protected] | T: +91 92122-51000
learning experiences and mentorship to nurture talent and foster innovation
Project Summary
Idea Behind Making This Project:
The idea behind this project is to develop a machine learning model capable of distinguishing between
spam and non-spam (ham) emails. By leveraging the Multinomial Naive Bayes algorithm, the project aims
to create an effective spam detection system that can automatically filter out unwanted emails, saving
time and improving inbox management.
7
Page
About Project:
EISYSTEMS SERVICES
FF-110, Express Greens Plaza, Sector 1
Vaishali – Delhi NCR – India 201010
W: www.eisystems.in | E: [email protected] | T: +91 92122-51000
The project involves building a spam email classifier using the Multinomial Naive Bayes algorithm. It utilizes
a dataset of labeled email messages, where each message is categorized as spam or ham. The model is
trained on features extracted from the text content of emails, allowing it to learn patterns and
characteristics associated with spam emails.
Python: Programming language used for data preprocessing, model training, and evaluation.
Streamlit: Web application framework used for building the user interface.
Scikit-learn: Python library used for implementing the Multinomial Naive Bayes classifier and other
machine learning functionalities.
Python programming skills for data manipulation, analysis, and machine learning model implementation.
Familiarity with machine learning algorithms, particularly the Multinomial Naive Bayes algorithm.
Understanding of text preprocessing techniques, including tokenization, stemming, and vectorization.
Basic knowledge of web development for building the user interface using Streamlit.
Research Done:
Research was conducted to explore various machine learning algorithms suitable for text classification
tasks, with a focus on the Multinomial Naive Bayes algorithm for spam detection. Experimentation was
carried out to optimize model performance through hyperparameter tuning and feature selection
techniques. Additionally, research was conducted on best practices for data preprocessing, including text
cleaning and feature engineering, to improve the effectiveness of the spam detection system.
EISYSTEMS SERVICES
FF-110, Express Greens Plaza, Sector 1
Vaishali – Delhi NCR – India 201010
W: www.eisystems.in | E: [email protected] | T: +91 92122-51000
User Input
Data Preprocessing
(Text Cleaning,
Feature Extraction)
Model Training
(MultinomialNB)
Model Prediction
Output (Spam/Ham
Classification)
PROGRAM :-
Page
EISYSTEMS SERVICES
FF-110, Express Greens Plaza, Sector 1
Vaishali – Delhi NCR – India 201010
W: www.eisystems.in | E: [email protected] | T: +91 92122-51000
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report,
ConfusionMatrixDisplay
import pickle
plt.figure(figsize=(15, 6))
mail_categories = [number_of_ham, number_of_spam]
labels = [f"Ham = {number_of_ham}", f"Spam = {number_of_spam}"]
explode = [.2, 0]
plt.pie(mail_categories, labels=labels, explode=explode, autopct="%.2f %%")
plt.title("Ham vs Spam")
plt.show()
encoder = LabelEncoder()
dataset['spam'] = encoder.fit_transform(dataset['Category'])
x = dataset['Message']
y = dataset['spam']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)
vectorizer = CountVectorizer()
x_train_counts = vectorizer.fit_transform(x_train)
classifier = MultinomialNB()
classifier.fit(x_train_counts, y_train)
x_test_counts = vectorizer.transform(x_test)
print(classification_report(y_test, y_pred))
emails = [
"Hey jessica, I'm at the Ms.Salahshor class waiting for you, where are you?",
'Upto 20% discount on parking, exclusive offer just for you. Dont miss this reward!',
'''Join us on Saturday, February 24 at 14:00 UTC on our YouTube channel to take this
interactive lesson, taught by Tutor Darryl.'''
]
emails_count = vectorizer.transform(emails)
print(emails_count)
print(classifier.predict(emails_count))
EISYSTEMS SERVICES
FF-110, Express Greens Plaza, Sector 1
Vaishali – Delhi NCR – India 201010
W: www.eisystems.in | E: [email protected] | T: +91 92122-51000
Figure 2:- importing the required libraries
EISYSTEMS SERVICES
FF-110, Express Greens Plaza, Sector 1
Vaishali – Delhi NCR – India 201010
W: www.eisystems.in | E: [email protected] | T: +91 92122-51000
Figure 4:- Data visualization using Pie chart
EISYSTEMS SERVICES
FF-110, Express Greens Plaza, Sector 1
Vaishali – Delhi NCR – India 201010
W: www.eisystems.in | E: [email protected] | T: +91 92122-51000
Figure 6:- Fitting the Training data to MultinomialNB
EISYSTEMS SERVICES
FF-110, Express Greens Plaza, Sector 1
Vaishali – Delhi NCR – India 201010
W: www.eisystems.in | E: [email protected] | T: +91 92122-51000
CODE FOR USER INTERFACE OF THE MODEL
Figure 8:- Importing streamlit library and loading the pickle files
15
EISYSTEMS SERVICES
FF-110, Express Greens Plaza, Sector 1
Vaishali – Delhi NCR – India 201010
W: www.eisystems.in | E: [email protected] | T: +91 92122-51000
Figure 10:- Adding Title of the interface and Components
16
EISYSTEMS SERVICES
FF-110, Express Greens Plaza, Sector 1
Vaishali – Delhi NCR – India 201010
W: www.eisystems.in | E: [email protected] | T: +91 92122-51000
Input Dataset of Email Spam and Ham Messages:
17
Page
18
EISYSTEMS SERVICES
FF-110, Express Greens Plaza, Sector 1
Vaishali – Delhi NCR – India 201010
W: www.eisystems.in | E: [email protected] | T: +91 92122-51000
Email Spam Detection interface by predicting the message as Spam
VIDEO LINK :-
https://drive.google.com/file/d/1n9DTSmHNeamZhfzEYSfqkx86rMGTbhLq/view?usp=sha
ring
References
19
Page
EISYSTEMS SERVICES
FF-110, Express Greens Plaza, Sector 1
Vaishali – Delhi NCR – India 201010
W: www.eisystems.in | E: [email protected] | T: +91 92122-51000
1. Dataset Source:
Kaggle: https://www.kaggle.com/datasets
UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/index.php
2. Research Papers:
"Machine Learning Techniques in Spam Email Detection": Link to paper
"Email Spam Filtering: A Review": Link to paper
3. Books:
"Machine Learning Yearning" by Andrew Ng: Link to book
"Natural Language Processing with Python" by Steven Bird, Ewan Klein, and Edward Loper:
Link to book
4. Online Articles and Tutorials:
Towards Data Science: https://towardsdatascience.com/
Analytics Vidhya: https://www.analyticsvidhya.com/
Medium: https://medium.com/
5. Official Documentation:
Scikit-learn documentation for Multinomial Naive Bayes: Link to documentation
Streamlit documentation for building web applications: Link to documentation
6. GitHub Repositories:
GitHub: https://github.com/ (Search for email spam detection projects)
EISYSTEMS SERVICES
FF-110, Express Greens Plaza, Sector 1
Vaishali – Delhi NCR – India 201010
W: www.eisystems.in | E: [email protected] | T: +91 92122-51000
1) Oral communication 1 2 3 4 5
2) Written communication 1 2 3 4 5
3) Initiative 1 2 3 4 5
5) Attitude 1 2 3 4 5
6) Dependability 1 2 3 4 5
7) Ability to learn 1 2 3 4 5
9) Professionalism 1 2 3 4 5
10) Creativity 1 2 3 4 5
12) Productivity 1 2 3 4 5
P. Sri hari
Signature of the Student
Annexure 1
Daily Activity Report
21
WEEK - 1
Page
EISYSTEMS SERVICES
FF-110, Express Greens Plaza, Sector 1
Vaishali – Delhi NCR – India 201010
W: www.eisystems.in | E: [email protected] | T: +91 92122-51000
Day & Date Brief Description of Daily Learning Outcome Person In-Charge
Activity
Day 1 Introduction to Python Understanding basic Mallika Srivastava
basics Python syntax and data
types
Day 2 Variables, data types, and Familiarity with Python Mallika Srivastava
operators variables, data types, and
operators
Day 3 Control flow and loops Knowledge of conditional Mallika Srivastava
statements and loops in
Python
Day 4 Lists, tuples, dictionaries Understanding Python data Mallika Srivastava
structures like lists, tuples,
and dictionaries
Day 5 Functions and file handling Learning to define Mallika Srivastava
functions and work with
files in Python
WEEK - 2
Day & Date Brief Description of Daily Learning Outcome Person In-Charge
Activity
Day 1 Introduction to Python Understanding function Mallika Srivastava
functions syntax and usage in Python
Day 2 Parameters, arguments, Familiarity with function Mallika Srivastava
and return values parameters, arguments,
and return values
Day 3 Scope of variables and Knowledge of variable Mallika Srivastava
built-in modules scope and Python built-in
modules
Day 4 Creating and using custom Learning to create and Mallika Srivastava
modules import custom modules in
Python
Day 5 Error handling and Understanding error types Mallika Srivastava
exception handling and how to handle
exceptions in Python
WEEK - 3
Day & Date Brief Description of Daily Learning Outcome Person In-Charge
Activity
22
WEEK - 4
Day & Date Brief Description of Daily Learning Outcome Person In-Charge
Activity
Day 1 Introduction to data Understanding the Mallika Srivastava
visualization importance of data
visualization
Day 2 Basic plotting with Familiarity with basic Mallika Srivastava
Matplotlib plotting techniques in
Matplotlib
Day 3 Advanced plotting with Knowledge of advanced Mallika Srivastava
Seaborn data visualization
techniques with Seaborn
Day 4 Interactive visualization Learning to create Mallika Srivastava
with Plotly interactive plots with Plotly
Day 5 Dashboard creation with Understanding how to Mallika Srivastava
Streamlit create interactive
dashboards with Streamlit
WEEK - 5
Day & Date Brief Description of Daily Learning Outcome Person In-Charge
Activity
23
learning
EISYSTEMS SERVICES
FF-110, Express Greens Plaza, Sector 1
Vaishali – Delhi NCR – India 201010
W: www.eisystems.in | E: [email protected] | T: +91 92122-51000
Day 2 Supervised learning and Familiarity with different Mallika Srivastava
unsupervised learning types of machine learning
techniques
Day 3 Model evaluation metrics Knowledge of metrics used Mallika Srivastava
to evaluate machine
learning models
Day 4 Model selection and Learning to select Mallika Srivastava
hyperparameter tuning appropriate models and
tune hyperparameters
Day 5 Model deployment Understanding techniques Mallika Srivastava
considerations and and considerations for
techniques deploying machine learning
models
WEEK - 6
Day & Date Brief Description of Daily Learning Outcome Person In-Charge
Activity
Day 1 Introduction to machine Understanding various Mallika Srivastava
learning algorithms machine learning
algorithms
Day 2 Linear regression and Familiarity with linear Mallika Srivastava
logistic regression regression and logistic
regression algorithms
Day 3 Decision trees and Knowledge of decision tree Mallika Srivastava
ensemble methods algorithms and ensemble
methods
Day 4 Support vector machines Learning about support Mallika Srivastava
and k-nearest neighbors vector machines and k-
nearest neighbors
algorithms
Day 5 Clustering algorithms and Understanding clustering Mallika Srivastava
dimensionality reduction algorithms and
techniques dimensionality reduction
techniques
WEEK - 7
Day & Date Brief Description of Daily Learning Outcome Person In-Charge
Activity
Day 1 Understanding email data Familiarity with email Mallika Srivastava
24
EISYSTEMS SERVICES
FF-110, Express Greens Plaza, Sector 1
Vaishali – Delhi NCR – India 201010
W: www.eisystems.in | E: [email protected] | T: +91 92122-51000
Day 2 Feature extraction and Knowledge of extracting Mallika Srivastava
selection relevant features from
email data
Day 3 Model selection and Understanding how to Mallika Srivastava
evaluation select and evaluate models
for email spam detection
Day 4 Hyperparameter tuning Learning to tune Mallika Srivastava
and performance hyperparameters and
optimization optimize model
performance
Day 5 Fine-tuning the model and Applying final adjustments Mallika Srivastava
finalizing the email spam and optimizations to the
detection model model
WEEK - 8
Day & Date Brief Description of Daily Learning Outcome Person In-Charge
Activity
Day 1 Designing the user Understanding user Mallika Srivastava
interface for the email interface design principles
spam detection app
Day 2 Developing the frontend Familiarity with frontend Mallika Srivastava
components development tools and
frameworks
Day 3 Integrating the frontend Knowledge of integrating Mallika Srivastava
with the backend frontend and backend
components
Day 4 Testing and debugging the Learning to identify and fix Mallika Srivastava
application bugs in the application
Day 5 Deploying the application Understanding the Mallika Srivastava
on GitHub Pages deployment process and
hosting the app
Annexure 2
Weekly Progress Report
Week No: ______
25
(1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16)
Page
EISYSTEMS SERVICES
FF-110, Express Greens Plaza, Sector 1
Vaishali – Delhi NCR – India 201010
W: www.eisystems.in | E: [email protected] | T: +91 92122-51000
Week(s) Summary of Weekly Activity
Week 1 Introduction to Python basics, covering variables, data types, control flow,
and functions.
Week 2 Delving into Python functions and modules, including parameters, return
values, and custom module creation.
Week 3 Exploring Python libraries such as NumPy, Pandas, Matplotlib, and Seaborn
for data manipulation and visualization.
Week 6 Learning various machine learning algorithms like linear regression, logistic
regression, decision trees, and ensemble methods.
Week 8 Developing the user interface for the email spam detection app and
deploying it on GitHub Pages after integrating frontend and backend
components.
26
Page
EISYSTEMS SERVICES
FF-110, Express Greens Plaza, Sector 1
Vaishali – Delhi NCR – India 201010
W: www.eisystems.in | E: [email protected] | T: +91 92122-51000