0% found this document useful (0 votes)
13 views30 pages

Email Spam Detection Edited

This project report details the development of an email spam detection system using machine learning, specifically a Naive Bayes classifier trained on labeled email data. The system employs TF-IDF vectorization for feature extraction and aims to automate spam filtering to enhance user experience in email communication. The project demonstrates effective spam classification and highlights potential future enhancements, such as integrating with real-time email clients and supporting multimedia content.

Uploaded by

23urcs107
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views30 pages

Email Spam Detection Edited

This project report details the development of an email spam detection system using machine learning, specifically a Naive Bayes classifier trained on labeled email data. The system employs TF-IDF vectorization for feature extraction and aims to automate spam filtering to enhance user experience in email communication. The project demonstrates effective spam classification and highlights potential future enhancements, such as integrating with real-time email clients and supporting multimedia content.

Uploaded by

23urcs107
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

E-MAIL SPAM DETECTION USING

MACHINE LEARNING

A MINI PROJECT REPORT

Submitted by

SUDHARSAN VIJAY N (REG.NO:953723104104)


SELVAPANDI G (REG.NO:953723104097)
SANJAY P (REG.NO:953723104091)

In partial fulfillment for the award of the degree


of
BACHELOR OF ENGINEERING

IN
COMPUTER SCIENCE AND ENGINEERING

AAA COLLEGE OF ENGINEERING AND TECHNOLOGY


SIVAKASI

ANNA UNIVERSITY:: CHENNAI 600 025


MAY 2025
BONAFIDE CERTIFICATE

Certified that this project report “E-MAIL SPAM


DETECTION USING MACHINE LEARNING” is the Bonafide
work of “SUDHARSAN VIJAY N (Reg.No:953723104104),
SELVAPANDI G (Reg.No:953723104097), SANJAY P
(953723104091)” who carried out the project work under my
supervision.Certified further that to the best of my knowledge
the work reported herein does not form part of any other thesis or
dissertation on the basis of which a degree or award was conferred
on an earlier occasion on this or any other candidate.

SIGNATURE SIGNATURE
Dr. J. Hemalatha, M.E., Ph.D., Mrs. S. Rajeswari, M.Tech.,
HEAD OF THE DEPARTMENT SUPERVISIOR
Professor & HOD Assistant Professor
Computer Science & Engineering Computer Science & Engineering
AAA College of Engg. & Tech., AAA College of Engg. & Tech.,
Sivakasi - 626 005 Sivakasi - 626 005
Virudhunagar District Virudhunagar District

Submitted for the project viva-voce examination to be held on

INTERNAL EXAMINAR EXTERNAL EXAMINAR


ii
ABSTRACT

This project presents an intelligent email spam detection system using


machine learning techniques. With the increasing volume of email
communication, filtering unwanted and potentially harmful spam
emails has become a critical need. The system is built using a
supervised learning approach, where a Naive Bayes classifier is trained
on a labeled dataset of emails categorized as spam or non-spam. Text
preprocessing and feature extraction are performed using the TF-IDF
vectorizer to convert the raw email content into numerical form. The
model is then evaluated for its accuracy and performance in detecting
spam messages. This project demonstrates the practical use of machine
learning in natural language processing to enhance cyber security and
improve user experience in email communication.

iii
TABLE OF CONTENT

CHAPETER.NO TITLE PAGE NO

ABSTRACT iii
LIST OF ABBREVIATIONS v
1. INTRODUCTION
1.1 Background 1
1.2 Objectives 1
1.3 Target Audience 1
2. PROJECT OVERVIEW
2.1 Purpose 2
2.2 Scope 2
2.3 Benefits 2
3. SYSTEM ARCHITECTURE
3.1 Components 3
3.2 Data Flow 3
4. SYSTEM DESIGN
4.1 E-Mail spam detection 4
4.2 Steps to implement 4
4.3 Pseudo-code 4
5. IMPLEMENTATION DETAILS
5.1 Languages used 5
5.2 Key functions used 5
5.3 Libraries used 5
6. DEPENDENCIES AND SETUP
6.1 Hardware requirements 6
6.2 Software requirements 6
7. TESTING AND VALIDATION
7.1 Test environment 7
7.2 Test cases 7
8. LIMITATIONS AND FUTURE ENHANCEMENTS
8.1 Limitations 8
8.2 Future enhancements 8
9. CODE IMPLEMENTATION 9
10. RESULTS 11
11. CONCLUSION
11.1 Conclusion 12

iv
LIST OF ABBREVIATIONS

GUI - Graphical User Interface

Ms - Milliseconds

PQ - Priority Queue

CPU - Central Processing Unit

inf - Infinity

dist - Distance

exec_time_ms - Execution Time in Milliseconds

v
CHAPTER 1
INTRODUCTION

1.1 BACKGROUND

Email spam has been a persistent issue affecting individuals and


organizations by cluttering inboxes with irrelevant or harmful messages. With
the rise of digital communication, the need for intelligent systems that can
automatically filter out spam has become essential. Traditional rule-based
spam filters often fail to adapt to evolving spam tactics, making machine
learning a more effective solution.

1.2 OBJECTIVES
The main objective of this project is to build a machine learning
model that can accurately detect spam emails using natural language
processing techniques. This includes training a classifier on labeled email data,
extracting relevant features using TF-IDF, and deploying a system that can
predict whether a given email is spam or not.

1.3 TARGET AUDIENCE


The project is aimed at computer science students, machine learning
enthusiasts, educators, developers, and organizations interested in email
security. It also serves as a practical project for those learning AI concepts and
seeking to apply them to real-world problems.

1
CHAPTER 2

PROJECT OVERVIEW

2.1 PURPOSE

The purpose of this project is to develop a simple yet effective


solution for spam detection that can help users avoid unwanted emails,
save time, and enhance their email experience. It demonstrates the
application of machine learning in cybersecurity and digital
communication.

2.2 SCOPE

The project is limited to classifying email content based on a


sample dataset using supervised learning. It includes data
preprocessing, feature extraction, model training, and prediction.
While the dataset and model are basic, the system can be extended to
handle large-scale email data and be integrated with email platforms.

2.3 BENEFITS
This system automates spam filtering, reduces human effort,
and improves email efficiency. It provides a learning opportunity to
understand text classification, enhances user privacy by flagging
potentially harmful content, and contributes to a safer digital
environment.

2
CHAPTER 3

SYSTEM ARCHITECTURE

3.1 COMPONENETS

The major components include a sample dataset of labeled emails,


a text vectorization method (TF-IDF), a Naive Bayes classification
algorithm, and a user-friendly interface to input and analyze emails. These
components work together to build a functional AI-based spam detector.

3.2 DATA FLOW

The data flow begins with the user providing email text. The
input is transformed into numerical features using TF-IDF vectorization.
These features are then passed into a trained Naive Bayes model, which
processes the input and outputs a prediction—indicating whether the email
is spam or not. This process runs in a loop until the user exits.

3
CHAPTER 4

SYSTEM DESIGN

4.1 E-MAIL SPAM DETECTION

Input: Email text data


Output: Classification result – Spam or Not Spam

4.2 STEPS TO IMPLEMENT

1. Collect and prepare email dataset with labeled spam and non-spam
messages.
2. Pre-process the email text (remove stop words, convert to
lowercase, etc.).
3. Convert text data into numerical format using TF-IDF
vectorization.
4. Split the dataset into training and testing sets.
5. Train a Naive Bayes classifier using the training data.
6. Evaluate the model using the test data.
7. Accept user input and classify the email as spam or not.
8. Repeat the input classification process until the user chooses to
exit.

4.3 PSEUDO-CODE

function spam_detector(email_text):
1. Load and pre-process email dataset
2. Apply TF-IDF vectorization
3. Train Naive Bayes classifier on training data
4. Loop until user exits:
a. Display sample emails
b. Take user input (email text)
c. Preprocess and vectorize user input
d. Predict using trained model
e. Output: Spam or Not Spam
4
CHAPTER 5

IMPLEMENTATION DETAILS

5.1 LANGUAGES USED

The project is developed using Python, a widely-used programming


language in the field of machine learning and data science due to its
simplicity and vast collection of libraries.

5.2 KEY FUCTIONS USED

Important functions used in this project include fit() and predict()


from the machine learning model for training and testing,
TfidfVectorizer() for transforming text data into numerical format, and
custom input/output logic using input() and print() for user interaction.
Additionally, string pre-processing functions like lower(), re.sub(), and
stop words removal play a vital role in cleaning the email text.

5.2 LIBRARIES USED

The primary libraries used in the project are pandas for data
manipulation, sklearn (scikit-learn) for building and evaluating the
machine learning model, and re for regular expressions used in pre-
processing. The nltk library is also used for natural language text
processing such as removing stop words.

5
CHAPTER 6

DEPENDENCIES AND SETUP

6.1 HARDWARE REQUIREMENTS

PROCESSOR: Intel Core i3 or higher


RAM: Minimum 2 GB (4 GB recommended)
STORAGE: At least 100 MB free space
DISPLAY: Standard display (800×600 or higher)

6.2 SOFTWARE REQUIREMENTS

OPERATING SYSTEM: Windows 10/11, Linux, or macOS


PYTHON VERSION: Python 3.6 or higher
TEXT EDITOR/IDE: Any (e.g., IDLE, VS Code, PyCharm, or Notepad)
COMMAND PROMPT: To run the Python script

6
CHAPTER 7
TESTING AND VALIDATION

7.1 TEST ENVIRONMENT

The Email Spam Detection using Machine Learning project was


developed and tested on a system running Windows 11, with Python 3.12 set
up via the Anaconda distribution. The project was executed using the Spyder
IDE, which provided a suitable environment for writing, running, and
debugging the code. The system used for testing had a reliable configuration
with a multi-core processor and at least 8GB RAM, allowing for efficient
processing of text data and machine learning operations. All required Python
libraries such as scikit-learn, pandas, and numpy were installed and managed in
a dedicated virtual environment to ensure compatibility and isolation from
other projects.

7.2 TEST CASES

Multiple test cases were executed to validate the performance and


accuracy of the spam detection model. These included typical spam emails like
"Congratulations! You've won a free iPhone!" and "Get rich quick with this
investment", which were expected to be flagged as spam. Non-spam examples
such as "Meeting rescheduled to 3 PM" or "Happy birthday! Let’s catch up
soon." were tested to ensure they were correctly marked as legitimate.
Additionally, edge cases like empty input, long text with mixed content, and
emails with special characters or numbers were also tested to evaluate how the
model handles unexpected or irregular input. The system demonstrated high
accuracy in correctly classifying most inputs, confirming the reliability of the
model under a variety of real-world conditions.

7
CHAPTER 8

LIMITATIONS AND FUTURE ENHANCEMENTS

8.1 LIMITATIONS

The project is limited by the size and quality of the dataset used. A
small or imbalanced dataset can lead to biased predictions. Also, the model
may fail to understand complex linguistic nuances, sarcasm, or evolving spam
patterns. Another limitation is that the project doesn't handle attachments or
HTML-based emails.

8.2 FUTURE ENHANCEMENTS

Future improvements include using deep learning models like LSTM


or BERT for better accuracy, integrating the system with real-time email
clients, supporting multiple languages, handling multimedia content in emails,
and continuously updating the spam detection model with new training data to
keep up with modern spam techniques.

8
CHAPTER 9

CODE IMPLEMENTATION
#Python modules to implement spam email detection
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
# Sample email dataset
emails = [
"Congratulations! You've won a $1,000 Walmart gift card. Click here to
claim now.",
"Hi, just checking if we're still on for lunch tomorrow.",
"Earn money from home. Work just 2 hours a day and make
$5,000/month!",
"Dear friend, I need your urgent help to transfer $10 million to your
account.",
"Meeting rescheduled to 3 PM tomorrow. Please confirm your availability.",
"This is not spam. Just wanted to share a funny video with you.",
"You've been selected for a free cruise to the Bahamas!",
"Can you please send the final report by evening?",
"Act now! Limited-time offer for free trial subscription.",
"Looking forward to seeing you at the seminar next week."
]
labels = [1, 0, 1, 1, 0, 0, 1, 0, 1, 0] # 1 = Spam, 0 = Not Spam
# Train the model
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(emails)
y = labels
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
model = MultinomialNB()

9
model.fit(X_train, y_train)
# Start loop
while True:
print("\n----- Sample Emails -----")
for i, email in enumerate(emails, start=1):
print(f"{i}. {email}\n")
user_input = input("Type or paste an email to check if it's spam (or type
'exit' to quit):\n")
if user_input.strip().lower() == 'exit':
print("Exiting the program. Goodbye!")
break
input_vector = vectorizer.transform([user_input])
prediction = model.predict(input_vector)
print("\nYou entered:")
print(user_input)
print("\nPrediction:")
print("Spam Email" if prediction[0] == 1 else "Not Spam")

10
CHAPTER 10

RESULTS

Figure 10.1 Output for E-mail Spam detection using Machine Learning

Figure 10.2 Output for E-mail Spam detection using Machine Learning

11
CHAPTER 11

CONCLUSION

11.1 CONCLUSION

In conclusion, the Email Spam Detection project


successfully implements a machine learning-based approach to
identify and classify email messages as spam or legitimate. The use
of TF-IDF vectorization for feature extraction, combined with the
Naive Bayes classifier, has proven to be effective for this task. The
system demonstrates good accuracy in distinguishing spam from
non-spam emails based on textual content, offering a reliable tool to
reduce unwanted messages and potential threats. The project
highlights the efficiency of machine learning in automating email
filtering and provides a foundation for further enhancements such as
real-time filtering, integration with mail servers, and support for
multiple languages or attachments.

12

You might also like