Email Spam Detection Edited
Email Spam Detection Edited
MACHINE LEARNING
Submitted by
IN
COMPUTER SCIENCE AND ENGINEERING
SIGNATURE SIGNATURE
Dr. J. Hemalatha, M.E., Ph.D., Mrs. S. Rajeswari, M.Tech.,
HEAD OF THE DEPARTMENT SUPERVISIOR
Professor & HOD Assistant Professor
Computer Science & Engineering Computer Science & Engineering
AAA College of Engg. & Tech., AAA College of Engg. & Tech.,
Sivakasi - 626 005 Sivakasi - 626 005
Virudhunagar District Virudhunagar District
iii
TABLE OF CONTENT
ABSTRACT iii
LIST OF ABBREVIATIONS v
1. INTRODUCTION
1.1 Background 1
1.2 Objectives 1
1.3 Target Audience 1
2. PROJECT OVERVIEW
2.1 Purpose 2
2.2 Scope 2
2.3 Benefits 2
3. SYSTEM ARCHITECTURE
3.1 Components 3
3.2 Data Flow 3
4. SYSTEM DESIGN
4.1 E-Mail spam detection 4
4.2 Steps to implement 4
4.3 Pseudo-code 4
5. IMPLEMENTATION DETAILS
5.1 Languages used 5
5.2 Key functions used 5
5.3 Libraries used 5
6. DEPENDENCIES AND SETUP
6.1 Hardware requirements 6
6.2 Software requirements 6
7. TESTING AND VALIDATION
7.1 Test environment 7
7.2 Test cases 7
8. LIMITATIONS AND FUTURE ENHANCEMENTS
8.1 Limitations 8
8.2 Future enhancements 8
9. CODE IMPLEMENTATION 9
10. RESULTS 11
11. CONCLUSION
11.1 Conclusion 12
iv
LIST OF ABBREVIATIONS
Ms - Milliseconds
PQ - Priority Queue
inf - Infinity
dist - Distance
v
CHAPTER 1
INTRODUCTION
1.1 BACKGROUND
1.2 OBJECTIVES
The main objective of this project is to build a machine learning
model that can accurately detect spam emails using natural language
processing techniques. This includes training a classifier on labeled email data,
extracting relevant features using TF-IDF, and deploying a system that can
predict whether a given email is spam or not.
1
CHAPTER 2
PROJECT OVERVIEW
2.1 PURPOSE
2.2 SCOPE
2.3 BENEFITS
This system automates spam filtering, reduces human effort,
and improves email efficiency. It provides a learning opportunity to
understand text classification, enhances user privacy by flagging
potentially harmful content, and contributes to a safer digital
environment.
2
CHAPTER 3
SYSTEM ARCHITECTURE
3.1 COMPONENETS
The data flow begins with the user providing email text. The
input is transformed into numerical features using TF-IDF vectorization.
These features are then passed into a trained Naive Bayes model, which
processes the input and outputs a prediction—indicating whether the email
is spam or not. This process runs in a loop until the user exits.
3
CHAPTER 4
SYSTEM DESIGN
1. Collect and prepare email dataset with labeled spam and non-spam
messages.
2. Pre-process the email text (remove stop words, convert to
lowercase, etc.).
3. Convert text data into numerical format using TF-IDF
vectorization.
4. Split the dataset into training and testing sets.
5. Train a Naive Bayes classifier using the training data.
6. Evaluate the model using the test data.
7. Accept user input and classify the email as spam or not.
8. Repeat the input classification process until the user chooses to
exit.
4.3 PSEUDO-CODE
function spam_detector(email_text):
1. Load and pre-process email dataset
2. Apply TF-IDF vectorization
3. Train Naive Bayes classifier on training data
4. Loop until user exits:
a. Display sample emails
b. Take user input (email text)
c. Preprocess and vectorize user input
d. Predict using trained model
e. Output: Spam or Not Spam
4
CHAPTER 5
IMPLEMENTATION DETAILS
The primary libraries used in the project are pandas for data
manipulation, sklearn (scikit-learn) for building and evaluating the
machine learning model, and re for regular expressions used in pre-
processing. The nltk library is also used for natural language text
processing such as removing stop words.
5
CHAPTER 6
6
CHAPTER 7
TESTING AND VALIDATION
7
CHAPTER 8
8.1 LIMITATIONS
The project is limited by the size and quality of the dataset used. A
small or imbalanced dataset can lead to biased predictions. Also, the model
may fail to understand complex linguistic nuances, sarcasm, or evolving spam
patterns. Another limitation is that the project doesn't handle attachments or
HTML-based emails.
8
CHAPTER 9
CODE IMPLEMENTATION
#Python modules to implement spam email detection
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
# Sample email dataset
emails = [
"Congratulations! You've won a $1,000 Walmart gift card. Click here to
claim now.",
"Hi, just checking if we're still on for lunch tomorrow.",
"Earn money from home. Work just 2 hours a day and make
$5,000/month!",
"Dear friend, I need your urgent help to transfer $10 million to your
account.",
"Meeting rescheduled to 3 PM tomorrow. Please confirm your availability.",
"This is not spam. Just wanted to share a funny video with you.",
"You've been selected for a free cruise to the Bahamas!",
"Can you please send the final report by evening?",
"Act now! Limited-time offer for free trial subscription.",
"Looking forward to seeing you at the seminar next week."
]
labels = [1, 0, 1, 1, 0, 0, 1, 0, 1, 0] # 1 = Spam, 0 = Not Spam
# Train the model
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(emails)
y = labels
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
model = MultinomialNB()
9
model.fit(X_train, y_train)
# Start loop
while True:
print("\n----- Sample Emails -----")
for i, email in enumerate(emails, start=1):
print(f"{i}. {email}\n")
user_input = input("Type or paste an email to check if it's spam (or type
'exit' to quit):\n")
if user_input.strip().lower() == 'exit':
print("Exiting the program. Goodbye!")
break
input_vector = vectorizer.transform([user_input])
prediction = model.predict(input_vector)
print("\nYou entered:")
print(user_input)
print("\nPrediction:")
print("Spam Email" if prediction[0] == 1 else "Not Spam")
10
CHAPTER 10
RESULTS
Figure 10.1 Output for E-mail Spam detection using Machine Learning
Figure 10.2 Output for E-mail Spam detection using Machine Learning
11
CHAPTER 11
CONCLUSION
11.1 CONCLUSION
12