Final Report
Final Report
Machine Learning
Final Report
Abstract
The project purpose is to provide users with a secure email spam filter tool using machine
learning models. Four models were implemented in this project. The project has two main
components, the backend mail server which uses a scrape function to store all the data into
the SQL database and then the flask application which is the front-end interface. This is the
interface each user will see when viewing their emails.
P a g e 2 | 27
Email Spam Filter using Machine Learning Final Report
Table of Contents
Abstract ...................................................................................................................................... 2
Introduction ................................................................................................................................ 5
Project description ..................................................................................................................... 6
System components ................................................................................................................... 9
Hardware ................................................................................................................................ 9
Software ............................................................................................................................................. 9
System architecture .................................................................................................................... 9
Web Application User Interface .............................................................................................. 11
Registration .......................................................................................................................... 11
Login .................................................................................................................................... 13
General Usage ...................................................................................................................... 14
Inbox mail ....................................................................................................................................... 15
Sent mail .......................................................................................................................................... 15
Spam mail ........................................................................................................................................ 15
Account information ...................................................................................................................... 16
Spam report ..................................................................................................................................... 16
Registered Users ............................................................................................................................. 17
Learning outcomes ................................................................................................................... 17
Technical achievements ....................................................................................................... 17
HTML and CSS .............................................................................................................................. 17
Bootstrap.......................................................................................................................................... 18
MySQL ............................................................................................................................................ 18
Python .............................................................................................................................................. 18
Flask ................................................................................................................................................. 18
Project Review ......................................................................................................................... 19
Positive aspects/aspects achieved......................................................................................... 19
Aspects not achieved ............................................................................................................ 20
Aspects gone wrong ............................................................................................................. 20
Things I would change ......................................................................................................... 20
Problems encountered .......................................................................................................... 22
Testing...................................................................................................................................... 22
Comparison to Original Design and Specification .................................................................. 24
Future features ......................................................................................................................... 24
Acknowledgments.................................................................................................................... 25
P a g e 3 | 27
Email Spam Filter using Machine Learning Final Report
P a g e 4 | 27
Email Spam Filter using Machine Learning Final Report
Introduction
The final report purpose is to describe in detail the end-product produced of the email spam
filter tool using machine learning algorithms. This document has various sections about the
project.
Section 1: Describes the final product in detail. It covers the tools and technologies used and
how they were integrated together. I have also included a diagram of the system architecture
to allow readers to visualize the project.
Section 2: This section displays each screen of the web application with a short description.
Section 3: The document contains learning outcomes from the project and project review.
Section 4: This section covers in detail the testing which took place for my project. Testing
on both and back-end and front-end was performed.
Section 5: Finally, I have included the future features for my project going forward.
P a g e 5 | 27
Email Spam Filter using Machine Learning Final Report
Project description
The email spam filter was implemented as follows:
I installed hMailServer on my local system for the purpose of creating a local mail server.
The “@test.com” domain was created in the hMailServer along with test accounts to send and
receive mails between accounts.
Thunderbird is an email client software application. I used this to send mails between
accounts I set up on my local server. I logged into each account on the local server using the
following configuration:
P a g e 6 | 27
Email Spam Filter using Machine Learning Final Report
Emails sent via thunderbird are stored in the hMailServer directory on the PC. For these
emails to be stored in the SQL database I had to implement a scrape function. This function
scans through the directories of the local mail server, and for each email which has not
already been stored it transforms the emails data for storage and then calls upon each of
the model’s functions created and inserts the spam classification boolean data for each
model into the database. The four models, Naïve Bayes, SVM, Random Forest and Logistic
Regression were trained using the following dataset: https://www.kaggle.com/uciml/sms-
spam-collection-dataset. I then split the data set into a training set and a testing set. The
training set consist of 4457 entries where 603 are spam and 3854 are ham while the test
dataset consists of 1116 entries where 146 are spam and 971 are ham. I used the scikit library
to train each of the models: Naïve Bayes, SVM, Random Forest and Logistic Regression. The
four models were added to pickle files to allow for quick access to the models. I used the
scikit learn metrics library to obtain the confusion matrix, accuracy, f1, precision & recall and
P a g e 7 | 27
Email Spam Filter using Machine Learning Final Report
sensitivity and specificity scores which helped me decide which model to use with my front-
end web application.
Flask was used to create the web application. This web app houses the functions of my
system. It is an easy-to-use application. When the user successfully registered with the
system they then will need to login to their account. The application will display the user’s
emails once logged in. The emails are classified using classification algorithms. The four
algorithms I implemented are Naive Bayes, SVM, Random Forest and Logistic Regression.
The outcome of the algorithm determines whether the email is spam or not spam. However,
in the end-product I chose Naïve Bayes to classify the emails as spam or ham. I choose Naïve
Bayes over the other three algorithms because the accuracy score was 0.9856, this score is
calculated by the number of items categorized correctly divided by the total number of items.
It is the fraction of the time the classifier is correct. The f1 score helped me in making the
decision. The F1 score is to measure a trade-off between precision and recall. A low F1 score
means the spam filtering is pickier which may lead to fewer real spam emails being marked
as spam and allow it into your inbox. The naïve bayes f1 score for ham is 0.9917 and for
spam is 0.9448 while comparing to Random Forest’s f1 score is 0.9862 for ham and 0.8988
for spam. The figure which made me decide the final decision also was the false negative
score. Naïve Bayes predict only 8% of emails are not spam but they are spam whereas SVM
predicts 13% of emails are not spam but they are spam. This means using SVM to classify
your emails will lead to more spam emails reaching your inbox and increase the risk of
phishing, malspam etc.
If the email is spam it will display in the spam folder for the user otherwise the email will be
in their inbox folder. The administrator is the only user who will have access to the spam
reports and have access to view all the registered users for the web application. The system
only has one administrator account. The flask app queries the SQL database to display the
emails, generate the spam reports and display all other information on the web application.
P a g e 8 | 27
Email Spam Filter using Machine Learning Final Report
System components
Hardware
Software
• Flask application
✓ Created to display all the emails
✓ The systems front-end
System architecture
The following is a high level three tier diagram to show how this project is implemented. It
shows the sender's client sending an email which reaches hMailServer which acts as an
SMTP, MTA and deals with IMAP and POP3 protocols. This information is stored in a
message repository in .eml format. Our web application is hosted on a local web server,
which has access to a local MySQL database. The flask application hosts code to periodically
check for new emails, parse the stored email messages into storable string format and stores it
in the MySQL database. Spam scoring takes place in the flask application and the spam score
is stored in the MySQL database. The front end of the flask application provides the user a
way to log in, when logged in there many pages which separate mails like sent, inbox and
spam.
P a g e 9 | 27
Email Spam Filter using Machine Learning Final Report
P a g e 10 | 27
Email Spam Filter using Machine Learning Final Report
Before you can receive and view your emails you must register with the application. You
must complete the registration form as seen below. You will have to provide a username,
email address and password.
If all details entered are authenticated, a success message will appear to the user as seen
below.
P a g e 11 | 27
Email Spam Filter using Machine Learning Final Report
If the username and/or email address is already in use an error message will display to notify
the user.
P a g e 12 | 27
Email Spam Filter using Machine Learning Final Report
If the password field and the confirm password field do not match an error message will be
displayed to the user.
Login
Once you have registered successfully, you can then login to the application. To login you
will need to provide the email address and password used when registering.
P a g e 13 | 27
Email Spam Filter using Machine Learning Final Report
If your credentials do not match an error message will display for the user.
Once you have logged in successfully you will be greeted with the below screen.
General Usage
Once you have registered and logged into the application you have access to your emails.
You can view your inbox, sent, spam mails and account information. However, if you have
logged into the application as the administrator you have additional features such as view
spam reports and view all the registered users of the application.
P a g e 14 | 27
Email Spam Filter using Machine Learning Final Report
Inbox mail
Sent mail
Spam mail
P a g e 15 | 27
Email Spam Filter using Machine Learning Final Report
Account information
Spam report
P a g e 16 | 27
Email Spam Filter using Machine Learning Final Report
Registered Users
The admin user only has access to this page. This page displayed all the registered users of
the application.
Learning outcomes
From completing this project, I have learnt end-less amounts over the past 6 months. I have
been learning since day one of researching. I took on this project to push myself and put
myself out of my comfort zone. Prior to working on this project, I had zero experience with
python and machine learning. Because of this research for the project took me a lot longer
than originally planned. However, as the research was completed the implementation started
smoothly and at a steady pace. This project allowed me to learn about new tools and
technologies such as hMailServer, machine learning algorithms, flask, and python. Also, this
was my first time creating and implementing an end-to-end web application and
understanding how the database interacts. My supervisor guided me with great examples and
guidance whenever needed throughout the project duration.
As stated above, I had zero experience with python and machine learning when I took on this
project. From extensive research and learning the new tools and technologies was a steep
learning curve than I had imagined.
➢ HTML
I used HTML within the flask application to create a professional and positive look for my
web application. I have a basic knowledge of the HMTL language as it was covered in the
P a g e 17 | 27
Email Spam Filter using Machine Learning Final Report
Web Development module, I studied in 2nd year of my degree. By implemented this language
in greater details than thought in 2nd year I feel will be of very high value to me in future
careers.
➢ Bootstrap
The CSS I implemented was bootstrap. I used bootstrap to design how my web application
would look like to users. I had never used bootstrap in the past this means it was a completely
new framework for me to use. Bootstrap is a popular framework and use quite often which
means it great experience to gain for entering the workforce.
➢ MySQL
Incorporating MySQL with databases was covered at a high level in the Web development
module in 2nd year of my degree. I used MySQL frequently in my project which has allowed
me to further my skills in this area.
➢ Python
➢ Flask
I implemented flask as my front-end. Like python, flask took me a lot longer than planned to
grasp. One of the main reasons I used flask was because it is specifically for python and
python was the core language used for the project. I followed many tutorials online to help
guide me through setting up the flask application. One online tutorial I would highly
recommend is the following by Corey Schafer:
https://www.youtube.com/watch?v=MwZwr5Tvyxo
P a g e 18 | 27
Email Spam Filter using Machine Learning Final Report
Project Review
This section of the document discusses aspects achieved and not achieved, aspects that went
wrong during implementation, thing I would change if starting again and finally any
problems I encountered throughout.
The development of the email spam filter tool was successful. The tool achieved all the
functions initialized at the very start of this project which means users can safely send and
receive emails and any emails containing spam will be placed in a separate folder for the
user. This project has ended at a very good stage. The end-product is presentable and a usable
tool.
The table below is the metric table from the functional specification document. Every
initialize goal was successful.
Criteria Description
Spam In the process of selecting and evaluating the machine learning model the use
detection of confusion matrix to compare the results of the models against a dataset
where we already predict the answer. For some scoring models like logistic
regression which was also tested the 0.5 score is likely a threshold that gives
a classifier with reasonably good accuracy. The spam filter has a specificity
of 0.9448, which means that it marks about 0.5662% of non-spam emails as
spam.
Spam In the process of selecting and evaluating the machine learning model we are
detection seeking a low level of Type I error (false positive), to stop important emails
being classified as spam which are not spam. Because the precision score is
0.997 this means 0.1% of emails maybe wrongly classified.
Spam In the process of selecting and evaluating the machine learning model we are
detection seeking a low level of Type II error (false negatives), to stop spam being
classified as a normal mail. Because the accuracy score is 0.9856 means the
Naïve Bayes models is very accurate at classifying the emails correctly. The
False Negative result from the confusion matrix is 8.
Spam Ensure the Recall (True positive rate) of the machine learning model to
detection ensure the correct proportion of actual positives was identified correctly. The
recall score for the Naïve Bayes model I implemented is 0.99174. I have high
recall and precision is emphasized over recall. This is appropriate for a spam
filter, because it is more important to not lose non-spam email than it is to
P a g e 19 | 27
Email Spam Filter using Machine Learning Final Report
filter every single piece of spam out of our inbox. The true positive rate is
962.
Spam Ensure the Precision of the machine learning model is sufficient. This score
detection should be at least 0.91. The precision score for the Naïve Bayes models I
implemented is average 0.9682. The higher the precision means less emails
are incorrectly classified.
Security The web application is not vulnerable to SQL injection attacks and uses an
object relational mapper.
Security The user’s passwords are stored using a hashing key to encrypt them.
Reliability The system works and does not throw exceptions during standard usage
Due to time constraints, not all functions for the flask application were achieved. I would
implement the function of allowing a user to send mails from the web app, request to change
their password and finally allow the admin to remove accounts if requested by a user.
However, the focus on this project was spam classification, as a result, I am very happy in
what was achieved over the six months.
When testing the spam classification using the mail merge technique the subject lines were
too long which was causing issues with the scrape function and the emails were not being
stored in the database, so I had to delete every email in the hMailServer directory for that user
and reduce the subject line length and re-start the testing.
While coding the web application the jinja templates were picking up python variables when
they were inside the html comments. Also, they were not html this python code was
displaying and as a result hindered me from getting data displayed.
P a g e 20 | 27
Email Spam Filter using Machine Learning Final Report
When I was storing the models to the pickle files, I was not storing the count vectorizer as
well as the models. So, I would initiate a new count vectorizer every time and the models
would not work as the vectors for the words did not match what the model had stored. I
solved this issue by also storing the count vectorizer in a pickle file.
If I were given the option to begin this project again from the start, I would change the
following to improve the project:
I feel if I had spent more time researching flask, I could have produced a more professional
design for my application.
➢ Time management
For my final year of college, managing time was critical as there was extensive amounts of
assignments due throughout the year. However, at some points throughout the year this
project was pushed back due to other modules exams and reports due date approaching and
taking me longer than expected to complete. This meant any lunch breaks or spare time in the
evenings/weekends was spent as wisely as possible on the project. I learnt a large amount of
time should be spent on researching the required tools and technologies needed and how they
work before trying to implement them without any knowledge. Also, research in-dept the best
fitted language to use to implement with the tools and technologies. I feel in the long-term
this strategy is the key to success.
➢ Development inconsistency
I would finish one task at a time if I were starting the project again. At the beginning I started
another task before the task I was on was finished. This became very confusing and more
difficult to track the progress of my project. As some tasks I was working on were backend
tasks such as implementing the machine learning models while also working on the GUI for
the web app development.
P a g e 21 | 27
Email Spam Filter using Machine Learning Final Report
Problems encountered
I thought that the functions for the four models (Naïve Bayes, SVM, Random Forest and
logistic Regression) had scoring functions I could use with them. But in fact, I had to use a
combination of using confusion matrix values and metric functions from the sklearn kit.
Testing
Machine learning models test
After conducting the below tests I have decided to implement the Naïve Bayes machine
learning model on the web application for the following reasons:
- The accuracy score was 0.9856, this score is calculated by the number of items
categorized correctly divided by the total number of items. It is the fraction of the
time the classifier is correct.
- The F1 score is to measure a trade-off between precision and recall. A low F1 score
means the spam filtering is pickier which may lead to fewer real spam emails being
marked as non-spam and allow it into your inbox. The naïve bayes f1 score for ham is
0.9917 and for spam is 0.9448 while comparing to Random Forest F1 score is 0.9862
for ham and 0.8988 for spam.
- The spam filter has a specificity of 0.9448, which means that it marks 0.5662% of
non-spam email as spam.
- The recall score for the Naïve Bayes model I implemented is 0.99174. I have high
recall and precision is emphasized over recall. This is appropriate for a spam filter,
because it is more important to not lose non-spam email than it is to filter every single
piece of spam out of our inbox.
- The precision score for the Naïve Bayes models I implemented is 0.9917. The higher
the precision means less emails are incorrectly classified.
- Naïve Bayes predict only 8% of emails are not spam but they are spam whereas SVM
predicts 13% of emails are not spam but they are spam. This means using SVM to
P a g e 22 | 27
Email Spam Filter using Machine Learning Final Report
classify your emails will lead to more spam emails reaching your inbox and increase
the risk of phishing, malspam etc.
Functionality testing
✓ Spam Report
To test the spam report, a batch of 1000 emails was sent to each registered user inbox. The
emails sent had documented predicted spam classification values. The spam report was then
compared to the predicted values and any discrepancies mitigated.
P a g e 23 | 27
Email Spam Filter using Machine Learning Final Report
Exploratory testing
Using exploratory testing students were asked to use the system, to test functionalities such as
login, logout, view spam, view inbox etc. Any issues found were noted and resolved before
submission.
Future features
Due to time constraints, some features would not have been implemented within the deadline.
Therefore, there are also future features in my project that could be implemented. I would
like to implement a fully functioning web application to allow users to send emails from the
web application instead of using hMailServer and allowing them to deal with their emails for
example, delete an email, forward an email etc. The spam classification was only the tip of
the iceberg with my interest in how it all worked.
I would also like to make a mobile application version for this project as many people use
their mobile to send and check emails.
P a g e 24 | 27
Email Spam Filter using Machine Learning Final Report
Acknowledgments
There are many people I would like to thank and acknowledge for all their kindness and help
throughout the year. Completing my final year in IT Carlow during a pandemic was a
challenge but I would not have achieved this without the help from all my lectures and
classmates. The support from my classmates this year will be certainly unforgettable. I would
also like to thank my close friends Katie Brophy and Sine Dohney. These two girls have been
extremely supportive since I met them at the start of second year and every year since. All the
IT Carlow staff are approachable and love sharing their knowledge with every student. All
my lectures from first year have always guided me and helped me solved any issues I faced
along the way especially Richard Butler, Keara Barrett, and James Egan.
Finally, I would like to thank my project supervisor, James Egan. James supported me from
day one with my project and set me on the right track. He always provided me with guidance,
support, and advice. James answered any questions I had promptly and always had solutions.
Thank you very much James for all your time, effort and help you have put into my project to
get the project completed.
P a g e 25 | 27
Email Spam Filter using Machine Learning Final Report
Plagiarism Declaration
I declare all submitted work is my own work. I have cited using the institutes standards, any
sources of quotations, paraphrases, tables, diagrams, or other material where intellectual
property rights may reside. Bibliography is provided in each document where needed. I
understand it is serious offence if I fail to obey with the Institute’s regulations governing
plagiarism.
P a g e 26 | 27
Email Spam Filter using Machine Learning Final Report
Bibliography
Scikit-learn.org. 2021. sklearn.metrics.recall_score — scikit-learn 0.24.2 documentation.
[online] Available at: <https://scikit-
learn.org/stable/modules/generated/sklearn.metrics.recall_score.html> [Accessed 12 April
2021].
Scikit-learn.org. 2021. sklearn.metrics.precision_score — scikit-learn 0.24.2 documentation.
[online] Available at: <https://scikit-
learn.org/stable/modules/generated/sklearn.metrics.precision_score.html> [Accessed 12 April
2021].
Scikit-learn.org. 2021. sklearn.metrics.f1_score — scikit-learn 0.24.2 documentation.
[online] Available at: <https://scikit-
learn.org/stable/modules/generated/sklearn.metrics.f1_score.html> [Accessed 12 April 2021].
Scikit-learn.org. 2021. sklearn.metrics.confusion_matrix — scikit-learn 0.24.2
documentation. [online] Available at: <https://scikit-
learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html> [Accessed 12
April 2021].
Statinfer | Data Science starts here. 2021. 204.4.2 Calculating Sensitivity and Specificity in
Python | Statinfer. [online] Available at: <https://statinfer.com/204-4-2-calculating-
sensitivity-and-specificity-in-python/> [Accessed 12 April 2021].
P a g e 27 | 27