Final CPP Project
Final CPP Project
PROJECT REPORT
ON
“EMAIL SPAM DETECTION USING MACHINE LEARNING
ALGORITHMS”
is a partial fulfillment of the Diploma in Information Technology of
Maharashtra State Board of Technical Education, Pune during the academic year 2022-23
By
Ayush Kunwar (2007110056)
Sourav Kamble (2007110267)
Guided by
(Ms. B.D.Kamble)
HADAPSAR, PUNE-411028.
JSPM’s
HADAPSAR PUNE-28
CERTIFICATE
This is to certify that Mr. Kunwar Ayush Arun from (institute) Jayawantrao Sawant Polytechnic
having Enrollment no:- 2007110056 has completed Capstone Project Execution Report having
title “Email Spam Detection Using Machine Learning Algorithms ” individually in a Group
consistency of two candidates under the guidance of the faculty guide.
Mr. M.B.Shinde
(Head of Department)
(External Examiner)
Dr. Deokar S. M.
(Principal)
I
Downloaded by vilasrao Deshmukh ([email protected])
lOMoARcPSD|30316250
JSPM’s
HADAPSAR PUNE-28
CERTIFICATE
This is to certify that Mr. Kamble Sourav Ram from (institute) Jayawantrao Sawant Polytechnic
having Enrollment no:- 2007110267 has completed Capstone Project Execution Report having
title “Email Spam Detection Using Machine Learning Algorithms ” individually in a Group
consistency of two candidates under the guidance of the faculty guide.
Mr. M.B.Shinde
(Head of Department)
(External Examiner)
Dr. Deokar S. M.
(Principal)
I
Downloaded by vilasrao Deshmukh ([email protected])
lOMoARcPSD|30316250
ACKNOLEDGEMENT
It is matter of great pleasure for me to submit this seminar report on Email Spam Detection
Using Machine Learning a part of curriculum for award of Maharashtra education of technical
education’s Diploma in Information Technology from JSPM’s Jayawantrao Sawant
Polytechnic.
Firstly, I would like to express my gratitude to my guide Ms. B.D.Kamble for their inspiration,
adroit guidance, constant supervision, direction and discussion in successful completion of this
Seminar.
I am thankful to Ms. B.D.Kamble, Project coordinator for guiding and helping me right from
the beginning, also Head of Department Mr. M.B.Shinde, for her valuable support and
guidance.
Also, I extend my thanks to all my colleagues those who have helped me directly or indirectly
in completion of this seminar and last but not least, I am thankful to my parents, who had
inspired me with their blessings.
SUBMITED BY-
INDEX
1|Page
Downloaded by vilasrao Deshmukh ([email protected])
lOMoARcPSD|30316250
List of Figures
2|Page
Downloaded by vilasrao Deshmukh ([email protected])
lOMoARcPSD|30316250
ABSTRACT
As a means of contact for personal and professional use, emails are commonly used. Information
shared that emails, such as banking information, credit reports, login details, etc., is often sensitive
and confidential. This makes them useful for cyber criminals who are able to exploit the data for
malicious purposes. Phishing is a technique that fraudsters use to acquire confidential data from
individuals by claiming to be from proven sources. The sender will persuade you to provide
personal information under bogus pretenses in a phished email. Phishing website detection is an
intelligent and efficient model focused on the use of data mining algorithms for classification or
association. In order to identify the phishing website and the relationship that correlates them with
each other, these algorithms were used to identify and characterize all rules and factors so that we
detect them by their efficiency, accuracy, number of generated rules and speed. The proposed
system integrates both classification and association algorithms, which optimize the system more
effectively and faster than the current system. The error rate of the current system decreases by
30% by using these two algorithms with several protocols, so that the proposed system creates an
effective way to detect the phishing website by using this approach. While there is no device that
will detect the entire phishing website, it can create a more effective way to detect the phishing
website using these methods.
3|Page
Downloaded by vilasrao Deshmukh ([email protected])
lOMoARcPSD|30316250
Chapter 01
Introduction
Phishing is a lucrative type of fraud in which the criminal deceives receivers and obtains
confidential information from them under false pretenses. Phished emails may direct the users
to click on a link of a website or attachment where they are required to provide confidential
information like passwords, credit card information etc. The phisher sends out the messages to
thousands of users and usually only a small percentage of recipients may fall into the trap but
this can result in high profits for the sender. In 2006, hackers in America used emails as a
mode of setting “baits” for users to steal usernames and passwords of American Online
accounts. Ever since then the techniques of phishing have evolved making it harder to identify
fraudulent emails.
Email or electronic mail spam refers to the “using of email to send unsolicited emails or
advertising emails to a group of recipients. Unsolicited emails mean the recipient has not
granted permission for receiving those emails. “The popularity of using spam emails is
increasing since last decade. Spam has become a big misfortune on the internet. Spam is a
waste of storage, time and message speed. Automatic email filtering may be the most effective
method of detecting spam but nowadays spammers can easily bypass all these spam filtering
applications easily. Several years ago, most of the spam can be blocked manually coming from
certain email addresses.
4|Page
Chapter 02
Literature Survey
There is some related work that apply machine learning methods in email spam detection,
A. Karim, S.Azam, B. Shanmugam, K.Kannoorpatti and M. Alazab. [ii] They describe a
focused literature survey of Artificial Intelligence Revised (AI) and Machine learning
methods for email spam detection.
[3] K. Agarwal and T. Kumar. Harisinghaney et al. (2014)
[4] Mohamad & Selamat (2015) [v] have used the “image and textual dataset for the e-mail
spam detection with the use of various methods. Harisinghaney et al. (2014) [IV] have used
methods of KNN algorithm, Naïve Bayes, and Reverse DBSCAN algorithm with
experimentation on dataset. For the text recognition, OCR library” [iii] is employed but this
OCR doesn't perform well.
Mohamad & Selamat (2015) [v] uses the feature selection hybrid approach of TF-IDF
(Term Frequency Inverse Document Frequency) and Rough pure mathematics.
5|Page
Chapter 03
Requirement Analysis
Hardware Requirements
Processor: - Intel Pentium 4 or above
Memory: - 2 GB or above
Other peripheral: - Printer
Hard Disk: - 500gb
6|Page
Chapter 04
System Architecture
The first system collects data from the Internet, such as synthetic and real-time spam email
data and applies cross-fold validation. Apply pre-processing in the training and testing
phase, and then proceed with feature extraction and selection. Train the system to generate
training rules and use different machine learning algorithms. Classify all test data, normal
and spam, based on the achieved weight for each test sample. Finally, predict the accuracy
of the entire system using various confusion matrixes.
7|Page
Chapter 05
Result and Output
Our model has been trained using multiple classifiers to check and compare the results for
greater accuracy. Each classifier will give its evaluated results to the user. After all the
classifiers return its result to the user; then the user can compare it with other results to see
whether the data is “spam” or “ham”. Each classifier result will be shown in graphs and
tables for better understanding. The dataset is obtained from “Kaggle” website for training.
The name of the dataset used is “spam.csv”. To test the trained machine, a different CSV
file is developed with unseen data i.e. data which is not used for the training of the machine;
named “emails.csv”. After the text edit has been completed, the paper is ready for the
template. Duplicate the template file by using the Save As command, and use the naming
convention prescribed by your conference for the name of your paper. In this newly created
file, highlight all of the contents and import your prepared text file. You are now ready to
style your paper; use the scroll down window on the left of the MS Word Formatting
toolbar.
8|Page
Graphs
9|Page
Chapter 06
Methodology
Data preprocessing:
When the data is considered, always a very large data sets with large no. of rows and
columns will be noted. But it is not always the case the data could be in many forms
such as Images, Audio and video files structured tables etc. Machine doesn’t understand
images or video, text data as it is, Machine only understand 1s and 0s. Steps in Data
Preprocessing: Data cleaning: In this step the work like filling of “missing values”,
“smoothing of noisy data”, “identifying or removing outliers “, and “resolving of
inconsistencies is done.” Data Integration: In this step addition of several databases,
information files or information set is performed. Data transformation: Aggregation and
normalization is performed to scale to a specific value Data reduction: This section
obtains a summary of the dataset which is very small in size but so far produces the
same analytical result.
Classic classifiers
Classification is a form of data analysis that extracts the models describing important
data classes. A classifier or a model is constructed for prediction of class labels for
example: “A loan application as risky or safe.”
Data classification is a two-step - learning step (construction of classification model.)
and - a classification step
10 | P a g e
Chapter 07
Future Scope
This work proposes a model for improving recognition of cruel spam in email. Our model
resolve employ a novel data-set intended for the process of feature choice, and then validate
the set of chosen features using three classifiers identified in spam detection using deep
learning. Feature selection is projected to recover training time as well as accuracy for the
classifiers.
11 | P a g e
Chapter 08
Conclusion
Ensemble methods on the other hand proven to be useful as they using multiple classifiers
for class prediction. Nowadays, lots of emails are sent and received and it is difficult as our
project is only able to test emails using a limited amount of corpus. Our project, thus spam
detection is proficient of filtering mails giving to the content of the email and not according
to the domain names or any other criteria. Therefore, at this it is an only limited body of the
email. There is a wide possibility of improvement in our project. The subsequent
improvements can be done:
“Filtering of spams can be done on the basis of the trusted and verified domain names.”
“The spam email classification is very significant in categorizing e-mails and to distinct e-
mails that are spam or non-spam.”
“This method can be used by the big body to differentiate decent mails that are only the
emails they wish to obtain.”
12 | P a g e
Chapter 09
Reference
[1] Yaseen, Yaseen Khather, Alaa Khudhair Abbas, and Ahmed M. Sana. "Image spam
detection using machine learning and natural language processing." Journal of Southwest
Jiaotong University 55.2 (2020).
[2] Mohammed, Mazin abed, et al. "An anti-spam detection model for emails of multi-
natural language." Journal of Southwest Jiaotong University 54.3 (2019).
[3] Gibson, Simran, et al. "Detecting spam email with machine learning optimized with bio-
inspired metaheuristic algorithms." IEEE Access 8 (2020): 187914-187932.
[4] Nandhini, S., and Jeen Marseline KS. "Performance evaluation of machine learning
algorithms for email spam detection." 2020 International Conference on Emerging Trends
in Information Technology and Engineering (ic-ETITE). IEEE, 2020.
[5] Govil, Nikhil, et al. "A machine learning based spam detection mechanism." 2020
Fourth International Conference on Computing Methodologies and Communication
(ICCMC). IEEE, 2020.
[6] Chandra, J. Vijaya, Narasimham Challa, and Sai Kiran Pasupuletti. "Machine learning
framework to analyze against spear phishing." Int. J. Innov. Technol. Exploring Eng.
(IJITEE) 8 (2019): 12.
[7] Bibi, Asma, et al. "Spam mail scanning using machine learning algorithm." J. Comput.
15.2
13 | P a g e