0% found this document useful (0 votes)
2 views

Machine Learning Based Classification for Spam Detection

The article discusses the use of machine learning algorithms for spam detection in emails, comparing the effectiveness of Random Forest, Logistic Regression, Naive Bayes, Support Vector Machine, and Artificial Neural Network. A dataset of 5,558 emails was analyzed, with the Random Forest algorithm achieving the highest accuracy of 98.83%. The study highlights the importance of effective spam classification to protect users from unsolicited and potentially harmful emails.

Uploaded by

Vitaly
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Machine Learning Based Classification for Spam Detection

The article discusses the use of machine learning algorithms for spam detection in emails, comparing the effectiveness of Random Forest, Logistic Regression, Naive Bayes, Support Vector Machine, and Artificial Neural Network. A dataset of 5,558 emails was analyzed, with the Random Forest algorithm achieving the highest accuracy of 98.83%. The study highlights the importance of effective spam classification to protect users from unsolicited and potentially harmful emails.

Uploaded by

Vitaly
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/380000625

Machine Learning Based Classification for Spam Detection

Article in Sakarya University Journal of Science · April 2024


DOI: 10.16984/saufenbilder.1264476

CITATIONS READS

2 643

2 authors:

Serkan Keskin Onur Sevli


Isparta University of Applied Sciences Burdur Mehmet Akif Ersoy University
5 PUBLICATIONS 10 CITATIONS 86 PUBLICATIONS 219 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Onur Sevli on 22 April 2024.

The user has requested enhancement of the downloaded file.


Sakarya University Journal of Science
ISSN : 2147-835X Vol. 28, No. 2, 270-282, 2024
Publisher : Sakarya University DOI: https://doi.org/10.16984/saufenbilder.1264476

Research Article

Machine Learning Based Classification for Spam Detection

Serkan Keskin1* , Onur Sevli2

1
Burdur Mehmet Akif Ersoy University, Institute of Science and Technology, Department of Computer Engineering,
Burdur, Türkiye, [email protected]
2
Burdur Mehmet Akif Ersoy University, Faculty of Engineering and Architecture, Department of Computer Engineering,
Burdur, Türkiye, [email protected]
* Corresponding author

ARTICLE INFO ABSTRACT

Keywords: Electronic Electronic messages, i.e. e-mails, are a communication tool frequently
Artificial Intelligence used by individuals or organizations. While e-mail is extremely practical to use, it is
Email Classification necessary to consider its vulnerabilities. Spam e-mails are unsolicited messages
Machine Learning created to promote a product or service, often sent frequently. It is very important to
Spam Detection classify incoming e-mails in order to protect against malware that can be transmitted
via e-mail and to reduce possible unwanted consequences. Spam email classification
is the process of identifying and distinguishing spam emails from legitimate emails.
This classification can be done through various methods such as keyword filtering,
machine learning algorithms and image recognition. The goal of spam email
classification is to prevent unwanted and potentially harmful emails from reaching
the user's inbox. In this study, Random Forest (RF), Logistic Regression (LR), Naive
Bayes (NB), Support Vector Machine (SVM) and Artificial Neural Network (ANN)
algorithms are used to classify spam emails and the results are compared. Algorithms
with different approaches were used to determine the best solution for the problem.
5558 spam and non-spam e-mails were analyzed and the performance of the
algorithms was reported in terms of accuracy, precision, sensitivity and F1-Score
metrics. The most successful result was obtained with the RF algorithm with an
Article History: accuracy of 98.83%. In this study, high success was achieved by classifying spam
Received: 13.03.2023 emails with machine learning algorithms. In addition, it has been proved by
Accepted: 08.12.2023 experimental studies that better results are obtained than similar studies in the
Online Available: 22.04.2024 literature.

1. Introduction than 4 billion. This number is estimated to


increase to 4.6 billion in 2025. In 2020, 306
With the widespread use of the Internet, billion e-mails are sent and received every day,
electronic communication has become more and this number is expected to exceed 376 billion
preferred. One of the most important tools of in 2025 [2].
electronic communication is electronic The use of e-mail is not only practical but also
messages, which we call e-mail. Today, has various vulnerabilities. The e-mail account to
individuals or organizations have one or more e- be hijacked in various ways, for e-mails
mail accounts. Instant delivery of messages, no containing advertisements etc. to hijack your
cost and ease of use increase the importance and computer by installing a software on your
prevalence of e-mail [1]. According to Statista computer when you click on the advertisement,
Research Department data, the number of and for the installed software to disrupt
actively used e-mail accounts in 2020 is more communication by sometimes filling the
Cite as: S. Keskin, O. Sevli (2024). Machine Learning Based Classification for Spam Detection, Sakarya University Journal of Science, 28(2), 270-282.
https://doi.org/10.16984/saufenbilder.1264476

This is an open access paper distributed under the terms and conditions of the Creative Commons Attribution-NonCommercial 4.0 International License.
Serkan Keskin, Onur Sevli

bandwidth. Such unsolicited e-mails are adaptive than AI-based systems. The main
characterized as "spam". Between October 2020 methods used in traditional spam detection
and September 2021, the global daily spam systems are as follows:
volume peaked in July 2021 with approximately
283 billion spam emails out of a total of 336.41 • Email authentication: This method is
billion emails. By August 2021, this number had used to verify who the sender of an email is. It
fallen to 65.50 billion. By September, the verifies the authenticity of the sender using
average spam volume had again increased by 36 standards such as DomainKeys Identified Mail
percent, reaching 88.88 billion out of a total of (DKIM) and Sender Policy Framework (SPF).
105.67 billion emails sent worldwide [3]. This makes it possible to detect fake emails or
Email providers are expected to stop spam emails spam emails sent from fake accounts [5].
before they reach users. Many email providers
include mechanisms that attempt to filter spam • List of email addresses: This method
by comparing the sender address of emails enables the detection of spam emails using a
against so-called blacklists of known spammers. predefined list of email addresses. This list may
However, since spammers frequently change include email addresses with a high probability
their sender addresses, the success of these of spam [6]. This method can be effective in
programs has not reached the desired level [4]. preventing spam emails, but it also involves the
At this point, a more effective and flexible risk of false positives, i.e., correct email
solution is needed. Generally, spam e-mails addresses being falsely flagged as spam.
contain messages such as "easy money", "adult
entertainment", etc. in their headers or content, • Content filtering: This method is used to
which can deceive individuals. The process of detect spam emails based on the content in the
classifying emails by interpreting messages is emails. For example, words and phrases such as
based on the keyword detection rule. This advertisements, product sales or illegal content
method has made the inadequacy of address- can be detected in emails and these emails can be
based filtering of spam e-mails more successful marked as spam. This method can be effective in
with keyword detection algorithms. Machine preventing spam emails, but it also involves the
learning techniques, which have recently gained risk of false positives [7].
popularity and are used in many different fields,
provide alternative solutions for filtering spam e- • Sharing a list of email addresses: This
mails much more successfully. method enables the detection of spam emails by
sharing a list of spam email addresses between
Methods used to detect spam emails different users and organizations. In this way, it
enables the detection of spam emails by sharing
Unsolicited emails (spam) are usually fake a list of spam email addresses between different
emails sent for advertising or fraudulent purposes users and organizations [7].
and often contain content that users do not want
or are not interested in. Such emails can put users 1.1.2. Artificial intelligence-based spam
in difficult situations or reduce work efficiency. detection systems
Therefore, it is important to detect and filter spam
emails. Artificial intelligence-based spam detection
systems are software used to detect spam
1.1.1. Traditional spam detection systems messages that are common in electronic
communication networks. These systems use
Such spam detection systems, which are not various artificial intelligence techniques to
based on artificial intelligence, usually use search for and detect specific characteristics of
simple algorithms that distinguish spam based on spam messages. Spam messages are usually
the content of the message, the sender's address marketing messages with a high content of
or the content of its links. The effectiveness and advertisements and promotions. These messages
accuracy of these systems is lower than that of are often sent to many people and are often
AI-based systems. They are less flexible and unsolicited or unnecessary. Sending too many

271
Sakarya University Journal of Science, 28(2) 2024, 270-282

spam messages wastes the time and effort of learning-based spam systems include errors such
email users. Artificial intelligence-based spam as decreasing correct detection rates if the
detection systems are designed to reduce these datasets are not large and diverse enough, or
problems. These systems examine the content, mistakenly identifying non-spam emails as spam
headers and other features of e-mail messages [12].
and classify spam messages according to certain
criteria [8]. 2. Literature Review

• Systems based on biological intelligence: When we examine the studies conducted in the
Systems based on biological intelligence are literature using artificial intelligence techniques
artificial intelligence systems that mimic the for the detection of spam e-mails, it is seen that
structure and functioning of the human brain. e-porta classification processes are performed
Such systems have a high degree of adaptive and with different algorithms. Some of these studies
learning capabilities, mimicking the learning, used traditional machine learning algorithms,
remembering and problem-solving abilities of while others used algorithms inspired by
the human brain. In particular, they have a biological systems such as Artificial Neural
network structure that transmits signals from Networks (ANN).
inputs to outputs using structures called neural
networks. These neural networks can have In a study classifying comments in different
learning and adaptive properties, much like the languages obtained from social media, an
human brain. By mimicking the natural structure accuracy of 96% was achieved using the Naive
and functioning of the human brain, such systems Bayes (NB) algorithm [13]. In another study to
can have a very high degree of adaptive and classify e-mails, a dataset containing 5574
learning capabilities [9]. English messages was classified with 95.48%
accuracy using the NB algorithm and 97.83%
• Machine learning-based systems: accuracy using the Support Vector Machine
Machine learning-based spam systems are (SVM) algorithm [14]. In another study for
systems that help to automatically detect spam filtering short messages (SMS), unwanted
emails. These systems usually identify spam advertisements were tried to be distinguished.
emails using features such as keywords and The highest scores obtained in the classification
phrases found in the content of the emails. They process were reported as 98.61% with SVM and
also take into account that spam emails are 97.55% with NB [15].
usually sent regularly and that they fit a certain
profile of email addresses and domains used. In some studies, classification is performed with
Spam systems developed using machine learning messages sent via social media. In the result
learn from pre-labeled datasets and discover obtained by classifying 1383 tweets, the accuracy
which features in these datasets are more rate of RF was 92.95% [16]. The same algorithm
effective in identifying spam emails [10]. may not always be more successful in the results
found. This is because different data sets are
• These features may include keywords and used. For example, in another spam e-mail
phrases in the content of the emails, the sender's detection study, 600 e-mails were classified. As
email address and domain, the email header, and a result of this classification, Naive Bayes was
the format of the email. The learned features are 95.5% and SVM was 93.5% [17]. In another
used to detect spam emails and new incoming study, 6000 emails were classified and Naive
emails are evaluated according to these features. Bayes was 94.6% and SVM was 98.5%
The advantages of machine learning-based spam successful [18]. Another of the algorithms
systems are that they have high detection rates as examined is LR. In this study, LR was used to
they learn from pre-labeled datasets [11]. classify incoming emails as raw and spam.
Furthermore, these systems can improve Dedekurt et al. presented a new spam approach
themselves through dynamic learning processes by combining LR and artificial bee colony [19].
and become more accurate classifiers over time.
However, the disadvantages of machine

272
Serkan Keskin, Onur Sevli

In another study, the ABC-LR algorithm was Data set


more successful than the classical LR algorithm
[20]. Janez-Martino used the LR algorithm on a In this study, a dataset consisting of 5558
spam dataset to evaluate the combination of LR samples and two attributes was used to detect
with a bag of words [21]. Apart from this, it has spam e-mails. The first attribute is the English
been observed that certain algorithms such as content text of the email message and the second
Naive Bayes-based and SVM have been used attribute is the target label that indicates whether
more than other machine learning algorithms the email is spam or not. This csv file (spam.csv,
[22]. 480.13 kB) prepared by Faisal Qureshi, contains
It was revealed that the NB algorithm was 5558 unique instances of ham (87%) and spam
96.31% successful in the classification of 310 e- (13%) messages. [31].
mails using the similar word suggestion feature
of the Zemberek library [23]. In a study Of the instances in the dataset, 747 are marked as
conducted on 4327 mail data sets with simulated spam and 4811 are marked as non-spam. The
neural networks (SNN), the success rate was graph showing the class distribution rates in the
found to be 95.82 [24]. In a study with the nearest dataset is given in Figure 1.
neighbor (KNN) algorithm, the highest success
rate of the KNN algorithm was 97.50% on a
dataset of 4601 e-mails taken from the UCI Data Set Class
machine learning repository website [25]. In
another study on the same data set, the SVM Distribution Ratios
algorithm was 93.07% successful.
Spam
13%
In the study conducted by Jain et al. they used a
Ham
data set consisting of 5572 messages labelled as 87%
raw and spam. As a result of the classification,
they achieved a success rate of 98.79% with the
SVM algorithm [26]. On the same data set, Spam Ham
Gadde et al. used the LSTM model and achieved
a success rate of 98.5%. TF-IDF and Hashing Figure 1. Class distribution rates
Vectoriser were used in the model [27]. Reddy
and Reddy achieved 95.32% success rate by When the class distribution rates are analyzed, it
using SVM algorithm on 5572 spam sms dataset is seen that the data set is not balanced. For this
[28]. In another study, 98.56% success rate was reason, cross-validation was applied in
achieved by using NB algorithm [29]. In the classification processes and detailed
study conducted by Abayomi et al. on the same measurement metrics obtained through
data set, a 98.6% success rate was obtained with complexity matrices are reported.
the BILSTM model using deep learning method.
[30]. Natural language processing (NLP)

3. Material and Method Natural Language Processing (NLP) enables


computers to communicate and process data
In this study, a classification study was carried using natural language. It is a sub-branch that
out on a data set consisting of 5558 samples for uses technologies such as artificial intelligence
distinguishing spam e-mails. After natural and machine learning and typically works with
language processing, the results of the text and audio data. NLP is artificial intelligence
classifications performed with 5 different technologies that give humans the ability to
machine learning algorithms consisting of understand and use natural language. NLP is
Random Forest, Logistic Regression, Naive divided into two main parts: text processing and
Bayes, Support Vector Machine and Artificial audio processing.
Neural Network are reported in terms of different
metrics. Text processing works with text data and
performs operations such as reading,
273
Sakarya University Journal of Science, 28(2) 2024, 270-282

understanding and summarizing texts. Voice 3.3.1. Support vector machine (SVM)
processing, on the other hand, works with voice
data and performs operations such as recognizing SVM is widely used in many studies because it
voices, generating text from voices and produces significant accuracy with less
translating texts into voice. In recent years, there computational power. SVM is one of the most
has been a rapid development of NLP in popular supervised learning algorithms used to
phenomena such as question answering, machine solve regression and classification problems. The
translation and machine reading comprehension. goal of the SVM algorithm is to construct the best
NLP can be divided into three parts: modeling, line or decision boundary that can classify data
learning and reasoning [32]. TF-IDF (Term points in a multidimensional space that classifies
Frequency-Inverse Document Frequency) is a them distinctly [34]. This boundary is called the
natural language processing technique used to hyperplane. The SVM selects endpoints or
measure word importance in texts. TF-IDF vectors to form the hyperplane. This selected
calculates how often a word occurs in a text state is called the support vectors [35]. The SVM
(Term Frequency, TF) and how few texts algorithm is used in many different fields such as
containing that word occur in total texts (Inverse image classification, text classification and face
Document Frequency, IDF). The product of these detection.
two values indicates the importance of the word.
TF-IDF is used to better understand the meaning 3.3.2. Logistic regression (LR)
of texts. TF-IDF is widely used for measuring
word distributions in texts and can be used in LR, like SVM, is one of the important machine
applications such as determining the similarity of learning algorithms among the algorithms that
texts, classifying texts or making connections use supervised learning techniques. It is used to
between texts [33]. predict a categorical dependent variable using a
set of independently given variables. LR predicts
Each word in the dataset used in this study is the output of a categorical dependent variable. It
associated with a numerical index value and should give a discrete or categorical value as a
those that carry spam flags are labeled. During result. The result can be true or false, 0 or 1.
the model training, the textual expressions in the Instead of giving an exact value, it gives a
dataset were separated word by word and probabilistic value between 0 and 1. Instead of a
subjected to numerical transformations, making linear line, LR draws an "S" shaped function to
it a completely numerical dataset. The dataset cover two maximum values. This function curve
was classified with 5 different machine learning gives the probability of whether a state exists or
algorithms. In the study carried out with not [36]. LR is a highly successful machine
algorithms written in Python programming learning algorithm that calculates probabilities
language in a spyder environment, tests were using discrete and continuous data and classifies
carried out using various library structures. With newly entered data.
the algorithms applied to the dataset,
performance evaluations were made according to 3.3.3. Naive bayes (NB)
precision, sensitivity, accuracy and F1 scores. All
algorithms were subjected to 5-fold cross- It is the first filtering algorithm used as a
validation. probabilistic classifier [37]. The NB algorithm is
a supervised learning algorithm for solving
Classification algorithms used classification problems based on Bayes theory. It
is used for text classification with a high-
The data set used in the study was classified dimensional training data set. The NB algorithm
using 5 different machine learning algorithms: can make predictions quickly. It makes
Support Vector Machine, Logistic Regression, predictions by calculating the probability of the
Naive Bayes, Random Forest and Artificial object. Due to their simplicity and high
Neural Network. performance, these approaches are the most
widely used in open-source systems proposed for
spam filtering [38]. This algorithm is also used in

274
Serkan Keskin, Onur Sevli

areas such as article classification and sentiment Model performance measurement


analysis.
A confusion matrix was used to express the
3.3.4. Random forest (RF) performance of the classifier used. The confusion
matrix is a table used to evaluate how well a class
The RF algorithm is a machine learning is distinguished from each other. It allows us to
algorithm created by combining many decision see how well the algorithm can predict the
trees. This algorithm can be used for correct class. The rows of the matrix represent
classification and regression problems. The RF the predicted class and the columns represent the
algorithm is a combination of many decision tree true class [43]. For a binary classification
models, each trained with different subsets of problem where the classes are "positive" and
data. Each decision tree makes decisions on "negative", the general structure of the
specific features and data points using a set of complexity matrix looks like Figure 2.
decision tree nodes. Decision trees work by
dividing the data into small subsets and
classifying the data points in these subsets with a
set of decision nodes. [39].

This algorithm allows each decision tree to make


predictions individually and eventually produces
a result by combining all the predictions. This
improves accuracy and consistency, giving better
results than a single decision tree. A large Figure 2. Complexity matrix
number of trees in the forest provides higher
accuracy [40]. Training time is less compared to In machine learning, true positive refers to the
other algorithms. It can maintain accuracy even number of correct positive predictions made by a
if a certain part of the data is missing. It is model out of all positive predictions. In other
generally used in banking, medicine, land use words, it is the number of instances where the
and marketing sectors. model correctly identifies a positive instance as
positive. True negative refers to the number of
3.3.5. Artificial neural network (ANN) correct negative predictions made by a model out
of all negative predictions. It is the number of
An Artificial Neural Network (ANN) is a instances where the model correctly identifies a
machine learning model that works like the brain. negative instance as negative. False positive
Like a network of nerve cells in the brain, an refers to the number of false positive predictions
ANN is made up of many nerve cells (neurons). made by a model out of all negative predictions.
Neurons are connected and process information In other words, it is the number of instances
by sending signals to each other. The ANN learns where the model predicts a positive instance
by using the connections between neurons and when it is negative. In machine learning, false
adjusting their weights [41]. Information is negative refers to the number of false negative
transmitted to the network from the input layer. predictions made by a model out of all positive
It is then processed in the intermediate layer and predictions. It is the number of instances where
sent to the output layer. The information coming the model predicts a negative pattern when it is
into the network is converted into output using positive.
the weight value of the network. To produce the
correct outputs, the evaluation of the weights Different evaluation metrics can be calculated
must be done correctly. The process in ANN is to from a complexity matrix. These metrics are
calculate the parameters w (weight) and b (bias) useful for understanding the performance of a
that will give the model the best score. [42]. classification algorithm and comparing the
ANN is a method that offers successful solutions performance of different models. The formulas
to many problems we encounter in daily life such for deriving these measures from the complexity
as classification, prediction and modeling. matrix are given in Table 1.

275
Sakarya University Journal of Science, 28(2) 2024, 270-282

Table 1. Formulation of measurements averages of the measurements obtained with each


Measure Description Formula algorithm are reported.
Accuracy Overall TP + TN
performance of TP + TN + FP + FN The complexity matrix obtained as a result of the
model classification process performed with the SVM
Precision How accurate TP
algorithm is given in Figure 3.
the positive TP + FP
predictions are
Sensitivity Coverage of TP
actual positive TP + FN
sample
F1 Score Hybrid metric 2TP
useful for 2TP + FP + FN
unbalanced
classes

Accuracy: The proportion of correct predictions.


It is calculated as the number of true positives
divided by the total number of true negatives Figure 3. SVM results
divided by the number of predictions.
In the complexity matrix of the DVM algorithm,
Precision: The proportion of correct positive it is understood that the model distinguishes
predictions. It is calculated by dividing the between spam and non-spam emails with overall
number of true positives by the total number of success. The values of the metrics calculated
true positives and false positives. over the complexity matrix of the model are
given in Table 2.
Sensitivity: The proportion of true positive cases
that are correctly predicted. It is calculated by Table 2. Calculated metrics for SVM
SVM Metrics Ratios
dividing the number of true positives by the total
Accuracy 98.74
number of true positives and false negatives.
Precision 98.86
Sensitivity 99.89
F1 Score: The harmonic mean of the precision F1 Score 99.29
and recall values. The F1 score takes values
between 0 and 1, with higher values indicating
In Table 2, the accuracy value showing the
better classification performance. overall success of the model is 98.74%. The
precision and sensitivity values showing the
4. Experimental Study and Findings
discrimination of the classes were obtained as
98.86% and 99.89%. The F1 Score value, which
In the study conducted for spam detection, the
expresses the balance of these two values, was
dataset consisting of 5558 samples was classified
obtained as 99.29%.
using 5 different machine learning algorithms:
SVM, LR, NB, RF and ANN. Before the
The complexity matrix obtained as a result of the
classification process, the e-mail message texts in
classification process performed with the LR
the dataset were subjected to natural language
algorithm is given in Figure 4.
processing. The texts were first parsed into
sentences and then segmented into words
according to the determined brackets. Word
vectors were created and Term Frequency /
Inverse Document Frequency was calculated.
The mathematically transformed e-mail
messages were classified with the specified
algorithms using 5-fold cross-validation. The

276
Serkan Keskin, Onur Sevli

the complexity matrix of the model are given in


Table 4.

Table 4. Calculated metrics for NB


NB Metrics Ratios
Accuracy 90.49
Precision 98.16
Sensitivity 90.49
F1 Score 94.17

Figure 4. LR results In Table 4, the accuracy value showing the


overall success of the model is 90.49%. The
In the complexity matrix of the LR algorithm, it precision and sensitivity values showing the
turns out that the model distinguishes spam and discrimination of the classes were obtained as
non-spam emails with general success. The 98.16% and 90.49%. The F1 Score value, which
values of the metrics calculated over the expresses the balance of these two values, was
complexity matrix of the model are given in obtained as 94.17%.
Table 3.
The complexity matrix obtained as a result of the
Table 3. Calculated metrics for LR classification process performed with the RF
LR Metrics Ratios algorithm is given in Figure 6.
Accuracy 97.66
Precision 97.75
Sensitivity 99.89
F1 Score 98.68

In Table 3, the accuracy value showing the


overall success of the model is 97.66%. The
precision and sensitivity values showing the
discrimination of the classes were obtained as
97.75% and 99.89%. The F1 Score value, which
expresses the balance of these two values, was
obtained as 98.68%.
Figure 6. RF results
The complexity matrix obtained as a result of the In the complexity matrix of the RF algorithm, it
classification process performed with the NB appears that the model distinguishes spam and
algorithm is given in Figure 5. non-spam emails with overall success. The
values of the metrics calculated over the
complexity matrix of the model are given in
Table 5.

Table 5. Calculated metrics for RF


RF Metrics Ratios
Accuracy 98.83
Precision 98.78
Sensitivity 99.89
F1 Score 99.34
Figure 5. NB results
In Table 5, the accuracy value showing the
In the complexity matrix of the NB algorithm, it overall success of the model is 98.83%. The
is understood that the model mixes TN and TP precision and sensitivity values showing the
values with FN. This affects the success of the discrimination of the classes were obtained as
model. The values of the metrics calculated over 98.78% and 99.89%. The F1 Score value, which

277
Sakarya University Journal of Science, 28(2) 2024, 270-282

expresses the balance of these two values, was Table 7. Calculated measurements of the algorithms
obtained as 99.34%. used

Sensitivity
Algorithm
Accuracy

Precision
Learning

F1 Score
Machine
The complexity matrix obtained as a result of the
classification process performed with the ANN
algorithm is given in Figure 7.

SVM 98.74 98.86 99.89 99.29


LR 97.66 97.75 99.89 98.68
NB 90.49 98.16 90.49 94.17
RF 98.83 98.78 99.89 99.34
ANN 97.04 97.00 99.69 98.32

When Table 7, which shows the classification


performance of the algorithms, is analysed, it is
revealed that the RO algorithm ranks first with
98.83% accuracy in terms of overall success. The
Figure 7. ANN results NB algorithm showed the lowest performance
with 90.49% accuracy. In terms of F1 score,
In the Complexity matrix of the ANN algorithm, which expresses the balance in distinguishing the
it is understood that the model successfully classes, the most successful algorithm was RO
distinguishes between spam and non-spam with 99.34%, while the lowest success was NB
emails in general. The values of the metrics algorithm with 94.17%. It is understood that
calculated over the complexity matrix of the RO>DVM>LR> ANN> in the general success
model are given in Table 6. ranking.
Table 6. Calculated metrics for ANN
The comparison of the findings obtained in the
ANN Metrics Ratios
Accuracy 97.04
classification process performed in this study
Precision 97.00 with other similar studies in the literature is given
Sensitivity 99.69 in Table 8. In this table, the most successful
F1 Score 98.32 algorithm and accuracy rates are given.

In Table 6, the accuracy value showing the The last row in Table 8 is the result of this study.
overall success of the model is 97.04%. The The reason why the accuracy rates in some
precision and sensitivity values showing the studies in this table are close to the accuracy rates
discrimination of the classes were obtained as of our study is that the data set sizes and data sets
97.00% and 99.69%. The F1 Score value, which are close to each other. As it can be understood,
expresses the balance of these two values, was it has been experimentally demonstrated that this
obtained as 98.32%. study is more successful than other studies. This
is due to the fact that the natural language
The measurements obtained as a result of the processing processes of the study are more
classification processes performed with 5 successful than other similar studies.
different algorithms are summarized in Table 7.
5. Conclusion

E-mail is one of the most widely used


communication tools and one of the biggest
problems in the use of this tool is spam messages.
Spam messages are e-mails that are intended to
advertise or deceive and their detection is of great
importance. Various techniques and algorithms
have been proposed to detect spam e-mails.

278
Serkan Keskin, Onur Sevli

In the present study, 5 different machine learning study, unlike other studies, the use of natural
algorithms were used to classify spam e-mails language processing made the success different
using a dataset of 5558 samples consisting of and high. It is concluded that this score is higher
spam and non-spam e-mail messages. With 5- than similar studies in the literature. This study
fold cross-validation, the results of the sets an example for a machine learning-based
classification processes are reported with infrastructure that will consistently filter spam
accuracy, precision, sensitivity and f1 score content in e-mail servers. In future studies, it is
metrics. aimed to obtain higher performance results with
In the study, the rf algorithm produced the most different algorithms on datasets to be prepared
successful result with 98.83% accuracy. In this for different natural languages.

Table 8. Comparison table of the most successful accuracy rates on the same and different data sets
Study Name Data Set Used Most Successful Highest
Algorithm Accuracy (%)
Kumar and al., 2023 Spam Dataset NB 98.56
[29]
Jain and al., 2022 [26] Spam Dataset SVM 98.79
Abayomi and al., 2022 Spam Dataset BILSTM 98.60
[30]
Reddy and Reddy, Spam Dataset SVM 95.32
2021 [28]
Gadde and al., 2021 Spam Dataset LSTM 98.50
[27]
Junnarkar and al., 2021 Data set containing 5574 e- SVM 97.83
[4] mails
Ma and al., 2020 [21] 6000 data sets containing e- SVM 95.5
mails
Salihi, 2019 [16] 1183 units obtained from RF 92.95
Twitter the resulting data set
Karamollaoglu and TurkishMail dataset NB 95.5
Dogru, 2018 [6] consisting of 600 e-mails
Nazlı, 2018 [44]. Data set consisting of 300 e- SVM 98.33
mails
Kale, 2018 [45] Data set of 4,709 e-mails Gradient 94.97
Boosted Tree (GBT)
Yıldız, 2017 [31] Data set of 310 Turkish e- NB 96.31
mails
Alkaht and al., 2016 CSDMC 2010, SNN 95.82
[28] SpamAssassin, Tarassul
Sharma and Spambase KNN 97.50
Suryawanshi, 2016
[29]
Zavvar al., 2016. [46] Spambase SVM 93.07
This study Spam Dataset SVM 98.74 98.83
LR 97.66
NB 90.49
RF 98.83
ANN 97.04

Article Information Form Authors' Contribution


All authors have contributed in experimental
Funding study and writing of the manuscript equally.
The author (s) has no received any financial
support for the research, authorship or
publication of this study.

279
Sakarya University Journal of Science, 28(2) 2024, 270-282

The Declaration of Conflict of Interest/ [5] S. Zeadally, E. Adi, Z. Baig, & I. A. Khan,
Common Interest "Harnessing artificial intelligence
No conflict of interest or common interest has capabilities to improve cybersecurity." Ieee
been declared by the authors. Access 8, 23817-23837, 2020.

The Declaration of Ethics Committee Approval [6] A. Karim, S. Azam, B. Shanmugam, K.


This study does not require ethics committee Kannoorpatti, & M. Alazab, "A
permission or any special permission. comprehensive survey for intelligent spam
email detection." IEEE Access 7, 168261-
The Declaration of Research and Publication 168295, 2019.
Ethics
The authors of the paper declare that they comply [7] T. Dogan, "On Term Weighting for Spam
with the scientific, ethical and quotation rules of SMS Filtering." Sakarya University
saujs in all processes of the paper and that they Journal of Computer and Information
do not make any falsification on the data Sciences 3.3, 239-249, 2020.
collected. In addition, they declare that sakarya
university journal of science and its editorial [8] S. Douzi, F. A. AlShahwan, M.
board have no responsibility for any ethical Lemoudden, & B. El Ouahidi, "Hybrid
violations that may be encountered, and that this email spam detection model using artificial
study has not been evaluated in any academic intelligence." International Journal of
publication environment other than sakarya Machine Learning and Computing 10.2
university journal of science. 2020.

Copyright Statement [9] E. M. Onyema, S. Dalal, C. A. T. Romero,


Authors own the copyright of their work B. Seth, P. Young, & M. A. Wajid, "Design
published in the journal and their work is of intrusion detection system based on
published under the CC BY-NC 4.0 license. cyborg intelligence for security of cloud
network traffic of smart cities." Journal of
References Cloud Computing 11.1, 1-20, 2022.

[1] E. G. Dada, J. S. Bassi, H. Chiroma, A. O. [10] A. Bhowmick, S. M. Hazarika, "E-mail


Adetunmbi, & O. E. Ajibuwa, “Machine spam filtering: a review of techniques and
learning for email spam filtering: review, trends." Advances in Electronics,
approaches and open research Communication and Computing:
problems.”Heliyon, 5(6), e01802, 2019. ETAEERE-2016, 583-590, 2018.

[2] L.Ceci (2022, Nov. 14). Number of e-mail [11] D. Abidin, The Effect of Derived Features
users worldwide [online]. on Art Genre Classification with Machine
Available:https://www.statista.com/statisti Learning. Sakarya University Journal of
cs/255080/number-of-e-mail-users- Science, 25(6), 1275-1286, 2021
worldwide/
[12] P. Sharma, U. Bhardwaj. "Machine
[3] S. Dixon (2022, Apr. 28) Daily spam learning based spam e-mail detection.
volume worldwide Available: "International Journal of Intelligent
https://www.statista.com/statistics/127042 Engineering and Systems 11.3, 1-10, 2018
4/daily-spam-volume-global/
[13] Ö. Şahinaslan, H. Dalyan, E. Şahinaslan,
[4] P.Pantel, D. L. Spamcop, "A Spam "Naive bayes sınıflandırıcısı kullanılarak
Classification and Organization Program." youtube verileri üzerinden çok dilli duygu
Learning for Text Categorization, 2006. analizi. "Bilişim Teknolojileri Dergisi
15.2, 221-229, 2022

280
Serkan Keskin, Onur Sevli

[14] A. Junnarkar, S. Adhikari, J. Fagania, P. [22] R. Mansoor, N. D. Jayasinghe, M. M. A.


Chimurkar, D. Karia "E-mail spam Muslam. "A comprehensive review on
classification via machine learning and email spam classification using machine
natural language processing." 2021 Third learning algorithms. "2021 International
International Conference on Intelligent Conference on Information Networking
Communication Technologies and Virtual (ICOIN). IEEE, 2021.
Mobile Networks (ICICV). IEEE, 2021.
[23] A. Yıldız, M. Demirci, Kurumsal e-posta
[15] Y. S. Bozan, Ö. Çoban, G. T. Özyer, & B. sınıflandırma sistemi. Diss. Yüksek Lisans
Özyer, "SMS spam filtering based on text Tezi, Gazi Üniversitesi Fen Bilimleri
classification and expert system." 2015 Enstitüsü, 82, Ankara, 2017.
23nd Signal Processing and
Communications Applications Conference [24] I. J. Alkaht, B. Al-Khatib. "Filtering spam
(SIU). IEEE, 2015. using several stages neural networks." Int.
Rev. Comp. Softw 11.2, 2016.
[16] A. K. A. Salihi, Spam detection by using
word-vector learning algorithm in online [25] A. Sharma, A. Suryawanshi. "A novel
social networks. MS thesis. Fen Bilimleri method for detecting spam email using
Enstitüsü, 2019. KNN classification with spearman
correlation as distance measure.
[17] H. Karamollaoglu, İ. A. Dogru, M. "International Journal of Computer
Dorterler, "Detection of Spam E-mails Applications 136.6, 28-35, 2016
with Machine Learning Methods. "2018
Innovations in Intelligent Systems and [26] Jain, T., Garg, P., Chalil, N., Sinha, A.,
Applications Conference (ASYU). IEEE, Verma, V. K., & Gupta, R. SMS spam
2018. classification using machine learning
techniques. In 2022 12th international
[18] M. T. Ma, K. Yamamori, A. Thida, "A conference on cloud computing, data
comparative approach to Naïve Bayes science & engineering (confluence) (pp.
classifier and support vector machine for 273-279). IEEE, 2022.
email spam classification."2020 IEEE 9th
Global Conference on Consumer [27] Gadde, S., Lakshmanarao, A., &
Electronics (GCCE). IEEE, 2020. Satyanarayana, S. SMS spam detection
using machine learning and deep learning
[19] B. K. Dedeturk, B. Akay. "Spam filtering techniques. In 2021 7th International
using a logistic regression model trained by Conference on Advanced Computing and
an artificial bee colony algorithm. Communication Systems (ICACCS) (Vol.
"Applied Soft Computing 91 106229, 1, pp. 358-362). IEEE, 2021.
2020.
[28] Reddy, G. A., & Reddy, B. I. Classification
[20] N. Baktır, A. Yılmaz, "Makine Öğrenmesi of Spam Text using SVM. Journal of
Yaklaşımlarının Spam-Mail Sınıflandırma University of Shanghai for Science and
Probleminde Karşılaştırmalı Analizi. Technology, 23(8), 616-624, 2021
"Bilişim Teknolojileri Dergisi 15.3: 349-
364, 2022. [29] Kumar, R., Murthy, K. S. R., Ramesh
Babu, J., & Shaik, A. Live Text Analyzer
[21] F. Jánez-Martino, E. Fidalgo, S. González- to Detect Unsolicited Messages Using
Martínez, J. Velasco-Mata, "Classification Count Vectorizer. Journal of Engineering
of spam emails through hierarchical Sciences, 14(06), 2023.
clustering and supervised learning. "arXiv
preprint arXiv: 2005.08773, 2020. [30] Abayomi‐Alli, O., Misra, S., & Abayomi‐
Alli, A. A deep learning method for

281
Sakarya University Journal of Science, 28(2) 2024, 270-282

automatic SMS spam classification: bayes-which naive bayes?", CEAS. Vol.


Performance of learning algorithms on 17. 2006.
indigenous dataset. Concurrency and
Computation: Practice and Experience, 34 [39] F. M. Avcu, "Az Veri Setli Çalışmalarında
(17), e6989, 2022. Derin Öğrenme Ve Diğer Sınıflandırma
Algoritmalarının Karşılaştırılması:
[31] ‘Email Spam Detection 98% Accuracy | Agonist Ve Antagonist Ligand Örneği
Kaggle’. "İnönü Üniversitesi Sağlık Hizmetleri
https://www.kaggle.com/code/mfaisalqure Meslek Yüksek Okulu Dergisi 10.1, 356-
shi/email-spam-detection-98- 371, 2022
accuracy/data (accessed Aug. 21, 2023).
[40] Ö. Akar, O. Güngör, "Rastgele orman
[32] M. Zhou, N. Duan, S. Liu, H. Y. Shum, algoritması kullanılarak çok bantlı
"Progress in neural NLP: modeling, görüntülerin sınıflandırılması. "Jeodezi ve
learning, and reasoning."Engineering 6.3, Jeoinformasyon Dergisi 106, 139-146,
275-290, 2020. 2012.

[33] I. Yahav, O. Shehory, D. Schwartz, [41] A. Arı, M. E. Berberler, "Yapay sinir ağları
"Comments mining with TF-IDF: the ile tahmin ve sınıflandırma problemlerinin
inherent bias and its removal. "IEEE çözümü için arayüz tasarımı. "Acta
Transactions on Knowledge and Data Infologica 1.2, 55-73, 2017
Engineering 31.3, 437-450, 2018
[42] O. I. Abiodun, A. Jantan, A. E. Omolara,
[34] Y. Altuntaş, A. F. Kocamaz, A. M. Ülkgün, K. V. Dada, A. M. Umar, O. U. Linus, M.
"Determination of Individual Investors' U. Kiru, "Comprehensive review of
Financial Risk Tolerance by Machine artificial neural network applications to
Learning Methods. "2020 28th Signal pattern recognition. "IEEE Access 7,
Processing and Communications 158820-158846, 2019
Applications Conference (SIU). IEEE,
2020. [43] Z. K. Şentürk, "Artificial neural networks
based decision support system for the
[35] R. Gürfidan, M. Ersoy, "Classification of detection of diabetic retinopathy. "Sakarya
death related to heart failure by machine Üniversitesi Fen Bilimleri Enstitüsü
learning algorithms. "Advances in Dergisi 24.2, 424-431, 2020.
Artificial Intelligence Research 1.1, 13-18,
2021 [44] N. Nazlı, Analysis of machine learning-
based spam filtering techniques. MS thesis.
[36] S. Şenel, B. Alatli. "Lojistik regresyon 2018.
analizinin kullanıldığı makaleler üzerine
bir inceleme. "Journal of Measurement and [45] B. Kale, Veri madenciliği sınıflandırma
Evaluation in Education and Psychology algoritmaları ile e-posta önemliliğinin
5.1, 35-52, 2014. belirlenmesi. MS thesis. Fen Bilimleri
Enstitüsü, 2018.
[37] A. McCallum, K. Nigam. "A comparison
of event models for naive bayes text [46] M. Zavvar, M. Rezaei, S. Garavand.
classification. "AAAI-98 workshop on "Email spam detection using combination
learning for text categorization. Vol. 752. of particle swarm optimization and
No. 1. 1998. artificial neural network and support vector
machine. "International Journal of Modern
[38] V. Metsis, I. Androutsopoulos, G. Education and Computer Science 8.7, 68,
Paliouras. "Spam filtering with naive 2016.

282

View publication stats

You might also like