Machine Learning Based Classification for Spam Detection
Machine Learning Based Classification for Spam Detection
net/publication/380000625
CITATIONS READS
2 643
2 authors:
All content following this page was uploaded by Onur Sevli on 22 April 2024.
Research Article
1
Burdur Mehmet Akif Ersoy University, Institute of Science and Technology, Department of Computer Engineering,
Burdur, Türkiye, [email protected]
2
Burdur Mehmet Akif Ersoy University, Faculty of Engineering and Architecture, Department of Computer Engineering,
Burdur, Türkiye, [email protected]
* Corresponding author
Keywords: Electronic Electronic messages, i.e. e-mails, are a communication tool frequently
Artificial Intelligence used by individuals or organizations. While e-mail is extremely practical to use, it is
Email Classification necessary to consider its vulnerabilities. Spam e-mails are unsolicited messages
Machine Learning created to promote a product or service, often sent frequently. It is very important to
Spam Detection classify incoming e-mails in order to protect against malware that can be transmitted
via e-mail and to reduce possible unwanted consequences. Spam email classification
is the process of identifying and distinguishing spam emails from legitimate emails.
This classification can be done through various methods such as keyword filtering,
machine learning algorithms and image recognition. The goal of spam email
classification is to prevent unwanted and potentially harmful emails from reaching
the user's inbox. In this study, Random Forest (RF), Logistic Regression (LR), Naive
Bayes (NB), Support Vector Machine (SVM) and Artificial Neural Network (ANN)
algorithms are used to classify spam emails and the results are compared. Algorithms
with different approaches were used to determine the best solution for the problem.
5558 spam and non-spam e-mails were analyzed and the performance of the
algorithms was reported in terms of accuracy, precision, sensitivity and F1-Score
metrics. The most successful result was obtained with the RF algorithm with an
Article History: accuracy of 98.83%. In this study, high success was achieved by classifying spam
Received: 13.03.2023 emails with machine learning algorithms. In addition, it has been proved by
Accepted: 08.12.2023 experimental studies that better results are obtained than similar studies in the
Online Available: 22.04.2024 literature.
This is an open access paper distributed under the terms and conditions of the Creative Commons Attribution-NonCommercial 4.0 International License.
Serkan Keskin, Onur Sevli
bandwidth. Such unsolicited e-mails are adaptive than AI-based systems. The main
characterized as "spam". Between October 2020 methods used in traditional spam detection
and September 2021, the global daily spam systems are as follows:
volume peaked in July 2021 with approximately
283 billion spam emails out of a total of 336.41 • Email authentication: This method is
billion emails. By August 2021, this number had used to verify who the sender of an email is. It
fallen to 65.50 billion. By September, the verifies the authenticity of the sender using
average spam volume had again increased by 36 standards such as DomainKeys Identified Mail
percent, reaching 88.88 billion out of a total of (DKIM) and Sender Policy Framework (SPF).
105.67 billion emails sent worldwide [3]. This makes it possible to detect fake emails or
Email providers are expected to stop spam emails spam emails sent from fake accounts [5].
before they reach users. Many email providers
include mechanisms that attempt to filter spam • List of email addresses: This method
by comparing the sender address of emails enables the detection of spam emails using a
against so-called blacklists of known spammers. predefined list of email addresses. This list may
However, since spammers frequently change include email addresses with a high probability
their sender addresses, the success of these of spam [6]. This method can be effective in
programs has not reached the desired level [4]. preventing spam emails, but it also involves the
At this point, a more effective and flexible risk of false positives, i.e., correct email
solution is needed. Generally, spam e-mails addresses being falsely flagged as spam.
contain messages such as "easy money", "adult
entertainment", etc. in their headers or content, • Content filtering: This method is used to
which can deceive individuals. The process of detect spam emails based on the content in the
classifying emails by interpreting messages is emails. For example, words and phrases such as
based on the keyword detection rule. This advertisements, product sales or illegal content
method has made the inadequacy of address- can be detected in emails and these emails can be
based filtering of spam e-mails more successful marked as spam. This method can be effective in
with keyword detection algorithms. Machine preventing spam emails, but it also involves the
learning techniques, which have recently gained risk of false positives [7].
popularity and are used in many different fields,
provide alternative solutions for filtering spam e- • Sharing a list of email addresses: This
mails much more successfully. method enables the detection of spam emails by
sharing a list of spam email addresses between
Methods used to detect spam emails different users and organizations. In this way, it
enables the detection of spam emails by sharing
Unsolicited emails (spam) are usually fake a list of spam email addresses between different
emails sent for advertising or fraudulent purposes users and organizations [7].
and often contain content that users do not want
or are not interested in. Such emails can put users 1.1.2. Artificial intelligence-based spam
in difficult situations or reduce work efficiency. detection systems
Therefore, it is important to detect and filter spam
emails. Artificial intelligence-based spam detection
systems are software used to detect spam
1.1.1. Traditional spam detection systems messages that are common in electronic
communication networks. These systems use
Such spam detection systems, which are not various artificial intelligence techniques to
based on artificial intelligence, usually use search for and detect specific characteristics of
simple algorithms that distinguish spam based on spam messages. Spam messages are usually
the content of the message, the sender's address marketing messages with a high content of
or the content of its links. The effectiveness and advertisements and promotions. These messages
accuracy of these systems is lower than that of are often sent to many people and are often
AI-based systems. They are less flexible and unsolicited or unnecessary. Sending too many
271
Sakarya University Journal of Science, 28(2) 2024, 270-282
spam messages wastes the time and effort of learning-based spam systems include errors such
email users. Artificial intelligence-based spam as decreasing correct detection rates if the
detection systems are designed to reduce these datasets are not large and diverse enough, or
problems. These systems examine the content, mistakenly identifying non-spam emails as spam
headers and other features of e-mail messages [12].
and classify spam messages according to certain
criteria [8]. 2. Literature Review
• Systems based on biological intelligence: When we examine the studies conducted in the
Systems based on biological intelligence are literature using artificial intelligence techniques
artificial intelligence systems that mimic the for the detection of spam e-mails, it is seen that
structure and functioning of the human brain. e-porta classification processes are performed
Such systems have a high degree of adaptive and with different algorithms. Some of these studies
learning capabilities, mimicking the learning, used traditional machine learning algorithms,
remembering and problem-solving abilities of while others used algorithms inspired by
the human brain. In particular, they have a biological systems such as Artificial Neural
network structure that transmits signals from Networks (ANN).
inputs to outputs using structures called neural
networks. These neural networks can have In a study classifying comments in different
learning and adaptive properties, much like the languages obtained from social media, an
human brain. By mimicking the natural structure accuracy of 96% was achieved using the Naive
and functioning of the human brain, such systems Bayes (NB) algorithm [13]. In another study to
can have a very high degree of adaptive and classify e-mails, a dataset containing 5574
learning capabilities [9]. English messages was classified with 95.48%
accuracy using the NB algorithm and 97.83%
• Machine learning-based systems: accuracy using the Support Vector Machine
Machine learning-based spam systems are (SVM) algorithm [14]. In another study for
systems that help to automatically detect spam filtering short messages (SMS), unwanted
emails. These systems usually identify spam advertisements were tried to be distinguished.
emails using features such as keywords and The highest scores obtained in the classification
phrases found in the content of the emails. They process were reported as 98.61% with SVM and
also take into account that spam emails are 97.55% with NB [15].
usually sent regularly and that they fit a certain
profile of email addresses and domains used. In some studies, classification is performed with
Spam systems developed using machine learning messages sent via social media. In the result
learn from pre-labeled datasets and discover obtained by classifying 1383 tweets, the accuracy
which features in these datasets are more rate of RF was 92.95% [16]. The same algorithm
effective in identifying spam emails [10]. may not always be more successful in the results
found. This is because different data sets are
• These features may include keywords and used. For example, in another spam e-mail
phrases in the content of the emails, the sender's detection study, 600 e-mails were classified. As
email address and domain, the email header, and a result of this classification, Naive Bayes was
the format of the email. The learned features are 95.5% and SVM was 93.5% [17]. In another
used to detect spam emails and new incoming study, 6000 emails were classified and Naive
emails are evaluated according to these features. Bayes was 94.6% and SVM was 98.5%
The advantages of machine learning-based spam successful [18]. Another of the algorithms
systems are that they have high detection rates as examined is LR. In this study, LR was used to
they learn from pre-labeled datasets [11]. classify incoming emails as raw and spam.
Furthermore, these systems can improve Dedekurt et al. presented a new spam approach
themselves through dynamic learning processes by combining LR and artificial bee colony [19].
and become more accurate classifiers over time.
However, the disadvantages of machine
272
Serkan Keskin, Onur Sevli
understanding and summarizing texts. Voice 3.3.1. Support vector machine (SVM)
processing, on the other hand, works with voice
data and performs operations such as recognizing SVM is widely used in many studies because it
voices, generating text from voices and produces significant accuracy with less
translating texts into voice. In recent years, there computational power. SVM is one of the most
has been a rapid development of NLP in popular supervised learning algorithms used to
phenomena such as question answering, machine solve regression and classification problems. The
translation and machine reading comprehension. goal of the SVM algorithm is to construct the best
NLP can be divided into three parts: modeling, line or decision boundary that can classify data
learning and reasoning [32]. TF-IDF (Term points in a multidimensional space that classifies
Frequency-Inverse Document Frequency) is a them distinctly [34]. This boundary is called the
natural language processing technique used to hyperplane. The SVM selects endpoints or
measure word importance in texts. TF-IDF vectors to form the hyperplane. This selected
calculates how often a word occurs in a text state is called the support vectors [35]. The SVM
(Term Frequency, TF) and how few texts algorithm is used in many different fields such as
containing that word occur in total texts (Inverse image classification, text classification and face
Document Frequency, IDF). The product of these detection.
two values indicates the importance of the word.
TF-IDF is used to better understand the meaning 3.3.2. Logistic regression (LR)
of texts. TF-IDF is widely used for measuring
word distributions in texts and can be used in LR, like SVM, is one of the important machine
applications such as determining the similarity of learning algorithms among the algorithms that
texts, classifying texts or making connections use supervised learning techniques. It is used to
between texts [33]. predict a categorical dependent variable using a
set of independently given variables. LR predicts
Each word in the dataset used in this study is the output of a categorical dependent variable. It
associated with a numerical index value and should give a discrete or categorical value as a
those that carry spam flags are labeled. During result. The result can be true or false, 0 or 1.
the model training, the textual expressions in the Instead of giving an exact value, it gives a
dataset were separated word by word and probabilistic value between 0 and 1. Instead of a
subjected to numerical transformations, making linear line, LR draws an "S" shaped function to
it a completely numerical dataset. The dataset cover two maximum values. This function curve
was classified with 5 different machine learning gives the probability of whether a state exists or
algorithms. In the study carried out with not [36]. LR is a highly successful machine
algorithms written in Python programming learning algorithm that calculates probabilities
language in a spyder environment, tests were using discrete and continuous data and classifies
carried out using various library structures. With newly entered data.
the algorithms applied to the dataset,
performance evaluations were made according to 3.3.3. Naive bayes (NB)
precision, sensitivity, accuracy and F1 scores. All
algorithms were subjected to 5-fold cross- It is the first filtering algorithm used as a
validation. probabilistic classifier [37]. The NB algorithm is
a supervised learning algorithm for solving
Classification algorithms used classification problems based on Bayes theory. It
is used for text classification with a high-
The data set used in the study was classified dimensional training data set. The NB algorithm
using 5 different machine learning algorithms: can make predictions quickly. It makes
Support Vector Machine, Logistic Regression, predictions by calculating the probability of the
Naive Bayes, Random Forest and Artificial object. Due to their simplicity and high
Neural Network. performance, these approaches are the most
widely used in open-source systems proposed for
spam filtering [38]. This algorithm is also used in
274
Serkan Keskin, Onur Sevli
275
Sakarya University Journal of Science, 28(2) 2024, 270-282
276
Serkan Keskin, Onur Sevli
277
Sakarya University Journal of Science, 28(2) 2024, 270-282
expresses the balance of these two values, was Table 7. Calculated measurements of the algorithms
obtained as 99.34%. used
Sensitivity
Algorithm
Accuracy
Precision
Learning
F1 Score
Machine
The complexity matrix obtained as a result of the
classification process performed with the ANN
algorithm is given in Figure 7.
In Table 6, the accuracy value showing the The last row in Table 8 is the result of this study.
overall success of the model is 97.04%. The The reason why the accuracy rates in some
precision and sensitivity values showing the studies in this table are close to the accuracy rates
discrimination of the classes were obtained as of our study is that the data set sizes and data sets
97.00% and 99.69%. The F1 Score value, which are close to each other. As it can be understood,
expresses the balance of these two values, was it has been experimentally demonstrated that this
obtained as 98.32%. study is more successful than other studies. This
is due to the fact that the natural language
The measurements obtained as a result of the processing processes of the study are more
classification processes performed with 5 successful than other similar studies.
different algorithms are summarized in Table 7.
5. Conclusion
278
Serkan Keskin, Onur Sevli
In the present study, 5 different machine learning study, unlike other studies, the use of natural
algorithms were used to classify spam e-mails language processing made the success different
using a dataset of 5558 samples consisting of and high. It is concluded that this score is higher
spam and non-spam e-mail messages. With 5- than similar studies in the literature. This study
fold cross-validation, the results of the sets an example for a machine learning-based
classification processes are reported with infrastructure that will consistently filter spam
accuracy, precision, sensitivity and f1 score content in e-mail servers. In future studies, it is
metrics. aimed to obtain higher performance results with
In the study, the rf algorithm produced the most different algorithms on datasets to be prepared
successful result with 98.83% accuracy. In this for different natural languages.
Table 8. Comparison table of the most successful accuracy rates on the same and different data sets
Study Name Data Set Used Most Successful Highest
Algorithm Accuracy (%)
Kumar and al., 2023 Spam Dataset NB 98.56
[29]
Jain and al., 2022 [26] Spam Dataset SVM 98.79
Abayomi and al., 2022 Spam Dataset BILSTM 98.60
[30]
Reddy and Reddy, Spam Dataset SVM 95.32
2021 [28]
Gadde and al., 2021 Spam Dataset LSTM 98.50
[27]
Junnarkar and al., 2021 Data set containing 5574 e- SVM 97.83
[4] mails
Ma and al., 2020 [21] 6000 data sets containing e- SVM 95.5
mails
Salihi, 2019 [16] 1183 units obtained from RF 92.95
Twitter the resulting data set
Karamollaoglu and TurkishMail dataset NB 95.5
Dogru, 2018 [6] consisting of 600 e-mails
Nazlı, 2018 [44]. Data set consisting of 300 e- SVM 98.33
mails
Kale, 2018 [45] Data set of 4,709 e-mails Gradient 94.97
Boosted Tree (GBT)
Yıldız, 2017 [31] Data set of 310 Turkish e- NB 96.31
mails
Alkaht and al., 2016 CSDMC 2010, SNN 95.82
[28] SpamAssassin, Tarassul
Sharma and Spambase KNN 97.50
Suryawanshi, 2016
[29]
Zavvar al., 2016. [46] Spambase SVM 93.07
This study Spam Dataset SVM 98.74 98.83
LR 97.66
NB 90.49
RF 98.83
ANN 97.04
279
Sakarya University Journal of Science, 28(2) 2024, 270-282
The Declaration of Conflict of Interest/ [5] S. Zeadally, E. Adi, Z. Baig, & I. A. Khan,
Common Interest "Harnessing artificial intelligence
No conflict of interest or common interest has capabilities to improve cybersecurity." Ieee
been declared by the authors. Access 8, 23817-23837, 2020.
[2] L.Ceci (2022, Nov. 14). Number of e-mail [11] D. Abidin, The Effect of Derived Features
users worldwide [online]. on Art Genre Classification with Machine
Available:https://www.statista.com/statisti Learning. Sakarya University Journal of
cs/255080/number-of-e-mail-users- Science, 25(6), 1275-1286, 2021
worldwide/
[12] P. Sharma, U. Bhardwaj. "Machine
[3] S. Dixon (2022, Apr. 28) Daily spam learning based spam e-mail detection.
volume worldwide Available: "International Journal of Intelligent
https://www.statista.com/statistics/127042 Engineering and Systems 11.3, 1-10, 2018
4/daily-spam-volume-global/
[13] Ö. Şahinaslan, H. Dalyan, E. Şahinaslan,
[4] P.Pantel, D. L. Spamcop, "A Spam "Naive bayes sınıflandırıcısı kullanılarak
Classification and Organization Program." youtube verileri üzerinden çok dilli duygu
Learning for Text Categorization, 2006. analizi. "Bilişim Teknolojileri Dergisi
15.2, 221-229, 2022
280
Serkan Keskin, Onur Sevli
281
Sakarya University Journal of Science, 28(2) 2024, 270-282
[33] I. Yahav, O. Shehory, D. Schwartz, [41] A. Arı, M. E. Berberler, "Yapay sinir ağları
"Comments mining with TF-IDF: the ile tahmin ve sınıflandırma problemlerinin
inherent bias and its removal. "IEEE çözümü için arayüz tasarımı. "Acta
Transactions on Knowledge and Data Infologica 1.2, 55-73, 2017
Engineering 31.3, 437-450, 2018
[42] O. I. Abiodun, A. Jantan, A. E. Omolara,
[34] Y. Altuntaş, A. F. Kocamaz, A. M. Ülkgün, K. V. Dada, A. M. Umar, O. U. Linus, M.
"Determination of Individual Investors' U. Kiru, "Comprehensive review of
Financial Risk Tolerance by Machine artificial neural network applications to
Learning Methods. "2020 28th Signal pattern recognition. "IEEE Access 7,
Processing and Communications 158820-158846, 2019
Applications Conference (SIU). IEEE,
2020. [43] Z. K. Şentürk, "Artificial neural networks
based decision support system for the
[35] R. Gürfidan, M. Ersoy, "Classification of detection of diabetic retinopathy. "Sakarya
death related to heart failure by machine Üniversitesi Fen Bilimleri Enstitüsü
learning algorithms. "Advances in Dergisi 24.2, 424-431, 2020.
Artificial Intelligence Research 1.1, 13-18,
2021 [44] N. Nazlı, Analysis of machine learning-
based spam filtering techniques. MS thesis.
[36] S. Şenel, B. Alatli. "Lojistik regresyon 2018.
analizinin kullanıldığı makaleler üzerine
bir inceleme. "Journal of Measurement and [45] B. Kale, Veri madenciliği sınıflandırma
Evaluation in Education and Psychology algoritmaları ile e-posta önemliliğinin
5.1, 35-52, 2014. belirlenmesi. MS thesis. Fen Bilimleri
Enstitüsü, 2018.
[37] A. McCallum, K. Nigam. "A comparison
of event models for naive bayes text [46] M. Zavvar, M. Rezaei, S. Garavand.
classification. "AAAI-98 workshop on "Email spam detection using combination
learning for text categorization. Vol. 752. of particle swarm optimization and
No. 1. 1998. artificial neural network and support vector
machine. "International Journal of Modern
[38] V. Metsis, I. Androutsopoulos, G. Education and Computer Science 8.7, 68,
Paliouras. "Spam filtering with naive 2016.
282