An Analysis of Machine Learning Algorithms and Deep Neural Networks For Email Spam Classification U
An Analysis of Machine Learning Algorithms and Deep Neural Networks For Email Spam Classification U
Computer Science(Software Engineering) Computer Science & Engineering Computer Science & Engineering
University of Hertfordshire BRAC University BRAC University
London, UK Dhaka, Bangladesh Dhaka, Bangladesh
[email protected] [email protected] [email protected]
Abstract—Due to the extensive use of technology in our daily Traditionally, rules and protocol-based systems were employed
lives, email has become essential for online correspondence to identify spam and phishing emails [4], [5]. These rule-
between individuals from all walks of life. As such certain based systems were static in nature rendering them ineffective
individuals have weaponized this service by bulk mailing ma-
licious emails to recipients with the goal of retrieving some against modern spam and phishing attempts [6]. Malicious
form of classified information. Thus, Email classification has attackers are growing more versatile in circumventing existing
become a major area of research as it enables identification email filters as computational resources become more widely
and isolation of such malicious emails. The objectives of this available. As such various machine learning based spam email
paper include a robust comparison of several traditional ma- detection systems have been proposed in the existing literature.
chine learning (ML) algorithms, exploring transfer learning with
static (non-trainable) pretrained GLOVE (Global word vector The primary contributions of this paper include exploring
representation) embedding, comparison of several deep learning transfer learning in training deep learning models (GLOVE
models trained with GLOVE and keras embedding separately. embedding) as well as a comparison of the ML classifiers and
Among ML classifiers, XGBoost achieved the highest evaluation Deep learning models using appropriate performance metrics.
scores. Among deep learning algorithms, keras embedding based
models outperformed GLOVE embedding based models by a II. LITERATURE REVIEW
small margin which shows the efficiency of transfer learning in
downstream NLP tasks (parts of speech tagging).
A. Related Work
Index Terms—XGBoost, Transfer Learning, Bi-directional In their paper [7], I. AbdulNabi et al. trained a K-NN (K-
Long Short Term Memory, Artificial Neural Network & Con- nearest neighbour), NB (Naive bayes), Bi-LSTM (Bi direc-
volutional Neural Network. tional Long short-term memory) and Google BERT model for
email classification. These models were evaluated using Accu-
I. INTRODUCTION racy and f1-score. The results show that Bert out performed all
Electronic mails (emails) play a significant role in day-to- the other models with an accuracy of 97.30% and an F1-score
day communication for a wide variety of professionals and of 96.96%.
businesses alike. Approximately A total of 319 billion emails S. Srinivasan et al. in their paper [8], explored 3 Deep
are being sent and received per day in 2021 and this number is Convolutional Neural Network (DCNN) architectures as well
likely to grow over 376 billion by the end of 2025 according to as popular pretrained CNN architectures such as VGG29,
email statistics report 2021 by RADICATI group [1]. As such Xception to classify spam images. The authors used several
malicious actors have begun using unsolicited emails to exploit Image spam data-sets namely Image spam hunter data-set, an
users, customers or professionals of particular businesses. improved data-set developed by authors of [9] and Dredze
Despite the use of several spam email detection systems, ImageSpam data-set. These models were evaluated using ac-
the proportion of spam emails in total email traffic remains curacy and f1-score.
enormous [1], [2]. Statista states that a total of 45.1% of all In their paper [10] S. Ishik et al. explored several Recurrent
emails exchanged in March, 2021 is identified as spam [3]. Neural Network architectures for email classification on Ag-
Authorized licensed use limited to: REVA UNIVERSITY. Downloaded on March 24,2024 at 16:41:49 UTC from IEEE Xplore. Restrictions apply.
glutinative Language like Turkish. The data-set was collected
from [11]. The authors trained an Artificial Neural Network
(ANN), LSTM and Bi-LSTM with MI (Mutual Information)
and WMI (Weighted Mutual Information) as feature selection.
The authors show that Bi-LSTM received an accuracy of
100%.
Ankit Narendrakumar et al. in his paper [12], explored the
efficacy of D-CNN algorithms for email classification. The
authors used Enron and spam assassin data-sets to train the
DCNN models. Finally, the author proposed THEMIS, an
email classification model based on a mathematical approach
where by the emails are divided into several sections and
complex functions are employed to extract and classify the
email signature. The models were evaluated using accuracy
and f1-score. The proposed THEMIS model achieved an
accuracy of 99.84%.
Alia. Barushka et al. in their paper [13], reviewed a spam
Fig. 1. Workflow diagram
classification models based on ANN, CNN, NB, SVM, Ran-
dom Forest with ngram and skip gram word representation
models respectively. The data-sets used by the authors include using pretrained word2vec word representation model.
Cornell University positive hotel review spam and negative
hotel review spam and TripAdvisor (Amazon Mechanical B. Observation
Turk). These models were evaluated using accuracy, AU-ROC, Most research lack the use of appropriate performance
FN and FP. ANN and CNN models with a combination of metrics for model evaluation, as such it is difficult to conclude
ngram , skip-gram word representation out performed all the if these models are generalizing to the trained corpus or over-
other models with an accuracy of 88.38% on the Negative fitting. The use of transfer learning and data wrangling in NLP
data-set and 89.75% on the Positive data-set. is quite limited in existing literature. Our work incorporates
In their paper [14] Feng Wei et al. proposed a Bi-LSTM with these algorithms to provide a clear, concise and updated
GLOVE word Embedding to detect twitter bots. Cresci-2017 analysis of machine learning models in email classification.
twitter data-set was used by the authors in their work. The
model evaluation metrics include Precision, Recall, Specificity, III. METHODOLOGY
Accuracy, F-Measure and MCC. The proposed model achieved Supervised Classification tasks generally consist of six
an accuracy of 96.1%, a recall score of 97.6%, precision score steps. These steps are defined as Data Acquisition and Pre-
of 94% and a specificity score of 93.5%. processing, Feature Extraction, Model Selection, Model Eval-
Ismaila Idris in his paper [15], proposed an ANN with a neg- uation and Model Deployment. The figure 1 illustrates the
ative selection algorithm (genetic algorithm) to classify spam steps performed within our work.
and non-spam emails. The author contrasted the proposed
model with an SVM classifier. The models were evaluated A. Data Acquisition and Pre-processing
using train and test accuracy. The proposed model received a The data set used in our work was acquired from the Enron
train accuracy score of 94.30% and a test accuracy score of data-set [18], a well-known publicly available benchmark cor-
91.37%. pus dedicated to spam email classification. Only the Kaminski
Sarit Chakraborty et al. in their paper [16] employed several folder of the data-set was used to generate the .csv file. We
variations of Decision tree classifier to filter spam emails from divided the dataset into a training set (80 percent) and test set
non-spam emails. The authors specifically used the NBTree (20 percent). Resulting training set had 4396 email samples
Classifier, C 4.5 / J48 Decision Tree Algorithm and Logistic and the test set had 1099 email samples.
Model Tree Induction (LMT) classifier. These models were Several data wrangling/pre-processing steps were performed
evaluated using accuracy. The authors show that LMT out on raw emails towards optimising the data-set for the purpose
performs all the other classifiers with an accuracy of over 85%, of spam email classification. These steps include Normaliza-
followed by NBTree with an accuracy of over 82% and lastly tion (removing repeating emails, stopwords, words less than
J48 with an accuracy of 78%. 3 words and punctuation), Tokenization and Lemmatization.
Yoon Kim in his paper [17] employ variations of CNN Stopwords and puntuations usually hold negligible value when
model along with a pretrained word2vec word embedding to it comes to classification of texts or documents but they may
classify sentences. The CNN variations considered are CNN- be very useful in predicting words, completing sentences and
rand, CNN-static, CNN-nonstatic and CNN-multichannel. The other similar tasks.
author concludes that simple CNN based architectures perform Usually raw texts contain empty spaces, new line characters
quite well in classification tasks related to Natural language or other document specific symbols. For the purpose of our
Authorized licensed use limited to: REVA UNIVERSITY. Downloaded on March 24,2024 at 16:41:49 UTC from IEEE Xplore. Restrictions apply.
task the raw emails were converted into a list of words where the 100-dimensional GLOVE embedding matrix. The GLOVE
each word is referred to as a token (Tokenization). Words Embedding layer was static that is, GLOVE embedding matrix
in a document are used in varying forms due to grammat- was not fine-tuned (not updated during Gradient Descent)
ical requirements. For example : Democracy - Democratic towards email classification.
- Democratization. Lemmatization converts these words to
their base, root or dictionary form. This allows optimal feature C. Model Selection
extraction as words used in varying contexts will have the same 1) Machine Learning Classifiers (ML): The traditional ma-
base form. These tokens were lemmatized and converted chine learning classifiers trained within this work include,
back to sentences/emails. Multinomial Naive Bayes, Random Forest, Decision Tree,
B. Feature Extraction Gradient Boosting, XGBoost, Logistic Regression, K-nearest
neighbors, SVM and SVM(RBF).
Machines are unable to process natural language such as
2) Deep Neural Network Classifiers (DNN):
text in English. For this reason linguistic data is required to
a) ANN – Artificial Neural Network: ANN is a feed
be transformed into a numeric representation which concisely
forward neural network that can identify patterns within data.
encapsulates the statistical inferences (Distribution, Frequen-
ANN comprises of several interconnected layers of nodes. The
cies) of the data as well as contextual and semantic meaning
connections between the nodes have adjustable parameters.
in many cases. This numeric representation is used as features
These parameters, along with the connections among the
for training ML models.
nodes, determine the output of the ANN.
1) TF-IDF (Term Frequency - Inverse Document Fre-
b) Bi-LSTM – Bi-Directional Long Short-Term Memory:
quency): TF-IDF was used to train traditional ML classifiers
RNN (Recurrent Neural Network) specializes in processing
(Not neural network based) within our work. TF-IDF score is
sequential or time-dependent data because of their ability to
assigned to a word based on the frequency of the word in a
utilize context (retained memory across inputs) when making
document and the number of documents it exists in. Generally,
final predictions. Bi-LSTM is a type of RNN that process
within linguistic data or corpus certain words are used more
sequential data in forward (past) and backward (future) direc-
frequently despite retaining lower significance or relevance
tions making them more efficient in sequential learning tasks
contrast to certain other words used rarely despite holding
(Machine translation).
higher relevance to the meaning of the message. TF-IDF score
is used to balance out the weights assigned to words such that c) CNN – Convolutional Neural Network: CNN’s em-
frequent non relevant words hold lower values compared to ploy convolution operations on the data matrix to reduce its
infrequent highly relevant words. That is, the TF-IDF score dimensions while retaining important features of the data-
gives more meaning to rare terms in the corpus and penalizes set. CNN’s take in sequential data (text) as a 1-dimensional
more commonly occurring terms [19]. matrix and, consequently, perform 1-dimensional convolution
2) Word Embedding: Word Embeddings are able to rep- operation.
resent linguistics items or words in a low dimensional vector
D. Model Evaluation
space. These numeric vectors (of words) are grouped together
within the vector space/ word space based on semantic similar- The classifiers and neural network models trained within
ity. for instance, Boat - Ship. There are primarily two ways to this work were evaluated using the following metrics.
train word embeddings, namely: Learnable Embedding (Keras True Positives(TP):Total number of spam emails correctly
Embedding), Pre trained word embedding (GLOVE). The recognized.
DNN models within this work were trained using keras and True Negatives(TN):Total number of benign/ham emails
GLOVE embedding separately with varying architectures. correctly recognized.
a) Keras Embedding : Keras Embedding requires a False Positives(FP):Total number of ham emails falsely
specific input and output dimensions as arguments. The input recognized as spam emails.
texts/words are required to be converted into a one hot encoded False Negatives (FN): Total number of spam emails falsely
vector prior to training the Embedding matrix. The parameters recognized as ham emails.
of the Keras Embedding matrix is updated during Gradient Precision: Precision in this works context is defined as the
Descent. During our work a 100-dimension word embedding ratio of predicted spam emails and true spam emails.
was trained using Keras Embedding layer, meaning that, each
word from the corpus/emails was transformed to a 100- tp/(tp + f p) (1)
dimensional vector.
b) GLOVE Embedding: GLOVE is a pre-trained word High precision means the model predicts low false positives
embedding model developed by [20]. GLOVE employs both and high true positives.
global statistics of matrix factorization like LSA(Latent Se- Recall: Recall in this works context is defined as the ratio of
mantic Analysis) and word2vec model. Pennington has pub- true spam emails and predicted spam emails.
lished several GLOVE embedding matrices of varying di-
mensions (50,100,200,300). For this study we have employed tp/(tp + f n) (2)
Authorized licensed use limited to: REVA UNIVERSITY. Downloaded on March 24,2024 at 16:41:49 UTC from IEEE Xplore. Restrictions apply.
The higher the Recall, the higher the model identifies the pretrained GLOVE embedding based models by a very small
positive events and labels correctly. F1 score: F1 score is the margin in terms of evaluation scores.
weighted average of precision and recall The GLOVE based DNN models were static in nature which
means it was used in conjunction with the DNN models
F 1 = 2 ∗ (P recision ∗ Recall)/(P recision + Recall) (3)
and were not updated or fine-tuned during Gradient Descent
The best performing models have an F1 score close to 1. (training). As such these models had a significantly lower
Accuracy: Accuracy is the ratio of correctly classified emails number of trainable parameters compared to keras embedding
(both spam and ham) among all emails in the test or train set. based models. Despite being static in nature, GLOVE based
models achieved very high evaluation scores. The reason be-
(tp + tn)/(tp + f p + f n + tn) (4) ing pre-trained word embedding models (word2vec, GLOVE)
AU-ROC: Receiver Operating Characteristics (ROC) is a prob- encapsulate word similarities off the shelf.
ability distribution curve for both tp and tn. Area under curve Among GLOVE based models, ANN GLOVE-1 and ANN
(AUC) of ROC is the measure of separation that is the ability GLOVE-2 have the lowest evaluation metrics which was
of a model to distinguish between classes correctly. expected due to complexity of the problem, structure of the
data and the general working principle of artificial neural
IV. RESULT ANALYSIS AND DISCUSSION networks (ANN).
A. Experimental Setup: Overall, the DNN models trained in this work outperform
The traditional machine learning models were trained on all other models proposed within the literature specifically [4],
a laptop using a jupyter notebook environment. The Deep [6], [7], [16], [21].
learning models (ANN, CNN, Bi-LSTM) were trained using Figure 2, shows heat-maps of classification report for DNN
google colab GPU and high ram configuration. Libraries used models (both keras and GLOVE embedding based). GLOVE
within this work include: Pandas, Numpy, Seaborn, Matplotlib, based ANN models have the highest while keras embedding
WordCloud, Scikit-learn, Keras, NLTK and Tensorflow. based models have the lowest false positives and false nega-
tives respectively.
B. Comparison of traditional machine learning classifiers: Figure 3, shows the AU-ROC curves for DNN models. ANN
Table I, illustrates a comparison of all ML classifiers trained GLOVE-1 and ANN GLOVE-2 incurred the lowest AU-ROC
in our work. The table shows that XGBoost achieved the best scores because of low precision and recall as well as high
scores for recall, f1 score, accuracy and AU-ROC. SVM (RBF) false negatives and false positives. All other DNN models have
achieved the best precision score. SVM (Linear) achieved an AU-ROC score of over 0.95 which means these models
the second-best evaluation scores. KNN and Decision tree have generalized to the imbalanced data-set and were able
achieved the lowest evaluation scores. All the classifiers have to distinguish spam and ham emails with moderately high
received evaluation scores of over 95 percentile which was accuracy.
expected and an improvement over [4], [6], [7], [16], [21].
[16] shows that word embedding with CNN-LSTM V. C ONCLUSION
achieves an accuracy of 95.9 %, recall of 1.0, precision of
0.936, f1 score of 0.967 and a G-mean of 96.7 %. This The primary objective of this paper was threefold. We have
paper also shows FastText email representation in conjunction provided a concise comparison of traditional machine learning
with CNN-LSTM achieves the same evaluation scores as word algorithms (Naive Bayes, SVM) for email classification using
embedding with the exception of precision which is 93.5%. [6] Enron corpus. XGBoost achieved the highest evaluation scores
shows that Text CNN achieves an accuracy of 97.54 % and f1 among other classifiers. We have trained six DNN models
score of 0.97. [7] shows a Bert based model (Best performing using pre-trained GLOVE embedding and three DNN models
model) with accuracy of 0.9730 and f1 score of 0.9696 on the using keras embedding. We have provided a rigorous com-
training set and that of 0.9867 and 0.9866 respectively on the parison of these nine DNN models. Keras embedding based
holdout set. [16] shows a Logistic model tree classifier with DNN models due to their large number of trainable parameters
an accuracy of 85.9%. [21] shows that their proposed model (Table II) have outperformed other models and classifiers. We
QUAGGA produces a precision, recall and accuracy score of have also observed that pretrained GLOVE embedding based
0.98 respectively. DNN models (CNN GLOVE-1, CNN GLOVE-2, Bi-LSTM
The top 3 classifiers (XGBoost, SVM-Linear, SVM-RBF) GLOVE-1 and Bi-LSTM GLOVE-2) have achieved extremely
within our work outperform all the models from the [4], [6], high evaluation scores. This shows that transfer learning can
[7], [16], [21]. be extremely useful and, in many cases, better for downstream
NLP tasks like text classification.
C. Comparison of Deep Neural Network (DNN) models Some future works include, using all Enron directory to gen-
Table II, illustrates a comparison of all Deep learning erate a balanced data-set with 0.5 million messages or emails
models (ANN, CNN, BI-LSTM) with keras embedding and approximately. Generating Adversarial Attacks to evaluate the
pretrained GLOVE embedding trained in our work. The table robustness of the trained models, implementing Google Bert
shows that keras Embedding based DNN models outperform embedding layer and using Bert models.
Authorized licensed use limited to: REVA UNIVERSITY. Downloaded on March 24,2024 at 16:41:49 UTC from IEEE Xplore. Restrictions apply.
Fig. 2. Heat-map of the classification report for Deep learning models
Authorized licensed use limited to: REVA UNIVERSITY. Downloaded on March 24,2024 at 16:41:49 UTC from IEEE Xplore. Restrictions apply.
TABLE I
C OMPARISON OF M ACHINE LEARNING C LASSIFIERS
TABLE II
C OMPARISON OF D EEP LEARNING MODELS
Model Precision Recall f1 Train Accu- Test Accu- Error Loss AU- Train
score racy(%) racy(%) ROC Time(s)
ANN(Keras embedding) 0.9888 0.9894 0.9899 99.91 99.18 0.81 0.02 0.9888 48.53
CNN(Keras embedding) 0.9865 0.9893 0.9922 100 99.18 0.81 0.03 0.9865 54.47
Bi-LSTM(Keras embedding) 0.9789 0.9833 0.9879 99.82 98.73 1.27 0.05 0.9789 119.11
CNN(GLOVE embedding-1) 0.9799 0.9788 0.9777 100 98.36 1.63 0.69 0.9799 942.53
Bi-LSTM(GLOVE embedding-1) 0.9775 0.9764 0.9754 97.52 98.18 1.81 0.07 0.9775 3592.29
Bi-LSTM(GLOVE embedding-2) 0.9630 0.9678 0.9728 98.27 97.54 2.45 0.05 0.9630 4142.67
CNN(GLOVE embedding-2) 0.9601 0.9665 0.9733 99.36 97.45 2.54 0.13 0.9601 202.71
ANN(GLOVE embedding-1) 0.9109 0.8810 0.8623 88.81 90.17 9.82 0.25 0.9109 806.91
ANN(GLOVE embedding-2) 0.7085 0.7461 0.8954 81.30 84.53 15.46 0.34 0.7085 318.35
Authorized licensed use limited to: REVA UNIVERSITY. Downloaded on March 24,2024 at 16:41:49 UTC from IEEE Xplore. Restrictions apply.