Decision Tree Model For Email Classification: Ivana Čavor
Decision Tree Model For Email Classification: Ivana Čavor
Authorized licensed use limited to: Jahangirnagar University. Downloaded on September 05,2021 at 04:32:49 UTC from IEEE Xplore. Restrictions apply.
A. Email dataset task is to understand that are there any specific words or
The dataset used for the classification purpose consists of sequence of words that determine whether an email is a spam
4000 entries [10]. The dataset contains 3465 ham and 535 or not. For this purpose, the Term Frequency (TF) method is
spam messages. This dataset is divided into two subsets: used. TF can be defined as a numerical statistic which is
training set and testing set. The size of dataset assigned for intended to reflect how crucial a word is to a document
training purpose can affect systems performance which will present in a corpus. The TF value is directly proportional to
be shown further on. the number of times a word appears in a document. Fig. 2
illustrates a word cloud of common words in spam email. The
B. Preprocessing of dataset size of word in Fig.2 is proportional to its occurrence in spam
The email dataset considered needs to be preprocessed emails. Words like ‘free’, ‘txt ’, ‘call’ have large TF weights
before performing feature selection. It is well known that which makes them good indicators of spam.
spam mails usually contain phone numbers, emails, website
URLs, money amounts, and a lot of whitespace and
punctuation. Instead of removing the following terms, for
each training example, the terms are replaced with a specific
string as follows:
TABLE I
FEATURE MATRIX: EACH ROW REPRESENTS AN EMAIL WITH THE FEATURES PRESENTED IN COLUMNS
FEATURES
EMAIL Numbr Call Txt Free Claim Httpaddr Moneysymb Total_spam_words DECISION/CLASS
Email_1 0 1 0 0 0 0 0 1 Ham
Email_2 2 0 0 1 1 1 0 4 Spam
Email_3 1 0 0 3 0 0 0 2 Spam
Email_4 1 0 0 0 0 0 0 0 Ham
Authorized licensed use limited to: Jahangirnagar University. Downloaded on September 05,2021 at 04:32:49 UTC from IEEE Xplore. Restrictions apply.
D. Decision tree Information gain is calculated to split the attributes
A decision tree uses a tree-like model to represent a further in the tree. The attribute with the highest
number of possible decision paths as well as their potential information gain is always preferred first. Entropy and
outcomes [13]. Each decisions tree node represents a information gain is related by (2):
feature, each branch represents a decision and each leaf
represents an outcome (class or decision). Decision trees gain(S,Ai )=Entropy(S)-EntropyA (S) (2)
i
can be used to predict the class of an unknown query
instance by building a model trained on set of labeled data. where EntropyAi(S) is the expected entropy if attribute Ai is
Each training example should be characterized by a used to partition the data.
number of descriptive features or attributes. The features The algorithm was implemented according to the
can have either nominal or continuous values. following steps:
A decision tree consists of root node, internal nodes and 1. Create a root node
leaf nodes. Internal nodes represent the conditions based on 2. Calculate the entropy of the whole (sub) dataset
which the tree splits into branches and the leaf nodes 3. Calculate the information gain for each feature
represent possible outcomes for each path. Each node and select the feature with the largest information
typically has two or more nodes extending from it. “When gain.
classifying an unknown instance, the unknown instance is 4. Assign the (root) node the label of the feature with
routed down the tree according to the values of the maximum information gain. Grow for each
attributes in the successive nodes and when a leaf is feature value an outgoing branch and add
reached the instance is classified according to class unlabeled nodes at the end.
assigned to the leaf” [14]. The main advantage for using 5. Split the dataset along the values of maximum
decision tree is that it is easy to follow and understand. Fig. information gain feature and remove this feature
4 presents example of a typical decision tree. Words “free” from dataset.
and “money” are typical spam words and they are used as 6. For each sub-dataset, repeat steps 3 to 5 until a
features. If the word “free” appears more than two times in stopping criteria is satisfied.
an email than the email is classified as spam. Otherwise,
we are asking does the email contain the word “money”. If Since the chosen features have continuous values, to
the word “money” appears more than three times than the perform a binary split, it is needed to convert continuous
email is certainly spam, otherwise it is ham. values to nominal ones. That is done using threshold value.
The threshold value is a value that offers maximum
information gain for that attribute. For example, the
information gain maximizes when threshold is equal to two
for total_spam_words feature from Table 1.
Authorized licensed use limited to: Jahangirnagar University. Downloaded on September 05,2021 at 04:32:49 UTC from IEEE Xplore. Restrictions apply.
TP+TN the system achieves high accuracy with a few features and
accuracy= (3)
TP+TN+FP+FN
with relatively small training dataset. In the near future, it
TP is planned to incorporate other classifiers and to compare
precision= (4) their performances with the proposed approach.
TP+FP
TP
recall= (5) REFERENCES
TP+FN
[1] P. Sharma and U. Bhardwaj, Machine Learning based Spam E-Mail
Detection, in International Journal of Intelligent Engineering &
For a classifier, accuracy is the proportion of the total Systems, vol. 11, no. 3, 2017
testing examples which classifier predicted correct, [2] A. S. Rajput, J. S. Sohal, V. Athavale, “Email Header Feature
precision is ratio of total number of correctly classified Extraction using Adaptive and Collaborative approach for Email
spam emails and the total number of emails predicted as Classification”, in International Journal of Innovative Technology
and Exploring Engineering (IJITEE), ISSN: 2278-3075, vol.8, Issue
spam and recall represents proportion of emails correctly 7S, May 2019
classified as spam among all spam emails. The [3] P. Kulkarni, J.R. Saini and H. Acharya, “Effect of Header-based
performance of proposed SD system is measured against Features on Accuracy of Classifiers for Spam Email Classification”,
the size of dataset and the features size. The result are in: International Journal of Advanced Computer Science and
Applications (IJACSA), vol. 11, no. 3, 2020
presented in Table 3. [4] E. G. Dada, S. B. Joseph, H. Chiroma, S. Abdulhamid, A.
TABLE III Adetunmbi, E. Opeyemi and Ajibuwa, “Machine learning for email
CLASSIFICATION RESULTS BASED ON DATASET SIZE AND spam filtering: review, approaches and open research problems”. in
FEATURE SIZE Heliyon, June 2019
[5] E. M. Bahgat, S. Rady, W. Gad and I. F. Moawad, “Efficient email
Dataset Feature Accuracy[%] Precision[%] Recall[%] classification approach based on semantic methods”, In: Ain Shams
size size Eng. J., vol. 9, no. 4, pp. 3259-3269, December 2018.
1000 7 97.4 92.01 87.21 [6] F. Ruskanda, “Study on the Effect of Preprocessing Methods for
1000 3 96.63 85.61 88.51 Spam Email Detection”, in: Indonesian Journal on Computing
1500 7 97.32 92.28 86.21 (Indo-JC). 4. 109, March 2019.
1500 3 96.56 85.62 87.77 [7] A. Sharma, Manisha, D. Manisha and D.R. Jain, “Data Pre-
3000 7 97.2 91.52 85.71 Processing in Spam Detection”, in: International Journal of Science
3000 3 96.3 83.96 87.30 Technology & Engineering (IJSTE), vol. 1, Issue 11, May 2015
[8] L. Shi, Q. Wang, X. Ma, M. Weng and H. Qiao, “Spam Email
Classification Using Decision Tree Ensemble”,in Journal of
The dataset of different sizes are used for measuring the Computational Information Systems 8, March 2012
performance. For example, in case of 1000 emails and 7 [9] S. Balamurugan and R. Rajaram, “Suspicious E-mail Detection via
features being used for the training process, accuracy was Decision Tree: A Data Mining Approach”, January 2007.
97.4% using decision tree classifier. The precision and [10] T. A. Almeida and J.M. Gómez Hidalgo, SMS Spam Collection,
UCI Machine Learning Repository, viewed 12 September 2020,
recall values are 92.01% and 87.21% respectively. https://archive.ics.uci.edu/ml/datasets/sms+spam+collection
Reducing the number of features affects the accuracy by [11] C. D. Manning, P. Raghavan and H. Schütze, “Introduction to
decreasing it to 96.63%, with precision and recall value of Information Retrieval”, in Cambridge University Press, 2008.
85.61% and 88.51% respectively. The dataset size slightly [12] A. Bhowmick and S. M. Hazarika, “Machine Learning for E-mail
Spam Filtering: Review, Techniques and Trends”, 2016
affects accuracy: the accuracy for 1500 training examples [13] J. Grus, “Data Science from Scratch: First Principles with Python”,
and 3000 training examples was 97.32% and 97.2% O'Reilly Media. Inc., April 2015
respectively. [14] I.H. Witten and E. Frank, “Data Mining: Practical Machine
Learning Tools and Techniques with Java Implementations”,
Morgan Kaufmann, San Francisco, 2000
IV. CONCLUSION [15] T. Kristensen and G. Kumar, “Entropy based disease classification
In this paper, decision tree-based classification is employed of proteomic mass spectrometry data of the human serum by a
support vector machine”, Proceedings. 2005 IEEE International
for spam email detection. A novel approach for feature Joint Conference on Neural Networks, 2005
selection and reduction is also presented. It is shown that
Authorized licensed use limited to: Jahangirnagar University. Downloaded on September 05,2021 at 04:32:49 UTC from IEEE Xplore. Restrictions apply.