0% found this document useful (0 votes)
121 views

Random Forests Machine Learning Technique For Email Spam Filtering E. G. Dada and S. B. Joseph

The document summarizes a study that used the random forest machine learning technique to classify emails as spam or not spam. The researchers used the Enron public email dataset containing over 5000 emails to extract prominent spam email features. They then applied the random forest algorithm, which resulted in a very high classification accuracy of 99.92% and low false positive rate of 0.01. The random forest algorithm was simulated using the WEKA data mining tool. The study aims to develop an effective spam email filter with high prediction accuracy and fewer required features.

Uploaded by

Rafterbang Putra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
121 views

Random Forests Machine Learning Technique For Email Spam Filtering E. G. Dada and S. B. Joseph

The document summarizes a study that used the random forest machine learning technique to classify emails as spam or not spam. The researchers used the Enron public email dataset containing over 5000 emails to extract prominent spam email features. They then applied the random forest algorithm, which resulted in a very high classification accuracy of 99.92% and low false positive rate of 0.01. The random forest algorithm was simulated using the WEKA data mining tool. The study aims to develop an effective spam email filter with high prediction accuracy and fewer required features.

Uploaded by

Rafterbang Putra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

University of Maiduguri

Faculty of Engineering Seminar Series


Volume 9 number 1, July 2018
Random Forests Machine Learning Technique for Email Spam Filtering
E. G. Dada and S. B. Joseph
Department of Computer Engineering, University of Maiduguri, Maiduguri – Borno State,
Nigeria.
[email protected]; +2349084222298
Abstract
Email spam is one of the major challenges faced daily by every email user in the world. On a
daily basis email users receive hundreds of spam mails having a new content, from
anonymous addresses which are automatically generated by robot software agents. The
traditional methods of spam filtering such as black lists and white lists using (domains, IP
addresses, mailing addresses) have proven to be grossly ineffective in curtailing the menace of
spam messages. This have brought afore the need for the invention of highly reliable email
spam filters. Of recent, machine learning approach have been successfully applied in detecting
and filtering spam emails. This paper proposes the use of random forest machine learning
algorithm for efficient classification of email spam messages. The main purpose is to develop
a spam email filter with better prediction accuracy and less numbers of features. From the
Enron public dataset consisting of 5180 emails of both ham, spam and normal emails, a set of
prominent spam email features (from the literatures) were extracted and applied by the
random forests algorithm with a resultant classification accuracy of 99.92%, very low false
positive rate (0.01) and very high true positive rate of 0.999. All experiments are conducted on
WEKA data mining and machine learning simulation environment.
Keywords: Machine learning, Spam filtering, Random Forests, Neural Networks, Support
Vector Machines, Naïve Bayes

1.0 Introduction
Recently, unsolicited commercial bulk emails popularly referred to as spam has constituted a
big problem on the internet. The spammer sending the fraudulent emails harvests email
addresses using various websites, viruses and malwares (Awad and Foqaha, 2016). Spam
hinders internet users from maximizing storage capacity and network bandwidth. The
presence of large volume of spam mails in computer networks is detrimental to the effective
usage of email server’s memory, bandwidth, CPU processing speed and user time (Fonseca et
al., 2016). Reports showed that spam mails are accountable for more than 77% of the email
traffic globally (Kaspersky, 2017). Spam emails are very annoying and inimical to users who
have fallen victim of 419 internet mails and other fraudulent practices of sending emails with
the purpose of luring unsuspecting persons to release confidential information such as user
name and passwords, Bank Verification Number (BVN) and credit card numbers. Several
work have been published in literature that proposed various approaches to email spam
filtering. And have been successfully applied to classify emails into either spam or non-spam.
These techniques include probabilistic, decision tree, artificial immune system (Bahgat et al.,
2016), support vector machine (SVM) (Bouguila and Amayri, 2009), artificial neural networks
(ANN) (Cao et al., 2004), and case-based technique (Fdez-Riverola, 2007). It has been
demonstrated that it is possible to use these machine learning techniques to filter out spam
mails by employing content-based filtering approach that have the ability to identify particular

Seminar Series Volume 9(1), 2018 Page 29


Dada & Joseph: Random Forests Machine Learning Technique for Email Spam Filtering
features in email messages (usually keywords frequently used in spam emails). The frequency
at which these features occur in emails determine the likelihood that the email will be classified
as spam when measured against the threshold value. Email messages that exceed the threshold
value are classified as spam (Mason, 2003). Karthika and Visalakshi (2015) compared the
performance of hybridized ACO and SVM with KNN, NB and SVM algorithms on spambase
dataset taken from UCI repository. Awad and Foqaha (2016) evaluated the performance of
PSO, RBFNN, MLP and ANN using the UCI spambase dataset. Sharma and Suryawanshi
(2016) compared the performance of kNN with spearman and kNN with Euclidean using the
spambase dataset taken from UCI repository. Awad and ELseuofi (2011) reviewed six state of
the art machine learning methods (Bayesian classification, k-NN, ANNs, SVMs, Artificial
Immune System and Rough sets) and their applicability to the problem of spam email
classification. Alkaht and Al-Khatib (2016) compared the performance of NN, MLP, Perceptron
on dataset based on randomly collected emails. Dhanaraj and Palaniswami (2014) evaluated
the performance of Firefly, NB, NN and PSO algorithm on CSDMC2010 spam corpus dataset.
Palanisamy, Kumaresan and Varalakshmi (2016) compared the performance of NSA, PSO,
SVM, NB and DFS-SVM using Ling spam dataset. Zavvar, Rezaei and Garavand (2016)
compared the performance of PSO, SOM, kNN and SVM on spambase datasets retrieved from
UCI repository. Sosa (2010) evaluated the performance of Sinespam, a spam classification
technique using machine learning a corpus of 2200 e-mails from several senders to various
receivers gathered by the ISP. Akshita (2016) applied the Deep Learning technique to content
based spam classification. The author used DL4J deep network on PU1, PU2, PU3, PUA and
Enron spam datasets. The main problem with many of these techniques discussed above is the
low performance of the filters and there is need to increase the classification accuracy of the
filters. Also, many of them are not robust and find it difficult to cope with the evolving nature
of spams.

2.0 Materials and Methods


Majority of the email spam filtering methods uses text categorization approaches.
Consequently, spam filters perform poorly and cannot efficiently prevent spam mails from
getting to the inbox of the users. This work employs, rules using Random Forests (RFs)
algorithm to extract important features from emails, and classify the emails into either ham,
spam or normal. The Enron spam dataset was used as the benchmark dataset. The Random
Forests machine learning algorithm was simulated using WEKA (Wang, 2005). WEKA have a
set of machine learning algorithms that can be used for data preprocessing, classification,
regression, clustering and association rules. Machine learning techniques implemented in
WEKA are helpful in solving different real world problems. The toolkit provides a well-
defined structure for researchers and developers to experiment with different machine
learning algorithms, to build and evaluate their models. All experiments were conducted on a
machine with a AMD A 10-7300 Radeon R6, 10 Compute Cores 4C+6G, 1.90 GHz, 8.00GB of
RAM.

2.1 Random Forests


Random forests (RFs) is a classic example of ensemble learning and regression technique
suitable for solving data classification problems (Akinyelu and Adewumi, 2016). Breiman and
Cutler (2007) proposed the RFs algorithm. The algorithm classify data into different classes
Seminar Series, Volume 9(1), 2018 Page 30
Dada & Joseph: Random Forests Machine Learning Technique for Email Spam Filtering
using decision trees. During the training phase, some of decision trees are created and later
used for the classification tasks. This works by considering the elected class of individual trees
and the class with the highest number of vote is considered as the final result. RF algorithm
has become very popular over the years and it is being applied to solve analogous problems
in various fields of human endeavor (Fette et al., 2007), (Koprinska, et al., 2007) and (Whittaker
et al., 2010). Random forests have several advantages such as: reduced classification error and
better f-scores when compared to some other machine learning techniques. Moreover, its
performance is generally as good as or even superior to that of SVMs. It can efficiently handle
unbalanced data sets that have missing values. It serves as an efficient algorithm for calculating
the estimated value of missing data and maintaining accuracy of the data in circumstances
where a significant proportion of the data are missing. The training time for RFs is usually
shorter compared to that of SVMs and Neural Networks (though this depends on individual
implementation). RF is better than most of the existing machine learning algorithms in terms
of accuracy. Its performance in large databases is very good. It can efficiently process hundreds
of thousands of input variables. RF creates an internal unbiased prediction of the collective
error during forest cultivation. It provides approach for soothing errors in population class
that have bias data sets. RFs also have the ability to effectively process unlabeled data making
it a very appropriate technique for clustering unlabeled data. Random forests is not
complicated and it uses fewer parameters when compared to the number of observations. RFs
permits the user to cultivate as many trees as possible at a high speed. RFs classify a new data
from an input vector by enlisting the input vector near individual trees in the forest. Each tree
carries out its classification which is usually known as the tree "votes" for that class. The forest
selects the class with the overall highest votes in the forest. The steps for cultivating trees are
outlined below:
1. Suppose N is the number of training instances, randomly representing N instances
which can be substituted from the existing data. Such instances are used as the training
set for growing the tree.
2. Suppose there are P input variables, a number p<<P is specified such that for each of
the corresponding node, p variables are randomly selected from P and the finest portion
on p is used to partition the node so that p now have a fixed value all through the period
of growing the forest.
3. Pruning is prohibited as each tree is cultivated to the biggest feasible level possible.
A tree is referred to as a strong classifier when it has a small error rate. Moreover, the error
rate of the forest reduces as the concentration of each trees in the forest increases. Reducing
the value of p mutually decreases the relationship and the power of the forest while enhancing
the value of p increases both in the area of the best boundary of p which is usually very
expansive. The value of p can be computed using the Out-of-bag (OOB) error (also known as
out-of-bag estimate) a value of p within the limit can be located promptly. This is the only
numerical factor that the random forests is slightly susceptible to its fine-tuning.
The algorithm below concisely outlined the steps required for the creation of forest trees.
Start RF Algorithm
Input: X: number of nodes
N: number of features
Y: number of trees to be grown
Output: G: the class with the highest number of vote
Seminar Series, Volume 9(1), 2018 Page 31
Dada & Joseph: Random Forests Machine Learning Technique for Email Spam Filtering
While stopping criteria is not true do
Select a self-starting sample S randomly from the training data Y
Create tree R𝑖 from the selected self-starting sample S using the steps below:
(1) Select n features randomly from N; where n≪N
(2) Calculate the best dividing point for node d among the n features
(3) Divide the parent node to two offspring nodes through the optimal divide
(4) Execute steps 1-3 till the maximum number of nodes (x) is created
Create your forest by iterating steps 1-4 for Y number of times
EndWhile
Produce output for each created trees {Rt}1Y
Use a new sample for each created trees starting from the root node
Assign the sample to the class matching the leaf node.
Merge the votes or results of every tree
Output the class with the highest number of vote (G).
End RF Algorithm

2.2 Dataset Used for Experiment


The Enron spam datasets was used for our experiment (Koprinska, et al., 2007). The Enron
spam datasets from the Enron corporation is used in this study. There are 5180 emails as
dataset in three folders: norm for normal, ham for non-spam and spam for Spam emails. Enron
has 5180 instances, 3672 ham, 8 norm, and 1500 spam emails. The dataset features are as
follows:
i. Some specific word or character was recurrent in the emails.
ii. The run-length attributes (55-57) measure the length of sequences of consecutive
capital letters.

2.3 Data Normalization Process


The original dataset used in our experiments consists of 5180 text files. The data contained in
those files are not normalized. This means that they have to be normalized before it can serve
as input to WEKA. It is required that all data be converted to one .arff file before it can be given
to WEKA for training. To achieve this, we use the following command in command line
interface of WEKA.
“java weka.core.converters.TextDirectoryLoader -dir D:/Enron > D:/Spam_mails.arff”
After the normalization process, the normalized file was given to WEKA for pre-processing.

2.4 Feature Extraction


After the pre-processing phase comes the feature extraction. Feature extraction is the process
of choosing a subset of the terms occurring in the training set and using only this subset as
features in text classification. This is achieved using some set of rules. Feature extraction makes
training and applying a classifier more efficient by decreasing the size of the effective
vocabulary. And also usually enhances classification accuracy by removing noise features.
Some of the important email features we used for our spam filtering include: Message body
and subject, Volume of the message, Occurrence count of words, Number of semantic
discrepancies patterns in the message, Recipient age, Sex and country, Recipient replied, Adult

Seminar Series, Volume 9(1), 2018 Page 32


Dada & Joseph: Random Forests Machine Learning Technique for Email Spam Filtering
content, Bag of words from the message content, Domain name, IP Address, More blank lines
in body.

3.0 Results and Discussion


This section presents the results of the experiments performed. The Random Forest algorithm
was applied to classify and evaluate the dataset, we used the 10-fold cross validation test which
is an approach employed in appraising predictive models that divide the original set into a
training sample to train the model, and a test set for its evaluation. Firstly, the training of the
datasets was performed with the feature vectors extracted by analyzing each message header,
checking of keywords and whitelist/blacklist. The performance of the trained models is
evaluated using 10-fold cross validation for its classification accuracy. Classification accuracy
is one of the performance metrics for email spam classification. It is measured as the ratio of
number of correctly classified instances in the test dataset and the total number of test cases.
In spam filtering, false negatives mean that some spam mails were wrongly classified as non-
spam and allowed to enter the user’s inbox. False positive mean that non-spam emails were
mistakenly classified as spam and moved to spam folder or discarded. For most users,
erroneously classifying valid emails as spam can be very costly than receiving spam mails in
their inbox. The false positive rate is also one of the performance metrics used in evaluating
the effectiveness of email spam filter. Depicted in figure 1 below is the screen shot of our output
on WEKA simulation environment.

Fig. 1: Screen shot of Random Forests classification output for Enron spam emails datasets
3.1 Effectiveness
In this section, we evaluate the effectiveness of all machine learning classifiers in terms of time
taken to create the model, correctly classified instances, incorrectly classified instances and
classification accuracy. The results are shown in Table 1.
Seminar Series, Volume 9(1), 2018 Page 33
Dada & Joseph: Random Forests Machine Learning Technique for Email Spam Filtering
Table 1: Performance Evaluation of RFs Algorithm
Evaluation Criteria
Time taken to create model(s) 17.75
Correctly classified instances 5176
Incorrectly classified instances 4
Accuracy (%) 99.92

To do a fair and better performance evaluation of the machine learning algorithms we are
considering, simulation error is also taken into account in this work. The effectiveness of these
algorithms is assessed using the following terms: Kappa statistic (KS), Mean Absolute Error
(MAE), Root Mean Squared Error (RMSE), Relative Absolute Error (RAE), Root Relative
Squared Error (RRSE). The KS, MAE and RMSE are in numeric values. RAE and RRSE are in
percentage. The results are shown in Table 2.

Table 2. Training and Simulation Error of RFs Algorithms


Evaluation Criteria RFs
Kappa Statistics (KS) 0.9981
Mean Absolute Error (MAE) 0.0296
Root Mean Square Error (RMSE) 0.06
Relative Absolute Error (RAE) % 10.7404
Root Relative Squared Absolute Error (RRAE) % 16.1506

3.2 Efficiency
After creating the predictive model, the efficiency of the RFs algorithm was evaluated as
shown in table 3 below.

Table 3. Performance evaluation of RFs algorithms based on TPR, FTR, Precision, and F-
Score
Technique TPR FPR Precision F-Score Class
RFs 0.999 0.001 1.000 0.998 Ham
1.000 0.000 1.000 1.000 Norm
0.999 0.001 0.998 0.998 Spam

From Table 1, it took RFs about 17.75 sec to create its model. The RFs has a classification
accuracy of 99.92%. It is also clear from the results that RFs has performed excellently in term
of very high correctly classified instances and very low number of incorrectly classified
instances. The training and simulation error depicted in table 2 shows that RFs produced an
excellent classification result (0.9992%) and very low error rate (0.0296). Once the model has
been created, the next step is to analyse the results generated to determine the efficiency of the
algorithms under consideration. Table 3 indicates that RFs have very good result in term of
TPR, FTR, Precision and F-Score for ham, norm and spam classes. Below in table 4 is the
confusion matrices of the RFs algorithm which also provide a practical way for assessing the
performance of the classifiers, each row of the table denotes actual rates of the class whereas
each column indicates the predictions

Seminar Series, Volume 9(1), 2018 Page 34


Dada & Joseph: Random Forests Machine Learning Technique for Email Spam Filtering
Table 4. Confusion Matrix for RFs Algorithm
Ham Norm Spam Class
RFs 3669 0 3 Ham
0 8 0 Norm
1 0 1499 Spam

From the table 4 above, RFs accurately predicts 3669 instances out of 3672 instances (3669 ham
instances that are truly ham and 1 spam instance that is really spam), and 3 instances wrongly
predicted (3 instances of ham class predicted as spam). From our experiments it is clear that
RFs performed excellently in term of effectiveness and efficiency considering its classification
accuracy, TPR, FPR, precision and F-score. It also correctly predicts 1499 instances out of 1500
instances (1499 spam instances that are truly spam and 1 ham instance that is really spam), and
3 instances wrongly predicted (3 instances of spam class predicted as ham).

4.0 Conclusion
Many of the existing email spam filtering techniques cannot effectively handle some of the
spams been sent on daily basis by spammers. This is because spammers kept on inventing
more sophisticated techniques for evading detection by spam filter. With continuous adoption
of new technique by spammers, email spam filtering has become a hot research area for
researchers. In this study, we proposed Random Forests algorithm for effective and efficient
email spam filtering. And evaluated the performance of RFs algorithm on Enron spam datasets
using accuracy, TPR, FPR, precision and F-measure to determine the effectiveness and
efficiency of the algorithm. We conclude by stating that RFs is a promising algorithm that can
be adopted either at mail server or at mail client side to further decrease the volume of spam
messages in email users inbox.

References
Akinyelu A. A., and Adewumi A.O. (2016). Classification of Phishing Email Using
Random Forest Machine Learning Technique. Journal of Applied Mathematics, 2014, 6,
Article ID 425731, Retrieved on July 12, 2017 from
http://dx.doi.org/10.1155/2014/425731
Akshita T. (2016). Content Based Spam Classification- A Deep Learning Approach. A
Thesis Submitted to The Faculty Of Graduate Studies, University Of Calgary, Alberta,
Canada.
Alkaht I.J., Al-Khatib B. (2016). Filtering SPAM Using Several Stages Neural Networks.
International Review on Computers and Software, 11, 2.
Awad M. and Foqaha M. (2016). Email Spam Classification Using Hybrid Approach of RBF
Neural Network and Particle Swarm Optimization. July 2016 International Journal of
Network Security & Its Applications 8(4):17-28. DOI: 10.5121/ijnsa.2016.8402
Awad W.A. and Elseuofi S.M. (2011). Machine Learning Methods for Spam E-mail
Classification. International Journal of Computer Science and Information Technology,
3(1):173–184.
Bahgat E.M., Rady S. and Gad W. (2016). An e-mail filtering approach using classification
techniques. In The 1st International Conference on Advanced Intelligent System and
Informatics (AISI2015), November 28-30, 2015, BeniSuef, Egypt, Springer International
Seminar Series, Volume 9(1), 2018 Page 35
Dada & Joseph: Random Forests Machine Learning Technique for Email Spam Filtering
Publishing, 321-331.
Bouguila N. and Amayri O. (2009) ..A discrete mixture-based kernel for SVMs: application
to spam and image categorization, Information Processing & Management, 45(6): 631-
642.
Breiman L, Cutler A (2007). Random forests-classification description, Department of
Statistics Homepage, 2007, http://www.stat.berkeley.edu/∼breiman/RandomForests
/cchome.htm.
Cao Y, Liao X, Li Y (2004). An e-mail filtering approach using neural network, In
International Symposium on Neural Networks, Springer Berlin Heidelberg, 688-694.
Dhanaraj KR, Palaniswami V (2014). Firefly and Bayes Classifier for Email Spam
Classification in a Distributed Environment. Australian Journal of Basic and Applied
Sciences, 8(17):118-130.
Fdez-Riverola F, Iglesias EL, Diaz F, Méndez JR, Corchado JM (2007). SpamHunting: An
instance-based reasoning system for spam labelling and filtering, Decision Support
Systems, 43(3):722-736.
Fette I, Sadeh N, Tomasic A (2007). Learning to detect phishing emails, in Proceedings of
the 16th International World Wide Web Conference (WWW ’07), 649–656, Alberta,
Canada, May 2007.
Fonseca DM, Fazzion OH, Cunha E, Las-Casas I, Guedes PD, Meira W, Chaves M (2016).
Measuring Characterizing, and Avoiding Spam Traffic Costs. IEEE Internet
Computing, 99.
Karthika R, Visalakshi P (2015). A Hybrid ACO Based Feature Selection Method for Email
Spam Classification. WSEAS Transaction on Computers, 14, pp. 171-177.
Kaspersky lab Spam Report (2017) .Visited on May 15, 2018
https://www.securelist.com/en/ analysis/204792230/Spam_Report_April_2012,
2012.
Koprinska I., Poon J., Clark J., Chan J. (2007). Learning to classify e-mail, Information
Sciences, 177(10): 2167–2187.
Mason S (2003). New Law Designed to Limit Amount of Spam in E-Mail.
http://www.wral.com/technolog
Sharma A. and Suryawansi A. (2016). A Novel Method for Detecting Spam Email using
KNN Classification with Spearman Correlation as Distance Measure. International
Journal of Computer Applications, 136 (6):28-34
Sosa J.N. (2010). Spam Classification using Machine Learning Techniques – Sinespam.
Master of Science Thesis. Master in Artificial Intelligence (UPC-URV-UB).
Wang X. (2005). Learning to classify email: A survey. Proceedings of 2005 International
Conference on Machine Learning and Cybernetics.
Whittaker C., Ryner B., Nazif M. (2010). Large-scale automatic classification of phishing
pages. In: Proceedings of the 17th Annual Network & Distributed System Security
Symposium (NDSS ’10), The Internet Society, San Diego, Calif., USA.
Zavvar M., Rezaei M., Garavand S. (2016) Email Spam Detection using Combination of
Particle Swarm Optimization and Artificial Neural Network and Support Vector
Machine. International Journal of Modern Education and Computer Science, pp. 68-
74

Seminar Series, Volume 9(1), 2018 Page 36

You might also like