Random Forests Machine Learning Technique For Email Spam Filtering E. G. Dada and S. B. Joseph
Random Forests Machine Learning Technique For Email Spam Filtering E. G. Dada and S. B. Joseph
1.0 Introduction
Recently, unsolicited commercial bulk emails popularly referred to as spam has constituted a
big problem on the internet. The spammer sending the fraudulent emails harvests email
addresses using various websites, viruses and malwares (Awad and Foqaha, 2016). Spam
hinders internet users from maximizing storage capacity and network bandwidth. The
presence of large volume of spam mails in computer networks is detrimental to the effective
usage of email server’s memory, bandwidth, CPU processing speed and user time (Fonseca et
al., 2016). Reports showed that spam mails are accountable for more than 77% of the email
traffic globally (Kaspersky, 2017). Spam emails are very annoying and inimical to users who
have fallen victim of 419 internet mails and other fraudulent practices of sending emails with
the purpose of luring unsuspecting persons to release confidential information such as user
name and passwords, Bank Verification Number (BVN) and credit card numbers. Several
work have been published in literature that proposed various approaches to email spam
filtering. And have been successfully applied to classify emails into either spam or non-spam.
These techniques include probabilistic, decision tree, artificial immune system (Bahgat et al.,
2016), support vector machine (SVM) (Bouguila and Amayri, 2009), artificial neural networks
(ANN) (Cao et al., 2004), and case-based technique (Fdez-Riverola, 2007). It has been
demonstrated that it is possible to use these machine learning techniques to filter out spam
mails by employing content-based filtering approach that have the ability to identify particular
Fig. 1: Screen shot of Random Forests classification output for Enron spam emails datasets
3.1 Effectiveness
In this section, we evaluate the effectiveness of all machine learning classifiers in terms of time
taken to create the model, correctly classified instances, incorrectly classified instances and
classification accuracy. The results are shown in Table 1.
Seminar Series, Volume 9(1), 2018 Page 33
Dada & Joseph: Random Forests Machine Learning Technique for Email Spam Filtering
Table 1: Performance Evaluation of RFs Algorithm
Evaluation Criteria
Time taken to create model(s) 17.75
Correctly classified instances 5176
Incorrectly classified instances 4
Accuracy (%) 99.92
To do a fair and better performance evaluation of the machine learning algorithms we are
considering, simulation error is also taken into account in this work. The effectiveness of these
algorithms is assessed using the following terms: Kappa statistic (KS), Mean Absolute Error
(MAE), Root Mean Squared Error (RMSE), Relative Absolute Error (RAE), Root Relative
Squared Error (RRSE). The KS, MAE and RMSE are in numeric values. RAE and RRSE are in
percentage. The results are shown in Table 2.
3.2 Efficiency
After creating the predictive model, the efficiency of the RFs algorithm was evaluated as
shown in table 3 below.
Table 3. Performance evaluation of RFs algorithms based on TPR, FTR, Precision, and F-
Score
Technique TPR FPR Precision F-Score Class
RFs 0.999 0.001 1.000 0.998 Ham
1.000 0.000 1.000 1.000 Norm
0.999 0.001 0.998 0.998 Spam
From Table 1, it took RFs about 17.75 sec to create its model. The RFs has a classification
accuracy of 99.92%. It is also clear from the results that RFs has performed excellently in term
of very high correctly classified instances and very low number of incorrectly classified
instances. The training and simulation error depicted in table 2 shows that RFs produced an
excellent classification result (0.9992%) and very low error rate (0.0296). Once the model has
been created, the next step is to analyse the results generated to determine the efficiency of the
algorithms under consideration. Table 3 indicates that RFs have very good result in term of
TPR, FTR, Precision and F-Score for ham, norm and spam classes. Below in table 4 is the
confusion matrices of the RFs algorithm which also provide a practical way for assessing the
performance of the classifiers, each row of the table denotes actual rates of the class whereas
each column indicates the predictions
From the table 4 above, RFs accurately predicts 3669 instances out of 3672 instances (3669 ham
instances that are truly ham and 1 spam instance that is really spam), and 3 instances wrongly
predicted (3 instances of ham class predicted as spam). From our experiments it is clear that
RFs performed excellently in term of effectiveness and efficiency considering its classification
accuracy, TPR, FPR, precision and F-score. It also correctly predicts 1499 instances out of 1500
instances (1499 spam instances that are truly spam and 1 ham instance that is really spam), and
3 instances wrongly predicted (3 instances of spam class predicted as ham).
4.0 Conclusion
Many of the existing email spam filtering techniques cannot effectively handle some of the
spams been sent on daily basis by spammers. This is because spammers kept on inventing
more sophisticated techniques for evading detection by spam filter. With continuous adoption
of new technique by spammers, email spam filtering has become a hot research area for
researchers. In this study, we proposed Random Forests algorithm for effective and efficient
email spam filtering. And evaluated the performance of RFs algorithm on Enron spam datasets
using accuracy, TPR, FPR, precision and F-measure to determine the effectiveness and
efficiency of the algorithm. We conclude by stating that RFs is a promising algorithm that can
be adopted either at mail server or at mail client side to further decrease the volume of spam
messages in email users inbox.
References
Akinyelu A. A., and Adewumi A.O. (2016). Classification of Phishing Email Using
Random Forest Machine Learning Technique. Journal of Applied Mathematics, 2014, 6,
Article ID 425731, Retrieved on July 12, 2017 from
http://dx.doi.org/10.1155/2014/425731
Akshita T. (2016). Content Based Spam Classification- A Deep Learning Approach. A
Thesis Submitted to The Faculty Of Graduate Studies, University Of Calgary, Alberta,
Canada.
Alkaht I.J., Al-Khatib B. (2016). Filtering SPAM Using Several Stages Neural Networks.
International Review on Computers and Software, 11, 2.
Awad M. and Foqaha M. (2016). Email Spam Classification Using Hybrid Approach of RBF
Neural Network and Particle Swarm Optimization. July 2016 International Journal of
Network Security & Its Applications 8(4):17-28. DOI: 10.5121/ijnsa.2016.8402
Awad W.A. and Elseuofi S.M. (2011). Machine Learning Methods for Spam E-mail
Classification. International Journal of Computer Science and Information Technology,
3(1):173–184.
Bahgat E.M., Rady S. and Gad W. (2016). An e-mail filtering approach using classification
techniques. In The 1st International Conference on Advanced Intelligent System and
Informatics (AISI2015), November 28-30, 2015, BeniSuef, Egypt, Springer International
Seminar Series, Volume 9(1), 2018 Page 35
Dada & Joseph: Random Forests Machine Learning Technique for Email Spam Filtering
Publishing, 321-331.
Bouguila N. and Amayri O. (2009) ..A discrete mixture-based kernel for SVMs: application
to spam and image categorization, Information Processing & Management, 45(6): 631-
642.
Breiman L, Cutler A (2007). Random forests-classification description, Department of
Statistics Homepage, 2007, http://www.stat.berkeley.edu/∼breiman/RandomForests
/cchome.htm.
Cao Y, Liao X, Li Y (2004). An e-mail filtering approach using neural network, In
International Symposium on Neural Networks, Springer Berlin Heidelberg, 688-694.
Dhanaraj KR, Palaniswami V (2014). Firefly and Bayes Classifier for Email Spam
Classification in a Distributed Environment. Australian Journal of Basic and Applied
Sciences, 8(17):118-130.
Fdez-Riverola F, Iglesias EL, Diaz F, Méndez JR, Corchado JM (2007). SpamHunting: An
instance-based reasoning system for spam labelling and filtering, Decision Support
Systems, 43(3):722-736.
Fette I, Sadeh N, Tomasic A (2007). Learning to detect phishing emails, in Proceedings of
the 16th International World Wide Web Conference (WWW ’07), 649–656, Alberta,
Canada, May 2007.
Fonseca DM, Fazzion OH, Cunha E, Las-Casas I, Guedes PD, Meira W, Chaves M (2016).
Measuring Characterizing, and Avoiding Spam Traffic Costs. IEEE Internet
Computing, 99.
Karthika R, Visalakshi P (2015). A Hybrid ACO Based Feature Selection Method for Email
Spam Classification. WSEAS Transaction on Computers, 14, pp. 171-177.
Kaspersky lab Spam Report (2017) .Visited on May 15, 2018
https://www.securelist.com/en/ analysis/204792230/Spam_Report_April_2012,
2012.
Koprinska I., Poon J., Clark J., Chan J. (2007). Learning to classify e-mail, Information
Sciences, 177(10): 2167–2187.
Mason S (2003). New Law Designed to Limit Amount of Spam in E-Mail.
http://www.wral.com/technolog
Sharma A. and Suryawansi A. (2016). A Novel Method for Detecting Spam Email using
KNN Classification with Spearman Correlation as Distance Measure. International
Journal of Computer Applications, 136 (6):28-34
Sosa J.N. (2010). Spam Classification using Machine Learning Techniques – Sinespam.
Master of Science Thesis. Master in Artificial Intelligence (UPC-URV-UB).
Wang X. (2005). Learning to classify email: A survey. Proceedings of 2005 International
Conference on Machine Learning and Cybernetics.
Whittaker C., Ryner B., Nazif M. (2010). Large-scale automatic classification of phishing
pages. In: Proceedings of the 17th Annual Network & Distributed System Security
Symposium (NDSS ’10), The Internet Society, San Diego, Calif., USA.
Zavvar M., Rezaei M., Garavand S. (2016) Email Spam Detection using Combination of
Particle Swarm Optimization and Artificial Neural Network and Support Vector
Machine. International Journal of Modern Education and Computer Science, pp. 68-
74