Statistical Performance Assessment of Supervised Machine Learning Algorithms For Intrusion Detection System
Statistical Performance Assessment of Supervised Machine Learning Algorithms For Intrusion Detection System
Corresponding Author:
Abdurazzag A. Aburas
School of Electrical, Electronic and Computer Engineering, University of Kwazulu-Natal
Durban, South Africa
Email: [email protected]
1. INTRODUCTION
Businesses, manufacturing sectors, and financial institutions are all becoming more and more reliant
on technology. The risk of cyberattacks is therefore extremely high. Therefore, protecting these devices is one
of the main issues facing researchers today [1]. Intrusion detection is a topic that is the subject of extensive
research globally [2], [3]. Based on the mechanism used for detection, intrusion detection systems (IDSs) are
divided into three classes: signature, anomaly, and specification-based. This is better explained with Figure 1
which depicts an intrusion detection system classified by detection strategy, deployment, architecture, and
detection behaviour or responses.
Anomaly-based IDS [4] is preferable to signature-based and specification-based IDS due to its ability
to detect novel threats. The problems of misclassification of intrusions in intrusion detection systems can be
solved by a supervised learning algorithm [5]. Many supervised machines learning methods, including
individual classifiers and ensemble classifiers, have been used to develop anomaly detection systems. Even
while combining several classifiers has helped machine learning research develop over the past decade, selecting
the appropriate ones to combine can be challenging, especially when employing an ensemble of the stacking type.
It is necessary to compare the performances of the best machine learning (ML) classifiers in ML
studies. This is a major challenge, especially when compared across various datasets [6]. An algorithm could
perform well on one dataset while failing to get the same results on another. This can be due to the existence
of outliers, feature distribution, or algorithm properties. As a result, comparing several algorithms to one
another becomes rather challenging.
Several studies on the performance evaluation of ML classifiers, particularly in network attack
detection, have been conducted. Subbiah et al. [5] presented a novel framework for intrusion detection that is
enabled by Boruta feature selection with grid search random forest (BFS-GSRF) algorithm. The proposed work
was evaluated on the knowledge discovery and data mining (NSL-KDD) dataset and its performance were
compared to linear discriminant analysis (LDA) and classification and regression tree (CART). According to
the results obtained in their study, the proposed BFS-GSRF outperforms LDA, CART and other existing
algorithms with an accuracy of 99% in detecting attacks. Studies in [7] investigated the comparative study of
several ML methods used in IDS for a variety of applications such as big data, fog computing, internet of things
(IoT), smart cities, and 5G networks. Furthermore, they classify intrusions using classifiers such as CART,
LDA, and RF, and implemented on the knowledge discovery and data mining tools competition (KDD-
CUP’99) dataset, and their efficiency was measured and compared to recent researches. Zaman and Lung [8]
used ensemble methods, fuzzy c-means, nave bayes, radial basis function and support vector machine (SVM)
to build an IDS using the Kyoto 2006+ dataset. They obtained promising results in terms of accuracy with
ensemble methods reaching 96.72%.
A gradient boosted machine (GBM) is a suggested detection engine for an anomaly-based IDS by [9]
using various datasets which are general packet radio service (GPRS), NSL-KDD and University of New South
Wales network-based attacks (UNSW-NB15), the proposed IDS's performance is assessed with hold-out and
cross-fold techniques, and the optimal GBM parameters are obtained using grid search. The proposed method
outperformed fuzzy classifiers and tree-based ensembles in terms of all the metrics considered. Kilincer et al.
[10] conducted a thorough literature review in 2021 to compare the performance of SVM, k-nearest neighbors
and decision tree (DT). The communications security establishment and the canadian institute for cybersecurity
intrusion detection evaluation dataset (CSE-CIC-IDS2018), UNSW-NB15, information security centre of
excellence intrusion detection evaluation dataset (ISCX-2012), NSL-KDD, and the cyber intrusion detection
system dataset 001 (CIDDS-001) datasets were all used for the comparative analysis. Except for the UNSW-
NB15 dataset, the study found that the accuracy of the models varied from 95% to 100%. DT consistently
outperformed all other implemented models, irrespective of dataset. The ability to detect unknown threats is
also a concern when evaluating the performance of the IDS. Hindy et al. [11] studied the effectiveness of ML-
based IDS in detecting unknown attacks in 2020. The study proposed an intrusion detection system (IDS) that
could detect zero-day threats with high recall rates while having a low false positive rate. In addition, to
compare with the proposed model, they implemented a one-class SVM. The canadian institute of cybersecurity
intrusion detection evaluation dataset (CICIDS2017) and the NSL-KDD dataset were used for model training
and evaluation in this study. To achieve the zero-day attack detection setting, only normal traffic was used
when training the model, and all attack samples were used to simulate zero-day attacks. The study found that
both models had a low false positive rate when it came to detecting zero-day attacks. Pranto et al. [12]
demonstrated a method to classify NSL-KDD dataset as normal or malicious using several machine learning
techniques such as k-nearest neighbor, decision tree, Naȉve Bayes, logistic regression, random forest, and their
ensemble approach. They obtained a highest accuracy of 99.5% with a 0.6% false alarm rate. Making a choice
Statistical performance assessment of supervised machine learning algorithms for … (Hassan A. Afolabi)
268 ISSN: 2252-8938
about which algorithm is superior to others is therefore difficult. Statistical validation of the performance
outcomes is required to address this problem.
The performance of ML classifiers for the design of an IDS is evaluated in this work. The accuracy,
precision, recall, and f-score of classifiers such as logistic regression (LR), stochastic gradient descent classifier
(SGDC), deep neural network (DNN), random forest (RF), adaptive boosting (AB), extreme gradient boosting
(XGB), gradient boosting machine (GBM), and extra tree classifier (ETC), are all measured. All classifiers,
excluding DNN, have their hyper-parameters tuned via random search [13]. Using well-known statistical tests,
the major differences between classifiers are statistically examined. Our major contributions are summarized.
i) Using well-known validation methods, the performance of various ML classifiers on the network-based
anomaly internet of things (N-BaIoT) and internet of things Intrusion detection dataset (IoTID20) datasets is
evaluated; ii) Statistical evaluation of performance results are carried out with the widely used Friedman and
Dunn's post-hoc tests. The significance of classifier was conducted with Friedman’s test and the Dunn's test
was used for pairwise comparison between the ML classifiers; iii) Classifiers that exhibit better trade-off
between the metrics considered and evaluation time are recommended for the design of IoT-specific anomaly-
based IDS. The remainder of the paper is structured by briefly discussing the materials and methods adopted
in our work in section 2, section 3 discusses the experimental setup and classifier performance. Additionally,
it discussed the statistical evaluations that were carried out. The paper is summarized and concluded in section 4.
𝑝
ln( 1−𝑝) = 𝛽0 + 𝛽1 𝑥1 + ⋯ 𝛽𝑛 𝑥𝑛 (2)
𝑦 = 𝜎(𝑥1 𝑤1 + 𝑥2 𝑤2 + 𝑏) (3)
where y denotes the output, 𝜎 = activation function, 𝑤1 and 𝑤2 denote connection. Weights, and 𝑏 denotes
bias. An activation function's role is to introduce nonlinearity and enable the model to learn complex nonlinear
functions.
2.2. Datasets
Dataset availability is important in the domain of intrusion detection systems because ML problems
are heavily reliant on data. Additionally, the quality of the datasets available is also important because the
higher the data quality, the better the algorithms perform. In this study, we’ve chosen two groups of some of
the most extensively applied and popular supervised machine learning algorithms and performed a performance
evaluation in terms of well-known metrics and validation methods using two internet of things (IoT) intrusion
detection datasets, namely N-BaIoT and IoTID20.
Statistical performance assessment of supervised machine learning algorithms for … (Hassan A. Afolabi)
270 ISSN: 2252-8938
N-BaIoT dataset includes files that each have 115 independent features designed statistically that is derived
from the raw network packet (pcap) files as well as a class label. All 115 features are self-explanatory. The
class instance occurrence in the N-BaIoT dataset is shown in Table 1.
𝑇𝑃
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃+𝐹𝑃 (6)
𝑇𝑃
𝑟𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃𝑅 = 𝐷𝑅 = (7)
𝑇𝑃+𝐹𝑁
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛×𝑟𝑒𝑐𝑎𝑙𝑙
𝐹𝑚𝑒𝑎𝑠𝑢𝑟𝑒 = 2 × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙 (8)
considered to be 10. All performance results given in this section are the weighted average of outputs from 10
iterations of every repeated validation approach to avoid bias. The process flow of the methodology is
illustrated in Figure 2. The classifier evaluation time is measured during the testing phase (It is the time
measured from when the classification process starts until the classification process stops). Using the following
command,
start_time = time.time()
y_pred = clf.predict(X_test)
Statistical performance assessment of supervised machine learning algorithms for … (Hassan A. Afolabi)
272 ISSN: 2252-8938
Secondly, the study examined the hold-out validation performance results. Figures 3 and 4 illustrates
the results for the two experiments conducted for hold-out validation on N-BaIoT and IoTID20 datasets
respectively. The mean value of the performance achieved with hold-out validation across both datasets is
shown in Figure 5 which illustrates that classifier outperform each other in terms of various metrics.
For the ensemble algorithms, LGBM performs better than other ensembles in terms of accuracy
(97.31%), while AB performs better in terms of precision (97.18%), GBM outperforms the others in terms of
recall (97.82%), and F-measure outperforms the others (97.49%). Furthermore, when only single classifiers
are considered, DNN outperforms LR and SGDC in terms of all metrics considered (96.42%, 96.23%, 96.25%,
and 96.24%) for accuracy, precision, recall, and F-measure, respectively. SGDC, on the other hand, achieves
the lowest values in terms of accuracy, precision, recall, and Fmeasure, with 87.84%, 88.21%, 89.99%, and
89.09%, respectively, among all classifiers considered.
Figures 6 and 7 illustrates the results for the experiments conducted for 10f validation on N-BaIoT
and IoTID20 datasets respectively. The average values of all notable metrics obtained with 10f validation
across both datasets are shown in Figure 8. In comparison to classifier performances with hold-out validation,
all utilized classifiers perform better with 10f validation. This is a result of sampling's effect, which selects
random occurrences and results in unsatisfactory classification.
The adoption of 10f validation rather than hold-out validation is encouraged by this phenomenon. The
10f validation results indicate that each classifier has a promising level of performance. DNN, on the other
hand, outperforms all other methods in terms of accuracy (98.76%). AB has the highest average precision
measure (98.30%). With recall and F-measure scores of 98.83% and 98.53%, respectively, GBM performs
best. Like the hold-out validation, SGDC still yields the lowest results when all metrics are taken into
consideration.
Figure 6. 10f Results on N-BaIoT dataset Figure 7. 10f Results on IoTID20 dataset
Classifier test times are listed in Table 4. Given that resource utilization is a key criterion for devices
with limited resources, it is crucial to take a classifier's evaluation time into consideration because it assists in
creating a suitable trade-off between a classifier's classification performance and resource utilization. On the
N-BaIoT and IoTID20 datasets respectively, LGBM training takes about 11 and 3 seconds. For both datasets,
DNN requires the most time for model evaluation. The test time of each classifier is generated solely for 10-
fold validation.
Statistical performance assessment of supervised machine learning algorithms for … (Hassan A. Afolabi)
274 ISSN: 2252-8938
According to the results presented in Tables 5 and 6, the classifiers' performance is significantly
different across all of the evaluated assessment metrics. As a result, it may be said that at least one classifier
performs much better than the others. Therefore, the alternative hypothesis 𝐻𝐴 is accepted, whereas the null
hypothesis 𝐻0 is rejected. The mean ranks of each classifier used for hold-out validation are shown in the
Table 7.
Dunn's post-hoc test is used to determine which classifier pairs perform significantly differently. All
pairwise comparisons' p values are checked against the considered significance level of 0.05 for this purpose.
Assuming that C1 and C2 are the test classifiers, the results of Dunn's test (Pairwise comparison) for hold-out
validation are shown in Table 8.
According to Table 8, the classifiers for SGDC and LR vs (ETC, XGB, AB, LGBM, and GBM) are
statistically and significantly different (p < 0.05), whereas the classifiers for RF-LGBM, RF-GBM, DNN-
LGBM, and DNN-GBM pairs are less significant. Please take note that the pairs of classifiers that are red-
shaded are not significantly different. Tables 9 and 10 shows the friedman test statistics and the mean ranks of
each classifier respectively for 10-fold validation results.
To determine which classifier performs statistically differently for 10-fold validation, Dunn's post-
hoc test is used. The outcomes are displayed in Table 11. The classifiers are statistically and significantly
different (p < 0.05) when comparing the classifiers SGDC to DNN, ETC, XGB, AB, LGBM and GBM, LR to
XGB, AB, LGBM and GBM, and RF-LGBM, RF-GBM, and RF-XGB pairs. Please note the red-shaded to be
the non-significantly different pair of classifiers for 10-fold validation.
Statistical performance assessment of supervised machine learning algorithms for … (Hassan A. Afolabi)
276 ISSN: 2252-8938
ACKNOWLEDGEMENTS
We would like to thank the members of the University of Kwazulu-Natal's Research office and the
open-source software community for the support provided during this research studies.
REFERENCES
[1] A. Aris, S. F. Oktug, and S. B. O. Yalcin, “Internet-of-things security: Denial of service attacks,” in 2015 23rd Signal Processing
and Communications Applications Conference, SIU 2015 - Proceedings, May 2015, pp. 903–906, doi: 10.1109/SIU.2015.7129976.
[2] M. Baykara and R. Das, “A novel hybrid approach for detection of web-based attacks in intrusion detection systems,” International
Journal of Computer Networks And Applications, vol. 4, no. 2, p. 62, Apr. 2017, doi: 10.22247/ijcna/2017/48968.
[3] B. B. Zarpelão, R. S. Miani, C. T. Kawakani, and S. C. de Alvarenga, “A survey of intrusion detection in Internet of Things,”
Journal of Network and Computer Applications, vol. 84, pp. 25–37, Apr. 2017, doi: 10.1016/j.jnca.2017.02.009.
[4] P. García-Teodoro, J. Díaz-Verdejo, G. Maciá-Fernández, and E. Vázquez, “Anomaly-based network intrusion detection:
Techniques, systems and challenges,” Computers and Security, vol. 28, no. 1–2, pp. 18–28, Feb. 2009,
doi: 10.1016/j.cose.2008.08.003.
[5] S. Subbiah, K. S. M. Anbananthen, S. Thangaraj, S. Kannan, and D. Chelliah, “Intrusion detection technique in wireless sensor
network using grid search random forest with Boruta feature selection algorithm,” Journal of Communications and Networks,
vol. 24, no. 2, pp. 264–273, Apr. 2022, doi: 10.23919/jcn.2022.000002.
[6] J. Demšar, “Statistical comparisons of classifiers over multiple data sets,” J. Mach. Learn. Res., vol. 7, pp. 1–30, Dec. 2006.
[7] T. Saranya, S. Sridevi, C. Deisy, T. D. Chung, and M. K. A. A. Khan, “Performance analysis of machine learning algorithms in
intrusion detection system: A review,” Procedia Computer Science, vol. 171, pp. 1251–1260, 2020,
doi: 10.1016/j.procs.2020.04.133.
[8] M. Zaman and C. H. Lung, “Evaluation of machine learning techniques for network intrusion detection,” in IEEE/IFIP Network
Operations and Management Symposium: Cognitive Management in a Cyber World, NOMS 2018, Apr. 2018, pp. 1–5,
doi: 10.1109/NOMS.2018.8406212.
[9] B. A. Tama and K. H. Rhee, “An in-depth experimental study of anomaly detection using gradient boosted machine,” Neural
Computing and Applications, vol. 31, no. 4, pp. 955–965, Jul. 2019, doi: 10.1007/s00521-017-3128-z.
[10] I. F. Kilincer, F. Ertam, and A. Sengur, “Machine learning methods for cyber security intrusion detection: Datasets and comparative
study,” Computer Networks, vol. 188, p. 107840, Apr. 2021, doi: 10.1016/j.comnet.2021.107840.
[11] H. Hindy, R. Atkinson, C. Tachtatzis, J. N. Colin, E. Bayne, and X. Bellekens, “Utilising deep learning techniques for effective
zero-day attack detection,” Electronics (Switzerland), vol. 9, no. 10, pp. 1–16, Oct. 2020, doi: 10.3390/electronics9101684.
[12] M. B. Pranto, M. H. A. Ratul, M. M. Rahman, I. J. Diya, and Z. Bin Zahir, “Performance of machine learning techniques in anomaly
detection with basic feature selection strategy-a network intrusion detection system,” Journal of Advances in Information
Technology, vol. 13, no. 1, pp. 36–44, 2022, doi: 10.12720/jait.13.1.36-44.
[13] J. Bergstra and Y. Bengio, “Random search for hyper-parameter optimization,” Journal of Machine Learning Research, vol. 13,
pp. 281–305, 2012.
[14] I. Goodfellow, Y. Bengio, and A. Courville, “Deep learning,” Nature, vol. 29, no. 7553, pp. 1–73, 2016, [Online]. Available:
http://deeplearning.net/.
[15] T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” in Proceedings of the ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, Aug. 2016, vol. 13-17-Augu, pp. 785–794, doi: 10.1145/2939672.2939785.
[16] J. H. Friedman, “Greedy function approximation: A gradient boosting machine,” Annals of Statistics, vol. 29, no. 5, pp. 1189–1232,
Oct. 2001, doi: 10.1214/aos/1013203451.
[17] P. Geurts, D. Ernst, and L. Wehenkel, “Extremely randomized trees,” Machine Learning, vol. 63, no. 1, pp. 3–42, Mar. 2006,
doi: 10.1007/s10994-006-6226-1.
[18] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001, doi: 10.1023/A:1010933404324.
[19] Y. Freund and R. E. Schapire, “A decision-theoretic generalization of on-line learning and an application to boosting,” Journal of
Computer and System Sciences, vol. 55, no. 1, pp. 119–139, Aug. 1997, doi: 10.1006/jcss.1997.1504.
[20] Y. Meidan et al., “N-BaIoT-network-based detection of IoT botnet attacks using deep autoencoders,” IEEE Pervasive Computing,
vol. 17, no. 3, pp. 12–22, Jul. 2018, doi: 10.1109/MPRV.2018.03367731.
[21] Y. Mirsky, T. Doitshman, Y. Elovici, and A. Shabtai, “Kitsune: An Ensemble of Autoencoders for Online Network Intrusion
BIOGRAPHIES OF AUTHORS
Statistical performance assessment of supervised machine learning algorithms for … (Hassan A. Afolabi)