Cse Cic Ids Dataset
Cse Cic Ids Dataset
https://doi.org/10.1186/s40537-020-00382-x
*Correspondence:
[email protected] Abstract
Florida Atlantic University, The exponential growth in computer networks and network applications worldwide
777 Glades Road, Boca Raton,
FL 33431, USA has been matched by a surge in cyberattacks. For this reason, datasets such as CSE-CIC-
IDS2018 were created to train predictive models on network-based intrusion detection.
These datasets are not meant to serve as repositories for signature-based detection
systems, but rather to promote research on anomaly-based detection through various
machine learning approaches. CSE-CIC-IDS2018 contains about 16,000,000 instances
collected over the course of ten days. It is the most recent intrusion detection data-
set that is big data, publicly available, and covers a wide range of attack types. This
multi-class dataset has a class imbalance, with roughly 17% of the instances compris-
ing attack (anomalous) traffic. Our survey work contributes several key findings. We
determined that the best performance scores for each study, where available, were
unexpectedly high overall, which may be due to overfitting. We also found that most
of the works did not address class imbalance, the effects of which can bias results in a
big data study. Lastly, we discovered that information on the data cleaning of CSE-
CIC-IDS2018 was inadequate across the board, a finding that may indicate problems
with reproducibility of experiments. In our survey, major research gaps have also been
identified.
Keywords: CSE-CIC-IDS2018, Big data, Intrusion detection, Machine learning, Class
imbalance
Introduction
Intrusion detection is the accurate identification of various attacks capable of damaging
or compromising an information system. An (IDS) can be host-based, network-based,
or a combination of both. A host-based IDS is primarily concerned with the internal
monitoring of a computer. Windows registry monitoring, log analysis, and file integrity
checking are some of the tasks performed by a host-based IDS [1]. A network-based IDS
monitors and analyzes network traffic to detect threats that include Denial-of-Service
(DoS) attacks, SQL injection attacks, and password attacks [2]. The rapid growth of
computer networks and network applications worldwide has encouraged an increase in
cyberattacks [3]. In 2019, business news channel CNBC reported that the average cost of
a cyberattack was $200,000 [4].
© The Author(s) 2020. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and
the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material
in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material
is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativeco
mmons.org/licenses/by/4.0/.
Leevy and Khoshgoftaar J Big Data (2020) 7:104 Page 2 of 19
Benign 83.070
DDoS 7.786
DoS 4.031
Brute force 2.347
Botnet 1.763
Infiltration 0.997
Web attack 0.006
CSV files that are downloadable from the cloud.1 Nine files consist of 79 independent
features, and the remaining file consists of 83 independent features.
Our exhaustive search for relevant, peer-reviewed papers ended on September 22,
2020. To the best of our knowledge, this is the first survey to exclusively present and
analyze intrusion detection research on CICIDS2018 in such detail. CICIDS2018 is
the most recent intrusion detection dataset that is big data, publicly available, and cov-
ers a wide range of attack types. The contribution of our survey centers around three
important findings. In general, we observed that the best performance scores for each
study, where provided, were unusually high. This may be a consequence of overfitting.
The second finding deals with the apparent lack of concern in most studies for the class
imbalance of CICIDS2018. Finally, we note that for all works, the data cleaning of CIC-
IDS2018 has been given little attention, a shortcoming that could hinder reproducibility
of experiments. Data cleaning involves the modification, formatting, and removal of data
to enhance dataset usability [12].
The remainder of this paper is organized as follows: "Research papers using CIC-
IDS2018" section describes and analyzes the compiled works; "Discussion of surveyed
works" section discusses survey findings, identifies gaps in the current research, and
explains the performance metrics used in the curated works; and "Conclusion" section
concludes with the main points of the paper and offers suggestions for future work.
1
https://www.unb.ca/cic/datasets/ids-2018.html.
Leevy and Khoshgoftaar J Big Data (2020) 7:104 Page 4 of 19
these scores may be valuable for future comparative research. Table 3 provides an alpha-
betical listing by author of the papers discussed in this section, along with the proposed
respective model(s) for CICIDS2018, and Table 4 shows the same ordered listing by
author coupled with the associated computing environment(s) for CICIDS2018.
Atefinia and Ahmadi [14] (Network intrusion detection using multi‑architectural modular
deep neural network)
Using an aggregator module [33] to integrate four network architectures, the authors
aimed to obtain a higher precision rate than any of those produced in the related works
described in their paper. The four network components included a restricted Boltz-
mann machine [34], a deep feed-forward neural network [35], and two Recurrent Neu-
ral Networks (RNNs) [36], specifically a Long Short-term Memory (LSTM) [37] and a
Gated Recurrent Unit (GRU) [38]. The models were implemented with Python and the
Scikit-learn2 library. Data preprocessing involved the removal of source and destina-
tion IP addresses and also source port numbers. Labels with string values were one-hot
encoded, and feature scaling was used to normalize the feature space of all the attrib-
utes between a range of 0 and 1. Rows with missing values and columns with too many
missing values were dropped from CICIDS2018. However, no information is provided
on how many rows and columns were removed. Stratified sampling with a train to test
2
https://scikit-learn.org.
Leevy and Khoshgoftaar J Big Data (2020) 7:104 Page 6 of 19
ratio of 80-20 was performed on each of the four modules. Information was then fed to
the aggregator module, which used a weighted averaging technique to produce the out-
put for the modular network. The highest accuracies (100%) were obtained for the DoS,
DDoS, and brute force attack types. These accuracies were associated with precision and
recall scores of 100%. One drawback is the authors’ comparison of results from their
study with results from the related works. A better approach is to empirically evaluate
at least two models, one of which would be the proposed modular network. Another
shortcoming relates to the non-availability of performance scores that cover the collec-
tive attack types. In other words, the scores of precision, recall, etc. for the combination
of attacks could provide additional insight. This does not detract from the usefulness of
reporting precision, recall, etc. for each attack type.
Basnet et al. [15] (Towards detecting and classifying network intrusion traffic using deep
learning frameworks)
The authors experimented with various deep learning frameworks (fast.ai,3 Keras,4
PyTorch,5 TensorFlow,6 Theano7) to detect network intrusion traffic and classify attack
types. For preprocessing, samples with “Infinity”, “NaN”, or missing values were dropped
and timestamps converted to Unix epoch numeric values (number of seconds since Jan-
uary 1, 1970). About 20,000 samples were dropped after the data cleanup process. The
destination port and protocol features were treated as categorical data, and the remain-
der were treated as numerical data. Ten-fold cross validation with either an 80–20 or
70–30 split was used for training and testing. Both binary class and multi-class classifi-
cation [39] were considered. A Multilayer Perceptron (MLP) [40] served as the only clas-
sifier. With the aid of GPU acceleration, the authors observed that fast.ai outperformed
the other frameworks consistently among all the experiments, yielding an optimum
accuracy of 99% for binary classification. The main limitation of this study is the use of
only one classifier.
Catillo et al. [16] (2L‑ZED‑IDS: a two‑level anomaly detector for multiple attack classes)
Based on an extension of previous research with CICIDS2017, this study trained a deep
autoencoder [41] on CICIDS2017 and CICIDS2018. In the preprocessing stage, the
Flow_ID and Timestamp features of the datasets were not selected because they were
deemed not relevant to the study. The autoencoder was implemented with Python,
Keras, and TensorFlow and trained on normal and DoS attack traffic. The train to test
ratio was 80–20 for both datasets. The highest accuracy for CICIDS2018 (99.2%) was
obtained for the botnet attack type, corresponding to a precision of 95.0% and a recall
of 98.9%. The highest accuracy (99.3%) of the entire study was obtained for CICIDS2017
(botnet attack type), coupled with a precision of 94.8% and a recall of 98.6%. One draw-
back of the study is the non-availability of an accuracy score for the collective attack
types. Another disadvantage is the use of only one classifier.
3
https://docs.fast.ai/.
4
https://github.com/keras-team/keras.
5
https://pytorch.org/.
6
https://www.tensorflow.org/.
7
http://deeplearning.net/software/theano/.
Leevy and Khoshgoftaar J Big Data (2020) 7:104 Page 7 of 19
Chadza et al. [17] (Contemporary sequential network attacks prediction using hidden
Markov Model)
By way of MATLAB software, two conventional Hidden Markov Model (HMM)
training algorithms, namely Baum Welch [42] and Viterbi [43], were applied to CIC-
IDS2018. HMM is a probabilistic machine learning framework that generates states
and observations. For this study, information is clearly lacking on data preprocess-
ing. About 457,550 records (selection criteria set in Snort Intrusion Detection Sys-
tem [44]) were selected from CICIDS2018. From that sample of records, 70% were
allocated to training and the remainder to testing. The authors found that the highest
accuracy of about 97% was achieved by both the Baum Welch and Viterbi algorithms.
This paper is only three pages in length. The main shortcoming of this work is the lack
of detail on the experiments performed.
Chastikova and Sotnikov [18] (Method of analyzing computer traffic based on recurrent
neural networks)
This highly theoretical study, which was submitted to the Journal of Physics Con-
ference series, does not give any empirical results and is extremely short on details.
It merely proposes a LSTM model to analyze computer network traffic using CIC-
IDS2018. The authors note that their use of the Focal Loss function [45] (initially
developed by Facebook AI research) addresses class imbalance. The fact that no met-
rics have been used and no computing environment was provided is a major draw-
back of this six-page paper.
is not expected to generalize well to a related dataset with feature ’X’ of {car, boat, train,
plane}.
Ferrag et al. [20] (Deep learning for cyber security intrusion detection: approaches,
datasets, and comparative study)
Seven deep learning models were evaluated on CICIDS2018 and the Bot-IoT data-
set [58]. The models included RNNs, deep neural networks [59], restricted Boltzmann
machines, deep belief networks [60], Convolutional Neural Networks (CNNs) [61],
deep Boltzmann machines [62], and deep autoencoders. The Bot-IoT dataset is a 2018
creation from the University of New South Wales (UNSW) that incorporates about
72,000,000 normal and botnetInternet of Things (IoT) instances with 46 features. The
experiment was performed on Google Colaboratory8 using Python and TensorFlow
with GPU acceleration. Only 5% of the instances were used in this study, as determined
by [58]. The highest accuracy for the Bot-IoT dataset (98.39%) was obtained with a deep
autoencoder, while the highest accuracy for CICIDS2018 (97.38%) was obtained with an
RNN. The highest recall for the Bot-IoT dataset (97.01%) came from a CNN, whereas the
highest recall for CICIDS2018 (98.18%) came from a deep autoencoder. The bulk of this
paper deals with classifying 35 cyber datasets and describing the seven deep learning
models. Only the last three pages discuss the actual experimentation, which is lacking in
detail. This is a major shortcoming of the study.
Filho et al. [21] (Smart detection: an online approach for DoS/DDoS attack detection using
machine learning)
The authors used a customized dataset and four well-known ones (CIC-DoS [63], ISCX2012,
CICIDS2017, and CICIDS2018) to obtain online random samples of network traffic and
classify them as DoS attacks or normal. There were 33 features obtained for each dataset.
These features were derived from source and destination ports, transport layer protocol, IP
packet size, and TCP flags. The individual datasets were combined into one unit contain-
ing normal traffic (23,088 instances), TCP flood attacks (14,988 instances), UDP flood (6,894
instances), HTTP flood (347 instances), and HTTP slow (183 instances). For the combined
dataset, the authors noted that ISCX2012 was only used to provide data traffic with normal
activity behavior. Recursive Feature Elimination with Cross Validation [64] was performed
on six learners (RF, DT, Logistic Regression, SGDClassifier [65], Adaboost, MLP). The learn-
ers were built with Scikit-learn. With regard to the combined dataset, Random Forest (20
features selected) had the highest accuracy among the learners. For CICIDS2018, the preci-
sion and recall for RF were both 100%. One shortcoming of this study is the use of ISCX2012
to provide normal traffic for the combined dataset. ISCX2012 is outdated, and as we previ-
ously pointed out, it is limited to only six traffic protocols.
Fitni and Ramli [22] (Implementation of ensemble learning and feature selection
for performance improvements in anomaly‑based intrusion detection systems)
Adopting an ensemble model approach, this study compared seven single learners to
evaluate the top performers for integration into a classifier unit. The seven learners are
as follows: RF, Gaussian Naive Bayes [66], DT, Quadratic Discriminant Analysis [67],
8
https://colab.research.google.com/.
Leevy and Khoshgoftaar J Big Data (2020) 7:104 Page 9 of 19
Gradient Boosting, and Logistic Regression. The models were built with Python and
Scikit-learn. During preprocessing, samples with missing values and infinity were
removed. Records that were actually a repetition of the header rows were also removed.
The dataset was then divided into training and testing validation sets in an 80-20 ratio.
Feature selection [68], a technique for selecting the most important features of a predic-
tive model, was performed using the Spearman’s rank correlation coefficient [69] and
Chi-squared test [70], resulting in the selection of 23 features. After the evaluation of
the seven learners with these features, Gradient Boosting, Logistic Regression, and DT
emerged as the top performers for use in the ensemble model. Accuracy, precision, and
recall scores for this model were 98.80%, 98.80%,and 97.10%, respectively, along with
an Area Under the Receiver Operating Characteristic Curve (AUC) of 0.94. We believe
that the performance of the ensemble model could be improved by substituting a Cat-
boost [71], LightGBM [72], or XGBoost classifier for the Gradient Boosting classifier.
The three possible substitutions are enhanced gradient boosting variants [73].
Gamage and Samarabandu [23] (Deep learning methods in network intrusion detection:
a survey and an objective comparison)
This work introduces a taxonomy of deep learning models for intrusion detection and
summarizes relevant research papers. Four deep learning models (feed-forward neural
network, autoencoder, deep belief network, LSTM) were then evaluated on the KDD
Cup 1999 [74], NSL-KDD [75], CICIDS2017, and CICIDS2018 datasets. The KDD Cup
1999 dataset, which was developed by Defense Advanced Research Project Agency
(DARPA), contains four categories of attacks, including the DoS category. Preprocessing
of the datasets consisted of removing invalid flow records (missing values, strings, etc.)
and feature scaling. One-hot encoding was done for the protocol type, service, and flag
features, three attributes that are categorical and non-numerical. The full datasets were
split into train and test sets, with hyperparameter tuning applied. Results show that the
feed-forward neural networks performed well on all four datasets. For this classifier, the
highest accuracy (99.58%) was obtained on the CICIDS2017 dataset. This score is asso-
ciated with a precision of 99.43% and a recall of 99.45%. With respect to CICIDS2018,
the highest accuracy for the feed-forward network was 98.4%, corresponding to a preci-
sion of 97.79% and a recall of 98.27%. GPU acceleration was used on the some of the PCs
involved in the experiments. One shortcoming of this study stems from the use of KDD
Cup 1999 and NSL-KDD, both of which are outdated and have known issues. The main
problem with KDD Cup 1999 is its significant number of duplicate records [75]. NSL-
KDD is an improved version that does not have the issue of redundant instances, but it
is far from ideal. For example, there are some attack classes without records in the test
dataset of NSL-KDD [12].
Hua [24] (An efficient traffic classification scheme using embedded feature selection
and LightGBM)
To tackle the class imbalance of CICIDS2018, the author incorporates an undersam-
pling and embedded feature selection approach with a LightGBM classifier. LightGBM
contains algorithms that address high numbers of instances and features in datasets.
Leevy and Khoshgoftaar J Big Data (2020) 7:104 Page 10 of 19
Huancayo Ramos et al. [25] (Benchmark‑based reference model for evaluating Botnet
detection tools driven by traffic‑flow analytics)
Five learners were evaluated on two datasets (CICIDS2018 and ISOT HTTP Botnet [78])
to determine the best botnet classifier. The ISOT HTTP Botnet dataset contains mali-
cious and benign instances of Domain Name System (DNS) traffic. The learners in the
study include RF, DT, k-NN, Naive Bayes, and SVM. Feature selection was performed
using various techniques, including the feature importance method [79] of RF. After fea-
ture selection, CICIDS2018 had 19 independent attributes while ISOT HTTP had 20,
with destination port number, source port number, and transport protocol among the
selected features. The models were implemented with Python and Scikit-learn. Five-fold
cross-validation was applied to a training set comprising 80% of the botnet instances.
The remainder of the botnet instances served as the testing set. For optimization, the
Grid Search algorithm [80] was used. With regard to CICIDS2018, the RF and DT learn-
ers scored an accuracy of 99.99%. Tied to this accuracy, the precision was 100% and the
recall was 99.99% for both learners. The RF and DT learners also had the highest accu-
racy for ISOT HTTP (99.94% for RF and 99.90% for DT). One limitation of this paper is
the inadequate information provided on data preprocessing.
Kanimozhi and Jacob [26] (Artificial Iitelligence based network intrusion detection
with hyper‑parameter optimization tuning on the realistic cyber dataset CSE‑CIC‑IDS2018
using cloud computing)
The authors trained a two-layer MLP, implemented with Python and Scikit-learn, to
detect botnet attacks. GridSearchCV [81] performed hyper-parameter optimization,
and L2 regularization [82] was used to prevent overfitting. Overfitting refers to a model
that has memorized training data instead of learning to generalize it [83]. The MLP clas-
sifier was trained only on the botnet instances of CICIDS2018, with ten-fold cross vali-
dation [84] implemented. For this study the AUC was 1, which is a perfect score. Related
accuracy, precision, and recall scores were all 100%. The paper is four pages long (with
two references), and one major shortcoming is an obvious lack of detail. Another draw-
back is the use of only one classifier to evaluate performance.
Leevy and Khoshgoftaar J Big Data (2020) 7:104 Page 11 of 19
Kanimozhi and Jacob [27] (Calibration of various optimized machine learning classifiers
in network intrusion detection system on the realistic cyber dataset CSE‑CIC‑IDS2018 using
cloud computing)
The purpose of this study was to determine the best classifier out of six candidates
(MLP, RF, k-NN, SVM, Adaboost, Naive Bayes). The models were developed with
Python and Scikit-learn. A calibration curve was used, which is a graph showing
the deviation of classifiers from a perfectly calibrated plot. Botnet instances of CIC-
IDS2018 were split into train and test instances, with no information provided on the
ratio of train to test instances. The MLP model emerged as the top choice with an
AUC of 1. Accuracy, precision, and recall scores associated with this perfect AUC
score were 99.97%, 99.96%, and 100%, respectively. No information was provided on
the MLP classifier, but it is most likely the same two-layer network as in [26]. The
main shortcoming of this paper is the lack of detail.
Kim et al. [29] (CNN‑based network intrusion detection against denial‑of‑service attacks)
In this study, the authors trained a CNN on DoS datasets from KDD Cup 1999 and
CICIDS2018. The model was implemented with Python and TensorFlow. For both
datasets, the train to test ratio was 70–30. In the case of KDD, the authors used about
283,000 samples, and for CICIDS2018, about 11,000,000. Image datasets were sub-
sequently generated, and binary and multi-class classification was performed. The
authors established that for the two datasets, the accuracy was about 99% for binary
classification, which corresponded to precision and recall scores of 81.75% and
82.25%, respectively. An RNN model was subsequently introduced into the study for
comparative purposes. The main drawback of this work arises from the use of the
Leevy and Khoshgoftaar J Big Data (2020) 7:104 Page 12 of 19
KDD Cup 1999 dataset, which, as previously discussed, is an outdated dataset with a
high number of redundant instances.
Li et al. [30] (Building auto‑encoder intrusion detection system based on random forest
feature selection)
In this online real-time detection study, unsupervised clustering and feature selection
play a major role. For preprocessing, “Infinity” and “NaN” values were replaced by 0,
and the data was then divided into sparse and dense matrices, normalized by L2 regu-
larization. A sparse matrix has a majority of elements with value 0, while a dense matrix
has a majority of elements with non-zero values. The model was built within a Python
environment. The best features were selected by RF, and the train to test ratio was set as
85–15. The Affinity Propagation (AP) clustering [87] algorithm was subsequently used
on 25% of the training dataset to group features into subsets, which were relayed to the
autoencoder. Recall rates for all attack types for the proposed model were compared
with those of another autoencoder model called Kitnet [88]. Several attack types for both
models had a recall of 100%. Only the proposed model was evaluated with the AUC met-
ric, with several attack types yielding a score of 1. Based on detection time results, the
authors showed that their model has a faster detection time than KitNet. The authors
provided performance scores for AUC and recall for each attack type of CICIDS2018.
This is a deficiency of the study as scores covering the collective attack types could pro-
vide additional insight. The absence of AUC values for Kitnet is another shortcoming.
Lin et al. [31] (Dynamic network anomaly detection system by using deep learning
techniques)
The authors investigated the use of Attention Mechanism (AM) [89] with LSTM to
improve performance. Attention mechanism imitates the focus mechanism of the
human brain, extracting and representing information most relevant to the target
through an automatic weighing scheme. The model was built with TensorFlow and fur-
ther optimized with Adam Gradient Descent [90], a replacement algorithm for Stochas-
tic Gradient Descent [91]. Seven other learners (DT, Gaussian Naive Bayes, RF, k-NN,
SVM, MLP, LSTM without AM) were also evaluated. Preprocessing of a CICIDS2018
subset (about 50% of the original size) involved removing the timestamp feature and
IP address feature. The dataset was then divided into training, test, and validation sets
in the ratios of 90%, 9%, and 1%. Normal dataset traffic was randomly undersampled
to obtain 2,000,000 records, while Web and infiltration attacks were oversampled with
SMOTE to address class imbalance. The LSTM model with AM outperformed the other
learners with an accuracy of 96.2% and a precision and recall of 96%. The contribution
of this useful study is limited by the inadequate information provided on data cleaning.
Another shortcoming is the omission of the oversampling rate for SMOTE.
Performance metrics
In order to explain the metrics provided in this survey, it is necessary to start with the
fundamental metrics and then build on the basics. Our list of applicable performance
metrics is explained as follows:
Leevy and Khoshgoftaar J Big Data (2020) 7:104 Page 15 of 19
• True Positive (TP) is the number of positive instances correctly identified as positive.
• True Negative (TN) is the number of negative instances correctly identified as nega-
tive.
• False Positive (FP), also known as Type I error, is the number of negative instances
incorrectly identified as positive.
• False Negative (FN), also known as Type II error, is the number of positive instances
incorrectly identified as negative.
Based on these fundamental metrics, the other performance metrics are derived as
follows:
Conclusion
A marked increase in cyberattacks has shadowed the rapid growth of computer net-
works and network applications. In light of this, several intrusion detection datasets,
including CICIDS2018, have been created to train predictive models. CICIDS2018 is
multi-class, contains about 16,000,000 instances, and is class-imbalanced. As late as Sep-
tember 22, 2020, we aggressively searched for relevant studies based on this dataset.
For the most part, we observed that the best performance scores for each study, where
provided, were unusually high. This may be attributable to overfitting. Furthermore, we
note that only a few of the surveyed works explored treatment for the class imbalance
of CICIDS2018. Class imbalance, particularly for big data, can skew the results of an
experiment. As a final point, we emphasize that the level of detail paid to the data clean-
ing of CICIDS2018 failed to meet our expectations. This concern has a bearing on the
reproducibility of experiments.
Several gaps have been identified in the current research. Topics such as big data pro-
cessing frameworks, concept drift, and transfer learning are missing from the literature.
Future work should address these gaps.
Abbreviations
AM: Attention Mechanism; ANOVA: ANalysis Of VAriance; AP: Affinity Propagation; AUC: Area Under the Receiver Operat-
ing Characteristic Curve; CIC: Canadian Institute of Cybersecurity; CNN: Convolutional Neural Network; CSE: Communica-
tions Security Establishment; DARPA: Defense Advanced Research Project Agency; DNS: Domain Name System; DDoS:
Distributed Denial-of-Service; DoS: Denial-of-Service; DT: Decision Tree; FN: False Negative; FNR: False Negative Rate; FP:
False Positive; FPR: False Positive Rate; GRU: Gated Recurrent Unit; HMM: Hidden Markov Model; HSD: Honestly Significant
Leevy and Khoshgoftaar J Big Data (2020) 7:104 Page 16 of 19
Difference; IDS: Intrusion Detection System; IoT: Internet of Things; ISCX: Information Security Centre of Excellence; k-NN:
k-Nearest Neighbor; LSTM: Long Short-term Memory; MLP: Multilayer Perceptron; NSF: National Science Foundation;
RF: Random Forest; RNN: Recurrent Neural Network; ROC: Receiver Operating Characteristic; SMOTE: Synthetic Minority
Oversampling Technique; SRBMM: Synthetic Minority Oversampling Technique; SVM: Support Vector Machine; TN: True
Negative; TNR: True Negative Rate; TP: True Positive; TPR: True Positive Rate; UNB: University of New Brunswick; UNSW:
University of New South Wales.
Acknowledgements
We would like to thank the reviewers in the Data Mining and Machine Learning Laboratory at Florida Atlantic University.
Additionally, we acknowledge partial support by the National Science Foundation (NSF) (CNS-1427536). Opinions, find-
ings, conclusions, or recommendations in this paper are the authors’ and do not reflect the views of the NSF.
Authors’ contributions
JLL performed the literature review and drafted the manuscript. TMK worked with JLL to develop the article’s framework
and focus. TMK introduced this topic to JLL. All authors read and approved the final manuscript.
Funding
Not applicable.
Competing interests
The authors declare that they have no competing interests.
References
1. Singh AP, Singh MD. Analysis of host-based and network-based intrusion detection system. IJ Comput Netw Inf
Secur. 2014;8:41–7.
2. Patil A, Laturkar A, Athawale S, Takale R, Tathawade P. A multilevel system to mitigate ddos, brute force and sql
injection attack for cloud security. In: International Conference on Information, Communication, Instrumentation
and Control (ICICIC), 2017. p. 1–7. IEEE.
3. Saxena AK, Sinha S, Shukla P. General study of intrusion detection system and survey of agent based intrusion
detection system. In: 2017 International Conference on Computing, Communication and Automation (ICCCA),
2017. p. 471–421. IEEE.
4. CNBC: Cyberattacks now cost companies $200,000 on average, putting many out of business. https://www.cnbc.
com/2019/10/13/cyberattacks-cost-small-companies-200k-putting-many-out-of-business.html.
5. Sharafaldin I, Lashkari AH, Ghorbani AA. Toward generating a new intrusion detection dataset and intrusion traffic
characterization. In: ICISSP, 2018. p. 108–116.
6. D’hooge L, Wauters T, Volckaert B, De Turck F. In-depth comparative evaluation of supervised machine learning
approaches for detection of cybersecurity threats. In: Proceedings of the 4th International Conference on Internet
of Things, Big Data and Security; 2019.
7. Shiravi A, Shiravi H, Tavallaee M, Ghorbani AA. Toward developing a systematic approach to generate benchmark
datasets for intrusion detection. Computers Secur. 2012;31(3):357–74.
8. Bouteraa I, Derdour M, Ahmim A. Intrusion detection using data mining: A contemporary comparative study. In:
2018 3rd International Conference on Pattern Analysis and Intelligent Systems (PAIS), 2018. p. 1–8. IEEE.
9. Leevy JL, Khoshgoftaar TM, Bauder RA, Seliya N. A survey on addressing high-class imbalance in big data. J Big
Data. 2018;5(1):42.
10. He H, Garcia EA. Learning from imbalanced data. IEEE Trans knowl Data Eng. 2009;21(9):1263–84.
11. Thakkar A, Lohiya R. A review of the advancement in intrusion detection datasets. Procedia Comput Sci.
2020;167:636–45.
12. Groff Z, Schwartz S. Data preprocessing and feature selection for an intrusion detection system dataset. In: 34th
Annual Conference of The Pennsylvania Association of Computer and Information Science Educators, 2019. p.
103–110.
13. Menon AK, Williamson RC. The cost of fairness in binary classification. In: Conference on Fairness, Accountability
and Transparency, 2018. p. 107–118.
14. Atefinia R, Ahmadi M. Network intrusion detection using multi-architectural modular deep neural network. J
Supercomput. 2020. https://doi.org/10.1007/s11227-020-03410-y
15. Basnet RB, Shash R, Johnson C, Walgren L, Doleck T. Towards detecting and classifying network intrusion traffic
using deep learning frameworks. J Internet Serv Inf Secur. 2019;9(4):1–17.
16. Catillo M, Rak M, Villano U. 2l-zed-ids: A two-level anomaly detector for multiple attack classes. In: Workshops of
the International Conference on Advanced Information Networking and Applications. 2020. p. 687–696.
Leevy and Khoshgoftaar J Big Data (2020) 7:104 Page 17 of 19
17. Chadza T, Kyriakopoulos KG, Lambotharan S. Contemporary sequential network attacks prediction using hidden
markov model. In: 2019 17th International Conference on Privacy, Security and Trust (PST), 2019. p. 1–3.
18. Chastikova V, Sotnikov V. Method of analyzing computer traffic based on recurrent neural networks. J Phys.
2019;1353:012133.
19. D’hooge L, Wauters T, Volckaert B, De Turck F. Inter-dataset generalization strength of supervised machine learning
methods for intrusion detection. J Inf Secur Appl. 2020;54:102564.
20. Ferrag MA, Maglaras L, Moschoyiannis S, Janicke H. Deep learning for cyber security intrusion detection:
approaches, datasets, and comparative study. J Inf Secur Appl. 2020;50:102419.
21. Lima Filho FSd, Silveira FA, de Medeiros Brito Junior A, Vargas-Solar G, Silveira LF. Smart detection: an online
approach for dos/ddos attack detection using machine learning. Security and Communication Networks 2019;
2019.
22. Fitni QRS, Ramli K. Implementation of ensemble learning and feature selection for performance improvements in
anomaly-based intrusion detection systems. In: 2020 IEEE International Conference on Industry 4.0, Artificial Intel-
ligence, and Communications Technology (IAICT), 2020. p. 118–124.
23. Gamage S, Samarabandu J. Deep learning methods in network intrusion detection: a survey and an objective
comparison. J Netw Comput Appl. 2020;169:102767.
24. Hua Y. An efficient traffic classification scheme using embedded feature selection and lightgbm. In: 2020 Informa-
tion Communication Technologies Conference (ICTC), 2020. p. 125–130.
25. Huancayo Ramos KS, Sotelo Monge MA, Maestre Vidal J. Benchmark-based reference model for evaluating botnet
detection tools driven by traffic-flow analytics. Sensors. 2020;20(16):4501.
26. Kanimozhi V, Jacob TP. Artificial intelligence based network intrusion detection with hyper-parameter optimiza-
tion tuning on the realistic cyber dataset cse-cic-ids2018 using cloud computing. In: 2019 International Confer-
ence on Communication and Signal Processing (ICCSP), 2019, p. 0033–0036.
27. Kanimozhi V, Jacob TP. Calibration of various optimized machine learning classifiers in network intrusion detec-
tion system on the realistic cyber dataset cse-cic-ids2018 using cloud computing. Int J Eng Appl Sci Technol.
2019;4(6):2143–455.
28. Karatas G, Demir O, Sahingoz OK. Increasing the performance of machine learning-based idss on an imbalanced
and up-to-date dataset. IEEE Access. 2020;8:32150–62.
29. Kim J, Kim J, Kim H, Shim M, Choi E. Cnn-based network intrusion detection against denial-of-service attacks.
Electronics. 2020;9(6):916.
30. Li X, Chen W, Zhang Q, Wu L. Building auto-encoder intrusion detection system based on random forest feature
selection. Comput Secur. 2020;95:101851.
31. Lin P, Ye K, Xu C-Z. Dynamic network anomaly detection system by using deep learning techniques. In: Interna-
tional Conference on Cloud Computing. Springer; 2019, 161–176.
32. Zhao F, Zhang H, Peng J, Zhuang X, Na S-G. A semi-self-taught network intrusion detection system. Neural Com-
put Appl. 2020;32:17169–79.
33. Happel BL, Murre JM. Design and evolution of modular neural network architectures. Neural Netw.
1994;7(6–7):985–1004.
34. Lu N, Li T, Ren X, Miao H. A deep learning scheme for motor imagery classification based on restricted Boltzmann
machines. IEEE Trans Neural Syst Rehab Eng. 2016;25(6):566–76.
35. Varsamopoulos S, Criger B, Bertels K. Decoding small surface codes with feedforward neural networks. Quantum
Sci Technol. 2017;3(1):015004.
36. De Mulder W, Bethard S, Moens M-F. A survey on the application of recurrent neural networks to statistical lan-
guage modeling. Comput Speech Lang. 2015;30(1):61–98.
37. Madan A, George AM, Singh A, Bhatia M. Redaction of protected health information in ehrs using crfs and bi-
directional lstms. In: 2018 7th International Conference on Reliability, Infocom Technologies and Optimization
(Trends and Future Directions)(ICRITO), 2018. p. 513–517.
38. Lee K, Filannino M, Uzuner Ö. An empirical test of grus and deep contextualized word representations on de-
identification. Stud Health Technol Inform. 2019;264:218–22.
39. Chaudhary A, Kolhe S, Kamal R. An improved random forest classifier for multi-class classification. Inf Process Agric.
2016;3(4):215–22.
40. Rynkiewicz J. Asymptotic statistics for multilayer perceptron with Relu hidden units. Neurocomputing.
2019;342:16–23.
41. Chen J, Xie B, Zhang H, Zhai J. Deep autoencoders in pattern recognition: A survey. Bio-inspired Computing Mod-
els And Algorithms. World Scientific;2019. 229–55.
42. Joshi J, Kumar T, Srivastava S, Sachdeva D. Optimisation of hidden Markov model using Baum-Welch algorithm for
prediction of maximum and minimum temperature over Indian Himalaya. J Earth Syst Sci. 2017;126(1):3.
43. Lember J, Sova J. Regenerativity of viterbi process for pairwise markov models. J Theor Probab. 2020;. https://doi.
org/10.1007/s10959-020-01022-z.
44. Shah SAR, Issac B. Performance comparison of intrusion detection systems and application of machine learning to
snort system. Future Gener Comput Syst. 2018;80:157–70.
45. Pasupa K, Vatathanavaro S, Tungjitnob S. Convolutional neural networks based focal loss for class imbalance prob-
lem: A case study of canine red blood cells morphology classification. J Ambient Intell Human Comput. 2020;.
https://doi.org/10.1007/s12652-020-01773-x.
46. Chen W, Zhang S, Li R, Shahabi H. Performance evaluation of the gis-based data mining techniques of best-
first decision tree, random forest, and naïve bayes tree for landslide susceptibility modeling. Sci Total Environ.
2018;644:1006–188.
47. Ahmad I, Basheri M, Iqbal MJ, Rahim A. Performance comparison of support vector machine, random forest, and
extreme learning machine for intrusion detection. IEEE Access. 2018;6:33789–95.
48. Taşer PY, Birant KU, Birant D. Comparison of ensemble-based multiple instance learning approaches. In: 2019 IEEE
International Symposium on INnovations in Intelligent SysTems and Applications (INISTA), 2019. p. 1–5.
Leevy and Khoshgoftaar J Big Data (2020) 7:104 Page 18 of 19
49. Ayyadevara VK. Gradient boosting machine. In: Pro Machine Learning Algorithms. Berkeley, CA: Apress; 2018. https
://doi.org/10.1007/978-1-4842-3564-5_6.
50. Wang R, Zeng S, Wang X, Ni J. Machine learning for hierarchical prediction of elastic properties in fe-cr-al system.
Comput Mater Sci. 2019;166:119–23.
51. Baig MM, Awais MM, El-Alfy E-SM. Adaboost-based artificial neural network learning. Neurocomputing.
2017;248:120–6.
52. Chen T, Guestrin C. Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd Acm Sigkdd Interna-
tional Conference on Knowledge Discovery and Data Mining, 2016. p. 785–794.
53. Vajda S, Santosh K. A fast k-nearest neighbor classifier using unsupervised clustering. In: International Conference
on Recent Trends in Image Processing and Pattern Recognition, 2016. p. 185–193.
54. Saikia T, Brox T, Schmid C. Optimized generic feature learning for few-shot classification across domains. arXiv
preprint arXiv:2001.07926 2020.
55. Sulaiman S, Wahid RA, Ariffin AH, Zulkifli CZ. Question classification based on cognitive levels using linear svc. Test
Eng Manag. 2020;83:6463–70.
56. Rahman MA, Hossain MA, Kabir MR, Sani MH, Awal MA et al.. Optimization of sleep stage classification using
single-channel eeg signals. In: 2019 4th International Conference on Electrical Information and Communication
Technology (EICT), 2019. p. 1–6.
57. Rymarczyk T, Kozłowski E, Kłosowski G, Niderla K. Logistic regression for machine learning in process tomography.
Sensors. 2019;19(15):3400.
58. Koroniotis N, Moustafa N, Sitnikova E, Turnbull B. Towards the development of realistic botnet dataset in the inter-
net of things for network forensic analytics: Bot-iot dataset. Future Gener Comput Syst. 2019;100:779–96.
59. Liu W, Wang Z, Liu X, Zeng N, Liu Y, Alsaadi FE. A survey of deep neural network architectures and their applica-
tions. Neurocomputing. 2017;234:11–26.
60. Li J, Xi B, Li Y, Du Q, Wang K. Hyperspectral classification based on texture feature enhancement and deep belief
networks. Remote Sensing. 2018;10(3):396.
61. Zhao Y, Li H, Wan S, Sekuboyina A, Hu X, Tetteh G, Piraud M, Menze B. Knowledge-aided convolutional neural
network for small organ segmentation. IEEE J Biomed Health Inform. 2019;23(4):1363–73.
62. Taherkhani A, Cosma G, McGinnity TM. Deep-fs: A feature selection algorithm for deep boltzmann machines.
Neurocomputing. 2018;322:22–37.
63. Jazi HH, Gonzalez H, Stakhanova N, Ghorbani AA. Detecting http-based application layer dos attacks on web serv-
ers in the presence of sampling. Comput Netw. 2017;121:25–36.
64. Akhtar F, Li J, Pei Y, Xu Y, Rajput A, Wang Q. Optimal features subset selection for large for gestational age classifica-
tion using gridsearch based recursive feature elimination with cross-validation scheme. In: International Confer-
ence on Frontier Computing, 2019. p. 63–71.
65. Scikit-learn: SGDClassifier. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifi er.
html
66. Fadlil A, Riadi I, Aji S. Ddos attacks classification using numeric attribute based Gaussian Naive Bayes. Int J Adv
Comput Sci Appl. 2017;8(8):42–50.
67. Elkhalil K, Kammoun A, Couillet R, Al-Naffouri TY, Alouini M-S. Asymptotic performance of regularized quadratic
discriminant analysis based classifiers. In: 2017 IEEE 27th International Workshop on Machine Learning for Signal
Processing (MLSP), 2017. p. 1–6.
68. Abd Elrahman SM, Abraham A. A review of class imbalance problem. J Netw Innov Comput. 2013;1(2013):332–40.
69. Zhang W-Y, Wei Z-W, Wang B-H, Han X-P. Measuring mixing patterns in complex networks by spearman rank cor-
relation coefficient. Phys A Stat Mech Appl. 2016;451:440–50.
70. Shi D, DiStefano C, McDaniel HL, Jiang Z. Examining chi-square test statistics under conditions of large model size
and ordinal data. Struct Equ Model. 2018;25(6):924–45.
71. Hancock J, Khoshgoftaar TM. Medicare fraud detection using catboost. In: 2020 IEEE 21st International Conference
on Information Reuse and Integration for Data Science (IRI), 2020. p. 97–103. IEEE Computer Society.
72. Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q, Liu T-Y. Lightgbm: A highly efficient gradient boosting deci-
sion tree. In: Advances in Neural Information Processing Systems, 2017. p. 3146–3154.
73. Bentéjac C, Csörgő A, Martínez-Muñoz G. A comparative analysis of gradient boosting algorithms. Artif Int Rev.
2020;1–31.
74. KDD: KDD Cup. https://kdd.ics.uci.edu/databases/kddcup99/task.html/.
75. Tavallaee M, Bagheri E, Lu W, Ghorbani AA. A detailed analysis of the kdd cup 99 data set. In: 2009 IEEE Symposium
on Computational Intelligence for Security and Defense Applications, 2009. p. 1–6. IEEE.
76. Yap BW, Abd Rani K, Abd Rahman HA, Fong S, Khairudin Z, Abdullah NN. An application of oversampling,
undersampling, bagging and boosting in handling imbalanced datasets. In: Proceedings of the First International
Conference on Advanced Data and Information Engineering (DaEng-2013), 2014. p. 13–22. Springer.
77. Saritas MM, Yasar A. Performance analysis of ann and Naive Bayes classification algorithm for data classification. Int
J Intell Syst Appl Eng. 2019;7(2):88–91.
78. Alenazi A, Traore I, Ganame K, Woungang I. Holistic model for http botnet detection based on dns traffic analysis.
In: International Conference on Intelligent, Secure, and Dependable Systems in Distributed and Cloud Environ-
ments, 2017. p. 1–18.
79. Gupta V, Bhavsar A. Random forest-based feature importance for hep-2 cell image classification. In: Annual Confer-
ence on Medical Image Understanding and Analysis, 2017. p. 922–934. Springer.
80. Yuanyuan S, Yongming W, Lili G, Zhongsong M, Shan J. The comparison of optimizing svm by ga and grid search.
In: 2017 13th IEEE International Conference on Electronic Measurement & Instruments (ICEMI), 2017. p. 354–360.
81. Ranjan G, Verma AK, Radhika S. K-nearest neighbors and grid search cv based real time fault monitoring system for
industries. In: 2019 IEEE 5th International Conference for Convergence in Technology (I2CT), 2019. p. 1–5.
82. Bilgic B, Chatnuntawech I, Fan AP, Setsompop K, Cauley SF, Wald LL, Adalsteinsson E. Fast image reconstruction
with l2-regularization. J Magn Reson Imaging. 2014;40(1):181–91.
Leevy and Khoshgoftaar J Big Data (2020) 7:104 Page 19 of 19
83. Meyer H, Reudenbach C, Hengl T, Katurji M, Nauss T. How to detect and avoid overfitting in spatio-temporal
machine learning applications. In: EGU General Assembly Conference Abstracts, vol. 20, 2018. p. 8365.
84. Yadav S, Shukla S. Analysis of k-fold cross-validation over hold-out validation on colossal datasets for quality clas-
sification. In: 2016 IEEE 6th International Conference on Advanced Computing (IACC), 2016. p. 78–83.
85. Fernández A, Garcia S, Herrera F, Chawla NV. Smote for learning from imbalanced data: progress and challenges,
marking the 15-year anniversary. J Artif Intell Res. 2018;61:863–905.
86. Negi S, Kumar Y, Mishra V. Feature extraction and classification for emg signals using linear discriminant analysis.
In: 2016 2nd International Conference on Advances in Computing, Communication, & Automation (ICACCA)(Fall),
2016. p. 1–6.
87. Wei Z, Wang Y, He S, Bao J. A novel intelligent method for bearing fault diagnosis based on affinity propagation
clustering and adaptive feature selection. Knowl Based Syst. 2017;116:1–12.
88. Mirsky Y, Doitshman T, Elovici Y, Shabtai A. Kitsune: an ensemble of autoencoders for online network intrusion
detection. arXiv preprint arXiv:1802.09089 2018.
89. Chorowski JK, Bahdanau D, Serdyuk D, Cho K, Bengio Y. Attention-based models for speech recognition. In:
Advances in Neural Information Processing Systems, 2015. p. 577–585.
90. Zhang Z. Improved adam optimizer for deep neural networks. In: 2018 IEEE/ACM 26th International Symposium
on Quality of Service (IWQoS), 2018. p. 1–2.
91. Sharma A. Guided stochastic gradient descent algorithm for inconsistent datasets. Appl Soft Comput.
2018;73:1068–80.
92. Chiang H-T, Hsieh Y-Y, Fu S-W, Hung K-H, Tsao Y, Chien S-Y. Noise reduction in ECG signals using fully convolutional
denoising autoencoders. IEEE Access. 2019;7:60806–133.
93. Deng Z-H, Qiao H-H, Song Q, Gao L. A complex network community detection algorithm based on label propaga-
tion and fuzzy c-means. Phys A Stat Mech Appl. 2019;519:217–26.
94. Zhu X, Wu X, Chen Q. Eliminating class noise in large datasets. In: Proceedings of the 20th International Confer-
ence on Machine Learning (ICML-03), 2003. p. 920–927.
95. Lee J-S. Auc4. 5: Auc-based c4. 5 decision tree algorithm for imbalanced data classification. IEEE Access.
2019;7:106034–42.
96. Sulam J, Ben-Ari R, Kisilev P. Maximizing auc with deep learning for classification of imbalanced mammogram
datasets. In: VCBM, 2017. p. 131–135.
97. Buda M, Maki A, Mazurowski MA. A systematic study of the class imbalance problem in convolutional neural
networks. Neural Netw. 2018;106:249–59.
98. Iversen GR, Wildt AR, Norpoth H, Norpoth HP. Analysis of Variance. Thousand Oaks: Sage; 1987.
99. Tukey JW. Comparing individual means in the analysis of variance. Biometrics. 1949;5:99–114.
100. Del Río S, López V, Benítez JM, Herrera F. On the use of map reduce for imbalanced big data using random forest.
Inf Sci. 2014;285:112–37.
101. Triguero I, Galar M, Merino D, Maillo J, Bustince H, Herrera F. Evolutionary undersampling for extremely imbalanced
big data classification under apache spark. In: 2016 IEEE Congress on Evolutionary Computation (CEC), 2016. p.
640–647. IEEE.
102. Moreno-Torres JG, Raeder T, Alaiz-RodríGuez R, Chawla NV, Herrera F. A unifying view on dataset shift in classifica-
tion. Pattern Recogn. 2012;45(1):521–30.
103. Zhuang F, Qi Z, Duan K, Xi D, Zhu Y, Zhu H, Xiong H, He Q. A comprehensive survey on transfer learning. Proceed-
ings of the IEEE. 2020.
104. Singla A, Bertino E, Verma D. Overcoming the lack of labeled data: training intrusion detection models using
transfer learning. In: 2019 IEEE International Conference on Smart Computing (SMARTCOMP), 2019. p. 69–74.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.