Ref 3
Ref 3
net/publication/332977806
CITATIONS READS
0 2,312
2 authors, including:
Qianru Zhou
University of Glasgow
17 PUBLICATIONS 61 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Qianru Zhou on 12 July 2020.
1 Introduction
With the novel cyber attacks keep emerging, and the rapid extension of new
communication protocols, which encrypts not only the user payload data but also
scrambles the packet header information such as IP address and Port number
[1], traditional intrusion detection methodologies which relays on finding and
matching the patterns of packets headers information (especially IP address and
Port number) are gradually losing their effectiveness. Thus adopting machine
learning technologies to detect intrusions are more and more believed to be
the future solution for intrusion detection [2–8]. As a technology of Artificial
Intelligence, machine learning is well known by its capability to grasp hidden
patterns from massive datasets and provide accurate prediction.
2 Zhou et al.
1
https://www.ll.mit.edu/r-d/datasets
2
http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
3
https://registry.opendata.aws/cse-cic-ids2018/
Title Suppressed Due to Excessive Length 3
2 Dataset
The features in CIC-AWS dataset are described in Table 1. There are 80 features
in the dataset, providing statistical information of the flows from both uplink
and downlink. Comparing to the straightforward TCP/IP traffic header infor-
mations provided by the previous datasets, like DARPA and KDD, it is widely
believed that the statistics based on flow could provide more useful information
for intrusion detection [16].
In order to get a general idea of the dataset, we plot a bar chart of each fea-
ture against the label. As redundant features will increase computation expense,
induce chaos and reduce accuracy, we delete the features that does not show any
difference between benign traffic and malicious one.
We use real life traffic data as the source of test dataset. The test datasets are
collected from mainly two different sources, the benign data and intrusion data.
Benign test data Benign data are collected from our real-life online surfing
traffic collected from a typical research product network, it generated during the
following daily online activities: emailing, searching (mainly on Google), reading
news, watching video (through Netflex and Youtube), downloading paper from
Google Scholar.
The data are collected for a week on our office desktop in a research daily
routine environment, and then converted into flow-based statistical dataset con-
sisting of 12,681 MB.
Intrusion test data To evaluate the ability of the machine learning models in
detecting an attack that it has not seen before (or in other word, Zero-Day at-
tacks), we collect novel real-life attacks traffic data containing eight new attack
types with no repetitive with the training CIC-AWS-2018 Dataset . The attack
types in the test data are listed in Table 2. This dataset is collected from most
recent real life attacks or abnormal traffic that humans failed to detect and pre-
vent, most of them are still active till nowadays, such as ransom malware, DDoS
Bot?a Darkness, Google doc macadocs, and Bitcoin Miner(this is more like
abnormal traffic rather than intrusions to many people).
– To reduce the size of the datasets, reduce the unnecessary accuracy of the
float numbers by dropping digits after the decimal point;
– Replace noisy, machine unprocessable chars by underline ;
– Replace “Infinity” and “NaN” value with suitable numbers.
After the laundry, the total size of training dataset has dropped from 6,886
MB to 4 MB, without losing valuable information.
Title Suppressed Due to Excessive Length 5
In our problem model, the task is that given a set of statistic information of
a flow, identify whether this flow is benign traffic or intrusion, based learning
on a set of already labelled data containing both benign and intrusion traffic,
that makes our problem a supervised classification problem. To find the best
classification model that suit into our problem, we run a exhaustive test of the
six most commonly used machine learning classification models on the training
dataset, comparing their performance using the criterias of precision, recall, F1
score, and time expense.
The supervised machine learning classification models tested are listed below.
4 Evaluation
This section presents the evaluation experiments setup and the evaluation results
of intrusion detections. The evaluation experiments includes: 1) cross validation
of the training CIC-AWS-2018 Dataset on each of the attack types, and 2) the
prediction using testing data on the model chosen.
4.1 Environment
The software tool used in the evaluation experiments is sklearn 5 , numpy 6 , and
pandas 7 . All the evaluation experiments are carried on a Dell server working at
3.4 GHz on 8 cores AMD64 CPU, with 16 GB memory and 1 TB hardware disk.
life, if an intrusion detection system generate too many false-positives, the alarms
will not be taken seriously or even shut down by human users, which would
cause greater danger than low true-positive rate. As shown in Fig 2, for most
of the intrusions (more specifically, any intrusions other than Infilteration),
false-positive rate generated by all the common machine learning classification
models are as low as 0%, except for Infilteration, which experiences between
10.00% ∼ 17.00%.
The overhead in terms of time expense is illustrated in Fig 3 to demonstrate
the efficiency of each machine learning classification models. The model that per-
forms best in accuracy, decision tree classification, also cause less time
than the peer classification models, consuming less than 20 seconds on all attack
types.
As analysed above, in the cross validation experiment, the decision tree
classification fitted best among the six common classifier models, consuming
less time. Thus, we choose it as the training model for the intrusion detection of
testing data, as discussed in the following section.
Our goal is to detect intrusions with the accuracy in terms of false-positive rate
and true-positive rate as high as possible, especially when the intrusions has not
been seen before (in other words Zero-Day attacks), we normalised the intrusion
types in train CIC-AWS-2018 Dataset and test datasets into one unique type
“Evil”. Thus there are only two types of traffic in the datasets, “Benign” and
“Evil”. The goal is to detect “Evil” traffic, especially Zero-Day attacks, with
the highest possible accuracy, no matter what type of intrusion it is. The ability
of identifying the exact types of detected intrusions is also highly desirable, but
10 Zhou et al.
DoS-SlowHTTPTest
DoS-Hulk
Infilteration
DDoS-LOIC-UDP
DDoS-HOIC
SQL Injection
BruteForce-XSS
BruteForce-Web
SSH-BruteForce
FTP-BruteForce
Botnet
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
True-Positive Rate
Fig. 1. The True-Positive rate for different classification models on different attack
types.
DoS-SlowHTTPTest
DoS-Hulk
Infilteration
DDoS-LOIC-UDP
DDoS-HOIC
SQL Injection
BruteForce-XSS
BruteForce-Web
SSH-BruteForce
FTP-BruteForce
Botnet
Fig. 2. The False-Positive rate for different classification models on different attack
types.
Fig. 3. The machine learning time consumption for different attack types.
Title Suppressed Due to Excessive Length 11
we focus on detecting intrusions only in this paper and will leave it as future
work, thus will not discuss it here.
We evaluate the test datasets on the chosen decision tree classification
model in three steps: 1) test on “Benign” test data only; 2) test on “Evil” test
data only; 3) test on the combined and shuffled benign and intrusion test dataset,
with different max depth of the decision tree.
The test results of steps 1) and 2) are shown in Table 14. As shown in the
bold numbers, the model is able to detect the tested “Evil” and “Benign” data
with 100% accuracy. For there is no “Benign” data in the “Evil data only” test,
and no “Evil” data in the “Benign data only” test, the obvious “0.00%” precision
of corresponding data are got.
At step 3), we merge and shuffle the intrusion test data with the Zero-
Day intrusions (comparing to training data) listed in Table 2 into the benign
dataset. The depth of a decision tree is the length of the longest path from root
to a leaf, it determined the size of the tree and affect the performance of the
tree models. To find the best size of the decision tree, we fit the test datasets
into a couple of decision tree classifiers with the maximum depth of 2, 3, 4, 5,
and 6. The detection efficiency is evaluated in terms of false-positive rate and
true-positive rate in Fig 4, and the time expenses of different tree models are also
given in Fig 5. In Fig 4, the x-axis denotes the maximum depth of the different
decision tree classifiers, and the y-axis denotes the percentage rates. The
true-positive rate is denoted with orange solid bars in every model, while the
false-positive rate is denoted with striped black bars. As shown in Fig 4, the
true-positive rate achieves the best at 96% when the maximum depth is 5, and
slightly deteriorates to 92% and 90% when the maximum depth reduces to 3 and
4. The false-positive rate gets its lowest at 5% when the maximum depth are 3
and 4, and deteriorates to about 10% when the maximum depth gets either lower
or higher to 2 or 5. In summary, the intrusion detector performs best with the
12 Zhou et al.
tree depth at 3, 4, and 5, either increase or decrease the maximum depth will
deteriorate the performance. Further experiments will be carried out to find the
better model to fit the flow-based statistical data for intrusion detection, but
will be discussed somewhere else.
The time expenses of the decision tree classifiers with different maximum
depth are shown in Fig 5, in terms of real (striped blue bars), sys (doted orange
bars), and user (solid green bars) time in seconds. As obviously shown in the
trending line of real time (which is the totally consumed time), the time expense
grows exponentially with the maximum depth. This is predictable for the size of
the decision tree classifier is 2d+1 − 1, where d is the depth of the tree, and the
calculation time expense is positively related to the size of the decision tree.
80%
60%
40%
20%
0%
2 3 4 5 6
Max Depth
Fig. 4. The true-positive and false-positive rate of the test process with decision tree
classifier under different max depth.
Title Suppressed Due to Excessive Length 13
350
real user sys Poly. (real)
300
250
Time (seconds)
200
150
100
50
0
2 3 4 5 6
Max Depth
Fig. 5. The time expenses of the test process with decision tree classifier under different
max depth, in terms of real, sys, and user time.
Table 14. Classification Results of the Benign and Intrusion Test Datasets on Decision
Tree Classification Model.
Results
Benign 0.00%
precision
Evil 100%
Benign 0.00
Evil data only recall
Evil 0.40
Benign 0.00
f1-score
Evil 0.57
Benign 100%
precision
Evil 0.00%
Benign 1.00
Benign data only recall
Evil 0.00
Benign 1.00
f1-score
Evil 0.00
Title Suppressed Due to Excessive Length 15
5 Conclusions
In this paper, we take an intensive analysis on intrusion detection using the flow-
based statistical data generated from network traffic packets with CICFlowMe-
ter, using machine learning classification models. Six common machine learning
classifications models are tested on the datasets generated from real-life attacks
and production networks. CIC-AWS-2018 Dataset which is collected by Amazon
cluster networks, containing benign traffic and fourteen different types of intru-
sions are used as training dataset, eight different types of intrusions traffic data
collected online and benign traffic data collected from our research production
network are used as testing dataset. Cross validations over the training dataset
are carried out on the six common machine learning classification models, one
model, decision tree classification, with the best performance on general adapt-
ability, precision, and time consumption, is chosen to carry out the testing ex-
periment. The testing results demonstrate acceptable performance for intrusion
detection, with 100% accuracy in detecting on Zero-Day intrusion or benign
data alone, and 96% accuracy in scrambled Zero-Day intrusion and benign
data, and false-positive rate of 5% at lowest.
Much is left to do in the future, for example, finding better models to fit the
statistical flow data, and testing on more novel types of intrusions in real time.
Bibliography