2.1(1)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

International Journal of Machine Learning and Computing, Vol. 12, No.

2, March 2022

A Hybrid IDS Using GA-Based Feature Selection Method


and Random Forest
Zhiqiang Liu and Yucheng Shi


main categories in intrusion detection. Misuse detection uses
Abstract—In recent years, the rapid development of internet known attack methods that have been defined in advance.
technology brings many severe network security problems The system determines the existence of these attacks to
linked to malicious intrusions. Intrusion Detection System is achieve the detection process, which is also called feature
considered to be one of the significant techniques to safeguard
the network from both external and internal attacks. However,
detection [1]. Misuse detection is built on the existing feature
with the fast expansion of the IoT network, cyberattacks are library or feature database. It can detect the intrusion patterns
also changing quickly, and many unknown types are showing recorded in the signature database with high accuracy.
up in the contemporary network environment. Consequently, However, misuse detection fails to detect the zero-day attack.
the efficiency of traditional signature-based and anomaly-based In other words, while there are attacks which not exist in the
Intrusion Detection System is insufficient. We propose a novel signature database, this detection system can hardly capture
Intrusion Detection System, which uses an evolutionary
technique based feature selection approach and a Random
them. When an alarm is raised, which means a recorded
Forest-based classifier. The evolution-based feature selector signature has been detected, though note that the set of
uses an innovative Fitness Function to select the important signatures could contain ambiguous outlines that can be
features and reduces dimensions of the data, which raise the caused by an attacker as well as a legitimate user. Anomaly
Ture Positive Rate and reduce the False Positive Rate at the detection does not rely on the signature database. It analyses
same time. With exceptional high accuracy in the network traffic by calculating the deviation from the
multi-classification tasks and outstanding capabilities of
handling noise in massive data scenarios, the Random Forest
user's behavior to the normal profile. Anomaly detection can
technique is widely used in anomaly detection. This research address the reliance issue on the signature database, but this
proposes a framework that can select more steady features and method may detect the normal network behavior as an
improve the classification results as compared with other intrusion, and the false alarm rate is relativity high.
technologies. The proposed framework is tested and Typically, an intrusion detection system consists of two
experimented on UNSW-NB15 datasets and NSL-KDD datasets. components that join together. The first component selects
Various statistical results and detailed comparison to other
methods are presented within this article.
only the necessary features, and the second component is for
classification and makes efficient decisions. To achieve the
Index Terms—Genetic algorithm, network security, best performance, these components must work along with
NSL-KDD, random forest decision tree, UNSW-NB15. each other to perform a low time consuming and high
accuracy result.
Data pre-processing goes for a vital step at the beginning
I. INTRODUCTION of the whole detect process. Selecting the significant
As a result of the world's third industrial revolution, information and features from the dataset can reduce the
computers and networking technologies exploded in our dimensions of the raw dataset, which always leads to better
daily life. With the convenient they brought, these performance. The Genetic Algorithm (GA) is inspired by a
technologies also left us with concern and risks. Virus, Trojan, natural evolutionary theory put forward by Darwin. Genetic
and Worms can easily inject into our system. Sensitive Algorithm is a commonly used method for finding an
information can be leaked or hijacked by cyberattacks. And optimized and high-quality solution to search problem. Core
all these threats still escalate with the development of operators in GA are inspired by biological processes such as
information technology. The traditional defense system can crossover, mutation, and selection [2]. Fitness function is
identify some attacks, but as they varied a bit, they can hardly considered to be the most important part of Genetic
be recognized. Thus, the whole industry is seeking new algorithms. The Fitness Function evaluates every offspring
mechanisms that can accurately capture and block those chromosomes, and then only the highest scored one can
threats and guarantee our system in a working and safe survive to the next evolutionary round. The disadvantage of
environment. the previous proposed Fitness Function in GA-based feature
Defense mechanisms can be categorized into Intrusion selection model is simply using the Accuracy and selected
Prevention Systems (IPS) and Intrusion Detection System feature numbers as parameters. Ignoring the high False
(IDS). Intrusion Detection System, as an entrance, usually Positive Rate (FPR) of the intrusion detection commonly
works at the frontier of the network. According to different results in a low True Positive Rate (TPR).
techniques, misuse detection and anomaly detection are two In recent decades, machine learning has been increasingly
used as another vital component of the modern Intrusion
Detection System. Generally, machine learning can be
Manuscript received October 18, 2019; revised April 11, 2021.
The authors are with the School of Software and Microelectronic, divided into supervised and unsupervised learning.
Northwestern Polytechnical University, Shannxi, China (e-mail: Supervised learning is a powerful tool in analyzing the
[email protected], [email protected]).

doi: 10.18178/ijmlc.2022.12.2.1077 43
International Journal of Machine Learning and Computing, Vol. 12, No. 2, March 2022

high-dimensional data and figuring out the hidden pattern System, such as misuse based, anomaly-based, host-based,
behind these statistics. Supervised learning also has a strong network-based and hybrid-based. It mainly focuses on
capability of classifying high-dimensional data into specific anomaly-based and behavior based along with agent-based
classes. Thus, this technology can be used to recognize technology in real network traffic. S. Northcutt et al. [7]
malicious behaviors in network traffic. Because of the compare the pros and cons of the anomaly detection approach
massive, high-dimensional and strong non-linear traffic data, and misuse detection approach respectively. The author
some classical machine learning methods, for instance, points out that the drawback of the anomaly detection
Probability-based Bayesian, Decision Tree, and Support approach is that when the Intrusion Detection System
Vector Machine (SVM), are proven to be less effective in the experiences a new behavior for the first time, it raises the
classification task. The results have low accuracy but a high alarm, which may be a false positive. Also, the false negative
False Positive Rate, and the “dimensional explosion” rate and False Positive Rate and anomaly detection are
problem is prone to occur. relatively much higher than misuse detection.
Random Forest (RF) is a supervised learning algorithm. L. Haripriya and M.A. Jabbar [8] give a review of using
After the training process with given features and Machine Learning (ML) technologies in the Intrusion
classification results, an RF model can be obtained to classify Detection System. They also discuss applications into a
new datasets. Among all kinds of supervised learning system with ML, and the detailed comparison of various
algorithms, Random Forest has certain advantages in approaches for the Intrusion Detection System using ML is
accuracy and training speed. Also, good noise processing given. This paper indicated that It is relatively hard to train
ability and high stability make the Random Forest a popular the ML models while a certain amount of traffic data is
choice in the Intrusion Detection System. There are a number insufficient or not available. A useful intrusion detection
of factors to evaluate the performance of Random Forest, system model uses Artificial Neural Network (ANN) is
including accuracy, recall rate, running time, etc. We propose presented by Basant Subba et al. [9]. One limitation in their
a model that combines the Genetic algorithm with the approach is that the model they proposed requires large
Random Forest algorithm to reach the best results. Besides, a training time. However, the overall detection performance of
newly designed Fitness Function, which adds FPR as a the neural network will not be degraded by the failure of
penalty parameter, aims to cut the false alarm rate (FAR) and adding new agents to the previous one. Pan-Shi Tang et al.
increase the TPR concurrently. Furthermore, F1-score is also [10] describe Filter and wrapper, which is the most common
used for balancing the weight of the precession rate and the feature selection algorithm in their work. A combination of
recall rate. We mainly optimize the accuracy and time two algorithms is also compared with the Genetic Algorithm
complexity of Random Forest through parameter adjustment based selection method, then comes out a conclusion that GA
and data dimensionality reduction. A stable number of has much higher efficiency than Filter and Wrapper
selected features and low decision time are also treated as an algorithm in selecting features. S. Aksoy et al. [11] and B.
important performance indicator. Kavitha et al. [12] describe an essential method of selecting
Rest of the paper is organized as follows: In Section II, the required subset of features by using the Genetic
related works are reviewed, in Section III, the details of the Algorithm. They believe feature selection can discard
proposed Intrusion Detection System is given. Section IV redundant items, as well as have a considerable effect on
discusses the experimental result in UNSW-NB15 [2] dataset building efficient classification system in further steps. Ketan
compared to the NSL-KDD [3] dataset. Conclusions and Sanjay Desale and Roshani Ade [13] propose an innovative
some possible future enhancements on this work are feature selection technique that using a method based on
presented in Section V. mathematical intersection principle and genetic algorithm.
Besides, a range type of feature selection techniques, for
instance, IG, CAE, and CFS, are tested. Their outcomes of
II. RELATED WORK the other two regularly used classifiers, J48, and Naive Bayes
There is a great deal of previous researches in the literature (NB) are compared. These articles give a good example of
that discussed the Intrusion Detection System. Denning D.E using the Genetic Algorithm as a feature selector.
[4] initially proposed the abstract model of the intrusion Yi Yi Aung et al. [14] develop an IDS for identifying
detection system in 1987. This paper firstly uses intrusion network behavior by using K-Means and RF. Decreasing the
detection as a security defense technique of the computer CPU and memory consumption is also one of their focuses.
system. The model is independent of any specific operating Moreover, the hybrid model shows a superior to the system
system, application environment, system vulnerability and only using a single Random Forest algorithm, specifically in
intrusion type. It is a framework that can be an excellent detection correctness and classification accuracy. In this
example of designing intrusion detection application systems. work, 10% of the KDDCUP99 [15] dataset is used to testify
Although the audit criteria in the proposed model can be the model accuracy. Yaping Chang et al. [16] apply Random
triggered by other unknown factors that are not anomalies Forest to select important features and SVM to improve the
behaviors. And the fact that whether the model can detect the classification result. And only 14 features (in total 41 features)
most intrusion before severe damage is done still needs to be are selected to reach a higher attack detection rate, also using
proved. Wu et al. [5] work intensively in database intrusion, the KDDCUP99[16] dataset. A data mining based intrusion
especially in anomaly detection based on data mining, the detection framework combining misuse and anomaly
author also uses association rules to a forward detection, which also applies the RF, is proposed by
implementation based on Trie tree. Aumreesh et al. [6] give a Mohammad Zulkernine and Jiong Zhang [17]. They utilize
review that emphasizes various types of Intrusion Detection sampling techniques and optimal arguments in their

44
International Journal of Machine Learning and Computing, Vol. 12, No. 2, March 2022

framework to increase the detection correctness of minority issue among all records in both training dataset and testing
intrusions. Although, the first shortcoming of their work is dataset, which makes the detection results more reliable.
that the hybrid system can be undermined if intrusions are NSL-KDD [3] training dataset covers 22 types of
much more than normal data in a dataset. Second, some high cyberattacks divided into four classes: Denial of Service
degree similar intrusions cannot be correctly detected as (DOS), Probing Attack (PROBE), User to Root (U2R), and
outliers by the system. Third, their tests and experiments still Remote to User (R2L). Table I presents the detail categories
work on the KDDCUP99 [15] dataset, which is outdated and of all attack types. Table I. also gives a brief description of
cannot truly represent the modern comprehensive network different classes. Fig. 2 shows the distribution of normal
traffic. M. Zhao et al. [18] use GA to optimize parameters of traffic and 4 types of abnormal traffic. It clearly illustrates
Support Vector Machine simultaneously. The model selects that the percentage of records in the dataset is inversely
optimized features and best SVM parameters by proportional to the number of records in each difficulty level.
concatenating them into one chromosome. However, Fitness
TABLE I: CATEGORIES OF VARIOUS ATTACK IN KDD
Function in their evolutionary process only allows the
Class Description Attack Subclass
accuracy and the True Positive Rate to assess every
Restrict or deny a legitimate user ‘smurf’, ‘back’, ‘Neptune’,
chromosome. Extra computing time is also required in every DoS
request to a system ‘pod’, ‘teardrop’, ‘land’
evolutionary step. Identify and gather vulnerabilities
‘Ipsweep’, ‘nmap’,
PROBE exposed in a system or a network
‘portsweep’, ‘satan’
device
Pretend to be a legitimate user or ‘loadmodule’,
III. PROPOSED GA-RF IDS FRAMEWORK U2R gain unauthorized Root access to a ‘buffer_overflow’,
The overall system architecture of proposed GA-RF IDS system ‘rootkit’, ‘perl’
framework is shown in Fig. 1. ‘warezmaster’,
We use the Genetic Algorithm based feature selection ‘guess_password’,
Gain unofficial local access from a
R2L ‘imap’, ‘phf’, ’spy’,
method to select useful features. In the Genetic Algorithm, remote machine
‘multihop’, ‘ ftp_write ,
different combinations of features are called chromosomes ‘wareaclient’
and every chromosome will be evaluated by the Fitness
Function. According to the fitness value, only the highest
scored chromosome can survive to the next evolution round.
The new chromosome will replace the old one in the total
chromosome pool, which is called the initial population.
When evolutionary loop stops, relatively characteristic
features are selected out as an output of the Genetic
Algorithm. On top of that, Random forest s used for further
feature selection and results classification. Random Forest is
considered to be a powerful tool when dealing with complex
data, whether in binary classification or multi-class
Fig. 2. Distribution chart of category in KDD training dataset.
classification.

Datasets (NSL-KDD & UNSW-NB15)

Data pre-processing

Feature Selection
(Based on GA)

Training Dataset Testing Dataset

Fig. 3. Attack distribution in UNSW-NB15 (training dataset).


RF Classifier

However, according to UNSW-NB15 [2], NSL-KDD [3]


dataset does not represent the current low footprint attack
Classification Results scenarios. The UNSW-NB15[2] dataset is created to serve an
all-inclusive environment of the contemporary network
Fig. 1. The architecture of the proposed GA-RF IDS. traffic, by establishing the synthetic network using the IXIA
tool, which can generate real current regular traffic and
A. Brief Comparison of NSL-KDD and UNSW-NB15 synthetically abnormal traffic. UNSW-NB15 [2] dataset has
According to UNSW-NB15 [2] dataset, the NSL-KDD [3] 49 features though NSL-KDD [3] dataset only has 41 features.
dataset is regarded as an upgraded version of the KDDCUP99 Moreover, the extra features can be regarded as key features
[16] dataset. NSL-KDD[3] dataset removes the unnecessary and show benefits in previous work. All of the records in
items in KDDCUP99 [16] and addresses the unbalancing UNSW-NB15 [2] are categorized into ten groups, which are

45
International Journal of Machine Learning and Computing, Vol. 12, No. 2, March 2022

Normal, Fuzzers, Analysis, Backdoors, DoS, Exploits, where the Minimum value and Maximum value from all
Generic, Reconnaissance, Shellcode, and Worms. Table II available data 𝑥𝑖 represents each data point.
gives detailed description of all attack types in UNSW-NB15 3) SMOTE algorithm
[2]. And Fig. 3 illustrates the distribution of the training
Because of the minority of some specific cyberattack types,
dataset.
such as R2L and U2R in NSL-KDD [3] dataset, Worms and
In this paper, the experiment will be executed on each
Shellcode in UNSW-NB15 [2] dataset, standard classifier
dataset, and results are presented in Section IV.
always detect those cyberattacks with very low accuracy.
TABLE II: DETAILED INFORMATION OF ATTACK TYPES IN UNSW-NB15 Synthetic Minority Oversampling Technique (SMOTE) is
Class Description
Attack used to overcome this problem. SMOTE is considered to be
Subclass an improved approach based on the Random Oversampling
Attempts to suspend a program or network
Fuzzers by providing randomly generated data.
24246 algorithm. Primarily the SMOTE Algorithm utilizes the
Contains different attacks of port scan, K-Nearest Neighbor (KNN) to generate the new samples,
2677
Analysis spam and html files penetrations. from a relatively small number of samples, mapped to the
A technique that bypasses system security original dataset. The algorithm step is shown below:
2329
Backdoors to access a computer or its data.
Restrict or deny a legitimate user request 1) For every sample x in the minority class 𝑆𝑚𝑖𝑛 , calculate
Dos 16353
to a system the Euclidean Distance for each of the rest in 𝑆𝑚𝑖𝑛 to
An attacker knows about a security obtain its K-Nearest Neighbor (KNN).
problem in an operating system or
Exploits
software and uses the vulnerability to
44525 2) For each minority sample x, randomly select several
exploit that knowledge. samples from its k-Nearest Neighbors, assuming that
One technique applies to all block ciphers the selected neighbor is 𝑥𝑛 .
Genetic with a given block and key size, regardless 215481
of the structure of the block ciphers.
3) For each 𝑥𝑛 , construct a new sample using the
Includes all strikes that can simulate following formula:
13987
Reconnaissance attacks that collect information.
A small piece of code that exploits
1511
xnew  x  rand  0,1  x  xn (2)
Shellcode software weaknesses as payloads.
Using security failures to replicates itself
174
Worms in order to infect other computers. Initial Population
(Feature chromosomes)

Update with new


B. Data Pre-processing chromosomes
Feature selection
Pre-Processing transforms the data in a uniform format. It on training dataset

also used to remove the useless data, which is not required for
Original
the proposed method and to complete the missing data. dataset
RF Decision Tree

1) 1-N encoding Training


K-Fold dataset Results of Intrusion
To evaluate a model, UNSW-NB15 [2] and NSL-KDD [3] validation Detection
Mutation & Crossover

are used as benchmark dataset. All the relevant experiments


Testing
are performed using the mentioned datasets above. Moreover, dataset Calculate fitness function
using only the material and crucial features to classify the
data source is essential. For better results of feature selection,
NSL-KDD [3] dataset and UNSW-NB15 [2] dataset cannot No
Determine criteria satisfied?
be used to train directly as the existence of non-numeric
features in datasets. To overcome this problem, non-numeric
Yes
features are converted into numeric features by using 1-n
Return the most
numeric coding. In this paper, all the non-numeric features important features
like protocol, service, and flag have been converted into
Fig. 4. Workflow chart of feature selection.
numeric features. For example, the protocol type feature in
NSL-KDD [3] consists of 3 nominal values which are tcp,
udp, and icmp, the string value ‘tcp’ is replaced by 1, ‘udp’ C. Genetic Algorithm Based Feature Selection Method
by 2 and ‘icmp’ by 3 et. In the proposed method, we use the Genetic Algorithm [19]
2) Normalization as the base of the feature selection method. Fig. 4 illustrates
the workflow of our proposed feature selection process.
Features in both datasets like “src-bytes”, “dst-byts”,
Initial population consists of the feature chromosomes.
“duration” etc. ranges from 0 to 500000, which make the
Features in NSL-KDD [3] and UNSW-NB15 [2] dataset are
dataset unbalanced and unfit to be processed. Unrivaled
coded into binary formation such as 110110111…00101101.
records in the dataset will mislead the classifier and result in
The chromosomes are generated randomly. On the other hand,
an inexact outcome. Therefore, these values or features
to include as many categories of attack as possible in both
should be normalized by using the following Max-Min (1)
datasets, the number of the initial population is restricted in
function:
100 to 150. According to the previous researches, the larger
xi  Min the initial population it is, the more complex the algorithm is,
(1)
Max  Min and more computing time is needed. On the contrary, if the
initial population is too small, the optimal performance of the
The above equation represents the normalization process,

46
International Journal of Machine Learning and Computing, Vol. 12, No. 2, March 2022

algorithm will be reduced, and it is easy to fall into the local 4) False Negative (FN): Incorrectly classify the samples
optimal solution. Both original datasets are separated into that originally belong to positive categories into
training and testing datasets by using the K-Fold validation negative categories.
method during the training process. Mutation rate and Accuracy (3) is the percentage of data that is correctly
crossover rate are kept constant in experiments. Based on the predicted. Accuracy is calculated as below:
classification results by the RF, Fitness Function evaluates
TP  TN
every chromosome at the end of the iteration. When any of Accuracy  (3)
TP  TN  FP  FN
the following conditions are satisfied, the feature extraction
algorithm terminates: F1  score Eq. (4) is calculated as follows:
1) When the maximum number of preset iterations is
reached, the search is complete. precision  recall
2) The maximum fitness value does not change for 10 F1  score  2  (4)
precision  recall
successive generations.
D. The Fitness Function In the data that predicted to be positive, the ratio of actually
positive data is called precision Eq. (5). In the actually
Fitness Function (8) is considered to be the most vital and
positive data, the ratio of data that predicted to be positive is
fundamental part of the genetic algorithm to evaluate a
called recalls Eq. (6). The formula of precision and recall is
chromosome to survive. At the end of every evolutionary step,
shown below:
the highest scored chromosome evaluated by Fitness
Function will replace the lower scored one. A proper Fitness TP
precision  (5)
Function should preserve chromosomes with high fitting TP  FP
values and speed up the iterative process of the genetic
TP
algorithm. Moreover, in the Intrusion detection system recall  (6)
scenario, notably, not only the accuracy and the True Positive TP  FN
Rate should be considered, but also the False Positive Rate
FPR Eq. (7) is the rate of the false positive detection
should be included in the Fitness Function. Previously,
calculated by:
researchers select subsets with higher classification accuracy
and fewer features. However, they did not take false detection FP
FP rate  (7)
in, so those feature subsets would result in higher false alarm FP  TN
rates, and the performance of the Intrusion Detection System
would degrade. The formula of the Fitness Function Eq. (8) is as below:

E. Random Forest Decision Tree Fitness  c   wa  RFAccuracy   wb  F1  score   wc  FPR (8)


Random forest is considered to be an integrated learning
method based on decision trees. The Random Forest was In the proposed Fitness Function Eq. (8), 𝑤𝑎 weights for
proposed by Leo Breiman in 2001 to combine the bagged accuracy of Random Forest Decision Tree, 𝑤𝑏 weights for
integrated learning theory [20] with the random subspace 𝐹1 − 𝑠𝑐𝑜𝑟𝑒 and 𝑤𝑐 weights for False Positive Rate. The 𝐹1 −
method [21]. RF is a well-known classifier for supervised 𝑠𝑐𝑜𝑟𝑒 is a measure of test accuracy. It is the harmonic mean
learning. In the RF decision tree, each node is classified on of precision and recall, which takes both precision and recall
the bases of optimal feature selection. This process continues of the classification model into account to compute. 𝐹1 −
until we reach the termination criteria. Each node categorized 𝑠𝑐𝑜𝑟𝑒 reaches its best value at 1 (perfect precision and recall)
as the relatively same kind of data. The number of votes and worst at 0.
determines the classification result. The most voted leaf node We assume the high False Positive Rate leads to a False
is considered to be the category of the sample. The voting alarm, which could make the Intrusion Detection System
process determined by path moving from root node to leaf judge normal network traffic to a malicious one. We propose
node. The resistance of RF to noise and outliers not only to increase the TPR and decrease the FPR simultaneously.
solve many performance issues but also give us good stability. Therefore, we treat the False Positive Rate as a penalty
The Non-Parametric nature of RF makes it a better choice for parameter in our Fitness Function, which means a high False
the classification of high-dimensional data. Positive Rate makes a lower value of the whole Fitness
F. Proposed Fitness Function Function. Every chromosome is evaluated by the proposed
Fitness Function at the end of every loop shown in Fig. 4 and
We propose an innovative Fitness Function (8) which uses only high scored chromosome can survive to next
three parameters named Accuracy, F1-score and False evolutionary round.
Positive Rate (FPR) to evaluate each chromosome feature.
1) True Positive (TP): Classify the samples that originally
belong to positive categories into positive categories. IV. EXPERIMENTS AND RESULTS
2) True Negative (TN): Classify the samples that
originally belong to negative categories into negative The testbed of our proposed method is a Windows
categories. platform based computer of hardware configuration having
3) False Positive (FP): Incorrectly classify the samples that Intel Core i7-8th generation in 2.3GHz and 8 GB RAM.
originally belong to negative categories into positive DEAP framework (version 1.28) was used to perform the
categories. Genetic Algorithm under Python. Detail parameters of the

47
International Journal of Machine Learning and Computing, Vol. 12, No. 2, March 2022

Genetic Algorithm and Fitness Function are shown in Table both NSL-KDD [3] Train dataset and UNSW-NB15 [2] Train
III. dataset in binary-classification are shown in Fig. 6. And the
ROC Curve for both datasets is shown in Fig. 7.
TABLE III: DETAILED PARAMETERS IN GA AND FITNESS FUNCTION
Evolution parameters
Parameters Name Number
Initial population 150
Mutation rate 0.01
Crossover rate 0.75
Selection type Roulette wheel selection
Crossover type Two-point crossover
Fitness Function parameters
𝑤𝑎 0.6
𝑤𝑏 0.4
𝑤𝑐 100

According to the Fitness Function (8), fitness score can be


influenced by different values of parameters. After many (a)
experiments, we found mean fitness value reached its peak
when 𝑤𝑎 =0.6 and 𝑤𝑏 =0.4. We defined the DEAP
framework to be a problem of Maximization and set 𝑤𝑐 =100
to amplify the weight of FPR to achieve the best result. The
overall mean fitness value in NSL-KDD [3] dataset (a) and
UNSW-NB15 [2] dataset (b) are shown in Fig. 5. The X-axis
represents the N×10th generation of the loop, and Y-axis
represents the mean fitness values of each generation. As
shown in Fig. 5, with the process of chromosome selection,
the function graph shows an upward trend and then gradually
flattens out, which means right scored chromosomes are
preserved in population and important selected features are
slowly becoming steady. (b)
F1 − 𝑠𝑐𝑜𝑟𝑒, Accuracy, Recall, Precision, and FPR for Fig. 5. (a) Mean fitness in NSL-KDD. (b) Mean fitness in UNSW-NB15.

Fig. 6. Evaluate index for NSL-KDD and UNSW-NB15.

Combining the feature selection results from the Genetic


Algorithm and the Random Forest, Table IV collects
important features for binary-classification and multi-class
classification in NSL-KDD [3] dataset and the UNSW-NB15
[2] dataset.

(b)
Fig. 7. (a) ROC Curve for NSL-KDD. (b) ROC Curve for UNSW-NB15.

Accuracy and AUC can reflect the classification ability of


the classifier. Due to the imbalance problem in NSL-KDD [3]
testing dataset and UNSW-NB15 [2] testing dataset, AUC
(a)

48
International Journal of Machine Learning and Computing, Vol. 12, No. 2, March 2022

number can show the classification ability of the framework Random Forest classifier. This evolutionary algorithm is
more objectively. Performance in NSL-KDD [3] dataset and used to select optimal features for the intrusion dataset. A
UNSW-NB15 [2] dataset is shown in Table V. new Fitness Function for the Genetic Algorithm is designed
to achieve high TPR and low FPR at the same time. We also
TABLE IV: SELECTED FEATURES FOR NSL-KDD AND UNSW-NB15 propose an optimized Random Forest classifier, which
Result with NSL-KDD dataset combining the Genetic Algorithm based feature selection
Class Numbers Selected Features
method, and showing higher accuracy and AUC in both
Normal 12 1,2,3,4,5,6,7,10,11,12,30,36
29,30,23,5,4,38,6,35,25,24, binary-class classification and multi-class classification. FPR
DOS 14
36,26,39,2 is also lower than other techniques. Two benchmark datasets,
36,5,35,33,12,2,40,37,6,3, NSL-KDD [3] dataset and UNSW-NB15 [2] dataset, are run
PROBE 15
32,27,41,30,26
R2L 11 23,3,5,33,12,24,10,36,32,37,6 in experiments, though the UNSW-NB15 [2] dataset is
U2R 12 1,24,33,32,36,23,6,10,14,17,5,13 considered as a more effective representation of modern
Result with UNSW-NB15 dataset network traffic. SMOTE algorithm is used for both
Normal 9 27,3,41,35,36,10,31,2,18 NSL-KDD [3] training dataset and UNSW-NB15 [2] training
41,36,27,31,8,7,28,33,10,
Reconnaissance 14
34,40,6,15,13 dataset, which can remarkably improve the detection
Exploits 8 41,31,27,28,7,2,13,14 correctness of minority attacks. The main advantage of our
Fuzzers 11 10,3,4,41,36,31,28,29,45,46,47 proposed framework is that it improves the detection
Worms 9 41,36,7,3,39,27,29,31,10
accuracy of the classic Random Forest by selecting essential
Generic 9 35,7,3,2,27,9,11,33,46
Shellcode 7 36,44,33,34,8,10,45 features and reducing training time.
Dos 12 2,27,41,36,31,7,12,3,10,43,45,47 Future work will be focused on GPU computing to shorten
Analysis 7 27,2,35,7,12,28,36 training time. Some deep learning algorithms will also be
Backdoor 10 35,27,2,33,14,9,17,25,23,42
considered to improve detection accuracy further.
TABLE V: PERFORMANCE IN NSL-KDD AND UNSW-NB15
Result with NSK-KDD Testing dataset CONFLICT OF INTEREST
class Accuracy (%) FPR (%) AUC The authors declare no conflict of interest.
Normal 96.12 2.91 0.96
Dos 97.31 1.49 0.98 AUTHOR CONTRIBUTIONS
PROBE 94.58 1.39 0.96
R2L 90.79 0.07 0.92
ZhiQiang Liu conducted the research; YuCheng Shi
U2R 88.21 0.11 0.85 analyzed data and wrote the paper.
Result with UNSW-NB15 Testing dataset
Normal 92.06 1.60 0.95 REFERENCES
Reconnaissance 91.24 0.60 0.94 [1] W. K. Lee, S. J. Stolfo, and K. W. Mok, “A data mining framework for
Exploits 94.69 1.62 0.95 building intrusion detection models,” in Proc. the 1999 IEEE
Symposium on Security and Privacy, 1999, pp. 120-132.
Fuzzers 86.04 2.10 0.91 [2] N. Moustafa, “UNSW-NBI5: A comprehensive data set for network
Worms 98.81 1.14 0.98 intrusion detection systems (UNSW-NBI5 network data set),” in Proc.
Generic 99.25 0.39 0.99 Military Communications and Information Systems Conference
Shellcode 95.43 2.49 0.97 (MiIClS), 2015.
Dos 94.03 2.06 0.90 [3] S. Revathi and A. Malathi, “A detailed analysis on NSL-KDD dataset
using various machine learning techniques for intrusion detection,”
Analysis 90.35 0.82 0.87 International Journal of Engineering Research & Technology (IJERT),
Backdoor 86.92 2.81 0.82 vol. 2, pp. 1848-1853, 2013.
[4] D. E. Denning, “An intrusion detection model,” IEEE Transactions on
Software Engineering, vol. 13, no. 2, pp. 222-232, 1987.
Compared with other technologies, our proposed GA-RF [5] W. Gongxing and H. Yimin, “Design of a new intrusion detection
Intrusion Detection System shows more effectiveness testing system based on database,” in Proc. 2009 International Conference
on NSL-KDD [3] dataset and the UNSW-NB15 [2] dataset, on Signal Processing Systems, 2009, pp. 814-817.
[6] A. K. Saxena, S. Sinha, and P. Shukla, “General study of intrusion
which can highly represent the current network traffic state. detection system and survey of agent based intrusion detection system,”
The performance comparison is shown in Table VI. in Proc. 2017 International Conference on Computing,
Communication and Automation (ICCCA), 2017, pp. 421-471.
TABLE VI: PERFORMANCE COMPARED WITH OTHER METHODS [7] S. Northcutt and J. Novak, “Network intrusion detection,” IEEE
Method Accuracy (%) FPR (%) DATASET Network, vol. 8, no. 3, pp. 26-41, 2003.
ANN [9] 98.86(three layer) - NSL-KDD [8] L. Haripriya and M. A. Jabbar, “Role of machine learning in intrusion
GA-based J48[13] 91.86 - NSL-KDD detection system: Review,” in Proc. 2018 Second International
GA-based NB [13] 89.5 - NSL-KDD Conference on Electronics, Communication and Aerospace
K-mean RF [14] 99.8 - 10% of KDD’99 Technology (ICECA), 2018, pp. 925-929.
RS-GA-SVM [16] 88.2 2 KDD’99 [9] M. B. Subba, S. Biswas, and S. Karmakar, “A neural network based
system for intrusion detection and attack classification,” in Proc. 2016
RF-based IDS [17] 94.7 2 KDD’99
Twenty Second National Conference on Communication (NCC), 2016,
GA-RF (Proposed) 96.12 2.91 NSL-KDD pp. 1-6.
GA-RF (Proposed) 92.06 1.60 UNSW-NB15 [10] P. S. Tang, X. L. Tang, and Z. Y. Tao, “Research on feature selection
algorithm based on mutual information and genetic algorithm,” in Proc.
2014 11th International Computer Conference on Wavelet Active
Media Technology and Information Processing, 2014.
V. CONCLUSIONS [11] S. Aksoy, “Feature reduction and selection,” Department of Computer
In this paper, we propose a novel Genetic Algorithm based Engineering, Bilkent University, 2008.
[12] B. Kavitha, S. Karthikeyan, and B. Chitra, “Efficient intrusion
feature selection Intrusion Detection System which uses the detection with reduced dimension using data mining classification

49
International Journal of Machine Learning and Computing, Vol. 12, No. 2, March 2022

methods and their performance comparison,” in Proc. International [21] T. K. Ho, “The random subspace method for constructing decision
Conference on Business Administration and Information Processing, forests,” IEEE Transactions on Pattern Analysis and Machine
2010, pp. 96-101. Intelligence, vol. 20, no. 8, pp. 832-844, Aug. 1998.
[13] K. S. Desale and R. Ade, “Genetic algorithm based feature selection
approach for effective intrusion detection system,” in Proc. 2015
International Conference on Computer Communication and Copyright © 2022 by the authors. This is an open access article distributed
Informatics (ICCCI), 2015, pp. 1-6. under the Creative Commons Attribution License which permits unrestricted
[14] Y. Y. Aung and M. M. Min, “An analysis of random forest algorithm use, distribution, and reproduction in any medium, provided the original
based network intrusion detection system,” in Proc. 2017 18th work is properly cited (CC BY 4.0).
IEEE/ACIS International Conference on Software Engineering,
Artificial Intelligence, Networking and Parallel/Distributed
Computing (SNPD), 2017, pp. 127-132. Zhiqiang Liu received the B.S. and Ph.D. degree in
[15] M. Tavallaee, E. Bagheri, W. Lu, and A. Ghorbani, “A detailed computer science from Northwestern Polytechnical
analysis of the KDDCUP99 dataset,” in Proc. IEEE International University, Xi'an, China. From December 2012 to
Conference on Computational Intelligence for Security & Defense January 2014, he visited Illinois State University at
Applications, 2009. Urbana-Champaign (UIUC) and Portland State
[16] Y. Chang, W. Li, and Z. Yang, “Network intrusion detection based on University (PSU). He is currently focusing on the
random forest and support vector machine,” in Proc. 2017 IEEE application of artificial intelligence in network
International Conference on Computational Science and Engineering security, simulation experiments, and data analysis.
(CSE) and IEEE International Conference on Embedded and
Ubiquitous Computing (EUC), 2017, pp. 635-638.
[17] J. Zhang, M. Zulkernine, and A. Haque, “Random-forests-based Yucheng Shi was born in Shanxi Province, China, in
network intrusion detection systems,” IEEE Transactions on Systems, 1994. He received the B.S. degree from the Taiyuan
Man, and Cybernetics, vol. 38, no. 5, pp. 649-659, Sept. 2008. University of Technology (TYUT), in 2017. He is
[18] M. Zhao, C. Fu, L. Ji, K. Tang, and M. Zhou, “Feature selection and currently pursuing the M.S. degree with the School of
parameter optimization for support vector machines: A new approach Software Engineering, Northwestern Polytechnical
based on genetic algorithm with feature chromosomes,” Expert University (NPU), Xi'an City, Shannxi Province,
Systems with Applications, vol. 38, no. 5, pp. 5197-5204, 2011. China. His research interests include network security,
[19] K. Deb, An Introduction to Genetic Algorithms, pp. 293-315, 1999. software engineering and artificial intelligence.
[20] S. W. Kwok and C. Carter, “Multiple decision trees,” Machine
Intelligence & Pattern Recognition, vol. 4, pp. 327-335, 2013.

50

You might also like