0% found this document useful (0 votes)
7 views

Improving network intrusion detection by identifying effective features based on probabilistic dependency trees and evolutionary algorithm

Uploaded by

electro-ub ub
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Improving network intrusion detection by identifying effective features based on probabilistic dependency trees and evolutionary algorithm

Uploaded by

electro-ub ub
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Cluster Computing (2022) 25:3299–3311

https://doi.org/10.1007/s10586-022-03564-9 (0123456789().,-volV)(0123456789().
,- volV)

Improving network intrusion detection by identifying effective


features based on probabilistic dependency trees and evolutionary
algorithm
Mahdi Ajdani1 • Hamidreza Ghaffary1

Received: 6 April 2021 / Revised: 29 December 2021 / Accepted: 12 February 2022 / Published online: 28 February 2022
 The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2022

Abstract
With the expansion of computer networks, attacks and intrusions into these networks have also increased. In order to
achieve complete security in a computer system, in addition to firewalls and other intrusion prevention equipment, other
systems called intrusion detection systems (IDS) are also required. This paper presents a new method for selecting effective
features on network intrusion detection based on the distribution estimation algorithm which employs the probability
dependency tree to identify the interactions between the features. To evaluate the performance of this algorithm, a new
dataset was used where the packets were divided into normal categories. The performance of the proposed algorithm was
investigated alone and in combination with other feature selection algorithms, including leading selection, regression
selection, random forest, and genetic algorithm. The effect of algorithm parameters, such as population size was explored
on the accuracy of intrusion detection. According to the results of this analysis as well as combining the results of accuracy
examination within the categories obtained using different feature selection algorithms, a subset of effective features in
intrusion detection was identified.

Keywords Data mining  NSL-KDD  KDD-cup99  Genetic algorithm  Security  Classification

1 Introduction data mining to detect intrusion include collecting data from


the network with sensors of monitoring systems, convert-
The purpose of intrusion detection is to prevent unautho- ing raw data into data that can be used in data mining
rized use, misuse of databases, and damage to network models, creating data mining models, and analyzing the
resources by both internal users and external intruders. results. Unsupervised [1] and supervised [2] data mining
Intrusion detection systems, a key element of the security are two common methods of data mining. In the unsuper-
infrastructure, are applied in many organizations. These vised method, the answer is discovered, but in the moni-
systems include hardware and software models as well as tored method, the response is characteristic and the
templates that automate the network event monitoring response to future observations must be predicted. The data
process to address security issues. Meanwhile, the purpose mining method in this paper falls into the category of
of data mining is to discover or generate relationships monitored algorithms [1]. After collecting data from the
between initial observations as well as to predict observa- network, a large set of samples with data mining models is
tions using the obtained patterns. The four main stages of examined through which an educational set is created. The
accuracy of this model is then evaluated with an experi-
mental set. Several methods have been proposed for clas-
& Hamidreza Ghaffary sification, such as k nearest neighbor [3], decision tree [4],
[email protected] support vector machine [5], Bayesian networks, and neural
Mahdi Ajdani networks [2, 3, 4]. In this study, a standard data set in the
[email protected] field of intrusion detection called NSL-KDD has been used.
1 This dataset contains 41 attributes and 5 different classes to
Department of Computer Science, Ferdows Branch, Islamic
Azad University, Ferdows, Iran characterize the behavior of packets in the network. These

123
3300 Cluster Computing (2022) 25:3299–3311

classes include a normal class along with the four intrusion several models (for example, ANNs) available. To decide
classes of DoS, U2R, R2L, and Prob attacks. The purpose which one to use and to have a good estimate of the error in
of selecting a data simplification feature is to identify and a test set, there must be a separate third set, the validation
use basic features. Feature selection is employed in many dataset. The model that functions best on the validation
areas, including text classification, data mining, pattern data should be the utilized model and should not be based
recognition, signal processing, intrusion detection, etc. The on its accuracy on the test data set. Otherwise, the reported
importance of feature selection is examined in two aspects. accuracy is optimistic and may not reflect the accuracy of
The first is eliminating inappropriate and ineffective fea- another similar but slightly different test set. There are
tures while the second deals with optimization problem to three major types of DM/DL approaches: supervised, semi-
obtain the optimal subset of features that would better meet supervised, and supervised. In unsupervised learning, the
the intended purpose [5]. main task is to find patterns, structures, and knowledge in
In this paper, a multi-class support vector machine is unsupervised data. When part of the data is labeled during
used to detect intrusion using previously collected data. In data acquisition or by human experts, the problem is called
order to identify the effective features in intrusion detec- semi-supervised learning. Adding tagged data is of signif-
tion, a type of distribution estimation algorithm [6] has icant help in solving the problem. If the data are fully
been used which models the relationships between the tagged, the training problem is supervised and it is usually
features. The performance of this algorithm in feature the task of finding a function or model that explains the
selection for this problem has been compared with that of data. Methods such as curve fitting are used to model data
basic algorithms such as leading selection, backward to the underlying problem. In general, this tag is a variable
selection, as well as standard genetic algorithm. Next, to or problem experts assume is relative to the data collected.
improve the performance of the algorithm used, a mimetic Unfortunately, the methods that are most effective for
distribution estimation algorithm is proposed which cyber applications have not been developed. Also, given
employs local random search. The results of evaluating the the richness and complexity of the methods, it is impos-
performance of algorithms based on general and partial sible to make a recommendation for each method based on
performance (within categories) indicate the good ability of the type of attack the system is expected to detect. There is
this method in identifying effective features for intrusion no single criterion for determining the effectiveness of
detection. Finally, by collecting the results obtained from methods, but several criteria that must be considered. There
different feature selection algorithms, a subset of the most are some features of this problem that make ML and DM
important features for intrusion detection is identified and methods difficult to use. They are particularly relevant to
introduced. In the second part of this paper, the background the fact that the model often needs to be retrained. An
of the research and the literature are reviewed. The third effective area for research is the study of rapid gradual
section describes the method of applying the dependency learning techniques that can be used for daily updates of
tree distribution estimation algorithm [7] to identify models against abnormal use and intrusion detection as
effective features. In the fourth section, the results of the introduced in this paper.
performed experiments are presented and explained.
Finally, conclusions and suggestions for future work are
presented in the fifth section. Through timely allocation of 2 Literature review
measures taken in the direction of oil supply to the system,
the security would hopefully be established in the network. 2.1 Selecting a feature in the category
To achieve this goal, one of the tools that can be used is
data octave. So far, various data mining methods have been Detection of intrusion is essentially a matter of classifica-
used to learn the behavioral patterns of ordinary users and tion. In this regard, feature selection is important, since
disruptive users. This beta paper has been performed to although there is no linear relationship between the number
explore the speed and accuracy of intrusion allocation of features and the performance of a category, exceeding
using duct data methods implemented on KDD-Cup-99 and the number of attributes by a certain value will change the
NSL-KDD sample data. Also, this research paper focuses performance of the category. Feature selection for large-
on machine learning (ML) and data mining (DM) tech- scale data improves the classifier performance, while also
niques for cybersecurity, with an emphasis on ML/DM reducing detection time and cost [6, 7]. In problems with a
methods to be compared with the proposed method. For large number of features, feature selection is a common
most ML procedures, there should be three steps, not two: step in machine learning methods. One of the most com-
training, validation, and testing. ML and DM methods mon methods of selecting a feature of a step is to generate a
often have parameters such as the number of layers and subset of features, to evaluate the subset, to terminate the
nodes. After completing the training, there are usually criterion, and to validate the results. In order to select a

123
Cluster Computing (2022) 25:3299–3311 3301

subset of properties, the process of selecting and evaluating The value of y is equal to 1 or - 1 and each is a true
the subset is repeated until the termination condition is met. vector of the next p. The closest training data to separator
Generating a subset of features is essentially an explora- super pages are called backup vectors.
tory search process in a search space. The nature of this Accordingly, the goal is to find the separator super plane
process is determined by two basic issues. The search point with the longest distance from the backup vectors that
or starting points that affect the search direction must first be separates the dots with the dots. As displayed in Fig. 1, as
determined. The second issue is determining a search strat- the margin of the super plate is maximized, the separation
egy. Various strategies, including complete, sequential, and between categories is also maximized. Each super page can
random are applied to search for the optimal subset. Each be written as a set of x points that satisfies the condition; w
newly created subset must be evaluated against a standard. is a normal vector perpendicular to the super plane. W and
The evaluation criteria are classified into two groups, inde- b should be selected so as to create the maximum distance
pendent and dependent, given dependence on learning between parallel hyperplanes that separate the data. These
algorithms. In the independent model, the feature subset is hyperpages are described using Eq. (2):
selected independently of the learning algorithm. In the
dependent model, however, a learning algorithm is used as Wx  b ¼ 1
ð2Þ
an evaluator function [9] to select the appropriate subset. The Wx  b ¼ 1
stop criterion determines when the feature selection process
w is the orthogonal weight vector on the separating
should stop. Some stop criteria include setting some limits
superplanes and b denotes the width vector from the origin
that should not be violated, setting the maximum number of
of each superplate. The first equation is the positive sample
iterations, stopping the attribute selection process when
separator page and the second equation is the negative
adding attributes does not result in a better subset, or stop-
sample separator page. In a support vector machine, a set of
ping if a subset is selected well enough. A simple way to
points can be separated from each other in two linear and
validate results is to draw conclusions using prior knowl-
nonlinear ways. If the training data are linearly separable,
edge; however, in real-world programs, there is usually no
the two hyperpages in the margin of the dots can be con-
such knowledge; thus, we should use some indirect methods
sidered as having nothing in common; Then, attempts are
to monitor the performance change.
made to maximize the spacing of those superpages. When
The two basic algorithms for feature selection are for-
the data can be separated linearly, the support vector
ward and backward selection. The feature selection process
machine obtains an optimal hyperplane with a maximum
in the advanced selection method starts with an empty set
margin by solving an optimization problem while consid-
where each time the algorithm repeats, a feature is added to
ering the training data set.
the answer set and evaluated using the evaluator function.
This is repeated until the required number of attributes is
selected. The problem with the advanced selection algo-
rithm is that the added feature is not removed if it is
inappropriate from the answer set. The backward selection
algorithm, unlike the forward selection algorithm, starts
with a set containing all attributes, and in each iteration of
the algorithm, the selected attribute with the evaluator
function is removed from the attribute set. This continues
until the removal of any feature improves. The properties
removed from the collection in this way are no longer
added to the collection, even if they are appropriate.

2.2 Support vector machine

The support vector machine is a binary classifier that


separates two classes using a linear boundary. The goal in
linear data segmentation is to achieve a function that
determines the most marginal hyperplanes.
Assume the instructional data set contains n samples as
follows:

^ p ; yiI^f1; 1g n
D y ¼ ðxi; yiÞj; xiIgR ð1Þ
i¼1 Fig. 1 Superset separating the two classes 1 ? and 1 -

123
3302 Cluster Computing (2022) 25:3299–3311

If the data are linearly inseparable and the classes analysis algorithm to reduce the number of features for
overlap, the separation of classes with a linear boundary is reducing the system complexity and for using a support
always accompanied by an error. In order to solve this vector machine to classify samples. The proposed system
problem, the data are first transferred from the initial space speeded up intrusion detection processing and greatly
to a higher dimension space using a nonlinear transfor- reduced the amount of memory required. Clustering was
mation, so that in the new space, the classifications will used to identify normal behavior in [17]. Normal behaviors
have less interference with each other. The transition to were grouped in a normal cluster, with normal clusters \
higher dimensions is done through kernel functions. Dif- used as signatures to detect intrusion, whereby any devia-
ferent core functions have been introduced for this purpose, tion from it was considered intrusion. In [18], a hybrid
such as polynomial function, radial base function, etc. method called FWP-SVM-GA was proposed. In this
[8, 9]. algorithm, initially the probability of combination and
mutation operators in the genetic algorithm was calculated
2.3 Previous studies according to the evolutionary status of the population and
the optimal fit value, which was used to select the feature.
Most intrusion detection systems mainly use a classifica- Its innovation was the calculation method of the fit func-
tion algorithm for intrusion detection; however, these tion, which combined three parameters for each subset of
systems only succeed in providing the best possible intru- features: true positive rate (TPR), error rate (Error), and
sion detection with a low false alarm rate. Various studies number of selected features (NumF (S)). Finally, given the
have been conducted on the use of data mining techniques subset of optimal properties, feature weights and SVM
to detect intrusion. Each of these studies seeks to provide parameters were optimized simultaneously.
better results in achieving useful patterns in intrusion In a similar work, [19] applied a recursive feature removal
detection systems. The data mining techniques applied in algorithm such as the regression selection method to identify
this regard include the following: related features in intrusion detection. They used two sets of
Wang et al. [13] proposed a method of combining neural Gaussian kernel support vector bands and the nearest
networks and fuzzy clustering in order to enhance the neighbor to decide whether to omit the variables. To improve
accuracy and stability in detecting low-recurrence attacks, the classification accuracy with each subset of features,
achieved by reducing the false-positive rate [28]. Initially, suitable values for the parameters of the classification algo-
the entire learning set is broken down into smaller subsets rithms were obtained via the parameter adjustment method.
using fuzzy clustering, and a suitable neural network is In another work [20], first, 17 and 35 related features were
applied to each subset. Each neural network can learn each selected from a total of 41 features in the NSL-KDD data-
subset faster and more accurately, and finally, using the base, respectively, using correlation-based feature selection
fuzzy aggregation method [29], they obtain the main output methods and Chi-Square. Then, support vector machine
from the output of all neural networks. Chen and Abraham classification models and intrusion detection neural net-
[14] utilized the distribution estimation algorithm to detect works based on selected features were employed. In [21],
intrusion in order to train the leading neural network using the particle swarm optimization algorithm, the
classifier. The weights, bias, and function parameters used appropriate values for the parameters C (cost) and g (gamma)
in the neural network, such as the Gaussian or Sigmoid in the backup vector category were found and the same
function, were optimized by the distribution estimation algorithm was used simultaneously to select a subset of
algorithm. In this paper, the neural network classification related features in the detection problem. In [8], the authors
was trained with particle swarm optimization algorithm proposed a hybrid principal component analysis (PCA)-
[31]. The comparison of the results revealed the high firefly based machine learning model to classify intrusion
accuracy and positive error rate in the neural network detection system (IDS) datasets. The dataset utilized in that
training method with the distribution estimation algorithm. study was collected from Kaggle. The XGBoost algorithm
Also, [15] proposed two neural network-based methods for was implemented on the reduced dataset for classification. In
detecting abuse-based intrusion. The first method was to [32], a Crow-Search based ensemble classifier was used to
use a neural network with fewer data and the main com- classify IoT- based UNSW-NB15 dataset. Initially, the most
ponent analysis technique [6], and the second method significant features were selected from the dataset using
involved applying a neural network with all the features of Crow-Search algorithm, after which these features were fed
the database. According to the results, the use of fewer to the ensemble classifier based on Linear Regression,
features in the KDDCUP99 database would improve the Random Forest and XGBoost algorithms for training. In [33]
time and memory parameters required for intrusion detec- NSLKDD, ISCXIDS2012, and CICIDS2017 datasets were
tion. In another similar work [16], an intrusion detection used for training and testing purposes. The J48Consolidated
system was introduced using a principal component provided the highest accuracy of 99.868%, a

123
Cluster Computing (2022) 25:3299–3311 3303

misclassification rate of 0.1319%, and a Kappa value of effective for cyber applications have not been developed
0.998. Thus, this classifier was proposed as the ideal classi- and given the richness and complexity of the methods, it is
fier for designing IDSs. Finally, in [34], the prominent impossible to make a recommendation for each method
dimensionality reduction techniques, Linear Discriminant based on the type of attack that the system is expected to
Analysis (LDA) and Principal Component Analysis (PCA), detect. An unknown sample with a trained model, and
were investigated on four popular algorithms including understandability of the final solution (classification) of
Machine Learning (ML), Decision Tree Induction, Support each ML or DM depending on the particular and some of
Vector Machine (SVM), Naive Bayes Classifier, and Ran- them may be more important than others. Another vital
dom Forest Classifier using publicly available Car- aspect of ML and DM for cyber intrusion detection is the
diotocography (CTG) dataset from University of California importance of datasets for training and testing systems. ML
and Irvine Machine Learning Repository. The experimen- and DM methods cannot work without representative data,
tation results demonstrated that PCA outperformed LDA in and it is difficult as well as time-consuming to set such sets.
all the measures. The performance of the classifiers, Deci- If an IDS could have access to the network and kernel level
sion Tree, and Random Forest examined was not affected data, it would be able to detect anomalies and misuse. If
much using PCA and LDA. To further analyze the perfor- only NetFlow data are available, this data should be aug-
mance of PCA and LDA, the experimentation was carried mented by network-level data such as network that provide
out on Diabetic Retinopathy (DR) and Intrusion Detection additional features of dataset. If possible, network dataset
System (IDS) datasets. Experimentation results proved that should be backed up with data.
ML algorithms with PCA yielded better results when The main gap observed is the availability of tagged data,
dimensionality of the datasets was high. However, when and as such it will certainly be a worthwhile investment to
dimensionality of datasets was low, it was observed that the collect data and tag some of them. With this new dataset,
ML algorithms without dimensionality reduction generated several types of MLs and DMs can be used to develop
better results. Table 1 summarizes the dataset feature values models, as opposed to limiting the ML list and being
in previous studies. effective for cyber applications. Significant improvements
Special emphasis was placed on finding sample articles can be made to cybersecurity using the ML and DM
describing the application of various ML and DM tech- methods using this dataset.
niques in cyberspace, both for cyber-attacks and intrusion There are some features of this cyber problem that make
detection. Unfortunately, the methods that are most the ML and DM methods difficult to use, but in this paper

Table 1 Summary of feature


Field Value type Feature name Value type
values in previous studies
Analyzing Integer same_srv_rate Float
proxy logs Nominal dst_host_same_srv_rate Float
Using NLP Float dst_host_srv_serror_rate Float
techniques Float count Integer
Class imbalance Nominal logged_in Integer
problem Float dst_host_srv_diff_host_rate Float
rerror_rate Float src_bytes Integer
Analyzing Float service Nominal
proxy logs Float dst_host_rerror_rate Float
Using NLP Integer dst_host_diff_srv_rate Float
techniques Integer wrong_fragment Integer
Class imbalance Float num_compromised Integer
problem Float dst_bytes Integer
rerror_rate Integer diff_srv_rate Float
Analyzing Integer is_guest_login Integer
proxy logs Integer land Integer
Using NLP Integer num_failed_logins Integer
techniques Integer num_root Integer
Class imbalance Integer num_shells Integer
problem Integer num_outbound_cmds Integer
rerror_rate Integer

123
3304 Cluster Computing (2022) 25:3299–3311

we can solve these problem using NSL-KDD, CIDDS-001, 2.4.2 Distribution estimation algorithms
KDD99, and VIRUS TOTAL datasets, that will normalized
before used and many of proposed method that described Distribution estimation algorithms prevent the removal of
[6, 7, 8, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, partial responses on chromosomes as much as possible.
26, 27, 28, 29, 30, 31, 32, 33, 34] and introduce new Indeed, by giving the structural blocks a high probability,
method with use strength of all them. attempt is made for these blocks to appear in the offspring.
For this purpose, instead of using standard genetic opera-
2.4 Evolutionary algorithms tors, the probability distribution of promising answers is
estimated to generate candidate answers. At each stage of
Evolutionary algorithms are a random, production-based the algorithm, a probabilistic model is constructed based on
approach to solving optimization problems. Genetic algo- selected population responses, with the next generation of
rithm is one of the most basic types of evolutionary algo- candidate solutions [17] generated by sampling this model.
rithms. This algorithm is inspired by natural selection Thus, distribution estimation algorithms are called proba-
theory [10] and genetic recombination [11]. In this algo- bilistic model-based genetic algorithms where the two
rithm, by recombining promising answers [12], the optimal combinations and mutations are replaced by constructing a
solution to the problem can be found. Use of this method in probabilistic model and sampling the constructed model
various issues has led to favorable results, but in some [11]. In general, distribution estimation algorithms are
cases, simple selection and recombination are not effective based on the probabilistic model used with the number of
to achieve the optimal answer. This is more likely to be the dependencies between genes divided into three categories:
case when the structural blocks [13] of the optimal univariate [18], bivariate [19], and multivariate [20]. The
response in the search space are poorly distributed. This is difference in these three models is related to the number of
due to the lack of effective conservation of structural dependencies of each variable on other variables. The
blocks or partial responses that arise in solutions. The term algorithms in this category do not consider any interde-
answers to sub-problems representing the knowledge and pendence between genes. Indeed, structural blocks are of
relationships that govern the dimensions of the problem are the first order, and their probability distributions are cal-
called structural blocks. The search for a technique to culated by multiplying the marginal probabilities of all
further protect structural blocks has led to the emergence of variables in each individual, where the most popular of
a new class of evolutionary algorithms called distribution these algorithms are the univariate margin distribution
estimation algorithms [10]. In the third section, these types algorithm [21], incremental learning population algorithm
of algorithms will be further explained. [22], the dense genetic algorithm [23]. In many cases, most
variables are somehow related. In the bivariate model, the
2.4.1 Genetic algorithm algorithm is able to record some binary interactions
between variables using structures such as trees. In tree-
The genetic algorithm uses genetic operators during the based models, one variable may be related to more than
reproductive phase. The operators [14], combination [15], one other variable whose offspring are placed in a tree
and mutation [16] are mostly used in genetic algorithms. structure. These algorithms are able to model two-order
The use of these operators on a population causes disap- relationships between problem genes. Thus, the probability
pearance of the dispersion or genetic diversity of the distribution model will be slightly more complex than the
population. The general process of the genetic algorithm is one-variable model, creating a form similar to the proba-
as follows: bility network between variables. Two-way information
In the first stage, an initial population is generated from maximization algorithms for input clustering [24] (MIMIC)
candidate pathways called chromosomes. The parent pop- and the optimizer algorithm with two-way information tree
ulation is then selected from the initial population, which [25] (COMIT) are examples of these algorithms. Experi-
uses an evaluator function for this purpose. Next, applying mental results have shown better performance of COMIT
combination and mutation operators to the parent popula- algorithm compared to MIMIC, PBIL, and GA. The algo-
tion and generating a population is a new solution called rithm for estimating the distribution of the dependency tree
child population. Finally, the population of new solutions is used in this research is similar to COMIT.
combined with the initial population and the initial popu- In distribution estimation algorithms based on multi-
lation of the next generation is created [14]. This process variate models, it is possible to model higher degrees of
continues until one of the stopping conditions is met. relationships between variables. The developed dense

123
Cluster Computing (2022) 25:3299–3311 3305

genetic algorithm [26] and the Bayesian optimization root variable is specified in the parameters section of the
algorithm are examples of such algorithms [11, 12]. The probabilistic model, according to which a value is ran-
major problem of these algorithms is the high complexity domly generated and replaced by the variable in the new
and time-consuming process of the modeling due to the instance. This process is repeated for all other variables to
complexity of the possible models used. As the process of generate a value for all variables. To generate more solu-
evaluating selected feature subsets with its classification tions, the above process must be repeated starting from the
method is very complex, in this study, a dependent tree- root [11].
based two-variable algorithm is employed to search for
possible subsets space, which will be explained below. 2.4.4 Random forest algorithm

2.4.3 Probabilistic dependency tree model The random forest algorithm was first introduced in 2001
by Bremen. The random forest is monitored by a classifi-
In a probabilistic model, the probability of different values cation algorithm that includes a set of decision trees. The
for the problem variables is obtained. Each possible model decision tree algorithm is one of the most widely used
contains a structure and a number of parameters. In the classification algorithms and data mining methods. In the
structure of the probabilistic model, the dependencies of decision tree, the class or category is determined by fol-
the variables, and in the parameters of the probabilistic lowing a set of questions related to the properties of the
model, the probability values of these dependencies are data and looking at the current data to make a decision.
specified. If the variables are binary, the probabilistic CART is the binary tree algorithm in the decision tree
model for each variable shows a probability of zero and algorithm. Random forest is a collection of CART trees
one. The method of calculating the value of probabilities is and is expressed in four stages [7].
based on the solutions available in the parent population.
1. K a subset of the training samples (D1, D2, …, Dk) are
The dependency tree is a two-variable model of the tree
selected from the sample code set in the training
type and obtains the relationship of each variable provided
section (D) by the Bootstrap sampling method. Finally,
another variable. This tree is a directional tree where the
the K decision tree will be created.
nodes are problem variables. The probability distribution
2. In the N branch of the classification tree node, the
defined by the tree structure is according to Eq. (3):
special m is randomly assigned and according to the
PðxÞ ¼ Onn¼m Pðxi jxj Þ ð3Þ minimum purity of the node, the best special among
In the above relation, xj refers to the father; when xi is the candidate M branches will be appointed. Trees will
the root, the value will be equal to p(xi). To learn the grow accordingly.
dependency tree from the population, first the entropy of 3. This step is the repetition of the second step. K The
each of the variables is obtained and the most irregular decision tree is produced.
variable is identified as the root. Then, based on the cri- 4. The well-grown decisive trees form a random com-
terion of mutual information [26], a dependency matrix is posite forest. The actual sample on the final floor of the
formed between the problem variables, which has rows and random forest awaits a majority vote.
columns according to the number of problem variables; in
each of its cells, the mutual information between the two 2.5 Identifying effective features
variables is calculated. Subsequently, with a maximum tree
spanning tree construction algorithm, the variable load is In this research, the dependency tree distribution estimation
selected which is most related to the variables added to the algorithm is used to select the features. In this algorithm,
tree and is added to the dependency tree with an edge. This each person in the solution population represents a subset
operation continues until all the variables are added to the of features. The support vector machine classification is
tree. Finally, using a Monte Carlo estimation method, the used to evaluate each solution from the population. After
marginal probability of the root variable and the condi- selecting promising population solutions, they are used as a
tional probability of other variables at the condition of their data set to train a possible dependency tree model with the
parents are calculated according to the population of method described in Sect. 2.3.3. After being evaluated in
promising solutions. New solutions (children) are gener- the main population, the new solutions produced replaces
ated independently of each other using the probabilistic the previous worse solutions. The details of how to display
model learned. For this purpose, the structure of the the feature subset in the algorithm and how to use the
probabilistic model is used. The root is the first variable for support vector machine to evaluate solutions are described
which the value is generated, as the root does not depend below.
on any variable. The probability of different values of the

123
3306 Cluster Computing (2022) 25:3299–3311

2.6 Evaluation and coding function of solutions 3 Results

Each subset of features is evaluated based on a multi-class 3.1 Dataset


support vector machine classifier. For this purpose, this
category is first trained with a set of training data filtered In the NSL-KDD dataset, each record contains 43 fields.
based on a subset of the given properties. It is then eval- Attribute 41 is a closed behavior field that specifies the
uated with experimental data sets that are similarly filtered. normal behavior or type of intrusion, and the last field
A separate classification is trained for each class (one-vs- represents the degree of difficulty of detecting intrusion.
all approach). Finally, the average performance of the The label column has 5 categories, one normal class, and
classification of different support vector machines on the four intrusion classes including DoS, U2R, R2L, and Prob.
experimental data set is the criterion for evaluating the In a denial of service attack, by saturating the target
feature subset. To encode each subset of attributes, as in machine with a connection request, high overhead is cre-
previous tasks, a binary string is used along the number of ated on the server, preventing the server from responding to
attributes, with the number zero signifying that the corre- legitimate network traffic. In the user attack on the root, the
sponding attribute is not selected while the number one attacker uses a normal user account to exploit the vulner-
meaning that the attribute is selected in the subset. ability of the system in order to gain root privileges. In
external penetration, the attacker has the ability to send
2.7 Combined with local search packets to a machine, but it has no ID on the machine and
cannot access the system like a user. In intrusive scanning
Memetic algorithms are a class of meta-heuristic algo- infiltration, it scans the machine to determine vulnerabili-
rithms that are obtained through combining innovative ties or attacks that may be exploited later. This provides a
methods such as local search with basic search engines list of potential vulnerabilities of a machine that can be
such as evolutionary algorithms. They can improve the used to carry out an attack.
performance of the basic search algorithm such as reducing The features in this data set are divided into numerical
the time to achieve the optimal response [22]. Evolutionary and textual data in three categories: basic, content, and
algorithms are usually created to search the entire search traffic [23]. Basic features include those extracted from a
space. Local neighborhood search, on the other hand, TCP / IP protocol connection. These features delay the
searches for any answers found with an evolutionary intrusion detection process. Examples of these features are
algorithm to find better answers. The selection of genera- the connection time, type of protocol and service used, plus
tion operators in a memetic algorithm as well as the type bytes sent and received on a connection. Content Features:
and local search method used in it will lead to very dif- Unlike many service-preventing and scanning attacks,
ferent execution results. Accordingly, in this paper, a local external-to-internal and user-to-root attacks do not have a
search algorithm is used which, after receiving the solution sequential pattern of anomalous repetition. This is because,
obtained with the distribution estimation algorithm, unlike service-blocking and scanning attacks, which have
examines its proximity. The algorithm selects the adjacent multiple connections to hosts over a short period of time,
subset that is more appropriate, and selects it as distant as external intrusion and user-to-root penetration are embed-
possible. Finally, it replaces the current solution with the ded in the data portion of network packets and generally
best solution found. have a single connection. To detect this type of attack,
This study was conducted to detect network intrusion. features capable of searching for infiltration behavior in the
To achieve this goal, a more comprehensive method called packet data section are required, such as the number of
the combination of random forest or the combination of failed attempts. These attributes are called content attri-
genetic algorithm was used. Increasing the evolutionary butes. Examples of these features include the sum of
algorithm leads to the specialization of their task; while operations performed on a connection, number of failed
focusing the learners, it will help distribute the error around logins on a connection, user’s access to the system as an
the goal and ultimately enhance the accuracy of the deci- administrator, etc. Traffic characteristics include the fea-
sion-making system. Thus, by combining their opinions, tures calculated according to the size of the window which
more accurate results will be obtained when compared are divided into two groups: i) a group of connections
against similar studies. called time-based that have had the same service and host

123
Cluster Computing (2022) 25:3299–3311 3307

Table 2 Simulation assumptions


Feature fitness strategy: SVM Data set: NSL-KDD, CIDDS-001, KDD99 and virus total

Standardize:0–1 SVM type: one vs. all


population_size = 50,100,150 Kernel function: MLP
problem_size = 42 Number of packets in train data set: 100,000
max_generations = 10
Selection operator: binary tournament selection Number of packets in test data set: 12,000

200 0.88
in the last two seconds as the current connection; ii) a

Populaon

Accuracy
150 0.86
group designed to evaluate attacks that occur in more than 100 0.84
0.82
two seconds. These are features that determine the per- 50 0.8
centage of past connections to current connections with the 0 0.78

LOCAL&…

LOCAL&…

LOCAL&…
EDA
GA
FS
BS

EDA
GA
FS
BS

EDA
GA
FS
BS
same service and host, and are called machine-based. Also,
CIDDS-001, KDD99, and VIRUS TOTAL datasets are
used for comparing the algorithms with each other. Methods

3.2 Simulation assumptions Popoulaon ACCURACY

The dependency tree distribution estimation algorithm has Fig. 2 Comparison of intrusion detection accuracy with the imple-
been implemented in C language; then, as a mex function, mentation of feature selection algorithms
it is possible to execute it in R version 3.4.2 environment.
Also, the relevant sections have been added to call the Since the packets in the NSL-KDD, CIDDS-001,
evaluator function. Other default simulation values are KDD99, and VIRUS TOTAL database are classified into
reported in Table 2. The NSL-KDD, CIDDS-001, KDD99, five different classes, the degree of accuracy within the
and VIRUS TOTAL datasets are preprocessed and the data obtained categories using different feature selection algo-
are normalized before the tests. Also, non-numerical data rithms and with different population sizes is presented in
need to be converted to numerical data to train the support Fig. 3. Influences that have a small number of samples to
vector machine classifier. Then, the genetic trait selection learn in the training database are detected with far less
algorithms, distribution estimation, distribution estimation accuracy, causing reduced overall accuracy of the
along with local search, leading selection, and regression detection.
selection are implemented.
3.4 Comparison
3.3 Results of simulation
The results obtained from the proposed method for
In this section, the results of the experiments are presented selecting the effective features in intrusion detection in
on the NSL-KDD database. Figure 2 compares the per- Table 3 are compared with other similar tasks on the NSL-
formance of the five feature selection methods using the KDD, CIDDS-001, KDD99, and VIRUS TOTAL data-
support vector machine as an evaluation function in pop- bases, using the support vector classifier. As can be
ulation sizes 50, 100, and 150. As can be seen, the leading observed, the average accuracy obtained with the proposed
selection and backward selection algorithms are popula- method has been compared with previous methods which
tion-free where population growth has no effect on their has been better in some cases. For example, compared to
performance. According to the obtained results, at smaller the proposed method with the particle swarm optimization
population sizes, the genetic algorithm outperformed the algorithm [21], the accuracy of the results obtained was
distribution estimation algorithm, where the difference in better; however, in the methods used in references [19] and
accuracy has diminished with increasing the population [21], in addition to feature selection, the appropriate values
size. Use of a combination of distribution estimation of the support vector machine classification parameters
algorithm and local search has also significantly improved were also obtained using optimization or manually. The
its performance in small population sizes. present study only focused on feature selection where

123
3308 Cluster Computing (2022) 25:3299–3311

Fig. 3 Comparison of accuracy


within categories resulting from feature selection
implementing feature selection
algorithms
1

0.5

u2r r2l prob dos normal

Table 3 Comparison with previous studies – Wrong_fragment (sum of packets with incorrect Check-
81.5 PSO-SVM (feature selection) [21]
sum code in a connection)
49.4 PSO-SVM (parameter tuning) [21]
– Count (number of connections that have the same
destination IP address)
82.3 Filter (35 features) ? SVM [20]
– Is_urgent_login (if the user accesses the system as a
81.27 RFE-SVM [19]
guest or moderator)
84.93 Local&EDA50-SVM
– Dst_host_srv_serroe_rate (percentage of connections
84 GA100-SVM
that have the same destination port number and their
85.62 FS-SVM
flag attribute value is S0, S1, S2 or S3)
86.68 GA-RF
– Difficulty detecting intrusion
Note that some features do not help much in detecting
intrusion and with most feature selection algorithms, they
common parameters for classification were considered. Of are considered irrelevant or redundant features:
note is the average performance report on the proposed – Duration (connection time)
algorithm compared to other methods, while other methods – Flag (connection status)
performance has been reported. – Hot (total operations performed on a connection)
– Num_faild_login (number of failed logins in a
3.4.1 Effective features in intrusion detection connection)
– Logged_in (if login is correct, it takes a value of one)
In Fig. 4, the results of the different algorithms studied are – Srv_diff_host_rate (percentage of feature 24 connec-
combined and the frequency of the selected features tions that have different destination machines)
obtained from five different execution times is shown. Each – Dst_host_count (the sum of connections that have the
column represents a feature, and the brightness of each cell same destination IP address)
in the image reflect the number of times that feature is – Dst_host_diff_srv_count (percentage of feature 32
selected in five different implementations of each algo- connections that have different services)
rithm (rows). The white squares indicate the selection of – Dst_host_serroe_rate (percentage of attribute 32 con-
the corresponding feature in each of the five execution nections whose flag attribute values are S0, S1, S2 or
times of the algorithm. S3)
Based on the combined results, the following features – Dst_host_rerroe_rate (percentage of attribute 32 con-
are effective in intrusion detection and have been selected nections whose flag attribute value is REJ)
with most feature selection algorithms:
Compared to the previously reported algorithm [23],
– Protocol_type (type of protocol used for connection) which did not use the concept of information gain and was
based on the ordinary decision tree, the method used in this

123
Cluster Computing (2022) 25:3299–3311 3309

Fig. 4 Selected properties


obtained from five times
execution of each algorithm

study has shown about 60.6% improvement in correct type attacks from the local search distribution estimation
allocation. algorithm which uses a population of 150 led to more
accurate intrusion detection. Future research can use dis-
tribution estimation algorithms, including Bayesian opti-
4 Conclusions mization algorithm and compare it with the results obtained
in this paper, and then draw a parametric map of the
Infiltration detection is essentially a matter of classifica- algorithms used for the development of current research.
tion, and feature selection is one of the issues addressed in The point to be emphasized is that for tree-based methods
classification. For large data, feature selection reduces the of decision-making, when the nominal property has a large
detection time and cost, while also improving the classifier number of unique values, the learning time will be very
efficiency. This paper compared the performance of genetic long and learning will be slow. With the proposed method,
attribute selection algorithms, distribution estimation, the learning speed greatly increased and the accuracy was
hybrid distribution estimation with local search, leading acceptable. In problems with a large number of features,
selection, and backward selection with the support vector feature selection is a common step in machine learning
machine classification used as a function of the fit of these methods. One of the common methods of selecting step
algorithms. In this regard, a general experimental frame- properties involves producing a subset of evaluation fea-
work was designed on two benchmark datasets namely, tures, termination criteria, and validation of results. A
NSL-KDD, VIRUS TOTAL with different state-of-art simple way to validate the results is to draw conclusions
machine learning classifiers including KNN-RF, PSO-RF, using prior knowledge; Nevertheless, in real-world pro-
PSO-SVM_GA and GA to investigate the influence of the grams, there is usually no such knowledge; thus, we must
four feature evaluation measures on the classification use some indirect methods to monitor performance change.
accuracy of an IDS. Under optimized parameter settings, Finally, Data mining models can be used as a powerful tool
all classifiers provided competitive results, where RF in intrusion detection to predict the type of attack in net-
offered better detection accuracy with all feature evaluation works. The model presented in this study can be used
measures. On the other hand, all the other classifiers gave alongside firewalls as a software assistant.
the best detection accuracy with consistency measure for
most of the attack classes. Further, the effectiveness of the
feature evaluation measures on IDS detection performance
Declarations
was analyzed. Here all the feature evaluation measures
revealed a good detection rate for most classes of attacks. Conflict of interest The authors declare that they have no conflict of
According to the obtained results, the genetic algorithm interest.
with a population size of 100 maximized the detection
accuracy of normal packets. Also, the local search distri-
bution algorithm with a population size of 50 led to the References
highest accuracy in detecting DOS-type attacks. Further, to
detect U2R attacks with the fewest samples in the training 1. Fenanir, S., Semchedine, F., Baadache, A.: A machine learning-
based lightweight intrusion detection system for the internet of
database, the leading selection algorithm was better than things. Rev. Intell. Artif. 33(3), 203–211 (2019)
other algorithms in detecting intrusion resulted in R2L-type 2. Koroniotis, N., Moustafa, N., Sitnikova, E., Turnbull, B.:
attacks from the backward selection algorithm and Prob- Towards the development of realistic botnet dataset in the

123
3310 Cluster Computing (2022) 25:3299–3311

internet of things for network forensic analytics: bot-iot dataset. 19 Jonathan, A., Mandala, S.: Increasing feature selection accuracy
Futur. Gener. Comput. Syst. 100, 779–796 (2019) through recursive method in intrusion detection system. IJOICT
3. Ghosh, J., Kumar, D., Tripathi, R.: Features extraction for net- 4(2), 43–50 (2018)
work intrusion detection using genetic algorithm (GA). In: 20. Taher, K., Jisan, B., Rahman, M.: Network intrusion detection
Gunjan, V.K. (ed.) Modern Approaches in Machine Learning and using supervised machine learning technique with feature selec-
Cognitive Science: A Walkthrough, pp. 13–25. Springer, Cham tion. In: International conference on robotics, electrical and signal
(2020) processing techniques (ICREST) (2019).
4. Gao, J., Chai, S., Zhang, B., Xia, Y.: Research on network 21. Manekar, V., Waghmare, K.: Intrusion detection system using
intrusion detection based on incremental extreme learning support vector machine (SVM) and particle swarm optimization
machine and adaptive principal component analysis. Energies (PSO). Int. J. Adv. Comput. Res. 4(3), 6 (2014)
12(7), 1223 (2019) 22 Acampora, G., Iorio, C., Pandolfo, G., Siciliano, R., Vitiello, A.:
5. Abualigah, L., Jamal Dulaimi, A.: A novel feature selection A memetic algorithm for solving the rank aggregation problem.
method for data mining tasks using hybrid sine cosine algorithm In: Hošková-Mayerová, S. (ed.) Algorithms as a Basis of Modern
and genetic algorithm. Clust. Comput. 24, 1–16 (2021) Applied Mathematics, pp. 447–460. Springer, Cham (2021)
6. Tang, J., Alelyani, S., Liu, H.: Feature selection for classification: 23 Tavallaee, M., Stakhanova, N., Ghorbani, A.A.: Towards credible
a review. In: Aggarwal, C. (ed.) Data Classification: Algorithms evaluation of anomaly based intrusion detection methods. IEEE
and Applications, Data Mining and Knowledge Discovery Series. Trans. Syst. Man Cybernet. Part-C 40(5), 516–524 (2010)
CRC Press, London (2014) 24. Pelikan, M.: ‘‘Genetic Algorithms’’, Missouri Estimation of
7. Chandrashekar, G., Sahin, F.: A survey on feature selection Distribution Algorithms Laboratory Report No. 2010007,
methods. Comput. Electr. Eng. 40(1), 16–28 (2014) Department of mathematics and Computer Science University of
8. Ravinder, R., Kavya, R.B., Ramadevi, Y.: A survey on SVM Missouri–St. Louis (2010).
classifiers for intrusion detection. Int. J. Comput. Appl. 98(19), 25. Lee, S.M., Kim, D.S., Park, J.S.: A survey and taxonomy of
38–44 (2014) lightweight intrusion detection systems. J. Internet Serv. Inf. Sec.
9. Bhavsar, Y.B., Waghmare, K.C.: Intrusion detection system using (2012). https://doi.org/10.22667/JISIS.2012.02.31.119
data mining technique: support vector machine. Int. J. Emerg. 26. Mukkamala, S., Sung, A.H.: Identifying significant features for
Technol. Adv. Eng. 3, 581–586 (2013) network forensic analysis. Using artificial intelligent techniques.
10. She, W., Li, D., Xia, Y., Tian, S.: Parameter estimation of P-III Int. J. Digital Evid. 1(4), 1–16 (2003)
distribution based on GA using rejection and interpolation 27. Tidke, S.M., Vishnu, S.: Intrusion Detection System using
mechanism. Clust. Comput. 22(1), 2159–2167 (2019) Genetic Algorithm and Data Mining. An overview. Int. J. Com-
11. Hauschild, M., Pelikan, M.: An Introduction and Survey of put. Sci. Inf. 1, 91–95 (2012)
Estimation of Distribution Algorithms. Missouri Estimation of 28. Sonawane, H.A., Pattewar, T.M.: A comparative performance
Distribution Algorithms Laboratory Report No. 2011004, evaluation of intrusion detection based on neural network and
Department of mathematics and Computer Science University of PCA. Presented at the IEEE ICCSP conference, pp. 841–845
Missouri–St. Louis (2011). (2015).
12 Bharathisindhu, P., Selva Brunda, S.: An improved model based 29 Ding, Y., Zhou, K., Bi, W.: Feature selection based on
on genetic algorithm for detecting intrusion in mobile ad hoc hybridization of genetic algorithm and competitive swarm opti-
network. Clust. Comput. 22(1), 265–275 (2019) mizer. Soft Comput. 24, 11663–11672 (2020)
13. Wang, G., Hao, J., Ma, J., Huang, J.: A new approach to intrusion 30. Ajdani, M., Ghaffary, H.: Introduced a new method for
detection using Artificial Neural Networks and fuzzy clustering. enhancement of intrusion detection with random forest and PSO
Expert Syst. Appl. 37, 6225–6232 (2010) algorithm. Sec. Privacy 4(2), 1147 (2021)
14. Chen, Y., Abraham, A.: Estimation of Distribution Algorithm for 31. Bhattacharya, S., Reddy Maddikunta, P.K., Kaluri, R., Singh, S.,
Optimization of Neural networks for Intrusion Detection System. Reddy Gadekallu, T., Alazab, M., Tariq, U.: A novel PCA-firefly
In: International Conference on Artificial Intelligence and Soft based XGBoost classification model for intrusion detection in
Computing ICAISC, pp. 9–18 (2006). networks using GPU. Electronics 9(2), 219 (2020)
15. Sheikhan, M., Jadidi, Z., Farrokhi, A.: Intrusion detection using 32. Srivastava, G., et al.: An ensemble model for intrusion detection
reduced-size RNN based on feature grouping. Neural Comput. in the Internet of Softwarized Things. In: Adjunct proceedings of
Appl. 25, 1185–1190 (2010) the 2021 international conference on distributed computing and
16. Praneeth, N.S.K.H., Varma, N.M., Ramakrishna, N.R.: Principle networking (2021).
component analysis based intrusion detection system using sup- 33 Panigrahi, R., et al.: Performance Assessment of supervised
port vector machine. In: IEEE international conference on recent classifiers for designing intrusion detection systems: a compre-
trends in electronics information communication technology, hensive review and recommendations for future research. Math-
pp. 1344–1350 (2016). ematics 9(6), 690 (2021)
17. Kumar, G., Kumar, K., Sachdeva, M.: The use of artificial 34 Reddy, G.T., et al.: Analysis of dimensionality reduction tech-
intelligence based techniques for intrusion detection: a review. niques on big data. IEEE Access 8, 54776–54788 (2020)
Int. Sci. Eng. J. Artif. Intell. Rev. 34, 369–387 (2010)
18 Tao, P., Sun, Z., Sun, Z.: An improved intrusion detection algo- Publisher’s Note Springer Nature remains neutral with regard to
rithm based on GA and SVM. In: Zhao, W. (ed.) Human-Centered jurisdictional claims in published maps and institutional affiliations.
Smart Systems and Technologies, vol. 6. Piscataway, IEEE
Access (2018)

123
Cluster Computing (2022) 25:3299–3311 3311

Mahdi Ajdani I am 35 years old. Hamidreza Ghaffary completed


I am a very motivated individual his bachelor’s degree in com-
and I’m not working right now, puter science at Sharif Univer-
so I have enough time and sity of Technology and his
energy to do my PhD excersises. master’s degree at the Univer-
The main reason that I wanna do sity of South Tehran and his
PhD course is that I love doctorate at Ferdowsi Univer-
researching and I am very eager sity. He is currently a faculty
to improve and develop my member and assistant professor
knowledge and skills. There- at the Faculty of Computer
fore, I believe that I am a good Engineering, Ferdows Azad
candidate for being a PhD University. His interests are
student. machine learning, pattern
recognition and image
processing.

123

You might also like