0% found this document useful (0 votes)
48 views14 pages

Research On Network Intrusion Detection Technology Based On Machine Learning

Uploaded by

electro-ub ub
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views14 pages

Research On Network Intrusion Detection Technology Based On Machine Learning

Uploaded by

electro-ub ub
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

International Journal of Wireless Information Networks (2021) 28:262–275

https://doi.org/10.1007/s10776-021-00520-z

Research on Network Intrusion Detection Technology Based


on Machine Learning
Fei Wu1 · Ting Li1 · Zhen Wu1 · ShuLin Wu1 · ChuanQi Xiao1

Received: 6 April 2021 / Revised: 18 May 2021 / Accepted: 4 June 2021 / Published online: 9 July 2021
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2021

Abstract
Aiming at the problems of low accuracy and efficiency and high false alarm rate in current network intrusion detection
methods, a network intrusion detection method based on machine learning is proposed. Machine learning algorithm is used
to classify network packets, and information gain is used as attribute selection measure to train and test multiple samples.
Based on the principle of structural risk minimization, the optimal interval is obtained and the difference function is obtained.
The intrusion detection framework is constructed by collecting network traffic data, system logs, user behavior information
and host information. This paper establishes the evaluation index of machine learning network intrusion detection, analyzes
and constructs the machine learning network intrusion detection model, preprocesses the data of intrusion detection model,
uses random forest algorithm to learn and train the data, calculates the importance of features, reduces the data dimension,
and realizes network intrusion detection. The experimental results show that this method has high accuracy and efficiency,
and can effectively reduce the false alarm rate of network intrusion detection.

Keywords Machine learning · Network intrusion · Anomaly detection · Support vector machine (SVM) · K-nearest
neighbor (KNN)

1 Introduction performance through their own learning, which is very suit-


able for the current diverse and complex network environ-
With the rapid development of the network, the network ment [1]. The introduction of machine learning algorithm in
environment tends to be diversified and complex, the tra- network intrusion detection system makes the system more
ditional network intrusion detection technology has been intelligent and efficient, improves the accuracy of intrusion
difficult to resist all kinds of network attacks. The advent detection, and protects the network security. Network intru-
of the era of big data forces the network intrusion system to sion detection methods are divided into anomaly detection
adopt more efficient algorithms. Most of the algorithms in and misuse detection. Anomaly detection first constructs a
machine learning are mainly used to solve the classification normal model, and the access that does not conform to the
problem, which can divide the network behavior into dif- model is defined as intrusion [2]. On the contrary, misuse
ferent parts. Classification termed as prognostic modelling detection establishes an intrusion model based on unaccep-
problem where a class label is prognosed for a provided table behavior, and all the access that conforms to the model
sample of input information in machine learning. Some sam- is intrusion. The difference between anomaly detection and
ples based on classification problem is: provided an instance, misuse detection is misuse detection is required to be fur-
categorize whether it is spam else not and provided a hand- nished along with a well-defined group of attack signatures
written character to categorize it as one among the specified occupied in their specified database while an anomaly detec-
characters. Moreover, machine learning algorithm is good at tion is to describe a detailed as well as precise outline of the
making agents simulate human behavior and improve their standard behavior of the networks in addition to its hosts. To
predict computer attacks, a method has been used named as
* Fei Wu misuse detection. Some instances of misuse detection are
[email protected] utilization of attack signatures in an intrusion detection sys-
tem. To refer entire types of computer misuse, a technology
1
State Grid Fujian Electric Power Company, Fuzhou 350000, has been utilized named as misuse detection.
China

13
Vol:.(1234567890)
International Journal of Wireless Information Networks (2021) 28:262–275 263

However, in the face of increasingly complex network intervals, network intrusion detection algorithm is used also, it
environment, the traditional intrusion detection system has was proposed to identify unknown attacks. To form a method
been difficult to resist, such as the system occupies too much simulating regular activity the proposed method utilizes
resources, poor ability to monitor the unknown network; machine learning as well as comparison has been taken for
high false alarm rate of anomaly detection, high false alarm recent behavior along with the previous existing method. In the
rate of misuse detection; insufficient analysis of abnormal process of machine learning construction, the most important
data, the need for manual intervention and other defects are is the split attribute. Tuple itself has multiple attributes, which
increasingly prominent. Intrusion detection system must attributes need to be classified in the process of construction
carry out self-learning according to external attacks [3]. At [5]. Tuples are termed as unchallengeable sequences which is
the time of IoT network applied owing to source restraints as utilized to accumulate collections of heterogeneous informa-
well as difficulties, there is a restriction for traditional intru- tion. Challengeable is termed as objects whose can have the
sion detection system to itself. Machine learning algorithm ability to alter and whose can doesn’t have the ability to alter
is committed to make agents simulate human behavior. With are termed as unchallengeable. The way of attribute selection
the improvement of experience, it can improve its perfor- measurement includes information gain, gain rate and so on.
mance, learn new knowledge and acquire new skills. With In this paper, information gain is used as attribute selection
the popularity of network applications, network security metric. The information gain expression of attribute a is as
problems also follow, and network attacks are developing follows:
in the direction of diversification, complexity and distribu- S | |
tion. Intrusion detection system is the second line of defense ∑ |Ti | ( )
gain A = inf(T) − × inf Ti (1)
to discover attacks. It can monitor network events in real i=1
T
time. It is an active defense technology to make up for the
lack of firewall [4]. In recent years, with the development In formula (1), T is all data sets, Ti is self set of different
of machine learning, the existing intrusion detection and values composed of attribute i on T , inf(T) is entropy func-
processing has a more effective mechanism. In the face of tion, and the calculation formula is as follows:
large-scale network data, due to the defects and instability Nclass
( ) ( ( ))
∑ freq Cj , T freq Cj , T
of machine learning algorithm itself, there are still problems inf(T) = −gain A × log2
such as low detection accuracy, high false alarm rate and j=1
|T| |T|
low detection efficiency. Therefore, it is of great theoretical (2)
significance to study the network intrusion detection tech- In a given sample, k nearest samples in the training sam-
nology based on machine learning its significance and appli- ple are calculated by some distance measurement method,
cation value. At present, the application of machine learning and the k nearest neighbor information is used for predic-
in network intrusion detection has been widely concerned by tion. KNN distance measurement method is used to calculate
all walks of life. Many famous scholars and research institu- training samples. In general, KNN algorithm uses machine
tions at home and abroad have invested in the research of learning method to predict, and selects the class with the
related technologies, and achieved representative results in most markers in K training samples as the final prediction
basic theory, key technology and architecture. result [6]. KNN algorithm is nothing but k-nearest neigh-
bours’ algorithm also, it is modest and advanced machine
learning algorithm. It is utilized for the purpose of solving
2 Network Intrusion Detection Based classification as well as regression difficulties. This algo-
on Machine Learning rithm is informal to gadget and elucidation also, the most
disadvantage is the size of the data is reducing pointedly. In
2.1 Network Intrusion Data Sample Training addition to machine learning, the average method is also a
Algorithm common prediction algorithm in regression task, that is, the
average of the output values of K training samples is used as
The programs (math as well as logic) which is to regulate the prediction result. Weighted voting based on distance and
themselves to achieve better as they are expressed to more distance is{(
also a prediction
) ( ) method.
( Assuming
)} that there is
information is said to be as machine learning algorithm. The a sample, x1 , y1 , x2 , y2 , … , xn , yn is usually used as
term “learning” is referred as programs can able to alter the a distance measure to measure the similarity between two
development of data over time by machine learning, while vectors.
human can able to alter the development of data through learn-
a
ing. The machine learning algorithm is used to classify net- ( ) ‖ ‖2 ∑ ( )2
(3)
2
d xi , xj = inf(T)‖xi − xj ‖ = inf(T) xik − xjk
work packets, and multiple samples are trained and tested. To ‖ ‖
k=1
create a baseline utilization of the networks at various time

13
264 International Journal of Wireless Information Networks (2021) 28:262–275

In the K-nearest neighbor classification, on the premise of that the specified problem through maintaining the difficulty
achieving the best classification effect, the best distribution in contradiction of its achievement at appropriate the train-
of the training samples is super-spherical or elliptical. If the ing information. Based on the principle of structural risk
distribution edge of the samples is nonlinear, the classifica- minimization, an optimal interval is found to divide the
tion effect may be reduced [7]. The complex linear insepa- examples
{( into
) (two categories.
) ( Assuming
)} the training sample
rable samples from the low dimensional feature space are set x1 , y1 , x2 , y2 , … , xn , yn , its objective function is:
mapped to the high dimensional space, and the classification n
( ) ∑ ( )
in the high dimensional space can improve the classification 1
L = P yi |x min ||w|| + c 𝜉i + 𝜉i∗ (7)
effect. Assuming that the influence of each characteristic 2 i=1
parameter on a given type is independent, the classification
results are identified by known prior probability and condi- Its constraints are:
tional probability. P(Y) is termed as probability of result and ( ( ) )
yi − w𝜙 xi + b ≤ 𝜀 − 𝜉i (8)
referred as prior probability which could be computed since
the training dataset. furthermore, the conditional probability ( ( ) )
is represented as P(Y|X) and the probability of an outcome w𝜙 xi + b − yi ≤ 𝜀 − 𝜉i∗ (9)
gives the evidence which is afforded by conditional prob-
ability at (the time of X )value is known. Suppose there is a 𝜉i , 𝜉i∗ ≥ 0, i = 1, 2, … , n;c > 0 (10)
sample X a1 , a2 , … , am , where a belongs to each feature
of x. In addition, features are ( independent) of each other, Then the final difference function is:
and there is a category set C y1 , y2 , … , yn . A formula that n
∑ ( ) ( )
defines how to update the probabilities of hypotheses at the f (x) = ai − a∗i K xi , xj + Lb (11)
time of providing evidence is said to be as Bayes’ theorem. i=1
Since the axioms of limited probability, the Bayes’ theorem ( )
has been tracked, but it could be utilized to influential rea- where ai is determined by the penalty factor, K xi , xj is the
son based on extensive variety of difficulties which involves kernel function satisfying Mercer’s condition, and b is the
certainty apprises. According to Bayes’ theorem, the prob- threshold. One among the function could be utilized as a
ability of each feature belonging to the category is calculated kernel function which will be determined by mercer’s theo-
as follows: rem. It illustrates that a total of convergent sequence of prod-
( ) uct functions could be denoted by a positive-definite matrix
( ) P aj |yi ( ) ( ) as well as it is symmetric. Integrated learning is to build and
P yi |aj = ( ) × P yi P aj (4) combine multiple learners to accomplish learning tasks [8].
d2 xi , xj
Generally speaking, a group of individual learners is gener-
Since each sample feature is independent of each other, ated first, and multiple learners are combined through some
the probability that the sample belongs to class yi is calcu- strategy. Usually, individual learners are also called base
lated as follows: learners. By integrating multiple learners, ensemble learning
can usually obtain more superior generalization ability than
m
( ) ∏ ( ) a single learner. The data integration structure of network
P yi |x = P yi |ai (5)
i=1
intrusion features is as Fig. 1.
Compared with the traditional intrusion detection meth-
According to the above formula, the probability of each ods, the classification efficiency of the intrusion detection
feature belonging to each category is calculated, and the system based on machine learning has been significantly
highest probability is the category it belongs to. The calcula- improved. Intrusion detection system refers to the process
tion formula is as follows: of generating warning by collecting network traffic data,
( ) { ( ) ( ) ( )} system logs, user behavior information and some informa-
P yk |x = max P y1 |x , P y2 , x , … , P yn |x , x ∈ yk
tion provided by the host to detect whether there is abnor-
(6) mal behavior or signs of attack in the network through cor-
An inductive principle is said to be as structural risk responding analysis [9]. There are several factors of data
minimization (SRM) which is utilized in machine learning. has been involved such as system logs, historical behavior,
In general, since a finite information set, a generalization network packets, etc. The valuable information has been
method should be chosen and with the subsequent difficulty realized by intrusion detection system and the collected
of overfitting—the method is turning into more powerfully data is transmitted to data pre-processing section. Also,
personalized towards the specified training set as well as the data has been transliterated into unified data format
producing sick to novel information. The principle describes and the next stage is intrusion detection analysis which

13
International Journal of Wireless Information Networks (2021) 28:262–275 265

Fig. 1  Data integration sche-


matic of network intrusion Individual learner 1
characteristics

Individual learner 2 Output


Combination module

Individual learner T

is to detect the behavior of data along with the normal The data of intrusion detection system mainly come from
existence. Then it has been recorded as well as generating different network segments and hosts, and the information
response for the valuable information. The specific overall about network traffic and user activity state are the main
framework of intrusion detection is as Fig. 2, which mainly data sources. In general, the content of data mainly includes
includes knowledge base, data collection, data preprocess- system logs, network packets and so on. It is the key to real-
ing, intrusion detection analysis and response processing. ize intrusion detection system to find available valuable
information [10]. The reliability of data can ensure that the

Current access
Data collection
behavior

Data preprocessing

Intrusion detection
analysis
Knowledge base

Historical behavior
Behavi or sequence
Invasion or not N extraction
System log

Specific rules of Y
conduct

Other Record the evidence Response proces sing

Fig. 2  Overall framework optimization of intrusion detection

13
266 International Journal of Wireless Information Networks (2021) 28:262–275

intrusion detection system can produce the maximum effect Assume that A is A positive sample set and B is A nega-
and make quick and effective response to the attack behav- tive sample set. The recall rate and precision rate are calcu-
ior. In general, the data information in the system logs of lated as follows:
the host system and the data packets in the network is often
|A ∩ B|
diverse and disorganized, which is difficult to analyze [11, precision(A, B) = (13)
12]. Therefore, in order to ensure that the data can be recog- |A|
nized by the detection algorithm, it must be preprocessed.
In the pre-processing process, the data is mainly processed, |A ∩ B|
Recall(A, B) = (14)
converted into a unified data format, to obtain valuable |B|
information. For example, data should be quantified, non-
numerical data should be converted to numerical value, and Support is a measure of the frequency of simultaneous
data should be standardized to avoid the occurrence of over- occurrence of events. A high frequency indicates a strong
eating small. Intrusion detection analysis is the core part of correlation between the two events; otherwise, the correla-
intrusion detection system, through the identification and tion is weak. Confidence is A measure of the probability of
statistics of features, reasonable analysis, so as to identify occurrence of event B after the occurrence of event A. A
the behavior of data, detect the existence of abnormal. high confidence indicates that the probability of occurrence
of event A has A great relationship with the occurrence of
event B. Otherwise, the two have little relationship.
2.2 Machine Learning Evaluation Index
Construction support(A → B) = P(A ∪ B) (15)

The performance of K information separates of all the sam- confidence (A → B) = P(B|A) (16)
ples is said to be as random sub sampling. Since every infor-
mation split, a standard quantity of observations is selected F1-score represents the harmonic mean value of recall
deprived of replacement from the sample as well as it is rate and accuracy rate, tending to the smaller value between
stored for the process of data testing. After the under-sam- recall rate and accuracy rate.
pling of clustering, although the number of samples of most 2 ∗ precision ∗ recall
classes is reduced, the spatial characteristics of most classes F1 − Score = (17)
precision ∗ recall
are retained, which avoids the randomness and blindness
of random subsampling to a certain extent, and avoids the
damage of random subsampling to the data distribution char- 2.2.1 Mean Square Error
acteristics of most classes. Coverage refers to the proportion
of the identified samples in the total sample, and accuracy A statistic which is to permit for researchers to make such
refers to the proportion of the correctly classified samples in claims is afforded by mean square error (MSE). In simple
the total sample [13]. Its calculation formula is as follows, words, the differentiate among the identified parameter as
assuming that the number of real categories is i , and y′i is the well as analyzed parameter is referred as MSE. The square
predicted value of the ith sample. expectation of the difference between the estimated market
n value parameter and the true value is expressed as follows:
1∑ ( )
accuracy = f (x)sum y�i = yi (12) N
n i=1 1∑
MSE = (observed − predicted )2 (18)
N t=1
The quantity of exact outcomes which is separated
through the quantity of outcomes that must have been reim- Mean absolute error is nothing but absolute value of the
bursed is said to be as recall. It is also known as sensitivity differentiate among the forecasted value as well as the origi-
in binary categorization. In some words, the probability of nal value. The mean absolute error is the mean value of the
applicable text is recovered through the query. The recall absolute error, which can well reflect the true situation of
rate is used to measure the ratio of the correct classification the predicted value.
of the positive sample to the total positive sample. Accu-
racy is A measure of the proportion of correctly classified
N
1 ∑|
positive samples in all positive samples. The proportion of MAE = f − yi || (19)
N i=1 | i
exactly identified positive observations towards entire obser-
vation in definite class is said to be as recall while accuracy For the common, according to the different combina-
is termed as the relevant cost for both false negative as well tions of the true category of the sample and the predicted
as positive values. category of the classifier, the classification results can be

13
International Journal of Wireless Information Networks (2021) 28:262–275 267

divided into true examples (TP) truepositive, which refers well as event administration structure, any malicious activity
to the samples that the classifier predicted as positive exam- or else destructions is classically conveyed else collected.
ples and the real situation is also positive examples [14]. Intrusion detection (IDS) provides an active and deep pro-
False positive example (FP) refers to the samples that are tection mechanism to recognize illegal attacks and mali-
predicted to be negative examples by the classifier but are cious uses through the system audit data or network packet
actually negative examples. True counter example (TN) true information. When the attack and damage of the protected
negative refers to the sample that the classifier predicts as system are found, the system security is maintained through
a counter example and the real situation is a counter exam- the response of intrusion detection. Compared with the first
ple; And the false counter example (FN) false negative, the layer of defense to establish a safe and reliable system or
sample that the classifier predicts to be a counter exam- network environment, the intrusion detection side adopts the
ple but the real situation is a positive example. Obviously, proactive way in advance, comprehensively detects the pro-
TP + FP + TN + FN = total number of samples, and the cor- tected system automatically, and guarantees the security of
relation is called confusion matrix as Table 1. the system by alarm and control of suspected attack behavior
Random subsampling is the random sampling of most [17]. At present, intrusion detection system has been widely
class samples in the data set, that is, the random deletion used in the security of network system and computer host
of some most class samples. A fragment of sampling meth- system. Intrusion detection by machine learning is different
odology is termed as random sampling method. It consists from traditional network intrusion detection, so the problem
of identical possibility for every sample which is being of network intrusion detection is analyzed and modeled, and
selected. An unbiased indication of entire population is it is modeled as a classification problem in machine learn-
meant for selecting sample randomly. For drawing outcomes, ing [18]. Then the process of network intrusion detection
a method is more significant named as unbiased random by machine learning is briefly introduced. The application
sample method. This random method has great uncertainty, of machine learning method to network intrusion detection
which may break the original distribution of outliers, thus is an important application scenario of machine learning
leading to the moving of decision boundaries, and ultimately method. It has a basically fixed process. Specifically, it can
affect the classification effect of classifier [15]. In addition, be divided into five stages: data cleaning, feature selection,
it is assumed that each sample of the majority class has the unbalanced processing, classifier training and classification
same contribution to the training of the classifier, but the result evaluation. The details are as Fig. 3.
sample points at different locations in the spatial distribu- Before the implementation of the detection algorithm,
tion have different influences on the training of the classifier it is necessary to analyze the node’s local logs, communi-
[16]. Therefore, random subsampling may lead to a lot of cation packets, network behavior, etc., and extract features
samples containing important information being discarded. for intrusion detection [19]. Detection feature is the basis
In order to avoid the randomness and blindness in random of wireless sensor network intrusion detection. Detection
subsampling, the distribution information of most samples algorithm realizes attack recognition by finding the anomaly
can be kept as much as possible while the number of most of detection feature. In order to lay a foundation for further
samples can be reduced. research, the characteristics of wireless sensor network are
summarized in terms of application layer, network layer, link
2.3 The Realization of Network Intrusion Detection layer and physical layer according to the characteristics of
wireless sensor network. Nowadays, in view of the security
Intrusion refers to the behavior that destroys the confiden- requirements and malicious attacks mentioned above, the
tiality, availability and integrity of the system. Intrusion research on the security mechanism of sensor network has
in general refers to the attack. A device or else software become a hot spot, and has made considerable achievements.
application which is to supervise the network intended for Due to the unstable environment of wireless sensor nodes
malicious activity or else strategy destructions is termed as and wireless channels, it is easy to lead to random errors
intrusion detection network.by utilizing a security data as (such as noise), which will affect the normal operation of
the system [20, 21]. Furthermore, to improve the intrusion
detection system, a method named as CFS + Ensemble Clas-
Table 1  Confusion matrix of classification results
sifiers (Bagging & Adaboost) has been introduced. Based
True category Classification results on the specified method, we can achieve high correctness,
A case in point Counterexample parcel location rate as well as bogus alert rate also reduced.
KDD99 and NSLKDD datasets are used to grouping mul-
A case in point TP (real example) FN (false counterexam- ticlass for each one of the assaults such as oddity as well as
ple)
typical traffic [22]. A most significant task of all network
Counterexample FP (false positive case) TN (true counterexample)
security tool is intrusion detection. It is also have linked

13
268 International Journal of Wireless Information Networks (2021) 28:262–275

Input number (cic-ids)

Data cleaning

Missing value
processing

Duplicate data
processing

Outlier handling

Feature selection

CFSüBFSLA Information gain (IG) Information gain rate (GR)

Unbalance treatment

Double boundary down


sampling Random down sampling

Classifier training

Naive Bayes Decision tree Random forest KüNN

Evaluation of classification
resul ts

Fig. 3  Optimization of network intrusion data collection process for machine learning

performance as well as efficiency issues based on several positive alarms which is being produced. Several research-
intrusion prediction and protection method arranged in net- ers have worked to solve performance problems by intro-
works. The outline of the IDS is based on its accuracy in ducing several machine learning algorithms with respect to
prediction network difference with reduced quantity of false IDS databases [23]. Malicious attack is often to destroy the

13
International Journal of Wireless Information Networks (2021) 28:262–275 269

system work for the purpose, and even can lead to network two major kinds of network attacks. Malicious parties take
collapse, need to establish a perfect defense mechanism to place in passive attack which increases unlicensed access to
deal with. Both random errors and malicious attacks can monitor, network as well as steal private information in the
cause exceptions in wireless sensor networks, damaging absence of making further changes.
their security and reliability as Fig. 4. An effort to forcefully It can be seen from Fig. 4 that security defense can
abuse else taking merits of somebody’s computer, anyhow be divided into two layers. The first layer mainly focuses
over computer viruses, phishing, social engineering else on key management, authentication, secure routing, data
some other kinds of social engineering is termed to be as fusion security, redundancy, speed limit and spread spec-
malicious attack. trum, etc. The present network intrusion detection data
set contains many features, so it inevitably contains many
2.3.1 Application Layer Attack redundant and irrelevant features. Redundant and irrel-
evant features reduce the efficiency of machine learning
It is constructed to attack the application layer itself which algorithms and have the potential to cause unexplained
is to concentrate on particular susceptibilities or problems, problems. Therefore, the first problem to be solved is to
concludes in the application not being have the capability to conduct feature selection, reduce the feature dimension
afford content towards the user. of the data set, and select a representative feature subset.
Detect network intrusion detection based on the charac-
2.3.2 Transport Layer Attack teristics of the network traffic, but the network traffic has
many features, not every feature of the testing results of
There are several attacks based on transport layer such the work, but too much characteristics involved in the
as TCP sequence identification, and UDP as well as TCP training of machine learning will reduce the efficiency of
flooding. training, but the invasion of different types have different
characteristics again, if the characteristics of the selected
2.3.3 Network Layer Attack too little, the application of the test surface is narrow.
Therefore, it is necessary to study the selection method
It is said to be as an unlicensed action based on digital assets of network traffic features. On the premise of ensuring the
inside an organizational network. Passive and active are the accuracy of detection, select as few features as possible

Application layer attack

Transport layer attack


Random error Malicious
attack Network layer attack
Key management

Link layer attack


Authentication
Phys ical layer attack
Generate
Secure routing

WSN exception
Redundancy Destruction

Speed limit Tier 1 defense


Prevention Safety and reliability
Ă

Layer 2 defense

Data fusion security


Intrusion detection

Fig. 4  Wireless sensor network security protection methods

13
270 International Journal of Wireless Information Networks (2021) 28:262–275

to improve the efficiency of training. Data cleaning is the training stage is to improve the evaluation index of some
first step of machine learning, but it plays a crucial role in aspects of the classifier, such as the overall classification
machine learning. Without data cleaning, it is very likely accuracy of samples.
that no effective model can be trained. In machine learn- Feature selection focuses on two issues: the first is the
ing, data cleansing means filtering and modifying data to standard, that is, what standard is used to measure whether
make it easier to process and model. Filter out the unnec- the feature is important; the second is the selection algo-
essary parts, so as not to affect the final training results. rithm, that is, how to find the optimal subset in the current
In addition, useful parts of the data may have formatting feature space. Intrusion detection algorithms are generally
problems such as mismatches, and need to be modified to a classification problem, used to distinguish abnormal data
make better use of this part. After the data preprocessing streams. In the process of designing the intrusion detec-
is completed, it is usually unable to directly input into the tion algorithm, two aspects are usually considered, one is
learner due to the large data dimension. Therefore, it is the recognition performance of the model, and the other is
necessary to select meaningful features and machine mod- whether the input data contains valid features. Therefore,
els for machine learning training. The features as well as after the classification algorithm is selected, the data scale
machine models for machine learning training is utilized will lead to the efficiency of the following model. There-
for the purpose of directly input to the learner owing to fore, feature selection for the original data set is a crucial
large data dimension next to the data pre-processing is step. Input the classification sample, and the final output
finished. result is determined by each machine learning vote. In RF,
In general, this feature is chosen from two aspects. First, bootstrap subsets of samples are obtained by self-help resa-
observe whether the feature diverges. If the feature does not mpling method for each machine learning tree. Therefore,
diverge, then the variance is close to zero, that is, the sample the training sets are all N samples randomly and repeatedly
has almost no difference in this feature, so the feature has extracted from the original data set. Due to the sampling
little effect on the differentiation of samples. The second is with back-up, one sample will appear repeatedly in the
to look at the correlation between features and goals: this is bootstrap subset. In general, 2/3 of the data will be used for
more obvious and features that are highly relevant to goals training in each machine learning, and 1/3 of the data that
should be given priority. In addition to the variance method, does not appear will become the out-of-pocket data of the
other methods described in the typical feature selection pro- sample. Feature importance can be calculated from OOB
cess is as Fig. 5. data, and the steps of optimizing network intrusion detec-
Real networks, the proportion of normal traffic and tion are as Fig. 6. OOB—out-of-band. The necessity of IDS
attack traffic, so the data collected from real network rally is to identify threats as well as it is kept at out-of-band on
class imbalance problems, the unbalanced dataset category the surroundings of network means that it is absence in the
will influence on the performance of the machine learning true real-time interaction way among the source as well as
trained classifier, so I need for class imbalance problems, on destination of data.
the training set for a particular machine learning algorithm According to the overall design structure of intrusion
for training, iterative adjustment continuously, finally it is detection, three modules of data preprocessing, feature
concluded that the optimal parameters, draw a flow charac- selection and intrusion detection identification of intru-
teristics and flow category mapping relationship, the rela- sion detection model are designed respectively. In this
tionship is known as the classifier. The main purpose of this stage, the KDDCUP99 data set is preprocessed, including

Raw data Feature subset


Subs et generation Subs et evaluation

Yes
No
Feature subset Validation of results

Fig. 5  Basic structure framework of network intrusion feature detection

13
International Journal of Wireless Information Networks (2021) 28:262–275 271

Table 2  Analysis of various attack samples in the experiment


Start
Attack type Training set Test set

DOS 12,443 2309


U2R 52 252
The m-th boots trap training set R2L 1026 739
PROBE 4247 1078
NORMAL 14,391 4382

Random characteristic subspace

N
3 Analysis of Experimental Results

Data sets used in the experiment: three different input data


Select the optimal splitting feature and splitting
value
sets were selected, respectively KDDCUP10% data set.
The original data set without feature selection was selected
to conduct comparative experiments with feature subsets
processed by feature engineering, so as to verify the prac-
ticability and necessity of feature selection. Wrapper and
Stop building
Embedded feature selection methods are used to compare
the performance of feature selection methods horizontally.
The experimental system environment is Windows7, the
Y
processor is I5, and the memory is 8 GB. The simulation
software is Python, and the data set used is KDDCUP99.
The m decision tree
The distribution of various attack samples is as Table 2.
The training set after cleaning, which is called the train-
ing set T, is input into the four classifiers of Park Bayesian,
machine learning, random forest and KNN for classifier
training, and then the test set is used to obtain the classifica-
End
tion effect of each classifier before feature selection.
Then on the training set T using CFS—BSFLA algorithm
Fig. 6  Optimization of network intrusion detection steps
for feature selection, select a feature subset, on the basis of
other features only keep relevant, delete data set and then
input to the naive bayes respectively, machine learning, ran-
dom forests, KNN classifier training in the four types of
classifiers, using the test set for testing, it is concluded that
the preprocessing and normalization of character attribute CFS—BSFLA algorithm to select the features in each clas-
data. KDDCUP99 dataset is utilized in the Third Interna- sifier classification effect.
tional Knowledge Discovery as well as data mining tools Used on training set T, respectively, using the informa-
competition. It is based on to construct a network intru- tion gain and information gain rate feature selection algo-
sion detector, a predictive method which has the ability to rithm, selecting a feature subset, then delete other features
differentiate among “bad” associates is referred as intru- only keep related data set, and then input to the naive bayes
sions or attacks along with “good” normal associations. respectively, machine learning, random forests, KNN clas-
A normalized set of information to be audited that con- sifier training in the four of classifiers, using the test set for
tains a huge numerous of intrusions simulated in a mili- testing, it is concluded that the information gain and infor-
tary network atmosphere. In this stage, the random forest mation gain rate algorithm to select the features in each clas-
algorithm was used to learn the data, and the machine sifier classification effect.
learning data were used to calculate the classification cor- In the experiment, the accuracy rate and false positive
rectness before and after feature disturbance and the aver- rate are mainly used to evaluate the effectiveness of the algo-
age value of the area under the curve, calculate the feature rithm proposed in the section. TP is used to record the num-
importance, exclude the features with the lowest score, and ber of positive samples predicted to be positive, FP is used to
reduce the dimension of the data. The data after dimen- record the number of negative samples predicted to be nega-
sionality reduction is classified and verified by test set. tive, TN is used to record the number of negative samples
predicted to be positive, and FN is used to record the number

13
272 International Journal of Wireless Information Networks (2021) 28:262–275

of negative samples predicted to be negative. A classification attributes is greater than 23, the accuracy of detection fluctu-
algorithm intended for two-group classification difficulties ates, and the detection model is affected by redundant feature
is utilized by a supervised machine learning method named attributes. In order to verify the effectiveness of the pro-
as support vector machine (SVM). They have the capabil- posed IRF-SVM (Improved random forest—support vector
ity to classify novel text next to providing an SVM method machine) model, its simulation experiment was compared
by setting a labelled training data for every classification with the classical SVM model, and the experimental results
system. In the application of SVM algorithm classification, are as Table 3.
the large number of feature data will easily lead to excessive It can be seen from the experimental results in the table
computation, which will affect the classification effect. The that the detection accuracy of IRF-SVM algorithm for
existence of sample redundancy is also one of the important Normal, DOS, Probe, R2L and U2R are 98.35%, 98.72%,
factors leading to the efficiency and accuracy of intrusion 97.63%, 93.64% and 96.85% respectively. The detection
detection. There are 41 features in KDDCUP99 data set, accuracy of IRF-SVM algorithm is significantly higher
some of which have little information and have no impact than that of the IPOS-SVM model proposed by the clas-
on the classification results. If they exist, the efficiency of sical SVM model. Meanwhile, the false positive rate also
the algorithm may be reduced and the deviation of the clas- decreases. It can be seen that compared with the classical
sification results may be caused. The comparison results of SVM algorithm and the proposed IPS-SVM model, the
intrusion detection accuracy of different feature subsets after proposed IRF-SVM algorithm has improved and improved
filtering sample attributes are as Fig. 7. the overall performance in the aspect of network intrusion
As can be seen from Fig. 7, there is a large amount of detection to different degrees. Facing the increasingly com-
redundant data in the KDDCUP99 data set. When the num- plex network environment, the traditional network intrusion
ber of feature attributes increases from 5 to 10, the intrusion detection technology is becoming increasingly weak, and
detection rate shows a rapid growth trend. When the num- the need for new technology to improve the defense perfor-
ber of feature attributes reaches 10, the intrusion detection mance of intrusion detection system. The detection accu-
model has achieved a high accuracy. When the number of racy of IRF-SVM algorithm for Normal, DOS, Probe, R2L
feature attributes increases from 10 to 23, the accuracy of and U2R are 98.35%, 98.72%, 97.63%, 93.64% and 96.85%
detection has been gradually stabilized. When the number of respectively. Based on the IRF-SVM model, the accuracy

100

95
Method of this paper

90
Accuracy/%

traditional method
85

80

75

70
5 10 15 20 25 30 35

Feature Set

Fig. 7  Detection accuracy of different feature subsets

13
International Journal of Wireless Information Networks (2021) 28:262–275 273

Table 3  Comparison of Model Evaluating indicator Nomal Dos Prode R2L U2R
detection accuracy and false
alarm rate (%) IRF-SVM Accuracy 98.35 98.72 97.63 93.64 96.85
False positive rate 0.51 0.24 0.93 1.21 1.11
SVM Accuracy 92.52 89.26 93.85 88.71 90.22
False positive rate 4.43 11.24 8.12 6.74 10.32
Reference Accuracy 93.30 90.52 87.70 90.93 92.75
False positive rate 3.04 9.95 15.56 4.73 2.27

has been improved as compared to the existing methods. The Spectral Indices and PLSR-Combined Machine Learning Models,
algorithm of machine learning is applied to intrusion detec- Remote Sensing, Vol. 13, No. 4, pp. 641, 2021.
4. L. Grbi, L. Kranjevi and S. Drueta, Machine Learning and Sim-
tion system, which improves the detection efficiency, makes ulation-Optimization Coupling for Water Distribution Network
the system more intelligent, and optimizes the performance Contamination Source Detection, Sensors, Vol. 21, No. 4, pp.
of the whole system. 1157, 2021.
5. B. Farsi, M. Amayri, N. Bouguila, et al., On Short-Term Load
Forecasting Using Machine Learning Techniques and a Novel Par-
allel Deep LSTM-CNN Approach, IEEE Access, 2021. https://d​ oi.​
4 Conclusion org/​10.​1109/​ACCESS.​2021.​30602​90.
6. S. E. Zhang, L. Sehoole, M. S. D. Manzi, et al., Use of novel 3D
seismic technology and machine learning for pothole detection,
This paper summarizes the application of machine learning characterization, and classification—Case study in the Bushveld
algorithms in network intrusion detection, briefly introduces Complex (South Africa), The Leading Edge, Vol. 40, No. 2, pp.
the common algorithms in machine learning, and lists the 106–113, 2021.
applications of different algorithms in intrusion detection. 7. C. H. Lin, J. X. Wu, P. Y. Chen, et al., Symmetric Cryptography
with a Chaotic Map and a Multilayer Machine Learning Network
However, the most effective method in network intrusion for Physiological Signal infosecurity: Case Study in Electrocar-
detection has not been found, and considering the diversity diogram, IEEE Access, Vol. 9, pp. 26451–26467, 2021.
and complexity of methods, it is impossible to recommend 8. W. Liu, Z. Tang, F. Lv, et al., Multi-feature integration and
only one method according to the attack type to be detected machine learning for guided wave structural health monitoring:
application to switch rail foot, Structural Health Monitoring, Vol.
by the system. A number of factors need to be considered in 46, No. 1, pp. 147592172198957, 2021.
determining the effectiveness of a method, including accu- 9. R. D. Costache, Q. B. Pham, E. Sharifi, et al., Flash-Flood Sus-
racy, complexity, the time it takes to classify an unknown ceptibility Assessment Using Multi-Criteria Decision Making
instance with a trained model, and the comprehensibility and Machine Learning Supported by Remote Sensing and GIS
Techniques, Remote Sensing, Vol. 12, No. 1, pp. 106, 2020.
of the final solution. The accuracy of the result has been 10. J. Liu, W. Zhang, Y. Chung, et al., Adaptive Intrusion Detection
enhanced such as 98.35, 98.72, 97.63, 93.64, and 96.85. In via GA-GOGMM-based Pattern Learning with Fuzzy Rough Set-
some specific intrusion detection systems, it is necessary to based Attribute Selection, Expert Systems with Applications, Vol.
consider carefully according to the special environment, and 139, No. 5, pp. 112845, 2020.
11. Y. Wu, W. W. Lee, Z. Xu, et al., Large-Scale and Robust Intru-
the establishment of a robust network intrusion detection sion Detection Model Combining Improved Deep Belief Network
system remains to be further studied. With Feature-Weighted SVM, IEEE Access, Vol. 8, No. 10, pp.
98600–98611, 2020.
12. W. Li and X. Tu, Quality analysis of multi-sensor intrusion detec-
Data Availability All data, models, and code generated or used during tion node deployment in homogeneous wireless sensor networks,
the study appear in the submitted article. The Journal of Supercomputing, Vol. 76, No. 12, pp. 1331–1341,
2020.
13. X. Zuo, Z. Chen, L. Dong, et al., Power information network intru-
sion detection based on data mining algorithm, The Journal of
References Supercomputing, Vol. 76, No. 6–7, pp. 2899–2908, 2020.
14. R. H. Dong, H. H. Yan and Q. Y. Zhang, An Intrusion Detection
Model for Wireless Sensor Network Based on Information Gain
1. T. H. Miller, M. D. Gallidabino, J. I. MacRae, et al., Prediction of Ratio and Bagging Algorithm, International Journal of Network
bioconcentration factors in fish and invertebrates using machine Security, Vol. 22, No. 2, pp. 218–230, 2020.
learning, Science of The Total Environment, Vol. 648, No. 10, pp. 15. J. Ning, J. Wang, J. Liu, et al., Attacker Identification and Intru-
80–89, 2019. sion Detection for In-Vehicle Networks, IEEE communications
2. M. Hajihosseini, M. Andalibi, M. Gheisarnejad, et al., DC/DC letters, Vol. 23, No. 11, pp. 1927–1930, 2019.
Power Converter Control-Based Deep Machine Learning Tech- 16. R. Vijayanand and D. Devaraj, A novel feature selection method
niques: Real-Time Implementation, IEEE Transactions on Power using whale optimization algorithm and genetic operators for
Electronics, Vol. 35, pp. 9971–9977, 2020. intrusion detection system in wireless mesh network, IEEE
3. G. R. Mahajan, B. Das, D. Murgaokar, et al., Monitoring the Access, 2020. https://​doi.​org/​10.​1109/​ACCESS.​2020.​29780​35.
Foliar Nutrients Status of Mango Using Spectroscopy-Based

13
274 International Journal of Wireless Information Networks (2021) 28:262–275

17. J. Guo, Design of Adaptive Marine Network Intrusion Detection Research and Development of Big Data for Enterprise Operation and
and Dynamic Defense System, Journal of Coastal Research, Vol. Management) won the third prize of Science and Technology Progress
104, No. sp1, pp. 104–109, 2020. Award of State Grid Fujian Electric Power Co., Ltd., and “Research
18. Y. Song, B. Bu and L. Zhu, A Novel Intrusion Detection Model and Application of PMS2.0 Application Level Key Technology for
Using a Fusion of Network and Device States for Communication- Remote Disaster Preparation” won the second prize of State Grid
Based Train Control Systems, Electronics, Vol. 9, No. 1, pp. 181, Fujian Electric Power Co., Ltd.
2020.
19. S. Li, J. Xie, F. Zhou, et al., Foreign Object Intrusion Detection on Ting Li (1) Research and applica-
Metro Track Using Commodity WiFi Devices with the Fast Phase tion of key technology of appli-
Calibration Algorithm, Sensors, Vol. 20, No. 12, pp. 3446, 2020. cation-grade remote disaster
20. A. Almalawi, A. Fahad, Z. Tari, et al., Add-On Anomaly Thresh- preparedness (PMS2.0), awarded
old Technique for Improving Unsupervised Intrusion Detection by Fujian Provincial Science and
on SCADA Data, Electronics, Vol. 9, No. 6, pp. 1017, 2020. Technology Award (2017), rank-
21. C. E. Graves, C. Li, X. Sheng, et al., Memristor TCAMs Acceler- ing the 2nd place. (2) Research
ate Regular Expression Matching for Network Intrusion Detec- on Power Distribution Equip-
tion, IEEE Transactions on Nanotechnology, Vol. 18, pp. 963– ment and State Intelligent Iden-
970, 2019. tification Technology, 2018 State
22. C. Iwendi, et al., The Use of Ensemble Models for Multiple Class Grid Information Communica-
and Binary Class Classification for Improving Intrusion Detection tion New Technology Innovation
Systems, Sensors, Vol. 20, No. 9, pp. 2559, 2020. https://​doi.​org/​ Award, Ranking 4th; (3) R&D
10.​3390/​s2009​2559. and implementation of collective
23. S. Bhattacharya, et al., A Novel PCA-Firefly Based XGBoost enterprise management and con-
Classification Model for Intrusion Detection in Networks Using trol module and business appli-
GPU, Electronics, Vol. 9, No. 2, pp. 219, 2020. https://d​ oi.​org/1​ 0.​ cation platform (Phase I),2017 Achievement Promotion and Applica-
3390/​elect​ronic​s9020​219. tion Award of State Grid Fujian Electric Power Co., Ltd.; (4) Integrated
power and line loss management system, the third prize of Science and
Publisher’s Note Springer Nature remains neutral with regard to Technology Award of Fujian Province of Electric Power, ranking the
jurisdictional claims in published maps and institutional affiliations. second; (5) Research and Application of Massive Surveillance Video
Management Technology Based on Cloud Storage, Third Prize of Sci-
ence and Technology Progress of State Grid Fujian Electric Power Co.,
Ltd in 2016; (6) Function Improvement, Promotion and Implementa-
Fei Wu Involved in the project of tion of the Grid Unified Video Monitoring Platform First Prize of
massive amounts of monitoring Achievements Promotion and Application of State Grid Fujian Electric
intelligent video analysis key Power Co., Ltd. (7) “Big Data Power Line Loss Analysis”, 2017 State
technology research and applica- Grid Fujian Electric Power Co., Ltd., Bronze Award; (8) Fujian Electric
tion of “scientific and techno- Power of State Grid Corporation of China—Pilot Construction of Data
logical progress second prize, Analysis Domain of Unified Data Center of the Whole Industry—
Fujian province, the major activ- Design, Development and Implementation Project—Achievement Pro-
ities based on digital power grid ject, which was awarded as the Quality Project of Information Con-
intelligent electrical command struction of State Grid Corporation in 2019; (9) Research and
system and the development and Application of Intelligent Supply Chain Management Technology for
application, for its science and Power Supply Enterprises, Third Prize of Excellent Achievement of
technology progress prize of ICOM New Technology Innovation and Development Action Plan in
Fujian electric power co., LTD., 2019.
based on cloud computing grid
geographic information service Zhen Wu The project “Research
platform, development and on Performance Optimization of
application for scientific and technological progress second prize, Information System Based on
Fujian province, the new generation of full dimension integrated net- Memory Computing Technol-
work management information system technology research and appli- ogy” that I participated in won
cation program” to obtain its Chinese Fujian electric power co., LTD., the science and technology pro-
scientific and technological progress first prize, Big data in the enter- gress prize of State Grid Fujian
prise management application research and development of typical “to Electric Power Co., Ltd., and the
obtain its Chinese Fujian electric power co., LTD. Science and technol- project “Data Governance and
ogy progress award), the power key technology research and develop- Deepening Application of Com-
ment and application of” big data platform for Fujian province science panies in Seven System County”
and technology progress third prize, the intelligent protect electrical that I participated won the first
command system construction of its power supply company in xia- prize of achievement promotion
men—develop and implement projects for the state grid company infor- and application of State Grid
matization construction high quality projects, (Typical Application Fujian Electric Power Co., Ltd.

13
International Journal of Wireless Information Networks (2021) 28:262–275 275

ShuLin Wu Lead and complete ChuanQi Xiao Achievements:


the “Research on Key Technolo- “China High-tech Zone” pub-
gies of Health Assessment and lished “Improving the Accuracy
Real-time Warning for Informa- of Power Information Equipment
tion System Oriented to Huge Standing Book Based on Big
Diaries”, a project of State Grid Data”.
Technology Co., Ltd. 2. Partici-
pated in the completion of the
State Grid Science and Technol-
ogy Project Research and Appli-
cation of Key Technologies for
U s e r- s i d e C o mp re h e n s i ve
Demand Response Oriented to
Multi-energy Collaboration; 3.
Research on Image Fog Remov-
ing Technology Based on
Guided Filtering in Power Grid, first prize of Fujian Electrical)
Machinery Society.

13

You might also like