IET Communications - 2020 - Safara - Improved Intrusion Detection Method For Communication Networks Using Association Rule
IET Communications - 2020 - Safara - Improved Intrusion Detection Method For Communication Networks Using Association Rule
Research Article
Abstract: Nowadays, detecting anomaly events in communication networks is highly under consideration by many researchers.
In a large communication network, traffic is massive, which leads to a larger amount of data travelling and also the growth of
noise. Therefore, to extract meaningful data for anomaly detection would be very challenging. Each attack has its own
behaviour that determines the type of attack. However, some attacks may have similar behaviours and only differ in some
features. Extracting such meaningful features is of special importance. In this study, an association rule mining algorithm, in
particular, the Apriori algorithm is employed to extract appropriate features from the raw data including rules and repetitive
patterns. The extracted features would be used then for classifying the data and detecting anomalies in communication
networks. A hybrid of artificial neural network and AdaBoost classification algorithms are employed for classifying the detected
events with normal behaviour and attack events. The proposed method is compared with previous methods reported in this field
such as CART, CHAID, multiple linear regression and logistic regression on KDDCUP99 data set. The results showed that the
proposed method outperformed other classifiers examined. The strategy of reinforcement learning is used to combine the
classifier's results which is based on Max vote strategy.
(i) Min Support: this parameter shows the minimum value that a set
of features must have to be considered as an optimal and preferable
rule. In other words, samples which have a probability of greater or
equal to this parameter will be considered as regular rules.
(ii) Max-Support: This parameter shows the maximum value of
Fig. 1 Flowchart of the proposed method probability. A sample can be a regular rule with 100% probability.
(iii) Confidence: This parameter is the frequency of the if-then
José et al. in 2015 and Park et al. [12] used the random forest rules which are found true.
classification method for protecting malwares in the intrusion
detection system. The proposed method in the research is based on The main goal of the Apriori algorithm is to generate associative
two general operators: ordered weighted feature selection on the rules to probe KDDCUP99 data set and finally extract important
Kyoto 2006 + data set and label segmentation method on the final rules which are repeating rule in fact. Each extracted rule contains
normalised attributes. some of the features of KDDCUP99. In addition, each rule has a
Taher et al. [13] presented a hybrid classification method based confidence level. According to the confidence level, the optimal
on ANN and support vector machine algorithms to predict rules would be extracted. In other words, the more the confidence
intrusion error in the networks traffic data. level of a rule, the more is the importance of a feature. Fig. 2
Finally, Riyaz and Ganapathy [14] combined supervised illustrates a part of the rules extracted by Apriori Algorithm.
learning methods, the classification method and fuzzy-based Some of the most important rules with a confidence level >90%
feature selection method, for the intrusion detection system. They are shown in Fig. 2. The symbol | indicates nested-if for each law.
applied the proposed method on the NSL-KDD data set. The The output of any rule is either 0 or 1. The normal instance is
experimental results showed that the proposed fuzzy-feature represented by 0 and attack instance is represented by 1. Therefore,
selection has better classification results than other classification the Apriori algorithm employed in this paper, which receives the
algorithms. complete KDDCUP99 data set and extracts a set of rules based on
MinSup, MaxSup and Confidence. After running the Apriori
algorithm, the following properties are extracted: Duration,
3 Proposed method
src_bytes, dst_bytes, count, srv_count, serror_rate, srv_serror_rate,
In this section, the proposed association rule mining is described. same_srv_rate, diff_srv_rate and dst_host_srv_rerror_rate. These
Then, the proposed classification algorithms are presented. features are fed into the classifier phase to be used for detecting
The flowchart of the proposed data mining method is illustrated intrusions.
in Fig. 1. According to the flowchart, the process of detecting
intrusions in communication network proposed in this paper is as 3.2 Hybrid of Adaboost and ANN for classification
follows. First, a general preprocessing is taken place on the data.
By the preprocessing, the outliers are removed and missing values The proposed classification method is a hybrid of the most popular
are filled. Then, using a popular association rule mining method, classification methods namely neural network and Adaboost, to
Apriori algorithm, a set of repetitive rules and patterns are detect the abnormal behaviours of intruders. In order to create a
extracted from the data set as features. After that, the extracted classification model to operate on training data, it would be
features are sent to the next step for creating and evaluation model. necessary to separate the train and test data. In this paper, 70% of
The data is divided into train data and test data. Train data is used data is used for training and the rest 30% of data is dedicated to the
for creating the model and test data is used for evaluating the test. Training data is used to train the proposed algorithm and
proposed method for detecting intrusions from newly entered create model and testing samples is used for evaluating the
TP + TN
Accuracy = (1)
TP + TN + FP + FN
TP
Precision = (2)
TP + FP
TP
Recall = (3)
TP + FN
proposed method. Then the results would be compared with that of RMSE = Sqr Error (6)
the other machine learning methods. The results of the above
methods are combined with each other and in every learning 4 Experimental results
iteration the best answer is selected and the final result is
determined. In this section, the evaluation results of classification algorithms
The strategy of reinforcement learning is used to combine the are illustrated using Rapid miner software. Our proposed algorithm
classifier's results which is based on Max vote strategy. Suppose an is compared with some well-known classification algorithms in
instance, features extracted by Apriori algorithm for one record of previous studies, including CHAID, CART, multiple linear
the data set, that is an attack (Class = 1) is logged into the vote regression and logistic regression. In Table 1, the characteristics of
system. ANN and Adaboost are working on the instance separately. the system which has been used to implement the proposed method
Suppose ANN recognises the entered instance as an attack (Class and evaluate its results are illustrated.
= 1). The same example is identified by the Adaboost algorithm as
an attack. Given that both algorithms have identified the target 4.1 Data set
sample as ‘1’ and the strategy of the vote-based system is Max, the
final answer will be ‘1’. In the other case, if both algorithms in the KDDCUP99 data set is one of the popular data sets for evaluating
vote system identify the sample entered as ‘0’, then the output will intrusion detection algorithms. It is compiled in a United States Air
be ‘0’. In the final case, if one sample algorithm detects ‘1’ and the Force data set. KDDCUP99 data set composed of 42 features, from
other identifies the imported sample ‘0’, the decision is most which 41 features are used to describe each intrusion and the 42nd
difficult. In the proposed vote system, a penalty would be imposed features is a label representing the type of intrusion. The names of
if an algorithm in the classification had errors, otherwise, the features and type of each of the features are illustrated in Table 2.
algorithm would be rewarded. Therefore, if one algorithm Value for each of the features could be discrete or continuous that
identifies a value of ‘0’ and the other one identifies a value of ‘1’, represented as Dis and Con, respectively.
then the response of the algorithm that receives the most rewards is
considered as the final answer. 4.2 Comparing proposed method with four other methods
Different voting methods are employed in previous studies such The training data which is 70% of the whole data set will be used
as weighted sum and median, product [16]. In this paper, the for training and generating model, and validation data which is
weighted sum is used for voting. In the next section, the results 30% of the whole data will be used for assessing the model
obtained by the proposed method would be compared with that of generated. Then ANN and AdaBoost are applied to create models.
other methods. The models are tested with test data, which is unseen data.
Therefore, each model in the proposed algorithm has its own
output which expresses its prediction. Finally, voting is performed
to select the best results of the two classifiers as the final results for
detecting anomalies.
Fig. 5 Comparing the error of the proposed method with other methods
Fig. 3 Comparing the precision of the proposed method with other
methods Precision and recall of the proposed method are compared with
that of four other machine learning method which are from the
popular machine learning methods. The proposed method is
outperformed the other methods with 99.55% precision for
predicting intrusions in communication networks.
Also, Fig. 4 presents recall of the proposed algorithm and other
algorithms in the applied data set with 95% for detecting intrusions
in the communication networks. It confirms that most intrusions
are successfully captured.
In Fig. 5, the classification error of the proposed method is
compared to that of other methods mentioned above.
In the proposed method, the value of error in detecting attacks
in every 100 cases is only 0.45% of error, which is an acceptable
error, particularly in compare with the error of other classification
algorithms examined.
The RMSE of the proposed method is shown in Fig. 6, and
because it has a direct relation with the error of the proposed
method, finally it represents the sum of generated errors in the
Fig. 4 Comparing the recall of the proposed method with other methods proposed method.
Owing to the importance of the subject, the proposed method is
The same training and testing process are also performed for tested with over five popular data sets and evaluated by its
four other classifiers CART [20], CHAID [21], Linear regression precision, recall, classification error and RMSEs, after which the
[15], and logistic regression [22] to compare the result and approve results are analysed as mentioned below.
the better results achieved by the propped hybrid method.
In Fig. 3, the precision of the proposed method is compared
with other methods.
Fig. 10 Comparing the recall of the proposed method on different data
sets