A New Data-Mining Based Approach For Network Intrusion Detection
A New Data-Mining Based Approach For Network Intrusion Detection
A New Data-Mining Based Approach For Network Intrusion Detection
I.
INTRODUCTION
There has been a recent awareness of the risk associated with network attacks by criminals or terrorists, as information systems are now more open to the Internet than ever before. Records made available by the Pentagon showed that they logged over 79,000 attempted intrusions in 2005 with about 1,300 successful ones. The deployment of sophisticated firewalls or authentication systems is no longer enough for building a secure information system. In addition, most of intrusion detection systems nowadays rely on handcrafted signatures just like anti-viruses which have to be updated continuously in order to be effective against new attacks. There is a need now to focus on the detection of unknown intrusions instead of relying on this signaturebased approach. This has led to another approach to intrusion detection which consists of detecting anomalies on the network. The anomaly detection attempts to quantify usual or acceptable behavior and flags other irregular behavior as potentially intrusive [1]. Unfortunately, the number of false
978-0-7695-3649-1/09 $25.00 2009 IEEE DOI 10.1109/CNSR.2009.64 372
positives generated by existing such systems is often too high [2], and it requires network administrators to go over too many entries, which reduces their efficiency. It is generally believed that intrusions show something which differs from the normal pattern, and that any unknown intrusion will present patterns more similar to known intrusion than to normal data. This observation suggests that intrusion detection could be considered as a data analysis problem. Additionally, by gathering network traffic, computing the right features and using the right classification algorithm, the system should be able to detect known intrusions as well as new intrusions. In this paper, we will present a new data-mining based technique for intrusion detection using an ensemble of binary classifiers with feature selection and multiboosting simultaneously. Our model employs feature selection so that the binary classifier for each type of attack can be more accurate, which improves the detection of attacks that occur less frequently in the training data. Based on the accurate binary classifiers, our model applies a new ensemble approach which aggregates each binary classifiers decisions for the same input and decides which class is most suitable for a given input. During this process, the potential bias of certain binary classifier could be alleviated by other binary classifiers decision. Our model also makes use of multiboosting for reducing both variance and bias. In addition, we will evaluate our model by various experiments. This paper is organized as follows: first, we present our motivation regarding using feature selection and multiboosting techniques in Section II, and then we describe our proposed method in Section III. In Section IV, we analyze the results of our experiments. Finally, Conclusion and future works are presented in Section V. II. MOTIVATION There have been several research works on how Knowledge Development and Data mining (KDD) task can help improve Intrusion Detection Systems (IDSs): classification, sequential analysis, time series analysis, prediction, clustering, and association rules [1][3][4][5][6]. In addition, previous works on the KDD cup 99 dataset [7][8], which is the dataset we are going to use for our experiments, have used various data mining techniques, e.g., by varying classification algorithm, focusing on feature selection, and even combining techniques. Among these approaches, we are interested in leveraging feature selection for improving the detection of attacks that
occur infrequently in the training data, and multiboosting for reducing both variance and bias. A. Feature Selection Feature selection can be considered an important asset in building classification models as some data may hinder the classification process in a complex domain. Moreover elimination of useless features enhances the accuracy of detection while speeding up the computation. Thus, feature selection improves the overall performance of the detection mechanism. A few data mining techniques have used feature selection techniques. The simplest approach consists of removing one feature at a time and testing the performance of a classification algorithm against the removed features. This approach was used by Mukkamala and Sung [9] and was tested with two different classification algorithms: Support Vector Machines (SVMs) and Artificial Neural Networks (NNs). Another more efficient approach to feature selection is proposed by Chebrolu, Abraham, and Thomas [10]. The authors proposed two different approaches: Bayesian networks and Classification and Regression Trees (CARTs) From our experiments done with feature selection, we have observed that feature selection contributed to improve overall accuracy, reduced the number of false positives, and improved the detection of instances with low frequency in the training data. The last is the main reason why we introduce feature selection in our proposed model. B. Multiboosting The effect of combining different classifiers can be explained with the theory of bias-variance decomposition. Bias refers to an error due to a learning algorithm while variance refers to an error due to the learned model. The total expected error of a classifier is the sum of the bias and the variance. In order to reduce bias and variation, some ensemble approaches have been introduced: Adaptive Boosting (AdaBoost) [11], Bootstrap Aggregating (Bagging) [12], Wagging [13][14], and Multiboosting [15]. Multiboosting uses a decision committee technique that combines AdaBoost with the wagging. As described by Webb [15], multiboosting can be considered as wagging committees formed by AdaBoost. There has been evidence that AdaBoost reduces both bias and variance whereas bagging, which employs a voting concept for classification, only reduces variance [11][12]. This is why the idea emerged of combining both in order to profit from the advantages of both algorithms and obtain an overall error reduction. At first the idea used a set of bagging subcommittees formed by AdaBoost, but since bagging would reduce the number of training examples available for each subcommittee (bagging uses sampling with replacement), wagging was employed. Webb also mentioned that multiboosting offers higher reduction in variance than other boosting methods as well as a reduction of bias [15]. In order to benefit from these features of multiboosting, we decided to use multiboosting in our proposed model after verifying that it performed better than bagging, AdaBoost,
and some other optimization methods found in the literature when tested with the current dataset. III. OUR APPROACH
A. Dataset and Attack Types The dataset we used in our experiments is the KDD 99 cup version of the 1998 Defense Advanced Research Projects Agency (DARPA) intrusion evaluation dataset [8] by the MIT Lincoln Laboratory. The DARPA dataset is used as a test bed for intrusion-detection evaluation and was generated from raw TCP/IP dump data by simulating a typical U.S. Air Force LAN. We are aware that the KDD 99 cup dataset has been criticized because it is out-dated and artificially synthesized [16], and that there is a new Protected Repository for the Defense of Infrastructure against Cyber Threats (PREDICT) dataset [17]. However, for the PREDICT dataset, we could not find other experimental results for performance comparison. Additionally, our approach has been less sensitive to a dataset, e.g., a uniformly sampled training data and no bias in design process (in terms of using data, arbitration, and multiboosting) so that it is expected to adapt well for other datasets. Thus, we decided to use the KDD 99 cup dataset in our experiments and we will use the PREDCIT dataset and real network data in our future works. The training dataset contains 24 attack types that could be classified into four main categories: Probing (Probe), Denial of Service (DOS), User to Root (U2R), and Remote to Local (R2L). Table I and II show attack types and their distribution in the datasets. B. Features and Challenges Although the original data contained 744 MB data with 4,900,000 records, in the KDD cup dataset a subset of 494,021 records is used. For each record, 41 various quantitative and qualitative features were extracted.
TABLE I. Type Probe DOS U2R R2L Attacks in the Training Data Ipsweep, Nmap, Portsweep, Satan Back, Land, Netptune, Pod, Smurf, Teardrop Buffer_overflow, Loadmodule, Perl, Rootkit Ftp_write, Guess_passwd, Imap, Multihop, Phf, Spy, Warezclient, Warezmaster ATTACK TYPES Additional Attacks in the Testing Data Mscan, Saint Apache2, Mailbomb, Processtable, Udpstorm Httptunnel, Ps, Worm, Xterm Named, Sendmail, Snmpgetattack, Snmpguess, Sqlattack, Xlock, Xsnoop
TABLE II.
Dataset
Training
19.69%
0.83%
79.24%
0.01%
0.23%
Testing
19.48%
1.34%
73.90%
0.07%
5.20%
373
Intrinsic features: The features are common to any network connection, they are basic features like: duration of the connection, protocol type, service type, and total bytes sent. Time-based feature: These are the same host and same service features which are called together time-based traffic features. These features, combined with basic features, could be used to detect DOS and fast probing attacks with the two second time window. Host-based traffic features: We have the host-based traffic features which were constructed using a window of 100 connections to the same destination host instead of a time window. These features are useful for detecting slow probing attacks that require several minutes to execute. Content-based features: Unlike DOS and Probe attacks which require many connections in a short amount of time, R2L and U2R attacks do not have any sequential patterns. Thus, the content features were computed by examining packet contents, e.g., computing the number of failed login could be helpful to detect a U2R attack. The main challenge in using the KDD cup dataset is the distribution of the attacks in the training and testing dataset. Pattern-recognition and machine-learning techniques trained with the KDD cup training data and tested with their testing data failed to detect the majority of U2R and R2L attacks with an acceptable false positive rate. As we can see in Table I, some attacks which do not exist on the training set appear on the testing set, e.g., Mscan and Saint in Probe. Since all these attacks can be characterized by similar behavior in the feature space when it comes to relevant feature in a category, we will be able to detect some of them if we have enough training data for this category. The only problem is the low available training data for the U2R and R2L category which are respectively 0.01 % and 0.23%, as we can see in Table II. C. Utilizing Feature Selection and Multiboosting In our model, we aim to increase overall detection accuracy as well as decrease the bias and variance, by leveraging an ensemble of individual binary classifiers with feature selection and the multiboosting simultaneously. First, we generate a binary classifier for each type of event by applying different features for different classes to generate more accurate results. The effectiveness of feature selection for binary classifiers has been shown in the experiments of Chebrolu, Abraham, and Thomas [10]. We also found that feature selection is useful in detecting lowfrequency instances like U2R and R2L in our experiments. Thus, it has helped to increase the overall accuracy of our model. Second, along with the accurate binary classifiers based on the feature selection, our method also applies an ensemble approach, i.e., we combine the results of each binary classifier to generate more unbiased result. Each of the five binary classifiers makes a binary decision independently for its corresponding type of event based on the same input. After that, it aggregates the results and produces the output
374
result using the confidence criterion of each classifier. In [10], the authors used an ensemble approach to make single binary decision, while we employ the ensemble approach to decide which class is most suitable for a given input. Thus, the potential bias of certain binary classifier could be alleviated by other binary classifiers results. In addition, we make use of the feed-back mechanism of multiboosting. As a result, our model can achieve lower bias and variance. D. Proposed Model Our model is illustrated in Fig.1 and described as follows: For each trial i, i=1T, where T is the total # of trials, (1) A sample training set is generated by a multibooster using wagging (as specified in Webbs multiboosting algorithm [15]). (2) Binary classifiers are generated for each class of event using relevant features for the class and the C4.5 classification algorithm [13]. Binary classifiers are derived from the training sample by considering all classes other than the current class as other, e.g., Cnormal will consider two classes: normal and other. The purpose of this phase is to select different features for different classes by applying the information gain [18] or gain ratio [13] in order to identify relevant features for each binary classifier. Moreover, applying the information gain or gain ratio will return all the features that contain more information for separating the current class from all other classes. The output of this ensemble of binary classifiers will be decided using arbitration function based on the confidence level of the output of individual binary classifiers (e.g., see Fig. 2). (3) The ensemble classifier is used by the multibooster in order to calculate the classification error, and derive the next training set. (4) After T trials, the final committee is formed and it will be used by our intrusion detection system.
IV.
A. Evaluation and the Performance of Learning Algorithms A misclassification cost matrix was given to evaluate the results of the KDD 99 contest as shown in Table III. We used the same cost matrix as a base for comparison with other methods using the same dataset. In order to choose a learning algorithm for our experiments, we compared the performance of different algorithms using the whole training set. In terms of overall performance, training time, overall accuracy, and correct classification rate on normal data, we picked C4.5 on the binary classifier for each class in our experiments. B. Experiments In our experiments, we use the training and test data from the KDD 99 cup dataset. We should mention that because of the computational cost of using the whole KDD cup training set, we had to use a 2% randomly generated subset using a uniform. Our results can be compared with other works in the subject since we used the whole testing set of 311,028 records and the same cost matrix. Moreover, unlike the KDD 99 winners method which has a biased sampling to rare classes that changes the distribution of the classes [19], we used the uniform distribution in our sampling. Thus, we believe that overall, our approach is low biased than the method of the KDD 99 winner. We will compare the performance of two different methods: one method using ensemble of individual binary classifiers with feature selection and the other method using the association of multiboosting techniques with feature selection. Experimental setting 1: ensemble of binary classifiers with feature selection In our first experiment, we first created a system with binary classifiers which means that for our dataset we will have five classifiers trained separately, each for one class of attack. Finally, we will use an arbitration strategy to determine the class for an instance (Fig. 2).
TABLE III. MISCLASSIFICATION COST MATRIX OF KDD 99 CONTEST Probe 1 0 1 2 2 DOS 2 2 0 2 2 U2R 2 2 2 0 2 R2L 2 2 2 2 0
The main purpose of doing this is to use different features for each class. Some features may be relevant for one class and not for the other. In order to determine the important features for each class, in each of our five classifiers we will first study a feature selection module based on information gain. We varied the information gain threshold for attribute selection in each binary classifier in our experiments. For this experiment, in each of our classifiers we opted to use the C4.5 classifier because of its accuracy in previous experiments. Fig. 3 shows the results obtained using different information gain thresholds. From the result, we can see that in our best case we achieve 98.3% of normal data correctly classified against the 97.0% that we had by considering all features. In addition, for the U2R attack, feature selection seems to contribute significantly to the detection accuracy since we go from 51.8% to 65.4% in the best case for this class. We also performed the same experiments but this time using the information gain ratio criterion for feature selection (Fig. 4). We can see from these experiments that overall, by varying thresholds we obtained better results using the information gain than the information gain ratio. Thus, we decided to combine all these five models with information gain threshold varying from 0.001 to 0.05 to form a single model with an arbitration strategy using a trained Radial
375
Figure 6. Multiboosting + ensemble of binary classifiers with feature selection using information gain
Basis Function (RBF) neural network. We obtained the results as shown in Fig. 5. The best overall accuracy was obtained using this method with 92.30% overall accuracy and a cost of 0.2184. The cost is better than the winning entry of the KDD cup. Experimental setting 2: multiboosting + ensemble of binary classifiers with feature selection The results obtained using multiboosting with no feature selection were quite promising since we obtained an overall accuracy of 93.44% and a cost of 0.194. However, the detection rate of normal data (97.9%) was not high enough. In our previous experiments, we observed that ensemble of binary classifier with attribute selection helps increase the number of correctly classified normal data which means less false positive. In our second experimental setting, we used the system described in the previous section also varying the information gain threshold and a trial member (committee size) of three. We obtain a lower overall accuracy of 92.55%, but it still performed better than the previous system. As expected the detection rate of normal data is increased with ensemble of binary classifiers with feature selection,
reaching 98.2% with the threshold of 0.05 (Fig. 6). We also compare the accuracy of our experiment with that of the KDD cup winner in Fig. 7. In addition, with a threshold of 0.05, we obtained a cost of 0.2025 against 0.2331 of the KDD cup winner. Our last experiment in this section involves using multiboosting as well as a gain ratio feature selection instead of information gain. We obtained the following results (Fig. 8). The result with gain ratio 0.01 is the best results we obtained in terms of false positive and overall accuracy trade off. Up to 99% of normal data are correctly classified which leaves us with a low false positive rate. The overall accuracy, 93.41%, is one of the best obtained. Additionally, the result with gain ratio 0.04 has the highest obtained overall accuracy (93.8128%) and the lowest cost of 0.1853 in all our experiments. V. CONCLUSION In this paper, we propose a new data-mining based approach by combining multiboosting and an ensemble of
Figure 7. Performance of multiboosting + ensemble of binary classifiers with feature selection using information gain 0.05
376
[3]
[4]
[5]
[6]
[7] [8] [9] Figure 8. Performance of multiboosting + ensemble of binary classifiers with feature selection using information gain ratio
binary classifiers with feature selection using either the information gain or the gain ratio criterion. This approach consists of three major functions: 1) generation of accurate binary classifiers by applying different features for different types of attacks, 2) a new ensemble approach of the binary classifiers for removing bias, 3) applying multiboosting for reducing both bias and variance. This model performs well and we even obtain 93.8128% detection rate using the gain ratio criterion as well as high detection rates for U2R and R2L compared to other works. The proposed system performs better than the winning entry of the KDD cup in term of accuracy and cost. Our experiments are solely based on the KDD/DARPA data set so that we need to verify our result in other network environments. Future works will extend the analysis to the PREDICT dataset and real network data. REFERENCES
[1] [2] S. Kumar, "Classification and detection of computer intrusions", Ph.D. thesis, Purdue Univ., West Lafayette, IN, 1995. W. Lee and D. Xiang "Information-theoretic measures for anomaly detection", In Proc. of the 2001 IEEE Symp. on Security and Privacy, Oakland, CA, May, 2001, pp. 130-143.
[10]
[11]
[12]
[13] [14]
[18] [19]
A. K. Ghosh, A. Schwartzbard, and M. Schatz, "Learning program behavior profiles for intrusion detection", Proc. of 1st USENIX Workshop on Intrusion Detection and Network Monitoring, Santa Clara, CA, April, 1999, pp. 51-62. W. Lee and S. J. Stolfo, "Data mining approaches for intrusion detection", Proc. of the 7th USENIX Security Symp., San Antonio, TX, 1998. W. Lee, S. J. Stolfo, and K. W. Mok "A data mining framework for building intrusion detection models", Proc. of the 1999 IEEE Symp. on Security and Privacy, Oakland, CA, May, 1999, pp. 120-132. W. Lee, R. A. Nimbalkar, K. K. Yee, S. B. Patil, P. H. Desai, T. T. Tran, and S. J. Stolfo, "A data mining and Cidf based approach for detecting novel and distributed intrusions", Lectures Notes in Computer Science, Vol. 1907, pp. 49- 54, 2000. The UCI KDD Archive, "KDD cup 1999 data", http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html MIT Lincoln Laboratory, "DARPA intrusion detection evaluation", http://www.ll.mit.edu/IST/ideval/, MA, USA. S. Mukkamala and A. H. Sung, "Identifying significant features for network forensic analysis using artificial intelligent techniques", International Journal of Digital Evidence, Vol. 1, Issue 4, Winter 2003. S. Chebrolu, A. Abraham, and J. P. Thomas, "Feature deduction and ensemble design of intrusion detection systems", Computer & Security, Vol. 24, Issue 4, June 2005, pp. 295-307. Y. Freund and R. E. Schapire, "A decision-theoretic generalization of on-line learning and an application to boosting", Journal of Computer and System Sciences, Vol. 55, Issue 1, August 1997, pp. 119-139. L. Breiman, "Bagging predictors", Technical Report No. 421, Department of Statistics, University of California Berkeley, September 1994. J. R. Quinlan, "C4.5: programs for machine learning", Morgan Kaufmann Publishers, 1993. E. Bauer and R. Kohavi, "An empirical comparison of voting classification algorithms: bagging, boosting and variants", Machine Learning, Vol. 36, Nos. 1-2, 1999, pp. 105-139. G. I. Webb, "Multiboosting: a technique for combining boosting and wagging", Machine Learning, Vol. 40, 2000, pp. 159-196. T. Brugger, KDD Cup '99 dataset (network intrusion) considered harmful, http://www.kdnuggets.com/news/2007/ n18/4i.html PREDICT Coordinating Center, PREDICT overview, https://www.predict.org/Portals/0/files/Documentation/MANUAL%2 0OF%20OPERATIONS/PREDICT_Overview_final.pdf? S. Kullback, "The Kullback-Leibler distance", The American Statistician, 1987, pp.340-341. B. Pfahringer, "Winning the KDD99 classification cup: Bagged Boosting", ACM SIGKDD Explorations Newsletter, Vol. 1, Issue 2, 2000, pp. 65-66.
377