On the combination of naive Bayes and decision trees for intrusion detection
Salem Benferhat Karim Tabia
CRIL -CNRS, Université d’Artois Université Mouloud Mammeri,
Rue Jean SOUVRAZ SP18 Département informatique,
62307 Lens, Cedex, France Tizi-ouzou, Algeria
Email:
[email protected] Email:
[email protected] Abstract Our system is evaluated on well known KDD’99
dataset. These data cause serious problems for
Decision trees and naive bayes have been recently intrusion detection approaches [10] for detecting rare
used as classifiers for intrusion detection problems. attacks. This is particularly true for R2L and U2R
They present good complementarities in detecting attacks where the detection rates are very low [10]. In
different kinds of attacks. However, both of them fact, R2L and U2R attack proportions are very small in
generate a high number of false negatives. This paper the KDD’99 learning data. In addition, these two
proposes a hybrid classifier that exploits complentaries attack categories are characterized by a high variance
between decision trees and naive bayes. In order to degree (for instance, most R2L and U2R attacks are
reduce false negative rate, we propose to reexaminate new, i.e. present only in the testing data).
decision trees and Bayes nets outputs by an anomaly-
based detection system. Our hybrid approach is based on the combination of
naive bayes and decision trees classifiers for intrusion
detection systems. These two classifiers use different
1. Introduction
methodologies and have good complementarity/error
Intrusion detection is the process of detecting diversity. More specifically, we are concerned with
unauthorized use of, or attack upon, a computer or reducing false negative rates, by using anomaly
network [3]. IDSs (Intrusion Detection Systems) can detection system.
detect attempts to compromise the confidentiality,
The rest of this paper is organised as follows. Next
integrity, and availability of a computer or network.
section provides a brief refresher on decision trees and
There are two general approaches to intrusion naive Bayes. Then we present their application on
detection: misuse detection and anomaly detection. In intrusion detection systems. Lastly, we present our
misuse detection, the IDS analyzes the information it hybrid approach.
gathers and compares it to large databases of attack
signatures. In anomaly detection, the system 2 A brief refresher on decision trees and
administrator defines the baseline, or normal, state of naive Bayes
the network’s traffic load, breakdown, protocol, and
typical packet size. The anomaly detector monitors 2.1 Decision trees
network segments to compare their state to the normal
baseline and look for anomalies. Decision trees are among the well known machine
learning techniques. A decision tree is composed of
Recently, decision trees [2] and naive bayes [11, 1] three basic elements:
have been used as classifiers for intrusion detection
problems. They present good complementarities in - A decision node specifying a test attribute.
detecting different kinds of attacks. However, both of
them generate a high number of false negatives. This - An edge or a branch corresponding to the one of
paper proposes a hybrid misuse-anomaly intrusion the possible attribute values which means one of the
detection approach. More precisely, we propose a test attribute outcomes.
hybrid misuse-anomaly multiple classifier system for
network intrusion detection.
Proceedings of the 2005 International Conference on Computational Intelligence for Modelling, Control and Automation, and International Conference on
Intelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC’05)
0-7695-2504-0/05 $20.00 © 2005 IEEE
- A leaf which is also named an answer node contains
the class to which the object belongs.
In decision trees, two major phases should be ensured:
So, the gain ratio is the information gain calibrated
1. Building the tree. Based on a given training set, a by Split_Info:
decision tree is built. It consists of selecting for each Gain(T, Ak )
decision node the ‘appropriate’ test attribute and GainRatio(T, Ak) = (5)
also to define the class labeling each leaf. Split_Info(T, Ak )
2. Classification. In order to classify a new instance, -The partitioning strategy dividing the current
we start by the root of the decision tree, then we test training set into several subsets by taking into account
the attribute specified by this node. The result of this the values of the selected test attribute.
test allows to move down the tree branch relative to
the attribute value of the given instance. This -The stopping criteria stopping the growth of a part
process will be repeated until a leaf is encountered. of the decision tree and consequently declaring the
The instance is then being classified in the same training subset as a leaf.
class as the one characterizing the reached leaf.
2.2 Naive Bayes
Several algorithms have been developed in order to
ensure the construction of decision trees and its use for Bayes networks are one of the most widely used
the classification task. The ID3 and C4.5 algorithms graphical models to represent and handle uncertain
developed by Quinlan [9, 8] are probably the most information [4, 7]. Bayes networks are specified by
popular ones. two components:
A generic decision tree algorithm is characterized by - A graphical component composed of a directed
the next properties: acyclic graph (DAG) where vertices represent events
and edges are relations between events.
-The attribute selection measure allowing to choose
an attribute that generates partitions where objects are - A numerical component consisting in a
distributed less randomly. In other words, this measure quantification of different links in the DAG by a
should consider the ability of each attribute Ak to conditional probability distribution of each node in the
determine training objects’ classes. The measure is the context of its parents.
gain ratio of Quinlan, based on the Shannon entropy,
where for an attribute Ak and a set of objects T, it is Naive Bayes are very simple Bayes networks which
defined as follows: are composed of DAGs with only one root node
(called parent), representing the unobserved node, and
Gain(T, Ak) = Info(T) – Info Ak (T) (1) several children, corresponding to observed nodes,
with the strong assumption of independence among
where child nodes in the context of their parent.
n
∑
freq(ci , T) freq(ci , T)
Info(T) = – log 2 (2) The classification is ensured by considering the par-
i =1
T T
ent node to be a hidden variable stating to which class
A
each object in the testing set should belong and child
Ta k nodes represent different attributes specifying this
Info Ak (T) = ∑
ak ∈D(Ak )
T
k A
info(Ta k )
k
(3) object.
and freq(ci, T) denotes the number of objects in the set T Hence, in presence of a training set we should only
belonging to the class ci and T Ak is the subset of compute the conditional probabilities since the
a
objects for which the attribute Ak k has the value ak structure is unique.
(belonging to the domain of Ak denoted D(Ak)).
Once the network is quantified, it is possible to
Then, Split_Info(T, Ak) is defined as the information classify any new object giving its attributes’ values
content of the attribute Ak itself [9]: using the Baye’s rule expressed by:
A A
Ta k Ta k
Split_Info(T, Ak) = - ∑
ak ∈D(Ak )
T
k
log 2
T
k
(4)
Proceedings of the 2005 International Conference on Computational Intelligence for Modelling, Control and Automation, and International Conference on
Intelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC’05)
0-7695-2504-0/05 $20.00 © 2005 IEEE
available on the network in order to look for
p(A/ci ) . p(ci ) exploits.
p(ci/A)= (6)
p(A)
Features characterizing each connection are divided
into:
where ci is a possible value in the session class and A is - basic features of individual TCP connections,
the total evidence on attributes nodes. The evidence A
can be dispatched into pieces of evidence, say a1, - content features within a connection suggested by
a2,..,an relative to the attributes A1, A2,..,An, respectively. domain knowledge,
Since naive Bayes work under the assumption that these
attributes are independent (giving the parent node), their - time based traffic features computed using a two-
combined probability is obtained as follows: second time window and
- host based traffic features computed using a
(7) window of 100 connections used to characterize
attacks that scan the hosts (or ports) using much
larger time interval than two seconds.
3 Decision trees and naive bayes in
intrusion detection 3.1 Evaluation of naive bayes and decision
This section briefly describes the application of trees
decision trees and naive bayes in intrusion detection
problems (for more details see [1, 2]). The experimental results of naive bayes and
decision trees classifiers on the KDD’99 data set, are
reported in table of figure 1. It shows that the two
3.1 A brief description of KDD’99 data set classifiers nearly have the same efficiency in terms of
detection rates and false alarm rates. The two
The data used in this paper are those proposed in the
classifiers fail to detect the same categories and detect
KDD’99 for intrusion detection [5] which are generally
well nearly the same ones. More particularly, both
used for benchmarking intrusion detection problems.
naive bayes and decision trees miss most R2L and
They set up an environment to collect TCP/IP dump
U2R attacks. However, naive bayes performs better
raws from a host located on a simulated military
detection rates for R2L, Probing and some new
network. Each TCP/IP connection is described by 41
attacks.
discrete and continuous features (e.g. duration, protocol
type, flag, etc.) and labeled as either normal, or as an
attack, with exactly one specific attack type (e.g. Smurf,
Perl, etc.). Attacks fall into four main categories:
- Denial of Service Attacks (DOS) in which an
attacker overwhelms the victim host with a huge
number of requests.
- User to Root Attacks (U2R) in which an attacker or
a hacker tries to get the access rights from a normal
host in order, for instance, to gain the root access to
the system.
- Remote to User Attacks (R2L) in which the intruder
tries to exploit the system vulnerabilities in order to
control the remote machine through the network as Figure 1. Naive bayes and decision trees
a local user. evaluation on the KDD’99 testing data
- Probing in which an attacker attempts to gather
useful information about machines and services
Proceedings of the 2005 International Conference on Computational Intelligence for Modelling, Control and Automation, and International Conference on
Intelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC’05)
0-7695-2504-0/05 $20.00 © 2005 IEEE
3.3 Complementarity/error diversity and anomaly detection approaches and on the other
relative to Naive Bayes and decision trees hand it combines machines learning approaches with
expert knowledge.
on KDD’99 testing data
The main remarks taken into account when building
Complementarity/error diversity is a key factor in our hybrid approach are :
hybrid classifier system effectiveness [12]. The Most misclassified connections are false
following matrix (Figure 2) sums up this aspect negatives coming respectively from R2L,
characterizing naive bayes and decision trees on the DoS, Probing and U2R categories _
KDD’99 data set.
Most of new attacks are misclassified and
predicted normal connections._
The two classifiers have very correlated
outputs (in the sense of predicting the same
class with a high probability).
Moreover, as mentioned above, the greater part of
detection errors are false negatives. Therefore, dealing
with true/false positives will not really improve the
PCC. Hence, our aim is to show how to deal with
true/false negatives. This is presented in next section.
Figure 2. Error diversity characterizing naïve 5 Dealing with true/false negatives
bayes and decision trees on the KDD’99 testing
5.1 A general schema
data
Distinguishing between normal connections and
This table shows the strong correlation between abnormal ones is by definition the objective of an
Naive bayes and decision tree outputs on the KDD’99 anomaly detection approach. To implement anomaly
data set. The two classifiers fail at the same time in detection in our system, a similarity measure is used to
classifying 7.20% of the testing connections, among estimate the “normality” of a connection. If the
which we find most of the new attacks. The probability connection is considered as abnormal, a second
that the two classifiers output the same class is about mechanism (which is not detailed in this paper), based
0.98 on the testing connections. From the confusion on expert knowledge, is used to determine the attack
matrixes of naive bayes and decision trees on KDD’99 category for each false negative.
testing data, the majority of misclassified connections
are false negatives (attacks detected as normal traffic
The mechanism that deals with true/false negatives
connections) that come mostly from R2L attacks.
proceeds as follows:
4 Hybrid misuse-anomaly approach for 1. Given that the major part of detection errors are
false negatives, then the first thing to do is to check
intrusion detection whether a connection is normal using a similarity
measure.
Given the fact that neither decision tree based IDS
nor naive bayes IDS can solely improve the overall 2. If the connection is identified as normal then a
classification rate (PCC) (and R2L and U2R detection final decision is made for this connection.
rates in particular), we suggest to develop a hybrid Otherwise the following cases are considered:
model that takes advantage of naive bayes and decision
trees complementarity/error diversity. Other Anomaly detection approach is known to
_
mechanisms are specifically needed to improve R2L generate a high false alarm rate. Accordingly,
and U2R detection rates because these two attack we provide a mechanism to verify whether the
categories are in most cases misclassified by the two suspected connection is not a normal one. This
first level classifiers at the same time. Our approach is may bring evidence of this connection
hybrid in the sense that it combines on one hand misuse normality, in which case it is again recognized
Proceedings of the 2005 International Conference on Computational Intelligence for Modelling, Control and Automation, and International Conference on
Intelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC’05)
0-7695-2504-0/05 $20.00 © 2005 IEEE
as a normal connection and no more action will
be undertaken. To compute the distance between a given attribute
Ai and its corresponding one in the normal model Âi,
Otherwise, the connection is an actual attack and one of the two following cases should be considered:
further work is needed to determine its exact - if Ai is a numeric attribute then the distance is
attack category. calculated as follows:
5.2 Evaluating the connection normality
distance(Ai, Â i )= a −m
i i (10)
A similarity measure is used to differentiate between ı i
true and false negatives. This similarity measure takes
into account the following constraints: ı
mi and i denote respectively the mean (average)
and standard deviation of the attribute Ai in normal
- some attributes are not of numeric type, connections within the training data.
- numeric attributes are not normalized, - if Ai is a symbolic attribute then the distance is
estimated according to the improbability of the value
- attributes have up to now unknown weights in of Ai in normal connections from training data. Let ai
determining the normal/abnormal character of a be the value taken by Ai then
connection.
distance(Ai, Âi ) = 1-p(ai ) (11)
Evaluating connections normality through a
similarity measure requires a profile for the normal Note that p(ai ) denotes the probability (frequency)
connections category. Our profile is statistical and is of Ai = ai in normal training connections. Intuitively
built on some quantitative magnitudes such as means again, the rarer the value ai in the training data, the
and standard deviations for numeric attributes and more negligible is the probability p(ai ).
frequencies for symbolic ones within normal
connections in KDD’99 training data. The connection After having estimated the connection normality,
similarity with the normal profile is calculated as we need to fix a threshold (for normality) in order to
follows: declare connections actually normal or abnormal. A
connection is declared an actual false negative
Normality (connection)=f (connection, Normal_Profile) (abnormal) if
(8)
Normality (connection) > Normal_ threshold (12)
and
40 Normal_threshold is obtained from the mean
f (connection, Normal_Profile)= ∑ w *distance(A , )
i=0
i i i normality of normal connections in KDD’99 training
data set.
(9)
If a connection is declared not normal, we need to
Note that wi denotes the weight associated with the determine its category (among DoS, R2L, U2R or
attribute Ai within a KDD’99 connection while Probing). This mechanism is based on properties of the
distance(Ai, Â i ) denotes a kind of distance measure that four attack categories. For instance, DoS and Probing
calculates the distance between attribute Ai and its connections are distinguished from R2L or U2R ones
corresponding attribute Âi in the normal connections according to connections duration and number. Such
model. Attributes Protocol_type, Service and Flag are information is available in the time and traffic based
given greater weights within the similarity measure attributes characterizing a KDD’99 connection [6].
because in anomaly detection context, for these three DoS and Probing attacks generally last for a very short
attributes, typically every new value occurrence in these time and include several connections while R2L and
attributes must be considered as anomalous. For U2R attacks are in most cases enclosed in one
example, a new service (a value never seen within the connection[6].
attribute Service) constitutes undoubtedly anomalous
behavior. However, only these three attributes are given Similarly, to differentiate between R2L and U2R
greater weights because their semantics in the anomaly connections, we use the fact that R2L attacks are
detection context is obvious. initiated by a remote outsider who have not authorized
Proceedings of the 2005 International Conference on Computational Intelligence for Modelling, Control and Automation, and International Conference on
Intelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC’05)
0-7695-2504-0/05 $20.00 © 2005 IEEE
access on the victim whereas U2R attacks try to get root 1999.
privileges by an authorized user or intruder who gained
a local access [6]. The key details are remote/local [6] K. Kendall. A Database of Computer Attacks for
access, user/root privileges and commands, etc. The the Evaluation of Intrusion Detection Systems.
content features are particularly suited to reveal signs Massachusetts Institute of Technology, Master
characteristic of R2L and U2R attacks. Thesis, 1999.
Lastly, DoS connections differ from Probings in the [7] J. Pearl. Probabilistic reasoning in intelligent
fact that most DoS attacks overwhelm their victim (the systems: networks of plausible inference. Morgan
same destination) by a huge number of queries while Kaufmman, San Francisco (California), 1988.
Probings gather information by scanning several
destination hosts and service ports [6]. This information [8] J. R. Quinlan. Decision trees as
is obtained from the appropriate traffic and time based probabilistic classifiers. In Proceedings of the
features. Fourth International Workshop on Machine
Learning, pages 31–37. Morgan Kaufmann,
1987.
6 Conclusion
[9] J. R. Quinlan. C4.5: Programs for machine
This paper has proposed a hybrid approach for learning. Morgan Kaufmann, San Mateo,
intrusion detection system. It is based on a combination California, 1993.
of naive bayes and decision tree classifiers. Output of
naive bayes and decisions are re-analysed by an [10] M. Sabhnani and G. Serpen. Why machine
anomaly detection to evaluate if a connection is normal learning algorithms fail in misuse detection on
or not. Actual experimental results show that our hybrid kdd intrusion detection data set. Journal of
approach gives better results than existing systems. A Intelligent Data Analy¬sis, 2004.
future work is to investigate the integration of expert
knowledge for attack classification. [11] A. Valdes and K. Skinner. Adaptive model-
based monitoring for cyber attack detection. In
Acknowledgments Actes du Recent Advances in Intrusion Detection
(RAID 2000), pages 80–92, Toulouse, France,
This work is supported by a french national project 2000.
ACI (Action Concerte Incitative) Sécurité Informatique
entitled DADDi (Dependable Anomaly Detection with [12] L. Xu, A. Kryzak, and C. V. Suen. Methods of
Diagnosis)). combining multiple classifiers and their
applications to handwriting recognition. In IEEE
Transactions on Systems, Man and Cybernetics,
References volume 22(3), 418¬435, 1992.
[1] N. Ben Amor, S. Benferhat, and Z. Elouedi. Naive
bayesian networks in intrusion detection systems.
In Probabilistic Graphical Models for Classificati-
on, pages 11–23, Cavtat Dubrovnik, Croatia, 2003.
[2] N. Ben Amor, S. Benferhat, and Z. Elouedi. Naive
bayes vs decision trees in intrusion detection syst-
ems. In ACM symposium on Applied computing
(SAC2004), pages 420–424, Nicosia, Cyprus,
2004.
[3] S. Axelsson. Intrusion detection systems: a survey
and taxonomy. Technical report 99-15, 2000.
[4] F. V. Jensen. Introduction to Bayesien networks.
UCL Press, University college, London, 1996.
[5] KDD. http://kdd.ccs.uci.edu/databases/kddcup99.
Proceedings of the 2005 International Conference on Computational Intelligence for Modelling, Control and Automation, and International Conference on
Intelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC’05)
0-7695-2504-0/05 $20.00 © 2005 IEEE