Comp of Clustering Method
Comp of Clustering Method
0.1 Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1 Introduction 2
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Goal of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background 6
2.1 Introduction to computer network security . . . . . . . . . . . 6
2.1.1 Network security . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Network intrusion detection systems . . . . . . . . . . 7
2.1.3 Network anomaly detection . . . . . . . . . . . . . . . 8
2.1.4 Computer attacks . . . . . . . . . . . . . . . . . . . . . 9
2.2 Introduction to clustering . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Notation and definitions . . . . . . . . . . . . . . . . . 12
2.2.2 The clustering problem . . . . . . . . . . . . . . . . . . 12
2.2.3 The clustering process . . . . . . . . . . . . . . . . . . 13
2.2.4 Feature selection . . . . . . . . . . . . . . . . . . . . . 13
2.2.5 Choice of clustering algorithm . . . . . . . . . . . . . . 13
2.2.6 Cluster validity . . . . . . . . . . . . . . . . . . . . . . 16
2.2.7 Clustering tendency . . . . . . . . . . . . . . . . . . . . 17
2.2.8 Clustering of network traffic data . . . . . . . . . . . . 18
2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1
CONTENTS 2
4 Experiments 62
4.1 Design of the experiments . . . . . . . . . . . . . . . . . . . . 62
4.2 Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2.1 Choice of data set . . . . . . . . . . . . . . . . . . . . . 63
4.2.2 Description of the feature set . . . . . . . . . . . . . . 65
4.3 Implementation issues . . . . . . . . . . . . . . . . . . . . . . 69
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6 Conclusion 86
6.1 Resume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.2 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.4 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
CONTENTS 3
A Definitions 95
A.1 Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
A.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
B Feature set 98
B.1 The feature set of the KDD Cup 99 data set . . . . . . . . . . 98
D Theorems 105
D.1 Algorithm: Hill climbing . . . . . . . . . . . . . . . . . . . . . 105
D.2 Theorem: Jensen’s inequality . . . . . . . . . . . . . . . . . . 105
D.3 Theorem: The Lagrange method . . . . . . . . . . . . . . . . 106
4
LIST OF FIGURES 5
6
Abstract
0.1 Preface
This thesis has been written by koffi bruno yao at the department of computer
science of the university of copenhagen(DIKU). The thesis was written in
the period 19/04/2005 to 01/03/2006 and was supervised by Peter Johansen
professor at DIKU. I would like to thank my supervisor for his support. The
primary audience of this thesis is researchers in anomaly detection. However
any reader with interest in clustering will find the thesis useful. The reader is
expected to have some basic understandings of computer networks and some
basic mathematical knowledge.
Chapter 1
Introduction
1.1 Motivation
It is important for companies to keep their computer systems secure be-
cause their economical activities rely on it. Despite the existence of attack
prevention mechanisms such as firewalls, most company computer networks
are still the victim of attacks. According to the statistics of CERT [44], the
number of reported incidents against computer networks has increased from
252 in 1990 to 21756 in 2000, and to 137529 in 2003. This happened because
of misconfiguration of firewalls or because malicious activities are generally
cleverly designed to circumvent the firewall policies. It is therefore crucial to
have another line of defence in order to detect and stop malicious activities.
This line of defence is intrusion detection systems (IDS).
2
CHAPTER 1. INTRODUCTION 3
normal or as specific attack type. The problem with this approach is that la-
belling the data is time consuming. Unsupervised anomaly detection, on the
other hand, operates on unlabeled data. The advantage of using unlabeled
data is that the unlabeled data is easy and inexpensive to obtain. The main
challenge in performing unsupervised anomaly detection is distinguishing the
normal data patterns from attack data patterns.
[9] provides a good review of data mining approaches for intrusion detec-
tion. Much work has been done on the area of unsupervised anomaly detec-
tion [7, 4, 6]. In [4], Eskin uses clustering to group normal data; intrusions
are considered to be outliers. Eskin follows a probability based approach to
outliers’ detection. In this approach, the data space has an unknown proba-
bility distribution. In this data space, anomalies are located in sparse regions
while normal data are found in dense regions.
The
Chapter 2
Background
6
CHAPTER 2. BACKGROUND 7
Agents generally gather network traffic data by sniffing the network. Sniff-
ing the network involves the agent having access to all the network traffic.
In an Ethernet-based network one computer can play the role of an agent.
Agents generally process the gathered data into a format that is easy for the
detector to use. The detector can use different techniques for the detection
of intrusions. The two main techniques are misuse detection and anomaly
detection. Misuse detection detects attacks by matching the current network
traffic against a database of known attack signatures. Anomaly detection,
on the other hand, finds attacks by identifying traffic patterns that deviate
significantly from the normal traffic.
The data set used in this thesis is an example of data obtained from
network intrusion detection agents. The output of the clustering serves to
define or enrich models used by the detector.
In the next section, we will look at network anomaly detection, which is
the type of detection technique we are interested in in this thesis.
because some software fails to check the size of the inputs entered by
users.
• Definition: Partition of a set: Let S be a set and {Si , i ∈ {1, ..., N }},
N non empty subsets of S.
The family of subsets {Si , i ∈ {1, ..., N }} is a partition of the set S if
and only if:
∀(i, j) ∈ {1, ...N } × {1, ..., N } and i 6= j, Si Sj = and N i=1 Si = S.
T S
• Note: In this thesis, the terms data points, data patterns, data items
and data instances refer all to the instances of a data set.
Similarity measures
The definition of the similarity between data items depends on the type
of the data. Two main types of data exist: continuous data and categorical
data1 . Examples of similarity measures for each of these types of data will
be presented in the following.
S(x, x) = 1, ∀x ∈ D (2.4)
S(x, y) = S(y, x), ∀x, y ∈ D (2.5)
Similarity indices can, in principle, be used on arbitrary data types. However,
they are generally used for measuring similarity in categorical data. They
are seldom applied to continuous data because distance measures are more
1
Sometimes binary data, which is essentially categorical data with two categories, is
considered as a separate category. In this thesis, no distinction is made between categorical
data and binary data.
CHAPTER 2. BACKGROUND 15
suitable for continuous data than similarity indices are. Different similarity
indices for binary or categorical data are found in the literature. Here are
three examples of similarity indices. In the following expressions, a is the
number of positive matches, d is the number of negative matches and b and
c are the number of mismatches between two instances A and B.
Objective functions
Objective functions are used by clustering methods that approach the clus-
tering problem as an optimization problem. An objective function defines
the criterion to be optimised by a clustering algorithm in order to obtain an
optimal clustering of the data set. Different objective functions are found in
cluster literature. Each of them is based on implicit or explicit assumptions
about the data set. A good choice of the objective function helps reveal a
meaningful structure in the data set. The most widely used objective func-
tion is the sum of squared-errors. Given a data set D = {x1 , x2 , ..., xn } and
CHAPTER 2. BACKGROUND 16
Internal validity: Internal validity only makes use of the data involved
in the clustering to assess the quality of the clusterings result. Example of
such data include the proximity matrix. The proximity matrix is a N matrix
which entry (i, j) represents the similarity between data patterns i and j.
Relative validity:
The purpose of relative clustering validity is to evaluate the partition pro-
duced by a clustering algorithm by comparing it with other partitions pro-
duced by the same algorithm, initialised with different parameters.
External validity is independent of the clustering algorithms used. It is
therefore appropriate for the comparison of different clustering algorithms.
Cluster validation by visualization:
This cluster validation is carried out by evaluating the quality of the cluster-
ing’s result with the human eye. This requires an appropriate representation
of the clusters so that they are easy to visualize. This approach is imprac-
tical for large data set and when the dimension of the data is high. It only
works in 2 to 3 dimensions because human eyes cannot visualize higher di-
mensions. For visualizing high dimension data, the dimension of the data
has to be reduced to 2 or 3. SOM, which is one clustering algorithm we will
study later, is often used as a tool for reducing the dimensions of the data
for visualization. Cluster validation by visualization will not be considered
in this thesis. There are two reasons for this: first because the size of the
data set is large and the dimension of the data is high and second because
the visualization cannot be quantified. We need to be able to quantify the
quality of the partitions in order to compare the algorithms on this basis.
2. Generate two sub samples s and t of the data set of size α*(the size of
the data set).
2.3 Summary
In this chapter, aspects of network security and clustering relevant for the
rest of the thesis have been introduced. Network intrusion detection has been
briefly presented. Because of the sophistication of network attack techniques
and the weaknesses in attack prevention mechanisms, network intrusion de-
tection systems are important for ensuring the security of computer networks.
The clustering problem has been defined and steps of the clustering process
have been presented. The main steps of the clustering process are: feature
selection, choice of clustering algorithms and cluster validity. In the next
chapter, clustering methods will be discussed more deeply.
Chapter 3
20
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 21
Element a b c d e f
a 0 1 1 3 2 5
b 1 0 2 2 1 4
c 1 2 0 3 2 5
d 3 2 3 0 1 4
e 2 1 2 1 0 3
f 5 4 5 4 3 0
b c d e f
a
until the entire data set falls into a single cluster. At this point the root of
the dendrogram is known.
Hierarchical divisible clustering uses the dendrogram from the root to
the leaves. HDC starts with a single cluster representing the entire data
set. Then it proceeds by iteratively dividing large clusters at the current
level i into smaller clusters at level i + 1. This process stops when each of
current clusters consists in a single element. At this point the leaves of the
dendrogram are known.
The following are the main steps by which HAC organizes the data in-
stances into a hierarchy of clusters. How HDC proceeds can trivially be
deduced from the steps of HAC.
1. Compute the distance between all the items and store them in a dis-
tance matrix.
2. Identify and merge the two most similar clusters.
3. Update the distance matrix by computing the distance between the
new cluster and all the others clusters.
4. Repeat step 2 and 3 until the desired number of clusters is obtained or
until all the items fall in a single cluster.
In order to merge clusters, the distance between pairs of clusters needs to be
computed. Below are some examples of inter-cluster distances.
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 23
• Average distance:
the average of the distances of all pair of elements (p1 ∈ C1 , p2 ∈ C2 )
the requires O(N 2 ) for creating a partition. Because of its high computation
time, hierarchical clustering are not suitable for clustering large data sets.
Hierarchical clustering algorithms do not aim at maximizing a global
objective function. At each step of the clustering process, they make local
decisions in order to find the best way of clustering the data.
In this section, hierarchical clustering has been briefly discussed. Hierar-
chical clustering is impractical for large data sets. The next section is about
partitioning clustering.
(3.4)
1. Initialisation
(a) Specify the number of clusters and assign arbitrarily each instance
of the data set to a cluster
(b) Compute the centre for each cluster
2. Iterations: Repeat steps 2a, 2b and 2c until the difference of two con-
secutive iterations is below a specified threshold.
Proof: It is clear that f (x0 ) ≥ f (x1 ) ≥ ... ≥ f (xt ). And the stopping
criterion, xt = xt−1 is satisfied at the point where f (xt ) = f (xt−1 ). This
means that the inequalities that exist before the stopping criterion is met
are all strict, so the algorithm progresses. It stops at some point in time be-
cause D is a finite set. The convergence is local and not optimal because the
algorithm performs locally; only a subset of the solution space is investigated.
The kmeans-algorithm
Kmeans is an iterative clustering algorithm which moves items among clus-
ters until a specified convergence criterion is met. Convergence is reached
when only very small changes are observed between two consecutive itera-
tions. The convergence criterion can be expressed in terms of the sum of
squared-errors but it does not need to be so expressed.
Algorithm: kmeans-algorithm
Input: A data set D of size N and the number of clusters K,
Output: a set of K clusters with minimal sum of squared-error.
1. Randomly choose K instances from D as the initial cluster centres;
2. Assign each instance to the cluster, which centre the instance is closest
to;
Figure 3.2 shows that the sum of squared-errors decreases very slowly
after the 10th iteration. This indicates that convergence of the kmeans is
reached around the 10th iteration.
• Probabilistic clustering
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 28
60000
"sse.dat"
55000
50000
45000
SSE
40000
35000
30000
25000
0 5 10 15 20 25 30 35 40
iteration
M (D|Θ) = ΠN
n=1 p(xn |Θ) = λ(Θ|D), (3.5)
p(x|Θk ) and αk are respectively the density function and the mix-
ture proportion of the k th mixture component.
For the experiments in this thesis, the model used is the mixture
of isotropic gaussians. This model is also known as the mixture
of spherical gaussians. In this model, each component of the mix-
ture is a spherical gaussian. The mixture of isotropic gaussians
has been chosen because of its simplicity, efficiency and scalability
to higher dimension.
The EM algorithm is a general method, used for estimating the
parameters of the mixture model. It is an iterative procedure that
consists in two steps: the expectation step and the maximization
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 30
step. The expectation step is commonly called the E-step and the
maximization step is called the M-step. The E-step estimates the
extent to which instances belong to clusters. The M-step computes
the new parameters of the model on the basis of the estimates of
the E-step. In the case of the mixture of isotropic gaussians, the
model parameters are the means, standard deviations and the
weights of the clusters. This step is called the maximization step
because it finds the values of the parameters that maximize the
likelihood function.
The E and M steps are repeated until convergence of the parame-
ters is reached. Convergence is reached when the parameter values
of two consecutive iterations get very close. At the end of the it-
erations, a partitioning of the data set is obtained by assigning
each data instance to the cluster to which the instance has high-
est membership degree. This way of assigning instances to clusters
is called the maximum a posteriori (MAP) assignment. MAP as-
signment gives a crisp or hard clustering of the data set. A soft
clustering -also called fuzzy clustering- can be obtained by using
the cluster membership degrees computed in the E-step.
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 31
4. M-step: Re-estimation
P of the parameter set Θ of the model
(new) P (k|xn ,Θ)xn
µk = Pn
P (k|xn ,Θ)
,
n
s P
(new) 2
(new) 1 P (k|xn ,Θ)kxn −µk k
σk = d
n P
P (k|xn ,Θ)
,
n
(new) 1
αk = P (k|xn , Θ)
P
N n
s(k, n)
P (t) (k|xn , Θ)log(
XX
δ(Θ|D) ≥ ) = Bt (Θ). (3.8)
n k P (t) (k|x
n , Θ)
From the E-step, the second term of the right side of this expres-
sion is known, therefore the maximization Bt (Θ) is reduced to the
maximization of
P (t) (k|xn , Θ) log(s(k, n))
XX
bt (Θ) = (3.10)
n k
new function:
K
X
ft (Θ) = bt (Θ) + λ( αk − 1), (3.14)
k=1
where
2
−1 kxn −µ(t)
k k
1 2
(σ
(t) 2
)
s(k, n) = αk γ(xn |Θk ) = αk √ (t)
exp k (3.16)
( 2Πσk )d
Which gives: PN
P (t) (k|xn , Θ)
− n=1
αk = (3.18)
λ
By taking into account the constraint K k=1 αk = 1, we get:
P
K PK PN
X − k=1 n=1 P (t) (k|xn , Θ)
1= αk = (3.19)
k=1 λ
Which means
λ = −N (3.21)
Replacing λ by its value in the equation 3.18 gives the estimate of
the mixing probability:
(t+1) 1 X (t)
αk = P (k|xn , Θ). (3.22)
N n
1. Initialisation
Start from an initial partition P 0 of the data set.
1.3e+07
"cml.dat"
1.2e+07
1.1e+07
loglikelihood
1e+07
9e+06
8e+06
7e+06
0 2 4 6 8 10 12 14
iteration
Figure 3.3: Variation of the log-likelihood with the iterations of the classifi-
cation maximum likelihood
The two previous examples are examples of model based clustering that
use a probabilistic approach. The next method, which is also an exam-
ple of model-based clustering uses an artificial neural network approach.
ANN are used for both classification and clustering. They can be com-
petitive or non-competitive. In competitive learning, the output nodes
compete and only one of them wins. A commonly used competitive
approach for clustering is self-organizing maps (SOM). The term self-
organizing refers to the ability of the nodes of the networks to organize
themselves into clusters.
SOM are represented by a single layered neural network in which each
output node is connected to all input nodes. This is illustrated in figure
3.4. When an input vector is presented to the input layer, only a sin-
gle output node is activated. This activated node is called the winner.
When the winner has been identified its weights are adjusted. At the
end of the learning process, similar items get associated to the same
output node. The most popular examples of SOM are the Kohonen
self-organizing maps [47].
X1
X2
X3
4. Decrease the learning rate and reduce the size of the neighbour-
hood of output nodes.
Initialisation of SOM algorithm: The weights of the network can
be initialised randomly. But with random initialisation, some of the
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 41
output nodes may never win the competition. This problem can be
avoided by randomly choosing instances of the data set as initial val-
ues of the weights.
Choice of distance measure: The dot product and the euclidean
distance are commonly used as distance measures. The dot product is
used in situations where the input patterns and the network weights
are normalized.
Learning rate: The learning rate controls the amount by which the
weights of the winner node and that of its neighbours are adjusted. The
initial learning rate is specified at the initialisation. Then it decreases
as the number of iterations increases. Decreasing the learning rate
ensures that the learning process stops at some point in time. This
is important because usually the convergence criterion is defined in
terms of very small changes in the weights of two consecutive iterations.
Competitive learning does not give any guaranties that this convergence
criteria will eventually be satisfied.
Defining the neighbourhood: Initially, the neighbourhood is set to
a large value which then decreases with the iterations. This corresponds
to assigning instances to nodes with more precision as the number of
iterations increases.
The time complexity of SOM is O(M ∗ N ), where M is the size of the
grid and N is the size of the data set. The justification of this time
complexity is the following: during the training a number of operations
(finding the winner and updating the neighbourhood), which is are at
most twice the size of the grid take place. And the maximum number
of iterations is equal to the size of the data. So the time complexity
for the training is O(M ∗ N ). As the assignment only takes O(N ) this
gives a total complexity of O(M ∗ N ).
One of the main strengths of SOM is its ability to preserve the topology
of the input data: items that are close to each other in the input space
remain close in the output space. This makes SOM a valuable tool for
visualizing high dimensions data in low dimension. SOM also supports
parallel processing; this can speed up the learning process.
Some of the limitations of Kohonen SOM are: it is most appropriate
for detecting hyperspherical clusters. The choice of initial parameter
values - the initial weights of connections, the learning rate, and the
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 42
LEVEL 1
LEVEL 2 LEVEL 3
the experiment as the data to be used does not capture the spatial
relationship between items.
1. Initialisation
Choose a threshold , and initialise the first cluster centre µ0 .
Generally the first item of the data set is chosen.
other hand, each instance belongs to more than one cluster with some
degree of membership. The degree of membership of a data instance
Xi to a cluster Ck is a real value zik ∈ [0, 1], where k zik = 1.
P
b is called the fuzzifier and it controls the degree of fuzziness. When the
fuzzifier b is closer to 1, the clustering tends to be crisp and when the
fuzzifier b becomes very large, the degree of membership approaches
1/K; that means the data instance is a member of all the clusters to
the same degree. Generally, the value of the fuzzifier b is chosen to be
2.
9000
"fuzzyKmeansSSE.dat"
8500
8000
7500
fuzzy SSE
7000
6500
6000
5500
5000
0 2 4 6 8 10 12 14 16
iteration
Figure 3.6: Variation of the - fuzzy - sum of squared errors in fuzzy kmeans
algorithm
Finding the formula for the means of the cluster is simple because no
constraints have to be satisfied. This is obtained deriving Q according
to µk and setting it to zero.
N
∂Q X
b
=2 zik (xi − µk ) = 0 (3.34)
∂µk i=1
this gives:
PN b
zik xi
µk = Pi=1
N b
(3.35)
i=1 zik
Figure 3.6 shows how the fuzzy sum of squared-errors vary with the
number of iterations of fuzzykmeans. In this figure, it appears that the
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 52
fuzzy sum of squared error decreases very slowly after the 11th itera-
tion. That indicates that convergence of the fuzzy kmeans is reached
around the 11th iteration.
The purpose of the proof is to show that the minimization of the SSE
is equivalent to the maximization of a special case classification like-
lihood criterion. This special case corresponds to the situation where
the model is a mixture of isotropic gaussians with identical standard
deviation and with identical mixture proportions. In this special case
CEM aims at finding clusters that are spheres of the same sizes. The
expression of the classification likelihood criterion, shown earlier, is as
follows:
κ(Θ|D) = K xik∈Ck log (αk p(xik |µk , σk )), where Ck is the k
th
clus-
P P
k=1
ter and µk , σk , αk are respectively its mean, standard deviation and
mixture proportion.
In the case where the mixture proportions and standard deviations are
identical for all the clusters, we have:
αk = 1/K and σk = σ ∀k, 1 ≤ k ≤ K. So,
K
X X
κ(Θ|D) = log ((1/K)p(xik |µk , σ)), (3.36)
k=1 xik∈Ck
where R is a constant.
Using the expression of the isotropic gaussian, that is: p(xik |µk , σ) =
2
√ 1
( 2πσ)d
exp( −kxik2σ−µ
2
kk
),
we get:
K
X X −1 2
√
κ(Θ|D) = (( )||x ik − µ k || + d log( 2πσ) + R (3.38)
k=1 xik∈Ck 2σ 2
96
"accuracy2levels.dat" using 1:2
95.5
classification accuracy*100
95
94.5
94
93.5
93
92.5
92
0 100 200 300 400 500 600 700 800
number of basic clusters
22000
"kmeansVariationOfSSE.dat" using 1:2
20000
18000
16000
14000
SSE
12000
10000
8000
6000
4000
2000
0 50 100 150 200 250 300 350 400
number of clusters
3.5 Summary
In this chapter, different clustering methods have been discussed. A
distinction has been made between clustering methods and clustering
algorithms. A clustering method defines the general concept and theory
the clustering is based on, while a clustering algorithm is a particular
implementation of a clustering method. Examples of each of the con-
sidered clustering methods have been discussed. Most of the classical
clustering algorithms considered in this thesis approach the clustering
problem as an optimisation problem. They aim at optimising a global
objective function. They make use of an iterative process to solve the
problem. Another group of algorithms do not approach the clustering
problem as an optimisation problem. They view clusters as dense re-
gions in the data space and identify clusters by merging small units of
dense regions. This is the case for density based clustering and grid
based clustering. A clustering architecture inspired by some properties
of the clustering algorithms using an optimisation approach and the
clustering algorithms which constructs clusters by making local deci-
sions has been proposed. This architecture takes into consideration the
characteristics of the data set at hand. The discussed clustering algo-
rithms, with the exception of DBSCAN, DENCLUE and STING are
used for our experiments, which are discussed in the next chapter.
Chapter 4
Experiments
This section describes and discusses the design and the execution of
the experiments. This discussion is important in order to understand
and explain aspects of the experiments that have an impact on the
performance of the clustering algorithms.
62
CHAPTER 4. EXPERIMENTS 63
(IDS) [17, 41]. Currently the DARPA data set is the most widely used
data set for testing IDS.
This DARPA project was prepared and executed by the Massachusetts
Institute of Technology (MIT) Lincoln Laboratory. MIT Lincoln Labs
set up an environment on a local area network that simulated a mili-
tary network under intensive attacks. The simulated network consists
of hundreds of users on thousands of hosts. Working in a simulated
environment made it possible for the experimenters to have complete
control of the data generation process. The experiment was carried
out over nine weeks and a raw network traffic data, also called raw
tcpdump data, was collected during this period.
The raw tcpdump data has then been processed into connection records
used in the KDD Cup 99 data set. The KDD Cup 99 data set contains
a rich variety of computer attacks. The full size of the KDD Cup 99 is
about five million network connection records. Each connection record
is described by 41 features and is labelled either as normal or as a
specific network attack. One of the reasons for choosing this data set
is that the data set is standard. This will make it easy to compare the
results of our work with other similar works. Another reason is that
it is difficult to get another data set which contain so rich a variety of
attacks as the one used here.
Some criticisms have been made about the generation of the DARPA
data set. One the strongest criticisms was made by J. Mc HUGH in
[42]. The network traffic generated in the DARPA data set has two
components: the background traffic data -which consists of network
traffic data generated during the normal usage of the network- and the
attack data. According to Mc HUGH, the generation process of the
background traffic data has not been described explicitly by the exper-
imenters. Therefore, there is no direct evidence that the background
traffic matches the normal usage pattern of the network to be simu-
lated. He made similar criticisms about the generation of the attack
data. The intensive attacks the network has been submitted to do not
reflect a real word attack scenario.
Although some of these criticisms are important and can be useful in
future generation off-line intrusion evaluation data sets, the DARPA
data set has many strengths which still make it the best publicly avail-
CHAPTER 4. EXPERIMENTS 65
In order to construct the feature set, the raw tcpdump data has been
pre-processed into connection records. The basic features are directly
obtained from the connection records. The derived features fall into
two groups: the content-based features and the traffic based features.
Content-based features are used for the description of attacks that are
embedded in the data portion of the IP packet. The description of
these types of attacks requires some domain knowledge and cannot be
done only on the basis of information available in the packet header.
Most of these attacks are R2L and U2R attacks. Traffic based features
have been computed automatically; they are effective for the detection
of DOS and probe attacks. The different types of attack contained in
the data set are described in appendix C.
In order to derive the traffic features, Stolfo et al. made use of an
algorithm that identifies frequent sequential patterns. The algorithm
takes the network connection records, described by the current basic
features, as input and computes the frequent sequential patterns. The
frequent episodes algorithm is executed on two different data sets: an
intrusion-free data set and a data set with intrusions. Then these two
results are compared in order to identify intrusion-patterns.
The derived features are constructed on the basis of patterns that only
appear in intrusion data records. Therefore, they are able to discrim-
inate between normal and intrusion connection records. Although ex-
perience shows that the feature set considered here discriminates well
between normal and intrusive patterns it has some limitations when it
is used for anomaly detection. Because the feature set has been derived
on the basis of intrusions in the training data set, the derived feature
set cannot describe attacks not included in the training data set. The
feature set is, therefore, more suitable for misuse detection than for
anomaly detection. Another limitation of the feature set is that it may
not discriminate well between normal data and attacks embedded in
the data portion of the data packet. The reason for this is that the
feature set has been constructed primarily on the basis of information
CHAPTER 4. EXPERIMENTS 67
The advantage of the linear scale compared to the other two scaling
schemes is its simplicity. Furthermore, the linear scale normalizes the
feature values. For these reasons, the linear scale has been used for
scaling the feature values.
This section describes how the data set is used for the experiments.
A 10% version of the KDD Cup data set is also available at [16]. We
use the 10% version of the KDD Cup data set. The 10% version of the
KDD Cup data set contains the same attack labels as the full version.
It has been constructed by selecting, from the original data set, 10% of
each of the most frequent attack categories and by keeping the smaller
attack categories unchanged. The advantage of using this version of the
data set is that the data set is smaller and therefore faster to process.
Working with the original data set would have made the execution of
CHAPTER 4. EXPERIMENTS 69
About eighty percent of the data are attacks. Most of these attacks
are DOS attacks: neptune and smurf. A large percentage of this data
consists of duplicates. In order to reduce the size of the data set, we
select only a low percentage of the smurf data and the neptune data
set. This new distribution of attack and normal labels is closer to a
real life scenario. Most researches in unsupervised anomaly detection
make some assumptions about the data set. Without such assumptions
the task of unsupervised anomaly detection is not possible. The subset
selected for the experiments consists of 10% attacks and 90% normal
data. Table 4.1 shows the distribution of attack categories for this data
set.
For each of the 10 phases in the ten-fold cross validation, each clustering
algorithm is run 3 times. We proceed in this way because most of
the algorithms are randomly initialised and the result of clustering is
dependent on the initial values. As mentioned above, the instances
of the data set are labelled: either as normal or as a specific attack
category. The labels are not used during clustering, they are only used
during the evaluation of the clustering algorithms.
For each of the clustering algorithms, various tests have been performed
in order to select the best parameter values. The experiments have been
performed with the best parameter values identified.
4.4 Summary
This chapter has covered the design and execution of our experiments.
Special attention has been paid to the data set and feature set used.
– The data set used is a slightly modified version of the KDD data
set. The feature values have been scaled and normalized using a
linear scale. The categorical feature values have been transformed
to numeric values using a frequency encoding.
– For each of the clustering algorithms, different tests have been
performed in order to choose the best set of parameters
– The limitation of the feature set for unsupervised anomaly detec-
tion has been discussed: Some of these limitations are: Firstly,
the algorithm used for the construction of the features relies on
the existence of an attack-free data set. But the fact that it is
difficult to obtain attack-free data set is the main motivation for
performing unsupervised anomaly detection. So for the purpose
of unsupervised anomaly detection we need some other method
to compute the feature set. Secondly, for the purpose of anomaly
detection, it is the normal traffic patterns we want to describe and
not the attacks, so it is appropriate for us to construct features
that describe the normal patterns and not the attacks.
Evaluation of clustering
methods
72
CHAPTER 5. EVALUATION OF CLUSTERING METHODS 73
ing.
The cluster entropy has been introduced in [37]. This measure captures
the homogeneity of clusters. Clusters which contain data from different
attack classes have a higher value. And the clusters, which contain only
few attack classes, have low entropy- close to zero. The overall cluster
entropy is the weighted mean of the cluster entropies.
Classification accuracy
Cluster entropy
The entropy of the cluster level captures the homogeneity of the clus-
ter. The entropy of a cluster is defined as:
Eclusteri = − j ( Nnjii ) log( Nnjii )
P
where ni is the size of the ite cluster and Nji is the number of instances
of cluster i which belongs to the class label j.
And the cluster entropy is the weighted sum of the cluster entropies:
Ecluster = i nNi Eclusteri
P
where N is the total size of the data set and ni is the number of in-
stances in cluster i.
The cluster entropy is lowest when the clusters consists of a single data
types and it is highest when the proportion of each of data category in
the clusters is the same.
CHAPTER 5. EVALUATION OF CLUSTERING METHODS 76
When the number of clusters is 23, the figures 5.1, 5.3 and 5.2 show
that, the two-level clustering(KHAC) and SOM or kmeans, initialised
with the clustering results of the leader clustering algorithm give the
CHAPTER 5. EVALUATION OF CLUSTERING METHODS 78
When the number of cluster is 49, the figures 5.4,5.6 and 5.5 show
that KHAC, leader clustering and the combination of leader clustering
with any of the other algorithms- except EM- have the best classifi-
cation accuracies and the clusters found by these algorithms represent
a larger varieties of attack categories. Although initialising any of the
other algorithms with the leader clustering improves the performance
of the algorithm, these combinations do not perform significantly better
than the leader clustering alone. And this is true both for the classi-
fication accuracies, cluster entropies and number of cluster categories.
Because most of the studied clustering except the leader clustering are
slow, using only leader seems more appropriate than using any of the
other algorithm either alone or in combination with the leader cluster-
ing algorithm.
The homogeneity of the clusters produced by kmeans is slightly better
than any of the other algorithms. The homogeneity of the clusters pro-
duced by fuzzy kmeans, EM-based clustering, CEM clustering is poor
than that of the other algorithms.
For both 23 and 49 clusters, each of the clustering outperform random
clustering.
The performance of the EM-based clustering algorithm is not so im-
pressive.
Some of conclusions that can be drawn from these results are that:
The good performance of KHAC can also be related to the fact that
it is the only of the studied algorithms that is able to detect clusters
of an arbitrary shape. The penalty of approximating incorrectly the
shape of clusters is higher for large clusters than for small clusters. This
could explain why KHAC have a good performance when the number
of clusters is low.
The EM-based algorithm did not produce good results compared to
most of the others algorithms. This was surprising, because most of
the others algorithms can be explained as special case of EM-based
clustering. One of the possible explanations for the poor performance
of EM-based clustering may be that the mixture of isotropic gaussians
does not match the underlying model of the data. But, this explanation
does not seem to hold because the classification EM clustering which
also assumes that the components of the model are non overlapping
isotropic gaussians gives better results. We could not relate the poor
performance of EM-based clustering to the fact that it assumes over-
lapping clusters. This is because, the fuzzy kmeans algorithm, which
also makes an assumption of overlapping clusters, has a much better
performance.
We conclude that the EM-based clustering’s poor performance is re-
lated to some parameters of the EM based clustering algorithm that
may not have been chosen correctly. For example, the number of clus-
ters, considered in our experiments may not be optimal for the EM-
based clustering. Alternatively, it may simply be related to the fact
that this clustering algorithm is not appropriate for this task. The
EM-based clustering is also less attractive for this task because of its
high computation time.
CHAPTER 5. EVALUATION OF CLUSTERING METHODS 81
Figure 5.2: The number of different cluster categories found by the algorithms
when the number of clusters is 23. The total number of labels contained in
the data set is 23.
CHAPTER 5. EVALUATION OF CLUSTERING METHODS 82
Figure 5.3: The cluster entropies when the number of clusters is 23. The
cluster entropy measures the homogeneity of the clusters. The lower the
cluster entropy is the more homogeneous the clusters are.
CHAPTER 5. EVALUATION OF CLUSTERING METHODS 83
Figure 5.5: The number of different cluster categories found by the algorithms
when the number of clusters is 49. The total number of labels contained in
the data set is 23.
CHAPTER 5. EVALUATION OF CLUSTERING METHODS 85
Figure 5.6: The cluster entropies when the number of clusters is 49.
Chapter 6
Conclusion
6.1 Resume
In this thesis, we have:
86
CHAPTER 6. CONCLUSION 87
On the basis our results, we can say that clustering can be successfully
used for unsupervised anomaly detection. Some of the clustering algo-
rithms are more appropriate for this task than others. We investigated
the potential of the leader clustering algorithm. This algorithm is very
simple and fast and produces good clustering results compared to most
of the other studied algorithms. When leader clustering is used for
initializing the other clustering algorithms, included in this thesis, the
clustering results of these algorithms improve significantly.
6.2 Achievements
The main goal of the thesis has been to investigate the efficiency of dif-
ferent classical clustering algorithms in clustering network traffic data
for unsupervised anomaly detection. The clusters obtained by cluster-
ing the network traffic data set are intended to be used by a security
expert for manual labelling. A second goal has been to study some
possible ways of combining these algorithms in order to improve their
performance. We can say that these goals have been achieved. The
results of our experiments have given us an indication of which cluster-
ing algorithms are good for this task and which ones are less suitable
for this task. Furthermore, we have studied ways of combining cluster-
ing ideas in order to efficiently solve the problem. We have found out
that, when the number of clusters is low, KHAC which is a combina-
tion of clustering concepts we have proposed, produces better results
than most of the other studied algorithms. Our data shows the poten-
tial of leader clustering algorithm in performing this task. Clustering
algorithms similar to leader cluster algorithm have been successfully
CHAPTER 6. CONCLUSION 88
used in some earlier works [6, 30] for clustering network traffic data.
The reasons for using this particular algorithm have not been explic-
itly stated in these works. In conclusion for our thesis we can say that
leader clustering is to be preferred, not only because it is fast but also
because it perform better than most of the other clustering algorithms.
So leader-like clustering algorithms could be investigated further in fu-
ture research on unsupervised detection. What make them specially
attractive is their scalability to a large data set. And KHAC seems
attractive when the number of clusters is low.
6.3 Limitations
One of the limitations of this thesis is that it has not possible to validate
the conclusions of the experiments against a real life data set. This has
not been possible because of the difficulties of acquiring such a data
set.
89
BIBLIOGRAPHY 90
[10] Recent Advance in Clustering : a brief review, S.B. KoTSIANTIS, P.E. PIN-
TELAS
[11] Mining in a data-flow environment: Experience in network intrusion detec-
tion. In proc. 5th ACM SIGKDD Int. Conf. Knowledge Discovery and data
mining, 114-124, W. Lee;S. Stolfo, and K. Mok 1999.
[12] J. He, A.H. Tan, C.L. Tan, and S.Y. Sung. On Quantitative Evaluation
of Clustering Systems. In W.Wu, H. Xiong and S. Shekhar Clustering and
Information retrieval(pp. 105-133), Kluwer Academic Publishers, 2004.
[13] K. Kendall. A database of Computer Attacks for the Evaluation of Intrusion
Detection Systems, Master thesis, Massachusetts Institute of Technology,
1999.
[14] W. Lee and S.J. Stolfo, A Framework for constructing features and models for
intrusion detection systems, ACM Transactions on Information and System
Security, Vol.3 No.4, November 2000, pages 227-261.
[15] The internet traffic archive ( 2000): http://ita.ee.lbl.gov
[16] KDD cup 99. Located at: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
[17] DARPA. Located at: http://www.ll.mit.edu/IST/ideval/
[18] Clustering large datasets, D.P. Mercer, october 2003
[19] I. Costa, F. de Carvalho, Maricilio C.P de Souto. Comparative analysis of
clustering methods for gene expression time course data,
[20] Boris Mirkin. Mathematical Classification and Clustering, Kluwer Ademic
Publishers, 1996.
[21] Robert Rosenthal and Ralph L. Rosnow, Essentials of Behavioral Research,
Methods and Data Analysis, second edition, 1991.
[22] Jiawei Han and Micheline Kamber. Data Mining, Concepts and Techniques,
Morgan Kaufmann Publishers ,2001.
[23] Anil K. Jain, Richard C. Dubes. Algorithms for clustering data, Prentice
Hall, 1988.
[24] Giles Celeux and Gerard Govaert. A classification EM algorithm for cluster-
ing and two stochastic versions, INRIA, 1991.
[25] Stuart J.Russel and Peter Norvig. Artificial Intelligence, a modern approach
, second edition , Prentice Hall, 2003.
BIBLIOGRAPHY 91
[39] Wenke Lee and S. J. Stolfo. Data Mining Approaches for Intrusion Detection,
1998.
[40] Stefano Zanero and Sergio M. Savaresi. Unsupervised learning techniques for
an intrusion detection system, ACM March 2004.
[41] R. Lippmann, J.w.Haines, D.J. Fried,J. Korba and K. Das. The 1999 DARPA
Off-line Intrusion Detection Evaluation, Lincoln Laboratory MIT, 2000.
[42] John Mc HUGH. Testing Intrusion Detection Systems: A Critique of 1998
and 1999 DARPA Intrusion Detection System Evaluations as Performed by
Lincoln Laboratory, ACM Transactions on Information and System Security,
Vol.3, No. 4, November 2000 pages 262-294.
[43] E. Eskin,M. Miller,Z. Zhong,G. Yi, W. Lee, and S. Stolfo. Adaptive model
generation for intrusion detection systems.
[44] http://www.cert.org/stats/cert stats.html#incidents
[45] https://www.cert.dk/artikler/artikler/CW30122005.shtml
[46] Martin Ester, Hans-Peter Kriegel,Jorg Sander,Xiaowei Xu. A density-based
clustering algorithm for discovering Clusters in Large Spatial databases with
noise. Proceedings of 2nd international Conference on Knowledge Discovery
and Data Mining, 1996.
[47] Teuvo Kohonen. Self-organizing maps, 2nd edition Springer, 1997.
[48] A. Ultsch and C. Vetter. Self-organizing-feature-maps versus sta-
tistical clustering methods: A benchmark. University of Marbug.
Research Report 0994. located at: http://www.mathematik.uni-
marburg.de/d̃atabionics/de//downloads/papers/ultsch94benchmark.pdf,
[accessed 15/02/2006]
[49] Ross J. Anderson. Security Engineering: A guide to building dependable
distributed systems. John Wiley & Sons, 2001.
[50] A. Wespi, G. Vigna and L.Deri. Recent Advances in Intrusion Detection.
5th International Symposium, Raid 2002 Zurich, Switzerland, October 2002
Proceedings. Springer.
[51] D. Gollmann. Computer Security. John Wiley & Sons, 1999.
[52] P. Giudici. Applied Data Mining: Statistical Methods for Business and In-
dustry. Wiley,2003.
[53] Bjarne Stroustrup. The C++ programming language, third edition, Addison-
Wesley, 1997.
BIBLIOGRAPHY 93
[54] http://www-iepm.slac.stanford.edu/monitoring/passive/tcpdump.html
Appendix A
Definitions
A.1 Acronyms
DOS: Denial of service attacks.
OS : Operative systems.
IDS: Intrusion detection systems.
NIDS: Network intrusion detection systems.
pod: Ping of Death.
IP: Internet Protocol.
TCP: Transport Control Protocol.
UDP: User Datagram Protocol.
ICMP: Internet control message protocol.
HTTP: hypertexts Transport Protocol.
FTP: File Transfer Protocol.
A.2 Definitions
Network Traffic
In this thesis network traffic refers to transfer of IP packets through
network communication channels.
Firewalls
94
APPENDIX A. DEFINITIONS 95
To broadcast a message
To broadcast a message consists in delivering that message to every
host on a -given- network.
Ping
A program that is used to test if a connection can be established to a
remote host.
Protocol
A protocol is a specifies how modules running on different hosts should
communicate with each other.
Host
A host is a synonym for computer.
CGI scripts
A CGI (common gateway interface) script is a program running on a
server and which can be invoked by a client from the CGI interface.
TCP connection
A TCP connection is a sequence of IP packets flowing from the packet
sender to the packet receiver under the control of a specific protocol.
The duration of the connection is limited in time.
Tcpdump
Is a log obtained by monitoring network traffic. Different tools exist
for sniffing network traffic. On such a tool which has been used for
collecting the network traffic data used in this thesis is the program
called TCPDUMP [54].
Data mining
APPENDIX A. DEFINITIONS 96
Data mining is the process of extracting useful models from large vol-
ume of data.
Appendix B
Feature set
97
APPENDIX B. FEATURE SET 98
Computer attacks
– Portsweep
probes a host to find available services on that host.
– Nmap
is a complete and flexible tool for scanning a network either ran-
domly or sequentially.
– Satan
is an administration tool; it gathers information about the net-
work. This information can be used by an attacker.
100
APPENDIX C. COMPUTER ATTACKS 101
– Back
is a denial of service attack against Apache webservers. The at-
tacker sends requests containing many front slashes. The process-
ing of which is time consuming.
– Land:
Spoofed SYN packet sent to the victim host resulting in that that
host repeatedly synchronizing with itself.
– Smurf
A broadcast of ping requests with a spoofed sender address which
results in that the victim being bombarded with a huge number
of ping responses.
– Neptune:
The attacker half opens a number of TCP connections to the vic-
tim host making it impossible for the victim host to accept new
TCP connections from other hosts.
– Teardrop:
Confuses the victim host by sending it overlapping IP fragments:
overlapping IP fragments are incorrectly dealt with by some older
operating systems.
– Perl:
Exploits a bug in some PERL implementations on some earlier
systems. This bug consists in these PERL implementations im-
properly handling their root privileges. This leads to a situation
where any user can obtain root privileges.
– Buffer overflow
Consists in overflowing input buffers in order to overwrite memory
locations containing security relevant information.
– Ftp write
This attack exploits a misconfiguration affecting write privileges
of anonymous accounts on an FTP server.
This allows any ftp user to add arbitrary files to the FTP server.
– Phf
Is an example of badly written CGI scripts that is distributed
with the apache server. Exploiting this flaw allows the attacker
to execute codes with the http privileges.
– Warezmaster
The warezmaster attack is possible in a situation where write per-
missions are improperly assigned on a FTP server.
When this is the case, the attacker can upload copies of illegal
APPENDIX C. COMPUTER ATTACKS 103
– Warezclient
The Warezclient attack consists in downloading illegal software
previously upload during a warezmaster attack.
– Guessing passwords
– Multihop attack
This attack first affects a host on a network and then uses that
host to attack other hosts on the network.
Appendix D
Theorems
n
X n
X
f( αi xi ) ≤ αi f (xi ) (D.1)
i=1 i=1
104
APPENDIX D. THEOREMS 105
106
APPENDIX E. RESULTS OF THE EXPERIMENTS 107