Machine Learning Technical Report
Machine Learning Technical Report
Abstract—As network attacks have increased in number and important features. This paper provides a review on current
severity over the past few years, intrusion detection system trends in intrusion detection together with a study on
(IDS) is increasingly becoming a critical component to secure technologies implemented by some researchers in this research
the network. Due to large volumes of security audit data as area. This paper also gives a brief overview of an application
well as complex and dynamic properties of intrusion named NEDAA (Network Exploitation Detection Analyst
behaviors, optimizing performance of IDS becomes an Assistant) developed at Applied Research Laboratories of the
important open problem that is receiving more and more University of Texas at Austin (ARL:UT) which help network
attention from the research community. Intrusion poses a analysts by enhancing knowledge of with machine learning to
serious security risk in a network environment. The ever create rules for an intrusion detection expert system. In their
growing new intrusion types pose a serious problem for their paper they have described some of the successes and problems
detection. Existing intrusion detection systems rely heavily on encountered in applying these techniques to computer network
human analysts to differentiate intrusive from non-intrusive intrusion detection.
network traffic. The human labeling of the available network
audit data instances is usually tedious, Keywords— Supervised Machine Learning, Unsupervised
time consuming and expensive. The large and growing amount Machine Learning, Network Intrusion Detection, Network Security
of data confronts the analysts with an overwhelming task,
making the automation of aspects of this task necessary.
I. INTRODUCTION
Whether complete automation is possible or even desirable is
debatable. The field of intrusion detection has received increasing
Also the pattern matching in intrusion detection has attention in recent years. One reason for this is the explosive
traditionally been simple, looking for exploitive activity such growth of the Internet and the large number of networked
as connections from certain IP addresses with histories of systems that exist in all types of organizations. The increase in
intrusive behavior. However, an intrusion into a computer the number of networked machines has lead to an increase in
network can be more complex, with the complexity being both unauthorized activity, not only from external attackers, but
spatial and temporal. An example of this type of intrusion is a also from internal attackers, such as disgruntled employees
„low and slow‟ attack consisting of intrusive behavior over and people abusing their privileges for personal gain. Security
hours, days or weeks that may originate from multiple is a big issue for all networks in today‟s enterprise
network sources. environment. Both commercial and open source products are
Machine learning can be applied to this problem to extend now available for this purpose. Many vulnerability assessment
human pattern recognition. Automated techniques are ideal for tools are also available in the market that can be used to assess
this application because they can monitor and correlate vast different types of security holes present in your network.
numbers of intrusive signatures. A secure network must provide the following:
• Confidentiality: Data that are being transferred through the
This paper compares the performance of Intrusion network should be accessible only to those that have been
Detection System (IDS) Classifiers using various feature properly authorized.
reduction techniques. To enhance the learning capabilities and • Integrity: Data should maintain their integrity from the
reduce the computational intensity of competitive learning moment they are transmitted to the moment they are actually
comparing the performance of the algorithms is performed received. No corruption or data loss is accepted either from
respectively, different dimension reduction techniques have random events or malicious activity.
been proposed. These includes classifying and clustering • Availability: The network should be resilient to Denial of
Algorithms Naïve Bayes, Simple k-mean, Decision tree and Service attacks.
J48, Linear Discriminate Analysis, and Independent The organization of this paper is as follows. Section II
Component Analysis. Many Intrusion Detection Systems are provides an overview of machine learning. Section III gives a
based on neural networks. However, they are computationally brief summary of intrusion detection. Section IV provides an
very demanding. In order to mitigate this problem, dimension overview of Anomaly detection. Section V describes how
reduction techniques are applied to a given dataset to extract machine learning can be used in anomaly detection. Section
VI gives a detailed description of unsupervised anomaly or any other function. This is an example of Regression
detection and what work has been done in field of anomaly problem in which continuous valued output is predicted and if
detection using unsupervised learning. Section VII provides an
discrete valued output is to be predicted then it is referred as
overview of other machine learning techniques. Section VIII
talks about future plans and finally Section IX highlights Classification.
challenges and deployment issues. Consider another graph shown below which shows two types
of breast cancer i) Malignant ii)benign and based on the size
II. MACHINE LEARNING of tumor it can belong to one of the two categories ,this is an
Machine Learning is a field of study that gives computers the example of classification.
ability to learn without being explicitly programmed (Arthur
Samuel (1959)).Another Definition of machine learning is, “A
computer program is said to learn from experience E with
respect to some task T and some performance measure P, if its
performance on T, as measured by P improves with
experience E.”
Machine Learning is a vast domain of computer science but
in this paper we have focused only on Anomaly Based
Intrusion Detection and Prevention system.
Machine Learning = algorithm + predictions +data. Figure 2: cancer graph
Records in database are called data instances (data) in machine
learning while Column in database are called features and data Unsupervised Learning is another classification of machine
instance is a vector of features and in machine learning we learning algorithms where the program is given a bunch of
predict feature vector. Another definition of machine learning data and it must find patterns and relationships therein. In
is: unsupervised learning we allow computer to learn by itself.
“Machine Learning is about building a compressed “We don’t classify data in unsupervised learning.”
representation of your data. Expressing what is present in Clustering used in Google news is an example of
data in compressed format and making predictions about new unsupervised learning.
data.”
Machine Learning develops algorithms for making predictions
from data which are classified into two broad categories. III. INTRUSION DETECTION
Supervised learning: In this type of learning algorithm the An intrusion attempt or a threat is a deliberate and
program is “trained” on a pre-defined set of “training unauthorized attempt to (i) access information, (ii) manipulate
examples”, which then facilitate its ability to reach an accurate information, or (iii) render a system unreliable or unusable.
conclusion when given new data. In supervised learning the For example, (a) Denial of Service (DoS) is an attempt to
idea is to teach the computers how to do something. The term make a machine or network resource unavailable to its
supervised learning refers to the fact that we gave the intended users, which are needed to function correctly during
algorithm a data set in which all the “right answers” are given. processing; (b) Worms and viruses exploit other hosts through
Suppose that prices of house in a particular city are to be the network; and (c) Compromises obtains privileged access to
predicted based on available data as shown in graph a host by taking advantages of known vulnerabilities [1].
Every attack leaves its signature. A signature in the world
of cyber-crime can be defined as signature is a set of rules that
an IDS and an IPS use to detect typical intrusive activity, such
Price as DoS attacks. Some of the methods that can be used to
($) in identify signature are:
1000 1. Connection attempt from a reserved IP address.
This is easily identified by checking the source address field in
an IP header.
2. Packet with an illegal TCP flag combination. This can be
Size in feet2 found by comparing the flags set in a TCP header against
known good or bad flag combinations.
Figure 1: House price prediction graph (figure) 3. Email containing a particular virus. The IDS can
compare the subject of each email to the subject associated
Here the learning algorithm can help by drawing a straight line with the virus-laden email, or it can look for an attachment
through the data and based on that, prices of houses can be with a particular name.
predicted with given size in square feet (here 750 4. DNS buffer overflow attempt contained in the payload
of a query. By parsing the DNS fields and checking the length
feet2).Instead of straight line there could be quadratic function
of each of them, the IDS can identify an attempt to perform a
buffer overflow using a DNS field. A different method would systems are developed based on many different machine
be to look for exploit shell code sequences in the payload. learning techniques. For example, some studies apply single
5. Denial of service attack on a POP3 server caused by learning techniques, such as neural networks, genetic
issuing the same command thousands of times. One signature algorithms, support vector machines, etc. On the other hand,
for this attack would be to keep track of how many times the some systems are based on combining different learning
command is issued and to alert when that number exceeds a techniques, such as hybrid or ensemble techniques. In
certain threshold. particular, these techniques are developed as classifiers, which
6. File access attack on an FTP server by issuing file and are used to classify or recognize whether the incoming Internet
directory commands to it without first logging in. A state- access is the normal access or an attack.
tracking signature could be developed which would monitor Machine learning techniques have ability to implement a
FTP traffic for a successful login and would alert if certain system that can learn from data. For example, a machine
commands were issued before the user had authenticated learning system could be trained on incoming packets to learn
properly. to distinguish between intrusive and normal packet. After
learning, it can then be used to classify new incoming packets
IV. ANOMALY BASED DETECTION into intrusive and normal packets.
Anomaly Based Detection also named as "Behavior- Based
Detection" is described as a network IDS which models the
behavior of the network, users, and computer systems and V. MACHINE LEARNING IN ANOMALY DETECTION
raises an alarm whenever there is a deviation from this normal Machine learning approaches attempt to obtain an anomaly
behavior. Anomaly detection is assumed to be able to detect a detection that adapts to measurements, changing network
new attack or any new potential attacks. It works on the conditions, and unseen anomalies. What is a machine learning
principle of distance measurement by constructing profiles view of network-anomaly detection? Let X(t) 2 Â (X in short)
representing normal usage and then comparing it with the be an n-dimensional random feature vector at time t 2 [0; t].
current behavior of data to find out a likely mismatch. It Consider the simplest scenario that there are two underlying
detects any action that significantly deviates from the normal states of a network, wi; i = 0;1, where w0 = 0 corresponds to
behavior based on normal behavior, profile and thresholds normal network operation, and w1 =1 corresponds to “unusual
calculated Anomaly based detection mechanism detects any or anomalous” network state. Detecting anomaly can be
traffic that is new or unusual, and is particularly good at considered as deciding whether a given observation x of
identifying sweeps and probes towards network hardware. It random feature vector X is a symptom of an underlying
can, therefore, give early warnings of potential intrusions network state w0 or w1. That is, a mapping needs to be
because probes and scans are the predecessors of all attacks. obtained between X and wi for i = 0,1. Note that an anomalous
And this applies equally to any new service installed on any network state may and may not correspond to abnormal
item of hardware. An anomaly-based IDS is a perfect network operation. For example, flash crowd, which is a surge
mechanism to detect anything from port anomalies and web of user activities, may result from either an installment of new
anomalies to mis-formed attacks, where the URL is software under a normal network operation or an abnormal
deliberately mistyped. But the problem occurs when an network state due to DDOS or Worm attacks.
intrusion (noise) data in the training data, makes a Due to the complexity of networks, such a mapping is
misclassification which leads to generate a lot of false positive usually unknown but can be learned from measurements.
alarms. Assume that a set D of m measurements is collected from a
The general architecture of an anomaly based methodology network as observations on X, i.e., D = {xi(t)}mi=1, where
is shown in fig. 2. This architecture consists of preprocessing xi(t) is the i-th observation for t ∈ [0,t]. xi(t) is called a training
to extract information then build the pattern of the connection sample in machine learning. In addition, another set D l = {yq}
if it deviates from the normal behavior rise an alarm. kl=1 of k measurements is assumed to be available in general
that are samples ( yq = 0,1) on ωi‟s. yq‟s are called labels in
Pattern machine learning. A pair (xi(t), yi) is called a labeled
Network Datasets with Alert measurement in machine learning, where observation xi(t) is
Pre- Pattern Outlier
traffic outliers
processing Buildin Detection obtained when a network is in a known state. For example ,if
g measurement xi(t) is taken when the network is known to
operate normally, yi = 0 and (xi(t),0) is considered as a sample
Figure 3: Anomaly Detection Block Diagram for normal network operation. If xi(t) is taken when the
network is known to operate abnormally, yi = 1 and (xi(t),1) is
However, based on the literature survey, each of the considered as a “signature” in anomaly detection. In general
techniques has some limitations. Signature based detection xi(t) is considered as an unlabeled measurement, meaning the
techniques fail to detect zero-day attacks. Continuous updation observation xi(t) occurs when the network state is unknown.
in signature database is a major technical concern. In anomaly Hence, three types of network measurements are: (a) normal
based detection techniques, different extraction notions of data Dn = {xi(t),0}k−ui=1, (b) unlabeled data D = {xj(t)}mj=1,
anomaly for different application domains produce high false and (c) anomalous data Dl = {xr(t),1}ur=1 . A training set
alarm-rates. In literature, numbers of anomaly detection
consists of all three types of data in general although a A similarity measure is a key parameter in a clustering
frequent scenario is that only D and Dn are available. algorithm that distinguishes the normal from anomalous
Examples of D include raw measurements on end-end flows, clusters. “Similarity” measures distances among the input xi‟s.
packet traces, and data from MIB variables. Examples of D n Euclidian distance is a commonly-used similarity measure for
and Dl can be such measurements obtained under normal or network measurements. When samples lie with distances
anomalous network conditions, respectively. below a chosen threshold, they are grouped into one cluster
Given a set of training samples, a machine learning view of [18]. Such an approach has been applied in the following
anomaly detection is to learn a mapping f(.) using the training examples: (a) to cluster network flows from the Abilene
set, where f(.) : Training Set → ωi, so that a desired network [17] to detect abrupt changes in the traffic distribution
performance can be achieved on assigning a new sample x to over an entire network, and (b) to cluster BGP update
one of the two categories. Fig. 4 illustrates the learning messages and detect unstable prefixes that experience frequent
problem. route changes [19]. More sophisticated algorithms are applied
A training set determines the amount of information such as hierarchical clustering to group BGP measurements
available, and thus categorizes different types of learning based on coarse-to-fine distances [21]. An advantage of
algorithms for anomaly detection. Specifically, when only D is hierarchical clustering is computational efficiency, especially
available, learning and thus anomaly detection is when applied to large network-data sets [21]. Other more
unsupervised. When D and Dn are available, learning/anomaly sophisticated clustering algorithms improve local convergence
detection can, be viewed as unsupervised. When Dl and Dn are using fuzzy-k-mean and swam-k-mean algorithms, and are
both available, learning/anomaly detection becomes applied to detecting network attacks [12]
supervised, since we have labels or signatures. f(.) determines Mixture models provide a model-based approach for
the architecture of a “learning machine”, which can either be a clustering [18]. A mixture model is a linear combination of
model with an analytical expression or a computational basis functions. Gaussian basis functions with different
algorithm. When f(.) is a model, learning algorithms for parameters are commonly used to represent individual
anomaly detection are parametric, i.e., model-based; clusters. The basis functions are weighted and combined to
otherwise, learning algorithms are non-parametric ,i.e., non- form the mapping. Parameters of the basis functions and the
model-based. weighting coefficients can be learned using measurements
[18]. For example, [21] uses Gaussian mixture models to
characterize utilization measurements.
VI. UNSUPERVISED ANOMALY DETECTION Parameters of the model are estimated using Expectation-
We now focus on unsupervised learning for anomaly Maximization (EM) algorithm and anomalies are detected
detection, where Dl is not available, a mapping f(.) is learned corresponding to network failure events. One limitation of
using raw measurements D and normal data Dn. Unsupervised clustering algorithms is that they do not provide an explicit
learning is the most common scenario in anomaly detection representation of the statistical dependence exhibited in raw
due to its practicality: Networks provide a rich variety and a measurements. Such a representation is important for
huge amount of raw measurements and normal data. One correlating multiple variables in detection, and diagnosis of
unsupervised learning approach for anomaly detection is anomalies.
behavioral-based. That is, Dn together with D is learned to
characterize a normal network behavior. A deviation from the
B. Bayesian Belief Networks
norm is considered as an anomaly
Bayesian Belief Networks (mapping f (.)) provides the
capability to capture the statistical dependence or causal-
performance relations among variables and anomalies.
An early work [9] applies Bayesian Belief Networks to
Training Set MIB variables in proactive network fault detection. The
Learning premise is that many variables in a network may exhibit
. Wi
f( ) anomalous behavior upon the occurrence of an event, and can
be combined to result in a network-wide view of anomalies
Fig 4: Machine Learning View of Anomaly Detection that may be more robust and result in a more accurate
detection. Specifically, the Bayesian Belief Networks [9] first
combines MIB variables within a protocol layer, and then
A. Clustering aggregates intermediate variables of protocol layers to form a
Clustering is another class of algorithms in unsupervised network-wide view of anomalies. Combinations are done
anomaly detection. Clustering algorithms [18] characterize through conditional probabilities in the Bayesian Belief
anomalies based on dissimilarities. The premise is that Networks. The conditional probabilities were determined a
measurements that correspond to normal operations are similar priori.
and thus cluster together. Measurements that do not fall in the Recent work [10] uses Bayesian Belief Networks to
“normal clusters” are considered as anomalies. combine and correlate different types of measurements such as
traffic volumes, ingress/egress packets, and bit rates.
Parameters in the Bayesian Belief Network are learned using entropy and conditional entropy as well as information
measurements. The performance of the Bayesian Belief gain/cost for anomaly detection. The entropy measures are
Network compares favorably with that of wavelet models and applied to several data sets for network security including call
time-series detections, especially in lowering false alarm rates data and “tcpdump” data. The entropy measures also assist in
[10]. selecting parameters for detection algorithms. [13] applies
[11] Provides an example of Bayesian Networks in information theoretical measures in behavioral-based anomaly
unsupervised anomaly detection at the application layer. In detection. First, the Maximum Entropy principle is applied to
this work a node in a graph represents a service and an edge estimate the distribution for normal network operation using
represents a dependency between services. The edge weights raw data. Second, the relative entropy of the network traffic is
vary with time, and are estimated using measurements. applied with respect to the distribution of the normal behavior.
Anomaly detection is conducted from a time sequence of The approach detects anomalies that not only cause abrupt
graphs with different link weights. The paper shows that changes but also gradual changes in network traffic.
service anomalies are detected with little information on [22] Uses entropy measures to flow data, and develops
normal network-behaviors. algorithms for estimating the entropy of streaming algorithms.
Hidden Markov Model is another approach in unsupervised
1) Bayesian Network anomaly detection. For example, [23] uses a Hidden Markov
A Bayesian network is a model that encodes probabilistic Model to correlate observation sequences and state transitions
relationships among the variables of interest. This technique is so that the most probable intrusion sequences can be
generally used for intrusion detection in combination with predicted. The approach is applied to intrusion detection data
statistical schemes. The naïve Bayesian (NB) algorithm is sets (KDDCUP99) and shown to reduce the false alarm rates
used for learning task, where a training set with target class is effectively.
provided. Training set is described by attributed A1 through The current implementation of NEDAA contains rule
An, and each attribute is described by attribute 𝑎1, 𝑎2 … 𝑎𝑛 generation modules that interface with two ARL:UT arti-
associated with class C. The objective is to classify an unseen ficial intelligence (AI) software packages: a genetic algorithm
example, whose class value is unknown but attribute values tool set and a decision tree generator. These modules can be
are known. The Bayesian approach to classifying the unseen customized for specific applications and data sources. The
example is to assign the most probable target class. 𝐶𝑚𝑎𝑝 Given choice of AI techniques employed stemmed from previous
the attribute values (𝑎1, 𝑎2, … … …, 𝑎𝑛) that describes the experience with genetic algorithms, and literature research
example. . 𝐶𝑚𝑎𝑝 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝐶𝑗 𝛴 𝐶𝑃 𝐶𝑗𝑎1, 𝑎2 … … …) the into other applicable methods. The genetic algorithm software
expression can be rewrite using Bayesian theorem as 𝐶𝑚𝑎𝑝 package was a logical platform from which to tackle the
=𝑎𝑟𝑔𝑚𝑎𝑥 𝐶𝑗𝛴 (𝑎1, 𝑎2 … … … 𝑎𝑛| 𝐶 )(𝐶𝑗 ) (1) difficult problem of intrusion detection. The use of decision
It is easy to estimate each of the P (Cj) simply by counting trees for rule generation was made to provide a deterministic
the frequency with which each target class Cj occurs in the alternative to genetic algorithms. VulcanRG, the machine
training set. The naïve Bayesian algorithm is based on the learning component of the NEDAA system, generates rules for
simplifying assumption that given the target class of the compilation into intrusion detection systems. These rules are
example, the probability of observing the conjunction 𝑎1, 𝑎2 generated by the genetic algorithm and by decision tree
… 𝑎𝑛 is just the product of the probabilities for the individual packages developed at ARL:UT. Currently VulcanRG is used
attributes: : 𝑃(𝑎1, 𝑎2 … … … 𝑎𝑛|𝐶𝑗 ) =• 𝑖 𝑃(𝑎𝑖|𝐶𝑗). to generate rules for one deployed IDS and one experimental
Substituting this into equation 1, we get get 𝐶𝑁𝐵 = 𝑎𝑟𝑔𝑚𝑎𝑥 system. Many of the examples presented in this paper are
𝐶𝑗𝛴 (𝐶𝑗 • 𝑖 (𝑎𝑖)|𝐶𝑗 ) (2) Where 𝐶𝑁𝐵, denote the target class derived from actual runs of the machine learning components
predicate by the naïve Bayesian classifier. In naïve Bayesian of NEDAA.
algorithm, the probability values of equation 2 are estimated A. Genetic Algorithms
from the given training data. These estimated values are then Genetic algorithms are a family of problem-solving
used to* classify unknown examples. a procedure that yields techniques based on evolution and natural selection. They are
several advantages, including the capability of encoding essentially a type of search algorithm that finds an
interdependencies between variables and of predicting events, approximate solution to an optimization task - inspired from
as well as the ability to incorporate both prior knowledge and biological, evolution process and natural genetics and
data [12][13]. proposed by Holland (1975). GA uses hill climbing method
VII. OTHER MACHINE LEARNING APPROACHES from an arbitrary selected number of genes. GA has four
operators: initialization, selection, crossover and mutation and
Using information theoretical measures is another approach as such, can be used to solve a wide variety of problems. This
for unsupervised anomaly detection. Information theoretical section gives a brief overview of genetic algorithms. The goal
measures are based on probability distributions. Hence of genetic algorithms is to create optimal solutions to
information from distributions that are estimated from the data problems. Potential solutions to the problem to be solved are
is used in anomaly detection. encoded as sequences of bits, characters or numbers. The unit
[15] Provides several information theoretical measures for of encoding (usually a single bit in traditional genetic
anomaly detection of network attacks. The measures include algorithms) is called a gene, and the encoded sequence is
called a chromosome. The genetic algorithm begins with a set
(population) of these chromosomes and an evaluation function (4,2,2,2,14,5,11,12,1,5,11,-1,6,14,7,6,0,4,7,0,5,1,2,3,2,0,17)
that measures the fitness of each chromosome, i.e. the
Figure 5: Chromosome for rule in Table 1
„goodness‟ of the problem
GA has been used in different ways in IDS. There are
many researchers that used evolutionary algorithms and The evolutionary process of the genetic algorithm requires
especially GAs in IDS to detect malicious intrusion from some form of fitness measure on individual population
normal use. Genetic algorithm based intrusion detection members. In NEDAA, this fitness measure is based on the
system is used to detect intrusion based on past behavior. A actual performance of each rule on a pre-classified data set.
profile is created for the normal behavior based on that genetic Each rule is used to filter a data set comprised of connections
algorithm learns and takes the decision for the unseen patterns. marked as either anomalous or normal by an analyst, and the
Genetic algorithms also used to develop rules for network fitness function rewards partial matches of training
intrusion detection. A chromosome in an individual contains connections that have been designated as anomalous. If a rule
genes corresponding to attributes such as the service, flags, completely matches an anomalous connection it is awarded a
logged in or not, and super-user attempts. These attacks that bonus; if it matches a normal connection it is penalized. As a
are common can be detected more accurately compared to result, succeeding populations are biased toward rules that
uncommon attributes. Genetic algorithms are capable of match intrusive connections only. After a certain number of
deriving classification rules and selecting optimal parameters generations, the genetic algorithm is stopped, and the best
for detection process. unique rules are selected.
Genetic Algorithms are applied in the current version of The classic genetic algorithm described earlier tends to
NEDAA, to evolve simple rules for monitoring network converge on a single „best‟ solution to any given problem. In
traffic. These rules are simple single-connection patterns for the case of a rule set, however, it is generally not sufficient to
differentiating normal from abnormal network connections. find a single rule. Multiple rules are generally needed to
The rules are evolved by creating patterns which match a set identify unrelated types of anomalies; several „good‟ rules are
of anomalous connections and a set of normal connections (as more effective than a single „best‟ one. In mathematical terms,
reported by an analyst). These patterns are used as the firing this requirement translates to the concept of finding local
conditions of rules of the form if < pattern matched > then < maxima as opposed to the global maximum of the fitness
generate alert >.2 The goal is to develop rules which match function. Genetic algorithms evolve solutions which maximize
only the anomalous connections. Such rules can then be used (or minimize) the fitness function. A traditional genetic
to filter historical records and new connections so that the algorithm will attempt to find a global maximum of the fitness
analyst can concentrate on those which are suspicious. function, and will continue until all solutions in the population
The patterns created by the genetic algorithm corresponds have converged to this maximum. In contrast, the problem of
to the format of incoming connections and are combinations of discovering multiple rules to filter incoming connections
specific connection attribute values and „wild cards‟. based on different criteria is essentially that of discovering
Examples of attributes used in the current version of NEDAA multiple local maxima of the fitness function.
include: source IP address, destination IP address, source IP In order to find local maxima of the fitness function, and
port, destination IP port, and network protocol. hence multiple rules, we employed niching techniques.
Conceptually, niching in genetic algorithms is similar to that
The initial population is comprised of random rules; an in nature; different species that share an environment
example is shown in Table 3. The chromosome for a rule is generally inhabit different niches in order to exploit resources
comprised of 29 genes: 8 for source IP (2 hexadecimal digits and minimize competition. In genetic algorithms, niching
per address field), 8 for destination IP, 6 for source port, 6 for strategies attempt to create subpopulations which converge on
destination port, and 1 for protocol. Each gene can be either a local maxima. The two standard ways of niching are sharing
specific numeric value in the appropriate range for the field or and crowding[4]. Sharing degrades the fitness of solutions
a wild card. The chromosome for the rule in Table 1 is shown based on the number of other solutions which are nearby
in Figure 5 When a rule is used to filter connections, each (similar). When the fitness is degraded in this manner,
connection is converted to a 29-field format corresponding to overcrowded niches become less hospitable, forcing solutions
the gene structure of a rule as described above. A rule matches to other local maxima which may be less populated. In
a connection if and only if every non-wildcard gene in the rule crowding, solutions which are generated by the crossover of
matches the corresponding field in the connection two „parent‟ solutions replace the nearest (most similar)
solutions in the population.
Attribute Value Both sharing and crowding use the concept of nearness ,or
Source IP 42.22.e5.bc(66.34.229.188) similarity ,in order to maintain population diversity. A
Destination IP 15.b*.6e.76(21.176+?.110.118)
Source Port 047051
distance metric must be imposed either on the space of
Destination Port 912320 solutions or on the space of chromosomes in order to use
Protocol TCP either of these niching methods. Unless a domain specific
distance metric is determined for the problem, the Hamming
Table 1: Sample rule
distance is used[4]. The Hamming distance between two
chromosomes is the number of genes which differ between the
two chromosomes. We use the Hamming distance and a
variation of crowding in order to generate a diverse rule set.
B. Neural Networks
Artificial neural networks consists of a collection of
processing elements that are highly interconnected and
transform a set of inputs to a set of desired outputs that is
inspired by the way biological nervous systems, such as the
brain, process information. The technique of Neural Networks
NN follows the same theories of how the human brains works.
The Multilayer Perceptions MLP has been widely used neural Figure 6: Simple Architecture of MLP
network for intrusion detection. They are capable of
approximating to arbitrary accuracy, any continuous function Neural network based intrusion detection system, intended to
as long as they contain enough hidden units. This means that classify the normal and attack patterns and the type of the
such models can form any classification decision boundary in attack. It is observed that the long training time of the neural
feature space and thus act as non-linear discriminate function. network was mostly due to the huge number of training
When the NN is used for packets classification, there is one vectors of computation facilities. However, when the neural
input node for each element of the feature vector. There is network parameters were determined by training,
usually one output node for each class to which a feature may classification of a single record was done in a negligible time.
be assigned (shown in Fig. 6). The hidden nodes are connected Therefore, the neural network based IDS can operate as an
to input nodes and some initial weight assigned to these online classifier for the attack types that it has been trained
connection. These weights are adjusted during the training for. The only factor that makes the neural network off-line is
process. MPL NN Structure One learning algorithm used for the time used for gathering information necessary to compute
MLP is called back-propagation rule. This is a gradient the features.[3][6].
descent method and based on an error function that represents
the difference between the network‟s calculated output and the
desired output. This error function is defined, based on the C. Support Vector Machine
Mean Squared Error MSE. The MSE can be summed over the In classification and regression, Support Vector Machines
entire training set. In order to successfully learn, the network‟s SVM is the most common and popular method for machine
true output should approach the desired output by learning tasks. In this method, a set of training examples is
continuously reducing the value of this error. The back given with each example is marked belonging into one of two
propagation rule calculates the error for a particular input and categories. Then, by using the Support Vector Machines
then back- Propagates the error from one layer to the algorithm, a model that can predict whether a new example
previous one. The connection weights, between the nodes, are falls into one categories or other is built [4]. For classification
adjusted according to the back-propagated error so that the using SVM we need to select the training samples and carry
error is reduced and the network learns. The input, output, and out attribute extraction on samples, generally, we select a
hidden layers neurons are variable. Input/output neurons are network connection as a sample or benchmark datasets like
changed according to the input/output vector and hidden layer KDD99 which have collection of network connections
neurons are adjusted as per performance requirement, attributes captured from variance networks. It is needed to
more hidden layers neurons more complex MLP. The neural define an input space x. For each network connection,
network intrusion detection system consists of three phases: choosing n attribute characteristics. The one-dimensional
Using automated parsers to process the raw TCP/IP vector x can be Used to express a network connection 𝑥 = {𝑥1,
dump data in to machine-readable form or use some 𝑥2, … … … , 𝑥𝑛} in which 𝑥𝑖, 𝑖 =1,2, … … . . 𝑛 ,refers to the i
benchmark dataset contain parsed connection characteristic value of the sample x. Defining Y as output
records. domain, as we only judge whether it is normal for each
Training: neural network is trained on different types network connection, only two states are enough to express the
of attacks and normal data. The input has 41 features problem, therefore, it can be defined that Y = (+1,−1) . When
and the output assumes one of two values: intrusion Y = +1 it is normal connection and when Y = −1 it is
(22 different attack types), and normal data. abnormal connection. A simple SVM classification diagram
Testing: performed on the test dataset. shown in Fig. 7.
In this only classes are there +1and -1 for normal and
abnormal connections. It maps sample space to a higher
dimensional even an infinite dimensional character space by
the use of nonlinear mapping function ф (xi). Support Vector
Machine is a classification method possessing better learning
ability for small samples, which has been widely applied in connects either two nodes or a node and a leaf. Leaves are
many fields such as Network Intrusion Detection, web page labeled with a decision value for categorization of the data.
identification and face identification. Support Vector Machine
applied in intrusion detection possesses such advantages as IP Port System Name Category
high training rate and decision rate, insensitiveness to 004020 Artemis Normal
004020 Apollo Intrusion
dimension of input data, continuous correction of various 002210 Artemis Normal
parameters with increase in training data which endows the 002210 Apollo Intrusion
system with self-learning ability, and so on. Besides, it is also 000010 Artemis Normal
capable of solving many practical classification problems, 000010 Apollo Normal
such as problem involving small samples and non-linear
problem. Therefore, Support Vector Machine will become Table 2: Example Intrusion Data
increasingly popular in network security
intrusion = false;
if(System Name == Apollo){
if((IP Port==002210)||(IP Port==004020)){ Figure 10: Intrusion Detection Decision Tree
intrusion = true;
}} The rules produced by the decision trees are of a slightly
The NEDAA machine learning approach uses analyst created different nature than those produced by genetic algorithms.
training sets for rule development and analyst decision The genetic algorithms use niching to create a population of
support. In the current implementation the training unique rules; ideally each rule at its own local maxima.
information is comprised of database views queried from the Decision trees on the other hand create a single rule with a
archived network events (Table 3). number of different clauses. Each clause is equivalent to a
From this training data the machine learning modules single rule in the genetic algorithm population. Each element
generate rules of the form: of training data falls into one of the clauses in the decision tree
if < condition > then < action > and hence is represented in the final rule. The „completeness‟
These rules can be compiled into the expert system for of the final rule, with respect to the training data, is an
intrusive event detection, or used to simplify the analyst‟s task advantage that decision trees have over genetic algorithms,
by summarizing large sets of training data into simple rule and dispenses with the need for niching in the decision tree
sets. It is important to note that the quality of the synthesized component.
rules depends on the quality of the training set. Any The rules developed by the decision tree component, like
classification errors in the training set will propagate into the the genetic algorithm rules, are sensor level rules. These rules
resulting rule set. are used to filter single connections. Currently, the generated
decision trees provide an elaboration of the „Hot IP‟ list; a list
Source IP Dest IP Source Dest Pr Intru which enumerates IP addresses with previous anomalous
Port Port ot sion activity. This extension of the „Hot IP‟ list allows an intrusion
oc detection system to differentiate between normal and
ol
s
anomalous activity from a Hot IP address.
123.202.72.109 225.142.187.12 001360 000080 IP True
123.202.72.109 225.142.187.12 001425 000080 IP True The following table describes comparison of various machine
123.202.72.109 225.142.187.12 001488 000080 IP True learning techniques which are discussed so far along with their
123.202.72.109 225.142.187.12 001559 000080 IP True
pros and cons.
123.202.72.109 225.142.187.12 001624 000080 IP True
255.142..147.75 150.216.191.119 001624 000080 IP True
255.142.187.19 125.250.187.19 004207 000025 IP True S.No Machine Pros Cons
233.167.15.65 255.142.187.12 004607 000025 IP False Learning
233.167.15.65 255.142.187.12 004690 000025 IP False Technique
139.61.51.70 255.142.187.12 001052 000021 IP False 1 Neural Ability to generalize Slow training
Networks from limited, noisy process
142.142.5.113 255.142.187.12 001572 000080 IP false
and incomplete data. so not suitable for
Table 3 : Example Training Data realtime detection.
Does not need expert time. Ideally these rules will be able to detect the „low and
knowledge and it can Over-fitting may slow‟ attack. There are many ways to build rules that are
find unknown or novel happen during
intrusions. neural based on a number of connections, events, etc. One way is to
network training. create rules which can cause other rules to be activated. This
2 Bayesian Encodes probabilistic Harder to handle method, known as rule chaining, allows complex sequences of
Networks relationships among continuous features. events to be detected [3]. Given a number of rules of the form
the variables of May not contain
interest. any good classifiers if < predicate > then < action >g, the execution of one rule‟s
if prior knowledge action during a cycle may trigger the successful evaluation of
Ability to incorporate is wrong. another rule‟s predicate on subsequent cycles In this way rules
both Prior knowledge may communicate with each other to detect complex behavior.
and data
3 Support Vector Better learning ability Training takes a
Incoming connections may activate certain rules, which in turn
Machine for small samples. long time. may activate other rules. This process may continue
High training rate and Mostly used binary indefinitely until a message triggers an alarm to the intrusion
decision rate, classifier which detection system, or until no new messages are created. We
insensitiveness to cannot give
dimension of input additional
are investigating the alteration of the genetic algorithm
data. information about component to create rules which chain in this manner. The
detected type of ID3 algorithm builds decision trees from an annotated data set.
attack. If the data set is augmented, a new decision tree must be built
4 Genetic Capable of deriving Genetic algorithm
Algorithm best classification cannot assure
to encompass the changes. Building a decision tree is
rules and Selecting constant computationally intensive. In order to avoid this computation
optimal parameters. optimization new algorithms have been developed to update existing
response times. decision trees based on new information[2]. This allows us to
Biologically inspired
and employs Over-fitting.
build scalable decision trees, and thus continually refine the
evolutionary rule set as new information becomes known to the analyst. We
algorithm. will investigate scalable decision tree construction as an
5 Fuzzy Logic Reasoning is High resource enhancement to our current system. . Decision trees can be
Approximate rather consumption used to cluster network events into similar categories based on
than precise. Involved.
common attributes
Effective, especially Reduced, relevant
against port scans and rule subset IX. CHALLENGES AND DEPLOYMENT ISSUES
probes. identification
and dynamic rule
From the above study of the different anomaly detection
updation at runtime approaches that are available today, it is clear that a black box
is a difficult task. anomaly detector may indeed be a utopian dream [50] for two
6 Decision Trees Decision Tree works Building a decision main reasons: (1) the nature of the information that is fed to
well with huge data tree is the anomaly detector could be varied both in format and range,
sets. computationally
intensive task. and (2) the nature of the anomaly, its frequency of occurrence
High detection and resource constraints clearly dictates the detection method
accuracy. of choice. In [25] the authors propose an initial prototype
anomaly detector that transforms the input data into some
Table 4: Comparasion of Machine Learning Techniques common format before choosing the appropriate detection
methodology. This is clearly an area where further research is
an important contribution, especially for deployment in
VIII. FUTURE PLANS service provider environments where it is necessary to build
The goal of machine learning as applied to network multiple anomaly detectors to address the myriad monitoring
intrusion detection is to generate a minimal rule set which can requirements.
detect intrusion signatures generalized from previous activity. Some of the challenges encountered when employing
We want a minimal number of rules for rapid response and machine learning approaches or statistical approaches is the
efficiency in the expert system. Rule complexity must mirror multiple time scales in which different network events of
complexity of attacks; as hackers become more skilled our AI interest occur. Capturing the characteristics of multi time scale
techniques need to create correspondingly complex rules. Our anomalies is difficult since the time scale of interest could be
primary near-term goal is to extend the machine learning different for different anomaly types and also within an
components to correlate and filter sets of connections as anomaly type depending on the network conditions. In [38]
opposed to single connections. We will combine connection the authors describe the influence of the regularity of data on
filtering with other information the IDS records (events, the performance of a probabilistic detector. It was observed
strings matched etc.) to create complex rules based on data that false alarms increase as a function of the regularity of the
annotated by analysts. More complex rules will look for data. The authors also show that the regularity of the data is
connection patterns which are extensive in both space and not merely a function of user type or environments but also
differs within user sessions and among users. Designing in input data; and (v) fundamental difficulties for conducting
anomaly detectors that can adapt to the changing nature of sound evaluation. While these challenges may not be
input data is an extremely challenging task. Most anomaly surprising for those who have been working in the domain for
detectors employed today are affected by the inherent changes some time, they can be easily lost on newcomers. To address
in the structure of the data that is being input to the detector them, we deem it crucial for any effective deployment to
and therefore does affect performance parameters such as acquire deep, semantic insight into a system‟s capabilities and
probability of hits and misses, and false alarm rates. Sampling limitations, rather than treating the system as a black box as
strategies for multi time scale events with resource constraints unfortunately often seen
is another area where there is a need for improved scientific
understanding that will aid the design of anomaly detection
modules. [26] Discovered that most sampling methods REFERENCES
employed today introduce significant bias into measured data, [1] Roger Gallion, Daniel C. St. Clair, Chaman L. Sabharwal, W.E. Bond
thus possibly deteriorating the effectiveness of the anomaly (1993). “Dynamic ID3: A Symbolic Learning Algorithm for Many-Valued
Attribute Domains,” SAC: 14- 20.
detection. Specifically, Mai et al use packet traces obtained
from a Tier-1 IP-backbone using four sampling methods [2] Johannes Gehrke, Venkatesh Ganti, Raghu Ramakrishnan, Wei-Yin Loh
including random and smart sampling. The sampled data is (1999). “BOAT - Optimistic Decision Tree Construction,” To appear in
then used to detect volume anomalies and port scans in Proceedings of 1999 SIGMOD Conference.
different algorithms such as wavelet models and hypothesis [3] David E. Goldberg (1989). Genetic Algorithms in Search, Optimization
testing. Significant bias is discovered in these commonly-used and Machine Learning. Addison-Wesley, Reading,MA.
sampling techniques, suggesting possible bias in anomaly
detection. Often, the detection of network anomalies requires [4] Brad L. Miller, Michael J. Shaw (1996). “Genetic Algorithms with
Dynamic Niche Sharing for Multimodal Function Optimization,” IEEE
the correlation of events across multiple correlated inputs data International Conference on Evolutionary Computation : 786-791.
sets. Using statistical approaches it is challenging to capture
the statistical dependencies observed in the raw data. When [5] Lyn Pierce, Stan Young (1998). “YAGATS: A Toolset for Genetic
using streaming algorithms it is also impossible to capture Manipulation of Finite-State Machines,” Applied Research Laboratories
Technical Report No. 99-1 (ARLTD- 99-1), Applied Research Laboratories,
these statistical dependencies unless there are some rule based The University of Texas at Austin.
engines that can correlate or couple queries from multiple
streaming algorithms. Despite the challenges, the [6] Lyn Pierce, Chris Sinclair (1999). “YAGATS IR&D Report,” Applied
representation of these dependencies across multiple input Research Laboratories Technical Report No. 98-1 (ARL-TD-98-1), Applied
Research Laboratories, The University of Texas at Austin.
data streams is necessary for the detailed diagnosis of network
anomalies. To sum up, there still remain several open issues to [7] J. Ross Quinlan (1993). C4.5 Programs for Machine Learning. Morgan
improve the efficiency and feasibility of anomaly detection. Kaufmann Publishers, SanMateo, CA.
One of the most urgent issues is to understand what
[8] Lane B. Warshaw, Lance Obermeyer, Daniel P. Miranker, Sara P.
information can best facilitate network anomaly detection. A Matzner (1999). “VenusIDS: An Active Database Component for Intrusion
second issue is to investigate the fundamental tradeoffs Detection,” Submitted to 1999 Annual Computer Security Applications
between the amount/complexity of information available and Conference.
the detection performance, so that computationally efficient
[9] Hood C.S., Ji C.: Proactive Network Fault Detection. IEEE Tran.
real-time anomaly detection is feasible in practice. Another Reliability. Vol. 46 3, 333-341 (1997)
interesting problem is to systematically investigate each
anomaly detection method and understand when and in what [10] Kline K., Nam S., Barford P., Plonka D., Ron A.: Traffic Anomaly
problem domains these methods perform well Detection at Fine Time Scales with Bayes Nets. To appear in the International
Conference on Internet Monitoring and Protection (2008)
The task of finding attacks is fundamentally different from
other applications, making it significantly harder for the [11] Ide T., Kashima H.: Eigenspace-Based Anomaly Detection in Computer
intrusion detection community to employ machine learning Systems. Proc. Of the tenth ACM SIGKDD international conference on
effectively. We believe that a significant part of the problem Knowledge discovery and data mining, Seattle, 440 - 449 (2004)
already originates in the premise, found in virtually any [12] Ensafi R., Dehghanzadeh S., Mohammad R., Akbarzadeh T.: Optimizing
relevant textbook, that anomaly detection is suitable for Fuzzy K-Means for Network Anomaly Detection Using PSO. Computer
finding novel attacks; we argue that this premise does not hold Systems and Applications, IEEE/ACS International Conference, 686-693
with the generality commonly implied. Rather, the strength of (2008)
machine-learning tools is finding activity that is similar to [13] Gu Y., McCallum A., Towsley D.: Detecting Anomalies in Network
something previously seen, without the need however to Traffic Using Maximum Entropy Estimation. Proc. of IMC (2005)
precisely describe that activity up front (as misuse detection
must). we identify further characteristics that our domain [14] Lall S., Sekar V., Ogihara M., Xu J., Zhang H.: Data Streaming
Algorithms for Estimating Entropy of Network Traffic. Proc. of ACM
exhibits that are not well aligned with the requirements of SIGMETRICS (2006)
machine-learning. These include: (i) a very high cost of errors;
(ii) lack of training data; (iii) a semantic gap between results [15] Lee W., Xiang D.: Information-Theoretic Measures for Anomaly
and their operational interpretation; (iv) enormous variability Detection. Proc. of IEEE Symposium on Security and Privacy (2001)
[16] Yang Y., Deng F., Yang H.: An Unsupervised Anomaly Detection
Approach using Subtractive Clustering and Hidden Markov Model.
Communications and Networking in China. 313-316(2007)
[17] Ahmed T., Coates M., Lakhina A.: Multivariate Online Anomaly
Detection Using Kernel Recursive Least Squares. Proc. of 26th IEEE
International Conference on Computer Communications(2007)
[18] Duda R. O., Hart P., Stork D.: Pattern Classification, 2nd edn. John
Willy and Sons (2001)
[21] Andersen D., Feamster N., Bauer S., Balaskrishman H.: Topology
inference from BGP routing dynamics. Proc. SIGCOM Internet Measurements
Workshop, Marseille, France (2002)
[22] Lall S., Sekar V., Ogihara M., Xu J., Zhang H.: Data Streaming
Algorithms for Estimating Entropy of Network Traffic. Proc. of ACM
SIGMETRICS (2006)
[23] Yang Y., Deng F., Yang H.: An Unsupervised Anomaly Detection
Approach using Subtractive Clustering and Hidden Markov Model.
Communications and Networking in China. 313-316 (2007)
[25] Venkataraman S., Caballero J., Song D., Blum A., Yates J.: Black-box
Anomaly Detection: Is it Utopian?” Proc. of the Fifth Workshop on Hot
Topics in Networking (HotNets-V), Irvine, CA (2006)
[26] Mai J., Chuah C., Sridharan A., Ye T., Zang H.: Is Sampled Data
Sufficient for Anomaly Detection?
Proc. of 6th ACM SIGCOMM conference on Internet measurement, Rio de
Janeriro, Brazil. 165 - 176 (2006)