0% found this document useful (0 votes)
34 views5 pages

Anomaly-Based IDS To Detect Attack Using Various...

Uploaded by

ouarme.ar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views5 pages

Anomaly-Based IDS To Detect Attack Using Various...

Uploaded by

ouarme.ar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

1

System Anomaly Detection: Mining Firewall


Logs
Robert Winding, Timothy Wright, and Michael Chapple

Abstract-This paper describes an application of data mining II. RELATED WORK


and machine learning to discovering network traffic anomalies in
firewall logs. There is a variety of issues and problems that can
occur with systems that are protected by firewalls. These There has been some considerable work done with regard to
systems can be improperly configured, operate unexpected anomaly detection with IDSs and firewalls. Salvatore[1]
services, or fall victim to intrusion attempts. Firewall logs often applies data mining techniques in an effort to capture system
generate hundreds of thousands of audit entries per day. It is audit logs and analyze them for indications of misuse. Eleazar
often easy to use these records for forensics if one knows that Eskin[2] describes techniques for misuse and anomaly
something happened and when. However, it can be burdensome detection in network traffic and system call stacks. Some of
to attempt to manually review logs for anomalies. This paper the features utilized are based on TCP parameters such as
uses data mining techniques to analyze network traffic, based on protocol, bytes transferred, and connection duration. (In this
firewall audit logs, to determine if statistical analysis of the logs paper, we look at some of the same features, but in
can be used to identify anomalies,
aggregation across many connections with the traffic
Index Terms- Data mining, Firewall log analysis, Intrusion governed by a firewall.) C. Caruso and D. Malerba [6]
Detection describe using data mining with clustering on firewall audit
logs. Their work sticks to accepted traffic, performing
I. INTRODUCTION analysis via features derived from different aggregations; the
HIS paper, describes an application of data mining d
goal is to model regularities in traffic and then use that to
machine learning to discovering network traffic anomalies
by analyzing firewall logs. Network security systems like Although this project was motivated by, and adapted some
firewalls and Intrusion Detection Systems (IDS) often produce of the approaches investigated, we offer some potentially
huge amounts of log data. Hidden in this data may be unique ideas as well. Specifically, we investigate the way
valuable knowledge regarding the configuration, management, firewall data is reduced and how the aggregate data is used for
and security of protected devices and systems. input into clustering and machine learning algorithms.
Security issues also arise from mis-configured machines and
machines that run undocumented services. An initial step in
the process of a network-based attack is to seek out such III. METHODOLOGY
machines through some level of reconnaissance. A target may
be interesting for a variety of reasons depending on the This research employs data mining techniques as a method to
attacker's motivations: a system may appear vulnerable to an identify anomalous system traffic caused by mis-configuration
exploit the attacker knows, or it may contain sensitive data or possible intrusion activity.
that has some theft value. Reconnaissance often takes the The technique uses the following method to apply machine
form of some kind of scanning process or failed access learning to the analysis of firewall logs:
attempts. Through the discovery of network traffic
relationships and patterns in log data, a firewall analyst can * Acquire, prepare, and clean data
detect possible intrusion attempts, system mis-configurations, * Identify methods of aggregating and reducing the
etc. In turn, this will result in more effective management of data
protected systems and controls.
* Select data features
* Utilize machine learning techniques such as
Manuscript received June 20, 2006. Clustering and Classification
Robert Winding. university of Notre Dame, 232 Info Technology cntr,
Notre Dame, IN 46556, [email protected] * Analyze and verify results to draw conclusions
Timothy Wright, university of Notre Dame, 402 Grace Hall, Notre Dame, Clseigadcsifato tchqusresdtoeef
IN 46556, [email protected] lsengadcasfcto ehqu resdtoeef
Michael Chapple, university of Notre Dame, 233 Info Technology cntr, features and/or their aggregations (e.g., source IP, destination
Notre Dame, IN 46556, [email protected]
1-4244-0423-1/106/$20.00 ©2006 IEEE
2

IP, destination port) can be used to identify machines aggregate behavior of related records. For this reason, the
exhibiting anomalous behavior. If we assume that systems are following aggregate records were prepared for analysis as
configured properly and behave in predictable ways, then feature candidates:
outliers will appear as network scans, probes, mis- R
configurations, etc. and be detected by clustering. * Repeated attempts of access by a single IP
The data that will be used for this experiment will be taken * Number of source IPs per destination IP
from a production, university datacenter firewall. The * Number of destination IPs per source IP
datacenter firewall employs extensive logging and captures
audit data on every connection and attempted connection. * Number of destination ports on a given
source/destination IP pair.
* Unique IPs
IV. DATA ACQUISITION AND PREPARATION
* Maximum activity from a single IP
A PERL script was used to extract the following data from the * Failed and successful connections from the same IP
firewall logs: date and time of connection attempt, * Attempts to access invalid IPs
permit/deny, source IP, source port, destination IP, destination
port, protocol (ex. TCP/UDP/ICMP), bytes transferred to * Inbound/Outbound bytes per unit time
server, and bytes transferred to client. Below is a sanitized Firewall logs for one day were processed into the flow-like
sample of the data records. data format described in Section IV. This yielded just over
03/10/2006 one million records that were imported into a data base for
02:00:35,PERMIT,10.10.222.11,16285,10.10.224.61,80,6,33,3467 further analysis. SQL queries were developed to aggregate the
03/10/2006 data in accordance with the feature candidates. The following
02:00:35,PERMIT,10.10.222.11,16288,10.10.234.49,443,6,440,11671 feature vectors were derived from these candidates:
03/10/2006 02:00:35,DENY,192.168.250.172,4212,192.168.4.164,80,6,0,423 * Source IP address, number of destination IP
03/10/2006 Addresses
02:00:36,PERMIT,192.168.250.172,4210,10.10.224.45,8080,6,0,0
* Destination IP, number of failed access attempts
D
This raw data and/or its aggregations are used to determine
features that, in turn, can detect anomalous patterns. The log * Source IP, destination IP
data required some level of preparation to get to this point, * Destination Perspective Vector (destination IP, count
which is to be expected. For example, bugs were discovered in of Source IPs, number of successful accesses,
the audit logging system of the firewall, which affected data number of failed accesses, count of destination ports,
acquisition. Also, there were redundant log entries generated number of bytes transferred inbound number of
for denied traffic. Finally, there was an issue with time y
formats not being compatible with our database's date/time bytes transferred [outbound])
data type. Fortunately, our PERL filter program and other These feature vectors formed the basis for the analysis of
simple techniques were able to deal with each of these issues. anomalies. Clustering and classification models were used to
With the log extraction and preparation the data acquisition explore the utility of these vectors. The process and results are
step is complete. To analyze the relationships between data described below in Section VI.
records we experimented with loading subsets of data into a
relational database. Using SQL, a search for various aggregate
relationships was carried out. This proved to be well alignedV
with the goals of detecting anomalous traffic. For example, it
was fairly easy to observe port scan activity by reviewing
source IPs associated with a large number of destination IPs Several analysis techniques were employed to analyze the
over a span of time. This kind of statistical analysis is also firewall log data. Some features were analyzed with boxplots
used to determine useful features from aggregations of the to look at the distribution of selected features, and spot and
data records. analyze outliers. Clustering was performed on the destination
perspective vector which also led to creating a classification
model using JRIP in WEKA. WEKA is a data mining tool
V. IDENTIFICATION, DISCOVERY, AND ANALYSIS OF FEATURES from the University of Waikato. JRIP is one of the many
' ~~~~~~machine learning algorithms supported by WEKIA. The
techniques and how to apply them with WEKIA can be found
in the book "Data Mining: Practical Machine Learning Tools
Little information can be gained from analyzing individual and Techniques"[3].
records. Often, the presence of intrusion behavior or other
anomalous activity can only be detected by looking at the
3

The first analysis was to "boxplot" the count of destination IP


addresses per source IP address. Experience has often shown c RW
B V W
t l
machines that are scanning or compromised to be associated
with an abnormal number of destination addresses. The
following boxplot indicates the relationship between source IP
and number of destination IP addresses for the sample data.
An extreme outlier with 1499 destination IPs was removed
(investigation showed that this was a monitoring system that
was probing/scanning some 1499 devices for availability).
This made the other outliers easier to see in the plot.

Once more, investigation revealed the hosts associated with


the outliers from most extreme to least extreme:
deuct
e monitoring service
* Two Active Directory Servers
* Three well known services
* A monitoring system
* A gateway addess that shoul not be kown
~~
~~
W ~ ~ ~ ~ ~ ~ ~~154 129.74 22 1.N I *
* Two well known services
IEE~ ~ ~~ ~ ~ ~ ~~~~~~~~7A~ ..9 W3 EN 721
A further inspection of the first outlier revealed two internal
ThIs distributoio poe useuls
Thisdisribtio
b yommediae idetifying six servers (privately addressed) attempting to talk to the defunct
sefl b imeditelyidetifingsix
prves

taiPs (i.e.
* The
ho woktain
i
symbols)gfrom whichr porte scgansd
whc liel ha a worm/virusry
wer erer
monitoring service. The remaining activity was found to
taking
plaffc. TermiigPsapatobengdin correspond with the previously detected SSH web scanners.
The

Investigation revealed that the hosts associated with the


outliers were, from the most extreme: e249
1 1274 exrm
22 t
* A system performing web port scans IZM~1.297 D3ENY 2
* Two systems performing SSH port scans 1221~ 3a3 DENY 723
A legitimate monitoring server 0 129 s4.
. Thee wrksttios whch lkelyhada wom/virus
Additional analysis indicated that most of these outliers were
* A load balancing switch's health check probes due to mis-configuration. Although, well known services
l A web statistics logging server that was running a operating on these hosts may have been probed for
recursive DNS server to avoid impacting the reconnaissance purposes. Further investigation is required to
performance of general University DNS service, confirm this.
* Two systemperforming SHport scafollwn distriution were1 obered
This simple analysis draws attention to less than a dozen The aggregation queries used in prior analysis were combined
machines: some of which were scanning the network and to form a destination IP perspective vector. This data was
Clstrditionstanceysisidctdta oto hs uleswr
some of which may have been compromised. Recall that the reduced from over one million raw data records to 2401
IPs were identified from a pool of 13606 IP addresses taken aggregate records-the number of unique destination IPs.
dnvestination rPsevwhatekindsaoflclusters might be dtce.Ui k=4 the1
4

0 5(0%)
12016 (84%) D tu
2 36( too) Im2
3 344 14%)

Below is a visualization of a couple of the key views.

L~~~~~~~~
~~~~4~~~~~~~i

Initial clustering resulted in greater than one million raw log


11 ~~~records being filtered down to a handful of outlier IPs.
Investigation of the individual IPs found that several were
identified because of architectural mis-configuration. Some
may have been the target of intirusion activity, but further
analysis is required.
Only IPs from ClusterO and Cluster2 were investigated. These
clusters represent the most discernable outliers from the
destination IP perspective. Unfortunately, these results are not
f- DO
~~~~~immediately as compelling as some of the boxplot analysis
(Nme ds pot
identified
.>
Clseutrin ClusterO
_~2>Cutrlutr
(21. were
32000
0.0 present in the boxplot analysis of
repeated,
Nube
of
denied access
Rules:6
attempts to destination IPs.
Given the promise of some of these results, we investigated
~~~~~~~~~~~~~~~~~~~~Tm
usingtae WEKA obIldmdl_.8eod
to develop a classifier for the clusters what were
identified in the prior analysis. Conveniently, using WEKA to
do cluster analysis of an ARFF file allows you to save the
111 ~~~result as a classified ARFF file, where the cluster label is the
class. We used this file for the training data and used the JRIP
rule generator in WEKA to build a classifier. The irules and
aws's CONA ~~~~~~~~~~~accuracy on the training data are as follows:
diiii k~~~~~~~~ ki I ~~~~~Classifier model (full training set)
JRIP rules:

(Number repeated denied access >= 572) => Clustere-lusterO (6.0/1.0)


(Number dst ports >= 4) and (Number dst ports >= 5) => Cluster-cluster2
(28.0/0.0)
(Bytes to Client >= 92010658) and (Number dst ports >= 4) =>
Cluster-cluster2 (7.0/1.0)
(Bytes to Client >= 1091343901) => Cluster=cluster2 (2.0/0.0)
5

Evaluation on training set made to identify false negatives by correlating classifications


Summary=== with other systems that may detect similar activity, such as
00elClsfdIsne 29 9 1 IDS or netflow systems. This would be an important part of
future investigation into the utility of applying these
Incorrectly Classified Instances 2 0.0833 % techniques to firewall log data.
Kappa statistic 0.997
Mean absolute error 0.0007 There may also be significant benefit to correlating the
features derived from the firewall logs with other features
Root mean squared error 0.0188 obtained from netflow and IDS data. This may lead to a
Relative absolute error 0.512 % richer feature vector that might have greater utility than the
Root relative squared error 7.1658 % data acquired during this project.
Total Number of Instances 2401
It is interesting to note that apparently there is a high ACKNOWLEDGEMNTS
correlation between the number of destination ports and the
number of sources IPs connecting to a destination IP (as
indicated by the cluster visualization). We were somewhat Dr.. Nitesh Chawla, . '.
Research Assistant .
Professor, Computer
surprised to see this rule. Without accurately labeled data, a Scien Pnv
ing, rs ofesor care
c

verifiable analysis against a testing set cannot be D


accomplished. However, since some of the IPs (i.e., hosts) David Alan Cieslak, Teaching Assistant, CSE 60647 - Data
have systemic problems, it is reasonable to predict that they Mining, University of Notre Dame.
will be classified in the same way using a different day's
firewall logs. Given that notion we took the next day's logs
(about 650K records), converted them into destination IP REFERENCES
perspective feature vectors, and performed the same analysis. [1] Salvatore J. Stolfo, Wenke Lee, Philip K. Chan, Wei Fan, Eleazar Eskin.
This was done by assigning the records all the same cluster T"Data mining-based intrusion detectors: an overview of the columbia
IDS project". ACM Portal, 2001.
label (for WEKA compatibility), and using WEKA to [2] Eleazar Eskin, Andrew Arnold, Michael Prerau, Leonid Portnoy and
generate JRIP classification predictions on the test data from Salvatore Stolfo. "A Geometric Framework for Unsupervised Anomaly
the model generated from the training data. On the two Detection: Detecting Intrusions in Unlabeled Data." Data Mining for
adjacent days' logs, five IPs were identified as ClusterO, and [3] Ian Security Applications. Kluwer 2002.
H. Wtten, Eibe Frank, "Data Mining: Practical Machine Learning
four out four fiveof five
of out identically classifiedinthetestan
were identically
were
classified in the test and Tools and Techniques", Elsevier, Inc. 2005.
training data. The fifth in the test data proved to be a host [4] Stuart Berman,
identical in function to one of the others (a redundant server p://wwwitarchitectcom/shared/article/showA icleJhtml?artJc1e d=17
. . ..
......................
for high availability). This host's IP was correctly classified O & ~7100878&classroom=,
m IT 03/2006.
Architect 03/2006
IT A
in the test data set, but was misclassified in the training set the [5] Kevin Liston, SANS
clustering algorithm had mislabeled it. Institute.
[6] C. Caruso and D. Malerba,
The results of these experiments are promising but not entirely http://www.di.uniba.it/-caruso/documenti/04dmV.pdf, Dipartimento di
conclusive. While more analysis is needed to determine the Informatica, University of Bari, Italy.
accuracy and relevance of the classifications regarding the
flagging of interesting IPs, it seems that there is promise to
using this technique.

VII. CONCLUSIONS AND FUTURE WORK

The techniques described in this paper were used to rapidly


identify a number of machines associated with anomalous
network activity. A significant number of the identified
machines were victims of nefarious activity or mis-
configuration. The ability to control these issues is a challenge
for network, server, and device management, in part because
of the vast amount of log data involved. Through the use of
data mining processes and techniques we have demonstrated
an accurate way of handling burdensome firewall logs.
In the scope of this effort a full analysis of the techniques
could not be performed. However, additional analysis is
warranted and may lead to promising results. No effort was

You might also like