0% found this document useful (0 votes)
43 views6 pages

Anomaly Detect Ion Using Visualization and Machine Learning

This document describes an anomaly detection system that uses machine learning and visualization techniques. The system monitors user activities through audit logs, generates user profiles using inductive logic programming (ILP), and detects anomalies by comparing activities to the profiles. It includes a visual browser to help administrators understand detected anomalies and raw log data through interaction with visualized rules and logs. The browser incorporates a hyperbolic tree visualization of log data to aid analysis.

Uploaded by

9967835192
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views6 pages

Anomaly Detect Ion Using Visualization and Machine Learning

This document describes an anomaly detection system that uses machine learning and visualization techniques. The system monitors user activities through audit logs, generates user profiles using inductive logic programming (ILP), and detects anomalies by comparing activities to the profiles. It includes a visual browser to help administrators understand detected anomalies and raw log data through interaction with visualized rules and logs. The browser incorporates a hyperbolic tree visualization of log data to aid analysis.

Uploaded by

9967835192
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Anomaly Detect ion using Visualization and Machine Learning

Fumio Mizoguchi
Information Media Center, Science University of Tokyo
Noda, Chiba, 278-8510, Japan
mizo Oimc .sut .ac.jp

Abstract the rule generation engine [4]. This paper also exam-
ines anomaly detection using the system.
Unauthorized access f r o m inside or outside an or- It is difficult for a n administrator t o comprehend the
ganization has become a social problem in the last few voluminous audit data when trying to detect anoma-
years, making a system that can detect such accesses lies. Therefore, we propose a visual browser in order
desirable. W e therefore monitor normal activities us- to support an administrator browsing analyzed results
ing inductive logic programming (ILP) which is one of or raw data. Furthermore, we seek to detect anomalies
machine learning and detect anomalies. To ensure ef- visually through interaction of the browser and a vi-
fective monitoring, we think the following two points sual tool, and to bring out the structure of audit data.
must be considered. One point is automation of de- As the visual tool, we use WebMap [5], which is a vi-
tection b y I L P system, which is a rule generation en- sual tool added the concept of the height to a hyper-
gine, that always induces and updates eflective rules. bolic tree. This paper presents research relevant to the
The other point is providing a visualization too1 that Cuckoo Egg Project [6], which seeks t o detect intrusion
reflects induced rules to the detection system. This tool by machine learning and visualization techniques.
enables an administrator to understand detection situ- This paper is organized as follows: Section 2 de-
ations. For automated detection, we provide the I L P scribes the architecture of our intrusion detection sys-
system with an automatic parameter adjustment func- tem, Section 3 explains the Visual browser that plays
tion. For the visualization tool, we a p p l y the visualiza- the most important role in our system. Section 4 de-
tion technology of a hyperbolic tree. scribes the method of generating user profiles by using
ILP, Section 5 provides the experimental results and
final section contains our conclusions.
1 Introduction

Unauthorized accesses to computers or networks are 2 Intrusion Detection System with Vi-
increasing every year. In addition, according t o an in- sual browser
vestigation of CSI-FBI in 1999 [l],many organizations
have been greatly damaged by internal and external We designed Intrusion Detection System with Vi-
crime. For this reason, there is increasing interest in sual browser (IDS-V). The figure 1 shows the system
intrusion detection. One type of intrusion detection architecture of IDS-V that consists of Log database,
is anomaly detection that determines the normal use Profiling engine, Log checker and Visual browser.
state using statistic or machine learning and detects Log checker monitors user events which are collected
exceptions. This includes research that rule-izes usual in the Log database as audit log. The Log database is
telephone-call records by machine learning and mon- the center of IDS-V. The log d a t a is referred by Profile
itors such records to detect telephone fraud [a], and engine and Log browser.
that analyzes the sequence of the command logs by The Profile engine generates profiles of each users.
using machine learning [3]. We adopt Inductive Logic Programming (ILP) to in-
However, the existing anomaly detection systems are duce a rule from audit data. The generated profiles
in the examination stage for automatic generation of is sent to the Log checker and Visual browser. Log
an optimum rule and automatic renewal, as well as its checker filters the user events using the profiles and
application method. We have resolved this problem we can edit the profile using Visual browser. The Log
by using an inductive logic programming (ILP) system checker detects anomaly event based on the profiles and
with a function to automatically adjust parameter as notify the fact that anomaly event happens to Visual

165
0-7695-0798-9/00 $10.00 0 2000 IEEE
effective browsing and help the troublesome task to find
the track of an intruder.
The figure 2 shows our Visual browser’. The text
field is to input the query to get log data from Log
database and it has three browsing buttons: “Get” is
to get the log data corresponding to the query, “Clear”
is to clear the text field and “Back” is to show the
previous log data.
There are four tabbed panels on this system: Log
browser, Visualize, Profile, and Property. The Prop-
erty panel is not shown in figure 2, it is to set
some properties such as hostname of each servers (Log
database, Log checker and Profile engine) and timing
getting log data such as getting logs before 60 seconds
from current time. Other panels are explained in the
following sections.
Figure 1: System architecture of IDS-V
3.1 Log browser

browser. Visual browser also get the log data from the Log browser is to browse logs that are listed on the
Log database and it shows us the log structure and spread sheet in the order of new log. The showing log is
classification based on the profiles from the Profiling user command: the first column is the user name, the
Engine. second one is the host name where the user execute
The Log checker is able to reduce the log data, how- the command, third one is the executed command and
ever, it is hard to generate perfect profiles to detect the last column is the time when the user execute the
anomaly event using the machine learning. So, we command.
should use rough rules rather than strict rules not to Notified anomaly event from the Log checker, the
lose anomaly event. Therefore, there are some cases cuckoo icon start moving and let us know the detection
that the Log checker reacts on normal log as anomaly by the cawing of a cuckoo. The log associated to the
event. In these case, an administrator must judge anomaly event is listed in red.
whether the detected log is real anomaly or not. The The logs corresponding to the query in the text field
Visual browser helps an administrator to check the log are returned by Log database. For example, if we can
leaking from the Log checker. get the log related to the user “hiraishi” “AND” ma-
Our IDS-V deal with command logs as user events chine “imct451”, we just set the following query:
and collects the following data set: usr(hiraishi) & hst(imct451)
e User name (who) And we can also set “ I ” for the “OR” operation. If we
set empty as the query, the database returns all logs
e Host name (where)
contained the time range setting in the Profile.
0 Executed command name (what) When we click on the text on the spread sheet, The
item is also appended to the query and down load logs
0 Parameter of the executed command (how) from the database automatically. So, we can browse
and go close up to the intended logs.
e data and time (when)
Thus, IDS-V can monitor user behaviors in the net- 3.2 Visualize
work. The data set is stored in the Log database.
The most important role of visualization is to en-
able us to easily understand the use status of devices
3 Visual browser and networks. Therefore, it is important to design a
browser for visualizing audit data. “Audit work bench
Visual browser help an administrator for log brows- [?I” developed at the University of California is one
such attempt. Although the objective of this paper is
ing, and clarify the structure of audit data visually by
using a hyperbolic tree as the visualization to perform ‘It is implemented by Java Language jkd1.2.2.

166
Browsing Buttons

Visualize

*, - --

Hyperbolic tree

Figure 2: Visual browser

close to that of using their visual browser t o audit data, Commands that imply the use of the network and
we focus on extracting and visualizing the information commands that system administrators frequently
from data in data mining. use (su, finger, ftp, telnet, rlogin, alias, configure,
In order to visualize audit logs, we adopted the nmap, etc.) tend to be used to intrude into the
WebMap visualization technique [SI, which maps Web network and stole password data. We make this
URLs onto. a hyperbolic plane. Our visualization category in order t o become aware of the executing
makes hierarchy basically in the following order: “User such commands.
name”, “Host name”, “Command name” and “Param-
eters”. However, in the “Command name” level, we 0 Safety command
divided into four categories as the characteristics of This category includes the commands correspond-
command: ing to the profile and plain safety commands like
0 Un-ruled command Is, cd, jlatex, java, etc.

This category is the most dangerous. The com- 0 Unknown command


mands does not conform to the profile is classified
into this category. So, they may be commands an Commands except for any of the above belong to
intruder executes. this category. This includes many cases of typing
mistakes and also includes the original command
0 Danger command of users.

167
We should first checks the un-ruled command, If 4 Generating Profiles
many branches have grown from the node of un-ruled
commands, We should regard the user as an intruder. In order to induce the rules, we apply an inductive
Then if the danger command has many branches and logic programming (ILP) system, and use it as the rule
the user is not an administrator, it indicates that the generation engine, in other words, the profiling gener-
user might do illegal use and there is a possibility that ation engine. The ILP system learns from an example,
the intruder pretends to be the user. Even if there are enables relational learning, and produces a result for
many branches in safety command, it is not necessary incremental learning. The framework of ILP is repre-
to care about it. It is quite normal phenomenon. In- sented as following:
stead, a few commands mean anomaly. When many BAHI=E+
commands are in the unknown command, the admin-
istrator should check whether the command is mis- B A H A E - F ~
typing or not. where, Et is the positive examples, E - is the negative
A hyperbolic tree can be changed arbitrarily by examples and B is the background knowledge. Thus,
mouse operations. Focus is changed by clicking the the purpose of ILP induces the hypothesis H .
mouse on a node. Mouse dragging can be accepted at In the method like decision tree general used in data
any position, making it easy for users to change the mining, we must combine target tables to generate the
viewpoint of the tree. We can also cutoff branches relational rules. However, ILP can generate rules from
above the clicked node by using “Cutoff” button. The the distributed tables. It is natural that the logs is
new hyperbolic tree consisting of only branches under distributed such as the computer is distributed in the
the clicked node is reconstructed. This allows us to network. So, the ILP is more suitable for the log data
make the tree focusing on the special purpose like the analysis.
command category as mentioned above. In this way, This learning system is implemented in Java lan-
the administrator can browse and understand huge logs guage, and references both Muggleton’s Inverse Entail-
visually. ment [7] and the GKS algorithm [8]. This system uses
the objective function by including positive examples
and the hypothetical length, and gives the rate of neg-
3.3 Profiling ative examples (Error rate), which the user establishes,
as the constraint, and determines the best hypothesis
(rules) from the most special case.
The profiles generated by Profiling engine is also However, existing ILP systems, including this sys-
shown in the text area of our visual browser. The pro- tem, must set up a search parameter. Therefore, when
files are listed every users in the following form: the users induce the rules from data by ILP system,
they must adjust the parameters, considering the per-
formance of rules. This becomes an obstacle, when we
incorporate a learning system into the detection sys-
--- hiraishi --- tem and automate it. We therefore define a function by
which the system automatically adjusts search param-
imct-r03,kterm,&,i eter by observing the value of a performance measure
(parameter). We call this function ’parameter tun-
*, java, ing’. When we implement this function, we use the
result of Cross-validation and Bootstrap method [9] of
re-sampling method. Our system then adjusts the pa-
The first line indicates the user name and the follow- rameter by learning them. Thus, we seek to perform
ing lines contains the profiles of the user. The second effective rule generation for large scale audit data by
line means that the user “hiraishi” uses the command harvesting the re-sampling technique. Furthermore, we
“kterm” at the machine “imct-r03” and he sets the pa- use the induced rules as profiling.
rameter “ & ” at the first argment. The “ * ” is the
wildcard, the third line indicates he uses ‘tjava” com-
4.1 Automatic parameter adjustment
met hod
mand at any machines.
Since the profiles generated by Profiling engine is The ILP system that we apply must set up an error
not complete for anomaly detection. We can edit and rate that implies how the hypothesis can include neg-
create the profiles in the text area of our browser. ative examples. The performance (Table 1) of a rule

168
is influenced by the setup, so we automatically adjust and Java programming. In contrast, the user B is the
the error rate so that it may maximize the performance administrator in our network. He often uses the com-
measurement specified by a user. mand for system administration and writes program by
using C language. So, the user A tends to behave such
as an intruder.
We derived profiles with ILP in the three methods:
A hypothesis A hypothesis logs of the target user is set as the positive ex-
regards as regards as
ample, in contrast, logs of other users is set as
1 a positive example 1 a negative example
the negative example and we gave the use way
E+ I True Positive I False Negative
of commands and login status as the background
I E- I False Positive I True Negative 1
knowledge In this case, the rules emphasizing how
the target user use commands are derived. For
Sensitivity = True Positive (TP) / Et example,
Specificity = True Negative (TN) / E -
userA(X) : - login(X, “ i m c t x ” , Y),
Accuracy = (TP + TN) / ( E s + E - ) type(Y, “ m u l e , t , i ” ) .

At this time, it can provide the remaining parame- In the machine “imctX”, the user A use the com-
ters that are not specified as constraint conditions. The mand “mule” with the parameter “ & ” at the first
parameters that it can maximize are accuracy, sensi- argument.
tivity, and specificity. These parameters are obtained
from the re-sampling technique. The parameters are Inducing rules from only logs of the target user. we
such that the error rate tends to increase when sensi- gave the use way of commands as the background
tivity increases and decrease when specificity increases. knowledge. In this case, the rules the target user
The error rate that yields the best parameter values is frequently uses are derived. For example,
obtained by binary search considering the relation of
the error rate and the parameter. userA(X) :- type(X, ffls,-al,i”).
A practical processing procedure is as follows. First,
a user selects a parameter (Table 1) to maximize and This means user A uses the command “1s” with
sets the remaining parameter as constraint conditions. the parameter “-al” at the first argument.
The system then generates a training set and a test When a rule that incorporates the login status is
set by a re-sampling process like Cross-validation or generated, we can learn that a user logs in from a spe-
bootstrap. Next, it learns the training set, tests the cific host to another certain host with a decided format
test set, asks for the performance evaluation shown in such as telnet login. I When we analyzed command
Table 1, and checks whether the ILP system satisfies logs, we found that the order of the arguments and
the constraint. If the constraint is satisfied, the system how to use command are unique.
updates the error rate so that the target parameters The figure 3 shows the changing of tree structure.
may be maximized (maximization process). If the con- First we generated profiles of the user A using 1000
straint is not satisfied, the system updates the error commands. The tree structure of the fist step is the (A)
rate so that the constraints are satisfied (adjustment in the figure 3. The (B) is the structure after adding
process). The system then updates the error rate re- 20 commands. At this time, we can find the change of
peatedly until the parameters becomes constant. tree structure easily, the number of branch at the un-
ruled command is increased. Since the user B is not a
system administrator, there is no such a command in
5 Experimental results his profiles and the command “su” is categorized in the
un-ruled command.
In order to clarify the effectivity of our visualization The (C) is the result after adding 100 commands
and profiling, we simulated the situation that an in- of the user A. The “su” command is in the danger
truder pretends to be a user. Using logs of two users A command. This means the profile that the user B use
and B , we added logs of the user A to logs of the user “su” command is generated by ILP. If an administrator
B little by little. can not notice the intrusion and the ILP induces from
The user A and B use about 900 commands in one logs contains anomaly events, the ILP may generate
month. The main work of the user A is editing the text the rules infected with the anomaly logs. As a result,

169
the profiles can not work effectively any longer. In the following two functions
our experiment, the 100 logs correspond to the logs
collected for 3 days. The administrator should check 0 Visualization function that reflects rules induced
logs once 3 days at least. by ILP system
However, since we set the four command categories, 0 Automatic parameter adjustment for search on
even if the rules infected with the anomaly logs are gen- ILP system
erated, the commands associated such rules is classified
into danger command or unknown command. We can By implementing these two functions, we attempted to
find the trace of intrusion by checking their branches. automate a system and facilitate visual perusal of a sit-
uation. In short, we proposed a framework of anomaly
detection using such technologies.

References

danger [l] CSI-FBI, 1999 CSI-FBI Survey Results,


=r-? http://www.gocsi.
com/summary.htm, 1999.
[2] T.Fawcett and F.Provest, Adaptive fraud detec-
tion, Data Mining and Knowledge Discoveryl, 291-
316, 1997.
[3] B.D.Davison and H.Hirsh, Predicting sequences of
user action, In Predicting the Future: AI Approaches
'*I-
*:** to Time Series Problems, AAA1 Press, pp.5-12. WS-
98-02, 1998.
[4] F.Mizoguchi, H.Ohwada, J .Tuchiya, Development
of agent type datamining system, Information-
Technology Promotion Agency, Japan, 1999.
[5] HSawai, H.Ohwada, F.Mizoguchi, Incorporating a
/ navigation tool into a browser for mining W W W
information, The First International Conference on
Discovery Science, 1998.
[B] F.Mizoguchi, Visual browser for intrusion detection
- Cuckoo egg project, CSS'99, 1999.

[7] S.Muggleton, Inverse Entailment and Progol, New


Generation Computing, 13:245-286, 1995.
Figure 3: Changing of tree structure [8] F.Mizoguchi and €I.Ohwada, Using Inductive
Logic Programming for Constraint Acquisition in
Constraint-based Problem Solving, Proc. Of the 5th
International Workshop on ILP, 297-322, 1995

6 Conclusion [9] M.Weiss and A.Kulikowski, Computer Systems


T h a t Learn, Morgan Kaufmann Publishers, 1991
We design the intrusion detection system with the
visual browser that helps an administrator for log
browsing, and clarify the structure of audit data visu-
ally by using a hyperbolic tree as the visualization t o
perform effective browsing and help the troublesome
task to find the track of an intruder. We implemented

170

You might also like