IDS MachineLearning
IDS MachineLearning
Abstract- Many computer-based devices are now connected reliance on it increases daily. Governments, corporations,
to the internet technology. These devices are widely used to banks, and schools conduct their day-to-day business over the
manage critical infrastructure such energy, aviation, mining, Internet. With such widespread use, the data that resides on
banking and transportation. The strategic value of the data and and flows across the network varies from banking and
the information transmitted over the Internet infrastructure has securities transactions to medical records, proprietary data,
a very high economic value. With the increasing value of the data
and personal correspondence.
and the information, the higher the threats and attacks on such
data and information. Statistical data shows a significant increase The Internet is easy and cheap to access, but the systems
in threats to cyber security. The Government is aware of the attached to it lack a corresponding ease of administration. As a
threats to cyber security and respond to cyber security system result, many Internet systems are not securely configured.
that can perform early detection of threats and attacks the Additionally the underlying network protocols that support
internet. Internet communication are insecure, and few applications
make use of the limited security protections that are currently
The success of a nation's cyber security system depends on
available.
the extent to which it is able to produce independently their cyber
defense system. Independence is manifested in the form of the The combination of the data available on the network and
ability to process, analyze and create an action to prevent threats the difficulties involved in protecting the data securely make
or attacks originating from within and outside the country. One Internet systems vulnerable attack targets. It is not uncommon
of the systems can be developed independently is Intrusion to see articles in the media referring to Internet intruder
Detection System (IDS) which is very useful for early detection of activities.
cyber threats and attacks.
Internet security/protection has become necessary because
The advantages of an IDS is determined by its ability to detect attacks frequently cause the compromise of critical/sensitive
cyber attacks with little false. This study learn how to implement data. Incidents involving viruses, worms, Trojan horses,
a combination of various methods of machine-learning to the IDS spyware, and other forms of malicious code have disrupted or
to improve the accuracy in detecting attacks. This study is damaged millions of systems and networks around the world.
expected to produce a prototype IDS. This prototype IDS, will be Heightened concerns about national security and exposure of
equipped with a combination of machine-learning methods to personally identifiable information (PH) are also raising
improve the accuracy in detecting various attacks. The addition awareness of the possible effects of computer-based attacks.
of machine-learning feature is expected to identify the specific These events-and many more-make the case daily for
characteristics of the attacks occurred in the Indonesian Internet
responding quickly and efficiently when computer security
network. Novel methods used and techniques in implementation
defenses are breached.
and the national strategic value are becoming the unique value
and advantages of this research. As a response to the situation of these computer-based
attacks in the world, used a number of security devices and
Keywords-intrusion detection system, machine-learning, software such as flfewalls, Intrusion Detection System (IDS),
support vector machine, threat, attack. Security Information and Event Management (SIEM), to
monitor the network in conjunction with the Computer
I. I NTRODUCTION Security Incident Response Team (CSIRT), which is
responsible for ensuring the Confidentiality, Integrity and
The Internet has become one of the most powerful and
Availability of network services. Attack or action that could
widely available communications mediums on earth, and our
jeopardize the Confidentiality, Integrity, and Availability in
negatives on the IDS makes the network administrators SVM and Dempster Shafer theory. Firstly KNN and SVM is
difficult in deciding or handling the reports of IDS. used to classify attack data and then combine the output of
Considering the workload of a network administrator, less those two different methods with Dempster Shafer Theory and
likely for them to update the types of new attacks and make used 8 feature from KDDCUP as training data. He stated that
new rule/signature any time. Therefore, there was an idea to the performance of the classification process is good in
develop our own IDS system that is able to increase the level overall, but the results aren't optimal to detect R2L and U2R
One approach that is widely used to improve the In 1999, Mukkamala research about Intrusion Detection
effectiveness of the IDS-related characteristics and adaptation System using SVM and Neural Network methods with
is the machine-learning techniques. With machine learning KDDCUP 1999 data as the training data. The result of this
techniques make the IDS can perfonn learning in recognizing research shows that the SVM method is more accurate than
signatures and profiles automatically based on the training Neural Network method[4].
dataset that can then predict impending attack. With the P. Nettasan researched Multi Stage Filter Using Enhanced
application of machine learning techniques IDS is expected to Adaboost for Network Intrusion Detection that divide
be more accurate in recognizing attacks and easily adapt the intrusion detection in three stage; DOS detection, Probe
changing of network environment. There are several methods detection and R2L and U2R detection. If there is no attack is
of machine learning that can be used within the classification detected by one of these detections than the packet data is
and clustering of attacks data on IDS. Several machine considered as nonnal. Every stage of detection has different
learning methods are already widely known among others: feature in quantity and kind [3].
Support Vector Machine (SVM), K-Nearest Neighbor (K
NN), Artificial Neural Network (ANN). Some researchers Agarwal and Joshi proposed a two-stage general-to
have researched using some of these methods in classifying specific framework for learning a rule-based model
and clustering the attacks data with pretty good results. (PNrule) to learn classifier models on a data set that has
Because each type of classification method has strengths and widely different class distributions in the training data set. The
weaknesses, then to improve the accuracy of the results of the PNruie technique was evaluated with the KDD testing data
classification is by taking advantages of each classification set, which contain many new R2L attacks and do not
and avoid the weaknesses of each method. present in the KDD training dataset. The proposed model
was able to detect only 10.7% attacks in the R2L attack
Several previous studies have shown that each machine category although insignificant amount of false alanns
learning method has advantages in detecting certain types of were generated. The obvious disadvantage of this
attacks. For this reason so in this research, we design a algorithm is that the rules are automatically generated
combination of several machine learning methods that can leading to data set dependency with negligible
obtain the optimal design to be implemented in an IDS to generalization capability [5].
detect various types of attacks. However, this paper will only
presented the results of research that has been done by using Levin used Kernel Miner tool on the KDD data set. Kernel
SVM method and selected features of the network dataset. Miner is data-mining tool for classifying data and predicting
new cases using automatically generated decision trees. By
In this paper we present the results of the initial research using this tool and the KDD data set, Levin created a set
we've done in implementing the SVM method in IDS to detect of locally optimal decision trees (called the decision
R2L type attacks. In this reasearch we propose a combination forest) from which optimal subset of trees (called the
of 8 features from Alimudin and 24 features from Natesan in sub-forest) selected for predicting new cases. Levin used
R2L and U2R detection and remove the redundant features. only 10% of the KDD training data set randomly sampled
This reaserch is using SVM methods because related to from the entire training data set. Multi-class detection
Mukkamala research stated that SVM is better than Artificial approach was used to detect different attack categories in the
Neural Network other machine learning methods. The method KDD data set. The fmal trees gave very high detection
used in this research differ from the previous research studies, rates for all classes including the R2L in the entire
57
trammg data set. The proposed classifier achieved only In this research not all the features of this data is used. We
7.32% detection and 2.5% false alarm rates for the R2L used 24 features from Nettasan [3] and 8 features from
attacks in the KDD testing data set [6]. Mukkamala [4], and the detailed of data are described below:
Yeung and Chow proposed a novelty detection The 8 features used in previous works from Mukkamala
approach using non-parametric density estimation based on are: src_bytes, dst_bytes, Count, srv_count, dst_host_count,
Parzen-window estimators with Gaussian kernels to build dst_host_srv_count, dst_host_same_src"'port_rate,
an intrusion detection system using normal data only. dst_host_srv_difJ _hostJate.
This novelty detection approach was employed to detect
The 24 features used in previous works from Nettasan [3]
attack categories in the KDD data set. 30,000 randomly
are: Duration, protocol_type, Service, Flag, src_bytes,
sampled normal records from the KDD training data set were
dst_bytes, Hot, numJailed_logins, logged-in,
used to estimate the density of the model. Another 30,000
num_compromised, root_shell, num_root, numJzle_creations,
randomly sample normal records from the KDD training data
num_shells, num_accessJzles, is_host_login, isJ5uest_login,
set were used to form the threshold determination set. It is
Count, serror_rate, rerror_rate, difJ_srv_rate, dst_host_count,
important to note that this model detects whether a record is
dst_host_difJ_srvJate, dst_host_srv_serrorJate.
intrusive or not. For the R2L attack category, 31.17% of R2L
records in the KDD testing dataset were detected as intrusive The 8 features from Nettasan and 24 features from
patterns: authors did not report any information on false alarm Mukkamala are combined and then we select 28 features. The
rates. Even so, this technique also failed to identify R2L selected features detailed are described in Table I below:
attacks with a high detection rate [7].
Lee and Stolfo used data mining techniques to collect TABLE I. PROPOSED FEATURES
KDD features from DARPA 1998 dataset. RIPPER rules
No Feature Name Description
were created using these data mining techniques to detect
I Duration Length of the connection in
R2L attacks. The proposed model could detect only seconds
20% of R2L attacks with a false alarm rate of 0.01 for the 2 protocol_type Type of the protocol used. For ex.
DARPA 1998 testing dataset [8]. TCP
3 Service Network service on the
destination like HTTP, Telnet.
III. STATE OF THE ART
4 Flag Normal or error status of the
The Intrusion Detection System is a software or hardware network connection.
tools that is used to detect unauthorized access of computer S src_bytes Number of data bytes from source
to destination
systems or network [9]. Some researchers have conducted
6 dsCbytes Number of data bytes from
studies to categorize/classify attacks data in the IDS with a
destination to source.
specific classification method to improve the detection
10 Hot Number of "hot" indicators.
acuracy. As known in the previous studies, the types of attack
II num failed logins Number of failed login attempts.
is divided into the following four groups: DoS, Probe, R2L, 12 logged-in I if successfully logged in; 0
and U2L. Classification method in previous research has high otherwise.
accuracy in detecting all group of attack, except for detecting 13 num_compromised Number of compromised
R2L attack. conditions.
14 root shell I if root shell is obtained; 0
We proposed a new combination of features in otherwise.
classification method we use, using 8 features from 16 num root Number of "root" access.
Mukkamala et al and 24 features from Natesan et al. The 17 num file creations Number of file creation
combination of 8 and 24 features become 28 features because operations.
there are 4 features that used in both 8 and 24 features. This 18 num shells Number of shell prompts.
technique is expected to detect R2L attacks more accurately 19 num_access_fiIes Number of operations on access
control files.
than the result of the previouse research.
21 is_hosUogin I if the login belongs to "hot" list;
In this work we observed and compared the classification o otherwise.
result between 8 and 24 features with payloads. By combining 22 is_guesUogin I if the login is a guest login; 0
otherwise.
and selecting from these 8 and 24 features become 28 features
23 Count Number of connections to the
is expected to increase the detection acuracy. We also
same host as the current
observed the importance of payload in detecting R2L attacks. connection in the past two
seconds
The data used in this research are GureKddCup [11]. This
25 serror_rate Percentage of connections that
data have an extra feature, that is Payload as the 42th feature. have "S YN" errors.
Payload is the content area of network packets in a sequence 27 rerror rate Percentage of connections that
of ASCII characters. In order to detecting R2L attack, we need have "REJ" errors.
Payload to increase detection rates. 30 diff srv rate Percentage of connections that
have same services.
32 dst host count Count for destination host.
35 dst host diff srv rate diff srv rate for destination host.
58
39 dst_host_srv_serror_rate srv_serror_rate for destination Basic idea of SVM is how to maximize the hyperplane
host. (maximal margin hyperplane), as illustrated in Figure l. In the
32 dst_host_srv_count, sum of connections to the same Figure 1 (a) there are some hyperplane choice for data set, and
destination port number
Figure 1 (b) is the maximal margin. Although in Figure 1 (a)
33 dst_host_same_src�ort_rate the percentage of connections that
we can use any hyperplane, with maximal margin will give
were to the same source port,
among the connections better generalization for classification methods.
aggregated in dst host_srv��
36 dst host srv diff host rate
- - - - -
the percentage of connections that
were to different destination
machines, among the connections
aggregated in dst host srv count
24 srv_count sum of connections to the same
destination port
A. GureKddCup Dataset
GureKddcup contains network connections kddcup99 (the
database of UCI repository). It also adds one feature: payload
(content of network packets) to each of the connections. It
allows to extract infonnation directly from the payload of each
connection to be used in the machine learning processes.
59
Class -1. The data that join in Class -1 are square symbol, described in appendix in the end of this paper. Then we split
whereas data in Class +1 are circle. datasets into two data set, first are gure.tr for training data and
Hyperplane are the best separation between two classes gure.te for testing data. After the data are ready we classify
and can find by measuring that hyperplane margin and fmd these data of 8, 24, and 28 feature with payload and without
maximal data. Margin is a distance between that hyperplane payload, so we get 6 feature set: 8 features with payload, 8
with the closest data from every class. This closest data are features without payload, 24 features with payload, 24 features
called as support vector. Solid line in Figure 1 (b) above show without payload, 28 features with payload, and 28 features
the best hyperplane, that is located exactly in the middle of without payload. After we did the experiment, we get the
two classes, whereas two circle and two square that cross
result that have the best features for detecting R2L attack to be
trough the boundary margin (dashed line) is a support vector.
implemented in simulation phase.
The method to find this hyperplane location is the core of
learning process of SVM.
2) Simulation phase:
C. System Design
Data
The system process design in this part show the data
processing in this system. Data will be processed in order as
Alta ker
illustrated in figure 2 below:
Attacker Send
Attack
... ..
Engine IDS
Web Report
Interface
60
To examine the results of this research, we developed a
simulation system that illustrates the process of
0.99
R2L Using Payload
implementation of the method to improve the acuracy in
detecting R2L attacks from the results of this research. The 0.985
system design of the simulation is as follow: 0.98 .8 Feature
0.975 .24 Feature
",
....
g 0.97 .28 Feature
·
..... � 0.965
..........
......
; 0.96
°
Database
...... 0.955
A-.,
gure_libsvm .... 0.95
......
Cui'.......
0.945
CoIIIIIInt. 0.94
iii....,
IImIIIIan. 1 2 3 4 5 6 7 8 9 10
i-Experiments
C. Experiment 3
A. Experiment J
Experiment of accuracy detection using lO-fold validations
The result of classifications with 8, 24 and 28 features
model with 28 features with payloads and without payload
using SVM method without payload shown in figure 5.
shown in figure 7.
B. Experiment 2
The result of classifications with 8, 24 and 28 features
using SVM with payload shown in figure 6.
61
Matthews Corellation Coefficient Using lO-Fold Validations References
99.00%
98.50%
[II H. Dreger. A. Feldmann, V. Paxson. And R. Sommer. "Operational
98.00% Experiences with High-Volume Network Intrusion Detection". In
Ie 97.50% l- Proceedings of ACM Conference on Computer and Communications
o • MCC NOP Security (CCS '04), Washington, D.C., October 2004.
.� 97.00% I-
[2] Alimudin Akhmad, Hariadi Mochammad (2012). "Integrasi IDS
• MCC USP
� 96.50% l- menggunakan SVM dan KNN Dempster-Shafer". ITS JAVA Press.
96.00% f- f- [3] Netasan P, B alasubramanie P. (2012). "Multi Stage Filter Using
95.50% l- I- Enhanced Adaboost for Network Intrusion Detection", International
Joumal of Network Security & Its Applications (IJNSA), Tamilnadu,
95.00% '- '-
India.
1 2 3 4 5 6 7 8 9 10
[4] Mukkamala Srinivas, Guadalupe, Sung Andrew. (2002). "Intrusion
K
Detection Using Neural Network and Support Vector Machines", IEEE.
[5] R. Agarwal and M. V. Joshi, "PNrule: A New Framework for
Fig. 8. Matthews Correlation Coefficient Result of classifying 8, 24 and 28
Learning Classifier Models in Data Mining (A Case-Study in
Features using Payloads and without payloads
Network Intrusion Detection)", Technical Report TR 00-015,
Department of Computer Science, University of Minnesota, 2000.
We also try to classify with Matthews Correlation
[6] I. Levin, "KDD-99 Classifier Learning Contest LLSoft's Results
Coefficient and the result show that it was also better using Overview", SIGKDD Explorations, ACM SIGKDD, January 2000,
payloads. Therefore, we can conclude that we can increase Vol. I (2), pp. 67-75.
detection of IDS in R2L attack using payloads features. [7] D. Y. Yeung, and C. Chow, "Parzen-window Network Intrusion
Detectors", In Proceedings of the Sixteenth International Conference
on Pattern Recognition, Quebec City, Canada, August 2002, Vol. 4,
VI. CONCLUSION pp. 385-388.
SVM based on 8 Features without payload is 95,78% and [8] W. Lee, and S. Stolfo, "A Framework for Constructing Features and
the accuracy of detecting R2L attacks using SVM based on 8 Models for Intrusion Detection Systems", ACM Transactions on
Information and System Security, November 2000, Vol. 3 (4), pp.
features with payload is 95,99%. The accuracy of detecting
227-261.
R2L attacks using SVM based on 24 Features without payload
[9] Manoj Sharma, Keshav Jindal, Aishish Kumar, "Intrusion Detection
is 95,73% and the accuracy of detecting R2L attacks using System using Bayesian Approach for Wireless Network",
SVM based on 24 Features with payload is 95.91% better than Intemational Joumal of Computer Applications(0975-888), Volume 48-
8. The accuracy of detecting R2L attacks using SVM based on No.5, June 2012.
28 Features without payload is 95,91% and the accuracy of [10] Maheshkumar Sabhnani, Gursel Serpen, "KDD Feature Set Complaint
detecting R2L attacks using SVM based on 28 Features with Heuristic Rules for R2L Attack Detection", Security and Management,
page 310-316. CSREA Press, (2003)
payload is 96,08% better than all. From the result we can
[II] GureKddcup a database description, http://www.sc.ehu.es/acwaldap/
conclude the best feature of 8, 24, and 28 is 28 Feature (our
purpose).
Acknowledgment
This work is supported by Id-SIRTIIICC (Indonesia
Security Incident Response Team on Internet
Infrastructure/Coordination Center), Department of Electrical
Engineering Universitas Indonesia, and Department of
Informatics and Computer Engineering, Electronic
Engineering Polytechnic Institute of Surabaya in the joint
development of National Cyber Security Situational
Awareness Systems "Mata Garuda". We gratefully appreciate
this support and would like to thank Id-SIRTIIICC,
Universitas Indonesia, and Electronic Engineering Polytechnic
Institute of Surabaya.
62
pop_3 54
APPENDIX printer
private
55
56
red_i 57
remote-...iob 58
rje 59
TABLE TIT. PROTOCOL TYPE COLUMN TRANSFORMATION
shell 60
Protocol Type Value smtp 61
sql_net 62
Tcp o ssh 63
Udp 1 sunrpc 64
Tcmp 2 supdup 65
systat 66
telnet 67
tftp_u 68
TABLE IV. SERVICES COLUMN TRANSFORMATION
tim_i 69
Service Value time 70
urh_i 71
netbios_dmg 0 urp_i 72
tim I uucp 73
urh 2 uucp�ath 74
ircu 3 vmnet 75
http_alt 4 whois 76
ntp 5 XII 77
aol 6 Z39 50 78
auth 7
snmp 8
urp 9
bgp 10
courier II
csnet ns 12 TABLE V. FLAG COLUMN TRANSFORMATION
ctf 13
daytime 14 Flag Value
discard 15
domain 16 OTH 0
domain_u 17 REJ I
echo 18 SHR 2
eco 19 RSTRH 3
eco_1 20 RSTO 4
ecr i 21 RSTOSO 5
efs 22 RSTR 6
exec 23 SO 7
finger 24 SI 8
fip 25 S2 9
fip_data 26 S3 10
gopher 27 SF II
harvest 28 SH 12
hostnames 29
http 30
http_2784 31
http_443 32
http_8001 33 TABLE VI. PAYLOAD COLUMN TRANSFORMATION
imap4 34
Payload Value
IRC 35
iso_tsap 36 E HLO 0
klogin 37 GET 1
kshell 38 USER 2
Idap 39 $#dum 3
link 40 p2p2p 4
login 41 p2p2E 5
mtp 42 ftpft 6
name 43 301 7
netbios_dgm 44 total 8
netbios ns 45 lusrl 9
netbios ssn 46 <191 10
netstat 47 The II
nnsp 48 Others above 12
nntp 49
ntp_u 50
other 51
pm_dump 52
pop 2 53
63
TABLE VII. R2L ATTACK COLUMN TRANSFORMAnON snmpguess 13
snmpgetattaek 14
R2L Value
Httptunnel 15
ftp-write 0 Sendmail 16
Guest 1 Named 17
Diet 2
diet_simple 3
guessJlassword 4
ftp_write 5 TABLE VIlI. CLASS COLUMN TRANSFORMAlION
Imap 6
Phf 7 Class Value
Multihop 8
Normal I
warezmaster 9
R2L -I
warezelient 10
Xloek 11
Xsnoop 12
64