0% found this document useful (0 votes)

28 views

IDS MachineLearning

This document discusses a study on using machine learning methods to improve the accuracy of attack detection on intrusion detection systems. The study aims to develop a prototype IDS equipped with a combination of machine learning methods to identify characteristics of attacks on the Indonesian internet network. The study evaluates using multiple machine learning techniques together to reduce false positives and negatives in intrusion detection.

Uploaded by

León Valeria

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views

IDS MachineLearning

Uploaded by

León Valeria

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Study on Implementation of Machine Learning

Methods Combination for Improving Attacks

Detection Accuracy on Intrusion Detection System
(IDS)
Bisyron Wahyudi Masduki*, Kalamullah Ramli* Ferry Astika Saputra**, Dedy Sugiarto**
*Department of Electrical Engineering * *Department of Informatics and Computer Engineering
Universitas Indonesia Electronic Engineering Polytechnic Institute of Surabaya
Jakarta, Indonesia Surabaya, Indonesia
[email protected] [email protected]

Abstract- Many computer-based devices are now connected reliance on it increases daily. Governments, corporations,
to the internet technology. These devices are widely used to banks, and schools conduct their day-to-day business over the
manage critical infrastructure such energy, aviation, mining, Internet. With such widespread use, the data that resides on
banking and transportation. The strategic value of the data and and flows across the network varies from banking and
the information transmitted over the Internet infrastructure has securities transactions to medical records, proprietary data,
a very high economic value. With the increasing value of the data
and personal correspondence.
and the information, the higher the threats and attacks on such
data and information. Statistical data shows a significant increase The Internet is easy and cheap to access, but the systems
in threats to cyber security. The Government is aware of the attached to it lack a corresponding ease of administration. As a
threats to cyber security and respond to cyber security system result, many Internet systems are not securely configured.
that can perform early detection of threats and attacks the Additionally the underlying network protocols that support
internet. Internet communication are insecure, and few applications
make use of the limited security protections that are currently
The success of a nation's cyber security system depends on
available.
the extent to which it is able to produce independently their cyber
defense system. Independence is manifested in the form of the The combination of the data available on the network and
ability to process, analyze and create an action to prevent threats the difficulties involved in protecting the data securely make
or attacks originating from within and outside the country. One Internet systems vulnerable attack targets. It is not uncommon
of the systems can be developed independently is Intrusion to see articles in the media referring to Internet intruder
Detection System (IDS) which is very useful for early detection of activities.
cyber threats and attacks.
Internet security/protection has become necessary because
The advantages of an IDS is determined by its ability to detect attacks frequently cause the compromise of critical/sensitive
cyber attacks with little false. This study learn how to implement data. Incidents involving viruses, worms, Trojan horses,
a combination of various methods of machine-learning to the IDS spyware, and other forms of malicious code have disrupted or
to improve the accuracy in detecting attacks. This study is damaged millions of systems and networks around the world.
expected to produce a prototype IDS. This prototype IDS, will be Heightened concerns about national security and exposure of
equipped with a combination of machine-learning methods to personally identifiable information (PH) are also raising
improve the accuracy in detecting various attacks. The addition awareness of the possible effects of computer-based attacks.
of machine-learning feature is expected to identify the specific These events-and many more-make the case daily for
characteristics of the attacks occurred in the Indonesian Internet
responding quickly and efficiently when computer security
network. Novel methods used and techniques in implementation
defenses are breached.
and the national strategic value are becoming the unique value
and advantages of this research. As a response to the situation of these computer-based
attacks in the world, used a number of security devices and
Keywords-intrusion detection system, machine-learning, software such as flfewalls, Intrusion Detection System (IDS),
support vector machine, threat, attack. Security Information and Event Management (SIEM), to
monitor the network in conjunction with the Computer
I. I NTRODUCTION Security Incident Response Team (CSIRT), which is
responsible for ensuring the Confidentiality, Integrity and
The Internet has become one of the most powerful and
Availability of network services. Attack or action that could
widely available communications mediums on earth, and our
jeopardize the Confidentiality, Integrity, and Availability in

978-1-4799-6551-9115/$31.00 ©2015 IEEE 56 2015 International Conference on Quality in Research

network security including Critical Infonnation Infrastructure in which we combine selected features which result 28
(CII) is often referred to as intrusion. Intrusion detection is an numbers of features, in addition we also add payload as
attempt to seek and identify intrusions in a network so that it feature in our classification result.
can take action to anticipate. Intrusion Detection System (IDS)
is a system that is made specifically to look for and detect the II. RELATED WORKS
presence of an intrusion on the network. IDS is usually placed
on a specific machine or distributed in a network. IDS is a System that monitors a packet data in the network
in network based or host based and detect a potential attack.
There have been many IDSs developed to detect network The goal of a Network Intrusion Detection System (NIDS) is
attacks, but the problems that often arise in the IDS is to to alert a system administrator each time an intruder tries to
overcome the problem of false (fault detection), either false penetrate the network. A signature-based NIDS defmes
positives or false negatives. False is a mistake in recognizing penetration through malicious signatures: if an ongoing
IDS signature attacks that result in misperceptions for network activity matches a signature, an alarm is raise [1].
administrators so confused to take decisions in dealing with
the situation. Too many appearance of false positives and false Akhmad Alimudin used machine learning algorithm KNN,

negatives on the IDS makes the network administrators SVM and Dempster Shafer theory. Firstly KNN and SVM is

difficult in deciding or handling the reports of IDS. used to classify attack data and then combine the output of

Considering the workload of a network administrator, less those two different methods with Dempster Shafer Theory and

likely for them to update the types of new attacks and make used 8 feature from KDDCUP as training data. He stated that

new rule/signature any time. Therefore, there was an idea to the performance of the classification process is good in

develop our own IDS system that is able to increase the level overall, but the results aren't optimal to detect R2L and U2R

of accuracy in recognizing the attack signature. attacks categories [2].

One approach that is widely used to improve the In 1999, Mukkamala research about Intrusion Detection

effectiveness of the IDS-related characteristics and adaptation System using SVM and Neural Network methods with

is the machine-learning techniques. With machine learning KDDCUP 1999 data as the training data. The result of this
techniques make the IDS can perfonn learning in recognizing research shows that the SVM method is more accurate than
signatures and profiles automatically based on the training Neural Network method[4].

dataset that can then predict impending attack. With the P. Nettasan researched Multi Stage Filter Using Enhanced
application of machine learning techniques IDS is expected to Adaboost for Network Intrusion Detection that divide
be more accurate in recognizing attacks and easily adapt the intrusion detection in three stage; DOS detection, Probe
changing of network environment. There are several methods detection and R2L and U2R detection. If there is no attack is
of machine learning that can be used within the classification detected by one of these detections than the packet data is
and clustering of attacks data on IDS. Several machine considered as nonnal. Every stage of detection has different
learning methods are already widely known among others: feature in quantity and kind [3].
Support Vector Machine (SVM), K-Nearest Neighbor (K
NN), Artificial Neural Network (ANN). Some researchers Agarwal and Joshi proposed a two-stage general-to
have researched using some of these methods in classifying specific framework for learning a rule-based model
and clustering the attacks data with pretty good results. (PNrule) to learn classifier models on a data set that has
Because each type of classification method has strengths and widely different class distributions in the training data set. The
weaknesses, then to improve the accuracy of the results of the PNruie technique was evaluated with the KDD testing data
classification is by taking advantages of each classification set, which contain many new R2L attacks and do not
and avoid the weaknesses of each method. present in the KDD training dataset. The proposed model
was able to detect only 10.7% attacks in the R2L attack
Several previous studies have shown that each machine category although insignificant amount of false alanns
learning method has advantages in detecting certain types of were generated. The obvious disadvantage of this
attacks. For this reason so in this research, we design a algorithm is that the rules are automatically generated
combination of several machine learning methods that can leading to data set dependency with negligible
obtain the optimal design to be implemented in an IDS to generalization capability [5].
detect various types of attacks. However, this paper will only
presented the results of research that has been done by using Levin used Kernel Miner tool on the KDD data set. Kernel
SVM method and selected features of the network dataset. Miner is data-mining tool for classifying data and predicting
new cases using automatically generated decision trees. By
In this paper we present the results of the initial research using this tool and the KDD data set, Levin created a set
we've done in implementing the SVM method in IDS to detect of locally optimal decision trees (called the decision
R2L type attacks. In this reasearch we propose a combination forest) from which optimal subset of trees (called the
of 8 features from Alimudin and 24 features from Natesan in sub-forest) selected for predicting new cases. Levin used
R2L and U2R detection and remove the redundant features. only 10% of the KDD training data set randomly sampled
This reaserch is using SVM methods because related to from the entire training data set. Multi-class detection
Mukkamala research stated that SVM is better than Artificial approach was used to detect different attack categories in the
Neural Network other machine learning methods. The method KDD data set. The fmal trees gave very high detection
used in this research differ from the previous research studies, rates for all classes including the R2L in the entire

57
trammg data set. The proposed classifier achieved only In this research not all the features of this data is used. We
7.32% detection and 2.5% false alarm rates for the R2L used 24 features from Nettasan [3] and 8 features from
attacks in the KDD testing data set [6]. Mukkamala [4], and the detailed of data are described below:

Yeung and Chow proposed a novelty detection The 8 features used in previous works from Mukkamala
approach using non-parametric density estimation based on are: src_bytes, dst_bytes, Count, srv_count, dst_host_count,
Parzen-window estimators with Gaussian kernels to build dst_host_srv_count, dst_host_same_src"'port_rate,
an intrusion detection system using normal data only. dst_host_srv_difJ _hostJate.
This novelty detection approach was employed to detect
The 24 features used in previous works from Nettasan [3]
attack categories in the KDD data set. 30,000 randomly
are: Duration, protocol_type, Service, Flag, src_bytes,
sampled normal records from the KDD training data set were
dst_bytes, Hot, numJailed_logins, logged-in,
used to estimate the density of the model. Another 30,000
num_compromised, root_shell, num_root, numJzle_creations,
randomly sample normal records from the KDD training data
num_shells, num_accessJzles, is_host_login, isJ5uest_login,
set were used to form the threshold determination set. It is
Count, serror_rate, rerror_rate, difJ_srv_rate, dst_host_count,
important to note that this model detects whether a record is
dst_host_difJ_srvJate, dst_host_srv_serrorJate.
intrusive or not. For the R2L attack category, 31.17% of R2L
records in the KDD testing dataset were detected as intrusive The 8 features from Nettasan and 24 features from
patterns: authors did not report any information on false alarm Mukkamala are combined and then we select 28 features. The
rates. Even so, this technique also failed to identify R2L selected features detailed are described in Table I below:
attacks with a high detection rate [7].

Lee and Stolfo used data mining techniques to collect TABLE I. PROPOSED FEATURES
KDD features from DARPA 1998 dataset. RIPPER rules
No Feature Name Description
were created using these data mining techniques to detect
I Duration Length of the connection in
R2L attacks. The proposed model could detect only seconds
20% of R2L attacks with a false alarm rate of 0.01 for the 2 protocol_type Type of the protocol used. For ex.
DARPA 1998 testing dataset [8]. TCP
3 Service Network service on the
destination like HTTP, Telnet.
III. STATE OF THE ART
4 Flag Normal or error status of the
The Intrusion Detection System is a software or hardware network connection.
tools that is used to detect unauthorized access of computer S src_bytes Number of data bytes from source
to destination
systems or network [9]. Some researchers have conducted
6 dsCbytes Number of data bytes from
studies to categorize/classify attacks data in the IDS with a
destination to source.
specific classification method to improve the detection
10 Hot Number of "hot" indicators.
acuracy. As known in the previous studies, the types of attack
II num failed logins Number of failed login attempts.
is divided into the following four groups: DoS, Probe, R2L, 12 logged-in I if successfully logged in; 0
and U2L. Classification method in previous research has high otherwise.
accuracy in detecting all group of attack, except for detecting 13 num_compromised Number of compromised
R2L attack. conditions.
14 root shell I if root shell is obtained; 0
We proposed a new combination of features in otherwise.
classification method we use, using 8 features from 16 num root Number of "root" access.
Mukkamala et al and 24 features from Natesan et al. The 17 num file creations Number of file creation
combination of 8 and 24 features become 28 features because operations.
there are 4 features that used in both 8 and 24 features. This 18 num shells Number of shell prompts.
technique is expected to detect R2L attacks more accurately 19 num_access_fiIes Number of operations on access
control files.
than the result of the previouse research.
21 is_hosUogin I if the login belongs to "hot" list;
In this work we observed and compared the classification o otherwise.
result between 8 and 24 features with payloads. By combining 22 is_guesUogin I if the login is a guest login; 0
otherwise.
and selecting from these 8 and 24 features become 28 features
23 Count Number of connections to the
is expected to increase the detection acuracy. We also
same host as the current
observed the importance of payload in detecting R2L attacks. connection in the past two
seconds
The data used in this research are GureKddCup [11]. This
25 serror_rate Percentage of connections that
data have an extra feature, that is Payload as the 42th feature. have "S YN" errors.
Payload is the content area of network packets in a sequence 27 rerror rate Percentage of connections that
of ASCII characters. In order to detecting R2L attack, we need have "REJ" errors.
Payload to increase detection rates. 30 diff srv rate Percentage of connections that
have same services.
32 dst host count Count for destination host.
35 dst host diff srv rate diff srv rate for destination host.

58
39 dst_host_srv_serror_rate srv_serror_rate for destination Basic idea of SVM is how to maximize the hyperplane
host. (maximal margin hyperplane), as illustrated in Figure l. In the
32 dst_host_srv_count, sum of connections to the same Figure 1 (a) there are some hyperplane choice for data set, and
destination port number
Figure 1 (b) is the maximal margin. Although in Figure 1 (a)
33 dst_host_same_src�ort_rate the percentage of connections that
we can use any hyperplane, with maximal margin will give
were to the same source port,
among the connections better generalization for classification methods.
aggregated in dst host_srv��
36 dst host srv diff host rate
- - - - -
the percentage of connections that
were to different destination
machines, among the connections
aggregated in dst host srv count
24 srv_count sum of connections to the same
destination port

IV. RESEARCH DESIGN

A. GureKddCup Dataset
GureKddcup contains network connections kddcup99 (the
database of UCI repository). It also adds one feature: payload
(content of network packets) to each of the connections. It
allows to extract infonnation directly from the payload of each
connection to be used in the machine learning processes.

The Infonnation System Technology (1ST) group of

Lincoln laboratories at MIT University under contract of
DARPA in collaboration with ARFL created a managed
network. In this network, they simulated real traffic with • Clas. -I OClru. +1
nonnal and attack connections and then captured the network
traffic with tcpdump (Linux command). The experiment lasted (a)
7 weeks of 5 days. The generated tcpdump files and ps outputs
and log files, are known as darpa99 database.
OJ crimination boundaries
Margin
After getting the tcpdump data, then they extracted
connections data from the tcpdump files and they represented
them in a tabular dataset in UCI repository format. In this way,
instances of the dataset belong to connections. They extracted
41 attributes for each connection and the class attribute. These
attributes are divided in three main groups: intrinsic features
(extracted from the headers area of the network packets),
content features (extracted from the contents area of the
network packets), and traffic features (extracted with
information about previous connections). This datasets are
known as kddcup99. [11]

B. Support Vector Machine (SVM) Classifications

Nowadays, Classification methods that widely developed
and implemented are Support Vector Machine (SVM)
methods. This method begins from statistical learning theory
that the performances are very good and give better result than
other methods. SVM also better in big dataset, moreover SVM
• Clas. -I OClru. +1
that use kernel methods must be mapped data from current
dimension into higher dimension. If in Artificial Neural (b)
Network (ANN) all training data will be trained while learning
process, in SVM are different, because it only use selected Fig. 1. Concept of SYM
data that have contribution to make a model will be used in
classification that will be trained. This is the advantage of
SVM because not all training data will be used for every
Classification concept with SVM can be described simply
training iteration process. The contributed data is called as
as a methods for searching the best hyperplane function that
support vector, thus the models name are Support Vector
separate two class in space input. As Figure 1 show some
Machine.
pattern that is the member from two class data: Class +1 and

59
Class -1. The data that join in Class -1 are square symbol, described in appendix in the end of this paper. Then we split
whereas data in Class +1 are circle. datasets into two data set, first are gure.tr for training data and

Hyperplane are the best separation between two classes gure.te for testing data. After the data are ready we classify
and can find by measuring that hyperplane margin and fmd these data of 8, 24, and 28 feature with payload and without
maximal data. Margin is a distance between that hyperplane payload, so we get 6 feature set: 8 features with payload, 8
with the closest data from every class. This closest data are features without payload, 24 features with payload, 24 features
called as support vector. Solid line in Figure 1 (b) above show without payload, 28 features with payload, and 28 features
the best hyperplane, that is located exactly in the middle of without payload. After we did the experiment, we get the
two classes, whereas two circle and two square that cross
result that have the best features for detecting R2L attack to be
trough the boundary margin (dashed line) is a support vector.
implemented in simulation phase.
The method to find this hyperplane location is the core of
learning process of SVM.
2) Simulation phase:
C. System Design
Data
The system process design in this part show the data
processing in this system. Data will be processed in order as
Alta ker
illustrated in figure 2 below:

Attacker Send
Attack

... ..
Engine IDS
Web Report
Interface

Fig. 3. Simulation Process Diagram

After we have the best result of the experiment, then we

choose the best detection method for the simulation. In the
simulation phase, we used the model engine that we've just
developed on the research phase with web based simulation
engine. In the simulation, we determined how much attack and
normal data packet that will be simulated in this simulation
system. While simulation started, this simulation gets attack or
Fig. 2. System Design.
normal packet data from database and classifying them with
our engine per second. Finally in the end of the simulation, the
engine will classify the entire data that is used in this
This research is divided into two main phases. The first simulation.
phase is research, and second is simulation. In the research
phase we conducted an experiment to find the better detection
V. EXPERIMENT ANALYSTS
using 8, 24 and 28 features with payloads or without payloads
and developed model engine for IDS Simulation. After we get The GureKDDCVP data is too large to be used in any
the result of the research phase which found one of the learning process. Most experiments with KDDCUP database
features set that is the best combination for detecting R2L. The are done based on the 10% of the database provided in VCI.
seconds phase is implementing the model engine in The provided database contains 494.021 connections. It has
simulation, by using the best features set which more acurate flood, no-flood and normal connections. Analogously in this
for detecting R2L. The detail description about these phases research we decided to generate a reduced sample:
are discusses below. gureKddcup6percent that contain 159.321 record data. The
detailed data are described in Table II below:
1) Research phase (Model Engine Development): In this
phase, we try to classify data set using 8, 24, and 28 features
TABLE TT. PORTION DATASET
with payload and without payload. Firstly we did data
Class 100% from 6% 80% from 6% 20% from 6%
preprocessing to clean and convert the data format from
Normal 157421 125937 31484
GureKDDCup data format to LIBSVM data format as
R2L 1136 909 227
required by LIBSVM classification. Before converting data 31711
Total 158557 126846
into LIBSVM data format, firstly we must convert it into
numerical format. There are 5 colunms that must be convert
From the table II above, 80% data is used for training data
into numerical format, the index of transformation are
and 20% data is used for testing data.

60
To examine the results of this research, we developed a
simulation system that illustrates the process of
0.99
R2L Using Payload
implementation of the method to improve the acuracy in
detecting R2L attacks from the results of this research. The 0.985
system design of the simulation is as follow: 0.98 .8 Feature
0.975 .24 Feature
",
....
g 0.97 .28 Feature
·
..... � 0.965
..........
......
; 0.96
°
Database
...... 0.955
A-.,
gure_libsvm .... 0.95
......
Cui'.......
0.945
CoIIIIIInt. 0.94
iii....,
IImIIIIan. 1 2 3 4 5 6 7 8 9 10
i-Experiments

Fig. 6. Result of classifying 8, 24 and 28 Features with Payloads

Fig. 4. Simulation System Design

Figure 6 above shows that with payload, we can increase

In this research, we have processed the data with 10
detection rate of attack in every feature set in all feature set.
subsets and the result are described as follows:

C. Experiment 3
A. Experiment J
Experiment of accuracy detection using lO-fold validations
The result of classifications with 8, 24 and 28 features
model with 28 features with payloads and without payload
using SVM method without payload shown in figure 5.
shown in figure 7.

R2L Without Payload Accuracy Using 10-Fold Validation

0.99
99.99%
0.985
0.98 .8 Feature 99.98%
0.975 .24 Feature
§ 0.97 .28 Feature c:
<II 99.97%
o
•
·�o:.::
°
.� 99.96%
..
Accuracy NOP

0.955 • Accuracy USP

c 99.95%
0.95
0.945 99.94%
0.94
1 2 3 4 5 6 7 8 9 10 99.93%
i-Experiments 1 2 3 4 5 6 7 8 9 10
K
Fig. 5. Result of classifying 8, 24 and 28 Features without Payloads
Fig. 7. Accuracy Result of classifying 8, 24 and 28 Features using Payloads
Figure 5 above shows the features chart without payload, and without payloads. Note: NOP= No Payload, USP= Use Payload, K=
we can conclude that the 28 features get the best detection number of K from K-Fold validation.
rate. Why do the systems get best result in 28? Because in the
intersection of 8, and 24 in 28 features there are 4 features We try to classify our propose with 10-fold validation and
distinct: dst_host_srv_count, dst_host_same_src�ortJate, the result show that it was better when using payloads.
dst_host_srv_difehost_rate and srv_count, which these
Experiment of Matthews Correlation Coefficient detection
features are very useful in order to detect the top three web
using 10-fold validations model with 28 features with
attacks.
payloads and without payload shown in figure 8.

B. Experiment 2
The result of classifications with 8, 24 and 28 features
using SVM with payload shown in figure 6.

61
Matthews Corellation Coefficient Using lO-Fold Validations References
99.00%
98.50%
[II H. Dreger. A. Feldmann, V. Paxson. And R. Sommer. "Operational
98.00% Experiences with High-Volume Network Intrusion Detection". In
Ie 97.50% l- Proceedings of ACM Conference on Computer and Communications
o • MCC NOP Security (CCS '04), Washington, D.C., October 2004.
.� 97.00% I-
[2] Alimudin Akhmad, Hariadi Mochammad (2012). "Integrasi IDS
• MCC USP
� 96.50% l- menggunakan SVM dan KNN Dempster-Shafer". ITS JAVA Press.
96.00% f- f- [3] Netasan P, B alasubramanie P. (2012). "Multi Stage Filter Using
95.50% l- I- Enhanced Adaboost for Network Intrusion Detection", International
Joumal of Network Security & Its Applications (IJNSA), Tamilnadu,
95.00% '- '-
India.
1 2 3 4 5 6 7 8 9 10
[4] Mukkamala Srinivas, Guadalupe, Sung Andrew. (2002). "Intrusion
K
Detection Using Neural Network and Support Vector Machines", IEEE.
[5] R. Agarwal and M. V. Joshi, "PNrule: A New Framework for
Fig. 8. Matthews Correlation Coefficient Result of classifying 8, 24 and 28
Learning Classifier Models in Data Mining (A Case-Study in
Features using Payloads and without payloads
Network Intrusion Detection)", Technical Report TR 00-015,
Department of Computer Science, University of Minnesota, 2000.
We also try to classify with Matthews Correlation
[6] I. Levin, "KDD-99 Classifier Learning Contest LLSoft's Results
Coefficient and the result show that it was also better using Overview", SIGKDD Explorations, ACM SIGKDD, January 2000,
payloads. Therefore, we can conclude that we can increase Vol. I (2), pp. 67-75.
detection of IDS in R2L attack using payloads features. [7] D. Y. Yeung, and C. Chow, "Parzen-window Network Intrusion
Detectors", In Proceedings of the Sixteenth International Conference
on Pattern Recognition, Quebec City, Canada, August 2002, Vol. 4,
VI. CONCLUSION pp. 385-388.
SVM based on 8 Features without payload is 95,78% and [8] W. Lee, and S. Stolfo, "A Framework for Constructing Features and
the accuracy of detecting R2L attacks using SVM based on 8 Models for Intrusion Detection Systems", ACM Transactions on
Information and System Security, November 2000, Vol. 3 (4), pp.
features with payload is 95,99%. The accuracy of detecting
227-261.
R2L attacks using SVM based on 24 Features without payload
[9] Manoj Sharma, Keshav Jindal, Aishish Kumar, "Intrusion Detection
is 95,73% and the accuracy of detecting R2L attacks using System using Bayesian Approach for Wireless Network",
SVM based on 24 Features with payload is 95.91% better than Intemational Joumal of Computer Applications(0975-888), Volume 48-
8. The accuracy of detecting R2L attacks using SVM based on No.5, June 2012.
28 Features without payload is 95,91% and the accuracy of [10] Maheshkumar Sabhnani, Gursel Serpen, "KDD Feature Set Complaint
detecting R2L attacks using SVM based on 28 Features with Heuristic Rules for R2L Attack Detection", Security and Management,
page 310-316. CSREA Press, (2003)
payload is 96,08% better than all. From the result we can
[II] GureKddcup a database description, http://www.sc.ehu.es/acwaldap/
conclude the best feature of 8, 24, and 28 is 28 Feature (our
purpose).

The future works is better if we can research using all four

classes attack type, DOS, Probe, U2L, and R2L.ln the future
works we can try to used real data training (use file pcap or
tcpdump). We can use the other algorithm to fmd the better
feature set example (PCA algorithm).

Acknowledgment
This work is supported by Id-SIRTIIICC (Indonesia
Security Incident Response Team on Internet
Infrastructure/Coordination Center), Department of Electrical
Engineering Universitas Indonesia, and Department of
Informatics and Computer Engineering, Electronic
Engineering Polytechnic Institute of Surabaya in the joint
development of National Cyber Security Situational
Awareness Systems "Mata Garuda". We gratefully appreciate
this support and would like to thank Id-SIRTIIICC,
Universitas Indonesia, and Electronic Engineering Polytechnic
Institute of Surabaya.

62
pop_3 54
APPENDIX printer
private
55
56
red_i 57
remote-...iob 58
rje 59
TABLE TIT. PROTOCOL TYPE COLUMN TRANSFORMATION
shell 60
Protocol Type Value smtp 61
sql_net 62
Tcp o ssh 63
Udp 1 sunrpc 64
Tcmp 2 supdup 65
systat 66
telnet 67
tftp_u 68
TABLE IV. SERVICES COLUMN TRANSFORMATION
tim_i 69
Service Value time 70
urh_i 71
netbios_dmg 0 urp_i 72
tim I uucp 73
urh 2 uucp�ath 74
ircu 3 vmnet 75
http_alt 4 whois 76
ntp 5 XII 77
aol 6 Z39 50 78
auth 7
snmp 8
urp 9
bgp 10
courier II
csnet ns 12 TABLE V. FLAG COLUMN TRANSFORMATION
ctf 13
daytime 14 Flag Value
discard 15
domain 16 OTH 0
domain_u 17 REJ I
echo 18 SHR 2
eco 19 RSTRH 3
eco_1 20 RSTO 4
ecr i 21 RSTOSO 5
efs 22 RSTR 6
exec 23 SO 7
finger 24 SI 8
fip 25 S2 9
fip_data 26 S3 10
gopher 27 SF II
harvest 28 SH 12
hostnames 29
http 30
http_2784 31
http_443 32
http_8001 33 TABLE VI. PAYLOAD COLUMN TRANSFORMATION
imap4 34
Payload Value
IRC 35
iso_tsap 36 E HLO 0
klogin 37 GET 1
kshell 38 USER 2
Idap 39 $#dum 3
link 40 p2p2p 4
login 41 p2p2E 5
mtp 42 ftpft 6
name 43 301 7
netbios_dgm 44 total 8
netbios ns 45 lusrl 9
netbios ssn 46 <191 10
netstat 47 The II
nnsp 48 Others above 12
nntp 49
ntp_u 50
other 51
pm_dump 52
pop 2 53