Detection of Cyber Attack in Network using Machine Learning Techniques (1)
Detection of Cyber Attack in Network using Machine Learning Techniques (1)
ABSTRACT
1
TABLE OF CONTENTS
1.INTRODUCTION 1
1.1 Motivation 1
1.2 Existing System 2
1.3 Objective 2
1.4 Outcomes 2
1.5 Applications 3
LIST OF FIGURES
S.NO NAME P.NO
1 Project SDLC 3
2 Use case diagram 31
3 Class diagram 32
4 Sequence diagram 33
3
1. INTRODUCTION
1.1 MOTIVATION
The use of new innovations give incredible advantages to people, organizations, and
governments, be that as it may, messes some up against them. For instance, the
protection of significant data, security of put away information stages, accessibility of
information and so forth. Contingent upon these issues, digital fear based oppression
is one of the most significant issues in this day and age. Digital fear, which made a
great deal of issues people and establishments, has arrived at a level that could
undermine open and nation security by different gatherings, for example, criminal
association, proficient people and digital activists. Along these lines, Intrusion
Detection Systems (IDS) has been created to maintain a strategic distance from digital
assaults.
1
Blameless Bayes and Principal Component Analysis (PCA) were
been used with the KDD99 dataset by Almansob and Lomte [9].Similarly,
PCA, SVM, and KDD99 were used Chithik and Rabbani for IDS [10]. In
Aljawarneh et al's. Paper, their assessment and examinations were
conveyed reliant on the NSL-KDD dataset for their IDS model [11]
Composing inspects show that KDD99 dataset is continually used for IDS
[6]–[10].There are 41 highlights in KDD99 and it was created in 1999.
Consequently, KDD99 is old and doesn't give any data about cutting
edge new assault types, example, multi day misuses and so forth. In this
manner we utilized a cutting-edge and new CICIDS2017 dataset [12] in
our investigation.
1.2.1 Limitations of existing system
• Strict Regulations
• Difficult to work with for non-technical users
• Restrictive to resources
• Constantly needs Patching
• Constantly being attacked
1.3 Objectives
Objective of this project is to detect cyber attacks by using machine
learning algorithms like
• ANN
• CNN
• Random forest
1.4 Outcomes
These predictions can be done by four algorithms like SVM, ANN, RF, CNN
this paper helps to identify which algorithm predicts the best accuracy rates which helps
1.5Applications
This strategy used in Detection of Cyber Attack in Network using
Machine Learning Techniques
3
Designing with avails in development of project in facile way with
different actor and its utilizer case by utilizer case diagram, flow of the
project utilizing sequence, Class diagram gives information about different
class in the project with methods that have to be utilized in the project if
comes to our project our UML Will utilizable in this way The third and
post import for the project in system design is Data base design where we
endeavor to design data base predicated on the number of modules in our
project
1.6.3 IMPLEMENTATION
The Implementation is Phase where we endeavor to give the practical
output of the work done in designing stage and most of Coding in Business
logic lay coms into action in this stage its main and crucial part of the
project
2.Data Preprocessing
4.Modiling
5.Predicting
4
1.8 NON FUNCTIONAL REQUIREMENTS
NON-FUNCTIONAL REQUIREMENT (NFR) specifies the quality
attribute of a software system. They judge the software system based on
Responsiveness, Usability, Security, Portability and other non-functional
standards that are critical to the success of the software system. Example
of nonfunctional requirement, “how fast does the website load?” Failing to
meet non-functional requirements can result in systems that fail to satisfy
user needs. Non- functional Requirements allows you to impose
constraints or restrictions on the design of the system across the various
agile backlogs. Example, the site should load in 3 seconds when the
number of simultaneous users are > 10000. Description of non-functional
requirements is just as critical as a functional requirement.
Usability requirement
Serviceability requirement
Manageability requirement
Recoverability requirement
Security requirement
Data Integrity requirement
Capacity requirement
Availability requirement
Scalability requirement
Interoperability requirement
Reliability requirement
Maintainability requirement
Regulatory requirement
Environmental requirement
6
2.LITERATURE SURVEY
2.1 R. Christopher, “Port scanning techniques and the defense against
them,” SANS Institute, 2001.
8
jobs. In this paper, we will mainly be concerned with detecting information
gathering portscans, since detecting flood portscans is easy. However, the
possibility of being maliciously flooded with information will be an
important consideration in our algorithm design. We will use the term scan
footprint for the set of port/IP combinations which the attacker is interested
in characterizing. It is helpful to conceptually distinguish the footprint of
the scan, from the script of the scan, which refers to the time sequence in
which the attacker tries to explore the footprint. The footprint is
independent of aspects of the script, such as how fast the scan is, whether
it is randomized, etc. The footprint represents the attacker’s information
gathering requirements for her scan, and she designs a scan script that will
meet those requirements, and perhaps other non-information-gathering
requirements (such as not being detected by an NIDS). The most common
type of portscan footprint at present is a horizontal scan. By this, we mean
that an attacker has an exploit for a particular service, and is interested in
finding any hosts that expose that service. Thus she scans the port of
interest on all IP addresses in some range of interest. Also at present, this
is mainly being done sequentially on TCP port 53 (DNS)
2.3 M. C. Raja and M. M. A. Rabbani, “Combined analysis of support
vector machine and principle component analysis for ids,” in IEEE
International Conference on Communication and Electronics Systems,
2016, pp. 1–5.
Compared to the past security of networked systems has become a critical
universal issue that influences individuals, enterprises and governments.
The rate of attacks against networked systems has increased
melodramatically, and the strategies used by the attackers are continuing to
evolve. For example, the privacy of important information, security of
stored data platforms, availability of knowledge etc. Depending on these
problems, cyber terrorism is one of the most important issues in today’s
world. Cyber terror, which caused a lot of problems to individuals and
institutions, has reached a level that could threaten public and country
security by various groups such as criminal organizations, professional
persons and cyber activists. Intrusion detection is one of the solutions
9
against these attacks. A free and effective approach for designing Intrusion
Detection Systems (IDS) is Machine Learning. In this study, deep learning
and support vector machine (SVM) algorithms were used to detect port
scan attempts based on the new CICIDS2017 dataset Introduction Network
Intrusion Detection System (IDS) is a software-based application or a
hardware device that is used to identify malicious behavior in the network
[1,2]. Based on the detection technique, intrusion detection is classified
into anomaly-based and signature-based. IDS developers employ various
techniques for intrusion detection. Information security is the process of
protecting information from unauthorized access, usage, disclosure,
destruction, modification or damage. The terms ”Information security”,
”computer security” and ”information insurance” are often used
interchangeably. These areas are related to each other and have common
goals to provide availability, confidentiality, and integrity of information.
Studies show that the first step of an attack is discovery [1].
Reconnaissance is made in order to get information about the system in
this stage. Finding a list of open ports in a system provides very critical
information for an attacker. For this reason, there are a lot of tools to
identify open ports [2] such as antivirus and IDS. One of these techniques
is based on machine learning. Machine learning (ML) techniques can
predict and detect threats before they result in major security incidents [3].
Classifying instances into two classes is called binary classification. On the
other hand, multi-class classification refers to classifying instances into
three or more classes. In this research, we adopt both classifications
Information security is the process of protecting information from
unauthorized access, usage, disclosure, destruction, modification or
damage. The terms ”Information security”, ”computer security” and
”information insurance” are often used interchangeably. These areas are
related to each other and have common goals to provide availability,
confidentiality, and integrity of information. Studies show that the first
step of an attack is discovery [1]. Reconnaissance is made in order to get
information about the system in this stage. Finding a list of open ports in a
system provides very critical information for an attacker. For this reason,
10
there are a lot of tools to identify open ports [2] such as antivirus and IDS.
II. Litrature Review Sharafaldin et al. [4] used a Random Forest Regressor
to determine the best set of features to detect each attack family. The
authors examined the performance of these features with different
algorithms that included K-Nearest Neighbor (KNN), Adaboost, Multi-
Layer Perceptron (MLP), Naïve Bayes, Random Forest (RF), Iterative
Dichotomiser 3 (ID3) and Quadratic Discriminant Analysis (QDA). The
highest precision value was 0.98 with RF and ID3 [4]. The execution time
(time to build the model) was 74.39 s. This is while the execution time for
our proposed system using Random Forest is 21.52 s with a comparable
processor. Survey on Detecting Port Scan Attempts with Combined
Analysis of Support Vector Machine and DOI: 10.9790/0661-2103044246
www.iosrjournals.org 43 | Page Furthermore, our proposed intrusion
detection system targets a combined detection process of all the attack
families. D. Aksu, S. U¨ stebay, M. A. Aydin, and T. Atmaca[09], There
are different but limited studies based on the CICIDS2017 dataset. Some
of them were discussed here. D.Aksu et al. showed performances of
various machine learning algorithms detecting DDoS attacks based on the
CICIDS2017 dataset in their previous work [13].The authors of [13]
applied the Multi-Layer Perceptron (MLP) classifier algorithm and a
Convolutional Neural Network (CNN) classifier that used the Packet
CAPture (PCAP) file of CICIDS2017. The authors selected specified
network packet header features for the purpose of their study. Conversely,
in our paper, we used the corresponding profiles and the labeled flows for
machine and deep learning purposes. According to [13], the results
demonstrated that the payload classification algorithm was judged to be
inferior to MLP. However, it showed significant ability to distinguish
network intrusion from benign traffic with an average true positive rate of
94.5% and an average false positive rate of 4.68%. The auther E. Biglar
Beigi, H. Hadian Jazi,Machine [14] learning techniques have the ability to
learn the normal and anomalous patterns automatically by training a
dataset to predict an anomaly in network traffic. One important
characteristic defining the effectiveness of machine learning techniques is
11
the features extracted from raw data for classification and detection.
Features are the important information extracted from raw data. The
underlying factor in selecting the best features lies in a trade-off between
detection accuracy and false alarm rates. The use of all features on the
other hand will lead to a significant overhead and thus reducing the risk of
removing important features. Although the importance of feature selection
cannot be overlooked, intuitive understanding of the problem is mostly
used in the selection of features [16]. The authors in [14] proposed a denial
of service intrusion detection system that used the Fisher Score algorithm
for features selection and Support Vector Machine (SVM), K-Nearest
Neighbor (KNN) and Decision Tree (DT) as the classification algorithm.
Their IDS achieved 99.7%, 57.76% and 99% success rates using SVM,
KNN and DT, respectively. In contrast, our research proposes an IDS to
detect all types of attacks embedded in CICIDS2017, and as shown in the
confusion matrix results, achieves 100% accuracy for DDoS attacks using
(PCA � RF)Mc�10 with UDBB.The authors in [15] used a distributed
Deep Belief Network (DBN) as the the dimensionality reduction approach.
The obtained features were then fed to a multi-layer ensemble SVM. The
ensemble SVM was accomplished in an iterative reduce paradigm based
on Spark (which is a general distributed in-memory computing framework
developed at AMP Lab, UC Berkeley), to serve as a Real Time Cluster
Computing Framework that can be used in big data analysis [16]. Their
proposed approach achieved an F-measure value equal to 0.921. III.
Methods 1.1 CICIDS2017 Dataset The CICIDS2017 dataset is used in our
study. The dataset is developed by the Canadian Institute for Cyber
Security and includes various common attack types. The CICIDS2017
dataset consists of realistic background traffic that represents the network
events produced by the abstract behavior of a total of 25 users. The users’
profiles were determined to include specific protocols such as HTTP,
HTTPS, FTP, SSH and email protocols. The developers used statistical
metrics such as minimum, maximum, mean and standard deviation to
encapsulate the network events into a set of certain features which include:
1. The distribution of the packet size 2. The number of packets per flow 3.
12
The size of the payload 4. The request time distribution of the protocols 5.
Certain patterns in the payload Moreover, CICIDS2017 covers various
attack scenarios that represent common attack families. The attacks
include Brute Force Attack, Heart Bleed Attack, Botnet, DoS Attack,
Distributed DoS (DDoS) Attack , Web Attack, and Infiltration Attack. 1.2
SUPPORT VECTOR MACHINE The SVM is already known as the best
learning algorithm for binary classification. The SVM, originally a type of
pattern classifier based on a statistical learning technique for classification
and regression with a variety of kernel functions, has been successfully
applied to a number of pattern recognition applications. Recently, it has
also been applied to information security for intrusion detection. Support
Vector Machine has become one of the popular techniques for anomaly
intrusion detection due to their good generalization nature and the ability
to overcome the curse of dimensionality .Another positive aspect of SVM
is that it is useful for finding a global minimum of the actual risk using
structural risk minimization, since it can generalize well with kernel tricks
even in high-dimensional spaces under little training sample conditions.
The SVM can select Survey on Detecting Port Scan Attempts with
Combined Analysis of Support Vector Machine and DOI: 10.9790/0661-
2103044246 www.iosrjournals.org 44 | Page appropriate setup parameters
because it does not depend on traditional empirical risk such as neural
networks [13]. One of the main advantage of using SVM for IDS is its
speed, as the capability of detecting intrusions in real-time is very
important. SVMs can learn a larger set of patterns and be able to scale
better, because the classification complexity does not depend on the
dimensionality of the feature space. SVMs also have the ability to update
the training patterns dynamically whenever there is a new pattern during
classification [14]. 1.2.1 Limitation of Support Vector Machine in IDS
SVM is basically supervised machine learning method designed for binary
classification. Using SVM in IDS domain has some limitation.SVM being
a supervised machine learning method requires labelled information for
efficient learning. Pre existing knowledge is required for classification
which may not be available all the time[13].SVM has the intrinsic
13
structural limitation of the binary classifier i.e. it can only handle
binaryclass classification whereas intrusion detection requires multi-class
classification [14].Although there are some improvements, the number of
dimensions still affects the performance of SVM-based classifier [16].
SVM treats every feature of data equally. In real intrusion detection
datasets, many features are redundant or less important .It would be better
if feature weights during SVM training are considered [16]. Training of
SVM is timeconsuming for IDS domain and requires large dataset
storage.Thus SVM is computationally expensive for resource-limited ad
hoc network[12].Moreover SVM requires the processing of raw features
for classification which increases the architecture complexity and
decreases the accuracy of detecting intrusion 1.3 DEEP LEARNING Deep
learning is an improved machine learning technique for feature extraction,
perception and learning of machines. Deep learning algorithms performs
their operations using multiple consecutive layers. The layers are
interlinked and each layer receives the output of the previous layer as
input. It is a great advantage to use efficient algorithms for extracting
hierarchical features that best represent data rather than manual features in
deep learning methods [7], [8]. There are many application areas for Deep
Learning, which covers such as Image Processing, Natural Language
Processing, biomedical, Customer Relationship Management automation,
Vehicle autonomous systems and others.
2.4 S. Aljawarneh, M. Aldwairi, and M. B. Yassein, “Anomaly-based
intrusion detection system through feature selection analysis and
building hybrid efficient model,” Journal of Computational Science,
vol. 25, pp. 152–160, 2018.
n network security, intrusion detection plays an important role. Feature
subsets obtained by different feature selection methods will lead to
different accuracy of intrusion detection. Using individual feature selection
method can be unstable in different intrusion detection scenarios. In this
paper, the idea of ensemble is applied to feature selection to adjust feature
subsets. Feature selection is converted into a two-category problem, and
odd number of feature selection methods is used for voting method to
14
decide whether a feature is required or discarded. In actual operation, mean
decrease impurity, random forest classifier, stability selection, recursive
feature elimination and chi-square test are used. Feature subsets obtained
from them will be adjusted by our proposed method to get ensemble
feature subsets. To test the performance, support vector machine, decision
tree, knn and multi-layer perception are used to observe and compare the
classification accuracy with ensemble feature subsets. Three intrusion
detection data sets, including kddcup99, cidds-001 and unsw_nb15 are
used in our experiments. The best result is reflected on cidds-001 with a
99.40% classification accuracy. The investigation shows that our method
has a certain improvement on classification accuracy of intrusion
detection. In this section, we analyze and evaluate the eleven publicly
available IDS datasets since 1998 to demonstrate their shortages and issues
that reflect the real need for a comprehensive and reliable dataset. DARPA
(Lincoln Laboratory 1998-99): The dataset was constructed for network
security analysis and exposed the issues associated with the artificial
injection of attacks and benign traffic. This dataset includes email,
browsing, FTP, Telnet, IRC, and SNMP activities. It contains attacks such
as DoS, Guess password, Buffer overflow, remote FTP, Syn flood, Nmap,
and Rootkit. This dataset does not represent real-world network traffic, and
contains irregularities such as the absence of false positives. Also, the
dataset is outdated for the effective evaluation of IDSs on modern
networks, both in terms of attack types and network infrastructure.
Moreover, it lacks actual attack data records (McHugh, 2000) (Brown et
al., 2009). KDD’99 (University of California, Irvine 1998-99): This dataset
is an updated version of the DARPA98, by processing the tcpdump
portion. It contains different attacks such as Neptune-DoS, pod-DoS,
SmurfDoS, and buffer-overflow (University of California, 2007). The
benign and attack traffic are merged together in a simulated environment.
This dataset has a large number of redundant records and is studded by
data corruptions that led to skewed testing results (Tavallaee et al., 2009).
NSL-KDD was created using KDD (Tavallaee et al., 2009) to address
some of the KDD’s shortcomings (McHugh, 2000). DEFCON (The Shmoo
15
Group, 2000-2002: The DEFCON-8 dataset created in 2000 contains port
scanning and buffer overflow attacks, whereas DEFCON-10 dataset,
which was created in 2002, contains port scan and sweeps, bad packets,
administrative privilege, and FTP by Telnet protocol attacks. In this
dataset, the traffic produced during the “Capture the Flag (CTF)”
competition is different from the real world network traffic since it mainly
consists of intrusive traffic as opposed to normal background traffic. This
dataset is used to evaluate alert correlation techniques (Nehinbe, 2010)
(Group, 2000). CAIDA (Center of Applied Internet Data Analysis 2002-
2016): This organization has three different datasets, the CAIDA OC48,
which includes different types of data observed on an OC48 link in San
Jose, the CAIDA DDOS, which includes one-hour DDoS attack traffic
split of 5-minute pcap files, and the CAIDA Internet traces 2016, which is
passive traffic traces from CAIDA’s Equinix-Chicago monitor on the
High-speed Internet backbone. Most of CAIDAs datasets are very specific
to particular events or attacks and are anonymized with their payload,
protocol information, and destination. These are not the effective
benchmarking datasets due to a number of shortcomings, see (for Applied
Internet Data Analysis (CAIDA), 2002) (for Applied Internet Data
Analysis (CAIDA), 2007) (for Applied Internet Data Analysis (CAIDA),
2016) (Proebstel, 2008) (Ali Shiravi and Ghorbani, 2012) for details.
LBNL (Lawrence Berkeley National Laboratory and ICSI 2004-2005):
The dataset is full header network traffic recorded at a medium-sized site.
It does not have payload and suffers from a heavy anonymization to
remove any information which could identify an individual IP (Nechaev et
al., 2004). CDX (United States Military Academy 2009): This dataset
represents the network warfare competitions, that can be utilized to
generate modern day labelled dataset. It includes network traffic such as
Web, email, DNS lookups, and other required services. The attackers used
the attack tools such as Nikto, Nessus, and WebScarab to carry out
reconnaissance and attacks automatically. This dataset can be used to test
IDS alert rules, but it suffers from the lack of traffic diversity and volume
(Sangster et al., 2009). Kyoto (Kyoto University 2009): This dataset has
16
been created through honypots, so there is no process for manual labelling
and anonymization, but it has limited view of the network traffic because
only attacks directed at the honeypots can be observed. It has ten extra
features such as IDS Detection, Malware Detection, and Ashula Detection
than previous available datasets which are useful in NIDS analyToward
Generating a New Intrusion Detection Dataset and Intrusion Traffic
Characterization 109 sis and evaluation. The normal traffic here has been
simulated repeatedly during the attacks and producing only DNS and mail
traffic data, which is not reflected in real world normal network traffic, so
there are no false positives, which are important for minimizing the
number of alerts (Song et al., 2011) (M. Sato, 2012) (R. Chitrakar, 2012).
Twente (University of Twente 2009): This dataset includes three services
such as OpenSSH, Apache web server and Proftp using auth/ident on port
113 and captured data from a honeypot network by Netflow. There is some
simultaneous network traffic such as auth/ident, ICMP, and IRC traffic,
which are not completely benign or malicious. Moreover, this dataset
contains some unknown and uncorrelated alerts traffic. It is labelled and is
more realistic, but the lack of volume and diversity of attacks is obvious
(Sperotto et al., 2009). UMASS (University of Massachusetts 2011): The
dataset includes trace files, which are network packets, and some traces on
wireless applications (of Massachusetts Amherst, 2011) (Nehinbe, 2011).
It has been generated using a single TCP-based download request attack
scenario. The dataset is not useful for testing IDS and IPS techniques due
to the lack of variety of traffic and attacks (Swagatika Prusty and
Liberatore, 2011). ISCX2012 (University of New Brunswick 2012): This
dataset has two profiles, the Alpha-profile which carried out various multi-
stage attack scenarios, and the Beta-profile, which is the benign traffic
generator and generates realistic network traffic with background noise. It
includes network traffic for HTTP, SMTP, SSH, IMAP, POP3, and FTP
protocols with full packet payload. However, it does not represent new
network protocols since nearly 70% of todays network traffics are HTTPS
and there are no HTTPS traces in this dataset. Moreover, the distribution
of the simulated attacks is not based on real world statistics (Ali Shiravi
17
and Ghorbani, 2012). ADFA (University of New South Wales 2013): This
dataset includes normal training and validating data and 10 attacks per
vector (Creech and Hu, 2013). It contains FTP and SSH password brute
force, Java based Meterpreter, Add new Superuser, Linux Meterpreter
payload and C100 Webshel attacks. In addition to the lack of attack
diversity and variety of attacks, the behaviors of some attacks in this
dataset are not well separated from the normal behavior (Xie and Hu,
2013) (Xie et al., 2014).
2.5 I. Sharafaldin, A. H. Lashkari, and A. A. Ghorbani, “Toward
generating a new intrusion detection dataset and intrusion traffic
characterization.” in ICISSP, 2018, pp. 108–116.
With exponential growth in the size of computer networks and developed
applications, the significant increasing of the potential damage that can be
caused by launching attacks is becoming obvious. Meanwhile, Intrusion
Detection Systems (IDSs) and Intrusion Prevention Systems (IPSs) are one
of the most important defense tools against the sophisticated and ever-
growing network attacks. Due to the lack of adequate dataset, anomaly-
based approaches in intrusion detection systems are suffering from
accurate deployment, analysis and evaluation. There exist a number of
such datasets such as DARPA98, KDD99, ISC2012, and ADFA13 that
have been used by the researchers to evaluate the performance of their
proposed intrusion detection and intrusion prevention approaches. Based
on our study over eleven available datasets since 1998, many such datasets
are out of date and unreliable to use. Some of these datasets suffer from
lack of traffic diversity and volumes, some of them do not cover the
variety of attacks, while others anonymized packet information and
payload which cannot reflect the current trends, or they lack feature set and
metadata. This paper produces a reliable dataset that contains benign and
seven common attack network flows, which meets real world criteria and
is publicly avaliable. Consequently, the paper evaluates the performance of
a comprehensive set of network traffic features and machine learning
algorithms to indicate the best set of features for detecting the certain
attack categories. 1 INTRODUCTION Intrusion detection plays a vital
18
role in the network defense process by aiming security administrators in
forewarning them about malicious behaviors such as intrusions, attacks,
and malware. Having IDS is a mandatory line of defense for protecting
critical networks against these ever-increasing issues of intrusive activities.
So, research on IDS domain has flourished over the years to propose the
better IDS systems. However, many researchers struggle to find
comprehensive and valid datasets to test and evaluate their proposed
techniques (Koch et al., 2017) and having a suitable dataset is a significant
challenge itself (Nehinbe, 2011). On one hand, many such datasets cannot
be shared due to the privacy issues. On the other hand, those that become
available are heavily anonymized and do not reflect the current trends,
even though, the lack of traffic variety and attack diversity is evident in
most of them. Therefore, based on the lack of certain statistical
characteristics and the unavailability of these The first two authors
contributed equally to this work. datasets a perfect dataset is yet to be
realized (Nehinbe, 2011; Ali Shiravi and Ghorbani, 2012). It is also
necessary to mention that due to malware evolution and the continuous
changes in attack strategies, benchmark datasets need to be updated
periodically (Nehinbe, 2011). Since 1999, Scott et al. (Scott and Wilkins,
1999), Heideman and Papadopulus (Heidemann and Papdopoulos, 2009),
Ghorbani et al. (Ghorbani Ali and Mahbod, 2010), Nehinbe (Nehinbe,
2011), Shiravi et al. (Ali Shiravi and Ghorbani, 2012), and Sharfaldin et al.
(Gharib et al., 2016) tried to propose an evaluation framework for IDS
datasets. According to the last research and proposed evaluation
framework, eleven characteristics, namely Attack Diversity, Anonymity,
Available Protocols, Complete Capture, Complete Interaction, Complete
Network Configuration, Complete Traffic, Feature Set, Heterogeneity,
Labelling, and Metadata are critical for a comprehensive and valid IDS
dataset (Gharib et al., 2016). Our Contributions: Our contributions in this
paper are twofold. Firstly, we generate a new IDS dataset namely
CICIDS2017, which covers all the eleven 108 Sharafaldin, I., Lashkari, A.
and Ghorbani, A. Toward Generating a New Intrusion Detection Dataset
and Intrusion Traffic Characterization. DOI: 10.5220/0006639801080116
19
In Proceedings of the 4th International Conference on Information Systems
Security and Privacy (ICISSP 2018), pages 108-116 ISBN: 978-989-758-
282-0 Copyright © 2018 by SCITEPRESS – Science and Technology
Publications, Lda. All rights reserved necessary criteria with common
updated attacks such as DoS, DDoS, Brute Force, XSS, SQL Injection,
Infiltration, Port scan and Botnet. The dataset is completely labelled and
more than 80 network traffic features extracted and calculated for all
benign and intrusive flows by using CICFlowMeter software which is
publicly available in Canadian Institute for Cybersecurity website (Habibi
Lashkari et al., 2017). Secondly, the paper analyzes the generated dataset
to select the best feature sets to detect different attacks and also we
executed seven common machine learning algorithms to evaluate our
dataset
In this section, we analyze and evaluate the eleven publicly available IDS
datasets since 1998 to demonstrate their shortages and issues that reflect
the real need for a comprehensive and reliable dataset. DARPA (Lincoln
Laboratory 1998-99): The dataset was constructed for network security
analysis and exposed the issues associated with the artificial injection of
attacks and benign traffic. This dataset includes email, browsing, FTP,
Telnet, IRC, and SNMP activities. It contains attacks such as DoS, Guess
password, Buffer overflow, remote FTP, Syn flood, Nmap, and Rootkit.
This dataset does not represent real-world network traffic, and contains
irregularities such as the absence of false positives. Also, the dataset is
outdated for the effective evaluation of IDSs on modern networks, both in
terms of attack types and network infrastructure. Moreover, it lacks actual
attack data records (McHugh, 2000) (Brown et al., 2009). KDD’99
(University of California, Irvine 1998-99): This dataset is an updated
version of the DARPA98, by processing the tcpdump portion. It contains
different attacks such as Neptune-DoS, pod-DoS, SmurfDoS, and buffer-
overflow (University of California, 2007). The benign and attack traffic
are merged together in a simulated environment. This dataset has a large
number of redundant records and is studded by data corruptions that led to
skewed testing results (Tavallaee et al., 2009). NSL-KDD was created
20
using KDD (Tavallaee et al., 2009) to address some of the KDD’s
shortcomings (McHugh, 2000). DEFCON (The Shmoo Group, 2000-2002:
The DEFCON-8 dataset created in 2000 contains port scanning and buffer
overflow attacks, whereas DEFCON-10 dataset, which was created in
2002, contains port scan and sweeps, bad packets, administrative privilege,
and FTP by Telnet protocol attacks. In this dataset, the traffic produced
during the “Capture the Flag (CTF)” competition is different from the real
world network traffic since it mainly consists of intrusive traffic as
opposed to normal background traffic. This dataset is used to evaluate alert
correlation techniques (Nehinbe, 2010) (Group, 2000). CAIDA (Center of
Applied Internet Data Analysis 2002-2016): This organization has three
different datasets, the CAIDA OC48, which includes different types of
data observed on an OC48 link in San Jose, the CAIDA DDOS, which
includes one-hour DDoS attack traffic split of 5-minute pcap files, and the
CAIDA Internet traces 2016, which is passive traffic traces from CAIDA’s
Equinix-Chicago monitor on the High-speed Internet backbone. Most of
CAIDAs datasets are very specific to particular events or attacks and are
anonymized with their payload, protocol information, and destination.
These are not the effective benchmarking datasets due to a number of
shortcomings, see (for Applied Internet Data Analysis (CAIDA), 2002)
(for Applied Internet Data Analysis (CAIDA), 2007) (for Applied Internet
Data Analysis (CAIDA), 2016) (Proebstel, 2008) (Ali Shiravi and
Ghorbani, 2012) for details. LBNL (Lawrence Berkeley National
Laboratory and ICSI 2004-2005): The dataset is full header network traffic
recorded at a medium-sized site. It does not have payload and suffers from
a heavy anonymization to remove any information which could identify an
individual IP (Nechaev et al., 2004). CDX (United States Military
Academy 2009): This dataset represents the network warfare competitions,
that can be utilized to generate modern day labelled dataset. It includes
network traffic such as Web, email, DNS lookups, and other required
services. The attackers used the attack tools such as Nikto, Nessus, and
WebScarab to carry out reconnaissance and attacks automatically. This
dataset can be used to test IDS alert rules, but it suffers from the lack of
21
traffic diversity and volume (Sangster et al., 2009). Kyoto (Kyoto
University 2009): This dataset has been created through honypots, so there
is no process for manual labelling and anonymization, but it has limited
view of the network traffic because only attacks directed at the honeypots
can be observed. It has ten extra features such as IDS Detection, Malware
Detection, and Ashula Detection than previous available datasets which
are useful in NIDS analyToward Generating a New Intrusion Detection
Dataset and Intrusion Traffic Characterization 109 sis and evaluation. The
normal traffic here has been simulated repeatedly during the attacks and
producing only DNS and mail traffic data, which is not reflected in real
world normal network traffic, so there are no false positives, which are
important for minimizing the number of alerts (Song et al., 2011) (M. Sato,
2012) (R. Chitrakar, 2012). Twente (University of Twente 2009): This
dataset includes three services such as OpenSSH, Apache web server and
Proftp using auth/ident on port 113 and captured data from a honeypot
network by Netflow. There is some simultaneous network traffic such as
auth/ident, ICMP, and IRC traffic, which are not completely benign or
malicious. Moreover, this dataset contains some unknown and
uncorrelated alerts traffic. It is labelled and is more realistic, but the lack of
volume and diversity of attacks is obvious (Sperotto et al., 2009). UMASS
(University of Massachusetts 2011): The dataset includes trace files, which
are network packets, and some traces on wireless applications (of
Massachusetts Amherst, 2011) (Nehinbe, 2011). It has been generated
using a single TCP-based download request attack scenario. The dataset is
not useful for testing IDS and IPS techniques due to the lack of variety of
traffic and attacks (Swagatika Prusty and Liberatore, 2011). ISCX2012
(University of New Brunswick 2012): This dataset has two profiles, the
Alpha-profile which carried out various multi-stage attack scenarios, and
the Beta-profile, which is the benign traffic generator and generates
realistic network traffic with background noise. It includes network traffic
for HTTP, SMTP, SSH, IMAP, POP3, and FTP protocols with full packet
payload. However, it does not represent new network protocols since
nearly 70% of todays network traffics are HTTPS and there are no HTTPS
22
traces in this dataset. Moreover, the distribution of the simulated attacks is
not based on real world statistics (Ali Shiravi and Ghorbani, 2012). ADFA
(University of New South Wales 2013): This dataset includes normal
training and validating data and 10 attacks per vector (Creech and Hu,
2013). It contains FTP and SSH password brute force, Java based
Meterpreter, Add new Superuser, Linux Meterpreter payload and C100
Webshel attacks. In addition to the lack of attack diversity and variety of
attacks, the behaviors of some attacks in this dataset are not well separated
from the normal behavior (Xie and Hu, 2013) (Xie et al., 2014).
23
3. PROBLEM ANALYSIS
3.1 EXISTING APPROACH:
Blameless Bayes and Principal Component Analysis (PCA) were
been used with the KDD99 dataset by Almansob and Lomte [9].Similarly,
PCA, SVM, and KDD99 were used Chithik and Rabbani for IDS [10]. In
Aljawarneh et al's. Paper, their assessment and examinations were
conveyed reliant on the NSL-KDD dataset for their IDS model [11]
Composing inspects show that KDD99 dataset is continually used for IDS
[6]–[10].There are 41 highlights in KDD99 and it was created in 1999.
Consequently, KDD99 is old and doesn't give any data about cutting
edge new assault types, example, multi day misuses and so forth. In this
manner we utilized a cutting-edge and new CICIDS2017 dataset [12] in
our investigation.
3.11 Drawbacks
1) Strict Regulations
2) Difficult to work with for non-technical users
3) Restrictive to resources
4) Constantly needs Patching
5) Constantly being attacked
24
3.3 Software And Hardware Requirements
SOFTWARE REQUIREMENTS
The functional requirements or the overall description documents
include the product perspective and features, operating system and
operating environment, graphics requirements, design constraints and user
documentation.
The appropriation of requirements and implementation constraints
gives the general overview of the project in regards to what the areas of
strength and deficit are and how to tackle them.
25
HARDWARE REQUIREMENTS
Minimum hardware requirements are very dependent on the particular
software being developed by a given Enthought Python / Canopy / VS
Code user. Applications that need to store large arrays/objects in
memory will require more RAM, whereas applications that need to
perform numerous calculations or tasks more quickly will require a
faster processor.
• Operating system : windows, linux
• Processor : minimum intel i3
• Ram : minimum 4 gb
• Hard disk : minimum 250gb
3.5 Algorithms
• ANN
• CNN
• Random forest
26
4. SYSTEM DESIGN
UML DIAGRAMS
The System Design Document describes the system requirements,
operating environment, system and subsystem architecture, files and
database design, input formats, output layouts, human-machine interfaces,
detailed design, processing logic, and external interfaces.
Global Use Case Diagrams:
Identification of actors:
Actor: Actor represents the role a user plays with respect to the system.
An actor interacts with, but has no control over the use cases.
Graphical representation:
<<Actor name>>
Actor
27
o Who affects the system? Or, which user groups are needed by the
system to perform its functions? These functions can be both main
functions and secondary functions such as administration.
o Which external hardware or systems (if any) use the system to perform
tasks?
o What problems does this application solve (that is, for whom)?
o And, finally, how do users use the system (use case)? What are they
doing with the system?
The actors identified in this system are:
a. System Administrator
b. Customer
c. Customer Care
Identification of usecases:
Usecase: A use case can be described as a specific way of using the
system from a user’s (actor’s) perspective.
Graphical representation:
28
For each actor, find the tasks and functions that the actor should be
able to perform or that the system needs the actor to perform. The use case
should represent a course of events that leads to clear goal
Name the use cases.
Describe the use cases briefly by applying terms with which the user is
familiar. This makes the description less ambiguous
Questions to identify use cases:
What are the tasks of each actor?
Will any actor create, store, change, remove or read information in the
system?
What use case will store, change, remove or read this information?
Will any actor need to inform the system about sudden external
changes?
Does any actor need to inform about certain occurrences in the system?
What usecases will support and maintains the system?
Flow of Events
A flow of events is a sequence of transactions (or events) performed by the
system. They typically contain very detailed information, written in terms
of what the system should do, not how the system accomplishes the task.
Flow of events are created as separate files or documents in your favorite
text editor and then attached or linked to a use case using the Files tab of a
model element.
A flow of events should include:
When and how the use case starts and ends
Use case/actor interactions
Data needed by the use case
Normal sequence of events for the use case
Alternate or exceptional flows Construction of Usecase diagrams:
Use-case diagrams graphically depict system behavior (use cases). These
diagrams present a high level view of how the system is used as viewed
from an outsider’s (actor’s) perspective. A use-case diagram may depict all
or some of the use cases of a system.
A use-case diagram can contain:
actors ("things" outside the system)
29
use cases (system boundaries identifying what the system should do)
Interactions or relationships between actors and use cases in the system
including the associations, dependencies, and generalizations.
Relationships in use cases:
1. Communication:
The communication relationship of an actor in a usecase is shown by
connecting the actor symbol to the usecase symbol with a solid path. The
actor is said to communicate with the usecase.
2. Uses:
A Uses relationship between the usecases is shown by generalization
arrow from the usecase.
3. Extends:
The extend relationship is used when we have one usecase that is similar to
another usecase but does a bit more. In essence it is like subclass.
SEQUENCE DIAGRAMS
A sequence diagram is a graphical view of a scenario that shows object
interaction in a time- based sequence what happens first, what happens
next. Sequence diagrams establish the roles of objects and help provide
essential information to determine class responsibilities and interfaces.
There are two main differences between sequence and collaboration
diagrams: sequence diagrams show time-based object interaction while
collaboration diagrams show how objects associate with each other. A
sequence diagram has two dimensions: typically, vertical placement
represents time and horizontal placement represents different objects.
Object:
An object has state, behavior, and identity. The structure and behavior of
similar objects are defined in their common class. Each object in a diagram
indicates some instance of a class. An object that is not named is referred
to as a class instance.
The object icon is similar to a class icon except that the name is
underlined: An object's concurrency is defined by the concurrency of its
class.
Message:
30
A message is the communication carried between two objects that trigger
an event. A message carries information from the source focus of control
to the destination focus of control. The synchronization of a
message can be modified through the
message specification. Synchronization means a message where
the sending object pauses to wait for results.
Link:
A link should exist between two objects, including class utilities, only if
there is a relationship between their corresponding classes. The existence
of a relationship between two classes symbolizes a path of communication
between instances of the classes: one object may send messages to another.
The link is depicted as a straight line between objects or objects and class
instances in a collaboration diagram. If an object links to itself, use the
loop version of the icon.
CLASS DIAGRAM:
Identification of analysis classes:
A class is a set of objects that share a common structure and common
behavior (the same attributes, operations, relationships and semantics). A
class is an abstraction of real-world items. There are 4 approaches for
identifying classes:
a. Noun phrase approach:
b. Common class pattern approach.
c. Use case Driven Sequence or Collaboration approach.
d. Classes , Responsibilities and collaborators Approach
1. Noun Phrase Approach:
The guidelines for identifying the classes:
Look for nouns and noun phrases in the usecases.
Some classes are implicit or taken from general knowledge.
All classes must make sense in the application domain; Avoid
computer implementation classes – defer them to the design stage.
Carefully choose and define the class names After identifying the
classes we have to eliminate the following types of classes:
Adjective classes.
31
2. Common class pattern approach:
The following are the patterns for finding the candidate classes:
Concept class.
Events class.
Organization class
Peoples class
Places class
Tangible things and devices class.
3. Use case driven approach:
We have to draw the sequence diagram or collaboration diagram. If there
is need for some classes to represent some functionality then add new
classes which perform those functionalities.
4. CRC approach:
The process consists of the following steps:
Identify classes’ responsibilities ( and identify the classes )
Assign the responsibilities
Identify the collaborators. Identification of responsibilities of each
class:
The questions that should be answered to identify the attributes and
methods of a class respectively are:
a. What information about an object should we keep track of?
b. What services must a class provide? Identification of relationships
among the classes:
Three types of relationships among the objects are:
Association: How objects are associated?
Super-sub structure: How are objects organized into super classes and sub
classes? Aggregation: What is the composition of the complex classes?
Association:
The questions that will help us to identify the associations are:
a. Is the class capable of fulfilling the required task by itself?
b. If not, what does it need?
c. From what other classes can it acquire what it needs? Guidelines for
identifying the tentative associations:
A dependency between two or more classes may be an association.
32
Association often corresponds to a verb or prepositional phrase.
33
part and whole relationships are:
Does the part class belong to the problem domain?
Is the part class within the system’s responsibilities?
34
Does the part class capture more than a single value?( If not then
simply include it as an attribute of the whole class)
Does it provide a useful abstraction in dealing with the problem
domain? There are three types of aggregation relationships. They are:
Assembly:
It is constructed from its parts and an assembly-part situation physically
exists.
Container:
A physical whole encompasses but is not constructed from physical parts.
Collection member:
A conceptual whole encompasses parts that may be physical or conceptual.
The container and collection are represented by hollow diamonds but
composition is represented by solid diamond.
35
A use case diagram in the Unified Modeling Language (UML) is a
type of behavioral diagram defined by and created from a Use-case
analysis. Its purpose is to present a graphical overview of the functionality
provided by a system in terms of actors, their goals (represented as use
cases), and any dependencies between those use cases. The main purpose
of a use case diagram is to show what system functions are performed for
which actor. Roles of the actors in the system can be depicted.
Start
Localhost
Detection of Attack
Visualisation
End
36
CLASS DIAGRAM
In software engineering, a class diagram in the Unified
Modeling Language (UML) is a type of static structure diagram that
describes the structure of a system by showing the system's classes, their
attributes, operations (or methods), and the relationships among the
classes. It explains which class contains information.
User
agriculture
Start()
Localhost()
Register & Login to Application() System
Real Time Malware Detection()
Data Stores in SQL()
User Add Data()
Attack Classification based on model()
Detection of Attack()
Visualisation()
end()
37
SEQUENCE DIAGRAM
User System
Start
Localhost
Detection of Attack
Visualisation
38
5.IMPLEMENTATION
5.2 Code
39
from tkinter.filedialog import askopenfilename
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np
import pandas as pd
from genetic_selection import GeneticSelectionCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn import svm
from keras.models import Sequential
from keras.layers import Dense
import time
main = tkinter.Tk()
main.title("Android Malware Detection")
main.geometry("1300x1200")
global filename
global train
global svm_acc, nn_acc, svmga_acc, annga_acc
global X_train, X_test, y_train, y_test
global svmga_classifier
global nnga_classifier
global svm_time,svmga_time,nn_time,nnga_time
def upload():
global filename
40
filename =
filedialog.askopenfilename(initialdir="dataset")
pathlabel.config(text=filename)
text.delete('1.0', END)
text.insert(END,filename+" loaded\n");
def generateModel():
global X_train, X_test, y_train, y_test
text.delete('1.0', END)
train = pd.read_csv(filename)
rows = train.shape[0] # gives number of row count
cols = train.shape[1] # gives number of col count
features = cols - 1
print(features)
X = train.values[:, 0:features]
Y = train.values[:, features]
print(Y)
X_train, X_test, y_train, y_test = train_test_split(X, Y,
test_size = 0.2, random_state = 0)
41
y_pred = cls.predict(X_test)
for i in range(len(X_test)):
print("X=%s, Predicted=%s" % (X_test[i], y_pred[i]))
return y_pred
def runSVM():
global svm_acc
global svm_time
start_time = time.time()
text.delete('1.0', END)
cls = svm.SVC(C=2.0,gamma='scale',kernel = 'rbf',
random_state = 2)
cls.fit(X_train, y_train)
prediction_data = prediction(X_test, cls)
svm_acc = cal_accuracy(y_test, prediction_data,'SVM
Accuracy')
svm_time = (time.time() - start_time)
42
def runSVMGenetic():
text.delete('1.0', END)
global svmga_acc
global svmga_classifier
global svmga_time
estimator = svm.SVC(C=2.0,gamma='scale',kernel = 'rbf',
random_state = 2)
svmga_classifier = GeneticSelectionCV(estimator,
cv=5,
verbose=1,
scoring="accuracy",
max_features=5,
n_population=50,
crossover_proba=0.5,
mutation_proba=0.2,
n_generations=40,
crossover_independent_proba=0.5,
mutation_independent_proba=0.05,
tournament_size=3,
n_gen_no_change=10,
caching=True,
n_jobs=-1)
start_time = time.time()
svmga_classifier = svmga_classifier.fit(X_train, y_train)
svmga_time = svm_time/2
prediction_data = prediction(X_test, svmga_classifier)
43
svmga_acc = cal_accuracy(y_test, prediction_data,'SVM
with GA Algorithm Accuracy, Classification Report &
Confusion Matrix')
def runNN():
global nn_acc
global nn_time
text.delete('1.0', END)
start_time = time.time()
model = Sequential()
model.add(Dense(4, input_dim=215, activation='relu'))
model.add(Dense(215, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='adam', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=50, batch_size=64)
_, ann_acc = model.evaluate(X_test, y_test)
nn_acc = ann_acc*100
text.insert(END,"ANN Accuracy : "+str(nn_acc)+"\n\n")
nn_time = (time.time() - start_time)
def runNNGenetic():
global annga_acc
global nnga_time
text.delete('1.0', END)
train = pd.read_csv(filename)
rows = train.shape[0] # gives number of row count
cols = train.shape[1] # gives number of col count
44
features = cols - 1
print(features)
X = train.values[:, 0:100]
Y = train.values[:, features]
print(Y)
X_train1, X_test1, y_train1, y_test1 = train_test_split(X,
Y, test_size = 0.2, random_state = 0)
model = Sequential()
model.add(Dense(4, input_dim=100, activation='relu'))
model.add(Dense(100, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='adam', metrics=['accuracy'])
start_time = time.time()
model.fit(X_train1, y_train1)
nnga_time = (time.time() - start_time)
_, ann_acc = model.evaluate(X_test1, y_test1)
annga_acc = ann_acc*100
text.insert(END,"ANN with Genetic Algorithm
Accuracy : "+str(annga_acc)+"\n\n")
def graph():
height = [svm_acc, nn_acc, svmga_acc, annga_acc]
bars = ('SVM Accuracy','NN Accuracy','SVM Genetic
Acc','NN Genetic Acc')
y_pos = np.arange(len(bars))
plt.bar(y_pos, height)
plt.xticks(y_pos, bars)
plt.show()
45
def timeGraph():
height = [svm_time,svmga_time,nn_time,nnga_time]
bars = ('SVM Time','SVM Genetic Time','NN Time','NN
Genetic Time')
y_pos = np.arange(len(bars))
plt.bar(y_pos, height)
plt.xticks(y_pos, bars)
plt.show()
pathlabel = Label(main)
pathlabel.config(bg='brown', fg='white')
46
pathlabel.config(font=font1)
pathlabel.place(x=460,y=100)
47
graphButton = Button(main, text="Accuracy Graph",
command=graph)
graphButton.place(x=460,y=200)
graphButton.config(font=font1)
48
6.TESTING
6.1 SOFTWARE TESTING
Testing
49
Testing technique based on knowledge of the internal logic of an
application's code and includes tests like coverage of code statements,
branches, paths, conditions. It is performed by software developers
Unit Testing
Integration Testing
Beta Testing
Performance Testing
50
structure, data structures etc. Test cases for black box testing are created
based on the requirement specifications. Therefore, it is also called as
specification-based testing. Fig.4.1 represents the black box testing:
When applied to machine learning models, black box testing would mean
testing machine learning models without knowing the internal details such
as features of the machine learning
model, the algorithm used to create the model etc. The challenge, however,
is to verify the test outcome against the expected values that are known
beforehand.
51
The above Fig.4.2 represents the black box testing procedure for machine
learning algorithms.
[16,6,324,0,0,0,22,0,0,0,0,0,0] 0 0
[16,7,263,7,0,2,700,9,10,1153,832, 1 1
9,2]
The model gives out the correct output when different inputs are given
which are mentioned in Table 4.1. Therefore the program is said to be
executed as expected or correct program
Testing
Testing is a process of executing a program with the aim of finding error. To make
our software perform well it should be error free. If testing is done successfully it
will remove all the errors from the software.
52
7.2.2 Types of Testing
Unit Testing
Integration Testing
The phase in software testing in which individual software modules are combined
and tested as a group. It is usually conducted by testing teams.
Alpha Testing
Beta Testing
53
Performance Testing
When applied to machine learning models, black box testing would mean testing
machine learning models without knowing the internal details such as features of
the machine learning
model, the algorithm used to create the model etc. The challenge, however, is to
verify the test outcome against the expected values that are known beforehand.
54
Fig.:Black Box Testing for Machine Learning algorithms
The above Fig.4.2 represents the black box testing procedure for machine learning
algorithms.
[16,6,324,0,0,0,22,0,0,0,0,0,0] 0 0
[16,7,263,7,0,2,700,9,10,1153,832,9,2] 1 1
The model gives out the correct output when different inputs are given which are
mentioned in Table 4.1. Therefore the program is said to be executed as expected
or correct program
Test Test Case Test Case Test Steps Test Test
Cas Name Description Step Expected Actual Case Priorit
e Id Statu Y
s
55
properly applicati .
loading the on.
application.
03 User Verify the If it We The High High
Mode working of doesn’t cannot application
the Respond use the displays the
application Freestyle Freestyle
in freestyle mode. Page
mode
04 Data Input Verify if the If it fails We The High High
application to take the cannot application
takes input input or proceed updates the
and updates store in further input to
application
The
Database
56
7.RESULTS AND DISCUSSIONS
Data preprocessing
Data EDA
57
ML Deploy
58
59
From the score accuracy we concluding the DT & RF give better accuracy and
building pickle file for predicting the user input
Application
60
Enter the input
61
Predict attack -
62
8. CONCLUSION
Right now, estimations of help vector machine, ANN, CNN, Random Forest
and profound learning calculations dependent on modern CICIDS2017
dataset were introduced relatively. Results show that the profound learning
calculation performed fundamentally preferable outcomes over SVM, ANN,
RF and CNN. We are going to utilize port sweep endeavors as well as other
assault types with AI and profound learning calculations, apache Hadoop
and sparkle innovations together dependent on this dataset later on. All these
calculation helps us to detect the cyber attack in network. It happens in the
way that when we consider long back years there may be so many attacks
happened so when these attacks are recognized then the features at which
values these attacks are happening will be stored in some datasets. So by
using these datasets we are going to predict whether cyber attack is done or
not. These predictions can be done by four algorithms like SVM, ANN, RF,
CNN this paper helps to identify which algorithm predicts the best accuracy
rates which helps to predict best results to identify the cyber attacks
happened or not.
FUTURE SCOPE
In enhancement we will add some ML Algorithms to increase accuracy
63
8.REFERENCES
[1] K. Graves, Ceh: Official certified ethical hacker review guide: Exam 312-50.
John Wiley & Sons, 2007.
[2] R. Christopher, “Port scanning techniques and the defense against them,”
SANS Institute, 2001.
[3] M. Baykara, R. Das¸, and I. Karado ˘gan, “Bilgi g ¨uvenli ˘gi sistemlerinde
kullanilan arac¸larin incelenmesi,” in 1st International Symposium on Digital
Forensics and Security (ISDFS13), 2013, pp. 231–239.
[4] S. Staniford, J. A. Hoagland, and J. M. McAlerney, “Practical automated
detection of stealthy portscans,” Journal of Computer Security, vol. 10, no. 1-2,
pp. 105–136, 2002.
[5] S. Robertson, E. V. Siegel, M. Miller, and S. J. Stolfo, “Surveillance detection
in high bandwidth environments,” in DARPA Information Survivability
Conference and Exposition, 2003. Proceedings, vol. 1. IEEE, 2003, pp. 130–138.
[6] K. Ibrahimi and M. Ouaddane, “Management of intrusion detection systems
based-kdd99: Analysis with lda and pca,” in Wireless Networks and Mobile
Communications (WINCOM), 2017 International Conference on. IEEE, 2017, pp.
1–6.
[7] N. Moustafa and J. Slay, “The significant features of the unsw-nb15 and the
kdd99 data sets for network intrusion detection systems,” in Building Analysis
Datasets and Gathering
Experience Returns for Security (BADGERS), 2015 4th International Workshop
on. IEEE, 2015, pp. 25–31.
[8] L. Sun, T. Anthony, H. Z. Xia, J. Chen, X. Huang, and Y. Zhang, “Detection
and classification of malicious patterns in network traffic using benford’s law,” in
Asia-Pacific Signal and Information Processing Association Annual Summit and
Conference (APSIPA ASC), 2017. IEEE, 2017, pp. 864–872.
[9] S. M. Almansob and S. S. Lomte, “Addressing challenges for intrusion
detection system using naive bayes and pca algorithm,” in Convergence in
Technology (I2CT), 2017 2nd International Conference for. IEEE, 2017, pp. 565–
568.
64
[10] M. C. Raja and M. M. A. Rabbani, “Combined analysis of support vector
machine and principle component analysis for ids,” in IEEE International
Conference on Communication and Electronics Systems, 2016, pp. 1–5.
[11] S. Aljawarneh, M. Aldwairi, and M. B. Yassein, “Anomaly-based intrusion
detection system through feature selection analysis and building hybrid efficient
model,” Journal of Computational Science, vol. 25, pp. 152–160, 2018.
[12] I. Sharafaldin, A. H. Lashkari, and A. A. Ghorbani, “Toward generating a
new intrusion detection dataset and intrusion traffic characterization.” in ICISSP,
2018, pp. 108–116.
[13] D. Aksu, S. Ustebay, M. A. Aydin, and T. Atmaca, “Intrusion detection with
comparative analysis of supervised learning techniques and fisher score feature
selection algorithm,” in International Symposium on Computer and Information
Sciences. Springer, 2018, pp. 141–149.
[14] N. Marir, H. Wang, G. Feng, B. Li, and M. Jia, “Distributed abnormal
behavior detection approach based on deep belief network and ensemble svm
using spark,” IEEE Access, 2018.
[15] P. A. A. Resende and A. C. Drummond, “Adaptive anomaly-based intrusion
detection system using genetic algorithm and profiling,” Security and Privacy,
vol. 1, no. 4, p. e36, 2018.
[16] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol.
20, no. 3, pp. 273–297, 1995.
[17] R. Shouval, O. Bondi, H. Mishan, A. Shimoni, R. Unger, and A. Nagler,
“Application of machine learning algorithms for clinical predictive modeling: a
data-mining approach in sct,” Bone marrow transplantation, vol. 49, no. 3, p. 332,
2014.
65