Network Intrusion Detection System Report
Network Intrusion Detection System Report
Submitted to
Master of Science
in
Computer Science
Submitted by:
Hareesha H M
(P18UZ21S0145)
Under the Guidance of
B G Shivakumar
Assistant Professor
Department of Computer Science
Bengaluru City University
Central College Campus
Bengaluru – 560001
OCTOBER 2023
BENGALURU CITY UNIVERSITY, BENGALURU
Dr. AMBEDKAR VEEDHI, CENTRAL COLLEGE CAMPUS, BENGALURU – 560001
CERTIFICATE
External Viva
Name of the Examiners Signature with Date
1.
2.
DECLARATION
The successful completion of any work will not complete if all those who have helped in
carrying out the work are not thanked. I take this opportunity to thank all those who have helped me in
carrying out this project work effectively and on time.
I also thank all the faculties of our department who supported directly and indirectly for the
successful completion of this project work.
I express the gratitude to my beloved family members and friends for this valuable support. I take
this opportunity to thank all my friends and all those who had encouraged me to complete my project
successfully.
Hareesha H M
(P18UZ21S0145)
ABSTRACT
The incremental increase in the usage of technology has led to an increase in the amount of data that
is being processed over the Internet significantly over the time period. With the huge amount of data that is
being flown over the Internet, comes the scenario of providing security to the data, and this is where an
Network Intrusion Detection System (NIDS) comes into the picture and helps in detecting any virtual
security threats. Network Intrusion Detection
System (NIDS) is a system that monitors and analyzes data to detect any intrusion in the system or
network. Intruders find different ways to penetrate into a network.
The NIDS which is being proposed is being implemented using latest technologies such as Machine
Learning Algorithms to classify the attacks and detecting them whenever an attack happens and also to find
which machine learning algorithm is best suitable for identifying the attack.
Keywords:
Intrusion, Network Intrusion Detection System, Denial of service, User to Root attacks, Remote to
User attacks, Local Area Network, Principal Component Analysis, Naïve Bayes ,Decision Tree, KNN
Algorithm, Logistic Regression, Alerts, False Positives, False Negative
i
CONTENTS
7 CONCLUSION 51
FUTURE SCOPE 51
8 REFERENCES 53
ii
LIST OF FIGURES
iii
Network Intrusion Detection System
1. INTRODUCTION
NIDS is a device or software application that monitors a network or systems for malicious activity or
policy violations. Any intrusion activity or violation is typically reported either to an administrator or
collected centrally using a security information and event management (SIEM) system. A SIEM system
combines outputs from multiple sources and uses alarm filtering techniques to distinguish malicious activity
from false alarms.
Although Network Intrusion Detection Systems monitor networks for potentially malicious activity,
they are also disposed to false alarms. Hence, organizations need to fine-tune their NIDS products when
they first install them. It means properly setting up the intrusion detection systems to recognize what normal
traffic on the network looks like as compared to malicious activity.
Network intrusion detection systems (NIDS) are set up at a planned point within the network to
examine traffic from all devices on the network. It performs an observation of passing traffic on the entire
subnet and matches the traffic that is passed on the subnets to the collection of known attacks. Once an
attack is identified or abnormal behavior is observed, the alert can be sent to the administrator. An example
of an NIDS is installing it on the subnet where firewalls are located in order to see if someone is trying
crack the firewall.
Host intrusion detection systems (HIDS) run on independent hosts or devices on the network. A
HNIDS monitors the incoming and outgoing packets from the device only and will alert the administrator if
suspicious or malicious activity is detected. It takes a snapshot of existing system files and compares it with
the previous snapshot. If the analytical system files 1 were edited or deleted, an alert is sent to the
administrator to investigate. An example of HNIDS usage can be seen on mission critical machines, which
are not expected to change their layout.
Protocol-based intrusion detection system (PNIDS) comprises of a system or agent that would
consistently resides at the front end of a server, controlling and interpreting the protocol between a
user/device and the server. It is trying to secure the web server by regularly monitoring the HTTPS protocol
stream and accept the related HTTP protocol. As HTTPS is unencrypted and before instantly entering its
web presentation layer then this system would need to reside in this interface, between to use the HTTPS.
Application Protocol-based Intrusion Detection System (APNIDS) is a system or agent that generally
resides within a group of servers. It identifies the intrusions by monitoring and interpreting the
communication on application specific protocols. For example, this would monitor the SQL protocol
explicit to the middleware as it transacts with the database in the web server.
Hybrid intrusion detection system is made by the combination of two or more approaches of the
intrusion detection system. In the hybrid intrusion detection system, host agent or system data is combined
with network information to develop a complete view of the network system. Hybrid intrusion detection
system is more effective in comparison to the other intrusion detection system. Prelude is an example of
Hybrid NIDS.
1. Signature-based Method
Signature-based NIDS detects the attacks on the basis of the specific patterns such as number of bytes or
number of 1‘s or number of 0‘s in the network traffic. It also detects on the basis of the already known
malicious instruction sequence that is used by the malware. The detected patterns in the NIDS are known as
signatures.
Signature-based NIDS can easily detect the attacks whose pattern (signature) already exists in system but
it is quite difficult to detect the new malware attacks as their pattern (signature) is not known.
2. Anomaly-based Method
Anomaly-based NIDS was introduced to detect the unknown malware attacks as new malware are
developed rapidly. In anomaly-based NIDS there is use of machine learning to create a trustful activity
model and anything coming is compared with that model and it is declared suspicious if it is not found in
model. Machine learning based method has a better generalized property in comparison to signature-based
NIDS as these models can be trained according to the applications and hardware configurations.
NIDS and firewall both are related to the network security but an NIDS differs from a firewall as a
firewall looks outwardly for intrusions in order to stop them from happening. Firewalls restrict access
between networks to prevent intrusion and if an attack is from inside the network it don‘t signal. An NIDS
describes a suspected intrusion once it has happened and then signals an alert.
Building a reliable network is a very difficult task considering all different possible types of attacks.
Nowadays, computer networks and their services are widely used in industry, business, and all arenas of
life. Security personnel and everyone who has a responsibility for providing protection for a network and its
users, have serious concerns about intruder attacks.
Network administrators and security officers try to provide a protected environment for user‘s accounts,
network resources, personal files and passwords. Attackers may behave in two ways to carry out their
attacks on networks; one of these ways is to make a network service unavailable for users or violating
personal information. Denial of service (DoS) is one of the most frequent cases representing attacks on
network resources and making network services unavailable for their users. There are many types of DoS
attacks, and every type has it is own behavior on consuming network resources to achieve the intruder‘s
aim, which is to render the network unavailable for its users. Remote to Local(R2L) is one type of computer
network attacks, in which an intruder sends set of packets to another computer or server over a network
where he/she does not have permission to access as a local user. User to root attacks (U2R) is a second type
of attack where the intruder tries to access the network resources as a normal user, and after several
attempts, the intruder becomes as a full access user. Probing is a third type of attack in 3 which the intruder
scans network devices to determine weakness in topology design or some opened ports and then use them in
the future for illegal access to personal information. There are many examples that represent probing over a
network, such as nmap, portsweep, ipsweep.
NIDS becomes an essential part for building computer network to capture these kinds of attacks in early
stages, because NIDS works against all intruder attacks. NIDS uses classification techniques to make
decision about every packet pass through the network whether it is a normal packet or an attack (i.e. DOS,
U2R, R2L, PROBE) packet. Software to detect network intrusions protects a computer network from
unauthorized users, including perhaps insiders. The intrusion detector learning task is to build a predictive
model (i.e. a classifier) capable of distinguishing between ``bad'' connections, called intrusions or attacks,
and ``good'' normal connections.
Lincoln Labs set up an environment to acquire nine weeks of raw TCP dump data for a local-area
network (LAN) simulating a typical U.S. Air Force LAN. They operated the LAN as if it were a true Air
Force environment, but peppered it with multiple attacks.The raw training data was about four gigabytes of
compressed binary TCP dump data from seven weeks of network traffic. This was processed into about five
million connection records.Similarly, the two weeks of test data yielded around two million connection
records.
A connection is a sequence of TCP packets starting and ending at some well defined times, between
which data flows to and from a source IP address to a target IP address under some well defined protocol.
Each connection is labeled as either normal, or as an attack, with exactly one specific attack type. Each
connection record consists of about 100 bytes.
An attacker tries to prevent legitimate users from using a service. For example, SYN flood, Smurf
and teardrop.
An attacker has local access to the victim machine and tries to gain super-user privilege. For
example, buffer overflow attacks.
An attacker tries to gain access to victim machine without having an account on it. For example,
password guessing attack
Probe
An attacker tries to gain information about the target host. For example, port-scan and ping-sweep.
Motivation for the work is to propose a security system, which detects malicious behaviors launched
toward a system at SC level. The NIDS uses data mining approaches namely Decision Tree, , Logistic
Regression and KNN is used to identify attack. The attack features are learned by the machine learning
algorithm.
The contributions of proposed work are: 1) Identifying attack class by applying machine learning
algorithm, 2) Identifying which algorithm is best suitable for NIDS problem to effectively resist insider
attack.
Intrusion detection system uses classification techniques to make decision about every packet pass
through the network whether it is a normal packet or an attack. Our objective is to classify the attack into
multiple attack types namely DOS, U2R, R2L, PROBE packet.
Network Intrusion detection begins where the firewall ends. Preventing unauthorized entry is best,
but not always possible. It is important that the system is reliable and accurate and secure. Intrusion
detection is defined as real-time monitoring and analysis of network activity and data for potential
Vulnerabilities and attacks in progress.
One major limitation of current Network intrusion detection system (NIDS) technologies is the
requirement to filter false alarms. NIDS is defined as a system that tries to detect and alert of attempted
intrusions into a system or a network. NIDSs are classified into two major approaches. Network Intrusion
detection is the process of monitoring the events occurring in a computer system or network and analyzing
them for signs of possible incidents, which are violations or imminent threats of violation of computer
security policies, acceptable use policies, or standard security practices. Network Intrusion prevention is the
process of performing intrusion detection and attempting to stop detected possible incidents.
Network Intrusion Detection and Prevention Systems (NIDPS) are primarily focused on identifying
possible incidents, logging information about them, attempting to stop them, and reporting them to security
administrators. In addition, organizations use IDPSs for other purposes, such as identifying problems with
security policies, documenting existing threats and deterring individuals from violating security policies.
IDPSs have become a necessary addition to the security infrastructure of nearly every organization. Thus
there is necessary to propose a strong detection mechanism to identify attacks.
Computer forensics science, which views computer systems as crime scenes, aims to identify, preserve,
recover, analyze, and present facts and opinions on information collected for a security event. It analyzes
what attackers have done such as spreading computer viruses, malwares, and malicious codes and
conducting DDoS attacks.
A developer used a self-developed packet sniffer to collect network packets with which to discriminate
network attacks with the help of network states and packet distribution. O‘ Shaughnessy and Gray acquired
network intrusion and attack patterns from system log files. These files contain traces of computer misuse. It
means that, from synthetically generated log files, these traces or patterns of misuse can be more accurately
reproduced. Wu and Banzhaf overviewed research progress of applying methods of computational
intelligence, including artificial neural networks, fuzzy systems, evolutionary computation, artificial
immune systems, and swarm intelligence, to detect malicious behaviors. The authors systematically
summarized and compared different intrusion detection methods, thus allowing us to clearly view those
existing research challenges. These aforementioned techniques and applications truly contribute to network
security.
In existing work, a security system, which collects forensic features for users at command level rather
than at SC level, by invoking data mining and forensic techniques, was developed. Moreover, if attackers
use many sessions to issue attacks, e.g., multistage attacks, or launch DDoS attacks, then it is not easy for
that system to identify attack patterns. Hu et al. presented an intelligent lightweight NIDS that utilizes a
forensic technique to profile user behaviors and a data mining technique to carry out cooperative attacks.
The authors claimed that the system could detect intrusions effectively and efficiently in real time.
However, they did not mention the SC filter.
Giffin et al. provided another example of integrating computer forensics with a knowledge-based
system. The system adopts a predefined model, which, allowing SC-sequences to be normally executed, is
employed by a detection system to restrict program execution to ensure the security of the protected system.
This is helpful in detecting applications that issue a 12 series of malicious SCs and identifying attack
sequences having been collected in knowledge bases. When an undetected attack is presented, the system
frequently finds the attack sequence in 2 s as its computation overhead. Fiore et al. explored the
effectiveness of a detection approach based on machine learning using the Discriminative Restricted
Boltzmann Machine to combine the expressive power of generative models with good classification
accuracy capabilities to infer part of its knowledge from incomplete training data so that the network
anomaly detection scheme can provide an adequate degree of protection from both external and internal
menaces. Faisal et al. analyzed the possibility of using data stream mining to enhance the security of
advanced metering infrastructure through an NIDS. The advanced metering infrastructure, which is one of
the most crucial components of smart card, serves as a bridge for providing bidirectional information flow
between the user domain and the utility domain. The authors treat an NIDS as a second-line security
measure after the first line of primary advanced metering infrastructure security techniques such as
encryption, authorization, and authentication.
Disadvantages
Most intrusion detection techniques focus on how to find malicious network behaviors and acquire
the characteristics of attack packets, i.e., attack patterns, based on the histories recorded in log files. Real
time attack detection at incoming rate is still a challenging.
Existing techniques cannot easily authenticate remote-login users and detect specific types of intrusions,
e.g., when an unauthorized user logs in to a system with a valid user ID and password.
Proposed smart intrusion detection system (NIDS) is viewed as an effective solution for network
security and protection against external threats. However, the existing NIDS often has a lower detection rate
under new attacks and has a high overhead when working with audit data, and thus machine learning
methods have been widely applied in intrusion detection.
In our proposed method, Decision Tree, Logistic regression, Random Forest and KNN is developed as
learning methods in solving the classification problem of pattern recognition and intrusion identification.
Compared with other classification algorithms, Decision Tree, Logistic regression, Random Forest and
KNN can better solve the problems of small samples, nonlinearity and high dimensionality.
Advantages
1. Improved Accuracy: Machine learning algorithms, like Decision Tree, Logistic algorithm, KNN , can
adapt and learn from the network's behavior, which can result in more accurate detection of intrusions. This
can lead to a reduced false positive rate, making it more efficient in identifying actual threats.
2. Dynamic Adaptability: NIDS based on machine learning can adapt to new and evolving threats. They can
continuously learn and update their detection models, making them well-suited for detecting previously
unseen attack patterns.
3. Reduced Dimensionality: The use of Rough Set Theory for data preprocessing can help in reducing the
dimensionality of the data. This can make the system more efficient and reduce the computational resources
required.
4. Customization Machine learning-based NIDS can be customized for specific network environments and
security requirements. They can be fine-tuned to the unique needs of an organization.
True Positive
False Positive
False Negative
True Negative
NIDS are typically evaluated based on the following standard performance measures:
It is calculated as the ratio between the number of correctly predicted attacks and the total number
of attacks. If all intrusions are detected then the TPR is 1 which is extremely rare for an NIDS. TPR is also
called Detection Rate (DR) or the Sensitivity.
It is calculated as the ratio between the number of normal instances incorrectly classified as an
attack and the total number of normal instances.
False Negative means when a detector fails to identify an anomaly and classifies it as normal.
The CR measures how accurate the NIDS is in detecting normal or anomalous traffic behavior. It is
described as the percentage of all those correctly predicted instances to all the instances
2. LITERATURE SURVEY
Vipin, Das & Vijaya, Pathak & Sattvik, Sharma & , Sreevathsan & ,
MVVNS.Srikanth & Kumar T, Gireesh. (2010), International Journal of Computer
Science & Information Technology.
Network and system security is of paramount importance in the present data communication
environment. Hackers and intruders can create many successful attempts to cause the crash of the networks
and web services by unauthorized intrusion. New threats and associated solutions to prevent these threats
are emerging together with the secured system evolution. Intrusion Detection Systems (NIDS) are one of
these solutions. The main function of Intrusion Detection System is to protect the resources from threats. It
analyzes and predicts the behaviours of users, and then these behaviours will be considered an attack or a
normal behaviour. We use Rough Set Theory (RST) and Support Vector Machine (SVM) to detect network
intrusions. First, packets are captured from the network, RST is used to pre-process the data and reduce the
dimensions. The features selected by RST will be sent to SVM model to learn and test respectively. The
method is effective to decrease the space density of data. The experiments compare the results with
Principal Component Analysis (PCA) and show RST and SVM schema could reduce the false positive rate
and increase the accuracy.
Techniques Used
Support Vector Machine (SVM) to detect network intrusions. They proposed an intrusion detection
method using an SVM based system on a RST to reduce the number of features from 41 to 29. We also
compared the performance of Rough set theory (RST) with PCA.
Disadvantages
1. Training and Maintenance: Machine learning-based NIDS systems require significant initial training with
labeled data. Keeping the models up to date and maintaining their accuracy over time can be resource-
intensive.
2. False Negatives: While machine learning can reduce false positives, it can also lead to false negatives.
Some attacks might go undetected if they do not match known patterns in the training data.
2.2 Detecting web based Ddos attack using mapreduce operations in cloud computing
environment
Choi, J & Choi, Chang & Ko, Byeongkyu & Choi, D & Kim, P. (2013), Journal of
Internet Services and Information Security. 3. 28-37.
A distributed denial of service attacks are the most serious factor among network security risks in
cloud computing environment. This study proposes a method of integration between HTTP GET flooding
among DDOS attacks and MapReduce processing for a fast attack detection in cloud computing
environment. This method is possible to ensure the availability of the target system for accurate and reliable
detection based on HTTP GET flooding. In experiments, the processing time for performance evaluation
compares a pattern detection of attack features with the Snort detection. The proposed method is better than
Snort detection method in experiment results because processing time of proposed method is shorter with
increasing congestion.
Keywords: DDoS Attack, HTTP GET Flooding Attack, Web Security, MapReduce
Techniques used
They proposed integration between HTTP GET flooding among DDOS attacks and MapReduce
processing for a fast attack detection in cloud computing environment. This method is possible to ensure the
availability of the target system for accurate and reliable detection based on HTTP GET flooding.
Detection of DDoS attack analyzes to check the normal state of each network. Next step is a
definition of parameter for network attack analysis. Final Step is a definition of threshold by parameter of
normal state. Analyzed parameter is classified CPU usage, Load, packet size, information distribution of
packet header, protocol distribution for classification distribution of network service, the maximum value,
minimum value of traffic, monitoring of flow using spoofing address and so on. The traffic analysis through
combination of parameters can be improved.
The paper suggests using MapReduce operations for detecting web-based Distributed Denial
of Service (DDoS) attacks in a cloud computing environment, specifically focusing on HTTP GET flooding.
While this approach has its advantages, there are also some disadvantages to consider:
Disadvantages
2. Learning Curve: Organizations may need to invest in training and expertise to effectively implement and
manage MapReduce operations for DDoS detection.
2.3 A Pattern Recognition Scheme for Distributed Denial of Service (DDoS) Attacks in
Wireless
Sensor Networks, Baig, Zubair & Baqer, M & Khan, Asad. (2006), 1050 - 1054.
10.1109/ICPR.2006.147.
This paper defines distinct attack patterns depicting Distributed Denial of Service (DDoS) attacks
against target nodes within wireless sensor networks for three most commonly used network topologies. A
Graph Neuron (GN)-based, decentralized pattern recognition scheme is proposed for attack detection. The
scheme does analysis of internal traffic flow of the network for DDoS attack patterns. We stipulate that the
attack patterns depend on both the current energy levels, as well as the energy consumption rates of
individual target nodes. The results of varying pattern update rates on the pattern recognition accuracies for
the three network topologies are included in the end to test the effectiveness of our implementation.
Techniques used
The Graph Neuron (GN) is an in-network, distributed, pattern recognition algorithm which can form
an associative memory overlay on the physical sensor network by interconnecting sensor nodes in a graph-
like structure called the GN array. The Graph Neuron (GN) as a light-weight pattern recognition application
was used to detect DDoS attack patterns inWSNs.
Disadvantages
1. False Negatives: While machine learning can reduce false positives, it can also lead to false negatives.
Some attacks might go undetected if they do not match known patterns in the training data.
2. Resource Consumption: The scheme likely requires significant computational resources, which can be
problematic for resource-constrained wireless sensor networks. High resource consumption may lead to
increased energy consumption and reduced network lifespan.
3.SYSTEM REQUIREMENTS
The system configuration includes Software and Hardware requirement, which are provided below
Python
Python features a dynamic type system and automatic memory management. It supports multiple
programming paradigms, including object-oriented, imperative, functional and procedural, and has a large
and comprehensive standard library.
Python interpreters are available for many operating systems. CPython, the reference implementation of
Python, is open source software and has a community-based development model, as do nearly all of its
variant implementations. CPython is managed by the non-profit Python Software Foundation.
Scikit-learn
Scikit-learn is the simple and efficient tools for data mining and data analysis. Scikitlearn is a free software
machine learning library for the Python programming language. It features various classification, regression
and clustering algorithms including support vector 40 machines, random forests, gradient boosting, k-means
and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and
SciPy.
Cross Validation: for estimating the performance of supervised models on unseen data
. Datasets: for test datasets and for generating datasets with specific properties for investigating model
behavior.
Dimensionality Reduction: for reducing the number of attributes in data for summarization, visualization
and feature selection such as Principal component analysis.
Feature selection: for identifying meaningful attributes from which to create supervised models.
Supervised Models: a vast array not limited to generalized linear models, discriminate analysis, naive
bayes, lazy methods, neural networks, support vector machines and decision trees.
Ram : 4 GB
Hard Disk : 4 GB
4. METHODOLOGY
Intrusion dataset
Training
Dataset
Test
Classification Techniques
Eval of
Accuracy
Compare Result
The proposed work is implemented in Python 3.6.4 with libraries scikit-learn, pandas, matplotlib and
other mandatory libraries. We downloaded dataset from kdd.ics.uci.edu. The data downloaded contains train
set and test set separately with four different classes of intrusions. The traindataset considered as train set
and testdataset considered as test set.
We have collected the dataset for the intrusion detection system with following details from KDD
dataset and we applied machine learning algorithm such as decision tree, regression, random forest and
KNN.
Data collection
The data collection process involves the selection of quality data for analysis. Here we used KDD
intrusion dataset taken from uci.edu for machine learning implementation. The job of a data analyst is to
find ways and sources of collecting relevant and comprehensive data, interpreting it, and analyzing results
with the help of statistical techniques.
Data visualization
A large amount of information represented in graphic form is easier to understand and analyze. Some
companies specify that a data analyst must know how to create slides, diagrams, charts, and templates. In
our approach, the detection rates of intrusion are shown as data visualization part.
Data preprocessing
The purpose of preprocessing is to convert raw data into a form that fits machine learning.
Structured and clean data allows a data scientist to get more precise results from an applied machine
learning model. The technique includes data formatting, cleaning, and sampling.
Dataset splitting
A dataset used for machine learning should be partitioned into three subsets — training, test, and
validation sets.
Training set: A data scientist uses a training set to train a model and define its optimal parameters it
has to learn from data.
Test set: A test set is needed for an evaluation of the trained model and its capability for
generalization. The latter means a model‘s ability to identify patterns in new unseen data after having been
trained over a training data. It‘s crucial to use different subsets for training and testing to avoid model
overfitting, which is the incapacity for generalization we mentioned above.
Model training
After a data scientist has preprocessed the collected data and split it into train and test can proceed
with a model training. This process entails ―feeding‖ the algorithm with training data. An algorithm will
process data and output a model that is able to find a target value (attribute) in new data an answer you want
to get with predictive analysis. The purpose of model training is to develop a model.
The goal of this step is to develop the simplest model which is able to formulate a target value fast
and well enough. A data scientist can achieve this goal through model tuning. That‘s the optimization of
model parameters to achieve an algorithm‘s best performance. It is important to note that the test data is not
from the same probability distribution as the training data, and it includes specific attack types not in the
training data. This makes the task more realistic. Some intrusion experts believe that most novel attacks are
variants of known attacks and the "signature" of known attacks can be sufficient to catch novel variants. The
datasets contain a total of 24 training attack types, with an additional 14 types in the test data only. The
dataset variable names are described below:
Intrusion Detection is considered with four different machine learning algorithms such as Decision Tree,
Logistic Regression, KNN and Bernoulli‘s NB. The predicted value is compared for the predicted accuracy
and error values.
Logistic Regression is basically a predictive model analysis technique where the output (target) variables are
discrete values for a given set of features or input (X). It is very powerful yet simple supervised
classification algorithm in machine learning. Around 60% of the world‘s classification problems can be
solved by using the logistic regression algorithm.
Logistic Regression is one of the most common algorithms used for binary classification. It predicts the
probability of occurrence of a binary outcome using a logit function. It is a special case of linear regression
as it predicts the probabilities of outcome using log function.
In simple, Linear Regression predicts scores on one variable from the scores on a second variable. The
variable that predicted is called Criterion Variable. The variable base for 21 predictions on is called
predictor variable. When there is only one predictor variable, the prediction method is called simple
regression.
We use the activation function (sigmoid) to convert the outcome into categorical value. There are many
examples where we can use Logistic Regression for example, it can be used for Fraud Detection, Spam
Detection, Cancer Detection, etc.
Department Of Computer science , BCU 23 2022-2023
Network Intrusion Detection System
Sigmoid Function
It is a mathematical function having a characteristic that can take any real value and map it to between o to 1
shaped like letter ‗S‘. The Sigmoid Function is also called Logistic Function.
The Sigmoid Function is more than 0.5 then we classify that label as Class 1 or Positive Class and if it is
less than 0.5 then we can classify it to Negative Class or label as Class 0.
Pros:
Fast
No tuning required
Highly interpretable
Well-understood
Cons:
Decision tree is a type of supervised learning algorithm that is mostly used in classification problems. It
works for both categorical and continuous input and output variables. In this technique, we split sample into
two or more homogeneous sets (or sub-populations) based on most significant splitter / differentiator in
input variables. In decision tree internal node represents a test on the attribute, branch depicts the outcome
and leaf represents decision made after computing attribute.
The general motive of using Decision Tree is to create a training model which can be used to predict class or
a value of target variables by learning decision rules inferred from prior data (training data).
The understanding level of Decision Tree algorithm is so easy compared with other classification
algorithms. The Decision Tree algorithm tries to solve the problem, by using tree representation. Each
internal node of the tree corresponds to an attribute, and each leaf node corresponds to a class labelDecision
Place the best attribute of the dataset at the root of the tree.
Split the training set into subsets. Subsets should be made in such a way that each subset contains data with
the same value for an attribute.
Repeat step 1 and step 2 on each subset until you find leaf nodes in all the branches of the tree.
In decision trees, for predicting a class label for a record we start from the root of the tree. Then compare the
values of the root attribute with record‘s attribute. On the basis of comparison, follow the branch
corresponding to that value and jump to the next node.
A decision tree is a non-parametric supervised learning algorithm for classification and regression
tasks. It has a hierarchical tree structure consisting of a root node, branches, internal nodes, and leaf nodes.
Decision trees are used for classification and regression tasks, providing easy-to-understand models.
A decision tree is a hierarchical model used in decision support that depicts decisions and their potential
outcomes, incorporating chance events, resource expenses, and utility. This algorithmic model utilizes
conditional control statements and is non-parametric, supervised learning, useful for both classification and
regression tasks. The tree structure is comprised of a root node, branches, internal nodes, and leaf nodes,
It is a tool that has applications spanning several different areas. Decision trees can be used for classification
as well as regression problems. The name itself suggests that it uses a flowchart like a tree structure to show
the predictions that result from a series of feature-based splits. It starts with a root node and ends with a
Before learning more about decision trees let‘s get familiar with some of the terminologies:
Root Node: The initial node at the beginning of a decision tree, where the entire population or dataset starts
Decision Nodes: Nodes resulting from the splitting of root nodes are known as decision nodes. These nodes
Leaf Nodes: Nodes where further splitting is not possible, often indicating the final classification or
Sub-Tree: Similar to a subsection of a graph being called a sub-graph, a sub-section of a decision tree is
Pruning: The process of removing or cutting down specific nodes in a decision tree to prevent overfitting
Branch / Sub-Tree: A subsection of the entire decision tree is referred to as a branch or sub-tree. It
Parent and Child Node: In a decision tree, a node that is divided into sub-nodes is known as a parent node,
and the sub-nodes emerging from it are referred to as child nodes. The parent node represents a decision or
condition, while the child nodes represent the potential outcomes or further decisions based on that
condition.
Pros
Prone to overfitting
o If you have a lot of features
o Stop growth of tree at the appropriate time
You can build bigger classifiers out of this.
Cons
The K Nearest Neighbors (KNN) algorithm measures the distance between a query scenario and a set of
scenarios in the data set. We can compute the distance between two scenarios using some distance function
d(x,y), where x,y are scenarios composed of N features, such that x={x1,…,xN}, y={y1,…,yN}
The model for KNN is the entire training dataset. When a prediction is required for a unseen data instance,
the KNN algorithm will search through the training dataset for the k-most similar instances. The prediction
attribute of the most similar instances is summarized and returned as the prediction for the unseen instance
The similarity measure is dependent on the type of data. For real-valued data, the Euclidean distance can be
used. Other types of data such as categorical or binary data, Hamming distance can be used.
Instance-based algorithms are those algorithms that model the problem using data instances (or rows) in
order to make predictive decisions. The KNN algorithm is an extreme form of instance-based methods
because all training observations are retained as part of the model.
It is a competitive learning algorithm, because it internally uses competition between model elements (data
instances) in order to make a predictive decision. The objective similarity measure between data instances
causes each data instance to compete to ―win‖ or be most similar 27 to a given unseen data instance and
contribute to a prediction.
5. DESIGN
Unified Modeling Language (UML) is a general purpose modelling language. The main aim of UML is to
define a standard way to visualize the way a system has been designed. It is quite similar to blueprints used
in other fields of engineering.
UML is not a programming language, it is rather a visual language. We use UML diagrams to portray the
behavior and structure of a system. UML helps software engineers, businessmen and system architects with
modelling, design and analysis. The Object Management Group (OMG) adopted Unified Modelling
Language as a standard in 1997. Its been managed by OMG ever since. International Organization for
Standardization (ISO) published UML as an approved standard in 2005. UML has been revised over the
years and is reviewed periodically
UML is linked with object oriented design and analysis. UML makes the use of elements and forms
associations between them to form diagrams. Diagrams in UML can be broadly classified as:
1. Structural Diagrams – Capture static aspects or structure of a system. Structural Diagrams include:
Component Diagrams,
Object Diagrams,
2. Behavior Diagrams – Capture dynamic aspects or behavior of the system. Behavior diagrams include
State Diagrams,
Interaction Diagrams.
Advantages of UML
UML is a highly recognized and understood platform for software design. It is a standard notation among
software developers. You can safely assume that most software professionals will be at least acquainted
with, if not well-versed in, UML diagrams, thus making it the go-to alternative to explain software design
models.
Use Case Diagram captures the system's functionality and requirements by using actors and use cases. Use
Cases model the services, tasks, function that a system needs to perform. Use cases represent high-level
functionalities and how a user will handle the system. Use-cases are the core concepts of Unified Modelling
language modeling.
The proposed work is implemented in Python 3.6.4 with libraries scikit-learn, pandas, matplotlib and
other mandatory libraries. The training dataset of KDD contains 125973 rows. The test dataset contains
22543. Machine learning algorithm is applied such as decision tree, Logistic regression ,KNN and NB. We
used these machine learning algorithm and identified intrusion. The result shows that intrusion detection is
efficient using Logistic regression. Decision Tree achieves around 53.46% accuracy, Logistic Regression
achieves 98.67% accuracy,KNN achieves around 98.86% accuracy, Bernoulli‘s naïve bayes achieves
around 95.40% accuracy.
Sequence Diagrams display the time sequence of the objects participating in the interaction. This consists of the
vertical dimension (time) and horizontal dimension (different objects).
Object: Object can be viewed as an entity at a particular point in time with a specific value and as a holder of identity
that has different values over time.
Actor: An Actor represents a coherent set of roles that users of a system play when interacting with the use cases of
the system
Message: A Message is sending of a signal from one sender object to other receiver objects.
An activity diagram shows the flow from activity to activity. An activity is a going nonatomic execution
within a state machine. An activity results in some action, results in a change of state or return of a value.
Transitions.
Activity states and action states: An executable atomic computation is called action state, which cannot be
decomposed. Activity state is non-atomic, decomposable and takes some duration to execute.
Transition: It is a path from one state to the next state, represented as simple directed line.
Branching: When an alternate path exists, branching arises which is represented by open diamond. It has a
incoming transition, two or more outgoing transitions.
Forking and Joining: The synchronization bar when split one flow into two or more flows is called fork.
When two or more flows are combined at synchronization bar, the bar is called join.
Swim Lanes: Group work flow is called swim lanes. All groups are portioned by vertical solid lines. Each
swim lane specifies locus of activities and has a unique name. Each swim lane is implemented by one or
more classes. Transition may occur between objects across swim lanes.
A state chart diagram describes a state machine which shows the behavior of classes. It shows the actual
changes in state not processes or commands that create those changes and is the dynamic behavior of
objects overtime by modeling the lifecycle of objects of each class.
It describes how an object is changing from one state to another state. There are mainly two states in state
chart diagram:
Initial state
Final state
State: It is a condition or situation in lifecycle of an object during which it satisfies same condition or
performs some activity or waits for some event.
Transition: It is a relationship between two states indicating that object in first state performs some action
and enters into the nextstate
Event: An event is specification of significant occurrence that has a location in time and space
The proposed work is implemented in Python 3.6.4 with libraries scikit-learn, pandas, matplotlib and
other mandatory libraries. The training dataset of KDD contains 125973 rows. The test dataset contains
22543. Machine learning algorithm is applied such as decision tree, Logistic regression ,KNN and NB. We
used these machine learning algorithm and identified intrusion.
The deployment of Network Intrusion Detection System(NIDS) plays a pivotal role in enhancing the
security posture of organizations. These systems are essential for real-time monitoring and analysis of
network activity, identifying potential vulnerabilities, and responding to ongoing threats. While NIDS
technologies may encounter challenges in filtering false alarms, their significance in safeguarding against
unauthorized intrusions and potential security incidents cannot be understated. When an unauthorized
person or intruder try to access , It consist of a web page through that it will alert an user.
The result shows that intrusion detection is efficient using Logistic regression. Decision Tree
achieves around 53.46% accuracy, Logistic Regression achieves 98.67% accuracy,KNN achieves around
98.86% accuracy, Bernoulli‘s naïve bayes achieves around 95.40% accuracy.
The following table shows the accuracy arrived in our experimental analysis.
ALGORITHM ACCURACY(%)
KNN 98.86%
Fig 6.2.1 This is the main page of Network Intrusion Detection System :
Fig 6.2.2 The following screen shows that the attack is DOS :
Fig 6.2.3 The following screen shows that the attack is NORMAL :
Fig 6.2.4 The following screen shows that the attack is PROBE :
CONCLUSION
The main aim of Intrusion Detection System is to detect the attacks and malicious activities that occur
within a network and to reduce the rate of false positives. By using the machine learning algorithms, the
output of the NIDS would be accurate, advanced and reliable. This system also shows the accuracy rate of
the attacks that have been detected by the different machine learning algorithms that have been
implemented. The incremental increase in the use of technology has led to huge amount of data that needs to
be processed and stored securely for the users. Security is a major aspect for any user. If a system is secure,
we can highly ensure user‘s privacy is high. The more secure the system, the more reliable it is. If an
Intrusion Detection System is capable of providing good security for user‘s data, we can say that the
developed Intrusion Detection System is good.
FUTURE SCOPE
The system that has been proposed can be made more reliable and efficient by implementing other machine
learning algorithms along with the ones that already have been implemented so that intrusion can be
detected easily. Also the other types of attacks can also be classified as the classes of intrusion to identify
more attacks and provide more security and reliability.
8. REFERENCES
[1] Network Traffic Analysis and Intrusion Detection Using Packet Sniffer
Qadeer, Mohammed & Iqbal, Arshad & Zahid, Mohammad & Siddiqui, Misbahur, Communication
Software and Networks, International Conference on. 313-317. 10.1109/ICCSN.2010.104.
[2] Analyzing Log Files for Post-mortem Intrusion Detection
Gamboa, Karen & Monroy, Raúl & Trejo, Luis & Aguirre Bermúdez, Eduardo & Mex-Perera,
Carlos. (2012), IEEE Transactions on Systems Man and Cybernetics Part C (Applications and
Reviews). 42. 10.1109/TSMCC.2012.2217325.
[3] A Pattern Recognition Scheme for Distributed Denial of Service (DDoS) Attacks in Wireless
Sensor Networks, Baig, Zubair & Baqer, M & Khan, Asad. (2006), 1050 - 1054.
10.1109/ICPR.2006.147.
[4] Detecting web based Ddos attack using mapreduce operations in cloud computing environment
Choi, J & Choi, Chang & Ko, Byeongkyu & Choi, D & Kim, P. (2013), Journal of Internet Services
and Information Security. 3. 28-37.
[5] Network Intrusion Detection System Based On Machine Learning Algorithms
Vipin, Das & Vijaya, Pathak & Sattvik, Sharma & , Sreevathsan & , MVVNS.Srikanth & Kumar T,
Gireesh. (2010), International Journal of Computer Science & Information Technology.
https://ieeexplore.ieee.org
https://www.researchgate.net
https://scholar.google.com
https://youtoube.com