A Journey Through Intrusion Detection Systems:: Techniques, Datasets and Challenges
A Journey Through Intrusion Detection Systems:: Techniques, Datasets and Challenges
through
Intrusion
Detection
Systems:
Techniques, Datasets
and Challenges
• Ahmed Salah Elbehairy
Contributo • Ahmed Sabry Mohamed
2
References
3
Introduction:
Cyberattacks It's
a Jungle Out
There
• Malicious attacks are becoming more
sophisticated
• Zero-day attacks number is increasing
• According to the 2017 Symantec Internet
Security Threat Report:
o Three billion zero-day attacks were
reported in 2016!!!
• According to the Data Breach Statistics in
2017:
o Nine billion data records were lost
or stolen by hackers since 2013!!!
• More ambitious hackers
o Bank customers vs Banks themselves
4
Intrusion Detection Systems
(IDS)
• High need for efficient mechanism
o Detect such sophisticated + novel + high volume
intrusions
o IDS
• Intrusion: Unauthorized activities that cause
damage to information system
o Confidentiality
o Integrity
o Availability
• IDS: SW or HW that identifies malicious actions
on computer systems to maintain its security
• IDS taxonomies:
o Method detection
o Data source
o Behavior on detection
5
Intrusion Detection Systems
(IDS) (Cont'd)
• We will focus on the first two
taxonomies.
• Method detection
o Signature based IDS (SIDS)
o Anomaly based IDS (AIDS)
§ Statistical based
§ Knowledge based
§ Machine Learning
o Data source
§ Host-based IDS (HIDS)
§ Network-based IDS (NIDS)
6
Signature-
based
Intrusion
Detection
Systems
(SIDS)
7
Overview
Purpose
• Find and identify a known attack.
• Raise an alarm if a match happens.
Basis
• Uses patterns memory.
• A predefined database based on previous knowledge.
Usage
• General monitoring of the network and/or host level using:
• Identified Malware.
• Identified suspicious network packets.
8
Key Points on SIDS
Advantages Disadvantages
• Highly effective • Inconveninet
against known against new attacks
threats. like Zero Day.
• Easy to implement • Potential for false
and manage due to positives due to
minimal shared
configuration. characteristic. 9
Functionality
• A very basic comparison of history where:
Compare
Build database current set of
of intrusion activities
signatures against existing
signatures
ELSE no
ALARM IF match Normal
match
10
If not Signature? Then what?
• Still has its use
o SIDS still has its uses and application.
o Identify sequences of bytes or specific protocol behaviors
that match known attack signatures in a network.
• Future Proofing
o Zero Days are priority one.
o Will use other approaches.
• Proactive
o Anomaly based detection
11
Anomaly-based
Intrusion
Detection
Systems (AIDS)
with Statistical
Techniques
13
Introduction
14
Overview of Statistical Techniques
15
Univariate
Models
Types of
Statistical Multivariate
Techniques Models
Time Series
Model
16
Univariate Models
• Analyze a single variable at a time
• Create a statistical profile for one measure of behavior
• Example Application:
o Monitoring the number of failed login attempts
17
Multivariate Model
• Analyze relationships between two or more variables
• Effective in capturing complex interactions
• Example Applications:
Correlating CPU usage and network traffic
Detecting anomalies through correlation
18
Time Series Model
• Analyze Data Points
o Collected or recorded at specific time intervals
• Detect Anomalies
o Based on the temporal sequence of data
• Example Applications:
Monitoring network traffic over time
Identify unusual spikes
19
Example Techniques in Statistical AIDS
Statistical Principal
Process Control Component Clustering
(SPC) Analysis (PCA)
• Uses control • Reduces data • Groups similar
charts to monitor dimensionality data points
system behavior • Facilitates easier together
• Detects identification of • Anomalies are
deviations from anomalies those that do not
the norm fit into any cluster
20
Advantages of Statistical Techniques
in AIDS
21
Challenges of Statistical Techniques in AIDS
23
Finite State Machine
• Computation model used to represent
and control execution flow based on
mathematical representation.
• Represented in the form of states,
transitions, and activities.
• A state checks the history data. For
instance, any variations in the input are
noted and based on the detected
variation transition happens
• An FSM can represent legitimate system
behaviour, and any observed deviation
from this FSM is regarded as an attack.
24
Description Languages
• Defines the syntax of rules which can be used to specify the characteristics of a
defined attack.
• Rules could be built by description languages such as N-grammars and UML
State Transition
Initializations
25
Expert System
• A number of rules that define attacks.
• Evaluation of user behavior by evaluating
events (audit records) using a set of rules.
• These rules describe suspicious behavior
that is based on knowledge of past
intrusions, known system vulnerabilities,
and the installation-specific security policy.
• The user's behavior is analyzed without
reference to whether it matches past
behavior patterns (simultaneously, as in
figure)
26
Machine
Learning
AIDS
27
Machine Learning: What is it ?
• Extracting knowledge from large
quantities of data
o Learn from training data
o Generalize to unseen data
• Why use machine learning to build
AIDS?
o Improved accuracy
o Less requirement for human
knowledge
28
Machine Learning AIDS
• Various AIDSs based on machine
learning exists
• Two main kinds of machine
learning are used to build AIDSs:
o Supervised learning
o Unsupervised learning
29
Supervised
Learning Login Origin Time Intrusion
Attempt of ?
• Algorithms are trained using labeled
datasets (Inputs (features) +
s Day
outputs(classes)) 5 External Night Yes
o Network Traffic Monitoring Example
1 Internal Day No
• Algorithms learn the inherit
2 Internal Night No
relationship between input data and
classes during the training 10 External Night Yes
• Algorithms are then tested to measure 3 External Day No
its accuracy
• Once its accuracy is acceptable, it is
released to handle real scenarios (seen
and unseen events)
30
Supervised
Learning (Cont'd)
• Numerous supervised techniques
available:
• Decision Tree
• Naïve Bayes Networks
• Genetic Algorithms (GA)
• Artificial Neural Networks (ANN)
• Fuzzy Logic
• Support Vector Machines
• Hidden Markov Model
• K-Nearest Neighbours
31
Decision Tree
• Structure
o Node = Rule or Test
o Branch = Possible Decision
o Leaf = Class
• Tree is constructed during the
training phase with the help of
labeled data.
• Numerous algorithms are
available:
o ID3
o C4.5
o CART
32
• Naïve:
o Simplifying assumption: "All
features are conditionally
independent of each other
given the class label"
• Bayes principle
Naïve Bayes
• Target:
"What is the probability that a
particular kind of attack is
occurring, given the observed
system activities?"
o P(Intrusion = y | Login
attempts = 5, origin =
Internal, Time of Day = Day)
?
33
Naïve Bayes (Cont'd)
• Let's first build the model using the
Probability Value
labeled data (Features + Classes):
P(Intrusion = Y) (2 / 5) = 0.4
P(Intrusion = N) (3 / 5) = 0.6
P(Origin = Internal | Intrusion = (0 / 2) = 0
Y) (2 / 2) = 1
P(Origin = External | Intrusion
= Y)
34
Naïve Bayes (Cont'd)
• Now let’s do the evaluation if the Probability Value
observed event is intrusion or not: P(Time = Day | Intrusion = Y) (0 /2) = 0
P(Time = Night | Intrusion = Y) (2 / 2) = 1
• Do your calculation:
o if probability > 0.5 then it an intrusion, otherwise it is a normal event
36
Naïve Bayes (Cont'd)
37
Unsupervised
Learning
• Algorithms learn patterns from
unlabeled dataset
• Data is grouped into various
classes through the learning
process
• Joint density model can help to
categorize
o Big size clusters are labeled as
normal
o Small size clusters are labeled as
intrusion
38
Unsupervised
Learning (Cont'd)
• Numerous clustering
techniques:
o K-means
o Hierarchical Clustering
39
K-means
40
Mixing of supervised and
unsupervised techniques
Fuzzy-based semi-
Numureous supervised
algorithms Expectation
Maximization
41
Ensemble methods
• Multiple machine learning used
o Synergy vs Energy
• Numerous ensemble methods:
• Boosting
o Sequential ensemble method
o Each model is trained to correct the mistakes of its
predecessor
• Bagging
o multiple models independently in parallel on different subsets
of the training data
o combine their predictions (e.g. Voting, averaging, etc..)
• Stacking
o Multiple models (base learners) are trained on the same
dataset
o a Meta-model (a blender) is trained on the outputs of the
base models
42
Intrusion Data Sources
43
Classificati •The previous two sections categorised IDS
based on the methods used to identify
intrusions
on of IDS o Signature
o Anomaly
input
So, what does this
mean?
44
Host-based
Data
Sources
Network-based
45
HIDS
46
NIDS
47
Two Good methods, Integrate?
Yes! Together with HIDS and firewalls, NIDS can provide a multi-
tier protection against both external and insider attacks.
49
Evaluatio
n of IDS
• Performance Metrics
Performance
Metrics For IDS
51
Inefficiency of Accuracy metric
52
Inefficiency of Accuracy metric
53
Inefficiency of Accuracy metric
54
Important
concepts
for
Performanc
e metrics
55
Important concepts for IDS
evaluation
• True Positive
• True Negative
• False Positive
• False Negative
56
Important concepts
for IDS evaluation
TPR = TP / ( TP +FN )
TNR = TN / ( FP + TN
)
FNR = FN / ( FN +
TP)
FPR = FP / ( FP +
TN )
57
Important
concepts for IDS
evaluation
• Models thresholds
• Receiver Operating
Characteristic (ROC) curve
58
Model thresholds
59
Threshold and
ROC Curve
61
IDS
Datasets
and
Datasets
quality
62
Difference from other Surveys
63
• Earliest effort to create an IDS dataset
• Defense Advanced Research Project Agency
• the MIT Lincoln Labs
• 4 GB of Data
DARPA • Basis for KDD Cup99 and NSL-KDD datasets
• Have been widely criticized
64
Collected using multiple computers connected to the
Internet to model a small USAir Force base of restricted
personnel
DARPA Setup
65
Relation Between DARPA and KDD Cup99 and furthermore
NSL-KDD
• The KDD Cup99 dataset was created from the DARPA dataset by extracting
connection records and organizing them into a format suitable for machine
learning competitions, specifically for the KDD (Knowledge Discovery in
Databases) competition.
66
Relation
Between DARPA
• 4 Attack Categories are:
and KDD Cup99 o Denial of service
and furthermore
o Remote to Local
NSL-KDD
o User to root
o Probe
67
Features Extracted
Relation
Between
DARPA and
KDD Cup99
and
furthermor
e NSL-KDD
68
DARPA / KDD Cup99 Critism
69
• was collected in 2007
• network traffic traces from Distributed Denial-
CAIDA of-Service (DDoS) attacks
• overwhelming the target with a flood of
network packets preventing regular traffic from
reaching its legitimate destination computer
70
• Lack of Attack Diversity : focus on DOS, no
R2L or U2R or Probing
• Absence of Full Network Context :
o Topology Information: The structure and
interconnections of devices in the network
CAIDA (e.g., routers, switches, hosts).
o Traffic Characteristics: Details of network
criticism traffic patterns across various devices and
times.
o Application Layer Details: The protocols and
applications being used (e.g., HTTP, FTP,
SSH).
71
NSL-KDD
• Developed from the earlier KDD cup99 dataset
• Approximately 78% and 75% of the network packets are
duplicated in both the training and testing dataset
(Tavallaee et al., 2009).
• Will affect ML models training
72
• real network traffic traces of HTTP,
SMTP, SSH, IMAP, POP3, and FTP
protocols
73
contain records
from both Linux
and Windows
operating
Australian
systems; they
Defence Force
are created from
ADFA-LD Academy
the evaluation of
system-call-
WD
Suitable for
Includes zero-day modern intrusion
attacks detection
challenges.
74
ADFA-LD
and ADFA-
WD
Features
75
AFDA LD
vs AFDA
WD
• Covers both benign and malicious traffic,
including advanced attacks like
CICIDS Heartbleed(Open SSL) and Botnets(DDOS).
77
• Created in 2018
• Specifically targets IoT networks, featuring
realistic and diverse attack scenarios.
• Includes modern threats like data
Bot-IoT exfiltration(Probing) and privilege
escalation(User-to-Root).
78
Dataset
Comparis
on - Deep
dive • Realistic traffic: ADFA, CICIDS, Bot-IoT.
• Zero-day attacks: ADFA, CICIDS, Bot-IoT.
• Full packet capture: All except CAIDA.
79
Dataset
Comparis
on -
summary
80
• Denial-of-Service (DoS) : attacks have the
objective offblocking or restricting services
delivered by the network, computer to the
users. (Botnets)
Types of • Probing attacks have the objective of
acquisition of information about the network or
Computer the computer system.
• User-to-Root (U2R) attacks have the objective
attacks of a non-privileged user acquiring root or
admin-user access on a specific computer or a
system on which the intruder had user level
access.
• Remote-to-Local (R2L) attacks involve sending
packets to the victim machine. The
cybercriminal
81
Feature selection for IDS
• 1. Filter Methods
• 2. Wrapper Methods
82
Filter Methods
• Advantages:
• Computationally efficient.
• Can be used as a preprocessing step for any machine learning model.
• Disadvantages:
• May not account for interactions between features.
83
Wrapper Methods
84
Challenges of
Intrusion Detection
Systems (IDS)
85
Evasion Techniques
86
High False Positive
Rates
• Anomaly-based IDS Systems
o Flag normal behavior as malicious
o Variability in defining 'normal'
behavior
• High False Alarms
o Due to variability in normal behavior
o Leads to many false positives
87
Dataset Limitations
• Outdated Datasets
o Commonly used datasets like KDD Cup 99
o Do not reflect current attack methods
and malware
o Challenges in accurate IDS evaluation
• Lack of Realism
o Some datasets fail to capture real-world
network traffic complexity
o Do not represent the diversity of real-
world attacks
88
Complexity and
Resource Intensity
• Statistical Models
o Building and maintaining accurate
models is complex
o Resource-intensive process
• Machine Learning Models
o Training requires significant
computational resources
o Updating models needs expertise
89
Adaptability
• Dynamic Environments
o IDS must adapt to changes in
network behavior
o New types of attacks present
challenges
o Effective implementation is difficult
90
Integration with
Existing Systems
• Compatibility Challenges
o Ensuring seamless integration with existing
network infrastructure
o Difficulty in integrating with current security
tools
• Performance Impact
o Potential introduction of latency
o Performance overhead needs to be
minimized
o Avoiding disruption to normal operations
91
Conclusion
• More and more sophisticated attacks with high
volumes are motivated
• Business continuity and even states national
securities must incorporate strong cybersecurity
processes and measures
• Part of this is designing highly efficient IDSs with
high capabilities to detect zero-day intrusions
• Combining different IDS categories and algorithms
is better for higher intrusion detection with lower
false alarms (Higher efficiency + Higher cost >> A
compromise)
• Digging deep into the hacker mind by studying
invasion techniques is important to build
defensive measures
• Dataset is crucial part for building and evaluating
more reliable IDS, further collaboration between
different entities shall exist to enhance this area
92