0% found this document useful (0 votes)
15 views91 pages

A Journey Through Intrusion Detection Systems:: Techniques, Datasets and Challenges

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views91 pages

A Journey Through Intrusion Detection Systems:: Techniques, Datasets and Challenges

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 91

A Journey

through
Intrusion
Detection
Systems:
Techniques, Datasets
and Challenges
• Ahmed Salah Elbehairy
Contributo • Ahmed Sabry Mohamed

rs • Khalid Abdelrahman Abdelwahab


• Ahmed Mahmoud Elkhateeb

2
References

Survey of intrusion detection systems: techniques,


datasets and challenges
Ansam Khraisat, Iqbal Gondal, Peter Vamplew and Joarder Kamruzzaman

3
Introduction:
Cyberattacks It's
a Jungle Out
There
• Malicious attacks are becoming more
sophisticated
• Zero-day attacks number is increasing
• According to the 2017 Symantec Internet
Security Threat Report:
o Three billion zero-day attacks were
reported in 2016!!!
• According to the Data Breach Statistics in
2017:
o Nine billion data records were lost
or stolen by hackers since 2013!!!
• More ambitious hackers
o Bank customers vs Banks themselves

4
Intrusion Detection Systems
(IDS)
• High need for efficient mechanism
o Detect such sophisticated + novel + high volume
intrusions
o IDS
• Intrusion: Unauthorized activities that cause
damage to information system
o Confidentiality
o Integrity
o Availability
• IDS: SW or HW that identifies malicious actions
on computer systems to maintain its security
• IDS taxonomies:
o Method detection
o Data source
o Behavior on detection

5
Intrusion Detection Systems
(IDS) (Cont'd)
• We will focus on the first two
taxonomies.
• Method detection
o Signature based IDS (SIDS)
o Anomaly based IDS (AIDS)
§ Statistical based
§ Knowledge based
§ Machine Learning
o Data source
§ Host-based IDS (HIDS)
§ Network-based IDS (NIDS)

6
Signature-
based
Intrusion
Detection
Systems
(SIDS)

7
Overview
Purpose
• Find and identify a known attack.
• Raise an alarm if a match happens.

Basis
• Uses patterns memory.
• A predefined database based on previous knowledge.

Usage
• General monitoring of the network and/or host level using:
• Identified Malware.
• Identified suspicious network packets.
8
Key Points on SIDS

Advantages Disadvantages
• Highly effective • Inconveninet
against known against new attacks
threats. like Zero Day.
• Easy to implement • Potential for false
and manage due to positives due to
minimal shared
configuration. characteristic. 9
Functionality
• A very basic comparison of history where:

Compare
Build database current set of
of intrusion activities
signatures against existing
signatures

ELSE no
ALARM IF match Normal
match

10
If not Signature? Then what?
• Still has its use
o SIDS still has its uses and application.
o Identify sequences of bytes or specific protocol behaviors
that match known attack signatures in a network.
• Future Proofing
o Zero Days are priority one.
o Will use other approaches.
• Proactive
o Anomaly based detection

11
Anomaly-based
Intrusion
Detection
Systems (AIDS)
with Statistical
Techniques

13
Introduction

14
Overview of Statistical Techniques

Anomaly Detection Statistical Techniques


Overview
Creates a model of normal behavior for Involves collecting and analyzing data
a system Builds statistical models representing
Flags significant deviations as normal behavior
anomalies Detects anomalies based on deviations
Potential indication of intrusions from these models

15
Univariate
Models
Types of
Statistical Multivariate
Techniques Models
Time Series
Model
16
Univariate Models
• Analyze a single variable at a time
• Create a statistical profile for one measure of behavior

• Example Application:
o Monitoring the number of failed login attempts

17
Multivariate Model
• Analyze relationships between two or more variables
• Effective in capturing complex interactions

• Example Applications:
Correlating CPU usage and network traffic
Detecting anomalies through correlation

18
Time Series Model
• Analyze Data Points
o Collected or recorded at specific time intervals
• Detect Anomalies
o Based on the temporal sequence of data

• Example Applications:
Monitoring network traffic over time
Identify unusual spikes

19
Example Techniques in Statistical AIDS

Statistical Principal
Process Control Component Clustering
(SPC) Analysis (PCA)
• Uses control • Reduces data • Groups similar
charts to monitor dimensionality data points
system behavior • Facilitates easier together
• Detects identification of • Anomalies are
deviations from anomalies those that do not
the norm fit into any cluster

20
Advantages of Statistical Techniques
in AIDS

21
Challenges of Statistical Techniques in AIDS

High False • Normal behavior varies widely


Positive • Leads to false alarms
Rates
• Building accurate models is
complex
Complexity • Resource-intensive
maintenance
22
Finite State
Types of Machine
Knowledge Description Languages
Based
Techniques
Expert Systems

23
Finite State Machine
• Computation model used to represent
and control execution flow based on
mathematical representation.
• Represented in the form of states,
transitions, and activities.
• A state checks the history data. For
instance, any variations in the input are
noted and based on the detected
variation transition happens
• An FSM can represent legitimate system
behaviour, and any observed deviation
from this FSM is regarded as an attack.

24
Description Languages
• Defines the syntax of rules which can be used to specify the characteristics of a
defined attack.
• Rules could be built by description languages such as N-grammars and UML

State Transition
Initializations

25
Expert System
• A number of rules that define attacks.
• Evaluation of user behavior by evaluating
events (audit records) using a set of rules.
• These rules describe suspicious behavior
that is based on knowledge of past
intrusions, known system vulnerabilities,
and the installation-specific security policy.
• The user's behavior is analyzed without
reference to whether it matches past
behavior patterns (simultaneously, as in
figure)

26
Machine
Learning
AIDS

27
Machine Learning: What is it ?
• Extracting knowledge from large
quantities of data
o Learn from training data
o Generalize to unseen data
• Why use machine learning to build
AIDS?
o Improved accuracy
o Less requirement for human
knowledge

28
Machine Learning AIDS
• Various AIDSs based on machine
learning exists
• Two main kinds of machine
learning are used to build AIDSs:
o Supervised learning
o Unsupervised learning

29
Supervised
Learning Login Origin Time Intrusion
Attempt of ?
• Algorithms are trained using labeled
datasets (Inputs (features) +
s Day
outputs(classes)) 5 External Night Yes
o Network Traffic Monitoring Example
1 Internal Day No
• Algorithms learn the inherit
2 Internal Night No
relationship between input data and
classes during the training 10 External Night Yes
• Algorithms are then tested to measure 3 External Day No
its accuracy
• Once its accuracy is acceptable, it is
released to handle real scenarios (seen
and unseen events)

30
Supervised
Learning (Cont'd)
• Numerous supervised techniques
available:
• Decision Tree
• Naïve Bayes Networks
• Genetic Algorithms (GA)
• Artificial Neural Networks (ANN)
• Fuzzy Logic
• Support Vector Machines
• Hidden Markov Model
• K-Nearest Neighbours

31
Decision Tree
• Structure
o Node = Rule or Test
o Branch = Possible Decision
o Leaf = Class
• Tree is constructed during the
training phase with the help of
labeled data.
• Numerous algorithms are
available:
o ID3
o C4.5
o CART

32
• Naïve:
o Simplifying assumption: "All
features are conditionally
independent of each other
given the class label"
• Bayes principle
Naïve Bayes
• Target:
"What is the probability that a
particular kind of attack is
occurring, given the observed
system activities?"
o P(Intrusion = y | Login
attempts = 5, origin =
Internal, Time of Day = Day)
?

33
Naïve Bayes (Cont'd)
• Let's first build the model using the
Probability Value
labeled data (Features + Classes):
P(Intrusion = Y) (2 / 5) = 0.4
P(Intrusion = N) (3 / 5) = 0.6
P(Origin = Internal | Intrusion = (0 / 2) = 0
Y) (2 / 2) = 1
P(Origin = External | Intrusion
= Y)

P(Origin = Internal | Intrusion = (2 / 3) = 0.67


N) (1 / 3) = 0.33
P(Origin = External | Intrusion
= N)

34
Naïve Bayes (Cont'd)
• Now let’s do the evaluation if the Probability Value
observed event is intrusion or not: P(Time = Day | Intrusion = Y) (0 /2) = 0
P(Time = Night | Intrusion = Y) (2 / 2) = 1

P(Time = Day | Intrusion = N) (2 /3) =


P(Time = Night | Intrusion = N) 0.67
(1 / 3) =
0.33
P(Login attempt = x | Intrusion (1 / 2) = 0.5
= Y)
{5, 10} 0
Others
P(Login attempt = x | Intrusion (1 / 3) =
= N) 0.33
{1, 2, 3}
Others 0
35
Naïve Bayes (Cont'd)

• Applying Naïve principle:


P(Intrusion = Y | Login Attempts = 5, Origin = Internal, Time of Day = Day) =
[P(Login Attempts = 5, Origin = Internal, Time of Day = Day | Intrusion = Y) *
P(Intrusion = Y)] /
P(Login Attempts = 5, Origin = Internal, Time of Day = Day)

• Note: Due to zero frequency for some events, Laplace smoothing


technique shall be used

• Do your calculation:
o if probability > 0.5 then it an intrusion, otherwise it is a normal event

36
Naïve Bayes (Cont'd)

• Naïve Bayes has reduced efficiency with:


o Dependent attributes-based systems
o Large data sets systems (High dimension)
• More sophisticated technique:
o Hidden Naïve Bayes (HNB) can resolve these
challenges efficiently

37
Unsupervised
Learning
• Algorithms learn patterns from
unlabeled dataset
• Data is grouped into various
classes through the learning
process
• Joint density model can help to
categorize
o Big size clusters are labeled as
normal
o Small size clusters are labeled as
intrusion

38
Unsupervised
Learning (Cont'd)
• Numerous clustering
techniques:
o K-means
o Hierarchical Clustering

39
K-means

Categorize N data to K User defines the Distance-based Algorithm:


clusters number of clusters clustering technique
(Number of attempts, origin, day e.g. : Intrusion, Normal ==> K = It applies Euclidean metric Choose K random points as
of time) ==> Intrusion or 2 centers of the K clusters
Normal Assign each data point to the
corresponding cluster based on
the minimum Euclidean distance
between it and different clusters
Update centers based on the
average of the mean of data
points assigened to each cluster
Repeat assignment and update
steps until no significant change
in cluster center

40
Mixing of supervised and
unsupervised techniques

The idea is to train the model with


a small set of labled data and a
Semi- large amount of unlabled data
supervised
learning Why doing this ?
Acquiring labled data
is expensive or time-
consuming

Fuzzy-based semi-
Numureous supervised
algorithms Expectation
Maximization

41
Ensemble methods
• Multiple machine learning used
o Synergy vs Energy
• Numerous ensemble methods:
• Boosting
o Sequential ensemble method
o Each model is trained to correct the mistakes of its
predecessor
• Bagging
o multiple models independently in parallel on different subsets
of the training data
o combine their predictions (e.g. Voting, averaging, etc..)
• Stacking
o Multiple models (base learners) are trained on the same
dataset
o a Meta-model (a blender) is trained on the outputs of the
base models
42
Intrusion Data Sources

43
Classificati •The previous two sections categorised IDS
based on the methods used to identify
intrusions
on of IDS o Signature
o Anomaly

based on •IDS can also be classified based on the input


data sources used to detect abnormal activities.

input
So, what does this
mean?

44
Host-based

Data
Sources
Network-based

45
HIDS

• HIDS inspect data that originates


from the host system and audit
sources.
o Operating system
o Window server logs
o Firewalls logs
o Application system audits, or
database logs.
o HIDS can detect insider attacks
that do not involve network
traffic

46
NIDS

• NIDS monitors the network traffic that is


extracted from a network through packet
capture
o Monitor many computers that are
joined to a network.
o NIDS is able to monitor the external
malicious activities that could be
initiated from an external threat at an
earlier phase
o NIDSs have limited ability to inspect
all data in a high bandwidth network
because of the volume of data passing
through modern high-speed
communication networks

47
Two Good methods, Integrate?
Yes! Together with HIDS and firewalls, NIDS can provide a multi-
tier protection against both external and insider attacks.

The main idea is to use a semantic structure to kernel level


system calls to understand anomalous program behavior.

AIDS systems have also been applied in NIDS and HIDS to


increase the detection performance with the use of machine
learning, knowledge-based and statistical schemes as discussed
before.
48
Introduction to IDS
Evaluation

• Why Evaluate Intrusion Detection


Systems ?
o Ensure effective anomaly and
attack detection.
o real-world applicability.
o Compare between models and
each other

49
Evaluatio
n of IDS

• Performance Metrics
Performance
Metrics For IDS

• Why not accuracy to measure


model effectiveness?

51
Inefficiency of Accuracy metric

52
Inefficiency of Accuracy metric

53
Inefficiency of Accuracy metric

54
Important
concepts
for
Performanc
e metrics

55
Important concepts for IDS
evaluation
• True Positive
• True Negative
• False Positive
• False Negative

56
Important concepts
for IDS evaluation

TPR = TP / ( TP +FN )
TNR = TN / ( FP + TN
)
FNR = FN / ( FN +
TP)
FPR = FP / ( FP +
TN )
57
Important
concepts for IDS
evaluation

• Models thresholds

• Receiver Operating
Characteristic (ROC) curve

58
Model thresholds

• Probabilistic models (e.g., logistic


regression, neural networks) output a
probability score for each prediction
instead of directly saying "Class 1" or
"Class 2".

• Example: If a model outputs P=0.8 for


a sample, it means the model
believes there is an 80% chance the
sample belongs to Class 1.

59
Threshold and
ROC Curve

• The ROC curve is created by changing the


threshold systematically and calculating:
• True Positive Rate (TPR): How many actual
positives are correctly identified (sensitivity). Y-
axis
• False Positive Rate (FPR): How many actual
negatives are incorrectly classified as positive.
X-axis
• Each threshold gives one point on the ROC
curve:
• Low threshold: Almost all predictions are
positive, leading to a high TPR and FPR.
• High threshold: Fewer predictions are
positive, leading to a low TPR and FPR.
60
Other Metrics not Mentioned in the
paper
F1 Score

61
IDS
Datasets
and
Datasets
quality

62
Difference from other Surveys

63
• Earliest effort to create an IDS dataset
• Defense Advanced Research Project Agency
• the MIT Lincoln Labs
• 4 GB of Data
DARPA • Basis for KDD Cup99 and NSL-KDD datasets
• Have been widely criticized

64
Collected using multiple computers connected to the
Internet to model a small USAir Force base of restricted
personnel

DARPA Setup

65
Relation Between DARPA and KDD Cup99 and furthermore
NSL-KDD

• The KDD Cup99 dataset was created from the DARPA dataset by extracting
connection records and organizing them into a format suitable for machine
learning competitions, specifically for the KDD (Knowledge Discovery in
Databases) competition.

66
Relation
Between DARPA
• 4 Attack Categories are:
and KDD Cup99 o Denial of service
and furthermore
o Remote to Local
NSL-KDD
o User to root
o Probe

67
Features Extracted

Relation
Between
DARPA and
KDD Cup99
and
furthermor
e NSL-KDD

68
DARPA / KDD Cup99 Critism

• Creech & Hu, 2014b


o Realism :pointed out that the dataset does not accurately reflect real-world
network conditions. The simulated attacks and traffic do not represent the
complexity and diversity found in modern networks.
o Redundancy : The dataset contains a high level of redundancy in network packet
records, which can bias machine learning models and limit the ability to
generalize to real-world scenarios.
o Limitations in Attack Variety: not including a diverse range of attacks. This
makes it less effective for evaluating the capability of IDS in handling modern or
emerging threats.
o Environment Constraints : The dataset was generated in a controlled
environment that modeled a military network, which is not representative of
more dynamic, heterogeneous real-world network environments.

69
• was collected in 2007
• network traffic traces from Distributed Denial-
CAIDA of-Service (DDoS) attacks
• overwhelming the target with a flood of
network packets preventing regular traffic from
reaching its legitimate destination computer

70
• Lack of Attack Diversity : focus on DOS, no
R2L or U2R or Probing
• Absence of Full Network Context :
o Topology Information: The structure and
interconnections of devices in the network
CAIDA (e.g., routers, switches, hosts).
o Traffic Characteristics: Details of network
criticism traffic patterns across various devices and
times.
o Application Layer Details: The protocols and
applications being used (e.g., HTTP, FTP,
SSH).

71
NSL-KDD
• Developed from the earlier KDD cup99 dataset
• Approximately 78% and 75% of the network packets are
duplicated in both the training and testing dataset
(Tavallaee et al., 2009).
• Will affect ML models training

72
• real network traffic traces of HTTP,
SMTP, SSH, IMAP, POP3, and FTP
protocols

ISCX 2012 • Well-labeled but doesn't simulate


zero-day attacks

73
contain records
from both Linux
and Windows
operating
Australian
systems; they
Defence Force
are created from
ADFA-LD Academy
the evaluation of
system-call-

and ADFA- based Host-


based IDS.

WD
Suitable for
Includes zero-day modern intrusion
attacks detection
challenges.

74
ADFA-LD
and ADFA-
WD
Features

75
AFDA LD
vs AFDA
WD
• Covers both benign and malicious traffic,
including advanced attacks like
CICIDS Heartbleed(Open SSL) and Botnets(DDOS).

2017 • Provides 80 network flow features with a


realistic topology setup.

77
• Created in 2018
• Specifically targets IoT networks, featuring
realistic and diverse attack scenarios.
• Includes modern threats like data
Bot-IoT exfiltration(Probing) and privilege
escalation(User-to-Root).

78
Dataset
Comparis
on - Deep
dive • Realistic traffic: ADFA, CICIDS, Bot-IoT.
• Zero-day attacks: ADFA, CICIDS, Bot-IoT.
• Full packet capture: All except CAIDA.

79
Dataset
Comparis
on -
summary

80
• Denial-of-Service (DoS) : attacks have the
objective offblocking or restricting services
delivered by the network, computer to the
users. (Botnets)
Types of • Probing attacks have the objective of
acquisition of information about the network or
Computer the computer system.
• User-to-Root (U2R) attacks have the objective
attacks of a non-privileged user acquiring root or
admin-user access on a specific computer or a
system on which the intruder had user level
access.
• Remote-to-Local (R2L) attacks involve sending
packets to the victim machine. The
cybercriminal

81
Feature selection for IDS

• 1. Filter Methods
• 2. Wrapper Methods

82
Filter Methods

• Evaluate the relevance of features based on statistical metrics,


independently of the machine learning algorithm.

• Advantages:
• Computationally efficient.
• Can be used as a preprocessing step for any machine learning model.
• Disadvantages:
• May not account for interactions between features.

83
Wrapper Methods

• Evaluate subsets of features by training and testing a model on them,


optimizing for performance metrics like accuracy or F1-score.
• Advantages:
• Takes feature interactions into account.
• Produces better-performing subsets.
• Disadvantages:
• Computationally expensive, especially for large datasets.

84
Challenges of
Intrusion Detection
Systems (IDS)

85
Evasion Techniques

Fragmentation Flooding Obfuscation Encryption


Attackers split Overwhelming the Concealing malicious Encrypting malicious
malicious payloads IDS with excessive code to make it traffic to prevent IDS
into smaller packets traffic to hide difficult for IDS to from inspecting the
to avoid detection. malicious activities. recognize it. content.

86
High False Positive
Rates
• Anomaly-based IDS Systems
o Flag normal behavior as malicious
o Variability in defining 'normal'
behavior
• High False Alarms
o Due to variability in normal behavior
o Leads to many false positives

87
Dataset Limitations
• Outdated Datasets
o Commonly used datasets like KDD Cup 99
o Do not reflect current attack methods
and malware
o Challenges in accurate IDS evaluation
• Lack of Realism
o Some datasets fail to capture real-world
network traffic complexity
o Do not represent the diversity of real-
world attacks

88
Complexity and
Resource Intensity
• Statistical Models
o Building and maintaining accurate
models is complex
o Resource-intensive process
• Machine Learning Models
o Training requires significant
computational resources
o Updating models needs expertise

89
Adaptability

• Dynamic Environments
o IDS must adapt to changes in
network behavior
o New types of attacks present
challenges
o Effective implementation is difficult

90
Integration with
Existing Systems
• Compatibility Challenges
o Ensuring seamless integration with existing
network infrastructure
o Difficulty in integrating with current security
tools
• Performance Impact
o Potential introduction of latency
o Performance overhead needs to be
minimized
o Avoiding disruption to normal operations

91
Conclusion
• More and more sophisticated attacks with high
volumes are motivated
• Business continuity and even states national
securities must incorporate strong cybersecurity
processes and measures
• Part of this is designing highly efficient IDSs with
high capabilities to detect zero-day intrusions
• Combining different IDS categories and algorithms
is better for higher intrusion detection with lower
false alarms (Higher efficiency + Higher cost >> A
compromise)
• Digging deep into the hacker mind by studying
invasion techniques is important to build
defensive measures
• Dataset is crucial part for building and evaluating
more reliable IDS, further collaboration between
different entities shall exist to enhance this area

92

You might also like