0% found this document useful (0 votes)
25 views61 pages

CS L06 MachineLearning AnomalyDetection

Uploaded by

Anshul Dessai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views61 pages

CS L06 MachineLearning AnomalyDetection

Uploaded by

Anshul Dessai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

AIMLCZG567: AI & ML

Techniques for Cyber Security


BITS Pilani Jagdish Prasad
Pilani Campus WILP
BITS Pilani
Pilani Campus

Session: 06
Title: Machine Learning for Anomaly
Detection
Agenda
• Anomaly Detection
• Machine Learning Algorithms for Anomaly Detection
• Association Rules
• Artificial Neural Network
• Random Forest Approach
• Clustering Approach
• Deep Learning Techniques

• Ref: https://ff12.fastforwardlabs.com
BITS Pilani, Pilani Campus
Anomaly Detection

BITS Pilani, Pilani Campus


Misuse Detection v/s Anomaly Detection

BITS Pilani, Pilani Campus


What is Anomaly Detection?
• Anomaly detection defines a profile of normal behaviours, which reflects the
health and sensitivity of a cyberinfrastructure.
• Anomaly behaviour is a pattern in data that does not conform to the expected
behaviours, including outliers, contaminants, noise etc.
• Anomaly detection requires a clear boundary between normal and anomalous
behaviour.
• A profile must contain robustly characterized normal behaviour, such as a
host/IP address or VLAN segment and have the ability to track the normal
behaviours of the target environment.
• Following information can be used to defines a profile:
• Occurrence patterns of specific commands in application protocols
• Association of content types with different fields of application protocols
• Connectivity patterns between protected servers and outside world
• Rate and burst length distributions for all types of traffic

BITS Pilani, Pilani Campus


What is Anomaly Detection?
• Anomaly detection flags out malicious behaviours like
• Segmentation of binary code in a user password
• Stealth reconnaissance attempts
• Backdoor service on a well-known standard port
• Natural failures in the network
• New buffer overflow attacks
• HTTP traffic on a nonstandard port
• Intentionally stealthy
• Example: If a user who usually logs in around 10 am from university dormitory,
logs in at 5:30 am from an IP address of China, then an anomaly has occurred.
• Business use cases:
• Medical diagnostics
• Network intrusion detection
• Manufacturing defect detection
• Fraud detection
BITS Pilani, Pilani Campus
Challenges in Anomaly Detection
• Date Volume: Huge volume of data with high-dimensional feature space is
difficult to manually analyse and monitor.
• Computational Complexity: Analysis and monitoring requires highly efficient
computational algorithms in data processing and pattern learning.
• Quantum of Malicious Data: Size of malicious data is much smaller than the
count of normal data in the huge volume of network data.
• Threshold Selection: Imbalanced distribution of normal and anomaly data
induces a high FAR (False Acceptance Rate).
• Interpretability: Running streaming data requires online analysis.
• Definition of Anomaly: Concept of an anomaly/outlier varies among
application domains.
• Contaminated Data Sets: Training and Testing data may contain unknown
noises.

BITS Pilani, Pilani Campus


Types of Anomaly

Global Outliers

BITS Pilani, Pilani Campus


Why Machine Learning for Anomaly Detection?
• Makes scaling anomaly detection easier by automating the identification of
patterns and anomalies without requiring explicit programming.
• ML algorithms are highly adaptable to changing data set patterns, making
them highly efficient and robust with time.
• Easily handles large and complex datasets, making anomaly detection efficient
despite the data set complexity.
• Ensures early anomaly identification and detection by identifying anomalies
as they happen, saving time and resources.
• ML based anomaly detection systems help achieve higher levels of accuracy in
anomaly detection compared to traditional methods.

BITS Pilani, Pilani Campus


Why Machine Learning for Anomaly Detection?

Supervised Learning

BITS Pilani, Pilani Campus


ML Process for Anomaly Detection

• An anomaly detection system consists of five


steps:
• Data collection
• Data pre-processing
• Normal behaviour learning phase
• Identification of abnormal behaviours using
dissimilarity detection techniques
• Security responses.
• Data volume is extremely large and requires
data reduction in data pre-processing.
• Most of the data in the network is streaming
data and requires further data reduction.
• Data pre-processing includes feature selection,
feature extraction and dimensionality reduction.

BITS Pilani, Pilani Campus


Steps to Build Anomaly Detection Model

• Step 1: Train the model


• Build a model of normal behaviour using
training data.
• Depending on the specific anomaly
detection method, training data may
contain both normal and abnormal data
points, or only normal data points
• Step 2: Test the model
• Assign scores to behaviour pattern
• Select a threshold rule that defines the
anomaly tagging process e.g. scores above
a given threshold are tagged as anomalies,
while those below it are tagged as normal.
• Manipulate threshold value to fine tune
the “sensitivity” of the anomaly tagging
process
• Step 3: Validate the model results

BITS Pilani, Pilani Campus


Machine Learning for Anomaly Detection
• Machine-learning algorithms group normal patterns by using similarity
measures between the patterns of input events and predefined normal
behaviours.
• Outliers are flagged as abnormal candidates.
• Disadvantages:
• May trigger high rates of false alarm as the method flags any significant deviation
from the baseline as an intrusion.
• Non-intrusive behaviour that falls outside the normal range will also be labelled as
an intrusion (false positive).
• Hackers often modify malicious codes or data to make them look similar to normal
patterns increasing the probability of being labelled as normal (false negative)
• Anomaly detection does not differentiate between attacks.
• Need to minimize false negative and false positive rates

BITS Pilani, Pilani Campus


ML Algorithms for Anomaly Detection
Anomaly Detection Assumptions Anomaly Scoring Notable Examples
Method
Clustering Normal data points belong to a cluster Distance from nearest Self-Organizing Maps
(or lie close to its centroid) in the data cluster centroid (SOMs), K-Means Clustering,
while anomalies do not belong to any Expectation Maximization
clusters. (EM)
K- Nearest Normal data instances occur in dense Distance from _k_th K-Nearest Neighbours (KNN)
Neighbour neighbourhoods while anomalous data nearest neighbour
are far from their nearest neighbours
Classification • A classifier can be learned which A measure of classifier One-Class Support Vector
distinguishes between normal and estimate (likelihood) that a Machines (OCSVMs)
anomalous with the given feature data point belongs to the
space normal class
• Labelled data exists for normal and
abnormal classes
Statistical Given an assumed stochastic model, Probability that a data Regression Models (ARMA,
normal data instances fall in high- point lies in a high- ARIMA)
probability regions of the model while probability region in the
abnormal data points lie in low- assumed distribution
probability regions
Deep learning Given an assumed stochastic model, Probability that a data Autoencoders, Sequence-to-
normal data instances fall in high- point lies in a high- Sequence Models,
probability regions of the model while probability region in the Generative Adversarial
abnormal data points lie in low- assumed distribution Networks (GANs),
probability regions Variational Autoencoders
(VAEs)
BITS Pilani, Pilani Campus
ML Techniques for Anomaly Detection
# Technique Used Input Data Format Levels References
C1 Statistical Methods Sequence of System Hosts Ye et al. (2001), Feinstein et al. (2003), Smaha
Calls, Offline (1988), Ye et al (2002)
C2 Statistical Methods TCP/IP data, online Network Yamanishi and Takeuchi (2001), Yamanishi et al.
(2000), Mahoney and Chan (2002, 2003), Soule et al.
(2005)
C3 Unsupervised TCP/IP data, offline Network Portnoy et al. (2001), Leung and Leckie (2005),
Clusterring algorithm Warrender et al. (1999), Zhang and Zulkernine
(2006a,b)
C4 Subspace TCP/IP data, offline Network Li et al (2006)
C5 Information Theoritic
TCP/IP data, online Network Lakhina et al (2005)
C6 Association Rules Freqquency of Hosts Lee and Stolfo (1998), Abraham et al. (2007a,b), Su
system calls, online et al. (2009), Lee et al. (1999)
C7 Kalman Filter TCP/IP data, online Network Soule et al (2005)
C8 Hidden Markov Model Sequence of System Hosts Warrender et al (1999)
(HMM) Calls, Offline
C9 ANN Sequence of System Hosts Ghosh et al. (1998, 1999), Liu et al. (2002)
Calls, Offline
C10 Principal Component TCP/IP data, online Network Lakhina et al. (2004), Ringberg et al. (2007)
Analysis (PCA)
C11 KNN Freqquency of Hosts Liao and Vemuri (2002)
system calls, offline
C12 SVM TCP/IP data, offline Network Hu et al. (2003), Chen et al. (2005)

BITS Pilani, Pilani Campus


Rule Based Anomaly Detection

BITS Pilani, Pilani Campus


Rule Based Anomaly Analysis
• Rules are defined to describe normal profiles of users, programs, and other
resources in cyberinfrastructures.
• Anomaly detection method identifies a potential attack if users or programs
act inconsistently with the defined rules.
• Constructing anomaly detection models using association rules is performed
in two steps.
• System audit data is mined for consistent and useful patterns of program and user
behaviours.
• Inductively learned classifiers are trained using the relevant features present in the
patterns to recognize anomalies

BITS Pilani, Pilani Campus


Threshold Rules
• Tests events or flows for activity that is greater than or less than a specified
range.
• Uses these rules to detect bandwidth usage changes in applications, failed
services, the number of users connected to a VPN, and detecting large
outbound transfers.
• Example:
• A user who was involved in a previous incident has large outbound transfer
• When a user is involved in a previous offense, automatically set the Rule response
to add to the Reference set.
• If you have a watch list of users, add them to the Reference set.
• Tune acceptable limits within the Threshold rule.

BITS Pilani, Pilani Campus


Behavioural Rules
• Tests events or flows for volume changes that occur in regular patterns to
detect outliers e.g. a mail server that has an open relay and suddenly
communicates with many hosts or an IPS that starts to generate numerous
alert activity.
• A behaviour rule that learns the rate or volume of a property over a pre-
defined season.
• Season defines the baseline comparison timeline for what you are evaluating.
• When you set a season of 1 week, the behaviour for the property over that 1
week is learned and than you use rule tests to alert you to the changes.
• After a behavioural rule is set, the seasons adjust automatically.
• As the data in the season is learned and is continually evaluated so that business growth is
profiled within the season, you do not have to make changes to your rules.
• The longer that a behavioural rule runs, the more accurate it is over time.
• You can then adjust the rule responses to capture more subtle changes.

BITS Pilani, Pilani Campus


Behavioral Rules
• Detects changes in traffic or properties that are always present such as mail
traffic, firewall traffic, bytes transferred by common protocols such as 443
traffic, or applications that are common within network.
• Define a pattern, traffic type, or data type that can track to generate an
overall trend or historical analysis.
• Assign rule tests against that pattern to alert to special conditions.
• Example:
• When the importance of the current traffic level (on a scale of 0 to 100) is 70
compared to learned traffic trends and behaviour to the rule test, the system
sends an alert when the traffic is +70 or -70 of the learned behaviour.

BITS Pilani, Pilani Campus


Application of Association Rule in Audit Data for
Anomaly Detection
• Lee and Stolfo (1998) detected insider attacks using audit data
• Aggregated the patterns into one normal profile for each user if two rules had
the same functions on the left- and right-hand sides, while their support and
confidence values are within 5% of each other, respectively.
• Similarity Score = m/n (m is number of new patterns that match historical
normal profile patterns, and n is number of all new patterns for detection).
• A higher similarity score indicates that the user performs normal behaviours.
• Different user roles were identified and each user had three profiles for time
segments: morning, afternoon, and night.
• Merged the patterns into each user’s profile, using data from first 4 weeks.
• Data from the fifth week was used as training data and recorded the normal
range of similarity scores by comparing the patterns from these data to the
recorded profiles of the first 4 weeks.
• Data from the sixth week includes anomaly behaviours that had patterns that
were measured with the normal profile and resulting similarity score
(anomaly) was compared with the normal range (normal).
BITS Pilani, Pilani Campus
Application of Association Rule in Audit Data for
Anomaly Detection

BITS Pilani, Pilani Campus


Application of Association Rule in Audit Data for
Anomaly Detection

BITS Pilani, Pilani Campus


Application of Fuzzy Rule in Audit Data for
Anomaly Detection
• At times the normal data deviate from the rules in a small range.
• This requires to improve the flexibility of the rule-based techniques using
fuzzy logic.
• Audit records include many ordinal and categorical features, which bring the
fuzziness into rules.
• Ex: a rule may contain the connection duration of a user’s process by using the
expression, such as “connection duration = 3 min” or “1 min ≤ connection
duration ≤ 4 min.”
• Fuzzy Frequent Episode can avoid this being categorised as anomaly.

BITS Pilani, Pilani Campus


Application of Fuzzy Rule for Anomaly Detection
• Luo and Bridges investigated the fuzzy rule-based anomaly detection using
real world data and simulated data set.
• Real-network traffic data was collected by Department of Computer Science
at Mississippi State University by tcpdump (http://www.tcpdump.org/n.d.).
• They extracted four features from the data: SN, FN, RN, and PN.
• They divided each feature into three fuzzy sets: LOW, MEDIUM, and HIGH.
• They derived the fuzzy association rules among the first three features, and
fuzzy frequency episode rules for the last feature.
• They used the traffic data from the afternoon as training data to build the
normal pattern fuzzy rules.
• They used the traffic data from the afternoon, evening, and night, as testing
or anomaly detection data.
• During testing, introduced a similarity function to compare the normal
patterns with the testing patterns.

BITS Pilani, Pilani Campus


Application of Fuzzy Rule for Anomaly Detection
• Rules derived from the testing data in the afternoon, evening, and night were
very similar, less similar and at least similar to the rules derived from the
training data, respectively.
• Selected the data in the 3 h of afternoon as training data and nine testing data
in 3 h from afternoon, evening, and night as testing data.
• Fuzzy rules derived from the testing data in the same time slots as the training
data were more similar to the rules generated by the training data than to the
rules generated from any other data sets.
• Simulated data, including three network traffic data sets, were collected by the
Institute for Visualization & Perception Research at University of Massachusetts
Lowell.
• First data set had normal patterns and other two data sets had IP spoofing
intrusions and port scanning intrusions called network1 and network3.
• Normal data set was split into training and testing parts and used the normal
training data set to train rule sets
• Derived testing rule sets on normal testing data on network1 and network3.
• Similarity between these rule sets was compared and it was demonstrated that the
fuzzy episode rules could detect anomalies. BITS Pilani, Pilani Campus
Artificial Neural Networks

BITS Pilani, Pilani Campus


Artificial Neural Network
• Machine Learning methods can be used to create two types of profiles:
• User Profiles based on the sequences of individual normal commands.
• Software Profiles based on the sequences of system calls.
• Software profiles can abstract the vagaries of users and catch users who slowly
change their behaviours to fool the profiling system.
• ANN can work on incomplete data and classify it as normal or anomaly.
• ANN is trained as follows:
• Input data is fed to the network and the activations for each level of neurons are
cascaded forward (feed forward)
• By comparing the output and the ground truth, weights in different layers, is updated
layer-by-layer (back propagation)

BITS Pilani, Pilani Campus


Artificial Neural Network: Detection Steps
• Feature Selection: Key attributes are packets source & destination address,
source & destination ports, protocol (TCP, UDP, ICMP).
• Ranking: Principle of ranking used to categorize the features into three groups:
• Preliminary features represent the most useful features.
• Secondary features are of less impact on the monitoring process.
• Less important features are of the slightest impact on the detection process.
• This categorization can accelerate the whole operation of the system.
• Ex: Considered only 22 out of the 41 KDD’99 features and experimenting with 5 preliminary, 7
secondary, 10 less important features
• Encoding: Encoding is done without using any keys and the non numerical data
(records in the database) are converted into serial numbers.
• Normalization: Process of efficiently organising the data with following three
goals: Eliminate redundant, partial dependent and transitive dependent data
• Anomaly Detection: Use ANN algorithm for classification into Normal, Probing,
R2L (remote to local), U2R (user to root) and DOS attacks.

BITS Pilani, Pilani Campus


Artificial Neural Network: Attack Examples
• Probing: ipsweep, mscan, nmap, portsweep, sendsaint, satan etc
• R2L (Remote to Local): ftp write, guess_password, imap, multihop, named,
worm, xsnoop, snmpgetattack etc
• U2R (User to Root): buffer_overflow, httptunnel, rootkit, sqlattack, xterm etc
• DOS: mailbomb, smurf, teardrop, udpstorm, neptune etc

BITS Pilani, Pilani Campus


Application of ANN for Anomaly Detection
• Ghosh (1998, 1999) used ANN as the anomaly detection model by analyzing
program behaviours.
• Captured system calls using Sun Microsystem’s basic security module (BSM)
auditing facility for Solaris.
• Built normal software profiles by capturing the frequencies of system calls to
monitor the behaviour of programs by noting irregularities in program calls.
• Model had a 3% FPR (1998) and a 0% FPR (1999) with 77% of attacks detected.
• Liu (2002) investigated three ANN methods: Back-Propagation (BP), Radial Basis
Function (RBF) networks, and Self-Organizing Map (SOM) networks.
• By using two encoding techniques (binary and decimal), the neural networks
generated high true-positive rates and low false-positive rates.
• Binary encoding had lower error rates than decimal encoding.
• Decimal encoding handled noise better and the classifier could be trained with
fewer data.

BITS Pilani, Pilani Campus


Support Vector Machine

BITS Pilani, Pilani Campus


Support Vector Machine for Anomaly Detection
• Supervised SVM can be used in anomaly detection by training the SVM
structure with both attack data sets and normal data sets.
• SVM can also be applied as unsupervised machine learning in anomaly
detection.
• SVM outperforms ANN by fine tuning support vectors to separate data to:
• Achieve the global optimum
• Easily control the overfitting problem.
• One-class classification method is used to detect the outliers and anomalies
in a dataset.
• Based on Support Vector Machines (SVM) evaluation, the One-class SVM
applies a One-class classification method for novelty detection.

BITS Pilani, Pilani Campus


Support Vector Machine for Anomaly Detection
• The distance between these two data points is referred to as margins.
• SVM determines the best or the optimal hyperplane with the maximum margin to
ensure the distance between the two classes is as wide as possible.
• Regarding anomaly detection, SVM computes the margin of the new data point
observation from the hyperplane to classify it.
• If the margin exceeds the set threshold, it classifies the new observation as an anomaly.
• If the margin is less than the threshold, the observation is classified as normal.
• SVM algorithms are highly efficient in handling high-dimensional and complex data sets.

BITS Pilani, Pilani Campus


Application SVM for Anomaly Detection
• Chen (2005) used BSM audit data from the 1998 DARPA intrusion detection
evaluation data sets.
• Applied supervised SVM using this data set and compared the results with
those obtained using ANN in the same workflow.
• Collected system call information over processes and extracted frequencies for
system calls and processes.
• Selected 10 days in which the most attacks appeared in a 7 week training data.
• Divided the attack data sets into two sets: half for training and half for testing.
• By replacing “word” and “document” with “system call” and “process,”, applied
Tfx-idf-based encoding scheme to mine the frequency of system calls.
• Chose Gaussian kernel k(x, y) = exp (−(x − y)2/δ2) as the kernel function.
• Parameters δ2 and the margin in the SVM classifier were optimally learned by
10-fold cross validation (CV) using a training data set.
• Implemented SVM classification over the testing data using the two parameters
obtained.
BITS Pilani, Pilani Campus
Application SVM for Misuse Detection

• Evaluated the detection result by using ROC


with the intrusion detection and FAR.
• SVM outperformed ANN in simple
frequency based method and Tfx-idf based
encoding scheme.
• Performance of Robust Support Vector
Machines (RSVMs) was compared with that
of conventional SVMs and Nearest
Neighbour classifiers in separating normal
profiles from the intrusive profiles of
computer programs.
• Results indicated the superiority of RSVMs,
in terms of high intrusion detection
accuracy, low false positives and ability to
generalize information in the presence of
noise.

BITS Pilani, Pilani Campus


Random Forests

BITS Pilani, Pilani Campus


Random Forests
• Random forests algorithm has better predication accuracy and efficiency on
large data sets in high dimensional feature space.
• Network traffic flow has large data sets and high dimensionality feature space.
• Random Forest algorithms are suitable to detect outliers in network traffic data
sets without attack-free training data.
• Results show that the Random Forests approach is comparable to unsupervised
anomaly detection approaches.
• Accuracy of random forests depends on the strength of the individual tree
classifiers and a measure of the dependence between them.
• Number of randomly selected features at each node is critical for the estimated
quality of anomaly detection.

BITS Pilani, Pilani Campus


Random Forests
• Two key parameters are required to build the network traffic model: number of
random features to split the node of trees (Nf) and number of trees in a forest
(Nt).
• Combinational values of these two variables are selected, corresponding to the
optimal prediction accuracy of the random forests.
• Random forests uses proximity measure between the paired data points to find
outliers.
• If a data point has low proximity measures to all the other data points in a given
data set, it is likely to be outlier.
• Proximity between xenquiry and xj ∈ Cj, prox2(xenquiry , xj), is accumulated by one,
when both data points are found in the same leaf of a tree.
• Divide final summation result by the number of trees to normalize the results.
• Obtain proximity and the degree of an outlier in any class for each data point
extracted from a network traffic data set and decide using the threshold.

BITS Pilani, Pilani Campus


Isolation Forests
• Principle of isolation forests is that outliers are points with features that are considerably different
from rest of the data.
• The inliers will be closer together while the outliers will be farther apart.
• A decision tree is constructed by making random cuts with randomly chosen features.
• Tree is allowed to grow until all points have been isolated.
• Fewer splits are required to isolate outliers and outlier nodes will reside closer to the root node.
• Process of constructing a tree with random cuts is repeated to create an ensemble, hence the term
forest in the algorithm’s name.
• Two key hyperparameters are estimators that give us the number of tresses uses in the ensemble
and contamination which gives the fraction of outliers in the data set.

BITS Pilani, Pilani Campus


Application of Random Forests in Anomaly
Detection
• Zhang and Zulkernine (2006) applied the random forest algorithm to the DARPA
MIT KDD Cup 1999 data set.
• Selected five services as pattern labels for a random forests algorithm, including
ftp, http, pop, smtp, and telnet.
• Four groups of data sets were generated by combining normal data and attack
data at the ratio of 99:1, 98:2, 95:5, and 90:10.
• A total of 47,426 normal traffic flow data were selected from ftp, pop, telnet,
5% http, and 10% smtp normal services.
• Evaluated the performance of the system using ROC.
• Reported a better detection rate while keeping the FP rate lower than in other
unsupervised anomaly detection systems.
• Indicated that the detection performance over minority attacks was much lower
than that of majority intrusions.
• Improved the detection system by using random forests in the hybrid system of
misuse detection and anomaly detection.
BITS Pilani, Pilani Campus
Clustering Algorithm

BITS Pilani, Pilani Campus


Clustering Techniques
• Clustering-based outlier detection methods assume that the normal data
objects belong to large and dense clusters, whereas outliers belong to small or
sparse clusters, or do not belong to any clusters.
• Clustering-based approaches detect outliers by extracting the relationship
between Objects and Cluster.
• An object is an outlier if:
• Does the object belong to any cluster? If not, then it is identified as an outlier.
• Is there a large distance between the object and the cluster to which it is closest? If
yes, it is an outlier.
• Is the object part of a small or sparse cluster? If yes, then all the objects in that cluster
are outliers.

BITS Pilani, Pilani Campus


Clustering Techniques
• Chandola (2006) categorized clustering-based techniques into three groups
according to assumptions.
• Categorization method is similar to assigning specific patterns or characteristics
to the groups of normal and anomalous data.
• Categorize clustering-based anomaly detection into two groups: distance-based
clustering and density-based clustering.
• First group includes k-means clustering, Expectation Maximization and SOM.
• Second group includes CLIQUE and MAFIA.
• Focus on the first group as the second group has fewer research results, and
does not have as good anomaly detection results as the first group.

BITS Pilani, Pilani Campus


Clustering Techniques
• Most deployed distance-based clustering method is adapted from k-means
clustering.
• Without defining K in these algorithms, the clustering hyper-spheres are
constrained by a threshold r.
• Given data set X = {x1, …, xm} and cluster set C = {C1, …, CK}, distance metric
dist(xi, Cj) measures the closeness between data point xi, i = 1, …, m, and cluster
Cj.
• To implement distance-based clustering, follow the steps below:
• Step 1. Initialize cluster set C = {C1, …, CK}.
• Step 2. Assign each data point xi in X to the closest cluster C *, C * ∈ {C1, …, CK.},
• if dist(xi, C *) ≤ r ; or creation of new cluster Cʹ for this data point, and update the
cluster set C.
• Step 3. Iterate until all data points are assigned to a cluster.

BITS Pilani, Pilani Campus


Clustering Techniques…
• Commonly used distance metric is Euclidean distance.
• If we choose the distance between a data point xi and cluster Cj to measure
dist(xi, Cj), the clustering algorithm is similar to k-means clustering, except there
is an additional constraint r for the clustering threshold.
• As all training data are unlabelled, we cannot determine which clusters belong
to normal or anomaly types.
• Each cluster may include mixed instances of normal data and different types of
attacks.
• Normal data out-number anomaly data, so the clusters that constitute more
than a percentage α of the training data set are labelled as “normal” groups.
• Other clusters are labelled as “attack.”
• Determination of abnormal clusters by the size of classes can lead to some
small-sized normal data groups be mis-classified as anomaly clusters especially
in case of multi-type normal data.

BITS Pilani, Pilani Campus


Clustering Techniques…
• Threshold r also affects the result of clustering.
• When r is large, the cluster number will decrease; when r is small, the cluster
number will increase.
• Selection of r is dependent on the knowledge of the normal data distribution.
• For instance, we know statistically it should be greater than the intra-cluster
distance and smaller than the inter-cluster distance.
• Jiang (2006) selected r by generating the mean and standard deviation of
distances between pairs of a sample data points from the training data set.
• Once the training data have been clustered and labelled, testing data can be
grouped according to their shortest distance to any cluster in the cluster set.

BITS Pilani, Pilani Campus


Application of Clustering Technique for Anomaly
Detection
• Portnoy (2001) applied the clustering anomaly detection method on the DARPA
MIT Knowledge Discovery and Data Mining (KDD) Cup 1999 data set.
• Data set has 4,900,000 data points with 24 attack types and normal activity in
the background.
• Each data point is a vector of extracted feature values from the connection
record obtained between IP addresses during simulated intrusions.
• Entire KDD data set is partitioned into 10 subsets.
• Subsets containing only one type of attacks or full of attacks sre removed, such
that only four subsets were left for CV.
• Filtered the training data sets from KDD data for attacks such that the attack
data and normal data had a proportion 1:99 in the resulting training data set.
• Before training and testing phases evaluated the performance using ROC (FP-
TP).

BITS Pilani, Pilani Campus


Application of Clustering Technique for Anomaly
Detection…
• Used 10% of the KDD data to measure the performance when choosing the
sensitive parameters: threshold r and percentage 1 − α.
• Selected r = 40 and α = 0.85 after balancing between TP and FP in ROC for
achieving the higher TP and acceptable FP.
• Training and testing were performed several times with different selections of
the combinational subsets for training and testing.
• Clustering with unlabelled data resulted in a lower detection rate for attacks
than clustering with supervised learning.
• Unlabelled data can potentially detect unknown attacks through an automated
or semi-automated process, enabling administrators to concentrate on the most
likely attack data.

BITS Pilani, Pilani Campus


Deep Learning

BITS Pilani, Pilani Campus


Deep Learning
• Deep Learning is designed to work with multivariate and high dimensional data.
• This makes it easy to integrate information from multiple sources, and
eliminates challenges associated with individually modelling anomalies for each
variable and aggregating the results.
• Deep learning approaches are well-adapted to jointly modelling the interactions
between multiple variables with respect to a given task and beyond the
specification of generic hyperparameters (number of layers, units per layer, etc.)
• Deep learning models require minimal tuning to achieve good results.
• Deep learning methods offer the opportunity to model complex, nonlinear
relationships within data, and leverage this for the anomaly detection task.
• Performance of deep learning models can potentially scale with the availability
of appropriate training data, making them suitable for data-rich problems.

BITS Pilani, Pilani Campus


Autoencoders
• Autoencoders are neural networks designed to learn a low-dimensional
representation, given some input data.
• Consist of two components:
• Encoder that learns to map input data to a low-dimensional representation (termed
the bottleneck)
• Decoder that learns to map this low-dimensional representation back to the original
input data.
• Encoder network learns an efficient “compression” function that maps input
data to a salient lower-dimensional representation, such that the decoder
network is able to successfully reconstruct the original input data.
• Model is trained by minimizing the reconstruction error, which is the difference
(mean squared error) between the original input and the reconstructed output
produced by the decoder.
• In practice, autoencoders have been applied as a dimensionality reduction
technique, as well as in other use cases such as noise removal from images,
image colorization, unsupervised feature extraction, and data compression
BITS Pilani, Pilani Campus
Autoencoders
• A semi-supervised approach is used to
train the model for normal behaviour.
• Autoencoder is trained on normal data
samples.
• Model learns a mapping function that
reconstructs normal data samples with a
very small reconstruction error.
• This behavior is replicated at test time,
where the reconstruction error is small for
normal data samples, and large for
abnormal data samples.
• To identify anomalies, we use the
reconstruction error score as an anomaly
score and flag samples with
reconstruction errors above a given
threshold
BITS Pilani, Pilani Campus
Variational Autoencoders
• VAE consists of an encoder and a decoder network
for Variational Inference learning.
• VAE learns a mapping from an input to a distribution
and learns to reconstruct the original data by
sampling from this distribution using a latent code.
• In Bayesian terms, prior is the distribution of the
latent code, likelihood is the distribution of the input
given the latent code, and posterior is distribution of
latent code, given our input.
• Encoder learns the parameters (mean and variance)
of a distribution that outputs a latent code vector Z,
given the input data (posterior) – typically a Gaussian
or Bernouli.
• Decoder learns a distribution that outputs the
original input data point (or something really close to
it), given a latent bottleneck sample (likelihood) -
typically, an isotropic Gaussian distribution.
• VAE model is trained by minimizing the difference
between the estimated distribution produced by the
model and the real distribution of the data.
BITS Pilani, Pilani Campus
Modelling Behavior for Variational Autoencoders
• Train the VAE on normal data samples.
• Two methods to flag Anomaly
• Method 1:
• Draw samples of the latent code Z from the encoder given our input data, sample
reconstructed values from the decoder using Z, and compute a mean
reconstruction error.
• Anomalies are flagged based on some threshold on the reconstruction error.
• Method 2:
• Output a mean and a variance parameter from the decoder,
• Compute the probability that the new data point belongs to the distribution of
normal data on which the model was trained.
• If the data point lies in a low-density region (below some threshold), flag that as an
anomaly.

BITS Pilani, Pilani Campus


Advantages of Variational Autoencoders
• VAEs enable Bayesian inference
• It can now sample from the learned encoder distribution and decode samples that
do not explicitly exist in the original dataset, but belong to the same data
distribution.
• VAEs learn a disentangled representation of a data distribution
• A single unit in the latent code is only sensitive to a single generative factor.
• This allows some interpretability of the output of VAEs, as we can vary units in the
latent code for controlled generation of samples.
• VAE provides true probability measures that offer a principled approach to
quantifying uncertainty when applied in practice
• Probability that a new data point belongs to the distribution of normal data is 80%.

BITS Pilani, Pilani Campus


Generative Adversarial Networks (GAN)
• Comprises of a pair of neural networks termed a generator (G) and discriminator (D).
• Both networks are trained jointly and play a competitive skill game with the end goal of learning
the distribution of the input data (X).
• Generator (G) learns a mapping from random noise of a fixed dimension (Z) to samples (X_) that
closely resemble members of the input data distribution.
• Discriminator (D) learns to correctly discern real samples that originated in the source data (X) from
fake samples (X_) that are generated by G.
• At each epoch during training, the parameters of G are updated to maximize its ability to generate
samples that are indistinguishable by D, while the parameters of D are updated to maximize its
ability to to correctly discern true samples X from generated samples X_.
• As training progresses, G becomes proficient at producing samples that are similar to X, and D also
upskills on the task of distinguishing real from fake samples.

BITS Pilani, Pilani Campus


Bi-Directional Generative Adversarial Networks (BiGAN)

• Generating a most representative latent noise vector for an


arbitrary sample is compute-intensive and very slow in
process.
• To address this issue, there is a new formulations of GANs
that enable a controlled adversarial inference by
introducing an encoder network E (BiGANs).
• Encoder learns the reverse mapping of the generator -
learns to generate a fixed vector Z_, given a sample.
• Input to the discriminator takes in pairs of input that
include the latent representation (Z, and Z_) in addition to
the data samples (X and X_).
• G learns an induced distribution that outputs samples of X
given a latent code Z, while E learns an induced distribution
that outputs Z, given a sample X.
• Example: Generator component of a GAN trained on
images of cars will always output an image that looks like a
car, given any latent code.
• At test time, we can leverage this property to infer how
different a given input sample is from the data distribution
on which the model was trained.
BITS Pilani, Pilani Campus
Sequence to Sequence Model

• Neural networks designed to learn mappings between data that are represented as sequences.
• Each token in a sequence may have some form of temporal dependence on other tokens i.e. a
relationship that has to be modelled to achieve correct results.
• Example: Task of language translation where a sequence of words in one language needs to be
mapped to a sequence of words in a different language.
• To excel at such a task, a model must take into consideration the location of each word/token within
the broader sentence to generate an appropriate translation
• Models has an encoder that generates a hidden representation of the input tokens, and a decoder
that takes in the encoder representation and sequentially generates a set of output tokens.
• Encoder and decoder are composed of long short-term memory (LSTM) blocks, that are particularly
suitable for modelling temporal relationships within input data tokens.
• Sequence-to-sequence models can be slow during inference; each individual token in the model
output is sequentially generated at each time step, where the total number of steps is the length of
the output token.

BITS Pilani, Pilani Campus


Thank You

BITS Pilani, Pilani Campus

You might also like