0% found this document useful (0 votes)
25 views14 pages

CH 03 - 11 - Unsupervised Learning - Anomaly Detection

This document discusses unsupervised learning techniques for anomaly detection. It explains DBSCAN clustering and how it can identify outliers without specifying the number of clusters. It then covers estimators in scikit-learn for point anomaly detection, including KernelDensity, OneClassSVM, IsolationForest and LocalOutlierFactor, and provides code examples applying these to synthetic data to identify outliers.

Uploaded by

Lama
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views14 pages

CH 03 - 11 - Unsupervised Learning - Anomaly Detection

This document discusses unsupervised learning techniques for anomaly detection. It explains DBSCAN clustering and how it can identify outliers without specifying the number of clusters. It then covers estimators in scikit-learn for point anomaly detection, including KernelDensity, OneClassSVM, IsolationForest and LocalOutlierFactor, and provides code examples applying these to synthetic data to identify outliers.

Uploaded by

Lama
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Ch 03: U n s u p e r v i s e d L e a r n i n g - A D

Prof. Dr. Rashid A. Saeed

1 / 25

2 / 25
Course Content

3/ 13
/ 13

4 / 25
Introduction

q Anomaly detection is a process where you find out the list


of outliers from your data.
q An outlier is a sample that has inconsistent data compared
to other regular samples hence raises suspicion on their
validity.
q The presence of outliers can also impact the performance
of ML algorithms when performing supervised tasks.
q It can also interfere with data scaling which is a common
data preprocessing step.
q we'll be discussing estimators available in scikit-learn
which can help with identifying outliers from data.

5 / 25

Applications

q Network intrusion detection


q Insurance / Credit card fraud detection
q Healthcare Informatics / Medical diagnostics
q Industrial Damage Detection
q Image Processing / Video surveillance
q Novel Topic Detection in Text Mining
q Lots more!

6 / 25
DBSCAN

q DBSCAN stands for Density-Based Spatial


Clustering of Applications with Noise.
q Is a cluster technique
q The main benefits of DBSCAN are that it does
not require the user to set the number of clusters
a priori, (as the case in k-mean)
q DBSCAN can capture clusters of complex
shapes, and it can identify points that are not
part of any cluster.
q DBSCAN can scales large datasets.
7 / 25

q Points that are within a dense region are called


core samples (or core points),
q There are two parameters in DBSCAN:
min_samples and eps.
q Data points within a distance of eps to a given
data point, that data point is classified as a core
sample.

8 / 25
q The algorithm works by picking an arbitrary point
to start with.
q It then finds all points with distance eps or less
from that point.
q If there are less than min_samples points within
distance eps of the starting point, this point is
labeled as noise, meaning that it doesn’t belong to
any cluster.
q If there are more than min_samples points within a
distance of eps, the point is labeled a core sample
and assigned a new cluster label.
9 / 25

q The cluster grows until there are no more core


samples within distance eps of the cluster.
q Then another point that hasn’t yet been visited is
picked, and the same procedure is repeated.
q In the end, there are three kinds of points: core
points, points that are within distance eps of core
points (called boundary points), and noise.
q When the DBSCAN algorithm is run on a
particular dataset multiple times, the clustering of
the core points and noise are always. However, a
boundary point might be neighbor to core samples
of more than one cluster.
10 / 25
q Let’s apply DBSCAN on the make_blobs synthetic
dataset
q DBSCAN does not allow predictions on new test
data, so we will use the fit_predict method to
perform clustering and return the cluster labels in
one step:

all data points were assigned the


label -1, which stands for noise
11 / 25

12 / 25
eps small mean many points labeled Increasing min_samples means that
as noise. eps very large result many fewer points will be core points, and
points forming a single cluster. more points will be labeled as noise.

Noise points in white.


Core samples as large markers,
Boundary points as smaller markers.
13 / 25

q While DBSCAN doesn’t require setting the


number of clusters explicitly, setting eps
controls how many clusters will be found.
q Finding a good setting for eps is sometimes
easier after scaling the data using
StandardScaler or MinMaxScaler, as using
these scaling techniques will ensure that
all features have similar ranges.

14 / 25
15 / 25

Classification

16 / 25
Point Anomaly Detection

q scikit-learn estimators
Ø KernelDensity
Ø OneClassSVM
Ø IsolationForest
Ø LocalOutlierFactor

Source: https://coderzcolumn.com/tutorials/machine-learning/scikit-learn-sklearn-anomaly-detection-outliers-detection

17 / 25

Make_blobs

•Blobs Dataset - has data


of 3 clusters with 500
samples and 2 features
per sample.

18 / 25
KernelDensity

It helps us measure kernel density of


samples which can be then used to take
out outliers.
Fitting Model to Data
fit the KernelDensity estimator (KDE)

19 / 25

Calculate Log Density Evaluations for Each Sample

q The KernelDensity estimator has a method


named score_samples() which accepts dataset and
returns log density evaluations for each sample
of data.
q We'll divide these values into 95% as valid data
and 5% as outliers based on the output of
score_samples() function.

20 / 25
Dividing Dataset into Valid Samples and Outliers

All the values in kde_X array which are less than


tau_kde will be outliers and values greater than it
will be qualified as valid samples.

21 / 25

filter data to divide it into outliers and valid samples

22 / 25
Plot Outliers with Valid Samples for Comparison

23 / 25

OneClassSVM

The OneClassSVM estimator is used behind


the scene to make a decision about the sample
is outlier or not.

rbf: Radial basis function kernel

24 / 25
Predict Sample Class (Outlier vs Normal)

q OneClassSVM provides predict() method which accepts


samples and returns array consisting of values 1 or -1.
q Here 1 represents a valid sample and -1 represents an
outlier.

25 / 25

Plot Outliers with Valid Samples for Comparison

27 / 25

You might also like