0% found this document useful (0 votes)

25 views14 pages

CH 03 - 11 - Unsupervised Learning - Anomaly Detection

This document discusses unsupervised learning techniques for anomaly detection. It explains DBSCAN clustering and how it can identify outliers without specifying the number of clusters. It then covers estimators in scikit-learn for point anomaly detection, including KernelDensity, OneClassSVM, IsolationForest and LocalOutlierFactor, and provides code examples applying these to synthetic data to identify outliers.

Uploaded by

Lama

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views14 pages

CH 03 - 11 - Unsupervised Learning - Anomaly Detection

Uploaded by

Lama

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Ch 03: U n s u p e r v i s e d L e a r n i n g - A D

Prof. Dr. Rashid A. Saeed

1 / 25

2 / 25
Course Content

3/ 13
/ 13

4 / 25
Introduction

q Anomaly detection is a process where you find out the list

of outliers from your data.
q An outlier is a sample that has inconsistent data compared
to other regular samples hence raises suspicion on their
validity.
q The presence of outliers can also impact the performance
of ML algorithms when performing supervised tasks.
q It can also interfere with data scaling which is a common
data preprocessing step.
q we'll be discussing estimators available in scikit-learn
which can help with identifying outliers from data.

5 / 25

Applications

q Network intrusion detection

q Insurance / Credit card fraud detection
q Healthcare Informatics / Medical diagnostics
q Industrial Damage Detection
q Image Processing / Video surveillance
q Novel Topic Detection in Text Mining
q Lots more!

6 / 25
DBSCAN

q DBSCAN stands for Density-Based Spatial

Clustering of Applications with Noise.
q Is a cluster technique
q The main benefits of DBSCAN are that it does
not require the user to set the number of clusters
a priori, (as the case in k-mean)
q DBSCAN can capture clusters of complex
shapes, and it can identify points that are not
part of any cluster.
q DBSCAN can scales large datasets.
7 / 25

q Points that are within a dense region are called

core samples (or core points),
q There are two parameters in DBSCAN:
min_samples and eps.
q Data points within a distance of eps to a given
data point, that data point is classified as a core
sample.

8 / 25
q The algorithm works by picking an arbitrary point
to start with.
q It then finds all points with distance eps or less
from that point.
q If there are less than min_samples points within
distance eps of the starting point, this point is
labeled as noise, meaning that it doesn’t belong to
any cluster.
q If there are more than min_samples points within a
distance of eps, the point is labeled a core sample
and assigned a new cluster label.
9 / 25

q The cluster grows until there are no more core

samples within distance eps of the cluster.
q Then another point that hasn’t yet been visited is
picked, and the same procedure is repeated.
q In the end, there are three kinds of points: core
points, points that are within distance eps of core
points (called boundary points), and noise.
q When the DBSCAN algorithm is run on a
particular dataset multiple times, the clustering of
the core points and noise are always. However, a
boundary point might be neighbor to core samples
of more than one cluster.
10 / 25
q Let’s apply DBSCAN on the make_blobs synthetic
dataset
q DBSCAN does not allow predictions on new test
data, so we will use the fit_predict method to
perform clustering and return the cluster labels in
one step:

all data points were assigned the

label -1, which stands for noise
11 / 25

12 / 25
eps small mean many points labeled Increasing min_samples means that
as noise. eps very large result many fewer points will be core points, and
points forming a single cluster. more points will be labeled as noise.

Noise points in white.

Core samples as large markers,
Boundary points as smaller markers.
13 / 25

q While DBSCAN doesn’t require setting the

number of clusters explicitly, setting eps
controls how many clusters will be found.
q Finding a good setting for eps is sometimes
easier after scaling the data using
StandardScaler or MinMaxScaler, as using
these scaling techniques will ensure that
all features have similar ranges.

14 / 25
15 / 25

Classification

16 / 25
Point Anomaly Detection

q scikit-learn estimators
Ø KernelDensity
Ø OneClassSVM
Ø IsolationForest
Ø LocalOutlierFactor

Source: https://coderzcolumn.com/tutorials/machine-learning/scikit-learn-sklearn-anomaly-detection-outliers-detection

17 / 25

Make_blobs

•Blobs Dataset - has data

of 3 clusters with 500
samples and 2 features
per sample.

18 / 25
KernelDensity

It helps us measure kernel density of

samples which can be then used to take
out outliers.
Fitting Model to Data
fit the KernelDensity estimator (KDE)

19 / 25

Calculate Log Density Evaluations for Each Sample

q The KernelDensity estimator has a method

named score_samples() which accepts dataset and
returns log density evaluations for each sample
of data.
q We'll divide these values into 95% as valid data
and 5% as outliers based on the output of
score_samples() function.

20 / 25
Dividing Dataset into Valid Samples and Outliers

All the values in kde_X array which are less than

tau_kde will be outliers and values greater than it
will be qualified as valid samples.

21 / 25

filter data to divide it into outliers and valid samples

22 / 25
Plot Outliers with Valid Samples for Comparison

23 / 25

OneClassSVM

The OneClassSVM estimator is used behind

the scene to make a decision about the sample
is outlier or not.

rbf: Radial basis function kernel

24 / 25
Predict Sample Class (Outlier vs Normal)

q OneClassSVM provides predict() method which accepts

samples and returns array consisting of values 1 or -1.
q Here 1 represents a valid sample and -1 represents an
outlier.

25 / 25

Plot Outliers with Valid Samples for Comparison

27 / 25

The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
4/5 (6458)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (643)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1175)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (1005)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
4/5 (650)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4103)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (629)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (1022)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (361)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (582)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
4.5/5 (5181)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Toibin
3.5/5 (2133)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
3.5/5 (464)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (280)
Yes Please
From Everand
Yes Please
Amy Poehler
4/5 (2016)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1090)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4372)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (2033)
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
3.5/5 (2814)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2885)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
4.5/5 (141)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
4/5 (4135)
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
4/5 (78)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John le Carré
4/5 (278)

CH 03 - 11 - Unsupervised Learning - Anomaly Detection

Uploaded by

CH 03 - 11 - Unsupervised Learning - Anomaly Detection

Uploaded by

Ch 03: U n s u p e r v i s e d L e a r n i n g - A D

Prof. Dr. Rashid A. Saeed

q Anomaly detection is a process where you find out the list

q Network intrusion detection

q DBSCAN stands for Density-Based Spatial

q Points that are within a dense region are called

q The cluster grows until there are no more core

all data points were assigned the

Noise points in white.

q While DBSCAN doesn’t require setting the

•Blobs Dataset - has data

It helps us measure kernel density of

Calculate Log Density Evaluations for Each Sample

q The KernelDensity estimator has a method

All the values in kde_X array which are less than

filter data to divide it into outliers and valid samples

The OneClassSVM estimator is used behind

rbf: Radial basis function kernel

q OneClassSVM provides predict() method which accepts

Plot Outliers with Valid Samples for Comparison

You might also like