0% found this document useful (0 votes)

31 views11 pages

20 Cs 112

Uploaded by

Aliza Wajid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views11 pages

20 Cs 112

Uploaded by

Aliza Wajid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

HITEC UNIVERSITY

Heavy Industries Taxila Education City

DEPARTMENT OF COMPUTER SCIENCE

Introduction to Data Mining

“MAM FAIZA JEHANGIR”

Assignment 05

Name:
Aliza wajid

Roll No:
20-cs-112

TABLE OF CONTENTS
QUESTION NO # 1: ..................................................................................................................................... 2
1. Proximity based outlier detection: ......................................................................................................... 2
1.1 Cluster-based: ...................................................................................................................................... 2
1.2 Distance-based: .................................................................................................................................... 3
1.3 Density-based: ..................................................................................................................................... 3
The KNN approach: ....................................................................................................................................... 4
Numerical Example: .................................................................................................................................. 4
QUESTION NO # 02: ................................................................................................................................... 5
1. Density Based Approach: ...................................................................................................................... 5
Mathematical Representation: ....................................................................................................................... 6
2. Distance Based Approach: ..................................................................................................................... 7
Mathematical Representation: ....................................................................................................................... 8
3. Grid Based Approach: ........................................................................................................................... 8
Mathematical Implementation: .................................................................................................................. 9
Reachability Distance: ........................................................................................................................... 9
Local Outlier Factor (LOF): ................................................................................................................ 10
4. Deviation Based Approach: ................................................................................................................. 10
Mathematical Implementation: ................................................................................................................ 10

QUESTION NO # 1:

1. Proximity based outlier detection:

Proximity-based techniques define a data point as an outlier when its locality (or proximity) is
sparsely populated. The proximity of a data point may be defined in a variety of ways, which are
subtly different from one another but are similar enough to merit unified treatment within a single
chapter. The most common ways of defining proximity for outlier analysis are as follows:

1.1 Cluster-based: The non-membership of a data point in any of the clusters, its distance from
other clusters, the size of the closest cluster, or a combination of these factors are used to quantify
the outlier score. The clustering problem has a complementary relationship to the outlier detection
problem in which points either belong to clusters or they should be considered outliers.
1.2 Distance-based: The distance of a data point to its k-nearest neighbor (or other variant) is
used in order to define proximity. Data points with large k-nearest neighbor distances are defined
as outliers. Distance-based algorithms typically perform the analysis at a much more detailed
granularity than the other two methods. On the other hand, this greater granularity often comes at
a significant computational cost.

1.3 Density-based: The number of other points within a specified local region (grid region or
distance-based region) of a data point, is used in order to define local density. These local density
values may be converted into outlier scores. Other kernel-based methods or statistical methods for
density estimation may also be used. The major difference between clustering and density-based
methods is that clustering methods partition the data points, whereas density-based methods
partition the data space.

A simple definition of the outlier score may be constructed by using the distances of data points to
cluster centroids. Specifically, the distance of a data point to its closest cluster centroid may be
used as a proxy for the outlier score of a data point. Since clusters may be of different shapes and
orientations, an excellent distance measure to use is the Mahalanobis distance, which scales the
distance values by local cluster variances along the directions of correlation. Consider a data set
containing k clusters. Assume that the rth cluster in d-dimensional space has a corresponding d-
dimensional row vector μr of attribute-wise means, and a d × d co-variance matrix Σr. The (i, j)th
entry of this matrix is the local covariance between the dimensions i and j in that cluster. Then, the
squared Mahalanobis distance MB(X, μr, Σr)2 between a data point X (expressed as row vector)
and the cluster distribution with centroid μr and covariance matrix Σr is defined as follows:

𝑀𝐵(𝑋, 𝜇𝑟, Σ𝑟) = (𝑋 − 𝜇𝑟)Σ−1 (𝑋 − 𝜇𝑟)ᵀ

After the data points have been scored with the local Mahalanobis distance, any form of extreme-
value analysis can be applied to these scores to convert them to binary labels. One can also view
the Mahalanobis distance as an adjusted Euclidean distance between a point and the cluster
centroid after some transformation and scaling. Specifically, the point and the centroid are
transformed into an axis system defined by the principal component directions (of the cluster
points). Subsequently, the squared distance between the candidate outlier point and cluster centroid
is computed along each of the new axes defined by these principal components, and then divided
by the variance of the cluster points along that component.
The KNN approach:
Numerical Example:
Let's consider a 2-dimensional dataset

• Let X represent the dataset, where each row i corresponds to a data point with two features
(= [1,2]𝑋𝑖 = [𝑥𝑖1, 𝑥𝑖2]).
• The introduction of an outlier at index 8585 can be expressed as:
• outlier=[5,5] Xoutlier =[5,5]
• The scatter plot visualization simply shows the data points (Xi ) and highlights the outlier
(outlier Xoutlier ) in red.

Let X still represent the dataset. The k-NN algorithm calculates the Euclidean distance (d) between
data points.

2 2
𝑑(𝑋𝑖 , 𝑋𝑗 ) = (𝑥𝑖1 − 𝑥𝑗1 ) + (𝑥𝑖2 − 𝑥𝑗2 )
For each data point Xi , the distances to its k nearest neighbors are computed.

distances𝑖 = [𝑑(𝑋𝑖 , 𝑋1 ), 𝑑(𝑋𝑖 , 𝑋2 ), … , 𝑑(𝑋𝑖 , 𝑋𝑘 )]

Outliers are then identified based on a distance threshold, set here as the 95th percentile of the
maximum distances.

𝑂𝑢𝑡𝑙𝑖𝑒𝑟𝑠 = 𝑖 ∣ 𝑑𝑖𝑠 tan 𝑐 𝑒𝑠𝑖, 𝑚𝑎𝑥 > 𝑜𝑢𝑡𝑙𝑖𝑒𝑟𝑡 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑

QUESTION NO # 02:

1. Density Based Approach:

I have applied the DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
algorithm, to a synthetic dataset. DBSCAN is a density-based clustering algorithm that groups
together data points that are close to each other and separates regions of lower point density.

It is a robust algorithm widely employed in unsupervised machine learning for cluster analysis. Its
distinctive feature lies in its ability to identify clusters based on the density of data points, making
it particularly adept at handling datasets with irregularly shaped clusters and varying point
densities. In my application, I generated a synthetic dataset with a deliberate design, featuring two
well-defined clusters and the inclusion of two outlier points. This synthetic dataset serves as an
ideal testing ground to assess DBSCAN's performance, as it exhibits characteristics that challenge
traditional clustering algorithms. The clusters are intended to simulate regions of high point
density, while the outliers introduce noise and test the algorithm's capability to distinguish less
dense areas.

Mathematical Representation:
The DBSCAN algorithm defines clusters as dense regions of points separated by areas of lower
point density. The key parameters are:

• ε (eps): The maximum distance between two samples for one to be considered as being in
the neighborhood of the other.

• min_samples: The number of samples in a neighborhood for a point to be considered as

a core point.
2. Distance Based Approach:
For a distance-based approach, we can use the isolation forest algorithm. The Isolation Forest is
an algorithm that identifies anomalies by isolating them in the feature space. It does this by
randomly selecting a feature and then randomly selecting a split value between the maximum and
minimum values of that feature. Anomalies are more likely to be isolated in fewer splits than
normal points.
In this visualization, the color of each point represents its anomaly score. Darker colors indicate
higher anomaly scores, implying that those points are more likely to be outliers. The colorbar on
the side provides a reference for the anomaly scores.

Mathematical Representation:
The Isolation Forest algorithm works by recursively partitioning the data based on randomly
selected features until the outliers are isolated. The key parameter is:

• contamination: The proportion of outliers in the dataset.

The algorithm labels data points as:

• -1: Outliers

• 1: Inliers (normal points)

3. Grid Based Approach:

For a grid-based approach, we can use the Local Outlier Factor (LOF) algorithm, which is a
popular method for detecting local outliers based on the density of data points within their local
neighborhoods. LOF computes a score for each data point, and higher scores indicate higher
likelihood of being an outlier.
In the visualization above, we explore the decision boundary of the Local Outlier Factor (LOF)
algorithm. The scatter plot represents our synthetic dataset, where points are color-coded based on
their LOF scores. Darker colors indicate higher LOF scores, suggesting a higher likelihood of
being an outlier. The red dashed contour lines represent the decision boundary created by the LOF
algorithm.

Mathematical Implementation:
• Xi : The data point for which we are calculating the LOF.
• Nk (Xi ): The set of k nearest neighbors of Xi .
• d(Xi ,Xj ): The distance between data points Xi and Xj .
• RDk (Xi ): The reachability distance of Xi with respect to Nk (Xi ).
• LOFk (Xi ): The Local Outlier Factor of i with respect to Nk (Xi ).

Reachability Distance:

𝑅𝐷𝑘 (𝑋𝑖 ) = max 𝑑 𝑖𝑠𝑡(𝑋𝑖 , 𝑋𝑗 ), 𝑐𝑜𝑟𝑒 − 𝑑𝑖𝑠𝑡𝑘 (𝑋𝑗 )

Local Outlier Factor (LOF):

𝐿𝑂𝐹𝑘 (𝑋𝑖 ) = (1/𝑘) ∗ Σ𝑋𝑗∈𝑁𝑘 (𝑋𝑖) (𝑅𝐷𝑘 (𝑋𝑗 )/𝑅𝐷𝑘 (𝑋𝑖 ))

4. Deviation Based Approach:

A deviation-based approach can be implemented using the Z-score as a measure of how far each
data point is from the mean in terms of standard deviations. Points with high Z-scores (indicating
significant deviations from the mean) can be considered potential outliers.

Mathematical Implementation:
• Z-scores are calculated for each feature in the dataset.
• The overall Z-score for each data point is computed as the Euclidean norm of its
individual feature Z-scores.
• Points with Z-scores above a certain threshold (commonly 2 or 3) are considered outliers
and are highlighted in red in the second plot.
• The original data points are shown in blue, while the identified outliers are marked in red.
𝑋𝑖 − mean
𝑍𝑖 =
std deviation

2 + 𝑍2 + ⋯ + 𝑍2
𝑍norm = √𝑍𝑖1 𝑖2 𝑖𝑛

Density Based CA
No ratings yet
Density Based CA
8 pages
Verilog Objective Test
100% (1)
Verilog Objective Test
5 pages
DBSCAN
No ratings yet
DBSCAN
14 pages
A History of Mobile Apps
100% (2)
A History of Mobile Apps
68 pages
ManualOperacion CJ1 CS1-ETN PDF
No ratings yet
ManualOperacion CJ1 CS1-ETN PDF
314 pages
DBSCAN
No ratings yet
DBSCAN
3 pages
Clustering Analysis (Unsupervised)
No ratings yet
Clustering Analysis (Unsupervised)
6 pages
DBSCAN
No ratings yet
DBSCAN
42 pages
AF-DBSCAN Presentation
No ratings yet
AF-DBSCAN Presentation
30 pages
System Changeover Four Possible Approaches
75% (8)
System Changeover Four Possible Approaches
5 pages
Midterm Question Bank Health Informatics
No ratings yet
Midterm Question Bank Health Informatics
42 pages
Datawarehousing and Data Mining
No ratings yet
Datawarehousing and Data Mining
119 pages
Unit 5
No ratings yet
Unit 5
63 pages
Online Nursery-WPS Office
No ratings yet
Online Nursery-WPS Office
4 pages
Asit Kumar Das - M4 BDA Clustering
No ratings yet
Asit Kumar Das - M4 BDA Clustering
99 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Chapter - 1: 1.1 Overview
No ratings yet
Chapter - 1: 1.1 Overview
50 pages
Clustering
No ratings yet
Clustering
47 pages
K Medoids
No ratings yet
K Medoids
101 pages
Clustering and Association Rule
No ratings yet
Clustering and Association Rule
69 pages
Chapter 7. Cluster Analysis
No ratings yet
Chapter 7. Cluster Analysis
120 pages
Anomaly-Fraud-Detection
No ratings yet
Anomaly-Fraud-Detection
50 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
51 pages
Prevention of Security Concerns During Outlier Detection
No ratings yet
Prevention of Security Concerns During Outlier Detection
3 pages
DM Clustering
No ratings yet
DM Clustering
51 pages
Clustering
No ratings yet
Clustering
55 pages
Amazon Sales Analytics
No ratings yet
Amazon Sales Analytics
13 pages
5 Anomaly Detection Annotated Section 100 300
No ratings yet
5 Anomaly Detection Annotated Section 100 300
48 pages
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
No ratings yet
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
61 pages
Clustering
No ratings yet
Clustering
65 pages
CS40003 (Data Analytics) : Term Project
No ratings yet
CS40003 (Data Analytics) : Term Project
10 pages
Algorithms For Spatial Outlier Detection: Chang-Tien Lu Dechang Chen Yufeng Kou
No ratings yet
Algorithms For Spatial Outlier Detection: Chang-Tien Lu Dechang Chen Yufeng Kou
4 pages
Unit 3 PHP
No ratings yet
Unit 3 PHP
18 pages
Lecture 13 - Unsupervised Learning, PCA ICA
No ratings yet
Lecture 13 - Unsupervised Learning, PCA ICA
50 pages
Unit 2
No ratings yet
Unit 2
33 pages
DW&M Unit 3 Part II
No ratings yet
DW&M Unit 3 Part II
50 pages
Dmbi Iat-2 Imp Ques Soln
No ratings yet
Dmbi Iat-2 Imp Ques Soln
43 pages
Data Analysis - Groups - INCOMPLETE
No ratings yet
Data Analysis - Groups - INCOMPLETE
24 pages
ML CH 4
No ratings yet
ML CH 4
65 pages
Resume Reference List Layout
100% (1)
Resume Reference List Layout
6 pages
8ad59658 1701235711480
No ratings yet
8ad59658 1701235711480
36 pages
Module5 - Outlier - Analysis: Reference: "Data Mining The Text Book", Charu C. Aggarwal, Springer, 2015. (Chapters 8)
No ratings yet
Module5 - Outlier - Analysis: Reference: "Data Mining The Text Book", Charu C. Aggarwal, Springer, 2015. (Chapters 8)
21 pages
Cluster Analysis
No ratings yet
Cluster Analysis
29 pages
Chapter 3 2
No ratings yet
Chapter 3 2
27 pages
Clustering
No ratings yet
Clustering
12 pages
Duan
No ratings yet
Duan
18 pages
Lecture 12 - Unsupervised Learning - Shoould Be Marged
No ratings yet
Lecture 12 - Unsupervised Learning - Shoould Be Marged
31 pages
Etech Q1 Lesson 2
No ratings yet
Etech Q1 Lesson 2
43 pages
20 Cs 112
No ratings yet
20 Cs 112
11 pages
ML Unit V
No ratings yet
ML Unit V
26 pages
Unit 4
No ratings yet
Unit 4
65 pages
4.production System Modeling
No ratings yet
4.production System Modeling
56 pages
Clustering Part-1
No ratings yet
Clustering Part-1
48 pages
Distance Based Pattern Driven Mining For Outlier Detection in High Dimensional Big Dataset
No ratings yet
Distance Based Pattern Driven Mining For Outlier Detection in High Dimensional Big Dataset
17 pages
SBS Product Catalog 2018
No ratings yet
SBS Product Catalog 2018
53 pages
DM Lect 8 - Clustering - DBSCAN
No ratings yet
DM Lect 8 - Clustering - DBSCAN
22 pages
Outlier Detection
No ratings yet
Outlier Detection
36 pages
MC Co Implementation Guide
No ratings yet
MC Co Implementation Guide
53 pages
Outlierfin
No ratings yet
Outlierfin
19 pages
Lect 4
No ratings yet
Lect 4
34 pages
1 s2.0 S1319157821001701 Main
No ratings yet
1 s2.0 S1319157821001701 Main
12 pages
Unit-7 Finalized
No ratings yet
Unit-7 Finalized
20 pages
Graph Partitioning Advance Clustering Technique
No ratings yet
Graph Partitioning Advance Clustering Technique
14 pages
Parul University: R (ABCDEF) and FD's (BC ADEF, A BCDEF, B F, D E)
No ratings yet
Parul University: R (ABCDEF) and FD's (BC ADEF, A BCDEF, B F, D E)
2 pages
Anomaly Detection
No ratings yet
Anomaly Detection
5 pages
Cs 083 HP 2 Mat Cs Final
No ratings yet
Cs 083 HP 2 Mat Cs Final
10 pages
04 Proyek PDB
No ratings yet
04 Proyek PDB
39 pages
18 A Comparison of Various Distance Functions On K - Mean Clustering Algorithm
No ratings yet
18 A Comparison of Various Distance Functions On K - Mean Clustering Algorithm
9 pages
ML Unit Iii
No ratings yet
ML Unit Iii
12 pages
Before You Start TV On Demand
No ratings yet
Before You Start TV On Demand
32 pages
DMW Unit 5
No ratings yet
DMW Unit 5
10 pages
Sap Commerce Notes
No ratings yet
Sap Commerce Notes
12 pages
A Comparative Study of K-Means, DBSCAN and OPTICS
No ratings yet
A Comparative Study of K-Means, DBSCAN and OPTICS
6 pages
Experiment 6
No ratings yet
Experiment 6
7 pages
Unit 4-2
No ratings yet
Unit 4-2
7 pages
Difference Between BFS and DFS
No ratings yet
Difference Between BFS and DFS
3 pages
Question
100% (3)
Question
6 pages
7 HomologyModelling 12oct2020
No ratings yet
7 HomologyModelling 12oct2020
8 pages
Unit 4
No ratings yet
Unit 4
5 pages
DVT Tutorial
No ratings yet
DVT Tutorial
11 pages
Bhowate, 2014, Outlier Detection Method For Data Set Based On Clustering and EDA Technique
No ratings yet
Bhowate, 2014, Outlier Detection Method For Data Set Based On Clustering and EDA Technique
3 pages
Ccs341-Dw-Int I Key-Set Ii - Ar
No ratings yet
Ccs341-Dw-Int I Key-Set Ii - Ar
14 pages
XXXXXX 5 XX
No ratings yet
XXXXXX 5 XX
5 pages
CSA Assessment Test 2022
No ratings yet
CSA Assessment Test 2022
5 pages
EWARM DDFFormat
No ratings yet
EWARM DDFFormat
6 pages
CN Suggesion Ca3
No ratings yet
CN Suggesion Ca3
2 pages
Compass NNW Nne - Google Search
No ratings yet
Compass NNW Nne - Google Search
1 page
CV Jayant Kumar
No ratings yet
CV Jayant Kumar
1 page
Azure Book 126
No ratings yet
Azure Book 126
1 page

20 Cs 112

Uploaded by

20 Cs 112

Uploaded by

HITEC UNIVERSITY

Heavy Industries Taxila Education City

DEPARTMENT OF COMPUTER SCIENCE

Introduction to Data Mining

1. Proximity based outlier detection:

𝑀𝐵(𝑋, 𝜇𝑟, Σ𝑟) = (𝑋 − 𝜇𝑟)Σ−1 (𝑋 − 𝜇𝑟)ᵀ

distances𝑖 = [𝑑(𝑋𝑖 , 𝑋1 ), 𝑑(𝑋𝑖 , 𝑋2 ), … , 𝑑(𝑋𝑖 , 𝑋𝑘 )]

𝑂𝑢𝑡𝑙𝑖𝑒𝑟𝑠 = 𝑖 ∣ 𝑑𝑖𝑠 tan 𝑐 𝑒𝑠𝑖, 𝑚𝑎𝑥 > 𝑜𝑢𝑡𝑙𝑖𝑒𝑟𝑡 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑

1. Density Based Approach:

• min_samples: The number of samples in a neighborhood for a point to be considered as

• contamination: The proportion of outliers in the dataset.

The algorithm labels data points as:

• 1: Inliers (normal points)

3. Grid Based Approach:

𝑅𝐷𝑘 (𝑋𝑖 ) = max 𝑑 𝑖𝑠𝑡(𝑋𝑖 , 𝑋𝑗 ), 𝑐𝑜𝑟𝑒 − 𝑑𝑖𝑠𝑡𝑘 (𝑋𝑗 )

𝐿𝑂𝐹𝑘 (𝑋𝑖 ) = (1/𝑘) ∗ Σ𝑋𝑗∈𝑁𝑘 (𝑋𝑖) (𝑅𝐷𝑘 (𝑋𝑗 )/𝑅𝐷𝑘 (𝑋𝑖 ))

4. Deviation Based Approach:

You might also like