0% found this document useful (0 votes)

37 views11 pages

20 Cs 112

Uploaded by

Aliza Wajid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views11 pages

20 Cs 112

Uploaded by

Aliza Wajid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 11

HITEC UNIVERSITY

Heavy Industries Taxila Education City

Introduction to Data Mining

“MAM FAIZA JEHANGIR”

Assignment 04

Name: Aliza wajid

Roll No: 20-cs-112

QUESTION NO # 1:

1. Proximity based outlier detection:

Proximity-based techniques define a data point as an outlier when its locality (or proximity) is
sparsely populated. The proximity of a data point may be defined in a variety of ways, which are
subtly different from one another but are similar enough to merit unified treatment within a
single chapter. The most common ways of defining proximity for outlier analysis are as follows:

1.1 Cluster-based: The non-membership of a data point in any of the clusters, its
distance from other clusters, the size of the closest cluster, or a combination of these
factors are used to quantify the outlier score. The clustering problem has a
complementary relationship to the outlier detection problem in which points either belong
to clusters or they should be considered outliers.

1.2 Distance-based: The distance of a data point to its k-nearest neighbor (or other
variant) is used in order to define proximity. Data points with large k-nearest neighbor
distances are defined as outliers. Distance-based algorithms typically perform the
analysis at a much more detailed granularity than the other two methods. On the other
hand, this greater granularity often comes at a significant computational cost.

1.3 Density-based: The number of other points within a specified local region (grid
region or distance-based region) of a data point, is used in order to define local density.
These local density values may be converted into outlier scores. Other kernel-based
methods or statistical methods for density estimation may also be used. The major
difference between clustering and density-based methods is that clustering methods
partition the data points, whereas density-based methods partition the data space.

A simple definition of the outlier score may be constructed by using the distances of data points
to cluster centroids. Specifically, the distance of a data point to its closest cluster centroid may be
used as a proxy for the outlier score of a data point. Since clusters may be of different shapes and
orientations, an excellent distance measure to use is the Mahalanobis distance, which scales the
distance values by local cluster variances along the directions of correlation. Consider a data set
containing k clusters. Assume that the rth cluster in d-dimensional space has a corresponding
ddimensional row vector μr of attribute-wise means, and a d × d co-variance matrix Σr. The (i,
j)th entry of this matrix is the local covariance between the dimensions i and j in that cluster.
Then, the squared Mahalanobis distance MB(X, μr, Σr)2 between a data point X (expressed as
row vector) and the cluster distribution with centroid μr and covariance matrix Σr is defined as
follows:

𝑀𝐵(𝑋, 𝜇𝑟, Σ𝑟) = (𝑋 − 𝜇𝑟)Σ−1(𝑋 − 𝜇𝑟)ᵀ

After the data points have been scored with the local Mahalanobis distance, any form of
extremevalue analysis can be applied to these scores to convert them to binary labels. One can
also view the Mahalanobis distance as an adjusted Euclidean distance between a point and the
cluster centroid after some transformation and scaling. Specifically, the point and the centroid are
transformed into an axis system defined by the principal component directions (of the cluster
points). Subsequently, the squared distance between the candidate outlier point and cluster
centroid is computed along each of the new axes defined by these principal components, and
then divided by the variance of the cluster points along that component.

The KNN approach:

Numerical Example:
Let's consider a 2-dimensional dataset

• Let X represent the dataset, where each row i corresponds to a data point with two
features
(= [1,2]𝑋𝑖 = [𝑥𝑖1, 𝑥𝑖2]).
• The introduction of an outlier at index 8585 can be expressed as:
• outlier=[5,5] Xoutlier =[5,5]
• The scatter plot visualization simply shows the data points (Xi ) and highlights the outlier
(outlier Xoutlier ) in red.
Let X still represent the dataset. The k-NN algorithm calculates the Euclidean distance (d)
between data points.

2 2
𝑑(𝑋𝑖, 𝑋𝑗) = (𝑥𝑖1 − 𝑥𝑗1) + (𝑥𝑖2 − 𝑥𝑗2)

For each data point Xi , the distances to its k nearest neighbors are computed.

distances𝑖 = [𝑑(𝑋𝑖, 𝑋1), 𝑑(𝑋𝑖, 𝑋2), … , 𝑑(𝑋𝑖, 𝑋𝑘)]

Outliers are then identified based on a distance threshold, set here as the 95th percentile of the
maximum distances.

𝑂𝑢𝑡𝑙𝑖𝑒𝑟𝑠 = 𝑖 ∣ 𝑑𝑖𝑠 tan 𝑐 𝑒𝑠𝑖, 𝑚𝑎𝑥 > 𝑜𝑢𝑡𝑙𝑖𝑒𝑟𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑

QUESTION NO # 02:

1. Density Based Approach:

I have applied the DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
algorithm, to a synthetic dataset. DBSCAN is a density-based clustering algorithm that groups
together data points that are close to each other and separates regions of lower point density.

It is a robust algorithm widely employed in unsupervised machine learning for cluster analysis.
Its distinctive feature lies in its ability to identify clusters based on the density of data points,
making it particularly adept at handling datasets with irregularly shaped clusters and varying
point densities. In my application, I generated a synthetic dataset with a deliberate design,
featuring two well-defined clusters and the inclusion of two outlier points. This synthetic dataset
serves as an ideal testing ground to assess DBSCAN's performance, as it exhibits characteristics
that challenge traditional clustering algorithms. The clusters are intended to simulate regions of
high point density, while the outliers introduce noise and test the algorithm's capability to
distinguish less dense areas.
Mathematical Representation:
The DBSCAN algorithm defines clusters as dense regions of points separated by areas of lower
point density. The key parameters are:

• ε (eps): The maximum distance between two samples for one to be considered as being in
the neighborhood of the other.

• min_samples: The number of samples in a neighborhood for a point to be considered

as a core point.
2. Distance Based Approach:
For a distance-based approach, we can use the isolation forest algorithm. The Isolation Forest is
an algorithm that identifies anomalies by isolating them in the feature space. It does this by
randomly selecting a feature and then randomly selecting a split value between the maximum
and minimum values of that feature. Anomalies are more likely to be isolated in fewer splits than
normal points.
In this visualization, the color of each point represents its anomaly score. Darker colors indicate
higher anomaly scores, implying that those points are more likely to be outliers. The colorbar on
the side provides a reference for the anomaly scores.

Mathematical Representation:
The Isolation Forest algorithm works by recursively partitioning the data based on randomly
selected features until the outliers are isolated. The key parameter is:

• contamination: The proportion of outliers in the dataset.

The algorithm labels data points as:

• -1: Outliers
• 1: Inliers (normal points)

3. Grid Based Approach:

For a grid-based approach, we can use the Local Outlier Factor (LOF) algorithm, which is a
popular method for detecting local outliers based on the density of data points within their local
neighborhoods. LOF computes a score for each data point, and higher scores indicate higher
likelihood of being an outlier.
In the visualization above, we explore the decision boundary of the Local Outlier Factor (LOF)
algorithm. The scatter plot represents our synthetic dataset, where points are color-coded based
on their LOF scores. Darker colors indicate higher LOF scores, suggesting a higher likelihood of
being an outlier. The red dashed contour lines represent the decision boundary created by the
LOF algorithm.

Mathematical Implementation:
• Xi : The data point for which we are calculating the LOF.
• Nk (Xi ): The set of k nearest neighbors of Xi .
• d(Xi ,Xj ): The distance between data points Xi and Xj .
• RDk (Xi ): The reachability distance of Xi with respect to Nk (Xi ).
• LOFk (Xi ): The Local Outlier Factor of i with respect to Nk (Xi ).

Reachability Distance:

𝑅𝐷𝑘(𝑋𝑖) = max 𝑑 𝑖𝑠𝑡(𝑋𝑖, 𝑋𝑗), 𝑐𝑜𝑟𝑒 − 𝑑𝑖𝑠𝑡𝑘(𝑋𝑗)

Local Outlier Factor (LOF):

𝐿𝑂𝐹𝑘(𝑋𝑖) = (1/𝑘) ∗ Σ𝑋𝑗∈𝑁𝑘(𝑋𝑖) (𝑅𝐷𝑘(𝑋𝑗)/𝑅𝐷𝑘(𝑋𝑖))

4. Deviation Based Approach:

A deviation-based approach can be implemented using the Z-score as a measure of how far each
data point is from the mean in terms of standard deviations. Points with high Z-scores (indicating
significant deviations from the mean) can be considered potential outliers.

Mathematical Implementation:
• Z-scores are calculated for each feature in the dataset.
• The overall Z-score for each data point is computed as the Euclidean norm of its
individual feature Z-scores.
• Points with Z-scores above a certain threshold (commonly 2 or 3) are considered outliers
and are highlighted in red in the second plot.
• The original data points are shown in blue, while the identified outliers are marked in red.
𝑋𝑖 − mean
𝑍𝑖 =
std deviation

𝑍norm = √𝑍𝑖21 + 𝑍𝑖22 + ⋯ + 𝑍𝑖𝑛2

Outlier Detection & Analysis 03
No ratings yet
Outlier Detection & Analysis 03
32 pages
Lecture-8 Outlier Detection
No ratings yet
Lecture-8 Outlier Detection
72 pages
Outlier Detection Using Robust Mahalanobis Distance: Aparna Bhide, M. PALB 7187
No ratings yet
Outlier Detection Using Robust Mahalanobis Distance: Aparna Bhide, M. PALB 7187
46 pages
Anomoly Detection - Ensemble - Classifiers
No ratings yet
Anomoly Detection - Ensemble - Classifiers
68 pages
Discovering Cluster-Based Local Outliers: Zengyou He, Xiaofei Xu, Shengchun Deng
No ratings yet
Discovering Cluster-Based Local Outliers: Zengyou He, Xiaofei Xu, Shengchun Deng
10 pages
ADII11 Metode Deteksi Outlier
No ratings yet
ADII11 Metode Deteksi Outlier
50 pages
Applied Sciences: Outlier Detection Based Feature Selection Exploiting Bio-Inspired Optimization Algorithms
No ratings yet
Applied Sciences: Outlier Detection Based Feature Selection Exploiting Bio-Inspired Optimization Algorithms
28 pages
Duan
No ratings yet
Duan
18 pages
Datawarehousing and Data Mining
No ratings yet
Datawarehousing and Data Mining
119 pages
Subspace Histograms For Outlier Detection in Linear Time: Saket Sathe Charu C. Aggarwal
No ratings yet
Subspace Histograms For Outlier Detection in Linear Time: Saket Sathe Charu C. Aggarwal
25 pages
Data Mining Chapter 6 Anomaly & Fraud Detection
No ratings yet
Data Mining Chapter 6 Anomaly & Fraud Detection
41 pages
Anomaly-Fraud-Detection
No ratings yet
Anomaly-Fraud-Detection
50 pages
BITS-WASE-DATA MINING-Session-07-2015 PDF
No ratings yet
BITS-WASE-DATA MINING-Session-07-2015 PDF
25 pages
Data Mining Slide Contents
No ratings yet
Data Mining Slide Contents
22 pages
Outlier Analysis in Data Mining
No ratings yet
Outlier Analysis in Data Mining
5 pages
Lecture 8 Data Prepration Techniques
No ratings yet
Lecture 8 Data Prepration Techniques
4 pages
8ad59658 1701235711480
No ratings yet
8ad59658 1701235711480
36 pages
17 dm2 Anomaly Detection 2022 23
No ratings yet
17 dm2 Anomaly Detection 2022 23
113 pages
Outlier Detection
No ratings yet
Outlier Detection
17 pages
Anomaly or Outlier Detection
No ratings yet
Anomaly or Outlier Detection
14 pages
Be A 65 Ads Exp 7
No ratings yet
Be A 65 Ads Exp 7
7 pages
Distance-Based Outlier Detection: Consolidation and Renewed Bearing
No ratings yet
Distance-Based Outlier Detection: Consolidation and Renewed Bearing
12 pages
Outlier Detection
No ratings yet
Outlier Detection
22 pages
20 Cs 112
No ratings yet
20 Cs 112
11 pages
5 Anomaly Detection Annotated Section 100 300
No ratings yet
5 Anomaly Detection Annotated Section 100 300
48 pages
Anomaly Detection
No ratings yet
Anomaly Detection
22 pages
Distance Based Outlier Detection
No ratings yet
Distance Based Outlier Detection
40 pages
A Stabilized Approach of Outlier Detection On Time Series Data
No ratings yet
A Stabilized Approach of Outlier Detection On Time Series Data
10 pages
Anomaly Detection
No ratings yet
Anomaly Detection
15 pages
Data Minning Unit 4-1
No ratings yet
Data Minning Unit 4-1
10 pages
Dmbi Iat-2 Imp Ques Soln
No ratings yet
Dmbi Iat-2 Imp Ques Soln
43 pages
Outlier
No ratings yet
Outlier
2 pages
ADS Ut2
No ratings yet
ADS Ut2
23 pages
Unit 5-2
No ratings yet
Unit 5-2
41 pages
Outlier Detection
No ratings yet
Outlier Detection
20 pages
Anomaly Detection Algorithms For RapidMiner
No ratings yet
Anomaly Detection Algorithms For RapidMiner
12 pages
Algorithms For Spatial Outlier Detection: Chang-Tien Lu Dechang Chen Yufeng Kou
No ratings yet
Algorithms For Spatial Outlier Detection: Chang-Tien Lu Dechang Chen Yufeng Kou
4 pages
Feature Engineering
No ratings yet
Feature Engineering
66 pages
Outlier Detection
No ratings yet
Outlier Detection
36 pages
Anomalies Detection Algorithms Report
No ratings yet
Anomalies Detection Algorithms Report
3 pages
Outlierfin
No ratings yet
Outlierfin
19 pages
Outliers
No ratings yet
Outliers
3 pages
Adsl Exp 8 2024
No ratings yet
Adsl Exp 8 2024
10 pages
Experiment 6
No ratings yet
Experiment 6
7 pages
参考文献3
No ratings yet
参考文献3
9 pages
Outlier Detection: Univariate and Multivariate
No ratings yet
Outlier Detection: Univariate and Multivariate
13 pages
6735367a5d6e24a5f185bf9c 99512104437
No ratings yet
6735367a5d6e24a5f185bf9c 99512104437
2 pages
Make 05 00042 v3
No ratings yet
Make 05 00042 v3
21 pages
K - Means Clustering With Outlier Removal
No ratings yet
K - Means Clustering With Outlier Removal
7 pages
Ads 7
No ratings yet
Ads 7
6 pages
ISAT 600 Progress Report 3
No ratings yet
ISAT 600 Progress Report 3
4 pages
6anomaly Fraud Detection
No ratings yet
6anomaly Fraud Detection
5 pages
Ads Lab6
No ratings yet
Ads Lab6
4 pages
Anomaly Detection RapidMiner
No ratings yet
Anomaly Detection RapidMiner
12 pages
Minor Project Presentation
No ratings yet
Minor Project Presentation
19 pages
Unit 4-2
No ratings yet
Unit 4-2
7 pages
Bhowate, 2014, Outlier Detection Method For Data Set Based On Clustering and EDA Technique
No ratings yet
Bhowate, 2014, Outlier Detection Method For Data Set Based On Clustering and EDA Technique
3 pages
Data Mining-Outlier Analysis
No ratings yet
Data Mining-Outlier Analysis
6 pages
Wafers: Basic Wafer Types
No ratings yet
Wafers: Basic Wafer Types
7 pages
Gallup Test
No ratings yet
Gallup Test
25 pages
Explore 5
No ratings yet
Explore 5
233 pages
Chapter 03
100% (2)
Chapter 03
16 pages
Formulation of Objective
No ratings yet
Formulation of Objective
16 pages
Big Data in Healthcare Systems and Research
No ratings yet
Big Data in Healthcare Systems and Research
4 pages
Term 2 Basic 3 Week 3 Lesson Plan
No ratings yet
Term 2 Basic 3 Week 3 Lesson Plan
20 pages
Marks Oriented Notes For IGCSE O Level Physics v37
No ratings yet
Marks Oriented Notes For IGCSE O Level Physics v37
76 pages
API 653 Notes
No ratings yet
API 653 Notes
3 pages
CPAP-HFNC - Medin - NC3 Ops - Manual Book
No ratings yet
CPAP-HFNC - Medin - NC3 Ops - Manual Book
59 pages
Module of Applied Entomology Only Agricultural Part
No ratings yet
Module of Applied Entomology Only Agricultural Part
53 pages
IOSA Checklist: ISM Edition 9 - Effective September 1, 2015
No ratings yet
IOSA Checklist: ISM Edition 9 - Effective September 1, 2015
253 pages
Agam
No ratings yet
Agam
12 pages
Gs33k50e10 50e
No ratings yet
Gs33k50e10 50e
5 pages
Biotechnology and It's Application by Hare Krishna Deepak
No ratings yet
Biotechnology and It's Application by Hare Krishna Deepak
42 pages
O Level Forces
No ratings yet
O Level Forces
16 pages
Contest1 Tasks
No ratings yet
Contest1 Tasks
9 pages
Module Body Fluids For Board Exam
No ratings yet
Module Body Fluids For Board Exam
8 pages
Evidence Claim Assessment Worksheet
No ratings yet
Evidence Claim Assessment Worksheet
3 pages
Huawei S3700 Switch Datasheet (22-Oct-2012)
No ratings yet
Huawei S3700 Switch Datasheet (22-Oct-2012)
12 pages
manual-KVL-c304i (D1) Öá W0208
No ratings yet
manual-KVL-c304i (D1) Öá W0208
8 pages
Tan ChineseLiteratureEssays 2016
No ratings yet
Tan ChineseLiteratureEssays 2016
5 pages
Short Story
No ratings yet
Short Story
2 pages
Job Network Transfer
No ratings yet
Job Network Transfer
4 pages
Lesson Plan (Thai Son)
No ratings yet
Lesson Plan (Thai Son)
8 pages
Java Lab Cycle Programs 2022
No ratings yet
Java Lab Cycle Programs 2022
2 pages
TOC of My Recently Published Book: Air Bearings: Theory, Design and Applications Wiley
No ratings yet
TOC of My Recently Published Book: Air Bearings: Theory, Design and Applications Wiley
11 pages
Pengaruh Feed Rate Terhadap Sifat Mekanik Pada Pengelasan Friction Stir Welding Alumunium 6110
No ratings yet
Pengaruh Feed Rate Terhadap Sifat Mekanik Pada Pengelasan Friction Stir Welding Alumunium 6110
10 pages
Visual Rhetoric PresentationWorksheet
No ratings yet
Visual Rhetoric PresentationWorksheet
2 pages
Fern Complex: Operational Summary For Vegetation Management
No ratings yet
Fern Complex: Operational Summary For Vegetation Management
8 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Competitive Learning: Fundamentals and Applications for Reinforcement Learning through Competition
From Everand
Competitive Learning: Fundamentals and Applications for Reinforcement Learning through Competition
Fouad Sabry
No ratings yet

20 Cs 112

Uploaded by

20 Cs 112

Uploaded by

HITEC UNIVERSITY

Heavy Industries Taxila Education City

Introduction to Data Mining

Name: Aliza wajid

Roll No: 20-cs-112

1. Proximity based outlier detection:

𝑀𝐵(𝑋, 𝜇𝑟, Σ𝑟) = (𝑋 − 𝜇𝑟)Σ−1(𝑋 − 𝜇𝑟)ᵀ

The KNN approach:

distances𝑖 = [𝑑(𝑋𝑖, 𝑋1), 𝑑(𝑋𝑖, 𝑋2), … , 𝑑(𝑋𝑖, 𝑋𝑘)]

𝑂𝑢𝑡𝑙𝑖𝑒𝑟𝑠 = 𝑖 ∣ 𝑑𝑖𝑠 tan 𝑐 𝑒𝑠𝑖, 𝑚𝑎𝑥 > 𝑜𝑢𝑡𝑙𝑖𝑒𝑟𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑

1. Density Based Approach:

• min_samples: The number of samples in a neighborhood for a point to be considered

• contamination: The proportion of outliers in the dataset.

The algorithm labels data points as:

3. Grid Based Approach:

𝑅𝐷𝑘(𝑋𝑖) = max 𝑑 𝑖𝑠𝑡(𝑋𝑖, 𝑋𝑗), 𝑐𝑜𝑟𝑒 − 𝑑𝑖𝑠𝑡𝑘(𝑋𝑗)

𝐿𝑂𝐹𝑘(𝑋𝑖) = (1/𝑘) ∗ Σ𝑋𝑗∈𝑁𝑘(𝑋𝑖) (𝑅𝐷𝑘(𝑋𝑗)/𝑅𝐷𝑘(𝑋𝑖))

4. Deviation Based Approach:

𝑍norm = √𝑍𝑖21 + 𝑍𝑖22 + ⋯ + 𝑍𝑖𝑛2

You might also like