20 Cs 112
20 Cs 112
Assignment 05
Name:
Aliza wajid
Roll No:
20-cs-112
TABLE OF CONTENTS
QUESTION NO # 1: ..................................................................................................................................... 2
1. Proximity based outlier detection: ......................................................................................................... 2
1.1 Cluster-based: ...................................................................................................................................... 2
1.2 Distance-based: .................................................................................................................................... 3
1.3 Density-based: ..................................................................................................................................... 3
The KNN approach: ....................................................................................................................................... 4
Numerical Example: .................................................................................................................................. 4
QUESTION NO # 02: ................................................................................................................................... 5
1. Density Based Approach: ...................................................................................................................... 5
Mathematical Representation: ....................................................................................................................... 6
2. Distance Based Approach: ..................................................................................................................... 7
Mathematical Representation: ....................................................................................................................... 8
3. Grid Based Approach: ........................................................................................................................... 8
Mathematical Implementation: .................................................................................................................. 9
Reachability Distance: ........................................................................................................................... 9
Local Outlier Factor (LOF): ................................................................................................................ 10
4. Deviation Based Approach: ................................................................................................................. 10
Mathematical Implementation: ................................................................................................................ 10
QUESTION NO # 1:
1.1 Cluster-based: The non-membership of a data point in any of the clusters, its distance from
other clusters, the size of the closest cluster, or a combination of these factors are used to quantify
the outlier score. The clustering problem has a complementary relationship to the outlier detection
problem in which points either belong to clusters or they should be considered outliers.
1.2 Distance-based: The distance of a data point to its k-nearest neighbor (or other variant) is
used in order to define proximity. Data points with large k-nearest neighbor distances are defined
as outliers. Distance-based algorithms typically perform the analysis at a much more detailed
granularity than the other two methods. On the other hand, this greater granularity often comes at
a significant computational cost.
1.3 Density-based: The number of other points within a specified local region (grid region or
distance-based region) of a data point, is used in order to define local density. These local density
values may be converted into outlier scores. Other kernel-based methods or statistical methods for
density estimation may also be used. The major difference between clustering and density-based
methods is that clustering methods partition the data points, whereas density-based methods
partition the data space.
A simple definition of the outlier score may be constructed by using the distances of data points to
cluster centroids. Specifically, the distance of a data point to its closest cluster centroid may be
used as a proxy for the outlier score of a data point. Since clusters may be of different shapes and
orientations, an excellent distance measure to use is the Mahalanobis distance, which scales the
distance values by local cluster variances along the directions of correlation. Consider a data set
containing k clusters. Assume that the rth cluster in d-dimensional space has a corresponding d-
dimensional row vector μr of attribute-wise means, and a d × d co-variance matrix Σr. The (i, j)th
entry of this matrix is the local covariance between the dimensions i and j in that cluster. Then, the
squared Mahalanobis distance MB(X, μr, Σr)2 between a data point X (expressed as row vector)
and the cluster distribution with centroid μr and covariance matrix Σr is defined as follows:
After the data points have been scored with the local Mahalanobis distance, any form of extreme-
value analysis can be applied to these scores to convert them to binary labels. One can also view
the Mahalanobis distance as an adjusted Euclidean distance between a point and the cluster
centroid after some transformation and scaling. Specifically, the point and the centroid are
transformed into an axis system defined by the principal component directions (of the cluster
points). Subsequently, the squared distance between the candidate outlier point and cluster centroid
is computed along each of the new axes defined by these principal components, and then divided
by the variance of the cluster points along that component.
The KNN approach:
Numerical Example:
Let's consider a 2-dimensional dataset
• Let X represent the dataset, where each row i corresponds to a data point with two features
(= [1,2]𝑋𝑖 = [𝑥𝑖1, 𝑥𝑖2]).
• The introduction of an outlier at index 8585 can be expressed as:
• outlier=[5,5] Xoutlier =[5,5]
• The scatter plot visualization simply shows the data points (Xi ) and highlights the outlier
(outlier Xoutlier ) in red.
Let X still represent the dataset. The k-NN algorithm calculates the Euclidean distance (d) between
data points.
2 2
𝑑(𝑋𝑖 , 𝑋𝑗 ) = (𝑥𝑖1 − 𝑥𝑗1 ) + (𝑥𝑖2 − 𝑥𝑗2 )
For each data point Xi , the distances to its k nearest neighbors are computed.
Outliers are then identified based on a distance threshold, set here as the 95th percentile of the
maximum distances.
QUESTION NO # 02:
It is a robust algorithm widely employed in unsupervised machine learning for cluster analysis. Its
distinctive feature lies in its ability to identify clusters based on the density of data points, making
it particularly adept at handling datasets with irregularly shaped clusters and varying point
densities. In my application, I generated a synthetic dataset with a deliberate design, featuring two
well-defined clusters and the inclusion of two outlier points. This synthetic dataset serves as an
ideal testing ground to assess DBSCAN's performance, as it exhibits characteristics that challenge
traditional clustering algorithms. The clusters are intended to simulate regions of high point
density, while the outliers introduce noise and test the algorithm's capability to distinguish less
dense areas.
Mathematical Representation:
The DBSCAN algorithm defines clusters as dense regions of points separated by areas of lower
point density. The key parameters are:
• ε (eps): The maximum distance between two samples for one to be considered as being in
the neighborhood of the other.
Mathematical Representation:
The Isolation Forest algorithm works by recursively partitioning the data based on randomly
selected features until the outliers are isolated. The key parameter is:
• -1: Outliers
Mathematical Implementation:
• Xi : The data point for which we are calculating the LOF.
• Nk (Xi ): The set of k nearest neighbors of Xi .
• d(Xi ,Xj ): The distance between data points Xi and Xj .
• RDk (Xi ): The reachability distance of Xi with respect to Nk (Xi ).
• LOFk (Xi ): The Local Outlier Factor of i with respect to Nk (Xi ).
Reachability Distance:
Mathematical Implementation:
• Z-scores are calculated for each feature in the dataset.
• The overall Z-score for each data point is computed as the Euclidean norm of its
individual feature Z-scores.
• Points with Z-scores above a certain threshold (commonly 2 or 3) are considered outliers
and are highlighted in red in the second plot.
• The original data points are shown in blue, while the identified outliers are marked in red.
𝑋𝑖 − mean
𝑍𝑖 =
std deviation
2 + 𝑍2 + ⋯ + 𝑍2
𝑍norm = √𝑍𝑖1 𝑖2 𝑖𝑛