0% found this document useful (0 votes)
3 views

K-NN Algorithm and Clustering Analysis

The document provides an overview of the K-Nearest Neighbor (KNN) algorithm, a simple supervised learning technique used for classification and regression based on similarity measures. It describes the working of KNN, including the steps to classify new data points and its pros and cons, as well as applications in various fields like banking and image recognition. Additionally, it covers clustering analysis, particularly K-Means and K-Medoids algorithms, explaining their processes and differences.

Uploaded by

deeproygsv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

K-NN Algorithm and Clustering Analysis

The document provides an overview of the K-Nearest Neighbor (KNN) algorithm, a simple supervised learning technique used for classification and regression based on similarity measures. It describes the working of KNN, including the steps to classify new data points and its pros and cons, as well as applications in various fields like banking and image recognition. Additionally, it covers clustering analysis, particularly K-Means and K-Medoids algorithms, explaining their processes and differences.

Uploaded by

deeproygsv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 93

Data Mining:

K-Nearest Neighbor(KNN) Algorithm and Clustering


Analysis of Machine Learning
Module 5

Prof.(Dr.)Soumen paul
Professor, Dept of Information Technology
Haldia Institute of Technology
Haldia
Instance-based learning

• The Machine Learning systems which are categorized


as instance-based learning are the systems that learn the
training examples by heart and then generalizes to new
instances based on some similarity measure. It is called
instance-based because it builds the hypotheses from the
training instances. It is also known as memory-based
learning or lazy-learning (because they delay processing
until a new instance must be classified). The time complexity
of this algorithm depends upon the size of training data. Each
time whenever a new query is encountered, its previously
stores data is examined. And assign to a target function value
for the new instance.
K-Nearest Neighbor(KNN) Algorithm for Machine Learning

• K-Nearest Neighbor is one of the simplest Machine Learning algorithms


based on Supervised Learning technique.

• K-NN algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most
similar to the available categories.

• K-NN algorithm stores all the available data and classifies a new data
point based on the similarity. This means when new data appears then it
can be easily classified into a well suite category by using K- NN
algorithm.

• K-NN algorithm can be used for Regression as well as for Classification


but mostly it is used for the Classification problems.
K-Nearest Neighbor(KNN) Algorithm for
Machine Learning (Cont..)
• K-NN is a non-parametric algorithm, which means it
does not make any assumption on underlying data.

• It is also called a lazy learner algorithm because it does


not learn from the training set immediately instead it stores
the dataset and at the time of classification, it performs an
action on the dataset.

• KNN algorithm at the training phase just stores the dataset


and when it gets new data, then it classifies that data into a
category that is much similar to the new data.
Example:
• Suppose, we have an image of a creature that
looks similar to cat and dog, but we want to
know either it is a cat or dog. So for this
identification, we can use the KNN algorithm,
as it works on a similarity measure. Our KNN
model will find the similar features of the new
data set to the cats and dogs images and based
on the most similar features it will put it in
either cat or dog category.
Why do we need a K-NN Algorithm?

• Suppose there are two categories, i.e.,


Category A and Category B, and we have a
new data point x1, so this data point will lie in
which of these categories. To solve this type of
problem, we need a K-NN algorithm. With the
help of K-NN, we can easily identify the
category or class of a particular dataset.
K-NN Algorithm
Working of KNN Algorithm
K-nearest neighbors (KNN) algorithm uses ‘feature similarity’ to predict the
values of new data points which further means that the new data point will be
assigned a value based on how closely it matches the points in the training set.
We can understand its working with the help of following steps −
Step 1 − For implementing any algorithm, we need dataset. So during the first
step of KNN, we must load the training as well as test data.
Step 2 − Next, we need to choose the value of K i.e. the nearest data points. K
can be any integer.
Step 3 − For each point in the test data do the following −
3.1 − Calculate the distance between test data and each row of training data
with the help of any of the method namely: Euclidean, Manhattan or Hamming
distance. The most commonly used method to calculate distance is Euclidean.
3.2 − Now, based on the distance value, sort them in ascending order.
3.3 − Next, it will choose the top K rows from the sorted array.
3.4 − Now, it will assign a class to the test point based on most frequent class
of these rows.
Step 4 − End
Example
• The following is an example to understand the concept of K and
working of KNN algorithm −
• Suppose we have a dataset which can be plotted as follows −
Example (Cont...)
Now, we need to classify new data point with black dot (at
point (60,60) into blue or red class. We are assuming K = 3 i.e.
it would find three nearest data points. It is shown in the next
diagram −
Example (Cont...)
We can see in the above diagram the three
nearest neighbors of the data point with black
dot. Among those three, two of them lies in
Red class hence the black dot will also be
assigned in red class.
Numerical Examples of K-NN
• Let us consider the data from a questionnaires survey and objective testing
with two attributes (acid durability and strength) to classify whether a
special paper tissue is good or not. Four training samples are listed in table
1.
X1=Acid Durabiliity X2= Strength(kg/m2 Y=Classification

7 7 Bad

7 4 Bad

3 4 Good

1 4 Good

Table 1
Now the factory produces a new paper tissue that having X1=3 and X2=7.
Without another expensive survey, can we classify the new tissue?
Solution
We can classify the new tissue by computing the following
five steps.
Step 1: Determine parameters K=number of nearest
neighbors. Let us consider K=3.
Step 2: Calculate the distance between the query-instance
(3,7) and all the training samples as shown in Table 2.
Step 3: Sort the distance and determine nearest neighbors
based on the kth minimum distance in table 3.
Step 4: Gather the category Y of the nearest neighbors.
The categories are available in Table 1. The result is shown
in Table 4.
Computation of Square Distance to (3,7)
Table 2
X1=Acid X2= Square distance to
Durabiliity Strength(kg/m2 (3,7)
7 7 (7-3)2+(7-7)2=16

7 4 (7-3)2+(4-7)2=25

3 4 (3-3)2+(4-7)2=9

1 4 (1-3)2+(4-7)2=13
Determination of nearest neighbors based on
Kth minimum distance
Table 3
X1=A X2= Square distance Rank Based on Included in
cid Strength( to (3,7) Distance Neighborhood
Durab kg/m2
iliity

7 7 16 3 Yes

7 4 25 4 No

3 4 9 1 Yes

1 4 13 2 Yes
Determination of Category Y of the nearest
Neighbors
Table 4
X1= X2= Square Rank Based Included in Category
Acid Strengt distance to on Distance Neighborhood
Dura h(kg/m (3,7)
biliit 2
y

7 7 16 3 Yes Bad

7 4 25 4 No -----

3 4 9 1 Yes Good

1 4 13 2 Yes Good
Solution (Cont..)
• Step 5: Use simply majority of the category of nearest
neighbors as the prediction value of the query instance, we
have 2 “good” and 1 bad categories. Since 2>1, by voting the
tissue can be classified as “good” category.
Pros and Cons of K-NN:
Pros

• It is very simple algorithm to understand and interpret.

• It is very useful for nonlinear data because there is no


assumption about data in this algorithm.

• It is a versatile algorithm as we can use it for


classification as well as regression.

• It has relatively high accuracy but there are much


better supervised learning models than KNN.
Cons
• It is computationally a bit expensive algorithm
because it stores all the training data.

• High memory storage required as compared to other


supervised learning algorithms.

• Prediction is slow in case of big N.

• It is very sensitive to the scale of data as well as


irrelevant features.
Applications of KNN

The following are some of the areas in which KNN can be


applied successfully −
• Banking System
KNN can be used in banking system to predict weather an
individual is fit for loan approval? Does that individual have
the characteristics similar to the defaulters one?
• Calculating Credit Ratings
KNN algorithms can be used to find an individual’s credit
rating by comparing with the persons having similar traits.

Other areas in which KNN algorithm can be used are Speech


Recognition, Handwriting Detection, Image Recognition and
Video Recognition.
Cluster analysis
Clustering is the assignment of objects into
groups (called clusters) so that objects from
the same cluster are more similar to each
other than objects from different clusters.
Often similarity is assessed according to a
distance measure. Clustering is a common
technique for statistical data analysis, which
is used in many fields, including machine
learning, data mining, pattern recognition,
image analysis and bioinformatics.
Types of Clustering Algorithm
1. Partitioning Algorithm
2. Hierarchical Clustering Algorithm
3. Clustering Algorithm of Categorical Data
Diff Types Partitioning Algorithm
1. K-Means Algorithm
2. K-Medoid Algorithm
3. PAM
4. CLARA
5. CLARANS
K-Means & K-Medoid Algorithm

• K-means algorithms, Where each cluster is


represented by the center of gravity of the
cluster.

• K-medoid algorithms, where each cluster is


represented by one of the objects of the cluster
located near the center.
K-Means
• Initial set of clusters randomly chosen.
• Iteratively, items are moved among sets of
clusters until the desired set is reached.
• High degree of similarity among elements in a
cluster is obtained.
• Given a cluster Ki={ti1,ti2,…,tim}, the cluster
mean is mi = (1/m)(ti1 + … + tim)
K-Means Example
• Given: {2,4,10,12,3,20,30,11,25}, k=2
• Randomly assign means: m1=3,m2=4
• K1={2,3}, K2={4,10,12,20,30,11,25},
m1=2.5,m2=16
• K1={2,3,4},K2={10,12,20,30,11,25}, m1=3,m2=18
• K1={2,3,4,10},K2={12,20,30,11,25},
m1=4.75,m2=19.6
• K1={2,3,4,10,11,12},K2={20,30,25},
• m1=7,m2=25
K-Means Algorithm
K-Medoid
• k-medoid is a classical partitioning technique of cluk-
medoid is a classical partitioning technique of
clustering that clusters the data set of n objects into k
clusters known a priori.

• It is more robust to noise and outliers as compared to


k-means.

• A medoid can be defined as that object of a cluster,


whose average dissimilarity to all the objects in the
cluster is minimal i.e. it is a most centrally located point
in the given data.
K-medoids
The k-medoids algorithm is a clustering
algorithm related to the k-means algorithm and
the medoid shift algorithm. Both the k-means
and k-medoids algorithms are partitioned
(breaking the dataset up into groups) and both
attempt to minimize squared error, the distance
between points labeled to be in a cluster and a
point designated as the center of that cluster. In
contrast to the k-means algorithm k-medoids
chooses data points as centers (medoids or
exemplars).
k-medoid clustering algorithm
1. The algorithm begins with arbitrary selection of the k
objects as medoid points out of n data points (n>k)
2. After selection of the k medoid points, associate each
data object in the given data set to most similar medoid.
The similarity here is defined using distance measure that
can be Euclidean distance, Manhattan distance or
Minkowski distance
3. Randomly select nonmedoid object O′
4. compute total cost S of swapping initial medoid object to
O′
5. If S<0, then swap initial medoid with the new one (if S<0
then there will be new set of medoids)
6. repeat steps 2 to 5 until there is no change in the medoid
Demonstration of algorithm

• Cluster the following X1 2 6


data set of ten objects X2 3 4
into two clusters i.e k X3 3 8
= 2. X4 4 7
• Consider a data set of X5 6 2
ten objects as follows: X6 6 4
X7 7 3
X8 7 4
X9 8 5
X10 7 6
Distribution of the data
Step 1 of Algorithm
• Step 1
• Initialize k centre
• Let us assume c1 = (3,4) and c2 = (7,4)
• So here c1 and c2 are selected as medoid.
• Calculating distance so as to associate each
data object to its nearest medoid. Cost is
calculated using Minkowski distance metric
with r = 1.
Minkowski distance (c1)–Step 1
c1 Data objects (Xi) Cost
(distance)
3 4 2 6 3
3 4 3 8 4
3 4 4 7 4
3 4 6 2 5
3 4 6 4 3
3 4 7 3 5
3 4 8 5 6
3 4 7 6 6
Minkowski distance (c2)–Step 1
c2 Data objects (Xi) Cost
(distance)
7 4 2 6 7
7 4 3 8 8
7 4 4 7 6
7 4 6 2 3
7 4 6 4 1
7 4 7 3 1
7 4 8 5 2
7 4 7 6 2
Formation of Cluster
• Then so the clusters become:
• Cluster1 = {(3,4)(2,6)(3,8)(4,7)}
• Cluster2 = {(7,4)(6,2)(6,4)(7,3)(8,5)(7,6)}
• Since the points (2,6) (3,8) and (4,7) are close to
c1 hence they form one cluster whilst remaining
points form another cluster.
• So the total cost involved is 20.
• Where cost between any two points is found
using formula
Cost Calculation
d
• Cost(x,c)=(1/d) | x  c |
i 1
• where x is any data object, c is the medoid, and d is the
dimension of the object which in this case is 2.
• Total cost is the summation of the cost of data object from its
medoid in its cluster so here:
• Total cost={(cost(3,4),cost(2,6)+(cost(3,4),cost(3,8))+
• (cost(3,4),cost(4,7))+ (cost(7,4),cost(6,2))+ (cost(7,4),cost(6,4))
+ (cost(7,4),cost(7,3))+ (cost(7,4),cost(8,5))+
(cost(7,4),cost(7,6))}
• =3+4+4+3+1+1+2+2
• =20
Cluster Formed
Step 2 of Algorithm
• Step 2
• Selection of non medoid O′ randomly
• Let us assume O′ = (7,3)
• So now the medoids are c1(3,4) and O′(7,3)
• If c1 and O′ are new medoids, calculate the
total cost involved
• By using the formula in the step 1
Minkowski distance(c1) –Step 2
c1 Data objects (Xi) Cost
(distance)
3 4 2 6 3
3 4 3 8 4
3 4 4 7 4
3 4 6 2 5
3 4 6 4 3
3 4 7 4 4
3 4 8 5 6
3 4 7 6 6
Minkowski distance(o’) –Step 2
O’ Data objects (Xi) Cost
(distance)
7 3 2 6 8
7 3 3 8 9
7 3 4 7 7
7 3 6 2 2
7 3 6 4 2
7 3 7 4 1
7 3 8 5 3
7 3 7 6 3
Cluster Formed after Step 2
Cost Calculation in Step 2
• Total Cost=3+4+4+2+2+1+3+3=22
• So cost of swapping medoid from c2 to O′ is
• S =Current total cost - Past total cost
=22 – 20
=2 >0
– So moving to O′ would be bad idea, so the previous choice
was good and algorithm terminates here (i.e there is no
change in the medoids).
– It may happen some data points may shift from one cluster
to another cluster depending upon their closeness to
medoid.
K-Medoid Clustering
• There are three types of algorithms for K-
Medoids Clustering:
• PAM (Partitioning Around Clustering)
• CLARA (Clustering Large Applications)
• CLARANS (Randomized Clustering Large
Applications)
• PAM is the most powerful algorithm of the three
algorithms but has the disadvantage of time
complexity. The following K-Medoids are
performed using PAM. In the further parts, we'll see
what CLARA and CLARANS are.
PAM-Algorithm
Given the value of k and unlabelled data:

1. Choose k number of random points from the data and assign these k points to
k number of clusters. These are the initial medoids.
2. For all the remaining data points, calculate the distance from each medoid
and assign it to the cluster with the nearest medoid.
3. Calculate the total cost (Sum of all the distances from all the data points to
the medoids)
4. Select a random point as the new medoid and swap it with the previous
medoid. Repeat 2 and 3 steps.
5. If the total cost of the new medoid is less than that of the previous medoid,
make the new medoid permanent and repeat step 4.
6. If the total cost of the new medoid is greater than the cost of the previous
medoid, undo the swap and repeat step 4.
7. The Repetitions have to continue until no change is encountered with new
medoids to classify data points.
An example of PAM Algorithm:
x y

0 5 4

1 7 7

2 1 3

3 8 6

4 4 9
Scatter plot:
PAM Clustering
If k is given as 2, we need to break down the
data points into 2 clusters.

Initial medoids: M1(1, 3) and M2(4, 9)


Calculation of distances
Manhattan Distance: |x1 - x2| + |y1 - y2|
PAM Clustering (Cont..)
x y From From
M1(1, 3) M2(4, 9)

0 5 4 5 6

1 7 7 10 5

2 1 3 - -

3 8 6 10 7

4 4 9 - -
PAM Clustering Cont..)
Cluster 1: 0
Cluster 2: 1, 3
Calculation of total cost:
(5) + (5 + 7) = 17
Random medoid: (5, 4)
M1(5, 4) and M2(4, 9):
PAM Clustering Cont..)
x y From From
M1(5, 4) M2(4, 9)
0 5 4 - -

1 7 7 5 5

2 1 3 5 9

3 8 6 5 7

4 4 9 - -
PAM Clustering(Cont..)
Cluster 1: 2, 3
Cluster 2: 1
Calculation of total cost:
(5 + 5) + 5 = 15
Less than the previous cost
New medoid: (5, 4).
Random medoid: (7, 7)
M1(5, 4) and M2(7, 7)
PAM Clustering(Cont..)
x y From From
M1(5, 4) M2(7, 7)
0 5 4 - -

1 7 7 - -

2 1 3 5 10

3 8 6 5 2

4 4 9 6 5
PAM Clustering(Cont..)
Cluster 1: 2
Cluster 2: 3, 4
Calculation of total cost:
(5) + (2 + 5) = 12
Less than the previous cost
New medoid: (7, 7).
Random medoid: (8, 6)
M1(7, 7) and M2(8, 6)
PAM Clustering(Cont..)
x y From From
M1(7, 7) M2(8, 6)
0 5 4 5 5

1 7 7 - -

2 1 3 10 10

3 8 6 - -

4 4 9 5 7
PAM Clustering(Cont..)
Cluster 1: 4
Cluster 2: 0, 2
Calculation of total cost:
(5) + (5 + 10) = 20
Greater than the previous cost
UNDO

Hence, the final medoids: M1(5, 4) and M2(7, 7)

Cluster 1: 2
Cluster 2: 3, 4
Total cost: 12
Clusters:
CLARA (Clustering Large Applications)
• The difference between the PAM and CLARA algorithms is
that the following one is based upon sampling. There is only a
small area of the real data is chosen as a representative of the
data and medoids are chosen from this sample utilizing PAM.

• The idea is that if the sample is selected in a fairly random


manner, then it correctly represents the whole dataset and
therefore, the representative objects (medoids) chosen will be
similar as if chosen from the whole dataset.

• CLARA draws several samples and outputs the good


clustering out of these samples. CLARA can deal with a
higher dataset than PAM.
CLARA Algorithm
Input: Database of D objects.
repeat for m times
draw a sample S subset of D, randomly from D.
call PAM (S,k) to get k medoids.
classify the entire data set D to C1, C2,....,Ck
calculate the quality of clustering as the average
dissimilarity.
end.
CLARANS (Clustering Large Applications
Based on Randomized Search)
It is similar to PAM and CLARA, but it applies a randomized
Iterative-Optimization for the determination of medoids. It is
easy to see that in PAM, at every iteration we examine k(N-k)
swaps to determine the pair corresponding to the minimum
cost. On the other hand, CLARA tries to examine fewer
elements by restricting its search to a smaller sample of the
database. Thus, if the sample size is S<=N, it examines at most
k(S-k) pairs at every iteration. Thus, in CLARA, an object can
be a medoid only if it is in the randomly selected sample. But
CLARANS does not restrict the search to any particular subset
of objects.
CLARANS (Clustering Large Applications
Based on Randomized Search)
Neither does it search the entire data set. It begins with PAM
and randomly selects few pairs (i,h), instead of examining all
pairs, for swapping at the current state.
CLARANS draws a sample with some randomness in every
phase of search.
The clustering phase can be presented as searching a graph
where each node is a possible solution, i.e, a set of k medoids.
The clustering obtained after replacing a single medoid is
called the neighbor of the current clustering.
Hierarchical Clustering Algorithm
The construction of a hierarchical agglomerative
classification can be achieved by the following general
algorithm.

1. Find the two closest objects and merge them into a


cluster.

2. Find and merge the next two closest points, where a


point is either an individual objects or a cluster of objects.

3. If more then one cluster remains, return to step 2.


Hierarchical Agglomerative Algorithm
• Agglomerative algorithm follows a bottom up
strategy, treating each object from its own cluster and
iteratively merging clusters until a single cluster is
formed or a terminal condition is satisfied.
• According to some similarly measure, the merging is
done by choosing the closest clusters first.
• A dendogram, which is a tree like structure, is used to
represent hierarchical clustering.
• Individual objects are represented by leaf nodes and
the clusters are represented by root nodes.
Linkage Criteria
It determines the distance between sets of observations as a function of
the pairwise distance between observations.

In Single Linkage, the distance between two clusters is the minimum


distance between members of the two clusters

In Complete Linkage, the distance between two clusters is the


maximum distance between members of the two clusters

In Average Linkage, the distance between two clusters is the average


of all distances between members of the two clusters

In Centroid Linkage, the distance between two clusters is is the


distance between their centroids
Agglomerative Algorithm: Single Link
Sample No X y
P1 0.40 0.53
P2 0.22 0.38
P3 0.35 0.32
P4 0.26 0.19
P5 0.08 0.41
P6 0.45 0.30
Distance Matrix
P1 P2 P3 P4 P5 P6

P1 0

P2 0.23 0

P3 0.22 0.14 0

P4 0.37 0.19 0.13 0

P5 0.34 0.14 0.28 0.23 0

P6 0.24 0.24 0.10 0.22 0.39 0


Merging the objects
Merging the two closest members of the two clusters and finding
the minimum element in distance matrix, we get:
Merging the objects
Here the minimum value is 0.10 and hence we combine
P3 and P6. Now, form cluster of corresponding to
minimum value and update distance matrix. To update
the distance matrix

min((P3,P6),P1)=min((P3,P1),(P6,P1))=0.22
min((P3,P6),P2)=min((P3,P2),(P6,P2))=0.14
min((P3,P6),P4)=min((P3,P4),(P6,P4))=0.13
min((P3,P6),P5)=min((P3,P5),(P6,P5))=0.28
Merging the objects
P1 P2 P3,P6 P4 P5

P1 0

P2 0.23 0

P3,P6 0.22 0.14 0

P4 0.37 0.19 0.13 0

P5 0.34 0.14 0.28 0.23 0


Merging the objects
• Merging two closest members of the two clusters and
finding the minimum element in distance matrix.
• Here the minimum value is 0.13. and hence we
combine P3, P6 and P4. Now form cluster of
elements corresponding to minimum values and
update distance matrix. To update the distance matrix:
min(((P3,P6),P4),P1)=min(((P3,P6),P1),(P4,P1))=0.22
min(((P3,P6),P4),P2)=min(((P3,P6),P2),(P4,P2))=0.14
min(((P3,P6),P4),P5)=min(((P3,P6),P5),(P4,P5))=0.23
Merging the objects…(Cont)
P1 P2 P3,P6,P4 P5

P1 0

P2 0.23 0

P3,P6,P4 0.22 0.14 0

P5 0.34 0.14 0.23 0


Merging the objects…(Cont)
• Merging two closest members of the two clusters
and finding the minimum element in distance
matrix.
• Here the minimum value is 0.14 and hence we
combine P2 and P5. Now form cluster of elements
corresponding to minimum values and update
distance matrix. To update the distance matrix
• min((P2,P5), P1)=min((P2,P1),(P5,P1))=0.23
• min((P2,P5),(P3,P6,P4))=min((P2,(P3,P6,P4)),
(P5,(P3,P6,P4)))=0.14
Merging the objects…(Cont)
P1 P2,P5 P3,P6,P4

P1 0

P2,P5 0.23 0

P3,P6,P4 0.22 0.14 0


Merging the objects…(Cont)
• Here the minimum value is 0.14 and hence we
combine P2,P5 and P3,P6,P4. Now, from
cluster of elements corresponding to
minimum values and update distance matrix.
To update the distance matrix

• min((P2,P5,P3,P6,P4),P1)=min((P2,P5),P1),
((P3,P6,P4),P1))=min(0.23,0.22)=0.22.
Merging the objects…(Cont)
P1 P2,P5,P3,P6,P4

P1 0

P2,P5,P3,P6,P4 0.22 0
Dendogram
DBSCAN
DBSCAN stands for Density-Based Spatial Clustering
Application with Noise.
It is an unsupervised machine learning algorithm that makes clusters
based upon the density of the data points or how close the data is.
That said the points which are outside the dense regions are excluded
and treated as noise or outliers.

This characteristic of the DBSCAN algorithm makes it a perfect fit


for outlier detection and making clusters of arbitrary shape. The
algorithms like K-Means Clustering lack this property and make
spherical clusters only and are very sensitive to outliers. By
sensitivity, it is meant the sphere-shaped clusters made through K-
Means can easily get influenced by the introduction of a single outlier
as they are included too.
A two-dimensional data for easy
visualization
• A two-dimensional data is presented for easy
visualization and understanding, else DBSCAN can
handle multi-dimensional data too. The possible
clusters from the data have been marked in the below
graph to visualize the clusters that we want. The
points (1,5) (4,3) (5,6) in the above graph fall outside
the markings and hence should be treated as outliers.
The DBSCAN algorithm should actually make
clusters and exclude outliers as we did in the graph.
Let’s first understand the algorithm and various steps
involved in it.
Data Visualisation
Logic and Steps:
• The DBSCAN algorithm takes two input
parameters. Radius around each point ( eps)
and the minimum number of data points that
should be around that point within that radius
( MinPts). For example, consider the point
(1.5,2.5), if we take eps = 0.3, then the circle
around the point with radius = 0.3, will contain
only one other point inside it (1.2,2.5) as
shown below:
Logic and Steps(cont..)
DBSCAN Algorithm
• Hence for (1.5, 2.5) when eps = 0.3, the number of neighbourhood point(s) is
just one. In DBSCAN each point is checked for these two parameters and the
decision about the clustering is made as described through the below steps:
1. Choose a value for eps and MinPts
2. For a particular data point (x) calculate its distance from every other datapoint.

3. Find all the neighbourhood points of x which fall inside the circle of radius
(eps) or simply whose distance from x is smaller than or equal to eps.
4. Treat x as visited and if the number of neighbourhood points around x are
greater or equal to MinPts then treat x as a core point and if it is not assigned
to any cluster, create a new cluster and assign it to that.
5. If the number of neighbourhood points around x are less than MinPts and it
has a core point in its neighbourhood, treat it as a border point.
6. Include all the density connected points as a single cluster. (What density
connected points mean is described later)
7. Repeat the above steps for every unvisited point in the data set and find out all
core, border and outlier points.
DBSCAN Algorithm
Algorithm in action
• Let’s now apply the DBSCAN algorithm to the
above dataset to find out clusters. We have to
choose first the values for eps and MinPts.
Let’s choose eps = 0.6 and MinPts = 4. Let’s
consider the first data point in the dataset (1,2)
& calculate its distance from every other data
point in the data set. The Calculated values are
shown below:
Algorithm in action (Cont..)
As evident from the above
table, the point (1, 2) has only
two other points in its
neighbourhood (1, 2.5), (1.2,
2.5) for the assumed value
of eps, as its less than MinPts,
we can’t declare it as a core
point. Let’s repeat the above
process for every point in the
dataset and find out the
neighbourhood of each. The
calculations when repeated
can be summarized as below:
Algorithm in action (Cont..)
Algorithm in action (Cont..)
• Observe the above table carefully, the left-most column contains
all the points we have in our data set. To the right of them are the
data points which are there in their neighbourhood i.e. the points
whose distance from them is less or equal to the eps value. There
are three points in the data set, (2.8, 4.5) (1.2, 2.5) (1, 2.5) that
have 4 neighbourhood points around them, hence they would be
called core points and as already mentioned, if the core point is
not assigned to any cluster, a new cluster is formed. Hence, (2.8,
4.5) is assigned to a new cluster, Cluster 1 and so is the point
(1.2, 2.5), Cluster 2. Also observe that the core points (1.2, 2.5)
and (1, 2.5) share at least one common neighbourhood point
(1,2) so, they are assigned to the same cluster. The below table
shows the categorization of all the data points into core, border
and outlier points. Have a look:
Algorithm in action (Cont..)
Algorithm in action (Cont..)
• There are three types of points in the dataset as detected
by the DBSCAN algorithm, core, border and outliers.
Every core point will be assigned to a new cluster
unless some of the core points share neighbourhood
points, they will be included in the same cluster. Every
border point will be assigned to the cluster-based upon
the core point in its neighbourhood e.g. the first point
(1, 2) is a border point and has a core point (1.2, 2.5) in
its neighbourhood, which is included in Cluster 2,
hence, the point (1,2) will be included in the Cluster 2
too. The whole categorization can be summarized as
below:
Categorization of Clusters
Reachability
• Directly density reachable: An object (or instance) q
is directly density reachable from object p if q is
within the ε-Neighborhood of p and p is a core object.
Here directly density
reachability is not
symmetric. Object p is
not directly density-
reachable from object
q as q is not a core
object.
Reachability (Cont..)
Density reachable: An
object q is density-reachable
from p w.r.t ε and MinPts if
there is a chain of objects q1,
q2…, qn, with q1=p, qn=q
such that qi+1 is directly
density-reachable from
qi w.r.t ε and MinPts for all 1
<= i <= n
Here density reachability is
not symmetric. As q is not a
core point thus qn-1 is not
directly density-reachable
from q, so object p is not
density-reachable from
object q.
Connectivity
Density
connectivity: Object q is
density-connected to
object p w.r.t ε and MinPts
if there is an object o such
that both p and q are
density-reachable
from o w.r.t ε and MinPts.
Here density connectivity
is symmetric. If object q is
density-connected to
object p then object p is
also density-connected to
object q.

You might also like