0% found this document useful (0 votes)

3 views

K-NN Algorithm and Clustering Analysis

The document provides an overview of the K-Nearest Neighbor (KNN) algorithm, a simple supervised learning technique used for classification and regression based on similarity measures. It describes the working of KNN, including the steps to classify new data points and its pros and cons, as well as applications in various fields like banking and image recognition. Additionally, it covers clustering analysis, particularly K-Means and K-Medoids algorithms, explaining their processes and differences.

Uploaded by

deeproygsv

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

K-NN Algorithm and Clustering Analysis

Uploaded by

deeproygsv

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 93

Data Mining:

K-Nearest Neighbor(KNN) Algorithm and Clustering

Analysis of Machine Learning
Module 5

Prof.(Dr.)Soumen paul
Professor, Dept of Information Technology
Haldia Institute of Technology
Haldia
Instance-based learning

• The Machine Learning systems which are categorized

as instance-based learning are the systems that learn the
training examples by heart and then generalizes to new
instances based on some similarity measure. It is called
instance-based because it builds the hypotheses from the
training instances. It is also known as memory-based
learning or lazy-learning (because they delay processing
until a new instance must be classified). The time complexity
of this algorithm depends upon the size of training data. Each
time whenever a new query is encountered, its previously
stores data is examined. And assign to a target function value
for the new instance.
K-Nearest Neighbor(KNN) Algorithm for Machine Learning

• K-Nearest Neighbor is one of the simplest Machine Learning algorithms

based on Supervised Learning technique.

• K-NN algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most
similar to the available categories.

• K-NN algorithm stores all the available data and classifies a new data
point based on the similarity. This means when new data appears then it
can be easily classified into a well suite category by using K- NN
algorithm.

• K-NN algorithm can be used for Regression as well as for Classification

but mostly it is used for the Classification problems.
K-Nearest Neighbor(KNN) Algorithm for
Machine Learning (Cont..)
• K-NN is a non-parametric algorithm, which means it
does not make any assumption on underlying data.

• It is also called a lazy learner algorithm because it does

not learn from the training set immediately instead it stores
the dataset and at the time of classification, it performs an
action on the dataset.

• KNN algorithm at the training phase just stores the dataset

and when it gets new data, then it classifies that data into a
category that is much similar to the new data.
Example:
• Suppose, we have an image of a creature that
looks similar to cat and dog, but we want to
know either it is a cat or dog. So for this
identification, we can use the KNN algorithm,
as it works on a similarity measure. Our KNN
model will find the similar features of the new
data set to the cats and dogs images and based
on the most similar features it will put it in
either cat or dog category.
Why do we need a K-NN Algorithm?

• Suppose there are two categories, i.e.,

Category A and Category B, and we have a
new data point x1, so this data point will lie in
which of these categories. To solve this type of
problem, we need a K-NN algorithm. With the
help of K-NN, we can easily identify the
category or class of a particular dataset.
K-NN Algorithm
Working of KNN Algorithm
K-nearest neighbors (KNN) algorithm uses ‘feature similarity’ to predict the
values of new data points which further means that the new data point will be
assigned a value based on how closely it matches the points in the training set.
We can understand its working with the help of following steps −
Step 1 − For implementing any algorithm, we need dataset. So during the first
step of KNN, we must load the training as well as test data.
Step 2 − Next, we need to choose the value of K i.e. the nearest data points. K
can be any integer.
Step 3 − For each point in the test data do the following −
3.1 − Calculate the distance between test data and each row of training data
with the help of any of the method namely: Euclidean, Manhattan or Hamming
distance. The most commonly used method to calculate distance is Euclidean.
3.2 − Now, based on the distance value, sort them in ascending order.
3.3 − Next, it will choose the top K rows from the sorted array.
3.4 − Now, it will assign a class to the test point based on most frequent class
of these rows.
Step 4 − End
Example
• The following is an example to understand the concept of K and
working of KNN algorithm −
• Suppose we have a dataset which can be plotted as follows −
Example (Cont...)
Now, we need to classify new data point with black dot (at
point (60,60) into blue or red class. We are assuming K = 3 i.e.
it would find three nearest data points. It is shown in the next
diagram −
Example (Cont...)
We can see in the above diagram the three
nearest neighbors of the data point with black
dot. Among those three, two of them lies in
Red class hence the black dot will also be
assigned in red class.
Numerical Examples of K-NN
• Let us consider the data from a questionnaires survey and objective testing
with two attributes (acid durability and strength) to classify whether a
special paper tissue is good or not. Four training samples are listed in table
1.
X1=Acid Durabiliity X2= Strength(kg/m2 Y=Classification

7 7 Bad

7 4 Bad

3 4 Good

1 4 Good

Table 1
Now the factory produces a new paper tissue that having X1=3 and X2=7.
Without another expensive survey, can we classify the new tissue?
Solution
We can classify the new tissue by computing the following
five steps.
Step 1: Determine parameters K=number of nearest
neighbors. Let us consider K=3.
Step 2: Calculate the distance between the query-instance
(3,7) and all the training samples as shown in Table 2.
Step 3: Sort the distance and determine nearest neighbors
based on the kth minimum distance in table 3.
Step 4: Gather the category Y of the nearest neighbors.
The categories are available in Table 1. The result is shown
in Table 4.
Computation of Square Distance to (3,7)
Table 2
X1=Acid X2= Square distance to
Durabiliity Strength(kg/m2 (3,7)
7 7 (7-3)2+(7-7)2=16

7 4 (7-3)2+(4-7)2=25

3 4 (3-3)2+(4-7)2=9

1 4 (1-3)2+(4-7)2=13
Determination of nearest neighbors based on
Kth minimum distance
Table 3
X1=A X2= Square distance Rank Based on Included in
cid Strength( to (3,7) Distance Neighborhood
Durab kg/m2
iliity

7 7 16 3 Yes

7 4 25 4 No

3 4 9 1 Yes

1 4 13 2 Yes
Determination of Category Y of the nearest
Neighbors
Table 4
X1= X2= Square Rank Based Included in Category
Acid Strengt distance to on Distance Neighborhood
Dura h(kg/m (3,7)
biliit 2
y

7 7 16 3 Yes Bad

7 4 25 4 No -----

3 4 9 1 Yes Good

1 4 13 2 Yes Good
Solution (Cont..)
• Step 5: Use simply majority of the category of nearest
neighbors as the prediction value of the query instance, we
have 2 “good” and 1 bad categories. Since 2>1, by voting the
tissue can be classified as “good” category.
Pros and Cons of K-NN:
Pros

• It is very simple algorithm to understand and interpret.

• It is very useful for nonlinear data because there is no

assumption about data in this algorithm.

• It is a versatile algorithm as we can use it for

classification as well as regression.

• It has relatively high accuracy but there are much

better supervised learning models than KNN.
Cons
• It is computationally a bit expensive algorithm
because it stores all the training data.

• High memory storage required as compared to other

supervised learning algorithms.

• Prediction is slow in case of big N.

• It is very sensitive to the scale of data as well as

irrelevant features.
Applications of KNN

The following are some of the areas in which KNN can be

applied successfully −
• Banking System
KNN can be used in banking system to predict weather an
individual is fit for loan approval? Does that individual have
the characteristics similar to the defaulters one?
• Calculating Credit Ratings
KNN algorithms can be used to find an individual’s credit
rating by comparing with the persons having similar traits.

Other areas in which KNN algorithm can be used are Speech

Recognition, Handwriting Detection, Image Recognition and
Video Recognition.
Cluster analysis
Clustering is the assignment of objects into
groups (called clusters) so that objects from
the same cluster are more similar to each
other than objects from different clusters.
Often similarity is assessed according to a
distance measure. Clustering is a common
technique for statistical data analysis, which
is used in many fields, including machine
learning, data mining, pattern recognition,
image analysis and bioinformatics.
Types of Clustering Algorithm
1. Partitioning Algorithm
2. Hierarchical Clustering Algorithm
3. Clustering Algorithm of Categorical Data
Diff Types Partitioning Algorithm
1. K-Means Algorithm
2. K-Medoid Algorithm
3. PAM
4. CLARA
5. CLARANS
K-Means & K-Medoid Algorithm

• K-means algorithms, Where each cluster is

represented by the center of gravity of the
cluster.

• K-medoid algorithms, where each cluster is

represented by one of the objects of the cluster
located near the center.
K-Means
• Initial set of clusters randomly chosen.
• Iteratively, items are moved among sets of
clusters until the desired set is reached.
• High degree of similarity among elements in a
cluster is obtained.
• Given a cluster Ki={ti1,ti2,…,tim}, the cluster
mean is mi = (1/m)(ti1 + … + tim)
K-Means Example
• Given: {2,4,10,12,3,20,30,11,25}, k=2
• Randomly assign means: m1=3,m2=4
• K1={2,3}, K2={4,10,12,20,30,11,25},
m1=2.5,m2=16
• K1={2,3,4},K2={10,12,20,30,11,25}, m1=3,m2=18
• K1={2,3,4,10},K2={12,20,30,11,25},
m1=4.75,m2=19.6
• K1={2,3,4,10,11,12},K2={20,30,25},
• m1=7,m2=25
K-Means Algorithm
K-Medoid
• k-medoid is a classical partitioning technique of cluk-
medoid is a classical partitioning technique of
clustering that clusters the data set of n objects into k
clusters known a priori.

• It is more robust to noise and outliers as compared to

k-means.

• A medoid can be defined as that object of a cluster,

whose average dissimilarity to all the objects in the
cluster is minimal i.e. it is a most centrally located point
in the given data.
K-medoids
The k-medoids algorithm is a clustering
algorithm related to the k-means algorithm and
the medoid shift algorithm. Both the k-means
and k-medoids algorithms are partitioned
(breaking the dataset up into groups) and both
attempt to minimize squared error, the distance
between points labeled to be in a cluster and a
point designated as the center of that cluster. In
contrast to the k-means algorithm k-medoids
chooses data points as centers (medoids or
exemplars).
k-medoid clustering algorithm
1. The algorithm begins with arbitrary selection of the k
objects as medoid points out of n data points (n>k)
2. After selection of the k medoid points, associate each
data object in the given data set to most similar medoid.
The similarity here is defined using distance measure that
can be Euclidean distance, Manhattan distance or
Minkowski distance
3. Randomly select nonmedoid object O′
4. compute total cost S of swapping initial medoid object to
O′
5. If S<0, then swap initial medoid with the new one (if S<0
then there will be new set of medoids)
6. repeat steps 2 to 5 until there is no change in the medoid
Demonstration of algorithm

• Cluster the following X1 2 6

data set of ten objects X2 3 4
into two clusters i.e k X3 3 8
= 2. X4 4 7
• Consider a data set of X5 6 2
ten objects as follows: X6 6 4
X7 7 3
X8 7 4
X9 8 5
X10 7 6
Distribution of the data
Step 1 of Algorithm
• Step 1
• Initialize k centre
• Let us assume c1 = (3,4) and c2 = (7,4)
• So here c1 and c2 are selected as medoid.
• Calculating distance so as to associate each
data object to its nearest medoid. Cost is
calculated using Minkowski distance metric
with r = 1.
Minkowski distance (c1)–Step 1
c1 Data objects (Xi) Cost
(distance)
3 4 2 6 3
3 4 3 8 4
3 4 4 7 4
3 4 6 2 5
3 4 6 4 3
3 4 7 3 5
3 4 8 5 6
3 4 7 6 6
Minkowski distance (c2)–Step 1
c2 Data objects (Xi) Cost
(distance)
7 4 2 6 7
7 4 3 8 8
7 4 4 7 6
7 4 6 2 3
7 4 6 4 1
7 4 7 3 1
7 4 8 5 2
7 4 7 6 2
Formation of Cluster
• Then so the clusters become:
• Cluster1 = {(3,4)(2,6)(3,8)(4,7)}
• Cluster2 = {(7,4)(6,2)(6,4)(7,3)(8,5)(7,6)}
• Since the points (2,6) (3,8) and (4,7) are close to
c1 hence they form one cluster whilst remaining
points form another cluster.
• So the total cost involved is 20.
• Where cost between any two points is found
using formula
Cost Calculation
d
• Cost(x,c)=(1/d) | x  c |
i 1
• where x is any data object, c is the medoid, and d is the
dimension of the object which in this case is 2.
• Total cost is the summation of the cost of data object from its
medoid in its cluster so here:
• Total cost={(cost(3,4),cost(2,6)+(cost(3,4),cost(3,8))+
• (cost(3,4),cost(4,7))+ (cost(7,4),cost(6,2))+ (cost(7,4),cost(6,4))
+ (cost(7,4),cost(7,3))+ (cost(7,4),cost(8,5))+
(cost(7,4),cost(7,6))}
• =3+4+4+3+1+1+2+2
• =20
Cluster Formed
Step 2 of Algorithm
• Step 2
• Selection of non medoid O′ randomly
• Let us assume O′ = (7,3)
• So now the medoids are c1(3,4) and O′(7,3)
• If c1 and O′ are new medoids, calculate the
total cost involved
• By using the formula in the step 1
Minkowski distance(c1) –Step 2
c1 Data objects (Xi) Cost
(distance)
3 4 2 6 3
3 4 3 8 4
3 4 4 7 4
3 4 6 2 5
3 4 6 4 3
3 4 7 4 4
3 4 8 5 6
3 4 7 6 6
Minkowski distance(o’) –Step 2
O’ Data objects (Xi) Cost
(distance)
7 3 2 6 8
7 3 3 8 9
7 3 4 7 7
7 3 6 2 2
7 3 6 4 2
7 3 7 4 1
7 3 8 5 3
7 3 7 6 3
Cluster Formed after Step 2
Cost Calculation in Step 2
• Total Cost=3+4+4+2+2+1+3+3=22
• So cost of swapping medoid from c2 to O′ is
• S =Current total cost - Past total cost
=22 – 20
=2 >0
– So moving to O′ would be bad idea, so the previous choice
was good and algorithm terminates here (i.e there is no
change in the medoids).
– It may happen some data points may shift from one cluster
to another cluster depending upon their closeness to
medoid.
K-Medoid Clustering
• There are three types of algorithms for K-
Medoids Clustering:
• PAM (Partitioning Around Clustering)
• CLARA (Clustering Large Applications)
• CLARANS (Randomized Clustering Large
Applications)
• PAM is the most powerful algorithm of the three
algorithms but has the disadvantage of time
complexity. The following K-Medoids are
performed using PAM. In the further parts, we'll see
what CLARA and CLARANS are.
PAM-Algorithm
Given the value of k and unlabelled data:

1. Choose k number of random points from the data and assign these k points to
k number of clusters. These are the initial medoids.
2. For all the remaining data points, calculate the distance from each medoid
and assign it to the cluster with the nearest medoid.
3. Calculate the total cost (Sum of all the distances from all the data points to
the medoids)
4. Select a random point as the new medoid and swap it with the previous
medoid. Repeat 2 and 3 steps.
5. If the total cost of the new medoid is less than that of the previous medoid,
make the new medoid permanent and repeat step 4.
6. If the total cost of the new medoid is greater than the cost of the previous
medoid, undo the swap and repeat step 4.
7. The Repetitions have to continue until no change is encountered with new
medoids to classify data points.
An example of PAM Algorithm:
x y

0 5 4

1 7 7

2 1 3

3 8 6

4 4 9
Scatter plot:
PAM Clustering
If k is given as 2, we need to break down the
data points into 2 clusters.

Initial medoids: M1(1, 3) and M2(4, 9)

Calculation of distances
Manhattan Distance: |x1 - x2| + |y1 - y2|
PAM Clustering (Cont..)
x y From From
M1(1, 3) M2(4, 9)

0 5 4 5 6

1 7 7 10 5

2 1 3 - -

3 8 6 10 7

4 4 9 - -
PAM Clustering Cont..)
Cluster 1: 0
Cluster 2: 1, 3
Calculation of total cost:
(5) + (5 + 7) = 17
Random medoid: (5, 4)
M1(5, 4) and M2(4, 9):
PAM Clustering Cont..)
x y From From
M1(5, 4) M2(4, 9)
0 5 4 - -

1 7 7 5 5

2 1 3 5 9

3 8 6 5 7

4 4 9 - -
PAM Clustering(Cont..)
Cluster 1: 2, 3
Cluster 2: 1
Calculation of total cost:
(5 + 5) + 5 = 15
Less than the previous cost
New medoid: (5, 4).
Random medoid: (7, 7)
M1(5, 4) and M2(7, 7)
PAM Clustering(Cont..)
x y From From
M1(5, 4) M2(7, 7)
0 5 4 - -

1 7 7 - -

2 1 3 5 10

3 8 6 5 2

4 4 9 6 5
PAM Clustering(Cont..)
Cluster 1: 2
Cluster 2: 3, 4
Calculation of total cost:
(5) + (2 + 5) = 12
Less than the previous cost
New medoid: (7, 7).
Random medoid: (8, 6)
M1(7, 7) and M2(8, 6)
PAM Clustering(Cont..)
x y From From
M1(7, 7) M2(8, 6)
0 5 4 5 5

1 7 7 - -

2 1 3 10 10

3 8 6 - -

4 4 9 5 7
PAM Clustering(Cont..)
Cluster 1: 4
Cluster 2: 0, 2
Calculation of total cost:
(5) + (5 + 10) = 20
Greater than the previous cost
UNDO

Hence, the final medoids: M1(5, 4) and M2(7, 7)

Cluster 1: 2
Cluster 2: 3, 4
Total cost: 12
Clusters:
CLARA (Clustering Large Applications)
• The difference between the PAM and CLARA algorithms is
that the following one is based upon sampling. There is only a
small area of the real data is chosen as a representative of the
data and medoids are chosen from this sample utilizing PAM.

• The idea is that if the sample is selected in a fairly random

manner, then it correctly represents the whole dataset and
therefore, the representative objects (medoids) chosen will be
similar as if chosen from the whole dataset.

• CLARA draws several samples and outputs the good

clustering out of these samples. CLARA can deal with a
higher dataset than PAM.
CLARA Algorithm
Input: Database of D objects.
repeat for m times
draw a sample S subset of D, randomly from D.
call PAM (S,k) to get k medoids.
classify the entire data set D to C1, C2,....,Ck
calculate the quality of clustering as the average
dissimilarity.
end.
CLARANS (Clustering Large Applications
Based on Randomized Search)
It is similar to PAM and CLARA, but it applies a randomized
Iterative-Optimization for the determination of medoids. It is
easy to see that in PAM, at every iteration we examine k(N-k)
swaps to determine the pair corresponding to the minimum
cost. On the other hand, CLARA tries to examine fewer
elements by restricting its search to a smaller sample of the
database. Thus, if the sample size is S<=N, it examines at most
k(S-k) pairs at every iteration. Thus, in CLARA, an object can
be a medoid only if it is in the randomly selected sample. But
CLARANS does not restrict the search to any particular subset
of objects.
CLARANS (Clustering Large Applications
Based on Randomized Search)
Neither does it search the entire data set. It begins with PAM
and randomly selects few pairs (i,h), instead of examining all
pairs, for swapping at the current state.
CLARANS draws a sample with some randomness in every
phase of search.
The clustering phase can be presented as searching a graph
where each node is a possible solution, i.e, a set of k medoids.
The clustering obtained after replacing a single medoid is
called the neighbor of the current clustering.
Hierarchical Clustering Algorithm
The construction of a hierarchical agglomerative
classification can be achieved by the following general
algorithm.

1. Find the two closest objects and merge them into a

cluster.

2. Find and merge the next two closest points, where a

point is either an individual objects or a cluster of objects.

3. If more then one cluster remains, return to step 2.

Hierarchical Agglomerative Algorithm
• Agglomerative algorithm follows a bottom up
strategy, treating each object from its own cluster and
iteratively merging clusters until a single cluster is
formed or a terminal condition is satisfied.
• According to some similarly measure, the merging is
done by choosing the closest clusters first.
• A dendogram, which is a tree like structure, is used to
represent hierarchical clustering.
• Individual objects are represented by leaf nodes and
the clusters are represented by root nodes.
Linkage Criteria
It determines the distance between sets of observations as a function of
the pairwise distance between observations.

In Single Linkage, the distance between two clusters is the minimum

distance between members of the two clusters

In Complete Linkage, the distance between two clusters is the

maximum distance between members of the two clusters

In Average Linkage, the distance between two clusters is the average

of all distances between members of the two clusters

In Centroid Linkage, the distance between two clusters is is the

distance between their centroids
Agglomerative Algorithm: Single Link
Sample No X y
P1 0.40 0.53
P2 0.22 0.38
P3 0.35 0.32
P4 0.26 0.19
P5 0.08 0.41
P6 0.45 0.30
Distance Matrix
P1 P2 P3 P4 P5 P6

P1 0

P2 0.23 0

P3 0.22 0.14 0

P4 0.37 0.19 0.13 0

P5 0.34 0.14 0.28 0.23 0

P6 0.24 0.24 0.10 0.22 0.39 0

Merging the objects
Merging the two closest members of the two clusters and finding
the minimum element in distance matrix, we get:
Merging the objects
Here the minimum value is 0.10 and hence we combine
P3 and P6. Now, form cluster of corresponding to
minimum value and update distance matrix. To update
the distance matrix

min((P3,P6),P1)=min((P3,P1),(P6,P1))=0.22
min((P3,P6),P2)=min((P3,P2),(P6,P2))=0.14
min((P3,P6),P4)=min((P3,P4),(P6,P4))=0.13
min((P3,P6),P5)=min((P3,P5),(P6,P5))=0.28
Merging the objects
P1 P2 P3,P6 P4 P5

P1 0

P2 0.23 0

P3,P6 0.22 0.14 0

P4 0.37 0.19 0.13 0

P5 0.34 0.14 0.28 0.23 0

Merging the objects
• Merging two closest members of the two clusters and
finding the minimum element in distance matrix.
• Here the minimum value is 0.13. and hence we
combine P3, P6 and P4. Now form cluster of
elements corresponding to minimum values and
update distance matrix. To update the distance matrix:
min(((P3,P6),P4),P1)=min(((P3,P6),P1),(P4,P1))=0.22
min(((P3,P6),P4),P2)=min(((P3,P6),P2),(P4,P2))=0.14
min(((P3,P6),P4),P5)=min(((P3,P6),P5),(P4,P5))=0.23
Merging the objects…(Cont)
P1 P2 P3,P6,P4 P5

P1 0

P2 0.23 0

P3,P6,P4 0.22 0.14 0

P5 0.34 0.14 0.23 0

Merging the objects…(Cont)
• Merging two closest members of the two clusters
and finding the minimum element in distance
matrix.
• Here the minimum value is 0.14 and hence we
combine P2 and P5. Now form cluster of elements
corresponding to minimum values and update
distance matrix. To update the distance matrix
• min((P2,P5), P1)=min((P2,P1),(P5,P1))=0.23
• min((P2,P5),(P3,P6,P4))=min((P2,(P3,P6,P4)),
(P5,(P3,P6,P4)))=0.14
Merging the objects…(Cont)
P1 P2,P5 P3,P6,P4

P1 0

P2,P5 0.23 0

P3,P6,P4 0.22 0.14 0

Merging the objects…(Cont)
• Here the minimum value is 0.14 and hence we
combine P2,P5 and P3,P6,P4. Now, from
cluster of elements corresponding to
minimum values and update distance matrix.
To update the distance matrix

• min((P2,P5,P3,P6,P4),P1)=min((P2,P5),P1),
((P3,P6,P4),P1))=min(0.23,0.22)=0.22.
Merging the objects…(Cont)
P1 P2,P5,P3,P6,P4

P1 0

P2,P5,P3,P6,P4 0.22 0
Dendogram
DBSCAN
DBSCAN stands for Density-Based Spatial Clustering
Application with Noise.
It is an unsupervised machine learning algorithm that makes clusters
based upon the density of the data points or how close the data is.
That said the points which are outside the dense regions are excluded
and treated as noise or outliers.

This characteristic of the DBSCAN algorithm makes it a perfect fit

for outlier detection and making clusters of arbitrary shape. The
algorithms like K-Means Clustering lack this property and make
spherical clusters only and are very sensitive to outliers. By
sensitivity, it is meant the sphere-shaped clusters made through K-
Means can easily get influenced by the introduction of a single outlier
as they are included too.
A two-dimensional data for easy
visualization
• A two-dimensional data is presented for easy
visualization and understanding, else DBSCAN can
handle multi-dimensional data too. The possible
clusters from the data have been marked in the below
graph to visualize the clusters that we want. The
points (1,5) (4,3) (5,6) in the above graph fall outside
the markings and hence should be treated as outliers.
The DBSCAN algorithm should actually make
clusters and exclude outliers as we did in the graph.
Let’s first understand the algorithm and various steps
involved in it.
Data Visualisation
Logic and Steps:
• The DBSCAN algorithm takes two input
parameters. Radius around each point ( eps)
and the minimum number of data points that
should be around that point within that radius
( MinPts). For example, consider the point
(1.5,2.5), if we take eps = 0.3, then the circle
around the point with radius = 0.3, will contain
only one other point inside it (1.2,2.5) as
shown below:
Logic and Steps(cont..)
DBSCAN Algorithm
• Hence for (1.5, 2.5) when eps = 0.3, the number of neighbourhood point(s) is
just one. In DBSCAN each point is checked for these two parameters and the
decision about the clustering is made as described through the below steps:
1. Choose a value for eps and MinPts
2. For a particular data point (x) calculate its distance from every other datapoint.

3. Find all the neighbourhood points of x which fall inside the circle of radius
(eps) or simply whose distance from x is smaller than or equal to eps.
4. Treat x as visited and if the number of neighbourhood points around x are
greater or equal to MinPts then treat x as a core point and if it is not assigned
to any cluster, create a new cluster and assign it to that.
5. If the number of neighbourhood points around x are less than MinPts and it
has a core point in its neighbourhood, treat it as a border point.
6. Include all the density connected points as a single cluster. (What density
connected points mean is described later)
7. Repeat the above steps for every unvisited point in the data set and find out all
core, border and outlier points.
DBSCAN Algorithm
Algorithm in action
• Let’s now apply the DBSCAN algorithm to the
above dataset to find out clusters. We have to
choose first the values for eps and MinPts.
Let’s choose eps = 0.6 and MinPts = 4. Let’s
consider the first data point in the dataset (1,2)
& calculate its distance from every other data
point in the data set. The Calculated values are
shown below:
Algorithm in action (Cont..)
As evident from the above
table, the point (1, 2) has only
two other points in its
neighbourhood (1, 2.5), (1.2,
2.5) for the assumed value
of eps, as its less than MinPts,
we can’t declare it as a core
point. Let’s repeat the above
process for every point in the
dataset and find out the
neighbourhood of each. The
calculations when repeated
can be summarized as below:
Algorithm in action (Cont..)
Algorithm in action (Cont..)
• Observe the above table carefully, the left-most column contains
all the points we have in our data set. To the right of them are the
data points which are there in their neighbourhood i.e. the points
whose distance from them is less or equal to the eps value. There
are three points in the data set, (2.8, 4.5) (1.2, 2.5) (1, 2.5) that
have 4 neighbourhood points around them, hence they would be
called core points and as already mentioned, if the core point is
not assigned to any cluster, a new cluster is formed. Hence, (2.8,
4.5) is assigned to a new cluster, Cluster 1 and so is the point
(1.2, 2.5), Cluster 2. Also observe that the core points (1.2, 2.5)
and (1, 2.5) share at least one common neighbourhood point
(1,2) so, they are assigned to the same cluster. The below table
shows the categorization of all the data points into core, border
and outlier points. Have a look:
Algorithm in action (Cont..)
Algorithm in action (Cont..)
• There are three types of points in the dataset as detected
by the DBSCAN algorithm, core, border and outliers.
Every core point will be assigned to a new cluster
unless some of the core points share neighbourhood
points, they will be included in the same cluster. Every
border point will be assigned to the cluster-based upon
the core point in its neighbourhood e.g. the first point
(1, 2) is a border point and has a core point (1.2, 2.5) in
its neighbourhood, which is included in Cluster 2,
hence, the point (1,2) will be included in the Cluster 2
too. The whole categorization can be summarized as
below:
Categorization of Clusters
Reachability
• Directly density reachable: An object (or instance) q
is directly density reachable from object p if q is
within the ε-Neighborhood of p and p is a core object.
Here directly density
reachability is not
symmetric. Object p is
not directly density-
reachable from object
q as q is not a core
object.
Reachability (Cont..)
Density reachable: An
object q is density-reachable
from p w.r.t ε and MinPts if
there is a chain of objects q1,
q2…, qn, with q1=p, qn=q
such that qi+1 is directly
density-reachable from
qi w.r.t ε and MinPts for all 1
<= i <= n
Here density reachability is
not symmetric. As q is not a
core point thus qn-1 is not
directly density-reachable
from q, so object p is not
density-reachable from
object q.
Connectivity
Density
connectivity: Object q is
density-connected to
object p w.r.t ε and MinPts
if there is an object o such
that both p and q are
density-reachable
from o w.r.t ε and MinPts.
Here density connectivity
is symmetric. If object q is
density-connected to
object p then object p is
also density-connected to
object q.

6 - KNN Classifier
No ratings yet
6 - KNN Classifier
10 pages
Nabil Anwar Academic CV Vitae 2014
100% (1)
Nabil Anwar Academic CV Vitae 2014
4 pages
Unit 3 KNN
No ratings yet
Unit 3 KNN
16 pages
Machine Learning Unit-3.1
No ratings yet
Machine Learning Unit-3.1
20 pages
K-Nearest Neighbor Algorithm
100% (1)
K-Nearest Neighbor Algorithm
6 pages
ML Unit-2
No ratings yet
ML Unit-2
24 pages
Knn
No ratings yet
Knn
5 pages
KNN
No ratings yet
KNN
9 pages
-21-KNN
No ratings yet
-21-KNN
28 pages
K Nearest Neighbor (KNN)
No ratings yet
K Nearest Neighbor (KNN)
9 pages
K Nearest Neighbors
No ratings yet
K Nearest Neighbors
22 pages
UNIT 3 ML Distance Based Learning
No ratings yet
UNIT 3 ML Distance Based Learning
19 pages
K-Nearest Neighbor Algorithm
No ratings yet
K-Nearest Neighbor Algorithm
6 pages
Knn
No ratings yet
Knn
3 pages
KNN Algorithm
No ratings yet
KNN Algorithm
4 pages
'Machine Learning (Nagarjun)
No ratings yet
'Machine Learning (Nagarjun)
10 pages
K-Nearest Neighbor Classification-Algorithm and Characteristics
No ratings yet
K-Nearest Neighbor Classification-Algorithm and Characteristics
6 pages
14 K - Nearest Neighbours
No ratings yet
14 K - Nearest Neighbours
8 pages
19-K-Nearest Neighbor Learning.-22-08-2024
No ratings yet
19-K-Nearest Neighbor Learning.-22-08-2024
25 pages
AI28
No ratings yet
AI28
5 pages
ML CH 3
No ratings yet
ML CH 3
88 pages
Week 10
No ratings yet
Week 10
41 pages
Day43 KNN Intro
No ratings yet
Day43 KNN Intro
4 pages
ML Lecture#10
No ratings yet
ML Lecture#10
17 pages
3.1 K Nearest Neighbour Classifier (1)
No ratings yet
3.1 K Nearest Neighbour Classifier (1)
24 pages
KNN
No ratings yet
KNN
53 pages
KNN
No ratings yet
KNN
20 pages
KNN PDF
No ratings yet
KNN PDF
30 pages
K-Nearest Neighbor
No ratings yet
K-Nearest Neighbor
22 pages
K_Nearest_Neighbour_Classifier
No ratings yet
K_Nearest_Neighbour_Classifier
24 pages
ML UNIT 5..
No ratings yet
ML UNIT 5..
40 pages
Seminar Report File On KNN Models: University Institute of Engineering and Technology, Kurukshetra University
No ratings yet
Seminar Report File On KNN Models: University Institute of Engineering and Technology, Kurukshetra University
24 pages
K-Nearest Neighbour (KNN) Algorithm_f3ec27282ed7dde87d5cf56f95272d1a
No ratings yet
K-Nearest Neighbour (KNN) Algorithm_f3ec27282ed7dde87d5cf56f95272d1a
5 pages
K-Nearest Neighbor (KNN)
No ratings yet
K-Nearest Neighbor (KNN)
27 pages
Lecture 38 KNN
No ratings yet
Lecture 38 KNN
4 pages
Unit V: Distance and Rule Based Models
No ratings yet
Unit V: Distance and Rule Based Models
56 pages
AI_5
No ratings yet
AI_5
11 pages
FPA unit 2
No ratings yet
FPA unit 2
20 pages
K-Nearest Neighbor (KNN) Algorithm For Machine Learning
No ratings yet
K-Nearest Neighbor (KNN) Algorithm For Machine Learning
17 pages
Lecture Note #3_PEC-CS701E
No ratings yet
Lecture Note #3_PEC-CS701E
27 pages
KNN - Feb 19
No ratings yet
KNN - Feb 19
42 pages
Adobe Scan 16 May 2023 (3)
No ratings yet
Adobe Scan 16 May 2023 (3)
9 pages
Distance-Based Methods - KNN
No ratings yet
Distance-Based Methods - KNN
8 pages
KNN
No ratings yet
KNN
16 pages
ml2
No ratings yet
ml2
6 pages
CSL0777 L22
No ratings yet
CSL0777 L22
35 pages
ML-Unit 5
No ratings yet
ML-Unit 5
40 pages
ML-UNIT-2
No ratings yet
ML-UNIT-2
46 pages
K-Nearest Neighbor (KNN) : Non-Parametric Algorithm
No ratings yet
K-Nearest Neighbor (KNN) : Non-Parametric Algorithm
7 pages
KNN Presentation
No ratings yet
KNN Presentation
19 pages
UNIT-2 K-Nn-March 2024
No ratings yet
UNIT-2 K-Nn-March 2024
23 pages
COS4852 2023 Unit 2 - KNN
No ratings yet
COS4852 2023 Unit 2 - KNN
10 pages
K-Nearest Neighbor (KNN) Algorithm For Machine Learning - Javatpoint
No ratings yet
K-Nearest Neighbor (KNN) Algorithm For Machine Learning - Javatpoint
18 pages
Rad Assignment (KNN)
No ratings yet
Rad Assignment (KNN)
6 pages
KNN Algo
No ratings yet
KNN Algo
7 pages
ML Mid2 Ans
No ratings yet
ML Mid2 Ans
24 pages
U3 KNN
No ratings yet
U3 KNN
6 pages
Knn
No ratings yet
Knn
3 pages
02-knn Notes
No ratings yet
02-knn Notes
23 pages
Algorithms - K Nearest Neighbors
No ratings yet
Algorithms - K Nearest Neighbors
23 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
10.6 Heat Conduction Through Composite Walls
No ratings yet
10.6 Heat Conduction Through Composite Walls
35 pages
Scalar and Vector Quantities
No ratings yet
Scalar and Vector Quantities
29 pages
3SLS and FIML
No ratings yet
3SLS and FIML
5 pages
Lecture 7
No ratings yet
Lecture 7
43 pages
mcsl-17 C and Assembly Language Programming Lab
No ratings yet
mcsl-17 C and Assembly Language Programming Lab
42 pages
An Overview of the SIGMA Research Project A European Approach to Seismic Hazard Analysis 1st Edition Alain Pecker - Own the ebook now with all fully detailed content
100% (1)
An Overview of the SIGMA Research Project A European Approach to Seismic Hazard Analysis 1st Edition Alain Pecker - Own the ebook now with all fully detailed content
60 pages
Establishing Minimum Economic Field Size and Analysing Its Role in Exploration Project Risks Assessment: Three Examples
No ratings yet
Establishing Minimum Economic Field Size and Analysing Its Role in Exploration Project Risks Assessment: Three Examples
24 pages
Data Strcuture 2023 Answer Paper
No ratings yet
Data Strcuture 2023 Answer Paper
16 pages
Betul Applied 11
No ratings yet
Betul Applied 11
6 pages
Chapter - 04 - HT
No ratings yet
Chapter - 04 - HT
8 pages
Introduction To Programming Home WORK FOR Y10-01-CT1: Topic 1: Computational Thinking
No ratings yet
Introduction To Programming Home WORK FOR Y10-01-CT1: Topic 1: Computational Thinking
3 pages
Valuation of Options
100% (1)
Valuation of Options
47 pages
C Section M312-Report1
No ratings yet
C Section M312-Report1
15 pages
atividade2SO 2
No ratings yet
atividade2SO 2
6 pages
Sensors and Transducers Modules
No ratings yet
Sensors and Transducers Modules
16 pages
Energy-Efficient Rate-Splitting Multiple Access a Deep Reinforcement Learning-Based Framework
No ratings yet
Energy-Efficient Rate-Splitting Multiple Access a Deep Reinforcement Learning-Based Framework
13 pages
"Full Coverage": Vectors: (Edexcel GCSE (9-1) Mock Set 2 Spring 2017 3F Q20ai, 3H Q4ai)
No ratings yet
"Full Coverage": Vectors: (Edexcel GCSE (9-1) Mock Set 2 Spring 2017 3F Q20ai, 3H Q4ai)
12 pages
One Sample T Test, Rejected
No ratings yet
One Sample T Test, Rejected
2 pages
4 3 Trig Id PDF
No ratings yet
4 3 Trig Id PDF
15 pages
Assignment C C
No ratings yet
Assignment C C
3 pages
2023 BATCH I-II DS CSE COURSE FILE
No ratings yet
2023 BATCH I-II DS CSE COURSE FILE
65 pages
MBM - Sampling Distributions Review Questions Solution
No ratings yet
MBM - Sampling Distributions Review Questions Solution
8 pages
CLASS X HOLIDAY HOME WORK 2025-26
No ratings yet
CLASS X HOLIDAY HOME WORK 2025-26
12 pages
Cosmological Constant Implementing Mach Principle in General Relativity
No ratings yet
Cosmological Constant Implementing Mach Principle in General Relativity
12 pages
Do Not Duplicate or Distribute Without Written Permission From CMKC!
No ratings yet
Do Not Duplicate or Distribute Without Written Permission From CMKC!
6 pages
Edge and Contour DEtection
No ratings yet
Edge and Contour DEtection
10 pages
Welcome To The 49 United States of America Mathematical Olympiad!
No ratings yet
Welcome To The 49 United States of America Mathematical Olympiad!
3 pages
CSWP SEGMENT 1 SAMPLE 2
No ratings yet
CSWP SEGMENT 1 SAMPLE 2
11 pages
TriStation 1131
No ratings yet
TriStation 1131
32 pages