K-NN Algorithm and Clustering Analysis
K-NN Algorithm and Clustering Analysis
Prof.(Dr.)Soumen paul
Professor, Dept of Information Technology
Haldia Institute of Technology
Haldia
Instance-based learning
• K-NN algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most
similar to the available categories.
• K-NN algorithm stores all the available data and classifies a new data
point based on the similarity. This means when new data appears then it
can be easily classified into a well suite category by using K- NN
algorithm.
7 7 Bad
7 4 Bad
3 4 Good
1 4 Good
Table 1
Now the factory produces a new paper tissue that having X1=3 and X2=7.
Without another expensive survey, can we classify the new tissue?
Solution
We can classify the new tissue by computing the following
five steps.
Step 1: Determine parameters K=number of nearest
neighbors. Let us consider K=3.
Step 2: Calculate the distance between the query-instance
(3,7) and all the training samples as shown in Table 2.
Step 3: Sort the distance and determine nearest neighbors
based on the kth minimum distance in table 3.
Step 4: Gather the category Y of the nearest neighbors.
The categories are available in Table 1. The result is shown
in Table 4.
Computation of Square Distance to (3,7)
Table 2
X1=Acid X2= Square distance to
Durabiliity Strength(kg/m2 (3,7)
7 7 (7-3)2+(7-7)2=16
7 4 (7-3)2+(4-7)2=25
3 4 (3-3)2+(4-7)2=9
1 4 (1-3)2+(4-7)2=13
Determination of nearest neighbors based on
Kth minimum distance
Table 3
X1=A X2= Square distance Rank Based on Included in
cid Strength( to (3,7) Distance Neighborhood
Durab kg/m2
iliity
7 7 16 3 Yes
7 4 25 4 No
3 4 9 1 Yes
1 4 13 2 Yes
Determination of Category Y of the nearest
Neighbors
Table 4
X1= X2= Square Rank Based Included in Category
Acid Strengt distance to on Distance Neighborhood
Dura h(kg/m (3,7)
biliit 2
y
7 7 16 3 Yes Bad
7 4 25 4 No -----
3 4 9 1 Yes Good
1 4 13 2 Yes Good
Solution (Cont..)
• Step 5: Use simply majority of the category of nearest
neighbors as the prediction value of the query instance, we
have 2 “good” and 1 bad categories. Since 2>1, by voting the
tissue can be classified as “good” category.
Pros and Cons of K-NN:
Pros
1. Choose k number of random points from the data and assign these k points to
k number of clusters. These are the initial medoids.
2. For all the remaining data points, calculate the distance from each medoid
and assign it to the cluster with the nearest medoid.
3. Calculate the total cost (Sum of all the distances from all the data points to
the medoids)
4. Select a random point as the new medoid and swap it with the previous
medoid. Repeat 2 and 3 steps.
5. If the total cost of the new medoid is less than that of the previous medoid,
make the new medoid permanent and repeat step 4.
6. If the total cost of the new medoid is greater than the cost of the previous
medoid, undo the swap and repeat step 4.
7. The Repetitions have to continue until no change is encountered with new
medoids to classify data points.
An example of PAM Algorithm:
x y
0 5 4
1 7 7
2 1 3
3 8 6
4 4 9
Scatter plot:
PAM Clustering
If k is given as 2, we need to break down the
data points into 2 clusters.
0 5 4 5 6
1 7 7 10 5
2 1 3 - -
3 8 6 10 7
4 4 9 - -
PAM Clustering Cont..)
Cluster 1: 0
Cluster 2: 1, 3
Calculation of total cost:
(5) + (5 + 7) = 17
Random medoid: (5, 4)
M1(5, 4) and M2(4, 9):
PAM Clustering Cont..)
x y From From
M1(5, 4) M2(4, 9)
0 5 4 - -
1 7 7 5 5
2 1 3 5 9
3 8 6 5 7
4 4 9 - -
PAM Clustering(Cont..)
Cluster 1: 2, 3
Cluster 2: 1
Calculation of total cost:
(5 + 5) + 5 = 15
Less than the previous cost
New medoid: (5, 4).
Random medoid: (7, 7)
M1(5, 4) and M2(7, 7)
PAM Clustering(Cont..)
x y From From
M1(5, 4) M2(7, 7)
0 5 4 - -
1 7 7 - -
2 1 3 5 10
3 8 6 5 2
4 4 9 6 5
PAM Clustering(Cont..)
Cluster 1: 2
Cluster 2: 3, 4
Calculation of total cost:
(5) + (2 + 5) = 12
Less than the previous cost
New medoid: (7, 7).
Random medoid: (8, 6)
M1(7, 7) and M2(8, 6)
PAM Clustering(Cont..)
x y From From
M1(7, 7) M2(8, 6)
0 5 4 5 5
1 7 7 - -
2 1 3 10 10
3 8 6 - -
4 4 9 5 7
PAM Clustering(Cont..)
Cluster 1: 4
Cluster 2: 0, 2
Calculation of total cost:
(5) + (5 + 10) = 20
Greater than the previous cost
UNDO
Cluster 1: 2
Cluster 2: 3, 4
Total cost: 12
Clusters:
CLARA (Clustering Large Applications)
• The difference between the PAM and CLARA algorithms is
that the following one is based upon sampling. There is only a
small area of the real data is chosen as a representative of the
data and medoids are chosen from this sample utilizing PAM.
P1 0
P2 0.23 0
P3 0.22 0.14 0
min((P3,P6),P1)=min((P3,P1),(P6,P1))=0.22
min((P3,P6),P2)=min((P3,P2),(P6,P2))=0.14
min((P3,P6),P4)=min((P3,P4),(P6,P4))=0.13
min((P3,P6),P5)=min((P3,P5),(P6,P5))=0.28
Merging the objects
P1 P2 P3,P6 P4 P5
P1 0
P2 0.23 0
P1 0
P2 0.23 0
P1 0
P2,P5 0.23 0
• min((P2,P5,P3,P6,P4),P1)=min((P2,P5),P1),
((P3,P6,P4),P1))=min(0.23,0.22)=0.22.
Merging the objects…(Cont)
P1 P2,P5,P3,P6,P4
P1 0
P2,P5,P3,P6,P4 0.22 0
Dendogram
DBSCAN
DBSCAN stands for Density-Based Spatial Clustering
Application with Noise.
It is an unsupervised machine learning algorithm that makes clusters
based upon the density of the data points or how close the data is.
That said the points which are outside the dense regions are excluded
and treated as noise or outliers.
3. Find all the neighbourhood points of x which fall inside the circle of radius
(eps) or simply whose distance from x is smaller than or equal to eps.
4. Treat x as visited and if the number of neighbourhood points around x are
greater or equal to MinPts then treat x as a core point and if it is not assigned
to any cluster, create a new cluster and assign it to that.
5. If the number of neighbourhood points around x are less than MinPts and it
has a core point in its neighbourhood, treat it as a border point.
6. Include all the density connected points as a single cluster. (What density
connected points mean is described later)
7. Repeat the above steps for every unvisited point in the data set and find out all
core, border and outlier points.
DBSCAN Algorithm
Algorithm in action
• Let’s now apply the DBSCAN algorithm to the
above dataset to find out clusters. We have to
choose first the values for eps and MinPts.
Let’s choose eps = 0.6 and MinPts = 4. Let’s
consider the first data point in the dataset (1,2)
& calculate its distance from every other data
point in the data set. The Calculated values are
shown below:
Algorithm in action (Cont..)
As evident from the above
table, the point (1, 2) has only
two other points in its
neighbourhood (1, 2.5), (1.2,
2.5) for the assumed value
of eps, as its less than MinPts,
we can’t declare it as a core
point. Let’s repeat the above
process for every point in the
dataset and find out the
neighbourhood of each. The
calculations when repeated
can be summarized as below:
Algorithm in action (Cont..)
Algorithm in action (Cont..)
• Observe the above table carefully, the left-most column contains
all the points we have in our data set. To the right of them are the
data points which are there in their neighbourhood i.e. the points
whose distance from them is less or equal to the eps value. There
are three points in the data set, (2.8, 4.5) (1.2, 2.5) (1, 2.5) that
have 4 neighbourhood points around them, hence they would be
called core points and as already mentioned, if the core point is
not assigned to any cluster, a new cluster is formed. Hence, (2.8,
4.5) is assigned to a new cluster, Cluster 1 and so is the point
(1.2, 2.5), Cluster 2. Also observe that the core points (1.2, 2.5)
and (1, 2.5) share at least one common neighbourhood point
(1,2) so, they are assigned to the same cluster. The below table
shows the categorization of all the data points into core, border
and outlier points. Have a look:
Algorithm in action (Cont..)
Algorithm in action (Cont..)
• There are three types of points in the dataset as detected
by the DBSCAN algorithm, core, border and outliers.
Every core point will be assigned to a new cluster
unless some of the core points share neighbourhood
points, they will be included in the same cluster. Every
border point will be assigned to the cluster-based upon
the core point in its neighbourhood e.g. the first point
(1, 2) is a border point and has a core point (1.2, 2.5) in
its neighbourhood, which is included in Cluster 2,
hence, the point (1,2) will be included in the Cluster 2
too. The whole categorization can be summarized as
below:
Categorization of Clusters
Reachability
• Directly density reachable: An object (or instance) q
is directly density reachable from object p if q is
within the ε-Neighborhood of p and p is a core object.
Here directly density
reachability is not
symmetric. Object p is
not directly density-
reachable from object
q as q is not a core
object.
Reachability (Cont..)
Density reachable: An
object q is density-reachable
from p w.r.t ε and MinPts if
there is a chain of objects q1,
q2…, qn, with q1=p, qn=q
such that qi+1 is directly
density-reachable from
qi w.r.t ε and MinPts for all 1
<= i <= n
Here density reachability is
not symmetric. As q is not a
core point thus qn-1 is not
directly density-reachable
from q, so object p is not
density-reachable from
object q.
Connectivity
Density
connectivity: Object q is
density-connected to
object p w.r.t ε and MinPts
if there is an object o such
that both p and q are
density-reachable
from o w.r.t ε and MinPts.
Here density connectivity
is symmetric. If object q is
density-connected to
object p then object p is
also density-connected to
object q.