0% found this document useful (0 votes)
2K views59 pages

SOLVED NUMERICALS EXAMPLES in Machine Learning

1) The document provides examples and explanations for numerical implementations of various machine learning algorithms including KNN, decision trees, metrics, Naive Bayes, K-means clustering, hierarchical clustering, and dimensionality reduction techniques. 2) A step-by-step example is given for implementing the KNN algorithm to classify a new paper sample using a training dataset containing acid durability, strength, and classification labels. 3) The ID3 decision tree algorithm is explained and an example is given using a weather dataset to classify whether to play tennis based on outlook, temperature, humidity, wind, and the classification label. Entropy and information gain calculations are shown to determine the most important attribute to split on at each node.

Uploaded by

Yash Sinha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2K views59 pages

SOLVED NUMERICALS EXAMPLES in Machine Learning

1) The document provides examples and explanations for numerical implementations of various machine learning algorithms including KNN, decision trees, metrics, Naive Bayes, K-means clustering, hierarchical clustering, and dimensionality reduction techniques. 2) A step-by-step example is given for implementing the KNN algorithm to classify a new paper sample using a training dataset containing acid durability, strength, and classification labels. 3) The ID3 decision tree algorithm is explained and an example is given using a weather dataset to classify whether to play tennis based on outlook, temperature, humidity, wind, and the classification label. Entropy and information gain calculations are shown to determine the most important attribute to split on at each node.

Uploaded by

Yash Sinha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

1

NEUMERICALS SOLVED EXAMPLES

S.No. Topic Page No.


1 KNN 2
2 Decision tree 5
3 METRICS 11
4 Naïve Bayes 22
5 K-means clustering 30
6 Hierarchical clustering 40
7 Dimensionality reduction techniques (PCA) 51

KNN Numerical Example (hand computation)


2

Numerical Example of K Nearest Neighbor Algorithm


Here is step by step on how to compute K-nearest neighbors KNN algorithm:
1. Determine parameter K = number of nearest neighbors
2. Calculate the distance between the query-instance and all the training samples
3. Sort the distance and determine nearest neighbors based on the K-th minimum distance
4. Gather the category y of the nearest neighbors
5. Use simple majority of the category of nearest neighbors as the prediction value of the query
instance

Example
We have data from the questionnaires survey (to ask people opinion) and objective testing with two
attributes (acid durability and strength) to classify whether a special paper tissue is good or not. Here is
four training samples
X2 = Strength
Y=
X1 = Acid Durability

(seconds) (kg/square Classification


meter)

7 7 Bad

7 4 Bad

3 4 Good

1 4 Good

Now the factory produces a new paper tissue that pass laboratory test with X1 = 3 and X2 = 7. Without
another expensive survey, can we guess what the classification of this new tissue is?

1. Determine parameter K = number of nearest neighbors

Suppose use K = 3
2. Calculate the distance between the query-instance and all the training samples

Coordinate of query instance is (3, 7), instead of calculating the distance we compute square distance
which is faster to calculate (without square root)
3

X2 = Strength
X1 = Acid Durability Square Distance to query instance
(seconds) (kg/square (3,7)
meter)
7 7 (7-3)2+(7-7)2=16

7 4 (7-3)2+(4-7)2=25

3 4 (3-3)2+(4-7)2=9

1 4 (1-3)2+(4-7)2=13

3. Sort the distance and determine nearest neighbors based on the K-th minimum distance

X2 = Strength
X1 = Acid Durability Square Distance to query instance Rank minimum
(seconds) (kg/square (3, 7) distance Is it included in 3-
meter) Nearest neighbors?
7 7 (7-3)2+(7-7)2=16 3 Yes

7 4 (7-3)2+(4-7)2=25 4 No

3 4 (3-3)2+(4-7)2=9 1 Yes

1 4 (1-3)2+(4-7)2=13 2 Yes

4. Gather the category Y of the nearest neighbors. Notice in the second row last column that the category
of nearest neighbor (Y) is not included because the rank of this data is more than 3 (=K).
4

X1 = Acid X2 =Strength Square Distance Rank Is it included Y = Category


Durability to query minimum in 3-Nearest of nearest
instance (3, 7) neighbors? Neighbor
(seconds) (kg/square distance
meter)
7 7 (7-3)2+(7-7)2=16 3 yes Bad
7 4 (7-3)2+(4-7)2=25 4 no -
3 4 (3-3)2+(4-7)2=9 1 yes good
1 4 (1-3)2+(4-7)2=13 2 yes good

5. Use simple majority of the category of nearest neighbors as the prediction value of the query instance

We have 2 good and 1 bad, since 2>1 then we conclude that a new paper tissue that pass laboratory test
with X1 = 3 and X2 = 7 is included in Good category.
5

Decision Tree

Data set
For instance, the following table informs about decision making factors to play
tennis at outside for previous 14 days.

Da Outlook Temp Humidit Wind Decisio


y . y n

1 Sunny Hot High Weak No

2 Sunny Hot High Strong No

3 Overcast Hot High Weak Yes

4 Rain Mild High Weak Yes

5 Rain Cool Normal Weak Yes

6 Rain Cool Normal Strong No

7 Overcast Cool Normal Strong Yes

8 Sunny Mild High Weak No

9 Sunny Cool Normal Weak Yes

10 Rain Mild Normal Weak Yes

11 Sunny Mild Normal Strong Yes

12 Overcast Mild High Strong Yes

13 Overcast Hot Normal Weak Yes

14 Rain Mild High Strong No

We can summarize the ID3 algorithm as illustrated below

Entropy(S) = ∑ – p(I) . log2p(I)

Gain(S, A) = Entropy(S) – ∑ [ p(S|A) . Entropy(S|A) ]


6

.Entropy

We need to calculate the entropy first. Decision column consists of 14 instances and includes two
labels: yes and no. There are 9 decisions labeled yes, and 5 decisions labeled no.

Entropy(Decision) = – p(Yes) . log2p(Yes) – p(No) . log2p(No)

Entropy(Decision) = – (9/14) . log2(9/14) – (5/14) . log2(5/14) = 0.940

Now, we need to find the most dominant factor for decisioning.

Wind factor on decision


Gain(Decision, Wind) = Entropy(Decision) – ∑ [ p(Decision|Wind) . Entropy(Decision|Wind) ]

Wind attribute has two labels: weak and strong. We would reflect it to the formula.

Gain(Decision, Wind) = Entropy(Decision) – [ p(Decision|Wind=Weak) .


Entropy(Decision|Wind=Weak) ] – [ p(Decision|Wind=Strong) .
Entropy(Decision|Wind=Strong) ]

Now, we need to calculate (Decision|Wind=Weak) and (Decision|Wind=Strong) respectively.

Weak wind factor on decision

Day Outlook Temp. Humidity Wind Decision

1 Sunny Hot High Weak No

3 Overcast Hot High Weak Yes

4 Rain Mild High Weak Yes

5 Rain Cool Normal Weak Yes

8 Sunny Mild High Weak No

9 Sunny Cool Normal Weak Yes

10 Rain Mild Normal Weak Yes

13 Overcast Hot Normal Weak Yes

There are 8 instances for weak wind. Decision of 2 items are no and 6 items are yes as illustrated
below.
7

1- Entropy(Decision|Wind=Weak) = – p(No) . log2p(No) – p(Yes) . log2p(Yes)

2- Entropy(Decision|Wind=Weak) = – (2/8) . log2(2/8) – (6/8) . log2(6/8) = 0.811

Strong wind factor on decision

Day Outlook Temp. Humidity Wind Decision

2 Sunny Hot High Strong No

6 Rain Cool Normal Strong No

7 Overcast Cool Normal Strong Yes

11 Sunny Mild Normal Strong Yes

12 Overcast Mild High Strong Yes

14 Rain Mild High Strong No

Here, there are 6 instances for strong wind. Decision is divided into two equal parts.

1- Entropy(Decision|Wind=Strong) = – p(No) . log2p(No) – p(Yes) . log2p(Yes)

2- Entropy(Decision|Wind=Strong) = – (3/6) . log2(3/6) – (3/6) . log2(3/6) = 1

Now, we can turn back to Gain(Decision, Wind) equation.

Gain(Decision, Wind) = Entropy(Decision) – [ p(Decision|Wind=Weak) .


Entropy(Decision|Wind=Weak) ] – [ p(Decision|Wind=Strong) .
Entropy(Decision|Wind=Strong) ] = 0.940 – [ (8/14) . 0.811 ] – [ (6/14). 1] = 0.048

Calculations for wind column is over. Now, we need to apply same calculations for other
columns to find the most dominant factor on decision.

Other factors on decision


We have applied similar calculation on the other columns.

1- Gain(Decision, Outlook) = 0.246

2- Gain(Decision, Temperature) = 0.029

3- Gain(Decision, Humidity) = 0.151


8

As seen, outlook factor on decision produces the highest score. That’s why, outlook decision will
appear in the root node of the tree.

Root decision on the tree

Now, we need to test dataset for custom subsets of outlook attribute.

Overcast outlook on decision


Basically, decision will always be yes if outlook were overcast.

Day Outlook Temp. Humidity Wind Decision

3 Overcast Hot High Weak Yes

7 Overcast Cool Normal Strong Yes

12 Overcast Mild High Strong Yes

13 Overcast Hot Normal Weak Yes

Sunny outlook on decision

Day Outlook Temp. Humidity Wind Decision

1 Sunny Hot High Weak No

2 Sunny Hot High Strong No

8 Sunny Mild High Weak No

9 Sunny Cool Normal Weak Yes

11 Sunny Mild Normal Strong Yes


9

Here, there are 5 instances for sunny outlook. Decision would be probably 3/5 percent no, 2/5
percent yes.

1- Gain(Outlook=Sunny|Temperature) = 0.570

2- Gain(Outlook=Sunny|Humidity) = 0.970

3- Gain(Outlook=Sunny|Wind) = 0.019

Now, humidity is the decision because it produces the highest score if outlook were sunny.

At this point, decision will always be no if humidity were high.

Day Outlook Temp. Humidity Wind Decision

1 Sunny Hot High Weak No

2 Sunny Hot High Strong No

8 Sunny Mild High Weak No

On the other hand, decision will always be yes if humidity were normal

Day Outlook Temp. Humidity Wind Decision

9 Sunny Cool Normal Weak Yes

11 Sunny Mild Normal Strong Yes

Finally, it means that we need to check the humidity and decide if outlook were sunny.

Rain outlook on decision

Day Outlook Temp. Humidity Wind Decision

4 Rain Mild High Weak Yes

5 Rain Cool Normal Weak Yes

6 Rain Cool Normal Strong No

10 Rain Mild Normal Weak Yes

14 Rain Mild High Strong No


10

1- Gain(Outlook=Rain | Temperature) = 0.01997309402197489

2- Gain(Outlook=Rain | Humidity) = 0.01997309402197489

3- Gain(Outlook=Rain | Wind) = 0.9709505944546686

Here, wind produces the highest score if outlook were rain. That’s why, we need to check wind
attribute in 2nd level if outlook were rain.

So, it is revealed that decision will always be yes if wind were weak and outlook were rain.

Day Outlook Temp. Humidity Wind Decision

4 Rain Mild High Weak Yes

5 Rain Cool Normal Weak Yes

10 Rain Mild Normal Weak Yes

What’s more, decision will be always no if wind were strong and outlook were rain.

Day Outlook Temp. Humidity Wind Decision

6 Rain Cool Normal Strong No

14 Rain Mild High Strong No

So, decision tree construction is over. We can use the following rules for decisioning.

Final version of decision tree


11

METRICS IN ML
12
13
14
15
16
17
18
19
20
21
22

Naive Bayes Example


Say you have 1000 fruits which could be either ‘banana’, ‘orange’ or ‘other’.
These are the 3 possible classes of the Y variable.

We have data for the following X variables, all of which are binary (1 or 0).

 Long
 Sweet
 Yellow

The first few rows of the training dataset look like this:

Fruit Long (x1) Sweet (x2) Yellow (x3)


Orange 0 1 0
Banana 1 0 1
Banana 1 1 1
Other 1 1 0
23

For the sake of computing the probabilities, let’s aggregate the training data
to form a counts table like this.

So the objective of the classifier is to predict if a given fruit is a ‘Banana’ or


‘Orange’ or ‘Other’ when only the 3 features (long, sweet and yellow) are
known.

Let’s say you are given a fruit that is: Long, Sweet and Yellow, can you
predict what fruit it is?

This is the same of predicting the Y when only the X variables in testing data
are known. Let’s solve it by hand using Naive Bayes.

The idea is to compute the 3 probabilities, that is the probability of the fruit
being a banana, orange or other. Whichever fruit type gets the highest
probability wins

All the information to calculate these probabilities is present in the above


tabulation.
Step 1: Compute the ‘Prior’ probabilities for each of the class of fruits.

That is, the proportion of each fruit class out of all the fruits from the
population. You can provide the ‘Priors’ from prior information about the
population. Otherwise, it can be computed from the training data.

For this case, let’s compute from the training data. Out of 1000 records in
training data, you have 500 Bananas, 300 Oranges and 200 Others. So the
respective priors are 0.5, 0.3 and 0.2.

P(Y=Banana) = 500 / 1000 = 0.50


24

P(Y=Orange) = 300 / 1000 = 0.30

P(Y=Other) = 200 / 1000 = 0.20

Step 2: Compute the probability of evidence that goes in the denominator.

This is nothing but the product of P of Xs for all X. This is an optional step
because the denominator is the same for all the classes and so will not
affect the probabilities.

P(x1=Long) = 500 / 1000 = 0.50

P(x2=Sweet) = 650 / 1000 = 0.65

P(x3=Yellow) = 800 / 1000 = 0.80

Step 3: Compute the probability of likelihood of evidences that goes in the


numerator.

It is the product of conditional probabilities of the 3 features. If you refer


back to the formula, it says P(X1 |Y=k). Here X1 is ‘Long’ and k is ‘Banana’.
That means the probability the fruit is ‘Long’ given that it is a Banana. In the
above table, you have 500 Bananas. Out of that 400 is long. So, P(Long |
Banana) = 400/500 = 0.8.

Here, I have done it for Banana alone.

Probability of Likelihood for Banana

P(x1=Long | Y=Banana) = 400 / 500 = 0.80

P(x2=Sweet | Y=Banana) = 350 / 500 = 0.70

P(x3=Yellow | Y=Banana) = 450 / 500 = 0.90

So, the overall probability of Likelihood of evidence for Banana = 0.8 * 0.7 *
0.9 = 0.504
25

Step 4: Substitute all the 3 equations into the Naive Bayes formula, to get the
probability that it is a banana.

Similarly, you can compute the probabilities for ‘Orange’ and ‘Other fruit’.
The denominator is the same for all 3 cases, so it’s optional to compute.

Clearly, Banana gets the highest probability, so that will be our predicted
class.

What is Laplace Correction?


The value of P(Orange | Long, Sweet and Yellow) was zero in the above
example, because, P(Long | Orange) was zero. That is, there were no ‘Long’
oranges in the training data.

It makes sense, but when you have a model with many features, the entire
probability will become zero because one of the feature’s value was zero. To
avoid this, we increase the count of the variable with zero to a small value
(usually 1) in the numerator, so that the overall probability doesn’t become
zero.

This correction is called ‘Laplace Correction’. Most Naive Bayes model


implementations accept this or an equivalent form of correction as a
parameter.
26

ANOTHER TEXT EXAMPLE (NAÏVE BAYES)


27
28
29
30

K-Means Clustering
31
32
33
34

ANOTHER EXAMPLE FOR


K-Means Clustering
35
36
37

ANOTHER EXAMPLE OF K-MEANS (One Dimention)

Suppose we want to group the visitors to a website using just their age (one-dimensional space) as follows:
n = 19
15,15,16,19,19,20,20,21,22,28,35,40,41,42,43,44,60,61,65

Initial clusters (random centroid or average):


k=2
c1 = 16
c2 = 22

Iteration 1:
c1 = 15.33
c2 = 36.25
xi c1 c2 Distance 1 Distance 2 Nearest Cluster New Centroid
15 16 22 1 7 1
15 16 22 1 7 1 15.33
16 16 22 0 6 1
19 16 22 9 3 2
19 16 22 9 3 2
20 16 22 16 2 2
20 16 22 16 2 2
21 16 22 25 1 2
22 16 22 36 0 2
28 16 22 12 6 2
35 16 22 19 13 2
36.25
40 16 22 24 18 2
41 16 22 25 19 2
42 16 22 26 20 2
43 16 22 27 21 2
44 16 22 28 22 2
60 16 22 44 38 2
61 16 22 45 39 2
65 16 22 49 43 2
38

Iteration 2:
c1 = 18.56
c2 = 45.90
xi c1 c2 Distance 1 Distance 2 Nearest Cluster New Centroid
15 15.33 36.25 0.33 21.25 1
15 15.33 36.25 0.33 21.25 1
16 15.33 36.25 0.67 20.25 1
19 15.33 36.25 3.67 17.25 1
19 15.33 36.25 3.67 17.25 1 18.56
20 15.33 36.25 4.67 16.25 1
20 15.33 36.25 4.67 16.25 1
21 15.33 36.25 5.67 15.25 1
22 15.33 36.25 6.67 14.25 1
28 15.33 36.25 12.67 8.25 2
35 15.33 36.25 19.67 1.25 2
40 15.33 36.25 24.67 3.75 2
41 15.33 36.25 25.67 4.75 2
42 15.33 36.25 26.67 5.75 2
45.9
43 15.33 36.25 27.67 6.75 2
44 15.33 36.25 28.67 7.75 2
60 15.33 36.25 44.67 23.75 2
61 15.33 36.25 45.67 24.75 2
65 15.33 36.25 49.67 28.75 2

Iteration 3:
c1 = 19.50
c2 = 47.89
xi c1 c2 Distance 1 Distance 2 Nearest Cluster New Centroid
15 18.56 45.9 3.56 30.9 1
15 18.56 45.9 3.56 30.9 1
16 18.56 45.9 2.56 29.9 1
19 18.56 45.9 0.44 26.9 1
19 18.56 45.9 0.44 26.9 1
19.50
20 18.56 45.9 1.44 25.9 1
20 18.56 45.9 1.44 25.9 1
21 18.56 45.9 2.44 24.9 1
22 18.56 45.9 3.44 23.9 1
28 18.56 45.9 9.44 17.9 1
39

35 18.56 45.9 16.44 10.9 2


40 18.56 45.9 21.44 5.9 2
41 18.56 45.9 22.44 4.9 2
42 18.56 45.9 23.44 3.9 2
43 18.56 45.9 24.44 2.9 2 47.89
44 18.56 45.9 25.44 1.9 2
60 18.56 45.9 41.44 14.1 2
61 18.56 45.9 42.44 15.1 2
65 18.56 45.9 46.44 19.1 2

Iteration 4:
c1 = 19.50
c2 = 47.89
xi c1 c2 Distance 1 Distance 2 Nearest Cluster New Centroid
15 19.5 47.89 4.50 32.89 1
15 19.5 47.89 4.50 32.89 1
16 19.5 47.89 3.50 31.89 1
19 19.5 47.89 0.50 28.89 1
19 19.5 47.89 0.50 28.89 1
19.50
20 19.5 47.89 0.50 27.89 1
20 19.5 47.89 0.50 27.89 1
21 19.5 47.89 1.50 26.89 1
22 19.5 47.89 2.50 25.89 1
28 19.5 47.89 8.50 19.89 1
35 19.5 47.89 15.50 12.89 2
40 19.5 47.89 20.50 7.89 2
41 19.5 47.89 21.50 6.89 2
42 19.5 47.89 22.50 5.89 2
43 19.5 47.89 23.50 4.89 2 47.89
44 19.5 47.89 24.50 3.89 2
60 19.5 47.89 40.50 12.11 2
61 19.5 47.89 41.50 13.11 2
65 19.5 47.89 45.50 17.11 2

No change between iterations 3 and 4 has been noted. By using clustering, 2 groups have been identified 15-28 and
35-65. The initial choice of centroids can affect the output clusters, so the algorithm is often run multiple times with
different starting conditions in order to get a fair view of what the clusters should be.
40

Videos Tutorials
1. K Means Clustering Algorithm: URL:
https://www.youtube.com/watch?v=1XqG0kaJVHY&feature=emb_logo
2. K Means Clustering Algorithm: URL: https://www.youtube.com/watch?v=EItlUEPCIzM

Hierarchical Clustering
Hierarchical clustering is another unsupervised learning algorithm that is used to group together the
unlabeled data points having similar characteristics. Hierarchical clustering algorithms falls into following
two categories.

Agglomerative hierarchical algorithms − In agglomerative hierarchical algorithms, each data point is


treated as a single cluster and then successively merge or agglomerate (bottom-up approach) the pairs
of clusters. The hierarchy of the clusters is represented as a dendrogram or tree structure.

Divisive hierarchical algorithms − On the other hand, in divisive hierarchical algorithms, all the data
points are treated as one big cluster and the process of clustering involves dividing (Top-down approach)
the one big cluster into various small clusters.

Agglomerative clustering
In agglomerative or bottom-up clustering method we assign each observation to its own cluster. Then,
compute the similarity (e.g., distance) between each of the clusters and join the two most similar
clusters. Finally, repeat steps 2 and 3 until there is only a single cluster left. The related algorithm is
shown below.
41

Before any clustering is performed, it is required to determine the proximity matrix containing the distance bet
a distance function. Then, the matrix is updated to display the distance between each cluster. The following thr
how the distance between each cluster is measured.

An example: working of the algorithm

Measuring the distance of two clusters


 A few ways to measure distances of two clusters.
42

 Results in different variations of the algorithm.

o Single link

o Complete link

o Average link

o Centroids

o …

Single link method


In single linkage hierarchical clustering, the distance between two clusters is defined as the shortest
distance between two points in each cluster. For example, the distance between clusters “r” and “s” to
the left is equal to the length of the arrow between their two closest points.

Complete link method


In complete linkage hierarchical clustering, the distance between two clusters is defined as the longest
distance between two points in each cluster. For example, the distance between clusters “r” and “s” to
the left is equal to the length of the arrow between their two furthest points.
43

Average link method:


In average linkage hierarchical clustering, the distance between two clusters is defined as the average
distance between each point in one cluster to every point in the other cluster. For example, the distance
between clusters “r” and “s” to the left is equal to the average length each arrow between connecting
the points of one cluster to the other.

Centroid method:
 In this method, the distance between two clusters is the distance between their centroids

The complexity
 All the algorithms are at least O(n2). n is the number of data points.

 Single link can be done in O(n2).

 Complete and average links can be done in O(n2logn).

 Due the complexity, hard to use for large data sets.

o Sampling

o Scale-up methods (e.g., BIRCH).

An Example

Let’s now see a simple example: a hierarchical clustering of distances in kilometers between some Italian
cities. The method used is single-linkage.

Input distance matrix (L = 0 for all the clusters):

BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
44

NA 255 468 754 0 219 869


RM 412 268 564 219 0 669
TO 996 400 138 869 669 0

The nearest pair of cities is MI and TO, at distance 138. These are merged into a single cluster called
"MI/TO". The level of the new cluster is L(MI/TO) = 138 and the new sequence number is m = 1.
Then we compute the distance from this new compound object to all other objects. In single link
clustering the rule is that the distance from the compound object to another object is equal to the
shortest distance from any member of the cluster to the outside object. So the distance from "MI/TO" to
RM is chosen to be 564, which is the distance from MI to RM, and so on.

After merging MI with TO we obtain the following matrix:

BA FI MI/TO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MI/TO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0


45

min d(i,j) = d(NA,RM) = 219 => merge NA and RM into a new cluster called NA/RM
L(NA/RM) = 219
m=2

BA FI MI/TO NA/RM
BA 0 662 877 255
FI 662 0 295 268
MI/TO 877 295 0 564
NA/RM 255 268 564 0

min d(i,j) = d(BA,NA/RM) = 255 => merge BA and NA/RM into a new cluster called BA/NA/RM
L(BA/NA/RM) = 255
m=3
46

BA/NA/RM FI MI/TO
BA/NA/RM 0 268 564
FI 268 0 295
MI/TO 564 295 0

min d(i,j) = d(BA/NA/RM,FI) = 268 => merge BA/NA/RM and FI into a new cluster called BA/FI/NA/RM
L(BA/FI/NA/RM) = 268
m=4

BA/FI/NA/RM MI/TO
BA/FI/NA/RM 0 295
MI/TO 295 0

Finally, we merge the last two clusters at level 295.


47

The process is summarized by the following hierarchical tree

ANOTHER HIERARCHIAL EXAMPLE

What is hierarchical clustering (agglomerative) ?


Clustering is a data mining technique to group a set of objects in a way such that objects in the
same cluster are more similar to each other than to those in other clusters.
In hierarchical clustering, we assign each object (data point) to a separate cluster. Then compute
the distance (similarity) between each of the clusters and join the two most similar clusters. Let’s
understand further by solving an example.
48

Objective : For the one dimensional data set {7,10,20,28,35}, perform hierarchical clustering and plot
the dendogram to visualize it.
Solution : First, let’s the visualize the data.

Observing the plot above, we can intuitively conclude that:


The first two points (7 and 10) are close to each other and should be in the same cluster
Also, the last two points (28 and 35) are close to each other and should be in the same cluster
Cluster of the center point (20) is not easy to conclude
Let’s solve the problem by hand using both the types of agglomerative hierarchical clustering:
Single Linkage : In single link hierarchical clustering, we merge in each step the two clusters,
whose two closest members have the smallest distance.
49

Using single linkage two clusters are formed :


Cluster 1 : (7,10)
Cluster 2 : (20,28,35)
50

2.Complete Linkage : In complete link hierarchical clustering, we merge in the members of the
clusters in each step, which provide the smallest maximum pairwise distance

Using complete linkage two clusters are formed :


Cluster 1 : (7,10,20)
Cluster 2 : (28,35)
Conclusion : Hierarchical clustering is mostly used when the application requires a hierarchy, e.g
creation of a taxonomy. However, they are expensive in terms of their computational and storage
requirements.
51

Video:

1. Hierarchical Clustering: URL:


https://www.youtube.com/watch?v=tlIv3IT_hHk&feature=emb_logo
2. Hierarchical Clustering :URL: https://www.youtube.com/watch?v=9U4h6pZw6f8

PRINCIPAL COMPONENT ANALYSIS


52
53
54
55
56
57
58
59

EXTERNAL RESOURCE:
1. https://youtu.be/FgakZw6K1QQ
2. https://builtin.com/data-science/step-step-explanation-principal-component-analysis
3. https://towardsdatascience.com/the-mathematics-behind-principal-component-analysis-
fff2d7f4b643

You might also like