SOLVED NUMERICALS EXAMPLES in Machine Learning
SOLVED NUMERICALS EXAMPLES in Machine Learning
Example
We have data from the questionnaires survey (to ask people opinion) and objective testing with two
attributes (acid durability and strength) to classify whether a special paper tissue is good or not. Here is
four training samples
X2 = Strength
Y=
X1 = Acid Durability
7 7 Bad
7 4 Bad
3 4 Good
1 4 Good
Now the factory produces a new paper tissue that pass laboratory test with X1 = 3 and X2 = 7. Without
another expensive survey, can we guess what the classification of this new tissue is?
Suppose use K = 3
2. Calculate the distance between the query-instance and all the training samples
Coordinate of query instance is (3, 7), instead of calculating the distance we compute square distance
which is faster to calculate (without square root)
3
X2 = Strength
X1 = Acid Durability Square Distance to query instance
(seconds) (kg/square (3,7)
meter)
7 7 (7-3)2+(7-7)2=16
7 4 (7-3)2+(4-7)2=25
3 4 (3-3)2+(4-7)2=9
1 4 (1-3)2+(4-7)2=13
3. Sort the distance and determine nearest neighbors based on the K-th minimum distance
X2 = Strength
X1 = Acid Durability Square Distance to query instance Rank minimum
(seconds) (kg/square (3, 7) distance Is it included in 3-
meter) Nearest neighbors?
7 7 (7-3)2+(7-7)2=16 3 Yes
7 4 (7-3)2+(4-7)2=25 4 No
3 4 (3-3)2+(4-7)2=9 1 Yes
1 4 (1-3)2+(4-7)2=13 2 Yes
4. Gather the category Y of the nearest neighbors. Notice in the second row last column that the category
of nearest neighbor (Y) is not included because the rank of this data is more than 3 (=K).
4
5. Use simple majority of the category of nearest neighbors as the prediction value of the query instance
We have 2 good and 1 bad, since 2>1 then we conclude that a new paper tissue that pass laboratory test
with X1 = 3 and X2 = 7 is included in Good category.
5
Decision Tree
Data set
For instance, the following table informs about decision making factors to play
tennis at outside for previous 14 days.
.Entropy
We need to calculate the entropy first. Decision column consists of 14 instances and includes two
labels: yes and no. There are 9 decisions labeled yes, and 5 decisions labeled no.
Wind attribute has two labels: weak and strong. We would reflect it to the formula.
There are 8 instances for weak wind. Decision of 2 items are no and 6 items are yes as illustrated
below.
7
Here, there are 6 instances for strong wind. Decision is divided into two equal parts.
Calculations for wind column is over. Now, we need to apply same calculations for other
columns to find the most dominant factor on decision.
As seen, outlook factor on decision produces the highest score. That’s why, outlook decision will
appear in the root node of the tree.
Here, there are 5 instances for sunny outlook. Decision would be probably 3/5 percent no, 2/5
percent yes.
1- Gain(Outlook=Sunny|Temperature) = 0.570
2- Gain(Outlook=Sunny|Humidity) = 0.970
3- Gain(Outlook=Sunny|Wind) = 0.019
Now, humidity is the decision because it produces the highest score if outlook were sunny.
On the other hand, decision will always be yes if humidity were normal
Finally, it means that we need to check the humidity and decide if outlook were sunny.
Here, wind produces the highest score if outlook were rain. That’s why, we need to check wind
attribute in 2nd level if outlook were rain.
So, it is revealed that decision will always be yes if wind were weak and outlook were rain.
What’s more, decision will be always no if wind were strong and outlook were rain.
So, decision tree construction is over. We can use the following rules for decisioning.
METRICS IN ML
12
13
14
15
16
17
18
19
20
21
22
We have data for the following X variables, all of which are binary (1 or 0).
Long
Sweet
Yellow
The first few rows of the training dataset look like this:
For the sake of computing the probabilities, let’s aggregate the training data
to form a counts table like this.
Let’s say you are given a fruit that is: Long, Sweet and Yellow, can you
predict what fruit it is?
This is the same of predicting the Y when only the X variables in testing data
are known. Let’s solve it by hand using Naive Bayes.
The idea is to compute the 3 probabilities, that is the probability of the fruit
being a banana, orange or other. Whichever fruit type gets the highest
probability wins
That is, the proportion of each fruit class out of all the fruits from the
population. You can provide the ‘Priors’ from prior information about the
population. Otherwise, it can be computed from the training data.
For this case, let’s compute from the training data. Out of 1000 records in
training data, you have 500 Bananas, 300 Oranges and 200 Others. So the
respective priors are 0.5, 0.3 and 0.2.
This is nothing but the product of P of Xs for all X. This is an optional step
because the denominator is the same for all the classes and so will not
affect the probabilities.
So, the overall probability of Likelihood of evidence for Banana = 0.8 * 0.7 *
0.9 = 0.504
25
Step 4: Substitute all the 3 equations into the Naive Bayes formula, to get the
probability that it is a banana.
Similarly, you can compute the probabilities for ‘Orange’ and ‘Other fruit’.
The denominator is the same for all 3 cases, so it’s optional to compute.
Clearly, Banana gets the highest probability, so that will be our predicted
class.
It makes sense, but when you have a model with many features, the entire
probability will become zero because one of the feature’s value was zero. To
avoid this, we increase the count of the variable with zero to a small value
(usually 1) in the numerator, so that the overall probability doesn’t become
zero.
K-Means Clustering
31
32
33
34
Suppose we want to group the visitors to a website using just their age (one-dimensional space) as follows:
n = 19
15,15,16,19,19,20,20,21,22,28,35,40,41,42,43,44,60,61,65
Iteration 1:
c1 = 15.33
c2 = 36.25
xi c1 c2 Distance 1 Distance 2 Nearest Cluster New Centroid
15 16 22 1 7 1
15 16 22 1 7 1 15.33
16 16 22 0 6 1
19 16 22 9 3 2
19 16 22 9 3 2
20 16 22 16 2 2
20 16 22 16 2 2
21 16 22 25 1 2
22 16 22 36 0 2
28 16 22 12 6 2
35 16 22 19 13 2
36.25
40 16 22 24 18 2
41 16 22 25 19 2
42 16 22 26 20 2
43 16 22 27 21 2
44 16 22 28 22 2
60 16 22 44 38 2
61 16 22 45 39 2
65 16 22 49 43 2
38
Iteration 2:
c1 = 18.56
c2 = 45.90
xi c1 c2 Distance 1 Distance 2 Nearest Cluster New Centroid
15 15.33 36.25 0.33 21.25 1
15 15.33 36.25 0.33 21.25 1
16 15.33 36.25 0.67 20.25 1
19 15.33 36.25 3.67 17.25 1
19 15.33 36.25 3.67 17.25 1 18.56
20 15.33 36.25 4.67 16.25 1
20 15.33 36.25 4.67 16.25 1
21 15.33 36.25 5.67 15.25 1
22 15.33 36.25 6.67 14.25 1
28 15.33 36.25 12.67 8.25 2
35 15.33 36.25 19.67 1.25 2
40 15.33 36.25 24.67 3.75 2
41 15.33 36.25 25.67 4.75 2
42 15.33 36.25 26.67 5.75 2
45.9
43 15.33 36.25 27.67 6.75 2
44 15.33 36.25 28.67 7.75 2
60 15.33 36.25 44.67 23.75 2
61 15.33 36.25 45.67 24.75 2
65 15.33 36.25 49.67 28.75 2
Iteration 3:
c1 = 19.50
c2 = 47.89
xi c1 c2 Distance 1 Distance 2 Nearest Cluster New Centroid
15 18.56 45.9 3.56 30.9 1
15 18.56 45.9 3.56 30.9 1
16 18.56 45.9 2.56 29.9 1
19 18.56 45.9 0.44 26.9 1
19 18.56 45.9 0.44 26.9 1
19.50
20 18.56 45.9 1.44 25.9 1
20 18.56 45.9 1.44 25.9 1
21 18.56 45.9 2.44 24.9 1
22 18.56 45.9 3.44 23.9 1
28 18.56 45.9 9.44 17.9 1
39
Iteration 4:
c1 = 19.50
c2 = 47.89
xi c1 c2 Distance 1 Distance 2 Nearest Cluster New Centroid
15 19.5 47.89 4.50 32.89 1
15 19.5 47.89 4.50 32.89 1
16 19.5 47.89 3.50 31.89 1
19 19.5 47.89 0.50 28.89 1
19 19.5 47.89 0.50 28.89 1
19.50
20 19.5 47.89 0.50 27.89 1
20 19.5 47.89 0.50 27.89 1
21 19.5 47.89 1.50 26.89 1
22 19.5 47.89 2.50 25.89 1
28 19.5 47.89 8.50 19.89 1
35 19.5 47.89 15.50 12.89 2
40 19.5 47.89 20.50 7.89 2
41 19.5 47.89 21.50 6.89 2
42 19.5 47.89 22.50 5.89 2
43 19.5 47.89 23.50 4.89 2 47.89
44 19.5 47.89 24.50 3.89 2
60 19.5 47.89 40.50 12.11 2
61 19.5 47.89 41.50 13.11 2
65 19.5 47.89 45.50 17.11 2
No change between iterations 3 and 4 has been noted. By using clustering, 2 groups have been identified 15-28 and
35-65. The initial choice of centroids can affect the output clusters, so the algorithm is often run multiple times with
different starting conditions in order to get a fair view of what the clusters should be.
40
Videos Tutorials
1. K Means Clustering Algorithm: URL:
https://www.youtube.com/watch?v=1XqG0kaJVHY&feature=emb_logo
2. K Means Clustering Algorithm: URL: https://www.youtube.com/watch?v=EItlUEPCIzM
Hierarchical Clustering
Hierarchical clustering is another unsupervised learning algorithm that is used to group together the
unlabeled data points having similar characteristics. Hierarchical clustering algorithms falls into following
two categories.
Divisive hierarchical algorithms − On the other hand, in divisive hierarchical algorithms, all the data
points are treated as one big cluster and the process of clustering involves dividing (Top-down approach)
the one big cluster into various small clusters.
Agglomerative clustering
In agglomerative or bottom-up clustering method we assign each observation to its own cluster. Then,
compute the similarity (e.g., distance) between each of the clusters and join the two most similar
clusters. Finally, repeat steps 2 and 3 until there is only a single cluster left. The related algorithm is
shown below.
41
Before any clustering is performed, it is required to determine the proximity matrix containing the distance bet
a distance function. Then, the matrix is updated to display the distance between each cluster. The following thr
how the distance between each cluster is measured.
o Single link
o Complete link
o Average link
o Centroids
o …
Centroid method:
In this method, the distance between two clusters is the distance between their centroids
The complexity
All the algorithms are at least O(n2). n is the number of data points.
o Sampling
An Example
Let’s now see a simple example: a hierarchical clustering of distances in kilometers between some Italian
cities. The method used is single-linkage.
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
44
The nearest pair of cities is MI and TO, at distance 138. These are merged into a single cluster called
"MI/TO". The level of the new cluster is L(MI/TO) = 138 and the new sequence number is m = 1.
Then we compute the distance from this new compound object to all other objects. In single link
clustering the rule is that the distance from the compound object to another object is equal to the
shortest distance from any member of the cluster to the outside object. So the distance from "MI/TO" to
RM is chosen to be 564, which is the distance from MI to RM, and so on.
BA FI MI/TO NA RM
min d(i,j) = d(NA,RM) = 219 => merge NA and RM into a new cluster called NA/RM
L(NA/RM) = 219
m=2
BA FI MI/TO NA/RM
BA 0 662 877 255
FI 662 0 295 268
MI/TO 877 295 0 564
NA/RM 255 268 564 0
min d(i,j) = d(BA,NA/RM) = 255 => merge BA and NA/RM into a new cluster called BA/NA/RM
L(BA/NA/RM) = 255
m=3
46
BA/NA/RM FI MI/TO
BA/NA/RM 0 268 564
FI 268 0 295
MI/TO 564 295 0
min d(i,j) = d(BA/NA/RM,FI) = 268 => merge BA/NA/RM and FI into a new cluster called BA/FI/NA/RM
L(BA/FI/NA/RM) = 268
m=4
BA/FI/NA/RM MI/TO
BA/FI/NA/RM 0 295
MI/TO 295 0
Objective : For the one dimensional data set {7,10,20,28,35}, perform hierarchical clustering and plot
the dendogram to visualize it.
Solution : First, let’s the visualize the data.
2.Complete Linkage : In complete link hierarchical clustering, we merge in the members of the
clusters in each step, which provide the smallest maximum pairwise distance
Video:
EXTERNAL RESOURCE:
1. https://youtu.be/FgakZw6K1QQ
2. https://builtin.com/data-science/step-step-explanation-principal-component-analysis
3. https://towardsdatascience.com/the-mathematics-behind-principal-component-analysis-
fff2d7f4b643