0% found this document useful (0 votes)

29 views23 pages

DWM Question Bank Solution

The document describes the k-means clustering algorithm applied to a dataset with 15 data points (A1-A15). It initializes the algorithm with 3 centroids and assigns each data point to the cluster with the closest centroid. It then recalculates the centroids and reassigns points in successive iterations until cluster assignments stabilize. The final centroids are (4, 4.6) for cluster 1, (4.143, 9.571) for cluster 2, and (10, 11.333) for cluster 3.

Uploaded by

Sahil Surve

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views23 pages

DWM Question Bank Solution

Uploaded by

Sahil Surve

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Q1

A2 (2,6), A7 (5,10), and A15 (6,11) as the centroids of the initial clusters.

 Centroid 1=(2,6) is associated with cluster 1.

 Centroid 2=(5,10) is associated with cluster 2.
 Centroid 3=(6,11) is associated with cluster 3.

Now we will find the euclidean distance between each point and the
centroids. Based on the minimum distance of each point from the centroids,
we will assign the points to a cluster.

Distance from
Distance from Distance from Assigned
Point Centroid 1
Centroid 2 (5,10) Centroid 3 (6,11) Cluster
(2,6)

A1 (2,10) 4 3 4.123106 Cluster 2

A2 (2,6) 0 5 6.403124 Cluster 1

A3
10.29563 6.082763 5 Cluster 3
(11,11)

A4 (6,9) 5 1.414214 2 Cluster 2

A5 (6,4) 4.472136 6.082763 7 Cluster 1

A6 (1,2) 4.123106 8.944272 10.29563 Cluster 1

A7 (5,10) 5 0 1.414214 Cluster 2

A8 (4,9) 3.605551 1.414214 2.828427 Cluster 2

A9
10 5.385165 4.123106 Cluster 3
(10,12)

A10 (7,5) 5.09902 5.385165 6.082763 Cluster 1

A11 8.602325 4.123106 3 Cluster 3

(9,11)

A12 (4,6) 2 4.123106 5.385165 Cluster 1

A13
4.123106 2 3.162278 Cluster 2
(3,10)

A14 (3,8) 2.236068 2.828427 4.242641 Cluster 1

A15
6.403124 1.414214 0 Cluster 3
(6,11)

At this point, we have completed the first iteration of the k-means

clustering algorithm and assigned each point into a cluster.

Now, we will calculate the new centroid for each cluster.

 In cluster 1, we have 6 points i.e. A2 (2,6), A5 (6,4), A6 (1,2), A10 (7,5),

A12 (4,6), A14 (3,8). To calculate the new centroid for cluster 1, we will
find the mean of the x and y coordinates of each point in the cluster.
Hence, the new centroid for cluster 1 is (3.833, 5.167).
 In cluster 2, we have 5 points i.e. A1 (2,10), A4 (6,9), A7 (5,10) , A8 (4,9),
and A13 (3,10). Hence, the new centroid for cluster 2 is (4, 9.6)
 In cluster 3, we have 4 points i.e. A3 (11,11), A9 (10,12), A11 (9,11), and
A15 (6,11). Hence, the new centroid for cluster 3 is (9, 11.25).

we will calculate the distance of each data point from the new centroids.

Distance
Distance from Distance from
from Assigned
Point Centroid 1 (3.833, centroid 2 (4,
centroid 3 Cluster
5.167) 9.6)
(9, 11.25)

A1 (2,10) 5.169 2.040 7.111 Cluster 2

A2 (2,6) 2.013 4.118 8.750 Cluster 1

A3
9.241 7.139 2.016 Cluster 3
(11,11)

A4 (6,9) 4.403 2.088 3.750 Cluster 2

A5 (6,4) 2.461 5.946 7.846 Cluster 1

A6 (1,2) 4.249 8.171 12.230 Cluster 1

A7 (5,10) 4.972 1.077 4.191 Cluster 2

A8 (4,9) 3.837 0.600 5.483 Cluster 2

A9
9.204 6.462 1.250 Cluster 3
(10,12)

A10 (7,5) 3.171 5.492 6.562 Cluster 1

A11
7.792 5.192 0.250 Cluster 3
(9,11)

A12 (4,6) 0.850 3.600 7.250 Cluster 1

A13
4.904 1.077 6.129 Cluster 2
(3,10)

A14 (3,8) 2.953 1.887 6.824 Cluster 2

A15
6.223 2.441 3.010 Cluster 2
(6,11)

Now, we have completed the second iteration of the k-means clustering

algorithm and assigned each point into an updated cluster.

Now, we will calculate the new centroid for each cluster for the third
iteration.

 In cluster 1, we have 5 points i.e. A2 (2,6), A5 (6,4), A6 (1,2), A10 (7,5),

and A12 (4,6). To calculate the new centroid for cluster 1, we will find
the mean of the x and y coordinates of each point in the cluster. Hence,
the new centroid for cluster 1 is (4, 4.6).
 In cluster 2, we have 7 points i.e. A1 (2,10), A4 (6,9), A7 (5,10) , A8 (4,9),
A13 (3,10), A14 (3,8), and A15 (6,11). Hence, the new centroid for
cluster 2 is (4.143, 9.571)
 In cluster 3, we have 3 points i.e. A3 (11,11), A9 (10,12), and A11 (9,11).
Hence, the new centroid for cluster 3 is (10, 11.333).

Now, we will calculate the distance of each data point from the new
centroids.

Distance from Distance from Distance from

Assigned
Point Centroid 1 (4, centroid 2 (4.143, centroid 3 (10,
Cluster
4.6) 9.571) 11.333)

A1 (2,10) 5.758 2.186 8.110 Cluster 2

A2 (2,6) 2.441 4.165 9.615 Cluster 1

A3
9.485 7.004 1.054 Cluster 3
(11,11)

A4 (6,9) 4.833 1.943 4.631 Cluster 2

A5 (6,4) 2.088 5.872 8.353 Cluster 1

A6 (1,2) 3.970 8.197 12.966 Cluster 1

A7 (5,10) 5.492 0.958 5.175 Cluster 2

A8 (4,9) 4.400 0.589 6.438 Cluster 2

A9
9.527 6.341 0.667 Cluster 3
(10,12)

A10 (7,5) 3.027 5.390 7.008 Cluster 1

A11
8.122 5.063 1.054 Cluster 3
(9,11)
A12 (4,6) 1.400 3.574 8.028 Cluster 1

A13
5.492 1.221 7.126 Cluster 2
(3,10)

A14 (3,8) 3.544 1.943 7.753 Cluster 2

A15
6.705 2.343 4.014 Cluster 2
(6,11)

Now, we have completed the third iteration of the k-means clustering

algorithm

Now, we will calculate the new centroid for each cluster for the third iteration.

 In cluster 1, we have 5 points i.e. A2 (2,6), A5 (6,4), A6 (1,2), A10 (7,5),

Here, you can observe that no point has changed its cluster compared to
the previous iteration. Due to this, the centroid also remains constant.
Therefore, we will say that the clusters have been stabilized.

Q.2
Following are two points from the dataset that we have selected as medoids.

 M1 = (3, 4)
 M2 = (7, 3)
Iteration 1

Now, we will calculate the distance between each data point and the medoids
using the Manhattan distance measure.

Distance From Distance from Assigned

Point Coordinates
M1 (3,4) M2 (7,3) Cluster

A1 (2, 6) 3 8 Cluster 1

A2 (3, 8) 4 9 Cluster 1

A3 (4, 7) 4 7 Cluster 1

A4 (6, 2) 5 2 Cluster 2

A5 (6, 4) 3 2 Cluster 2

A6 (7, 3) 5 0 Cluster 2

A7 (7,4) 4 1 Cluster 2

A8 (8, 5) 6 3 Cluster 2

A9 (7, 6) 6 3 Cluster 2

A10 (3, 4) 0 5 Cluster 1

The clusters made with medoids (3, 4) and (7, 3) are as follows.

 Points in cluster1= {(2, 6), (3, 8), (4, 7), (3, 4)}
 Points in cluster 2= {(7,4), (6,2), (6, 4), (7,3), (8,5), (7,6)}

After assigning clusters, we will calculate the cost for each cluster and find
their sum. The cost is nothing but the sum of distances of all the data points
from the medoid of the cluster they belong to.

Hence, the cost for the current cluster will be 3+4+4+2+2+0+1+3+3+0=22.

Iteration 2
Now, we will select another non-medoid point (7, 4) and make it a temporary
medoid for the second cluster. Hence,

 M1 = (3, 4)
 M2 = (7, 4)

Now, let us calculate the distance between all the data points and the current
medoids.

Distance From M1 Distance from

Point Coordinates Assigned Cluster
(3,4) M2 (7,4)

A1 (2, 6) 3 7 Cluster 1

A2 (3, 8) 4 8 Cluster 1

A3 (4, 7) 4 6 Cluster 1

A4 (6, 2) 5 3 Cluster 2

A5 (6, 4) 3 1 Cluster 2

A6 (7, 3) 5 1 Cluster 2

A7 (7,4) 4 0 Cluster 2

A8 (8, 5) 6 2 Cluster 2

A9 (7, 6) 6 2 Cluster 2

A10 (3, 4) 0 4 Cluster 1

The data points haven’t changed in the clusters after changing the medoids.
Hence, clusters are:

 Points in cluster1:{(2, 6), (3, 8), (4, 7), (3, 4)}

 Points in cluster 2:{(7,4), (6,2), (6, 4), (7,3), (8,5), (7,6)}
Now, let us again calculate the cost for each cluster and find their sum. The
total cost this time will be 3+4+4+3+1+1+0+2+2+0=20.

Here, the current cost is less than the cost calculated in the previous
iteration. Hence, we will make the swap permanent and make (7,4) the
medoid for cluster 2. If the cost this time was greater than the previous
cost i.e. 22, we would have to revert the change. New medoids after this
iteration are (3, 4) and (7, 4) with no change in the clusters.

Iteration 3
Now, let us again change the medoid of cluster 2 to (6, 4). Hence, the new
medoids for the clusters are M1=(3, 4) and M2= (6, 4 ).

Let us calculate the distance between the data points and the above medoids
to find the new cluster.

Distance From M1 Distance from M2

Point Coordinates Assigned Cluster
(3,4) (6,4)

A1 (2, 6) 3 6 Cluster 1

A2 (3, 8) 4 7 Cluster 1

A3 (4, 7) 4 5 Cluster 1

A4 (6, 2) 5 2 Cluster 2

A5 (6, 4) 3 0 Cluster 2

A6 (7, 3) 5 2 Cluster 2

A7 (7,4) 4 1 Cluster 2
A8 (8, 5) 6 3 Cluster 2

A9 (7, 6) 6 3 Cluster 2

A10 (3, 4) 0 3 Cluster 1

Again, the clusters haven’t changed. Hence, clusters are:

 Points in cluster1:{(2, 6), (3, 8), (4, 7), (3, 4)}

 Points in cluster 2:{(7,4), (6,2), (6, 4), (7,3), (8,5), (7,6)}

Now, let us again calculate the cost for each cluster and find their sum. The
total cost this time will be 3+4+4+2+0+2+1+3+3+0=22.

The current cost is 22 which is greater than the cost in the previous iteration
i.e. 20. Hence, we will revert the change and the point (7, 4) will again be
made the medoid for cluster 2.

So, the clusters after this iteration will be cluster1 = {(2, 6), (3, 8), (4, 7), (3, 4)}
and cluster 2= {(7,4), (6,2), (6, 4), (7,3), (8,5), (7,6)}. The medoids are (3,4) and
(7,4).

Q.3

To obtain the new distance matrix, we need to remove the 3 and 5 entries, and replace it by an
entry "35”.
This gives us the new distance matrix. The items with the smallest distance get clustered next.
This will be 2 and 4.

Student’s u need to solve all those 6 iterations

here
Continuing in this way, after 6 steps, everything is clustered.

Complete Linkage

Q. 4
Using single linkage, we specify minimum distance between original objects of the two
clusters.
Using the input distance matrix, distance between cluster (D, F) and cluster A is
computed as

Distance between cluster (D, F) and cluster B is

Similarly, distance between cluster (D, F) and cluster C is

Finally, distance between cluster E and cluster (D, F) is calculated as

Then, the updated distance matrix becomes

Looking at the lower triangular updated distance matrix, we found out that the closest
distance between cluster B and cluster A is now 0.71. Thus, we group cluster A and
cluster B into a single cluster name (A, B).
Now we update the distance matrix. Aside from the first row and first column,

distance between cluster C and cluster (D, F) is computed as

Distance between cluster (D, F) and cluster (A, B) is the minimum distance between all
objects involves in the two clusters

Similarly, distance between cluster E and (A, B) is

Then the updated distance matrix is

Observing the lower triangular of the updated distance matrix, we can see that the
closest distance between clusters happens between cluster E and (D, F) at distance
1.00. Thus, we cluster them together into cluster ((D, F), E ).
The updated distance matrix is given below.

Distance between cluster ((D, F), E) and cluster (A, B) is calculated as

Distance between cluster ((D, F), E) and cluster C yields the minimum distance of 1.41.
This distance is computed as

After that, we merge cluster ((D, F), E) and cluster C into a new cluster name (((D, F),
E), C).
The updated distance matrix is shown in the figure below

The minimum distance of 2.5 is the result of the following computation

Now if we merge the remaining two clusters, we will get only single cluster contain the
whole 6 objects. Thus, our computation is finished. We summarized the results of
computation as follow:

1. In the beginning we have 6 clusters: A, B, C, D, E and F

2. We merge cluster D and F into cluster (D, F) at distance 0.50
3. We merge cluster A and cluster B into (A, B) at distance 0.71
4. We merge cluster E and (D, F) into ((D, F), E) at distance 1.00
5. We merge cluster ((D, F), E) and C into (((D, F), E), C) at distance 1.41
6. We merge cluster (((D, F), E), C) and (A, B) into ((((D, F), E), C), (A, B)) at
distance 2.50
7. The last cluster contain all the objects, thus conclude the computation

Using this information, we can now draw the final results of a dendogram.

Module 5
Q.2
The Apriori Algorithm makes the given assumptions

o All subsets of a frequent itemset must be frequent.

o The subsets of an infrequent item set must be infrequent.
o Fix a threshold support level. In our case, we have fixed it at 50 percent.

Step 1

Make a frequency table of all the products that appear in all the transactions. Now,
short the frequency table to add only those products with a threshold support level
of over 50 percent. We find the given frequency table.
Product Frequency (Number of transactions)

Rice (R) 4

Pulse(P) 5

Oil(O) 4

Milk(M) 4

Step 2

Create pairs of products such as RP, RO, RM, PO, PM, OM. You will get the given
frequency table.

Itemset Frequency (Number of transactions)

RP 4

RO 3

RM 2

PO 4

PM 3

OM 2

Step 3

Implementing the same threshold support of 50 percent and consider the products
that are more than 50 percent. In our case, it is more than 3

Thus, we get RP, RO, PO, and PM

Step 4
Now, look for a set of three products that the customers buy together. We get the
given combination.

1. RP and RO give RPO

2. PO and PM give POM

Step 5

Calculate the frequency of the two itemsets, and you will get the given frequency
table.

Itemset Frequency (Number of transactions)

RPO 4

POM 3

Student’s u can now find some of the

association rules from this table.

Q.3
Solution:

Step-1: Calculating C1 and L1:

D & E will be pruned due to given support
count as 3
Here onwards students do the calculation
and solve the complete problem

Q.4
Support threshold=50%, Confidence= 60%

Table 1:

Transaction List of items

T1 I1,I2,I3

T2 I2,I3,I4

T3 I4,I5

T4 I1,I2,I4

T5 I1,I2,I3,I5

T6 I1,I2,I3,I4

Solution: Support threshold=50% => 0.5*6= 3 => min_sup=3

Table 2: Count of each item

Item Count

I1 4

I2 5

I3 4

I4 4

I5 2

Table 3: Sort the itemset in descending order.

Item Count

I2 5

I1 4

I3 4

I4 4

Build FP Tree

1. Considering the root node null.

2. The first scan of Transaction T1: I1, I2, I3 contains three items {I1:1}, {I2:1},
{I3:1}, where I2 is linked as a child, I1 is linked to I2 and I3 is linked to I1.
3. T2: I2, I3, and I4 contain I2, I3, and I4, where I2 is linked to root, I3 is linked to
I2 and I4 is linked to I3. But this branch would share the I2 node as common
as it is already used in T1.
4. Increment the count of I2 by 1, and I3 is linked as a child to I2, and I4 is
linked as a child to I3. The count is {I2:2}, {I3:1}, {I4:1}.
5. T3: I4, I5. Similarly, a new branch with I5 is linked to I4 as a child is created.
6. T4: I1, I2, I4. The sequence will be I2, I1, and I4. I2 is already linked to the root
node. Hence it will be incremented by 1. Similarly I1 will be incremented by 1
as it is already linked with I2 in T1, thus {I2:3}, {I1:2}, {I4:1}.
7. T5:I1, I2, I3, I5. The sequence will be I2, I1, I3, and I5. Thus {I2:4}, {I1:3}, {I3:2},
{I5:1}.
8. T6: I1, I2, I3, I4. The sequence will be I2, I1, I3, and I4. Thus {I2:5}, {I1:4}, {I3:3},
{I4 1}.

Item Conditional Pattern Conditional FP- Frequent Patterns

Base tree Generated

I4 {I2,I1,I3:1},{I2,I3:1} {I2:2, I3:2} {I2,I4:2},{I3,I4:2},{I2,I3,I4:2}

I3 {I2,I1:3},{I2:1} {I2:4, I1:3} {I2,I3:4}, {I1:I3:3}, {I2,I1,I3:3}

I1 {I2:4} {I2:4} {I2,I1:4}

Module 6
Q.3
Page Rank formula
PR(A) = (1-β) + β * [PR(B) / Cout(B) + PR(C) / Cout(C)+ ...... +
PR(N) / Cout(N)]
HERE, β is teleportation factor i.e. 0.8
Number of incoming links and outgoing links from each node considered to calculate the rank.

Let us create a table of the 0th Iteration, 1st Iteration, and 2nd Iteration.

NODES ITERATION 0 ITERATION 1 ITERATION 2

A 1/6 = 0.16 0.3 0.392

B 1/6 = 0.16 0.32 0.3568

C 1/6 = 0.16 0.32 0.3568

D 1/6 = 0.16 0.264 0.2714

NODES ITERATION 0 ITERATION 1 ITERATION 2

E 1/6 = 0.16 0.264 0.2714

F 1/6 = 0.16 0.392 0.4141

Iteration 0:
For iteration 0 assume that each page is having page rank = 1/Total no. of
nodes
Therefore, PR(A) = PR(B) = PR(C) = PR(D) = PR(E) = PR(F) = 1/6 = 0.16
Iteration 1:
By using the above-mentioned formula
PR(A) = (1-0.8) + 0.8 * PR(B)/4 + PR(C)/2
= (1-0.8) + 0.8 * 0.16/4 + 0.16/2
= 0.3
So, what have we done here is for node A we will see how many incoming
signals are there so here we have PR(B) and PR(C). And for each of the
incoming signals, we will see the outgoing signals from that particular
incoming signal i.e. for PR(B) we have 4 outgoing signals and for PR(C) we
have 2 outgoing signals. The same procedure will be applicable for the
remaining nodes and iterations.
NOTE: USE THE UPDATED PAGE RANK FOR FURTHER
CALCULATIONS.

PR(B) = (1-0.8) + 0.8 * PR(A)/2

= (1-0.8) + 0.8 * 0.3/2
= 0.32
PR(C) = (1-0.8) + 0.8 * PR(A)/2
= (1-0.8) + 0.8 * 0.3/2
= 0.32
PR(D) = (1-0.8) + 0.8 * PR(B)/4
= (1-0.8) + 0.8 * 0.32/4
= 0.264
PR(E) = (1-0.8) + 0.8 * PR(B)/4
= (1-0.8) + 0.8 * 0.32/4
= 0.264
PR(F) = (1-0.8) + 0.8 * PR(B)/4 + PR(C)/2
= (1-0.8) + 0.8 * (0.32/4) + (0.32/2)
= 0.392
This was for iteration 1, now let us calculate iteration 2.

Iteration 2:
By using the above-mentioned formula
PR(A) = (1-0.8) + 0.8 * PR(B)/4 + PR(C)/2
= (1-0.8) + 0.8 * (0.32/4) + (0.32/2)
= 0.392
NOTE: USE THE UPDATED PAGE RANK FOR FURTHER
CALCULATIONS.
PR(B) = (1-0.8) + 0.8 * PR(A)/2
= (1-0.8) + 0.8 * 0.392/2
= 0.3568
PR(C) = (1-0.8) + 0.8 * PR(A)/2
= (1-0.8) + 0.8 * 0.392/2
= 0.3568
PR(D) = (1-0.8) + 0.8 * PR(B)/4
= (1-0.8) + 0.8 * 0.3568/4
= 0.2714
PR(E) = (1-0.8) + 0.8 * PR(B)/4
= (1-0.8) + 0.8 * 0.3568/4
= 0.2714
PR(F) = (1-0.8) + 0.8 * PR(B)/4 + PR(C)/2
= (1-0.8) + 0.8 * (0.3568/4) + (0.3568/2)
= 0.4141
So, the final PAGE RANK for the above-given question is,
NODES ITERATION 0 ITERATION 1 ITERATION 2