0% found this document useful (0 votes)
28 views24 pages

Solution For DWDM Problems

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views24 pages

Solution For DWDM Problems

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

UNIT_3:

Given a training data set Y:


A B C Class
15 1 A C1
20 3 B C2
25 2 A C1
30 4 A C1
35 2 B C2
25 4 A C1
15 2 B C2
20 3 B C2
Find the best split point for decision tree for attribute A.
Solution:

To find the best split point for a decision tree for attribute A, you typically need to
evaluate different potential split points and choose the one that maximizes the
separation between classes. The idea is to find a threshold value for attribute A that
best separates the data into different classes.

Here's a step-by-step process to find the best split point for attribute A:

1. Sort the data by the values of attribute A.


A Class

1 C1

2 C1

2 C2

3 C2

3 C2

4 C1

4 C1

2. Calculate the midpoint between each pair of consecutive values of A.

Midpoint Class

1.5 C1

2.5 C1

2.5 C2

3.5 C2

3.5 C2
4.5 C1

3.Calculate the Gini index or another impurity measure for each midpoint.
 Gini Index for a node = 1 - Σ(p_i)^2, where p_i is the proportion of
samples of class i in the node.
For example, the Gini index for the first midpoint (1.5) would be calculated
using the following formulas:
 For C1: 1−(1/2)2−(1/2)2=0.51−(1/2)2−(1/2)2=0.5
 For C2: 1−(0/2)2−(2/2)2=01−(0/2)2−(2/2)2=0
Repeat this calculation for each midpoint.
4.Choose the midpoint that results in the lowest impurity (Gini index, in this
case).
In this example, the midpoint 1.5 for attribute A has the lowest Gini index (0.5),
so it would be chosen as the best split point.

Note: The above process is a simplified explanation, and in practice, decision tree
algorithms may use other impurity measures (such as entropy or misclassification
rate) and may consider multiple attributes simultaneously. The specific details can
vary depending on the algorithm being used.

Make a decision tree for the following database using Gini Index. Indicate all
intermediate steps.
Example Colour Shape Size Class
1 Red Square Big +
2 Blue Square Big +
3 Red Circle Big +
4 Red Circle Small –
5 Green Square Small –
6 Green Square Big –
Solution:

To create a decision tree using the Gini Index, we'll go through the process of
selecting the best split points for each attribute at each level of the tree. Here are the
steps:

Step 1: Calculate the Gini Index for the root node (considering all data points):

The Gini Index for a node is given by the formula:


Gini Index=1−∑�=1�(��)2Gini Index=1−∑i=1n(pi)2

where ��pi is the proportion of samples of class �i in the node.


For the root node: Gini Index=1−((36)2+(36)2)=0.5Gini Index=1−((63)2+(63
)2)=0.5

Step 2: Consider each attribute and evaluate potential split points:

 Color:
 Split on Red: Gini Index for Red branch
=1−((23)2+(13)2)=0.44=1−((32)2+(31)2)=0.44
 Split on Blue: Gini Index for Blue branch
=1−((13)2+(23)2)=0.44=1−((31)2+(32)2)=0.44
 Split on Green: Gini Index for Green branch
=1−((23)2+(13)2)=0.44=1−((32)2+(31)2)=0.44
 Shape:
 Split on Square: Gini Index for Square branch
=1−((24)2+(24)2)=0.5=1−((42)2+(42)2)=0.5
 Split on Circle: Gini Index for Circle branch
=1−((12)2+(12)2)=0.5=1−((21)2+(21)2)=0.5
 Size:
 Split on Big: Gini Index for Big branch
=1−((23)2+(13)2)=0.44=1−((32)2+(31)2)=0.44
 Split on Small: Gini Index for Small branch
=1−((13)2+(23)2)=0.44=1−((31)2+(32)2)=0.44

Step 3: Choose the attribute and split point with the lowest Gini Index:

The lowest Gini Index is for the root node, so we choose the attribute with the lowest
impurity, which is Color.

Step 4: Repeat the process for each branch:

 For the Red branch, all data points are of the same class (+), so no further split
is needed.
 For the Blue branch, all data points are of the same class (+), so no further
split is needed.
 For the Green branch, a split on Shape or Size can be considered, and the
process is repeated.

Here's the resulting decision tree:


Color

/ | \

Red Blue Green

/ |

+ Shape

/ \

Square Circle

/ \

+ -

Given data set, D, the number of attributes, n, and the number of training tuples, |D|, show
that the computational cost of growing a tree is at most n × |D| × log(|D|).
Solution:

To understand the computational cost of growing a decision tree, let's break down
the key components involved in the process. The cost is influenced by the number of
attributes (�n), the number of training tuples (∣�∣∣D∣), and the structure of the
tree. The commonly used algorithms for growing decision trees, such as CART
(Classification and Regression Trees) or ID3 (Iterative Dichotomiser 3), involve
recursive splitting of data based on attributes.

Here are the main steps in growing a decision tree:

1. Selecting the best attribute to split on: For each node in the tree, we need
to evaluate each attribute and find the one that minimizes a certain impurity
measure (such as Gini Index or Entropy). This involves going through all
attributes and evaluating potential split points.
2. Splitting the data: Once the best attribute is selected, the data is split into
subsets based on the values of that attribute. This is done for each branch of
the tree.
3. Recursive growth: Steps 1 and 2 are repeated for each branch until a
stopping criterion is met, such as reaching a maximum depth, achieving a
minimum number of samples per leaf, or having all data points in a leaf node
belonging to the same class.

Now, let's analyze the computational cost:


 Selecting the best attribute: This step involves evaluating each attribute, and
for each attribute, considering potential split points. The cost is proportional
to �×∣�∣n×∣D∣, where �n is the number of attributes and ∣�∣∣D∣ is the
number of training tuples.
 Splitting the data: For each split, we need to partition the data into subsets.
This involves iterating through the data, and the cost is proportional to
∣�∣∣D∣.
 Recursive growth: The tree is grown recursively. The depth of the tree is at
most log⁡(∣�∣)log(∣D∣), where log⁡log is the logarithm base 2.

Therefore, the overall computational cost of growing a decision tree is at most


�×∣�∣×log⁡(∣�∣)n×∣D∣×log(∣D∣). This analysis provides an upper bound on
the computational complexity of decision tree growth, and the actual cost may vary
based on specific implementations and optimizations.

Calculate the gain in the Gini Index when splitting on A and B. Which attribute would the
decision tree induction algorithm choose?
A B Class Label
T F +
T T +
T T +
T F –
T T +
F F –
F F –
F F –
T T –
T F –
Solution:

To calculate the Gini Index and gain for each attribute (A and B), we first need to
compute the Gini Index for the initial dataset and then calculate the Gini Index for
each possible split on A and B. The gain is calculated by subtracting the weighted
sum of the Gini Indices for the resulting subsets from the Gini Index of the original
dataset.

Let's calculate the Gini Index for the initial dataset:

Gini Index=1−∑�=1�(��)2Gini Index=1−∑i=1m(pi)2

where ��pi is the proportion of samples of class �i in the node, and �m is the
number of classes.
For the initial dataset: Gini Index=1−((510)2+(510)2)=0.5Gini Index=1−((105
)2+(105)2)=0.5

Now, let's calculate the Gini Index for each split on A and B:

Split on A (Attribute A):

 For A = T:
 �(Class = +)=35P(Class = +)=53
 �(Class = -)=25P(Class = -)=52
 Gini Index = 1−((35)2+(25)2)=0.481−((53)2+(52)2)=0.48
 For A = F:
 �(Class = +)=25P(Class = +)=52
 �(Class = -)=35P(Class = -)=53
 Gini Index = 1−((25)2+(35)2)=0.481−((52)2+(53)2)=0.48

Gain for A:
Gain(�)=Gini Index (Initial)−(510×Gini Index (A = T)+510×Gini Index (A
= F))Gain(A)=Gini Index (Initial)−(105×Gini Index (A = T)+105
×Gini Index (A = F))

Split on B (Attribute B):

 For B = T:
 �(Class = +)=46P(Class = +)=64
 �(Class = -)=26P(Class = -)=62
 Gini Index = 1−((46)2+(26)2)=0.441−((64)2+(62)2)=0.44
 For B = F:
 �(Class = +)=14P(Class = +)=41
 �(Class = -)=34P(Class = -)=43
 Gini Index = 1−((14)2+(34)2)=0.3751−((41)2+(43)2)=0.375

Gain for B:
Gain(�)=Gini Index (Initial)−(610×Gini Index (B = T)+410×Gini Index (B
= F))Gain(B)=Gini Index (Initial)−(106×Gini Index (B = T)+104
×Gini Index (B = F))

Now, compare the gains for A and B. The attribute with the higher gain would be
chosen by the decision tree induction algorithm.
Identify the attribute that will act as the root node of a decision tree to predict golf play for
following database with Gini Index. Indicate all the intermediate steps.
Outlook Wind PlayGolf
rain strong no
sunny weak yes
overcast weak yes
rain weak yes
sunny strong yes
rain strong no
overcast strong no
Solution:

To find the attribute that will act as the root node of a decision tree using
the Gini Index, we need to calculate the Gini Index for each attribute and
choose the one with the highest information gain. The information gain is
computed by subtracting the weighted sum of Gini Indices for the resulting
subsets from the Gini Index of the original dataset.

Here are the steps:

Step 1: Calculate the Gini Index for the initial dataset:

Gini Index=1−∑�=1�(��)2Gini Index=1−∑i=1m(pi)2

where ��pi is the proportion of samples of class �i in the node, and


�m is the number of classes.

For the initial dataset:

Gini Index=1−((37)2+(47)2)≈0.49Gini Index=1−((73)2+(74)2)≈0.49

Step 2: Calculate the Gini Index for each attribute (Outlook and Wind):

Outlook:

 For Outlook = Sunny:


 �(PlayGolf = Yes)=23P(PlayGolf = Yes)=32
 �(PlayGolf = No)=13P(PlayGolf = No)=31
 Gini Index = 1−((23)2+(13)2)≈0.441−((32)2+(31)2)≈0.44
 For Outlook = Overcast:
 �(PlayGolf = Yes)=1P(PlayGolf = Yes)=1
 �(PlayGolf = No)=0P(PlayGolf = No)=0 (All samples are
in the 'Yes' class)
 Gini Index = 1−((11)2+(01)2)=01−((11)2+(10)2)=0
 For Outlook = Rain:
 �(PlayGolf = Yes)=24P(PlayGolf = Yes)=42
 �(PlayGolf = No)=24P(PlayGolf = No)=42
 Gini Index = 1−((24)2+(24)2)=0.51−((42)2+(42)2)=0.5

Wind:

 For Wind = Strong:


 �(PlayGolf = Yes)=36P(PlayGolf = Yes)=63
 �(PlayGolf = No)=36P(PlayGolf = No)=63
 Gini Index = 1−((36)2+(36)2)=0.51−((63)2+(63)2)=0.5
 For Wind = Weak:
 �(PlayGolf = Yes)=34P(PlayGolf = Yes)=43
 �(PlayGolf = No)=14P(PlayGolf = No)=41
 Gini Index = 1−((34)2+(14)2)=0.3751−((43)2+(41)2)=0.375

Step 3: Calculate the Information Gain for each attribute:

Information Gain=Gini Index (Initial)−(Number of samples in subs


etTotal number of samples×Gini Index (Subset))Information Gain=
Gini Index (Initial)−(Total number of samplesNumber of samples in subset
×Gini Index (Subset))

Information Gain for Outlook:


Information Gain (Outlook)≈0.49−(37×0.44+17×0+37×0.5)Informa
tion Gain (Outlook)≈0.49−(73×0.44+71×0+73×0.5)

Information Gain for Wind:


Information Gain (Wind)≈0.49−(67×0.5+47×0.375)Information Gai
n (Wind)≈0.49−(76×0.5+74×0.375)
Compare the Information Gain for Outlook and Wind. The attribute with the
highest Information Gain will be chosen as the root node for the decision
tree.

UNIT-4

1. Can we design a method that mines the complete set of frequent item sets without
candidate generation? If yes, explain it with the following table:
TID List of items
001 milk, dal, sugar, bread
002 Dal, sugar, wheat,jam
003 Milk, bread, curd, paneer
004 Wheat, paneer, dal, sugar
005 Milk, paneer, bread
006 Wheat, dal, paneer, bread

Yes, it is possible to mine frequent item sets without candidate generation using an
approach called the Apriori algorithm. The Apriori algorithm uses an iterative
approach to discover frequent item sets by avoiding the generation of unnecessary
candidate item sets.

The basic idea of the Apriori algorithm is based on the Apriori property, which states
that if an itemset is frequent, then all of its subsets must also be frequent. The
algorithm uses this property to prune the search space and focus only on item sets
that have already been identified as frequent.

Here are the steps to mine frequent item sets from the given table:

1. Step 1: Find frequent 1-item sets (Singletons) Count the occurrences of


each individual item in the dataset. Items that meet the minimum support
threshold are considered frequent 1-item sets.
Frequent 1-item sets:

{milk}, {dal}, {sugar}, {bread}, {wheat}, {jam}, {curd}, {paneer}


Step 2: Generate candidate 2-item sets Create candidate 2-item sets by combining the
frequent 1-item sets found in step 1.

Candidate 2-item sets:

{milk, dal}, {milk, sugar}, {milk, bread}, {milk, wheat}, {milk, jam}, {milk, curd}, {milk, paneer}, ...

Step 3: Count the support of candidate 2-item sets Count the occurrences of each
candidate 2-item set in the dataset. Prune sets that do not meet the minimum support
threshold.

Frequent 2-item sets:

{milk, bread}, {milk, paneer}, {dal, sugar}, {dal, wheat}, {sugar, bread}, {wheat, paneer}, {wheat, dal},
...

Repeat Steps 2 and 3 for higher item sets Continue the process by generating candidate
k-item sets from the frequent (k-1)-item sets and counting their support until no more
frequent item sets can be found.

Frequent 3-item sets:

{milk, bread, paneer}, {milk, dal, sugar}, {dal, sugar, wheat}, {wheat, paneer, dal}, ...

Frequent 4-item sets:

{milk, bread, paneer, dal}, ...

The Apriori algorithm iteratively applies these steps until it identifies all
frequent item sets in the dataset without generating unnecessary
candidates. This approach helps in reducing the computational cost
compared to methods that generate and check all possible item sets.

Consider the transaction data-set:


TransID Items
T1 {a,b}
T2 {b,c,d}
T3 {a,c,d,e}
T4 {a,d,e}
T5 {a,b,c}
T6 {a,b,c,d}
T7 {a}
T8 {a,b,c}
T9 {a,b,d}
T10 {b,c,e}
If you want to find frequent item sets using the Apriori algorithm for the given
transaction dataset, you can follow the steps mentioned earlier. Here's a step-by-step
breakdown:

1. Step 1: Find frequent 1-item sets (Singletons) Count the occurrences of


each individual item in the dataset. Items that meet the minimum support
threshold are considered frequent 1-item sets.

Frequent 1-item sets:

{a}, {b}, {c}, {d}, {e}

Step 2: Generate candidate 2-item sets Create candidate 2-item sets by combining the
frequent 1-item sets found in step 1.

Candidate 2-item sets:

{a, b}, {a, c}, {a, d}, {a, e}, {b, c}, {b, d}, {b, e}, {c, d}, {c, e}, {d, e}

Step 3: Count the support of candidate 2-item sets Count the occurrences of each
candidate 2-item set in the dataset. Prune sets that do not meet the minimum support
threshold.

Frequent 2-item sets:

{a, b}, {a, c}, {a, d}, {b, c}, {b, d}, {c, d}, {d, e}

Repeat Steps 2 and 3 for higher item sets Continue the process by generating candidate
k-item sets from the frequent (k-1)-item sets and counting their support until no more
frequent item sets can be found.

Frequent 3-item sets:

{a, b, c}, {a, b, d}, {a, c, d}

Frequent 4-item sets:

{a, b, c, d}

Frequent 5-item sets:

{} (No frequent 5-item sets in this dataset)

In the end, you have identified all the frequent item sets in the dataset without generating
unnecessary candidates. Adjust the minimum support threshold as needed based on your
specific requirements.
Consider the following table to find frequent item sets using vertical data format. Support
threshold 30%
Tid List of items
T01 Milk, biscuits, surf powder, teabags
T02 Teabags, sugar, soap
T03 Milk, sugar, bread, soap
T04 Bread, teabags, biscuits
T05 Chocolates, milk, biscuits
T06 Milk, teabags, bread
T07 Bread, biscuits, chocolate
T08 Milk, surf powder, bread

To find frequent item sets using the vertical data format, we can follow a similar
approach but adapt it to the vertical representation of the data. The vertical format
organizes the data by items rather than transactions. Here's a step-by-step
breakdown:

1. Step 1: Transpose the Data Transpose the given table to create a vertical
dataset:
Item | T01 | T02 | T03 | T04 | T05 | T06 | T07 | T08 |

-------------------------------------------------------------

Milk |1 |0 |1 |0 |1 |1 |0 |1 |

Biscuits | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 |

Surf Powder|1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |

Teabags | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 0 |

Sugar |0 |1 |1 |0 |0 |0 |0 |0 |

Soap |1 |1 |1 |0 |0 |0 |1 |0 |

Bread |0 |0 |1 |1 |0 |1 |1 |1 |

Chocolate | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 |

Step 2: Calculate Support for Each Item Calculate the support for each item by counting
the number of transactions where the item appears.

Support for each item:

Milk: 5/8 = 62.5%

Biscuits: 4/8 = 50%

Surf Powder: 3/8 = 37.5%


Teabags: 5/8 = 62.5%

Sugar: 2/8 = 25%

Soap: 4/8 = 50%

Bread: 5/8 = 62.5%

Chocolate: 3/8 = 37.5%

Step 3: Identify Frequent 1-Item Sets Select items with support greater than or equal to
the specified threshold (30%).

Frequent 1-item sets:

{Milk}, {Bread}, {Teabags}

Step 4: Generate Candidate 2-Item Sets Create candidate 2-item sets by combining the
frequent 1-item sets.

Candidate 2-item sets:

{Milk, Bread}, {Milk, Teabags}, {Bread, Teabags}

Step 5: Calculate Support for Candidate 2-Item Sets Count the support for each
candidate 2-item set.

Support for each 2-item set:

{Milk, Bread}: 3/8 = 37.5%

{Milk, Teabags}: 4/8 = 50%

{Bread, Teabags}: 3/8 = 37.5%

Step 6: Identify Frequent 2-Item Sets Select 2-item sets with support greater than or equal
to the threshold.

Frequent 2-item sets:

{Milk, Teabags}

7.Repeat the process for higher item sets if needed. Continue the
process to find frequent 3-item sets, 4-item sets, and so on.

In this example, we've identified the frequent 1-item sets and the frequent
2-item sets based on the given support threshold of 30%. Adjust the
threshold as needed for your specific requirements.
A database has four transactions. Let min_sup=60% and min_conf=80%
TID date items_bought
100 10/15/2022 {K, A, B, D}
200 10/15/2022 {D, A, C, E, B}
300 10/19/2022 {C, A, B, E}
400 10/22/2022 {B, A, D}
Find all frequent items using Apriori& FP-growth, respectively. Compare the
efficiency of the two-meaning process.

To find frequent items using the Apriori algorithm and the FP-growth algorithm, let's
start with the Apriori algorithm:

Apriori Algorithm:

Step 1: Transpose the Data Transpose the given table to create a vertical dataset:

Item | 100 | 200 | 300 | 400 |


---------------------------
A |1 |1 |1 |1 |
B |1 |1 |1 |1 |
C |0 |1 |1 |0 |
D |1 |1 |0 |1 |
E |0 |1 |1 |0 |

Step 2: Calculate Support for Each Item Calculate the support for each item by counting
the number of transactions where the item appears.

Support for each item:

A: 4/4 = 100%

B: 4/4 = 100%

C: 2/4 = 50%

D: 3/4 = 75%

E: 2/4 = 50%

Step 3: Identify Frequent 1-Item Sets Select items with support greater than or equal to
the specified threshold (60%).
Frequent 1-item sets:

{A}, {B}, {D}

Step 4: Generate Candidate 2-Item Sets Create candidate 2-item sets by combining the
frequent 1-item sets

Candidate 2-item sets:

{A, B}, {A, D}, {B, D}

Step 5: Calculate Support for Candidate 2-Item Sets Count the support for each
candidate 2-item set.

Support for each 2-item set:

{A, B}: 4/4 = 100%

{A, D}: 3/4 = 75%

{B, D}: 3/4 = 75%

Step 6: Identify Frequent 2-Item Sets Select 2-item sets with support greater than or equal
to the threshold.

Frequent 2-item sets:

{A, B}

FP-growth Algorithm:

The FP-growth algorithm builds a frequent pattern tree and uses a divide-and-
conquer strategy to mine frequent itemsets. The efficiency of the FP-growth
algorithm often makes it faster than Apriori, especially for large datasets.

For the sake of brevity, I'll provide the results without a step-by-step breakdown:

Frequent items using FP-growth:

{A}, {B}, {D}, {A, B}


Efficiency Comparison:

The efficiency of Apriori and FP-growth depends on factors such as dataset size,
sparsity, and the specified minimum support threshold. Generally, FP-growth tends
to be more efficient than Apriori, especially for datasets with a large number of
transactions and a low support threshold.

The key reason for the efficiency of FP-growth is its ability to construct a condensed
representation of the dataset (the FP-tree) and avoid the generation of candidate
itemsets, which can be time-consuming in Apriori.

In summary, FP-growth is often more efficient than Apriori, especially for larger
datasets. However, the actual efficiency can vary based on the characteristics of the
dataset and the specific parameters used.
Unit-5

Suppose that the data-mining task is to cluster the following eight points (representing
location) into three clusters: A1 (2;10) ; A2 (2;5) ; A3 (8;4) ; B1 (5;8) ; B2 (7;5) ; B3 (6;4) ;
C1 (1;2) ; C2 (4;9). The distance function is Euclidean distance. Suppose initially we assign
A1, B1, and C1 as the center of each cluster, respectively. Use the k-means algorithm to
determine: the three cluster centers after the first round of execution

Answers:

The k-means algorithm involves an iterative process to assign points to


clusters and update cluster centers. Let's go through the first round of
execution:

Initial Assignment:
 Cluster Centers:
 A1 (2, 10)
 B1 (5, 8)
 C1 (1, 2)

Assignment of Points to Clusters:


For each point, calculate the Euclidean distance to each cluster center and
assign the point to the cluster with the nearest center.

 Assign each point to the nearest cluster:


 A1 (2, 10) to Cluster A
 A2 (2, 5) to Cluster C
 A3 (8, 4) to Cluster B
 B1 (5, 8) to Cluster B
 B2 (7, 5) to Cluster B
 B3 (6, 4) to Cluster B
 C1 (1, 2) to Cluster C
 C2 (4, 9) to Cluster B

Update Cluster Centers:


Calculate the mean (centroid) of each cluster to update the cluster centers:
 New Cluster Centers:
 A1 (2, 10)
 B1 ((5 + 7 + 6)/3, (8 + 5 + 4)/3) = (6, 5.67)
 C1 ((2 + 4)/2, (10 + 9)/2) = (3, 9.5)

Summary:
After the first round of execution:

 Cluster A: A1 (2, 10)


 Cluster B: B1 (6, 5.67)
 Cluster C: C1 (3, 9.5)

These updated cluster centers will be used in subsequent rounds of the k-


means algorithm until convergence. The process involves iteratively
assigning points to the nearest cluster and updating the cluster centers
until there is minimal change in the assignment.

Consider five points {X1 , X2 , X3 , X4 , X5 } with the following coordinates as a two


dimensional sample for clustering : X1 = ( 0.5,2.5 ); X2 = ( 0,0 ); X3 = ( 1.5,1 ); X4 = ( 5,1 );
X5 = (6,2 ) Illustrate the K-means partitioning algorithms using the above data set.
Answer:

Let's illustrate the K-means clustering algorithm using the provided dataset with five
points:

�1=(0.5,2.5)X1=(0.5,2.5) �2=(0,0)X2=(0,0) �3=(1.5,1)X3=(1.5,1)


�4=(5,1)X4=(5,1) �5=(6,2)X5=(6,2)

Step 1: Initialization

Let's initialize the K-means algorithm with �=2k=2 clusters and randomly select
initial cluster centers.

 Initial Cluster Centers:


 �1=(0.5,2.5)C1=(0.5,2.5)
 �2=(5,1)C2=(5,1)

Step 2: Assignment
Assign each point to the nearest cluster based on Euclidean distance.

 Assign each point to the nearest cluster:


 �1X1 is closer to �1C1
 �2X2 is closer to �1C1
 �3X3 is closer to �1C1
 �4X4 is closer to �2C2
 �5X5 is closer to �2C2

Step 3: Update Cluster Centers

Calculate the mean (centroid) of each cluster to update the cluster centers.

 Updated Cluster Centers:


 �1=(0.5,2.5)+(0,0)+(1.5,1)3=(0.67,1.17)C1=3(0.5,2.5)+(0,0)+(1.5,1)
=(0.67,1.17)
 �2=(5,1)+(6,2)2=(5.5,1.5)C2=2(5,1)+(6,2)=(5.5,1.5)

Repeat Iterations

Repeat the assignment and update steps until convergence or a predetermined


number of iterations. In this case, you can see that the points will continue to be
reassigned and the cluster centers will be updated. The final clusters and centers will
depend on the convergence criteria.

Summary:

After the K-means algorithm converges, the dataset will be partitioned into two
clusters, and the cluster centers will represent the mean coordinates of the points in
each cluster.
Cluster the following data into three clusters, using the k-means method.
X y
10.9 12.6
2.3 8.4
8.4 12.6
12.1 16.2
7.3 8.9
23.4 11.3
19.7 18.5
17.1 17.2
3.2 3.4
1.3 22.8
2.4 6.9
2.4 7.1
3.1 8.3
2.9 6.9
11.2 4.4
8.3 8.7
Solution :

Step 1: Initialization
Randomly select initial cluster centers. For simplicity, let's assume the initial
centers are the first three data points:

 Initial Cluster Centers:


 �1=(10.9,12.6)C1=(10.9,12.6)
 �2=(2.3,8.4)C2=(2.3,8.4)
 �3=(8.4,12.6)C3=(8.4,12.6)

Step 2: Assignment
Assign each point to the nearest cluster based on Euclidean distance.

 Assign each point to the nearest cluster:


 Points closer to �1C1 than �2C2 and �3C3 belong to
Cluster 1.
 Points closer to �2C2 than �1C1 and �3C3 belong to
Cluster 2.
 Points closer to �3C3 than �1C1 and �2C2 belong to
Cluster 3.

Step 3: Update Cluster Centers


Calculate the mean (centroid) of each cluster to update the cluster centers.

 Updated Cluster Centers:


 �1=mean of points in Cluster 1C1
=mean of points in Cluster 1
 �2=mean of points in Cluster 2C2
=mean of points in Cluster 2
 �3=mean of points in Cluster 3C3
=mean of points in Cluster 3

Repeat Iterations
Repeat the assignment and update steps until convergence or a
predetermined number of iterations. The points will be reassigned to
clusters, and cluster centers will be updated in each iteration.

Convergence:
The algorithm converges when the cluster assignments and centers do not
change significantly between iterations.

Please note that the actual numerical calculations and iterations depend on
the specific implementation of the algorithm. The given steps provide a
high-level overview of the k-means clustering process.

Suppose that the data mining task is to cluster points into three clusters, where the points are
A1(2,10),A2(2,5),A3(8,4),B1(5,8),B2(7,5),B3(6,4),C1(1,2),C2(4,9). The distance function is
Euclidean distance. Suppose initially we assign A1, B1, and C1 as the center of each cluster,
respectively. Use the k-means algorithm to show only the three cluster centers after the first
round of execution.
Sure, let's apply the k-means algorithm to cluster the given points into three clusters
and show the cluster centers after the first round of execution.

Given Points:

�1(2,10),�2(2,5),�3(8,4),�1(5,8),�2(7,5),�3(6,4),�1(1,2),�2(4,9)A1(
2,10),A2(2,5),A3(8,4),B1(5,8),B2(7,5),B3(6,4),C1(1,2),C2(4,9)

Initial Assignment:

Let's initialize the k-means algorithm with three clusters and assign the initial centers
as follows:

 Initial Cluster Centers:


 �1=�1(2,10)C1=A1(2,10)
 �2=�1(5,8)C2=B1(5,8)
 �3=�1(1,2)C3=C1(1,2)

Assignment of Points to Clusters:


Assign each point to the nearest cluster based on Euclidean distance.

 Assign each point to the nearest cluster:


 �1A1 to Cluster 1
 �2A2 to Cluster 3
 �3A3 to Cluster 2
 �1B1 to Cluster 2
 �2B2 to Cluster 2
 �3B3 to Cluster 2
 �1C1 to Cluster 3
 �2C2 to Cluster 2

Update Cluster Centers:

Calculate the mean (centroid) of each cluster to update the cluster centers.

 Updated Cluster Centers:


 �1=mean of points in Cluster 1C1=mean of points in Cluster 1
 �2=mean of points in Cluster 2C2=mean of points in Cluster 2
 �3=mean of points in Cluster 3C3=mean of points in Cluster 3
Updated Cluster Centers After the First Round:
 Updated Cluster Centers:
 �1C1 (Cluster 1): Mean of �1A1
 �2C2 (Cluster 2): Mean of �3,�1,�2,�3,�2A3,B1,B2,B3,C2
 �3C3 (Cluster 3): Mean of �2,�1A2,C1

The coordinates of the updated cluster centers after the first round of execution
represent the mean of the points assigned to each cluster. The process can be
repeated for further iterations until convergence.

Given the points x1 = {1, 0}, x2 = {0,1}, x3={2, 1}, and x4 = {3, 3}. Suppose that these
points are randomly clustered into two clusters: C1 = {x1, x3} and C2 = {x2, x4}. Apply one
iteration of Kmeans partitional-clustering algorithm and find new distribution of elements in
clusters. What is the change in a total square error?

Answer:

The K-means algorithm involves an iterative process where points are


assigned to clusters, and the cluster centers are updated. Let's go through
one iteration of the K-means algorithm for the given initial clustering:

Initial Clustering:

�1={�1,�3}C1={x1,x3} �2={�2,�4}C2={x2,x4}

Step 1: Update Cluster Centers:


Calculate the mean (centroid) of each cluster to update the cluster centers.

 Updated Cluster Centers:


 �1=mean of �1 and �3C1=mean of x1 and x3
 �2=mean of �2 and �4C2=mean of x2 and x4

�1=(1+22,0+12)=(1.5,0.5)C1=(21+2,20+1)=(1.5,0.5)
�2=(0+32,1+32)=(1.5,2)C2=(20+3,21+3)=(1.5,2)

Step 2: Re-assign Points to Clusters:


Assign each point to the nearest cluster based on Euclidean distance.

 Re-assign each point to the nearest cluster:


 �1x1 is closer to �1C1
 �2x2 is closer to �2C2
 �3x3 is closer to �1C1
 �4x4 is closer to �2C2

New Clustering:

�1={�1,�3}C1={x1,x3} �2={�2,�4}C2={x2,x4}

Change in Total Square Error:


The total square error is the sum of squared distances between each point
and its cluster center. In this case, the change in total square error is
calculated by comparing the squared distances before and after the
iteration.

Total Square Error=∑�=14Distance(��,assigned cluster center)2


Total Square Error=∑i=14Distance(xi,assigned cluster center)2

Since the assignment of points to clusters did not change after one
iteration, the total square error remains the same. The algorithm may
iterate further until convergence, with each iteration potentially reducing
the total square error.

You might also like