0% found this document useful (0 votes)
13 views6 pages

Data Mining 3

The document describes clustering 8 points (A1-A8) based on their X and Y coordinates. It assigns the points to two clusters, calculates the means of each cluster, then recalculates the distances and cluster assignments. It also covers calculating entropy and information gain to determine the best attribute to split a dataset for decision tree classification.

Uploaded by

doaa mahmoud
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views6 pages

Data Mining 3

The document describes clustering 8 points (A1-A8) based on their X and Y coordinates. It assigns the points to two clusters, calculates the means of each cluster, then recalculates the distances and cluster assignments. It also covers calculating entropy and information gain to determine the best attribute to split a dataset for decision tree classification.

Uploaded by

doaa mahmoud
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

i X Y

A1 2 10
A2 2 5
A3 8 4
A4 5 8
A5 7 5
A6 6 4
A7 1 2
A8 4 9

Choose two points to be center (A1, A2)

i A1(cluster 1) A2 (Cluster 2)
A1 0 5
A2 5 0
A3 8.5 6.08
A4 3.6 4.2
A5 7.07 5
A6 7.2 4.1
A7 8.06 3.1
A8 2.2 4.47

A1 belong to the Cluster of Point A1

A2 belong to the Cluster of Point A2

A3, belong to the Cluster of Point A2

A4 belong to the Cluster of Point A1

A5 belong to the Cluster of Point A2

A6 belong to the Cluster of Point A2

A7 belong to the Cluster of Point A2

A8 belong to the Cluster of Point A1


i X Y Cluster
A1 2 10 1
A2 2 5 2
A3 8 4 2
A4 5 8 1
A5 7 5 2
A6 6 4 2
A7 1 2 2
A8 4 9 1

Calculate the mean of Cluster 1----------------------(A1, A4, A8)

X= (2+5+4)/3= 3.6

Y= (10+8+9)/3= 9 49

Mean Cluster 1 (3.6, 9).

Calculate the mean of Cluster 2----------------------(A2, A3, A5, A6, A7)

X= (2+8+7+6+1)/5= 4.8

Y= (5+5+4+4+2)/5= 4

Mean Cluster 2 (4.8, 4).

Step2: Recalculate the distance from each point to the cluster means

I cluster 1 Cluster 2
A1 1.8 6.6
A2 4.3 2.9
A3 6.6 3.2
A4 1.7 4
A5 5.2 2.4
A6 5.5 1.2
A7 7.4 4.2
A8 0.4 5.06

A1, belong to the Cluster 1

A2, belong to the Cluster 2

A3, belong to the Cluster 2

A4 belong to the Cluster 1


A5 belong to the Cluster 2

A6 belong to the Cluster 2

A7 belong to the Cluster 2

A8 belong to the Cluster 1

i X Y Cluster
A1 2 10 1
A2 2 5 2
A3 8 4 2
A4 5 8 1
A5 7 5 2
A6 6 4 2
A7 1 2 2
A8 4 9 1

Calculate the mean of Cluster 1----------------------(A1, A4, A8)

X= (2+5+4)/3= 3.6

Y= (10+8+9)/3= 9 49

Mean Cluster 1 (3.6, 9).

Calculate the mean of Cluster 2----------------------(A2, A3, A5, A6, A7)

X= (2+8+7+6+1)/5= 4.8

Y= (5+5+4+4+2)/5= 4

Mean Cluster 2 (4.8, 4).

Q2

Find the Entropy of the set We have 10 Records

Output label: 3 No, 4 Yes 3 Yes

E(S) = - PN * log2 PN - PY * log2 PY 7 No

= − 3/10 ∗ log2 (3/10) – 7/10 ∗ log2 (7/10)

=0.88
Find the Information Gain for each attribute attribute: Marital Status 3 Yes

7 No
E(Married)=0

E(Single)= 1 Marrie Divorced


d Single
E(Divorced)= 1

G(S, “Marital Status ”) = E(S) – (PM*E(M) + PS*E(S) + PD*E(D)) 2 No 1 Yes


4 No
= 0.88 - (4/10 * 0 + 4/10 * 1 + 2/10 *1) 2 Yes 1 No

= 0.88 - 0.6= 0.28

Find the Information Gain for each attribute attribute: Refund

3 Yes
E(YES)=0 7 No
E(NO)= - PN * log2 PN - PY * log2 PY Yes No

= 0.88 − 4/7 ∗ log2 (4/7) – 3/7 ∗ log2 (3/7)

=0.98
3 Yes
3 No
4 No

G(S, “Refund ”) = E(S) – (PN*E(N) + PY*E(Y) )

= 0.88 -(3/10 * 0 + 7/10 *0.98)

= 0.19
Find the Information Gain for each attribute attribute: Taxable Income

3 Yes
E(>100)=0 7 No
E(<=100)= - PN * log2 PN - PY * log2 PY >100 <=100

= 0.88 − 4/7 ∗ log2 (4/7) – 3/7 ∗ log2 (3/7)

=0.98
3 Yes
3 No
G(S, “Taxable Income”) = E(S) – (PN*E(N) + PY*E(Y) ) 4 No

= 0.88 -(3/10 * 0 + 7/10 *0.98)

= 0.19

From the calculated Gaining Information, select the attribute with the highest value

attribute: Refund ➔ 0.19

attribute: Marital Status ➔ 0.28

attribute: Taxable Income ➔ 0.19

So, select the Marital Status to start with as it has highest value

Material Statues
Divorced Married

Single

TID Refund Tax No


1 Yes 125k
3 No 70k
TID Refund Tax
5 No 95k
7 yes 220k
8 No 85k
10 No 90k

You might also like