Solutions To Part 2 of The Mid-Term Examination: I.E. 1062/2062 DATA MINING

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

I.E.

1062/2062 DATA MINING


Solutions to Part 2 of the mid-term examination

Question 1
a) (4 points)
The raw data set of 155 instances has a total of 40+40+20+4=104 in class Low, 30+3+10=43 in
class Medium, and 5+3=8 in class High. Thus the entropy measure at the root node =
I(104,43,8) =
-(104/155)log2(104/155) - (43/155)log2(43/155) - (8/155)log2(8/155) = 1.1202

b) (8 points)
Consider "Department" for the splitting attribute: We have
Sales: 80 Low, 30 Medium, 0 High (total of 110 instances)
I(80,30,0) = -(80/110)log2(80/110) - (30/110)log2(30/110) - (0/110)log2(0/110) = 0.8454
Systems: 20 Low, 3 Medium, 8 High (total of 31 instances)
I(20,3,8) = -(20/31)log2(20/31) - (3/31)log2(3/31) - (8/31)log2(8/31) = 1.2383
Marketing: 4 Low, 10 Medium, 0 High (total of 14 instances)
I(4,10,0) = -(4/14)log2(4/14) - (10/14)log2(10/14) - (0/14)log2(0/14) = 0.8631
Thus E(Department) = (110/155)0.8454 + (31/155)1.2383 + (14/155)0.8631 = 0.9255

Thus Expected Information Gain from having “Department” at the root node =
I(104,43,8) - E(Department) = 1.1202 – 0.9255 = 0.1945

To compute the Gain Ratio we need the intrinsic value of the split, which is = I(110,31,14) =
-(110/155)log2(110/155) – (31/155)log2(31/155) - (14/100)log2(14/155) = 1.1288
Thus gain ratio = 0.1945/1.1288 = 0.1724

c) (2 points)
There are no Junior, Systems people who are 40+. We therefore look at the instances arriving at
Junior and Systems: there are 23, with 20 in Low and 3 in Medium. We then assign the “???”
leaf the majority class out of these 23 instances which is LOW.

d) (7 points)
Consider the 48 Senior status records: we have the following data:

Department Age Salary Count


Sales 30's Medium 30
Systems 30's High 5
Systems 40+ High 3
Marketing 40+ Medium 10

Thus we have 48 records with 0 in Low, 40 in Medium, 8 in High. The entropy measure at the
node being considered is thus = I(0,40,8) =
-(0/48)log2(0/48) - (40/48)log2(40/48) - (8/48)log2(8/48) = 0.65
AGE AS SPLITTING ATTRIBUTE:
20’s: (0 instances)
I(0,0,0) = 0
30’s: 0 Low, 30 Medium, 5 High (total of 35 instances)
I(0,30,5) = -(0/35)log2(0/35) - (30/35)log2(30/35) - (5/35)log2(5/35) = 0.5917
40+: 0 Low, 10 Medium, 3 High (total of 13 instances)
I(0,10,3) = -(0/13)log2(0/13) - (10/13)log2(10/13) - (3/13)log2(3/13) = 0.7793

Thus E(Age) = (0/48)0 + (35/48)0.5917 + (13/48)0.7793= 0.6425

Thus Expected Information Gain from age =


I I(0,40,8) - E(Age) = 0.65 - 0.6425 = 0.0075

DEPARTMENT AS SPLITTING ATTRIBUTE:


Sales: 0 Low, 30 Medium, 0 High (total of 30 instances)
I(0,30,0) = -(0/30)log2(0/30) - (30/30)log2(30/30) - (0/30)log2(0/30) = 0
Systems: 0 Low, 0 Medium, 8 High (total of 8 instances)
I(0,0,8) = -(0/8)log2(0/8) - (0/8)log2(0/8) - (0/8)log2(0/8) = 0
Marketing: 0 Low, 10 Medium, 0 High (total of 10 instances)
I(0,10,0) = -(0/10)log2(0/10) - (10/10)log2(10/10) - (0/10)log2(0/10) = 0

Thus E(Department) = 0

Thus Expected Information Gain from Department =


I I(0,40,8) - E(Department) = 0.65 - 0.0 = 0.65

So we choose Department and this completes the tree. The final tree is as below:
104,43,8

Status
Senior Junior
0,40,8 104,3,0

Department Department
Marketing Marketing
Sales 4,3,0 Sales
0,10,0 0,30,0 80,0,0
Systems
0,0,8
MEDIUM MEDIUM LOW Systems LOW
HIGH 20,3,0

Age
20's 40's
20,0,0 30's 0,0,0
0,3,0

LOW MEDIUM LOW


e) (2 points)
Junior Sales employee in her 30's leads to a classification of LOW as shown above by the red
arrow (note that age is irrelevant to the classification!).
f) (3 points)
If we prune the portion shown above we get a single classification leaf of LOW instead since this
is the majority classification for the 23 instances Status=Junior and Department+Systems (in fact,
the entire righthand side under Status=Junior could be replaced with a single leaf labeled LOW!).
As compared to the full tree that is perfect and has 0 error, the error after pruning is 3/155 =
1.94% (the 3 Medium instances that are Junior an Systems are now wrongly classified as Low).

Question 2 (5 points)
A neural net will need seven input nodes and three output nodes defined as below:
INPUT
Node 1: 1 if Department=Sales, 0 otherwise
Node 2: 1 if Department=Systems, 0 otherwise
Node 3: 1 if Department=Marketing, 0 otherwise
Node 4: 1 if Age=20’s, 0 otherwise
Node 5: 1 if Age=30’s, 0 otherwise
Node 6: 1 if Age=40+, 0 otherwise
Node 7: 1 if staus is senior, 0 if junior
OUTPUT
Node 1: 1 if in class Low, 0 otherwise
Node 2: 1 if in class Medium, 0 otherwise
Node 3: 1 if in class High, 0 otherwise

With the above definition, the instance of Part 5 (Sales, 30’s, Junior) would lead to the input
vector (1,0,0,0,1,0,0).

Question 3 (9 points)
Covering Algorithm for class Medium
Department
Sales: MEDIUM=30 out of 110 so p/t=30/110=0.27
Systems: MEDIUM=3 out of 31 so p/t=3/31=0.10
Marketing: MEDIUM=10 out of 14 so p/t=10/14=0.71
Status
Junior: MEDIUM=3 out of 107 so p/t=3/107 = 0.03
Senior: MEDIUM=40 out of 48 so p/t=40/48=0.83 (best)
Age
20’s: MEDIUM=0 out of 60 so p/t=90/450=0.0
30’s: MEDIUM=33 out of 82 so p/t=33/82=0.40
40+: MEDIUM=10 out of 13 so p/t=10/13=0.77

So the partial rule is “If (Status=Senior) then MEDUM” with an accuracy of 0.83.
Considering only the instances with Senior Status; there are 48 of these that are covered by the
above clause we have
Department
Sales: MEDIUM=30 out of 30 so p/t=1.0 (best)
Systems: MEDIUM=0 out of 8 so p/t=0
Marketing: MEDIUM=10 out of 10 so p/t=1.0

Age
20’s: MEDIUM=0 out of 0 so p/t is not meaningful
30’s: MEDIUM=30 out of 35 so p/t=30/35=0.86
40+: MEDIUM=10 out of 13 so p/t=10/13=0.77

Both “Department=Sales” and “Department=Marketing” lead to perfect classification.


However Sales has a higher coverage (30 as opposed to 10); so we pick Sales.

The final rule is "If (Status=Senior) and (Department=Sales) then MEDIUM" - this rule
covers a total of 30 instances (out of 155).

You might also like