100% found this document useful (1 vote)
153 views5 pages

Homework4DecisionTree Answers Vs1

The file “dt_train.csv” contains 601 lines with 10 variables. The first line contains column headers that may be interpreted as follows: id: observation identifier. t1: measurement on test 1; t2: measurement on test 2. t3: measurement on test 3; t4: measurement on test 4. t5: measurement on test 5; t6: measurement on test 6. t7: measurement on test 7; t8: measurement on test 8. d: binary output variable set to 1 if product is defective and 0 otherwise. The next 600 lines contain 600 examples, for which the values of the above features are specified. The table below reproduces the first 2 observations. • Use rpart with the training examples to come up with a small set of rules that correctly classify the output variable “d” based on input variable values (t1, t2, t3, t4, t5, t6, t7, and t8).
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
153 views5 pages

Homework4DecisionTree Answers Vs1

The file “dt_train.csv” contains 601 lines with 10 variables. The first line contains column headers that may be interpreted as follows: id: observation identifier. t1: measurement on test 1; t2: measurement on test 2. t3: measurement on test 3; t4: measurement on test 4. t5: measurement on test 5; t6: measurement on test 6. t7: measurement on test 7; t8: measurement on test 8. d: binary output variable set to 1 if product is defective and 0 otherwise. The next 600 lines contain 600 examples, for which the values of the above features are specified. The table below reproduces the first 2 observations. • Use rpart with the training examples to come up with a small set of rules that correctly classify the output variable “d” based on input variable values (t1, t2, t3, t4, t5, t6, t7, and t8).
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Homework 4: Decision Tree Inductive Learning

A+ GRADE R DATA ANALYSIS ANSWERS


ANSWERS DONE USING R DATA
The file dt_train.csv contains 601 lines with 10 variables. The first line contains
column headers that may be interpreted as follows:
id: observation identifier.
t1:
measurement on test 1;
t2:
measurement on test 2.
t3:
measurement on test 3;
t4:
measurement on test 4.
t5:
measurement on test 5;
t6:
measurement on test 6.
t7:
measurement on test 7;
t8:
measurement on test 8.
d:
binary output variable set to 1 if product is defective and 0 otherwise.
The next 600 lines contain 600 examples, for which the values of the above
features are specified.
The table below reproduces the first 2 observations.
id
1
2

t1
84
39

t2
8
67

t3
64
61

t4
6
77

t5
94
80

t6
36
35

t7
51
89

t8
21
80

Use rpart with the training examples to come up with a small set of rules that
correctly classify the output variable d based on input variable values (t1, t2,
t3, t4, t5, t6, t7, and t8).
Answer: Completed
Command:
> library(rpart)
> trainingdata = read.csv("dt_train.csv")
> modeldata <- rpart(d ~ t1+t2+t3+t4+t5+t6+t7+t8, data = trainingdata, method
= "class")
> modeldata

d
1
1

Output:
n= 600
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 600 95 1 (0.1583333 0.8416667)
2) t7< 33.5 196 95 1 (0.4846939 0.5153061)
4) t5< 60.5 125 30 0 (0.7600000 0.2400000)
8) t3< 76 95 0 0 (1.0000000 0.0000000) *
9) t3>=76 30 0 1 (0.0000000 1.0000000) *
5) t5>=60.5 71 0 1 (0.0000000 1.0000000) *
3) t7>=33.5 404 0 1 (0.0000000 1.0000000) *
Analysis:
Terminal nodes (leafs) are marked as * at the end of every row. In this case the
nodes are 3, 5, 8 and 9.
Specify the rules.
Command:
> rule <- path.rpart(modeldata, nodes= 3)
> rule <- path.rpart(modeldata, nodes= 5)
> rule <- path.rpart(modeldata, nodes= 8)
> rule <- path.rpart(modeldata, nodes= 9)
Output:
node number: 3
root
t7>=33.5
node number: 5
root
t7< 33.5
t5>=60.5
node number: 8
root
t7< 33.5

t5< 60.5
t3< 76
node number: 9
root
t7< 33.5
t5< 60.5
t3>=76
Analysis:
Predicted value of 'd' is referred from the 'yval' value for terminal node in
'modeldata'. The rules can be specified as:
If t7 >= 33.5 THEN d = 1
If t7 < 33.5 AND t5 >= 60.5 THEN d = 1
If t7 < 33.5 AND t5 < 60.5 AND t3 <76 THEN d = 0
If t7 < 33.5 AND t5 < 60.5 AND t3 >= 76 THEN d = 1
The file dt_test.csv contains 200 test examples with the same 10 variables.
Test your trained classifier on these test example and present your confusion
matrix. Comment on your classification accuracy.
Command & Confusion Matrix (highlighted in Grey):
> testdata = read.csv("dt_test.csv")
> testdataRule1 <- subset(testdata, t7>= 33.5)
> table(testdataRule1$d, testdataRule1$d == "1")
TRUE
1 129
> testdataRule2 <- subset(testdata, t7< 33.5 & t5>= 60.5)
> table(testdataRule2$d, testdataRule2$d == "1")
TRUE
1 22
> testdataRule3 <- subset(testdata, t7< 33.5 & t5< 60.5 & t3 < 76)

> table(testdataRule3$d, testdataRule3$d == "0")


TRUE
0 37
> testdataRule4 <- subset(testdata, t7< 33.5 & t5< 60.5 & t3 >= 76)
> table(testdataRule4$d, testdataRule4$d == "1")
TRUE
1 12
Analysis:
Since there is no false in the confusion matrix for all four rules, classification
accuracy is 100% correct.
Then use the rules to predict the output class d for the following 10 test cases
(presented in the file dt_new.csv):
new_case
1
2
3
4
5
6
7
8
9
10

t1
8
22
74
66
55
34
23
9
6
68

t2
86
36
26
71
72
58
70
19
71
40

Command:
> newdata = read.csv("dt_new.csv")
> predict(modeldata, newdata)
Output:
01

t3
55
80
32
71
61
22
39
67
20
86

t4
53
69
26
52
41
84
65
43
6
82

t5
36
90
38
42
91
84
16
2
27
82

t6
12
33
52
88
39
61
71
20
58
44

t7
82
22
63
89
50
95
96
92
6
61

t8
19
6
12
70
96
57
78
3
22
48

d
1
1
1
1
1
1
1
1
0
1

1 01
2 01
3 01
4 01
5 01
6 01
7 01
8 01
9 10
10 0 1
Analysis:
d = 1 for new_case = 1, 2, 3, 4, 5, 6, 7, 8 and 10
d = 0 for new_case = 9

You might also like