Homework 2
Homework 2
Fall 2015
Homework #2
Submitted By-
Group 11
Ankit Bhardwaj ([email protected])
Arpit Gulati ([email protected])
Nitish Puri ([email protected])
Problem 1
a) Input the data set. Set the role of INCOME to target. Use a partition node to divide the
data into 60% train, 40% test.
Ans: Salary-class.csv file was the input to the var file which was then connected to the
partition node to divide the data into 60% train, 40% test.
b) Create the default C&R decision tree. How many leaves are in the tree?
Ans :
There are a total of 7 leaves which can be seen when we see the decision tree using the
viewer option in the C & R Model.
if MSTATUS in [ " Divorced" " Married-spouse-absent" " Never-married" " Separated" "
Widowed" ] and C-GAIN <= 7139.500 then INCOME <=50K
If MSTATUS in [ " Married-AF-spouse" " Married-civ-spouse" ] and DEGREE in [ " 10th" " 11th" "
12th" " 1st-4th" " 5th-6th" " 7th-8th" " 9th" " Assoc-acdm" " Assoc-voc" " HS-grad" " Preschool"
" Some-college" ] and C-GAIN <= 5095.500 then INCOME <=50K.
e) Create two more C&R trees. The first is just like the default tree except you do not
“prune tree to avoid overfitting" (on the basic tab). The other does prune, but you
require 500 records in a parent branch and 100 records in a child branch. How do the
three trees differ (briefly). Which seems most accurate on the training data? Which
seems most accurate on the test data?
Ans:
Tree that does not prune has a maximum Tree Depth :7 and has 17 leaves.
Tree that does prune but require 500 records in a parent branch and 100 records in a child
branch has a Tree Depth : 4 and 7 leaves.
As per the analysis Default tree and Prune 500,100 are pretty similar as they have same depth
and number of leaves. However, tree that does not prune has a different tree depth : 7 and has
17 leaves as there is a difference in Predictor Importance of the attribute for the tree that does
not Prune.
When we connect all the three trees to the analysis node the results are given below.
The result clearly states that all the three trees are predicting 84.96% results as correct for
training data and 84.13% results as correct for testing data. Hence, as all the tree models are in
100% agreement with each other. However, Default model and Model that prunes has Tree
depth as 4 so, that would be more efficient and would avoid overfitting.
Problem 2
a) Input the zoo1.csv to train the decision tree classifier (C5.0) and come up with a decision
tree to classify a new record into one of the categories (pick the “favor accuracy" option
in the C5.0 node). Make sure you examine the data first and think about what field(s) to
use for the classification scheme.
Ans :
After examining the data it is pretty eminent that the Attribute Animal has a distinct value
every time and can be avoided while making the tree.
b) Rename the generated node as “fulltree" and fully unfold it while browsing it. Use this
to draw the full tree - how many leaves does it have? What is the classification accuracy
on the training dataset? You can check this through an analysis node or through a table.
Ans:
Both the models give 100% on the training dataset. However, one difference which can be
seen from the analysis is that both models use different values for Predictor Importance of
Attributes.
Predictor Importance for Fulltree. It gives importance to mainly 3 attributes milk, backbone
and feathers.
Predictor Importance for Fulltree. It gives importance attributes milk, feathers, backbone,
airbourne, breathes, aquatic.
e) Next, use the ”fulltree" node and an analysis node to classify the records in the testing
dataset, zoo2.csv (to do this just disconnect the zoo1.csv data source node and instead
connect a new data source node at the beginning of the data stream with zoo2.csv as
the variable). Compare the classification accuracy here with what you saw in part (b)
and comment. What are the misclassified animals?
Ans:
Using the analysis node we can see that tree predicts 90% of the results correct and 10%
wrong that is 3 records are predicted wrong. However, in part (b) for training data accuracy
was 100%.
The three Misclassified animals are
1) Classified Flea as invertebrate should be insect.
2) Classified Seasnake as fish should be reptile.
3) Classified Termite as invertebrate should be insect.
f) Suppose you wished to use a single level tree (i.e., 1R - just one attribute to classify) and
you use the full data set (zoo.csv) to determine this. Which of the three attributes “milk",
“feathers" and “aquatic" yields the best results? Why do you think the results are so
skewed in each case?
Ans:
Case 1 taking “Milk” as the only attribute-
Accuracy is 60.4% when taking only “Milk” as the attribute for predicting “Type” i.e 61 out of
101 results are predicted correct.
Case 2 taking “Feathers” as the only attribute-
Accuracy is 60.4% when taking only “Feathers” as the attribute for predicting “Type” i.e 61 out
of 101 results are predicted correct.
Case 3 taking “Aquatic” as the only attribute-
Accuracy is 47.52% when taking only “Aquatic” as the attribute for predicting “Type” i.e 48 out
of 101 results are predicted correct.
Clearly, the attributes “Milk” and “Feathers” yield the best result i.e 60.4% in comparison to
“Aquatic” which gives a result of 47.52%.
The results are skewed in each case as we are taking only one attribute at a time and do not
have sufficient information to classify all the animals into correct types. Also, as we can see
from the results is that in all the tree cases mammal is the type which each model is predicted
for most of the observation it is so as for Mammal when Milk is True in most of the cases type is
mammal, when feathers are false type is mammal and when aquatic is false type is mammal.
So, the distribution is favoring Mammal to a great extent.
Problem 3
(a) What is the gini impurity value for this data set?
Income
PURE
Gini (Income) = calculated using weighted avg = Gini (High)+ Gini(Medium) + Gini (Low)
= 0 *(3/14) + 4/9 * (6/14) + 12/25 *( 5/14)
= 4/21 + 6/35 = 0.3619
(c) What is the gain in gini impurity of income obtained by splitting on income as in part
(b)?
Ans :- Gain = Gini index of entire data set – Gini index using ‘Income’ as split variable.
= 0.5 (part (a) calculated) – 0.3619 (part (b) calculated)
= 0.1381.
(d) Continue building the tree. If you do not split on a node, explain why. Draw the final tree
that you obtained.
Ans :-
We will classify the decision tree further, keeping ‘Income’ as the initial split variable and try
out different combinations of ‘Student’ and ‘Credit Rating’ to see what we get on the target
variable.
Take on splitting a node :->
1. Follow HIT & TRIAL approach i.e choosing a splitting variable.
2. The moment we get a pure subset, we will not split further.
Income
0 Yes, 3 No PURE
4 Yes, 2 No 3 Yes, 2 No
Step 1 :- Choosing ‘Credit rating ‘ as the next split variable , we will see the values of target variable on
the combination <Medium income , Fair credit rating> and <Medium Income and Excellent credit rating>
Income
LOW
HIGH MEDIUM
3 Yes, 2 No
0 Yes, 3 No 4 Yes, 2 No Impurity
FAIR EXCELLENT
1 Yes, 2 No’s
0 Yes, 3 No’s
PURE Impurity
Conclusion :- We rule out the option of choosing ‘Credit rating ‘ as the next splitting variable for
‘MEDIUM ‘ income , since we are not getting a pure subset.
Step 2 :- Choosing ‘student ‘ as the next split variable for ‘Medium ‘ income , we will see the values
of target variable on the combination <Medium income , Yes is a student> and <Medium Income
and No is not a student>.
Income
YES NO
Income
MEDIUM LOW
HIGH
3 Yes, 2 No
0 Yes, 3 No 4 Yes, 2 No
Credit Rating
PURE STUDENT
4 Yes, 0 No 0 Yes, 2 No
PURE PURE
0 Yes, 2 No 3 Yes, 0 No
PURE PURE
The purpose of separating the data into training and testing data is to determine the
accuracy of predictions. We measure the performance of a model in terms of its error
rate: percentage of incorrectly classified instances in the data set. To measure the
performance of a data set we divide the data set into
Training data : The training set (seen data) to build the model (determine its
parameters)
Test data: The test set (unseen data) to measure its performance (holding the
parameters constant).Test data is something that we get in future. We don't know their
Y/dependent variable value and we predict it using our model.
Remove the branches of decision tree that are of low statistical significance.
To avoid overfitting while making a decision tree for a particular data set.
• Pre-pruning: the process is done during the construction of the tree. There is
some criteria to stop expanding the nodes (allowing a certain level of "impurity"
in each node).
• Post-pruning: the process is done after the construction of the tree. Branches are
removed from the bottom up to a certain limit. It uses similar criteria to pre-
pruning.