Decision Tree
Decision Tree
We got the target people….through we have created decision tree from our dataset
Decision Tree :
Decision tree is the most powerful and popular tool for
classification and prediction (Regression).
A Decision tree is a flowchart like tree structure.
Each internal node denotes a test on an attribute
(split nodes, decision nodes or internal nodes)
Each branch represents an outcome of the test, and
Each leaf node (terminal node) holds a class label.
Background:
Growing a tree involves deciding on which features to choose
(splitting feature)
what conditions to use for splitting
when to stop (stopping criteria).
Entropy of classes
Entropy of attributes
Information gain
Entropy is the degree of randomness or uncertainty, in our case degree of variance or class variance. In other words, entropy is also measure of impurity;
a dataset is called a pure dataset if all its instances have the same class attributes. For pure dataset, entropy will be zero or near to zero. Information gain
also known as mutual information; it is a measure to select the most informative attribute that can help us to reduce entropy.
Entropy of classes (Binary)
E = -Fc1log2(Fc1) - Fc2log2(Fc2)
E = -FYeslog2(FYes) – FNolog2(FNo)
Where, Fc1 and Fc2 are the fraction of different classes (for
example YES and NO) in the class attribute
Entropy of different attribute values (v=Low, Med, High for
salary)
Ev = -Fc1ilog2(Fc1i) - Fc2ilog2(Fc2i)
Elow = -Fc1_lowlog2(Fc1_low) - Fc2_lowlog2(Fc2_low)
Elow = -FYes_lowlog2(FYes_low) – FNo_lowlog2(FNo_low)
Where, Fc1i and Fc2i are the fractions of YES and NO for
selected attribute value
=0 .247
Understand all through example
SN Name Sex Salary Marital st. Investor
1 P1 M Low Un N
2 P2 M Med Un N
3 P3 M Med Ma Y
4 P4 F Med Ma N
5 P5 M Med Ma Y
6 P6 F High Un y
7 P7 F Low Un N
8 P8 M High Un Y
9 P9 F Med Un Y
10 P10 M Low Ma Y
Entropy of class
Number of “Yes” in dataset: 6
Number of “No” in dataset: 4
Total number of observations:10
Elow = E(1,2)
Elow = -Fc1_lowlog2(Fc1_low) - Fc2_lowlog2(Fc2_low)
Elow = -FYes_lowlog2(FYes_low) – FNo_lowlog2(FNo_low)
Elow = -1/3log2(1/3) – F2/3log2(2/3)
Elow = -0.33log2(0.33) – 0.66log2(0.66)
Elow = 0.9234
Value Yes No
Med 3 2
EMed = E(3,2)
EMed = -3/5log2(3/5) – F2/5log2(2/5)
EMed =0.9709
Value Yes No
High 2 0
EHigh = E(2,0)
EHigh = -2/2log2(2/2) – F0/2log2(0/2)
EHigh =0
Summary
Attribute Information Gain
Salary 0.2085
Sex 0.0169
Marital Status 0.0465
Salary
Yes
Observations:
“High” branch has already got it decision as class “Yes”
How? A person with high salary is an investor no matter
what his/her marital status or gender is.
SN Name Sex Salary Marital st. Investor
6 P6 F High Un y
8 P8 M High Un Y
Salary
Yes
Marital st.
Unmarried
Married
Yes No
Observations:
“Married” and “Unmarried” branches have already got it
decision as class “Yes” and “No” respectively. How? A person
with low salary and “married “ will invest and “Unmarried”
will not invest no matter what his/her sex is. Ex.
SN Name Sex Salary Marital st. Investor
1 P1 M Low Un N
7 P7 F Low Un N
10 P10 M Low Ma Y
It is a case of “pure dataset” i.e. all the sub-dataset
having same class.
Summary
Attribute Information Gain
Sex 0.0169
Marital Status 0.0169
Sex Yes
Marital st.
Unmarried
Married
Yes No
Salary
Sex Yes
Marital st.
M F
Unmarried
Married
Marital Marital
Yes
No
Stopping criteria
1. Max/Fix number of leaf
2. Max/Fix depth of tree
3. Min number of observation in node
4. Pure node (every element in the subset belongs to the same class; in which case the node
is turned into a leaf node and labelled with the class of the examples.)
5. there are no more attributes to be selected, but the examples still do not belong to
the same class. In this case, the node is made a leaf node and labelled with the most common
class of the examples in the subset.
For example, we may be interested in predicting who will or will not graduate from college, or who will or will not renew a subscription. These would be examples of simple binary
classification problems, where the categorical dependent variable can only assume two distinct and mutually exclusive values. In other cases, we might be interested in predicting which
one of multiple different alternative consumer products (e.g., makes of cars) a person decides to purchase, or which type of failure occurs with different types of engines. In those cases
there are multiple categories or classes for the categorical dependent variable.
CLASSIFICATION TREES
(Solve classification type problem)
Classification tree methods (i.e., decision tree methods, ID3) are
recommended when the data mining task contains classifications
or predictions of outcomes (category). A Classification tree labels,
records, and assigns variables to discrete classes.
REGRESSION TREES
(solve regression type problem)
Regression tree, which also works in a very similar fashion but output
is continues numerical value. Regression trees are needed when the
response variable is numeric or continuous.
For example, the predicted price of a consumer good. Thus regression trees are applicable for prediction type of problems as opposed to classification.
CLASSIFICATION AND REGRESSION TREES (C&RT)
CLASSIFICATION AND REGRESSION TREES (C&RT)
(Solve both classification and regression type problem)
The CART or Classification & Regression Trees methodology was
introduced in 1984 by Leo Breiman. This algorithm is the cornerstone
of the ensemble system, like bagging and boosting. The representation
for the CART model is a binary tree (parent node have a maximum of
two branches or nodes).
= 1 - 1/k
= 0.643
Squared the term- supporting bigger proportion (may take example of some of absolute and sum
of squared error)
GINI of a split
GINI (s,t) = PL GINI (tL) + PR GINI (tR)
= weighted_Cost(Left) + weighted_Cost(Right)
Where
s : split
t : node
PL : Proportion of observation in Left Node after split, s
GINI (tL) : Gini of Left Node after split, s
PR : Proportion of observation in Right Node after split, s
GINI (tR) : Gini of Right Node after split, s
Example
We have an example in which input node, parent node, has equal number of Target variable values- “Yes” and “No”. Overall number of
observations are 24.
Gender variable is considered to split the node. Gini Split value is calculated as below.
= 1- (1/2)2–(1/2)2
= 1- 0.25 -0.25
= 0.5
Now we want to split the code based on Gender Variable. After the
split we will have following summary.
Now, let’s calculate GINI index of the split using Gender variable.
We then weight and sum each of the splits based on the baseline / proportion of the data each split takes up.
SR A1 A2 Class
1 3.2 1.5 0
2 1.3 1.2 0
3 3.7 2.8 0
4 2.9 2.4 0
5 3.9 1.9 0
6 7.5 3.5 1
7 9.0 3.2 1
8 7.4 0.9 1
9 9.5 4.2 1
10 7.3 3.5 1
Attribute with the lowest Gini index score will be chosen as the
node in the decision tree. Here, the value of 3.9 from the A1
has the lowest Gini index.
A1
A1<3.9 A1>=3.9
Why ordinal attribute values can be grouped as long as the grouping does not violate the
order property of the attribute values?
Ans: It may be useful or not- depends on the relation of feature with class, but mostly it is useful.
My advice therefore - as a part of the feature engineering decide whether to use a factor ordered or
unordered. Use ordered factor only if it is highly correlated with the output variable, otherwise fall
back to an unordered factor.
We usually split things into specified parts which are not contradictory. A special thing can be small
and medium, as one group, and large, as the other group. But it cannot be small and large at the
same time. The point is that you have a sequence in your data. If there was no such thing you could
have different combinations of attribute values. Suppose you have a set of attribute values for a
fruit. It can be apple, pineapple and watermelon. Due to the fact that there is no ordinal, you can
have all possible combination for binary splits.
Contradiction: It may be useless, since for example a T-shirt factory can decide to print red tshirts of size Small
and Large and blue tshirts of sizes medium and extralarge. Since we don't know the model that generates the data
how can we infer that it's "better" to preserve the order in the splits of a ordinal attribute ? there is no advantages
of maintaining the order of an attribute splits- Ans: feature construction is one of the important tasks of the
modeler. It is up to you to decide whether to represent a categorical variable as ordered or unordered
Ex.2
Lets assume we have 3 classes and 80 objects. 19 objects are in class 1, 21 objects in class 2, and 40
objects in class 3 (denoted as (19,21,40) ).
The Gini index would be: 1- [ (19/80)^2 + (21/80)^2 + (40/80)^2] = 0.6247 i.e. costbefore = Gini(19,21,40) =
0.6247
In order to decide where to split, we test all possible splits. For example splitting at 2.0623, which results in
a split (16,9,0) and (3,12,40):
After testing x1 < 2.0623:
costL =Gini(16,9,0) = 0.4608
costR =Gini(3,12,40) = 0.4205
Then we weight branch impurity by empirical branch probabilities:
costx1<2.0623 = 25/80 costL + 55/80 costR = 0.4331
Ex.3
A 0 33
A 0 54
A 0 56
A 0 42
A 1 50
B 1 55
B 1 31
B 0 -4
B 1 77
B 0 49
We’ll first try using Gini Index on a couple values. Let’s try Var1 == 1 and Var2 >=32.
Gini Index Example: Var1 == 1
Baseline of Split: Var1 has 4 instances (4/10) where it’s equal to 1 and 6 instances
(6/10) when it’s equal to 0.
For Var1 == 1 & Class == A: 1 / 4 instances have class equal to A.
For Var1 == 1 & Class == B: 3 / 4 instances have class equal to B.
o Gini Index here is 1-((1/4)^2 + (3/4)^2) = 0.375
For Var1 == 0 & Class== A: 4 / 6 instances have class equal to A.
For Var1 == 0 & Class == B: 2 / 6 instances have class equal to B.
o Gini Index here is 1-((4/6)^2 + (2/6)^2) = 0.4444
We then weight and sum each of the splits based on the baseline / proportion of the
data each split takes up.
o 4/10 * 0.375 + 6/10 * 0.444 = 0.41667
Baseline of Split: Var2 has 8 instances (8/10) where it’s equal >=32 and 2 instances
(2/10) when it’s less than 32.
For Var2 >= 32 & Class == A: 5 / 8 instances have class equal to A.
For Var2 >= 32 & Class == B: 3 / 8 instances have class equal to B.
o Gini Index here is 1-((5/8)^2 + (3/8)^2) = 0.46875
For Var2 < 32 & Class == A: 0 / 2 instances have class equal to A.
For Var2 < 32 & Class == B: 2 / 2 instances have class equal to B.
o Gini Index here is 1-((0/2)^2 + (2/2)^2) = 0
We then weight and sum each of the splits based on the baseline / proportion of the
data each split takes up.
o 8/10 * 0.46875 + 2/10 * 0 = 0.375
Based on these results, you would choose Var2>=32 as the split since its weighted Gini
Index is smallest.
Ex. 4
X y
-1.2 0
-3.2 0
2.1 1
1.5 1
Disadvantages
Instability
The reliability of the information in the decision tree depends on feeding the precise internal and external
information at the onset. Even a small change in input data can at times, cause large changes in the tree. Changing
variables, excluding duplication information, or altering the sequence midway can lead to major changes and might
possibly require redrawing the tree.
Computing probabilities of different possible branches, determining the best split of each node, and selecting
optimal combining weights to prune algorithms contained in the decision tree are complicated tasks that require
much expertise and experience.
Decision trees are tree data structures that are generated using learning
algorithms for the purpose of Classification and Regression.
One of the most common problem when learning a decision tree is to learn the
optimal size of the resulting tree that leads to a better real time accuracy of the model.
A tree that has too many branches and layers can result in overfitting of the training
data. The performance of a tree can be further increased by pruning. It
involves removing the branches that make use of features having low importance.
This way, we reduce the complexity of tree, and thus increasing its predictive power
by reducing overfitting.
Decision tree pruning can be divided into two types: pre-pruning and post-
pruning.
Pre- pruning
Post-pruning
Minimum error. The tree is pruned back to the point where the cross-validated error is a minimum. Cross-validation is the process of building a tree with most of the data and then using the remaining part of the data to test the accuracy
of the tree.
Smallest tree. The tree is pruned back slightly further than the minimum error. Technically the pruning creates a tree with cross-validation error within 1 standard error of the minimum error. The smaller tree is more intelligible at the
cost of a small increase in error.
When the leaf node has very few observations left (Min
number of observation in node)
This ensures that we terminate the tree when reliability of further splitting the node becomes suspect due
to small sample size. Central Limit Theorem tells us that when observations are mutually independent,
then about 30 observations constitute large sample. This can become rough guide, though usually, this
user input parameter should be higher than 30, say 50 or 100 or more, because we typically work with
multi-dimensional observations and observations could be correlated.
Method A:
TreeScore=RSS+aT
a (alpha) is a hyperparameter we find using cross-validation, and T is the number of leaves in
the subtree.
We calculate the Tree Score for all subtrees in the decision tree, and then pick the subtree
with the lowest Tree Score. However, we can observe from the equation that the value of
alpha determines the choice of the subtree. The value of alpha is found using cross-
validation. We repeat the above process for different values of alpha, which gives us a
sequence of trees. The value of alpha that on average gives the lowest Tree Score is the final
value of alpha. Finally, our pruned decision tree will be tree corresponding to the final value
of alpha.
Method B:
using one of the following methods:
1. Use a distinct dataset from the training set (called validation set), to evaluate the effect of post-pruning nodes from the tree.
2. Build the tree by using the training set, then apply a statistical test to estimate whether pruning or expanding a particular node is likely to produce an improvement beyond the training set.
o Error estimation
The first method is the most common approach. In this approach, the available data are separated into two sets of examples: a training set, which is used to build the decision tree, and a validation set, which is used to evaluate the impact of pruning the tree. The second method is also a common approach. Here, we explain the error
estimation and Chi2 test.
3-Post-pruning using Error estimation
Error estimate for a sub-tree is weighted sum of error estimates for all its leaves. The error estimate (e) for a
node is:
The error rate at the parent node is 0.46 and since the error rate for its children (0.51) increases with the split,
we do not want to keep the children.
Bad 4 1 4
Good 2 1 2
Chi2 = 0.21 Probability = 0.90 degree of freedom=2
If we require that the probability has to be less than a limit (e.g., 0.05), therefore we decide not to split the node.
The first type is the one described by the many articles in this blog: Classification tree. This
is also referred to as a Decision tree by default. However there is another basic decision
tree in common use: Regression tree, which also works in a very similar fashion. This
article explores the main differences between them: when to use each, how they differ
and some cautions.
This might seem like a trivial issue - once you know the difference! Classification trees, as
the name implies are used to separate the dataset into classes belonging to the response
variable. Usually the response variable has two classes: Yes or No (1 or 0). If the target
variable has more than 2 categories, then a variant of the algorithm, called C4.5, is used.
For binary splits however, the standard CART procedure is used. Thus classification trees
are used when the response or target variable is categorical in nature.
Regression trees are needed when the response variable is numeric or continuous. For
example, the predicted price of a consumer good. Thus regression trees are applicable
for prediction type of problems as opposed to classification.
Keep in mind that in either case, the predictors or independent variables may be
categorical or numeric. It is the target variable that determines the type of decision tree
needed.
In a standard classification tree, the idea is to split the dataset based on homogeneity of
data. Lets say for example we have two variables: age and weight that predict if a person is
going to sign up for a gym membership or not. In our training data if it showed that 90% of
the people who are older than 40 signed up, we split the data here and age becomes a top
node in the tree. We can almost say that this split has made the data "90% pure". Rigorous
measures of impurity, based on computing proportion of the data that belong to a class,
such as entropy or Gini index are used to quantify the homogeneity in Classification trees.
In a regression tree the idea is this: since the target variable does not have classes, we fit a
regression model to the target variable using each of the independent variables. Then for
each independent variable, the data is split at several split points. At each split point, the
"error" between the predicted value and the actual values is squared to get a "Sum of
Squared Errors (SSE)". The split point errors across the variables are compared and the
variable/point yielding the lowest SSE is chosen as the root node/split point. This process is
recursively continued.
We discussed a C4.5 classification tree (for more than 2 categories of target variable) here
which uses information gain to decide on which variable to split. In a corresponding
regression tree, standard deviation is used to make that decision in place of information
gain. More technical details are here. Regression trees, by virtue of using regression
models lose the one strength of standard decision trees: ability to handle highly non-linear
parameters. In such cases, it may be better to use the C4.5 type implementation.
This tree below summarizes at a high level the types of decision trees available!
Decision tree bagging (handle overfitting)
We have to train different model on different sample of data and try to
come up with the improved version of classifier.
Decision tree have a tendency of high variance, which leads to failure
in their generalization. Tree bagging is a technique that can help us to
solve this problem.
But how? Because bagging has its unique feature of sampling; it
creates different samples out of data with replacement. Then we create
different models for these samples and at the prediction stage, we can
combine the results.
Data
Average
Prediction
Consider following dataset
SN RED GREEN BLUE CLASS
1 0.95 0.3 0.63 0
2 0.83 0.4 0.61 0
3 0.75 0.25 0.59 0
4 0.63 0.19 0.39 0
5 0.65 0.45 0.46 1
6 0.53 0.3 0.19 1
7 0.32 0.5 0.35 1
8 0.77 0.55 0.41 1
Table-1: Main dataset
Sample-1
SN RED GREEN BLUE CLASS
1 0.95 0.3 0.63 0
2 0.83 0.4 0.61 0
8 0.77 0.55 0.41 1
Sample-2
SN RED GREEN BLUE CLASS
2 0.83 0.4 0.61 0
3 0.75 0.25 0.59 0
4 0.63 0.19 0.39 0
Sample-3
SN RED GREEN BLUE CLASS
2 0.83 0.4 0.61 0
5 0.65 0.45 0.46 1
6 0.53 0.3 0.19 1
Properties (key points):
Every sample is unique.
Every sample is created from same dataset.
Every sample has exact number of observations.
Every sample has different statistical properties.
Every sample have same features.
Decision trees are being learnt over sample have different
learning statistics
Sample-1
SN RED GREEN CLASS
1 0.95 0.3 0
2 0.83 0.4 0
6 0.53 0.3 1
Sample-2
SN H S CLASS
1 0.63 0.19 0
3 0.53 0.3 0
4 0.32 0.5 0
Sample-3
SN BLUE H CLASS
1 0.63 0.63 0
6 0.19 0.83 1
8 0.41 0.63 1
Each our classifier will learn different instance (less overlapping) with
different features, which creates a less correlated but generalized
predictor.