0% found this document useful (0 votes)
24 views71 pages

Decision Tree

The document discusses the use of decision trees for selecting potential investors for a fund advisory company based on various attributes such as age, sex, employment, salary, and marital status. It explains the process of creating a decision tree using the ID3 algorithm, which involves calculating entropy and information gain to determine the best attributes for splitting the dataset. The document also outlines the stopping criteria for the decision tree and provides detailed calculations and examples to illustrate the decision-making process.

Uploaded by

code04rashtra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views71 pages

Decision Tree

The document discusses the use of decision trees for selecting potential investors for a fund advisory company based on various attributes such as age, sex, employment, salary, and marital status. It explains the process of creating a decision tree using the ID3 algorithm, which involves calculating entropy and information gain to determine the best attributes for splitting the dataset. The document also outlines the stopping criteria for the decision tree and provides detailed calculations and examples to illustrate the decision-making process.

Uploaded by

code04rashtra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

Decision Tree

Let us start with simple application:


Suppose you are owner of a fund advisory company and you want to select people who are willing to invest through your company.

How can you choose them?


One way to do this is go brute force; pick up your phone and dial numbers and ask them whether they are willing to invest money in the market. Some of
them may not want to, some they are not enough earning, some of them irritated.

So what you should do?


Your problem can be solved by analyzing the people mindset if past data is available.

SN Name Sex Age Employment Salary Marital Investor


st.
1 P1 M 20 Un NA Un N
2 P2 M 25 Em 25k Un N
3 P3 M 30 Em 40k Ma Y
4 P4 F 30 Em 40k Ma N
5 P5 M 25 Em 30k Ma Y
Summary of dataset (assume full dataset) represent as:
 Only 15% of the people from the age group of 20-25 years are interested. Only 5% from the age group 50-70 years and 80% of the 25-30 years
are persons who invest in the market.
 Only 30% of the female from the population invest whereas the male investor population is 70%
 Males who are unmarried are less interested (30%), on the other hand 70% married males are interested. So our criteria further narrow down.
 Next, married person who have children are most likely to invest (80%) and only 20% of them without children are investing.

We got the target people….through we have created decision tree from our dataset

Decision Tree :
Decision tree is the most powerful and popular tool for
classification and prediction (Regression).
 A Decision tree is a flowchart like tree structure.
 Each internal node denotes a test on an attribute
(split nodes, decision nodes or internal nodes)
 Each branch represents an outcome of the test, and
 Each leaf node (terminal node) holds a class label.

Background:
 Growing a tree involves deciding on which features to choose
(splitting feature)
 what conditions to use for splitting
 when to stop (stopping criteria).

Recursive Splitting (more towards to ID3)


 In this procedure all the features are considered
 Different split points are tried and tested using a cost function
calculate how much accuracy each split will cost us,
using a function.
 The split with the best cost (or lowest cost) is selected (The split
that costs least is chosen).
 This algorithm is recursive in nature as the groups formed
can be sub-divided using same strategy.
 Due to this procedure, this algorithm is also known as the greedy
algorithm (ID3)
• Entire Dataset (include all features)
1

• Select spliting criteia (find spiliting node which


2 maximize the objective function)

• Split the branch (divide the dataset in sub/multiple


3 groups)

• Repeat the process 2 & 3 with each subgroup


4

• Stoping criteria (condition)


5

• Final Decision Tree


6
Iterative Dichotomiser 3 (ID3)
Iterative Dichotomiser 3 (ID3)
In decision tree learning, ID3 (Iterative Dichotomiser 3) is an algorithm invented by Ross Quinlan[1] (1986)
used to generate a decision tree from a dataset. ID3 is the precursor to the C4.5 algorithm.
Procedure: The ID3 algorithm begins with the original set {S} as the root
node. On each iteration of the algorithm, it iterates through every
unused attribute of the set {S} and calculates the entropy {H(S)}} or
the information gain { IG(S)} of that attribute. It then selects the attribute which has
the smallest entropy (or largest information gain) value. The set { S} is then split
or partitioned by the selected attribute to produce subsets of the data.

ID3 algorithm is suited for classification of categorical dataset.

SN Name Sex Salary Marital st. Investor


1 P1 M Low Un N
2 P2 M Med Un N
3 P3 M Med Ma Y
4 P4 F Med Ma N
5 P5 M Med Ma Y
6 P6 F High Un Y
7 P7 F Low Un N
8 P8 M High Un Y
9 P9 F Med Un Y
10 P10 M Low Ma Y
Selecting splitting node
Step1- Calculate the entropy of the class
Step2- Calculate the entropy of all the Attribute (feature)
Step3- Calculate the Information Gain of all the Attribute (feature)
Step4- Select attribute for splitting with highest information gain- Which feature is best to
split?

Root Node (select splitting node)


We have to select the attribute (feature) that has most informative data. It requires
three calculations:

 Entropy of classes
 Entropy of attributes
 Information gain
Entropy is the degree of randomness or uncertainty, in our case degree of variance or class variance. In other words, entropy is also measure of impurity;
a dataset is called a pure dataset if all its instances have the same class attributes. For pure dataset, entropy will be zero or near to zero. Information gain
also known as mutual information; it is a measure to select the most informative attribute that can help us to reduce entropy.
Entropy of classes (Binary)
E = -Fc1log2(Fc1) - Fc2log2(Fc2)
E = -FYeslog2(FYes) – FNolog2(FNo)

Where, Fc1 and Fc2 are the fraction of different classes (for
example YES and NO) in the class attribute
Entropy of different attribute values (v=Low, Med, High for
salary)
Ev = -Fc1ilog2(Fc1i) - Fc2ilog2(Fc2i)
Elow = -Fc1_lowlog2(Fc1_low) - Fc2_lowlog2(Fc2_low)
Elow = -FYes_lowlog2(FYes_low) – FNo_lowlog2(FNo_low)

Where, Fc1i and Fc2i are the fractions of YES and NO for
selected attribute value

Total entropy of Attribute


k
Ea   Evi *[( Fc1i  Fc 2i ) / N ]
i 1

Where, k is number of value in given attribute ;


Low, Med, High for salary. N is number of data points.

Esalary = Elow* [(Fc1_low + Fc2_low)/N] + EMed * [] + EHigh *[]


Information Gain
IG= E-Ea
IG (Salary)= 0.94-0.693

=0 .247
Understand all through example
SN Name Sex Salary Marital st. Investor
1 P1 M Low Un N
2 P2 M Med Un N
3 P3 M Med Ma Y
4 P4 F Med Ma N
5 P5 M Med Ma Y
6 P6 F High Un y
7 P7 F Low Un N
8 P8 M High Un Y
9 P9 F Med Un Y
10 P10 M Low Ma Y

Entropy of class
 Number of “Yes” in dataset: 6
 Number of “No” in dataset: 4
 Total number of observations:10

E (Investor)= E(Yes, No)


E (Investor)= E(6, 4)
E (Investor)= -FYeslog2(FYes) – FNolog2(FNo)
E (Investor)= -(6/10)log2(6/10) -(4/10)log2(4/10)-
E(Investor)= 0.9709

Entropy of attribute ‘Salary’


Number of values in salary= 3 (Low, Med, High)
E (Investor, Salary)
E (Investor, Salary)= Esalary = Flow*Elow+ FMed*EMed + FHigh*EHigh
= Flow*E(1,2)+ FMed*E(3,2) + FlHigh*E(2,0)
Entropy of attribute value-Low
Value Yes No
Low 1 2

Elow = E(1,2)
Elow = -Fc1_lowlog2(Fc1_low) - Fc2_lowlog2(Fc2_low)
Elow = -FYes_lowlog2(FYes_low) – FNo_lowlog2(FNo_low)
Elow = -1/3log2(1/3) – F2/3log2(2/3)
Elow = -0.33log2(0.33) – 0.66log2(0.66)
Elow = 0.9234

Value Yes No
Med 3 2
EMed = E(3,2)
EMed = -3/5log2(3/5) – F2/5log2(2/5)
EMed =0.9709
Value Yes No
High 2 0
EHigh = E(2,0)
EHigh = -2/2log2(2/2) – F0/2log2(0/2)
EHigh =0

Total Entropy of attribute “Salary”


E (Investor, Salary)= Esalary = Flow*Elow+ FMed*EMed + FHigh*EHigh
= [(1+2)/10]*0.9234 +[(3+2)/10]*0.9709 +[(2+0)/10]*0
Esalary = 0.7624
Information Gain (Salary)
IGSalary = E (Investor) – Esalary
IGSalary =0.9709-0.7624
IGSalary =0.2085
Attribute (Sex)
Values Yes No Attribute value Entropy
M 4 2 0.9234
F 2 2 1
ESex = 0.9540
IGSex= 0.0169

Attribute (Marital Status)


Values Yes No Attribute value Entropy
M 3 1 0.8112
UN 3 3 1
EMarital =
0.9244
IGMarital =
0.0465

Summary
Attribute Information Gain
Salary 0.2085
Sex 0.0169
Marital Status 0.0465
Salary

Low Med High

Yes

Observations:
“High” branch has already got it decision as class “Yes”
How? A person with high salary is an investor no matter
what his/her marital status or gender is.
SN Name Sex Salary Marital st. Investor
6 P6 F High Un y
8 P8 M High Un Y

It is a case of “pure dataset” i.e. all the dataset having


same class.
Next parent node ->
Calculation of the next node under the low salary
branch:
SN Name Sex Salary Marital st. Investor
1 P1 M Low Un N
7 P7 F Low Un N
10 P10 M Low Ma Y

Choosing between sex and marital status?


(Sex)
Values Yes No Attribute value Entropy
M 1 1 1
F 0 1 0
ESex = 0.6666
IGSex= 0.2568 New class-entropy (0.923)
(.923-.666)
(Marital Status)
Values Yes No Attribute value Entropy
M 1 0 0
UN 0 2 0
EMarital = 0
IGMarital=
0.9234
Summary
Attribute Information Gain
Sex 0.2568
Marital Status 0.9234

Salary

Low Med High

Yes
Marital st.

Unmarried
Married

Yes No
Observations:
“Married” and “Unmarried” branches have already got it
decision as class “Yes” and “No” respectively. How? A person
with low salary and “married “ will invest and “Unmarried”
will not invest no matter what his/her sex is. Ex.
SN Name Sex Salary Marital st. Investor
1 P1 M Low Un N
7 P7 F Low Un N
10 P10 M Low Ma Y
It is a case of “pure dataset” i.e. all the sub-dataset
having same class.

Next parent node ->


Calculation of the next node under the Med salary
branch:
SN Name Sex Salary Marital st. Investor
2 P2 M Med Un N
3 P3 M Med Ma Y
4 P4 F Med Ma N
5 P5 M Med Ma Y
9 P9 F Med Un Y
Choosing between sex and marital status?
(Sex)
Values Yes No Attribute value Entropy
M 2 1 0.9234
F 1 1 1
ESex = 0.9693
IGSex= 0.0169
(Marital Status)
Values Yes No Attribute value Entropy
M 2 1 0.9234
UN 1 1 1
EMarital =
0.9693
IGMarital=
0.0169

Summary
Attribute Information Gain
Sex 0.0169
Marital Status 0.0169

Both the attributes have same information gain! (may


choose anyone of them )
Salary

Low Med High

Sex Yes
Marital st.

Unmarried
Married

Yes No

Next parent node ->


Calculation of the next node under the sex “Male”
branch:
SN Name Sex Salary Marital st. Investor
2 P2 M Med Un N
3 P3 M Med Ma Y
5 P5 M Med Ma Y

It is also pure dataset, all married will invest and unmarried


will no.
Calculation of the next node under the sex “Fe-Male”
branch:
SN Name Sex Salary Marital st. Investor
4 P4 F Med Ma N
9 P9 F Med Un Y
It is also pure dataset, all married will not invest and
unmarried will invest.

Salary

Low Med High

Sex Yes
Marital st.
M F
Unmarried
Married
Marital Marital

Yes No Married Unmarried

Yes No Married Unmarried

Yes
No
Stopping criteria
1. Max/Fix number of leaf
2. Max/Fix depth of tree
3. Min number of observation in node
4. Pure node (every element in the subset belongs to the same class; in which case the node
is turned into a leaf node and labelled with the class of the examples.)

5. there are no more attributes to be selected, but the examples still do not belong to
the same class. In this case, the node is made a leaf node and labelled with the most common
class of the examples in the subset.

6. there are no examples in the subset , which happens when no example in


the parent set was found to match a specific value of the selected attribute. An example could be
the absence of a person among the population with age over 100 years. Then a leaf node is
created and labelled with the most common class of the examples in the parent node's set.

7. IG (Children) > IG (Parent)


CLASSIFICATION AND REGRESSION TREES (CART)

Classification-type problems: Classification-type problems are


generally those where we attempt to predict values of a
categorical dependent variable (class) from one or more continuous
and/or categorical predictor variables.

For example, we may be interested in predicting who will or will not graduate from college, or who will or will not renew a subscription. These would be examples of simple binary
classification problems, where the categorical dependent variable can only assume two distinct and mutually exclusive values. In other cases, we might be interested in predicting which
one of multiple different alternative consumer products (e.g., makes of cars) a person decides to purchase, or which type of failure occurs with different types of engines. In those cases
there are multiple categories or classes for the categorical dependent variable.

Regression-type problems problems: Regression-type


are
generally those where we attempt to predict the values of a
continuous variable from one or more continuous and/or categorical
predictor variables.
For example, we may want to predict the selling prices of single family homes (a continuous dependent variable) from various other continuous predictors (e.g., square footage) as well
as categorical predictors (e.g., style of home, such as ranch, two-story, etc.; zip code or telephone area code where the property is located, etc.; note that this latter variable would be
categorical in nature, even though it would contain numeric values or codes).

CLASSIFICATION TREES
(Solve classification type problem)
Classification tree methods (i.e., decision tree methods, ID3) are
recommended when the data mining task contains classifications
or predictions of outcomes (category). A Classification tree labels,
records, and assigns variables to discrete classes.
REGRESSION TREES
(solve regression type problem)
Regression tree, which also works in a very similar fashion but output
is continues numerical value. Regression trees are needed when the
response variable is numeric or continuous.
For example, the predicted price of a consumer good. Thus regression trees are applicable for prediction type of problems as opposed to classification.
CLASSIFICATION AND REGRESSION TREES (C&RT)
CLASSIFICATION AND REGRESSION TREES (C&RT)
(Solve both classification and regression type problem)
The CART or Classification & Regression Trees methodology was
introduced in 1984 by Leo Breiman. This algorithm is the cornerstone
of the ensemble system, like bagging and boosting. The representation
for the CART model is a binary tree (parent node have a maximum of
two branches or nodes).

With numerical values


If you look closer at the tree, you will see all the attributes are numerical and so the criteria of node splitting is going to be based on numerical value only. There is important property of binary
tree: one branch will have values lesser than parent node and the other will have equal or higher than the parent. This important criteria is used to divide data using an anchor point.
The main elements of CART (and any decision tree
algorithm) are:

1. Rules for splitting data at a node based on the value


of one variable;
Impurity check (Entropy- categorical data and Gini
Index- numerical data + Cat. Data)
2. Stopping rules for deciding when a branch is terminal
and can be split no more; and
3. Finally, a prediction for the target variable in each
terminal node.
What are issue with entropy?
Gini Index for impurity check
Gini index is the cost function used to evaluate splits in the
dataset. Each split has two important aspects: the first is the
attributes and second is the attribute value.

 Favors larger partitions and large levels/distinct values. (because of this


property, more priority is given to Gini than Entropy)
 Uses squared proportion of classes.
 Perfectly classified (pure dataset), Gini Index would be zero.
 Evenly distributed would be 1
 You want a variable split that has a low Gini Index.
 The algorithm works as 1 – ( P(class1)^2 + P(class2)^2 + …
+ P(classN)^2)
For Nominal variable with k level, the maximum value Gini
Index is

= 1 - 1/k

2-levels, Max Gini Index= 1-1/2= 0.5

Binary Target Variable: Worked out Example


Consider an example where target variable is binary, the summary table for such example will be
similar to below table.
Gini Index = 1 - 0.742 - 0.262
= 1-0.5476 -0.0676
= 0.3848

Nominal Target Variable: Worked out Example


When target variable is nominal variable with different levels, we can calculate Gini Index in a similar way. This is an illustrative example.

Gini Index = 1 – 0.082 - 0.132 - 0.292 -0.52

= 1 - 0.006 - 0.018 - 0.083 - 0.250

= 0.643
Squared the term- supporting bigger proportion (may take example of some of absolute and sum
of squared error)

GINI of a split
GINI (s,t) = PL GINI (tL) + PR GINI (tR)
= weighted_Cost(Left) + weighted_Cost(Right)
Where

s : split
t : node
PL : Proportion of observation in Left Node after split, s
GINI (tL) : Gini of Left Node after split, s
PR : Proportion of observation in Right Node after split, s
GINI (tR) : Gini of Right Node after split, s

Example
We have an example in which input node, parent node, has equal number of Target variable values- “Yes” and “No”. Overall number of
observations are 24.

Gender variable is considered to split the node. Gini Split value is calculated as below.

Gini index for this node (class) will be

= 1- (1/2)2–(1/2)2
= 1- 0.25 -0.25

= 0.5

Now we want to split the code based on Gender Variable. After the
split we will have following summary.

Now, let’s calculate GINI index of the split using Gender variable.

GINI (s,t) = PL GINI (tL) + PR GINI (tR)


GINI (tL) = 1- (6/8) 2 - (2/8)2
= 1- 0.5625 - 0.0625
= 0.375
GINI (tR) = 1- (6/16) 2 – (10/16)2
= 1 - 0.140625 -0.390625
= 0.469

We then weight and sum each of the splits based on the baseline / proportion of the data each split takes up.

GINI (s,t) = (8/24)*0.375 + (16/24)*0.469


= 0.125 + 0.313
= 0.438
Similarly, we need to find GINI index value for all the split points and select the best split for a
variable.

Ex. [multiple attributes-Continuous]


How to find out attribute and its value to best split?

SR A1 A2 Class
1 3.2 1.5 0
2 1.3 1.2 0
3 3.7 2.8 0
4 2.9 2.4 0
5 3.9 1.9 0
6 7.5 3.5 1
7 9.0 3.2 1
8 7.4 0.9 1
9 9.5 4.2 1
10 7.3 3.5 1

Suppose, Gini index score against each value of each attributes


A1 Gini A2 Gini
Index Index
3.2 0.20 1.5 0.47
1.3 0.25 1.2 0.50
3.7 0.14 2.8 0.14
2.9 0.23 2.4 0.32
3.9 0.0 1.9 0.41
7.5 0.23 3.5 0.25
9.0 0.25 3.2 0.20
7.4 0.20 0.9 0.25
9.5 0.25 4.2 0.25
7.3 0.14 3.5 0.25

Attribute with the lowest Gini index score will be chosen as the
node in the decision tree. Here, the value of 3.9 from the A1
has the lowest Gini index.

A1
A1<3.9 A1>=3.9

Data subset Data subset

Repeat this process for each node of next level until


encounter some terminating criteria.
Ex.: Consider salary for first slit through Gini index

SN Name Sex Salary Marital st. Investor


1 P1 M Low Un N
2 P2 M Med Un N
3 P3 M Med Ma Y
4 P4 F Med Ma N
5 P5 M Med Ma Y
6 P6 F High Un Y
7 P7 F Low Un N
8 P8 M High Un Y
9 P9 F Med Un Y
10 P10 M Low Ma Y
The groupings shown in Figures 4.10(a) and (b) preserve the order among the attribute values,
whereas the grouping shown in Figure a.10(c) violates this property because it combines the
attribute values Small and Large into the same partition while Medium and Extra Large are
combined into another partition."

Why ordinal attribute values can be grouped as long as the grouping does not violate the
order property of the attribute values?

Ans: It may be useful or not- depends on the relation of feature with class, but mostly it is useful.

Discussion: it is an important optimization feature. We have said that for a


categorical predictor with m levels, all 2(m−1) different possible splits are tested.
Luckily, for any ordered outcome there is a computational shortcut that allows the
program to find the best split using only m−1 comparisons. As you see, the
ordered factor may be processed much effectively.

My advice therefore - as a part of the feature engineering decide whether to use a factor ordered or
unordered. Use ordered factor only if it is highly correlated with the output variable, otherwise fall
back to an unordered factor.

We usually split things into specified parts which are not contradictory. A special thing can be small
and medium, as one group, and large, as the other group. But it cannot be small and large at the
same time. The point is that you have a sequence in your data. If there was no such thing you could
have different combinations of attribute values. Suppose you have a set of attribute values for a
fruit. It can be apple, pineapple and watermelon. Due to the fact that there is no ordinal, you can
have all possible combination for binary splits.

Contradiction: It may be useless, since for example a T-shirt factory can decide to print red tshirts of size Small
and Large and blue tshirts of sizes medium and extralarge. Since we don't know the model that generates the data
how can we infer that it's "better" to preserve the order in the splits of a ordinal attribute ? there is no advantages
of maintaining the order of an attribute splits- Ans: feature construction is one of the important tasks of the
modeler. It is up to you to decide whether to represent a categorical variable as ordered or unordered

Ex.2
Lets assume we have 3 classes and 80 objects. 19 objects are in class 1, 21 objects in class 2, and 40
objects in class 3 (denoted as (19,21,40) ).
The Gini index would be: 1- [ (19/80)^2 + (21/80)^2 + (40/80)^2] = 0.6247 i.e. costbefore = Gini(19,21,40) =
0.6247
In order to decide where to split, we test all possible splits. For example splitting at 2.0623, which results in
a split (16,9,0) and (3,12,40):
After testing x1 < 2.0623:
costL =Gini(16,9,0) = 0.4608
costR =Gini(3,12,40) = 0.4205
Then we weight branch impurity by empirical branch probabilities:
costx1<2.0623 = 25/80 costL + 55/80 costR = 0.4331

We do that for every possible split, for example x1 < 1:


costx1<1 = FractionL Gini(8,4,0) + FractionR Gini(11,17,40) = 12/80 * 0.4444 + 68/80 * 0.5653 = 0.5417
After that, we chose the split with the lowest cost. This is the split x1 < 2.0623 with a cost of 0.4331.

Ex.3

Gini Index by Hand


Let’s walk through an example of calculating a few nodes.

 We’re trying to predict the class variable.


 For numeric variables, you would go from distinct value to distinct value and check
the split as less than and greater than or equal to.

Class Var1 Var2

A 0 33

A 0 54
A 0 56

A 0 42

A 1 50

B 1 55

B 1 31

B 0 -4

B 1 77

B 0 49

We’ll first try using Gini Index on a couple values. Let’s try Var1 == 1 and Var2 >=32.
Gini Index Example: Var1 == 1

 Baseline of Split: Var1 has 4 instances (4/10) where it’s equal to 1 and 6 instances
(6/10) when it’s equal to 0.
 For Var1 == 1 & Class == A: 1 / 4 instances have class equal to A.
 For Var1 == 1 & Class == B: 3 / 4 instances have class equal to B.
o Gini Index here is 1-((1/4)^2 + (3/4)^2) = 0.375
 For Var1 == 0 & Class== A: 4 / 6 instances have class equal to A.
 For Var1 == 0 & Class == B: 2 / 6 instances have class equal to B.
o Gini Index here is 1-((4/6)^2 + (2/6)^2) = 0.4444
 We then weight and sum each of the splits based on the baseline / proportion of the
data each split takes up.
o 4/10 * 0.375 + 6/10 * 0.444 = 0.41667

Gini Index Example: Var2 >= 32

 Baseline of Split: Var2 has 8 instances (8/10) where it’s equal >=32 and 2 instances
(2/10) when it’s less than 32.
 For Var2 >= 32 & Class == A: 5 / 8 instances have class equal to A.
 For Var2 >= 32 & Class == B: 3 / 8 instances have class equal to B.
o Gini Index here is 1-((5/8)^2 + (3/8)^2) = 0.46875
 For Var2 < 32 & Class == A: 0 / 2 instances have class equal to A.
 For Var2 < 32 & Class == B: 2 / 2 instances have class equal to B.
o Gini Index here is 1-((0/2)^2 + (2/2)^2) = 0
 We then weight and sum each of the splits based on the baseline / proportion of the
data each split takes up.
o 8/10 * 0.46875 + 2/10 * 0 = 0.375

Based on these results, you would choose Var2>=32 as the split since its weighted Gini
Index is smallest.

Ex. 4
X y
-1.2 0
-3.2 0
2.1 1
1.5 1

X is attribute and splitting value is x>=0


Calculate Gini index at x>=0
LEFT TREE RIGHT TREE
X y x y
-1.2 0 2.1 1
-3.2 0 1.5 1

Gini for Left tree


= (2/4)[1-(2/2)2 + (0/2)2]- (2/4) [1-(0/2)2 + (2/2)2]
= 0.5 [1-1+0] - 0.5 [1-0+1]
=0-0= 0
Ex. 5
X y
-1.2 1
-3.2 0
2.1 1
1.5 1

X is attribute and splitting value is x>=0


Calculate Gini index at x>=0
LEFT TREE RIGHT TREE
-1.2 1 x y
-3.2 0 2.1 1
1.5 1

Gini for Left tree


= (2/4)[1-(1/2)2 + (1/2)2]- (2/4) [1-(0/2)2 + (2/2)2]
=0.5 [1-0.25+0.25] - 0.5 [1-0+1]
=0.5[0.5] -0= 0.25
Pros and Cons of decision tree
Advantages: -
Decision Trees that is applicable for continuous,
ordinal and categorical inputs. It can solve both
classification and regression problems.
Decision trees are also not sensitive to outliers since the
splitting happens based on proportion of samples within
the split ranges and not on absolute values.
The advantages of a decision tree are: it explores
possibilities and alternatives, leading toward a
desirable outcome.
{ For example, a tree showing ways to use excess capital will show what choices are
available, and what choices must await Stock Market fluctuation. Another
revelation from decision trees is the taxonomy of priorities – for example, is
employee maintenance more or less important than stockholder dividends?}

Advantage: Decision trees implicitly perform variable


screening or feature selection
{We described here why feature selection is important in analytics. We also introduced a few
common techniques for performing feature selection or variable screening. When we fit a
decision tree to a training dataset, the top few nodes on which the tree is split are essentially
the most important variables within the dataset and feature selection is completed
automatically!}

 Advantage: Decision trees require relatively little effort


from users for data preparation (requiring little data pre-processing)
To overcome scale differences between parameters - for example if we have a dataset which
measures revenue in millions and loan age in years, say; this will require some form of
normalization or scaling before we can fit a regression model and interpret the
coefficients. Such variable transformations are not required with decision trees because the
tree structure will remain the same with or without the transformation.

Advantage: Nonlinear relationships between


parameters do not affect tree performance
As we described here, highly nonlinear relationships between variables will result in failing
checks for simple regression models and thus make such models invalid. However, decision
trees do not require any assumptions of linearity in the data. Thus, we can use them in
scenarios where we know the parameters are nonlinearly related.

Advantage: The best feature of using trees for analytics


- easy to interpret and explain to executives!
Decision trees are very intuitive and easy to explain. The main advantage is interpretability.
Decision tree is easy to interpret. Decision trees are "white boxes" in the sense that the
acquired knowledge can be expressed in a readable form, while KNN,SVM,NN are
generally black boxes, i.e. you cannot read the acquired knowledge in a comprehensible
way.

Disadvantages

Instability
The reliability of the information in the decision tree depends on feeding the precise internal and external
information at the onset. Even a small change in input data can at times, cause large changes in the tree. Changing
variables, excluding duplication information, or altering the sequence midway can lead to major changes and might
possibly require redrawing the tree.

Complexity/ computationally Expensive


Among the major decision tree disadvantages are its complexity. Decision trees are easy to use compared to
other decision-making models, but preparing decision trees, especially large ones with many branches, are
complex and time-consuming affairs.

Computing probabilities of different possible branches, determining the best split of each node, and selecting
optimal combining weights to prune algorithms contained in the decision tree are complicated tasks that require
much expertise and experience.

Difficult to interpret if Tree is Large


Decision trees are relatively easy to understand when there are few decisions and outcomes included in the
tree. Large trees that include dozens of decision nodes (spots where new decisions are made) can be
convoluted and may have limited value. The more decisions there are in a tree, the less accurate any expected
outcomes are likely to be.

Tree structure prone to sampling –


While Decision Trees are generally robust to outliers, due to their tendency to overfit, they are prone to sampling
errors. If sampled training data is somewhat different than evaluation or scoring data, then Decision Trees tend
not to produce great results

Tree splitting is locally greedy –


At each level, tree looks for binary split such that impurity of tree is reduced by maximum amount. This is a
greedy algorithm and achieves local optima. It may be possible, for example, to achieve less than maximum drop
in impurity at current level, so as to achieve lowest possible impurity of final tree, but tree splitting algorithm
cannot see far beyond the current level. This means that Decision Tree built is typically locally optimal and not
globally optimal or best.

Suffer from Overfitting

ID3, CART, C5.0, C4.5


Pruning (decision trees)

Decision trees are tree data structures that are generated using learning
algorithms for the purpose of Classification and Regression.
One of the most common problem when learning a decision tree is to learn the
optimal size of the resulting tree that leads to a better real time accuracy of the model.
A tree that has too many branches and layers can result in overfitting of the training
data. The performance of a tree can be further increased by pruning. It
involves removing the branches that make use of features having low importance.
This way, we reduce the complexity of tree, and thus increasing its predictive power
by reducing overfitting.

Overfitting is a significant practical difficulty for decision tree models


and many other predictive models. Overfitting happens when the
learning algorithm continues to develop hypotheses that reduce
training set error at the cost of an increased test set error.
Pruning a decision tree helps to prevent overfitting the training data so that our
model generalizes well to unseen data. Pruning a decision tree means to remove a
subtree that is redundant and not a useful split and replace it with a leaf node.

Decision tree pruning can be divided into two types: pre-pruning and post-
pruning.
 Pre- pruning
 Post-pruning

I will consider 2 pruning strategies,

 Minimum error. The tree is pruned back to the point where the cross-validated error is a minimum. Cross-validation is the process of building a tree with most of the data and then using the remaining part of the data to test the accuracy
of the tree.

 Smallest tree. The tree is pruned back slightly further than the minimum error. Technically the pruning creates a tree with cross-validation error within 1 standard error of the minimum error. The smaller tree is more intelligible at the
cost of a small increase in error.

Pre-pruning that stop growing the tree earlier, before it perfectly


classifies the training set (Stopping of the decision tree construction
process during training.). It also known as Early Stopping Rule, is the method where
the subtree construction is halted at a particular node after evaluation of some measure.

o (prevent overfitting) stop the tree-building process early, before it


produces leaves with very small samples. This heuristic is known
as early stopping.
o At each stage of splitting the tree, we check the cross-validation
error. If the error does not decrease significantly enough then
we stop.

o These measures can be the Gini Impurity or the Information


Gain. In pre-pruning, we evaluate the pruning condition based on
the above measures at each node. Examples of pruning
conditions include informationGain(Attr)> minGain or treeDepth
== MaxDepth. If the condition is satisfied, we prune the subtree.
That means we replace the decision node with a leaf node.
Otherwise, we continue building the tree using our decision tree
algorithm.

o Early stopping may underfit by stopping too early. The current


split may be of little benefit, but having made it, subsequent splits
more significantly reduce the error.

o Pre-pruning has the advantage of being faster and more efficient


as it avoids generating overly complex subtrees which overfit the
training data. However, in pre-pruning, the growth of the tree is
stopped prematurely by our stopping condition.

Some pre-pruning methods are:


1. Max/Fix depth of tree
2. IG (Children) < IG (Parent)
When the leaf node is pure node –
If a leaf node happens to be pure node at any stage, then no further downstream tree is
grown from that node. Tree can continue to be grown from other leaf nodes.

When the leaf node has very few observations left (Min
number of observation in node)
This ensures that we terminate the tree when reliability of further splitting the node becomes suspect due
to small sample size. Central Limit Theorem tells us that when observations are mutually independent,
then about 30 observations constitute large sample. This can become rough guide, though usually, this
user input parameter should be higher than 30, say 50 or 100 or more, because we typically work with
multi-dimensional observations and observations could be correlated.

When decrease in impurity of tree is very small –


This user input parameter leads to termination of tree when impurity drops by very small
amount, say, 0.001 or lesser.

When sufficient number of leaves are created (Max/Fix


number of leaf)–
One method of culminating growth of tree is to achieve desired number of leaves – an user
input parameter – and then stop. Personally, I find this to be not so good criteria simply because
growth of tree is unbalanced and some branch would have nodes of very few observations
while others of very large, when stopping condition is met. Also, while it is possible to decide
what is small sample size or what is small change in impurity, it’s not usually possible to know
what is reasonable number of leaves for given data and business context.

When cross-validation impurity starts to increase –


This is one of complex method, but likely to be more robust as it doesn’t required any
assumption on user input. Training data is split into train and cross-validation data, in say 70%-
30% proportion. Tree is grown on train data by computing impurity of tree and splitting the tree
wherever decrease in impurity is observed. Similar tree is replicated on cross-validation data.
Since we are growing tree on train data, its impurity will always decrease, by very definition of
process. However, at some point, impurity of cross-validation tree will increase for same split.
This is point where we can stop growing the tree since divergence in error (impurity) signals
start of overfitting.
Post-pruning that allows the tree to perfectly classify the training set, and
then post prune the tree. You grow the tree entirely using your decision tree
algorithm and then you prune the subtrees in the tree in a bottom-up
fashion. You start from the bottom decision node and, based on measures
such as Gini Impurity or Information Gain, you decide whether to keep this
decision node or replace it with a leaf node. For example, say we want to
prune out subtrees that result in least information gain. When deciding the
leaf node, we want to know what leaf our decision tree algorithm would
have created if it didn’t create this decision node.
The important step of tree pruning is to define a criterion be used to
determine the correct final tree size.

1-Reduced Error Pruning (REP)


One of the simplest forms of pruning is reduced error pruning. Starting at the leaves, each node is replaced with its
most popular class. If the prediction accuracy is not affected then the change is kept. While somewhat naive, reduced
error pruning has the advantage of simplicity and speed.
REP belongs to the Post-Pruning category. In REP, pruning is performed with the help of
a validation set (Use a distinct dataset from the training set (called validation set), to evaluate the effect of
post-pruning nodes from the tree; In this approach, the available data are separated into two sets of examples: a
training set, which is used to build the decision tree, and a validation set, which is used to evaluate the impact of
pruning the tree.). In REP, all nodes are evaluated for pruning in a bottom up fashion. A node is
pruned if the resulting pruned tree performs no worse than the original tree on the
validation set. The subtree at the node is replaced with a leaf node which is assigned the
most common class.

2-Cost complexity pruning


Cost-complexity pruning also falls under the post-pruning category. Cost-complexity
pruning works by calculating a Tree Score based on Residual Sum of Squares (RSS) for the
subtree, and a Tree Complexity Penalty that is a function of the number of leaves in the
subtree. The Tree Complexity Penalty compensates for the difference in the number of
leaves. Numerically, Tree Score is defined as follows:

Method A:
TreeScore=RSS+aT
a (alpha) is a hyperparameter we find using cross-validation, and T is the number of leaves in
the subtree.

We calculate the Tree Score for all subtrees in the decision tree, and then pick the subtree
with the lowest Tree Score. However, we can observe from the equation that the value of
alpha determines the choice of the subtree. The value of alpha is found using cross-
validation. We repeat the above process for different values of alpha, which gives us a
sequence of trees. The value of alpha that on average gives the lowest Tree Score is the final
value of alpha. Finally, our pruned decision tree will be tree corresponding to the final value
of alpha.

Method B:
using one of the following methods:

1. Use a distinct dataset from the training set (called validation set), to evaluate the effect of post-pruning nodes from the tree.
2. Build the tree by using the training set, then apply a statistical test to estimate whether pruning or expanding a particular node is likely to produce an improvement beyond the training set.

o Error estimation

o Significance testing (e.g., Chi-square test)

The first method is the most common approach. In this approach, the available data are separated into two sets of examples: a training set, which is used to build the decision tree, and a validation set, which is used to evaluate the impact of pruning the tree. The second method is also a common approach. Here, we explain the error
estimation and Chi2 test.
3-Post-pruning using Error estimation
Error estimate for a sub-tree is weighted sum of error estimates for all its leaves. The error estimate (e) for a
node is:

The error rate at the parent node is 0.46 and since the error rate for its children (0.51) increases with the split,
we do not want to keep the children.

Post-pruning using Chi2 test (categorical)


In Chi2 test we construct the corresponding frequency table and calculate the Chi2 value and its probability.

Bronze Silver Gold

Bad 4 1 4

Good 2 1 2
Chi2 = 0.21 Probability = 0.90 degree of freedom=2

If we require that the probability has to be less than a limit (e.g., 0.05), therefore we decide not to split the node.

4-Expected Error Pruning


 Static Error estimate e(s) (Approximate expected error)
assuming that we prune at a particular node.
 Approximate backed-up error from children assuming we
did not prune.
 If expected error (Static Error) is less than backed-up
error, prune.
How to determine exactly which subtree to prune?

The decision rule about pruning below s is:


If the static error is less than or equal to the backed_up error
then prune,making it a leaf node, and assigning it the most
common classsification of the training examples affiliated with
that node.
Note: Error calculation and its interpretation is bit different from
earlier methods. Conventionally, if error increases after split (in
next level- children), the split is not allowed. But in this method
(Expected Error Pruning), it work in opposite sense where,
error increase, the split is recommended. The reason lies on
interpretation of formulas. As:
Static Error estimate e(s): Error if go for split
backed-up error: Error if not go for split
Conclusion: if error reduces/less in case of split- we should
consider it.

Static Error estimate e(s)

Back-up Error BackUpError(s):

P = observation ration of leaf against parent.


An example
Static Error estimate e(s)
0.375= (6-4+2-1)/(6+2)
N= 6 (4+2) (no of ex in node)
n=4 [4,2] no of ex in majority class
K=2 (binary classification)
BackupErrors
0.413= 5/6* 0.429 + 1/6*0.333
After pruning
For other node
0.444= 2/3*0.5 + 1/3*0.333
Summary:
 Pruning is a technique in machine learning that reduces
the size of decision trees by removing sections of the tree
that provide little power to classify instances.
 Pruning reduces the complexity of the final classifier, and
hence improves predictive accuracy by the reduction
of overfitting.
 One of the questions that arises in a decision tree algorithm
is the optimal size of the final tree. A tree that is too large
risks overfitting the training data and poorly generalizing to
new samples. A small tree might not capture important
structural information about the sample space. However, it
is hard to tell when a tree algorithm should stop because it
is impossible to tell if the addition of a single extra node will
dramatically decrease error. This problem is known as
the horizon effect. A common strategy is to grow the tree
until each node contains a small number of instances then
use pruning to remove nodes that do not provide additional
information.
 Pruning should reduce the size of a learning tree without
reducing predictive accuracy as measured by a cross-
validation set. There are many techniques for tree pruning
that differ in the measurement that is used to optimize
performance.
2 main differences between
classification and regression trees

The first type is the one described by the many articles in this blog: Classification tree. This
is also referred to as a Decision tree by default. However there is another basic decision
tree in common use: Regression tree, which also works in a very similar fashion. This
article explores the main differences between them: when to use each, how they differ
and some cautions.

Difference 1: When to use classification vs regression tree

This might seem like a trivial issue - once you know the difference! Classification trees, as
the name implies are used to separate the dataset into classes belonging to the response
variable. Usually the response variable has two classes: Yes or No (1 or 0). If the target
variable has more than 2 categories, then a variant of the algorithm, called C4.5, is used.
For binary splits however, the standard CART procedure is used. Thus classification trees
are used when the response or target variable is categorical in nature.

Regression trees are needed when the response variable is numeric or continuous. For
example, the predicted price of a consumer good. Thus regression trees are applicable
for prediction type of problems as opposed to classification.

Keep in mind that in either case, the predictors or independent variables may be
categorical or numeric. It is the target variable that determines the type of decision tree
needed.

Difference 2: How they work

In a standard classification tree, the idea is to split the dataset based on homogeneity of
data. Lets say for example we have two variables: age and weight that predict if a person is
going to sign up for a gym membership or not. In our training data if it showed that 90% of
the people who are older than 40 signed up, we split the data here and age becomes a top
node in the tree. We can almost say that this split has made the data "90% pure". Rigorous
measures of impurity, based on computing proportion of the data that belong to a class,
such as entropy or Gini index are used to quantify the homogeneity in Classification trees.
In a regression tree the idea is this: since the target variable does not have classes, we fit a
regression model to the target variable using each of the independent variables. Then for
each independent variable, the data is split at several split points. At each split point, the
"error" between the predicted value and the actual values is squared to get a "Sum of
Squared Errors (SSE)". The split point errors across the variables are compared and the
variable/point yielding the lowest SSE is chosen as the root node/split point. This process is
recursively continued.

We discussed a C4.5 classification tree (for more than 2 categories of target variable) here
which uses information gain to decide on which variable to split. In a corresponding
regression tree, standard deviation is used to make that decision in place of information
gain. More technical details are here. Regression trees, by virtue of using regression
models lose the one strength of standard decision trees: ability to handle highly non-linear
parameters. In such cases, it may be better to use the C4.5 type implementation.

This tree below summarizes at a high level the types of decision trees available!
Decision tree bagging (handle overfitting)
We have to train different model on different sample of data and try to
come up with the improved version of classifier.
Decision tree have a tendency of high variance, which leads to failure
in their generalization. Tree bagging is a technique that can help us to
solve this problem.
But how? Because bagging has its unique feature of sampling; it
creates different samples out of data with replacement. Then we create
different models for these samples and at the prediction stage, we can
combine the results.

Data

Sample-1 Sample-2 Sample-N

Decision Tree- Decision Tree- Decision Tree-


1 2 N

Average
Prediction
Consider following dataset
SN RED GREEN BLUE CLASS
1 0.95 0.3 0.63 0
2 0.83 0.4 0.61 0
3 0.75 0.25 0.59 0
4 0.63 0.19 0.39 0
5 0.65 0.45 0.46 1
6 0.53 0.3 0.19 1
7 0.32 0.5 0.35 1
8 0.77 0.55 0.41 1
Table-1: Main dataset

Sample-1
SN RED GREEN BLUE CLASS
1 0.95 0.3 0.63 0
2 0.83 0.4 0.61 0
8 0.77 0.55 0.41 1

Sample-2
SN RED GREEN BLUE CLASS
2 0.83 0.4 0.61 0
3 0.75 0.25 0.59 0
4 0.63 0.19 0.39 0

Sample-3
SN RED GREEN BLUE CLASS
2 0.83 0.4 0.61 0
5 0.65 0.45 0.46 1
6 0.53 0.3 0.19 1
Properties (key points):
 Every sample is unique.
 Every sample is created from same dataset.
 Every sample has exact number of observations.
 Every sample has different statistical properties.
 Every sample have same features.
 Decision trees are being learnt over sample have different
learning statistics

From bagging to random forest


Bagging is nothing but creating multiple sample out of our dataset and
training different trees on those. Primary purpose was to handle overfit
provided that samples are not correlated. If samples are enough
dissimilar then related tree Will be more diversified. But in real sense,
these samples are correlated and it may cause high variance.
So, what next we can do to make it less correlated?
We can choose subsample with different features.

Example: suppose we have 6 features in dataset and we can create


subsample with two features (around square root of total features) as:

SN RED GREEN BLUE H S V CLASS


1 0.95 0.3 0.63 0.63 0.19 0.39 0
2 0.83 0.4 0.61 0.83 0.4 0.61 0
3 0.75 0.25 0.59 0.53 0.3 0.19 0
4 0.63 0.19 0.39 0.32 0.5 0.35 0
5 0.65 0.45 0.46 0.95 0.3 0.63 1
6 0.53 0.3 0.19 0.83 0.4 0.61 1
7 0.32 0.5 0.35 0.75 0.25 0.59 1
8 0.77 0.55 0.41 0.63 0.19 0.39 1
Table-2 Main dataset

Sample-1
SN RED GREEN CLASS
1 0.95 0.3 0
2 0.83 0.4 0
6 0.53 0.3 1

Sample-2
SN H S CLASS
1 0.63 0.19 0
3 0.53 0.3 0
4 0.32 0.5 0

Sample-3
SN BLUE H CLASS
1 0.63 0.63 0
6 0.19 0.83 1
8 0.41 0.63 1

Each our classifier will learn different instance (less overlapping) with
different features, which creates a less correlated but generalized
predictor.

You might also like