0% found this document useful (0 votes)

24 views71 pages

Decision Tree

The document discusses the use of decision trees for selecting potential investors for a fund advisory company based on various attributes such as age, sex, employment, salary, and marital status. It explains the process of creating a decision tree using the ID3 algorithm, which involves calculating entropy and information gain to determine the best attributes for splitting the dataset. The document also outlines the stopping criteria for the decision tree and provides detailed calculations and examples to illustrate the decision-making process.

Uploaded by

code04rashtra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views71 pages

Decision Tree

Uploaded by

code04rashtra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 71

Decision Tree

Let us start with simple application:

Suppose you are owner of a fund advisory company and you want to select people who are willing to invest through your company.

How can you choose them?

One way to do this is go brute force; pick up your phone and dial numbers and ask them whether they are willing to invest money in the market. Some of
them may not want to, some they are not enough earning, some of them irritated.

So what you should do?

Your problem can be solved by analyzing the people mindset if past data is available.

SN Name Sex Age Employment Salary Marital Investor

st.
1 P1 M 20 Un NA Un N
2 P2 M 25 Em 25k Un N
3 P3 M 30 Em 40k Ma Y
4 P4 F 30 Em 40k Ma N
5 P5 M 25 Em 30k Ma Y
Summary of dataset (assume full dataset) represent as:
 Only 15% of the people from the age group of 20-25 years are interested. Only 5% from the age group 50-70 years and 80% of the 25-30 years
are persons who invest in the market.
 Only 30% of the female from the population invest whereas the male investor population is 70%
 Males who are unmarried are less interested (30%), on the other hand 70% married males are interested. So our criteria further narrow down.
 Next, married person who have children are most likely to invest (80%) and only 20% of them without children are investing.

We got the target people….through we have created decision tree from our dataset

Decision Tree :
Decision tree is the most powerful and popular tool for
classification and prediction (Regression).
 A Decision tree is a flowchart like tree structure.
 Each internal node denotes a test on an attribute
(split nodes, decision nodes or internal nodes)
 Each branch represents an outcome of the test, and
 Each leaf node (terminal node) holds a class label.

Background:
 Growing a tree involves deciding on which features to choose
(splitting feature)
 what conditions to use for splitting
 when to stop (stopping criteria).

Recursive Splitting (more towards to ID3)

 In this procedure all the features are considered
 Different split points are tried and tested using a cost function
calculate how much accuracy each split will cost us,
using a function.
 The split with the best cost (or lowest cost) is selected (The split
that costs least is chosen).
 This algorithm is recursive in nature as the groups formed
can be sub-divided using same strategy.
 Due to this procedure, this algorithm is also known as the greedy
algorithm (ID3)
• Entire Dataset (include all features)
1

• Select spliting criteia (find spiliting node which

2 maximize the objective function)

• Split the branch (divide the dataset in sub/multiple

3 groups)

• Repeat the process 2 & 3 with each subgroup

• Stoping criteria (condition)

• Final Decision Tree

6
Iterative Dichotomiser 3 (ID3)
Iterative Dichotomiser 3 (ID3)
In decision tree learning, ID3 (Iterative Dichotomiser 3) is an algorithm invented by Ross Quinlan[1] (1986)
used to generate a decision tree from a dataset. ID3 is the precursor to the C4.5 algorithm.
Procedure: The ID3 algorithm begins with the original set {S} as the root
node. On each iteration of the algorithm, it iterates through every
unused attribute of the set {S} and calculates the entropy {H(S)}} or
the information gain { IG(S)} of that attribute. It then selects the attribute which has
the smallest entropy (or largest information gain) value. The set { S} is then split
or partitioned by the selected attribute to produce subsets of the data.

ID3 algorithm is suited for classification of categorical dataset.

SN Name Sex Salary Marital st. Investor

1 P1 M Low Un N
2 P2 M Med Un N
3 P3 M Med Ma Y
4 P4 F Med Ma N
5 P5 M Med Ma Y
6 P6 F High Un Y
7 P7 F Low Un N
8 P8 M High Un Y
9 P9 F Med Un Y
10 P10 M Low Ma Y
Selecting splitting node
Step1- Calculate the entropy of the class
Step2- Calculate the entropy of all the Attribute (feature)
Step3- Calculate the Information Gain of all the Attribute (feature)
Step4- Select attribute for splitting with highest information gain- Which feature is best to
split?

Root Node (select splitting node)

We have to select the attribute (feature) that has most informative data. It requires
three calculations:

 Entropy of classes
 Entropy of attributes
 Information gain
Entropy is the degree of randomness or uncertainty, in our case degree of variance or class variance. In other words, entropy is also measure of impurity;
a dataset is called a pure dataset if all its instances have the same class attributes. For pure dataset, entropy will be zero or near to zero. Information gain
also known as mutual information; it is a measure to select the most informative attribute that can help us to reduce entropy.
Entropy of classes (Binary)
E = -Fc1log2(Fc1) - Fc2log2(Fc2)
E = -FYeslog2(FYes) – FNolog2(FNo)

Where, Fc1 and Fc2 are the fraction of different classes (for
example YES and NO) in the class attribute
Entropy of different attribute values (v=Low, Med, High for
salary)
Ev = -Fc1ilog2(Fc1i) - Fc2ilog2(Fc2i)
Elow = -Fc1_lowlog2(Fc1_low) - Fc2_lowlog2(Fc2_low)
Elow = -FYes_lowlog2(FYes_low) – FNo_lowlog2(FNo_low)

Where, Fc1i and Fc2i are the fractions of YES and NO for
selected attribute value

Total entropy of Attribute

k
Ea   Evi *[( Fc1i  Fc 2i ) / N ]
i 1

Where, k is number of value in given attribute ;

Low, Med, High for salary. N is number of data points.

Esalary = Elow* [(Fc1_low + Fc2_low)/N] + EMed * [] + EHigh *[]

Information Gain
IG= E-Ea
IG (Salary)= 0.94-0.693

=0 .247
Understand all through example
SN Name Sex Salary Marital st. Investor
1 P1 M Low Un N
2 P2 M Med Un N
3 P3 M Med Ma Y
4 P4 F Med Ma N
5 P5 M Med Ma Y
6 P6 F High Un y
7 P7 F Low Un N
8 P8 M High Un Y
9 P9 F Med Un Y
10 P10 M Low Ma Y

Entropy of class
 Number of “Yes” in dataset: 6
 Number of “No” in dataset: 4
 Total number of observations:10

E (Investor)= E(Yes, No)

E (Investor)= E(6, 4)
E (Investor)= -FYeslog2(FYes) – FNolog2(FNo)
E (Investor)= -(6/10)log2(6/10) -(4/10)log2(4/10)-
E(Investor)= 0.9709

Entropy of attribute ‘Salary’

Number of values in salary= 3 (Low, Med, High)
E (Investor, Salary)
E (Investor, Salary)= Esalary = Flow*Elow+ FMed*EMed + FHigh*EHigh
= Flow*E(1,2)+ FMed*E(3,2) + FlHigh*E(2,0)
Entropy of attribute value-Low
Value Yes No
Low 1 2

Elow = E(1,2)
Elow = -Fc1_lowlog2(Fc1_low) - Fc2_lowlog2(Fc2_low)
Elow = -FYes_lowlog2(FYes_low) – FNo_lowlog2(FNo_low)
Elow = -1/3log2(1/3) – F2/3log2(2/3)
Elow = -0.33log2(0.33) – 0.66log2(0.66)
Elow = 0.9234

Value Yes No
Med 3 2
EMed = E(3,2)
EMed = -3/5log2(3/5) – F2/5log2(2/5)
EMed =0.9709
Value Yes No
High 2 0
EHigh = E(2,0)
EHigh = -2/2log2(2/2) – F0/2log2(0/2)
EHigh =0

Total Entropy of attribute “Salary”

E (Investor, Salary)= Esalary = Flow*Elow+ FMed*EMed + FHigh*EHigh
= [(1+2)/10]*0.9234 +[(3+2)/10]*0.9709 +[(2+0)/10]*0
Esalary = 0.7624
Information Gain (Salary)
IGSalary = E (Investor) – Esalary
IGSalary =0.9709-0.7624
IGSalary =0.2085
Attribute (Sex)
Values Yes No Attribute value Entropy
M 4 2 0.9234
F 2 2 1
ESex = 0.9540
IGSex= 0.0169

Attribute (Marital Status)

Values Yes No Attribute value Entropy
M 3 1 0.8112
UN 3 3 1
EMarital =
0.9244
IGMarital =
0.0465

Summary
Attribute Information Gain
Salary 0.2085
Sex 0.0169
Marital Status 0.0465
Salary

Low Med High

Yes

Observations:
“High” branch has already got it decision as class “Yes”
How? A person with high salary is an investor no matter
what his/her marital status or gender is.
SN Name Sex Salary Marital st. Investor
6 P6 F High Un y
8 P8 M High Un Y

It is a case of “pure dataset” i.e. all the dataset having

same class.
Next parent node ->
Calculation of the next node under the low salary
branch:
SN Name Sex Salary Marital st. Investor
1 P1 M Low Un N
7 P7 F Low Un N
10 P10 M Low Ma Y

Choosing between sex and marital status?

(Sex)
Values Yes No Attribute value Entropy
M 1 1 1
F 0 1 0
ESex = 0.6666
IGSex= 0.2568 New class-entropy (0.923)
(.923-.666)
(Marital Status)
Values Yes No Attribute value Entropy
M 1 0 0
UN 0 2 0
EMarital = 0
IGMarital=
0.9234
Summary
Attribute Information Gain
Sex 0.2568
Marital Status 0.9234

Salary

Low Med High

Yes
Marital st.

Unmarried
Married

Yes No
Observations:
“Married” and “Unmarried” branches have already got it
decision as class “Yes” and “No” respectively. How? A person
with low salary and “married “ will invest and “Unmarried”
will not invest no matter what his/her sex is. Ex.
SN Name Sex Salary Marital st. Investor
1 P1 M Low Un N
7 P7 F Low Un N
10 P10 M Low Ma Y
It is a case of “pure dataset” i.e. all the sub-dataset
having same class.

Next parent node ->

Calculation of the next node under the Med salary
branch:
SN Name Sex Salary Marital st. Investor
2 P2 M Med Un N
3 P3 M Med Ma Y
4 P4 F Med Ma N
5 P5 M Med Ma Y
9 P9 F Med Un Y
Choosing between sex and marital status?
(Sex)
Values Yes No Attribute value Entropy
M 2 1 0.9234
F 1 1 1
ESex = 0.9693
IGSex= 0.0169
(Marital Status)
Values Yes No Attribute value Entropy
M 2 1 0.9234
UN 1 1 1
EMarital =
0.9693
IGMarital=
0.0169

Summary
Attribute Information Gain
Sex 0.0169
Marital Status 0.0169

Both the attributes have same information gain! (may

choose anyone of them )
Salary

Low Med High

Sex Yes
Marital st.

Unmarried
Married

Yes No

Next parent node ->

Calculation of the next node under the sex “Male”
branch:
SN Name Sex Salary Marital st. Investor
2 P2 M Med Un N
3 P3 M Med Ma Y
5 P5 M Med Ma Y

It is also pure dataset, all married will invest and unmarried

will no.
Calculation of the next node under the sex “Fe-Male”
branch:
SN Name Sex Salary Marital st. Investor
4 P4 F Med Ma N
9 P9 F Med Un Y
It is also pure dataset, all married will not invest and
unmarried will invest.

Salary

Low Med High

Sex Yes
Marital st.
M F
Unmarried
Married
Marital Marital

Yes No Married Unmarried

Yes
No
Stopping criteria
1. Max/Fix number of leaf
2. Max/Fix depth of tree
3. Min number of observation in node
4. Pure node (every element in the subset belongs to the same class; in which case the node
is turned into a leaf node and labelled with the class of the examples.)

5. there are no more attributes to be selected, but the examples still do not belong to
the same class. In this case, the node is made a leaf node and labelled with the most common
class of the examples in the subset.

6. there are no examples in the subset , which happens when no example in

the parent set was found to match a specific value of the selected attribute. An example could be
the absence of a person among the population with age over 100 years. Then a leaf node is
created and labelled with the most common class of the examples in the parent node's set.

7. IG (Children) > IG (Parent)

CLASSIFICATION AND REGRESSION TREES (CART)

Classification-type problems: Classification-type problems are

generally those where we attempt to predict values of a
categorical dependent variable (class) from one or more continuous
and/or categorical predictor variables.

For example, we may be interested in predicting who will or will not graduate from college, or who will or will not renew a subscription. These would be examples of simple binary
classification problems, where the categorical dependent variable can only assume two distinct and mutually exclusive values. In other cases, we might be interested in predicting which
one of multiple different alternative consumer products (e.g., makes of cars) a person decides to purchase, or which type of failure occurs with different types of engines. In those cases
there are multiple categories or classes for the categorical dependent variable.

Regression-type problems problems: Regression-type

are
generally those where we attempt to predict the values of a
continuous variable from one or more continuous and/or categorical
predictor variables.
For example, we may want to predict the selling prices of single family homes (a continuous dependent variable) from various other continuous predictors (e.g., square footage) as well
as categorical predictors (e.g., style of home, such as ranch, two-story, etc.; zip code or telephone area code where the property is located, etc.; note that this latter variable would be
categorical in nature, even though it would contain numeric values or codes).

CLASSIFICATION TREES
(Solve classification type problem)
Classification tree methods (i.e., decision tree methods, ID3) are
recommended when the data mining task contains classifications
or predictions of outcomes (category). A Classification tree labels,
records, and assigns variables to discrete classes.
REGRESSION TREES
(solve regression type problem)
Regression tree, which also works in a very similar fashion but output
is continues numerical value. Regression trees are needed when the
response variable is numeric or continuous.
For example, the predicted price of a consumer good. Thus regression trees are applicable for prediction type of problems as opposed to classification.
CLASSIFICATION AND REGRESSION TREES (C&RT)
CLASSIFICATION AND REGRESSION TREES (C&RT)
(Solve both classification and regression type problem)
The CART or Classification & Regression Trees methodology was
introduced in 1984 by Leo Breiman. This algorithm is the cornerstone
of the ensemble system, like bagging and boosting. The representation
for the CART model is a binary tree (parent node have a maximum of
two branches or nodes).

With numerical values

If you look closer at the tree, you will see all the attributes are numerical and so the criteria of node splitting is going to be based on numerical value only. There is important property of binary
tree: one branch will have values lesser than parent node and the other will have equal or higher than the parent. This important criteria is used to divide data using an anchor point.
The main elements of CART (and any decision tree
algorithm) are:

1. Rules for splitting data at a node based on the value

of one variable;
Impurity check (Entropy- categorical data and Gini
Index- numerical data + Cat. Data)
2. Stopping rules for deciding when a branch is terminal
and can be split no more; and
3. Finally, a prediction for the target variable in each
terminal node.
What are issue with entropy?
Gini Index for impurity check
Gini index is the cost function used to evaluate splits in the
dataset. Each split has two important aspects: the first is the
attributes and second is the attribute value.

 Favors larger partitions and large levels/distinct values. (because of this

property, more priority is given to Gini than Entropy)
 Uses squared proportion of classes.
 Perfectly classified (pure dataset), Gini Index would be zero.
 Evenly distributed would be 1
 You want a variable split that has a low Gini Index.
 The algorithm works as 1 – ( P(class1)^2 + P(class2)^2 + …
+ P(classN)^2)
For Nominal variable with k level, the maximum value Gini
Index is

= 1 - 1/k

2-levels, Max Gini Index= 1-1/2= 0.5

Binary Target Variable: Worked out Example

Consider an example where target variable is binary, the summary table for such example will be
similar to below table.
Gini Index = 1 - 0.742 - 0.262
= 1-0.5476 -0.0676
= 0.3848

Nominal Target Variable: Worked out Example

When target variable is nominal variable with different levels, we can calculate Gini Index in a similar way. This is an illustrative example.

Gini Index = 1 – 0.082 - 0.132 - 0.292 -0.52

= 1 - 0.006 - 0.018 - 0.083 - 0.250

= 0.643
Squared the term- supporting bigger proportion (may take example of some of absolute and sum
of squared error)

GINI of a split
GINI (s,t) = PL GINI (tL) + PR GINI (tR)
= weighted_Cost(Left) + weighted_Cost(Right)
Where

s : split
t : node
PL : Proportion of observation in Left Node after split, s
GINI (tL) : Gini of Left Node after split, s
PR : Proportion of observation in Right Node after split, s
GINI (tR) : Gini of Right Node after split, s

Example
We have an example in which input node, parent node, has equal number of Target variable values- “Yes” and “No”. Overall number of
observations are 24.

Gender variable is considered to split the node. Gini Split value is calculated as below.

Gini index for this node (class) will be

= 1- (1/2)2–(1/2)2
= 1- 0.25 -0.25

= 0.5

Now we want to split the code based on Gender Variable. After the
split we will have following summary.

Now, let’s calculate GINI index of the split using Gender variable.

GINI (s,t) = PL GINI (tL) + PR GINI (tR)

GINI (tL) = 1- (6/8) 2 - (2/8)2
= 1- 0.5625 - 0.0625
= 0.375
GINI (tR) = 1- (6/16) 2 – (10/16)2
= 1 - 0.140625 -0.390625
= 0.469

We then weight and sum each of the splits based on the baseline / proportion of the data each split takes up.

GINI (s,t) = (8/24)0.375 + (16/24)0.469

= 0.125 + 0.313
= 0.438
Similarly, we need to find GINI index value for all the split points and select the best split for a
variable.

Ex. [multiple attributes-Continuous]

How to find out attribute and its value to best split?

SR A1 A2 Class
1 3.2 1.5 0
2 1.3 1.2 0
3 3.7 2.8 0
4 2.9 2.4 0
5 3.9 1.9 0
6 7.5 3.5 1
7 9.0 3.2 1
8 7.4 0.9 1
9 9.5 4.2 1
10 7.3 3.5 1

Suppose, Gini index score against each value of each attributes

A1 Gini A2 Gini
Index Index
3.2 0.20 1.5 0.47
1.3 0.25 1.2 0.50
3.7 0.14 2.8 0.14
2.9 0.23 2.4 0.32
3.9 0.0 1.9 0.41
7.5 0.23 3.5 0.25
9.0 0.25 3.2 0.20
7.4 0.20 0.9 0.25
9.5 0.25 4.2 0.25
7.3 0.14 3.5 0.25

Attribute with the lowest Gini index score will be chosen as the
node in the decision tree. Here, the value of 3.9 from the A1
has the lowest Gini index.

A1
A1<3.9 A1>=3.9

Data subset Data subset

Repeat this process for each node of next level until

encounter some terminating criteria.
Ex.: Consider salary for first slit through Gini index

SN Name Sex Salary Marital st. Investor

1 P1 M Low Un N
2 P2 M Med Un N
3 P3 M Med Ma Y
4 P4 F Med Ma N
5 P5 M Med Ma Y
6 P6 F High Un Y
7 P7 F Low Un N
8 P8 M High Un Y
9 P9 F Med Un Y
10 P10 M Low Ma Y
The groupings shown in Figures 4.10(a) and (b) preserve the order among the attribute values,
whereas the grouping shown in Figure a.10(c) violates this property because it combines the
attribute values Small and Large into the same partition while Medium and Extra Large are
combined into another partition."

Why ordinal attribute values can be grouped as long as the grouping does not violate the
order property of the attribute values?

Ans: It may be useful or not- depends on the relation of feature with class, but mostly it is useful.

Discussion: it is an important optimization feature. We have said that for a

categorical predictor with m levels, all 2(m−1) different possible splits are tested.
Luckily, for any ordered outcome there is a computational shortcut that allows the
program to find the best split using only m−1 comparisons. As you see, the
ordered factor may be processed much effectively.

My advice therefore - as a part of the feature engineering decide whether to use a factor ordered or
unordered. Use ordered factor only if it is highly correlated with the output variable, otherwise fall
back to an unordered factor.

We usually split things into specified parts which are not contradictory. A special thing can be small
and medium, as one group, and large, as the other group. But it cannot be small and large at the
same time. The point is that you have a sequence in your data. If there was no such thing you could
have different combinations of attribute values. Suppose you have a set of attribute values for a
fruit. It can be apple, pineapple and watermelon. Due to the fact that there is no ordinal, you can
have all possible combination for binary splits.

Contradiction: It may be useless, since for example a T-shirt factory can decide to print red tshirts of size Small
and Large and blue tshirts of sizes medium and extralarge. Since we don't know the model that generates the data
how can we infer that it's "better" to preserve the order in the splits of a ordinal attribute ? there is no advantages
of maintaining the order of an attribute splits- Ans: feature construction is one of the important tasks of the
modeler. It is up to you to decide whether to represent a categorical variable as ordered or unordered

Ex.2
Lets assume we have 3 classes and 80 objects. 19 objects are in class 1, 21 objects in class 2, and 40
objects in class 3 (denoted as (19,21,40) ).
The Gini index would be: 1- [ (19/80)^2 + (21/80)^2 + (40/80)^2] = 0.6247 i.e. costbefore = Gini(19,21,40) =
0.6247
In order to decide where to split, we test all possible splits. For example splitting at 2.0623, which results in
a split (16,9,0) and (3,12,40):
After testing x1 < 2.0623:
costL =Gini(16,9,0) = 0.4608
costR =Gini(3,12,40) = 0.4205
Then we weight branch impurity by empirical branch probabilities:
costx1<2.0623 = 25/80 costL + 55/80 costR = 0.4331

We do that for every possible split, for example x1 < 1:

costx1<1 = FractionL Gini(8,4,0) + FractionR Gini(11,17,40) = 12/80 * 0.4444 + 68/80 * 0.5653 = 0.5417
After that, we chose the split with the lowest cost. This is the split x1 < 2.0623 with a cost of 0.4331.

Ex.3

Gini Index by Hand

Let’s walk through an example of calculating a few nodes.

 We’re trying to predict the class variable.

 For numeric variables, you would go from distinct value to distinct value and check
the split as less than and greater than or equal to.

Class Var1 Var2

A 0 33

A 0 54
A 0 56

A 0 42

A 1 50

B 1 55

B 1 31

B 0 -4

B 1 77

B 0 49

We’ll first try using Gini Index on a couple values. Let’s try Var1 == 1 and Var2 >=32.
Gini Index Example: Var1 == 1

 Baseline of Split: Var1 has 4 instances (4/10) where it’s equal to 1 and 6 instances
(6/10) when it’s equal to 0.
 For Var1 == 1 & Class == A: 1 / 4 instances have class equal to A.
 For Var1 == 1 & Class == B: 3 / 4 instances have class equal to B.
o Gini Index here is 1-((1/4)^2 + (3/4)^2) = 0.375
 For Var1 == 0 & Class== A: 4 / 6 instances have class equal to A.
 For Var1 == 0 & Class == B: 2 / 6 instances have class equal to B.
o Gini Index here is 1-((4/6)^2 + (2/6)^2) = 0.4444
 We then weight and sum each of the splits based on the baseline / proportion of the
data each split takes up.
o 4/10 * 0.375 + 6/10 * 0.444 = 0.41667

Gini Index Example: Var2 >= 32

 Baseline of Split: Var2 has 8 instances (8/10) where it’s equal >=32 and 2 instances
(2/10) when it’s less than 32.
 For Var2 >= 32 & Class == A: 5 / 8 instances have class equal to A.
 For Var2 >= 32 & Class == B: 3 / 8 instances have class equal to B.
o Gini Index here is 1-((5/8)^2 + (3/8)^2) = 0.46875
 For Var2 < 32 & Class == A: 0 / 2 instances have class equal to A.
 For Var2 < 32 & Class == B: 2 / 2 instances have class equal to B.
o Gini Index here is 1-((0/2)^2 + (2/2)^2) = 0
 We then weight and sum each of the splits based on the baseline / proportion of the
data each split takes up.
o 8/10 * 0.46875 + 2/10 * 0 = 0.375

Based on these results, you would choose Var2>=32 as the split since its weighted Gini
Index is smallest.

Ex. 4
X y
-1.2 0
-3.2 0
2.1 1
1.5 1

X is attribute and splitting value is x>=0

Calculate Gini index at x>=0
LEFT TREE RIGHT TREE
X y x y
-1.2 0 2.1 1
-3.2 0 1.5 1

Gini for Left tree

= (2/4)[1-(2/2)2 + (0/2)2]- (2/4) [1-(0/2)2 + (2/2)2]
= 0.5 [1-1+0] - 0.5 [1-0+1]
=0-0= 0
Ex. 5
X y
-1.2 1
-3.2 0
2.1 1
1.5 1

X is attribute and splitting value is x>=0

Calculate Gini index at x>=0
LEFT TREE RIGHT TREE
-1.2 1 x y
-3.2 0 2.1 1
1.5 1

Gini for Left tree

= (2/4)[1-(1/2)2 + (1/2)2]- (2/4) [1-(0/2)2 + (2/2)2]
=0.5 [1-0.25+0.25] - 0.5 [1-0+1]
=0.5[0.5] -0= 0.25
Pros and Cons of decision tree
Advantages: -
Decision Trees that is applicable for continuous,
ordinal and categorical inputs. It can solve both
classification and regression problems.
Decision trees are also not sensitive to outliers since the
splitting happens based on proportion of samples within
the split ranges and not on absolute values.
The advantages of a decision tree are: it explores
possibilities and alternatives, leading toward a
desirable outcome.
{ For example, a tree showing ways to use excess capital will show what choices are
available, and what choices must await Stock Market fluctuation. Another
revelation from decision trees is the taxonomy of priorities – for example, is
employee maintenance more or less important than stockholder dividends?}

Advantage: Decision trees implicitly perform variable

screening or feature selection
{We described here why feature selection is important in analytics. We also introduced a few
common techniques for performing feature selection or variable screening. When we fit a
decision tree to a training dataset, the top few nodes on which the tree is split are essentially
the most important variables within the dataset and feature selection is completed
automatically!}

 Advantage: Decision trees require relatively little effort

from users for data preparation (requiring little data pre-processing)
To overcome scale differences between parameters - for example if we have a dataset which
measures revenue in millions and loan age in years, say; this will require some form of
normalization or scaling before we can fit a regression model and interpret the
coefficients. Such variable transformations are not required with decision trees because the
tree structure will remain the same with or without the transformation.

Advantage: Nonlinear relationships between

parameters do not affect tree performance
As we described here, highly nonlinear relationships between variables will result in failing
checks for simple regression models and thus make such models invalid. However, decision
trees do not require any assumptions of linearity in the data. Thus, we can use them in
scenarios where we know the parameters are nonlinearly related.

Advantage: The best feature of using trees for analytics

- easy to interpret and explain to executives!
Decision trees are very intuitive and easy to explain. The main advantage is interpretability.
Decision tree is easy to interpret. Decision trees are "white boxes" in the sense that the
acquired knowledge can be expressed in a readable form, while KNN,SVM,NN are
generally black boxes, i.e. you cannot read the acquired knowledge in a comprehensible
way.

Disadvantages

Instability
The reliability of the information in the decision tree depends on feeding the precise internal and external
information at the onset. Even a small change in input data can at times, cause large changes in the tree. Changing
variables, excluding duplication information, or altering the sequence midway can lead to major changes and might
possibly require redrawing the tree.

Complexity/ computationally Expensive

Among the major decision tree disadvantages are its complexity. Decision trees are easy to use compared to
other decision-making models, but preparing decision trees, especially large ones with many branches, are
complex and time-consuming affairs.

Computing probabilities of different possible branches, determining the best split of each node, and selecting
optimal combining weights to prune algorithms contained in the decision tree are complicated tasks that require
much expertise and experience.

Difficult to interpret if Tree is Large

Decision trees are relatively easy to understand when there are few decisions and outcomes included in the
tree. Large trees that include dozens of decision nodes (spots where new decisions are made) can be
convoluted and may have limited value. The more decisions there are in a tree, the less accurate any expected
outcomes are likely to be.

Tree structure prone to sampling –

While Decision Trees are generally robust to outliers, due to their tendency to overfit, they are prone to sampling
errors. If sampled training data is somewhat different than evaluation or scoring data, then Decision Trees tend
not to produce great results

Tree splitting is locally greedy –

At each level, tree looks for binary split such that impurity of tree is reduced by maximum amount. This is a
greedy algorithm and achieves local optima. It may be possible, for example, to achieve less than maximum drop
in impurity at current level, so as to achieve lowest possible impurity of final tree, but tree splitting algorithm
cannot see far beyond the current level. This means that Decision Tree built is typically locally optimal and not
globally optimal or best.

Suffer from Overfitting

ID3, CART, C5.0, C4.5

Pruning (decision trees)

Decision trees are tree data structures that are generated using learning
algorithms for the purpose of Classification and Regression.
One of the most common problem when learning a decision tree is to learn the
optimal size of the resulting tree that leads to a better real time accuracy of the model.
A tree that has too many branches and layers can result in overfitting of the training
data. The performance of a tree can be further increased by pruning. It
involves removing the branches that make use of features having low importance.
This way, we reduce the complexity of tree, and thus increasing its predictive power
by reducing overfitting.

Overfitting is a significant practical difficulty for decision tree models

and many other predictive models. Overfitting happens when the
learning algorithm continues to develop hypotheses that reduce
training set error at the cost of an increased test set error.
Pruning a decision tree helps to prevent overfitting the training data so that our
model generalizes well to unseen data. Pruning a decision tree means to remove a
subtree that is redundant and not a useful split and replace it with a leaf node.

Decision tree pruning can be divided into two types: pre-pruning and post-
pruning.
 Pre- pruning
 Post-pruning

I will consider 2 pruning strategies,

 Minimum error. The tree is pruned back to the point where the cross-validated error is a minimum. Cross-validation is the process of building a tree with most of the data and then using the remaining part of the data to test the accuracy
of the tree.

 Smallest tree. The tree is pruned back slightly further than the minimum error. Technically the pruning creates a tree with cross-validation error within 1 standard error of the minimum error. The smaller tree is more intelligible at the
cost of a small increase in error.

Pre-pruning that stop growing the tree earlier, before it perfectly

classifies the training set (Stopping of the decision tree construction
process during training.). It also known as Early Stopping Rule, is the method where
the subtree construction is halted at a particular node after evaluation of some measure.

o (prevent overfitting) stop the tree-building process early, before it

produces leaves with very small samples. This heuristic is known
as early stopping.
o At each stage of splitting the tree, we check the cross-validation
error. If the error does not decrease significantly enough then
we stop.

o These measures can be the Gini Impurity or the Information

Gain. In pre-pruning, we evaluate the pruning condition based on
the above measures at each node. Examples of pruning
conditions include informationGain(Attr)> minGain or treeDepth
== MaxDepth. If the condition is satisfied, we prune the subtree.
That means we replace the decision node with a leaf node.
Otherwise, we continue building the tree using our decision tree
algorithm.

o Early stopping may underfit by stopping too early. The current

split may be of little benefit, but having made it, subsequent splits
more significantly reduce the error.

o Pre-pruning has the advantage of being faster and more efficient

as it avoids generating overly complex subtrees which overfit the
training data. However, in pre-pruning, the growth of the tree is
stopped prematurely by our stopping condition.

Some pre-pruning methods are:

1. Max/Fix depth of tree
2. IG (Children) < IG (Parent)
When the leaf node is pure node –
If a leaf node happens to be pure node at any stage, then no further downstream tree is
grown from that node. Tree can continue to be grown from other leaf nodes.

When the leaf node has very few observations left (Min
number of observation in node)
This ensures that we terminate the tree when reliability of further splitting the node becomes suspect due
to small sample size. Central Limit Theorem tells us that when observations are mutually independent,
then about 30 observations constitute large sample. This can become rough guide, though usually, this
user input parameter should be higher than 30, say 50 or 100 or more, because we typically work with
multi-dimensional observations and observations could be correlated.

When decrease in impurity of tree is very small –

This user input parameter leads to termination of tree when impurity drops by very small
amount, say, 0.001 or lesser.

When sufficient number of leaves are created (Max/Fix

number of leaf)–
One method of culminating growth of tree is to achieve desired number of leaves – an user
input parameter – and then stop. Personally, I find this to be not so good criteria simply because
growth of tree is unbalanced and some branch would have nodes of very few observations
while others of very large, when stopping condition is met. Also, while it is possible to decide
what is small sample size or what is small change in impurity, it’s not usually possible to know
what is reasonable number of leaves for given data and business context.

When cross-validation impurity starts to increase –

This is one of complex method, but likely to be more robust as it doesn’t required any
assumption on user input. Training data is split into train and cross-validation data, in say 70%-
30% proportion. Tree is grown on train data by computing impurity of tree and splitting the tree
wherever decrease in impurity is observed. Similar tree is replicated on cross-validation data.
Since we are growing tree on train data, its impurity will always decrease, by very definition of
process. However, at some point, impurity of cross-validation tree will increase for same split.
This is point where we can stop growing the tree since divergence in error (impurity) signals
start of overfitting.
Post-pruning that allows the tree to perfectly classify the training set, and
then post prune the tree. You grow the tree entirely using your decision tree
algorithm and then you prune the subtrees in the tree in a bottom-up
fashion. You start from the bottom decision node and, based on measures
such as Gini Impurity or Information Gain, you decide whether to keep this
decision node or replace it with a leaf node. For example, say we want to
prune out subtrees that result in least information gain. When deciding the
leaf node, we want to know what leaf our decision tree algorithm would
have created if it didn’t create this decision node.
The important step of tree pruning is to define a criterion be used to
determine the correct final tree size.

1-Reduced Error Pruning (REP)

One of the simplest forms of pruning is reduced error pruning. Starting at the leaves, each node is replaced with its
most popular class. If the prediction accuracy is not affected then the change is kept. While somewhat naive, reduced
error pruning has the advantage of simplicity and speed.
REP belongs to the Post-Pruning category. In REP, pruning is performed with the help of
a validation set (Use a distinct dataset from the training set (called validation set), to evaluate the effect of
post-pruning nodes from the tree; In this approach, the available data are separated into two sets of examples: a
training set, which is used to build the decision tree, and a validation set, which is used to evaluate the impact of
pruning the tree.). In REP, all nodes are evaluated for pruning in a bottom up fashion. A node is
pruned if the resulting pruned tree performs no worse than the original tree on the
validation set. The subtree at the node is replaced with a leaf node which is assigned the
most common class.

2-Cost complexity pruning

Cost-complexity pruning also falls under the post-pruning category. Cost-complexity
pruning works by calculating a Tree Score based on Residual Sum of Squares (RSS) for the
subtree, and a Tree Complexity Penalty that is a function of the number of leaves in the
subtree. The Tree Complexity Penalty compensates for the difference in the number of
leaves. Numerically, Tree Score is defined as follows:

Method A:
TreeScore=RSS+aT
a (alpha) is a hyperparameter we find using cross-validation, and T is the number of leaves in
the subtree.

We calculate the Tree Score for all subtrees in the decision tree, and then pick the subtree
with the lowest Tree Score. However, we can observe from the equation that the value of
alpha determines the choice of the subtree. The value of alpha is found using cross-
validation. We repeat the above process for different values of alpha, which gives us a
sequence of trees. The value of alpha that on average gives the lowest Tree Score is the final
value of alpha. Finally, our pruned decision tree will be tree corresponding to the final value
of alpha.

Method B:
using one of the following methods:

1. Use a distinct dataset from the training set (called validation set), to evaluate the effect of post-pruning nodes from the tree.
2. Build the tree by using the training set, then apply a statistical test to estimate whether pruning or expanding a particular node is likely to produce an improvement beyond the training set.

o Error estimation

o Significance testing (e.g., Chi-square test)

The first method is the most common approach. In this approach, the available data are separated into two sets of examples: a training set, which is used to build the decision tree, and a validation set, which is used to evaluate the impact of pruning the tree. The second method is also a common approach. Here, we explain the error
estimation and Chi2 test.
3-Post-pruning using Error estimation
Error estimate for a sub-tree is weighted sum of error estimates for all its leaves. The error estimate (e) for a
node is:

The error rate at the parent node is 0.46 and since the error rate for its children (0.51) increases with the split,
we do not want to keep the children.

Post-pruning using Chi2 test (categorical)

In Chi2 test we construct the corresponding frequency table and calculate the Chi2 value and its probability.

Bronze Silver Gold

Bad 4 1 4

Good 2 1 2
Chi2 = 0.21 Probability = 0.90 degree of freedom=2

If we require that the probability has to be less than a limit (e.g., 0.05), therefore we decide not to split the node.

4-Expected Error Pruning

 Static Error estimate e(s) (Approximate expected error)
assuming that we prune at a particular node.
 Approximate backed-up error from children assuming we
did not prune.
 If expected error (Static Error) is less than backed-up
error, prune.
How to determine exactly which subtree to prune?

The decision rule about pruning below s is:

If the static error is less than or equal to the backed_up error
then prune,making it a leaf node, and assigning it the most
common classsification of the training examples affiliated with
that node.
Note: Error calculation and its interpretation is bit different from
earlier methods. Conventionally, if error increases after split (in
next level- children), the split is not allowed. But in this method
(Expected Error Pruning), it work in opposite sense where,
error increase, the split is recommended. The reason lies on
interpretation of formulas. As:
Static Error estimate e(s): Error if go for split
backed-up error: Error if not go for split
Conclusion: if error reduces/less in case of split- we should
consider it.

Static Error estimate e(s)

Back-up Error BackUpError(s):

P = observation ration of leaf against parent.

An example
Static Error estimate e(s)
0.375= (6-4+2-1)/(6+2)
N= 6 (4+2) (no of ex in node)
n=4 [4,2] no of ex in majority class
K=2 (binary classification)
BackupErrors
0.413= 5/6* 0.429 + 1/6*0.333
After pruning
For other node
0.444= 2/3*0.5 + 1/3*0.333
Summary:
 Pruning is a technique in machine learning that reduces
the size of decision trees by removing sections of the tree
that provide little power to classify instances.
 Pruning reduces the complexity of the final classifier, and
hence improves predictive accuracy by the reduction
of overfitting.
 One of the questions that arises in a decision tree algorithm
is the optimal size of the final tree. A tree that is too large
risks overfitting the training data and poorly generalizing to
new samples. A small tree might not capture important
structural information about the sample space. However, it
is hard to tell when a tree algorithm should stop because it
is impossible to tell if the addition of a single extra node will
dramatically decrease error. This problem is known as
the horizon effect. A common strategy is to grow the tree
until each node contains a small number of instances then
use pruning to remove nodes that do not provide additional
information.
 Pruning should reduce the size of a learning tree without
reducing predictive accuracy as measured by a cross-
validation set. There are many techniques for tree pruning
that differ in the measurement that is used to optimize
performance.
2 main differences between
classification and regression trees

The first type is the one described by the many articles in this blog: Classification tree. This
is also referred to as a Decision tree by default. However there is another basic decision
tree in common use: Regression tree, which also works in a very similar fashion. This
article explores the main differences between them: when to use each, how they differ
and some cautions.

Difference 1: When to use classification vs regression tree

This might seem like a trivial issue - once you know the difference! Classification trees, as
the name implies are used to separate the dataset into classes belonging to the response
variable. Usually the response variable has two classes: Yes or No (1 or 0). If the target
variable has more than 2 categories, then a variant of the algorithm, called C4.5, is used.
For binary splits however, the standard CART procedure is used. Thus classification trees
are used when the response or target variable is categorical in nature.

Regression trees are needed when the response variable is numeric or continuous. For
example, the predicted price of a consumer good. Thus regression trees are applicable
for prediction type of problems as opposed to classification.

Keep in mind that in either case, the predictors or independent variables may be
categorical or numeric. It is the target variable that determines the type of decision tree
needed.

Difference 2: How they work

In a standard classification tree, the idea is to split the dataset based on homogeneity of
data. Lets say for example we have two variables: age and weight that predict if a person is
going to sign up for a gym membership or not. In our training data if it showed that 90% of
the people who are older than 40 signed up, we split the data here and age becomes a top
node in the tree. We can almost say that this split has made the data "90% pure". Rigorous
measures of impurity, based on computing proportion of the data that belong to a class,
such as entropy or Gini index are used to quantify the homogeneity in Classification trees.
In a regression tree the idea is this: since the target variable does not have classes, we fit a
regression model to the target variable using each of the independent variables. Then for
each independent variable, the data is split at several split points. At each split point, the
"error" between the predicted value and the actual values is squared to get a "Sum of
Squared Errors (SSE)". The split point errors across the variables are compared and the
variable/point yielding the lowest SSE is chosen as the root node/split point. This process is
recursively continued.

We discussed a C4.5 classification tree (for more than 2 categories of target variable) here
which uses information gain to decide on which variable to split. In a corresponding
regression tree, standard deviation is used to make that decision in place of information
gain. More technical details are here. Regression trees, by virtue of using regression
models lose the one strength of standard decision trees: ability to handle highly non-linear
parameters. In such cases, it may be better to use the C4.5 type implementation.

This tree below summarizes at a high level the types of decision trees available!
Decision tree bagging (handle overfitting)
We have to train different model on different sample of data and try to
come up with the improved version of classifier.
Decision tree have a tendency of high variance, which leads to failure
in their generalization. Tree bagging is a technique that can help us to
solve this problem.
But how? Because bagging has its unique feature of sampling; it
creates different samples out of data with replacement. Then we create
different models for these samples and at the prediction stage, we can
combine the results.

Data

Sample-1 Sample-2 Sample-N

Decision Tree- Decision Tree- Decision Tree-

1 2 N

Average
Prediction
Consider following dataset
SN RED GREEN BLUE CLASS
1 0.95 0.3 0.63 0
2 0.83 0.4 0.61 0
3 0.75 0.25 0.59 0
4 0.63 0.19 0.39 0
5 0.65 0.45 0.46 1
6 0.53 0.3 0.19 1
7 0.32 0.5 0.35 1
8 0.77 0.55 0.41 1
Table-1: Main dataset

Sample-1
SN RED GREEN BLUE CLASS
1 0.95 0.3 0.63 0
2 0.83 0.4 0.61 0
8 0.77 0.55 0.41 1

Sample-2
SN RED GREEN BLUE CLASS
2 0.83 0.4 0.61 0
3 0.75 0.25 0.59 0
4 0.63 0.19 0.39 0

Sample-3
SN RED GREEN BLUE CLASS
2 0.83 0.4 0.61 0
5 0.65 0.45 0.46 1
6 0.53 0.3 0.19 1
Properties (key points):
 Every sample is unique.
 Every sample is created from same dataset.
 Every sample has exact number of observations.
 Every sample has different statistical properties.
 Every sample have same features.
 Decision trees are being learnt over sample have different
learning statistics

From bagging to random forest

Bagging is nothing but creating multiple sample out of our dataset and
training different trees on those. Primary purpose was to handle overfit
provided that samples are not correlated. If samples are enough
dissimilar then related tree Will be more diversified. But in real sense,
these samples are correlated and it may cause high variance.
So, what next we can do to make it less correlated?
We can choose subsample with different features.

Example: suppose we have 6 features in dataset and we can create

subsample with two features (around square root of total features) as:

SN RED GREEN BLUE H S V CLASS

1 0.95 0.3 0.63 0.63 0.19 0.39 0
2 0.83 0.4 0.61 0.83 0.4 0.61 0
3 0.75 0.25 0.59 0.53 0.3 0.19 0
4 0.63 0.19 0.39 0.32 0.5 0.35 0
5 0.65 0.45 0.46 0.95 0.3 0.63 1
6 0.53 0.3 0.19 0.83 0.4 0.61 1
7 0.32 0.5 0.35 0.75 0.25 0.59 1
8 0.77 0.55 0.41 0.63 0.19 0.39 1
Table-2 Main dataset

Sample-1
SN RED GREEN CLASS
1 0.95 0.3 0
2 0.83 0.4 0
6 0.53 0.3 1

Sample-2
SN H S CLASS
1 0.63 0.19 0
3 0.53 0.3 0
4 0.32 0.5 0

Sample-3
SN BLUE H CLASS
1 0.63 0.63 0
6 0.19 0.83 1
8 0.41 0.63 1

Each our classifier will learn different instance (less overlapping) with
different features, which creates a less correlated but generalized
predictor.

Poor Charlie's Almanack by Charles T. Munger PDF
97% (189)
Poor Charlie's Almanack by Charles T. Munger PDF
552 pages
Mindset The New Psychology of Success by Carol Dweck PDF
95% (66)
Mindset The New Psychology of Success by Carol Dweck PDF
292 pages
Limitless
97% (148)
Limitless
355 pages
The Systems Thinking Playbook
100% (7)
The Systems Thinking Playbook
213 pages
Timothy Ferriss - Tools of Titans
98% (124)
Timothy Ferriss - Tools of Titans
527 pages
101 Creative Problem Solving Techniques by James M. Higgins
97% (61)
101 Creative Problem Solving Techniques by James M. Higgins
241 pages
Neuroscience: Science of The Brain
90% (80)
Neuroscience: Science of The Brain
60 pages
15000+ ChatGPT Prompts, (Crafti - Pro) - Tareas
95% (20)
15000+ ChatGPT Prompts, (Crafti - Pro) - Tareas
367 pages
Frames of Mind
100% (12)
Frames of Mind
606 pages
NickBostrom Superintelligence PDF
96% (54)
NickBostrom Superintelligence PDF
323 pages
Strategic Thinking in Complex Problem Solving
100% (16)
Strategic Thinking in Complex Problem Solving
300 pages
Make It Stick
100% (37)
Make It Stick
328 pages
Predictably Irrational - The Hidden Forces That Shape Our Decisions PDF
97% (38)
Predictably Irrational - The Hidden Forces That Shape Our Decisions PDF
308 pages
Think Mental Models PDF
100% (22)
Think Mental Models PDF
154 pages
Avinash Dixit and Barry Nalebuff - Thinking Strategically
100% (7)
Avinash Dixit and Barry Nalebuff - Thinking Strategically
400 pages
Daren Hardy - The Compound Effect - Jumpstart Your Income, Your Life, Your Success-Hachette (2020)
94% (18)
Daren Hardy - The Compound Effect - Jumpstart Your Income, Your Life, Your Success-Hachette (2020)
309 pages
Predictable Irrationality
100% (4)
Predictable Irrationality
360 pages
The Polymath Reading List
88% (17)
The Polymath Reading List
48 pages
Permutation and Combinations
From Everand
Permutation and Combinations
Ramesh Chandra
4/5 (36)
Systems Thinking
100% (10)
Systems Thinking
62 pages
Mental Model Master List PDF
100% (6)
Mental Model Master List PDF
32 pages
Art of Asking Essential Questions
81% (16)
Art of Asking Essential Questions
9 pages
Amazon Interview
0% (1)
Amazon Interview
3 pages
3. Classification Trees,
No ratings yet
3. Classification Trees,
48 pages
Unit II Part 1
No ratings yet
Unit II Part 1
62 pages
Decision Tree and KNN Assignment Two
No ratings yet
Decision Tree and KNN Assignment Two
13 pages
Decision Tree Induction
No ratings yet
Decision Tree Induction
52 pages
ML Unit 3 Notes
No ratings yet
ML Unit 3 Notes
117 pages
09 - ML - Decision Tree
No ratings yet
09 - ML - Decision Tree
45 pages
ML_UNIT_3_NOTES-1
No ratings yet
ML_UNIT_3_NOTES-1
118 pages
Decision Tree Induction
No ratings yet
Decision Tree Induction
80 pages
Decision-Tree-Classifier_Manual
No ratings yet
Decision-Tree-Classifier_Manual
6 pages
Classification Ppts 2021
No ratings yet
Classification Ppts 2021
80 pages
6CS4-02 Machine Learning Manish Bhardwaj
No ratings yet
6CS4-02 Machine Learning Manish Bhardwaj
625 pages
Unit6 -2 Classification-Decision-Trees_25625586-1bf9-4821-a721-70db2d7805ef
No ratings yet
Unit6 -2 Classification-Decision-Trees_25625586-1bf9-4821-a721-70db2d7805ef
36 pages
ID3 MedhaPradhan
No ratings yet
ID3 MedhaPradhan
22 pages
06-Classification_Part1
No ratings yet
06-Classification_Part1
44 pages
23 Id3
No ratings yet
23 Id3
20 pages
dm4
No ratings yet
dm4
68 pages
Unit 5. Decision Trees
No ratings yet
Unit 5. Decision Trees
58 pages
Classification: Basic Concepts
No ratings yet
Classification: Basic Concepts
73 pages
Week 4 - Classification - Decision Tree 1
No ratings yet
Week 4 - Classification - Decision Tree 1
40 pages
Datamining-Lect5 Decision Tree
No ratings yet
Datamining-Lect5 Decision Tree
38 pages
Decision Tree Introduction
No ratings yet
Decision Tree Introduction
14 pages
PPT4 TOPIK4 R0 Predictive Modeling
No ratings yet
PPT4 TOPIK4 R0 Predictive Modeling
35 pages
Data Mining Unit 3
No ratings yet
Data Mining Unit 3
21 pages
ML L8 Decision Tree
No ratings yet
ML L8 Decision Tree
109 pages
Classification and Prediction
No ratings yet
Classification and Prediction
40 pages
Decision Tree
No ratings yet
Decision Tree
33 pages
VII - CS8031 - DMDW - Module 6 - Classification - VBP
No ratings yet
VII - CS8031 - DMDW - Module 6 - Classification - VBP
99 pages
تمييز اشكال ميد
No ratings yet
تمييز اشكال ميد
267 pages
Machine Learning Unit-3.2
No ratings yet
Machine Learning Unit-3.2
61 pages
unit 2 notes (1)
No ratings yet
unit 2 notes (1)
83 pages
Lecture 11-Classification-M
No ratings yet
Lecture 11-Classification-M
33 pages
UN Data minig
No ratings yet
UN Data minig
24 pages
IML Unit04 - Learning Decision Trees
No ratings yet
IML Unit04 - Learning Decision Trees
28 pages
Mod 3 part1_merged
No ratings yet
Mod 3 part1_merged
101 pages
2 ML Ch3 Decision Trees Final
No ratings yet
2 ML Ch3 Decision Trees Final
70 pages
Unit-3 (1)
No ratings yet
Unit-3 (1)
81 pages
dm 3
No ratings yet
dm 3
37 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
59 pages
ML-Lecture-8-9-Classification
No ratings yet
ML-Lecture-8-9-Classification
35 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
81 pages
Unit 3-Classification
No ratings yet
Unit 3-Classification
71 pages
4. Classification
No ratings yet
4. Classification
75 pages
Classification - Decision Trees
No ratings yet
Classification - Decision Trees
43 pages
Liaquat Majeed Sheikh: National University of Computer and Emerging Sciences
No ratings yet
Liaquat Majeed Sheikh: National University of Computer and Emerging Sciences
79 pages
04 Classification
No ratings yet
04 Classification
72 pages
UNIT 2 Class Basic
No ratings yet
UNIT 2 Class Basic
69 pages
Decision Tree
No ratings yet
Decision Tree
29 pages
Lesson 5
No ratings yet
Lesson 5
28 pages
AIML Lec-11
No ratings yet
AIML Lec-11
18 pages
Classification
No ratings yet
Classification
33 pages
15 1 Random Forest and Decision Tree
No ratings yet
15 1 Random Forest and Decision Tree
66 pages
20210913115613D3708 - Session 05-08 Decision Tree Classification
No ratings yet
20210913115613D3708 - Session 05-08 Decision Tree Classification
37 pages
_08ClassBasic_v1
No ratings yet
_08ClassBasic_v1
46 pages
Decision Tree.pptx
No ratings yet
Decision Tree.pptx
41 pages
Decision Tree Algorithm
No ratings yet
Decision Tree Algorithm
18 pages
Randomforest TNP
No ratings yet
Randomforest TNP
71 pages
Decision Tree
No ratings yet
Decision Tree
12 pages
08 Class Basic
No ratings yet
08 Class Basic
81 pages
GCSE Maths Revision: Cheeky Revision Shortcuts
From Everand
GCSE Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (2)
Thinking in Systems and Mental Models Think Like A Super Thinker by Marcus P. Dawson
90% (10)
Thinking in Systems and Mental Models Think Like A Super Thinker by Marcus P. Dawson
271 pages
Problem Solving
100% (17)
Problem Solving
145 pages
Ten Mental Models For Learning Anything
100% (1)
Ten Mental Models For Learning Anything
22 pages
Getting Things Done Personal Workflow Map (MM-EN-SB)
89% (75)
Getting Things Done Personal Workflow Map (MM-EN-SB)
3 pages
Logic Made Easy (2004)
100% (32)
Logic Made Easy (2004)
260 pages
Decision Tree
No ratings yet
Decision Tree
26 pages
Applying Science of Learning in Education
No ratings yet
Applying Science of Learning in Education
304 pages
Systems Thinking Tools
100% (9)
Systems Thinking Tools
60 pages
Behavioral Economics
75% (4)
Behavioral Economics
145 pages
CNN Case Studies Unit 4
No ratings yet
CNN Case Studies Unit 4
13 pages
Estimating Equilibrium Scour Depth Around Non-Circular Bridge Piers Using Interpretable Hybrid Machine Learning Models
No ratings yet
Estimating Equilibrium Scour Depth Around Non-Circular Bridge Piers Using Interpretable Hybrid Machine Learning Models
16 pages
Saleh Et Al-2024-Scientific Reports
No ratings yet
Saleh Et Al-2024-Scientific Reports
11 pages
Ai-Ml for desmoking and dehazing Synopsis
No ratings yet
Ai-Ml for desmoking and dehazing Synopsis
4 pages
Lecture1_EM-Variational-inference
No ratings yet
Lecture1_EM-Variational-inference
18 pages
ucalgary_2023_hajimohammadkhani_ahmad
No ratings yet
ucalgary_2023_hajimohammadkhani_ahmad
103 pages
1 Autoencoders
No ratings yet
1 Autoencoders
22 pages
Mastering pandas 1st Edition Femi Anthony 2024 Scribd Download
No ratings yet
Mastering pandas 1st Edition Femi Anthony 2024 Scribd Download
50 pages
Deep Learning Lab Manual-36-41
No ratings yet
Deep Learning Lab Manual-36-41
6 pages
Guidelines For Secure AI System Development
No ratings yet
Guidelines For Secure AI System Development
20 pages
Report
No ratings yet
Report
13 pages
Article 18 Colas
No ratings yet
Article 18 Colas
10 pages
Batch 1
No ratings yet
Batch 1
57 pages
Akhilesh - Kashyap Mini
No ratings yet
Akhilesh - Kashyap Mini
86 pages
Computational Statistical Methodologies and Modeling for Artificial Intelligence Edge AI in Future Computing 1st Edition Priyanka Harjule (Editor) - The complete ebook is available for download with one click
100% (1)
Computational Statistical Methodologies and Modeling for Artificial Intelligence Edge AI in Future Computing 1st Edition Priyanka Harjule (Editor) - The complete ebook is available for download with one click
47 pages
Deep Learning concise notes
No ratings yet
Deep Learning concise notes
4 pages
ML - Chapter 6 - Model Evaluation
No ratings yet
ML - Chapter 6 - Model Evaluation
65 pages
Clustering Algorithms: I I M M M N S
No ratings yet
Clustering Algorithms: I I M M M N S
16 pages
Optimal Airline Ticket Purchasing Using Automated User-Guided Feature Selection
No ratings yet
Optimal Airline Ticket Purchasing Using Automated User-Guided Feature Selection
7 pages
Multimodal Interactive Pattern Recognition And Applications 1st Edition Alejandro Hctor Toselli download
100% (1)
Multimodal Interactive Pattern Recognition And Applications 1st Edition Alejandro Hctor Toselli download
87 pages
ML Quiz-1
No ratings yet
ML Quiz-1
4 pages
Unit II - Ethical Initiatives in Ai Notes
No ratings yet
Unit II - Ethical Initiatives in Ai Notes
16 pages
SEMINAR
No ratings yet
SEMINAR
9 pages
CACS 410 Artificial Intelligence VII Semester Elective BCA Syllabus
No ratings yet
CACS 410 Artificial Intelligence VII Semester Elective BCA Syllabus
3 pages
project proposal chi
No ratings yet
project proposal chi
6 pages
Credit Card Fraud Detection
No ratings yet
Credit Card Fraud Detection
27 pages
CA1 and CA2 solution mL[1]
No ratings yet
CA1 and CA2 solution mL[1]
23 pages
THE PROVEN CHAT GPT STRATEGIES FOR BUSINESS SUCCESS IN 2025 AI AGENTS, CHAT GPT TASK, PROMPTS (Kurzweil, Gary) (Z-Library)
No ratings yet
THE PROVEN CHAT GPT STRATEGIES FOR BUSINESS SUCCESS IN 2025 AI AGENTS, CHAT GPT TASK, PROMPTS (Kurzweil, Gary) (Z-Library)
232 pages
What's Up CAPTCHA - A CAPTCHA Based On Image Orientation
No ratings yet
What's Up CAPTCHA - A CAPTCHA Based On Image Orientation
10 pages

Decision Tree

Uploaded by

Decision Tree

Uploaded by

Decision Tree

Let us start with simple application:

How can you choose them?

So what you should do?

SN Name Sex Age Employment Salary Marital Investor

Recursive Splitting (more towards to ID3)

• Select spliting criteia (find spiliting node which

• Split the branch (divide the dataset in sub/multiple

• Repeat the process 2 & 3 with each subgroup

• Stoping criteria (condition)

• Final Decision Tree

ID3 algorithm is suited for classification of categorical dataset.

SN Name Sex Salary Marital st. Investor

Root Node (select splitting node)

Total entropy of Attribute

Where, k is number of value in given attribute ;

Esalary = Elow* [(Fc1_low + Fc2_low)/N] + EMed * [] + EHigh *[]

E (Investor)= E(Yes, No)

Entropy of attribute ‘Salary’

Total Entropy of attribute “Salary”

Attribute (Marital Status)

Low Med High

It is a case of “pure dataset” i.e. all the dataset having

Choosing between sex and marital status?

Low Med High

Next parent node ->

Both the attributes have same information gain! (may

Low Med High

Next parent node ->

It is also pure dataset, all married will invest and unmarried

Low Med High

Yes No Married Unmarried

Yes No Married Unmarried

6. there are no examples in the subset , which happens when no example in

7. IG (Children) > IG (Parent)

Classification-type problems: Classification-type problems are

Regression-type problems problems: Regression-type

With numerical values

1. Rules for splitting data at a node based on the value

 Favors larger partitions and large levels/distinct values. (because of this

2-levels, Max Gini Index= 1-1/2= 0.5

Binary Target Variable: Worked out Example

Nominal Target Variable: Worked out Example

Gini Index = 1 – 0.082 - 0.132 - 0.292 -0.52

= 1 - 0.006 - 0.018 - 0.083 - 0.250

Gini index for this node (class) will be

GINI (s,t) = PL GINI (tL) + PR GINI (tR)

GINI (s,t) = (8/24)*0.375 + (16/24)*0.469

Ex. [multiple attributes-Continuous]

Suppose, Gini index score against each value of each attributes

Data subset Data subset

Repeat this process for each node of next level until

SN Name Sex Salary Marital st. Investor

Discussion: it is an important optimization feature. We have said that for a

We do that for every possible split, for example x1 < 1:

Gini Index by Hand

 We’re trying to predict the class variable.

Class Var1 Var2

Gini Index Example: Var2 >= 32

X is attribute and splitting value is x>=0

Gini for Left tree

X is attribute and splitting value is x>=0

Gini for Left tree

Advantage: Decision trees implicitly perform variable

 Advantage: Decision trees require relatively little effort

Advantage: Nonlinear relationships between

Advantage: The best feature of using trees for analytics

Complexity/ computationally Expensive

Difficult to interpret if Tree is Large

Tree structure prone to sampling –

Tree splitting is locally greedy –

Suffer from Overfitting

ID3, CART, C5.0, C4.5

Overfitting is a significant practical difficulty for decision tree models

I will consider 2 pruning strategies,

Pre-pruning that stop growing the tree earlier, before it perfectly

GINI (s,t) = (8/24)0.375 + (16/24)0.469