CHTKT - DataScience - Chapter03 - Machine Learning With Python - 02
CHTKT - DataScience - Chapter03 - Machine Learning With Python - 02
2023
Decision tree
Source: Internet
Instructor: TRAN THI TUAN ANH 4 / 34
3. CLASSIFICATION (cont)- DECISION TREE
Example Decision tree : A candidate who has a job offer and wants to
decide whether he should accept the offer or not.
Classification error:
Em = 1 − max(pi )
where pi represents the proportion of instances of class i in the node.
A lower classification error suggests a more pure or homogeneous leaf
node. Example: if you have a leaf node with
Class A: 16 obs
Class B: 13 obs
Class C: 1 obs
What is the classification error of this node?
Gini impurity
K
X K
X
Gini = pi (1 − pi ) = 1 − pi2
i=1 i=1
where
pi represents the proportion of instances of class i in the node.
1 − pi is the probability of selecting an element not from class i
A lower Gini impurity suggests a more pure or homogeneous leaf node.
Example: if you have a leaf node with
Class A: 16 obs
Class B: 13 obs
Class C: 1 obs
Instructor: TRAN THI TUAN ANH 15 / 34
3. CLASSIFICATION (cont)- DECISION TREE
Entropy
K
X
Entropy = − pi log2 (pi )
i=1
where
pi represents the proportion of instances of class i in the node.
A lower entropy value indicates a more pure or homogeneous node.
Example: if you have a leaf node with
Class A: 16 obs
Class B: 13 obs
Class C: 1 obs
What is the Entropy of this node?
Instructor: TRAN THI TUAN ANH 16 / 34
3. CLASSIFICATION (cont)- DECISION TREE
Information gain
The information gain is based on the decrease in entropy after a
dataset is split on an attribute.
Constructing a decision tree is all about finding attribute that returns
the highest information gain
Note: More uncertainty, more entropy!
When to stop?
when all records in current data subset have the same output
or all records have exactly the same set of input attributes
or set a minimum number of observations on each leaf
or set a maximum depth refers to the the length of the longest path
Instructor: TRAN THI TUAN ANH 19 / 34
3. CLASSIFICATION (cont)- DECISION TREE
THE END