Decision Tree
Decision Tree
Its a prediction model in which a series of test is conducted and use the answers to take decision
on prediction
a decision tree consists of a root node (or starting node),interior nodes, and leaf nodes (or
terminating nodes) that are connected by branches
The shallowness of the tree depends on the descriptive features that best
discriminate between instances that have different target feature values toward
the top of the tree
In the activity of choosing a card randomly from fig. f it is not known for sure as to which card is
going to be picked as the chances of any card being picked is
equally likely very high entropy (high impurity)
SHANNON’S ENTROPY MODEL
An attractive characteristic of this function is that the range of values for the binary
logarithm of a probability, [−∞, 0], is much larger than those taken by the probability
itself [0, 1].
SHANNON’S ENTROPY MODEL
(a) A graph illustrating how the value of a binary log (the log to the base 2) of a probability
changes across the range of probability values; (b) the impact of multiplying these values by − 1.
SHANNON’S ENTROPY MODEL
Shannon’s model of entropy is a weighted sum of the logs of the probabilities of each
possible outcome when we make a random selection from a set.
use 2 as the base, s, when we calculate entropy, which means that we measure entropy in bits
ENTROPY CALCULATION : EXAMPLE1
Entropy for a card being picked from a pack of 52 cards
The probability of randomly selecting any specific card i, P(card = i), from a set of 52 cards is
1/52. The entropy is calculated as below
ENTROPY CALCULATION : EXAMPLE
2 Entropy for a type of suit being picked from a pack of (4 different suits )52 cards
Calculating the entropy of a set of 52 playing cards if we only distinguish between cards based
on their suit (heart ,club,diamond or spade)
INFORMATION
GAIN
Need to develop a formal model that captures the intuitions about the informativeness of the
features
Shannon’s entropy model does this , The measure of informativeness that we
will use is known as information gain
COMPUTATION OF INFORMATION GAIN
Step1 : Compute the entropy of the original dataset with respect to the target feature
Step2 : For each descriptive feature, create the sets that result by partitioning the
instances in the dataset using their feature values, and then sum the entropy
scores of each of these sets.
Step3: Subtract the remaining entropy value (computed in step 2) from the original
entropy value (computed in step 1) to give the information gain
COMPUTATION OF INFORMATION GAIN
(CONTD.)
Step1 :
COMPUTATION OF INFORMATION GAIN
(CONTD.)
Step2 : Computation of remaining entropy for the SUSPICIOUS WORDS feature
Computation of remaining entropy for the UNKNOWN SENDER feature
Computation of remaining entropy for the CONTAINS IMAGES feature
Computation of remaining entropy for the SUSPICIOUS WORDS feature
COMPUTATION OF INFORMATION GAIN
(CONTD.)
Step2 : Computation of remaining entropy for the SUSPICIOUS WORDS feature
Computation of remaining entropy for the UNKNOWN SENDER feature
Computation of remaining entropy for the CONTAINS IMAGES feature
Computation of remaining entropy for the UNKNOWN SENDER feature
US class
T
t
t
US class
f
f
f
Ranjitha U.N
COMPUTATION OF INFORMATION GAIN
(CONTD.)
Step2 : Computation of remaining entropy for the SUSPICIOUS WORDS feature
Computation of remaining entropy for the UNKNOWN SENDER feature
Computation of remaining entropy for the CONTAINS IMAGES feature
Computation of remaining entropy for the CONTAINS IMAGES feature
COMPUTATION OF INFORMATION GAIN
(CONTD.)
Step3 : information gain calculation for each descriptive feature
CHOICE OF THE ROOT
NODE
The feature that possess the highest information gain is the right choice
to be considered as root node to start with .As the tree grows the entropy
model allows us to decide which test we should add to the sequence next
ITERATIVE DICHOTOMIZER 3 (ID3) ALGORITHM (DECISION
TREE INDUCTION ALGORITHMS)
1: if all the instances in D have the same target level C then
stops
2: return a decision tree consisting of a leaf node with label C growing
3: else if d is empty then the
current
4: return a decision tree consisting of a leaf node with the label of the majority target path
tree by
in the
level in D adding a
leaf node
5: else if D is empty then to the
tree
6: return a decision tree consisting of a leaf node with the label of the majority
target level of the dataset of the immediate parent node
7:Else it extends
the current
8: d [best] ← arg max IG (d, D) ,d belongs to D path by
adding an
make a new node, Node d[best ], and label it with d [best] interior
node to the
tree and
10: partition D using d [best] growing the
branches
11: remove d [best] from d ,iteratively
In the ID3 algorithm the base cases are the situations where we stop splitting the dataset and
construct a leaf node with an associated target level
There are two important things to remember when designing these base cases
1. the dataset of training instances considered at each of the interior
nodes in the tree is not the complete dataset
2. once a feature has been tested, it is not considered for selection
again along that path in the tree
The ID3 algorithm uses the information gain metric to choose the best feature to test at each
node in the tree
EXAMPLE
THE TOTAL ENTROPY FOR THIS
DATASET
CALCULATING THE ENTROPY OF EVERY FEATURE
Calculating the entropy for feature = STREAM
Consider the values taken by the feature stream ie (true ,false)
Stream vegetation Stream vegetation
True Riparian False Chapparal
True Riparian
False Chapparal
True Conifer
False Conifer
True Chapparal
medium Chapparal
H(elevation =high) = - [ 2/3 log2 (2/3) + 1/3 log2(1/3) + 0/3 log2(0/3) ]
= 0.9183 bit
H(elevation =medium) = - [1/2 log2 (1/2) + 1/2 log2(1/2) +0/2log2(0/2) ] Elevation vegetation
high Chapparal
=1 bit
high conifer
high Chapparal
H(elevation =highest) = - [1/1 log2 (1/1) + 0+0]
=0
H(elevation =low) = - [1/1 log2 (1/1) + 0+0])
=0
Rem(elevation ,D) =3/7(0.9183) + 2/7(1) 0.67
+0+0= I G(slope) = H(vegetation) –
Rem(elevation)
= 0.877 bit
CONSIDERING THE FIRST NODE( ROOT NODE)
FOR SPLITTING
H(stream=false) = -[1/1 log2 1 ] H(slope=steep) = -[1/2 log2 1/2 + 1/2 log2 1/2 ]
=0 =1
H(stream=true) = -[1/1 log2 1 ]
=0 Rem(stream ) = 1/2 (0) + ½(0) Rem(slope) =2/2(1)
=0 =1
I G (stream) = I G (slope) = H(elevation=medium) –
H(elevation=medium) – Rem(Slope)
Rem(Stream)= 1-0 = 1-1
=1 =0
STREAM has a higher information gain than SLOPE and so is the best feature
SPLITTING THE TREE FURTHER
Consider the feature Elevation =high
H(elevation=high) = 2/3 log2 2/3 + 1/3 log2 1/3 +0
= 0.9183
stream vegetation
slope vegetation
false Chapparal
steep Chapparal
false conifer
flat conifer
true Chapparal
steep Chapparal
H(stream=false) = -[1/2 log2 1/2 +1/2 log2 1/2 ] H(slope=steep) = -[2/2 log2 2/2]
=1 =0
H(stream=true) = -[1/1 log2 1 ] H(slope=flat) = - [1/1 log2 1/1]
=0 =0
Rem(stream ) = 2/3 (1) +0 Rem(slope) =2/3 (0) + 0
=0.666 =0
I G (stream) = H(elevation=high) – I G (slope) = H(elevation=high) –
Rem(Stream) Rem(Slope)
= 0.9183 – 0.66 = 0.9183- 0
=0.2517 =0.9183
FINAL DECISION
TREE