0% found this document useful (0 votes)
5 views

Decision Tree

The document discusses the decision tree model, which is used for prediction by conducting a series of tests on descriptive features. It explains the structure of a decision tree, the importance of choosing the root node based on information gain, and the application of Shannon's entropy model to measure impurity and informativeness of features. The ID3 algorithm is introduced as a method for constructing decision trees, detailing the steps to calculate entropy and information gain for feature selection.

Uploaded by

Nilay Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Decision Tree

The document discusses the decision tree model, which is used for prediction by conducting a series of tests on descriptive features. It explains the structure of a decision tree, the importance of choosing the root node based on information gain, and the application of Shannon's entropy model to measure impurity and informativeness of features. The ID3 algorithm is introduced as a method for constructing decision trees, detailing the steps to calculate entropy and information gain for feature selection.

Uploaded by

Nilay Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

DECISION TREE

 Its a prediction model in which a series of test is conducted and use the answers to take decision
on prediction

 a decision tree consists of a root node (or starting node),interior nodes, and leaf nodes (or
terminating nodes) that are connected by branches

Each non-leaf node (root and interior)


in the tree specifies a test to be
carried out on a descriptive feature.

The number of possible levels that a


feature can take determines the number
of downward branches from a
non-leaf node
THE DATASET

To predict whether emails are spam or


ham (genuine). The dataset has three
binary descriptive features
SUSPICIOUS WORDS
UNKNOWN
SENDER CONTAINS
IMAGES
DEMONSTRATION

How do we decide which is the best decision tree to use?


CHOICE OF ROOT
NODE
 Decision trees are preferred to be shallower

 The shallowness of the tree depends on the descriptive features that best
discriminate between instances that have different target feature values toward
the top of the tree

 The formal measure we will use to do this is Shannon’s entropy model.


SHANNON’S ENTROPY MODEL
 Claude Shannon’s entropy model defines a computational measure of the impurity
(heterogenity)of the elements in a set

 In the activity of choosing a card randomly from fig. a it is known for


sure That an ace of spades is selected  zero entropy(no impurity)

 In the activity of choosing a card randomly from fig. f it is not known for sure as to which card is
going to be picked as the chances of any card being picked is
equally likely  very high entropy (high impurity)
SHANNON’S ENTROPY MODEL

An outcome with a large probability should map to a low entropy value

An outcome with a small probability should map to a large entropy value

Mathematical logarithm, or log, function does almost exactly the transformation

An attractive characteristic of this function is that the range of values for the binary
logarithm of a probability, [−∞, 0], is much larger than those taken by the probability
itself [0, 1].
SHANNON’S ENTROPY MODEL

(a) A graph illustrating how the value of a binary log (the log to the base 2) of a probability
changes across the range of probability values; (b) the impact of multiplying these values by − 1.
SHANNON’S ENTROPY MODEL

 Shannon’s model of entropy is a weighted sum of the logs of the probabilities of each
possible outcome when we make a random selection from a set.

measure of the impurity,


heterogeneity, of a set

use 2 as the base, s, when we calculate entropy, which means that we measure entropy in bits
ENTROPY CALCULATION : EXAMPLE1
Entropy for a card being picked from a pack of 52 cards

The probability of randomly selecting any specific card i, P(card = i), from a set of 52 cards is
1/52. The entropy is calculated as below
ENTROPY CALCULATION : EXAMPLE
2 Entropy for a type of suit being picked from a pack of (4 different suits )52 cards

Calculating the entropy of a set of 52 playing cards if we only distinguish between cards based
on their suit (heart ,club,diamond or spade)
INFORMATION
GAIN

Need to develop a formal model that captures the intuitions about the informativeness of the
features
Shannon’s entropy model does this , The measure of informativeness that we
will use is known as information gain
COMPUTATION OF INFORMATION GAIN
Step1 : Compute the entropy of the original dataset with respect to the target feature

Step2 : For each descriptive feature, create the sets that result by partitioning the
instances in the dataset using their feature values, and then sum the entropy
scores of each of these sets.

Step3: Subtract the remaining entropy value (computed in step 2) from the original
entropy value (computed in step 1) to give the information gain
COMPUTATION OF INFORMATION GAIN
(CONTD.)
Step1 :
COMPUTATION OF INFORMATION GAIN
(CONTD.)
Step2 : Computation of remaining entropy for the SUSPICIOUS WORDS feature
Computation of remaining entropy for the UNKNOWN SENDER feature
Computation of remaining entropy for the CONTAINS IMAGES feature
Computation of remaining entropy for the SUSPICIOUS WORDS feature
COMPUTATION OF INFORMATION GAIN
(CONTD.)
Step2 : Computation of remaining entropy for the SUSPICIOUS WORDS feature
Computation of remaining entropy for the UNKNOWN SENDER feature
Computation of remaining entropy for the CONTAINS IMAGES feature
Computation of remaining entropy for the UNKNOWN SENDER feature

US class
T
t
t
US class
f
f
f
Ranjitha U.N
COMPUTATION OF INFORMATION GAIN
(CONTD.)
Step2 : Computation of remaining entropy for the SUSPICIOUS WORDS feature
Computation of remaining entropy for the UNKNOWN SENDER feature
Computation of remaining entropy for the CONTAINS IMAGES feature
Computation of remaining entropy for the CONTAINS IMAGES feature
COMPUTATION OF INFORMATION GAIN
(CONTD.)
Step3 : information gain calculation for each descriptive feature
CHOICE OF THE ROOT
NODE

The feature that possess the highest information gain is the right choice
to be considered as root node to start with .As the tree grows the entropy
model allows us to decide which test we should add to the sequence next
ITERATIVE DICHOTOMIZER 3 (ID3) ALGORITHM (DECISION
TREE INDUCTION ALGORITHMS)
1: if all the instances in D have the same target level C then
stops
2: return a decision tree consisting of a leaf node with label C growing
3: else if d is empty then the
current
4: return a decision tree consisting of a leaf node with the label of the majority target path
tree by
in the

level in D adding a
leaf node
5: else if D is empty then to the
tree
6: return a decision tree consisting of a leaf node with the label of the majority
target level of the dataset of the immediate parent node
7:Else it extends
the current
8: d [best] ← arg max IG (d, D) ,d belongs to D path by
adding an

make a new node, Node d[best ], and label it with d [best] interior
node to the
tree and
10: partition D using d [best] growing the
branches
11: remove d [best] from d ,iteratively

12: for each partition D i of D do


13: grow a branch from Noded[best] to the decision tree created by rerunning ID3 with D= D i = i
Ranjitha U.N
NOTE

 In the ID3 algorithm the base cases are the situations where we stop splitting the dataset and
construct a leaf node with an associated target level

 There are two important things to remember when designing these base cases
1. the dataset of training instances considered at each of the interior
nodes in the tree is not the complete dataset
2. once a feature has been tested, it is not considered for selection
again along that path in the tree
 The ID3 algorithm uses the information gain metric to choose the best feature to test at each
node in the tree
EXAMPLE
THE TOTAL ENTROPY FOR THIS
DATASET
CALCULATING THE ENTROPY OF EVERY FEATURE
Calculating the entropy for feature = STREAM
Consider the values taken by the feature stream ie (true ,false)
Stream vegetation Stream vegetation
True Riparian False Chapparal
True Riparian
False Chapparal
True Conifer
False Conifer
True Chapparal

H(Stream=true) = - [ 2/4 log2 (2/4) + 1/4 log2(1/4) + 1/4 log2(1/4) ]


= 1.5 bit

H(Stream=false) = - [0/3 log2 (0/3) + 2/3 log2(2/3) + 1/3 log2(1/3) ]


= 0.91 bit

Rem(Stream) = 4/7(1.5) + 3/7(0.91)


=1.24 bits
I G(stream) = H(vegetation) – Rem(Stream)
= 1.55-1.24
Ranjitha U.N
= 0.308 bits
CALCULATING THE ENTROPY OF EVERY FEATURE
Calculating the entropy for feature = slope
Consider the values taken by the feature stream ie (flat ,moderate ,steep)
Slope vegetation Slope vegetation Slope vegetation

Flat Conifer Moderate Riparian Steep Chapparal


Steep Riparian

H(slope=flat) = - [1/1 log2 1] Steep Chapparal


Steep conifer
=0
Steep Chapparal
H(Slope=moderate) = - [1/1 log2 1]
=0
H(Slope=steep) = - [3/5 log2 (3/5)
+ 1/5 log2 (1/5)+ 1/5 log2 (1/5) ]
= 1.37
Rem(Steep ,D) = 5/7(1.37) + 1/7(0) +1/7(0)
=0.97 bits
I G(slope) = H(vegetation) – Rem(slope)
= 1.55-0.97
= 0.577 bits
CALCULATING THE ENTROPY OF EVERY FEATURE
Calculating the entropy for feature = Elevation
Consider the values taken by the feature elevation i.e. (high,low,medium,highest)
Elevation vegetation Elevation vegetation Elevation vegetation

Low Riparian highest Chapparal medium Riparian

medium Chapparal
H(elevation =high) = - [ 2/3 log2 (2/3) + 1/3 log2(1/3) + 0/3 log2(0/3) ]
= 0.9183 bit
H(elevation =medium) = - [1/2 log2 (1/2) + 1/2 log2(1/2) +0/2log2(0/2) ] Elevation vegetation
high Chapparal
=1 bit
high conifer
high Chapparal
H(elevation =highest) = - [1/1 log2 (1/1) + 0+0]
=0
H(elevation =low) = - [1/1 log2 (1/1) + 0+0])
=0
Rem(elevation ,D) =3/7(0.9183) + 2/7(1) 0.67
+0+0= I G(slope) = H(vegetation) –
Rem(elevation)
= 0.877 bit
CONSIDERING THE FIRST NODE( ROOT NODE)
FOR SPLITTING

Feature with highest information


gain is considered as root node
DECISION TREE AFTER FIRST LEVEL OF SPLITTING
 ELEVATION has the largest information
gain of the three features and so is
selected by the algorithm at the root
node of the tree

 ELEVATION is no longer listed in these


partitions as it has already been used
to split the data

 Elevation-low and Elevation –highest


both has single instances and can be
converted into leaf nodes.
SPLITTING THE TREE (SECOND
Consider the feature Elevation =medium
TIME)
H(elevation=medium) = 1/2 log2 1/2 + 1/2 log2 1/2
+0
=1
stream vegetation slope vegetation

true Riparian steep Riparian

false Chapparal steep Chapparal

H(stream=false) = -[1/1 log2 1 ] H(slope=steep) = -[1/2 log2 1/2 + 1/2 log2 1/2 ]
=0 =1
H(stream=true) = -[1/1 log2 1 ]
=0 Rem(stream ) = 1/2 (0) + ½(0) Rem(slope) =2/2(1)
=0 =1
I G (stream) = I G (slope) = H(elevation=medium) –
H(elevation=medium) – Rem(Slope)
Rem(Stream)= 1-0 = 1-1
=1 =0
STREAM has a higher information gain than SLOPE and so is the best feature
SPLITTING THE TREE FURTHER
Consider the feature Elevation =high
H(elevation=high) = 2/3 log2 2/3 + 1/3 log2 1/3 +0
= 0.9183
stream vegetation
slope vegetation
false Chapparal
steep Chapparal
false conifer
flat conifer
true Chapparal
steep Chapparal

H(stream=false) = -[1/2 log2 1/2 +1/2 log2 1/2 ] H(slope=steep) = -[2/2 log2 2/2]
=1 =0
H(stream=true) = -[1/1 log2 1 ] H(slope=flat) = - [1/1 log2 1/1]
=0 =0
Rem(stream ) = 2/3 (1) +0 Rem(slope) =2/3 (0) + 0
=0.666 =0
I G (stream) = H(elevation=high) – I G (slope) = H(elevation=high) –
Rem(Stream) Rem(Slope)
= 0.9183 – 0.66 = 0.9183- 0
=0.2517 =0.9183
FINAL DECISION
TREE

You might also like