0% found this document useful (0 votes)
24 views57 pages

11 Decision Trees

Uploaded by

Sujith Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views57 pages

11 Decision Trees

Uploaded by

Sujith Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 57

Decision Trees

Dr Subharag Sarkar
Content

 Introduction
 Decision Trees
 Choosing an Attribute
 Entropy
 Information Gain
 Summary
Introduction

 Decision tree induction is one of the simplest and yet most successful
forms of machine learning.
 We first describe the representation—the hypothesis space—and then
show how to learn a good hypothesis.
 A decision tree represents a function that takes as input a vector of
attribute values and returns a “decision”—a single output value.
 The input and output values can be discrete or continuous.
 For now, we will concentrate on problems where the inputs have
discrete values and the output has exactly two possible values; this is
Boolean classification, where each example input will be classified as
true (a positive example) or false (a negative example).
Decision Trees

 A decision tree reaches its decision by performing a sequence of tests.


 Each internal node in the tree corresponds to a test of the value of one
of the input attributes, Ai, and the branches from the node are labeled
with the possible values of the attribute, Ai = vik.
 Each leaf node in the tree specifies a value to be returned by the
function.
 The decision tree representation is natural for humans; indeed, many
“How To” manuals (e.g., for car repair) are written entirely as a single
decision tree stretching over hundreds of pages.
Example
 We will build a decision tree to decide whether to wait for a table at a
restaurant.
 The aim here is to learn a definition for the goal predicate WillWait.
 Firstly, we list the attributes that we will consider as part of the input:
Example
 Note that every variable has a small set
of possible values; the value of
WaitEstimate, for example, is not an
integer, rather it is one of the four
discrete values 0–10, 10–30, 30–60, or
>60.
 Each pattern Xi is a set of 10 attribute
values, describing the case of a
customer asked to wait at the
restaurant.
 Using these training examples, we want
to learn a function F, mapping each
pattern into a Boolean answer (will wait
or will not wait).
Decision Trees
 A decision tree is a type of classifier. The
input is a pattern. The output is a class.
 Given a pattern: we start at the root of
the tree.
 The current node asks a question about
the pattern.
 Based on the answer, we move to a child
of the current node.
 When we get to a leaf node, we get the
output of the decision tree (a class).
Decision Trees

 Example: What is the output of the


Decision Tree on this pattern?
 First, we check the value of Patrons =
Some
 Where do we go next? To the middle
child
 What happens next?
 Leaf node. Output is will wait.
Decision Trees

 Example: What is the output of the


Decision Tree on this pattern?
 First, we check value of Patrons = Full
 Where do we go next? To the right child.
 Next check: value of WaitEstimate? 10-
30
 Where do we go next? To the second-
from-the-right child.
 Next check: value of Hungry? False
 Where do we go next? To the left child.
 We arrived at a leaf node
 Output: will wait.
Decision Trees
 At this point, it should be clear how to
apply a decision tree to classify a
pattern.
 Obviously, there are lots and lots of
different decision trees that we can
come up with.
Decision Trees
 Here is a different decision tree.
 The natural question is: How can
we construct a good decision
tree?
Decision Tree Learning
 DTL is a pseudocode that the
textbook provides for learning a
Decision tree.
 It looks simple and short, but (as
in TT-Entails?) there are a lot of
details we should look into.
Decision Tree Learning
 Notice that the function is
recursive (the line of the
recursive call is highlighted).
 This function builds the entire
tree, and its recursive calls build
each individual subtree, and
each individual leaf node.
Decision Tree Learning
 Examples: A set of training
examples.
 Remember, each training
example is a pair, consisting of a
pattern and a class label.
 Attributes: A list of attributes
that we can choose to test.
 Parent_examples: A default class
to output if no better choice is
available (details later).
Decision Tree Learning
 The function returns a Decision
tree.
 However, notice the highlighted
line, that says if there are no
examples, we should return
parent_examples. What does it
mean?
 It means we should return a leaf
node, that outputs class
parent_examples.
Decision Tree Learning
 This is the only reason we have
a parent_examples argument to
the DTL function.
 If there are no examples,
obviously we need to create a
leaf node.
 The parent_examples argument
tells us what class to store at
the leaf node.
Decision Tree Learning
 First base case: the examples are
empty. As discussed before, we
return a leaf node with output class
parent_examples.
 Second base case: all examples
have the same class. We just return
a leaf node with that class as its
output.
 Third base case: attributes is empty.
We have run out of questions to ask.
 In this case, we have to return a leaf
node. What class should we store
there?
 output class Plurality-Value
(examples).
 Plurality-Value is an auxiliary
function, that returns the most
Decision Tree Learning
 The highlighted code shows the
recursive case.
 The first thing we do, is call
Importance.
 Importance is an auxiliary function
that chooses the attribute we should
check at this node.
 We will talk A LOT about the
Importance function, a bit later.
 For now, just accept that this
function will do its job and choose
an attribute, which we store at
variable A.
Decision Tree Learning
 Here we create the tree that we are
going to return.
 We store in that tree (probably in
some member variable) the fact
that we will be testing attribute A.
 Next???
 Next, we need to create the children
of that tree.
 How?
 With recursive calls to DTL, with
appropriate arguments.
 How many children do we
create?
 As many as the values of attribute
A.
Decision Tree Learning
 The highlighted loop creates the
children (the subtrees of tree).
 Each iteration creates one child.
Decision Tree Learning
 Remember, each child (subtree)
corresponds to a value vi of best.
 To create that child, we need to call
DTL with appropriate values for
examples, attributes, and
parent_examples.
 What examples should we use?
 The subset of examples where A=vi.
Decision Tree Learning
 What attributes do we use?
 Everything in the attributes variable,
except for A.
 Why are we leaving A out?
 What happens if we use the
same attribute twice in a path?
Using an Attribute Twice in a Path
 What happens the second time we
use Patrons? As the attribute in
this example?
 It is useless. All patterns getting to
this node already have Patrons? =
FULL
 Therefore, all patterns will go to the
right subtree.
Decision Tree Learning: Recursive
Case
 Finally, in our recursive call to DTL, what
should be the value of parent_examples?
 This will be used as output for a leaf node, if
exs are empty.
 First of all, why would exs ever be empty?
 It may just happen that no training examples
have A=vk.
 So, what should be the value of
parent_examples?
 It should be Plurality-Value(examples): the
most frequent class among the examples.
Decision Tree Learning: Recursive
Case
 Once the recursive call to DTL has returned a
subtree, we add that to our tree.
 Finally, once all recursive calls to DTL have
returned, we return the tree we have created.
 Are we done?
 Almost, except that we still need to talk how
Importance works.
Choosing an Attribute
Choosing an Attribute

 Here we see two different attributes chosen at the root of the decision
tree.
 At each node, green stands for a training example that waited, red stands
for a training example that did not wait.
 Which attribute seems better (more useful) to you?
 The Patrons attribute is more useful because it separates better the green
from the reds.
Choosing an Attribute

 How can we quantify how well an attribute separates training examples


from different classes?
 We need to define two new quantities:
 Entropy.
 Information gain.
 These quantities are computed with specific formulas.
 Information gain will be used to choose the best attribute.
Entropy
Entropy
 Suppose a group of friends is trying to decide which
movie they can watch together on Sunday.
 There are 2 choices for movies, one is “Lucy” and the
second is “Titanic” and now everyone has to tell
their choice.
 After everyone gives their answer we see that “Lucy”
gets 4 votes and “Titanic” gets 5 votes.
 Which movie do they watch now?
 Isn’t it hard to choose 1 movie now because the votes
for both movies are somewhat equal?
Entropy
 This is exactly what is called disorderness, there is an
equal number of votes for both movies, and we can’t
really decide which movie we should watch.
 It would have been much easier if the votes for “Lucy”
were 8 and for “Titanic” it was 2.
 Here we could easily say that the majority of votes are
for “Lucy” hence everyone will be watching this
movie.
 In a decision tree, the output is mostly “yes” or “no”
Entropy

 Entropy is nothing but the uncertainty in our dataset


or measure of disorder.
 Decision trees use Entropy to measure the impurity of
the node.
 Impurity is the degree of randomness; it tells how
random our data is.
Entropy- Two Class Example

 Suppose that we have a set X of training examples.


 K1 examples have class label A.
 K2 examples have class label B.
 Let K = K1 + K2. K1 K2
 Then the entropy of the set X depends only on the two ratios:K andK
The entropy H is defined as:

 Note: logarithms in this discussion are always base 2.


Entropy – General Example

 In the general case:


 Suppose that we have a set X of training examples.
 Suppose there are N different class labels L1, …, LN.
 K1 examples have class label L1.
 K2 examples have class label L2.
 In general, Ki examples have the class label Li.
 Let K = K1 + K2 + … + KN.
 Then the entropy H of the set X is given by this formula:
Making Sense of Entropy
 If this is the first time you see the definition of entropy, it probably does not
look very intuitive.
 We will look at several specific examples so that you can see how entropy
behaves.
 The lowest possible entropy value is 0.
 The lower the entropy is, the more uneven the distribution of classes is.
 Zero entropy means all training examples have the same class.
 The highest possible entropy value is log2N (N=number of classes).
 The higher the entropy is, the more even the distribution of classes is.
 Entropy log2N means that the number of training examples for each class is
equal. K1 = K2 = … = KN.
Example 1

 We have a set X of training examples, of two classes.


 200 examples have class label A.
 200 examples have class label B.
 What is the entropy of X?

 The classes are evenly split.


 Therefore, H has the largest possible value: log22 = 1.
Example 2

 We have a set X of training examples, of two classes.


 20 examples have class label A.
 500 examples have class label B.
 What is the entropy of X?

 The classes are pretty unevenly split.


 Therefore, H has a smaller value, relatively closer to 0.
Example 3

 We have a set X of training examples, of two classes.


 20 examples have class label A.
 5000 examples have class label B.
 What is the entropy of X?

 The classes are even more unevenly split.


 Therefore, H is even smaller than it was in the previous example.
Example 4

 We have a set X of training examples, of two classes.


 0 examples have class label A.
 5000 examples have class label B.
 What is the entropy of X?

 This is an interesting case. Who wins, 0 or infinity?


 We will skip the mathematical proof, but 0 wins.
 Makes sense, most uneven split possible -> lowest H possible.
Example 5

 We have a set X of training examples, of two classes.


 50 examples have class label A.
 15 examples have class label B.
 25 examples have class label C.
 What is the entropy of X?

 Since we have three classes, the entropy can be greater than 1.


Information Gain
Information Gain

 Information gain measures the reduction of


uncertainty given some feature.
 It also helps to decide the attribute that should be
selected as a decision node or root node.

 It is just the entropy of the full dataset – the entropy


of the dataset given some feature.
Information Gain

 It tells how much information a feature/attribute


provides us.
 Following the value of the information gain, splitting of
the node and decision tree building is being done.
 Decision tree always tries to maximize the value of
the information gain, and a node/attribute having the
highest value of the information gain is being split
first.
Information Gain – Behaviour

 Overall, we like attributes that yield as high information gain as possible.


 To get that, we want the entropy of the subsets Ei to be small.
 This happens when the attribute splits the training examples nicely so
that different classes get concentrated on different subsets Ei.
Information Gain
 Suppose our entire population has a total of 30
instances.
 The dataset is to predict whether the person will go to
the gym or not.
 Let’s say 16 people go to the gym and 14 people
don’t.
 Now we have two features to predict whether he/she
will go to the gym or not.
 Feature 1 is “Energy” which takes two values “high”
and “low”
 Feature 2 is “Motivation” which takes 3 values “No
motivation”, “Neutral” and “Highly motivated”.
Feature 1
 Let’s Calculate the Entropy:
Feature 1
 Now calculate the weighted average of entropy of
each node.

 Now we have the value of E(Parent) and E(Parent|


Energy), information gain will be:

 Our parent entropy was near 0.99 and after looking at


this value of information gain, we can say that the
entropy of the dataset will decrease by 0.37 if we
make “Energy” as our root node.
Feature 2

 Let us Calculate the entropy here:


Feature 2
 Now calculate the weighted average of entropy of
each node.

 Now we have the value of E(Parent) and E(Parent|


Energy), information gain will be:

 Our parent entropy was near 0.99 and after looking at


this value of information gain, we can say that the
entropy of the dataset will decrease by 0.13 if we
make “Motivation” our root node.
Information Gain
 We now see that the “Energy” feature gives more
reduction which is 0.37 than the “Motivation” feature.
 Hence, we will select the feature that has the highest
information gain and then split the node based on that
feature.
 In this example “Energy” will be our root node and we’ll do
the same for sub-nodes.
Information Gain
 Here we can see that when the energy is “high” the
entropy is low and hence we can say a person will go to
the gym if he has high energy, but what if the energy is
low?
 We will again split the node based on the new feature
which is “Motivation”.
Information Gain – Example 1

 What is the information gain in


this example?
 H(E) = ???
 H(E1) = ???
 H(E2) = ???
 H(E3) = ???
Information Gain – Example 1
 What is the information gain in this
example?
 H(E) = 1, since classes are evenly split at
the top.
 H(E1) = 0, since we only have will not wait
cases on the left child.
 H(E2) = 0, since we only have will wait
cases on the middle child.
 H(E3) =
Information Gain – Example 2
 What is the information gain in
this example?
 H(E) = ???
 H(E1) = ???
 H(E2) = ???
 H(E3) = ???
 H(E4)=???
Information Gain – Example 2
 What is the information gain in
this example?
 H(E) = 1, since classes are
evenly split.
 H(E1) = H(E2) = H(E3) = H(E4) =
1, classes at all children are also
evenly split.
Importance
 We can now finally specify what
Importance does:
 Importance returns the attribute
that achieves the highest
information gain on the given
examples.
Summary

 Decision trees are a popular pattern recognition method.


 Learning a decision tree is done by recursively:
 Picking an attribute at the root.
 Sending the training examples to different children of the root, based on
their values of the chosen attribute.
 Learning each of the children.
 Choosing attributes is based on information gain.
 We prefer attributes that concentrate different classes on different
children.

You might also like