0% found this document useful (0 votes)

24 views57 pages

11 Decision Trees

Uploaded by

Sujith Yadav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views57 pages

11 Decision Trees

Uploaded by

Sujith Yadav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 57

Decision Trees

Dr Subharag Sarkar
Content

 Introduction
 Decision Trees
 Choosing an Attribute
 Entropy
 Information Gain
 Summary
Introduction

 Decision tree induction is one of the simplest and yet most successful
forms of machine learning.
 We first describe the representation—the hypothesis space—and then
show how to learn a good hypothesis.
 A decision tree represents a function that takes as input a vector of
attribute values and returns a “decision”—a single output value.
 The input and output values can be discrete or continuous.
 For now, we will concentrate on problems where the inputs have
discrete values and the output has exactly two possible values; this is
Boolean classification, where each example input will be classified as
true (a positive example) or false (a negative example).
Decision Trees

 A decision tree reaches its decision by performing a sequence of tests.

 Each internal node in the tree corresponds to a test of the value of one
of the input attributes, Ai, and the branches from the node are labeled
with the possible values of the attribute, Ai = vik.
 Each leaf node in the tree specifies a value to be returned by the
function.
 The decision tree representation is natural for humans; indeed, many
“How To” manuals (e.g., for car repair) are written entirely as a single
decision tree stretching over hundreds of pages.
Example
 We will build a decision tree to decide whether to wait for a table at a
restaurant.
 The aim here is to learn a definition for the goal predicate WillWait.
 Firstly, we list the attributes that we will consider as part of the input:
Example
 Note that every variable has a small set
of possible values; the value of
WaitEstimate, for example, is not an
integer, rather it is one of the four
discrete values 0–10, 10–30, 30–60, or
>60.
 Each pattern Xi is a set of 10 attribute
values, describing the case of a
customer asked to wait at the
restaurant.
 Using these training examples, we want
to learn a function F, mapping each
pattern into a Boolean answer (will wait
or will not wait).
Decision Trees
 A decision tree is a type of classifier. The
input is a pattern. The output is a class.
 Given a pattern: we start at the root of
the tree.
 The current node asks a question about
the pattern.
 Based on the answer, we move to a child
of the current node.
 When we get to a leaf node, we get the
output of the decision tree (a class).
Decision Trees

 Example: What is the output of the

Decision Tree on this pattern?
 First, we check the value of Patrons =
Some
 Where do we go next? To the middle
child
 What happens next?
 Leaf node. Output is will wait.
Decision Trees

 Example: What is the output of the

Decision Tree on this pattern?
 First, we check value of Patrons = Full
 Where do we go next? To the right child.
 Next check: value of WaitEstimate? 10-
30
 Where do we go next? To the second-
from-the-right child.
 Next check: value of Hungry? False
 Where do we go next? To the left child.
 We arrived at a leaf node
 Output: will wait.
Decision Trees
 At this point, it should be clear how to
apply a decision tree to classify a
pattern.
 Obviously, there are lots and lots of
different decision trees that we can
come up with.
Decision Trees
 Here is a different decision tree.
 The natural question is: How can
we construct a good decision
tree?
Decision Tree Learning
 DTL is a pseudocode that the
textbook provides for learning a
Decision tree.
 It looks simple and short, but (as
in TT-Entails?) there are a lot of
details we should look into.
Decision Tree Learning
 Notice that the function is
recursive (the line of the
recursive call is highlighted).
 This function builds the entire
tree, and its recursive calls build
each individual subtree, and
each individual leaf node.
Decision Tree Learning
 Examples: A set of training
examples.
 Remember, each training
example is a pair, consisting of a
pattern and a class label.
 Attributes: A list of attributes
that we can choose to test.
 Parent_examples: A default class
to output if no better choice is
available (details later).
Decision Tree Learning
 The function returns a Decision
tree.
 However, notice the highlighted
line, that says if there are no
examples, we should return
parent_examples. What does it
mean?
 It means we should return a leaf
node, that outputs class
parent_examples.
Decision Tree Learning
 This is the only reason we have
a parent_examples argument to
the DTL function.
 If there are no examples,
obviously we need to create a
leaf node.
 The parent_examples argument
tells us what class to store at
the leaf node.
Decision Tree Learning
 First base case: the examples are
empty. As discussed before, we
return a leaf node with output class
parent_examples.
 Second base case: all examples
have the same class. We just return
a leaf node with that class as its
output.
 Third base case: attributes is empty.
We have run out of questions to ask.
 In this case, we have to return a leaf
node. What class should we store
there?
 output class Plurality-Value
(examples).
 Plurality-Value is an auxiliary
function, that returns the most
Decision Tree Learning
 The highlighted code shows the
recursive case.
 The first thing we do, is call
Importance.
 Importance is an auxiliary function
that chooses the attribute we should
check at this node.
 We will talk A LOT about the
Importance function, a bit later.
 For now, just accept that this
function will do its job and choose
an attribute, which we store at
variable A.
Decision Tree Learning
 Here we create the tree that we are
going to return.
 We store in that tree (probably in
some member variable) the fact
that we will be testing attribute A.
 Next???
 Next, we need to create the children
of that tree.
 How?
 With recursive calls to DTL, with
appropriate arguments.
 How many children do we
create?
 As many as the values of attribute
A.
Decision Tree Learning
 The highlighted loop creates the
children (the subtrees of tree).
 Each iteration creates one child.
Decision Tree Learning
 Remember, each child (subtree)
corresponds to a value vi of best.
 To create that child, we need to call
DTL with appropriate values for
examples, attributes, and
parent_examples.
 What examples should we use?
 The subset of examples where A=vi.
Decision Tree Learning
 What attributes do we use?
 Everything in the attributes variable,
except for A.
 Why are we leaving A out?
 What happens if we use the
same attribute twice in a path?
Using an Attribute Twice in a Path
 What happens the second time we
use Patrons? As the attribute in
this example?
 It is useless. All patterns getting to
this node already have Patrons? =
FULL
 Therefore, all patterns will go to the
right subtree.
Decision Tree Learning: Recursive
Case
 Finally, in our recursive call to DTL, what
should be the value of parent_examples?
 This will be used as output for a leaf node, if
exs are empty.
 First of all, why would exs ever be empty?
 It may just happen that no training examples
have A=vk.
 So, what should be the value of
parent_examples?
 It should be Plurality-Value(examples): the
most frequent class among the examples.
Decision Tree Learning: Recursive
Case
 Once the recursive call to DTL has returned a
subtree, we add that to our tree.
 Finally, once all recursive calls to DTL have
returned, we return the tree we have created.
 Are we done?
 Almost, except that we still need to talk how
Importance works.
Choosing an Attribute
Choosing an Attribute

 Here we see two different attributes chosen at the root of the decision
tree.
 At each node, green stands for a training example that waited, red stands
for a training example that did not wait.
 Which attribute seems better (more useful) to you?
 The Patrons attribute is more useful because it separates better the green
from the reds.
Choosing an Attribute

 How can we quantify how well an attribute separates training examples

from different classes?
 We need to define two new quantities:
 Entropy.
 Information gain.
 These quantities are computed with specific formulas.
 Information gain will be used to choose the best attribute.
Entropy
Entropy
 Suppose a group of friends is trying to decide which
movie they can watch together on Sunday.
 There are 2 choices for movies, one is “Lucy” and the
second is “Titanic” and now everyone has to tell
their choice.
 After everyone gives their answer we see that “Lucy”
gets 4 votes and “Titanic” gets 5 votes.
 Which movie do they watch now?
 Isn’t it hard to choose 1 movie now because the votes
for both movies are somewhat equal?
Entropy
 This is exactly what is called disorderness, there is an
equal number of votes for both movies, and we can’t
really decide which movie we should watch.
 It would have been much easier if the votes for “Lucy”
were 8 and for “Titanic” it was 2.
 Here we could easily say that the majority of votes are
for “Lucy” hence everyone will be watching this
movie.
 In a decision tree, the output is mostly “yes” or “no”
Entropy

 Entropy is nothing but the uncertainty in our dataset

or measure of disorder.
 Decision trees use Entropy to measure the impurity of
the node.
 Impurity is the degree of randomness; it tells how
random our data is.
Entropy- Two Class Example

 Suppose that we have a set X of training examples.

 K1 examples have class label A.
 K2 examples have class label B.
 Let K = K1 + K2. K1 K2
 Then the entropy of the set X depends only on the two ratios:K andK
The entropy H is defined as:

 Note: logarithms in this discussion are always base 2.

Entropy – General Example

 In the general case:

 Suppose that we have a set X of training examples.
 Suppose there are N different class labels L1, …, LN.
 K1 examples have class label L1.
 K2 examples have class label L2.
 In general, Ki examples have the class label Li.
 Let K = K1 + K2 + … + KN.
 Then the entropy H of the set X is given by this formula:
Making Sense of Entropy
 If this is the first time you see the definition of entropy, it probably does not
look very intuitive.
 We will look at several specific examples so that you can see how entropy
behaves.
 The lowest possible entropy value is 0.
 The lower the entropy is, the more uneven the distribution of classes is.
 Zero entropy means all training examples have the same class.
 The highest possible entropy value is log2N (N=number of classes).
 The higher the entropy is, the more even the distribution of classes is.
 Entropy log2N means that the number of training examples for each class is
equal. K1 = K2 = … = KN.
Example 1

 We have a set X of training examples, of two classes.

 200 examples have class label A.
 200 examples have class label B.
 What is the entropy of X?

 The classes are evenly split.

 Therefore, H has the largest possible value: log22 = 1.
Example 2

 We have a set X of training examples, of two classes.

 20 examples have class label A.
 500 examples have class label B.
 What is the entropy of X?

 The classes are pretty unevenly split.

 Therefore, H has a smaller value, relatively closer to 0.
Example 3

 We have a set X of training examples, of two classes.

 20 examples have class label A.
 5000 examples have class label B.
 What is the entropy of X?

 The classes are even more unevenly split.

 Therefore, H is even smaller than it was in the previous example.
Example 4

 We have a set X of training examples, of two classes.

 0 examples have class label A.
 5000 examples have class label B.
 What is the entropy of X?

 This is an interesting case. Who wins, 0 or infinity?

 We will skip the mathematical proof, but 0 wins.
 Makes sense, most uneven split possible -> lowest H possible.
Example 5

 We have a set X of training examples, of two classes.

 50 examples have class label A.
 15 examples have class label B.
 25 examples have class label C.
 What is the entropy of X?

 Since we have three classes, the entropy can be greater than 1.

Information Gain
Information Gain

 Information gain measures the reduction of

uncertainty given some feature.
 It also helps to decide the attribute that should be
selected as a decision node or root node.

 It is just the entropy of the full dataset – the entropy

of the dataset given some feature.
Information Gain

 It tells how much information a feature/attribute

provides us.
 Following the value of the information gain, splitting of
the node and decision tree building is being done.
 Decision tree always tries to maximize the value of
the information gain, and a node/attribute having the
highest value of the information gain is being split
first.
Information Gain – Behaviour

 Overall, we like attributes that yield as high information gain as possible.

 To get that, we want the entropy of the subsets Ei to be small.
 This happens when the attribute splits the training examples nicely so
that different classes get concentrated on different subsets Ei.
Information Gain
 Suppose our entire population has a total of 30
instances.
 The dataset is to predict whether the person will go to
the gym or not.
 Let’s say 16 people go to the gym and 14 people
don’t.
 Now we have two features to predict whether he/she
will go to the gym or not.
 Feature 1 is “Energy” which takes two values “high”
and “low”
 Feature 2 is “Motivation” which takes 3 values “No
motivation”, “Neutral” and “Highly motivated”.
Feature 1
 Let’s Calculate the Entropy:
Feature 1
 Now calculate the weighted average of entropy of
each node.

 Now we have the value of E(Parent) and E(Parent|

Energy), information gain will be:

 Our parent entropy was near 0.99 and after looking at

this value of information gain, we can say that the
entropy of the dataset will decrease by 0.37 if we
make “Energy” as our root node.
Feature 2

 Let us Calculate the entropy here:

Feature 2
 Now calculate the weighted average of entropy of
each node.

 Now we have the value of E(Parent) and E(Parent|

Energy), information gain will be:

 Our parent entropy was near 0.99 and after looking at

this value of information gain, we can say that the
entropy of the dataset will decrease by 0.13 if we
make “Motivation” our root node.
Information Gain
 We now see that the “Energy” feature gives more
reduction which is 0.37 than the “Motivation” feature.
 Hence, we will select the feature that has the highest
information gain and then split the node based on that
feature.
 In this example “Energy” will be our root node and we’ll do
the same for sub-nodes.
Information Gain
 Here we can see that when the energy is “high” the
entropy is low and hence we can say a person will go to
the gym if he has high energy, but what if the energy is
low?
 We will again split the node based on the new feature
which is “Motivation”.
Information Gain – Example 1

 What is the information gain in

this example?
 H(E) = ???
 H(E1) = ???
 H(E2) = ???
 H(E3) = ???
Information Gain – Example 1
 What is the information gain in this
example?
 H(E) = 1, since classes are evenly split at
the top.
 H(E1) = 0, since we only have will not wait
cases on the left child.
 H(E2) = 0, since we only have will wait
cases on the middle child.
 H(E3) =
Information Gain – Example 2
 What is the information gain in
this example?
 H(E) = ???
 H(E1) = ???
 H(E2) = ???
 H(E3) = ???
 H(E4)=???
Information Gain – Example 2
 What is the information gain in
this example?
 H(E) = 1, since classes are
evenly split.
 H(E1) = H(E2) = H(E3) = H(E4) =
1, classes at all children are also
evenly split.
Importance
 We can now finally specify what
Importance does:
 Importance returns the attribute
that achieves the highest
information gain on the given
examples.
Summary

 Decision trees are a popular pattern recognition method.

 Learning a decision tree is done by recursively:
 Picking an attribute at the root.
 Sending the training examples to different children of the root, based on
their values of the chosen attribute.
 Learning each of the children.
 Choosing attributes is based on information gain.
 We prefer attributes that concentrate different classes on different
children.

Unit IV Da Online - PPTX 2 82
No ratings yet
Unit IV Da Online - PPTX 2 82
81 pages
Cse 445 Lecture 8 Mma
No ratings yet
Cse 445 Lecture 8 Mma
107 pages
Zero To Advance in DSA - Shumbul Arifa
No ratings yet
Zero To Advance in DSA - Shumbul Arifa
21 pages
Tycs Ai Unit 2
No ratings yet
Tycs Ai Unit 2
84 pages
ML L8 Decision Tree
No ratings yet
ML L8 Decision Tree
109 pages
MCQ Datastructure Python
No ratings yet
MCQ Datastructure Python
18 pages
Data Structures: Dr. Priya P Sajan C-Dac Thiruvananthapuram
100% (1)
Data Structures: Dr. Priya P Sajan C-Dac Thiruvananthapuram
56 pages
Decision Trees: Artificial Intelligence: A Modern Approach, 3rd Ed
No ratings yet
Decision Trees: Artificial Intelligence: A Modern Approach, 3rd Ed
47 pages
Interview PDF
No ratings yet
Interview PDF
100 pages
SWE Job Interview Prep Resources
100% (1)
SWE Job Interview Prep Resources
2 pages
Mcse 102 Advanced Data Structure and Algorithm Dec 2014
No ratings yet
Mcse 102 Advanced Data Structure and Algorithm Dec 2014
1 page
AI&Ml-module 4 (Complete)
No ratings yet
AI&Ml-module 4 (Complete)
124 pages
AI&Ml-module 4 (Part 1)
No ratings yet
AI&Ml-module 4 (Part 1)
85 pages
Module 3
No ratings yet
Module 3
102 pages
ML Unit 3 Notes
No ratings yet
ML Unit 3 Notes
117 pages
Module 3
No ratings yet
Module 3
101 pages
19 - Decision Tree - ID3
No ratings yet
19 - Decision Tree - ID3
87 pages
Decision - Tree
No ratings yet
Decision - Tree
75 pages
ML Unit 3 Notes-1
No ratings yet
ML Unit 3 Notes-1
118 pages
Module - 2 Decision Tree Learning
No ratings yet
Module - 2 Decision Tree Learning
79 pages
Decision Trees
No ratings yet
Decision Trees
53 pages
Decision Trees-Lecture 9&10
No ratings yet
Decision Trees-Lecture 9&10
60 pages
Unit 5. Decision Trees
No ratings yet
Unit 5. Decision Trees
58 pages
Unit II Part 1
No ratings yet
Unit II Part 1
62 pages
Data Mining Classification Algorithms: Credits: Padhraic Smyth
No ratings yet
Data Mining Classification Algorithms: Credits: Padhraic Smyth
54 pages
AIML Module-04
No ratings yet
AIML Module-04
46 pages
2 ML Ch3 Decision Trees Final
No ratings yet
2 ML Ch3 Decision Trees Final
70 pages
Chapter18 2
No ratings yet
Chapter18 2
40 pages
LAB 4: AVL Tree and Heap EX5:: Delete On AVL Tree Insert Remove (Replace by The Maximum Left Node)
No ratings yet
LAB 4: AVL Tree and Heap EX5:: Delete On AVL Tree Insert Remove (Replace by The Maximum Left Node)
26 pages
Decisiontrees
No ratings yet
Decisiontrees
28 pages
Data Structure
No ratings yet
Data Structure
42 pages
Module 3 Chap 3 Decision Tree Learning
No ratings yet
Module 3 Chap 3 Decision Tree Learning
79 pages
Lesson 3.1 - Supervised Learning Decision Trees
No ratings yet
Lesson 3.1 - Supervised Learning Decision Trees
51 pages
ML Unit 2-2-40
No ratings yet
ML Unit 2-2-40
39 pages
Binary Tree REPORT
No ratings yet
Binary Tree REPORT
11 pages
CSC204 Data Structure Practice Questions
No ratings yet
CSC204 Data Structure Practice Questions
38 pages
Decision Tree
No ratings yet
Decision Tree
18 pages
Lec 34
No ratings yet
Lec 34
32 pages
Machine Learning: Prepared by
No ratings yet
Machine Learning: Prepared by
44 pages
L3 - Decision Trees
No ratings yet
L3 - Decision Trees
28 pages
Unit Ii
No ratings yet
Unit Ii
22 pages
L4-Linked listsBLC
No ratings yet
L4-Linked listsBLC
35 pages
M2 Decision Trees
No ratings yet
M2 Decision Trees
37 pages
2024 Lecture11 MLAlgorithms
No ratings yet
2024 Lecture11 MLAlgorithms
84 pages
CS6364 Lecture18 - ML Decision Tree
No ratings yet
CS6364 Lecture18 - ML Decision Tree
30 pages
Leetcode DSA Sheet
No ratings yet
Leetcode DSA Sheet
13 pages
Lecture 07 On Decision Trees
No ratings yet
Lecture 07 On Decision Trees
36 pages
Mod 333
No ratings yet
Mod 333
21 pages
Decision Tree Learning
No ratings yet
Decision Tree Learning
22 pages
Machine Learning: MVJ21CS62
No ratings yet
Machine Learning: MVJ21CS62
12 pages
SDG Sdgs DF
No ratings yet
SDG Sdgs DF
23 pages
Decision Trees: Classifier
No ratings yet
Decision Trees: Classifier
23 pages
Class Assignment 2 Sp18-Bse-118
No ratings yet
Class Assignment 2 Sp18-Bse-118
7 pages
Unit6 - 2 Classification-Decision-Trees
No ratings yet
Unit6 - 2 Classification-Decision-Trees
36 pages
6.2.unit-2 ML Handsout
No ratings yet
6.2.unit-2 ML Handsout
18 pages
Ai - Unit Vi
No ratings yet
Ai - Unit Vi
40 pages
Week 11 - Decision Tree Learning
No ratings yet
Week 11 - Decision Tree Learning
43 pages
2025 Lecture07 P1 ID3
No ratings yet
2025 Lecture07 P1 ID3
41 pages
48 Learning From Memorization AIC17 V1
No ratings yet
48 Learning From Memorization AIC17 V1
49 pages
CS8391-Data Structures
No ratings yet
CS8391-Data Structures
13 pages
Lecturenotes DecisionTree Spring15
No ratings yet
Lecturenotes DecisionTree Spring15
16 pages
Test-8-CS - Programming and Data Structures PDF
No ratings yet
Test-8-CS - Programming and Data Structures PDF
21 pages
Presentation Report S2019 Artificial Intelligence-CS360
No ratings yet
Presentation Report S2019 Artificial Intelligence-CS360
9 pages
AIML Lec-11
No ratings yet
AIML Lec-11
18 pages
Module#8 Decision Tree and Random Forest
No ratings yet
Module#8 Decision Tree and Random Forest
37 pages
Decision Tree
No ratings yet
Decision Tree
20 pages
Heap Sort
No ratings yet
Heap Sort
1 page
Mod 4-1
No ratings yet
Mod 4-1
42 pages
10 Learning
No ratings yet
10 Learning
32 pages
10 Learning Annot
No ratings yet
10 Learning Annot
32 pages
Decision Tree in Machine Learning
No ratings yet
Decision Tree in Machine Learning
11 pages
SYCS Data Structure Practical Manual
No ratings yet
SYCS Data Structure Practical Manual
34 pages
Decision Tree
No ratings yet
Decision Tree
23 pages
CSE 455 Artificial Intelligence: Decision Trees
No ratings yet
CSE 455 Artificial Intelligence: Decision Trees
16 pages
2-3 Trees and B-Trees 1 Recap: 1.1 Balanced Binary Search Trees
No ratings yet
2-3 Trees and B-Trees 1 Recap: 1.1 Balanced Binary Search Trees
8 pages
Binary Search Tree (Lab 10)
No ratings yet
Binary Search Tree (Lab 10)
8 pages
Array, Structure and Pointers in C - For DSA
No ratings yet
Array, Structure and Pointers in C - For DSA
10 pages
BST Lecture 1 - Class Notes
No ratings yet
BST Lecture 1 - Class Notes
7 pages
5 Learning
No ratings yet
5 Learning
7 pages
(18CS32) Dsa QB
No ratings yet
(18CS32) Dsa QB
7 pages
T02 Soln
No ratings yet
T02 Soln
8 pages
OOPDS C++ Projecf
No ratings yet
OOPDS C++ Projecf
5 pages
Decision Tree Classifier - Manual
No ratings yet
Decision Tree Classifier - Manual
6 pages
DATA STRUCTURES LAB Syllabus
No ratings yet
DATA STRUCTURES LAB Syllabus
2 pages
Data Structures and Algorithms Lab Journal - Lab 8: Objective
No ratings yet
Data Structures and Algorithms Lab Journal - Lab 8: Objective
4 pages
Algorithms and Data Structures-1
No ratings yet
Algorithms and Data Structures-1
4 pages
IT 238: Data Structure and Algorithms: Course Objectives
No ratings yet
IT 238: Data Structure and Algorithms: Course Objectives
2 pages
AVL Trees Deletion
No ratings yet
AVL Trees Deletion
10 pages
Cs-08 Data Structure Using C Language
No ratings yet
Cs-08 Data Structure Using C Language
3 pages
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet

11 Decision Trees

Uploaded by

11 Decision Trees

Uploaded by

Decision Trees

 A decision tree reaches its decision by performing a sequence of tests.

 Example: What is the output of the

 Example: What is the output of the

 How can we quantify how well an attribute separates training examples

 Entropy is nothing but the uncertainty in our dataset

 Suppose that we have a set X of training examples.

 Note: logarithms in this discussion are always base 2.

 In the general case:

 We have a set X of training examples, of two classes.

 The classes are evenly split.

 We have a set X of training examples, of two classes.

 The classes are pretty unevenly split.

 We have a set X of training examples, of two classes.

 The classes are even more unevenly split.

 We have a set X of training examples, of two classes.

 This is an interesting case. Who wins, 0 or infinity?

 We have a set X of training examples, of two classes.

 Since we have three classes, the entropy can be greater than 1.

 Information gain measures the reduction of

 It is just the entropy of the full dataset – the entropy

 It tells how much information a feature/attribute

 Overall, we like attributes that yield as high information gain as possible.

 Now we have the value of E(Parent) and E(Parent|

 Our parent entropy was near 0.99 and after looking at

 Let us Calculate the entropy here:

 Now we have the value of E(Parent) and E(Parent|

 Our parent entropy was near 0.99 and after looking at

 What is the information gain in

 Decision trees are a popular pattern recognition method.

You might also like