0% found this document useful (0 votes)

24 views70 pages

2 ML Ch3 Decision Trees Final

Uploaded by

chea rotha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views70 pages

2 ML Ch3 Decision Trees Final

Uploaded by

chea rotha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 70

Decision Tree Learning

Lecture Slides 2
n  In these slides;
¨  you will strengthen your understanding of the introductory
concepts of machine learning
¨  learn about the Decision tree approach which is one of the
fundamental approaches in machine learning
n  Decision trees are also the basis of a new method called
“random forest” which we will see towards the end of the
course.

n  Note that you can skip Advanced marked slides, they are
extra information
Decision Trees
n  One of the most widely used and practical methods for
inductive inference

n  Can be used for classification (most common use) or

regression problems
Decision Tree for “Good credit applicant?”

Current Debt?

None Large Small

Assets? Bad appl. Employed?

High None Yes No

Good appl. Medium appl. ….. Bad appl.

•  Each internal node corresponds to a test

•  Each branch corresponds to a result of the test
•  Each leaf node assigns a classification
Decision Regions
Divide and Conquer
n  Internal decision nodes
¨  Univariate: Uses a single attribute, xi
n  Discrete xi : n-way split for n possible values

n  Continuous xi : Binary split : xi > wm

¨  Multivariate: Uses more than one attributes

n  Leaves
¨  Classification: Class labels, or proportions
¨  Regression: Numeric; r average, or local fit

n  Once the tree is trained, a new instance is classified by

starting at the root and following the path as dictated by
the test results for this instance.
Multivariate Trees
Expressiveness
n  A decision tree can represent a disjunction of
conjunctions of constraints on the attribute values of
instances.
¨  Each path corresponds to a conjunction
¨  The tree itself corresponds to a disjunction
Decision Tree

If (O=Sunny AND H=Normal) OR (O=Overcast) OR (O=Rain AND W=Weak)

then YES

n  “A disjunction of conjunctions of constraints on attribute values”

Advanced
n  How expressive is this representation?

n  How would we represent:

¨  (A AND B) OR C
¨  A XOR B
¨  M–of-N
n  2-of-(A,B,C,D))

n  AB + AC + AD + BC + BD + CD
Decision tree learning algorithm
n  For a given training set, there are many trees that code it
without any error

n  Finding the smallest tree is NP-complete (Quinlan 1986),

hence we are forced to use some (local) search
algorithm to find reasonable solutions
n  Learning is greedy; find the best split recursively
(Breiman et al, 1984; Quinlan, 1986, 1993)

If the decisions are binary, then in the best case, each

n 
decision eliminates half of the regions (leaves).

n If there are b regions, the correct region can be found in

log2b decisions, in the best case.
The basic decision tree learning algorithm
n  A decision tree can be constructed by considering
attributes of instances one by one.
¨  Which attribute should be considered first?

n  The height of a decision tree depends on the order

attributes that are considered.
Top-Down Induction of Decision Trees
n  Which attribute is best?

(25+,25-)
(25+,25-)
A1 < …?
A2 < …?
t f t f

(17+, 20-) (8+,5-)

(24+, 5-) (1+,20-)
Entropy
n  Measure of uncertainty
n  Expected number of bits to resolve uncertainty
n  Entropy measures the information amount in a
message

n  Important quantity in

¨  coding theory
¨  statistical physics
¨  machine learning
¨  …

n  Show the high school form example with gender field
Entropy of a Binary Random Variable
Entropy of a Binary Random Variable
n  Entropy measures the impurity of S:
Entropy(S) = -p x log2 p +
- (1-p) x log2 (1-p)

Note: Here p=p-positive and 1-p= p_negative in the previous slide

n  Example: Consider a binary random variable X s.t. Pr{X = 0} = 0.1

1 1
Entropy(X) = 0.1× lg + (1 − 0.1)× lg
0.1 (1 − 0.1)
Entropy – General Case
n  When the random variable has multiple possible outcomes, its
entropy becomes:

n  Here x is the possible outcomes (Male/Female or Dice outcomes,...)

Entropy
Example from Coding theory:
Random variable x discrete with 8 possible states; how many bits are
needed to transmit the state of x?

1.  All states equally likely

2.  We have the following distribution for x?

Advanced
¨  Indeed, this process is used in designing codes for messages, such that
the total transmission costs are minimized.
Use of Entropy in
Choosing the
Next Attribute
n  We will use the entropy of the remaining tree as our
measure to prefer one attribute over another.

n  In summary, we will consider

¨  the entropy over the distribution of samples falling under each
leaf node and
¨  we will take a weighted average of that entropy – weighted by
the proportion of samples falling under that leaf.

n  We will then choose the attribute that brings us the

biggest information gain, or equivalently, results in a tree
with the lower weighted entropy.
Please note: Gain is what you started with MINUS Remaining entropy.
We can also simply choose the tree with the smallest remaining entropy!
Training Examples
Selecting the Next Attribute

We would select the Humidity attribute to split the root node as it has a higher
Information Gain.
Selecting the Next Attribute
n  Computing the information gain for each attribute, we selected the Outlook
attribute as the first test, resulting in the following partially learned tree:

n  We can repeat the same process recursively, until Stopping conditions are
satisfied.
Partially learned tree
Until stopped:
n  Select one of the unused attributes to partition the
remaining examples at each non-terminal node using
only the training samples associated with that node

Stopping criteria:
n  each leaf-node contains examples of one type
n  algorithm ran out of attributes
n  …
Advanced: Other measures of impurity
n  Entropy is not the only measure of impurity. If a function satisfies
certain criteria, it can be used as a measure of impurity.
n  For Binary random variables (only p value is indicated), we have:
¨  Gini index: 2p(1-p)
n  p=0.5 Gini Index=0.5
n  p=0.9 Gini Index=0.18
n  p=1 Gini Index=0
n  p=0 Gini Index=0

¨  Misclassification error: 1 – max(p,1-p)

n  p=0.5 Misclassification error=0.5
n  p=0.9 Misclassification error=0.1
n  P=1 Misclassification error=0
n  P=0 Misclassification error=0
Other Issues
With Decision Trees

Continuous Values
Missing Attributes
…
Continuous Valued Attributes
n  Create a discrete attribute to test continuous variables
Temperature = 82:5
(Temperature > 72:3) = t; f

n  How to find the threshold?

Temperature: 40 48 60 72 80 90
PlayTennis: No No Yes Yes Yes No
Incorporating continuous-valued attributes
n  Where to cut?
¨  We can show that the threshold is always the the transitions
(shown in red boundaries between the two classes)

Continuous valued
attribute
Advanced: Split Information?
n  In each tree, the leaves contain samples of only one kind
(e.g. 50+, 10+, 10- etc).
¨  Hence, the remaining entropy is 0 in each one.

n  But we would often prefer A1.

¨  How to indicate this preference?

100 examples
100 examples
A2
A

10 positive

50 positive 50 negative 10 positive 10 positive 10 negative

Advanced: Attributes with Many Values –

n  One way to penalize such attributes is to use the

following alternative measure:

G ain (S , A )
G ainR atio(S , A ) =
SplitInformation (S , A )
c
Si Si
SplitInformation (S , A ) = -‐ Â
Σ
i= 1 S
lg
S

Split Information is 3.3

Entropy of the attribute A: versus 1, for the
Experimentally determined by the training samples previous example.
-10*0.1*log_2(0.1) vs.
- 2*0.5*log_2(0.5)
Handling training examples with missing attribute values
n  What if an example x is missing the value an attribute A?

n  Simple solution:

¨  Use the most common value among examples at node n.
¨  Or use the most common value among examples at node n that
have classification c(x)

n  More complex, probabilistic approach

¨  Assign a probability to each of the possible values of A based on
the observed frequencies of the various values of A
¨  Then, propagate examples down the tree with these
probabilities.
¨  The same probabilities can be used in classification of new
instances (used in C4.5)
Handling attributes with differing costs
n  Sometimes, some attribute values are more expensive or
difficult to prepare.
¨  medical diagnosis, BloodTest has cost $150

n  In practice, it may be desired to postpone acquisition of

such attribute values until they become necessary.

n  To this purpose, one may modify the attribute selection

measure to penalize expensive attributes.
G ain 2 (S , A )
¨  Tan and Schlimmer (1990)
C ost (A )
2G ain (S , A ) -‐ 1
Nunez (1988) ∈[0, 1]
w , w Œ
¨ 
(C ost (A ) + 1)
Regression with Decision
Trees
Model Selection in Trees:
Strengths and Advantages of Decision Trees
n  Rule extraction from trees
¨ A decision tree can be used for feature extraction (e.g. seeing
which
features are useful)

n  Interpretability: human experts may verify and/or

discover patterns

n  Compact and fast classification method

Overfitting
Over fitting in Decision Trees
n  Why “over”-fitting?
¨  A
model can become more complex than the true
target function when it tries to satisfy noisy data as
well.
n  Consider adding the following training example
which is incorrectly labeled as negative:

Sky; Temp; Humidity; Wind; PlayTennis

Sunny; Hot; Normal; Strong; PlayTennis = No
n  Consider adding the following training example
which is incorrectly labeled as negative:

Sky; Temp; Humidity; Wind; PlayTennis

Sunny; Hot; Normal; Strong; PlayTennis = No

Wind
Strong
Yes No
n  ID3 (the Greedy algorithm that was outlined) will make a new split
and will classify future examples following the new path as negative.

n  Problem is due to ”overfitting” the training data which may be

thought as insufficient generalization of the training data
¨  Coincidental regularities in the data
¨  Insufficient data
¨  Differences between training and test distributions

n  Definition of overfitting

¨  A hypothesis is said to overfit the training data if there exists
some other hypothesis that has larger error over the training
data but smaller error over the entire instances.
What is the formal description of overfitting?
n  Remember polynomial curve fitting with more and more
complex models (higher degree polynomials)

n  We will see what happens regarding train versus test

accuracies and regarding weights, as models get more
complex.
Polynomial Curve Fitting
Sum-of-Squares Error Function
0th Order Polynomial
1st Order Polynomial
3rd Order Polynomial
9th Order Polynomial
Over-fitting

Root-‐Mean-‐Square (RMS) Error:

Polynomial Coefficients
Data Set Size:

9th Order Polynomial

Data Set Size:
9th Order Polynomial

As data increases, the polynomial is further bound, reducing wild changes.

Back to Decision Trees
Over fitting in Decision Trees
Avoiding over-fitting the data
n  How can we avoid overfitting? There are 2 approaches:
1.  Early stopping: stop growing the tree before it perfectly
classifies the training data
2.  Pruning: grow full tree, then prune
n  Reduced error pruning

n  Rule post-pruning

¨  Pruning: replacing a subtree with a leaf with the most common
classification in the subtree.

¨  Pruning approach is found more useful in practice.

n  Whether we are pre or post-pruning, the important
question is how to select “best” tree:

¨  Measure performance over separate validation set

¨  Measure performance over training data

n  apply a statistical test to see if expanding or pruning would
produce an improvement beyond the training set (Quinlan
1986)

¨  MDL: Minimum description length

¨  …
Reduced-Error Pruning (Quinlan 1987)
n  Split data into training and validation set

n  Do until further pruning is harmful:

¨  1. Evaluate impact of pruning each possible node (plus those
below it) on the validation set
¨  2. Greedily remove the one that most improves validation set
accuracy

n  Produces smallest version of the (most accurate) tree

n  What if data is limited?

¨  We would not want to separate a validation set.
Advanced: Alternative: Rule post-pruning
n  Algorithm
¨  Build a complete decision tree.
¨  Convert the tree to set of rules.
¨  Prune each rule:
n  Remove any preconditions if any improvement in accuracy

¨  Sort the pruned rules by accuracy and use them in that order.

n  Perhaps most frequently used method (e.g., in C4.5)

n  More details can be found in

http://www2.cs.uregina.ca/~hamilton/courses/831/notes/ml/dtrees/4_dtrees3.html
(read only if interested)
Curse of Dimensionality
Curse of Dimensionality

n  In a learning task, it seems like adding more attributes

would help the learner, as more information never hurts,
right?

n  In fact, sometimes it does, due to what is called

curse of dimensionality.
n  In a toy learning problem (tx to Gutierrez-Osuna), our algorithm:
¨  divides the feature space uniformly into bins and
¨  for each new example that we want to classify, we
just need to figure out the bin the example falls into
and find the predominant class in that bin as the label.

¨  One feature where the input space is divided into 3 bins:

¨  Noticing the overlap, we add one more feature:

Curse of Dimensionality

As dimensionality D increases, the amount of data needed increases

exponentially with D.
n  Examples in high dimensional spaces, irrelevant
attributes,...
n  We can still find effective techniques applicable to high-
dimensional spaces
¨  Real data will often be confined to a region of the space having
lower effective dimensionality
¨  Real data will typically exhibit smoothness properties

n  Feature selection

n  Dimensionality reduction
n  Controlling model complexity
n  ...

Decision Trees-Lecture 9&10
No ratings yet
Decision Trees-Lecture 9&10
60 pages
19 - Decision Tree - ID3
No ratings yet
19 - Decision Tree - ID3
87 pages
Chapter18 2
No ratings yet
Chapter18 2
40 pages
Decision Tree
No ratings yet
Decision Tree
58 pages
M2 Decision Trees
No ratings yet
M2 Decision Trees
37 pages
Unit 5. Decision Trees
No ratings yet
Unit 5. Decision Trees
58 pages
16-Decision Tree Classification Algorithm Advantages With Examples (Iterative Dichotomiser 3-ID3) - 22-03-2024
No ratings yet
16-Decision Tree Classification Algorithm Advantages With Examples (Iterative Dichotomiser 3-ID3) - 22-03-2024
83 pages
Decision Tree
No ratings yet
Decision Tree
20 pages
Lecture 04 Decession Trees 04112022 015118pm
No ratings yet
Lecture 04 Decession Trees 04112022 015118pm
43 pages
Classification With Decision Trees: Instructor: Qiang Yang
100% (1)
Classification With Decision Trees: Instructor: Qiang Yang
62 pages
Module - 2 Decision Tree Learning
No ratings yet
Module - 2 Decision Tree Learning
79 pages
Wk. 5.2. Decision Trees (27.10.2020)
No ratings yet
Wk. 5.2. Decision Trees (27.10.2020)
57 pages
Unit6 - 2 Classification-Decision-Trees
No ratings yet
Unit6 - 2 Classification-Decision-Trees
36 pages
9-Module 5 Decision Tree-21-03-2024
No ratings yet
9-Module 5 Decision Tree-21-03-2024
83 pages
Random Forest Regression
No ratings yet
Random Forest Regression
57 pages
ML Unit 3 Notes-1
No ratings yet
ML Unit 3 Notes-1
118 pages
Decision Tree: Courtesy: Prof. Pabitra Mitra, CSE, IIT Kharagpur
No ratings yet
Decision Tree: Courtesy: Prof. Pabitra Mitra, CSE, IIT Kharagpur
73 pages
Decision Tree
No ratings yet
Decision Tree
18 pages
Decision Trees
No ratings yet
Decision Trees
15 pages
Module 3
No ratings yet
Module 3
101 pages
Module 3
No ratings yet
Module 3
102 pages
Lect 8-Decision Tree-2
No ratings yet
Lect 8-Decision Tree-2
16 pages
Week 11 - Decision Tree Learning
No ratings yet
Week 11 - Decision Tree Learning
43 pages
Decision Tree Algorithm: Comp328 Tutorial 1 Kai Zhang
No ratings yet
Decision Tree Algorithm: Comp328 Tutorial 1 Kai Zhang
25 pages
ML Lecture 13-14
No ratings yet
ML Lecture 13-14
33 pages
T6 Decision Tree
No ratings yet
T6 Decision Tree
38 pages
ML Classification Tree
No ratings yet
ML Classification Tree
36 pages
ML Unit 3 Notes
No ratings yet
ML Unit 3 Notes
117 pages
Cse 445 Lecture 8 Mma
No ratings yet
Cse 445 Lecture 8 Mma
107 pages
Unit-4 (1) .Docx ML
No ratings yet
Unit-4 (1) .Docx ML
42 pages
Machine Learning Unit-3.2
No ratings yet
Machine Learning Unit-3.2
61 pages
Data Minning Unit 5 PDF
No ratings yet
Data Minning Unit 5 PDF
19 pages
Lecture 4
No ratings yet
Lecture 4
74 pages
DMDW Co3 Session 14
No ratings yet
DMDW Co3 Session 14
55 pages
Decision Tree Algorithm
No ratings yet
Decision Tree Algorithm
18 pages
Lecture 4
No ratings yet
Lecture 4
74 pages
15 1 Random Forest and Decision Tree
No ratings yet
15 1 Random Forest and Decision Tree
66 pages
Trees
No ratings yet
Trees
78 pages
4.decision Tree
No ratings yet
4.decision Tree
39 pages
Decision-Tree Learning .
No ratings yet
Decision-Tree Learning .
29 pages
Decision Tree Learning
No ratings yet
Decision Tree Learning
70 pages
Chapter 3
No ratings yet
Chapter 3
88 pages
Decision Tree and Random Forest
No ratings yet
Decision Tree and Random Forest
74 pages
Lecture 17 18
No ratings yet
Lecture 17 18
52 pages
Decision Tree
No ratings yet
Decision Tree
14 pages
Unit II
No ratings yet
Unit II
34 pages
Decision Tree in Machine Learning
No ratings yet
Decision Tree in Machine Learning
11 pages
CS446: Machine Learning: Lecture 21 (ML Models - Decision Trees - ID3)
No ratings yet
CS446: Machine Learning: Lecture 21 (ML Models - Decision Trees - ID3)
54 pages
FALLSEM2024-25 BCSE209L TH VL2024250101598 2024-08-05 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE209L TH VL2024250101598 2024-08-05 Reference-Material-I
31 pages
Decision Tree
No ratings yet
Decision Tree
33 pages
2025 Lecture07 P1 ID3
No ratings yet
2025 Lecture07 P1 ID3
41 pages
AI&Ml-module 4 (Part 1)
No ratings yet
AI&Ml-module 4 (Part 1)
85 pages
AI&Ml-module 4 (Complete)
No ratings yet
AI&Ml-module 4 (Complete)
124 pages
Decision Tree
No ratings yet
Decision Tree
22 pages
Decision Tree
No ratings yet
Decision Tree
25 pages
Decision - Tree
No ratings yet
Decision - Tree
75 pages
Machine Learning: MVJ21CS62
No ratings yet
Machine Learning: MVJ21CS62
12 pages
Lecture Notes For Machine Learning Theory
No ratings yet
Lecture Notes For Machine Learning Theory
167 pages
350 NLP Projects With Code
No ratings yet
350 NLP Projects With Code
70 pages
Dabdsml Study Plan
No ratings yet
Dabdsml Study Plan
5 pages
Laptop Price Predicton Report
No ratings yet
Laptop Price Predicton Report
30 pages

2 ML Ch3 Decision Trees Final

Uploaded by

2 ML Ch3 Decision Trees Final

Uploaded by

Decision Tree Learning

n Can be used for classification (most common use) or

None Large Small

Assets? Bad appl. Employed?

High None Yes No

Good appl. Medium appl. ….. Bad appl.

• Each internal node corresponds to a test

n Continuous xi : Binary split : xi > wm

n Once the tree is trained, a new instance is classified by

If (O=Sunny AND H=Normal) OR (O=Overcast) OR (O=Rain AND W=Weak)

n “A disjunction of conjunctions of constraints on attribute values”

n How would we represent:

n Finding the smallest tree is NP-complete (Quinlan 1986),

If the decisions are binary, then in the best case, each

n If there are b regions, the correct region can be found in

n The height of a decision tree depends on the order

(17+, 20-) (8+,5-)

n Important quantity in

Note: Here p=p-positive and 1-p= p_negative in the previous slide

n Example: Consider a binary random variable X s.t. Pr{X = 0} = 0.1

n Here x is the possible outcomes (Male/Female or Dice outcomes,...)

1. All states equally likely

2. We have the following distribution for x?

n In summary, we will consider

n We will then choose the attribute that brings us the

¨ Misclassification error: 1 – max(p,1-p)

n How to find the threshold?

n But we would often prefer A1.

50 positive 50 negative 10 positive 10 positive 10 negative

n One way to penalize such attributes is to use the

Split Information is 3.3

n Simple solution:

n More complex, probabilistic approach

n In practice, it may be desired to postpone acquisition of

n To this purpose, one may modify the attribute selection

n Interpretability: human experts may verify and/or

n Compact and fast classification method

Sky; Temp; Humidity; Wind; PlayTennis

Sky; Temp; Humidity; Wind; PlayTennis

n Problem is due to ”overfitting” the training data which may be

n Definition of overfitting

n We will see what happens regarding train versus test

Root-­‐Mean-­‐Square (RMS) Error:

9th Order Polynomial

As data increases, the polynomial is further bound, reducing wild changes.

n Rule post-pruning

¨ Pruning approach is found more useful in practice.

¨ Measure performance over separate validation set

¨ Measure performance over training data

¨ MDL: Minimum description length

n Do until further pruning is harmful:

n Produces smallest version of the (most accurate) tree

n What if data is limited?

n Perhaps most frequently used method (e.g., in C4.5)

n More details can be found in

n In a learning task, it seems like adding more attributes

n In fact, sometimes it does, due to what is called

¨ Noticing the overlap, we add one more feature:

As dimensionality D increases, the amount of data needed increases

n Feature selection

You might also like

n  Can be used for classification (most common use) or

•  Each internal node corresponds to a test

n  Continuous xi : Binary split : xi > wm

n  Once the tree is trained, a new instance is classified by

n  “A disjunction of conjunctions of constraints on attribute values”

n  How would we represent:

n  Finding the smallest tree is NP-complete (Quinlan 1986),

n If there are b regions, the correct region can be found in

n  The height of a decision tree depends on the order

n  Important quantity in

n  Example: Consider a binary random variable X s.t. Pr{X = 0} = 0.1

n  Here x is the possible outcomes (Male/Female or Dice outcomes,...)

1.  All states equally likely

2.  We have the following distribution for x?

n  In summary, we will consider

n  We will then choose the attribute that brings us the

¨  Misclassification error: 1 – max(p,1-p)

n  How to find the threshold?

n  But we would often prefer A1.

n  One way to penalize such attributes is to use the

n  Simple solution:

n  More complex, probabilistic approach

n  In practice, it may be desired to postpone acquisition of

n  To this purpose, one may modify the attribute selection

n  Interpretability: human experts may verify and/or

n  Compact and fast classification method

n  Problem is due to ”overfitting” the training data which may be

n  Definition of overfitting

n  We will see what happens regarding train versus test

Root-‐Mean-‐Square (RMS) Error:

n  Rule post-pruning

¨  Pruning approach is found more useful in practice.

¨  Measure performance over separate validation set

¨  Measure performance over training data

¨  MDL: Minimum description length

n  Do until further pruning is harmful:

n  Produces smallest version of the (most accurate) tree

n  What if data is limited?

n  Perhaps most frequently used method (e.g., in C4.5)

n  More details can be found in

n  In a learning task, it seems like adding more attributes

n  In fact, sometimes it does, due to what is called

¨  Noticing the overlap, we add one more feature:

n  Feature selection