0% found this document useful (0 votes)
34 views60 pages

Decision Trees-Lecture 9&10

Uploaded by

umtxengineer3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views60 pages

Decision Trees-Lecture 9&10

Uploaded by

umtxengineer3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 60

Decision Trees

Lecture 9 &10
Introduction of Decision
trees
Decision Trees
• A hierarchical data structure that represents data by implementing a divide
and conquer strategy
• Can be used as a non-parametric classification and regression method
• Given a collection of examples, learn a decision tree that represents it.
• Use this representation to classify new examples

C B A

4
The Representation
• Decision Trees are classifiers for instances represented as feature vectors
• color={red, blue, green} ; shape={circle, triangle, rectangle} ; label= {A, B, C}
• Nodes are tests for feature values Learning a
Evaluation of a
• There is one branch for each value of the feature Decision Tree Decision Tree

• Leaves specify the category (labels)


Color
• Can categorize instances into multiple disjoint categories

C B A
Shape B Shape

B A C B A
5
Expressivity of Decision Trees
• As Boolean functions they can represent any Boolean function.
• Can be rewritten as rules in Disjunctive Normal Form (DNF)
• Green ∧ Square  positive
• Blue ∧ Circle  positive
• Blue ∧ Square  positive Color
• The disjunction of these rules is equivalent to the Decision Tree
• What did we show? What is the hypothesis space here?
• 2 dimensions: color and shape
• 3 values each: color(red, blue, green), shape(triangle, square, circle) Shape B Shape
• |X| = 9: (red, triangle), (red, circle), (blue, square) …
• |Y| = 2: + and -
• |H| = 29

B A C B A
6
Decision Trees

• Output is a discrete category. Real valued outputs are


possible (regression trees)
• There are efficient algorithms for processing large
amounts of data (but not too many features) Color
• There are methods for handling noisy data
(classification noise and attribute noise) and for
handling missing attribute values
Shape B Shape

B A C B A
7
Decision Boundaries
• Usually, instances are represented as attribute-value pairs (color=blue, shape =
square, +)
• Numerical values can be used either by discretizing or by using thresholds for
splitting nodes
• In this case, the tree divides the features space into axis-parallel rectangles,
each labeled with one of the labels

Y
X<3
+ + +
no yes
7
Y>7 Y<5
+ + -
5 no yes no yes

- + - - + + X<1
no yes
1 3 X
+ - 8
Today’s key concepts
• Learning decision trees (ID3 algorithm)
• Greedy heuristic (based on information gain)
Originally developed for discrete features

• Overfitting
• What is it? How do we deal with it?

• Some extensions of DTs

• Principles of Experimental ML
9
Administration
• Since there is no waiting list anymore; all people that wanted to
be in are in.
• Recitations
• Quizzes
• Questions?
• Please ask/comment during class.

10
Learning decision trees (ID3
algorithm
Decision Trees
• Can represent any Boolean Function
• Can be viewed as a way to compactly represent a lot of data.
• Natural representation: (20 questions)
• The evaluation of the Decision Tree Classifier is easy

Outlook
• Clearly, given data, there are
many ways to represent it as
Sunny Overcast Rain
a decision tree. Humidity Wind
Yes
• Learning a good representation
High Normal Strong Weak
from data is the challenge. No Yes No Yes
12
Will I play tennis today?
• Features
• Outlook: {Sun, Overcast, Rain}
• Temperature: {Hot, Mild, Cool}
• Humidity: {High, Normal, Low}
• Wind: {Strong, Weak}

• Labels
• Binary classification task: Y = {+, -}

13
Will I play tennis today?
O T H W Play?
1 S H H W - Outlook: S(unny),
2 S H H S - O(vercast),
3 O H H W + R(ainy)
4 R M H W +
5 R C N W + Temperature: H(ot),
6 R C N S - M(edium),
7 O C N S + C(ool)
8 S M H W -
9 S C N W + Humidity: H(igh),
10 R M N W + N(ormal),
11 S M N S + L(ow)
12 O M H S +
13 O H N W + Wind: S(trong),
14 R M H S - W(eak)

14
Basic Decision Trees Learning Algorithm
O T H W Play?
1 S H H W -
• Data is processed in Batch (i.e. all the
2 S H H S - data available) Algorithm?
3 O H H W +
4 R M H W + • Recursively build a decision tree top
5 R C N W + down.
6 R C N S -
7 O C N S + Outlook
8 S M H W -
9 S C N W +
10 R M N W + Rain
Sunny Overcast
11 S M N S +
12 O M H S + Humidity Yes Wind
13 O H N W +
14 R M H S - High Normal Strong Weak
No Yes No Yes
Basic Decision Tree Algorithm
• Let S be the set of Examples
• Label is the target attribute (the prediction) For evaluation time why?
• Attributes is the set of measured attributes
• ID3(S, Attributes, Label)- Iterative Dichotomiser
If all examples are labeled the same return a single node tree with Label
Otherwise Begin
A = attribute in Attributes that best classifies S (Create a Root node for tree)

for each possible value v of A


Add a new tree branch corresponding to A=v
Let Sv be the subset of examples in S with A=v
if Sv is empty: add leaf node with the common value of Label in S
Else: below this branch add the subtree
ID3(Sv, Attributes - {a}, Label)
End
Return Root

16
Picking the Root Attribute
• The goal is to have the resulting decision tree as small as
possible (Occam’s Razor)
• But, finding the minimal decision tree consistent with the data is NP-hard
• The recursive algorithm is a greedy heuristic search for a simple
tree, but cannot guarantee optimality.
• The main decision in the algorithm is the selection of the next
attribute to condition on.

17
Picking the Root Attribute
• Consider data with two Boolean attributes (A,B).
A
Instances, Label 1 0
(A=0,B=0), - : 50 examples
+ -
(A=0,B=1), - : 50 examples
(A=1,B=0), - : 0 examples splitting on A
(A=1,B=1), + : 100 examples
• What should be the first attribute we select?
• Splitting on A: we get purely labeled nodes. B
• Splitting on B: we don’t get purely labeled nodes. 1 0
• What if we have: (A=1,B=0), - : 3 examples?
A -
1 0
+ -
• (one way to think about it: # of queries required to label a random
data point) splitting on B
18
Picking the Root Attribute
• Consider data with two Boolean attributes (A,B).
< (A=0,B=0), - >: 50 examples
< (A=0,B=1), - >: 50 examples
< (A=1,B=0), - >: 0 examples 3 examples
< (A=1,B=1), + >: 100 examples
• What should be the first attribute we select?
• Trees looks structurally similar; which attribute should we choose?

A B
Advantage A. But… 1 0 1 0
Need a way to quantify things
B - A -
1 0 100 1 0 53
• One way to think about it: # of queries required to
label a random data point. + - + -
• If we choose A we have less uncertainty about
the labels.
100 3 100 50
splitting on A splitting on B 19
Picking the Root Attribute
• The goal is to have the resulting decision tree as small as
possible (Occam’s Razor)
• The main decision in the algorithm is the selection of the next attribute to
condition on.
• We want attributes that split the examples to sets that are
relatively pure in one label; this way we are closer to a leaf
node.
• The most popular heuristics is based on information gain, originated with
the ID3 system of Quinlan.

20
Entropy
• Entropy (impurity, disorder) of a set of examples, S, relative to a binary
classification is:

• is the proportion of positive examples in S and


• is the proportion of negatives examples in S
• If all the examples belong to the same category: Entropy = 0
• If all the examples are equally mixed (0.5, 0.5): Entropy = 1
• Entropy = Level of uncertainty.
• In general, when pi is the fraction of examples labeled i:

• Entropy can be viewed as the number of bits required, on average, to encode


the class of labels. If the probability for + is 0.5, a single bit is required for
each example; if it is 0.8, can use less then 1 bit. 21
Information Gain
High Entropy –> High level of Uncertainty
Low Entropy –> No Uncertainty.

• The information gain of an attribute a is the expected


reduction in entropy caused by partitioning on this attribute
Outlook
• Where:
• Sv is the subset of S for which attribute a has value v, and Rain
Sunny Overcast
• the entropy of partitioning the data is calculated by weighing the
entropy of each partition by its size relative to the original set

• Partitions of low entropy (imbalanced splits) lead to high


gain
• Go back to check which of the A, B splits is better
24
Will I play tennis today?
O T H W Play?
1 S H H W -
2 S H H S -
3 O H H W + calculate current entropy
4 R M H W +
5 R C N W + •
6 R C N S -
7 O C N S +
8 S M H W -
9 S C N W +  0.94
10 R M N W +
11 S M N S +
12 O M H S +
13 O H N W +
14 R M H S -

25
Decision Boundaries
• Usually, instances are represented as attribute-value pairs (color=blue, shape =
square, +)
• Numerical values can be used either by discretizing or by using thresholds for
splitting nodes
• In this case, the tree divides the features space into axis-parallel rectangles,
each labeled with one of the labels

Y
X<3
+ + +
no yes
7
Y>7 Y<5
+ + -
5 no yes no yes

- + - - + + X<1
no yes
1 3 X
+ - 26
Today’s key concepts
• Learning decision trees (ID3 algorithm)
• Greedy heuristic (based on information gain)
Originally developed for discrete features

• Overfitting
• What is it? How do we deal with it?

• Some extensions of DTs

• Principles of Experimental ML
27
Administration
• Since there is no waiting list anymore; all people that wanted to
be in are in.
• Recitations
• Quizzes
• Questions?
• Please ask/comment during class.

28
Learning decision trees (ID3
algorithm
Decision Trees
• Can represent any Boolean Function
• Can be viewed as a way to compactly represent a lot of data.
• Natural representation: (20 questions)
• The evaluation of the Decision Tree Classifier is easy

Outlook
• Clearly, given data, there are
many ways to represent it as
Sunny Overcast Rain
a decision tree. Humidity Wind
Yes
• Learning a good representation
High Normal Strong Weak
from data is the challenge. No Yes No Yes
30
Will I play tennis today?
• Features
• Outlook: {Sun, Overcast, Rain}
• Temperature: {Hot, Mild, Cool}
• Humidity: {High, Normal, Low}
• Wind: {Strong, Weak}

• Labels
• Binary classification task: Y = {+, -}

31
Will I play tennis today?
O T H W Play?
1 S H H W - Outlook: S(unny),
2 S H H S - O(vercast),
3 O H H W + R(ainy)
4 R M H W +
5 R C N W + Temperature: H(ot),
6 R C N S - M(edium),
7 O C N S + C(ool)
8 S M H W -
9 S C N W + Humidity: H(igh),
10 R M N W + N(ormal),
11 S M N S + L(ow)
12 O M H S +
13 O H N W + Wind: S(trong),
14 R M H S - W(eak)

32
Basic Decision Trees Learning Algorithm
O T H W Play?
1 S H H W -
• Data is processed in Batch (i.e. all the
2 S H H S - data available) Algorithm?
3 O H H W +
4 R M H W + • Recursively build a decision tree top
5 R C N W + down.
6 R C N S -
7 O C N S + Outlook
8 S M H W -
9 S C N W +
10 R M N W + Rain
Sunny Overcast
11 S M N S +
12 O M H S + Humidity Yes Wind
13 O H N W +
14 R M H S - High Normal Strong Weak
No Yes No Yes
Basic Decision Tree Algorithm
• Let S be the set of Examples
• Label is the target attribute (the prediction)
• Attributes is the set of measured attributes
• ID3(S, Attributes, Label)- Iterative Dichotomiser
If all examples are labeled the same return a single node tree with Label
Otherwise Begin
A = attribute in Attributes that best classifies S (Create a Root node for tree)

for each possible value v of A


Add a new tree branch corresponding to A=v
Let Sv be the subset of examples in S with A=v
if Sv is empty: add leaf node with the common value of Label in S why?
Else: below this branch add the subtree
For evaluation time
ID3(Sv, Attributes - {a}, Label)
End
Return Root

34
Picking the Root Attribute
• The goal is to have the resulting decision tree as small as
possible (Occam’s Razor)
• But, finding the minimal decision tree consistent with the data is NP-hard
• The recursive algorithm is a greedy heuristic search for a simple
tree, but cannot guarantee optimality.
• The main decision in the algorithm is the selection of the next
attribute to condition on.

35
Picking the Root Attribute
• Consider data with two Boolean attributes (A,B).
A
Instances, Label 1 0
(A=0,B=0), - : 50 examples
+ -
(A=0,B=1), - : 50 examples
(A=1,B=0), - : 0 examples splitting on A
(A=1,B=1), + : 100 examples
• What should be the first attribute we select?
• Splitting on A: we get purely labeled nodes. B
• Splitting on B: we don’t get purely labeled nodes. 1 0
• What if we have: (A=1,B=0), - : 3 examples?
A -
1 0
+ -
• (one way to think about it: # of queries required to label a random
data point) splitting on B
36
Picking the Root Attribute
• Consider data with two Boolean attributes (A,B).
< (A=0,B=0), - >: 50 examples
< (A=0,B=1), - >: 50 examples
< (A=1,B=0), - >: 0 examples 3 examples
< (A=1,B=1), + >: 100 examples
• What should be the first attribute we select?
• Trees looks structurally similar; which attribute should we choose?

A B
Advantage A. But… 1 0 1 0
Need a way to quantify things
B - A -
1 0 100 1 0 53
• One way to think about it: # of queries required to
label a random data point. + - + -
• If we choose A we have less uncertainty about
the labels.
100 3 100 50
splitting on A splitting on B 37
Picking the Root Attribute
• The goal is to have the resulting decision tree as small as
possible (Occam’s Razor)
• The main decision in the algorithm is the selection of the next attribute to
condition on.
• We want attributes that split the examples to sets that are
relatively pure in one label; this way we are closer to a leaf
node.
• The most popular heuristics is based on information gain, originated with
the ID3 system of Quinlan.

38
Entropy
• Entropy (impurity, disorder) of a set of examples, S, relative to a binary
classification is:

• is the proportion of positive examples in S and


• is the proportion of negatives examples in S
• If all the examples belong to the same category: Entropy = 0
• If all the examples are equally mixed (0.5, 0.5): Entropy = 1
• Entropy = Level of uncertainty.
• In general, when pi is the fraction of examples labeled i:

• Entropy can be viewed as the number of bits required, on average, to encode


the class of labels. If the probability for + is 0.5, a single bit is required for
each example; if it is 0.8, can use less then 1 bit. 39
Information Gain
High Entropy –> High level of Uncertainty
Low Entropy –> No Uncertainty.

• The information gain of an attribute a is the expected


reduction in entropy caused by partitioning on this attribute
Outlook
• Where:
• Sv is the subset of S for which attribute a has value v, and Rain
Sunny Overcast
• the entropy of partitioning the data is calculated by weighing the
entropy of each partition by its size relative to the original set

• Partitions of low entropy (imbalanced splits) lead to high


gain
• Go back to check which of the A, B splits is better
42
Information Gain of Every Attribute
1. Calculate the Entropy of Whole data
2. Calculate the Entropy of attribute/feature values
Will I play tennis today?
O T H W Play?
1 S H H W -
2 S H H S -
3 O H H W + calculate current entropy
4 R M H W +
5 R C N W + •
6 R C N S -
7 O C N S +
8 S M H W -
9 S C N W +  0.94
10 R M N W +
11 S M N S +
12 O M H S +
13 O H N W +
14 R M H S -

44
Entropy of Entire Data Set
Entropy of Outlook Attribute Value: Dataset which is consisting
Sunny of sunny for outlook
Entropy of Outlook Attribute Value: Sunny, Rain,
Overcast
Entropy of Temperature Attribute Value: Hot,
Mild, Cool…G(S,Temp)=0.0289

You might also like