0% found this document useful (0 votes)
5 views50 pages

2.0 - Decision Tree

The document discusses decision trees as a method of supervised learning, emphasizing their structure as a series of 'if-else' decisions based on features to classify data. It explains the process of learning decision stumps, which are simple decision trees with one rule, and how to evaluate their effectiveness using classification accuracy. Additionally, it outlines the computational costs associated with decision stumps and introduces the greedy recursive splitting method for building more complex decision trees.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views50 pages

2.0 - Decision Tree

The document discusses decision trees as a method of supervised learning, emphasizing their structure as a series of 'if-else' decisions based on features to classify data. It explains the process of learning decision stumps, which are simple decision trees with one rule, and how to evaluate their effectiveness using classification accuracy. Additionally, it outlines the computational costs associated with decision stumps and introduces the greedy recursive splitting method for building more complex decision trees.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 50

EC9630: Machine Learning

Decision Trees
Data Representation and Exploration
Features

• example-feature representation: Age Job? City Rating Income

– Samples: another name 23 Yes Van A 22,000.00


we’ll use for examples. 23 Yes Bur BBB 21,000.00

Samples
22 No Van CC 0.00
25 Yes Sur AAA 57,000.00

•.
Supervised Learning
• We discussed supervised learning:
Egg Milk Fish Wheat Shellfish Peanuts … Sick?
0 0.7 0 0.3 0 0 1
0.3 0.7 0 0.6 0 0.01 1
0 0 0 0.8 0 0 0
0.3 0.7 1.2 0 0.10 0.01 1
0.3 0 1.2 0.3 0.10 0.01 1

• Input for an example (day of the week) is a set of features (quantities of food).
• Output is a desired class label (whether or not we got sick).
• Goal of supervised learning:
– Use data to find a model that outputs the right label based on the features.
• Above, model predicts whether foods will make you sick (even with new combinations).
– This framework can be applied any problem where we have input/output examples.
Decision Trees
• Decision trees are simple programs consisting of:
– A nested sequence of “if-else” decisions based on the features (splitting rules).
– A class label as a return value at the end of each sequence.
Can draw sequences of decisions as a tree:
• Example decision tree:
Milk > 0.5

True Fals
e
Sick Egg > 1

True Fals
e
Sick Not Sick

– Example
Supervised Learning as Writing A Program
• There are many possible decision trees.
– We’re going to search for one that is good at our supervised learning problem.

• So our input is data and the output will be a program.


– This is called “training” the supervised learning model.
– Different than usual input/output specification for writing a program.

• Supervised learning is useful when you have lots of labeled data BUT:
1. Problem is too complicated to write a program ourselves.
2. Human expert can’t explain why you assign certain labels.
OR
2. We don’t have a human expert for the problem.
Learning A Decision Stump: “Search and Score”
• We’ll start with "decision stumps”:
– Simple decision tree with 1 splitting rule based on thresholding 1 feature.

Milk > 0.5

True False

Sick Not Sick

• To “learn” a decision stump we need to find 3 things:


– Which feature should we use to “split” the data?
– What value of the threshold should be used?
– What classes should we use at the leaves? (when we have more than
Learning A Decision Stump: “Search and Score”
• To “learn” a decision stump:
1. Define a ‘score’ for each possible rule.
2. Search for the rule with the best score.

Egg > 0.5 Milk > 0.1


Milk > 0.5 Milk > 0.5
False True False True
True False True
False
Not Sick Not Sick Sick
Sick Sick Not Sick Sick
Not Sick

Score=10 Score=45 Score=65


Score=50
(Best rule!)

• Q: what “score” should we use?


Learning A Decision Stump: Accuracy Score
• Most intuitive score: classification accuracy.
– “If we use this rule, how many examples do we label
correctly?” Milk Fish Egg Sick?
Egg > 0.5
0.7 0 1 1
False True
0.7 0 2 1
Sick 0 0 0 0
Not Sick
0.7 1.2 0 0
0 1.2 2 1
• Computing classification accuracy for this rule:
0 0 0 0
– Find most common labels after applying splitting rule:
• When (egg > 1), we were “sick” 2 times out of 2.
• When (egg ≤ 1), we were “not sick” 3 times out of 4.
– Compute accuracy:
• The accuracy (“score”) of the rule is 5 times out of 6.
• This “score” evaluates quality of a rule.
– We “learn” a decision stump by finding the rule with the best score.
Learning A Decision Stump: By Hand
• Let’s search for the decision stump maximizing classification score:
Milk Fish Egg Sick?
0.7 0 1 First we check “baseline rule” of predicting mode (no split): this gets 3/6 accuracy.
1
If (milk > 0) predict “sick” (2/3) else predict “not sick” (2/3): 4/6 accuracy
0.7 0 2 1 If (fish > 0) predict “not sick” (2/3) else predict “sick” (2/3): 4/6 accuracy
0 1.2 0 0 If (fish > 1.2) predict “sick” (1/1) else predict “not sick” (3/5): 5/6 accuracy
If (egg > 0) predict “sick” (3/3) else predict “not sick” (3/3): 6/6 accuracy
0.7 1.2 0 0
If (egg > 1) predict “sick” (2/2) else predict “not sick” (3/4): 5/6 accuracy
0 1.3 2 1
0 0 0 0

• Highest-scoring rule: (egg > 0) with leaves “sick” and “not sick”.
• Notice we only need to test feature thresholds that happen in the data:
– There is no point in testing the rule (egg > 3), it gets the “baseline” score.
– There is no point in testing the rule (egg > 0.5), it gets the (egg > 0) score.
– Also note that we don’t need to test “<“, since it would give equivalent rules.
Supervised Learning Notation (MEMORIZE THIS)
Egg Milk Fish Wheat Shellfish Peanuts Sick?
0 0.7 0 0.3 0 0 1
0.3 0.7 0 0.6 0 0.01 1
X= 0 0 0 0.8 0 0 ‘n’ X= 0 ‘n’
0.3 0.7 1.2 0 0.10 0.01 1
0.3 0 1.2 0.3 0.10 0.01 1
Nxd Nx1
d 1

• Feature matrix ‘X’ has rows as examples, columns as features.


– xij is feature ‘j’ for example ‘i’ (quantity of food ‘j’ on day ‘i’).
– xi is the list of all features for example ‘i’ (all the quantities on day ‘i’).
– xj is column ‘j’ of the matrix (the value of feature ‘j’ across all examples).
• Label vector ‘y’ contains the labels of the examples.
– yi is the label of example ‘i’ (1 for “sick”, 0 for “not sick”).
Supervised Learning Notation (MEMORIZE THIS)
Egg Milk Fish Wheat Shellfish Peanuts Sick?
0 0.7 0 0.3 0 0 1
0.3 0.7 0 0.6 0 0.01 1
X= 0 0 0 0.8 0 0 ‘n’ y= 0 ‘n’
0.3 0.7 1.2 0 0.10 0.01 1
0.3 0 1.2 0.3 0.10 0.01 1
Nxd Nx1
d 1

𝑋 1=(0 , 0.7 , 0 , 0.3 , 0 , 0) 1


𝑋 =(0 , 0.3 , 0 ,0.3 , 0.3)
𝑋 2=(0.3 , 0.7 , 0 , 0.6 , 0 , 0.01) 2
𝑋 =(0.7 , 0.7 , 0 , 0.7 , 0)
𝑋 2=(0 , 0 , 0 , 0.8 , 0 , 0) 6
𝑋 =( 0 , 0.01 , 0 , 0.01 ,0.01)
Supervised Learning Notation (MEMORIZE THIS)
Egg Milk Fish Wheat Shellfish Peanuts Sick?
0 0.7 0 0.3 0 0 1
0.3 0.7 0 0.6 0 0.01 1
X= 0 0 0 0.8 0 0 ‘n’ y= 0
0.3 0.7 1.2 0 0.10 0.01 1
0.3 0 1.2 0.3 0.10 0.01 1
Nxd Nx1
d 1
• Training phase:
– Use ‘X’ and ‘y’ to find a ‘model’ (like a decision stump).
• Prediction phase:
– Given an example xi, use ‘model’ to predict a label “ “ (sick” or “not sick”).
• Training error:
– Fraction of times our prediction 𝑦! 𝑖 does not equal the true yi label.
Cost of Decision Stumps
• How much does this cost?
• Assume we have:
– ‘n’ examples (days that we measured).
– ‘d’ features (foods that we measured).
– ‘k’ thresholds (>0, >1, >2, …) for each feature.

• Computing the score of one rule costs O(n):


– We need to go through all ‘n’ examples to find most common labels.
– We need to go through all ‘n’ examples again to compute the accuracy.
– See notes on webpage for review of “O(n)” notation.

• We compute score for up to k*d rules (‘k’ thresholds for each of ‘d’ features):
– So we need to do an O(n) operation k*d times, giving total cost of O(ndk).
Cost of Decision Stumps
• Is a cost of O(ndk) good?
• Size of the input data is O(nd):
– If ‘k’ is small then the cost is roughly the same cost as loading the data.
• We should be happy about this, you can learn on any dataset you can load!
– If ‘k’ is large then this could be too slow for large datasets.

• Example: if all our features are binary then k=1, just test (feature > 0):
– Cost of fitting decision stump is O(nd), so we can fit huge datasets.
• Example: if all our features are numerical with unique values then k=n.
– Cost of fitting decision stump is O(n2d).
• We don’t like having n2 because we want to fit datasets where ‘n’ is large!
– Bonus slides: how to reduce the cost in this case down to O(nd log n).
• Basic idea: sort features and track labels. Allows us to fit decision stumps to huge datasets.
Digression: “Debugging by Frustration/TA”
• Here is one way to write a complicated program:
1. Write the entire function at once.
2. Try it out to “see if it works”.
3. Spend hours fiddling with commands, to find magic working combination.
4. Send code to the TA, asking “what is wrong?”

• If you are lucky, Step 2 works and you are done!

• If you are not lucky, takes way longer than principled coding
methods.
– This is also a great way to introduce bugs into your code.
– And you will not be able to do Step 4 when you graduate.
Digression: Debugging 101
• What strategies could we use to debug an ML implementation?
– Use “print” statements to see what is happening at each step of the code.
• Or use a debugger.
– Develop one or more simple “test cases”, were you worked out the result by hand.
• Maybe one of the functions you are using does not work the way you think it does.
– Check if the “predict” functionality works correctly on its own.
• Maybe the training works but the prediction does not.
– Check if the “training” functionality works correctly on its own.
• Maybe the prediction works but the training does not.
– Try the implementation with only one training example or only one feature.
• Maybe there is an indexing problem, or things are not being aggregated properly.
– Make a “brute force” implementation to compare to your “fast/clever” implementation.
• Maybe you made a mistake when trying to be fast/clever.
• With these strategies, you should be able to diagnose locations of problems.
Next Topic: Learning Decision Trees
Decision Stumps and Decision Trees

Milk > 0.5


False True

Egg > 1.0 Egg > 0.1

True False True


False

Not Sick Sick Not Sick Sick

These are the decision stump

20
Decision Tree Learning
• Decision stumps have only 1 rule based on only 1 feature.
– Very limited class of models: usually not very accurate for most tasks.

• Decision trees allow sequences of splits based on multiple features.


– Very general class of models: can get very high accuracy.
– However, it’s computationally infeasible to find the best decision tree.

• Most common decision tree learning algorithm in practice:


– Greedy recursive splitting.
Example of Greedy Recursive Splitting
Find the decision stump with the best score:
• Start with the full dataset:
milk > 0.5
Egg Milk … Sick?
False True
0 0.7 1
1 0.7 1 Not Sick Sick
0 0 0
1 0.6 1 Split into two smaller datasets based on stump:
Egg Milk … Sick? Egg Milk … Sick?
1 0 0
0 0 0 0 0.7 1
2 0.6 1
1 0 0 1 0.7 1
0 1 1
2 0 1 1 0.6 1
2 0 1
0 0.3 0 2 0.6 1
0 0.3 0
2 0 1 0 1 1
1 0.6 0
1 0.6 0
2 0 1
Milk ≤ 0.5
Milk > 0.5
Greedy Recursive Splitting
We now have a decision stump and two datasets:
Egg Milk … Sick? Egg Milk … Sick?
0 0 0 0 0.7 1
milk > 0.5
1 0 0 1 0.7 1
False True 2 0 1 1 0.6 1
Not Sick Sick 0 0.3 0 2 0.6 1
2 0 1 0 1 1
1 0.6 0
Fit a decision stump to each leaf’s data.

Egg > 1 Lactose > 0

False True False True

Not Sick Sick Not Sick Sick


Greedy Recursive Splitting
We now have a decision stump and two datasets:
Egg Milk … Sick? Egg Milk … Sick?
milk > 0.5 0 0 0 0 0.7 1
False True 1 0 0 1 0.7 1
2 0 1 1 0.6 1
0 0.3 0 2 0.6 1
2 0 1 0 1 1
1 0.6 0

Fit a decision stump to each leaf’s data. Egg > 1 Lactose > 0
Then add these stumps to the tree. False True False True

Not Sick Sick Not Sick Sick


Greedy Recursive Splitting
We now have a decision stump and two datasets:
Egg Milk … Sick? Egg Milk … Sick?
0 0 0 0 0.7 1
milk > 0.5 1 0 0 1 0.7 1
False True
2 0 1 1 0.6 1
0 0.3 0 2 0.6 1
Egg > 1 Lactose > 0
2 0 1 0 1 1
False True False True
1 0.6 0
Not Sick Sick Not Sick Sick

Fit a decision stump to each leaf’s data.


Then add these stumps to the tree.
Greedy Recursive Splitting
This gives a “depth 2” decision tree: It splits the two datasets into four datasets:
Milk ≤ 0.5 Milk > 0.5
Egg Milk … Sick? Egg Milk … Sick?
0 0 0 0 0.7 1
1 0 0 1 0.7 1
2 0 1 1 0.6 1
0 0.3 0 2 0.6 1
2 0 1 0 1 1
1 0.6 0

Milk ≤ 0.5 , Egg ≤ 1 Milk ≤ 0.5 , Egg > 1 Milk > 0.5 , lactose ≤ 0 Milk > 0.5 , lactose > 0
Egg Milk … Sick? Egg Milk … Sick? Egg Milk … Sick? Egg Milk … Sick?
0 0 0 2 0 1 0 0.7 1 1 0.6 0
1 0 0 2 0 1 1 0.7 1
0 0.3 0 1 0.6 1
2 0.6 1
Greedy Recursive Splitting
We could try to split the four leaves to make a “depth 3” decision
tree:
milk > 0.5
False True

Egg > 1 Lactose > 0

False True False


True
Sick Not Sick
Ice cream > 0.3 Egg > 1

False True False True

Not Sick Sick Not Sick Sick


Which score function should a decision tree used?
• Shouldn’t we just use accuracy score?
– For leafs: yes, just maximize accuracy.
– For internal nodes: not necessarily.

• Maybe no simple rule like (egg > 0.5) improves accuracy.


– But this doesn’t necessarily mean we should stop!
Example Where Accuracy Fails
• Consider a dataset with 2 features and 2 classes (‘□’ and ‘o’).
– Because there are 2 features, we can draw ‘X’ as a scatterplot.
• Colours and shapes denote the class labels ‘y’.
• A decision stump would divide space
by a horizontal or vertical line.
– Testing whether xi1 > t or whether xi2 > t. 3

• On this dataset no horizontal/vertical


line improves accuracy. 2
– Baseline is ‘o’, but need to get many ‘o’ 0
wrong to get one ‘x’ right.
-2

-3

-3 -1 0 1 3
Example Where Accuracy Fails
• Consider a dataset with 2 features and 2 classes (‘x’ and ‘o’). Splitting rule 1 :
– Because there are 2 features, we can draw ‘X’ as a scatterplot. X>1
• Colours and shapes denote the class labels ‘y’.

• A decision stump would divide space 3

by a horizontal or vertical line.


– Testing whether xi1 > t or whether xi2 > t. 2
• On this dataset no horizontal/vertical 0
line improves accuracy.
-2
– Baseline is ‘o’, but need to get many ‘o’
wrong to get one ‘x’ right.
-3

-3 -1 0 1 3
Which score function should a decision tree used?

• Most common score in practice is “information gain”.


– “Choose split that decreases entropy of labels the most”.
• Information gain for baseline rule (“do nothing”) is 0.
– Infogain is large if labels are “more predictable” (“less random”) in next layer.
• Even if it does not increase classification accuracy at one depth,
we hope that it makes classification easier at the next depth.
Example Where Accuracy Fails
Splitting rule 1 : X > 1
Classify as ‘□’. This split make label
random , Classify as ‘o’

3
X>1

2 False True

0 ‘□’ ‘o’’

-2

-3
More splits need for accurate classification

-3 -1 0 1 3
Example Where Accuracy Fails
Splitting rule 1 : Y > 2
Everything on the top
Y classified as ‘o’
X>1
False True
Splitting rule 1 : X > 1
3 Everything on the right ‘o’’
classified as ‘o

Y>2
2
False True
0
‘□’ ‘o’’
-2

-3

-3 -1 1
x
0 3
Points between X≤1 & Y
≤2 will be predicted as ‘□’
Example Where Accuracy Fails X>1
Splitting rule 2 : Y > 2
This split make label False True
random , Everything
on the top ‘o’ and
bottom ‘□’.
Y>2 ‘o’’
Y False True

3 Y ≤ -2 ‘o’’
False True

2
Points between X≤1 ‘□’ ‘o’’
& Y ≤2 & Y > -2 will
be predicted as ‘□’ 0

-2
Splitting rule 3 : Y ≤ -2.
This split make label Splitting rule 1 : X > 1
random. Everything on This split make label
the top ‘o’ and bottom ‘□’. -3 random , Everything
on the right ‘o’ and left
‘□’.
-3 -1 0 1
x
3
Example Where Accuracy Fails X>1
Splitting rule 2 : Y > 2
This split make label False True
random , Everything
on the top ‘o’ and
bottom ‘□’.
Y>2 ‘o’’
Y False True

3 Y ≤ -2 ‘o’’
False True

Splitting rule 4 : x ≤ -1. 2


This split make label x ≤ -2 ‘o’’
random. Everything on
the top ‘o’ and bottom ‘□’. 0
False True

-2
‘□’ ‘o’’
Splitting rule 3 : Y ≤ -2.
This split make label Splitting rule 1 : X > 1
random. Everything on This split make label
the top ‘o’ and bottom ‘□’. -3 random , Everything
on the right ‘o’ and left
‘□’.
-3 -1 0 1
x
3
Discussion of Decision Tree Learning
• Advantages:
– Easy to implement.
– Interpretable.
– Learning is fast prediction is very fast.
– Can elegantly handle a small number missing values during training.

• Disadvantages:
– Hard to find optimal set of rules.
– Greedy splitting often not accurate, requires very deep trees.
Discussion of Decision Tree Learning
• Issues:
– Can you revisit a feature?
• Yes, knowing other information could make feature relevant again.
– More complicated rules?
• Yes, but searching for the best rule gets much more expensive.
– What is best score?
• Infogain is the most popular and often works well, but is not always the best.
– What if you get new data?
• You could consider splitting if there is enough data at the leaves, but occasionally might want to re-
learn the whole tree or sub-trees.
– What depth?
• Some implementations stop at a maximum depth.
• Some stop if too few examples in leaf.
• Some stop if infogain is too small.
• Some stop by checking performance on a “validation set” (we will discuss this next time).
Summary
• Supervised learning:
– Using data to write a program based on input/output examples.
• Decision trees: predicting a label using a sequence of simple rules.
• Decision stumps: simple decision tree that is very fast to fit.
• Greedy recursive splitting: uses a sequence of stumps to fit a tree.
– Very fast and interpretable, but not always the most accurate.
• Information gain: splitting score based on decreasing entropy.

• Next time: the most important ideas in machine learning.


Additional Learning Materials
Other Considerations for Food Allergy Example
• What types of preprocessing might we do?
– Data cleaning: check for and fix missing/unreasonable values.
– Summary statistics:
• Can help identify “unclean” data.
• Correlation might reveal an obvious dependence (“sick”  “peanuts”).
– Data transformations:
• Convert everything to same scale? (e.g., grams)
• Add foods from day before? (maybe “sick” depends on multiple days)
• Add date? (maybe what makes you “sick” changes over time).
– Data visualization: look at a scatterplot of each feature and the label.
• Maybe the visualization will show something weird in the features.
• Maybe the pattern is really obvious!
• What you do might depend on how much data you have:
– Very little data:
• Represent food by common allergic ingredients (lactose, gluten, etc.)?
– Lots of data:
• Use more fine-grained features (bread from bakery vs. hamburger bun)?
Going from O(n2d) to O(nd log n) for Numerical Features
• Do we have to compute score from scratch?
– As an example, assume we eat integer number of eggs:
• So the rules (egg > 1) and (egg > 2) have same decisions, except when (egg == 2).
• We can actually compute the best rule involving ‘egg’ in O(n log n):
– Sort the examples based on ‘egg’, and use these positions to re-arrange ‘y’.
– Go through the sorted values in order, updating the counts of #sick and #not-sick that
both satisfy and don’t satisfy the rules.
– With these counts, it’s easy to compute the classification accuracy (see bonus slide).
• Sorting costs O(n log n) per feature.
• Total cost of updating counts is O(n) per feature.
• Total cost is reduced from O(n2d) to O(nd log n).
• This is a good runtime:
– O(nd) is the size of data, same as runtime up to a log factor.
– We can apply this algorithm to huge datasets.
How do we fit stumps in O(nd log n)?
• Let’s say we’re trying to find the best rule involving milk:
Egg Milk … Sick? First grab the milk column and sort it Milk Sick?
0 0.7 1 (using the sort positions to re-arrange 0 0
1 0.7 1 the sick column). This step costs 0 0
0 0 0 O(n log n) due to sorting. 0 0
1 0.6 1 0 0
1 0 0 Now, we’ll go through the milk values 0.3 0
2 0.6 1 in order, keeping track of #sick and 0.6 1
0 1 1 #not sick that are above/below the 0.6 1
2 0 1 current value. E.g., #sick above 0.3 is 5. 0.6 0
0 0.3 0 0.7 1
1 0.6 0 With these counts, accuracy score is 0.7 1
2 0 1 (sum of most common label above and 1 1
below)/n.
How do we fit stumps in O(nd log n)?
Milk Sick? Start with the baseline rule () which is always “satisfied”:
0 0 If satisfied, #sick=5 and #not-sick=6.
0 0 If not satisfied, #sick=0 and #not-sick=0.
0 0
This gives accuracy of (6+0)/n = 6/11.
0 0
0.3 0
Next try the rule (milk > 0), and update the counts based on these 4 rows:
0.6 1
If satisfied, #sick=5 and #not-sick=2.
0.6 1
If not satisfied, #sick=0 and #not-sick=4.
0.6 0
This gives accuracy of (5+4)/n = 9/11, which is better.
0.7 1
Next try the rule (milk > 0.3), and update the counts based on this 1 row:
0.7 1
If satisfied, #sick=5 and #not-sick=1.
1 1
If not satisfied, #sick=0 and #not-sick=5.
This gives accuracy of (5+5)/n = 10/11, which is better.
(and keep going until you get to the end…)
How do we fit stumps in O(nd log n)?
Milk Sick? Notice that for each row, updating the counts only costs O(1).
0 0 Since there are O(n) rows, total cost of updating counts is O(n).
0 0
0 0 Instead of 2 labels (sick vs. not-sick), consider the case of ‘k’
0 0 labels:
- Updating the counts still costs O(n), since each row has one
0.3 0
0.6 1
label.
- But computing the ‘max’ across the labels costs O(k), so
0.6 1
0.6 0
cost is O(kn).
0.7 1
With ‘k’ labels, you can decrease cost using a “max-heap” data
0.7 1
structure:
1 1
- Cost of getting max is O(1), cost of updating heap for a row
is O(log k).
- But k <= n (each row has only one label).
Can decision trees re-visit a feature?

• Yes. milk > 0.5


False True

Egg > 1 Lactose > 0

False True False


True
Sick Not Sick
Ice cream > 0.3 Egg > 1

False True False True


Not Sick Sick
Not Sick milk > 0.3
False True
Revisited
Not Sick Sick

Knowing (ice cream > 0.3) makes small milk quantities relevant.
Can decision trees have more complicated rules?
• Yes!
• Rules that depend on more than one feature:

milk > 0.5


False True

Egg > 1 Lactose > 0


& Egg > 1
False True False True

Not Sick Sick Not Sick Sick

• But now searching for the best rule can get expensive.
Can decision trees have more complicated rules?
• Yes!
• Rules that depend on more than one threshold:

1995 ≤ Birth Year ≤ 2005

False True

Not Sick Sick

• “Very Simple Classification Rules Perform Well on Most Commonly Used Datasets”
– Consider decision stumps based on multiple splits of 1 attribute.
– Showed that this gives comparable performance to more-fancy methods on many datasets.
Does being greedy actually hurt?
• Can’t you just go deeper to correct greedy decisions?
– Yes, but you need to “re-discover” rules with less data.
• Consider that you are allergic to milk (and drink this often), and also get sick when you (rarely) combine
diet coke with mentos.
• Greedy method should first split on milk (helps accuracy the most): because at the time milk is the best
feature to consider
milk > 0.5 Greedy Decision:
False True We can make whatever
choice seems best at
the moment and then
Pepsi > 0 Pepsi > 0 solve the subproblems
True that arise later
False False True
You have learn twise at
Not Sick Mentos > 0 Not Sick Mentos > 0 this condition to make
sub trees. Worse case is
False True False True a identical sub trees in
same level.
Not Sick Sick Not Sick Sick
Does being greedy actually hurt?
• Can’t you just go deeper to correct greedy decisions.
– Yes, but you need to “re-discover” rules with less data.
• Consider that you are allergic to milk (and drink this often), and also get sick when you (rarely)
combine diet coke with mentos.
• Greedy method should first split on milk (helps accuracy the most).
• Non-greedy method could get simpler tree (split on milk later):

Pepsi > 0.5


False True

Milk > 0.5 mentos > 0


Still have repeated
False True False True structure
Not Sick Sick Milk > 0.5 Sick

False True

Not Sick Sick


Does being greedy actually hurt?
• Can’t you just go deeper to correct greedy decisions?
– Yes, but you need to “re-discover” rules with less data.
• Consider that you are allergic to milk (and drink this often), and
also get sick when you (rarely) combine diet coke with mentos.
• Greedy method should first split on milk (helps accuracy the
most).
• Non-greedy method could get simpler tree (split on milk later):
– Example of a non-greedy method (complicated and not popular
currently).
Decision Trees with Probabilistic Predictions
• Often, we’ll have multiple ‘y’ values at each leaf node.
• In these cases, we might return probabilities instead of a label.

• E.g., if in the leaf node we 5 have “sick” examples and 1 “not sick”:
– Return p(y = “sick” | xi) = 5/6 and p(y = “not sick” | xi) = 1/6.

• In general, a natural estimate of the probabilities at the leaf nodes:


– Let ‘nk’ be the number of examples that arrive to leaf node ‘k’.
– Let ‘nkc’ be the number of times (y == c) in the examples at leaf node ‘k’.
– Maximum likelihood estimate for this leaf is p(y = c | xi) = nkc/nk.
Alternative Stopping Rules
• There are more complicated rules for deciding when *not* to split.

• Rules based on minimum sample size.


– Don’t split any nodes where the number of examples is less than some
‘m’.
– Don’t split any nodes that create children with less than ‘m’ examples.
• These types of rules try to make sure that you have enough data to justify decisions.

• Alternately, you can use a validation set (see next lecture):


– Don’t split the node if it decreases an approximation of test accuracy.

You might also like