2.0 - Decision Tree
2.0 - Decision Tree
Decision Trees
Data Representation and Exploration
Features
Samples
22 No Van CC 0.00
25 Yes Sur AAA 57,000.00
•.
Supervised Learning
• We discussed supervised learning:
Egg Milk Fish Wheat Shellfish Peanuts … Sick?
0 0.7 0 0.3 0 0 1
0.3 0.7 0 0.6 0 0.01 1
0 0 0 0.8 0 0 0
0.3 0.7 1.2 0 0.10 0.01 1
0.3 0 1.2 0.3 0.10 0.01 1
• Input for an example (day of the week) is a set of features (quantities of food).
• Output is a desired class label (whether or not we got sick).
• Goal of supervised learning:
– Use data to find a model that outputs the right label based on the features.
• Above, model predicts whether foods will make you sick (even with new combinations).
– This framework can be applied any problem where we have input/output examples.
Decision Trees
• Decision trees are simple programs consisting of:
– A nested sequence of “if-else” decisions based on the features (splitting rules).
– A class label as a return value at the end of each sequence.
Can draw sequences of decisions as a tree:
• Example decision tree:
Milk > 0.5
True Fals
e
Sick Egg > 1
True Fals
e
Sick Not Sick
– Example
Supervised Learning as Writing A Program
• There are many possible decision trees.
– We’re going to search for one that is good at our supervised learning problem.
• Supervised learning is useful when you have lots of labeled data BUT:
1. Problem is too complicated to write a program ourselves.
2. Human expert can’t explain why you assign certain labels.
OR
2. We don’t have a human expert for the problem.
Learning A Decision Stump: “Search and Score”
• We’ll start with "decision stumps”:
– Simple decision tree with 1 splitting rule based on thresholding 1 feature.
True False
• Highest-scoring rule: (egg > 0) with leaves “sick” and “not sick”.
• Notice we only need to test feature thresholds that happen in the data:
– There is no point in testing the rule (egg > 3), it gets the “baseline” score.
– There is no point in testing the rule (egg > 0.5), it gets the (egg > 0) score.
– Also note that we don’t need to test “<“, since it would give equivalent rules.
Supervised Learning Notation (MEMORIZE THIS)
Egg Milk Fish Wheat Shellfish Peanuts Sick?
0 0.7 0 0.3 0 0 1
0.3 0.7 0 0.6 0 0.01 1
X= 0 0 0 0.8 0 0 ‘n’ X= 0 ‘n’
0.3 0.7 1.2 0 0.10 0.01 1
0.3 0 1.2 0.3 0.10 0.01 1
Nxd Nx1
d 1
• We compute score for up to k*d rules (‘k’ thresholds for each of ‘d’ features):
– So we need to do an O(n) operation k*d times, giving total cost of O(ndk).
Cost of Decision Stumps
• Is a cost of O(ndk) good?
• Size of the input data is O(nd):
– If ‘k’ is small then the cost is roughly the same cost as loading the data.
• We should be happy about this, you can learn on any dataset you can load!
– If ‘k’ is large then this could be too slow for large datasets.
• Example: if all our features are binary then k=1, just test (feature > 0):
– Cost of fitting decision stump is O(nd), so we can fit huge datasets.
• Example: if all our features are numerical with unique values then k=n.
– Cost of fitting decision stump is O(n2d).
• We don’t like having n2 because we want to fit datasets where ‘n’ is large!
– Bonus slides: how to reduce the cost in this case down to O(nd log n).
• Basic idea: sort features and track labels. Allows us to fit decision stumps to huge datasets.
Digression: “Debugging by Frustration/TA”
• Here is one way to write a complicated program:
1. Write the entire function at once.
2. Try it out to “see if it works”.
3. Spend hours fiddling with commands, to find magic working combination.
4. Send code to the TA, asking “what is wrong?”
• If you are not lucky, takes way longer than principled coding
methods.
– This is also a great way to introduce bugs into your code.
– And you will not be able to do Step 4 when you graduate.
Digression: Debugging 101
• What strategies could we use to debug an ML implementation?
– Use “print” statements to see what is happening at each step of the code.
• Or use a debugger.
– Develop one or more simple “test cases”, were you worked out the result by hand.
• Maybe one of the functions you are using does not work the way you think it does.
– Check if the “predict” functionality works correctly on its own.
• Maybe the training works but the prediction does not.
– Check if the “training” functionality works correctly on its own.
• Maybe the prediction works but the training does not.
– Try the implementation with only one training example or only one feature.
• Maybe there is an indexing problem, or things are not being aggregated properly.
– Make a “brute force” implementation to compare to your “fast/clever” implementation.
• Maybe you made a mistake when trying to be fast/clever.
• With these strategies, you should be able to diagnose locations of problems.
Next Topic: Learning Decision Trees
Decision Stumps and Decision Trees
20
Decision Tree Learning
• Decision stumps have only 1 rule based on only 1 feature.
– Very limited class of models: usually not very accurate for most tasks.
Fit a decision stump to each leaf’s data. Egg > 1 Lactose > 0
Then add these stumps to the tree. False True False True
Milk ≤ 0.5 , Egg ≤ 1 Milk ≤ 0.5 , Egg > 1 Milk > 0.5 , lactose ≤ 0 Milk > 0.5 , lactose > 0
Egg Milk … Sick? Egg Milk … Sick? Egg Milk … Sick? Egg Milk … Sick?
0 0 0 2 0 1 0 0.7 1 1 0.6 0
1 0 0 2 0 1 1 0.7 1
0 0.3 0 1 0.6 1
2 0.6 1
Greedy Recursive Splitting
We could try to split the four leaves to make a “depth 3” decision
tree:
milk > 0.5
False True
-3
-3 -1 0 1 3
Example Where Accuracy Fails
• Consider a dataset with 2 features and 2 classes (‘x’ and ‘o’). Splitting rule 1 :
– Because there are 2 features, we can draw ‘X’ as a scatterplot. X>1
• Colours and shapes denote the class labels ‘y’.
-3 -1 0 1 3
Which score function should a decision tree used?
3
X>1
2 False True
0 ‘□’ ‘o’’
-2
-3
More splits need for accurate classification
-3 -1 0 1 3
Example Where Accuracy Fails
Splitting rule 1 : Y > 2
Everything on the top
Y classified as ‘o’
X>1
False True
Splitting rule 1 : X > 1
3 Everything on the right ‘o’’
classified as ‘o
Y>2
2
False True
0
‘□’ ‘o’’
-2
-3
-3 -1 1
x
0 3
Points between X≤1 & Y
≤2 will be predicted as ‘□’
Example Where Accuracy Fails X>1
Splitting rule 2 : Y > 2
This split make label False True
random , Everything
on the top ‘o’ and
bottom ‘□’.
Y>2 ‘o’’
Y False True
3 Y ≤ -2 ‘o’’
False True
2
Points between X≤1 ‘□’ ‘o’’
& Y ≤2 & Y > -2 will
be predicted as ‘□’ 0
-2
Splitting rule 3 : Y ≤ -2.
This split make label Splitting rule 1 : X > 1
random. Everything on This split make label
the top ‘o’ and bottom ‘□’. -3 random , Everything
on the right ‘o’ and left
‘□’.
-3 -1 0 1
x
3
Example Where Accuracy Fails X>1
Splitting rule 2 : Y > 2
This split make label False True
random , Everything
on the top ‘o’ and
bottom ‘□’.
Y>2 ‘o’’
Y False True
3 Y ≤ -2 ‘o’’
False True
-2
‘□’ ‘o’’
Splitting rule 3 : Y ≤ -2.
This split make label Splitting rule 1 : X > 1
random. Everything on This split make label
the top ‘o’ and bottom ‘□’. -3 random , Everything
on the right ‘o’ and left
‘□’.
-3 -1 0 1
x
3
Discussion of Decision Tree Learning
• Advantages:
– Easy to implement.
– Interpretable.
– Learning is fast prediction is very fast.
– Can elegantly handle a small number missing values during training.
• Disadvantages:
– Hard to find optimal set of rules.
– Greedy splitting often not accurate, requires very deep trees.
Discussion of Decision Tree Learning
• Issues:
– Can you revisit a feature?
• Yes, knowing other information could make feature relevant again.
– More complicated rules?
• Yes, but searching for the best rule gets much more expensive.
– What is best score?
• Infogain is the most popular and often works well, but is not always the best.
– What if you get new data?
• You could consider splitting if there is enough data at the leaves, but occasionally might want to re-
learn the whole tree or sub-trees.
– What depth?
• Some implementations stop at a maximum depth.
• Some stop if too few examples in leaf.
• Some stop if infogain is too small.
• Some stop by checking performance on a “validation set” (we will discuss this next time).
Summary
• Supervised learning:
– Using data to write a program based on input/output examples.
• Decision trees: predicting a label using a sequence of simple rules.
• Decision stumps: simple decision tree that is very fast to fit.
• Greedy recursive splitting: uses a sequence of stumps to fit a tree.
– Very fast and interpretable, but not always the most accurate.
• Information gain: splitting score based on decreasing entropy.
Knowing (ice cream > 0.3) makes small milk quantities relevant.
Can decision trees have more complicated rules?
• Yes!
• Rules that depend on more than one feature:
• But now searching for the best rule can get expensive.
Can decision trees have more complicated rules?
• Yes!
• Rules that depend on more than one threshold:
False True
• “Very Simple Classification Rules Perform Well on Most Commonly Used Datasets”
– Consider decision stumps based on multiple splits of 1 attribute.
– Showed that this gives comparable performance to more-fancy methods on many datasets.
Does being greedy actually hurt?
• Can’t you just go deeper to correct greedy decisions?
– Yes, but you need to “re-discover” rules with less data.
• Consider that you are allergic to milk (and drink this often), and also get sick when you (rarely) combine
diet coke with mentos.
• Greedy method should first split on milk (helps accuracy the most): because at the time milk is the best
feature to consider
milk > 0.5 Greedy Decision:
False True We can make whatever
choice seems best at
the moment and then
Pepsi > 0 Pepsi > 0 solve the subproblems
True that arise later
False False True
You have learn twise at
Not Sick Mentos > 0 Not Sick Mentos > 0 this condition to make
sub trees. Worse case is
False True False True a identical sub trees in
same level.
Not Sick Sick Not Sick Sick
Does being greedy actually hurt?
• Can’t you just go deeper to correct greedy decisions.
– Yes, but you need to “re-discover” rules with less data.
• Consider that you are allergic to milk (and drink this often), and also get sick when you (rarely)
combine diet coke with mentos.
• Greedy method should first split on milk (helps accuracy the most).
• Non-greedy method could get simpler tree (split on milk later):
False True
• E.g., if in the leaf node we 5 have “sick” examples and 1 “not sick”:
– Return p(y = “sick” | xi) = 5/6 and p(y = “not sick” | xi) = 1/6.