0% found this document useful (0 votes)

5 views50 pages

2.0 - Decision Tree

The document discusses decision trees as a method of supervised learning, emphasizing their structure as a series of 'if-else' decisions based on features to classify data. It explains the process of learning decision stumps, which are simple decision trees with one rule, and how to evaluate their effectiveness using classification accuracy. Additionally, it outlines the computational costs associated with decision stumps and introduces the greedy recursive splitting method for building more complex decision trees.

Uploaded by

prasannanimesh01234

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views50 pages

2.0 - Decision Tree

Uploaded by

prasannanimesh01234

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 50

EC9630: Machine Learning

Decision Trees
Data Representation and Exploration
Features

• example-feature representation: Age Job? City Rating Income

– Samples: another name 23 Yes Van A 22,000.00

we’ll use for examples. 23 Yes Bur BBB 21,000.00

Samples
22 No Van CC 0.00
25 Yes Sur AAA 57,000.00

•.
Supervised Learning
• We discussed supervised learning:
Egg Milk Fish Wheat Shellfish Peanuts … Sick?
0 0.7 0 0.3 0 0 1
0.3 0.7 0 0.6 0 0.01 1
0 0 0 0.8 0 0 0
0.3 0.7 1.2 0 0.10 0.01 1
0.3 0 1.2 0.3 0.10 0.01 1

• Input for an example (day of the week) is a set of features (quantities of food).
• Output is a desired class label (whether or not we got sick).
• Goal of supervised learning:
– Use data to find a model that outputs the right label based on the features.
• Above, model predicts whether foods will make you sick (even with new combinations).
– This framework can be applied any problem where we have input/output examples.
Decision Trees
• Decision trees are simple programs consisting of:
– A nested sequence of “if-else” decisions based on the features (splitting rules).
– A class label as a return value at the end of each sequence.
Can draw sequences of decisions as a tree:
• Example decision tree:
Milk > 0.5

True Fals
e
Sick Egg > 1

True Fals
e
Sick Not Sick

– Example
Supervised Learning as Writing A Program
• There are many possible decision trees.
– We’re going to search for one that is good at our supervised learning problem.

• So our input is data and the output will be a program.

– This is called “training” the supervised learning model.
– Different than usual input/output specification for writing a program.

• Supervised learning is useful when you have lots of labeled data BUT:
1. Problem is too complicated to write a program ourselves.
2. Human expert can’t explain why you assign certain labels.
OR
2. We don’t have a human expert for the problem.
Learning A Decision Stump: “Search and Score”
• We’ll start with "decision stumps”:
– Simple decision tree with 1 splitting rule based on thresholding 1 feature.

Milk > 0.5

True False

Sick Not Sick

• To “learn” a decision stump we need to find 3 things:

– Which feature should we use to “split” the data?
– What value of the threshold should be used?
– What classes should we use at the leaves? (when we have more than
Learning A Decision Stump: “Search and Score”
• To “learn” a decision stump:
1. Define a ‘score’ for each possible rule.
2. Search for the rule with the best score.

Egg > 0.5 Milk > 0.1

Milk > 0.5 Milk > 0.5
False True False True
True False True
False
Not Sick Not Sick Sick
Sick Sick Not Sick Sick
Not Sick

Score=10 Score=45 Score=65

Score=50
(Best rule!)

• Q: what “score” should we use?

Learning A Decision Stump: Accuracy Score
• Most intuitive score: classification accuracy.
– “If we use this rule, how many examples do we label
correctly?” Milk Fish Egg Sick?
Egg > 0.5
0.7 0 1 1
False True
0.7 0 2 1
Sick 0 0 0 0
Not Sick
0.7 1.2 0 0
0 1.2 2 1
• Computing classification accuracy for this rule:
0 0 0 0
– Find most common labels after applying splitting rule:
• When (egg > 1), we were “sick” 2 times out of 2.
• When (egg ≤ 1), we were “not sick” 3 times out of 4.
– Compute accuracy:
• The accuracy (“score”) of the rule is 5 times out of 6.
• This “score” evaluates quality of a rule.
– We “learn” a decision stump by finding the rule with the best score.
Learning A Decision Stump: By Hand
• Let’s search for the decision stump maximizing classification score:
Milk Fish Egg Sick?
0.7 0 1 First we check “baseline rule” of predicting mode (no split): this gets 3/6 accuracy.
1
If (milk > 0) predict “sick” (2/3) else predict “not sick” (2/3): 4/6 accuracy
0.7 0 2 1 If (fish > 0) predict “not sick” (2/3) else predict “sick” (2/3): 4/6 accuracy
0 1.2 0 0 If (fish > 1.2) predict “sick” (1/1) else predict “not sick” (3/5): 5/6 accuracy
If (egg > 0) predict “sick” (3/3) else predict “not sick” (3/3): 6/6 accuracy
0.7 1.2 0 0
If (egg > 1) predict “sick” (2/2) else predict “not sick” (3/4): 5/6 accuracy
0 1.3 2 1
0 0 0 0

• Highest-scoring rule: (egg > 0) with leaves “sick” and “not sick”.
• Notice we only need to test feature thresholds that happen in the data:
– There is no point in testing the rule (egg > 3), it gets the “baseline” score.
– There is no point in testing the rule (egg > 0.5), it gets the (egg > 0) score.
– Also note that we don’t need to test “<“, since it would give equivalent rules.
Supervised Learning Notation (MEMORIZE THIS)
Egg Milk Fish Wheat Shellfish Peanuts Sick?
0 0.7 0 0.3 0 0 1
0.3 0.7 0 0.6 0 0.01 1
X= 0 0 0 0.8 0 0 ‘n’ X= 0 ‘n’
0.3 0.7 1.2 0 0.10 0.01 1
0.3 0 1.2 0.3 0.10 0.01 1
Nxd Nx1
d 1

• Feature matrix ‘X’ has rows as examples, columns as features.

– xij is feature ‘j’ for example ‘i’ (quantity of food ‘j’ on day ‘i’).
– xi is the list of all features for example ‘i’ (all the quantities on day ‘i’).
– xj is column ‘j’ of the matrix (the value of feature ‘j’ across all examples).
• Label vector ‘y’ contains the labels of the examples.
– yi is the label of example ‘i’ (1 for “sick”, 0 for “not sick”).
Supervised Learning Notation (MEMORIZE THIS)
Egg Milk Fish Wheat Shellfish Peanuts Sick?
0 0.7 0 0.3 0 0 1
0.3 0.7 0 0.6 0 0.01 1
X= 0 0 0 0.8 0 0 ‘n’ y= 0 ‘n’
0.3 0.7 1.2 0 0.10 0.01 1
0.3 0 1.2 0.3 0.10 0.01 1
Nxd Nx1
d 1

𝑋 1=(0 , 0.7 , 0 , 0.3 , 0 , 0) 1

𝑋 =(0 , 0.3 , 0 ,0.3 , 0.3)
𝑋 2=(0.3 , 0.7 , 0 , 0.6 , 0 , 0.01) 2
𝑋 =(0.7 , 0.7 , 0 , 0.7 , 0)
𝑋 2=(0 , 0 , 0 , 0.8 , 0 , 0) 6
𝑋 =( 0 , 0.01 , 0 , 0.01 ,0.01)
Supervised Learning Notation (MEMORIZE THIS)
Egg Milk Fish Wheat Shellfish Peanuts Sick?
0 0.7 0 0.3 0 0 1
0.3 0.7 0 0.6 0 0.01 1
X= 0 0 0 0.8 0 0 ‘n’ y= 0
0.3 0.7 1.2 0 0.10 0.01 1
0.3 0 1.2 0.3 0.10 0.01 1
Nxd Nx1
d 1
• Training phase:
– Use ‘X’ and ‘y’ to find a ‘model’ (like a decision stump).
• Prediction phase:
– Given an example xi, use ‘model’ to predict a label “ “ (sick” or “not sick”).
• Training error:
– Fraction of times our prediction 𝑦! 𝑖 does not equal the true yi label.
Cost of Decision Stumps
• How much does this cost?
• Assume we have:
– ‘n’ examples (days that we measured).
– ‘d’ features (foods that we measured).
– ‘k’ thresholds (>0, >1, >2, …) for each feature.

• Computing the score of one rule costs O(n):

– We need to go through all ‘n’ examples to find most common labels.
– We need to go through all ‘n’ examples again to compute the accuracy.
– See notes on webpage for review of “O(n)” notation.

• We compute score for up to k*d rules (‘k’ thresholds for each of ‘d’ features):
– So we need to do an O(n) operation k*d times, giving total cost of O(ndk).
Cost of Decision Stumps
• Is a cost of O(ndk) good?
• Size of the input data is O(nd):
– If ‘k’ is small then the cost is roughly the same cost as loading the data.
• We should be happy about this, you can learn on any dataset you can load!
– If ‘k’ is large then this could be too slow for large datasets.

• Example: if all our features are binary then k=1, just test (feature > 0):
– Cost of fitting decision stump is O(nd), so we can fit huge datasets.
• Example: if all our features are numerical with unique values then k=n.
– Cost of fitting decision stump is O(n2d).
• We don’t like having n2 because we want to fit datasets where ‘n’ is large!
– Bonus slides: how to reduce the cost in this case down to O(nd log n).
• Basic idea: sort features and track labels. Allows us to fit decision stumps to huge datasets.
Digression: “Debugging by Frustration/TA”
• Here is one way to write a complicated program:
1. Write the entire function at once.
2. Try it out to “see if it works”.
3. Spend hours fiddling with commands, to find magic working combination.
4. Send code to the TA, asking “what is wrong?”

• If you are lucky, Step 2 works and you are done!

• If you are not lucky, takes way longer than principled coding
methods.
– This is also a great way to introduce bugs into your code.
– And you will not be able to do Step 4 when you graduate.
Digression: Debugging 101
• What strategies could we use to debug an ML implementation?
– Use “print” statements to see what is happening at each step of the code.
• Or use a debugger.
– Develop one or more simple “test cases”, were you worked out the result by hand.
• Maybe one of the functions you are using does not work the way you think it does.
– Check if the “predict” functionality works correctly on its own.
• Maybe the training works but the prediction does not.
– Check if the “training” functionality works correctly on its own.
• Maybe the prediction works but the training does not.
– Try the implementation with only one training example or only one feature.
• Maybe there is an indexing problem, or things are not being aggregated properly.
– Make a “brute force” implementation to compare to your “fast/clever” implementation.
• Maybe you made a mistake when trying to be fast/clever.
• With these strategies, you should be able to diagnose locations of problems.
Next Topic: Learning Decision Trees
Decision Stumps and Decision Trees

Milk > 0.5

False True

Egg > 1.0 Egg > 0.1

True False True

False

Not Sick Sick Not Sick Sick

These are the decision stump

20
Decision Tree Learning
• Decision stumps have only 1 rule based on only 1 feature.
– Very limited class of models: usually not very accurate for most tasks.

• Decision trees allow sequences of splits based on multiple features.

– Very general class of models: can get very high accuracy.
– However, it’s computationally infeasible to find the best decision tree.

• Most common decision tree learning algorithm in practice:

– Greedy recursive splitting.
Example of Greedy Recursive Splitting
Find the decision stump with the best score:
• Start with the full dataset:
milk > 0.5
Egg Milk … Sick?
False True
0 0.7 1
1 0.7 1 Not Sick Sick
0 0 0
1 0.6 1 Split into two smaller datasets based on stump:
Egg Milk … Sick? Egg Milk … Sick?
1 0 0
0 0 0 0 0.7 1
2 0.6 1
1 0 0 1 0.7 1
0 1 1
2 0 1 1 0.6 1
2 0 1
0 0.3 0 2 0.6 1
0 0.3 0
2 0 1 0 1 1
1 0.6 0
1 0.6 0
2 0 1
Milk ≤ 0.5
Milk > 0.5
Greedy Recursive Splitting
We now have a decision stump and two datasets:
Egg Milk … Sick? Egg Milk … Sick?
0 0 0 0 0.7 1
milk > 0.5
1 0 0 1 0.7 1
False True 2 0 1 1 0.6 1
Not Sick Sick 0 0.3 0 2 0.6 1
2 0 1 0 1 1
1 0.6 0
Fit a decision stump to each leaf’s data.

Egg > 1 Lactose > 0

False True False True

Not Sick Sick Not Sick Sick

Greedy Recursive Splitting
We now have a decision stump and two datasets:
Egg Milk … Sick? Egg Milk … Sick?
milk > 0.5 0 0 0 0 0.7 1
False True 1 0 0 1 0.7 1
2 0 1 1 0.6 1
0 0.3 0 2 0.6 1
2 0 1 0 1 1
1 0.6 0

Fit a decision stump to each leaf’s data. Egg > 1 Lactose > 0
Then add these stumps to the tree. False True False True

Not Sick Sick Not Sick Sick

Greedy Recursive Splitting
We now have a decision stump and two datasets:
Egg Milk … Sick? Egg Milk … Sick?
0 0 0 0 0.7 1
milk > 0.5 1 0 0 1 0.7 1
False True
2 0 1 1 0.6 1
0 0.3 0 2 0.6 1
Egg > 1 Lactose > 0
2 0 1 0 1 1
False True False True
1 0.6 0
Not Sick Sick Not Sick Sick

Fit a decision stump to each leaf’s data.

Then add these stumps to the tree.
Greedy Recursive Splitting
This gives a “depth 2” decision tree: It splits the two datasets into four datasets:
Milk ≤ 0.5 Milk > 0.5
Egg Milk … Sick? Egg Milk … Sick?
0 0 0 0 0.7 1
1 0 0 1 0.7 1
2 0 1 1 0.6 1
0 0.3 0 2 0.6 1
2 0 1 0 1 1
1 0.6 0

Milk ≤ 0.5 , Egg ≤ 1 Milk ≤ 0.5 , Egg > 1 Milk > 0.5 , lactose ≤ 0 Milk > 0.5 , lactose > 0
Egg Milk … Sick? Egg Milk … Sick? Egg Milk … Sick? Egg Milk … Sick?
0 0 0 2 0 1 0 0.7 1 1 0.6 0
1 0 0 2 0 1 1 0.7 1
0 0.3 0 1 0.6 1
2 0.6 1
Greedy Recursive Splitting
We could try to split the four leaves to make a “depth 3” decision
tree:
milk > 0.5
False True

Egg > 1 Lactose > 0

False True False

True
Sick Not Sick
Ice cream > 0.3 Egg > 1

False True False True

Not Sick Sick Not Sick Sick

Which score function should a decision tree used?
• Shouldn’t we just use accuracy score?
– For leafs: yes, just maximize accuracy.
– For internal nodes: not necessarily.

• Maybe no simple rule like (egg > 0.5) improves accuracy.

– But this doesn’t necessarily mean we should stop!
Example Where Accuracy Fails
• Consider a dataset with 2 features and 2 classes (‘□’ and ‘o’).
– Because there are 2 features, we can draw ‘X’ as a scatterplot.
• Colours and shapes denote the class labels ‘y’.
• A decision stump would divide space
by a horizontal or vertical line.
– Testing whether xi1 > t or whether xi2 > t. 3

• On this dataset no horizontal/vertical

line improves accuracy. 2
– Baseline is ‘o’, but need to get many ‘o’ 0
wrong to get one ‘x’ right.
-2

-3

-3 -1 0 1 3
Example Where Accuracy Fails
• Consider a dataset with 2 features and 2 classes (‘x’ and ‘o’). Splitting rule 1 :
– Because there are 2 features, we can draw ‘X’ as a scatterplot. X>1
• Colours and shapes denote the class labels ‘y’.

• A decision stump would divide space 3

by a horizontal or vertical line.

– Testing whether xi1 > t or whether xi2 > t. 2
• On this dataset no horizontal/vertical 0
line improves accuracy.
-2
– Baseline is ‘o’, but need to get many ‘o’
wrong to get one ‘x’ right.
-3

-3 -1 0 1 3
Which score function should a decision tree used?

• Most common score in practice is “information gain”.

– “Choose split that decreases entropy of labels the most”.
• Information gain for baseline rule (“do nothing”) is 0.
– Infogain is large if labels are “more predictable” (“less random”) in next layer.
• Even if it does not increase classification accuracy at one depth,
we hope that it makes classification easier at the next depth.
Example Where Accuracy Fails
Splitting rule 1 : X > 1
Classify as ‘□’. This split make label
random , Classify as ‘o’

3
X>1

2 False True

0 ‘□’ ‘o’’

-2

-3
More splits need for accurate classification

-3 -1 0 1 3
Example Where Accuracy Fails
Splitting rule 1 : Y > 2
Everything on the top
Y classified as ‘o’
X>1
False True
Splitting rule 1 : X > 1
3 Everything on the right ‘o’’
classified as ‘o

Y>2
2
False True
0
‘□’ ‘o’’
-2

-3

-3 -1 1
x
0 3
Points between X≤1 & Y
≤2 will be predicted as ‘□’
Example Where Accuracy Fails X>1
Splitting rule 2 : Y > 2
This split make label False True
random , Everything
on the top ‘o’ and
bottom ‘□’.
Y>2 ‘o’’
Y False True

3 Y ≤ -2 ‘o’’
False True

2
Points between X≤1 ‘□’ ‘o’’
& Y ≤2 & Y > -2 will
be predicted as ‘□’ 0

-2
Splitting rule 3 : Y ≤ -2.
This split make label Splitting rule 1 : X > 1
random. Everything on This split make label
the top ‘o’ and bottom ‘□’. -3 random , Everything
on the right ‘o’ and left
‘□’.
-3 -1 0 1
x
3
Example Where Accuracy Fails X>1
Splitting rule 2 : Y > 2
This split make label False True
random , Everything
on the top ‘o’ and
bottom ‘□’.
Y>2 ‘o’’
Y False True

3 Y ≤ -2 ‘o’’
False True

Splitting rule 4 : x ≤ -1. 2

This split make label x ≤ -2 ‘o’’
random. Everything on
the top ‘o’ and bottom ‘□’. 0
False True

-2
‘□’ ‘o’’
Splitting rule 3 : Y ≤ -2.
This split make label Splitting rule 1 : X > 1
random. Everything on This split make label
the top ‘o’ and bottom ‘□’. -3 random , Everything
on the right ‘o’ and left
‘□’.
-3 -1 0 1
x
3
Discussion of Decision Tree Learning
• Advantages:
– Easy to implement.
– Interpretable.
– Learning is fast prediction is very fast.
– Can elegantly handle a small number missing values during training.

• Disadvantages:
– Hard to find optimal set of rules.
– Greedy splitting often not accurate, requires very deep trees.
Discussion of Decision Tree Learning
• Issues:
– Can you revisit a feature?
• Yes, knowing other information could make feature relevant again.
– More complicated rules?
• Yes, but searching for the best rule gets much more expensive.
– What is best score?
• Infogain is the most popular and often works well, but is not always the best.
– What if you get new data?
• You could consider splitting if there is enough data at the leaves, but occasionally might want to re-
learn the whole tree or sub-trees.
– What depth?
• Some implementations stop at a maximum depth.
• Some stop if too few examples in leaf.
• Some stop if infogain is too small.
• Some stop by checking performance on a “validation set” (we will discuss this next time).
Summary
• Supervised learning:
– Using data to write a program based on input/output examples.
• Decision trees: predicting a label using a sequence of simple rules.
• Decision stumps: simple decision tree that is very fast to fit.
• Greedy recursive splitting: uses a sequence of stumps to fit a tree.
– Very fast and interpretable, but not always the most accurate.
• Information gain: splitting score based on decreasing entropy.

• Next time: the most important ideas in machine learning.

Additional Learning Materials
Other Considerations for Food Allergy Example
• What types of preprocessing might we do?
– Data cleaning: check for and fix missing/unreasonable values.
– Summary statistics:
• Can help identify “unclean” data.
• Correlation might reveal an obvious dependence (“sick”  “peanuts”).
– Data transformations:
• Convert everything to same scale? (e.g., grams)
• Add foods from day before? (maybe “sick” depends on multiple days)
• Add date? (maybe what makes you “sick” changes over time).
– Data visualization: look at a scatterplot of each feature and the label.
• Maybe the visualization will show something weird in the features.
• Maybe the pattern is really obvious!
• What you do might depend on how much data you have:
– Very little data:
• Represent food by common allergic ingredients (lactose, gluten, etc.)?
– Lots of data:
• Use more fine-grained features (bread from bakery vs. hamburger bun)?
Going from O(n2d) to O(nd log n) for Numerical Features
• Do we have to compute score from scratch?
– As an example, assume we eat integer number of eggs:
• So the rules (egg > 1) and (egg > 2) have same decisions, except when (egg == 2).
• We can actually compute the best rule involving ‘egg’ in O(n log n):
– Sort the examples based on ‘egg’, and use these positions to re-arrange ‘y’.
– Go through the sorted values in order, updating the counts of #sick and #not-sick that
both satisfy and don’t satisfy the rules.
– With these counts, it’s easy to compute the classification accuracy (see bonus slide).
• Sorting costs O(n log n) per feature.
• Total cost of updating counts is O(n) per feature.
• Total cost is reduced from O(n2d) to O(nd log n).
• This is a good runtime:
– O(nd) is the size of data, same as runtime up to a log factor.
– We can apply this algorithm to huge datasets.
How do we fit stumps in O(nd log n)?
• Let’s say we’re trying to find the best rule involving milk:
Egg Milk … Sick? First grab the milk column and sort it Milk Sick?
0 0.7 1 (using the sort positions to re-arrange 0 0
1 0.7 1 the sick column). This step costs 0 0
0 0 0 O(n log n) due to sorting. 0 0
1 0.6 1 0 0
1 0 0 Now, we’ll go through the milk values 0.3 0
2 0.6 1 in order, keeping track of #sick and 0.6 1
0 1 1 #not sick that are above/below the 0.6 1
2 0 1 current value. E.g., #sick above 0.3 is 5. 0.6 0
0 0.3 0 0.7 1
1 0.6 0 With these counts, accuracy score is 0.7 1
2 0 1 (sum of most common label above and 1 1
below)/n.
How do we fit stumps in O(nd log n)?
Milk Sick? Start with the baseline rule () which is always “satisfied”:
0 0 If satisfied, #sick=5 and #not-sick=6.
0 0 If not satisfied, #sick=0 and #not-sick=0.
0 0
This gives accuracy of (6+0)/n = 6/11.
0 0
0.3 0
Next try the rule (milk > 0), and update the counts based on these 4 rows:
0.6 1
If satisfied, #sick=5 and #not-sick=2.
0.6 1
If not satisfied, #sick=0 and #not-sick=4.
0.6 0
This gives accuracy of (5+4)/n = 9/11, which is better.
0.7 1
Next try the rule (milk > 0.3), and update the counts based on this 1 row:
0.7 1
If satisfied, #sick=5 and #not-sick=1.
1 1
If not satisfied, #sick=0 and #not-sick=5.
This gives accuracy of (5+5)/n = 10/11, which is better.
(and keep going until you get to the end…)
How do we fit stumps in O(nd log n)?
Milk Sick? Notice that for each row, updating the counts only costs O(1).
0 0 Since there are O(n) rows, total cost of updating counts is O(n).
0 0
0 0 Instead of 2 labels (sick vs. not-sick), consider the case of ‘k’
0 0 labels:
- Updating the counts still costs O(n), since each row has one
0.3 0
0.6 1
label.
- But computing the ‘max’ across the labels costs O(k), so
0.6 1
0.6 0
cost is O(kn).
0.7 1
With ‘k’ labels, you can decrease cost using a “max-heap” data
0.7 1
structure:
1 1
- Cost of getting max is O(1), cost of updating heap for a row
is O(log k).
- But k <= n (each row has only one label).
Can decision trees re-visit a feature?

• Yes. milk > 0.5

False True

Egg > 1 Lactose > 0

False True False

True
Sick Not Sick
Ice cream > 0.3 Egg > 1

False True False True

Not Sick Sick
Not Sick milk > 0.3
False True
Revisited
Not Sick Sick

Knowing (ice cream > 0.3) makes small milk quantities relevant.
Can decision trees have more complicated rules?
• Yes!
• Rules that depend on more than one feature:

milk > 0.5

False True

Egg > 1 Lactose > 0

& Egg > 1
False True False True

Not Sick Sick Not Sick Sick

• But now searching for the best rule can get expensive.
Can decision trees have more complicated rules?
• Yes!
• Rules that depend on more than one threshold:

1995 ≤ Birth Year ≤ 2005

False True

Not Sick Sick

• “Very Simple Classification Rules Perform Well on Most Commonly Used Datasets”
– Consider decision stumps based on multiple splits of 1 attribute.
– Showed that this gives comparable performance to more-fancy methods on many datasets.
Does being greedy actually hurt?
• Can’t you just go deeper to correct greedy decisions?
– Yes, but you need to “re-discover” rules with less data.
• Consider that you are allergic to milk (and drink this often), and also get sick when you (rarely) combine
diet coke with mentos.
• Greedy method should first split on milk (helps accuracy the most): because at the time milk is the best
feature to consider
milk > 0.5 Greedy Decision:
False True We can make whatever
choice seems best at
the moment and then
Pepsi > 0 Pepsi > 0 solve the subproblems
True that arise later
False False True
You have learn twise at
Not Sick Mentos > 0 Not Sick Mentos > 0 this condition to make
sub trees. Worse case is
False True False True a identical sub trees in
same level.
Not Sick Sick Not Sick Sick
Does being greedy actually hurt?
• Can’t you just go deeper to correct greedy decisions.
– Yes, but you need to “re-discover” rules with less data.
• Consider that you are allergic to milk (and drink this often), and also get sick when you (rarely)
combine diet coke with mentos.
• Greedy method should first split on milk (helps accuracy the most).
• Non-greedy method could get simpler tree (split on milk later):

Pepsi > 0.5

False True

Milk > 0.5 mentos > 0

Still have repeated
False True False True structure
Not Sick Sick Milk > 0.5 Sick

False True

Not Sick Sick

Does being greedy actually hurt?
• Can’t you just go deeper to correct greedy decisions?
– Yes, but you need to “re-discover” rules with less data.
• Consider that you are allergic to milk (and drink this often), and
also get sick when you (rarely) combine diet coke with mentos.
• Greedy method should first split on milk (helps accuracy the
most).
• Non-greedy method could get simpler tree (split on milk later):
– Example of a non-greedy method (complicated and not popular
currently).
Decision Trees with Probabilistic Predictions
• Often, we’ll have multiple ‘y’ values at each leaf node.
• In these cases, we might return probabilities instead of a label.

• E.g., if in the leaf node we 5 have “sick” examples and 1 “not sick”:
– Return p(y = “sick” | xi) = 5/6 and p(y = “not sick” | xi) = 1/6.

• In general, a natural estimate of the probabilities at the leaf nodes:

– Let ‘nk’ be the number of examples that arrive to leaf node ‘k’.
– Let ‘nkc’ be the number of times (y == c) in the examples at leaf node ‘k’.
– Maximum likelihood estimate for this leaf is p(y = c | xi) = nkc/nk.
Alternative Stopping Rules
• There are more complicated rules for deciding when *not* to split.

• Rules based on minimum sample size.

– Don’t split any nodes where the number of examples is less than some
‘m’.
– Don’t split any nodes that create children with less than ‘m’ examples.
• These types of rules try to make sure that you have enough data to justify decisions.

• Alternately, you can use a validation set (see next lecture):

– Don’t split the node if it decreases an approximation of test accuracy.

Introduction To Classification - PPT Slides 1
No ratings yet
Introduction To Classification - PPT Slides 1
62 pages
7 Classification
100% (3)
7 Classification
63 pages
QB Pec-Cs701e
No ratings yet
QB Pec-Cs701e
12 pages
Machine Learning Unit-3.2
No ratings yet
Machine Learning Unit-3.2
61 pages
Chp8 Classification Basic Concepts - Lecture#8
No ratings yet
Chp8 Classification Basic Concepts - Lecture#8
40 pages
ML Notes - 2025
No ratings yet
ML Notes - 2025
145 pages
Diabetes Case Study - Jupyter Notebook
100% (1)
Diabetes Case Study - Jupyter Notebook
10 pages
Unit 5
No ratings yet
Unit 5
37 pages
Introduction To Machine Learning: Jaime S. Cardoso
100% (1)
Introduction To Machine Learning: Jaime S. Cardoso
52 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
37 pages
Pilot Fear Detection From EEG Signals Classified by Decision Tree During Landing Conditions
No ratings yet
Pilot Fear Detection From EEG Signals Classified by Decision Tree During Landing Conditions
7 pages
ICS 2408 - Lecture 6 - Classification and Prediction
No ratings yet
ICS 2408 - Lecture 6 - Classification and Prediction
47 pages
ML - Questions & Answer
No ratings yet
ML - Questions & Answer
45 pages
Unit 4
No ratings yet
Unit 4
186 pages
IntroClassificationDA 2024
No ratings yet
IntroClassificationDA 2024
129 pages
DWDM - Unit - V
No ratings yet
DWDM - Unit - V
93 pages
19 - Decision Tree - ID3
No ratings yet
19 - Decision Tree - ID3
87 pages
Chapter 02
No ratings yet
Chapter 02
53 pages
TTNT 09 Learning From Examples
No ratings yet
TTNT 09 Learning From Examples
58 pages
04 - Model Selection
No ratings yet
04 - Model Selection
62 pages
Learning
No ratings yet
Learning
51 pages
ML-Module 3
No ratings yet
ML-Module 3
64 pages
3ML.02.MainConcepts Evaluation
No ratings yet
3ML.02.MainConcepts Evaluation
35 pages
IDS 6 Classification
No ratings yet
IDS 6 Classification
44 pages
Lecture 11 Slides - After
No ratings yet
Lecture 11 Slides - After
55 pages
Algorithms II
No ratings yet
Algorithms II
24 pages
Mod 4-1
No ratings yet
Mod 4-1
42 pages
Employee Performance Appraisal For Salary Hike - Project
No ratings yet
Employee Performance Appraisal For Salary Hike - Project
93 pages
Pradeep Aiml
No ratings yet
Pradeep Aiml
47 pages
Lecture Set4
No ratings yet
Lecture Set4
41 pages
1.0 Introduction
No ratings yet
1.0 Introduction
50 pages
Lecture Set5
No ratings yet
Lecture Set5
37 pages
Unit 4 Classification
No ratings yet
Unit 4 Classification
87 pages
DLWSS551 - Knowledge Representation
No ratings yet
DLWSS551 - Knowledge Representation
43 pages
Lecture Set3
No ratings yet
Lecture Set3
29 pages
Machine Learning Lab (CIE 421P)
No ratings yet
Machine Learning Lab (CIE 421P)
49 pages
Unit-4 DM
No ratings yet
Unit-4 DM
19 pages
ML Unit 2
No ratings yet
ML Unit 2
37 pages
3 DM Classification
No ratings yet
3 DM Classification
55 pages
Unit-3 (MLT)
No ratings yet
Unit-3 (MLT)
46 pages
L4 - Features & Filters I
No ratings yet
L4 - Features & Filters I
25 pages
EC9600 - Applied Algorithm I
No ratings yet
EC9600 - Applied Algorithm I
22 pages
Ds Notes Mca
No ratings yet
Ds Notes Mca
30 pages
Presentation UNIT-2 (Old)
No ratings yet
Presentation UNIT-2 (Old)
58 pages
Class 2a-Decision Trees
No ratings yet
Class 2a-Decision Trees
28 pages
Act 9
No ratings yet
Act 9
22 pages
Fa20 Lecture 2 Decision Trees Inked2
No ratings yet
Fa20 Lecture 2 Decision Trees Inked2
22 pages
DWM Notes
No ratings yet
DWM Notes
27 pages
L3 IntroToMemoryAndComputing
No ratings yet
L3 IntroToMemoryAndComputing
36 pages
L5 - AVR Architecture
No ratings yet
L5 - AVR Architecture
18 pages
L2 EmbeddedTechnologies
No ratings yet
L2 EmbeddedTechnologies
23 pages
Unit-4 Data Mining
No ratings yet
Unit-4 Data Mining
19 pages
Ringkasan Thesis Ade Surya Budiman
No ratings yet
Ringkasan Thesis Ade Surya Budiman
12 pages
Machine
No ratings yet
Machine
61 pages
Chapter 7 Learning
No ratings yet
Chapter 7 Learning
34 pages
Machine Learning 2025
No ratings yet
Machine Learning 2025
12 pages
DC Machines
No ratings yet
DC Machines
25 pages
Guide
No ratings yet
Guide
24 pages
Springer Lecture Notes in Computer Science
No ratings yet
Springer Lecture Notes in Computer Science
11 pages
ML Lec-12
No ratings yet
ML Lec-12
17 pages
Oaneoinae
No ratings yet
Oaneoinae
13 pages
Decision Trees
No ratings yet
Decision Trees
45 pages
Unit Iii
No ratings yet
Unit Iii
11 pages
ch3 4
No ratings yet
ch3 4
11 pages
Diabetes Prediction
No ratings yet
Diabetes Prediction
15 pages
Machine Learning Notes ?
No ratings yet
Machine Learning Notes ?
14 pages
Decision Tree
No ratings yet
Decision Tree
16 pages
Decision Tree Part 1
No ratings yet
Decision Tree Part 1
16 pages
Lecture 8
No ratings yet
Lecture 8
28 pages
"Classifiers": R & D Project by Under The Guidance of
No ratings yet
"Classifiers": R & D Project by Under The Guidance of
59 pages
DM Unit-3
No ratings yet
DM Unit-3
23 pages
AD3461 ML Lab Manual
No ratings yet
AD3461 ML Lab Manual
32 pages
Decision Tree Random Forest Theory
No ratings yet
Decision Tree Random Forest Theory
13 pages
M01 Tree-Based Methods
No ratings yet
M01 Tree-Based Methods
38 pages
Quiz2 Mock Solutions
No ratings yet
Quiz2 Mock Solutions
19 pages
ML4 - Decision Trees & Random Forest
No ratings yet
ML4 - Decision Trees & Random Forest
44 pages
Statistical Learning Slides
No ratings yet
Statistical Learning Slides
60 pages
DR +R +kavitha
No ratings yet
DR +R +kavitha
7 pages
Lec.7.intro.D.S. Fall 2023
No ratings yet
Lec.7.intro.D.S. Fall 2023
26 pages
L4 MicroControllers
No ratings yet
L4 MicroControllers
17 pages
Lecture 6 - Decision Trees
No ratings yet
Lecture 6 - Decision Trees
43 pages
Machine Learning
No ratings yet
Machine Learning
8 pages
Phyton
No ratings yet
Phyton
10 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
45 pages
CS 343: Artificial Intelligence Machine Learning: Raymond J. Mooney
No ratings yet
CS 343: Artificial Intelligence Machine Learning: Raymond J. Mooney
35 pages
Lec 2
No ratings yet
Lec 2
13 pages
Lecture 06 Part A - Macine Learning
No ratings yet
Lecture 06 Part A - Macine Learning
77 pages
DM Module-3 Notes
No ratings yet
DM Module-3 Notes
25 pages
Heart Disease Prediction
No ratings yet
Heart Disease Prediction
27 pages
Data Mining Classification Algorithms: Credits: Padhraic Smyth
No ratings yet
Data Mining Classification Algorithms: Credits: Padhraic Smyth
54 pages
Cardio Disease - Full Document - LightGBM
No ratings yet
Cardio Disease - Full Document - LightGBM
29 pages
AI & ML Unit 3 Notes
No ratings yet
AI & ML Unit 3 Notes
20 pages
Week 7 Laboratory Activity
No ratings yet
Week 7 Laboratory Activity
12 pages
Self-Quiz Unit 5 - Attempt Review
No ratings yet
Self-Quiz Unit 5 - Attempt Review
3 pages
Divorce Prediction System: Devansh Kapoor 179202050
No ratings yet
Divorce Prediction System: Devansh Kapoor 179202050
12 pages
Machine Learning
No ratings yet
Machine Learning
8 pages
Machine Learning 1707965934
No ratings yet
Machine Learning 1707965934
15 pages
Data Mining Assignment No. 1
No ratings yet
Data Mining Assignment No. 1
7 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
50 pages
Final Research Paper
No ratings yet
Final Research Paper
3 pages
Unit-3 DWDM
No ratings yet
Unit-3 DWDM
11 pages
Disease Prediction Using Machine Learning
No ratings yet
Disease Prediction Using Machine Learning
4 pages
ONE WAY OR ANOTHER: MAKING YOUR WAY IN LIFE
From Everand
ONE WAY OR ANOTHER: MAKING YOUR WAY IN LIFE
Edward Steep
No ratings yet
Patient Guide to Semen Analysis
From Everand
Patient Guide to Semen Analysis
Rajasingam Sivaperagasm Jeyendran
No ratings yet

2.0 - Decision Tree

Uploaded by

2.0 - Decision Tree

Uploaded by

EC9630: Machine Learning

• example-feature representation: Age Job? City Rating Income

– Samples: another name 23 Yes Van A 22,000.00

• So our input is data and the output will be a program.

Milk > 0.5

Sick Not Sick

• To “learn” a decision stump we need to find 3 things:

Egg > 0.5 Milk > 0.1

Score=10 Score=45 Score=65

• Q: what “score” should we use?

• Feature matrix ‘X’ has rows as examples, columns as features.

𝑋 1=(0 , 0.7 , 0 , 0.3 , 0 , 0) 1

• Computing the score of one rule costs O(n):

• If you are lucky, Step 2 works and you are done!

Milk > 0.5

Egg > 1.0 Egg > 0.1

True False True

Not Sick Sick Not Sick Sick

These are the decision stump

• Decision trees allow sequences of splits based on multiple features.

• Most common decision tree learning algorithm in practice:

Egg > 1 Lactose > 0

False True False True

Not Sick Sick Not Sick Sick

Not Sick Sick Not Sick Sick

Fit a decision stump to each leaf’s data.

Egg > 1 Lactose > 0

False True False

False True False True

Not Sick Sick Not Sick Sick

• Maybe no simple rule like (egg > 0.5) improves accuracy.

• On this dataset no horizontal/vertical

• A decision stump would divide space 3

by a horizontal or vertical line.

• Most common score in practice is “information gain”.

Splitting rule 4 : x ≤ -1. 2

• Next time: the most important ideas in machine learning.

• Yes. milk > 0.5

Egg > 1 Lactose > 0

False True False

False True False True

milk > 0.5

Egg > 1 Lactose > 0

Not Sick Sick Not Sick Sick

1995 ≤ Birth Year ≤ 2005

Not Sick Sick

Pepsi > 0.5

Milk > 0.5 mentos > 0

Not Sick Sick

• In general, a natural estimate of the probabilities at the leaf nodes:

• Rules based on minimum sample size.

• Alternately, you can use a validation set (see next lecture):

You might also like