0% found this document useful (0 votes)

11 views

Lecture 4

This document discusses decision trees for classification and regression. It provides examples of how decision trees are constructed by recursively partitioning the training data based on tests of feature values. The partitions create homogeneous regions that can be used to predict labels. Decision tree learning involves determining the optimal set of rules and order to test features to create the tree. While constructing the optimal tree is intractable, greedy heuristics are often used to create a good tree by prioritizing the most informative rules earliest in the tree.

Uploaded by

RITURAJ

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Lecture 4

Uploaded by

RITURAJ

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Learning with Decision Trees

CS771: Introduction to Machine Learning

Piyush Rai
2
Announcement
▪ An extra office hour every week (Saturday 3-4pm)
▪ Different from my regular in-person office hours (Wed, 6-7pm)

▪ To be held online (Google Meet: https://meet.google.com/dup-mozx-swe )

▪ Can ask doubts etc from the previous classes

CS771: Intro to ML
3
Decision Trees
▪ A Decision Tree (DT) defines a hierarchy of rules to make a prediction
Root Node
Body
Warm temp. Cold

An Internal Node Non-mammal

A Leaf Node

Gives No
Yes
birth

Mammal Non-mammal

▪ Root and internal nodes test rules. Leaf nodes make predictions

▪ DT learning is about learning such a tree from labeled training data

CS771: Intro to ML
4
Decision Tree Learning: The Basic Idea
▪ Recursively partition training data till you get (roughly) homogeneous regions
Test time: Given a test
What do you mean input, first locate its
by “homogeneous” region. Then use the
regions? prediction rule of that
region to predict its label
Within each region, we can even
A homogeneous region use a very sophisticated model
will have all (or most of) (like a deep neural network) but
the training inputs with we usually prefer a simple rule
the same outputs/labels (constant label, or maybe use a
simple ML model like LwP) so that
training and test phases are fast

▪ Some typical prediction rules for each region

▪ Use a constant label (e.g., majority) if region fully/almost homogeneous
▪ Learn another prediction model(e.g., LwP) if region not fully homogeneous
CS771: Intro to ML
5
Decision Tree for Classification: An Example
Training data with each
5 input having 2 features
(𝒙 = 𝑥1 , 𝑥2 ) NO YES
𝑥1 > 3.5 ?
4 Test input
Feature 2 (𝑥2 )

NO YES NO YES
3 𝑥2 > 2 ? 𝑥2 > 3 ?

2
Predict Predict Predict Predict
1 Red Green Green Red

1 2 3 4 5 6
Feature 1 (𝑥1 ) Remember: Root node
contains all training inputs.
DT is very efficient at test time: To predict the label Internal/leaf nodes receive a
of a test point, nearest neighbors will require subset of training inputs
computing distances from 48 training inputs. DT
predicts the label by doing just 2 feature-value
comparisons! Way more fast!!! CS771: Intro to ML
6
Decision Tree for Regression: An Example
Can use any regression Another simple
model but would like a option can be to
simple one, so let’s use a predict the average
constant prediction based output of the training
regression model inputs in this region

4 NO YES
𝑥 >4?

3
y YES Predict
NO
2 𝑥 >3? 𝑦 = 3.5

1
Predict Predict
𝑦=3 𝑦 = 1.5
1 2 3 4 5
𝐱

To predict the output for a test point, nearest

neighbors will require computing distances from 15
training inputs. DT predicts the label by doing just at
most 2 feature-value comparisons! Way more fast!!!
CS771: Intro to ML
Given some training data, 7
Constructing a Decision Tree what’s the “optimal” DT?

5 NO YES
How to decide which rules to
𝑥1 > 3.5 ?
test for and in what order?
4
How to assess informativeness of a rule?
Feature 2 (𝑥2 )

NO YES NO YES
𝑥2 > 2 ? 𝑥2 > 3 ?
3

2
Predict Predict Predict Predict

1
Red Green Green Red
In general, constructing DT is an
intractable problem (NP-hard)
1 2 3 4 5 6
Feature 1 (𝑥1 ) The rules are organized in the Often we can use some “greedy”
DT such that most informative heuristics to construct a “good” DT
Hmm.. So DTs are like
rules are tested first
the “20 questions”
To do so, we use the training data to figure out
game (ask the most Informativeness of a rule is of related which rules should be tested at each node
useful questions first) to the extent of the purity of the split
arising due to that rule. More The same rules will be applied on the test inputs
informative rules yield more pure splits to route them along the tree until they reach
some leaf node where the prediction is made
CS771: Intro to ML
Usually, cross-validation 8
Decision Trees: Some Considerations can be used to decide
size/shape
▪ What should be the size/shape of the DT?
▪ Number of internal and leaf nodes
Root and internal nodes of DT split the training
▪ Branching factor of internal nodes data (can think of them as a “classifier”)

▪ Depth of the tree

▪ Split criterion at root/int. nodes

▪ Use another classifier?
▪ Or maybe by doing a simpler test?

▪ What to do at the leaf node? Some options: Usually, constant prediction

at leaf nodes used since it
▪ Make a constant prediction for each test input reaching there will be very fast
▪ Use a nearest neighbor based prediction using training inputs at that leaf node
▪ Train and predict using some other sophisticated supervised learner on that node
CS771: Intro to ML
9
Techniques to Split at Internal Nodes?
▪ This decision/split can be done using various ways, e.g.,
▪ Testing the value of a single feature at a time (such internal node called “Decision Stump”)
With this approach, all features
(2 real-valued features in this DT methods based
example) and all possible values on testing a single
of each feature need to be feature at each
evaluated in selecting the feature internal node are
to be tested at each internal faster and more
node. If features binary/discrete popular (e.g., ID3,
(only finite possible values), it is C4.5 algos)
reasonably easy
▪ Testing the value of a combination of features (maybe 2-3 features)
▪ Learning a classifier (e.g., LwP or some more sophisticated classifier)

DT methods based on learning and

using a separate classifier at each
internal node are less common. But
this approach can be very powerful
and sometimes used in some
advanced DT methods CS771: Intro to ML
10
Internal Nodes: Good vs Bad Splits
▪ Recall that each internal node receives a subset of all the training inputs
▪ Regardless of the criterion, the split should result in as “pure” groups as possible
▪ Meaning: After split, in each group, majority of the inputs have the same label/output

▪ For classification problems (discrete outputs), entropy is a measure of purity

▪ Low entropy ⇒ high purity (less uniform label distribution)
▪ Splits that give the largest reduction (before split vs after split) in entropy are preferred
(this reduction is also known as “information gain”) CS771: Intro to ML
11
Entropy and Information Gain
▪ Assume a set of labelled inputs 𝑺 from C classes, 𝑝𝑐 as fraction of class c inputs
Uniform sets (all classes
▪ Entropy of the set 𝑺 is defined as H(𝑺) = − 𝑐∈𝐶 𝑝𝑐 log 𝑝𝑐
σ roughly equally present)
▪ Suppose a rule splits 𝑺 into two smaller disjoint sets 𝑺1 and 𝑺2 have high entropy; skewed
sets low
▪ Reduction in entropy after the split is called information gain
𝑆1 𝑆2
𝐼𝐺 = 𝐻 𝑆 − 𝐻 𝑆1 − 𝐻 𝑆2
𝑆 𝑆

This split has a low IG

This split has higher IG
(in fact zero IG)

CS771: Intro to ML
12
Decision Tree for Classification: Another Example
▪ Deciding whether to play or not to play Tennis on a Saturday
▪ Each input (Saturday) has 4 categorical features: Outlook, Temp., Humidity, Wind
▪ A binary classification problem (play vs no-play)
▪ Below Left: Training data, Below Right: A decision tree constructed using this data
Why did we test
outlook feature’s
value first?

Because outlook feature is the

most informative (has highest IG)
at the root node position

Example credit: Tom Mitchell CS771: Intro to ML

13
Entropy and Information Gain
▪ Let’s use IG based criterion to construct a DT for the Tennis example
▪ At root node, let’s compute IG of each of the 4 features
▪ Consider feature “wind”. Root contains all examples S = [9+,5-]
H(S ) = −(9/14) log2(9/14) − (5/14) log2(5/14) = 0.94
Sweak = [6+, 2−] ⇒ H(Sweak ) = 0.811
Sstrong = [3+, 3−] ⇒ H(Sstrong) = 1
𝑆weak 𝑆strong
𝐼𝐺(𝑆, 𝑤𝑖𝑛𝑑) = 𝐻 𝑆 − 𝐻 𝑆weak − 𝐻 𝑆strong = 0.94 − 8/14 ∗ 0.811 − 6/14 ∗ 1 = 0.048
𝑆 𝑆

▪ Likewise, at root: IG(S, outlook) = 0.246, IG(S, humidity) = 0.151, IG(S,temp) = 0.029
▪ Thus we choose “outlook” feature to be tested at the root node
▪ Now how to grow the DT, i.e., what to do at the next level? Which feature to test next?
▪ Rule: Iterate - for each child node, select the feature with the highest IG
CS771: Intro to ML
14
Growing the tree

▪ Proceeding as before, for level 2, left node, we can verify that

▪ IG(S,temp) = 0.570, IG(S, humidity) = 0.970, IG(S, wind) = 0.019
▪ Thus humidity chosen as the feature to be tested at level 2, left node
▪ No need to expand the middle node (already “pure” - all “yes” training examples ☺)
▪ Can also verify that wind has the largest IG for the right node
▪ Note: If a feature has already been tested along a path earlier, we don’t consider it again
CS771: Intro to ML
15
When to stop growing the tree?

▪ Stop expanding a node further (i.e., make it a leaf node) when

▪ It consist of all training examples having the same label (the node becomes “pure”)
▪ We run out of features to test along the path to that node
▪ The DT starts to overfit (can be checked by monitoring
the validation set accuracy) To help prevent the tree
from growing too much!
▪ Important: No need to obsess with too much for purity
▪ It is okay to have a leaf node that is not fully pure, e.g., this OR
▪ At test inputs that reach an impure leaf, can predict probability of belonging to each class
(in above example, p(red) = 3/8, p(green) = 5/8), or simply predict the majority label
CS771: Intro to ML
16
Avoiding Overfitting in DTs
▪ Desired: a DT that is not too big in size, yet fits the training data reasonably
▪ Note: An example of a very simple DT is “decision-stump”
▪ A decision-stump only tests the value of a single feature (or a simple rule)
▪ Not very powerful in itself but often used in large ensembles of decision stumps
▪ Mainly two approaches to prune a complex DT
▪ Prune while building the tree (stopping early) Either can be done
▪ Prune after building the tree (post-pruning) using a validation set
▪ Criteria for judging which nodes could potentially be pruned
▪ Use a validation set (separate from the training set)
▪ Prune each possible node that doesn’t hurt the accuracy on the validation set
▪ Greedily remove the node that improves the validation accuracy the most
▪ Stop when the validation set accuracy starts worsening
▪ Use model complexity control, such as Minimum Description Length (will see later)
CS771: Intro to ML
For regression, outputs are 17
Decision Trees: Some Comments real-valued and we don’t have
a “set” of classes, so
▪ Gini-index defined as σ𝐶𝑐=1 𝑝𝑐 (1 − 𝑝𝑐 ) can be an alternative to IG quantities like entropy/IG/gini
etc. are undefined
▪ For DT regression1, variance in the outputs can be used to assess purity

▪ When features are real-valued (no finite possible values to try), things are a bit more tricky
▪ Can use tests based on thresholding feature values (recall our synthetic data examples)
▪ Need to be careful w.r.t. number of threshold points, how fine each range is, etc.

▪ More sophisticated decision rules at the internal nodes can also be used
▪ Basically, need some rule that splits inputs at an internal node into homogeneous groups
▪ The rule can even be a machine learning classification algo (e.g., LwP or a deep learner)
▪ However, in DTs, we want the tests to be fast so single feature based rules are preferred

▪ Need to take care handling training or test inputs that have some features missing
1Breiman, Leo; Friedman, J. H.; Olshen, R. A.; Stone, C. J. (1984). Classification and regression trees CS771: Intro to ML
18
Ensemble of Trees
All trees can be
▪ Ensemble is a collection of models trained in parallel
▪ Each model makes a prediction. Take their majority as the final prediction
Each tree is trained on a
▪ Ensemble of trees is a collect of simple DTs subset of the training
▪ Often preferred as compared to a single massive, complicated tree inputs/features
An RF with 3 simple trees. The majority
▪ A popular example: Random Forest (RF) prediction will be the final prediction

▪ XGBoost is another popular ensemble of trees

▪ Based on the idea of “boosting” (will study boosting later) simple trees
▪ Sequentially trains a set of trees with each correcting errors of previous ones CS771: Intro to ML
19
Decision Trees: A Summary
.. thus helping us learn complex rule as
Some key strengths: a combination of several simpler rules
▪ Simple and easy to interpret
5
▪ Nice example of “divide and conquer” NO
𝑥1 > 3.5 ?
YES

paradigm in machine learning 4

Feature 2 (𝑥2 )
NO YES NO YES

▪ Easily handle different types of 𝑥2 > 2 ? 𝑥2 > 3 ?

3
features (real, categorical, etc.) 2
▪ Very fast at test time Predict Predict Predict Predict
Red Green Green Red
1
▪ Multiple simple DTs can be combined Human-body
4 5 6
via ensemble methods: more powerful 1 2Feature 3
1 (𝑥 )
1
pose estimation
▪ Used in several real-world ML applications, e.g., recommender systems, gaming (Kinect)
Some key weaknesses:
▪ Learning optimal DT is (NP-hard) intractable. Existing algos mostly greedy heuristics
▪ Can sometimes become very complex unless some pruning is applied
CS771: Intro to ML

Kids Box 4 Tests
No ratings yet
Kids Box 4 Tests
22 pages
PQWT-M100 & M200 & M400 Automatic Mapping Mobile Water Detector Operation Manual
50% (6)
PQWT-M100 & M200 & M400 Automatic Mapping Mobile Water Detector Operation Manual
12 pages
SOC Questions
No ratings yet
SOC Questions
28 pages
Lecture 4
No ratings yet
Lecture 4
19 pages
Decision Tree
No ratings yet
Decision Tree
21 pages
Decision Tree in ML
No ratings yet
Decision Tree in ML
21 pages
Decision Trees For Classification and Regression: Piyush Rai Introduction To Machine Learning (CS771A)
No ratings yet
Decision Trees For Classification and Regression: Piyush Rai Introduction To Machine Learning (CS771A)
26 pages
08 09 10 Cross Validation and Decision Trees
No ratings yet
08 09 10 Cross Validation and Decision Trees
15 pages
Learning by Asking Questions: Decision Trees: Piyush Rai Machine Learning (CS771A)
No ratings yet
Learning by Asking Questions: Decision Trees: Piyush Rai Machine Learning (CS771A)
22 pages
Learning by Asking Questions: Decision Trees: Piyush Rai Machine Learning (CS771A)
No ratings yet
Learning by Asking Questions: Decision Trees: Piyush Rai Machine Learning (CS771A)
91 pages
Springer.linguistic Decision Trees for Classification-2014
No ratings yet
Springer.linguistic Decision Trees for Classification-2014
43 pages
22.InfoTheory-DecisionTrees-short
No ratings yet
22.InfoTheory-DecisionTrees-short
25 pages
Ds 6
No ratings yet
Ds 6
24 pages
CS446: Machine Learning: Lecture 21 (ML Models - Decision Trees - ID3)
No ratings yet
CS446: Machine Learning: Lecture 21 (ML Models - Decision Trees - ID3)
54 pages
771 A18 Lec3
No ratings yet
771 A18 Lec3
83 pages
M01 Tree-Based Methods
No ratings yet
M01 Tree-Based Methods
38 pages
08 Decision - Tree
No ratings yet
08 Decision - Tree
9 pages
DMDW-CO3-SESSION-14
No ratings yet
DMDW-CO3-SESSION-14
55 pages
Module 3 Chap 3 Decision Tree Learning
No ratings yet
Module 3 Chap 3 Decision Tree Learning
79 pages
6.2.Unit-2 ML Handsout
No ratings yet
6.2.Unit-2 ML Handsout
18 pages
ML_UNIT_3_NOTES-1
No ratings yet
ML_UNIT_3_NOTES-1
118 pages
ML-chap-3
No ratings yet
ML-chap-3
52 pages
ML L8 Decision Tree
No ratings yet
ML L8 Decision Tree
109 pages
AI&Ml-module 4 (Complete)
No ratings yet
AI&Ml-module 4 (Complete)
124 pages
AI&Ml-module 4 (Part 1)
No ratings yet
AI&Ml-module 4 (Part 1)
85 pages
Title: Implementation of Decision Tree Classification: Department of Computer Science and Engineering
No ratings yet
Title: Implementation of Decision Tree Classification: Department of Computer Science and Engineering
8 pages
Lecture - 3 Classification (Decision Tree)
No ratings yet
Lecture - 3 Classification (Decision Tree)
44 pages
07 - ML - Decision Tree
No ratings yet
07 - ML - Decision Tree
37 pages
ml unit3
No ratings yet
ml unit3
8 pages
Decision Trees CLS
No ratings yet
Decision Trees CLS
43 pages
DecisionTree Numerical ID3Prob
No ratings yet
DecisionTree Numerical ID3Prob
114 pages
MCA3 (DS) Unit 4 ML
No ratings yet
MCA3 (DS) Unit 4 ML
29 pages
Machine Learning: Prepared by
No ratings yet
Machine Learning: Prepared by
44 pages
ML-3-Decision Tree
No ratings yet
ML-3-Decision Tree
17 pages
L3 - Decision Trees
No ratings yet
L3 - Decision Trees
28 pages
ML Unit 3 Notes
No ratings yet
ML Unit 3 Notes
117 pages
Unit IV Da Online - PPTX 2 82
No ratings yet
Unit IV Da Online - PPTX 2 82
81 pages
ML Unit-2 Material WORD
No ratings yet
ML Unit-2 Material WORD
25 pages
Decision Trees: Principal Data Miner, ATO Adjunct Associate Professor, ANU
No ratings yet
Decision Trees: Principal Data Miner, ATO Adjunct Associate Professor, ANU
3 pages
Unit-3 Alt
No ratings yet
Unit-3 Alt
24 pages
Decision Trees
No ratings yet
Decision Trees
38 pages
Lecture 8
No ratings yet
Lecture 8
28 pages
Chapter 03
No ratings yet
Chapter 03
30 pages
Practical 9 Decision Tree Classification
No ratings yet
Practical 9 Decision Tree Classification
24 pages
Decision Trees
No ratings yet
Decision Trees
53 pages
Unit-III Decision Tree: Course In-Charges
No ratings yet
Unit-III Decision Tree: Course In-Charges
69 pages
Lec.7.intro.D.S. Fall 2023
No ratings yet
Lec.7.intro.D.S. Fall 2023
26 pages
Module 4 Lecture -2
No ratings yet
Module 4 Lecture -2
65 pages
Lec 16,17
No ratings yet
Lec 16,17
90 pages
Unit 3 Classification - Dr. Vidyut D
No ratings yet
Unit 3 Classification - Dr. Vidyut D
72 pages
Machine Learning Approaches: Decision Trees
No ratings yet
Machine Learning Approaches: Decision Trees
44 pages
Lecture 02 - Warming-Up and Data and Features - Plain
No ratings yet
Lecture 02 - Warming-Up and Data and Features - Plain
23 pages
Decision Tree
No ratings yet
Decision Tree
31 pages
09 Decision Tree Induction
No ratings yet
09 Decision Tree Induction
120 pages
Decision Tree Learning
No ratings yet
Decision Tree Learning
22 pages
An Introduction TO Decision Trees
No ratings yet
An Introduction TO Decision Trees
30 pages
Wk. 5.2. Decision Trees (27.10.2020)
No ratings yet
Wk. 5.2. Decision Trees (27.10.2020)
57 pages
Lecture2-DT
No ratings yet
Lecture2-DT
89 pages
4 Nearest Neighbors
No ratings yet
4 Nearest Neighbors
13 pages
Unit II Part 1
No ratings yet
Unit II Part 1
62 pages
Machine Learning-Lecture 05
No ratings yet
Machine Learning-Lecture 05
21 pages
Decision Trees
No ratings yet
Decision Trees
45 pages
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
Sap Reports
No ratings yet
Sap Reports
307 pages
Ottomed Video Processor S3-2
No ratings yet
Ottomed Video Processor S3-2
2 pages
RVTools
No ratings yet
RVTools
144 pages
DikshaEOI-Tendernotice 1
No ratings yet
DikshaEOI-Tendernotice 1
17 pages
8 Leadership Lessons From Apple and Samsung
No ratings yet
8 Leadership Lessons From Apple and Samsung
3 pages
Python Melli Bank - Ir
No ratings yet
Python Melli Bank - Ir
10 pages
class 7
No ratings yet
class 7
26 pages
Identity in A Technological Society
50% (2)
Identity in A Technological Society
6 pages
AppWall 755 RN
No ratings yet
AppWall 755 RN
29 pages
Dvir Hakmon CV English Final
No ratings yet
Dvir Hakmon CV English Final
2 pages
Solutions For Electrical Energy Efficiency
No ratings yet
Solutions For Electrical Energy Efficiency
8 pages
PDF File For Oled Full Report
No ratings yet
PDF File For Oled Full Report
31 pages
Cp Chapter 2
No ratings yet
Cp Chapter 2
12 pages
Course in Bhubaneswar
No ratings yet
Course in Bhubaneswar
9 pages
Download Complete (Ebook) Write Great Code: Volume 1, Understanding the Machine 2nd Edition by Randall Hyde ISBN 9781593270032, 9781718500365, 1593270038, 171850036X PDF for All Chapters
100% (8)
Download Complete (Ebook) Write Great Code: Volume 1, Understanding the Machine 2nd Edition by Randall Hyde ISBN 9781593270032, 9781718500365, 1593270038, 171850036X PDF for All Chapters
57 pages
Topographic Map of Shiloh
No ratings yet
Topographic Map of Shiloh
1 page
Title Page 1 Merged
No ratings yet
Title Page 1 Merged
34 pages
a-golf-ball-launcher-as-a-sophomore-design-project
No ratings yet
a-golf-ball-launcher-as-a-sophomore-design-project
15 pages
VBOX-3200
No ratings yet
VBOX-3200
1 page
OM 337.1 - Total Quality Management (Bagchi) PDF
No ratings yet
OM 337.1 - Total Quality Management (Bagchi) PDF
8 pages
Cochran Bibliography
No ratings yet
Cochran Bibliography
5 pages
Ofc - 18T9918 - Sarviz 47
No ratings yet
Ofc - 18T9918 - Sarviz 47
9 pages
MAINTAIN HEALTHY, SAFE AND SECURE WORK ENVIRONMENT
No ratings yet
MAINTAIN HEALTHY, SAFE AND SECURE WORK ENVIRONMENT
10 pages
Mhf4u - UNIT3
100% (1)
Mhf4u - UNIT3
17 pages
Antenna Specifications
No ratings yet
Antenna Specifications
2 pages
Partitioning Considerations For Best Performance
No ratings yet
Partitioning Considerations For Best Performance
2 pages
E-Notes DBMS PDF
No ratings yet
E-Notes DBMS PDF
144 pages

Lecture 4

Uploaded by

Lecture 4

Uploaded by

Learning with Decision Trees

CS771: Introduction to Machine Learning

▪ To be held online (Google Meet: https://meet.google.com/dup-mozx-swe )

▪ Can ask doubts etc from the previous classes

An Internal Node Non-mammal

▪ DT learning is about learning such a tree from labeled training data

▪ Some typical prediction rules for each region

To predict the output for a test point, nearest

▪ Depth of the tree

▪ Split criterion at root/int. nodes

▪ What to do at the leaf node? Some options: Usually, constant prediction

DT methods based on learning and

▪ For classification problems (discrete outputs), entropy is a measure of purity

This split has a low IG

Because outlook feature is the

Example credit: Tom Mitchell CS771: Intro to ML

▪ Proceeding as before, for level 2, left node, we can verify that

▪ Stop expanding a node further (i.e., make it a leaf node) when

▪ XGBoost is another popular ensemble of trees

paradigm in machine learning 4

▪ Easily handle different types of 𝑥2 > 2 ? 𝑥2 > 3 ?

You might also like