Lecture 4
Lecture 4
CS771: Intro to ML
3
Decision Trees
▪ A Decision Tree (DT) defines a hierarchy of rules to make a prediction
Root Node
Body
Warm temp. Cold
Gives No
Yes
birth
Mammal Non-mammal
▪ Root and internal nodes test rules. Leaf nodes make predictions
NO YES NO YES
3 𝑥2 > 2 ? 𝑥2 > 3 ?
2
Predict Predict Predict Predict
1 Red Green Green Red
1 2 3 4 5 6
Feature 1 (𝑥1 ) Remember: Root node
contains all training inputs.
DT is very efficient at test time: To predict the label Internal/leaf nodes receive a
of a test point, nearest neighbors will require subset of training inputs
computing distances from 48 training inputs. DT
predicts the label by doing just 2 feature-value
comparisons! Way more fast!!! CS771: Intro to ML
6
Decision Tree for Regression: An Example
Can use any regression Another simple
model but would like a option can be to
simple one, so let’s use a predict the average
constant prediction based output of the training
regression model inputs in this region
4 NO YES
𝑥 >4?
3
y YES Predict
NO
2 𝑥 >3? 𝑦 = 3.5
1
Predict Predict
𝑦=3 𝑦 = 1.5
1 2 3 4 5
𝐱
5 NO YES
How to decide which rules to
𝑥1 > 3.5 ?
test for and in what order?
4
How to assess informativeness of a rule?
Feature 2 (𝑥2 )
NO YES NO YES
𝑥2 > 2 ? 𝑥2 > 3 ?
3
2
Predict Predict Predict Predict
1
Red Green Green Red
In general, constructing DT is an
intractable problem (NP-hard)
1 2 3 4 5 6
Feature 1 (𝑥1 ) The rules are organized in the Often we can use some “greedy”
DT such that most informative heuristics to construct a “good” DT
Hmm.. So DTs are like
rules are tested first
the “20 questions”
To do so, we use the training data to figure out
game (ask the most Informativeness of a rule is of related which rules should be tested at each node
useful questions first) to the extent of the purity of the split
arising due to that rule. More The same rules will be applied on the test inputs
informative rules yield more pure splits to route them along the tree until they reach
some leaf node where the prediction is made
CS771: Intro to ML
Usually, cross-validation 8
Decision Trees: Some Considerations can be used to decide
size/shape
▪ What should be the size/shape of the DT?
▪ Number of internal and leaf nodes
Root and internal nodes of DT split the training
▪ Branching factor of internal nodes data (can think of them as a “classifier”)
CS771: Intro to ML
12
Decision Tree for Classification: Another Example
▪ Deciding whether to play or not to play Tennis on a Saturday
▪ Each input (Saturday) has 4 categorical features: Outlook, Temp., Humidity, Wind
▪ A binary classification problem (play vs no-play)
▪ Below Left: Training data, Below Right: A decision tree constructed using this data
Why did we test
outlook feature’s
value first?
▪ Likewise, at root: IG(S, outlook) = 0.246, IG(S, humidity) = 0.151, IG(S,temp) = 0.029
▪ Thus we choose “outlook” feature to be tested at the root node
▪ Now how to grow the DT, i.e., what to do at the next level? Which feature to test next?
▪ Rule: Iterate - for each child node, select the feature with the highest IG
CS771: Intro to ML
14
Growing the tree
▪ When features are real-valued (no finite possible values to try), things are a bit more tricky
▪ Can use tests based on thresholding feature values (recall our synthetic data examples)
▪ Need to be careful w.r.t. number of threshold points, how fine each range is, etc.
▪ More sophisticated decision rules at the internal nodes can also be used
▪ Basically, need some rule that splits inputs at an internal node into homogeneous groups
▪ The rule can even be a machine learning classification algo (e.g., LwP or a deep learner)
▪ However, in DTs, we want the tests to be fast so single feature based rules are preferred
▪ Need to take care handling training or test inputs that have some features missing
1Breiman, Leo; Friedman, J. H.; Olshen, R. A.; Stone, C. J. (1984). Classification and regression trees CS771: Intro to ML
18
Ensemble of Trees
All trees can be
▪ Ensemble is a collection of models trained in parallel
▪ Each model makes a prediction. Take their majority as the final prediction
Each tree is trained on a
▪ Ensemble of trees is a collect of simple DTs subset of the training
▪ Often preferred as compared to a single massive, complicated tree inputs/features
An RF with 3 simple trees. The majority
▪ A popular example: Random Forest (RF) prediction will be the final prediction
Feature 2 (𝑥2 )
NO YES NO YES