Decision Tree in ML
Decision Tree in ML
2
Decision Trees
A Decision Tree (DT) defines a hierarchy of rules to make a
prediction Root Node
Body
Warm temp. Cold
Gives No
Yes
birth
Mammal Non-mammal
NO 𝑥 2> 2?
YES NO 𝑥 2> 3 ?
YES
3
2
Predict Predict Predict Predict
1 Red Green Green Red
1 2 3 4 5 6
Remember: Root node
Feature 1 ( contains all training
DT is very efficient at test time: To inputs
predict the label of a test point, Each leaf node receives
nearest neighbors will require a subset of training
computing distances from 48 training inputs
inputs. DT predicts the label by doing
just 2 feature-value comparisons! Way
6
Decision Trees for Classification: Another Example
Deciding whether to play or not to play Tennis on a Saturday
Each input (Saturday) has 4 categorical features: Outlook,
Temp., Humidity, Wind
A binary classification problem (play vs no-play)
Below Left: Training data, Below Right: A decision tree
constructed using this data
2
Predict
Red
Predict
Green
Predict
Green
Predict
Red In general, constructing
1
DT is an intractable
1 2 3 4 5 6 The rules are organized problem (NP-hard)
Feature 1 (
Often we can use some
Hmm.. So DTs are in the DT such that “greedy” heuristics to
like the “20 most informative rules construct a “good” DT
questions” game are tested first To do so, we use the training data to
(ask the most Informativeness of a rule is of figure out which rules should be
useful questions related to the extent of the tested at each node
first) purity of the split arising due The same rules will be applied on the
to that rule. More informative test inputs to route them along the
rules yield more pure splits tree until they reach some leaf node
where the prediction is made
11
Decision Tree Construction: An Example
Let’s consider the playing Tennis example
Assume each internal node will test the value of one of the
features
When features are real-valued (no finite possible values to try), things are
a bit more tricky
Can use tests based on thresholding feature values (recall our
synthetic data examples)
Need to be careful w.r.t. number of threshold points, how fine each
range is, etc.
More sophisticated decision rules at the internal nodes can also be used
Basically, need some rule that splits inputs at an internal node into
homogeneous groups
The rule can even be a machine learning classification algo (e.g., LwP
or aJ. H.;
, Leo; Friedman, deep
Olshen,learner)
R. A.; Stone, C. J. (1984). Classification and regression trees
18
An Illustration: DT with Real-Valued Features
Test example
“Best” (purest “Best”(purest
possible) possible) Vertical
Horizontal Split Split
3
y YES
NO Predict
2 𝑥 2> 3 ?
1
Predict Predict
1 2 3 4 5
𝐱
Feature 2 (
features (real, categorical, etc.)
2
Very fast at test time Predict
Red
Predict
Green
Predict
Green
Predict
Red
1
Multiple DTs can be combined
1 2 3 4 5 6 Human-body
via ensemble methods: more powerful Feature 1 ( pose
(e.g., Decision Forests; will see later) estimation
Used in several real-world ML applications, e.g., recommender
systems, gaming (Kinect)
Some key weaknesses:
Learning optimal DT is (NP-hard) intractable. Existing algos mostly
Key Hyperparameters
1. criterion
•What it does: Chooses the function to measure the quality of a split.
•Options:
• "gini" (default) — Uses Gini Impurity.
• "entropy" — Uses Entropy (from Information Gain).
•Impact: Affects how the tree decides where to split.
2. max_depth
•What it does: Sets the maximum depth of the tree.
•Impact: Controls overfitting (deep trees may overfit) and underfitting (shallow trees
may underfit).
3. min_samples_split
•What it does: Minimum number of samples required to split an internal node.
•Default: 2
•Impact: Higher values = less splits = simpler tree.
4. min_samples_leaf
from sklearn.tree import DecisionTreeClassifier clf = •What it does: Minimum number of samples that must be in a leaf node.
DecisionTreeClassifier •Default: 1
•Impact: Bigger values = more pruning = prevents very small leaves.
( criterion="gini", 5. max_features
max_depth=5, •What it does: Number of features to consider when looking for the best split.
•Options:
min_samples_split=4, • Integer (exact count)
• Float (proportion of total features)
min_samples_leaf=2, • "sqrt" (square root of total features — common in Random Forests)
max_features="sqrt" ) •
•
"log2"
None (use all features)
6. max_leaf_nodes
•What it does: Maximum number of leaf nodes.
•Impact: Limits tree growth and simplifies model.
7. splitter
•What it does: Chooses strategy to split nodes.
•Options:
• "best" — Chooses the best split.
• "random" — Chooses a random split among the best candidates.
8. class_weight
•What it does: Weights assigned to each class (for handling imbalance).
•Options:
• None — All classes treated equally.
• balanced — Weights are inversely proportional to class frequencies.