0% found this document useful (0 votes)
26 views24 pages

Ds 6

The document discusses decision trees, including how they are built by splitting nodes, techniques for splitting nodes like using linear classifiers, and strategies for pruning trees like stopping splits if nodes are pure or features are exhausted. It also discusses lessons about decision trees like how they can handle both numeric and discrete data and are generally very fast for predictions.

Uploaded by

hkhandelwal2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views24 pages

Ds 6

The document discusses decision trees, including how they are built by splitting nodes, techniques for splitting nodes like using linear classifiers, and strategies for pruning trees like stopping splits if nodes are pure or features are exhausted. It also discusses lessons about decision trees like how they can handle both numeric and discrete data and are generally very fast for predictions.

Uploaded by

hkhandelwal2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 24

Decision Trees

Midsem Exam
Feb 22 (Thu), 6PM, L16,17,18,19,20
Only for registered students (regular + audit)
Assigned seating – will be announced soon
Open notes (handwritten only)
No mobile phones, tablets etc
Bring your institute ID card
If you don’t bring it, you may have to spend precious
time during the exam getting verified separately
Syllabus:
All videos, slides, code linked on the course
discussion page (link below) till 21 Feb 2024 (Wed)
https://www.cse.iitk.ac.in/users/purushot/courses/ml/
2023-24-w/discussion.html
See GitHub for practice questions
Doubt Clearing and Practice Session
Feb 21, 2024 (Wed), 11PM, Online
Exact timing and meeting link TBA
Solve previous years questions
Clear doubts
Building Decision Trees
Root

1 2 3 4 5 6 7 8 9 10 11 12
Yes X < 5.5 No
Internal
Y > 10.5 Y>9 Node

Internal
X < 11.5 Node X < 8.5
Internal
Node
Y > 2.5 Y > 5.5 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Leaf
X < 12
Leaf

Leaf Leaf
Decision Trees – all shapes and sizes
Thisprune
May DT isthe
balanced –
tree to make The previous DT was very
it more shallow
allasleaf nodes
well. Alsoare at sameto have DT with
possible imbalanced and
depththan
more from2 the root per internal node considered bad in general
children

Imbalanced DTs may


offer very poor
Prediction accuracy
as well as take as long as
kNN to make a prediction.
Imagine a DT which is just a
chain of nodes. With data points, there
would be chain: some predictions will take
time . With a balanced DT, every
prediction takes at most time 
Regression with Decision Trees
To perform real valued regression, may
simply use average score at a leaf node
to predict scores for test data points
How to learn a DT?
How many children should a node have?

How to send data points to children?

When to stop splitting and make the node a leaf?

What to do at a leaf?

How many trees to train?


Cheapest thing to do would be to store the majority
What to do at a leaf? color (label) at a leaf. A slightly more informative
(more expensive as well) thing can be to store how
Can take any many training
(complicated) points
action atofa each
leafcolor (label) reached that
leaf
Why not call another machine learning algorithm?
For speed, keep leaf action simple
Simplest action – constant prediction
Such a DT will encode a piecewise constant
prediction function
How to split a node into children nodes?
In principle there is no restriction (e.g. can even use a
deep net to split a node). However, in practice we use
a simple ML algo like linear to split nodes. This is
because the usefulness of DTs largely comes from
being able to rapidly send a test data point to a leaf

Notice, splitting a node is a


classification problem in itself! Binary
if two children, multiclass if more than
2 children

Can we use any Oh! So we are using a


classification technique to simple ML technique such
split a node or are there as binary classification to
some restrictions? learn a DT!
Splitting a Node – some lessons Various notions of purity exist – entropy
and Gini index for classification
problems, variance for regression
Node splitting algorithm must be fast else problems DT predictions will be
slow Making sure that the split is
Oftenbalanced
people(e.g. carefully choose
roughly half just nodes
the Pure a single feature
are very and We
convenient. split
cana node
baseddata
onpoints
that goe.g.left(age
and right)
25 gois left,
makeage
them25 go right)
leaves right away and not
also important to ensure that have to worry about splitting them 
Such the
“simple classifiers”
tree is balanced. are often called decision stumps
However,
ensuring balance is often
tricky
A child node is completely pure How do I decide whether to use age or
if it contains training data of gender? Even if using age, how do I
only one class. decide whether to threshold at 25 or 65?
Usually, people go over all available features and all possible thresholds (can
be slow if not done cleverly) and choose a feature and a threshold for that
feature so that the child nodes that are created are as pure as possible
Purifying Decision Stumps
Purest Horizontal Purest Vertical Split
Split

Left Right Left Right

Search purest horizontal split by going


over all possible thresholds and checking
Two possible splitting directions. Let us Purest vertical split
choose the one that gives us purer children is more pure – use it!
Node splitting via linear classifiers
One Final Recap
Pruning Strategies
Stop if node is pure or almost pure
Stop if all features exhausted – avoid using a feature twice on a path
Limits depth of tree to (num of dimensions)
Can stop if a node is ill-populated i.e. has few training points
Can also (over) grow a tree and then merge nodes to shrink it
Merge two leaves and see if it worsens performance on the
validation set or not – rinse and repeat
Use a validation set to make these decisions (never touch test set)
Decision Trees - Lessons
Very fast at making predictions (if tree is reasonably balanced)
Can handle discrete data (even non numeric data) as well – e.g. can
have a stump as: blood group AB or O go left, else go right
SVM, RR etc have difficulty with such non-numeric and discrete
data since difficult to define distance and averages with them
(however, there are workarounds to do SVM etc with discrete data as
well)
Tons of DT algorithms – both classical (ID3, C4.5) as well as recent
(GBDT, LPSR, Parabel) – DTs are versatile and very useful
Reason: DT learning is an NP hard problem 
If you think you have a better way of splitting nodes or handling leaf
Playing Hangman
Imagine a language where the letter

Uncertainty
“#” appears in every word.
Guessing # gives us no useful
To win at hangman, we must
ask questions that eliminate
wrong answers quickly
information

4096
2048 2048
1024 1024 1024 1024

Similarly, very rare letters are not very good either –


10 levels later … they will occasionally help us identify the word very
quickly but will mostly cause us to make a mistake.

1 1 1 1 1 1 1 1 1 1 1 1 1 1 … 1 1
Uncertainty Reduction – Hangman
amid, baby, back, bake, bike, book, bump, burn, cave, chip, cook, damp,
duck, dump, fade, good, have, high, hook, jazz, jump, kick, maid, many,

Start State mind, monk, much, must, paid, pain, park, pick, pine, pipe, pond, pony,
pump, push, quick, quid, quit, sail, same, save, sight, size, stay, study, stuff,
suffer, sway, tail, twin, wage, wake, wall, warn, wave, weak, wear, whip,
wife, will, wind, wine, wing, wipe, wise, wish, with, wood, wound, year

Goal States amid hook sail … year

amid, baby, back, bake, bike, book, bump, pump, push, quick, quid, quit, sail, same, save,
burn, cave, chip, cook, damp, duck, dump, sight, size, stay, study, stuff, suffer, sway, tail,
Good Question fade, good, have, high, hook, jazz, jump, kick,
maid, many, mind, monk, much, must, paid,
twin, wage, wake, wall, warn, wave, weak,
wear, whip, wife, will, wind, wine, wing, wipe,
pain, park, pick, pine, pipe, pond, pony, wise, wish, with, wood, wound, year

baby, back, bake, bike, book, bump, burn, cave, chip, cook, damp, duck,
dump, fade, good, have, high, hook, jazz, jump, kick, maid, many, mind,

Bad Question amid monk, much, must, paid, pain, park, pick, pine, pipe, pond, pony, pump,
push, quick, quid, quit, sail, same, save, sight, size, stay, study, stuff, suffer,
sway, tail, twin, wage, wake, wall, warn, wave, weak, wear, whip, wife, will,
wind, wine, wing, wipe, wise, wish, with, wood, wound, year
Uncertainty Reduction – Classification
I can see that we wish to go from uncertain
start state to goal states where we are
certain about a prediction – but how we
define a good question is still a big vague

Start State

Goal States …

Good Question

Bad Question
Notions of entropy exist for
Entropy is a measure of Uncertainty real-valued cases as well but
they involve probability
density functions so skipping
If we have a set of words, then that set has an entropy of for now

Larger sets have larger entropy and a set with a single word has entropy
Makes sense since we have no uncertainty if only a single word is possible
More generally, if there is a set of elements of types with elements
of type , then its entropy is defined as

where is proportion of elements of type (or class ) in multiclass cases


The earlier example is a special case where each word is its own “type”
i.e., there are “types” with for all
A pure set e.g., has entropy whereas a set with same number of elements of
each class i.e., has entropy
What is a good question?
No single criterion – depends on the application
ID3 (iterative dichotomizer 3 by Ross Quinlan) suggests that a good
question is one that reduces confusion the most i.e., reduces entropy the
most
Suppose asking a question splits a set into subsets
Note that if and
Let us denote – note that
Then the entropy of this collection of sets is defined to be

Can interpret this as “average” or “weighted” entropy since a fraction of


points will land up in the set where the entropy is
A good question for Hangman
The definition of entropy/information and why these were named as such is a bit unclear, but these
have several interesting properties e.g., if our first question halves the set of words (1 bit of info) and
the next question further quarters the remaining set (2 bits of info), then we have 3 bits of info and
our set size has gone down by a power of i.e., information defined this way can be added up!
Suppose a question splits a set of 4096 words into (2048, 2048)
Old entropy was
New entropy is
Yup! In fact, there is a mathematical proof that the definition of entropy we used is the only
Entropy reduced by so we say, we gained bit of information
definition that satisfies 3 intuitive requitements. Suppose an event occurs with probability and we
Suppose a question splits the set into (1024, 1024, 1024, 1024)
wish to measure the information from that event’s occurrence say s.t.
1. A sure event conveys no information i.e.,
New2.entropy is
The more common the event, the less information it conveys i.e., if
3. The information conveyed by two independent events adds up i.e.,
Gained bits of information – makes sense – each set is smaller
Suppose a question splits the set into (16, 64, 4016)
New entropy is definition of that satisfies all three requirements is for some base . We then
I see … the only
define entropy as . If we choose base we get information in “bits” (binary digits). If we choose
We gained
base we get-information
bits of ininformation  aka nats. If we choose base we get information
“nits” (natural digits)
in “dits” (decimal digits) aka hartleys
The ID3 Algorithm
Given a test data point, we go down the tree using
the splitting criteria till we reach a leaf where we
use the leaf action to make our prediction

With as set of all train points, create a root node and call train()
Train(node , set )
If is sufficiently pure or sufficiently small, make a leaf, decide a simple leaf
action (e.g., most popular class, label popularity vector, etc.) and return
Else, out of available choices, choose the splitting criteria (e.g. a single
feature) that causes maximum information gain i.e., reduces entropy the most
Split along that criteria to get partition of (e.g. if that feature takes distinct
values)
Create child nodes and call train()
There are several augmentations to this algorithm e.g. C4.5, C5.0 that
allow handing real-valued features, missing features, boosting etc
Note: ID3 will not ensure a balanced tree but usually balance is decent
Careful use of DTs
DTs can be tweaked to give very high
training accuracies
Can badly overfit to training data if
grown too large
Choice of decision stumps is critical
PUF problem: a single linear model
works
DT will struggle and eventually overfit if
we insist that questions used to split the
DT nodes use a single feature
However, if we allow node questions to
be a general linear model, root node
itself can purify the data completely 

You might also like