0% found this document useful (0 votes)
11 views

Lecture 3

Uploaded by

bhavesh agrawal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Lecture 3

Uploaded by

bhavesh agrawal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

MACHINE LEARNING (CS 403/603)

Decision tree; Overfitting and underfitting; and


Holdout Techniques

Dr. Puneet Gupta


Introduction
Decision Tree (DT)
● A sequence of tests.
● Representation very natural for humans.
● Style of many “How to” manuals and
trouble-shooting procedures.
● Example: Find a number between 1 and
100 by asking questions.
● DT learning is about learning such a tree
from labeled training data, i.e.,
supervised learning.
● Also known as learning by asking
questions.

Image source: New York Times April 16, 2008


Introduction
DT2 DT1

Aim: Given a person, find


out its class ● First node is called root node, internal nodes are some
attribute, edges are the values of attributes; and external
What are inputs? nodes are the outcome of prediction.
Gender = {Male(M), ● There can be multiple DT.
Female (F)} and Height = ● Which one will you prefer?
{1.5, …, 2.5} ● Optimal DT: Finding an optimal DT is an NP-complete
What are outputs? problem. Hence, DT algorithms use heuristic approaches
Class = {Short (S), like they follow a greedy, top-down recursive divide-and-
Medium (M), Tall (T)} conquer strategy to create DT.
Introduction

great Food
yuck
mediocre
Speedy no
no no
yes
yes Price Our data
adequate high
yes no

How to perform testing?

Computational complexity: DT are computationally


inexpensive. If a DT is known, classifying a test record
is extremely fast, with a worst-case time complexity of
O(d), where d is the maximum depth of the tree.
Top down Greedy DT: Algorithm
Algorithm
● For a successor, select the best attribute.
● Create descendent node using that attribute; create edges; and
partition examples accordingly.
● Repeat above steps for each successor node until all the examples
are classified correctly or there are no attributes left.
Which is the best attribute?

● A statistical property called information gain, measures how well a given


attribute separates the training examples
● Information gain uses the notion of entropy.
● Information gain = expected reduction of entropy
Entropy
Entropy measures the amount of information in a random variable.

What is the entropy of rolling a dice with 8 equiproble face?


Ans: 3
In our case, entropy measures the impurity of a collection of examples. It
depends from the distribution of the random variable p.
Assume: p+ and p– the proportion of positive and negative examples in S,
i.e., training set

● Entropy (S) ≡ – p+ log2 p+ – p–log2 p– [0 log20 = 0]


● Entropy ([14+, 0–]) = – 14/14 log2 (14/14) – 0 log2 (0) = 0
● Entropy ([9+, 5–]) = – 9/14 log2 (9/14) – 5/14 log2 (5/14) = 0.94
● Entropy ([7+, 7– ]) = – 7/14 log2 (7/14) – 7/14 log2 (7/14) = 1/2
+ 1/2 = 1 [log21/2 = – 1]
● Note: the log of a number < 1 is negative, 0 ≤ p ≤ 1,
0 ≤ entropy ≤ 1
Information Gain
● Information gain is the
expected reduction in
entropy caused by
partitioning the
examples on an
attribute. Alternatively,
it is the difference in
the entropy before and
after the split.
● The higher the
information gain the
more effective the
attribute is, in
classifying training
data.
● Entropies obtained
after splitting are
normalized.
Top down Greedy DT: Algorithm
Step 1
● Which attribute should be tested at the root?
Gain(S, Outlook) = 0.246; Gain(S, Humidity) = 0.151; Gain(S,
Wind) = 0.084; and Gain(S, Temperature) = 0.029
● Outlook provides the best prediction hence,
partition using Outlook
Step 2
● Working on Outlook=Sunny node:
Gain(SSunny, Humidity) = 0.970 − 3/5 × 0.0 − 2/5 × 0.0 = 0.970
Gain(SSunny, Wind) = 0.970 − 2/5 × 1.0 − 3.5 × 0.918 = 0 .019
Gain(SSunny, Temp.) = 0.970 − 2/5 × 0.0 − 2/5 × 1.0 − 1/5 × 0.0 = 0.570
● Humidity provides the best prediction.
Step 3
● Do nothing as every node is leaf node

{D1, D2, {D9, {D4, D5, {D6, D14}


D8} D11} D10} No
No Yes Yes
Top down Greedy DT: Error
New training sample
<Outlook=Sunny, Temp=Hot,
Humidity=Normal, Wind=Strong,
PlayTennis=No>

New noisy example causes


splitting of second leaf node.

Reasons of noise:
● Some values of attributes are incorrect because of
errors in the data acquisition process or the
preprocessing phase
● The classification is wrong because of some error
Overfitting reason 1: When a model gets
DT on two variables is trained with so much of data, it starts learning
distorted by noise point from the noise and inaccurate data entries.
Overfitting
Overfitting reason 2: If there are large number of attributes, ML algorithms may find
meaningless regularity in the data that is irrelevant to the true, important, distinguishing
features. It is due to lack of data points. e.g.,
1) predicting rain whether you go out or not is irrelevant.
2) predict the roll of a die using day of the week and color of the die
Overfitting means fitting the training set “too well” on the performance on the
test set degrades.
Underfitting refers to a model that can neither model the training data nor
generalize to new data.
● Model will keep on learning and thus
the error for the model on the training
and testing data will keep on
decreasing.
● If learning goes too long, overfitting
starts due to noise and less relevant
attributes. Hence the performance of
the model on test set decreases.
● For good model, we will stop at a
point just before where the error
starts increasing, i.e., the point where
the modelperforms well on training
and unseen testing dataset.
Mitigating Overfitting by Holdout
Techniques
Training set is used for creating the model and test set is used for
estimating model performance and it should not be used for training
1) Naive Approach Total Examples
● Use full training set.
● Suffer from overfitting. Training set Test set
2) Holdout method
● Small dataset size
● Since it is a single train-and- Total Examples
test experiment, it will be
misleading if we choose
Training set Validation set Test set
“unfortunate” split
Model Selection
Try different values of K in K-NN or tree depths in DT and look at the performance
on the validation set. Select the value that gives best accuracy on validation set
The limitations of the holdout can be handled with a family of re-sampling methods at
the expense of higher computational cost:
● Cross Validation using random Subsampling, K-Fold Cross-Validation, Leave-one-
out Cross-Validation (LOOCV)
● Bootstraping
Cross-validation
1) Random Subsampling 2) K-Fold Cross-Validation 3) Leave-one-
● Performs K data splits of full training set. ● Create a K-fold partition out Cross-
● Each data split randomly selects a of the the dataset Validation
(fixed) number of examples without ● For each of K (LOOCV)
replacement: validation set. experiments, use K-1 ● Use K=1 in
● For each data split i, we retrain classifier folds for training and the K-Fold cross
with the remaining examples and then remaining fold for testing. validation
estimate Ei on validation set. ● It is better than random ● Highly time
● True error estimate is obtained as the subsampling as it uses all expensive
average of all the estimates the examples in the
dataset.

EXP1 V V V V E1
EXP2 V V V V E2
Eaverage
E3
EXP3 V V V V :
: :
:

EXP1 Validation set E1


EXP2 E2
Validation set Eaverage
E3
EXP3 Validation set
Bootstraping
Step 1: From a dataset
with N examples:
1) Randomly select (with
Training dataset replacement) N examples
and use this set for training
2) The remaining examples
that were not selected for
training are used for
testing. This value is likely
to change from fold to fold.
Step 2: Calculate error.
Step 3: Repeat step 1 and
2 for a specified number of
folds, K.
Step 4: As before, the true
error is estimated as the
average error rate on test
examples
Mitigating Overfitting by Holdout
Model Selection
Techniques

Test Set
ML Algo. Error or Error1
Training Set

Select model with minimum error,


Model 1 average error
1

Error2
Validation Set

ML Algo. Error or
Model 2 average error
2

Model s
Selected Final
Error
: : : model
: : :
: : :
: : :
ML Algo. Error or Errorq
Model q average error
q

Different ML Algorithms are designed by varying hyperparameters


Summary: DT
Advantages
● Simple and each to interpret
● Do not make any assumption about data distribution of data
● Easily handle different types of features (real, categorical, etc.)
● Very fast at test time

Disadvantages
● Learning the optimal DT is NP-Complete. The existing algorithms
are heuristics, like the one we discussed.
● Can be complex if pruning is avoided

Questions left?
● Continuous valued attributes, ordinal attributes and so on...
● Alternative measures for selecting attributes
● Multi-value split or binary split
● How to perform Regression?: Use variance instead of entropy
Summary: Understanding of Overfitting
and underfitting
● Overfitting: Good performance on training data, poor generliazation to
other data.
● Underfitting: Poor performance on training data and poor generalization
to other data.

How to avoid Overfitting in ML?


● Early Stopping: Its rules provide us the guidance as to how many
iterations can be run before learner begins to over-fit. In case of DT, if
Gain of the best attribute at a node is below a threshold, stop and make
this node a leaf node. Alternatively, one can stop in DT if number of
instances is less than some user-specified threshold
● Pruning: Pruning is extensively used while building related models. It
simply removes the nodes which add little predictive power for the
problem in hand. Cross-validation.
● Regularization: It introduces a cost term for bringing in more features
with the objective function. Hence it tries to push the coefficients for
many variables to zero and hence reduce cost term. (Will see later...)
Summary: Holdout Techniques
Procedure outline for holdout techniques:
1)Divide full data into training, validation and test set
2)Train the model using the training set
3)Evaluate the model using the validation set
4)Repeat steps 2 through 4 using different architectures and
training parameters
5)Select the best model using evaluation result
6)Assess this final model using the test set

Note
● If cross-Validation or Bootstrap are used, steps 3 and 4

have to be repeated for each of the K folds


● After assessing the final model on the test set, YOU MUST

NOT tune the model any further.


References

1) Machine Learning, Tom Mitchell, McGraw-Hill.


2) Pattern Classification by Richard O. Duda, David G.
Stork, Peter E.Hart

You might also like