Lecture 3
Lecture 3
great Food
yuck
mediocre
Speedy no
no no
yes
yes Price Our data
adequate high
yes no
Reasons of noise:
● Some values of attributes are incorrect because of
errors in the data acquisition process or the
preprocessing phase
● The classification is wrong because of some error
Overfitting reason 1: When a model gets
DT on two variables is trained with so much of data, it starts learning
distorted by noise point from the noise and inaccurate data entries.
Overfitting
Overfitting reason 2: If there are large number of attributes, ML algorithms may find
meaningless regularity in the data that is irrelevant to the true, important, distinguishing
features. It is due to lack of data points. e.g.,
1) predicting rain whether you go out or not is irrelevant.
2) predict the roll of a die using day of the week and color of the die
Overfitting means fitting the training set “too well” on the performance on the
test set degrades.
Underfitting refers to a model that can neither model the training data nor
generalize to new data.
● Model will keep on learning and thus
the error for the model on the training
and testing data will keep on
decreasing.
● If learning goes too long, overfitting
starts due to noise and less relevant
attributes. Hence the performance of
the model on test set decreases.
● For good model, we will stop at a
point just before where the error
starts increasing, i.e., the point where
the modelperforms well on training
and unseen testing dataset.
Mitigating Overfitting by Holdout
Techniques
Training set is used for creating the model and test set is used for
estimating model performance and it should not be used for training
1) Naive Approach Total Examples
● Use full training set.
● Suffer from overfitting. Training set Test set
2) Holdout method
● Small dataset size
● Since it is a single train-and- Total Examples
test experiment, it will be
misleading if we choose
Training set Validation set Test set
“unfortunate” split
Model Selection
Try different values of K in K-NN or tree depths in DT and look at the performance
on the validation set. Select the value that gives best accuracy on validation set
The limitations of the holdout can be handled with a family of re-sampling methods at
the expense of higher computational cost:
● Cross Validation using random Subsampling, K-Fold Cross-Validation, Leave-one-
out Cross-Validation (LOOCV)
● Bootstraping
Cross-validation
1) Random Subsampling 2) K-Fold Cross-Validation 3) Leave-one-
● Performs K data splits of full training set. ● Create a K-fold partition out Cross-
● Each data split randomly selects a of the the dataset Validation
(fixed) number of examples without ● For each of K (LOOCV)
replacement: validation set. experiments, use K-1 ● Use K=1 in
● For each data split i, we retrain classifier folds for training and the K-Fold cross
with the remaining examples and then remaining fold for testing. validation
estimate Ei on validation set. ● It is better than random ● Highly time
● True error estimate is obtained as the subsampling as it uses all expensive
average of all the estimates the examples in the
dataset.
EXP1 V V V V E1
EXP2 V V V V E2
Eaverage
E3
EXP3 V V V V :
: :
:
Test Set
ML Algo. Error or Error1
Training Set
Error2
Validation Set
ML Algo. Error or
Model 2 average error
2
Model s
Selected Final
Error
: : : model
: : :
: : :
: : :
ML Algo. Error or Errorq
Model q average error
q
Disadvantages
● Learning the optimal DT is NP-Complete. The existing algorithms
are heuristics, like the one we discussed.
● Can be complex if pruning is avoided
Questions left?
● Continuous valued attributes, ordinal attributes and so on...
● Alternative measures for selecting attributes
● Multi-value split or binary split
● How to perform Regression?: Use variance instead of entropy
Summary: Understanding of Overfitting
and underfitting
● Overfitting: Good performance on training data, poor generliazation to
other data.
● Underfitting: Poor performance on training data and poor generalization
to other data.
Note
● If cross-Validation or Bootstrap are used, steps 3 and 4