EDA Module 2
EDA Module 2
•Low-Bias, High-Variance:
•High-Bias, Low-Variance:
•High-Bias, High-Variance:
Bias-Variance trade-off.
• While building the machine learning model, it is really
important to take care of bias and variance in order to
avoid over fitting and under fitting in the model.
• If the model is very simple with fewer parameters, it
may have low variance and high bias. Whereas, if the
model has a large number of parameters, it will have
high variance and low bias.
• So, it is required to make a balance between bias and
variance errors, and this balance between the bias
error and variance error is known as the Bias-Variance
trade-off.
Introduction to cross - validation
• An ideal predictor is that, which will learn all the
structure in the data but none of the noise. While with
increasing model complexity in the training data, PE
reduces monotonically, the same will not be true for
test data.
• Bias and variance move in opposing directions and at a
suitable bias-variance combination the PE is the
minimum in the test data. The model that achieves this
lowest possible PE is the best prediction model.
• Cross-validation is a comprehensive set of data splitting
techniques which helps to estimate the point of
inflection of PE.
Cross-validation: Data
splitting method?
• Why data splitting?
• What data splitting methods we have
discussed so far?
– Holdout sample: Training & Test data
Demerits
• In a sparse data set, one may not have the luxury to set
aside a reasonable portion of the data for testing.
• If we happen to have a 'bad' split, the estimate is not
reliable.
Three-way Split: Training, Validation
and Test Data
• The available data is partitioned into three sets:
training, validation and test set.
• The prediction model is trained on the training
set and is evaluated on the validation set.
• A typical split is 50% for the training data and
25% each for validation set and test set.