UNIT03
UNIT03
• Underfitting
• Overfitting
• Bias – variance trade-off
• Underfitting
– When target function is kept to simple
– Unavailability of sufficient training data set
– Underfitting results in both poor performance
with training data as well as poor generalization to
test data.
• Bias
– Gap between predicted values and actual value
– Parametric models generally have high bias
making them easier to understand/interpret and
faster to learn. These algorithms have a poor
performance on data sets, which are complex in
nature and do not align with the simplifying
assumptions made by the algorithm.
– Underfitting results in high bias.
• Variance
– Errors due to variance occur from difference in
training data sets used to train the model.
– Distance of all predicted values with respect to
each other
EVALUATING PERFORMANCE OF
A MODEL
• Supervised learning - classification
– Accuracy
– Sensitivity
Understanding with cricket match win
example
• There are four possibilities with regards to the
cricket match win/loss prediction:
– The model predicted win and the team won
– The model predicted win and the team lost
– The model predicted loss and the team won
– The model predicted loss and the team lost In
this problem, the obvious class of interest is ‘win’.
The first case, i.e. the model predicted win and the
• model accuracy is given by total number of
correct classifications (either as the class of
interest, i.e. True Positive or as not the class of
interest, i.e. True Negative) divided by total
number of classifications done.
Model accuracy =
• Error rate : The percentage of misclassifications
• Specificity
– Specificity of a model measures the proportion of
negative examples which have been correctly
classified.
• There are two other performance measures of
a supervised learning model which are similar
to sensitivity and specificity.
– Precision : precision gives the proportion of
positive predictions which are truly positive,
• What is clustering?
• challenges which lie in the process of clustering:
– It is generally not known how many clusters can be formulated
from a particular data set. It is completely open-ended in most
cases and provided as a user input to a clustering algorithm.
– Even if the number of clusters is given, the same number of
clusters can be formed with different groups of data instances.
In a more objective way, it can be said that a clustering