T04 Soln
T04 Soln
Tutorial 4
1. Flexible vs Inflexible Method (Modified from An Introduction to Statistical Learning)
For each of parts (a) through (d), indicate whether we would generally expect the perfor-
mance of a flexible statistical learning method to be better or worse than an inflexible method.
Justify your answer in terms of bias and variance.
(a) The sample size n is extremely large, and the number of features is small.
Solution: If the sample size is large and number of features is small, the variance can
usually be kept small even with flexible methods. Flexible methods can have smaller
bias so will likely be preferable in this case.
(b) The number of features is extremely large, and the number of observations n is small.
Solution: If the number of features is extremely large and number of observations is
small, it may be difficult to keep the variance small with flexible methods. Inflexible
methods may perform better even if it is biased if the variance is much smaller.
(c) The relationship between the features and response is highly non-linear.
Solution: Inflexible methods may not be able to represent highly non-linear functions
and have high bias. In this case, using flexible methods to reduce bias may be helpful,
although still need to consider the variance.
(d) The variance of the error terms, i.e. σ 2 = V ar(ϵ), is extremely high.
Solution: If the data is very noisy, then it is easy for flexible methods to overfit and
have high variance. An inflexible method would be less likely to overfit the noise.
(a) Provide a sketch of typical (squared) bias, variance, training error, test error, and irre-
ducible error curves, on a single plot, as we go from less flexible statistical learning
methods towards more flexible approaches. The x-axis should represent the amount
of flexibility in the method, and the y-axis should represent the values for each curve.
There should be five curves. Make sure to label each one.
Solution:
Tutorial 4 2
Figure 1: Curves showing bias, variance, training error, test error and Bayes error against flexibility.
(b) Explain why each of the five curves has the shape displayed in part (a).
Solution: As flexibility increases, the model is better able to approximate the target
conditional distribution and so bias decreases. For universal approximators, the bias
should decrease to zero. As flexibility increases, variance increase as the number of
ways to fit the same training dataset also increases. Training error will decrease, even-
tually to zero as the function class eventually is able to interpolate the data exactly. Test
error decreases initially as the reduction of bias dominates but eventually will increase
as variance increases faster than any reduction of bias. Irreducible error is external data
noise and is not affected by the approximator so does not change with flexibility of the
method.
(a) Let’s first try to understand the assumption about the training dataset. Which of these
statements are correct about the assumption of the training set?
A) all ϵi have the same mean
B) all yi have the same mean
C) all yi have the same variance
Solution: A and C.
Reason: Considering the data generation process of yi = f (xi ) + ϵi , f (xi ) is a function
of xi , which is not a random variable. However, ϵi is a random variable which follows
the Gaussian distribution ϵ ∼ N (0, σ 2 ), and it is randomly sampled for each i and
Tutorial 4 3
V ar[f (xi )] = 0 because we assume the nearest neighbours xi are fixed, hence
values of f (xi ) are also fixed.
Tutorial 4 4
• Bias squared:
Pk Pk
Assuming fixed neighbours xi , E[ k1 i=1 y(xi )] = 1
k i=1 f (xi ).
(c) How would the equations from Part (b) change when k is varied? How would you
choose an optimal value of k using the above equations?
Solution:
• Choice of k will not affect the irreducible error.
• The variance will decrease as k increases.
• As the value of k increases, bias will likely increase – as the number of data points
increase, we will be considering points further away from x0 and we would move
further away from f (x0 ) (in the extreme case where k is equal to the number of
points in the training set, fˆ(x) would just give the mean of the training set output).