Bias Variance Trade Off
Bias Variance Trade Off
As usual, we are given a dataset D = {(x1 , y1 ), … , (xn , yn )} , drawn i.i.d. from some distribution
P (X, Y ) . Throughout this lecture we assume a regression setting, i.e. y ∈ R . In this lecture we will
decompose the generalization error of a classifier into three rather interpretable terms. Before we do that,
let us consider that for any given input x there might not exist a unique label y. For example, if your
vector x describes features of house (e.g. #bedrooms, square footage, ...) and the label y its price, you
could imagine two houses with identical description selling for different prices. So for any given feature
vector x , there is a distribution over possible labels. We therefore define the following, which will come in
useful later on:
The expected label denotes the label you would expect to obtain, given a feature vector x .
Alright, so we draw our training set D , consisting of n inputs, i.i.d. from the distribution P . As a second
step we typically call some machine learning algorithm A on this data set to learn a hypothesis (aka
classifier). Formally, we denote this process as hD = A(D).
For a given hD , learned on data set D with algorithm A , we can compute the generalization error (as
measured in squared loss) as follows:
2 2
E(x,y)∼P [(hD (x) − y) ] = ∫ ∫ (hD (x) − y) Pr(x, y)∂ y∂ x.
x y
Note that one can use other loss functions. We use squared loss because it has nice mathematical
properties, and it is also the most common loss function.
The previous statement is true for a given training set D . However, remember that D itself is drawn from
P , and is therefore a random variable. Further, hD is a function of D , and is therefore also a random
n
¯
h = ED∼P n [hD ] = ∫ hD Pr(D)∂ D
2 2
E (x,y)∼P
[(hD (x) − y) ] = ∫ ∫ ∫ (hD (x) − y) P(x, y)P(D)∂ x∂ y∂ D
n
D∼P D x y
To be clear, D is our training points and the (x, y) pairs are the test points.
We are interested in exactly this expression, because it evaluates the quality of a machine learning
algorithm A with respect to a data distribution P (X, Y ) . In the following we will show that this
expression decomposes into three meaningful terms.
file:///Users/kilianweinberger/Documents/teaching/CS4780/CS4780MasterCornell/2023Spring/WebSite/lectures/temp.html 1/4
3/23/23, 11:55 AM Lecture 11: Bias Variance Tradeoff
2 2
¯ ¯
Ex,y,D [[hD (x) − y] ] = Ex,y,D [[(hD (x) − h(x)) + (h(x) − y)] ]
2
¯
¯ (x) − h 2 ¯ ¯ ¯
= Ex,D [(hD
(x)) ] + 2 Ex,y,D [(hD (x) − h(x)) (h(x) − y)] + Ex,y [(h (x) − y) ]
¯ ¯ ¯ ¯
Ex,y,D [(hD (x) − h(x)) (h(x) − y)] = Ex,y [ED [hD (x) − h(x)] (h(x) − y)]
¯ ¯
= Ex,y [(ED [hD (x)] − h(x)) (h(x) − y)]
¯ ¯ ¯
= Ex,y [(h(x) − h(x)) (h(x) − y)]
= Ex,y [0]
= 0
Returning to the earlier expression, we're left with the variance and another term
2 2 2
¯ ¯
Ex,y,D [(hD (x) − y) ] = Ex,D [(hD (x) − h (x)) ] + Ex,y [(h(x) − y) ]
Variance
We can break down the second term in the above equation as follows:
2 2
¯ ¯
Ex,y [(h(x) − y) ] = Ex,y [(h(x) − ȳ (x)) + (ȳ (x) − y) ]
2 2
¯ ¯
= Ex,y [(ȳ (x) − y) ] + Ex [(h(x) − ȳ (x)) ] + 2 Ex,y [(h(x) − ȳ (x)) (ȳ (x) − y)]
Noise 2
Bias
¯ ¯
Ex,y [(h(x) − ȳ (x)) (ȳ (x) − y)] = Ex [Ey∣x [ȳ (x) − y] (h(x) − ȳ (x))]
¯
= Ex [Ey∣x [ȳ (x) − y] (h(x) − ȳ (x))]
¯
= Ex [(ȳ (x) − Ey∣x [y]) (h(x) − ȳ (x))]
¯
= Ex [(ȳ (x) − ȳ (x)) (h(x) − ȳ (x))]
= Ex [0]
= 0
2 2 2 2
¯ ¯
Ex,y,D [(hD (x) − y) ] = Ex,D [(hD (x) − h(x)) ] + Ex,y [(ȳ (x) − y) ] + Ex [(h(x) − ȳ (x)) ]
Variance: Captures how much your classifier changes if you train on a different training set. How "over-
specialized" is your classifier to a particular training set (overfitting)? If we have the best possible model
for our training data, how far off are we from the average classifier?
Bias: What is the inherent error that you obtain from your classifier even with infinite training data? This
is due to your classifier being "biased" to a particular kind of solution (e.g. linear classifier). In other
words, bias is inherent to your model.
Noise: How big is the data-intrinsic noise? This error measures ambiguity due to your data distribution
and feature representation. You can never beat this, it is an aspect of the data.
file:///Users/kilianweinberger/Documents/teaching/CS4780/CS4780MasterCornell/2023Spring/WebSite/lectures/temp.html 2/4
3/23/23, 11:55 AM Lecture 11: Bias Variance Tradeoff
Fig 2: The variation of Bias and Variance with the model complexity. This is similar to the
concept of overfitting and underfitting. More complex models overfit while the simplest
models underfit.
Source: http://scott.fortmann-roe.com/docs/BiasVariance.html
file:///Users/kilianweinberger/Documents/teaching/CS4780/CS4780MasterCornell/2023Spring/WebSite/lectures/temp.html 3/4
3/23/23, 11:55 AM Lecture 11: Bias Variance Tradeoff
Figure 3: Test and training error as the number of training instances increases.
The graph above plots the training error and the test error and can be divided into two overarching
regimes. In the first regime (on the left side of the graph), training error is below the desired error
threshold (denoted by ϵ ), but test error is significantly higher. In the second regime (on the right side of
the graph), test error is remarkably close to training error, but both are above the desired tolerance of ϵ .
Symptoms:
Remedies:
Symptoms:
Remedies:
file:///Users/kilianweinberger/Documents/teaching/CS4780/CS4780MasterCornell/2023Spring/WebSite/lectures/temp.html 4/4