Bias-Variance Decomposition
Bias-Variance Decomposition
Bias-Variance Decomposition
Bias variance decomposition of machine learning algorithms for various loss functions.
Overview
Often, researchers use the terms bias and variance or "bias-variance tradeo!" to describe the performance of
a model -- i.e., you may stumble upon talks, books, or articles where people say that a model has a high
variance or high bias. So, what does that mean? In general, we might say that "high variance" is proportional to
over"tting, and "high bias" is proportional to under"tting.
Anyways, why are we attempting to do this bias-variance decomposition in the "rst place? The decomposition
of the loss into bias and variance helps us understand learning algorithms, as these concepts are correlated to
under"tting and over"tting.
̂ some parameter
To use the more formal terms for bias and variance, assume we have a point estimator θ of
or function θ . Then, the bias is commonly de"ned as the di!erence between the expected value of the
estimator and the parameter that we want to estimate:
Bias = E[θ] ̂ − θ.
If the bias is larger than zero, we also say that the estimator is positively biased, if the bias is smaller than zero,
the estimator is negatively biased, and if the bias is exactly zero, the estimator is unbiased. Similarly, we de"ne
the variance as the di!erence between the expected value of the squared estimator minus the squared
expectation of the estimator:
( )
2
Var(θ) ̂ = E[θ ̂] − E[θ]̂ .
2
Note that in the context of this lecture, it will be more convenient to write the variance in its alternative form:
Suppose there is an unknown target function or "true function" to which we do want to approximate. Now,
suppose we have di!erent training sets drawn from an unknown distribution de"ned as "true function +
noise." The following plot shows di!erent linear regression models, each "t to a di!erent training set. None of
these hypotheses approximate the true function well, except at two points (around x=-10 and x=6). Here, we
can say that the bias is large because the di!erence between the true value and the predicted value, on
average (here, average means "expectation of the training sets" not "expectation over examples in the training
set"), is large:
http://rasbt.github.io/mlxtend/user_guide/evaluate/bias_variance_decomp/#bias-variance-decomposition Página 1 de 9
Bias-Variance Decomposition - mlxtend 28/07/2021 06:02
The next plot shows di!erent unpruned decision tree models, each "t to a di!erent training set. Note that
these hypotheses "t the training data very closely. However, if we would consider the expectation over training
sets, the average hypothesis would "t the true function perfectly (given that the noise is unbiased and has an
expected value of 0). As we can see, the variance is very large, since on average, a prediction di!ers a lot from
the expectation value of the prediction:
http://rasbt.github.io/mlxtend/user_guide/evaluate/bias_variance_decomp/#bias-variance-decomposition Página 2 de 9
Bias-Variance Decomposition - mlxtend 28/07/2021 06:02
Before we introduce the bias-variance decomposition of the 0-1 loss for classi"cation, let us start with the
decomposition of the squared loss as an easy warm-up exercise to get familiar with the overall concept.
The previous section already listed the common formal de"nitions of bias and variance, however, let us de"ne
them again for convenience:
Recall that in the context of these machine learning lecture (notes), we de"ned
Note that unless noted otherwise, the expectation is over training sets!
To get started with the squared error loss decomposition into bias and variance, let use do some algebraic
̂
manipulation, i.e., adding and subtracting the expected value of y and then expanding the expression using
the quadratic formula (a + b)2 = a2 + b2 + 2ab):
S = (y − y)2̂
(y − y)2̂ = (y − E[y] ̂ + E[y] ̂ − y)2̂
= (y − E[y])̂ 2 + (E[y] ̂ − y)2 + 2(y − E[y])(E[
̂ y] ̂ − y).̂
Next, we just use the expectation on both sides, and we are already done:
̂
E[2(y − E[y])(E[ y] ̂ − y)]̂ = 2E[(y − E[y])(E[̂ y] ̂ − y)]̂
= ̂
2(y − E[y])E[(E[ y] ̂ − y)]̂
= 2(y − E[y])(E[E[y]]̂ − E[y])̂
̂
= 2(y − E[y])(E[̂ y] ̂ − E[y])̂
= 0.
So, this is the canonical decomposition of the squared error loss into bias and variance. The next section will
discuss some approaches that have been made to decompose the 0-1 loss that we commonly use for
classi"cation accuracy or error.
The following "gure is a sketch of variance and bias in relation to the training error and generalization error --
how high variance related to over"tting, and how large bias relates to under"tting:
http://rasbt.github.io/mlxtend/user_guide/evaluate/bias_variance_decomp/#bias-variance-decomposition Página 3 de 9
Bias-Variance Decomposition - mlxtend 28/07/2021 06:02
In fact, the paper this quote was taken from may o!er the most intuitive and general formulation at this point.
However, we will "rst, for simplicity, go over Kong & Dietterich formulation [2] of the 0-1 loss decomposition,
which is the same as Domingos's but excluding the noise term (for simplicity).
The table below summarizes the relevant terms we used for the squared loss in relation to the 0-1 loss. Recall
that the 0-1 loss, L , is 0 if a class label is predicted correctly, and one otherwise. The main prediction for the
squared error loss is simply the average over the predictions E[y] ̂ (the expectation is over training sets), for
the 0-1 loss Kong & Dietterich and Domingos de"ned it as the mode. I.e., if a model predicts the label one
more than 50% of the time (considering all possible training sets), then the main prediction is 1, and 0
otherwise.
Hence, as result from using the mode to de"ne the main prediction of the 0-1 loss, the bias is 1 if the main
prediction does not agree with the true label y, and 0 otherwise:
Bias = {
1 if y ≠ E[y],̂
0 otherwise.
http://rasbt.github.io/mlxtend/user_guide/evaluate/bias_variance_decomp/#bias-variance-decomposition Página 4 de 9
Bias-Variance Decomposition - mlxtend 28/07/2021 06:02
The variance of the 0-1 loss is de"ned as the probability that the predicted label does not match the main
prediction:
̂ E[y]).
V ariance = P(y ≠ ̂
Next, let us take a look at what happens to the loss if the bias is 0. Given the general de"nition of the loss, loss
= bias + variance, if the bias is 0, then we de"ne the loss as the variance:
̂ y) = V ariance = P(y ≠
Loss = 0 + V ariance = Loss = P(y ≠ ̂ E[y]).
̂
In other words, if a model has zero bias, it's loss is entirely de"ned by the variance, which is intuitive if we think
of variance in the context of being proportional over"tting.
The more surprising scenario is if the bias is equal to 1. If the bias is equal to 1, as explained by Pedro
Domingos, the increasing the variance can decrease the loss, which is an interesting observation. This can be
seen by "rst rewriting the 0-1 loss function as
̂ y) = 1 − P(y =
Loss = P(y ≠ ̂ y).
(Note that we have not done anything new, yet.) Now, if we look at the previous equation of the bias, if the bias
is 1, we have y ≠ E[y] .̂ If y is not equal to the main prediction, but y is also is equal to y , ̂ then y must
̂ be equal
to the main prediction. Using the "inverse" ("1 minus"), we can then write the loss as
̂ y) = 1 − P(y =
Loss = P(y ≠ ̂ y) = 1 − P(y ≠
̂ E[y]).
̂
Since the bias is 1, the loss is hence de"ned as "loss = bias - variance" if the bias is 1 (or "loss = 1 - variance").
This might be quite unintuitive at "rst, but the explanations Kong, Dietterich, and Domingos o!er was that if a
model has a very high bias such that it main prediction is always wrong, increasing the variance can be
bene"cial, since increasing the variance would push the decision boundary, which might lead to some correct
predictions just by chance then. In other words, for scenarios with high bias, increasing the variance can
improve (decrease) the loss!
References
[1] Domingos, Pedro. "A uni"ed bias-variance decomposition." Proceedings of 17th International
Conference on Machine Learning. 2000.
[2] Dietterich, Thomas G., and Eun Bae Kong. Machine learning bias, statistical bias, and statistical
variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State
University, 1995.
http://rasbt.github.io/mlxtend/user_guide/evaluate/bias_variance_decomp/#bias-variance-decomposition Página 5 de 9
Bias-Variance Decomposition - mlxtend 28/07/2021 06:02
X, y = iris_data()
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3,
random_state=123,
shuffle=True,
stratify=y)
tree = DecisionTreeClassifier(random_state=123)
For comparison, the bias-variance decomposition of a bagging classi"er, which should intuitively have a lower
variance compared than a single decision tree:
tree = DecisionTreeClassifier(random_state=123)
bag = BaggingClassifier(base_estimator=tree,
n_estimators=100,
random_state=123)
http://rasbt.github.io/mlxtend/user_guide/evaluate/bias_variance_decomp/#bias-variance-decomposition Página 6 de 9
Bias-Variance Decomposition - mlxtend 28/07/2021 06:02
X, y = boston_housing_data()
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3,
random_state=123,
shuffle=True)
tree = DecisionTreeRegressor(random_state=123)
For comparison, the bias-variance decomposition of a bagging regressor is shown below, which should
intuitively have a lower variance than a single decision tree:
tree = DecisionTreeRegressor(random_state=123)
bag = BaggingRegressor(base_estimator=tree,
n_estimators=100,
random_state=123)
http://rasbt.github.io/mlxtend/user_guide/evaluate/bias_variance_decomp/#bias-variance-decomposition Página 7 de 9
Bias-Variance Decomposition - mlxtend 28/07/2021 06:02
np.random.seed(1)
tf.random.set_seed(1)
X, y = boston_housing_data()
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3,
random_state=123,
shuffle=True)
model = tf.keras.Sequential([
tf.keras.layers.Dense(32, activation=tf.nn.relu),
tf.keras.layers.Dense(1)
])
optimizer = tf.keras.optimizers.Adam()
model.compile(loss='mean_squared_error', optimizer=optimizer)
mean_squared_error(model.predict(X_test), y_test)
32.69300595184836
Note that it is highly recommended to use the same number of training epochs that you would use on the
original training set to ensure convergence:
np.random.seed(1)
tf.random.set_seed(1)
API
bias_variance_decomp(estimator, X_train, y_train, X_test, y_test, loss='0-1_loss', num_rounds=200,
random_seed=None, !t_params)
estimator : object A classi"er or regressor object or class implementing both a fit and predict method
similar to the scikit-learn API.
http://rasbt.github.io/mlxtend/user_guide/evaluate/bias_variance_decomp/#bias-variance-decomposition Página 8 de 9
Bias-Variance Decomposition - mlxtend 28/07/2021 06:02
A training dataset for drawing the bootstrap samples to carry out the bias-variance decomposition.
Targets (class labels, continuous values in case of regression) associated with the X_train examples.
The test dataset for computing the average loss, bias, and variance.
Targets (class labels, continuous values in case of regression) associated with the X_test examples.
Loss function for performing the bias-variance decomposition. Currently allowed values are '0-1_loss' and
'mse'.
Random seed for the bootstrap sampling used for the bias-variance decomposition.
Additional parameters to be passed to the ."t() function of the estimator when it is "t to the bootstrap
samples.
Returns
average bias, and average bias (all #oats), where the average is computed over the data points in the test
set.
Examples
ython
http://rasbt.github.io/mlxtend/user_guide/evaluate/bias_variance_decomp/#bias-variance-decomposition Página 9 de 9