0% found this document useful (0 votes)
123 views9 pages

Bias-Variance Decomposition

This document discusses bias-variance decomposition, which decomposes the loss of a machine learning model into bias and variance components. Bias measures how far a model's predictions are from the true underlying values on average. Variance measures how much a model's predictions change when trained on different data sets. The document defines bias and variance formally and shows how the squared error loss can be decomposed into the sum of the bias squared plus the variance. This decomposition helps understand how algorithms can underfit or overfit data.

Uploaded by

bp6tjgxs4j
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
123 views9 pages

Bias-Variance Decomposition

This document discusses bias-variance decomposition, which decomposes the loss of a machine learning model into bias and variance components. Bias measures how far a model's predictions are from the true underlying values on average. Variance measures how much a model's predictions change when trained on different data sets. The document defines bias and variance formally and shows how the squared error loss can be decomposed into the sum of the bias squared plus the variance. This decomposition helps understand how algorithms can underfit or overfit data.

Uploaded by

bp6tjgxs4j
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Bias-Variance Decomposition - mlxtend 28/07/2021 06:02

Bias-Variance Decomposition
Bias variance decomposition of machine learning algorithms for various loss functions.

from mlxtend.evaluate import bias_variance_decomp

Overview
Often, researchers use the terms bias and variance or "bias-variance tradeo!" to describe the performance of
a model -- i.e., you may stumble upon talks, books, or articles where people say that a model has a high
variance or high bias. So, what does that mean? In general, we might say that "high variance" is proportional to
over"tting, and "high bias" is proportional to under"tting.

Anyways, why are we attempting to do this bias-variance decomposition in the "rst place? The decomposition
of the loss into bias and variance helps us understand learning algorithms, as these concepts are correlated to
under"tting and over"tting.

̂ some parameter
To use the more formal terms for bias and variance, assume we have a point estimator θ of
or function θ . Then, the bias is commonly de"ned as the di!erence between the expected value of the
estimator and the parameter that we want to estimate:

Bias = E[θ] ̂ − θ.

If the bias is larger than zero, we also say that the estimator is positively biased, if the bias is smaller than zero,
the estimator is negatively biased, and if the bias is exactly zero, the estimator is unbiased. Similarly, we de"ne
the variance as the di!erence between the expected value of the squared estimator minus the squared
expectation of the estimator:

( )
2
Var(θ) ̂ = E[θ ̂] − E[θ]̂ .
2

Note that in the context of this lecture, it will be more convenient to write the variance in its alternative form:

Var(θ) ̂ = E[(E[θ] ̂ − θ)2̂ ].

To illustrate the concept further in context of machine learning ...

Suppose there is an unknown target function or "true function" to which we do want to approximate. Now,
suppose we have di!erent training sets drawn from an unknown distribution de"ned as "true function +
noise." The following plot shows di!erent linear regression models, each "t to a di!erent training set. None of
these hypotheses approximate the true function well, except at two points (around x=-10 and x=6). Here, we
can say that the bias is large because the di!erence between the true value and the predicted value, on
average (here, average means "expectation of the training sets" not "expectation over examples in the training
set"), is large:

http://rasbt.github.io/mlxtend/user_guide/evaluate/bias_variance_decomp/#bias-variance-decomposition Página 1 de 9
Bias-Variance Decomposition - mlxtend 28/07/2021 06:02

The next plot shows di!erent unpruned decision tree models, each "t to a di!erent training set. Note that
these hypotheses "t the training data very closely. However, if we would consider the expectation over training
sets, the average hypothesis would "t the true function perfectly (given that the noise is unbiased and has an
expected value of 0). As we can see, the variance is very large, since on average, a prediction di!ers a lot from
the expectation value of the prediction:

Bias-Variance Decomposition of the Squared Loss


We can decompose a loss function such as the squared loss into three terms, a variance, bias, and a noise
term (and the same is true for the decomposition of the 0-1 loss later). However, for simplicity, we will ignore
the noise term.

http://rasbt.github.io/mlxtend/user_guide/evaluate/bias_variance_decomp/#bias-variance-decomposition Página 2 de 9
Bias-Variance Decomposition - mlxtend 28/07/2021 06:02

Before we introduce the bias-variance decomposition of the 0-1 loss for classi"cation, let us start with the
decomposition of the squared loss as an easy warm-up exercise to get familiar with the overall concept.

The previous section already listed the common formal de"nitions of bias and variance, however, let us de"ne
them again for convenience:

Bias(θ) ̂ = E[θ] ̂ − θ, Var(θ) ̂ = E[(E[θ] ̂ − θ)2̂ ].

Recall that in the context of these machine learning lecture (notes), we de"ned

the true or target function as y


= f (x) ,
the predicted target value as y
= ̂ = h(x) ,
̂ f (x)
and the squared loss as S = (y − y)2̂ . (I use S here because it will be easier to tell it apart from the E ,
which we use for the expectation in this lecture.)

Note that unless noted otherwise, the expectation is over training sets!

To get started with the squared error loss decomposition into bias and variance, let use do some algebraic
̂
manipulation, i.e., adding and subtracting the expected value of y and then expanding the expression using
the quadratic formula (a + b)2 = a2 + b2 + 2ab):

S = (y − y)2̂
(y − y)2̂ = (y − E[y] ̂ + E[y] ̂ − y)2̂
= (y − E[y])̂ 2 + (E[y] ̂ − y)2 + 2(y − E[y])(E[
̂ y] ̂ − y).̂
Next, we just use the expectation on both sides, and we are already done:

E[S] = E[(y − y)2̂ ]


E[(y − y)2̂ ] = (y − E[y])̂ 2 + E[(E[y] ̂ − y)2̂ ]
= [Bias]2 + Variance.
You may wonder what happened to the "2ab " term (2(y ̂
− E[y])(E[y] ̂ − y) ̂) when we used the expectation. It
turns that it evaluates to zero and hence vanishes from the equation, which can be shown as follows:

̂
E[2(y − E[y])(E[ y] ̂ − y)]̂ = 2E[(y − E[y])(E[̂ y] ̂ − y)]̂
= ̂
2(y − E[y])E[(E[ y] ̂ − y)]̂
= 2(y − E[y])(E[E[y]]̂ − E[y])̂
̂
= 2(y − E[y])(E[̂ y] ̂ − E[y])̂
= 0.
So, this is the canonical decomposition of the squared error loss into bias and variance. The next section will
discuss some approaches that have been made to decompose the 0-1 loss that we commonly use for
classi"cation accuracy or error.

The following "gure is a sketch of variance and bias in relation to the training error and generalization error --
how high variance related to over"tting, and how large bias relates to under"tting:

http://rasbt.github.io/mlxtend/user_guide/evaluate/bias_variance_decomp/#bias-variance-decomposition Página 3 de 9
Bias-Variance Decomposition - mlxtend 28/07/2021 06:02

Bias-Variance Decomposition of the 0-1 Loss


Note that decomposing the 0-1 loss into bias and variance components is not as straight-forward as for the
squared error loss. To quote Pedro Domingos, a well-known machine learning researcher and professor at
University of Washington:

"several authors have proposed bias-variance decompositions related to zero-one loss


(Kong & Dietterich, 1995; Breiman, 1996b; Kohavi & Wolpert, 1996; Tibshirani, 1996;
Friedman, 1997). However, each of these decompositions has signi!cant shortcomings.". [1]

In fact, the paper this quote was taken from may o!er the most intuitive and general formulation at this point.
However, we will "rst, for simplicity, go over Kong & Dietterich formulation [2] of the 0-1 loss decomposition,
which is the same as Domingos's but excluding the noise term (for simplicity).

The table below summarizes the relevant terms we used for the squared loss in relation to the 0-1 loss. Recall
that the 0-1 loss, L , is 0 if a class label is predicted correctly, and one otherwise. The main prediction for the
squared error loss is simply the average over the predictions E[y] ̂ (the expectation is over training sets), for
the 0-1 loss Kong & Dietterich and Domingos de"ned it as the mode. I.e., if a model predicts the label one
more than 50% of the time (considering all possible training sets), then the main prediction is 1, and 0
otherwise.

- Squared Loss 0-1 Loss

Single loss (y − y)2̂ L(y, y) ̂

Expected loss E[(y − y)2̂ ] E[L(y, y)]̂

Main prediction E[y ] ̂ mean (average) mode

Bias2 (y − E[y])̂ 2 L(y, E[y])̂

Variance E[(E[y] ̂ − y)2̂ ] E[L(y, ̂ E[y])]


̂

Hence, as result from using the mode to de"ne the main prediction of the 0-1 loss, the bias is 1 if the main
prediction does not agree with the true label y, and 0 otherwise:

Bias = {
1 if y ≠ E[y],̂
0 otherwise.

http://rasbt.github.io/mlxtend/user_guide/evaluate/bias_variance_decomp/#bias-variance-decomposition Página 4 de 9
Bias-Variance Decomposition - mlxtend 28/07/2021 06:02

The variance of the 0-1 loss is de"ned as the probability that the predicted label does not match the main
prediction:

̂ E[y]).
V ariance = P(y ≠ ̂

Next, let us take a look at what happens to the loss if the bias is 0. Given the general de"nition of the loss, loss
= bias + variance, if the bias is 0, then we de"ne the loss as the variance:

̂ y) = V ariance = P(y ≠
Loss = 0 + V ariance = Loss = P(y ≠ ̂ E[y]).
̂

In other words, if a model has zero bias, it's loss is entirely de"ned by the variance, which is intuitive if we think
of variance in the context of being proportional over"tting.

The more surprising scenario is if the bias is equal to 1. If the bias is equal to 1, as explained by Pedro
Domingos, the increasing the variance can decrease the loss, which is an interesting observation. This can be
seen by "rst rewriting the 0-1 loss function as

̂ y) = 1 − P(y =
Loss = P(y ≠ ̂ y).

(Note that we have not done anything new, yet.) Now, if we look at the previous equation of the bias, if the bias
is 1, we have y ≠ E[y] .̂ If y is not equal to the main prediction, but y is also is equal to y , ̂ then y must
̂ be equal
to the main prediction. Using the "inverse" ("1 minus"), we can then write the loss as

̂ y) = 1 − P(y =
Loss = P(y ≠ ̂ y) = 1 − P(y ≠
̂ E[y]).
̂

Since the bias is 1, the loss is hence de"ned as "loss = bias - variance" if the bias is 1 (or "loss = 1 - variance").
This might be quite unintuitive at "rst, but the explanations Kong, Dietterich, and Domingos o!er was that if a
model has a very high bias such that it main prediction is always wrong, increasing the variance can be
bene"cial, since increasing the variance would push the decision boundary, which might lead to some correct
predictions just by chance then. In other words, for scenarios with high bias, increasing the variance can
improve (decrease) the loss!

References
[1] Domingos, Pedro. "A uni"ed bias-variance decomposition." Proceedings of 17th International
Conference on Machine Learning. 2000.
[2] Dietterich, Thomas G., and Eun Bae Kong. Machine learning bias, statistical bias, and statistical
variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State
University, 1995.

Example 1 -- Bias Variance Decomposition of a Decision Tree


Classifier

http://rasbt.github.io/mlxtend/user_guide/evaluate/bias_variance_decomp/#bias-variance-decomposition Página 5 de 9
Bias-Variance Decomposition - mlxtend 28/07/2021 06:02

from mlxtend.evaluate import bias_variance_decomp


from sklearn.tree import DecisionTreeClassifier
from mlxtend.data import iris_data
from sklearn.model_selection import train_test_split

X, y = iris_data()
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3,
random_state=123,
shuffle=True,
stratify=y)

tree = DecisionTreeClassifier(random_state=123)

avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(


tree, X_train, y_train, X_test, y_test,
loss='0-1_loss',
random_seed=123)

print('Average expected loss: %.3f' % avg_expected_loss)


print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)

Average expected loss: 0.062


Average bias: 0.022
Average variance: 0.040

For comparison, the bias-variance decomposition of a bagging classi"er, which should intuitively have a lower
variance compared than a single decision tree:

from sklearn.ensemble import BaggingClassifier

tree = DecisionTreeClassifier(random_state=123)
bag = BaggingClassifier(base_estimator=tree,
n_estimators=100,
random_state=123)

avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(


bag, X_train, y_train, X_test, y_test,
loss='0-1_loss',
random_seed=123)

print('Average expected loss: %.3f' % avg_expected_loss)


print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)

Average expected loss: 0.048


Average bias: 0.022
Average variance: 0.026

Example 2 -- Bias Variance Decomposition of a Decision Tree


Regressor

http://rasbt.github.io/mlxtend/user_guide/evaluate/bias_variance_decomp/#bias-variance-decomposition Página 6 de 9
Bias-Variance Decomposition - mlxtend 28/07/2021 06:02

from mlxtend.evaluate import bias_variance_decomp


from sklearn.tree import DecisionTreeRegressor
from mlxtend.data import boston_housing_data
from sklearn.model_selection import train_test_split

X, y = boston_housing_data()
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3,
random_state=123,
shuffle=True)

tree = DecisionTreeRegressor(random_state=123)

avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(


tree, X_train, y_train, X_test, y_test,
loss='mse',
random_seed=123)

print('Average expected loss: %.3f' % avg_expected_loss)


print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)

Average expected loss: 31.756


Average bias: 13.856
Average variance: 17.900

For comparison, the bias-variance decomposition of a bagging regressor is shown below, which should
intuitively have a lower variance than a single decision tree:

from sklearn.ensemble import BaggingRegressor

tree = DecisionTreeRegressor(random_state=123)
bag = BaggingRegressor(base_estimator=tree,
n_estimators=100,
random_state=123)

avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(


bag, X_train, y_train, X_test, y_test,
loss='mse',
random_seed=123)

print('Average expected loss: %.3f' % avg_expected_loss)


print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)

Average expected loss: 18.620


Average bias: 15.461
Average variance: 3.159

Example 3 -- TensorFlow/Keras Support


Since mlxtend v0.18.0, the bias_variance_decomp now supports Keras models. Note that the original model is
reset in each round (before re"tting it to the bootstrap samples).

http://rasbt.github.io/mlxtend/user_guide/evaluate/bias_variance_decomp/#bias-variance-decomposition Página 7 de 9
Bias-Variance Decomposition - mlxtend 28/07/2021 06:02

from mlxtend.evaluate import bias_variance_decomp


from mlxtend.data import boston_housing_data
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import tensorflow as tf
import numpy as np

np.random.seed(1)
tf.random.set_seed(1)

X, y = boston_housing_data()
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3,
random_state=123,
shuffle=True)

model = tf.keras.Sequential([
tf.keras.layers.Dense(32, activation=tf.nn.relu),
tf.keras.layers.Dense(1)
])

optimizer = tf.keras.optimizers.Adam()
model.compile(loss='mean_squared_error', optimizer=optimizer)

model.fit(X_train, y_train, epochs=100, verbose=0)

mean_squared_error(model.predict(X_test), y_test)

32.69300595184836

Note that it is highly recommended to use the same number of training epochs that you would use on the
original training set to ensure convergence:

np.random.seed(1)
tf.random.set_seed(1)

avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(


model, X_train, y_train, X_test, y_test,
loss='mse',
num_rounds=100,
random_seed=123,
epochs=200, # fit_param
verbose=0) # fit_param

print('Average expected loss: %.3f' % avg_expected_loss)


print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)

Average expected loss: 32.740


Average bias: 27.474
Average variance: 5.265

API
bias_variance_decomp(estimator, X_train, y_train, X_test, y_test, loss='0-1_loss', num_rounds=200,
random_seed=None, !t_params)

estimator : object A classi"er or regressor object or class implementing both a fit and predict method
similar to the scikit-learn API.

http://rasbt.github.io/mlxtend/user_guide/evaluate/bias_variance_decomp/#bias-variance-decomposition Página 8 de 9
Bias-Variance Decomposition - mlxtend 28/07/2021 06:02

X_train : array-like, shape=(num_examples, num_features)

A training dataset for drawing the bootstrap samples to carry out the bias-variance decomposition.

y_train : array-like, shape=(num_examples)

Targets (class labels, continuous values in case of regression) associated with the X_train examples.

X_test : array-like, shape=(num_examples, num_features)

The test dataset for computing the average loss, bias, and variance.

y_test : array-like, shape=(num_examples)

Targets (class labels, continuous values in case of regression) associated with the X_test examples.

loss : str (default='0-1_loss')

Loss function for performing the bias-variance decomposition. Currently allowed values are '0-1_loss' and
'mse'.

num_rounds : int (default=200)

Number of bootstrap rounds for performing the bias-variance decomposition.

random_seed : int (default=None)

Random seed for the bootstrap sampling used for the bias-variance decomposition.

fit_params : additional parameters

Additional parameters to be passed to the ."t() function of the estimator when it is "t to the bootstrap
samples.

Returns

avg_expected_loss, avg_bias, avg_var : returns the average expected

average bias, and average bias (all #oats), where the average is computed over the data points in the test
set.

Examples

For usage examples, please see http://rasbt.github.io/mlxtend/user_guide/evaluate/bias_variance_decomp/

ython

Copyright © 2014-2020 Sebastian Raschka (http://sebastianraschka.com)


Documentation built with MkDocs (http://www.mkdocs.org/).

http://rasbt.github.io/mlxtend/user_guide/evaluate/bias_variance_decomp/#bias-variance-decomposition Página 9 de 9

You might also like