Calculating A - Linear Regression
Calculating A - Linear Regression
Calculating A - Linear Regression
Linear Regression
5021 In the following, we will apply the mathematical concepts from Chap-
5022 ters 2, 5, 6 and 7 to solving linear regression (curve fitting) problems.
5023 In regression, we want to find a function f that maps inputs x ∈ RD to regression
5024 corresponding function values f (x) ∈ R given a set of training inputs
5025 xn and corresponding observations yn = f (xn ) + , where is a random
5026 variable that comprises measurement noise and unmodeled processes. An
5027 illustration of such a regression problem is given in Figure 9.1. A typi-
5028 cal regression problem is given in Figure 9.1(a): For some input values
5029 x we observe (noisy) function values y = f (x) + . The task is to in-
5030 fer the function f that generated the data. A possible solution is given
5031 in Figure 9.1(b), where we also show three distributions centered at the
5032 function values f (x) that represent the noise in the data.
5033 Regression is a fundamental problem in machine learning, and regres-
5034 sion problems appear in a diverse range of research areas and applica-
5035 tions, including time-series analysis (e.g., system identification), control
5036 and robotics (e.g., reinforcement learning, forward/inverse model learn-
5037 ing), optimization (e.g., line searches, global optimization), and deep-
5038 learning applications (e.g., computer games, speech-to-text translation,
5039 image recognition, automatic video annotation). Regression is also a key
5040 ingredient of classification algorithms.
Figure 9.1
0.4 0.4 (a) Dataset;
(b) Possible solution
0.2 0.2
to the regression
0.0 0.0 problem.
y
−0.2 −0.2
−0.4 −0.4
−4 −2 0 2 4 −4 −2 0 2 4
x x
(a) Regression problem: Observed noisy (b) Regression solution: Possible function
function values from which we wish to infer that could have generated the data (blue)
the underlying function that generated the with indication of the measurement noise of
data. the function value at the corresponding in-
puts (orange distributions).
267
c
Draft chapter (August 30, 2018) from “Mathematics for Machine Learning”
2018 by Marc Peter
Deisenroth, A Aldo Faisal, and Cheng Soon Ong. To be published by Cambridge University Press.
Report errata and feedback to http://mml-book.com. Please do not post or distribute this file,
please link to https://mml-book.com.
268 Linear Regression
5043 • Choice of the model (type) and the parametrization of the regres-
5044 sion function. Given a data set, what function classes (e.g., polynomi-
5045 als) are good candidates for modeling the data, and what particular
5046 parametrization (e.g., degree of the polynomial) should we choose?
5047 Model selection, as discussed in Section 8.5, allows us to compare var-
5048 ious models to find the simplest model that explains the training data
5049 reasonably well.
5050 • Finding good parameters. Having chosen a model of the regression
5051 function, how do we find good model parameters? Here, we will need to
5052 look at different loss/objective functions (they determine what a “good”
5053 fit is) and optimization algorithms that allow us to minimize this loss.
5054 • Overfitting and model selection. Overfitting is a problem when the
5055 regression function fits the training data “too well” but does not gen-
5056 eralize to unseen test data. Overfitting typically occurs if the underly-
5057 ing model (or its parametrization) is overly flexible and expressive, see
5058 Section 8.5. We will look at the underlying reasons and discuss ways to
5059 mitigate the effect of overfitting in the context of linear regression.
5060 • Relationship between loss functions and parameter priors. Loss func-
5061 tions (optimization objectives) are often motivated and induced by prob-
5062 abilistic models. We will look at the connection between loss functions
5063 and the underlying prior assumptions that induce these losses.
5064 • Uncertainty modeling. In any practical setting, we have access to only
5065 a finite, potentially large, amount of (training) data for selecting the
5066 model class and the corresponding parameters. Given that this finite
5067 amount of training data does not cover all possible scenarios, we way
5068 want to describe the remaining parameter uncertainty to obtain a mea-
5069 sure of confidence of the model’s prediction at test time; the smaller the
5070 training set the more important uncertainty modeling. Consistent mod-
5071 eling of uncertainty equips model predictions with confidence bounds.
5072 In the following, we will be using the mathematical tools from Chap-
5073 ters 3, 5, 6 and 7 to solve linear regression problems. We will discuss
5074 maximum likelihood and maximum a posteriori (MAP) estimation to find
5075 optimal model parameters. Using these parameter estimates, we will have
5076 a brief look at generalization errors and overfitting. Toward the end of
5077 this chapter, we will discuss Bayesian linear regression, which allows us to
5078 reason about model parameters at a higher level, thereby removing some
5079 of the problems encountered in maximum likelihood and MAP estimation.
Draft (2018-08-30) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
9.1 Problem Formulation 269
20 Figure 9.2 Linear
10 10 regression without
0 0 0 features.
y
y
(a) Example
−10 −10
−20
functions that fall
−10 0 10 −10 −5 0 5 10 −10 −5 0 5 10 into this category.
x x x
(b) Training set.
(a) Example functions (straight (b) Training set. (c) Maximum likelihood esti- (c) Maximum
lines) that can be described us- mate. likelihood estimate.
ing the linear model in (9.2).
y = f (x) + , (9.1)
5081 where x ∈ RD are inputs and y ∈ R are noisy function values (targets).
5082 Furthermore, ∼ N 0, σ 2 is independent, identically distributed (i.i.d.)
5083 measurement noise. In this particular case, is Gaussian distributed with
5084 mean 0 and variance σ 2 . Our objective is to find a function that is close
5085 (similar) to the unknown function that generated the data.
In this chapter, we focus on parametric models, i.e., we choose a para-
metrized function f and find parameters that “work well” for modeling the
data. In linear regression, we consider the special case that the parameters
appear linearly in our model. An example of linear regression is
p(y | x, θ) = N y | x> θ, σ 2 ,
(9.3)
5086 which is the probability of observing a target value y given that we know
5087 the input location x and the parameters θ . Note that the only source of
5088 uncertainty originates from the observation noise (as x and θ are assumed
5089 known in (9.3))—without any observation noise, the relationship between
5090 x and y would be deterministic and (9.3) would be a delta distribution.
5091 For x, θ ∈ R the linear regression model in (9.2) describes straight lines
5092 (linear functions), and the parameter θ would be the slope of the line.
5093 Figure 9.2(a) shows some examples. This model is not only linear in the Linear regression
5094 parameters, but also linear in the inputs x. We will see later that y = φ(x)θ refers to models that
are linear in the
5095 for nonlinear transformations φ is also a linear regression model because
parameters.
5096 “linear regression” refers to models that are “linear in the parameters”, i.e.,
5097 models that describe a function by a linear combination of input features.
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
270 Linear Regression
5098 In the following, we will discuss in more detail how to find good pa-
5099 rameters θ and how to evaluate whether a parameter set “works well”.
Draft (2018-08-30) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
9.2 Parameter Estimation 271
ce the logarithm
(strictly) 5119 To find the desired parameters θ ML that maximize the likelihood, we
notonically 5120 typically perform gradient ascent (or gradient descent on the negative
reasing function,
optimum of a
5121 likelihood). In the case of linear regression we consider here, however,
ction f is 5122 a closed-form solution exists, which makes iterative gradient descent un-
ntical to the 5123 necessary. In practice, instead of maximizing the likelihood directly, we
imum of log f . 5124 apply the log-transformation to the likelihood function and minimize the
5125 negative log-likelihood.
5126 Remark (Log Transformation). Since the likelihood function is a product
5127 of N Gaussian distributions, the log-transformation is useful since a) it
5128 does not suffer from numerical underflow, b) the differentiation rules will
5129 turn out simpler. Numerical underflow will be a problem when we mul-
5130 tiply N probabilities, where N is the number of data points, since we
5131 cannot represent very small numbers, such as 10−256 . Furthermore, the
5132 log-transform will turn the product into a sum of log-probabilities such
5133 that the corresponding gradient is a sum of individual gradients, instead
5134 of a repeated application of the product rule (5.54) to compute the gradi-
5135 ent of a product of N terms. ♦
To find the optimal parameters θ ML of our linear regression problem,
we minimize the negative log-likelihood
N
Y N
X
− log p(y | X, θ) = − log p(yn | xn , θ) = − log p(yn | xn , θ) , (9.7)
n=1 n=1
5136 where we exploited that the likelihood (9.4) factorizes over the number
5137 of data points due to our independence assumption on the training set.
In the linear regression model (9.2) the likelihood is Gaussian (due to
the Gaussian additive noise term), such that we arrive at
1
log p(yn | xn , θ) = − (yn − x> 2
n θ) + const (9.8)
2σ 2
where the constant includes all terms independent of θ . Using (9.8) in The negative
the negative log-likelihood (9.7) we obtain (ignoring the constant terms) log-likelihood
function is also
N called error function.
1 X
L(θ) := − log p(y | X, θ) = 2 (yn − x>
n θ)
2
(9.9a)
2σ n=1
1 1 2
= (y − Xθ)> (y − Xθ) = 2 ky − Xθk , (9.9b)
2σ 2 2σ
5138 where X = [x1 , · · · , xN ]> ∈ RN ×D .
5139 Remark. There is some notation overloading: We often summarize the
5140 set of training inputs in X , whereas in the design matrix we additionally
5141 assume a specific “shape”. ♦
5142 In (9.9b) we used the fact that the sum of squared errors between the
5143 observations yn and the corresponding model prediction x> n θ equals the
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
272 Linear Regression
5144 squared distance between y and Xθ . Remember from Section 3.1 that
2
5145 kxk = x> x if we choose the dot product as the inner product.
5146 With (9.9b) we have now a concrete form of the negative log-likelihood
5147 function we need to optimize. We immediately see that (9.9b) is quadratic
5148 in θ . This means that we can find a unique global solution θ ML for mini-
5149 mizing the negative log-likelihood L. We can find the global optimum by
5150 computing the gradient of L, setting it to 0 and solving for θ .
Using the results from Chapter 5, we compute the gradient of L with
respect to the parameters as
dL d 1 >
= (y − Xθ) (y − Xθ) (9.10a)
dθ dθ 2σ 2
1 d >
= 2 y y − 2y > Xθ + θ > X > Xθ (9.10b)
2σ dθ
1
= 2 (−y > X + θ > X > X) ∈ R1×D . (9.10c)
σ
As a necessary optimality condition we set this gradient to 0 and obtain
dL (9.10c)
= 0 ⇐⇒ θ > X > X = y > X (9.11a)
dθ
⇐⇒ θ > = y > X(X > X)−1 (9.11b)
⇐⇒ θ ML = (X > X)−1 X > y . (9.11c)
5151 We could right-multiply the first equation by (X > X)−1 because X > X is
5152 positive definite (if we do not have two identical inputs xi , xj for i 6= j ).
5153 Remark. In this case, setting the gradient to 0 is a necessary and sufficient
5154 condition and we obtain a global minimum since the Hessian ∇2θ L(θ) =
5155 X > X ∈ RD×D is positive definite. ♦
Draft (2018-08-30) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
9.2 Parameter Estimation 273
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
274 Linear Regression
Figure 9.4
Polynomial 4 4 Training data
regression. (a) Data MLE
set consisting of 2 2
(xn , yn ) pairs,
0 0
y
n = 1, . . . , 10; (b)
Maximum −2 −2
likelihood
polynomial of −4 −4
degree 4. −4 −2 0 2 4 −4 −2 0 2 4
x x
Consider the data set in Figure 9.5(a). The data set consists of N = 20
Draft (2018-08-30) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
9.2 Parameter Estimation 275
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
276 Linear Regression
Figure 9.5 4 Training data 4 Training data 4 Training data
Maximum MLE MLE MLE
2 2 2
likelihood fits for
0 0 0
y
y
different polynomial
−2 −2 −2
degrees M .
−4 −4 −4
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
x x x
0 0 0
y
y
−2 −2 −2
−4 −4 −4
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
x x x
a free model parameter, we can ignore the scaling by 1/σ 2 , so that we end
2
up with a squared-error-loss function ky − Φθk . Instead of using this
root mean squared squared loss, we often use the root mean squared error (RMSE)
error (RMSE) v
q u
u1 X N
2
ky − Φθk /N = t (yn − φ> (xn )θ)2 , (9.21)
N n=1
The RMSE is 5172 which (a) allows us to compare errors of data sets with different sizes
normalized. 5173 and (b) has the same scale and the same units as the observed function
5174 values yn . For example, assume we fit a model that maps post-codes (x
5175 is given in latitude,longitude) to house prices (y -values are EUR). Then,
5176 the RMSE is also measured in EUR, whereas the squared error is given
5177 in EUR2 . If we choose to include the factor σ 2 from the original negative
5178 log-likelihood (9.9b) then we end up with a “unit-free” objective.
5179 For model selection (see Section 8.5) we can use the RMSE (or the
5180 negative log-likelihood) to determine the best degree of the polynomial
5181 by finding the polynomial degree M that minimizes the objective. Given
5182 that the polynomial degree is a natural number, we can perform a brute-
5183 force search and enumerate all (reasonable) values of M . For a training
5184 set of size N it is sufficient to test 0 6 M 6 N − 1. For M > N we would
5185 need to solve an underdetermined system of linear equations so that we
5186 would end up with infinitely many solutions.
5187 Figure 9.5 shows a number of polynomial fits determined by maximum
5188 likelihood for the dataset from Figure 9.5(a) with N = 10 observations.
5189 We notice that polynomials of low degree (e.g., constants (M = 0) or lin-
5190 ear (M = 1) fit the data poorly and, hence, are poor representations of the
5191 true underlying function. For degrees M = 3, . . . , 5 the fits look plausible
5192 and smoothly interpolate the data. When we go to higher-degree polyno-
5193 mials, we notice that they fit the data better and better. In the extreme
Draft (2018-08-30) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
9.2 Parameter Estimation 277
6
RMSE
0
0 2 4 6 8 10
Degree of polynomial
5194 case of M = N − 1 = 9, the function will pass through every single data
5195 point. However, these high-degree polynomials oscillate wildly and are a
5196 poor representation of the underlying function that generated the data,
5197 such that we suffer from overfitting. overfitting
5198 Remember that the goal is to achieve good generalization by making Note that the noise
5199 accurate predictions for new (unseen) data. We obtain some quantita- variance σ 2 > 0.
5200 tive insight into the dependence of the generalization performance on the
5201 polynomial of degree M by considering a separate test set comprising 200
5202 data points generated using exactly the same procedure used to generate
5203 the training set. As test inputs, we chose a linear grid of 200 points in the
5204 interval of [−5, 5]. For each choice of M , we evaluate the RMSE (9.21) for
5205 both the training data and the test data.
5206 Looking now at the test error, which is a qualitive measure of the gen-
5207 eralization properties of the corresponding polynomial, we notice that ini-
5208 tially the test error decreases, see Figure 9.6 (orange). For fourth-order
5209 polynomials the test error is relatively low and stays relatively constant up
5210 to degree 5. However, from degree 6 onward the test error increases signif-
5211 icantly, and high-order polynomials have very bad generalization proper-
5212 ties. In this particular example, this also is evident from the corresponding
5213 maximum likelihood fits in Figure 9.5. Note that the training error (blue training error
5214 curve in Figure 9.6) never increases when the degree of the polynomial in-
5215 creases. In our example, the best generalization (the point of the smallest
5216 test error) is obtained for a polynomial of degree M = 4. test error
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
278 Linear Regression
regularizer 5218 where the second term is the regularizer, and λ > 0 controls the “strict-
5219 ness” of the regularization.
5220 Remark. Instead of the Euclidean norm k·k2 , we can choose any p-norm
5221 k·kp . In practice, smaller values for p lead to sparser solutions. Here,
5222 “sparse” means that many parameter values θn = 0, which is also use-
LASSO 5223 ful for variable selection. For p = 1, the regularizer is called LASSO (least
5224 absolute shrinkage and selection operator) and was proposed by Tibshi-
5225 rani (1996). ♦
From a probabilistic perspective, adding a regularizer is identical to
placing a prior distribution p(θ) on the parameters and then selecting
the parameters that maximize the posterior distribution p(θ | X, y), i.e.,
we choose the parameters θ that are “most probable” given the training
data. The posterior over the parameters θ , given the training data X, y ,
is obtained by applying Bayes’ theorem as
p(y | X, θ)p(θ)
p(θ | X, y) = . (9.23)
p(y | X)
5226 The parameter vector θ MAP that maximizes the posterior (9.23) is called
maximum 5227 the maximum a-posteriori (MAP) estimate.
a-posteriori To find the MAP estimate, we follow steps that are similar in flavor
MAP to maximum likelihood estimation. We start with the log-transform and
compute the log-posterior as
5228 where the constant comprises the terms that are independent of θ . We see
5229 that the log-posterior in (9.24) is the sum of the log-likelihood p(y | X, θ)
5230 and the log-prior log p(θ).
Remark (Relation to Regularization). Choosing a Gaussian parameter prior
1
p(θ) = N 0, b2 I , b2 = 2λ , the (negative) log-prior term will be
>
− log p(θ) = λθ
| {z θ} + const , (9.25)
=λkθk22
5231 and we recover exactly the regularization term in (9.22). This means that
5232 for a quadratic regularization, the regularization parameter λ in (9.22)
5233 corresponds to twice the precision (inverse variance) of the Gaussian (iso-
5234 tropic) prior p(θ). Therefore, the log-prior in (9.24) reflects the impact
5235 of the regularizer that penalizes implausible values, i.e., values that are
5236 unlikely under the prior. ♦
To find the MAP estimate θ MAP , we minimize the negative log-posterior
Draft (2018-08-30) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
9.2 Parameter Estimation 279
d log p(θ | X, y) 1 1
− = 2 (θ > Φ> Φ − y > Φ) + 2 θ > . (9.29)
dθ σ b
We will find the MAP estimate θ MAP by setting this gradient to 0:
1 > > 1
2
(θ Φ Φ − y > Φ) + 2 θ > = 0 (9.30a)
σ b
1 1 1
⇐⇒ θ > 2
Φ> Φ + 2 I − 2 y > Φ = 0 (9.30b)
σ b σ
2
σ
⇐⇒ θ > Φ> Φ + 2 I = y > Φ (9.30c)
b
−1
σ2
⇐⇒ θ > = y > Φ Φ> Φ + 2 I (9.30d)
b
so that we obtain the MAP estimate (by transposing both sides of the last Φ> Φ is symmetric
equality) and positive
semidefinite and the
−1 additional term is
σ2
θ MAP = Φ> Φ + 2 I Φ> y . (9.31) strictly positive
b definite, such that
all eigenvalues of
5239 Comparing the MAP estimate in (9.31) with the maximum likelihood es- the matrix to be
5240 timate in (9.18) we see that the only difference between both solutions inverted are
2
5241 is the additional term σb2 I in the inverse matrix. This term ensures that positive.
2
5242 Φ> Φ + σb2 I is symmetric and strictly positive definite (i.e., its inverse
5243 exists) and plays the role of the regularizer. regularizer
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
280 Linear Regression
y
estimates.
−2 Training data −2
MLE
−4 MAP −4
−4 −2 0 2 4 −4 −2 0 2 4
x x
Draft (2018-08-30) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
9.3 Bayesian Linear Regression 281
where we now explicitly place a Gaussian prior p(θ) = N m0 , S 0 on Figure 9.8
θ , which turns the parameter vector into a latent variable. The full proba- Graphical model for
Bayesian linear
bilistic model, i.e., the joint distribution of observed and latent variables,
regression.
y and θ , respectively, is
m0 S0
p(y, θ | x) = p(y | x, θ)p(θ) , (9.33)
θ
5260 which allows us to write down the corresponding graphical model in Fig-
σ
5261 ure 9.8, where we made the parameters of the Gaussian prior on θ explicit.
x y
5263 which we can interpret as the average prediction of y∗ | x∗ , θ for all plausi-
5264 ble parameters θ according to the prior distribution p(θ). Note that predic-
5265 tions using the prior distribution only require to specify the input locations
5266 x∗ , but no training data.
In our model, we chose a conjugate (Gaussian) prior on θ so that the
predictive distribution is Gaussian as well (and can be computed in closed
form): With the prior distribution p(θ) = N m0 , S 0 , we obtain the pre-
dictive distribution as
5267 where we used that (i) the prediction is Gaussian due to conjugacy and the
5268 marginalization property of Gaussians, (ii), the Gaussian noise is indepen-
5269 dent so that V[y∗ ] = V[φ> (x∗ )θ] + V[], (iii) y∗ is a linear transformation
5270 of θ so that we can apply the rules for computing the mean and covariance
5271 of the prediction analytically by using (6.50) and (6.51), respectively.
5272 In (9.35), the term φ> (x∗ )S 0 φ(x∗ ) in the predictive variance explicitly
5273 accounts for the uncertainty associated with the parameters θ , whereas σ 2
5274 is the uncertainty contribution due to the measurement noise.
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
282 Linear Regression
y
(black line) and the
marginal −2 −2
uncertainties
(shaded), −4 −4
representing the
−4 −2 0 2 4 −4 −2 0 2 4
95% confidence x x
bounds; (a) Prior distribution over functions. (b) Samples from the prior distribution over
(b) Samples from functions.
the prior over
functions, which are
induced by the
samples from the 5275 So far, we looked at computing predictions using the parameter prior
parameter prior. 5276 p(θ). However, when we have a parameter posterior (given some train-
5277 ing data X, y ), the same principles for prediction and inference hold
5278 as in (9.34) – we just need to replace the prior p(θ) with the posterior
5279 p(θ | X, y). In the following, we will derive the posterior distribution in
5280 detail before using it to make predictions.
marginal likelihood
5282 the marginal likelihood/evidence, which is independent of the parameters
evidence 5283 θ and ensures that the posterior is normalized, i.e., it integrates to 1. We
5284 can think of the marginal likelihood as the likelihood averaged over all
5285 possible parameter settings (with respect to the prior distribution p(θ)).
Draft (2018-08-30) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
9.3 Bayesian Linear Regression 283
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
284 Linear Regression
Therefore, we can write θ = By , and by using the rules for linear trans-
formations of the mean and covariance from (6.50)–(6.51) we obtain
N θ | By, σ 2 BB > = N θ | (Φ> Φ)−1 Φ> y, σ 2 (Φ> Φ)−1
(9.43)
5290 after some re-arranging of the terms for the covariance matrix.
If we now look at (9.43) and define its mean as µ and covariance matrix
as Σ in (9.41c) and (9.41b), respectively, we obtain the covariance
SN
and the mean mN of the parameter posterior N θ | mN , S N as
S N = (S −1
0 +σ
−2 >
Φ Φ)−1 , (9.44a)
mN = S N (S −1
0 m0
> >
+ σ −2 (Φ Φ) (Φ Φ)−1 Φ y ) >
(9.44b)
| {z }| {z }
Σ−1 µ
= S N (S −1
0 m0 +σ −2 >
Φ y) , (9.44c)
The posterior mean
5291 respectively. Note that the posterior mean mN equals the MAP estimate
equals the MAP 5292 θ MAP from (9.31). This also makes sense since the posterior distribution is
estimate.
5293 unimodal (Gaussian) with its maximum at the mean.
Remark. The posterior precision (inverse covariance)
1 >
S −1 −1
N = S0 + Φ Φ (9.45)
σ2
5294 of the parameters θ (see (9.44a)) contains two terms: S −10 is the prior
5295 precision and σ12 Φ> Φ is a data-dependent (precision) term. Both terms
5296 (matrices) are symmetric and positive definite. The data-dependent term
Φ> Φ accumulates5297
1
σ2
Φ> Φ grows as more data is taken into account. This means (at least)
contributions from5298 two things:
the data.
5299 • The posterior precision grows as more and more data is taken into ac-
5300 count; therefore, the covariance, and with it the uncertainty about the
5301 parameters, shrinks.
5302 • The relative influence of the parameter prior vanishes for large N .
5303 Therefore, for N → ∞ the prior plays no role, and the parameter posterior
5304 tends to a point estimate, the MAP estimate. ♦
Draft (2018-08-30) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
9.3 Bayesian Linear Regression 285
where the constant contains the black terms in (9.47a), which are inde-
pendent of θ . The orange terms are terms that are linear in θ , and the
blue terms are the ones that are quadratic in θ . By inspecting (9.47b), we
find that this equation is quadratic in θ . The fact that the unnormalized
log-posterior distribution is a (negative) quadratic form implies that the
posterior is Gaussian, i.e.,
S −1 > −2
N = Φ σ IΦ + S −1
0 ⇐⇒ S N = (σ −2 Φ> Φ + S −1
0 )
−1
,
(9.50)
−1 −2 >
m>
N S N = (σ Φ y + S −1
0 m0 )
>
⇐⇒ mN = S N (σ −2 Φ> y + S −1
0 m0 ) .
(9.51)
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
286 Linear Regression
5318 The term φ> (x∗ )S N φ(x∗ ) reflects the posterior uncertainty associated
5319 with the parameters θ . Note that S N depends on the training inputs X ,
5320 see (9.44a). The predictive mean coincides with the MAP estimate.
Remark (Mean and Variance of Noise-Free Function Values). In many
cases, we are not interested in the predictive distribution p(y∗ | X, y, x∗ )
of a (noisy) observation. Instead, we would like to obtain the distribution
Draft (2018-08-30) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
9.3 Bayesian Linear Regression 287
Figure 9.10
4 4 4 Bayesian linear
2 2 2 regression and
0 0 0
posterior over
y
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
288 Linear Regression
visualizes the prior over functions induced by the parameter prior and
sample functions from this prior.
Figure 9.10 shows the posterior over functions that we obtain
via Bayesian linear regression. The training dataset is shown in Fig-
ure 9.11(a); Figure 9.11(b) shows the posterior distribution over func-
tions, including the functions we would obtain via maximum likelihood
and MAP estimation. The function we obtain using the MAP estimate also
corresponds to the posterior mean function in the Bayesian linear regres-
sion setting. Figure 9.11(c) shows some plausible realizations (samples)
of functions under that posterior over functions.
5334 Figure 9.11 shows some examples of the posterior distribution over
5335 functions induced by the parameter posterior. For different polynomial de-
5336 grees M the left panels show the maximum likelihood estimate, the MAP
5337 estimate (which is identical to the posterior mean function) and the 95%
5338 predictive confidence bounds, represented by the shaded area. The right
5339 panels show samples from the posterior over functions: Here, we sampled
5340 parameters θ i from the parameter posterior and computed the function
5341 φ> (x∗ )θ i , which is a single realization of a function under the posterior
5342 distribution over functions. For low-order polynomials, the parameter pos-
5343 terior does not allow the parameters to vary much: The sampled functions
5344 are nearly identical. When we make the model more flexible by adding
5345 more parameters (i.e., we end up with a higher-order polynomial), these
5346 parameters are not sufficiently constrained by the posterior, and the sam-
5347 pled functions can be easily visually separated. We also see in the corre-
5348 sponding panels on the left how the uncertainty increases, especially at
5349 the boundaries. Although for a 7th-order polynomial the MAP estimate
5350 yields a reasonable fit, the Bayesian linear regression model additionally
5351 tells us that the posterior uncertainty is huge. This information can be crit-
5352 ical when we use these predictions in a decision-making system, where
5353 bad decisions can have significant consequences (e.g., in reinforcement
5354 learning or robotics).
Draft (2018-08-30) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
9.3 Bayesian Linear Regression 289
Figure 9.11
4 4 Bayesian linear
regression. Left
2 2 panels: Shaded
areas indicate the
0 0
y
y
95% predictive
Training data
confidence bounds.
−2 MLE −2 The mean of the
MAP
Bayesian linear
−4 BLR −4 regression model
−4 −2 0 2 4 −4 −2 0 2 4 coincides with the
x x MAP estimate. The
(a) Posterior distribution for polynomials of degree M = 3 (left) and samples from the posterior predictive
over functions (right). uncertainty is the
sum of the noise
term and the
4 4 posterior parameter
uncertainty, which
2 2 depends on the
location of the test
0 0
y
−4 −2 0 2 4 −4 −2 0 2 4
x x
(b) Posterior distribution for polynomials of degree M = 5 (left) and samples from the posterior
over functions (right).
4 Training data 4
MLE
2 MAP 2
BLR
0 0
y
−2 −2
−4 −4
−4 −2 0 2 4 −4 −2 0 2 4
x x
(c) Posterior distribution for polynomials of degree M = 7 (left) and samples from the posterior
over functions (right).
5356 where we integrate out the model parameters θ . We compute the marginal
5357 likelihood in two steps: First, we show that the marginal likelihood is
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
290 Linear Regression
5358 Gaussian (as a distribution in y ); Second, we compute the mean and co-
5359 variance of this Gaussian.
5360 1. The marginal likelihood is Gaussian: From Section 6.6.2 we know that
5361 (i) the product of two Gaussian random variables is an (unnormal-
5362 ized) Gaussian distribution, (ii) a linear transformation of a Gaussian
5363 random variable is Gaussian distributed. In (9.62b), we require a linear
5364 transformation to bring N y | Xθ, σ 2 I into the form N θ | µ, Σ for
5365 some µ, Σ. Once this is done, the integral can be solved in closed form.
5366 The result is the normalizing constant of the product of the two Gaus-
5367 sians. The normalizing constant itself has Gaussian shape, see (6.114).
2. Mean and covariance. We compute the mean and covariance matrix
of the marginal likelihood by exploiting the standard results for means
and covariances of affine transformations of random variables, see Sec-
tion 6.4.4. The mean of the marginal likelihood is computed as
(9.65)
5368 The marginal likelihood can now be used for Bayesian model selection as
5369 discussed in Section 8.5.2.
y = xθ + , ∼ N 0, σ 2 ,
(9.66)
Draft (2018-08-30) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
9.4 Maximum Likelihood as Orthogonal Projection 291
4 4 Figure 9.12
Geometric
2 2 interpretation of
least squares. (a)
0 0 Dataset; (b)
y
y
Maximum
Projection likelihood solution
−2 −2
Observations interpreted as a
Maximum likelihood estimate
projection.
−4 −4
−4 −2 0 2 4 −4 −2 0 2 4
x x
(a) Regression dataset consisting of noisy ob- (b) The orange dots are the projections of
servations yn (blue) of function values f (xn ) the noisy observations (blue dots) onto the
at input locations xn . line θML x. The maximum likelihood solution to
a linear regression problem finds a subspace
(line) onto which the overall projection er-
ror (orange lines) of the observations is mini-
mized.
RN , we recall the results from Section 9.2.1 and obtain the maximum
likelihood estimator for the slope parameter as
X >y
θML = (X > X)−1 X > y = ∈ R. (9.67)
X >X
This means for the training inputs X we obtain the optimal (maxi-
mum likelihood) reconstruction of the training data, i.e., the approxima-
tion with the minimum least-squares error
X >y XX >
XθML = X = y. (9.68)
X >X X >X
5374 As we are basically looking for a solution of y = Xθ, we can think Linear regression
5375 of linear regression as a problem for solving systems of linear equations. can be thought of as
a method for solving
5376 Therefore, we can relate to concepts from linear algebra and analytic ge-
systems of linear
5377 ometry that we discussed in Chapters 2 and 3. In particular, looking care- equations.
5378 fully at (9.68) we see that the maximum likelihood estimator θML in our Maximum
5379 example from (9.66) effectively does an orthogonal projection of y onto likelihood linear
5380 the one-dimensional subspace spanned by X . Recalling the results on or- regression performs
XX > an orthogonal
5381 thogonal projections from Section 3.7, we identify X > X as the projection
projection.
5382 matrix, θML as the coordinates of the projection onto the one-dimensional
5383 subspace of RN spanned by X and XθML as the orthogonal projection of
5384 y onto this subspace.
5385 Therefore, the maximum likelihood solution provides also a geometri-
5386 cally optimal solution by finding the vectors in the subspace spanned by
5387 X that are “closest” to the corresponding observations y , where “clos-
5388 est” means the smallest (squared) distance of the function values yn to
5389 xn θ. This is achieved by orthogonal projections. Figure 9.12(b) shows the
5390 orthogonal projection of the noisy observations onto the subspace that
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
292 Linear Regression
5391 minimizes the squared distance between the original dataset and its pro-
5392 jection, which corresponds to the maximum likelihood solution.
In the general linear regression case where
y = φ> (x)θ + , ∼ N 0, σ 2
(9.69)
with vector-valued features φ(x) ∈ RK , we again can interpret the maxi-
mum likelihood result
y ≈ Φθ ML , (9.70)
> −1 >
θ ML = Φ(Φ Φ) Φ y (9.71)
5393 as a projection onto a K -dimensional subspace of RN , which is spanned
5394 by the columns of the feature matrix Φ, see Section 3.7.2.
If the feature functions φk that we use to construct the feature ma-
trix Φ are orthonormal (see Section 3.6), we obtain a special case where
the columns of Φ form an orthonormal basis (see Section 3.5), such that
Φ> Φ = I . This will then lead to the projection
K
!
> >
X >
−1
Φ(Φ Φ) Φy = ΦΦ y = φk φk y (9.72)
k=1
5395 so that the coupling between different features has disappeared and the
5396 maximum likelihood projection is simply the sum of projections of y onto
5397 the individual basis vectors φk , i.e., the columns of Φ. Many popular basis
5398 functions in signal processing, such as wavelets and Fourier bases, are
5399 orthogonal basis functions. When the basis is not orthogonal, one can
5400 convert a set of linearly independent basis functions to an orthogonal basis
5401 by using the Gram-Schmidt process (Strang, 2003).
Draft (2018-08-30) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
9.5 Further Reading 293
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
294 Linear Regression
5426 sian processes can be found in (MacKay, 1998; Rasmussen and Williams,
5427 2006).
5428 We focused on Gaussian parameter priors in the discussions in this chap-
5429 ters because they allow for closed-form inference in linear regression mod-
5430 els. However, even in a regression setting with Gaussian likelihoods we
5431 may choose a non-Gaussian prior. Consider a setting where the inputs are
5432 x ∈ RD and our training set is small and of size N D. This means that
5433 the regression problem is under-determined. In this case, we can choose
5434 a parameter prior that enforces sparsity, i.e., a prior that tries to set as
variable selection 5435 many parameters to 0 as possible (variable selection). This prior provides
5436 a stronger regularizer than the Gaussian prior, which often leads to an in-
5437 creased prediction accuracy and interpretability of the model. The Laplace
5438 prior is one example that is frequently used for this purpose. A linear re-
5439 gression model with the Laplace prior on the parameters is equivalent to
LASSO 5440 linear regression with L1 regularization (LASSO) (Tibshirani, 1996). The
5441 Laplace distribution is sharply peaked at zero (its first derivative is discon-
5442 tinuous) and it concentrates its probability mass closer to zero than the
5443 Gaussian distribution, which encourages parameters to be 0. Therefore,
5444 the non-zero parameters are relevant for the regression problem, which is
5445 the reason why we also speak of “variable selection”.
Draft (2018-08-30) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.