Calculating A - Linear Regression

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

5020

Linear Regression

5021 In the following, we will apply the mathematical concepts from Chap-
5022 ters 2, 5, 6 and 7 to solving linear regression (curve fitting) problems.
5023 In regression, we want to find a function f that maps inputs x ∈ RD to regression
5024 corresponding function values f (x) ∈ R given a set of training inputs
5025 xn and corresponding observations yn = f (xn ) + , where  is a random
5026 variable that comprises measurement noise and unmodeled processes. An
5027 illustration of such a regression problem is given in Figure 9.1. A typi-
5028 cal regression problem is given in Figure 9.1(a): For some input values
5029 x we observe (noisy) function values y = f (x) + . The task is to in-
5030 fer the function f that generated the data. A possible solution is given
5031 in Figure 9.1(b), where we also show three distributions centered at the
5032 function values f (x) that represent the noise in the data.
5033 Regression is a fundamental problem in machine learning, and regres-
5034 sion problems appear in a diverse range of research areas and applica-
5035 tions, including time-series analysis (e.g., system identification), control
5036 and robotics (e.g., reinforcement learning, forward/inverse model learn-
5037 ing), optimization (e.g., line searches, global optimization), and deep-
5038 learning applications (e.g., computer games, speech-to-text translation,
5039 image recognition, automatic video annotation). Regression is also a key
5040 ingredient of classification algorithms.

Figure 9.1
0.4 0.4 (a) Dataset;
(b) Possible solution
0.2 0.2
to the regression
0.0 0.0 problem.
y

−0.2 −0.2

−0.4 −0.4

−4 −2 0 2 4 −4 −2 0 2 4
x x

(a) Regression problem: Observed noisy (b) Regression solution: Possible function
function values from which we wish to infer that could have generated the data (blue)
the underlying function that generated the with indication of the measurement noise of
data. the function value at the corresponding in-
puts (orange distributions).

267
c
Draft chapter (August 30, 2018) from “Mathematics for Machine Learning” 2018 by Marc Peter
Deisenroth, A Aldo Faisal, and Cheng Soon Ong. To be published by Cambridge University Press.
Report errata and feedback to http://mml-book.com. Please do not post or distribute this file,
please link to https://mml-book.com.
268 Linear Regression

5041 Finding a regression function requires solving a variety of problems,


5042 including

5043 • Choice of the model (type) and the parametrization of the regres-
5044 sion function. Given a data set, what function classes (e.g., polynomi-
5045 als) are good candidates for modeling the data, and what particular
5046 parametrization (e.g., degree of the polynomial) should we choose?
5047 Model selection, as discussed in Section 8.5, allows us to compare var-
5048 ious models to find the simplest model that explains the training data
5049 reasonably well.
5050 • Finding good parameters. Having chosen a model of the regression
5051 function, how do we find good model parameters? Here, we will need to
5052 look at different loss/objective functions (they determine what a “good”
5053 fit is) and optimization algorithms that allow us to minimize this loss.
5054 • Overfitting and model selection. Overfitting is a problem when the
5055 regression function fits the training data “too well” but does not gen-
5056 eralize to unseen test data. Overfitting typically occurs if the underly-
5057 ing model (or its parametrization) is overly flexible and expressive, see
5058 Section 8.5. We will look at the underlying reasons and discuss ways to
5059 mitigate the effect of overfitting in the context of linear regression.
5060 • Relationship between loss functions and parameter priors. Loss func-
5061 tions (optimization objectives) are often motivated and induced by prob-
5062 abilistic models. We will look at the connection between loss functions
5063 and the underlying prior assumptions that induce these losses.
5064 • Uncertainty modeling. In any practical setting, we have access to only
5065 a finite, potentially large, amount of (training) data for selecting the
5066 model class and the corresponding parameters. Given that this finite
5067 amount of training data does not cover all possible scenarios, we way
5068 want to describe the remaining parameter uncertainty to obtain a mea-
5069 sure of confidence of the model’s prediction at test time; the smaller the
5070 training set the more important uncertainty modeling. Consistent mod-
5071 eling of uncertainty equips model predictions with confidence bounds.

5072 In the following, we will be using the mathematical tools from Chap-
5073 ters 3, 5, 6 and 7 to solve linear regression problems. We will discuss
5074 maximum likelihood and maximum a posteriori (MAP) estimation to find
5075 optimal model parameters. Using these parameter estimates, we will have
5076 a brief look at generalization errors and overfitting. Toward the end of
5077 this chapter, we will discuss Bayesian linear regression, which allows us to
5078 reason about model parameters at a higher level, thereby removing some
5079 of the problems encountered in maximum likelihood and MAP estimation.

Draft (2018-08-30) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
9.1 Problem Formulation 269
20 Figure 9.2 Linear
10 10 regression without
0 0 0 features.
y

y
(a) Example
−10 −10
−20
functions that fall
−10 0 10 −10 −5 0 5 10 −10 −5 0 5 10 into this category.
x x x
(b) Training set.
(a) Example functions (straight (b) Training set. (c) Maximum likelihood esti- (c) Maximum
lines) that can be described us- mate. likelihood estimate.
ing the linear model in (9.2).

5080 9.1 Problem Formulation


We consider the regression problem

y = f (x) +  , (9.1)

5081 where x ∈ RD are inputs and y ∈ R are noisy function values (targets).
5082 Furthermore,  ∼ N 0, σ 2 is independent, identically distributed (i.i.d.)
5083 measurement noise. In this particular case,  is Gaussian distributed with
5084 mean 0 and variance σ 2 . Our objective is to find a function that is close
5085 (similar) to the unknown function that generated the data.
In this chapter, we focus on parametric models, i.e., we choose a para-
metrized function f and find parameters that “work well” for modeling the
data. In linear regression, we consider the special case that the parameters
appear linearly in our model. An example of linear regression is

y = f (x) +  = x> θ +  , (9.2)



where θ ∈ RD are the parameters we seek, and  ∼ N 0, σ is i.i.d. 2

Gaussian measurement/observation noise. The class of functions described


by (9.2) are straight lines that pass through the origin. In (9.2), we chose
a parametrization f (x) = x> θ . For the time being we assume that the
noise variance σ 2 is known. The noise model induces the likelihood likelihood

p(y | x, θ) = N y | x> θ, σ 2 ,

(9.3)

5086 which is the probability of observing a target value y given that we know
5087 the input location x and the parameters θ . Note that the only source of
5088 uncertainty originates from the observation noise (as x and θ are assumed
5089 known in (9.3))—without any observation noise, the relationship between
5090 x and y would be deterministic and (9.3) would be a delta distribution.
5091 For x, θ ∈ R the linear regression model in (9.2) describes straight lines
5092 (linear functions), and the parameter θ would be the slope of the line.
5093 Figure 9.2(a) shows some examples. This model is not only linear in the Linear regression
5094 parameters, but also linear in the inputs x. We will see later that y = φ(x)θ refers to models that
are linear in the
5095 for nonlinear transformations φ is also a linear regression model because
parameters.
5096 “linear regression” refers to models that are “linear in the parameters”, i.e.,
5097 models that describe a function by a linear combination of input features.

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
270 Linear Regression

5098 In the following, we will discuss in more detail how to find good pa-
5099 rameters θ and how to evaluate whether a parameter set “works well”.

5100 9.2 Parameter Estimation


Consider the linear regression setting (9.2) and assume we are given a
training set training set D consisting of N inputs xn ∈ RD and corresponding ob-
Figure 9.3 servations/targets yn ∈ R, n = 1, . . . , N . The corresponding graphical
Probabilistic model is given in Figure 9.3. Note that yi and yj are conditionally in-
graphical model for
dependent given their respective inputs xi , xj , such that the likelihood
linear regression.
Observed random function factorizes according to
variables are N N
shaded,
Y Y
N yn | x> 2

p(y1 , . . . , yN | x1 , . . . , xN ) = p(yn | xn ) = n θ, σ . (9.4)
deterministic/
n=1 n=1
known values are
without circles. The
5101 The likelihood and the factors p(yn | xn ) are Gaussian due to the noise
parameters θ are 5102 distribution.
treated as
unknown/latent
In the following, we are interested in finding optimal parameters θ ∗ ∈
quantities. RD for the linear regression model (9.2). Once the parameters θ ∗ are
found, we can predict function values by using this parameter estimate
θ in (9.2) so that at an arbitrary test input x∗ we predict the probability for
an output y∗ as
σ
p(y∗ | x∗ , θ ∗ ) = N y∗ | x> ∗ 2

xn yn ∗θ , σ . (9.5)
5103 In the following, we will have a look at parameter estimation by maxi-
n = 1, . . . , N
5104 mizing the likelihood, a topic that we already covered to some degree in
5105 Section 8.2.

5106 9.2.1 Maximum Likelihood Estimation


maximum likelihood
5107 A widely used approach to finding the desired parameters θ ML is maximum
estimation 5108 likelihood estimation where we find parameters θ ML that maximize the
Maximizing the 5109 likelihood (9.4).
likelihood means We obtain the maximum likelihood parameters as
maximizing the
probability of the θ ML = arg max p(y | X, θ) , (9.6)
(training) data θ
given the 5110 where we define the design matrix X := [x1 , . . . , xN ]> ∈ RN ×D and
parameters.
5111 y := [y1 , . . . , yN ]> ∈ RN as the collections of training inputs and targets,
design matrix
5112 respectively. Note that the nth row in the design matrix X corresponds to
The likelihood is not
5113 the data point xn .
a probability
distribution in the
5114 Remark. Note that the likelihood is not a probability distribution in θ : It
parameters. 5115 is simply a function of the parameters θ but does not integrate to 1 (i.e.,
5116 it is unnormalized), and may not even be integrable with respect to θ .
5117 However, the likelihood in (9.6) is a normalized probability distribution
5118 in the data y . ♦

Draft (2018-08-30) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
9.2 Parameter Estimation 271

ce the logarithm
(strictly) 5119 To find the desired parameters θ ML that maximize the likelihood, we
notonically 5120 typically perform gradient ascent (or gradient descent on the negative
reasing function,
optimum of a
5121 likelihood). In the case of linear regression we consider here, however,
ction f is 5122 a closed-form solution exists, which makes iterative gradient descent un-
ntical to the 5123 necessary. In practice, instead of maximizing the likelihood directly, we
imum of log f . 5124 apply the log-transformation to the likelihood function and minimize the
5125 negative log-likelihood.
5126 Remark (Log Transformation). Since the likelihood function is a product
5127 of N Gaussian distributions, the log-transformation is useful since a) it
5128 does not suffer from numerical underflow, b) the differentiation rules will
5129 turn out simpler. Numerical underflow will be a problem when we mul-
5130 tiply N probabilities, where N is the number of data points, since we
5131 cannot represent very small numbers, such as 10−256 . Furthermore, the
5132 log-transform will turn the product into a sum of log-probabilities such
5133 that the corresponding gradient is a sum of individual gradients, instead
5134 of a repeated application of the product rule (5.54) to compute the gradi-
5135 ent of a product of N terms. ♦
To find the optimal parameters θ ML of our linear regression problem,
we minimize the negative log-likelihood
N
Y N
X
− log p(y | X, θ) = − log p(yn | xn , θ) = − log p(yn | xn , θ) , (9.7)
n=1 n=1

5136 where we exploited that the likelihood (9.4) factorizes over the number
5137 of data points due to our independence assumption on the training set.
In the linear regression model (9.2) the likelihood is Gaussian (due to
the Gaussian additive noise term), such that we arrive at
1
log p(yn | xn , θ) = − (yn − x> 2
n θ) + const (9.8)
2σ 2
where the constant includes all terms independent of θ . Using (9.8) in The negative
the negative log-likelihood (9.7) we obtain (ignoring the constant terms) log-likelihood
function is also
N called error function.
1 X
L(θ) := − log p(y | X, θ) = 2 (yn − x>
n θ)
2
(9.9a)
2σ n=1
1 1 2
= (y − Xθ)> (y − Xθ) = 2 ky − Xθk , (9.9b)
2σ 2 2σ
5138 where X = [x1 , · · · , xN ]> ∈ RN ×D .
5139 Remark. There is some notation overloading: We often summarize the
5140 set of training inputs in X , whereas in the design matrix we additionally
5141 assume a specific “shape”. ♦
5142 In (9.9b) we used the fact that the sum of squared errors between the
5143 observations yn and the corresponding model prediction x> n θ equals the

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
272 Linear Regression

5144 squared distance between y and Xθ . Remember from Section 3.1 that
2
5145 kxk = x> x if we choose the dot product as the inner product.
5146 With (9.9b) we have now a concrete form of the negative log-likelihood
5147 function we need to optimize. We immediately see that (9.9b) is quadratic
5148 in θ . This means that we can find a unique global solution θ ML for mini-
5149 mizing the negative log-likelihood L. We can find the global optimum by
5150 computing the gradient of L, setting it to 0 and solving for θ .
Using the results from Chapter 5, we compute the gradient of L with
respect to the parameters as
 
dL d 1 >
= (y − Xθ) (y − Xθ) (9.10a)
dθ dθ 2σ 2
1 d  > 
= 2 y y − 2y > Xθ + θ > X > Xθ (9.10b)
2σ dθ
1
= 2 (−y > X + θ > X > X) ∈ R1×D . (9.10c)
σ
As a necessary optimality condition we set this gradient to 0 and obtain
dL (9.10c)
= 0 ⇐⇒ θ > X > X = y > X (9.11a)

⇐⇒ θ > = y > X(X > X)−1 (9.11b)
⇐⇒ θ ML = (X > X)−1 X > y . (9.11c)

5151 We could right-multiply the first equation by (X > X)−1 because X > X is
5152 positive definite (if we do not have two identical inputs xi , xj for i 6= j ).
5153 Remark. In this case, setting the gradient to 0 is a necessary and sufficient
5154 condition and we obtain a global minimum since the Hessian ∇2θ L(θ) =
5155 X > X ∈ RD×D is positive definite. ♦

Example 9.1 (Fitting Lines)


Let us have a look at Figure 9.2, where we aim to fit a straight line f (x) =
θx, where θ is an unknown slope, to a data set using maximum likelihood
estimation. Examples of functions in this model class (straight lines) are
shown in Figure 9.2(a). For the data set shown in Figure 9.2(b) we find
the maximum likelihood estimate of the slope parameter θ using (9.11c)
and obtain the maximum likelihood linear function in Figure 9.2(c).

Linear regression 5156 Maximum Likelihood Estimation with Features


refers to “linear-in- So far, we considered the linear regression setting described in (9.2),
the-parameters”
which allowed us to fit straight lines to data using maximum likelihood
regression models,
but the inputs can estimation. However, straight lines are not particularly expressive when it
undergo any comes to fitting more interesting data. Fortunately, linear regression offers
nonlinear us a way to fit nonlinear functions within the linear regression framework:
transformation.
Since “linear regression” only refers to “linear in the parameters”, we can

Draft (2018-08-30) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
9.2 Parameter Estimation 273

perform an arbitrary nonlinear transformation φ(x) of the inputs x and


then linearly combine the components of the result. The model parame-
ters θ still appear only linearly. The corresponding linear regression model
is
K−1
X
>
y = φ (x)θ +  = θk φk (x) +  , (9.12)
k=0

5157 where φ : RD → RK is a (nonlinear) transformation of the inputs x and


5158 φk : RD → R is the k th component of the feature vector φ. feature vector

Example 9.2 (Polynomial Regression)


We are concerned with a regression problem y = φ> (x)θ+, where x ∈ R
and θ ∈ RK . A transformation that is often used in this context is
 
  1
φ0 (x)  x 
 
 φ1 (x)   x2 
φ(x) =  ..  =  x  ∈ RK . (9.13)
   3 
 .  
 . 

φK−1 (x)  .. 
xK−1
This means, we “lift” the original one-dimensional input space into a
K -dimensional feature space consisting of all monomials xk for k =
0, . . . , K − 1. With these features, we can model polynomials of degree
6 K − 1 within the framework of linear regression: A polynomial of de-
gree K − 1 is
K−1
X
f (x) = θk xk = φ> (x)θ (9.14)
k=0

where φ is defined in (9.13) and θ = [θ0 , . . . , θK−1 ]> ∈ RK contains the


(linear) parameters θk .

Let us now have a look at maximum likelihood estimation of the param-


eters θ in the linear regression model (9.12). We consider training inputs
xn ∈ RD and targets yn ∈ R, n = 1, . . . , N , and define the feature matrix feature matrix
(design matrix) as design matrix

φ0 (x1 ) · · · φK−1 (x1 )


 
 > 
φ (x1 )
..   φ0 (x2 ) · · · φK−1 (x2 ) 
 
Φ :=  = .. ..  ∈ RN ×K , (9.15)

.  
>
 . . 
φ (xN )
φ0 (xN ) · · · φK−1 (xN )

5159 where Φij = φj (xi ) and φj : RD → R.

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
274 Linear Regression

Example 9.3 (Feature Matrix for Second-order Polynomials)


For a second-order polynomial and N training points xn ∈ R, n =
1, . . . , N , the feature matrix is
1 x1 x21
 
1 x2 x22 
Φ =  .. .. ..  . (9.16)
 
. . . 
1 xN x2N

With the feature matrix Φ defined in (9.15) the negative log-likelihood


for the linear regression model (9.12) can be written as
1
− log p(y | X, θ) = (y − Φθ)> (y − Φθ) + const . (9.17)
2σ 2
Comparing (9.17) with the negative log-likelihood in (9.9b) for the “feature-
free” model, we immediately see we just need to replace X with Φ. Since
both X and Φ are independent of the parameters θ that we wish to opti-
maximum likelihood mize, we arrive immediately at the maximum likelihood estimate
estimate
θ ML = (Φ> Φ)−1 Φ> y (9.18)
5160 for the linear regression problem with nonlinear features defined in (9.12).
5161 Remark. When we were working without features, we required X > X to
5162 be invertible, which is the case when the rows of X are linearly inde-
5163 pendent. In (9.18), we therefore require Φ> Φ to be invertible. This is
5164 the case if and only if the rows of the feature matrix are linearly inde-
5165 pendent. Nonlinear feature transformations can make previously linearly
5166 dependent inputs X linearly independent (and vice versa). ♦

Example 9.4 (Maximum Likelihood Polynomial Fit)

Figure 9.4
Polynomial 4 4 Training data
regression. (a) Data MLE
set consisting of 2 2
(xn , yn ) pairs,
0 0
y

n = 1, . . . , 10; (b)
Maximum −2 −2
likelihood
polynomial of −4 −4
degree 4. −4 −2 0 2 4 −4 −2 0 2 4
x x

(a) Regression data set. (b) Polynomial of degree 4 determined by max-


imum likelihood estimation.

Consider the data set in Figure 9.5(a). The data set consists of N = 20

Draft (2018-08-30) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
9.2 Parameter Estimation 275

pairs (xn , yn ), where xn ∼ U[−5, 5] and yn = − sin(xn /5) + cos(xn ) + ,


where  ∼ N 0, 0.22 .
We fit a polynomial of degree K = 4 using maximum likelihood esti-
mation, i.e., parameters θ ML are given in (9.18). The maximum likelihood
estimate yields function values φ> (x∗ )θ ML at any test location x∗ . The
result is shown in Figure 9.5(b).

5167 Estimating the Noise Variance


Thus far, we assumed that the noise variance σ 2 is known. However, we
can also use the principle of maximum likelihood estimation to obtain
2
σML for the noise variance. To do this, we follow the standard procedure:
we write down the log-likelihood, compute its derivative with respect to
σ 2 > 0, set it to 0 and solve:
N
X
2
log N yn | θ > φ(xn ), σ 2

logp(y | X, θ, σ ) = (9.19a)
n=1
N  
X 1 1 2 1 > 2
= − log(2π) − log σ − 2 (yn − θ φ(xn )) (9.19b)
n=1
2 2 2σ
N
N 1 X
2
= − log σ − 2 (yn − θ > φ(xn ))2 + const . (9.19c)
2 2σ n=1
| {z }
=:s

The partial derivative of the log-likelihood with respect to σ 2 is then


∂ log p(y | X, θ, σ 2 ) N 1
2
= − 2 + 4s = 0 (9.20a)
∂σ 2σ 2σ
N s
⇐⇒ = 4 (9.20b)
2σ 2 2σ
N
s 1 X
⇐⇒ σML 2
= = (yn − θ > φ(xn ))2 . (9.20c)
N N n=1
5168 Therefore, the maximum likelihood estimate for the noise variance is the
5169 mean squared distance between the noise-free function values θ > φ(xn )
5170 and the corresponding noisy observations yn at xn , for n = 1, . . . , N .

5171 9.2.2 Overfitting in Linear Regression


We just discussed how to use maximum likelihood estimation to fit linear
models (e.g., polynomials) to data. We can evaluate the quality of the
model by computing the error/loss incurred. One way of doing this is
to compute the negative log-likelihood (9.9b), which we minimized to
determine the MLE. Alternatively, given that the noise parameter σ 2 is not

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
276 Linear Regression
Figure 9.5 4 Training data 4 Training data 4 Training data
Maximum MLE MLE MLE
2 2 2
likelihood fits for
0 0 0
y

y
different polynomial
−2 −2 −2
degrees M .
−4 −4 −4

−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
x x x

(a) M = 0 (b) M = 1 (c) M = 3

4 Training data 4 Training data 4 Training data


MLE MLE MLE
2 2 2

0 0 0
y

y
−2 −2 −2

−4 −4 −4

−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
x x x

(d) M = 4 (e) M = 6 (f) M = 9

a free model parameter, we can ignore the scaling by 1/σ 2 , so that we end
2
up with a squared-error-loss function ky − Φθk . Instead of using this
root mean squared squared loss, we often use the root mean squared error (RMSE)
error (RMSE) v
q u
u1 X N
2
ky − Φθk /N = t (yn − φ> (xn )θ)2 , (9.21)
N n=1

The RMSE is 5172 which (a) allows us to compare errors of data sets with different sizes
normalized. 5173 and (b) has the same scale and the same units as the observed function
5174 values yn . For example, assume we fit a model that maps post-codes (x
5175 is given in latitude,longitude) to house prices (y -values are EUR). Then,
5176 the RMSE is also measured in EUR, whereas the squared error is given
5177 in EUR2 . If we choose to include the factor σ 2 from the original negative
5178 log-likelihood (9.9b) then we end up with a “unit-free” objective.
5179 For model selection (see Section 8.5) we can use the RMSE (or the
5180 negative log-likelihood) to determine the best degree of the polynomial
5181 by finding the polynomial degree M that minimizes the objective. Given
5182 that the polynomial degree is a natural number, we can perform a brute-
5183 force search and enumerate all (reasonable) values of M . For a training
5184 set of size N it is sufficient to test 0 6 M 6 N − 1. For M > N we would
5185 need to solve an underdetermined system of linear equations so that we
5186 would end up with infinitely many solutions.
5187 Figure 9.5 shows a number of polynomial fits determined by maximum
5188 likelihood for the dataset from Figure 9.5(a) with N = 10 observations.
5189 We notice that polynomials of low degree (e.g., constants (M = 0) or lin-
5190 ear (M = 1) fit the data poorly and, hence, are poor representations of the
5191 true underlying function. For degrees M = 3, . . . , 5 the fits look plausible
5192 and smoothly interpolate the data. When we go to higher-degree polyno-
5193 mials, we notice that they fit the data better and better. In the extreme

Draft (2018-08-30) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
9.2 Parameter Estimation 277

10 Figure 9.6 Training


Training error and test error.
8 Test error

6
RMSE

0
0 2 4 6 8 10
Degree of polynomial

5194 case of M = N − 1 = 9, the function will pass through every single data
5195 point. However, these high-degree polynomials oscillate wildly and are a
5196 poor representation of the underlying function that generated the data,
5197 such that we suffer from overfitting. overfitting
5198 Remember that the goal is to achieve good generalization by making Note that the noise
5199 accurate predictions for new (unseen) data. We obtain some quantita- variance σ 2 > 0.

5200 tive insight into the dependence of the generalization performance on the
5201 polynomial of degree M by considering a separate test set comprising 200
5202 data points generated using exactly the same procedure used to generate
5203 the training set. As test inputs, we chose a linear grid of 200 points in the
5204 interval of [−5, 5]. For each choice of M , we evaluate the RMSE (9.21) for
5205 both the training data and the test data.
5206 Looking now at the test error, which is a qualitive measure of the gen-
5207 eralization properties of the corresponding polynomial, we notice that ini-
5208 tially the test error decreases, see Figure 9.6 (orange). For fourth-order
5209 polynomials the test error is relatively low and stays relatively constant up
5210 to degree 5. However, from degree 6 onward the test error increases signif-
5211 icantly, and high-order polynomials have very bad generalization proper-
5212 ties. In this particular example, this also is evident from the corresponding
5213 maximum likelihood fits in Figure 9.5. Note that the training error (blue training error
5214 curve in Figure 9.6) never increases when the degree of the polynomial in-
5215 creases. In our example, the best generalization (the point of the smallest
5216 test error) is obtained for a polynomial of degree M = 4. test error

5217 9.2.3 Regularization and Maximum A Posteriori Estimation


We just saw that maximum likelihood estimation is prone to overfitting. It
often happens that the magnitude of the parameter values becomes rela-
tively big if we run into overfitting (Bishop, 2006). One way to mitigate
the effect of overfitting is to penalize big parameter values by a technique
called regularization. In regularization, we add a term to the log-likelihood regularization
that penalizes the magnitude of the parameters θ . A typical example is a

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
278 Linear Regression

regularized “loss function” of the form


2
− log p(y | X, θ) + λ kθk2 , (9.22)

regularizer 5218 where the second term is the regularizer, and λ > 0 controls the “strict-
5219 ness” of the regularization.
5220 Remark. Instead of the Euclidean norm k·k2 , we can choose any p-norm
5221 k·kp . In practice, smaller values for p lead to sparser solutions. Here,
5222 “sparse” means that many parameter values θn = 0, which is also use-
LASSO 5223 ful for variable selection. For p = 1, the regularizer is called LASSO (least
5224 absolute shrinkage and selection operator) and was proposed by Tibshi-
5225 rani (1996). ♦
From a probabilistic perspective, adding a regularizer is identical to
placing a prior distribution p(θ) on the parameters and then selecting
the parameters that maximize the posterior distribution p(θ | X, y), i.e.,
we choose the parameters θ that are “most probable” given the training
data. The posterior over the parameters θ , given the training data X, y ,
is obtained by applying Bayes’ theorem as
p(y | X, θ)p(θ)
p(θ | X, y) = . (9.23)
p(y | X)
5226 The parameter vector θ MAP that maximizes the posterior (9.23) is called
maximum 5227 the maximum a-posteriori (MAP) estimate.
a-posteriori To find the MAP estimate, we follow steps that are similar in flavor
MAP to maximum likelihood estimation. We start with the log-transform and
compute the log-posterior as

log p(θ | X, y) = log p(y | X, θ) + log p(θ) + const , (9.24)

5228 where the constant comprises the terms that are independent of θ . We see
5229 that the log-posterior in (9.24) is the sum of the log-likelihood p(y | X, θ)
5230 and the log-prior log p(θ).
Remark (Relation to Regularization). Choosing a Gaussian parameter prior
1
p(θ) = N 0, b2 I , b2 = 2λ , the (negative) log-prior term will be
>
− log p(θ) = λθ
| {z θ} + const , (9.25)
=λkθk22

5231 and we recover exactly the regularization term in (9.22). This means that
5232 for a quadratic regularization, the regularization parameter λ in (9.22)
5233 corresponds to twice the precision (inverse variance) of the Gaussian (iso-
5234 tropic) prior p(θ). Therefore, the log-prior in (9.24) reflects the impact
5235 of the regularizer that penalizes implausible values, i.e., values that are
5236 unlikely under the prior. ♦
To find the MAP estimate θ MAP , we minimize the negative log-posterior

Draft (2018-08-30) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
9.2 Parameter Estimation 279

distribution with respect to θ , i.e., we solve

θ MAP ∈ arg min{− log p(y | X, θ) − log p(θ)} . (9.26)


θ

We determine the gradient of the negative log-posterior with respect to θ


as
d log p(θ | X, y) d log p(y | X, θ) d log p(θ)
− =− − , (9.27)
dθ dθ dθ
5237 where we identify the first term on the right-hand-side as the gradient of
5238 the negative log-likelihood given in (9.10c). 
More concretely, with a Gaussian prior p(θ) = N 0, b2 I on the param-
eters θ , the negative log-posterior for the linear regression setting (9.12),
we obtain the negative log posterior
1 1
− log p(θ | X, y) = 2
(y − Φθ)> (y − Φθ) + 2 θ > θ + const . (9.28)
2σ 2b
Here, the first term corresponds to the contribution from the log-likelihood,
and the second term originates from the log-prior. The gradient of the log-
posterior with respect to the parameters θ is then

d log p(θ | X, y) 1 1
− = 2 (θ > Φ> Φ − y > Φ) + 2 θ > . (9.29)
dθ σ b
We will find the MAP estimate θ MAP by setting this gradient to 0:
1 > > 1
2
(θ Φ Φ − y > Φ) + 2 θ > = 0 (9.30a)
σ   b
1 1 1
⇐⇒ θ > 2
Φ> Φ + 2 I − 2 y > Φ = 0 (9.30b)
σ b σ
2
 
σ
⇐⇒ θ > Φ> Φ + 2 I = y > Φ (9.30c)
b
−1
σ2

⇐⇒ θ > = y > Φ Φ> Φ + 2 I (9.30d)
b
so that we obtain the MAP estimate (by transposing both sides of the last Φ> Φ is symmetric
equality) and positive
semidefinite and the
−1 additional term is
σ2

θ MAP = Φ> Φ + 2 I Φ> y . (9.31) strictly positive
b definite, such that
all eigenvalues of
5239 Comparing the MAP estimate in (9.31) with the maximum likelihood es- the matrix to be
5240 timate in (9.18) we see that the only difference between both solutions inverted are
2
5241 is the additional term σb2 I in the inverse matrix. This term ensures that positive.
2
5242 Φ> Φ + σb2 I is symmetric and strictly positive definite (i.e., its inverse
5243 exists) and plays the role of the regularizer. regularizer

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
280 Linear Regression

Example 9.5 (MAP Estimation for Polynomial Regression)

Figure 9.7 Training data


4 4
Polynomial MLE
regression: 2 2 MAP
Maximum
likelihood and MAP 0 0
y

y
estimates.
−2 Training data −2
MLE
−4 MAP −4

−4 −2 0 2 4 −4 −2 0 2 4
x x

In the polynomial regression example


 from Section 9.2.1, we place
a Gaussian prior p(θ) = N 0, I on the parameters θ and determine
the MAP estimates according to (9.31). In Figure 9.7, we show both the
maximum likelihood and the MAP estimates for polynomials of degree 6
(left) and degree 8 (right). The prior (regularizer) does not play a signifi-
cant role for the low-degree polynomial, but keeps the function relatively
smooth for higher-degree polynomials. However, the MAP estimate can
only push the boundaries of overfitting – it is not a general solution to this
problem.

5244 In the following, we will discuss Bayesian linear regression where we


5245 average over all plausible sets of parameters instead of focusing on a point
5246 estimate.

5247 9.3 Bayesian Linear Regression


5248 Previously, we looked at linear regression models where we estimated the
5249 model parameters θ , e.g., by means of maximum likelihood or MAP esti-
5250 mation. We discovered that MLE can lead to severe overfitting, in particu-
5251 lar, in the small-data regime. MAP addresses this issue by placing a prior
Bayesian linear 5252 on the parameters that plays the role of a regularizer.
regression 5253 Bayesian linear regression pushes the idea of the parameter prior a step
5254 further and does not even attempt to compute a point estimate of the pa-
5255 rameters, but instead the full posterior over the parameters is taken into
5256 account when making predictions. This means we do not fit any param-
5257 eters, but we compute an average over all plausible parameters settings
5258 (according to the posterior).

Draft (2018-08-30) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
9.3 Bayesian Linear Regression 281

5259 9.3.1 Model


In Bayesian linear regression, we consider the model

prior p(θ) = N m0 , S 0 ,
(9.32)
likelihood p(y | x, θ) = N y | φ> (x)θ, σ 2 ,



where we now explicitly place a Gaussian prior p(θ) = N m0 , S 0 on Figure 9.8
θ , which turns the parameter vector into a latent variable. The full proba- Graphical model for
Bayesian linear
bilistic model, i.e., the joint distribution of observed and latent variables,
regression.
y and θ , respectively, is
m0 S0
p(y, θ | x) = p(y | x, θ)p(θ) , (9.33)
θ
5260 which allows us to write down the corresponding graphical model in Fig-
σ
5261 ure 9.8, where we made the parameters of the Gaussian prior on θ explicit.
x y

5262 9.3.2 Prior Predictions


In practice, we are usually not so much interested in the parameter values
θ . Instead, our focus often lies in the predictions we make with those pa-
rameter values. In a Bayesian setting, we take the parameter distribution
and average over all plausible parameter settings when we make predic-
tions. More specifically, to make predictions at an input location x∗ , we
integrate out θ and obtain
Z
p(y∗ | x∗ ) = p(y∗ | x∗ , θ)p(θ)dθ = Eθ [p(y∗ | x∗ , θ)] , (9.34)

5263 which we can interpret as the average prediction of y∗ | x∗ , θ for all plausi-
5264 ble parameters θ according to the prior distribution p(θ). Note that predic-
5265 tions using the prior distribution only require to specify the input locations
5266 x∗ , but no training data.
In our model, we chose a conjugate (Gaussian) prior on θ so that the
predictive distribution is Gaussian as well (and can be computed in closed
form): With the prior distribution p(θ) = N m0 , S 0 , we obtain the pre-
dictive distribution as

p(y∗ | x∗ ) = N φ> (x∗ )m0 , φ> (x∗ )S 0 φ(x∗ ) + σ 2 ,



(9.35)

5267 where we used that (i) the prediction is Gaussian due to conjugacy and the
5268 marginalization property of Gaussians, (ii), the Gaussian noise is indepen-
5269 dent so that V[y∗ ] = V[φ> (x∗ )θ] + V[], (iii) y∗ is a linear transformation
5270 of θ so that we can apply the rules for computing the mean and covariance
5271 of the prediction analytically by using (6.50) and (6.51), respectively.
5272 In (9.35), the term φ> (x∗ )S 0 φ(x∗ ) in the predictive variance explicitly
5273 accounts for the uncertainty associated with the parameters θ , whereas σ 2
5274 is the uncertainty contribution due to the measurement noise.

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
282 Linear Regression

Example 9.6 (Prior over Functions)


Let us consider a Bayesian linear regression problem with polynomials
of degree 5. We choose a parameter prior p(θ) = N 0, 41 I . Figure 9.9
visualizes the distribution over functions induced by this parameter prior,
including some function samples from this prior.
Figure 9.9 Prior
over functions. 4 4
(a) Distribution over
functions 2 2
represented by the
mean function 0 0
y

y
(black line) and the
marginal −2 −2
uncertainties
(shaded), −4 −4
representing the
−4 −2 0 2 4 −4 −2 0 2 4
95% confidence x x
bounds; (a) Prior distribution over functions. (b) Samples from the prior distribution over
(b) Samples from functions.
the prior over
functions, which are
induced by the
samples from the 5275 So far, we looked at computing predictions using the parameter prior
parameter prior. 5276 p(θ). However, when we have a parameter posterior (given some train-
5277 ing data X, y ), the same principles for prediction and inference hold
5278 as in (9.34) – we just need to replace the prior p(θ) with the posterior
5279 p(θ | X, y). In the following, we will derive the posterior distribution in
5280 detail before using it to make predictions.

5281 9.3.3 Posterior Distribution


Given a training set of inputs xn ∈ RD and corresponding observations
yn ∈ R, n = 1, . . . , N , we compute the posterior over the parameters
using Bayes’ theorem as
p(y | X, θ)p(θ)
p(θ | X, y) = , (9.36)
p(y | X)
where X is the collection of training inputs and y the collection of train-
ing targets. Furthermore, p(y | X, θ) is the likelihood, p(θ) the parameter
prior and
Z
p(y | X) = p(y | X, θ)p(θ)dθ (9.37)

marginal likelihood
5282 the marginal likelihood/evidence, which is independent of the parameters
evidence 5283 θ and ensures that the posterior is normalized, i.e., it integrates to 1. We
5284 can think of the marginal likelihood as the likelihood averaged over all
5285 possible parameter settings (with respect to the prior distribution p(θ)).

Draft (2018-08-30) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
9.3 Bayesian Linear Regression 283

In our specific model (9.32), the posterior (9.36) can be computed in


closed form as

p(θ | X, y) = N θ | mN , S N , (9.38a)
S N = (S −1
0 +σ
−2 >
Φ Φ)−1 , (9.38b)
mN = S N (S −1
0 m0 +σ −2 >
Φ y) , (9.38c)
5286 where the subscript N indicates the size of the training set. In the follow-
5287 ing, we will detail how we arrive at this posterior.
Bayes’ theorem tells us that the posterior p(θ | X, y) is proportional to
the product of the likelihood p(y | X, θ) and the prior p(θ):
p(y | X, θ)p(θ)
posterior p(θ | X, y) = (9.39a)
p(y | X)
p(y | X, θ) = N y | Φθ, σ 2 I

likelihood (9.39b)

prior p(θ) = N θ | m0 , S 0 (9.39c)
5288 We will discuss two approaches to derive the desired posterior.

5289 Approach 1: Linear Transformation of Gaussian Random Variables


Looking at the numerator of the posterior in (9.39a), we know that the
Gaussian prior times the Gaussian likelihood (where the parameters on
which we place the Gaussian appears linearly in the mean) is an (un-
normalized) Gaussian (see Section 6.6.2). If necessary, we can find the
normalizing constant using (6.114). If we want to compute that product
by using the results from (6.112)–(6.113) in Section 6.6.2, we need to
ensure the product has the “right” form, i.e.,
N y | Φθ, σ 2 I N θ | m0 , S 0 = N θ | µ, Σ N θ | m0 , S 0
   
(9.40)
for some µ, Σ. With this form we determine the desired product immedi-
ately as
  
N θ | µ, Σ N θ | m0 , S 0 ∝ N θ | mN , S N (9.41a)
S N = (S 0−1 + Σ−1 )−1 (9.41b)
mN = S N (S −1
0 m0 + Σ
−1
µ) . (9.41c)

In order to get the “right” form, we need to turn N y | Φθ, σ 2 I into

N θ | µ, Σ for appropriate choices of µ, Σ. We will do this by using
a linear transformation of Gaussian random variables (see Section 6.6),
which allows us to exploit the property that linearly transformed Gaussian
random variables are Gaussian distributed. More specifically, we will find
µ = By and Σ = σ 2 BB > by linearly transforming the relationship
y = Φθ in the likelihood into By = θ for a suitable B . We obtain
×Φ> ×(Φ> Φ)−1
y = Φθ ⇐⇒ Φ> y = Φ> Φθ ⇐⇒ (Φ> Φ)−1 Φ> y = θ (9.42)
| {z }
=:B

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
284 Linear Regression

Therefore, we can write θ = By , and by using the rules for linear trans-
formations of the mean and covariance from (6.50)–(6.51) we obtain
N θ | By, σ 2 BB > = N θ | (Φ> Φ)−1 Φ> y, σ 2 (Φ> Φ)−1
 
(9.43)
5290 after some re-arranging of the terms for the covariance matrix.
If we now look at (9.43) and define its mean as µ and covariance matrix
as Σ in (9.41c) and (9.41b), respectively, we obtain the covariance
 SN
and the mean mN of the parameter posterior N θ | mN , S N as

S N = (S −1
0 +σ
−2 >
Φ Φ)−1 , (9.44a)
mN = S N (S −1
0 m0
> >
+ σ −2 (Φ Φ) (Φ Φ)−1 Φ y ) >
(9.44b)
| {z }| {z }
Σ−1 µ

= S N (S −1
0 m0 +σ −2 >
Φ y) , (9.44c)
The posterior mean
5291 respectively. Note that the posterior mean mN equals the MAP estimate
equals the MAP 5292 θ MAP from (9.31). This also makes sense since the posterior distribution is
estimate.
5293 unimodal (Gaussian) with its maximum at the mean.
Remark. The posterior precision (inverse covariance)
1 >
S −1 −1
N = S0 + Φ Φ (9.45)
σ2
5294 of the parameters θ (see (9.44a)) contains two terms: S −10 is the prior
5295 precision and σ12 Φ> Φ is a data-dependent (precision) term. Both terms
5296 (matrices) are symmetric and positive definite. The data-dependent term
Φ> Φ accumulates5297
1
σ2
Φ> Φ grows as more data is taken into account. This means (at least)
contributions from5298 two things:
the data.
5299 • The posterior precision grows as more and more data is taken into ac-
5300 count; therefore, the covariance, and with it the uncertainty about the
5301 parameters, shrinks.
5302 • The relative influence of the parameter prior vanishes for large N .
5303 Therefore, for N → ∞ the prior plays no role, and the parameter posterior
5304 tends to a point estimate, the MAP estimate. ♦

5305 Approach 2: Completing the Squares


5306 Instead of looking at the product of the prior and the likelihood, we can
5307 transform the problem into log-space and solve for the mean and covari-
5308 ance of the posterior by completing the squares.
The sum of the log-prior and the log-likelihood is
log N y | Φθ, σ 2 I + log N θ | m0 , S 0
 
(9.46a)
1
= − σ −2 (y − Φθ)> (y − Φθ) + (θ − m0 )> S −1

0 (θ − m0 + const
2
(9.46b)

Draft (2018-08-30) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
9.3 Bayesian Linear Regression 285

where the constant contains terms independent of θ . We will ignore the


constant in the following. We now factorize (9.46b), which yields
1 −2 >
− σ y y − 2σ −2 y > Φθ + θ > σ −2 Φ> Φθ + θ > S −1
0 θ
2 (9.47a)
> −1 > −1

− 2m0 S 0 θ + m0 S 0 m0
1
= − θ > (σ −2 Φ> Φ + S −1 −2 >
Φ y + S −1 >

0 )θ − 2(σ 0 m0 ) θ + const ,
2
(9.47b)

where the constant contains the black terms in (9.47a), which are inde-
pendent of θ . The orange terms are terms that are linear in θ , and the
blue terms are the ones that are quadratic in θ . By inspecting (9.47b), we
find that this equation is quadratic in θ . The fact that the unnormalized
log-posterior distribution is a (negative) quadratic form implies that the
posterior is Gaussian, i.e.,

p(θ | X, y) = exp(log p(θ | X, y)) ∝ exp(log p(y | X, θ) + log p(θ))


(9.48a)
 1 
∝ exp − θ > (σ −2 Φ> Φ + S −1
0 )θ − 2(σ −2 >
Φ y + S −1
0 m 0 ) >
θ ,
2
(9.48b)

5309 where we used (9.47b) in the last expression.


The remaining task is it to bring this (unnormalized)
 Gaussian into the
form that is proportional to N θ | mN , S N , i.e., we need to identify the
mean mN and the covariance matrix S N . To do this, we use the concept
of completing the squares. The desired log-posterior is completing the
squares
1
log N θ | mN , S N = − (θ − mN )> S −1
 
N (θ − mN ) + const
2
(9.49a)
1
= − θ > S −1 > −1 > −1

N θ − 2mN S N θ + mN S N mN . (9.49b)
2
Here, we factorized the quadratic form (θ − mN )> S −1 N (θ − mN ) into a
term that is quadratic in θ alone (blue), a term that is linear in θ (orange),
and a constant term (black). This allows us now to find S N and mN by
matching the colored expressions in (9.47b) and (9.49b), which yields

S −1 > −2
N = Φ σ IΦ + S −1
0 ⇐⇒ S N = (σ −2 Φ> Φ + S −1
0 )
−1
,
(9.50)
−1 −2 >
m>
N S N = (σ Φ y + S −1
0 m0 )
>
⇐⇒ mN = S N (σ −2 Φ> y + S −1
0 m0 ) .
(9.51)

5310 This is identical to the solution in (9.44a)–(9.44c), which we obtained by


5311 linear transformations of Gaussian random variables.

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
286 Linear Regression

Remark (Completing the Squares—General Approach). If we are given an


equation
x> Ax − 2a> x + const1 , (9.52)
where A is symmetric and positive definite, which we wish to bring into
the form
(x − µ)> Σ(x − µ) + const2 , (9.53)
we can do this by setting
Σ := A , (9.54)
−1
µ := Σ a (9.55)
5312 and const2 = const1 − µ> Σµ. ♦
We can see that the terms inside the exponential in (9.48b) are of the
form (9.52) with
A := σ −2 Φ> Φ + S −1
0 , (9.56)
>
a := σ −2
Φ y+ S −1
0 m0 . (9.57)
5313 Since A, a can be difficult to identify in equations like (9.47a), it is of-
5314 ten helpful to bring these equations into the form (9.52) that decouples
5315 quadratic term, linear terms and constants, which simplifies finding the
5316 desired solution.

5317 9.3.4 Posterior Predictions


In (9.34), we computed the predictive distribution of y∗ at a test input
x∗ using the parameter prior p(θ). In principle, predicting with the pa-
rameter posterior p(θ | X, y) is not fundamentally different given that
in our conjugate model the prior and posterior are both Gaussian (with
different parameters). Therefore, by following the same reasoning as in
Section 9.3.2 we obtain the (posterior) predictive distribution
Z
p(y∗ | X, y, x∗ ) = p(y∗ | x∗ , θ)p(θ | X, y)dθ (9.58a)
Z
= N y∗ | φ> (x∗ )θ, σ 2 N θ | mN , S N dθ (9.58b)
 

= N y∗ | φ> (x∗ )mN , φ> (x∗ )S N φ(x∗ ) + σ 2 (9.58c)




5318 The term φ> (x∗ )S N φ(x∗ ) reflects the posterior uncertainty associated
5319 with the parameters θ . Note that S N depends on the training inputs X ,
5320 see (9.44a). The predictive mean coincides with the MAP estimate.
Remark (Mean and Variance of Noise-Free Function Values). In many
cases, we are not interested in the predictive distribution p(y∗ | X, y, x∗ )
of a (noisy) observation. Instead, we would like to obtain the distribution

Draft (2018-08-30) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
9.3 Bayesian Linear Regression 287

of the (noise-free) latent function values f (x∗ ) = φ> (x∗ )θ . We deter-


mine the corresponding moments by exploiting the properties of means
and variances, which yields
E[f (x∗ ) | X, y] = Eθ [φ> (x∗ )θ | X, y] = φ> (x∗ )Eθ [θ | X, y]
(9.59)
= φ> (x∗ )mN = m>
N φ(x∗ ) ,

Vθ [f (x∗ ) | X, y] = Vθ [φ> (x∗ )θ | X, y]


= φ> (x∗ )Vθ [θ | X, y]φ(x∗ ) (9.60)
>
= φ (x∗ )S N φ(x∗ )
5321 We see that the predictive mean is the same as the predictive mean for
5322 noisy observations as the noise has mean 0, and the predictive variance
5323 only differs by σ 2 , which is the variance of the measurement noise: When
5324 we predict noisy function values, we need to include σ 2 as a source of
5325 uncertainty, but this term is not needed for noise-free predictions. Here,
5326 the only remaining uncertainty stems from the parameter posterior. ♦ Integrating out
5327 Remark (Distribution over Functions). The fact that we integrate out the parameters induces
5328 parameters θ induces a distribution over functions: If we sample θ i ∼ a distribution over
functions.
5329 p(θ | X, y) from the parameter posterior, we obtain a single function re-
5330 alization θ >
i φ(·). The mean function, i.e., the set of all expected function mean function
5331 values Eθ [f (·) | θ, X, y], of this distribution over functions is m> N φ(·).
5332 The (marginal) variance, i.e., the variance of the function f (·), are given
5333 by φ> (·)S N φ(·). ♦

Example 9.7 (Posterior over Functions)

Figure 9.10
4 4 4 Bayesian linear
2 2 2 regression and
0 0 0
posterior over
y

Training data functions. (a)


−2 −2 MLE −2
MAP
Training data; (b)
−4 −4 BLR −4 posterior
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 distribution over
x x x
functions; (c)
(a) Training data. (b) Posterior over functions rep- (c) Samples from the posterior
Samples from the
resented by the marginal un- over functions, which are in-
posterior over
certainties (shaded) showing duced by the samples from the
functions.
the 95% predictive confidence parameter posterior.
bounds, the maximum likeli-
hood estimate (MLE) and the
MAP estimate (MAP), which is
identical to the posterior mean
function.

Let us revisit the Bayesian linear regression problem withpolynomials


of degree 5. We choose a parameter prior p(θ) = N 0, 41 I . Figure 9.9

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
288 Linear Regression

visualizes the prior over functions induced by the parameter prior and
sample functions from this prior.
Figure 9.10 shows the posterior over functions that we obtain
via Bayesian linear regression. The training dataset is shown in Fig-
ure 9.11(a); Figure 9.11(b) shows the posterior distribution over func-
tions, including the functions we would obtain via maximum likelihood
and MAP estimation. The function we obtain using the MAP estimate also
corresponds to the posterior mean function in the Bayesian linear regres-
sion setting. Figure 9.11(c) shows some plausible realizations (samples)
of functions under that posterior over functions.

5334 Figure 9.11 shows some examples of the posterior distribution over
5335 functions induced by the parameter posterior. For different polynomial de-
5336 grees M the left panels show the maximum likelihood estimate, the MAP
5337 estimate (which is identical to the posterior mean function) and the 95%
5338 predictive confidence bounds, represented by the shaded area. The right
5339 panels show samples from the posterior over functions: Here, we sampled
5340 parameters θ i from the parameter posterior and computed the function
5341 φ> (x∗ )θ i , which is a single realization of a function under the posterior
5342 distribution over functions. For low-order polynomials, the parameter pos-
5343 terior does not allow the parameters to vary much: The sampled functions
5344 are nearly identical. When we make the model more flexible by adding
5345 more parameters (i.e., we end up with a higher-order polynomial), these
5346 parameters are not sufficiently constrained by the posterior, and the sam-
5347 pled functions can be easily visually separated. We also see in the corre-
5348 sponding panels on the left how the uncertainty increases, especially at
5349 the boundaries. Although for a 7th-order polynomial the MAP estimate
5350 yields a reasonable fit, the Bayesian linear regression model additionally
5351 tells us that the posterior uncertainty is huge. This information can be crit-
5352 ical when we use these predictions in a decision-making system, where
5353 bad decisions can have significant consequences (e.g., in reinforcement
5354 learning or robotics).

5355 9.3.5 Computing the Marginal Likelihood


In Section 8.5.2, we highlighted the importance of the marginal likelihood
for Bayesian model selection. In the following, we compute the marginal
likelihood for Bayesian linear regression with a conjugate Gaussian prior
on the parameters, i.e., exactly the setting we have been discussing in this
chapter. Just to re-cap, we consider the following generative process:

θ ∼ N m0 , S 0 (9.61a)
> 2

yn | xn , θ ∼ N xn θ, σ , (9.61b)

Draft (2018-08-30) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
9.3 Bayesian Linear Regression 289
Figure 9.11
4 4 Bayesian linear
regression. Left
2 2 panels: Shaded
areas indicate the
0 0
y

y
95% predictive
Training data
confidence bounds.
−2 MLE −2 The mean of the
MAP
Bayesian linear
−4 BLR −4 regression model
−4 −2 0 2 4 −4 −2 0 2 4 coincides with the
x x MAP estimate. The
(a) Posterior distribution for polynomials of degree M = 3 (left) and samples from the posterior predictive
over functions (right). uncertainty is the
sum of the noise
term and the
4 4 posterior parameter
uncertainty, which
2 2 depends on the
location of the test
0 0
y

input. Right panels:


Training data
Sampled functions
−2 MLE −2 from the posterior
MAP
distribution.
−4 BLR −4

−4 −2 0 2 4 −4 −2 0 2 4
x x
(b) Posterior distribution for polynomials of degree M = 5 (left) and samples from the posterior
over functions (right).

4 Training data 4
MLE
2 MAP 2
BLR
0 0
y

−2 −2

−4 −4

−4 −2 0 2 4 −4 −2 0 2 4
x x
(c) Posterior distribution for polynomials of degree M = 7 (left) and samples from the posterior
over functions (right).

n = 1, . . . , N . The marginal likelihood is given by The marginal


Z likelihood can be
p(y | X) = p(y | X, θ)p(θ)dθ (9.62a) interpreted as the
expected likelihood
under the prior, i.e.,
Z
= N y | Xθ, σ 2 I N θ | m0 , S 0 dθ ,
 
(9.62b) Eθ [p(y | X, θ)].

5356 where we integrate out the model parameters θ . We compute the marginal
5357 likelihood in two steps: First, we show that the marginal likelihood is

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
290 Linear Regression

5358 Gaussian (as a distribution in y ); Second, we compute the mean and co-
5359 variance of this Gaussian.

5360 1. The marginal likelihood is Gaussian: From Section 6.6.2 we know that
5361 (i) the product of two Gaussian random variables is an (unnormal-
5362 ized) Gaussian distribution, (ii) a linear transformation of a Gaussian
5363 random variable is Gaussian distributed. In (9.62b), we require a linear

5364 transformation to bring N y | Xθ, σ 2 I into the form N θ | µ, Σ for
5365 some µ, Σ. Once this is done, the integral can be solved in closed form.
5366 The result is the normalizing constant of the product of the two Gaus-
5367 sians. The normalizing constant itself has Gaussian shape, see (6.114).
2. Mean and covariance. We compute the mean and covariance matrix
of the marginal likelihood by exploiting the standard results for means
and covariances of affine transformations of random variables, see Sec-
tion 6.4.4. The mean of the marginal likelihood is computed as

Eθ [y | X] = Eθ [Xθ + ] = X Eθ [θ] = Xm0 . (9.63)



Note that  ∼ N 0, σ 2 I is a vector of i.i.d. random variables. The
covariance matrix is given as

Covθ [y] = Cov[Xθ] + σ 2 I = X Covθ [θ]X > + σ 2 I (9.64a)


> 2
= XS 0 X + σ I (9.64b)

Hence, the marginal likelihood is


N 1
p(y | X) = (2π)− 2 det(XS 0 X > + σ 2 I)− 2
× exp − 12 (y − Xm0 )> (XS 0 X > + σ 2 I)−1 (y − Xm0 ) .


(9.65)

5368 The marginal likelihood can now be used for Bayesian model selection as
5369 discussed in Section 8.5.2.

5370 9.4 Maximum Likelihood as Orthogonal Projection


Having crunched through much algebra to derive maximum likelihood
and MAP estimates, we will now provide a geometric interpretation of
maximum likelihood estimation. Let us consider a simple linear regression
setting

y = xθ + ,  ∼ N 0, σ 2 ,

(9.66)

5371 in which we consider linear functions f : R → R that go through the


5372 origin (we omit features here for clarity). The parameter θ determines the
5373 slope of the line. Figure 9.12(a) shows a one-dimensional dataset.
With a training data set X = [x1 , . . . , xN ]> ∈ RN , y = [y1 , . . . , yN ]> ∈

Draft (2018-08-30) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
9.4 Maximum Likelihood as Orthogonal Projection 291
4 4 Figure 9.12
Geometric
2 2 interpretation of
least squares. (a)
0 0 Dataset; (b)
y

y
Maximum
Projection likelihood solution
−2 −2
Observations interpreted as a
Maximum likelihood estimate
projection.
−4 −4
−4 −2 0 2 4 −4 −2 0 2 4
x x

(a) Regression dataset consisting of noisy ob- (b) The orange dots are the projections of
servations yn (blue) of function values f (xn ) the noisy observations (blue dots) onto the
at input locations xn . line θML x. The maximum likelihood solution to
a linear regression problem finds a subspace
(line) onto which the overall projection er-
ror (orange lines) of the observations is mini-
mized.

RN , we recall the results from Section 9.2.1 and obtain the maximum
likelihood estimator for the slope parameter as

X >y
θML = (X > X)−1 X > y = ∈ R. (9.67)
X >X
This means for the training inputs X we obtain the optimal (maxi-
mum likelihood) reconstruction of the training data, i.e., the approxima-
tion with the minimum least-squares error

X >y XX >
XθML = X = y. (9.68)
X >X X >X
5374 As we are basically looking for a solution of y = Xθ, we can think Linear regression
5375 of linear regression as a problem for solving systems of linear equations. can be thought of as
a method for solving
5376 Therefore, we can relate to concepts from linear algebra and analytic ge-
systems of linear
5377 ometry that we discussed in Chapters 2 and 3. In particular, looking care- equations.
5378 fully at (9.68) we see that the maximum likelihood estimator θML in our Maximum
5379 example from (9.66) effectively does an orthogonal projection of y onto likelihood linear
5380 the one-dimensional subspace spanned by X . Recalling the results on or- regression performs
XX > an orthogonal
5381 thogonal projections from Section 3.7, we identify X > X as the projection
projection.
5382 matrix, θML as the coordinates of the projection onto the one-dimensional
5383 subspace of RN spanned by X and XθML as the orthogonal projection of
5384 y onto this subspace.
5385 Therefore, the maximum likelihood solution provides also a geometri-
5386 cally optimal solution by finding the vectors in the subspace spanned by
5387 X that are “closest” to the corresponding observations y , where “clos-
5388 est” means the smallest (squared) distance of the function values yn to
5389 xn θ. This is achieved by orthogonal projections. Figure 9.12(b) shows the
5390 orthogonal projection of the noisy observations onto the subspace that

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
292 Linear Regression

5391 minimizes the squared distance between the original dataset and its pro-
5392 jection, which corresponds to the maximum likelihood solution.
In the general linear regression case where
y = φ> (x)θ + ,  ∼ N 0, σ 2

(9.69)
with vector-valued features φ(x) ∈ RK , we again can interpret the maxi-
mum likelihood result
y ≈ Φθ ML , (9.70)
> −1 >
θ ML = Φ(Φ Φ) Φ y (9.71)
5393 as a projection onto a K -dimensional subspace of RN , which is spanned
5394 by the columns of the feature matrix Φ, see Section 3.7.2.
If the feature functions φk that we use to construct the feature ma-
trix Φ are orthonormal (see Section 3.6), we obtain a special case where
the columns of Φ form an orthonormal basis (see Section 3.5), such that
Φ> Φ = I . This will then lead to the projection
K
!
> >
X >
−1
Φ(Φ Φ) Φy = ΦΦ y = φk φk y (9.72)
k=1

5395 so that the coupling between different features has disappeared and the
5396 maximum likelihood projection is simply the sum of projections of y onto
5397 the individual basis vectors φk , i.e., the columns of Φ. Many popular basis
5398 functions in signal processing, such as wavelets and Fourier bases, are
5399 orthogonal basis functions. When the basis is not orthogonal, one can
5400 convert a set of linearly independent basis functions to an orthogonal basis
5401 by using the Gram-Schmidt process (Strang, 2003).

5402 9.5 Further Reading


In this chapter, we discussed linear regression for Gaussian likelihoods
and conjugate Gaussian priors on the parameters of the model. This al-
lowed for closed-form Bayesian inference. However, in some applications
we may want to choose a different likelihood function. For example, in
classification a binary classification setting, we observe only two possible (categorical)
outcomes, and a Gaussian likelihood is inappropriate in this setting. In-
stead, we can choose a Bernoulli likelihood that will return a probability
of the predicted label to be 1 (or 0). We refer to the books by Bishop
(2006); Murphy (2012); Barber (2012) for an in-depth introduction to
classification problems. A different example where non-Gaussian likeli-
hoods are important is count data. Counts are non-negative integers, and
in this case a Binomial or Poisson likelihood would be a better choice than
generalized linear a Gaussian. All these examples fall into the category of generalized linear
models models, a flexible generalization of linear regression that allows for re-
sponse variables that have error distribution models other than a Gaussian

Draft (2018-08-30) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
9.5 Further Reading 293

distribution. The GLM generalizes linear regression by allowing the linear


model to be related to the observed values via a smooth and invertible
function σ(·) that may be nonlinear so that y = σ(f ), where f = θ > φ(x)
is the linear regression model from (9.12). We can therefore think of a
generalized linear model in terms of function composition y = σ◦f where
f is a linear regression model and σ the activation function. Note, that al-
though we are talking about “generalized linear models” the outputs y are
no longer linear in the parameters θ . In logistic regression, we choose the logistic regression
1
logistic sigmoid σ(f ) = 1+exp(−f )
∈ [0, 1], which can be interpreted as the logistic sigmoid
probability of observing a binary output y = 1 of a Bernoulli random vari-
able. The function σ(·) is called transfer function or activation function, its transfer function
inverse is called the canonical link function. From this perspective, it is activation function
canonical link
also clear that generalized linear models are the building blocks of (deep)
function
feedforward neural networks: If we consider a generalized linear model For ordinary linear
y = σ(Ax + b), where A is a weight matrix and b a bias vector, we iden- regression the
tify this generalized linear model as a single-layer neural network with activation function
activation function σ(·). We can now recursively compose these functions would simply be the
identity.
via
Generalized linear
models are the
xk+1 = f k (xk )
(9.73) building blocks of
f k (xk ) = σk (Ak xk + bk ) deep neural
networks.
5403 for k = 0, . . . , K − 1 where x0 are the input features and xK = y
5404 are the observed outputs, such that f K−1 ◦ · · · ◦ f 0 is a K -layer deep
5405 neural network. Therefore, the building blocks of this deep neural net-
5406 work are the generalized linear models defined in (9.73). A great post
5407 on the relation between GLMs and deep networks is available at https:
5408 //tinyurl.com/glm-dnn. Neural networks (Bishop, 1995; Goodfellow
5409 et al., 2016) are significantly more expressive and flexible than linear re-
5410 gression models. However, maximum likelihood parameter estimation is a
5411 non-convex optimization problem, and marginalization of the parameters
5412 in a fully Bayesian setting is analytically intractable.
5413 We briefly hinted at the fact that a distribution over parameters in-
5414 duces a distribution over regression functions. Gaussian processes (Ras- Gaussian processes
5415 mussen and Williams, 2006) are regression models where the concept of
5416 a distribution over function is central. Instead of placing a distribution
5417 over parameters a Gaussian process places a distribution directly on the
5418 space of functions without the “detour” via the parameters. To do so, the
5419 Gaussian process exploits the kernel trick (Schölkopf and Smola, 2002), kernel trick
5420 which allows us to compute inner products between two function values
5421 f (xi ), f (xj ) only by looking at the corresponding input xi , xj . A Gaus-
5422 sian process is closely related to both Bayesian linear regression and sup-
5423 port vector regression but can also be interpreted as a Bayesian neural
5424 network with a single hidden layer where the number of units tends to
5425 infinity (Neal, 1996; Williams, 1997). An excellent introduction to Gaus-

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
294 Linear Regression

5426 sian processes can be found in (MacKay, 1998; Rasmussen and Williams,
5427 2006).
5428 We focused on Gaussian parameter priors in the discussions in this chap-
5429 ters because they allow for closed-form inference in linear regression mod-
5430 els. However, even in a regression setting with Gaussian likelihoods we
5431 may choose a non-Gaussian prior. Consider a setting where the inputs are
5432 x ∈ RD and our training set is small and of size N  D. This means that
5433 the regression problem is under-determined. In this case, we can choose
5434 a parameter prior that enforces sparsity, i.e., a prior that tries to set as
variable selection 5435 many parameters to 0 as possible (variable selection). This prior provides
5436 a stronger regularizer than the Gaussian prior, which often leads to an in-
5437 creased prediction accuracy and interpretability of the model. The Laplace
5438 prior is one example that is frequently used for this purpose. A linear re-
5439 gression model with the Laplace prior on the parameters is equivalent to
LASSO 5440 linear regression with L1 regularization (LASSO) (Tibshirani, 1996). The
5441 Laplace distribution is sharply peaked at zero (its first derivative is discon-
5442 tinuous) and it concentrates its probability mass closer to zero than the
5443 Gaussian distribution, which encourages parameters to be 0. Therefore,
5444 the non-zero parameters are relevant for the regression problem, which is
5445 the reason why we also speak of “variable selection”.

Draft (2018-08-30) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.

You might also like