Derivation of Normal Equations
Derivation of Normal Equations
1 Linear Regression
In these notes I want to explain you how the normal equations for linear regression are derived. For doing
this, I would be using the vector and matrix calculus. Please do remember that you can find these details
on the internet on different websites like wikipedia. The idea is same but the symbols might be different. I
am going to stay consistant with my course contents.
We have defined the following hypothesis function:
hθ (x) = θ0 x0 + θ1 x1 + . . . + θn xn , (1)
where each x is a n-dimensional vector or observation. Our goal is to minimising the cost function J(Θ)
using least squares as follows:
m
1 X
J(Θ) = (hθ (x)(i) − y (i) )2 , (2)
2m i=1
where x(i) is an ith observation from m number of samples and y (i) is the ground truth (class label) for that
ith observation. While doing these calculations, the rule of thumb is to check the dimensions of the resulting
vectors. This gives you a good idea of whether you have done the calculus correctly.
So here I redefine the vectors in our problem just as a reminder. The regression coefficients in our problem
are θ defined as
θ0
θ1
∈ Rn+1 (3)
..
.
θn
Each mth input x is a vector of n + 1 dimensions where we add x0 = 1 for convenience. Therefore our
hypothesis function becomes:
hθ (x) = θ T x, (4)
where this equation represents the dot product of the two vectors. We define the ‘design matrix’ are refered
to as X using the following notation:
1
2 Derivation of Normal Equations for Linear Regression
· · · x(1)T · · ·
· · · x(2)T · · ·
X= (5)
..
.
· · · x(m)T · · ·
as a matrix of m rows, in which each row is the ith sample (the vector x(i) ). With this, we can rewrite the
least-squares cost as following, replacing the explicit sum by matrix multiplication:
1
J(θ) = (Xθ − y)T (Xθ − y) (6)
2m
1
Using some matrix transpose identities, we can simplify this a bit. I will throw away the fraction part 2m ,
since we’re going to compare a derivative to zero anyway. So here are the equations:
Note that Xθ is a vector, and so is y. So when we multiply one by another, it doesn’t matter what the
order is (as long as the dimensions work out). So we can further simplify:
This is the matrix representation of the cost function J that we wished to minimise. Remember using the
system of linear equations we had taken the parital derivative of the cost fucntion and equated it equal to
zero. Using this derivative we used the Gradient descent algorithm to find the values of θ that will minimise
this cost function.
Now we have to take the partial derivative with respect to vectors which is lightly uncomfortable, but I will
try to explain it.
Lets work at the derivatives in terms of vectors. Lets consider the following fucntion:
f (v) = aT v (10)
We know how to take the parital derivative of the above function by expanding it. So lets define the partial
derivatives first:
∂f
∂v1
∂f
∂ ∂v2
f = .. (11)
∂v
.
∂f
∂vn
f (v) = aT v (12)
= a1 v1 + a2 v2 + . . . + an vn (13)
∂f
= a1
∂v1
∂f
= a2
∂v2
..
.
∂f
= an (14)
∂vn
(15)
∂
f = a. (16)
∂v
Lets keep working through the partial derivatives with respect to vectors. Lets define
Over here please verify the dimensions of the above equation if the satisfy. Now lets expand the above
equations:
T y
x11 x12 ··· x1n θ1 1
x21 x22 ··· x2n θ2 y 2
P (θ) = 2 × (18)
· · · ··· ···
··· ··· ..
.
xm1 xm2 ··· xmn θn ym
T
x11 θ1 + · · · +x1n θn y1
x21 θ1 + · · · +x2n θn y2
P (x) = 2 × (19)
··· ··· ···
..
.
xm1 θ1 + · · · +xmn θn ym
We can write the above equations in a much compact form using summations signs
m
X
P (x) = 2 (xr1 θ1 + · · · + xrn θn )yr (21)
r=1
Xm n
X
= 2 yr xrc θc (22)
r=1 c=1
∂P
= 2(x11 y1 + · · · + xm1 ym )
∂θ1
∂P
= 2(x12 y1 + · · · + xm2 ym )
∂θ2
..
.
∂P
= 2(x1n y1 + · · · + xmn ym )
∂θn
(23)
So we can write the above equations using the matrix forms as follows:
∂
P = 2Xy (24)
∂θ
∂ ∂
P = (2(Xθ)T y)
∂θ ∂θ
= 2Xy (25)
Take a moment to convince yourself this is true. It’s just collecting the individual components of X into a
matrix and the individual components of y into a vector. Since X is a m-by-n matrix and y is a m-by-1
column vector, the dimensions work out and the result is a n-by-1 column vector.
So we’ve just computed the second term of the vector derivative of J. Now lets go back to the full definition
of J and see how to compute the derivative of its first term. Lets call it Q
Q(θ) = θ T X T Xθ (26)
This will be slightly more complex but here we go ... plz note there is a transpose of X.
θ1
x11 x21 ··· xm1 x11 x12 ··· x1n
x12 x22 ··· xm2 x21 x22 ··· x2n
θ2
Q(θ) = (θ1 · · · θn ) (27)
···
··· ..
.
x1n x2n ··· xmn xm1 xm2 ··· xmn θn
Derivation of Normal Equations for Linear Regression 5
The centre two matrices are simply X-squared matrix which is n-by-n. The element in row r and column c
of this square matrix is:
m
X
xir xic , (28)
i=1
2
where X-squared is a symmetric matrix. Lets call each element of this squared matrix as Xrc . Now the
above equation becomes:
2 2
X11 θ1 + · · · + X1n θn
2 2
X21 θ1 + · · · + X2n θn
Q(θ) = (θ1 · · · θn ) (29)
···
2 2
Xn1 θ1 + · · · + Xnn θn
2 2
Q(θ) = θ1 (X11 θ1 + · · · + X1n θn ) +
2 2
θ2 (X21 θ1 + · · · + X2n θn ) +
··· +
2 2
θn (Xn1 θ1 + · · · + Xnn θn ) (30)
∂Q 2 2 2 2 2
= (2θ1 X11 + θ2 X12 + · · · + θn X1n ) + θ2 X21 + · · · + θn Xn1 (31)
∂θ1
2 2
Remember X is a symmetric matrix, so X12 = X21 . Hence,
∂Q 2 2 2
= 2θ1 X11 + 2θ2 X12 + · · · + 2θn X1n (32)
∂θ1
∂Q
= 2X 2 θ = 2X T Xθ (33)
∂θ
Hence,
∂ T T
θ X Xθ = 2X T Xθ. (34)
∂θ
Now lets look back it the matrix form of the cost fucntion J given in (9) and insert the partial derivatives.
6 Derivation of Normal Equations for Linear Regression
∂J ∂ T T
= (θ X Xθ − 2(Xθ)T y + y T y) (35)
∂θ ∂θ
= 2X T Xθ − 2X T y = 0 (36)
OR
2X T Xθ = 2X T y (37)
θ = (X T X)−1 X T y (38)
which is the normal equation that we use for calculating the θ empirically using matrix notations.
Now that you know how to calculate the partial derivatives of matrices, lets have a look at how to solve the
derivative of normal equations for linear regression. Lets recap the cost function for the linear regression:
m
1 X
J(Θ) = (hθ (x)(i) − y (i) )2 , (39)
2m i=1
1
Throwing away the fraction part, 2m , is the simplified version of the equation
1
Using some matrix transpose identities, we can simplify this a bit. I will throw away the fraction part 2m ,
since we’re going to compare a derivative to zero anyway. So here are the equations:
Now our goal is simply to find the values of theta that minimises the cost function J(θ). To do that we
use the concepts from our FCS where to minimise the function, all we need to do is to take the derivative of
the function that we want minimise and put it equal to zero. To do that we got to remember one thing that
here we are dealing with vectors and matrices not variables. So lets have a look at some of the rules that we
are going to use.
Derivation of Normal Equations for Linear Regression 7
If we have to take a derivative of a function of θ, f (θ) with respect to (w.r.t) θ, then the partial derivative
is comes out to be:
∂
(Aθ) = AT (44)
∂θ
Secondly,
∂ T
(θ Aθ) = 2AT θ (45)
∂θ
∂J(θ) ∂ T T 1
= (θ X Xθ − 2y T (Xθ) + y T y) (46)
∂θ ∂θ 2m
1
= (2X T Xθ − 2X T y + 0) (47)
2m
All that we need to do now is to put the first derivative of the function equal to zero and solve for θ.
1
(2X T Xθ − 2X T y + 0) = 0, (48)
2m
2X T Xθ = 2X y,T
(49)
X T Xθ = X T y, (50)
T −1 T
θ = (X X) X y, (51)
which is the equation for solving θ. This is called the Normal Equation for linear regression.