0% found this document useful (0 votes)
265 views7 pages

Derivation of Normal Equations

The document discusses the derivation of the normal equations for linear regression. It defines the hypothesis function and cost function, and expresses them using matrix notation. It then takes the partial derivative of the cost function with respect to the regression coefficients to derive an expression for the normal equations, which can be set to zero to solve for the coefficients that minimize total squared error.

Uploaded by

Muhammad Usman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
265 views7 pages

Derivation of Normal Equations

The document discusses the derivation of the normal equations for linear regression. It defines the hypothesis function and cost function, and expresses them using matrix notation. It then takes the partial derivative of the cost function with respect to the regression coefficients to derive an expression for the normal equations, which can be set to zero to solve for the coefficients that minimize total squared error.

Uploaded by

Muhammad Usman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Derivation of Normal Equations for Linear Regression

Instructor: Dr. Ali Hassan

1 Linear Regression

In these notes I want to explain you how the normal equations for linear regression are derived. For doing
this, I would be using the vector and matrix calculus. Please do remember that you can find these details
on the internet on different websites like wikipedia. The idea is same but the symbols might be different. I
am going to stay consistant with my course contents.
We have defined the following hypothesis function:

hθ (x) = θ0 x0 + θ1 x1 + . . . + θn xn , (1)

where each x is a n-dimensional vector or observation. Our goal is to minimising the cost function J(Θ)
using least squares as follows:

m
1 X
J(Θ) = (hθ (x)(i) − y (i) )2 , (2)
2m i=1

where x(i) is an ith observation from m number of samples and y (i) is the ground truth (class label) for that
ith observation. While doing these calculations, the rule of thumb is to check the dimensions of the resulting
vectors. This gives you a good idea of whether you have done the calculus correctly.
So here I redefine the vectors in our problem just as a reminder. The regression coefficients in our problem
are θ defined as

 
θ0
 θ1 
 ∈ Rn+1 (3)
 
 ..
 . 
θn

Each mth input x is a vector of n + 1 dimensions where we add x0 = 1 for convenience. Therefore our
hypothesis function becomes:

hθ (x) = θ T x, (4)

where this equation represents the dot product of the two vectors. We define the ‘design matrix’ are refered
to as X using the following notation:

1
2 Derivation of Normal Equations for Linear Regression

· · · x(1)T · · ·
 
 · · · x(2)T · · · 
X= (5)
 
.. 
 . 
· · · x(m)T · · ·

as a matrix of m rows, in which each row is the ith sample (the vector x(i) ). With this, we can rewrite the
least-squares cost as following, replacing the explicit sum by matrix multiplication:

1
J(θ) = (Xθ − y)T (Xθ − y) (6)
2m

1
Using some matrix transpose identities, we can simplify this a bit. I will throw away the fraction part 2m ,
since we’re going to compare a derivative to zero anyway. So here are the equations:

J(θ) = ((Xθ)T − y T )(Xθ − y) (7)


T T T T
J(θ) = (Xθ) Xθ − (Xθ) y − y (Xθ) + y y (8)

Note that Xθ is a vector, and so is y. So when we multiply one by another, it doesn’t matter what the
order is (as long as the dimensions work out). So we can further simplify:

J(θ) = θ T X T Xθ − 2(Xθ)T y + y T y (9)

This is the matrix representation of the cost function J that we wished to minimise. Remember using the
system of linear equations we had taken the parital derivative of the cost fucntion and equated it equal to
zero. Using this derivative we used the Gradient descent algorithm to find the values of θ that will minimise
this cost function.
Now we have to take the partial derivative with respect to vectors which is lightly uncomfortable, but I will
try to explain it.
Lets work at the derivatives in terms of vectors. Lets consider the following fucntion:

f (v) = aT v (10)

We know how to take the parital derivative of the above function by expanding it. So lets define the partial
derivatives first:

 ∂f

∂v1
 ∂f 
∂  ∂v2 
f = ..  (11)
∂v 
 .


∂f
∂vn

We know how to solve this by opening the vectors as follows:


Derivation of Normal Equations for Linear Regression 3

f (v) = aT v (12)
= a1 v1 + a2 v2 + . . . + an vn (13)

Now computing the partial derivatives of each component we get following:

∂f
= a1
∂v1
∂f
= a2
∂v2
..
.
∂f
= an (14)
∂vn
(15)

So in terms of vectors, we can write the above equation as:


f = a. (16)
∂v
Lets keep working through the partial derivatives with respect to vectors. Lets define

P (θ) = 2(Xθ)T y. (17)

Over here please verify the dimensions of the above equation if the satisfy. Now lets expand the above
equations:

  T  y 
x11 x12 ··· x1n θ1 1
 x21 x22 ··· x2n   θ2   y 2

P (θ) = 2 ×    (18)


 · · · ··· ···

···  ···   .. 
 . 
xm1 xm2 ··· xmn θn ym

Expanind the equations:

T  

x11 θ1 + · · · +x1n θn y1
 x21 θ1 + · · · +x2n θn   y2 
P (x) = 2 ×  (19)
 

··· ··· ···
  .. 
.
   
xm1 θ1 + · · · +xmn θn ym

and now multiplying the matrix with vector y:

P (θ) = 2(x11 θ1 + · · · + x1n θn )y1 +


(x21 θ1 + · · · + x2n θn )y2 +
··· +
(xm1 θ1 + · · · + xmn θn )ym (20)
4 Derivation of Normal Equations for Linear Regression

We can write the above equations in a much compact form using summations signs

m
X
P (x) = 2 (xr1 θ1 + · · · + xrn θn )yr (21)
r=1
Xm n
X
= 2 yr xrc θc (22)
r=1 c=1

Now start taking the partial derivatives of above equation by θi :

∂P
= 2(x11 y1 + · · · + xm1 ym )
∂θ1
∂P
= 2(x12 y1 + · · · + xm2 ym )
∂θ2
..
.
∂P
= 2(x1n y1 + · · · + xmn ym )
∂θn
(23)

So we can write the above equations using the matrix forms as follows:


P = 2Xy (24)
∂θ

∂ ∂
P = (2(Xθ)T y)
∂θ ∂θ
= 2Xy (25)

Take a moment to convince yourself this is true. It’s just collecting the individual components of X into a
matrix and the individual components of y into a vector. Since X is a m-by-n matrix and y is a m-by-1
column vector, the dimensions work out and the result is a n-by-1 column vector.
So we’ve just computed the second term of the vector derivative of J. Now lets go back to the full definition
of J and see how to compute the derivative of its first term. Lets call it Q

Q(θ) = θ T X T Xθ (26)

This will be slightly more complex but here we go ... plz note there is a transpose of X.

   θ1

x11 x21 ··· xm1 x11 x12 ··· x1n
 x12 x22 ··· xm2   x21 x22 ··· x2n 
 θ2 
Q(θ) = (θ1 · · · θn )   (27)

···

···  .. 
.
   
x1n x2n ··· xmn xm1 xm2 ··· xmn θn
Derivation of Normal Equations for Linear Regression 5

The centre two matrices are simply X-squared matrix which is n-by-n. The element in row r and column c
of this square matrix is:

m
X
xir xic , (28)
i=1

2
where X-squared is a symmetric matrix. Lets call each element of this squared matrix as Xrc . Now the
above equation becomes:

2 2
 
X11 θ1 + · · · + X1n θn
2 2
 X21 θ1 + · · · + X2n θn 
Q(θ) = (θ1 · · · θn )   (29)
 ··· 
2 2
Xn1 θ1 + · · · + Xnn θn

Now multiplying by θ we get:

2 2
Q(θ) = θ1 (X11 θ1 + · · · + X1n θn ) +
2 2
θ2 (X21 θ1 + · · · + X2n θn ) +
··· +
2 2
θn (Xn1 θ1 + · · · + Xnn θn ) (30)

Lets start calculating the partial derivatives as follows:

∂Q 2 2 2 2 2
= (2θ1 X11 + θ2 X12 + · · · + θn X1n ) + θ2 X21 + · · · + θn Xn1 (31)
∂θ1

2 2
Remember X is a symmetric matrix, so X12 = X21 . Hence,

∂Q 2 2 2
= 2θ1 X11 + 2θ2 X12 + · · · + 2θn X1n (32)
∂θ1

The remaining partial derivatives will also be similar.


Now writing the partial derivatives in the vector forms, we get:

∂Q
= 2X 2 θ = 2X T Xθ (33)
∂θ

Hence,

∂ T T
θ X Xθ = 2X T Xθ. (34)
∂θ

Now lets look back it the matrix form of the cost fucntion J given in (9) and insert the partial derivatives.
6 Derivation of Normal Equations for Linear Regression

∂J ∂ T T
= (θ X Xθ − 2(Xθ)T y + y T y) (35)
∂θ ∂θ
= 2X T Xθ − 2X T y = 0 (36)

OR

2X T Xθ = 2X T y (37)

Now assuming that matrix X T X is invertible, we get

θ = (X T X)−1 X T y (38)

which is the normal equation that we use for calculating the θ empirically using matrix notations.

2 Derivation Using the Matrix Derivatives

Now that you know how to calculate the partial derivatives of matrices, lets have a look at how to solve the
derivative of normal equations for linear regression. Lets recap the cost function for the linear regression:
m
1 X
J(Θ) = (hθ (x)(i) − y (i) )2 , (39)
2m i=1

The vectorised form of this equation is given as follows:


1
J(θ) = (Xθ − y)T (Xθ − y) (40)
2m

1
Throwing away the fraction part, 2m , is the simplified version of the equation
1
Using some matrix transpose identities, we can simplify this a bit. I will throw away the fraction part 2m ,
since we’re going to compare a derivative to zero anyway. So here are the equations:

J(θ) = ((Xθ)T − y T )(Xθ − y) (41)


T T T T
J(θ) = (Xθ) Xθ − (Xθ) y − y (Xθ) + y y (42)

Rearranging the above matrices and vectors, we get:

J(θ) = θ T X T Xθ − 2y T (Xθ) + y T y (43)

Now our goal is simply to find the values of theta that minimises the cost function J(θ). To do that we
use the concepts from our FCS where to minimise the function, all we need to do is to take the derivative of
the function that we want minimise and put it equal to zero. To do that we got to remember one thing that
here we are dealing with vectors and matrices not variables. So lets have a look at some of the rules that we
are going to use.
Derivation of Normal Equations for Linear Regression 7

2.1 Matrices Derivatives

If we have to take a derivative of a function of θ, f (θ) with respect to (w.r.t) θ, then the partial derivative
is comes out to be:


(Aθ) = AT (44)
∂θ

Secondly,

∂ T
(θ Aθ) = 2AT θ (45)
∂θ

2.2 Calculating the Derivatives

Now lets apply these functions on the cost function. Therefore:

∂J(θ) ∂ T T 1
= (θ X Xθ − 2y T (Xθ) + y T y) (46)
∂θ ∂θ 2m
1
= (2X T Xθ − 2X T y + 0) (47)
2m

All that we need to do now is to put the first derivative of the function equal to zero and solve for θ.

1
(2X T Xθ − 2X T y + 0) = 0, (48)
2m
2X T Xθ = 2X y,T
(49)
X T Xθ = X T y, (50)
T −1 T
θ = (X X) X y, (51)

which is the equation for solving θ. This is called the Normal Equation for linear regression.

You might also like