0% found this document useful (0 votes)
29 views9 pages

Lecture03a Least Squares Annotated

The document discusses the least squares method in linear regression, detailing the derivation of normal equations and the convexity of the cost function. It explains the geometric interpretation of the solution and the conditions for the invertibility of the Gram matrix. Additionally, it addresses issues related to rank deficiency and ill-conditioning in practical applications of linear regression.

Uploaded by

Quan Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views9 pages

Lecture03a Least Squares Annotated

The document discusses the least squares method in linear regression, detailing the derivation of normal equations and the convexity of the cost function. It explains the geometric interpretation of the solution and the conditions for the invertibility of the Gram matrix. Additionally, it addresses issues related to rank deficiency and ill-conditioning in practical applications of linear regression.

Uploaded by

Quan Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

annote

Machine Learning Course - CS-433

Least Squares

Sept 24, 2024

Martin Jaggi
Last updated on: September 24, 2024
credits to Mohammad Emtiyaz Khan & Rüdiger Urbanke
Motivation
In rare cases, one can compute the
optimum of the cost function ana-
lytically. Linear regression using a
-

mean-squared error cost function is loss-fu = MSE


-
one such case. Here the solution can
be obtained explicitly, by solving a
linear system of equations. These
-

equations are sometimes called the


normal equations. This method is
one of the most popular methods
for data fitting. It is called least
squares.
non-convey

M
To derive the normal equations, we
first show that the problem is con- -

vex. We then use the optimality ⑧


-

conditions for convex functions (see NI = 0


the previous lecture notes on opti-
mization). I.e., at the optimum pa-
rameter, call it w?, it must be true
that the gradient of the cost function
is 0. I.e.,

rL(w?) = 0.

This is a system of D equations.


⑦ convex

-Optimal (global/
Normal Equations

C
Recall that the cost function for lin-

EnlXw y(R
ear regression with mean-squared er- -

ror is given by =

2 e
N
X
1 2 1 i
L(w) = yn x>
n w = (y Xw)>(y Xw),
2N n=1
2N

where feahes
2
y1
3 2
x11 12

x ... x1D
3

wari
6 y2 7 6 x21 x22 . . . x2D 7 o

y = 4 . 5,X = 6
6 7
4 .. .. 7
. . . .. 5 .
.
yN xN 1 xN 2 . . . xN D

① We claim that this cost function is


convex in the w. There are several
-

ways of proving this:


fog =
f(g(w))
1. Simplest way: observe that : limea
L is naturally represented as g

the sum (with positive coef- ↓ : convex

ficients) of the simple terms

O fog
courex
(yn x> 2
n w) . Further, each of
these simple terms is the com-
position of a linear function sum of comiret

with a convex function (the ; convex


square function). Therefore,
each of these simple terms is
convex and hence the sum is
convex.
2. Directly verify the definition,
that for any 2 [0, 1] and ~
y
-o
W
w
line
w, w0, segment
stain
0 0
L( w + (1
-m
i
)w ) O
( L(w) + (1 O
)L(w ))  0.

Computation: LHS = ((Xi -


y /l -
z((xx y/ -

1 u
0
--

n -

x))(xw)y
=
(1 )kX(w w )k22,
2N -

which indeed is non-positive.


3. We can compute the sec-

[ )/,
ond derivative (the Hessian)
and show that it is positive
semidefinite (all its eigenval- xi =

ues are non-negative). For


the present case a computa-
tion shows that the Hessian -O PS D
..
.

has the form 2


= convex
1 >
X X.
N Example
This matrix is indeed positive
NSF-XT(y =
Xw)
semidefinite since its non-zero
eigenvalues are the squares of DEH =
EX
*
X
z
the non-zero singular values of
the matrix X. DXD matix
P..D . viMr30 Ne

W
⑧ ⑨
⑭ V
11Xv/ 0
O
= ,
= P SP
. . .
XL = O
② Now where we know that the func-
Compute

tion is convex, let us find its min-


imum. If we take the gradient of
this expression with respect to the
weight vector w we get
1 > = Xe = 0
rL(w) = X (y Xw).
N
If we set this expression to 0 we get
the normal equations for linear re- D EIRP
#xw
=
gression,

00 >
X (y Xw) = 0.
| {z }
error

Dequali
D variables
Geometric Interpretation
2EI12N
The error is orthogonal to all
-

columns of X.
The span of X is the space spanned
by the columns of X. Every ele-
ment of the span can be written
as u = Xw for some choice of w.
Which element of span(X) shall we
take? The normal equations tell us
-

that the optimum choice for u, call


it u?, is that element so that y u?
is orthogonal to span(X). In other
words, we should pick u? to be equal
Ext = o

to the projection of y onto span(X).

The following figure illustrates this:


(taken from Bishop’s book)

S= span(X/ YEIRN
4-
·

=
t
2
'2

'1 Xw
y
Col1
of X

min llell
W

11"-XwP
f(w) =
Least Squares solve X*(y -Xw) = 0
The matrix X>X 2 RD⇥D is called
the Gram matrix. If it is invertible, #X+xw
we can multiply the normal equation um
=
by the inverse of the Gram matrix DxD
-

from the left to get a closed-form ex-


-

pression for the minimum:


-

=
w? = (X>X) 1X>y. Ok
We can use this model to predict a
new value for an unseen datapoint
(test point) xm:

ŷm := x> ? >


O Stu
> 1 >
m w = xm (X X) X y.

Invertibility and Uniqueness


Note that the Gram matrix
X>X 2 RD⇥D is invertible if and
only if X has full column rank, or
#
in other words rank(X) = D.
-

&
Proof: To see this assume first that rank(X) < D. Then there exists a
non-zero vector u so that Xu = 0. It follows that X>Xu = 0, and so
rank(X>X) < D. Therefore, X>X is not invertible.
Conversely, assume that X>X is not invertible. Hence, there exists a
non-zero vector v so that X>Xv = 0. It follows that

Fit
0 = v>X>Xv = (Xv)>(Xv) = kXvk2.

This implies that Xv = 0, i.e., rank(X) < D.


case 2 : D> N
over-paramelized
Rank Deficiency and Ill-Conditioning
Unfortunately, in practice, X is of-
ten rank deficient.
-over-param setting
E
• If D > N , we always have
rank(X) < D
(since row rank = col. rank) invertible
-
potentially
I
• If D  N , but some of
the columns x:d are (nearly)
D

.
collinear, then the matrix is ill-

-
-

conditioned, leading to numer-


-
ical issues when solving the lin-
ear system.

Can we solve least squares if X is


rank deficient? Yes, using a linear
system solver. o

XXw = Xy
Summary of Linear Regression
We have studied three types of methods:

O
1. Grid Search

O
2. Iterative Optimization Algorithms
(Stochastic) Gradient Descent
3. Least squares
closed-form solution, for linear MSE
-
Additional Notes
Solving linear systems
There are many ways to solve a linear system Mb
Xw = y, but it usually
involves a decomposition of the matrixM
X such as the QR or LU decom-
position which are very robust. Matlab’s backslash operator and also
NumPy’s linalg package implement this in just one line:
MD
w = np . l i n a l g . s o l v e (X, y)
It is important to never invert a matrix to solve a linear system - as this
would incur a cost at least three times the cost of using a linear solver.
For more, see this blog post https://gregorygundersen.com/blog/2020/
12/09/matrix-inversion/.

For a robust implementation, see Sec. 7.5.2 of Kevin Murphy’s book.

Closed-form solution for MAE mi 1xw-y1


Can you derive closed-form solution for 1-parameter model when using
the MAE cost function?
See this short article: http://www.johnmyleswhite.com/notebook/2013/
03/22/modes-medians-and-means-an-unifying-perspective/.

You might also like