Lecture03a Least Squares Annotated
Lecture03a Least Squares Annotated
Least Squares
Martin Jaggi
Last updated on: September 24, 2024
credits to Mohammad Emtiyaz Khan & Rüdiger Urbanke
Motivation
In rare cases, one can compute the
optimum of the cost function ana-
lytically. Linear regression using a
-
M
To derive the normal equations, we
first show that the problem is con- -
rL(w?) = 0.
-Optimal (global/
Normal Equations
C
Recall that the cost function for lin-
EnlXw y(R
ear regression with mean-squared er- -
ror is given by =
2 e
N
X
1 2 1 i
L(w) = yn x>
n w = (y Xw)>(y Xw),
2N n=1
2N
where feahes
2
y1
3 2
x11 12
↓
x ... x1D
3
wari
6 y2 7 6 x21 x22 . . . x2D 7 o
y = 4 . 5,X = 6
6 7
4 .. .. 7
. . . .. 5 .
.
yN xN 1 xN 2 . . . xN D
O fog
courex
(yn x> 2
n w) . Further, each of
these simple terms is the com-
position of a linear function sum of comiret
1 u
0
--
n -
x))(xw)y
=
(1 )kX(w w )k22,
2N -
[ )/,
ond derivative (the Hessian)
and show that it is positive
semidefinite (all its eigenval- xi =
W
⑧ ⑨
⑭ V
11Xv/ 0
O
= ,
= P SP
. . .
XL = O
② Now where we know that the func-
Compute
00 >
X (y Xw) = 0.
| {z }
error
Dequali
D variables
Geometric Interpretation
2EI12N
The error is orthogonal to all
-
columns of X.
The span of X is the space spanned
by the columns of X. Every ele-
ment of the span can be written
as u = Xw for some choice of w.
Which element of span(X) shall we
take? The normal equations tell us
-
S= span(X/ YEIRN
4-
·
=
t
2
'2
⑧
'1 Xw
y
Col1
of X
min llell
W
11"-XwP
f(w) =
Least Squares solve X*(y -Xw) = 0
The matrix X>X 2 RD⇥D is called
the Gram matrix. If it is invertible, #X+xw
we can multiply the normal equation um
=
by the inverse of the Gram matrix DxD
-
=
w? = (X>X) 1X>y. Ok
We can use this model to predict a
new value for an unseen datapoint
(test point) xm:
&
Proof: To see this assume first that rank(X) < D. Then there exists a
non-zero vector u so that Xu = 0. It follows that X>Xu = 0, and so
rank(X>X) < D. Therefore, X>X is not invertible.
Conversely, assume that X>X is not invertible. Hence, there exists a
non-zero vector v so that X>Xv = 0. It follows that
Fit
0 = v>X>Xv = (Xv)>(Xv) = kXvk2.
.
collinear, then the matrix is ill-
-
-
XXw = Xy
Summary of Linear Regression
We have studied three types of methods:
O
1. Grid Search
O
2. Iterative Optimization Algorithms
(Stochastic) Gradient Descent
3. Least squares
closed-form solution, for linear MSE
-
Additional Notes
Solving linear systems
There are many ways to solve a linear system Mb
Xw = y, but it usually
involves a decomposition of the matrixM
X such as the QR or LU decom-
position which are very robust. Matlab’s backslash operator and also
NumPy’s linalg package implement this in just one line:
MD
w = np . l i n a l g . s o l v e (X, y)
It is important to never invert a matrix to solve a linear system - as this
would incur a cost at least three times the cost of using a linear solver.
For more, see this blog post https://gregorygundersen.com/blog/2020/
12/09/matrix-inversion/.