0% found this document useful (0 votes)
40 views

Regression Using LS Handout

The document discusses regression using least squares. Regression is the process of fitting a function to data points. Least squares regression finds the function that minimizes the sum of the squares of the differences between the observed responses in the dataset and those predicted by the function. It describes how linear regression fits a linear function to data by solving a least squares problem. Nonlinear regression can also be performed by expressing the function as a linear combination of basis functions, again resulting in a least squares problem.

Uploaded by

tys7524
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

Regression Using LS Handout

The document discusses regression using least squares. Regression is the process of fitting a function to data points. Least squares regression finds the function that minimizes the sum of the squares of the differences between the observed responses in the dataset and those predicted by the function. It describes how linear regression fits a linear function to data by solving a least squares problem. Nonlinear regression can also be performed by expressing the function as a linear combination of basis functions, again resulting in a least squares problem.

Uploaded by

tys7524
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Regression using Least Squares

The fundamental problem in supervised machine learning is to fit


a function to a series of point evaluations. We are given data
D
(t1, y1), . . . , (tM , yM ), with tm ∈ R and ym ∈ R,

and want to find a function f : RD → R such that

f (tm) ≈ ym, m = 1, . . . , M. (1)

When the ym (and the range of f ) are continuous-valued, the


process of fitting such an f is called regression. Being a terrible
name, the terminology has become ingrained through 130+ years
of use, so we are stuck with it.

We need two key ingredients to putting the task in (1) on firm


mathematical ground, a model function class and a loss function.

Function class. For the problem (1) to make sense, our search
needs to be limited to a class of functions F . For example, we
might restrict F to be polynomial, or twice differentiable, etc.
Choosing this F is a modeling problem, as it basically encodes
what we believe to be true (or reasonably close to true) about the
function we are trying to discover. It also has to be something

- Computation and Statistics for Machine Learning - Spring 2023 1


we can compute with. Examples below will take F to be an
appropriately chosen Hilbert space, or a subspace of a Hilbert
space; these will both lead to algorithms that lean heavily on
linear algebra.

In some cases, this F is chosen implicitly. This is the case


in modern multi-layer neural networks, where the structure for
computing the function is defined, but it is hard to describe
exactly the class of functions they can represent.

With the choice of F fixed, the problem (1) now becomes

find f ∈ F such that f (tm) ≈ ym, m = 1, . . . , M. (2)

Loss function. This penalizes the deviation of the f (tm) from the
ym. Given a loss function `(·, ·) : R × R → R+ that quantifies
the penalty for the deviation at a single sample location, we can
assign a (positive) numeric score to the performance of every
candidate f by summing over all the samples. This allows us to
write (2) more precisely as an optimization problem:

M
X
min `(ym, f (tm)).
f ∈F
m=1

There are again many loss functions you might consider, and
depending on the context, some might be more natural than

- Computation and Statistics for Machine Learning - Spring 2023 2


others. We will focus almost all of our efforts on the squared loss
`(u, v) = |u − v|2. Then the problem becomes

M
X 2
min |ym − f (tm)| .
f ∈F
m=1

We will see below that this choice, coupled with a subspace or


Hilbert space model for F , allows us to completely solve the
regression problem using linear algebra.

- Computation and Statistics for Machine Learning - Spring 2023 3


Linear regression

The most classical choice for the function class F is that it


contains all linear functions on RD . A function f : RD → R is
linear if

f (αt1 + β t2) = αf (t1) + βf (t2),

for all α, β ∈ R, and t1, t2 ∈ RD . It is a fact that every linear


functional on RD is uniquely represented by a vector w, where

T
f (t) = ht, wi = t w.

So given the {(tm, ym)}, we want to find w such that

T
ym ≈ tmw,

and using the squared-loss to measure the mismatch, we have the


following optimization problem

M
X T 2
min |ym − tmw| .
w∈RD m=1

If we stack up the tm as rows in a M × D matrix A, the sum in

- Computation and Statistics for Machine Learning - Spring 2023 4


the optimization program above becomes ||y − Aw||22, where

 
tT1

y1
tT2  y2 
 
A= , y =  .
 
...  ... 
 
tTM yM

We can then find the best linear function on RD by solving the


finite dimensional least-squares problem

2
min ||y − Aw||2.
w∈RD

Given a solution ŵ to the above, we have the best linear function


fˆ, where

T
fˆ(t) = t ŵ = ŵ1t1 + · · · + ŵD tD .

It is often the case that we want to also want to add a constant


offset to the above, and find an affine function of the form

f (t) = w0 + w1t1 + · · · + wD tD .

- Computation and Statistics for Machine Learning - Spring 2023 5


This is done simply by adding a column of all ones to A:
 
T
1 t1
T 
 1 t2 

0
A =  .. ...  ,
 . 
T
1 tM

and then solving


0 0 2
min ||y − A w ||2.
w0 ∈RD

- Computation and Statistics for Machine Learning - Spring 2023 6


Nonlinear regression using a basis

It is easy to generalize what is in the previous section to fit


nonlinear functions to data. What is interesting is that we still
end up with a very similar linear least-squares problem.

We do this by letting F be a finite-dimensional subspace (of a


Hilbert space) spanned by a set of specified basis functions. Given
a set of building blocks ψ n : RD → R (i.e. basis functions), the
model implicit here is that our target function f : RD → R is (at
least approximately) in the subspace spanned by ψ 1, . . . , ψ N ,
that is, there exists {xn} such that

N
X
f ( t) = xnψn(t).
n=1

Fitting a function in this space F is the same as fitting a x ∈ RN


such that

N
X N
X
y1 ≈ xnψn(t1), · · · , yM ≈ xnψn(tM ).
n=1 n=1

- Computation and Statistics for Machine Learning - Spring 2023 7


With the least-squares loss, we want to solve

M
X N
X M
X
2 T 2
min |ym − xnψn(tm)| = |ym − Ψ (tm)x| ,
x∈RN m=1 n=1 m=1
(3)

where Ψ(·) : RD → RN is
 
ψ1(t)
 ψ2(t) 
Ψ ( t) = 
 ... .

ψN (t)

Constructing the M × N matrix A as


 
T  
Ψ ( t1 ) ψ1(t1) ψ2(t1) ··· ψN (t1)
Ψ ( t2 ) T  ψ1(t2) ψ2(t2) ··· ψN (t2) 
 
A= = ,
 
...  ... ... ... 
 
Ψ ( tM ) T ψ1(tM ) ψ2(tM ) ··· ψN (tM )
(4)

the optimization program becomes

2
min ||y − Ax||2.
x∈RN

From a solution x̂ to the program above, we can synthesize the

- Computation and Statistics for Machine Learning - Spring 2023 8


solution to (3) as

N
X
fˆ(t) = x̂nψn(t).
n=1

Even though we are ultimately recovering a nonlinear function of


a continuous variable, introducing a basis allows us to put the
nonlinear regression problem into the exact same computational
framework as linear regression.

Play around with the code in example-regression.m. Here,


we take M (noisy) samples of an underlying function, and then
can perform regression using N basis functions. Notice that
increasing N makes the class richer, and corresponds to adding
columns in A. This increases our ability to match the samples,
but also increases the risk of ‘over fitting’ the model.

- Computation and Statistics for Machine Learning - Spring 2023 9


The least-squares problem

We have seen that in both the linear regression case and the
nonlinear regression case, the problem reduces to same form.
We will look at problems like this multiple times in this course,
and it is worth starting to study this kind of problem from the
perspective of linear algebra.

We start with the following fundamental result: Any solution x̂


to
2
min ||y − Ax||2. (5)
x∈RN

must obey the normal equations


T T
A Ax̂ = A y. (6)

In fact, g(x) = ||y − Ax||22 is a convex, differentiable function


on all of RN . Thus a necessary condition for x̂ to be the
minimizer of g(x) is that the gradient with respect to x is equal
to zero at x: ∇g(x̂) = 0. We have
2 2 T 2
∇(||y − Ax||2) = ∇(||y||2 − 2y Ax + ||Ax||2)
T T
= −2A y + 2A Ax,

- Computation and Statistics for Machine Learning - Spring 2023 10


which is 0 when (6) holds.

There are several natural questions at this point. First, is there


any guarantee that (6) even has a solution? Are there conditions
under which the solution is guaranteed to be unique or non-
unique? And if the solution is non-unique, what should we
do?

We will answer these questions using the following two facts from
linear algebra. For any M × N matrix A,

1. Null(AT A) = Null(A), and


2. Col(AT A) = Row(A).

(try to prove it at home)

Now it is easy to argue that

1. The system (6) always has a solution, no matter what A and


y are. This follows immediately from the fact that
T T
A y ∈ Row(A) = Col(A A).

This means that there is always at least one minimizer of (5).


2. If rank(A) = N , then (6) has the unique solution, and so (5)
is a unique minimizer. The unique minimzer is
T −1 T
x̂ = (A A) A y.

- Computation and Statistics for Machine Learning - Spring 2023 11


3. If rand(A) < N , then there are an infinite number of solutions
to (5). In this case, A has a non-trivial null space, so if x̂ is
a solution to (5), so is x̂ + v for all v ∈ Null(A), as
2 2
||y − A(x̂ + v)||2 = ||y − Ax̂||2,

since Av = 0.
4. If rand(A) = M , then there exists at least one x̂ such that
||y − Ax̂|| = 0, that is Ax̂ = y . If in addition M < N ,
then there will be an infinity of such solutions.

- Computation and Statistics for Machine Learning - Spring 2023 12


Minimum `2 solutions

When there are an infinite number of solutions, we need a way to


pick one of them. For this, modeling will again come into play;
we need some kind of value system other than the objective in
(5) to judge which of the solutions is best. There are many, many
ways this can be done; we will look two variations of one idea
below.

One principle is to choose the solution that is the smallest (i.e.


closest to the origin). In many applications this is justified using
some kind of minimum energy principle –the sum of squares
being a proxy for the amount of resources it takes to implement
a solution. It can also be thought of as a kind of Principle of
Parsimony (or Principle of Economy, Occam’s razor), in that the
solution should not be any bigger than it needs to be.

The minimum energy least-squares solution is now the solution to


the optimization program
2 T T
min ||x||2 subject to A Ax = A y. (7)
x∈RN

It happens that there is now always a unique solution to this


program. We will revisit this problem through the lens of the

- Computation and Statistics for Machine Learning - Spring 2023 13


singular value decomposition. For now, we will look at the
case where A has full row rank, rank(A) = M . In this case,
AT Ax = AT y if and only if Ax = y , so the above program
simplifies to
2
min ||x||2 subject to Ax = y. (8)
x∈RN

The first thing to realize (and this holds in the general case as
well) is that the solution to the above will be in Row(A). We
know that the row and null spaces are orthogonal complements
of one another, and so every x ∈ RN can be written as

x = x1 + x2, where x1 ∈ Row(A), x2 ∈ Null(A),

and xT1 x2 = 0 of course. We can recast the optimization in (8)


as a search over x1 and x2:
2
min ||x1 + x2||2 subject to A(x1 + x2) = y.
x1 ∈Row(A)
x2 ∈Null(A)

Since Ax2 = 0 and ||x1 + x2||22 = ||x1||22 + ||x2||22 (why?),


the above simplifies to
2 2
min ||x1||2 + ||x2||2 subject to Ax1 = y.
x1 ∈Row(A)
x2 ∈Null(A)

- Computation and Statistics for Machine Learning - Spring 2023 14


Since x2 does not appear now in the constraints, we see that
the minimizer will have x2 = 0, i.e. the solution lies entirely in
Row(A).

When rank(A) = M , we can use the fact that x̂ ∈ Row(A) to


drive a closed-form solution. We know that there exists a v̂ such
that
T T
x̂ = A v̂, and AA v̂ = y.

Since rank(A) = M , we know that the M × M matrix AAT


is invertible, and so there is exactly one v̂ that obeys the second
condition above, namely v̂ = (AAT )−1y.

We will derive the solution to this problem for rank(A) < M


later when we talk about singular value decomposition.

- Computation and Statistics for Machine Learning - Spring 2023 15


Regularization

The problem with solving (7) when A has a non-trivial null space
is not just that there are an infinity of solutions, it is also that
the space of solutions is unbounded – you can add a vector from
the null space of arbitrary size and not change the funtional.
Even when A technically has full column rank, if it is poorly
conditioned, then a small change in y can amount to a massive
change in x̂.

We will look much more carefully at describing how well-


conditioned the least-squares problem is when we talk about
the SVD.

In place of (5), we solve

2 2
min ||y − Ax||2 + δ||x||2. (9)
x∈RN

for some δ ≥ 0. We are favoring solutions close to the origin, but


it can be applied for any A, no matter the rank. It also gives us
a little more flexibility in the model, as we can treat δ as a knob
that sets the trade-off between how closely we want Ax to match
y , and how large x can be. Note that when rank(A) = M , the
solution to (9) goes to the solution to (8) as δ → 0.

- Computation and Statistics for Machine Learning - Spring 2023 16


The analog of the normal equations for (9) is that a solution x̂
must obey

T T
(A A + δI)x = A y. (10)

However, the matrix (AT A + δI) is always invertible for δ > 0


no matter what A is. Thus the unique solution to (9) is

T −1 T
x̂ = (A A + δI) A y.

It also turns out that we can write this as

T T −1
x̂ = A (AA + δI) y.

You can verify this by plugging the expression above into the left
hand side of (10).

From a high level perspective, the new objective function


recognizes that making Ax as close to y as possible should not
be the only thing we are interested in. Models solely depending
on A are rarely perfect. Later we will talk about this point in
relation to prior information on the unknown parameter x. In
the context of regression, using (9) is often referred to as ridge
regression. If ` − 1 norm is used, it is referred to as lasso

- Computation and Statistics for Machine Learning - Spring 2023 17


regression:
2
min ||y − Ax||2 + δ||x||1. (11)
x∈RN

for some δ ≥ 0.

- Computation and Statistics for Machine Learning - Spring 2023 18


Ridge regression using a basis

Let us return to the fundamental problem of fitting a function


to data (regression). Above, we saw that once a basis was
introduced, we could set this up as a least-squares problem, and
then we saw one way to make this least-squares problem better
posed using regularization.

Another variation is to impose the regularization in the function


space. Suppose that we are looking for an f in a Hilbert space
S that matches observed data {(tm, ym)}. The associated
regularized least-squares problem is

M
X 2 2
min |ym − f (tm)| + δ||f ||S . (12)
f ∈S
m=1

Note that we are using the Hilbert space norm to penalize the
size of the function. If we again introduce a basis, modeling the
target function as being in the span of ψ1, . . . , ψN , then we can
rewrite this as
M
X N
X
2 2
min |ym − Ax| + δ|| xnψ n||S ,
x∈RN m=1 n=1

where A is constructed as in (4). Expanding the term on the

- Computation and Statistics for Machine Learning - Spring 2023 19


right above gives us
N
X N
X N
X
2
|| xnψ n||S =h xn ψ n , xn ψ n iS
n=1 n=1 n=1

N X
X N
= xnxk hψ n, ψ k iS
n=1 k=1
T
= x Gx,

where G is the Gram matrix for the basis, Gn,k = hψ n, ψ k iS .


Thus the optimization (12) in Hilbert space becomes the finite-
dimensional problem
2 T
min ||y − Ax||2 + δx Gx, (13)
x∈RN

which has closed form solution


T −1 T
x̂ = (A A + δG) A y.

We then synthesize the solution as before,


N
X
fˆ(t) = xnψn(t).
n=1

When the {ψ n} are orthonormal, then G = I , and (13) is


equivalent to the standard regularized least-square problem. When

- Computation and Statistics for Machine Learning - Spring 2023 20


the basis is not orthonormal, then solving (13) and (9) will in
general produce different weights and hence different synthesized
functions. But in practice, if the basis is reasonable, then the
difference will be minor.

- Computation and Statistics for Machine Learning - Spring 2023 21

You might also like