0% found this document useful (0 votes)
18 views6 pages

Leastsquares Minnorm Problems

The document discusses least squares and minimal norm problems, particularly focusing on Tikhonov regularization and its implications for overdetermined and underdetermined systems. It also explores feature maps and the kernel trick, illustrating how complex models can be constructed using kernels instead of explicit feature maps. Finally, it addresses the computational aspects of kernel methods, including regularization techniques to improve the stability of solutions in the presence of ill-conditioned kernel matrices.

Uploaded by

laxkor1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views6 pages

Leastsquares Minnorm Problems

The document discusses least squares and minimal norm problems, particularly focusing on Tikhonov regularization and its implications for overdetermined and underdetermined systems. It also explores feature maps and the kernel trick, illustrating how complex models can be constructed using kernels instead of explicit feature maps. Finally, it addresses the computational aspects of kernel methods, including regularization techniques to improve the stability of solutions in the presence of ill-conditioned kernel matrices.

Uploaded by

laxkor1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Bindel, Fall 2019 Matrix Computation

2019-10-18

1 Least squares and minimal norm problems


The least squares problem with Tikhonov regularization is

1 η2
minimize ∥Ax − b∥22 + ∥x∥2 .
2 2
The Tikhonov regularized problem is useful for understanding the connection
between least squares solutions to overdetermined problems and minimal
norm solutions to underdetermined problem. For η > 0, the system admits
a unique solution independent of whether m ≥ n or m < n, and independent
of whether or not A has maximal rank. The limit when η → 0 is also well-
defined: it is the smallest norm x that minimizes the residual ∥Ax − b∥.
We usually write the Tikhonov-regularized solution as

xη = (AT A + η 2 I)−1 AT b.

However, we can get some interesting insights by writing the regularized


normal equations [ ][ ] [ ]
−I A rη b
T 2 = .
A η I xη 0
The first equation in this system defines the residual (r = Ax − b), while the
second gives the regularized version of the normal equation (AT r + η 2 x = 0).
Eliminating the r variable gives us the regularized normal equation in the
form we have seen before; but we can also eliminate x to yield

(−I − η −2 AAT )rη = b.

Scaling variables, we have

(AAT + η 2 I)rη = −η 2 b,

and by substituting into the equation AT r + η 2 x = 0, we have an alternate


expression for the solution to the regularized problem:

xη = AT (AAT + η 2 I)−1 b.
Bindel, Fall 2019 Matrix Computation

Thus, playing around with the regularized normal equations gives us two
different expressions for xη :

xη = (AT A + η 2 I)−1 bAT


= AT (AAT + η 2 I)−1 b

In the full-rank overdetermined case (m > n), the former expression gives
us the usual least-squares solutions (AT A)−1 AT b; in the full-rank under-
determined case (m < n), the latter expression gives us the usual minimum-
norm solution AT (AAT )−1 b.
For the majority of this lecture, we will focus on the minimum-norm
solution to overdetermined problems and its role in kernel methods. However,
the connection between the regularized form of the minimum-norm solution
in the overdetermined case and the regularized form of the least squares
problem in the underdetermined case will be relevant to a discussion at the
end of the lecture on (one) fast method for kernel-based fitting.

2 Feature maps and the kernel trick


In our first lecture on least squares, we described one of the standard uses of
least squares: fitting a linear model to data. That is, given (possibly noisy)
observations yi = f (xi ) for x ∈ RN , we fit f (x) ≈ s(x) = xT β by minimizing
the squared error:
min ∥Xβ − y∥2 ,
β

where X is a matrix whose ith row is the vector of coordinates for the ith
data point. Even in simple applications of least squares, however, a purely
linear model may not be adequate for modeling f ; we might at least want to
consider affine or polynomial functions in the coordinates, if not something
more common. A simple way to get more complex models is to introduce
a feature map that takes our original points in Rn and maps them into a
higher-dimensional space where we will fit our linear models

s(x) = ϕ(x)T β, ϕ : Rn → RN .

The features ϕ1 , . . . , ϕN are chosen in advance; the regression coefficients β


are fit according to the data. When m > N , we would fit the regression
coefficients by minimizing the residual norm ∥Φβ − y∥, where [Φ]ij = ϕj (xi ).
Bindel, Fall 2019 Matrix Computation

But often we are interested in the case when N ≫ m, in which case we seek
a minimal norm solution to the overdetermined problem, i.e.

β = ΦT (ΦΦT )−1 y.

Substituting this into our formula for s, we have

s(x) = ϕ(x)T ΦT (ΦΦT )−1 y.

Now, define the kernel function k(x, x′ ) = ϕ(x)T ϕ(x′ ); then we can rewrite
s(x) in terms of the kernel function as

s(x) = kxX (KXX )−1 y

where X = (x1 , x2 , . . . , xm ) is the list of sample coordinates, and the subscript


X means “form a matrix or vector where x1 , . . . , xm are inserted into this
argument in turn,” i.e.
[ ]
kxX = k(x, x1 ) k(x, x2 ) . . . k(x, xm ) ,
 
k(x1 , x1 ) . . . k(x1 , xm )
 .. .. .. 
KXX =  . . . .
k(xm , x2 ) . . . k(xm , xm )

Having expressed our interpolant purely in terms of the kernel function, we


can now dispense with the feature map and the corresponding β coefficient:
only the kernel matters. For common kernels used in approximation theory
and statistics (such as the Matérn family or the squared exponential kernels),
we usually don’t bother to write down an associated feature map.

3 Placing parens and alternate interpretations


The expression
s(x) = ϕ(x)T ΦT (ΦΦT )−1 y
involves a product of several terms that we can group in different ways:

s(x) = ϕ(x)T β, β = Φ† y
−1
s(x) = kxX c, c = KXX y
−1
s(x) = d(x)T y, d(x) = KXX KXx = (ΦT )† ϕ(x)
Bindel, Fall 2019 Matrix Computation

We have already discussed the meaning of the first of these groupings, with
β as a minimal-norm solution to an overdetermined linear system relating
features to observations. We now comment on the other two.
The expression

m
s(x) = kxX c = k(x, xi )ci
i=1

involves basis functions x 7→ k(x, xi ) depending on the location of the data


sites x1 , . . . , xm . Many common kernels depend only on the distance between
the two arguments; for example, the squared exponential kernel is

k SE (x, x′ ) = ψ(∥x − x′ ∥; σ), ψ(r) = exp(−r2 /2σ 2 ).

In this case, we would have



m
s(x) = ψ(∥x − xi ∥)ci ,
i=1

i.e. s(x) is a linear combination of translates of the function ψ. The coeffi-


cients ci are simply chosen to satisfy the interpolation conditions.
The expression
∑m
T
s(x) = d(x) y = di (x)yi
i=1

is an expansion of s in terms of the Lagrange functions di , which satisfy


di (xj ) = δij . Another way of thinking about d(x) involves the least squares
formulation:
minimize ∥ΦT d(x) − ϕ(x)∥2 .
Why is this a sensible thing to do? The least squares formulation is attempt-
ing to solve the approximation problem

m
ϕi (x) ≈ ϕi (xj )dj (x)
j=1

in a least squares sense; that is, for a collection of representative functions


(the features), we are trying to predict the value at x as a linear combination
of values at the sample points x1 , . . . , xm . Once we have that combination,
together with the function values f (x1 ), . . . , f (xm ) (the y vector), we use the
same linear combination to predict the value of f (x).
Bindel, Fall 2019 Matrix Computation

4 From kernels back to least squares


While there are several interpretations for the kernel system, in practice we
usually compute
KXX c = y
and then predict using s(x) = kxX c. In general, this costs O(m3 ) time for
the initial fit and O(m) time to evaluate the interpolant. However, we can
sometimes use the structure of the kernel to more quickly compute the coeffi-
cients or predict at new points. Standard approaches typically exploit either
smoothness of the kernel or low-dimensional structure of the distribution of
points in the original space. We will briefly discuss the former, using the con-
nection between minimal norm problems and least squares that we discussed
earlier in the lecture.
For very smooth kernel functions with long length scales relative to the
spacing between points, the kernel matrix KXX — though positive definite
— will be very ill-conditioned. In this case, we often work with a regularized
version of the fitting problem:

(KXX + η 2 I)c = y

Often far fewer than m eigenvalues of KXX that are much greater than η 2 ,
and so we can effectively approximate the system by

(AAT + η 2 I)ĉ = y.

Here, we can think of the rows of A as being “reduced” feature vectors.


From our earlier discussion, we recognize that ĉ is the scaled residual for a
regularized least squares problem with A; that is, if we solve

1 η2
minimize ∥Au − b∥2 + ∥u∥2
2 2
then
ĉ = η −2 (b − Au) = −η −2 r.
Moreover, suppose we know how to compute the reduced feature vector at
an evaluation point x, i.e. we can find ax such that

aTx AT = kxX .
Bindel, Fall 2019 Matrix Computation

Then using the regularized normal equation AT r + η 2 u = 0, we have

kxX ĉ = aTx AT (−η −2 r) = aTx u.

That is, solving the regularized kernel problem is (up to error associated with
a low-rank approximation) equivalent to solving a regularized least squares
problem, and we get the same predictions whether we compute the least
squares predictor aTx u or the kernel-based predictor kxX ĉ.

You might also like