Leastsquares Minnorm Problems
Leastsquares Minnorm Problems
2019-10-18
1 η2
minimize ∥Ax − b∥22 + ∥x∥2 .
2 2
The Tikhonov regularized problem is useful for understanding the connection
between least squares solutions to overdetermined problems and minimal
norm solutions to underdetermined problem. For η > 0, the system admits
a unique solution independent of whether m ≥ n or m < n, and independent
of whether or not A has maximal rank. The limit when η → 0 is also well-
defined: it is the smallest norm x that minimizes the residual ∥Ax − b∥.
We usually write the Tikhonov-regularized solution as
xη = (AT A + η 2 I)−1 AT b.
(AAT + η 2 I)rη = −η 2 b,
xη = AT (AAT + η 2 I)−1 b.
Bindel, Fall 2019 Matrix Computation
Thus, playing around with the regularized normal equations gives us two
different expressions for xη :
In the full-rank overdetermined case (m > n), the former expression gives
us the usual least-squares solutions (AT A)−1 AT b; in the full-rank under-
determined case (m < n), the latter expression gives us the usual minimum-
norm solution AT (AAT )−1 b.
For the majority of this lecture, we will focus on the minimum-norm
solution to overdetermined problems and its role in kernel methods. However,
the connection between the regularized form of the minimum-norm solution
in the overdetermined case and the regularized form of the least squares
problem in the underdetermined case will be relevant to a discussion at the
end of the lecture on (one) fast method for kernel-based fitting.
where X is a matrix whose ith row is the vector of coordinates for the ith
data point. Even in simple applications of least squares, however, a purely
linear model may not be adequate for modeling f ; we might at least want to
consider affine or polynomial functions in the coordinates, if not something
more common. A simple way to get more complex models is to introduce
a feature map that takes our original points in Rn and maps them into a
higher-dimensional space where we will fit our linear models
s(x) = ϕ(x)T β, ϕ : Rn → RN .
But often we are interested in the case when N ≫ m, in which case we seek
a minimal norm solution to the overdetermined problem, i.e.
β = ΦT (ΦΦT )−1 y.
Now, define the kernel function k(x, x′ ) = ϕ(x)T ϕ(x′ ); then we can rewrite
s(x) in terms of the kernel function as
s(x) = ϕ(x)T β, β = Φ† y
−1
s(x) = kxX c, c = KXX y
−1
s(x) = d(x)T y, d(x) = KXX KXx = (ΦT )† ϕ(x)
Bindel, Fall 2019 Matrix Computation
We have already discussed the meaning of the first of these groupings, with
β as a minimal-norm solution to an overdetermined linear system relating
features to observations. We now comment on the other two.
The expression
∑
m
s(x) = kxX c = k(x, xi )ci
i=1
(KXX + η 2 I)c = y
Often far fewer than m eigenvalues of KXX that are much greater than η 2 ,
and so we can effectively approximate the system by
(AAT + η 2 I)ĉ = y.
1 η2
minimize ∥Au − b∥2 + ∥u∥2
2 2
then
ĉ = η −2 (b − Au) = −η −2 r.
Moreover, suppose we know how to compute the reduced feature vector at
an evaluation point x, i.e. we can find ax such that
aTx AT = kxX .
Bindel, Fall 2019 Matrix Computation
That is, solving the regularized kernel problem is (up to error associated with
a low-rank approximation) equivalent to solving a regularized least squares
problem, and we get the same predictions whether we compute the least
squares predictor aTx u or the kernel-based predictor kxX ĉ.