Machine Learning
Machine Learning
Gabriel Peyré
CNRS & DMA
École Normale Supérieure
[email protected]
https://mathematical-tours.github.io
www.numerical-tours.com
January 3, 2021
186
Chapter 12
This chapter gives a rapid overview of the main concepts in machine learning. The goal is not to be
exhaustive, but to highlight representative problems and insist on the distinction between unsupervised
(vizualization and clustering) and supervised (regression and classification) setups. We also shed light on
the tight connexions between machine learning and inverse problems.
While imaging science problems are generally concern with processing a single data (e.g. an image),
machine learning problem is rather concern with analysing large collection of data. The focus (goal and
performance measures) is thus radically different, but quite surprisingly, it uses very similar tools and
algorithm (in particular linear models and convex optimization).
and covariance
n
1X
(xi − m̂)(xi − m̂)∗ ∈ Rp×p .
def.
Ĉ = (12.1)
n i=1
def.
Denoting X̃ = X − 1p m̂∗ , one has Ĉ = X̃ ∗ X̃/n.
Note that if the points (xi )i are modelled as i.i.d. variables, and denoting x one of these random variables,
one has, using the law of large numbers, the almost sure convergence as n → +∞
187
C SVD of C
Figure 12.1: Empirical covariance of the data and its associated singular values.
One has that x̃i = ProjT̃ (xi ) where T̃ = m + Spandk=1 (vk ) is an affine space.
def.
Figure 12.3 shows an example of PCA for 2-D and 3-D vizualization.
Optimality analysis. We now show that among all possible linear dimen-
sionality reduction method, PCA is optimal in sense of `2 error. To simplify,
without loss of generality (since it can be subtracted from the data) we assume
that empirical mean is zero m̂ = 0 so that X = X̃.
1 https://en.wikipedia.org/wiki/Iris_flower_data_set
188
Figure 12.3: 2-D and 3-D PCA vizualization of the input clouds.
√
We recall that X = nU diag(σ)V > and Ĉ = n1 X > X = U ΛU > where
Λ = diag(λi = σi2 ), where λ1 > . . . > λr .
The following proposition shows that PCA is optimal in term of `2 distance
if one consider only affine spaces. This means that we consider the following
compression/decompression to a dimension k (i.e. dimensionality reduction and then expansion)
( )
X
||xi − RS > xi ||2Rp ; R, S ∈ Rp×k
def.
min f (R, S) = (12.5)
R,S
i
Note that this minimization is a priori not trivial to solve because, although f (·, S) and f (R, ·) are convex,
f is not jointly convex. So iteratively minimizing on R and S might fail to converge to a global minimizer.
This section aims at proving the following theorem.
def.
Theorem 20. A solution of (12.5) is S = R = V1:k = [v1 , . . . , vk ].
Proof. We consider an arbitrary pair (R, S). Since the matrix RS > has rank k 0 6 k,
0
let W ∈ Rp×k be an ortho-basis of Im(RS > ), so that W > W = Idk0 ×k0 . We remark
that
argmin ||x − W z||2 = W > x
z
because the first order condition for this problem reads W > (W z − x) = 0. Hence, denoting RS > xi = W zi
0
for some zi ∈ Rk
X X X
f (R, S) = ||xi − RS > xi ||2 = ||xi − W zi ||2 > ||xi − W W > xi ||2 > f (W̃ , W̃ ).
i i i
Proof. Using the previous lemma, one can consider only R = S with S > S = Idk so that one needs to solve
X X
f (S, S) = ||xi SS > xi ||2 = ||xi ||2 − 2x> > > > >
i SS xi + xi S(S S)S xi .
i i
189
Figure 12.4: Left: proof of the rightmost inequality in (12.6). Middle: matrix B, right: matrix B̃.
The next lemma provides an upper bound on the quantity being minimized as the solution of a convex
optimization problem (a linear program). The proof of the theorem follows by showing that this upper
bound is actually reached, which provides a certificate of optimality.
Lemma 9. Denoting C = V ΛV > , one has
( p ) k
X X X
>
tr(S CS) 6 maxp λi βi ; 0 6 β 6 1, βi 6 k = λi (12.6)
β∈R
i=1 i i=1
where we denoted B = V > S ∈ Rp×k , (bi )pi=1 with bi ∈ Rk are the rows of B and βi = ||bi ||2 > 0. One has
def. def.
We extend the k columns of b into an orthogonal basis B̃ ∈ Rp×p such that B̃ B̃ > = B̃ > =Id
˜ p , so that
and hence (βi )pi=1 satisfies the constraint of the considered optimization problem, hence tr(S > CS) is neces-
sarily smaller than the maximum possible value.
For the proof of the second upper bound, we only verify it in 2D and 3D using a drawing, see Figure 12.4,
left.
Proof of Theorem 20. Setting S = V1:k = [v1 , . . . , vk ], it satisfies CS = V ΛV > V1:k = V1:k diag(λi )ki=1 and
hence
Xk
tr(S > CS) = tr(S > S diag(λi )i=1k ) = tr(Idk diag(λi )i=1k ) = λi .
i=1
This value matches the right-most upper bound of Lemma 9, which shows that this S is optimal.
190
12.1.2 Clustering and k-means
A typical unsupervised learning task is to infer a class label yi ∈ {1, . . . , k} for each input point xi , and
this is often called a clustering problem (since the set of points associated to a given label can be thought
as a cluster).
k-means A way to infer these labels is by assuming that the clusters are compact, and optimizing some
compactness criterion. Assuming for simplicity that the data are in Euclidean space (which can be relaxed
to an arbitrary metric space, although the computations become more complicated), the k-means approach
minimizes the distance between the points and their class centroids c = (c` )k`=1 , where each c` ∈ Rp . The
corresponding variational problem becomes
k
X X
def.
min E(y, c) = ||xi − c` ||2 .
(y,c)
`=1 i:yi =`
If during the iterates, one of the cluster associated to some c` becomes empty, then one can either decide to
destroy it and replace k by k − 1, or try to “teleport” the center c` to another location (this might increase
the objective function E however).
Since the energy E is decaying during each of these two steps, it is converging to some limit value.
Since there is a finite number of possible labels assignments, it is actually constant after a finite number of
iterations, and the algorithm stops.
Of course, since the energy is non-convex, little can be said about the property of the clusters output
by k-means. To try to reach lower energy level, it is possible to “teleport” during the iterations centroids
c` associated to clusters with high energy to locations within clusters with lower energy (because optimal
solutions should somehow balance the energy).
Figure 12.6 shows an example of k-means iterations on the Iris dataset.
k-means++ To obtain good results when using k-means, it is crucial to have an efficient initialization
scheme. In practice, the best results are obtained by seeding them as far as possible from one another (a
greedy strategy works great in practice).
Quite surprisingly, there exists a randomized seeding strategy which can be shown to be close to optimal
in term of value of E, even without running the k-means iterations (although in practice it still needs to be
used to polish the results). The corresponding k-means++ initialization is obtained by selecting c1 uniformly
at random among the xi , and then assuming c` has been seeded, drawing c`+1 among the sample according
to the probability π (`) on {1, . . . , n} proportional to the squared inverse of the distance to the previously
seeded points
(`) def. 1/d2 def.
∀ i ∈ {1, . . . , n}, πi = Pn i 2 where dj = min ||xi − cr ||.
j=1 1/d j 16r6`−1
191
Iter #1 Iter #2 1
0.5
0
1 2 3
1
0.5
0
1 2 3
Iter #3 Iter #16 1
0.5
0
1 2 3
Figure 12.6: Left: iteration of k-means algorithm. Right: histogram of points belonging to each class after
the k-means optimization.
This means that points which are located far away from the preciously seeded centers are more likely to be
picked.
The following results, due to David Arthur and Sergei Vassilvitskii, shows that this seeding is optimal
up to log factor on the energy. Note that finding a global optimum is known to be NP-hard.
Theorem 21. For the centroids c? defined by the k-means++ strategy, denoting y ? the associated nearest
neighbor labels defined as in (12.7), one has
Lloyd algorithm and continuous densities. The k-means iterations are also called “Lloyd” algorithm,
which also find applications to optimal vector quantization for compression. It can also be used in the
“continuous” setting where the empirical samples (xi )i are replaced by an arbitrary measure over Rp . The
energy to minimize becomes
Xk Z
min ||x − c` ||2 dµ(x)
(V,c) V`
`=1
where (V` )` is a partition of the domain. Step (12.7) is replaced by the computation of a Voronoi cell
These Voronoi cells are polyhedra delimited by segments of mediatrix between centroids, and this Voronoi
segmentation can be computed efficiently using tools from algorithmic geometry in low dimension. Step (12.8)
are then replaced by R
xdµ(x)
∀ ` ∈ {1, . . . , k}, c` ← RC` .
C`
dµ(x)
In the case of µ being uniform distribution, optimal solution corresponds to the hexagonal lattice. Figure 12.7
displays two examples of Lloyd iterations on 2-D densities on a square domain.
192
Figure 12.7: Iteration of k-means algorithm (Lloyd algorithm) on continuous densities µ. Top: uniform.
Bottom: non-uniform (the densities of µ with respect to the Lebesgue measure is displayed as a grayscale
image in the background).
In order to make the problem tractable computationally, and also in order to obtain efficient prediction
scores, it is important to restrict the fit to the data yi ≈ f (xi ) using a “small enough” class of functions.
Intuitively, in order to avoid overfitting, the “size” of this class of functions should grows with the number
n of samples.
Here L : Y 2 → R+ is the so-called loss function, and it should typically satisfies L(y, y 0 ) = 0 if and only if
y = y 0 . The specifics of L depend on the application at hand (in particular, one should use different losses
for classification and regression tasks). To highlight the dependency of fˆ on n, we occasionally write fˆn .
Intuitively, one should have fˆn → f¯ as n → +∞, which can be captured in expectation of the prediction
error over the samples (xi , yi )i , i.e.
One should be careful that here the expectation is over both x (distributed according to the marginal πX
of π on X ), and also the n i.i.d. pairs (xi , yi ) ∼ π used to define fˆn (so a better notation should rather be
193
(xi , yi )i . Here L̄ is some loss function on Y (one can use L̄ = L for instance). One can also study convergence
in probability, i.e.
∀ ε > 0, Eε,n = P(L̃(fˆn (x), f¯(x)) > ε) → 0.
def.
If this holds, then one says that the estimation method is consistent (in expectation or in probability). The
question is then to derive convergence rates, i.e. to upper bound En or Eε,n by some explicitly decay rate.
Note that when L̃(y, y 0 ) = |y−y 0 |r , then convergence in expectation is stronger (implies) than convergence
in probability since using Markov’s inequality
1 En
Eε,n = P(|fˆn (x) − f (x)|r > ε) 6 E(|fˆn (x) − f (x)|r ) = .
ε ε
where J is some regularization function, for instance J = || · ||22 (to avoid blowing-up of the parameter) or
J = || · ||1 (to perform model selection, i.e. using only a sparse set of feature among a possibly very large
pool of p features). Here λn > 0 is a regularization parameter, and it should tend to 0 when n → +∞.
Then one similarly defines the ideal parameter β̄ as in (12.10) so that the limiting estimator as n → +∞
is of the form f¯ = f (·, β̄) for β̄ defined as
Z
β̄ ∈ argmin L(f (x, β), y)dπ(x, y) = E(x,y)∼π (L(f (x, β), y). (12.12)
β X ×Y
Prediction vs. estimation risks. In this parametric approach, one could be interested in also studying
how close θ̂ is to θ̄. This can be measured by controlling how fast some estimation error ||β̂ − β̄|| (for some
norm || · ||) goes to zero. Note however that in most cases, controlling the estimation error is more difficult
than doing the same for the prediction error. In general, doing a good parameter estimation implies doing
a good prediction, but the converse is not true.
which converges to E(L(fˆ(x), y)) for large n̄. Minimizing Rn̄ to setup to some meta-parameter of the method
(for instance the regularization parameter λn ) is called “cross validation” in the literature.
194
Figure 12.9: Conditional expectation.
Least square and conditional expectation. If one do not put any constraint on f (beside being mea-
surable), then the optimal limit estimator f¯(x) defined in (12.10) is simply averaging the values y sharing
the same x, which is the so-called conditional expectation. Assuming for simplicity that π has some density
dπ
dxdy with respect to a tensor product measure dxdy (for instance the Lebegues mesure), one has
y dπ (x, y)dy
R
¯ Y dxdy
∀x ∈ X, f (x) = E(y|x = x) = R dπ
Y dxdy
(x, y)dy
195
Penalized linear models. A very simple class of models is obtained by
imposing that f is linear, and set f (x, β) = hx, βi, for parameters β ∈ B =
Rp . Note that one can also treat this way affine functions by remarking that
hx, βi + β0 = h(x, 1), (β, β0 )i and replacing x by (x, 1). So in the following,
without loss of generality, we only treat the vectorial (non-affine) case.
Under the square loss, the regularized ERM (12.11) is conveniently rewritten
as
1 Figure 12.10: Linear re-
β̂ ∈ argmin hĈβ, βi − hû, βi + λn J(β) (12.14)
β∈B 2 gression.
where we introduced the empirical correlation (already introduced in (12.1)) and observations
n n
1 ∗ 1X 1X 1
xi x∗i yi xi = X ∗ y ∈ Rp .
def. def.
Ĉ = X X= and û =
n n i=1 n i=1 n
As n → 0, under weak condition on π, one has with the law of large numbers the almost sure convergence
Ĉ → C = E(x∗ x)
def. def.
and û → u = E(yx). (12.15)
When considering λn → 0, in some cases, one can shows that in the limit n → +∞, one retrieves the
following ideal parameter
β̄ ∈ argmin {J(β) ; Cβ = u} .
β
Problem (12.14) is equivalent to the regularized resolution of inverse problems (8.9), with Ĉ in place of Φ
and û in place of Φ∗ y. The major, and in fact only difference between machine learning and inverse problems
is that the linear √
operator is also noisy since Ĉ can be viewed as a noisy version of C. The “noise level”, in
this setting, is 1/ n in the sense that
1 1
E(||Ĉ − C||) ∼ √ and E(||û − u||) ∼ √ ,
n n
under the assumption that E(y4 ) < +∞, E(||x||4 ) < +∞ so ensure that one can use the central limit theorem
on x2 and xy. Note that, although we use here linear estimator, one does not need to assume a “linear”
relation of the form y = hx, βi + w with a noise w independent from x, but rather hope to do “as best as
possible”, i.e. estimate a linear model as close as possible to β̄.
The general take home message is that it is possible to generalize Theorems 10, 12 and 13 to cope with
the noise on the covariance matrix to obtain prediction convergence rates of the form
196
Ridge regression (quadratic penalization). For J = ||·||2 /2, the estimator (12.14) is obtained in closed
form as
β̂ = (X ∗ X + nλn Idp )−1 X ∗ y = (Ĉ + nλn Id)−1 û. (12.16)
This is often called ridge regression in the literature. Note that thanks to the Woodbury formula, this
estimator can also be re-written as
If n p (which is the usual setup in machine learning), then (12.17) is preferable. In some cases however (in
particular when using RKHS technics), it makes sense to consider very large p (even infinite dimensional),
so that (12.16) must be used.
If λn → 0, then using (12.15), one has the convergence in expectation and probability
β̂ → β̄ = C + u.
Theorems 10 and 12 can be extended to this setting and one obtains the following result.
Theorem 22. If
β̄ = C γ z where ||z|| 6 ρ (12.18)
for 0 < γ 6 2, then
1 γ
E(||β̂ − β̄||2 ) 6 Cρ2 γ+1 n− γ+1 (12.19)
for a constant C depending only on γ.
It is important to note that, since β̄ = C + u, the source condition (12.18) is always satisfied. What trully
matters here is that the rate (12.19) does not depend on the dimension p of the features, but rather only on
ρ, which can be much smaller. This theoretical analysis actually works perfectly fine in infinite dimension
p = ∞ (which is the setup considered when dealing with RKHS below).
197
k=1 k=5 k = 10 k = 40
In practice, the parameter R can be setup through cross-validation, by minimizing the testing risk Rn̄
defined in (12.13), which typically uses a 0-1 loss for counting the number of mis-classifications
n̄
X
δ(ȳj − fˆ(xi ))
def.
Rn̄ =
j=1
where δ(0) = 0 and δ(s) = 1 if s 6= 0. Of course the method extends to arbitrary metric space in place of
Euclidean space Rp for the features. Note also that instead of explicitly sorting all the Euclidean distance,
one can use fast nearest neighbor search methods.
Figure 12.12 shows, for the IRIS dataset, the classification domains (i.e. {x ; f (x) = `} for ` = 1, . . . , k)
using a 2-D projection for vizualization. Increasing R leads to smoother class boundaries.
Approximate risk minimization. The hard classifier is defines from a linear predictor hx, βi as sign(hx, βi) ∈
{−1, +1}. The 0-1 loss error function (somehow the “ideal” loss) counts the number of miss-classifications,
and can ideal classifier be computed as
n
X
min `0 (−yi hxi , βi) (12.20)
β
i=1
198
Figure 12.13: 1-D and 2-D logistic classification, showing the impact of ||β|| on the sharpness of the
classification boundary.
where `0 = 1R+ . Indeed, miss classification corresponds to hxi , wi and yi having different signs, so that in
this case `0 (−yi hxi , wi) = 1 (and 0 otherwise for correct classification).
The function `0 is non-convex and hence problem (12.20) is itself non-convex, and in full generality, can
be shown to be NP-hard to solve. One thus relies on some proxy, which are functions which upper-bounds
`0 and are convex (and sometimes differentiable).
The most celebrated proxy are
which are respectively the hinge loss corresponds to support vector machine (SVM, and is non-smooth) and
the logistic loss (which is smooth). The 1/ log(2) is just a constant which makes `0 6 `. AdaBoost is using
`(u) = eu . Note that least square corresponds to using `(u) = (1 + u)2 , but this is a poor proxy for `0 for
negative values, although it might works well in practice. Note that SVM is a non-smooth problem, which
can be cast as a linear program minimizing the co-called classification margin
( )
X
min ui ; 1 + ui = yi hxi , βi .
u>0,β
i
Logistic loss probabilistic interpretation. Logistic classification can be understood as a linear model
as introduced in Section 12.3.1, although the decision function f (·, β) is not linear. Indeed, one needs
to “remap” the linear value hx, βi in the interval [0, 1]. In logistic classification, we define the predicted
probability of x belonging to class with label −1 as
es
= (1 + e−s )−1 ,
def. def.
f (x, β) = θ(hx, βi) where θ(s) = (12.21)
1 + es
which is often called the “logit” model. Using a linear decision model might seems overly simplistic, but in
high dimension p, the number of degrees of freedom is actually enough to reach surprisingly good classification
performances. Note that the probability of belonging to the second class is 1 − f (x, β) = θ(−s). This
symmetry of the θ function is important because it means that both classes are treated equally, which makes
sense for “balanced” problem (where the total mass of each class are roughly equal).
Intuitively, β/||β|| controls the separating hyperplane direction, while 1/||β|| is roughly the fuzziness of the
the separation. As ||β|| → +∞, one obtains sharp devision boundary, and logistic classification ressembles
SVM.
Note that f (x, β) can be interpreted as a single layer perceptron with a logistic (sigmoid) rectifying unit,
more details on this in Chapter 15.
199
3
Binary
Logistic
2 Hinge
0
-3 -2 -1 0 1 2 3
Figure 12.14: Comparison of loss functions. [ToDo: Re-do the figure, it is not correct, they should
upper bound `0 ]
Since the (xi , yi ) are modeled as i.i.d. variables, it makes sense to define β̂ from the observation using
a maximum likelihood, assuming that each yi conditioned on xi is a Bernoulli variable with associated
probability (pi , 1 − pi ) with pi = f (xi , β). The probability of observing yi ∈ {0, 1} is thus, denoting
si = hxi , βi
si 1−ȳi ȳi
e 1
P(y = yi |x = xi ) = p1−ȳ
i
i
(1 − pi )ȳi
=
1 + esi 1 + esi
where we denoted ȳi = yi2+1 ∈ {0, 1}.
One can then minimize minus the sum of the log of the likelihoods, which reads
n n
X X es i 1
β̂ ∈ argmin − log(P(y = yi |x = xi )) = −(1 − ȳi ) log s
− ȳi log
β∈Rp i=1 i=1
1+e i 1 + esi
Some algebraic manipulations shows that this is equivalent to an ERM-type form (12.11) with a logistic loss
function
n
1X
β̂ ∈ argmin E(β) = L(hxi , βi, yi ) (12.22)
β∈Rp n i=1
where the logistic loss reads
def.
L(s, y) = log(1 + exp(−sy)). (12.23)
Problem (12.22) is a smooth convex minimization. If X is injective, E is also strictly convex, hence it has a
single global minimum.
Figure (12.14) compares the binary (ideal) 0-1 loss, the logistic loss and the hinge loss (the one used for
SVM).
200
Figure 12.15: Influence on the separation distance between the class on the classification probability.
Note that in the case of k = 2 classes Y = {−1, 1}, this model can be shown to be equivalent to the
two-classes logistic classifications methods exposed in Section (12.4.2), with a solution vector being equal to
β1 − β2 (so it is computationally more efficient to only consider a single vector as we did).
The computation of the LSE operator is unstable for large value of Si,` (numerical overflow, producing
NaN), but this can be fixed by subtracting the largest element in each row, since
LSE(S + a) = LSE(S) + a
201
Figure 12.16: 2-D and 3-D PCA vizualization of the digits images.
Figure 12.17: Results of digit classification Left: probability h(x)` of belonging to each of the 9 first classes
(displayed over a 2-D PCA space). Right: colors reflect probability h(x) of belonging to classes.
if a is constant along the rows. This is often referred to as the “LSE trick” and is very important to use
in practice (in particular if some classes are well separated, since the corresponding β` vector might become
large).
The gradient of the LSE operator is the soft-max operator
eSi,`
def.
∇LSE(S) = SM(S) = P Si,m
me
Similarly to the LSE, it needs to be stabilized by subtracting the maximum value along rows before compu-
tation.
Once D matrix is computed, the gradient of E is computed as
1 ∗
∇E(β) = X (SM(Xβ) − D).
n
and one can minimize E using for instance a gradient descent scheme.
To illustrate the method, we use a dataset of n images of size p = 8 × 8, representing digits from 0 to 9
(so there are k = 10 classes). Figure 12.16 displays a few representative examples as well as 2-D and 3-D
PCA projections. Figure (12.17) displays the “fuzzy” decision boundaries by vizualizing the value of h(x)
using colors on an image regular grid.
202
not well capture by these models. In many cases (e.g. for text data) the input data is not even in a linear
space, so one cannot even apply these model.
Kernel method is a simple yet surprisingly powerful remedy for these issues. By lifting the features to
a high dimensional embedding space, it allows to generate non-linear decision and regression functions, but
still re-use the machinery (linear system solvers or convex optimization algorithm) of linear models. Also,
by the use of the so-called “kernel-trick”, the computation cost does not depends on the dimension of the
embedding space, but of the number n of points. It is the perfect example of so-called “non-parametric”
methods, where the number of degrees of freedom (number of variables involved when fitting the model)
grows with the number of samples. This is often desirable when one wants the precisions of the result to
improve with n, and also to mathematically model the data using “continuous” models (e.g. functional
spaces such as Sobolev).
The general rule of thumb is that any machine learning algorithm which only makes use of inner products
(and not directly of the features xi themselves) can be “kernelized” to obtain a non-parametric algorithm.
This is for instance the case for linear and nearest neighbor regression, SVM classification, logistic classifica-
tion and PCA dimensionality reduction. We first explain the general machinery, and instantiate this in two
representative setup (ridge regression, nearest-neighbor regression and logistic classification)
As a warmup, we start with the regression using a square loss, `(y, z) = 12 (y − z)2 , so that
which is the ridge regression problem studied in Section 12.3.1. In this case, the solution is obtained from
the annulation of the gradient as
β ? = (Φ∗ Φ + λIdH )−1 Φ∗ y.
It cannot be computed because a priori H can be infinite dimensional. Fortunately, the following Woodbury
matrix identity comes to the rescue. Note that this is exactly the formula (12.17) when using K in place of
Ĉ
Proof. Denoting U = (Φ∗ Φ + λIdH )−1 Φ∗ and V = Φ∗ (ΦΦ∗ + λIdRn )−1 , one has
(Φ∗ Φ + λIdH )U = Φ∗
203
and
(Φ∗ Φ + λIdH )V = (Φ∗ ΦΦ∗ + λΦ∗ )(ΦΦ∗ + λIdRn )−1 = Φ∗ (ΦΦ∗ + λIdRn )(ΦΦ∗ + λIdRn )−1 = Φ∗ .
Using this formula, one has the alternate formulation of the solution as
β ? = Φ∗ c? c? = (K + λIdRn )−1 y
def.
where
This means that c? ∈ Rn can be computed from the knowledge of k alone, without the need to actually
compute the lifted features ϕ(x). If the kernel k can be evaluated efficiently, this means that K requires
only n2 operator and the computation of c? can be done exactly in O(n3 ) operation, and approximately in
O(n2 ) (which is the typical setup in machine learning where computing exact solutions is overkill).
From this optimal c? , one can compute the predicted value for any x ∈ X using once gain only kernel
evaluation, since
X X
hβ ? , ϕ(x)iH = h c?i ϕ(xi ), ϕ(x)iH = c?i κ(x, xi ). (12.25)
i i
2 √ √ √
κ(x, x0 ) = (x1 x01 + x1 x01 + c) = hϕ(x), ϕ(x0 )i where ϕ(x) = (x21 , x22 , 2x1 x2 , 2cx1 , 2cx2 , c)∗ ∈ R6 .
In Euclidean spaces, the gaussian kernel is the most well known and used kernel
||x−y||2
κ(x, y) = e−
def.
2σ 2 . (12.26)
The bandwidth parameter σ > 0 is crucial and controls the locality of the model. It is typically tuned
||x−·||2
−
through cross validation. It corresponds to an infinite dimensional lifting x 7→ e 2(σ/2)2 ∈ L2 (Rp ). Another
related popular kernel is the Laplacian kernel exp(−||x − y||/σ). More generally, when considering translation
invariant kernels κ(x, x0 ) = k(x − x0 ) on Rp , being positive definite is equivalent
√ to k̂(ω) > 0 where k̂ is the
Fourier transform, and the associated lifting is obtained by considering ĥ = κ̂ and ϕ(x) = h(x−·) ∈ L2 (Rp ).
Figure 12.18 shows an example of regression using a Gaussian kernel.
204
σ = 0.1 σ = 0.5 σ=1 σ=5
Kernel on non-Euclidean spaces. One can define kernel κ on general space X . A valid kernel should
satisfy that for any set of distinct points (xi )i , the associated kernel matrix K = (κ(xi , xj ))i,j should be
positive definite (i.e. strictly positive eigenvalues). This is equivalent to the existence of an embedding
Hilbert space H and a feature map ϕ : X → H such that κ(x, x0 ) = hϕ(x), ϕ(x0 )iH .
For instance, if X = P(M) is the set of measurable ensembles of some measured space (M, µ), one
can
R define κ(A, B) = µ(A ∩ B) which corresponds to the lifting ϕ(A) = 1A so that hϕ(A), ϕ(B)i =
1A (x)1B (x)dµ(x) = µ(A ∩ B). It is also possible to define kernel on strings or between graphs, but
this is more involved.
Note that in practice, it is often desirable to normalize the kernel so that it has a unit diagonal (for
translation invariant kernel over Rp , this is automatically achieved because k(x, x) = k(0, 0) is constant).
0
This corresponds to replacing ϕ(x) by ϕ(x)/||ϕ(x)||K and thus replacing κ(x, x0 ) by √ κ(x,x √) 0 0 .
κ(x,x) κ(x ,x )
where q ∈ RN is a solution of
λ
min L(Kc, y) + hKc, ciRn (12.29)
c∈Rn 2
where we defined
K = Φ∗ Φ = (hϕ(xi ), ϕ(xj )iH )ni,j=1 ∈ Rn×n .
def.
0 ∈ Φ∗ ∂L(Φ∗ β ? , y) + λβ ? = 0
205
-1
1
1
β ? = − Φ∗ u? ∈ Im(Φ∗ )
λ
which is the desired result.
Equation (12.28) expresses the fact that the solution only lives in the n dimensional space spanned
by the lifted observed points ϕ(xi ). On contrary to (12.27), the optimization problem (12.29) is a finite
dimensional optimization problem, and in many case of practical interest, it can be solved approximately
in O(n2 ) (which is the cost of forming the matrix c). For large scale application, this complexity can be
further reduced by computing a low-rank approximation of K ≈ Φ̃∗ Φ̃ (which is equivalent to computing an
approximate embedding space of finite dimension).
For classification applications, one can use for ` a hinge or a logistic loss function, and then the decision
boundary is computed following (12.25) using kernel evaluations only as
X
sign(hβ ? , ϕ(x)iH ) = sign( c?i κ(x, xi )).
i
Figure 12.19 illustrates such a non-linear decision function on a simple 2-D problem. Note that while the
decision boundary is not a straight line, it is a linear decision in a higher (here infinite dimensional) lifted
space H (and it is the projection of this linear decision space which makes it non-linear).
It is also possible to extend nearest neighbors classification (as detailed in Section 12.4.1) and regression
over a lifted space by making use only of kernel evaluation, simply noticing that
206
where ` : Y × Y → R is some loss function. In order to achieve this goal, the method selects a class of
functions F and minimizes the empirical risk
n
def. 1
X
fˆ = argmin L̂(f ) =
def.
`(yi , f (xi )).
f ∈F n i=1
the average error associate to the choice of some predicted value z ∈ Y at location x ∈ X , one has the
decomposition of the risk
Z Z Z
L(f ) = `(y, f (x))dPY |X (y|x) dPX (x) = α(f (x)|x)dPX (x)
X Y X
Example 2 (Regression).
R For regression applications, where Y = R, `(y, z) = (y − z)2 , one has f ? (x) =
E(Y |X = x) = Y ydPY |X (y|x), which is the conditional expectation.
Example 3 (Classification). For classification applications, where Y = {−1, 1}, it is convenient to introduce
def.
η(x) = PY |X (y = 1|x) ∈ [0, 1]. If the two classes are separable, then η(x) ∈ {0, 1} on the support of X
(it is not defined elsewhere). For the 0-1 loss `(y, z) = 1y=z = 1R+ (−yz), one has f ? = sign(2η − 1) and
L(f ? ) = EX (min(η(X), 1 − η(X))). In practice, computing this η is not possible from the data D alone, and
minimizing the 0-1 loss is NP hard for most F. Considering a loss of the form `(y, z) = Γ(−yz), one has
that the Bayes estimator then reads in the fully non-parametric setup
f ? (x) = argmin α(z|x) = η(x)Γ(−z) + (1 − η(x))Γ(z) ,
z
207
Calibration in the classification setup A natural question is to ensure that in this (non realistic . . . )
setup, the final binary classifier sign(f ? ) is equal to sign(2η − 1), which is the Bayes classifier of the (non-
convex) 0-1 loss. In this case, the loss is said to be calibrated. Note that this does not mean that f ? is itself
equal to 2η − 1 of course. One has the following result.
Proposition 39. A loss ` associated to a convex Γ is calibrated if and only if Γ is differentiable at 0 and
Γ0 (0) > 0.
In particular, the hinge and logistic loss are thus calibrated. Denoting LΓ the loss associated to `(y, z) =
Γ(−yz), and denoting Γ0 = 1R+ the 0-1 loss, stronger quantitative controls are of the form
0 6 LΓ0 (f ) − inf LΓ0 6 Ψ(LΓ (f ) − inf LΓ ) (12.30)
for some increasing function Ψ : R+ → R+ . Such a control ensures in particular that if f ? minimize LΓ , it
also minimizes LΓ0 and hence sign(f ? ) = sign(2η − 1) and the loss is calibrated. One can show that the
hinge loss enjoys such a√quantitative control with Ψ(r) = r and that the logistic loss has a worse control
since it requires Ψ(s) = s.
fˆ = argmin L̂(f )
def.
f ∈F
is decomposed as the sum of the estimation (random) error and the approximation (deterministic) error
h i h i
L(fˆ) − inf L = L(fˆ) − inf L + A(F) where A(F) = inf L − inf L .
def.
(12.31)
F F
Approximation error. Bounding the approximation error fundamentally requires some hypothesis on
f ? . This is somehow the take home message of “no free lunch” results, which shows that learning is not
possible without regularity assumption on the distribution of (X, Y ). We only give here a simple example.
Example 4 (Linearly parameterized functionals). A popular class of functions are linearly parameterized
maps of the form
f (x) = fw (x) = hϕ(x), wiH
where ϕ : X → H somehow “lifts” the data features to a Hilbert space H. In the particular case where
X = Rp is already a (finite dimensional) Hilbert space, one can use ϕ(x) = x and recovers usual linear
methods. One can also consider for instance polynomial features, ϕ(x) = (1, x1 , . . . , xp , x21 , x1 x2 , . . .), giving
rise to polynomial regression and polynomial classification boundaries. One can then use a restricted class
of functions of the form F = {fw ; ||w||H 6 R} for some radius R, and if one assumes for simplicity that
f ? = fw? is of this form, and that the loss `(y, ·) is Q-Lipschitz, then the approximation error is bounded by
an orthogonal projection on this ball
A(F) 6 QE(||ϕ(x)||H ) max(||w? ||H − R, 0).
Remark 2 (Connexions with RKHS). Note that this lifting actually corresponds to using functions f in a
reproducing Hilbert space, denoting
def.
||f ||k = inf {||w||H ; f = fw }
w∈H
def.
and the associated kernel is k(x, x0 ) = hϕ(x), ϕ(x0 )iH . But this is not important for our discussion here.
208
Estimation error. The estimation error can be bounded for arbitrary distributions by leveraging concen-
tration inequalities (to controls the impact of the noise) and using some bound on the size of F.
The first simple but fundamental inequality bounds the estimator error by some uniform distance between
L and L̂. Denoting g ∈ F an optimal estimator such that L(g) = inf F L (assuming for simplicity it exists)
one has
h i h i h i
L(fˆ) − inf L = L(fˆ) − L̂(fˆ) + L̂(fˆ) − L̂(g) + L̂(g) − L(ĝ) 6 2 sup |L̂(f ) − L(f )| (12.32)
F f ∈F
since L̂(fˆ) − L̂(g) 6 0. So the goal is “simply” to control ∆(D) = supF |L̂ − L|, which is a random value
def.
where εi are independent Bernoulli random variable (i.e. P(εi = ±1) = 1/2). Note that Rn (G) actually
depends on the distribution of (X, Y ) . . .
Here, one needs to apply this notion of complexity to the functions G = `[F] defined as
def.
`[F] = {(x, y) ∈ X × Y 7→ `(y, f (x)) ; f ∈ F} ,
and that one has the following control, which can be proved using a simple but powerful symmetrization
trick.
Proposition 41. One has
E(∆(D)) 6 2Rn (`[F]).
If ` is Q-Lipschitz with respect to its second variable, one furthermore has Rn (`[F]) 6 QRn (F) (here the
class of functions only depends on x).
Putting (12.31), (12.32), Propositions 40 and 41 together, one obtains the following final result.
Theorem 23. Assuming ` is Q-Lipschitz and bounded by `∞ on the support of (Y, f (X)), one has with
probability 1 − δ r
2 log(1/δ)
0 6 L(fˆ) − inf L 6 2`∞ + 4QRn (F) + A(F). (12.34)
n
Example 5 (Linear models). In the case where F = {hϕ(·), wiH ; ||w|| 6 R} where || · || is some norm on H,
one has
R X
Rn (F) 6 || εi ϕ(xi )||∗
n i
where || · ||∗ is the so-called dual norm
def.
||u||∗ = sup hu, wiH .
||w||61
209
In the special case where || · || = || · ||H is Hilbertian, then one can further simplify this expression since
||u||∗ = ||u||H and p
R E(||ϕ(x)||2H )
Rn (F) 6 √ .
n
This result is powerful since the bound does not depend on the feature dimension (and can be even applied
in the RKHS setting where√H is infinite dimensional). In this case, one sees that the convergence speed
in (12.34) is of the order 1/ n (plus the approximation error). One should keep in mind that one needs also
to select the “regularization” parameter R to obtain the best possible trade-off. In practice, this is done by
cross validation on the data themselves.
Example 6 (Application to SVM classification). One cannot use the result (12.34) in the case of the 0-1
loss `(y, z) = 1z6=y since it is not Lipschitz (and also because minimizing L̂ would be intractable). One can
however applies it to a softer piece-wise affine upper-bounding loss `ρ (z, y) = Γρ (−zy) for
def.
Γρ (s) = min (1, max (0, 1 + s/ρ)) .
This function is 1/ρ Lipschitz, and it is sandwiched between the 0-1 loss and a (scaled) hinge loss
def.
Γ0 6 Γρ 6 ΓSVM (·/ρ) where ΓSVM (s) = max(1 + s, 0).
This allows one, after a change of variable w 7→ w/ρ, to bound with probability 1 − δ the 0-1 risk using a
SVM risk by applying (12.34)
r p
2 log(1/δ) E(||ϕ(x)||2H )/ρ
LΓ0 (fˆ) − inf LΓSVM (f ) 6 2 +4 √
f ∈Fρ n n
def.
where Fρ = {fw = hϕ(·), wiH ; ||w|| 6 1/ρ}. In practice, one rather solves a penalized version of the above
risk (in its empirical version)
min L̂ΓSVM (fw ) + λ||w||2H (12.35)
w
? ?
P ? the optimal c solving this strictly convex program, the estimator is evaluated as fw? (x) = hw , ϕ(x)i =
From
c
i i k(x i , x).
210
278
Bibliography
[1] Amir Beck. Introduction to Nonlinear Optimization: Theory, Algorithms, and Applications with MAT-
LAB. SIAM, 2014.
[2] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization
and statistical learning via the alternating direction method of multipliers. Foundations and Trends R
in Machine Learning, 3(1):1–122, 2011.
[3] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
[4] E. Candès and D. Donoho. New tight frames of curvelets and optimal representations of objects with
piecewise C2 singularities. Commun. on Pure and Appl. Math., 57(2):219–266, 2004.
[5] E. J. Candès, L. Demanet, D. L. Donoho, and L. Ying. Fast discrete curvelet transforms. SIAM
Multiscale Modeling and Simulation, 5:861–899, 2005.
[6] A. Chambolle. An algorithm for total variation minimization and applications. J. Math. Imaging Vis.,
20:89–97, 2004.
[7] Antonin Chambolle, Vicent Caselles, Daniel Cremers, Matteo Novaga, and Thomas Pock. An intro-
duction to total variation for image analysis. Theoretical foundations and numerical methods for sparse
recovery, 9(263-340):227, 2010.
[8] Antonin Chambolle and Thomas Pock. An introduction to continuous optimization for imaging. Acta
Numerica, 25:161–319, 2016.
[9] S.S. Chen, D.L. Donoho, and M.A. Saunders. Atomic decomposition by basis pursuit. SIAM Journal
on Scientific Computing, 20(1):33–61, 1999.
[10] Philippe G Ciarlet. Introduction à l’analyse numérique matricielle et à l’optimisation. 1982.
[11] P. L. Combettes and V. R. Wajs. Signal recovery by proximal forward-backward splitting. SIAM
Multiscale Modeling and Simulation, 4(4), 2005.
[12] I. Daubechies, M. Defrise, and C. De Mol. An iterative thresholding algorithm for linear inverse problems
with a sparsity constraint. Commun. on Pure and Appl. Math., 57:1413–1541, 2004.
[13] D. Donoho and I. Johnstone. Ideal spatial adaptation via wavelet shrinkage. Biometrika, 81:425–455,
Dec 1994.
[14] Heinz Werner Engl, Martin Hanke, and Andreas Neubauer. Regularization of inverse problems, volume
375. Springer Science & Business Media, 1996.
[15] M. Figueiredo and R. Nowak. An EM Algorithm for Wavelet-Based Image Restoration. IEEE Trans.
Image Proc., 12(8):906–916, 2003.
[16] Simon Foucart and Holger Rauhut. A mathematical introduction to compressive sensing, volume 1.
Birkhäuser Basel, 2013.
279
[17] Stephane Mallat. A wavelet tour of signal processing: the sparse way. Academic press, 2008.
[18] D. Mumford and J. Shah. Optimal approximation by piecewise smooth functions and associated varia-
tional problems. Commun. on Pure and Appl. Math., 42:577–685, 1989.
[19] Neal Parikh, Stephen Boyd, et al. Proximal algorithms. Foundations and Trends R in Optimization,
1(3):127–239, 2014.
[20] Gabriel Peyré. L’algèbre discrète de la transformée de Fourier. Ellipses, 2004.
[21] J. Portilla, V. Strela, M.J. Wainwright, and Simoncelli E.P. Image denoising using scale mixtures of
Gaussians in the wavelet domain. IEEE Trans. Image Proc., 12(11):1338–1351, November 2003.
[22] L. I. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise removal algorithms. Phys.
D, 60(1-4):259–268, 1992.
[23] Otmar Scherzer, Markus Grasmair, Harald Grossauer, Markus Haltmeier, Frank Lenzen, and L Sirovich.
Variational methods in imaging. Springer, 2009.
[24] C. E. Shannon. A mathematical theory of communication. The Bell System Technical Journal,
27(3):379–423, 1948.
[25] Jean-Luc Starck, Fionn Murtagh, and Jalal Fadili. Sparse image and signal processing: Wavelets and
related geometric multiscale analysis. Cambridge university press, 2015.
280