0% found this document useful (0 votes)

7 views29 pages

Machine Learning

This document provides an overview of machine learning concepts, focusing on the distinction between unsupervised and supervised learning. It discusses dimensionality reduction techniques like Principal Component Analysis (PCA) and clustering methods such as k-means, emphasizing their mathematical foundations and applications. The document highlights the importance of optimization in these methods and introduces the k-means++ initialization strategy for improved clustering results.

Uploaded by

Mohamed Assili

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views29 pages

Machine Learning

Uploaded by

Mohamed Assili

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Mathematical Foundations of Data Sciences

Gabriel Peyré
CNRS & DMA
École Normale Supérieure
[email protected]
https://mathematical-tours.github.io
www.numerical-tours.com

January 3, 2021
186
Chapter 12

Basics of Machine Learning

This chapter gives a rapid overview of the main concepts in machine learning. The goal is not to be
exhaustive, but to highlight representative problems and insist on the distinction between unsupervised
(vizualization and clustering) and supervised (regression and classification) setups. We also shed light on
the tight connexions between machine learning and inverse problems.
While imaging science problems are generally concern with processing a single data (e.g. an image),
machine learning problem is rather concern with analysing large collection of data. The focus (goal and
performance measures) is thus radically different, but quite surprisingly, it uses very similar tools and
algorithm (in particular linear models and convex optimization).

12.1 Unsupervised Learning

In unsupervised learning setups, one observes n points (xi )ni=1 . The problem is now to infer some
properties for this points, typically for vizualization or unsupervised classication (often called clustering).
For simplicity, we assume the data are points in Euclidean space xi ∈ Rp (p is the so-called number of
features). These points are conveniently stored as the rows of a matrix X ∈ Rn×d .

12.1.1 Dimensionality Reduction and PCA

Dimensionality reduction is useful for vizualization. It can also be understood as the problem of feature
extraction (determining which are the relevant parameters) and this can be later used for doing other tasks
more efficiently (faster and/or with better performances). The simplest method is the Principal Component
Analysis (PCA), which performs an orthogonal linear projection on the principal axes (eigenvectors) of the
covariance matrix.

Presentation of the method. The empirical mean is defined as

n
def. 1X
m̂ = xi ∈ Rp
n i=1

and covariance
n
1X
(xi − m̂)(xi − m̂)∗ ∈ Rp×p .
def.
Ĉ = (12.1)
n i=1
def.
Denoting X̃ = X − 1p m̂∗ , one has Ĉ = X̃ ∗ X̃/n.
Note that if the points (xi )i are modelled as i.i.d. variables, and denoting x one of these random variables,
one has, using the law of large numbers, the almost sure convergence as n → +∞

and Ĉ → C = E((x − m)(x − m)∗ ).

def. def.
m̂ → m = E(x) (12.2)

187
C SVD of C

Figure 12.1: Empirical covariance of the data and its associated singular values.

Denoting µ the distribution (Radon measure) on Rp of x, one can alternatively write

Z Z
m= xdµ(x) and C = (x − m)(x − m)∗ dµ(x).
Rp Rp

The PCA ortho-basis, already introduced in Section 22, corresponds to the

right singular vectors of the centred data matrix, as defined using the (reduced)
SVD decomposition √
X̃ = nU diag(σ)V ∗
where U ∈ Rn×r and V ∈ Rp×r , and where r = rank(X̃) 6 min(n, p). We
denote V = (vk )rk=1 the orthogonal columns (which forms an orthogonal system
of eigenvectors of Ĉ = V diag(σ 2 )V > ), vk ∈ Rp . The intuition is that they are
the main axes of “gravity” of the point cloud (xi )i in Rp . We assume the
singular values are ordered, σ1 > . . . > σr , so that the first singular values Figure 12.2: PCA main
capture most of the variance of the data. axes capture variance
Figure 12.1 displays an example of covariance and its associated spectrum σ. The points (xi )i correspond
to the celebrated IRIS dataset1 of Fisher. This dataset consists of 50 samples from each of three species
of Iris (Iris setosa, Iris virginica and Iris versicolor). The dimensionality of the features is p = 4, and the
dimensions corresponds to the length and the width of the sepals and petals.
The PCA dimensionality reduction embedding xi ∈ Rp 7→ zi ∈ Rd in dimension d 6 p is obtained by
projecting the data on the first d singular vector
def.
zi = (hxi − m, vk i)dk=1 ∈ Rd . (12.3)

From these low-dimensional embedding, one can reconstruct back an approximation as

def.
X
x̃i = m + zi,k vk ∈ Rp . (12.4)
k

One has that x̃i = ProjT̃ (xi ) where T̃ = m + Spandk=1 (vk ) is an affine space.
def.

Figure 12.3 shows an example of PCA for 2-D and 3-D vizualization.

Optimality analysis. We now show that among all possible linear dimen-
sionality reduction method, PCA is optimal in sense of `2 error. To simplify,
without loss of generality (since it can be subtracted from the data) we assume
that empirical mean is zero m̂ = 0 so that X = X̃.
1 https://en.wikipedia.org/wiki/Iris_flower_data_set

188
Figure 12.3: 2-D and 3-D PCA vizualization of the input clouds.

√
We recall that X = nU diag(σ)V > and Ĉ = n1 X > X = U ΛU > where
Λ = diag(λi = σi2 ), where λ1 > . . . > λr .
The following proposition shows that PCA is optimal in term of `2 distance
if one consider only affine spaces. This means that we consider the following
compression/decompression to a dimension k (i.e. dimensionality reduction and then expansion)
( )
X
||xi − RS > xi ||2Rp ; R, S ∈ Rp×k
def.
min f (R, S) = (12.5)
R,S
i

Note that this minimization is a priori not trivial to solve because, although f (·, S) and f (R, ·) are convex,
f is not jointly convex. So iteratively minimizing on R and S might fail to converge to a global minimizer.
This section aims at proving the following theorem.
def.
Theorem 20. A solution of (12.5) is S = R = V1:k = [v1 , . . . , vk ].

Note that using such a compressor R and decompressor R = S corresponds exactly

to the PCA method (12.3) and (12.4).
We first prove that one can restrain its attention to orthogonal projection matrix.

Lemma 7. One can assume S = R and S > S = Idk×k .

Proof. We consider an arbitrary pair (R, S). Since the matrix RS > has rank k 0 6 k,
0
let W ∈ Rp×k be an ortho-basis of Im(RS > ), so that W > W = Idk0 ×k0 . We remark
that
argmin ||x − W z||2 = W > x
z

because the first order condition for this problem reads W > (W z − x) = 0. Hence, denoting RS > xi = W zi
0
for some zi ∈ Rk
X X X
f (R, S) = ||xi − RS > xi ||2 = ||xi − W zi ||2 > ||xi − W W > xi ||2 > f (W̃ , W̃ ).
i i i

were we have extended W in an orthogonal matrix W̃ ∈ Rp×k where W̃1:k0 = W .

def.
Lemma 8. Denoting C = XX > ∈ Rp×p , an optimal S is obtained by solving

tr(S > CS > ) ; S > S = Idk .

max
S∈Rp×k

Proof. Using the previous lemma, one can consider only R = S with S > S = Idk so that one needs to solve
X X
f (S, S) = ||xi SS > xi ||2 = ||xi ||2 − 2x> > > > >
i SS xi + xi S(S S)S xi .
i i

189
Figure 12.4: Left: proof of the rightmost inequality in (12.6). Middle: matrix B, right: matrix B̃.

Using that S > S = Idk , one has

X X X X
f (S, S) = cst − x> >
i SS xi = − tr(x> >
i SS xi ) = − tr(S > xi x> >
i S) = − tr(S ( xi x>
i )S).
i i i i

The next lemma provides an upper bound on the quantity being minimized as the solution of a convex
optimization problem (a linear program). The proof of the theorem follows by showing that this upper
bound is actually reached, which provides a certificate of optimality.
Lemma 9. Denoting C = V ΛV > , one has
( p ) k
X X X
>
tr(S CS) 6 maxp λi βi ; 0 6 β 6 1, βi 6 k = λi (12.6)
β∈R
i=1 i i=1

i.e. the maximum on the right hand size is β = (1, . . . , 1, 0, . . . , 0).

Proof. We extend V ∈ Rp×k into an orthogonal matrix Ṽ ∈ Rp×p (i.e. Ṽ1:r = V ) such that Ṽ Ṽ > = Ṽ > V =
Idp . Similarly we extend Λ into Λ̃ by adding zeros, so that C = Ṽ Λ̃Ṽ > . One has
p
X X
tr(S > CS) = tr(S > Ṽ Λ̃Ṽ > S) = tr(B > ΛB) = tr(ΛBB > ) = λi ||bi ||2 = λ i βi
i=1 i

where we denoted B = V > S ∈ Rp×k , (bi )pi=1 with bi ∈ Rk are the rows of B and βi = ||bi ||2 > 0. One has
def. def.

B > B = S > Ṽ Ṽ > S = S > S = Idk ,

so that the columns of B are orthogonal, and thus

X X
βi = ||bi ||2 = ||B||2Fro = tr(B > B) = tr(B > B) = k.
i i

We extend the k columns of b into an orthogonal basis B̃ ∈ Rp×p such that B̃ B̃ > = B̃ > =Id
˜ p , so that

0 6 βi = ||bi ||2 6 ||b̃i ||2 = 1

and hence (βi )pi=1 satisfies the constraint of the considered optimization problem, hence tr(S > CS) is neces-
sarily smaller than the maximum possible value.
For the proof of the second upper bound, we only verify it in 2D and 3D using a drawing, see Figure 12.4,
left.
Proof of Theorem 20. Setting S = V1:k = [v1 , . . . , vk ], it satisfies CS = V ΛV > V1:k = V1:k diag(λi )ki=1 and
hence
Xk
tr(S > CS) = tr(S > S diag(λi )i=1k ) = tr(Idk diag(λi )i=1k ) = λi .
i=1
This value matches the right-most upper bound of Lemma 9, which shows that this S is optimal.

190
12.1.2 Clustering and k-means
A typical unsupervised learning task is to infer a class label yi ∈ {1, . . . , k} for each input point xi , and
this is often called a clustering problem (since the set of points associated to a given label can be thought
as a cluster).

k-means A way to infer these labels is by assuming that the clusters are compact, and optimizing some
compactness criterion. Assuming for simplicity that the data are in Euclidean space (which can be relaxed
to an arbitrary metric space, although the computations become more complicated), the k-means approach
minimizes the distance between the points and their class centroids c = (c` )k`=1 , where each c` ∈ Rp . The
corresponding variational problem becomes
k
X X
def.
min E(y, c) = ||xi − c` ||2 .
(y,c)
`=1 i:yi =`

The k-means algorithm can be seen as a block coordinate relaxation, which

alternatively updates the class labels and the centroids. The centroids c are
first initialized (more on this later), for instance, using a well-spread set of
points from the samples. For a given set c of centroids, minimizing y 7→ E(y, c)
is obtained in closed form by assigning as class label the index of the closest
centroids
∀ i ∈ {1, . . . , n}, yi ← argmin ||xi − c` ||. (12.7)
16`6k

For a given set y of labels, minimizing c 7→ E(y, c) is obtained in closed form

by computing the barycenter of each class
P Figure 12.5: k-means clus-
i:yi =` xi ters according to Vornoi
∀ ` ∈ {1, . . . , k}, c` ← (12.8)
| {i ; yi = `} | cells.

If during the iterates, one of the cluster associated to some c` becomes empty, then one can either decide to
destroy it and replace k by k − 1, or try to “teleport” the center c` to another location (this might increase
the objective function E however).
Since the energy E is decaying during each of these two steps, it is converging to some limit value.
Since there is a finite number of possible labels assignments, it is actually constant after a finite number of
iterations, and the algorithm stops.
Of course, since the energy is non-convex, little can be said about the property of the clusters output
by k-means. To try to reach lower energy level, it is possible to “teleport” during the iterations centroids
c` associated to clusters with high energy to locations within clusters with lower energy (because optimal
solutions should somehow balance the energy).
Figure 12.6 shows an example of k-means iterations on the Iris dataset.

k-means++ To obtain good results when using k-means, it is crucial to have an efficient initialization
scheme. In practice, the best results are obtained by seeding them as far as possible from one another (a
greedy strategy works great in practice).
Quite surprisingly, there exists a randomized seeding strategy which can be shown to be close to optimal
in term of value of E, even without running the k-means iterations (although in practice it still needs to be
used to polish the results). The corresponding k-means++ initialization is obtained by selecting c1 uniformly
at random among the xi , and then assuming c` has been seeded, drawing c`+1 among the sample according
to the probability π (`) on {1, . . . , n} proportional to the squared inverse of the distance to the previously
seeded points
(`) def. 1/d2 def.
∀ i ∈ {1, . . . , n}, πi = Pn i 2 where dj = min ||xi − cr ||.
j=1 1/d j 16r6`−1

191
Iter #1 Iter #2 1
0.5
0
1 2 3
1
0.5
0
1 2 3
Iter #3 Iter #16 1
0.5
0
1 2 3

Figure 12.6: Left: iteration of k-means algorithm. Right: histogram of points belonging to each class after
the k-means optimization.

This means that points which are located far away from the preciously seeded centers are more likely to be
picked.
The following results, due to David Arthur and Sergei Vassilvitskii, shows that this seeding is optimal
up to log factor on the energy. Note that finding a global optimum is known to be NP-hard.

Theorem 21. For the centroids c? defined by the k-means++ strategy, denoting y ? the associated nearest
neighbor labels defined as in (12.7), one has

E(E(y ? , c? )) 6 8(2 + log(k))min E(y, v),

(y,c)

where the expectation is on the random draws performed by the algorithm.

Lloyd algorithm and continuous densities. The k-means iterations are also called “Lloyd” algorithm,
which also find applications to optimal vector quantization for compression. It can also be used in the
“continuous” setting where the empirical samples (xi )i are replaced by an arbitrary measure over Rp . The
energy to minimize becomes
Xk Z
min ||x − c` ||2 dµ(x)
(V,c) V`
`=1

where (V` )` is a partition of the domain. Step (12.7) is replaced by the computation of a Voronoi cell

V` = {x ; ∀ `0 6= `, ||x − c` || 6 ||x − c`0 ||} .

def.
∀ ` ∈ {1, . . . , k},

These Voronoi cells are polyhedra delimited by segments of mediatrix between centroids, and this Voronoi
segmentation can be computed efficiently using tools from algorithmic geometry in low dimension. Step (12.8)
are then replaced by R
xdµ(x)
∀ ` ∈ {1, . . . , k}, c` ← RC` .
C`
dµ(x)
In the case of µ being uniform distribution, optimal solution corresponds to the hexagonal lattice. Figure 12.7
displays two examples of Lloyd iterations on 2-D densities on a square domain.

12.2 Empirical Risk Minimization

Before diving into the specifics of regression and classification problems, let us give describe a generic
methodology which can be applied in both case (possibly with minor modification for classification, typically
considering class probabilities instead of class labels).

192
Figure 12.7: Iteration of k-means algorithm (Lloyd algorithm) on continuous densities µ. Top: uniform.
Bottom: non-uniform (the densities of µ with respect to the Lebesgue measure is displayed as a grayscale
image in the background).

In order to make the problem tractable computationally, and also in order to obtain efficient prediction
scores, it is important to restrict the fit to the data yi ≈ f (xi ) using a “small enough” class of functions.
Intuitively, in order to avoid overfitting, the “size” of this class of functions should grows with the number
n of samples.

12.2.1 Empirical Risk

Denoting Fn some class of functions (which depends on the number of available samples), one of the
most usual way to do the learning is to perform an empirical risk minimization (ERM)
n
1X
fˆ ∈ argmin L(f (xi ), yi ). (12.9)
f ∈Fn n i=1

Here L : Y 2 → R+ is the so-called loss function, and it should typically satisfies L(y, y 0 ) = 0 if and only if
y = y 0 . The specifics of L depend on the application at hand (in particular, one should use different losses
for classification and regression tasks). To highlight the dependency of fˆ on n, we occasionally write fˆn .

12.2.2 Prediction and Consistency

When doing a mathematically analysis, one usually assumes that (xi , yi ) are drawn from a distribution
π on X × Y, and the large n limit defines the ideal estimator
Z
¯
f ∈ argmin L(f (x), y)dπ(x, y) = E(x,y)∼π (L(f (x), y). (12.10)
f ∈F∞ X ×Y

Intuitively, one should have fˆn → f¯ as n → +∞, which can be captured in expectation of the prediction
error over the samples (xi , yi )i , i.e.

En = E(L̃(fˆn (x), f¯(x))) −→ 0.

def.

One should be careful that here the expectation is over both x (distributed according to the marginal πX
of π on X ), and also the n i.i.d. pairs (xi , yi ) ∼ π used to define fˆn (so a better notation should rather be

193
(xi , yi )i . Here L̄ is some loss function on Y (one can use L̄ = L for instance). One can also study convergence
in probability, i.e.
∀ ε > 0, Eε,n = P(L̃(fˆn (x), f¯(x)) > ε) → 0.
def.

If this holds, then one says that the estimation method is consistent (in expectation or in probability). The
question is then to derive convergence rates, i.e. to upper bound En or Eε,n by some explicitly decay rate.
Note that when L̃(y, y 0 ) = |y−y 0 |r , then convergence in expectation is stronger (implies) than convergence
in probability since using Markov’s inequality
1 En
Eε,n = P(|fˆn (x) − f (x)|r > ε) 6 E(|fˆn (x) − f (x)|r ) = .
ε ε

12.2.3 Parametric Approaches and Regularization

Instead of directly defining the class Fn and using it as a constraint, it is possible to rather use a
penalization using some prior to favor “simple” or “regular” functions. A typical way to achieve this is by
using a parametric model y ≈ f (x, β) where β ∈ B parametrizes the function f (·, β) : X → Y. The empirical
risk minimization procedure (12.9) now becomes
n
1X
β̂ ∈ argmin L(f (xi , β), yi ) + λn J(β). (12.11)
β∈B n i=1

where J is some regularization function, for instance J = || · ||22 (to avoid blowing-up of the parameter) or
J = || · ||1 (to perform model selection, i.e. using only a sparse set of feature among a possibly very large
pool of p features). Here λn > 0 is a regularization parameter, and it should tend to 0 when n → +∞.
Then one similarly defines the ideal parameter β̄ as in (12.10) so that the limiting estimator as n → +∞
is of the form f¯ = f (·, β̄) for β̄ defined as
Z
β̄ ∈ argmin L(f (x, β), y)dπ(x, y) = E(x,y)∼π (L(f (x, β), y). (12.12)
β X ×Y

Prediction vs. estimation risks. In this parametric approach, one could be interested in also studying
how close θ̂ is to θ̄. This can be measured by controlling how fast some estimation error ||β̂ − β̄|| (for some
norm || · ||) goes to zero. Note however that in most cases, controlling the estimation error is more difficult
than doing the same for the prediction error. In general, doing a good parameter estimation implies doing
a good prediction, but the converse is not true.

12.2.4 Testing Set and Cross-validation

It is not possible to access En or Eε,n because the optimal f¯ is unknown. In order to tune some
parameters of the methods (for instance the regularization parameter λ), one rather wants to minimize the
risk E(L(fˆ(x), y)), but this one should not be approximated using the training samples (xi , yi )i .
One thus rather ressorts to a second set of data (x̄j , ȳj )n̄j=1 , called “testing set”. From a modelling
perspective, this set should also be distributed i.i.d. according to π. The validation (or testing) risk is then
n̄
1X ˆ
Rn̄ = L(f (x̄j ), ȳj ) (12.13)
n̄ j=1

which converges to E(L(fˆ(x), y)) for large n̄. Minimizing Rn̄ to setup to some meta-parameter of the method
(for instance the regularization parameter λn ) is called “cross validation” in the literature.

12.3 Supervised Learning: Regression

194
Figure 12.9: Conditional expectation.

In supervised learning, one has access to training data, consisting in pairs

(xi , yi ) ∈ X × Y. Here X = Rp for simplicity. The goal is to infer some
relationship, typically of the form yi ≈ f (xi ) for some deterministic function
f : X → Y, in order, when some un-observed data x without associated value
in Y is given, to be able to “predict” the associated value using y = f (x).
If the set Y is discrete and finite, then this problem is called a supervised
classification problem, and this is studied in Section 12.4. The simplest example
being the binary classification case, where Y = {0, 1}. It finds applications for
instance in medical diagnosis, where yi = 0 indicates a healthy subject, why Figure 12.8: Probabilistic
yi = 0 a pathological one. If Y is continuous (the typical example being Y = R), modelling.
then this problem is called a regression problem.

12.3.1 Linear Regression

We now specialize the empirical risk minimization approach to regression problems, and even more
specifically, we consider Y = R and use a quadratic loss L(y, y 0 ) = 21 |y − y 0 |2 .
Note that non-linear regression can be achieved using approximation in dictionary (e.g. polynomial
interpolation), and this is equivalent to using lifting to a higher dimensional space, and is also equivalent to
kernelization technics studied in Section 12.5.

Least square and conditional expectation. If one do not put any constraint on f (beside being mea-
surable), then the optimal limit estimator f¯(x) defined in (12.10) is simply averaging the values y sharing
the same x, which is the so-called conditional expectation. Assuming for simplicity that π has some density
dπ
dxdy with respect to a tensor product measure dxdy (for instance the Lebegues mesure), one has

y dπ (x, y)dy
R
¯ Y dxdy
∀x ∈ X, f (x) = E(y|x = x) = R dπ
Y dxdy
(x, y)dy

where (x, y) are distributed according to π.

In the simple case where X and Y are discrete, denoting πx,y the probability of (x = x, y = y), one has
P
y yπx,y
∀ x ∈ X , f¯(x) = P
y πx,y

and it is unspecified if the marginal of π along X vanishes at x.

The main issue is that this estimator fˆ performs poorly on finite samples, and f (x) is actually undefined
if there is no sample xi equal to x. This is due to the fact that the class of functions is too large, and one
should impose some regularity or simplicity on the set of admissible f .

195
Penalized linear models. A very simple class of models is obtained by
imposing that f is linear, and set f (x, β) = hx, βi, for parameters β ∈ B =
Rp . Note that one can also treat this way affine functions by remarking that
hx, βi + β0 = h(x, 1), (β, β0 )i and replacing x by (x, 1). So in the following,
without loss of generality, we only treat the vectorial (non-affine) case.
Under the square loss, the regularized ERM (12.11) is conveniently rewritten
as
1 Figure 12.10: Linear re-
β̂ ∈ argmin hĈβ, βi − hû, βi + λn J(β) (12.14)
β∈B 2 gression.

where we introduced the empirical correlation (already introduced in (12.1)) and observations
n n
1 ∗ 1X 1X 1
xi x∗i yi xi = X ∗ y ∈ Rp .
def. def.
Ĉ = X X= and û =
n n i=1 n i=1 n

As n → 0, under weak condition on π, one has with the law of large numbers the almost sure convergence

Ĉ → C = E(x∗ x)
def. def.
and û → u = E(yx). (12.15)

When considering λn → 0, in some cases, one can shows that in the limit n → +∞, one retrieves the
following ideal parameter
β̄ ∈ argmin {J(β) ; Cβ = u} .
β

Problem (12.14) is equivalent to the regularized resolution of inverse problems (8.9), with Ĉ in place of Φ
and û in place of Φ∗ y. The major, and in fact only difference between machine learning and inverse problems
is that the linear √
operator is also noisy since Ĉ can be viewed as a noisy version of C. The “noise level”, in
this setting, is 1/ n in the sense that

1 1
E(||Ĉ − C||) ∼ √ and E(||û − u||) ∼ √ ,
n n

under the assumption that E(y4 ) < +∞, E(||x||4 ) < +∞ so ensure that one can use the central limit theorem
on x2 and xy. Note that, although we use here linear estimator, one does not need to assume a “linear”
relation of the form y = hx, βi + w with a noise w independent from x, but rather hope to do “as best as
possible”, i.e. estimate a linear model as close as possible to β̄.
The general take home message is that it is possible to generalize Theorems 10, 12 and 13 to cope with
the noise on the covariance matrix to obtain prediction convergence rates of the form

E(|hβ̂, xi − hβ̄, xi|2 ) = O(n−κ )

and estimation rates of the form

0
E(||β̂ − β̄||2 ) = O(n−κ ),
1
under some suitable source condition involving C and u. Since the noise level is roughly n− 2 , the ideal cases
are when κ = κ0 = 1, which is the so-called linear rate regime. It is also possible to derive sparsistency
theorems by extending theorem 14. For the sake of simplicity, we now focus our attention to quadratic
penalization, which is by far the most popular regression technic. It is fair to say that sparse (e.g. `1 type)
methods are not routinely used in machine learning, because they typically do not improve the estimation
performances, and are mostly useful to do model selection (isolate a few useful coordinates in the features).
This is in sharp contrast with the situation for inverse problems in imaging sciences, where sparsity is a key
feature because it corresponds to a modelling assumption on the structure of the data to recover.

196
Ridge regression (quadratic penalization). For J = ||·||2 /2, the estimator (12.14) is obtained in closed
form as
β̂ = (X ∗ X + nλn Idp )−1 X ∗ y = (Ĉ + nλn Id)−1 û. (12.16)
This is often called ridge regression in the literature. Note that thanks to the Woodbury formula, this
estimator can also be re-written as

β̂ = X ∗ (XX ∗ + nλn Idn )−1 y. (12.17)

If n p (which is the usual setup in machine learning), then (12.17) is preferable. In some cases however (in
particular when using RKHS technics), it makes sense to consider very large p (even infinite dimensional),
so that (12.16) must be used.
If λn → 0, then using (12.15), one has the convergence in expectation and probability

β̂ → β̄ = C + u.

Theorems 10 and 12 can be extended to this setting and one obtains the following result.

Theorem 22. If
β̄ = C γ z where ||z|| 6 ρ (12.18)
for 0 < γ 6 2, then
1 γ
E(||β̂ − β̄||2 ) 6 Cρ2 γ+1 n− γ+1 (12.19)
for a constant C depending only on γ.

It is important to note that, since β̄ = C + u, the source condition (12.18) is always satisfied. What trully
matters here is that the rate (12.19) does not depend on the dimension p of the features, but rather only on
ρ, which can be much smaller. This theoretical analysis actually works perfectly fine in infinite dimension
p = ∞ (which is the setup considered when dealing with RKHS below).

12.4 Supervised Learning: Classification

We now focus on the case of discrete labels yi ∈ Y = {1, . . . , k}, which is the classification setup. We
now detail two popular classification methods: nearest neighbors and logistic classification. It is faire to say
that a significant part of successful applications of machine learning technics consists in using one of these
two approaches, which should be considered as baselines. Note that the nearest neighbors approach, while
popular for classification could as well be used for regression.

12.4.1 Nearest Neighbors Classification

Probably the simplest method for supervised classification is R nearest neighbors (R-NN), where R is
a parameter indexing the number of neighbors. Increasing R is important to cope with noise and obtain
smoother decision boundaries, and hence better generalization performances. It should typically decreases as
the number of training samples n increases. Despite its simplicity, k-NN is surprisingly successful in practice,
specially in low dimension p.
The class fˆ(x) ∈ Y predicted for a point x is the one which is the most represented among the R points
(xi )i which are the closed to x. This is a non-parametric method, and fˆ depends on the numbers n of
samples (its “complexity” increases with n).
One first compute the Euclidean distance between this x and all other xi in
the training set. Sorting the distances generates an indexing σ (a permutation
of {1, . . . , n}) such that

||x − xσ(1) || 6 ||x − xσ(2) || 6 . . . 6 ||x − xσ(n) ||.

197

Figure 12.11: Nearest

neighbors.
=1 R=1 R=10
R=5 R=5
R=10 R=40 R=4

k=1 k=5 k = 10 k = 40

Figure 12.12: k-nearest-neighbor classification boundary function.

=10 R=10 R=40

For a given R, one can compute the “local” histogram of classes around x R=40
def. 1
h` (x) = i ; yσ(i) ∈ {1, . . . , R} .
R
The decision class for x is then a maximum of the histogram
fˆ(x) ∈ argmax h` (x).
`

In practice, the parameter R can be setup through cross-validation, by minimizing the testing risk Rn̄
defined in (12.13), which typically uses a 0-1 loss for counting the number of mis-classifications
n̄
X
δ(ȳj − fˆ(xi ))
def.
Rn̄ =
j=1

where δ(0) = 0 and δ(s) = 1 if s 6= 0. Of course the method extends to arbitrary metric space in place of
Euclidean space Rp for the features. Note also that instead of explicitly sorting all the Euclidean distance,
one can use fast nearest neighbor search methods.
Figure 12.12 shows, for the IRIS dataset, the classification domains (i.e. {x ; f (x) = `} for ` = 1, . . . , k)
using a 2-D projection for vizualization. Increasing R leads to smoother class boundaries.

12.4.2 Two Classes Logistic Classification

The logistic classification method (for 2 classes and multi-classes) is one of the most popular (maybe
“the” most) popular machine learning technics. This is due in large part of both its simplicity and because
it also outputs a probability of belonging to each class (in place of just a class membership), which is useful
to (somehow . . . ) quantify the “uncertainty” of the estimation. Note that logistic classification is actually
called “logistic regression” in the literature, but it is in fact a classification method.
Another very popular (and very similar) approach is support vector machine (SVM). SVM is both more
difficult to train (because the loss is non-smooth) and does not give class membership probability, so the
general rule of thumb is that logistic classification is preferable.
To simplify the expression, classes indexes are set to yi ∈ Y = {−1, 1} in the following. Note that for
logistic classification, the prediction function f (·, β) ∈ [0, 1] outputs the probability of belonging to the first
class, and not the class indexes. With a slight abuse of notation, we still denote it as f .

Approximate risk minimization. The hard classifier is defines from a linear predictor hx, βi as sign(hx, βi) ∈
{−1, +1}. The 0-1 loss error function (somehow the “ideal” loss) counts the number of miss-classifications,
and can ideal classifier be computed as
n
X
min `0 (−yi hxi , βi) (12.20)
β
i=1

198
Figure 12.13: 1-D and 2-D logistic classification, showing the impact of ||β|| on the sharpness of the
classification boundary.

where `0 = 1R+ . Indeed, miss classification corresponds to hxi , wi and yi having different signs, so that in
this case `0 (−yi hxi , wi) = 1 (and 0 otherwise for correct classification).
The function `0 is non-convex and hence problem (12.20) is itself non-convex, and in full generality, can
be shown to be NP-hard to solve. One thus relies on some proxy, which are functions which upper-bounds
`0 and are convex (and sometimes differentiable).
The most celebrated proxy are

`(u) = (1 + u)+ and `(u) = log(1 + exp(u))/ log(2)

which are respectively the hinge loss corresponds to support vector machine (SVM, and is non-smooth) and
the logistic loss (which is smooth). The 1/ log(2) is just a constant which makes `0 6 `. AdaBoost is using
`(u) = eu . Note that least square corresponds to using `(u) = (1 + u)2 , but this is a poor proxy for `0 for
negative values, although it might works well in practice. Note that SVM is a non-smooth problem, which
can be cast as a linear program minimizing the co-called classification margin
( )
X
min ui ; 1 + ui = yi hxi , βi .
u>0,β
i

Logistic loss probabilistic interpretation. Logistic classification can be understood as a linear model
as introduced in Section 12.3.1, although the decision function f (·, β) is not linear. Indeed, one needs
to “remap” the linear value hx, βi in the interval [0, 1]. In logistic classification, we define the predicted
probability of x belonging to class with label −1 as
es
= (1 + e−s )−1 ,
def. def.
f (x, β) = θ(hx, βi) where θ(s) = (12.21)
1 + es
which is often called the “logit” model. Using a linear decision model might seems overly simplistic, but in
high dimension p, the number of degrees of freedom is actually enough to reach surprisingly good classification
performances. Note that the probability of belonging to the second class is 1 − f (x, β) = θ(−s). This
symmetry of the θ function is important because it means that both classes are treated equally, which makes
sense for “balanced” problem (where the total mass of each class are roughly equal).
Intuitively, β/||β|| controls the separating hyperplane direction, while 1/||β|| is roughly the fuzziness of the
the separation. As ||β|| → +∞, one obtains sharp devision boundary, and logistic classification ressembles
SVM.
Note that f (x, β) can be interpreted as a single layer perceptron with a logistic (sigmoid) rectifying unit,
more details on this in Chapter 15.

199
3
Binary
Logistic
2 Hinge

0
-3 -2 -1 0 1 2 3

Figure 12.14: Comparison of loss functions. [ToDo: Re-do the figure, it is not correct, they should
upper bound `0 ]

Since the (xi , yi ) are modeled as i.i.d. variables, it makes sense to define β̂ from the observation using
a maximum likelihood, assuming that each yi conditioned on xi is a Bernoulli variable with associated
probability (pi , 1 − pi ) with pi = f (xi , β). The probability of observing yi ∈ {0, 1} is thus, denoting
si = hxi , βi
si 1−ȳi ȳi
e 1
P(y = yi |x = xi ) = p1−ȳ
i
i
(1 − pi )ȳi
=
1 + esi 1 + esi
where we denoted ȳi = yi2+1 ∈ {0, 1}.
One can then minimize minus the sum of the log of the likelihoods, which reads
n n
X X es i 1
β̂ ∈ argmin − log(P(y = yi |x = xi )) = −(1 − ȳi ) log s
− ȳi log
β∈Rp i=1 i=1
1+e i 1 + esi

Some algebraic manipulations shows that this is equivalent to an ERM-type form (12.11) with a logistic loss
function
n
1X
β̂ ∈ argmin E(β) = L(hxi , βi, yi ) (12.22)
β∈Rp n i=1
where the logistic loss reads
def.
L(s, y) = log(1 + exp(−sy)). (12.23)
Problem (12.22) is a smooth convex minimization. If X is injective, E is also strictly convex, hence it has a
single global minimum.
Figure (12.14) compares the binary (ideal) 0-1 loss, the logistic loss and the hinge loss (the one used for
SVM).

Gradient descent method. Re-writing the energy to minimize

1X
E(β) = L(Xβ, y) where L(s, y) = L(si , yi ),
n i

its gradient reads

y
∇E(β) = X ∗ ∇L(Xβ, y) where ∇L(s, y) =
θ(−y s),
n
where is the pointwise multiplication operator, i.e. .* in Matlab. Once β (`=0) ∈ Rp is initialized (for
instance at 0p ), one step of gradient descent (17.2) reads

β (`+1) = β (`) − τ` ∇E(β (`) ).

To understand the behavior of the method, in Figure 12.15 we generate synthetic data distributed ac-
cording to a mixture of Gaussian with an overlap governed by an offset ω. One can display the data overlaid
on top of the classification probability, this highlight the separating hyperplane {x ; hβ, xi = 0}.

200
Figure 12.15: Influence on the separation distance between the class on the classification probability.

12.4.3 Multi-Classes Logistic Classification

The logistic classification method is extended to an arbitrary number k of classes by considering a family
of weight vectors β = (β` )k`=1 , which are conveniently stored as columns of a matrix β ∈ Rp×k .
This allows one to model probabilistically the belonging of a point x ∈ Rp to the classes using the logit
model
e−hx, β` i

f (x, β) = P −hx, β i
me
m
`
k
P
This vector h(x) ∈ [0, 1] describes the probability of x belonging to the different classes, and ` h(x)` = 1.
The computation of β is obtained by solving a maximum likelihood estimator
n
1X
max log(f (xi , β)yi )
β∈Rp×k n i=1

where we recall that yi ∈ Y = {1, . . . , k} is the class index of the point xi .

This is conveniently rewritten as
def.
X
min E(β) = LSE(Xβ)i − hXβ, Di
β∈Rp×k
i

where D ∈ {0, 1}n×k is the binary class index matrices

1 if yi = `,
Di,` =
0otherwise.

and LSE is the log-sum-exp operator

!
X
LSE(S) = log exp(Si,` ) ∈ Rn .
`

Note that in the case of k = 2 classes Y = {−1, 1}, this model can be shown to be equivalent to the
two-classes logistic classifications methods exposed in Section (12.4.2), with a solution vector being equal to
β1 − β2 (so it is computationally more efficient to only consider a single vector as we did).
The computation of the LSE operator is unstable for large value of Si,` (numerical overflow, producing
NaN), but this can be fixed by subtracting the largest element in each row, since

LSE(S + a) = LSE(S) + a

201
Figure 12.16: 2-D and 3-D PCA vizualization of the digits images.

Figure 12.17: Results of digit classification Left: probability h(x)` of belonging to each of the 9 first classes
(displayed over a 2-D PCA space). Right: colors reflect probability h(x) of belonging to classes.

if a is constant along the rows. This is often referred to as the “LSE trick” and is very important to use
in practice (in particular if some classes are well separated, since the corresponding β` vector might become
large).
The gradient of the LSE operator is the soft-max operator
eSi,`

def.
∇LSE(S) = SM(S) = P Si,m
me

Similarly to the LSE, it needs to be stabilized by subtracting the maximum value along rows before compu-
tation.
Once D matrix is computed, the gradient of E is computed as
1 ∗
∇E(β) = X (SM(Xβ) − D).
n
and one can minimize E using for instance a gradient descent scheme.
To illustrate the method, we use a dataset of n images of size p = 8 × 8, representing digits from 0 to 9
(so there are k = 10 classes). Figure 12.16 displays a few representative examples as well as 2-D and 3-D
PCA projections. Figure (12.17) displays the “fuzzy” decision boundaries by vizualizing the value of h(x)
using colors on an image regular grid.

12.5 Kernel Methods

Linear methods are parametric and cannot generate complex regression or decision functions. The lin-
earity assumption is often too restrictive and in some case the geometry of the input functions or classes is

202
not well capture by these models. In many cases (e.g. for text data) the input data is not even in a linear
space, so one cannot even apply these model.
Kernel method is a simple yet surprisingly powerful remedy for these issues. By lifting the features to
a high dimensional embedding space, it allows to generate non-linear decision and regression functions, but
still re-use the machinery (linear system solvers or convex optimization algorithm) of linear models. Also,
by the use of the so-called “kernel-trick”, the computation cost does not depends on the dimension of the
embedding space, but of the number n of points. It is the perfect example of so-called “non-parametric”
methods, where the number of degrees of freedom (number of variables involved when fitting the model)
grows with the number of samples. This is often desirable when one wants the precisions of the result to
improve with n, and also to mathematically model the data using “continuous” models (e.g. functional
spaces such as Sobolev).
The general rule of thumb is that any machine learning algorithm which only makes use of inner products
(and not directly of the features xi themselves) can be “kernelized” to obtain a non-parametric algorithm.
This is for instance the case for linear and nearest neighbor regression, SVM classification, logistic classifica-
tion and PCA dimensionality reduction. We first explain the general machinery, and instantiate this in two
representative setup (ridge regression, nearest-neighbor regression and logistic classification)

12.5.1 Feature Map and Kernels

We consider a general lifting ϕ : x ∈ X → ϕ(x) ∈ H where H is a Hilbert space, possibly of infinite
dimension. The space X can thus be arbitrary, it does not need to be a vector space (but of course, if X = Rp ,
one could use ϕ(x) = x and retrieve the usual linear methods). For regression, we consider predictions which
are linear function over this lifted space, i.e. of the form x 7→ hϕ(x), βiH for some weight vector β ∈ H. For
binary classification, one uses x 7→ sign(hϕ(x), βiH ).
A typical example of lift for 1 − D values p = 1 is ϕ(x) = (1, x, x2 , . . . , xk ) ∈ Rk to perform polynomial
regression (this can be extended to any dimension p using higher dimensional polynomials). We denote
Φ = (ϕ(xi )∗ )ni=1 the “matrix” where each row is a lifted feature ϕ(xi ). For instance, if H = Rp̄ is finite
dimensional, one can view this as a matrix Φ ∈ Rn×p̄ , but the rows of the matrix can be infinite dimensional
vectors.
We consider the empirical risk minimization procedure
n
X λ λ
min `(yi , hϕ(xi ), βi) + ||β||2H = L(Φβ, y) + ||β||2H (12.24)
β∈H
i=1
2 2

As a warmup, we start with the regression using a square loss, `(y, z) = 12 (y − z)2 , so that

β ? = argmin ||Φβ − y||2Rn + λ||β||2H .

β∈H

which is the ridge regression problem studied in Section 12.3.1. In this case, the solution is obtained from
the annulation of the gradient as
β ? = (Φ∗ Φ + λIdH )−1 Φ∗ y.
It cannot be computed because a priori H can be infinite dimensional. Fortunately, the following Woodbury
matrix identity comes to the rescue. Note that this is exactly the formula (12.17) when using K in place of
Ĉ

Proposition 37. One has

(Φ∗ Φ + λIdH )−1 Φ∗ = Φ∗ (ΦΦ∗ + λIdRn )−1 .

Proof. Denoting U = (Φ∗ Φ + λIdH )−1 Φ∗ and V = Φ∗ (ΦΦ∗ + λIdRn )−1 , one has

(Φ∗ Φ + λIdH )U = Φ∗

203
and

(Φ∗ Φ + λIdH )V = (Φ∗ ΦΦ∗ + λΦ∗ )(ΦΦ∗ + λIdRn )−1 = Φ∗ (ΦΦ∗ + λIdRn )(ΦΦ∗ + λIdRn )−1 = Φ∗ .

Since (Φ∗ Φ + λIdH ) is invertible, one obtains the result.

Using this formula, one has the alternate formulation of the solution as

β ? = Φ∗ c? c? = (K + λIdRn )−1 y
def.
where

where we introduced the so-called empirical kernel matrix

K = Φ∗ Φ = (κ(xi , xj ))ni,j=1 ∈ Rn×n κ(x, x0 ) = hϕ(x), ϕ(x0 )i.

def.
where

This means that c? ∈ Rn can be computed from the knowledge of k alone, without the need to actually
compute the lifted features ϕ(x). If the kernel k can be evaluated efficiently, this means that K requires
only n2 operator and the computation of c? can be done exactly in O(n3 ) operation, and approximately in
O(n2 ) (which is the typical setup in machine learning where computing exact solutions is overkill).
From this optimal c? , one can compute the predicted value for any x ∈ X using once gain only kernel
evaluation, since
X X
hβ ? , ϕ(x)iH = h c?i ϕ(xi ), ϕ(x)iH = c?i κ(x, xi ). (12.25)
i i

12.5.2 Kernel Design

More importantly, one can reverse the process, and instead of starting from a lifting ϕ, directly consider
a kernel κ(x, x0 ). This is actually the way this is done in practice, since it is easier to design kernel and
think in term of their geometrical properties (for instance, one can sum kernels). In order for this to make
sense, the kernel needs to be positive definite, i.e. one should have that (κ(xi , xj ))i,j should be symmetric
positive definite for any choice of sampling points (xi )i . This can be shown to be equivalent to the existence
of a lifting function ϕ generating the kernel. Note that such a kernel can be defined on arbitrary space (not
necessarily Euclidean).
When using the linear kernel κ(x, y) = hx, yi, one retrieves the linear models studied in the previous
section, and the lifting is trivial ϕ(x) = x. A family of popular kernels are polynomial ones, κ(x, x0 ) =
(hx, yi + c)a for a ∈ N∗ and c > 0, which corresponds to a lifting in finite dimension. For instance, for a = 2
and p = 2, one has a lifting in dimension 6

2 √ √ √
κ(x, x0 ) = (x1 x01 + x1 x01 + c) = hϕ(x), ϕ(x0 )i where ϕ(x) = (x21 , x22 , 2x1 x2 , 2cx1 , 2cx2 , c)∗ ∈ R6 .

In Euclidean spaces, the gaussian kernel is the most well known and used kernel

||x−y||2
κ(x, y) = e−
def.
2σ 2 . (12.26)

The bandwidth parameter σ > 0 is crucial and controls the locality of the model. It is typically tuned
||x−·||2
−
through cross validation. It corresponds to an infinite dimensional lifting x 7→ e 2(σ/2)2 ∈ L2 (Rp ). Another
related popular kernel is the Laplacian kernel exp(−||x − y||/σ). More generally, when considering translation
invariant kernels κ(x, x0 ) = k(x − x0 ) on Rp , being positive definite is equivalent
√ to k̂(ω) > 0 where k̂ is the
Fourier transform, and the associated lifting is obtained by considering ĥ = κ̂ and ϕ(x) = h(x−·) ∈ L2 (Rp ).
Figure 12.18 shows an example of regression using a Gaussian kernel.

204
σ = 0.1 σ = 0.5 σ=1 σ=5

Figure 12.18: Regression using a Gaussian kernel.

Kernel on non-Euclidean spaces. One can define kernel κ on general space X . A valid kernel should
satisfy that for any set of distinct points (xi )i , the associated kernel matrix K = (κ(xi , xj ))i,j should be
positive definite (i.e. strictly positive eigenvalues). This is equivalent to the existence of an embedding
Hilbert space H and a feature map ϕ : X → H such that κ(x, x0 ) = hϕ(x), ϕ(x0 )iH .
For instance, if X = P(M) is the set of measurable ensembles of some measured space (M, µ), one
can
R define κ(A, B) = µ(A ∩ B) which corresponds to the lifting ϕ(A) = 1A so that hϕ(A), ϕ(B)i =
1A (x)1B (x)dµ(x) = µ(A ∩ B). It is also possible to define kernel on strings or between graphs, but
this is more involved.
Note that in practice, it is often desirable to normalize the kernel so that it has a unit diagonal (for
translation invariant kernel over Rp , this is automatically achieved because k(x, x) = k(0, 0) is constant).
0
This corresponds to replacing ϕ(x) by ϕ(x)/||ϕ(x)||K and thus replacing κ(x, x0 ) by √ κ(x,x √) 0 0 .
κ(x,x) κ(x ,x )

12.5.3 General Case

This result obtained for least squares can be generalized to any loss, as stated in the following proposition,
which is the crux of the RKHS approaches. When using a regularization which is a squared Euclidean norm,
|| · ||2H , it states that the solutions actually belongs to a data-driven linear sub-space of dimension n. Although
the proof is straightforward, its implications are very profound, since it leads to tractable algorithms even
when using an infinite dimensional lifting space H. It is often called the “representer” theorem in RKHS
theory.

Proposition 38. The solution β ? ∈ H of

n
X λ λ
min `(yi , hϕ(xi ), βi) + ||β||2H = L(Φβ, y) + ||β||2H (12.27)
β∈H
i=1
2 2

is unique and can be written as

X
β = Φ∗ c? = c?i ϕ(xi ) ∈ H (12.28)
i

where q ∈ RN is a solution of
λ
min L(Kc, y) + hKc, ciRn (12.29)
c∈Rn 2
where we defined
K = Φ∗ Φ = (hϕ(xi ), ϕ(xj )iH )ni,j=1 ∈ Rn×n .
def.

Proof. The first order condition of (12.27) reads

0 ∈ Φ∗ ∂L(Φ∗ β ? , y) + λβ ? = 0

205
-1
1

Figure 12.19: Non-linear classification using a Gaussian kernel.

i.e. there exists u? ∈ ∂L(Φ∗ β ? , y) such that

1
β ? = − Φ∗ u? ∈ Im(Φ∗ )
λ
which is the desired result.

Equation (12.28) expresses the fact that the solution only lives in the n dimensional space spanned
by the lifted observed points ϕ(xi ). On contrary to (12.27), the optimization problem (12.29) is a finite
dimensional optimization problem, and in many case of practical interest, it can be solved approximately
in O(n2 ) (which is the cost of forming the matrix c). For large scale application, this complexity can be
further reduced by computing a low-rank approximation of K ≈ Φ̃∗ Φ̃ (which is equivalent to computing an
approximate embedding space of finite dimension).
For classification applications, one can use for ` a hinge or a logistic loss function, and then the decision
boundary is computed following (12.25) using kernel evaluations only as
X
sign(hβ ? , ϕ(x)iH ) = sign( c?i κ(x, xi )).
i

Figure 12.19 illustrates such a non-linear decision function on a simple 2-D problem. Note that while the
decision boundary is not a straight line, it is a linear decision in a higher (here infinite dimensional) lifted
space H (and it is the projection of this linear decision space which makes it non-linear).
It is also possible to extend nearest neighbors classification (as detailed in Section 12.4.1) and regression
over a lifted space by making use only of kernel evaluation, simply noticing that

||ϕ(xi ) − ϕ(xj )||2H = κ(xi , xi ) + κ(xj , xj ) − 2κ(xi , xj ).

12.6 Probably approximately correct learning theory

This section is a short presentation of the main results in Probably Approximately Correct (PAC) learn-
ing. Its goal is to asses the generalization performance of learning method without putting any assumption
on the distribution of the data. The main reference (and in particular the proofs of the mentioned results)
for this section is the book “Foundations of Machine Learning” by Mehryar Mohri, Afshin Rostamizadeh,
and Ameet Talwalkar https://cs.nyu.edu/~mohri/mlbook/.
def.
The underlying assumption is that the data D = (xi , yi )ni=1 ⊂ X × Y are independent realizations of a
random vector (X, Y ). The goal is to learn from D alone a function f : X → Y which is as close a possible
of minimizing the risk Z
def.
L(f ) = E(`(Y, f (X))) = `(y, f (x))dPX,Y (x, y)
X ×Y

206
where ` : Y × Y → R is some loss function. In order to achieve this goal, the method selects a class of
functions F and minimizes the empirical risk
n
def. 1
X
fˆ = argmin L̂(f ) =
def.
`(yi , f (xi )).
f ∈F n i=1

This is thus a random function (since it depends on D).

Example 1 (Regression and classification). In regression, Y = R and the most common choice of loss is
`(y, z) = (y − z)2 . For binary classification, Y = {−1, +1} and the ideal choice is the 0-1 loss `(y, z) = 1y6=z .
Minimizing L̂ for the 0-1 loss is often NP hardÂ for most choice of F. So one uses other loss functions of the
form `(y, z) = Γ(−yz) where Γ is a convex function upper-bounding 1R+ , which makes min L̂(f ) a convex
f
optimization problem.
The goal of PAC learning is to derive, with probability at least 1 − δ (intended to be close to 1), a bound
on the generalization error L(fˆ) − inf(L) > 0 (also called excess risk), and this bound depends on n and δ.
In order for this generalization error to go to zero, one needs to put some hypothesis on the distribution of
(X, Y ).

12.6.1 Non parametric setup and calibration

If F is the whole set of measurable functions, the minimizer f ? of L is often called “Bayes estimator”
and is the best possible estimator.

Risk decomposition Denoting

Z
def.
α(z|x) = EY (`(Y, z)|X = x) = `(y, f (x))dPY |X (y|x)
Y

the average error associate to the choice of some predicted value z ∈ Y at location x ∈ X , one has the
decomposition of the risk
Z Z Z
L(f ) = `(y, f (x))dPY |X (y|x) dPX (x) = α(f (x)|x)dPX (x)
X Y X

so that computing f ? can be done independently at each location x by solving

f ? (x) = argmin α(z|x).

Example 2 (Regression).
R For regression applications, where Y = R, `(y, z) = (y − z)2 , one has f ? (x) =
E(Y |X = x) = Y ydPY |X (y|x), which is the conditional expectation.
Example 3 (Classification). For classification applications, where Y = {−1, 1}, it is convenient to introduce
def.
η(x) = PY |X (y = 1|x) ∈ [0, 1]. If the two classes are separable, then η(x) ∈ {0, 1} on the support of X
(it is not defined elsewhere). For the 0-1 loss `(y, z) = 1y=z = 1R+ (−yz), one has f ? = sign(2η − 1) and
L(f ? ) = EX (min(η(X), 1 − η(X))). In practice, computing this η is not possible from the data D alone, and
minimizing the 0-1 loss is NP hard for most F. Considering a loss of the form `(y, z) = Γ(−yz), one has
that the Bayes estimator then reads in the fully non-parametric setup

f ? (x) = argmin α(z|x) = η(x)Γ(−z) + (1 − η(x))Γ(z) ,
z

so that it is non-linear function of η(x).

207
Calibration in the classification setup A natural question is to ensure that in this (non realistic . . . )
setup, the final binary classifier sign(f ? ) is equal to sign(2η − 1), which is the Bayes classifier of the (non-
convex) 0-1 loss. In this case, the loss is said to be calibrated. Note that this does not mean that f ? is itself
equal to 2η − 1 of course. One has the following result.
Proposition 39. A loss ` associated to a convex Γ is calibrated if and only if Γ is differentiable at 0 and
Γ0 (0) > 0.
In particular, the hinge and logistic loss are thus calibrated. Denoting LΓ the loss associated to `(y, z) =
Γ(−yz), and denoting Γ0 = 1R+ the 0-1 loss, stronger quantitative controls are of the form
0 6 LΓ0 (f ) − inf LΓ0 6 Ψ(LΓ (f ) − inf LΓ ) (12.30)
for some increasing function Ψ : R+ → R+ . Such a control ensures in particular that if f ? minimize LΓ , it
also minimizes LΓ0 and hence sign(f ? ) = sign(2η − 1) and the loss is calibrated. One can show that the
hinge loss enjoys such a√quantitative control with Ψ(r) = r and that the logistic loss has a worse control
since it requires Ψ(s) = s.

12.6.2 PAC bounds

Bias-variance decomposition. For a class F of functions, the excess risk of the empirical estimator

fˆ = argmin L̂(f )
def.

f ∈F

is decomposed as the sum of the estimation (random) error and the approximation (deterministic) error
h i h i
L(fˆ) − inf L = L(fˆ) − inf L + A(F) where A(F) = inf L − inf L .
def.
(12.31)
F F

This splitting is a form of variance/bias separation.

As the size of F increases, the estimation error increases but the approximation error decreases, and the
goal is to optimize this trade-off. This size should thus depend on n to avoid over-fitting (selecting a too
large class F).

Approximation error. Bounding the approximation error fundamentally requires some hypothesis on
f ? . This is somehow the take home message of “no free lunch” results, which shows that learning is not
possible without regularity assumption on the distribution of (X, Y ). We only give here a simple example.
Example 4 (Linearly parameterized functionals). A popular class of functions are linearly parameterized
maps of the form
f (x) = fw (x) = hϕ(x), wiH
where ϕ : X → H somehow “lifts” the data features to a Hilbert space H. In the particular case where
X = Rp is already a (finite dimensional) Hilbert space, one can use ϕ(x) = x and recovers usual linear
methods. One can also consider for instance polynomial features, ϕ(x) = (1, x1 , . . . , xp , x21 , x1 x2 , . . .), giving
rise to polynomial regression and polynomial classification boundaries. One can then use a restricted class
of functions of the form F = {fw ; ||w||H 6 R} for some radius R, and if one assumes for simplicity that
f ? = fw? is of this form, and that the loss `(y, ·) is Q-Lipschitz, then the approximation error is bounded by
an orthogonal projection on this ball
A(F) 6 QE(||ϕ(x)||H ) max(||w? ||H − R, 0).
Remark 2 (Connexions with RKHS). Note that this lifting actually corresponds to using functions f in a
reproducing Hilbert space, denoting
def.
||f ||k = inf {||w||H ; f = fw }
w∈H
def.
and the associated kernel is k(x, x0 ) = hϕ(x), ϕ(x0 )iH . But this is not important for our discussion here.

208
Estimation error. The estimation error can be bounded for arbitrary distributions by leveraging concen-
tration inequalities (to controls the impact of the noise) and using some bound on the size of F.
The first simple but fundamental inequality bounds the estimator error by some uniform distance between
L and L̂. Denoting g ∈ F an optimal estimator such that L(g) = inf F L (assuming for simplicity it exists)
one has
h i h i h i
L(fˆ) − inf L = L(fˆ) − L̂(fˆ) + L̂(fˆ) − L̂(g) + L̂(g) − L(ĝ) 6 2 sup |L̂(f ) − L(f )| (12.32)
F f ∈F

since L̂(fˆ) − L̂(g) 6 0. So the goal is “simply” to control ∆(D) = supF |L̂ − L|, which is a random value
def.

(depending on the data D).

The following proposition, which is a corollary of McDiarmid inequality, bounds with high probability
the deviation of ∆(D) from its mean.
Proposition 40 (McDiarmid control to the mean). If `(Y, f (X)) is almost surely bounded by `∞ for any
f ∈ F, then with probability 1 − δ,
r
2 log(1/δ)
∆(D) − ED (∆(D)) 6 `∞ . (12.33)
n
Now we need to control ED (∆(D)) which requires to somehow bound the size of F. This can be achieved
by the so-called Vapnik-Chervonenkis (VC) dimension, but this leads to overly pessimistic (dimension-
dependent) bounds for linear models. A more refined analysis makes use of the so-called Rademacher
complexity of a set of functions G from X × Y to R
n
def.
h 1X i
Rn (G) = Eε,D sup εi g(xi , yi )
g∈G n i=1

where εi are independent Bernoulli random variable (i.e. P(εi = ±1) = 1/2). Note that Rn (G) actually
depends on the distribution of (X, Y ) . . .
Here, one needs to apply this notion of complexity to the functions G = `[F] defined as
def.
`[F] = {(x, y) ∈ X × Y 7→ `(y, f (x)) ; f ∈ F} ,

and that one has the following control, which can be proved using a simple but powerful symmetrization
trick.
Proposition 41. One has
E(∆(D)) 6 2Rn (`[F]).
If ` is Q-Lipschitz with respect to its second variable, one furthermore has Rn (`[F]) 6 QRn (F) (here the
class of functions only depends on x).
Putting (12.31), (12.32), Propositions 40 and 41 together, one obtains the following final result.
Theorem 23. Assuming ` is Q-Lipschitz and bounded by `∞ on the support of (Y, f (X)), one has with
probability 1 − δ r
2 log(1/δ)
0 6 L(fˆ) − inf L 6 2`∞ + 4QRn (F) + A(F). (12.34)
n
Example 5 (Linear models). In the case where F = {hϕ(·), wiH ; ||w|| 6 R} where || · || is some norm on H,
one has
R X
Rn (F) 6 || εi ϕ(xi )||∗
n i
where || · ||∗ is the so-called dual norm
def.
||u||∗ = sup hu, wiH .
||w||61

209
In the special case where || · || = || · ||H is Hilbertian, then one can further simplify this expression since
||u||∗ = ||u||H and p
R E(||ϕ(x)||2H )
Rn (F) 6 √ .
n
This result is powerful since the bound does not depend on the feature dimension (and can be even applied
in the RKHS setting where√H is infinite dimensional). In this case, one sees that the convergence speed
in (12.34) is of the order 1/ n (plus the approximation error). One should keep in mind that one needs also
to select the “regularization” parameter R to obtain the best possible trade-off. In practice, this is done by
cross validation on the data themselves.
Example 6 (Application to SVM classification). One cannot use the result (12.34) in the case of the 0-1
loss `(y, z) = 1z6=y since it is not Lipschitz (and also because minimizing L̂ would be intractable). One can
however applies it to a softer piece-wise affine upper-bounding loss `ρ (z, y) = Γρ (−zy) for
def.
Γρ (s) = min (1, max (0, 1 + s/ρ)) .

This function is 1/ρ Lipschitz, and it is sandwiched between the 0-1 loss and a (scaled) hinge loss
def.
Γ0 6 Γρ 6 ΓSVM (·/ρ) where ΓSVM (s) = max(1 + s, 0).

This allows one, after a change of variable w 7→ w/ρ, to bound with probability 1 − δ the 0-1 risk using a
SVM risk by applying (12.34)
r p
2 log(1/δ) E(||ϕ(x)||2H )/ρ
LΓ0 (fˆ) − inf LΓSVM (f ) 6 2 +4 √
f ∈Fρ n n
def.
where Fρ = {fw = hϕ(·), wiH ; ||w|| 6 1/ρ}. In practice, one rather solves a penalized version of the above
risk (in its empirical version)
min L̂ΓSVM (fw ) + λ||w||2H (12.35)
w

which corresponds to the so-called kernel-SVM method.

Remark 3 (Resolution using the kernel trick). The kernel trick allows one to solve problem such as (12.35)
having the form
n
1X
min `(yi , hw, ϕ(xi )i) + λ||w||2H (12.36)
w∈H n
i=1

by recasting this possibly infinite dimensional problem

Pn into a problem of finite dimension n by observing
that the solution necessarily has the form w? = i=1 c?i ϕ(xi ) for some c? ∈ Rn (this can be seen from the
first order optimality conditions of the problem). Plugging this expression in (12.36) only necessitates the
def.
evaluation of the empirical kernel matrix K = (k(xi , xj ) = hϕ(xi ), ϕ(xj )i)ni,j=1 ∈ Rn×n since it reads
n
1X
minn `(yi , (Kc)i ) + λhKc, ci
c∈R n i=1

? ?
P ? the optimal c solving this strictly convex program, the estimator is evaluated as fw? (x) = hw , ϕ(x)i =
From
c
i i k(x i , x).

210
278
Bibliography

[1] Amir Beck. Introduction to Nonlinear Optimization: Theory, Algorithms, and Applications with MAT-
LAB. SIAM, 2014.
[2] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization
and statistical learning via the alternating direction method of multipliers. Foundations and Trends R
in Machine Learning, 3(1):1–122, 2011.
[3] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
[4] E. Candès and D. Donoho. New tight frames of curvelets and optimal representations of objects with
piecewise C2 singularities. Commun. on Pure and Appl. Math., 57(2):219–266, 2004.
[5] E. J. Candès, L. Demanet, D. L. Donoho, and L. Ying. Fast discrete curvelet transforms. SIAM
Multiscale Modeling and Simulation, 5:861–899, 2005.
[6] A. Chambolle. An algorithm for total variation minimization and applications. J. Math. Imaging Vis.,
20:89–97, 2004.
[7] Antonin Chambolle, Vicent Caselles, Daniel Cremers, Matteo Novaga, and Thomas Pock. An intro-
duction to total variation for image analysis. Theoretical foundations and numerical methods for sparse
recovery, 9(263-340):227, 2010.
[8] Antonin Chambolle and Thomas Pock. An introduction to continuous optimization for imaging. Acta
Numerica, 25:161–319, 2016.
[9] S.S. Chen, D.L. Donoho, and M.A. Saunders. Atomic decomposition by basis pursuit. SIAM Journal
on Scientific Computing, 20(1):33–61, 1999.
[10] Philippe G Ciarlet. Introduction à l’analyse numérique matricielle et à l’optimisation. 1982.
[11] P. L. Combettes and V. R. Wajs. Signal recovery by proximal forward-backward splitting. SIAM
Multiscale Modeling and Simulation, 4(4), 2005.
[12] I. Daubechies, M. Defrise, and C. De Mol. An iterative thresholding algorithm for linear inverse problems
with a sparsity constraint. Commun. on Pure and Appl. Math., 57:1413–1541, 2004.
[13] D. Donoho and I. Johnstone. Ideal spatial adaptation via wavelet shrinkage. Biometrika, 81:425–455,
Dec 1994.
[14] Heinz Werner Engl, Martin Hanke, and Andreas Neubauer. Regularization of inverse problems, volume
375. Springer Science & Business Media, 1996.
[15] M. Figueiredo and R. Nowak. An EM Algorithm for Wavelet-Based Image Restoration. IEEE Trans.
Image Proc., 12(8):906–916, 2003.
[16] Simon Foucart and Holger Rauhut. A mathematical introduction to compressive sensing, volume 1.
Birkhäuser Basel, 2013.

279
[17] Stephane Mallat. A wavelet tour of signal processing: the sparse way. Academic press, 2008.
[18] D. Mumford and J. Shah. Optimal approximation by piecewise smooth functions and associated varia-
tional problems. Commun. on Pure and Appl. Math., 42:577–685, 1989.
[19] Neal Parikh, Stephen Boyd, et al. Proximal algorithms. Foundations and Trends R in Optimization,
1(3):127–239, 2014.
[20] Gabriel Peyré. L’algèbre discrète de la transformée de Fourier. Ellipses, 2004.
[21] J. Portilla, V. Strela, M.J. Wainwright, and Simoncelli E.P. Image denoising using scale mixtures of
Gaussians in the wavelet domain. IEEE Trans. Image Proc., 12(11):1338–1351, November 2003.

[22] L. I. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise removal algorithms. Phys.
D, 60(1-4):259–268, 1992.
[23] Otmar Scherzer, Markus Grasmair, Harald Grossauer, Markus Haltmeier, Frank Lenzen, and L Sirovich.
Variational methods in imaging. Springer, 2009.
[24] C. E. Shannon. A mathematical theory of communication. The Bell System Technical Journal,
27(3):379–423, 1948.
[25] Jean-Luc Starck, Fionn Murtagh, and Jalal Fadili. Sparse image and signal processing: Wavelets and
related geometric multiscale analysis. Cambridge university press, 2015.

280

Project Timeline Infographics
No ratings yet
Project Timeline Infographics
35 pages
Class VIII 9. Kinematics Worksheet
No ratings yet
Class VIII 9. Kinematics Worksheet
2 pages
FORM 3 Physics MID-1 TERM EXAM PDF
No ratings yet
FORM 3 Physics MID-1 TERM EXAM PDF
15 pages
Quantum Jumping - Shift Your Reality in Big, Positive Ways
75% (4)
Quantum Jumping - Shift Your Reality in Big, Positive Ways
12 pages
1000+ Chapter 10
No ratings yet
1000+ Chapter 10
4 pages
Lecture 7 Data Transformation and Dimensionality Reduction
No ratings yet
Lecture 7 Data Transformation and Dimensionality Reduction
22 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
60 pages
Chapter 10. Dimensionality Reduction With PCA
No ratings yet
Chapter 10. Dimensionality Reduction With PCA
23 pages
Lab Report-Permeability Test
84% (19)
Lab Report-Permeability Test
11 pages
Week12 PCA BayesianInference Before Lecture
No ratings yet
Week12 PCA BayesianInference Before Lecture
82 pages
Machine Learning (CSO851) - Lecture 03
No ratings yet
Machine Learning (CSO851) - Lecture 03
71 pages
Lecture W12ab
No ratings yet
Lecture W12ab
60 pages
Chang K. Mathematical Structures of Quantum Mechanics 2011
100% (1)
Chang K. Mathematical Structures of Quantum Mechanics 2011
208 pages
Presentation A I STD 2
No ratings yet
Presentation A I STD 2
63 pages
DAAI - Lecture - 04 - With - Solutions - 10oct22
No ratings yet
DAAI - Lecture - 04 - With - Solutions - 10oct22
84 pages
کتاب نهم بارگزاری شده
No ratings yet
کتاب نهم بارگزاری شده
55 pages
Principal Component Analysis PCA 17
No ratings yet
Principal Component Analysis PCA 17
58 pages
20 Pca
No ratings yet
20 Pca
50 pages
109後西醫第1次模考各科試題及解答
No ratings yet
109後西醫第1次模考各科試題及解答
62 pages
Dim Reduction & Pattern Recognition
No ratings yet
Dim Reduction & Pattern Recognition
63 pages
10 Autoencoders
No ratings yet
10 Autoencoders
42 pages
Lecture 9 - PCA
No ratings yet
Lecture 9 - PCA
44 pages
Lec 15
No ratings yet
Lec 15
28 pages
Vian! Lniionv Ni 3On3Ios Dnilndnod: Cognitive Science
No ratings yet
Vian! Lniionv Ni 3On3Ios Dnilndnod: Cognitive Science
59 pages
34 ST 03 64
No ratings yet
34 ST 03 64
36 pages
CDT 05 PCA SVD FoDS
No ratings yet
CDT 05 PCA SVD FoDS
34 pages
Lecture: Dimensionality Reduction With Principal Component Analysis
No ratings yet
Lecture: Dimensionality Reduction With Principal Component Analysis
42 pages
AML Unit - 1 Material
No ratings yet
AML Unit - 1 Material
36 pages
Dimensionality Reduction With Principal Component Analysis
No ratings yet
Dimensionality Reduction With Principal Component Analysis
39 pages
Math - ML Trang 10
No ratings yet
Math - ML Trang 10
31 pages
PrincipalComponentAnalysis LectureNotesPublic
No ratings yet
PrincipalComponentAnalysis LectureNotesPublic
24 pages
PCA revis-BoW PDF
No ratings yet
PCA revis-BoW PDF
47 pages
Computer Vision: Spring 2006 15-385,-685 Instructor: S. Narasimhan Wean 5403 T-R 3:00pm - 4:20pm
No ratings yet
Computer Vision: Spring 2006 15-385,-685 Instructor: S. Narasimhan Wean 5403 T-R 3:00pm - 4:20pm
58 pages
16 dm2 Dimred 2022 23
No ratings yet
16 dm2 Dimred 2022 23
49 pages
Dimension Reduction
No ratings yet
Dimension Reduction
23 pages
Lecture8 2015
No ratings yet
Lecture8 2015
51 pages
Ruiz Modified I2ml3e Chap6
No ratings yet
Ruiz Modified I2ml3e Chap6
38 pages
Lecture 12
No ratings yet
Lecture 12
31 pages
08 HighDimensional PDF
No ratings yet
08 HighDimensional PDF
88 pages
08 HighDimensional PDF
No ratings yet
08 HighDimensional PDF
88 pages
I2ml3e Chap6
No ratings yet
I2ml3e Chap6
37 pages
2016 FS PhySci GR 11 Jun Exam Eng
No ratings yet
2016 FS PhySci GR 11 Jun Exam Eng
18 pages
4.5 Principal Component Analysis
No ratings yet
4.5 Principal Component Analysis
15 pages
DPP - Class - XI Engineering (27 July - 28 Aug. 21)
No ratings yet
DPP - Class - XI Engineering (27 July - 28 Aug. 21)
3 pages
LectureNotes PCA
No ratings yet
LectureNotes PCA
20 pages
Dimension Reduction and Hidden Structure: 1.1 Principal Component Analysis (PCA)
No ratings yet
Dimension Reduction and Hidden Structure: 1.1 Principal Component Analysis (PCA)
40 pages
Lecture 17 and 18
No ratings yet
Lecture 17 and 18
29 pages
STS Week 2-5
No ratings yet
STS Week 2-5
20 pages
DimensionalityReduction Pca
No ratings yet
DimensionalityReduction Pca
24 pages
Presentation
No ratings yet
Presentation
31 pages
Quantum Cogwheel Theory (QCT) : A Thorough Derivation and Unification Framework
No ratings yet
Quantum Cogwheel Theory (QCT) : A Thorough Derivation and Unification Framework
10 pages
Computer Vision: Spring 2006 15-385,-685
No ratings yet
Computer Vision: Spring 2006 15-385,-685
58 pages
Sta 5
No ratings yet
Sta 5
16 pages
An Analysis of A Tall Structure With Shear Panel and Floating Columns in Seismic Zone IV by STAAD Pro Software
No ratings yet
An Analysis of A Tall Structure With Shear Panel and Floating Columns in Seismic Zone IV by STAAD Pro Software
22 pages
Lecture 7: Principal Component Analysis (PCA) (Draft: Version 0.9.1)
No ratings yet
Lecture 7: Principal Component Analysis (PCA) (Draft: Version 0.9.1)
11 pages
Weekly Homework X
No ratings yet
Weekly Homework X
15 pages
5 - Feature Generation
No ratings yet
5 - Feature Generation
15 pages
Soil Report JNV North Dinajpur
No ratings yet
Soil Report JNV North Dinajpur
18 pages
16-04-23 - ISR - IIT - STAR CO-SC (MODEL-A) - Jee-Main - CTM-20 - KEY & SOL
No ratings yet
16-04-23 - ISR - IIT - STAR CO-SC (MODEL-A) - Jee-Main - CTM-20 - KEY & SOL
14 pages
Lecture 7: Unsupervised Learning: C19 Machine Learning Hilary 2013 A. Zisserman
No ratings yet
Lecture 7: Unsupervised Learning: C19 Machine Learning Hilary 2013 A. Zisserman
20 pages
Science 89 Electricity
No ratings yet
Science 89 Electricity
16 pages
1 s2.0 S1877050915031828 Main
No ratings yet
1 s2.0 S1877050915031828 Main
7 pages
Problem Set 2: First-Principles Energy Methods: Harvard SEAS AP275 Computational Design of Materials Spring 2018
No ratings yet
Problem Set 2: First-Principles Energy Methods: Harvard SEAS AP275 Computational Design of Materials Spring 2018
13 pages
Kapnopoulou and Caridis - High Cycle Fatigue Analysis of A Lower Hopper Knuc
No ratings yet
Kapnopoulou and Caridis - High Cycle Fatigue Analysis of A Lower Hopper Knuc
7 pages
Coffee Grinders Catalog
No ratings yet
Coffee Grinders Catalog
5 pages
Machine Learning Techniques
No ratings yet
Machine Learning Techniques
8 pages
Shuttlecock Dynamics
No ratings yet
Shuttlecock Dynamics
6 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
3 pages
Tutorials
No ratings yet
Tutorials
14 pages
1991 - Resonant Sensors and Applications
No ratings yet
1991 - Resonant Sensors and Applications
7 pages
Indian Standard: Methods of Tests For Building Limes
No ratings yet
Indian Standard: Methods of Tests For Building Limes
5 pages
Microarray Analysis: Algorithms in Computational Biology Spring 2006
No ratings yet
Microarray Analysis: Algorithms in Computational Biology Spring 2006
18 pages
Kernel Principal Component Analysis and Its Applications in Face Recognition and Active Shape Models
No ratings yet
Kernel Principal Component Analysis and Its Applications in Face Recognition and Active Shape Models
9 pages
New Routes From Minimal Approximation Error To Principal Components
No ratings yet
New Routes From Minimal Approximation Error To Principal Components
14 pages
m1 Rws Sept
No ratings yet
m1 Rws Sept
2 pages
Wk01 Machine Learning
No ratings yet
Wk01 Machine Learning
6 pages
cs229 Notes10 PDF
No ratings yet
cs229 Notes10 PDF
6 pages
Pca PDF
No ratings yet
Pca PDF
6 pages
Canal Design 2
No ratings yet
Canal Design 2
3 pages
Polar Equations To Rectangular
No ratings yet
Polar Equations To Rectangular
2 pages
Principal Component Analysis: Atent Ariables
No ratings yet
Principal Component Analysis: Atent Ariables
13 pages
4 Science Pulleys Gears Pulley Questions
No ratings yet
4 Science Pulleys Gears Pulley Questions
2 pages
10-601 Machine Learning (Fall 2010) Principal Component Analysis
No ratings yet
10-601 Machine Learning (Fall 2010) Principal Component Analysis
8 pages
Tutorial On Principal Component Analysis: Javier R. Movellan
No ratings yet
Tutorial On Principal Component Analysis: Javier R. Movellan
9 pages
Pca
No ratings yet
Pca
6 pages
Piper Pitch Trim Service Manual
83% (18)
Piper Pitch Trim Service Manual
206 pages
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Mathematical Functions
From Everand
Mathematical Functions
Oliver Linton
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)