0% found this document useful (0 votes)

47 views49 pages

Lec SML Basic Theory 2

This document discusses principles of statistical machine learning and binary classification. It introduces Ripley's 2D binary classification dataset and discusses key concepts in binary classification including: - Classifying points that are not clearly in one class or on the decision boundary - The need for classification rules to generalize to new data not in the training set - The goal of finding the optimal classifier f* that minimizes errors - Approximating f* by searching a restricted function space F rather than the full space of all functions - Minimizing empirical risk on a training set to find the best classifier in F

Uploaded by

Amit Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views49 pages

Lec SML Basic Theory 2

Uploaded by

Amit Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

Principles of Statistical Machine Learning

Fondements Thoriques de lApprentissage Statistique

Dr Ernest Fokou

Associate Professor of Statistics

School of Mathematical Sciences
Rochester Institute of Technology

Visiting Professor of Statistical Machine Learning

Laboratoire de Mathmatiques de Bretagne Atlantique
Universit de Bretagne-Sud, Vannes, France
June-July 2015

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 1 / 49

Binary Classification in the Plane
Ripleys 2D binary classification data set
1.2

0.8

0.6
x2

0.4

0.2

1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8

x1
Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 2 / 49
Binary Classification in the Plane

For the binary classification problem introduced earlier:

A collection {(x1 , y1 ), , (xn , yn )} of i.i.d. observations is given

xi X R2 , i = 1, , n. X is the input space.

yi {1, +1}. Y = {1, +1} is the output space.
What is the probability law that governs the (xi , yi )s?
What is the functional relationship between x and y?
What is the "best" approach to determining from the available
observations, the relationship between x and y in such a way
that, given a new (unseen) observation xnew , its class y new can be
predicted as accurately as possible.

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 3 / 49

Basic Remarks on Classification

While some points clearly belong to one of the classes, there are
other points that are either strangers in a foreign land, or are
positioned in such a way that no automatic classification rule can
clearly determine their class membership.
One can construct a classification rule that puts all the points in
their corresponding classes. Such a rule would prove disastrous in
classifying new observations not present in the current collection
of observations.
Indeed, we have a collection of pairs (xi , yi ) of observations
coming from some unknown distribution P(x, y).

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 4 / 49

Basic Remarks on Classification

Finding an automatic classification rule that achieves the

absolute very best on the present data is not enough since
infinitely many more observations can be generated by P(x, y) for
which good classification will be required.
Even the universally best classifier will make mistakes.
Of all the functions in Y X , it is reasonable to assume that there
is a function f that maps any x X to its corresponding
y Y, i.e.,
f : X Y
x 7 f (x),
with the minimum number of mistakes.

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 5 / 49

Risk Minimization Revisited
Let f denote any generic function mapping an element x of X to its
corresponding image f (x) in Y.
Each time x is drawn from P(x), the disagreement between the image
f (x) and the true image y is called the loss, denoted by `(y, f (x)).
The expected value of this loss function with respect to the
distribution P(x, y) is called the risk functional of f . Generically, we
shall denote the risk functional of f by R(f ), so that
Z
R(f ) = E[`(Y, f (X))] = `(y, f (x))dP(x, y).

The best function f over the space Y X of all measurable functions

from X to Y is therefore
f = arg inf R(f ),
f

so that
R(f ) = R = inf R(f ).
f

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 6 / 49

On the need to reduce the search space

Unfortunately, f can only be found if P(x, y) is known. Therefore,

since we do not know P(x, y) in practice, it is hopeless to determine
f .
Besides, trying to find f without the knowledge of P(x, y) implies
having to search the infinite dimensional function space Y X of all
mappings from X to Y, which is an ill-posed and computationally
nasty problem.
Throughout this lecture, we will seek to solve the more reasonable
problem of choosing from a function space F Y X , the one function
f + F that best estimates the dependencies between x and y.
It is therefore important to define what is meant by best estimates.
For that, the concepts of loss function and risk functional need to be
define.

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 7 / 49

Loss and Risk in Pattern Recognition
For this classification/pattern recognition, the so-called 0-1 loss function
defined below is used. More specifically,

0 if y = f (x),
`(y, f (x)) = 1{Y 6=f (X)} = (1)
1 if y 6= f (x).
The corresponding risk functional is
Z

R(f ) = `(y, f (x))dP(x, y) = E 1{Y 6=f (X)} = Pr [Y 6= f (X)].
(X,Y )P

The minimizer of the 0-1 risk functional over all possible classifiers is the
so-called Bayes classifier which we shall denote here by f given by
f = arg inf

Pr [Y 6= f (X)] .
f (X,Y )P

Specifically, the Bayes classifier f is given by the posterior probability of

class membership, namely
f (x) = arg max Pr[Y = y|x] .

yY

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 8 / 49

Function Class in Pattern Recognition
As stated earlier, trying to find f is hopeless. One needs to select a
function space F Y X , and then choose the best estimator f + from F,
i.e.,
f + = arg inf R(f ),
f F

so that
R(f + ) = R+ = inf R(f ).
f F

For the binary pattern recognition problem, one may consider finding the
best linear separating hyperplane, i.e.
(
F = f : X {1, +1}| 0 R, (1 , , p )> = Rp |
)

f (x) = sign > x + 0 , x X

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 9 / 49

Empirical Risk Minimization

n o
Let D = (X1 , Y1 ), , (Xn , Yn ) be an iid sample from P(x, y).
The empirical version of the risk functional is
n
b )= 1
X
R(f 1{Yi 6=f (Xi )}
n
i=1

We therefore seek the best by empirical standard,

( n )
1X
f = arg min
b 1{Yi 6=f (Xi )}
f F n
i=1

Since it is impossible to search all possible functions, it is usually

crucial to choose the "right" function space F.

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 10 / 49

Bias-Variance Trade-Off

In traditional statistical estimation, one needs to address at the very least

issues like: (a) the Bias of the estimator; (b) the Variance of the estimator;
(c) The consistency of the estimator; Recall from elementary point
estimation that, if is the true value of the parameter to be estimated, and
b is a point estimator of , then one can decompose the total error as
follows:
b = b E[]b + E[] b (2)
| {z } | {z }
Estimation error Bias

Under the squared error loss, one seeks b that minimizes the mean squared
error,
b = arg min E[(b )2 ] = arg min MSE(),
b

rather than trying to find the minimum variance unbiased estimator

(MVUE).

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 11 / 49

Bias-Variance Trade-off

Clearly, the traditional so-called bias-variance decomposition of the MSE

reveals the need for bias-variance trade-off. Indeed,
b = E[(b )2 ] = E[(b E[])
MSE() b 2 ] + E[(E[]
b )2 ]
= variance + bias2

If the estimator b were to be sought from all possible value of , then it

might make sense to hope for the MVUE. Unfortunately - an especially in
function estimation as we clearly argued earlier - there will be some bias, so
that the error one gets has a bias component along with the variance
component in the squared error loss case. If the bias is too small, then an
estimator with a larger variance is obtained. Similarly, a small variance will
tend to come from estimators with a relatively large bias. The best
compromise is then to trade-off bias and variance. Which is in functional
terms translates into trade-off between approximation error and estimation
error.
Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 12 / 49
Bias-Variance Trade-off

True Risk

Bias squared

Variance

Optimal Smoothing
Less smoothing More smoothing
Figure: Illustration of the qualitative behavior of the dependence of bias versus variance on a tradeoff parameter
such as or h. For small values the variability is too high; for large values the bias gets large.

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 13 / 49

Structural risk minimization principle

Since making the estimator of the function arbitrarily complex causes the
problems mentioned earlier, the intuition for a trade-off reveals that instead
of minimizing the empirical risk R
b n (f ) one should do the following:
Choose a collection of function spaces {Fk : k = 1, 2, }, maybe a
collection of nested spaces (increasing in size)
Minimize the empirical risk in each class
Minimize the penalized empirical risk

min min R
b n (f ) + penalty(k, n)
k f Fk

where penalty(k, n) gives preference to models with small estimation error.

It is important to note that penalty(k, n) measures the capacity of the
function class Fk . The widely used technique of regularization for solving
ill-posed problem is a particular instance of structural risk minimization.

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 14 / 49

Regularization for Complexity Control

Tikhonovs Variation Approach to Regularization[Tikhonov, 1963]

Find f that minimizes the functional
n
b n(reg) (f ) = 1
X
R `(yi , f (xi )) + (f )
n
i=1

where > 0 is some predefined constant.

Ivanovs Quasi-solution Approach to Regularization[Ivanov, 1962]
Find f that minimizes the functional
n
b n (f ) = 1
X
R `(yi , f (xi ))
n
i=1

subject to the constraint

(f ) C
where C > 0 is some predefined constant.
Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 15 / 49
Regularization for Complexity Control
Philips Residual Approach to Regularization[Philips, 1962]
Find f that minimizes the functional

(f )

subject to the constraint

n
1X
`(yi , f (xi ))
n
i=1

where > 0 is some predefined constant.

In all the above, the functional (f ) is called the regularization functional.
(f ) is defined in such a way that it controls the complexity of the
function f .
Z b
2
(f ) = kf k = (f 00 (t))2 dt.
a
is a regularization functional used in spline smoothing.
Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 16 / 49
Statistical Consistency

Definition: Let bn be an estimator of some scalar quantity based on

an i.i.d. sample X1 , X2 , , Xn from the distribution with parameter
. Then, bn is said to be a consistent estimator of , if bn converges
in probability to , i.e.,
P
bn .
n

In other words, bn is a consistent estimator of if, > 0,

n o
lim Pr |bn | > = 0.
n

It turns out that for unbiased estimators bn , consistency is

straightforward as direct consequence of a basic probabilistic inequality
like Chebyshevs inequality. However, for unbiased estimators, one has
to be more careful.

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 17 / 49

A Basic Important Inequality

Theorem
(Biename-Chebyshevs inequality) Let X be a random variable with finite
mean X = E[X] i.e. |E[X]| < + and finite variance X2 = V(X) , i.e.,

|V(X)| < +. Then, > 0,

V(X)
Pr[|X E[X]| > ] .
2
It is therefore easy to see here that, with unbiased bn , one has E[bn ] = ,
and the result is immediate. For the sake of clarity, lets recall here the
elementary weak law of large numbers.

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 18 / 49

Weak Law of Large Numbers

Let X be a random variable with finite mean X = E[X] i.e.

|E[X]| < + and finite variance X 2 = V(X) , i.e., |V(X)| < +. Let

X1 , X2 , , Xn be a random sample of n observations drawn

independently from the distribution of X, so that for i = 1, , n, we have
E[Xi ] = and V[Xi ] = 2 . Let X n be the sample mean, i.e.,

n
n = 1 (X1 + X2 + + Xn ) = 1
X
X Xi
n n
i=1

n ] = , and, > 0,
Then, clearly, E[X
n | > ] = 0.

lim Pr[|X (3)
n

This essentially expresses the fact that the empirical mean X n converges in
probability to the theoretical mean in the limit of very large samples.

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 19 / 49

Weak Law of Large Numbers

We therefore have
P
n
X .
n

With X = E[X n ] = and 2 = 2 /n, one applyies Biename-Chebyshevs

X
inequality and gets: > 0,

| > ] 2
Pr[|X , (4)
n2
which, by inversion, is the same as
r
| < 1 2
|X (5)
n
with probability at least 1 .
Why is all the above of any interest to statistical learning theory?

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 20 / 49

Weak Law of Large Numbers

Why is all the above of any interest to statistical learning

theory?
Equation (3) states the much needed consistency of X
as an estimator of .
Equation (4), by showing the dependence of on n and
converges to .
helps assess the rate at which X
Equation (5), by showing a confidence interval helps
compute bounds on the unknown true mean as a
function of the empirical mean X and the confidence
level 1 .
Finally, how does go about constructing estimators with
all the above properties.
Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 21 / 49
Components of Statistical Machine Learning
Interestingly, all those 4 components of classical estimation theory, will be
encountered again in statistical learning theory. Essentially, the 4
components of statistical learning theory consist of finding the answers to
the following questions:
(a) What are the necessary and sufficient conditions for the
consistency of a learning process based on the ERM principle? This
leads to the Theory of consistency of learning processes.
(b) How fast is the rate of convergence of the learning process? This
leads to the Nonasymptotic theory of the rate of convergence of
learning processes;
(c) How can one control the rate of convergence (the generalization
ability) of the learning process?. This leads to the Theory of
controlling the generalization ability of learning processes;
(d) How can one construct algorithms that can control the
generalization ability of the learning process?. This leads to Theory of
constructing learning algorithms.
Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 22 / 49
Error Decomposition revisited

A reasoning on error decomposition and consistency of estimators along

with rates, bounds and algorithms applies to function spaces: indeed, the
difference between the true risk R(fbn ) associated with fbn and the overall
minimum risk R can be decomposed to explore in greater details the
source of error in the function estimation process:

R(fbn ) R = R(fbn ) R(f + ) + R(f + ) R (6)

| {z } | {z }
Estimation error Approximation error

A reasoning similar to bias-variance trade-off and consistency can be made,

with the added complication brought be the need to distinguish between
the true risk functional and the empirical risk functional, and also to the
added to assess both pointwise behaviors and uniform behaviors. In a
sense, one needs to generalize the decomposition and the law of large
numbers to function spaces.

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 23 / 49

Approximation-Estimation Trade-Off

True Risk

Bias squared

Variance

Optimal Smoothing
Less smoothing More smoothing

Figure: Illustration of the qualitative behavior of the dependence of bias versus

variance on a tradeoff parameter such as or h. For small values the variability is
too high; for large values the bias gets large.

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 24 / 49

Consistency of the Empirical Risk Minimization principle

The ERM principle is consistent if it provides a sequence of functions

fn , n = 1, 2, for which both the expected risk R(fbn ) and the
empirical risk R b n (fbn ) converge to the minimal possible value of the
+
risk R(f ) in the function class under consideration, i.e.,
P
R(fn ) inf R(f ) = R(f + )
n f F

and
P
b n (fn )
R inf R(f ) = R(f + )
n f F

Vapnik discusses the details of this theorem at length, and extends the
exploration to include the difference between what he calls trivial
consistency and non-trivial consistency.

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 25 / 49

Consistency of the Empirical Risk Minimization principle

To better understand consistency in function spaces, consider the

sequence of random variables

n = sup R(f ) R
b n (f ), (7)

f F

and consider studying

( )

lim P sup R(f ) R
b n (f ) > = 0, > 0.

n f F

Vapnik shows that the sequence of the means of the random variable
n converges to zero as the number n of observations increases.
He also remarks that the sequence of random variables n converges
in probability to zero if the set of functions F, contains a finite
number m of elements. We will show that later in the case of pattern
recognition.
Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 26 / 49
Consistency of the Empirical Risk Minimization principle
It remains then to describe the properties of the set of functions F,
and probability measure P(x, y) under which the sequence of random
variables n converges in probability to zero.
(" # " #)
lim P sup [R(f ) Rn (f )] > or sup [Rn (f ) R(f )] >
b b = 0.
n f F f F

Recall that Rb n (f ) is the realized disagreement between classifier f

and the truth about the label y of x based on information contained
in the sample D.
It is easy to see that, for a given (fixed) function (classifier) f ,

E[R
b n (f )] = R(f ). (8)

Note that while this pointwise unbiasedness of the empirical risk is a

good bottomline property to have, it is not enough. More is needed as
the comparison is against R(f + ) or event better yet R(f ).
Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 27 / 49
Consistency of the Empirical Risk
Remember that the goal of statistical function estimation is to devise
a technique (strategy) that chooses from the function class F, the one
function whose true risk is as close as possible to the lowest risk in
class F.
The question arises: since one cannot calculate the true error, how can
one devise a learning strategy for choosing classifiers based on it?
Tentative answer: At least devise strategies that yield functions for
which the upper bound on the theoretical risk is as tight as possible,
so that one can make confidence statements of the form:
With probability 1 over an i.i.d. draw of some sample according to
the distribution P, the expected future error rate of some classifier is
bounded by a function g(, error rate on sample) of and the error rate
on sample.
( )
Pr TestError TrainError + (n, , (F)) 1

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 28 / 49

Foundation Result in Statistical Learning Theory

Theorem:(Vapnik and Chervonenkis, 1971) Let F be a class of

functions implementing so learning machines, and let = V Cdim(F) be
the VC dimension of F. Let the theoretical and the empirical risks be
defined as earlier and consider any data distribution in the population of
interest. Then f F, the prediction error (theoretical risk) is bounded by
v
u
u log 2n + 1 log
4
n (f ) +
t
R(f ) R (9)
n

with probability of at least 1 . or

v
u
( u log 2n + 1 log )
t 4
Pr TestError TrainError + 1
n

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 29 / 49

Bounds on the Generalization Error
For instance, using Chebyshevs inequality and the fact that
b n (f )] = R(f ), it is easy to see that, for given classifier f and a sample
E[R
D = {(x1 , y1 ), , (xn , yn )},

b n (f ) R(f )| > ] R(f )(1 R(f ))

Pr[|R .
n2
To estimate the true but unknown error R(f ) with a probability of at least
1 , it makes sense to use inversion, i.e., set
r
R(f )(1 R(f )) R(f )(1 R(f ))
= 2
, so that = .
n n
Owing to the fact that max R(f )(1 R(f )) = 14 , we have
R(f )[0,1]
r r 1/2
R(f )(1 R(f )) 1 1
< = .
n 4n 4n

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 30 / 49

Bounds on the Generalization Error

Based on Chebyshevs inequality, for a given classifier f , with a

probability of at least 1 , the bound on the difference between the
true risk R(f ) and the empirical risk R
b n (f ) is given by
1/2
1
|R
b n (f ) R(f )| < .
4n

Recall that one of the goals of statistical learning theory is to assess

the rate of convergence of the empirical risk to the true risk, which
translates into assessing how tight the corresponding bounds on the
true risk are.
In fact, it turns out many bounds can be so loose as to become
useless. It turns out that the above Chebyshev-based bound is not a
good one, at least compared to bounds obtained using the so-called
hoeffdings inequality.

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 31 / 49

Bounds on the Generalization Error

Theorem:(Hoeffdings inequality) Let Z1 , Z2 , , Zn be a collection

of i.i.d random variables with Zi [a, b]. Then, > 0,
" n #
2n2
1 X
Pr Zi E[Z] > 2 exp

n (b a)2
i=1

corollary:(hoeffdings inequality for sample proportions) Let

Z1 , Z2 , , Zn be a collection of i.i.d random variables from a
BernoulliPdistribution with "success" probability p. Let
pbn = n1 ni=1 Zi . Clearly, pbn [0, 1] and E[b pn ] = p.
Therefore, as a direct consequence of the above theorem, we have,
> 0,
pn p| > ] 2 exp 2n2

Pr [|b

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 32 / 49

Bounds on the Generalization Error

So we have, > 0,

pn p| > ] 2 exp 2n2

Pr [|b

Now, setting = 2 exp(22 n), it is straightforward to see that the

hoeffding-based 1 level confidence bound on the difference
between R(f ) and Rb n (f ) for a fixed classifier f is given by
!1/2
ln 2
|R
b n (f ) R(f )| < .
2n

Which of the two bounds is tighter? Clearly, we need to find out

which of ln 2/ or 1/2 is larger. This is the same as comparing
exp(1/2) and 2/, which in turns means comparing a(2/) and 2/
where a = exp(1/4). With > 0, a(2/) > 2/, so that, we know that
hoeffdings bounds are tighter. The graph also confirm this.
Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 33 / 49
Bounds on the Generalization Error

Chernoff vs Chebyshev bounds for proportions: delta = 0.01 Chernoff vs Chebyshev bounds for proportions: delta = 0.05
0.5 0.25
Chernoff Chernoff
Chebyshev Chebyshev
0.45

0.4 0.2

0.35
Theoretical bound f(n,)

Theoretical bound f(n,)

0.3 0.15

0.25

0.2 0.1

0.15

0.1 0.05

0.05

0 0
0 2000 4000 6000 8000 10000 12000 0 2000 4000 6000 8000 10000 12000
n = Sample size n = Sample size

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 34 / 49

Beyond Chernov and Hoeffding

In all the above, we only addressed pointwise convergence of

b n (f ) to R(f ), i.e., for Fix a machine f F, we studied the
R
convergence of
b n (f ) to R(f ).
R
Needless to mention that that pointwise convergence is of very little
use here.
A more interesting issue to address is uniform convergence. That is,
for all machines, f F, determine the necessary and sufficient
conditions for the convergence of

sup |R
b n (f ) R(f )| > to 0.
f F

Clearly, such a study extends the Law of Large Numbers to function

spaces, thereby providing tools for the construction of bounds on the
theoretical errors of learning machines.
Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 35 / 49
Beyond Chernov and Hoeffding

Since uniform convergence requires the consideration of the entirety of

the function space of interest, care needs to be taken regarding the
dimensionality of the function space.
Uniform convergence will prove substantially easier to handle for finite
sample spaces than for infinite dimensional function spaces.
Indeed, infinity dimensional spaces, one will need to introduce such
concepts of the capacity of the function space, measured through
devices such as the VC-dimension and covering numbers.

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 36 / 49

Beyond Chernov and Hoeffding
b n (f ) and R(f ) are close for all f F, i.e., > 0,
Theorem: If R
sup |R
b n (f ) R(f )| ,
f F

then
R(fbn ) R(f + ) 2.
Proof:Recall that we did define fbn as the best function that is yielded by
b n (f ) in the function class F. Recall also that R
the empirical risk R b n (fbn )
can be made as small as possible as we saw earlier. Therefore, with f +
being the best true risk in class F, we always have
Rb n (f + ) R
b n (fbn ) 0.
As a result,
R(fbn ) = R(fbn ) R(f + ) + R(f + )
b n (f + ) R
= R b n (fbn ) + R(fbn ) R(f + ) + R(f + )
b n (f )| + R(f + )
2sup |R(f ) R
f F
Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 37 / 49
Beyond Chernov and Hoeffding
Proof:Recall that we did define fbn as the best function that is yielded by
b n (f ) in the function class F. Recall also that R
the empirical risk R b n (fbn )
can be made as small as possible as we saw earlier. Therefore, with f +
being the best true risk in class F, we always have
b n (f + ) R
R b n (fbn ) 0.

As a result,
R(fbn ) = R(fbn ) R(f + ) + R(f + )
b n (f + ) R
= R b n (fbn ) + R(fbn ) R(f + ) + R(f + )
b n (f )| + R(f + )
2sup |R(f ) R
f F

Consequently,
R(fbn ) R(f + ) 2sup |R(f ) R
b n (f )|
f F

as required.
Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 38 / 49
Beyond Chernov and Hoeffding
Corollary: A direct consequence of the above theorem is the following:
For a given machine f F,
2 1/2
!
ln
R(f ) R
b n (f ) +
2n

with probability at least 1 , > 0.

If the function class F is finite, ie
F = {f1 , f2 , , fm }
where m = |F| = #F = Number of functions in the class F then it
can be shown that, for all f F,
2 1/2
!
ln m + ln
R(f ) R b n (f ) +
2n

with probability at least 1 , > 0.

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 39 / 49
Beyond Chernov and Hoeffding

It can also be shown that

!1/2
ln m + ln 2
R(fn ) R
b n (f + ) + 2 (10)
2n

with probability at least 1 , > 0, where as before

f + = arg inf R(f ) and fn = argminR

b n (f ).
f F f F

Equation (10) is of foundational importance, because it reveals clearly

that the size of the function class controls the uniform bound on the
crucial generalization error: Indeed, if the size m of the function class
F increases, then R(f + ) is caused to increase while R(fbn ) decreases,
so that the trade-off between the two is controlled by the size m of
the function class.

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 40 / 49

Vapnik-Chervonenkis Dimension

Definition: (Shattering) Let X 6= be any non empty domain. Let

F 2X be any non-empty class of functions having X as their
domain. Let S X be any finite subset of the domain X . Then S is
said to be shattered by F iff

{S f | f F} = 2S

In other words, F shatters S if any subset of S can be obtained by

intersecting S with some set from F.
Example: A class F 2X of classifiers is said to shatter a set
x1 , x2 , , xn of n points, if, for any possible configuration of labels
y1 , y2 , , yn , we can find a classifier f F that reproduces those
labels.

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 41 / 49

Vapnik-Chervonenkis Dimension

Definition(VC-dimension) Let X 6= be any non empty learning

domain. Let F 2X be any non-empty class of functions having X as
their domain. Let S X be any finite subset of the domain X . The
VC dimension of F is the cardinality of the largest finite set S X
that is shattered by F, ie
n o
V Cdim(F) := max |S| : S is shattered by F

Note: If arbitrarily large finite sets are shattered by F, then

V Cdim(F) = . In other words, if a small set of finite cardinality
cannot be found that is shattered by F, then V Cdim(F) = .
Example: The VC dimension of a class F 2X of classifiers is the
largest number of points that F can shatter.

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 42 / 49

Vapnik-Chervonenkis Dimension

Remarks: If V Cdim(F) = d, then there exists a finite set S X

such that |S| = d and S is shattered by F. Importantly, every set
S X such that |S| > d is not shattered by F. Clearly, we do not
expect to learn anything until we have at least d training points.
Intuitively, this means that an infinite VC dimension is not desirable as
it could imply the impossibility to learn the concept underlying any
data from the population under consideration. However, a finite VC
dimension does not guarantee the learnability of the concept
underlying any data from the population under consideration either.
Fact: Let F be any finite function (concept) class. Then, since it
requires 2d distinct concepts to shatter a set of cardinality d, no set of
cardinality greater than log |F| can be shattered. Therefore, log |F| is
always an upper bound for the VC dimension of finite concept classes.

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 43 / 49

Vapnik-Chervonenkis Dimension

To gain insights into the central concept of VC dimension, we herein

consider a few examples of practical interest for which the VC
dimension can be found.
VC dimension of the space of separating hyperplanes: Let
X = Rp be the domain for the binary Y {1, +1} classification
task, and consider using hyperplanes to separate the points of X . Let
F denote the class of all such separating hyperplanes. Then,

V Cdim(F) = p + 1

Intuitively, the following pictures for the case of X = R2 help see why
the VC dimension is p + 1.

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 44 / 49

Foundation Result in Statistical Learning Theory

Theorem:(Vapnik and Chervonenkis, 1971) Let F be a class of

with probability of at least 1 . or

v
u
( u log 2n + 1 log )
t 4
Pr TestError TrainError + 1
n

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 45 / 49

Appeal of the VC Bound
Note: One of the greatest appeals of the VC bound is that, though
applicable to function classes of infinite dimension, it preserves the
same intuitive form as the bound derived for finite dimensional F.
Essentially, using the VC dimension concept, the number L of possible
labeling configurations obtainable from F with V Cdim(F) = over
2n points verifies

en
L . (12)

The VC bound is simply obtained by replacing log |F| with L in the
expression of the risk bound for finite dimensional F.
The most important part of the above theorem is the fact that the
generalization ability of a learning machine depends on both the
empirical risk and the complexity of the class of functions used, which
is measured here by the VC dimension of (Vapnik and
Chervonenkis, 1971).
Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 46 / 49
Appeal of the VC Bound

Also, the bounds offered here are distribution-free, since no

assumption is made about the distribution of the population.
The details of this important result will be discussed again in chapter
6 and 7, where we will present other measures of the capacity of a
class of functions.
Remark: From the expression of the VC Bound, it is clear that an
intuitively appealing way to improve the predictive performance
(reduce prediction error) of a class of machines is to achieve a
trade-off (compromise) between small VC dimension and minimization
of the empirical risk.
At first, it may seen as if the VC bound is acting in a way similar to the
number of parameters, since it serves as a measure of the complexity
of F. In this spirit, the following is a possible guiding principle.

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 47 / 49

Appeal of the VC Bound

At first, it may seen as if the VC bound is acting in a way similar

to the number of parameters, since it serves as a measure of the
complexity of F. In this spirit, the following is a possible guiding
principle.
Intuition: One should seek to construct a classifier that
achieves the best trade-off (balance, compromise) between
complexity of function class - measured by VC dimension- and fit
to the training data -measured by empirical risk.
Now equipped with this sound theoretical foundation one can
then go on to the implementation of varioous learning machines.
We shall use R to discover some of the most commonly learning
machines.

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 48 / 49

Machine Learning CRAN Task View in R
Lets visit the website where most of the R community goes
http:// www.r-project.org
Lets install some packages and get started
install.packages(ctv)
library(ctv)

install.views(MachineLearning)
install.views(HighPerformanceComputing)
install.views(Bayesian)
install.views(Robust)
Lets load a couple of packages and explore
library(e1071)
library(MASS)
library(kernlab)
Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 49 / 49

Chapter 08
100% (2)
Chapter 08
202 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
111 pages
Mathematics of Machine Learning MIT
No ratings yet
Mathematics of Machine Learning MIT
411 pages
Statistical Learning Intro
No ratings yet
Statistical Learning Intro
10 pages
Logistic Regression
No ratings yet
Logistic Regression
36 pages
Machine Learning Lecture Notes
No ratings yet
Machine Learning Lecture Notes
119 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
213 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
100 pages
226 Lecture5 Prediction
No ratings yet
226 Lecture5 Prediction
45 pages
Time Series Forecasting by Using Wavelet Kernel SVM
No ratings yet
Time Series Forecasting by Using Wavelet Kernel SVM
52 pages
Weatherwax Epstein Hastie Solution Manual
No ratings yet
Weatherwax Epstein Hastie Solution Manual
147 pages
Lecture3 2015
No ratings yet
Lecture3 2015
38 pages
Unit-2 Machine Learning
No ratings yet
Unit-2 Machine Learning
110 pages
MIT18 657F15 LecNote PDF
No ratings yet
MIT18 657F15 LecNote PDF
194 pages
Stability and Generalization: CMAP, Ecole Polytechnique F-91128 Palaiseau, FRANCE
No ratings yet
Stability and Generalization: CMAP, Ecole Polytechnique F-91128 Palaiseau, FRANCE
28 pages
DSA5102 Lecture1
No ratings yet
DSA5102 Lecture1
60 pages
Best Generalisation Error PDF
No ratings yet
Best Generalisation Error PDF
28 pages
Lec-01-Introduction To Statistical Learning
No ratings yet
Lec-01-Introduction To Statistical Learning
38 pages
Lecture 1
No ratings yet
Lecture 1
56 pages
Lecture16 Crossvalidation
No ratings yet
Lecture16 Crossvalidation
32 pages
Ortonormalidad en Espacios de Hilbert
No ratings yet
Ortonormalidad en Espacios de Hilbert
20 pages
Lecturenotes
No ratings yet
Lecturenotes
56 pages
DSA5105 Lecture1
No ratings yet
DSA5105 Lecture1
51 pages
DSA5102X Lecture1
No ratings yet
DSA5102X Lecture1
51 pages
Introduction Statistical Learning
No ratings yet
Introduction Statistical Learning
39 pages
ML 3
No ratings yet
ML 3
66 pages
Class 02
No ratings yet
Class 02
42 pages
Statistical Learning: First Steps: Sasha Rakhlin
No ratings yet
Statistical Learning: First Steps: Sasha Rakhlin
26 pages
Vapnik - Complete Statistical Theory of Learning Learning U
No ratings yet
Vapnik - Complete Statistical Theory of Learning Learning U
59 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
114 pages
Selected Theoretical Aspects of ML and Deep Learning
No ratings yet
Selected Theoretical Aspects of ML and Deep Learning
46 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
4 pages
Industrial Mathematics Institute: Research Report
No ratings yet
Industrial Mathematics Institute: Research Report
25 pages
Error Propagation
No ratings yet
Error Propagation
22 pages
Learning Theory
No ratings yet
Learning Theory
19 pages
Regression Using LS Handout
No ratings yet
Regression Using LS Handout
21 pages
Lec 10
No ratings yet
Lec 10
8 pages
Representer Function
No ratings yet
Representer Function
12 pages
Lec10 PDF
No ratings yet
Lec10 PDF
8 pages
MIT15 097S12 Lec04
No ratings yet
MIT15 097S12 Lec04
6 pages
When Models Meet Data
No ratings yet
When Models Meet Data
25 pages
01 Lecturenote SRM
No ratings yet
01 Lecturenote SRM
9 pages
18.657: Mathematics of Machine Learning: N I I H H I 1
No ratings yet
18.657: Mathematics of Machine Learning: N I I H H I 1
6 pages
(Textbook) (Solution) The Elements of Statistical Learning
No ratings yet
(Textbook) (Solution) The Elements of Statistical Learning
147 pages
4 Linear Regression Additional Notes
No ratings yet
4 Linear Regression Additional Notes
8 pages
ML 01
No ratings yet
ML 01
24 pages
Machine Learning Lecture Notes Undergrad
No ratings yet
Machine Learning Lecture Notes Undergrad
19 pages
Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
No ratings yet
Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
14 pages
Curs 1 SSL - Introduction
No ratings yet
Curs 1 SSL - Introduction
57 pages
Stat Risk
No ratings yet
Stat Risk
6 pages
Introduction To: Support Vector Machines
No ratings yet
Introduction To: Support Vector Machines
53 pages
Supervised Learning
No ratings yet
Supervised Learning
5 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
Notes 1
No ratings yet
Notes 1
3 pages
Li 2004
No ratings yet
Li 2004
4 pages
Control Lab Report Experiment No. 01 PDF
No ratings yet
Control Lab Report Experiment No. 01 PDF
5 pages
Cheatsheet Supervised Learning
No ratings yet
Cheatsheet Supervised Learning
4 pages
Bitcoin Encryption Decryption DSA
100% (1)
Bitcoin Encryption Decryption DSA
16 pages
FALLSEM2025-26 BAMAT101 ETH CH2025260103502 Reference Material I 02 Multivariable Calculus de With LAB BTECH Common 1
No ratings yet
FALLSEM2025-26 BAMAT101 ETH CH2025260103502 Reference Material I 02 Multivariable Calculus de With LAB BTECH Common 1
2 pages
cs675 SS2022 Midterm Solution PDF
No ratings yet
cs675 SS2022 Midterm Solution PDF
10 pages
A New Methodology To Predict Backbreak in Blasting Operation-M. Mohammadnejad
No ratings yet
A New Methodology To Predict Backbreak in Blasting Operation-M. Mohammadnejad
7 pages
A Deep Learning Methodology To Predicting Cybersecurity Attacks On The Internet of Things
No ratings yet
A Deep Learning Methodology To Predicting Cybersecurity Attacks On The Internet of Things
22 pages
Neural Networks Economics
No ratings yet
Neural Networks Economics
27 pages
Crime Analysisand Prediction Using Data Mining
No ratings yet
Crime Analysisand Prediction Using Data Mining
8 pages
ETHZ Lecture1
No ratings yet
ETHZ Lecture1
49 pages
Elmer Models Manual
No ratings yet
Elmer Models Manual
342 pages
DesignandimplementationofMultiplierunitMAC ROBA
No ratings yet
DesignandimplementationofMultiplierunitMAC ROBA
10 pages
Asna Notes
No ratings yet
Asna Notes
95 pages
Sample Computer Practical File 12
No ratings yet
Sample Computer Practical File 12
130 pages
Emergency Fund-TVM
No ratings yet
Emergency Fund-TVM
33 pages
Discrete-Time Signals and Systems
No ratings yet
Discrete-Time Signals and Systems
29 pages
Classification Analysis Report PDF
No ratings yet
Classification Analysis Report PDF
9 pages
Introduction To Fuzzy Logic Using MATLAB: S.N. Sivanandam, S. Sumathi and S.N. Deepa
100% (1)
Introduction To Fuzzy Logic Using MATLAB: S.N. Sivanandam, S. Sumathi and S.N. Deepa
5 pages
DS Lab Manual Final
No ratings yet
DS Lab Manual Final
49 pages
CS 747, Autumn 2020: Week 4, Lecture 1: Shivaram Kalyanakrishnan
No ratings yet
CS 747, Autumn 2020: Week 4, Lecture 1: Shivaram Kalyanakrishnan
103 pages
Boosting Buehlmann
No ratings yet
Boosting Buehlmann
52 pages
Deming Regression: By:-Amit Singh
No ratings yet
Deming Regression: By:-Amit Singh
29 pages
Week 2 Quiz - Coursera PDF
100% (4)
Week 2 Quiz - Coursera PDF
1 page
Week 2 Quiz - Coursera PDF
100% (4)
Week 2 Quiz - Coursera PDF
1 page
Template
No ratings yet
Template
2 pages
Scia Engineer MOOT 2011 ENG
No ratings yet
Scia Engineer MOOT 2011 ENG
8 pages
Week 09 Fall15
No ratings yet
Week 09 Fall15
46 pages
Lec - 06 - Lyapunov - Stability - Analysis Part 2 Week 7 Feb 29
No ratings yet
Lec - 06 - Lyapunov - Stability - Analysis Part 2 Week 7 Feb 29
34 pages
Fake Jobs Code
No ratings yet
Fake Jobs Code
3 pages
Regression Modeling in Biostatistics
No ratings yet
Regression Modeling in Biostatistics
3 pages
Cuadro Original de Ejercicios Coursera
No ratings yet
Cuadro Original de Ejercicios Coursera
15 pages
8478457
No ratings yet
8478457
13 pages
PoC Healthcare
No ratings yet
PoC Healthcare
13 pages
Collaborative Multi-Teacher Knowledge Distillation For Learning Low Bit-Width Deep Neural Networks
No ratings yet
Collaborative Multi-Teacher Knowledge Distillation For Learning Low Bit-Width Deep Neural Networks
10 pages
North West
No ratings yet
North West
5 pages
Tree Search Using MPI With Static and Dynamic Partitioning PDF
No ratings yet
Tree Search Using MPI With Static and Dynamic Partitioning PDF
9 pages
Lossless Online Bayesian Bagging: Herbert K. H. Lee
No ratings yet
Lossless Online Bayesian Bagging: Herbert K. H. Lee
9 pages
Raheel Ahmed - OR Assignment 2
No ratings yet
Raheel Ahmed - OR Assignment 2
3 pages
Flows, Wiley & Sons. Mathematical Programming, Thomson Books
No ratings yet
Flows, Wiley & Sons. Mathematical Programming, Thomson Books
3 pages
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Hyperbolic Functions (Trigonometry) Mathematics E-Book For Public Exams
From Everand
Hyperbolic Functions (Trigonometry) Mathematics E-Book For Public Exams
Mohmmad Khaja Shareef
No ratings yet

Lec SML Basic Theory 2

Uploaded by

Lec SML Basic Theory 2

Uploaded by

Principles of Statistical Machine Learning

Fondements Thoriques de lApprentissage Statistique

Associate Professor of Statistics

Visiting Professor of Statistical Machine Learning

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 1 / 49

1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8

For the binary classification problem introduced earlier:

xi X R2 , i = 1, , n. X is the input space.

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 3 / 49

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 4 / 49

Finding an automatic classification rule that achieves the

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 5 / 49

The best function f over the space Y X of all measurable functions

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 6 / 49

Unfortunately, f can only be found if P(x, y) is known. Therefore,

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 7 / 49

Specifically, the Bayes classifier f is given by the posterior probability of

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 8 / 49

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 9 / 49

We therefore seek the best by empirical standard,

Since it is impossible to search all possible functions, it is usually

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 10 / 49

In traditional statistical estimation, one needs to address at the very least

rather than trying to find the minimum variance unbiased estimator

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 11 / 49

Clearly, the traditional so-called bias-variance decomposition of the MSE

If the estimator b were to be sought from all possible value of , then it

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 13 / 49

where penalty(k, n) gives preference to models with small estimation error.

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 14 / 49

Tikhonovs Variation Approach to Regularization[Tikhonov, 1963]

where > 0 is some predefined constant.

subject to the constraint

subject to the constraint

where > 0 is some predefined constant.

Definition: Let bn be an estimator of some scalar quantity based on

In other words, bn is a consistent estimator of if,  > 0,

It turns out that for unbiased estimators bn , consistency is

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 17 / 49

|V(X)| < +. Then,  > 0,

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 18 / 49

Let X be a random variable with finite mean X = E[X] i.e.

X1 , X2 , , Xn be a random sample of n observations drawn

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 19 / 49

With X = E[X n ] = and 2 = 2 /n, one applyies Biename-Chebyshevs

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 20 / 49

Why is all the above of any interest to statistical learning

A reasoning on error decomposition and consistency of estimators along

R(fbn ) R = R(fbn ) R(f + ) + R(f + ) R (6)

A reasoning similar to bias-variance trade-off and consistency can be made,

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 23 / 49

Figure: Illustration of the qualitative behavior of the dependence of bias versus

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 24 / 49

The ERM principle is consistent if it provides a sequence of functions

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 25 / 49

To better understand consistency in function spaces, consider the

and consider studying

Recall that Rb n (f ) is the realized disagreement between classifier f

Note that while this pointwise unbiasedness of the empirical risk is a

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 28 / 49

Theorem:(Vapnik and Chervonenkis, 1971) Let F be a class of

with probability of at least 1 . or

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 29 / 49

b n (f ) R(f )| > ] R(f )(1 R(f ))

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 30 / 49

Based on Chebyshevs inequality, for a given classifier f , with a

Recall that one of the goals of statistical learning theory is to assess

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 31 / 49

Theorem:(Hoeffdings inequality) Let Z1 , Z2 , , Zn be a collection

corollary:(hoeffdings inequality for sample proportions) Let

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 32 / 49

pn p| > ] 2 exp 2n2

Now, setting = 2 exp(22 n), it is straightforward to see that the

Which of the two bounds is tighter? Clearly, we need to find out

Theoretical bound f(n,)

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 34 / 49

In other words, bn is a consistent estimator of if, > 0,

|V(X)| < +. Then, > 0,

b n (f ) R(f )| > ] R(f )(1 R(f ))

pn p| > ] 2 exp 2n2

Now, setting = 2 exp(22 n), it is straightforward to see that the