Lec SML Basic Theory 2
Lec SML Basic Theory 2
Dr Ernest Fokou
0.8
0.6
x2
0.4
0.2
While some points clearly belong to one of the classes, there are
other points that are either strangers in a foreign land, or are
positioned in such a way that no automatic classification rule can
clearly determine their class membership.
One can construct a classification rule that puts all the points in
their corresponding classes. Such a rule would prove disastrous in
classifying new observations not present in the current collection
of observations.
Indeed, we have a collection of pairs (xi , yi ) of observations
coming from some unknown distribution P(x, y).
so that
R(f ) = R = inf R(f ).
f
The minimizer of the 0-1 risk functional over all possible classifiers is the
so-called Bayes classifier which we shall denote here by f given by
f = arg inf
Pr [Y 6= f (X)] .
f (X,Y )P
so that
R(f + ) = R+ = inf R(f ).
f F
For the binary pattern recognition problem, one may consider finding the
best linear separating hyperplane, i.e.
(
F = f : X {1, +1}| 0 R, (1 , , p )> = Rp |
)
f (x) = sign > x + 0 , x X
n o
Let D = (X1 , Y1 ), , (Xn , Yn ) be an iid sample from P(x, y).
The empirical version of the risk functional is
n
b )= 1
X
R(f 1{Yi 6=f (Xi )}
n
i=1
Under the squared error loss, one seeks b that minimizes the mean squared
error,
b = arg min E[(b )2 ] = arg min MSE(),
b
True Risk
Bias squared
Variance
Optimal Smoothing
Less smoothing More smoothing
Figure: Illustration of the qualitative behavior of the dependence of bias versus variance on a tradeoff parameter
such as or h. For small values the variability is too high; for large values the bias gets large.
Since making the estimator of the function arbitrarily complex causes the
problems mentioned earlier, the intuition for a trade-off reveals that instead
of minimizing the empirical risk R
b n (f ) one should do the following:
Choose a collection of function spaces {Fk : k = 1, 2, }, maybe a
collection of nested spaces (increasing in size)
Minimize the empirical risk in each class
Minimize the penalized empirical risk
min min R
b n (f ) + penalty(k, n)
k f Fk
(f )
Theorem
(Biename-Chebyshevs inequality) Let X be a random variable with finite
mean X = E[X] i.e. |E[X]| < + and finite variance X2 = V(X) , i.e.,
V(X)
Pr[|X E[X]| > ] .
2
It is therefore easy to see here that, with unbiased bn , one has E[bn ] = ,
and the result is immediate. For the sake of clarity, lets recall here the
elementary weak law of large numbers.
n
n = 1 (X1 + X2 + + Xn ) = 1
X
X Xi
n n
i=1
n ] = , and, > 0,
Then, clearly, E[X
n | > ] = 0.
lim Pr[|X (3)
n
This essentially expresses the fact that the empirical mean X n converges in
probability to the theoretical mean in the limit of very large samples.
We therefore have
P
n
X .
n
| > ] 2
Pr[|X , (4)
n2
which, by inversion, is the same as
r
| < 1 2
|X (5)
n
with probability at least 1 .
Why is all the above of any interest to statistical learning theory?
True Risk
Bias squared
Variance
Optimal Smoothing
Less smoothing More smoothing
and
P
b n (fn )
R inf R(f ) = R(f + )
n f F
Vapnik discusses the details of this theorem at length, and extends the
exploration to include the difference between what he calls trivial
consistency and non-trivial consistency.
Vapnik shows that the sequence of the means of the random variable
n converges to zero as the number n of observations increases.
He also remarks that the sequence of random variables n converges
in probability to zero if the set of functions F, contains a finite
number m of elements. We will show that later in the case of pattern
recognition.
Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 26 / 49
Consistency of the Empirical Risk Minimization principle
It remains then to describe the properties of the set of functions F,
and probability measure P(x, y) under which the sequence of random
variables n converges in probability to zero.
(" # " #)
lim P sup [R(f ) Rn (f )] > or sup [Rn (f ) R(f )] >
b b = 0.
n f F f F
E[R
b n (f )] = R(f ). (8)
So we have, > 0,
Chernoff vs Chebyshev bounds for proportions: delta = 0.01 Chernoff vs Chebyshev bounds for proportions: delta = 0.05
0.5 0.25
Chernoff Chernoff
Chebyshev Chebyshev
0.45
0.4 0.2
0.35
Theoretical bound f(n,)
0.25
0.2 0.1
0.15
0.1 0.05
0.05
0 0
0 2000 4000 6000 8000 10000 12000 0 2000 4000 6000 8000 10000 12000
n = Sample size n = Sample size
sup |R
b n (f ) R(f )| > to 0.
f F
then
R(fbn ) R(f + ) 2.
Proof:Recall that we did define fbn as the best function that is yielded by
b n (f ) in the function class F. Recall also that R
the empirical risk R b n (fbn )
can be made as small as possible as we saw earlier. Therefore, with f +
being the best true risk in class F, we always have
Rb n (f + ) R
b n (fbn ) 0.
As a result,
R(fbn ) = R(fbn ) R(f + ) + R(f + )
b n (f + ) R
= R b n (fbn ) + R(fbn ) R(f + ) + R(f + )
b n (f )| + R(f + )
2sup |R(f ) R
f F
Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 37 / 49
Beyond Chernov and Hoeffding
Proof:Recall that we did define fbn as the best function that is yielded by
b n (f ) in the function class F. Recall also that R
the empirical risk R b n (fbn )
can be made as small as possible as we saw earlier. Therefore, with f +
being the best true risk in class F, we always have
b n (f + ) R
R b n (fbn ) 0.
As a result,
R(fbn ) = R(fbn ) R(f + ) + R(f + )
b n (f + ) R
= R b n (fbn ) + R(fbn ) R(f + ) + R(f + )
b n (f )| + R(f + )
2sup |R(f ) R
f F
Consequently,
R(fbn ) R(f + ) 2sup |R(f ) R
b n (f )|
f F
as required.
Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 38 / 49
Beyond Chernov and Hoeffding
Corollary: A direct consequence of the above theorem is the following:
For a given machine f F,
2 1/2
!
ln
R(f ) R
b n (f ) +
2n
{S f | f F} = 2S
V Cdim(F) = p + 1
Intuitively, the following pictures for the case of X = R2 help see why
the VC dimension is p + 1.
install.views(MachineLearning)
install.views(HighPerformanceComputing)
install.views(Bayesian)
install.views(Robust)
Lets load a couple of packages and explore
library(e1071)
library(MASS)
library(kernlab)
Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 49 / 49