0% found this document useful (0 votes)
47 views49 pages

Lec SML Basic Theory 2

This document discusses principles of statistical machine learning and binary classification. It introduces Ripley's 2D binary classification dataset and discusses key concepts in binary classification including: - Classifying points that are not clearly in one class or on the decision boundary - The need for classification rules to generalize to new data not in the training set - The goal of finding the optimal classifier f* that minimizes errors - Approximating f* by searching a restricted function space F rather than the full space of all functions - Minimizing empirical risk on a training set to find the best classifier in F

Uploaded by

Amit Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views49 pages

Lec SML Basic Theory 2

This document discusses principles of statistical machine learning and binary classification. It introduces Ripley's 2D binary classification dataset and discusses key concepts in binary classification including: - Classifying points that are not clearly in one class or on the decision boundary - The need for classification rules to generalize to new data not in the training set - The goal of finding the optimal classifier f* that minimizes errors - Approximating f* by searching a restricted function space F rather than the full space of all functions - Minimizing empirical risk on a training set to find the best classifier in F

Uploaded by

Amit Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Principles of Statistical Machine Learning

Fondements Thoriques de lApprentissage Statistique

Dr Ernest Fokou

Associate Professor of Statistics


School of Mathematical Sciences
Rochester Institute of Technology

Visiting Professor of Statistical Machine Learning


Laboratoire de Mathmatiques de Bretagne Atlantique
Universit de Bretagne-Sud, Vannes, France
June-July 2015

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 1 / 49


Binary Classification in the Plane
Ripleys 2D binary classification data set
1.2

0.8

0.6
x2

0.4

0.2

1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8


x1
Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 2 / 49
Binary Classification in the Plane

For the binary classification problem introduced earlier:


A collection {(x1 , y1 ), , (xn , yn )} of i.i.d. observations is given

xi X R2 , i = 1, , n. X is the input space.


yi {1, +1}. Y = {1, +1} is the output space.
What is the probability law that governs the (xi , yi )s?
What is the functional relationship between x and y?
What is the "best" approach to determining from the available
observations, the relationship between x and y in such a way
that, given a new (unseen) observation xnew , its class y new can be
predicted as accurately as possible.

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 3 / 49


Basic Remarks on Classification

While some points clearly belong to one of the classes, there are
other points that are either strangers in a foreign land, or are
positioned in such a way that no automatic classification rule can
clearly determine their class membership.
One can construct a classification rule that puts all the points in
their corresponding classes. Such a rule would prove disastrous in
classifying new observations not present in the current collection
of observations.
Indeed, we have a collection of pairs (xi , yi ) of observations
coming from some unknown distribution P(x, y).

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 4 / 49


Basic Remarks on Classification

Finding an automatic classification rule that achieves the


absolute very best on the present data is not enough since
infinitely many more observations can be generated by P(x, y) for
which good classification will be required.
Even the universally best classifier will make mistakes.
Of all the functions in Y X , it is reasonable to assume that there
is a function f that maps any x X to its corresponding
y Y, i.e.,
f : X Y
x 7 f (x),
with the minimum number of mistakes.

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 5 / 49


Risk Minimization Revisited
Let f denote any generic function mapping an element x of X to its
corresponding image f (x) in Y.
Each time x is drawn from P(x), the disagreement between the image
f (x) and the true image y is called the loss, denoted by `(y, f (x)).
The expected value of this loss function with respect to the
distribution P(x, y) is called the risk functional of f . Generically, we
shall denote the risk functional of f by R(f ), so that
Z
R(f ) = E[`(Y, f (X))] = `(y, f (x))dP(x, y).

The best function f over the space Y X of all measurable functions


from X to Y is therefore
f = arg inf R(f ),
f

so that
R(f ) = R = inf R(f ).
f

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 6 / 49


On the need to reduce the search space

Unfortunately, f can only be found if P(x, y) is known. Therefore,


since we do not know P(x, y) in practice, it is hopeless to determine
f .
Besides, trying to find f without the knowledge of P(x, y) implies
having to search the infinite dimensional function space Y X of all
mappings from X to Y, which is an ill-posed and computationally
nasty problem.
Throughout this lecture, we will seek to solve the more reasonable
problem of choosing from a function space F Y X , the one function
f + F that best estimates the dependencies between x and y.
It is therefore important to define what is meant by best estimates.
For that, the concepts of loss function and risk functional need to be
define.

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 7 / 49


Loss and Risk in Pattern Recognition
For this classification/pattern recognition, the so-called 0-1 loss function
defined below is used. More specifically,

0 if y = f (x),
`(y, f (x)) = 1{Y 6=f (X)} = (1)
1 if y 6= f (x).
The corresponding risk functional is
Z
 
R(f ) = `(y, f (x))dP(x, y) = E 1{Y 6=f (X)} = Pr [Y 6= f (X)].
(X,Y )P

The minimizer of the 0-1 risk functional over all possible classifiers is the
so-called Bayes classifier which we shall denote here by f given by
f = arg inf

Pr [Y 6= f (X)] .
f (X,Y )P

Specifically, the Bayes classifier f is given by the posterior probability of


class membership, namely
f (x) = arg max Pr[Y = y|x] .

yY

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 8 / 49


Function Class in Pattern Recognition
As stated earlier, trying to find f is hopeless. One needs to select a
function space F Y X , and then choose the best estimator f + from F,
i.e.,
f + = arg inf R(f ),
f F

so that
R(f + ) = R+ = inf R(f ).
f F

For the binary pattern recognition problem, one may consider finding the
best linear separating hyperplane, i.e.
(
F = f : X {1, +1}| 0 R, (1 , , p )> = Rp |
)
 
f (x) = sign > x + 0 , x X

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 9 / 49


Empirical Risk Minimization

n o
Let D = (X1 , Y1 ), , (Xn , Yn ) be an iid sample from P(x, y).
The empirical version of the risk functional is
n
b )= 1
X
R(f 1{Yi 6=f (Xi )}
n
i=1

We therefore seek the best by empirical standard,


( n )
1X
f = arg min
b 1{Yi 6=f (Xi )}
f F n
i=1

Since it is impossible to search all possible functions, it is usually


crucial to choose the "right" function space F.

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 10 / 49


Bias-Variance Trade-Off

In traditional statistical estimation, one needs to address at the very least


issues like: (a) the Bias of the estimator; (b) the Variance of the estimator;
(c) The consistency of the estimator; Recall from elementary point
estimation that, if is the true value of the parameter to be estimated, and
b is a point estimator of , then one can decompose the total error as
follows:
b = b E[]b + E[] b (2)
| {z } | {z }
Estimation error Bias

Under the squared error loss, one seeks b that minimizes the mean squared
error,
b = arg min E[(b )2 ] = arg min MSE(),
b

rather than trying to find the minimum variance unbiased estimator


(MVUE).

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 11 / 49


Bias-Variance Trade-off

Clearly, the traditional so-called bias-variance decomposition of the MSE


reveals the need for bias-variance trade-off. Indeed,
b = E[(b )2 ] = E[(b E[])
MSE() b 2 ] + E[(E[]
b )2 ]
= variance + bias2

If the estimator b were to be sought from all possible value of , then it


might make sense to hope for the MVUE. Unfortunately - an especially in
function estimation as we clearly argued earlier - there will be some bias, so
that the error one gets has a bias component along with the variance
component in the squared error loss case. If the bias is too small, then an
estimator with a larger variance is obtained. Similarly, a small variance will
tend to come from estimators with a relatively large bias. The best
compromise is then to trade-off bias and variance. Which is in functional
terms translates into trade-off between approximation error and estimation
error.
Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 12 / 49
Bias-Variance Trade-off

True Risk

Bias squared

Variance

Optimal Smoothing
Less smoothing More smoothing
Figure: Illustration of the qualitative behavior of the dependence of bias versus variance on a tradeoff parameter
such as or h. For small values the variability is too high; for large values the bias gets large.

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 13 / 49


Structural risk minimization principle

Since making the estimator of the function arbitrarily complex causes the
problems mentioned earlier, the intuition for a trade-off reveals that instead
of minimizing the empirical risk R
b n (f ) one should do the following:
Choose a collection of function spaces {Fk : k = 1, 2, }, maybe a
collection of nested spaces (increasing in size)
Minimize the empirical risk in each class
Minimize the penalized empirical risk

min min R
b n (f ) + penalty(k, n)
k f Fk

where penalty(k, n) gives preference to models with small estimation error.


It is important to note that penalty(k, n) measures the capacity of the
function class Fk . The widely used technique of regularization for solving
ill-posed problem is a particular instance of structural risk minimization.

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 14 / 49


Regularization for Complexity Control

Tikhonovs Variation Approach to Regularization[Tikhonov, 1963]


Find f that minimizes the functional
n
b n(reg) (f ) = 1
X
R `(yi , f (xi )) + (f )
n
i=1

where > 0 is some predefined constant.


Ivanovs Quasi-solution Approach to Regularization[Ivanov, 1962]
Find f that minimizes the functional
n
b n (f ) = 1
X
R `(yi , f (xi ))
n
i=1

subject to the constraint


(f ) C
where C > 0 is some predefined constant.
Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 15 / 49
Regularization for Complexity Control
Philips Residual Approach to Regularization[Philips, 1962]
Find f that minimizes the functional

(f )

subject to the constraint


n
1X
`(yi , f (xi ))
n
i=1

where > 0 is some predefined constant.


In all the above, the functional (f ) is called the regularization functional.
(f ) is defined in such a way that it controls the complexity of the
function f .
Z b
2
(f ) = kf k = (f 00 (t))2 dt.
a
is a regularization functional used in spline smoothing.
Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 16 / 49
Statistical Consistency

Definition: Let bn be an estimator of some scalar quantity based on


an i.i.d. sample X1 , X2 , , Xn from the distribution with parameter
. Then, bn is said to be a consistent estimator of , if bn converges
in probability to , i.e.,
P
bn .
n

In other words, bn is a consistent estimator of if,  > 0,


n o
lim Pr |bn | >  = 0.
n

It turns out that for unbiased estimators bn , consistency is


straightforward as direct consequence of a basic probabilistic inequality
like Chebyshevs inequality. However, for unbiased estimators, one has
to be more careful.

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 17 / 49


A Basic Important Inequality

Theorem
(Biename-Chebyshevs inequality) Let X be a random variable with finite
mean X = E[X] i.e. |E[X]| < + and finite variance X2 = V(X) , i.e.,

|V(X)| < +. Then,  > 0,

V(X)
Pr[|X E[X]| > ] .
2
It is therefore easy to see here that, with unbiased bn , one has E[bn ] = ,
and the result is immediate. For the sake of clarity, lets recall here the
elementary weak law of large numbers.

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 18 / 49


Weak Law of Large Numbers

Let X be a random variable with finite mean X = E[X] i.e.


|E[X]| < + and finite variance X 2 = V(X) , i.e., |V(X)| < +. Let

X1 , X2 , , Xn be a random sample of n observations drawn


independently from the distribution of X, so that for i = 1, , n, we have
E[Xi ] = and V[Xi ] = 2 . Let X n be the sample mean, i.e.,

n
n = 1 (X1 + X2 + + Xn ) = 1
X
X Xi
n n
i=1

n ] = , and,  > 0,
Then, clearly, E[X
n | > ] = 0.

lim Pr[|X (3)
n

This essentially expresses the fact that the empirical mean X n converges in
probability to the theoretical mean in the limit of very large samples.

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 19 / 49


Weak Law of Large Numbers

We therefore have
P
n
X .
n

With X = E[X n ] = and 2 = 2 /n, one applyies Biename-Chebyshevs


X
inequality and gets:  > 0,

| > ] 2
Pr[|X , (4)
n2
which, by inversion, is the same as
r
| < 1 2
|X (5)
n
with probability at least 1 .
Why is all the above of any interest to statistical learning theory?

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 20 / 49


Weak Law of Large Numbers

Why is all the above of any interest to statistical learning


theory?
Equation (3) states the much needed consistency of X
as an estimator of .
Equation (4), by showing the dependence of on n and
converges to .
helps assess the rate at which X
Equation (5), by showing a confidence interval helps
compute bounds on the unknown true mean as a
function of the empirical mean X and the confidence
level 1 .
Finally, how does go about constructing estimators with
all the above properties.
Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 21 / 49
Components of Statistical Machine Learning
Interestingly, all those 4 components of classical estimation theory, will be
encountered again in statistical learning theory. Essentially, the 4
components of statistical learning theory consist of finding the answers to
the following questions:
(a) What are the necessary and sufficient conditions for the
consistency of a learning process based on the ERM principle? This
leads to the Theory of consistency of learning processes.
(b) How fast is the rate of convergence of the learning process? This
leads to the Nonasymptotic theory of the rate of convergence of
learning processes;
(c) How can one control the rate of convergence (the generalization
ability) of the learning process?. This leads to the Theory of
controlling the generalization ability of learning processes;
(d) How can one construct algorithms that can control the
generalization ability of the learning process?. This leads to Theory of
constructing learning algorithms.
Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 22 / 49
Error Decomposition revisited

A reasoning on error decomposition and consistency of estimators along


with rates, bounds and algorithms applies to function spaces: indeed, the
difference between the true risk R(fbn ) associated with fbn and the overall
minimum risk R can be decomposed to explore in greater details the
source of error in the function estimation process:

R(fbn ) R = R(fbn ) R(f + ) + R(f + ) R (6)


| {z } | {z }
Estimation error Approximation error

A reasoning similar to bias-variance trade-off and consistency can be made,


with the added complication brought be the need to distinguish between
the true risk functional and the empirical risk functional, and also to the
added to assess both pointwise behaviors and uniform behaviors. In a
sense, one needs to generalize the decomposition and the law of large
numbers to function spaces.

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 23 / 49


Approximation-Estimation Trade-Off

True Risk

Bias squared

Variance

Optimal Smoothing
Less smoothing More smoothing

Figure: Illustration of the qualitative behavior of the dependence of bias versus


variance on a tradeoff parameter such as or h. For small values the variability is
too high; for large values the bias gets large.

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 24 / 49


Consistency of the Empirical Risk Minimization principle

The ERM principle is consistent if it provides a sequence of functions


fn , n = 1, 2, for which both the expected risk R(fbn ) and the
empirical risk R b n (fbn ) converge to the minimal possible value of the
+
risk R(f ) in the function class under consideration, i.e.,
P
R(fn ) inf R(f ) = R(f + )
n f F

and
P
b n (fn )
R inf R(f ) = R(f + )
n f F

Vapnik discusses the details of this theorem at length, and extends the
exploration to include the difference between what he calls trivial
consistency and non-trivial consistency.

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 25 / 49


Consistency of the Empirical Risk Minimization principle

To better understand consistency in function spaces, consider the


sequence of random variables

n = sup R(f ) R
b n (f ) , (7)

f F

and consider studying


( )

lim P sup R(f ) R
b n (f ) > = 0, > 0.

n f F

Vapnik shows that the sequence of the means of the random variable
n converges to zero as the number n of observations increases.
He also remarks that the sequence of random variables n converges
in probability to zero if the set of functions F, contains a finite
number m of elements. We will show that later in the case of pattern
recognition.
Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 26 / 49
Consistency of the Empirical Risk Minimization principle
It remains then to describe the properties of the set of functions F,
and probability measure P(x, y) under which the sequence of random
variables n converges in probability to zero.
(" # " #)
lim P sup [R(f ) Rn (f )] > or sup [Rn (f ) R(f )] >
b b = 0.
n f F f F

Recall that Rb n (f ) is the realized disagreement between classifier f


and the truth about the label y of x based on information contained
in the sample D.
It is easy to see that, for a given (fixed) function (classifier) f ,

E[R
b n (f )] = R(f ). (8)

Note that while this pointwise unbiasedness of the empirical risk is a


good bottomline property to have, it is not enough. More is needed as
the comparison is against R(f + ) or event better yet R(f ).
Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 27 / 49
Consistency of the Empirical Risk
Remember that the goal of statistical function estimation is to devise
a technique (strategy) that chooses from the function class F, the one
function whose true risk is as close as possible to the lowest risk in
class F.
The question arises: since one cannot calculate the true error, how can
one devise a learning strategy for choosing classifiers based on it?
Tentative answer: At least devise strategies that yield functions for
which the upper bound on the theoretical risk is as tight as possible,
so that one can make confidence statements of the form:
With probability 1 over an i.i.d. draw of some sample according to
the distribution P, the expected future error rate of some classifier is
bounded by a function g(, error rate on sample) of and the error rate
on sample.
( )
Pr TestError TrainError + (n, , (F)) 1

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 28 / 49


Foundation Result in Statistical Learning Theory

Theorem:(Vapnik and Chervonenkis, 1971) Let F be a class of


functions implementing so learning machines, and let = V Cdim(F) be
the VC dimension of F. Let the theoretical and the empirical risks be
defined as earlier and consider any data distribution in the population of
interest. Then f F, the prediction error (theoretical risk) is bounded by
v  
u
u log 2n + 1 log
4
n (f ) +
t
R(f ) R (9)
n

with probability of at least 1 . or


v  
u
( u log 2n + 1 log )
t 4
Pr TestError TrainError + 1
n

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 29 / 49


Bounds on the Generalization Error
For instance, using Chebyshevs inequality and the fact that
b n (f )] = R(f ), it is easy to see that, for given classifier f and a sample
E[R
D = {(x1 , y1 ), , (xn , yn )},

b n (f ) R(f )| > ] R(f )(1 R(f ))


Pr[|R .
n2
To estimate the true but unknown error R(f ) with a probability of at least
1 , it makes sense to use inversion, i.e., set
r
R(f )(1 R(f )) R(f )(1 R(f ))
= 2
, so that  = .
n n
Owing to the fact that max R(f )(1 R(f )) = 14 , we have
R(f )[0,1]
r r  1/2
R(f )(1 R(f )) 1 1
< = .
n 4n 4n

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 30 / 49


Bounds on the Generalization Error

Based on Chebyshevs inequality, for a given classifier f , with a


probability of at least 1 , the bound on the difference between the
true risk R(f ) and the empirical risk R
b n (f ) is given by
 1/2
1
|R
b n (f ) R(f )| < .
4n

Recall that one of the goals of statistical learning theory is to assess


the rate of convergence of the empirical risk to the true risk, which
translates into assessing how tight the corresponding bounds on the
true risk are.
In fact, it turns out many bounds can be so loose as to become
useless. It turns out that the above Chebyshev-based bound is not a
good one, at least compared to bounds obtained using the so-called
hoeffdings inequality.

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 31 / 49


Bounds on the Generalization Error

Theorem:(Hoeffdings inequality) Let Z1 , Z2 , , Zn be a collection


of i.i.d random variables with Zi [a, b]. Then,  > 0,
" n #
2n2
1 X  
Pr Zi E[Z] >  2 exp

n (b a)2
i=1

corollary:(hoeffdings inequality for sample proportions) Let


Z1 , Z2 , , Zn be a collection of i.i.d random variables from a
BernoulliPdistribution with "success" probability p. Let
pbn = n1 ni=1 Zi . Clearly, pbn [0, 1] and E[b pn ] = p.
Therefore, as a direct consequence of the above theorem, we have,
 > 0,
pn p| > ] 2 exp 2n2

Pr [|b

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 32 / 49


Bounds on the Generalization Error

So we have,  > 0,

pn p| > ] 2 exp 2n2



Pr [|b

Now, setting = 2 exp(22 n), it is straightforward to see that the


hoeffding-based 1 level confidence bound on the difference
between R(f ) and Rb n (f ) for a fixed classifier f is given by
!1/2
ln 2
|R
b n (f ) R(f )| < .
2n

Which of the two bounds is tighter? Clearly, we need to find out


which of ln 2/ or 1/2 is larger. This is the same as comparing
exp(1/2) and 2/, which in turns means comparing a(2/) and 2/
where a = exp(1/4). With > 0, a(2/) > 2/, so that, we know that
hoeffdings bounds are tighter. The graph also confirm this.
Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 33 / 49
Bounds on the Generalization Error

Chernoff vs Chebyshev bounds for proportions: delta = 0.01 Chernoff vs Chebyshev bounds for proportions: delta = 0.05
0.5 0.25
Chernoff Chernoff
Chebyshev Chebyshev
0.45

0.4 0.2

0.35
Theoretical bound f(n,)

Theoretical bound f(n,)


0.3 0.15

0.25

0.2 0.1

0.15

0.1 0.05

0.05

0 0
0 2000 4000 6000 8000 10000 12000 0 2000 4000 6000 8000 10000 12000
n = Sample size n = Sample size

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 34 / 49


Beyond Chernov and Hoeffding

In all the above, we only addressed pointwise convergence of


b n (f ) to R(f ), i.e., for Fix a machine f F, we studied the
R
convergence of
b n (f ) to R(f ).
R
Needless to mention that that pointwise convergence is of very little
use here.
A more interesting issue to address is uniform convergence. That is,
for all machines, f F, determine the necessary and sufficient
conditions for the convergence of

sup |R
b n (f ) R(f )| >  to 0.
f F

Clearly, such a study extends the Law of Large Numbers to function


spaces, thereby providing tools for the construction of bounds on the
theoretical errors of learning machines.
Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 35 / 49
Beyond Chernov and Hoeffding

Since uniform convergence requires the consideration of the entirety of


the function space of interest, care needs to be taken regarding the
dimensionality of the function space.
Uniform convergence will prove substantially easier to handle for finite
sample spaces than for infinite dimensional function spaces.
Indeed, infinity dimensional spaces, one will need to introduce such
concepts of the capacity of the function space, measured through
devices such as the VC-dimension and covering numbers.

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 36 / 49


Beyond Chernov and Hoeffding
b n (f ) and R(f ) are close for all f F, i.e.,  > 0,
Theorem: If R
sup |R
b n (f ) R(f )| ,
f F

then
R(fbn ) R(f + ) 2.
Proof:Recall that we did define fbn as the best function that is yielded by
b n (f ) in the function class F. Recall also that R
the empirical risk R b n (fbn )
can be made as small as possible as we saw earlier. Therefore, with f +
being the best true risk in class F, we always have
Rb n (f + ) R
b n (fbn ) 0.
As a result,
R(fbn ) = R(fbn ) R(f + ) + R(f + )
b n (f + ) R
= R b n (fbn ) + R(fbn ) R(f + ) + R(f + )
b n (f )| + R(f + )
2sup |R(f ) R
f F
Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 37 / 49
Beyond Chernov and Hoeffding
Proof:Recall that we did define fbn as the best function that is yielded by
b n (f ) in the function class F. Recall also that R
the empirical risk R b n (fbn )
can be made as small as possible as we saw earlier. Therefore, with f +
being the best true risk in class F, we always have
b n (f + ) R
R b n (fbn ) 0.

As a result,
R(fbn ) = R(fbn ) R(f + ) + R(f + )
b n (f + ) R
= R b n (fbn ) + R(fbn ) R(f + ) + R(f + )
b n (f )| + R(f + )
2sup |R(f ) R
f F

Consequently,
R(fbn ) R(f + ) 2sup |R(f ) R
b n (f )|
f F

as required.
Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 38 / 49
Beyond Chernov and Hoeffding
Corollary: A direct consequence of the above theorem is the following:
For a given machine f F,
2 1/2
!
ln
R(f ) R
b n (f ) +
2n

with probability at least 1 , > 0.


If the function class F is finite, ie
F = {f1 , f2 , , fm }
where m = |F| = #F = Number of functions in the class F then it
can be shown that, for all f F,
2 1/2
!
ln m + ln
R(f ) R b n (f ) +
2n

with probability at least 1 , > 0.


Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 39 / 49
Beyond Chernov and Hoeffding

It can also be shown that


!1/2
ln m + ln 2
R(fn ) R
b n (f + ) + 2 (10)
2n

with probability at least 1 , > 0, where as before

f + = arg inf R(f ) and fn = argminR


b n (f ).
f F f F

Equation (10) is of foundational importance, because it reveals clearly


that the size of the function class controls the uniform bound on the
crucial generalization error: Indeed, if the size m of the function class
F increases, then R(f + ) is caused to increase while R(fbn ) decreases,
so that the trade-off between the two is controlled by the size m of
the function class.

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 40 / 49


Vapnik-Chervonenkis Dimension

Definition: (Shattering) Let X 6= be any non empty domain. Let


F 2X be any non-empty class of functions having X as their
domain. Let S X be any finite subset of the domain X . Then S is
said to be shattered by F iff

{S f | f F} = 2S

In other words, F shatters S if any subset of S can be obtained by


intersecting S with some set from F.
Example: A class F 2X of classifiers is said to shatter a set
x1 , x2 , , xn of n points, if, for any possible configuration of labels
y1 , y2 , , yn , we can find a classifier f F that reproduces those
labels.

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 41 / 49


Vapnik-Chervonenkis Dimension

Definition(VC-dimension) Let X 6= be any non empty learning


domain. Let F 2X be any non-empty class of functions having X as
their domain. Let S X be any finite subset of the domain X . The
VC dimension of F is the cardinality of the largest finite set S X
that is shattered by F, ie
n o
V Cdim(F) := max |S| : S is shattered by F

Note: If arbitrarily large finite sets are shattered by F, then


V Cdim(F) = . In other words, if a small set of finite cardinality
cannot be found that is shattered by F, then V Cdim(F) = .
Example: The VC dimension of a class F 2X of classifiers is the
largest number of points that F can shatter.

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 42 / 49


Vapnik-Chervonenkis Dimension

Remarks: If V Cdim(F) = d, then there exists a finite set S X


such that |S| = d and S is shattered by F. Importantly, every set
S X such that |S| > d is not shattered by F. Clearly, we do not
expect to learn anything until we have at least d training points.
Intuitively, this means that an infinite VC dimension is not desirable as
it could imply the impossibility to learn the concept underlying any
data from the population under consideration. However, a finite VC
dimension does not guarantee the learnability of the concept
underlying any data from the population under consideration either.
Fact: Let F be any finite function (concept) class. Then, since it
requires 2d distinct concepts to shatter a set of cardinality d, no set of
cardinality greater than log |F| can be shattered. Therefore, log |F| is
always an upper bound for the VC dimension of finite concept classes.

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 43 / 49


Vapnik-Chervonenkis Dimension

To gain insights into the central concept of VC dimension, we herein


consider a few examples of practical interest for which the VC
dimension can be found.
VC dimension of the space of separating hyperplanes: Let
X = Rp be the domain for the binary Y {1, +1} classification
task, and consider using hyperplanes to separate the points of X . Let
F denote the class of all such separating hyperplanes. Then,

V Cdim(F) = p + 1

Intuitively, the following pictures for the case of X = R2 help see why
the VC dimension is p + 1.

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 44 / 49


Foundation Result in Statistical Learning Theory

Theorem:(Vapnik and Chervonenkis, 1971) Let F be a class of


functions implementing so learning machines, and let = V Cdim(F) be
the VC dimension of F. Let the theoretical and the empirical risks be
defined as earlier and consider any data distribution in the population of
interest. Then f F, the prediction error (theoretical risk) is bounded by
v  
u
u log 2n + 1 log
4
n (f ) +
t
R(f ) R (11)
n

with probability of at least 1 . or


v  
u
( u log 2n + 1 log )
t 4
Pr TestError TrainError + 1
n

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 45 / 49


Appeal of the VC Bound
Note: One of the greatest appeals of the VC bound is that, though
applicable to function classes of infinite dimension, it preserves the
same intuitive form as the bound derived for finite dimensional F.
Essentially, using the VC dimension concept, the number L of possible
labeling configurations obtainable from F with V Cdim(F) = over
2n points verifies
 
en
L . (12)

The VC bound is simply obtained by replacing log |F| with L in the
expression of the risk bound for finite dimensional F.
The most important part of the above theorem is the fact that the
generalization ability of a learning machine depends on both the
empirical risk and the complexity of the class of functions used, which
is measured here by the VC dimension of (Vapnik and
Chervonenkis, 1971).
Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 46 / 49
Appeal of the VC Bound

Also, the bounds offered here are distribution-free, since no


assumption is made about the distribution of the population.
The details of this important result will be discussed again in chapter
6 and 7, where we will present other measures of the capacity of a
class of functions.
Remark: From the expression of the VC Bound, it is clear that an
intuitively appealing way to improve the predictive performance
(reduce prediction error) of a class of machines is to achieve a
trade-off (compromise) between small VC dimension and minimization
of the empirical risk.
At first, it may seen as if the VC bound is acting in a way similar to the
number of parameters, since it serves as a measure of the complexity
of F. In this spirit, the following is a possible guiding principle.

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 47 / 49


Appeal of the VC Bound

At first, it may seen as if the VC bound is acting in a way similar


to the number of parameters, since it serves as a measure of the
complexity of F. In this spirit, the following is a possible guiding
principle.
Intuition: One should seek to construct a classifier that
achieves the best trade-off (balance, compromise) between
complexity of function class - measured by VC dimension- and fit
to the training data -measured by empirical risk.
Now equipped with this sound theoretical foundation one can
then go on to the implementation of varioous learning machines.
We shall use R to discover some of the most commonly learning
machines.

Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 48 / 49


Machine Learning CRAN Task View in R
Lets visit the website where most of the R community goes
http:// www.r-project.org
Lets install some packages and get started
install.packages(ctv)
library(ctv)

install.views(MachineLearning)
install.views(HighPerformanceComputing)
install.views(Bayesian)
install.views(Robust)
Lets load a couple of packages and explore
library(e1071)
library(MASS)
library(kernlab)
Dr Ernest Fokou (RIT) Principles of Statistical Machine Learning 49 / 49

You might also like