0% found this document useful (0 votes)
6 views

Lecture 1

Uploaded by

theitspace404
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Lecture 1

Uploaded by

theitspace404
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Lecture 1, October 18, 2016

Intro to Learning Theory


Ruth Urner

1 Machine Learning and Learning Theory


Coming soon..

2 Formal Framework
2.1 Basic notions
In our formal model for machine learning, the instances to be classified are members of a
set X , the domain set or feature space. Instances are to be classified into a label set Y.
For now (and most of the class), we assume that the label set is binary, that is Y = {0, 1}.
For example, an instance x ∈ X could be an email and its label indicates whether the
email is spam (y = 1) or not spam (y = 0). We often assume that the instances are
represented as real-valued vectors, that is X ⊆ Rd for some dimension d.
A predictor or classifier is a function h : X → Y. A learner is a function that takes
some training data and maps it to a predictor. We let the training data be denoted by a
sequence S = ((X1 , Y1 ), . . . , (Xn , Yn )). Then, formally, a learner A is a function

[
A : (X × Y)i → Y X
i=1
A : S 7→ h,

where Y X denotes the set of all functions from set X to set Y. For convenience, when the
learner is clear from context, we use the notation hn to denote the output of the learner
on data of size n, that is hn = A(S) for |S| = n.
The goal of learning is produce a predictor h that correctly classifies not only the
training data, but also future instances that it has not seen yet. We thus need a mathe-
matical description of how the environment produces instances. In particular, we would
like to model that the environment (or nature) remains somehow stable, that the process
that generated the training data is the same that will generate future data.
We model the data generation as a probability distribution P over X ×Y = X ×{0, 1}.
We further assume that the instances (Xi , Yi ) are i.i.d. (independently and identically
distributed) according to P .
The performance of a classifier h on an instance (X, Y ) is measured by a loss function.
A loss function is a function

` : (Y X × X × Y) → R.

The value `(h, X, Y ) ∈ R indicates “how badly h predicts on example (X, Y )”. We will,
for now, work with the binary loss (or 0/1-loss), defined as

`(h, X, Y ) = 1[h(X) 6= Y ] ,

where 1[p] denotes the indicator function of predicate p, that is 1[p] = 1 of p is true and
1[p] = 0 if p is false. The binary loss is 1 if the prediction of h on example (X, Y ) is
wrong. If the prediction is correct, no loss is suffered and the binary loss assigns value 0.
We can now formally phrase the goal of learning, as aiming for a classifier that has
low loss on expectation over the data generating distribution. That is, we would like to
output a classifier that has low expected loss, or risk, defined as

L(h) = E(X,Y )∼P [`(h, X, Y )] = E(X,Y )∼P [1[h(X) 6= Y ]].

Since our loss function assumes only values in {0, 1}, the above expectation is equal to
the probability of generating an example X on which h makes a wrong prediction. That
is, we have

L(h) = E(X,Y )∼P E(X,Y )∼P [1[h(X) 6= Y ]] = P(X,Y )∼P [h(X) 6= Y ].

Note however, that the learner does not get to see the data generating distribution. It can
thus not merely output a classifier of lowest expected loss. The learner needs to make its
decisions based on the data S. Given a classifier h and data S, the learner can evaluate
the empirical risk of h on S
n
1X
Ln (h) = 1[h(Xi ) 6= Yi ] .
n i=1

2.2 On the relation of empirical and true risk


A natural strategy for the learner would be, to simply output a function that has small
empirical risk. In favor of this approach, we now show that the empirical risk is an
unbiased estimator of the true risk.

Claim 1. For all functions h : X → {0, 1} and for all sample sizes n we have

ES Ln (h) = L(h)
Proof.
n
1X
ES Ln (h) = ES 1[h(Xi ) 6= Yi ]
n i=1
n
1X
= ES 1[h(Xi ) 6= Yi ]
n i=1
n
1X
= E(X,Y ) 1[h(X) 6= Y ]
n i=1
n
1X
= L(h)
n i=1
= L(h)

where the second equality holds by linearity of expectation, and the third inequality holds
since that expectation depends only on one (the i-th) example in S.
Thus, for any fixed function, the empirical risk gives us an unbiased estimate of the
quantity that we are after, the true risk. Note that this holds even for small sample sizes.
Moreover, by the law of large numbers, the above claim implies that, with large sample
sizes, the empirical risk of a classifier converges to its true risk (in probability). As we
see more and more data, the empirical risk of a function becomes a better and better
estimate of its true risk.
This may lead us to believe that the simple learning strategy of just finding some
function with low empirical risk should succeed at achieving low true risk as we see more
and more data. However, the following phenomenon shows that this strategy can in fact
go wrong arbitrarily badly.
Claim 2. There exists a distribution P and a learner, such that for all n we have

Ln (hn ) = 0 and L(hn ) = 1

Proof. As the data generating distribution, consider the uniform distribution over R×{1}.
That is, in any sample S, generated by this P , the examples are labeled with 1, that is
S = ((X1 , 1)), . . . , (Xn , 1))). We construct a “stubborn” learner A. The stubborn learner
outputs a function that agrees with the sample’s labels on points that were in the sample
S, but keeps believing that the label is 0 everywhere else. Formally:

1 if (X, 1) ∈ S
hn (X) = A(S)(X) =
0 otherwise

Now we clearly have Ln (hn ) = 0 for all n. However, since S is finite, the set of instances X
on which hn predicts 1 has measure 0. Thus, with probability 1, hn outputs the incorrect
label 0. Thus L(h) = 1.
The difference between the situations in the above two claims is that, in the second
case, the function hn depends on the data. While, for every fixed function h (fixed before
the data is seen), the empirical risk estimates converge to the true risk of this function,
this convergence is not uniform over all functions. Claim 2 shows that, at any given
sample size, there exist functions, for which true and empirical risk are arbitrarily far
apart.
Now, in machine learning, we do want the function that the learner outputs to be able
to depend on the data. Furthermore, the learner only ever gets to see a finite amount of
data. We have seen that, for any finite sample size, that is, on any finite amount of data,
the empirical risk can be a very bad indicator of the true risk of a function.
Basic questions of learning theory thus are: How can we control the (true) risk of a
function learned based on a finite amount of data? Can we identify situations where we
can relate the true and empirical risk?

2.3 Fixing a hypothesis space


We have seen that, if we want our learned function hn to depend on the data, we have
to change the rules for the learner. In Claim 2, we let the learner output any function it
wanted. This resulted in the learner adapting itself very well to the data it has seen in
the sample, achieving 0 empirical risk, while not making any progress towards predicting
well on unseen examples.
The construction of Claim 2 is an extreme version of a phenomenon called overfit-
ting. In informal terms, if a learning method has too much freedom with regards to the
functions it can output, it may “overadapt” to the training data, rather than extracting
structure that will also apply to the unseen examples. Overfitting is frequently encoun-
tered phenomenon in practice, that one has to guard against. To prevent the learner from
overfitting, we need to restrict the class of predictors.
A hypothesis class H is a set of predictors H ⊆ {0, 1}X . Instead of allowing the learner
to output any function, we will now consider learners that output functions from H. We
will see that, in many cases, fixing the hypothesis class before we see the data, will let us
regain control over the relation between empirical and true risk.
However, fixing the Hypothesis class also means that there may not be any good
function in the class. We will thus rephrase the goal of learning, to only require the
learner to come up with a function that is (approximately) as good as the best function
in the class H.
Thus, our new goal is to show that
L(hn ) ≤ inf L(h) + f (n),
h∈H

where f is decreasing function of sample size n. That is, as we see more and more data,
we would like that the true risk of the output of the learner approaches the best risk
possible with the class H. Or equivalently, we would like to show that
L(hn ) − inf L(h) ≤ f (n).
h∈H

2.4 Learnability of finite classes


We now show that the above goal is achievable for finite classes H = {h1 , . . . , hN }. We
will analyze the learner ERM (Empirical Risk Minimization), which outputs a function
from H that has minimal empirical risk.
ERM : S 7→ ĥn ∈ argmini Ln (hi )
There may be several functions in H that have lowest empirical risk on a data set S. But
since every function h ∈ H has some empirical risk on data S, and the empirical risk can
only assume finitely many values (namely multiples of n1 = |S|1
), the argmin is a nonempty
subset of H. A learner is an ERM learner, if it always outputs some function from this
subset.
For now, we will further make a simplifying assumption on the data generating distri-
bution P . We will assume that P is realizable with respect to the clas H. A distribution
is realizable with respect to a hypothesis class H if there is an h∗ ∈ H with L(h∗ ) = 0.

Theorem 1. Let H = {h1 , . . . , hN } and δ ∈ (0, 1]. Under the realizability assumption,
we have with probability at least (1 − δ) over the generation of the sample S

log N + log(1/δ)
L(ĥn ) ≤ .
n
Proof. Note that Ln (h∗ ) = 0 for all possible samples S. Thus, for any  > 0, ERM only
outputs a function error larger than , if Ln (ĥn ) = 0 while L(ĥn ) ≥ .
For every h ∈ H with L(h) > , we have

PS {Ln (h) = 0} ≤ (1 − )n ≤ e−n

(Recall that, for all x ∈ R, we have (1 + x) ≤ ex .)


Let H denote the set of functions h in H with L(h) > . We get, using the union
bound,
n o
PS L(ĥn ) ≥  ≤ PS {∃h ∈ H : Ln (h) = 0}
= PS {∨h∈H Ln (h) = 0}
≤ |H |(1 − )n
≤ |H|(1 − )n
≤ |H|e−n = N e−n

log 1 +log N
Now we set  = δ
n
.
Plugging in this value for , we have shown
 
log N + log(1/δ)
PS L(ĥn ) ≥ ≤δ
n

which is equivalent to the statement of the theorem.


Thus, under the realizability we have L(hn ) − minh∈H L(h) ≤ f (n) for f (n) =
log N +log(1/δ)
n
.

You might also like