Lecture 1
Lecture 1
2 Formal Framework
2.1 Basic notions
In our formal model for machine learning, the instances to be classified are members of a
set X , the domain set or feature space. Instances are to be classified into a label set Y.
For now (and most of the class), we assume that the label set is binary, that is Y = {0, 1}.
For example, an instance x ∈ X could be an email and its label indicates whether the
email is spam (y = 1) or not spam (y = 0). We often assume that the instances are
represented as real-valued vectors, that is X ⊆ Rd for some dimension d.
A predictor or classifier is a function h : X → Y. A learner is a function that takes
some training data and maps it to a predictor. We let the training data be denoted by a
sequence S = ((X1 , Y1 ), . . . , (Xn , Yn )). Then, formally, a learner A is a function
∞
[
A : (X × Y)i → Y X
i=1
A : S 7→ h,
where Y X denotes the set of all functions from set X to set Y. For convenience, when the
learner is clear from context, we use the notation hn to denote the output of the learner
on data of size n, that is hn = A(S) for |S| = n.
The goal of learning is produce a predictor h that correctly classifies not only the
training data, but also future instances that it has not seen yet. We thus need a mathe-
matical description of how the environment produces instances. In particular, we would
like to model that the environment (or nature) remains somehow stable, that the process
that generated the training data is the same that will generate future data.
We model the data generation as a probability distribution P over X ×Y = X ×{0, 1}.
We further assume that the instances (Xi , Yi ) are i.i.d. (independently and identically
distributed) according to P .
The performance of a classifier h on an instance (X, Y ) is measured by a loss function.
A loss function is a function
` : (Y X × X × Y) → R.
The value `(h, X, Y ) ∈ R indicates “how badly h predicts on example (X, Y )”. We will,
for now, work with the binary loss (or 0/1-loss), defined as
`(h, X, Y ) = 1[h(X) 6= Y ] ,
where 1[p] denotes the indicator function of predicate p, that is 1[p] = 1 of p is true and
1[p] = 0 if p is false. The binary loss is 1 if the prediction of h on example (X, Y ) is
wrong. If the prediction is correct, no loss is suffered and the binary loss assigns value 0.
We can now formally phrase the goal of learning, as aiming for a classifier that has
low loss on expectation over the data generating distribution. That is, we would like to
output a classifier that has low expected loss, or risk, defined as
Since our loss function assumes only values in {0, 1}, the above expectation is equal to
the probability of generating an example X on which h makes a wrong prediction. That
is, we have
Note however, that the learner does not get to see the data generating distribution. It can
thus not merely output a classifier of lowest expected loss. The learner needs to make its
decisions based on the data S. Given a classifier h and data S, the learner can evaluate
the empirical risk of h on S
n
1X
Ln (h) = 1[h(Xi ) 6= Yi ] .
n i=1
Claim 1. For all functions h : X → {0, 1} and for all sample sizes n we have
ES Ln (h) = L(h)
Proof.
n
1X
ES Ln (h) = ES 1[h(Xi ) 6= Yi ]
n i=1
n
1X
= ES 1[h(Xi ) 6= Yi ]
n i=1
n
1X
= E(X,Y ) 1[h(X) 6= Y ]
n i=1
n
1X
= L(h)
n i=1
= L(h)
where the second equality holds by linearity of expectation, and the third inequality holds
since that expectation depends only on one (the i-th) example in S.
Thus, for any fixed function, the empirical risk gives us an unbiased estimate of the
quantity that we are after, the true risk. Note that this holds even for small sample sizes.
Moreover, by the law of large numbers, the above claim implies that, with large sample
sizes, the empirical risk of a classifier converges to its true risk (in probability). As we
see more and more data, the empirical risk of a function becomes a better and better
estimate of its true risk.
This may lead us to believe that the simple learning strategy of just finding some
function with low empirical risk should succeed at achieving low true risk as we see more
and more data. However, the following phenomenon shows that this strategy can in fact
go wrong arbitrarily badly.
Claim 2. There exists a distribution P and a learner, such that for all n we have
Proof. As the data generating distribution, consider the uniform distribution over R×{1}.
That is, in any sample S, generated by this P , the examples are labeled with 1, that is
S = ((X1 , 1)), . . . , (Xn , 1))). We construct a “stubborn” learner A. The stubborn learner
outputs a function that agrees with the sample’s labels on points that were in the sample
S, but keeps believing that the label is 0 everywhere else. Formally:
1 if (X, 1) ∈ S
hn (X) = A(S)(X) =
0 otherwise
Now we clearly have Ln (hn ) = 0 for all n. However, since S is finite, the set of instances X
on which hn predicts 1 has measure 0. Thus, with probability 1, hn outputs the incorrect
label 0. Thus L(h) = 1.
The difference between the situations in the above two claims is that, in the second
case, the function hn depends on the data. While, for every fixed function h (fixed before
the data is seen), the empirical risk estimates converge to the true risk of this function,
this convergence is not uniform over all functions. Claim 2 shows that, at any given
sample size, there exist functions, for which true and empirical risk are arbitrarily far
apart.
Now, in machine learning, we do want the function that the learner outputs to be able
to depend on the data. Furthermore, the learner only ever gets to see a finite amount of
data. We have seen that, for any finite sample size, that is, on any finite amount of data,
the empirical risk can be a very bad indicator of the true risk of a function.
Basic questions of learning theory thus are: How can we control the (true) risk of a
function learned based on a finite amount of data? Can we identify situations where we
can relate the true and empirical risk?
where f is decreasing function of sample size n. That is, as we see more and more data,
we would like that the true risk of the output of the learner approaches the best risk
possible with the class H. Or equivalently, we would like to show that
L(hn ) − inf L(h) ≤ f (n).
h∈H
Theorem 1. Let H = {h1 , . . . , hN } and δ ∈ (0, 1]. Under the realizability assumption,
we have with probability at least (1 − δ) over the generation of the sample S
log N + log(1/δ)
L(ĥn ) ≤ .
n
Proof. Note that Ln (h∗ ) = 0 for all possible samples S. Thus, for any > 0, ERM only
outputs a function error larger than , if Ln (ĥn ) = 0 while L(ĥn ) ≥ .
For every h ∈ H with L(h) > , we have
log 1 +log N
Now we set = δ
n
.
Plugging in this value for , we have shown
log N + log(1/δ)
PS L(ĥn ) ≥ ≤δ
n