02 First Model of Learning
02 First Model of Learning
You watch me do this trick a couple times and notice I always hand out 5 cards
A short digression
• Suppose that Javier has a biased coin, which lands on heads with some unknown
probability
–
–
• Javier toss the coin times
–
as
Clearly, at least in a very limited sense, we can learn something about from
observations
There is always the possibility that we are totally wrong, but given enough data,
the probability should be very small
Connection to learning
Coin tosses: We want to estimate (i.e., predict how likely a “heads” is)
Learning: We want to estimate a function
Define
(Population) risk:
Empirical risk:
Trust, but verify
The law of large numbers guarantees that as long as we have enough data, we will
have that
This means that we can use to verify whether was a good hypothesis
If we fix a hypotheses before drawing our data, then the law of large numbers
tells us that
However, it is also true that for a fixed , if is large it can still be very likely
that there is some hypothesis for which is still very far from
Example
Question 1: If I toss a fair coin 10 times, what is the
probability that I get 10 heads?
it is also likely that there will be at least one hypothesis where is very
different from
where each
From the training data , we would like to select the best possible hypothesis
from
Example
Empirical risk
Recall our definition of risk and its empirical counterpart
Risk:
Empirical risk:
The empirical risk gives us an estimate of the true risk , and from
the law of large numbers we know that as
Aside:
The risk in ERM
As long as we have enough data, for any particular hypothesis , we expect
However, if is very large, then we can also expect that there are some for
which
Of course, we can never guarantee that this holds, so instead we will be concerned
with the probability that
distribution of
Too much randomness?
Ultimately, we will want to show something like
for all
In order to tease all of this apart, let’s begin by going back to just a single
hypothesis and studying
Bounding the error
We want to calculate
and this is just asking about the probability that a Binomial random variable will be
within of its mean
This has no nice closed form expression, and is rather unwieldy to work with and
doesn’t give us much intuition
or equivalently
Concentration inequalities
An inequality of the form
There are many different concentration inequalities that give us various bounds
along these lines
We will start with a very simple one, and then build up to a stronger result
Markov’s inequality
The simplest of these results is Markov’s inequality
Proof.
Note that is a nonnegative random
variable. Thus we can apply Markov’s inequality to obtain
Proof of Markov (Part 1)
There is a simple proof of Markov if you know the (super useful!) fact that for any
nonnegative random variable
Thus
Proof of Markov (Part 2)
We can visualize this result as
and hence
Hoeffding’s inequality
Chebyshev’s inequality gives us the kind of result we are after, but it is too loose
to be of practical use
Hoeffding’s inequality assumes a bit more about our random variable beyond
having finite variance, but gets us a much tighter and more useful result:
(Markov)
(Independence)
Hoeffding’s Lemma
It is not obvious, but also not too hard to show, that
(proof uses convexity and then gets a bound using a Taylor series expansion)
By setting , we have
Putting it all together
Thus, we have proven that
Finally going back to our original problem, this means that Hoeffding yields the
bound
Multiple hypotheses
Thus, after much effort, we have that for a particular hypothesis ,
Equivalently, we can try to bound the probability that any hypothesis has an
empirical risk that deviates from its mean by more than
Formal statement
We can express this mathematically as
linearly exponentially
increasing decreasing
This suggests that ERM is a reasonable approach as long as isn’t too big
(i.e., )
Note that the above is equivalent to the statement that with probability at
least ,
Bounding the excess risk
Note that we would ideally actually like to choose
What about ?
Bounding the excess risk
We will bound in two steps…
• cannot be too much bigger than :
By the definition of ,
From before, we have
Thus
Of course, the trick in doing a good job of learning is ensure that is actually
small
unfortunately…
Fundamental tradeoff
More hypotheses ultimately sacrifices our guarantee that ,
which causes the whole argument to break
Error