0% found this document useful (0 votes)
6 views37 pages

02 First Model of Learning

The document discusses the relationship between probability, learning, and hypothesis testing, emphasizing the importance of empirical risk minimization (ERM) in estimating functions. It outlines how the law of large numbers can help verify hypotheses, while also addressing challenges posed by multiple hypotheses and the need for confidence bounds. The text concludes with a discussion on the tradeoff between the richness of hypothesis sets and the guarantees of learning performance.

Uploaded by

Mark Davenport
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views37 pages

02 First Model of Learning

The document discusses the relationship between probability, learning, and hypothesis testing, emphasizing the importance of empirical risk minimization (ERM) in estimating functions. It outlines how the law of large numbers can help verify hypotheses, while also addressing challenges posed by multiple hypotheses and the need for confidence bounds. The text concludes with a discussion on the tradeoff between the richness of hypothesis sets and the guarantees of learning performance.

Uploaded by

Mark Davenport
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Another approach

You watch me do this trick a couple times and notice I always hand out 5 cards

Suppose you instead consider

Now, can you learn a function such that is a reliable predictor of ?


Probability to the rescue!
Any agreeing with the training data may be possible
but that does not mean that any is equally probable

A short digression
• Suppose that Javier has a biased coin, which lands on heads with some unknown
probability


• Javier toss the coin times

Does tell us anything about ?


What can we learn from ?
Given enough tosses (large ), we expect that

Law of large numbers

as

Clearly, at least in a very limited sense, we can learn something about from
observations

There is always the possibility that we are totally wrong, but given enough data,
the probability should be very small
Connection to learning
Coin tosses: We want to estimate (i.e., predict how likely a “heads” is)
Learning: We want to estimate a function

Suppose we have a hypothesis and that is discrete


Think of the as a series of independent coin tosses, where the are
drawn from a probability distribution
– heads: our hypothesis is correct, i.e.,
– tails: our hypothesis is wrong, i.e.,

Define
(Population) risk:

Empirical risk:
Trust, but verify
The law of large numbers guarantees that as long as we have enough data, we will
have that

This means that we can use to verify whether was a good hypothesis

Unfortunately, verification is not learning

• Where did come from?


• What if is large?
• How do we know if , or at least, if ?
• Given many possible hypotheses, how can we pick a good one?
From coins to learning
Consider an ensemble of many hypotheses

If we fix a hypotheses before drawing our data, then the law of large numbers
tells us that

However, it is also true that for a fixed , if is large it can still be very likely
that there is some hypothesis for which is still very far from
Example
Question 1: If I toss a fair coin 10 times, what is the
probability that I get 10 heads?

Question 2: If I toss 1000 fair coins 10 times each, what is


the probability that some coin will get 10 heads?

This illustrates the fundamental challenge of multiple hypothesis testing


…and back to learning
If we have many hypotheses (large ), then

even though for any fixed hypothesis it is likely that

it is also likely that there will be at least one hypothesis where is very
different from

Can we adapt our approach to handle many hypotheses?


A first model of learning
Let’s restrict our attention to binary classification
– our labels belong to (or )

We observe the data

where each

Suppose we are given a list of possible hypotheses

From the training data , we would like to select the best possible hypothesis
from
Example
Empirical risk
Recall our definition of risk and its empirical counterpart

Risk:

Empirical risk:

The empirical risk gives us an estimate of the true risk , and from
the law of large numbers we know that as

We should be able to use the empirical risk to choose a good hypothesis


Empirical risk minimization (ERM)
We want to choose a hypothesis from that achieves a small risk

Since is supposed to be a good estimate of , an incredibly natural


(and common) strategy is to pick

Aside:
The risk in ERM
As long as we have enough data, for any particular hypothesis , we expect

However, if is very large, then we can also expect that there are some for
which

Thus, what can we say about ?


• We know that is as small as it can be
– this could be because is small
– or, it could be because for some
• Which explanation is more likely?
– it depends… just how large is ?
Confidence bounds
One way to provide guarantees for the ERM approach is to set and such that

for all (and for some suitably small choice of )

Of course, we can never guarantee that this holds, so instead we will be concerned
with the probability that

distribution of
Too much randomness?
Ultimately, we will want to show something like

for all

What is random here?


– the training data
– , because each depends on
– , because it depends on

In order to tease all of this apart, let’s begin by going back to just a single
hypothesis and studying
Bounding the error
We want to calculate

Note that is a random variable


– we can write where the are Bernoulli random variables
– thus, is a Binomial random variable
– since , we have that
Deviation from the mean
Thus, an equivalent way to think about our problem is that we would like to
calculate

and this is just asking about the probability that a Binomial random variable will be
within of its mean

If represents the cumulative distribution function (CDF) of our binomial


random variable, then we can write
Bounding the deviation
Unfortunately, the CDF we are interested in is given by

This has no nice closed form expression, and is rather unwieldy to work with and
doesn’t give us much intuition

Instead of calculating the probability exactly, it is enough to get a good bound of


the form

or equivalently
Concentration inequalities
An inequality of the form

tell us how a particular random variable (in this case ) concentrates


around its mean

There are many different concentration inequalities that give us various bounds
along these lines

We will start with a very simple one, and then build up to a stronger result
Markov’s inequality
The simplest of these results is Markov’s inequality

Let be any nonnegative random variable.


Then for any ,

This is cool on its own, but can be leveraged to say even


more since for any strictly monotonically increasing and
nonnegative-valued function
Chebyshev’s inequality
As an example, Chebyshev’s inequality
states that for any random variable ,

Proof.
Note that is a nonnegative random
variable. Thus we can apply Markov’s inequality to obtain
Proof of Markov (Part 1)
There is a simple proof of Markov if you know the (super useful!) fact that for any
nonnegative random variable

Proof. We can write


where

Thus
Proof of Markov (Part 2)
We can visualize this result as

Thus, we can immediately see that we must have

and hence
Hoeffding’s inequality
Chebyshev’s inequality gives us the kind of result we are after, but it is too loose
to be of practical use

Hoeffding’s inequality assumes a bit more about our random variable beyond
having finite variance, but gets us a much tighter and more useful result:

Let be independent bounded random variables, i.e., random variables


such that for all

Let . Then for any , we have


Chernoff’s bounding method
To prove this result, we will use a similar approach as in Chebyshev’s inequality
To begin consider only the upper tail inequality:

(Markov)

(Independence)
Hoeffding’s Lemma
It is not obvious, but also not too hard to show, that

(proof uses convexity and then gets a bound using a Taylor series expansion)

Plugging this in, we obtain that for any , we have

By setting , we have
Putting it all together
Thus, we have proven that

An analogous argument proves

Combined, these give


Special case: Binomials
If the are Bernoulli random variables, then is a Binomial random variable
and Hoeffding’s inequality becomes

Finally going back to our original problem, this means that Hoeffding yields the
bound
Multiple hypotheses
Thus, after much effort, we have that for a particular hypothesis ,

However, we are ultimately interested in , not just a single hypothesis

One way to argue that is to ensure that


simultaneously for all

Equivalently, we can try to bound the probability that any hypothesis has an
empirical risk that deviates from its mean by more than
Formal statement
We can express this mathematically as

We can bound this using something called the union bound


Union bound
Union bound For any sequence of events

The events in our case are given by


Final result
Interpretation
We went through all of this work to show that

linearly exponentially
increasing decreasing

This suggests that ERM is a reasonable approach as long as isn’t too big
(i.e., )

Note that the above is equivalent to the statement that with probability at
least ,
Bounding the excess risk
Note that we would ideally actually like to choose

We can also relate the performance of to :

We have already shown that with probability at least

What about ?
Bounding the excess risk
We will bound in two steps…
• cannot be too much bigger than :
By the definition of ,
From before, we have
Thus

• cannot be too much bigger than :


By the definition of ,
From before, we have
Thus
The upshot
Thus,

Bottom line: As long as isn’t too big ( ) then we can be reasonably


confident that isn’t too much larger than

Of course, the trick in doing a good job of learning is ensure that is actually
small

To achieve this, we need a “rich” set of possible hypotheses…

unfortunately…
Fundamental tradeoff
More hypotheses ultimately sacrifices our guarantee that ,
which causes the whole argument to break

Richer set of hypotheses

Error

“Richness” of hypothesis set

You might also like