0% found this document useful (0 votes)
20 views

Lecture Summary

The document discusses Jensen's inequality and uniform convergence. It talks about increasing the hypothesis space to find a good model. It also discusses expectations within expectations when using Rademacher elements. The document covers bounding the expectation using a growth function and the size of the data. It states that the six statements in the theorem are equivalent. It then shifts to discussing what big data means and using samples to handle it for frequent item set mining. It covers deriving sample size lower bounds for PAC learning and guarantees around learning with infinite versus finite hypothesis spaces.

Uploaded by

jerry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Lecture Summary

The document discusses Jensen's inequality and uniform convergence. It talks about increasing the hypothesis space to find a good model. It also discusses expectations within expectations when using Rademacher elements. The document covers bounding the expectation using a growth function and the size of the data. It states that the six statements in the theorem are equivalent. It then shifts to discussing what big data means and using samples to handle it for frequent item set mining. It covers deriving sample size lower bounds for PAC learning and guarantees around learning with infinite versus finite hypothesis spaces.

Uploaded by

jerry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 2

Jensen's inequality

We proof by Uniform convergence

The forall h elem H: Ld(h) - L_squiglyD(h) <= epsilon


Means that by increasing the hypotheses increase the chance of finding a good
model.

Instead of taking all the unions, instead we look at the biggest set. The infinite
case does not have to exist.
Just like a function with a limit, the supremum is the upper/lower limit.
Eventhough the limit is never sampled if we sample it.

During the second step by definition


The value is the average of the differences of the losses of each of the two
datasets

To get rid of the infinitity. We observe that we are training and testing ins the
equation
During AI training we have a training and testing set.
It does not matter for the expectation to change the elements of D and D', it does
not matter from which dataset we sample.

Using rademacher elements we can generate a vector o \elem {-1,1}^m we can observe
more things
Since the equality holds for any vector of length m we can swap around the
expectations.
This is interesting (we have an expectation within an expectation)
Expectation is average. The average is a loop, nested expectation is a nested loop.
We may have infinitely many hypotheses, but we're talkinga bout finite things
In finite case supernum is same as taking the maximum over the expectation.

Step 4 implies that our expectation has a bound.


The size of the data is 2m. Then we know the size of Hm is bounded by the growth
function (last week)
Rewrite with growth function. and then we know the expectation has a bound.

Step 5, almost done.


We're talkinga bout a non negative random variable.
Markov tells us that with a probability of at least 1-\delta is less than or equal
to a fraction.
We only need to show that the fraction can be as low as we want it to be.
If m is bigger than

The theorem from this says that all the 6 statements are equivalent

-- AFTER BREAK
Conceptual
We started by discussing what big data means
Everything is statistically significant, regardless of how small the difference is.

Using samples is the way to handle the big data and learn from this.
We used frequent item set mining
However this scan does multiple scans over database, if we don't sample and it
doesnt fit in RAM, its very costly.
We then asked ourselves how big our sample should be to learn from it.
You can check the frequent but not the missed frequent once, by lowering the
required frequency.
However there CAN still be set missing from the sample as frequent but is frequent
in reality.
Then we noticed that we are doing item set mining using machine learning.
Classification is supervised learning.
We derived a lower bound of how big the sample should be.
We call our learning PAC and then investigated how we can do this PAC learning.
Using hypothesis sets etc to denote how to learn
!!!!! It is DOABLE for finite hypothesis sets to learn
The no free lunch theorem tells us, if you express too much you cannot learn.

If the set is infinite then IT CANNOT be PAC learned.


Today we saw that if the VC dimension is finite, we can in fact DO PAC learning.

We were trying to find sample bounds for frequent item set mining? This will be
friday
And what are the guarantees

The one requirement, is can we learn more with a larger class of hypotheses sets.
Answer is not really, which is kindof suprirsing.
Can we guarantee that we're eventually having a accuracy of alpha. Answer is no

Another side note: This course teaches that there are requirements for amount of
data in order to learn things.
If we have only limited amount of clients and we have a great amount of nodes, then
we will not be able to learn.
Which is opposite of big data, but shows the other side of what we learned

You might also like