0% found this document useful (0 votes)
20 views29 pages

Lec 6

Uploaded by

Preetham Shekar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views29 pages

Lec 6

Uploaded by

Preetham Shekar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

COMS 4771

Introduction to Machine Learning

Nakul Verma
Announcements

• HW2 due now!

• Project proposal due on tomorrow

• Midterm next lecture!

• HW3 posted
Last time…

• Linear Regression

• Parametric vs Nonparametric regression

• Logistic Regression for classification

• Ridge and Lasso Regression

• Kernel Regression

• Consistency of Kernel Regression

• Speeding non-parametric regression with trees


Towards formalizing ‘learning’

What does it mean to learn a concept?


• Gain knowledge or experience of the concept.

The basic process of learning


• Observe a phenomenon
• Construct a model from observations
• Use that model to make decisions / predictions

How can we make this more precise?


A statistical machinery for learning
Phenomenon of interest:
Input space: Output space:
There is an unknown distribution over
The learner observes examples drawn from

Construct a model: Machine learning

Let be a collection of models, where each predicts given


From observations, select a model which predicts well.

(generalization error of f )

We can say that we have learned the phenomenon if

for any tolerance level of our choice.


PAC Learning
For all tolerance levels , and all confidence levels , if there exists
some model selection algorithm that selects from m observations
ie, , and has the property:
with probability at least over the draw of the sample,

We call
• The model class is PAC-learnable.
• If the is polynomial in and , then is efficiently PAC-learnable

A popular algorithm:
Empirical risk minimizer (ERM) algorithm
PAC learning simple model classes
Theorem (finite size ):
Pick any tolerance level , and any confidence level
let be examples drawn from an unknown

if , then with probability at least

is efficiently PAC learnable

Occam’s Razor Principle:


All things being equal, usually the simplest explanation
of a phenomenon is a good hypothesis.

Simplicity = representational succinctness


Proof sketch
Define:

(generalization error of f ) (sample error of f )

We need to analyze:

≤0

Uniform deviations of
expectation of a random
variable to the sample
Proof sketch
Fix any and a sample , define random variable

(generalization error of f ) (sample error of f )

Lemma (Chernoff-Hoeffding bound ‘63):


Let be m Bernoulli r.v. drawn independently from B(p).
for any tolerance level

A classic result in concentration


of measure, proof later
Proof sketch
Need to analyze

Equivalently, by choosing with probability at least ,


for all
PAC learning simple model classes
Theorem (Occam’s Razor):
Pick any tolerance level , and any confidence level
let be examples drawn from an unknown

if , then with probability at least

is efficiently PAC learnable


One thing left…
Still need to prove:

Lemma (Chernoff-Hoeffding bound ‘63):


Let be m Bernoulli r.v. drawn independently from B(p).
for any tolerance level

How sample average deviates from true average as a function of


number of samples (m)

Need to analyze: How does the probability measure


concentrates towards a central value (like mean)
Detour: Concentration of Measure
Let’s start with something simple:

Let X be a non-negative random variable.


For a given constant c > 0, what is: ?

Markov’s Inequality

Why?
Observation
Take expectation on both sides.

c<c

X
Concentration of Measure
Using Markov to bound deviation from mean…

Let X be a random variable (not necessarily non-negative).


Want to examine: for some given constant c > 0

Observation:

by Markov’s Inequality

Chebyshev’s Inequality
True for all distributions! EX X
Concentration of Measure
Sharper estimates using an exponential!

Let X be a random variable (not necessarily non-negative).


For some given constant c > 0

Observation:

for any t > 0

by Markov’s Inequality

This is called Chernoff’s


bounding method
Concentration of Measure
Now, Given X1, …, Xm i.i.d. random variables (assume 0 ≤ Xi ≤ 1)

Define Yi := Xi – EXi

By Cherneoff’s bounding
technique

Yi i.i.d.

This implies the Chernoff-Hoeffding bound!


Back to Learning Theory!
Theorem (Occam’s Razor):
Pick any tolerance level , and any confidence level
let be examples drawn from an unknown

if , then with probability at least

is efficiently PAC learnable


Learning general concepts
Consider linear classification

Occam’s Razor bound is ineffective


VC Theory
Need to capture the true richness of

Definition (Vapnik-Chervonenkis or VC dimension):


We say that a model class as VC dimension d, if d is the largest set of
points such that for all possible labelings of
there exists some that achieves that labelling.

Example: = linear classifiers in R2


VC Dimension

Another example:
= Rectangles in R2

The class of rectangles


cannot realize this labelling

VC dimension:
• A combinatorial concept to capture the true richness of
• Often (but not always!) proportional to the degrees-of-freedom or
the number of independent parameters in
VC Theorem
Theorem (Vapnik-Chervonenkis ’71):
Pick any tolerance level , and any confidence level
let be examples drawn from an unknown

if , then with probability at least

is efficiently PAC learnable

VC Theorem  Occam’s Razor Theorem


Tightness of VC bound
Theorem (VC lower bound):
Let be any model selection algorithm that given m samples, returns a
model from , that is,
For all tolerance levels , and all confidence levels ,
there exists a distribution such that if
Some implications

• VC dimension of a model class fully characterizes its learning ability!

• Results are agnostic to the underlying distribution.


One algorithm to rule them all?

From our discussion it may seem that ERM algorithm is universally consistent.

This is not the case!

Theorem (no free lunch, Devroye ‘82):


Pick any sample size m, any algorithm and any
There exists a distribution such that

while the Bayes optimal error,


Further refinements and extensions

• How to do model class selection? Structural risk results.

• Dealing with kernels – Fat margin theory

• Incorporating priors over the models – PAC-Bayes theory

• Is it possible to get distribution dependent bound? Rademacher complexity

• How about regression? Can derive similar results for nonparametric


regression.
What We Learned…

• Formalizing learning

• PAC learnability

• Occam’s razor Theorem

• VC dimension and VC theorem

• VC theorem

• No Free-lunch theorem
Questions?
Next time…

Midterm!

Unsupervised learning.
Announcements

• Project proposal due on tomorrow

• Midterm next lecture!

• HW3 posted

You might also like