0% found this document useful (0 votes)

20 views29 pages

Lec 6

Uploaded by

Preetham Shekar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views29 pages

Lec 6

Uploaded by

Preetham Shekar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

COMS 4771

Introduction to Machine Learning

Nakul Verma
Announcements

• HW2 due now!

• Project proposal due on tomorrow

• Midterm next lecture!

• HW3 posted
Last time…

• Linear Regression

• Parametric vs Nonparametric regression

• Logistic Regression for classification

• Ridge and Lasso Regression

• Kernel Regression

• Consistency of Kernel Regression

• Speeding non-parametric regression with trees

Towards formalizing ‘learning’

What does it mean to learn a concept?

• Gain knowledge or experience of the concept.

The basic process of learning

• Observe a phenomenon
• Construct a model from observations
• Use that model to make decisions / predictions

How can we make this more precise?

A statistical machinery for learning
Phenomenon of interest:
Input space: Output space:
There is an unknown distribution over
The learner observes examples drawn from

Construct a model: Machine learning

Let be a collection of models, where each predicts given

From observations, select a model which predicts well.

(generalization error of f )

We can say that we have learned the phenomenon if

for any tolerance level of our choice.

PAC Learning
For all tolerance levels , and all confidence levels , if there exists
some model selection algorithm that selects from m observations
ie, , and has the property:
with probability at least over the draw of the sample,

We call
• The model class is PAC-learnable.
• If the is polynomial in and , then is efficiently PAC-learnable

A popular algorithm:
Empirical risk minimizer (ERM) algorithm
PAC learning simple model classes
Theorem (finite size ):
Pick any tolerance level , and any confidence level
let be examples drawn from an unknown

if , then with probability at least

is efficiently PAC learnable

Occam’s Razor Principle:

All things being equal, usually the simplest explanation
of a phenomenon is a good hypothesis.

Simplicity = representational succinctness

Proof sketch
Define:

(generalization error of f ) (sample error of f )

We need to analyze:

≤0

Uniform deviations of
expectation of a random
variable to the sample
Proof sketch
Fix any and a sample , define random variable

(generalization error of f ) (sample error of f )

Lemma (Chernoff-Hoeffding bound ‘63):

Let be m Bernoulli r.v. drawn independently from B(p).
for any tolerance level

A classic result in concentration

of measure, proof later
Proof sketch
Need to analyze

Equivalently, by choosing with probability at least ,

for all
PAC learning simple model classes
Theorem (Occam’s Razor):
Pick any tolerance level , and any confidence level
let be examples drawn from an unknown

if , then with probability at least

is efficiently PAC learnable

One thing left…
Still need to prove:

Lemma (Chernoff-Hoeffding bound ‘63):

Let be m Bernoulli r.v. drawn independently from B(p).
for any tolerance level

How sample average deviates from true average as a function of

number of samples (m)

Need to analyze: How does the probability measure

concentrates towards a central value (like mean)
Detour: Concentration of Measure
Let’s start with something simple:

Let X be a non-negative random variable.

For a given constant c > 0, what is: ?

Markov’s Inequality

Why?
Observation
Take expectation on both sides.

c<c

X
Concentration of Measure
Using Markov to bound deviation from mean…

Let X be a random variable (not necessarily non-negative).

Want to examine: for some given constant c > 0

Observation:

by Markov’s Inequality

Chebyshev’s Inequality
True for all distributions! EX X
Concentration of Measure
Sharper estimates using an exponential!

Let X be a random variable (not necessarily non-negative).

For some given constant c > 0

Observation:

for any t > 0

by Markov’s Inequality

This is called Chernoff’s

bounding method
Concentration of Measure
Now, Given X1, …, Xm i.i.d. random variables (assume 0 ≤ Xi ≤ 1)

Define Yi := Xi – EXi

By Cherneoff’s bounding
technique

Yi i.i.d.

This implies the Chernoff-Hoeffding bound!

Back to Learning Theory!
Theorem (Occam’s Razor):
Pick any tolerance level , and any confidence level
let be examples drawn from an unknown

if , then with probability at least

is efficiently PAC learnable

Learning general concepts
Consider linear classification

Occam’s Razor bound is ineffective

VC Theory
Need to capture the true richness of

Definition (Vapnik-Chervonenkis or VC dimension):

We say that a model class as VC dimension d, if d is the largest set of
points such that for all possible labelings of
there exists some that achieves that labelling.

Example: = linear classifiers in R2

VC Dimension

Another example:
= Rectangles in R2

The class of rectangles

cannot realize this labelling

VC dimension:
• A combinatorial concept to capture the true richness of
• Often (but not always!) proportional to the degrees-of-freedom or
the number of independent parameters in
VC Theorem
Theorem (Vapnik-Chervonenkis ’71):
Pick any tolerance level , and any confidence level
let be examples drawn from an unknown

if , then with probability at least

is efficiently PAC learnable

VC Theorem  Occam’s Razor Theorem

Tightness of VC bound
Theorem (VC lower bound):
Let be any model selection algorithm that given m samples, returns a
model from , that is,
For all tolerance levels , and all confidence levels ,
there exists a distribution such that if
Some implications

• VC dimension of a model class fully characterizes its learning ability!

• Results are agnostic to the underlying distribution.

One algorithm to rule them all?

From our discussion it may seem that ERM algorithm is universally consistent.

This is not the case!

Theorem (no free lunch, Devroye ‘82):

Pick any sample size m, any algorithm and any
There exists a distribution such that

while the Bayes optimal error,

Further refinements and extensions

• How to do model class selection? Structural risk results.

• Dealing with kernels – Fat margin theory

• Incorporating priors over the models – PAC-Bayes theory

• Is it possible to get distribution dependent bound? Rademacher complexity

• How about regression? Can derive similar results for nonparametric

regression.
What We Learned…

• Formalizing learning

• PAC learnability

• Occam’s razor Theorem

• VC dimension and VC theorem

• VC theorem

• No Free-lunch theorem
Questions?
Next time…

Midterm!

Unsupervised learning.
Announcements

• Project proposal due on tomorrow

• Midterm next lecture!

• HW3 posted

ML Unit-3.-1
No ratings yet
ML Unit-3.-1
28 pages
TheLearningTheory 2
No ratings yet
TheLearningTheory 2
90 pages
DSA Lab Program
No ratings yet
DSA Lab Program
52 pages
ML 3
No ratings yet
ML 3
36 pages
Horowitz Sinander Notes
No ratings yet
Horowitz Sinander Notes
136 pages
Machine Learning - The Science of Selection Under Uncertainty
No ratings yet
Machine Learning - The Science of Selection Under Uncertainty
85 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
114 pages
Unit 3
No ratings yet
Unit 3
99 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
111 pages
DS Unit-3
No ratings yet
DS Unit-3
235 pages
The Secant Method
No ratings yet
The Secant Method
7 pages
ML Unit-3
No ratings yet
ML Unit-3
24 pages
ML Unit-1
No ratings yet
ML Unit-1
42 pages
Foundations of Machine Learning: Module 7: Computational Learning Theory
No ratings yet
Foundations of Machine Learning: Module 7: Computational Learning Theory
64 pages
SML Lecture2
No ratings yet
SML Lecture2
35 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
213 pages
Undecidability
No ratings yet
Undecidability
58 pages
Theory of Automata - Introduction
100% (1)
Theory of Automata - Introduction
79 pages
Stscat 2
No ratings yet
Stscat 2
87 pages
How To Pass Data Structure in One Day
No ratings yet
How To Pass Data Structure in One Day
20 pages
How To Write A Variable Description
75% (4)
How To Write A Variable Description
2 pages
SML Lecture3
No ratings yet
SML Lecture3
36 pages
MachineLearning - UNIT III
No ratings yet
MachineLearning - UNIT III
30 pages
02 First Model of Learning
No ratings yet
02 First Model of Learning
37 pages
Pac Learning
No ratings yet
Pac Learning
30 pages
Introduction To Operations Research: Ninth Edition
No ratings yet
Introduction To Operations Research: Ninth Edition
8 pages
VC Dim
No ratings yet
VC Dim
22 pages
05 VC Bound
No ratings yet
05 VC Bound
27 pages
MLSM Lecture3 190923
No ratings yet
MLSM Lecture3 190923
36 pages
Al3451 - Machine Learning - Answer Key 13 Mark
No ratings yet
Al3451 - Machine Learning - Answer Key 13 Mark
22 pages
ML Unit-2 Material Add-On
No ratings yet
ML Unit-2 Material Add-On
82 pages
Learning Theory
No ratings yet
Learning Theory
19 pages
SVM Example
No ratings yet
SVM Example
24 pages
ML Lecture23
No ratings yet
ML Lecture23
57 pages
Lecture16 VC
No ratings yet
Lecture16 VC
42 pages
DistributionLearn (Shai)
No ratings yet
DistributionLearn (Shai)
47 pages
Data Mining Techniques
No ratings yet
Data Mining Techniques
33 pages
Top-Down Parsing
No ratings yet
Top-Down Parsing
73 pages
Chap1 Bishop
No ratings yet
Chap1 Bishop
35 pages
ML - Unit 4
No ratings yet
ML - Unit 4
15 pages
ML Document-1 - Merged
No ratings yet
ML Document-1 - Merged
19 pages
ML Lecture#3
No ratings yet
ML Lecture#3
37 pages
ML Question Bank CA-II
No ratings yet
ML Question Bank CA-II
10 pages
Lecture 01
No ratings yet
Lecture 01
11 pages
Lecture 2.4
No ratings yet
Lecture 2.4
28 pages
Week 7 Notes
No ratings yet
Week 7 Notes
11 pages
Lecture 5
No ratings yet
Lecture 5
12 pages
Functional - Programming - LISP - Part 2
No ratings yet
Functional - Programming - LISP - Part 2
16 pages
Introduction To Machine Learning (67577) : Shai Shalev-Shwartz
No ratings yet
Introduction To Machine Learning (67577) : Shai Shalev-Shwartz
124 pages
Lect1116-18 Active Learning
No ratings yet
Lect1116-18 Active Learning
6 pages
Understanding Machine Learning Solution Manual: 2 Gentle Start
No ratings yet
Understanding Machine Learning Solution Manual: 2 Gentle Start
67 pages
Tutorial
No ratings yet
Tutorial
81 pages
Csup AL
No ratings yet
Csup AL
5 pages
07 Agnostic Pac
No ratings yet
07 Agnostic Pac
5 pages
Module 2 - Syllabus: CS 476 Introduction To Machine Learning, Module 2
No ratings yet
Module 2 - Syllabus: CS 476 Introduction To Machine Learning, Module 2
20 pages
Lect 3
No ratings yet
Lect 3
4 pages
DS & GT Question Bank
No ratings yet
DS & GT Question Bank
5 pages
C4.5 Decision Tree Solution With Calculations
No ratings yet
C4.5 Decision Tree Solution With Calculations
4 pages
RIP Routing Protocol
No ratings yet
RIP Routing Protocol
27 pages
Math Review Unit 3
No ratings yet
Math Review Unit 3
4 pages
Lect 0329
No ratings yet
Lect 0329
3 pages
Colt Tutorial
No ratings yet
Colt Tutorial
43 pages
Deadlock Prevention, Avoidance, and Detection
No ratings yet
Deadlock Prevention, Avoidance, and Detection
29 pages
Assignment 2
No ratings yet
Assignment 2
3 pages
Operations Management: William J. Stevenson
No ratings yet
Operations Management: William J. Stevenson
19 pages
Sea130 DS Exp 4
No ratings yet
Sea130 DS Exp 4
5 pages
Thesis Rakhlin
No ratings yet
Thesis Rakhlin
148 pages
Masked Label Prediction (Contiene GTN)
No ratings yet
Masked Label Prediction (Contiene GTN)
7 pages
1 The Probably Approximately Correct (PAC) Model: COS 511: Theoretical Machine Learning
No ratings yet
1 The Probably Approximately Correct (PAC) Model: COS 511: Theoretical Machine Learning
6 pages
Course 003: Basic Econometrics: Rohini Somanathan-Part 1 Sunil Kanwar - Part II
No ratings yet
Course 003: Basic Econometrics: Rohini Somanathan-Part 1 Sunil Kanwar - Part II
31 pages
Unit Iii
No ratings yet
Unit Iii
6 pages
Manasci Mod 6
No ratings yet
Manasci Mod 6
2 pages
Aguinaldo Q2week 1 2
No ratings yet
Aguinaldo Q2week 1 2
3 pages
Machine Learning: PAC-Learning and VC-Dimension
No ratings yet
Machine Learning: PAC-Learning and VC-Dimension
31 pages
ML Lecture 8
No ratings yet
ML Lecture 8
12 pages
Lecture Summary
No ratings yet
Lecture Summary
2 pages
Btech Cse 6 Sem Compiler Design Pcs6i102 2019
No ratings yet
Btech Cse 6 Sem Compiler Design Pcs6i102 2019
2 pages
VTU E-Shikshana Programme - 02 Schedule For Live Transmission of Lectures From 19.08.2019 To 16.11.2019
No ratings yet
VTU E-Shikshana Programme - 02 Schedule For Live Transmission of Lectures From 19.08.2019 To 16.11.2019
3 pages
Lect 26 PDF
No ratings yet
Lect 26 PDF
14 pages
Analysis & Design of Algorithms
No ratings yet
Analysis & Design of Algorithms
2 pages
Discrete-Time Signals and Systems: Aperiodic Continuous
No ratings yet
Discrete-Time Signals and Systems: Aperiodic Continuous
7 pages
Writing Proofs - Analyzing Games - Problems - Lavrov (2015-16)
No ratings yet
Writing Proofs - Analyzing Games - Problems - Lavrov (2015-16)
2 pages
Operators, Type Safety
No ratings yet
Operators, Type Safety
45 pages
10-601 Machine Learning
No ratings yet
10-601 Machine Learning
7 pages
hw2 Sol
No ratings yet
hw2 Sol
3 pages
How Many Samples To Learn A Finite Class?
No ratings yet
How Many Samples To Learn A Finite Class?
4 pages
Discussion Notes 2-6
No ratings yet
Discussion Notes 2-6
3 pages
Technical Written Exam Questions: Data Structure
No ratings yet
Technical Written Exam Questions: Data Structure
10 pages
Exercises of Statistical Inference
From Everand
Exercises of Statistical Inference
Simone Malacrida
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet

Lec 6

Uploaded by

Lec 6

Uploaded by

COMS 4771

Introduction to Machine Learning

• HW2 due now!

• Project proposal due on tomorrow

• Midterm next lecture!

• Parametric vs Nonparametric regression

• Logistic Regression for classification

• Ridge and Lasso Regression

• Consistency of Kernel Regression

• Speeding non-parametric regression with trees

What does it mean to learn a concept?

The basic process of learning

How can we make this more precise?

Construct a model: Machine learning

Let be a collection of models, where each predicts given

We can say that we have learned the phenomenon if

for any tolerance level of our choice.

if , then with probability at least

is efficiently PAC learnable

Occam’s Razor Principle:

Simplicity = representational succinctness

(generalization error of f ) (sample error of f )

(generalization error of f ) (sample error of f )

Lemma (Chernoff-Hoeffding bound ‘63):

A classic result in concentration

Equivalently, by choosing with probability at least ,

if , then with probability at least

is efficiently PAC learnable

Lemma (Chernoff-Hoeffding bound ‘63):

How sample average deviates from true average as a function of

Need to analyze: How does the probability measure

Let X be a non-negative random variable.

Let X be a random variable (not necessarily non-negative).

Let X be a random variable (not necessarily non-negative).

for any t > 0

This is called Chernoff’s

This implies the Chernoff-Hoeffding bound!

if , then with probability at least

is efficiently PAC learnable

Occam’s Razor bound is ineffective

Definition (Vapnik-Chervonenkis or VC dimension):

Example: = linear classifiers in R2

The class of rectangles

if , then with probability at least

is efficiently PAC learnable

VC Theorem  Occam’s Razor Theorem

• VC dimension of a model class fully characterizes its learning ability!

• Results are agnostic to the underlying distribution.

This is not the case!

Theorem (no free lunch, Devroye ‘82):

while the Bayes optimal error,

• How to do model class selection? Structural risk results.

• Dealing with kernels – Fat margin theory

• Incorporating priors over the models – PAC-Bayes theory

• Is it possible to get distribution dependent bound? Rademacher complexity

• How about regression? Can derive similar results for nonparametric

• Occam’s razor Theorem

• VC dimension and VC theorem

• Project proposal due on tomorrow

• Midterm next lecture!

You might also like