0% found this document useful (0 votes)

10 views5 pages

Lecture 1

Uploaded by

theitspace404

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views5 pages

Lecture 1

Uploaded by

theitspace404

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Lecture 1, October 18, 2016

Intro to Learning Theory

Ruth Urner

1 Machine Learning and Learning Theory

Coming soon..

2 Formal Framework
2.1 Basic notions
In our formal model for machine learning, the instances to be classified are members of a
set X , the domain set or feature space. Instances are to be classified into a label set Y.
For now (and most of the class), we assume that the label set is binary, that is Y = {0, 1}.
For example, an instance x ∈ X could be an email and its label indicates whether the
email is spam (y = 1) or not spam (y = 0). We often assume that the instances are
represented as real-valued vectors, that is X ⊆ Rd for some dimension d.
A predictor or classifier is a function h : X → Y. A learner is a function that takes
some training data and maps it to a predictor. We let the training data be denoted by a
sequence S = ((X1 , Y1 ), . . . , (Xn , Yn )). Then, formally, a learner A is a function
∞
[
A : (X × Y)i → Y X
i=1
A : S 7→ h,

where Y X denotes the set of all functions from set X to set Y. For convenience, when the
learner is clear from context, we use the notation hn to denote the output of the learner
on data of size n, that is hn = A(S) for |S| = n.
The goal of learning is produce a predictor h that correctly classifies not only the
training data, but also future instances that it has not seen yet. We thus need a mathe-
matical description of how the environment produces instances. In particular, we would
like to model that the environment (or nature) remains somehow stable, that the process
that generated the training data is the same that will generate future data.
We model the data generation as a probability distribution P over X ×Y = X ×{0, 1}.
We further assume that the instances (Xi , Yi ) are i.i.d. (independently and identically
distributed) according to P .
The performance of a classifier h on an instance (X, Y ) is measured by a loss function.
A loss function is a function

` : (Y X × X × Y) → R.

The value `(h, X, Y ) ∈ R indicates “how badly h predicts on example (X, Y )”. We will,
for now, work with the binary loss (or 0/1-loss), defined as

`(h, X, Y ) = 1[h(X) 6= Y ] ,

where 1[p] denotes the indicator function of predicate p, that is 1[p] = 1 of p is true and
1[p] = 0 if p is false. The binary loss is 1 if the prediction of h on example (X, Y ) is
wrong. If the prediction is correct, no loss is suffered and the binary loss assigns value 0.
We can now formally phrase the goal of learning, as aiming for a classifier that has
low loss on expectation over the data generating distribution. That is, we would like to
output a classifier that has low expected loss, or risk, defined as

L(h) = E(X,Y )∼P [`(h, X, Y )] = E(X,Y )∼P [1[h(X) 6= Y ]].

Since our loss function assumes only values in {0, 1}, the above expectation is equal to
the probability of generating an example X on which h makes a wrong prediction. That
is, we have

L(h) = E(X,Y )∼P E(X,Y )∼P [1[h(X) 6= Y ]] = P(X,Y )∼P [h(X) 6= Y ].

Note however, that the learner does not get to see the data generating distribution. It can
thus not merely output a classifier of lowest expected loss. The learner needs to make its
decisions based on the data S. Given a classifier h and data S, the learner can evaluate
the empirical risk of h on S
n
1X
Ln (h) = 1[h(Xi ) 6= Yi ] .
n i=1

2.2 On the relation of empirical and true risk

A natural strategy for the learner would be, to simply output a function that has small
empirical risk. In favor of this approach, we now show that the empirical risk is an
unbiased estimator of the true risk.

Claim 1. For all functions h : X → {0, 1} and for all sample sizes n we have

ES Ln (h) = L(h)
Proof.
n
1X
ES Ln (h) = ES 1[h(Xi ) 6= Yi ]
n i=1
n
1X
= ES 1[h(Xi ) 6= Yi ]
n i=1
n
1X
= E(X,Y ) 1[h(X) 6= Y ]
n i=1
n
1X
= L(h)
n i=1
= L(h)

where the second equality holds by linearity of expectation, and the third inequality holds
since that expectation depends only on one (the i-th) example in S.
Thus, for any fixed function, the empirical risk gives us an unbiased estimate of the
quantity that we are after, the true risk. Note that this holds even for small sample sizes.
Moreover, by the law of large numbers, the above claim implies that, with large sample
sizes, the empirical risk of a classifier converges to its true risk (in probability). As we
see more and more data, the empirical risk of a function becomes a better and better
estimate of its true risk.
This may lead us to believe that the simple learning strategy of just finding some
function with low empirical risk should succeed at achieving low true risk as we see more
and more data. However, the following phenomenon shows that this strategy can in fact
go wrong arbitrarily badly.
Claim 2. There exists a distribution P and a learner, such that for all n we have

Ln (hn ) = 0 and L(hn ) = 1

Proof. As the data generating distribution, consider the uniform distribution over R×{1}.
That is, in any sample S, generated by this P , the examples are labeled with 1, that is
S = ((X1 , 1)), . . . , (Xn , 1))). We construct a “stubborn” learner A. The stubborn learner
outputs a function that agrees with the sample’s labels on points that were in the sample
S, but keeps believing that the label is 0 everywhere else. Formally:

1 if (X, 1) ∈ S
hn (X) = A(S)(X) =
0 otherwise

Now we clearly have Ln (hn ) = 0 for all n. However, since S is finite, the set of instances X
on which hn predicts 1 has measure 0. Thus, with probability 1, hn outputs the incorrect
label 0. Thus L(h) = 1.
The difference between the situations in the above two claims is that, in the second
case, the function hn depends on the data. While, for every fixed function h (fixed before
the data is seen), the empirical risk estimates converge to the true risk of this function,
this convergence is not uniform over all functions. Claim 2 shows that, at any given
sample size, there exist functions, for which true and empirical risk are arbitrarily far
apart.
Now, in machine learning, we do want the function that the learner outputs to be able
to depend on the data. Furthermore, the learner only ever gets to see a finite amount of
data. We have seen that, for any finite sample size, that is, on any finite amount of data,
the empirical risk can be a very bad indicator of the true risk of a function.
Basic questions of learning theory thus are: How can we control the (true) risk of a
function learned based on a finite amount of data? Can we identify situations where we
can relate the true and empirical risk?

2.3 Fixing a hypothesis space

We have seen that, if we want our learned function hn to depend on the data, we have
to change the rules for the learner. In Claim 2, we let the learner output any function it
wanted. This resulted in the learner adapting itself very well to the data it has seen in
the sample, achieving 0 empirical risk, while not making any progress towards predicting
well on unseen examples.
The construction of Claim 2 is an extreme version of a phenomenon called overfit-
ting. In informal terms, if a learning method has too much freedom with regards to the
functions it can output, it may “overadapt” to the training data, rather than extracting
structure that will also apply to the unseen examples. Overfitting is frequently encoun-
tered phenomenon in practice, that one has to guard against. To prevent the learner from
overfitting, we need to restrict the class of predictors.
A hypothesis class H is a set of predictors H ⊆ {0, 1}X . Instead of allowing the learner
to output any function, we will now consider learners that output functions from H. We
will see that, in many cases, fixing the hypothesis class before we see the data, will let us
regain control over the relation between empirical and true risk.
However, fixing the Hypothesis class also means that there may not be any good
function in the class. We will thus rephrase the goal of learning, to only require the
learner to come up with a function that is (approximately) as good as the best function
in the class H.
Thus, our new goal is to show that
L(hn ) ≤ inf L(h) + f (n),
h∈H

where f is decreasing function of sample size n. That is, as we see more and more data,
we would like that the true risk of the output of the learner approaches the best risk
possible with the class H. Or equivalently, we would like to show that
L(hn ) − inf L(h) ≤ f (n).
h∈H

2.4 Learnability of finite classes

We now show that the above goal is achievable for finite classes H = {h1 , . . . , hN }. We
will analyze the learner ERM (Empirical Risk Minimization), which outputs a function
from H that has minimal empirical risk.
ERM : S 7→ ĥn ∈ argmini Ln (hi )
There may be several functions in H that have lowest empirical risk on a data set S. But
since every function h ∈ H has some empirical risk on data S, and the empirical risk can
only assume finitely many values (namely multiples of n1 = |S|1
), the argmin is a nonempty
subset of H. A learner is an ERM learner, if it always outputs some function from this
subset.
For now, we will further make a simplifying assumption on the data generating distri-
bution P . We will assume that P is realizable with respect to the clas H. A distribution
is realizable with respect to a hypothesis class H if there is an h∗ ∈ H with L(h∗ ) = 0.

Theorem 1. Let H = {h1 , . . . , hN } and δ ∈ (0, 1]. Under the realizability assumption,
we have with probability at least (1 − δ) over the generation of the sample S

log N + log(1/δ)
L(ĥn ) ≤ .
n
Proof. Note that Ln (h∗ ) = 0 for all possible samples S. Thus, for any > 0, ERM only
outputs a function error larger than , if Ln (ĥn ) = 0 while L(ĥn ) ≥ .
For every h ∈ H with L(h) > , we have

PS {Ln (h) = 0} ≤ (1 − )n ≤ e−n

(Recall that, for all x ∈ R, we have (1 + x) ≤ ex .)

Let H denote the set of functions h in H with L(h) > . We get, using the union
bound,
n o
PS L(ĥn ) ≥ ≤ PS {∃h ∈ H : Ln (h) = 0}
= PS {∨h∈H Ln (h) = 0}
≤ |H |(1 − )n
≤ |H|(1 − )n
≤ |H|e−n = N e−n

log 1 +log N
Now we set = δ
n
.
Plugging in this value for , we have shown

log N + log(1/δ)
PS L(ĥn ) ≥ ≤δ
n

which is equivalent to the statement of the theorem.

Thus, under the realizability we have L(hn ) − minh∈H L(h) ≤ f (n) for f (n) =
log N +log(1/δ)
n
.

Deep Learning Module 3
No ratings yet
Deep Learning Module 3
15 pages
Life Cycle of Mushroom
No ratings yet
Life Cycle of Mushroom
17 pages
Tailwind CSS Cheatsheet
100% (1)
Tailwind CSS Cheatsheet
13 pages
MATH 499 Homework 2
100% (3)
MATH 499 Homework 2
2 pages
Lec11 Handout
No ratings yet
Lec11 Handout
86 pages
Mathematics of Machine Learning MIT
No ratings yet
Mathematics of Machine Learning MIT
411 pages
ML Opt
No ratings yet
ML Opt
89 pages
Shawe-Taylor-Slides Statiscal Learning Theory For Modern Machine Learning
No ratings yet
Shawe-Taylor-Slides Statiscal Learning Theory For Modern Machine Learning
195 pages
ML Lecture 1 Iitg
No ratings yet
ML Lecture 1 Iitg
32 pages
Lect02 Problem ML
No ratings yet
Lect02 Problem ML
41 pages
Ai512 Book
No ratings yet
Ai512 Book
127 pages
Machine Learning PDF
No ratings yet
Machine Learning PDF
77 pages
An Information-Theoretic Approach To Generalization Theory - Part2
No ratings yet
An Information-Theoretic Approach To Generalization Theory - Part2
22 pages
Lec 3
No ratings yet
Lec 3
21 pages
Credal Learning Theory
No ratings yet
Credal Learning Theory
30 pages
General Model of Learning From Examples
No ratings yet
General Model of Learning From Examples
17 pages
Unit 1-1
No ratings yet
Unit 1-1
75 pages
Tuo Zhao Notes
No ratings yet
Tuo Zhao Notes
47 pages
Stat Risk
No ratings yet
Stat Risk
6 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
100 pages
DL Unit-2
No ratings yet
DL Unit-2
24 pages
Michael Importance Weighting
No ratings yet
Michael Importance Weighting
30 pages
Court Order
No ratings yet
Court Order
16 pages
Statistical Learning: First Steps: Sasha Rakhlin
No ratings yet
Statistical Learning: First Steps: Sasha Rakhlin
26 pages
UNIT I-Part 2
No ratings yet
UNIT I-Part 2
35 pages
03 Bayes Nearest Neighbors
No ratings yet
03 Bayes Nearest Neighbors
34 pages
Sol Advriskmin 2
No ratings yet
Sol Advriskmin 2
3 pages
Supervised Learning
No ratings yet
Supervised Learning
61 pages
MIT18 657F15 LecNote PDF
No ratings yet
MIT18 657F15 LecNote PDF
194 pages
Lecturenotes
No ratings yet
Lecturenotes
56 pages
Lecture 2 PDF
No ratings yet
Lecture 2 PDF
12 pages
Slide07 Bayes
No ratings yet
Slide07 Bayes
51 pages
Cs 171 18 IntroLearning Old
No ratings yet
Cs 171 18 IntroLearning Old
47 pages
Revised Lecture Notes 2
No ratings yet
Revised Lecture Notes 2
16 pages
Module 2 - Syllabus: CS 476 Introduction To Machine Learning, Module 2
No ratings yet
Module 2 - Syllabus: CS 476 Introduction To Machine Learning, Module 2
20 pages
Risk Minimization
No ratings yet
Risk Minimization
12 pages
Unit-2 MLT
No ratings yet
Unit-2 MLT
84 pages
Statistical Learning Theory: 18.657: Mathematics of Machine Learning
No ratings yet
Statistical Learning Theory: 18.657: Mathematics of Machine Learning
9 pages
Class 02
No ratings yet
Class 02
42 pages
CH 1
No ratings yet
CH 1
24 pages
Lecture-4 Emprical Risk and Optimization
No ratings yet
Lecture-4 Emprical Risk and Optimization
20 pages
3 LogisticRegression
No ratings yet
3 LogisticRegression
30 pages
Chapter 3 Solutions Understanding Machine Learning
No ratings yet
Chapter 3 Solutions Understanding Machine Learning
6 pages
Unit 1.2 Perceptron 2024
No ratings yet
Unit 1.2 Perceptron 2024
107 pages
Lec 25
No ratings yet
Lec 25
15 pages
n27 PDF
No ratings yet
n27 PDF
3 pages
Unit Online 1.2
No ratings yet
Unit Online 1.2
20 pages
Formal Model and Empirical Risk Minimization: Dr. Shahid Hussain (In-Charge SE Program) G42, Ground Floor, CS Dept., CIIT
No ratings yet
Formal Model and Empirical Risk Minimization: Dr. Shahid Hussain (In-Charge SE Program) G42, Ground Floor, CS Dept., CIIT
5 pages
07 Intro To ML
No ratings yet
07 Intro To ML
38 pages
UNIT1 ERM and PAC Learning
No ratings yet
UNIT1 ERM and PAC Learning
20 pages
Supervised Learning
No ratings yet
Supervised Learning
5 pages
Generalization Error of The Tilted Empirical Risk
No ratings yet
Generalization Error of The Tilted Empirical Risk
54 pages
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
No ratings yet
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
50 pages
Unit 2
No ratings yet
Unit 2
76 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
4 pages
Deep Learning Unit 2
No ratings yet
Deep Learning Unit 2
25 pages
10-601 Machine Learning
No ratings yet
10-601 Machine Learning
7 pages
3.1 Binary Classification
No ratings yet
3.1 Binary Classification
4 pages
Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
No ratings yet
Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
14 pages
Chapter 3 - Introduction Via Linear Regression
No ratings yet
Chapter 3 - Introduction Via Linear Regression
20 pages
ML 01
No ratings yet
ML 01
24 pages
Kebutuhan Alat Praktek Teknik Alat Berat
No ratings yet
Kebutuhan Alat Praktek Teknik Alat Berat
2 pages
Christ MSC Data Science
No ratings yet
Christ MSC Data Science
100 pages
2012 Arizona Cardinals Media Guide
No ratings yet
2012 Arizona Cardinals Media Guide
452 pages
7A Concept Map Cells
100% (1)
7A Concept Map Cells
4 pages
TK06A User Manual
No ratings yet
TK06A User Manual
8 pages
Cyanide Facts
No ratings yet
Cyanide Facts
8 pages
Characteristics, Processes and Ethics of Research
No ratings yet
Characteristics, Processes and Ethics of Research
4 pages
Electric Drives Intro
No ratings yet
Electric Drives Intro
58 pages
Compliance Gap Analysis
No ratings yet
Compliance Gap Analysis
1 page
1.B. Soal Bahasa Inggris X Sem 1
No ratings yet
1.B. Soal Bahasa Inggris X Sem 1
15 pages
PM-02 - 08 - Technical Completion of The Workorder
No ratings yet
PM-02 - 08 - Technical Completion of The Workorder
2 pages
Miller Approximation
No ratings yet
Miller Approximation
14 pages
IDEHI The Journey Continues
No ratings yet
IDEHI The Journey Continues
299 pages
Voltammetry and Polarography
No ratings yet
Voltammetry and Polarography
46 pages
Module 8 - Change Management
No ratings yet
Module 8 - Change Management
11 pages
Ouelhazi Mohamed Attia: Personal Informations
No ratings yet
Ouelhazi Mohamed Attia: Personal Informations
2 pages
Konya Province Gelatin Production Pre Feasibility Report With Appendix
No ratings yet
Konya Province Gelatin Production Pre Feasibility Report With Appendix
77 pages
Brown Hauenstein 2005 Interrater Agreement Reconsidered An Alternative To The RWG Indices
No ratings yet
Brown Hauenstein 2005 Interrater Agreement Reconsidered An Alternative To The RWG Indices
20 pages
Blast E-Value
No ratings yet
Blast E-Value
5 pages
Cheng 2019
No ratings yet
Cheng 2019
8 pages
Income Tax Reviewer and Case Digests PAGE-2 - : Ma. Angela Leonor C. Aguinaldo Ateneo Law 2010
No ratings yet
Income Tax Reviewer and Case Digests PAGE-2 - : Ma. Angela Leonor C. Aguinaldo Ateneo Law 2010
1 page
MARHABA CRETA Khutba Mehmoodiya 2
No ratings yet
MARHABA CRETA Khutba Mehmoodiya 2
9 pages
Activity Design District Municipal Festival of Talents 2024
No ratings yet
Activity Design District Municipal Festival of Talents 2024
6 pages
Definitions Goals and Scope of Counseling
No ratings yet
Definitions Goals and Scope of Counseling
1 page
Answer
No ratings yet
Answer
8 pages
ARLS Bylaws
No ratings yet
ARLS Bylaws
3 pages
Passive Voice
No ratings yet
Passive Voice
1 page
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet
Mathematical Foundations of Information Theory
From Everand
Mathematical Foundations of Information Theory
A. Ya. Khinchin
3.5/5 (9)

Lecture 1

Uploaded by

Lecture 1

Uploaded by

Lecture 1, October 18, 2016

Intro to Learning Theory

1 Machine Learning and Learning Theory

L(h) = E(X,Y )∼P [`(h, X, Y )] = E(X,Y )∼P [1[h(X) 6= Y ]].

L(h) = E(X,Y )∼P E(X,Y )∼P [1[h(X) 6= Y ]] = P(X,Y )∼P [h(X) 6= Y ].

2.2 On the relation of empirical and true risk

Ln (hn ) = 0 and L(hn ) = 1

2.3 Fixing a hypothesis space

2.4 Learnability of finite classes

PS {Ln (h) = 0} ≤ (1 − )n ≤ e−n

(Recall that, for all x ∈ R, we have (1 + x) ≤ ex .)

which is equivalent to the statement of the theorem.

You might also like

PS {Ln (h) = 0} ≤ (1 − )n ≤ e−n