0% found this document useful (0 votes)

55 views9 pages

Statistical Learning Theory: 18.657: Mathematics of Machine Learning

This document summarizes a lecture on binary classification in machine learning. It discusses: 1) The Bayes classifier, which is the optimal classifier that minimizes classification error based on the true probability distribution; 2) Empirical risk minimization, which builds classifiers based on training data rather than the true distribution, with the goal of minimizing expected classification error on new data; 3) Comparing generative and discriminative approaches, where generative models assumptions about the data distribution and discriminative models directly learn the decision boundary from data.

Uploaded by

Mega Silvia Hasugian

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

55 views9 pages

Statistical Learning Theory: 18.657: Mathematics of Machine Learning

Uploaded by

Mega Silvia Hasugian

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

18.

657: Mathematics of Machine Learning

Lecturer: Philippe Rigollet Lecture 2
Scribe: Jonathan Weed Sep. 14, 2015

Part I
Statistical Learning Theory
1. BINARY CLASSIFICATION

In the last lecture, we looked broadly at the problems that machine learning seeks to solve
and the techniques we will cover in this course. Today, we will focus on one such problem,
binary classification, and review some important notions that will be foundational for the
rest of the course.
Our present focus on the problem of binary classification is justified because both binary
classification encompasses much of what we want to accomplish in practice and because the
response variables in the binary classification problem are bounded. (We will see a very
important application of this fact below.) It also happens that there are some nasty surprises
in non-binary classification, which we avoid by focusing on the binary case here.

1.1 Bayes Classifier

Recall the setup of binary classification: we observe a sequence (X1 , Y1 ), . . . , (Xn , Yn ) of n
independent draws from a joint distribution PX,Y . The variable Y (called the label ) takes
values in {0, 1}, and the variable X takes values in some space X representing “features” of
the problem. We can of course speak of the marginal distribution PX of X alone; moreover,
since Y is supported on {0, 1}, the conditional random variable Y |X is distributed according
to a Bernoulli distribution. We write Y |X ∼ Bernoulli(η(X)), where

η(X) = IP(Y = 1|X) = IE[Y |X].

(The function η is called the regression function.)

We begin by defining an optimal classifier called the Bayes classifier. Intuitively, the
Bayes classifier is the classifier that “knows” η—it is the classifier we would use if we had
perfect access to the distribution Y |X.

Definition: The Bayes classifier of X given Y , denoted h∗ , is the function defined by the
rule
∗ 1 if η (x) > 1/2
h (x) =
0 if η(x) ≤ 1/2.
In other words, h∗ (X) = 1 whenever IP(Y = 1|X) > IP(Y = 0|X).
Our measure of performance for any classifier h (that is, any function mapping X to
{0, 1}) will be the classification error : R(h) = IP(Y 6= h(X)). The Bayes risk is the value
R∗ = R(h∗ ) of the classification error associated with the Bayes classifier. The following
theorem establishes that the Bayes classifier is optimal with respect to this metric.

1
Theorem: For any classifier h, the following identity holds:
Z
R(h) − R(h∗ ) = |2η(x) − 1| Px (dx) = IEX [|2η(X) − 1|1(h(X) 6= h∗ (X))] (1.1)
6 ∗
h=h

Where h = h∗ is the (measurable) set {x ∈ X | h(x) 6= h∗ (x)}.

In particular, since the integrand is nonnegative, the classification error R∗ of the
Bayes classifier is the minimizer of R(h) over all classifiers h.
Moreover,
1
R(h∗ ) = IE[min(η(X), 1 − η(X))] ≤ . (1.2)
2

Proof. We begin by proving Equation (1.2). The definition of R(h) implies

R(h) = IP(Y 6= h(X)) = IP(Y = 1, h(X) = 0) + IP(Y = 0, h(X) = 1),

where the second equality follows since the two events are disjoint. By conditioning on X
and using the tower law, this last quantity is equal to

IE[IE[1(Y = 1, h(X) = 0)|X]] + IE[IE[1(Y = 0, h(X) = 1)|X]]

Now, h(X) is measurable with respect to X, so we can factor it out to yield

IE[1(h(X) = 0)η(X) + 1(h(X) = 1)(1 − η(X))]], (1.3)

where we have replaced IE[Y |X] by η(X).

In particular, if h = h∗ , then Equation 1.3 becomes

IE[1(η(X) ≤ 1/2)η(X) + 1(η(x) > 1/2)(1 − η(X))].

But η(X) ≤ 1/2 implies η(X) ≤ 1 − η(X) and conversely, so we finally obtain

R(h∗ ) = IE[1(η(X) ≤ 1/2)η(X) + 1(η(x) > 1/2)(1 − η(X))]

= IE[(1(η(X) ≤ 1/2) + 1(η(x) > 1/2)) min(η(X), 1 − η(X))]
= IE[min(η(X), 1 − η(X))],

as claimed. Since min(η(X), 1 − η(X)) ≤ 1/2, its expectation is also certainly at most 1/2
as well.
Now, given an arbitrary h, applying Equation 1.3 to both h and h∗ yields

R(h) − R(h∗ ) = IE[1(h(X) = 0)η(X) + 1(h(X) = 1)(1 − η(X))]

−1(h∗ (X) = 0)η(X) + 1(h∗ (X) = 1)(1 − η(X))]],

which is equal to

IE[(1(h(X) = 0) − 1(h∗ (X) = 0))η(X) + (1(h(X) = 1) − 1(h∗ (X) = 1))(1 − η(X))].

Since h(X) takes only the values 0 and 1, the second term can be rewritten as −(1(h(X) =
0) − 1(h∗ (X) = 0)). Factoring yields

IE[(2η(X) − 1)(1(h(X) = 0) − 1(h∗ (X) = 0))].

2
The term 1(h(X) = 0) − 1(h∗ (X) = 0) is equal to −1, 0, or 1 depending on whether h
and h∗ agree. When h(X) = h∗ (X), it is zero. When h(X) = 6 h∗ (X), it equals 1 whenever
∗
h (X) = 0 and −1 otherwise. Applying the definition of the Bayes classifier, we obtain

IE[(2η(X) − 1)1(h(X) 6= h∗ (X)) sign(η − 1/2)] = IE[|2η(X) − 1|1(h(X) 6= h∗ (X))],

as desired.

We make several remarks. First, the quantity R(h) − R(h∗ ) in the statement of the
theorem above is called the excess risk of h and denoted E(h). (“Excess,” that is, above
the Bayes classifier.) The theorem implies that E(h) ≥ 0.
Second, the risk of the Bayes classifier R∗ equals 1/2 if and only if η(X) = 1/2 almost
surely. This maximal risk for the Bayes classifier occurs precisely when Y “contains no
information” about the feature variable X. Equation (1.1) makes clear that the excess risk
weighs the discrepancy between h and h∗ according to how far η is from 1/2. When η is
close to 1/2, no classifier can perform well and the excess risk is low. When η is far from
1/2, the Bayes classifier performs well and we penalize classifiers that fail to do so more
heavily.
As noted last time, linear discriminant analysis attacks binary classification by putting
some model on the data. One way to achieve this is to impose some distributional assump-
tions on the conditional distributions X|Y = 0 and X|Y = 1.
We can reformulate the Bayes classifier in these terms by applying Bayes’ rule:

IP(X = x|Y = 1)IP(Y = 1)

η(x) = IP(Y = 1|X = x) = .
IP(X = x|Y = 1)IP(Y = 1) + IP(X = x|Y = 0)IP(Y = 0)

(In general, when PX is a continuous distribution, we should consider infinitesimal proba-

bilities IP(X ∈ dx).)
Assume that X|Y = 0 and X|Y = 1 have densities p0 and p1 , and IP(Y = 1) = π is
some constant reflecting the underlying tendency of the label Y . (Typically, we imagine
that π is close to 1/2, but that need not be the case: in many applications, such as anomaly
detection, Y = 1 is a rare event.) Then h∗ (X) = 1 whenever η(X) ≥ 1/2, or, equivalently,
whenever
p1 (x) 1−π
≥ .
p0 (x) π
When π = 1/2, this rule amounts to reporting 1 or 0 by comparing the densities p1
and p0 . For instance, in Figure 1, if π = 1/2 then the Bayes classifier reports 1 whenever
p1 ≥ p0 , i.e., to the right of the dotted line, and 0 otherwise.
On the other hand, when π is far from 1/2, the Bayes classifier is weighed towards the
underlying bias of the label variable Y .

1.2 Empirical Risk Minimization

The above considerations are all probabilistic, in the sense that they discuss properties of
some underlying probability distribution. The statistician does not have access to the true
probability distribution PX,Y ; she only has access to i.i.d. samples (X1 , Y1 ), . . . , (Xn , Yn ).
We consider now this statistical perspective. Note that the underlying distribution PX,Y
still appears explicitly in what follows, since that is how we measure our performance: we
judge the classifiers we produced on future i.i.d. draws from PX,Y .

3
Figure 1: The Bayes classifier when π = 1/2.

ˆ n (X), which is random

Given data Dn = {(X1 , Y1 ), . . . , (Xn , Yn )}, we build a classifier h
in two senses: it is a function of a random variable X and also depends implicitly on the
random data Dn . As above, we judge a classifier according to the quantity E(h ˆ n ). This is
a random variable: though we have integrated out X, the excess risk still depends on the
data Dn . We therefore will consider bounds both on its expected value and bounds that
hold in high probability. In any case, the bound E(h ˆ n ) ≥ 0 always holds. (This inequality
does not merely hold “almost surely,” since we proved that R(h) ≥ R(h∗ ) uniformly over
all choices of classifier h.)
Last time, we proposed two different philosophical approaches to this problem. In
particular, generative approaches make distributional assumptions about the data, attempt
to learn parameters of these distributions, and then plug the resulting values into the model.
The discriminative approach—the one taken in machine learning—will be described in great
detail over the course of this semester. However, there is some middle ground, which is worth
mentioning briefly. This middle ground avoids making explicit distributional assumptions
about X while maintaining some of the flavor of the generative model.
The central insight of this middle approach is the following: since by definition h∗ (x) =
1(η(X) > 1/2), we estimate η by some η̂n and thereby produce the estimator h ˆn =
1(η̂n (X) > 1/2). The result is called a plug-in estimator.
Of course, achieving good performance with a plug-in estimator requires some assump-
tions. (No-free-lunch theorems imply that we can’t avoid making an assumption some-
where!) One possible assumption is that η(X) is smooth; in that case, there are many
nonparamteric regression techniques available (Nadaraya-Watson kernel regression, wavelet
bases, etc.).
We could also assume that η(X) is a function of a particular form. Since η(X) is only
supported on [0, 1], standard linear models are generally inapplicable; rather, by applying
the logit transform we obtain logistic regression, which assumes that η satisfies an identity
of the form
η(X)
log = θT X.
1 − η(X)
Plug-in estimators are called “semi-paramteric” since they avoid making any assumptions
about the distribution of X. These estimators are widely used because they perform fairly
well in practice and are very easy to compute. Nevertheless, they will not be our focus here.
In what follows, we focus here on the discriminative framework and empirical risk min-
imization. Our benchmark continues to be the risk function R(h) = IE1(Y 6= h(X)), which

4
is clearly not computable based on the data alone; however, we can attempt to use a naı̈ve
statistical “hammer” and replace the expectation with an average.

Definition: The empirical risk of a classifier h is given by

n
1X
R̂n (h) = 1(Yi 6= h(Xi )).
n
i=1

Minimizing the empirical risk over the family of all classifiers is useless, since we can
always minimize the empirical risk by mimicking the data and classifying arbitrarily other-
wise. We therefore limit our attention to classifiers in a certain family H.
ˆ erm of the set
Definition: The Empirical Risk Minimizer (ERM) over H is any element1 h
ˆ n (h).
argminh∈H R

In order for our results to be meaningful, the class H must be much smaller than the
ˆ erm will be close
space of all classifiers. On the other hand, we also hope that the risk of h
to the Bayes risk, but that is unlikely if H is too small. The next section will give us tools
for quantifying this tradeoff.

1.3 Oracle Inequalities

An oracle is a mythical classifier, one that is impossible to construct from data alone but
whose performance we nevertheless hope to mimic. Specifically, given H we define h ¯ to be
an element of argminh∈H R(h)—a classifier in H that minimizes the true risk. Of course,
¯ but we can hope to prove a bound of the form
we cannot determine h,
ˆ ≤ R(h)
R(h) ¯ + something small. (1.4)

Since h¯ is the best minimizer in H given perfect knowledge of the distribution, a bound of
the form given in Equation 1.4 would imply that h ˆ has performance that is almost best-in-
class. We can also apply such an inequality in the so-called improper learning framework,
where we allow h ˆ to lie in a slightly larger class H0 ⊃ H; in that case, we still get nontrivial
guarantees on the performance of h ˆ if we know how to control R(h) ¯
There is a natural tradeoff between the two terms on the right-hand side of Equation 1.4.
When H is small, we expect the performance of the oracle h ¯ to suffer, but we may hope
¯
to approximate h quite closely. (Indeed, at the limit where H is a single function, the
“something small” in Equation 1.4 is equal to zero.) On the other hand, as H grows the
oracle will become more powerful but approximating it becomes more statistically difficult.
(In other words, we need a larger sample size to achieve the same measure of performance.)
Since R(h)ˆ is a random variable, we ultimately want to prove a bound in expectation
or tail bound of the form
ˆ ≤ R(h)
IP(R(h) ¯ + ∆n,δ (H)) ≥ 1 − δ,

where ∆n,δ (H) is some explicit term depending on our sample size and our desired level of
confidence.
1
In fact, even an approximate solution will do: our bounds will still hold whenever we produce a classifier
ˆ
h satisfying R ˆ ≤ inf h∈H Rn (h) + ε.
ˆ n (h)

5
In the end, we should recall that

E (h) ˆ − R(h∗ ) = (R(h)

ˆ = R(h) ˆ − R(h)) ¯ − R(h∗ )).
¯ + (R(h)

The second term in the above equation is the approximation error, which is unavoidable
once we fix the class H. Oracle inequalities give a means of bounding the first term, the
stochastic error.

1.4 Hoeffding’s Theorem

Our primary building block is the following important result, which allows us to understand
how closely the average of random variables matches their expectation.

Theorem (Hoeffding’s Theorem): Let X1 , . . . , Xn be n independent random vari-

ables such that Xi ∈ [0, 1] almost surely.
Then for any t > 0,

n
!
1 X 2
IP Xi − IEXi > t ≤ 2e−2nt .

n
i=1

In other words, deviations from the mean decay exponentially fast in n and t.

Proof. Define centered random variables Zi = Xi − IEXi . It suffices to show that

X
1 2
IP Zi > t ≤ e−2nt ,
n

since the lower tail bound follows analogously. (Exercise!)

We apply Chernoff bounds. Since the exponential function is an order-preserving bijec-
tion, we have for any s > 0
X
1 X P
IP Zi > t = IP exp s Zi > estn ≤ e−stn IE[es Zi ] (Markov)
n
Y
= e−stn IE[esZi ], (1.5)

where in the last equality we have used the independence of the Zi .

We therefore need to control the term IE[esZi ], known as the moment-generating func-
tion of Zi . If the Zi were normally distributed, we could compute the moment-generating
function analytically. The following lemma establishes that we can do something similar
when the Zi are bounded.

Lemma (Hoeffding’s Lemma): If Z ∈ [a, b] almost surely and IEZ = 0, then

s2 (b−a)2
IEesZ ≤ e 8 .

Proof of Lemma. Consider the log-moment generating function ψ(s) = log IE[esZ ], and note
that it suffices to show that ψ(s) ≤ s2 (b − a)2 /8. We will investigate ψ by computing the

6
first several terms of its Taylor expansion. Standard regularity conditions imply that we
can interchange the order of differentiation and integration to obtain

IE[ZesZ ]
ψ 0 (s) = ,
IE[esZ ]
2
IE[Z 2 esZ ]IE[esZ ] − IE[ZesZ ]2 sZ esZ

00 2 e
ψ (s) = = IE Z − IE Z .
IE[esZ ]2 IE[esZ ] IE[esZ ]
esZ
Since IE[esZ ]
integrates to 1, we can interpret ψ 00 (s) as the variance of Z under the probability
esZ
measure dF = IE[esZ ]
dIE. We obtain

00 a+b
ψ (s) = varF (Z) = varF Z − ,
2

since the variance is unaffected under shifts. But |Z − a+b b−a

2 | ≤ 2 almost surely since
Z ∈ [a, b] almost surely, so
" #
a+b 2 (b − a)2

a+b
varF Z − ≤F Z− ≤ .
2 2 4

Finally, the fundamental theorem of calculus yields

Z sZ u
s2 (b − a)2
ψ(s) = ψ 00 (u) du ≤ .
0 0 8

This concludes the proof of the Lemma.

Applying Hoeffding’s Lemma to Equation (1.5), we obtain

X
1 Y 2 2
IP Zi > t ≤ e−stn es /8 = ens /8−stn ,
n

for any s > 0. Plugging in s = 4t > 0 yields

X
1 2
IP Zi > t ≤ e−2nt ,
n

as desired.

Hoeffding’s Theorem implies that, for any classifier h, the bound

r
log(2/δ)
|R̂n (h) − R(h)| ≤
2n
holds with probability 1 − δ. We can immediately apply this formula to yield a maximal
inequality: if H is a finite family, i.e., H = {h1 , . . . , hM }, then with probability 1 − δ/M
the bound r
log(2M/δ)
|R̂n (hj ) − R(hj )| ≤
2n

7
holds. The event that maxj |R ˆ n (hj )−R(hj )| > t is the union of the events |R
ˆ n (hj )−R(hj )| >
t for j = 1, . . . , M , so the union bound immediately implies that
r
log(2M/δ)
max |R̂n (hj ) − R(hj )| ≤
j 2n
with probability 1−δ. In other words, for such a family, we can be assured that the empirical
risk and the true risk are close. Moreover, the logarithmic dependence on M implies that
we can increase the size of the family H exponentially quickly with n and maintain the
same guarantees on our estimate.

8
MIT OpenCourseWare
http://ocw.mit.edu

18.657 Mathematics of Machine Learning

Fall 2015

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

YCT Civil Engineering Smart Scan
93% (14)
YCT Civil Engineering Smart Scan
656 pages
Dynamic Analysis of Structures PDF
0% (1)
Dynamic Analysis of Structures PDF
755 pages
Notes and Solutions For: Pattern Recognition by Sergios Theodoridis and Konstantinos Koutroumbas.
100% (1)
Notes and Solutions For: Pattern Recognition by Sergios Theodoridis and Konstantinos Koutroumbas.
209 pages
Bayesian Classifier and ML Estimation: 6.1 Conditional Probability
100% (3)
Bayesian Classifier and ML Estimation: 6.1 Conditional Probability
11 pages
Mathematics of Machine Learning MIT
No ratings yet
Mathematics of Machine Learning MIT
411 pages
Bayesian Decision Theory: CS479/679 Pattern Recognition Dr. George Bebis
No ratings yet
Bayesian Decision Theory: CS479/679 Pattern Recognition Dr. George Bebis
64 pages
Linear Classification: 1 1 N N I D I
No ratings yet
Linear Classification: 1 1 N N I D I
33 pages
Linearclassification
No ratings yet
Linearclassification
31 pages
Bayesian Decision Theory and Learning: Jayanta Mukhopadhyay Dept. of Computer Science and Engg
No ratings yet
Bayesian Decision Theory and Learning: Jayanta Mukhopadhyay Dept. of Computer Science and Engg
56 pages
Bayesian and MLE
No ratings yet
Bayesian and MLE
30 pages
Classification
No ratings yet
Classification
19 pages
PR January20 03 PDF
No ratings yet
PR January20 03 PDF
74 pages
Lecture 6 - Generative Models
No ratings yet
Lecture 6 - Generative Models
33 pages
Machine Learning Models and Theories
No ratings yet
Machine Learning Models and Theories
38 pages
03 Bayes Nearest Neighbors
No ratings yet
03 Bayes Nearest Neighbors
34 pages
Machine Learning PDF
No ratings yet
Machine Learning PDF
77 pages
2023 LSE MY474 Applied Machine Learning Social Science, Lecture3
No ratings yet
2023 LSE MY474 Applied Machine Learning Social Science, Lecture3
58 pages
Lec5 Class
No ratings yet
Lec5 Class
14 pages
MIT18 657F15 LecNote PDF
No ratings yet
MIT18 657F15 LecNote PDF
194 pages
Introduction To Machine Learning CS - 229
No ratings yet
Introduction To Machine Learning CS - 229
109 pages
Lecturenotes
No ratings yet
Lecturenotes
56 pages
IML Module 3
No ratings yet
IML Module 3
95 pages
Bartlett 08 A
No ratings yet
Bartlett 08 A
18 pages
Lec 6
No ratings yet
Lec 6
14 pages
3.1 Binary Classification
No ratings yet
3.1 Binary Classification
4 pages
Stat Risk
No ratings yet
Stat Risk
6 pages
22-23 323 Week7Notes
No ratings yet
22-23 323 Week7Notes
15 pages
Bayesian Decision Theory
No ratings yet
Bayesian Decision Theory
65 pages
Lecture 2 - Principle of Machine Learning
No ratings yet
Lecture 2 - Principle of Machine Learning
39 pages
RN Notes
No ratings yet
RN Notes
119 pages
Cours 2 MVA
No ratings yet
Cours 2 MVA
5 pages
Module - 3 - Last Part
No ratings yet
Module - 3 - Last Part
16 pages
Bayesian Theory
No ratings yet
Bayesian Theory
66 pages
Introduction To Pattern Recognition
No ratings yet
Introduction To Pattern Recognition
12 pages
Lumen Method
100% (1)
Lumen Method
11 pages
Bayesian Decision Theory: Prof. Richard Zanibbi
No ratings yet
Bayesian Decision Theory: Prof. Richard Zanibbi
47 pages
Bayesian Classifier Notes
No ratings yet
Bayesian Classifier Notes
9 pages
CS771: Machine Learning: Tools, Techniques and Applications Mid-Semester Exam
No ratings yet
CS771: Machine Learning: Tools, Techniques and Applications Mid-Semester Exam
7 pages
Theory For Classification and Linear Models (I)
No ratings yet
Theory For Classification and Linear Models (I)
32 pages
Lecture 03 Bayes Classifier With Prob Concepts
No ratings yet
Lecture 03 Bayes Classifier With Prob Concepts
70 pages
Lec 1
No ratings yet
Lec 1
42 pages
Bayes Theorem
No ratings yet
Bayes Theorem
7 pages
Tuo Zhao Notes
No ratings yet
Tuo Zhao Notes
47 pages
Bayesian Decision Theory
No ratings yet
Bayesian Decision Theory
63 pages
WK 08
No ratings yet
WK 08
10 pages
(Lalley) Random Walks On Infinite Groups
100% (1)
(Lalley) Random Walks On Infinite Groups
373 pages
10-701/15-781 Machine Learning - Midterm Exam, Fall 2010: Aarti Singh Carnegie Mellon University
No ratings yet
10-701/15-781 Machine Learning - Midterm Exam, Fall 2010: Aarti Singh Carnegie Mellon University
16 pages
Dr. Arslan Shaukat
No ratings yet
Dr. Arslan Shaukat
18 pages
CSCE 970 Lecture 2: Bayesian-Based Classifiers: Most Probable
No ratings yet
CSCE 970 Lecture 2: Bayesian-Based Classifiers: Most Probable
5 pages
Bayes Decision Theory
No ratings yet
Bayes Decision Theory
53 pages
Introduction To Neural Networks
No ratings yet
Introduction To Neural Networks
3 pages
Weatherwax Theodoridis Solutions
No ratings yet
Weatherwax Theodoridis Solutions
212 pages
Statistics 512 Notes 26: Decision Theory Continued: FX FX D
No ratings yet
Statistics 512 Notes 26: Decision Theory Continued: FX FX D
11 pages
Maths Grade 12 Revision
No ratings yet
Maths Grade 12 Revision
7 pages
Homework1 Solutions
No ratings yet
Homework1 Solutions
5 pages
Bayesian Learning: Thanks To Nir Friedman, HU
No ratings yet
Bayesian Learning: Thanks To Nir Friedman, HU
41 pages
Lecture 5
No ratings yet
Lecture 5
16 pages
Bayes ML Tutorial
No ratings yet
Bayes ML Tutorial
69 pages
Lesson 1 M3 My Path Task - WorldQuant University
No ratings yet
Lesson 1 M3 My Path Task - WorldQuant University
4 pages
6.867 Section 3: Classification: 1 Intro 2 2 Representation 2 3 Probabilistic Models 2
No ratings yet
6.867 Section 3: Classification: 1 Intro 2 2 Representation 2 3 Probabilistic Models 2
10 pages
Casting Defects
No ratings yet
Casting Defects
22 pages
49 Mathematics Ug STD
No ratings yet
49 Mathematics Ug STD
9 pages
Assign 1
No ratings yet
Assign 1
5 pages
Depth of Embedment of Sheet Pile Wall in Cohessiveless Soil
No ratings yet
Depth of Embedment of Sheet Pile Wall in Cohessiveless Soil
6 pages
Supercritical Fluid Extraction 2021
100% (1)
Supercritical Fluid Extraction 2021
40 pages
Machine Design Ii M E C 3 11 0: Rolling-Element Bearing Problems
No ratings yet
Machine Design Ii M E C 3 11 0: Rolling-Element Bearing Problems
29 pages
Class 12physics - Electric Charges and Fields - Mcqs
No ratings yet
Class 12physics - Electric Charges and Fields - Mcqs
16 pages
Zhao Et Al. (202) Economy-Environment-Energy Performance Evaluation of CCHP Microgrid System - A Hybrid Multi-Criteria Decision-Making Method
No ratings yet
Zhao Et Al. (202) Economy-Environment-Energy Performance Evaluation of CCHP Microgrid System - A Hybrid Multi-Criteria Decision-Making Method
17 pages
Reviewer in Toa Finals
No ratings yet
Reviewer in Toa Finals
7 pages
Unit 03 - Neural Networks - MD
No ratings yet
Unit 03 - Neural Networks - MD
24 pages
MDF125 Syllabus 1
No ratings yet
MDF125 Syllabus 1
8 pages
Lecture 2 3
No ratings yet
Lecture 2 3
17 pages
LDR Physics Coursework
100% (2)
LDR Physics Coursework
6 pages
CE 437 - PDF 02 - Intro 02 - Steel - (Design of Steel Structure)
No ratings yet
CE 437 - PDF 02 - Intro 02 - Steel - (Design of Steel Structure)
13 pages
Chapter 1
No ratings yet
Chapter 1
73 pages
TER19
No ratings yet
TER19
65 pages
Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
No ratings yet
Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
14 pages
Chapter7 (C) Hardenability
No ratings yet
Chapter7 (C) Hardenability
17 pages
ME415 6 Turbomachines - 221215 - 074849
No ratings yet
ME415 6 Turbomachines - 221215 - 074849
44 pages
TSR Report - National Paints March 2025
No ratings yet
TSR Report - National Paints March 2025
3 pages
Natural Product Formulation 2021
No ratings yet
Natural Product Formulation 2021
38 pages
Atomic Structure Vidyadaan 2.0 - 1595610441663
No ratings yet
Atomic Structure Vidyadaan 2.0 - 1595610441663
7 pages
Volatile Oil Production March 20
No ratings yet
Volatile Oil Production March 20
45 pages
Combine PDF
No ratings yet
Combine PDF
31 pages
Cambridge IGCSE (TM) Physics 4th Edition Space Physics
No ratings yet
Cambridge IGCSE (TM) Physics 4th Edition Space Physics
10 pages
02.08.20 Sr.n-Superchaina Jee Adv 2019 p1 Gta-4 P-1 Key Sol
No ratings yet
02.08.20 Sr.n-Superchaina Jee Adv 2019 p1 Gta-4 P-1 Key Sol
7 pages
A2 2024 2025 Timetable 16.9.24
No ratings yet
A2 2024 2025 Timetable 16.9.24
1 page
Comparison of Trends and Frequencies of Drought in
No ratings yet
Comparison of Trends and Frequencies of Drought in
10 pages
Unit 01 - Linear Classifiers and Generalizations - MD
No ratings yet
Unit 01 - Linear Classifiers and Generalizations - MD
23 pages
Phytochemical and Pharmacological Potential of Annona Species: A Review
No ratings yet
Phytochemical and Pharmacological Potential of Annona Species: A Review
8 pages
Dragon Skin FX Pro TB
No ratings yet
Dragon Skin FX Pro TB
2 pages
Phytochemical and Pharmacological Profile of Ipomoea Aquatica
No ratings yet
Phytochemical and Pharmacological Profile of Ipomoea Aquatica
13 pages
Stephen Hawking: About The Big Bang To Black Hole Theory
No ratings yet
Stephen Hawking: About The Big Bang To Black Hole Theory
6 pages
Article Nernst Equation PDF
No ratings yet
Article Nernst Equation PDF
8 pages
Aaac 6201 Type A2 Iec 61089
No ratings yet
Aaac 6201 Type A2 Iec 61089
1 page
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet
An Introduction to Linear Algebra and Tensors
From Everand
An Introduction to Linear Algebra and Tensors
M. A. Akivis
1/5 (1)
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet

Statistical Learning Theory: 18.657: Mathematics of Machine Learning

Uploaded by

Statistical Learning Theory: 18.657: Mathematics of Machine Learning

Uploaded by

18.

657: Mathematics of Machine Learning

1.1 Bayes Classifier

η(X) = IP(Y = 1|X) = IE[Y |X].

(The function η is called the regression function.)

Where h = h∗ is the (measurable) set {x ∈ X | h(x) 6= h∗ (x)}.

Proof. We begin by proving Equation (1.2). The definition of R(h) implies

R(h) = IP(Y 6= h(X)) = IP(Y = 1, h(X) = 0) + IP(Y = 0, h(X) = 1),

IE[IE[1(Y = 1, h(X) = 0)|X]] + IE[IE[1(Y = 0, h(X) = 1)|X]]

Now, h(X) is measurable with respect to X, so we can factor it out to yield

IE[1(h(X) = 0)η(X) + 1(h(X) = 1)(1 − η(X))]], (1.3)

where we have replaced IE[Y |X] by η(X).

IE[1(η(X) ≤ 1/2)η(X) + 1(η(x) > 1/2)(1 − η(X))].

R(h∗ ) = IE[1(η(X) ≤ 1/2)η(X) + 1(η(x) > 1/2)(1 − η(X))]

R(h) − R(h∗ ) = IE[1(h(X) = 0)η(X) + 1(h(X) = 1)(1 − η(X))]

IE[(1(h(X) = 0) − 1(h∗ (X) = 0))η(X) + (1(h(X) = 1) − 1(h∗ (X) = 1))(1 − η(X))].

IE[(2η(X) − 1)(1(h(X) = 0) − 1(h∗ (X) = 0))].

IE[(2η(X) − 1)1(h(X) 6= h∗ (X)) sign(η − 1/2)] = IE[|2η(X) − 1|1(h(X) 6= h∗ (X))],

IP(X = x|Y = 1)IP(Y = 1)

(In general, when PX is a continuous distribution, we should consider infinitesimal proba-

1.2 Empirical Risk Minimization

ˆ n (X), which is random

Definition: The empirical risk of a classifier h is given by

1.3 Oracle Inequalities

E (h) ˆ − R(h∗ ) = (R(h)

1.4 Hoeffding’s Theorem

Theorem (Hoeffding’s Theorem): Let X1 , . . . , Xn be n independent random vari-

Proof. Define centered random variables Zi = Xi − IEXi . It suffices to show that

since the lower tail bound follows analogously. (Exercise!)

where in the last equality we have used the independence of the Zi .

Lemma (Hoeffding’s Lemma): If Z ∈ [a, b] almost surely and IEZ = 0, then

since the variance is unaffected under shifts. But |Z − a+b b−a

Finally, the fundamental theorem of calculus yields

This concludes the proof of the Lemma.

Applying Hoeffding’s Lemma to Equation (1.5), we obtain

for any s > 0. Plugging in s = 4t > 0 yields

Hoeffding’s Theorem implies that, for any classifier h, the bound

18.657 Mathematics of Machine Learning

You might also like