Lecture 4
Lecture 4
Lecture 4
Lecture 4 - 01/16/2011
Estimation VS Approximation
Lecturer: Ofer Dekel
Some inequalities
Chebyshevs inequality: Let Z be a random variable with expected value and variance 2 < . Then
> 0, we have P r(|Z | ) 2 /2 .
Proof. Let X = (Z )2 0, then E(X) = E[(Z )2 ] = 2 . And from Markovs inequality we have
P (X 2 ) = P ((Z )2 2 ) = P (|Z | ) E(X)/2 =
2
2
(1)
Hoeffdings inequality: Let Z1 . . . Zm are independent random variables. Assume that the Zi are
Palmost
m
1
surely bounded: P r(Zi [a, b]) = 1 where ba = c. Then, for the average of these variables Z = m
i=1 Zi ,
22
we have P (Z E(Z) ) exp( c2 ) for any > 0.
Proof. Without loss of generality we assume E(Z) = 0 and c = 1 (or we can just let Z 0 =
from Markovs inequality we have
P r(Z )
ZE(Z)
).
c
Then
2
E[e4mZ ]
= P r(e4mZ e4m )
e4m2
Q 4Zi
Q
4Zi
E[ i e
]
E[e
]
=
= i 4m2
(Second equality holds only when Zi s are independent)
e4m2
e
Q 22
ie
d
.
m
(3)
for any 1 i m. In other words, replace the i-th coordinate xi by some other value changes the value of
d
f by at most d/m. Then f has the m
-bounded property and satisfies the following inequality
P r(f (Z) E[f (Z)] ) exp(
for any > 0.
2m2
)
d2
(4)
2
2.1
In P
the context of machine learning theory, let a sample set S = Z, each Zi = l(h; (xi , yi )), and f (Z) =
m
1
i=1 l(h; (xi , yi )) = l(h; S). For a fixed hypothesis h, a loss function l [0, c] and > 0, we then have
m
2
P r(|l(h; D) l(h; S)| ) 2 exp( 2m
c2 ), where D denotes the distribution of the examples (x, y) and S is
2
a sample set drawn from D with size m. Let = exp( 2m
c2 ). Then with probability at least 1 , we have
q
|l(h; D) l(h; S)| c log(2/)
2m .
The q
above inequality says that for each hypothesis h H, there exists a set S of samples that satisfies the
bound c 2/
2m with probability at least 1. However, these sets may be different for different hypotheses. For
a fixed observed sample S , this inequality may not hold for all hypotheses in H including the hypothesis
hERM = arg minhH l(h; S ) with minimum empirical risk. Only some hypotheses in H (not necessarily
hERM ) will satisfy this inequality.
2.2
Uniform bound
To overcome the above limitation, we need to derive a unif orm bound for all hypotheses in H. As shown
in Fig 1, we define the set i = {S D : |l(hi ; D) l(hi ; S)| > } be the set contains all bad samples for
Pk
which the bound fails. For each i, P r(i ) . If |H| = k, we can write P r(1 . . . k ) i=1 P r(i ).
As a result, we obtain the uniform bound:
X
P (h H : l(h; D) l(h; S) ) 1
P (|l(hi ; D l(hi , S)| > )|
i
1 2k exp(
2m2
).
c2
(5)
Figure 1: Set diagram for i and the uniformly good set of samples.
Theorem 1. If the size of the hypothesis space |H| = k, the loss function l [0, c], and S is the sample set
drawn from distribution D with |S| = m. Then > 0 and h H, with probability at least 1 ,
r
log(2/) + log(k)
|l(h; D l(h; S)| c
(6)
2m
2
2.3
Excess Risk
log(2/)+log(k)
.
2m
Figure 2: The empirical risk l(hi ; S) and the range for the true risk l(hi ; D). The difference between
l(hERM ; D) and l(h ; D) is at most 2 where h = arg minhH l(h; D)
2.4
Estimation VS Approximation
First we define the Bayesian risk minall possible h l(h; D) and the Bayesian hypothesis arg min l(h; D), a hypothesis that attains the Bayesian risk. Sometimes some errors may be inevitable, the Bayesian risk may be
strictly positive.
Take the binary classification task for example, where y {1, 1} be the lab, and the loss function is
zero-one l(h; (x, y)) = 1h(x)6=y . The risk for h can be written as:
where
E[1h)x6=y ]
Ex [E[1h(x)6=y |X = x]]
[E[1h(x)6=y |X = x]
P r(h(x) 6= y|X = x)
(
P r(Y = 1|X = x)
if h(x) = +1
P r(Y = +1|X = x)
if h(x) = 1
(7)
(8)
Figure 3: The approximation errors are shown in solid arrows pointing from the Bayesian hypothesis hBayes
to hi , where hi = arg minhi Hi l(hi ; D). The estimation errors are shown in dotted arrows pointing from hi
to hERM . Suppose hERM H1 H2 and |H2 | << |H1 |. The figure demonstrates the effect of the size of
hypothesis space on the approximation error and the estimation error.
Scenario 2: k = 3, H = {hERM , h1 , h2 }. Now we have
r
l(hERM , D) min l(h, D) + 2c
hH
2 log(2/) + log(3)
2m
Theorem 2. If the size of the hypothesis space is infinite, |H| = , the loss function l [0, c], and S is
the sample set drawn from distribution D with |S| = m. Then > 0 and h H, with probability at least
1 ,
|l(h; S) l(h; D)| ()
(9)
Proof. To apply Hoeffdings inequality 2, we define f (S) = maxhH [l(h; D l(h; S)].
c
First we show that f (S) is m
bounded. h, we change one example in S S 0 . l [0, c], thus
c
c
0
|l(h, S ) l(h, S| m . l(h, D) remains the same, we have |f (S) f (S;)| m
.
Next we apply McDiarmids inequality,
2m2
) 2 exp( 2 ) =
pc
f (s) E[f (S)] + c log(2/)/2m
where E[f (S)] = ES {maxhH [l(h; D l(h; S)]}. The expectation is taken over all possible samples.
Third, we show that maxiI E(Xi ) E(maxiI Xi ). For j I, xj max xi . Then E(Xj )
E(maxi Xi ). Finally we have maxj E(Xj ) E(maxi Xi ). Using this lemma, we can show
ES [f (S)]
(10)