Lecture 4

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

CSE522, Winter 2011, Learning Theory

Lecture 4 - 01/16/2011

Estimation VS Approximation
Lecturer: Ofer Dekel

Scribe: Yanping Huang

Some inequalities

Chebyshevs inequality: Let Z be a random variable with expected value and variance 2 < . Then
 > 0, we have P r(|Z | ) 2 /2 .
Proof. Let X = (Z )2 0, then E(X) = E[(Z )2 ] = 2 . And from Markovs inequality we have
P (X 2 ) = P ((Z )2 2 ) = P (|Z | ) E(X)/2 =

2
2

(1)

Hoeffdings inequality: Let Z1 . . . Zm are independent random variables. Assume that the Zi are
Palmost
m
1
surely bounded: P r(Zi [a, b]) = 1 where ba = c. Then, for the average of these variables Z = m
i=1 Zi ,
22
we have P (Z E(Z) ) exp( c2 ) for any  > 0.
Proof. Without loss of generality we assume E(Z) = 0 and c = 1 (or we can just let Z 0 =
from Markovs inequality we have
P r(Z )

ZE(Z)
).
c

Then

2
E[e4mZ ]
= P r(e4mZ e4m )
e4m2
Q 4Zi
Q
4Zi
E[ i e
]
E[e
]
=
= i 4m2
(Second equality holds only when Zi s are independent)
e4m2
e
Q 22
ie

(Using Jensens inequality)


e4m2
= exp(2m2 )
(2)

McDiarmids Inequality: Suppose Z1 , . . . , Zm are independent, the vector Z = {Z1 , Z2 , . . . , Zm } and


assume that f satisfies
sup
z1 ,...,zm ,zi0

|f (z1 , . . . , zm ) f (z1 , . . . , zi1 , zi0 zi+1 , . . . , zm )|

d
.
m

(3)

for any 1 i m. In other words, replace the i-th coordinate xi by some other value changes the value of
d
f by at most d/m. Then f has the m
-bounded property and satisfies the following inequality
P r(f (Z) E[f (Z)] ) exp(
for any  > 0.

2m2
)
d2

(4)

2
2.1

Generalization Bounds for finite hypothesis space


Chernoff bound for a fixed hypothesis

In P
the context of machine learning theory, let a sample set S = Z, each Zi = l(h; (xi , yi )), and f (Z) =
m
1
i=1 l(h; (xi , yi )) = l(h; S). For a fixed hypothesis h, a loss function l [0, c] and  > 0, we then have
m
2
P r(|l(h; D) l(h; S)| ) 2 exp( 2m
c2 ), where D denotes the distribution of the examples (x, y) and S is
2
a sample set drawn from D with size m. Let = exp( 2m
c2 ). Then with probability at least 1 , we have
q
|l(h; D) l(h; S)| c log(2/)
2m .
The q
above inequality says that for each hypothesis h H, there exists a set S of samples that satisfies the

bound c 2/
2m with probability at least 1. However, these sets may be different for different hypotheses. For
a fixed observed sample S , this inequality may not hold for all hypotheses in H including the hypothesis
hERM = arg minhH l(h; S ) with minimum empirical risk. Only some hypotheses in H (not necessarily
hERM ) will satisfy this inequality.

2.2

Uniform bound

To overcome the above limitation, we need to derive a unif orm bound for all hypotheses in H. As shown
in Fig 1, we define the set i = {S D : |l(hi ; D) l(hi ; S)| > } be the set contains all bad samples for
Pk
which the bound fails. For each i, P r(i ) . If |H| = k, we can write P r(1 . . . k ) i=1 P r(i ).
As a result, we obtain the uniform bound:
X
P (h H : l(h; D) l(h; S) ) 1
P (|l(hi ; D l(hi , S)| > )|
i

1 2k exp(

2m2
).
c2

(5)

Finally we have the theorem for a uniform bound

Figure 1: Set diagram for i and the  uniformly good set of samples.
Theorem 1. If the size of the hypothesis space |H| = k, the loss function l [0, c], and S is the sample set
drawn from distribution D with |S| = m. Then > 0 and h H, with probability at least 1 ,
r
log(2/) + log(k)
|l(h; D l(h; S)| c
(6)
2m
2

2.3

Excess Risk

Define the excess risk for any hypothesis h H be l(h; D) minhH


q l(h; D). From Theorem 1 we have
l(hERM ; D) minhH l(h; D) + 2 , as shown in Fig 2.3, where  = c

log(2/)+log(k)
.
2m

Figure 2: The empirical risk l(hi ; S) and the range for the true risk l(hi ; D). The difference between
l(hERM ; D) and l(h ; D) is at most 2 where h = arg minhH l(h; D)

2.4

Estimation VS Approximation

First we define the Bayesian risk minall possible h l(h; D) and the Bayesian hypothesis arg min l(h; D), a hypothesis that attains the Bayesian risk. Sometimes some errors may be inevitable, the Bayesian risk may be
strictly positive.
Take the binary classification task for example, where y {1, 1} be the lab, and the loss function is
zero-one l(h; (x, y)) = 1h(x)6=y . The risk for h can be written as:

where

E[1h)x6=y ]

Ex [E[1h(x)6=y |X = x]]

[E[1h(x)6=y |X = x]

P r(h(x) 6= y|X = x)
(
P r(Y = 1|X = x)

if h(x) = +1

P r(Y = +1|X = x)

if h(x) = 1

We have the Bayesian hypothesis that minimizes the above risk


(
+1 if P r(y = 1|x = x) > 0.5
hBayes =
1 otherwise

(7)

(8)

if we know the distribution D = P r(Y |X)P r(X).


For a hypothesis space H, we define the approximation error as l(h , D) l(hBayes , D) and the estimation
error as l(hERM ; D) l(h ; D) where h = arg minhH l(h; D).
We observe that as the size of hypothesis space k = |H| increases, the approximation error may decreases
while the estimation error will increase. Consider the following two scenarios demonstrated in Fig 2.4,
Scenario 1: k = 1010 , hERM = arg minhH l(h; S). Then with probability at least 1, from the excess
risk theorem we have
r
2 log(2/) + log(1010 )
l(hERM , D) min l(h, D) + 2c
hH
2m
.
3

Figure 3: The approximation errors are shown in solid arrows pointing from the Bayesian hypothesis hBayes
to hi , where hi = arg minhi Hi l(hi ; D). The estimation errors are shown in dotted arrows pointing from hi
to hERM . Suppose hERM H1 H2 and |H2 | << |H1 |. The figure demonstrates the effect of the size of
hypothesis space on the approximation error and the estimation error.
Scenario 2: k = 3, H = {hERM , h1 , h2 }. Now we have
r
l(hERM , D) min l(h, D) + 2c
hH

2 log(2/) + log(3)
2m

Generalization Bound for infinite hypothesis space

Theorem 2. If the size of the hypothesis space is infinite, |H| = , the loss function l [0, c], and S is
the sample set drawn from distribution D with |S| = m. Then > 0 and h H, with probability at least
1 ,
|l(h; S) l(h; D)| ()

(9)

Proof. To apply Hoeffdings inequality 2, we define f (S) = maxhH [l(h; D l(h; S)].
c
First we show that f (S) is m
bounded. h, we change one example in S S 0 . l [0, c], thus
c
c
0
|l(h, S ) l(h, S| m . l(h, D) remains the same, we have |f (S) f (S;)| m
.
Next we apply McDiarmids inequality,
2m2
) 2 exp( 2 ) =
pc
f (s) E[f (S)] + c log(2/)/2m

P r(|f (S) E[f (S)]|

where E[f (S)] = ES {maxhH [l(h; D l(h; S)]}. The expectation is taken over all possible samples.
Third, we show that maxiI E(Xi ) E(maxiI Xi ). For j I, xj max xi . Then E(Xj )
E(maxi Xi ). Finally we have maxj E(Xj ) E(maxi Xi ). Using this lemma, we can show
ES [f (S)]

= ES [max[l(h; D) l(h; S)]]


h

= ES [max[ES 0 l(h; S)]]


h

ES ES 0 [max(l(h; S 0 l(h; S))]


h

(10)

You might also like