0% found this document useful (0 votes)
44 views3 pages

Discussion Notes 2-6

Review of entropy and Han's inequality, and their application to Efron-Stein. Applications in turn for those concepts, including VC dimension, conjugate dual functions and Rademacher complexity.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views3 pages

Discussion Notes 2-6

Review of entropy and Han's inequality, and their application to Efron-Stein. Applications in turn for those concepts, including VC dimension, conjugate dual functions and Rademacher complexity.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

STAT210B Notes

Matt Olfat
February 13, 2017

Efron-Stein Inequality
Let X1 , . . . , Xn be independent random variables, and let Z = f (X1 , . . . , Xn ). Then, we have that:
" n #
X
V ar(Z) E Ei [Z]2 Ei2 [Z] .
i=1

we will show that:

(x) = x2 Ent (Z) =PE(Z) (EZ)


(x) = x log x Ent (Z) E [ ni=1 Ei [(Z)] (Ei [Z])]
Let X be a random variable taking values in X countable.
P
Definition 1 the Shannon entropy of X is H(X) = E[ log p(x)] = xX p(x) log p(x).

p(x) log p(x)


P
Definition 2 If P, Q are probability distributions on X , then D(P |Q) = xX q(x)
0.
 
P q(x)
Can show using that log x x1, 0 log 0 = 0. Then consider D(P |Q) xX ,p(x)>0 p(x) 1 p(x)

0.
Entropy maximized over uniform distribution:
let |X | , q(x) = |X1 | . Then, D(P |Q) = log |X | H(P ) 0 H(P ) log |X |.
D(P |Q) = 0 iff P = Q.

Entropy on Product Spaces


if X, Y are random variables takingPvalues in X , Y, respectively, they have some joint probability
P (x, y) on X Y : H((X, Y )) = xX ,yY p(x, y) log p(x, y).

Definition 3 the mutual information between X and Y is I(X, Y ) = H(X) + H(Y ) H(X, Y ).
This measures how dependent the variables are.

Now, PX (x) = yY P (X = x, Y = y). Then, I(x, y) = xX ,yY p(x, y) log pX p(x,y)


P P
(x)pY (y)
=
D(P |PX PY ), where denotes the probability product measure, PX (X = x)PY (Y = y). This
is zero when X, Y are independent.

Definition 4 the conditional entropy of X given Y is H(X|Y ) = H(X, Y )H(Y ) = EY [H(P (X|Y ))].

1
Leads to D(PXY |PX PY ) = H(X) + H(Y ) H(X, Y ) = H(X) H(X|Y ) 0,
since H(X) H(X|Y ). All this together gives us H(X1 , . . . , Xm ) = H(X1 ) + H(X2 |X1 ) +
H(X3 |X1 , X2 ) + . . . .

Pn 5 (Hans Inequality) Let X1 , . . . , Xn be discrete random variables. Then, H(x1 , . . . , xn )


Theorem
1
n1 i=1 H(x1 , . . . , xi1 , xi+1 , . . . , xn ).

Proof. H(x1 , . . . , xn ) = H(x1 , . . . , xi1 , xi+1 , . . . , xn ) + H(xi |x1 , . . . , xi=1 , xi+i , . . . , xn ).


Sum this for all i to get:
P
nH(x
P 1 , . . . , xn ) = H(x1 , . . . , xi1 , xi+1 , . . . , xn )
+ P H(xi |x1 , . . . , xi1 , xi+i , . . . , xn ) P
H(x1 , . . . , xi1 , xi+1 , . . . , xn ) + H(xi |x1 , . . . , xi1 )
P P
P H(x1 , . . . , xi1 , xi+1 , . . . , xn ) + H(xi |x1 , . . . , xi1 )
= H(x1 , . . . , xi1 , xi+1 , . . . , xn ) + H(x1 , . . . , xi )

This is tight when the variables are independent.


Consider the binary hypercube with edges at -1 and 1. The Hamming distance measures the
number of differing entries in two vectors. Also consider a graph G = (V, E), where each corner of
the cube is a node and adjacent corners have edges between them. Then, |V | = 2n , |E| = n 2n1 .
|E|
Also, |V |
= n/2 = log22 |V | .
We want: A V = {1, 1}n , |E(A)| |A|
log22 |A| .
Proof. Let X have a uniform distribution over A. H(X)P H(X (i) ) is the entropy of the
ith coordinate given everything else. This is also equal to xA p(x) log p(xi |x(i) ). How-
ever, p(xi |x(i) ) = 0.5 if (x1 , . . . , xi1 , xi , xi+1 , . . . , xn ) A, and is 1 P otherwise. Let x(i) =
(x1 , . . . , xi1 , xi , xi+1 , . . . , xn ). We have: H(X)H(X (i) ) = log 2 log 2
P
|A| xA (i) A = |A| |E(A)|
i 1x
H(X).
Let X be a countable set and P let P, Q be probability distributions on X n . Assume P = P1
P2 Pm and denote q (i) = q(x1 , . . . , xi1 , xi+1 , . . . , xn ), p(i) (x(i) ) =
p1 (x1 ) pi1 (xi1 )pi+1 (xi+1 ) pn (xn ).
1
Pn
Theorem 6 (Hans Inequality) D(Q|P ) n1 D(Q(i) |P (i) ),
Pn 
or D(Q|P ) i=1 D(Q|P ) D(Q(i) |P (i) ) .
1
H(Q(i) ). Now, we have xX n q(x) log q(x)
P P
Proof. The original theorem gave us H(Q) n1
1
Pn P (i) (i) (i) (i)

n1 i=1 (i)
x X n1 q (x ) log q (x ) . We want to show that
1
Pn P
q(x) log p(x) n1 i=1 x(i) X n q (x ) log q (i) (x(i) ). We may use the fact that p is a prod-
(i) (i)
P

= n1P ni=1 xX n q(x) log p(i) (x(i) )pi (xi ) =


P P P
uct measure to get xX n q(x) log p(x)P
n1 ni=1 xX n q(x) log pi (xi ) + n1 ni=1 xX n q(x) log p(i) (x(i) )
P P

February 13th
E[(Z Zi )2 ] =E[(Z Zi )2 1{ZZi } ] + E[(Z Zi )2 1{Z<Zi } ]
(1)
=2E[(Z Zi )2+ ] = 2E[(Z Zi )2 ]
This gives us various versions of Efron-Stein.

2
Definition 7 Let f : xn P [0, ). f is self-bounded if i, fi : xi1 [0, ) such that 0
f (x) fi (x(i) ) 1 and ni=1 (f (x) fi (x(i) )) f (x).

Corollary 8 Let Z = f (x1 , . . . , xn ), x1 , . . . , xn independent, f is self-bounded, Z L2 . Then,


Var(Z) E[Z].

E[(f (x) fI (x(i) ))2 ] E[(f (x)


P
Proof. fi are given. By Efron-Stein, V ar[Z]
fI (x(i) ))] E[f (x)] = E[Z].
So self-bounded functions have variances smaller than their expected values. One application
of this is in relative stability:
Zn
Definition 9 We say that nonnegative random variables Zn are relatively stable if E[Zn ]
1 in
probability as n increases.

Thus, the expectation is all we need to know about the magnitude of Zn . If we assume that
V ar[Zn ] E[Zn ]. Then, P (| E[Z Zn
n]
1| ) V2 E[Z ar[Zn ]
n]
1
2 2 E[Z ] .
n
Now, we move on to configuration functions. First, we say that a property is defined over a
union of finite products of a set X. Let 1 X1 , 2 X1 X2 , . . . . We say that (x1 , . . . , xn )
xn satisfies the property if (x1 , . . . , xn ) n . is hereditary (motonote) if the following of
(x1 , . . . , xn ) n then i, . . . , ik , (xi1 , . . . , xik ) k . Let f : X i N. This function maps any
string X = (x1 , . . . , xn ) to the size of the maximal subsets of that string that satisfies . Then, f is
called a configuration function.

Corollary 10 if f is a configuration function, then f is self-bounded.

Proof. Let fi = f , x = (x1 , . . . , xn ) X n . Let i1 , . . . , ik such that (xi1 , . . . , xik ) is the


maximal subset that satisfies . fi (x(i) ) = f (x1 , . . . , xi1
P , xi+1 , . . . , xn ). Then, clearly, 0 f
fi 1, and they will only differ for i {i1 , . . . , ik } so f (x) fi (x(i) )) k = f (x). Then, by
corollary 8, V ar(f (x)) E[f (x)].
VC dimensions are an example of this. Let A be a collection of subsets of X . Let X =
(x1 , . . . , xn ) X n . T r(X) = {A {x1 , . . . , xn } : A A}. We can see that, depending on A,
the trace of X may not capture the richness of the entire power set of X (Consider X a collection
of points and A to be the set of half-spaces facing to the right). The Shatte coefficient (the growth
coefficient) of X is |T r(X)|. A subset {xn , . . . , xn } is shattered by A if the trace is equal to
its power set. The VC dimension of A with respect to that particular X is the size of the maximal
subset shattered by A. Clearly, the property of being shattered is monotone, so the VC dimension is
a configuration function and we have V ar[D(X)] E[D(X)]. This is an example of an empirical
process that concentrates.
Another key example of empirical processes that concentrate is the Rademacher complexity.
Let x1 , . . . , xn be independent
Pn uniform [0, 1]d random variables, 1 , . . . , n i.i.d. Rademacher( 21 ).
Let Z = E[maxk[d] j=1 j (xTj ek )|x1 , . . . , xn ]. Z has the self-bounding property, as removing
one element from the summation inside the maximization can onlyPdecrease the P total value by
(Z Zi ) = (E maxk[d] j (xTj ek )
P
less than one, P i.e. 0 T
Z Zi 1. Furthermore,
P
E maxk[d] j6=i j (xj ek )) n(E[max

You might also like