0% found this document useful (0 votes)
54 views

1 Markov's Inequality: Lecture Notes CS:5360 Randomized Algorithms

1) Markov's Inequality provides an upper bound on the probability that a non-negative random variable X exceeds some value a. It states that Pr(X ≥ a) ≤ E[X]/a. 2) Chebyshev's Inequality gives a stronger bound than Markov's Inequality by using the variance of X. It states that for any a > 0, Pr(|X - E[X]| ≥ a) ≤ Var[X]/a^2. 3) The document provides an example of using both inequalities to bound the probability that the number of heads in n coin flips exceeds 3n/4. Markov's gives a weak bound of 3/4 while Chebys

Uploaded by

Mirza Abdulla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views

1 Markov's Inequality: Lecture Notes CS:5360 Randomized Algorithms

1) Markov's Inequality provides an upper bound on the probability that a non-negative random variable X exceeds some value a. It states that Pr(X ≥ a) ≤ E[X]/a. 2) Chebyshev's Inequality gives a stronger bound than Markov's Inequality by using the variance of X. It states that for any a > 0, Pr(|X - E[X]| ≥ a) ≤ Var[X]/a^2. 3) The document provides an example of using both inequalities to bound the probability that the number of heads in n coin flips exceeds 3n/4. Markov's gives a weak bound of 3/4 while Chebys

Uploaded by

Mirza Abdulla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Lecture Notes CS:5360 Randomized Algorithms

Lecture 7: Sept 11, 2018


Scribe: Tanmay Inamdar

1 Markov’s Inequality
Recall that our general theme is to upper bound tail probabilities, i.e., probabilities of the form
Pr(X ≥ c · E[X]) or Pr(X ≤ c · E[X]). The first tool towards that end is Markov’s Inequality.
Note. This is a simple tool, but it is usually quite weak. It is mainly used to derive stronger tail
bounds, such as Chebyshev’s Inequality.

Theorem 1 (Markov’s Inequality) Let X be a non-negative random variable. Then,

E[X]
Pr(X ≥ a) ≤ , for any a > 0.
a
Before we discuss the proof of Markov’s Inequality, first let’s look at a picture that illustrates the
event that we are looking at.

Pr(X ≥ a)

E[X] a

Figure 1: Markov’s Inequality bounds the probability of the shaded region.

Proof:[1] Suppose X is a discrete random variable, for simplicity.


X
E[X] = x · Pr(X = x)
x
X
≥ x · Pr(X = x)
x≥a
X
≥a· · Pr(X = x)
x≥a
= a · Pr(X ≥ a)

Rearranging, we get
E[X]
Pr(X ≥ a) ≤ .
a

1
(
1 if X ≥ a
Proof:[2] Define a random variable Y as follows. Y =
0 otherwise.
Now, if X < a, Y = 0. Otherwise, X ≥ a, in which case Y = 1. In both cases, we have that
Y ≤ Xa . Note that we use the fact that X is a non-negative random variable in the first case.
Therefore, E[Y ] ≤ E[X]
a . However, since Y is an indicator random variable, E[Y ] = Pr(Y =
1) = Pr(X ≥ a). This implies that Pr(X ≥ a) ≤ E[X] a .
Example. Let X be a random variable that denotes the number of heads, when n fair coins are
tossed independently. Using Linearity of Expectation, we get that E[X] = n2 .
n/2
Plugging in a = 3n 3n 2
4 in Markov’s Inequality, we get that Pr(X ≥ 4 ) ≤ 3n/4 = 3 . This is a quite
weak bound on the tail probability using Markov’s Inequality, since we intuitively know that X
should be concentrated very tightly around its mean. (If we toss 10,000 fair coins, we have a sense
that the probability of getting 7,500 or more heads is going to be very small.)
To illustrate this point further, consider Pr(X ≥ n). Plugging in a = n, we get Pr(X ≥ n) ≤
n/2 1 1
n = 2 . However, we know that Pr(X ≥ n) = Pr(X = n) = 2n , since outcomes of all n coin
tosses must be heads, when X = n.

The example above illustrates that often, the bounds given by Markov’s Inequality are quite
weak. This should not be surprising, however, since this bound only makes use of the expected
value of a random variable.

2 Chebyshev’s Inequality
In order to get more information about a random variable, we can use moments of a random
variable.

Definition 2 (Moment) The k th moment of a random variable X is E[X k ].

Higher moments often reveal more information about a random variable, which, in turn helps us
derive better bounds. However, there is a trade-off. It is often difficult to compute higher moments
in practical cases, e.g., while analyzing randomized algorithms. Now, let us look at the variance of
a random variable.

Definition 3 (Variance) The variance of a random variable X, denoted as Var[X], is E[(X −


E[X])2 ].

The variance of a random variable can be seen as the expected square of the distance of X, from
its expected value E[X]. Another way to look at Var[X] is as follows.
Var[X] = E (X − E[X])2
 

= E X 2 − 2 · X · E[X] + E[X]2
 

= E[X 2 ] − E[2E[X] · X] + E E[X]2


 
(Linearity of Expectation)
= E[X 2 ] − 2E[X] · E[X] + E[X]2 (2 and E[X] are constants)
= E[X 2 ] − E[X]2 .

2
That is, the variance of X equals the difference between the second moment of X, and the square
of the expected value of X (i.e., the square of the first moment of X).
Now, we can derive Chebyshev’s Inequality, which often gives much stronger bounds than the
Markov’s Inequality.

Theorem 4 (Chebyshev’s Inequality) For any a > 0,

Var[X]
Pr(|X − E[X]| ≥ a) ≤ .
a2
Again, let us look at a picture that illustrates Chebyshev’s Inequality.

Pr((E[X] − X) ≥ a) Pr((X − E[X]) ≥ a)

a a

E[X]

Figure 2: Chebyshev’s Inequality bounds the probability of the shaded regions.

Proof:

Pr (|X − E[X]| ≥ a) = Pr (X − E[X])2 ≥ a2 = Pr(Y ≥ a2 )




Where, Y = (X − E[X])2 . Note that Y is a non-negative random variable. Therefore, using


Markov’s Inequality,

E (X − E[X])2

2 E[Y ] Var[X]
Pr(Y ≥ a ) ≤ 2
= 2
= .
a a a2

Example. Again consider the fair coin example. Recall that X denotes the number of heads, when
n fair coins are tossed independently. We saw that Pr(X ≥ 3n 2
4 ) ≤ 3 , using Markov’s Inequality. Let
us see how Chebyshev’s Inequality can be used to give a much stronger bound on this probability.
First, notice that:
 
3n  n n  n n   n
Pr X ≥ = Pr X − ≥ ≤ Pr X − ≥ = Pr |X − E[X]| ≥ .

4 2 4 2 4 4

That is, we are interested in bounding the upper tail probability. However, as seen before,
Chebyshev’s Inequality upper bounds probabilities of both tails. In order to use Chebyshev’s
Inequality, we must first calculate Var[X]. First, we must characterize what kind of random
variable X is.

3
Definition 5 (Binomial Random Variable) A random variable X is Binomial with parame-
ters n and p (denoted as X ∼ Bin(n, p)) if X takes on values 0, 1, . . . , n − 1, n, with the following
distribution.  
n j
Pr(X = j) = p (1 − p)n−j .
j

A binomial random variable X ∼ Bin(n, p) denotes the number of successes (heads), when n
independent coins are tossed, with each coin having success (heads) probability of p. In our
example, X is a binomial random variable with parameters n and 12 . However, we consider the
more general case.
To compute Var[X], we need E[X] and E[X 2 ]. For the case of Binomial Random variable,
E[X] = np can be computed easily, as seen before. However, computing E[X 2 ] directly is quite
tedious. Therefore, we decompose X is the following manner.(
1 with probability p
For each coin toss i = 1, . . . , n, define an indicator r.v. Xi =
0 with probability 1 − p.
That is, Xi is 1 if the ith coin toss is heads, and 0 otherwise. It is easy to see that X = ni=1 Xi .
P
Before we show how the variance of X can be decomposed, we need the following definition.

Definition 6 (Covariance) The Covariance of random variables Xi and Xj , denoted as Cov(Xi , Xj ),


is E [(Xi − E[Xi ]) · (Xj − E[Xj ])] .

Cov(Xi , Xj ) is a measure of correlation between Xi and Xj . It immediately follows from the


definition that Cov(Xi , Xj ) = Cov(Xj , Xi ). Another way to look at Cov(Xi , Xj ) is as follows.

Cov(Xi , Xj ) = E [(Xi − E[Xi ]) · (Xj − E[Xj ])]


= E [Xi Xj − Xi E[Xj ] − Xj E[Xi ] + E[Xi ]E[Xj ]]
= E[Xi Xj ] − E[Xj ] · E[Xi ] − E[Xi ] · E[Xi ] + E[E[Xi ] · E[Xj ]]
(Using Linearity of Expectation)
= E[Xi Xj ] − E[Xi ] · E[Xj ]

Now, we state the following theorem without proof.

Theorem 7 " n n
#
X X X
Var Xi = Var[Xi ] + Cov(Xi , Xj )
i=1 i=1 i,j
i6=j

Consider the case where all Xi ’s are mutually independent. Then, for any Xi , Xj , E[Xi Xj ] =
E[Xi ] · E[Xj ], which implies that Cov(Xi , Xj ) = 0. That is,

Theorem 8 (Linearity of Variance) If X1 , X2 , . . . , Xn are all mutually independent, then


" n # n
X X
Var Xi = Var[Xi ].
i=1 i=1

4
Notes.

1. Linearity of Variance requires the independence of the random variables, whereas Linearity
of Expectation does not.

2. We do not need mutual independence between the random variables for Linearity of Variance.
A weaker notion called pairwise independence suffices. That is, for any distinct Xi , Xj , it is
sufficient to require Xi , Xj be independent.

Example: (Continued) In our coin toss example, all Xi ’s are in fact mutually independent. There-
fore, Var[X] = ni=1 Var[Xi ].
P
For any Xi , Var[Xi ] = E[Xi2 ] − E[Xi ]2 . E[Xi ] = Pr(Xi = 1) = p. Note that Xi2 also has the
same distribution as Xi , and therefore, E[Xi2 ] = p. So, Var[Xi ] = p − p2 = p(1 − p).
And therefore, Var[X] = np(1 − p).

Theorem 9 (Variance of a Binomial Random Variable) If X ∼ Bin(n, p), then Var[X] =


np(1 − p).

Therefore, in the case of fair coin tosses, Var[X] = n4 . By Chebyshev’s Inequality,


 n n  (n/4) 4
Pr X − ≥ ≤ = .

2 4 (n/4)2 n

Recall that Markov’s Inequality gave us a much weaker bound of 32 on the same tail probability.
Later
 on,we will discover that using Chernoff Bounds, we can get an even stronger bound of
1
O exp(n) on the same probability. However, Chernoff Bounds require mutual independence,
whereas even the weaker notion of pairwise independence suffices for an application of Chebyshev’s
Inequality.

5
Lecture Notes CS:5360 Randomized Algorithms
Lectures 8 and 9: Sept 13 & Sept 18, 2018
Scribe: Tanmay Inamdar

Example 1: (Coupon Collector’s Problem)


Recall that in the Coupon Collector’s Problem, we defined Xi as the number of cereal boxes
bought while having i − 1 distinct coupons, and the random varaible X = ni=1 Xi denoted the
P
total number of boxes required to obtain all n distinct coupons. We had shown the following:
n−i+1 n
1. Each Xi is a geometric random variable with parameter pi = n . So, E[Xi ] = n−i+1 .

2. E[X] = n ln n + Θ(n), using Linearity of Expectation.

Can we bound the tail probability Pr(X ≥ 2 ln n)?


Using Markov’s Inequality, Pr(X ≥ 2 ln n) ≤ n ln2n+Θ(n) 1 1 1

ln n = 2 +Θ ln n = 2 + o(1). For sufficiently
1
large n, this bound is arbitrarily close to 2 .
What do we require for using Chebyshev’s Inequality?

1. We need to set a = n ln n, ignoring the lower order Θ(n) term in E[X].

2. We need to compute Var(X). The variables Xi ’s are mutually independent. For example, the
number of cereal boxes bought while having 3 distinct coupons does not affect the number
of cereal boxes bought while having 4 distinct coupons. Therefore, we can use Linearity of
Variance, provided that we can compute Var[Xi ] for each Xi .

We state the following result without proof.


1−p
Theorem 10 If Y ∼ Geom(p), then Var[Y ] = p2
.

1−pi 1 n2
Using this, for any Xi , we have that Var[Xi ] = p2i
≤ p2i
= (n−i+1)2
. Therefore,

n
X
Var[X] = Var[Xi ]
i=1
n
X n2

(n − i + 1)2
i=1
n
X 1
= n2
j2
j=1

X 1
≤ n2
j2
j=1

π2 X 1 π2
= n2 · (∵ = .)
6 j2 6
j=1

6
n2 π 2
Therefore, plugging a = n ln n, E[X] = n ln n and Var[X] = 6 in Chebyshev’s Inequality,

n2 π 2
 
1
Pr(X ≥ 2 ln n) ≤ Pr(|X − E[X]| ≥ n ln n) ≤ 2 =Θ .
6n (ln n)2 (ln n)2

Note that this bound is much better than the bound obtained via Markov’s Inequality. 
A note on error probabilities.
Error probability (upper bound) Comment
1 Weakest useful probability bound. Can
c use probability amplification.
1 Approaches 0 as n increases, but rate of
poly(log n) approach may not be satisfactory.
1 Gray area. Sometimes referred to as “high
nc , 0 < c < 1. probability”.
1
n Standard meaning of “high probability”.

3 Median Finding by Sampling


We are going to discuss a Monte Carlo algorithm with deterministic O(n) running time, with failure
1
probability at most n1/4 . This algorithm is notable for its use of sampling, a probabilistic technique
that has many applications. The analysis of the algorithm uses Chebyshev’s inequality. A different
randomized median find algorithm appears in the homework – it is Las Vegas with expected O(n)
running time.
The input is a list L[1..n] of n distinct numbers. We are required to find the median element,
i.e., an element of rank n2 in the sorted version of L, which we denote by m = median(L). First,
we describe the idea of the algorithm informally.

Informal Description.

• We sample L to get a sublist S of t elements, where t  n is a parameter to be fixed later.


Sampling a smaller subset of the input is often referred to down sampling.

• The intuition is that median(S) should be “close to” median(L) in the sorted version of L.
We can find median(S) by any standard sorting algorithm in O(t log t) time, which would be
O(n), if we choose t small enough.

• Let d be the element in Sorted S of rank 2t − t0 , and u be the element in Sorted S of rank
t 0 0
2 + t . Again, t is a parameter to be fixed later. See the following figure for an illustration.

t0 t0
d m u
Sorted S

The intuition is that median(L) should lie between d and u in Sorted L.

7
• Then, we find the set C = {x ∈ L | d ≤ x ≤ u} – this takes O(n) time.

d u
| {z } Sorted L
C

• We sort C and find the median of L.

• In what ways can the things go wrong?


Define `d = |{x ∈ L | x < d}|, and mu = |{x ∈ L | x > u}|. (`d stands for less than d and mu
stands for more than u.) The “bad events” are as follows.
n
1. B1 ≡ `d > 2. In this case, m is smaller than d, and so m 6∈ C. See the figure for
illustration.
C
m d
| {z }
`d

n
2. B2 ≡ mu > 2. In this case, m is greater than u, and so m 6∈ C. See the figure for
illustration.
C
u m
| {z }
mu

3. B3 ≡ C is “too large” to be sorted!

Now, we describe the algorithm formally.


Algorithm 1: MonteCarloMedianFind(L[1..n])
1 Pick t = dn3/4 e elements from L by sampling independently, with replacement, uniformly at random. The
resulting S is a multiset.
2 Sort S and let
3/4 √
d = element of rank b n 2 − nc in Sorted S,
3/4 √
u = element of rank d n 2 + ne in Sorted S.
3 By scanning L and comparing each element in L with d and u, compute C = {x ∈ L | d ≤ x ≤ u},
`d = |{x ∈ L | x < d}|, and mu = |{x ∈ L | x > u}|.
4 if `d > n2 then FAIL
5 if mu > n2 then FAIL
6 if |C| > 4 · n3/4 then FAIL
7 Sort C and return the element with rank n2 − `d in Sorted C.

Lines 4 and 5 correspond to the bad events B1 and B2 as defined earlier. Line 6 corresponds to
the bad event B3 , which we now define formally as B3 ≡ |C| > 4 · n3/4 . Note that if none of the
bad events B1 , B2 and B3 happens, then the algorithm is guaranteed to return the correct median
of the list L, in O(n) time (because if B3 does not happen, then S and C are small enough to be

8
sorted by a deterministic algorithm, say merge sort, in O(n) time). We will omit ceilings and floors
in the analysis for simplicity.

Lemma 11
1
Pr(B1 ) ≤ .
4n1/4
Proof: Recall that B1 ≡ `d > n2 . Let X denote the number of elements in S that are ≤ m.
3/4 √
Note that d has a rank n 2 − n in S, i.e., the number of elements in S that are less than d, is
3/4 √
≤ n 2 − n. Now, since m ≤ d, the number of elements sampled from the shaded blue region in
3/4 √
the following picture, is at most n 2 − n.

Sorted L
m d
| {z }
`d

3/4 √
Therefore, Pr(B1 ) ≤ Pr(X < n 2 − n). Now, we upper bound the latter probability.
For analyzing X, we can look at the sampling process as a sequence of t = n3/4 trials, with
“success” corresponding to an element being chosen from the left of m. The probability of “success”
n−1
number of elements ≤ m +1
is n = 2 n = 12 + 2n 1
, assuming n is odd.
n3/4 n3/4
Therefore, E[X] = n · 2 + 2n ≥ 2 , and Var[X] = n3/4 · 21 + 2n
1 1 1
3/4
  1 1

· 2 − 2n ≤ 4 , using
the formulas for the expected value and the variance of a binomial random variable.


n

n

n3/4 √ n3/4 E[X]


2 − n 2

n3/4 √ √
Figure 3: Area to the left of blue line: X < 2 − n, Area to the left of red line: X < E[X] − n
 √  n3/4
Now, we want to analyze Pr X < −
n , which is the area left of the blue line in the
2

picture above. We upper bound it by Pr(X < E[X] − n), which is the area left of the red line.
Therefore,

9
!
n3/4 √
Pr(B1 ) ≤ Pr X < − n (As argued earlier)
2

≤ Pr(X < E[X] − n) (From Figure 3.)
√ 
≤ Pr |X − E[X]| > n (Bounding lower tail probability by both tails)
n3/4 n3/4 √
≤ √ (Plugging in Var[X] ≤ 4 and a = n in Chebyshev’s inequality)
4 · ( n)2
1
= 1/4 .
4n

We have a similar lemma about the bad event B2 ≡ mu > n/2. The proof of this lemma is entirely
symmetric, so we state it here without proof.

Lemma 12
1
Pr(B2 ) ≤ .
4n1/4

Now, we have a bad event B3 ≡ |C| > 4 · n3/4 . We have the following lemma.

Lemma 13
1
Pr(B3 ) ≤ .
2n1/4
Proof: We decompose B3 into two events B31 and B32 , where

B31 ≡ number of elements in C that are ≥ m is > 2n3/4


B32 ≡ number of elements in C that are ≤ m is > 2n3/4

We have that B3 ≡ B31 ∪B32 . We will show that Pr(B31 ) ≤ 4n11/4 , and the proof for Pr(B32 ) ≤ 1
4n1/4
is symmetric. Note that by union bound, these two bounds imply the lemma.

Sorted L 2n3/4

m u

n
2 − 2n3/4

If the event B31 occurs, then there are > 2n3/4 elements between m and u (including both).
3/4 √
However, the number of elements in S that are > u is n 2 − n, all of which must be sampled
from the red region (in fact, the subset of the red region that lies to the right of u). But we know
3/4 √
that < n 2 − 2 n elements are sampled from the red region.
Let X denote the number of elements sampled in S from the red region, i.e., the number
of sampled elements with rank ≥ n2 + 2n3/4 in L. Similar to earlier, X is a binomial random

10
n
−2n3/4 n3/4 √
variable, with n3/4 trials,
 and success
 probability
2
n . Therefore, E[X] = 2 − 2 n, and
n3/4
Var[X] = n3/4 21 − n1/42 1 2
2 + n1/4 ≤ 4 .

!
n3/4 √
Pr(B31 ) = Pr X ≥ − n
2
!
n3/4 √ √
≤ Pr X − − n≥ n
2
√ 
≤ Pr |X − E[X]| ≥ n
n3/4 n3/4 √
≤ √ (Plugging in Var[X] ≤ 4 and a = n in Chebyshev’s inequality)
4( n)2
1
= 1/4 .
4n

Now, we conclude with the following theorem about Algorithm 1.

Theorem 14 Algorithm 1 runs in O(n) deterministic time and returns a median with probability
1
at least 1 − n1/4 .

Proof: We skip the arguments about the running time and the correctness of the algorithm. We
have already argued that the algorithm can make an error in four ways, corresponding to the bad
events B1 , B2 and B31 , B32 . We have argued that the probability of each of these bad events can be
upper bounded by 4n11/4 . By union bound over these four events, the probability that the algorithm
1
makes an error is at most n1/4 .

11

You might also like