1 Markov's Inequality: Lecture Notes CS:5360 Randomized Algorithms
1 Markov's Inequality: Lecture Notes CS:5360 Randomized Algorithms
1 Markov’s Inequality
Recall that our general theme is to upper bound tail probabilities, i.e., probabilities of the form
Pr(X ≥ c · E[X]) or Pr(X ≤ c · E[X]). The first tool towards that end is Markov’s Inequality.
Note. This is a simple tool, but it is usually quite weak. It is mainly used to derive stronger tail
bounds, such as Chebyshev’s Inequality.
E[X]
Pr(X ≥ a) ≤ , for any a > 0.
a
Before we discuss the proof of Markov’s Inequality, first let’s look at a picture that illustrates the
event that we are looking at.
Pr(X ≥ a)
E[X] a
Rearranging, we get
E[X]
Pr(X ≥ a) ≤ .
a
1
(
1 if X ≥ a
Proof:[2] Define a random variable Y as follows. Y =
0 otherwise.
Now, if X < a, Y = 0. Otherwise, X ≥ a, in which case Y = 1. In both cases, we have that
Y ≤ Xa . Note that we use the fact that X is a non-negative random variable in the first case.
Therefore, E[Y ] ≤ E[X]
a . However, since Y is an indicator random variable, E[Y ] = Pr(Y =
1) = Pr(X ≥ a). This implies that Pr(X ≥ a) ≤ E[X] a .
Example. Let X be a random variable that denotes the number of heads, when n fair coins are
tossed independently. Using Linearity of Expectation, we get that E[X] = n2 .
n/2
Plugging in a = 3n 3n 2
4 in Markov’s Inequality, we get that Pr(X ≥ 4 ) ≤ 3n/4 = 3 . This is a quite
weak bound on the tail probability using Markov’s Inequality, since we intuitively know that X
should be concentrated very tightly around its mean. (If we toss 10,000 fair coins, we have a sense
that the probability of getting 7,500 or more heads is going to be very small.)
To illustrate this point further, consider Pr(X ≥ n). Plugging in a = n, we get Pr(X ≥ n) ≤
n/2 1 1
n = 2 . However, we know that Pr(X ≥ n) = Pr(X = n) = 2n , since outcomes of all n coin
tosses must be heads, when X = n.
The example above illustrates that often, the bounds given by Markov’s Inequality are quite
weak. This should not be surprising, however, since this bound only makes use of the expected
value of a random variable.
2 Chebyshev’s Inequality
In order to get more information about a random variable, we can use moments of a random
variable.
Higher moments often reveal more information about a random variable, which, in turn helps us
derive better bounds. However, there is a trade-off. It is often difficult to compute higher moments
in practical cases, e.g., while analyzing randomized algorithms. Now, let us look at the variance of
a random variable.
The variance of a random variable can be seen as the expected square of the distance of X, from
its expected value E[X]. Another way to look at Var[X] is as follows.
Var[X] = E (X − E[X])2
= E X 2 − 2 · X · E[X] + E[X]2
2
That is, the variance of X equals the difference between the second moment of X, and the square
of the expected value of X (i.e., the square of the first moment of X).
Now, we can derive Chebyshev’s Inequality, which often gives much stronger bounds than the
Markov’s Inequality.
Var[X]
Pr(|X − E[X]| ≥ a) ≤ .
a2
Again, let us look at a picture that illustrates Chebyshev’s Inequality.
a a
E[X]
Proof:
E (X − E[X])2
2 E[Y ] Var[X]
Pr(Y ≥ a ) ≤ 2
= 2
= .
a a a2
Example. Again consider the fair coin example. Recall that X denotes the number of heads, when
n fair coins are tossed independently. We saw that Pr(X ≥ 3n 2
4 ) ≤ 3 , using Markov’s Inequality. Let
us see how Chebyshev’s Inequality can be used to give a much stronger bound on this probability.
First, notice that:
3n n n n n n
Pr X ≥ = Pr X − ≥ ≤ Pr X − ≥ = Pr |X − E[X]| ≥ .
4 2 4 2 4 4
That is, we are interested in bounding the upper tail probability. However, as seen before,
Chebyshev’s Inequality upper bounds probabilities of both tails. In order to use Chebyshev’s
Inequality, we must first calculate Var[X]. First, we must characterize what kind of random
variable X is.
3
Definition 5 (Binomial Random Variable) A random variable X is Binomial with parame-
ters n and p (denoted as X ∼ Bin(n, p)) if X takes on values 0, 1, . . . , n − 1, n, with the following
distribution.
n j
Pr(X = j) = p (1 − p)n−j .
j
A binomial random variable X ∼ Bin(n, p) denotes the number of successes (heads), when n
independent coins are tossed, with each coin having success (heads) probability of p. In our
example, X is a binomial random variable with parameters n and 12 . However, we consider the
more general case.
To compute Var[X], we need E[X] and E[X 2 ]. For the case of Binomial Random variable,
E[X] = np can be computed easily, as seen before. However, computing E[X 2 ] directly is quite
tedious. Therefore, we decompose X is the following manner.(
1 with probability p
For each coin toss i = 1, . . . , n, define an indicator r.v. Xi =
0 with probability 1 − p.
That is, Xi is 1 if the ith coin toss is heads, and 0 otherwise. It is easy to see that X = ni=1 Xi .
P
Before we show how the variance of X can be decomposed, we need the following definition.
Theorem 7 " n n
#
X X X
Var Xi = Var[Xi ] + Cov(Xi , Xj )
i=1 i=1 i,j
i6=j
Consider the case where all Xi ’s are mutually independent. Then, for any Xi , Xj , E[Xi Xj ] =
E[Xi ] · E[Xj ], which implies that Cov(Xi , Xj ) = 0. That is,
4
Notes.
1. Linearity of Variance requires the independence of the random variables, whereas Linearity
of Expectation does not.
2. We do not need mutual independence between the random variables for Linearity of Variance.
A weaker notion called pairwise independence suffices. That is, for any distinct Xi , Xj , it is
sufficient to require Xi , Xj be independent.
Example: (Continued) In our coin toss example, all Xi ’s are in fact mutually independent. There-
fore, Var[X] = ni=1 Var[Xi ].
P
For any Xi , Var[Xi ] = E[Xi2 ] − E[Xi ]2 . E[Xi ] = Pr(Xi = 1) = p. Note that Xi2 also has the
same distribution as Xi , and therefore, E[Xi2 ] = p. So, Var[Xi ] = p − p2 = p(1 − p).
And therefore, Var[X] = np(1 − p).
Recall that Markov’s Inequality gave us a much weaker bound of 32 on the same tail probability.
Later
on,we will discover that using Chernoff Bounds, we can get an even stronger bound of
1
O exp(n) on the same probability. However, Chernoff Bounds require mutual independence,
whereas even the weaker notion of pairwise independence suffices for an application of Chebyshev’s
Inequality.
5
Lecture Notes CS:5360 Randomized Algorithms
Lectures 8 and 9: Sept 13 & Sept 18, 2018
Scribe: Tanmay Inamdar
2. We need to compute Var(X). The variables Xi ’s are mutually independent. For example, the
number of cereal boxes bought while having 3 distinct coupons does not affect the number
of cereal boxes bought while having 4 distinct coupons. Therefore, we can use Linearity of
Variance, provided that we can compute Var[Xi ] for each Xi .
1−pi 1 n2
Using this, for any Xi , we have that Var[Xi ] = p2i
≤ p2i
= (n−i+1)2
. Therefore,
n
X
Var[X] = Var[Xi ]
i=1
n
X n2
≤
(n − i + 1)2
i=1
n
X 1
= n2
j2
j=1
∞
X 1
≤ n2
j2
j=1
∞
π2 X 1 π2
= n2 · (∵ = .)
6 j2 6
j=1
6
n2 π 2
Therefore, plugging a = n ln n, E[X] = n ln n and Var[X] = 6 in Chebyshev’s Inequality,
n2 π 2
1
Pr(X ≥ 2 ln n) ≤ Pr(|X − E[X]| ≥ n ln n) ≤ 2 =Θ .
6n (ln n)2 (ln n)2
Note that this bound is much better than the bound obtained via Markov’s Inequality.
A note on error probabilities.
Error probability (upper bound) Comment
1 Weakest useful probability bound. Can
c use probability amplification.
1 Approaches 0 as n increases, but rate of
poly(log n) approach may not be satisfactory.
1 Gray area. Sometimes referred to as “high
nc , 0 < c < 1. probability”.
1
n Standard meaning of “high probability”.
Informal Description.
• The intuition is that median(S) should be “close to” median(L) in the sorted version of L.
We can find median(S) by any standard sorting algorithm in O(t log t) time, which would be
O(n), if we choose t small enough.
• Let d be the element in Sorted S of rank 2t − t0 , and u be the element in Sorted S of rank
t 0 0
2 + t . Again, t is a parameter to be fixed later. See the following figure for an illustration.
t0 t0
d m u
Sorted S
7
• Then, we find the set C = {x ∈ L | d ≤ x ≤ u} – this takes O(n) time.
d u
| {z } Sorted L
C
n
2. B2 ≡ mu > 2. In this case, m is greater than u, and so m 6∈ C. See the figure for
illustration.
C
u m
| {z }
mu
Lines 4 and 5 correspond to the bad events B1 and B2 as defined earlier. Line 6 corresponds to
the bad event B3 , which we now define formally as B3 ≡ |C| > 4 · n3/4 . Note that if none of the
bad events B1 , B2 and B3 happens, then the algorithm is guaranteed to return the correct median
of the list L, in O(n) time (because if B3 does not happen, then S and C are small enough to be
8
sorted by a deterministic algorithm, say merge sort, in O(n) time). We will omit ceilings and floors
in the analysis for simplicity.
Lemma 11
1
Pr(B1 ) ≤ .
4n1/4
Proof: Recall that B1 ≡ `d > n2 . Let X denote the number of elements in S that are ≤ m.
3/4 √
Note that d has a rank n 2 − n in S, i.e., the number of elements in S that are less than d, is
3/4 √
≤ n 2 − n. Now, since m ≤ d, the number of elements sampled from the shaded blue region in
3/4 √
the following picture, is at most n 2 − n.
Sorted L
m d
| {z }
`d
3/4 √
Therefore, Pr(B1 ) ≤ Pr(X < n 2 − n). Now, we upper bound the latter probability.
For analyzing X, we can look at the sampling process as a sequence of t = n3/4 trials, with
“success” corresponding to an element being chosen from the left of m. The probability of “success”
n−1
number of elements ≤ m +1
is n = 2 n = 12 + 2n 1
, assuming n is odd.
n3/4 n3/4
Therefore, E[X] = n · 2 + 2n ≥ 2 , and Var[X] = n3/4 · 21 + 2n
1 1 1
3/4
1 1
· 2 − 2n ≤ 4 , using
the formulas for the expected value and the variance of a binomial random variable.
√
n
√
n
n3/4 √ √
Figure 3: Area to the left of blue line: X < 2 − n, Area to the left of red line: X < E[X] − n
√ n3/4
Now, we want to analyze Pr X < −
n , which is the area left of the blue line in the
2
√
picture above. We upper bound it by Pr(X < E[X] − n), which is the area left of the red line.
Therefore,
9
!
n3/4 √
Pr(B1 ) ≤ Pr X < − n (As argued earlier)
2
√
≤ Pr(X < E[X] − n) (From Figure 3.)
√
≤ Pr |X − E[X]| > n (Bounding lower tail probability by both tails)
n3/4 n3/4 √
≤ √ (Plugging in Var[X] ≤ 4 and a = n in Chebyshev’s inequality)
4 · ( n)2
1
= 1/4 .
4n
We have a similar lemma about the bad event B2 ≡ mu > n/2. The proof of this lemma is entirely
symmetric, so we state it here without proof.
Lemma 12
1
Pr(B2 ) ≤ .
4n1/4
Now, we have a bad event B3 ≡ |C| > 4 · n3/4 . We have the following lemma.
Lemma 13
1
Pr(B3 ) ≤ .
2n1/4
Proof: We decompose B3 into two events B31 and B32 , where
We have that B3 ≡ B31 ∪B32 . We will show that Pr(B31 ) ≤ 4n11/4 , and the proof for Pr(B32 ) ≤ 1
4n1/4
is symmetric. Note that by union bound, these two bounds imply the lemma.
Sorted L 2n3/4
m u
n
2 − 2n3/4
If the event B31 occurs, then there are > 2n3/4 elements between m and u (including both).
3/4 √
However, the number of elements in S that are > u is n 2 − n, all of which must be sampled
from the red region (in fact, the subset of the red region that lies to the right of u). But we know
3/4 √
that < n 2 − 2 n elements are sampled from the red region.
Let X denote the number of elements sampled in S from the red region, i.e., the number
of sampled elements with rank ≥ n2 + 2n3/4 in L. Similar to earlier, X is a binomial random
10
n
−2n3/4 n3/4 √
variable, with n3/4 trials,
and success
probability
2
n . Therefore, E[X] = 2 − 2 n, and
n3/4
Var[X] = n3/4 21 − n1/42 1 2
2 + n1/4 ≤ 4 .
!
n3/4 √
Pr(B31 ) = Pr X ≥ − n
2
!
n3/4 √ √
≤ Pr X − − n≥ n
2
√
≤ Pr |X − E[X]| ≥ n
n3/4 n3/4 √
≤ √ (Plugging in Var[X] ≤ 4 and a = n in Chebyshev’s inequality)
4( n)2
1
= 1/4 .
4n
Theorem 14 Algorithm 1 runs in O(n) deterministic time and returns a median with probability
1
at least 1 − n1/4 .
Proof: We skip the arguments about the running time and the correctness of the algorithm. We
have already argued that the algorithm can make an error in four ways, corresponding to the bad
events B1 , B2 and B31 , B32 . We have argued that the probability of each of these bad events can be
upper bounded by 4n11/4 . By union bound over these four events, the probability that the algorithm
1
makes an error is at most n1/4 .
11