0% found this document useful (0 votes)
157 views7 pages

hw5 Sol PDF

This document contains instructions for homework 5 that is due on March 11, 2020. It includes problems related to hashing and sampling. Specifically, it asks to prove properties about universal and strongly universal hashing schemes that use random matrices. It also asks to prove that a sampling algorithm called GetOneSample returns an element from a stream uniformly at random and to determine the expected number of times a specific line in the algorithm is executed.

Uploaded by

J
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
157 views7 pages

hw5 Sol PDF

This document contains instructions for homework 5 that is due on March 11, 2020. It includes problems related to hashing and sampling. Specifically, it asks to prove properties about universal and strongly universal hashing schemes that use random matrices. It also asks to prove that a sampling algorithm called GetOneSample returns an element from a stream uniformly at random and to determine the expected number of times a specific line in the algorithm is executed.

Uploaded by

J
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

CS 473 ] Spring 2020

Y Homework 5 Z
Due Wednesday, March 11, 2020 at 9pm

1. Suppose we are hashing from the universe U = {0, 1, . . . , 2w − 1} of w-bit strings to a hash
table of size m = 2` ; that is, we are hashing w-bit words into `-bit labels. To define our
universal family of hash functions, we think of words and labels as boolean vectors, and we
specify our hash function by choosing a random ` × w boolean matrix.
For any ` × w matrix M of 0s and 1s, define the hash function h M : {0, 1}w → {0, 1}` by
the boolean matrix-vector product
w
M M
h M (x) = M x mod 2 = Mi x i = Mi .
i=1 i : x i =1

where ⊕ denotes bitwise exclusive-or (that is, addition mod 2), Mi denotes the ith column
of M , and x i denotes the ith bit of x.

(a) Prove that the family M = {hm | M ∈ {0, 1}w×` } of all random-matrix hash functions
is universal.
Solution: Pick two arbitrary distinct words x 6= y; we need to prove that
Pr M [h M (x) = h M ( y)] ≤ 1/m. Definition-chasing implies that for any matrix M ,
we have h M (x) = h M ( y) if and only if M (x ⊕ y) mod 2 = 0. Thus, it suffices to
prove for any non-zero word z ∈ {0, 1}w that Pr M [M z mod 2 = 0] ≤ 1/m.
Fix an arbitrary non-zero word z, and suppose z j = 1. Then we have
M M
M z mod 2 = 0 ⇐⇒ Mi zi = 0 ⇐⇒ Mi zi = M j
i i6= j

If we fix all the entries in M except the jth column M j , the left side of the last
equation is a fixed vector, but the right side is still uniformly random. The
L
probability that the random vector M j equals the fixed vector i6= j Mi zi is
exactly 1/m. „

Rubric: 4 points

(b) Prove that M is not uniform.

Solution: h(0) = 0 for all h ∈ M. „

Rubric: 1 point

1
CS 473 Homework 5 (due March 11) Spring 2020

(c) Now consider a modification of the previous scheme, where we specify a hash
function by a random matrix M ∈ {0, 1}`×w and an independent random offset vector
b ∈ {0, 1}` :  w 
M
h M ,b (x) = (M x + b) mod 2 = Mi x i ⊕ b
i=1
+
Prove that the family M of all such functions is strongly universal.

Solution: Fix two arbitrary distinct words x 6= y and two labels u and v; we
need to compute the probability that h(x) = u and h( y) = v.
Without loss of generality, suppose x i = 0 and yi = 1 for some index i. Fix
all the entries in M except in column Mi ; then

h(x) = u ⊕ b and h( y) = v ⊕ Mi ⊕ b

for some fixed vectors u and v; specifically,


M M
u= Mi x i and v= Mi yi
j6=i j6=i

Definition-chasing now implies that h(x) = u and h( y) = v if and only if

b =u⊕u and M1 = u ⊕ u ⊕ v ⊕ v.

The vectors b and M1 are uniform and independent, so the probability of this
event is exactly 1/m2 . „

Rubric: 4 points

(d) Prove that M+ is not 4-uniform.

Solution: h(3) = h(0) ⊕ h(1) ⊕ h(2) for all h ∈ M+ . „

Rubric: 1 point

(e) [Extra credit] Prove that M+ is actually 3-uniform.

Solution: Fix three distinct words x, y, z and three (not necessarily distinct)
labels s, t, u. Because x, y, z are distinct, there must be two indices i 6= j such
that the bit-pairs (x i , x j ), ( yi , y j ), (zi , z j ) are also distinct. Fix all entries in M
except those in columns Mi and M j . Up to permutations of the variable names,
there are only three cases to consider.

• Case 1: No (1, 1). Suppose x i = x j = y j = zi = 0 and yi = z j = 1. Then

h(x) = s ⊕ b h( y) = t ⊕ Mi ⊕ b h(z) = u ⊕ M j ⊕ b

for some fixed vectors s, t, and u. It follows that h(x) = s and h( y) = t and

2
CS 473 Homework 5 (due March 11) Spring 2020

h(z) = u if and only if

b =s⊕s Mi ⊕ b = t ⊕ t Mj ⊕ b = u ⊕ u

and therefore (by solving the system of three mod-2-linear equations) if and
only if

Mi = s ⊕ s ⊕ t ⊕ t Mj = s ⊕ s ⊕ u ⊕ u b =s⊕s

• Case 2: No (0, 1). Suppose x i = x j = y j = 0 and yi = zi = z j = 1. Then

h(x) = s ⊕ b h( y) = t ⊕ Mi ⊕ b h(z) = u ⊕ Mi ⊕ M j ⊕ b

for some fixed vectors s, t, and u. It follows that h(x) = s and h( y) = t and
h(z) = u if and only if

b =s⊕s Mi ⊕ b = t ⊕ t Mi ⊕ M j ⊕ b = u ⊕ u

and therefore if and only if

Mi = s ⊕ s ⊕ t ⊕ t Mj = s ⊕ s ⊕ t ⊕ t ⊕ u ⊕ u b =s⊕s

• Case 3: No (0, 0). Suppose x j = yi = 0 and x j = yi = zi = z j = 1. Then

h(x) = s ⊕ Mi ⊕ b h( y) = t ⊕ M j ⊕ b h(z) = u ⊕ Mi ⊕ M j ⊕ b

for some fixed vectors s, t, and u. It follows that h(x) = s and h( y) = t and
h(z) = u if and only if

Mi ⊕ b = s ⊕ s Mj ⊕ b = t ⊕ t Mi ⊕ M j ⊕ b = u ⊕ u

and therefore if and only if

Mi = t ⊕ t ⊕ u ⊕ u Mj = s ⊕ s ⊕ u ⊕ u b =s⊕s⊕ t ⊕ t ⊕u⊕u

In all three cases, we have found unique values of Mi and M j and b such
that h(x) = s and h( y) = t and h(z) = u. We conclude that Pr[(h(x) = s) ∧
(h( y) = t) ∧ (h(z) = u)] = 1/m3 , as required. „

Rubric: Max 5 points extra credit. This is not the only correct proof, or even necessarily the
simplest proof.

3
CS 473 Homework 5 (due March 11) Spring 2020

2. (a) Prove that the item returned by GetOneSample(S) is chosen uniformly at random
from S.
Solution: Let P(i, n) denote the probability that GetOneSample(S) returns
the ith item in a given stream S of length n. We consider the behavior of the last
iteration of the algorithm. With probability 1/n, GetOneSample returns the
last item in the stream, and with probability (n − 1)/n, GetOneSample returns
the recursively computed sample of the first n − 1 elements of S. Thus, we have
the following recurrence for all positive integers i and n:


 0 if i > n

1

P(i, n) = n if i = n

 n − 1 P(i, n − 1) if i < n



n
The recurrence includes the implicit base case P(i, 0) = 0 for all i. The induction
hypothesis implies that P(i, n−1) = 1/(n−1) for all i < n. It follows immediately
that P(i, n) = 1/n for all i ≤ n, as required. „

Rubric: 3 points.

Solution: See our solution to problem 3 (with k = 1). „

Rubric: Full credit if and only if (1) the proposed algorithm for problem 3 is actually equiv-
alent to GETONESAMPLE when k = 1, (2) the proposed solution to problem 3 includes a
correct proof of uniformity, and (3) the proposed solution to problem 3 doesn’t refer to
problem 2(a).

(b) What is the exact expected number of times that GetOneSample(S) executes line (?)?

that equals 1 if line (?) is executed in


Solution: Let X i be the indicator variable P
the ith iteration of the main loop; then X = i X i is the total number of executions
of line (?). The algorithm immediatelyPgives us E[X P i ] = Pr[X i = 1] = 1/i, so
n n
linearity of expectation implies E[X ] = i=1 E[X i ] = i=1 1/i = Hn . „

Rubric: 2 points. 1 point for “O(log n)”.

(c) What is the exact expected value of ` when GetOneSample(S) executes line (?) for
the last time?
Solution: Let `∗ denote the value of ` when GetOneSample(S) executes line (?)
for the last time. Part (a) immediately implies that `∗ is uniformly distributed
between 1 and n. It follows that E[`∗ ] = (n + 1)/2. „

Rubric: 2 points. 1 point for “O(n)”.

4
CS 473 Homework 5 (due March 11) Spring 2020

(d) What is the exact expected value of ` when either GetOneSample(S) executes line (?)
for the second time (or the algorithm ends, whichever happens first)?

Solution: Let `2 denote the value of ` when either GetOneSample(S) executes


line (?) for the second time (or the algorithm ends, whichever happens first). To
determine
Xn
E[`2 ] = Pr[`2 ≥ i],
i=1

we compute Pr[`2 ≥ i] for all i.


We immediately have Pr[`2 ≥ 1] = 1 and Pr[`2 ≥ i] = 0 for all i > n. For
any i such that 1 ≤ i ≤ n, we have `2 ≥ i if and only if line (?) is not executed in
iterations 2 through i − 1 of the main loop, so
i−1
Y j−1 1
Pr[`2 ≥ i] = = .
j=2
j i−1

We conclude that
n n
X X 1
E[`2 ] = Pr[`2 ≥ i] = 1 + = Hn−1 + 1.
i=1 i=2
i−1

Rubric: 3 points. This is not the only correct proof. 2 points for writing “O(log n)”; 1 point
for an exact answer that is incorrect but still O(log n).

5
CS 473 Homework 5 (due March 11) Spring 2020

3. (This is a continuation of the previous problem.) Describe and analyze an algorithm that
returns a subset of k distinct items chosen uniformly at random from a data stream of
length at least k. Prove your algorithm is correct.

Solution (O(k) time per element, 8/10): Intuitively, we can chain together k inde-
pendent instances of GetOneSample, where the items rejected by each instance are
treated as the input stream for the next instance. Folding these k instances together
gives us the following algorithm, which runs in O(kn) time. The algorithm maintains
a single array sample[1 .. k].

GetSamples(S, k):
while S is not done
x ← next item in S
for j ← 1 to min{k, `}
if Random(` − j + 1) = 1
swap x ↔ sample[ j]
return sample[1 .. k]

Correctness follows inductively from the correctness of GetOneSample as follows.


The analysis in part (a) implies that sample[1] is equally likely to be any element
of S. Let S 0 [1 .. n − 1] be the sequence of items rejected by sample[1]; at each
iteration `, item S 0 [` − 1] is either S[`] (with probability 1 − 1/`) or the previous
value of sample[1] (with probability 1/`). The output array sample[2 .. k] has the
same distribution as the output array from GetSamples(S 0 , k −1). Thus, the inductive
hypothesis implies that sample[2 .. k] contains k − 1 distinct items chosen uniformly
at random from S \ {sample[1]}. We conclude that sample[1 .. k] contains k distinct
items chosen uniformly at random from S, as required. „

Solution (O(1) time per element, 10/10): To develop the algorithm, let’s start with
a standard algorithm for a different problem. Recall from the lecture notes that the
Fisher-Yates algorithm randomly permutes an arbitrary array. (This algorithm is
called InsertionShuffle in the notes.)

FisherYates(S[1 .. n]):
for ` ← 1 to n
j ← Random(`)
swap S[ j] ↔ S[`]
return S

We can prove that each of the n! possible permutations of the input array has the
same probability of being output by FisherYates, by combining two observations:

• First, there are exactly n! possible outcomes of all the calls to Random. Specifi-
cally, the `th call to Q
Random has ` possible outcomes, so the total number of
n
outcomes is exactly `=1 ` = n!.
• Second, every permutation of S can be computed by FisherYates. Specifically,
the location of S[n] in the output array gives us a unique outcome for the last

6
CS 473 Homework 5 (due March 11) Spring 2020

call to Random, and (after undoing the last swap) the induction hypothesis
implies that there is a sequence of Random outcomes that produce any desired
permutation of S[1 .. n − 1]. (The base case n = 0 is trivial.)

In particular, each subset of k distinct elements of S has the same probability of


appearing in the prefix S[1 .. k] of the output array.
Now suppose we modify the Fisher-Yates algorithm in two different ways. First,
let the input be given in a stream rather than an explicit array; in particular, we do
not know the value of n in advance. Second, we maintain only the first k elements of
the output array.

FYSample(S, k):
`←0
while S is not done
x ← next item in S
`←`+1
j ← Random(`)
if j ≤ k
x ← sample[ j]
return sample[1 .. k]

The k elements produced by this modified algorithm have exactly the same
distribution as the first k elements of the output of FisherYates. Thus, FYSample
chooses a k-element subset of the stream uniformly at random, as required. The
algorithm runs in O(n) time, using only O(k) space. „

Rubric: 10 points = 5 for the algorithm + 5 for the proof. Max 8 points for O(kn)-time algorithm;
scale partial credit. These are neither the only correct algorithms nor the only correct proofs for
these algorithms.

You might also like