Programming Homework Help
Programming Homework Help
In this problem we will work w Z ith the word RAM model, where we can
store each integer in a single word (or “by Z te”) of computer memory. The
number of bits that can be stored in a single word is ⌈lg n ⌉. So, arithmetic
operations on words or O(lg n)-bit numbers take O(1) time. Finally, with
this model comparing two n-bit numbers takes O(n) time.
For this problem you can assume that you are provided with a black-box
prime number generator, and that it takes O(1) time to sample O(lg K)-bit
primes, for K polynomial in n. Also, you can assume that π(k) ∼ k , where
π(k) be the number of distinct primes less than k.
(a) Define fp : {0, 1} m → as fp(X) := g(X) mod p, where p is a prime and
g : {0, 1} m → is a function that converts an m-bit binary string to a
corresponding base 2 integer. Note that if X is equal to Y then fp(X) = fp(Y
). However, if X differs from Y then it can still be the case that fp(X) =
fp(Y ). We will refer to cases where the results of evaluating the function
on two different string inputs are equal as false positives
Hint: Consider the prime factorization of |g(X) − g(Y )|. Notice that the
number of prime factors is at most m.
Solution:
A false positive happens when m-bit strings X = Y but fp(X) = fp(Y ). This
implies that the binary integers g(X) and g(Y ) are the same modulo p, g(X)
− g(Y ) ≡ 0
mod p. This will happen if and only if p is a prime factor of g(X) − g(Y ).
Thus, we should count the number of distinct prime factors of g(X)−g(Y ).
Let g(X)−g(Y ) = q1q2 . . . qc where qi are its (not necessarily distinct)
prime factors. Clearly, qi ≥ 2, since 2 is the smallest prime. Thus, g(X) −
g(Y ) ≥ 2 c . Since g(X), g(Y ) ≤ 2 m, c < m. Hence, g(X) − g(Y ) can have
at most m prime factors, which is also the maximum number of distinct
prime factors it can have. When we sample a random prime from the set P =
p1, p2, . . . pt , the probability of us finding a prime factor of g(X) − g(Y )
and obtaining a false positive is at most:
An even tighter bound is possible, however, they do not improve the asymptotic
running time of our algorithm in the subsequent parts.
(b) Let X(j) be a length-m substring of the target text that starts at position j, for a given
j ∈ {1, 2, . . . , n − m + 1}. Design a randomized algorithm that given fp(Y ) and
fp(X(j)) determines if there is a match between the pattern and the target text for a
given offset j ∈ {1, 2, . . . , n − m + 1}.
Solution:
Recall that a Monte Carlo algorithm is guaranteed to terminate quickly on any input,
however it has a probability of returning the wrong result. Whereas a Las Vegas
algorithm is guaranteed to be correct, but only terminates in some expected time, ie. in
some cases it can take a long time. For this problem, we will present both types of
algorithms. Monte Carlo version:
Thus, the overall error probability of our algorithm is at most m/t. The algorithm only
compares two numbers fp(Y ) and fp(X(j)), which are both at most p − 1. If we can
bound p to be poly(n,m), then it only takes O(log n) bits to represent both fp(Y ) and
fp(X(j)), and comparing them would take O(1) time in our model. Hence we have a
Monte Carlo algorithm that runs in O(1) time with an error probability of at most m/t.
If: fp(Y ) = fp(X(j)), check the strings g(X(j)) and g(Y ), if they are equal, return “A
match”
Analysis: The difference between this algorithm and the previous one is that when
the finger prints fp(Y ), fp(X(j)) are the same, we must make sure the strings g(Y )
and g(X(j)) are actually the same. There is no other way to guarantee this other than
to check them directly. Since g(Y ) and g(X(j)) are m-bit numbers, this takes O(m)
time.
(c) Design a formula that given g(X(j)) computes g(X(j+1)), where X(j) is a length-m
substring of the target text that starts at position j, for a given j ∈ {1, 2, . . . , n−m+1}.
Use it to compute fp(X(j + 1)) from fp(X(j)). Note that the formula should depend on
X, j, and m.
Solution: We regard g(X(j)) as an m-bit binary string with the left most bit being the
most significant, ie. g(X(j)) = 2m−1Xj+2m−2Xj+1+. . . 2 0Xj+m−1. g(X(j+1)) is
basically g(X(j)) shifted right by 1 digit, discarding the original most significant bit Xj
in the process, while adding a new least significant bit Xj+m. Hence we have the
following relation between g(X(j)) and g(X(j + 1)):
so
(d) Suppose that X(j) and Y differ at every string position. Give the best upper bound
you can on the expected number of positions j such that fp(X(j)) = fp(Y ), where j ∈ {1,
2, . . . , n − m + 1}
Solution: Let Ck be the indicator variable that fp(X(k)) = fp(Y ). Then, the total number
of false positives F P for all the positions is:
Since we are told that g(X(j)) = g(Y ) for all j, using the result of part (a), E[Ci ] ≤
m/t for every i. Hence,
Solution: We will determine K (the upper bound on p) at the end, for now we will
treat it as a variable. Our algorithm works as follows: 1. sample a random prime
from the range 1 to K. This can be accomplished using either the blackbox described
in the introduction of the problem in O(1) time. Or we can do it ourselves as
follows:
First we generate a random number from 1 to K, not necessarily prime, this should
take O(1) time in our model if K is poly(n,m). Typically, pseudo random numbers
are generated with “seeds” using modular arithmetic, but these can all be done in
O(1) time. We note here that deterministic primality testing algorithms exist that run
in polylog(n) time. That is to say, given any number less than n, we can determine
whether it is prime or composite in polylog(n) time. Thus, to pick a random prime
number from the set, we random sample a number, use the deterministic primality
testing algorithm, and repeat until we obtain a prime number. From the prime
density theorem, the probability of us finding a prime number is at least π(K)/K = 1/
log(K). So we need to try expected O(log(K)) times before we actually find a prime
number. This brings our total runtime to log(K) × polylog(K) = polylog(K). 2.
Compute fp(Y ) and fp(X(1)). Both of them are m-bit numbers, arithmetic
operations will take O(m) time.
3. Using fp(X(1)), compute all fp(X(j)) using the formula from part (c). We argued
that each additional value only takes O(1) time to compute, and there are O(n − m +
1) = O(n) of them in total, so this step takes O(n).
(f) Provide a bound for the probability that the running time is more than 100 times
the expected running time.
Solution: Let T be the random variable that describes the running time of the
algorithm. By Markov’s inequality, we have