4101 Assignment 9
4101 Assignment 9
Answer(a)
Consider two different strings x = ⟨x0 , x1 , . . . , xℓ ⟩ and y = ⟨y0 , y1 , . . . , ym ⟩ (possibly with
different lengths):
(i) Why must there be some index k such that 0 ≤ k ≤ min(ℓ, m) and xk ̸= yk ?
Since the strings x and y are different, it is guaranteed that there must exist an index k
where the characters at that position differ. This is because the strings are not identical,
so at least one character must not match. This index must be less than or equal to
min(ℓ, m) because:
• If the strings are of different lengths, say ℓ ̸= m, the shorter string will end at index
min(ℓ, m) − 1. Thus, we compare characters only up to the index min(ℓ, m) − 1.
If the strings differ in length, the last characters compared before the strings end
must also differ, ensuring the existence of such a k.
• If the strings are of the same length, there will still be an index k where the char-
acters xk ̸= yk , since the assumption is that x and y are not identical.
1
• Therefore, the index k must be within the range 0 ≤ k ≤ min(ℓ, m), as we only
need to compare up to the shorter string’s length or up to the first position where
the strings differ.
P
ℓ Pm
(ii) Show that h(x) = h(y) if and only if ak (xk −yk ) mod p = i=0,i̸=k ai y i − i=0,i̸=k ai xi
mod p.
We are given two strings x = ⟨x0 , x1 , . . . , xℓ ⟩ and y = ⟨y0 , y1 , . . . , ym ⟩, and we need to
show that the hash values h(x) and h(y) are equal if and only if the condition
ℓ m
!
X X
ak (xk − yk ) mod p = ai y i − ai x i mod p
i=0,i̸=k i=0,i̸=k
We now separate the sums at the index k where xk ̸= yk . Let’s write the sum for each
side, taking care to exclude the term at k:
ℓ
X ℓ
X
ai xi = ak xk + ai x i
i=0 i̸=k,i=0
and m m
X X
ai yi = ak yk + ai y i .
i=0 i̸=k,i=0
2
ℓ
X m
X
ak (xk − yk ) + ai x i − ai yi ≡ 0 (mod p).
i̸=k,i=0 i̸=k,i=0
Rearranging terms:
ℓ m
!
X X
ak (xk − yk ) ≡ ai y i − ai x i (mod p).
i̸=k,i=0 i̸=k,i=0
This completes the proof for the forward direction: if h(x) = h(y), then the condition
holds.
Now, we prove the reverse direction: if the condition holds, then h(x) = h(y).
Assume that:
ℓ m
!
X X
ak (xk − yk ) ≡ ai y i − ai x i (mod p).
i̸=k,i=0 i̸=k,i=0
Pℓ Pm
Adding i=0 ai xi and i=0 ai yi to both sides:
ℓ
X m
X
ai x i − ai yi + ak (xk − yk ) ≡ 0 (mod p).
i=0 i=0
Thus:
ℓ
! m
!
X X
ai x i mod p = ai y i mod p.
i=0 i=0
This shows that h(x) = h(y), completing the proof for the reverse direction.
Therefore, we have shown that h(x) = h(y) if and only if the condition
ℓ m
!
X X
ak (xk − yk ) mod p = ai y i − ai x i mod p
i̸=k,i=0 i̸=k,i=0
1
.
p
3
(ii) Where did you use the assumption that the ai ’s are independent?
The independence of the ai ’s ensures that each character contributes uniquely to the hash
value, preventing any systematic bias in the hash computation. If the coefficients were
not independent, the hash function could produce biased or repetitive results, increasing
the chance of collisions.
(b) Where, exactly, in part (a) did you use the assumption that p > 128?
Where, exactly, did you use the assumption that the ai ’s are independent?
In part (a), the assumption that p > 128 was used in the following context:
- The assumption ensures that the size of the hash table p is sufficiently large to
allow each character in the string to have a unique, non-conflicting hash value. Since
each character is encoded in ASCII, which uses values between 0 and 127, having p >
128 guarantees that there will be enough distinct buckets in the hash table to store all
potential character combinations without significant collisions. If p ≤ 128, the chance of
collisions increases because the number of available buckets is too small compared to the
possible character combinations.
As for the assumption that the ai ’s are independent:
- The independence of the ai ’s is crucial in the hash function’s design. Since the
random numbers a0 , a1 , . . . , ap−1 are chosen independently, it means that the contribution
of each character in the string to the hash value is independent of the others. This
independence ensures that the hash values will be well distributed and minimizes the
chances of collisions between different strings. If the ai ’s were not independent, it would
introduce a correlation between the terms in the hash function, increasing the likelihood
of collisions and making the hash function less effective.
4
(i) How should Georgy store the coefficients a0 , a1 , . . .? When should each ai
be chosen?
Georgy should choose the random values a0 , a1 , . . . for the hash function just before
inserting each string into the hash table. These values should be selected independently
and uniformly at random from the set {0, 1, 2, . . . , p − 1}.
There’s no need to choose all the ai ’s in advance, as they are only required when
computing the hash for a specific string. The chosen values are used temporarily to
compute the hash value for that string and do not need to be stored permanently. Once
the hash is computed and the string is inserted into the hash table, the values can be
discarded.
Therefore, Georgy can pick a0 , a1 , . . . just before inserting each string and can store
them temporarily during the hash computation process.
(ii) What is the expected time to search for a string of length ℓ in the hash
table, and how can you justify this?
Once the hash table for the set S is built, the expected time to perform a search for
a string of length ℓ can be broken down into two parts: the time to compute the hash
function and the time to search within the bucket.
1. **Time toPCompute the Hash Function**: The hash function h(x) involves com-
puting the sum ℓi=0 ai xi modulo p. Since each arithmetic operation and comparison in
{0, 1, 2, . . . , p − 1} is constant time, the time to compute the hash is proportional to the
length of the string ℓ, i.e., O(ℓ).
2. **Time to Search in the Bucket**: After computing the hash, we check the cor-
responding bucket for the string. Since chaining is used to resolve collisions, the search
within a bucket will involve traversing a linked list of strings. On average, if the hash
table is well-distributed, the expected number of strings in any given bucket will be np ,
where n is the number of strings and p is the size of the hash table.
The expected time to search within the bucket is thus proportional to the average
number of elements in the bucket, i.e., O np .
Combining both parts, the total expected time to search for a string of length ℓ is:
n
O(ℓ) + O
p
Thus, the expected time to perform the search is O(ℓ + np ), where O(ℓ) accounts for
the hash computation time, and O np accounts for the search within the bucket.