0% found this document useful (0 votes)
11 views5 pages

4101 Assignment 9

The document outlines a homework assignment for York University's EECS 4101/5101 course, focusing on designing a hash table for storing strings with a specific hash function to minimize collisions. It includes problems related to the properties of the hash function, assumptions about prime number sizes, and the independence of coefficients. Additionally, it discusses the expected time complexity for searching strings in the hash table using chaining for collision resolution.

Uploaded by

rajpunjabi47
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views5 pages

4101 Assignment 9

The document outlines a homework assignment for York University's EECS 4101/5101 course, focusing on designing a hash table for storing strings with a specific hash function to minimize collisions. It includes problems related to the properties of the hash function, assumptions about prime number sizes, and the independence of coefficients. Additionally, it discusses the expected time complexity for searching strings in the hash table using chaining for collision resolution.

Uploaded by

rajpunjabi47
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

York University EECS 4101/5101

November 22, 2024


Homework Assignment #9
Due: November 29, 2024 at 5:00 p.m.
November 29, 2024

Problem 1: Georgy’s Hash Table


Georgy is working on designing a hash table to store strings of various lengths. Since the
length of each string may vary, Georgy is trying to devise a general method for selecting a
hash function such that the probability of collision between any two strings is minimized.
A string x of length ℓ is represented as a sequence of characters ⟨x0 , x1 , . . . , xℓ ⟩, where
the character xℓ is a special End of Text (ETX) character that is not allowed to appear
earlier in the string. Each character is encoded in ASCII, with values between 0 and 127
(with ETX encoded as 3).
For simplicity, Georgy decides to use a hash table size p, where p is a prime number
greater than 128. Georgy generates random numbers a0 , a1 , a2 , . . ., where each ai ∈
{0, 1, 2, . . . , p − 1}, and defines the hash function h as:

!
X
h(⟨x0 , x1 , . . . , xℓ ⟩) = ai x i mod p
i=0

Answer(a)
Consider two different strings x = ⟨x0 , x1 , . . . , xℓ ⟩ and y = ⟨y0 , y1 , . . . , ym ⟩ (possibly with
different lengths):

(i) Why must there be some index k such that 0 ≤ k ≤ min(ℓ, m) and xk ̸= yk ?
Since the strings x and y are different, it is guaranteed that there must exist an index k
where the characters at that position differ. This is because the strings are not identical,
so at least one character must not match. This index must be less than or equal to
min(ℓ, m) because:
• If the strings are of different lengths, say ℓ ̸= m, the shorter string will end at index
min(ℓ, m) − 1. Thus, we compare characters only up to the index min(ℓ, m) − 1.
If the strings differ in length, the last characters compared before the strings end
must also differ, ensuring the existence of such a k.
• If the strings are of the same length, there will still be an index k where the char-
acters xk ̸= yk , since the assumption is that x and y are not identical.

1
• Therefore, the index k must be within the range 0 ≤ k ≤ min(ℓ, m), as we only
need to compare up to the shorter string’s length or up to the first position where
the strings differ.
P 
ℓ Pm
(ii) Show that h(x) = h(y) if and only if ak (xk −yk ) mod p = i=0,i̸=k ai y i − i=0,i̸=k ai xi
mod p.
We are given two strings x = ⟨x0 , x1 , . . . , xℓ ⟩ and y = ⟨y0 , y1 , . . . , ym ⟩, and we need to
show that the hash values h(x) and h(y) are equal if and only if the condition
ℓ m
!
X X
ak (xk − yk ) mod p = ai y i − ai x i mod p
i=0,i̸=k i=0,i̸=k

holds for some index k.


The hash function h(x) for the string x is defined as:

!
X
h(x) = ai x i mod p.
i=0

Similarly, for the string y, the hash function is:


m
!
X
h(y) = ai y i mod p.
i=0

Now, let us assume that h(x) = h(y). This means:



! m
!
X X
ai x i mod p = ai y i mod p.
i=0 i=0

This can be rewritten as:



X m
X
ai x i − ai y i ≡ 0 (mod p).
i=0 i=0

We now separate the sums at the index k where xk ̸= yk . Let’s write the sum for each
side, taking care to exclude the term at k:

X ℓ
X
ai xi = ak xk + ai x i
i=0 i̸=k,i=0

and m m
X X
ai yi = ak yk + ai y i .
i=0 i̸=k,i=0

Thus, the difference becomes:


ℓ m
!
X X
ak x k + ai x i − ak y k + ai y i ≡0 (mod p).
i̸=k,i=0 i̸=k,i=0

Simplifying the expression:

2

X m
X
ak (xk − yk ) + ai x i − ai yi ≡ 0 (mod p).
i̸=k,i=0 i̸=k,i=0

Rearranging terms:
ℓ m
!
X X
ak (xk − yk ) ≡ ai y i − ai x i (mod p).
i̸=k,i=0 i̸=k,i=0

This completes the proof for the forward direction: if h(x) = h(y), then the condition
holds.
Now, we prove the reverse direction: if the condition holds, then h(x) = h(y).
Assume that:
ℓ m
!
X X
ak (xk − yk ) ≡ ai y i − ai x i (mod p).
i̸=k,i=0 i̸=k,i=0
Pℓ Pm
Adding i=0 ai xi and i=0 ai yi to both sides:

X m
X
ai x i − ai yi + ak (xk − yk ) ≡ 0 (mod p).
i=0 i=0

Thus:

! m
!
X X
ai x i mod p = ai y i mod p.
i=0 i=0

This shows that h(x) = h(y), completing the proof for the reverse direction.
Therefore, we have shown that h(x) = h(y) if and only if the condition
ℓ m
!
X X
ak (xk − yk ) mod p = ai y i − ai x i mod p
i̸=k,i=0 i̸=k,i=0

holds for some index k.

1
.
p

(b) Assumptions and Dependencies


(i) Where did you use the assumption that p > 128?
The assumption that p > 128 ensures that the hash function can accommodate the entire
ASCII character set. If p were smaller than 128, collisions would occur more frequently
since there wouldn’t be enough space to map all the possible character values.

3
(ii) Where did you use the assumption that the ai ’s are independent?
The independence of the ai ’s ensures that each character contributes uniquely to the hash
value, preventing any systematic bias in the hash computation. If the coefficients were
not independent, the hash function could produce biased or repetitive results, increasing
the chance of collisions.

(iii) Show that the probability that h(x) = h(y) is p1 .


When we calculate the hash values h(x) and h(y), the result depends on the random
numbers a0 , a1 , . . . , ap−1 , which are picked randomly from the set {0, 1, 2, . . . , p − 1}.
These numbers are chosen independently of each other, and each has an equal chance of
being any number from 0 to p − 1.
The important part to focus on is the condition for when the two hash values are
equal, i.e., when h(x) = h(y). This condition involves the sum of the differences between
the corresponding characters of the two strings, multiplied by the random numbers ai .
The term that matters most here is ak , where xk and yk differ. In order for h(x) = h(y),
the random number ak needs to satisfy a specific equation:

ak (xk − yk ) ≡ some constant (mod p).


Since the value of ak is picked randomly from {0, 1, 2, . . . , p − 1}, the chance that ak
satisfies this equation is simply p1 . That’s because there are p possible values for ak , and
only one specific value will work to satisfy the condition for the hash values to be equal.
Therefore, the probability that h(x) = h(y) is p1 .

(b) Where, exactly, in part (a) did you use the assumption that p > 128?
Where, exactly, did you use the assumption that the ai ’s are independent?
In part (a), the assumption that p > 128 was used in the following context:
- The assumption ensures that the size of the hash table p is sufficiently large to
allow each character in the string to have a unique, non-conflicting hash value. Since
each character is encoded in ASCII, which uses values between 0 and 127, having p >
128 guarantees that there will be enough distinct buckets in the hash table to store all
potential character combinations without significant collisions. If p ≤ 128, the chance of
collisions increases because the number of available buckets is too small compared to the
possible character combinations.
As for the assumption that the ai ’s are independent:
- The independence of the ai ’s is crucial in the hash function’s design. Since the
random numbers a0 , a1 , . . . , ap−1 are chosen independently, it means that the contribution
of each character in the string to the hash value is independent of the others. This
independence ensures that the hash values will be well distributed and minimizes the
chances of collisions between different strings. If the ai ’s were not independent, it would
introduce a correlation between the terms in the hash function, increasing the likelihood
of collisions and making the hash function less effective.

(c) Storing Strings in a Hash Table


Now, Georgy wants to store a set S of n strings in a hash table of size p, where p > n.
He will use chaining to resolve collisions. Answer the following questions:

4
(i) How should Georgy store the coefficients a0 , a1 , . . .? When should each ai
be chosen?
Georgy should choose the random values a0 , a1 , . . . for the hash function just before
inserting each string into the hash table. These values should be selected independently
and uniformly at random from the set {0, 1, 2, . . . , p − 1}.
There’s no need to choose all the ai ’s in advance, as they are only required when
computing the hash for a specific string. The chosen values are used temporarily to
compute the hash value for that string and do not need to be stored permanently. Once
the hash is computed and the string is inserted into the hash table, the values can be
discarded.
Therefore, Georgy can pick a0 , a1 , . . . just before inserting each string and can store
them temporarily during the hash computation process.

(ii) What is the expected time to search for a string of length ℓ in the hash
table, and how can you justify this?
Once the hash table for the set S is built, the expected time to perform a search for
a string of length ℓ can be broken down into two parts: the time to compute the hash
function and the time to search within the bucket.
1. **Time toPCompute the Hash Function**: The hash function h(x) involves com-
puting the sum ℓi=0 ai xi modulo p. Since each arithmetic operation and comparison in
{0, 1, 2, . . . , p − 1} is constant time, the time to compute the hash is proportional to the
length of the string ℓ, i.e., O(ℓ).
2. **Time to Search in the Bucket**: After computing the hash, we check the cor-
responding bucket for the string. Since chaining is used to resolve collisions, the search
within a bucket will involve traversing a linked list of strings. On average, if the hash
table is well-distributed, the expected number of strings in any given bucket will be np ,
where n is the number of strings and p is the size of the hash table.
The expected time to search within the  bucket is thus proportional to the average
number of elements in the bucket, i.e., O np .
Combining both parts, the total expected time to search for a string of length ℓ is:
 
n
O(ℓ) + O
p
Thus, the expected time to perform the search is O(ℓ + np ), where O(ℓ) accounts for
 
the hash computation time, and O np accounts for the search within the bucket.

You might also like