Hash Table
Hash Table
This section concerns Hash Table data structure and its relevant hash functions designs. And it's
an improved version of Direct-Access Table for operational efficiency.
Direct-Access Table
Formal Def:
Suppose distinct m keys are taken from 𝒰 ⊆ {0, 1, ..., n}, setup an array T[0, 1, ...m-1]:
wherein every position in array T is called a slot. It is expected to have only Θ(1) access time but
if the range of keys is too large, e.g. 64-bit numbers which have over trillions of different keys,
the direct-access table may be overwhelmed in memory usage as well as entry access time. It is a
poor choice in the context of Internet data size. Thus, the Hash Table is introduced.
Design Idea
Hash Table is an extended array structure that associate keys with values that is dynamically
resizable on demand. It is designed to combine the good qualities of List and Array data
structures.
List only requires Θ(|𝒮|) space but lookup time Θ(|𝒮|) as well.
Array only requires Ο(1) for lookup time but Θ(|𝒰|) space.
Hash Table possesses both the small space of storage Ο(|𝒮|) and swift entry access of Ο(1)
optimally.
In order to support three distinct operations of Hash Table: INSERTION of new record,
DELETE existing record and LOOKUP for a specified record, Hash Table is built in several
steps:
1. decide a n that is the number of "buckets" for storing entries; (might be dynamically
Note: if the record does not exist, INSERTION of new entries will complete but DELETE will
fail; if the record exists, INSERTION of new entries will override the existing values and
DELETE will succeed.
What if the "bucket" location has already been occupied when trying to insert a new entry? In
other words, a hash collision happens. Generally, there are two approaches, (separate) chaining
and open-addressing, in resolving this situation.
Hash Collision
Note: hash collision always happens, there is never a "perfect" hash function ever discovered to
have no collisions during entry insertion.
Load Factor
Define a load factor (α) as a hashing performance metric for hash table Τ:
α = n / m, where n is the total number of elements, m is the number of buckets in hash table for
storing elements. The value of α can vary as bigger than 1, equal to 1 and less than 1; when α is
far larger than 1, the hash table is hence overloaded and need to expand itself by the techniques
introduced in table doubling.
The average-case performance of hashing depends on how well the 𝒽 function distribute n keys
in m slots. Hence, assuming that bucket position each element is hashed into is independent with
each other and equally likely any of m slots would be chosen as the bucket, then
For ј = 0, 1, ..., m-1, let's denote the length of the list Τ[ј] by nј, so that n = n0 + n1 + .. + nm-1. The
expected value of nј is E[nj] = α = n/m.
And the following content provides practices on the basis of such term.
(Separate) Chaining
In chaining method, all entries with same hash values are chained together in a LinkedList;
Specifically, either a pointer exists in a bucket of the hash table which points to the LinkedList
(Singly or Doubly Linked list) of all entries hashed to that bucket, or there is only a NIL in that
bucket.
LOOKUP costs Θ(1 + α) in chaining version of Hash Table. INSERTION operation in this
structure takes worst-case running time of Ο(1), while the DELETION might takes up to Ο(n) if
all entries hashed into one bucket; if using the doubly LinkedList, DELETION may speed up.
Open Addressing
Although it is possible for (separate) chaining method to have number of buckets smaller than
In other words, all elements are stored in buckets and one entry per bucket. While it is possible
for such hash table to fill up the entire buckets, open addressing method saves memory
occupation by pointers to allocate more buckets, drastically minimizing hash collisions and
improving searching speed.
In order to INSERT element into the hash table, open addressing requires continuous buckets
examinations, which is termed probe, until an empty bucket is located;
probe sequence: to make INSERTION operation successful, for each key k, there is a length m
probe sequence:
wherein each hash function generates a distinct position within the range of buckets. There are
three major probing techniques in which probe sequence is produced in different manners.
Note: when performing DELETE operation, the designated entry should be labeled "deleted" in
case of further DELETE operation failure. (e.g. k2 is inserted after probing the location of k1, if
k1 is deleted to recover an empty bucket, there is no way to know beforehand whether k2 has been
inserted. Then, the further SEARCH and DELETE operation of k2 will fail)
Linear Probing
Given an auxiliary hash function 𝒽': 𝒰 -> {0, 1, ..., m-1}, then linear probing would use:
that before INSERTION, first probing the bucket Τ[𝒽'(k)], and Τ[𝒽'(k) + 1],... till the Τ[𝒽'(k) -
1]; to be noted that the initial probing position determines the entire possible sequence.
Quadratic Probing
Under the same premise of linear probing, quadratic probing adopts a better strategy:
instead of occupying large portions of adjacent buckets by using linear probing, quadratic
probing leads to a more dispersed distribution of elements.
Double Hashing
This is known as the best strategy of the three in open addressing method;
where in both 𝒽1 and 𝒽2 are auxiliary hash functions; in order to search the whole hash table,
value of 𝒽2 has to be relatively prime to the hash table size m. And it results in a Θ(m2) of
probing sequence for each key k, approximating the optimal simple uniform hashing that for
each key there is m! number of probe sequence permutations.