Lecture 9 - 2024-Searching and Hashing Algorithms
Lecture 9 - 2024-Searching and Hashing Algorithms
Hashing Algorithms
Chung-Ming Chen
Department of Biomedical Engineering
Textbook:
1. C++ Programming: From Problem Analysis to Program Design, 8th Edition, D.S. Malik
2. Data Structures Using C++, 2nd Edition, D.S. Malik Page 1
Search Algorithms and Analysis
key comparisons
Sequential and binary search algorithms search the list by comparing the
target element with the list elements. For this reason, these algorithms
are called comparison-based search algorithms.
Two ways that data is organized with the help of the hash table:
– The data is stored within the hash table, that is, in an array.
– The data is stored in linked lists and the hash table is an array of
pointers to those linked lists.
Terminology
The hash table HT is, usually, divided into, say b buckets HT[0],
HT[1], ..., HT[b – 1]. Each bucket is capable of holding, say r items.
Thus, it follows that br = m, where m is the size of HT. Generally, r =
1 and so each bucket can hold one item.
The hash function h maps the key X onto an integer t, that is, h(X) = t,
such that 0 h(X) b – 1.
collision
collision
Two keys, X1 and X2, such that X1 X2, are called synonyms if
h(X1) = h(X2). Let X be a key and h(X) = t. If bucket t is full, we say
that an overflow occurs. Let X1 and X2 be two nonidentical keys. If
h(X1) = h(X2), we say that a collision occurs. If r = 1, that is, the
bucket size is 1, an overflow and a collision occur at the same time.
When choosing a hash function, the main objectives are to:
– Choose a hash function that is easy to compute.
– Minimize the number of collisions.
Suppose that each key is a string. The following C++ function uses
the division method to compute the address of the key.
Initially, all the array positions are available. Because all the array
positions are available, the probability of any position being probed is
(1/20). Suppose that after storing some of the items, the hash table is
as shown below:
In this figure, a cross indicates that this array slot is occupied. Slot 9
will be occupied next if, for the next key, the hash address is 6, 7, 8,
or 9. Thus, the probability that slot 9 will be occupied next is 4/20.
Similarly, in this hash table, the probability that array position 14 will
be occupied next is 5/20.
Spring 2024 Data Structure 22
Linear Probing: Clustering Problem
In this hash table, the probability that the array position 14 will be
occupied next is 9/20, whereas the probability that the array positions 15,
16, or 17 will be occupied next is 1/20. We see that items tend to cluster,
which would increase the search length. Linear probing, therefore,
causes clustering. This clustering is called primary clustering.
One way to improve linear probing is to skip array positions by a fixed
constant, say c, rather than 1. In this case, the hash address is as
follows:
(h(X) + i * c) % HTSize
If c = 2 and h(X) = 2k, that is, h(X) is even, only the even-numbered array
positions are visited. Similarly, if c = 2 and h(X) = 2k + 1, that is, h(X) is
odd, only the odd-numbered array positions are visited. To visit all the
array positions, the constant c must be relatively prime to HTSize.
This method uses a random number generator to find the next available
slot. The ith slot in the probe sequence is
(h(X) + ri) % HTSize
where ri is the ith value in a random permutation of the numbers 1
to HTSize – 1. All insertions and searches use the same sequence of
random numbers.
ITEM INSERTION AND COLLISION: For each key X (in the item),
we first find h(X) = t, where 0 t HTSize – 1. The item with this key
is then inserted in the linked list (which might be empty) pointed to by
HT[t]. It then follows that for nonidentical keys X1 and X2, if h(X1) =
h(X2), the items with keys X1 and X2 are inserted in the same linked
list and so collision is handled quickly and effectively. (A new item can
be inserted at the beginning of the linked list because the data in a
linked list is in no particular order.)
SEARCH: Suppose that we want to determine whether an item R
with key X is in the hash table. As usual, first we calculate h(X).
Suppose h(X) = t. Then the linked list pointed to by HT[t] is searched
sequentially
Let