0% found this document useful (0 votes)
24 views

Lecture 9 - 2024-Searching and Hashing Algorithms

Uploaded by

jiawenf1009
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Lecture 9 - 2024-Searching and Hashing Algorithms

Uploaded by

jiawenf1009
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

Searching and

Hashing Algorithms

Chung-Ming Chen
Department of Biomedical Engineering

Textbook:
1. C++ Programming: From Problem Analysis to Program Design, 8th Edition, D.S. Malik
2. Data Structures Using C++, 2nd Edition, D.S. Malik Page 1
Search Algorithms and Analysis

 Associated with each item in a data set is a special member that


uniquely identifies the item in the data set. For example, if you have a
data set consisting of student records, then the student ID uniquely
identifies each student in a particular school. This unique member of
the item is called the key of the item.
 The keys of the items in the data set are used in such operations as
searching, sorting, insertion, and deletion. For instance, when we
search the data set for a particular item, we compare the key of the
item for which we are searching with the keys of the items in the data
set.
 In the analysis of an algorithm, the key comparisons refer to
comparing the key of the search item with the key of an item in the
list. Moreover, the number of key comparisons refers to the number
of times the key of the item (in algorithms such as searching and
sorting) is compared with the keys of the items in the list.

Spring 2024 Data Structure 2


class arrayListType

Spring 2024 Data Structure 3


Sequential Search: Array-based Lists

Spring 2024 Data Structure 4


Sequential Search Analysis
 When analyzing a search algorithm, we count the number of key
comparisons because this number gives us the most useful
information. Furthermore, the criteria for counting the number of key
comparisons can be applied equally well to other search algorithms.

key comparisons

Spring 2024 Data Structure 5


Sequential Search Analysis
 Suppose that L is list of length n. We want to determine the number of key
comparisons made by the sequential search when the L is searched for a
given item.
 If the search item is not in the list, we then compare the search item with
every element in the list, making n comparisons. This is an unsuccessful
case.
 Suppose that the search item is in the list. Then the number of key
comparisons depends on where in the list the search item is located.
– Best case: The search item is the first element of L. We make only one key
comparison.
– Worst case: The search item is the last element in the list. The algorithm
makes n comparisons.
– The best and worst cases are not likely to occur every time we apply the
sequential search on L, so it would be more helpful if we could determine the
average behavior of the algorithm.

Spring 2024 Data Structure 6


Sequential Search Analysis
 To determine the average number of comparisons in the successful case
of the sequential search algorithm:
1. Consider all possible cases.
2. Find the number of comparisons for each case.
3. Add the number of comparisons and divide by the number of cases.
 If the search item, called the target, is the first element in the list, one
comparison is required.
 If the target is the second element in the list, two comparisons are
required.
 Similarly, if the target is the kth element in the list, k comparisons are
required. We assume that the target can be any element in the list; that is,
all list elements are equally likely to be the target.
 Suppose that there are n elements in the list. The following expression
gives the average number of comparisons:

Spring 2024 Data Structure 7


Binary Search for Ordered Lists
 The binary search algorithm uses the divide-and-conquer technique
to search the list, which can be performed only on ordered lists.
 First, the search item is compared with the middle element of the list.
If the search item is found, the search terminates. If the search item is
less than the middle element of the list, we restrict the search to the
first half of the list; otherwise, we search the second half of the list.

Spring 2024 Data Structure 8


Binary Search Algorithm

The maximum number


of iterations ?

Spring 2024 Data Structure 9


Insertion into an Ordered List

1. Use an algorithm similar to the binary search algorithm to find the


place where the item is to be inserted.
2. if the item is already in this list
output an appropriate message
else
use the function insertAt to insert the item in the list.

Spring 2024 Data Structure 10


Lower Bound on Comparison-Based Search Algorithms

 Sequential and binary search algorithms search the list by comparing the
target element with the list elements. For this reason, these algorithms
are called comparison-based search algorithms.

Spring 2024 Data Structure 11


Hashing Algorithms

 The complexity of hashing search algorithm is, on average, of order 1.


 In hashing, the data is organized with the help of a table, called the
hash table, denoted by HT, and the hash table is stored in an array.
 To determine whether a particular item with a key, say X, is in the table,
we apply a function h, called the hash function, to the key X; that is, we
compute h(X), read as h of X. The function h is typically an arithmetic
function and h(X) gives the address of the item in the hash table.
 Suppose that the size of the hash table, HT, is m. Then 0  h(X) < m.
Thus, to determine whether the item with key X is in the table, we look
at the entry HT[h(X)] in the hash table.
 Because the address of an item is computed with the help of a
function, it follows that the items are stored in no particular order.

Spring 2024 Data Structure 12


Organizing Data in the Hash Table

 Two ways that data is organized with the help of the hash table:
– The data is stored within the hash table, that is, in an array.
– The data is stored in linked lists and the hash table is an array of
pointers to those linked lists.
 Terminology
The hash table HT is, usually, divided into, say b buckets HT[0],
HT[1], ..., HT[b – 1]. Each bucket is capable of holding, say r items.
Thus, it follows that br = m, where m is the size of HT. Generally, r =
1 and so each bucket can hold one item.
The hash function h maps the key X onto an integer t, that is, h(X) = t,
such that 0  h(X)  b – 1.

Spring 2024 Data Structure 13


Collisions

collision

collision

Spring 2024 Data Structure 14


Collisions

 Two keys, X1 and X2, such that X1  X2, are called synonyms if
h(X1) = h(X2). Let X be a key and h(X) = t. If bucket t is full, we say
that an overflow occurs. Let X1 and X2 be two nonidentical keys. If
h(X1) = h(X2), we say that a collision occurs. If r = 1, that is, the
bucket size is 1, an overflow and a collision occur at the same time.
 When choosing a hash function, the main objectives are to:
– Choose a hash function that is easy to compute.
– Minimize the number of collisions.

Spring 2024 Data Structure 15


Hash Functions: Some Examples

 Mid-Square: In this method, the hash function, h, is computed by


squaring the identifier, and then using the appropriate number of bits
from the middle of the square to obtain the bucket address. Because
the middle bits of a square usually depend on all the characters, it is
expected that different keys will yield different hash addresses with
high probability, even if some of the characters are the same.
 Folding: In folding, the key X is partitioned into parts such that all the
parts, except possibly the last parts, are of equal length. The parts
are then added, in some convenient way, to obtain the hash address.
 Division (Modular arithmetic): In this method, the key X is
converted into an integer iX. This integer is then divided by the size of
the hash table, i.e., HTSize, to get the remainder, giving the address
of X in HT. That is, (in C++)
h(X) = iX % HTSize;

Spring 2024 Data Structure 16


Example: Division Method in C++

 Suppose that each key is a string. The following C++ function uses
the division method to compute the address of the key.

Spring 2024 Data Structure 17


Collision Resolution

 In reality, collisions are unavoidable because usually a hash function


always maps a larger domain onto a smaller range. Thus, in hashing,
we must include algorithms to handle collisions.
 Collision resolution techniques are classified into two categories:
open addressing (also called closed hashing), and chaining (also
called open hashing).
 In open addressing, the data is stored within the hash table.
 In chaining, the data is organized in linked lists and the hash table is
an array of pointers to the linked lists.

Spring 2024 Data Structure 18


Open Addressing – Linear Probing

 Suppose that an item with key X is to be inserted in HT. We use the


hash function to compute the index h(X) of this item in HT. Suppose
that h(X) = t. Then 0  h(X)  HTSize – 1. If HT[t] is empty, we store
this item into this array slot. Suppose that HT[t] is already occupied
by another item; we have a collision. In linear probing, starting at
location t, we search the array sequentially to find the next available
array slot.
 In linear probing, we assume that the array is circular so that if the
lower portion of the array is full, we can continue the search in the top
portion of the array. This can be easily accomplished by using the
mod operator. That is, starting at t, we check the array locations t,
(t + 1) % HTSize, (t + 2) % HTSize, . . ., (t + j) % HTSize. This is
called the probe sequence.
 The next array slot is given by
(h(X) + j) % HTSize
where j is the jth probe.
Spring 2024 Data Structure 19
Linear Probing: Example

Spring 2024 Data Structure 20


Linear Probing in C++

Spring 2024 Data Structure 21


Linear Probing: Clustering Problem

 From the definition of linear probing, we see that linear probing is


easy to implement. However, linear probing causes clustering; that
is, more and more new keys would likely be hashed to the array slots
that are already occupied. For example, consider the hash table of
size 20, as shown below:

 Initially, all the array positions are available. Because all the array
positions are available, the probability of any position being probed is
(1/20). Suppose that after storing some of the items, the hash table is
as shown below:

 In this figure, a cross indicates that this array slot is occupied. Slot 9
will be occupied next if, for the next key, the hash address is 6, 7, 8,
or 9. Thus, the probability that slot 9 will be occupied next is 4/20.
Similarly, in this hash table, the probability that array position 14 will
be occupied next is 5/20.
Spring 2024 Data Structure 22
Linear Probing: Clustering Problem

 Consider the hash table below:

 In this hash table, the probability that the array position 14 will be
occupied next is 9/20, whereas the probability that the array positions 15,
16, or 17 will be occupied next is 1/20. We see that items tend to cluster,
which would increase the search length. Linear probing, therefore,
causes clustering. This clustering is called primary clustering.
 One way to improve linear probing is to skip array positions by a fixed
constant, say c, rather than 1. In this case, the hash address is as
follows:
(h(X) + i * c) % HTSize
 If c = 2 and h(X) = 2k, that is, h(X) is even, only the even-numbered array
positions are visited. Similarly, if c = 2 and h(X) = 2k + 1, that is, h(X) is
odd, only the odd-numbered array positions are visited. To visit all the
array positions, the constant c must be relatively prime to HTSize.

Spring 2024 Data Structure 23


Random Probing

 This method uses a random number generator to find the next available
slot. The ith slot in the probe sequence is
(h(X) + ri) % HTSize
where ri is the ith value in a random permutation of the numbers 1
to HTSize – 1. All insertions and searches use the same sequence of
random numbers.

Spring 2024 Data Structure 24


Rehashing and Quadratic Probing
 Rehashing: In this method, if a collision occurs with the hash function
h, we use a series of hash functions, h1, h2, . . ., hs. That is, if the
collision occurs at h(X), the array slots hi(X), 1  i  s are examined.
 Quadratic Probing:

Spring 2024 Data Structure 25


Quadratic Probing Property

Spring 2024 Data Structure 26


Quadratic Probing: Fast Implementation

Spring 2024 Data Structure 27


Secondary Clustering
 Both random and quadratic probings eliminate primary clustering.
However, if two nonidentical keys, say X1 and X2, are hashed to the
same home position, that is, h(X1) = h(X2), then the same probe
sequence is followed for both keys. The same probe sequence is
used for both keys because random probing and quadratic probing
are functions of the home positions, not the original key. It follows that
if the hash function causes a cluster at a particular home position, the
cluster remains under these probings. This is called secondary
clustering.

Spring 2024 Data Structure 28


Double Hashing

 One way to solve secondary clustering is to use linear probing, with


the increment value a function of the key. This is called double
hashing. In double hashing, if a collision occurs at h(X), the probe
sequence is generated by using the rule:
(h(X) + i * g(X)) % HTSize
where g is the second hash function, and i = 0, 1, 2, 3, . . . .
 If the size of the hash table is a prime p, then we can define g as
follows:
g(k) = 1+(k % (p – 2))

Spring 2024 Data Structure 29


Double Hashing: Example

Spring 2024 Data Structure 30


Deletion: Open Addressing

 Suppose that an item, say R, is to be deleted from the hash table,


HT. Clearly, we first must find the index of R in HT. To find the index
of R, we apply the same criteria we applied to R when R was inserted
in HT. Let us further assume that after inserting R, another item, R’,
was inserted in HT, and the home position of R and R’ is the same.
The probe sequence of R is contained in the probe sequence of R’
because R’ was inserted in the hash table after R.
 Suppose that we delete R simply by marking the array slot containing
R as empty. If this array position stays empty, then while searching
for R’ and following its probe sequence, the search terminates at this
empty array position. This gives the impression that R’ is not in the
table, which, of course, is incorrect. The item R cannot be deleted
simply by marking its position as empty from the hash table.

Spring 2024 Data Structure 31


Solution to Deletion Dilemma

 Another solution is to use another array, say indexStatusList of int, of


the same size as the hash table as follows: Initialize each position of
indexStatusList to 0, indicating that the corresponding position in the
hash table is empty. When an item is added to the hash table at
position, say i, we set indexStatusList[i] to 1. When an item is deleted
from the hash table at position, say k, we set indexStatusList[k] to -1.
Therefore, each entry in the array indexStatusList is -1, 0, or 1.

Spring 2024 Data Structure 32


Chaining

 In chaining, the hash table, HT, is an array of pointers. Therefore, for


each j, where 0  j  HTSize – 1, HT[j] is a pointer to a linked list. The
size of the hash table, HTSize, is less than or equal to the number of
items.

Spring 2024 Data Structure 33


Chaining

 ITEM INSERTION AND COLLISION: For each key X (in the item),
we first find h(X) = t, where 0  t  HTSize – 1. The item with this key
is then inserted in the linked list (which might be empty) pointed to by
HT[t]. It then follows that for nonidentical keys X1 and X2, if h(X1) =
h(X2), the items with keys X1 and X2 are inserted in the same linked
list and so collision is handled quickly and effectively. (A new item can
be inserted at the beginning of the linked list because the data in a
linked list is in no particular order.)
 SEARCH: Suppose that we want to determine whether an item R
with key X is in the hash table. As usual, first we calculate h(X).
Suppose h(X) = t. Then the linked list pointed to by HT[t] is searched
sequentially

Spring 2024 Data Structure 34


Chaining

 DELETION: To delete an item, say R, from the hash table, first we


search the hash table to find where in a linked list R exists. We then
adjust the pointers at the appropriate locations and deallocate the
memory occupied by R.
 OVERFLOW: Because data is stored in linked lists, overflow is no
longer a concern because memory space to store the data is
allocated dynamically. Furthermore, the size of the hash table no
longer needs to be greater than the number of items. If the size of the
hash table is less than the number of items, some of the linked lists
contain more than one item. However, with a good hash function, the
average length of a linked list is still small and so the search is
efficient.

Spring 2024 Data Structure 35


Chaining

 ADVANTAGES OF CHAINING: From the construction of the hash


table using chaining, we see that item insertion and deletion are
straightforward. If the hash function is efficient, few keys are hashed
to the same home position. Thus, on average, a linked list is short,
which results in a shorter search length. If the item size is large, it
saves a considerable amount of space. For example, suppose there
are 1000 items and each item requires 10 words of storage. Further
suppose that each pointer requires one word of storage. We then
need 1000 words for the hash table, 10,000 words for the items, and
1000 words for the link in each node. A total of 12,000 words of
storage space, therefore, is required to implement chaining. On the
other hand, if we use quadratic probing, if the hash table size is twice
the number of items, we need 20,000 words of storage..

Spring 2024 Data Structure 36


Chaining
 DISADVANTAGES OF CHAINING: If the item size is
small, a considerable amount of space is wasted. For
example, suppose there are 1000 items, each requiring 1
word of storage. Chaining then requires a total of 3000
words of storage. On the other hand, with quadratic
probing, if the hash table size is twice the number of
items, only 2000 words are required for the hash table.
Also, if the table size is three times the number of items,
then in quadratic probing the keys are reasonably spread
out. This results in fewer collisions and so the search is
fast.

Spring 2024 Data Structure 37


Hashing Analysis

 Let

The parameter α is called the load factor.


 The average number of comparisons for a successful search and an
unsuccessful search are given below:

Spring 2024 Data Structure 38

You might also like