0% found this document useful (0 votes)
27 views61 pages

Unit 1 Hashing

Hashing is a technique for implementing hash tables that allows for constant average time complexity for insertions, deletions, and lookups, but is inefficient for ordered operations. It involves mapping keys to bucket addresses using hash functions, with collision resolution strategies such as separate chaining and open addressing. The choice of hash function and table size is critical for performance, with considerations for load factors and clustering effects impacting efficiency.

Uploaded by

nivahem609
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views61 pages

Unit 1 Hashing

Hashing is a technique for implementing hash tables that allows for constant average time complexity for insertions, deletions, and lookups, but is inefficient for ordered operations. It involves mapping keys to bucket addresses using hash functions, with collision resolution strategies such as separate chaining and open addressing. The choice of hash function and table size is critical for performance, with considerations for load factors and clustering effects impacting efficiency.

Uploaded by

nivahem609
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Hashing

Hash Tables
• The implementation of hash tables is called hashing.
• Hashing is a technique used for performing insertions,
deletions and finds in constant average time (i.e. O(1))
• This data structure, however, is not efficient in
operations that require any ordering information among
the elements, such as findMin, findMax and printing the
entire table in sorted order.
• Two types
static hashing
dynamic hashing
Hash tables
• The ideal hash table structure is merely an array of some fixed
size, containing the key-item pairs.
• A stored item needs to have a data member, called key, that will
be used in computing the index value for the item.
– Key could be an integer, a string, etc
– e.g. a name or Id that is a part of a large employee structure
• The size of the array is TableSize.
• The items that are stored in the hash table are indexed by values
from 0 to TableSize – 1.
• Each key is mapped into some number in the range 0 to
TableSize – 1.
• The mapping is called a hash function.
• Hashing schemes use a hash function to map keys into hash tale
buckets
Example Hash
Table
0
1
Items
2
john 25000
3 john 25000
phil 31250 key Hash 4 phil 31250
Function
dave 27500 5
mary 28200 6 dave 27500
7 mary 28200
key 8
9
Inserting and deleting record
• To insert a record into the structure compute
the hash value h(Ki), and place the record in
the bucket address returned.
• For lookup operations, compute the hash
value as above and search each record in the
bucket for the specific record.
• To delete simply lookup and remove
Hash Function & its properties
• The hash function, h, is a function from the set of all
search-keys, K, to the set of all bucket addresses, B.
• The hash function must be simple to compute.
• The distribution should be uniform.
– An ideal hash function should assign the same number
of records in each bucket.
• The distribution should be random.
– Regardless of the actual search-keys, each bucket has
the same number of records on average
– Hash values should not depend on any ordering or the
search-keys
Hash function
Problems:
• Keys may not be numeric.
• Number of possible keys is much larger than the
space available in table.
• Different keys may map into same location
– Hash function is not one-to-one => collision.
– If there are too many collisions, the performance of
the hash table will suffer dramatically.
Hash Function: division
• If the input keys are nonnegative integers then simply Key
mod TableSize is a general strategy.
– Unless key happens to have some undesirable
properties. (e.g. all keys end in 0 and we use mod 10)
• Is most widely used hash function in practice
• The choice of TableSize is critical. If size is a power of 2,
then fD(key) depends only on the least significant bits of X.
• In practice it has been observed that it is sufficient to
choose TableSize such that it has no prime divisors less
than 20
• If the keys are strings, hash function needs more care.
– First convert it into a numeric value.
Hash Function: mid square
• One hash function that has found much use in symbol table
applications is the 'middle of square' function.
• This function, Fm, is computed by squaring the identifier and then
using an appropriate number of bits from the middle of the square
to obtain the bucket address
• the identifier is assumed to fit into one computer word.
• Since the middle bits of the square will usually depend upon all of the
characters in the identifier, different identifiers would result in
different hash addresses with high probability even when some of the
characters are the same
• The number of bits to be used to obtain the bucket address depends on
the table size.
• If r bits are used, the range of values is 2r,
• so the size of hash tables is chosen to be a power of 2 when this kind
of scheme is used.
Hash Function:Folding
• the identifier X is partitioned into several parts, all but the last
being of the same length.
• These parts are then added together to obtain the hash
address for X.
• There are two ways of carrying out this addition.
• In the first, all but the last part are shifted so that the least
significant bit of each part lines up with the corresponding bit
of the last The different parts are now added together to get
f(X). This method is known as shift folding.
• The other method of adding the parts is folding at the
boundaries. In this method, the identifier is folded at the part
boundaries and digits falling into the same position are added
together) to obtain f(X).
• Some folding methods go one step further and reverse every
other piece before the addition.
Example
Collision Resolution
• If, when an element is inserted, it hashes to the
same value as an already inserted element, then we
have a collision.
• When two items hash to the same slot, we must
have a systematic method for placing the second
item in the hash table. This process is called
collision resolution
• There are several methods for collision resolution
– Separate chaining(closed hashing)
– Open hashing(open addressing)
• Linear Probing
• Quadratic Probing
• Double Hashing
Separate Chaining
• The idea is to keep a list of all elements that hash
to the same value. Chaining allows many items to
exist at the same location in the hash table.
– The array elements are pointers to the first nodes of the
lists.
– A new item is inserted to the front of the list.
• Advantages:
– Better space utilization for large items.
– Simple collision handling: searching linked list.
– Overflow: we can store more items than the hash table
size.
– Deletion is quick and easy: deletion from the linked list.
• As more and more items hash to the same
location, the difficulty of searching for the item in
Example
Keys: 0, 1, 4, 9, 16, 25, 36, 49, 64, 81
hash(key) = key % 10.
0 0

1 81 1
2

4 64 4
5 25
6 36 16
7

9 49 9
Operations
• Initialization: all entries are set to NULL
• Find:
– locate the cell using hash function.
– sequential search on the linked list in that cell.
• Insertion:
– Locate the cell using hash function.
– (If the item does not exist) insert it as the first item in
the list.
• Deletion:
– Locate the cell using hash function.
– Delete the item from the linked list.
Analysis of Separate Chaining
• Collisions are very likely.
– How likely and what is the average length of
lists?
• Load factor l definition:
– Ratio of number of elements (N) in a hash table
to the hash TableSize.
• i.e. l = N/TableSize
– The average length of a list is also l.
– For chaining l is not bound by 1; it can be > 1.
Cost of searching
• Cost = Constant time to evaluate the hash function
+ time to traverse the list.
• Unsuccessful search:
– We have to traverse the entire list, so we need to compare l nodes on
the average.
• Successful search:
– List contains the one node that stores the searched item + 0 or more
other nodes.
– Expected # of other nodes = x = (N-1)/M which is essentially l, since
M is presumed large.
– On the average, we need to check half of the other nodes while
searching for a certain element
– Thus average search cost = 1 + l/2
Summary
• The analysis shows us that the table size is
not really important, but the load factor is.
• TableSize should be as large as the number
of expected elements in the hash table.
– To keep load factor around 1.
• TableSize should be prime for even
distribution of keys to hash table cells.
Hashing: Open Addressing
Collision Resolution with
Open Addressing
• Separate chaining has the disadvantage of
using linked lists.
– Requires the implementation of a second data
structure.
• In an open addressing hashing system, all
the data go inside the table.
– Thus, a bigger table is needed.
• Generally, the load factor should be below 0.5.
– If a collision occurs, alternative cells are tried
until an empty cell is found.
Open Addressing
• More formally:
– Cells h0(x), h1(x), h2(x), …are tried in succession where
hi(x) = (hash(x) + f(i)) mod TableSize, with f(0) = 0.
– The function f is the collision resolution strategy.
• There are three common collision resolution
strategies:
– Linear Probing
– Quadratic probing
– Double hashing
Linear Probing
• In linear probing, collisions are resolved by
sequentially scanning an array (with
wraparound) until an empty cell is found.
– i.e. f is a linear function of i, typically f(i)= i.
• Example:
– Insert items with keys: 89, 18, 49, 58, 9 into an
empty hash table.
– Table size is 10.
– Hash function is hash(x) = x mod 10.
• f(i) = i;
Figure 20.4
Linear probing
hash table after
each insertion
Find and Delete
• The find algorithm follows the same probe
sequence as the insert algorithm.
– A find for 58 would involve 4 probes.
– A find for 19 would involve 5 probes.
• We must use lazy deletion (i.e. marking
items as deleted)
– Standard deletion (i.e. physically removing the
item) cannot be performed.
– e.g. remove 89 from hash table.
Clustering Problem
• As long as table is big enough, a free cell
can always be found, but the time to do so
can get quite large.
• Worse, even if the table is relatively empty,
blocks of occupied cells start forming.
• This effect is known as primary clustering.
• Any key that hashes into the cluster will
require several attempts to resolve the
collision, and then it will add to the cluster.
Analysis of insertion
• The average number of cells that are examined in
an insertion using linear probing is roughly
(1 + 1/(1 – λ)2) / 2
• For a half full table we obtain 2.5 as the average
number of cells examined during an insertion.
• Primary clustering is a problem at high load
factors. For half empty tables the effect is not
disastrous.
Analysis of Find
• An unsuccessful search costs the same as
insertion.
• The cost of a successful search of X is equal to the
cost of inserting X at the time X was inserted.
• For λ = 0.5 the average cost of insertion is 2.5.
The average cost of finding the newly inserted
item will be 2.5 no matter how many insertions
follow.
• Thus the average cost of a successful search is an
average of the insertion costs over all smaller load
factors.
Linear Probing – Analysis -- Example
• What is the average number of probes for a successful
search and an unsuccessful search for this hash table?
– Hash Function: h(x) = x mod 11 0 9
Successful Search: 1
– 20: 9 -- 30: 8 -- 2 : 2 -- 13: 2, 3 -- 25: 3,4 2 2
– 24: 2,3,4,5 -- 10: 10 -- 9: 9,10, 0
3 13
Avg. Probe for SS = (1+1+1+2+2+4+1+3)/8=15/8
4 25
Unsuccessful Search:
– We assume that the hash function uniformly 5 24
distributes the keys. 6
– 0: 0,1 -- 1: 1 -- 2: 2,3,4,5,6 -- 3: 3,4,5,6 7
– 4: 4,5,6 -- 5: 5,6 -- 6: 6 -- 7: 7 -- 8: 8,9,10,0,1
8 30
– 9: 9,10,0,1 -- 10: 10,0,1
9 20
Avg. Probe for US =
(2+1+5+4+3+2+1+1+5+4+3)/11=31/11 10 10
Quadratic Probing
• Quadratic Probing eliminates primary clustering
problem of linear probing.
• Collision function is quadratic.
– The popular choice is f(i) = i2.
• If the hash function evaluates to h and a search in
cell h is inconclusive, we try cells h + 12, h+22, …
h + i2.
– i.e. It examines cells 1,4,9 and so on away from the
original probe.
• Remember that subsequent probe points are a
quadratic number of positions from the original
probe point.
Figure 20.6
A quadratic probing
hash table after each
insertion (note that
the table size was
poorly chosen
because it is not a
prime number).
Quadratic Probing
• Problem:
– We may not be sure that we will probe all locations in
the table (i.e. there is no guarantee to find an empty cell
if table is more than half full.)
– If the hash table size is not prime this problem will be
much severe.
• However, there is a theorem stating that:
– If the table size is prime and load factor is not larger
than 0.5, all probes will be to different locations and an
item can always be inserted.
Some Considerations
• What happens if load factor gets too high?
– Dynamically expand the table as soon as the
load factor reaches 0.5, which is called
rehashing.
– Always double to a prime number.
– When expanding the hash table, reinsert the
new table by using the new hash function.
Analysis of Quadratic Probing
• Quadratic probing has not yet been
mathematically analyzed.
• Although quadratic probing eliminates primary
clustering, elements that hash to the same location
will probe the same alternative cells. This is know
as secondary clustering.
• Techniques that eliminate secondary clustering are
available.
– the most popular is double hashing.
Double Hashing
• A second hash function is used to drive the
collision resolution.
– f(i) = i * hash2(x)
• We apply a second hash function to x and probe at
a distance hash2(x), 2*hash2(x), … and so on.
• The function hash2(x) must never evaluate to zero.
– e.g. Let hash2(x) = x mod 9 and try to insert 99 in the
previous example.
• A function such as hash2(x) = R – ( x mod R) with
R a prime smaller than TableSize will work well.
– e.g. try R = 7 for the previous example.(7 - x mode 7)
example
• Insert keys : 89,18,49,58,69
Hash1(key)=key%10, Hash2(key)=7-(key%7)
The relative efficiency of
four collision-resolution methods
Linear probing with Chaining
with/without replacement
• chaining is a concept which introduces an
additional field with data i.e. chain.
• A separate chain table is maintained for
colliding data.
• When collision occurs, we store the second
colliding data by linear probing method.
• The address of this colliding data can be
stored with the first colliding element in the
chain table, without replacement.
Without replacement example
• 121, 13, 14, 31, 51, 16, 71, 48, 19
Index key Chain
0 -1 -1
1 131 -1
2 -1 -1
3 -1 -1
4 -1 -1
5 -1 -1
6 -1 -1
7 -1 -1
8 -1 -1
9 -1 -1
With replacement example
• 11,21,31,34,55,52,33
Index key Chain Index key Chain
0 -1 -1 0 -1 -1
1 11 -1 1 11 6
2 -1 -1 2 52 -1
3 -1 -1 3 33 -1
4 -1 -1 4 34 -1
5 -1 -1 5 55 -1
6 -1 -1 6 21 7
7 -1 -1 7 31 -1
8 -1 -1 8 -1 -1
9 -1 -1 9 -1 -1
Hashing Applications
• Compilers use hash tables to implement the
symbol table (a data structure to keep track
of declared variables).
• Game programs use hash tables to keep
track of positions it has encountered
(transposition table)
• Online spelling checkers.
Dynamic Hashing
• More effective then static hashing when the
database grows or shrinks
• Extendable hashing splits and coalesces
buckets appropriately with the database size.
– i.e. buckets are added and deleted on demand.
The Hash Function
• Typically produces a large number of
values, uniformly and randomly.
• Only part of the value is used depending on
the size of the database.
Data Structure
• Hash indices are typically a prefix of the
entire hash value.
• More than one consecutive index can point
to the same bucket.
– The indices have the same hash prefix which
can be shorter than the length of the index.
General Extendable Hash
Structure

In this structure, i2 = i3 = i, whereas i1 = i – 1


Queries and Updates
• Lookup
– Take the first i bits of the hash value.
– Following the corresponding entry in the
bucket address table.
– Look in the bucket.
Queries and Updates (Cont’d)
• Insertion
– Follow lookup procedure
– If the bucket has space, add the record.
– If not…
Insertion (Cont’d)
• Case 1: i = ij
– Use an additional bit in the hash value
• This doubles the size of the bucket address table.
• Makes two entries in the table point to the full bucket.
– Allocate a new bucket, z.
• Set ij and iz to i
• Point the second entry to the new bucket
• Rehash the old bucket
– Repeat insertion attempt
Insertion (Cont’d)
• Case 2: i > ij
– Allocate a new bucket, z
– Add 1 to ij, set ij and iz to this new value
– Put half of the entries in the first bucket and
half in the other
– Rehash records in bucket j
– Reattempt insertion
Example
• Insert following values
• Dictionary index=2, bucket size =4
• Hash function : key mod 64
Comparison to Other Hashing
Methods
• Advantage: performance does not decrease
as the database size increases
– Space is conserved by adding and removing as
necessary
• Disadvantage: additional level of
indirection for operations
– Complex implementation
– Memory is wasted in pointers when the global
depth and local depth difference becomes
drastic.
Ordered Indexing vs. Hashing
• Hashing is less efficient if queries to the
database include ranges as opposed to
specific values.
• In cases where ranges are infrequent,
hashing provides faster insertion, deletion,
and lookup then ordered indexing.
Summary
• Hash tables can be used to implement the insert
and find operations in constant average time.
– it depends on the load factor not on the number of items
in the table.
• It is important to have a prime TableSize and a
correct choice of load factor and hash function.
• For separate chaining the load factor should be
close to 1.
• For open addressing load factor should not exceed
0.5 unless this is completely unavoidable.
– Rehashing can be implemented to grow (or shrink) the
table.
Dictionary
• A dictionary is a collection of items stored as key-element
pairs, used to look up items quickly by key
• We should be able to test keys for equality; for some
applications, we might want to not allow more than one
element with the same key (ex: student records stored by
student ID number), but for others this is acceptable (ex:
an English dictionary, since many words have multiple
definitions)
• All dictionaries should support certain basic operations,
like size(), isEmpty(), find(Key k), findAllVals(Key k),
insert(KeyValPair p), remove(Key k), removeAllVals(Key
k)..
Dictionary as ADT
• Data:
• Set of (key, value) pairs
• keys are mapped to values
• keys must be comparable
• keys must be unique
• Standard Operations:
• insert(key, value)
• find(key)
• delete(key)
ordered dictionary
• An ordered dictionary is a particular kind of
dictionary in which we have a total ordering
on the keys. (Whereas before, we could
only test keys for equality, now we can
compare them.) An ordered dictionary will
support all the same methods as a generic
dictionary, plus some additional ones,
namely closestBefore(Key k),
closestAfter(Key k).
Skip list
• The worst case search time for a sorted
linked list ? For a Balanced Binary Search
Tree? For a sorted array?
• How we can search faster in sorted linked
list ?
• we can create multiple layers so that we can
skip some nodes
Skip list
• The upper layer works as an “express lane” which
connects only main outer stations, and the lower layer
works as a “normal lane” which connects every station.

• Suppose we want to search for 50, we start from first node


of “express lane” and keep moving on “express lane” till
we find a node whose next is greater than 50. Once we find
such a node (30 is the node in following example) on
“express lane”, we move to “normal lane” using pointer
from this node, and linearly search for 50 on “normal
lane”.

You might also like