0% found this document useful (0 votes)
89 views46 pages

CS235102 Data Structures

This document provides an overview of hashing techniques for implementing symbol tables. It describes the symbol table abstract data type and common operations like search, insert and delete. It then explains hash tables and different hashing functions like mid-square, division, folding and digit analysis. Finally, it discusses approaches for handling collisions when keys hash to the same location, including linear probing, quadratic probing, rehashing and chaining. The goal of hashing is to support fast constant-time operations for symbol tables by distributing keys across table locations via a hash function.

Uploaded by

janagyrama1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views46 pages

CS235102 Data Structures

This document provides an overview of hashing techniques for implementing symbol tables. It describes the symbol table abstract data type and common operations like search, insert and delete. It then explains hash tables and different hashing functions like mid-square, division, folding and digit analysis. Finally, it discusses approaches for handling collisions when keys hash to the same location, including linear probing, quadratic probing, rehashing and chaining. The goal of hashing is to support fast constant-time operations for symbol tables by distributing keys across table locations via a hash function.

Uploaded by

janagyrama1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 46

CS235102

Data Structures
Chapter 8 Hashing

Chapter 8 Hashing: Outline


The Symbol Table Abstract Data Type
Static Hashing
Hash Tables
Hashing Functions
Mid-square
Division
Folding
Digit Analysis
Overflow Handling
Linear Open Addressing, Quadratic probing, Rehashing
Chaining

The Symbol Table ADT (1/3)


Many example of dictionaries are found in many
applications, Ex. spelling checker
In computer science, we generally use the term
symbol table rather than dictionary, when
referring to the ADT.
We define the symbol table as a set of nameattribute pairs.
Example: In a symbol table for a compiler

the name is an identifier


the attributes might include an initial value
a list of lines that use the identifier.

The Symbol Table ADT (2/3)


Operations on symbol table:

Determine if a particular name is in the table


Retrieve/modify the attributes of that name
Insert/delete a name and its attributes

Implementations
Binary search tree: the complexity is O(n)
Some other binary trees (chapter 10): O(log n).

Hashing
A technique for search, insert, and delete operations
that has very good expected performance.

The Symbol Table ADT (3/3)

Search Techniques
Search tree methods
Identifier comparisons

Hashing methods
Relies on a formula called the hash function.

Types of hashing
Static hashing
Dynamic hashing

Hash Tables (1/6)


In static hashing, we store the identifiers in a
fixed size table called a hash table
Arithmetic function, f
To determine the address of an identifier, x, in the
table
f(x) gives the hash, or home address, of x in the table

Hash table, ht
Stored in sequential memory locations that are
partitioned into b buckets, ht[0], , ht[b-1].
Each bucket has s slots

Hash Tables (2/6)


hash table (ht)

f(x): 0 (b-1)

b buckets

0
1
2
.
.
b-2
b-1
1

s slots

Hash Tables (3/6)


The identifier density of a hash table
is the ratio n/T
n is the number of identifiers in the table
T is possible identifiers

The loading density or loading factor


of a hash table is = n/(sb)
s is the number of slots
b is the number of buckets

Hash Tables (4/6)


Two identifiers, i1 and i2 are synonyms with
respect to f if f(i1) = f(i2)
We enter distinct synonyms into the same bucket as
long as the bucket has slots available

An overflow occurs when we hash a new


identifier into a full bucket
A collision occurs when we hash two
non-identical identifiers into the same bucket.
When the bucket size is 1, collisions and
overflows occur simultaneously.

Hash Tables (5/6)


Example 8.1: Hash table
b = 26 buckets and s = 2 slots. Distinct identifiers n = 10
The loading factor, , is 10/52 = 0.19.
Associate the letters, a-z,
with the numbers, 0-25,
Synonym
respectively
s
Define a fairly simple hash
Synonym
function, f(x), as the
s
first character of x.

C library functions (f(x)):


acos(0), define(3), float(5), exp(4),
char(2), atan(0), ceil(2), floor(5),
clock(2), ctime(2)
overflow: clock, ctime

Synonym
s

Hash Tables (6/6)


The time required to enter, delete, or search for
identifiers does not depend on the number of
identifiers n in use; it is O(1).
Hash function requirements:
Easy to compute and produces few collisions.
Unfortunately, since the ration b/T is usually small, we
cannot avoid collisions altogether.
=> Overload handling mechanisms are needed

Hashing Functions (1/8)


A hash function, f, transforms an identifier, x, into
a bucket address in the hash table.
We want a hash function that is easy to compute
and that minimizes the number of collisions.
Hashing functions should be unbiased.
That is, if we randomly choose an identifier, x, from
the identifier space, the probability that f(x) = i is 1/b
for all buckets i.
We call a hash function that satisfies unbiased
property a uniform hash function.
Mid-square, Division, Folding, Digit Analysis

Hashing Functions (2/8)


Mid-square fm(x)=middle(x2):
Frequently used in symbol table applications.
We compute fm by squaring the identifier and then
using an appropriate number of bits from the middle of
the square to obtain the bucket address.
The number of bits used to obtain the bucket address
depends on the table size. If we use r bits, the range
of the value is 2r.
Since the middle bits of the square usually depend
upon all the characters in an identifier, there is high
probability that different identifiers will produce
different hash addresses.

Hashing Functions (3/8)


Division fD(x) = x % M :
Using the modulus (%) operator.
We divide the identifier x by some number M and use
the remainder as the hash address for x.
This gives bucket addresses that range from 0 to M - 1,
where M = that table size.

The choice of M is critical.


If M is divisible by 2, then odd keys to odd buckets
and even keys to even buckets. (biased!!)

Hashing Functions (4/8)


The choice of M is critical (contd)
When many identifiers are permutations of each other, a biased use of
the table results.
Example: X=x1x2 and Y=x2x1
Internal binary representation: x1 --> C(x1) and x2 --> C(x2)
Each character is represented by six bits
X: C(x1) * 26 + C(x2), Y: C(x2) * 26 + C(x1)
(fD(X) - fD(Y)) % p (where p is a prime number)
= (C(x1) * 26 % p + C(x2) % p - C(x2) * 26 % p - C(x1) % p ) % p
p = 3, 26=64
(64 % 3 * C(x1) % 3 + C(x2) % 3 - 64 % 3 * C(x2) % 3 - C(x1) % 3) % 3
= C(x1) % 3 + C(x2) % 3 - C(x2) % 3 - C(x1) % 3 = 0 % 3
The same behavior can be expected when p = 7
A good choice for M would be : M a prime number such that M does
not divide rka for small k and a.

Hashing Functions (5/8)


Folding

Partition identifier x into several parts


All parts except for the last one have the same length
Add the parts together to obtain the hash address

Two possibilities (divide x into several parts)


Shift folding:
Shift all parts except for the last one, so that the least
significant bit of each part lines up with corresponding
bit of the last part.
x1=123, x2=203, x3=241, x4=112, x5=20, address=699

Folding at the boundaries:


reverses every other partition before adding
x1=123, x2=302, x3=241, x4=211, x5=20, address=897

Hashing Functions (6/8)


Folding example:
123
P1
shift folding

123

203
P2
203

241
P3
241

112
P4

112

20
P5

20
699

folding at the
boundaries
MSD ---> LSD
LSD <--- MSD

123

203

241

112

20

Hashing Functions (7/8)


Digit Analysis
Used with static files
A static files is one in which all the identifiers are known in
advance.

Using this method,


First, transform the identifiers into numbers using some radix,
r.
Second, examine the digits of each identifier, deleting those
digits that have the most skewed distribution.
We continue deleting digits until the number of remaining
digits is small enough to give an address in the range of the
hash table.

Hashing Functions (8/8)


Digital Analysis example:
All the identifiers are known in advance, M=1~999
X1:d11 d12

d1n
X2:d21 d22

d2n

Xm:dm1 dm2

dmn
Select 3 digits from n
Criterion:
Delete the digits having the most skewed distributions

The one most suitable for general purpose


applications is the division method with a divisor, M,
such that M has no prime factors less than 20.

Overflow Handling (1/8)


Linear open addressing (Linear probing)
Compute f(x) for identifier x
Examine the buckets:
ht[(f(x)+j)%TABLE_SIZE], 0 j TABLE_SIZE
The bucket contains x.
The bucket contains the empty string (insert to it)
The bucket contains a nonempty string other than x
(examine the next bucket) (circular rotation)
Return to the home bucket ht[f(x)],
if the table is full we report an error condition and exit

Overflow Handling (2/8)


Additive transformation and Division

Hash table with linear probing (13 buckets, 1


slot/bucket)

insertion

Overflow Handling (3/8)

Problem of Linear Probing

float
acos
atoi
atol
define
char
ceil
cos
floor
exp
Enter ctime
Identifiers tend to cluster together :
Adjacent cluster tend to coalesce
Increase the search time
Example: suppose we enter the
C built-in functions into a
26-bucket hash table in order.
The hash function uses the first
character in each function name

Enter
sequence:
acos, atoi, char, define, exp,
ceil, cos, float, atol, floor, ctime
# of key comparisons=35/11=3.18
Hash table with linear probing (26 buckets, 1

Overflow Handling (4/8)


Alternative techniques to improve open
addressing approach:

Quadratic probing
rehashing
random probing

Rehashing
Try f1, f2, , fm in sequence if collision occurs
disadvantage
comparison of identifiers with different hash values
use chain to resolve collisions

Overflow Handling (5/8)


Quadratic Probing
Linear probing searches buckets (f(x)+i)%b
Quadratic probing uses a quadratic function of i as the
increment
Examine buckets f(x), (f(x)+i2)%b, (f(x)-i2)%b, for
1<=i<=(b-1)/2
When b is a prime number of the form
Prime j Prime
3 0
43
4j+3, j is an integer, the quadratic search
7 1
59
examines every bucket in the table

j
10
14

11 2

127

31

19 4

251

62

23 5

503 125

31 7

1019 254

Overflow Handling (6/8)


Chaining
Linear probing and its variations perform poorly
because inserting an identifier requires the
comparison of identifiers with different hash values.
In this approach we maintained a list of synonyms for
each bucket.
To insert a new element
Compute the hash address f (x)
Examine the identifiers in the list for f(x).

Since we would not know the sizes of the lists in


advance, we should maintain them as lined chains

Overflow Handling (7/8)


Results of Hash Chaining
acos, atoi, char, define, exp, ceil, cos, float, atol, floor, ctime
f (x)=first character of x

# of key comparisons=21/11=1.91

Comparison:

Overflow Handling (8/8)

In Figure 8.7, The values in each column give the average


number of bucket accesses made in searching eight
different table with 33,575, 24,050, 4909, 3072, 2241,
930, 762, and 500 identifiers each.
Chaining performs better than linear open addressing.
We can see that division is generally superior

Average number of bucket accesses per identifier retrieved

Dynamic Hashing
Dynamic hashing using directories
Analysis of directory dynamic hashing

simulation

Directoryless dynamic hashing

Dynamic Hashing Using Directories

Dynamic Hashing Using Directories

Dynamic Hashing Using Directories

Program8.5 Dynamic hashing

Analysis of Directory Dynamic Hashing

Directoryless Dynamic Hashing

Directoryless Dynamic Hashing

Directoryless Dynamic Hashing

Directoryless Dynamic Hashing

You might also like