0% found this document useful (0 votes)
84 views110 pages

CH8 Hashing

This document discusses hashing as it relates to data structures. It begins with an outline of topics to be covered, including static hashing, dynamic hashing, and Bloom filters. It then provides background on hashing, explaining that it allows searching in O(1) time. Examples are given of applications that use dictionaries, like spelling checkers and databases. Different representations of dictionaries are reviewed, with hashing noted as allowing O(1) time searches. Finally, hash tables are explained as an efficient implementation of a symbol table using a hash function to map keys to buckets.

Uploaded by

張思思
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views110 pages

CH8 Hashing

This document discusses hashing as it relates to data structures. It begins with an outline of topics to be covered, including static hashing, dynamic hashing, and Bloom filters. It then provides background on hashing, explaining that it allows searching in O(1) time. Examples are given of applications that use dictionaries, like spelling checkers and databases. Different representations of dictionaries are reviewed, with hashing noted as allowing O(1) time searches. Finally, hash tables are explained as an efficient implementation of a symbol table using a hash function to map keys to buckets.

Uploaded by

張思思
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 110

Data

Structures
CH8 Hashing
Prof. Tai-Lang Jong
Office: Delta 928
Tel: 42577
email: [email protected]
Spring 2022
Outline
• 8.1 Introduction
• 8.2 Static hashing
• 8.3 Dynamic hashing
• 8.4 Bloom filters

2
Hashing( 雜湊 ) in Data Structures
• Searching is dominant operation on any data
structure. Most of the cases for inserting, deleting,
updating, all operations required searching first. So
searching operation of particular data structure
determines it’s time complexity.
• If we take any data structure, the best time
complexity for searching is O(log n) in AVL tree and
sorted array only. Most of the cases it will take O(n)
time.
• To solve this searching problem hashing concept is
introduced which will take O(1) time for searching.
It’s constant time.
3
Dictionary ADT Revisited
• A dictionary is a collection of items.
• Each item is a pair
• (key, element) or (key, value)
• Pairs have different keys.
• Keys may not have an ordering
• Dictionary represents a mapping from keys to elements.
• The primary use of dictionary is to store elements so that they can be
located quickly using (search) keys
• Operations.
• Get(theKey) => search
• Delete(theKey) => delete
• Insert(theKey, theElement) => insert
• Additional operations: (for dynamic dictionary)
• Ascend() (sorting)
• Get(index)
• Delete(index)
4
ADT 5.3 Dictionary Revisited
template <class K, class E>
class Dictionary {
public:
virtual bool IsEmpty() const = 0;
// return true iff the dictionary is empty
virtual pair <K, E>* Get(const K&) const = 0;
// return pointer to the pair with specified key; return 0 if no such pair
virtual void Insert(const pair <K, E>&) = 0;
// insert the given pair; if key is a duplicate update associated element
virtual void Delete(const K&) = 0;
// delete the pair with specified key
};
• The dictionary ADT provides operations for storing records, finding
records, and removing records from the collection.
• This ADT gives us a standard basis for comparing various data
structures. 5
Applications of Dictionary
• Spelling checker
• Thesaurus
• Natural language dictionary
• key: word in language X; value: word in language Y
• Index for a database
• Contact book
• key: name of person; value: telephone number
• Property-value collection
• key: property name; value: associated value
• Symbol tables generated by loaders, assemblers, and
compilers, e.g.,
• Table of program variable identifiers
• key: identifier; value: address in memory
6
Representation of Dictionary
• Sorted or Unsorted sequences (linear lists)
• Easy to implement, Space efficient
• Insertion itself is fast, but needs lookup to check if
the name was already in
• Lookup is slow
• Binary search tree
• Get, Insert, Delete take O(n) time (WC)
• AVL trees
• Balanced binary search tree • self-organizing BST
• Get, Insert, Delete take O(logn) time • red-black trees
• (a,b)-trees (in
• Hashing
particular: 2-3-trees)
• Get, Insert, Delete take O(1) time • B-trees
7
Simple Implementations of Dictionary
• Elements of a dictionary can be kept in a sequence (linked
list or array):
• data size: number of elements (n);
• dominant operation: key comparison
• Unordered:
search: O(n); insert: O(1)(no checking); delete: O(n)
• Ordered array:
search: O(logn); insert O(n); delete O(n)
• Ordered linked list:
search: O(n); insert O(n); delete: O(n)
• (keeping the sequence sorted does not help in this case!)
• Space complexity: Θ(n)
8
Complexity Of Dictionary Operations
Get(), Insert() and Delete()
Data Structure Worst Case Expected

Hash Table O(n) O(1)

Binary Search O(n) O(log n)


Tree
Balanced Binary O(log n) O(log n)
Search Tree

n is number of elements in dictionary


9
Symbol Table
• Symbol Table
• Can be viewed as a set of name-attribute pairs (like (key,
element) or (key, value) pair in dictionary)
• A form of dictionary
• Is used widely in many applications include spelling
checker, thesaurus, loaders, compilers
• Common operations on a symbol table
• Search a particular name in the table (Get(K&))
• Retrieve the attributes of that name
• Modify the attributes of that name
• Insert a new name and its attributes (Insert(<K,E>&))
• Delete a name and its attributes (Delete(<K,E>&))
10
How To Implement Symbol
Table?
1.Direct Addressing
• Assume potential keys are numbers from some universe U ⊆ N.
• An element with key k ∈ U can be kept under index k in a |U|-
element array:
search: O(1); insert: O(1); delete: O(1)
• This is extremely fast! What is the price?
n - number of elements currently kept.
What is space complexity?
space complexity: O(|U|) (|U| can be very high, even if we
keep a small number of elements!)
• Direct addressing is fast but waists a lot of memory (when |U|
>> n)
11
How To Implement Symbol
Table?
2. Binary Search Tree
• Allows efficient search, insert, and delete operation in O(h),
where h is the height of the tree
• Worst case O(n), where n is the total number of identifiers
• Can be improved to O(log n) (balanced BST, AVL tree, ..)
3. Hash Table
• A fixed-size linear array, ht
• For an identifier, x, The address of x is determined by a
hashing function h(x)

12
Hash Tables
• A hash table is another easy and efficient
implementation of a symbol table.
• Elements are kept in an b-element table, b << |U|
• It works with keys that are not ordered, but supports
only
• Insert
• delete hashing function: h : U → [0..b − 1]
• search
• It is based on the concept of a hash function h(key).
• Maps each possible element into a specified bucket
• The number of buckets b is much less than the number of
possible elements |U| (b << |U|)
• Each bucket can store a limited number of elements
13
Hashing Non-integer Keys
• For integer key:
hashing function: h : U → [0..b − 1], (|U| > b)
• What if the type of key is not integer?
• Additional step is needed:
• Before computing the hash function, the key should be
transformed to integer.
• For example: key is a string of characters, the
transformation should depend on all characters.
• This transforming function should have similar
properties to hashing function.

14
Registration Division Example
請大家向註冊組承辦人
查詢學期成績

全校多數都打電話、
寄 email 給第一位 承辦人 分機 / Email

陳 OO 31300 / chen@nthu...

郭 OO 31301 / kuo@nthu...

李 OO 31302 / li@nthu...

林 OO 31303 / lin@nthu..

王 OO 31304 / wang@nthu...

15
Hash Concepts
• Hash function
• Any deterministic function that can map data of
arbitrary size (original keys) to a continuous range data
of a desired fixed size: 0≤h(k)<b (hashed keys)
Hash
Keys Hashed Keys Values
Function
(e.g., names) (e.g., 0~4)
承辦人 分機
0 陳 OO 31300
" 周杰倫 " 2
1 郭 OO 31301
"Donald Trump" h() 3 2 李 OO 31302
3 林 OO 31303
" 鈴木一朗 " 0
4 王 OO 31304
……
16
Hash Concepts
• Hash function
• It shuffles the order of mapping
• But it is deterministic

Hash
Keys Hashed Keys Values
Function
(e.g., names) (e.g., 0~4)
承辦人 分機
0 陳 OO 31300
" 周杰倫 " 2
1 郭 OO 31301
"Donald Trump" h() 3 2 李 OO 31302
3 林 OO 31303
" 鈴木一朗 " 0
4 王 OO 31304
……
17
Hash in Cooking
• Hash: "chop and mix foods"

• Example: hash browns ( 薯餅 )

McDonald
Function

18
Hash in Chinese Decomposition
• Decompose Chinese characters into keyboard strokes
• Facilitate Chinese input
• Example: the Boshiamy ( 嘸蝦米 ) decomposition scheme

哈 OO OAO

州 Boshiamy
Function
YYY

哥 o
o
TOTO
19
Hash in Storing Data
• Example: Storing students' grades according to
their name initial letters

A (Alice,
95 100)
J B
John


100 Name I
Alice A (John, 95) (Jane,
initial J
100)
100 K
Jane

J
Z
20
Hashing
• Static hashing
• Store the dictionary (key, element) pairs in hash table
• Hash Table
• Hash Function
• Collision & Overflow Handling
• Open addressing or open hashing
• Chaining
• Dynamic hashing (extendible hashing)
• Retain the fast retrieval time of conventional hashing while
extending the technique so that it can accommodate
dynamically increasing and decreasing table size without
penalty.
• Provide acceptable hash table performance on a per
operation basis using directory and directoryless
21
Outline
• 8.1 Introduction
• 8.2 Static hashing
• 8.3 Dynamic hashing
• 8.4 Bloom filters

22
Advantages of Hashing
Hash table
(Bob, 80) (Ben, 70)
(Irene, 85)
Hash
(John, 95) (Jane,
Identifier (key) function 100)
h(key)
(Ken, 75)

(Zoe, 80)

• Inserting, deleting, and searching can be ~O(1) time


• Hash function computation is designed in O(1)
• Indexing the corresponding bucket in the table is O(1)
• Searching all slots in a bucket for a key is also O(1)
• The number of slots is independent of the number of pairs
stored in the table
23
Terminology in Hashing
• Buckets & Slots A pair with a key k is
• b buckets in ht stored in a hash table
• h(k) is the home bucket of a key k
• s slots per bucket ht Slots

• Identifier density = n/T


• T possible different keys (|U|) 0 (Alice,
100)
• n stored pairs in ht 1
(Bob, 80) (Ben, 70)
• Loading factor (density) =  =


Buckets
n/(sb)
(Irene, 85)
• b is the number of buckets, s is the
number of slots per bucket (John, 95) (Jane,
100)
• Synonyms (collision)
(Ken, 75)
• Two identifiers, I1 and I2 are …
b-1
synonyms if h(I1) = h(I2)
(Zoe, 80) 24
Collision and Overflow
• Collision
• When two nonidentical identifiers are hashed into the
same bucket
• Overflow
• When a new identifier is mapped or hashed by h into a full
bucket

• Collision exists at location 0 and 7


• Overflow will occur when AA is
hashed into the table (h(AA) = 0)!

25
Efficiency of Hash Table
• Search or insertion time of a hash
table
(1) compute the hash function
(2) search a bucket
• The above times are independent of
n – O(1)
• Collision is inevitable
• Taking the first character is not a
good hashing because of too much
collision
• many variables in a program begin
with the same character
• An overflow mechanism is necessary
26
Hash Function

定義域 值域
通常為不限大小 通常為有限大小 [0,b)
Hash Function
通常必須 " 多對一 "
絕不會 " 一對多 "

27
Hash Function
• Good hash functions reduce the
chance of collisions
Slots
• Enlarging hash table size can also
reduce collisions
• At the cost of memory size h(A)=0 (Alice,
100)
• Ideal hash functions have h(B)=1
(Bob, 80) (Ben, 70)
• Minimal number of collisions


• Uniform distribution of keys for

Buckets
h(I)=9
various values (Irene, 85)
h(J)=10
• Easy (fast) to compute (John, 95) (Jane,
h(K)=11 100)
• Maximum usage of the information
present in the values (keys) (Ken, 75)
• Yield different values even for very …
h(Z)=25
similar keys
(Zoe, 80)
28
Uniform Hash Function
• Basic desired properties of hash function
• Easy to compute
• Minimize the number of collisions
• Uniform hash function
• A good hash function
• Should also depend on every character of an input
identifier
• Uniform hash function
• Let x be an identifier chosen at random
• Then, the probability that h(x) = i is 1/b for every bucket i
• That is, the hash function does not result in a biased use
of the hash table for random inputs
29
Hash Functions
• Classical examples
• Modulo (division)
• Mid-square
• Folding
• Digit analysis
• String-to-integer conversion

• We can design our own hash functions

30
Hash Function: Modulo
(Division)
• Most widely used hash function in practice
• Procedure
• h(k) = k % D

• Selection of D
• D is the number of buckets
• D would better be an odd number
• Even divisor D always maps even keys to even buckets and odd
keys to odd buckets
• Real-world data tend to have a bias toward either odd or even
keys
• It would be even desirable if D can be a prime number
or a number having no prime factors smaller than 20 31
Hash Function: Modulo
(Division)
hash function hD (k) = k % D
where % is the modulo operator
That is, the remainder is used as the hash address

• Hash address
• in the range from 0 through ( D-1)  implies that table size is D
• D should not be a power of 2
• otherwise, hD(k) may depend only on the least significant bits of x
• e.g., D=23, then A1  encoded to 26(10) + 1  hD(A1) = 1
XY1  encoded to 212(33) + 26(34) + 1  hD(XY1) = 1
• D is usually a prime number
32
Hash Function: Mid-Square
• h(k) = some middle r bits of the square of k
• The number of bucket is equal to 2r
• Example
k k2 h(k) k k2 h(k)
0 0 0000 0000 0 8 64 0100 0000 0
1 1 0000 0001 0 9 81 0101 0001 4
2 4 0000 0100 1 10 100 0110 0100 9
3 9 0000 1001 2 11 121 0111 1001 14
4 16 0001 0000 4 12 144 1001 0000 4
5 25 0001 1001 6 13 169 1010 1001 10
6 36 0010 0100 9 14 196 1100 0100 1
7 49 0011 0001 12 15 225 1110 0001 8

33
Hash Function: Mid-Square

A – 1 = 18
B – 2 = 28
.
.
.
Z – 26
0 – 27 = 338
1 -- 28 = 348
2 – 29 = 358
..
9 -- 36 = 448

• The coding of identifier x is right-justified, zero-filled, and has six bits per character
• Table size will be a power of 2 34
Hash Function: Folding
• Partition the key into several parts and add them
together
• Two strategies: shift folding and folding at the boundary
• Example
• k = 12320324111220 = 123 203 241 112 20

• Shift folding
123
h(k) = = 699 123 203 241 112 20
20

• Folding at the boundary


h(k) = = 897 123 302 241 211 20
20

35
Hash Function: Digit Analysis
• Application
• when all the identifiers are known in advance
• Procedure
• Step 1: Each identifier is interpreted as a number using radix r
• Step 2: Analyze the distribution of each digit
• Step 3: Drop biased digits
• The digits with the most skewed distributions are deleted one
by one until the remaining digits is small enough to give an
address
• Example
• Given three identifiers in radix-9 form: 891, 792, 793
• Digit distribution: 1st {8, 7, 7}, 2nd {9, 9, 9}, 3rd {1, 2, 3}
• The most skewed digits: the 2nd
36
String-to-Integer Conversion
• Useful when keys are strings
• Procedure
• Treat every n character as an 8n-bit integer
• ASCII represents a character using 8 bits

• Add all integers together to obtain the overall value


• Adopt the aforementioned hash functions (modulo, folding…)
37
Converting a String to an
Integer
unsigned int StringToInt(string s)
{// 根據 s 的所有字元,把 s 轉換成一個非負的 int
int length = (int) s.length(); // s 裡的字元個數
unsigned int answer = 0;
if (lengh % 2 == 1){// 長度為奇數
answer = s.at(length - 1);
length--;
}
// 長度現在為偶數
for (int i = 0; i < length; i += 2){// 一次做兩個字元
answer += s.at(i);
answer += ((int) s.at(i + 1)) << 8;
}
return answer;
}
38
C++ STL template class
hash<T>
• C++ STL provides specializations of the STL template
class hash<T> that transform instances of type T
into a nonnegative integer of type size_t.
template<>
class hash<string> {
public:
size_t operator()(const string theKey) const
{ // Convert theKey to a nonnegative integer
unsigned long hashValue = 0;
int length = (int) theKey.lenght();
for (int i = 0; i < length ; i++)
hashValue = 5 * hashValue + theKey.at(i);
  return size_t(hashValue);
}
};
39
Design Our Own Hash
• Recall that
• Hash function is any deterministic function that can map
data of arbitrary size (original keys) to data of a desired
fixed size (hashed keys)
• So of course we can design a hash like this

Original + Trump's Mid- Hashed


Folding Modulo
Keys Birthday Square Keys

• Consideration:
• We need to argue the advantages of our hash compared
with the commonly used ones
40
Secure Hash Functions
• Example
• SHA-1
• SHA256
• MD5

• Usage
• Message authentication
• Password store ( 能驗證密碼但又可防範洩漏密碼 )
• Digital signature ( 防止變造 )
• Digital currency
• …

41
Message Authentication
• A message M to be transmitted over an insecure channel
from A to B.
• Assume we have a means to transmit messages much
smaller than M securely, e.g., encrypt smaller message
or transmit the smaller message on a more expensive
but far more secure channel.
• Send h(M) using more secure method. Send M over
insecure channel.
• At B, received M’ and h(M); compute h(M’) and compare
with h(M) to decide if received M’ is correct.
• It should be difficult for a malicious user who has
knowledge of h & M to determine a synonym of M --
weak collision resistance property 42
Secure Hash Function Properties
• Desired secure hash function properties:
• Weak collision resistance:
• It is difficult for a malicious user who has knowledge of h & M to
determine a synonym of M
• One-way property:
• For a given c, it is computationally difficult to find a k such that h(k) = c
(inverse hashing problem)
• Strong collision resistance:
• It is computationally difficult to find a pair (x,y) such that h(x) = h(y)
• Several cryptographic hash functions with these
properties have been developed. They also have
additional properties:
• h can be applied to a block of data of any size
• h produces a fixed-length hash code
• h(k) is relatively easy to compute for any given k
43
Secure Hash Algorithm (SHA)
• Developed at the National Institute of Standards &
Technology (NIST) in USA.
• SHA-1 function:
• Input: any message length < 264 bits
• Output: a 160-bit code
Step 1: Preprocess the message so that its length is q*512 bits for some integer q.
The preprocessing may entail adding a string of zeros to the end of the message.
Step 2: Initialize the 160-bit output buffer OB, which comprises five 32-bit registers
A, B, C, D, and E, with A = 0x67452301, B = 0xefcdab89, C = 0x98badcfe,
D = 0x10325476, E = 0xc3d2e1f0.
Step 3: for (int i = 1; i <= q; i++) {
Let Bi = ith block of 512 bits of the message;
OB = F (OB, Bi); //function F consists of 4 rounds of 20 atomic steps
}
Step 4: Output OB 。
44
Atomic SHA Operation
A B C D E

ft +

S5 +

+ Wt

S30 + Kt

A B C D E
t = Step number, 0≤t≤79
ft(B,C,D) = Primitive (bitwise) logical function for step t; e.g., (BΛC)V(BΛD)
Sk = Circular left shift of the 32-bit register by k bits
Wt = A 32-bit value derived from Bi
Kt = A constant 45
Security Hash: MD5 Example
0x67452301 0xEFCDAB89 0x98BADCFE 0x10325476

F(X, Y, Z)
Input (32 bit at a time)
Sin table +
rotate

Output Hash
46
Security Hash: MD5 Example
• https://www.md5hashgenerator.com/

• Example
• “NTHU” 
8191af722cfd2890b7a9e986003a6439
• “NTHU1” 
4c97870289739e75576a4cbeb6222e25
• “ 資料結構” 
9aa8415c41bb436d4b6e5618a7be6360

• Deterministic results
• Everyone can get exactly the same results, although the
results look like random numbers 47
Security Hash: MD5 Example
• Important properties
• Hard to find the inverse
• “ 祝大家期末考順利 ? !" 
21cee26fd407729a1c740105891e3fca

• Easy to verify

• Hard to find another meaningful input that has


the same hash (synonyms)

48
Usage: Password Store

username: password
username: password
Mary: abcabc
John: edp903d
Bob: HE3dpq7

49
Usage: Password Store

Login
username: hashed_password
Mary: 5abb7fa…
John: 8e2c1f8…
Bob: ed6e5ad…

? ???

50
Usage: Digital Signature
• 有奬徵答,請大家解一題資結問題,前 3 快解出
問題的給獎品 !
• 狀況一
• 有人搶先說“我解開了”再慢慢解題,難以判斷真實順序
• 狀況二
• 第一名公布了答案,二三名都說“我也是這麼想的”
• 狀況三
• 大家統一把答案交給某個裁判再檢查是誰最快回答出正確
答案。但裁判的公平性無法驗證。

• Security hash 可以幫忙解決這個問題

51
Usage: Digital Currency
• 數位貨幣 ( 例如比特幣 )
• 去中心化
• 由貨幣使用者記帳,而非由政府或銀行記帳
• 記帳的使用者會得到一些報酬

• 問題
• 怎麼決定記帳權歸屬,且不會被特定使用者把持 ?
• 怎麼避免帳本被竄改 ?

52
Usage: Digital Currency
• 方法
• 要求所有使用者一起解 inverse hash 問題
• 請把 x 取代成數字,使“ A 轉給 B 十元 (xxxxxxxx)” 這個字
串的 MD5 前 N 碼是 0
• A 轉給 B 十元 (00000000)  4299bf747…
• A 轉給 B 十元 (00000001)  2c81a694d…
• A 轉給 B 十元 (00000002)  551741357…
• 求解動作稱挖礦,困難度被控制在約 10 分鐘才能解
答一次。
• 挖礦有利可圖,因此很多人參與
• 想要單方把持記帳權,必須投入非常多電腦,不敷成本
• 後帳本與前帳本相關 ( 區塊鏈 )
• 想從竄改已經成形的帳本,必須投入非常多電腦在短時間
內重算竄改位置之後的 inverse hash 問題,不敷成本
53
Overflow Handling
• Problem
• When a new identifier is hashed into a full
bucket, then we need to find another open
bucket
• Methods
• Open addressing -- find the closest bucket that is
not full
• Linear probing
• Quadratic probing
• Rehashing
• Chaining
• Implement each bucket as a linked list
54
Linear Open Addressing
• Procedure of searching an identifier x
• Step 1: compute h(x)
• Step 2: examine identifiers at positions ht[h(x)], ht[h(x)
+1], ..., ht[(h(x)+j] in this order until one of the following
happens:
(a) ht[ h(x)+j ] = x; in this case x is found
(b) ht[ h(x)+j ] is null; x is not in the table
(c) We return to the
starting position
h(x); the table is
full and x is not in
the table

55
Program 8.4 Linear Probing
template < class K, class E >
pair< K, E>* LinearProbing <K, E>::Get(const K& k)
{ // search the linear probing hash table ht ( each bucket has exactly one
// slot ) for k. If a pair with this key is found, return a pointer to this pair;
// otherwise, return 0
int i = h(k); // home bucket
int j;
for ( j = i ; ht[j] && ht[j]→first != k;) {
j = (j + 1 ) % b; // treat the table as circular
if ( j == i ) return 0;// back to the start point
}
if(ht[j]→first == k) return ht[j];
return 0;
}
56
Problem of Linear Open Addressing
• Identifiers tend to cluster together
• Increase the search time
• Could be worse than the search tree structure
• An analysis shows that
• It takes (2-)/(2-2) to look up an identifier
where  is the loading density (n/(sb))
• Quadratic probing
• improve the clustering problem
• check sequence:
((h(k)+i2)%b) and ((h(k)-i2)%b)
i = 1, 2, ...
57
Rehashing
• Another way to control the growth of clusters is to
use a series of hash functions h1, h2, …, hm. This is
called rehashing.
• Buckets hi(k), 1 ≤ i ≤ m are examined in that order.
• Double hashing:
• If h(k) is occupied, then we iteratively try buckets
h(k) + j*h‘(k) for j=1, 2, …

58
Chaining
• Linear probing performs poorly because the search
for an identifier involves comparisons with
identifiers that have different hash values.
• e.g., search of ZA involves comparisons with the
buckets ht[0] – ht[7] which are not possible of
colliding with ZA.
• Unnecessary comparisons can be avoided if all the
synonyms are put in the same list, where one list
per bucket.
• Each chain has a head node. Head nodes are stored
sequentially.
59
Chain-Based Hash Table
• Each bucket is a chain
• Chain nodes are typically
unordered data link
A (Alice, 100) 0
• We typically expect the
hash function spreads B (Bob, 80) (Ben, 70) 0
records uniformly 0


enough
I (Irene, 85) 0
• Thus each chain does
J (John, 95) (Jane, 100) 0
not contain too many
nodes K (Ken, 75) (Kevin, 70) 0
• Linearly traversing a L (Linda, 90) 0
chain is required for M (Mary, 85) 0
inserting, finding, and (Zoe, 80) 0

removing a key
60
Chaining
ht data link
[0] A4 A3 A1 A2 A 0
[1] 0
[2] 0
[3] D 0
[4] E 0
[5] 0
[6] G GA 0

[7] 0
[8] 0
[9] 0
[10] 0
[11] L 0


[25] ZA Z 0

26 個桶的雜湊表
61
Program 8.5 Chain Search
template <class K, class E>
pair<K, E>* Chaining <K, E>::Get(const K& k)
{// 在鏈雜湊表 ht 中搜尋 k ,如果找到具有這個鍵值的字典對,
// 那麼回傳這個字典對的指標;否則回傳 0
int i = h(k); // 主桶
// 搜尋鏈 ht[i]
for(ChainNode<pair<K,E>>* current = ht[i]; current;

current = current ->link)


if(current->data.first==k)return &current->data;
return 0;
}

i = h(“B3”) = 1

62
A Comparison
• Hash Function
• Division is generally superior to the other types
• Collision handling
• Chaining outperforms linear open addressing

63
Chaining vs. Open Addressing
Chaining Open Addressing

Elements can be stored at outside In open addressing elements


of the table should be stored inside the table
only
In chaining at any time the number In open addressing the number of
of elements in the hash table may elements present in the hash table
be greater than the size of the hash will not exceed to number of
table indices in hash table.

In case of deletion chaining is the If deletion is not required. Only


inserting and searching is required
best method open addressing is better
Open addressing requires less
Chaining requires more space space than chaining.

64
Outline
• 8.1 Introduction
• 8.2 Static hashing
• 8.3 Dynamic hashing
• 8.4 Bloom filters

65
8.3 Dynamic Hashing
• In static hashing, if the hash table is allocated to be
too small, then when the data exceeds the capacity
of the hash table (loading density exceeds a
threshold), the entire table must be extended (size
increased) and restructured:
e.g., b buckets, divisor D = b → 2b + 1 buckets, and
new D = 2b+1  hash function changed  the
whole hash table needs rebuilt all over (not copy) --
a time-consuming process.
• For very large dictionary being accessed 24/7, that
means unacceptable interruption of service during
hash table rebuilding.
66
Hash Table Size: b  2b+1
hashed key hD(k) = bucket index
hD(k) = [0,…, b-1], D = b

(k1,e1 (k2,e2 (k1,e1


) ) )
hD(k
k b k hD’(k)
)
2b+1
(k2,e2
)

D = b  D’ = 2b+1

Hash table needs rebuilt all over


again (not simple copy)
67
Dynamic Hashing
• Dynamic hashing, or extendible hashing:
• Aims to reduce rebuild time by ensuring that each
rebuild changes the home bucket for the entries in
only one bucket, so that it can provide acceptable
hash table performance on a per operation basis
• Retain the fast retrieval time while extending --
dynamically increasing and decreasing table size with
very little penalty.
• Two forms of Dynamic hashing
• Dynamic hashing using directory
• Directoryless dynamic hashing

68
Dynamic Hashing Using
Directory
• Assume a file F is a collection of records R.
Each record has a key field K
Records are stored in pages or buckets whose capacity
is 2p
• Dynamic hashing using directory first map key k to
directory d by using h(k,p), p LSbits of h(k), which
contains pointers to 2p buckets
• When overflow, increase p, thus the size of directory is
increased, and dynamically add new buckets with
minimal bucket restructuring.
• With directory in memory and records in disk, it is
aimed to minimize access to pages
69
Dynamic Hashing Using
Directory
• hashed key h(k) = sufficiently large
• h(k,p) = p LSb bits from h(k)  Directory d 
pointer to bucket
• Directory can double when overflow occurs
h(k) =
xxxx xxxx xxxx xxxx (k1,e1) (k2,e2)

h(k,
k h(k)
p)

(k3,e3)

Directory d Hash table dynamically


add new buckets 70
Dynamic Hashing Using
Directories
• Given a partial list of identifiers (key) in the
following:
k h(k) A → 100 0 → 000
A0 100 000 B → 101 1 → 001
A1 100 001 C → 110 2 → 010
B0 101 000 .. 3 → 011
B1 101 001 4 → 100
C1 110 001 ..
C2 110 010 h(k) 6 bits
C3 110 011 Take p LSb bits
C5 110 101 → directory size = 2p
→ directory depth = p
71
Starting with p = 2, 6 pairs in Buckets
• Assume dictionary already contains 6 keys A0, A1, B0, B1,
C2, C3 → p = 2 = directory depth as shown below:
h(k,2)
00
A0 → 00
B0 → 00
01 A1 → 01
B1 → 01
10 C2 → 10
C3 → 11
11

• Insert C5 into dictionary: h(C5,2) = 01 → overflow


72
When Overflow
• Insert C5,h(C5,2) = 01 – bucket 01 overflows
• Find a least u such that the values of h(k,u) of new
insert key and keys in overflowed bucket are not the
same → u = 3 in this case
• Double the size of directory. Copy the original
pointers in the upper half into the new lower half
• Restructure by splitting only the home (overflow)
bucket:
h(A1,3) = 001, h(B1,3) = 001, remain at the same old
bucket
h(C5, 3) = 101, only C5 → add a new bucket 101 for C5
73
• If insert C1 into dictionary: h(C1,3) = 001 → overflow
74
When Overflow
• Insert C1,h(C1,3) = 001 – bucket 001 overflows
• Find a least u such that the values of h(k,u) of new
insert key and keys in overflowed bucket are not the
same → u = 4 in this case
• Double the size of directory. Copy the original
pointers in the upper half into the new lower half
• Restructure by splitting only the home (overflow)
bucket:
h(A1,4) = 0001, h(B1,4) = 1001, h(C1, 4) = 0001
h(B1, 4) = 1001, → add a new bucket 1001 for B1,
insert C1 to Bucket 0001
75
Following the doubling of
directory, split the overflow
bucket bucket 0001 by h(k,4)
A1, C1 → bucket 0001
B1 → bucket 1001 – new bucket

Add new bucket and move B1 here

76
Dynamic Hashing Using
Directory
• Deletion from a dynamic hash table with a directory is
similar to insertion.
• Although dynamic hashing employs array doubling, the
time for this array doubling is considerably less than that
for the array doubling used in static hashing.
This is because in dynamic hashing, we need to rehash
only the entries in the overflow bucket rather than all
entries in the table.
• Further, savings result when the directory resides in
memory while the buckets are on disk.
A search requires only 1 disk access; an insert makes 1
read and 2write access to the disk; the array doubling
requires no disk access. 77
Directoryless Dynamic Hashing
• Dispense with the directory, d, of bucket pointers.
• An array, ht, of buckets is used.
• Assume that this array is as large as possible and there
is no possibility of increasing its size dynamically.
• To avoid initializing such a large array, use two variables
q and r, 0 ≤ q < 2r, to keep track of the active buckets.
• At any time, only buckets [0, 2r + q -1] are active. New
active buckets can be added when needed by changing
q and/or r (q++; → if q = 2r → q=0, r++;)
• Each active bucket is the start of a chain of bucket. The
remaining buckets on a chain are called overflow
buckets.
78
Index Rules for Active Buckets
• The active buckets 0 through q-1 as well as the active
buckets 2r through 2r + q -1 are indexed using h(k,r+1)
• While the remaining active buckets q through 2r -1 are
indexed using h(k,r)
• Each active bucket is the start of a chain of bucket. The
remaining buckets on a chain are called overflow buckets.

h(k,r+1) h(k,r+1)
ht
0 q 2r 2r+q
Overflow buckets
h(k,r)

Active buckets 79
Searching Directoryless Hash Table
• To search for key k, we need to compute h(k,r)

Program 8.6: Searching a directoryless hash table

If ( h(k,r) < q )
search the chain that begins at bucket h(k, r+1);
else
search the chain that begins at bucket h(k,r);

80
Overflow Handling
• To insert new key-element, search to determine if the key
is already in ht.
• When search is done, new key is not in ht, and the active
bucket for the searched chain is full, we get an overflow.
• An overflow is handled by
1. Activating bucket 2r + q (add a new active bucket)
2. Reallocate (rehashing) the entries in the chain q between
q and the newly activated bucket (or chain) 2r + q using
h(k,r+1).
3. q = q + 1; if (q == 2r) {q=0; r=r+1;}
4. Finally, the new key-element is inserted into the chain
where it would be searched for by search algorithm
using the new r and q values.
If h(k,r)<q use h(k,r+1) else use h(k,r) 81
• To start, assume there are 4 (2r+q) bucket chains
(r=2,q=0); each chain begins at one of the 4 active
buckets and comprises only that active bucket (i.e.,
there are no overflow buckets). Each bucket has 2 slots.
• To Insert k = C5, first determine if C5 is in ht by using
search algorithm. If not, check if overflow.
(a) r = 2, q = 0
1. Search C5: h(C5,2) = 01  h(C5,2)
Active buckets = 2 +q=4
r
> q (= 0), use h(C5,2) = 01 and
k h(k)
A0 100 000 search that chain,
A1 100 001 2. C5 not found, check if the active
B0 101 000 bucket for the searched chain is
B1 101 001 full
C1 110 001 3. Yes  overflow  Handle
C2 110 010
overflow
C3 110 011
C5 110 101 82
• Insert k = C5, to search k = C5, first compute h(k,r)
• h(C5,2) = 01 > q (0)  examine bucket 01 using h(k,r)
 C5 not found and bucket full  overflow
(a) r = 2, q = 0, Insert C5 (b) Insert C5, r = 2, q = 1
• Overflow handling
1. Activate bucket 2r + q = 100
2. Reallocating A0,B4 in the chain
q (i.e., chain 0) using h(k,r+1):
h(A0,r+1) = 000, h(B4,r+1) =
h(k,r) h(k,r+1)
100,  A0 not moved,
B4→bucket 100
3. q = q+1 =1, r = 2 unchanged
4. Insert C5: h(C5,r)=01  use
h(C5,2)=01, C5 added to bucket
83
01 using overflow bucket
• Insert k = C1, to search k = C1, first compute h(k,r)
• h(C1,2) = 01 = q (1)  examine bucket 01 using h(k,r)
 C1 not found and bucket full  overflow
(b) r = 2, q = 1 • Overflow handling
1. Activate bucket 2r + q = 101
2. Reallocating (rehashing)
A1,B5,C5 in the chain q (i.e.,
chain 1) using h(k,r+1):
h(A1,r+1) = 001, h(B5,r+1) =
h(k,r) h(k,r+1) 101,
h(C5,r+1) = 101
 A1 not moved,
B5,C5→bucket 101
3. q = q+1 =2 < 2r, r = 2 unchanged
84
4. Insert C1: h(C1,r)=01<q=2  C1
(b) r = 2, q = 1, Insert C1 (c) r = 2, q = 2

Chain q relocated
using h(k,r+1)

h(k,r) h(k,r+1)
h(k,r+1)
h(k,r)

85
Outline
• 8.1 Introduction
• 8.2 Static hashing
• 8.3 Dynamic hashing
• 8.4 Bloom filters

86
Bloom Filter Concepts
• Proposed by Burton Howard Bloom in 1970
• A Bloom filter is a data structure designed to tell
you, rapidly and memory-efficiently, whether an
element is present in a set.
• It is a probabilistic data structure that uses the
concept of hashing extensively.
• The price paid for this efficiency is that a Bloom
filter is a probabilistic data structure: it tells us that
the element either definitely is not in the set (no
false negative) or may be in the set (may have false
positive).
87
Bloom Filter Concepts
• Consider matching two strings in checking for passwords
from a database.
• The passwords are encrypted and then stored in the database
for security reasons. The hashed values are very long strings,
usually 70+ characters.
• In such cases, when two strings need to be compared
character by character, string matching algorithms take O(n).
• Bloom filter, on the other hand, takes O(1) to accomplish the
same task.
• What is the additional advantage of using a bloom filter?
• Bloom filters reduce the number of calls made to resources
such as servers and databases, by quickly eliminating inputs
that don’t match with the actual value.
88
Bloom Filter Concepts
Traditional set data Bloom filters
structures, e.g., a BST
False positive
(It could be wrong X √ ( 缺點 )
when it says "Yes")
False negative
(It could be wrong X X
when it says "No")
Easy insertion √ √
Easy deletion √ X ( 缺點 )
Memory space Low
efficiency High ( 優點 )

89
Applications of Bloom Filter
• Authentication:
• Bloom filters can check for passwords and reject all of
the wrong passwords entered, thus reducing the load on
the main database servers.
• Authorization:
• Authorization is the process of giving access to users of a
website based on the level of authority a user possesses.
• The admin can access the entire website and make
changes, whereas a common user can view the website
in read-only mode. Therefore using bloom filters in
websites like large e-commerce sites is a viable solution
to prevent access from non-authorized entries.

90
Applications of Bloom Filter
• One-search (one-hit) wonders:
• Search engines keep track of the search phrases and ensure
not to cache the phrases until they are searched for
repetitively.
• We can do a little experiment to check this out.
• Open your incognito tab and go to any search engine you’d like.
Type a query related to Python, for example, “lists in python”.
The next time you type lists, it will still show results that are not
specific to Python.
• A couple of searches related to Python will lead to all search
results being directed towards Python.
• After a couple of searches, you will observe that just typing
dictionary will lead you to dictionaries in Python.
• Search engines keep track of the URLs and enable caching of
the URLs upon multiple accesses.
91
Grocery Shop Example
• Suppose we own a grocery
shop
• Customers occasionally ask
for an item that we are not
sure about the availability
• We spend significant time
looking for an item before
realizing that the item is
unavailable
• This significant time can be
spared if we know that the
item is definitely not
available.
92
Grocery Shop Example
• Bloom filter can help
• Determine the availability of
an requested item
• Some false positive are
acceptable
• i.e., the data structure tells
that an item is available, but
the fact is otherwise
• No false negative
• We do not want to
mistakenly turn down a
customer‘s request

93
Bloom Filter
• Bloom filter is an exciting application of the hash
tables used to check for membership of elements in
a set
• Components of bloom filter
• An m-bit vector, initially filled with 0
• Multiple (k) hash functions (k < m)
• The hash functions used in a Bloom filter should
be independent and uniformly distributed.
• They should also be as fast as possible (cryptographic hashes
such as SHA1, though widely used therefore are not very good
choices).
• And n elements in the set
• False positive rate will be approximately (1-e-kn/m)k
94
Bloom Filter
• Example N
A
• A table with 26 entries, A ~ Z, B O
initially 0
C P
• Three hash functions for a string
D Q
• First character E R
• Second character F S
• Third character G T
H U
I V
J W
K X
L Y
M Z
95
Available items

Bloom Filter
• Example N
A
• Register string "Coke" into the B O 1
Bloom filter to indicate that our 1
C P
grocery sells Coke
D Q
• Set the bit vector according to the
three hash values, C, O, and K E R
F S
G T
"Coke" H U
h1 "C"
I V

h2 "O" J W
K 1 X
h3 "K" L Y
M Z
96
Available items

Bloom Filter
• A simple test
A N
• If a customer request for "Coke" 1
B O
afterward
C 1 P
• Bit vector is examined according to the
three hash values D Q
• Bloom filter determines that coke is E R
available because the corresponding F S
bits have been set G T

"Coke" H U
h1 "C"
I V
h2 "O" J W
K 1 X
h3 "K" L Y
M Z
97
Available items

Bloom Filter
• A simple test
A N
• If a customer request for “Orange 1
B O
juice" afterward
C 1 P
• Bloom filter determines that Orange
D Q
juice is unavailable because at least
one corresponding bit is not set E R
F S
G T
" Orange juice " h1 “O"
H U
h2 “R" I V
J W
h3 "A" K 1 X
L Y
M Z
98
Available items

Bloom Filter
• A simple test
A N
• If a customer request for “Tea" 1
B O
afterward
C 1 P
• Bloom filter determines that Tea is
D Q
unavailable because at least one
corresponding bit is not set E R
F S
G T
"Tea" h1 "T"
H U
h2 "E" I V
J W
h3 "A" K 1 X
L Y
M Z
99
Available items

Bloom Filter
• We register more strings into the 1 1
A N
Bloom filter (add more elements to 1
B O
set) (The bit vector gets filled up) 1 1
C P
D Q
"Fanta"  F A N E R 1
F 1 S 1
G T 1
"Sprite"  S P R
H U
I 1 V 1
J W
"Vitali"  V I T K 1 X
L Y
M Z
100
Available items

Bloom Filter
• Test again 1 N 1
A
• Bloom filter still works B O 1
C 1 P 1
D Q
"Coke"  C O K E R 1
F 1 S 1
G T 1
"Tea"  T E A U
H
I 1 V 1
J W
"Fanta"  F A N K 1 X
L Y
M Z
101
Advantages
• Simple & fast & memory efficient A 1 N 1
• Bloom filters filter out the majority of B O 1
true negatives and therefore enable C 1 P 1
the design of efficient systems. D Q
E R 1
F 1 S 1
G T 1
• "Coca Cola
• "Fanta" H U
• "Sprite" I 1 V 1
Only 26 bits
• "Vitali J W
1

K X
L Y
M Z
102
Available items

Disadvantages
• Bloom filter exhibits false positive 1 1
A N
• When Bloom filter says "yes", it is not 1
B O
100% true
C 1 P 1
• "Coffee" is a false positive in our
example (so is “Orange juice”) D Q
E R 1
• But, when Bloom filter says "no", it 1 S 1
F
is always true G T 1
H U
"Coffee"  C O F 1 1
I V
Our grocery does not sell J W
coffee actually! 1
K X
L Y
• The probability of error goes up as the bit
M Z
array gets filled up. 103
Bloom Filter Analysis
• Key factors of a bloom filter
• Number of hash functions, k
• Number of bits in the bit vector, m
• Number of items expected to be stored, n
• Uniformity of the hash functions
• False positive analysis
• Bit vector is set nk times after n items are stored
• Each time, the probability that a particular bit is set is (1/m)
• Assume true uniformity of hash functions ((1-1/m)  not set)
• The probability that a particular bit is set after n items are
stored is (1 - (1 - 1/m)nk)
• The probability of a false positive is (1 - (1 - 1/m)nk)k
• We can carefully select m, n, and k to achieve our
acceptable false positive rate, e.g., 1%
104
• Note:
( )
𝑚
1 −1
lim 1− =𝑒
𝑚→ ∞ 𝑚

• For large m
( )
𝑚
1
1− ≈ 𝑒− 1
𝑚

) (( ))
𝑚 𝑘/ 𝑚

(
𝑘
1 1
1− = 1− =𝑒 −𝑘 /𝑚
𝑚 𝑚

( )
𝑛𝑘
1 − 𝑛𝑘 / 𝑚
1− =𝑒
𝑚

• So:
( ( ) )
𝑛𝑘 𝑘
1 −𝑛𝑘 /𝑚 𝑘
1− 1− =( 1 − 𝑒 )
𝑚

105
Hash Functions
• The hash functions used in a Bloom filter should
be independent and uniformly distributed. They
should also be as fast as possible (cryptographic
hashes such as SHA1, though widely used therefore
are not very good choices).
• A short survey of bloom filter implementations:
• Chromium uses HashMix
• python-bloomfilter uses cryptographic hashes
• Plan9 uses a simple hash as proposed in Mitzenmacher
2005
• Sdroege Bloom filter uses fnv1a
• Squid uses MD5
106
How Big Should Bloom Filter Be?
• It's a nice property of Bloom filters that you can
modify the false positive rate of your filter. A larger
filter will have less false positives, and a smaller one
more.
• Your false positive rate will be approximately 
(1-e-kn/m)k, so you can just plug the number n of
elements you expect to insert, and try various
values of k and m to configure your filter for your
application

107
How Many Hash Functions?
• The more hash functions you have, the slower your
bloom filter, and the quicker it fills up. If you have
too few, however, you may suffer too many false
positives.
• Since you have to pick k when you create the filter,
you'll have to ballpark what range you expect n to be
in. Once you have that, you still have to choose a
potential m (the number of bits) and k (the number
of hash functions).
• It seems a difficult optimization problem, but
fortunately, given an m and an n, we have a function
to choose the optimal value of k: (m/n)ln(2)
108
Determine The Size of a Bloom Filter
• To choose the size of a bloom filter
1. Choose a ballpark value for n
2. Choose a value for m
3. Calculate the optimal value of k
4. Calculate the error rate for our chosen values
of n, m, and k. If it's unacceptable, return to
step 2 and change m; otherwise we're done.

109
Summary
• 8.1 Introduction
• 8.2 Static hashing
• 8.3 Dynamic hashing
• 8.4 Bloom filters

110

You might also like