CH8 Hashing
CH8 Hashing
Structures
CH8 Hashing
Prof. Tai-Lang Jong
Office: Delta 928
Tel: 42577
email: [email protected]
Spring 2022
Outline
• 8.1 Introduction
• 8.2 Static hashing
• 8.3 Dynamic hashing
• 8.4 Bloom filters
2
Hashing( 雜湊 ) in Data Structures
• Searching is dominant operation on any data
structure. Most of the cases for inserting, deleting,
updating, all operations required searching first. So
searching operation of particular data structure
determines it’s time complexity.
• If we take any data structure, the best time
complexity for searching is O(log n) in AVL tree and
sorted array only. Most of the cases it will take O(n)
time.
• To solve this searching problem hashing concept is
introduced which will take O(1) time for searching.
It’s constant time.
3
Dictionary ADT Revisited
• A dictionary is a collection of items.
• Each item is a pair
• (key, element) or (key, value)
• Pairs have different keys.
• Keys may not have an ordering
• Dictionary represents a mapping from keys to elements.
• The primary use of dictionary is to store elements so that they can be
located quickly using (search) keys
• Operations.
• Get(theKey) => search
• Delete(theKey) => delete
• Insert(theKey, theElement) => insert
• Additional operations: (for dynamic dictionary)
• Ascend() (sorting)
• Get(index)
• Delete(index)
4
ADT 5.3 Dictionary Revisited
template <class K, class E>
class Dictionary {
public:
virtual bool IsEmpty() const = 0;
// return true iff the dictionary is empty
virtual pair <K, E>* Get(const K&) const = 0;
// return pointer to the pair with specified key; return 0 if no such pair
virtual void Insert(const pair <K, E>&) = 0;
// insert the given pair; if key is a duplicate update associated element
virtual void Delete(const K&) = 0;
// delete the pair with specified key
};
• The dictionary ADT provides operations for storing records, finding
records, and removing records from the collection.
• This ADT gives us a standard basis for comparing various data
structures. 5
Applications of Dictionary
• Spelling checker
• Thesaurus
• Natural language dictionary
• key: word in language X; value: word in language Y
• Index for a database
• Contact book
• key: name of person; value: telephone number
• Property-value collection
• key: property name; value: associated value
• Symbol tables generated by loaders, assemblers, and
compilers, e.g.,
• Table of program variable identifiers
• key: identifier; value: address in memory
6
Representation of Dictionary
• Sorted or Unsorted sequences (linear lists)
• Easy to implement, Space efficient
• Insertion itself is fast, but needs lookup to check if
the name was already in
• Lookup is slow
• Binary search tree
• Get, Insert, Delete take O(n) time (WC)
• AVL trees
• Balanced binary search tree • self-organizing BST
• Get, Insert, Delete take O(logn) time • red-black trees
• (a,b)-trees (in
• Hashing
particular: 2-3-trees)
• Get, Insert, Delete take O(1) time • B-trees
7
Simple Implementations of Dictionary
• Elements of a dictionary can be kept in a sequence (linked
list or array):
• data size: number of elements (n);
• dominant operation: key comparison
• Unordered:
search: O(n); insert: O(1)(no checking); delete: O(n)
• Ordered array:
search: O(logn); insert O(n); delete O(n)
• Ordered linked list:
search: O(n); insert O(n); delete: O(n)
• (keeping the sequence sorted does not help in this case!)
• Space complexity: Θ(n)
8
Complexity Of Dictionary Operations
Get(), Insert() and Delete()
Data Structure Worst Case Expected
12
Hash Tables
• A hash table is another easy and efficient
implementation of a symbol table.
• Elements are kept in an b-element table, b << |U|
• It works with keys that are not ordered, but supports
only
• Insert
• delete hashing function: h : U → [0..b − 1]
• search
• It is based on the concept of a hash function h(key).
• Maps each possible element into a specified bucket
• The number of buckets b is much less than the number of
possible elements |U| (b << |U|)
• Each bucket can store a limited number of elements
13
Hashing Non-integer Keys
• For integer key:
hashing function: h : U → [0..b − 1], (|U| > b)
• What if the type of key is not integer?
• Additional step is needed:
• Before computing the hash function, the key should be
transformed to integer.
• For example: key is a string of characters, the
transformation should depend on all characters.
• This transforming function should have similar
properties to hashing function.
14
Registration Division Example
請大家向註冊組承辦人
查詢學期成績
全校多數都打電話、
寄 email 給第一位 承辦人 分機 / Email
陳 OO 31300 / chen@nthu...
郭 OO 31301 / kuo@nthu...
李 OO 31302 / li@nthu...
林 OO 31303 / lin@nthu..
王 OO 31304 / wang@nthu...
15
Hash Concepts
• Hash function
• Any deterministic function that can map data of
arbitrary size (original keys) to a continuous range data
of a desired fixed size: 0≤h(k)<b (hashed keys)
Hash
Keys Hashed Keys Values
Function
(e.g., names) (e.g., 0~4)
承辦人 分機
0 陳 OO 31300
" 周杰倫 " 2
1 郭 OO 31301
"Donald Trump" h() 3 2 李 OO 31302
3 林 OO 31303
" 鈴木一朗 " 0
4 王 OO 31304
……
16
Hash Concepts
• Hash function
• It shuffles the order of mapping
• But it is deterministic
Hash
Keys Hashed Keys Values
Function
(e.g., names) (e.g., 0~4)
承辦人 分機
0 陳 OO 31300
" 周杰倫 " 2
1 郭 OO 31301
"Donald Trump" h() 3 2 李 OO 31302
3 林 OO 31303
" 鈴木一朗 " 0
4 王 OO 31304
……
17
Hash in Cooking
• Hash: "chop and mix foods"
McDonald
Function
18
Hash in Chinese Decomposition
• Decompose Chinese characters into keyboard strokes
• Facilitate Chinese input
• Example: the Boshiamy ( 嘸蝦米 ) decomposition scheme
哈 OO OAO
州 Boshiamy
Function
YYY
哥 o
o
TOTO
19
Hash in Storing Data
• Example: Storing students' grades according to
their name initial letters
A (Alice,
95 100)
J B
John
…
100 Name I
Alice A (John, 95) (Jane,
initial J
100)
100 K
Jane
…
J
Z
20
Hashing
• Static hashing
• Store the dictionary (key, element) pairs in hash table
• Hash Table
• Hash Function
• Collision & Overflow Handling
• Open addressing or open hashing
• Chaining
• Dynamic hashing (extendible hashing)
• Retain the fast retrieval time of conventional hashing while
extending the technique so that it can accommodate
dynamically increasing and decreasing table size without
penalty.
• Provide acceptable hash table performance on a per
operation basis using directory and directoryless
21
Outline
• 8.1 Introduction
• 8.2 Static hashing
• 8.3 Dynamic hashing
• 8.4 Bloom filters
22
Advantages of Hashing
Hash table
(Bob, 80) (Ben, 70)
(Irene, 85)
Hash
(John, 95) (Jane,
Identifier (key) function 100)
h(key)
(Ken, 75)
(Zoe, 80)
…
Buckets
n/(sb)
(Irene, 85)
• b is the number of buckets, s is the
number of slots per bucket (John, 95) (Jane,
100)
• Synonyms (collision)
(Ken, 75)
• Two identifiers, I1 and I2 are …
b-1
synonyms if h(I1) = h(I2)
(Zoe, 80) 24
Collision and Overflow
• Collision
• When two nonidentical identifiers are hashed into the
same bucket
• Overflow
• When a new identifier is mapped or hashed by h into a full
bucket
25
Efficiency of Hash Table
• Search or insertion time of a hash
table
(1) compute the hash function
(2) search a bucket
• The above times are independent of
n – O(1)
• Collision is inevitable
• Taking the first character is not a
good hashing because of too much
collision
• many variables in a program begin
with the same character
• An overflow mechanism is necessary
26
Hash Function
定義域 值域
通常為不限大小 通常為有限大小 [0,b)
Hash Function
通常必須 " 多對一 "
絕不會 " 一對多 "
27
Hash Function
• Good hash functions reduce the
chance of collisions
Slots
• Enlarging hash table size can also
reduce collisions
• At the cost of memory size h(A)=0 (Alice,
100)
• Ideal hash functions have h(B)=1
(Bob, 80) (Ben, 70)
• Minimal number of collisions
…
• Uniform distribution of keys for
Buckets
h(I)=9
various values (Irene, 85)
h(J)=10
• Easy (fast) to compute (John, 95) (Jane,
h(K)=11 100)
• Maximum usage of the information
present in the values (keys) (Ken, 75)
• Yield different values even for very …
h(Z)=25
similar keys
(Zoe, 80)
28
Uniform Hash Function
• Basic desired properties of hash function
• Easy to compute
• Minimize the number of collisions
• Uniform hash function
• A good hash function
• Should also depend on every character of an input
identifier
• Uniform hash function
• Let x be an identifier chosen at random
• Then, the probability that h(x) = i is 1/b for every bucket i
• That is, the hash function does not result in a biased use
of the hash table for random inputs
29
Hash Functions
• Classical examples
• Modulo (division)
• Mid-square
• Folding
• Digit analysis
• String-to-integer conversion
30
Hash Function: Modulo
(Division)
• Most widely used hash function in practice
• Procedure
• h(k) = k % D
• Selection of D
• D is the number of buckets
• D would better be an odd number
• Even divisor D always maps even keys to even buckets and odd
keys to odd buckets
• Real-world data tend to have a bias toward either odd or even
keys
• It would be even desirable if D can be a prime number
or a number having no prime factors smaller than 20 31
Hash Function: Modulo
(Division)
hash function hD (k) = k % D
where % is the modulo operator
That is, the remainder is used as the hash address
• Hash address
• in the range from 0 through ( D-1) implies that table size is D
• D should not be a power of 2
• otherwise, hD(k) may depend only on the least significant bits of x
• e.g., D=23, then A1 encoded to 26(10) + 1 hD(A1) = 1
XY1 encoded to 212(33) + 26(34) + 1 hD(XY1) = 1
• D is usually a prime number
32
Hash Function: Mid-Square
• h(k) = some middle r bits of the square of k
• The number of bucket is equal to 2r
• Example
k k2 h(k) k k2 h(k)
0 0 0000 0000 0 8 64 0100 0000 0
1 1 0000 0001 0 9 81 0101 0001 4
2 4 0000 0100 1 10 100 0110 0100 9
3 9 0000 1001 2 11 121 0111 1001 14
4 16 0001 0000 4 12 144 1001 0000 4
5 25 0001 1001 6 13 169 1010 1001 10
6 36 0010 0100 9 14 196 1100 0100 1
7 49 0011 0001 12 15 225 1110 0001 8
33
Hash Function: Mid-Square
A – 1 = 18
B – 2 = 28
.
.
.
Z – 26
0 – 27 = 338
1 -- 28 = 348
2 – 29 = 358
..
9 -- 36 = 448
• The coding of identifier x is right-justified, zero-filled, and has six bits per character
• Table size will be a power of 2 34
Hash Function: Folding
• Partition the key into several parts and add them
together
• Two strategies: shift folding and folding at the boundary
• Example
• k = 12320324111220 = 123 203 241 112 20
• Shift folding
123
h(k) = = 699 123 203 241 112 20
20
35
Hash Function: Digit Analysis
• Application
• when all the identifiers are known in advance
• Procedure
• Step 1: Each identifier is interpreted as a number using radix r
• Step 2: Analyze the distribution of each digit
• Step 3: Drop biased digits
• The digits with the most skewed distributions are deleted one
by one until the remaining digits is small enough to give an
address
• Example
• Given three identifiers in radix-9 form: 891, 792, 793
• Digit distribution: 1st {8, 7, 7}, 2nd {9, 9, 9}, 3rd {1, 2, 3}
• The most skewed digits: the 2nd
36
String-to-Integer Conversion
• Useful when keys are strings
• Procedure
• Treat every n character as an 8n-bit integer
• ASCII represents a character using 8 bits
• Consideration:
• We need to argue the advantages of our hash compared
with the commonly used ones
40
Secure Hash Functions
• Example
• SHA-1
• SHA256
• MD5
• Usage
• Message authentication
• Password store ( 能驗證密碼但又可防範洩漏密碼 )
• Digital signature ( 防止變造 )
• Digital currency
• …
41
Message Authentication
• A message M to be transmitted over an insecure channel
from A to B.
• Assume we have a means to transmit messages much
smaller than M securely, e.g., encrypt smaller message
or transmit the smaller message on a more expensive
but far more secure channel.
• Send h(M) using more secure method. Send M over
insecure channel.
• At B, received M’ and h(M); compute h(M’) and compare
with h(M) to decide if received M’ is correct.
• It should be difficult for a malicious user who has
knowledge of h & M to determine a synonym of M --
weak collision resistance property 42
Secure Hash Function Properties
• Desired secure hash function properties:
• Weak collision resistance:
• It is difficult for a malicious user who has knowledge of h & M to
determine a synonym of M
• One-way property:
• For a given c, it is computationally difficult to find a k such that h(k) = c
(inverse hashing problem)
• Strong collision resistance:
• It is computationally difficult to find a pair (x,y) such that h(x) = h(y)
• Several cryptographic hash functions with these
properties have been developed. They also have
additional properties:
• h can be applied to a block of data of any size
• h produces a fixed-length hash code
• h(k) is relatively easy to compute for any given k
43
Secure Hash Algorithm (SHA)
• Developed at the National Institute of Standards &
Technology (NIST) in USA.
• SHA-1 function:
• Input: any message length < 264 bits
• Output: a 160-bit code
Step 1: Preprocess the message so that its length is q*512 bits for some integer q.
The preprocessing may entail adding a string of zeros to the end of the message.
Step 2: Initialize the 160-bit output buffer OB, which comprises five 32-bit registers
A, B, C, D, and E, with A = 0x67452301, B = 0xefcdab89, C = 0x98badcfe,
D = 0x10325476, E = 0xc3d2e1f0.
Step 3: for (int i = 1; i <= q; i++) {
Let Bi = ith block of 512 bits of the message;
OB = F (OB, Bi); //function F consists of 4 rounds of 20 atomic steps
}
Step 4: Output OB 。
44
Atomic SHA Operation
A B C D E
ft +
S5 +
+ Wt
S30 + Kt
A B C D E
t = Step number, 0≤t≤79
ft(B,C,D) = Primitive (bitwise) logical function for step t; e.g., (BΛC)V(BΛD)
Sk = Circular left shift of the 32-bit register by k bits
Wt = A 32-bit value derived from Bi
Kt = A constant 45
Security Hash: MD5 Example
0x67452301 0xEFCDAB89 0x98BADCFE 0x10325476
F(X, Y, Z)
Input (32 bit at a time)
Sin table +
rotate
Output Hash
46
Security Hash: MD5 Example
• https://www.md5hashgenerator.com/
• Example
• “NTHU”
8191af722cfd2890b7a9e986003a6439
• “NTHU1”
4c97870289739e75576a4cbeb6222e25
• “ 資料結構”
9aa8415c41bb436d4b6e5618a7be6360
• Deterministic results
• Everyone can get exactly the same results, although the
results look like random numbers 47
Security Hash: MD5 Example
• Important properties
• Hard to find the inverse
• “ 祝大家期末考順利 ? !"
21cee26fd407729a1c740105891e3fca
• Easy to verify
48
Usage: Password Store
username: password
username: password
Mary: abcabc
John: edp903d
Bob: HE3dpq7
49
Usage: Password Store
Login
username: hashed_password
Mary: 5abb7fa…
John: 8e2c1f8…
Bob: ed6e5ad…
? ???
50
Usage: Digital Signature
• 有奬徵答,請大家解一題資結問題,前 3 快解出
問題的給獎品 !
• 狀況一
• 有人搶先說“我解開了”再慢慢解題,難以判斷真實順序
• 狀況二
• 第一名公布了答案,二三名都說“我也是這麼想的”
• 狀況三
• 大家統一把答案交給某個裁判再檢查是誰最快回答出正確
答案。但裁判的公平性無法驗證。
51
Usage: Digital Currency
• 數位貨幣 ( 例如比特幣 )
• 去中心化
• 由貨幣使用者記帳,而非由政府或銀行記帳
• 記帳的使用者會得到一些報酬
• 問題
• 怎麼決定記帳權歸屬,且不會被特定使用者把持 ?
• 怎麼避免帳本被竄改 ?
52
Usage: Digital Currency
• 方法
• 要求所有使用者一起解 inverse hash 問題
• 請把 x 取代成數字,使“ A 轉給 B 十元 (xxxxxxxx)” 這個字
串的 MD5 前 N 碼是 0
• A 轉給 B 十元 (00000000) 4299bf747…
• A 轉給 B 十元 (00000001) 2c81a694d…
• A 轉給 B 十元 (00000002) 551741357…
• 求解動作稱挖礦,困難度被控制在約 10 分鐘才能解
答一次。
• 挖礦有利可圖,因此很多人參與
• 想要單方把持記帳權,必須投入非常多電腦,不敷成本
• 後帳本與前帳本相關 ( 區塊鏈 )
• 想從竄改已經成形的帳本,必須投入非常多電腦在短時間
內重算竄改位置之後的 inverse hash 問題,不敷成本
53
Overflow Handling
• Problem
• When a new identifier is hashed into a full
bucket, then we need to find another open
bucket
• Methods
• Open addressing -- find the closest bucket that is
not full
• Linear probing
• Quadratic probing
• Rehashing
• Chaining
• Implement each bucket as a linked list
54
Linear Open Addressing
• Procedure of searching an identifier x
• Step 1: compute h(x)
• Step 2: examine identifiers at positions ht[h(x)], ht[h(x)
+1], ..., ht[(h(x)+j] in this order until one of the following
happens:
(a) ht[ h(x)+j ] = x; in this case x is found
(b) ht[ h(x)+j ] is null; x is not in the table
(c) We return to the
starting position
h(x); the table is
full and x is not in
the table
55
Program 8.4 Linear Probing
template < class K, class E >
pair< K, E>* LinearProbing <K, E>::Get(const K& k)
{ // search the linear probing hash table ht ( each bucket has exactly one
// slot ) for k. If a pair with this key is found, return a pointer to this pair;
// otherwise, return 0
int i = h(k); // home bucket
int j;
for ( j = i ; ht[j] && ht[j]→first != k;) {
j = (j + 1 ) % b; // treat the table as circular
if ( j == i ) return 0;// back to the start point
}
if(ht[j]→first == k) return ht[j];
return 0;
}
56
Problem of Linear Open Addressing
• Identifiers tend to cluster together
• Increase the search time
• Could be worse than the search tree structure
• An analysis shows that
• It takes (2-)/(2-2) to look up an identifier
where is the loading density (n/(sb))
• Quadratic probing
• improve the clustering problem
• check sequence:
((h(k)+i2)%b) and ((h(k)-i2)%b)
i = 1, 2, ...
57
Rehashing
• Another way to control the growth of clusters is to
use a series of hash functions h1, h2, …, hm. This is
called rehashing.
• Buckets hi(k), 1 ≤ i ≤ m are examined in that order.
• Double hashing:
• If h(k) is occupied, then we iteratively try buckets
h(k) + j*h‘(k) for j=1, 2, …
58
Chaining
• Linear probing performs poorly because the search
for an identifier involves comparisons with
identifiers that have different hash values.
• e.g., search of ZA involves comparisons with the
buckets ht[0] – ht[7] which are not possible of
colliding with ZA.
• Unnecessary comparisons can be avoided if all the
synonyms are put in the same list, where one list
per bucket.
• Each chain has a head node. Head nodes are stored
sequentially.
59
Chain-Based Hash Table
• Each bucket is a chain
• Chain nodes are typically
unordered data link
A (Alice, 100) 0
• We typically expect the
hash function spreads B (Bob, 80) (Ben, 70) 0
records uniformly 0
…
enough
I (Irene, 85) 0
• Thus each chain does
J (John, 95) (Jane, 100) 0
not contain too many
nodes K (Ken, 75) (Kevin, 70) 0
• Linearly traversing a L (Linda, 90) 0
chain is required for M (Mary, 85) 0
inserting, finding, and (Zoe, 80) 0
…
removing a key
60
Chaining
ht data link
[0] A4 A3 A1 A2 A 0
[1] 0
[2] 0
[3] D 0
[4] E 0
[5] 0
[6] G GA 0
[7] 0
[8] 0
[9] 0
[10] 0
[11] L 0
•
•
•
[25] ZA Z 0
26 個桶的雜湊表
61
Program 8.5 Chain Search
template <class K, class E>
pair<K, E>* Chaining <K, E>::Get(const K& k)
{// 在鏈雜湊表 ht 中搜尋 k ,如果找到具有這個鍵值的字典對,
// 那麼回傳這個字典對的指標;否則回傳 0
int i = h(k); // 主桶
// 搜尋鏈 ht[i]
for(ChainNode<pair<K,E>>* current = ht[i]; current;
i = h(“B3”) = 1
62
A Comparison
• Hash Function
• Division is generally superior to the other types
• Collision handling
• Chaining outperforms linear open addressing
63
Chaining vs. Open Addressing
Chaining Open Addressing
64
Outline
• 8.1 Introduction
• 8.2 Static hashing
• 8.3 Dynamic hashing
• 8.4 Bloom filters
65
8.3 Dynamic Hashing
• In static hashing, if the hash table is allocated to be
too small, then when the data exceeds the capacity
of the hash table (loading density exceeds a
threshold), the entire table must be extended (size
increased) and restructured:
e.g., b buckets, divisor D = b → 2b + 1 buckets, and
new D = 2b+1 hash function changed the
whole hash table needs rebuilt all over (not copy) --
a time-consuming process.
• For very large dictionary being accessed 24/7, that
means unacceptable interruption of service during
hash table rebuilding.
66
Hash Table Size: b 2b+1
hashed key hD(k) = bucket index
hD(k) = [0,…, b-1], D = b
D = b D’ = 2b+1
68
Dynamic Hashing Using
Directory
• Assume a file F is a collection of records R.
Each record has a key field K
Records are stored in pages or buckets whose capacity
is 2p
• Dynamic hashing using directory first map key k to
directory d by using h(k,p), p LSbits of h(k), which
contains pointers to 2p buckets
• When overflow, increase p, thus the size of directory is
increased, and dynamically add new buckets with
minimal bucket restructuring.
• With directory in memory and records in disk, it is
aimed to minimize access to pages
69
Dynamic Hashing Using
Directory
• hashed key h(k) = sufficiently large
• h(k,p) = p LSb bits from h(k) Directory d
pointer to bucket
• Directory can double when overflow occurs
h(k) =
xxxx xxxx xxxx xxxx (k1,e1) (k2,e2)
h(k,
k h(k)
p)
(k3,e3)
76
Dynamic Hashing Using
Directory
• Deletion from a dynamic hash table with a directory is
similar to insertion.
• Although dynamic hashing employs array doubling, the
time for this array doubling is considerably less than that
for the array doubling used in static hashing.
This is because in dynamic hashing, we need to rehash
only the entries in the overflow bucket rather than all
entries in the table.
• Further, savings result when the directory resides in
memory while the buckets are on disk.
A search requires only 1 disk access; an insert makes 1
read and 2write access to the disk; the array doubling
requires no disk access. 77
Directoryless Dynamic Hashing
• Dispense with the directory, d, of bucket pointers.
• An array, ht, of buckets is used.
• Assume that this array is as large as possible and there
is no possibility of increasing its size dynamically.
• To avoid initializing such a large array, use two variables
q and r, 0 ≤ q < 2r, to keep track of the active buckets.
• At any time, only buckets [0, 2r + q -1] are active. New
active buckets can be added when needed by changing
q and/or r (q++; → if q = 2r → q=0, r++;)
• Each active bucket is the start of a chain of bucket. The
remaining buckets on a chain are called overflow
buckets.
78
Index Rules for Active Buckets
• The active buckets 0 through q-1 as well as the active
buckets 2r through 2r + q -1 are indexed using h(k,r+1)
• While the remaining active buckets q through 2r -1 are
indexed using h(k,r)
• Each active bucket is the start of a chain of bucket. The
remaining buckets on a chain are called overflow buckets.
h(k,r+1) h(k,r+1)
ht
0 q 2r 2r+q
Overflow buckets
h(k,r)
Active buckets 79
Searching Directoryless Hash Table
• To search for key k, we need to compute h(k,r)
If ( h(k,r) < q )
search the chain that begins at bucket h(k, r+1);
else
search the chain that begins at bucket h(k,r);
80
Overflow Handling
• To insert new key-element, search to determine if the key
is already in ht.
• When search is done, new key is not in ht, and the active
bucket for the searched chain is full, we get an overflow.
• An overflow is handled by
1. Activating bucket 2r + q (add a new active bucket)
2. Reallocate (rehashing) the entries in the chain q between
q and the newly activated bucket (or chain) 2r + q using
h(k,r+1).
3. q = q + 1; if (q == 2r) {q=0; r=r+1;}
4. Finally, the new key-element is inserted into the chain
where it would be searched for by search algorithm
using the new r and q values.
If h(k,r)<q use h(k,r+1) else use h(k,r) 81
• To start, assume there are 4 (2r+q) bucket chains
(r=2,q=0); each chain begins at one of the 4 active
buckets and comprises only that active bucket (i.e.,
there are no overflow buckets). Each bucket has 2 slots.
• To Insert k = C5, first determine if C5 is in ht by using
search algorithm. If not, check if overflow.
(a) r = 2, q = 0
1. Search C5: h(C5,2) = 01 h(C5,2)
Active buckets = 2 +q=4
r
> q (= 0), use h(C5,2) = 01 and
k h(k)
A0 100 000 search that chain,
A1 100 001 2. C5 not found, check if the active
B0 101 000 bucket for the searched chain is
B1 101 001 full
C1 110 001 3. Yes overflow Handle
C2 110 010
overflow
C3 110 011
C5 110 101 82
• Insert k = C5, to search k = C5, first compute h(k,r)
• h(C5,2) = 01 > q (0) examine bucket 01 using h(k,r)
C5 not found and bucket full overflow
(a) r = 2, q = 0, Insert C5 (b) Insert C5, r = 2, q = 1
• Overflow handling
1. Activate bucket 2r + q = 100
2. Reallocating A0,B4 in the chain
q (i.e., chain 0) using h(k,r+1):
h(A0,r+1) = 000, h(B4,r+1) =
h(k,r) h(k,r+1)
100, A0 not moved,
B4→bucket 100
3. q = q+1 =1, r = 2 unchanged
4. Insert C5: h(C5,r)=01 use
h(C5,2)=01, C5 added to bucket
83
01 using overflow bucket
• Insert k = C1, to search k = C1, first compute h(k,r)
• h(C1,2) = 01 = q (1) examine bucket 01 using h(k,r)
C1 not found and bucket full overflow
(b) r = 2, q = 1 • Overflow handling
1. Activate bucket 2r + q = 101
2. Reallocating (rehashing)
A1,B5,C5 in the chain q (i.e.,
chain 1) using h(k,r+1):
h(A1,r+1) = 001, h(B5,r+1) =
h(k,r) h(k,r+1) 101,
h(C5,r+1) = 101
A1 not moved,
B5,C5→bucket 101
3. q = q+1 =2 < 2r, r = 2 unchanged
84
4. Insert C1: h(C1,r)=01<q=2 C1
(b) r = 2, q = 1, Insert C1 (c) r = 2, q = 2
Chain q relocated
using h(k,r+1)
h(k,r) h(k,r+1)
h(k,r+1)
h(k,r)
85
Outline
• 8.1 Introduction
• 8.2 Static hashing
• 8.3 Dynamic hashing
• 8.4 Bloom filters
86
Bloom Filter Concepts
• Proposed by Burton Howard Bloom in 1970
• A Bloom filter is a data structure designed to tell
you, rapidly and memory-efficiently, whether an
element is present in a set.
• It is a probabilistic data structure that uses the
concept of hashing extensively.
• The price paid for this efficiency is that a Bloom
filter is a probabilistic data structure: it tells us that
the element either definitely is not in the set (no
false negative) or may be in the set (may have false
positive).
87
Bloom Filter Concepts
• Consider matching two strings in checking for passwords
from a database.
• The passwords are encrypted and then stored in the database
for security reasons. The hashed values are very long strings,
usually 70+ characters.
• In such cases, when two strings need to be compared
character by character, string matching algorithms take O(n).
• Bloom filter, on the other hand, takes O(1) to accomplish the
same task.
• What is the additional advantage of using a bloom filter?
• Bloom filters reduce the number of calls made to resources
such as servers and databases, by quickly eliminating inputs
that don’t match with the actual value.
88
Bloom Filter Concepts
Traditional set data Bloom filters
structures, e.g., a BST
False positive
(It could be wrong X √ ( 缺點 )
when it says "Yes")
False negative
(It could be wrong X X
when it says "No")
Easy insertion √ √
Easy deletion √ X ( 缺點 )
Memory space Low
efficiency High ( 優點 )
89
Applications of Bloom Filter
• Authentication:
• Bloom filters can check for passwords and reject all of
the wrong passwords entered, thus reducing the load on
the main database servers.
• Authorization:
• Authorization is the process of giving access to users of a
website based on the level of authority a user possesses.
• The admin can access the entire website and make
changes, whereas a common user can view the website
in read-only mode. Therefore using bloom filters in
websites like large e-commerce sites is a viable solution
to prevent access from non-authorized entries.
90
Applications of Bloom Filter
• One-search (one-hit) wonders:
• Search engines keep track of the search phrases and ensure
not to cache the phrases until they are searched for
repetitively.
• We can do a little experiment to check this out.
• Open your incognito tab and go to any search engine you’d like.
Type a query related to Python, for example, “lists in python”.
The next time you type lists, it will still show results that are not
specific to Python.
• A couple of searches related to Python will lead to all search
results being directed towards Python.
• After a couple of searches, you will observe that just typing
dictionary will lead you to dictionaries in Python.
• Search engines keep track of the URLs and enable caching of
the URLs upon multiple accesses.
91
Grocery Shop Example
• Suppose we own a grocery
shop
• Customers occasionally ask
for an item that we are not
sure about the availability
• We spend significant time
looking for an item before
realizing that the item is
unavailable
• This significant time can be
spared if we know that the
item is definitely not
available.
92
Grocery Shop Example
• Bloom filter can help
• Determine the availability of
an requested item
• Some false positive are
acceptable
• i.e., the data structure tells
that an item is available, but
the fact is otherwise
• No false negative
• We do not want to
mistakenly turn down a
customer‘s request
93
Bloom Filter
• Bloom filter is an exciting application of the hash
tables used to check for membership of elements in
a set
• Components of bloom filter
• An m-bit vector, initially filled with 0
• Multiple (k) hash functions (k < m)
• The hash functions used in a Bloom filter should
be independent and uniformly distributed.
• They should also be as fast as possible (cryptographic hashes
such as SHA1, though widely used therefore are not very good
choices).
• And n elements in the set
• False positive rate will be approximately (1-e-kn/m)k
94
Bloom Filter
• Example N
A
• A table with 26 entries, A ~ Z, B O
initially 0
C P
• Three hash functions for a string
D Q
• First character E R
• Second character F S
• Third character G T
H U
I V
J W
K X
L Y
M Z
95
Available items
Bloom Filter
• Example N
A
• Register string "Coke" into the B O 1
Bloom filter to indicate that our 1
C P
grocery sells Coke
D Q
• Set the bit vector according to the
three hash values, C, O, and K E R
F S
G T
"Coke" H U
h1 "C"
I V
h2 "O" J W
K 1 X
h3 "K" L Y
M Z
96
Available items
Bloom Filter
• A simple test
A N
• If a customer request for "Coke" 1
B O
afterward
C 1 P
• Bit vector is examined according to the
three hash values D Q
• Bloom filter determines that coke is E R
available because the corresponding F S
bits have been set G T
"Coke" H U
h1 "C"
I V
h2 "O" J W
K 1 X
h3 "K" L Y
M Z
97
Available items
Bloom Filter
• A simple test
A N
• If a customer request for “Orange 1
B O
juice" afterward
C 1 P
• Bloom filter determines that Orange
D Q
juice is unavailable because at least
one corresponding bit is not set E R
F S
G T
" Orange juice " h1 “O"
H U
h2 “R" I V
J W
h3 "A" K 1 X
L Y
M Z
98
Available items
Bloom Filter
• A simple test
A N
• If a customer request for “Tea" 1
B O
afterward
C 1 P
• Bloom filter determines that Tea is
D Q
unavailable because at least one
corresponding bit is not set E R
F S
G T
"Tea" h1 "T"
H U
h2 "E" I V
J W
h3 "A" K 1 X
L Y
M Z
99
Available items
Bloom Filter
• We register more strings into the 1 1
A N
Bloom filter (add more elements to 1
B O
set) (The bit vector gets filled up) 1 1
C P
D Q
"Fanta" F A N E R 1
F 1 S 1
G T 1
"Sprite" S P R
H U
I 1 V 1
J W
"Vitali" V I T K 1 X
L Y
M Z
100
Available items
Bloom Filter
• Test again 1 N 1
A
• Bloom filter still works B O 1
C 1 P 1
D Q
"Coke" C O K E R 1
F 1 S 1
G T 1
"Tea" T E A U
H
I 1 V 1
J W
"Fanta" F A N K 1 X
L Y
M Z
101
Advantages
• Simple & fast & memory efficient A 1 N 1
• Bloom filters filter out the majority of B O 1
true negatives and therefore enable C 1 P 1
the design of efficient systems. D Q
E R 1
F 1 S 1
G T 1
• "Coca Cola
• "Fanta" H U
• "Sprite" I 1 V 1
Only 26 bits
• "Vitali J W
1
…
K X
L Y
M Z
102
Available items
Disadvantages
• Bloom filter exhibits false positive 1 1
A N
• When Bloom filter says "yes", it is not 1
B O
100% true
C 1 P 1
• "Coffee" is a false positive in our
example (so is “Orange juice”) D Q
E R 1
• But, when Bloom filter says "no", it 1 S 1
F
is always true G T 1
H U
"Coffee" C O F 1 1
I V
Our grocery does not sell J W
coffee actually! 1
K X
L Y
• The probability of error goes up as the bit
M Z
array gets filled up. 103
Bloom Filter Analysis
• Key factors of a bloom filter
• Number of hash functions, k
• Number of bits in the bit vector, m
• Number of items expected to be stored, n
• Uniformity of the hash functions
• False positive analysis
• Bit vector is set nk times after n items are stored
• Each time, the probability that a particular bit is set is (1/m)
• Assume true uniformity of hash functions ((1-1/m) not set)
• The probability that a particular bit is set after n items are
stored is (1 - (1 - 1/m)nk)
• The probability of a false positive is (1 - (1 - 1/m)nk)k
• We can carefully select m, n, and k to achieve our
acceptable false positive rate, e.g., 1%
104
• Note:
( )
𝑚
1 −1
lim 1− =𝑒
𝑚→ ∞ 𝑚
• For large m
( )
𝑚
1
1− ≈ 𝑒− 1
𝑚
) (( ))
𝑚 𝑘/ 𝑚
(
𝑘
1 1
1− = 1− =𝑒 −𝑘 /𝑚
𝑚 𝑚
( )
𝑛𝑘
1 − 𝑛𝑘 / 𝑚
1− =𝑒
𝑚
• So:
( ( ) )
𝑛𝑘 𝑘
1 −𝑛𝑘 /𝑚 𝑘
1− 1− =( 1 − 𝑒 )
𝑚
105
Hash Functions
• The hash functions used in a Bloom filter should
be independent and uniformly distributed. They
should also be as fast as possible (cryptographic
hashes such as SHA1, though widely used therefore
are not very good choices).
• A short survey of bloom filter implementations:
• Chromium uses HashMix
• python-bloomfilter uses cryptographic hashes
• Plan9 uses a simple hash as proposed in Mitzenmacher
2005
• Sdroege Bloom filter uses fnv1a
• Squid uses MD5
106
How Big Should Bloom Filter Be?
• It's a nice property of Bloom filters that you can
modify the false positive rate of your filter. A larger
filter will have less false positives, and a smaller one
more.
• Your false positive rate will be approximately
(1-e-kn/m)k, so you can just plug the number n of
elements you expect to insert, and try various
values of k and m to configure your filter for your
application
107
How Many Hash Functions?
• The more hash functions you have, the slower your
bloom filter, and the quicker it fills up. If you have
too few, however, you may suffer too many false
positives.
• Since you have to pick k when you create the filter,
you'll have to ballpark what range you expect n to be
in. Once you have that, you still have to choose a
potential m (the number of bits) and k (the number
of hash functions).
• It seems a difficult optimization problem, but
fortunately, given an m and an n, we have a function
to choose the optimal value of k: (m/n)ln(2)
108
Determine The Size of a Bloom Filter
• To choose the size of a bloom filter
1. Choose a ballpark value for n
2. Choose a value for m
3. Calculate the optimal value of k
4. Calculate the error rate for our chosen values
of n, m, and k. If it's unacceptable, return to
step 2 and change m; otherwise we're done.
109
Summary
• 8.1 Introduction
• 8.2 Static hashing
• 8.3 Dynamic hashing
• 8.4 Bloom filters
110