04 UW Hashing
04 UW Hashing
:
Database System
Principles
Notes 5: Hashing and More
1
Indexes
… WHERE key = 22
Row
Key pointer
22
22
Table
Index
Types of Indexes
These are several types of index structures available to
you, depending on the need:
– A B+-tree index is in the form of a balanced tree
and is the default index type.
– A bitmap index has a bitmap for each distinct value
indexed, and each bit position represents a row
that may or may not contain the indexed value.
This is best for low-cardinality columns.
B+ tree index
Index entry
Root
Branch
– Bitmap index
CREATE BITMAP INDEX emp2 ON EMP (deptno);
DBA_INDEXES.INDEX_TYPE ‘BITMAP’
• online,3,AAAPvCAAFAAAAFaA
Aq
• online,3,AAAPvCAAFAAAAFaA
At
CS 245 Notes 5 8
Bitmap Indexes
Table File 3
Block 10
Block 11
Index Block 12
Start End
Key ROWID ROWID Bitmap
<Blue, 10.0.3, 12.8.3, 1000100100010010100>
<Green, 10.0.3, 12.8.3, 0001010000100100000>
<Red, 10.0.3, 12.8.3, 0100000011000001001>
<Yellow, 10.0.3, 12.8.3, 0010001000001000010>
Bitmap Indexes
Structure of a bitmap index
A bitmap index is also organized as a B-tree, but the leaf node stores a
bitmap for each key value instead of a list of ROWIDs. Each bit in the
bitmap corresponds to a possible ROWID, and if the bit is set, it means
that the row with the corresponding ROWID contains the key value.
As shown in the diagram, the leaf node of a bitmap index contains the
following:
An entry header that contains the number of columns and lock info
Key values consisting of length and value pairs for each key column
Start ROWID
End ROWID
A bitmap segment consisting of a string of bits. (The bit is set when the
corresponding row contains the key value and is unset when the row
does not contain the key value. The Oracle server uses a patented
compression technique to store bitmap segments.)
Bitmap Index
Empno Status Region Gender Info
101 single east male bracket_1
102 married central female bracket_4
103 married west female bracket_2
104 divorced west male bracket_4
105 single central female bracket_2
106 married central female bracket_3
0 1 0
0 0 1
0 0 1
0 1 0
0 1 0
Using Bitmap Indexes
SELECT COUNT(*)
FROM CUSTOMER
WHERE MARITAL_STATUS = 'married‘
AND REGION IN ('central','west');
Range queries
AGE SALARY
SELECT * FROM T
25 60
WHERE Age BETWEEN 44 AND 55
45 60
AND Salary BETWEEN 100 AND 200;
50 75
50 100
50 120
Bitvectors for Age
70 110
Bitvectors for Salary
85 140
30 260 25: 100000001000 60:
25 400 110000000000
45 350 30: 000000010000 75:
50 275 001000000000
60 260 45: 010000000100 100:
000100000000
50: 001110000010 110:
Range queries
AGE SALARY
SELECT * FROM T
25 60
WHERE Age BETWEEN 44 AND 55
45 60
AND Salary BETWEEN 100 AND 200;
50 75
50 100
45: 010000000100
50 120
50: 001110000010 OR ->
70 110
011110000110
85 140
30 260
100: 000100000000
25 400
110: 000001000000
45 350
120: 000010000000
50 275
140: 000000100000 OR ->
60 260
000111100000
Compressed bitmaps
Run-length encoding:
run: a sequence of i 0’s followed by a 1
10000001000000000100010000000000001
Example: …0100000000000001
run with 13 0’s
i in binary: 1101
j = 4 -> in unary: 1110
Encoding for the run: 11101101
Compressed bitmaps
Encoding for i = 0: 00
Encoding for i = 1: 01
Decoding:
Decode the following: 11101101001011 -> 13, 0, 3
Buckets
(typically 1
. disk block)
.
.
18
Hashing
19
Two alternatives
.
.
.
records
(1) key h(key) .
(direct reference, not flexible) .
.
20
Two alternatives
record
key 1
(2) key h(key)
(indirect reference, more flexible)
Index
21
Typical implementation
22
Example hash function
23
This may not be best function …
Read Knuth Vol. 3 if you really
need to select a good
function.
25
Next: example to illustrate
inserts, overflows,
deletes
h(K)
26
EXAMPLE 2 records/bucket
0
INSERT:
h(a) = 1
1
2
h(b) =
2 3
h(c) = 1
h(d) =
0 27
EXAMPLE 2 records/bucket
0
INSERT: d
h(a) = 1 a
1 c
2
b
h(b) =
2 3
h(c) = 1
h(e) =
h(d)
1 =
0 28
EXAMPLE 2 records/bucket
0
INSERT: d
h(a) = 1 a e
1 c
2
b
h(b) =
2 3
h(c) = 1
h(e) =
h(d)
1 =
0 29
EXAMPLE: deletion
Delete: 0 a
e 1 b d
f c
2
e
3
f
g
30
EXAMPLE: deletion
Delete: 0 a
e 1 b d
f c
c 2
e
3
f maybe move
g “g” up
31
EXAMPLE: deletion
Delete: 0 a
e 1 b d
f c d
c 2
e
3
f maybe move
g “g” up
32
Rule of thumb:
• Try to keep space utilization
between 50% and 80%
Utilization = # keys used
total # keys that
fit
• If < 50%, wasting space
• If > 80%, overflows significant
depends on how good
hash function is & on
# keys/bucket
33
How do we cope with growth?
• Overflows and reorganizations
• Dynamic hashing
34
How do we cope with growth?
• Overflows and reorganizations
• Dynamic hashing
• Extensible
• Linear
35
Extensible hashing: two ideas
h(K)[i ] . to bucket
.
.
.
.
h(K)[i ]: means the first i bits of the output by hash
function
37
Example: h(k) is 4 bits; 2
keys/bucket
1
i=1 0001
0
1
1
1001
1100
Insert
1010
38
Example: h(k) is 4 bits; 2
keys/bucket
1
i=1 0001
0
1
1
1001
1010 1100
Insert 1
1100
1010
39
Example: h(k) is 4 bits; 2
keys/bucket
i =2
1
00
i=1 0001
01
10
1 2
1001 11
1010 1100
New directory
Insert 1 2
1100
1010
40
Example continued
i= 2
00
1
01
0001
10
11 2
1001
1010
Insert:
2
0111 1100
0000
41
Example continued
0000
i= 2 0001
00
1
01
0001 0111
10 0111
11 2
1001
1010
Insert:
2
0111 1100
0000
42
Example continued 2
0000
i= 2 0001
00
12
01
0001 0111
10 0111
11 2
1001
1010
Insert:
2
0111 1100
0000
43
Example continued
0000 2
i= 2 0001
00 0111 2
01
10
11
1001 2
1010
Insert:
1001 1100 2
44
Example continued
0000 2
i= 2 0001
00 0111 2
01
10 1001
11 1001
10101001 2
1010
Insert:
1001 1100 2
45
Example continued
i=3
0000 2
000
i= 2 0001
001
00 0111 2
010
01
011
10 1001 3
1001 100
11
10101001 2 3 101
1010 110
Insert:
1001 1100 2 111
46
Extensible hashing: deletion
• No merging of blocks
• Merge blocks
and cut directory if
possible
(Reverse insert
procedure)
47
Deletion example:
48
Note: Still need overflow
chains
• Example: many records with duplicate
keys if we split:
insert 1100
2
1
1101
1100
2
1100
1100
49
Solution: overflow chains
1 1
1101 1101 1100
1100 1101
50
Summary Extensible hashing
+ Can handle growing files
- with less wasted space
- with no full reorganizations
- Indirection
(Not bad if directory in
memory)
-
Directory doubles in size
(Now it fits, now it does not)
51
Linear hashing
• Another dynamic hashing scheme
Two ideas:
b
(a) Use i low order bits of
hash 01110101
grows i
52
Example b=4 bits, i =2, 2
keys/bucket
Future
growth
0000 0101 buckets
1010 1111
00 01 10
m = 01 (max
11 used bucket) or n=2 (number of used
buckets)
Future
growth
0000 0101 buckets
1010 1111
00 01 10
m =11
01 (max used bucket)
Future
growth
0000 0101 buckets
1010 1111
00 01 10
m =11
01 (max used bucket)
56
Example b=4 bits, i =2, 2
keys/bucket
0101 • insert 0101
Future
growth
0000 0101 1010 buckets
1010 1111
00 01 10
m =11
01 (max used bucket)
10
57
Example b=4 bits, i =2, 2
keys/bucket
0101 • insert 0101
Future
growth
0000 0101 1010 buckets
1010 1111
00 01 10
m =11
01 (max used bucket)
10
11
58
Example b=4 bits, i =2, 2
keys/bucket
0101 • insert 0101
Future
growth
0000 0101 1010 1111 buckets
0101
1010 1111
00 01 10
m =11
01 (max used bucket)
10
11
59
Example Continued: How to grow beyond
this?
i=2
60
Example Continued: How to grow beyond
this?
i = 23
61
Example Continued: How to grow beyond
this?
i = 23
62
Example Continued: How to grow beyond
this?
i = 23
63
When do we expand file?
total # of
#used slots #records, total # of slots #buckets
slots
64
Summary Linear Hashing
+ Can handle growing files
- with less wasted space
- with no full reorganizations
65
Example: BAD CASE
Very full
66
Hashing depends on data distribution!
Summary
Hashing
- How it works
- Dynamic hashing
- Extensible
- Linear
67
Next:
• Indexing vs Hashing
• Index definition in SQL
• Multiple key access
68
Indexing vs Hashing
• Hashing good for probes given key
e.g., SELECT …
FROM R
WHERE R.A = 5
69
Indexing vs Hashing
• INDEXING (Including B Trees) good
for
Range Searches:
e.g.,
SELECT FROM R
WHERE R.A > 5 AND R.A < 10;
70
Index definition in SQL
71
Note CANNOT SPECIFY TYPE OF
INDEX
(e.g. B-tree, Hashing, …)
OR PARAMETERS
(e.g. Load Factor, Size of
Hash,...)
73
Multi-key Index
74
Strategy I:
• Use one index, say Dept.
• Get all Dept = “Toy” records
and check their salary
I1
75
Strategy II:
76
Strategy III:
One idea:
I1 I3
77
Example
10k
15k
Art 17k
Sales 21k
Example
Toy
Record
12k Name=Joe
15k DEPT=Sales
Dept 15k SAL=15k
19k
Index
Salary
Index
78
For which queries is this index
good?
Find RECs Dept = “Sales”
SAL=20k
Find RECs Dept = “Sales” SAL >
20k
Find RECs Dept = “Sales”
Find RECs SAL = 20k
79