0% found this document useful (0 votes)
11 views79 pages

04 UW Hashing

Uploaded by

selezeno4ka1337
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views79 pages

04 UW Hashing

Uploaded by

selezeno4ka1337
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 79

Ullman et al.

:
Database System
Principles
Notes 5: Hashing and More

1
Indexes
… WHERE key = 22
Row
Key pointer

22

22

Table
Index
Types of Indexes
These are several types of index structures available to
you, depending on the need:
– A B+-tree index is in the form of a balanced tree
and is the default index type.
– A bitmap index has a bitmap for each distinct value
indexed, and each bit position represents a row
that may or may not contain the indexed value.
This is best for low-cardinality columns.
B+ tree index
Index entry

Root

Branch

Index entry header


Leaf Key column length
Key column value
ROWID
B+-Tree Index
Structure of a B+-tree index
At the top of the index is the root, which contains entries that point to the
next level in the index. At the next level are branch blocks, which in turn
point to blocks at the next level in the index. At the lowest level are the
leaf nodes, which contain the index entries that point to rows in the
table. The leaf blocks are doubly linked to facilitate the scanning of the
index in an ascending as well as descending order of key values.

Format of index leaf entries


An index entry is made up of the following components:
An entry header, which stores the number of columns and locking
information
Key column length-value pairs, which define the size of a column in the
key followed by the value for the column (The number of such pairs
is a maximum of the number of columns in the index.)
ROWID of a row that contains the key values

Index Options
A unique index ensures that every indexed value is
unique.
CREATE UNIQUE INDEX emp1 ON EMP (ename);
DBA_INDEXES. UNIQUENESS  ‘UNIQUE’

– Bitmap index
CREATE BITMAP INDEX emp2 ON EMP (deptno);
DBA_INDEXES.INDEX_TYPE  ‘BITMAP’

An index can have its key values stored in ascending or


descending order.
CREATE INDEX emp3 ON emp (sal DESC);
DBA_IND_COLUMNS.DESCEND  ‘DESC’

Index Options
A composite index is one that is based on more than one
column.
CREATE INDEX emp4 ON emp (empno, sal);
DBA_IND_COLUMNS.COLUMN_POSITION  1,2 …

– A function-based index is an index based on a function’s


return value.
CREATE INDEX emp5 ON emp (SUBSTR(ename, 3, 4));
DBA_IND_EXPRESSIONS.COLUMN_EXPRESSION  ‘SUBSTR …’

– A compressed index has repeated key values removed.


CREATE INDEX emp6 ON emp (empno, ename, sal) COMPRESS
2;
DBA_INDEXES.COMPRESSION  ‘ENABLED’
DBA_INDEXES.PREFIX_LENGTH  2
Compressed index
example
• online,0,AAAPvCAAFAAAAFaA • online,0
Aa • AAAPvCAAFAAAAFaAAa
• AAAPvCAAFAAAAFaAAg
• online,0,AAAPvCAAFAAAAFaA
• AAAPvCAAFAAAAFaAAl
Ag
• online,3
• online,0,AAAPvCAAFAAAAFaA • AAAPvCAAFAAAAFaAAq
Al • AAAPvCAAFAAAAFaAAt

• online,3,AAAPvCAAFAAAAFaA
Aq
• online,3,AAAPvCAAFAAAAFaA
At

CS 245 Notes 5 8
Bitmap Indexes
Table File 3
Block 10

Block 11

Index Block 12

Start End
Key ROWID ROWID Bitmap
<Blue, 10.0.3, 12.8.3, 1000100100010010100>
<Green, 10.0.3, 12.8.3, 0001010000100100000>
<Red, 10.0.3, 12.8.3, 0100000011000001001>
<Yellow, 10.0.3, 12.8.3, 0010001000001000010>
Bitmap Indexes
Structure of a bitmap index
A bitmap index is also organized as a B-tree, but the leaf node stores a
bitmap for each key value instead of a list of ROWIDs. Each bit in the
bitmap corresponds to a possible ROWID, and if the bit is set, it means
that the row with the corresponding ROWID contains the key value.
As shown in the diagram, the leaf node of a bitmap index contains the
following:
An entry header that contains the number of columns and lock info
Key values consisting of length and value pairs for each key column
Start ROWID
End ROWID
A bitmap segment consisting of a string of bits. (The bit is set when the
corresponding row contains the key value and is unset when the row
does not contain the key value. The Oracle server uses a patented
compression technique to store bitmap segments.)
Bitmap Index
Empno Status Region Gender Info
101 single east male bracket_1
102 married central female bracket_4
103 married west female bracket_2
104 divorced west male bracket_4
105 single central female bracket_2
106 married central female bracket_3

REGION='east' REGION='central' REGION='west'


1 0 0

0 1 0

0 0 1

0 0 1

0 1 0

0 1 0
Using Bitmap Indexes
SELECT COUNT(*)
FROM CUSTOMER
WHERE MARITAL_STATUS = 'married‘
AND REGION IN ('central','west');
Range queries
AGE SALARY
SELECT * FROM T
25 60
WHERE Age BETWEEN 44 AND 55
45 60
AND Salary BETWEEN 100 AND 200;
50 75
50 100
50 120
Bitvectors for Age
70 110
Bitvectors for Salary
85 140
30 260 25: 100000001000 60:
25 400 110000000000
45 350 30: 000000010000 75:
50 275 001000000000
60 260 45: 010000000100 100:
000100000000
50: 001110000010 110:
Range queries
AGE SALARY
SELECT * FROM T
25 60
WHERE Age BETWEEN 44 AND 55
45 60
AND Salary BETWEEN 100 AND 200;
50 75
50 100
45: 010000000100
50 120
50: 001110000010 OR ->
70 110
011110000110
85 140
30 260
100: 000100000000
25 400
110: 000001000000
45 350
120: 000010000000
50 275
140: 000000100000 OR ->
60 260
000111100000
Compressed bitmaps

1’s in a bit vector will be very rare. We compress the vector.

Run-length encoding:
run: a sequence of i 0’s followed by a 1
10000001000000000100010000000000001

1. Determine how many bits the binary representation of i


has. This is number j.
2. We represent j in „unary” by j-1 1’s and a single 0.
3. Then we follow with i in binary.
Compressed bitmaps

Example: …0100000000000001
run with 13 0’s

i in binary: 1101
j = 4 -> in unary: 1110
Encoding for the run: 11101101
Compressed bitmaps
Encoding for i = 0: 00
Encoding for i = 1: 01

We ignore the trailing 0’s. But not the starting 0’s !

Decoding:
Decode the following: 11101101001011 -> 13, 0, 3

Original bitvector: 0000000000000110001


Hashing

key  h(key) <key>

Buckets
(typically 1
. disk block)
.
.

18
Hashing

19
Two alternatives
.
.
.
records
(1) key  h(key) .
(direct reference, not flexible) .
.

20
Two alternatives

record
key 1
(2) key  h(key)
(indirect reference, more flexible)
Index

• Alt (2) for “secondary” search key

21
Typical implementation

22
Example hash function

• Key = ‘x1 x2 … xn’ n byte character


string
• Have b buckets
• h: add x1 + x2 + ….. xn
– compute sum modulo b

23
 This may not be best function …
 Read Knuth Vol. 3 if you really
need to select a good
function.

Good hash  Expected number


of
function: keys/bucket is the
same for all
buckets
24
Within a bucket:
• Do we keep keys sorted?

• Yes, if CPU time critical


& Inserts/Deletes not too frequent

25
Next: example to illustrate
inserts, overflows,
deletes

h(K)

26
EXAMPLE 2 records/bucket

0
INSERT:
h(a) = 1

1
2
h(b) =
2 3

h(c) = 1
h(d) =
0 27
EXAMPLE 2 records/bucket

0
INSERT: d

h(a) = 1 a
1 c
2
b
h(b) =
2 3

h(c) = 1
h(e) =
h(d)
1 =
0 28
EXAMPLE 2 records/bucket

0
INSERT: d

h(a) = 1 a e
1 c
2
b
h(b) =
2 3

h(c) = 1
h(e) =
h(d)
1 =
0 29
EXAMPLE: deletion

Delete: 0 a
e 1 b d
f c
2
e
3
f
g

30
EXAMPLE: deletion

Delete: 0 a
e 1 b d
f c
c 2
e
3
f maybe move
g “g” up

31
EXAMPLE: deletion

Delete: 0 a
e 1 b d
f c d
c 2
e
3
f maybe move
g “g” up

32
Rule of thumb:
• Try to keep space utilization
between 50% and 80%
Utilization = # keys used
total # keys that
fit
• If < 50%, wasting space
• If > 80%, overflows significant
depends on how good
hash function is & on
# keys/bucket
33
How do we cope with growth?
• Overflows and reorganizations
• Dynamic hashing

34
How do we cope with growth?
• Overflows and reorganizations
• Dynamic hashing

• Extensible
• Linear

35
Extensible hashing: two ideas

(a) Use i of b bits output by hash


function
b
00110101
h(K) 

use i  grows over


time….
36
(b) Use directory
.

h(K)[i ] . to bucket
.
.
.
.
h(K)[i ]: means the first i bits of the output by hash
function

37
Example: h(k) is 4 bits; 2
keys/bucket
1
i=1 0001
0
1
1
1001
1100

Insert
1010

38
Example: h(k) is 4 bits; 2
keys/bucket
1
i=1 0001
0
1
1
1001
1010 1100

Insert 1
1100
1010

39
Example: h(k) is 4 bits; 2
keys/bucket
i =2
1
00
i=1 0001
01

10
1 2
1001 11
1010 1100

New directory
Insert 1 2
1100
1010

40
Example continued

i= 2
00
1
01
0001
10

11 2
1001
1010
Insert:
2
0111 1100
0000
41
Example continued
0000
i= 2 0001
00
1
01
0001 0111
10 0111
11 2
1001
1010
Insert:
2
0111 1100
0000
42
Example continued 2
0000
i= 2 0001
00
12
01
0001 0111
10 0111
11 2
1001
1010
Insert:
2
0111 1100
0000
43
Example continued
0000 2
i= 2 0001
00 0111 2
01

10

11

1001 2
1010
Insert:
1001 1100 2

44
Example continued
0000 2
i= 2 0001
00 0111 2
01

10 1001
11 1001
10101001 2
1010
Insert:
1001 1100 2

45
Example continued
i=3
0000 2
000
i= 2 0001
001
00 0111 2
010
01
011
10 1001 3
1001 100
11

10101001 2 3 101

1010 110
Insert:
1001 1100 2 111

46
Extensible hashing: deletion

• No merging of blocks
• Merge blocks
and cut directory if
possible
(Reverse insert
procedure)

47
Deletion example:

• Run thru insert example in reverse!

But: Typically not implemented

48
Note: Still need overflow
chains
• Example: many records with duplicate
keys if we split:
insert 1100

2
1
1101
1100
2
1100
1100
49
Solution: overflow chains

insert 1100 add overflow block:

1 1
1101 1101 1100
1100 1101

50
Summary Extensible hashing
+ Can handle growing files
- with less wasted space
- with no full reorganizations
- Indirection
(Not bad if directory in
memory)
-
Directory doubles in size
(Now it fits, now it does not)
51
Linear hashing
• Another dynamic hashing scheme
Two ideas:
b
(a) Use i low order bits of
hash 01110101
grows i

(b) File grows linearly

52
Example b=4 bits, i =2, 2
keys/bucket

Future
growth
0000 0101 buckets
1010 1111
00 01 10
m = 01 (max
11 used bucket) or n=2 (number of used
buckets)

Rule If h(k)[i ]  m, then


look at bucket h(k)[i]
else, look at bucket h(k)[i] - 2i -
1
53
Example b=4 bits, i =2, 2
keys/bucket
• insert 0101

Future
growth
0000 0101 buckets
1010 1111
00 01 10
m =11
01 (max used bucket)

Rule If h(k)[i ]  m, then


look at bucket h(k)[i]
else, look at bucket h(k)[i] - 2i -
1
54
Example b=4 bits, i =2, 2
keys/bucket
• insert 0101
0101
• can have overflow chains!

Future
growth
0000 0101 buckets
1010 1111
00 01 10
m =11
01 (max used bucket)

Rule If h(k)[i ]  m, then


look at bucket h(k)[i ]
else, look at bucket h(k)[i ] - 2i -
1
55
Note
• In textbook, n is used instead of m
• n=m+1 (n=number of used buckets)
n=10 (n=2 in decimal)
Future
growth
0000 0101 buckets
1010 1111
00 01 10
m =11
01 (max used bucket)

56
Example b=4 bits, i =2, 2
keys/bucket
0101 • insert 0101

Future
growth
0000 0101 1010 buckets
1010 1111
00 01 10
m =11
01 (max used bucket)
10

57
Example b=4 bits, i =2, 2
keys/bucket
0101 • insert 0101

Future
growth
0000 0101 1010 buckets
1010 1111
00 01 10
m =11
01 (max used bucket)
10
11

58
Example b=4 bits, i =2, 2
keys/bucket
0101 • insert 0101

Future
growth
0000 0101 1010 1111 buckets
0101
1010 1111
00 01 10
m =11
01 (max used bucket)
10
11

59
Example Continued: How to grow beyond
this?

i=2

0000 0101 1010 1111


0101
00 01 10 ...
11
m = 11 (max used bucket)

60
Example Continued: How to grow beyond
this?

i = 23

0000 0101 1010 1111


0101
000 0 01 0 10
0 ...
100 101
11 110 111
m = 11 (max used bucket)

61
Example Continued: How to grow beyond
this?

i = 23

0000 0101 1010 1111


0101
000 0 01 0 10
0 100
...
100 101
11 110 111
m = 11 (max used bucket)
100

62
Example Continued: How to grow beyond
this?

i = 23

0000 0101 1010 1111 0101


0101 0101
000 0 01 0 10
0 100 101
...
100 101
11 110 111
m = 11 (max used bucket)
100
101

63
 When do we expand file?

• Keep track of: # used slots = U

total # of
#used slots  #records, total # of slots  #buckets
slots

• If U > threshold then increase m


(and maybe i )

64
Summary Linear Hashing
+ Can handle growing files
- with less wasted space
- with no full reorganizations

+ No indirection like extensible


hashing
- Can still have overflow chains

65
Example: BAD CASE

Very full

Very empty Need to


move
m here…
Would
waste
space...

66
Hashing depends on data distribution!
Summary

Hashing
- How it works
- Dynamic hashing
- Extensible
- Linear

67
Next:

• Indexing vs Hashing
• Index definition in SQL
• Multiple key access

68
Indexing vs Hashing
• Hashing good for probes given key
e.g., SELECT …
FROM R
WHERE R.A = 5

69
Indexing vs Hashing
• INDEXING (Including B Trees) good
for
Range Searches:
e.g.,
SELECT FROM R
WHERE R.A > 5 AND R.A < 10;

70
Index definition in SQL

• Create index name on rel (attr)


• Create unique index name on rel
(attr)
defines candidate
key
• Drop INDEX
name

71
Note CANNOT SPECIFY TYPE OF
INDEX
(e.g. B-tree, Hashing, …)
OR PARAMETERS
(e.g. Load Factor, Size of
Hash,...)

... at least in SQL...

In Oracle you can !


72
Note ATTRIBUTE LIST  MULTIKEY INDEX
(next)
e.g., CREATE INDEX foo ON
R(A,B,C)

73
Multi-key Index

Motivation: Find records where


DEPT = “Toy” AND SAL >
50k

What kind of indexes can support this


query?

74
Strategy I:
• Use one index, say Dept.
• Get all Dept = “Toy” records
and check their salary

I1

75
Strategy II:

• Use 2 Indexes; Manipulate Pointers

Toy ppppp pppppp Sal >


50k

AND  intersection of pointers

76
Strategy III:

• Multiple Key Index


I2

One idea:
I1 I3

77
Example
10k
15k
Art 17k
Sales 21k
Example
Toy
Record
12k Name=Joe
15k DEPT=Sales
Dept 15k SAL=15k
19k
Index

Salary
Index
78
For which queries is this index
good?
Find RECs Dept = “Sales”
SAL=20k
Find RECs Dept = “Sales” SAL >
20k
Find RECs Dept = “Sales”
Find RECs SAL = 20k

79

You might also like