0% found this document useful (0 votes)
56 views74 pages

02 Blocking - Addional

The document discusses different ways that a database management system can store data on disk, including using records, blocks, files, and indexes. It covers how records are organized into blocks and files, and different techniques for storing, modifying, and finding records using dense and sparse indexes as well as primary and secondary indexes. The storage and indexing of data impacts performance, space utilization, and the ability to efficiently access and modify records.

Uploaded by

rakhaadit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
0% found this document useful (0 votes)
56 views74 pages

02 Blocking - Addional

The document discusses different ways that a database management system can store data on disk, including using records, blocks, files, and indexes. It covers how records are organized into blocks and files, and different techniques for storing, modifying, and finding records using dense and sparse indexes as well as primary and secondary indexes. The storage and indexing of data impacts performance, space utilization, and the ability to efficiently access and modify records.

Uploaded by

rakhaadit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
You are on page 1/ 74

DBMS Storage Overview

Values

Records

Blocks

Files

Memory 1
Record
§  Collection of related data items (called
Fields)
§  Typically used to store one tuple
§  Example: Sells record consisting of
§  bar field
§  beer field
§  price field

2
Record Metadata
§  For fixed-length records, schema
contains the following information:
§  Number of fields
§  Type of each field
§  Order in record
§  For variable-length records, every
record contains this information in its
header

3
Record Header
§  Reserved part at the beginning of a
record
§  Typically contains:
§  Record type (which Schema?)
§  Record length (for skipping)
§  Time stamp (last access)

4
Files
§  Files consist of blocks containing records
§  How to place records into blocks?

assume fixed
length blocks

assume a single file


5
Files
§  Options for storing records in blocks:
1.  Separating records
2.  Spanned vs. unspanned
3.  Sequencing
4.  Indirection

6
1. Separating Records
Block R1 R2 R3

a. no need to separate - fixed size recs.


b. special marker
c. give record lengths (or offsets)
i.  within each record
ii.  in block header

7
2. Spanned vs Unspanned
§  Unspanned: records must be in one block
R1 R2 R3 R4 R5

§  Spanned: one record in two or more blocks


R3 R3 R7
R1 R2 (a) (b) R4 R5 R6 (a)

§  Unspanned much simpler, but wastes space


§  Spanned essential if record size > block size
8
3. Sequencing
§  Ordering records in a file (and in the blocks)
by some key value
§  Can be used for binary search
§  Options:
a.  Next record is physically contiguous
R1 Next (R1) ...
b.  Records are linked

R1 Next (R1)
9
4. Indirection
§  How does one refer to records?
a.  Physical address (disk id, cylinder, head,
sector, offset in block)
b.  Logical record ids and a mapping table
Indirection map

17 Rec ID Physical
addr. 2:34:5:742:2340

§  Tradeoff between flexibility and cost


10
Modification of Records
How to handle the following operations
on the record level?
1.  Insertion
2.  Deletion
3.  Update

11
1. Insertion
§  Easy case: records not in sequence
§  Insert new record at end of file
§  If records are fixed-length, insert new
record in deleted slot
§  Difficult case: records are sorted
§  Find position and slide following records
§  If records are sequenced by linking, insert
overflow blocks

12
2. Deletion
a.  Immediately reclaim space by shifting
other records or removing overflows
b.  Mark deleted and list as free for re-use
§  Tradeoffs:
§  How expensive is immediate reclaim?
§  How much space is wasted?

13
Problem with Deletion
§  Dangling pointers:
R1 ?
§  When using physical addresses:

Never reused May be reused


§  When using logical addresses:
ID LOC Never reuse
ID 7788 nor
7788 space in the map
14
3. Update
§  If records are fixed-length and the
order is not affected:
§  Fetch the record, modify it, write it back
§  Otherwise:
§  Delete the old record
§  Insert the new record overwriting the
tombstones from the deletion

15
Pointer Swizzling
§  Swizzling = replacement of physical
addresses by memory addresses when
loading blocks into memory
§  Automatic Swizzling: swizzle all
addresses when loading a block
(need to swizzle all pointer from and to
the block)
§  Swizzling on Demand: use addresses
which are invalid as memory addresses
16
Data Organizaton
§  There are millions of ways to organize
the data on disk
§  Flexibility Space Utilization

Complexity Performance

17
Summary 9
More things you should know:
§  Memory Hierarchy
§  Storage on harddisks
§  Values, Records, Blocks, Files
§  Storing and modifying records

18
Index Structures

19
Finding Records
§  How do we find the records for a query?
§  Example: SELECT * FROM Sells
§  Need to examine every block in every file
§  Group blocks into files by relation!
§  Example: SELECT * FROM Sells
WHERE price = 20;
§  Need to examine every block in the file

20
Finding Records
§  Use of indexes allows to narrow search
to (almost) only the relevant blocks

Blocks
Value Index Holding Matching records
records

§  Indexes can be dense or sparse


21
Dense Index
Dense Index Sequential File
10 10
20 20
30
40 30
40
50
60 50
70 60
80
70
90 80
100
90
110
100
120
22
Sparse Index
2nd level Sparse Index Sequential File
10 10 10
90 30 20
170 50
250 70 30
40
90
330
110 50
410
130 60
490
150
570 70
170 80
190
90
210
100
230
23
Deletion from Sparse Index
§  Delete 40
10
10 20
30
50 30
70 40

90 50
110

60
130

70
150
80

24
Deletion from Sparse Index
§  Delete 30
10
10 20
30
40
50 40
30
70 40

90 50
110

60
130

70
150
80

25
Deletion from Sparse Index
§  Delete 30 & 40
10
10 20
30
50
50
70 30
70 40

90 50
110

60
130

70
150
80

26
Insertion into Sparse Index
§  Insert 35
10
10 20
30
50 30
70 35

90 50
110

60
130

70
150
80

27
Insertion into Sparse Index
§  Insert 25
10
10 20
30
50 30
70 35

90 50
110

60
130

70
150
80

25
28
Sparse vs Dense
§  Sparse uses less index space per record
(can keep more of index in memory)
§  Sparse allows multi-level indexes
§  Dense can tell if record exists without
accessing it
§  Dense needed for secondary indexes
§  Primary index = order of records in storage
§  Secondary index = impose different order
29
Secondary Index
2nd level Secondary Index Sequential File
10 10 20
20 10 40
50 20
20 10
20
20
Careful when 30 50
Looking for 20 40 30
50
10
50 50
60
60
20
30
Secondary Index
2nd level Secondary Index Sequential File
10 10 20
50 20 40
30
40 10
20
50
60 50
30
10
50
60
20
31
Combining Indexes
§  SELECT * FROM Sells WHERE beer =
“Od.Cl.“ AND price = “20“
Beer index Sells Price index

OC 20
C.Ch.

§  Just intersect buckets in memory!


32
Conventional Indexes
§  Sparse, Dense, Multi-level, ...
§  Advantages:
§  Simple
§  Sequential index is good for scans
§  Disadvantage:
§  Inserts expensive
§  Lose sequentiality and balance

33
Example: Unbalanced Index
10
20 39
30 31
33 35
36
40
50
60 32
38
34
70
80
90 overflow area
(not sequential)

34
B+Trees

35
Idea
§  Conventional indexes are fixed-level
§  Give up sequentiality of the index in
favour of balance
§  B+Tree = variant of B-Tree
§  Allows index tree to grow as needed
§  Ensures that all blocks are between half
used and completely full

36
Characteristics
§  Parameter n determines number of keys
and pointers per node
§  Key size 4 and pointer size 8 allows for
maximal n = 340 (4n + 8(n+1) < 4096)
§  Leafs contain at least n/2 key-pointer pairs
to records and a pointer to the next leaf
§  Interior nodes contain at least (n-1)/2 keys
and at least n/2 pointers to other nodes
§  No restrictions for the root node 37
Example: B+Tree (n=3)
42

11 23 64

3 6 9 11 15 17 23 31 37 42 57 64 85

38
Example: Leaf node

42 57
To next leaf

To record To record
With key 42 With key 57

39
Example: Interior node

11 23

To keys To keys To keys


K < 11 11 ≤ K < 23 23 ≤ K

40
Restrictions
Full node min.
node
11 23 42 64

Non-leaf

11 15 17 64 85

Leaf
Counts even
when null
41
Insertion
§  If there is place in the appropriate leaf,
just insert it there
§  Otherwise:
§  Split the leaf in two and divide the keys
§  Insert the smallest value reachable through
the right node into the parent node
§  Recurse until there is enough room
§  Special case: Splitting the root results in
a new root
42
Example: Insertion
§  Insert 85

11 23 42

3 6 9 11 17 23 31 37 42 57 85

43
Example: Insertion
§  Insert 15

11 23 42

3 6 9 11 15
17 17 23 31 37 42 57 85

44
Example: Insertion
§  Insert 64 42

11 23 42 42 64

3 6 9 11 15 17 23 31 37 42 57 85 64 85

45
Deletion
§  If there are enough keys left in the
appropriate leaf, just delete the key
§  Otherwise:
§  If there is a direct sibling with more than
minimum key, steal one!
§  If not, join the node with a direct sibling and
delete the smallest value reachable through
the former right sibling from its parent
§  Special case: If the root contains only
one pointer after deletion, delete it 46
Example: Deletion
§  Delete 9 42

11 23 64

3 6 9 11 15 17 23 31 37 42 57 64 85

47
Example: Deletion
§  Delete 3 42

11 23
15 64

3 11
6 6 11 17
15 15 17 23 31 37 42 57 64 85

48
Example: Deletion
§  Delete 11 42

23
15 23 64

6 11 15
6 15
17 17 23 31 37 42 57 64 85

49
Example: Deletion
§  Delete 17, 37 42

23 64

6 15 17 23 31 37 42 57 64 85

50
Example: Deletion
§  Delete 31 42

23 42
64 64

6 15 23 23 31 42 57 64 85

51
Efficiency
§  Need to load one block for each level!
§  With n = 340 and an average fill of 255
pointers, we can index 255^3 = 16.6
million records in only 3 levels
§  There are at most 342 blocks in the first
two levels
§  First two levels can be kept in memory
using less than 1.4 Mbyte
§  Only need to access one block! 52
Range Queries
§  Queries often restrict an attribute to a
range of values
§  Example:
SELECT * FROM Sells
WHERE price > 20;
§  Records are found efficiently by searching
for value 20 and then traversing the leafs
§  Can also be used if there is both an upper
and a lower limit
53
Summary 10
More things you should know:
§  Dense Index, Sparse Index
§  Multi-Level Indexes
§  Primary vs Secondary Index
§  Structure of B+Trees
§  Insertion and Deletion in B+Trees

54
Hash Tables

55
Hash Table in Primary Storage
§  Main parameter B = number of buckets
§  Hash function h maps key to numbers
from 0 to B-1
§  Bucket array indexed from 0 to B-1
§  Each bucket contains exactly one value
§  Strategy for handling conflicts

56
Example: B = 4
§  Insert c (h(c) = 3)
§  Insert a (h(a) = 1) Conflict!
0
§  Insert e (h(e) = 1) 1 a e
§  Alternative 1: 2 e
§  Search for free bucket, 3 c
e.g. by Linear Probing .
.
.
§  Alternative 2:
§  Add overflow bucket
57
Hash Function
§  Hash function should ensure hash values
are equally distributed
§  For integer key K, take h(K) = K modulo B
§  For string key, add up the numeric values
of the characters and compute the
remainder modulo B
§  For really good hash functions, see Donald
Knuth, The Art of Computer Programming:
Volume 3 – Sorting and Searching
58
Hash Table in Secondary Storage
§  Each bucket is a block containing f
key-pointer pairs
§  Conflict resolution by probing potentially
leads to a large number of I/Os
§  Thus, conflict resolution by adding
overflow buckets
§  Need to ensure we can directly access
bucket i given number i
59
Example: Insertion, B=4, f=2
§  Insert a
§  Insert b 0 d

§  Insert c 1 a i
e
§  Insert d
2 b
§  Insert e
3 c
§  Insert g g
§  Insert i

60
Efficiency
§  Very efficient if buckets use only one
block: one I/O per lookup
§  Space utilization is #keys in hash
divided by total #keys that fit
§  Try to keep between 50% and 80%:
§  < 50% wastes space
§  > 80% significant number of overflows

61
Dynamic Hashing
§  How to grow and shrink hash tables?
§  Alternative 1:
§  Use overflows and reorganizations
§  Alternative 2:
§  Use dynamic hashing
§  Extensible Hash Tables
§  Linear Hash Tables

62
Extensible Hash Tables
§  Hash function computes sequence of k
bits for each key
k = 8 00110101
i=3
§  At any time, use only the first i bits
§  Introduce indirection by a pointer array
§  Pointer array grows and shrinks (size 2i )
§  Pointers may share data blocks (store
number of bits used for block in j ) 63
Example: k = 4, f = 2

i =2
1 0001 1

0111
00

01

10 1001 2

11 1010
2
1100

64
Insertion
§  Find destination block B for key-pointer pair
§  If there is room, just insert it
§  Otherwise, let j denote the number of bits
used for block B
§  If j = i, increment i by 1:
§  Double the length of the bucket array to 2i+1
§  Adjust pointers such that for old bit strings w,
w0 and w1 point to the same bucket
§  Retry insertion
65
Insertion
§  If j < i, add a new block B‘:
§  Key-pointer pairs with (j+1)st bit = 0 stay in B
§  Key-pointer pairs with (j+1)st bit = 1 go to B‘
§  Set number of bits used to j+1 for B and B‘
§  Adjust pointers in bucket array such that if for
all w where previously w0 and w1 pointed to B,
now w1 points to B‘
§  Retry insertion

66
Example: Insert, k = 4, f = 2
§  Insert 1010
i =2
1 0001 1

0
00

1
01

10 1001 1
2

11 1100
1010
1
2
1100

67
Example: Insert, k = 4, f = 2
§  Insert 0111
i =2
1 0001 1

0111
00

01

10 1001 2

11 1010
2
1100

68
Example: Insert, k = 4, f = 2
§  Insert 0000
i =2
1 0001 1
2

0111
0000
00
1
2
0111
01

10 1001 2

11 1010
2
1100

69
Deletion
§  Find destination block B for key-pointer pair
§  Delete the key-pointer pair
§  If two blocks B referenced by w0 and w1
contain at most f keys, merge them,
decrease their j by 1, and adjust pointers
§  If there is no block with j = i, reduce the
pointer array to size 2i-1 and decrease i by 1

70
Example: Delete, k = 4, f = 2
§  Delete 0000
i =2
1 0001 2
1

0000
0111
00
2
0111
01

10 1001 2

11 1010
2
1100

71
Example: Delete, k = 4, f = 2
§  Delete 0111
i =2
1 0001 1

0111
00

01

10 1001 2

11 1010
2
1100

72
Example: Delete, k = 4, f = 2
§  Delete 1010
i =2
1 0001 1

00

01

10 1001 2
1

11 1010
1100
2
1100

73
Efficiency
§  As long as pointer array fits into
memory and hash function behaves
nicely, just need one I/O per lookup
§  Overflows can still happen if many key-
pointer pairs hash to the same bit string
§  Solve by adding overflow blocks

74

You might also like