0% found this document useful (0 votes)
6 views

01 Dbms Storage

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

01 Dbms Storage

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

SCALABLE DATA MANAGEMENT SYSTEMS

STORAGE: SINGLE-NODE DBMSs

Systems@TUDa http://tuda.systems/
ANNOUNCEMENTS
Warm-up lab: will start this Friday.
• Goal: see a sample lab and learn Rust
• Important: Warm-up lab will give bonus points if you solve it
• Gitlab accounts have been created and you should have received an email
• If you already had an account in our Gitlab (from a previous semester/course), your
account will be re-used
• In any case, try to log in and make sure that you are part of the group
"sdms_ws2024_students”

Systems@TUDa http://tuda.systems/ | 2
COURSE SCHEDULE
Week 1: Introduction
Week 2: DBMS Storage - Single Node
Week 3: DBMS Storage - Distributed
Week 4: DBMS Query Processing - Single Node Single-Node +
Week 5: DBMS Query Processing - Distributed Query Distributed DBMS
Week 6: DBMS Query Optimization - Single Node & Distributed
Week 7: DBMS Transaction Processing - Single Node
Week 8: DBMS Transaction Processing - Distributed
Week 9: Cloud DBMS - Data Centers & DBMS Architectures
Week 10: Cloud DBMS - Scalable Query Processing
Cloud DBMSs
Week 11: Cloud DBMS - Scalable Transaction Processing
Week 12: Cloud DBMS - Secure DBMSs
Week 13: Other Workloads – MapReduce / Streaming
Week 14: Other Workloads – Distributed AI Other Workloads

Systems@TUDa http://tuda.systems/ | 3
A SINGLE-NODE DBMS ARCHITECTURE
SQL Result

Query Optimization

Query Execution

Access Methods

Buffer Pool

Storage Manager

Table Data
Systems@TUDa http://tuda.systems/ | 4
TODAY: STORAGE LAYERS OF A DBMS
SQL Result

Query Optimization

Query Execution How to store and


Today’s Lecture: read table data
Storage Layers Access Methods efficiently?

Buffer Pool

Storage Manager

Table Data
Systems@TUDa http://tuda.systems/ | 5
DATABASES NEED TO STORE TABLE DATA

cust-key last-name … age


1 Müller 47
Example Table: 2 Lo 53
… …
99 Schmitt 22

Systems@TUDa http://tuda.systems/ | 6
THE STORAGE HIERARCHY (SIMPLIFIED)
core
registers

L1-I L1-D
L2

L3 / LLC (last-level cache)

MAIN MEMORY (DRAM)

PERSISTENT STORAGE (hard disk, ssd)


Systems@TUDa http://tuda.systems/ | 7
STORAGE HIERARCHY – ACCESS LATENCY

Systems@TUDa http://tuda.systems/ | 8
STORAGE HIERARCHY – CAPACITY & COST

3MiB each
96MiB

Increase Cost $ / GB
384MiB

265-4096GiB

1-335TiB

Systems@TUDa http://tuda.systems/ | 9
MOVEMENT OF DATA IN STORAGE HIERARCHY

Systems@TUDa http://tuda.systems/ | 10
LET US DESIGN A DATABASE STORAGE MANAGER

cust-key last-name … age


1 Müller 47
Example Table: 2 Lo 53
… …
99 Schmitt 22

Where to store data?

How to store table data for efficient access?

Systems@TUDa http://tuda.systems/ | 11
Interactive Session
(not in slides)
A SINGLE-NODE DBMS Later in the
semester

Query Optimization Select * FROM


Scan table1 & Execution customer WHERE ID=5

DBMS
Page Page Buffer Pool Storage
1 2
(in Memory)
Focus today on
single-node storage.

Next lecture(s):
Distributed storage

Block Block Block Table on HDD



1 2 n (Hard-Disk-Drive)

Systems@TUDa http://tuda.systems/ | 13
TODAY’S AGENDA
Storage Manager

Buffer Pool

Access Methods (in exercise session for lab 1)

Systems@TUDa http://tuda.systems/ | 14
TODAY’S AGENDA
Storage Manager

Buffer Pool

Access Methods (in exercise session for lab 1)

Systems@TUDa http://tuda.systems/ | 15
HOW TO MAP TABLE TO A HDD?
cust-key last-name … age
1 Müller 47
Table 2 Lo 53 Rows (=Tuples)
… … in Table
99 Schmitt 22

A table is mapped to a
database file which consist
of multiple disk blocks

Block Block Block


Database File 1 2

n Blocks on HDD

Systems@TUDa http://tuda.systems/ | 16
DATABASE FILES – TWO MAIN TYPES
Unsorted (Heap File) – New blocks are appended if table grows
cid ... cid ... cid ... cid ...
4234 ... 2234 ... 1211 ... 8004 ... How to
1345 ... 1745 ... ... 7235 ... 0325 ... organize data
How to organize
3367 ... 6367 ... 6337 ... 1357 ...
within a block?
1689 ... 8689 ... 7639 ... 2899 ...

Clustered File – Sorted by an attribute of table (by primary key)


cid ... cid ... cid ... cid ...
1234 ... 1478 ... 9021 ... 9234 ...
1345 ... 1489 ... ... 9089 ... 9287 ...
1356 ... 1591 ... 9111 ... 9321 ...
1389 ... 1599 ... 9139 ... 9333 ...

Systems@TUDa http://tuda.systems/ | 17
MAPPING TABLES TO BLOCKS: BASIC IDEA
For now: Block = Page
Page (e.g., 4KB)
Strawman idea for how to organize a page Tuple 1 (36B)
• Pages provide fixed-sized “slots” for tuples Tuple 2 (36B)
• Why fixed-sized slots? How can a DBMS know
Tuple 3 (36B)
the size of a row?

Free Space

Systems@TUDa http://tuda.systems/ | 18
EXAMPLE: DETERMINE SLOT SIZE

cust-key last-name … age


1 Müller 47 Slot
2 Lo 53
… … size?
99 Schmitt 22

INT + STRING (max. 20 CHARS) + INT = 8 + 20 + 8 Bytes (per tuple)

Systems@TUDa http://tuda.systems/ | 19
RECAP: DATA TYPES IN SQL
Strings (fixed and variable-length):
▪ CHAR(20) -- fixed length
How to support
▪ VARCHAR(40) -- variable length
variable-length
data?
Numbers (fixed-length):
▪ BIGINT (8 Byte), INT (4 Byte), SMALLINT (2), TINYINT (1)
▪ REAL (8.Byte), FLOAT (4 Byte)

Times and dates (fixed-length):


▪ …

Systems@TUDa http://tuda.systems/ | 20
SLOTTED PAGES: DETAILS
Slots
Slotted Page: One slot per tuple (Offset+size of tuples)
• Each slot: stores offset + size for tuple data
• Tuple data: stored from back to front
Header o o o

Page (e.g., 4KB)


Offset=4000 &
Operations on a Page: Size=32
• Insert operation: append tuple data at the end of
page + add a pointer into slot Free Space
• Read operation: first read pointer and then
Tuple 3
retrieve tuple data
• Delete operations: set a binary flag per slot if it Tuple 2 Tuple 1

is used (1=used slot, 0=free slot)


Tuple data (variable-length)

Systems@TUDa http://tuda.systems/ | 21
EXAMPLE: SLOTTED PAGES
Slots
(Offset + Size)

Header o o o

cust- last- age


key name
Table 1 Müller 47 Page
2 Lo 53 Free Space (e.g., 4KB)
… …
99 Schmitt 22 22 Schmitt 99

53 Lo 2 47 Müller 1

Tuple data

Systems@TUDa http://tuda.systems/ | 22
ANOTHER VARIANT OF A SLOTTED PAGES
Per slot: Fixed-
Fixed-length data of tuples (e.g., INT) length data &
Offset/size
is stored directly in slots
Tuple Tuple
Header 1 2
Only variable-length data (e.g., Tuple
3
VARCHAR) is stored at the end of page

Page
Free Space
Advantage: No indirection and faster
access to fixed-length data
Tuple Tuple Tuple
3 2 1

ONLY Variable-length data

Systems@TUDa http://tuda.systems/ | 23
HOW TO FIND DATA IN TABLES?
The DBMS needs a mechanism to identify individual
tuples in tables (e.g., for DBMS indexes, see below)

Databases use so called Record Ids

Customer Record Id
CID = 5? (Page=1,
Slot = 5)

Systems@TUDa http://tuda.systems/ | 24
DETAILS ABOUT RECORD IDs
Record IDs are “pointers” to a tuple Record Ids in DBMSs

Record ID = Unique Page-ID + Slot number ROWID


(e.g, PageID=1, Slot #=5) (4-byte)
• Page ID is used to find page (block on disk)
• Slot number is used to find data in page ROWID
(4-byte)
All commercial DBMS use this concept but call it
differently (ROWID, CTID, …)
CTID
(4-byte)

Systems@TUDa http://tuda.systems/ | 25
SLOTTED PAGES: ROW VS COLUMN STORE
Row-Store: All attributes of a tuple in a table are stored in one page

Table Pages

Column-Store: Different pages for individual columns of a table

Table Pages

Systems@TUDa http://tuda.systems/ | 26
Row vs. Column-store: Scan Perf.

From „Performance Tradeoffs in Read-


Optimized Databases“, VLDB

Systems@TUDa http://tuda.systems/ | 27
TODAY’S AGENDA
Storage Manager

Buffer Pool

Access Methods (in exercise session for lab 1)

Systems@TUDa http://tuda.systems/ | 28
HARD DISK DRIVES / HDDs ARE SLOW
HDDs are (still) the dominant storage device for classical DBMSs due to their low
$/GB price and high capacities

BUT: HDDs are slow due to mechanics of access


• Seek time to position arm approx. 3ms
• Sequential access ~200MB/s
• Huge imbalance in random vs. sequential I/O! (random is 100x slower)

The DBMS design must take these effects into account


• Cache data in main memory (RAM) → buffer pool
• Same holds also for SSDs (which are faster than HDDs but slower than RAM)
Systems@TUDa http://tuda.systems/ | 29
REMEMBER THE STORAGE HIERARCHY?

CPU Registers Faster &


Volatile No explicit control
Smaller
from applications
CPU Caches

RAM

Non- SSD (e.g., Flash)


Volatile
Hard Disk Drive (HDD)
Slower &
Larger

Systems@TUDa http://tuda.systems/ | 30
STORAGE HIERARCHY: ACCESS TIMES

0.5 ns L1 Cache 0.5 s


7 ns L2 Cache 7s
100 ns RAM 100 s
150.000 ns SSD 1.7 days
10.000.000 ns HDD 16.5 weeks

Access Times for Random Access


→ HDD is orders of magnitude (100.000x) slower than RAM

Systems@TUDa http://tuda.systems/ | 31
Example: Data Access via RIDs
Customer
CID = 5?

Systems@TUDa http://tuda.systems/ | 32
BUFFER: THE COMPLETE PICTURE
Customer
CID = 5?

(1, 5)

Systems@TUDa http://tuda.systems/ | 33
BUFFER: THE COMPLETE PICTURE

(1, 5)
Page Table
tells us where
to find page?
Systems@TUDa http://tuda.systems/ | 34 Buffer or Disk
BUFFER: THE COMPLETE PICTURE

(1, 5)

Systems@TUDa http://tuda.systems/ | 35
BUFFER MANAGER / BUFFER POOL

The page table keeps track of cached


pages and their location
1

3
Buffer pool provides fixed number of

“frames” to cache pages
Page Table Buffer Pool

Page table refers to a frame (pointer in


memory*) if page is in buffer pool

Systems@TUDa http://tuda.systems/ | 36
BUFFER MANAGER: INTERFACE?
Buffer pools provide two main operations: PIN and UNPIN

PIN: DBMS (i.e., worker thread which executes a query) requests a page using
Page ID
• Buffer pool checks if page is already in memory
• If not, it loads page and returns pointer to DBMS

UNPIN: worker thread releases page to indicate that it does not use it anymore

Buffer pool contains additional info for caching decisions: Dirty flag /
Reference counter (how many workers access page)
Systems@TUDa http://tuda.systems/ | 37
BUFFER MANAGER: PIN OPERATION
function PIN(pageno) // request page from buffer pool
1. if buffer caches page with pageno then
2. pinCount(pageno) = pinCount(pageno) + 1; Page already
3. return memory address of frame holding page; in buffer?

1. if buffer is full
2. select a victim-frame v using replacement policy; Page needs
3. if dirty(v) then write v to disk; to be evicted?
4. end

1. read page with pageno from disk into frame v;


2. pinCount(pageno) = 1; Load page
3. dirty(pageno) = false; from disk
4. return address of frame v; Into buffer
Systems@TUDa http://tuda.systems/ | 38
BUFFER MANAGER: UNPIN OPERATION
function UNPIN(pageno, dirty) // release page
1. pinCount(pageno) = pinCount(pageno) - 1;
2. if(dirty == true)
3. dirty(pageno) = true //indicate that page as modified

Systems@TUDa http://tuda.systems/ | 39
BUFFER MANAGER: EVICTION STRATEGIES
If buffer is full when a new page is requested via a PIN operation, then a
victim frame needs to be selected (called page eviction)

Discussion: What is a good eviction stragegy?


• Random: evict any page where pinCount=0
• Other ideas?

Systems@TUDa http://tuda.systems/ | 40
Interactive Session
(not in slides)
EVICTION STRATEGIES?

„Good“ eviction strategy: causes as little as possible page misses


in buffer (i.e., page needs to be loaded from disk)

Systems@TUDa http://tuda.systems/ | 42
BUFFER MANAGER: EVICTION STRATEGIES
Typical eviction strategies
• FIFO: first-in-first-out (using a simple queue)
• LRU: evict least-recently-used page by an access time per page
(list of pages sorted by access time)

Goal is to evict pages that are not likely to be used in near future

Systems@TUDa http://tuda.systems/ | 43
LEAST-RECENTLY-USED (LRU)
Main idea: keep most recently Implementation:
used pages in buffer Keep ordered list of page numbers
Page access order
Typical implementation: Keep a
sorted list of page IDs ordered by
access time

Can cause high overhead → one


„move“ in list per page access
(modern DBMS do millions of
queries per second) List of pages List of pages
after access of a after access of b

Systems@TUDa http://tuda.systems/ | 44
CLOCK ALGORITHM TO APPROXIMATE LRU
Clock algorithm is a „good enough“ approximate version of LRU with
lower overhead

Goal: Find an old page, but not necessarily the very oldest

Clock algorithm organizes frames in a circular buffer


• Each frame has a reference-bit
• Whenever page is requested, the reference-bit is set to 1

Systems@TUDa http://tuda.systems/ | 45
CLOCK: EVICTION ALGORITHM
A “clock hand” keeps track of which frame needs to be checked
next

Clock hand is ONLY moved if a page needs to be evicted!

Clock eviction algorithm


1. Check frame at clock hand
2. if reference bit=1, set ref-bit=0 and sweep to next frame in
clockwise-order; then goto 1
3. else stop sweeping and evict page in current frame (but only if it
is not pinned, which we ignore in the examples)
Systems@TUDa http://tuda.systems/ | 46
CLOCK: EXAMPLE (4 FRAMES)
Ref=1
Page requests:
1. PIN(pageID=5) Page 1

Hand
Ref=1 Ref=1

Page 4 Page 2

Page 3
Ref=1

Systems@TUDa http://tuda.systems/ | 47
CLOCK: EXAMPLE (4 FRAMES)
Ref=0
Page requests:
1. PIN(pageID=5) Page 1

Ref=1 Ref=1

Page 4 Page 2

Page 3
Ref=1

Systems@TUDa http://tuda.systems/ | 48
CLOCK: EXAMPLE (4 FRAMES)
Ref=0
Page requests:
1. PIN(pageID=5) Page 1

Ref=1 Ref=0

Page 4 Page 2

Page 3
Ref=1

Systems@TUDa http://tuda.systems/ | 49
CLOCK: EXAMPLE (4 FRAMES)
Ref=0
Page requests:
1. PIN(pageID=5) Page 1

Ref=1 Ref=0

Page 4 Page 2

Page 3
Ref=0

Systems@TUDa http://tuda.systems/ | 50
CLOCK: EXAMPLE (4 FRAMES)
Ref=0
Page requests:
1. PIN(pageID=5)
X
Page 1

Ref=0 Ref=0

Page 4 Page 2

Page 3
Ref=0

Systems@TUDa http://tuda.systems/ | 51
CLOCK: EXAMPLE (4 FRAMES)
Ref=1
Page requests:
Page 5 is loaded into buffer ->
1. PIN(pageID=5) Page 5 Reference-bit is reset to 1

Ref=0 Ref=0

Page 4 Page 2

Page 3
Ref=0

Systems@TUDa http://tuda.systems/ | 52
CLOCK: EXAMPLE (4 FRAMES)
Ref=1
Page requests:
1. PIN(pageID=5) Page 5
2. PIN(pageID=2)
3. PIN(pageID=3)

Ref=0 Ref=1

Page 4 Page 2

• Page 2 and 3 are accessed


Page 3
→ Reference-bit is reset to 1
Ref=1 • Hand does NOT move!
Systems@TUDa http://tuda.systems/ | 53
CLOCK: EXAMPLE (4 FRAMES)
Ref=1
Page requests:
1. PIN(pageID=5) Page 5
2. PIN(pageID=2)
3. PIN(pageID=3)
4. PIN(pageID=1)
Ref=0 Ref=1

Page 4 Page 2

Page 3
Ref=1

Systems@TUDa http://tuda.systems/ | 54
CLOCK: EXAMPLE (4 FRAMES)
Ref=0
Page requests:
1. PIN(pageID=5) Page 5
2. PIN(pageID=2)
3. PIN(pageID=3)
4. PIN(pageID=1)
Ref=0 Ref=1

Page 4 Page 2

Page 3
Ref=1

Systems@TUDa http://tuda.systems/ | 55
CLOCK: EXAMPLE (4 FRAMES)
Ref=0
Page requests:
1. PIN(pageID=5) Page 5
2. PIN(pageID=2)
3. PIN(pageID=3)
4. PIN(pageID=1)
Ref=0 Ref=0

Page 4 Page 2

Page 3
Ref=1

Systems@TUDa http://tuda.systems/ | 56
CLOCK: EXAMPLE (4 FRAMES)
Ref=0
Page requests:
1. PIN(pageID=5) Page 5
2. PIN(pageID=2)
3. PIN(pageID=3)
4. PIN(pageID=1)
Ref=0 Ref=0

Page 4 Page 2

Page 3
Ref=0

Systems@TUDa http://tuda.systems/ | 57
CLOCK: EXAMPLE (4 FRAMES)
Ref=0
Page requests:
1. PIN(pageID=5) Page 5
2. PIN(pageID=2)
3. PIN(pageID=3)
4. PIN(pageID=1)
Ref=0 Ref=0

X
Page 4 Page 2

Page 3
Ref=0

Systems@TUDa http://tuda.systems/ | 58
CLOCK: EXAMPLE (4 FRAMES)
Ref=0
Page requests:
1. PIN(pageID=5) Page 5
2. PIN(pageID=2)
3. PIN(pageID=3)
4. PIN(pageID=1)
Ref=1 Ref=0

Page 1 Page 2

Page 1 is loaded into buffer →


Page 3
Reference-bit is reset to 1
Ref=0

Systems@TUDa http://tuda.systems/ | 59
SUMMARY
Storage Manager

Buffer Pool

Access Methods (in exercise session for lab 1)

Systems@TUDa http://tuda.systems/ | 60
QUESTIONS

Systems@TUDa http://tuda.systems/ | 61

You might also like