0% found this document useful (0 votes)
5 views43 pages

08 Storage

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views43 pages

08 Storage

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Lecture #08

ADVANCED
DATABASE
SYSTEMS
Storage Models &
Data Layout
@Andy_Pavlo // 15-721 // Spring 2020
2

D ATA O R G A N I Z AT I O N
Fixed-Length Variable-Length
Index Data Blocks Data Blocks

Block Id + C++11 alignas


Offset
Block Pointer Offset
44-bits 20-bits

15-721 (Spring 2020)


3

D ATA O R G A N I Z AT I O N

One can think of an in-memory database as just a


large array of bytes.
→ The schema tells the DBMS how to convert the bytes
into the appropriate type.
→ Each tuple is prefixed with a header that contains its
meta-data.

Storing tuples with as fixed-length data makes it


easy to compute the starting point of any tuple.

15-721 (Spring 2020)


4

Type Representation
Data Layout / Alignment
Storage Models
System Catalogs

15-721 (Spring 2020)


5

D ATA R E P R E S E N TAT I O N

INTEGER/BIGINT/SMALLINT/TINYINT
→ C/C++ Representation
FLOAT/REAL vs. NUMERIC/DECIMAL
→ IEEE-754 Standard / Fixed-point Decimals
TIME/DATE/TIMESTAMP
→ 32/64-bit int of (micro/milli)seconds since Unix epoch
VARCHAR/VARBINARY/TEXT/BLOB
→ Pointer to other location if type is ≥64-bits
→ Header with length and address to next location (if
segmented), followed by data bytes.

15-721 (Spring 2020)


6

VA R I A B L E P R E C I S I O N N U M B E R S

Inexact, variable-precision numeric type that uses


the “native” C/C++ types.

Store directly as specified by IEEE-754.

Typically faster than arbitrary precision numbers.


→ Example: FLOAT, REAL/DOUBLE

15-721 (Spring 2020)


7

VA R I A B L E P R E C I S I O N N U M B E R S

Rounding Example
#include <stdio.h>

Output int main(int argc, char* argv[]) {


x+y = 0.30000001192092895508 float x = 0.1;
0.3 = 0.29999999999999998890 float y = 0.2;
printf("x+y = %.20f\n", x+y);
printf("0.3 = %.20f\n", 0.3);
}

15-721 (Spring 2020)


8

FIXED PRECISION NUMBERS

Numeric data types with arbitrary precision and


scale. Used when round errors are unacceptable.
→ Example: NUMERIC, DECIMAL

Typically stored in an exact, variable-length binary


representation with additional meta-data.
→ Like a VARCHAR but not stored as a string

15-721 (Spring 2020)


9

D ATA L AYO U T

char[]
CREATE TABLE AndySux (
id INT PRIMARY KEY, header id value
value BIGINT
);
reinterpret_cast<int32_t*>(address)

15-721 (Spring 2020)


10

VA R I A B L E -L E N G T H F I E L D S

char[]
CREATE TABLE AndySux (
value VARCHAR(1024) header id Andy|64-BIT
64-BIT POINTER POINTER
);
Variable-Length Data Blocks
INSERT INTO AndySux
VALUES ("Andy has the worst LENGTH NEXT
Andy has the worst
hygiene that I have ever seen. I hygiene that I have ever seen. I hate
hate him so much.");

LENGTH NEXT him so much.


15-721 (Spring 2020)
11

N U L L D ATA T Y P E S

Choice #1: Special Values


→ Designate a value to represent NULL for a data type (e.g.,
INT32_MIN).

Choice #2: Null Column Bitmap Header


→ Store a bitmap in the tuple header that specifies what
attributes are null.

Choice #3: Per Attribute Null Flag


→ Store a flag that marks that a value is null.
→ Have to use more space than just a single bit because this
messes up with word alignment.
15-721 (Spring 2020)
12

DISCLAIMER

The truth is that you only need to worry about


word-alignment for cache lines (e.g., 64 bytes).

I’m going to show you the basic idea using 64-bit


words since it’s easier to see…

15-721 (Spring 2020)


13

W O R D -A L I G N E D T U P L E S

All attributes in a tuple must be word aligned to


enable the CPU to access it without any
unexpected behavior or additional work.
CREATE TABLE AndySux (
32-bits id INT PRIMARY KEY, char[]
64-bits cdate TIMESTAMP, id cdate c zipc
16-bits color CHAR(2),
32-bits zipcode INT 64-bit Word 64-bit Word 64-bit Word 64-bit Word
);

15-721 (Spring 2020)


14

W O R D -A L I G N E D T U P L E S

Approach #1: Perform Extra Reads


→ Execute two reads to load the appropriate parts
of the data word and reassemble them.

Approach #2: Random Reads


→ Read some unexpected combination of bytes
assembled into a 64-bit word.
Approach #3: Reject
Source: Levente Kurusa → Throw an exception and hope app handles it.
15-721 (Spring 2020)
15

W O R D -A L I G N M E N T: PA D D I N G

Add empty bits after attributes to ensure that tuple


is word aligned.

CREATE TABLE AndySux (


32-bits id INT PRIMARY KEY, char[]
00000000 00000
64-bits cdate TIMESTAMP, id 00000000
00000000 cdate c zipc 000
00000
00000000 000
16-bits color CHAR(2),
32-bits zipcode INT 64-bit Word 64-bit Word 64-bit Word 64-bit Word
);

15-721 (Spring 2020)


16

W O R D -A L I G N M E N T: R E O R D E R I N G

Switch the order of attributes in the tuples'


physical layout to make sure they are aligned.
→ May still have to use padding.

CREATE TABLE AndySux (


32-bits id INT PRIMARY KEY, char[]
000000000000
64-bits cdate TIMESTAMP, id zipc cdate c 000000000000
000000000000
000000000000
16-bits color CHAR(2),
32-bits zipcode INT 64-bit Word 64-bit Word 64-bit Word 64-bit Word
);

15-721 (Spring 2020)


17

C M U -D B A L I G N M E N T E X P E R I M E N T
Processor: 1 socket, 4 cores w/ 2×HT
Workload: Insert Microbenchmark

Avg. Throughput
No Alignment 0.523 MB/sec
Padding 11.7 MB/sec
Padding + Sorting 814.8 MB/sec

Source: Tianyu Li

15-721 (Spring 2020)


18

STORAGE MODELS

N-ary Storage Model (NSM)


Decomposition Storage Model (DSM)
Hybrid Storage Model

COLUMN-STORES VS. ROW-STORES: HOW


DIFFERENT ARE THEY REALLY?
SIGMOD 2008

15-721 (Spring 2020)


19

N -A R Y S T O R A G E M O D E L ( N S M )

The DBMS stores all of the attributes for a single


tuple contiguously.

Ideal for OLTP workloads where txns tend to


operate only on an individual entity and insert-
heavy workloads.

Use the tuple-at-a-time iterator model.

15-721 (Spring 2020)


20

N -A R Y S T O R A G E M O D E L ( N S M )

Advantages
→ Fast inserts, updates, and deletes.
→ Good for queries that need the entire tuple.
→ Can use index-oriented physical storage.

Disadvantages
→ Not good for scanning large portions of the table and/or
a subset of the attributes.

15-721 (Spring 2020)


21

D E C O M P O S I T I O N S TO R A G E M O D E L ( D S M )

The DBMS stores a single attribute for all tuples


contiguously in a block of data.

Ideal for OLAP workloads where read-only


queries perform large scans over a subset of the
table’s attributes.

15-721 (Spring 2020)


22

D E C O M P O S I T I O N S TO R A G E M O D E L ( D S M )

Advantages
→ Reduces the amount wasted work because the DBMS
only reads the data that it needs.
→ Better compression.

Disadvantages
→ Slow for point queries, inserts, updates, and deletes
because of tuple splitting/stitching.

15-721 (Spring 2020)


23

D S M S Y S T E M H I S TO R Y

1970s: Cantor DBMS


1980s: DSM Proposal
1990s: SybaseIQ (in-memory only)
2000s: Vertica, VectorWise, MonetDB
2010s: Everyone

15-721 (Spring 2020)


24

DSM: DESIGN DECISIONS

Tuple Identification
Data Organization
Update Policy
Buffering Location

OPTIMAL COLUMN LAYOUT FOR


HYBRID WORKLOADS
VLDB 2019

15-721 (Spring 2020)


25

D S M : T U P L E I D E N T I F I C AT I O N

Choice #1: Fixed-length Offsets


→ Each value is the same length for an attribute.
Choice #2: Embedded Tuple Ids
→ Each value is stored with its tuple id in a column.
Offsets Embedded Ids
A B C D A B C D
0 0 0 0 0
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3

15-721 (Spring 2020)


26

D S M : D ATA O R G A N I Z AT I O N

Choice #1: Insertion Order


→ Tuples are inserted into any free slot that is available in
existing blocks.

Choice #2: Sorted Order


→ Tuples are inserted based into a slot according to some
ordering scheme.

Choice #3: Partitioned


→ Assign tuples to blocks according to their attribute values
and some partitioning scheme (e.g., hashing, range).
15-721 (Spring 2020)
27

D S M : D ATA O R G A N I Z AT I O N

Data Table Sorted Table


INSERT INTO xxx A B C A B C
VALUES (a2, b1, c5); 0 a1 b1 c1 2 a1 b2 c8
1 a3 b2 c9 5 a1 b2 c9
2 a1 b2 c8 0 a1 b1 c1
3 a2 b2 c7 3 a2 b2 c7
4 a2
a3 b2
b1 c9
c5
c6 1 a3 b2 c9
5 a3
a1 b1
b2 c1
c9 6 a3 b1 c1
6 a3 b1 c6
c1 4 a3 b1 c6
7 a2 b1 c5 7
Sort Order: (A↑, B↓, C↑)
15-721 (Spring 2020)
28

C A S P E R D E LTA S TO R E

Range-partitioned column store with a "shallow"


order-preserving index above it.
→ Shallow index maps value ranges to partitions.
→ Index keys are sorted but the individual columns are not.

DBMS runs an offline optimization algorithm to


determine the optimal partitioning of data.

OPTIMAL COLUMN LAYOUT FOR


HYBRID WORKLOADS
VLDB 2019

15-721 (Spring 2020)


29

C A S P E R D E LTA S TO R E
Data Table
A B C
INSERT INTO xxx 0 a1 b2 c8
VALUES (a2, b1, c5); 1 a1 b2 c9
2 a1 b1 c1
3 a2 b2 c7
INSERT INTO xxx 4 a2 b1 c5
VALUES (a2, b2, c6); 5 a2
a3 b2
b1 c6
6 a3 b1 c1
Shallow Index 7 a3 b2 c9
key→partition 8 a3 b1 c6
9

15-721 (Spring 2020)


30

O B S E R VAT I O N

Data is “hot” when it enters the database


→ A newly inserted tuple is more likely to be updated again
the near future.

As a tuple ages, it is updated less frequently.


→ At some point, a tuple is only accessed in read-only
queries along with other tuples.

15-721 (Spring 2020)


31

H Y B R I D S TO R A G E M O D E L

Single logical database instance that uses different


storage models for hot and cold data.

Store new data in NSM for fast OLTP


Migrate data to DSM for more efficient OLAP

15-721 (Spring 2020)


32

H Y B R I D S TO R A G E M O D E L

Choice #1: Separate Execution Engines


→ Use separate execution engines that are optimized for
either NSM or DSM databases.

Choice #2: Single, Flexible Architecture


→ Use single execution engine that can efficiently operate
on both NSM and DSM databases.

15-721 (Spring 2020)


33

S E PA R AT E E X E C U T I O N E N G I N E S

Run separate “internal” DBMSs that each only


operate on DSM or NSM data.
→ Need to combine query results from both engines to
appear as a single logical database to the application.
→ Must use a synchronization method (e.g., 2PC) if a txn
spans execution engines.

Two approaches to do this:


→ Fractured Mirrors (Oracle, IBM)
→ Delta Store (SAP HANA)

15-721 (Spring 2020)


34

FRACTURED MIRRORS

Store a second copy of the database in a DSM


layout that is automatically updated.
→ All updates are first entered in NSM then eventually
copied into DSM mirror.

NSM DSM
(Primary) (Mirror) Analytical
Transactions Queries

A CASE FOR FRACTURED MIRRORS


VLDB 2002

15-721 (Spring 2020)


35

D E LTA S TO R E

Stage updates to the database in an NSM table.


A background thread migrates updates from delta
store and applies them to DSM data.

DSM
NSM Historical Data
Delta Store
Transactions

15-721 (Spring 2020)


36

P E LOT O N A D A P T I V E S TO R A G E

Employ a single execution engine architecture that


can operate on both NSM and DSM data.
→ Don’t need to store two copies of the database.
→ Don’t need to sync multiple database segments.

Note that a DBMS can still use the delta-store


approach with this single-engine architecture.

BRIDGING THE ARCHIPELAGO BETWEEN ROW-STORES AND


COLUMN-STORES FOR HYBRID WORKLOADS
SIGMOD 2016

15-721 (Spring 2020)


37

P E LOT O N A D A P T I V E S TO R A G E

Original Data Adapted Data


UPDATE AndySux A B C D A B C D
SET A = 123,
B = 456, Hot
C = 789
WHERE D = “xxx” A B C D

SELECT AVG(B)
FROM AndySux
WHERE C = “yyy”

Cold
15-721 (Spring 2020)
39

P E LOT O N A D A P T I V E S T O R A G E
Row Layout Column Layout Adaptive Layout
Execution Time (ms)

1600

1200

800

400

0 Scan Insert Scan Insert Scan Insert Scan Insert Scan Insert Scan Insert

Sep-15 Sep-16 Sep-17 Sep-18 Sep-19 Sep-20


15-721 (Spring 2020)
40

S Y S T E M C ATA LO G S

Almost every DBMS stores their database's


catalogs the same way that they store regular data.
→ Wrap object abstraction around tuples.
→ Specialized code for "bootstrapping" catalog tables.

The entire DBMS should be aware of transactions


in order to automatically provide ACID guarantees
for DDL commands and concurrent transactions.

15-721 (Spring 2020)


41

SCHEMA CHANGES

ADD COLUMN:
→ NSM: Copy tuples into new region in memory.
→ DSM: Just create the new column segment

DROP COLUMN:
→ NSM #1: Copy tuples into new region of memory.
→ NSM #2: Mark column as "deprecated", clean up later.
→ DSM: Just drop the column and free memory.

CHANGE COLUMN:
→ Check whether the conversion can happen. Depends on
default values.
15-721 (Spring 2020)
42

INDEXES

CREATE INDEX:
→ Scan the entire table and populate the index.
→ Must record changes made by txns that modified the table
while another txn was building the index.
→ When the scan completes, lock the table and resolve
changes that were missed after the scan started.

DROP INDEX:
→ Just drop the index logically from the catalog.
→ It only becomes "invisible" when the txn that dropped it
commits. All existing txns will still have to update it.

15-721 (Spring 2020)


43

SEQUENCES

Typically stored in the catalog. Used for


maintaining a global counter
→ Also called "auto-increment" or "serial" keys

Sequences are not maintained with the same


isolation protection as regular catalog entries.
→ Rolling back a txn that incremented a sequence does not
rollback the change to that sequence.
→ All INSERT queries would incur write-write conflicts.

15-721 (Spring 2020)


44

PA R T I N G T H O U G H T S

We abandoned the hybrid storage model


→ Significant engineering overhead.
→ Delta version storage + column store is almost
equivalent.

Catalogs are hard.

15-721 (Spring 2020)

You might also like