04 Storage2
04 Storage2
Systems
Database Storage:
Tuple Organization
15-445/645 FALL 2024 PROF. ANDY PAVLO
ADMINISTRIVIA
Homework #1 is due September 8th @ 11:59pm
Snowflake
→ Thursday Sept 12th @ 12:00pm
→ GHC 9115
LAST CLASS
We presented a disk-oriented architecture where
the DBMS assumes that the primary storage
location of the database is on non-volatile disk.
SLOTTED PAGES
The most common layout scheme is Slot Array
called slotted pages. 1 2 3 4 5 6 7
Header
The slot array maps "slots" to the
tuples' starting position offsets.
SLOTTED PAGES
The most common layout scheme is Slot Array
called slotted pages. 1 2 3 4 5 6 7
Header
The slot array maps "slots" to the
tuples' starting position offsets.
SLOTTED PAGES
The most common layout scheme is Slot Array
called slotted pages. 1 2 3 4 5 6 7
Header
The slot array maps "slots" to the
tuples' starting position offsets.
SLOTTED PAGES
The most common layout scheme is Slot Array
called slotted pages. 1 2 3 4 5 6 7
Header
The slot array maps "slots" to the
tuples' starting position offsets.
SLOTTED PAGES
The most common layout scheme is Slot Array
called slotted pages. 1 2 3 4 5 6 7
Header
The slot array maps "slots" to the
tuples' starting position offsets.
SLOTTED PAGES
The most common layout scheme is Slot Array
called slotted pages. 1 2 3 4 5 6 7
Header
The slot array maps "slots" to the
tuples' starting position offsets.
RECORD IDS
The DBMS assigns each logical tuple a
unique record identifier that
CTID (6-bytes)
represents its physical location in the
database.
→ File Id, Page Id, Slot #
→ Most DBMSs do not store ids in tuple.
→ SQLite uses ROWID as the true primary ROWID (8-bytes)
ROWID
TUPLE-ORIENTED STORAGE
Insert a new tuple:
→ Check page directory to find a page with a free slot.
→ Retrieve the page from disk (if not in memory).
→ Check slot array to find empty space in page that will fit.
TUPLE-ORIENTED STORAGE
Problem #1: Fragmentation
→ Pages are not fully utilized (unusable space, empty slots).
Problem #2: Useless Disk I/O
→ DBMS must fetch entire page to update one tuple.
Problem #3: Random Disk I/O
→ Worse case scenario when updating multiple tuples is that
each tuple is on a separate page.
TODAY'S AGENDA
Log-Structured Storage
Index-Organized Storage
Data Representation
LOG-STRUCTURED STORAGE
Instead of storing tuples in pages and updating the
in-place, the DBMS maintains a log that records
changes to tuples.
→ Each log entry represents a tuple PUT/DELETE operation.
→ Originally proposed as log-structure merge trees (LSM
Trees) in 1996.
Memory
Disk
5-445/645 (Fall 2024)
Memory
Disk
5-445/645 (Fall 2024)
Memory
Disk
5-445/645 (Fall 2024)
Memory
Disk
5-445/645 (Fall 2024)
Memory
Disk
5-445/645 (Fall 2024)
Memory
Disk
5-445/645 (Fall 2024)
Key Low→High
PUT (key101,a2)
PUT (key102,b1)
PUT (key103,c1)
Memory
Disk
5-445/645 (Fall 2024)
Key Low→High
PUT (key101,a2)
PUT (key102,b1)
PUT (key103,c1)
Memory
Level #0 SSTable
Disk
5-445/645 (Fall 2024)
Key Low→High
PUT (key101,a2)
PUT (key102,b1)
PUT (key103,c1)
Memory
Level #0 SSTable SSTable Newest→Oldest
Disk
5-445/645 (Fall 2024)
Key Low→High
PUT (key101,a2)
PUT (key102,b1)
PUT (key103,c1)
Memory
Level #0 SSTable SSTable Newest→Oldest
Level #1 SSTable
Disk
5-445/645 (Fall 2024)
Key Low→High
PUT (key101,a2)
PUT (key102,b1)
PUT (key103,c1)
Memory
Level #0 Newest→Oldest
Level #1 SSTable
Disk
5-445/645 (Fall 2024)
Key Low→High
PUT (key101,a2)
PUT (key102,b1)
PUT (key103,c1)
Memory
Level #0 SSTable SSTable Newest→Oldest
Level #1 SSTable
Disk
5-445/645 (Fall 2024)
Key Low→High
PUT (key101,a2)
PUT (key102,b1)
PUT (key103,c1)
Memory
Level #0 SSTable SSTable Newest→Oldest
Disk
5-445/645 (Fall 2024)
Key Low→High
PUT (key101,a2)
PUT (key102,b1)
PUT (key103,c1)
Memory
Level #0 Newest→Oldest
Disk
Level #2 SSTable
5-445/645 (Fall 2024)
Key Low→High
PUT (key101,a2)
PUT (key102,b1)
PUT (key103,c1)
Memory
Level #0 Newest→Oldest
Level #1
Disk
Level #2 SSTable
5-445/645 (Fall 2024)
Memory
Level #0 SSTable
Level #1 SSTable
Disk
Level #2 SSTable
5-445/645 (Fall 2024)
Level #1 SSTable
Disk
Level #2 SSTable
5-445/645 (Fall 2024)
LOG-STRUCTURED STORAGE
Key-value storage that appends log SSTable
records on disk to represent changes
Key Low→High
DEL (key100)
to tuples (PUT, DELETE).
PUT (key101,a3)
→ Each log record must contain the tuple's
unique identifier. PUT (key102,b2)
→ Put records contain the tuple contents. PUT (key103,c1)
→ Deletes marks the tuple as deleted.
LOG-STRUCTURED COMPACTION
Periodically compact SSTAbles to reduce wasted
space and speed up reads.
→ Only keep the "latest" values for each key using a sort-
merge algorithm.
SSTable SSTable
DEL (key100) PUT (key101,a2)
+
PUT (key101,a3) PUT (key102,b1)
PUT (key102,b2) DEL (key103)
PUT (key103,c1) PUT (key104,d2)
Newest→Oldest
5-445/645 (Fall 2024)
LOG-STRUCTURED COMPACTION
Periodically compact SSTAbles to reduce wasted
space and speed up reads.
→ Only keep the "latest" values for each key using a sort-
merge algorithm.
SSTable SSTable SSTable
DEL (key100) PUT (key101,a2) DEL (key100)
+
PUT (key101,a3) PUT (key102,b1) PUT (key101,a3)
PUT (key102,b2) DEL (key103) PUT (key102,b2)
PUT (key103,c1) PUT (key104,d2) PUT (key103,c1)
PUT (key104,d2)
Newest→Oldest
5-445/645 (Fall 2024)
DISCUSSION
Log-structured storage managers are more common
today than in previous decades.
→ This is partly due to the proliferation of RocksDB.
OBSERVATION
The two table storage approaches we've discussed
so far rely on indexes to find individual tuples.
→ Such indexes are necessary because the tables are
inherently unsorted.
INDEX-ORGANIZED STORAGE
DBMS stores a table's tuples as the value of an index
data structure.
→ Still use a page layout that looks like a slotted page.
→ Tuples are typically sorted in page based on key.
B+Tree pays maintenance costs upfront, whereas
LSMs pay for it later.
Leaf
Nodes
Tuple #3 Tuple #2 Tuple #6
5-445/645 (Fall 2024)
TUPLE STORAGE
A tuple is essentially a sequence of bytes prefixed
with a header that contains meta-data about it.
DATA LAYOUT
unsigned char[]
CREATE TABLE foo (
id INT PRIMARY KEY, header id value
value BIGINT
);
DATA LAYOUT
unsigned char[]
CREATE TABLE foo (
id INT PRIMARY KEY, header id value
value BIGINT
);
reinterpret_cast<int32_t*>(address)
WORD-ALIGNED TUPLES
All attributes in a tuple must be word aligned to
enable the CPU to access it without any unexpected
behavior or additional work.
WORD-ALIGNED TUPLES
All attributes in a tuple must be word aligned to
enable the CPU to access it without any unexpected
behavior or additional work.
WORD-ALIGNED TUPLES
All attributes in a tuple must be word aligned to
enable the CPU to access it without any unexpected
behavior or additional work.
WORD-ALIGNED TUPLES
All attributes in a tuple must be word aligned to
enable the CPU to access it without any unexpected
behavior or additional work.
WORD-ALIGNED TUPLES
All attributes in a tuple must be word aligned to
enable the CPU to access it without any unexpected
behavior or additional work.
WORD-ALIGNMENT: PADDING
Add empty bits after attributes to ensure that tuple
is word aligned. Essentially round up the storage
size of types to the next largest word size.
WORD-ALIGNMENT: REORDERING
Switch the order of attributes in the tuples' physical
layout to make sure they are aligned.
→ May still have to use padding to fill remaining space.
WORD-ALIGNMENT: REORDERING
Switch the order of attributes in the tuples' physical
layout to make sure they are aligned.
→ May still have to use padding to fill remaining space.
DATA REPRESENTATION
INTEGER/BIGINT/SMALLINT/TINYINT
→ Same as in C/C++.
FLOAT/REAL vs. NUMERIC/DECIMAL
→ IEEE-754 Standard / Fixed-point Decimals.
VARCHAR/VARBINARY/TEXT/BLOB
→ Header with length, followed by data bytes OR pointer to
another page/offset with data.
→ Need to worry about collations / sorting.
TIME/DATE/TIMESTAMP/INTERVAL
→ 32/64-bit integer of (micro/milli)-seconds since Unix
epoch (January 1st, 1970).
POSTGRES: NUMERIC
# of Digits
typedef unsigned char NumericDigit;
Weight of 1st Digit typedef struct {
int ndigits;
Scale Factor int weight;
int scale;
Positive/Negative/NaN int sign;
NumericDigit *digits;
Digit Storage } numeric;
5-445/645 (Fall 2024)
POSTGRES: NUMERIC
# of Digits
typedef unsigned char NumericDigit;
Weight of 1st Digit typedef struct {
int ndigits;
Scale Factor int weight;
int scale;
Positive/Negative/NaN int sign;
NumericDigit *digits;
Digit Storage } numeric;
5-445/645 (Fall 2024)
LARGE VALUES
CREATE TABLE foo (
Most DBMSs do not allow a tuple to id INT PRIMARY KEY,
exceed the size of a single page. data INT,
Tuple
contents TEXT
);
To store values that are larger than a
page, the DBMS uses separate Header INT INT TEXT
overflow storage pages.
→ Postgres: TOAST (>2KB)
→ MySQL: Overflow (>½ size of page)
→ SQL Server: Overflow (>size of page)
LARGE VALUES
CREATE TABLE foo (
Most DBMSs do not allow a tuple to id INT PRIMARY KEY,
exceed the size of a single page. data INT,
Tuple
contents TEXT
);
To store values that are larger than a
page, the DBMS uses separate Header INT INT TEXT
overflow storage pages.
→ Postgres: TOAST (>2KB)
→ MySQL: Overflow (>½ size of page) Overflow Page
→ SQL Server: Overflow (>size of page) VARCHAR DATA
LARGE VALUES
CREATE TABLE foo (
Most DBMSs do not allow a tuple to id INT PRIMARY KEY,
exceed the size of a single page. data INT,
Tuple
contents TEXT
);
To store values that are larger than a
page, the DBMS uses separate Header INT INT size TEXT
location
SYSTEM CATALOGS
A DBMS stores meta-data about databases in its
internal catalogs.
→ Tables, columns, indexes, views
→ Users, permissions
→ Internal statistics
SYSTEM CATALOGS
You can query the DBMS’s internal
INFORMATION_SCHEMA catalog to get info about the
database.
→ ANSI standard set of read-only views that provide info
about all the tables, views, columns, and procedures in a
database
SELECT * SQL-92
FROM INFORMATION_SCHEMA.TABLES
WHERE table_catalog = '<db name>';
\d; Postgres
.tables SQLite
SELECT * SQL-92
FROM INFORMATION_SCHEMA.TABLES
WHERE table_name = 'student'
\d student; Postgres
SCHEMA CHANGES
ADD COLUMN:
→ NSM: Copy tuples into new region in memory.
→ DSM: Just create the new column segment on disk.
DROP COLUMN:
→ NSM #1: Copy tuples into new region of memory.
→ NSM #2: Mark column as "deprecated", clean up later.
→ DSM: Just drop the column and free memory.
CHANGE COLUMN:
→ Check whether the conversion is allowed to happen.
Depends on default values.
INDEXES
CREATE INDEX:
→ Scan the entire table and populate the index.
→ Have to record changes made by txns that modified the
table while another txn was building the index.
→ When the scan completes, lock the table and resolve
changes that were missed after the scan started.
DROP INDEX:
→ Just drop the index logically from the catalog.
→ It only becomes "invisible" when the txn that dropped it
commits. All existing txns will still have to update it.
CONCLUSION
Log-structured storage is an alternative approach to
the tuple-oriented architecture.
→ Ideal for write-heavy workloads because it maximizes
sequential disk I/O.
NEXT CLASS
Breaking your preconceived notion that a DBMS
stores everything as rows…