20 Logging
20 Logging
Systems
Database
Logging
15-445/645 FALL 2024 PROF. ANDY PAVLO
ADMINISTRIVIA
Project #3 is due Sunday Nov 17th @ 11:59pm
→ Saturday Office Hours on Nov 16th @ 3-5pm GHC 5207
LAST CLASS
We discussed multi-version concurrency control
(MVCC) and how it effects the design of the entire
DBMS architecture.
MOTIVATION
Schedule
T1
BEGIN
R(A)
W(A) Buffer Pool
⋮
TIME
COMMIT A=1
Page
A=1
MOTIVATION
Schedule
T1
BEGIN
R(A)
W(A) Buffer Pool
⋮
TIME
COMMIT A=2
A=1
Page
A=1
MOTIVATION
Schedule
T1
BEGIN
R(A)
W(A) Buffer Pool
⋮
TIME
COMMIT A=2
A=1
Page
A=1
MOTIVATION
Schedule
T1
BEGIN
R(A)
W(A) Buffer Pool
⋮
TIME
COMMIT
Page
A=1
CRASH RECOVERY
Recovery algorithms are techniques to ensure
database consistency, transaction atomicity, and
durability despite failures.
TODAY’S AGENDA
Buffer Pool Policies
Shadow Paging
Write-Ahead Log
Logging Schemes
Checkpoints
DB Flash Talk: Confluent
OBSERVATION
The database’s primary storage location is on non-
volatile storage, but this is slower than volatile
storage. Use volatile memory for faster access:
→ First copy target record into memory.
→ Perform the writes in memory.
→ Write dirty records back to disk.
BUFFER POOL
Schedule
T1 T2
BEGIN
R(A) Buffer Pool
W(A)
BEGIN
TIME
BUFFER POOL
Schedule
T1 T2
BEGIN
R(A) Buffer Pool
W(A)
BEGIN
TIME
A=3
A=1 B=9 C=7
R(B)
W(B) A=1 B=9 C=7
COMMIT
⋮
ROLLBACK
BUFFER POOL
Schedule
T1 T2
BEGIN
R(A) Buffer Pool
W(A)
BEGIN
TIME
A=3
A=1 B=9 C=7
R(B)
W(B) A=1 B=9 C=7
COMMIT
⋮
ROLLBACK
BUFFER POOL
Schedule
T1 T2
BEGIN
R(A) Buffer Pool
W(A)
BEGIN
TIME
A=3
A=1 B=8
B=9 C=7
R(B)
W(B) A=1 B=9 C=7
COMMIT
⋮
ROLLBACK
BUFFER POOL
Schedule
T1 T2
BEGIN
R(A) Buffer Pool
W(A)
BEGIN
TIME
A=3
A=1 B=8
B=9 C=7
R(B)
W(B) A=1 B=9 C=7
COMMIT
⋮
ROLLBACK
BUFFER POOL
Schedule
T1 T2
BEGIN
R(A) Buffer Pool
W(A)
BEGIN
TIME
A=3
A=1 B=8
B=9 C=7
R(B)
W(B) A=3
A=1 B=9
B=8 C=7
COMMIT
⋮
ROLLBACK
BUFFER POOL
Schedule
T1 T2
BEGIN
R(A) Buffer Pool
W(A)
BEGIN
TIME
A=3
A=1 B=8
B=9 C=7
R(B)
W(B) A=3
A=1 B=9
B=8 C=7
COMMIT
⋮
ROLLBACK
STEAL POLICY
Whether the DBMS can evict a dirty object in the
buffer pool modified by an uncommitted txn and
overwrite the most recent committed version of
that object in non-volatile storage.
FORCE POLICY
Whether the DBMS requires that all updates made
by a txn are written back to non-volatile storage
before the txn can commit.
NO-STEAL + FORCE
Schedule
T1 T2
BEGIN
R(A)
W(A) Buffer Pool
BEGIN
TIME
NO-STEAL + FORCE
Schedule
T1 T2
BEGIN
R(A)
W(A) Buffer Pool
BEGIN
TIME
R(B) A=3
A=1 B=9 C=7
W(B)
COMMIT A=1 B=9 C=7
⋮
ROLLBACK
NO-STEAL + FORCE
Schedule
T1 T2
BEGIN
R(A)
W(A) Buffer Pool
BEGIN
TIME
R(B) A=3
A=1 B=9 C=7
W(B)
COMMIT A=1 B=9 C=7
⋮
ROLLBACK
NO-STEAL + FORCE
Schedule
T1 T2
BEGIN
R(A)
W(A) Buffer Pool
BEGIN
TIME
R(B) A=3
A=1 B=8
B=9 C=7
W(B)
COMMIT A=1 B=9 C=7
⋮
ROLLBACK
NO-STEAL + FORCE
Schedule
T1 T2
BEGIN
R(A)
W(A) Buffer Pool
BEGIN
TIME
R(B) A=3
A=1 B=8
B=9 C=7
W(B)
COMMIT A=1 B=9 C=7
⋮
ROLLBACK
FORCE means that T2
changes must be written
to disk at this point.
NO-STEAL + FORCE
Schedule
T1 T2
BEGIN NO-STEAL means that T1 changes
R(A) cannot be written to disk yet.
W(A) Buffer Pool
BEGIN
TIME
R(B) A=3
A=1 B=8
B=9 C=7
W(B) Copy
COMMIT A=1 B=9
B=8 C=7
⋮ A=1 B=8 C=7
ROLLBACK
FORCE means that T2
changes must be written
to disk at this point.
NO-STEAL + FORCE
Schedule
T1 T2
BEGIN
R(A)
W(A) Buffer Pool
BEGIN
TIME
R(B) A=3
A=1 B=8
B=9 C=7
W(B)
COMMIT A=1 B=9
B=8 C=7
⋮
ROLLBACK
NO-STEAL + FORCE
This approach is the easiest to implement:
→ Never have to undo changes of an aborted txn because the
changes were not written to disk.
→ Never have to redo changes of a committed txn because all
the changes are guaranteed to be written to disk at commit
time (assuming atomic hardware writes).
SHADOW PAGING
Instead of copying the entire database, the DBMS
copies pages on write to create two versions:
→ Master: Contains only changes from committed txns.
→ Shadow: Temporary database with changes made from
uncommitted txns.
To install updates when a txn commits, overwrite
the root so it points to the shadow, thereby
swapping the master and shadow.
Master
Pointer 1
Txn T1 2
3
COMMIT 4
Master
Page Table
SQLITE (PRE-2010)
When a txn modifies a page, the Memory
DBMS copies the original page to a
separate journal file before Page 1 Page 2 Page 3
Disk
After restarting, if a journal file exists, Page 1 Page 4 Journal File
then the DBMS restores it to undo
changes from uncommitted txns.
Page 2 Page 5
Page 3 Page 6
SQLITE (PRE-2010)
When a txn modifies a page, the Memory
DBMS copies the original page to a
separate journal file before Page 1 Page 2 Page 3
Disk
After restarting, if a journal file exists, Page 1 Page 4 Journal File
then the DBMS restores it to undo
changes from uncommitted txns.
Page 2 Page 5 Page 2
Page 3 Page 6
SQLITE (PRE-2010)
When a txn modifies a page, the Memory
DBMS copies the original page to a
separate journal file before Page 1 Page 2 Page 3
Disk
After restarting, if a journal file exists, Page 1 Page 4 Journal File
then the DBMS restores it to undo
changes from uncommitted txns.
Page 2 Page 5 Page 2
Page 3 Page 6
SQLITE (PRE-2010)
When a txn modifies a page, the Memory
DBMS copies the original page to a
separate journal file before Page 1 Page 2 Page 3
Disk
After restarting, if a journal file exists, Page 1 Page 4 Journal File
then the DBMS restores it to undo
changes from uncommitted txns.
Page 2 Page 5 Page 2
Page 3 Page 6
SQLITE (PRE-2010)
When a txn modifies a page, the Memory
DBMS copies the original page to a
separate journal file before Page 1 Page 2 Page 3
Disk
After restarting, if a journal file exists, Page 1 Page 4 Journal File
then the DBMS restores it to undo
changes from uncommitted txns.
Page 2 Page 5 Page 2
SQLITE (PRE-2010)
When a txn modifies a page, the Memory
DBMS copies the original page to a
separate journal file before Page 1 Page 2 Page 3
Disk
After restarting, if a journal file exists, Page 1 Page 4 Journal File
then the DBMS restores it to undo
changes from uncommitted txns.
Page 2 Page 5 Page 2
SQLITE (PRE-2010)
When a txn modifies a page, the Memory
DBMS copies the original page to a
separate journal file before
overwriting master version.
→ Called rollback mode.
rollback mode
Disk
After restarting, if a journal file exists, Page 1 Page 4 Journal File
then the DBMS restores it to undo
changes from uncommitted txns.
Page 2 Page 5 Page 2
SQLITE (PRE-2010)
When a txn modifies a page, the Memory
DBMS copies the original page to a
separate journal file before
overwriting master version.
→ Called rollback mode.
rollback mode
Disk
After restarting, if a journal file exists, Page 1 Page 4 Journal File
then the DBMS restores it to undo
changes from uncommitted txns.
Page 2 Page 5 Page 2
SQLITE (PRE-2010)
When a txn modifies a page, the Memory
DBMS copies the original page to a
separate journal file before Page 2 Page 3
Disk
After restarting, if a journal file exists, Page 1 Page 4 Journal File
then the DBMS restores it to undo
changes from uncommitted txns.
Page 2 Page 5 Page 2
SQLITE (PRE-2010)
When a txn modifies a page, the Memory
DBMS copies the original page to a
separate journal file before Page 2 Page 3
Disk
After restarting, if a journal file exists, Page 1 Page 4 Journal File
then the DBMS restores it to undo
changes from uncommitted txns.
Page 2 Page 5 Page 2
OBSERVATION
Shadowing page requires the DBMS to perform
writes to random non-contiguous pages on disk.
Trivial
Poor response time, but enforces durability of committed txns.
No-Steal Steal (flush an unpinned dirty page even if the updating txn is active)
Low throughtput, Concern: A stolen+flushed page was modified by an uncommitted txn. T.
but works for If T aborts, how is atomicity enforced?
aborted txns. Solution: Remember old value (logs). Use to UNDO.
5-445/645 (Fall 2024)
WAL PROTOCOL
The DBMS stages all a txn’s log records in volatile
storage (usually backed by buffer pool).
WAL PROTOCOL
Write a <BEGIN> record to the log for each txn to
mark its starting point.
WAL – EXAMPLE
Schedule
T1 WAL Buffer
BEGIN
W(A) <T1 BEGIN>
W(B)
⋮
TIME
COMMIT
A=1 B=5 C=7
Buffer Pool
WAL – EXAMPLE
Schedule
T1 WAL Buffer
BEGIN
W(A) <T1 BEGIN>
W(B) 1 <T1, A, 1, 8>
⋮
TIME
Before After
COMMIT
A=1 B=5 C=7
Buffer Pool
WAL – EXAMPLE
Schedule
T1 WAL Buffer
BEGIN
W(A) <T1 BEGIN>
W(B) 1 <T1, A, 1, 8>
⋮
TIME
Before After
COMMIT
A=1 B=5 C=7
Buffer Pool
WAL – EXAMPLE
Schedule
T1 WAL Buffer <T1 BEGIN>
<T1, A, 1, 8>
BEGIN <T1, B, 5, 9>
<T1 COMMIT>
W(A) <T1 BEGIN>
W(B) <T1, A, 1, 8>
⋮ <T1, B, 5, 9>
TIME
<T1 COMMIT>
COMMIT
A=1 B=5 C=7
A=1 B=9
A=8 B=5 C=7
WAL – EXAMPLE
Schedule
T1 WAL Buffer <T1 BEGIN>
<T1, A, 1, 8>
BEGIN <T1, B, 5, 9>
<T1 COMMIT>
W(A) <T1 BEGIN>
W(B) <T1, A, 1, 8>
⋮ <T1, B, 5, 9>
TIME
<T1 COMMIT>
COMMIT ⋮
A=1 B=5 C=7
Buffer Pool
A=1 B=9
A=8 B=5 C=7
WAL – EXAMPLE
Everything we need to
Schedule restore T1 is in the log!
T1 WAL Buffer <T1 BEGIN>
<T1, A, 1, 8>
BEGIN <T1, B, 5, 9>
<T1 COMMIT>
W(A)
W(B)
⋮
TIME
COMMIT
A=1 B=5 C=7
Buffer Pool
WAL – IMPLEMENTATION
Flushing the log buffer to disk every time a txn
commits will become a bottleneck.
W(C)
W(D)
⋮ ⋮
COMMIT
COMMIT
W(C)
W(D)
⋮ ⋮
COMMIT
COMMIT
W(C)
W(D)
⋮ ⋮
COMMIT
COMMIT
COMMIT
COMMIT
COMMIT
COMMIT
COMMIT
COMMIT
<T2, D, 3, 4>
COMMIT
COMMIT
<T2, D, 3, 4>
COMMIT
COMMIT
COMMIT
COMMIT
W(C)
W(D) Flush after an elapsed
⋮ ⋮ amount of time.
<T2, D, 3, 4>
COMMIT
COMMIT
W(C)
W(D) Flush after an elapsed
⋮ ⋮ amount of time.
<T2, D, 3, 4>
COMMIT
COMMIT
No Undo + No Redo
LOGGING SCHEMES
Physical Logging
→ Record the byte-level changes made to a specific page.
→ Example: git diff
Logical Logging
→ Record the high-level operations executed by txns.
→ Example: UPDATE, DELETE, and INSERT queries.
Physiological Logging
→ Physical-to-a-page, logical-within-a-page.
→ Hybrid approach with byte-level changes for a single tuple
identified by page id + slot number.
→ Does not specify organization of the page.
5-445/645 (Fall 2024)
LOGGING SCHEMES
UPDATE foo SET val = XYZ WHERE id = 1;
OBSERVATION
The DBMS's WAL will grow forever.
After a crash, the DBMS must replay the entire log,
which will take a long time.
CHECKPOINTS
Blocking / Consistent Checkpoint Protocol:
→ Pause all queries.
→ Flush all WAL records in memory to disk.
→ Flush all modified pages in the buffer pool to disk.
→ Write a <CHECKPOINT> entry to WAL and flush to disk.
→ Resume queries.
CHECKPOINTS
Use the <CHECKPOINT> record as the WAL
starting point for analyzing the WAL. <T1 BEGIN>
<T1, A, 1, 2>
<T1 COMMIT>
<T2 BEGIN>
<T2, A, 2, 3>
<T3 BEGIN>
<CHECKPOINT>
<T2, B, 4, 5>
<T2 COMMIT>
<T3, A, 3, 4>
⋮
CHECKPOINTS
Use the <CHECKPOINT> record as the WAL
starting point for analyzing the WAL. <T1 BEGIN>
Any txn that committed before the <T1, A, 1, 2>
<T1 COMMIT>
checkpoint is ignored (T1). <T2 BEGIN>
<T2, A, 2, 3>
<T3 BEGIN>
<CHECKPOINT>
<T2, B, 4, 5>
<T2 COMMIT>
<T3, A, 3, 4>
⋮
CHECKPOINTS
Use the <CHECKPOINT> record as the WAL
starting point for analyzing the WAL. <T1 BEGIN>
Any txn that committed before the <T1, A, 1, 2>
<T1 COMMIT>
checkpoint is ignored (T1). <T2 BEGIN>
<T2, A, 2, 3>
T2 + T3 did not commit before the last <T3 BEGIN>
checkpoint. <CHECKPOINT>
<T2, B, 4, 5>
<T2 COMMIT>
<T3, A, 3, 4>
⋮
CHECKPOINTS
Use the <CHECKPOINT> record as the WAL
starting point for analyzing the WAL. <T1 BEGIN>
Any txn that committed before the <T1, A, 1, 2>
<T1 COMMIT>
checkpoint is ignored (T1). <T2 BEGIN>
<T2, A, 2, 3>
T2 + T3 did not commit before the last <T3 BEGIN>
checkpoint. <CHECKPOINT>
<T2, B, 4, 5>
→ Need to redo T2 because it committed <T2 COMMIT>
after checkpoint. <T3, A, 3, 4>
→ Need to undo T3 because it did not ⋮
commit before the crash.
CHECKPOINTS – CHALLENGES
In this example, the DBMS must stall txns when it
takes a checkpoint to ensure a consistent snapshot.
→ We will see how to get around this problem next class.
CHECKPOINTS – FREQUENCY
Checkpointing too often causes the runtime
performance to degrade.
→ System spends too much time flushing buffers.
CONCLUSION
Write-Ahead Logging is (almost) always the best
approach to handle loss of volatile storage.
Use incremental updates (STEAL + NO-FORCE)
with checkpoints.
NEXT CLASS
Better Checkpoint Protocols.
Recovery with ARIES.