0% found this document useful (0 votes)
78 views45 pages

Database System Recovery: CSEP 545 Transaction Processing For E-Commerce Philip A. Bernstein

DB Recovery
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views45 pages

Database System Recovery: CSEP 545 Transaction Processing For E-Commerce Philip A. Bernstein

DB Recovery
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

4.

Database System Recovery


CSEP 545 Transaction Processing for E-Commerce Philip A. Bernstein
Copyright 2012 Philip A. Bernstein
1/18/12 1

Outline
1. Introduction 2. Recovery Manager 3. Two Non-Logging Algorithms 4. Log-based Recovery 5. Media Failure

1/18/12

1. Introduction
A database may become inconsistent because of
Transaction failure (abort) Database system failure (possibly caused by OS crash) Media crash (disk-resident data is corrupted)

The recovery system ensures the database contains exactly those updates produced by committed transactions
I.e. atomicity and durability, despite failures

1/18/12

Assumptions
Two-phase locking, holding write locks until after a transaction commits. This implies
Recoverability No cascading aborts Strictness (never overwrite uncommitted data)

Page-level everything (for now)


Database is a set of pages Page-granularity locks A transactions read or write operation operates on an entire page Well look at record granularity later
1/18/12 4

Storage Model
Stable database - survives system failures Cache (volatile) - contains copies of some pages, which are lost by a system failure
Fetch, Flush Pin, Unpin, Deallocate Read, Write
Cache Manager Read, Write Cache
5

Stable Database Log


1/18/12

Stable Storage
Write(P) overwrites the entire contents of P on the disk If Write is unsuccessful, the error might be detected on the next read ...
e.g. page checksum error => page is corrupted

or maybe not
Write correctly wrote to the wrong location

Write is the only operation thats atomic with respect to failures and whose successful execution can be determined by recovery procedures.
1/18/12 6

The Cache
Cache is divided into page-sized slots. Dirty bit tells if the page was updated since it was last written to disk. Pin count tells number of pin ops without unpins

Page P2 P47 P21

Dirty Bit 1 0 1

Cache Address Pin Count 91976 1 812 2 10101 0

Fetch(P) - read P into a cache slot. Return slot address. Flush(P) - If Ps slot is dirty and unpinned, then write it to disk (i.e. return after the disk acks).
1/18/12 7

The Cache (contd)


Pin(P) - make Ps slot non-flushable & non-replaceable.
Non-flushable because Ps content may be inconsistent. Non-replaceable because someone has a pointer into P or is accessing Ps content.

Unpin(P) - release it. Deallocate(P) - allow Ps slot to be reused (even if dirty).

1/18/12

Big Picture
Record manager is the main user of the cache manager. It calls Fetch(P) and Pin(P) to ensure the page is in main memory, non-flushable, and non-replaceable.

Query Optimizer Query Executor Database Access Method System (record-oriented files) Page-oriented Files Database
1/18/12

Fetch, Flush Pin, Unpin, Deallocate Recovery manager Cache manager Page file manager
9

Latches
A page is a data structure with many fields. A latch is a short-term lock that gives its owner access to a page in main memory. A read latch allows the owner to read the content. A write latch allows the owner to modify the content. The latch is usually a bit in a control structure, not an entry in the lock manager. It can be set and released much faster than a lock. Theres no deadlock detection for latches.
1/18/12 10

The Log
A sequential file of records describing updates:
Address of updated page. Id of transaction that did the update. Before-image and after-image of the page.

Whenever you update the cache, also update the log. Log records for Commit(Ti) and Abort(Ti). Some older systems separated before-images and after-images into separate log files. If opi conflicts with and executes before opk, then opis log record must precede opks log record.
1/18/12

Recovery will replay operations in log-record-order.

11

The Log (contd)


To update records on a page: Fetch(P) Pin(P) write lock (P) write latch (P) update P log the update to P unlatch (P) Unpin(P) read P into cache ensure P isnt flushed for two-phase locking get exclusive access to P update P in cache append it to the log release exclusive access allow P to be flushed

1/18/12

12

2. Recovery Manager
Processes Commit, Abort and Restart Commit(T)
Write Ts updated pages to stable storage atomically, even if the system crashes

Abort(T)
Undo the effects of Ts writes

Restart = recover from system failure


Abort all transactions that were not committed at the time of the previous failure Fix stable storage so it includes all committed writes and no uncommitted ones (so it can be read by new txns)
1/18/12 13

Recovery Manager Model


Transaction 1 Transaction 2 Transaction N

Commit, Abort, Restart Recovery Manager Pin, Unpin Fetch


Read, Write Cache Manager Flush Deallocate

Read, Write
Cache

Read, Write

Stable Database Log


1/18/12

Flush, dealloc for normal operatn Restart uses Fetch, Pin, Unpin 14

Implementing Abort(T)
Suppose T wrote page P. If P was not transferred to stable storage, then deallocate its cache slot. If it was transferred, then Ps before-image must be in stable storage (else you couldnt undo after a system failure). Undo Rule - Do not flush an uncommitted update of P until Ps before-image is stable. (Ensures undo is possible.) Write-Ahead Log Protocol - Do not until Ps before-image is in the log. 1/18/12 15

Avoiding Undo
Avoid the problem implied by the Undo Rule by never flushing uncommitted updates.
Avoids stable logging of before-images. Dont need to undo updates after a system failure.

A recovery algorithm requires undo if an update of an uncommitted transaction can be flushed.


Usually called a steal algorithm, because it allows a dirty cache page to be stolen.

1/18/12

16

Implementing Commit(T)
Commit must be atomic. So it must be implemented by a disk write. Suppose T wrote P, T committed, and then the system fails. P must be in stable storage. Redo rule - Dont commit a transaction until the after-images of all pages it wrote are in stable storage (in the database or log). (Ensures redo is possible.)
Often called the Force-At-Commit rule.

1/18/12

17

Avoiding Redo
To avoid redo, flush all of Ts updates to the stable database before it commits. (They must be in stable storage.)
Usually called a Force algorithm, because updates are forced to disk before commit. Its easy, because you dont need stable bookkeeping of after-images. But its inefficient for hot pages. (Consider TPC-A/B.)

Conversely, a recovery algorithm requires redo if a transaction may commit before all of its updates are in the stable database.
1/18/12 18

Avoiding Undo and Redo?


To avoid both undo and redo
Never flush uncommitted updates (to avoid undo), and Flush all of Ts updates to the stable database before it commits (to avoid redo).

Thus, it requires installing all of a transactions updates into the stable database in one write to disk It can be done, but it isnt efficient for short transactions and record-level updates.
Use shadow paging.

1/18/12

19

Implementing Restart
To recover from a system failure
Abort transactions that were active at the failure. For every committed transaction, redo updates that are in the log but not the stable database. Resume normal processing of transactions.

Idempotent operation - many executions of the operation have the same effect as one execution. Restart must be idempotent. If its interrupted by a failure, then it re-executes from the beginning. Restart contributes to unavailability. So make it fast!
1/18/12 20

3. Log-based Recovery
Logging is the most popular mechanism for implementing recovery algorithms. The recovery manager implements
Commit - by writing a commit record to the log and flushing the log (satisfies the Redo Rule). Abort - by using the transactions log records to restore before-images. Restart - by scanning the log and undoing and redoing operations as necessary.

The algorithms are fast since they use sequential log I/O in place of random database I/O. They greatly affect TP and Restart performance. 1/18/12 21

Implementing Commit
Every commit requires a log flush. If you can do K log flushes per second, then K is your maximum transaction throughput. Group Commit Optimization - when processing commit, if the last log page isnt full, delay the flush to give it time to fill. If there are multiple data managers on a system, then each data mgr must flush its log to commit.
If each data mgr isnt using its logs update bandwidth, then a shared log saves log flushes. A good idea, but rarely supported commercially. 1/18/12

22

Implementing Abort
To implement Abort(T), scan Ts log records and install before images. To speed up Abort, back-chain each transactions update records. Start of Log Tis first log record Transaction Descriptors Ti Pk null pointer T7

Transaction last log record

Ti Pm backpointer End of Log

1/18/12

23

Satisfying the Undo Rule


To implement the Write-Ahead Log Protocol, tag each cache slot with the log sequence number (LSN) of the last update record to that slots page. Log Page Dirty Cache Pin LSN On disk Start Bit Address Count

P47 P21

1 1

812 10101

2 0

Main Memory

End

Cache manager wont flush a page P until Ps last updated record, pointed to by LSN, is on disk. Ps last log record is usually stable before Flush(P), so this rarely costs an extra flush LSN must be updated while latch is held on Ps slot 1/18/12

24

Implementing Restart (rev 1)


Assume undo and redo are required. Scan the log backwards, starting at the end.
How do you find the end?

Construct a commit list and recovered-page-list during the scan (assuming page level logging). Commit(T) record => add T to commit list Update record for P by T
if P is not in the recovered-page-list then Add P to the recovered-page-list. If T is in the commit list, then redo the update, else undo the update.
25

Checkpoints
Problem - Prevent Restart from scanning back to the start of the log A checkpoint is a procedure to limit the amount of work for Restart Cache-consistent checkpointing
Stop accepting new update, commit, and abort operations Make list of [active transaction, pointer to last log record] Flush all dirty pages Append a checkpoint record to log; include the list Resume normal processing

Database and log are now mutually consistent


1/18/12 26

Restart Algorithm (rev 2)


No need to redo records before last checkpoint, so
Starting with the last checkpoint, scan forward in the log. Redo all update records. Process all aborts. Maintain list of active transactions (initialized to content of checkpoint record). After youre done scanning, abort all active transactions.

Restart time is proportional to the amount of log after the last checkpoint. Reduce restart time by checkpointing frequently. Thus, checkpointing must be cheap.
1/18/12 27

Fuzzy Checkpointing
Make checkpoints cheap by avoiding synchronized flushing of dirty cache at checkpoint time. Stop accepting new update, commit, and abort operations Make a list of all dirty pages in cache Make list of [active transaction, pointer to last log record] Append a checkpoint record to log; include the list Resume normal processing Initiate low priority flush of all dirty pages Dont checkpoint again until all of the last checkpoints dirty pages are flushed. Restart begins at second-to-last (penultimate) checkpoint. Checkpoint frequency depends on disk bandwidth. 1/18/12 28

Operation Logging
Record locking requires (at least) record logging.
Suppose records x and y are on page P w1[x] w2[y] abort1 commit2 (not strict w.r.t. pages)

Record logging requires Restart to read a page before updating it. This reduces log size. Further reduce log size by logging description of an update, not the entire before/after image of record.
Only log after-image of an insertion Only log fields being updated

Now Restart cant blindly redo.


1/18/12

E.g., it must not insert a record twice

29

LSN-based logging
Each database page Ps header has the LSN of the last log record whose operation updated P. Restart compares log record and page LSN before redoing the log records update U. Redo the update only if LSN(P) < LSN(U) Undo is a problem. If Us transaction aborts and you undo U, what LSN to put on the page? Suppose T1 and T2 update records x and y on P w1[x] w2[y] c2 a1 (what LSN does a1 put on P?) not LSN before w1[x] (which says w2[y] didnt run) not w2[y] (which says w1[x] wasnt aborted)
1/18/12 30

LSN-based logging (contd)


w1[x] w2[y] c2 a1 (what LSN does a1 put on P?) Why not use a1s LSN?
must latch all of T1s updated pages before logging a1 else, some w3[z] on P could be logged after a1 but be executed before a1, leaving a1s LSN on P instead of w3[z]s.

1/18/12

31

Logging Undos
Log the undo(U) operation, and use its LSN on P CLR = Compensation Log Record = a logged undo Do this for all undos (during normal abort or recovery) This preserves the invariant that the LSN on each page P exactly describes Ps state relative to the log. P contains all updates to P up to and including the LSN on P, and no updates with larger LSN. So every aborted transactions log is a palindrome of update records and undo records. Restart processes Commit and Abort the same way It redoes the transactions log records. It only aborts active transactions after the forward scan 1/18/12

32

Logging Undos (contd)


Tricky issues
Multi-page updates (its best to avoid them) Restart grows the log by logging undos. Each time it crashes, it has more log to process

Optimization - CLR points to the transactions log record preceding the corresponding do.
Splices out undone work Avoids undoing undone work during abort Avoids growing the log due to aborts during Restart
DoA1
1/18/12

...

DoB1

...

DoC1

...

UndoC1 ... UndoB1

...
33

Restart Algorithm (rev 3)


Starting with the penultimate checkpoint, scan forward in the log. Maintain list of active transactions (initialized to content of checkpoint record). Redo an update record U for page P only if LSN(P) < LSN(U). After youre done scanning, abort all active transactions. Log undos while aborting. Log an abort record when youre done aborting. This style of record logging, logging undos, and replaying history during restart was popularized in the ARIES algorithm by Mohan et al at IBM, published in 1992.
1/18/12 34

Analysis Pass
Log flush record after a flush occurs (to avoid redo) To improve redo efficiency, pre-analyze the log
Requires accessing only the log, not the database

Build a Dirty Page Table that contains list of dirty pages and, for each page, the oldestLSN that must be redone
Flush(P) says to delete P from Dirty Page Table Write(P) adds P to Dirty Page Table, if it isnt there Include Dirty Page Table in checkpoint records Start at last checkpt record, scan forward building the table

Also build list of active txns with lastLSN

1/18/12

35

Analysis Pass (contd)


Start redo at oldest oldestLSN in Dirty Page Table
Then scan forward in the log, as usual Only redo records that might need it, that is, those where LSN(redo record) oldestLSN, hence theres no later flush record Also use Dirty Page Table to guide page prefetching
Prefetch pages in oldestLSN order in Dirty Page Table

1/18/12

36

Logging B-Tree Operations


To split a page
log records deleted from the first page (for undo) log records inserted to the second page (for redo) theyre the same records, so long them once!

This doubles the amount of log used for inserts


log the inserted data when the record is first inserted if a page has N records, log N/2 records, every time a page is split, which occurs once for every N/2 insertions
1/18/12 37

User-level Optimizations
If checkpoint frequency is controllable, then run some experiments. Partition DB across more disks to reduce restart time (if Restart is multithreaded). Increase resources (e.g. cache) available to restart program.

1/18/12

38

Shared Disk System


Process A Process B Can cache a page in two processes that write-lock different records P P Only one process at a time can r2 have write privilege r7 Use a global lock manager When setting a write lock on P, may need to refresh the cached copy from disk (if another process P recently updated it)

Use version number on the page and in the lock

1/18/12

39

Shared Disk System


When a process sets the lock, it tells the lock manager version number of its cached page. A process increments the version number the first time it updates a cached page. When a process is done with an updated page, it flushes the page to disk and then increments version number in the lock. Need a shared log manager, possibly with local caching in each machine.
1/18/12 40

4. Media Failures
A media failure is the loss of some of stable storage. Most disks have MTBF over 10 years. Still, if you have 10 disks ... So shadowed disks are important.
Writes go to both copies. Handshake between Writes to avoid common failure modes (e.g. power failure). Service each read from one copy.

To bring up a new shadow


Copy tracks from good disk to new disk, one at a time. A Write goes to both disks if the track has been copied. A read goes to the good disk, until the track is copied. 1/18/12

41

RAID
RAID - redundant array of inexpensive disks
Use an array of N disks in parallel A stripe is an array of the ith block from each disk A stripe is partitioned as follows:

...
M data blocks

...
N-M error correction blocks

Each stripe is one logical block, which can survive a single-disk failure.
1/18/12 42

Where to Use Disk Redundancy?


Preferably for both the DB and log. But at least for the log
In an undo algorithm, its the only place that has certain before images. In a redo algorithm, its the only place that has certain after images.

If you dont shadow the log, its a single point of failure.


1/18/12 43

Archiving
An archive is a database snapshot used for media recovery. Load the archive and redo the log To take an archive snapshot write a start-archive record to the log copy the DB to an archive medium write an end-archive record to the log (or simply mark the archive as complete) So, the end-archive record says that all updates before the start-archive record are in the archive Can use the standard LSN-based Restart algorithm to recover an archive copy relative to the log.
1/18/12 44

Archiving (contd)
To archive the log, use 2 pairs of shadowed disks. Dump one pair to archive (e.g. tape) while using the other pair for on-line logging. (I.e. ping-pong to avoid disk contention) Optimization - only archive committed pages and purge undo information from the log before archiving To do incremental archive, use an archive bit in each page. Each page update sets the bit. To archive, copies pages with the bit set, then clear it. To reduce media recovery time rebuild archive from incremental copies partition log to enable fast recovery of a few corrupted pages 1/18/12 45

You might also like