This document describes the Practical Byzantine Fault Tolerance (PBFT) algorithm for replicating services across unreliable networks where some systems may experience failures. It explains the problem PBFT aims to solve, outlines the system model and failure assumptions, and details the three phase commit protocol used to reach agreement on state across replicas in the presence of faulty nodes.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
22 views26 pages
PBFT
This document describes the Practical Byzantine Fault Tolerance (PBFT) algorithm for replicating services across unreliable networks where some systems may experience failures. It explains the problem PBFT aims to solve, outlines the system model and failure assumptions, and details the three phase commit protocol used to reach agreement on state across replicas in the presence of faulty nodes.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26
Practical Byzantine Fault Tolerance
Appears in the Proceedings of the Third Symposium on Operating
Systems Design and Implementation, New Orleans, USA, February 1999
Published: February 1999
The Problem • Provide a reliable answer to a computation even in the presence of Byzantine faults. • A client would like to • Transmit a request • Wait for k replies • Conclude that the answer is a true answer The Model • Networks are unreliable • Can delay, reorder, drop,retransmit • Some fraction of nodes are unreliable • May behave in any way, and need not follow the protocol. • Nodes can verify the authenticity of messages Failures • The system requires 3f+1 nodes to withstand f failures • All f nodes may be faulty, and not respond • But there is no guarantee that the remaining n-f are good, and good nodes must outnumber bad nodes. • This holds if n-2f > f or n > 3f Nodes • Maintain a state • Log • View number • state • Can perform a set of operations • Need not be simple read/write • Must be deterministic • Well behaved nodes must: • start at the same state • Execute requests in the same order Views • Operations occur within views • For a given view, a particular node in is designated the primary node, and the others are backup nodes • Primary = v mod n • N is number of nodes • V is the view number Protocol A three phase protocol • Pre-prepare: primary proposes an order • Prepare: Backup copies agree on # • Commit: agree to commit Agreement • Quorum based • 2f+1 nodes must have same value • System has 3f+1 nodes • Any 2f+1 subset has >= 1 good node in common • Good nodes don’t lie • Same decision at each node w/ quorum Messages • The following messages are used by the protocol, and are signed by the sender • Request <o,t,c> (called m) • Sent from the client to the primary • Contains: client #, timestamp, and operation • Reply <v,t,c,I,r> • Pre-prepare <v,d,n>, m • Multicast from primary to backups • Contains view #, sequence #, digest • Message may be sent separately Messages • Prepare <v,n,d,I > • Sent amongst backups • Commit <v,n,d,I > • Replica I is prepared to commit seq # n, view v
• Messages are accepted in each phase
• If the current node is in view v • The sequence number,n, is within a certain range • The node has not received contradictory messages • The digest matches the computed digest Pre-prepare • The client sends a message to the primary • The primary assigns a sequence number to the message, and multicasts it. • Backups: • Receive the pre-prepare message • Validate it and drop the message if invalid • Record the message, the pre-prepare message, and a newly generated prepare message in the log • Multicast the prepare message to the other backups Prepare • A prepare message indicates a backups willingness to accept a given sequence number. • Once a quorum of messages prepare messages is received, a commit message is sent Commit • Nodes must ensure that enough nodes have all been prepared before applying the changes so: • A node waits for a quorum of commit messages before applying a change. • Changes are applied in order of sequence number • Cannot be applied until all lower numbered messages have been applied. Truncating the log • Checkpoints at regular intervals • Requests are in log, or already stable • Each node maintains multiple copies of state: • A copy of the last proven checkpoint • 0 or more unproven checkpoints • The current working state • A node sends a checkpoint message when it generates a new checkpoint • checkpoint is proven when a quorum agrees • Then this checkpoint becomes stable • Log truncated, old checkpoints discarded View change • The view change mechanism • Protects against faulty primaries • Backups propose a view change when a timer expires • The timer runs whenever a backup has accepted some message & is waiting to execute it. • Once a view change is proposed, the backup will no longer do work (except checkpoint) in the current view. View change 2 • A view change message contains • # of the highest message in the stable checkpoint • And the check point messages • A pre-prepare message for non-checkpointed messages • And proof it was prepared • The new primary declares a new view when it receives a quorum of messages New view * uncheck pointed messages • New primary computes • Maximum checkpointed sequence number • Maximum sequence number not checkpointed • Constructs new pre-prepare messages • Either is a new pre-prepare for a message in the new view • Or a no-op pre-prepare so there are no gaps New view 2 • New primary sends a new view message • Contains all view change messages • All computed pre-prepare messages • Recipients verify: • The pre-prepare messages • The have the latest checkpoint • If not, they can get a copy • Sends a prepare message for each pre-prepare • Enters the new view Controlling View Changes • Moving through views too quickly • Nodes will wait longer if • No useful work was done in the previous view • I.e. only re-execution of previous requests\ • Or enough nodes accepted the change, but no new view was declared • If a node gets f+1 view change requests with a higher view number • It will send its own view change with the minimum view number • This is safe, because at least one non-faulty replica sent a message nondeterminism • The model requires that requests be deterministic • But this is not always the case • E.g. update a timestamp using the current clock • Two solutions • Let the primary propose a value • Create a <value, message> pair and proceed as before • Allow the backups to select values • Wait for 2f+1 • Start three-phase protocol optimizations • Don’t send f+1 messages back to the client • Instead send f digests, and 1 result • If they don’t match, retry with old protocol • Tentative commit • After prepare, backup may tentatively execute request • Client waits for a querom of tentative replies, otherwise retries and waits for f+1 replies • Read-only • Clients multicast directly to replicas • Replicas execute the request, wait until no tentative request are pending, return the result • Client waits for a quorum of results Implementation • The protocol is implemented in a replication library • No mechanism to change views • Uses upcalls to allow servers to: • Invoke requests (client) • Execute requests • Create and delete checkpoints • Retrieve checkpoints • Compute digests (of checkpoints) Implementation 2 • Communication • Udp for point to point • Udp multicast for group communication Micro benchmark • Compares a service that executes a no-op • Single server vs Replicated using protocol BFS • Implementation of NFS using the replication library. • Looks like normal NFS to clients • Replication library runs requsts via a relay • Server maintains filesystem state in memory mapped files BFS 2 • Server maintains at most 2 checkpoints • Using copy on write • Digests computed incrementally • For efficienty