0% found this document useful (0 votes)

16 views68 pages

T1 BFTSMR

The document summarizes a tutorial on state machine replication for Byzantine fault tolerance (BFT). It covers the basics of state machine replication, including passive and active replication approaches. It discusses the requirements for BFT state machine replication, including total order multicast. It also outlines some fundamental results in distributed systems, including the impossibility of reliable communication and the equivalence between total order multicast and consensus.

Uploaded by

zqkhtggvbq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views68 pages

T1 BFTSMR

Uploaded by

zqkhtggvbq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 68

4/10/12

Tutorial on
EuroSys 2012

(BFT) State Machine

Replication:
The Hype, The Virtue...
and even some Practice

Alysson Neves Bessani

[email protected]
University of Lisbon, Faculty of Sciences
© Alysson Bessani. All rights reserved. 1

Summary
• Part 1: The Basics
o State machine replication
o Potential applications
o 5 fundamental results on distributed systems
o Paxos/Viewstamped replication
o Castro & Liskov’ PBFT
• Part 2: BFT Literature Review
o Improving performance
o Improving resource efficiency
o Improving robustness
• Part 3: Applications, Open Problems & Practice
o BFT Applications
o Open problems on BFT
o BFT-SMaRt
o Practice: a BFT KV (in memory) Store

© Alysson Bessani. All rights reserved. 2

1
4/10/12

Part I
The Basics

EuroSys 2012
© Alysson Bessani. All rights reserved. 3

Replication
• Replication is a technique used for performance
and/or fault tolerance

• Each replica is a state machine:

o A deterministic program that receives an input, change its state and
produces an output
o State transitions are atomic

• Replication can be passive or active

© Alysson Bessani. All rights reserved. 4

2
4/10/12

Passive Replication
• Also called Primary-Backup (PB) or master-slave
• Clients talk with the primary, that sends the
operations and checkpoints to the backups
o Sometimes backup replicas answer read-only operations
• If the primary crashes, one of the backups takeover
op1, op2

op1 PRIMARY
CLIENTS
checkpoint | op1, op2
op2

BACKUPS

state + log state + log

Active Replication
• Also called State Machine Replication – SMR
(Schneider, ACM CS 1990)
• All servers execute the same set of operations in the
same order (servers are always “synchronized”)
• Clients wait for the first reply (crash faults)

Waits for the ﬁrst reply op1, op2

op1 op1, op2

CLIENTS SERVERS

op1, op2
op2

© Alysson Bessani. All rights reserved. 6

3
4/10/12

Byzantine Fault Tolerance

• A component that suffers an arbitrary (or Byzantine)
failure can exhibit any behavior
o Stay silent, delay messages, change messages
o It can model intrusions

• PB is hard to use with this fault model

o How to know if the reply/delta checkpoint produced by
the primary is correct?

• SMR is the way to go:

o All replicas execute the operations and send replies
o Clients can vote for the correct reply

© Alysson Bessani. All rights reserved. 7

BFT State Machine

Replication
• All servers execute the same sequence of operations
• Requires total order multicast

Waits f+1 equal replies op1, op2

op1 op1, op2

CLIENTS SERVERS

Total Order Multicast op1, op2

op2
op1, op2

© Alysson Bessani. All rights reserved. 8

4
4/10/12

SMR Requirements
• Initial state: All replicas start on the same state
Easy!
• Coordination: All replicas receive the same
sequence of inputs Total Order
Multicast

• Determinism: all replicas receiving the same input

on the state produce the same output and resulting
state
Easy?

© Alysson Bessani. All rights reserved. 9

System Properties
• Safety: all servers execute the same sequence of
requests

• Liveness: all correct clients requests are executed

© Alysson Bessani. All rights reserved. 10

5
4/10/12

System Models:
BFT SMR Assumptions
• Faults:
o How many faulty servers and clients the system tolerate? Of what type
(e.g., crash, crash-recovery, Byzantine)?

• Time
o Do I need time assumptions (e.g., upper bound on message and
execution times, synchronized clocks)?

• Connectivity
o All processes are connected?
o The communication links are reliable? Authenticated?

• Cryptography
o What cryptography assumptions are needed?

• Architecture
o Homogeneous or heterogeneous?

© Alysson Bessani. All rights reserved. 11

Five Distributed Computing

Fundamental Results
• Impossibility of reliable communication
• Equivalence between total order and consensus
• Impossibility of fault-tolerant consensus
• Minimum synchrony required for FT consensus
• Fault thresholds: f+1, 2f+1, 3f+1 …

© Alysson Bessani. All rights reserved. 12

6
4/10/12

Impossibility of Reliable

Communication
• How can we implement reliable channels on
unreliable networks?
o We can’t! We need some weak reliability guarantee in
order to build them…
• Fair channels:
o If a message is sent infinitely many times through a channel, it will be received
infinitely by its receiver
• A practical interpretation:
o a channel can lose messages for some time
o eventually, some of these messages will reach the destination
• Reliability can now be implemented:
o Send a message repeatedly until an ACK is received
o For BFT, a HMAC should be added to each message, and when it is not valid
the message is discarded

© Alysson Bessani. All rights reserved. 13

Total Order Multicast and

Consensus Equivalence
• Total order multicast is equivalent to consensus:
o A consensus protocol can be used to solve atomic broadcast

p1
p2 Reliable Consensus
p3 Multicast
p4
• Why it works? every process decide the same set
o An atomic broadcast protocol can be used to solve consensus

p1 Total
p2 Order
p3 Multicast
p4
• Why it works? The decision will be the first message delivered first to
every process
• This equivalence holds in most system models
© Alysson Bessani. All rights reserved. 14

7
4/10/12

Impossibility of Fault-‐‑

Tolerant Consensus
• Result:
Consensus is not solvable in asynchronous systems with reliable channels
(or reliable shared memory) even with one crash fault

• Why?
o Cannot differentiate faulty from slow processes p1 and p2 does not
p1 receive anything
from p3
v = 1
p1 and p2 cannot
decide between v = ?
0 or 1
p3

v = 0

p2
© Alysson Bessani. All rights reserved. 15

Minimum Synchrony
required for FT Consensus
• Result:
Fault-tolerant consensus can be solved in the eventually synchronous
system model

• Why?
o The system is asynchronous but has the notion of time
o After some point, the system will become synchronous (bounded but
unknown communication and processing delays)
o If the algorithm keeps trying (always ensuring safety) and increasing the
timeout values, it will be able to solve consensus

p1
p2 Round 0 New Round 1 New
p3 Round Round
p4
p0 have T seconds p1 have 1.5T seconds
to enforce its value to enforce its value
© Alysson Bessani. All rights reserved. 16

8
4/10/12

Fault Thresholds
• State Machine Replication has two phases
o Ordering è consensus requirements

Crash Byzantine
Synchronous f+1 3f+1/f+1*
Non-‐‑Synchronous 2f+1 3f+1
* using signatures
o Execution è voting requirements

Crash Byzantine
f+1 2f+1

o The required number of replicas is the maximum required among these

two phases.

© Alysson Bessani. All rights reserved. 17

(Non-‐‑Byzantine) FT State

Machine Replication
• Paxos (Lamport, TOCS 1998)
o Agreement framework
o Can be instantiated as a consensus primitive or a SMR algorithm
o Three roles: proposer, acceptor and learner

• Viewstamped Replication (Oki & Liskov, PODC’88)

o Similar to Paxos as a SMR algorithm
o System model (also similar to Paxos):
• Unbounded number of crash-prone clients
• 2f+1 replicas
• Stable storage
• Partially synchronous system
o Safety is always ensured, but Liveness requires synchrony

© Alysson Bessani. All rights reserved. 18

9
4/10/12

Viewstamped Replication
Request Reply

c
Prepare PrepareOk
(leader)
0

2
• Requests are executed only after the majority of the
replicas have it on its log
• It ensures the request will be visible even if the leader fails

© Alysson Bessani. All rights reserved. 19

Viewstamped Replication
Request

DoViewChange StartView
(old leader)
0

1
(new leader)

2
• If a replica suspects the leader, it sends a message
to the next leader
• If the next leader receives f+1 messages, it
synchronizes replica logs and start a new view
© Alysson Bessani. All rights reserved. 20

10
4/10/12

Some Industrial Applications

of Paxos/VR
• Oracle’ Berkley DB
o At least for leader election

• Google’ Chubby (Burrows, OSDI’06)

• Google Megastore (Baker et al, CIDR’11)
o Uses in a different way…

• Yahoo!/Apache Zookeeper (Hunt et al, USENIX’10)

o Zab is a protocol similar to Paxos

• IBM’ Spinnaker (Rao et al, VLDB’11)

• MS’ Gaios (Bolosky et al, NSDI’11)
• MS’ Windows Azure Storage (Calder et al, SOSP’11)
o Paxos for intra-datacenter replication

© Alysson Bessani. All rights reserved. 21

Practical Byzantine Fault

Tolerance (PBFT)
• The paper (Castro & Liskov, OSDI’99)
sparkled the interest in BFT replication
o It shows BFT can be fast through the avoidance of public-key
crypto (using HMAC vectors instead)
o Other BFT papers both extend and use the PBFT protocol (and
implementation) as a baseline
o Several versions published: OSDI’99, TOCS’02, Liskov’ 2010 book
chapter

© Alysson Bessani. All rights reserved. 22

11
4/10/12

PBFT: System Model

• Asynchronous distributed system
o Needs partial synchrony for Liveness

• Network can lose, delay, reorder and duplicate

messages; but cannot do that indefinitely
o i.e., they require fair links to implement reliable channels

• Byzantine fault model

o Fault independence (i.e., no common mode faults)
o N = 3f+1 servers, being at most f faulty
o An unbounded number of clients, all of them can be faulty

• Cryptography
o PK signatures to simplify the protocol presentation
o MAC (each pair of processes share a key)
o Digests (hashes)

© Alysson Bessani. All rights reserved. 23

PBFT: Normal Operation

• Algorithm outline:
o System evolves in views, numbered sequentially. In each view v, one
server is the primary, the others are the backups: primaryv = v mod N
o Client multicasts a signed request to all servers
o Servers reach agreement about the sequence number of the request
• The primary proposes the sequence number for each request
• The backups confirm that the primary follows the protocol
o If the primary fails, there is a view change
o Client waits for at least f+1 replies with the same result (at least one
correct server executed the operation and produced the result)
© Alysson Bessani. All rights reserved. 24

12
4/10/12

PBFT: Normal Operation I

v is the view number; n is the sequence number of m

• Pre-prepare phase:
o primary receives a correctly signed request m
o It assigns a sequence number n to the message and sends this number, a
digest of request D(m) and its current view number to all backups (other
replicas) in a PRE-PREPARE message
o backup replicas receive the message and test its validity, i.e., if n was not
assigned to another request and if it is in view v
o If a replica has m and a valid PRE-PREPARE for it, it proceeds to the prepare
phase (m is pre-prepared)
© Alysson Bessani. All rights reserved. 25

PBFT: Normal Operation II

v is the view number; n is the sequence number of m

• Prepare phase:
o replicas store the received PRE-PREPARE message
o each replica sends a PREPARE message to other replicas containing v, n and
the digest D(m) of the message
o all servers that receive 2f PREPARE message from other replicas with the same
v, n and D(m), proceed to the commit phase
o when a replica finishes the prepare phase for m, we say that m is prepared on
this replica

© Alysson Bessani. All rights reserved. 26

13
4/10/12

PBFT: Normal Operation III

v is the view number; n is the sequence number of m

• Commit phase:
o each replica multicasts a COMMIT message containing v and n
o the request m for which n was assigned is executed when:
• a replica receives 2f COMMIT messages with the same v and n from
other replicas
• all requests with sequence number lower than n are executed
o when the replica i finishes the commit phase we say that m is committed in i

© Alysson Bessani. All rights reserved. 27

PBFT: Protocol Invariants

• <m,n,v> is prepared in a correct replica →
2f+1 replicas pre-prepared <m,n,v> →
at least f+1 of them are correct →
(f+1) + (2f+1) > 3f+1 (any 2f+1 quorum of the system will contain at least
one of these correct replicas) →
it is impossible to have <m’,n,v> prepared (m’ ≠ m) on some correct
replica (a correct replica will not pre-prepare two messages with the
same n and v)

• <m,n,v> is committed in a correct replica →

2f+1 replicas prepared <m,n,v> →
at least f+1 of them are correct →
any 2f+1 quorum of this system will contain at least one of these correct
replicas (that can show that <m,n,v> is prepared)

© Alysson Bessani. All rights reserved. 28

14
4/10/12

PBFT: Checkpoint and

View Change
• Checkpoints
o All prepared and committed messages are logged in memory
o Periodically, replicas exchange messages to save a stable checkpoint
and truncate the log

• View Change Protocol

o If 2f+1 replicas suspect the primary of view v, a new view is started

o The objective of this protocol is to make the correct replicas agree about
a new primary and the state of the log

© Alysson Bessani. All rights reserved. 29

PBFT: Checkpoint
• Every protocol message is only accepted (and logged) if the
assigned sequence number falls on a certain interval marked
by two values: h and H = h + L (maximum log size)
• Periodically (every K request executions), the replicas
exchange CHECKPOINT messages to advance h and H by K
• CHECKPOINT messages contain a digest of system’ state
before the checkpoint and the sequence number n of the last
executed request to reach this state (n mod K = 0)
• Replicas store 2f+1 CHECKPOINT messages as a proof that no
other checkpoint for n is possible
o (2f+1) + (2f+1) = 4f+2; even with f Byzantine 4f+2 – f > 3f+1
• All messages regarding requests with sequence number small
than n can be discarded from the log
• Late replicas can update themselves fetching states that can
be proved correct with 2f+1 CHECKPOINT messages

© Alysson Bessani. All rights reserved. 30

15
4/10/12

PBFT: View Change I

• A backup replica triggers the view change protocol if it stays with some
pending message m for more than a certain time limit (request timeout expires)
• At this point, the replica stops accepting messages for v and sends a VIEW-
CHANGE message containing:
o the next view number v+1
o the sequence number n of the last stable checkpoint
o a set C of 2f+1 signed CHECKPOINT messages that validate n
o a set P of messages prepared in i on views v’ ≤ v
o a set Q of messages pre-prepared in i on views v’ ≤ v

© Alysson Bessani. All rights reserved. 31

PBFT: View Change II

• VIEW-CHANGE messages are accepted if C validates n and all messages in P and Q

are from views ≤ v
• for each accepted VIEW-CHANGE message, a replica sends a VIEW-CHANGE-ACK
to the primary of the next view (v+1)
• the new primary only accept a VIEW-CHANGE from a replica if it receives 2f-1 VIEW-
CHANGE-ACKs for it from other replicas
(the conference paper you read on assignment 2 does not contain this phase, but it
requires PK signatures on view changes)

© Alysson Bessani. All rights reserved. 32

16
4/10/12

PBFT: View Change III

• the new primary uses the information on accepted VIEW-CHANGE messages

to define new view’s h as the highest sequence number found on a valid
checkpoint
• for each sequence number n such that h < n ≤ h + L
o if there is some message m prepared with n in 2f+1 replicas (possibly commited in some of them),
the sequence number n must be assigned to m
o otherwise, n must be assigned to a null operation (this only fill gaps)
• these assignments must be sent to other replicas in a NEW-VIEW message
together with a digest from each accepted VIEW-CHANGE message used to
define them

© Alysson Bessani. All rights reserved. 33

PBFT: View Change IV

What
happens
now?

• each backup replica that receive the NEW-VIEW obtains the VIEW-
CHANGE messages used to build it
o they can have it already or they can fetch them from other replicas
• with these messages, each <message, sequence number> assignment
contained on the NEW-VIEW message can be verified (with the same
procedure used by the primary used to choose these assignments)
o if there some assignment is invalid, a VIEW-CHANGE for v+2 is sent to all replicas
o otherwise, a PREPARE message is sent for each assignment and the protocol resumes to its normal
behavior, as if the assignment was a PRE-PREPARE message

© Alysson Bessani. All rights reserved. 34

17
4/10/12

Why PBFT works?

(Safety)
• A Byzantine primary can not “create” its own requests:
o Backup replicas only process authenticated requests from clients
• A Byzantine primary can not assign the same sequence
number to different messages:
o A correct backup sends a PREPARE message only for the first request it
receives for a certain sequence number n
o A correct backup sends a COMMIT message only if it receives PREPARE
messages from 2f other replicas
o There can not be two different quorums of 2f+1 out-of 3f+1 replicas that
send PREPARE messages for the same n and different requests
• These quorums overlap on at least f+1 replicas
• Thus, one correct replica should have send contradictory messages,
which is not possible.
• Consequently, all replicas execute the same sequence of
requests created by clients

© Alysson Bessani. All rights reserved. 35

Why PBFT works?

(Liveness)
• A Byzantine primary can decide not to send PRE-
PREPARE messages for some requests or to skip
sequence numbers:
o However, when a backup replica receives a request from a client it starts a
timer, which is stopped when the request is executed
o If the timer expires, the backup trigger the view change protocol
o When enough backup replicas trigger a view change, a new primary is
defined and a new view is installed

• For each timer expiration, the timer value is doubled

• Liveness is ensured as long as eventually a timer
value suffices to finish the protocol execution with a
correct primary

© Alysson Bessani. All rights reserved. 36

18
4/10/12

PBFT: Optimizations I
• One of the key contributions of PBFT are its
optimizations

• Rationale for optimizations:

“Faults, concurrency and asynchrony are very rare”

© Alysson Bessani. All rights reserved. 37

PBFT: Optimizations II

• MAC vectors instead of digital signatures
o The use of PK signatures were the main reason for the poor performance
of previous protocols
o MAC vectors are weaker than digital signatures, so the former cannot
always be used to substitute the later

• Digest replies
o Instead of all replicas sending the reply of a request, the client can
choose just one replica to send the reply, the others only send a digest of
the reply to allow voting
o If the received reply is wrong, the client can ask for a (full) reply from other
replicas

• Batching
o Instead of running the agreement protocol for every request to be
executed, it can be done for request sets (batches)

© Alysson Bessani. All rights reserved. 38

19
4/10/12

PBFT: Optimizations III

• Read-only requests
o Read-only requests generally does not require ordering because they do
not change the system’ state
o All replicas can immediately reply to the client and it can finishes the read
if there are 2f+1 matching replies - instead of f+1, to ensure Linearizability
o Otherwise (due to faulty replicas or concurrency), the client retries the
request using the normal protocol

No 2f+1 matching replies!

© Alysson Bessani. All rights reserved. 39

PBFT: Optimizations IV

• Tentative execution
o Replicas can tentatively execute a request when it is prepared and they
have committed all requests with lower sequence number
o This reduces the protocol latency from 5 to 4 communication steps
o The client needs to wait for 2f+1 matching replies from different replicas to
be sure that the execution order will eventually commit
o If the client do not receive these replies and a timer expires, it resends the
request without asking for tentative execution

© Alysson Bessani. All rights reserved. 40

20
4/10/12

References
• Schneider. Implementing Fault-Tolerant Services using the
State Machine Approach: a Tutorial. ACM Computing
Surveys 1990.
• Lamport. The Part-time Parliament. ACM TOCS 1998
• Oki & Liskov. Viewstamped Replication: A New Primary
Copy Method to Support Highly-Available Distributed
Systems. PODC’88
• Burrows. The Chubby Lock Service for loosely-coupled
distributed systems. OSDI’06
• Baker et al. Megastore: Providing Scalable, Highly
Available Storage for Interactive Services. CIDR’11
• Hunt et al. ZooKeeper: Wait-free Coordination for
Internet-scale Systems. USENIX’10

© Alysson Bessani. All rights reserved. 41

References
• Bolosky et al. Paxos Replicated State Machines as the
Basis of a High-Performance Data Store. NSDI’11
• Rao et al. Using Paxos to Build a Scalable, Consistent,
and Highly Available Datastore. VLDB’11
• Calder et al. Windows Azure Storage: A Highly Available
Cloud Storage Service with Strong Consistency. SOSP’11
• Castro & Liskov. Practical Byzantine Fault Tolerance.
OSDI’99
• Castro & Liskov. Practical Byzantine Fault Tolerance and
Proactive Recovery. ACM TOCS 2002
• Liskov. From Viewstamped Replication to Byzantine Fault
Tolerance. Replication: Theory and Practice, 2010

© Alysson Bessani. All rights reserved. 42

21
4/10/12

Part II
BFT Literature Review

EuroSys 2012

© Alysson Bessani. All rights reserved. 43

Outline
• Improving BFT performance
• Robust BFT protocols
• Architectural hybridization
• Implementation techniques
• Complementary techniques for BFT

Note: there are other papers and other aspects, but this is my
selection given the time constraints we have

© Alysson Bessani. All rights reserved. 44

22
4/10/12

Improving BFT
Performance
• PBFT performance is competitive with crash fault-
tolerant systems, and in some cases even with non-
replicated systems
• However, in the expected common situation where
o There are no faults
o The system is synchronous
o There is no concurrency

• PBFT still requires 2(n-1)2+(n-1) messages and 5

communication steps (without optimizations)

Can we do be\er?

© Alysson Bessani. All rights reserved. 45

Improving BFT
Performance
• Since PBFT publication, several works tried to
improve its performance
• Q/U – Query/Update (Abd-El-Malek et al, SOSP’05)
o “Pure” quorum-based protocol that works on asynchronous system
o Advantages:
• Improves the fault scalability of the system, i.e., the throughput of the
system does not drop dramatically when f increases
• Operations require only two communication steps (best case)
o Drawbacks:
• Sacrifices Liveness (Obstruction-freedom instead of Wait-freedom):
operations only terminate if there is no write contention on the object
• Requires n ≥ 5f +1

© Alysson Bessani. All rights reserved. 46

23
4/10/12

Improving BFT
Performance
• HQ-Replication (Cowling et al, OSDI’06)
o Combines quorum-based protocols with PBFT
• If there is no concurrency, executes a (f-dissemination BQS) write
protocol to change the system state
• If concurrency is detected, start PBFT to order concurrent requests
o Same advantages of Q/U, with the same Liveness guarantees of PBFT and
using only 3f+1 replicas

1 or 2
comm.
steps

© Alysson Bessani. All rights reserved. 47

Zyzzyva: Speculative BFT

(Kotla et al, TOCS 2009)
• The “final word” on high-performance BFT protocols
• Main idea: PBFT with speculative execution
o Each replica (speculatively) executes a request just after receiving the
sequence number of this request by the primary
o After executing the request they send a reply to the client
o The consistent state of the replicas only matter to clients, so let them verify
if all replicas are on the same state
o If there is some problem (e.g., the primary sends different operations to
different replicas), a correct client will detect it
o This client will inform the replicas, which must rollback to a safe state and
change the primary

• Improves latency and throughput on the best case

o Zyzzyva requires only 3 communication steps
o Zyzzyva requires only 2n message exchanges

© Alysson Bessani. All rights reserved. 48

24
4/10/12

Zyzzyva: Speculative BFT

• Best-case execution (synchronous and fault-free)
timeout

REQUEST

SPEC-‐‑RESPONSE

1 Client waits for 3f+1 matching

replies that reﬂect the same
ORDER-‐‑REQ
history
2

3 Replicas speculatively
execute the requests in the
4 order given by the primary

© Alysson Bessani. All rights reserved. 49

Zyzzyva: Speculative BFT Client receives less than 3f+1

matching replies before the
• Asynchrony or faulty replica timer expires
timeout

REQUEST

COMMIT
1
LOCAL-‐‑COMMIT
ORDER-‐‑REQ
2
Replicas see that there are
SPEC-‐‑RESPONSE
2f+1 replicas that matches
3 some history and commits it

© Alysson Bessani. All rights reserved. 50

25
4/10/12

Zyzzyva: Speculative BFT

Client receives non-‐‑matching
replies and sends a POM (Proof-‐‑
Of-‐‑Misbehavior) message

• Malicious primary Correct replicas see that their

timeout histories are diﬀerent and
start view change to elect
a new primary.
REQUEST

POM
1
ORDER-‐‑REQ
2
View LOCAL-‐‑COMMIT
change
3
SPEC-‐‑RESPONSE
Malicious primary send
diﬀerent ORDER-‐‑REQ to
4 diﬀerent replicas

© Alysson Bessani. All rights reserved. 51

Zyzzyva: Speculative BFT

• Comparison with other protocols (theory)

© Alysson Bessani. All rights reserved. 52

26
4/10/12

Zyzzyva: Speculative BFT

• Comparison with other protocols (experimental)
-‐‑ n = 4
-‐‑ no faults
-‐‑ 0 byte requests
-‐‑ null operations

Zyzzyva (batch): 84Kops/s

PBFT (batch): 60Kops/s

Q/U and HQ: 23Kops/s

(quorum-‐‑based protocol
cannot batch messages)

© Alysson Bessani. All rights reserved. 53

Zyzzyva: Speculative BFT

• Zyzzyva is the fastest protocol one can devise for
ordering requests under the Byzantine fault model
• However, it is not perfect
o Speculative execution on servers might not be a good idea
• You need to be able to rollback to a committed state if a view
change is triggered
• This makes your server code much more complicated
o If you wait for replies from all replicas, you will always be waiting for the
slower one
• In non-synchronous networks you will have to calibrate your timeout
value carefully
o Zyzzyva is vulnerable to several attacks, just like PBFT

© Alysson Bessani. All rights reserved. 54

27
4/10/12

The Next 700 BFT

Protocols
• HQ and Zyzzyva are protocols with fast and slow
paths, being the slow path similar to PBFT
• Guerraoui et al (EuroSys’10) generalized this idea
with the ABSTRACT abstraction
o An ABSTRACT instance is just like a state machine replication, but abortale
o ABSTRACT instances are composable, i.e., if one instance aborts, it returns
enough information for clients to start another
o This allows the development of optimistic protocols that can revert to
more conservative approaches if the expected conditions are not meet
• Aliph BFT SMR protocol:
c
p1
p2 Quorum Chain Backup (PBFT)
p3 Aborts if Aborts if
contention Never aborts
p4 is detected
asynchronous

© Alysson Bessani. All rights reserved. 55

ct to Work: Aliph ZLight, replies sent by replicas contain a digest of their his-
onstrate how we can build novel, tory. The client checks that the histories sent by the 3f + 1
cols, using Abstract. Our new pro- replicas match. If that is not the case, or if the client does
ves up to 30% lower latency and up not receive 3f + 1 replies, the client invokes a panicking
put than state-of-the-art protocols. mechanism. This is the same as in ZLight (Sec. 3.2): (i) the
h consisted in building two new in- client sends a PANIC message to replicas, (ii) replicas stop
requiring less than 25% of the code executing requests on reception of a PANIC message, (iii) PBFT Q/U HQ Zyzzyva Alip

The Next 700 BFT

ols, and reusing Backup (Sec. 3.3). replicas send back a signed message containing their his- Number of replicas 3f+1 5f+1 3f+1 3f+1 3f+
describe Aliph and then we evaluate tory. The client collects 2f + 1 signed messages containing Throughput (MAC ops at bottleneck server) 2+ 8f 2+4f 2+4f 2+ 3f 1+ f +
b b b
replica histories and generates an abort history. Note that, Latency (1-way messages in the critical path) 4 2 4 3 2

Protocols
unlike ZLight, Quorum does not tolerate contention: concur-
rent requests can be executed in different orders byTable 2. Characteristics of state-of-the-art BFT protocols. Row 1 is the number of replicas. Row 2 is the thr
different
replicas. This is not the case in ZLight, as requestsnumber of MAC operations at the bottleneck replica (assuming batches of b requests). Row 3 is the latency
are or-
Aliph are summarized in Table 2, dered by the primary.
of 1-way messages in the critical path. Bold entries denote protocols with the lowest known cost.
of [20]. In short, Aliph is the first
ol that achieves a latency of 2 one- 3f+1 2 3f+1 Number of MAC f+1 f+2 2(f+1) 2(f+1) f+2 f+1 Number of MAC When the client receives a correct rep
en there is no contention. It is also client operations per process client operations per process
the other hand, when the reply is not corr
h the number of MAC operations at not receive any reply (e.g., due to the B
ds to 1 (under high contention when r1 r1 discards the request), the client broadcast
nabled): 50% less than required by r2
to replicas. As in ZLight and Quorum, w
r2 a PANIC message, they stop executing
ract implementations: Backup (in- r3
r3 back a signed message containing their
uorum and Chain (both described Number of MACs collects 2f + 1 signed messages contain
r4
ance commits requests as long as r4
1 1
Number of MACs
carried by a message
f+1 2f+1 (f+1)(f+2) 2f+1 f+1 carried by a message and generates an abort history.
2
k failures, (b) client Byzantine fail- Chain’s implementation requires 3300
Quorum Chain panic and checkpoint libraries). Moreove
Quorum implements a very simple Figure 4. Communication pattern of Quorum.
(latency-‐‑optimal) Figure 5. Communication pattern of Chain.
(throughput-‐‑optimal) tocol in which the number of MAC oper
nd gives Aliph the low latency fla- neck replica tends to 1. This comes from
nditions are satisfied. On the other The implementation of Quorum is very simple. It requiresThe behavior of Chain, as described so far, is very sim- contention, the head of the chain can ba
actly the same progress guarantees 3200 lines of C code (including panicking and checkpointilar to the crash-tolerant protocol described in [29]. We tol- and tail do thus need to read (resp. write)
it commits requests as long as there libraries). Quorum makes Aliph the first BFT protocol to
erate Byzantine failures by ensuring: (1) that the content of to) the client, and write (resp. read) f +
es or Byzantine clients. Chain im-
ABSTRACT is a nice idea that really simpliﬁes the
achieve a latency of 2 one-way message delays, while only is not modified by a malicious replica, (2) that no
a message of requests. Thus for a single request, he
ern and, as we show below, allows design of optimistic state machine replication. 1+ f +1
requiring 3f + 1 replicas (Q/U [1] has the same latency replicabut
in the chain is bypassed, and (3) that the reply sent by b MAC operations. Note that all ot
ak throughput than all existing pro- the tail is correct. To provide those guarantees, our Chain re-
requires 5f + 1 replicas). Given its simplicity and efficiency, requests in batch, and have thus a lower n
d Chain share the panicking mech- it might seem surprising not to have seen it published liesearlier.
on a novel authentication method we call chain authen- erations per request. State-of-the-art pro
h is invoked by the client when it We believe that Abstract made ticatorswe
that possible because (CAs). CAs are lightweight MAC authenticators, re- require at least 2 MAC operations at th
© Alysson Bessani. All rights reserved. 56
st. quiring processes to generate (at most) f + 1 MACs (in con- (with the same assumption on batching
could focus on weaker (and hence easier to implement)
ng static switching ordering to or- trast to 3f + 1 in traditional authenticators). CAs guarantee this number tends to 1 in Chain can be in
Abstract specifications, without caring about (numerous)
that, if a client commits request req, every correct replica by the fact that these are two distinct repl
protocols: Quorum-Chain-Backup- difficulties outside the Quorum “common-case”.
executed req. CAs, along with the inherent throughput ad- request (the head) and send the reply (the
−etc. Initially, Quorum is active. As
vantages of a pipeline pattern, are key to Chain’s dramatic
due to contention), it switches to 4.1.2 Chain 4.1.3 Optimizations
throughput improvements over other BFT protocols. We de-
equests until it aborts (e.g., due to Chain organizes replicas in a pipeline (see Fig. 5). All
scriberepli-
below how CAs are used in Chain. When a Chain instance is executing in
witches to Backup, which executes cas know the fixed ordering of replica IDs (called chain or- generate CAs in order to authenticate the mes-
Processes requests as long as there are no server
When Backup commits k requests, sages
der); the first (resp., last) replica is called the head they send. Each CA contains MACs for a set of pro-
(resp., the Aliph implementation we benchmar
o Quorum, and so on. cesses called successor set. The successor set of clients con-
the tail). Without loss of generality we assume an ascending
28
we slightly modified the progress proper
first describe Quorum (Sec. 4.1.1) ordering by replica IDs, where the head (resp., tail) sists of the f + 1 first replicas in the chain. The successor
is replica it aborts requests as soon as replicas de
full details and correctness proofs r1 (resp., r3f +1 ). set of replica ri depends on its position i: (a) for the first 2f contention (i.e. there is only one active
2s). Moreover, Chain replicas add an i
hen, we discuss some system-level In Chain, a client invokes a request by sendingreplicas,
it to thethe successor set comprises the next f + 1 replicas
ec. 4.1.3). in the chain, whereas (b) for other replicas (i > 2f ), the suc- abort history to specify that they aborted
head, who assigns sequence numbers to requests. Then, each
cessor set comprises all subsequent replicas in the chain, as
−
→ of contention. We slightly modified Bac
replica ri forwards the message to its successor ri , where
4/10/12

Robust BFT Protocols

• All distributed protocols can have their
performance hurt by (Distributed) DoS attacks
o There is nothing we can do about that… we need communication and
timing assumptions in order to solve BFT consensus

• However, the quest for optimizing these protocols

for the “expected common case” made them
even more fragile to malicious behavior
o E.g., malicious clients can try to execute operations continuously on
systems like HQ and Q/U to make their operation extremely slow

• However, there are two attacks (≠ (D)DoS) that can

really hurt the performance of systems like PBFT and
Zyzzyva (Amir et al., DSN’ 08, TDSC 2011)

© Alysson Bessani. All rights reserved. 57

BFT Under A\ack

• Attack #1: causing view change with a malicious
client without using DoS
o On PBFT, clients need to send a request “signed” with an authenticator (a
MAC vector)
• Correct authenticator: MAC(c,0) MAC(c,1) MAC(c,2) MAC(c,3)
o A malicious client can send a corrupted authenticator that is valid for all
backup replicas but not for the primary
• Malicious authenticator: ?!#@$ MAC(c,1) MAC(c,2) MAC(c,3)
• The primary will ignore the client’s request
• Other replicas will accept it and, after their timer expires, will relay it to
the primary
o Since the primary will never accept this request, other replicas will start a
view change after a second timeout
o Conclusion: the use of authenticators allow faulty clients to force view
changes as they wish

© Alysson Bessani. All rights reserved. 58

29
4/10/12

BFT Under A\ack

• How to “patch” attack #1’ vulnerability?
o Make clients sign (not with MAC vectors) their messages
o Digital signatures (like RSA) ensure that if some correct server authenticate
the message, then all correct servers will authenticate the message
o Performance issues:
• Clients generate signatures, servers only verify one signature per
request
• Operation’ latency increases by a signature (~5 ms on standard
hardware) plus a verification (~0.5 ms)
• Throughput becomes limited by the amount of signatures each server
can verify per second
o The mentioned single core machine cannot do more than 2Kops/s
o But a 4-year-old quad-core machine can do ~40Kops/s

© Alysson Bessani. All rights reserved. 59

BFT Under A\ack

• Attack #2: degrading performance with a faulty
primary
o A faulty primary must order a request before other replicas timer expires
for this request
o Assuming Ttimeout = 100 ms, if a faulty primary delays the ordering of each
request by 90 ms, a view-change will not be triggered
o Nonetheless, the performance of the system will drop dramatically
o This attack can be even more devastating if combined with attack #1,
since for each view change Ttimeout is doubled
o Conclusion: a faulty primary can inject a delay of almost Ttimeout ms on
each request processing, making the end-to-end performance of the
system orders of magnitude worse than expected

© Alysson Bessani. All rights reserved. 60

30
4/10/12

BFT Under A\ack

• How to mitigate attack #2?
o Solution 1: use decentralized (leader-free) protocols
• The request’ sequence number is not defined by a primary
• Replicas will propose their set of pending requests for ordering in a
decentralized consensus (Moniz et al, TDSC 2011)
• Whether or not this approach works depends on how similar the
proposals are, i.e., if all replicas receive client’s requests on the same
order (which happens very often on HUB-based networks)
o Solution 2: monitor the primary’s performance and start a view change if
it’s too low
• Problem is “how to define the threshold between an unstable network
and a faulty primary”
• A wrong view change can hurt the protocol’ performance
o Solution 3: rotate the primary periodically

© Alysson Bessani. All rights reserved. 61

Protocols Solving these Issues

• Prime (Amir et al, DSN’08, TDSC 2011)
Identified
oAMIR ET AL.: these
PRIME:problems for REPLICATION
BYZANTINE the first time in their DSN’08
UNDER ATTACKpaper
o The Prime protocol adds a pre-order phase to PBFT
operation across all links is at most
ð2f þ 1Þsop . During the Preordering
tion is sent to between 2f and 3f se
least 2fsop . Therefore, reconciliation
same amount of aggregate bandw
Fig. 2. Fault-free operation of Prime (f ¼ 1).
• Aardvark (Clement et al, NSDI’09) semination. Note that a single serve
o PBFT made robust
one reconciliation part per operation
Summary of normal-case operation. To summarize the
at least f þ 1 correct servers share t
Preordering and Global Ordering subprotocols, Fig. 2 follows
• Spinning
the path of (Veronese et al, SRDS’09)
a client operation through the system during 5.4 The Suspect-Leader Subpr
normal-case operation.
o Rotating-coordinator BFT SMR The operation is first preordered in There are two types of performan
two rounds (PO-REQUEST and PO-ACK), after which its mounted by a malicious leader.
preordering is cumulatively acknowledged (PO-SUMMARY).
© Alysson Bessani. All rights reserved. PREPARE
62 messages at a rate slower
When the leader is correct, it includes, in its next PRE- by the protocol. Second, even if
PREPARE, the set of at least 2f þ 1 PO-SUMMARY messages PREPARE messages at the correct r
that prove that at least 2f þ 1 servers have preordered the include a summary matrix that do
operation. The PRE-PREPARE flooding step (not shown) runs up-to-date PO-SUMMARY message
in parallel with the PREPARE step. The client operation will be This can prevent or delay preor
executed once the PRE-PREPARE is globally ordered. Note becoming eligible for execution.
that in general, many operations are being preordered in 31
The Suspect-Leader subprotocol
parallel, and globally ordering a PRE-PREPARE will make against these attacks. The protocol c
many operations eligible for execution. isms that work together to enforce t
leader. The first provides a mean
5.3 The Reconciliation Subprotocol servers can tell the leader which P
4/10/12

Robust BFT SMR

• Clement et al. (NSDI’09) proposes a variation of PBFT
that implements robust state machine replication
o The name of the protocol is Aardvark J

• By robust, it means:
o Maintains a stable performance even when under attack by f malicious
replicas and an unbounded number of clients

• Three main differences (when compared with PBFT):

o Clients must sign requests
• to avoid malicious clients provoking view changes
o Resource Isolation
• to resist denial of service attacks against network interfaces
o Regular view changes
• to avoid performance degradation attacks

© Alysson Bessani. All rights reserved. 63

Robust BFT SMR:

Replica Architecture
• Every replica needs n
NICs (one to each other
replica plus one to
clients)
• This makes the system
resilient to DoS network
attacks from faulty
replicas and clients
• To help resist DoS attacks,
there are specific
algorithms to verify client
requests and process
replica messages

© Alysson Bessani. All rights reserved. 64

32
4/10/12

Robust BFT SMR

(client request veriﬁcation)
• Client messages have
both a MAC and a
signature
o Why?
• Each reply is cached to
deal with retransmissions
• Clients that misbehave
are blacklisted
• ”Redundancy” and
”once per view checks”
take care of replay
attacks
o Clients need to sign
different requests to make
them valid

© Alysson Bessani. All rights reserved. 65

Robust BFT SMR

(replica msg processing)
• If some replica is
sending 20 times more
messages than the
others, blacklist it
• To avoid resource
exhaustion, messages
are processed on a
round robin fashion
• Only process catch up
messages if the system
is idle

© Alysson Bessani. All rights reserved. 66

33
4/10/12

Spinning
• A protocol build upon PBFT, but with a modification
based on a simple idea:
o PBFT’s problem is that a malicious primary can keep ordering requests very
slowly without triggering view changes
o So, why not change view after each message commit?
o in this way, the sequence number of a message matches exactly the view
number of its delivery

• Potential problem:
o The view change protocol is complex and costly
o But it is not a problem: the view change will deterministically happen after
every committed message, so it is not necessary to have a special
protocol to change primary

• The resulting protocol (Spinning) makes the primary

role rotates to all servers

© Alysson Bessani. All rights reserved. 67

Spinning
• Example execution of Spinning:
o first request is ordered by s1, which is the primary of view v
o second request is ordered by s2, which is the primary of v+1
o …

© Alysson Bessani. All rights reserved. 68

34
4/10/12

Spinning: Performance
• Changing primary improves or degrades
performance in fault-free executions?

No, the performance

is be\er!

The primary extra

load is evenly
distributed between
-‐‑ n = 4
-‐‑ no faults all replicas
-‐‑ 0 byte requests
-‐‑ null operations

© Alysson Bessani. All rights reserved. 69

Spinning: Performance
• What happens when a latency is injected by a
faulty primary?
No delay Spinning
performance
degrades much
slower than PBFT

Malicious primaries
can only degrade the
performance of the
system in f out-‐‑of n
protocol executions
Amount of delay injected

© Alysson Bessani. All rights reserved. 70

35
4/10/12

Spinning: Issues
• Without the repair procedure of view changes, how
replicas recover from a malicious primary on some view?
o Merge operation: joins one or more faulty views (i.e., with faulty primaries) with
a correct view (i.e., with correct primary)
o The idea is very similar to PBFT’s view change: the new correct primary will read
the state of the system and proceed ordering requests ensuring the protocol
invariants

• Faulty replicas can force merge operations periodically

o To avoid that, after a merge operation, the primary of the most recent merged
view is put on a blacklist
o Every time a replica goes to a black list, it stays there for a number of turns (n
views) equal to the number of times it was blacklisted
• First time, loses 1 turn; second time, loses 2 turns, and so on…
o Blacklisted replicas cannot be primary on their views, but otherwise can
participate on the protocol as a backup

© Alysson Bessani. All rights reserved. 71

Architectural
Hybridization
• Motivation: BFT in Homogeneous Systems is Expensive

PBFT

Zyzzyva

• At least 3f+1 replicas

• At least 3 communication steps for establish agreement
(non-speculative normal case operation)

© Alysson Bessani. All rights reserved. 72

36
4/10/12

Architectural
Hybridization
• Is it possible to do better?
1- Less than 3f+1 replicas to tolerate f Byzantine faults?
• Homogeneous non-synchronous systems require 3f+1 replicas

(and, at the same time)

2- Less than 3 communication steps to establish agreement
• It is possible to solve consensus with 2 communication steps if there
are 5f+1 replicas (Martin & Alvisi, TDSC 2007)

• Hybrid distributed systems (Veríssimo, SIGACT News

2006) with local trusted components can do that!

© Alysson Bessani. All rights reserved. 73

History: Trusted
Components and BFT
• (Correia et al, SRDS’02) BFT Reliable Multicast using TTCB,
a distributed real-time trusted component
• (Correia et al, SRDS’04) BFT SMR with 2f+1 replicas using a
distributed trusted component
• (Chun et al, SOSP’07) PBFT with 2f+1 replicas using a
complex local trusted component (A2M)
• (Levin et al, NSDI’09) A2M reduced to a simple secure
counter (TrInc), that can be build using a TPM chip
• (Veronese et al, DI-FCUL TR 2008, TC 2011) MinBFT shows
that with a trusted counter one can reduces BFT SMR to
viewstamped replication/Paxos
• (Kapitza et al, EuroSys’12) BFT SMR with only f+1 active
replicas using a trusted counter efficiently implemented
in FPGA

© Alysson Bessani. All rights reserved. 74

37
4/10/12

MinBFT: System Model

§ Eventually synchronous system

§ Authenticated and reliable channels

§ Local Trusted Component (can only crash)

§ Secure hash function

§ n ≥ 2f+1 replicas, at most f can suffer Byzantine faults

© Alysson Bessani. All rights reserved. 75

MinBFT Trusted Component

(USIG -‐‑ Unique Sequential Identiﬁer Generator)
• A minimal local trusted component containing
o A cryptosystem for authenticating its outputs
o A monotonic counter

• Storage (on the USIG of process i):

o A private-key PrKi
o An unbounded counter count
• Operations:
o createUI(m)
• Assigns a counter value c_val to a message m
• Increments the counter: count++
• Outputs UI = <c_val, i, H(m), SignPrKi>
o verifyUI(j, PuKj, UI, m)
• Verifies if UI was generated by createUI(m) on the USIG of process j. This
verification uses j’s USIG public key PuKj
• Outputs true or false

© Alysson Bessani. All rights reserved. 76

38
4/10/12

PBFT x MinBFT
pre-‐‑
request prepare prepare commit reply
Client request prepare commit reply

Replica 0

Replica 1

Replica 2

Replica 3
MinBFT
PBFT

Benefits of MinBFT
• 2f+1 instead of 3f+1 replicas (minimal for general SMR)
• 2 steps instead of 3 on the normal case (minimal for consensus)
• USIG is arguably a minimal trusted component

© Alysson Bessani. All rights reserved. 77

MinBFT: Normal Case

§ Primary defines the order
§ The sequence number is
the USIG counter value
assigned to the message REQUEST PREPARE COMMIT REPLY

§ Servers accepted f+1 Client

commits Replica 0

§ Each one should have a Replica 1

valid UI from its sender
USIG Replica 2
§ Execution follows the
order on PREPARE’ UI
§ Client waits for f+1
matching replies

© Alysson Bessani. All rights reserved. 78

39
4/10/12

MinBFT: View Change

REQ-‐‑VIEW-‐‑ VIEW-‐‑CHANGE NEW-‐‑VIEW
CHANGE
< V + 1 > < Clast, UI > < V, UI >
Replica 0
primary v

§ When a request is received,

a timer Texec is started
Replica 1
primary v+1 § f+1 REQ-‐‑VIEW-‐‑CHANGE

§ f+1 VIEW-‐‑CHANGE
Replica 2

© Alysson Bessani. All rights reserved. 79

MinBFT: Why it Works?

• Uses 2f+1 replicas with quorums of size f+1
• One replica in the intersection of any two quorums

write read
set set

• What if this replica is faulty?

o It cannot lie because every value is associated with a UI
o Different messages will have different UIs

• Practical effects:
o A primary replica cannot send two PREPARE messages with different
messages and the same sequence number (UI)
o A backup replica cannot lie about the value proposed by the primary

© Alysson Bessani. All rights reserved. 80

40
4/10/12

USIG Implementation: VM

State Machine
Code USIG
Replica BFT Rep. Alg. Daemon
Architecture
OS OS
Virtual Machine Monitor
Hardware
• The BFT protocol + application runs on a untrusted virtual
machine that have access to the outside network
• The USIG is implemented as a daemon on a trusted
virtual machine (e.g., Xen’s Dom0)
• They communicate by TCP sockets

© Alysson Bessani. All rights reserved. 81

USIG Implementation: VM

0 PuK1 1 PuK0
Public-key USIG PuK2 PuK2
(RSA or ESIGN) PrK0
m,S(m)
PrK1
• Only createUI
requires trusted
component 2 PuK0
access PuK1
PrK2

0 1
HMAC USIG
• Both createUI and SK SK
m,HMAC(m)
verifyUI requires
trusted
component 2
access
SK

© Alysson Bessani. All rights reserved. 82

41
4/10/12

USIG Implementation: VM

• MinBFT can be implemented with both variants, but
the HMAC version, albeit potentially more efficient,
can be more difficult to manage
o Symmetric keys have a short life-cycle than PK keys
o How to refresh them without interrupting the protocol?

• MinZyzzyva requires clients to verify UIs, meaning

that clients need to have a trusted component with
the shared secret key
o This is infeasible in practice
o Conclusion: MInZyzzyva only works with PK USIG

© Alysson Bessani. All rights reserved. 83

USIG Implementation: TPM

State Machine
Code

Replica BFT Rep. Alg.

Architecture
OS
TPM
Hardware

• A public-key (2048-bit RSA) implementation of the USIG

service
• The private key and the counter are stored in the TPM
(version 1.2 or higher)
• BFT protocol access a TPM driver to issue commands

© Alysson Bessani. All rights reserved. 84

42
4/10/12

USIG Implementation: TPM

createUI(m), as called by replica i
• First, calculate hm = H(m)
• Send the following commands to the TPM (details omitted)
1. TPM_EstablishTransport
2. TPM_ExecuteTransport(TPM_IncrementCounter)
3. TPM_ReleaseTransportSigned(hm)
• The last command returns:
o A 2048-bit RSA signature S of <TPM_IncrementCounter; c_val; hm >
• This UI value is <<TPM_IncrementCounter; c_val; hm>, S>
Proves that the counter was incremented
Proves that it was generated
Association of a counter by TPM on replica i
value with the message

© Alysson Bessani. All rights reserved. 85

USIG Implementation: TPM

verifyUI(j, PKj , UI, m)
• Verify the format of the data structure (e.g., there is an
increment on the TPM counter)
• Verify if hm = H(m) is on the UI
• Verify the signature using the public key PKj
• If all these checks are passed, return true, otherwise, return
false

Important Remark: This function don’t need to access the TPM

© Alysson Bessani. All rights reserved. 86

43
4/10/12

USIG Performance:
VM x TPM
• TPM USIG
o Signature: 797 ms
o One increment by 3.5 seconds
o 32-bit monotonic counter

• VM USIG

© Alysson Bessani. All rights reserved. 87

Implementation
Techniques
• BASE (Castro et al, TOCS 2003)
o Define useful abstractions to implement diverse BFT services
• Parallel execution of requests (Kotla & Dahlin, DSN’04)
o Some service requests do not require total order execution (writes on different files of
a file system), and can be executed in parallel
o May improves the throughput of certain services (e.g., distributed FS)

• On-Demand Replica Consistency (Distler & Kapitza, EuroSys’11)

o Breaks the service state in partitions
o Each partition executes a subset of the submitted requests
o Specially useful if executing a request require a lot of resources

• Separating Agreement from Execution (Yin et al, SOSP’03)

• UpRight (Clement et al, SOSP’09)
• ZZ (Wood et al, EuroSys’11)

© Alysson Bessani. All rights reserved. 88

44
4/10/12

Classical BFT SMR

Architecture
• Clients sign requests and sent them to the replicas
• Replicas verify client signature and run an agreement protocol
to establish total order
• Replicas execute the request and send the reply to the client

verify agreement
sign

execute

Separating Agreement/
Execution Architecture
• Separate servers in two layers: agreement and execution
• Clients sign requests, agreement replicas verify it
• 3f+1 replicas to agree on requests sequence number and 2f+1
for executing the requests

agreement
verify
sign ordered request

execute

45
4/10/12

Agreement/Exectuion
Problem
• In data centers, clients usually are also servers… they have to
be fast (generating signatures is very costly)
o E.g.: web services (BFT clients) access a BFT database (execution)
• These web service hosts need to serve lots of clients (high
throughput) and they are paid by the service provider

BFT clients agreement

execution
clients Datacenter

Internet

UpRight Architecture
agreement
• A new layer need to be deployed
to avoid client signatures: request
quorum (RQ)
• Servers on this layer store the
request and generate a matrix 3. Request hash
signature to be ordered by the + sequence
agreement layer 2. Request hash +
matrix signature number
• The execution layer fetches the
request after ordered from RQ,
execute it and send a reply verify 4. Exec. replicas fetch
ordered request
sign

5. Req. is obtained

execute
1. Requests are sent
7. Reply is received
6. Reply is sent

46
4/10/12

UpRight Remarks
• Number of faults tolerated:
o Request quorum: nr ≥ 2u + r + 1
o Ordering: no ≥ 2u + r + 1
o Execution: ne ≥ u + r + 1

• Clients only do MACs, not signatures… it is more

aligned with cloud applications (clients are also
servers of application services)
• Speculative execution is not employed in the
service, but only on order assignment (execution
servers are just like clients receiving replies from
Zyzzyva)

ZZ Architecture
• Key observation: In fault free executions, f+1 execution
replicas are enough for the execution layer
• In server consolidation scenarios, these extra f execution
replicas can be dormant VMs

agreement
verify
sign ordered request

execute
reply

47
4/10/12

Complementary Techniques
for BFT: Fault Recovery
• Problem with tolerating f faults:
o If an intelligent adversary is able to compromises f machines, given
enough time, he/she will compromise f+1 (or more)
o Solution: Proactive Recovery (Castro & Liskov, TOCS 2002)
• Replicas (compromised or not) are cleaned periodically
• PR requires a local trusted real-time component
o Otherwise, it may be vulnerable to certain attacks (Sousa et al, DSN’05)
o Most proactive recovery systems are vulnerable (Sousa et al, HotDep’06)
• To ensure availability you may also need 2k extra
replicas if at most k recover at the same time
Outdated …

Complementary Techniques
for BFT: Diversity
• f-fault-tolerant replicated systems are useful only
if faults are not correlated
• It usually requires diverse replicas
o Different administrative domains
o N-version programming (effective?)
o Obfuscation, Memory randomization (effective?)
o Use of different components like databases (Gashi et
al, TDSC 2007), file systems (Castro et al, TOCS 2003)
and operating systems (Garcia et al, DSN’11) is
effective!
• What about deploying and managing diversity?

48
4/10/12

References
• Abd-El-Malek et al. Fault-scalable Byzantine Fault-
tolerant Services. SOSP’05
• Cowling et al. HQ-Replication: a Hybrid Quorum Protocol
for Byzantine Fault Tolerance. OSDI’06
• Kotla et al. Zyzzyva: Speculative Byzantine Fault
Tolerance. ACM TOCS 2009 (prel. SOSP’07)
• Guerraoui et al. The Next 700 BFT Protocols. EuroSys’10
• Amir et al. Byzantine protocols Under Attack. IEEE TDSC
2011 (prel. DSN’08)
• Moniz et al. RITAS: Services for Randomized Intrusion
Tolerance. IEEE TDSC 2011
• Veronese et al. Spin One’s Wheels? Byzantine Fault
Tolerance with a Spinning Primary. SRDS’09

References
• Clement et al. Making Byzantine Fault-tolerant Systems
Tolerate Byzantine faults. NSDI’09
• Martin & Alvisi. Fast Byzantine Paxos. IEEE TDSC 2007
• Veríssimo. Travelling through Wormholes: a new look at
Distributed System Models. SIGACT News 2006
• Correia et al. Hybrid Byzantine-resilient Reliable Multicast.
SRDS’02
• Correia et al. How to Tolerate Half less One Byzantine
Faults in Practical Distributed Systems. SRDS’04
• Chun et al. Attested append-only memory: Making
adversaries stick to their word. SOSP’07
• Levin et al. TrInc: Small Trusted Hardware for Large
Distributed Systems. NSDI’09

49
4/10/12

References
• Veronese et al. Efficient Byzantine Fault Tolerance. IEEE
TC 2011. to appear (prel . DI-FCUL Tech. Report 2008)
• Kapitza et al. ChepBFT: Resource-efficient Byzantine
Fault Tolerance.EuroSys’12
• Castro et al. BASE: Using Abstractions to Improve Fault
Tolerance. ACM TOCS 2003
• Kotla & Dahlin. High-throughput Byzantine Fault
Tolerance. DSN’04
• Distler & Kapitza. Increasing Performance in Byzantine
Fault-Tolerant Systems with On-Demand Replica
Consistency. EuroSys’11
• Yin et al. Separating Agreement from Execution in
Byzantine Fault-tolerant Services. SOSP’03

References
• Clement et al. UpRight Cluster Services. SOSP’09
• Wood et al. ZZ and the Art of BFT Execution.
EuroSys’11
• Sousa et al. How resilient are distributed f fault/
intrusion-tolerant systems? DSN’05
• Sousa et al. Hidden Problems of Asynchronous
Proactive Recovery. HotDep’07
• Gashi et al. Fault tolerance via diversity for off-the-
shelf products: a study with SQL database servers.
IEEE TDSC 2007
• Garcia et al. OS Diversity for Intrusion tolerance:
Myth or Reality? DSN’11

50
4/10/12

Other Aspects
Wide-area replication
• Wester et al. Tolerating Latency in Replicated State Machines Through
Client Speculation. NSDI’09
• Mao et al. Towards Low Latency State Machine Replication for Uncivil
Wide-area Networks. HotDep’09
• Amir et al. STEWARD: Scaling Byzantine Fault-Tolerant Replication to Wide-
Area Networks. IEEE TDSC 2010
• Veronese et al. EBAWA: Efficient Byzantine Agreement for Wide-Area
Networks. HASE’10
Weak consistency & others
• Li & Mazières. Beyond One-third Faulty Replicas in Byzantine Fault Tolerant
Systems. NSDI’07
• Singh et al. Zeno: Eventually Consistent Byzantine-Fault Tolerance. NSDI’09
• Sen et al. Prophecy: Using History for High-Throughput Fault Tolerance.
NSDI’10
• Bessani et al. Active Quorum Systems. HotDep’10
© Alysson Bessani. All rights reserved. 101

Part III
Applications, Open Problems & Practice

EuroSys 2012

51
4/10/12

BFT Applications
• Distributed File Systems
o BFS (Castro & Liskov, TOCS 2002), BASEFS (Castro et al, TOCS 2003)
o Oceanstore (Kubiatowicz et al, ASPLOS’00), Farsite (Adya et al, OSDI’02)
o UR-HDFS (Celement et al, SOSP’09)

• Database replication
o Commit Barrier Scheduling (Vandiver et al, SOSP’07)
o Byzantium (Garcia et al, EuroSys’11)

• Coordination Service
o DepSpace (Bessani et al, EuroSys’08)
o UR-Zookeeper (Clement et al, SOSP’09)

• Naming Services
o DNS (Cachin & Samar, DSN’04)
o LDAP (FCUL, unpublished)

BFT Real Applications?

• Tolerating non-malicious Byzantine faults
o Memory and disk corruptions are relatively common at large scale
o These problems are detected and corrected using end-to-end integrity
checks (i.e., crypto hashes stored separately)
o Can we use BFT SMR to tolerate this?
• Where these faults happen?
• Are there simple techniques?
o What about software (heisen)bugs?

• General fault tolerance

o BFT is a general technique for fault tolerance
o The next step on fault tolerance evolution

• Malicious Byzantine faults

o What if Byzantine faults are the result of successful attacks?
o BFT is not enough, we need Intrusion tolerance

52
4/10/12

Intrusion Tolerance (InTol)

• Coined by Joni Fraga and David Powell
“A Fault- and Intrusion-Tolerant File System”, IFIP SEC,1985

• An intrusion-tolerant system can maintain its

security properties (confidentiality, integrity
and availability) despite some of its
components being compromised

• Appeal: since it’s impossible to prove that a

system has no vulnerabilities, it is more safe
to assume that intrusions can happen

The Promise of BFT

• From PBFT’ abstract (Castro & Liskov, OSDI’99):
“We believe that Byzantine fault-tolerant
algorithms will be increasingly important in
the future because malicious attacks and
software errors are increasingly common
and can cause faulty nodes to exhibit
arbitrary behavior.”

53
4/10/12

InTol vs BFT

• BFT replication protocols are a key
mechanism for intrusion-tolerant systems
o However, I/T systems assume faults may be
caused by malicious and intelligent adversaries
• Differences and I/T added requirements:
o Unfavorable executions
o Diversity
o Recovery and Self-healing
o Confidentiality

Intrusion-‐‑tolerant Systems
• Definition
An intrusion-tolerant system is a replicated system
in which a malicious adversary needs to
compromise more than f out-of n components in
less than T time units in order to make it fail.

Comments:
• Similar to BFT with proactive recovery
• T and f make little sense without previous requirements
• Other definitions are possible

54
4/10/12

Problems of Intrusion

Tolerance
• Originally described in (Bessani, WRAITS’11)

• 3 Solved
• 2 Half-solved
• 5 Open

Solved Problem:
Performance
1990s: first implementations appeared with
useful performance (Rampart, SecureRing)
1999: Castro & Liskov’ PBFT
2000s: PBFT-like protocols with better
performance in certain favorable conditions
Minimal Maximal
Latency Throughput

PBFT (1999) Zyzzyva (2007) Next 700 BFT (2010)

55
4/10/12

Solved Problem:
Resource Eﬃciency
• Separating agreement from execution
o 3f+1 replicas for ordering requests
o 2f+1 replicas for executing requests
o f+1 exec. replicas may be sufficient with VMs
• Trusted components (e.g., TPM)
o Agreement with 2f+1 replicas (instead of 3f+1)

Solved Problem:
Recovery
• Problem with tolerating f faults:
o If an intelligent adversary is able to compromises f machines, given
enough time, he/she will compromise f+1 (or more)

• Solution: Periodic (Proactive) Recovery

o Replicas (compromised or not) are cleaned periodically

• Requires a trusted real-time component

56
4/10/12

Half-‐‑solved Problem:
Diversity
• f-fault-tolerant replicated systems are
useful only if faults are not correlated
• It usually requires diverse replicas
o Different administrative domains
o N-version programming (effective?)
o Obfuscation, Memory randomization
(effective?)
o Use of different components like databases,
file systems, operating systems is effective!
• What about deploying diversity?

Half-‐‑solved Problem:
Robust Performance of BFT
• BFT replication is
o very efficient in favorable conditions
o very inefficient in unfavorable conditions
• What about a balance?
o efficient enough in most conditions
• Design principles (Prime, Aardvark, AQS)
o No complex optimizations
o Use public-key crypto if needed
o Exploit application semantics for optimizations

57
4/10/12

Open Problems:
Intrusion Reaction
• Most BFT protocols only tolerate faults and
don’t take actions against malicious replicas
(others than what is required for correctness)
• In practice, replica behavior needs to be
monitored and recovery actions need to be
executed if intrusions are detected
• Research question: Given the specification
of a protocol, how to automatically detect
misbehaviors and react to them?

Open Problems:
Time-‐‑bounded State Transfer
• Recall that the window of vulnerability of an
intrusion-tolerant system is bounded by T
o Every T time units all replicas are rejuvenated
o Every replica must take no more than T/n time units to
recover itself, i.e., take the following steps:
• Shutdown
• Chose a clean (and different) OS image
• Boot
• Fetch and validate service state
• Research question: How to bound the last step?

58
4/10/12

Open Problems:
Diversity Management
• Research question: Assume we have a pool
of diverse configurations for the system
replicas, how to choose the best set?
o The idea is to minimize the number of shared
vulnerabilities/bugs among any two replicas
o This is even more complicated if replicas change
at runtime
• Besides that, diversity means management
of complexity. How to deal with it?

Open Problems:
Conﬁdential Operation

store(k,v)
CLIENTS
SERVERS

read(k)

• One intrusion → Data leaked

• Threshold crypto/secret sharing help in some cases,
e.g., storage systems (Bessani et al, EuroSys’08)
• Homomorphic crypto can be solution

59
4/10/12

Open Problems:
Graceful Degradation
• Our intrusion tolerance definition is very strict (all-or-
nothing)
• Research question: How to specify degraded
behaviors for intrusion tolerant systems in general?
• Examples: What if …
o … there are more than f faulty replicas?
o … the system is completely asynchronous?

SMR Programming Model

• Basic client-server synchronous RPC

execute(command){
//change state
reply = invoke(command); return reply;
}

60
4/10/12

SMR Programming Model

• What about server-initiated communication?
o Client needs to poll the server for updates

• What about asynchronous RPC?

o Do a synchronous RPC at the client-side on a separated thread

• What about nested calls?

o Requires special support for the API

• What about multithreading?

o Remove it!
o The replication library provides nonces and timestamps for dealing
with other sources of non-determinism

BFT-‐‑SMaRt
• Started in 2006, as a Byzantine Paxos
implementation on the Neko simulator
• Later extended to be the replication layer of
DepSpace (Bessani et al, EuroSys’08)
• Currently used/maintained by researchers in
Portugal, Brazil and Germany
• Sponsored by:

FCT Fundação para a Ciência e a Tecnologia

61
4/10/12

BFT-‐‑SMaRt
• BFT-SMaRt design principles:
o Java-based (for security and correctness reasons)
o No optimizations that bring complexity
o Modularity
o Features: Extensible API, State Management, Reconfiguration

• Implements a protocol very similar to PBFT, but modular;

Dealing with Complexity

Features X Complexity
We are here!

-‐‑ Java instead of C++

SMR Complexity
(LoCs & Module -‐‑ Avoid overcomplicated optimizations
dependencies) -‐‑ Number of lines of code: 8399
(PBFT: ~20K LoCC; UpRight: ~22K LoJC)
-‐‑ Number of classes/interfaces: 90

62
4/10/12

Modularity

BFT-‐‑SMaRt Replica
Architecture Timers to trigger
Execute
operation
8 regency change

2
Protocol 7
3
1 Core
6

4
5

63
4/10/12

BFT-‐‑SMaRt Software
• It is a library (.jar file) that must be linked with the
client and the servers…
• There is no service/component that must be
deployed or managed besides the BFT client and
server
• Available at http://code.google.com/p/bft-smart/
• Current version: 0.7
o Many disruptive features are being integrated in the code
o API changes will happen
o Bugs remain
o Any help is welcome!

BFT-‐‑SMaRt Software
BFT-SMaRt.jar

CounterClient uses ServiceProxy

+main() +invoke() ServiceReplica
+constructor()

uses
Config.
CounterServer
Config. +main()

64
4/10/12

Conﬁguration
• A directory containing three things
o The keys directory with the process i privatekeyi file and publickeyj for
every other process j
• In the future these keys will go to keystores/trustores

o hosts.config: IP:port of the n replicas

#id address port (0 to n-1 are replicas)
0 127.0.0.1 11000
1 127.0.0.1 11010
2 127.0.0.1 11020
3 127.0.0.1 11030

o Do not use consecutive ports (each replica uses its port p, plus p+1)

Conﬁguration
• system.config: a Java properties file containing the
system parameters
system.authentication.hmacAlgorithm = HmacSHA1
system.servers.num = 4
system.servers.f = 1
system.totalordermulticast.timeout = 12000000
system.totalordermulticast.highMark = 10000
system.totalordermulticast.maxbatchsize = 400
system.totalordermulticast.verifyTimestamps = false
system.totalordermulticast.state_transfer = true
system.totalordermulticast.checkpoint_period = 50
system.totalordermulticast.revival_highMark = 10
system.communication.useSignatures = 0
system.communication.useMACs = 1
system.initial.view = 0,1,2,3
system.debug = 0

65
4/10/12

BFT-‐‑SMaRt Programming
• Client-side:
o ServiceProxy is the main class to be used
o Requests and replies are byte arrays (to avoid unnecessary overheads)

public class ServiceProxy extends ... {!

! !public ServiceProxy(int processId) ...!
!public ServiceProxy(int processId, !
! !String configHome,!
! !Comparator<byte[]> replyComparator, !
! !Extractor replyExtractor) ...!
!
!public byte[] invokeOrdered(byte[] request) ...!
!public byte[] invokeUnordered(byte[] request) ...!
!public void invokeAsynchronous(byte[] request, !
! !ReplyListener listener, int[] targets) ...!
}!

BFT-‐‑SMaRt Programming
• Server-side:
o ServiceReplica is the main class
o It needs an implementation of Executable and Recoverable to work

public class ServiceReplica extends ... {!

!public ServiceReplica(int id,!
! !Executable executor, !
! !Recoverable recover) ...!
!public ServiceReplica(int id, String configHome,!
! !boolean isToJoin, Executable executor,!
! !Recoverable recover) ...!
!
!public void leave() ...!
!public ReplicaContext getReplicaContext () ...!
}!

66
4/10/12

BFT-‐‑SMaRt Programming
• Server-side (cont.):

public interface Executable {!

!public byte[] executeUnordered(byte[] command,!
! !MessageContext msgCtx);!
}!
public interface SingleExecutable extends Executable {!
!public byte[] executeOrdered(byte[] command,!
! !MessageContext msgCtx);!
}!
public interface BatchExecutable extends Executable {!
!public byte[][] executeBatch(byte[][] command,!
! !MessageContext[] msgCtx);!
}!
public interface Recoverable {!
!public byte[] getState();!
!public void setState(byte[] state);!
}!

Creating an In-‐‑Memory

KV-‐‑Store with BFT-‐‑SMaRt
• Download BFT-SMaRt 0.7 from
http://code.google.com/p/bft-smart
• (optional) Create a project in your favorite Java IDE
and add dist/BFT-SMaRt.jar and other lib/*.jar to it
• Create a KVMessage class to represent the
messages exchanged between clients and the
replicas
• Create a KVServer class implementing the
SingleExecutable and Recoverable interfaces, and
using ServiceReplica
• Create a KVClient class using ServiceProxy

67
4/10/12

References
• Kubiatowicz et al. OceanStore: An Architecture for Global-scale Persistent
Storage. ASPLOS’00
• Adya et al. FARSITE: Federated, Available, and Reliable Storage for an
Incompletely Trusted Environment. OSDI’02
• Vandiver et al. Tolerating Byzantine Faults in Database Systems using Commit
Barrier Scheduling. SOSP’07
• Garcia et al. Efficient Middleware for Byzantine Fault-tolerant Database
Replication. EuroSys’11
• Bessani et al. DepSpace: A Byzantine Fault-tolerant Coordination Services.
EuroSys’08
• Christian Cachin and Asad Samar. Secure distributed DNS. DSN’04
• Fraga & Powel. A Fault- and Intrusion-Tolerant File System, IFIP SEC’85
• Bessani. From Byzantine Fault Tolerance to Intrusion Tolerance (A position
paper). WRAITS’11
• Sousa & Bessani. From Byzantine Consensus to BFT State Machine Replication:
A latency-optimal transformation. EDCC’12

h\p://www.di.fc.ul.pt/~bessani
h\p://code.google.com/p/bft-‐‑smart

EuroSys 2012

Unit 3-1
No ratings yet
Unit 3-1
26 pages
Week 6
No ratings yet
Week 6
177 pages
NIST IR 8460 Ipd
No ratings yet
NIST IR 8460 Ipd
322 pages
Fault Tolerance Notes
No ratings yet
Fault Tolerance Notes
101 pages
ByzantineFT SMR
No ratings yet
ByzantineFT SMR
41 pages
Franca Rezende 2021
No ratings yet
Franca Rezende 2021
146 pages
Ch9 Consensus
No ratings yet
Ch9 Consensus
40 pages
Chapter 8
No ratings yet
Chapter 8
29 pages
Du3 1
No ratings yet
Du3 1
54 pages
Chapter 05
No ratings yet
Chapter 05
32 pages
Fault Tolerance in Distributed Computing
No ratings yet
Fault Tolerance in Distributed Computing
32 pages
Blockchain State Machine
No ratings yet
Blockchain State Machine
5 pages
The Latest Gossip On BFT Consensus: Ethan Buchman, Jae Kwon and Zarko Milosevic Tendermint
No ratings yet
The Latest Gossip On BFT Consensus: Ethan Buchman, Jae Kwon and Zarko Milosevic Tendermint
14 pages
Midterm Cheatsheet
No ratings yet
Midterm Cheatsheet
2 pages
Bolt-Dumbo Transformer: Asynchronous Consensus As Fast As The Pipelined BFT
No ratings yet
Bolt-Dumbo Transformer: Asynchronous Consensus As Fast As The Pipelined BFT
21 pages
CSE446 Lecture 5
No ratings yet
CSE446 Lecture 5
34 pages
EPI TIA-942-C Webinar Presentation
100% (3)
EPI TIA-942-C Webinar Presentation
33 pages
Week 7
No ratings yet
Week 7
96 pages
Fault Tolerance
No ratings yet
Fault Tolerance
40 pages
CBDT3103 Answer
No ratings yet
CBDT3103 Answer
9 pages
Fault Tolerant Message Passing Systems
No ratings yet
Fault Tolerant Message Passing Systems
26 pages
Chen 07
No ratings yet
Chen 07
39 pages
CST 428 Block Chain Technologies: Consensus Algorithms and Bitcoin
No ratings yet
CST 428 Block Chain Technologies: Consensus Algorithms and Bitcoin
75 pages
Blockchain - Unit1
No ratings yet
Blockchain - Unit1
115 pages
Chapter 06 Fault - Tolerance
No ratings yet
Chapter 06 Fault - Tolerance
30 pages
ch08 Ts TK Fault Tolerance I
No ratings yet
ch08 Ts TK Fault Tolerance I
29 pages
Paxos Made Moderately Complex: Robbert Van Renesse Cornell University Rvr@cs - Cornell.edu March 25, 2011
No ratings yet
Paxos Made Moderately Complex: Robbert Van Renesse Cornell University Rvr@cs - Cornell.edu March 25, 2011
15 pages
DS Unit5
No ratings yet
DS Unit5
13 pages
Fault
No ratings yet
Fault
101 pages
Chapte Four DS
No ratings yet
Chapte Four DS
37 pages
CSE446 Lecture 4
No ratings yet
CSE446 Lecture 4
32 pages
PBFT Algorithm
No ratings yet
PBFT Algorithm
76 pages
Cse535 F24 1003 BFT
No ratings yet
Cse535 F24 1003 BFT
47 pages
Byzantine Fault-Tolerance: COMP 413 Fall 2002
No ratings yet
Byzantine Fault-Tolerance: COMP 413 Fall 2002
21 pages
Ch8 Distributed
No ratings yet
Ch8 Distributed
12 pages
Chapter 8-Fault Tolerance
No ratings yet
Chapter 8-Fault Tolerance
30 pages
CST428 M3-Ktunotes - in
No ratings yet
CST428 M3-Ktunotes - in
47 pages
Chapter 8-Fault Tolerance
100% (1)
Chapter 8-Fault Tolerance
71 pages
Reaching Consensus in The Byzantine Empire: A Comprehensive Review of BFT Consensus Algorithms
No ratings yet
Reaching Consensus in The Byzantine Empire: A Comprehensive Review of BFT Consensus Algorithms
39 pages
BDLT Unit#1 3
No ratings yet
BDLT Unit#1 3
13 pages
Blockchain Technology: 1.1. Distributed Systems
No ratings yet
Blockchain Technology: 1.1. Distributed Systems
5 pages
w9s1 FaultTolerance1
No ratings yet
w9s1 FaultTolerance1
34 pages
Unit 2 BFT
No ratings yet
Unit 2 BFT
13 pages
Unit 4
No ratings yet
Unit 4
11 pages
Fault Tolerance: Click To Add Text Dealing Successfully With Partial System. Key Technique: Redundancy
No ratings yet
Fault Tolerance: Click To Add Text Dealing Successfully With Partial System. Key Technique: Redundancy
48 pages
Flex BFT Ccs19
No ratings yet
Flex BFT Ccs19
13 pages
CH 4
No ratings yet
CH 4
25 pages
Byzantine Fault Tolerance
No ratings yet
Byzantine Fault Tolerance
12 pages
2023-Byzantine Fault-Tolerant Consensus Algorithms A Survey
No ratings yet
2023-Byzantine Fault-Tolerant Consensus Algorithms A Survey
25 pages
Fault Tolerance and Consensus: C. Bettini - Distributed and Pervasive Systems
No ratings yet
Fault Tolerance and Consensus: C. Bettini - Distributed and Pervasive Systems
10 pages
Practical Byzantine Fault Tolerance and Proactive Recovery
No ratings yet
Practical Byzantine Fault Tolerance and Proactive Recovery
64 pages
CS 194: Distributed Systems
No ratings yet
CS 194: Distributed Systems
15 pages
Nikil DS Report
No ratings yet
Nikil DS Report
4 pages
Consensus
No ratings yet
Consensus
77 pages
Distributed Systems - Fault Tolerance
No ratings yet
Distributed Systems - Fault Tolerance
21 pages
13 Sosp03 PDF
No ratings yet
13 Sosp03 PDF
15 pages
Unit 3
No ratings yet
Unit 3
62 pages
Fault System One
No ratings yet
Fault System One
19 pages
Hamming Code in Computer Network
No ratings yet
Hamming Code in Computer Network
5 pages
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
No ratings yet
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
52 pages
Implementing Storage Spaces and Data Deduplication
No ratings yet
Implementing Storage Spaces and Data Deduplication
45 pages
Tri-GP Technical Product Guide
No ratings yet
Tri-GP Technical Product Guide
64 pages
GNC TPEF GEN in 7771 00015 A03 FG Systems Specification
No ratings yet
GNC TPEF GEN in 7771 00015 A03 FG Systems Specification
60 pages
Technical Product Guide, Trident v1.0 PDF
No ratings yet
Technical Product Guide, Trident v1.0 PDF
40 pages
Faults and Their Influence On The Dynamic Behaviour of Electric Vehicles
No ratings yet
Faults and Their Influence On The Dynamic Behaviour of Electric Vehicles
70 pages
Cominication PDF
No ratings yet
Cominication PDF
3 pages
Computer Networks: Error Detection and Correction
No ratings yet
Computer Networks: Error Detection and Correction
22 pages
Reliable System Design: Hardware Design Checklist Testing Embedded Systems Critical Systems
No ratings yet
Reliable System Design: Hardware Design Checklist Testing Embedded Systems Critical Systems
28 pages
Course No: EEE 4704 Experiment No: 03 Name of The Experiment: "Study On Convolutional Encoding and Viterbi Convolutional Decoding Algorithm"
No ratings yet
Course No: EEE 4704 Experiment No: 03 Name of The Experiment: "Study On Convolutional Encoding and Viterbi Convolutional Decoding Algorithm"
9 pages
Hamming Code
0% (2)
Hamming Code
6 pages
Operating System Updates V6 CPU412-5H PN - DP, CPU414-5H PN - DP, CPU416-5H PN - DP, CP... - ID - 109474550 - Industry Support Siemens
No ratings yet
Operating System Updates V6 CPU412-5H PN - DP, CPU414-5H PN - DP, CPU416-5H PN - DP, CP... - ID - 109474550 - Industry Support Siemens
10 pages
Chapter 10 Error Detection & Correctio
No ratings yet
Chapter 10 Error Detection & Correctio
26 pages
Hamming Code Explanation 1
No ratings yet
Hamming Code Explanation 1
4 pages
p245 Chen
No ratings yet
p245 Chen
3 pages
Unit-3 Part2
No ratings yet
Unit-3 Part2
74 pages
Distributed Systems Notes AKTU
No ratings yet
Distributed Systems Notes AKTU
3 pages
Error Detection (Data Communication)
No ratings yet
Error Detection (Data Communication)
54 pages
CYB 402 - Fault Tolerant Routing (Lect.6)
No ratings yet
CYB 402 - Fault Tolerant Routing (Lect.6)
4 pages
A Survey of Fault Diagnosis and Fault-Tolerant Techniques-Part I: Fault Diagnosis With Model-Based and Signal-Based Approaches
No ratings yet
A Survey of Fault Diagnosis and Fault-Tolerant Techniques-Part I: Fault Diagnosis With Model-Based and Signal-Based Approaches
11 pages
DCS Chapter-1
No ratings yet
DCS Chapter-1
9 pages
Using OpenVMS Clusters For Disaster Tolerance
No ratings yet
Using OpenVMS Clusters For Disaster Tolerance
27 pages
Fault TolerantBack To BackConverterforDirect DrivePMSGWindTurbinesUsingDirectTorqueandPowerControlTechniques
No ratings yet
Fault TolerantBack To BackConverterforDirect DrivePMSGWindTurbinesUsingDirectTorqueandPowerControlTechniques
14 pages
Network Coding For Fault-Tolerant Transmission of Biomedical Data
No ratings yet
Network Coding For Fault-Tolerant Transmission of Biomedical Data
6 pages
Fault Tolerant Distributed Systems PDF
No ratings yet
Fault Tolerant Distributed Systems PDF
2 pages
Error-Detection Note Ofnetworking
No ratings yet
Error-Detection Note Ofnetworking
6 pages
Reed Solomon Codes
No ratings yet
Reed Solomon Codes
3 pages
Fts 2007
No ratings yet
Fts 2007
7 pages

T1 BFTSMR

Uploaded by

T1 BFTSMR

Uploaded by

4/10/12

(BFT) State Machine

Alysson Neves Bessani

© Alysson Bessani. All rights reserved. 2

• Each replica is a state machine:

• Replication can be passive or active

© Alysson Bessani. All rights reserved. 4

state + log state + log

Waits for the ﬁrst reply op1, op2

op1 op1, op2

© Alysson Bessani. All rights reserved. 6

Byzantine Fault Tolerance

• PB is hard to use with this fault model

• SMR is the way to go:

© Alysson Bessani. All rights reserved. 7

BFT State Machine

Waits f+1 equal replies op1, op2

op1 op1, op2

Total Order Multicast op1, op2

© Alysson Bessani. All rights reserved. 8

• Determinism: all replicas receiving the same input

© Alysson Bessani. All rights reserved. 9

• Liveness: all correct clients requests are executed

© Alysson Bessani. All rights reserved. 10

© Alysson Bessani. All rights reserved. 11

Five Distributed Computing

© Alysson Bessani. All rights reserved. 12

Impossibility of Reliable

© Alysson Bessani. All rights reserved. 13

Total Order Multicast and

Impossibility of Fault-­‐‑

o The required number of replicas is the maximum required among these

© Alysson Bessani. All rights reserved. 17

(Non-­‐‑Byzantine) FT State

• Viewstamped Replication (Oki & Liskov, PODC’88)

© Alysson Bessani. All rights reserved. 18

© Alysson Bessani. All rights reserved. 19

Some Industrial Applications

• Google’ Chubby (Burrows, OSDI’06)

• Yahoo!/Apache Zookeeper (Hunt et al, USENIX’10)

• IBM’ Spinnaker (Rao et al, VLDB’11)

© Alysson Bessani. All rights reserved. 21

Practical Byzantine Fault

© Alysson Bessani. All rights reserved. 22

PBFT: System Model

• Network can lose, delay, reorder and duplicate

• Byzantine fault model

© Alysson Bessani. All rights reserved. 23

PBFT: Normal Operation

PBFT: Normal Operation I

PBFT: Normal Operation II

© Alysson Bessani. All rights reserved. 26

PBFT: Normal Operation III

© Alysson Bessani. All rights reserved. 27

PBFT: Protocol Invariants

• <m,n,v> is committed in a correct replica →

© Alysson Bessani. All rights reserved. 28

PBFT: Checkpoint and

• View Change Protocol

o If 2f+1 replicas suspect the primary of view v, a new view is started

© Alysson Bessani. All rights reserved. 29

© Alysson Bessani. All rights reserved. 30

PBFT: View Change I

© Alysson Bessani. All rights reserved. 31

PBFT: View Change II

• VIEW-CHANGE messages are accepted if C validates n and all messages in P and Q

© Alysson Bessani. All rights reserved. 32

PBFT: View Change III

• the new primary uses the information on accepted VIEW-CHANGE messages

© Alysson Bessani. All rights reserved. 33

PBFT: View Change IV

© Alysson Bessani. All rights reserved. 34

Why PBFT works?

© Alysson Bessani. All rights reserved. 35

Why PBFT works?

• For each timer expiration, the timer value is doubled

Impossibility of Fault-‐‑

(Non-‐‑Byzantine) FT State