T1 BFTSMR
T1 BFTSMR
Tutorial on
EuroSys 2012
Summary
• Part 1: The Basics
o State machine replication
o Potential applications
o 5 fundamental results on distributed systems
o Paxos/Viewstamped replication
o Castro & Liskov’ PBFT
• Part 2: BFT Literature Review
o Improving performance
o Improving resource efficiency
o Improving robustness
• Part 3: Applications, Open Problems & Practice
o BFT Applications
o Open problems on BFT
o BFT-SMaRt
o Practice: a BFT KV (in memory) Store
1
4/10/12
Part I
The Basics
EuroSys 2012
© Alysson Bessani. All rights reserved. 3
Replication
• Replication is a technique used for performance
and/or fault tolerance
2
4/10/12
Passive Replication
• Also called Primary-Backup (PB) or master-slave
• Clients talk with the primary, that sends the
operations and checkpoints to the backups
o Sometimes backup replicas answer read-only operations
• If the primary crashes, one of the backups takeover
op1, op2
op1
PRIMARY
CLIENTS
checkpoint | op1, op2
op2
BACKUPS
Active Replication
• Also called State Machine Replication – SMR
(Schneider, ACM CS 1990)
• All servers execute the same set of operations in the
same order (servers are always “synchronized”)
• Clients wait for the first reply (crash faults)
op1, op2
op2
3
4/10/12
4
4/10/12
SMR Requirements
• Initial state: All replicas start on the same state
Easy!
• Coordination: All replicas receive the same
sequence of inputs Total Order
Multicast
System Properties
• Safety: all servers execute the same sequence of
requests
5
4/10/12
System Models:
BFT SMR Assumptions
• Faults:
o How many faulty servers and clients the system tolerate? Of what type
(e.g., crash, crash-recovery, Byzantine)?
• Time
o Do I need time assumptions (e.g., upper bound on message and
execution times, synchronized clocks)?
• Connectivity
o All processes are connected?
o The communication links are reliable? Authenticated?
• Cryptography
o What cryptography assumptions are needed?
• Architecture
o Homogeneous or heterogeneous?
6
4/10/12
p1
p2
Reliable
Consensus
p3
Multicast
p4
• Why it works? every process decide the same set
o An atomic broadcast protocol can be used to solve consensus
p1
Total
p2
Order
p3
Multicast
p4
• Why it works? The decision will be the first message delivered first to
every process
• This equivalence holds in most system models
© Alysson Bessani. All rights reserved. 14
7
4/10/12
• Why?
o Cannot differentiate faulty from slow processes p1 and p2 does not
p1
receive anything
from p3
v = 1
p1 and p2 cannot
decide between v = ?
0 or 1
p3
v = 0
p2
© Alysson Bessani. All rights reserved. 15
Minimum Synchrony
required for FT Consensus
• Result:
Fault-tolerant consensus can be solved in the eventually synchronous
system model
• Why?
o The system is asynchronous but has the notion of time
o After some point, the system will become synchronous (bounded but
unknown communication and processing delays)
o If the algorithm keeps trying (always ensuring safety) and increasing the
timeout values, it will be able to solve consensus
p1
p2
Round 0
New
Round 1
New
p3
Round
Round
p4
p0 have T seconds
p1 have 1.5T seconds
to enforce its value
to enforce its value
© Alysson Bessani. All rights reserved. 16
8
4/10/12
Fault Thresholds
• State Machine Replication has two phases
o Ordering è consensus requirements
Crash
Byzantine
Synchronous
f+1
3f+1/f+1*
Non-‐‑Synchronous
2f+1
3f+1
* using signatures
o Execution è voting requirements
Crash
Byzantine
f+1
2f+1
9
4/10/12
Viewstamped Replication
Request
Reply
c
Prepare
PrepareOk
(leader)
0
2
• Requests are executed only after the majority of the
replicas have it on its log
• It ensures the request will be visible even if the leader fails
Viewstamped Replication
Request
DoViewChange
StartView
(old leader)
0
1
(new leader)
2
• If a replica suspects the leader, it sends a message
to the next leader
• If the next leader receives f+1 messages, it
synchronizes replica logs and start a new view
© Alysson Bessani. All rights reserved. 20
10
4/10/12
11
4/10/12
• Cryptography
o PK signatures to simplify the protocol presentation
o MAC (each pair of processes share a key)
o Digests (hashes)
• Algorithm outline:
o System evolves in views, numbered sequentially. In each view v, one
server is the primary, the others are the backups: primaryv = v mod N
o Client multicasts a signed request to all servers
o Servers reach agreement about the sequence number of the request
• The primary proposes the sequence number for each request
• The backups confirm that the primary follows the protocol
o If the primary fails, there is a view change
o Client waits for at least f+1 replies with the same result (at least one
correct server executed the operation and produced the result)
© Alysson Bessani. All rights reserved. 24
12
4/10/12
• Pre-prepare phase:
o primary receives a correctly signed request m
o It assigns a sequence number n to the message and sends this number, a
digest of request D(m) and its current view number to all backups (other
replicas) in a PRE-PREPARE message
o backup replicas receive the message and test its validity, i.e., if n was not
assigned to another request and if it is in view v
o If a replica has m and a valid PRE-PREPARE for it, it proceeds to the prepare
phase (m is pre-prepared)
© Alysson Bessani. All rights reserved. 25
• Prepare phase:
o replicas store the received PRE-PREPARE message
o each replica sends a PREPARE message to other replicas containing v, n and
the digest D(m) of the message
o all servers that receive 2f PREPARE message from other replicas with the same
v, n and D(m), proceed to the commit phase
o when a replica finishes the prepare phase for m, we say that m is prepared on
this replica
13
4/10/12
• Commit phase:
o each replica multicasts a COMMIT message containing v and n
o the request m for which n was assigned is executed when:
• a replica receives 2f COMMIT messages with the same v and n from
other replicas
• all requests with sequence number lower than n are executed
o when the replica i finishes the commit phase we say that m is committed in i
14
4/10/12
PBFT: Checkpoint
• Every protocol message is only accepted (and logged) if the
assigned sequence number falls on a certain interval marked
by two values: h and H = h + L (maximum log size)
• Periodically (every K request executions), the replicas
exchange CHECKPOINT messages to advance h and H by K
• CHECKPOINT messages contain a digest of system’ state
before the checkpoint and the sequence number n of the last
executed request to reach this state (n mod K = 0)
• Replicas store 2f+1 CHECKPOINT messages as a proof that no
other checkpoint for n is possible
o (2f+1) + (2f+1) = 4f+2; even with f Byzantine 4f+2 – f > 3f+1
• All messages regarding requests with sequence number small
than n can be discarded from the log
• Late replicas can update themselves fetching states that can
be proved correct with 2f+1 CHECKPOINT messages
15
4/10/12
• A backup replica triggers the view change protocol if it stays with some
pending message m for more than a certain time limit (request timeout expires)
• At this point, the replica stops accepting messages for v and sends a VIEW-
CHANGE message containing:
o the next view number v+1
o the sequence number n of the last stable checkpoint
o a set C of 2f+1 signed CHECKPOINT messages that validate n
o a set P of messages prepared in i on views v’ ≤ v
o a set Q of messages pre-prepared in i on views v’ ≤ v
16
4/10/12
• each backup replica that receive the NEW-VIEW obtains the VIEW-
CHANGE messages used to build it
o they can have it already or they can fetch them from other replicas
• with these messages, each <message, sequence number> assignment
contained on the NEW-VIEW message can be verified (with the same
procedure used by the primary used to choose these assignments)
o if there some assignment is invalid, a VIEW-CHANGE for v+2 is sent to all replicas
o otherwise, a PREPARE message is sent for each assignment and the protocol resumes to its normal
behavior, as if the assignment was a PRE-PREPARE message
17
4/10/12
18
4/10/12
PBFT: Optimizations I
• One of the key contributions of PBFT are its
optimizations
• Digest replies
o Instead of all replicas sending the reply of a request, the client can
choose just one replica to send the reply, the others only send a digest of
the reply to allow voting
o If the received reply is wrong, the client can ask for a (full) reply from other
replicas
• Batching
o Instead of running the agreement protocol for every request to be
executed, it can be done for request sets (batches)
19
4/10/12
20
4/10/12
References
• Schneider. Implementing Fault-Tolerant Services using the
State Machine Approach: a Tutorial. ACM Computing
Surveys 1990.
• Lamport. The Part-time Parliament. ACM TOCS 1998
• Oki & Liskov. Viewstamped Replication: A New Primary
Copy Method to Support Highly-Available Distributed
Systems. PODC’88
• Burrows. The Chubby Lock Service for loosely-coupled
distributed systems. OSDI’06
• Baker et al. Megastore: Providing Scalable, Highly
Available Storage for Interactive Services. CIDR’11
• Hunt et al. ZooKeeper: Wait-free Coordination for
Internet-scale Systems. USENIX’10
References
• Bolosky et al. Paxos Replicated State Machines as the
Basis of a High-Performance Data Store. NSDI’11
• Rao et al. Using Paxos to Build a Scalable, Consistent,
and Highly Available Datastore. VLDB’11
• Calder et al. Windows Azure Storage: A Highly Available
Cloud Storage Service with Strong Consistency. SOSP’11
• Castro & Liskov. Practical Byzantine Fault Tolerance.
OSDI’99
• Castro & Liskov. Practical Byzantine Fault Tolerance and
Proactive Recovery. ACM TOCS 2002
• Liskov. From Viewstamped Replication to Byzantine Fault
Tolerance. Replication: Theory and Practice, 2010
21
4/10/12
Part II
BFT Literature Review
EuroSys 2012
Outline
• Improving BFT performance
• Robust BFT protocols
• Architectural hybridization
• Implementation techniques
• Complementary techniques for BFT
Note: there are other papers and other aspects, but this is my
selection given the time constraints we have
22
4/10/12
Improving BFT
Performance
• PBFT performance is competitive with crash fault-
tolerant systems, and in some cases even with non-
replicated systems
• However, in the expected common situation where
o There are no faults
o The system is synchronous
o There is no concurrency
Improving BFT
Performance
• Since PBFT publication, several works tried to
improve its performance
• Q/U – Query/Update (Abd-El-Malek et al, SOSP’05)
o “Pure” quorum-based protocol that works on asynchronous system
o Advantages:
• Improves the fault scalability of the system, i.e., the throughput of the
system does not drop dramatically when f increases
• Operations require only two communication steps (best case)
o Drawbacks:
• Sacrifices Liveness (Obstruction-freedom instead of Wait-freedom):
operations only terminate if there is no write contention on the object
• Requires n ≥ 5f +1
23
4/10/12
Improving BFT
Performance
• HQ-Replication (Cowling et al, OSDI’06)
o Combines quorum-based protocols with PBFT
• If there is no concurrency, executes a (f-dissemination BQS) write
protocol to change the system state
• If concurrency is detected, start PBFT to order concurrent requests
o Same advantages of Q/U, with the same Liveness guarantees of PBFT and
using only 3f+1 replicas
1 or 2
comm.
steps
24
4/10/12
REQUEST
SPEC-‐‑RESPONSE
3
Replicas speculatively
execute the requests in the
4
order given by the primary
REQUEST
COMMIT
1
LOCAL-‐‑COMMIT
ORDER-‐‑REQ
2
Replicas see that there are
SPEC-‐‑RESPONSE
2f+1 replicas that matches
3
some history and commits it
25
4/10/12
POM
1
ORDER-‐‑REQ
2
View
LOCAL-‐‑COMMIT
change
3
SPEC-‐‑RESPONSE
Malicious primary send
different ORDER-‐‑REQ to
4
different replicas
26
4/10/12
27
4/10/12
Protocols
unlike ZLight, Quorum does not tolerate contention: concur-
rent requests can be executed in different orders byTable 2. Characteristics of state-of-the-art BFT protocols. Row 1 is the number of replicas. Row 2 is the thr
different
replicas. This is not the case in ZLight, as requestsnumber of MAC operations at the bottleneck replica (assuming batches of b requests). Row 3 is the latency
are or-
Aliph are summarized in Table 2, dered by the primary.
of 1-way messages in the critical path. Bold entries denote protocols with the lowest known cost.
of [20]. In short, Aliph is the first
ol that achieves a latency of 2 one- 3f+1 2 3f+1 Number of MAC f+1 f+2 2(f+1) 2(f+1) f+2 f+1 Number of MAC When the client receives a correct rep
en there is no contention. It is also client operations per process client operations per process
the other hand, when the reply is not corr
h the number of MAC operations at not receive any reply (e.g., due to the B
ds to 1 (under high contention when r1 r1 discards the request), the client broadcast
nabled): 50% less than required by r2
to replicas. As in ZLight and Quorum, w
r2 a PANIC message, they stop executing
ract implementations: Backup (in- r3
r3 back a signed message containing their
uorum and Chain (both described Number of MACs collects 2f + 1 signed messages contain
r4
ance commits requests as long as r4
1 1
Number of MACs
carried by a message
f+1 2f+1 (f+1)(f+2) 2f+1 f+1 carried by a message and generates an abort history.
2
k failures, (b) client Byzantine fail- Chain’s implementation requires 3300
Quorum
Chain
panic and checkpoint libraries). Moreove
Quorum implements a very simple Figure 4. Communication pattern of Quorum.
(latency-‐‑optimal)
Figure 5. Communication pattern of Chain.
(throughput-‐‑optimal)
tocol in which the number of MAC oper
nd gives Aliph the low latency fla- neck replica tends to 1. This comes from
nditions are satisfied. On the other The implementation of Quorum is very simple. It requiresThe behavior of Chain, as described so far, is very sim- contention, the head of the chain can ba
actly the same progress guarantees 3200 lines of C code (including panicking and checkpointilar to the crash-tolerant protocol described in [29]. We tol- and tail do thus need to read (resp. write)
it commits requests as long as there libraries). Quorum makes Aliph the first BFT protocol to
erate Byzantine failures by ensuring: (1) that the content of to) the client, and write (resp. read) f +
es or Byzantine clients. Chain im-
ABSTRACT is a nice idea that really simplifies the
achieve a latency of 2 one-way message delays, while only is not modified by a malicious replica, (2) that no
a message of requests. Thus for a single request, he
ern and, as we show below, allows design of optimistic state machine replication.
1+ f +1
requiring 3f + 1 replicas (Q/U [1] has the same latency replicabut
in the chain is bypassed, and (3) that the reply sent by b MAC operations. Note that all ot
ak throughput than all existing pro- the tail is correct. To provide those guarantees, our Chain re-
requires 5f + 1 replicas). Given its simplicity and efficiency, requests in batch, and have thus a lower n
d Chain share the panicking mech- it might seem surprising not to have seen it published liesearlier.
on a novel authentication method we call chain authen- erations per request. State-of-the-art pro
h is invoked by the client when it We believe that Abstract made ticatorswe
that possible because (CAs). CAs are lightweight MAC authenticators, re- require at least 2 MAC operations at th
© Alysson Bessani. All rights reserved. 56
st. quiring processes to generate (at most) f + 1 MACs (in con- (with the same assumption on batching
could focus on weaker (and hence easier to implement)
ng static switching ordering to or- trast to 3f + 1 in traditional authenticators). CAs guarantee this number tends to 1 in Chain can be in
Abstract specifications, without caring about (numerous)
that, if a client commits request req, every correct replica by the fact that these are two distinct repl
protocols: Quorum-Chain-Backup- difficulties outside the Quorum “common-case”.
executed req. CAs, along with the inherent throughput ad- request (the head) and send the reply (the
−etc. Initially, Quorum is active. As
vantages of a pipeline pattern, are key to Chain’s dramatic
due to contention), it switches to 4.1.2 Chain 4.1.3 Optimizations
throughput improvements over other BFT protocols. We de-
equests until it aborts (e.g., due to Chain organizes replicas in a pipeline (see Fig. 5). All
scriberepli-
below how CAs are used in Chain. When a Chain instance is executing in
witches to Backup, which executes cas know the fixed ordering of replica IDs (called chain or- generate CAs in order to authenticate the mes-
Processes requests as long as there are no server
When Backup commits k requests, sages
der); the first (resp., last) replica is called the head they send. Each CA contains MACs for a set of pro-
(resp., the Aliph implementation we benchmar
o Quorum, and so on. cesses called successor set. The successor set of clients con-
the tail). Without loss of generality we assume an ascending
28
we slightly modified the progress proper
first describe Quorum (Sec. 4.1.1) ordering by replica IDs, where the head (resp., tail) sists of the f + 1 first replicas in the chain. The successor
is replica it aborts requests as soon as replicas de
full details and correctness proofs r1 (resp., r3f +1 ). set of replica ri depends on its position i: (a) for the first 2f contention (i.e. there is only one active
2s). Moreover, Chain replicas add an i
hen, we discuss some system-level In Chain, a client invokes a request by sendingreplicas,
it to thethe successor set comprises the next f + 1 replicas
ec. 4.1.3). in the chain, whereas (b) for other replicas (i > 2f ), the suc- abort history to specify that they aborted
head, who assigns sequence numbers to requests. Then, each
cessor set comprises all subsequent replicas in the chain, as
−
→ of contention. We slightly modified Bac
replica ri forwards the message to its successor ri , where
4/10/12
29
4/10/12
30
4/10/12
• By robust, it means:
o Maintains a stable performance even when under attack by f malicious
replicas and an unbounded number of clients
32
4/10/12
33
4/10/12
Spinning
• A protocol build upon PBFT, but with a modification
based on a simple idea:
o PBFT’s problem is that a malicious primary can keep ordering requests very
slowly without triggering view changes
o So, why not change view after each message commit?
o in this way, the sequence number of a message matches exactly the view
number of its delivery
• Potential problem:
o The view change protocol is complex and costly
o But it is not a problem: the view change will deterministically happen after
every committed message, so it is not necessary to have a special
protocol to change primary
Spinning
• Example execution of Spinning:
o first request is ordered by s1, which is the primary of view v
o second request is ordered by s2, which is the primary of v+1
o …
34
4/10/12
Spinning: Performance
• Changing primary improves or degrades
performance in fault-free executions?
Spinning: Performance
• What happens when a latency is injected by a
faulty primary?
No delay
Spinning
performance
degrades much
slower than PBFT
Malicious primaries
can only degrade the
performance of the
system in f out-‐‑of n
protocol executions
Amount of delay injected
35
4/10/12
Spinning: Issues
• Without the repair procedure of view changes, how
replicas recover from a malicious primary on some view?
o Merge operation: joins one or more faulty views (i.e., with faulty primaries) with
a correct view (i.e., with correct primary)
o The idea is very similar to PBFT’s view change: the new correct primary will read
the state of the system and proceed ordering requests ensuring the protocol
invariants
Architectural
Hybridization
• Motivation: BFT in Homogeneous Systems is Expensive
PBFT
Zyzzyva
36
4/10/12
Architectural
Hybridization
• Is it possible to do better?
1- Less than 3f+1 replicas to tolerate f Byzantine faults?
• Homogeneous non-synchronous systems require 3f+1 replicas
History: Trusted
Components and BFT
• (Correia et al, SRDS’02) BFT Reliable Multicast using TTCB,
a distributed real-time trusted component
• (Correia et al, SRDS’04) BFT SMR with 2f+1 replicas using a
distributed trusted component
• (Chun et al, SOSP’07) PBFT with 2f+1 replicas using a
complex local trusted component (A2M)
• (Levin et al, NSDI’09) A2M reduced to a simple secure
counter (TrInc), that can be build using a TPM chip
• (Veronese et al, DI-FCUL TR 2008, TC 2011) MinBFT shows
that with a trusted counter one can reduces BFT SMR to
viewstamped replication/Paxos
• (Kapitza et al, EuroSys’12) BFT SMR with only f+1 active
replicas using a trusted counter efficiently implemented
in FPGA
37
4/10/12
38
4/10/12
PBFT x MinBFT
pre-‐‑
request
prepare
prepare
commit
reply
Client
request
prepare
commit
reply
Replica 0
Replica 1
Replica 2
Replica 3
MinBFT
PBFT
Benefits of MinBFT
• 2f+1 instead of 3f+1 replicas (minimal for general SMR)
• 2 steps instead of 3 on the normal case (minimal for consensus)
• USIG is arguably a minimal trusted component
commits Replica 0
39
4/10/12
§ f+1 VIEW-‐‑CHANGE
Replica 2
write
read
set
set
• Practical effects:
o A primary replica cannot send two PREPARE messages with different
messages and the same sequence number (UI)
o A backup replica cannot lie about the value proposed by the primary
40
4/10/12
0
1
HMAC USIG
• Both createUI and SK
SK
m,HMAC(m)
verifyUI requires
trusted
component 2
access
SK
41
4/10/12
42
4/10/12
43
4/10/12
USIG Performance:
VM x TPM
• TPM USIG
o Signature: 797 ms
o One increment by 3.5 seconds
o 32-bit monotonic counter
• VM USIG
Implementation
Techniques
• BASE (Castro et al, TOCS 2003)
o Define useful abstractions to implement diverse BFT services
• Parallel execution of requests (Kotla & Dahlin, DSN’04)
o Some service requests do not require total order execution (writes on different files of
a file system), and can be executed in parallel
o May improves the throughput of certain services (e.g., distributed FS)
44
4/10/12
verify
agreement
sign
execute
Separating Agreement/
Execution Architecture
• Separate servers in two layers: agreement and execution
• Clients sign requests, agreement replicas verify it
• 3f+1 replicas to agree on requests sequence number and 2f+1
for executing the requests
agreement
verify
sign
ordered request
execute
reply
45
4/10/12
Agreement/Exectuion
Problem
• In data centers, clients usually are also servers… they have to
be fast (generating signatures is very costly)
o E.g.: web services (BFT clients) access a BFT database (execution)
• These web service hosts need to serve lots of clients (high
throughput) and they are paid by the service provider
Internet
UpRight Architecture
agreement
• A new layer need to be deployed
to avoid client signatures: request
quorum (RQ)
• Servers on this layer store the
request and generate a matrix 3. Request hash
signature to be ordered by the + sequence
agreement layer 2. Request hash +
matrix signature
number
• The execution layer fetches the
request after ordered from RQ,
execute it and send a reply verify
4. Exec. replicas fetch
ordered request
sign
46
4/10/12
UpRight Remarks
• Number of faults tolerated:
o Request quorum: nr ≥ 2u + r + 1
o Ordering: no ≥ 2u + r + 1
o Execution: ne ≥ u + r + 1
ZZ Architecture
• Key observation: In fault free executions, f+1 execution
replicas are enough for the execution layer
• In server consolidation scenarios, these extra f execution
replicas can be dormant VMs
agreement
verify
sign
ordered request
execute
reply
ZZZ
© Alysson Bessani. All rights reserved. 94
47
4/10/12
Complementary Techniques
for BFT: Fault Recovery
• Problem with tolerating f faults:
o If an intelligent adversary is able to compromises f machines, given
enough time, he/she will compromise f+1 (or more)
o Solution: Proactive Recovery (Castro & Liskov, TOCS 2002)
• Replicas (compromised or not) are cleaned periodically
• PR requires a local trusted real-time component
o Otherwise, it may be vulnerable to certain attacks (Sousa et al, DSN’05)
o Most proactive recovery systems are vulnerable (Sousa et al, HotDep’06)
• To ensure availability you may also need 2k extra
replicas if at most k recover at the same time
Outdated …
Complementary Techniques
for BFT: Diversity
• f-fault-tolerant replicated systems are useful only
if faults are not correlated
• It usually requires diverse replicas
o Different administrative domains
o N-version programming (effective?)
o Obfuscation, Memory randomization (effective?)
o Use of different components like databases (Gashi et
al, TDSC 2007), file systems (Castro et al, TOCS 2003)
and operating systems (Garcia et al, DSN’11) is
effective!
• What about deploying and managing diversity?
48
4/10/12
References
• Abd-El-Malek et al. Fault-scalable Byzantine Fault-
tolerant Services. SOSP’05
• Cowling et al. HQ-Replication: a Hybrid Quorum Protocol
for Byzantine Fault Tolerance. OSDI’06
• Kotla et al. Zyzzyva: Speculative Byzantine Fault
Tolerance. ACM TOCS 2009 (prel. SOSP’07)
• Guerraoui et al. The Next 700 BFT Protocols. EuroSys’10
• Amir et al. Byzantine protocols Under Attack. IEEE TDSC
2011 (prel. DSN’08)
• Moniz et al. RITAS: Services for Randomized Intrusion
Tolerance. IEEE TDSC 2011
• Veronese et al. Spin One’s Wheels? Byzantine Fault
Tolerance with a Spinning Primary. SRDS’09
References
• Clement et al. Making Byzantine Fault-tolerant Systems
Tolerate Byzantine faults. NSDI’09
• Martin & Alvisi. Fast Byzantine Paxos. IEEE TDSC 2007
• Veríssimo. Travelling through Wormholes: a new look at
Distributed System Models. SIGACT News 2006
• Correia et al. Hybrid Byzantine-resilient Reliable Multicast.
SRDS’02
• Correia et al. How to Tolerate Half less One Byzantine
Faults in Practical Distributed Systems. SRDS’04
• Chun et al. Attested append-only memory: Making
adversaries stick to their word. SOSP’07
• Levin et al. TrInc: Small Trusted Hardware for Large
Distributed Systems. NSDI’09
49
4/10/12
References
• Veronese et al. Efficient Byzantine Fault Tolerance. IEEE
TC 2011. to appear (prel . DI-FCUL Tech. Report 2008)
• Kapitza et al. ChepBFT: Resource-efficient Byzantine
Fault Tolerance.EuroSys’12
• Castro et al. BASE: Using Abstractions to Improve Fault
Tolerance. ACM TOCS 2003
• Kotla & Dahlin. High-throughput Byzantine Fault
Tolerance. DSN’04
• Distler & Kapitza. Increasing Performance in Byzantine
Fault-Tolerant Systems with On-Demand Replica
Consistency. EuroSys’11
• Yin et al. Separating Agreement from Execution in
Byzantine Fault-tolerant Services. SOSP’03
References
• Clement et al. UpRight Cluster Services. SOSP’09
• Wood et al. ZZ and the Art of BFT Execution.
EuroSys’11
• Sousa et al. How resilient are distributed f fault/
intrusion-tolerant systems? DSN’05
• Sousa et al. Hidden Problems of Asynchronous
Proactive Recovery. HotDep’07
• Gashi et al. Fault tolerance via diversity for off-the-
shelf products: a study with SQL database servers.
IEEE TDSC 2007
• Garcia et al. OS Diversity for Intrusion tolerance:
Myth or Reality? DSN’11
50
4/10/12
Other Aspects
Wide-area replication
• Wester et al. Tolerating Latency in Replicated State Machines Through
Client Speculation. NSDI’09
• Mao et al. Towards Low Latency State Machine Replication for Uncivil
Wide-area Networks. HotDep’09
• Amir et al. STEWARD: Scaling Byzantine Fault-Tolerant Replication to Wide-
Area Networks. IEEE TDSC 2010
• Veronese et al. EBAWA: Efficient Byzantine Agreement for Wide-Area
Networks. HASE’10
Weak consistency & others
• Li & Mazières. Beyond One-third Faulty Replicas in Byzantine Fault Tolerant
Systems. NSDI’07
• Singh et al. Zeno: Eventually Consistent Byzantine-Fault Tolerance. NSDI’09
• Sen et al. Prophecy: Using History for High-Throughput Fault Tolerance.
NSDI’10
• Bessani et al. Active Quorum Systems. HotDep’10
© Alysson Bessani. All rights reserved. 101
Part III
Applications, Open Problems & Practice
EuroSys 2012
51
4/10/12
BFT Applications
• Distributed File Systems
o BFS (Castro & Liskov, TOCS 2002), BASEFS (Castro et al, TOCS 2003)
o Oceanstore (Kubiatowicz et al, ASPLOS’00), Farsite (Adya et al, OSDI’02)
o UR-HDFS (Celement et al, SOSP’09)
• Database replication
o Commit Barrier Scheduling (Vandiver et al, SOSP’07)
o Byzantium (Garcia et al, EuroSys’11)
• Coordination Service
o DepSpace (Bessani et al, EuroSys’08)
o UR-Zookeeper (Clement et al, SOSP’09)
• Naming Services
o DNS (Cachin & Samar, DSN’04)
o LDAP (FCUL, unpublished)
52
4/10/12
53
4/10/12
Intrusion-‐‑tolerant Systems
• Definition
An intrusion-tolerant system is a replicated system
in which a malicious adversary needs to
compromise more than f out-of n components in
less than T time units in order to make it fail.
Comments:
• Similar to BFT with proactive recovery
• T and f make little sense without previous requirements
• Other definitions are possible
54
4/10/12
• 3 Solved
• 2 Half-solved
• 5 Open
Solved Problem:
Performance
1990s: first implementations appeared with
useful performance (Rampart, SecureRing)
1999: Castro & Liskov’ PBFT
2000s: PBFT-like protocols with better
performance in certain favorable conditions
Minimal Maximal
Latency Throughput
55
4/10/12
Solved Problem:
Resource Efficiency
• Separating agreement from execution
o 3f+1 replicas for ordering requests
o 2f+1 replicas for executing requests
o f+1 exec. replicas may be sufficient with VMs
• Trusted components (e.g., TPM)
o Agreement with 2f+1 replicas (instead of 3f+1)
Minimal:
-‐‑ Number of replicas
-‐‑ Communication steps
PBFT
-‐‑ Trusted component
MinBFT
© Alysson Bessani. All rights reserved. 111
Solved Problem:
Recovery
• Problem with tolerating f faults:
o If an intelligent adversary is able to compromises f machines, given
enough time, he/she will compromise f+1 (or more)
56
4/10/12
Half-‐‑solved Problem:
Diversity
• f-fault-tolerant replicated systems are
useful only if faults are not correlated
• It usually requires diverse replicas
o Different administrative domains
o N-version programming (effective?)
o Obfuscation, Memory randomization
(effective?)
o Use of different components like databases,
file systems, operating systems is effective!
• What about deploying diversity?
Half-‐‑solved Problem:
Robust Performance of BFT
• BFT replication is
o very efficient in favorable conditions
o very inefficient in unfavorable conditions
• What about a balance?
o efficient enough in most conditions
• Design principles (Prime, Aardvark, AQS)
o No complex optimizations
o Use public-key crypto if needed
o Exploit application semantics for optimizations
57
4/10/12
Open Problems:
Intrusion Reaction
• Most BFT protocols only tolerate faults and
don’t take actions against malicious replicas
(others than what is required for correctness)
• In practice, replica behavior needs to be
monitored and recovery actions need to be
executed if intrusions are detected
• Research question: Given the specification
of a protocol, how to automatically detect
misbehaviors and react to them?
Open Problems:
Time-‐‑bounded State Transfer
• Recall that the window of vulnerability of an
intrusion-tolerant system is bounded by T
o Every T time units all replicas are rejuvenated
o Every replica must take no more than T/n time units to
recover itself, i.e., take the following steps:
• Shutdown
• Chose a clean (and different) OS image
• Boot
• Fetch and validate service state
• Research question: How to bound the last step?
58
4/10/12
Open Problems:
Diversity Management
• Research question: Assume we have a pool
of diverse configurations for the system
replicas, how to choose the best set?
o The idea is to minimize the number of shared
vulnerabilities/bugs among any two replicas
o This is even more complicated if replicas change
at runtime
• Besides that, diversity means management
of complexity. How to deal with it?
Open Problems:
Confidential Operation
store(k,v)
CLIENTS
SERVERS
read(k)
59
4/10/12
Open Problems:
Graceful Degradation
• Our intrusion tolerance definition is very strict (all-or-
nothing)
• Research question: How to specify degraded
behaviors for intrusion tolerant systems in general?
• Examples: What if …
o … there are more than f faulty replicas?
o … the system is completely asynchronous?
execute(command){
//change state
reply = invoke(command);
return reply;
}
60
4/10/12
BFT-‐‑SMaRt
• Started in 2006, as a Byzantine Paxos
implementation on the Neko simulator
• Later extended to be the replication layer of
DepSpace (Bessani et al, EuroSys’08)
• Currently used/maintained by researchers in
Portugal, Brazil and Germany
• Sponsored by:
61
4/10/12
BFT-‐‑SMaRt
• BFT-SMaRt design principles:
o Java-based (for security and correctness reasons)
o No optimizations that bring complexity
o Modularity
o Features: Extensible API, State Management, Reconfiguration
62
4/10/12
Modularity
BFT-‐‑SMaRt Replica
Architecture
Timers to trigger
Execute
operation
8
regency change
2
Protocol
7
3
1
Core
6
4
5
Signature are
verified here
Send to
Receive from
secure sockets
secure sockets
© Alysson Bessani. All rights reserved. 126
63
4/10/12
BFT-‐‑SMaRt Software
• It is a library (.jar file) that must be linked with the
client and the servers…
• There is no service/component that must be
deployed or managed besides the BFT client and
server
• Available at http://code.google.com/p/bft-smart/
• Current version: 0.7
o Many disruptive features are being integrated in the code
o API changes will happen
o Bugs remain
o Any help is welcome!
BFT-‐‑SMaRt Software
BFT-SMaRt.jar
uses
Config.
CounterServer
Config. +main()
64
4/10/12
Configuration
• A directory containing three things
o The keys directory with the process i privatekeyi file and publickeyj for
every other process j
• In the future these keys will go to keystores/trustores
o Do not use consecutive ports (each replica uses its port p, plus p+1)
Configuration
• system.config: a Java properties file containing the
system parameters
system.authentication.hmacAlgorithm = HmacSHA1
system.servers.num = 4
system.servers.f = 1
system.totalordermulticast.timeout = 12000000
system.totalordermulticast.highMark = 10000
system.totalordermulticast.maxbatchsize = 400
system.totalordermulticast.verifyTimestamps = false
system.totalordermulticast.state_transfer = true
system.totalordermulticast.checkpoint_period = 50
system.totalordermulticast.revival_highMark = 10
system.communication.useSignatures = 0
system.communication.useMACs = 1
system.initial.view = 0,1,2,3
system.debug = 0
65
4/10/12
BFT-‐‑SMaRt Programming
• Client-side:
o ServiceProxy is the main class to be used
o Requests and replies are byte arrays (to avoid unnecessary overheads)
BFT-‐‑SMaRt Programming
• Server-side:
o ServiceReplica is the main class
o It needs an implementation of Executable and Recoverable to work
66
4/10/12
BFT-‐‑SMaRt Programming
• Server-side (cont.):
67
4/10/12
References
• Kubiatowicz et al. OceanStore: An Architecture for Global-scale Persistent
Storage. ASPLOS’00
• Adya et al. FARSITE: Federated, Available, and Reliable Storage for an
Incompletely Trusted Environment. OSDI’02
• Vandiver et al. Tolerating Byzantine Faults in Database Systems using Commit
Barrier Scheduling. SOSP’07
• Garcia et al. Efficient Middleware for Byzantine Fault-tolerant Database
Replication. EuroSys’11
• Bessani et al. DepSpace: A Byzantine Fault-tolerant Coordination Services.
EuroSys’08
• Christian Cachin and Asad Samar. Secure distributed DNS. DSN’04
• Fraga & Powel. A Fault- and Intrusion-Tolerant File System, IFIP SEC’85
• Bessani. From Byzantine Fault Tolerance to Intrusion Tolerance (A position
paper). WRAITS’11
• Sousa & Bessani. From Byzantine Consensus to BFT State Machine Replication:
A latency-optimal transformation. EDCC’12
h\p://www.di.fc.ul.pt/~bessani
h\p://code.google.com/p/bft-‐‑smart
EuroSys 2012
68