DistributedSystems Notes
DistributedSystems Notes
Giuliano Abruzzo
November 26, 2019
1
Contents
1 Modelling Distributed System 5
2 Links 9
2.1 Fair-Loss P2P link . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Stubborn P2P link . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Perfect P2P link . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Physical Time 11
3.1 Synchronization Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.1 Christian’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.2 Berkeley’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.3 Network Time Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Logical Time 14
4.1 Logical clock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.1.1 Scalar Logical Clock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.1.2 Vector Logical Clock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 Logical Time and Distributed Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2.1 Lamport’s algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2.2 Ricart-Agrawala’s algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
6 Broadcast Communication 26
6.1 Best Effort Broadcast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.2 Reliable Broadcast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.2.1 Reliable Broadcast, Synchronized system . . . . . . . . . . . . . . . . . . . . 28
6.2.2 Reliable Broadcast, Asynchronized system . . . . . . . . . . . . . . . . . . . . 28
6.3 Uniform Reliable Broadcast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.3.1 Uniform Reliable Broadcast, Synchronous system . . . . . . . . . . . . . . . . 29
6.3.2 Uniform Reliable Broadcast, Asynchronous system . . . . . . . . . . . . . . . 30
6.4 Probabilistic Broadcast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.4.1 Eager Probabilistic Broadcast . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
7 Consensus 31
7.1 Regular Consensus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
7.2 Uniform Consensus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2
8 Paxos 34
9 Ordered Communications 36
9.1 FIFO broadcast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
9.2 Casual Order Broadcast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
9.2.1 Waiting Causal Broadcast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
9.2.2 Non-Waiting Causal Broadcast . . . . . . . . . . . . . . . . . . . . . . . . . . 39
11 Distributed Registers 43
11.0.1 Regular Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
11.0.2 Atomic Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
11.1 Regular Register Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
11.1.1 Read-One-Write-All Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 46
11.1.2 Fail-Silent Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
11.2 Atomic Register Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
11.2.1 Regular Register (1,N) to Atomic Register (1,1) . . . . . . . . . . . . . . . . . 49
11.2.2 Atomic Register (1,1) to Atomic Register (1,N) . . . . . . . . . . . . . . . . . 49
11.2.3 Read-Impose Write-All Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 50
11.2.4 Read-Impose Write-Majority algorithm . . . . . . . . . . . . . . . . . . . . . 51
12 Software Replication 52
12.1 Primary Backup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
12.1.1 No-Crash scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
12.1.2 Crash scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
12.2 Active Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
13 Cap theorem 54
3
16.2.1 Regular Register with cryptography . . . . . . . . . . . . . . . . . . . . . . . 62
16.2.2 Regular Register without cryptography . . . . . . . . . . . . . . . . . . . . . 63
19 Blockchain 72
4
1 Modelling Distributed System
A distributed system is a set of entities, computers, machines communicating, coordinating
and sharing resources in order to reach a common goal, and appearing as a single computing
system. To explain several situations in distributed systems, we will use distributed abstraction
cause they captures common properties of a large range of systems, and they prevent reinventing
the same solution for variants of the same problem. We will use the Composition Model, where
there are several elements represented as:
We will use these three in modules description in the specification. For a simple sync and
async Job Handler we will have the following pseudocode:
5
If we need to show a Job transformation and processing abstraction we have to use another model,
with two components, a Transformation Handler and the Job Handler, in pseudocode:
6
Only the operations referred to the arrows entering in the Transformation Handler are imple-
mented, because these are the operations the Transformation Handler has to handle. So, we write
pseudocode which follows the given specification, where operations and properties are listed. The
bottom and top variables are used alongside with the buffer size M to ensure that the limit of jobs
that can be processed by the buffer is not exceeded. Now we have two implements two exercises:
7
Processes in a Distributed System often communicate through messages. We can represent a
distributed algorithm as a series of automata, one per process, which define how to react to a
message. The execution of a distributed algorithm is represented by a sequence of steps executed
by the processes:
• Safety: which states that the algorithm should not behave in a wrong way;
– a safety property is a property that can be violated at some time t and never be satisfied
again after that time;
– a safety property is a property such that, whenever it is violated in some execution E of
an algorithm, there is a partial execution E 0 of E such that the property will be violated
in any extension of E;
• Liveness: which ensure that eventually something good happens;
– a liveness property is a property of a distributed system execution such that, for any time
t, there is some hope that the property can be satisfied at some time t0 ≥ t.
• Partially: where there is an unknown time t after which the system behaves as a synchronous
system. It will be a period long enough to terminate the distributed algorithm;
8
2 Links
Links are used to model the network component of a distributed system, in fact they connect pairs
of processes. We have three different types of link :
• Fair-loss links;
• Stubborn links;
• Perfect links;
The two processes linked can crash, and the time taken to execute an operation is bounded, and
the messages can be lost and can take an indefinite time to reach the destination. The generic
link interface is:
The sender must take care of the retransmissions if it wants to be sure that m is delivered at its
destination and there is no guarantee that the sender can stop the retransmissions of each message,
and each message may be delivered more than once.
9
The implementation of the Stubborn P2P link is:
We can observe that the No duplication is ensured by the last piece of pseudocode, thanks to the
variable delivered. No creation is inherited, and Reliable delivery derives from the whole schema,
thanks to the perfect link in particular which is built by chance to deliver messages correctly.
10
3 Physical Time
In a distributed system processes run on different nodes interconnected by mean of a network and
cooperate to complete a computation. They communicate only through messages and, as ordering
is require in such application, time is a critical factor for distributed systems. Each process pi in
a distributed system runs on a single mono-processor machine with no shared memory, and they
have a state si , changed by the actions during the algorithm execution. Each process generate a
sequence of events:
We indicate with:
0
• →i the ordering relation between two events e and e ;
0 0
• e →i e if and only if e happened before e ;
We call local history the sequence of events produced by a process, a partial local history a
prefix of a local history, and global history the set containing every local history. Events can be
time-stamped through physical clocks values, in fact in a single process is always possible to
order events, but in a distributed system in presence of network delay, it is impossible to realize a
common clock shared among every process. Anyway, it is possible to use timestamps in order to
synchronize physical clocks through algorithms with a certain degree of approximation:
Ci (T ) = α · Hi (t) + β
Where Ci (t) represent the software clock and Hi (t) represent the hardware clock, α and β are
two factors which approximate the result closer to guarantee monotonicity. This software clock is
not generally completely accurate, in fact it can be different from the real time and at any process
due the precision of the approximation. We have to keep the granularity, also called resolution,
the interval of time between two increments of the software clock, of the software clock smaller than
the time difference between two consequent events so:
11
UTC is the international standard for clock synchronization, and we can have two types of syn-
chronization:
• External Synchronization:
– In which processes synchronize their clock Ci with an UTC source S, in a way such that
for each time interval: |S(t) − Ci (t)| < D where D is a synchronization bound ;
• Internal Synchronization:
– In which all the processes synchronize their clock Ci between them with respect to D in
pairs so: |Ci (t) − Cj (t)| < D;
So, the clocks that are internally synchronized are not necessarily externally synchronized,
instead the clocks that are externally synchronized are also internally synchronized with a
bound of 2 · D.
An hardware clock is correct if its drift rate is within a limited bound of p > 0:
dC
1−p≤ ≤1+p
dT
h 0i
If we have a correct hardware clock H we can measure a time interval t, t :
0 0 0
(1 − p)(t − t) ≤ H(t ) − H(t) ≤ (1 + p)(t − t)
0 0
Software clocks have to be monotone: t > t, so C(t ) > C(t).
A process p asks the current time through mr and receives t in mt from S, and p will set its time
to t + RT2 T , where RT T is the round trip time experience by p. Is important to note that the time
server can crash or can be hacked.
12
Accuracy of this algorithm strongly depends of RT T , and we can have two cases:
• Case 1:
– In this case the real reply time is greater than estimated time
that is RT2 T , so in particular is equal to RT T − min;
RT T
– ∆ = estimated − real = 2 − (RT T − min) = −( RT2 T − min);
• Case 2:
– In this case the real reply time is smaller than estimated time
that is RT2 T , so in particular is equal to min;
RT T
– ∆ = estimated − real = 2 − min = +( RT2 T − min);
So the the accuracy of Cristian’s Algorithm is ±( RT2 T − min) where min is the minimum trans-
mission delay.
The master process computes the differences ∆pi between the master clock and the clock of
every process pi (even him self), after it compute the average, avg, of all the differences ∆pi with-
out considering faulty processes (a process with a clock which differ from the master one more than
a threshold ) and at the end it computes the correction of each process (even faulty) with the
formula: ADGpi = avg − ∆pi .
When a slaves process receives the correction, it is applied to the local clock. If the correc-
tion is negative, the process doesn’t adjust the value but it slow down its clock, since decrementing
can cause problems.
13
It works with a hierarchy:
It’s important to note that physical synchronization doesn’t work in asynchronization system
as we completely make the bound logic useless, in fact time for answer is unpredictable.
4 Logical Time
4.1 Logical clock
As we said, physical clock are good if we have a precise estimation of delays, but this can be
hard, and often we want to know in which order some events happened and not the exact time
for each of them. Since in a distributed system each system has its own logical clock, if clocks are
not aligned it’s not possible to order events generated by different processes, so we need a reliable
way to order events and this is the logical clock. These clocks are based on the causal relations
between events, and it’s important to define two relations:
• →i is the ordering relation between events in a process pi ;
• → is the happened-before between any pairs of events;
0
We say that two events e and e are in happened-before relation if:
0
• ∃ pi | e →i e ;
• ∀ message m : send(m) → receive(m);
– send(m) is the event of sending a message m;
– receive(m) is the event of receipt of the same message m;
0 00 00 00 0
• ∃ e, e , e | (e → e ) ∧ (e → e );
14
Where the last rule says that the happened-before relation is transitive. Using these rules we
can define a causal ordered sequence of events, and if there are two events that are not in
0
happened-before relation they are concurrent (e k e ). The logical clock is a software counting
register monotonically increasing its value and it’s not related to the physical clock in any way.
We denote with Li (e) the logical timestamp assigned by the logical clock by a process pi to the
event e. There is a property that says:
0 0
if e → e then L(e) < L(e )
Each process pi initialize its vector clock Vi : Vi [j] = 0 ∀ j = 1 ... N , and pi increases Vi [i] of 1
when it generates an event: Vi [i] = Vi [i] + 1;
15
When pi sends m: When pi receives m:
• Create an event send(m); • Update Vi [j] = max(t[j], Vi [j]) ∀ j = 1 ... N ;
• Increment Vi ; • Produce an event receive(m);
• Timestamps m with a t = Vi ; • Increment Vi ;
The implementation is like before, with the difference that the update of the vector are done in
an index according to the event that generated the message. So Vi [i] represents the numbers of
events produced by pi , Vi [j] represents the number of events of pj that pi knows. Two events are
0 0 0
in happened-before only iff V ≤ V ∧ V 6= V so it must be V < V .
1 1
• event e11 0 < event e12 1 so we have: e11 → e12
0 0
1 0
• event e11 0 ≮ event e13 0 so we have: e11 k e13
0 1
Differently from Scalar Clock, Vector Clock allows to determine if two events are concurrent or
in happened-before.
16
The algorithms rules for process pi are:
• Access the CS:
– pi sends a request message attaching Ck to all processes and adds its request to Q;
• Request reception from pj :
– pi puts pj request (timestamp included) in its queue and sends back an ACK to pj ;
• pi enters the CS iff :
– pi has in its queue a request with timestamp t;
– t is the smallest timestamp in the queue;
0
– pi has already received an ACK from another process with t > t;
• Release of the CS:
– pi sends a release message to all the other processes and deletes its own request from the
queue;
The safety proof can be explained as follow: let’s suppose by contradiction that both the
processes pi and pj enter the critical section, this means that both have received an ACK from any
other process and the timestamp has to be the smallest in the queue:
• ti < tj < ACKi .ts;
• tj < ti < ACKj .ts
17
So we have three cases:
• pj ACK arrives before pj request then pi can enter the CS without any problem;
• pj ACK arrives after pj request but before pi ACK then pi enters the CS without any problem
and sends its ACK after executing the CS;
• Both processes receive the ACK when the two requests are in queue but mutual exclusion is
guaranteed by the total order on the timestamps;
Fairness is satisfied because different requests can be either in happened-before or in concurrent
relation, so in the first case everything is done with the respect of that order, while in the second
case the CS access can happened in any order. In the worst case, this algorithm needs 3(N − 1)
messages for the CS execution.
This algorithm sends the processes into critical section based on the number of the process,
or on the basis of a deterministic function that ensures the total order.
18
5 Failure Detection & Leader Election
A system is synchronous/asynchronous or partially synchronous depending on the timing as-
sumption. If they are explicit we talk about a synchronous system, otherwise it is asynchronous.
Partially systems are the ones that need abstract timing assumptions and we have two choices:
• Put assumption on the system model (including links and process);
• Create a separate abstractions that encapsulate timing assumptions;
19
To prove the correctness we have to show that both the property are satisfied. In this case
they follow from the perfect point-to-point link, in fact, if a process crashes, it won’t be able to
send HEART BEAT REP LY any more. If at the timeout there is a process that doesn’t reply to
requests it means that it has crashed.
As we see from the specification, a <> P can mistakenly suspect a process but is able to restore
it as soon as possible, as it receives a reply, also updating the timeout. This can happen when the
chosen timeout is too short. If a process q crashes and stops to send replies, p doesn’t change its
judgment anymore.
20
5.2 Leader Election
5.2.1 Perfect Leader Election
Sometimes, we may be more interested in knowing if a process is alive instead of monitoring failures.
We can use a different oracle which reports a process that is alive called Leader Election module.
In the perfect leader election we use the perfect failure detector :
21
This ensures that, eventually, correct processes will elect the same correct process as their leader.
It doesn’t guarantee that leaders may change in an arbitrary period of time, and that many leaders
might be elected during the same period of time without having crashed. Once a unique leader is
determined and doesn’t change again, we say that the leaser has stabilized.
The last piece of code, in the fill deliver event, we will replace the number of epoch of a process
that crashed again.
22
Exercise 1
23
Exercise 1.1
The answer is yes, because we can implement it on the process that has all the links of channel
A. So, in that case when the timeout will end if a process didn’t send a reply we know for sure
that it has crashed.
Init:
correcti = {p1 , p2 , p3 , p4 }
alivei = ∅
detectedi = ∅
for each pj ∈ correcti do:
trigger send(HearthBeatRequest, i) to pj
start(timer1 )
when timer1 = 0
for each pj ∈ correcti do:
trigger send(ALIV EL IST, alivei , i) to pj
start(timer2 )
when timer2 = 0
for each pj ∈ correcti do:
if pj ∈
/ alivei ∧ pj ∈
/ detectedi
detectedi = detectedi ∪ {pj }
trigger crash(pj )
alivei = ∅
for each pj ∈ correcti do:
trigger send(HearthBeatRequest, i) to pj
start(timer1 )
Exercise 1.2
We can’t cause once the process that has all the channels A fails, it’s not guaranteed that all the
failures will be detected since the channel has a probability to lose a message that is not zero.
Exercise 1.3
We can’t cause all the links are fair loss, so we can just implement an eventually perfect failure
detector.
24
Exercise 2
Exercise 2
The answer is yes, because we can implement it on the process that has all the links of channel
A. So, in that case when the timeout will end if a process didn’t send a reply we know for sure
that it has crashed.
Uses:
Oracle Oi
Perfect P2P link
Init:
leaderi =⊥
lefti = getLef t()
righti = getRight()
25
6 Broadcast Communication
6.1 Best Effort Broadcast
We will now focus on the message broadcasting. This means that a process sends a message to
all the other ones, and there are several types of broadcast, the first one is the BEB, or Best Effort
Broadcast, that ensures message delivery only if the sender don’t crash, if it happens, processes
may disagree on whether or not deliver the message.
26
6.2 Reliable Broadcast
The Reliable Broadcast instead is:
In which we have, the same properties of the BEB, plus Agreement property, now we will see
two schemes that hep for the understanding of BEB and RB:
27
6.2.1 Reliable Broadcast, Synchronized system
The Reliable Broadcast in a synchronized system is:
The first if in the upon event (beb, deliver), is used in order to check the presence of duplicate
messages (if there is delay in receiving messages) so it guarantees the no duplication. So, only the
messages that are coming from a crashed process will be retrasmitted so we ensure the agreement
property, the re-brodcast is done by leaving as sender the original crashed one. In the best case 1
BEB message per one RB message, so it means that we don’t have any crases in the system and
we don’t need to re-brodcast, instead in the worst case we have n − 1 BEB messages per one RB,
so we have n − 1 failures, so for each RB message we have to re-brodcast the message n − 1 times.
28
6.3 Uniform Reliable Broadcast
There is also another type of Reliable Brodcast, called Uniform Reliable Brodcast or URB,
where the only difference is that the Agreement property becomes uniform, which means that
correct processes must deliver also messages from faulty processes, so the delivers of the crashed
processes are a subset of the delivers of the correct processes:
29
Where ack is a matrix in which we have for rows the messages and for columns all the processes
of the system. When the process p sends a BEB messages, we put in the pending a tuple
(id sender, id message). When the process receives a message from the BEB it insert in the ack
matrix in the rows of m himself, if the tuple (id sender, id message) is not in pending, the tuple
is inserted and the message will be rebroadcasted. The candeliver is a boolean function that
ensures that the set of the correct processes is a subset of the processes that receives the BEB
deliver.
30
6.4.1 Eager Probabilistic Broadcast
This broadcast is used when we work on huge distributed system. In fact if we have 100 nodes,
we could need 1002 or 1003 messages in the worst case for a single message delivery. Instead with
this system based on the Gossip Dissemination, in which a process sends a message to a set
of random process and the processes that receive will send the message to another set of random
process and this happen for r rounds, we cover almost all the nodes with a cost much lower than
the previous.
The picktargets functions picks k − random processes from the entire set of the processes, the
gossip function instead is used to send a message in broadcast to a subset of processes for a number
of rounds.
7 Consensus
7.1 Regular Consensus
A group of processes must agree in a value proposed by one of them, they start with different
opinions and then they converge toward only one of them.
31
We don’t deal with asynchronous systems cause no algorithm can guarantee to reach consen-
sus even with one process crash failure. In the case of a synchronous system, we can implement
the Floading Consensus, in which processes exchange their values and when all the processes
make their own proposal available, a value is chosen, but in order to do this we need no failures
due to the communication.
We can see that receivedfrom is an array where we insert in the ith position the processes
from which I delivered in the ith round, instead proposal is an array of n entry where in the ith
position i will put all the proposals received in the last round (even my proposal ). The propose
event permit to handle my proposal by adding it to the array of the proposal, and then he send
this to all the other thanks to BEB. When we check the correct array we are checking if the set
of received proposals of this round is equal to the received proposals of the last round (so we have
only correct processes) and if still we don’t have decided a variable the process will decide it and
will send it to all the other thanks to BEB. Last event handle the situation in which the variable
was decided by other process, in this case I check if the process that taken the decision is alive and
I set the decided variable and I rebroadcast the decision with BEB.
• Correctness:
– Validity and Integrity follow from the properties on the communication channels;
– Termination is ensured because algorithm terminates at most after N rounds;
– Agreement is satisfied cause the same deterministic function is applied to the same
values by correct processes;
• Performance:
– Best case: one communication round, so 2 × N 2 ;
– Worst case: we have N 2 messages exchanged for N rounds so we have N 3 messages;
32
7.2 Uniform Consensus
In the Uniform Consensus we have the Uniform Agreement property which means that also
faulty processes agree for the decided value:
In this case instead we will check only the proposal from the current round, so it’s very similar
to the previous one but here the decision is based only on the current round. If we are not in the
last round, we increment the round variable and we reset the receivedfrom array that is the set
that contains the processes from which I received the proposal.
• Correctness:
– Validity and Integrity follow from the properties of the best-effort broadcast;
– Termination is ensured because all correct processes reach round N and decide in that
round ;
∗ The strong completeness property of the failure detector implies that no correct
process waits indefinitely for a message from a process that has crashed, as the
crashed process is eventually removed from correct;
– Uniform Agreement holds cause all processes that reach round N have the same set
of values in their variable proposalset;
33
• Performance:
– We have N communication steps and O(N 3 ) messages for all correct process to decide;
8 Paxos
The Paxos algorithms was introduced in order to provide a viable solution to consensus in
asynchronous system, with these Safety is always guaranteed, but the algorithm makes some
progress (Liveness) only when the network works for enough time (partial synchronized ). We have
two basic assumptions:
• Agents can fail by stopping, the also operate at arbitrary speed and they may restart;
– Since all agents may fail after a value is chosen and then restart, a solution is impossible
unless some information can be remembered by an agent that has failed and restarted ;
• Messages can take arbitrarily long time to be delivered, can be also be duplicated or lost,
but they aren’t corrupted;
There are three actors in Paxos protocol:
• Proposer: who propose a value;
• Acceptors: processes that commits on a final decided value;
• Learners: who passively assist to the decision and they obtain the final decided value;
A model with only one acceptor is the simplest one, but we have a problem if it crash, so we must
have multiple acceptors, and in this case a value is accepted when the majority of it accepts it.
The problem is that each acceptor may receive a different set of proposals, a possible solution is
that:
• An acceptor may accept at most one value;
But in this case which value the acceptor should accept? A possible solution is that:
• An acceptor must accept the first proposal it receives;
But in this case we can have a sort of deadlock in which the acceptors couldn’t reach a majority.
We have to keep track of the different proposal by assigning a value v unique, and then the value
is chosen when a proposal with the same value has been accepted by the majority.
34
• If a proposal with value v is accepted every high-numbered proposal that is accepted by any
acceptor has value v;
But what if a new proposal propose a new different value that the acceptor must accept?
• If a proposal with value v is chosen, every high-numbered proposal issued by any proposer has
value v;
Now let’s assume that a proposal m with value v has been accepted, now we have to guarantee that
any proposal n > m has value v, we could prove it by induction assuming that every proposal
with number in [m, n − 1] has value v. For m to be accepted there is a majority of acceptors that
accept it. Therefore the assumption that m has been accepted implies that: every acceptor in the
majority has accepted a proposal with number in [m, n − 1] with value v.
• For any v and n, if a proposal with value v and number n is issued, then there is a set S
consisting of a majority of acceptors such that either:
– No acceptor in S has accepted any proposal numbered less than n;
– v is the value of the highest-numbered proposal among all proposals numbered less
than n accepted by the acceptors in S;
So this condition consider the situation in which a set of acceptors S accept a proposal n with value
v, and this can happened in two cases: in the first case, all the previous proposals with id < n
0
weren’t accepted, in the second case, a proposal n < n was already accepted, but v is equal to the
0 0
proposal value v of n .
To ensure this, we have to ask to proposer that wants to propose a value numbered n to learn
the highest-numbered value with number less than n that has been or will be accepted, by any
acceptor in a majority. To learn about a proposal we simply have to ask to the acceptors to not
accept any value numbered less than n. The Paxos protocol has two main phases:
• Phase 1:
– A proposer chooses a new proposal version number n and sends a prepare request
(P REP ARE, n) to a majority of acceptors;
– If an acceptor receives a prepare request it respond with a promise not to accept
0
any more proposal numbered less than n and he suggest the value v of the highest-
numbered proposal that it has accepted if there is any, else ⊥:
0 0
(ACK, n, n , v ) if it exists;
(ACK, n, ⊥, ⊥) if not;
0
– If an acceptor receive a prepare request with a n lower than the n from any prepare
0
request it has already responded sends out a (N ACK, n ) ;
• Phase 2:
– If the proposer receives ACKs from a majority of acceptors then it can issue an accept
request (ACCEP T, n, v) where n is the number that appears in the prepare request,
and v is the value of the highest-numbered proposal among the responses or the proposal’s
own proposal if none was received;
35
– If the acceptor receives an accept request, it accepts the proposal unless it has already
responded to a prepare request with a number greater than n;
– Whenever acceptor accepts a proposal respond to all proposal (ACCEP T, n, v), and the
proposal that receives (ACCEP T, n, v) from a majority of acceptors, decides v and sends
a (DECIDE, v) to all the other learners. All the learners that receive (DECIDE, v),
decide v;
9 Ordered Communications
Here we need to define guarantees about the order of deliveries inside group of processes. We have
three different types of ordering:
• Delivers respect FIFO ordering of the corresponding send;
• Delivers respect Casual ordering of the corresponding send;
• Delivery respects a Total ordering of deliveries;
Reliable broadcast that we previously studied doesn’t have any property on ordering deliveries of
messages and this can cause problems in the same communication.
36
In the upon event Deliver there is a while next that is used when we receive a message with sn
(identifier of the message used to control the order ) equal to the current one, cause next is an
array used to track how many messages are arrived to that process. In the while we will empty
the pending array when there are message with a sn less that the received one, so in this way we
respect the FIFO property.
It’s important to note that Causal Broadcast = Reliable Broadcast + Causal Order and that
Causal Order = F IF O Order + Local Order, where Local Order means that if a process delivers
0 0
a message m before sending a message m , then no correct process deliver m if it has not already
delivered m.
37
9.2.1 Waiting Causal Broadcast
V is the logical vector, a vector of dimension N , the number of processes. In the Broadcast
event, I copy the current logical vector and in the position of the current process I insert the lsn
before this is incremented. In the Deliver event instead, I will enter in the while only if inside the
0
set of pending messages there are messages with a logical vector W lower than my logical vector,
so in this way I can deliver them and I respect the condition of the causal order. We also in
the Deliver event increment the logical clock value of that process cause for the RB property a
message that I am going to deliver was previously sent by a sender process so we will increment
the value of that process.
38
0 0
Safety: let two broadcast messages m and m such that broadcast(m) → broadcast(m ) then each
0
process have to deliver m before m
Liveness: eventually each message will be delivered and is guaranteed by two assumptions:
• The number of broadcast events that precedes a certain event is finite;
The approach of this algorithm is continuous in fact each time a message is delivered, the process
doesn’t wait the missing messages, so it is always delivered once the process is sure that the past
messages of the received one are delivered and then added to my list of past messages. In fact
the past list is a list in which will be inserted all the message involved in actions of deliver or
broadcast (by respecting a causal order ). In the broadcast event we will insert in the past list
the message that will be broadcasted. In the deliver event instead we will check all the past list
of the message received and for each message extracted from it will be checked if current process
has already delivered it and if not this will be delivered and inserted in the current past list, at the
end the current message, if is not in the my past messages, is delivered and inserted in the list.
39
10 Total Order Broadcast
A Total Order Broadcast is a reliable broadcast that orders all messages, even those from different
senders and those that are not causally related. T otal Order Broadcast = Reliable Broadcast +
T otal Order, from the reliable we have that processes agree on the same set of messages they
deliver, and from the total order, processes agree on the same sequence of message. The message
is delivered to all or to none of the processes and, if the message is delivered, every other message
is ordered either before or after this message.
It’s important to note that Total Order is orthogonal with respect to FIFO and Causal Or-
der. This means that respecting the total order doesn’t mean that FIFO and causal order are
respected too, in fact these two are parallel, if the causal order is respect also FIFO is respected,
instead with total order we cannot make any assumption on causal and FIFO.
In order to study this part, we need to consider a system model composed by a static set of processes
with perfect communication channels, asynchronous and crash-fault based and we characterize the
system in terms of its possible runs R. Total order specifications are usually composed by four
properties:
• A Validity property guarantees that messages sent by correct processes will eventually be
delivered at least by correct processes;
• An Integrity property guarantees that no spurious or duplicate messages are delivered ;
• An Agreement property ensures that processes deliver the same set of messages;
• An Order property constrains processes delivering the same messages to deliver them in
the same order;
The total order specifications with crash failure and perfect channel are:
• NUV: if a correct process T OCAST a message m then some correct process will eventually
deliver m;
• UI: for any message m, every process -p delivers m at most once and only if m was previously
T OCAST by some process;
The Agreement property:
40
So the constrain for Uniform Agreement is that correct processes always deliver the same
set of messages, and that the set of messages delivered by a faulty process is a subset of the set of
the correct processes, instead in NUA the set of faulty can be completely different.
So SUTO says that processes have the same prefix of the set of delivered messages and after an
omission (a message not delivered by someone) we have a disjointed set of delivered messages (like
p3 in the example image). Instead in WUTO there aren’t restriction, so the only thing that matter
is the order of the deliver between processes.
41
10.1 Total Order Algorithm
So when the list of messages is not empty and the process is not waiting any decision from the
consensus it will send his list of messages unordered. When it receive a decision from the
consensus it will deliver all the messages by the decided order. It’s important to note that the
process will check if the consensus round and his round are equal in order to be sure that the
decision is taken about the actual situation.
42
• Due to URB all processes (even faulty) delivers the same set of messages, so we obtain UA;
• Due to NUC all correct processes decide the same list of messages, so correct process will
deliver messages in the same order, instead faulty process will deliver (before crash) a different
sequence of messages.
11 Distributed Registers
43
• Each value is univocally identified;
• Processes are sequential, so a process can invoke only one operation per time;
The notation of the register is: (X, Y ) where X processes can write and Y processes can read, so
for example (1, 1) is a register in which only a process can write and only a process can read (these
processes are decided a priori).
0
It’s important to note that in a regular register a process can read a value v and then a value v
0
even if the writer has written v and then v, as long as the the write and the read operations are
concurrent, but this is not allowed in an Atomic register :
44
11.0.2 Atomic Register
The Atomic Register is a regular register with an ordering property (that is valid also for read
operations of different processes):
• Ordering: if a read return v2 after a read that it precedes it has returned v1 then v1 cannot
be written after v2 ;
Some examples:
45
11.1.1 Read-One-Write-All Algorithm
We will use the Fail-Stop algorithm: Read-One-Write-All in which processes can crash but
the crashes can be reliably detected by all the other processes with the use of a perfect failure
detector, and it uses a perfect point-to-point link and a Best-effort broadcast (BEB). The algorithm
idea is that each process stores a local copy of the register where:
• Read-one: where each read operation returns the value stored in its local copy of the register ;
• Write-all: where each write operation updates the value locally stored at each process the
writer consider to haven’t crashed, and a write completes when the writer receives an ack
from each process that has not crashed ;
The BEB instance will be used in order to broadcast to all the processes the new variable during
the write operation. The pl instance instead is used when all the processes have to send the ack
back to the writer. The writeset is used from the writer in order to keep track of all process that
confirms the receive of the update of the variable, and when the number of correct process is a
subset of the process that receives the ack the operation of write is closed and the writeset is set
to 0. So for the write operation we need at most 2N messages and for read operation 0 messages
cause it’s a local operation.
46
11.1.2 Fail-Silent Algorithm
The problem with this algorithm is that it doesn’t ensure validity if the failure detector is not
perfect, in fact in this case the validity property is not respect. So in this case we can use a different
algorithm that doesn’t use a failure detector. This algorithm is called Fail-silent algorithm:
majority voting regular register and the idea is that each process locally stores a copy of the
current value of the register and each written value is univocally associated to a timestamp, the
writer and the reader processes use a set of witness process, to track the last value written. We use a
Quorum that this an intersection of any two sets of witness processes not empty, and a Majority
Voting, so each set is constituted by a majority of processes:
47
When a process need to write it will begin to broadcast by sending its value and its timestamp
(increased). When we receive a message in the deliver event of the write, I will check if the
timestamp received is bigger then the current timestamp of the value, and in this case I update the
value and will send the ACK back. When the writer receives at least N/2 ACK’s (since we have
the assumption of the majority of correct process) will trigger the WriteReturn. When instead
we have a read operation, since we don’t have a perfect failure detector I cannot be sure that my
value is still correct, I need to consult all the other process in order to obtain a quorum (so to obtain
a variable). In the deliver event of the read we do the quorum, in fact the first control is that
the r (timestamp of the read ) received is the same of my actual rid e will be inserted in the readlist
and when the number of processes in the list is at least the half of the total number of process we
will trigger the ReadReturn. In order to perform a Write operation or a Read operation we need
at most 2N messages.
In order to pass from a Regular Register (1, N ) to an Atomic Register (1, N ) we have to
distinguish two phases:
• We use a Regular Register (1, N ) to build an Atomic Register (1, 1);
• We use a set of Atomic Registers (1, 1) to build an Atomic Register (1, N );
48
11.2.1 Regular Register (1,N) to Atomic Register (1,1)
Where in order to respect the Ordering property there is a control on the timestamp received,
where will be checked if it’s greater than the previously read value. Each Write operation or Read
operation request a write/read on a regular register.
49
In the Write event we will use the writing variable that is used in order to permit to only
one process to write at time. In the WriteReturn event of the atomic registers below when we
receive a number of ACK’s equal to the number of the processes, the process will check if the variable
writing is true (so this process is the writer ), and in this case will trigger its own W riteReturn, else
will trigger its ReadReturn. In the ReadReturn event of the atomic registers below we will add
a tuple with the received value and its timestamp in the readlist, when the number of items in the
list is N then we will choose the value with the maximum timestamp associated (in order to respect
the ordering property) and we will send to all the other processes the new variable with a W rite
on the atomic registers below. So it’s important to note that for both read and write operation
we need to use a write operation of the N atomic registers below, so when a W riteReturn event is
received we can have two cases: if the process is the writer we write in the register, if the process
is not the writer it will read thanks to the ReadReturn.
50
In this algorithm we use the Best Effort Broadcast, a Perfect Failure Detector and a PP2P.
When a process want to read a variable from the register it will broadcast a write operation to all
the other processes its own local variable with its timestamp. When a process receive a beb deliver
with a write, it will check if the timestamp arrived is bigger than its own timestamp of that variable.
and in this case it will update the tuple and will send back an ACK to confirm the deliver of the
broadcast message. When the correct processes is a subset of the writeset (the set of processes
that has send an ACK ) if the process is a reader (with variable reading = true) it will trigger
the ReadReturn else if the process is a writer it will trigger the W riteReturn. Since for any read
operation the reader process ensures that any other process has a timestamp greater or equal we
ensure the ordering property. For Write and Read Operation we have at most 2N messages.
Since in this algorithm we don’t use any failure detector, in addiction to a write timestamp
we need a read timestamp that will be incremented in both operation of read and write. In the
51
Read event we have a broadcast used to report to all the other processes that my process need
a quorum on this variable. When we receive at least N/2 responses the process will take only the
messages with r = rid (so the current read) and will take the value with highest timestamp and will
trigger a broadcast that imposes to all the other process to change its own variable to the current
one decided by the quorum. In the deliver of the write event instead to maintain the ordering
property will take the value with a timestamp bigger than the actual and will send an ACK. So,
like for the read event, if the number of ACK received are at least N/2 if the process is a reader
will trigger the ReadReturn else if the process is a writer it will trigger the W riteReturn. Since
the read imposes the write of the value read to a majority of processes and to the property of
intersection of quorum the ordering property is respected. For write operation we need at most
2N messages, instead for read operation we need at most 4N messages, cause the read does two
broadcast one for obtaining a quorum and the other for impose its value to all the other processes.
12 Software Replication
In distributed system, Software Replication is used for fault tolerance purposes, so for guarantee
the availability of a service (an object) despite failures. If we consider p the failure probability of
an object O, the availability of O is 1 − p. If we replicate an object O on n nodes now its avail-
ability is (1 − p)n . So now the system model is composed by a set of processes that are connected
with a PP2P and they may crash.
These processes interacts wit a set of objects X located at different sites managed by processes:
• Sequential consistency;
• Causal consistency;
The first two criteria composes the Strong Consistency, instead the third one is the Weak
Consistency.
52
If we call the precedence relation: ≺ and the concurrency relation: k,we say that an
execution E is linearizable if there exists a sequence S including all operations of E such that:
• for any operation O1 and O2 such that O1 ≺ O2 , then O1 appears before O2 in the sequence
S;
• the sequence S is legal, so for every object in the sub-sequence of S they have to respect the
sequential specification of the object;
There is a sufficient condition for linearizability: replicas must agree on the set of invocations
they handle and on the order according to which they handle these invocations:
• Atomicity: Given an invocation x op(arg) pi , if one replica of the object x handles this
invocation, then every correct replica of x also handles that invocation;
• Ordering: Given two invocations x op(arg) pi and x op(arg) pj if two replicas handle both
the invocations, they handle them in the same order ;
There are two main techniques that implement linearizability: primary backup and active repli-
cation.
53
– The client receives the answer, so it’s fine;
• Scenario 2, Primary fails before sending update messages:
– The client doesn’t get any answer and will resend the request after a timeout, so the
new primary will handle the request as new;
• Scenario 3, Primary fails after sending update messages but before receiving all ACK:
– In order to guarantee atomicity, the update its received either by all or by no one, when
a primary fails there is need to elect another one among all the replicas.
So we need a total order broadcast, even for the clients. Obviously Active Replication doesn’t
need recovery action upon the failure of a replica.
13 Cap theorem
CAP theorem (Consistency, Availability, Partition tolerance) states that we can choose
only two of these in a Distributed System in case of failures, so we cannot guarantee all of these at
the same moment. To see why this rule holds, we image a situation in which we have two nodes
connected each other:
Data are replicated across the nodes, so they have the same dataset. Now we see what C,A and
P means in this situation:
• Consistency: if the dataset in N1 is changed, then we need to change also the dataset in N2
so that it look the same in both of them;
• Availability: as long as both N1 and N2 are up and running, I should be able to query/update
data on any of them;
54
• Partition tolerance: if the link between N1 and N2 fails, I should till be able to query/update
my dataset;
The best way to understand how the theorem works is to see what happens during a network
partition:
So we can see that N1 and N2 cannot communicate anymore, so let’s assume that we have
some means to discover partition. Now if someone talks to N1 and changes the dataset, N1 cannot
propagate these changes to N2 :
• If we choose consistency, we have to block all the updates on the system (both nodes) but
this makes it unavailable;
• If we choose availability, we make different updates on the nodes this erases consistency;
We can see that we have to sacrifice one of the them, one decent solution is to reduce availability
to one single node and then update the other when the link is re-available.
So, we can choose between CP or AP, since choosing CA is obviously nonsense, cause in this
case we don’t have any info on that system so we can’t work on it. It is also important to say
that we can also choose the C-level and A-level so we are not constrained into chasing one or the
other. For example we can be read available on any node but not update available, or be available
on only one node and apply some post-partition recovery. Or we can choose eventual consistency,
if our app is okay in using slightly old data in some nodes, to improve availability. Now let’s write
a formal proof of the CAP Theorem by contradiction:
Let’s assume that there exists a system that is consistent, available and partition tolerant. Then
we will partition the system like:
Next, we have to update the value on N1 to v1 and since the system is available we can do it.
Next we will read the value of N2 but it returns v0 cause the link is broken and N1 cannot pass v1
to N2 . This is an inconsistent scheme so we have a contradiction.
55
14 Byzantine Tolerant Broadcast
Byzantine processes are process that may deviate arbitrarily from the instructions that an
algorithm assign to them, or act as if they were deliberately preventing the algorithm from reaching
its goals. The basic step to fight them is to use some cryptography mechanisms to implement perfect
links abstraction, but alone it doesn’t allow to tolerate Byzantine processes.
The consistency property is very important cause it is referred to one single broadcast event, so
every correct process delivers the same message. In the broadcast event we will send a message
to all the processes and this message contains the id of the sender, this is used in the deliver event
where we will check if the message is from the sender and in this case the process will resend the
message to all the other processes with sender himself with an echo. When a process receive an
56
echo message, they will add in the echo array the message in the position of the process that sent
it. When the number of processes with the message m in the echo array is bigger than N 2+f we can
finally deliver the broadcast.
We know that in a normal quorum we need at least N2 correct processes, but in this case
since we have to deal with f byzantine processes we need at least N 2+f . The number of correct
processes in such a system becomes: N 2+f − f that is equal to N 2−f . So if we have to be sure that
two byzantine quorums returns at least one correct process we will consider the edge case in which
we have two disjoint quorums of N 2−f , they will have more than N − f (sum of the two quorums
quantity) correct processes to have at least one intersection. The N − f is the number of correct
processes I need to consider the system correct and has to be > N 2+f so we have N > 3f for which
correctness in ensured.
When the quorum of a receiving process is reached the process will send a ready message to all
the other processes with its id and the message received from the echo. When a deliver event
57
of an echo message arrives the ready array is filled with the message in the position of the sender
process. The ready message of the broadcast can be send even when the number of ready messages
delivered is higher than f , this cause in this case at least one process is correct. At the end in order
to deliver the broadcast message we need to check if the number of processes from which I
received the ready message is at least > 2f and this is made in order to avoid byzantine processes
that sends twice a ready message.
58
tenant) to attack or to retreats, in which commander and lieutenants can be traitors (byzantine),
and in order to win all loyal general attack or retreat, and after the commander send the order
lieutenants can communicate between them. So we have two goals:
• All loyal general decide upon the same plan of action;
• The traitors cannot cause the loyal general to adopt a bad plan;
So we can rephrasing the goal as:
• Any two loyal general use the same value of v(i), where v(i) is the information communicated
by the it h general ;
• If the it h general is loyal, then the value he sends must be used by every loyal general as the
value of v(i);
So the property are:
• All loyal lieutenants obey the same order;
• If the commander is loyal, then every loyal general obeys the order he sends;
Like from the the correctness of the past section we need N ≥ 3f + 1 of loyal generals in order
to let this work:
So in this system model we have a reliable communication channels, the message source is
known, message omissions can be detected and the default decision for lieutenants is RET REAT .
We will use a recursion algorithm and we define a set of protocols: OM (f ) for which:
• OM (0)
– The commander sends his value to every lieutenant;
– Each lieutenant uses the value received or RET REAT if he receives no value;
• OM (f ) with f > 0:
– The commander sends his value to every lieutenant;
– For each vi , where v(i) is the value of the lieutenant i received by the commander, the
lieutenant send its value to all the other N − 2 lieutenants (N − 2 to everybody minus
himself and the commander );
59
– For each value received (not counting duplicate) and by considering also is own value,
the lieutenant uses the value majority too choose is own result;
So in less words, when the lieutenants receives the variable from the commander they starts to
resend the value between them. When a lieutenant receive a message from another lieutenant they
add the variable in one array at the position of the sender and at the end they choose the majority
in base of the content of the array and if its not possible to choose a majority the lieutenant decide
to retire.
60
16.1.1 Byzantine Tolerant Safe Register
We have N servers with 1 writer and n readers:
We have to assure that once writer returns from a write operation, then any following read
operation returns the last written value, so a writer sends a request to servers and waits for enough
ACK messages to be sure that enough correct servers delivers it. Same for the read operation, that
sends a read request and waits for enough reply messages to be able to read the newest value.
The quorum that works for Byzantine Broadcast here is not enough since Safe Register has a
stronger semantic, cause we require that a write operations is visible to all once it terminates so we
need a Masking Quorum: N > 4f and quorum: N +2f 2 .
So we can see that in the deliver event of the write, when a new write request arrives we check
if the sender is the writer process and in this case if the value of the receiver is older than the new
61
value sent, and in this case the receiver will update it and after it sends an ACK. When the writer
receives the ACK it will add in the acklist the ACK received, and if the number of ACK received
is bigger than N +2f
2 (so reachs the quorum) we can trigger the WriteReturn. When a process want
to read a value, will trigger a read request, and all processes will send to him the value requested.
When the sender receives at least N +2f 2 values, will select the variable that occurs more than f
and with the highest timestamp, in this way we respect that the read returns the most recent value
written.
In the write event we can see that the sender will encrypt the message and will be sent to all
the processes. The quorum, since we use some cryptography mechanisms, is bigger than N 2+f so we
have to assume than N > 3f . When a process want to read some values, it will trigger the read
request to all the processes, that they will respond with them variable encrypted and the timestamp.
So the reader will check if the signature is correct and in this case will add the message to its
readlist, and when this is bigger than the quorum will trigger the readreturn with the highest value
(timestamp) in the readlist.
62
16.2.2 Regular Register without cryptography
It’s important to note that sometime just one phase for writing is not enough cause we can have
some cases in which a reader is not able to choose a value, so we need two phases:
• Pre-write: where the writer sends P REW RIT E messages with the current value-timestamp
pair, and then it waits until it receives P REACK messages from N − f processes;
• Write: where the writer sends W RIT E messages with the current value-timestamp pair, and
then it waits until it receives ACK messages from N − f processes;
So every process stores two timestamp/value pairs, in this way they can detect faulty process:
The write event is same as usual with a quorum of N − f and with two phases of write and
pre-write. When a reader want to read something will send a request, and all the processes will
respond with a message that contains two pairs, one pair value-timestamp of the pre-write, and
the same pair for the write. So the reader will check if these two pairs are equal, or if the timestamp
of the the pre-write is bigger of one unit than the timestamp of the write, so for this condition we
63
have the case in which that process received the pre-write but not the write. After the reader will
check if there are tuple than are more than f byzantine processes and if there exist a subset of the
readlist for which the quorum composed by the highest timestamps (one of the two timestamps
bigger) is reached. If the quorum is not reached then the read request is re-sended.
In order to transmit a message in a multi-hop network there are three different ways:
• Flooding: in which the message is sent to all the processes of the network ;
• Routing: in which the message is sent by using a spanning tree, so we have a route in order
to reach a destination process;
• Gossip: that uses probabilistic;
64
Since we can have some byzantine processes that can block or change the message in the multi-
hop network we need to implement a byzantine reliable broadcast, in which for Complete Commu-
nication Network, we have validity, no duplication, integrity and totality, and instead for Multi-Hop
Network, we need to have a correct source, safety (integrity) and liveness (validity).
We can use this theorem, since we know there are at most f byzantine processes over the sys-
tem, so if a message comes from f + 1 disjoint paths it can be safely accepted.
65
• Propagation Algorithm:
– The source process s sends the message msg to all of its neighbors msg := hs, m, ∅i;
– A correct process p saves and relays the message sent by a neighbor q to all other
neighbors not included in the traversed processes appending to it the id of q so: msg :=
hs, m, traversed processes ∪ qi;
• Verification Algorithm:
– If a process receives copies of msg carrying the same m and s where it is possible to
identify f + 1 disjoint path (from s to current node) among traversed processes then m
is delivered by the process;
Since all the messages generated by a byzantine process are labeled with its ID, it is not possible
to generate traversed processes with minimum cut greater than f , so the dolev algorithm
enforces safety. Instead the liveness depends on the network topology.
The byzantine realible broadcast can achieved in a static network G composed of n processes
and f byzantine processes, if and only if the vertex connectivity of G is at least 2f + 1. The
vertex connectivity of a graph is the minimum number of nodes whose deletion disconnects it.
66
So, it’s not practically employable, and we can have even another problem: a byzantine processes
can start flooding the network with lots of messages, so we have denial of service problem and so no
liveness. A possible solution is to restrict the capability of every process, so we can use a Bounded
Channel Capacity, where every process can send only a bounded number of messages in a time
window, so in this way we decrease the input for the NP-complete problem.
The Dolev algorithms relays any message to any process not already included in traversalp rocesses
so we can optimize it:
• Optimization 1:
– If a process p has delivered a message msg than p can relay msg with an empty
traversed processes without affecting the safety property;
• Optimization 2:
– A process p has not to relay messages carrying m to processes qi that have already
delivered m;
In case we have a routed network, so with fixed routes between every pair of processes, we
have that the source broadcasts a message along 2f + 1 disjoint routes, so any other process relays
a message only if it respects the planned routes. So we need at least 2f +1 vertex connected network,
and we have a Quadratic Message Complexity (every edge is traversed once), and a Linear Delivery
Complexity (counting the copies of message).
If we have instead a Digital Signature, the source will digitally signs a message to broadcast,
67
and any correct process just relays the received messages, in this case we need at least f + 1 vertex
connected network.
So every process relays a message only if it has been delivered, and since at most f faulty pro-
cess are present in the neighborhood the CPA algorithm enforces safety, instead liveness depends
on network topology. The Message Complexity is quadratic (since every edge is traversed once),
and the Delivery Complexity is linear (counting the copies of a message).
68
17.2.2 MKLO
MKLO is the partition of nodes in levels, in which the source is placed in L0 , and the neighbors
of the source are in L1 , instead any other level is placed in the first level such that it has at least
k neighbors in the previous levels. If we use MKLO the CPA algorithm liveness is guaranteed. So
the Correctness is:
• The neighbors of the source wait until they receive m from s then they deliver m and multicast
hs, m, ∅i;
• When hs, m, Si is received from a neighbor q with q ∈
/ S and |S| ≤ Z − 3 then Rec(q) :=
hs, m, Si and multicast hs, m, S ∪ {q}i
• When ∃m, p, q, S such that q 6= p, q 6= S, Rec(q) = hs, m, ∅i and Rec(p) = hs, m, Si then
deliver m and multicast hs, m, ∅i and stop;
Every process keeps a copy of the last message received from every neighbors, so linear memory
is required on every process so we need at most 1 byzantine arbitrarily place or spatial condtion
D > 4.
69
18.1 TVG Model
The TVG model or Time Varying Graph is a Graph G := (V, E, ρ, ζ) where V is the set of
nodes, E the set of edges, ρ is the presence function (a true/false function that indicate if there
is that edge or not), ζ is the latency function (a function that given an arch return the latency).
G can also be described as a sequence of static graphs (snapshots). The underlying graph is the
graph on which all the snapshots graphs are based on:
A sequence of distinct nodes (p1 , ... , pn ) is a Journey, or Dynamic Path, from p1 to pn if there
exists a sequence of dates (t1 , ... , tn ) such that ∀ i ∈ {1, ... , n − 1} we have:
• ei = (pi , pi+1 ) ∈ E, so there is an edge connecting pi to pi+1 ;
• ∀ t ∈ [ti , ti + ζ(ei , ti )] , ρ(ei , t) = 1, so pi can send a message to pi+1 at date ti ;
• ζ(ei , ti ) ≤ ti+1 − ti the aforementioned message is received by data ti+1 ;
70
The Broadcast latency on 1-interval connected networks is O(n) so ∀t at least one not
informed node is connected to an informed node, so the complexity is equal to the static multi-hop
network. The Byzantine Reliable Broadcast Specification is:
Like for the multi-hop network we can have two different types of failure models, globally
bounded and locally bounded.
Unlike the multi-hop network in which Vertex Cut = Disjoint Paths, in dynamic network the
Vertex Cut ≥ Disjoint Paths. So, the idea is to extend the Dolev Algorithm, by using the same
propagation algorithm, but we check for a dynamic min-cut with size f + 1 and in which every
message is retransmitted every time a process detects a network change in its neighborhood.
The Correctness is guaranteed by the fact that byzantine processes can not generate traversed processes
with minimum cut lower than f , liveness instead between two endpoints, the byzantine reliable com-
munication from process p to q, is achievable if and only if the dynamic minimum cut between
p and q is at least 2f + 1. The complexity is the same of the Dolev one, exponential for message
complexity, NP-complete for delivery complexity.
71
For static distributed systems the vertex connectivity of a graph can polynomial verified through a
max-flow algorithm, instead in dynamic distributed systems the computation of a dynamic min-
cut is a NP-complete problem.
We can use the CPA algorithm on Dynamic Distributed Systems without further changes.
The safety property is still guaranteed, in fact every process relays a message only if it has been
delivered, at most f faulty process are present in the neighborhood. The liveness property in
static networks requires the existence of a partition MKLO to be guaranteed. In dynamic networks
a MKLO is not enough, in fact an edge may disappear while transmitting a message and the order
of appearance of edges matters, so we have to use the TMKLO that is the temporal minimum
k-level ordering:
RCD + M KLO
Where RCD is the edge appearances that allow to reliably deliver a message transmitted over a
channel. and the necessary condition with TMKLO is k = f + 1 instead the sufficient condition is
k = 2f + 1, and the computation of the TMKLO is done only if we have a full TVG knowledge.
19 Blockchain
A Blockchain is a decentralized, distributed and public digital ledger that is used to record trans-
actions across many computers so that any involved record cannot be altered retroactively, without
the alteration of all subsequent blocks. So, it’s a decentralized fully replicated database on a trust-less
P2P network containing a history of transactions, and it’s public, immutable and non-repudiable.
Blockchain is used for example with bitcoin in which the network ensures the validity of transac-
tions without a trusted centralized authority. The bitcon Blockchain is composed by the bitcoin,
a virtual currency, a bitcoin public ledger, the list of all transactions ever made. When a new
transaction is made the sender will update his ledger and after will broadcast the message so all the
nodes will update their bitcon public ledger and will verify the transaction with the use of digital
signature (all the nodes have a private key).
All the transactions are grouped in blocks and the blocks are connected each other in a chain,
in fact each block has a reference to the hash of the previous one, and all the transactions in the
same block are considered made at the same time. Transactions not yet in block are called uncon-
firmed transactions, and each node can pick a set of unconfirmed transactions in order to build
72
a new block and to propose it. It’s important to note that attaching a new block require Consensus.
• Block Interval: time to wait for a block to propagate to all the nodes;
We have two important metrics:
• Throughput: how many transactions for second;
Miner is encouraged cause solving a block gives coins to the node that found the proof-of-work,
and some transactions can have an additional fee given to the node who will mines the block that
containing it. Rewards are an incentive for nodes to keep them supporting the blockchain and to
keep nodes honest, and its a way to distribute coins into circulation.
It’s impossible for an attacker to change a transaction in a specific block in the blockchain, cause
an attacker should be quicker than the rest of the whole network, and in that case he should be able
to re-mine n + 1 blocks (all blocks next to the specific block ) quicker than the rest of the network,
if so the attacker could obtain the longest and modified blockchain and all network would converge
to it, but the attacked should have the 50% of the computational power of the network. Since the
last blocks are less secure (in order to obtain the longest branch) an attacker should wait for 5 or 6
blocks in order to make an attack successfully and this make its probability too low, so this solution
protect Integrity and Double Spending Fraud.
There is another way to proof-of-work, that is called proof-of-stake, while proof-of-work was
very secure but waste a lot of electric power, Proof-of-Stake is secure without mining so we don’t
waste energy. Instead of mine a block the creator of the next block is chosen in a deterministic way
according to its wealth, and the reward are not related to the created block but according to your
wallet, the longer you keep the coin in the wallet the more the reward is high. The probability of
mint (instead of mine) is proportional to your wallet, so minting will require a lot of coin in order
to attack the network and it’s very hard to mint to consecutive blocks.
73