0% found this document useful (0 votes)
7 views

DistributedSystems Notes

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

DistributedSystems Notes

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

Distributed System Notes

Giuliano Abruzzo
November 26, 2019

1
Contents
1 Modelling Distributed System 5

2 Links 9
2.1 Fair-Loss P2P link . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Stubborn P2P link . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Perfect P2P link . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Physical Time 11
3.1 Synchronization Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.1 Christian’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.2 Berkeley’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.3 Network Time Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4 Logical Time 14
4.1 Logical clock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.1.1 Scalar Logical Clock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.1.2 Vector Logical Clock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 Logical Time and Distributed Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2.1 Lamport’s algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2.2 Ricart-Agrawala’s algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5 Failure Detection & Leader Election 19


5.1 Failure Detector Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.1.1 Perfect Failure Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.1.2 Eventually Perfect Failure Detector . . . . . . . . . . . . . . . . . . . . . . . . 20
5.2 Leader Election . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2.1 Perfect Leader Election . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2.2 Eventual Leader Election . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2.3 Leader Election with Fair Lossy Links . . . . . . . . . . . . . . . . . . . . . . 22

6 Broadcast Communication 26
6.1 Best Effort Broadcast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.2 Reliable Broadcast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.2.1 Reliable Broadcast, Synchronized system . . . . . . . . . . . . . . . . . . . . 28
6.2.2 Reliable Broadcast, Asynchronized system . . . . . . . . . . . . . . . . . . . . 28
6.3 Uniform Reliable Broadcast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.3.1 Uniform Reliable Broadcast, Synchronous system . . . . . . . . . . . . . . . . 29
6.3.2 Uniform Reliable Broadcast, Asynchronous system . . . . . . . . . . . . . . . 30
6.4 Probabilistic Broadcast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.4.1 Eager Probabilistic Broadcast . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

7 Consensus 31
7.1 Regular Consensus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
7.2 Uniform Consensus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2
8 Paxos 34

9 Ordered Communications 36
9.1 FIFO broadcast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
9.2 Casual Order Broadcast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
9.2.1 Waiting Causal Broadcast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
9.2.2 Non-Waiting Causal Broadcast . . . . . . . . . . . . . . . . . . . . . . . . . . 39

10 Total Order Broadcast 40


10.1 Total Order Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
10.1.1 UC and URB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
10.1.2 UC and NURB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
10.1.3 NUC and URB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
10.1.4 NUC and NURB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

11 Distributed Registers 43
11.0.1 Regular Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
11.0.2 Atomic Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
11.1 Regular Register Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
11.1.1 Read-One-Write-All Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 46
11.1.2 Fail-Silent Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
11.2 Atomic Register Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
11.2.1 Regular Register (1,N) to Atomic Register (1,1) . . . . . . . . . . . . . . . . . 49
11.2.2 Atomic Register (1,1) to Atomic Register (1,N) . . . . . . . . . . . . . . . . . 49
11.2.3 Read-Impose Write-All Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 50
11.2.4 Read-Impose Write-Majority algorithm . . . . . . . . . . . . . . . . . . . . . 51

12 Software Replication 52
12.1 Primary Backup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
12.1.1 No-Crash scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
12.1.2 Crash scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
12.2 Active Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

13 Cap theorem 54

14 Byzantine Tolerant Broadcast 56


14.1 Byzantine Consistent Broadcast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
14.2 Byzantine Reliable Broadcast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

15 Byzantine Tolerant Consensus 58


15.1 Byzantine Generals Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
15.1.1 Byzantine Generals Problem with Authentication Codes . . . . . . . . . . . . 60

16 Byzantine Tolerant Registers 60


16.1 Safe Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
16.1.1 Byzantine Tolerant Safe Register . . . . . . . . . . . . . . . . . . . . . . . . . 61
16.2 Regular Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3
16.2.1 Regular Register with cryptography . . . . . . . . . . . . . . . . . . . . . . . 62
16.2.2 Regular Register without cryptography . . . . . . . . . . . . . . . . . . . . . 63

17 Byzantine Tolerant Broadcast, Multi-Hop Networks 64


17.1 Globally Bounded Failure Model, Multi-Hop network . . . . . . . . . . . . . . . . . . 65
17.1.1 Dolev’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
17.2 Locally Bounded Failure Model, Multi-Hop network . . . . . . . . . . . . . . . . . . 68
17.2.1 CPA Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
17.2.2 MKLO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
17.3 Byzantine Reliable Broadcast in Planar Networks . . . . . . . . . . . . . . . . . . . . 69

18 Byzantine Tolerant Broadcast, Dynamic Networks 69


18.1 TVG Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
18.2 Globally Bounded Failure Model, Dynamic network . . . . . . . . . . . . . . . . . . . 71
18.3 Locally Bounded Failure Model, Dynamic network . . . . . . . . . . . . . . . . . . . 72

19 Blockchain 72

4
1 Modelling Distributed System
A distributed system is a set of entities, computers, machines communicating, coordinating
and sharing resources in order to reach a common goal, and appearing as a single computing
system. To explain several situations in distributed systems, we will use distributed abstraction
cause they captures common properties of a large range of systems, and they prevent reinventing
the same solution for variants of the same problem. We will use the Composition Model, where
there are several elements represented as:

• Request: events used by a component to request a service


from another component;

• Confirmation: events used by a component to confirm the


competition of a request;
• Indication: events used by a component to deliver
information to another component;

We will use these three in modules description in the specification. For a simple sync and
async Job Handler we will have the following pseudocode:

5
If we need to show a Job transformation and processing abstraction we have to use another model,
with two components, a Transformation Handler and the Job Handler, in pseudocode:

6
Only the operations referred to the arrows entering in the Transformation Handler are imple-
mented, because these are the operations the Transformation Handler has to handle. So, we write
pseudocode which follows the given specification, where operations and properties are listed. The
bottom and top variables are used alongside with the buffer size M to ensure that the limit of jobs
that can be processed by the buffer is not exceeded. Now we have two implements two exercises:

7
Processes in a Distributed System often communicate through messages. We can represent a
distributed algorithm as a series of automata, one per process, which define how to react to a
message. The execution of a distributed algorithm is represented by a sequence of steps executed
by the processes:

Distributed Algorithms should have two fundamental properties:

• Safety: which states that the algorithm should not behave in a wrong way;
– a safety property is a property that can be violated at some time t and never be satisfied
again after that time;
– a safety property is a property such that, whenever it is violated in some execution E of
an algorithm, there is a partial execution E 0 of E such that the property will be violated
in any extension of E;
• Liveness: which ensure that eventually something good happens;
– a liveness property is a property of a distributed system execution such that, for any time
t, there is some hope that the property can be satisfied at some time t0 ≥ t.

There are three different timing assumptions:


• Synchronous: where there is a known upper bound on the time token for processing, com-
munication, and a drift of a local lock with respect to real time;
• Asynchronous: where there are no timing assumptions on processes and links, but we can
use logical clock to measure time with respect to communication;

• Partially: where there is an unknown time t after which the system behaves as a synchronous
system. It will be a period long enough to terminate the distributed algorithm;

8
2 Links
Links are used to model the network component of a distributed system, in fact they connect pairs
of processes. We have three different types of link :
• Fair-loss links;
• Stubborn links;

• Perfect links;
The two processes linked can crash, and the time taken to execute an operation is bounded, and
the messages can be lost and can take an indefinite time to reach the destination. The generic
link interface is:

A message is typically received at a given port of the


network and stored in some buffer, then some algorithm is
executed to satisfy the properties of the link abstraction,
before the message is actually delivered. Remember, Deliver
is different from receive.

2.1 Fair-Loss P2P link


The Fair-Loss point-to-point link specification has two operations, Send and Deliver and three
properties:
• Fair-loss: if a correct process p infinitely often sends a message m to a correct process q,
then q delivers m an infinite number of times;

• Finite duplication: if a correct processes p sends a message m a finite number of times to


process q, then m cannot be delivered an infinite number of times by q;
• No creation: if some process q delivers a message m with sender p, then m was previously
sent to q by p;

The sender must take care of the retransmissions if it wants to be sure that m is delivered at its
destination and there is no guarantee that the sender can stop the retransmissions of each message,
and each message may be delivered more than once.

2.2 Stubborn P2P link


The Stubborn point-to-point link has has two operations, Send and Deliver and two properties:
• Stubborn delivery: if a correct process p sends a message m once to a correct process, then
q delivers m an infinite number of times;

• No creation: same as fair-loss p2p;

9
The implementation of the Stubborn P2P link is:

2.3 Perfect P2P link


The perfect link solves all the issues presented above, in fact besides the No Duplication and No
Creation it has another property:
• Reliable delivery: if a correct process p sends a message m to a correct process q, then q
eventually delivers m;

We can observe that the No duplication is ensured by the last piece of pseudocode, thanks to the
variable delivered. No creation is inherited, and Reliable delivery derives from the whole schema,
thanks to the perfect link in particular which is built by chance to deliver messages correctly.

10
3 Physical Time
In a distributed system processes run on different nodes interconnected by mean of a network and
cooperate to complete a computation. They communicate only through messages and, as ordering
is require in such application, time is a critical factor for distributed systems. Each process pi in
a distributed system runs on a single mono-processor machine with no shared memory, and they
have a state si , changed by the actions during the algorithm execution. Each process generate a
sequence of events:

• Internal Event: events that transforms the process state;


• External Event: send/receive;
• eki : that represent the k-th event of the process pi ;

We indicate with:
0
• →i the ordering relation between two events e and e ;
0 0
• e →i e if and only if e happened before e ;
We call local history the sequence of events produced by a process, a partial local history a
prefix of a local history, and global history the set containing every local history. Events can be
time-stamped through physical clocks values, in fact in a single process is always possible to
order events, but in a distributed system in presence of network delay, it is impossible to realize a
common clock shared among every process. Anyway, it is possible to use timestamps in order to
synchronize physical clocks through algorithms with a certain degree of approximation:

Ci (T ) = α · Hi (t) + β

Where Ci (t) represent the software clock and Hi (t) represent the hardware clock, α and β are
two factors which approximate the result closer to guarantee monotonicity. This software clock is
not generally completely accurate, in fact it can be different from the real time and at any process
due the precision of the approximation. We have to keep the granularity, also called resolution,
the interval of time between two increments of the software clock, of the software clock smaller than
the time difference between two consequent events so:

Tresolution < ∆T between two notable events

There are two parameters that effect physical clocks:


• Skew: the difference between two clocks: |Ci (t) − Cj (t)|;
• Drift rate: the gradual misalignment of once synchronized clocks caused by the slight
inaccuracies of the time-keeping mechanism, for example the ordinary quartz clocks deviate
of 1 sec in 12 days;

11
UTC is the international standard for clock synchronization, and we can have two types of syn-
chronization:
• External Synchronization:
– In which processes synchronize their clock Ci with an UTC source S, in a way such that
for each time interval: |S(t) − Ci (t)| < D where D is a synchronization bound ;
• Internal Synchronization:
– In which all the processes synchronize their clock Ci between them with respect to D in
pairs so: |Ci (t) − Cj (t)| < D;

So, the clocks that are internally synchronized are not necessarily externally synchronized,
instead the clocks that are externally synchronized are also internally synchronized with a
bound of 2 · D.

An hardware clock is correct if its drift rate is within a limited bound of p > 0:
dC
1−p≤ ≤1+p
dT
h 0i
If we have a correct hardware clock H we can measure a time interval t, t :

0 0 0
(1 − p)(t − t) ≤ H(t ) − H(t) ≤ (1 + p)(t − t)
0 0
Software clocks have to be monotone: t > t, so C(t ) > C(t).

3.1 Synchronization Algorithms


3.1.1 Christian’s Algorithm
Christian’s algorithm is an external synchronization algorithm which uses a time server S that
receives signal from an UTS source. It works also in an asynchronous system in a probabilistically
way. Is based on message round trip or RTT, and RTTs must be small enough in order to
obtain synchronization.

A process p asks the current time through mr and receives t in mt from S, and p will set its time
to t + RT2 T , where RT T is the round trip time experience by p. Is important to note that the time
server can crash or can be hacked.

12
Accuracy of this algorithm strongly depends of RT T , and we can have two cases:

• Case 1:
– In this case the real reply time is greater than estimated time
that is RT2 T , so in particular is equal to RT T − min;
RT T
– ∆ = estimated − real = 2 − (RT T − min) = −( RT2 T − min);

• Case 2:
– In this case the real reply time is smaller than estimated time
that is RT2 T , so in particular is equal to min;
RT T
– ∆ = estimated − real = 2 − min = +( RT2 T − min);

So the the accuracy of Cristian’s Algorithm is ±( RT2 T − min) where min is the minimum trans-
mission delay.

3.1.2 Berkeley’s Algorithm


Berkeley’s Algorithm is an internal synchronization algorithm with a master/slave structure
and is based on two steps. In the first step gathering of all the clocks from processes and com-
putation of the difference, and in the second step there is the computation of the correction. The
accuracy of this protocol depends on the maximum round-trip time. If the master crashes, another
one is elected, and it’s tolerant to arbitrary behavior, like a slave that sends a wrong value, since
the master uses a threshold.

The master process computes the differences ∆pi between the master clock and the clock of
every process pi (even him self), after it compute the average, avg, of all the differences ∆pi with-
out considering faulty processes (a process with a clock which differ from the master one more than
a threshold ) and at the end it computes the correction of each process (even faulty) with the
formula: ADGpi = avg − ∆pi .

When a slaves process receives the correction, it is applied to the local clock. If the correc-
tion is negative, the process doesn’t adjust the value but it slow down its clock, since decrementing
can cause problems.

3.1.3 Network Time Protocol


The NTP, or Network Time Protocol, is a time service over Internet that synchronizes clients
with UTC. It’s a standard for external clock synchronization of distributed systems and employs
several security mechanisms and is based on a remote reading procedure like the Cristian’s
Algorithm. Furthermore it adds basic algorithm mechanisms for clustering, filtering and evaluating
data quality.

13
It works with a hierarchy:

• Primary server: connected directly to UTC sources;


• Secondary server: synchronized to primary servers;
• Synchronization subnet: lowest level servers in users com-
puters;

This hierarchy in case of faults is reconfigurable. There are three


modes of synchronization:
• Multicast: in which the server periodically sends its actual
time to its leaves;

• Procedure Call: in which the server replies to request with


its timestamp;
• Symmetrical: in which we synchronize the pairs of time
servers using messages containing timing information;

It’s important to note that physical synchronization doesn’t work in asynchronization system
as we completely make the bound logic useless, in fact time for answer is unpredictable.

4 Logical Time
4.1 Logical clock
As we said, physical clock are good if we have a precise estimation of delays, but this can be
hard, and often we want to know in which order some events happened and not the exact time
for each of them. Since in a distributed system each system has its own logical clock, if clocks are
not aligned it’s not possible to order events generated by different processes, so we need a reliable
way to order events and this is the logical clock. These clocks are based on the causal relations
between events, and it’s important to define two relations:
• →i is the ordering relation between events in a process pi ;
• → is the happened-before between any pairs of events;
0
We say that two events e and e are in happened-before relation if:
0
• ∃ pi | e →i e ;
• ∀ message m : send(m) → receive(m);
– send(m) is the event of sending a message m;
– receive(m) is the event of receipt of the same message m;
0 00 00 00 0
• ∃ e, e , e | (e → e ) ∧ (e → e );

14
Where the last rule says that the happened-before relation is transitive. Using these rules we
can define a causal ordered sequence of events, and if there are two events that are not in
0
happened-before relation they are concurrent (e k e ). The logical clock is a software counting
register monotonically increasing its value and it’s not related to the physical clock in any way.
We denote with Li (e) the logical timestamp assigned by the logical clock by a process pi to the
event e. There is a property that says:
0 0
if e → e then L(e) < L(e )

There are two main implementations of the logical clock:

4.1.1 Scalar Logical Clock


Each process pi initializes its logical clock Li = 0, and pi increases its logical clock of 1 when it
generate an event (send or receive): Li = Li + 1

When pi sends m: When pi receives m:

• Create an event send(m); • Update Li = max(t, Li );


• Increment Li ; • Produce an event receive(m);
• Timestamps m with a t = Li ; • Increment Li ;
0 0
As we said if e → e then L(e) < L(e ) but the contrary is not valid, so it possible that the two
events are concurrent. So we cannot determine if two processes are in happened-before by analyzing
scalar clocks only.

In fact we can see that L(e31 ) < L(e21 ) but:


e21 k e31 ;

4.1.2 Vector Logical Clock


Each process has an array of N integers where N is the number of processes that are taken into
consideration, this array is called Vector Clock, and each process maintains is own. Vector Clock
resolves the old problem cause here we have:
0 0
e → e iff L(e) < L(e )

Each process pi initialize its vector clock Vi : Vi [j] = 0 ∀ j = 1 ... N , and pi increases Vi [i] of 1
when it generates an event: Vi [i] = Vi [i] + 1;

15
When pi sends m: When pi receives m:
• Create an event send(m); • Update Vi [j] = max(t[j], Vi [j]) ∀ j = 1 ... N ;
• Increment Vi ; • Produce an event receive(m);
• Timestamps m with a t = Vi ; • Increment Vi ;

The implementation is like before, with the difference that the update of the vector are done in
an index according to the event that generated the message. So Vi [i] represents the numbers of
events produced by pi , Vi [j] represents the number of events of pj that pi knows. Two events are
0 0 0
in happened-before only iff V ≤ V ∧ V 6= V so it must be V < V .

   
1 1
• event e11 0 < event e12 1 so we have: e11 → e12
0 0

   
1 0
• event e11 0 ≮ event e13 0 so we have: e11 k e13
0 1

Differently from Scalar Clock, Vector Clock allows to determine if two events are concurrent or
in happened-before.

4.2 Logical Time and Distributed Algorithms


We have seen two mechanism to represent logical time, the scalar clock and the vector clock. We
will now discuss how logical-based algorithms can be implemented in distributed systems. But
first the mutual exclusion abstraction specification are:
• Mutual Exclusion: at every time t at most one process p is in critical section;

• No-Deadlock: there always exists a process p able to enter in critical section;


• No-Starvation: every process p requesting the critical section eventually gets in;

4.2.1 Lamport’s algorithm


In the Lamport’s algorithm when a process want to enter the critical section sends a request
message to all the other, and a counter, timestamp, is used for maintaining an history of the op-
erations and this counter is incremented for each event and when a message (even not related to
mutual exclusion computation) is sent or received.

Each process pi uses a data structure composed of:


• Ck : a counter for process pi ;

• Q: a queue maintained by pi where critical section requests are stored;

16
The algorithms rules for process pi are:
• Access the CS:
– pi sends a request message attaching Ck to all processes and adds its request to Q;
• Request reception from pj :

– pi puts pj request (timestamp included) in its queue and sends back an ACK to pj ;
• pi enters the CS iff :
– pi has in its queue a request with timestamp t;
– t is the smallest timestamp in the queue;
0
– pi has already received an ACK from another process with t > t;
• Release of the CS:
– pi sends a release message to all the other processes and deletes its own request from the
queue;

• Reception of a release message from pj :


– pj deletes pi request from the queue;

The safety proof can be explained as follow: let’s suppose by contradiction that both the
processes pi and pj enter the critical section, this means that both have received an ACK from any
other process and the timestamp has to be the smallest in the queue:
• ti < tj < ACKi .ts;
• tj < ti < ACKj .ts

17
So we have three cases:
• pj ACK arrives before pj request then pi can enter the CS without any problem;
• pj ACK arrives after pj request but before pi ACK then pi enters the CS without any problem
and sends its ACK after executing the CS;

• Both processes receive the ACK when the two requests are in queue but mutual exclusion is
guaranteed by the total order on the timestamps;
Fairness is satisfied because different requests can be either in happened-before or in concurrent
relation, so in the first case everything is done with the respect of that order, while in the second
case the CS access can happened in any order. In the worst case, this algorithm needs 3(N − 1)
messages for the CS execution.

4.2.2 Ricart-Agrawala’s algorithm


Each process has:
• Replies initially 0;
• State ∈ [Requesting, CS, N CS];

• Q a queue for pending requests initially empty;


• Last˙Req;
• Num;

And the algorithm is:

This algorithm sends the processes into critical section based on the number of the process,
or on the basis of a deterministic function that ensures the total order.

18
5 Failure Detection & Leader Election
A system is synchronous/asynchronous or partially synchronous depending on the timing as-
sumption. If they are explicit we talk about a synchronous system, otherwise it is asynchronous.
Partially systems are the ones that need abstract timing assumptions and we have two choices:
• Put assumption on the system model (including links and process);
• Create a separate abstractions that encapsulate timing assumptions;

5.1 Failure Detector Abstraction


A failure detector abstraction is a software module used to detect faulty processes, it encap-
sulate timing assumptions of a either partially synchronous or fully synchronous system. It has two
properties:
• Accuracy: that represents the ability to avoid mistakes;
• Completeness: that represents the ability to detect all failures;

5.1.1 Perfect Failure Detector

19
To prove the correctness we have to show that both the property are satisfied. In this case
they follow from the perfect point-to-point link, in fact, if a process crashes, it won’t be able to
send HEART BEAT REP LY any more. If at the timeout there is a process that doesn’t reply to
requests it means that it has crashed.

5.1.2 Eventually Perfect Failure Detector


In the Eventually perfect failure detector there is an unknown time T after that crashes can be
accurately detected. In the asynchronous period (so the moments before T ), the failure detector can
make mistake assuming correct processes as crashed, so the notion of detection becomes suspicious.

As we see from the specification, a <> P can mistakenly suspect a process but is able to restore
it as soon as possible, as it receives a reply, also updating the timeout. This can happen when the
chosen timeout is too short. If a process q crashes and stops to send replies, p doesn’t change its
judgment anymore.

20
5.2 Leader Election
5.2.1 Perfect Leader Election
Sometimes, we may be more interested in knowing if a process is alive instead of monitoring failures.
We can use a different oracle which reports a process that is alive called Leader Election module.
In the perfect leader election we use the perfect failure detector :

5.2.2 Eventual Leader Election


If the failure detector is not perfect we talk about eventual leader election:

21
This ensures that, eventually, correct processes will elect the same correct process as their leader.
It doesn’t guarantee that leaders may change in an arbitrary period of time, and that many leaders
might be elected during the same period of time without having crashed. Once a unique leader is
determined and doesn’t change again, we say that the leaser has stabilized.

5.2.3 Leader Election with Fair Lossy Links


If we have a fair-loss link we need to convert the whole schema and we need to use crash-recovery
and timeouts:

The last piece of code, in the fill deliver event, we will replace the number of epoch of a process
that crashed again.

22
Exercise 1

23
Exercise 1.1
The answer is yes, because we can implement it on the process that has all the links of channel
A. So, in that case when the timeout will end if a process didn’t send a reply we know for sure
that it has crashed.

Init:
correcti = {p1 , p2 , p3 , p4 }
alivei = ∅
detectedi = ∅
for each pj ∈ correcti do:
trigger send(HearthBeatRequest, i) to pj
start(timer1 )

upon event deliver(HearthBeatRequest, j) from pj


trigger send(HearthBeatReply, i) to pj

upon event deliver(HearthBeatReply, j) from pj


alivei = alivei ∪ {pj }

when timer1 = 0
for each pj ∈ correcti do:
trigger send(ALIV EL IST, alivei , i) to pj
start(timer2 )

when timer2 = 0
for each pj ∈ correcti do:
if pj ∈
/ alivei ∧ pj ∈
/ detectedi
detectedi = detectedi ∪ {pj }
trigger crash(pj )
alivei = ∅
for each pj ∈ correcti do:
trigger send(HearthBeatRequest, i) to pj
start(timer1 )

Exercise 1.2
We can’t cause once the process that has all the channels A fails, it’s not guaranteed that all the
failures will be detected since the channel has a probability to lose a message that is not zero.

Exercise 1.3
We can’t cause all the links are fair loss, so we can just implement an eventually perfect failure
detector.

24
Exercise 2

Exercise 2

The answer is yes, because we can implement it on the process that has all the links of channel
A. So, in that case when the timeout will end if a process didn’t send a reply we know for sure
that it has crashed.

Uses:
Oracle Oi
Perfect P2P link

Init:
leaderi =⊥
lefti = getLef t()
righti = getRight()

when lef ti = null do:


leaderi = pi
trigger leader(pi )
trigger send(N ewLeader, leaderi ) to righti

upon event deliver(N ewLeader, l) from lefti


if leaderi 6= l
trigger leader(pi )
trigger send(N ewLeader, leaderi ) to righti

upon event lef t neighbour(pj )


lefti = pj

upon event right neighbour(pj )


righti = pj
trigger send(N ewLeader, leaderi ) to righti

25
6 Broadcast Communication
6.1 Best Effort Broadcast
We will now focus on the message broadcasting. This means that a process sends a message to
all the other ones, and there are several types of broadcast, the first one is the BEB, or Best Effort
Broadcast, that ensures message delivery only if the sender don’t crash, if it happens, processes
may disagree on whether or not deliver the message.

26
6.2 Reliable Broadcast
The Reliable Broadcast instead is:

In which we have, the same properties of the BEB, plus Agreement property, now we will see
two schemes that hep for the understanding of BEB and RB:

27
6.2.1 Reliable Broadcast, Synchronized system
The Reliable Broadcast in a synchronized system is:

The first if in the upon event (beb, deliver), is used in order to check the presence of duplicate
messages (if there is delay in receiving messages) so it guarantees the no duplication. So, only the
messages that are coming from a crashed process will be retrasmitted so we ensure the agreement
property, the re-brodcast is done by leaving as sender the original crashed one. In the best case 1
BEB message per one RB message, so it means that we don’t have any crases in the system and
we don’t need to re-brodcast, instead in the worst case we have n − 1 BEB messages per one RB,
so we have n − 1 failures, so for each RB message we have to re-brodcast the message n − 1 times.

6.2.2 Reliable Broadcast, Asynchronized system


If the failure detector is not perfect, in an Asynchronous System, we always have to retransmit
the message (eagen algorithm). In this case, the best case is the same of the worst case so we
have n BEB messages per one RB message:

28
6.3 Uniform Reliable Broadcast
There is also another type of Reliable Brodcast, called Uniform Reliable Brodcast or URB,
where the only difference is that the Agreement property becomes uniform, which means that
correct processes must deliver also messages from faulty processes, so the delivers of the crashed
processes are a subset of the delivers of the correct processes:

6.3.1 Uniform Reliable Broadcast, Synchronous system

29
Where ack is a matrix in which we have for rows the messages and for columns all the processes
of the system. When the process p sends a BEB messages, we put in the pending a tuple
(id sender, id message). When the process receives a message from the BEB it insert in the ack
matrix in the rows of m himself, if the tuple (id sender, id message) is not in pending, the tuple
is inserted and the message will be rebroadcasted. The candeliver is a boolean function that
ensures that the set of the correct processes is a subset of the processes that receives the BEB
deliver.

6.3.2 Uniform Reliable Broadcast, Asynchronous system


The algorithm is the same of the synchronous one but the difference is that we don’t have anymore
the perfect failure detector, so the candeliver function is modified, and it return true when the
number of ack received are at least half of the total. The assumption is that the majority of the
processes are correct.

6.4 Probabilistic Broadcast


In the Probabilistic Broadcast we have that the message is delivered 99% of the times, so it’s
not fully reliable. The probabilistic broadcast implements a tree structure where the broadcast
message is sent directly to sons, there is also a hierarchical communication in which the tree is
hierarchical and in this case the system loss some correctness but in terms of speed is more efficient.

30
6.4.1 Eager Probabilistic Broadcast
This broadcast is used when we work on huge distributed system. In fact if we have 100 nodes,
we could need 1002 or 1003 messages in the worst case for a single message delivery. Instead with
this system based on the Gossip Dissemination, in which a process sends a message to a set
of random process and the processes that receive will send the message to another set of random
process and this happen for r rounds, we cover almost all the nodes with a cost much lower than
the previous.

The picktargets functions picks k − random processes from the entire set of the processes, the
gossip function instead is used to send a message in broadcast to a subset of processes for a number
of rounds.

7 Consensus
7.1 Regular Consensus
A group of processes must agree in a value proposed by one of them, they start with different
opinions and then they converge toward only one of them.

31
We don’t deal with asynchronous systems cause no algorithm can guarantee to reach consen-
sus even with one process crash failure. In the case of a synchronous system, we can implement
the Floading Consensus, in which processes exchange their values and when all the processes
make their own proposal available, a value is chosen, but in order to do this we need no failures
due to the communication.

We can see that receivedfrom is an array where we insert in the ith position the processes
from which I delivered in the ith round, instead proposal is an array of n entry where in the ith
position i will put all the proposals received in the last round (even my proposal ). The propose
event permit to handle my proposal by adding it to the array of the proposal, and then he send
this to all the other thanks to BEB. When we check the correct array we are checking if the set
of received proposals of this round is equal to the received proposals of the last round (so we have
only correct processes) and if still we don’t have decided a variable the process will decide it and
will send it to all the other thanks to BEB. Last event handle the situation in which the variable
was decided by other process, in this case I check if the process that taken the decision is alive and
I set the decided variable and I rebroadcast the decision with BEB.
• Correctness:
– Validity and Integrity follow from the properties on the communication channels;
– Termination is ensured because algorithm terminates at most after N rounds;
– Agreement is satisfied cause the same deterministic function is applied to the same
values by correct processes;
• Performance:
– Best case: one communication round, so 2 × N 2 ;
– Worst case: we have N 2 messages exchanged for N rounds so we have N 3 messages;

32
7.2 Uniform Consensus
In the Uniform Consensus we have the Uniform Agreement property which means that also
faulty processes agree for the decided value:

In this case instead we will check only the proposal from the current round, so it’s very similar
to the previous one but here the decision is based only on the current round. If we are not in the
last round, we increment the round variable and we reset the receivedfrom array that is the set
that contains the processes from which I received the proposal.

• Correctness:
– Validity and Integrity follow from the properties of the best-effort broadcast;
– Termination is ensured because all correct processes reach round N and decide in that
round ;
∗ The strong completeness property of the failure detector implies that no correct
process waits indefinitely for a message from a process that has crashed, as the
crashed process is eventually removed from correct;
– Uniform Agreement holds cause all processes that reach round N have the same set
of values in their variable proposalset;

33
• Performance:
– We have N communication steps and O(N 3 ) messages for all correct process to decide;

8 Paxos
The Paxos algorithms was introduced in order to provide a viable solution to consensus in
asynchronous system, with these Safety is always guaranteed, but the algorithm makes some
progress (Liveness) only when the network works for enough time (partial synchronized ). We have
two basic assumptions:
• Agents can fail by stopping, the also operate at arbitrary speed and they may restart;
– Since all agents may fail after a value is chosen and then restart, a solution is impossible
unless some information can be remembered by an agent that has failed and restarted ;
• Messages can take arbitrarily long time to be delivered, can be also be duplicated or lost,
but they aren’t corrupted;
There are three actors in Paxos protocol:
• Proposer: who propose a value;
• Acceptors: processes that commits on a final decided value;
• Learners: who passively assist to the decision and they obtain the final decided value;
A model with only one acceptor is the simplest one, but we have a problem if it crash, so we must
have multiple acceptors, and in this case a value is accepted when the majority of it accepts it.

The problem is that each acceptor may receive a different set of proposals, a possible solution is
that:
• An acceptor may accept at most one value;
But in this case which value the acceptor should accept? A possible solution is that:
• An acceptor must accept the first proposal it receives;
But in this case we can have a sort of deadlock in which the acceptors couldn’t reach a majority.
We have to keep track of the different proposal by assigning a value v unique, and then the value
is chosen when a proposal with the same value has been accepted by the majority.

34
• If a proposal with value v is accepted every high-numbered proposal that is accepted by any
acceptor has value v;
But what if a new proposal propose a new different value that the acceptor must accept?
• If a proposal with value v is chosen, every high-numbered proposal issued by any proposer has
value v;
Now let’s assume that a proposal m with value v has been accepted, now we have to guarantee that
any proposal n > m has value v, we could prove it by induction assuming that every proposal
with number in [m, n − 1] has value v. For m to be accepted there is a majority of acceptors that
accept it. Therefore the assumption that m has been accepted implies that: every acceptor in the
majority has accepted a proposal with number in [m, n − 1] with value v.
• For any v and n, if a proposal with value v and number n is issued, then there is a set S
consisting of a majority of acceptors such that either:
– No acceptor in S has accepted any proposal numbered less than n;
– v is the value of the highest-numbered proposal among all proposals numbered less
than n accepted by the acceptors in S;
So this condition consider the situation in which a set of acceptors S accept a proposal n with value
v, and this can happened in two cases: in the first case, all the previous proposals with id < n
0
weren’t accepted, in the second case, a proposal n < n was already accepted, but v is equal to the
0 0
proposal value v of n .

To ensure this, we have to ask to proposer that wants to propose a value numbered n to learn
the highest-numbered value with number less than n that has been or will be accepted, by any
acceptor in a majority. To learn about a proposal we simply have to ask to the acceptors to not
accept any value numbered less than n. The Paxos protocol has two main phases:
• Phase 1:
– A proposer chooses a new proposal version number n and sends a prepare request
(P REP ARE, n) to a majority of acceptors;
– If an acceptor receives a prepare request it respond with a promise not to accept
0
any more proposal numbered less than n and he suggest the value v of the highest-
numbered proposal that it has accepted if there is any, else ⊥:
0 0
(ACK, n, n , v ) if it exists;
(ACK, n, ⊥, ⊥) if not;
0
– If an acceptor receive a prepare request with a n lower than the n from any prepare
0
request it has already responded sends out a (N ACK, n ) ;
• Phase 2:
– If the proposer receives ACKs from a majority of acceptors then it can issue an accept
request (ACCEP T, n, v) where n is the number that appears in the prepare request,
and v is the value of the highest-numbered proposal among the responses or the proposal’s
own proposal if none was received;

35
– If the acceptor receives an accept request, it accepts the proposal unless it has already
responded to a prepare request with a number greater than n;
– Whenever acceptor accepts a proposal respond to all proposal (ACCEP T, n, v), and the
proposal that receives (ACCEP T, n, v) from a majority of acceptors, decides v and sends
a (DECIDE, v) to all the other learners. All the learners that receive (DECIDE, v),
decide v;

9 Ordered Communications
Here we need to define guarantees about the order of deliveries inside group of processes. We have
three different types of ordering:
• Delivers respect FIFO ordering of the corresponding send;
• Delivers respect Casual ordering of the corresponding send;
• Delivery respects a Total ordering of deliveries;
Reliable broadcast that we previously studied doesn’t have any property on ordering deliveries of
messages and this can cause problems in the same communication.

9.1 FIFO broadcast

36
In the upon event Deliver there is a while next that is used when we receive a message with sn
(identifier of the message used to control the order ) equal to the current one, cause next is an
array used to track how many messages are arrived to that process. In the while we will empty
the pending array when there are message with a sn less that the received one, so in this way we
respect the FIFO property.

9.2 Casual Order Broadcast


The Casual Order Broadcast ensures that messages are delivered such that they respect all
cause-effect relations, so is an extension of the happened-before relation, so there can be a message
m1 that cause a message m2 , denoted as: m1 → m2 , and this happens when:
• Some process p broadcast m1 before it broadcast m2 ;

• Some process p delivers m1 and subsequently broadcast m2 ;


0 0 0
• There exists some message m such that m1 → m and m → m2

It’s important to note that Causal Broadcast = Reliable Broadcast + Causal Order and that
Causal Order = F IF O Order + Local Order, where Local Order means that if a process delivers
0 0
a message m before sending a message m , then no correct process deliver m if it has not already
delivered m.

37
9.2.1 Waiting Causal Broadcast

V is the logical vector, a vector of dimension N , the number of processes. In the Broadcast
event, I copy the current logical vector and in the position of the current process I insert the lsn
before this is incremented. In the Deliver event instead, I will enter in the while only if inside the
0
set of pending messages there are messages with a logical vector W lower than my logical vector,
so in this way I can deliver them and I respect the condition of the causal order. We also in
the Deliver event increment the logical clock value of that process cause for the RB property a
message that I am going to deliver was previously sent by a sender process so we will increment
the value of that process.

38
0 0
Safety: let two broadcast messages m and m such that broadcast(m) → broadcast(m ) then each
0
process have to deliver m before m
Liveness: eventually each message will be delivered and is guaranteed by two assumptions:
• The number of broadcast events that precedes a certain event is finite;

• Channels are reliable;

9.2.2 Non-Waiting Causal Broadcast

The approach of this algorithm is continuous in fact each time a message is delivered, the process
doesn’t wait the missing messages, so it is always delivered once the process is sure that the past
messages of the received one are delivered and then added to my list of past messages. In fact
the past list is a list in which will be inserted all the message involved in actions of deliver or
broadcast (by respecting a causal order ). In the broadcast event we will insert in the past list
the message that will be broadcasted. In the deliver event instead we will check all the past list
of the message received and for each message extracted from it will be checked if current process
has already delivered it and if not this will be delivered and inserted in the current past list, at the
end the current message, if is not in the my past messages, is delivered and inserted in the list.

39
10 Total Order Broadcast
A Total Order Broadcast is a reliable broadcast that orders all messages, even those from different
senders and those that are not causally related. T otal Order Broadcast = Reliable Broadcast +
T otal Order, from the reliable we have that processes agree on the same set of messages they
deliver, and from the total order, processes agree on the same sequence of message. The message
is delivered to all or to none of the processes and, if the message is delivered, every other message
is ordered either before or after this message.

It’s important to note that Total Order is orthogonal with respect to FIFO and Causal Or-
der. This means that respecting the total order doesn’t mean that FIFO and causal order are
respected too, in fact these two are parallel, if the causal order is respect also FIFO is respected,
instead with total order we cannot make any assumption on causal and FIFO.

In order to study this part, we need to consider a system model composed by a static set of processes
with perfect communication channels, asynchronous and crash-fault based and we characterize the
system in terms of its possible runs R. Total order specifications are usually composed by four
properties:
• A Validity property guarantees that messages sent by correct processes will eventually be
delivered at least by correct processes;
• An Integrity property guarantees that no spurious or duplicate messages are delivered ;
• An Agreement property ensures that processes deliver the same set of messages;
• An Order property constrains processes delivering the same messages to deliver them in
the same order;
The total order specifications with crash failure and perfect channel are:
• NUV: if a correct process T OCAST a message m then some correct process will eventually
deliver m;

• UI: for any message m, every process -p delivers m at most once and only if m was previously
T OCAST by some process;
The Agreement property:

• UNIFORM AGREEMENT (UA):

– If a process (correct or not) T ODelivers a message m,


then all correct processes will eventually T ODeliver m;
• NON-UNIFORM AGREEMENT (NUA):
– If a correct process T ODelivers a message m, then all
correct processes will eventually T ODeliver m;

40
So the constrain for Uniform Agreement is that correct processes always deliver the same
set of messages, and that the set of messages delivered by a faulty process is a subset of the set of
the correct processes, instead in NUA the set of faulty can be completely different.

The Ordering Property for Uniform:

• STRONG UNIFORM TOTAL ORDER (SUTO)


0
– If some process T ODelivers some message m before m ,
0
then a process T ODelivers m only after it has
T ODelivered m;
• WEAK UNIFORM TOTAL ORDER (WUTO)

– If process p and process q both T ODeliver messages m


0 0
and m , then p T ODeliver m before m if and only if q
0
T ODeliver m before m ;

So SUTO says that processes have the same prefix of the set of delivered messages and after an
omission (a message not delivered by someone) we have a disjointed set of delivered messages (like
p3 in the example image). Instead in WUTO there aren’t restriction, so the only thing that matter
is the order of the deliver between processes.

We have the same property for Non-Uniform:

• STRONG NON-UNIFORM TOTAL ORDER


(SNUTO)
– If some correct process T ODelivers some message m
0 0
before m , then a correct process T ODelivers m only
after it has T ODelivered m;

• WEAK NON-UNIFORM TOTAL ORDER


(WNUTO)
– If correct process p and q both T ODeliver messages m
0 0
and m , then p T ODeliver m before m if and only if q
0
T ODeliver m before m ;

41
10.1 Total Order Algorithm

So when the list of messages is not empty and the process is not waiting any decision from the
consensus it will send his list of messages unordered. When it receive a decision from the
consensus it will deliver all the messages by the decided order. It’s important to note that the
process will check if the consensus round and his round are equal in order to be sure that the
decision is taken about the actual situation.

10.1.1 UC and URB


When we use Uniform Consensus and Uniform Reliable Broadcast we obtain a Total Order:
T O(U A, SU T O):
• Due to URB all processes (even faulty) delivers the same set of messages, so we obtain UA;
• Due to UC all processes (even faulty) decide the same list of messages, so message are sorted
by a deterministic rule and we will have the same order ;

10.1.2 UC and NURB


When we use Uniform Consensus and Non-Uniform Reliable Broadcast we obtain a Total
Order: T O(N U A, SU T O):
• Due to NURB all processes delivers the same set of messages,instead faulty can delivers
other messages, so we obtain NUA;
• Due to UC all processes (even faulty) decide the same list of messages, so message are sorted
by a deterministic rule and we will have the same order ;

10.1.3 NUC and URB


When we use Not-Uniform Consensus and Uniform Reliable Broadcast we obtain a Total
Order: T O(U A, W N U T O):

42
• Due to URB all processes (even faulty) delivers the same set of messages, so we obtain UA;
• Due to NUC all correct processes decide the same list of messages, so correct process will
deliver messages in the same order, instead faulty process will deliver (before crash) a different
sequence of messages.

10.1.4 NUC and NURB


When we use Not-Uniform Consensus and Not-Uniform Reliable Broadcast we obtain a
Total Order: T O(N U A, W N U T O):
• Due to NURB all correct processes delivers the same set of messages,instead faulty can
delivers other messages, so we obtain NUA;
• Due to NUC all correct processes decide the same list of messages, so correct process will
deliver messages in the same order, instead faulty process will deliver (before crash) a different
sequence of messages.

11 Distributed Registers

A register is a shared variable accessed by processes through read


and write operations. This abstraction supports the design of dis-
tributed solution by hiding the complexity of the message passing
system and the distribution of the data.

We have two operations:


• Read operation: read() → v, that returns the current value v of the register ;
• Write operation: write(v), that write the value v in the register and returns true at the
end of the operation;

There are three basic assumptions:


• A register stores only positive integers and it’s initialized to 0;

43
• Each value is univocally identified;
• Processes are sequential, so a process can invoke only one operation per time;
The notation of the register is: (X, Y ) where X processes can write and Y processes can read, so
for example (1, 1) is a register in which only a process can write and only a process can read (these
processes are decided a priori).

Every operation is characterized by two events: Invocation and


Return, and each of these events occur at a single indivisible point
of time. An operation is complete if both the invocation and the
return events are occured, instead is failed if the process crash before
0
obtaining a return. Given two operations o and o , we says that o
0
precedes o if the response event of o precedes the invocation event
0
of o . If is not possible to define a precedence relation between two
operations they are said to be concurrent.

The sequential specification are two:


• Liveness: each operation eventually terminates;

• Safety: each read operation returns the last value written;

11.0.1 Regular Register


A regular register is a register (1, N ) in which the two following properties holds:

• Termination: if a correct process invokes an operation,


then the operation eventually receives the confirmation;

• Validity: a read operation return the last value written or


the value concurrently written.

0
It’s important to note that in a regular register a process can read a value v and then a value v
0
even if the writer has written v and then v, as long as the the write and the read operations are
concurrent, but this is not allowed in an Atomic register :

44
11.0.2 Atomic Register
The Atomic Register is a regular register with an ordering property (that is valid also for read
operations of different processes):
• Ordering: if a read return v2 after a read that it precedes it has returned v1 then v1 cannot
be written after v2 ;

Some examples:

11.1 Regular Register Interface


Let’s move to the different types of implementations of the register, let’s begin with the regular
register (1, N ) that is built in this way:

45
11.1.1 Read-One-Write-All Algorithm
We will use the Fail-Stop algorithm: Read-One-Write-All in which processes can crash but
the crashes can be reliably detected by all the other processes with the use of a perfect failure
detector, and it uses a perfect point-to-point link and a Best-effort broadcast (BEB). The algorithm
idea is that each process stores a local copy of the register where:

• Read-one: where each read operation returns the value stored in its local copy of the register ;
• Write-all: where each write operation updates the value locally stored at each process the
writer consider to haven’t crashed, and a write completes when the writer receives an ack
from each process that has not crashed ;

The BEB instance will be used in order to broadcast to all the processes the new variable during
the write operation. The pl instance instead is used when all the processes have to send the ack
back to the writer. The writeset is used from the writer in order to keep track of all process that
confirms the receive of the update of the variable, and when the number of correct process is a
subset of the process that receives the ack the operation of write is closed and the writeset is set
to 0. So for the write operation we need at most 2N messages and for read operation 0 messages
cause it’s a local operation.

46
11.1.2 Fail-Silent Algorithm
The problem with this algorithm is that it doesn’t ensure validity if the failure detector is not
perfect, in fact in this case the validity property is not respect. So in this case we can use a different
algorithm that doesn’t use a failure detector. This algorithm is called Fail-silent algorithm:
majority voting regular register and the idea is that each process locally stores a copy of the
current value of the register and each written value is univocally associated to a timestamp, the
writer and the reader processes use a set of witness process, to track the last value written. We use a
Quorum that this an intersection of any two sets of witness processes not empty, and a Majority
Voting, so each set is constituted by a majority of processes:

47
When a process need to write it will begin to broadcast by sending its value and its timestamp
(increased). When we receive a message in the deliver event of the write, I will check if the
timestamp received is bigger then the current timestamp of the value, and in this case I update the
value and will send the ACK back. When the writer receives at least N/2 ACK’s (since we have
the assumption of the majority of correct process) will trigger the WriteReturn. When instead
we have a read operation, since we don’t have a perfect failure detector I cannot be sure that my
value is still correct, I need to consult all the other process in order to obtain a quorum (so to obtain
a variable). In the deliver event of the read we do the quorum, in fact the first control is that
the r (timestamp of the read ) received is the same of my actual rid e will be inserted in the readlist
and when the number of processes in the list is at least the half of the total number of process we
will trigger the ReadReturn. In order to perform a Write operation or a Read operation we need
at most 2N messages.

11.2 Atomic Register Interface


The Atomic Register Interface has the same properties of the regular plus the Ordering prop-
erty:

In order to pass from a Regular Register (1, N ) to an Atomic Register (1, N ) we have to
distinguish two phases:
• We use a Regular Register (1, N ) to build an Atomic Register (1, 1);
• We use a set of Atomic Registers (1, 1) to build an Atomic Register (1, N );

48
11.2.1 Regular Register (1,N) to Atomic Register (1,1)

Where in order to respect the Ordering property there is a control on the timestamp received,
where will be checked if it’s greater than the previously read value. Each Write operation or Read
operation request a write/read on a regular register.

11.2.2 Atomic Register (1,1) to Atomic Register (1,N)

49
In the Write event we will use the writing variable that is used in order to permit to only
one process to write at time. In the WriteReturn event of the atomic registers below when we
receive a number of ACK’s equal to the number of the processes, the process will check if the variable
writing is true (so this process is the writer ), and in this case will trigger its own W riteReturn, else
will trigger its ReadReturn. In the ReadReturn event of the atomic registers below we will add
a tuple with the received value and its timestamp in the readlist, when the number of items in the
list is N then we will choose the value with the maximum timestamp associated (in order to respect
the ordering property) and we will send to all the other processes the new variable with a W rite
on the atomic registers below. So it’s important to note that for both read and write operation
we need to use a write operation of the N atomic registers below, so when a W riteReturn event is
received we can have two cases: if the process is the writer we write in the register, if the process
is not the writer it will read thanks to the ReadReturn.

11.2.3 Read-Impose Write-All Algorithm


Even the atomic register has a modified version of the Read-One Write-All algorithm of the regular
register, and it’s called Read-Impose Write-All Algorithm. The idea is that the read operation
writes, and it’s called Read-Impose Write-All cause a read operation imposes to all correct processes
to update their local copy of the register with the value read, unless they store a more recent value:

50
In this algorithm we use the Best Effort Broadcast, a Perfect Failure Detector and a PP2P.
When a process want to read a variable from the register it will broadcast a write operation to all
the other processes its own local variable with its timestamp. When a process receive a beb deliver
with a write, it will check if the timestamp arrived is bigger than its own timestamp of that variable.
and in this case it will update the tuple and will send back an ACK to confirm the deliver of the
broadcast message. When the correct processes is a subset of the writeset (the set of processes
that has send an ACK ) if the process is a reader (with variable reading = true) it will trigger
the ReadReturn else if the process is a writer it will trigger the W riteReturn. Since for any read
operation the reader process ensures that any other process has a timestamp greater or equal we
ensure the ordering property. For Write and Read Operation we have at most 2N messages.

11.2.4 Read-Impose Write-Majority algorithm


This algorithm called Read-Impose Write-Majority algorithm its a variation of the Majority
Voting algorithm of the regular register, in which we don’t have any failure detector and we assume
to have a majority of correct processes. The idea is to impose to a majority of process to have the
value read :

Since in this algorithm we don’t use any failure detector, in addiction to a write timestamp
we need a read timestamp that will be incremented in both operation of read and write. In the

51
Read event we have a broadcast used to report to all the other processes that my process need
a quorum on this variable. When we receive at least N/2 responses the process will take only the
messages with r = rid (so the current read) and will take the value with highest timestamp and will
trigger a broadcast that imposes to all the other process to change its own variable to the current
one decided by the quorum. In the deliver of the write event instead to maintain the ordering
property will take the value with a timestamp bigger than the actual and will send an ACK. So,
like for the read event, if the number of ACK received are at least N/2 if the process is a reader
will trigger the ReadReturn else if the process is a writer it will trigger the W riteReturn. Since
the read imposes the write of the value read to a majority of processes and to the property of
intersection of quorum the ordering property is respected. For write operation we need at most
2N messages, instead for read operation we need at most 4N messages, cause the read does two
broadcast one for obtaining a quorum and the other for impose its value to all the other processes.

12 Software Replication
In distributed system, Software Replication is used for fault tolerance purposes, so for guarantee
the availability of a service (an object) despite failures. If we consider p the failure probability of
an object O, the availability of O is 1 − p. If we replicate an object O on n nodes now its avail-
ability is (1 − p)n . So now the system model is composed by a set of processes that are connected
with a PP2P and they may crash.

These processes interacts wit a set of objects X located at different sites managed by processes:

• Each object has a state, that can be accessed through operations;


• An operation by a process pi on an object x ∈ X is a pair invocation/response:
– The operation invocation is: x op(arg) pi where arg is the argument of the operation;
– The operation response is: x ok(res) pi where res is the result of the operation;
– The pair invocation/response is: x op(arg) pi /x ok(res) pi
• After issuing an invocation a process is blocked until it receives the matching response;
In order to tolerate process crash failures, a logical object must have several physical replicas
located in different sites, we assume that a process pi crashes when an object xi crashes. It’s im-
portant that replication is transparent to the client processes, this means that client think that
it is interacting with the same server, even if in reality it is interacting with a correct copy of it.

There are three consistency criteria:


• Linearizability;

• Sequential consistency;
• Causal consistency;
The first two criteria composes the Strong Consistency, instead the third one is the Weak
Consistency.

52
If we call the precedence relation: ≺ and the concurrency relation: k,we say that an
execution E is linearizable if there exists a sequence S including all operations of E such that:
• for any operation O1 and O2 such that O1 ≺ O2 , then O1 appears before O2 in the sequence
S;
• the sequence S is legal, so for every object in the sub-sequence of S they have to respect the
sequential specification of the object;
There is a sufficient condition for linearizability: replicas must agree on the set of invocations
they handle and on the order according to which they handle these invocations:
• Atomicity: Given an invocation x op(arg) pi , if one replica of the object x handles this
invocation, then every correct replica of x also handles that invocation;
• Ordering: Given two invocations x op(arg) pi and x op(arg) pj if two replicas handle both
the invocations, they handle them in the same order ;
There are two main techniques that implement linearizability: primary backup and active repli-
cation.

12.1 Primary Backup


Primary:
• Receives invocations from clients and sends back the answers;
• Given an object x, then prim(x) returns the primary of x;
Backup:
• Interacts with prim(x);
• Used to guarantee fault tolerance by replacing a primary when crashes;

12.1.1 No-Crash scenario


So before send back the response to the client, the primary
replica sends an update to all the other correct backups and only
after that it gets an ACK from them it will send a response to
the client. Linearizability is guaranteed since the order in
which prim(x) receive client invocations define the order of the
operation on the object.

12.1.2 Crash scenario


There are three different scenarios:
• Scenario 1, Primary fails after the client receives the answer, and there are two cases:
– The client doesn’t receive the response due to PP2P, if the response is lost client re-
transmits the request after a timeout, in this case the new primary will recognize the
request already issued and will send back the result without updating the replicas;

53
– The client receives the answer, so it’s fine;
• Scenario 2, Primary fails before sending update messages:
– The client doesn’t get any answer and will resend the request after a timeout, so the
new primary will handle the request as new;

• Scenario 3, Primary fails after sending update messages but before receiving all ACK:
– In order to guarantee atomicity, the update its received either by all or by no one, when
a primary fails there is need to elect another one among all the replicas.

12.2 Active Replication


Here all the replicas have the same role, and each replica is deterministic (so if they have same
state and same input they produce the same output). So the client will receive the same response
from all the replicas. In this way the client doesn’t need to wait the response of all replicas, he
will took the response sent by the first correct replica it receives. In this case in order to ensure
linearizability:
• Atomicity: if a replica executes an invocation, all correct replicas execute the same invoca-
tion;
• Ordering: no two correct replicas have to execute two invocations in different order;

So we need a total order broadcast, even for the clients. Obviously Active Replication doesn’t
need recovery action upon the failure of a replica.

13 Cap theorem
CAP theorem (Consistency, Availability, Partition tolerance) states that we can choose
only two of these in a Distributed System in case of failures, so we cannot guarantee all of these at
the same moment. To see why this rule holds, we image a situation in which we have two nodes
connected each other:

Data are replicated across the nodes, so they have the same dataset. Now we see what C,A and
P means in this situation:
• Consistency: if the dataset in N1 is changed, then we need to change also the dataset in N2
so that it look the same in both of them;
• Availability: as long as both N1 and N2 are up and running, I should be able to query/update
data on any of them;

54
• Partition tolerance: if the link between N1 and N2 fails, I should till be able to query/update
my dataset;
The best way to understand how the theorem works is to see what happens during a network
partition:

So we can see that N1 and N2 cannot communicate anymore, so let’s assume that we have
some means to discover partition. Now if someone talks to N1 and changes the dataset, N1 cannot
propagate these changes to N2 :
• If we choose consistency, we have to block all the updates on the system (both nodes) but
this makes it unavailable;
• If we choose availability, we make different updates on the nodes this erases consistency;
We can see that we have to sacrifice one of the them, one decent solution is to reduce availability
to one single node and then update the other when the link is re-available.

So, we can choose between CP or AP, since choosing CA is obviously nonsense, cause in this
case we don’t have any info on that system so we can’t work on it. It is also important to say
that we can also choose the C-level and A-level so we are not constrained into chasing one or the
other. For example we can be read available on any node but not update available, or be available
on only one node and apply some post-partition recovery. Or we can choose eventual consistency,
if our app is okay in using slightly old data in some nodes, to improve availability. Now let’s write
a formal proof of the CAP Theorem by contradiction:

Let’s assume that there exists a system that is consistent, available and partition tolerant. Then
we will partition the system like:

Next, we have to update the value on N1 to v1 and since the system is available we can do it.
Next we will read the value of N2 but it returns v0 cause the link is broken and N1 cannot pass v1
to N2 . This is an inconsistent scheme so we have a contradiction.

55
14 Byzantine Tolerant Broadcast
Byzantine processes are process that may deviate arbitrarily from the instructions that an
algorithm assign to them, or act as if they were deliberately preventing the algorithm from reaching
its goals. The basic step to fight them is to use some cryptography mechanisms to implement perfect
links abstraction, but alone it doesn’t allow to tolerate Byzantine processes.

14.1 Byzantine Consistent Broadcast

The consistency property is very important cause it is referred to one single broadcast event, so
every correct process delivers the same message. In the broadcast event we will send a message
to all the processes and this message contains the id of the sender, this is used in the deliver event
where we will check if the message is from the sender and in this case the process will resend the
message to all the other processes with sender himself with an echo. When a process receive an

56
echo message, they will add in the echo array the message in the position of the process that sent
it. When the number of processes with the message m in the echo array is bigger than N 2+f we can
finally deliver the broadcast.

We know that in a normal quorum we need at least N2 correct processes, but in this case
since we have to deal with f byzantine processes we need at least N 2+f . The number of correct
processes in such a system becomes: N 2+f − f that is equal to N 2−f . So if we have to be sure that
two byzantine quorums returns at least one correct process we will consider the edge case in which
we have two disjoint quorums of N 2−f , they will have more than N − f (sum of the two quorums
quantity) correct processes to have at least one intersection. The N − f is the number of correct
processes I need to consider the system correct and has to be > N 2+f so we have N > 3f for which
correctness in ensured.

14.2 Byzantine Reliable Broadcast

When the quorum of a receiving process is reached the process will send a ready message to all
the other processes with its id and the message received from the echo. When a deliver event

57
of an echo message arrives the ready array is filled with the message in the position of the sender
process. The ready message of the broadcast can be send even when the number of ready messages
delivered is higher than f , this cause in this case at least one process is correct. At the end in order
to deliver the broadcast message we need to check if the number of processes from which I
received the ready message is at least > 2f and this is made in order to avoid byzantine processes
that sends twice a ready message.

15 Byzantine Tolerant Consensus


Since byzantine processes may invent values or claim to have proposed different values we need
to adapt the validity property of the consensus. So, we restrict the specification only to correct
processes and we define two different type of validity weak and strong:

Weak Byzantine Consensus:


Strong Byzantine Consensus:

15.1 Byzantine Generals Problem


The byzantine generals problem is a problem of consensus for byzantine processes in which
we have a general (called commander) that can communicate to the other generals (called lieu-

58
tenant) to attack or to retreats, in which commander and lieutenants can be traitors (byzantine),
and in order to win all loyal general attack or retreat, and after the commander send the order
lieutenants can communicate between them. So we have two goals:
• All loyal general decide upon the same plan of action;
• The traitors cannot cause the loyal general to adopt a bad plan;
So we can rephrasing the goal as:
• Any two loyal general use the same value of v(i), where v(i) is the information communicated
by the it h general ;
• If the it h general is loyal, then the value he sends must be used by every loyal general as the
value of v(i);
So the property are:
• All loyal lieutenants obey the same order;
• If the commander is loyal, then every loyal general obeys the order he sends;
Like from the the correctness of the past section we need N ≥ 3f + 1 of loyal generals in order
to let this work:

So in this system model we have a reliable communication channels, the message source is
known, message omissions can be detected and the default decision for lieutenants is RET REAT .
We will use a recursion algorithm and we define a set of protocols: OM (f ) for which:
• OM (0)
– The commander sends his value to every lieutenant;
– Each lieutenant uses the value received or RET REAT if he receives no value;
• OM (f ) with f > 0:
– The commander sends his value to every lieutenant;
– For each vi , where v(i) is the value of the lieutenant i received by the commander, the
lieutenant send its value to all the other N − 2 lieutenants (N − 2 to everybody minus
himself and the commander );

59
– For each value received (not counting duplicate) and by considering also is own value,
the lieutenant uses the value majority too choose is own result;
So in less words, when the lieutenants receives the variable from the commander they starts to
resend the value between them. When a lieutenant receive a message from another lieutenant they
add the variable in one array at the position of the sender and at the end they choose the majority
in base of the content of the array and if its not possible to choose a majority the lieutenant decide
to retire.

15.1.1 Byzantine Generals Problem with Authentication Codes


The problem is since this a recursive algorithm we have a large number of messages and it’s
very complex so a solution can be to use messages authentication codes:
• The commander signs and sends his value (v : 0) to every lieutenant.
• For each i:
– If lieutenant i receives a messages (v : 0) from the commander and has not received yet
any order then: Vi = {v} and sends v : 0 : i to every lieutenant;
– If lieutenant i receives a messages v : 0 : j1 : ... : jk and v is not in Vi then: adds v to
Vi and if k < f sends v : 0 : j1 : ... : jk to every lieutenant other than j1 : ... : jk
• For each i : when lieutenant i receives no more messages, he obeys the order choice(Vi );

16 Byzantine Tolerant Registers


16.1 Safe Register
A safe register is a register in which we have a validity property that states that if we have a
read operation that is not concurrent with a write will return the last value written:

60
16.1.1 Byzantine Tolerant Safe Register
We have N servers with 1 writer and n readers:

We have to assure that once writer returns from a write operation, then any following read
operation returns the last written value, so a writer sends a request to servers and waits for enough
ACK messages to be sure that enough correct servers delivers it. Same for the read operation, that
sends a read request and waits for enough reply messages to be able to read the newest value.

The quorum that works for Byzantine Broadcast here is not enough since Safe Register has a
stronger semantic, cause we require that a write operations is visible to all once it terminates so we
need a Masking Quorum: N > 4f and quorum: N +2f 2 .

So we can see that in the deliver event of the write, when a new write request arrives we check
if the sender is the writer process and in this case if the value of the receiver is older than the new

61
value sent, and in this case the receiver will update it and after it sends an ACK. When the writer
receives the ACK it will add in the acklist the ACK received, and if the number of ACK received
is bigger than N +2f
2 (so reachs the quorum) we can trigger the WriteReturn. When a process want
to read a value, will trigger a read request, and all processes will send to him the value requested.
When the sender receives at least N +2f 2 values, will select the variable that occurs more than f
and with the highest timestamp, in this way we respect that the read returns the most recent value
written.

16.2 Regular Register


The regular register is an evolution of the Majority voting Algorithm, and we will discuss two
implementation, with cryptography and without it.

16.2.1 Regular Register with cryptography


In the cryphograhy one, we have that the writer signs the timestamp-value pair, so it will send to
all the process the pair (ts, v) signed, so the reader will verifies the signature on each pair received
and will ignore those with invalid signatures:

In the write event we can see that the sender will encrypt the message and will be sent to all
the processes. The quorum, since we use some cryptography mechanisms, is bigger than N 2+f so we
have to assume than N > 3f . When a process want to read some values, it will trigger the read
request to all the processes, that they will respond with them variable encrypted and the timestamp.
So the reader will check if the signature is correct and in this case will add the message to its
readlist, and when this is bigger than the quorum will trigger the readreturn with the highest value
(timestamp) in the readlist.

62
16.2.2 Regular Register without cryptography
It’s important to note that sometime just one phase for writing is not enough cause we can have
some cases in which a reader is not able to choose a value, so we need two phases:
• Pre-write: where the writer sends P REW RIT E messages with the current value-timestamp
pair, and then it waits until it receives P REACK messages from N − f processes;
• Write: where the writer sends W RIT E messages with the current value-timestamp pair, and
then it waits until it receives ACK messages from N − f processes;
So every process stores two timestamp/value pairs, in this way they can detect faulty process:

The write event is same as usual with a quorum of N − f and with two phases of write and
pre-write. When a reader want to read something will send a request, and all the processes will
respond with a message that contains two pairs, one pair value-timestamp of the pre-write, and
the same pair for the write. So the reader will check if these two pairs are equal, or if the timestamp
of the the pre-write is bigger of one unit than the timestamp of the write, so for this condition we

63
have the case in which that process received the pre-write but not the write. After the reader will
check if there are tuple than are more than f byzantine processes and if there exist a subset of the
readlist for which the quorum composed by the highest timestamps (one of the two timestamps
bigger) is reached. If the quorum is not reached then the read request is re-sended.

So the termination property must be relaxed in to finite-write termination. Instead of re-


quiring that every operation of a correct process eventually terminates, a read operation that is
concurrent with infinitely many write operations may not terminate.

17 Byzantine Tolerant Broadcast, Multi-Hop Networks


The Distributed System is an abstraction, in which we have, a set of spatially separate entities,
each of these with a certain computational power, that are able to communicate and to coordinate
among themselves for reaching a common goal. It’s defined by a set of assumptions like:
• Process assumptions: all correct processes, crash failures, crash with recovery, byzantine,
etc...;
• Link assumptions: Fair-loss, Perfect link, etc...;
And we have different Communication Network, or Network Topology, that shows how processes
are interconnected between each other, so defines the set of processes which can directly exchange
messages. The Perfect Point-to-Point Link is a link in which we have no duplication, no creation,
and reliable delivery, that means that if a correct process p sends a message m to a correct process
q then q eventually delivers m. In the Multi-Hop instead we have that a message may cross hop
to reach its destination. So the communication and the broadcast is based on the PP2P :

In order to transmit a message in a multi-hop network there are three different ways:
• Flooding: in which the message is sent to all the processes of the network ;
• Routing: in which the message is sent by using a spanning tree, so we have a route in order
to reach a destination process;
• Gossip: that uses probabilistic;

64
Since we can have some byzantine processes that can block or change the message in the multi-
hop network we need to implement a byzantine reliable broadcast, in which for Complete Commu-
nication Network, we have validity, no duplication, integrity and totality, and instead for Multi-Hop
Network, we need to have a correct source, safety (integrity) and liveness (validity).

We can distinguish two different types of failure models:

• Globally Bounded: that counts the total


number of byzantine processes in whole system;
• Locally Bounded: that counts the number of
byzantine processes to which every other cor-
rect process is directly connected;

17.1 Globally Bounded Failure Model, Multi-Hop network


So with Globally Bounded Failure Model we have:
• n processes;
• Not-complete communication network ;

• Processes can be correct or byzantine;


• We can have f faulty processes;
• Processes have no global knowledge (so they don’t know about each other);

• Authenticated Perfect Channels, in which we have reliable delivery, no duplication, no creation


and authenticity;
A graph is called k − connected if and only if it contains k independent path between any two
vertexes, and the menger theorem (Vertex cut - Disjoint path), says that minimum number of
vertexes separating two nodes p and q is equal to the maximum number of disjoint p − q paths in
the graph, so this means that is equal to minimum number of node that we need to delete from the
graph in order to get a min-cut.

We can use this theorem, since we know there are at most f byzantine processes over the sys-
tem, so if a message comes from f + 1 disjoint paths it can be safely accepted.

17.1.1 Dolev’s Algorithm


In order to verify it we can use the Dolev’s Algorithm, where the idea is to leverage the authen-
ticated channels to collect the ID’s of the processes traversed by a message, so the format will be:
(source, content, traversed processes). This algorithm is divided in two parts:

65
• Propagation Algorithm:
– The source process s sends the message msg to all of its neighbors msg := hs, m, ∅i;
– A correct process p saves and relays the message sent by a neighbor q to all other
neighbors not included in the traversed processes appending to it the id of q so: msg :=
hs, m, traversed processes ∪ qi;

• Verification Algorithm:
– If a process receives copies of msg carrying the same m and s where it is possible to
identify f + 1 disjoint path (from s to current node) among traversed processes then m
is delivered by the process;

Since all the messages generated by a byzantine process are labeled with its ID, it is not possible
to generate traversed processes with minimum cut greater than f , so the dolev algorithm
enforces safety. Instead the liveness depends on the network topology.

The byzantine realible broadcast can achieved in a static network G composed of n processes
and f byzantine processes, if and only if the vertex connectivity of G is at least 2f + 1. The
vertex connectivity of a graph is the minimum number of nodes whose deletion disconnects it.

We have two types of complexity:


• Message Complexity: exponential in the number of processes;

• Delivery Complexity: solve an NP-complete problem;

66
So, it’s not practically employable, and we can have even another problem: a byzantine processes
can start flooding the network with lots of messages, so we have denial of service problem and so no
liveness. A possible solution is to restrict the capability of every process, so we can use a Bounded
Channel Capacity, where every process can send only a bounded number of messages in a time
window, so in this way we decrease the input for the NP-complete problem.

The Dolev algorithms relays any message to any process not already included in traversalp rocesses
so we can optimize it:
• Optimization 1:
– If a process p has delivered a message msg than p can relay msg with an empty
traversed processes without affecting the safety property;
• Optimization 2:
– A process p has not to relay messages carrying m to processes qi that have already
delivered m;

On asynchronous systems, the message complexity is still exponential, instead in Synchronous


Systems with specific topologies it has been show that they are very effective.

In case we have a routed network, so with fixed routes between every pair of processes, we
have that the source broadcasts a message along 2f + 1 disjoint routes, so any other process relays
a message only if it respects the planned routes. So we need at least 2f +1 vertex connected network,
and we have a Quadratic Message Complexity (every edge is traversed once), and a Linear Delivery
Complexity (counting the copies of message).

If we have instead a Digital Signature, the source will digitally signs a message to broadcast,

67
and any correct process just relays the received messages, in this case we need at least f + 1 vertex
connected network.

17.2 Locally Bounded Failure Model, Multi-Hop network


With Locally Bounded Failure Model we have:
• n processes;
• Not-complete communication network ;
• Processes can be correct or byzantine;

• We can have f faulty processes in the neighborhood of every process;


• Processes have no global knowledge (so they don’t know about each other);
• Authenticated Perfect Channels;

17.2.1 CPA Algorithm


The idea is that since we have at most f faulty process in every neighborhood a process waits for
f + 1 copies of the same message in order to deliver. So the source will broadcast the message, and
all the neighbors of the source will directly accept and relays the message, instead when a process
receives the message from f + 1 distinct neighbors will accepts and relays the message.

So every process relays a message only if it has been delivered, and since at most f faulty pro-
cess are present in the neighborhood the CPA algorithm enforces safety, instead liveness depends
on network topology. The Message Complexity is quadratic (since every edge is traversed once),
and the Delivery Complexity is linear (counting the copies of a message).

68
17.2.2 MKLO
MKLO is the partition of nodes in levels, in which the source is placed in L0 , and the neighbors
of the source are in L1 , instead any other level is placed in the first level such that it has at least
k neighbors in the previous levels. If we use MKLO the CPA algorithm liveness is guaranteed. So
the Correctness is:

• Necessary condition: MKLO with k = f + 1;


• Sufficient condition: MKLO with k = 2f + 1;
And the strict condition is that MKLO with f +1 removing any possible placement of the byzantine
processes;

17.3 Byzantine Reliable Broadcast in Planar Networks


A Planar Graph is a graph in which edges do not cross and in which Z is the maximum number
of edges per polygon and D is the minimum number of nodes between two byzantine nodes. It is
possible to achieve Byzaint Reliable Broadcast in a 4-connected planar graph if and only if D > Z.
The implementation is:
• Every process saves in Rec(q) the last message received from a neighbor q;
• The source s multicasts an information m;

• The neighbors of the source wait until they receive m from s then they deliver m and multicast
hs, m, ∅i;
• When hs, m, Si is received from a neighbor q with q ∈
/ S and |S| ≤ Z − 3 then Rec(q) :=
hs, m, Si and multicast hs, m, S ∪ {q}i

• When ∃m, p, q, S such that q 6= p, q 6= S, Rec(q) = hs, m, ∅i and Rec(p) = hs, m, Si then
deliver m and multicast hs, m, ∅i and stop;
Every process keeps a copy of the last message received from every neighbors, so linear memory
is required on every process so we need at most 1 byzantine arbitrarily place or spatial condtion
D > 4.

18 Byzantine Tolerant Broadcast, Dynamic Networks


In Dynamic Networks we have continuously changes like the set of processes that composes
the system (Churns) and also the communication network. In this system processes can directly
exchange messages with a subset of all processes, and this subset can change over the time and
processes can be isolated for a while. So we need a model that changes over time, and one of the
most general is the TVG model.

69
18.1 TVG Model
The TVG model or Time Varying Graph is a Graph G := (V, E, ρ, ζ) where V is the set of
nodes, E the set of edges, ρ is the presence function (a true/false function that indicate if there
is that edge or not), ζ is the latency function (a function that given an arch return the latency).
G can also be described as a sequence of static graphs (snapshots). The underlying graph is the
graph on which all the snapshots graphs are based on:

A sequence of distinct nodes (p1 , ... , pn ) is a Journey, or Dynamic Path, from p1 to pn if there
exists a sequence of dates (t1 , ... , tn ) such that ∀ i ∈ {1, ... , n − 1} we have:
• ei = (pi , pi+1 ) ∈ E, so there is an edge connecting pi to pi+1 ;
• ∀ t ∈ [ti , ti + ζ(ei , ti )] , ρ(ei , t) = 1, so pi can send a message to pi+1 at date ti ;
• ζ(ei , ti ) ≤ ti+1 − ti the aforementioned message is received by data ti+1 ;

There are different classes of TVG:


• Class 1, Temporal Source: ∃u ∈ V : ∀v ∈ V, u → v, broadcast feasible from at least one
node;
• Class 2, Temporal Sink: ∃u ∈ V : ∀v ∈ V, v → u, function whose input is spread over all
nodes;
• Class 3, Connectivity over time: ∃u, v ∈ V, v → u, every node can reach all other;
• Class 5, Recurrent connectivity: ∀v, u ∈ V, ∀t ∈ τ, ∃ ∈ ∗u,v , departure() > t, routing can
always be achieved;
0 0
• Class 6, Recurrence of edges: ∀e ∈ E, ∀t ∈ τ, ∃t > t, ρ(e, t ) = 1 and G is connected;
0 0
• Class 7, Time-bounded recurrence of edges: ∀e ∈ E, ∀t ∈ τ, ∃t ∈ [t, t + δ] , ρ(e, t ) = 1
and G is connected;
• Class 8, Periodicity of edges: ∀e ∈ E, ∀t ∈ τ, ∀k ∈ N, ρ(e, t + kp) some p ∈ T and G is
connected;
0 0
• Class 10, T-interval connectivity: ∀i ∈ N, ∀T ∈ N, ∃G ⊆ G : VG0 = VG , where G is
0
connected and ∀j ∈ [i, i + T − 1] , G ⊆ Gj ;

70
The Broadcast latency on 1-interval connected networks is O(n) so ∀t at least one not
informed node is connected to an informed node, so the complexity is equal to the static multi-hop
network. The Byzantine Reliable Broadcast Specification is:

Like for the multi-hop network we can have two different types of failure models, globally
bounded and locally bounded.

18.2 Globally Bounded Failure Model, Dynamic network


• n processes;
• Dynamic Communication Network, TVG;

• Processes can be correct or byzantine;


• We can have f faulty processes;
• Processes have no global knowledge (so they don’t know about each other);
• Authenticated Perfect Channels;

Unlike the multi-hop network in which Vertex Cut = Disjoint Paths, in dynamic network the
Vertex Cut ≥ Disjoint Paths. So, the idea is to extend the Dolev Algorithm, by using the same
propagation algorithm, but we check for a dynamic min-cut with size f + 1 and in which every
message is retransmitted every time a process detects a network change in its neighborhood.

The Correctness is guaranteed by the fact that byzantine processes can not generate traversed processes
with minimum cut lower than f , liveness instead between two endpoints, the byzantine reliable com-
munication from process p to q, is achievable if and only if the dynamic minimum cut between
p and q is at least 2f + 1. The complexity is the same of the Dolev one, exponential for message
complexity, NP-complete for delivery complexity.

71
For static distributed systems the vertex connectivity of a graph can polynomial verified through a
max-flow algorithm, instead in dynamic distributed systems the computation of a dynamic min-
cut is a NP-complete problem.

18.3 Locally Bounded Failure Model, Dynamic network


• n processes;
• Dynamic Communication Network, TVG;
• Processes can be correct or byzantine;
• We can have f faulty processes in the neighborhood of every process;
• Processes have no global knowledge (so they don’t know about each other);
• Authenticated Perfect Channels;

We can use the CPA algorithm on Dynamic Distributed Systems without further changes.
The safety property is still guaranteed, in fact every process relays a message only if it has been
delivered, at most f faulty process are present in the neighborhood. The liveness property in
static networks requires the existence of a partition MKLO to be guaranteed. In dynamic networks
a MKLO is not enough, in fact an edge may disappear while transmitting a message and the order
of appearance of edges matters, so we have to use the TMKLO that is the temporal minimum
k-level ordering:
RCD + M KLO
Where RCD is the edge appearances that allow to reliably deliver a message transmitted over a
channel. and the necessary condition with TMKLO is k = f + 1 instead the sufficient condition is
k = 2f + 1, and the computation of the TMKLO is done only if we have a full TVG knowledge.

19 Blockchain
A Blockchain is a decentralized, distributed and public digital ledger that is used to record trans-
actions across many computers so that any involved record cannot be altered retroactively, without
the alteration of all subsequent blocks. So, it’s a decentralized fully replicated database on a trust-less
P2P network containing a history of transactions, and it’s public, immutable and non-repudiable.

Blockchain is used for example with bitcoin in which the network ensures the validity of transac-
tions without a trusted centralized authority. The bitcon Blockchain is composed by the bitcoin,
a virtual currency, a bitcoin public ledger, the list of all transactions ever made. When a new
transaction is made the sender will update his ledger and after will broadcast the message so all the
nodes will update their bitcon public ledger and will verify the transaction with the use of digital
signature (all the nodes have a private key).

All the transactions are grouped in blocks and the blocks are connected each other in a chain,
in fact each block has a reference to the hash of the previous one, and all the transactions in the
same block are considered made at the same time. Transactions not yet in block are called uncon-
firmed transactions, and each node can pick a set of unconfirmed transactions in order to build

72
a new block and to propose it. It’s important to note that attaching a new block require Consensus.

The proof-of-work is a mathematical challenge to solve a block. Mining is finding a number


hash(block) < target, this target is a threshold decided apriori by the system, and the first node
who solves a block can propose it as next in the blockchain, the other nodes which receive the block
will recompute the hash to check the validity. If two or more branches may arrive together we have
a temporal disagree and in this case the network has to converge to the longest one. A miner needs
a high computational power in order to compute a thousands of hash per second.

Transaction rate depends on two parameters:


• Block Size: how many transactions in a block ;

• Block Interval: time to wait for a block to propagate to all the nodes;
We have two important metrics:
• Throughput: how many transactions for second;

• Latency: how much time for complete a transaction;


Increasing the block size will improve the throughput but the bigger block will take longer to prop-
agate in the network. Instead decreasing the block size will improve the latency but will lead the
system to instability caused by disagreement.

Miner is encouraged cause solving a block gives coins to the node that found the proof-of-work,
and some transactions can have an additional fee given to the node who will mines the block that
containing it. Rewards are an incentive for nodes to keep them supporting the blockchain and to
keep nodes honest, and its a way to distribute coins into circulation.

It’s impossible for an attacker to change a transaction in a specific block in the blockchain, cause
an attacker should be quicker than the rest of the whole network, and in that case he should be able
to re-mine n + 1 blocks (all blocks next to the specific block ) quicker than the rest of the network,
if so the attacker could obtain the longest and modified blockchain and all network would converge
to it, but the attacked should have the 50% of the computational power of the network. Since the
last blocks are less secure (in order to obtain the longest branch) an attacker should wait for 5 or 6
blocks in order to make an attack successfully and this make its probability too low, so this solution
protect Integrity and Double Spending Fraud.

There is another way to proof-of-work, that is called proof-of-stake, while proof-of-work was
very secure but waste a lot of electric power, Proof-of-Stake is secure without mining so we don’t
waste energy. Instead of mine a block the creator of the next block is chosen in a deterministic way
according to its wealth, and the reward are not related to the created block but according to your
wallet, the longer you keep the coin in the wallet the more the reward is high. The probability of
mint (instead of mine) is proportional to your wallet, so minting will require a lot of coin in order
to attack the network and it’s very hard to mint to consecutive blocks.

73

You might also like