Group Communication
Group Communication
G
roups are ubiquitous in human groups could be limited to the convenience of collec-
society. We often use collective tively designating a set of processes1 using a common
names for groups of people as a name or address. Such facilities are already offered in
convenient means for referring many local-area networks (LANs), and we are all
to or addressing some part of the population as if it accustomed to using Internet news groups or mailing
were a single entity, like a school class, an age group, lists. The full benefits of the group concept, however,
or a social category. People get together in groups can be reaped only if we know how to set up and coor-
whenever concerted action can be expected to pro- dinate groups of processes that work together to ful-
vide gains in efficiency or to improve the chances of fill a common purpose, like sharing a computational
success in some endeavor, like a road crew, a military load, increasing performance, or providing a fault-
platoon, or a research team. tolerant service. This special section presents some of
Similarly, groups can be used in distributed com- the current ideas on how such groups can be created
or group communication service should provide is behaved failures can be neglected. A crash failure
still the subject of much debate within the research model might be appropriate, for example, for a gen-
community. The ease with which a given group’s ser- eral-purpose computing network in which the most
vice semantics can be provided or, indeed, whether or common problem is the unavailability of certain sites.
not such a service can be implemented at all, depends In such an environment, any service that can tolerate
heavily on what can be said or assumed about the process crash failures provides a useful improvement
computation environment in which the service is to be in dependability over one that cannot tolerate any fail-
provided. The most important assumptions concern ures at all. A crash failure model can also be appro-
how processes can fail and how well they communi- priate for ultra-dependable distributed systems if
The strongest and most common assumption At the other end of the failure spectrum is the arbi-
about process failures is the crash failure model—a trary failure model in which no restrictive assumptions
process acts in full accordance with its specification are made about the way processes can fail. For exam-
until it suddenly ceases all activity. A crash failure ple, they could fail by sending erroneous messages, by
saturating the network, or even by colluding with people. Imagine that a committee meeting is con-
other faulty processes to bring down the system. This vened and attendees gather round a table in a meet-
is a “worst-case” failure model that frees the system ing room. It’s a long meeting, so the attendees get
designer of any obligation to justify the realism of a very tired, and some of them doze off now and again.
more restrictive assumption. It is particularly appro- However, when they are awake they can easily see
priate for building ultra-dependable systems or for who else is awake, since all are sitting round the same
dealing with processes under the control of a mali- table. With a little organization (a protocol), those
cious intruder. Unfortunately, protocols that can tol- that stay awake can (for example) take turns to
erate such arbitrary failures require more redundancy address the meeting, and they should all know who
and more messages than if they were designed to tol- else is awake, what they heard, and what they should
erate only crash failures; they are also much more dif- have learned.
ficult to design and validate. This analogy illustrates several points:
The strongest assumption that can be made about
inter-process communication is that any message sent • There are at least three sorts of groups to be con-
by a correct process to another correct process is sidered: the people eligible to attend the meeting
always received within a given delay—the so-called syn- (the committee); those who attend the meeting
chronous communication assumption. The nice thing (the attendees); and those who participate in the
about this assumption is that one process can reliably committee’s work at a given instant because they
detect whether another process is alive just by sending were awake (the participants).
it a query and waiting a known bounded time for a • The meeting room setting is analogous to the syn-
response. Unfortunately, in a system where processes chronous communication model discussed earlier
must communicate over a shared network, such per- in which communication is reliable and timely, and
fection is guaranteed only with a certain probability, it is easy for people (processes) to detect whether
by using multiple communication paths and/or mes- some of them have fallen asleep (crashed). Thus,
sage retransmissions. Often, however, it is impossible they can make strong statements about who is
to give even a probabilistic guarantee, since the actual awake (the current membership of the group of
load on the network may be totally unpredictable. participants), what they have heard (the messages
The opposite approach is to consider that there is delivered), and what they have all learned
no known limit on the time it takes for a message to (changes to internal state resulting from the order
reach its destination. Protocols designed without of message delivery).
knowledge of time limits could easily be ported from • It shows that we must also worry about how new
one environment to another, since they would oper- attendees (people who join the meeting after it has
ate correctly whatever the performance of the net- started) and recovered attendees (participants who
work. Unfortunately, with such totally asynchronous fall asleep and later awaken) are brought up to
communication, a process cannot decide whether date with what has been decided while they were
another process has crashed or whether its query or absent or asleep.
the expected response is still on its way across the net-
work. In practice, it is essential to introduce some Now let us consider another setting for the com-
notion of time so that processes know how long to wait mittee meeting. Let us suppose that, instead of meet-
for an expected response before suspecting that the ing round a table in a quiet room, the committee tries
originator of the response might have failed. to conduct its business in a large and very busy hotel
Note, however, that suspicion of a crash is not the lobby. Because of the hustle and bustle, the attendees
same as detection of a crash; the suspected process cannot always see or talk directly to one another. Even
might still be perfectly healthy. It is easy to see, there- when one attendee can see another attendee, it’s not
fore, that it is impossible to achieve any sort of deter- certain that the latter is looking at the former. Some
ministic agreement between correct processes. The of the attendees could fall asleep or go home without
best that can be done in such an environment is to the others ever noticing. It’s quite plain that in this
ensure that certain safety properties are guaranteed setting the committee has a much harder job to
whatever the communication delays or safety proper- process its agenda in some consistent way. We can sup-
ties, and that useful progress is made whenever the pose that the attendees try to gather together to get
network performs well enough for processes to com- some work done, but to do so they also have to reach
municate with each other in a timely manner. some sort of agreement about who they all think are
It might be instructive at this point to draw a few in their particular gathering (for example, to decide
analogies between groups of processes and groups of who will act as chair).
In practice, it is essential to
introduce some notion of time so that processes know
how long to wait for an expected response.