0% found this document useful (0 votes)
16 views4 pages

Group Communication

The document discusses the concept of groups in both human society and distributed computing systems, emphasizing their role in enhancing efficiency and coordination for common purposes. It explores various models of process failures, particularly crash and arbitrary failure models, and the implications for group communication protocols. The document also highlights ongoing research and prototype systems aimed at improving group management and communication in diverse network environments.

Uploaded by

smaugqwer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views4 pages

Group Communication

The document discusses the concept of groups in both human society and distributed computing systems, emphasizing their role in enhancing efficiency and coordination for common purposes. It explores various models of process failures, particularly crash and arbitrary failure models, and the implications for group communication protocols. The document also highlights ongoing research and prototype systems aimed at improving group management and communication in diverse network environments.

Uploaded by

smaugqwer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Com

Group David Powell, Guest Editor

G
roups are ubiquitous in human groups could be limited to the convenience of collec-

society. We often use collective tively designating a set of processes1 using a common

names for groups of people as a name or address. Such facilities are already offered in

convenient means for referring many local-area networks (LANs), and we are all

to or addressing some part of the population as if it accustomed to using Internet news groups or mailing

were a single entity, like a school class, an age group, lists. The full benefits of the group concept, however,

or a social category. People get together in groups can be reaped only if we know how to set up and coor-

whenever concerted action can be expected to pro- dinate groups of processes that work together to ful-

vide gains in efficiency or to improve the chances of fill a common purpose, like sharing a computational

success in some endeavor, like a road crew, a military load, increasing performance, or providing a fault-

platoon, or a research team. tolerant service. This special section presents some of

Similarly, groups can be used in distributed com- the current ideas on how such groups can be created

Illustration: Will Terry


puting systems to help master the complexity of large and managed.

applications or to help provide non-functional prop- 1


The term “process” is used here for simplicity of expression. Of course,
many different sorts of computing entities can be considered in groups, such
as physical processors, database servers, and sub-networks of a larger com-
erties, such as availability or security. Computation munication network.
munication
What facilities or semantics a group management model is appropriate if the probability of less well-

or group communication service should provide is behaved failures can be neglected. A crash failure

still the subject of much debate within the research model might be appropriate, for example, for a gen-

community. The ease with which a given group’s ser- eral-purpose computing network in which the most

vice semantics can be provided or, indeed, whether or common problem is the unavailability of certain sites.

not such a service can be implemented at all, depends In such an environment, any service that can tolerate

heavily on what can be said or assumed about the process crash failures provides a useful improvement

computation environment in which the service is to be in dependability over one that cannot tolerate any fail-

provided. The most important assumptions concern ures at all. A crash failure model can also be appro-

how processes can fail and how well they communi- priate for ultra-dependable distributed systems if

cate with each other. nodes have extensive built-in self-checking.

The strongest and most common assumption At the other end of the failure spectrum is the arbi-

about process failures is the crash failure model—a trary failure model in which no restrictive assumptions

process acts in full accordance with its specification are made about the way processes can fail. For exam-

until it suddenly ceases all activity. A crash failure ple, they could fail by sending erroneous messages, by
saturating the network, or even by colluding with people. Imagine that a committee meeting is con-
other faulty processes to bring down the system. This vened and attendees gather round a table in a meet-
is a “worst-case” failure model that frees the system ing room. It’s a long meeting, so the attendees get
designer of any obligation to justify the realism of a very tired, and some of them doze off now and again.
more restrictive assumption. It is particularly appro- However, when they are awake they can easily see
priate for building ultra-dependable systems or for who else is awake, since all are sitting round the same
dealing with processes under the control of a mali- table. With a little organization (a protocol), those
cious intruder. Unfortunately, protocols that can tol- that stay awake can (for example) take turns to
erate such arbitrary failures require more redundancy address the meeting, and they should all know who
and more messages than if they were designed to tol- else is awake, what they heard, and what they should
erate only crash failures; they are also much more dif- have learned.
ficult to design and validate. This analogy illustrates several points:
The strongest assumption that can be made about
inter-process communication is that any message sent • There are at least three sorts of groups to be con-
by a correct process to another correct process is sidered: the people eligible to attend the meeting
always received within a given delay—the so-called syn- (the committee); those who attend the meeting
chronous communication assumption. The nice thing (the attendees); and those who participate in the
about this assumption is that one process can reliably committee’s work at a given instant because they
detect whether another process is alive just by sending were awake (the participants).
it a query and waiting a known bounded time for a • The meeting room setting is analogous to the syn-
response. Unfortunately, in a system where processes chronous communication model discussed earlier
must communicate over a shared network, such per- in which communication is reliable and timely, and
fection is guaranteed only with a certain probability, it is easy for people (processes) to detect whether
by using multiple communication paths and/or mes- some of them have fallen asleep (crashed). Thus,
sage retransmissions. Often, however, it is impossible they can make strong statements about who is
to give even a probabilistic guarantee, since the actual awake (the current membership of the group of
load on the network may be totally unpredictable. participants), what they have heard (the messages
The opposite approach is to consider that there is delivered), and what they have all learned
no known limit on the time it takes for a message to (changes to internal state resulting from the order
reach its destination. Protocols designed without of message delivery).
knowledge of time limits could easily be ported from • It shows that we must also worry about how new
one environment to another, since they would oper- attendees (people who join the meeting after it has
ate correctly whatever the performance of the net- started) and recovered attendees (participants who
work. Unfortunately, with such totally asynchronous fall asleep and later awaken) are brought up to
communication, a process cannot decide whether date with what has been decided while they were
another process has crashed or whether its query or absent or asleep.
the expected response is still on its way across the net-
work. In practice, it is essential to introduce some Now let us consider another setting for the com-
notion of time so that processes know how long to wait mittee meeting. Let us suppose that, instead of meet-
for an expected response before suspecting that the ing round a table in a quiet room, the committee tries
originator of the response might have failed. to conduct its business in a large and very busy hotel
Note, however, that suspicion of a crash is not the lobby. Because of the hustle and bustle, the attendees
same as detection of a crash; the suspected process cannot always see or talk directly to one another. Even
might still be perfectly healthy. It is easy to see, there- when one attendee can see another attendee, it’s not
fore, that it is impossible to achieve any sort of deter- certain that the latter is looking at the former. Some
ministic agreement between correct processes. The of the attendees could fall asleep or go home without
best that can be done in such an environment is to the others ever noticing. It’s quite plain that in this
ensure that certain safety properties are guaranteed setting the committee has a much harder job to
whatever the communication delays or safety proper- process its agenda in some consistent way. We can sup-
ties, and that useful progress is made whenever the pose that the attendees try to gather together to get
network performs well enough for processes to com- some work done, but to do so they also have to reach
municate with each other in a timely manner. some sort of agreement about who they all think are
It might be instructive at this point to draw a few in their particular gathering (for example, to decide
analogies between groups of processes and groups of who will act as chair).

In practice, it is essential to
introduce some notion of time so that processes know
how long to wait for an expected response.

52 April 1996/Vol. 39, No. 4 COMMUNICATIONS OF THE ACM


Communication
Group
As people drift in and out of one anoth- provides a totally ordered multicast service
er’s sight, they have to successively reach to application process groups. It is partic-
new decisions about who is in their gather- ularly suitable for supporting fault-toler-
ing. Furthermore, there could be many ant soft real-time applications. Totem is a
such gatherings in different parts of the scalable system built using a hierarchy of
lobby at the same time. In this case, differ- group communication protocols for
ent gatherings of attendees could end up groups of processors on a LAN, for groups
making conflicting decisions. The only way of interconnected LANs, and for groups
to avoid such conflicts is to impose a rule of application processes.
stating, for example, that only gatherings with a In their article on the Transis system, Danny Dolev
majority of committee members are allowed to make and Dalia Malki from the Hebrew University of
any decisions. Alternatively, the committee could Jerusalem consider some of the difficult problems that
attempt to reconcile conflicting decisions whenever arise in diverse network settings. The authors discuss
two or more gatherings are able to merge and form a how different components of a partitioned network
new gathering. can operate autonomously and then merge operations
This meeting-in-a-crowded-lobby scenario is analo- when they become reconnected (such partitioned
gous to the asynchronous communication model dis- operation is of special interest in mobile applications).
cussed earlier. It illustrates several points: They also consider the need for different protocols for
fast local communication and for the unavoidably slow-
• Since people (processes) cannot reliably detect er communication between local clusters.
whether some of them are absent or, equivalently, Most articles in this special section assume that
have fallen asleep (crashed), they cannot decide processes fail only by crashing or by just being slow.
exactly who is attending the meeting. The exception is the short article by Mike Reiter from
• It is impossible to prevent the attendees from split- AT&T Bell Laboratories, sketching some of the novel
ting into separate sub-meetings (gatherings). Such group communication ideas embodied in the Rampart
gatherings or groups of participants are sometimes system to provide tolerance for malicious intrusion.
called “views” of the meeting, since the participants Here, faulty processes can collude and fail in quite
consider that their gathering is the meeting. This arbitrary fashion, and groups are used to mask the
logical partitioning of “meetings” is unavoidable in malicious actions of a minority of the group members.
asynchronous settings, so asynchronous group pro- There are many different ways of defining and
tocols must be able to deal with it. using process groups. Features that may be useful in
• When an attendee joins an existing gathering (and one setting may be impediments in others. The Horus
thus forms a new, larger gathering), the partici- system, described in the article by Robbert van
pants have to work out what this new participant Renesse, Ken Birman, and Silvano Maffeis of Cornell
already knows about the committee’s work and University, aims to provide a very flexible environ-
bring him or her up to date. This reporting has to ment for system programmers to configure group pro-
be done either if the new participant just woke up tocols specifically adapted to the problem at hand.
from a nap (recovered from a crash) or if he or The last two articles do not describe specific sys-
she lost touch with an earlier gathering (became tems but address instead the fundamentals of group
disconnected). In an asynchronous setting, it’s dif- communication. André Schiper (EFPL, Switzerland)
ficult to tell the difference. If the participants are and Michel Raynal (IRISA, France) trace some inter-
not going to have to constantly tell “new” partici- esting directions for future research into various sorts
pants everything they have done since the begin- of process groups. In particular, they consider the dif-
ning of time, they have to remember (keep on ferences between groups for replicated, fault-tolerant
stable storage) what they did in earlier gatherings, objects and groups for implementing atomic transac-
and the protocol has to ensure that all work done tions. They propose a multicast primitive for carrying
in successive gatherings is done in some consistent out a specific class of transactions on a set of replicat-
fashion. ed, fault-tolerant objects.
Flaviu Cristian of the University of California, San
* * * Diego, compares the properties of group communica-
tion protocols for the synchronous and asynchronous
Over about the last decade, there has been consider- communication models. This comparison underlines
able research into the management of process groups the advantages and drawbacks of both models and
and protocols for communication within and between should be considered essential reading for anyone
such groups. The articles in this special section pre- interested in group communication. C
sent some of the prototype systems and current
research activities typical of the prevalent ideas in this
DAVID POWELL is Directeur de Recherche CNRS at the Labora-
fascinating area. toire d’Analyse et d’Architecture des Systèmes where he works in
The article by Louise Moser, Michael Melliar- the Dependable Computing and Fault Tolerance Research Group.
Smith, and their team at the University of California, Current Address: LAAS-CNRS, 7 Avenue du Colonel Roche, 31077
Santa Barbara, describes the Totem system, which Toulouse, France; email: [email protected]

COMMUNICATIONS OF THE ACM April 1996/Vol. 39, No. 4 53

You might also like