Programming Distributed Systems P310-Feldman
Programming Distributed Systems P310-Feldman
SYSTEMS
J. A. Feldman, d. R. Low, and P. D. Rovner
Permission to copy without fee all or part of this material is granted provided
that the copies are not made or distributed for direct commercial advantage,
the ACM copyright notice and the tide of the publication and its date appear,
and notice is given that copying is by permission of the Association for Com-
puting Machinery, Inc. To copy otherwise, or to republish, requires a fee and/or
specific permission.
310
1,01
Figure 1, RIG Hardware Configuration
Modules communicate with one another solely through here to stay and had to be accounted for. Structured
messages. In order to have communication, there must programming seemed to be attacking the right problem
be something that is understood by both communicating with unreasonable methods. Messages were known to
modules. The common element in PLITS iS a name be a very good control primitive and were the coin of
which may be thought of as an uninterpreted string of networking. The experience with RIG convinced us that
characters. A message is a set of (name~value) pairs messages also seemed to be a good mechanism for pro-
called slots. The value portion of a slot will be an element ducing reliable yet still flexible software.
of some primitive domain (think of integers) whose rep- The message-module paradigm became established
resentation is also generally understood. quickly as the fundamental solution for PLITS.
The modules of any PLITS system will have to be able The decision to have public names as the basis of com-
to compose, send, receive, and decompose messages. munication seems obvious in retrospect, but was difficult
For this purpose, we must add some data types and op- to arrive at. By sharing names rather than variables or
erations to ALGOL or any other body language. In this sequential positions in some structure, modules could be
case, the primitive data types of ALGOL will have to be written in a way that was clear, but did not have the
extended to include module and message. Each modu;e problems of shared storage.
also will contain an explicit declaration (Public) of every It was apparent from work in automatic programming
slot name that it can deal with along with the data type and verification that more declarative information was
of that slot. There is a process analogous to link-editing needed--hence we included the general notion of asser-
that insures that public slot names are used consistently. tions. Although many difficult questions remain, enough
For a first example, suppose there were s module, clean solutions have been found to convince us that
Fibonacci, which provided the service of supplying con- there is something fundamentally sound in the PLITS
secutive positive Fibonacci numbers, and a module, world view.
George, which wanted to make use of this service. The Example 1 is basically bad PLITS code; the module
code for this would be something like that shown in Ex- Fibonacci contains no error checking. Let us consider an
expanded but still weak version which will not cause in-
ample 1 (George and Fibonacci are actual constant
named modules, not module prototypes or class defini- teger overflow (Example 2).
tions}. The first new notion occurs on line 4, where a public
slot name of type "problem_type" is declared. The type
We see that George and Fibonacci both know the slot problem_type is a fixed sequence of uninterpreted sym-
names "Object" and "Recipient" and thus can commu- bols exactly like the Pascal "enumeration" type. There
nicate. At the appropriate time, George composes a mes- will be several public enumerations in a PLITS system.
sage with one slot, having as a value the system identifier
In lines 9-11, a prepackaged message is assembled and
for the module George itself. After sending the message
stored in the message variable, My_Complaint. The other
to Fibonacci, George is suspended until a message from
new code is in lines 21-27; the Recipient slot of My_Com-
Fibonacci is received, i,e., this is essentially a subroutine
call. The Fibonacci module simply waits for a request plaint is filled in from the Request. If there is a Com-
and fulfills it. The syntax for accessing and modifying plainLDept slot in this request, the module which is its
messages treats them like the records of, e.g., Pascal value will be sent the complaint. Otherwise, some default
complaint handler, City_Hall, will hear about it. The
[Hoare, 1973].
name of the Recipient module (which may have been
Starting from a survey of the "powerful ideas" of pro-
awaiting an answer) is passed along to the Complaint-
gramming systems, we attempted to see if there were in-
Dept, because there might be some appropriate response
herent incompatibilities among them. It was immediate- to the problem. For example, there could be some double
ly clear that one could not combine all Of the useful precision Fibonacci module which would be able to re-
language primitives in a consistent way--so PLITS had turn an appropriate value if George were prepared to
to include different languages. Networking was clearly
311
1 Begin "George" Begin "Fibonacci"
2 Public Integer Object; Public Integer Object;
3 Public module Recipient; Public module Recipient;
While true do
Begin
7 Send {Recipient~Me} to Fibonacci; Request-Receive from Any;
10 End End
11 End "George" End "Fibonacci"
Example 1
1 Begin "Fibonacci"
2 Public Integer Object; 1 Begin "Fibonacci"
3 Public module Recipient, ComplainLDept, Complainer; 2 Public integer Object;
4 Public problem_type Problem; 3 Public module Recipient;
4 Public action type Action;
5 message Request, My_Complaint;
6 module Complainee; 5 message Request;
7 integer This, Last, Previous, Biggest;
6 map This[trecsection} =>integer;
8 Last-0; This-l; Biggest-23~ ~1 ; 7 map Last[transaction) => integer;
312
accept it. This would require that George handle double-
4.1 System Overview
size integers: that is not hard to arrange, for example, by
an extra slot for the high order part. Even on a single machine, there will have to be some
There is a more interesting problem in the control dis- underlying programs which handle messages. We will
cipline used in the coding of the module George given in call this collection of programs the kernel for a PLITS site.
Example 1. The statement on line 8 is: The kernel is a conventional multi-programming monitor
which sequences through the modules on its "ready"
Mess2-Recelve from Fibonacci.
queue. The kernel also maintains data structures describ-
But we saw in the expanded Fibonacci module of Exam- ing modules which are "suspended," waiting tO Receive
ple 2 that there might be an error recovery module that a message of a specified sort. These data structures, to-
would supply the answer if Fibonacci could not. The gether with analogous ones for messages which result
coding style of line 8 requires that the answer be con- from Send statements, suffice to implement the PLITS
veyed back to Fibonacci and then to George, but there is message primitives.
nothing to be gained by retracing our steps. To solve this A problem arises if the modules are written in different
and a number of other control problems, we will add one body languages. It may be the case that languages differ
more ~:onstruct, transaction, to PLITS. Intuitively, a trans- in their representation of primitive data types (e.g., real).
action is a key which can be used in the regulation of We require that the representation of primitive data
message traffic. We could replace 8 with: types be uniform within a site. This, as well as other con-
8' Mess2-Receive about Key4 ; siderations, may give rise to the situation where there is
more than one site on a given machine involved in an in-
where Key4 is a transaction which is identified with the
dividual distributed job (D JOB). Figure 2 is a graphic
generation of this sequence of Fibonacci numbers. Selec-
representation of the breakdown of functions and termi-
tive receives based on transaction keys allow a receiving
no)ogy which we have adopted. It is convenient to divide
module to be programmed without regard to which mod-
the PLITS support functions into two subsets carried OUt
ule will ultimately send it the message. Yet the receiving
by the site kernel and by the DSYS Host Control Program
module is still able to keep separate "conversations"
(DHCP) respectively. In the example, there are two
distinct.
DJOBs, A and B, which have no connection but happen
This leads us to the use of transactions as tags for
to be both distributed over Machines 1 and 2. DJOB A
different streams of communication to and from multi-
consists of three sites: $11 and $12 on Machine 1 and $21
plexing modules, In Example 3, we have a new Fibonacci
on Machine 2. Each site has a kernel associated with it
module ~,hich can simu}taneously maintain several dif-
as described above. The kernel performs the following
ferent streams of Fibonacci sequences for one or many
functions:
users.
In the example, "map" (lines 8-7, 28, 27, 34, ;35, etc.) (1) distributes messages within the site;
indicates a data structure which maps a transaction key (2) forwards messages to and from other sites;
into an integer. Think of it as an integer array indexed by (3) carries out needed representation shifts for inter-
transaction keys. "Action_type" is an enumeration type site messages;
consisting of SMART, GENERATE, and TERMINATE. (4) allocates resources within the site;
MaxActive is a constant which is the maximum number (5) generates unique (world-wide) names;
of active sequences allowed. (6) checks for errors and assertion violations.
Every PLITS message contains two slots placed there We have briefly discussed the first three functions. The
by the system. The "From" slot contains the source of the fourth function, resource allocation within the site, is
message. The "About" slot contains the transaction key concerned with storage allocation and reclamation, sched-
associated with the message. (A dummy key is supplied uling of ready modules, etc. The fifth function is the
if the user has not provided one.) generation of unique names for modules and transaction
To obtain a Fibonacci stream, a user module will send keys (see Section 4.6). Error and assertion checking is
a message with Action field equal to "Start" and with the discussed in Section 4.8.
Each DHCP is an extension of its machine's operating
globally known constant transaction key "StartNew-
Sequence." The Fibonacci module sends the first element system. It performs four main functions (see Section 4.5
of a new sequence back along with a new transaction for more detail):
key, which will be used by the user module in each future (1) distributes messages among sites local to this ma-
request. chine;
(2) forwards messages to and from other machines;
Note the power of the selective receive in line 16. The
(3) starts and stops D JOBs, and provides access to
Fibonacci module has complete control over what it
responds to. Originally (line 16), it is listening for any other operating system services;
(4) checks for errors and assertion violations,
request for service. When a stream is opened, it adds that
stream to its set (line 28). When the maximum allowable Let us first consider the problem of setting up a DJOB.
number of active streams is reached, it turns deaf to new If there are two sites on the same machine with the same
requests for service (line 24). Finally, when a stream dies, representations, the DHCP only has to check that the use
it turns deaf to that stream and again allows new re- of public slot names is compatible--essentially the same
quests for service (lines 40-44). process as combining the externals of two load modules.
If there are several machines involved and there is an
4. DSYS--A Distributed System incompatibility in representation of a primitive data type,
With the PLITS style of programming as background then some conversion routines will have to be automati-
and a source of examples, we are developing an experi- cally invoked. The ARPA network voice protocol [Cohen,
mental system (DSYS) to support high-level distributed 1976] presents a good model of a scheme in which a
computing. DSYS will run on the seven computers in our dialog between machines is used to reconcile representa-
laboratory: four ALTOs, two Eclipses, and a PDP/10. It tion differences before messages containing data are
will provide facilities for defining and running PLITS dis- sent. All of this is fairly messy, but should only be neces-
tributed jobs (DJOBs). sary when a new PLITS language processor is brought
The remainder of this section outlines both our progress up on a machine. In the usual case, the standard conver-
to date and our present design ideas for DSYS. The dis- sions between sites will have been established and the
cussion in Sections 4.10 and 4.11 summarizes our goals negotiations between machines will be simple.
and states some questions that we find useful for DSYS When a PLITS message is sent by a module in a site, its
planning. destination is checked. If it is within a site, the site kernel
handles it; if not, it is given to the local DHCP. If the
destination is within another site on the same machine, it
is given to the kernel for that site; if not, the DHCP has it
forwarded to the appropriate machine. This is the job of
DHCP functions 1 and 2 above. To do this effectively re-
quires quite a lot of mechanism beneath the surface.
Problems faced include reliable transmission, flow con-
trol, error handling, and providing user services in a dis-
tributed operating system.
313
D job A !
/
> Link
<
Djob B
Machine 1 Machine 2
Figure 2. Example Overview of PUTS DJOBs
314
A destination descriptor is a distributed data structure. 4.6 Names
The destination's site kernel has a portion, and each
There is a question of how to generate unique names
computer upon which there is at least one module send-
in a distributed system. If there were a central source of
ing messages to the destination has a portion (maintained
names, it might take a long time to get one, and the cen-
by its DCM). One can view the portions on foreign com-
tral source might be sometimes inaccessible. If each site
puters as queue extensions. The primary job of the DCM
created its own, there would either have to be a lot of
is to maintain this distributed data structure to support
handshaking or there would be a danger of duplications.
module-to-module c o m m u n i c a t i o n across c o m p u t e r
Our solution is simple and quite general: a name (in the
boundaries.
present design) is a 32-bit number, composed of four
fields--a computer number, an "incarnation number," a
Flow Control site number, and a "local module number." The com-
The DCM extends the basic flow control strategy that puter number uniquely identifies one of the computers
is used by site kernels to include flow control for mes- in our network. The incarnation number is used to dis-
sages to foreign modules. This is done by providing tinguish old incarnations of the operating system on the
(limited) local queue space for each foreign destination, indicated computer from the most recent one. DSYS uses
a mechanism for forwarding messages to foreign site this information to trap references to defunct operating
kernels, and a mechanism for communicating state in- system incarnations (see the discussion on error handling
formation about a destination from its site kernel to in Section 4.8). The site number identifies a site on the
(forwarding) DCMs. Thus, for message communication indicated computer, and the local module number identi-
to a foreign module, the DCM acts as (an agent for) the fies a module at the site. Thus, a DSYS module name
foreign site kernel. That is, the DCM makes it appear to uniquely identifies a module in the distributed system.
the sending site kernel as if it (the DCM) were the foreign One consequence of these definitions is that a given
site kernel. From the point of view of the sending module, module instance always resides on the same machine,
its site kernel responds uniformly to messages sent to a somewhat contrary to current fantasies about distributed
local module or to a foreign one: the SENDMESSAGE call computing. In our view, a module will be compiled to
returns (perhaps after a delay) with a code that specifies take full advantage of the hardware and software re-
either an error condition (see Section 4.8 below) or that sources of its machine. There will be equivalent modules
the message was posted for delivery. on various machines, and programs will be able to choose
between them, but each will have a distinct unique
Renable Transmlsslon name and machine of residence.
One of the special problems of network communica- DHCPs and site kernels have names too. If the site
tion is "reliable transmission." In general, messages sent number in a name is zero, the name identifies the DHCP
over a communication line may be lost, garbled, or du- on the indicated computer. If the site number is non-zero
plicated, and may arrive in a different order than they but the local module number is zero, the name identifies
were sent. A communication system can provide a reli- the indicated site kernel on the indicated computer.
able transmission service in any of several ways, all of The system uses the names of DHCPs and site kernels
which depend on feedback from the receiver to the in its protocols for connections, flow control, and reliable
sender. DSYS provides reliable transmission on an "end- transmission. That is, the distributed system uses mes-
to-end" basis, rather than between each pair of com- sages to get its work done, just as a user DJOB does.
puters along the way. The sending end is a (forwarding)
DCM, and the receiving end is a destination, i.e., a (mod- 4.7 Access to Services
ule, transaction) pair. It is the responsibility of the One of the tasks of the DSYS Host Control Program
receiving site kernel to remember for each of its destina- (OHCP) on each computer is to provide D JOBs with ac-
tions the state (i.e., message sequence number) of the cess to the services that the local operating system
message stream from each foreign D C M A "positive- provides. Typical services include file system access,
acknowledgment, retransmission" protocol for reliable ARPANET, printer, TELNET, text editor, and facilities
transmission is used between receiving site kernels and for creating and running a site as part of a D JOB. Each
forwarding DCMs. This is exactly the communication DHCP is equipped with a built-in module called "Request-
path needed for end-to-end flow control! It turns out that Fielder" which provides D JOBs with such access. There
the mechanisms for end-to-end flow control can be used is a DSYS call that allows any module to find the name
(with very small additional cost) for reliable transmission of the "RequestFielder" module at any DHCP. This is one
as well. If the transmission line error rate is low enough application of a general "name service" facility within
(our experience indicates that it is), the expense of an DSYS (not described here). A special message protocol is
(occasional) end-to-end retransmission is offset by the used to arrange with a DSYS RequestFielder for a service.
advantages of a simple and flexible low-level (i.e., com-
puter-to-computer) protocol. 4.8 Error Handling
Many of the speciai problems of distributed computing
Computer-to-Computer Flow Control relate to handling errors. In a conventional program-
A separate (but related) issue is flow control between ming style (i.e., not message-based), subroutines are
adjacent computers. If a receiving computer cannot keep used as the primary structuring mechanism. The usual
up with a sending one, it can either discard information assumption about subroutines is that they are available
when buffer space is exhausted or somehow ask the when called, and that they function properly. The analog
sender to "wait a while." The former strategy has the of a subroutine call in a message-based programming
advantage of simplicity, but causes information to be style is the "handshake": send a message, wait for a
lost, thus effectively increasing the communication line reply. In general, of course, message activity can be pipe-
error rate. It is usually a bad idea to increase the effec- lined or multiplexed, and the relationships between in-
tive line error rate to compensate for too simple a design. coming and outgoing messages can be much richer than
DSYS uses a straightforward version of the "wait a while" a direct response to each query.
idea to control flow between computers. The basic strat-
egy is the same for computer-to-computer flow control 4.8.1 Errors Unlque to Distributed Computing
as it is for end-to-end flow control on DSYS connections: In addition to all of the ways in which a subroutine can
when the receiver finds that its remaining buffer space is fair (bugs, bad specifications, name conflicts, etc.), mes-
.critically low, it sends a "stop sending" request to the sages can fail in (at least) the following other ways:
sender. As soon as enough buffer space becomes avail- 1. "Synchronous" errors
able, it sends a "continue sending" request. Enough extra Synchronous errors are those that can be detected
space at the receiver is allocated to accommodate data when a module executes a system call to send or
that arrives while the "stop sending" message is in tran- receive a message. Synchronous errors arise because
sit. A sender that has been stopped will resume sending call parameters are bad in some way. Such errors
after a time if no "continue" message is received (it may can be reported as "failure" of the system call, Ex-
have been lost). In pathological cases, data will be lost, amples:
and end-to-end retransmission will be necessary. Once (a) Specified site (or module) does not exist.
again, we assume that this will happen very infrequently, (b) Specified computer is down.
and that the parameter values for the space and time (c) Incarnation number of specified computer oper-
thresholds can be adjusted for an acceptable trade-off ating system is out of date.
between minimal expected Iossage and efficient normal
operation.
315
2. "Asynchronous" errors What can be done to provide systematic conventions
Asynchronous errors are not immediately de- for dealing with the errors and exceptional conditions
tectable as problems with the parameters to a sys- that occur in distributed computations? In particular,
tem call, and can occur at any time. Examples: how can such a system be made robust? What can be
(a) A message that was previously queued for de- done to maintain the integrity of a distributed system
livery couldn't be delivered after all (because of (and of innocent user jobs) when either a user job or a
a, b, or c above). part of the system fails?
(b) A "demon" has discovered a problem. A "de- How should "user job" be defined? What services
mon" is a service provided by DSYS whereby a should the distributed system provide, and how should
module can request explicit (asynchronous) noti- user jobs deal with the distributed system? What are the
fication (via an EMERGENCY message) when a special problems of user jobs in such an environment,
specified (other) module ceases to exist. and how can the distributed system help?
(c) A foreign site or service that was being used by
this D JOB terminated abnormally. 4.11.2 Longer-Range Questions
Unfortunately, there are more opportunities for errors HOW can performance be monitored and distributed
to occur in systems for distributed computing than in computations (and systems) be tuned? In general, how
conventional systems. The programmer of a distributed should the programmer think about an execution of his
computation must therefore give more thought to the computation? What tools can the system provide to help
problem of dealing with errors and "exceptional condi- in this regard? Such tools should also be helpful to the
tions" to provide an adequately robust program. Sys- system designer.
tematic conventions for how to deal with such errors How can such a system be made reliable? Are there
should help, and we are developing some ideas along practical descriptive techniques for the protocols of real
these lines for DSYS. distributed computations? How can such a description
be used effectively to uncover design problems or gen-
4.8.2 Emergency Messages erate tests? How much of this can be automated?
The present DSYS design provides "emergency" mes-
sages as the mechanism that the system uses to report
References
asynchronous errors to a module. If a module has an
emergency message on its input queue, the system will Ball, et al., "RIG, Rochester's Intelligent Gateway: Sys-
include a notice that there is a pending emergency mes- tem Overview," TR5, Computer Science Department,
sage as part of the normal response to any call that sends University of Rochester, April 1976; also appeared in
or receives a message. This is only an initial attempt at IEEE Transactions on Software Engineering, Vol. SE-2,
providing a uniform mechanism for errors and other No. 4, December 1976.
asynchronous conditions. Cohen, D., "Specifications for the Network Voice Proto-
col," ISI/RR-75-39, Information Sciences Institute,
4.9 Implementation University of Southern California, March 1976.
Foster, John D., "Distributive Processing for Banking,"
An experimental version of DSYS is up and working in
Datamation, July 1976.
our local network. There are experimental DHCPs for
Hoare, C. A. R., "Communicating Sequential Processes,"
the ALTOs and for the PDP/IO, and the Eclipse DHCP is
Computer Science Department, Queen's University,
in the final stages of debugging. Each DHCP has most of
Belfast, March 1977.
a Communications Manager, a name server, and a rudi-
Hoare, C. A. R. and Wirth, N., "An Axiomatic Definition of
mentary Job Manager (presently a RequestFielder that
the Programming Language Pascal," ACTA Informatica,
provides file service).
Vol. 2, 1973.
Rovner, P. D., working paper, to appear as TR22, Com-
4.10 Summary
puter Science Department, University of Rochester,
There is a rapidly growing awareness [Hoare, 1977] 1977.
that the paradigm of a collection of communicating se-
quential processes is a useful and powerful concept for
solving problems and for developing computer systems.
In the usual way, progress requires the development Of
concrete systems which both test ideas and lead to new
ones.
316