Distributed Counting
Distributed Counting
Diss. ETH
1
12.82k
e*.B
*
Distributed
How to
Counting
Bottlenecks
Bypass
Roger
ETHICS ETH-BIB
P Wattenhofer
00100003152448
Leer
Vide
Empty
Distributed
How to
Counting
Bypass
Bottlenecks
Technology (ETH)
Zurich
Dipl. Informatik-Ingenieur
ETH Zurich
accepted
on
the recommendation of
Widmayer,
Examiner
Herlihy, Co-Examiner
Leer
Vide
Empty
Can you do addition?" the White Queen asked. "What's one and one and one and one and one
and
one
and
one
and
one
and
one
and one?"
"I don't
-
know,"
Lewis
Acknowledgments
am
grateful
to
Guntlin, and especially Peter Widmayer and Maurice Herlihy for proof-reading and correcting (at least a couple of pages) of this
Simon Poole, Stefan
manuscript.
Finally,
rest of the
gratitude to family,
my
and
Contents
Introduction
1.1 1.2
Distributed
2 4 5 7
...Counting
Performance
Overview
1.3
1.4
Central Scheme
2.1
2.2
11
Construction
Performance?
11
12
2.3 2.4
Queueing Theory
Performance!
13
15
Decentralization
3.1
3.2
19
Introduction
Lower Bound
20 22
3.3
Upper
Bound
26
Counting
4.1
Networks
31 32
Construction
4.2
Counting
vs.
Sorting
Revisited
37
4.3
4.4
Queueing Theory
Performance
40
42
Diffracting
5.1
Tree
47
Construction
48
ii
Contents
5.2
Performance
52
Combining
6.1 6.2 6.3 6.4
Tree
59
Construction
Performance
60
62 65
Counting Pyramid
Performance
68
Optimal Counting
7.1 7.2
73 Tree 74 76
Synchronous Combining
Lower Bound
Simulation
8.1 8.2
81
Model Results
81
84
Discussion 9.1
93
Applications
Properties
93
9.2
96
Abstract
A distributed counter is
variable that is
an
common
atomic test-and-increment
counter value to the
we
counting,
with the
emphasis
on
efficiency.
system's
counter
value
with
distinguished
central processor.
and-increment operation,
However, with
they send a request message to the central a reply message with the current counter value. number of processors operating on the distributed large
processor will become
a
bottleneck.
There will be
primary goal of
Since the
a
counter.
absence of
the
this work is to implement an efficient distributed efficiency of a distributed counter depends on the bottleneck, any reasonable model of efficiency must comprise
essence
of bottlenecks.
a
In
one
approach,
we
messages which
during
a
series of test-anda
increment
operations.
"sequential" setting. Since distributed counting has a multiplicity of applications (allocation of resources, synchronization of processes, mutual exclusion, or as a base for
distributed data
structures),
part of the work we present the three most important an efficient distributed counter: The family
Counting
Networks
are
Networks, the
by Aspnes, Herlihy, and Shavit; Counting Networks an elegant decentralized structure, that had enormous influence on research. Contrary to Counting Tree Shavit and Zemach demands Diffracting proposal by
iv
Abstract
notion of time.
Finally
we
present the
one
Combining
request, is
idea, combining
meta
(Gottlieb, Lubachevski,
Rudolph).
time,
the
we
study
and other
a completely revised tree concept with systematic combining improvements. A randomized derivative of the Combining Tree,
further
advantages
in
practice.
to
analyze
an
the
expected
bottleneck,
propose that
no
processor
can
handle
schemes
we
unlimited number of messages in limited time. We show that all three are considerably more efficient than the central scheme. Moreover,
distributed
There
are
is
an
asymptotically optimum
speed which
examine
other
characteristics
counter.
beyond
pure
we a
are
desirable for
distributed
In
particular,
We want
stronger
counting scheme to the than test-and-increment more only. And we provide powerful operations wish that a counting scheme adapts instantaneously to changing conditions in the system. Since these points are satisfied by the Counting Pyramid,
it is
a
correctness conditions
(e.g. hnearizabihty).
promising distributed
practical applications. We
further discuss
time.
model-specific analyses
with
are
issues such
as
Our theoretical
case
studies
many
their
structures such
stack
or
queue, and
dynamic
load
balancing).
Kurzfassung
Ein
Zahler
in
einem
System
verteilter
die
der
jedem
Prozessoren ist eine Variable, test-and-tncrement-ZugriS erlaubt: anfragenden Prozessor mitgeteilt, und
um
Systems
wird
eins
erhoht.
In
dieser
Arbeit
Effizienzbetrachtungen
des
besonderer Wert
gelegt
wird.
Systems bei einem zentralen Prozessor. Greifen andere Prozessoren mittels test-and-increment-Oper&tion auf den Zahlerwert zu, senden sie dem zentralen Prozessor eine Anfragenachricht und erhalten postwendend
eine Antwortnachrieht mit dem aktuellen Zahlerwert.
System gross, wird der zentrale Prozessor uberlastet. Zu viele Anfragenachrichten treffen in zu kurzer Zeit beim zentralen Prozessor ein. Die Anfragen konnen nicht mehr postwendend beantwortet werden
Prozessoren im
-
sie stauen sich beim zentralen Prozessor. Der zentrale Prozessor wird
zum
Systems.
ist, einen effizienten verteilten Zahler
von zu
primare
realisieren.
Nadelohrs
Effizienz-Modell
Anzahl der
die Nadelohr-Problematik in
jedes verminftige
Serie
von
test-and-mcrement-Zugriffen
zeigen
eine nicht-
triviale untere Schranke und stellen einen verteilten Zahler vor, der diese
untere Schranke in einer
sequentiellen Umgebung
hat
erreicht.
Da verteiltes
Anwendungen Ressourcen, (Allokation Synchronisation von Prozessen, gegenseitiger Ausschluss oder als Basis fur verteilte Datenstrukturen), gibt die untere Schranke Aufschluss iiber den
Zahlen eine Vielzahl
von
von
vi
Kurzfassung
Im
zur
prasentieren wir die drei wichtigsten Vorschlage Realisierung verteilter Zahler: Die Klasse der Counting Networks von Aspnes, Herlihy und Shavit. Counting Networks sind verteilte Zahler mit eleganter dezentraler Struktur, die einen grossen Einfluss auf die Forschung hatten und haben werden. Im Gegensatz zu Counting Networks
Hauptteil
der Arbeit
effizienten
verlangt der Diffracting Tree von Shavit und Zemach von den Prozessoren einen Zeitbegriff. Schliesslich stellen wir den Combining Tree vor. Dessen Schliisselidee, mehrere Anfragen zu einer Meta-Anfrage zu kombinieren, ist schon seit 1983 (Gottlieb, Lubachevski und Rudolph) bekannt und wird von Eine randomisierte uns komplett iiberarbeitet und signifikant verbessert. Verwandte des Combining Tree, die Counting Pyramid, verspricht weitere Vorteile in der Anwendung.
Alle drei
und
andere
Eigenschaften
gepriift. Wir analysieren die erwartete Zeit, die ein test-and-mcrementZugriff kostet. Um der Nadelohr-Problematik Rechnung zu tragen, verlangen wir, dass kein Prozessor eine unbegrenzte Anzahl Nachrichten in begrenzter Zeit verarbeiten kann. Wir zeigen, dass alle drei Schemata wesentlich effizienter als der zentrale Zahler sind. Des weiteren besprechen
wir, weshalb der Combining Tree der asymptotisch bestmogliche verteilte
Zahler ist.
Neben purer
Geschwindigkeit
einem
uns
diverse
weitere
Eigen
schaften
von
verteilten
untersuchen
insbesondere
Korrektheitsbedingungen (Stichwort hneanzabihty), die Flexibilitat, aufwendigere Operationen als test-and-mcrement anzubieten und Diese die Anpassungsfahigkeit an sich andernde Umstande im System.
verscharfte
Eigenschaften sind die Starke der Counting Pyramid, weshalb sie ein vielversprechender verteilter Zahler fur den praktischen Einsatz ist. Uberdies besprechen wir modellspezifische Punkte wie Fehlertoleranz oder den Zeitbegriff.
Unsere theoretischen
Analysen
werden durch
Ergebnisse
einer Simulation
gestiitzt. Wir prasentieren verschiedene Fallstudien und deren Interpretationen. Ausserdem stellen wir einige wichtige Anwendungen des verteil ten Zahlens (verteilte Datenstrukturen wie Stack, Queue oder dynamische
Lastverteilungsverfahren)
vor.
Chapter
Introduction
has been trying to gain power by summoning job. One believed in a linear relationship of the number of workers and the job duration: if one person could complete a job in n days, then n persons could complete the job in one day. However, this is true only if all the persons perfectly work together as a team and do not quarrel. If the job calls for intensive coordination, the linear relationship does not seem achievable. In such a case, merely increasing the number of workers will not suffice. It may thus be seen that what is mostly needed is
man
a
workers to
complete
example in mind let us evaluate digital computers and their performance. Since 1950 there has been a remarkable rate of increase in
With this
computing speed of approximately ten times every five years. However, the switching time of current processors is getting closer to the physical limit of
about 1 ns, due to natural constraints such as the speed of light and power dissipation (see [Noy79, Kuc96, Sch98] among others). Having learned from the first paragraph, a logical way to improve performance beyond a single "processor" is to connect multiple "processors" together, a technology that came to be known as parallel processing or distributed computing.
Some problems arising in digital parallel processing are identical to "manual parallel processing": How can one orchestrate the processors in such a way that they work together as a team and not mutually hinder themselves?
How much coordination overhead is inevitable?
Can
one
a
come
up with
a
techniques
so
scales,
adding
processor to
more
Introduction
insight into these questions, computer scientists have come up simple problem which lies in the core of distributed coordination: distributed counting. Since many other coordination problems and data
could call it the
depending on (or related to) distributed counting, Drosophila of distributed coordination and distributed data structures. The design and analysis of distributed data structures draw theoretical interest from the fact that, contrary to distributed algorithms,
structures seem to be
one
An operation
each
may
interfere with
invested
an
operation of
by
futile.
be
seen
as
an
evolvement of the most fundamental distributed coordination primitive, mutual exclusion [Dij65, Lam74]. Whereas mutual exclusion primarily intended to for
accomplish correct coordination, distributed counting aims providing fast coordination among processors. In this work, we will
counting, present
and
and
analyze
several
interesting
as
schemes
that count,
deal
with
related
lower bounds.
answer
"10!" instead of
being peppered
The rest of this
with increments
a
by
the
Loosely speaking, we will show losing count when White Queen if Tweedledum and
follows.
In Sections 1.1 and 1.2
helping
is
hand.
as
chapter
organized
respectively,
present our distributed system model and explain what counting is. In Section 1.3, we point out how the performance of a counting
we
Finally,
an
overview
of the
complete
work is
given
in Section 1.4.
1.1
Distributed
a
...
Our definition of
definition of
an
distributed
system
is
analogous
to
the
standard
or
asynchronous
on
practical
text books
a
We deal with
distributed system of
one
processors.
Each processor is
n;
the value
knowledge. Each processor p has unbounded local memory, which is accessible exclusively to processor p. There is no global memory. Processors can interact only by passing messages. To do so, the processors
use
consume.
Whenever
primitive,
message
(a
copy of some
portion
of processor
p's
local
memory)
1.1 Distributed
...
Since
the
more
than
one
processor q at
(approximately)
queued at processor q. Processor q will consume and handle the messages sequentially in first-come-first-serve order (with ties broken arbitrarily) by executing the consume primitive. Transferring a message from processor p to processor q takes an unbounded, but finite amount of time. Both, sending and consuming a message, need some local computation. The sending processor p must specify the receiver and start the transfer of the message. The consuming processor q must remove the message from the queue of incoming messages, and parse the content of the message. The system is asynchronous: executing the send (consume) primitive takes an unbounded, but finite amount of time. While a processor p is sending (consuming) a message, processor p is busy and cannot do anything else. Transferring a message is independent of any
same
processor, however.
-*
time
send
processor p
J^7
processor q
arrival
consume
Figure
1.1:
Transferring
Message
a
Figure
q.
1.1 shows
an
example:
Processor p sends
is
message
to processor
First,
message
is transferred.
arriving
already
m
waiting
to be consumed.
are
The message
is
consumed.
Finally, the
message
by
processor q.
Every
processor q is
polling
incoming
messages.
Whenever
the queue is not empty, the first message in the queue is consumed and
immediately
q is
handled.
While processor q is
handling
executing a loop of (a) checking its queue for incoming messages, (b) consuming the first message in the queue, and (c) immediately handling
it. It is not processor
possible
to broadcast
or
multicast
message to
not
more
than
one
primitive. It is more than one message at once by executing the is, sending and consuming are both sequential.
by executing
the send
Introduction
We
assume
any
are
postulate that
ever
no
failures whatsoever
no
occur
in the system: No
be lost
or
corrupted,
processor will
ever
crash
or
behave
badly.
point, our distributed system differs from the standard definition asynchronous message passing system: We assume that the expected time for transferring a message is known a priori, and we assume that the expected time for doing a local computation (e.g. add two values of the local memory, executing the send or the consume primitive) is known a priori. This does not conflict the condition that there is no upper bound on the time for transferring a message (or doing some local computation); the probability that a message takes much more than expected time will be small however. Because the expected times are known, we cannot claim that From a quite provocative standpoint, our system is totally asynchronous. one could say that knowing the expected times is only necessary when analyzing the performance (but not for correctness). We will discuss this along with other model related issues in Section 9.2, when we have further
In
one
of
an
insight. For
common
the moment,
assume
knowledge
of
expected
our
times.
are
on
parallel/distributed
the message
passing
as
model, because other popular models (e.g. shared memory) are not to the concept of queueing that is used extensively in this work.
close
1.2
...
Counting
implemented for such encapsulates an integer value
operation inc (short for test-and-increment): When a operation, then a copy of the system's counter
initiating processor p and the system's counter (by one). Initially, the system's counter value val
operation until the completion
the counter value is returned to the
(when
initiating
processor)
1.2.
by
T.
Compare
with
Figure
Definition 1.1
it
(Correctness)
counting scheme
is
said to be correct
if
fulfills
1.3 Performance
time
initiation
^-^C
"*\
"\
completion
<
system
*"u0
operation
time
Figure
1.2:
Operation
Time
(1)
No
No
duplication:
omission:
No value
is
returned twice.
is
(2)
number
Definition 1.2
(Quiescent State)
inc
The system
is
in
quiescent
state
operation
is
completed.
// the system is in a quiescent state and k mc operations were initiated, exactly the values 0,1,..., k1 were returned by a correct counting
scheme.
[HW90], when,
More
in addition to
accomplishing
assigned
they
were
requested.
formally:
(Linearizability) A correct counting scheme is called [HW90] if the following is granted. Whenever the first of two
is
operations
completed before
the second
is
For many applications, the use of a Hnearizable counting scheme as a basic primitive considerably simplifies arguing about correctness. As explained in [HW90, AW94], linearizability is related to (but not the same as) other correctness criteria for asynchronous distributed systems [Lam79, Pap79].
1.3
Performance
finding
an
counting
of
is
strongly
finding
an
appropriate
measure
efficiency.
Introduction
measures of efficiency in distributed systems, such as time or complexity, will not work for distributed counting. For instance, even though a distributed counter could be message and time optimal by just storing the system's counter value with a single processor c and having
The obvious
message
access
only
one
message
exchange,
not scale
such
an
implementation
a
clearly
whenever
frequently,
the processor
operation bottleneck;
there will be
high
message
congestion
In other
words, the
work of the scheme should not be concentrated at any single processor or within a small group of processors, even if this optimizes some measure
of
measure we
of
care
of
no
that
send
or
receive
an
For concreteness in
our
us
assume
that
depend
or
on
work is
always
the
bounded
by
counter value
addition,
assume
that
some on
amount of
computation
at
tc
average.
some
We
message, need
a
local
processor is to send
message to every
consume
the message
expected
time.
operation from
an
probability for
processor to initiate
to the size
operation
in
small
interval of time be
proportional
frequency
scheme.
is substantial when
us
arguing
that
about the
on
performance of
counting
Let
therefore
assume
operation
the
initiations at
processor.
These the
put
us
in
position
performance
variables with
however,
up, let
we
can be expected values for random generally arbitrary probability distributions. Sometimes, will make specific assumptions on these distributions. To sum
us
n:
local
computation.
1.4 Overview
message.
operations
processor.
Having four system parameters enables us to argue very accurately on performance. The drawback of sometimes ugly formulas is balanced by the advantage of more insight due to precision. The performance model can be applied on many theoretical as well as real machines. One can derive
results for machines with fast communication communication Our
(e.g. PRAM) as well as slow (e.g. Internet) just by choosing tc and tm accordingly.
performance model is related to other suggestions towards this end. The LogP model [CKP+96] is very close, although we do not distinguish "overhead" and "gap", two parameters of the LogP model that are always within a constant factor of 2. If one is interested in the performance of a counting scheme according to the QRQW PRAM model [GMR94], one has
to set
tc, tm
:=
1 and
t%
(the
initiations)
to
an
arbitrary integer; that is, the QRQW PRAM may be seen as a synchronous special case of our message passing model. In [GMR94], it is debated that the standard PRAM model does not reflect the congestion capabilities of
real machines: it is either too strong
(concurrent read/write),
or
too weak
are
(exclusive read/write);
approximated
are
well
with the
[Val90],
[BK92],
Message
popular
completely different
Unlike
a
model for
congestion that
and
our
[DHW93,
DHW97]. (a
similar
approach
1.4
In this
Overview
chapter,
the
we
have
presented
counting is. Additionally, we have discussed how we will performance of a counting scheme in most parts of this work.
have
a
Let
us now
look at what is to
come
in the
following chapters.
Chapter
is
today
many available
Introduction
We
see
counting
is not
there is
point
at which additional
give no additional performance. In order to assess the expected performance of the Central Scheme, we give a short introduction into fundamental queueing theory. Chapter 3 organized"
and prove
studies exist whether
are
[WW97b].
a
problem
a
of
"centrally implementing a
not
processor is
communication bottleneck
lower and
matching
Chapter
-
give hope
that there
are
some
coordination overhead
seems
unavoidable, however.
scheme
advisable to construct
real
counting
study,
From
there exist many similarities between the upper bound and the
efficient
counting
schemes
presented
in
Chapter
a theoretical point of view, the most interesting counting schemes are family of Counting Networks [AHS91] which are presented and discussed in Chapter 4. By introducing the theory of queueing networks, we are in the position to show that the Counting Networks are indeed much more efficient than the Central Scheme. To our knowledge, this is the first performance analysis of Counting Networks by means of queueing theory. We prove that Counting Networks are correct, though not Unearizable. Moreover, we discuss the relation of Counting Networks and their close relatives, the Sorting Networks.
the
Right
upon
simple and efficient but not Unearizable counting scheme. Unlike Counting Networks, the Diffracting Tree exploits that processors have a notion of time, in our model.
show that the
Diffracting
Tree is
Combining Tree [WW98b], the most Its key our performance model. for time a long idea, combining several requests, is well-known [GLR83]. For the first time, our proposal studies the combining concept without underlying hardware network but with systematic waiting in order to promote combining. Also in this chapter, we present the Counting Pyramid [WW98a], a new adaptive derivate of the Combining Tree. We believe that the Counting Pyramid is the counting scheme with most practical relevance
In
Chapter 6,
we
present
the
up to date.
After
counting schemes, we ask what the best possible performance for distributed counting is. In Chapter 7, by making a detour to a synchronous machine model, we present a lower bound for having
seen
1.4 Overview
counting
Tree
In
in
our
machine model is
[WW97a].
our
Combining
(Counting Pyramid)
we
asymptotically analysis
and
possible.
simulation
the
Chapter 8,
complement
an
[WW98b].
excellent method to
assess
performance
For
purity
counting
theoretical
with their
results of various
case
studies
are
given, along
certainly
not
we
least, Chapter 9
First,
coordination problems and distributed data structures closely related to counting. Then, performance along with non-performance properties (linearizability, flexibility, adaptiveness, fault tolerance) are discussed. Moreover, we debate on the machine model and various associated attributes such as waiting or simplicity.
processor
are
indeed
Except the introduction, most chapters of the work are self-contained. Figure 1.3 gives an intuition how the chapters and sections depend on each other. The heart of the work is the discussion of the three fast counting
schemes
(Chapters 4, 5,
-
and
6).
can
be read
performance analysis (Sections 2.4 4.4, 5.2, 6.2, and 6.4) needs knowledge of queueing theory (Sections 2.3 and 4.3). Chapters 3 and 7 are theoretical; lower bounds on counting issues are shown. The fe-ary tree structures used in these chapters are related to the Combining Tree the distributed counting (Chapter 6). Section 9.2 sums everything up expert is encouraged to read this section now and check the details when
independently
needed.
are the new Combining Tree and Counting Pyramid (Chapter 6), various lower bounds on counting (Chapters 3 and 7), and last but not least a universal framework to analyze the average case performance of counting schemes by means of queueing theory (Sections 2.4, 4.4, 5.2, 6.2, and 6.4) which is supported by simulation (Chapter 8).
adaptive derivative
the
10
Introduction
Fast
Counting
Schemes
1
f
Networks
Counting^* 1
Applications^-v
/^~~\_.4.4A
4.3
C^ir
^"f 9.1 J
\
>.
ltroductionl
I Discussion
{9.2 X^^F^
/
/
S
I I
Diffracting 1 Tree J
/V-Crroperties
/ Queueing S. Theory
PV
\w^
JfCombining\ Tree ^-X 1
l 6.3
2.3
Central Scheme
1
I
MnsJ
)
[Jr\
1
Simulation 1
\3L6-2
/
J ""CTounting
Pyramid
f~%^
Decentra- 1
lization
(jfc-ary Trees^
Theory
Lower Bounds
Optimal TP I Counting jf
jf
Figure
Chapter
Central Scheme
In this
chapter,
as a
we
in
Section 1.3
stimulus for
research,
Arguing
try
about the
we
In
we
Section 2.2,
to do
chat
informally
2.3,
about various
problems that
to
arise when
so.
In Section
-
we
give
queueing theory
just enough
to be in the
position
analyze
the Central
2.1
Construction
Scheme,
an
value,
processor
is to administrate
Initially,
is 0.
Every
processor in the
an
Whenever
a
request operation, request message contains the identification of the initiating processor
When processor
processor
a
c
processor p sends
This
p.
consumes
by first returning
by one. locally incrementing reply reply message is finally consumed by the initiating processor p, thus completing processor p's inc operation. If processor c itself initiates an inc
message and second
This
operation,
by
one
processor c (locally) reads the counter value and increments it (without sending a message). Figure 2.1 sketches the organization
c
(with
processor
as
the
center)
12
Central Scheme
Figure
Theorem 2.1
is
are
assigned centrally
at processor
no
Since the local variable val is incremented with every request message,
value is returned
correctness condition
(1)
c
of Definition
is not
request messages
at processor
larger
are
consumed
by
0,..., k
are
returned. Since at
operations
initiated,
correctness condition
(2)
of Definition
Linearizability: Let processor c decide at time td(v) that some inc operation will receive counter value v. This mc operation is initiated at time tl(v) and is completed at time tc(v). Because the time for decision must be between initiation and completion, we know that tl(v) < td(v) < tc(v). Moreover, we have td{a) < td(b) => a < b because the local variable val is
1.1 follows.
never
decremented at processor
c.
The
1.4)
operations
is initiated
second
(a
<
(tc(a) 6)" is
<
tl(b))
fulfilled since
td(a)
<
tc(a)
tl(b)
<
td{b)
=> a <
b.
2.2
Performance?
accesses
of the processors to
are
long enough that the Central Scheme can handle all other, without being a bottleneck processor. Then, requests one the total operation time is T tm + tc + tm, because, a request message
processor p is
after the
does
some
to
time tm), processor c (time tc), and a reply As we processor p (time tm)
(taking
steps
will
see
in the
following paragraph,
this
analysis
was a
2.3
Queueing Theory
13
It is
possible that
same
many
at processor
a
at
more or
less the
time.
anism for
one
incoming
we are
Queued
all
n
messages
are
handled
(first-come-first-served).
c
Whenever
inc
unlucky,
or
operation
at more
less the
same
receives
re
These requests
one
first
costs
sequentially.
a
Since
an
handling
an
request
=
operation time of T
significant
gap between
ation time T
time T
systems. A
scheme
sages
can are
be reached
meaningful by studying
To
answer
performance
case:
of
counting
mes
expected
How many
when processor
A.K.Erlang
Today
it is
theory
a
of
queueing early
in the twentieth
century.
the next
widely
used in
broad
variety
of
applications.
In
paragraphs we will show a couple of definitions and basic results in queueing theory. Following the tradition in the field, we will use other names than everybody else, because the usual names (such as client, server, job, customer) could be misunderstood in our environment. By using the
expected behavior as a benchmark, we get a simple and consistent assess the performance of the Central Scheme.
way to
2.3
Let
us
Queueing Theory
define
some
about the
processor
c
performance
in
our
arguing queueing system, such as the Central Scheme. At the single processor queueing system
of
a
single
processor
of processor c, messages arrive from the left and exit at the right, depicted in figure 2.2. Processor c consists of a handler that handles the incoming
messages,
and
before
being
handled.
The messages
manner.
arriving
system
are
handled in first-come-first-serve
In
our
to the
requesting
application, handling a message is returning a counter value processor and incrementing the local counter value by one.
a
In
a
our
processor to initiate
an
inc
operation
a
in
proportional
This
very
14
Central Scheme
Processor
0Queue
Handler
Figure
2.2:
Single
Processor
Queueing System
on
The Poisson
many other
a
important
In
a
as
building
block.
probability,
in is
larger
than
some
P[t <t0}
where
process j
l-e
-At,
is
the
two
arrivals.
This It
is
Poisson the
has many
interesting
mathematical
properties:
only
continuous stochastic process that has the Markov property (sometimes called memoryless property). Loosely speaking, the Markov property means
that the
process is of
no use
in
predicting
the
future.
probably illustrated best by giving an example known as the bus stop paradox: Buses arrive at a station according to a Poisson process.
mean
The
Suppose
now
that you
What is the expected time until the next bus arrives? Naturally, one would like to answer one minute. This would be true if buses arrived deterministically. However, we are dealing with a Poisson process. Then, the correct answer is ten minutes.
we
will
one
use
joining independent
Poisson process.
resulting
process is
a
a
equal
to the
sum
large
number of
Poisson).
can
It is shown in
a good choice to model the independent users (which are not necessarily [KT75] that, under relatively mild conditions, the n
aggregate arrival
be
process of
independent
a
processes
(with
rate
arrival rate
A
as
n
-)
oo.
approximated
a
well
by
>
Therefore,
practical
one
realistic
assumption
in
many
situations For
(e.g.
distributed
counting).
process
completeness, let us mention the reverse: When splitting randomly in k processes, every of the k resulting processes is
2.4 Performance!
15
more
please consult
book
on
queueing theory
own
stochastics
[KT75, Nel95].
you know that it
From your
experience
with real
queueing situations,
handling
system collapses.
handler, denoted by
It
can
busy.
be calculated
as
-,
is the where j is the average time between two message arrivals and A and are positive, we have p > 0. It Since time. both, p, handling average is crucial that the handler must not be overloaded results in p < 1. Thus p
can
-
that is
<
which
As every textbook
processor
queueing theory presents the analysis for the single queueing system, we will only show the result that is interesting
on
for
us.
Theorem 2.2
tr for
(Expected Response Time) The expected response time in the queue + time to be handled) is message (tr
time
t^E^(1+2(l^(1+C-))
where H
is
for
=
the
handling
H
time
(note
=
that
E[H]
)>
^), 0%
the
variation
expected of
of the
(defined by Cjj
mm*
an<^ 0 < p
< I
is
the utilization
handler.
from the
(such as the expected number of single processor queueing system) can be derived directly other by applying the main theorem in Queueing Theory:
interesting
coefficients
Theorem 2.3
(Little)
The
processor queueing
system
N
is
in
the
single
response time
tr with N
Xtr, where A
2.4
We're
Performance!
now
in the position queueing theory.
to
analyze
assume
by using single
a
processor
We
operation
at random
(that is,
the
probability of an
small
time interval is
proportional
interval); therefore,
initiation
16
Central Scheme
is
Poisson process.
time is T
tm + tr + t,,
For
instance,
let
us
assume
handling
request
tr
time,
deterministically.
Lemma 2.4 When
handling
is
is
deterministic with
handling
time tc,
the
expected
response time
tc 2
=
p
,
tr
where
tc
p
21-/
Proof: We have
< 1.
U
0.
E[H]
=
tc and
Var(H)
we
we
get
tr
^ T^~
w^h P
joined
are one
single
get the
access
a
p-,
where
processor.
Please mind
applicable.
The
investigate
another handler.
larger
than i0 is
=
P[t <tQ]
where
-
l-
e~iu
is the
expected handling
time. A
single
processor
queueing system
exponential handling is often called M/M/1 system both, arrival process and handling process, have the
one
one
gets
is
Lemma 2.5
The
expected
tr
=
response time
for
an
exponential
< 1.
handler
1-p
where p
i,
It has been found reasonable
How realistic is
an
exponential handler?
a
telephone call or the consulting time at For our application, one could argue, that there an administration desk. might be cache misses when handling a message and therefore the handling time is not deterministic. But obviously, arguments like this are dragged in by head and shoulders. The real motivation comes from simplicity and tractability of the results. When we will investigate more complex schemes, organized as networks of queueing systems, we will use the M/M/1 system with more complex handlers, the counting schemes as a building block are not analytically tractable.
for
2.4 Performance!
17
Lemma 2.6
The
the
expected
as
exponential
time
handler is
as
ymptotically
handler.
same
the
expected
response
with deterministic
Proof:
time is
t*/t*
ter be the expected response time when the handling deterministic (Lemma 2.4) resp. exponential (Lemma 2.5). Since 1 and 0 < p < 1, we conclude that \t% <tf < t%.
Let
resp.
-
tf
Theorem 2.7
constant k.
(Performance)
Let p
y*
time
<
for
positive
Then the
expected operation
of
0{tm+tc).
2.6,
=
we
expected operation
time is T
tm + j- + tm. As
in Theorem
1.
not be
^
a
< 1
i,
we
have T
2tm
whenever p >
+
When
D
tc).
simple argument [WW97c], one can show that this is asymptotically there is no counting scheme that does not need any local computation or message exchange. Whenever there is no positive constant By optimal;
k such that
nlf-
p < 1
By
but
means
of
queueing theory,
distributions:
can
argue not
only
on
complete
The
probability
larger
than to is
P[t>t0]
Therefore,
A;
is
>
a
e~^1'p)t.
larger
time than
the
ktr, where
=
constant
=
tr
is
not
the
expected
the
(tr
^f-)
is
P[t
by
k.
ktr] 0(tm
e~k. Thus,
+
=
only
expected operation
time is bounded
bounded by T
tc) but also the response time of a single inc operation is 0(tm + tc) with probability 1 e"k for a positive constant
p > 1? As
What
happens when
c
already
mentioned in Section
2.3,
processor
response time of
Instead of
solving
the
just sending a request inc operation, processor p waits congestion problem are known for
some
(e.g. exponential back-off [GGMM88], or queue lock mechanisms [And90, MS91]). Note that these solutions may help against congestion but
an
do not have
improved operation
are
other initiations
18
Central Scheme
bulk request
values instead.
Processor
adding
average
by replying an interval of z values and Suppose the waiting time for sending a processor p is distributed exponentially with
c
waiting
time tw.
is A
f^.
As
we
want to
guarantee p
1,
we
must have
=
tw
>
M/M/l
j^f-
where p
n%c-.
Remember
inc
operation
some
message is sent.
paradox
=
expected waiting
time is
operation
:=
time is T
+
=
2tm
by setting tw
2tm
+
tc(n
tc(n 2^ + 1)
Summarizing,
T
=
(sparse access);
Since in every
the
0(tm
tc).
counting
scheme
operation
exchange a message and do some calculation in order to fulfill the correctness criteria, the Central Scheme is asymptotically optimal when Whenever ti is too small, one can have an expected access is very low. operation time of T 0(tm + ntc) when using waiting. The expected
must
=
operation
system
-
time is
the
growing linearly with the number of system does not scale, adding a processor
more
processors in the
to
a
computation
does not
give
performance.
Chapter
Decentralization
Having
wonder
scheme
seen
counting scheme in the previous chapter, one might Is there a counting a better solution is possible: all, whether, which is not central, thus decentral, but fulfills the correctness
a
naive
at
coordinate the
We
processors?
of
study
prove
some
n
the
a
problem
implementing
no
processor is
we a
communication bottleneck
lower bound of
that
over
processor must
exchange
-
sequence of
processors.
helps
to
counter
that
achieves this
once
bound
when
each processor
exactly
in Section 3.3.
bound is
much
a
tight.
Even
though
problem such
a
counting
can
be
decentralized, it
real
along
this theoretical
(which
abstracts
considerably from
in
Chapter
The
6.
in this
reasoning
chapter
a
is
A quorum system is
collection of sets of
loosely related with that in quorum systems. elements, where every two sets
20
Decentralization
laid in
[GB85]
in the
and
[Mae85]; [Nei92]
proposed
seventies
contains
good
survey
on
the
topic.
starting
continuing
until
today
3.1
Introduction
deriving a lower bound, we will restrict the model of compu Chapter 1 in two ways. First, we will ignore concurrency control problems, that is, we will not propose a mechanism that synchronizes mes
For the sake of tation of
sages and local
operations
at
processors.
Let
us
therefore
assume
that
enough time elapses in between any two inc operations to make sure that the preceding inc operation is completed before the next one is initiated, i.e. no two operations overlap in time. Second, we assume that every processor initiates the inc operation exactly once. This restriction will simplify the one could easily derive the following argumentation but is not essential results for a more general setting.
The execution of
a
single
inc
operation proceeds
as
follows.
Let p be
To do so,
operation.
messages to
others, and
After
p receives the last of the messages which lets processor p determine the
current value val of the counter
(for instance, processor p may simply receive message). This does not necessarily terminate the
messages may be sent in order to
operation: Additional
operations. As soon as no further messages are sent, the execution of the inc operation terminates. During this execution, the counter value has been incremented. That is, the distributed execution
prepare the system for future
of
an
inc
operation
is
partially
system.
Figure
3.1: Execution of
an
inc
operation
3.1 Introduction
21
We
can
an
inc
operation
an
as
directed, acyclic
operation initiated graph (DAG). Figure DAG label of A with the node 17. represents processor q by processor q performing some communication. In general, a processor may appear more than once as a node label; in particular, the initiating processor p appears as the source of the DAG and somewhere else in the DAG, where processor
3.1 shows the execution of inc
p is informed of the current counter value val. An Pi to
a arc
from
node labeled
sending
of
to processor p2.
initiating
receive
a
processor p, let
Ip
or
message
during
the inc
papers
quorum
systems
[Mae85].
increment the
Lemma 3.1
(Hot Spot) Letp andq be two processors that Then IpC\ Iq / 0 must hold.
contradiction, let
us
Proof:
Because
assume
that
Ip
Iq
Iq
0.
only
the processors in
Ip
can
gets
an
it, none of the processors operation initiated by processor p. Therefore, incorrect counter value, a contradiction.
the
knows
processor q
Note that
an operation depends on the operation immediately precedes it. Examples for such data structures are a bit can be accessed and nipped, and a priority queue.
an
The execution of
inc
operation is
not
function of the
initiating
processor and the state of the system, due to the nondeterministic nature
of
distributed
computation. Here, the state of the system in between any operations is denned as a vector of the local states
Now consider
a
of the processors.
in state
s
prefix
of the DAG of
an
execution
exec
pref of
processors
present in
that
prefix.
is
a
Then for any state of the system that is different from s, but
pref, the considered prefix prefix of a possible execution. The reason is simply that the processors in pref cannot distinguish between both states, and therefore any partially ordered set of events that involves only processors in pref and can happen nondeterministically in one state can happen in the other as well.
of
exec
22
Decentralization
3.2
Lower Bound
Consider
a
Definitions 3.2
sequence
of
consecutive inc
operations of
distributed counter.
p sends
or
receives
Let mp denote the number of messages that processor during the entire operation sequence; we call this the
a
message load
of processor p. Choose
maxp=i mp
and call b
bottleneck processor.
a
We will derive
not too many
lower bound
are
on
operations
are
initiated
by
any
easily
many
p is
by a single processor: Whenever a processor inc operation more frequently than all other processors initiating together, it is optimal to store the system's counter value at processor p.)
operations
initiated
the
To be
even more
bound,
we
initiates
exactly
operation.
In any
Theorem 3.3
(Lower Bound)
on
n
algorithm
that
implements
dis
processors where every processor initiates the inc there a bottleneck processor that sends and receives il(k) is operation once,
tributed counter
messages, where
kkk
n.
Proof: To of
an
simplify
us a
execution of
an inc
operation by
DAG,
see
sequence of messages
Figure 3.2. This communication list models along an arc in the DAG corresponds along a path in the list. Processor 17, for
example,
sends
single
(Figure 3.2)
(Figure 3.1).
just
once,
by
processor 11 in the
a
list.
By counting each
more
arc
in the list
we
get
lower bound
no
on
DAG,
because
processor
incoming arcs to nodes with its label in the list than in the DAG. Therefore, a lower bound for the list is also a lower bound for the DAG.
Figure
Now let
us
3.2: DAG of
Figure 3.1
n
as a
list
study
processor
initiates
one
inc
have
23
operation
more
than
one
possible
execution
distributed system. The communication list of a processor is the communication list of the execution of the inc operation initiated by the processor. Let the length of the communication
due to the nondeterminism of
list be the number of
arcs
We will argue
on
operations. The
sequence of
we
operation
in the sequence,
processor
(among
yet)
such that
and let
L% be the length of
operation is exactly Lt in the list and at least Lv DAG; bound, we only argue on the messages in the list. total of For the n inc operations, the number of messages sent is Y^=i ^ti let us denote this number as n-L, with L as the average number of messages sent. Because every sent message is going to be received by a processor,
that
are
in the
we
have
with nib
>
2nL.
processor b
an
processor
performs
inc
operation,
change.
Figure
We will
initiating
an
inc
operation
now
argue
on
the choice of
to the
operation.
according
lengths
yet,
see
Figure
we
the list
lengths
compare the
only length of a
operation.
length
of
the sequence, and consider the list of processor q for the i-th inc Let
l% denote the length of this list. By definition of the operation sequence, < lt Lt (i 1,..., n). Let pt<J denote the processor label of the j-th node
=
24
Decentralization
of the
we
list, where j
=
have plto
q for i
1,...,
n.
weight
operation
as
h
where
**
m(pl:J)
before the t-th inc operation, and /x := for each p = 1,..., n, and therefore uii
How do the
an
is the number of messages that processor pltJ sent or received 1. 0 nib + Initially, we have m(p)
=
0.
inc
this, let
one
us
compare the
that at least
of the processors in
with that
operation; let pzj be the first node inc operation i + 1 can differ
from the list for inc operation i in all elements that follow ptj, including ptj itself, but there is at least one execution in which it is identical to the list before inc operation i in all elements that precede plj (formally,
pl+1 ^
0,..., /). The reason is that for none of the processors pt j for j preceding pzj, the (knowledge about the system) state changes due to the i-th inc operation; each processor pl} (j 0,..., / 1) cannot distinguish
= =
=
whether
or
This
immediately gives
Wl+
wl+l
+^
m(pt+hj)
-^
^ m(pltJ)
p
J=/+l
>
1
>
a'
^
^-^
uJ
1 f
, n<<
1
=
.1
f
U>i+
fit
fj,J
1
=
u>i +
-r
We therefore have
n
**
u.'i
,=1^
25
m(pto)
inc
operations. We have
ra(Pn,o)
wn
~y~]
j=i
ra(Pn,j)
/^
With fi
rrib >
mq
>
m(pnfl)
we
get
>
Am(pnj) 2^ V3,
+ 1
1
=
ttfn-(l--j-)
1
wn + r
n
+ l
>
Xl+'re,+x"
<
v^i
xn
for x% >
0,
1,...,
(e.g.
[Bin77])
we
get
fi
>
j^
n;
\J\i~nL
n
~
77t
That is, (i
>
L+yn.
m^
=
we
conclude /j,
>
k, where
in
kkk
n.
Since
1, this
Asymptotically,
common
kkk
can
be
expressed
more
terms
follows.
26
Decentralization
Corollary operation
3.4
(Lower Bound)
on n
In any
dis
tributed counter
once, there is
V log log n)
messages.
Proof:
z
Let
be defined thus k
=
as
zz
n.
Since kkk
zz
=
n,
=
we
know that
where
>
k >
z/2,
6(z). Solving
W(x)
is Lambert's W function
ew{logn),
lim^(/ogn)logJogn
n-+oo
logn
we
get k
0(2)
(ew<"*>)
(j^).
3.3
Upper
a
Bound
We propose
previous
are
It is based
on
communication tree
whose root holds the counter value. The leaves of the tree
forwarding
processors
operations. The inner nodes of the tree operation request to the root. Recall initiates exactly one inc operation.
an
inc
as
follows.
level k+1;
level that
zero. n
=
kkk.
n
For
let
us assume
to the next
simplicity, higher
kkk,
integer k).
an
identifier id that
currently works for the node; let us call this the current simplicity, we do not distinguish between a node and its current processor whenever no ambiguity arises. Furthermore, it knows the identifiers of its k children and its parent. In addition, it keeps
or
it;
we
Initially,
node
j (j
0,..., kl
1)
level i
(i
(i
l)fcfe
jfcfe_i
+ 1.
3.3
Upper
Bound
27
root)
level 0
level 1
Figure
identifier.
no two inner nodes on levels 1 through k get the same Furthermore, the largest identifier (used for the parent of the
the value
-
(k
l)kk
kkk
no
(kk
kk
l)kk~k
kk
fc
+ 1 + 1
on
=
kkk
n.
We will make
the
same
sure
that
levels 1
through
=
ever
have
1. The leaves
have identifiers
the
n
right on level k + 1, representing by this regular scheme, all the processors. identifiers all initial locally. The age of all inner processors can compute 0. root stores an additional value, The the nodes including root is initially 0. where val the counter value val, initially
to
defined
Now let
us
describe how
non-root node
an
inc
operation initiated
a
at
processor p is
executed.
parent. Any
and
one
receiving
an
by
two
(one
an
for
receiving
val;
for sending
a
message).
message, it sends
furthermore,
a
by
two. After
incrementing
value,
node decides
locally
only if
updates
by setting
28
Decentralization
age
0 and idnew
:=
id0id
messages, k + 1
new
processor of its
job
and children
parent and
children about idnew. Note that in this way, we're able to keep the length of messages as short as O(logn) bits. There is a slight difference when
the root retires: value It
additionally informs
a
the
new
val,
and it
saves
Since
they
happen
If so,
processor
solving
this
problem
is
For simplicity, we do not as described. handling the corresponding messages; one way a proper handshaking protocol, with a constant
we
number of extra messages for each of the messages While correctness is derive
a
describe
we
straightforward and
is therefore
omitted,
will
now
bound
on
Lemma 3.5
(Retirement)
more
once
during
any
single
inc operation.
node, and
let
be the
is
(in
historic
order)
u
that retires
u
second time.
once
Since
first,
mc
retired
only
during
the current
operation. Therefore,
for k > 1 node Lemma 3.6
u
(Grow Old) //
an
inner
during
an inc
operation,
it sends and
receives
at most
four
messages.
Proof: Let p be the processor that initiates the node u that is on the path from leaf p to the from its child
v on
mc
operation. Each
one
inner
root receives
message
path, and it forwards this message to its parent. Among all nodes adjacent to u, only its parent and v can retire during the current mc operation, because u's other children are not on the path from p to the root and belong to Ip only if u retires. Due to Lemma 3.5, no node can retire more than once during a single inc operation, thus u does not
that receive
one more
sum
up,
node
receives
if
the
path from
p to the root.
Lemma 3.7
n
(Number
of
Retirements) During
on
of
mc
level
retires at most
kk~l
1 times.
3.3
Upper Bound
29
Proof:
on each path and therefore receives at most two operation and sends one message (the counter value). It
retires after every 4fc messages, with the total number r0 of retirements
satisfying
t-o <
7T 4fc
=
-X
4
<
kk.
In
at most
we
general, a node on level i is on kk~l+1 paths, and it receives and sends 3kk~l+l +r,_i messages. With a retirement at every 4fc messages,
inductively get
node
on
level i:
<
<
replace others when through k have been defined just for the purpose of providing a sufficiently large interval of replacement processor identifiers. The jth node (j 0,..., k% 1) on level i (i 1,..., k) initially uses processor (i l)kk +jkk~l + 1; its replacement
us now
Let
consider the
availability
of processors that
on
nodes retire.
levels 1
processor candidates
are
(i
Note that these
case. are
l)kk
jkk~l
{2,3,... ,kk~1}.
just
as
exactly kk~l
1 processors,
In
addition,
note
replaces
its processor
kk
1 times.
Lemma 3.8
most
(Inner
Node
0(k)
and sends at
Proof:
When
processor starts
working for a node, it receives k 4- 1 predecessor that tells about the identifiers of its parent
3.6,
we
it sends k + 1
one
to its
Lemma 3.9
(Leaf
Node
Work) During
the
entire sequence
of
inc
operations,
each
leaf
receives
30
Decentralization
exactly
one mc
operation
an
and receives
an
answer,
accounting for
two messages.
It receives
on
that this
happens
kk~k
times.
-
0
D
Theorem 3.10
(Bottleneck) During
receives
of
mc
oper
0{k)
messages,
where
kkk
n.
working
at most
once
most
conclude
we
0(k)
get
total of
O(k)
messages
as
claimed.
common
more
O(logn/loglogn).
3.11
Therefore
Corollary
is
tight.
Chapter
Counting
The first
Networks
counting scheme without a bottleneck processor is the family of Counting Networks, introduced 1991 by James Aspnes, Maurice Herlihy, and Nir Shavit [AHS91]. Although the goal of the previous chapter and
their work is do not
similar, the approach could not be more different. [AHS91] investigate the platonic nature of counting but rather present an
excellent
practical
Counting Network.
have a very rich combinatorial structure. Partly beauty, partly because of their applicability, they roused much interest in the community and started the gold rush in distributed a majority counting. They have been a tremendous source of inspiration of published papers in the field are about Counting Networks. Their close relatives, the Sorting Networks, are even more popular in theoretical computer science. They're often treated in introductory algorithm courses Due and in distinguished text books [Knu73, Man89, CLR92, OW93]. to the amount of published material, we're not in the position to discuss
Counting
Networks
because of their
every aspect of
Counting
Networks. If
one
is interested in
a more
detailed
paper
treatment,
we'd recommend
the
[AHS94]
or
on Counting Networks [AHS91], we will present Counting Network as a representative for the whole family of Counting Networks. In the first section, we present the recursive construction of the Bitonic Counting Network. Then, we will prove that the
Bitonic
In Section Networks
Counting Network fulfills the criteria for a correct counting scheme. 4.2, a couple of the most fundamental properties of Counting
are
discussed. We will
Sorting
the
of
Counting
and
32
Counting
Networks
of this
chapter,
as
we
will
analyze
the
Network
Counting Networks in general. To do so, we have to reveal a bit more queueing theory in Section 4.3. We will see why Counting Networks are indeed much more powerful than the Central Scheme. To our knowledge, this is the first performance analysis of Counting Networks by means of queueing theory.
as
well
of
4.1
A
is
Construction
Network consists of balancers connected
Counting
an
by
are
wires.
A balancer
arrive
element with k
input
output wires.
times and
Messages
on
input
wires at
arbitrary
words,
a
forwarded to the
the two
toggle mechanism,
sending the first, third, fifth, etc. second, the forth, the sixth, etc. to
For
a
incoming
balancer,
we
the zth
input
0,1,..., k
1.
Similarly,
=
we
sent messages
the concept of
balancer.
x0
X\
o
yo
JS
H-\
y\
Figure
4.1: A balancer
Properties
4.1
(1)
J2i=o
x*
2/o + 2/i
eventually forwarded.
In other
words, if
in
^2l=0
xt
2/o + 2/1
(3)
The number
of
=
output
+
wire
is
at most
one
in
higher
of
output
wire:
any state y0
\(y0
yi)/2] (thus
[(y0
yi)/2\).
4.1 Construction
33
Counting
Networks
are
built
upon
these
are
balancers.
Most
Counting
as
exactly
l
more
lucid if
one
draws
balancer
in
figure
4.2.
xo
yo
x\
yi
Figure
There
are
more
fashionable style
several
Counting Network
structures
known; let
Network.
us
restrict
on
the most is
prominent
isomorphic
to Batcher's Bitonic
This structure
a
very
popular
many
OW93].
w
comprehensive computer science text The only parameter of a Bitonic Counting Network always
a
It is
which is
power of two.
w
output wires.
the
Counting
Network in
inductive
manner.
To build
Counting Network
of width
parallel B[w/2] by a single Merger 1 of the upper resp. the outputs 0,..., w/2 1 of M[w\, just 1 0,..., w/2 resp. w/2,..., w B[l] is defined as a simple wire.
follow
of width
(short B[w\), we let two w (short M[w]); that is, lower B[w/2] are the inputs
as
depicted
in
figure
4.3.
A
w
>
Merger of width w (M[w]) is defined recursively as well. M[w] (with 2) is constructed with two parallel M[w/2}. The upper M[w/2] takes
even
the
resp.
lower half of
M[uj],
that is,
wires
input
wires
1 of the upper
+
M[w/2]
5,...,
are
-
the
input
lower
M[w/2]
w
uses
4,...,
to the
as
input.
M[w). The 1,3,5,..., w/2 1, w/2, w/2 + 2, w/2 + of the upper and lower M[w/2] is routed
3, w/2
+
w
1 of
original position. Finally, every two neighboring outputs (0 with-1, 2 2 with w with 3, w 1) are balanced once more. Figure 4.4 illustrates the definition of a Merger; to be distinguished, even and odd input/output wires have different colors. M[2] is defined as a simple balancer such as in
...,
figure
In the
4.2.
following,
we
name
promises.
We follow the
34
Counting
Networks
B[w]
in
M[w]
Figure
Definition 4.2
B[w]
is said to
(Step Property)
i/0
< yl
A sequence yo,Vi, y3
<
,Vw-i
1, for any i
<
j.
//
step property,
(1)
(2)
subsequences have
and odd
then its
even
subsequences satisfy
w
u/2-l
1
*
ui/2-1 nrf
iu "
1
*
^
8
=
J/2i+l
i=0
1=0
2=0
Facts 4.4
If
yw-i
have the
step property,
(1)
and and
J27=d
J2i=o
w
xi
EIlo
y*> then Xl
Vi +
=
y%
for
''
''w
l-
(2)
x*
^2i=o
0,1,...,
1)
such that x3
yt
for
unique j (j 0,..., w 1,
=
Using
these facts
we
will
now
be in the
position
to prove
that every
Merger
preserves the
step property
[AHS91].
quiescent state, if the inputs
have the
xo: xi,...,
M[w]
in
xw/2ixw/2+\i
ixw-i
step property,
then
the
xw/2-i output
step property.
4.1 Construction
35
M[w]
Figure
M[w]
Proof: For
w
=
by
2:
induction is
a
on
the width
w.
balancer and
For
>
2:
Let zq,
...
,zw/2-i resp.
z'0,... ,z'w,2_1
be the output of
lower
M[w/2]
subnetwork.
xw/2ixw/2+1t
even
and odd
induction
(Fact 4.3.1). By
From Fact an<i %'
M[w/2]
'
step property.
4.3.2
we
conclude that Z
+
Ll^rio2"1^]
at most 1
we
r|Er="J/2^1implies that
is yi
so
=
Since
\a]
by
[b\
=
and
[a\
\b]
1.
differ
by
at most 1.
If Z
Z',
Fact 4.4.1
z[
=
for i
0,..., w/2
Therefore,
...,
the output of
M[w]
Z[j/2J for i
0,...,
it;
1. Since zq,
2w/2-i
M[w]
=
by 1,
Fact 4.4.2
except
Let /
j/i
:=
1.
(with
=
i <
2j)
is 1 +1. The
output
(with
balancer
2j +1) is L The output ^2j and j/2j+i are balanced by the final J- Therefore M[w] preserves / + 1 and y2j+i resulting in j/2j
D
Counting
from
the final
output
comes
Merger
inputs
are
recursively
36
Counting
Networks
3)
Theorem 4.6
a
(Correctness)
Network
In
output
wires
of
Bitonic
Counting
of width
Corollary
4.7
(Correctness)
is a
correct
counting scheme
Proof
Assume that k
state
mc
operations
1
were
initiated and
k
mc
we
are
in
quiescent
(Definition
2),
operations
is
completed We number the first message to leave B[w]'s top output wire with 0, the first message to leave the ith (0 w 1, 1) output wire with i the the zth to leave 1,2, jth (j 0, generally ) message ,w (i l) l)w + i With Theorem 4 6, exactly the values output wire with (j
=
0,1,
Fact 1 Let
us
were
alloted, with
follows
build
a
none
missing and
none
alloted twice
With
3,
the
Corollary
we
describe how
Bitonic
is
passing model
processor p
to
one
A balancer's role
receives a
turns
output
is
wire
message
as
finally leaves
the
network,
it
sent to
on
one
processors,
acting
,w
modulo counters
assigns the values
the ith
(i
0,
1)
for incoming messages and returns the result to the of the mc operation Whenever a processor initiates an initiating processor
2w,
mc
operation
wire
it sends
message to
an
of the
Network
Since there
are w
input
wires
and two
connected to the
so
same
are
balancer
short
-
(processor),
forwarded
is
possibilities
to do
The messages
they basically
sent to and
inside the
Counting
Network resp
Figure
4 5
Bitonic
Counting
a
Network with
width 8 sends
a
Processor p initiates
operation
To do so, processor p
random input
wire
This message
When
reaching
modulo
counter,
value
assigned
4.2
Counting
vs.
Sorting
37
f-
xx
f-
a
random choice
,
tz
4.5: A Bitonic
completion
Figure
Counting
4.2
Counting
vs.
Sorting
programming
-
"Indeed, I believe that virtually every important aspect of sorting and searching!"
uncovers connections between counting and sorting. As already mentioned, Counting Networks have a rich underlying theory with many interesting aspects. With lack of space, we're not in the position to discuss all of them and will focus on those that will help us to analyze the Counting
This section
Networks.
Counting
one
Network is
isomorphic
to Batcher's Bitonic
Sorting Network,
might hope that other Counting Networks can be constructed when replacing the comparators of a Sorting Network with balancers. Indeed, there is a relation of counting and sorting, although
quite that strong. Already in their seminal paper [AHS91], Aspnes, Herlihy, and Shavit show that every Counting Network is also a Sorting Network but not vice versa. Therefore, Counting Networks can be seen as
not
a
generalization
one
of
Sorting
Networks.
a
can
and
vice
versa.
Theorem 4.8
(Counting
is a
vs.
Sorting)
The
isomorphic
vice versa.
network
of
Counting
Proof:
Network
Sorting
The
isomorphic networks
of the Even-Odd
not
or
Insertion
Sorting
Network
are
Counting Networks.
38
Counting
Networks
For the other direction, let C be a Counting Network and 1(C) be the isomorphic network, where every balancer is replaced by a comparator. Let 1(C) have an arbitrary input of O's and l's; that is, some of the input wires have a 0, all others have a 1. There is a message at C"s ith input wire if and only if 1(C) 's i input wire is 0. Since C is a Counting Network, all messages are routed to the upper output wires. 1(C) is isomorphic to C,
therefore
if and
comparator in
if the
1(C)
will receive
on
its upper
(lower)
wire
only
(lower)
wire.
corresponding balancer receives a message on its upper Using an inductive argument, the O's and l's will be routed
all O's exit the network the lower wires. We
on
on
apply
Sorting
Lemma
[Lei92]
The
are
which states that any network that sorts O's and l's does also sort
is
a
Sorting Network.
Counting Network
message from
More an input wire to an output wire. by is d 0. The of the network's wires depth of a input depth formally, balancer's output wire is defined as maxt~Qdt + 1, where dt is the depth of
input
wire
(i
0,..., k
of
1).
The
depth of
the
Counting
a depth replacing "balancer" with "comparator", depth of the isomorphic Sorting Network.
Remark:
Lemma 4.9
(Depth)
w
The Bitonic
w.
Counting Network
with width
has
depth
\ log2
\ log
d(M[w/2]) + 1 d(X) for the depth of X. Since d(M[w}) With d(B[w]) 1, we conclude d(M[w\) logw. d(M[2]) 0, we get d(B[w]) d(M[w}) + d(B[w/2)) + d{M[w]) and d(B[\)) 1 2 hi H h + + + logw \ogw \ogw d(M[w/2]) d(M[2\) and the follows. Lemma + ^ logw(\ogw 1)
Proof: We write
and
=
=
The
depth of sorting
the Bitonic
Counting Network
is
objects [Knu73], every Sorting Network with width w has depth Q(logw). Using Theorem 4.8 this lower bound holds for Counting Networks too. Since the depth of a network is one of the key parameters for performance, there has been much research done for Counting Networks with small depth. In the journal version of their seminal paper, [AHS94] propose an alternative Counting Network based on the Periodic Balanced Sorting Network [DPRS89] with 6(log2 w) depth.
In
[KP92],
constant
Counting Network with depth 0(ck's* logui) (for a positive c) is presented. However, this scheme is superseded by a simpler
a
w
4.2
Counting
vs.
Sorting
39
0(logwloglogu;)-depth network and finally an asymptotically optimal O(logiu) construction in Michael Klugerman's thesis [Klu94]. It is crucial to note that all these schemes make use of the infamous AKS Sorting Network [AKS83] as a building block. Therefore, just as in the AKS Sorting Network, the Big-Oh hides an unrealistically large constant. One cannot hope to improve this constant without improving the constant for the AKS Sorting
Network.
As
is
already mentioned
one
of the
more
counting scheme. It
can
be
shown
by
an
easy
example that
The Bitonic
Lemma 4.10
(Linearizability)
Counting Network
is not
linearizable.
Proof: Please consider the Bitonic 4.6: Assume that two inc
Counting
were
figure
operations
corresponding
on wire 0 and 2 (both in light grey color). having passed the second resp. the first balancer, these traversing messages "fall asleep"; In other words, both messages take unusually long time before they are received by the next balancer. Since we are in an asynchronous setting, this may be case.
After
Zzz
I
ZZz
I'
rr
Therefore, this
message is
Figure
4.6:
network
on
already passed by
on
forwarded twice
and
the lower output wire of these balancers. After the message leaves the network
reaching
wire 2
on
Strictly afterwards,
the network
on
wire 1.
After
having passed
Therefore, the
40
Counting
Networks
(and
not
depicted
in
wires 1 resp. 3, these Because values. dark the and medium grey the getting exactly grey do conflict Definition with the Bitonic operation 1.4, Counting Network is
on
figure 4.6), the two light grey eventually leave the network
not linearizable.
Note that
example
4.6 behaves
are
correctly
in the
quiescent
state:
Finally,
exactly
They Counting Networks and counting in general. linearizable Counting Network must have an infinite number of balancers and present the "Skew-Filter", an infinite network that can be appended to a Counting Network in order to make it linearizable. Moreover, they show that one can construct more efficient linearizable Counting Networks when
reveal various
important facts
messages
are
They present
lower bounds
well
use
as
the
of this
chapters.
A different approach was chosen by [LSST96]. Assuming that unlimited asynchrony is not reflecting the real world accurately, they show that Count ing Networks are practically linearizable: Whenever one inc operation is completed significantly before another is initiated (where "significantly" depends on the depth of the network and the maximum and minimum message time tm), the first inc operation will get a smaller value for a large class of existing Counting Networks. On the other hand, the authors show that the Bitonic Counting Network is not linearizable when the maximum possible message time is more than twice as large as the minimum possible message time. These results were further refined by [MPT97].
4.3
Queueing Theory
Revisited
As in Section 2.4, we use queueing theory to argue about the expected operation time of a Counting Network. Things are more complicated with
Counting Networks,
as
there is not
single
processor with
one
queue but
analyzed. However, queueing networks attracted the interest of many researchers. They are usually presented directly upon the single processor queueing system in many text books [BG87, PHB90, Rob93, Nel95]. The seminal idea of how to make queueing networks analytically
tractable
comes
from
4.3
Queueing Theory
Revisited
41
Lemma 4.11
(Burke)
The
departure
in
process
Section
rate A.
2.4)
the
from equilibrium,
an a
The
proof
may
be
looked
up
in
text
book
an
on
queueing networks
distinguished
from
an
steady-state.
discuss network is
queueing
queueing networks as they are used in this section. A a graph of nodes and directed arcs, connecting these nodes. Each node represents a single processor queueing system. Messages The routing of the messages through the system are sent along the arcs. there is a fixed probability for a message to be sent along is random the arc from processor p to processor q. Apart from this internal traffic, messages may enter or exit the network. See figure 4.7 for a sample network with three processors. Internal (external) traffic is depicted with black Two of three the nodes in the system receive messages from an arcs. (grey)
now
-
external
every
source.
At
sum
two
nodes,
messages
node,
arc
the
of the
can
In
our
example,
to
one
this half.
probabilities outgoing arcs must be one. be achieved by setting the routing probability on
every
Network
Figure
Lemma
4.7:
Sample
Network
processes
4.11, together with the joining resp. splitting property of Poisson (joining independent Poisson processes results in one Poisson
process and
splitting
Jackson
one
process
randomly
resulting published by
processes is also
[Jac57, Jac63].
Given
a
Theorem 4.12
rate
(Jackson)
queueing
network, where
every han
is are Poisson dis dling exponentially, to and the other with sent stations tributed, fixed probabilities. messages are Then, in the equilibrium, every processor in the system can be modeled as
distributed
an
isolated
M/M/1
processor,
rest
of the system.
42
Counting Networks
4.4
Performance
help of Theorem 4.12, we're in the position to argue analytically about Counting Networks. We start by evaluating the Bitonic Counting Network. Then, we discuss how to generalize the results for arbitrary Counting Networks.
With the
Lemma 4.13 In
is ?f.
Wti
Bitonic
Counting Network,
balancer
Proof: Let b be
balancer of
are
Bitonic
Counting
w.
input
Whenever
a
processor p is to initiate
an
inc
operation,
message to b with
one
probability
to
of the
operation
Poisson
expected
With the
joining
operations
processors is
f;. Therefore,
Counting
2-f-.
input
a
Using
an
inductive argument,
one can
Network is the
certainly
balancer,
(lower)
w
wire with
probability
|,
^~.
In Section
needed
properly. Note, that these modulo counters are not really necessary and were only introduced for simplicity. Instead of doing the modulo counting
in the modulo counters, last column of
one
counting directly
in the
the upper resp. lower wire to a modulo counter but will directly return the result to the initiating processor followed by adding ^ to the local variable,
preparing itself for the next message. This task time, too. Therefore, we need ^d(w) processors
with We must
can
be done in constant
as
to
operate
balancers,
Network.
\ log
a
Counting
< n.
To be correct,
we
processor is not
operations.
an
(external)
With
Lemma
4.13,
we
get
4.4 Performance
43
Corollary
p
4.14
as a
is
^- + f-,
if processor
operates
balancer. The
Lemma 4.15
Bitonic
Counting
+
Network and
traverse the
processor
is
(d(w)
1) (
-h
where
depth d(w)
log2
logw.
Proof:
Since
handling
message
rate is fi
takes
=
constant
amount
of local
utilization of the
only l/tc. handling time is distributed exponentially. The handler, denoted by p, is the fraction of time where the
Please note, that
we can
can
handler is busy. It
be calculated A
P=
-
as
=2
+
w
tc
-.
U
-
can
vary
between 0 and 1.
time for
a
As
we
response
processor is tr
jf-. With
an
expected
choosing
an
This expected operation time should be minimized by appropriate width w, respecting these restrictions:
power
is
(1)
is
of
2.
(2) (3)
The width
small
number
of
processors
available:
fd(w)
< n.
The width
is
big enough
< 1.
is
sufficiently
small: p
One
can
2s^tfi
w
with the smallest w fulfilling restrictions doubling w while restriction 4.16.2 remains valid. Whenever there is no w respecting Restrictions 4.16, the Bitonic Counting Network is not applicable tl: the time between to initiations is too small for any width w. Let us first argue when a Counting Network is applicable. Then we will present an asymptotically optimal choice for w.
by starting
One cannot
explicitly
try
to
solve the
a
however,
we
get
taste of
44
Counting
Networks
For
simplicity,
> 1.
suppose that
w
=
is
power of 2.
<
Width
is valid for
jd(w)
^o^d^y)
d(n)
= =
2$n)d{n)
< n
since
<
d(n)
2^lf,
ntc.
2(d(n) + 1)tt
applicability
1. That
is, ti
Since
0(log2n),
the
of
the Bitonic
with tt
fi(clog n)
at least
Counting
simplicity,
assume
that there is
w
more
width
as
satisfying
Restrictions 4.16.
Choose
asymptotically optimal
w
Counting
is
power of 2 and
<
~
Atc
<2w
The Bitonic
Counting
satisfies
Re
4-16.
we
Proof: As
have
chosen
power of
Concerning
With the
restriction 4.16.3,
we
have p
2^ f*+
<
2(^ + 1)^
we
5 +
applicability
> 2
restriction ti >
2(d(n)
l)tc,
get
p <
^. d/^)+lwe
Counting Network
with
=
more
than
one
balancer,
have
and therewith P < Note that every Counting + Network of width w < would not satisfy restriction 4.16.3, because
d(n)
|u)
tc t%
n + w 2
ti
ti
4tc t%
1.
^w,w,2w,4w,...
\wd{w)
<
satisfy
than
to do
<
n
more
largest
so, that is
\Wd{W)
< n.
Having
<
we
get
\Wd{W)
for
Theorem 4.18
tonic
(Performance)
Network is
The
expected operation
time
the Bi
Counting
T
Utc
tm) log2
^J
ifU
as
2^ ^
> 0
the
proof for
4.4 Performance
45
~.
operation
time is
(d(d))
i)^Jl_ ^
+
+
o(d(f)(tc tm)).
better
There is
no
w* with
an
asymptotically
+
operation time,
as
the second
factor is at least
Cl(tc
a
tm)
because p > 0.
Q(d(1jJL)),
because
Bitonic
Counting
equal
to D
\w
\w
>
\n^-
n^.
arguments
(Lemma
4.13 to Theorem
4.18)
can
large family of Counting Networks. All balancers must have 2 input wires and the probability for a message to pass a balancer must be the same for all balancers, though. For the Klugerman's optimal-depth Counting Network [Klu94], one gets
for
a
Theorem 4.19
(Performance)
is
The
expected operation
time
for
the op
iftt
fl(tc logn).
of the infamous hides
a
AKS
Sorting
reader
constant in
optimal-depth Counting Network makes [AKS83] as a building block the big-Oh of the operation time.
Network
may
use
and
huge
is
The
be
disturbed that
the
Bitonic
Counting
Network
operation initiations tt
Whenever tt is relatively small significantly larger than 2clog n. (compared to the number of processors n), the queues of not-handled messages at the first balancers of the network will get bigger and bigger and the expected operation time will go to
choose the width of the Bitonic
have
n
processors in
infinity. This is because we cannot Counting Network arbitrarily as we only the system that act as the balancers. Waiting some
messages from time to time
time and
sending bulk
such
a
(the
on
Central Scheme
trick,
confer Section
2.4)
had to
"split"
message
would lead to
congestion
46
Counting
Networks
completeness, let us mention that there has been a considerable amount study to generalize Counting Networks. Besides the already discussed waiting paradigm [HSW91], several papers tried to build Counting Net works with larger balancers (fan-out must not be 2 but can be arbitrary) [AA92, BM96, BM98]. By having balancers with k input/output wires, one can realize a Counting Network with depth 0(logl w) instead of 0(log2 w), when the width is w. On the other hand, larger balancers have much more congestion.
For
of
queueing theory can be used for virtually We have been discussing the Bitonic Counting Network any network. in detail. Other Counting Networks (e.g. the optimal-depth Counting Network) can be analyzed with exactly the same procedure. For Counting Networks that have balancers with k input/output wires and depth 0((tc + tm) klogk Ij-). 0(logfc w), the expected operation time will be T Therefore, the overall performance of Counting Networks with larger balancers is not competitive (except for very specific environments). We will see another example of analyzing a network of processors with queueing theory in the next Chapter.
=
Chapter
Diffracting
The introduction of the
Tree
Diffracting
seen as
Zemach
[SZ94]
could be
Asaph counting. being often called another instance of Counting the difference carries more weight than the
Tree in 1994
by
But let
us
stick to the
common
first. As
Counting Networks,
are
Diffracting
(degenerated
and
to
finally
a
a
reach
assigned.
Tree
The most
is not
significant difference,
reason
why
Diffracting
Counting Network,
consumed message
are
allowed to not
handle
immediately
a
together.
Two fundamental
it be
advantageous
some
to wait
message
immediately? Second,
handling (since
waiting for
asynchronicity)?
We will find
an
answer
to the first
question
in Section 5.2.
The second
question
issues. without
along
We give
a
short
answer
a
clock
by sending
When the dummy message is By choosing the number of achieve an asymptotically correct expected waiting time.
can realize waiting cycle of processors. received by the initiator, waiting time is over. processors in the cycle accordingly, we can
dummy
message in
48
Diffracting
Tree
Diffracting Diffracting Tree is correct, though not linearizable. In Section 5.2, we analyze the performance of the Diffracting Tree. This section is built upon Section 4.3. We will see why the Diffracting Tree is an appealing alternative to Counting Networks. For a detailed treatment of the Diffracting Tree, we recommend [SZ96].
we
section,
Additionally,
we
5.1
A
Construction
Diffracting Tree consists of balancers connected by wires. Contrary to Counting Networks, a balancer has always one input wire and two output wires. In Section 4.1, we said that a balancer is a toggle mechanism, sending the first, third, fifth, etc. incoming message to the upper, the second, the forth, the sixth, etc. to the lower output wire. We denote the number of consumed messages on the input wire with x, and the number of sent
messages
on
0,1.
Properties 4.1):
(compare
with
Properties
(1)
>
yo+Vi
in
any state.
(2) Every
we are
eventually forwarded.
x
=
In other
words, if
j/o + yx.
(3)
The number
of
=
output
+
wire
is
at most
one in
higher
of
output
wire:
any state y0
\{y0
yi)/2] (thus
[{y0
yi)/2\).
Diffracting Tree is a binary tree of balancers of height h where the output The xth output wire from the binary are re-arranged (Figure 5.1): 1 where output 0 (2h tree (i 0,..., 2h 1) is the top (bottom) output wire of the tree) is routed to the jth output wire of the Diffracting Tree, such that j(i) is the reversed binary representation of v. Let ih-\ih-2 *o
wires
=
-
be the
k
=
binary representation of
-
i, that is
J2k=o l^k
Diffracting
short
vvith ii G
{0,1}
a
for
0,..., h
1, then j{i)
we
YllZl tk2h-k-1
the
Tree
is
on
correct
To do so,
and
down
=
halves.
We
use
following
=
forms:
rounding f (x)
~
[a;/2],4- (x) |_a:/2J,J (x) G {t (x),l (x)}. Moreover, a^a-z ..ak(x) 1,. .,k and |*- (x) =| at(a2(.. .flfc(x)...)) where at(x) =| (x) for i (tfc-1 (*)) for k > 1 (with f (a;) =t (x)).
=
5.1 Construction
49
balncer
balncer
balncer
Figure
5.1:
balncer
balncer
balncer
Tree of
y*
\/
/ \
y6
Diffracting
height
rounding
up and down
halves,
we
know
for
x, k
and
therefore
<
ai
ik (x)
<
ai... at
%h (x)
=
ax... a;
tfc (x)
where at
-% for
(3) f (x)
\x/2k]
and
ik (x)
\x(2k\
and
therefore
< 1.
[x+\\<\x].
\,\k (x) <||fc (z) <Uk (x) <Uk (*)
inequality follow directly
let x-f from Fact 5.2.2.
Lemma 5.3
inequality,
-
know that xr
< 1 <*
:=tfc (a;) and x :=|fe (x). From Fact xt/2 < (xl + l)/2. With Fact 5.2.5 we [xj2 1/2J
get
if (x)
[xt/2\
<
L(xj.
1)/2J
<
As for
Counting Networks,
we
Diffracting
50
Diffracting
Tree
j, 0
<
yt
y3
<
(see
Definition
4.2).
Theorem 5.4
the
(Correctness)
Tree with
In
Diffracting
height
The number of messages leaving on the ith (i 0, ...,2h 1) output wire is yx. Let ih-iih-2 o be the binary representation of i, that Proof:
=
-
is i
to
^2kZ0 ik^k
leaving
Ofc
with u 6
{0,1}
for k
0,..., h
output wire i
are
passing the
same
balancers. The
re-routing of the
=
wires
after
ah-ia.h-2
-o,o{x)
such that
=t^=> ik
0 and ak
=|0- ik
1 for k
0,..., h
1.
Without loss of
generality, let
of
binary representation
such that 1
yt,
we
=
ji
>
ii
=
6fc
0 < i < j < 2h 1. Let jh-ijh-2 -jo be the exists there / an Then, j. integer (/ 0,..., h 1) 0 and jk I + 1,..., h 1. As for ik for k 0 and 6/l_i6^_2 -bo(x) such that bk =t<=> jfc
0,...,
1.
Using
Lemma 5.3
we
conclude that
bibi-i... 6o(a:)
a/a/-i
a0(x).
since ak
=
Moreover,
A;
=
we
have ah-i
.ai+i(x)
b^-i
bi+i(x)
bk for
/ +
1,..., ft
1.
Therefore,
< ah-i..
y3 =6^-1
...6(+1(6(6;_! ...b0(x))
we
.a;+i(a;a(_i.. .a0(x))
j/t.
for every
pair
0 <
i, j
<
2h
conclude that y2 and y3 differ by at most 1. Therewith, for any pair j < j we have
D
0 < yz
j/j
< 1
As for
one
Counting Networks,
easily implement a balancer by using However, this naive approach is bound to implementing the root of the tree is a hot-spot and
could
not better than the Central Scheme.
we
the
problem,
on
balancer
If
it is based
even
the
[SZ94]:
evenly b, they
an
number
sophisticated implementation of following by Shavit and Zemach of messages pass through a balancer, they are
need
a more
observation
balancer
goal of "colliding" and "diffracting", the implementation of Diffracting Tree, we use an array of prisms the to cleave optical jargon) and one toggle. When a message enters ([SZ94] the balancer, it selects one of the s prisms uniformly at random. If there
To achieve the
a
5.1 Construction
51
already
one
prism, the
two messages
are
diffracted,
If
balancer,
one
is
no
prism, the
time tw.
another message enters the prism in the meantime, both are diffracted. If a message does not meet another within time tw, the message is sent to the toggle (which is nothing but a balancer t within balancer b; to avoid
ambiguities
b,
we
call t
toggle).
balancer
Tree balancer
Figure
of the
5.2 shows
balancer:
as
Messages
arrive
on
one
prisms (sketched
small
triangles) uniformly
If the
directly on toggle (i) on the black wire. The toggle forwards the first, third, fifth, etc. incoming message to the lower output to the upper, the second, the forth, the sixth, etc. wire. Every prism and the toggle are implemented by a single processor. By rereading the balancer properties, you can convince yourself that this hence, it is correct. diffracting balancer fulfills the Properties 5.1
message is
one
lucky,
As for
Counting Networks,
it
can
be shown
by
an
easy
example
that
Lemma 5.5
(Linearizability)
a
The
Diffracting
Diffracting Tree with height 1 and a prism array of only (the root balancer) such as in Figure 5.3: Assume inc that two operations were initiated and two corresponding messages were the first (light grey color) sent to the upper prism where they are diffracted
balancer
-
sent to the
light grey message "falls asleep"; in other words, the message takes unusually long time before it is received by the modulo counter. Since we are in an asynchronous setting, this may be the case. The medium grey message is received by the lower modulo counter where
lower output wire. The
52
Diffracting
Tree
to the
initiating
processor;
this inc
Figure
5.3:
Strictly afterwards, another inc operation (dark grey) is initiated and the corresponding message is sent to the lower prism. Waiting for a partner message is not successful and the message is forwarded to the toggle and from there (since it is the first to arrive at the toggle) directly to the upper output wire. Then the dark grey message is received by the modulo counter and (since it is the first) the value 0 is assigned and returned to the initiating
processor.
received
Finally (and not depicted in Figure 5.3), by the upper modulo counter,
the dark grey and the medium grey
the
light
grey message is
eventually
Because
and value 2 is
assigned.
operation
is not linearizable.
example of Figure 5.3 behaves correctly Finally, exactly the values 0,1,2 are allotted.
in the
quiescent
state:
5.2
In
a
Performance
[SUZ96], a performance analysis of the Diffracting Tree is presented for synchronous machine model. Since we want all counting schemes to be analyzed with the same machine model, we have to restart from scratch.
One intuitively thinks that there is not much
that the
sense
in
slowing
down the
toggle
is not overloaded
by letting
two messages
coordinate
locally. Analyzing
Tree has the the
Diffracting
Tree is
a more
more
degrees of freedom
than
s
height
of
5.2 Performance
53
message in
identically (some
are
diffracted,
some are
not all
Let us start analyzing the Diffracting Tree by first focusing on its building block, the balancer. Assume that messages arrive at a balancer with arrival
rate A
(Poisson).
The arrival rate at
a
Lemma 5.6
prism
is
Aa
7 + -f-.
Proof: Since the messages choose one of the s prisms uniformly at random, the arrival rate of messages at any single prism is Am
=
As
we
have
specified,
no
time to be diffracted.
The
problems when we assume the waiting time to be distributed exponentially with expected waiting time tw. To stay within our model, we introduce a virtual waiting message representing the event when waiting is over. A waiting message arrives at a prism with arrival The two Poisson arrival streams (messages and waitings) rate Xw j-are joined to one single Poisson arrival stream with rate Aa Am + \w (see D Section 2.3).
waiting
poses
additional
probability
an
that
a
an
arrival is
message is pm
=
-f^, the
=
probability
We're
now
that
arrival is
virtual
waiting
message is pw
a
jf*-
1 pm.
probability of
a
message to be diffracted.
Lemma 5.7
The
probability for
Pt
message to be
s
toggled
is
2Xtv
a
Proof:
Let
us
have
prism (Figure
5.4).
'CX&
Figure
There
are
Prism
only
two
legal transitions,
are
(a
message
arrival)
no
and
(waiting
time is
over).
There
message in the
54
Diffracting
Tree
be
prism, the prism is in state so- Whenever there is diffracted, the prism is in state si. Whenever
and
a
one
message
waiting
to
the
prism
is in state si
message
system goes
the
to state Sd-
Analogously,
we
whenever the
prism
is in state si and
waiting
time is over,
toggle. From
there is
prism
returns
immediately
no more
message in the
prism.
By solving
PwSo +sd + st
si
sd St
(l-pw)s0
(l-PtuW
VwS\
sd + st
1
or
by simple symmetry reasoning, one concludes that the probability to ispw and the probability to visit state Sd is lpw. Note, that there are always two messages diffracted when arriving at state d whereas there is only one message toggled when arriving at state t. Therefore, the probability for a message to be toggled is
visit state St
St
Pw
Aw
-
S s
Pt~
2sd
+ st
2(1-pw)+pw
and tw
we
Am + AA
2Xtw
D
carefully
a a
prism, prism
as
nor
the
toggle
are
overloaded. As
an
arrival is tc
(for
message
as
well
for
virtual waiting
message),
=
we
must
guarantee that
< 1.
Pa
AAic
is
(S
)tc
Zw
toggle
we
receives
message
probability
is A.
Having
pt
s+2\tm
Set
< 1.
Pt
Xpttc
time
+ 2MW
The
expected
=
for
message
from entering
+
to
leaving
"J 1
-
+ Pa
o
2
"1
Pttm
Aa
i
1
Pt
5.2 Performance
55
when
we assure
that p&
<
1 and pt < 1.
1
pt,
a
Proof: With
probability
are
waiting
meet
an
in the
prism for the other message to be diffracted, half of them already waiting message in the prism. The average waiting time
is
-r~.
Pa
Aa
With
probability pt, a message is sent to the toggle. A message that is toggled is first queued in the prism, then it will wait for the waiting time out, then it is sent to the toggle where it is queued once again before being
handled. Thus, the total time in the balancer is
r
+T-+tm+ 1 Aa Pa
a
pt
The total
expected
-
time for
message in
balancer is therefore
tb
(i
pt)
pa
to
AA /
\1
Pa
\Aa
tm +
~~)
1
-
pt)
which
can
be
simplified
the claimed.
As you see, formulas tend to be quite naughty. Therefore, we will not For a balancer pursue the exact but rather stick to an asymptotic analysis.
with arrival rate
time
s :=
A,
we
prism
waiting
| 2Atc \,tw
:=
2stc.
level
one
generality, suppose that \tc > |. If Xtc < |, one can realize a single processor. Note that the size of a prism array on i + 1 of the Diffracting Tree is exactly half the size of a prism array on i of the Diffracting Tree, because the arrival rate is split in half from
to the next down
level
the
tree.
prism
size
and
waiting
and pt
time tw
<
does not
overload
prism
or
the
toggle,
>
< 1
< 1.
Proof: We have
\2Xtc]
=
[2|]
2. Then
Pa
AAC
/A
(j
\stc
Pt
~
Ptc
3A2tg
~
3Atc
_ ~
3
<
~s +
2\iw
2\tc
8\H*
2T8A7;
< L
56
Diffracting
Tree
Corollary
a
5.10
The
balancer is tb
time
for
message
from entering
to
leaving
Proof: 5.8
Withpt
<
|,
and pt <
(Lemma 5.9),
we
tc
=
~,
o~v+Pttm
ll+pt
AA
pA
*
pttc
-
=
-
pt
0(tc+
AA
tm +
tc).
We have Aa
r-
and therefore
J_
AA
st^
=
=
2g2tc
s
A4,
<
""
&Xt2c
2Xtc
+ 1
<
_
6Atg
=
~~
2sAic
2Aic
asymptotically the best one could expect. The tc and the tm term Corollary 5.10 are inherent, since every message is handled by a and a not negligible part of them are not diffracted and they must be prism sent from a prism to the toggle. Since messages are sent from one balancer to another, the term tm will occur anyway when analyzing the performance of the Diffracting Tree.
This is
in tb in
Let
us
concentrate
on
the
Diffracting
Tree
as a
whole again.
log
ntc
+ 1.
Theorem 5.11
(Performance)
The
fracting
Tree is
0 Utc +
tm) log
y~)
'
#*
n(fc log").
Proof: The arrival rate at the root balancer half of the messages
are
(on
level
0)
is
Ao
f-.
Since
are
forwarded
to the lower
balancer
on
level I is
A;
tl2-'.
level h is
5.2 Performance
57
On level h,
P
-
simple
within
processor p
can
handle
arriving
messages because
^htc
0(tc) expected
immediately
=
act
2h.
The
that tb
operation 0(tc
time is T
+
0(h(tb
a
tm)).
From
Corollary 5.10,
we
know
tm)
is spent in
balancer.
act
to
for the
arrays cannot +
the The
a
0(sh
2h)
Oijj^log^j*-),
larger than
tv
have
Diffracting
s.
Tree with
height h,
n
and the
prism
Since the
solution for
r\ogx
<
is
<
vJ^n.,
x
=
where
W(n)
is Lambert's W function
=
(confer
can
be
and
=
rj1L. Therefore,
Q(tcW(n)),
which
D
Q(tc\ogn).
Theorem
result 4.19, we By comparing performance of the Diffracting Tree is asymptotically exactly the same as for an optimal-depth Counting Network. As the constant hidden in the big-Oh of the operation time is small, the Diffracting Tree will have a much better performance in reality, though. The lesson one might learn from the Diffracting Tree is that one can get surprisingly simpler efficient counting schemes by relaxing model constraints (that is, allowing the messages to wait for each other).
with
conclude
that
the
few
generalizations
or
done to the
[ST95]
the Reactive
58
Diffracting
Tree
Chapter
Combining
The idea of
a
Tree
combining
several requests to
one
sending
every request
alone, is well
known within
Lubachevski, and
1983
early eighties.
[GLR83].
shortly
on
followed
Later,
underlying hardware of the network which resulted in the Software Combining Tree [YTL86]. Another analysis and several applications are due to [GVW89]. According
time,
we
to
our
underlying hardware
introduce systematic waiting to gain performance [WW98b]: A message is not forwarded immediately to another processor but will wait for a defined amount of time in order to promote combining. All prior
not wait
message
instantly
adaptive Combining Counting Pyramid [WW98a]. The Counting Pyramid is analyzed in Section 6.4. Due to its adaptiveness, we believe that the Counting Pyramid has practical merit.
will
present
the
a new
the
Tree
60
Combining
Tree
6.1
We have
Construction
a
height h: the
inner node
root is
u on
on
the tree
are
level h.
> 1.
level
-
0,..., h
:=
2 has
node
on
level h
> n
1 has m'
\n/mh~1}
sure
mh~l
m'
1.
> 1
and
mh~x
< n,
n.
it is
node
distinct
leaves and
give
We
are
discarded.
identify each
with p
=
1,...,
exactly
in the tree, processor p acts for leaf p and for inner node p
(if
it
exists).
To
achieve
this,
neighbor
of these two
neighbors.
The
A simple strategy
to
operations
is the
following.
system's
a
Whenever
processor p wants to
parent, which in
message to its
request arrives
at the root.
performance of the system is not better than a Central Scheme. To overcome this problem, we let the inner nodes combine several requests from their children. That means, instead of just forwarding each hot-spot,
and the
an
as
(within
certain time
requested by its children at "roughly the same time" frame). On the other hand, we have to guarantee
are
counter values
we
forwarded up resp.
quickly. Consequently,
only
need
upward
and
downward messages. An upward message is sent up the tree and consists of only one integer, the number of increment operations requested by the
subtree.
interval of counter values
A downward message is sent down the tree and consists of an (assigned by the root), specified by the first and
us
describe this
counting
scheme
precisely by defining
6.1 Construction
61
immediately
sends
an
upward
message
(asking
a
for
one
counter
value)
to the
in the tree.
Later, leaf
Root When
gets
downward message from the parent with this message, the inc
an
counter value.
By consuming
operation
is
assigned completed.
the root
an upward message from a child asking for z counter values, instantaneously returns a downward message with the interval {val,..., val + z 1} to the child and increments val by z.
receiving
Inner Node
As
already sketched,
on
inner nodes
are
to combine
upcoming
messages and
decombine them
Also,
an
inner node
keeps track
of all requests sent up to the root whose response did not come down yet. These requests are stored in a local first-in-first-out queue.
Let
us
inner node p
first describe the process of sending a request up in the tree: The keeps track of the number of increment requests by the leaves
as sum a
(initially 0).
child q
upward
an
message from
a
asking
counter
to
sum.
values,
an
tuple (q, z)
to
expected waiting
for
(with
sum
>
0)
counter
upward message to its parent, asking values, and it locally resets sum to 0.
inner node p receives
a
symmetric. Whenever
downward
given interval to the children, according to the entries at the beginning of its queue. For simplicity, assume that messages do not overtake each other on the same link; that is, if two
message from its parent, it distributes the
messages
are
Theorem 6.1
The
Combining
Tree
is
Proof:
are
(compare
with Theorem
at
2.1)
Correctness:
assigned centrally non-intersecting intervals, no value is returned twice, thus The number of condition (1) of Definition 1.1. correctness fulfilling
the root.
distribute
of initiated
handled
inc
operations.
inc
consumed and
1
are
by
returned.
Since at least k
operations
are
initiated,
correctness condition
(2)
of
62
Combining
Tree
Linearizability:
and is
td(v)
is
operation
This inc
operation
initiated at time
tx(v)
completed
we
at time
tc(v).
completion,
<
know that
tl(v)
<
td(v)
<
tc(v). Moreover,
never
have
td(a)
td(b)
=>
< b because
(Definition 1.4)
(a
linearizability completed
lower counter
<
(tc(a) b)" is
<
tl(b))
fulfilled since
td{a)
<
tc(a)
tl{b)
<
td(b)
^a<b.
6.2
Let
us
Performance
now
argue
on
the
performance of
this
tc,
more
specifically, tm
we
a
>
in Theorem 6.9
In order to arrive at
parameters
as
follows:
,h:=
Antc
Directly
Fact 6.2 It
is
tl>tw= 4tm
>
16ic-
Lemma 6.3
(Up-Down)
all
At each
inner
node, the
is
amount
as
of
local
com
the
same
the amount
of
messages.
Proof:
An inner node p
more
upward
messages than
downward messages. On the other hand, upward messages can always be handled in tc time, whereas an interval from a downward message has to be distributed to let
us
possibly
many children.
as
a
To
simplify
the
argumentation,
model called
downward message
(often
bulk
arrival),
every
one
in the queue.
This
meeting the demands of exactly one can be done in tc time, too. Moreover,
a
message generates
queue
tuple
queue
tuple,
the
proposition
follows.
6.2 Performance
63
simplify bookkeeping, all we do in our analysis is counting twice as many incoming upward messages plus outgoing upward messages at a node and forget about the incoming and outgoing downward messages.
To
Lemma 6.4
The arrival rate at
an inner
node
on
level I with I
0,..., h
2is\<-c.
Proof: than A child sends
With
no
two
P-.
r^-l
m
upward messages within expected tw upward messages at inner node u is no ^ + 1 and Fact 6.2, we have
tm
time.
more
< -^- +
11 +
problems when we assume the waiting exponentially with expected waiting time tw. To stay within our model, we introduce a virtual waiting message representing the event when waiting is over (compare with Lemma 5.6). The waiting
The
waiting
poses
no
additional
time to be distributed
^-.
downward
messages
From
Lemma
as
6.3,
we
know that
handling
is
as
expensive
handling upward messages. handling upward messages, sending upward result, we get (simplified using Fact 6.2)
(
,
When
adding
messages and
doubling the
3
<
1 +
1 +
1
=t
4
+ <
4 +
=
Lemma 6.5
an inner
node
on
level h
is
\ <
--.
(leaf
Having
u
at most
inner node
we
(on
>
tree) initiates one mc operation in tt expected m' children, the arrival rate for upward messages at level h +1, [logm 1) is no more than y-. Since h
of the
^j^]
have mh~l
^i. Therefore
n
^^
Using
Fact
U.
1^zrMr.
+ L
6.2,
we
have m!
1
<
1 h-<
1 h
1
.
proof is equal
to the second
64
Combining
Tree
Lemma 6.6
processor is \ <
j-.
Proof:
A processor p is
an
leaf and
p initiates
value every tj time. The arrival rate the arrival rate at inner node p
possibly one inner node. As a leaf, (expected) and consumes a counter at processor p is therefore bounded by
Lemmata 6.4 and 6.5 not
we more
(with
than
^-)
and leaf p
(not
more
than
j-).
2
^
get
,3
-
4tc
4tc
16tc
8tc
a
Corollary
6.6,
we
get
<
~tc
< 1.
Lemma 6.8
sage is
(Response Time)
to the
any
upward
mes
forwarded
Proof:
l/tc. Handling a message takes tc expected time, thus we have p. < 6.6, the utilization ratio is p |. With Theorem 2.2,
=
-
expected
response time is tr
,'
<
8ic.
This
immediately
leads to the
following.
The
Theorem 6.9
(Performance)
T
the Com
bining
Tree is
=
0(tm\og1f/\ogtf)forwarding
6.3,
we
a
message at
an
only 0(tc)
down too.
time.
From Lemma
one
Handling
message goes
one
along
(tm)
down
might send
=
6.2, we get for again. Using Ih tm. With h Tlogm 2h(tc + tm) + htw <
get T
=
sending transferring one message All up the tree and it up (tw). the total expected operation time
with
^]
'
+ 1 and
[^"|,
C
we
0(tm \ogtm/tc^).
have introduced two restrictions
was
At the
to
rid of them.
One constraint
U ^
tw( 4tm).
When processors
are
6.3
Counting Pyramid
65
operation
time \tm
very often
(ti
<
itm),
we
an
upward
one
message to the
parent immediately,
to
instead, and
bulk
request message
(compare
2.4).
Since ti
0(tm),
this does
waiting time asymptotically. Please note that the Combining Tree is the first counting scheme that is applicable even when U is very small. So far, all counting schemes we've seen had a restriction on the frequency the processors may initiate an inc operation; the Combining
Tree however has
none.
>
Atr.
Since
sending/receiving
message
always at least some local computation, this constraint will usually be satisfied naturally. However, if the opposite is the case, it is best to set 4 and adjust h resp. tw to [log4 ^] resp. 10c. Again, up a tree with m one can show a similar Timing Lemma and Counting Theorem with an
=
upper bound of
0(iclog2 rfSL).
Combing Tree, each processor p acts as two nodes operations and possibly as an inner node to manage the inc operations. Alternatively we could initiate an inc operation directly at an inner node p (and save some nodes on the path to the root and back). This variation however, has asymptotically the same performance as the Combining Tree and is not as fair (processors with inner nodes closer to the root have a better operation time).
We have
as a seen
that in the
leaf
to initiate inc
6.3
In this
Counting Pyramid
section,
we
propose
Counting Pyramid.
Its
of the
Combining
Tree
the
combining
requests along upward paths in a tree, and to decombine the answers on the way down. The Counting Pyramid extends this concept in two ways.
First,
next
node forwards
not
request
one
of the nodes
on
the
higher level,
along
the
necessarily
as
tree
same
path
helps
to
spread
out
a evenly substantially, we name it differently for clearer distinction: The scheme is a pyramid, a level is a floor, and a node is a brick. Second, a processor decides freely on the floor in the pyramid at which it initially sends its request. We propose that this decision be based on the frequency with
the load
across
departs from
tree
which the processor increments: If it increments very often, it should send the request to a low floor; if increments are very rare, it should go directly
66
Combining
Tree
to
Pyramid.
With this
is
the first
counting
scheme that
adapts
to any
adaptiveness and because counting is not the only operation that Pyramid supports, the Counting Pyramid is quite a flexible structure with high potential for a variety of applications. Each processor p has a completely local view of the the performance of the counting speed Counting Pyramid for processor p does not depend on the access rates of
the
-
Due to its
solely
on
processor p.
Obviously,
this individual
processors
access
patterns
are
correlated. Our
vnc
approach will work nicely whenever processors initiate operation completely independently of each other.
of h
an
=
the
h 0. floors. Floor / (/ 1) is made The single brick in floor 0 is the top of the Pyramid, with floor numbers increasing from top to bottom. The height h of the Counting Pyramid is defined as h := log^ n; for simplicity, assume that h is an integer The number of bricks in the Pyramid is smaller
..,
of
>
1 is
integer
it is possible to give each identify each brick with its number processor p will act for brick p. Regrettably, we cannot claim that the Pyramid concept is truly novel. It has been recognized as quite a universal paradigm in ancient times already, and even the distinguished role of the top has been acknowledged previously (Figure 6.1).
=
Yl/=o m^
7^1
<
"
Thus,
We
n.
Figure
6 1
The
Counting Pyramid
on
6.3
Counting Pyramid
67
processor p
request
message to
random brick
b in floor
A brick
random
upon
/ is to be specified later). 1 at in floor / > 0 that receives a request picks a brick in floor / floor in and forwards the request to that brick. The top (brick 0),
/ (/
0,..., h
1;
good
choice for
receipt of
As
response
Pyramid along
the
same
its
to
the
initiating
processor p.
Analogous
into
one.
to the
Combining Tree,
a
up the
same
Pyramid,
resp.
we let a brick combine several requests just forwarding each individual request combine requests arriving at "roughly the
time"
(within
certain time
are
frame).
long.
requests
counter values
forwarded up resp.
Let
us
Pyramid
is sent
quickly,
up the
i.e., without
waiting
too
messages:
messages.
An
two kinds of
integers:
s, the sender
values, specified
Let
us
interval.
by the first and the last counter value of the describe the counting scheme more precisely by defining
e.g.
Top
When
receiving
a
an
upward by
z.
message from q,
counter
values, the
z
top
returns
{val,...,
val +
1}
to
Initially,
val is 0.
Initiating
upward
Processor
an
increment.
one
Processor p
immediately sends
-
an
(asking
/
for
counter
value)
to
/.
The value of
will be
function
there has been one) resp. the time since setting up the system (if processor p has not initiated an inc operation yet), and it will be denoted as f(t). Later, processor p will get a downward message from brick b with an assigned counter value. Then, the inc operation is completed.
operation (if
Brick
As
(not
the
top)
bricks
are
already sketched,
on
to
combine
upcoming
a
messages
and
decombine them
Pyramid. Also,
brick
keeps track of
down yet.
all open requests sent up to the top whose response did not
come
were
An open request is
upward
messages that
combined
68
Combining
Tree
by the brick
memory.
and sent to
brick
on
the next
higher
floor
(i.e.,
number).
Let
us
brick b in floor
not
as
/ keeps
Pyramid
to
brick in floor
an
1, denoted
sum
(initially 0).
for
z
upward
message from
asking
counter
values,
z
open
to
From time to
time tw
upward
message to
random brick in
resets
sum
floor
1, asking for
counter
symmetrical. Whenever brick b receives a downward message, it distributes the given interval according to the tuples of the corresponding open request. Afterwards, the open request is removed from
The way down is
local memory.
6.4
Performance
we
In this section,
will argue
on
the
performance
the
are
of the
by
6.2.
means
of queueing theory.
Tree
Because
Combining
derivative,
all arguments
following argumentation gets simpler when we being. First, let the expected time for be significantly larger than the time used for some transferring a message local computation on average, that is, tm > 6tc. Moreover, we restrict
introduce two restrictions for the time
the
access
Let
us
assume
elapses between the initiations of two consecutive inc operations by the same processor is Poisson with expected value tl. We restrict tt such that
t%
>
an
individual time t since the last initiation of processor tw, but not the average. Later, we will show how to
In order to arrive at
as
a
two restrictions.
fast scheme,
we
follows:
m :=
,tw
:-
6tm,f{t)
:=
min
logm
TiZw
-1.
Directly
on
tm and tt follows:
Fact 6.10
(Relative Durations)
tt >tw
6tm
>
36tc.
6.4 Performance
69
Lemma 6.11
(Up-Down)
At each
brick, the
is
amount
as
messages
the
same
the amount
might
receive far
more
upward
a
always
be handled in
tc expected time,
distributed to
whereas
an
interval from
possibly many tuples in the corresponding open request. To simplify the argumentation, let us model a downward message as a whole set of downward entities (often called a bulk arrival), every one meeting the demands of exactly one pending tuple in the open request. Each entity task can be handled in tc expected time, too. Moreover, since every upward
message
generates
tuple
entity
removes a
tuple, the
Lemma follows.
simplify bookkeeping, we count twice as many incoming upward plus outgoing upward messages at a brick and forget about the incoming and outgoing downward messages, in the following analysis.
To
messages
Lemma 6.12
(Brick
Arrival
Rate)
(m
\ L"W
brick
is
A<2(
1
^"W
Proof: Let
in floor
us
upward
messages
arriving
at
brick b
/.
upward
two
(1)
Brick b receives
floor there
1)
send
no
are
m^+1 bricks
in
messages from bricks in floor / + 1. Bricks (in upward messages within time tw on average. As floor / + 1, and m' bricks in floor /, the arrival
rate for
upward
floor
+ 1 is
no more
than
p-.
receiving upward messages directly initiating processors, / as start floor. In the worst case, all n processors in the system choose / as the start floor for their increments. /
<
Brick b is
from
1, then
nt,.
/
Having
/(<)=
n
logm-^
t
-lo/
+ l>logm-^t> mf+1 t
-
ibli'tij
FtiLji
at most
independent
processors
starting
at
/ (with
choice of
rate is bounded
by
1
m
nm.f+l
<
mf
nt,
mf
70
Combining
Tree
rate at
brick in floor h
(with
choice of
1
m
mm
Xi
t%
xw
assume
(3)
The
waiting
poses
no
additional
the
waiting
we
virtual
is
waiting
and
the
combined upward
^-.
as
From Lemma
expensive
handling doubling
the
the
handling downward messages is as handling upward messages. When adding up the cost for types (1), (2a) resp. (2b), and (3) of upward messages, and D result (for downward messages), the Lemma follows.
6.11,
we
know
that
Corollary
5
6.13
(Brick
Arrival
Rate)
brick is A <
etc-
Proof.
6.12
<
+ +
2m+1
<
2^
2(^ +
l)
+ l
itm
T-c^w
6
^w
__
Otc
_1_
Otc
_5_
Ut'c
Corollary
sor
6.14
(Processor
Arrival
Rate)
proces
is X <
g|-.
only acting
Also with
an
Proof:
A processor p is not
as
brick in the
processor p is also
initiating
average.
a
inc
to time with
of at least tt,
expected delay of at least tt, preceding operation. The arrival rate at processor p is therefore bounded by the arrival rate at brick p (with Corollary 6.13) and f- (initiating and receiving). By Fact 6.10, we get
on
processor p receives
6.4 Performance
71
Corollary
5
g
<
6.15
(No Overload)
No processor
is
overloaded
since
<
<k
11.
Proof:
we
handled,
on
average.
With
-tc
Corollary
6.16
(Brick Response)
At
brick,
any
upward
message
is
Proof: From
queueing theory,
a
we
expected
response time
for
the
message at
brick is tr
n*
With p )
l/tc
and
Corollary 6.15,
Theorem 6.17
(Pyramid Performance)
is
The
expected operation
time
of
the
Counting Pyramid
(tmlogh/\ogj
and t
is
where
nmin(l, tc/t),
the last
inc
since
the
initiation
of
operation.
brick takes
a
Proof:
Corollary
only
brick
0(tc)
time.
Please note that the time spent in the queue at From Lemma 6.11,
we
is included
Handling one message goes along with transferring one message (tm) and waiting until the brick might send it upwards (tw). All up the Pyramid and down again. Using Fact 6.10, the expected operation time when entering in floor / is 2f(9tc + tm) + ftw < llftm. With the
messages too.
definition of
f(t)
and tw
6im
<
f(t)
mini
Iogm-y
logm
,h
-1
O I min I
-,
logm
.
(\ogtm/tc min(ntc/t,n)J
72
Combining
Tree
Corollary
at
a
6.18
(Memory)
The
expected
amount
of
processor is
0(mh)
Proof: From
o(^logn/logt^y
know that messages arrive at the brick with
Corollary 6.13,
we
arrival rate less than j-. More or less every second message is a potential upward message, with the consequence that a tuple has to be stored in
memory. With Theorem
tuple
is removed after
we
0(tm logm n)
two
time
(n <n).
to
>
section,
have
arguments.
often
When processors
very
simplify the of them. One constraint was t, tw(= 6tm). active and initiate the increment operation very
introduced
restrictions
a request immediately, but instead, and to already combine
=
(,
<
6tm),
one
message. Thus tt
0(tm);
waiting
time
asymptotically,
>
6tc.
Since
sending/receiving
message
always at least some local computation, this constraint will usually be satisfied naturally. However, if the opposite is the case, all we have to 4 and adjust h resp. tw to log4 n do is setting up a Pyramid with m
=
resp.
I2tc. Again,
one can
show
in similar way
an
upper bound
on
the
Chapter
Optimal Counting
fundamentally different solutions for counting: family of Counting Networks, the Diffracting Tree and the Combining Tree (including its adaptive version the Counting the in We have to seen that, expected operation time, Pyramid). regard the Combining Tree has a better performance than the other schemes. Still not answered is the question of optimality: What is the lowest possible
Up
to now,
we
have
seen
four
The Central
Scheme,
the
operation time?
From
our
observations in
Chapter 3,
we
know that
counting
scheme with
no
there is little
hope for
very
asymptotically
inc
"constant"
operation
of the
"optimal"
will tackle this question by making a detour to another model synchronous machine model. After defining the model, we present a synchronous equivalent to the Combining Tree from Section 6.1. In Section 7.2, we present a lower bound for counting on a synchronous machine. As the performance of the upper bound in Section 7.1 matches this lower bound, lower and upper bounds are tight. Finally, we will show how the synchronous model (and the lower bound) can be extended to the machine
[WW97a]
the
model
we are
used to
(see
Section
1.1).
74
Optimal Counting
7.1
Synchronous Combining
Tree
synchronous correspondent to the model of Chapter 1: synchronous, distributed system of n processors in a message passing network, where each processor is uniquely identified with one of the integers from 1 to n. Each processor has unbounded local memory; there is no shared memory. Any processor can exchange messages directly with any other processor. The processors operate in synchrony. Within each clock cycle (of a global clock), each processor may receive one message, perform a constant number of local calculations, and send one message, in this order. Every message takes exactly m (with m > 1) cycles to travel from the sending processor to the receiving processor. No failures whatsoever occur
Consider
a
in the In the
system.
following,
we
present
synchronous equivalent
a
to the
Combining
on
Tree from Section 6.1. We have level zero, all leaves of the tree the
For
height
h: the root is
are on
(including
mh.
root)
m
n.
children.
that
n
=
simplicity,
>
us assume
mh.
Since
1 and
possible to give each inner node a distinct number between Furthermore, we number the n leaves with distinct numbers from
1,
it is
1 to
resp.
n.
inner
node p
one
leaf p. Therefore,
one
1,...,
n, there is
exactly
setting
up this tree
of its parent.
A processor p will act for its leaf and for its inner
node;
to
achieve this, it will communicate with every neighbor of these two nodes in the tree. Every processor knows its neighbors.
As in the
Combining Tree,
the
a
system's
Whenever
processor p wants to
sends
request
message to its
so
message to
on
value val to this request, and val is sent down the tree
until it arrives at the leaf p.
along
the
forwarding
simplify the following reasoning, we will make strong use of the synchronicity. Every processor is to know the current cycle c of the system. Each node in the tree that is the ith (i 1,..., m) child of its parent node is allowed to send a message to its parent only in cycle c l)m + i, (2k
7.1
Synchronous Combining
Tree
75
Similarly, every inner node in the tree is allowed to send a 2km + j, for any 1,..., m) child only in cycle c jth (j k. restrictions From the we get immediately: integer sending
for
integer
k.
message to its
Lemma 7.1
message per
(One Message)
cycle.
No node sends
(or receives)
more
than
one
for
m
cycles to get to parent) receives a message from the parent only if c 2km + i + m= (2k + \)m + i, and receives a message from the jth child only if c 1,... m. (2k' l)m+j+m 2k'm+j, for j Again, (2k + l)m 4- i ^ 2k'm + j, for integers k, k' and 1 < i, j < m.
node,
a
node
(that
Lemma 7.2
message per
(Communication)
cycle.
A processor sends
(receives)
at most
one
care
that for
(p
1,..., n),
Using
has
no
immediately.
Therefore,
Let
us a
processor
can
describe the
counting
scheme
only
in
cycle
(2k
l)m
the client
containing application
generated by
cycles.
Several
cycles later,
when the message went all the way up the tree and
an
Root
The root receives
in the
a
for
counter values
cycle immediately
As
2km +
to child
j.
j and
Inner Node
already sketched,
on
inner nodes
are
to combine
upcoming
come
messages and
Also,
an
inner node
keeps track
operations
are
operations
stored in
76
Optimal Counting
Let
us
sending upward
node p has to
keep
by
the
leaves in the subtree that have not yet been forwarded up the tree (initially 0). When inner node p receives a message from child j asking for z counter
values,
there
p enqueues
record
(j, z)
to
sum.
The children's
are
cycles
2km +
(2k + l)m+i
cycle
c,
sum
is reset to 0.
at
cycle
symmetrical. In cycle (2k + \)m +1, the inner node p gets starting counter value start from the parent. Then, if 1,... m, the first record of queue is j, (2k + 2)m + j, j
=
inner node p
removes
j. Then,
From the
quantity
to start.
description of the
tree's
nodes,
we
get immediately:
Fact 7.3
(Forwarding)
cycle, cycles
is
2m
at every node.
mc
operation
is
exe
along the links alone is 2h m cycles. Fact 7.3 delays of at most 2m cycles at every node, thus in total at most 2h 2m. Therefore, the total number of cycles is D Q(hm). Since mh Q(2hm) + 0(4hm) n, we have h logm n.
have additional
=
7.2
For the
Lower Bound
synchronous setting,
we
will derive
lower bound
case.
on
the time it
us
To do so, let
first
discuss the
Let
a
processor p broadcast
a
piece
possible
time. If
q
cycle
c, then
a)
processor
already
message message
1 or b) processor q received a cycle c with the information in cycle c, that is, another processor sent a to q in cycle c m. Therefore, the number of processors that know
77
the information in
cycle
is defined
by
if
c
fm{c)
fm{c
1)
fm{c
m),
> m,
and
/m(c)
1 if
< m.
(Dissemination) Let m > 1 for c fm(c m) if c > m and fm(c) bounded from above by fm(c) < mclm.
Lemma 7.5
-
4 and /ei
/m(c)
-
0,...,
1.
is
Proof: For
we
0,...,
1, the claim is
mc/m.
If
> m,
have
by
induction
+
fm{c)
Since
<
m(c-1)/m
m(c-m)/m
mc/m
m m
m>lm
m l/m
m+m'/^
< 1
for
>
Theorem 7.6 (Broadcasting) fi(mlogmn) cycles. Proof: Lemma 7.5 says that in
<
>
1,
broadcasting takes
cycles,
m
we
can
inform
no
more
than
n
/m(c)
mclm
special
processors, when
>
4.
Therefore, informing
^
c
=
/1(n)
=
cycles,
can
where
/^H71)
m[lgmnl-
2,3,
one
therefore
f^{n)
>
log2n. For
easily show that /m(c) < 2C and 2,3, we have log2n 6(mlogmn),
=
By symmetry, accumulating information from n different processors processor takes the same time as broadcasting to n processors.
at
one
Corollary 7.7 (Accumulation) For every m > 1, accumulating infor mation from n processors takes Q.(m logm n) cycles.
Finally,
we use
distributed
prove
synchronous
Theorem 7.8
(Lower Bound)
m
An inc
operation
costs
Q(mlogmn)
cy
cles, for
every
> 1.
Proof: At and
cycle
1,
assume
s
value is val.
c no
Assume that
processors initiate
an
processor initiates
quiescent and the counter inc operation at cycle operation at cycle c + l,...,c + t,
an
for
1.1
The criterion for correct counting (Definition sufficiently large t. and Fact 1.3) requires that the s processors get the counter values
78
Optimal Counting
val,..., val
gets the
we
1. Assume processor pw is
w
=
one
of these
processors and
0,...,
w
1. For this to be
s
possible,
pw
1 of the
involved processors. As
the
Corollary 7.7, this takes Q(mlogmw) cycles. Since for majority of the s processors, w fi(s), the result cannot be expected before cycle c + Q(m logm s). Whenever s f2(n) (a substantial part of the
=
processors),
schemes
this bound is
Q.(mlogm n).
only
hold for linearizable
counting
(Definition 1.4), but for the weaker general correctness condition (Definition 1.1) that is used for the family of Counting Networks and the
Tree.
Diffracting imply:
(Theorem 7.8)
The
synchronous Combining
Tree
is also
an
asynchronous setting,
Let
us assume
our
machine model:
that
some
local
at
processor takes
average) tc
message is
tm. In the following, we will reformulate the arguments above in this setting. Inevitably, many statements are very close to their synchronous counterparts. For simplicity, we assume that tm > tc.
Let
a
processor p broadcast
a
piece of information
in the shortest
possible
time. If
q
a
a)
processor
already
tc
or
b)
processor q received
tc
<
t <
to.
Since
transferring
information
before time to
tm.
Therefore,
at time
to is bounded by
+
/(to)
<
/(to
tc)
/(t0
tm),
/(t0)
1 if t0 < tm.
Lemma 7.10
(Dissemination)
<
With
tm/tc
>
4, /(to)
bounded
from
above
by f{to)
{tm/tc)t0,tm
tm/tc
we
> 4 >
1, and
to/tTO
>
0,
have
/(t0)
/(to
1 <
(tm/tc)to^m.
/(t0
-
If to
<
>
tm,
have
by induction
/(to)
<
tc)
tm)
(W*r)(t~te)/tm
(Wtc)(t-tm)/tm.
79
For
simplicity,
we
set
m :=
tm/tc.
+
Then
+
/(*o)
Since
<
mto/tm_1/m
for
mto/tm_1
mto/tm
m}lm
'
m-mllm
m+mu
< 1
m >
arbitrary
processors
Proof:
s :=
can
inform
no
more
than
f(to)
(tm/tc)o/"m
<
processors, when
tm/tc
>
4.
Therefore, informing
/_1(s)
time, with
&
{tm/tc)t0/tm
l0gtra/tc
<
to/tm.
im[logm s]. For the special case 1 < tm/tc < 4, one tc log2 s. As can easily /(to) < 2'/'c and therefore t0 > f~1(s) < 4 2 1 and < Q(tm/tc)), (thus tm/tc tc<tm< 4tc (thus tc Q{tm)), we have tc log2 s (, logtm/ic s), and the Theorem follows.
Thus t0 >
=
f~1(s)
show that
By symmetry, accumulating information from s different processor takes at least as much time as broadcasting to Corollary
cessors
processors at
s
one
processors.
7.12
takes
pro
Finally,
model
inc
we
use
Corollary
7.12 to prove
counting
in
our
(Section 1.1). Assume that the time between operation is ti (exactly), for every processor p.
(Lower Bound)
An inc
two initiations of
an
Theorem 7.13
operation of
an
arbitrary
count
ing
ft(tm\og/log-p).
Proof: At time to,
value is val.
assume
s
Assume that
No processor initiates
inc
operation
after
(Definition
values
sufficiently long time. The criterion for correct counting 1.3) requires that the s processors get the counter
1. Assume processor pw is
one
val,..., val + s
of these
processors
80
Optimal Counting
and gets the counter value val + w, w = 0,..., s -1. For this to be possible, 1 of the s involved processors. pw has to accumulate information from w
As
we
Corollary 7.12,
s
this takes
for the
majority
have
of the
l(tm logtmnc w)
time. Since
processors,
fi(s),
expected
before time to +
is tt,
s
=
we
fi(^m ^Etm/tc s)- Since the time between ntc/tt initiations in a time interval of size
two initiations
tc.
Therefore,
O
Q(ntc/tz)
This lower bound of Theorem 7.13 matches the expected operation time of the Combining Tree (Theorem 6.9) and the Counting Pyramid (Theorem
6.17). Unfortunately,
our
we
used restrictions to
time to handle
exponentially distributed). As one cannot counting scheme (that is more complex than the Central Scheme) for general distributions of various parameters (e.g. handling time), we are not in the position to formally prove that the Combining Tree has an asymptotically optimal expected operation time.
message be
a
analyze
the
performance of
our
chapter, however, asymptotically the same as the end of the next chapter we
next
detail.
Chapter
Simulation
is
the
suitable tool when arguing about systems handling time) are Markov, the analysis gets distributions. Therefore, we complement our
a
an
analysis
excellent method
performance of distributed counting schemes, starting with the work of [HLS95]. For purity and generality reasons, we decided to simulate the counting schemes not for a specific machine, but for a distributed virtual
to assess the
machine.
In Section
8.1,
are
we
studies
Then, in Section 8.2, the results of given, along with their interpretation [WW98b].
8.1
Model (DVM)
consists of
n
processors, communi
cating by exchanging
steps
cost
messages.
computation
tc time for
same
message to be sent to
processor
by tm. Since several processors might send a mes concurrently, every received message is stored
incoming messages and consumed one after the other. Note both, sending and consuming a message, need some local computation (for sending: specify the receiver and start the sending process; for consum ing: mark the message as read and remove it from the queue of incoming messages). With our DVM, it is not possible to broadcast a message to
a
in
queue of
that
one
step.
In other
words, if
processor is to
82
Simulation
send
message to every other processor in turn, the last processor will not
consume
not have
arbitrary probability distributions. This way, one can easily simulate the performance of a sys tem where some messages are delayed or processors might have temporary performance problems (due to caching, for example). As a virtual machine, the DVM abstracts from reality in a number of aspects. Although every counting scheme we have tested uses no more than a constant number of lo cal computation steps to work on a pending event, the constants themselves do certainly differ a fact simply ignored by the DVM. Anyhow, we think
to be
constants, but
sim
ple element (e.g. a balancer in a Counting Network: receiving a message, toggling a bit, sending the message according to the bit) with a complex element (e.g. an inner node in a Combining Tree when receiving an upward message: receiving a message, storing the message in a queue, adding the value z to sum). We accepted this simplification to have the opportunity to experimentally evaluate the counting schemes on very large scale DVM's with up to 16384 processors, a dimension which is surely intractable for a
simulation model that is close to On the
a
real machine.
DVM,
we
have
(Chapter 2), the Bitonic Counting Network (Chapter 4), the Diffracting Tree (Chapter 5), and the Combining Tree (Chapter 6). Although we believe that the Counting Pyramid (Section 6.3) is a
The Central Scheme
interesting structure with favorable practical properties, we did implement it, as the most important advantage, the adaptiveness, is
very
not not
Sometimes,
an a
the real
implementation
an
for
"waiting" of
the
exponentially
a
combined
upward
if
none
upward
-
message
immediately
to be
sent within
possibly
time
exponential
realistic to
has the
deterministic
one
waiting
not
coins)
and
use
generally
same
allowed to
5.2, 6.2)
The
these deterministic waiting methods for analysis (Sections because they are not Poisson and therefore violated a precondition
of Jackson's Theorem
(Theorem 4.12).
efficiency of all counting schemes but the Central Scheme varies according to a few structural parameters (e.g. network width, waiting time). For each scheme, we have chosen the best parameter settings
8.1 Model
83
according
results.
to the
analysis
in the
performance section (Sections 4.4, 5.2, 6.2). performance analysis only provided asymptotical
constants,
we
figure
(non-asymptotical) analysis,
we
by
using
iterative
that all
cannot
guarantee
We have
implemented the system, the counting schemes, and methods programming language Java. By using the technique
for
an
introduction),
it is
possible
if you
to simulate
are
simulating
machine with
than
thousand processors
run
on a
machine with
only
one
definitely
into
problems
of time
or
memory.
following tests,
we
popular criterion
to estimate the
handled
operation time. The performance of a counting system the number of inc operations that are
a
We
restricted ourselves
the
operation
an
time because
operation
time and
throughput
executes
a
are
closely
related.
In most inc
following tests,
each processor
loop operation previous inc operation. In that scenario, operation time and throughput are inversely proportional to each other: Assume that P* is the setting of the parameters of a counting scheme (e.g. the width for the Bitonic Counting Network) that minimizes the operation time. The parameter-setting P* is also optimum for throughput since an alternative parameter-setting P' that maximizes throughput contradicts the optimality
that initiates whenever it receives the
operation time.
figure presented,
initiations
we
made
five
test
runs
of up
to
lO'OOO
is
operation
per
processor.
The
runs.
result
presented
The
the
average
was
operation
mostly
less than
http://www.inf.ethz.ch/personal/watten/sim/
84
Simulation
8.2
Results
Loop Benchmark
For the first
benchmark,
we
have been
a
using
the
inc
operation
whenever it receives the value for its previous inc operation. We measure the average operation time of an inc operation, the time from initiation
completion. This benchmark produces a high level of concurrency (the highest possible when having the restriction that every processor is only allowed to have one pending operation). [HLS95] call this the Counting Benchmark. We are testing systems up to 16384 processors. Calculation time tc (for handling a message) and message time tm (for transferring a message) are both deterministic with 1 resp. 5 time units. Figure 8.1 shows the average operation time of the four counting schemes.
to 350
Central Scheme
Bitonic
300
u
Counting
Tree Tree
Network
Diffracting
250
Combining
200
o.
150
100
50
0/
xb
nCV-
to*
N-
<S>
tf1
<P
<$>
^ ,f JP ^ ^
Number of Processors
Figure
As
16
8.1:
Loop Benchmark
long as the number of processors in the system is small (up to processors), all schemes show equal performance, since every scheme degenerates to the Central Scheme (the Bitonic Counting Network has width 1, the Diffracting Tree has height 0 and Combining Tree has height 1).
8.2 Results
85
The time
operation
-
more or
less bound
by
the message
if there
system, the
operation
initiating
processor sends
message to the central processor which returns the current counter value
number of processors is
to
rising,
a
congestion: Whenever
at the central processor, there is already a huge queue of not-yet-handled messages. As the processors are executing a loop, a message from more or
less every other processor is already in the queue. Therefore, when having n processors in the system, the average operation time for the Central Scheme
is about ntc. It
was
not
possible
to show the
curve
for
>
512
for 16384
operation
depth, it also loses time when chart, one can see nicely that one has to double the width of the Bitonic Counting Network from time to time in order to be competitive; doubling the width is a major influence on performance and therefore, the performance drops significantly whenever one has to do it. For our setting, the width was doubled at 256 and 2048 processors, respectively. We will have a closer look into this phenomenon
Counting
Network has
log2
high.
In the
are
not
negligible
us
is
concentrate
(and
easily
every other
see
scheme).
the interesting case of 32 processors. Diffracting Tree beats the Combining The difference however is rather small;
The
This Tree it is
hard to
when
equivalent to the Central Scheme: The smallest nonheight 2 (therefore, always 4 messages are sent from initiation to completion of an inc operation), the smallest non-central Bitonic Counting Network has width 4 (a message is passing 3 balancers and therefore 4 messages have to be sent from initiation to completion of an inc operation). Both, the Combining Tree with height 2 and the Bitonic Counting Network with width 4 are not competitive in this setting with 32 processors, because the Central Scheme (where only 2 messages are sent) performs better. On the other hand, with the Diffracting Tree of height
every class that is not
central
Combining
Tree has
86
Simulation
(one
balancer consisting of
one
one
toggle,
=
and
counters), there is a scheme where most of the time (when the 32, messages are diffracted), only 3 messages are sent. In the case of n the Diffracting Tree with height 1 performs a little bit better than all other
two modulo
schemes.
as [HLS95] or [SZ96] (where the Combining Tree did not perform as good as the Bitonic Counting Network and the Diffracting Tree), you might be surprised by our results. There are at least two justifications for the discrepancy. First, our Combining Tree differs significantly from the version that [HLS95] and [SZ96] implemented:
They
used
binary tree,
time
we
to circumstances.
More
severely,
be
technique
reason
that promotes
The second
constants
involved when
might handling a
and
Combining Tree did not combining many messages that we ignored the different
their
message
(confer
the discussion in
Section
[SZ96],
used
much
more
realistic machine
account.
Bitonic
Counting
Network Benchmark
quickly investigate the Bitonic Counting Network as a representative choosing the parameters influences the performance. Concretely, how does the choice of the width (the only parameter of the Bitonic Again, each processor Counting Network) influence its performance? executes a loop that initiates an inc operation whenever it received the value for its previous inc operation. We measure the average operation
Let
us
for how
time of
an inc
operation, the
completion.
are
For this
benchmark,
distributed
exponentially
expected
values 1
respectively
=
10 time units.
Figure
8.2
as
shows the average operation time for n 256,512,1024,2048 processors a function of the width w of the network.
First note
that, for
every n,
the
curve
has U-form.
the
curve
small,
the balancers
facing
lot
On the other
1024 for
example,
one
suffice to
play
balancers). Having
the
largest possible
of the network
width may not be and therefore the the middle where both
are
good
depth depth
operation
congestion
small.
8.2 Results
87
16 Network Width
32
64
128
Figure
8.2: Bitonic
Counting
Network
It is worth
networks have
constant
n
noting that, when congestion at the balancers is high, the more or less the same performance when the width is a fraction of the number of processors. For width 2,4,8,16 and
the average
We
256,512,1024,2048 respectively,
250 and 300 time units.
might congestion is high, the depth of the network does not really matter. Although in a network of depth 16, a message has to pass 10 times more balancers than in the network with width 2, the congestion faced at the first
between
conclude
balancer visited is
On the
responsible
for
slowing
is
other extreme,
when
there
virtually
no
example,
the
consider the
or
performance of
roughly
same.
we are
looking
curve.
curves
width
(e.g.
16 for
512 and
1024)
and similar
n
=
operation
1024 resp.
time. Sometimes
n
=
they
(e.g.
of
2048).
Therefore
Figure
"bumpy".
88
Simulation
Working Benchmark
Initiating
an
inc
operation immediately
upon
receiving
last inc operation is not very realistic. In practical situations, a processor often performs some local computation depending on the last value before
initiating
In this inc
operation.
processor
executes
a
loop that initiates an operation, performs some local before the next inc computation (working) operation. Again, initiating we measure the average operation time of an inc operation. [HLS95]
waits for the result
benchmark,
each
We have
system of 1024
The
calculation time tc
time is 1 time
is
distributed
exponentially (the
expected calculation
unit).
units; that is, the expected message time is only parameter of the benchmark is the time for the local computation (the working time): For this test, we decided to have it distributed exponentially with expected time between 1 and 8192 time units. Figure 8.3 shows the average operation time of the four counting
5 time units. The schemes.
uniformly
from 3 to 7 time
140
Figure
8.3:
Working
Benchmark
8.2 Results
89
A first
impression
8.1.
is that
Figure 8.3
is
approximately
mirror
image of
n
Figure
The short
explanation
operation
very
at
time with
have
closer look: As
same
long
if
as
operation
time is the
no
as
no
small, the
all, because
practically
business
curve
processor is
working
competitive:
as
operation
time is close
counting operation). The the chart; it is simply not to 1024 time units as long
in the
the
working
time is small
enough.
working time is high (more than 1024 time equal performance, since every scheme degenerates to the Central Scheme. For an expected working time equal to 1024, one can recognize the Diffracting Tree anomaly that was described in the loop
On the other extreme, when the
units),
thought helps
one an
when
studying
processor p is in
working
or
counting (waiting
is T.
operation).
arbitrary
working
operation time
processor is
Then, in
is
steady state,
Therefore,
the
probability
that
an
working
T^w-
are counting is T"^y. To expected 64, the give an example: In Figure 8.3 you see that for working time W average operation time T of the Combining Tree is close to 60. Therefore, approximately half of the processors are working, half of them are counting. In other words, when the working time is 64 for 1024 processors in this setting, one should choose the same parameters for the Combining Tree as in the case of a simple loop benchmark with 512 processors only.
the
Message
Time Benchmark
As
we
have
on
seen
in
analysis,
to the
the
Combining
over
Tree
saves
logarithmic
expected
factor
operation
compared
Diffracting
(confer
inspect
Theorem 5.11
respectively).
the
logarithmic
by
constant
by asymptotical analysis. Combining Tree with the Diffracting Tree only, because the expected operation time of the Central Scheme and the Bitonic Counting Network do differ not only at this logarithmic factor.
90
Simulation
inc
As for the loop benchmark, each processor executes a loop that initiates an operation whenever it received the value for its previous inc operation.
measure
We
the average
operation time of
We have
a
an
inc
operation, the
time
from initiation to
completion.
deterministically.
only parameter
of
It shall be distributed
Figure 8.4
counting
schemes.
1600
ffracting
1400
Tree Tree
imbining
0>
1200
P
c
1000
k*
a.
800
O
a>
600
>
<f
400
200
0 1
16
32
64
128
Expected Message
Time
Figure
8.4:
Message
Time Benchmark
On the left-hand side of the chart where the message time is of the
same
magnitude as the calculation time, both schemes, the Diffracting Tree and the Combining Tree have comparable performance. When message time
and calculation time
are
equal,
(except
the
the last
level)
in the
Combining
height
once
Tree.
Therefore,
the
height
of
Combining
of the
Diffracting
(up
and
down)
instead of
only
in the
proceeding to the right, we see that the performance difference Diffracting Tree and Combining Tree gets bigger and bigger. In 128 is 1 and tm fact, the difference of the performance between tm which is close to a factor of 2 as growing with a factor of about 1.9
between
= =
-
8.2 Results
91
promised by our theoretical analysis. In Chapter 6, we have seen that in Combining Tree, the number of children per node should be chosen proportional to the ratio of message time over calculation time. Therefore, whenever messages are relatively slow, the Combining Tree is flat and has low height. Thus, the Combining Tree should always be the first choice when the ratio of message time and calculation time is high, that is in systems where the processors are loosely coupled (e.g. Internet).
the
performances is approximately growing with the expected factor 2. On the other hand, as our analysis delivers asymptotical results only, important constants are missing. There fore, the ratio of the absolute operation times of the Combining Tree and the Diffracting Tree does not have a factor 2. In the Combining Tree, for
We have
seen
example, the logarithm basis (coming from the height of the tree) is better approximated by tm/tc + 6. Therefore, in the Loop Benchmark, the gap of
the
Combining
"expected"
From the
have
description of
the
foregoing benchmarks,
we
implemented
tm and
several distributions for the involved random variables The aim of this last benchmark is to examine how
(e.g.
tc).
performance of a counting
a
As for the
inc
processor executes
an
operation whenever it received the value for its previous inc operation.
measure
We
an
inc
initiation to
n
=
completion.
The
counting scheme
as
is the
2,...,
well
as
expected
expected
(calculation
and message
resp.
units), exponentially distributed (Markov with parameters 1 resp. 5), uniformly distributed (the time to handle a message is distributed
uniformly
between 0 and
2
time
message to
be
uniformly
units),
and Gauss
n
=
(non-negative
1 resp.
5 and
l).
92
Simulation
the average
operation
case
time of all
16 combinations of the
are
3%. Having usually having exponential times is usually slightly worse than the average. This is not astounding as Lemma 2.6 showed that a deterministic handler is somewhat better than an exponential handler. Therefore, our theoretical analysis using exponentially distributed random variables is a good approximation for many distributions and therefore a good yardstick for reality.
deterministic times is Even
are
slightly
slower than
argue that
our
analytical
to
performance analysis
side,
our
is somewhat
worst
analysis.
On the other
be
(Theorem 7.13) matches the performance of the Combining Tree (and the Counting Pyramid) for exponentially distributed variables (Theorems 6.9 and 6.17), and we have observed by simulation that exponential distribution is (slightly) worse than deterministic variables, one might say that the
Combining optimal.
scheme
Tree and the
Counting Pyramid
are
indeed
asymptotically
Although efficiency
can
is very
important, there are other qualities a counting a couple of the more substantial ones in
the next
chapter.
Chapter
Discussion
Having studied
schemes in the
efficiency
of
counting
previous chapters,
on
we
will take
beyond pure speed. First, we will a counting scheme has. We applications briefly will see that many processor coordination problems and distributed data structures are closely related to counting.
issues that
are
Then,
in
Section 9.2,
will
non-performance
Additionally,
chapters.
we
related
will summarize important performance and properties of the presented counting schemes. bring up issues that went short in the previous
we
9.1
Applications
"Je les gere. Je les compte et C'est difficile. Mais
-
Antoine de
a
Saint-Exupery, Le
Petit Prince
counting scheme is an flight counting scheme is down number available the of in seats a plane and is returning counting a seat number whenever somebody is initiating the reserve operation.
Internet based reservation system.
Another illustrative
application of
distributed
A
reverse
Obviously,
plane
will
enough
a
bottleneck.
fact,
the real
applications
counting schemes
subtle.
94
Discussion
Loop
Index Distribution
Loop
index distribution
processors is
where
(see [HLS95]) is a dynamic load balancing technique dynamically choose independent loop iterations and
An intuitive into
concurrently.
example is rendering
Mandelbrot
partitioned loop rectangles. rectangle in the screen. As rectangles are independent of each other, they can be rendered in parallel. Some rectangles however, take
The
a
screen
covers
longer
assign the rectangles to the processors beforehand but to do it dynamically. Numbering the rectangles and using a counting scheme to assigning the numbers dynamically to processors is a "picturesque" application of distributed counting schemes. A remaining difficulty however is to choose the size of the rectangles.
A distributed counting scheme may be used as the basis of many important distributed data structures. In the following, we will present some of them.
Queue Having
queue,
two counters,
one can
easily
dequeue, which append an object to the end of the queue resp. remove the first object from the queue. The object, that was enqueued as the ith object in the queue is stored at i mod n, when having n processors in the system. Initially, processor p both counter values (named first and last) are initiated with 0. Whenever a processor p wants to enqueue an object o, p initiates an inc operation
enqueue and
on
last
(receiving I,
n. an a
last)
processor / mod
Whenever
on
processor p wants to
dequeue
initiates
inc
operation
and sends
request message
Stack
Implementing
object
on
stack
(supporting
the two
demanding
a
as
one
(in
addition to
test-
operation)
test-and-decrement
a
[ST95]
Diffracting
operation by means of "anti-tokens", that is messages that "wrong" (upper instead of lower and vice versa) wire. The
can
sent to the
same
very
idea
be used for
Counting
Combining
9.1
Applications
95
Counting Pyramid, the concept of adding/subtracting an arbitrary number to/from the current system value is included anyway. Similar to the queue,
we
on
operation figures
and
or
pop
object
One
and
In the
a
improve the behavior of the stack with the elimination technique: Diffracting Tree, whenever a message from a push operation (token)
message from
a
operation (anti-token) meet in a prism, they can exchange information immediately and both messages must not be forwarded anymore. Whenever push and pop operations are initiated about as often, the performance will improve dramatically [ST95].
pop
Combining Tree (and the Counting Pyramid): Assume that k messages are to be combined, where the ith message is ml (m4 x > 0 for x push operation requests, A single message with the x < 0 for x pop operation requests). ml value 5Zt=i has to be sent upward. Every other pop operation finds mi it a push operation partner such that both can be satisfied without going to They
same
elimination
technique
can
be
applied
to the
Job Pool
right away be used tasks, producer prestigious problem in the load Processors produce jobs which are added to the pool balancing community: and processors consume jobs from the pool. The producing and consuming patterns are incalculable. In [DS97], it is conjectured that job pools based on counting schemes might be effective competitors of heuristic local load balancing schemes such as [RSU91, Hen95]. technique
a
can
Fetch-And-$
Already in [GK81], it is recognized that the combining idea can be general beyond counting: they propose a general fetch-and-$ operation where $ may be any associative binary operation defined on a set. Routinely used instances of fetch-and-$ are test-and-set, swap, compare-and-swap, test-and-add, fetch-and-or, fetch-and-max among others [BMW85, Wil88].
ized
Therefore,
clusion
can
many
common
ex
be realized
Pyramid.
ordination structures
elegantly Combining Counting possible to accomplish these generalized co with Counting Networks or the Diffracting Tree.
with the Tree
or
the
96
Discussion
Outlook
There
are
other
fetch-and-$ operation.
the
generalized
a
whether
dis
tributed priority queue [DP92, RCD+94] can be tuned to more adaptive This could possibly reveal a method ness by means of counting schemes.
to establish bottleneck-free
parallel best-first branch&bound algorithms applications might be a distributed dictionary resp. a [MDFK95]. distributed database really [Lit80, Sto84, Pel90, KW95] that has improved when there are hot spots (that is, highly accessed vari performance even
Other
ables/tuples).
9.2
Properties
this work focussed
on
Apparently,
and
performance;
we
present/analyze counting
Other
as
expected efficiency.
counting scheme, such
discussion but
to
never
nicely in regard to interesting and important properties of a linearizability and adaptiveness were subject of
our we
in the heart of
recapitulate
various
properties
have
to discuss
briefly
counting schemes
Performance
analyzing thoroughly the performance of the presented counting schemes by means of queueing theory (confer Sections 2.4, 4.4, 5.2, 6.2, and 6.4 respectively) as well as simulation (see Chapter 8).
We have been
one can say that in both worlds, the theoretical as well as the Combining Tree (and the Counting Pyramid as a Combining Tree derivate) has slightly better expected performance than the Diffracting Tree, which in turn is better than the Bitonic Counting Network and that the Central Scheme is not competitive when access is high. To sum up, the
Summarizing,
the
practical,
operation time T of an inc operation is simplified (by assuming that local computation costs 1 time unit on average, transferring a message
m
some
costs
time units
on
operation
very
frequently):
Central Scheme: T
Bitonic
0(m
n)
=
Counting
Network: T
0(mlog n)
9.2
Properties
97
Diffracting Tree:
0{m log n)
T
=
Combining
Tree and
Counting Pyramid:
(9(m logm n)
over
parallel machines,
calculation
approximated [DD95, Sch98]. Therefore, the Combining Tree and the Counting Pyramid gain about a factor of 10 in expected performance compared to the Diffracting Tree. Note that we did not use such large ratios in the Message Time Benchmark (see Section 8.2) because of simulation limitations. From Chapter 7 we know that the Combining Tree and the Counting Pyramid are asymptotically "optimal". Moreover, the Combining Tree and the Counting Pyramid are the only counting schemes known to be applicable and efficient when processors initiate the inc operation very frequently.
best
with
l'OOO... 50'000
Linearizability
performance, linearizability is a key property of a counting scheme considerably simplifies arguing about correctness [HW90, AW94]. We have seen that both, the Diffracting Tree and the Bitonic Counting Network are not linearizable (see Lemma 4.10 for the Bitonic Counting Network and [HSW96] for Counting Networks in general, see Lemma 5.5 for the Diffracting Tree).
Besides
since it The Central Scheme and the other hand
see
are
linearizable
Combining Tree (Counting Pyramid) on the (see Theorem 2.1 for the Central Scheme and Combining Tree). Therefore, the Combining Tree
only scheme that
is linearizable in its very nature
(Counting Pyramid)
and efficient at the
is the
same
time.
Flexibility
Various
tion
counting exclusively.
one
schemes allow
To
stacks),
needs
more
powerful operations (confer Section 9.1). [ST95] Diffracting Tree can be extended to the Elimination
a
both,
test-and-increment- and
stacks and
test-and-decrement-
Tree
pools Combining they implement Counting Pyramid) is able to add or subtract any value, and many operations beyond that. The only restriction on the associative binary operation is that it is combinable without using too many bits. operation, (and
and
the
all,
it is in
98
Discussion
sense
the most flexible scheme. There is also another advantage of having possibility for powerful operations: Imagine a system where the load is extremely high: processors initiate the inc operation without waiting for the completion of their previous inc operation [YTL86] call this Tree an "unlimited" access system. a (or a Counting Combining Using the in the initiate inc bulk; processors operations Pyramid), processors the
-
at
once,
easily possible
Diffracting
Tree
or
Counting Networks.
Adaptiveness Any counting scheme of general relevance should be adaptive the scheme should perform well under high as well as under low load such that it can be used as a general purpose coordination primitive. The Central Scheme has no system parameters; it is not adaptive at all.
-
The
that the
any
we
stated
counting scheme
p
access
view of the
performance of the
access
that
depend
the
solely
on
processor p.
are
all other
counting
Diffracting Tree, [DS97] have proposed an appealing adaptive version, the Reactive Diffracting Tree. As the name indicates, the Reactive Diffracting Tree does not adapt without delay, it has to change its structure in order to fulfill a new access pattern. Since changing the structure needs time, the Reactive Diffracting Tree is not suitable for scenarios where access patterns change very frequently. The very same idea, the folding and unfolding of nodes, could be used for the Combining Tree also the Counting Pyramid however is a more promising extension of the Combining Tree in terms of adaptiveness.
-
modify the Bitonic Counting Network in order to adaptive, because, [AA92, Klu94] showed that there is no Counting Network where the number of wires is not a power of two, as long as A we have the restriction that balancers have exactly two output wires. reactive Counting Network in the spirit of the Reactive Diffracting Tree is
make it
hard to construct since
wires
one
It is rather difficult to
cannot increment
or
according
to access
Counting Pyramid: Let processors that do not initiate the inc operation too often skip the first balancers. [HW98] discussed ideas towards this end.
9.2
Properties
99
One
wires k
=
promising layout
c
=
is to have
=
classes c, where
0,..., k (k
9.1 for
an
logn). Every
class
consists of 2C entry is
(confer Figure
note that the
3;
depth of
figure
proportional
to their real
depth).
class 3
Af,2
M
_
Mn
_
m\
m0
class 2
_
class
'{IK
class 0
Figure
First,
9.1: An
adaptive
c
Bitonic
Counting Network Counting Network with Counting Networks are precisely, the output wires of
Mc,
for
c
=
enter
Bitonic
width 2C
(Bc
for
a
input Mc+i
wires of
input
wires of
0,...,
2.
The output of Bk
and
B^-i
enter
M^-x directly.
to the
a
passes
random wire of
to choose
is
possible
Merger
Network
as
In without
overloading
any balancer.
Then, message of class c has to pass 0(c2) balancers in Bc and 0{ck) balancers in the following c Mergers. Thus, messages that enter the network
on a
(any constant),
a
will
only
a
pass
O(k)
balancers.
wire of
As for the
Counting Pyramid,
processor p chooses
random
input
an appropriate class to send the request to. If processor p initiates the inc operation very frequently, processor p must choose a high class; if processor
class.
Thus,
the
load is
low)
to
operation rather rarely, processor p can choose a low simplified operation time varies from O(mlogn) (when 0(mlog n) (when load is high).
Tree and Bitonic
For the
Diffracting
Counting Network
on
we're done:
All
5.2).
of
The
well
as
the
(confer Combining
the
Tree)
makes
use
one
more
calculation time
(see
Section
6.4).
changing,
100
Discussion
Counting Pyramid should adapt the number of bricks per floor (determined by m? for floor /) accordingly. We can achieve this by setting up a Counting 2 initially. Later, we use the floors 0,z,2i,3z,... only Pyramid with m
:=
[log ^J).
the very
can
use
and tc in
a
are
cycle
of processors and doing as many local computation steps as possible in the meantime, a processor can estimate a good value for m. The top of
the
Counting Pyramid
can
from time to
change
Counting Pyramid
accordingly. Waiting
We have
seen
(Counting Pyramid)
messages.
both, the Diffracting Tree and the Combining Tree use the concept of waiting in order to diffract/combine At first sight, a counting scheme using waiting needs a notion
that
realize
an asynchronous system. Practically waiting nevertheless by sending a dummy message
however,
in
a
we can
the
cycle of processors. Whenever the dummy message is received by initiator, the waiting time is over. By choosing the cycle accordingly (for the Counting Pyramid: a cycle of 6 processors), we can achieve an
correct
asymptotically
Therefore,
the
expected waiting
a on
time.
waiting. In
preceding
subsection
adaptiveness,
have
seen
system parameters such as expected message time, expected calculation time, or expected time between two initiations,
can
but
accommodate
(reactively
a
or
immediately)
to
(changing) system
could claim that
one
knowing the expected times for the system parameters is only necessary when analyzing the performance, not for setting up a scheme. Still, one has to be careful about alleging that we are in a totally asynchronous message passing model (see [Lyn96] among many others): we are dealing exclusively with expected (average case) performance
having
a
"asynchronous"
seize the
is
usually connected
with worst
case
(impossibility)
our
results.
us
opportunity
to discuss the
connection of
performance evaluation model and the shared memory contention model by Cynthia Dwork, Maurice Herlihy, and Orli Waarts [DHW93, DHW97]. In order to assess the performance, they introduce an adversary scheduler.
By summing
up the total number of "stalls" of
accesses
in the worst
case
and
9.2
Properties
101
dividing
an
up with
so
amortized worst
performance
measure.
to
speak,
=
an average case performance measure. simplified model (confer the subsection on performance), both performance models (our average case and the worst case model [DHW93, DHW97]) have the same 0(n) efficiency for the Central Scheme and (more surprisingly) 0(log2 n) for the Bitonic Counting Network. The two of all presented counting schemes that use the waiting concept are not competitive when there is a strong adversary scheduler, although they have a better expected performance. Both have 0(n) efficiency when the adversary scheduler queues the messages at the root resp. toggle of the root in the Combining Tree resp. the Diffracting Tree. Is waiting therefore a good ingredient when a system should have good expected performance m
the consequence is
1 in
our
random; By setting
and
we
don't
care
block-free)
MT97]
among
others).
This
research, however,
from
done in
shared
memory
model,
wait-free
our message passing model significantly waiting is completely unlike ours. In shared memory, implementation guarantees that a processor must not wait for
which differs
passing model, since by definition, processors must wait for messages from other processors in order to complete an inc operation (there is no reliable "passive" memory in a message passing system).
message
can
be simulated
a
on
message
passing
machine
It
[ABD95],
be
is
there is
certainly
interesting, whether there is a connection of our conjecture good for expected performance) and the research whether the wait-free property harms performance in shared memory [ALS94].
Note that
might (waiting
waiting (in
shared memory
manner)
an
tolerance
(when
that
q in order to
proceed, is,
might
cause an
infinite
are
waiting
crash
of processor
p)
and simplicity.
Both issues
discussed in the
following.
Simplicity
As data structures and
algorithms get more and more elaborate and com plex, having simple impiementable solutions is often a key requirement when designing a structure/algorithm.
and
102
Discussion
At first
the
Networks the
seem
to be much
as
more
complex
a
than
Combining Tree,
they
have
very
rich This
structure
relatively intricate
to show
their correctness.
complexity however does not contradict our notion of simplicity. Implementing a message handler of a processor in a Counting Network, is a moderate task: Handling a message is basically consuming it and alternatively forwarding it to one of two followup processors. Such a handler could be realized in hardware without much effort, a goal that always should be kept in mind when building a basic coordination primitive such as a counting scheme. The finite state machine of a prism element of a Diffracting Tree (see Figure 5.4) is more sophisticated. Essentially the waiting concept introduces different kinds of messages to arrive which complicates the handler. Especially in the shared memory model, realizing waiting requires several expensive coordination (e.g. compare-and-swap)
operations.
structural
Although Pyramid)
work.
very is
simple
at first
sight,
the
build the
Combining
Tree in hardware.
Fault Tolerance
Distributed of
Systems
that
are
very
popular object
study
over
Following
one
can
asynchronous
one
consensus
problem [FLP85],
counting scheme
counts
correctly
in
an
presence of
counter
faulty
v
processor
(because
two
processors
side, Counting Networks seem to be much more fault tolerant than the other three schemes, because they are up to now the only counting schemes known where every processor is of equal importance and none sticks out. In other words, when an enemy
can
who will
value
have to reach
consensus
other
choose
one
processor to break
the
the
Combining Diffracting
For
a
Counting Pyramid,
the
toggle
in
(in
Diffracting
fail. the
Counting
thus 1
one can
Network of width w,
most) (most) inc operations will only 2/w messages pass any of
are
balancers,
2/w
inc
operations
any fixed
e
accordingly,
initiated inc
9.2
Properties
103
more
fault
tolerant,
one
Counting
Networks. To be
more
Counting Pyramid [HW98]: The top a Counting Network of width m (for (floor 0) replaced that of is m a power simplicity, assume 2). Bricks of floor 1 behave similar the in as original Counting Pyramid: Still, they combine several incoming requests and send one combined message to their Bitonic Counting Network input wire. As this request is not asking for 1 but z counter values, it is split up at the balancers (at the first balancer in a message requesting \z] and [zJ counter values respectively which are forwarded accordingly).
fault tolerant version of the
with
is cut off and
no
the bricks
a
request
values, the modulo counters will return a modulo interval of values to the requesting brick on floor 1. The simplified operation time is T 0(m logTO n + m log2 m) for the fault tolerant Counting Pyramid. A processor brake down results in an expected loss of 1/m messages. If more fault tolerance is needed, one should cut the Counting Pyramid on a floor with higher number. Please note that introducing a Counting Network as the top of the Counting Pyramid implies dropping the linearizability It is an interesting question to what extent fault tolerance property. and linearizability contradict; a good starting point towards this end is
=
[HSW96].
Overview
The
following
table
summarizes
the
most
important
properties of
for
a
good/medium/bad assessment
specific
property is marked by
+/o/, respectively.
Central
Bitonic
Quality
Diffracting
Tree
Scheme
Performance
Counting
Network
0
-
Counting Pyramid
+ +
o
+
-
+
+
-
0
-
+
-
+
-
+
+
0
0
Besides
pure
performance,
we
have
of the four
presented counting
schemes.
other
104
Discussion
family of Counting Networks and the Counting Pyramid are the most interesting schemes we have seen. The Counting Pyramid may be of practical relevance, since it succeeds in expected performance as well as linearizability, flexibility and adaptiveness. The strength of the Counting Networks is more on the theory side: they are the only decentral (and therefore fault tolerant) schemes that are known up to now. Moreover, they are the only schemes that do have good worst case performance
[DHW93, DHW97]).
Obviously, the theoretical side,
case,
last
it would be
average case,
word
is
Especially
shared
on
the
more
properties (e.g.
fault
Bibliography
[AA92]
Hagit Attiya. Counting networks with arbitrary fan-out. In Proceedings of the 3rd Annual ACMSIAM Symposium on Discrete Algorithms, pages 104-113, Orlando, FL, USA, January 1992.
Eran Aharonson and
[ABD95]
Hagit Attiya, Amotz Bar-Noy, and Danny Dolev. Sharing memory robustly in message-passing systems. Journal of the ACM, 42(1):124-142, January
1995.
[ADT95]
Yehuda
fast.
the
Proceedings of the 27th Annual ACM Symposium on Theory of Computing, pages 538-547, New York, May 1995.
In
[AHS91]
James
networks and
Maurice Herlihy, and Nir Shavit. Counting multi-processor coordination. In Proceedings of the Twenty Third Annual ACM Symposium on Theory of Computing, pages 348-358, New Orleans, Louisiana, 6-8 May
Aspnes,
1991.
[AHS94]
James
Aspnes,
Maurice
Herlihy, and
Nir Shavit.
Counting
networks.
41(5):1020-1048, September
1994.
[AKS83]
Miklos
Ajtai,
Janos
Komlos,
In
and
Endre
Szemeredi.
An
Proceedings of the Fifteenth Annual ACM Symposium on Theory of Computing, pages 1-9, Boston, Massachusetts, 25-27 April 1983.
network.
[ALS94]
Hagit Attiya, Nancy Lynch, and Nir Shavit. Are wait-free Journal of the ACM, 41(4):725-763, July algorithms fast?
1994.
106
Bibliography
[And90]
performance of spin lock alternatives IEEE Transactions on for shared memory multiprocessors. Parallel and Distributed Systems, 1(1):6-16, January 1990.
Thomas Anderson. The
[And91]
Gregory
1991.
R.
Andrews.
Concurrent programming.
Redwood
Benja1
City,
edition,
[AW94]
Hagit Attiya
versus
linearizability.
ACM Transactions
1994.
tems,
12(2):91-122, May
[Bat68]
[BC84]
Prentice-Hall, 1984.
[BG87]
D. Bertsekas and R.
Gallager.
Data Networks.
Prentice-Hall,
1987.
[Bin77]
Formeln und
Tafeln.
Orell Fiissli
Verlag,
[BK92]
Bar-Noy
and Shlomo
algorithms in the postal model for message-passing systems. In Proceedings of the 4th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 13-22, San Diego, Cali
fornia, June 29-July 1, 1992.
[BM96]
A combinatorial
of
balancing networks.
1996.
Journal
of
the
ACM,
43(5):794-839, September
[BM98]
Proceedings of
the
Processing Symposium
Distributed Processing,
Symposium
on
Parallel and
April
1998.
[BMW85]
W. C
Brantley, K
P.
International Con memory element. Proceedings of the 1985 Parallel Processing, pages 782-789, 1985. ference on
Bibliography
107
[CDK95]
George Coulouris, Jean Dollimore, and Tim Kindberg. Dis tributed Systems. Addison-Wesley, Wokingham, 2 edition,
1995.
[CGHJ93]
Robert M. David J.
Jeffrey. On
function, 1993.
[CKP+96]
Culler, Rachard M. Karp, David Patterson, Abhijit Sahay, Eunice E. Santos, Klaus Erik Schauser, Ramesh Subramonian, and Thorsten von Eicken. LogP: A practical model of parallel computation. Communications of the ACM,
David E.
39(ll):78-85,
[CLR92]
Introduction to
algorithms.
Company,
6th
edition, 1992.
to
[Coh83]
L. A. Cohn. A
conceptual approach
general
purpose
parallel
York, 1983.
[DD95]
J. J.
Dongarra and
of various computers.
partment of
1995.
Dunigan. Message-passing performance Technical Report UT-CS-95-299, De Computer Science, University of Tennessee, July
T.
[DHW93]
Cynthia Dwork,
Maurice
Herlihy,
on
in shared memory
Proceedings of the TwentyTheory of Computing, pages 174-183, San Diego, California, 16-18 May 1993.
In Maurice
[DHW97]
Cynthia Dwork,
in shared memory
Herlihy, and Orli Waarts. Contention algorithms. Journal of the ACM, 44(6):779-
[Dij65]
Edsger W. Dijkstra. Solution of a problem in concurrent programming control. Communications of the ACM, 8(9):569, September 1965. Narsingh Deo and Sushil Prasad. Parallel heap: An optimal The Journal of Supercomputing, parallel priority queue.
[DP92]
6(l):87-98,
March 1992.
[DPRS89]
Martin
Saks.
The
Dowd, Yehoshua Perl, Larry Rudolph, and Michael periodic balanced sorting network. Journal of the
108
Bibliography
[DS97]
Reactive
diffracting
on
In
Proc.
and
Symposium
24-32,
1997.
Parallel
Algorithms
Architectures,
[EL75]
on
3-
chromatic
and
hypergraphs
and
some
finite sets,
volume
10,
pages
1975.
[Fis87]
[FLP85]
Nancy
A.
Lynch,
consensus with one faulty Impossibility Journal of the ACM, 32(2):374-382, April 1985.
of distributed
[GB85]
distributed system.
Journal
of
the
October 1985.
[GGK+83]
Allan Gottlieb, Ralph Grishman, Clyde P Krviskal, Kevin P. McAuliffe, Larry Rudolph, and Marc Snir. The NYU ultracomputer: Designing a MIMD, shared memory parallel computer. IEEE Trans.
Computs., C-32(2):175-189,
1983.
[GGMM88]
Stability
of
of
the
[GH81] [GK81]
of Queuemg Theory.
Wiley
and
Sons.,
1981.
9(6):16-24,
October 1981.
[GLR83]
Allan
sic
Gottlieb, Boris D. Lubachevsky, and Larry Rudolph. Ba techniques for the efficient coordination of very large num bers of cooperating sequential processors. ACM Transactions on Programming Languages and Systems, 5(2): 164-189, April
1983.
[GMR94]
Phillip B. Gibbons, Yossi Matias, and Vijaya Ramachandran. The QRQW PRAM: Accounting for contention in parallel algorithms. In Proceedings of the 5th Annual ACM-SIAM
Bibliography
109
pages
638-648, Arlington,
[GVW89]
James R.
ficient
Goodman, Mary K. Vernon, and Philip J. Woest. Ef synchronization primitives for large-scale cache-coherent
multiprocessors. In Third International Conference on Ar chitectural Support for Programming Languages and Operating Systems, pages 64-75, Boston, Massachusetts, 3-6 April 1989.
[Hen95]
Dominik Henrich.
Branch-and-Bound,
[Her91]
Maurice
tions
on
Systems, 13(1):124149,
January
[HLS95]
Maurice
Nir Shavit.
on
Scalable
concurrent
counting.
ACM Transactions
November 1995.
Computer Sys
tems,
13(4):343-364,
[HSW91]
Herlihy, Nir Shavit, and Orli Waarts. Low contention In Proceedings of the 32nd Annual counting. Foundations on Symposium of Computer Science, pages 526Maurice linearizable
[HSW96]
Maurice
Linearizable
1996.
counting
Computing, 9(4):193-203,
[HT90]
Maurice P
Herlihy
free computation in
of
Message Passing systems. In Proceedings Symposium on Principles of Distribted 347-362, Quebec City, Quebec, Canada,
[HW90]
Maurice P.
Herlihy
and Jeannette M.
Programming Languages
and
July
1990
[HW98]
Maurice
Herlihy
Roger Wattenhofer. Adaptive counting counting schemes. Unpublished Brown University, May 1998. visiting
and
to
[Ja'92]
Parallel
Algorithms.
Ad
110
Bibliography
[Jac57]
J. R. Jackson. Networks of
5:518-521, 1957.
[Jac63]
J. R. Jackson.
science,
10, 1963.
[Klu94]
Michael Richard
and Related
Klugerman. Small-Depth Counting Networks Topics. MIT Press, Cambridge, MA, September
1994.
[KM96]
Acta
275, 1996.
[Knu73]
Donald E. Knuth.
Sorting and Searching, volume 3 of The Art of Computer Programming. Addison-Wesley, Reading, MA, USA, 1973.
Michael
[KP92]
Klugerman
networks.
and C.
In
Gregory
Plaxton.
Small-depth
counting
Proceedings of the 24th Annual ACM the on Symposium Theory of Computing, pages 417-428, Victoria, B.C., Canada, May 1992.
[KT75]
S. Karlin and H. M.
Taylor.
A First Course
in
Stochastic
[Kuc96]
David J. Kuck.
Future
High Performance Computing Challenges for Systems. Oxford University Press, New York, 1996.
-
[KW95]
Brigitte Kroll
Widmayer. Balanced distributed Algorithms and Data Structures, 4th International Workshop, volume 955 of Lecture Notes in Computer Science, pages 50-61, Kingston, Ontario, Canada, 16-18 August 1995.
Sandeep Bhatt. An atomic Proceedings of the 5th Annual Parallel Algorithms and Architectures,
and
In 1993.
and Peter
[LAB93]
ACM Symposium
pages
on
[Lam74]
Leslie
solution of
Dijkstra's
concurrent pro
17(8):453-
Bibliography
111
[Lam79]
Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Transactions on Computers, C-28(9):690-691, September 1979.
Leslie
Leslie
[Lam83]
Lamport.
1983.
Specifying
on
concurrent
program
modules.
ACM Transactions
Programming Languages
and
Systems,
5(2): 190-222,
[Lei92] [Lit80]
Frank Thomson
and Architectures.
In International
hashing: A new tool for files and tables Conference On Very Large Data 212-223, Long Beach, Ca., USA, October 1980.
[LKK86]
Gyungho
Y.
effectiveness of
Parallel
combining in shared memory parallel computer spots'. In International Conference Processing, pages 35-41, Los Alamitos, Ca., USA,
August
1986.
[Lov73]
Laszlo Lovasz.
Proc. and
fth Southwestern
pages
Computing,
[LSST96]
Nancy Lynch, Nir Shavit, Alex Shvartsman, and Dan Touitou. Counting networks are practically linearizable. In Proceedings of the 15th Annual ACM Symposium on Principles of Dis tributed Computing, pages 280-289, New York, May 1996.
[Lyn96]
Lynch. Distributed Algorithms. Morgan Kaufmann series in data management systems. Morgan Kaufmann Pub lishers, Los Altos, CA 94022, USA, 1996. Nancy
A.
[Mae85]
Mamoru Maekawa.
A VN
in decentralized systems.
A CM Transactions
1985.
Computer
[Man89]
[MDFK95]
Algorithms:
A Creative
Approach.
Addison-Wesley,
1989.
Efficient
to
use
parallel
practice.
Lecture Notes in
Computer Science,
1000:62-80, 1995.
112
Bibliography
[MPT97]
Marios
Tsigas.
networks. In
Processing Symposium,
[MR97]
Byzantine quorum systems. Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing, pages 569-578, El Paso, Texas, 4-6 May 1997.
Dahlia Malkhi and Michael Reiter.
In
[MS91]
John M.
Mellor-Crummey
on
ACM Trans,
[MT97]
wait-
counting.
Journal
1997.
[Nei92]
Mitchell L. Neilsen.
tems. PhD
Distributed
Sys
and Information
Sciences,
Kansas State
[Nel95]
Randolph
and
[Noy79]
Robert N.
321-337. IEEE
and
Limitations,
pages
[NW98]
Moni
Naor
Avishai
Wool.
The
load, capacity,
on
and
availability
Computing,
27(2):423-447,
[OW93]
Widmayer.
Algonthmen
1993.
und
[Pap79]
Christos H.
database ber 1979.
Papadimitriou. The serializability of concurrent updates. Journal of the ACM, 26(4):631-653, Octo
[Pel90]
David
A complexityAlgorithms, I^th International Workshop, volume 486 of Lecture Notes in Computer Science, pages 71-89, Bari, Italy, 24-26 September 1990.
Peleg.
oriented view.
Bibliography
113
[PHB90]
Papadopulos, C. Heavey, and J. Browne. Queuemg The Manufacturing Systems Analysis and Desmg. Chapman ory & Hall, London, 1990.
H. T.
in
[PN85]
and
Gregory F. Pfister and Alan Norton. "Hot spot" contention combining in multistage interconnection networks. IEEE Transactions on Computers, C-34, 10:943-948, 1985.
David
[PW95]
of
Peleg and Avishai Wool. Crumbling walls: A class practical and efficient quorum systems. In Proceedings of the 15th Annual ACM Symposium on Principles of Distributed Computing, pages 120-129, August 1995. Peleg and Avishai Wool. How to be an efficient snoop, probe complexity of quorum systems. In Proceedings of the 15th Annual ACM Symposium on Principles of Distributed Computing, pages 290-299, May 1996.
the
[PW96]
David
or
[RCD+94]
Abhiram
Jones,
queues.
and Distributed
6th
USA, October
[Rob93]
Thomas
G.
Robertazzi.
Computer
Networks and
Systems:
Queuemg Theory
Berlin,
1
and
Performance
Evaluation.
Springer,
edition, 1993.
[RSU91]
Larry Rudolph, M Slivkin-AUalouf, and Eli Upfal. A simple load balancing scheme for task allocation in parallel machines. In Proceedings of the 3rd Annual ACM Symposium on Parallel Algorithms and Architectures, pages 237-245, Hilton Head, SC, July
1991.
-
[Sch98]
Gaby Schulemann. Innere Werte Technik fur Supercomputer. In c't Magazm fur Computertechnik, volume 11, pages 88-89. Heise Verlag, 1998.
Nir Shavit and Dan Touitou.
struction of
[ST95]
con
Sym posium on Parallel Algorithms and Architectures SPAA '95, pages 54-63, Santa Barbara, California, July 1995.
and stacks. In Proc. 7th Annual ACM
pools
[Sto84]
Database
of
the IEEE
Computer Magazine of the Computer Group News Computer Group Society, C33(7), July 1984.
114
Bibliography
[SUZ96]
Shavit, Eli Upfal, and Asaph Zemach. A steady state analysis of diffracting trees. In Proceedings of the 8th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 33-41, Padua, Italy, June 24-26, 1996.
Nir
Nir Shavit and
[SZ94]
Asaph Zemach. Diffracting trees. In Proceedings Symposium on Parallel Algorithms and of Architectures, pages 167-176, New York, NY, USA, June 1994.
the 6th Annual
[SZ96]
Nir
Shavit and
on
Transactions 1996.
[Val90]
Leslie G. Valiant. A
bridging
the
Communications
of
model for
[Wil88]
James Wilson.
Operating System
for Shared-
Memory
[WW97a]
Roger Wattenhofer and Peter Widmayer. Distributed counting Technical Report 277, ETH Zurich, at maximum speed. Departement Informatik, November, 1997.
[WW97b]
Roger
Widmayer.
An inherent bottle
neck in distributed
[WW97c]
Roger
Widmayer. Towards
In
decentral
Conference
on
World
pages
490-496, 1997.
[WW98a]
The counting Roger Wattenhofer and Peter Widmayer. Technical ETH Zurich, Departement Report 295, pyramid.
Informatik, March,
1998.
[WW98b]
Roger Wattenhofer and Peter Widmayer. Fast counting with Technical Report 288, ETH the optimum combining tree. Zurich, Departement Informatik, January, 1998.
[WW98cj
Roger Wattenhofer
tleneck in distributed
Widmayer. An inherent bot counting. In Journal of Parallel and Distributed Computing, volume 49, pages 135-145, 1998.
and Peter
Bibliography
115
[YTL86]
Pen-Chung Yew, Niau-Feng Tzeng, and Duncan H. Lawrie. Distributing hot-spot addressing in large scale multiprocessor. In International Conference on Parallel Processing, pages 5158, Los Alamitos, Ca., USA, August 1986.
Curriculum Vitae
Roger
P. Wattenhofer
17.11.1969
Birth in Lachen
SZ, Switzerland
in Siebnen SZ
1976
1984
1989
Primary School
1984
1989
1990 1994
Studies in
Computer Science at the ETH Zurich. Subsidiary subject Operations Research. Special training on System Software and
Theoretical
Computer Science.
Projects
Practical
in
Graph Theory
and
Cryptology.
AG.
experience
Thesis
Engineering
Curves".
Diploma
"Space Filling
1995
present
Research and
Teaching Assistant at the Department, ETH Zurich. Computer Research in the fields of distributed and spatial data structures and algorithms. Teaching in data structures and algorithms, simulation, and theory of parallel and distributed algorithms. Supervising several computer science master thesis. Colloquium Chair of SIROCCO 97, the 4th International Colloquium on Structural Information and Communication Complexity.
Science
"Prinzipien
des
Co-Organizer
of the Swiss
-
Algorithmenentwurfs". Championship
1998.
in
Informatics, 1996
Olympic
Team in Informatics.