Peer-To-Peer Systems and Applications
Peer-To-Peer Systems and Applications
Editorial Board
David Hutchison
Lancaster University, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Friedemann Mattern
ETH Zurich, Switzerland
John C. Mitchell
Stanford University, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
Oscar Nierstrasz
University of Bern, Switzerland
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
University of Dortmund, Germany
Madhu Sudan
Massachusetts Institute of Technology, MA, USA
Demetri Terzopoulos
New York University, NY, USA
Doug Tygar
University of California, Berkeley, CA, USA
Moshe Y. Vardi
Rice University, Houston, TX, USA
Gerhard Weikum
Max-Planck Institute of Computer Science, Saarbruecken, Germany
Ralf Steinmetz Klaus Wehrle (Eds.)
Peer-to-Peer Systems
and Applications
13
Volume Editors
Ralf Steinmetz
TU Darmstadt
KOM - Multimedia Communications Lab
Merckstr. 25, 64283 Darmstadt, Germany
E-mail: [email protected]
Klaus Wehrle
Universität Tübingen
Protocol-Engineering and Distributed Systems Group
Morgenstelle 10 c, 72076 Tübingen, Germany
E-mail: [email protected]
CR Subject Classification (1998): C.2, H.3, H.4, C.2.4, D.4, F.2.2, E.1, D.2
ISSN 0302-9743
ISBN-10 3-540-29192-X Springer Berlin Heidelberg New York
ISBN-13 978-3-540-29192-3 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer. Violations are liable
to prosecution under the German Copyright Law.
Springer is a part of Springer Science+Business Media
springeronline.com
© Springer-Verlag Berlin Heidelberg 2005
Printed in Germany
Typesetting: Camera-ready by author, data conversion by Boller Mediendesign
Printed on acid-free paper SPIN: 11530657 06/3142 543210
This book is dedicated to our children:
Jan, Alexander,
Felix, Lena, Samuel & Julius
Foreword
Ion Stoica (University of California at Berkeley)
With this book, Steinmetz and Wehrle have made a successful attempt
to present the vast amount of knowledge in the Peer-to-Peer field, which was
accumulated over the last few years, in a coherent and structured fashion.
The book includes articles on most recent developments in the field. This
makes the book equally useful for readers who want to get an up-to-date
perspective on the field, as well as for researchers who want to enter the field.
The combination of the traditional Peer-to-Peer designs and applications and
the discussion of their self-organizing properties and their impact on other
areas of computer science make this book a worthy addition to the Peer-to-
Peer field.
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Why We Wrote This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Structure and Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Teaching Materials and Book Website . . . . . . . . . . . . . . . . . . . . . 5
1.4 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4. Application Areas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1 Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3 Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4 Storage Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.5 Processor Cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
X Table of Contents
7.2.3 Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.2.4 Data Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.3 DHT Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
7.3.2 Node Arrival . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
7.3.3 Node Failure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
7.3.4 Node Departure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.4 DHT Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.4.1 Routing Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
7.4.2 Storage Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
7.4.3 Client Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
7.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Part V. Self-Organization
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623
List of Authors
Klaus Wehrle
Ion Stoica Universität Tübingen
645 Soda Hall Protocol-Engineering &
Computer Science Division Distributed Systems Group
University of California, Berkeley Morgenstelle 10c
Berkeley, CA 94720-1776 72076 Tübingen
USA Germany
Ralf Steinmetz Jörg Eberspächer
TU Darmstadt TU München
KOM – Multimedia Communications Institute of Communication Networks
Merckstraße 25 Arcisstraße 21
64283 Darmstadt 80290 München
Germany Germany
Detlef Schoder
Rüdiger Schollmeier
Universität zu Köln
TU München
Seminar für Wirtschaftsinformatik,
Institute of Communication Networks
insb. Informationsmanagement
Arcisstraße 21
Pohligstr. 1
80290 München
50969 Köln
Germany
Germany
Kai Fischbach Christian Schmitt
Universität zu Köln Universität zu Köln
Seminar für Wirtschaftsinformatik, Seminar für Wirtschaftsinformatik,
insb. Informationsmanagement insb. Informationsmanagement
Pohligstr. 1 Pohligstr. 1
50969 Köln 50969 Köln
Germany Germany
Simon Rieche
Michael Kaufmann
Universität Tübingen
Universität Tübingen
Protocol-Engineering &
Arbeitsbereich für Paralleles Rechnen
Distributed Systems Group
WSI – Am Sand 13
Morgenstelle 10c
72076 Tübingen
72076 Tübingen
Germany
Germany
Stefan Götz
Heiko Niedermayer
Universität Tübingen
Universität Tübingen
Protocol-Engineering &
Computer Networks & Internet
Distributed Systems Group
Morgenstelle 10c
Morgenstelle 10c
72076 Tübingen
72076 Tübingen
Germany
Germany
Christian Koppen
Universität Passau Burkhard Stiller
Computer Networks & Computer Universität Zürich, IFI
Communications Group Communication Systems Group
Innstraße 33 Winterthurerstraße 190
94032 Passau 8057 Zürich
Germany Switzerland
Danny Raz
Jan Mischke Technion IIT
McKinsey Company & Inc. Department of Computer Science
Switzerland Haifa 32000
Israel
Wolfgang Nejdl Wolf Siberski
Universität Hannover, KBS Universität Hannover, KBS
Appelstraße 4 Appelstraße 4
30167 Hannover 30167 Hannover
Germany Germany
Wolf-Tilo Balke Gerhard Hasslinger
L3S Research Center T-Systems Technologiezentrum
Expo Plaza 1 Deutsche-Telekom-Allee 7
30539 Hannover 64307 Darmstadt
Germany Germany
Kurt Tutschku Phuoc Tran-Gia
Universität Würzburg Universität Würzburg
Institut für Informatik, Lehrstuhl III Institut für Informatik, Lehrstuhl III
Am Hubland Am Hubland
97074 Würzburg 97074 Würzburg
Germany Germany
Wolfgang Kellerer Andreas Heinemann
DoCoMo Communications TU Darmstadt
Laboratories Europe GmbH FG Telekooperation
Landsberger Straße 312 Hochschulstraße 10
80687 München 64289 Darmstadt
Germany Germany
XXVI List of Authors
Oliver P. Waldhorst
Max Mühlhäuser
Universität Dortmund
TU Darmstadt
Rechnersysteme und
FG Telekooperation
Leistungsbewertung
Hochschulstraße 10
August-Schmidt-Straße 12
64289 Darmstadt
44227 Dortmund
Germany
Germany
Christoph Lindemann
Jussi Kangasharju
Universität Dortmund
TU Darmstadt
Rechnersysteme und
FG Telekooperation
Leistungsbewertung
Hochschulstraße 10
August-Schmidt-Straße 12
64289 Darmstadt
44227 Dortmund
Germany
Germany
Steffen Muhle
Thomas Hummel Universität zu Köln
Accenture European Seminar für Wirtschaftsinformatik,
Technology Park insb. Informationsmanagement
449, Route des Crêtes Pohligstr. 1
06902 Sophia Antipolis 50969 Köln
France Germany
Jan Gerke David Hausheer
ETH Zürich, TIK ETH Zürich, TIK
Gloriastrasse 35 Gloriastrasse 35
8092 Zürich 8092 Zürich
Switzerland Switzerland
Michael Conrad Jochen Dinger
Universität Karlsruhe Universität Karlsruhe
Institute of Telematics Institute of Telematics
Zirkel 2 Zirkel 2
76128 Karlsruhe 76128 Karlsruhe
Germany Germany
Hannes Hartenstein Marcus Schöller
Universität Karlsruhe Universität Karlsruhe
Institute of Telematics Institute of Telematics
Zirkel 2 Zirkel 2
76128 Karlsruhe 76128 Karlsruhe
Germany Germany
Daniel Rolli
Martina Zitterbart
Universität Karlsruhe
Universität Karlsruhe
Lehrstuhl für
Institute of Telematics
Informationsbetriebswirtschaftslehre
Zirkel 2
Englerstr. 14
76128 Karlsruhe
76128 Karlsruhe
Germany
Germany
List of Authors XXVII
The term “Peer-to-Peer” has drawn much attention in the last few years;
particularly for applications providing file-sharing, but distributed comput-
ing and Internet-based telephony have also been successfully implemented.
Within these applications the Peer-to-Peer concept is mainly used to share
files, i.e., the exchange of diverse media data, like music, films, and pro-
grams. The growth in the usage of these applications is enormous and even
more rapid than that of the World Wide Web. Also, much of the attention
focused on early Peer-to-Peer systems concerned copyright issues of shared
content.
But, the concept of Peer-to-Peer architectures offers many other inter-
esting and significant research avenues as the research community has re-
peatedly pointed out. Due to its main design principle of being completely
decentralized and self-organizing - as opposed to the Internet’s traditional
Client-Server paradigm - the Peer-to-Peer concept emerges as a major de-
sign pattern for future applications, system components, and infrastructural
services, particularly with regard to scalability and resilience.
The perspective of the Peer-to-Peer concept offers new challenges, e.g.,
building scalable and resilient distributed systems and a fast deployment of
new services. Based on the decentralized Peer-to-Peer approach, new Internet
services can be deployed on demand and without spending time-consuming
efforts in the process of product placement for the appropriate market, com-
munity, or company.
R. Steinmetz and K. Wehrle (Eds.): P2P Systems and Applications, LNCS 3485, pp. 1-5, 2005.
Springer-Verlag Berlin Heidelberg 2005
2 1. Introduction
Thus, the editors of this book have followed certain objectives with the
writing and editing of this book:
– Overview of the Peer-to-Peer Research Area:
Although research on Peer-to-Peer systems and applications is very young,
the Peer-to-Peer concept has already proven to be applicable and useful
in many cases. With this book, we want to give a broad overview of the
broad range of applications of the Peer-to-Peer paradigm. In addition to
a definition of the term “Peer-to-Peer” and a discussion of fundamental
mechanisms we want to show all the different facets of Peer-to-Peer research
and its applications. These manifold facets are also nicely reflected by the
structure of the book and its ten parts.
– Common Understanding of the Peer-to-Peer Paradigm:
After providing a good overview of the research field, our second objec-
tive is to define our notion of the “Peer-to-Peer paradigm”. In the past,
many things were called “Peer-to-Peer” – yet were often not even slightly
related to it – and most people only associated “Peer-to-Peer” with pop-
ular file-sharing applications and not with the promising advantages and
possibilities the paradigm can offer in a variety of other scenarios.
– Compendium and Continuing Knowledge Base for Teaching:
There does not yet exist in the literature a good overview of Peer-to-Peer
systems which is also useful for teaching purposes. Thus, the third intention
of this book is to provide a common basis for teaching, with material for
lectures, seminars, and labs. The knowledge of many experts has been
assembled for this book, each in their own specific research area. Thus,
teachers can choose from a wide range of chapters on all aspects of Peer-
to-Peer systems and applications, and therefore, can design the syllabus
for their classes with individual accents. In addition to this text book,
electronic slides are available on the companion website.
The idea to write and edit this book arose from a sequence of international
and German activities and events that fostered the idea (1) to coordinate and
to support research in the area of Peer-to-Peer systems and applications and
(2) to establish a highly webbed research community. Among these events
have been the KuVS Hot Topics Meeting (GI/ITG KuVS Fachgespräch)
“Quality in Peer-to-Peer-Systems” (TU Darmstadt, September 2003) [197],
the Dagstuhl Seminar “Peer-to-Peer Systems” (March 2004) [149] and the
GI/ITG Workshop “Peer-to-Peer Systems and Applications” (Kaiserslautern,
March 2005) [244]. In the course of these events, a scientific community of re-
searchers, mostly from German-speaking countries, but also from elsewhere,
in particular the U.S., formed in the area of Peer-to-Peer systems and appli-
cations.
1.2 Structure and Contents 3
Part V: Self-Organization
Part V deals with the fascinating topic of self-organization. General aspects
and a characterization of self-organization is given in Chapter 15. Chapter 16
follows with a discussion of self-organization in Peer-to-Peer-based systems.
The authors of each chapter were asked to supply related teaching materi-
als, in particular slides in the current PowerPoint format. All this e-learning
content can be retrieved by instructors from www.peer-to-peer.info – the
website of this book. The slides can be used without charge and adapted in-
dividually by teachers provided that this book and the origin of the material
is appropriately acknowledged.
Teachers may also want to publish their modifications at the book website
so that they are accessible to a wide audience. Our hope is that contributions
from the community will allow the companion website to grow into a large
knowledge base.
More information on accessing and using the website can be found at
www.peer-to-peer.info . Please provide us with your comments on improve-
ments, errors, or any other issues to be addressed in the next edition through
this website. Thank you!
1.4 Acknowledgements
The efforts of many people were indispensable in the editing, writing, and
reviewing of this book. First and foremost, the editors would like to thank
all the authors for their written contribution and for offering appropriate
teaching materials. Without the contributions of these experts from the area
of “Peer-to-Peer Systems and Applications”, this compendium would not
have achieved such a success. We also want to thank all reviewers for their
comments and remarks on the chapters, which was an important factor in
the quality of this book.
Special thanks go to all the people who helped, with great diligence, in the
background, especially Stefan Götz, Jens Gogolek, Marc Bégin, and Oliver
Heckmann. A special thank-you goes to Simon Rieche, who spent countless
hours solving LaTeX riddles, coordinating the review process, and dealing
with most of the communication between editors and authors. All of them
supported this project with untiring efforts and helped to make it a reality in
a smooth in a distributed fashion. We also want to thank our lector, Alfred
Hofmann from Springer Publishing, for his spontaneous enthusiasm for this
book project and his great support during the entire editing and production
process.
Last but not least, we gratefully acknowledge the support, encourage-
ment, and patience of our families and friends.
R. Steinmetz and K. Wehrle (Eds.): P2P Systems and Applications, LNCS 3485, pp. 9-16, 2005.
Springer-Verlag Berlin Heidelberg 2005
10 2. What Is This “Peer-to-Peer” About?
2.1 Definitions
Oram et al. [462] gives a basic definition of the term “Peer-to-Peer” which is
further refined in [573]:
[a Peer-to-Peer system is] a self-organizing system of equal, autonomous
entities (peers) [which] aims for the shared usage of distributed resources in
a networked environment avoiding central services.
In short, it is a system with completely decentralized self-organization and
resource usage. Apart from these basic principles, Peer-to-Peer systems can
be characterized as follows (though a single system rarely exhibits all of these
properties):
1
We do not distinguish between Peer-to-Peer computing and Peer-to-Peer net-
working but focus on Peer-to-Peer (P2P) as a property, characteristic, method,
or mechanism.
2.1 Definitions 11
C
C
C P P P
P P
S
S
P
P
P P
C
C
P
C P P P
Decentralized Self-Organization:
9. Ideally, resources can be located without any central entity or service (in
Figures 2.1a and 2.1b, centralized services are necessary in contrast to
Figure 2.1c). Similarly, the system is controlled in a self-organizing or
ad hoc manner. As mentioned above, this guide line may be violated for
reasons of performance. However, the decentralized nature should not be
violated. The result of such a mix is a Peer-to-Peer system with a hybrid
structure (cf. Fig. 2.1b).
2
TU Darmstadt, Sept. 2003, http://www.kom.tu-darmstadt.de/ws-p2p/
2.2 Research Challenges in Peer-to-Peer Systems & Applications 13
Applications
2011 Context Aware Services Trustworthy Computing
2010
Location Based Service in MANET
(Distributed & Decentralized)
2008
Video Conferences
Lack of Trust
2008 Commercialisation as the
End of P2P
Interoperability
2007 Best-effort Service Insufficient
Digital Right Management for Most Applications
2006
Still Low Bandwidth
End-nodes
2005
Intellectual Property P2P Requires „Flat Rate“ Access
2004
Law Suits Against Users Software Patents
Research Focus
2010
P2P/GRID Integration
2009
Service Differentiation
P2P File Systems P2P for Business Concepts of Trust & Dynamic
Information Systems Security (ACLs)
2007
Research Focus
2007
P2P in Mobile Networks Celluar/Adhoc
a routing tree, e.g. Pastry, or finger tables, e.g. Chord, a request is routed
towards the desired data item. For such requests a logarithmic complexity
is guaranteed. Often, the amount of routing information is in the order of
O(log N ) entries per peer (see also Chapter 8).
Next to the already discussed similarities to known database indexing
techniques, DHTs employ additional techniques to manage data structures,
to add redundancy, and to locate the nearest instances of a requested data
item.
Part III in this book deals with all details on structured Peer-to-Peer
systems with a special focus on Distributed Hash Tables.
2.3 Conclusion
100%
90%
Traffic portions in % per week
80%
Unidentified
70% Data_Transfers
60% File_Sharing
50%
40%
30%
20%
10%
0%
2
4
4. 2
2
.0 03
04
18 200
18 200
18 200
18 200
18 200
18 .200
18 200
18 200
18 200
18 200
18 200
18 200
18 200
18 200
20
18 .20
6.
0.
2.
4.
8.
0.
2.
8.
8.
6.
4.
2.
6.
2
2
.0
.0
.1
.1
.0
.0
.0
.1
.0
.0
.0
.0
.0
.0
.1
18
Fig. 3.1: Portions of traffic measured per week in the Abilene Backbone from
18.02.2002 until 18.07.2004 (peaks at 18.12.2002 and 18.02.2004 result
from measurement errors. source:[42])
R. Steinmetz and K. Wehrle (Eds.): P2P Systems and Applications, LNCS 3485, pp. 17-23, 2005.
Springer-Verlag Berlin Heidelberg 2005
18 3. Past and Future
Until the end of 2004 the amount of Peer-to-Peer traffic decreased down
to a value of approximately 15 percent. This might point to an increasing ef-
ficiency of the Peer-to-Peer protocols, since the signaling traffic is reduced or
to a decreasing usage of Peer-to-Peer applications. However if we also have a
look at the unidentified traffic and the traffic identified as data-transfers, we
can observe that these volumes are increasing and that the total amount of
traffic stemming from these three sources stays at a constant level of nearly 90
percent. Analyzing the networking techniques of different Peer-to-Peer appli-
cations in more detail this could also indicate that Peer-to-Peer applications
are “going underground”, i.e. they use TCP port 80, so that they can, on the
port level, not be distinguished from common data transfers. Further on more
and more Peer-to-Peer applications use so called port hopping, meaning that
they change frequently their communication port during run time and can
thus not be identified as file sharing applications on the port level. Thus the
amount of unidentified traffic and data transfers increases and the amount
of identified Peer-to-Peer traffic decreases, while the total amount stays at a
constant level of approximately 90 percent.
Hence, Peer-to-Peer communication plays a dominant role in todays net-
works and is also proliferating into many new application areas. In this chap-
ter, we will have a look at the development of Peer-to-Peer applications in the
last few years, analyze the development of the capabilities of the user termi-
nals and finally consider possible directions that development of Peer-to-Peer
technology might take in the future.
This was about to change in May 1999. Home users started to use their con-
nected computers for more than just temporarily requesting content from
web or email servers. With the introduction of the music- and file-sharing
application Napster by Shawn Fenning [437], the users opened their comput-
ers not only to consume and download content, but also to offer and provide
content to other participating users over the Internet. This phenomenon is
best described by the artificial term SERVENT for one node, which is a com-
bination of the first syllable of the term SERVer and the second syllable of
the term cliENT.
Comparing the Peer-to-Peer networks, which started with Napster, to the
architecture established by the ARPANET we can observe that in contrast
to today’s Peer-to-Peer realizations, the ARPANET was not self organiz-
ing. It was administrated by a centralized steering committee, and did not
provide any means for context or content based routing beyond “simple”
address based routing. In current Peer-to-Peer networks, the participating
users establish a virtual network, entirely independent from the physical net-
work, without having to obey any administrative authorities or restrictions.
These networks are based on UDP or TCP connections, are completely self-
organizing and are frequently changing their topology, as users join and leave
the network in a random fashion, nearly without any loss of network func-
tionality.
Another decentralized and application-layer oriented communications
paradigm is Grid computing which became famous with the project SETI-
home [557] . It is often compared to Peer-to-Peer as being a more structured
approach with the dedicated aim to especially share computing power and
20 3. Past and Future
storage space for distributed computations and simulations [217]. Yet, the ba-
sic principles in Peer-to-Peer and Grid are similar. However concerning the
number of participating users and thus also the traffic volumes Grid com-
puting is taking currently a minor role. Nevertheless it has a high growth
potential.
Because of the mostly illegal content shared in the Napster network (con-
tent was mostly copyright protected, mp3 compressed music), the Recording
Industry Association of America (RIAA) filed in December 1999 a lawsuit
against Napster Inc. This was possible, because the Napster network relies
heavily on a centralized lookup/index server operated by Napster Inc. This
server, which represents a single point of failure in the Napster network could
therefore be targeted by the RIAA.
the enhanced protocol, which consists of two hierarchical routing layers [359].
The foundation for this development of Gnutella has already been laid in
October 2000 by the presentation of the Reflector/SuperPeer concept. These
Peer-to-Peer networks with a second dynamic routing hierarchy are called the
second generation Peer-to-Peer networks. As shown by Figure 3.2, even today
second generation Peer-to-Peer protocols are widely used. Edonkey2000 and
FastTrack are based on such an overlay routing concept [184, 358, 410, 423].
However in May 2003 things began to change again. Applications based on
the FastTrack protocol caused significantly less traffic, whereas on the other
hand the traffic amounts of e.g. Gnutella or Edonkey increased. In addition,
we can observe from Figure 3.2, that the traffic caused by the BitTorrent
network increased significantly and caused at the end of 2004 the majority
of the traffic [127, 320].
Two main reasons explain this phenomenon. First of all in KaZaA the
amount of hardly identifiable corrupted content increased significantly due to
the weakness of the used hashing algorithm (UUHASH). Thus users switched
100% Freenet
Direct Connect++
Shoutcast Carracho
90% Blubster
Neo-Modus
80% FastTrack
WinMX
FastTrack Shoutcast
datavolumes in % per week
70% Audiogalaxy
eDonkey2000
Hotline
60%
Gnutella
BitTorrent
50%
40%
30%
0%
18.02.2002
18.08.2002
18.04.2003
18.10.2003
18.06.2004
18.04.2002
18.06.2002
18.10.2002
18.12.2002
18.02.2003
18.06.2003
18.08.2003
18.12.2003
18.02.2004
18.04.2004
18.08.2004
18.10.2004
Fig. 3.2: Traffic proportions of the different Peer-to-Peer applications and proto-
cols from the traffic measured per week in the Abilene Backbone from
18.02.2002 until 18.07.2004 (source:[42])
22 3. Past and Future
4.1 Information
R. Steinmetz and K. Wehrle (Eds.): P2P Systems and Applications, LNCS 3485, pp. 25-32, 2005.
Springer-Verlag Berlin Heidelberg 2005
26 4. Application Areas
Virtual Office offers users the opportunity to set up so-called shared spaces,
which provide a shared working environment for virtual teams formed on
an ad-hoc basis, as well as to invite other users to work in these teams.
Groove Virtual Office can be expanded by system developers. A devel-
opment environment, the Groove Development Kit, is available for this
purpose [187].
4.2 Files
4.3 Bandwidth
This party waits until a certain number of identical results are received from
these peers before accepting the result as correct.
By means of file replication and random distribution of identification num-
bers to peers using a hash function, the Peer-to-Peer storage network auto-
matically ensures that various copies of the same file are stored at different
geographical locations. No additional administration or additional backup so-
lution is required to achieve protection against a local incident or loss of data.
This procedure also reduces the significance of a problem which is character-
istic of Peer-to-Peer networks: in Peer-to-Peer networks there is no guarantee
that a particular peer will be available in the network at a particular point
in time (availability problem). In the case of Peer-to-Peer storage networks,
this could result in settings where no peer is available in the network which
stores the file being requested. Increasing the number of replicates stored at
various geographical locations can, however, enhance the probability that at
least one such peer will be available in the network.
The low administration costs, which result from the self-organized char-
acter of Peer-to-Peer storage networks, and the fact that additional backup
solutions are seldom required are among the advantages these new systems
offer for providing and efficiently managing storage.
R. Steinmetz and K. Wehrle (Eds.): P2P Systems and Applications, LNCS 3485, pp. 35-56, 2005.
Springer-Verlag Berlin Heidelberg 2005
36 5. First and Second Generation of Peer-to-Peer Systems
C lie n t - S e r v e r P e e r - t o - P e e r
1 . R e s o u rc e s a re s h a re d b e tw e e n th e p e e rs
2 . R e s o u r c e s c a n b e a c c e s s e d d ir e c tly fr o m o th e r p e e r s
3 . P e e r is p r o v id e r a n d r e q u e s to r (S e r v e n t c o n c e p t)
U n s t r u c t u r e d P 2 P S t r u c t u r e d P 2 P
1 s t G e n e r a tio n 2 n d G e n e r a tio n
1 . S e r v e r is th e c e n tr a l C e n tr a liz e d P 2 P P u re P 2 P H y b r id P 2 P D H T -B a s e d
e n tity a n d o n ly
p r o v id e r o f s e r v ic e 1 . A ll fe a tu r e s o f P e e r- 1 . A ll fe a tu r e s o f P e e r- 1 . A ll fe a tu r e s o f P e e r - 1 . A ll fe a tu r e s o f P e e r-
a n d c o n te n t. to - P e e r in c lu d e d to - P e e r in c lu d e d to - P e e r in c lu d e d to - P e e r in c lu d e d
à N e tw o rk m a n a g e d
2 . C e n tr a l e n tity is 2 . A n y te r m in a l e n tity 2 . A n y te r m in a l e n tity 2 . A n y te r m in a l e n tity
b y th e S e rv e r
n e c e s s a ry to c a n b e re m o v e d c a n b e re m o v e d c a n b e re m o v e d
2 . S e r v e r a s th e h ig h e r p r o v id e th e s e r v ic e w ith o u t lo s s o f w ith o u t lo s s o f w ith o u t lo s s o f
p e rfo rm a n c e s y s te m . fu n c tio n a lity fu n c tio n a lity fu n c tio n a lity
3 . C e n tr a l e n tity is
3 . C lie n ts a s th e lo w e r s o m e k in d o f 3 . à N o c e n tr a l e n titie s 3 . à d y n a m ic c e n tr a l 3 . à N o c e n tr a l e n titie s
p e rfo rm a n c e s y s te m in d e x /g r o u p e n titie s
E x a m p le s : G n u te lla 0 .4 , 4 . C o n n e c tio n s in th e
d a ta b a s e F re e n e t E x a m p le : G n u te lla 0 .6 , o v e r la y a r e fix e d
E x a m p le : W W W E x a m p le : N a p s te r J X T A E x a m p le s : C h o r d , C A N
The messages employed in Napster are fairly simple and easy to track, as
they are transmitted as plain text messages. We describe in the following the
basic messages used in Napster to announce and to search for content.
Each message to/from the Napster server has the basic structure given
in Figure 5.2. The first four bytes provide the <Length> parameter, which
specifies the length of the payload of this message. The <Function> param-
eter stated in the following four bytes, defines the message type, e.g. login
or search, which are described in the following. The payload finally carries
parameters necessary for the different messages, e.g. the keywords of a search
message.
5.2 Centralized Peer-to-Peer Networks 39
After a successful login, the server sends a LOGIN ACK (0x03) (size: 20
bytes) to the client. If the <nick> is registered, the email address given at
registration time is returned.
If the <nick> is not registered, a dummy value is returned. As soon as the
peer is logged in, it sends one “CLIENT NOTIFICATION OF SHARED FILE”
(0x64) message for every file it wants to share (see Figure 5.4). Thus routing
is possible, as every client announces its shared objects to the Napster server.
This message contains the filename (<Filename>) of the file, the MD5-hash
value of the file <MD5> [519] and the size in byte of the file (<Size>). The
MD5 (Message Digest 5) algorithm produces a 128-bit “fingerprint” of any
file. It is extremely unlikely that two messages contain the same hash value.
The MD5 algorithm is therefore intended to provide any user with the
possibility to secure the origin of the shared file, even if parts of the file are
provided by different Napster users. As specific parameters of the music file,
this message additionally provides the bitrate (<Bitrate>), the sampling rate
of the MP3 (<frequency>), and the playout time of a music file (<time>). The
40 5. First and Second Generation of Peer-to-Peer Systems
bit rate represents the quality of the used coding and compression algorithm.
The average size of this message is 74 bytes.
File Request
To be able to download a file from the Napster network, peers which
share the requested file have to be found. The format of a request is shown
in Figure 5.5. Therefore the requesting peer sends a SEARCH (0xC8) message
to the Napster server. To specify the search this message contains several
parameters stating keywords describing the requested object (artistname and
parts of the songname). Further on this message also specifies a filter, e.g. to
state a certain quality of the requested file, like the bitrate and the sampling
frequency of the requested file. The parameter <compare> can have the values
“at least”, “at best” or “equal to”. Thus the requesting peer can choose the
quality of the file and also the file size, which together with the link type
(parameter <Link Type> e.g. a T1 connection) of the providing peer can
strongly influence the download speed. The parameter <MAX RESULTS>
finally states the maximum number of results the requesting peer wants the
Napster server to return. The average size of such a message is 130 bytes.
Upon receiving a SEARCH message, the Napster server tries to match the
parameters stated in the SEARCH message with the entries of its database,
consisting of data previously received from other peers upon initialization
(CLIENT NOTIFICATION OF SHARED FILE (0x64) messages). If the server
can resolve the query, it answers with at least one RESPONSE (0xC9) con-
taining information about shared files matching the previously stated criteria
(see Figure 5.6). To provide the requesting peer with information about the
available data and where it can be downloaded from, this message contains
the full filename (<File-Name>) and the IP-address (<IP>) of the providing
peer, so that the requesting peer can download the requested file directly
via its HTTP-instance [365]. Further on the file size (<Size>), the playout
time (<length>), the sample and the bitrate of the file are stated (<Freq>,
<Bitrate>). To check the integrity of the file and to be able to download the
5.2 Centralized Peer-to-Peer Networks 41
file from multiple sources the MD5 hash value of the shared file is also stated
(<MD5>). The average size of such a message is 200 bytes.
5.2.3 Discussion
Login: [0x24|0x02|…]
Notif: [0x46|0x64|…]
Notif: [0x46|0x64|…]
Notif: [0x46|0x64|…]
Search: [0x7E|0xC8|…]
Response: [0xC4|0xC9|…]
Response: [0xC4|0xC9|…]
HTTP: GET[Filename]
OK[data]
Fig. 5.7: Sample message sequence chart for one Napster server with one request-
ing and one providing peer
Fig. 5.8: Sample graph of a simulated Gnutella 0.4 network (100 nodes)
44 5. First and Second Generation of Peer-to-Peer Systems
The nodes communicate directly with each other without any central instance
. However at the beginning, i.e. in a bootstrap phase, a central entity like a
beacon server, from which IP addresses of active nodes can be retrieved,
is necessary. If a node already participated in the network, it may also be
able to enter the network by trying to connect to nodes, whose addresses it
cached in a previous session. As soon as a new node knows the IP address
and port of one active Gnutella node it first establishes a TCP connection
to this node and then connects to this node by sending the ASCII encoded
request string “GNUTELLA CONNECT/<protocol version string>\n\n” to it.
If the participating peer accepts this connection request it must respond with
a “GNUTELLA OK\n\n”.
Gnutella mainly uses four messages as stated above. The messages are
setup in a similar manner as in Napster. They consist of a general message
header and the additional payload (see Figure 5.9). However since in Gnutella
the messages are flooded through the overlay network, some additional pa-
rameters are necessary beyond those used for Napster. The <Descriptor ID>
is a 16-byte string uniquely identifying the message on the network. Thus cir-
cles can be detected, i.e. every message which is received twice by a node is
not forwarded any further. Simultaneously and backward routing of possible
response messages is possible.
Every node therefore has to store this ID and the IP address from which
it received the message for a certain time. The <TTL> (Time-to-Live) value
determines the number of hops a message is forwarded in the overlay network.
This value is decreased by every node which received the message before the
message is forwarded. When the TTL value reaches zero, the message is not
forwarded any further, to avoid infinitely circulating messages. Generally a
TTL value of seven is considered to be sufficient to reach a large fraction
of the nodes participating in the overlay network. The <Hops>-value states
the number of hops a message has already been forwarded and is therefore
increased by one by every forwarding peer. It can be used to guarantee, that
initially no larger value than seven has been used by a requesting peer, as
The <Payload length> parameter states the size of the message so that the
next message in the incoming stream can clearly be identified.
However the most important field, which determines the payload is the
<Payload-Descriptor> field. The messages we distinguish here are 0x00 for
a PING, 0x01 for a PONG, 0x80 for a QUERY and 0x81 for a QUERYHIT
message [126]. The network exploration message PING does not contain any
payload, whereas in the payload of the PONG message in addition to the con-
tact information (IP address+port) information about the amount of shared
files is stated. To search for data, the QUERY message contains, besides the
parameter which states the requested minimum download speed, a null termi-
nated search string containing the keywords separated by blanks, describing
the requested object. The average size of this message is 78.4 bytes. If we
now assume, that an average word has a length of eight characters plus one
character for the blank, we can also compute the average number of words a
user states as search criteria, as every character is described with one byte.
For Gnutella this results in an average of 7.35 words per QUERY. Similar
to the PING messages, the QUERY messages are flooded through the over-
lay. As soon as one node receives a QUERY-message, it compares the search
5.3.3 Discussion
5
2
4
1
3
6
7
8
In our example the flooding of the request messages results, as we can see
from Figure 5.15, in 12 PING and 12 PONG messages, and 6 messages for the
initial connection establishment. Taking the message sizes from above into
account (PING: 23 byte, PONG: 37 byte) and assuming for a each connection
(GnuCon+OK) message pair 34 byte, this results in a total of 462 bytes.
This traffic is necessary to merely explore the network. We can also observe
in Figure 5.15, that several messages are not forwarded any further, because
they are received for a second time.
If we further on assume that the node would start a search in this small
network, this would result in 12 QUERY messages. Assuming that three nodes
answer this QUERY, and this results in eight additional QUERYHIT messages,
we can calculate a total traffic this node caused in this small network to 6.928
bytes. Together with the initialization traffic we can compute a total of 7.390
transmitted bytes. This is significantly more than the traffic caused by the
Napster peer. For a larger network we can assume that the amount of traffic
grows even further as the messages are flooded via more hops. The main
reason is the distributed nature of the Gnutella network. This causes on the
5.3 Pure Peer-to-Peer-Networks 47
one hand a lot of traffic as no central lookup is available, but on the other
hand also makes this network hard to attack, as no central single point of
failure exists.
P e e r7 P e e r3 P e e r1 P e e r5 P e e r2 P e e r4 P e e r6 P e e r8
G n u -C o n G n u -C o n
O K O K
G n u -C o n
O K
P IN G P IN G
P IN G P IN G
P IN G P IN G
P IN G P IN G
P IN G P IN G
P IN G
P IN G
P O N G P O N G
P O N G P O N G P O N G
P O N G P O N G
P O N G
P O N G
P O N G
P O N G
P O N G
Fig. 5.15: Sample message sequence chart to illustrate the basic signaling behavior
of Gnutella 04
Fig. 5.16: Map of Gnutella Network measured on 12.08.2002 up to 1st hop level
Fig. 5.17: Map of Gnutella Network measured on 12.08.2002 up to 2nd hop level
Fig. 5.18: Map of Gnutella Network measured on 12.08.2002 up to the 3rd hop
level, including the zigzag PING-PONG route (bold line)
5.4 Hybrid Peer-to-Peer Networks 49
c · d−1.4 , 1 < d ≤ 7
p(d) −1
c · 1−1.4 − 0.05, d = 1
p (d) = , with c =
c · 0.05, d = 20 c
d
0, in any other case (5.4)
average : d¯ = 2.8
var (d) = 3.55
Figure 5.19 depicts a sample network which is based on the Superpeer dis-
tribution stated above. Due to the nodes with degree 20 it has a hub-like
structure, similar to the measured structure of a Gnutella 0.6 network (see
Figure 5.20). These hubs dominate the structure of this overlay network.
Because of their high degree these nodes establish with a higher probability
connections between each other (marked by dashed lines in Figure 5.19). This
results in a kind of second hierarchical layer which occurs in the Gnutella 0.6
network. The nodes with a small degree are mainly located at the edge of the
network.
Fig. 5.19: Sample graph of a simulated Gnutella 0.6 network (100 nodes)
5.4 Hybrid Peer-to-Peer Networks 51
Although the number of nodes with degree one is high (47%) in this
distribution, the number of separate sub-components is small, which can
be observed by inspection. This results in a comparably high number of
reachable nodes, within a few hops. This behavior can be explained by the
fact, that the average degree of the Superpeer distribution with d = 2.80
is higher than in the powerlaw distribution for a Gnutella 0.4 network used
earlier.
If we transform the abstract network structure depicted by Figure 5.20
into the geographical view, depicted by Figure 5.21, we can make the network
visible and can determine e.g. the popularity of the Gnutella network in
different countries (see Figure 5.21). Further on we can observe the hub like
structure of the Gnutella 0.6 network, which can not be retrieved from the
geographical view. However, comparing both figures we can again observe
the problem of the random structure, resulting in zigzag routes.
1 1 6
1 1 8
3 9
1 8 7
4 3
1 0 0
Fig. 5.20: Abstract network structure of a part of the Gnutella network (222 nodes
Geographical view given by Figure 5.21, measured on 01.08.2002)
52 5. First and Second Generation of Peer-to-Peer Systems
100 118
116 18
3, 43
39
Fig. 5.21: Geographical view of a part of the Gnutella network (222 nodes); the
numbers depict the node numbers from the abstract view (measured on
01.08.2002)
All messages, i.e. PING, PONG, QUERY and QUERYHIT, defined in Gnutella
0.4, are also used in Gnutella 0.6 . However, to reduce the traffic imposed
on the Leafnodes and to use the Superpeer layer efficiently, the Leafnodes
have to announce their shared content to the Superpeers they are connected
to. Therefore the ROUTE TABLE UPDATE message (0x30) is used (see Fig-
ure 5.22 and Figure 5.23). The <Variant> parameter is used to identify a
ROUTE TABLE UPDATE message either as Reset or as an Update.
The Reset variant is used to clear the route-table on the receiver, i.e.
the Superpeer. Therefore additionally the table length (<Table Length>) to
be cleared must be stated. The parameter <Infinity> is not used currently
and was intended to clear the route-table on several nodes, if the route-table
would be broadcasted in the overlay.
The variant Patch is used to upload and set a new route-table at the
Superpeer. To avoid one large table to be transferred at once, which might
block the communication channel of a Gnutella node, it is possible to break
one route table into a maximum of 255 chunks, which are numbered with the
parameter <Seq No>, where the maximum number of used chunks is stated
with the parameter <Seq Size>. To reduce the message size further on, the
parameter <Compression> can be used to state a compression scheme which
is used to compress the route table (0x0 for no algorithm, 0x1 for the ZLIB
algorithm). For details of the decompression the parameter <Entry Bits> is
used, which is not explained in detail here. The parameter <DATA> contains
32 bit long hash-values of the keywords describing all objects shared by the
5.4 Hybrid Peer-to-Peer Networks 53
Fig. 5.22: ROUTE TABLE UPDATE (0x30) payload structure (Reset, Variant
=0x0)
Fig. 5.23: ROUTE TABLE UPDATE (0x30) payload structure (Patch, Vari-
ant=0x1)
5.4.3 Discussion
L7
S3 L6
S2
S1 L5
L1 L3 L4
4
L2
L 2 L 3 L 1 S 1 S 3 S 2 L 7 L 6 L 5 L 4
G n u -C o n
O K P IN G
R T U P IN G
P IN G
P IN G P O N G
P O N G
P O N G
P O N G
P O N G
Q U E R Y
Q U E R Y
Q U E R Y Q U E R Y Q U E R Y
Q U E R Y
Q U H IT Q U E R Y
Q U H IT Q U H IT Q U H IT
Q U H IT
Q U H IT
Q U H IT
Q U H IT
Fig. 5.25: Sample message sequence chart to illustrate the basic signaling behavior
of Gnutella 0.6
6.1 Introduction
In this chapter we will introduce two famous network models that arose
much interest in recent years: The small-world model of Duncan Watts and
Steven Strogatz [615] and scale-free or power-law networks, first presented
by the Faloutsos brethren [201] and filled with life by a model of Albert-
László Barabási and Réka Alberts [60]. These models describe some structural
aspects of most real-world networks. The most prevalent network structure
of small-world networks is a local mesh-like part combined with some random
edges that make the network small.
The preceding chapters sketched the field of Peer-to-Peer concepts and
applications. The field and its history are deeply intertwined with the area
of distributed computing and sometimes overlaps with concepts from client-
server systems and ad hoc networks. To set a clear foundation we base this
chapter on the following, quite general definition of Peer-to-Peer systems:
Definition 6.1.1. Peer-to-Peer Systems
A Peer-to-Peer system consists of computing elements that are:
(1) connected by a network,
(2) addressable in a unique way, and
(3) share a common communication protocol.
All computing elements, synonymously called nodes or peers, have comparable
roles and share responsibility and costs for resources.
The functions of Peer-to-Peer systems are manifold. They may be coarsely
subsumed under communication of information, sharing services, and
sharing resources. To implement these functions, the system has to pro-
vide some infrastructure. What are the requirements to make a Peer-to-Peer
infrastructure useful? Here, we will concentrate on the following four condi-
tions:
Condition 1 Decentrality
One inherent claim of the Peer-to-Peer idea is that there is no
central point in which all information about the system, data
and users is stored. If there is no central organizing element
R. Steinmetz and K. Wehrle (Eds.): P2P Systems and Applications, LNCS 3485, pp. 57-76, 2005.
Springer-Verlag Berlin Heidelberg 2005
58 6. Random Graphs, Small-Worlds and Scale-Free Networks
6.2 Definitions
Let V = {1, 2, 3, . . . , n} be the set of all n nodes, or peers, in the system. Each
node is identified and addressable by its number. The underlying network
makes it possible to route a message from each node to any other node.
Because of the decentralized nature of a Peer-to-Peer system, not every node
v is required to store routing information to each and every other node.
The set of nodes over which node v will route outgoing messages is denoted
by N (v) and called the neighbors of v. Every Peer-to-Peer network can be
associated with a graph G = (V, E). E is the set of edges e = (i, j) where
j is a neighbor of i, i.e., there is at least one entry in the routing table of i
that uses j as the next node. For edge e = (i, j), i is the source node and j is
the target node. The number of edges is denoted by m. G is sometimes called
the overlay network of a Peer-to-Peer system. The edges might be weighted,
e.g., with the number of entries that use j as the next node or the cost for
the traverse of this edge. All edges are directed.
The set of edges can also be represented in the adjacency matrix A(G)
with dimension n × n. aij is 1 if and only if the edge e = (i, j) ∈ E. If edges
are weighted with a weight function ω : E → R then aij is commonly given
by ω(e = (i, j)) if e = (i, j) and zero otherwise. The set of eigenvectors and
eigenvalues of a matrix is defined as the set of all vectors x and real numbers
λ such that:
Ax = λx (6.1)
The outdegree ko (v) of a node v is defined as the number of neighbors it
has: ko (v) = |N (v)|. The indegree ki (v) is defined
as the number of neigh-
bor sets in which v is an element: ki (v) = w∈V [v ∈ N (w)]. The Boolean
expression in brackets is given in Iverson-notation (see [257]) and evaluates
to 1 if the expression is true and to zero otherwise. The degree k(v) of a
node v is defined as the sum of indegree and outdegree. A path P (i, j) from
node i to node j is defined as a subset P ⊆ E of edges {e1 , e2 , . . . , ek } where
e1 = (i, v1 ), ek = (vk−1 , j) and ∀ 1 < l < k: el = (vl−1 , l). The path length of
a path P is defined as the number of edges in it. If the edge set is weighted
with a weight function ω : E → R, then the path length L(P (i, j) of a path
P (i, j) of two nodes i and j is defined as:
L(P (i, j)) = ω(e) (6.2)
e∈Ps (i,j)
In the following, we will use the first definition to reduce complexity. All
further definitions can be easily transformed for weighted graphs.
60 6. Random Graphs, Small-Worlds and Scale-Free Networks
Any path with minimal length between two nodes is a shortest path be-
tween these nodes. The distance d(i, j) between two nodes is the length of
any shortest path between them. The diameter D(G) of a graph G is defined
as the maximal distance between any two nodes in the graph:
The average path length D (G) of a graph G is defined as the sum of the
distances over all pairs of nodes divided by the number of all pairs of nodes:
(i,j)∈V ×V d(i, j)
D (G) = (6.4)
n · (n − 1)
A graph is connected if there is at least one path between every two nodes.
A graph is k-connected if the removal of any set with k − 1 nodes leaves
the graph connected. Let VC ⊆ V be a subset of nodes. The induced graph
G(VC ) = (VC , EC ) is defined as the graph with the following edge set EC :
EC = {e = (i, j)|i, j ∈ VC }. An induced graph is a (simple) component if it
is connected.
The set of edges in a graph is formally a relation R ⊆ V × V on the set of
possible edges. A network family is an (infinite) set of graphs with the same
relation. Normally, this relation is given as an algorithm that decides which
edges are added to the graph and which are not.
most people. In terms of graph theory the result signifies that the diameter
of social networks is quite small despite their dense local structure. What
kind of network model can reproduce this special combination of properties?
This is the riddle that was not to be solved until the 1990s. In the following
sections we will describe the most important approaches with which social
and other evolving networks are modeled today. We will show that some of
the features of these networks are interesting for Peer-to-Peer applications
and present ideas about how their favored properties can be transferred to
Peer-to-Peer overlay networks.
The analysis of social relationships as graphs can be traced back to the 1950s
[614]. At the same time, the first graph models, concerning random graphs,
were introduced. They were so successful that they were used as simulation
models for very different networks over the following 30 years. Random graphs
62 6. Random Graphs, Small-Worlds and Scale-Free Networks
were introduced independently by Gilbert [245] and Erdős and Renyi [464].
We will first present the model of Erdős and Renyi, following [84].
Gilberts Model
For M ∼ pN the two models Gn,M and Gn,p are almost interchangeable [84].
Here, we will just review some of the important results for Random Graphs
that are interesting in comparison with small-worlds and Scale-Free Networks,
cited from [80]. We use the Gilbert notation.
For the first theorem cited, the random graph is built up sequentially,
by adding random edges one at a time. Analyzing the connectivity of the
evolving graph, we can make an interesting observation: After having added
approximately n/2 edges, we get a giant connected component with a size of
Θ(n) as stated in the following theorem.
Theorem 6.4.2. Giant Connected Component
Let c > 0 be a constant and p = c/n. If c < 1 every component of Gn,p has
order O(log n) with high probability. If c > 1 then there will be one component
with high probability that has a size of (f (c) + O(1)) · n, where f (c) > 0. All
other components have size O(log n) [84].
This theorem is easy to remember and nonetheless surprising: The giant
connected component emerges with high probability when the average degree
is about one.
The next property concerns the degree distribution: If one node is drawn
randomly from V , how high is the probability P (k) that it has degree k? In
random graphs the degree distribution is described as a Poisson-distribution
k −c
P (k) = c k!
e
as stated in the following theorem:
Theorem 6.4.3. Degree distribution
Let Xk be the number of nodes with degree k in Gn,p . Let c be a constant with
c > 0 and p = c/n. Then, for k = 0, 1, 2 . . .
ck e−c Xk ck e−c
Pr (1 − ) ≤ ≤ (1 + ) (6.6)
k! n k!
as n → ∞ [84].
This can easily be seen by the following
argument: First, we can construct
Gn,p in a Bernoulli experiment with n2 variables Xij , i = j, i, j ∈ V that are
1 with probability p. The degree of node i is the sum of all variables Xij and
for reasonably small p and n → ∞, the degree can be described by a Poisson
distribution.
The next question to be answered is the diameter of Gn,p . It is given by
the following theorem:
64 6. Random Graphs, Small-Worlds and Scale-Free Networks
E(N (i))
C(i) = (6.7)
d(i)(d(i) − 1)
Random graphs were very famous for a long time for two reasons: Many
of their properties are exactly solvable in a rigorous analysis. They can be ex-
actly defined and varied in many different ways. Second, they provide a much
richer field of application than the other network model that was popular at
the time, i.e., regular graphs in which every node has the same degree, such
as a lattice. No one doubted that social networks cannot be exactly random,
but as long as some of their properties were well described by it, it seemed
that random graphs were an easy and useful way to model all kind of different
networks.
Despite the excitement that followed the Milgram experiment there was no
convincing network model generating a network that is locally highly clus-
tered and at the same time has a small diameter until 1998. Then, Watts
and Strogatz [615] analyzed three different kinds of real networks: A film col-
laboration network in which two actors are connected by an undirected edge
whenever they have acted together in at least one film, the neural network
of the worm C. elegans, and the power grid of the United States. For each of
these networks they measured the average path length in the graph and com-
pared it with a random graph with the same number of nodes and edges. The
average path length was in each case slightly higher but clearly within the
6.4 Families and Models 65
same order of magnitude. On the other hand one could see that the real net-
works were much more densely connected on a local level than the random
ones. To measure this density, the authors introduced a new measure, the
clustering coefficient which we have already defined in Equation 6.7. Watts
and Strogatz compared the average clustering coefficient of these real net-
works with the corresponding random networks: The clustering coefficients
were at least ten times higher in real networks and for the film collaboration
network the factor is more than 1000. With this analysis the surprising result
of Milgram’s work could be made more intelligible: Real networks have nearly
the same diameter as Random Graphs and at the same time show a high,
local clustering.
Table 6.1: Average path length D and average clustering coefficient C for three
real networks, compared with random graphs that have the same num-
ber of nodes and the same average degree. The first network represents
actors that are connected by an edge if they have contributed to at
least one film together, the second is defined as the set of all genera-
tors, transformers and substations that are connected by high voltage
transmission lines. The neural network of C. elegans displays all neu-
rons and considers them as connected if they share a synapse or gap
junction. All three networks show the small-world phenomenon, with
an average path length comparable to that of the corresponding ran-
dom graph and a clustering coefficient that is considerably larger than
in the random graphs ([615]).
With this, small-worlds are defined as networks with a dense, local struc-
ture, evaluated by the clustering coefficient, and a small diameter that is
comparable to that of a random graph with the same number of nodes and
edges. Watts and Strogatz introduced a very simple network model that is
able to reproduce this behavior. It starts with a chordal ring: Nodes are num-
bered from 1 to n and placed on a circle. Then, every node is connected with
its k clockwise next neighbors (Fig. 6.2)
This ring is traversed and for every encountered edge a random number
between zero and one is drawn. If it is smaller than a given constant 0 ≤
p ≤ 1 the edge will be rewired. The rewiring is done by drawing uniformly
at random a new target node from the set of all nodes V , deleting the old
edge and inserting the new edge between the old source and the new target
node. It is important to preclude duplicate edges in this process. If p is small,
66 6. Random Graphs, Small-Worlds and Scale-Free Networks
p=0 p=1
increasing randomness
Fig. 6.2: The small-world model introduced by Watts and Strogatz [615] starts
with a chordal ring in which n nodes are placed on a circle and connected
with their k clockwise next neighbors (here, k = 2). With probability p
every edge can be rewired once. The rewiring is done by choosing uni-
formly at random a new target node, such that the old edge is removed
and the new one connects the old source node with the new target node.
The figure shows that as p grows the model can be tuned between total
regularity and total randomness. With sufficiently small p it is possible to
maintain the local structure and yet provide an overall small diameter.
This state thus displays the properties of small-worlds as they can be
found in reality.
almost no edges will be rewired and the local structure is nearly completely
preserved. If p is near to 1 the graph produced is claimed to be a random
graph with a small average path length. Interesting are the states in between
these two extremes. Fig. 6.3 shows the dependency of the clustering coefficient
and average path length on p for a graph with 5000 nodes. Clearly, even a
quite small p of about 0.005 is sufficient to reduce the diameter so much that
it resembles the diameter in the corresponding random graph without losing
the local structure that is measured with the clustering coefficient.
Viewed from another perspective, the findings of Watts and Strogatz indi-
cate that a small number of random edges decreases the average path length
significantly since they can be viewed as ‘short-cuts’ spanning the regular
graph. With this model a part of the riddle regarding real networks was
solved.
In Sect. 6.5 we will present some more properties of small-worlds that
are especially interesting for Peer-to-Peer applications. In Subsect. 6.5.1 we
will present a more generalized model of Small World Networks in multi-
dimensional spaces, introduced by Kleinberg in [353, 354]. But despite the
immediate success of the small-world model of Watts and Strogatz the riddle
was only partly solved, as would soon become clear.
6.4 Families and Models 67
1
0.8
0.6
CC0
0.4
0.2 LL0
Fig. 6.3: The diagram shows the dependency of the clustering coefficient C and
the average path length L on the rewiring probability p. For each prob-
ability ten different small-worlds with 5000 nodes have been simulated.
The clustering coefficient of the small-world after the rewiring phase was
divided by the clustering coefficient of the chordal ring before rewiring.
Analogously, the average path length is given in relation to the aver-
age path length before the rewiring. It can be clearly seen that a small
rewiring probability of approximately 0.005 is sufficient to reduce the av-
erage path length to 1/10 without decreasing the clustering coefficient by
more than 1.5%.
P (d) ∝ dH (6.11)
This last property is more approximative than the other properties but
is nonetheless useful as the authors show in their paper [201].
After the authors had found this self-organizing structure, they asked in
their discussion “Why would such an unruly entity like the Internet follow
any statistical regularities?”. The answer to this question was given by an
elegant model of Barabási and Albert in the same year[60]. They examined
a part of the World Wide Web (WWW) [60] (see also [20]) and displayed
the result as a graph. In this graph, all visited pages were represented as
nodes, and two pages were connected by a directed edge (i, j) if page i had
a link pointing to page j. In this graph the number of nodes with a given
degree was calculated. By dividing it by the number of nodes in the graph,
the probability P (k) of drawing uniformly at random a node with degree k
is computed. The authors observed that the probability P (k) is proportional
to k to the power of a constant γ (similar to E3 above):
P (k) ∝ k −γ (6.12)
The Barabási-Albert-Model
already in the network. The probability Πt (j) that some old node j gets
one of the m edges is proportional to its current degree kt (j) at time t:
kt (j)
Π(j) = (6.13)
v∈V kt (v)
kt (j)
= (6.14)
2 · mt
with mt being the number of edges in the graph at time t.
Thus, the network model works as follows:
1. Begin with a small network of at least m0 nodes and some edges.
2. In each simulation step add one node. For each of its m0 edges draw one
of the nodes j that are already in the graph, each with probability Π(j).
It should be clear that this algorithm is not a model in the mathematical
sense [80] but rather defines a family of possible implementations. Later,
Albert and Barabási could show in [22, 19] that the only requirement for the
emergence of a scale-free behavior is that the probability of gaining a new
edge is proportional to the degree of a node in each timestep. Thus, it is
sufficient that any network model show this preferential attachment in order
to generate scale-free networks. This property can be easily remembered as
a behavior in which ‘the rich get richer’.
To date, many different variants of network models that generate scale-free
networks have been published: A mathematical model more precisely defined
than the Barabási-Albert-model is the linearized chord diagram, introduced
in [79]. Here, two groups provide each node with an initial attractiveness that
increases the probability of being chosen by a constant value [180, 175]. A
quite complicated but powerful model with many parameters was given in
[132].
A model that is simple to adapt to Peer-to-Peer systems was first intro-
duced by Kumar et al. [369] for web graphs, and independently by Vazquez
et al. [605] and Pastor-Satorras et al. [477] for modeling protein interaction
networks: In each timestep of this model, one of the existing nodes is cloned
with all the links to other nodes and the clones are connected to each other.
Then, both clones lose some edges at random with a very small probabil-
ity and gain as many new edges to new, randomly drawn target nodes. It
can be easily shown that the probability of node j getting a new node in
timestep t is proportional to its degree at that time: The more edges it has,
the more probable it is that one of its neighbors is chosen to be cloned. If
one of the neighbors is cloned, the edge to j is copied and thus the degree of
j is increased by 1. Thus, this model shows preferential attachment and the
resulting networks are scale-free with respect to the degree distribution.
In the following we want to discuss some of the properties of Small Worlds
and Scale-Free Networks that are interesting for Peer-to-Peer systems.
70 6. Random Graphs, Small-Worlds and Scale-Free Networks
The model starts with a set of grid points in an n × n square. Each node
i is identified by the two coordinates xi , yi that define its position P (i) in
the grid. The distance d (i, j) is here defined as the number of ‘lattice steps’
separating them:
d (i, j) = |xi − xj | + |yi − yj | (6.15)
The set of (directed) edges is constructed in two parts:
1. First, every node i is connected with all nodes j that are within distance
d (i, j) ≤ q for some given integer q.
2. Second, for each node i q additional edges are built. The probability that
the ith directed edge has endpoint j is proportional to d (i, j)−r , with
r a given real constant. To generate aproper probability distribution,
the normalizing constant is given by v∈V d (i, v)−r . This probability
distribution is called the inverse rth -power distribution.
6.5 Applications to Peer-to-Peer Systems 71
If p and q are given as fixed constants, this network family is described only
by parameter r.
Now, a message is to be sent within this network. The transmission model
is as follows: We start with two arbitrary nodes in the network, source node
s and target node t. The goal is to transmit the message from s to t with
as few steps as possible. An algorithm is defined as decentralized if at any
time-step the current message holder u has knowledge of only:
DA 1) the set of local contacts among all nodes (i.e. the underlying grid
structure),
DA 2) the position, P (t), of target node t on the grid, and
DA 3) the locations and long-range contacts of all nodes that have come in
contact with the message.
Here, we just want to state the results of this approach. The proofs can be
found in [354]. The first result is that there is only one possible parameter for
r in a twodimensional grid where a decentralized algorithm is able to perform
the transmission task in expected O(log n) steps. This efficiency is measured
as the expected delivery time, i.e., the number of steps before the message
reaches its target:
Theorem 6.5.1. Navigability in Kleinberg Small-Worlds
There is a decentralized algorithm A and a constant α, independent of n,
so that when r = 2 and p = q = 1, the expected delivery time is at most
α · (log n)2 .
The next theorem shows that r = 2 is the only parameter for which the
expected delivery time is polynomial in O(log n):
Theorem 6.5.2. (a) Let 0 ≤ r < 2. There is a constant αr , depending
on p, q, r, but independent of n so that the expected delivery time of any
decentralized algorithm is at least αr n(2−r)/3 .
(b) Let r > 2. There is a constant αr , depending on p, q, r, but independent
of n, so that the expected delivery time of any decentralized algorithm is at
least αr n(r−2)/(r−1) .
These results can be generalized for multi-dimensional spaces. For any
k-dimensional space, a decentralized algorithm can construct paths of length
polynomial in O(log n) if and only if r = k.
What does this decentralized algorithm look like? In each step, the current
message-holder u chooses a contact that is as close to the target as possible, in
terms of lattice distance. And that is all. Note, that this very simple algorithm
does not make use of DA 3). Accordingly, we do not need any memorization
of the route a message has taken to get to node i.
Summarizing, Kleinberg-small-worlds provide a way of building an overlay
network for Peer-to-Peer applications, in which a very simple, greedy and
local routing protocol is applicable. On the other hand, it requires some
72 6. Random Graphs, Small-Worlds and Scale-Free Networks
2. When the datastore at a node is full and a new file f with key key(f )
arrives (from either a new insertion of a file or a successful request), the
node finds out from the current datastore the file with key v farthest
from the seed in terms of the distance in the key space S:
the family of harmonic distributions (a fact which inspired the name for the
protocol). As in the Kleinberg paper, the actual routing protocol is greedy:
Every message holder sends it to the one node known to have a key next to
the requested file key.
The authors ensure that no node has more than a fixed number k of
(incoming) long range contacts. If, by chance, one node asks to establish a
long-range link to a node that has already reached this number, the latter
will refuse the new connection. The most interesting property of this protocol
is that it shows a trade-off between the number of links a node has and the
expected path length within the network to find a file:
Theorem 6.5.3. Symphony
The expected path length in an n-node network with k = O(1) edges, built
by the Symphony protocol, is inversely proportional to k and proportional to
(log n)2 .
This is true whether long-range links are used in one direction only (from the
one building it to the one randomly chosen) or in both directions.
The Symphony approach is elegant and smoothly transforms the idea of
Kleinberg small-worlds to the world of Peer-to-Peer systems. An even more
sophisticated approach was given by Hui, Lui and Yau in [309]. In their
Small-World Overlay Protocol (SWOP), clusters emerge in a self-organized
way. The basic idea is again based on a hash-function and nodes that are
placed on a unit-perimeter circle. Additionally, every node tries to connect
to one random node with the probability distribution in Equation 6.17. Here,
n is the number of clusters in the system.
A new node joining the circle will be the basis for a new cluster if both of its
neighbors are members of a cluster with a maximum size. Otherwise, it will
join the cluster with smaller size and create some intra-cluster connections.
The maximal cluster size is given as a variable of the system and might
be changed dynamically. Each cluster has one designated head node that is
chosen by some periodically repeated voting mechanism. This head node is
responsible for maintaining some ‘long-range’ inter-cluster connections. The
routing protocol is the same greedy protocol used in the other approach: Each
message holder will send the message to the one node known to have a key
next to the requested file key.
The article is mainly concerned with the proper behavior of a protocol in a
flash crowd scenario: These are situations, in which some static or dynamic
object is heavily requested. The example provided by the authors is the crush
on the CNN web server for news documents that was initiated by the 9/11
incident. Here, the news consists not only of static documents but might also
be changing within minutes. A careful distribution within the net can prevent
server crashes.
The idea proposed by the authors is that heavily requested documents should
be copied via the inter-cluster links so that nearly every cluster has its own
copy. This is sufficient in static scenarios, but an additional version number
6.5 Applications to Peer-to-Peer Systems 75
6.6 Summary
This chapter has presented three prominent network models that are able to
model different aspects of many complex and dynamic networks. First was
the random graph model. It is easy to simulate and many properties can be
analyzed with stochastic methods. It can be a good model for some Peer-to-
Peer systems. Other Peer-to-Peer systems exhibit the so-called small-world
effect: High clustering of nodes that share similar interests and just a few
links between nodes with very different interests. These few ‘long-range’ or
‘short-cut’ links decrease the diameter such that the average path length in
these networks is almost as short as in a random graph with the same number
of nodes and links. Finally, we presented a model that generates scale-free
networks. In these networks the presence of highly connected nodes (‘hubs’)
is much more probable than in random graphs, i.e., the probability of finding
a node P (k) is proportional to k −γ , where γ is a constant.
Small-world networks are interesting for Peer-to-Peer systems because
they provide a good way to structure nodes with similar interests into groups
without losing the small diameter of random graphs. Scale-free networks ex-
hibit a good fault tolerance, but on the other hand, the are extremely vul-
nerable to attacks.
As shown, some authors have already tackled the problem of how de-
sired properties of these three network models can be transmitted to overlay
networks in Peer-to-Peer systems using simple and local protocols. Future
research will have to show which kind of network model is best for build-
ing structured, yet self-organizing overlay networks for Peer-to-Peer systems
that are stable despite dynamic changes and scale nicely under the constantly
increasing number of peers.
7. Distributed Hash Tables
Klaus Wehrle, Stefan Götz, Simon Rieche (University of Tübingen)
R. Steinmetz and K. Wehrle (Eds.): P2P Systems and Applications, LNCS 3485, pp. 79-93, 2005.
Springer-Verlag Berlin Heidelberg 2005
80 7. Distributed Hash Tables
? ? distributed system
?
Node A Node B
Fig. 7.1: The lookup problem: Node A wants to store a data item D in the dis-
tributed system. Node B wants to retrieve D without having prior knowl-
edge of D’s current location. How should the distributed system, espe-
cially data placement and retrieval, be organized (in particular, with
regard to scalability and efficiency)?
1
In the context of Peer-to-Peer systems, the distributed system – the collection
of participating nodes pursuing the same purpose – is often called the overlay
network or overlay system.
7.1 Distributed Management and Retrieval of Data 81
“A stores D”
Server S
Node B d “Where is D ?”
e “A stores D”
Fig. 7.2: Central Server: (1) Node A publishes its content on the central server S.
(2) Some node B requests the actual location of a data item D from the
central server S. (3) If existing, S replies with the actual location of D.
(4) The requesting node B transmits the content directly from node A.
This section advocates the use of Distributed Hash Tables by comparing three
basic strategies to store and retrieve data in distributed systems: centralized
servers, flooding search, and distributed indexing.
c c
c “B searches D”
c
Node B c c c
c
c c
c
d “I have D ?” c
& Transmission: D J Node B “I store D”
Node A
and attack. If it fails or becomes unavailable for either of these reasons, the
distributed system – as a whole – is no longer useable.
Overall, the central server approach is best for simple and small applica-
tions or systems with a limited number of participants, since the costs for
data retrieval are in the order of O(1) and the amount of network load (in
proximity of the server) and the necessary storage capacity increase by O(N ).
But, scalability and availability are vital properties, especially when systems
grow by some orders of magnitude or when system availability is crucial.
Therefore, more scalable and reliable solutions need to be investigated.
Distributed systems with a central server are very vulnerable since all requests
rely on the server’s availability and consistency. An opposite approach is pur-
sued by the so-called second generation of Peer-to-Peer systems (cf. Chap-
ter 5.3). They keep no explicit information about the location of data items in
other nodes, other than the nodes actually storing the content. This means
that there is no additional information concerning where to find a specific
item in the distributed system. Thus, to retrieve an item D the only chance
is to ask as much participating nodes as necessary, whether or not they
presently have item D, or not. Second generation Peer-to-Peer systems rely
on this principle and broadcast a request for an item D among the nodes
of the distributed system. If a node receives a query, it floods this message
to other nodes until a certain hop count (Time to Live – TTL) is exceeded.
Often, the general assumption is that content is replicated multiple times in
the network, so a query may be answered in a small number of hops.
7.1 Distributed Management and Retrieval of Data 83
y Scalability: O(log N)
Flooding y No false negatives
Communication
Fig. 7.4: Comparison of complexity in terms of search effort (y-axis) and storage
cost per node (x-axis). Bottlenecks and special characteristics of each
approach are named.
d
I want D !
H(„D“)=3107 1008 1622 2011
d
709 2207
? d
c 611
3485
2906
12.5.7.31
peer-to-peer.info
berkeley.edu planet-lab.org 89.11.20.15
e 95.7.6.10
86.8.10.18 7.31.10.25
Fig. 7.5: Distributed Hash Table: The nodes in the distributed system organize
themselves in a structured overlay and establish a small amount of rout-
ing information for quick and efficient routing to other overlay nodes.
(1) Node A sends a request for item D to an arbitrary node of the DHT.
(2) The request is forwarded according to DHT routing with O(logN )
hops to the target node. (3) The target node sends D to node A.
Hash Tables provide a global view of data distributed among many nodes,
independent of the actual location. Thereby, location of data depends on the
current DHT state and not intrinsically on the data.
Overall, Distributed Hash Tables posses the following characteristics:
– In contrast to unstructured Peer-to-Peer systems, each DHT node manages
a small number of references to other nodes. By means these are O(log N )
references, where N depicts the number of nodes in the system.
– By mapping nodes and data items into a common address space, routing
to a node leads to the data items for which a certain node is responsible.
– Queries are routed via a small number of nodes to the target node. Because
of the small set of references each node manages, a data item can be located
by routing via O(log N ) hops. The initial node of a lookup request may be
any node of the DHT.
– By distributing the identifiers of nodes and data items nearly equally
throughout the system, the load for retrieving items should be balanced
equally among all nodes.
– Because no node plays a distinct role within the system, the formation of
hot spots or bottlenecks can be avoided. Also, the departure or dedicated
elimination of a node should have no considerable effects on the function-
ality of a DHT. Therefore, Distributed Hash Tables are considered to be
very robust against random failures and attacks.
– A distributed index provides a definitive answer about results. If a data
item is stored in the system, the DHT guarantees that the data is found.
The following table compares again the main characteristics of the presented
approaches in terms of complexity, vulnerability and query ability. Accord-
ing to their complexity in terms of communication overhead, per node state
maintenance, and their resilience, Distributed Hash Tables show the best per-
formance unless complex queries are not vital. For fuzzy or complex query
patterns, unstructured Peer-to-Peer systems are still the best option.
Table 7.1: Comparison of central server, flooding search, and distributed indexing.
86 7. Distributed Hash Tables
Distributed Hash Tables introduce new address spaces into which data is
mapped. Address spaces typically consist of large integer values, e.g., the
range from 0 to 2160 −1. Distributed Hash Tables achieve distributed indexing
by assigning a contiguous portion of the address space to each participating
node (Figure 7.6). Given a value from the address space, the main operation
provided by a DHT system is the lookup function, i.e., to determine the node
responsible for this value.
Distributed Hash Table approaches differ mainly in how they internally
manage and partition their address space. In most cases, these schemes lend
themselves to geometric interpretations of address spaces. As a simple ex-
ample, all mathematical operations on the address space could be performed
modulo its number of elements, yielding a ring-like topology.
7.2 Fundamentals of Distributed Hash Tables 87
2m-1 0
H(Node Y)=3485 Often, the address
Y space is viewed as
a circle.
X
Data item “D”:
H(“D”)=3107 H(Node X)=2906
Fig. 7.6: A linear address space with integer values ranging from 0 to 65,535. The
address space is partitioned among eight peers.
Mapping on the
real topology
example, a node may transfer responsibility for a part of its address space to
other nodes, or several nodes may manage the same portion of address space.
Chapter 9 discusses load-balancing schemes in more detail.
7.2.3 Routing
134.2.11.68 134.2.11.68
There are two possibilities for storing data in a Distributed Hash Table.
In a Distributed Hash Table which uses direct storage, the data is copied
upon insertion to the node responsible for it (Figure 7.8(a)). The advantage
is that the data is located directly in the Peer-to-Peer system and the node
which inserted it can subsequently leave the DHT without the data becoming
unavailable. The disadvantage is the overhead in terms of storage and network
bandwidth. Since nodes may fail, the data must be replicated to several nodes
to increase its availability. Additionally, for large data, a huge amount of
storage is necessary on every node.
The other possibility is to store references to the data. The inserting node
only places a pointer to the data into the Distributed Hash Table. The data
itself remains on this node, leading to reduced load in the DHT (Figure
7.8(a)). However, the data is only available as long as the node is available.
In both cases, the node using the Distributed Hash Table for lookup pur-
poses does not have to be part of the Distributed Hash Table in order to use
its services. This allows to realize a DHT service as third-party infrastructure
service, such as the OpenDHT Project [511].
7.3.1 Overview
To store or access data in a Distributed Hash Table, a node first needs to join
it. The arrival of new nodes leads to changes in the DHT infrastructure, to
which the routing information and distribution of data needs to be adapted.
At this stage, the new node can insert data items into the Distributed Hash
Table and retrieve data from it. In case a node fails or leaves the system, the
DHT needs to detect and adapt to this situation.
It takes four steps for a node to join a Distributed Hash Table. First, the
new node has to get in contact with the Distributed Hash Table. Hence, with
some bootstrap method it gets to know some arbitrary node of the DHT. This
node is used as an entry point to the DHT until the new node is an equivalent
member of the DHT. Then, the new node needs to be assigned a partition
in the logical address space. Depending on the DHT implementation, a node
may choose arbitrary or specific partitions on its own or it determines one
based on the current state of the system. Third, the routing information in
the system needs to be updated to reflect the presence of the new node.
Fourth, the new node retrieves all (key, value) pairs under its responsibility
from the node that stored them previously.
Distributed Application
Put(Key,Value) Get(Key)
Value
Distributed Hash Table
(CAN, Chord, Pastry, Tapestry, …)
There are two angles from which the functionality of Distributed Hash Tables
can be viewed: they can be interpreted as routing systems or as storage
systems. The first interpretation focuses on the delivery of packets to nodes
in a DHT based on a destination ID. In the second, a Distributed Hash Table
appears as a storage system similar to a hash table. These notions are reflected
in the interface that a Distributed Hash Table provides to applications.
92 7. Distributed Hash Tables
Given the above interfaces, a node can only utilize its primitives after joining
a Distributed Hash Table. However, a distributed system can also be struc-
tured such that the nodes participating in the DHT make available the DHT
services to other, non-participating hosts. In such an environment, these hosts
act as clients of the DHT nodes. This setup can be desirable where, for ex-
ample, the Distributed Hash Table is run as an infrastructure service on a
dedicated set of nodes for increased reliability. The interface between clients
7.5 Conclusions 93
and DHT nodes is also well-suited to realize access control and accounting
for services available on the Distributed Hash Table. Note that this interface
can itself be implemented as an application on top of the DHT routing or
storage layer.
7.5 Conclusions
8.1 Chord
R. Steinmetz and K. Wehrle (Eds.): P2P Systems and Applications, LNCS 3485, pp. 95-117, 2005.
Springer-Verlag Berlin Heidelberg 2005
96 8. Selected DHT Algorithms
K61
K5 Finger Table of N8
N57 N8 Idx Target ID Successor
K51 0 N8 + 1 N10
N10
N15 1 N8 + 2 N10
K49
N48 2 N8 + 4 N15
N18 3 N8 + 8 N18
N43 N24 4 N8 + 16 N24
K43
N35 N29 5 N8 + 32 N43
K38 K26
Fig. 8.1: A 6-bit Chord identifier space. Dotted lines indicate which nodes host
which keys. Black lines represent the fingers of node N 8.
8.1.2 Routing
Given a Chord identifier circle, all identifiers are well-ordered and keys and
nodes are uniquely associated. Thus, each (key, value) pair is located and
managed on a single, well-defined node. The DHT is formed by the set of
all (key, value) pairs on all nodes of an identifier circle. The key to efficient
lookup and modification operations on this data is to quickly locate the node
responsible for a particular key.
For a very simple routing algorithm, only very little per-node state is re-
quired. Each node needs to store its successor node on the identifier circle.
When a key is being looked up, each node forwards the query to its successor
in the identifier circle. One of the nodes will determine that the key lies be-
tween itself and its successor. Thus, the key must be hosted by this successor.
Consequently, the successor is communicated as the result of the query back
to its originator.
This inefficient form of key location involves a number of messages linear
to the number of nodes on the identifier circle. Chord utilizes additional per-
node state for more scalable key lookups.
Each node maintains a routing table, the finger table (cf. Figure 8.1),
pointing to other nodes on the identifier circle. Given a circle with l-bit
identifiers, a finger table has a maximum of l entries. On node n, the table
entry at row i identifies the first node that succeeds n by at least 2i−1 , i.e.,
successor(n + 2i−1 ), where 1 ≤ i ≤ l. In Figure 8.1, for example, the second
finger of node N 8 (8 + 21 = 10) is node N 10 and the third finger (8 + 22 = 12)
is node N 15. The first finger of a node is always its immediate successor on
the identifier circle.
As a finger table stores at most l entries, its size is independent of the
number of keys or nodes forming the DHT. Each finger entry consists of
a node ID, an IP address and port pair, and possibly some book-keeping
8.1 Chord 97
information. Even for large identifiers, e.g., l = 256, this is a relatively small
amount of data per node which can be efficiently managed and searched. The
routing information from finger tables provides information about nearby
nodes and a coarse-grained view of long-distance links at intervals increasing
by powers of two.
The Chord routing algorithm exploits the information stored in the finger
table of each node. A node forwards queries for a key k to the closest pre-
decessor of k on the identifier circle according to its finger table. When the
query reaches a node n such that k lies between n and the successor of n on
the identifier circle, node n reports its successor as the answer to the query.
Thus, for distant keys k, queries are routed over large distances on the
identifier circle in a single hop. Furthermore, the closer the query gets to
k, the more accurate the routing information of the intermediate nodes on
the location of k becomes. Given the power-of-two intervals of finger IDs,
each hop covers at least half of the remaining distance on the identifier circle
between the current node and the target identifier. This results in an average
of O(log(N )) routing hops for a Chord circle with N participating nodes. For
example, a Chord network with 1000 nodes forwards queries, on average, in
roughly O(10) steps. In their experiments, Stoica et al. show that the average
lookup requires 12 log(N ) steps.
8.1.3 Self-Organization
The Chord system described so far also needs to allow for nodes joining and
leaving the system as well as to deal with node failures.
Node Arrivals
In order to join a Chord identifier circle, the new node first determines some
identifier n. The original Chord protocol does not impose any restrictions on
this choice. For example, n could be set at random assuming that the prob-
ability for collisions with existing node IDs is low in a identifier space large
enough. There have been several proposals to restrict node IDs according to
certain criteria, e.g., to exploit network locality or to avoid identity spoofing.
For the new node n, another node o must be known which already par-
ticipates in the Chord system. By querying o for n’s own ID, n retrieves its
successor. It notifies its successor s of its presence leading to an update of
the predecessor pointer of s to n. Node n then builds its finger by iteratively
querying o for the successors of n + 21 , n + 22 , n + 23 , etc. At this stage, n
has a valid successor pointer and finger table. However, n does not show up
in the routing information of other nodes. In particular, it is not known to
its predecessor as its new successor since the lookup algorithm is not apt to
determine a node’s predecessor.
98 8. Selected DHT Algorithms
Stabilization Protocol
Chord introduces a stabilization protocol to validate and update successor
pointers as nodes join and leave the system. Stabilization requires an addi-
tional predecessor pointer and is performed periodically on every node. The
stabilize() function on a node k requests the successor of k to return its
predecessor p. If p equals k, k and its successor agree on being each other’s
respective predecessor and successor. The fact that p lies between k and its
successor indicates that p recently joined the identifier circle as k’s successor.
Thus, node k updates its successor pointer to p and notifies p of being its
predecessor.
With the stabilization protocol, the new node n does not actively de-
termine its predecessor. Instead, the predecessor itself has to detect and fix
inconsistencies of successor and predecessor pointers using stabilize(). Af-
ter node n has thus learnt of its predecessor, it copies all keys it is responsible
for, i.e., keys between predecessor(n) and n, while the predecessor of n re-
leases them.
At this stage, all successor pointers are up to date and queries can be
routed correctly, albeit slowly. Since the new node n is not present in the
finger tables of other nodes, they forward queries to the predecessor of n
even if n would be more suitable. Node n’s predecessor then needs to forward
the query to n via its successor pointer. Multiple concurrent node arrivals
may lead to several linear forwardings via successor pointers.
The number of nodes whose finger table needs to be updated is in the
order of O(log(N )) in a system with N nodes. Based on the layout of a
finger table, a new node n can identify the nodes with outdated finger tables
as predecessor(n − 2i−1 ) for 1 < i ≤ l. However, the impact of outdated
finger tables on lookup performance is small, and in the face of multiple node
arrivals, the finger table updates would be costly. Therefore, Chord prefers to
update finger tables lazily. Similar to the stabilize() function, each node
n runs the fix fingers() function periodically. It picks a finger randomly
from the finger table at index i (1 < i ≤ l) and looks it up to find the true
current successor of n + 2i−1 .
Node Failures
Chord addresses node failures on several levels. To detect node failures, all
communication with other nodes needs to be checked for timeouts. When a
node detects a failure of a finger during a lookup, it chooses the next best
preceding node from its finger table. Since a short timeout is sufficient, lookup
performance is not significantly affected in such a case. The fix fingers()
function ensures that failed nodes are removed from the finger tables. To
expedite this process, fix fingers() can be invoked specifically on a failed
finger.
8.2 Pastry 99
Node Departures
Treating nodes that voluntarily leave a Chord network like failed ones does
not affect the stability of the network. Yet it is inefficient because the failure
needs to be detected and rectified. Therefore, a leaving node should transfer
its keys to its successor and notify its successor and predecessor. This ensures
that data is not lost and that the routing information remains intact.
8.2 Pastry
The Pastry distributed routing system was proposed in 2001 by Rowstron and
Druschel [527]. Similar to Chord, its main goal is to create a completely de-
centralized, structured Peer-to-Peer system in which objects can be efficiently
located and messages efficiently routed. Instead of organizing the identifier
space as a Chord-like ring, the routing is based on numeric closeness of iden-
tifiers. In their work, Rowstron and Druschel focus not only on the number
of routing hops, but also on network locality as factors in routing efficiency.
100 8. Selected DHT Algorithms
K33 K01
N10
N23
N21
K22 K12
Fig. 8.2: A 4-bit Pastry identifier space with six keys mapped onto five nodes.
Numeric closeness is an ambiguous metric for assigning keys to nodes as
illustrated for key K22.
In Pastry, nodes and data items uniquely associate with l-bit identifiers, i.e.,
integers in the range of 0 to 2l −1 (l is typically 128). Under such associations,
an identifier is termed a node ID or a key, respectively. Pastry views identifiers
as strings of digits to the base 2b where b is typically chosen to be 4. A key
is located on the node to whose node ID it is numerically closest.
Figure 8.2 illustrates a Pastry identifier space with 4-bit identifiers and
b = 2, so all numbers are to the base of 4. The closest node to, e.g., key
K01 is N 01, whereas K03 is located on node N 10. The distances of key K22
to node N 21 and N 23 are equal so both nodes host this key to satisfy the
requirements.
Pastry’s node state is divided into three main elements. The routing table,
similar to Chord’s finger table, stores links into the identifier space. The
leaf set contains nodes which are close in the identifier space (like Chord’s
successor list). Nodes that are close together in terms of network locality are
listed in the neighborhood set.
Pastry measures network locality based on a given scalar network proxim-
ity metric. This metric is assumed to be already available from the network
infrastructure and might range from IP hops to actual the geographical lo-
cation of nodes.
8.2 Pastry 101
Routing Table
0 031120 1 201303 312201
1 0 110003 120132 132012
2 100221 101203 102303 3
3 103112 2 103302
4 103210 2
Node 103220 5 0
Leaf Set
103123 103210 103302 103330
Neighborhood Set
031120 312201 120132 101203
Fig. 8.3: Pastry node state for the node 103220 in a 12-bit identifier space and a
base of 4 (l = 12, b = 2). The routing table lists nodes with the length
of the common node identifier prefix corresponding to the row index.
Routing Table
A Pastry node’s routing table R (see Figure 8.3) is made up of bl rows with
2b − 1 entries per row (an additional column in Figure 8.3 also lists the
digits of the local node ID for clarity). On node n, the entries in row i hold
the identities of Pastry nodes whose node IDs share an i-digit prefix with n
but differ in digit n itself. For example, the first row of the routing table is
populated with nodes that have no prefix in common with n. When there
is no node with an appropriate prefix, the corresponding table entry is left
empty.
Routing tables built according to the Pastry scheme achieve an effect
similar to Chord finger tables. A node has a coarse-grained knowledge of
other nodes which are distant in the identifier space. The detail of the routing
information increases with the proximity of other nodes in the identifier space.
Without a large number of nearby nodes, the last rows of the routing table
are only sparsely populated. Intuitively, the identifier space would need to be
fully exhausted with node IDs for complete routing tables on all nodes. In
a system with N nodes, only log2b (N ) routing table rows are populated on
average.
In populating the routing table, there is a choice from the set of nodes
with the appropriate identifier prefix. During the routing process, network
locality can be exploited by selecting nodes which are close in terms of a
network proximity metric.
Leaf Set
The routing table sorts node IDs by prefix. To increase lookup efficiency, the
leaf set L of node n holds the |L| nodes numerically closest to n. The routing
102 8. Selected DHT Algorithms
table and the leaf set are the two sources of information relevant for routing.
The leaf set also plays a role similar to Chord’s successor lists in recovering
from failures of adjacent nodes.
Neighborhood Set
Instead of numeric closeness, the neighborhood set M is concerned with nodes
that are close to the current node with regard to the network proximity
metric. Thus, it is not involved in routing itself but in maintaining network
locality in the routing information.
Routing in Pastry is divided into two main steps. First, a node checks whether
the key k is within the range of its leaf set. If this is the case, it implies that k
is located on one of the nearby nodes of the leaf set. Thus, the node forwards
the query to the leaf set node numerically closest to k. In case this is the
node itself, the routing process is finished.
If k does not fall into the range of leaf set nodes, the query needs to be
forwarded over a longer distance using the routing table. In this case, a node
n tries to pass the query on to a node which shares a longer common prefix
with k than n itself. If there is no such entry in the routing table, the query
is forwarded to a node which shares a prefix with k of the same length as n
but which is numerically closer to k than n.
For example, a node with a routing table as in Figure 8.3 would send a
query for key 103200 on to node 103210 as it is the leaf set node closest to
the key. Since the leaf set holds the closest nodes, the key is known to be
located on that node. A query for key 102022, although numerically closer to
node 101203, is forwarded to node 102303 since it shares the prefix 102 with
the key (in contrast to 10 as the current node does). For key 103000, there is
no routing table entry with a longer common prefix than the current node.
Thus the current node routes the query to node 103112 which has the same
common prefix 103 but is numerically closer than the current node.
This scheme ensures that routing loops do not occur because the query
is routed strictly to a node with a longer common identifier prefix than the
current node, or to a numerically closer node with the same prefix.
8.2.4 Self-Organization
In practice, Pastry needs to deal with node arrivals, departures, and failures,
while, at the same time, maintaining good routing performance if possible.
This section describes how Pastry achieves these goals.
8.2 Pastry 103
Node Arrival
Before joining a Pastry system, a node chooses a node ID. Pastry itself allows
arbitrary node IDs, but applications may have more restrictive requirements.
Commonly, a node ID is formed as the hash value of a node’s public key or
IP address.
For bootstrapping, the new node n is assumed to know a nearby Pastry
node k based on the network proximity metric. Now n needs to initialize its
node, i.e., its routing table, leaf and neighborhood set. Since k is assumed to
be close to n, the nodes in k’s neighborhood set are reasonably good choices
for n, too. Thus, n copies the neighborhood set from k.
To build its routing table and leaf set, n needs to retrieve information
about the Pastry nodes which are close to n in the identifier space. To do
this, n routes a special “join” message via k to a key equal to n. According
to the standard routing rules, the query is forwarded to node c with the
numerically closest node ID. Due to this property, the leaf set of c is suitable
for n, so it retrieves c’s leaf set for itself.
The join request triggers all nodes, which forwarded the request towards
c, to provide n with their routing information. Node n’s routing table is
constructed from the routing information of these nodes starting at row zero.
As this row is independent of the local node ID, n can use the entries at row
zero of k’s routing table. In particular, it is assumed that n and k are close
in terms of the network proximity metric. Since k stores nearby nodes in its
routing table, these entries are also close to n. In the general case of n and k
not sharing a common prefix, n cannot re-use entries from any other row in
k’s routing table.
The route of the join message from n to c leads via nodes v1 ...vn with
increasingly longer common prefixes of n and vi . Thus, row 1 from the routing
table of node v1 is also a good choice for the same row of the routing table of
n. The same is true for row 2 on node v2 and so on. Based on this information,
the routing table can be constructed for node n.
Finally, the new node sends its node state to all nodes in its routing
data. These nodes can update their own routing information accordingly. In
contrast to the lazy updates in Chord, this mechanism actively updates the
state in all affected nodes when a new node joins the system. At this stage,
the new node is fully present and reachable in the Pastry network.
The arrival and departure of nodes affects only a relatively small number
of nodes in a Pastry system. Consequently, the state updates from multiple
such operations rarely overlap and there is little contention. Thus, Pastry uses
the following optimistic time-stamp-based approach to avoid major inconsis-
tencies of node state: the state a new node receives is time-stamped. After
the new node initializes its own internal state, it announces its state back
to the other nodes including the original time-stamps. If the time-stamps do
not match on the other nodes, they request the new node to repeat the join
procedure.
104 8. Selected DHT Algorithms
Node Failure
Node failure is detected when a communication attempt with another node
fails. Routing requires contacting nodes from the routing table and leaf set,
resulting in lazy detection of failures. Since the neighborhood set is not in-
volved in routing, Pastry nodes periodically test the liveness of the nodes in
their neighborhood sets.
During routing, the failure of a single node in the routing table does not
significantly delay the routing process. The local node can choose to forward
the pending query to a different node from the same row in the routing table.
Alternatively, a node could store backup nodes with each entry in the routing
table.
Failed nodes need to be evicted from the routing table to preserve routing
performance and correctness. To replace a failed node at entry i in row j of
its routing table (Rji ), a node contacts another node referenced in row i.
Entries in the same row j of the the remote node are valid for the local node.
Hence it can copy entry Rji from the remote node to its own routing table
after verifying the liveness of the entry. In case it failed as well, the local
node can probe the other nodes in row j for entry Rji . If no live node with
the appropriate node ID prefix can be obtained in this way, the local node
expands its horizon by querying nodes from the preceding row Rj−1 . With
very high probability, this procedure eventually finds a valid replacement for
the failed routing table entry Rji , if one exists.
Repairing a failed entry in the leaf set L of a node is straightforward by
utilizing the leaf sets of other nodes referenced in the local leaf set. The node
contacts the leaf set entry with the largest index on the side of the failed
node in order to retrieve the remote leaf set L . If this node is unavailable,
the local node can revert to leaf set entries with smaller indices. Since the
entries in L and L are close to each other in the identifier space and overlap,
the node selects an appropriate replacement node from L and adds it to its
own leaf set. In the event that the replacement entry failed as well, the node
again requests the leaf sets of other nodes from its local leaf set. For this
procedure to be unsuccessful, |L| 2 adjacent nodes need to fail simultaneously.
The probability of such a circumstance can be kept low even with modest
values of |L|.
Nodes recover from node failures in their neighborhood sets in a fashion
similar to repairing the leaf set. However, failures cannot be detected lazily
since the nodes in the neighborhood set are not contacted regularly for rout-
ing purposes. Therefore, each node periodically checks the liveness of nodes
in its neighborhood set. When a node failure is detected, a node consults
the neighborhood sets of other neighbor nodes to determine an appropriate
replacement entry.
8.2 Pastry 105
Node Departure
Since Pastry can maintain stable routing information in the presence of node
failures, deliberate node departures were originally treated as node failures
for simplicity. However, a Pastry network would benefit from departure opti-
mizations similar to those proposed for Chord. The primary goals would be
to prevent data loss and reduce the amount of network overhead induced by
Pastry’s failure recovery mechanisms.
Arbitrary Failures
The approaches proposed for dealing with failures assumed that nodes fail
by becoming unreachable. However, failures can lead to a random behavior
of nodes, including malicious violations of the Pastry protocol. Rowstron and
Druschel propose to amend these problems by statistically choosing alter-
native routes to circumvent failed nodes. Thus, a node chooses randomly,
according to the constraints for routing correctness, from a set of nodes to
route queries to with a bias towards the default route. A failed node would
thus be able to interfere with some traffic but eventually be avoided after a
number of retransmissions. How node arrivals and departures can be made
more resilient to failed or malicious nodes is not addressed in the original
work on Pastry.
Pastry optimizes two aspects of routing and locating the node responsible for
a given key: it attempts both to achieve a small number of hops to reach the
destination node, and to exploit network locality to reduce the overhead of
each individual hop.
Route Length
The routing scheme in Pastry essentially divides the identifier space into
domains of size 2n where n is a multiple of 2b . Routes lead from high-order
domains to low-order domains, thus reducing the remaining identifier space
to be searched in each step. Intuitively, this results in an average number of
routing steps related to the logarithm of the size of the system. This intuition
is supported by a more detailed analysis.
It is assumed that routing information on all nodes is correct and that
there are no node failures. There are three cases in the Pastry routing scheme,
the first of which is to forward a query according to the routing table. In
this case, the query is forwarded to a node with a longer prefix match than
the current node. Thus, the number of nodes with longer prefix matches is
106 8. Selected DHT Algorithms
Locality
By exploiting network locality, Pastry routing optimizes not only the number
of hops but also the costs of each individual hop. The criteria to populate a
node’s routing table allow a choice among a number of nodes with matching
ID prefixes for each routing table entry. By selecting nearby nodes in terms
of network locality, the individual routing lengths are minimized. This ap-
proach does not necessarily yield the shortest end-to-end route but leads to
reasonable total route lengths.
Initially, a Pastry node uses the routing table entries from nodes on a
path to itself in the identifier space. The proximity of the new node n and
the existing well-known node k implies that the entries in k’s first row of
the routing table are also close to n. The entries of subsequent rows from
nodes on the path from k to n may seem close to k but not necessarily to
n. However, the distance from k to these nodes is relatively long compared
to the distance between k and n. This is because the entries in later routing
table rows have to be chosen from a logarithmically smaller set of nodes in
the system. Hence, their distance to k and n increases logarithmically on
average. Another implication of this fact is that messages are routed over
increasing distances the closer they get to the destination ID.
(0,63) (63,63)
C
(0−31,48−63) D
B (32−63,32−63)
(0−31,32−47)
A
(0−63,0−31)
(0,0) (63,0)
Fig. 8.4: A two-dimensional six-bit CAN identifier space with four nodes. For sim-
plicity, it is depicted as a plane instead of a torus.
N4 N4 N7
N5 N1 N2 N5 N2
N3 N3 N1
K
N6 N6
Fig. 8.5: The route from node N 1 to Fig. 8.6: New node N 7 arrives in the
a key K with coordinates zone of N 1. N 1 splits its
(x, y) in a two-dimensional zone and assigns one half to
CAN topology before node N 7. Updated neighbor set
N 7 joins. Neighbor set of of N 1: {N 7, N 2, N 6, N 5}.
N 1: {N 2, N 6, N 5}
For routing purposes, a CAN node stores information only about its immedi-
ate neighbors. Two nodes in a d-dimensional space are considered neighbors
if their coordinates overlap in one dimension and are adjacent to each other
in d − 1 dimension. Figure 8.5 illustrates neighbor relationships. For example,
node N 1 and N 6 are neighbors as they overlap in the y dimension and are
next to each other in the x dimension. At the same time, node N 5 and N 6 are
not neighbors as they do not overlap in any dimension. Similarly, node N 1
and N 4 overlap in the x dimension but are not adjacent in the y dimension,
so they are not neighbors of each other.
The routing information in CAN is comprised of the IP address, a port,
and the zone of every neighbor of a node. This data is necessary to access
the CAN service on a neighbor node and to know its location in the identifier
space. In a d-dimensional identifier space partitioned into zones of equal size,
each node has 2d neighbors. Thus, the number of nodes participating in a
CAN system can grow very large while the necessary routing information per
node remains constant.
8.3 Content Addressable Network CAN 109
8.3.4 Self-Organization
Node Arrival
A node n joining a CAN system needs to be allocated a zone and the zone
neighbors need to learn of the existence of n. The three main steps in this
procedure are: to find an existing node of a CAN system; to determine which
zone to assign to the new node; and to update the neighbor state.
Like Chord and Pastry, CAN is not tied to a particular mechanism for
locating nodes in the overlay network to be joined. However, Ratnasamy et
al. suggest using a dynamic DNS name to record one or more nodes belonging
to a particular CAN system. The referenced nodes may in turn publish a list
of other nodes in the same CAN overlay. This scheme allows for replication
and randomized node selection to circumvent node failures.
Given a randomly chosen location in the identifier space, the new node
n sends a special join message via one of the existing nodes to these coordi-
nates. Join messages are forwarded according to the standard CAN routing
procedure. After the join message reaches the destination node d, d splits
its zone in half and assigns one half to n (cf. Figure 8.6). In order to ease
the merging of zones when nodes leave and to equally partition the identifier
space, CAN assumes a certain ordering of the dimensions by which zones
are split. For example, zones may be split along the first (x) dimension, then
along the second (y) dimension and so on. Finally, d transfers the (key, value)
pairs to n for which it has become responsible.
110 8. Selected DHT Algorithms
Node Failure
The zones of failing or leaving nodes must be taken over by live nodes to
maintain a valid partitioning of the CAN identifier space. A CAN node de-
tects the failure of a neighbor when it ceases to send update messages. In such
an event, the node starts a timer. When the timer fires, it sends takeover mes-
sages to the neighbors of the failed node. The timer is set up such that nodes
with large zones have long timeouts while small zones result in short time-
outs. Consequently, nodes with small zone sizes send their takeover messages
first.
When a node receives a takeover message, it cancels its own timer pro-
vided its zone is larger than the one advertised in the message. Otherwise,
it replies with its own takeover message. This scheme efficiently chooses the
neighboring node with the smallest zone volume. The elected node claims
ownership of the deserted zone and merges it with its own zone if possible.
Alternatively, it temporarily manages both zones.
The hash-table data of a failed node is lost. However, the application
utilizing CAN is expected to periodically refresh data items it inserted into
the DHT (the same is true for the other systems presented here). Thus, the
hash table state is eventually restored.
During routing, a node may find that the neighbor to which a message
is to be forwarded has failed and the repair mechanism has not yet set in.
In such a case, it forwards the message to the live neighbor next closest to
the destination coordinates. If all neighbors failed, which are closer to the
destination, the local node floods the message in a controlled manner within
the overlay until a closer node is found.
Node Departure
When a node l deliberately leaves a CAN system, it notifies a neighbor n
whose zone can be merged with l’s zone. If no such neighbor exists, l chooses
the neighbor with the smallest zone volume. It then copies the contents of its
hash table to the selected node so this data remains available.
As described above, departing and failing nodes can leave a neighbor node
managing more than one zone at a time. CAN uses a background process
8.3 Content Addressable Network CAN 111
With multiple hash functions, a (key, value) pair would be associated with
a different identifier per hash function. Storing and accessing the (key, value)
at each of the corresponding nodes increases data availability. Furthermore,
routing can be performed in parallel towards all the different locations of a
data item reducing the average query latency. However, these improvements
come at the cost of additional per-node state and routing traffic.
The mechanisms presented above reduce per-hop latency by increasing
the number of neighbors known to a node. This allows a node to forward
messages to neighbors with a low RTT. However, CAN may also construct
the overlay so it resembles more closely the underlying IP network. To place
nodes close to each other, both at the IP and the overlay level, CAN assumes
the existence of well-known landmark nodes. Before joining a CAN network,
a node samples its RTT to the landmarks and chooses a zone close to a
landmark with a low RTT. Thus, the network latency en route to its neighbors
can be expected to be low resulting in lower per-hop latency.
For a more uniform partition of the identifier space, nodes should not join
a CAN system at a random location. Instead, the node which manages the
initial random location queries its neighbors for their zone volume. The node
with the largest zone volume is then chosen to split its zone and assign half
of it to the new node. This mechanism contributes significantly to a uniform
partitioning of the coordinate space.
For real load-balancing, however, the zone size is not the only factor to
consider. Particularly popular (key, value) pairs create hot spots in the iden-
tifier space and can place substantial load on the nodes hosting them. In a
CAN network, overload caused by hot-spots may be reduced through caching
and replication. Each node caches a number of recently accessed data items
and satisfies queries for these data items from its cache if possible. Overloaded
nodes may also actively replicate popular keys to their neighbors. The neigh-
bors in turn reply to a certain fraction of these frequent requests themselves.
Thus, load is distributed over a wider area of the identifier space.
8.4 Symphony
sor on the ring. In Symphony, the Chord finger table is replaced by a con-
stant but configurable number k of long distance links. In contrast to other
systems, there is no deterministic construction rule for long distance links.
Instead, these links are are chosen randomly according to harmonic distribu-
tions (hence the name Symphony). Effectively, the harmonic distribution of
long-distance links favors large distances in the identifier space for a system
with few nodes and decreasingly smaller distances as the system grows.
The basic routing in this setup is trivial: a query is forwarded to the
node with the shortest distance to the destination key. By exploiting the bi-
directional nature of links to other nodes, routing both clockwise and counter-
clockwise leads, on average, to a 25% to 30% reduction of routing hops.
Symphony additionally employs a 1-lookahead approach. The lookahead table
of each node records those nodes which are reachable through the successor,
predecessor, and long distance links, i.e., the neighbors of a node’s neighbors.
Instead of routing greedily, a node forwards messages to its direct neighbor
(not a neighbor’s neighbor) which promises the best progression towards the
destination. This reduces the average number of routing hops by 40% at the
expense of management overhead when nodes join or leave the system.
In comparison with the systems discussed previously, the main contribu-
tion of Symphony is its constant degree topology resulting in very low costs
of per-node state and of node arrivals and departures. It also utilizes bi-
directional links between nodes and bi-directional routing. Symphony’s rout-
ing performance (O( k1 log 2 (N ))) is competitive compared with Chord and the
other systems (O(log(N ))) but does not scale as well with exceedingly large
numbers of nodes. However, nodes can vary the number of links they main-
tain to the rest of the system during run-time based on their capabilities,
which is not permitted by the original designs of Chord, Pastry, and CAN.
8.5 Viceroy
Level 1
Level 2
Level 3
Fig. 8.7: A Viceroy topology with 18 nodes. Lines indicate short- and long-range
downlinks; other links and lower levels are omitted for simplicity.
8.6 Kademlia
0 1
0 1 0 1
0 1 0 1 0 1 0 1
0 1 0 1 0 1 0 1 0 1 0 1
0 1 0 1 0 1 0 1
Fig. 8.8: An example of a Kademlia topology. The black node 0010 knows about
the subtrees that do not match its identifier as indicated by the dot-
ted squares. Each node successively forwards a query to α nodes in a
destination subtree.
of remaining available than fresh nodes. This increases the stability of the
routing topology and also prevents good links from being flushed from the
routing tables by distributed denial-of-service attacks, as can be the case in
other DHT systems.
With its XOR metric, Kademlia’s routing has been formally proved con-
sistent and achieves a lookup latency of O(log(N )). The required amount
of node state grows with the size of a Kademlia network. However, it is
configurable and together with the adjustable parallelism factor allows for a
trade-off of node state, bandwidth consumption, and lookup latency.
8.7 Summary
The core feature of every DHT system is its self-organizing distributed opera-
tion. All presented systems aim to remain fully functional and usable at scales
of thousands or even millions of participating nodes. This obviously implies
that node failures must be both tolerated and of low impact to the operation
and performance of the overall system. Hence, performance considerations
are an integral part of the design of each system.
Since the lookup of a key is probably the most frequently executed op-
eration and essential to all DHT systems, a strong focus is put on its per-
formance. The number of routing hops is an important factor for end-to-end
latency, but the latency of each hop also plays an important role. Gener-
ally, additional routing information on each node also provides a chance for
choosing better routes. However, the management of this information and of
links to other nodes in a system also incurs overhead in processing time and
bandwidth consumption.
Table 8.1: Performance comparison of DHT systems. The columns show the aver-
ages for the number of routing hops during a key lookup, the amount
of per-node state, and the number of messages when nodes join or leave
the system.
8.7 Summary 117
Table 8.1 summarizes the routing latency, per-node state, and the costs
of node arrivals and departures in the systems discussed above. It illustrates
how design choices, like a constant-degree topology, affect the properties of
a system. It should be noted that these results are valid only for the original
proposals of each system and that the O() notation leaves ample room for
variation. In many cases, design optimizations from one system can also be
transferred to another system. Furthermore, the effect of implementation op-
timizations should not be underestimated. The particular behavior of a DHT
network in a certain application scenario needs to be determined individually
through simulation or real-world experiments.
9. Reliability and Load Balancing in
Distributed Hash Tables
Simon Rieche, Heiko Niedermayer, Stefan Götz, Klaus Wehrle
(University of Tübingen)
After introducing some selected Distributed Hash Table (DHT) systems, this
chapter introduces algorithms for DHT-based systems which balance the stor-
age data load (Section 9.1) or care for the reliability of the data (Section 9.2).
R. Steinmetz and K. Wehrle (Eds.): P2P Systems and Applications, LNCS 3485, pp. 119-135, 2005.
Springer-Verlag Berlin Heidelberg 2005
120 9. Reliability and Load Balancing in DHTs
200
100
50
0
0 200k 400k 600k 800k 1000k
DRFXPHQWVSHUQRGH Number of documents in the DHT
2500
2000
Documents per node
1500
1000
500
0
0 200k 400k 600k 800k 1000k
Number of documents in the DHT
Fig. 9.1: Distribution of data among a Chord DHT without load-balancing mech-
anisms.
documents to be stored ranged from 100,000 to 1,000,000 and for this purpose
the Chord ring’s address space had a size of m = 22 bits. Consequently,
222 = 4,194,304 documents and/or nodes could be stored and managed in
the ring. The keys for the data and nodes were generated randomly. The
load of a node was defined by the number of documents it stored.
The graphs in Fig. 9.1 clearly show that the assumption of an equal dis-
tribution of data among peers by simply using a hash function does not hold.
For example, Fig. 9.1(a) shows how many nodes (y-axis) store a certain num-
ber of documents (x-axis). It is obvious that there is an unequal distribution
of documents among the nodes. For an easier comparison, the grey line indi-
cates the optimal number in the case of equal distribution – approximately
122 documents per node in this example. Additionally, Fig. 9.1(b) plots the
number of nodes without a document.
9.1 Storage Load Balancing of Data in Distributed Hash Tables 121
9.1.1 Definitions
As an example, more than 300,000 collected file names from music and
video servers were hashed, and the ID space was divided into intervals ac-
cording to the first bits of an ID, e.g., 8 bits for 256 intervals. The load of each
of these intervals is distributed closely around the average with the empirical
standard deviation (σExperiment = 34.5) being close to the theoretical one
(σBinomial = 34.2).
However, the assumptions of this model are not realistic for DHTs because
interval sizes for nodes are not equal. The next section will deduce the interval
size distribution.
The rationale for using this continuous model is that it is easier than a
discrete one and the ID space is large compared with the number of nodes in
it.
The continuous uniform distribution is defined as:
0 x<0
U (x) = x 0≤x<1
1 x≥1
Let L be the distribution of the interval size. It is given as the minimum
of N − 1 experiments1 :
1
For our statistical analysis it does not matter if the node responsible for the data
is at the beginning or the end of the interval.
9.1 Storage Load Balancing of Data in Distributed Hash Tables 123
Fig. 9.2: Probability Density Function for Continuous Model with 4,096 nodes.
N −1
L(x) = 1− (1 − U (x)) = 1 − (1 − U (x))N −1
i=1
0 x<0
N −1
= 1 − (1 − x) 0≤x<1
1 x≥1
5
0
Fig. 9.3: Load Distribution (mean load = 128; 4,096 nodes) approximated with a
scaled probability function from the Discrete Model.
124 9. Reliability and Load Balancing in DHTs
approximation for the load, we use the number of data items as the ID space.
Consequently, we get an approximation for the probability of a certain load
(e.g., probability that a node has load = 10 items). If we multiply these
probabilities with the number of nodes, we get the frequency distribution
shown in Fig. 9.3.
The third constraint results in the transfer of the largest virtual server
that will not make the receiving node heavy, therefore, the chance of finding
another light node in the next round which can receive a virtual server of
this heavy node is increased.
One-to-One Scheme. This scheme is the simplest one. Two nodes are picked
at random and a virtual server is transferred from a heavy node to a light
one. Each light node periodically selects a node and indicates a transfer if
that node is heavy, and if the above tree rules hold.
One-to-Many Scheme. This scheme allows a heavy node to consider more
than one light node at a time. Each heavy node transfers a virtual server
to one node of a known set of light nodes. For each light node of this set,
the best virtual server is computed as described above and only the lightest
virtual server of these will be transferred.
Many-to-Many Scheme. This scheme matches many heavy nodes to many
light nodes. In order to get many heavy nodes and many light nodes to
interact, a global pool of virtual servers is created – an intermediate step in
moving a virtual server from a heavy node to a light node. The pool is only
a local data structure used to compute the final allocation.
In three phases (unload, insert, and dislodge) the virtual servers to be
transferred are computed. In the first one (unload) each heavy node puts
the information about its virtual servers into a global pool until this node
becomes light.
The virtual servers in the pool must then be transferred to nodes in the
next step (insert). This phase is executed in rounds, in which the heaviest
virtual server from the pool is selected and transferred to a light node, de-
termined using the rules above. This phase continues until the pool becomes
empty, or until no more virtual servers can be transferred.
In the final phase (dislodge), the largest virtual server from the pool is
exchanged with another virtual server of a light node which is lighter and
does not make the node heavy. If such a node is found, the insert step begins
again, otherwise the algorithm terminates and the rest of the virtual servers
in the pool stay at their current nodes.
h1 (x), h2 (x), · · · , hd (x) to insert the data x. For each of these computed re-
sults, the node responsible for this ID in the DHT is located. The data is now
placed on the peer with the lowest load.
There are two ways to implement the search. A simple implementation
requires that all hash functions be recalculated. After all lookups are made
to find the peers associated with each of these values, one node must have
successfully stored the data. These searches can be made in parallel and thus
enable searching in little more time than their classic counterparts since this
approach uses a factor of d more network traffic to perform each search.
The second way of searching is to use redirection pointers. Insertion pro-
ceeds exactly as before, but in addition to storing the item at the least
loaded peer, all other peers store a redirection pointer to this node. To re-
trieve document x, it is not necessary to calculate all possible hash func-
tions h1 , h2 , . . . , hd , because each possible node h1 (x), h2 (x), . . . , hd (x) stores
a pointer to document x. Thus, each of these nodes can forward the request
directly to the node which is actually storing the requested document. Hence,
a request for a certain key has to be made only to one of the d possible nodes.
If this node does not store the data, the request is forwarded directly to the
right node via the pointer. Nevertheless, the owner of a key has to insert
the document periodically to prevent its removal after a timeout (soft state).
Lookups now take at most only one more step.
1
2
… 3 …
4
… …
5
6
Splitting of an interval
1 4
… 2 5 …
… 3 6 …
Fig. 9.4: Splitting of an Interval: Nodes 1 to 6 are assigned to the same interval
and are overloaded in terms of data load. Since only three nodes are
necessary to maintain an interval, this interval can be split.
2f Nodes with Excessive Load. If 2f different nodes are assigned to the same
interval, and each node stores significantly more documents than average,
then this interval gets divided. The point of separation is the center of the
interval. It can be easily computed as the half of the interval borders or the
half of the hash values representing the stored documents. This implies that
no load in terms of data has to be moved anywhere, and the respective nodes
lose approximately half of their data load at once. Finally, the predecessors
and successors will be adapted accordingly. Figure 9.4 shows an example of
such an interval division.
More than f Nodes in an Interval. Intervals with more than f but less than
2f nodes can release some nodes to other intervals. If nodes within a partic-
ular interval are overloaded, they wait for additional nodes to join them. If
128 9. Reliability and Load Balancing in DHTs
1
… … 2 …
… … 3 …
4
Shifting of nodes
4 1
… … 2 …
… … 3 …
Fig. 9.5: Moving nodes: Nodes 1 to 4 are assigned to an interval. Since only three
nodes are necessary to maintain an interval, node 4 can be transferred
into another overloaded interval.
some nodes are very light, they periodically send this information to other
nodes placed in different intervals. These desired destinations (intervals) can
be found using routing entries of appropriate finger tables. Even if an interval
with a heavy load exists, nodes can be moved to this interval. Based on the
new situation, accumulated nodes within the new interval can try to split it
by the rules described above.
Figure 9.5 shows an example of such a shifting of nodes to regions of higher
load. Nodes 1 to 4 are very light in terms of data load and are responsible for
the same interval. Since only three nodes are required, node 4 can be moved
to an overloaded interval that should be divided.
No more than f Nodes within an Interval. As an additional alternative, in-
terval borders may be shifted. Nodes can compare their load with the load of
their immediate predecessors and successors. If its own interval shows more
load than its neighbor’s, part of the load can be released and thus interval
borders will be shifted. Figure 9.6 shows an example of such a shifting of
interval borders.
… …
… …
… …
… …
Fig. 9.6: Intervals adjusted between neighbors: The nodes within the right interval
act together and are light, but the nodes located within the interval before
are overloaded. Interval borders can be changed there.
parison to virtual servers in section 9.1.3) and are computed with different
hash functions applied to their own ID. Each node chooses only one virtual
node to become active. The address of a node is denoted as (2b + 1)2−a by
a, b, where a and b are non-negative integers and b < 2a−1 . This is an un-
ambiguous notation for all addresses with finite binary representation. These
addresses are ordered according to the length of their binary representation,
so a, b < a , b if a < a or (a = a and b < b ).
Each node now chooses its ideal state. Given any set of active virtual
nodes, each (possibly inactive) one spans a certain range of addresses between
itself and the succeeding active virtual node. Each real node has activated
the virtual node that spans the minimal possible (under the ordering just
defined) address space. Thus, each node occasionally determines which of its
virtual nodes spans the smallest address space and activates that particular
virtual node.
Item Balancing. This also shifts interval borders. Nodes can compare their
load with the loads of other nodes. If its own interval shows more load than
its neighbor’s, part of the load can be released and thus interval borders
between two intervals will be shifted.
gorithm [513]. The focus was on the distribution of documents among the
nodes.
Simulation Scenarios
In each scenario, a Chord DHT with 4,096 nodes was simulated and multiple
simulations were run per scenario to confirm the results. The simulation in
[513] shows that the results are comparable with the simulations presented
in [99]. The total number of documents to be stored ranged from 100,000 to
1,000,000. The keys for the data and nodes were generated randomly. For
this purpose, the Chord ring’s address space had a size of m = 22 bits.
Consequently, 222 = 4,194,304 documents and/or nodes could be stored and
managed in the ring. In the simulation, the load of a node is defined as the
number of documents it stores.
Simulation Results
Fig. 9.7(a) shows the distribution of documents in Chord without load-
balancing. Between 105 and 106 documents were distributed across 4,096
nodes. The upper value indicates the maximum number of documents per
node, the lower value the minimum number. The optimal number of docu-
ments per node is indicated by the marker in the middle.
Even for a large number of documents in the DHT, there are some nodes
not managing any documents and, consequently, without any load. Some
nodes have a load of up to ten times above the optimum. Fig. 9.7(b) shows
that Power of Two Choices works much more efficiently than the original
Chord without load-balancing. However, there are still obvious differences in
the loads of the nodes. Some are still without any document.
Applying the concept of Virtual Servers with the One-to-One scheme
(cf. Fig. 9.7(c)) results in a more efficient load-balancing. Nevertheless, this
is coupled with a much higher workload for each node because it has to
manage many virtual servers. Additionally, the data of all virtual servers of
one physical node has to be stored in the memory of the managing node.
Fig. 9.7(d) shows that the best results for load-balancing are achieved by
using the heat dispersion algorithm. Each node manages a certain amount of
data and load fluctuations are relatively small. Documents are only moved
from neighbor to neighbor. Using virtual severs, however, results in copying
the data of a whole virtual server. As a result, the copied load is always
balanced. In addition, more node management is necessary.
9.2 Reliability of Data in Distributed Hash Tables 131
2500 1000
2000 800
Documents per node
1000 400
500 200
0 0
0 200k 400k 600k 800k 1000k 0 200k 400k 600k 800k 1000k
Number of documents Number of documents
1000 1000
800 800
Documents per node
600 600
400 400
200 200
0 0
0 200k 400k 600k 800k 1000k 0 200k 400k 600k 800k 1000k
Number of documents Number of documents
(c) Chord with virtual server (d) Chord with heat dispersion algo-
rithm
Fig. 9.7: Simulation results comparing different approaches for load balancing in
Chord.
Through much research in the design and stabilization of DHT lookup ser-
vices, these systems aim to provide a stable global addressing structure on
top of a dynamic network of unreliable, constantly failing and arriving nodes.
This will allow building fully decentralized services and distributed applica-
tions based on DHTs. This section shows algorithms for ensuring that data
stored at failing nodes is available after stabilization routines of the Peer-to-
Peer-based network have been applied.
There are two ways to store data in the DHT in a fault-tolerant manner.
One is to replicate the data to other nodes, another is to split the data and
make them more available through redundancy.
132 9. Reliability and Load Balancing in DHTs
9.2.1 Redundancy
9.2.2 Replication
Successor-List
The authors of Chord show in [575] a possibility to make the data more
reliable in their DHT. The idea to make the data in Chord more fault-tolerant
is to use a so-called successor-list. This list is also used to stabilize the network
after nodes leave. The successor-list of any node consists of the f nearest
successors clockwise on the Chord ring.
Address-
Space
… 1 2 … f …
… … …
Node
Fig. 9.8: Successor-list of a node with f nearest nodes clockwise on the Chord ring.
Address-
Space
… …
… …
… …
… …
Node Data
each joining node to all other nodes responsible for the same interval. Follow-
ing this, parts of documents located within this interval are copied to this new
node. Then, the original methods to insert a node in Chord are performed.
To keep the complexity of the routing tables low, each node stores only one
reference to the list of nodes for each finger within its own routing table to
other intervals.
Figure 9.10 shows an example of the distribution of intervals on different
nodes, where each interval has the minimum of two different nodes assigned
to it.
1 4 6 9
Address-
… 2 5 7 10 … Space
… 3 8 …
Node
According to [575], each node may maintain several virtual servers, but
most of them store only one. This allows nodes with higher performance to
store several virtual servers. Chord can take advantage of the high compu-
tational power of certain nodes. Such a node declares itself responsible for
two or more intervals. Thus, it manages several virtual servers, and each vir-
tual server is responsible for separate disjoint intervals. However, all intervals
stored at the virtual servers in the same physical node shall not be identical
in order to guarantee fault tolerance.
If new data has to be inserted into an interval, it will be distributed by
one node to all other nodes responsible for the same interval. But, no copies
of the data are replicated clockwise to the next n nodes along the ring, as
in the original Chord DHT. Figure 9.11 shows the distribution of replicas of
the inserted data to the neighbors responsible for the same interval.
If any node leaves the system and any other node takes notice of this, the
standard stabilization routine of Chord is performed. The predecessors and
successors are informed, and afterwards, inconsistent finger table entries are
identified and updated using the periodic maintenance routine.
Thus DHT systems become more reliable and far more efficient due to
the structured management of nodes. Generally, random losses of nodes are
not critical because at least f nodes manage one interval cooperatively. The
modified DHT system can cope with a loss of (f -1) nodes assigned to the
same interval. In case of less than f nodes within one interval, the algorithm
immediately merges adjacent intervals.
9.3 Summary 135
… …
… …
… …
… …
Node Data
Fig. 9.11: Copy of data to the neighbors responsible for the same interval.
9.3 Summary
R. Steinmetz and K. Wehrle (Eds.): P2P Systems and Applications, LNCS 3485, pp. 137-153, 2005.
Springer-Verlag Berlin Heidelberg 2005
138 10. P-Grid: Dynamics of Self-Organizing Processes
P (j, t+1) = P (j, t)+α(P (j−1, t)−P (j, t))+(1−α)((j−1)P (j−1, t)−jP (j, t)).
Now assume that the degree distribution is in steady state, i.e. P (j, t) =
cj , t > 0. We can derive
cj 2−α 2−α1
=1− ≈1−
cj−1 1 + α + j(1 − α) 1−αj
where the approximation is valid for large j. This relationships is satisfied
approximately for
2−α
cj ≈ j − 1−α .
To see this, note that for this cj we have
cj 1 2−α 2−α1
≈ (1 − ) 1−α ≈ 1 −
cj−1 j 1−αj
This is a first example of how a self-organization process results in a global
structural feature, namely the power-law degree distribution. The probability
that a node has a given in-degree remains invariant while the network grows,
thus the system is in a dynamic equilibrium during network construction.
The structure of the resulting overlay network is the basis for performing
searches efficiently. In Gnutella, searches are performed by message flooding.
A low network diameter, as in the power-law graph, guarantees low search
latency. Message flooding however induces a high consumption of network
bandwidth. Therefore other strategies for performing searches in Gnutella
networks have been investigated. The independence of the network main-
tenance and search protocols makes it possible to use alternative search
strategies which may exploit the emergent overlay network structure more
efficiently. Examples of such alternative strategies are the random walker
model [397] and the percolation search model [537], which both exploit the
specific structure of the network.
To summarize, we can observe two important points for unstructured
overlay networks such as Gnutella. First, the structure of the network and
140 10. P-Grid: Dynamics of Self-Organizing Processes
We assume that the data keys are taken from the interval [0, 1[. The struc-
ture of a P-Grid overlay network is based on two simple principal ideas: (1)
the key space is recursively bisected such that the resulting partitions carry
approximately the same workload. Peers are associated with those partitions.
Using a bisection approach greatly simplifies decentralized load-balancing by
local decision-making. (2) Bisecting the key space induces a canonical trie
structure which is used as the basis for implementing a standard, distributed
prefix routing scheme for efficient search.
This is illustrated in Fig. 10.1. At the bottom we see a possible skewed
key distribution in the interval [0, 1[. We bisect the interval such that each
resulting partition carries (approximately) the same load. Each partition can
be uniquely identified by a bit sequence. We associate one or more peers (in
the example exactly two) with each of the partitions. We call the bit sequence
of a peer’s partition the peer’s path. The bit sequences induce a trie structure,
which is used to implement prefix routing. Each peer maintains references in
its routing table that pertain to its path. More specifically, for each position
of its path, it maintains one or more references to a peer that has a path
with the opposite bit at this position. Thus the trie structure is represented
in a distributed fashion by the routing tables of the peers, such that there
is no hierarchy in the actual overlay network. This construction is analogous
to other prefix routing schemes that have been devised [491, 527]. Search in
such overlay networks is performed by resolving a requested key bit by bit.
When bits cannot be resolved locally, peers forward the request to a peer
known from their routing tables.
P-Grid uses replication in two ways in order to increase the resilience of the
overlay network when nodes or network links fail. Multiple references are kept
in the routing tables, thus providing alternative access paths, and multiple
peers are associated with the same key space partitions (structural replica-
tion) in order to provide data redundancy. The self-organization mechanisms
we will discuss for P-Grid will relate to these two replication mechanisms.
Contrary to standard prefix routing approaches P-Grid does not assume
a maximal key length that limits the tree depth and thus search cost. This
assumption would compromise the load-balancing properties achieved by bi-
section. Thus search efficiency is not guaranteed structurally, since in the
worst case search cost is related to the maximal path length of the trie,
which for skewed key distributions can be up to linear in the network size.
142 10. P-Grid: Dynamics of Self-Organizing Processes
ID peer identifier
Trie abstraction
00* data keys for prefix routing
1 : 2 routing table entry
0 1
2 1*
00 01 0 :1
6 1*
1 00*
0 :3
1 :2
01 : 3 010 011
Replica sub-network
7 00*
1 :6
01 : 8 4 010* 3 011*
1 :2 1 :6
00 : 7 00 : 1
011: 3 010: 4
5 010* 8 011*
1 :6 1 :2
00 : 1 00 : 1
011: 8 010: 5
Recursive Partitioning
0 1
Load distribution
ally using a hash table in main memory is the constant time of lookup, insert,
and delete operations. But to facilitate this, a hash table sacrifices the order-
relationship of the keys. However, over a network, where only parts of the
hash table are stored at each location, we need multiple overlay hops anyway.
For most conventional DHTs the number of hops is logarithmic in the network
size. Thus the main advantage of constant-time access no longer exists. In
fact the fundamental issue to address now is, whether we can realize a search
tree, which still is similarly efficient as a DHT in terms of fault-tolerance,
load-balancing, etc., but also provides properties such as preservation of key
ordering and hence supports efficient exact queries but also efficiently enables
higher-level search predicates such as substring search, range queries [157],
etc. This is a major goal in the design of the P-Grid overlay network.
2. Referential integrity: During the process each of the peers has to en-
counter at least one peer that decided for the other partition. Thus the
peers have the necessary information to construct the routing table.
The second condition makes the problem non-trivial, since otherwise peers
could simply select partition 0 with probability p and 1 otherwise. P-Grid uses
the following distributed algorithm to solve the problem.
1. Each undecided peer initiates interactions with a uniformly randomly
selected peer until it has reached a decision.
2. If the contacted peer is undecided the peers perform a balanced split with
probability 0 ≤ α(p) ≤ 1 and maintain references to each other.
3. If the contacted peer has already decided for 1 then the contacting peer
decides for 0 with probability 0 ≤ β(p) ≤ 1 and with probability 1 − β(p)
for 1. In the first case it maintains a reference to the contacted peer. In
the second case it obtains a reference to a peer from the other partition
from the contacted peer.
We can model this algorithm as a Markovian process. We assume that in
each step i one peer without having found its counterpart so far contacts an-
other randomly selected peer. We denote by P (0, t) and P (1, t) the expected
number of peers that have decided in step t for 0 and 1 respectively. Initially
P (0, 0) = P (1, 0) = 0. At the end of the process at some step te we have
P (0, te ) + P (1, te ) = n + 1. We analyze the case α(p) = 1. Then the model
can be given as
1
P (0, t) = P (0, t − 1) + (n − P (0, t − 1) − (1 − β)P (1, t − 1))
n
1
P (1, t) = P (1, t − 1) + (n − βP (1, t − 1))
n
In order to determine the proper value of β for a given value of p, we have
to solve the recursive system. By standard solution methods we obtain
n β n−1 t
P (0, t) = (2β − 1 + (1 − )t − 2β( ))
β n n
n β
P (1, t) = (1 − (1 − )t )
β n
We observe that the recursion terminates as soon as no more undecided
peers exist, i.e., as soon as P (0, te ) + P (1, te ) = n + 1. By evaluating this
termination condition we obtain
log(2)
te (n) = n +1 (10.1)
log( n−1 )
146 10. P-Grid: Dynamics of Self-Organizing Processes
Note that te does not depend on p, and thus the partitioning process
requires the same number of interactions among peers independent of load
(0,te )
distribution. By definition p = Pn+1 , thus we obtain a relationship among
the network size n + 1 and the load distribution p with β(p, n). For large
networks, by letting n → ∞, we obtain the following relationship among p
and β(p)
1
p=1− (1 − 2−β ) (10.2)
β
Positive solutions for β(p) cannot be obtained for all values of p. From
Equation 10.2 we derive that positive solutions exist for p ≥ 1 − log(2). In-
formally speaking, since balanced splits are always executed unconditionally,
the algorithm cannot adapt to arbitrarily skewed distributions. Therefore
for 0 ≤ p < 1 − log(2) we have to pursue a different strategy, by reducing
the probability of balanced splits, i.e. α(p) < 1. The analysis of this case is
analogous and therefore we omit it here.
Various non-trivial issues still need to be addressed to extend this basic
process to a complete method for constructing a P-Grid overlay network with
load-balancing characteristics. The value of p is normally not known, thus it
needs to be estimated from the key samples the peers have available locally.
This introduces errors into the process which require non-trivial corrections.
The process needs to be performed recursively, thus errors in proportionally
bisecting the key space accumulate. The process needs to be approximately
synchronized to leave the assumptions made for the basic process valid. The
bisection process should terminate as soon as the number of peers in the
same partition falls below a threshold. Since peers cannot know during the
bootstrapping all potential replica peers in the same partition, other criteria,
based on the locally available keys, need to be evaluated. Solutions for these
problems have been developed and it has been shown that in fact it is pos-
sible to efficiently construct a P-Grid overlay network satisfying the desired
load-balancing properties based on the elementary process introduced in this
section [6, 7].
LEGEND
ID Presently online
ID Presently offnline 0 1
Up-to-date cache
1 : 2 ,12
Stale cache 00 01 10 11
2 12,13,14
12 12,13,14
0 : 5,7
1 1 9 2,3
14 4,5
3 6,7
11 8,9
13 10,11 10 : 6,13
7 1 4 2,3
5 4,5
10 6,7
6 8,9
8 10,11
no dependency between the peer identity (idP7 = 0111) and the path it is
associated with (π(P7 ) = 0000). In its routing table P7 stores references for
paths starting with 1, 01 and 001, so that queries with these prefixes can be
forwarded closer to the peers holding the searched information. The cached
physical addresses of these references may be up-to-date (for example, P13 ’s)
or be stale (denoted by underlining, for example, P5 ).
A peer Pq decides that it has failed to contact a peer Ps , if one of the
following happens: (1) No peer is available at the cached address (trivial case)
or (2) the contacted peer has a different identifier. In either of these cases
an up-to-date identity-to-address mapping can be obtained by querying the
P-Grid. If peer Ps goes offline, and comes online later with a different IP
address, it can insert a new identity-to-address mapping into P-Grid.
If a peer fails to contact peers in its routing table, it initiates a new query
to discover the latest identity-to-address mapping of any of those peers. If
this is successful it forwards the query.
Assuming the initial setup while the P-Grid is in the state shown in Fig-
ure 10.2, the query processing will work as follows. Assume that P7 receives
a query Q(01∗). P7 fails to forward the query to either of P5 or P14 since
their cache entries are stale. Thus P7 initiates a recursive query for (P5 ), i.e.,
Q(0101), which needs to be forwarded to either P5 or P14 . This fails again.
P7 then initiates a recursive query for (P14 ), i.e., Q(1110), which needs to be
forwarded to P12 and (or) P13 . P12 is offline, so irrespective of the cache being
stale or up-to-date, the query cannot be forwarded to P12 . P13 is online, and
the cached physical address of P13 at P7 is up-to-date, so the query is for-
warded to P13 . P13 needs to forward Q(P14 ) to either P2 or P12 . Forwarding
10.3 Self-Organization in Structured Peer-to-Peer Systems 149
to P12 fails and so does the attempt to forward the query to P2 because P13 ’s
cache entry for P2 is stale. Thus P13 initiates a recursive query for (P2 ), i.e.,
Q(0010). P13 sends Q(P2 ) to P5 which forwards it to P7 and/or P9 . Let us
assume P9 replies. Thus P13 learns P2 ’s address and updates its cache. P13
also starts processing and forwards the parent query (P14 ) to P2 . P2 provides
P14 ’s up-to-date address, and P7 updates its cache.
Having learned P14 ’s current physical address, P7 now forwards the origi-
nal query Q(01∗) to P14 . This does not only satisfy the original query but P7
also has the opportunity to learn and update physical addresses P14 knows
and P7 needs, for example, P5 ’s latest physical address (we assume that peers
synchronize their routing tables during communication since this does not in-
cur any overhead). In the end, the query Q(01∗) is answered successfully and
additionally P7 gets to know the up-to-date physical addresses of P14 and
possibly of P5 . Furthermore, due to child queries, P13 updates its cached
address for P2 . Figure 10.3 shows the final state of the P-Grid with several
caches updated after the the completion of Q(01∗) at P7 .
1
Pµ (t + 1) = Pµ (t) − pc (1 − Pµ (t)) + (Nrec − 1)(1 − ) (10.3)
r log2 n
Nrec and can be expressed in terms of Pµ (t). The negative contribution
in the recursion corresponds to the fraction of correct routing table entries
of a peer that turns stale between two queries issued by the peer and the
positive contribution is the fraction of incorrect routing table entries of a
peer that are repaired due to recursively triggered and successfully processed
queries.
The system is in a dynamic equilibrium if Pµ (t) = µ for some constant µ.
10.4 Summary
We have seen three examples of how self-organizing processes induce struc-
tural features of Peer-to-Peer overlay networks, one example for unstructured
overlay networks and two examples for structured overlay networks. Each of
152 10. P-Grid: Dynamics of Self-Organizing Processes
Nrec=5 0.2
Nrec=10
Nrec=25 0.15
0.1
0.05
p_on
0.5 0.6 0.7 0.8 0.9
the examples was slightly different both in the nature of the process studied,
in the type of equilibrium obtained as well as in the purpose for which the
model of the process was developed.
The ideas presented in this chapter which we explored during the process
of designing and implementing the P-Grid system are generally applicable,
however. Often dynamic systems will have to be analyzed as Markovian sys-
tems, be it for a-posteriori analysis, or to study their evolution over time,
given a set of rules for local interactions, or to investigate the equilibrium
state in the presence of perturbations.
Also, in the context of Peer-to-Peer systems, reactive route maintenance
strategies have been studied by other projects, e.g., in DKS, as well as other
systems also try to address the problem of fast and parallel overlay construc-
tion mechanisms, for example, [31]. Other systems that focus more on storage
load-balancing for arbitrary load distributions and use small-world routing
include SkipGraphs [36] and Mercury [72] (among several other recent Peer-
to-Peer proposals). Increasingly, there is a confluence of ideas which arrived
independently by various research groups dealing with self-organization prob-
lems.
More importantly, analysis of network evolution and maintenance are ei-
ther explicitly or implicitly assuming a Markovian model, an analytical ap-
proach which we tried to present here formally by elaborating on three differ-
ent self-organizing processes. In the case of modeling preferential attachment
in unstructured overlay networks the stochastic model has been developed
to explain a-posteriori a phenomenon that has been observed in many artifi-
cial and natural networks, including Peer-to-Peer overlay networks. Thus it is
used to explain empirical evidence. The model itself identifies a dynamic equi-
10.4 Summary 153
R. Steinmetz and K. Wehrle (Eds.): P2P Systems and Applications, LNCS 3485, pp. 157-170, 2005.
Springer-Verlag Berlin Heidelberg 2005
158 11. Application-Layer Multicast
end user nodes do not usually participate in packet replication and group
management, and act only as service consumers.
In the next section, we follow one of the possible taxonomies: classifying
ALM based on structured or unstructured overlays. For each category, we
further distinguish two sub-categories with reference to the peculiarities of
each category.
Finally, before delving into the specification details of ALM systems, it is
important to identify the performance overhead and service costs introduced
by moving multicast functionality to the application layer. Below is a list of
metrics that are often used to evaluate any given ALM system, but that also
represent a guideline for ALM design in general:
– Relative Delay Penalty (RDP) or Stretch: Strictly defined as the ratio of
the one-way overlay delay of a node pair over the unicast delay among the
same nodes in the identical direction. The goal here is to match routing
proximity as closely as possible to underlying IP routing, reducing the
resulting delay penalty.
– Throughput : Similarly to RDP, this performance metric measures the ef-
fective data throughput achieved for a single receiver (over time or on
average).
– Stress: Stress is defined as the number of times the same packet traverses
a specific physical link in either direction. Essentially, it quantifies the cost
of moving the replication effort to end systems in terms of data bandwidth.
– Control Overhead : Number of control messages exchanged throughout an
ALM session. This metric represents the cost in terms of control message
exchanges.
The term ”centralized” in the context of ALM design does not refer to data
replication handled by a centralized entity; instead, it points out the de-
sign principle of a centralized entity handling group management and cre-
ation/optimization of the distribution tree. This is the approach taken in the
Application Level Multicast Infrastructure (ALMI [482]).
The entire coordination task in ALMI is assigned to the ”session con-
troller ”, as shown in the architecture diagram (Figure 11.1). The session
controller - residing either on a dedicated server of the service provider or
at a group member node - exchanges point-to-point messages (dashed ar-
rows) via unicast with every overlay node (drawn as a circular point). It is
worth mentioning that the controller does not lie on the data path, i.e., is not
part of the distribution tree (marked with bold, solid arrows), thus avoiding
bottlenecks in data distribution.
160 11. Application-Layer Multicast
Session
Controller
Member
Virtual
Multicast
Tree
A node that wants to join an ALMI session sends a JOIN message to the
session controller. Note that the discovery of the session controller’s location
for a given session ID is beyond the scope of the system specification and
realized by third party means, for example via a known URLs or e-mail
notification. When the newly arrived node is accepted into the group, it
receives a response containing its member ID (identifier in the overlay) and
the location of a parent node to which it should append itself. Finally, the
newly added node sends a GRAFT message to its parent and obtains in
response the data ports for the two-way communication with its parent. Node
departures are realized similarly by signaling the session controller with a
LEAVE message.
Tree creation and maintenance are also tasks performed by the session
controller. Given a performance metric of interest (e.g., delay), the controller
computes locally a minimal spanning tree on the group members graph and
assigns the upstream and downstream nodes to each overlay node in the
distribution tree. Note that, unlike other ALM systems, ALMI builds a single
shared tree with bidirectional links that is jointly used by all members for data
distribution. Measurement data for the metric to be optimized is provided by
each overlay node to the controller on a point-to-point basis. For this purpose,
each overlay node actively probes every other node and reports the results
to the controller. Obviously, this generates an O(n2 ) message overhead. To
scale the monitoring service to larger groups, ALMI limits the degree of each
node in the monitoring graph. Although this may initially lead to sub-optimal
multicast trees, over time each node dynamically prunes bad links and adds
new links to the monitoring topology, resulting in more efficient multicast
trees.
Recapitulating, the centralization approach adopted by ALMI offers two
primary advantages: high control over the overlay topology and ease of im-
plementation. Moreover, as a side-effect of the first advantage, detection of
malicious nodes is easier to realize because all control operations pass through
11.3 Unstructured Overlays 161
the session controller. On the other hand, ALMI is plagued with the scala-
bility and dependability concerns of all centralized systems. While the first
deficiency remains unresolved, ALMI tries to alleviate the negative effects of
controller failures by introducing backup controllers. These synchronize peri-
odically with the main controller’s state and, in case of failure detection, one
of the backup controllers replaces the session coordinator.
14 512
Unicast
256 DVMRP
DVMRP Narada
12
8 32
RDP
16 Unicast
6
8
4
4
2 2
1
0
0 10 20 30 40 50 60 70 80 1 2 4 8 16 32 64 128
Physical Delay (in ms) Stress of Physical Link (log-scale)
Fig. 11.2: Delay performance of ESM and induced cost in terms of link stress
occur when the abrupt failure of a node causes the partitioning of the mesh.
For this reason, when node A has not received any refresh message from node
B for a given time period, it starts actively probing node B. If B responds,
A creates an overlay link to B to repair the partition. Otherwise, it presumes
the departure of B and ultimately deletes B from its group member list.
The mesh constructed with Narada is heavily influenced by randomizing
effects, such as link failures after node departures, additions of bad quality
”emergency” links during partition repair, or additions of arbitrary links dur-
ing node arrivals (recall that a newly arrived node creates arbitrary links to a
few bootstrap nodes). Also, throughout the session, the varying conditions of
the network substrate may render a previously good mesh formation obsolete.
Due to the reasons mentioned above, ESM employs periodic re-evaluation of
the constructed mesh. This re-constructed mesh is achieved autonomically
by each member actively probing its mesh links and adding new, or dropping
existing, links. Various heuristics are used for evaluation of the utility of link
additions/drops, incorporating the effects of a potential overlay reconfigura-
tion on the entire group [307].
Finally, data delivery in ESM follows the reverse-path forwarding [150]
concept: a node ni that receives a packet from source S via a mesh-neighbor
nj forwards the packet, if and only if nj is the next hop on the shortest
path of ni to S. If that condition holds, ni replicates the packet to all of
its mesh-neighbors that use ni as their next hop to reach S. The forwarding
algorithm requires each node to maintain an entry to every other node in
its routing table, which does not only contain the next overlay hop and the
cost associated with the destination, but also the exact path that leads to it.
This extra information is also used for loops-avoidance and count-to-infinity
problems.
Figure 11.2(a) plots the delay penalty with regard to native unicast delay
obtained with an ESM overlay. The data originate from a simulation of the
11.4 Structured Overlays 163
Narada protocol, incorporating 1024 nodes, 3145 links and a group size of
128, where the out-degree of each node was limited to 3 to 6 neighbors. The
plot conveys that Narada minimizes the delay penalty experienced by node
pairs with high unicast delay, whereas the penalty increases for small unicast
delay values. The fairly large RDP values for node pairs with very low unicast
delays are justified by the fact that even a small suboptimal configuration of
the overlay topology magnifies the tiny delay of nodes residing close by to a
large penalty. Still, the effective delay among these node pairs adheres to real-
time requirements. In addition, the histogram in Figure 11.2(b) compares the
physical link stress in ESM against two common alternatives: naive unicast
and IP multicast using DVMRP [161]. Intuitively, native multicast is optimal
in terms of stress (stress 1 across all links). It is worth mentioning that ESM
manages to limit the maximum link stress to 9. In contrast, native unicast
leads to a longer tail, loading few links with stress greater than 9, and worse,
transmitting the same packet 128 times over a single link.
Summarizing, End System Multicast builds self-organizing overlays and
therefore provides increased robustness with minimal configuration effort.
As it does not require any support from network internal application-layer
entities, it constitutes a ready-to-deploy solution. Furthermore, beyond the
system specification presented herein, Narada can be optimized against var-
ious QoS metrics, such as throughput or node out-degree [306]. Finally, an
asset of the entire ESM endeavor is that the system prototype has already
been used for various Internet broadcasts, spanning multiple continents con-
necting home, academic and commercial network environments [305]. One of
the limitations of ESM is its inability to scale to large group sizes, mainly
because of the volume of routing information that needs to be exchanged and
maintained by each node. A major concern common to all host-based ALM
solutions is service interruption due to abrupt node failures. This failure type
however, is one of the major open issues in peer-to-peer research in general.
structured overlays and provides insight into issues pertaining to their design
and performance.
Two approaches have been taken towards extending the routing infras-
tructure of a structured overlay to support multicast and these differ primar-
ily in the manner they implement packet replication:
1. Directed flooding of each message across the virtual space.
2. Forming a tree, rooted at the group source used to distribute messages
from a specific source to the group members (tree leaves).
In the following, we review one representative system from each of the
two categories and end with a comparison of the two.
3. If a message has already traversed at least half of the distance from the
source across a particular dimension, then the message is not forwarded
by a receiving node.
4. Each node caches the sequence numbers of received messages and discards
messages it has already forwarded.
Intuitively, rules 1 and 2 ensure that a message reaches all dimensions of
the virtual space and additionally that the message is flooded to all nodes in
one single dimension. Rule 3 prevents the flooding process from looping. A
sample one-to-many communication in a two-dimensional CAN is depicted
in Figure 11.3, where the listed forwarding rules are applied on a hop-by-hop
basis to flood a message to all members of the CAN. Note that if the coor-
dinate space is not perfectly partitioned, then a node may receive multiple
copies of the same packet. This is particularly true for nodes C and D in the
flooding example presented in Figure 11.3. [506] specifies enhancements to
the elementary forwarding process for avoiding, but not entirely eliminate,
an important fraction of duplicates.
forwards the message to the node with nodeId id2 , where a prefix of id2
matches in more than b bits the identifier idn compared with the match of
id1 . If no such match exists, the node forwards the message to the node with
the numerically closest nodeId. It can be proven that the described routing
scheme always converges [527]. Forwarding decisions, as well as mapping of
nodeIds to IP addresses, is accomplished using routing state maintained at
each node. A more detailed presentation of the Pastry system can be found
in section 8.2, which discusses thoroughly the routing table structure (section
8.2.2), node bootstrapping (section 8.2.4), failure handling (section 8.2.4) and
proximity matching from the geographical/IP-level proximity to the identifier
space (section 8.2.5).
Scribe [109] builds a large-scale, fully decentralized, many-to-many dis-
tribution service on top of a Pastry infrastructure. Multicast distribution is
essentially core-based, using a Pastry node as the rendezvous point. Group
members ”join” the tree routed at the well known rendezvous point, while
group sources send multicast data directly to the core. Scribe exposes the
following simple API calls to group communication applications:
– Create(credentials,topicId): creates a group identified by a unique top-
icId, which is the result of hashing a textual description of the group’s topic
concatenated with the nodeId of the creator node. Credentials are used for
applying access control to group creation.
– Subscribe(credentials, topicId, eventHandler): commands the local
Scribe instance to join the group identified by the topicId, resulting in
receiving multicast data of a particular group. Arriving group data are
passed to the specified event handler.
– Unsubscribe(credentials, topicId): causes the local node to leave the
specified group.
– Publish(credentials, topicId, event): used by group sources to com-
municate an event (i.e., multicast data) to the specified group.
An application intending to create a group uses Scribe’s ”create” API
call. Then Scribe passes a CREATE message using the topicId and creden-
tials specified by the application to the local Pastry instance. The message is
routed to the node with the nodeId numerically closest to the topicId. The re-
ceiving node checks the credentials, adds the topicId to the locally maintained
groups and becomes the rendezvous point (RP) for the group. Adding a leaf
to the tree rooted at the RP is a receiver-initiated process: Scribe asks Pastry
to route a SUBSCRIBE message using the relevant topicId as the message
destination key. At each node along the route towards the RP, the message
is intercepted by Scribe. If the node already holds state of the particular top-
icId, it adds the preceding node’s nodeId to its children table and forwards
the message to the RP. If state does not exist, it creates a children table entry
associated with the topicId and again forwards the message towards the RP.
The latter process results in a reverse-path forwarding [150] distribution tree
11.4 Structured Overlays 167
Publish Msg
Temporary Pastry Route 1001 ĺ 1100
Subscribe (0111)
Subscribe (0100)
Fig. 11.4: Tree creation and data replication in a sample Scribe overlay.
3
8
RAD
6
2
5
4 1.5
3
1
2
1 0.5
0
d=10, z=1 d=9, z=2 d=12, z=3 d=10, z=5 d=8, z=10 0
b=1 b=2 b=3 b=4
Fig. 11.5: Relative delay penalty in various configurations of CAN multicast and
Scribe.
the RP, the message accumulates state on the nodes it traverses, including
the RP itself. As soon as the SUBSCRIBE message is treated at the RP,
multicast data start flowing on the distribution tree (which in this case is
a single node chain) towards node 1100. Now, let’s look at how the arrival
of a receiver in the ”vicinity” of node 1100, namely of node 0100, affects
the distribution tree. Following the common join process, node 0100 sends a
SUBSCRIBE message towards the RP. The message is again delivered first
to node 1001. Normally, the message would follow the path 1001-1111-1100
to reach the RP. However, since node 1001 already possesses state of the
group 1100, it just adds node 0100 to its respective children table entry and
terminates the SUBSCRIBE message.
Table 11.1: Maximum and average link stress in CAN and Scribe multicast.
RAND corresponds to randomly assigning nodeIds to newly arriving
Pastry nodes, whereas TOP denotes a topologically aware assignment
of nodeIds, taking proximity into consideration.
gle multicast group with the same single source over all runs. Since there are
various parameter sets possible for the instantiation of a CAN/Pastry over-
lay, the authors have experimented with various configurations: a) in CAN,
by tuning the dimensions d of the Cartesian space and the maximum number
of nodes allowed in each zone, and b) in Pastry, by tuning the parameter b
(number of bits matched during per hop prefix matching).
A small excerpt of the results is illustrated in Figure 11.5 and Table 11.1.
Figure 11.5 shows that tree-based multicast over Pastry is superior to CAN-
flooding in terms of both maximum (RM D) and average (RAD) relative
delay penalty, making the former more suitable for delay-sensitive applica-
tions (e.g., multi-point teleconferencing). In contrast, certain configurations
of CAN-flooding manage to economize link utilization compared with Pastry,
as outlined in Table 11.1. Consequently, for a non delay-critical application
(e.g., TV broadcasting), CAN-flooding offers a cheaper solution. For more de-
tailed comparisons, including further metrics and evaluation against a larger
number of concurrent groups, please refer to [110].
Topology-Awareness:
Matching proximity in the overlay sense with the actual proximity of the
underlying IP substrate is key in improving the performance (delivery de-
lay, throughput) and cost (stress) of ALM. While this is a pure necessity in
structured overlays [502], efforts are being spent on achieving a closer match
in unstructured overlays as well [370].
Quality of Service:
Recently, the provision of (probabilistic) QoS guarantees in overlay communi-
cation [383] is receiving increasing interest. To name an example, an overlay
node can take advantage of redundancy in overlay paths [539] to a given
destination and alternate over the several path options according to path
conditions. For instance, if node A is able to reach node B using two loss-
disjoint paths P1 and P2, A is able to pick the path with the lowest loss rate
to increase quality [538]. Similarly, multi-path routing together with redun-
dant coding inside the overlay can be used in k-redundant, directed acyclic
graphs to increase data throughput [643].
Multi-source Support:
Many of the existing ALM schemes employ source-specific multicast trees.
Still, various multicast applications inherently have multiple sources (such as
teleconferencing or online multi-player games). The trivial solution manifests
itself by creating one tree per source; still, it is evident that this is not the
most efficient solution in all practical cases. Alternatively, the overlay rout-
ing algorithm may take application semantics into consideration to provide
economical multi-source support [341] (e.g., by creating trees on demand and
applying tree caching).
Security:
Malicious node behavior can harm the performance and stability of an
application-layer multicast overlay. For example, Mathy et al. showed in
[401] the impact of malicious nodes reporting false delay measurement val-
ues. Clearly, shielding overlay networks with powerful cheating detection and
avoidance mechanisms is an interesting challenge.
11.6 Summary
In the preceding sections, we have introduced a first set of interesting ap-
plications of peer-to-peer networks: application-layer multicast. In this field,
peer-to-peer technology has helped to overcome the slow adaptation of mul-
ticast mechanisms at the network layer. While unstructured and centralized
peer-to-peer systems allowed for fast deployment of those networks, the fam-
ily of unstructured peer-to-peer networks offer unlimited scalability. The vast
amount of ongoing work will unleash further improvements of the existing
multicast systems and introduce new applications of peer-to-peer technology
in other networking areas.
12. ePOST
Alan Mislove, Andreas Haeberlen, Ansley Post, Peter Druschel
(Rice University & Max Planck Institute for Software Systems)
R. Steinmetz and K. Wehrle (Eds.): P2P Systems and Applications, LNCS 3485, pp. 171-192, 2005.
Springer-Verlag Berlin Heidelberg 2005
172 12. ePOST
Layers Function
Email Client Interacts with user
IMAP POP3 SMTP Standard email access protocols
ePOST Uses POST to provide email services
POST Securely and reliablely delivers messages
Glacier PAST Scribe Stores data / disseminates messages
Pastry Routes messages in overlay
12.1.1 Design
Ring A Ring B
Fig. 12.2: Diagram of application layers. Note that rings may be running different
protocols, as in this example.
that all members of a given ring are fully connected in the underlying physical
network, i.e., they are not separated by firewalls or NAT boxes.
All nodes in the entire system join the global ring, unless they are con-
nected behind a firewall or a NAT. In addition, each node joins an organi-
zational ring consisting of all the nodes that belong to a given organization.
A node is permitted to route messages and perform other operations only in
rings of which it is a member.
An example configuration is shown in Figure 12.3. Nodes shown in gray
are instances of the same node in multiple rings and nodes in black are only in
a single ring because they are behind a firewall. The nodes connected by lines
are actually instances of the same node, running in different rings. Ring A7
consists of nodes in an organization that are fully connected to the Internet.
Thus, each node is also a member of the global ring. Ring 77 represents a set
of nodes mostly behind a firewall.
Global Ring
Ring A7 Ring 77
organizational ring may also include nodes that are connected to the Internet
through a firewall or NAT box.
Recall that a node that is a member of more than one ring is a gateway
node. Such a node supports multiple virtual overlay nodes, one in each ring,
but uses the same nodeId in each ring. Gateway nodes can forward messages
between rings, as described in the next subsection. In Figure 12.3 above, all
of the nodes in ring A7 are gateway nodes between the global ring and ring
A7. To maximize load balance and fault tolerance, all nodes are expected to
serve as gateway nodes, unless connectivity limitations (firewalls and NAT
boxes) prevent it.
Gateway nodes announce themselves to other members of the rings in
which they participate by subscribing to a multicast group in each of the
rings. The group identifiers of these groups are the ringIds of the associated
rings. In Figure 12.3 for instance, a node that is a member of both the global
ring and A7, joins the Scribe groups:
Scribe group A700...0 in the global ring
Scribe group 0000...0 in ringId A7
12.1.4 Routing
Recall that each node knows the ringIds of all rings in which it is a member.
We assume that each message carries, in addition to a target key, the ringId
of the ring in which the key is stored. Gateways forward messages as follows.
If the target ringId of a message equals one of these ringIds, the node simply
forwards the message to the corresponding ring. From that point on, the
message is routed according to the structured overlay protocol within that
target ring.
Otherwise, the node needs to locate a gateway node in the target ring,
which is accomplished via a Scribe anycast. If the node is a member of the
global ring, it then forwards the message via anycast in the global ring to
the group that corresponds to the destination’s ringId. The message will be
delivered by Scribe to a gateway node for the target ring that is close in the
physical network, among all such gateway nodes. This gateway node then
forwards the data into the target ring, and routing proceeds as before.
If the sender is not a member of the global ring, then it forwards the
message into the global ring via a gateway node by anycasting to the group
local Scribe group whose identifier corresponds to the ringId of the global
ring. Routing then proceeds as described above.
176 12. ePOST
In the previous discussion, we assumed that messages carry both a key and the
ringId of the ring in which the key is stored. In practice, however, applications
may need to look up a key without knowledge of where the object is stored.
For instance, keys are often derived from the hash of a textual name provided
by a human user. In this case, the ring in which the key is stored may be
unknown.
The following mechanism is designed to enable the global lookup of keys
even when the ring in which it resides is not known to the requester. When
a key is inserted into an organizational ring and that key should be visible
at globally, a special indirection record is inserted into the global ring that
associates the key with the ringIds of the organizational rings where replicas
of the key are stored. The ringIds of a key can now be looked up in the global
ring. Note that indirection records are the only data that need to be stored
in the global ring. To prevent space-filling attacks, only legitimate indirection
records are accepted by members of the global ring
ePOST uses the POST messaging system to provide email services. At a high
level, POST provides three generic services: (i) a shared, secure, durable mes-
sage store, (ii) metadata based on single-writer logs, and (iii) event notifica-
tion. These services can be combined to implement a variety of collaborative
applications, such as email, news, instant messaging, shared calendars, and
whiteboards.
In a typical pattern of use, users create messages (such as emails in the
case of ePOST) that are inserted in encrypted form into the secure store.
To send a message to another user or group, the event notification service
is used to provide the recipient(s) with the necessary information to locate
and decrypt the message. The recipients may then modify their personal,
application-specific metadata to incorporate the message into their view, such
as a private mail folder in ePOST.
POST assumes the existence of a certificate authority. This authority signs
identity certificates binding a user’s unique name (e.g., his email address) to
his public key. The same authority issues the nodeId certificates required
for secure routing in Pastry [108]. Users can access the system from any
12.2 POST Design 177
participating node, but it is assumed that the user trusts her local node,
hereafter referred to as the trusted node, with her private key material.
Though participating nodes may suffer from Byzantine failures, POST
also assumes that a large majority (> 75%) of nodes in the system behave
correctly, and that at least one node from each PAST replica set has not
been compromised. If these assumptions are violated, POST’s services may
not be available, though the durability of stored data is still ensured thanks to
Glacier, an archival storage layer that is described in Section 12.4.2. Addition-
ally, POST makes the common assumption that breaking the cryptographic
ciphers and signatures is computationally infeasible.
Table 12.1 shows pseudocode detailing the POST API that is presented
to applications. The store and fetch methods comprise the single-copy mes-
sage store. Similarly, the readMostRecentEntry, readPreviousEntry, and
appendEntry methods provide the metadata service, and the notify method
represents the event notification service.
The most interesting of these APIs is the metadata service, and we de-
scribe it in more detail here. Each of the user’s logs is given a name unique to
the user, denoted below by LogName. Applications can scan through a log in
reverse order by first calling readMostRecentEntry, followed by successive
invocations of readPreviousEntry. Similarly, applications can write to the
log by simply calling writeLog with the desired target log’s name.
POST uses the PAST distributed hash table to store three types of data:
content-hash blocks, certificate blocks, and public-key blocks.
178 12. ePOST
Content-Hash Blocks
Content-hash blocks, which store immutable data objects such as email data,
are stored using the cryptographic hash of the block’s contents as the key.
Content-hash blocks can be authenticated by obtaining a single replica and
verifying that its contents match the key; because they are immutable, any
corruption of the content can be easily detected.
Certificate Blocks
Certificate blocks are signed by the certificate authority and bind a name
(e.g. an email address) to a public key. Certificate blocks are stored using
the cryptographic hash of the name as the key and are also immutable after
creation. Certificate blocks can be authenticated based on their digital sig-
nature, since all users are assumed to know the certificate authority’s public
key.
Public-Key Blocks
Public-key blocks contain timestamps, are signed with a private key, and
are stored using a secure hash of the corresponding public key as the key.
The signature attached to the block allows for block mutation after creation.
First, the nodes storing replicas of the block must verify that the signature
on the update matches the already-known public key. To prevent an attacker
from trying to roll the block back to an earlier valid state, the storage nodes
verify that the timestamps are increasing monotonically. Finally, the object
requester must obtain all live replicas, verify their signatures, and discard
any with older timestamps.
The Scribe group provides a rendez-vous point for nodes waiting for news
from the associated user, or anybody wishing to notify the user that new
data is available. For example, users waiting for another user A to come
online can subscribe to A’s group. Once A is online again, he publishes to his
group, informing others of his presence.
While POST stores potentially sensitive user data on nodes throughout the
network, the system seeks to provide a level of privacy, integrity and durabil-
ity comparable to maintaining data on a trusted server. A technique called
convergent encryption [176] is used. This allows a message to be disclosed to
selected recipients, while ensuring that copies of a given plain-text message
inserted by different users or applications map to the same cipher-text, thus
ensuring that only a single copy of the message is stored.
To store a message X, POST first computes the cryptographic Hash(X),
uses this hash as a key to encrypt X with an efficient symmetric cipher, and
then stores the resulting ciphertext with the key
Hash Encrypt Hash(X) (X)
which is the secure hash of the ciphertext. To decrypt the message, a user
must know the hash of the plain-text.
Convergent encryption reduces the storage requirements when multiple
copies of the same content are inserted into the store independently. This
happens, for example, when a popular document is sent as an email attach-
ment or posted on bulletin boards by many different users.
In certain scenarios, it may be undesirable to use convergent encryption,
such as when the plain-text can easily be guessed. In these cases, the POST
store can be configured to use conventional symmetric encryption with ran-
domly generated keys.
The notification service is used to alert users and groups of users to certain
events, such as the availability of a new email message, a change in the state
of a user, or a change in the state of a shared object.
For instance, after a new message was inserted into POST as part of an
email or a newsgroup posting, the intended recipient(s) must be alerted to
the availability of the message and be provided with the appropriate decryp-
tion key. Commonly, this type of notification involves obtaining the contact
address from the recipient’s identity block. Then, a notification message is
180 12. ePOST
sent to the recipient’s trusted node, containing the message’s decryption key,
and is encrypted with the recipient’s public key and signed by the sender.
In practice, the notification can be more complicated if the sender and
the recipient are not on-line at the same time. To handle this case, the sender
delegates the responsibility of delivering the notification message to a set of
k random nodes. When a user A wishes to send a notification message to
a user B whose trusted node is off-line, A first sends a notification request
message to the k nodes numerically closest to a random Pastry key C. This
message is encrypted with B’s public key and signed by A. The k nodes are
then responsible for delivering the notification message (contained within the
notification request message) to B. Each of these nodes stores the message
and then subscribes to the Scribe group associated with B.
Whenever user B is on-line, his trusted node periodically publishes a
message to the Scribe group rooted at the hash of his public key, notifying
any subscribers of his presence and current contact address. Upon receipt
of this message, the subscribers deliver the notification by sending it to the
contact address. As long as not all of the replica nodes fail at the same time,
the notification is guaranteed to be delivered. POST relies on Scribe only for
timely delivery – if Scribe messages are occasionally lost due to failures, the
notification will still be delivered since users periodically publish to the their
group.
12.2.5 Metadata
In order to make the PAST DHT practical for use in applications such as
ePOST, we found it necessary to introduce a mechanism for removing objects
from the DHT.
Disk space is not necessarily a problem, since the rapid growth in hard
disk capacity would probably make it possible to store all inserted data ad
infinitum. However, the network bandwidth required to repair failed replicas
would become unwieldy over time. Such maintenance is necessary to ensure
that there always are at least k live replicas of each stored object, and re-
replicating each object as necessary.
The obvious solution is to add a delete operation to PAST that removes
the object associated with the given key. However, a delete method is un-
safe, because a single compromised node could use it to delete data at will.
Moreover, safe deletion of shared objects requires a secure reference-counting
scheme, which is difficult to implement in a system with frequent node failures
and the possibility of Byzantine faults.
As an alternative solution, we added leases to objects stored in PAST.
Each object inserted into the DHT is given a expiration date by the insert-
ing node. Once the expiration date for a given object has passed, the storage
nodes are free to delete the object. Clients must periodically extend the leases
on all data they are interested in. The modified PAST API is shown in Ta-
ble 12.2.
instead, they are kept for an additional grace period TG . During this time,
the objects are still available for queries, but they are no longer advertised
to other nodes during maintenance. Thus, nodes that have already deleted
their objects do not attempt to recover them.
POST is designed to face a variety of threats, ranging from nodes that simply
fail to operate, to attackers trying to read or modify sensitive information.
POST must likewise be robust against free riding behavior, including users
consuming more resources than they contribute, and to application-specific
resource consumption issues, such as the space consumed by spam messages.
Threat Model
Our threat model for POST includes of attacks from both within and outside
of POST. Internal attacks can be broken down into two classes: free riding
and malicious behavior. Free riding, discussed below, consists of either selfish
behavior or simple denial of service. Malicious behavior, however, can consist
of nodes attempting to read confidential data, modify existing data, or delete
data from the ePOST system.
Data Privacy
Data Integrity
Due to the single-writer property and the content-hash chaining [408] of the
logs, it is computationally infeasible for a malicious user or storage node to
insert a new log record or to modify an existing log record without the change
being detected. This is due to the choice of a collision-resistant secure hash
12.2 POST Design 183
function to chain the log entries and the use of signatures based on public
key encryption in the log heads.
To prevent version rollback attacks by malicious storage nodes, public-key
blocks contain timestamps. When reading a public-key block (e.g., a log-head)
from the store, nodes read all replicas and use the authentic replica with
the most recent timestamp. When reading content-hash blocks or certificate
blocks, they can use any authentic replica.
Denial of Service
Free Riding
Nodes within the network may try to consume much more remote storage
than they provide to the network. Likewise, nodes may wish to fetch objects
more often than they serve objects to other nodes. If bandwidth or storage are
scarce resources, users will have an incentive to modify their POST software
to behave selfishly. Nodes can generally be coerced into behaving correctly
when other nodes observe their behavior and, if they determine a node to
be a freeloader, will refuse to give it service [448, 135]. Such mechanisms can
guarantee that it is rational for nodes to behave correctly.
POST, in its present form, does not yet include any explicit incentives
mechanisms [448, 135]. The reason is that within an administrative domain,
members generally have external incentives to cooperate. If abuses do occur,
they can be localized to an organizational ring, and the offending users can
be reprimanded within the organization.
184 12. ePOST
Each ePOST user is expected to run a daemon program on his desktop com-
puter that implements ePOST, and contributes some CPU, network band-
width and disk storage to the system. The daemon also acts as an SMTP
and IMAP server, thus allowing the user to utilize conventional email client
programs. The daemon is assumed to be trusted by the user and holds the
user’s private key material. No other participating nodes in the system are
assumed to be trusted by the user.
When ePOST receives messages from a client program, it parsers them into
MIME components (message body and any attachments) and these are stored
as separate objects in POST’s secure store. Recall that frequently circulated
attachments are stored in the system only once.
The message components are first inserted into POST by the sender’s
ePOST daemon; then, a notification message is sent to the recipient. Sending
a message or attachment to a large number of recipients requires very little
additional storage overhead beyond sending to a single recipient, as the data is
only inserted once. Additionally, if messages are forwarded or sent by different
users, the original message data does not need to be stored again; the message
reference is reused.
Each email folder is represented by an encrypted POST log. Each log entry
represents a change to the state of the associated folder, such as the addition
or deletion of a message. Since the log can only be written by its owner and
its contents are encrypted, ePOST preserves or exceeds the level of privacy
and integrity provided by conventional email systems with storage on trusted
servers. A diagram of the logs used in ePOST is shown in Figure 12.4.
Body
Insert Email #56
Fig. 12.4: Log structure used in ePOST. Each box or circle represents a separate
object in the DHT.
12.3.5 Management
Software
The first management task incurred with ePOST is maintaining the proxy
software. This software needs to be kept running and up-to-date as bugs
are fixed and features are added. In our deployment, the ePOST proxy is
configured as a service that is restarted automatically if it fails. Software
upgrades are handled by signing updated code and having users’ proxies
periodically check and download authorized updates.
To allow administrators to efficiently monitor the ePOST system, we have
built a graphical administrative monitoring interface. This application allows
12.4 Correlated Failures 187
Storage
Access Control
Controlling access to ePOST can be broken down into two related tasks:
trust and naming. Trust is based on certificates, which users must obtain
from their organization to participate in the system. This is no different from
current email systems, where each user is required to obtain an account on
an email server. For example, in our experimental deployment, we provide a
web page where users can sign up and download certificates. In practice, the
process may require various forms of authentication before the new certificate
is produced.
Naming in ePOST is managed in a manner similar to current systems.
Organizations ensure that email addresses are unique and associated with
only one public key. This is easy to accomplish, since each user must obtain
a certificate from his organization.
ePOST has the potential for requiring substantially lower administrative
overhead than conventional email systems, since the self-organizing properties
of the underlying Peer-to-Peer substrate can mask the effect of node failures.
Additionally, the organic scalability granted to ePOST by the overlay has
the potential to significantly reduce the overhead associated with scaling an
existing email service to more users.
also because as the size of the system increases, the storage supply increases
also. This allows the system to support organic growth.
Since the system is built out of unreliable components, it must be prepared
to handle occasional node failures. Cooperative storage systems like PAST
often assume that the node population is highly diverse, i.e., that the nodes
are running different operating systems, use different hardware platforms, are
located in different countries, etc. Under these conditions, node failures can
be approximated as independent and identically distributed. To ensure data
durability, it is thus sufficient to store a small number of replicas for each
object, and to create new replicas when a node failure is detected.
Unfortunately, most real distributed systems exhibit high diversity only
in some aspects, but not in others. For example, the fraction of nodes running
Microsoft Windows can be as high as 60% or more in many environments.
In such a system, failures are not independent. For example, if the Windows
machines share a common vulnerability, a worm that exploits this vulnera-
bility may cause a large-scale correlated failure that can affect a majority of
the nodes. Moreover, if the worm can obtain administrator privileges on the
machines it infects, the failures can even be Byzantine.
The reactive replication strategy in PAST is clearly not sufficient to han-
dle failures of this type. Even if the failure is not Byzantine, there may simply
not be enough time to create a sufficient number of additional replicas. As a
consequence, early deployments of ePOST sometimes suffered data loss dur-
ing correlated failures. Since this is not acceptable for critical data like email,
the system needed another mechanism to ensure data durability.
4
Node (online)
5 Node (offline)
R Full replica
H(o) + 4
6 5 Fragment
H(o) + 5
6
3
H(o) + 3
6
H(o) R
R
H(o) + 2
6
H(o) + 1 R
6
12.4.2 Glacier
The durability layer in POST, which is called Glacier, takes a different ap-
proach [269]. Instead of relying on a sophisticated failure model, it makes a
very simple assumption, namely that the correlated failure does not affect
more than a fraction fmax of the nodes; all failure scenarios up to that frac-
tion are assumed to be equally likely. In order to tolerate such a wide range
of failures, Glacier must sacrifice some capacity in the cooperative store for
additional redundancy; thus, it trades abundance for increased reliability.
When a new object is inserted, Glacier applies an erasure code to trans-
form it into a large number of fragments. Together, the fragments are much
larger than the object itself, but a small number of them is sufficient to re-
store the entire object. For example, Glacier may be configured to create
48 fragments, each of which is 20% the size of the object. This corresponds
to a storage overhead of 9.6, but the object can be restored as long as any
five fragments survive.
Glacier then attaches to each fragment a so-called manifest which, among
other things, contains hashes of all the other fragments. This is used to au-
thenticate fragments. Finally, Glacier spreads the fragments across the over-
lay, calculating the key of fragment i as
i
ki = K +
n+1
where K is the key of the object and n is the total number of fragments.
This ensures that the fragments are easy to find without extra bookkeeping
190 12. ePOST
(which may be lost in a failure). Also, if the overlay is large enough, each
fragment is stored on a different node, which ensures that fragment losses are
not correlated.
For security reasons, Glacier does not allow fragments to be overwritten
or deleted remotely. If these operations were permitted, a compromised node
could use them to delete its own data on other nodes. However, objects may
be associated with a lease, and their storage is reclaimed when the lease
expires. Also, Glacier supports a per-object version number to implement
mutable objects.
Since some fragments are continually lost due to individual node failures
and departures, Glacier implements a maintenance mechanism to reconstruct
missing fragments. However, because of the high redundancy, Glacier can
afford high latencies between the loss of a fragment and its recovery; thus,
the maintenance mechanism need not be tightly coupled.
Because of the way fragments are placed in the ring, each Glacier node
1
knows that its peers at a distance k · n+1 (k = 1..n) in ring space store a set of
fragments that is very similar to its own. Thus, each node periodically (say,
once every few hours) asks one of its peers for a list of fragments it stores,
and compares that list to the fragments in its own local store. If it finds a key
for which it does not currently have a fragment, it calculates the positions
of all corresponding fragments and checks whether any of them fall into its
local key range. If so, it asks its peers for a sufficient number of fragments to
restore the object, computes its own fragment, and stores it locally.
Glacier also takes advantage of the fact that nodes often depart the overlay
for a certain amount of time (e.g. because of a scheduled downtime) but
return afterwards with their store intact. Therefore, Glacier nodes do not
immediately take over the ring space of a failed neighbor, but wait for a
certain grace period T . If the node returns during that time, it only needs
to reconstruct the fragments that were inserted while it was absent; the vast
majority of its fragments remains unmodified.
The loosely coupled maintenance mechanism greatly reduces the band-
width required for fragment maintenance. In the actual deployment which
has moderate churn and the configuration mentioned earlier, Glacier uses
less bandwidth than PAST, even though it manages over three times more
storage.
12.4 Correlated Failures 191
In ePOST, the storage load mainly consists of small objects (email texts and
headers). This causes more overhead in Glacier because the number of keys
is higher, and thus more storage space and bandwidth is required for per-key
metadata such as the fixed-size manifests. To reduce this overhead, ePOST
aggregates objects before inserting them into Glacier.
The main challenge in object aggregation is how to do it securely in an
environment with large-scale Byzantine failures. Even though there are con-
siderable advantages in performing aggregation on the storage nodes, Glacier
cannot allow this because these nodes cannot be trusted. Therefore, each
node is required to create and maintain its own aggregates. This includes
keeping a mapping from object keys to aggregate keys (which is required to
locate objects), extending the leases of aggregates whose objects are still in
use, and merging old aggregates whose objects have mostly expired.
The mapping from object keys to aggregates requires special attention
because it is crucial during recovery. Without it, the application may be
unable to find its objects after a failure without searching Glacier’s entire
store, which is infeasible. For this reason, the system adds to each aggregate
a few pointers to other aggregates, thus forming a directed acyclic graph
(DAG). During recovery, an ePOST node traverses its DAG and is thus able
192 12. ePOST
to locate all non-expired objects it has inserted. Moreover, the DAG contains
a hash tree, which is used to authenticate all aggregates. The only additional
requirement for ePOST is to maintain a pointer to the top-level aggregate;
this pointer is kept in an object with a well-known key that is directly inserted
into Glacier.
13.1 Introduction
The idea of GRID computing originated in the scientific community and was
initially motivated by processing power and storage intensive applications
[213]. The basic objective of GRID computing is to support resource sharing
among individuals and institutions (organizational units), or resource enti-
ties within a networked infrastructure. Resources that can be shared are, for
example, bandwidth, storage, processing capacity, and data [304, 429]. The
resources pertain to organizations and institutions across the world; they
can belong to a single enterprize or be in an external resource-sharing and
service provider relationship. On the GRID, they form distributed, hetero-
geneous, dynamic virtual organizations [221]. The GRID provides a resource
abstractions in which the resources are represented by services. Through the
strong service-orientation the GRID effectively becomes a networked infras-
tructure of interoperating services. The driving vision behind this is the idea
of “service-oriented science” [215].
The GRID builds on the results of distributed systems research and im-
plements them on a wider scale. Through the proliferation of the Internet and
the development of the Web (together with emerging distributed middleware
platforms), large-scale distributed applications that span a wide geographical
and organizational area becoming possible. This has been taken advantage of
within the Peer-to-Peer and the GRID communities more or less at the same
time. The GRID has been driven from within the science community, which
first saw the potential of such systems and implemented them on a wider
scale. Application areas here are distributed supercomputing (e.g., physical
process simulations), high-throughput computing (to utilize unused processor
cycles), on-demand computing (for short-term demands and load-balancing),
data intensive computing (synthesizing information from data that is main-
tained in geographically distributed repositories), and collaborative comput-
ing [225].
It is important to note that the prime objective of GRID computing
is to provide access to common, very large pools of different resources that
enable innovative applications to utilize them [227]. This is one of the defining
differences between Peer-to-Peer (P2P) and GRID computing. Although both
R. Steinmetz and K. Wehrle (Eds.): P2P Systems and Applications, LNCS 3485, pp. 193-206, 2005.
Springer-Verlag Berlin Heidelberg 2005
194 13. Distributed Computing – GRID Computing
are concerned with the pooling and co-ordinated use of resources, the GRID’s
objective is to provide a platform for the integration of various applications,
whereas initially Peer-to-Peer applications were vertically integrated [217].
Many current GRID implementations are based on the Globus ToolkitT M ,
an open source toolkit [203, 219]. Within the Globus project, a pragmatic
approach has been taken in implementing services needed to support a com-
putational GRID. The Open GRID Services Architecture (OGSA) developed
by the Global GRID Forum (GGF) - inspired by the Globus project - devel-
ops the GRID idea further and is also concerned with issues that have not
been the focus of the Globus project such as architectural and standardiza-
tion matters. OGSA combines GRID technology with Web services. This has
led to a strong service orientation, where services on the GRID are generally
autonomous and not subject of centralized control.
Besides Globus and OSGA there is the European driven development
around Unicorn. It has especially influenced information GRIDs in the cor-
porate world but is also still very much in the development phase. In this
chapter we concentrate on the developments within Globus and GGF. Un-
fortunately, a comprehensive discussion is not possible within the space of
this chapter. It rather provides an overview of the ideas and concepts be-
hind the GRID, but also shows how they have evolved from an infrastructure
driven approach towards a service-oriented architecture. This is particularly
important since the GRID is still a very active and fast developing area. More
information can be found in [225] and on the GGF Web site (www.ggf.org).
In this chapter, the main initiatives and concepts driving GRID develop-
ments are introduced. This includes an outline of the architectural concepts
behind GRID but also a discussion of the Globus Project and the develop-
ments around OGSA. Subsequently, the relationship between Peer-to-Peer
and GRID is outlined.
The GRID idea has evolved over the years and there is no prescribed or
standardized GRID architecture. The current understanding of the GRID
has been influenced by a number of initiatives and individuals who have been
driving the development. Therefore, the GRID architecture presented in the
following represents a generally accept view and not a standardized reference
framework. It should be regarded as abstraction in which the various GRID
tools and services can be located according to their functionality.
A computational GRID is more formally defined as “a hardware and soft-
ware infrastructure that provides dependable, consistent, pervasive, and in-
expensive access to high-end computational capabilities” [225]. Different ap-
plications can be implemented on top of this infrastructure to utilize the
shared resources. If different institutions or individuals participate in such a
13.2 The GRID Architecture 195
sharing relationship, they form a virtual organization [222]. The concept un-
derlying the GRID facilitates collaboration across institutional boundaries.
To achieve this a protocol architecture is proposed and standard components
are specified that can be used by the different parties entering into a sharing
relationship [227].
Application
Collective
Resource
Connectivity
Fabric
The Globus project started in 1996 and is hosted by Argonne National Lab-
oratory’s Mathematics and Computer Science Division, the University of
Southern California’s Information Sciences Institute, and the University of
Chicago’s Distributed Systems Laboratory. It was one of the first and most
visible activities in this area. It is supported by a number of institutional
(e.g., National Computational Science Alliance (USA), NASA, Universities
of Chicago and Wisconsin) and industry partners (e.g., IBM and Microsoft)
[248]. The project is centered on four main activity areas:
1. Building of large-scale GRID applications such as distributed supercom-
puting, smart instruments, desktop supercomputing tele-immersion.
2. Support for planning and building of large-scale testbeds for GRID re-
search, as well as for functional GRID systems
3. Research into GRID related issues such as resource management, secu-
rity, information services, fault detection and data management.
4. Building of software tools for a variety of platforms (the so-called Globus
ToolkitT M ), which are, however, considered research prototypes only.
The Globus ToolkitT M supplies the building blocks for a GRID infrastruc-
ture, i.e., it provides services and modules required to support GRID appli-
cations and programming tools. It is a community-based, open architecture,
open source set of services and software libraries [221]. The services are pro-
grams that interact with each other to exchange information or co-ordinate
the processing of tasks. They can be used independently or together to form
a supporting platform for GRID applications.
A Globus service encapsulates a certain functionality and provides an
abstraction for resources. For instance, a number of services deal with re-
source selection, allocation, and management. “Resource”, in this context,
is a generic term for everything required to process a task. This includes
system resources such as CPU, network bandwidth and storage capacity.
The Resource Selection Service (RSS) provides a generic resource selection
framework for all kinds of GRID applications. For a specific case it identifies
a suitable set of resources by taking into account application characteristics
and system status [392].
13.3 The Globus Project 197
best use of the advantages of both technologies. Web services define tech-
niques for describing software components and accessing them. Further, Web
service discovery methods allow the identification of relevant service providers
within the system regardless of platform or hosting environment-specific fea-
tures. Web services are open in that they are programming language, pro-
gramming model, and system software neutral.
GT 3
Data
Services
GT 3 Base Services
(Resource Management, Data Transfer,
Information Services, Reservation, Monitoring)
GT 3 Core
(Interface & Behaviour)
In the context of OGSA, a new version of the Globus toolkit was de-
veloped (Globus Toolkit Version 3, GT3). From GT3 onwards the Globus
architecture was defined together with OGSA. Currently (i.e. in 2005) GT4
is the most recent version. It is a refined version compared to GT3 where the
main concepts are still valid. The Core as shown in Figure 13.2 implements
service interfaces and behaviors as specified in the GRID Services Specifica-
tion [599]. The Core and Base Service Layer are part of the OGSI system
framework.
A number of (standard) high-level services that address requirements of
eBusiness and eScience applications are being discussed within GGF. Such
services include:
– distributed data management services (e.g., for database access, data trans-
lation, replica management and location, and transactions);
– workflow services (for coordinating different tasks on multiple Grid re-
sources);
– auditing services (for recording usage data);
– instrumentation and monitoring services (for measuring and reporting sys-
tem state information); and
– security protocol mapping services (for enabling distributed security pro-
tocols to be transparently mapped onto native platform security services).
These services can be implemented and composed in various different ways
replacing some of the current Globus toolkit services, for instance, dealing
with resource management and data transfer [222].
200 13. Distributed Computing – GRID Computing
With the strong service orientation adopted by OGSA the GRID can be re-
garded as a network of services. Web service mechanisms provide support for
describing, discovering, accessing and securing interaction (see chapter 14) for
a more detailed discussion on Web services). More formally OGSA defines a
GRID service as a network-enabled entity that represents computational and
storage resources, networks, programs, and databases, inter alia. Within the
virtual organization formed by these networked services, clear service defini-
tions and a set of protocols are required to invoke these services. Note, the
protocols are independent of the actual service definitions and vice versa.
They specify a delivery semantic and address issues such as reliability and
authentication. A protocol that guarantees that a message is reliably received
exactly once can, for instance, be used to achieve reliability, if required. Mul-
tiple protocol bindings for a single interface are possible because WDSL is
used for the service definition [222]. However, the protocol definition itself is
outside the scope of OGSA.
To ensure openness virtual service definitions are used according to which
multiple (ideally interworking) implementations can be produced. Thus, a
client invoking a service does not have to consider the platform a service in-
stantiation is running on, or have to know anything about the implementation
details. The interaction between services happens via well-defined, published
service interfaces that are implementation independent. To increase the gen-
erality of the service definition, authentication and reliable service invocation
are viewed as service protocol binding issues that are external to the core ser-
vice definition. Though, they have to be addressed within a complete OGSA
implementation.
The OGSA services are also concerned with transient service instances
within the GRID infrastructure because services are not necessarily static
and persistent (i.e., a service can be created and destroyed dynamically).
Furthermore, OGSA conventions allow identification of service changes such
as service upgrades. The information documenting these changes also states
whether the service is backward compatible regarding interface and seman-
tics.
Since GRID services have to run on multiple platforms in a distributed
heterogeneous environment, service implementations should, be portable not
only in terms of their design, but also as far as code and the hosting environ-
ment are concerned. OGSA defines the basic behavior of a service but does
not prescribe how a service should be executed. It is the hosting environment,
which defines how a GRID service implementation realizes the GRID service
semantics [221]. Apart from the traditional OS based implementations, GRID
services can also be built on top of new hosting environments such as J2EE,
Web-sphere, .NET, JXTA, or Sun ONE. These hosting environments tend to
offer better programmability and manageability, they are usually also more
flexible and provide a degree of safety.
13.4 Defining the GRID: The Global GRID Forum Initiative 201
Despite OGSA not being concerned with implementation details, the def-
inition of baseline characteristics can facilitate the service implementation.
Issues that have to be addressed in the context of hosting environments are
the mapping of GRID wide names and service handles into programming lan-
guage specific pointers or references, the dispatch of invocations into actions
such as events and procedure calls, protocol processing and data formatting
for network transmission, lifetime management, and inter-service authentica-
tion.
The original motivation behind GRID and Peer-to-Peer applications has been
similar; both are concerned with the pooling and organization of distributed
resources that are shared between (virtual) communities connected via an
ubiquitous network (such as the Internet). The resources and services they
provide can be located anywhere in the system and are made transparently
available to the users on request. Both also take a similar structural approach
by using overlay structures on top of the underlying communication (sub-
)system.
However, there are also substantial differences on the application, func-
tional and structural levels. The applications supported through the GRID
are mainly scientific applications that are used in a professional context. The
number of entities is still rather moderate in size, and the participating in-
stitutions are usually known. Current Peer-to-Peer applications, in contrast,
provide open access for a large, fluctuating number of unknown participants
with highly variable behavior. Therefore, Peer-to-Peer has to deal with scal-
ability and failure issues much more than GRID applications. Peer-to-Peer
applications are still largely concerned with file and information sharing.
In addition, they usually provide access to simple resources (e.g. process-
ing power), whereas the GRID infrastructure provides access to a resource
pool (e.g., computing clusters, storage systems, databases, but also scientific
instruments, sensors, etc.) [217]. Peer-to-Peer applications usually are ver-
tically integrated, i.e. the application itself realizes many of the conceptual
and basic functionalities that should be part of and architecture or the in-
frastructure. An example are overlay structures as part of the application.
In contrast, the GRID is essentially a multipurpose infrastructure where the
core functionality is provided by a set of services that are part of the architec-
ture. The resources are represented by services that can be used by different
applications.
In recent years, a number of Peer-to-Peer middleware platforms have been
developed that provide generic Peer-to-Peer support. The functionality they
support comprises, for example, naming, discovery, communication, security,
and resource aggregation. One example is JXTA [330], an open platform
designed for Peer-to-Peer computing. Its goal is to develop basic building
blocks and services to enable innovative applications for peer groups. An-
other, emerging Peer-to-Peer platform is Microsoft’s Windows Peer-to-Peer
Networking (MSP2P) [119], which provides simple access to networked re-
sources. There is also ongoing research in this area. For instance, in the EU
funded project on Market Managed Peer-to-Peer Services (MMAPPS) a mid-
204 13. Distributed Computing – GRID Computing
dleware platform has been created that incorporates market mechanisms (in
particular, accounting, pricing and trust mechanisms) [578]. On top of this
platform, a number of applications (i.e., a file sharing application, a medical
application, and a WLAN roaming application) have been implemented to
show how such a generic platform can be used.
The GRID has been successfully in operation within the scientific community
for a number of years. However, the potential of the GRID goes beyond
scientific applications and can for instance also be applied to the government
domain, health-care, industry and the eCommerce sector [62]. Many of the
basic concepts and methods could remain unchanged when applied to these
new domains. Other issues not within the scope of the current GRID initiative
will have to be addressed in the context of these application areas (e.g.,
commercial accounting and IPR issues). Further, with a more widespread
adoption of the GRID, there is a greater need for scalability, dependability
and trust mechanisms, fault-tolerance, self-organization, self-configuration,
and self-healing functionality. This indicates that mechanisms from the Peer-
to-Peer application and platform domain and the Peer-to-Peer paradigm in
general could be adopted more widely by the GRID. This would result in
a more dynamic, scalable, and robust infrastructure without changing the
nature or fundamental concepts. Though, this will only happen in the context
of the service-oriented architecture. Thus, the developments between Peer-to-
Peer and Web services as described in Chapter 14 and between Peer-to-Peer
and GRID are actually running in parallel.
Peer-to-Peer applications are also developing into more complex systems
that provide more sophisticated services. A platform approach has been pro-
posed by some vendors and research initiatives to provide more generic sup-
port for sophisticated Peer-to-Peer applications. It is expected that devel-
opers of Peer-to-Peer systems are going to become increasingly interested in
such platforms, standard tools for service description, discovery and access,
etc. [217]. Such a Peer-to-Peer infrastructure would then have a lot in com-
mon with the GRID infrastructure. However, the goal behind the GRID (i.e.,
providing access to computational resources encapsulated as services) is not
necessarily shared by these middleware platforms. They are built for better
and more flexible application support.
Essentially, it is a matter of substantiating the claims represented by the
Peer-to-Peer paradigm of providing more flexibility, dynamicity, robustness,
dependability and scalability for large scale distributed systems. If this is
successful and additional quality of service features (such as performance and
efficiency) can also be ensured, Peer-to-Peer mechanisms can become central
to the GRID. Peer-to-Peer applications, on the other hand, will have to adopt
13.6 Summary 205
13.6 Summary
The idea for the GRID was conceived within the science community and
inspired by the success of the Internet and results produced by distributed
systems research. The main target application areas are resource sharing,
distributed supercomputing, data intensive computing, and data sharing and
collaborative computing. The GRID provides an abstraction for the different
resources in form of services.
The architectural view of the GRID can be compared to the Internet
hourglass model where a small group of core protocols and components build
the link between the high-level mechanisms and a number of lower level base
technologies [67]. The various services in this architecture can be located
at one of the different layers, namely the Fabric, Connectivity, Resource,
and Collective Layer. The Globus ToolkitT M provided the first tools for a
GRID infrastructure. These tools exploit the capabilities of the platforms
and hosting environments they run on, but do not add any functionality on
the system level. Using this pragmatic approach, some remarkable systems
have been realized by the Globus project, or with the help of the Globus
Toolkit. The OGSA initiative within the Global GRID Forum (conceived
within the Globus Project) is developing the original ideas further. It takes
a more systematic approach and defines a universal service architecture in
which the advantages of GRID technology and Web services are combined.
It is strictly service-oriented; i.e. everything is regarded as a service charac-
terized by well-specified platform and protocol independent interfaces. This
universal service idea combined with openness and platform independence,
allows building very large and functional complex systems. Applying these
concepts could provide a way to deal with management issues that have so
far restricted the size of distributed systems.
The relationship between Peer-to-Peer and GRID is still a controversial
topic. Since GRID is defined as infrastructure formed out of services repre-
senting resources, its scope and extent are more well defined than that of
Peer-to-Peer. The term Peer-to-Peer is on the one hand, being used for a
group of distributed applications (such as the well known file sharing appli-
cations); on the other hand it also refers to a paradigm encompassing the
concepts of decentralization, self-organization, and resource sharing within a
system context [573]. Recently, middleware platforms have been developed
that provide generic support for Peer-to-Peer applications, implementing the
Peer-to-Peer paradigm in an operating system-independent fashion. The ob-
jective of the GRID is to provide an infrastructure that pools and coordinates
206 13. Distributed Computing – GRID Computing
the use of large sets of distributed resources (i.e., to provide access to compu-
tational resources similar to the access to electricity provided by the power
grid). The most recent development within the GRID community go towards
a strong service orientation. Within GGF the ideas developed in the service-
oriented architecture and Web service domain are being adopted. Hence, a
convergence between GRID and Peer-to-Peer would actually run in parallel
or be predated by a convergence of Peer-to-Peer and Web Services. Though,
it has been recognisee that an adoption of Peer-to-Peer principles could be
beneficial in terms of scalability, dependability, and robustness. The pooling
and sharing of resources is also a common theme in Peer-to-Peer applica-
tions. This could be supported by Peer-to-Peer middleware platforms in the
future. However, this does not mean global access to computational resources
(represented by services) anywhere, anytime. The question of how, indeed if,
the two concepts converge is still open.
14. Web Services and Peer-to-Peer
Markus Hillenbrand, Paul Müller (University of Kaiserslautern)
14.1 Introduction
Peer-to-Peer and Web services both address decentralized computing. They
can be considered as rather distinct from each other, but a closer look at the
Web services technology reveals a great potential for a combination of both
Peer-to-Peer and Web services.
The basic idea behind Web services technology is to provide functionality
over the Internet that can be accessed using a well-defined interface. This
idea of a service-oriented architecture forms the next evolutionary step in
application design and development after procedural programming, object
orientation, and component-oriented development. During the last twenty
years, different middleware approaches and application designs have been
introduced to leverage dated technology and provide easy access over open
and mostly insecure access networks.
The most recognized and well established technologies for creating dis-
tributed systems are the Remote Procedure Call (RPC, 1988) from Sun
Microsystems, the Distributed Computing Environment (DCE, 1993) from
R. Steinmetz and K. Wehrle (Eds.): P2P Systems and Applications, LNCS 3485, pp. 207-224, 2005.
Springer-Verlag Berlin Heidelberg 2005
208 14. Web Services and Peer-to-Peer
the Open Software Foundation (OSF), the Common Object Request Broker
Architecture (CORBA, 1990s) from the Open Management Group (OMG),
the Java Remote Method Invocation (RMI, 1990s) and Java Enterprise Beans
(EJB, 1990s) from Sun Microsystems, and the Distributed Component Object
Model (DCOM, 1997 and COM+, late 1990) from Microsoft. Each of these
technologies introduced a higher level of abstraction for creating distributed
applications and reduced the implementation effort necessary to achieve this
goal. Figure 14.1 illustrates the relationship between the underlying pro-
gramming paradigm, the level of abstraction, and the complexity of creating
a distributed application.
Distribution aspects have always been an addendum to procedural pro-
gramming and object-orientation (mostly using some kind of remote proce-
dure call mechanism) and are not intrinsic to the paradigms. Solutions fol-
lowing the component-oriented paradigm provide middleware functionality
and software containers that allow for distribution during software develop-
ment and help managing the resulting software systems. In contrast to this,
Web services are based on open, well-defined, and established standards and
encompass distribution from within the specifications. In combination with
currently evolving additional standards (cf. Chapter 14.2.6) they have a good
chance to achieve the goals of a real and secure distributed middleware ar-
chitecture.
The Web services technology has been initiated by industry and not
academia, and more and more large companies are working on Web services
technology and apply it in real world applications. Though what is the reason
for this development? Unfortunately, there is no commonly used definition
for Web services. Instead, several distinct definitions have to be consulted to
investigate what Web services are and how to use them. Two major driving
forces of the Web services technology – IBM and Microsoft – define Web
services as follows:
Definition 14.1.1. (IBM, 2003) “Web services are self-contained, modular
applications that can be described, published, located, and invoked over a net-
work. Web services perform encapsulated business functions, ranging from
simple request-reply to full business process interactions. These services can
be new applications or just wrapped around existing legacy systems to make
them network-enabled. Services can rely on other services to achieve their
goals.”
Microsoft favors a similar definition of the Web services technology, but
it emphasizes standard Internet protocols:
Definition 14.1.2. (MSDN, 2001) “A Web service is a programmable ap-
plication logic accessible using standard Internet protocols, or to put it an-
other way, the implementation of Web-supported standards for transparent
machine-to-machine and application-to-application communication.”
14.2 Architecture and Important Standards 209
The common aspect of definitions 14.1.1 and 14.1.2 is their focus on busi-
ness and application-to-application communication. A more technical view
on Web services is given by the following definition from the World Wide
Web Consortium in 2003:
Definition 14.1.3. (W3C: May, 2003) “A Web service is a software system
identified by a URI (Uniform Resource Identifier), whose public interfaces
and bindings are defined and described using XML. Its definition can be dis-
covered by other software systems. These systems may then interact with the
Web service in a manner prescribed by its definition, using XML based mes-
sages conveyed by Internet protocols.”
This definition completely abstracts from the implementation and usage
of Web services and is entirely based on XML. During the definition phases
of Web services related standards, the W3C has revised this definition several
times to make it more specific in terms of technology while trying to keep it
as general as possible. As of 2004, the current definition reads as follows:
Definition 14.1.4. (W3C Feb, 2004) “A Web service is a software system
designed to support interoperable machine-to-machine interaction over a net-
work. It has an interface described in a machine-processable format (specif-
ically WSDL). Other systems interact with the Web service in a manner
prescribed by its description using SOAP messages, typically conveyed us-
ing HTTP with an XML serialization in conjunction with other Web-related
standards.”
Compared to definition 14.1.3, not only XML [89] but also WSDL [118,
115] and SOAP [265] are part of the definition. And HTTP [206] is mentioned
as the typical transport protocol. This makes the definition of a Web service
more precise from a technological view, but also narrows applicability and
extensibility.
The relevant standards mentioned in the definitions will be briefly intro-
duced in the next sections. A sample Web service (providing functionality to
add an integer or a complex number) will be used to illustrate them.
The Web services technology permits loose coupling and simple integration
of software components into applications – irrespective of programming lan-
guages and operating systems by using several standards. The basic archi-
tecture is shown in figure 14.2.
Three participants interact to perform a task. A service provider is re-
sponsible for creating and publishing a description of a service interface us-
ing WSDL. The provider also contributes the actual implementation of the
service on a server responding to requests from clients that adhere to this
210 14. Web Services and Peer-to-Peer
possible. In each case the information needed can be retrieved solely from the
WSDL document.
The necessary standards and protocols to either publish, find, or bind
a Web service will now be explained in greater detail. Exemplification will
adhere to the WSDL 1.1 specification because this version is currently widely
used and has a large tool support.
XML [89] is the key to platform and programming language neutral data
exchange. It provides the mechanisms to create complex data structures as
well as it allows for modeling dependencies between data sets. An XML doc-
ument itself is a plain text file using a given character encoding scheme (e.g.
ISO8859-15 or UTF-8). In the following, the necessary parts of the XML
specification will be introduced to give a better understanding of the next
chapters.
Structure
XML allows to use any name as an element name. Thus, the vocabulary of
XML documents is not fixed. To avoid collisions of such element names, XML
namespaces [89] have been introduced in 1999 and were updated in 2004. A
namespace can be defined inside an element (usually the root element) and is
valid for all child elements (“XML namespaces” in figure 14.3). A namespace
is specified using a Uniform Resource Identifier (URI [68]) which itself can
either be a Uniform Resource Locator (URL) or a Uniform Resource Name
(URN). A URL points to a specific location where more information about
the namespace can be found while a URN is just a globally unique name.
It is possible to use different namespaces inside an XML document, and the
XML document itself can use elements from these namespaces in any suitable
order.
Together with XML Namespaces, XML Schema [200, 593, 75] is one impor-
tant building block for creating modular XML documents. Its major goal is
to make syntactical restrictions for XML elements, i.e. XML Schemas can
be used to assign and define data types. Besides basic data types such as
integer, string, date, etc. provided by the standard, it is possible to define
new datatypes (“XML complex type definition” in figure 14.3). Using the
appropriate XML Schema elements, it is further possible to define new sim-
ple (primitive) data types, complex data types (like structures, arrays, etc.)
as well as enumerations and choices. It is also possible to define and assign
structural patterns restricting the range of values for the data types. Ad-
ditionally, XML Schemas can be imported into other XML documents, e.g.
WSDL documents. This allows for re-use of XML data types and a modular
design of XML documents.
14.2.2 WSDL
The Web Services Description Language (WSDL [118, 115]) is an XML based
format for describing the interface of a Web service. The WSDL document
starts with an XML header and the body is divided into several parts (shown
in figure 14.4):
Root Element
Types
The data types used by the Web service should be designed using XML
Schema. Inside the types element it is possible to define data types for
the current service or to import data types from remote documents using
the XML Schema import element. In figure 14.5 the types element is used
to import the Complex data type defined in figure 14.3. The actual XML
Schema file location is specified using the schemaLocation attribute and
its namespace is specified using the namespace attribute accordingly. The
targetNamespace attribute can be used to map the namespace of the XML
Schema document into another namespace.
Fig. 14.5: WSDL types element used to import a XML Schema data type
214 14. Web Services and Peer-to-Peer
Messages
Messages are exchanged between the client and the service and represent
the data necessary to call a Web service function or to create a response. A
message element has a name and several parts that make up the message.
Every part element usually has a type – and this type is either imported
or defined in the types element. In figure 14.6 four messages are defined.
The first (Message addComplex) has two child elements x and y, and the
second message (Message addComplexResponse) contains only one child el-
ement result. As the name of the message suggests, it is used as a response
to the first message. Messages three and four work in the same manner.
Port Types
A Web service can have several porttype elements1 , each containing a set
of operations provided by the Web service. The port types use the messages
defined using the message elements to create input and output messages
for each operation. In figure 14.7 the two operations Message addComplex
and Message addInt are defined using the messages from figure 14.6 and
thus form a request-response operation addComplex. With WSDL 1.1 other
operation types are possible: one-way (the endpoint behind the operation
receives a message), solicit-response (the endpoint receives a message and
sends a correlated message), and notification (the endpoint sends a message).
Bindings
The binding element assigns a data encoding format and a transport protocol
to the Web service operations. It is possible to assign more than one protocol
1
In the WSDL 2.0 specification the porttype element has been renamed to
interface and extended to support more types of communication.
14.2 Architecture and Important Standards 215
to the same operation. In figure 14.8 both operations are defined to use SOAP
over HTTP.
Service
The service element finally defines for each binding a port as the actual end-
point, i.e. the place in the network where the actual software runs and offers
the service2 . In figure 14.9 the binding defined in figure 14.8 is assigned to
the SOAP access point provided by the software running on localhost on
port 8080.
2
In the WSDL 2.0 specification the port element has been renamed to endpoint
in order to clarify the meaning.
216 14. Web Services and Peer-to-Peer
14.2.3 SOAP
The mandatory body (“SOAP body” in figure 14.10) finally carries all ap-
plication specific information for the final recipient. This final recipient must
be able to semantically understand the body elements. A fault element inside
the body can be used to carry an error message to one of the intermediaries
or back to the origin of the message.
An additional standard allows for attachments to be transmitted in MIME
encoded form, enabling Web services to process large binary data files.
14.2.4 HTTP
14.2.5 UDDI
14.2.6 WS-*
WS-Addressing
WS-Federation
WS-Policy
The Web Services Policy Framework [540] provides a general purpose model
and corresponding syntax to describe and communicate the policies of a Web
service. It defines a base set of constructs that can be used and extended
by other Web services specifications to describe a broad range of service
requirements, preferences, and capabilities.
WS-ReliableMessaging
WS-ResourceFramework
WS-Security
WS-Transaction
As Web services evolve and are deployed on a larger scale, the need for
the combination of several Web services in order to create a business pro-
cess arises. Several languages and specifications can be identified that deal
with service orchestration. The most relevant are XML Process Definition
Language (XPDL [452]), Business Process Modeling Language (BPML [33]),
Web Service Choreography Interface (WSCI [34]), Electronic Business using
eXtensible Markup Language (ebXML [455]), and Business Process Execu-
tion Language for Web services (BPEL4WS [588]). The latter is currently
the most promising candidate for a common standard.
BPEL4WS is based on XML and can be used to combine distributed
Web services to a business process. Interaction between Web services can be
modeled as well as between the business process and its clients. The clients
can thus be detached from the actual business logic and be kept simple.
BPEL4WS is driven by major companies such as IBM and Microsoft and
provides a language to implement complex processes by allowing for different
actions like calling a Web service, manipulating data, and handling errors.
Flow control can be realized using control flow statements like tests, loops,
and threads. To the outside, a BPEL4WS business process can be described
like a normal Web service and have its own WSDL description – a client does
not need to know the internal structure or control flow of the process.
might also be used as incentives for further research and development in this
area.
The Web services standards evolve at a high rate and influence other tech-
nologies as well. There are several issues that also concern Peer-to-Peer tech-
nology:
XML
All data formats and all data exchange protocols in the Web services area are
based on XML. XML Schema is used to define platform and programming
language neutral data types, SOAP is used to transfer these data types to the
service, and WSDL is used to describe the service itself. New XML standards
or enhancements can be integrated into the Web services technology with
small effort, as XML security or XML encryption have shown.
Another benefit would be to use XML schema definitions for describ-
ing resources, data, services, and peers within a Peer-to-Peer system with
meta data. An XML based description of the resources and data shared in
a Peer-to-Peer system would be more flexible (with regard different schema
files and namespaces) and extensible, because a schema file can be easily
extended without having effect on existing software and thus allow for a
smooth upgrade or change in meta data description. A more detailed view
on schema-based Peer-to-Peer systems is given in chapter 19.
Service Registration
Security
Interoperability
One of the design goals of Web services has been to be as open and interop-
erable as possible. Standardized interfaces (written in WSDL) can be used
and accessed by any system capable to process XML documents. There is no
artificial language or operating system barrier in a Web services scenario. To-
gether with security standards this accounts for large business processes and
applications to be deployed over the Internet using different programming
languages and operating systems.
Service orchestration
Web services can be combined to create a business process using Web service
orchestration. This allows for re-use and encapsulation. The JXTA SOAP
project (http://soap.jxta.org) for example brings together Web services
and Peer-to-Peer technology by defining a bridge between SOAP and the
JXTA protocol. This can be further extended by defining workflows on top
of these services. JXTA is explained in more detail in chapter 21.3.1.
Decentralization
Transport Protocols
The success of the World Wide Web and Web services is partly based on the
simplicity and scalability of HTTP. Operating in real time and being state-
222 14. Web Services and Peer-to-Peer
less allows for a tight coordination between client (browser) and server (Web
server) – with little overhead. But in systems with a high need for synchro-
nization (like instant messaging) HTTP is inadequate due to its design. This
also applies to services that need a lot of time to process a request (large
data base operations or complex calculations). HTTP is designed to deliver
an answer immediately. Some systems have instead adopted the Simple Mail
Transfer Protocol (SMTP) for asynchronous messaging in this case. But there
are several other protocols that might prove useful in different usage scenar-
ios. Especially Peer-to-Peer instant messaging protocols are designed to allow
for a flexible two-way communication.
Addressing Scheme
Client/Server Architecture
On the World Wide Web roles like client and server are largely fixed – the
Web server is always a server, and a Web browser is always a client. This also
applies to Web services running on a Web server. In Peer-to-Peer systems
however, these roles are only temporary. A node usually acts as client or
server, depending on the current task. This also affects scalability. A strong
client/server architecture only scales with the servers, while a Peer-to-Peer
infrastructure scales depending on the roles taken by the nodes.
4
Such a combination could be a Peer-to-Peer system using Web service technology
or a Web service application scenario adopting Peer-to-Peer techniques.
14.5 Resulting Architectures 223
Bandwidth
Using XML message formats and searching for services using Peer-to-Peer
technology in distributed applications will increase the need for bandwidth
dramatically compared to a central registry such as UDDI. If there is no
central registry, a lot of nodes (peers) of the system have to be queried for
their services – this is especially the case when using unstructured Peer-to-
Peer systems (cf. Part II).
Security
Maintenance
Several architectures can be imagined when joining Web services and Peer-
to-Peer technologies. One of the most promising can be outlined as follows.
Distributed applications will have two faces: Peer-to-Peer in a closed and
rather secure system (i.e. the Intranet or a similar form) and additional Web
service access points for external communication on the Internet – as long as
security is weak there. It is possible to have the benefits of Peer-to-Peer sys-
tems like decentralization, scalability, and availability inside an application,
inside a complex system, or inside a company. On the edge to the Internet
this is changed to the benefits of Web services like security and standardized
WSDL interface descriptions.
This approach could be used to design service brokers (i.e. the entities
responsible for finding a service matching a request like in [211]) and search
engines (i.e. entities responsible for finding arbitrary information matching a
request like in [296]) by using Peer-to-Peer technology internally and offering
their results in XML/WSDL.
224 14. Web Services and Peer-to-Peer
Further Reading
This chapter about Web services and Peer-to-Peer was only a short in-
troduction into the world of distributed services. A good start for ob-
taining more knowledge are the following references (in no particular or-
der) [25, 291, 193, 496, 141].
15. Characterization of Self-Organization
Hermann De Meer, Christian Koppen (University of Passau)
15.1 Introduction
Self-organization is used in many disciplines to refer to several, related phe-
nomenons. Some of the more prominent phenomenons summarized under the
umbrella of self-organization are autonomy, self-maintenance, optimization,
adaptivity, rearrangement, reproduction or emergence. An exact match, how-
ever, has yet to be accomplished. Even in the context of this book on Peer-
to-Peer systems, self-organization is used in various forms to relate to several
interesting but distinct properties of Peer-to-Peer networking. Before Peer-
to-Peer networks are analyzed in more detail in Chapter 16 for their degree
of affinity to self-organization, we juxtapose selected but prominent defini-
tions and criteria of self-organization from all disciplines in this chapter. The
purpose of that exercise is to broaden scope and horizon of understanding
self-organization in the context of Peer-to-Peer networks. It is hoped such
an approach may spearhead new developments and stimulate innovative dis-
cussions. Due to the nature of some of the disciplines, the definitions may
lack mathematical preciseness and some ambiguities may not be overcome.
It is still believed by comparing and relating the existing manifold perspec-
tives and concepts a more objective and thought-provoking discussion in the
context of Peer-to-Peer networking can result. This is particularly so as self-
organization may offer great potentials and pose high risks by the same token.
The notion of self-organization is not a new one. In fact, its roots may
even be traced back to ancient times. It was Aristoteles who stated that “The
whole is more than the sum of its parts” [32, 10f-1045a], a simple definition
for a phenomenon that nowadays is called emergence which is attributed to
self-organizing systems. In the 20th century, the pioneer discipline engaging
in self-organization was the science of cybernetics, originated as theory of
communication and control of regulatory feedback in the 1940s; cyberneti-
cists study the fundamentals of organizational forms of machines and human
beings. The name and concept of self-organization as it is understood to-
day emerged in the 1960s, when related principles were detected in different
scientific disciplines.
The biologists Varela and Maturana formed the term of autopoiesis, a
way of organization that every living organism seems to exhibit [402, 403].
The chemist Ilya Prigogine observed the formation of order in a special class
of chemical system which he then called dissipative [495]. The biochemists
Eigen and Schuster detected autocatalytic hypercycles, a form of chemical
R. Steinmetz and K. Wehrle (Eds.): P2P Systems and Applications, LNCS 3485, pp. 227-246, 2005.
Springer-Verlag Berlin Heidelberg 2005
228 15. Characterization of Self-Organization
molecules which align with each other and reproduce themselves to build
up and maintain a stable structure [551]. The physicist Haken analyzed the
laser and found that the atoms and molecules organize themselves so that a
homogeneous ray of light is generated [271]. The field of synergetics resulted
as a whole new discipline from this research [272]. These are just some exam-
ples, a lot more have appeared in natural, social, economic and information
sciences.
Each approach has revealed some basic principles of self-organization, and
it is interesting to see how similarities can be identified. Self-organization may
appear in different facets or it can be seen as an assembly of several, many
or all of the described properties. We first describe the various properties
in more detail in Section 15.2, use them to characterize self-organization in
Section 15.3 and then apply the results to examples in computer science,
emphasizing of what we see as positive impact of self-organization on these
example areas, in Section 15.4. Section 15.5 concludes this chapter.
15.2.1 System
Definition: System
A system is a set of components that have relations between each
other and form a unified whole. A system distinguishes itself from its
environment.
such as the cup in this example. The system is not observed as a giant set of
single molecules, but as one big entity.
15.2.2 Complexity
The term “complexity” is used to denote diverse concepts coming from do-
mains which exhibit strong differences, like social [57], economic [120] and
computer sciences [628]. Even the theory of complexity is used to denominate
multiple disciplines, including theoretical computer science, systems theory
and chaos theory. For example, in theoretical computer science the Landau
symbols (e.g., O(n), ω(n), Θ(n)) are used to describe the time or space com-
plexity of an algorithm, independently of a certain implementation. The Kol-
mogorow complexity, the degree of order in a string, is determined by the
size of the shortest computer program that creates this string. A general
definition of complexity is therefore hard to achieve.
Definition: Complexity
We use the term complexity to denote the existence of system
properties that make it difficult to describe the semantics of a system’s
overall behavior in an arbitrary language, even if complete information
about its components and interactions is known. [58]
called convection cells (Figure 15.1). The water moves up on one side and
down on the other side, in a regular movement. The more regular movement
can now be easier described by a formula due to a reduced complexity.
15.2.3 Feedback
Definition: Feedback
We use the term “feedback” to describe “the return to the input of a
part of the output of a machine, system, or process (as for producing
changes in an electronic circuit that improve performance or in an
automatic control device that provide self-corrective action)” [409].
15.2.4 Emergence
The term “emergence” is used in various disciplines [190] and there is by far
no general agreement about its meaning: “First, it is often applied to situa-
tions, agent behaviors, that are surprising and not fully understood. Second,
it refers to a property of a system that is not contained in any one of its parts.
This is the typical usage in the fields of artificial life, dynamical systems, and
neural networks for phenomena of self-organization. Third, it concerns behav-
ior resulting from the agent-environment interaction whenever the behavior
is not preprogrammed”[488]. The term “emergence” is also used to describe
15.2 Basic Definitions 231
15.2.6 Criticality
The term “criticality” is used in many domains but has acquired special
importance in the field of thermodynamics. Criticality “is used in connection
with phase transitions. When the temperature of the system is precisely equal
to the transition temperature, something extraordinary happens. [. . .] The
system becomes critical in the sense that all members of the system influence
each other.” [323] Since this definition is not general enough to be valid in
the context of self-organization, we denote a group of system components an
“assembly” and use the following definition:
Definition: Criticality
“An assembly in which a chain reaction is possible is called critical,
and is said to have obtained criticality.”[623, Criticality]
Fig. 15.2: Example for a simulation run of the Abelian Sandpile Model. A single
grain of sand causes an avalanche that affects all but the upper left field.
Simulations have shown that this system moves into a critical state, many
fields have a height between 1 and 3 – they are stable because another grain
of sand will not change the system structure. At the same time, a few fields
reside at the critical value of 4 – another grain of sand will cause an avalanche
that changes the system’s structure [647]. A combination of critical and non-
critical field states causes the system to remain stable in most of the cases
(because under the assumption of a random distribution of newly dropped
sand grains, the probability of hitting a stable field is high), at the same
time keeping the possibility for change (at least a few fields can cause a top-
pling). Bak concludes: “A frozen state cannot evolve. A chaotic state cannot
remember the past. That leaves the critical state as the only alternative.”
[46, p. 6]
Note that some systems (e.g., the Abelian Sandpile Model) have the abil-
ity to move themselves into a critical state without external influences. This
phenomenon is called self-organized criticality [45] and can be observed in
most diverse real systems such as earthquakes, stock exchange crashes, traf-
fic jams or sun storms [172].
Definition: Hierarchy
For this context, we define a hierarchy as a rooted tree. “A tree is an
undirected simple graph G” satisfying the condition that “any two
vertices in G can be connected by a unique simple path. [. . .] A tree is
called a rooted tree if one vertex has been designated the root, in which
case the edges have a natural orientation, towards or away from the
root.” [623, Rooted Tree]
i.e., its distance from the root. This allows us to order the elements partially.
Thus, a hierarchy can be seen as an indication of order.
Definition: Heterarchy
“A heterarchy is a type of network structure that allows a high degree
of connectivity. By contrast, in a hierarchy every node is connected to
at most one parent node and zero or more child nodes. In a heterarchy,
however, a node can be connected to any of its surrounding nodes.”
[623, Heterarchy]
15.2.8 Stigmergy
Definition: Stigmergy
“Stigmergy defines a paradigm of indirect and asynchronous
communication mediated by an environment.” [173, Stigmergy]
15.2.9 Perturbation
Definition: Perturbation
A perturbation is a disturbance which causes an act of compensation,
whereby the disturbance may be experienced in a positive or negative
way. [594, p. 118] (orig. in German)
brane of a cell. With its help, the cell determines which substances gain access
and which are rejected.
15.3.4 Maintenance
15.3.6 Feedback
Negative feedback prevents the system from growing so fast that it would
collapse. Even if it is built of viable components only, it can reach a criti-
cal size where it might break or cannot react to perturbations fast enough.
Therefore, it is necessary to damp positive feedback eventually. In a thermo-
stat the positive feedback of increasing the water flow (and thus the heat)
is opposed to the negative feedback of decreasing it. This allows the self-
regulatory adaption of temperature independently from the environment.
15.3.7 Criticality
Systems such as the Abelian Sandpile Model (see Section 15.2.6) show that
local influences can have global effects: every cell can only influence its 4
neighbor cells. It depends only on the system’s state if the addition of another
grain of sand triggers a massive toppling, which may reach every cell on the
grid. In this case, the reason for the uncertainty of effect is that the system
resides in the state of criticality. As described in Section 15.2.6, criticality
offers a basic stability as well as the capability for changes. Both properties
are essential for evolving systems – instability causes breakdown, inflexibility
prevents from growth and adaptivity. Other phenomena which are regarded
as being self-organizing, like an evolution [45] or earthquakes [172], are also
assumed to reside in a critical state. Therefore, criticality can be seen as an
indication for self-organization.
15.3.8 Emergence
It appears that not only the state of criticality but also emergence (see Sec-
tion 15.2.4) connects local influences and global effects. In systems like ant
colonies [174], a set of simple rules in combination with randomness allows
the ants to fulfill tasks like building an ant hill or foraging. Although no ant
knows the overall environment, the swarm as a whole is able to determine
short paths to food sources (a more detailed description is given later in Sec-
tion 15.4.2). Since the effect depends on the whole system and not only on
its parts, it is denoted as emergent.
Emergence is often characterized as being unpredictable. Consider the
appearance of convection cells in the Bénard system (described in Sec-
tion 15.2.2). Although the effects of heating and cooling are known as well
as the properties of water, the rotational direction of the rolls (clockwise or
counter clockwise) cannot be predicted.
15.4 Applications in Computer Science 239
In this section, we describe three examples which show some of the charac-
teristics specified above and their positive effects.
240 15. Characterization of Self-Organization
This is a short overview about small-world and scale-free networks with re-
gard to self-organization. For more details we refer to Chapter 6.
Milgram’s Experiment
Small-World Networks
Scale-Free Networks
The scale-free (SF) model [60] describes networks in a more dynamical way
than the small-world model and is mainly based on the two mechanisms:
– Dynamic construction and
– Preferential attachment.
15.4 Applications in Computer Science 241
The construction process usually starts with m nodes and no edges. New
nodes are added incrementally, and a constant number of edges is attached
to them (to stay within O(n)). The probability that an edge from a new node
n is connected to a certain node with degree ki is given by:
ki
P (n → ki ) = . (15.2)
j kj
Due to the additivity and homogeneity (on algebraic grounds) of this func-
tion, the correlation between the degree of a node and the probability for
new nodes to connect to it is linear. This leads to a scale-free network, the
structure of the system is independent of its current size (Figure 15.4). Scale-
free networks show the property that most nodes have a small number of
connections while only a few are highly meshed (“hubs”). This relation can
be mathematically described by a power law: P (k) ∼ k −γ (where γ is a sys-
tem constant). Therefore, these networks are also called power law networks.
Sometimes they are also denoted fractal networks, signifying a correlation to
fractals1 . For power law networks as well as fractals, a part of the system has
the same structure as the whole (Figure 15.4). This property, which is called
self-similarity for fractals, is just another expression for freedom of scale.
Connection to Self-Organization
1
A fractal is a geometric object which can be divided into parts, each of which is
similar to the original object. [623, Fractal].
242 15. Characterization of Self-Organization
Fig. 15.4: Example for a hierarchical, fractal, scale-free network. Every part of the
system has the same structure as the whole.
15.4.2 Swarming
Ant Algorithms
Ant algorithms are used to solve a problem by means of many agents called
ants. The strategies of these ants have been inferred by watching nature,
where real ants use chemical substances (“pheromones”) to communicate
with each other. Ants spread pheromones while moving around and detecting
the trails of their conspecifics. The substance evaporates over time. Ants
follow traces with a probability proportional to the strength of the pheromone
signals. One application of ant algorithms is the traveling salesman problem
(TSP), i.e., finding the shortest cycle through a given number of cities. Ant
algorithms are based on a model of foraging behavior of real ants as illustrated
in Figure 15.5.
Fig. 15.5: How real ants find the shortest path. 1) Ants move from their nest to
a food source. 2) They arrive at a decision point. 3) Some ants choose
the upper path, some the lower path. The choice is random. 4) Since
ants move approximately at constant speed, the ants which choose the
lower, shorter path reach the opposite decision point faster than those
which choose the upper, longer, path. 5) Pheromone accumulates at a
higher rate on the shorter path, so consequently more and more ants
choose this path. 6) The decision of the following ants is influenced by
the higher pheromone concentration.
Connection to Self-Organization
The concept of cellular automata can be traced back to the 1940s when
John von Neumann investigated self-replicating systems [610]. A cellular au-
tomaton can be explained as an accumulation of many deterministic finite
automata which all have the same set of rules. It consists of an infinite, n-
dimensional grid of homogeneous cells ci which have a state s(ci , t) at time
t. For each cell ci , a neighborhood N (ci ) is defined. It can be chosen ac-
cording to certain metrics, e.g., the two neighbors in each dimension, or the
two neighbors and additionally ci itself. For each combination of states of
N (ci ), a rule is defined which determines s(ci , t + 1). The rules are valid for
all cells so that the total number of rules is given by the number of cells in
the neighborhood to the power of the number of possible states |N ||s| .
A very simple example is the mod 2-automaton. It is one-dimensional
(n = 1), has two possible states (s(c, t) ∈ {0, 1}) and has only the single rule
for all cells y:
Eq. (15.3) implies that s(y, t + 1) is independent of s(y, t). A plot showing
the states of this simple automaton over time is illustrated in Figure 15.6.
Unexpectedly, the automaton builds the structure of a well-known fractal,
the Sierpinski-triangle. Other automata show similar behavior, but there are
differences which lead to the following classification [628]:
– Class 1 (trivial automata): lead to the same state for each cell, independent
of the initial state.
– Class 2 (periodic automata): lead to a fixed state for each cell, dependent
on the initial state.
15.5 Conclusions 245
Fig. 15.6: Visualization of the mod 2-automaton (time progresses from top to bot-
tom): the Sierpinski-triangle
Connection to Self-Organization
15.5 Conclusions
16.1 Introduction
In the year 1999, the first Peer-to-Peer system, Napster [436], began its
(short) career in the Internet. The popularity of Peer-to-Peer networks has
grown immensely ever since. Nowadays, the traffic load on the Internet ap-
pears to be dominated by Peer-to-Peer applications (see Chapter 22 for de-
tails). As the downside of the success story scalability and flexibility issues be-
came visible. If well understood and carefully implanted self-organization may
provide a useful means to handle these challenges. But since self-organization
may resist imposed control if done naively, self-organization can as well be
the source of inefficiency. Many Peer-to-Peer systems have been advertised
as being self-organizing, although meaning and significance of this claim are
far from being clear. There are several classes of Peer-to-Peer systems that
exhibit different properties with different degrees of self-organization. Peer-to-
Peer systems have to provide services like routing, searching for and accessing
of resources. An open question is if and how much can self-organization, with
all its illusiveness, emerge as an essential means for improving the quality of
the services. Improved service quality, thereby, is to be achieved equally for
performance, robustness, security and scalability in an all open world.
Based on the characteristics as outlined in Chapter 15, we describe criteria
for self-organization of Peer-to-Peer systems. In Section 16.2.1, these criteria
are first introduced and motivated. Following that, the criteria are applied to
some of the more popular unstructured and structured Peer-to-Peer systems
in Section 16.2.2 and Section 16.2.3, respectively. In each case the overall
degree of self-organization incorporated is first identified and then potential
enhancements of self-organization are discussed. In Section 16.3 the Active
Virtual Peer concept is introduced as an example for a higher degree of self-
R. Steinmetz and K. Wehrle (Eds.): P2P Systems and Applications, LNCS 3485, pp. 247-266, 2005.
Springer-Verlag Berlin Heidelberg 2005
248 16. Self-Organization in Peer-to-Peer Systems
16.2.1 Criteria
system is (as usual for information systems) imposed from the outside, i.e.,
from the developers, and does not arise self-determined.
The results of our analysis are presented in Tables 16.1 and 16.2. Explana-
tions are given in Sections 16.2.2 and 16.2.3. We use the following structure for
each subsection of Sections 16.2.2 and 16.2.3: the first paragraph gives a very
short description of the analyzed Peer-to-Peer system (for details, we refer
to other chapters of this book). The second paragraph contains our results
concerning the basic criteria identity, boundaries, reproduction, mutability
& organization. The third paragraph deals with the basic criteria metrics
& adaptivity. The last paragraph illustrates conformance to the criteria for
autonomy.
Boundaries × × ×
Reproduction × × × ×
Mutability × × ×
Organization × × ×
Metrics
Adaptivity ×
Feedback × × × ×
Reduction of
complexity
× × × ×
Randomness × × × × ×
SOC × × × × ×
Emergence × × × ×
Table 16.1: Self-organization in unstructured Peer-to-Peer systems. The sym-
bols show the degree of conformance with the criteria listed in Sec-
tion 16.2.1.
– full conformance – partial conformance
× – no conformance
16.2 Evaluation of Peer-to-Peer Systems 251
Napster
Napster [436] was conceived as a platform to share audio data in the well-
known MP3-format.
The server of the system is the only bootstrap node. It admits every peer
to enter, so boundary conditions are not actively enforced by the system
itself. Of course there could be set external policies for admittance, but these
policies are not an integral part of the Peer-to-Peer system. Consequently,
one of the essential characteristics of self-organization, namely self-bounding,
is not fulfilled. Neither peer structure nor data is actively reproduced; when
a peer leaves the system, its data is no longer available if not provided by
other peers. Clients cannot take over management tasks from the server and
the server does not share files. As a result, mutability is not given as far
as the system structure is concerned. The type of organization is a mix of
a very flat hierarchy (between the clients and the server) and a heterarchy
(among the clients). The heterarchical organization is of advantage – it does
not affect the whole system if a single peer fails. Unfortunately, clients direct
their search requests to the central server only, which is therefore urgently
required for the operation of the system. Thus the server is a single point of
failure (SPoF) which ruins the positive effects of the heterarchy.
Metrics can be attained by means of keep-alive-messages (ping/pong)
that peers exchange among each other; these messages are an appropriate
way to detect the failure of a node or connection. If such a message fails to
appear, the server can be asked once again for the respective file to be lo-
cated at another peer. Such a form of adaptivity can be of advantage to the
overall system. However, no explicit precaution against or response to over-
load conditions is taken into account. Similarly no defence against possible
DoS attacks or against infiltration by corrupted data is integrated. Thus, the
criterion of metrics is not fully satisfied.
Keep-alive-messages are a form of internal communication which is an
indication for feedback. But for conformity to criterion “feedback”, reactions
and structural changes are also necessary. This is not the case in Napster, so
feedback is hardly incorporated. No further properties that satisfy the criteria
for autonomy are known.
Gnutella
FastTrack
FastTrack [202] can be seen as a hybrid of Napster and Gnutella. All peers
are equal, but every peer can decide to become a SuperNode, which offers
services to other peers (“successors”) that connect to it. This concept is a
structural response to Gnutella’s usage profile.
The boundaries of FastTrack can at least be somehow influenced, because
the system offers a possibility to limit bandwidth and connection count. When
overload is on the raise, a peer can refuse requests (including requests from
new nodes that want to enter the network). This feature, however, has to be
adjusted manually and is not self-organizing; furthermore it does not allow
to reject a request or node based on some characteristics. Thus FastTrack is
not fully compliant to the criterion “boundaries”. FastTrack clients do not
replicate data without user interaction, so no active reproduction occurs. The
16.2 Evaluation of Peer-to-Peer Systems 253
eDonkey
The eDonkey network [185] has strong parallels to FastTrack, but is geared
to the transfer of very large files. In addition to swarming, peers can help
each other by means of “hording”, i.e., swapping received data among each
other so that data does not need to be downloaded from (far away) sources.
In eDonkey, a peer runs either a client application or a server application or
both.
eDonkey conforms to the “boundaries” criterion in analogy to FastTrack:
every interested node can connect to a server, while the server may reject
requests if it is overloaded. Also in eDonkey, no reproduction is done auto-
matically. Every peer (preferably with high capabilities) can run a server.
A server in eDonkey is comparable to the SuperNode concept in FastTrack,
thus the same arguments are valid concerning mutability. The organization
of peers is quite similar, too; as an enhancement, a client can connect to
multiple servers at the same time, which makes the system more robust.
Considerations about metrics go according to the Peer-to-Peer systems
analyzed above. The failure of a node does not concern the whole network; if
the node ran a server, its clients can connect to another server and continue.
Swarming and hording are designed to reduce the traffic load. The use of
254 16. Self-Organization in Peer-to-Peer Systems
Freenet
The identity of the Freenet approach [231] differs from the one of the sys-
tems covered so far. Its purpose is to provide an infrastructure for free and
anonymous information exchange. The detailed mechanisms are described in
[123] and [124].
Like the other approaches, Freenet offers no control for joining or leav-
ing nodes, and thus has no means to decide about its boundaries in a self-
determined way. But it provides the reproduction of data: requested infor-
mation is cached on the nodes between source and target. This results in
the movement of data towards its requesters. More than that, it leads to the
duplication of popular data while unrequested data is timing out. The data
is adapted to user requests. All peers are equal; they are organized in form of
a heterarchy which does not allow for mutability of the system structure. In
addition, every node only knows a fixed number of neighbors, which leads to
inefficiency on the one hand but higher potential for anonymity on the other
hand.
The use of metrics, especially the handling of perturbations, is exception-
ally interesting in Freenet, as it is hard for perturbations to have an effect on
the system at all. A failure of a peer or connection can be tolerated, because
its data (at least the popular part of it) is cached on its neighbors. This is
also why overload of a node is unlikely to occur: the more peers request data
that a certain node holds, the more copies will be made and, thus, later re-
quests will not even reach the original node but be answered by increasingly
“closer” nodes. A similar argument applies to the threat of DoS attacks – the
intended effect is a temporary high load in the network until an attacker gets
swamped by responses. All these properties result from Freenet having the
focus on data, and not on the peers. An attack or request can not be done to
a peer, but to data (which is adapted on demand). The manipulation of data
is also hard to achieve because of the multiple encryption techniques and the
lack of knowledge about the location of data. As a conclusion, Freenet in
fact does offer less measures and reactions to perturbations but incorporates
preventions by design.
16.2 Evaluation of Peer-to-Peer Systems 255
Storage &
Identity Allocation allocation Allocation Distribution
Boundaries × × × ×
Reproduction
Mutability ×
Organization
Metrics
Adaptivity
Feedback × × ×
Reduction of
complexity
× × ×
Randomness × × × ×
SOC × × × ×
Emergence × × × ×
Table 16.2: Self-organization in structured Peer-to-Peer systems. The symbols
show the degree of conformance with the criteria listed in Sec-
tion 16.2.1
– full conformance – partial conformance
× – no conformance
Chord
Chord [117], [575] arranges all peers on a ring of size N . Every peer holds a
routing table which contains log N “fingers”, i.e., addresses of other peers.
256 16. Self-Organization in Peer-to-Peer Systems
These are not set arbitrarily but in a way to gain small-world characteristics:
there are many entries for nearby peers and a few for distant ones. Due to the
universality of DHTs, the identity of Chord is not filesharing but allocation
of data in a more general way.
In analogy to Gnutella, every peer can serve as a bootstrap node and
every unknown node is admitted to the system, so the “boundaries” crite-
rion is not met. Chord does not reproduce peers or connections, but the ring
structure is always preserved. Additionally, the redundant storage of data is
possible: when a new node enters the system, it obtains the data it is respon-
sible for while its predecessor may keep a copy of (at least a part of) it. In
case of the departure of a node, its neighbors redistribute the data among
each other. So Chord conforms at least partially the “reproduction” crite-
rion. Since the ring structure is a design principle it is immutable. However,
a smart (or maybe dynamic re-) assignment of IDs can be used to balance the
traffic load so that mutability is partially given. Concerning the organization,
arbitrary connections may exist between peers, so peers form a heterarchy.
Nevertheless, communication paths are not chosen arbitrarily but structured
because every peer has a routing table. This table is different for each peer
so that no SPoF exists. This means that every peer is part of many hierar-
chies (i.e., routing tables with distance as order). Taken altogether, Chord
implements a sophisticated combination of hierarchy and heterarchy.
Metrics are used as in unstructured Peer-to-Peer systems, the failure of a
node can be detected by the absence of keep-alive-messages. If such a failure
occurs, the predecessor of the missing node takes over the responsibilities of
its successor. This allows a high measure of robustness in conjunction with
the application of redundancy. On the other hand, an overload of peers is not
unlikely: since data is portioned in disjoint parts, there is only one peer for
every piece of information. Thus, the nodes which keep popular data are at
high risk to be congested. The problem of intentionally faked data or routing
tables has not been addressed so far. Taking it all together, Chord offers some
mechanisms to support adaptivity.
The peers’ routing tables are periodically checked for consistency. This is
a form of feedback which is explained as follows. Entry i in the routing table
of peer n contains the address of a peer p whose distance to n on the ring
is between 2i and 2i+1 . If p fails, n searches for a peer q which is neighbor
of p and whose distance to n also is between 2i and 2i+1 . If such a peer q is
found, n directs the respective requests to q instead of p. Since this means a
change of connections, the system structure is changed (the ring is stabilized).
Thus, messages between peers lead to a more stable system structure which
preserves consistency and efficiency and conforms to the criterion “feedback”.
There is no indication for the satisfaction of another criterion for autonomy.
16.2 Evaluation of Peer-to-Peer Systems 257
PAST
PAST [476] is an approach for the allocation and archival storage of data. The
allocation is managed by Pastry [527], an algorithm based on prefix routing.
Each of the bn peers in a Pastry system is part of n nested clusters and knows
b − 1 peers on each cluster level which resembles a scale-free network. Further
details can be found in [526], [528].
Every interested peer gains access to the system, so the boundaries are
not exclusively determined by the system. In analogy to Chord, replication
is possible; the actual deployment can be adjusted by the replication param-
eter k on a per-file basis. Since this parameter is an integral part of PAST, at
least reproduction of data can be identified. The system is immutably struc-
tured in form of a b∗-tree with peers being the leaves. But as in Chord, a
dynamic assignment of IDs could be used for load-balancing. Another anal-
ogy to Chord is the organization. Peers may arbitrarily connect to each other
(heterarchy), but communication is forced to traverse along the cluster hier-
archy. Additional data structures named leaf set and neighborhood set
allow direct and thus efficient communication to topological or domain spe-
cific nearby peers. The cluster hierarchy on the other hand offers an upper
bound to the number of necessary hops to deliver a message to distant peers.
Pulled together, the organization is designed to support for both efficiency
and robustness.
Concerning metrics, PAST differs only slightly from aforementioned Peer-
to-Peer systems – the number of keep-alive-messages sent to nearby peers is
higher than those sent to distant peers. When a failure or overload occurs,
redundancy is used to confine effects locally so the global system is safe-
guarded. The replication parameter, which is crucial to support adaptivity,
is adjusted manually only so that self-organization is limited in that respect.
PAST offers a dedicated security concept that allows authorization via smart-
cards after a peer has joined. This clearly reduces the danger of fakes (that
represent a negative perturbation), but on the other hands restricts strongly
the application of the system and number of users.
No property was found to satisfy one of the criteria for autonomy.
CAN
With the approach of content addressable networks (CANs) [504] data is or-
ganized in form of D-dimensional vectors; for every dimension a different hash
function is used. Requests are routed from a node to both √ of his neighbors
in every dimension, what leads to a complexity of O(D D N ) with constant
storage cost for each of the N peers.
As in the other cases, no peer willing to enter the system is rejected
and thus the boundaries are not exclusively determined by the system itself.
First ideas considering reproduction of data (similar to Chord) are described
258 16. Self-Organization in Peer-to-Peer Systems
NICE
evocates the problem of freeriders and attackers. Even worse, most systems
even have no metrics to detect perturbations appropriately, not to mention
reactions to perturbations. While Napster and NICE can be taken as exam-
ples for the power of hierarchies to assure efficiency, Napster and NICE can
also serve as warning examples for the risks of SPoFs that can result from
hierarchies. Networks like native Gnutella, on the other hand, may suffer
from signaling overhead due to flooding in a (fixed) heterarchical connectiv-
ity structure. Most systems have an invariant structure, they are at most able
to change the distribution of data.
We believe that there exist four areas where the enforcement of control will
be beneficiary for such applications.
The first is access control. Participants of Peer-to-Peer overlays are typ-
ically granted access to all resources offered by the peers. These resources
are valuable. Thus, the resource provider, either content provider or network
provider, need to identify and regulate the admission to the overlay. In par-
ticular for Peer-to-Peer file sharing applications, access control should block
off Peer-to-Peer applications or enable controlled content sharing.
The second area is resource management. The resources of individual peers
have to be treated with care, e.g., low-bandwidth connected peers should not
be overloaded with download requests and exploited equally. For Peer-to-
Peer file sharing applications, for example, content caching capabilities will
improve the performance while reducing the stress imposed on the network.
A third area of interest is overlay load control. Overlay load control copes
with traffic flows inside the overlay. Its goal is to balance the traffic and
load in order to maintain sufficient throughput inside the overlay while also
protecting other network services by mapping this load in an optimum way
onto the underlying network infrastructure.
Finally, the forth area of command is adaptive topology control. Overlay
connections may be established or destroyed arbitrarily by the peers since
they can join or leave the virtual network at any time. Topology control may
enforce redundant connections, thus increasing the reliability of the service.
In addition, topology control may force the structure of the virtual network
to be more efficient and faster in locating resources when using broadcast
protocols.
Having identified the objectives of control for a Peer-to-Peer overlay, it is
important to examine how adaptive and un-supervised control mechanisms
need to be implemented, without diminishing the virtues of the Peer-to-Peer
model or introducing further complexity and overhead to the network. We
believe that it is vital to preserve the autonomy of the peers inside a Peer-
to-Peer network. Additional control loops, which adapt to the behavior of
a Peer-to-Peer overlay, must not interfere with the autonomous nature of
any Peer-to-Peer application. To achieve this goal, we suggest implementing
control through an additional support infrastructure.
262 16. Self-Organization in Peer-to-Peer Systems
may apply traffic engineering for standard IP routing protocols [212] as well
as for explicit QoS enabled mechanisms like MPLS [630].
Figure 16.2 depicts a scenario where two AVPs, AVP 1 and AVP 2, are
located within a single administrative domain. AVP 1 consists of three AOL
modules and one VCC component, while AVP 2 comprises of two AOL mod-
ules. Multiple ordinary peers, denoted by “Peer”, maintain connections to
them. The two AVPs maintain overlay connections to each other. The AOL
modules of the AVPs are in command of the overlay connections. This way,
the AVPs can impose control on the overlay connection.
Having identified earlier the objectives for control of a Peer-to-Peer over-
lay, it is time to see how the AVP facilitates these control issues. Deployed
AVPs create a realm wherein they constantly exchange information. Each
AVP consists of multiple AOL and VCC proxylets which communicate and
collaborate. The exchange of information allows for coordinated control of the
overlay. A realm of AVPs is more suitable to evaluate the conditions inside a
particular part of a Peer-to-Peer overlay than a single entity and this knowl-
edge is distributed in order to achieve better results. Again, this capability
promotes the flexibility and adaptivity of the AVP approach. Continuing, an
AVP imposes control by providing effectors on connection level. The effec-
tors comprise so far the Router module and the Connection Manager module.
The Connection Manager enforces control by manipulating the connections
peers maintain with each other. That is a significant difference compared to
most Peer-to-Peer applications where the way peers connect to each other is
random. By applying connection management, the AVP can enforce different
control schemes.
The Router module governs the relaying of messages on application-level
according to local or federated constraints, e.g., access restriction or virtual
peer state information. The Sensor module provides state information for the
distributed and collaborative control scheme.
The proposed concept relies on Active Virtual Peers as the main build-
ing block. The presented AVPs implement means for overlay control with
respect to access, routing, topology forming, and application layer resource
management. The AVP concept not only allows for a flexible combination of
algorithms and techniques but enables operation over an adaptive and self-
organizing virtual infrastructure. The significance of the approach is based
on the automatic expendability and adaptivity of the whole overlay network
as Peer-to-Peer services evolve. From this perspective, AVPs are inherently
different to all other Peer-to-Peer systems. While some other Peer-to-Peer
systems do comply with criteria of self-organization, a similar autonomy in
developing features of self-organization seems unique to AVPs. AVPs adapt
to the environment without manual triggers as opposed to other Peer-to-Peer
approaches.
16.4 Conclusions 265
The concept of AVPs is similar to the “ultrapeers” since both apply a peer
hierarchy and reduce signaling traffic. AVPs differ from “ultrapeers”, how-
ever, because of their overlay load control capability and adaptivity to the
underlying network structure. The well-known Kazaa Peer-to-Peer fileshar-
ing service [558] applies a concept similar to “ultrapeers”. In Kazaa these
distinct nodes are denoted as “superpeers”.
The OverQoS architecture [580] aims to provide QoS services for overlay
networks. Dedicated OverQoS routers are placed at fixed points inside an
ISP’s (Internet Service Provider) network and connected through overlay
links. The aggregation of flows into controlled flows of an overlay enables
this architecture to adapt to varying capacities of the IP network and ensure
a statistical guarantee to loss rates. This OverQoS approach complements
and extends the limited load control provided so far in the AOL proxylet.
However, it lacks any adaptivity to the varying network topology as addressed
by the AVP.
Resilient overlay networks (RONs) [26] provide considerable control and
choice on end hosts and applications on how data can be transmitted, with
the aim of improving end-to-end reliability and performance. However, RONs
are mostly restricted within single administrative domains.
16.4 Conclusions
occur automatically and not be manually triggered or come from the outside.
Viable structures should emerge autonomously. Furthermore, the boundaries
of a self-organizing Peer-to-Peer network should be self-determined, a prop-
erty hardly observed with current Peer-to-Peer systems. Detrimental effects
caused by attackers or freeriders should be confined or prevented.
Self-organization reaches beyond the obviously desirable properties like
flexibility, adaptivity or robustness. It includes the use of random compo-
nents that allow the system to create new viable structures. Appearances
of emergent properties are entailed that are triggered by interacting compo-
nents. It extends to the state of criticality that allows for appropriate reaction
and restructuring to perturbations. A reduction of complexity for a scalable
growth is also often seen as a property of self-organization while maintaining
identity. In all cases, control that can be exercised externally is limited to a
minimum.
A decentralized self-management comes very close to the ideal of self-
organization. In addition, it would be desirable if self-organizing Peer-to-Peer
networks could be steered towards a certain overall purposeful goal. Such a
steering may take place in accordance to observations made with emergent
properties of self-organizing systems. Emergence is relying on simple rules,
adapts sensibly to perturbations according to an “implanted” goal and does
not develop pathological features under various forms of stress but compen-
sates for stress to a wider extend in a reasonable way. Studying emergence
may lead to identifying the simple rules to be implanted into Peer-to-Peer
networks such that purposeful behavior may emerge. Pathological behavior in
terms of detrimental performance or security should be made autonomously
avoidable. If self-organizing systems are seen as systems which “create their
own life” purposeful and efficient operation may become a big challenge. Self-
organizing Peer-to-Peer systems with the inert property of well-behavedness
that can be purposefully “implanted” would be the ideal case. How such an
“implant” can be created and inserted is one of the big remaining challenges.
17. Peer-to-Peer Search and Scalability
Burkhard Stiller (University of Zürich and ETH Zürich)
Jan Mischke (McKinsey & Company, Inc., Zürich)
R. Steinmetz and K. Wehrle (Eds.): P2P Systems and Applications, LNCS 3485, pp. 269-288, 2005.
Springer-Verlag Berlin Heidelberg 2005
270 17. Peer-to-Peer Search and Scalability
Peer-to-Peer Application
refers to finding the node hosting data for a particular identifier. For this
work, the canonical search process has been broken into different phases,
where necessary steps and possible short-cuts have been identified. Figure
17.2(a) shows the process from keywords over names and addresses to the
path to target node hosting the desired resources. Of course, for retrieving a
document additional techniques may be applied, such as multiple keyword
search or approximate keyword search, however, the full set of information
retrieval techniques are limited in this section to Peer-to-Peer search and
lookup.
Path To
Keyword Node
Keywords Name Lookup Routing Target
Search Address
Node
Keyword Routing
Keyword Lookup/Routing
names onto addresses in the network. Addresses specify the network location
of the node hosting the resource with a given name, e.g., the IP address of
the host. Finally, routing is the process of finding a path and moving queries
to the target node.
Three short-cut mechanisms can help optimize search. Name routing com-
bines the (distributed) lookup of the target node address with path identi-
fication and query forwarding to that node. Keyword lookup returns one or
more addresses of nodes hosting resources with given keyword descriptions.
Napster is the most prominent example. Finally, keyword routing directly
routes towards a node hosting specified resources. Keyword routing is some-
times also called semantic routing or content routing. For the Peer-to-Peer
case, the process can be simplified as shown in Figure 17.2(b). Since Peer-
to-Peer systems build on overlay networks, routing becomes a trivial task:
knowing the target node address, the requestor simply creates a new virtual
link to that address. Only few circumstances (like the anonymity requirement
in Freenet [122]) lead to a more difficult overlay routing approach, which is
an issue separate from search.
Based on those initial discussions, the focus for this chapter is drawn on scal-
able keyword lookup for a single keyword. With the search process defined
and disaggregated, it becomes obvious that searching requires a series of map-
pings, from the keyword space to the name space to the address space to the
space of paths to nodes. The fundamental structural options in a distributed
environment are the same for each mapping, and a complete classification
and in our definition optimum design space is provided in Figure 17.3. The
criteria of mutually exclusive and collectively exhaustive branches at each
level have directly been built into the classification
A mapping can only be defined through a computation or a table. A (pre-
defined) computation is difficult to achieve but some attempts have been
made, usually involving hashing. More widely adopted are tables with (up-
datable) entries for the desired search items, e.g., a node address for each
valid name. Mapping then comes down to finding the desired table entry
and looking up the associated value. In a distributed environment, a table
can either reside on a central entity like a search engine server, or be fully
replicated on each node, or be distributed among the nodes.
Distributed tables are most interesting and challenging in that they require
for each mapping to collaboratively find and contact the node that offers
the desired information or table entry. Two important aspects distinguish
distributed table approaches: the structure of the table, i.e. the distribution
of table entries to nodes, and the physical or overlay topology of the network.
The distribution of table entries can happen at random or in a well-designed
17.1 Peer-to-Peer Search and Lookup in Overlay Networks 273
Non-Symmetric
Classical
Hierarchy
Hierarchical
Computation Symmetric
Redundant
Aligned Table Hierarchy
Structure and
Topology
Mapping
Relation Central Non-Hierarchical
Ordered Space
Table
Complete On Unaligned Structured
Table
Each Node Table Structure
and Topology
Topology
Structured
Distributed
No
Neighborhood
Random Table Information
Structure and Without
Random Topology Recursion
Neighborhood
Information
With
Recursion
process leading to a clear target table structure; the same applies for the
distribution of links and, hence, the topology. Whether the table structure
and topology are designed and aligned, or both random or at least one of
them designed but not aligned with the other has a substantial implication
on search.
In a random table structure and random topology, it is natural that each
node at least carries information about itself, i.e. its address, the names of
the objects it hosts, and corresponding keyword descriptions. In addition
to information on their own tables, nodes may have knowledge on the ta-
ble entries of their neighbors in an aggregated or non-aggregated form. The
knowledge on neighboring table entries will in some cases be restricted to
the direct neighbors, but can also involve recursion: An arbitrary node A not
only learns about the table entries of its neighbors Bi , but also through Bi
about Bi ’s neighbors Cij , Cij ’s neighbors Dijk , and so on. This way, nodes
eventually know about most or even all keywords, names, or addresses in the
direction of each neighbor in a usually aggregated way.
Rather than keeping explicit knowledge on neighboring table entries,
nodes can exploit implicit knowledge when the table distribution and topology
follow a clear and aligned structure that every node knows. The most com-
mon approach is certainly the classical hierarchy. A root node informs about
table areas represented by a number of second-level nodes. The second-level
nodes, in turn, delegate to third-level nodes for sub-areas within their own
area, and so on, until a request finally reaches the leaf node responsible for
274 17. Peer-to-Peer Search and Scalability
the desired entry. Particularly in the quest for scalable Peer-to-Peer search al-
gorithms, “symmetric hierarchies” have been created by adding redundancy.
In symmetric redundant hierarchies, every node can act as the root or be
on any other level of the hierarchy. This can be achieved by replicating root
information on table areas on each node as well as second-level information
on sub-areas. Symmetric redundant hierarchies show structural similarities
to k-ary n-cubes (cf. [535]).
Non-hierarchical structures are also possible and available. In an ordered
space, the table is split into consecutive areas. Each of the areas is represented
on one node. The nodes, in turn, are ordered in the same way, i.e. neighboring
table areas reside on neighboring nodes. Examples of such spaces are rings
or Euclidean spaces, but other forms are possible.
Unaligned table structures and topologies occur when the table is dis-
tributed according to a clear structure, but the topology is random, or the
topology is designed, but the table structure random, or both table and topol-
ogy are clearly structured, but in different ways. While the first case is helpful
to allow for aggregation of table area information, the second case is advan-
tageous for performance improvements compared to a completely random
approach. It appears difficult to gain from the third case.
Designs based on any kind of structured table regardless of the topology
are often referred to as Distributed Hash Tables (DHT).
fault tolerance of the network from the perspective of a single peer, the
node degree can be a significant inhibitor for scalability: The node degree
determines the size of the routing table on each peer with the according
impact on memory consumption and processing power.
– Wire Length (τ̄ )
The wire length is the average round trip delay of an overlay link, con-
tributing to the latency in the system. The wire length is closely related
to mapping an overlay network properly onto a physical network: A low
wire length in a Peer-to-Peer overlay network can be achieved by choosing
neighbors that are also neighbors or at least physically and topologically
close in the underlying network. Closely related to wire length is the no-
tion of stretch: The stretch of a path in an overlay network is the ratio of
total physical network hops underlying the overlay path that separates two
peers to the minimum number of physical network hops between the two
when routing is not confined to the overlay network.
contacting an arbitrary peer in the system, but query traffic can eventually
lead to a collapse. In order to quantitatively assess the above measures, it is
necessary to define suitable metrics, which determine a standard of measure-
ment that can be applied to a corresponding dimension. This metric quantifies
a dimension in that it associates each pair of elements of that dimension with
a number or parameter reflecting the distance of these members along that
dimension. The metric also defines the unit to measure the distance in.
system malfunctions. Significant latency can result from several hops onto
different peer nodes. No latency is productive, all is overhead. Latency is
measured in milliseconds (ms).
All resource consumption of a Peer-to-Peer system that is not defined as
productive above is considered overhead. More specifically, the overhead com-
prises all protocols and functionality of the distributed architecture that are
necessary to operate the system and to give it certain properties. In a math-
ematical notation, total resource consumption can be represented by a vector
ProcessingPower
Memory
p=
p tot = p prod + p OH ; Bandwidth
Latency
ProcessingPowerEfficiency
MemoryEfficiency
p prod = ε · p tot ; ε = Diag ,
BandwidthEfficiency
LatencyEfficiency
where Diag() creates a diagonal matrix from a vector, i.e., a matrix with all
zeroes except for the diagonal elements. Like the resource consumption, the
efficiency heavily depends on the scale σ.
Scale is the size and frequency of tasks or the system performing the tasks
along possibly multiple dimensions. In light of the variety of possible dimen-
sions scale can refer to, it is necessary to identify a set of requirements that
17.2 Scalability in Peer-to-Peer Systems 279
# peerrs
% online
objects per peer
task frequency
σ= ,
processing size of task
memory size of task
bandwidth size of task
···
where these dots symbolize additional scale dimensions that may be required
for specific systems.
In summary, mechanisms for managing the (resources of the) Peer-to-Peer
network include the Peer-to-Peer overlay network management, driven by the
rate of joins to and departures from overlay Peer-to-Peer network, and a QoS
control. Mechanisms for offering and retrieving services in a distributed en-
vironment address (a) a service description and classification driven by the
variety and complexity of services, (b) lookup and search driven by users’ to-
tal queries per successful query/task, (c) pricing, indexing, and advertising,
(d) negotiations and the percentage of negotiated tasks, and (e) contract-
ing driven by the percentage contracts per task. Mechanisms for fulfilling
a service cover accounting and charging and invoicing, a.o. driven by the
number of clearing or payment authorities. The set of organizational and
self-learning mechanisms for peers includes the building and maintenance of
peer groups, affected by the diversity of interests of peers, and the reputation
of peers, effect by the frequency of reputation updates. Finally, security mech-
anisms, such as the identification, authentication, authorization, encryption,
and decryption, are driven by the percentage of tasks requiring the respective
security mechanism.
Building on these notations for scale and efficiency, scalability can now be
captured mathematically. Strict scalability, the asymptotically constant effi-
ciency requirement, can be translated into
ε = ε σ → const; σ
=∞
σ · εReference σ
εSystem σ =
All elements of the diagonal refer to the corresponding resource in the effi-
ciency matrix, i.e., Σ1 is the processing-power scalability, Σ2 is the memory
scalability, Σ3 is the bandwidth scalability, Σ4 is the latency scalability. Each
of these scalability metrics depends on all or a subset of the scale dimensions
σi , i = 1, 2, ...
Mobile Collaboration
Groupware 50 Bandwidth: 9.6 . . . 384 Kbit/s
applying the scale vector defined in Section 17.2.1 and the additional dimen-
sion queries per task that was identified as being relevant for search task size
stands for bandwidth/memory/processing size of task, respectively.
This expression can be simplified. First of all, task size is only relevant
for productive resource consumption to accomplish tasks, not for overhead.
Second, # peers and % online will only appear in product form, so it is pos-
sible to aggregate. Similarly, it is possible to aggregate the query frequency:
query freq = task freq · queries per task. Finally, search functionality can be
separated into query (routing and processing) and overlay network manage-
ment. The query overhead will be directly proportional to query freq, while
the overlay network management will be independent of it. Abbreviating
p OH,search σ = p query (n, |o|) · query freq + p OVLmgmt (n, |o|)
17.3 A Scheme for Lookup and Search Overlay Scalability 283
The pruning probability pp,0 at node 0, the requesting node, will usually
be zero.
The latency L for a query is driven by the characteristic path-length P LCh
and the wire length τ̄ as well as the pruning factor fp :
L = P LR · τ̄ = P LCh · fp · τ̄
, where E[.] yields the expected value in case the request path-length P LR ,
the node degree d, and/or the routing efficiency are random variables.
As for the latency, the characteristic path-length and the pruning prob-
ability influence the bandwidth overhead (and scalability) in a major way,
bearing in mind P LR = P LCh · f . Furthermore, the routing efficiency plays a
significant role. It is also obvious that the packet size should be kept as small
as possible. The equation further suggests that the node degree be kept low.
However, this applies only if the routing efficiency is smaller than 1. And even
then, a lower node degree entails a larger characteristic path-length with its
negative influence on aggregate bandwidth. Note that a higher node degree
also increases in principle the bandwidth available as it augments the number
of links from or to a node. However, these links are only virtual links in the
overlay network that all have to be mapped onto one and the same physical
access line of a node.
Memory overhead for search is mainly driven by the state information
to be kept on each node. In particular, this is the size of the routing table,
determined by the node degree d, as well as any other state information like
object links to objects on remote nodes. Processing power is mostly consumed
for query routing. Hence, the routing table size and thus the node degree
should be kept low to keep processing overhead in bounds. The frequency of
messages to be routed will automatically be optimized when attempting to
reduce the number of aggregate messages magg .
The overlay network management overhead is too system-specific to be
properly addressed in this general section. It comprises all tasks to create
and maintain overlay network links and routing tables. In random networks
like Gnutella, e.g., it is limited to ping and pong messages only, whereas it
becomes more complicated in structured networks like Chord. Typical tasks
then include the insertion of new nodes and new object links into the overlay.
As for queries, the path-length and the aggregate number of messages for
these insertion events have to be evaluated.
Table 17.2: Scalability Assessment Scheme for Peer-to-Peer Lookup and Search
Having outlined the key concepts on search and lookup in overlay networks
and having defined the model for a formal Peer-to-Peer scalability approach,
286 17. Peer-to-Peer Search and Scalability
Node A
Neigbors of Node A
Further peers
Link in overlay
2
1D
L
(7,2)
n
sio
Level L1
en
m
Di
Dimension L1D1
(2,3)
2
2D
L
n
sio
Level L2
en
m
Di
Dimension L2D1
This overview on Peer-to-Peer lookup and scalability has shown that effi-
ciency and scalability of these mechanisms can be formalized and may have
impacts on existing systems. A heuristic approach to scalability evaluation
currently prevails. Therefore, this work provides an analytical yet pragmatic
assessment scheme that can help to formalize and standardize scalability in-
vestigations of Peer-to-Peer systems. As a newly proposed scheme SHARK
has been outlined as a scalable approach of a symmetric hierarchy adapta-
tion, which autonomously arranges nodes and objects in the network and in
semantic clusters.
18. Algorithmic Aspects of Overlay Networks
Danny Raz (Technion, Israel Institute of Technology)
R. Steinmetz and K. Wehrle (Eds.): P2P Systems and Applications, LNCS 3485, pp. 289-321, 2005.
Springer-Verlag Berlin Heidelberg 2005
290 18. Algorithmic Aspects of Overlay Networks
exact algorithm and the size of the data. We denote this complexity by P (k),
where k is the size of the input to the computation. In this chapter we follow
[507] and assume that P (k) = P × k for some constant P , since we at least
need to copy all the data from the fast layer to the computation layer.
Recall that a link in the overlay network is a virtual link and it may be
a long path in the underlying network. Figure 18.1 depicts such a network.
The big circles represent overlay nodes and the wide dotted lines represent
links between nodes in the overlay network. The translation of overlay links
to physical links depends on the routing in the underlying (IP) network. For
example a packet going from node C to node D may be routed through node
E, even though E is not connected to C in the overlay network.
There are two different methods in which overlay networks can handle
routing. In the first one, the overlay network takes care of routing, and thus
packets are forwarded by the overlay layer to the appropriate virtual neighbor.
For example, if node C needs to send a message to node G, it may route it
through node B, thus the actual packet may travel more than once over
several links of the underlying networks. The second routing method uses
the underlying routing, i.e., if node C needs to send a message to node G, it
will get its actual address (say IP address) and will send the message directly
to this address. The actual performance of the algorithms depends, of course,
on the routing methods, and we will carry out the analysis for each of the
methods.
Now, under this model, the time complexity (i.e. the time it takes to
complete the task) of data collection is no longer just a function of the length
of the paths along which the data is collected. In fact, this time complexity
also depends on the number of overlay network nodes in which the data is
processed, and on the complexity of the processing algorithm itself.
In particular, it is no longer true that the popular solution of collecting
data along a spanning tree as described in [37] is optimal. Consider for ex-
C
000000000000
111111111111
000000000000F
111111111111
111111
000000
000000 111111111111
111111
000000
111111
000000000000
000000000000
111111111111
000000
111111 000000000000
111111111111
000000
111111 000000000000 11111111111111111
111111111111 00000000000000000
000000
111111 00000
11111
00000000000000000
11111111111111111
00000
11111
000000
111111 00000000000000000
11111111111111111
00000
11111
B 000000000000000
111111111111111 00000000000000000
11111111111111111
00000
11111
00000000000000000000
11111111111111111111
00000000000000000
11111111111111111
000000000000000
111111111111111
000
111
000000000000000
111111111111111 00000
11111 0000000
1111111
00000000000000000000
11111111111111111111
00000000000000000
11111111111111111
000
111
000000000000000
111111111111111
000
111 00000
11111 0000000
1111111
00000000000000000000
11111111111111111111
00000000000000000
11111111111111111
0000000
1111111
000000000000000
111111111111111
000
111 00000
11111
G11111111111111111111
00000000000000000000
00000000000000000
11111111111111111
00000
11111 0000000
1111111
000000000000000
111111111111111
000
111 00000000000000000000
11111111111111111111
00000000000000000
11111111111111111
00000
11111 0000000
1111111
00000000000000000000
11111111111111111111
000000000000000
111111111111111
000
111 00000000000000000
11111111111111111
00000
11111 0000000
1111111
00000000000000000000
11111111111111111111
00000000000000000
11111111111111111
000000000000000
111111111111111
000
111 00000
11111 0000000
1111111
00000000000000000000
11111111111111111111
00000000000000000
11111111111111111
000
111
000
111 00000
11111
0000000000000
1111111111111 0000000
1111111
00000000000000000000
11111111111111111111
00000000000000000
11111111111111111
0000000
1111111
000
111 00000
11111
0000000000000
1111111111111
00000000000000000000
11111111111111111111
00000000000000000
11111111111111111
00000
11111 0000000
1111111
000
111 0000000000000
1111111111111
00000000000000000000
11111111111111111111
00000000000000000
11111111111111111
0000000000000
1111111111111 0000000
1111111
000
111 00000000000000000000
11111111111111111111
00000000000000000
11111111111111111
0000000000000
1111111111111 0000000
1111111
000
111 00000000000000000000
11111111111111111111
00000000000000000
11111111111111111
00000000000000
11111111111111
0000000000000
1111111111111 0000000
1111111
00000000000000000000
11111111111111111111
000
111 00000000000000000
11111111111111111
00000000000000
11111111111111
0000000000000
1111111111111 0000000
1111111
00000000000000000000
11111111111111111111
00000000000000000
11111111111111111
000
111 00000000000000
11111111111111
0000000
1111111
E11111111111111
0000000000000
1111111111111
00000000000000000000
11111111111111111111
00000000000000000
11111111111111111
000
111 00000000000000
0000000000000
1111111111111 0000000
1111111
00000000000000000000
11111111111111111111
00000000000000000
11111111111111111
00000000000000
11111111111111
000
111
000
111 0000000000000
1111111111111 0000000
1111111
00000000000000000000
11111111111111111111
00000000000000000
11111111111111111
00000000000000
11111111111111
0000000
1111111
000
111 0000000000000
1111111111111
00000000000000000000
11111111111111111111
0000000000000
1111111111111 0000000
1111111
D
000
111 00000000000000000000
11111111111111111111
000000000000000000000000000000000
111111111111111111111111111111111
1111111111111 0000000
1111111
0000000000000 1111111
00000000000000000000
11111111111111111111
000
111
A 000000000000000000000000000000000
111111111111111111111111111111111
0000000000000
1111111111111 0000000
00000000000000000000
11111111111111111111
000000000000000000000000000000000
111111111111111111111111111111111
000111111111111111111111111111111111
111 0000000000000 1111111
1111111111111 0000000
00000000000000000000
11111111111111111111
000000000000000000000000000000000
0000000
1111111
0000000000000
1111111111111
00000000000000000000
11111111111111111111
000000000000000000000000000000000
111111111111111111111111111111111
0000000000000 1111111
1111111111111 0000000
00000000000000000000
11111111111111111111
000000000000000000000000000000000
111111111111111111111111111111111
0000000000000
1111111111111 0000000
1111111
00000000000000000000
11111111111111111111
1111111111111 0000000
1111111
0000000000000 1111111
00000000000000000000
11111111111111111111
0000000000000 0000000
1111111111111
00000000000000000000
11111111111111111111
0000000000000
1111111111111 0000000
1111111
00000000000000000000
11111111111111111111
0000000000000
1111111111111
00000000000000000000
11111111111111111111
ample the tree described in Figure 18.2. Since the amount of information
passed along each of the nodes on the long path is O(n), the overall time
complexity is Ω(P n2 ). Another naive solution for this problem is to itera-
tively query each node in the network, however, this solution will require
that (for the same tree) n separate messages will arrive to the root, and an
overall Ω(n2 ) messages if the overlay network takes care of the routing. The
goal is to develop algorithms that will have both a linear time and message
complexity.
We start with a simpler problem, where data is collected along a given
path in the overlay network. We study different simple algorithms for this
case (similar to the one presented in [507] for the active networking case) and
analyze their performance in the two overlay routing methods. Then we turn
to the algorithm, called collect-rec in [507], that uses recursion to collect data.
When the algorithm collects data from a path, it partitions the path into two
segments, and runs recursively on each segment. The data collected from the
second segment is sent to the first node in the path. The complexity analysis
of this algorithm shows that the algorithm time complexity is O(nP + nC)
and the message complexity is O(nlog(n)) in the first routing method, as in
the active networking case. If routing in the overlay network is done using the
second method then the algorithm time complexity is O(nP + ndn C) and the
¯
message complexity can be bounded by min O(ndn log(n)), O(dlog(n)), where
dn is the average distance (in underlying network hops) between overlay
neighbors, and d¯ is the average distance (in underlying network hops) between
any two overlay nodes.
The more general problem is the problem of efficiently collecting data from
a general overlay network. We are given a network with a specific node called
root, where the number of nodes in the overlay network is n, its diameter is D,
the number of logical links is m, and each node holds one unit of information.
The root intends to collect the data from all the nodes in the overlay network.
We assume that the network topology is arbitrary and no global routing
information is available to the nodes. That is, a node in the overlay network
may know the name (or IP address) of its logical neighbors, but no global
information about the structure of the overlay network is known. Our aim is
to develop an algorithm that solves the problem defined above, with minimal
time and message complexity.
The need for such data collection depends on the specific network. For the
popular Peer-to-Peer networks, one can think of obtaining a snapshot of the
network at a given time, or finding the number of copies of a very popular
file. For CDNs, obtaining usage statistics is always a need, and doing so
with minimal delay and network overhead is very desirable. As shown before,
the naive implementation of collecting data along a given spanning tree may
perform badly. One possible approach is to extend Algorithm collect-rec([507])
from a path to a general graph. We explain this method, and generalize the
data collection algorithm to an algorithm that collects an arbitrary amount
292 18. Algorithmic Aspects of Overlay Networks
3
3
of data from any link on a path. This last algorithm is then used as a building
block in the more general algorithm that collects data in an almost optimal
way from any given spanning tree.
For many overlay applications (such a CDN), assuming that the over-
lay network maintains a spanning tree is a very natural assumption as some
information should be sent (or collected) from all the nodes. For other ap-
plications, like a Peer-to-Peer file sharing application, maintaining a global
spanning tree may be too expensive since the amount of users that join (or
leave) the Peer-to-Peer network each time unit is too big. Nevertheless, in all
non-trivial proposals for a structural Peer-to-Peer network, maintaining such
a tree requires at most a very small change to the existing infrastructure.
If a spanning tree does not exist, one will have to create a spanning tree,
and run the algorithm on top of it (assuming the amount of data collected
is big enough). In order to create such a tree, one can use the well known
algorithm, [37] in which every node that receives a message indicating the
creation of the tree, sends this message to all its neighbors. The creation of
such a tree in our model will take O(CD + mP ) time and O(m) messages.
This can be naturally included in the number assigning step of Algorithm
collect-rec, resulting in a message complexity of O(m + nlog(n)) and a time
complexity of O(mP + nC).
The rest of the chapter is organized as follows. We start with the formal
definition of the model, then in section 18.3.1 we describe algorithms that
collect information from a path in the overlay network. In Section 18.4 we
describe a data collecting algorithm that works on a single path but when
the amount of data in each node is not fixed. Apart from being an interesting
problem by itself, this algorithm presents the basic building block for the more
general algorithm, weighted collect on trees, described in Section 18.5. This
algorithm collects information from a tree that spans the overlay network.
We deal with the creation of such a spanning tree in section 18.6 and with
general functions that can be computed from the nodes’ data in section 18.7.
18.2 Model Definition 293
$SSOLFDWLRQ
/D\HU
)DVW1HWZRUN
/D\HU
5RXWHU
This model is similar to the one introduced in [507] for active networks.
It distinguishes between two types of delay: the delay of the message that
only passes through the FNL module and the delay of the message that
also triggers a computation in the AL. A message that passes only through
the FNL suffers a constant delay. We bound this delay by the constant C;
note that in practice the propagation time between neighbors (in the overlay
network) may vary from 1Ms (in a LAN) to 80Ms (In a WAN). Since in most
systems, messages are exchanged using the TCP protocol over links that
have enough bandwidth (say more than 144Kbs), the propagation delay of
messages is almost constant, which justifies this constant delay assumption.
294 18. Algorithmic Aspects of Overlay Networks
The problem can be stated as follows. A node v seeks to learn the ids of the
nodes along the route (in the overlay network) from itself to another node
u. Recall that in our model, v only knows the id of the next hop node along
this route.
In a naive implementation (naive), node v queries its next hop node for the id
of the second hop node. Then it iteratively queries the nodes along the route
until it reaches the one leading to the destination. This method resembles
the way the traceroute program works, but it does not use the TTL field
which is not part of the model. The delay of the naive algorithm is comprised
of n activations of an AL level program plus the network delay. The average
network delay in the first routing method is 2idn C for i = 1, 2, . . . , n−1 hops,
which sums up to O(n2 )dn C time units. The message complexity in this case
2
is given by n−1i=1 2idn = O(n )dn . Using the second routing method, each
message travels on the average d¯ underlying network hops, and thus the
¯
average delay of the algorithm is O(n(P + dC)), and the average message
complexity is 2nd.¯
Next we describe two simple algorithms, collect-en-route and report-en-
route, (presented in [507] ) that improve the above solution to the route
exploration problem, and analyze their performance using our model. Fol-
lowing this discussion we turn to more sophisticated solutions that achieve
near optimal performances.
collect-en-route
collect-en-route
1. for M SG∗ (s, d, list)
2. if i == d
3. send Report(list|i) to s
4. else
5. send M SG∗ (s, d, list|i) to d
by 2ndn C + ni=1 iP = 2ndn C + n(n+1)
2 P for the first routing method, and
n
by ndn C + dC + i=1 iP = ndn C + dC + n(n+1)
¯ ¯
2 P for the second routing
method. Note that this algorithm is somewhat more sensitive to packet loss
than the previous (and the following) one since no partial information is
available at the source before the algorithm terminates. Furthermore, the
time-out required to detect a message loss here is significantly larger than
with the other algorithms presented here.
report-en-route
report-en-route
1. for M SG∗ (s, d, c)
2. send Report(id, c + 1) to s
3. if i = d
4. send M SG∗ (s, d, c + 1) to d
The message complexity using the first routing method is clearly O(dn n2 ),
and using the second routing method t becomes O(n(dn + d)).¯ The communi-
cation delay for this algorithm using the first routing method is 2dn nC since
18.3 Gathering Information Along a Path 297
exactly one message traverses the route in the forward direction until the
destination, and this message is then sent back to the source. If we use the
second routing method, the communication delay of the message sent from
¯ The execution delay in all the nodes
the destination back to the source is C d.
is P since the message length is exactly one unit. The total delay is given
then by n(2Cdn + P ) for the first routing method, and n(Cdn + P ) + C d¯ for
the second routing method.
Algorithm collect-en-route features a linear message complexity with a
quadratic delay, while algorithm report-en-route features a linear completion
delay with a quadratic message complexity. Combining these two algorithms
we can achieve tradeoffs between these two measures. In particular, if both
measures are equally important we may want to minimize their sum.
Report-Every-l
18.3.2 collect-rec
A different approach, however, is needed in order to reduce both the time and
message complexity. The following algorithm, collect-rec, achieves an almost
linear time and linear message complexity. The main idea is to partition the
path between the source and the destination into two segments, to run the
algorithm recursively on each segment, and then to send the information
about the second segment route from the partition point to the source via
the FNL track. In order to do so on the segment (i, j), in each recursive step
one needs to find the id of the partition point, k, and to notify this node, k,
298 18. Algorithmic Aspects of Overlay Networks
that it has to both perform the algorithm on the segment (k, j) and report to
i. In addition, i has to know that it is collecting data only until this partition
point, k, and it should get the rest of the information via the fast track.
The partition can be done, naively, in two passes. First we find the segment
length. Then sending the segment length and a counter in the slow track
allows k to identify itself as the partition node.
The idea behind the algorithm, as described above is very simple. How-
ever, the detailed implementation is somewhat complex. A pseudo-code im-
plementation of this algorithm is given below.
main(i, d, l)
1. if l = 0
2. return({vali })
3. send Reach1port(i, d, l/2, l/2) to d
4. L1 ← main(i, d, l/2)
5. L2 ← receiveport ()
6. return(L1 |L2 )
collect(d)
1. l ← getlength(s, d)
2. L ← main(s, d, l)
3. return(L)
1
Subindexing with port is to indicate a possible implementation where the port
number is used to deliver incoming messages to the correct recursive instantia-
tion.
18.3 Gathering Information Along a Path 299
$ $
$ , $ ,
$ ( , 0 $ ( , 0 2
$ & ( * , . 0 2 $ & ( ) , . 0
(a) (b)
For the second routing method we get the following recursive formulae,
¯
T C(n) ≤ T C(n/2) + P n/2 + (P + Cdn n/2 + C d),
concentrated at the end of the path. collect-rec algorithm is also not optimal
for this problem, since when all the data is concentrated in the last node
the data is copied log(n) times during its transmission to the first node and
thus the execution complexity of collect-rec will be O(nlog(n)P ). Thus, we
developed a new algorithm that improves collect-rec and solves this problem
more efficiently.
Our problem can be formally defined as follows: We are given an overlay
network with two designated nodes i and j. Node i intends to collect data
from all the nodes along the path from itself to j, where the amount of data in
each node may vary. Let the length of the path be n and the total amount of
data in all the nodes along the path be n̄. We assume that no global routing
information is available to the node, and thus every node knows only the ids
of its neighbors. Our goal is to develop an algorithm that solves the described
problem while minimizing the message and time complexity.
302 18. Algorithmic Aspects of Overlay Networks
Our algorithm, called algorithm weighted collect-rec, follows the steps of the
collect-rec algorithm. In order to collect the data along the path from node
i to node j the algorithm partitions the path into two segments, runs itself
on each segment recursively and then sends data from the second segment to
i via the Fast Network Layer. However, in our case we cannot assume that
the amount of data in each segment is proportional to the segment size since
nodes may hold an arbitrary amount of data. Thus we need to adjust the
destination of each node’s data according to the amount of data that the
node sends, and the total amount of data in the system.
Consider the logical tree properties of algorithm collect-rec described in
the previous section. Item 2 is not valid for this logical tree if the data is
distributed arbitrarily along the path. In the example on Figure 18.9.a node
O has 4 units of data while each other node has one unit of data. Since O
has extra data, the amount of data processed by node M is not bounded by
22 . In this case algorithm weighted collect-rec adjusts the logical tree in such
a way that O sends its data directly to node I (see Figure 18.9.b). Let n̄ be
the total amount of data in the tree; we assume for simplicity that n̄ ≥ n and
term the value n̄/n the amount ratio. The adjusted tree has the following
properties:
1. The number of nodes with height h is bounded by O(n/2h ).
2. If the amount of data that a node sends to its father is greater than n̄2l /n
and smaller than or equal to n̄2l−1 /n, then the data is copied at most
log(n) − l times during the algorithm run.
3. If the amount of data that a node sends to its father is greater than n̄2l /n
and smaller than or equal to n̄2l+1 /n, then the distance on the path from
the node to its father in the logical tree is bounded by O(2log(n)−l ).
These properties allow the weighted collect-rec algorithm to reach good
complexity. weighted collect-rec algorithm uses two parameters to determine
the distance of the data transmission. These parameters are the node’s po-
sition in the basic logical tree, and the amount of data that the nodes in its
logical subtree have.
The algorithm consists of three phases. During the first phase the algorithm
computes the path parameters, i.e., the path length and the total amount
of information along the path, assigns length level to each node and defines
the destination for each node according to the basic logical tree, i.e. a tree
that does not consider the amount of data in each node. The purposes of the
second phase are to adjust the length of transporting data when required,
18.4 weighted collect-rec Algorithm 303
and to allow each node to know the amount of data it should receive during
the last phase and the id of the parent node in the adjusted logical tree. In
the third phase, all the nodes send their own data and the data they received
from other nodes to their parent in the adjusted logical tree. At the end of
this phase, the root (the first node in the path) receives all the data, and the
algorithm accomplishes its task.
Phase 1
In this phase the algorithm builds the basic logical tree (of collect-rec algo-
rithm) and computes the amount ratio which indicates the total amount of
data in the path. Later, in the next phase, this information is used to modify
the logical tree as needed. Recall that the logical tree defines the partition
of the path into segments such that node i is responsible for collecting the
information from segment [i, i + segmentsizei]. The same idea is deployed in
the weighted collect-rec algorithm. The weighted collect-rec algorithm builds
a logical tree (see for example Figure 18.9.a), in which links indicate the
responsibility of nodes for segments.
At the beginning of the algorithm the root sends a message with two
counters towards the last node in the path. The counters are the length and
total data amount of the path. The message passes through the Application
Layer until it reaches the last node in the path. Each node that is traversed
by this message, increments the length and adds its local data amount to the
total data amount. The last node sends the message to the root through the
Fast Network Layer.
When the root receives the message containing the path parameters, it
initiates a recursive partition process. During this process each node i sends
partition messages to nodes with numbers (i + segmentsizei/2k ), where
k changes from 0 to i’s length level−1. The messages are delivered through
the Application Layer. Each such message, destined to node i, contains the
address of its sender (i.e., i’s parent in the logical tree who is also called i’s
destination), i’s length level that describes the length of i’s segment, and the
amount ratio. Partition messages are sent via the Application Layer using a
counter in the same way this process is done in algorithm collect-rec. Upon
receiving the partition message each node stores all received parameters, it
sends a hello message containing its id to its parent in the logical tree, and
initiates a partition process on its segment. At the end of this phase, each
node knows the ids of its parent and of its children in the basic logical tree.
Note, that the root’s length level is log(n) and the size of its segment is n.
Phase 2
As mentioned above, the goal of the second phase is to compute the adjusted
logical tree. This includes determining the ids of the parents and children in
304 18. Algorithmic Aspects of Overlay Networks
the new tree, and computing the amount of data that will be collected. This is
done by collecting information regarding the amount of data in each segment,
and disseminating this information along the hierarchy toward the root. This
can be viewed as running algorithm collect-rec when the data collected is the
amount of data at each node, and the length of its segment. This information
is sufficient in order for each node to determine which of the nodes in its
subtree are its children in the adjusted tree, and how much information each
one of them sends.
This phase starts with a message Comput-Adjusted-Tree (n,n̄,idList), sent
by the root using the fast track along the basic logical tree. Upon receiving
this message, each node in the tree stores locally the idList that contains
the ids of all the logical nodes from the root to it, adds its id to the idList,
and forwards the message to the children in its basic logical tree. When a
leaf gets this message, it computes the id of its destination according to
the idList, and the amount of data it has according to the following rule.
If the amount of data is greater than n̄2l /n and smaller than or equal to
n̄2l−1 /n, then the destination is the (log(n) − l)’s element in the idList. It
then sends an AdjustTree(DataList) message to its parent in the basic logical
tree. Upon receiving an AdjustTree(DataList) message from all children, each
intermediate node computes the amount of data it should receive, the ids of
its children in the adjusted tree, and by adding its own data amount, the
amount of data it needs to send up the tree. Using this data and the rule
specified above, the node can use the idList it stored to compute the id of
its adjusted destination (i.e. its parent in the adjusted tree). It then appends
all lists it received from its children, adds its own amount and dest-id, and
sends an Adjust() message to its parent in the basic logical tree.
This phase ends when the root gets an Adjust() message from all its
children. At this point each node in the tree knows how much information it
should get, and the ids of the nodes that will send it information, and also
the id of its destination, i.e., its parent in the adjusted logical tree.
Phase 3
In this phase, data is sent along the links of the adjusted logical tree (as
described in Figure 18.9). Each node sends data to its destination. The
nodes, that receive data from other nodes, must wait until receiving all the
data and only then they transmit this received data together with their local
data to their destination. This is done through the fast track. At the end
of the phase the root receives all the data of the path and the algorithm
accomplishes its task.
18.4 weighted collect-rec Algorithm 305
weighted collect-rec-Phase2-3(i)
1. if received ComputeAdjustTree(n, n̄,idList)
2. if(segmentsizei > 1)
3. idList = id | idList
4. send ComputeAdjustTree(n, n̄,idList) to ChildL
5. send ComputeAdjustTree(n, n̄,idList) to ChildR
6. else
7. dest = parent
8. if (local data amount >0 )
9. l= log n + 1 - log(n/n̄· local data amount)
10. dest = idListl
11. DataList = (i,local data amount,dest)
12. send AdjustTree(DataList) to perent
13. send report(Data) to dest
14. i=0
15. while (i < number − of − children)
16. if received AdjustTree(DataList)
17. i++
18. List = List | DataList
19.
20. dataAmount = processAdjustTree(List)
21. dataAmount+=local data amount
22. dest = parent
23. if (local data amount >0 )
24. l= log n + 1 - log(n/n̄· local data amount)
25. dest = idListl l=log(n̄/n· dataAmount)
26. DataList = (i,dataAmount,dest) | DataList
27. send AdjustTree(DataList) to perent
28. collectedAmount = dataAmount - local data amount
29. L ← collectData(collectedAmount)
30. L ← L|(localdata)
31. send Report(L) to dest
As explained above, the first and second phases of algorithm weighted collect-
rec can be viewed as an execution of algorithm collect-rec, where the data
collected is the triple (i, amount(i), dest(i)) and not the real data of each
node. The only change here is that we added at the beginning of Phase 2 the
message AdjustTree(idList) which goes down the logical tree. However, since
the amount of data in the idList is bounded by the tree hight (log n), the
complexity of these phases is the same as the complexity of collect-rec, which
can be formally stated as follows.
Lemma 18.4.1. The time complexity of Phase 1 and Phase 2 is O(n(P +
Cdn )), and the message complexity is O(n(dn + log n)) for the first routing
306 18. Algorithmic Aspects of Overlay Networks
collectData(n)
1. L ← empty
2. while( n > 0)
3. if received Report(data)
4. L ← L|data
5. n− = sizeof (data)
6. return L
processAdjustTree(List)
1. amount = 0
2. for each entry in List
3. if( entry.dest == i) amount+=entry.amount
4. return amount
method, and time complexity is O(n(P + Cdn ) + C d¯log n), and the message
complexity is O(n(d¯ + dn + log n)) for the second routing method.
The more difficult part is to analyze the complexity of the third phase in
which data is actually sent along the logical links of the adjusted tree. We
need to prove the following lemma.
Lemma 18.4.2. The time complexity of Phase 3 is O(n(P + Cdn )), and
the message complexity of O(n(dn + log n)) for the first routing method, and
time complexity is O(n(P + Cdn ) + C d¯log n), and the message complexity is
O(n(d¯ + dn + log n)) for the second routing method.
Proof. When the algorithm collects data all the messages are sent towards
the first node in the path, therefore the communication delay of phase 3 is
O(Cdn n) using the first routing method and O(C d¯log n) using the second
routing method.
Define the length level of a node to be l if its segment size is smaller than
2l+1 but bigger than or equal to 2l . A node has an amount level l if the
amount of data it sends ( this includes its own data and the data received
from the other nodes during the algorithm execution) is greater than or equal
to (n̄/n)2l and smaller than (n̄/n)2l+1 . Finally we define the node’s level to
be the maximum between its length level and its amount level.
According to the algorithm every node sends its data to a node with a
higher level. The amount level of the node determines the execution delay
in the node. The execution delay in the node with amount level la equals at
worst to the amount of data it might process without increasing its amount
level, hence it equals to n̄/n(2la −1 ). There are log(n) possible amount levels
and processing in the nodes with the same level is done simultaneously, hence
the execution delay of Phase 3 is:
18.5 Gathering Information from a Tree 307
n̄
log(n)
(2l ) ≤ O(n̄).
n
l=0
Therefore the time complexity of Phase 3 is O(n(P +Cdn )) for the first routing
method, and O(n(P + Cdn ) + C d¯log n) for the second routing method.
During the data collection phase each node sends one message with data
to its destination. The distance (in terms of overlay hops) between the node
and its destination is bounded by 2l where l is the node’s level. The number
of nodes with level l is bounded by n/2l−1 since there are at most n/2l nodes
with amount level l and at most n/2l nodes with length level l. Hence, when
using the first routing method, the total number of messages passing during
the third phase is bounded by
ndn
log(n)
(2l ) ≤ O(nlog(n)).
2l−1
l=0
In this section we deal with a more general problem where we need to collect
information from a general graph, and not from a specific path. First, we
assume the existence of a spanning tree rooted at the root, and we want to
collect information from all leaves of this tree along the paths to the root.
As shown in the introduction to this chapter, the naive solution of collecting
data along a spanning tree as described in [37] is not optimal in our model.
One can show that in the first routing method, it is impossible to gather all
information with a time complexity lower than Ω(DC + nP ), where D is the
diameter of the overlay network. This is true because a message cannot arrive
at the most remote element faster than DC time units, and the algorithm
must spend at least P time units to copy the message from every element
in the network. The message complexity cannot be lower than Ω(n), since
every element in the network must send at least one message, thus the root
must process at least n units of data. Moreover, if no global structure such
as a spanning tree is available, the message complexity is bounded by O(m),
308 18. Algorithmic Aspects of Overlay Networks
i
0
1 k 8
5 9 12
2
j
3 4 6 7 10 11 13 14
where m is the number of logical links in the overlay network, since every
link in the graph must be tested in order to insure that we cover all nodes.
A first step towards developing algorithms that will have both a linear
time and message complexity is to modify collect-rec to work on trees. This
can be done (see Figure 18.13) by assigning a number to every node in the
tree according to a pre-order visiting starting at the root. We now consider
the path that goes from node 1 (the root) to node n, according to this order.
Note that node i and node i + 1 may not be neighbors in the overlay network
(for example nodes 11 and 12 in Figure 18.13), and therefore this path is not
a simple path. However, the total length of the path in terms of overlay nodes
is bounded by 2n.
Creating such a path requires assigning numbers to the nodes in the tree
according to the pre-order visiting order, and allowing nodes in the tree to
route messages according to this node number. For this, each node should
know its own number, and the number range of each of its subtrees. This can
be easily done by a bottom up pass on the tree that collects the sizes of the
subtrees, followed by a top down pass on the tree in which every node assigns
ranges to each of its subtrees. Since the size of the messages is constant, the
time complexity of this process can be bounded by O(Cdn D + P n), and the
message complexity is O(n).
After this phase we can run Algorithm collect-rec on this path and ob-
tain the message and time complexity of collect-rec. All together we get the
following theorem.
Theorem 18.5.1. One can collect data from any given spanning tree with
time complexity of O(n(P + Cdn )), and message complexity of O(n(dn +
18.5 Gathering Information from a Tree 309
log n)) for the first routing method, and time complexity of O(n(P + Cdn ) +
C d¯log n), and message complexity of O(n(d¯ + dn + log n)) for the second
routing method.
However, in practice n might be very big (100,000 nodes or more in a
typical Peer-to-Peer network) while the diameter of the network (and thus
the height of the spanning tree) is much smaller (typically 10). It is thus both
practically important and theoretically interesting to reduce the complexity
of the data collecting algorithm in this model.
In the rest of this section we describe an algorithm, called weighted collect
on trees, for the general data collection problem. As indicated before, we
assume the existence of a spanning tree rooted at the root, and we want to
collect information from all leaves of this tree along the paths to the root. Our
algorithm follows the ideas presented in the previous section for the weighted
collect-rec algorithm. We start with the given spanning tree (as we started
with he basic logical tree in algorithm weighted collect-rec, and we modify it
by assigning new destinations to some nodes. This is done in a way that data
is always sent up the tree towards the root, balancing between the amount
of data that a node sends and the length of path along which the data is
transported.
In order to do so, we assign what we call a length level to each node. These
length level values should have the following two properties: The number of
the nodes with length level l must be greater than or equal to the total num-
ber of the nodes with length level greater than l, and the distance between
a node with length level l and the nearest node towards the root with length
level greater than l must be bounded by O(2l ). Finding an appropriate as-
signment that has the desired properties is not easy. As will be proven later,
the following algorithm assigns length level to the nodes of the tree with
the desired properties. All leaves have length level 0. The length level of each
internal node is the minimum between the maximum of all the length levels
in the node’s sub-tree plus 1, and the position of the first 1 in the binary
representation of the distance of the node from the root. Once length levels
are assigned, each node should send its data to the first node, on the way
to the root in the tree, that has a length level that is bigger that the node’s
length level.
Phase 1
As explained above, the goals of the first phase are to assign a length level to
each node, to find the destination of each node, and to compute the number
of the adjust messages that each node needs to receive during the second
phase.
310 18. Algorithmic Aspects of Overlay Networks
Definition 18.5.1. Let lengthbin (i) be the binary representation of the dis-
tance between the root and node i.
Definition 18.5.2. Let η(i) be the position of the least significant bit that
equals to 1 in lengthbin (i).
The algorithm assigns length level 0 to all leaves. The length levels of all
other nodes are defined by the formula min{η(i), lm (i) + 1} (where lm (i) is
the maximal length level in the sub-tree of i excluding i).
In order to find the distance from each node to the root, the root dis-
seminates among its children a message with one counter that is equal to 1.
Each node that receives such a message stores the value of the counter as the
distance to the root, increments the counter, and disseminates the message
to its children.
Then, the algorithm computes the maximal length level in the sub-tree of
each node and assigns the length level to each node. During this computa-
tion each node receives one message from each of its children. This message
contains the maximal length level in the child sub-tree. Upon receiving this
message the node computes both its length level and the maximal length level
in its sub-tree (considering its own level as well), and sends the message with
the computed value to its parent. This computation is done bottom up, and it
starts in the leaves immediately after receiving the message with the distance
to the root.
Once a node knows its length level it starts searching for its destination.
this is done by creating a request that contains the node’s length level and
sending it towards the root via the Application Layer. Each node that receives
such a request stores the address of the child who sent it. Then, the node
forwards the request to its parent iff the node’s length level is smaller than or
equal to the received length level and the node did not send a request before
with the same length level.
If the node’s length level is bigger than the requested length level, it sends
a reply with its address and length level to the appropriate child. When a
node receives a reply, it disseminates the reply to all children who had sent
the appropriate requests before, and stores the address from the reply if this
is also its own destination.
In order to move on to the last task of the phase, it is important that
each node will know that it has received all messages from its subtree. This
is done by sending notification messages up the tree. Each leaf sends such a
notification immediately after sending its request. Other nodes send their no-
tification when they receive notification from all of their children and finished
processing all the received requests. Note, that a node delays the dissemina-
tion of replies to its children until it sends a notification towards the root.
Only after the node notifies its father, it is guaranteed that it knows about
all its children who are waiting for the delayed replies.
The last task of the phase is to figure out the number of length sources of
each node. Recall, that node j is the length source of node i if j received i’s
18.5 Gathering Information from a Tree 311
address as a destination address during the first phase. Denote by Vl (i) the set
of the length sources of i with the length level l. The number of length sources
equals the sum of Vl (i), for all l smaller than i’s length level. Next we describe
how the algorithm computes Vl (i) for a specific l. Consider the virtual tree
that contains all the paths through which the replies with i’s address passed.
When replies reach their destinations this virtual tree is well-defined, since
each node has the addresses of the children that received these messages,
and the node will not receive new requests since its children finished sending
requests. The root of this sub-tree is i. Note, that two such virtual trees, that
are built from the paths passed by the replies with the same length level, have
no common edges if their roots are distinct. The algorithm then computes
the sum of the values of all the virtual tree nodes in a bottom up way, where
each node adds the value 1 iff the node’s length level is l.
Once the destination of each node is known, the algorithm proceeds exactly
as in algorithm weighted collect-rec since the balancing between the amount
of data and the length level is exactly the same, and sending the data is of
course the same.
For simplicity, we do not explicitly state the pseudo code that implements the
algorithm. The main difference between algorithm weighted collect on trees
and algorithm weighted collect-rec is in the first phase and the definition of
the length levels. Once the logical tree on top of the overlay tree is created,
the algorithm and the analysis is very similar to the one in algorithm weighted
collect-rec. We begin with two definitions.
Definition 18.5.3. Let Vl be the group of nodes whose length level assigned
by the algorithm is l. Denote by nx the size of Vx .
Lemma 18.5.1. The size ofVl is equals to or greater than the sum of sizes
max length level
of Vi , for all i > l, i.e. nl ≥ i=l+1 ni .
Proof. Consider the group V>l = Vi such that i > (l + 1). Denote by vi
a node from V>l , and by M (vi ) the nearest to vi node in its sub-tree with
length level l. Such a node always exists, because under each node with length
level x there are nodes with all length levels less than x. If there are two or
more such nodes we can arbitrarily choose one of them.
Consider two distinct nodes v1 , v2 from V>l . There are two cases. The
first case is when none of these two nodes is in the sub-tree of the second
node. The second case is when one node is in the sub-tree of the second node.
312 18. Algorithmic Aspects of Overlay Networks
M (vi ) is always in the sub-tree of vi . Thus, in the first case M (v1 ) and M (v2 )
are distinct nodes. Consider now the second case. Assume, that v2 is in the
sub-tree of v1 . M (v2 ) is in the sub-tree of v2 . Since the length levels of the
considered nodes is greater than l, both η(v1 ) and η(v1 ) are greater than l.
Hence, the path from v2 to v1 contains at least one node with length level
equals to l. Henceforth, M (v1 ) and M (v2 ) are distinct nodes.
In both cases we proved that M (v1 ) =M (v2 ), hence for each vi there
is M (vi ) that is distinct from all other M (vj ) if vi =vj . Thus, nl ≥
max length level
i=l+1 ni , and the lemma follows.
Lemma 18.5.2. The number of the nodes with length level l is bounded by
n/2l .
Proof. In order to prove the lemma it is sufficient to prove the following
equation
level
max length
n
ni ≤ l (18.1)
2
i=l
since ni ≥ 0 for every i.
The proof of Equation 18.1 is obtained using induction on length level.
Base case. The length level equals to 0. The lemma holds, because the
total number of nodes in the tree is n.
Inductive step. Suppose that the lemma holds for all length levels less
than or equal to l. We prove that the lemma holds also for length level l + 1.
Equation 18.1 equals to:
level
max length
n
nl + ni ≤ . (18.2)
2l
i=l+1
max length level
According to Lemma 18.5.1 nl ≥ i=l+1 ni . Thus, Equation
18.3 follows from Equation 18.2.
level
max length
n level
max length
n
2 ni ≤ , ni ≤ (18.3)
2l 2l+1
i=l+1 i=l+1
Consider the first case. Since the distance from i to the root is greater
than 2(l+1) there is 1 in lengthbin (i) at position k which is greater than l.
Let ī be the node that lies on the path from i to the root and lengthbin (ī) is
the same as lengthbin (i) except in position l. The ī’s length level cannot be
smaller than (l + 1) because the first 1 in lengthbin (ī) is at a position greater
than l and the maximal length level in ī’s sub-tree is at least l since i is in
this sub-tree. The distance from i to ī is 2l , thus the lemma follows in this
case.
Consider the second case. Let k be the position of the first 1 in lengthbin (i).
Let ī be the node that lies on the path from i to the root and the distance
between i and ī is 2l+1 . There are two sub-cases, when k > (l + 1) and when
k = (l + 1). When k > (l + 1) lengthbin (ī) has no 1 at positions less than l + 1.
When k = (l + 1) lengthbin (ī) is the same as lengthbin (i) except in position
k. Since the distance from i to the root is greater than 2(l+1) , lengthbin (i) has
1 at a position greater than k. In the two sub-cases the first 1 in lengthbin (ī)
is in a position greater than l. Since i is in the ī’s sub-tree the maximal length
level in this sub-tree is at least l. Therefore, ī’s length level cannot be smaller
than (l + 1). The lemma follows.
Before we will prove the next lemma we must note that the maximal
length level assigned by the algorithm is log(D).
Lemma 18.5.4. The time complexity of phase 1 is O(Ddn C + nP ) and its
message complexity is O(dn n log(D)).
Proof. During the first phase the algorithm uses the following types of mes-
sages: a message with distance to the root, a message with maximal length
level in the sub-tree of each node, a message that notifies the node’s father
that the node finished sending requests, a message with request for the ad-
dress of the destination, and a message with reply that contains the destina-
tion address. The complexities related to the first three types of the messages
are the same. The complexity related to the last two types are also the same.
Hence the complexity of the second phase is determined by the complexity re-
lated to the messages with distance to the root and by the complexity related
to the messages with destination address.
Consider the messages with distance to the root. Each node receives one
such message, thus their message complexity is O(n). The messages are dis-
seminated in one direction, hence their communication delay is O(DC). In
order to evaluate the execution delay related to these messages consider a
critical path {s0 , s1 , ..., sk }, where messages are sent from si+1 to si and s0
is the root. The execution delay of these messages is equal to the sum of
execution delays at each node, denoted by ti :
k
ti , (18.4)
i=0
314 18. Algorithmic Aspects of Overlay Networks
The node spends a constant time when it sends such a message to each
of its children, hence ti equals to the number of node’s children. Since the
sum of children in the tree cannot exceed the number of the nodes of the
tree, the execution delay of the messages with distance to the root is bounded
by O(nP ). The time complexity related to the messages with distance to the
root is O(DC + nP ).
Consider now the messages that contain the addresses of the destinations.
These messages are always sent in one direction, hence their communication
delay is O(DC). Since the paths, which messages with the same level but
iwhich are sent by the different nodes pass, have no common edges, the ex-
log(D)
ecution delay is the i=0 ti , where ti is the execution delay of processing
messages with level i. ti consists of the execution delay of receiving the mes-
sage from the node’s father and retransmitting the message to the children.
According to Lemma 18.5.3 the first component is bounded by 2i+1 . The
second component is bounded by the number of nodes with length level i.
According to Lemma 18.5.2 this number is bounded by n/2i . Hence, the
execution delay is:
log(D)
n
2i+1 + ≤ O(n) (18.5)
i=0
2i
The time complexity of these messages is O(DC +nP ). Consider now their
message complexity. Each node sends one such message to a distance bounded
by 2i+1 , where i is the node’s length level, according to Lemma 18.5.3. The
number of nodes with length level i is bounded by n/2i according to Lemma
18.5.2. Hence, the message complexity of these messages is:
log(D)
n
2i+1 ≤ O(nlog(D)) (18.6)
i=0
2i
The lemma follows.
As explained, the continuation of the algorithm is very similar to al-
gorithm weighted collect-rec and thus we omit the detailed description and
analysis. The overall complexity of the algorithm is stated in the following
theorem.
Theorem 18.5.2. The time complexity of the algorithm weighted collect on
trees is O(n(P +Cdn )), and the message complexity is O(n(dn +log D)) for the
first routing method, and the time complexity is O(n(P + Cdn ) + C d¯log D),
and the message complexity is O(n(d¯ + dn + log n)) for the second routing
method.
18.6 Gathering Information from General Graphs 315
In many cases, exploring the path between two nodes is just an intermediate
step towards computing some function along this path. A typical example
is bottleneck detection: we want to detect the most congested link along a
path. Another typical example is the need to know how many copies of a
certain file are available in a Peer-to-Peer file sharing system. In both cases
316 18. Algorithmic Aspects of Overlay Networks
the computation can be done using a single pass on the data (path in the
first example, and tree in the second one) using constant size messages.
Bottleneck detection is a special case of a a global sensitive function [121]
which we term succinct functions. These functions, e.g., average, min, max,
and modulo, can be computed with a single path on the input, in any order,
requiring only constant amount of memory. For such functions we can prove
the following theorem.
Theorem 18.7.1. Every succinct function on a path can be computed with
time complexity Θ(ndn (P + C)) and a message complexity of Θ(n), and these
bounds are tight.
In a similar way, we can define succinct-tree functions as functions from
a set of elements {Xi } to a single element x, such that if f ({xi }) = x and
f ({yi }) = y then f ({xi } ∪ {yi }) = f ({x} ∪ {y}). Such functions can be
computed on a tree using a single bottom up pass with fixed length messages,
and thus the following theorem holds.
Theorem 18.7.2. Every succinct-tree function can be computed on an over-
lay network with time complexity of Θ((Cdn D + Sn P ), and message complex-
ity of Θ(m) (Θ(n) if a spanning tree is available), and these bounds are tight.
Note that since we only use messages between neighbors in the overlay net-
work, the same results hold for both routing methods.
In this section we present the results of running the weighted collect-rec algo-
rithm. Figure 18.14 depicts the results of collecting data that is distributed
uniformly among all the nodes. The X axis is the path length. In these runs
we checked the time and message complexity of the algorithm for different
values of amount ratio, from 1 up to 1.8. Remember that the amount ratio
is the ratio between the amount of the data in all the nodes and the path
length. The results show that the message complexity is independent of the
amount of transported data. For clarity of the presentation, we plotted in
Fig.(18.14.b) the line for the theoretical bound O(nlog(n)).
18.8 Performance Evaluation 317
As for time complexity, there is a clear difference in the running time with
different amount ratios. Note, however, that this difference is not proportional
to the difference in the amount ratio. This can be explained by the fact that
the running time of the first two phases depends only on the path length. The
difference in the running time of the algorithm for different amount rations,
is introduced by the third phase, when the actual data collection is done.
7LPH8QLWV
0HVVDJHV
/HQJWK /HQJWK
Figure 18.15 depicts the results of collecting data from a path where all
the data is concentrated in one node. Note, that in this case the node that
contains all the data must send the data directly to the root during the last
phase. In each chart, the X axis describes the distance from the single node
that contains data to the root. The number of nodes in the path is 100 in
charts a and b. Both, the number of messages and the running time increase
when data is located further away from the root. The initial values show the
minimal time and number of messages required to accomplish the first two
phases (i.e. the time complexity or the number of messages when the data is
actually located at the root). The number of messages increases proportionally
to the distance from the node to the root. As the node is further away from
the root the message with the data is transmitted over a greater distance,
and the number of messages grows. The growth of the running time looks
like a step function. The figures show that the steps appear at values of n/2i .
Since the node sends the data directly to the root during the last phase,
the algorithm must deliver the address of the root to this node during the
second phase. Considering the path of the address delivering. The message
with the address may be processed by other nodes and each time when the
maximal length level of the nodes that participate in the delivering increases
a new step starts. This is because the amount of work done by the node
during the second phase is proportional to n/2i . Inside each step (that is
better seen for the small values of x) there is a weak increase of the running
318 18. Algorithmic Aspects of Overlay Networks
time. This increase is caused by the growth of the communication delay that
a message with data suffers when the distance to the root increases. The
presented results were obtained by running one experiment for each value of
x. A larger number of experiments here is useless, since the simulator gives
the same results for exactly same starting parameters.
PHVVDJHV
WLPH XQLWV
GDWD ORFDWLRQ GDWDORFDWLRQ
Figure 18.16 depicts the results of collecting data from a path where an
arbitrary group of nodes contains all the data in the system. The number
of nodes in the path is 100 in charts a and b. In each chart the X axis
describes the size of the set of the nodes that contains all data. Note that
the data is distributed uniformly among all the nodes in the set. The charts
show that the running time decreases when the group size grows from 4 − 5
up to approximately 20% of n. When the size of the group increases the
data processing becomes more parallel and this causes the processing time to
decrease. When the size of the group continues to grow, the time required for
processing the data in each node increases since there are more data packets.
This process balances the effect of the parallel processing and the total time
required is the same.
As can be seen from Figure 18.16 (b), the number of messages increases
logarithmically with the increasing of the group size. The charts show that as
the group grows the average cost (in messages) of adding new nodes decreases.
The average value of the standard deviation for the running time is 13% and
for the number of messages is 3%. The standard deviation has higher values
for the small group and it decreases as the group size increases.
18.8 Performance Evaluation 319
PHVVDJHV
WLPH
JURXSVL]H JURXSVL]H
In order to evaluate the weighted collect on trees algorithm we used two net-
work models. The networks created according these models have different
topology properties. The first model, denoted by random tree, is the network
topology where probability that a new node will be linked to an existing node
is the same for all existing nodes (also known as G(n, p)). However, recent
studies indicate that the Internet topology is far from being a random graph
[201]. In order to capture spanning trees of such model the described algo-
rithm were also tested on Barabasi trees. In this model the probability that a
new node will be linked to an existing node with k links is proportional to k
[637]. As it will be shown below, the differences in the network models affect
the performance of the algorithm. The results of evaluating the performance
of weighted collect on trees algorithm is depicted in Figures 18.17. In both
charts, the x axis describes the size of the tree. Figure 18.17 (a) shows the
results of running the algorithm on random trees and Figure 18.17 (b) shows
the results of running the algorithm on Barabasi trees. The plotted results
reflect an average cost taken from 1000 runs per each tree size, and the values
of standard deviation do not exceed 4%.
One can observe that both the time and the number messages grow lin-
early, 3 this agrees with the theoretical analysis. The running time of the al-
gorithm is smaller on random trees than on Barabasi trees. This is explained
by the fact that Barabasi trees have a small group of nodes with a larger
number of children (in random trees the distribution of children among the
internal nodes is more uniformly) and these nodes perform a lot of work. The
number of messages is greater when the algorithms run on the random trees.
This can be explained by the fact that the diameter of Barabasi trees is less
than the diameter of random trees.
3
Note that the time scale is logarithmic.
320 18. Algorithmic Aspects of Overlay Networks
7LPH
7LPH
WUHHVL]H WUHHVL]H
5DQGRP7UHHV3 & 5DQGRP7UHHV3 & 5DQGRP7UHHV3 & %DUDEDVL7UHHV3 & %DUDEDVL7UHHV3 & %DUDEDVL7UHHV3 &
(a) Time for random trees (b) Time for Barabasi trees
0HVVDJHV
WUHHVL]H
5DQGRP7UHHV %DUDEDVL7UHHV
(c) Messages
Note that the actual value of the time complexity depends on the ratio
between C and P . In practical scenarios this ratio depends on the network
RTT (since C also present the propagation delay), the type of processing,
the architecture of the overlay network, and the efficiency of data handling
at the nodes. However, in a Peer-to-Peer network, where distant hosts can
be neighbors in the overlay layer, this ratio could indeed be small (i.e. ≤ 1),
while in local area overlay networks (as in Active Networks prototypes) this
ratio could be as big as 20.
When creating the graphs in Figure 18.17, we used p = c = 1 to calculate
the case c = p, c = 1, p = 20 to calculate the case p = 20c, and p = 1, c = 20 to
calculate the case c = 20p. It is clear then, that the fastest case is when c = p.
However, as indicated by the graphs the affect of increasing p is much more
severe than increasing c. If we look at the Barabasi tree, we see that it takes
about 2000 time units to collect information from a 1000 node tree (where
c = 20p). Assuming p = 0.1Ms, we can infer that a 100,000 node graph could
be collected in 20 seconds assuming RTT of 2Ms. For more realistic RTTs, and
18.8 Performance Evaluation 321
assuming an open TCP connection between peers, one can collect information
from 100,000 nodes in less than a minute, using algorithm weighted collect on
trees.
19. Schema-Based Peer-to-Peer Systems
Wolfgang Nejdl, Wolf Siberski (L3S and University of Hannover)
19.1 Introduction
When sharing information or resources — the most prominent application of
Peer-to-Peer systems — one is immediately faced with the issue of searching.
Any application which provides an information collection needs some means
to enable users finding relevant information. Therefore, the expressivity of
the query language supported by the system is a crucial aspect of Peer-to-
Peer networks. Daswani et al. [154] distinguish key-based, keyword-based and
schema-based systems.
Key-based systems can retrieve information objects based on a unique
hash-key assigned to it. This means that documents for example have to
be requested based on their name. This kind of queries is supported by all
DHT networks (cf. Chapter 7). Typically, key-based search features are not
exposed to end-users, but rather used as basic infrastructure.
Keyword-based systems extend this to the possibility to look for docu-
ments based on a list of query terms. This means that users do not have to
know the document they are looking for, but can ask for all documents rele-
vant to particular keywords. Non-ranking keyword-based systems find match-
ing resources by executing a string or string pattern matching algorithm, e.g.
on the file name. Ranking keyword-based approaches score documents ac-
cording to their relevance depending on statistics derived from document full
text. Chapter 20 describes the latter kind of systems.
Schema-based systems manage and provide query capabilities for struc-
tured information. Structured means that the information instances adhere
to a predefined schema. For example, in a digital archive any document is
described using a schema consisting of elements as title, author, subject,
etc. In schema-based systems, queries have to be formulated in terms of the
schema (e.g. “find all documents with author=Smith”). Nowadays the domi-
nant schema-based systems are relational databases; other important variants
are XML and Semantic Web data stores. Schema-based Peer-to-Peer systems
are sometimes also called Peer Data Management Systems (e.g., in [273]).
Digital archives are an application area where schema-based queries pro-
vide significant value. Here, users often need to formulate complex queries,
i.e., queries with constraints regarding several criteria, to specify their search.
For example, to find recent books about Java programming, one would need
to exclude outdated books and to disambiguate between the Java program-
ming language (“find all books where ’Java’ occurs in the title, publication
R. Steinmetz and K. Wehrle (Eds.): P2P Systems and Applications, LNCS 3485, pp. 323-336, 2005.
Springer-Verlag Berlin Heidelberg 2005
324 19. Schema-Based Peer-to-Peer Systems
date is less than three years ago, and subject is a subtopic of computer”).
Such complex queries are only supported by schema-based systems.
We can observe two converging development lines, one regarding database
systems, the other in Peer-to-Peer networks. Databases started as central-
ized systems, where one server processes queries from all clients. Since then,
they have evolved towards a higher degree of distribution, e.g by introducing
mediator-based [622] distributed query processing1 . At the same time Peer-
to-Peer systems have developed towards support for more expressive queries
[445, 70, 4, 274, 308]. Schema-based Peer-to-Peer systems are the point where
these two directions of research meet, as shown in figure 19.1(see also [260]).
Database Research
M ARIPOSA EDUTELLA
schema-
ANY DBMS O BJECTG LOBE PIAZZA
based
TSIMMIS PIER
fixed FASTTRACK
schema/ NAPSTER GNUTELLA
keywords LIMEWIRE
P2P Research
CHORD
key PASTRY
P-GRID
The Data model used to store information is tightly connected to the aspect
of the query language. Many data models have been proposed for storing
structured data and it is out of scope to discuss them in detail. We rather
want to mention some basic distinctions with respect to the data model that
influence the ability of the system. The most basic way of storing structured
data is in terms of a fixed, standardized schema that is used across the whole
system. In this view, less complex data models like the one used in key- or
keyword-based systems can be considered as special case of a very simple
fixed schema. Despite the obvious limitations, fixed schema approaches are
often observed in Peer-to-Peer systems because this eliminates the problem of
326 19. Schema-Based Peer-to-Peer Systems
The data placement dimension is about where the data is stored in the net-
work. Two different strategies for data placement in the network can be iden-
tified: placement according to ownership and placement according to search
strategy.
Placement according to ownership. In a Peer-to-Peer system it seems most
natural to store information at the peer which is controlled by the information
owner. And this is indeed the typical case. The advantage is that access and
modification are under complete control of the owner. For example, if the
owner wants to cease publishing of its resources, he can simply disconnect his
peer from the network. In the owner-based placement approach the network
is only used to increase access to the information.
Placement according to search strategy. The complementary model is the
when peers do not only cooperate in the search process, but already in storing
the information. Then the network as a whole is like a uniform facility to store
and retrieve information. In this case, data is distributed over the peers so
that it can be searched for in the most efficient manner, i.e. according to
the search strategy implemented in the network. Thus, the owner has less
control, but the network becomes more efficient.
Both variants can be further improved in terms of efficiency by the in-
troduction of additional caching and replication strategies. Note that while
19.2 Design Dimensions of Schema-Based Peer-to-Peer Systems 327
this improves the network performance, it also reduces the owner’s control of
information.
ing middle way between pure structured and pure unstructured networks (cf.
Chapter 15).
Super-Peer Networks. Inspired by the mediator work in distributed databases,
a special kind of hybrid networks, so-called super-peer networks have gained
attention as topology for schema-based Peer-to-Peer networks. The distri-
bution of peer performance characteristics (processing power, bandwidth,
availability, etc.) is not distributed uniformly over all peers in a network.
Exploiting these different capabilities in a Peer-to-Peer network can lead to
an efficient network architecture [635], where a small subset of peers, called
super-peers, takes over specific responsibilities for peer aggregation, query
routing and possibly mediation. For this purpose, only the super-peers form
a Peer-to-Peer network, and all other peers connect directly to the resulting
super-peer backbone.
Super-peer-based Peer-to-Peer infrastructures usually exploit a two-phase
routing architecture, which routes queries first in the super-peer backbone,
and then distributes them to the peers connected to the super-peers. Like
database mediators, super-peers only need to know which schema elements
each connected peer supports. This is a small amount of information and thus
easily indexed and maintained. Another advantage is the ability of super-
peers to perform coordinating tasks as creating a distributed query plan for
a query (see 19.4.2). The disadvantage of super-peer networks is the need
to dedicate explicitly specific nodes to the super-peer role which limits the
self-organization capabilities of the network.
When ontologies are used to categorize information, this can be exploited
to further optimize peer selection in a super-peer network. Each super-peer
becomes responsible for one or several ontology classes. Peers are clustered at
these super-peers according to the classes of information they provide. Thus,
an efficient structured network approach can be used to forward a query to
the right super-peer, which distributes it to all relevant peers [395].
Discussion. Structured and unstructured networks have complementary ad-
vantages and disadvantages regarding their use for schema-based networks.
The predetermined structure allows for more efficient query distribution in
a structured network, because each peer ’knows’ the network structure and
can forward queries just in the right direction. But this does only work well
if query complexity is limited, otherwise too many separate overlay networks
have to be created and maintained.
In unstructured networks, peers do not know exactly in which direction
to send a query. Therefore, queries have to be spread within the network
to increase the probability of hitting the peer(s) having the requested re-
source, thus decreasing network efficiency. On the other hand, queries can
take more or less any form, as long as each peer is able to match its resources
against them locally. For support of highly expressive queries, as needed
e.g. in ontology-based systems, only unstructured networks are feasible. An
exception are some DHT systems which have been extended recently into
19.3 Case Study: A Peer-to-Peer Network for the Semantic Web 329
In the Semantic Web, an important aspect for its overall design is the
exchange of data among computer systems without the need of explicit
consumer-producer relationships. The Resource Description Format stan-
dard (RDF, [360]) is used to annotate resources on the Web and pro-
vide the means by which computer systems can exchange and compre-
hend data. All resources are identifiable by unique resource identifiers (URIs
plus anchor ids). All annotations are represented as statements of the
form <subject, property, value>, where subject identifies the resource we
want to describe (using a URI), property denotes which attribute we spec-
ify, and value the attribute value, expressed as a primitive datatype or
an URI referring to another resource. For example, to annotate document
http://site/sample.html with its author, we could use the statement
<http://site/sample.html dc:creator ‘‘Paul Smith’’>.
RDF Schema (RDFS, [91]) is used to define the vocabulary used for de-
scribing our resources. RDFS schema definitions include resource classes,
properties and property constraints (domain, range, etc.). For example, prop-
erty dc:creator is a property of the standardized Dublin Core metadata
schema for document archives [159]. We can use any properties defined in
the schemas we use, possibly mix different schemas, and relate different re-
330 19. Schema-Based Peer-to-Peer Systems
all capability levels, thus enabling reuse of existing functionality, e.g., for
translation purposes.
HyperCuP broadcast. Whenever an SP/SP index stays the same after the
update, propagation stops.
Because one important aspect of Peer-to-Peer networks is their dynamic-
ity, the SP/SP indices are not, in contrast to distributed architectures in the
database area (e.g., [88]), replicated versions of a central index, but rather
parts of a distributed index similar to routing indices in TCP/IP networks.
When a query arrives at a super-peer, it matches the schema elements
occurring in the query against the index information. The query is only for-
warded to the peers and along the super-peer connections which use the same
schema elements and are therefore able to deliver results. Thus, the indices
act as message forwarding filter which ensure that the query is distributed
only to relevant peers.
Ranking scores each resource that matches a query using a certain set of
criteria and then returns it as part of a ranked list. Additionally, we need
to restrict the number of results, to make it easier for the user to use the
results, and to minimize traffic in a Peer-to-Peer environment. In databases,
this approach is referred to as top-k query processing, where only the k best
matching resources are returned to the user.
Top-k ranking in Peer-to-Peer networks has to address two additional
challenges [444]:
Mismatch in scoring techniques and input data. Scoring techniques and input
data used by the different peers can have a strong impact on getting the
correct overall top-scored objects. Since we want to minimize network traffic,
but nevertheless integrate the top-scored objects from all different peers (and
super-peers) within each super-peer, each super-peer has to decide how to
score answers to a given query. In general we want to assume that every
peer throughout the network uses the same methods to score documents
with respect to a query, though input data to compute these scores may be
different.
Using only distributed knowledge. Distributed information and thus differ-
ent input data to score answers complicates top-k retrieval, because many
scoring measures that take global characteristics into account simply cannot
be evaluated correctly with limited local knowledge. See section 20.1.2 for a
context where some global knowledge (or estimation) is required for correct
score calculation.
One algorithm for top-k query evaluation is presented in [54]. It uses the
same super-peer architecture as described in section 19.3.2. The algorithm is
based on local rankings at each peer, which are aggregated during routing of
answers for a given query at the super-peers involved in the query answering
process. Each peer computes local rankings for a given query, and returns just
the best matches to its super-peer. At the super-peer, the results are merged,
using the result scores, and routed back to the query originator. On this way
back, each involved super-peer again merges results from local peers and from
neighboring super-peers and forwards only the best results, until the aggre-
gated top k results reach the peer that issued the corresponding query. While
results are routed through the super-peers, they note in an index the list of
peers / super-peers which contributed to the top k results for a query. This
information is subsequently used to directly route queries that were answered
before only to those peers able to provide top answers. Thus, the distribution
of queries can be limited significantly and query processing becomes much
more efficient. To cope with network churn, index entries expire after some
time, and the query is again sent to all relevant peers. Algorithm details can
be found in [54], together with optimality proofs and simulation results.
336 19. Schema-Based Peer-to-Peer Systems
19.5 Conclusion
R. Steinmetz and K. Wehrle (Eds.): P2P Systems and Applications, LNCS 3485, pp. 337-352, 2005.
Springer-Verlag Berlin Heidelberg 2005
338 20. Supporting Information Retrieval in Peer-to-Peer Systems
The first big application of Peer-to-Peer technology was the free exchange
of music files (mostly mp3 files, but also a limited amount of videos) over
the Internet. As early as 1999 Napster [436] offered a platform to exchange
files in a distributed fashion. Peers could offer files for download and directly
download files from other peers in the network. The Napster platform was not
Peer-to-Peer technology in the strict sense, because Napster still relied on a
central server administrating addresses of peers and lists of offered files. The
files offered were, however, not moved to the server, but all downloads were
directly initiated between two peers. The content searches in the Napster
network were made on a restricted amount of meta-data like filename, artist,
or song title. Matching this limited meta-data with a user’s query keywords
content searches thus only decided, if there was a peer offering an adequate
file and ordered possible downloads by the expected quality of the download
connection.
Since a central index approach could only handle the Napster network’s
enormous success in terms of scalability by hierarchies of peers and provided
a single point of responsibility for the content, from 2000 on the Gnutella
[249] network began to build a file exchange platform on a real Peer-to-
Peer structure. Content searches were performed by flooding queries from
the originating peer to all neighboring nodes within a certain radius (the
time to live (TTL) for a query). Also this approach proved not to be scalable
beyond a certain point and the Gnutella network spectacularly broke down
in August 2000 because of heavy overloads in low bandwidth peers. This
breakdown led to the introduction of load-balancing and the construction of
schema-based networks (Fast Track, e.g. KaZaA [343] or Morpheus [432]),
where a backbone of high bandwidth peers (so-called super-peers) takes a
lot of the query routing responsibility. Search in schema-based Peer-to-Peer
networks will be discussed in a different chapter of this book.
Previous applications for media exchange dealt mostly with exact or substring
matching of simple meta-data descriptions of media files. When it comes
to the exchange of (predominantly) textual documents , meta-data is not
enough, but fulltext searches have to be supported. Though meta-data can
capture some basic traits of a document (e.g. that some text is a ’newspaper
article’ related to ’sports’), they cannot anticipate and capture all the aspects
of a text a user might be interested in. Thus, in information retrieval all terms
that can be in some way important for a document should be searchable (i.e.
indexed). The second major difference to meta-data-based retrieval is that
information retrieval cannot use an exact match retrieval model, but has
to rely on ranked retrieval models. These models introduce the notion of a
20.1 Content Searching in Peer-to-Peer Applications 339
certain degree of match of each document with respect to the query. The
higher the degree of match the better the document is expected to satisfy
a user’s information need. Thus in contrast to simple media file exchanges,
where the connection speed was the most interesting factor for choosing a
peer for download, information retrieval really has to find the document with
best degree of match from any peer within the entire network. A well balanced
expected precision and recall of such a content search is thus a major indicator
for the effectiveness of the search capabilities.
Very early information retrieval research encountered the necessity to not
only take information in each document into account, but also use some
background information regarding the entire collection, prominently e.g. the
discriminatory power of each keyword within the specific collection. For ex-
ample considering different news collections, the occurrence of the keyword
’basketball’ in a document will have a good discriminatory power in a general
news collection, a severely lesser power to discriminate between documents
in a sports news collection and virtually no discriminative power within a
collection of NBA news. One of the most popular information retrieval mea-
sure thus is the well-known TFxIDF type. This measure is a combination
of two parts (typically with some normalizations), the term frequency (TF,
measures how often a query term is contained in a certain document), and
the inverted document frequency (IDF, inverse of how often a query term
occurs in documents of the specific collection). Intuitively a document gets
more relevant the more often the query term(s) occur in the document and
the less often the query terms occur in other documents of the collection (i.e.
the more discriminating query terms are with respect to a collection). Though
TF can be determined locally by each peer, the IDF measure needs to inte-
grate collection-wide information and cannot be determined locally. A typical
instance of the TFxIDF measure (with sq (D) as the score for query term q
in document D and N as the total number of documents in the collection) is
e.g. given by:
T Fq (D)
sq (D) := maxt∈D (T Ft (D)) ∗ log( DF
N
q
)
point of failure and needs a high communication overhead for keeping track
of content changes in the network (e.g. peers changing their local content
by adding or deleting documents or peers joining or leaving the network).
Thus distributed index structures that are still reliable even in the face of
network churn, have to be used.
– Integration of Collection-Wide Information: Queries cannot be an-
swered by the individual peers having only local knowledge, but a peers
needs up-to-date collection-wide information for correct scoring. Con-
stantly disseminating this collection-wide information needs a high amount
of bandwidth, if the network is rather volatile, with a high number of peers
joining or leaving the network. Moreover, quick dissemination is necessary,
if peers show a certain locality in their interests and provide document
collections for specific topics, instead of a broad variety that resembles the
topic distribution of the network.
Bloom filters for retrieval and disseminates them throughout the community
using gossiping algorithms [140].
Collection Selection. If no central index of all collections’ contents is
given, choosing ’just the right’ collections for querying is a major problem. For
the use in distributed environments like the WWW several benefit estimators
for collection selection have been proposed. Basically these estimators use
aggregated statistics about the individual collections to estimate the expected
result quality of each individual collection. Expected qualities can then be
used for deciding which collections to select for querying or for determining
a querying sequence of the collections. The most popular benefit estimator is
the CORI measure [101], which computes the collection score si for collection
i with respect to a query q as:
α+(1−α)∗Ti,t ∗Ii,t
si := t∈q |q|
log(n+0.5)
log(cdfi,t +0.5)
with Ti,t := β + (1 − β) ∗ log(cdfi,t
max +1.0) and Ii,t := cft
log(n+1.0)
where Vi is the term space of the collection i, i.e. the distinct terms in the
collection’s inverted index. V avg is the average term space of all collections
whose inverted index contains term t. However, it is important to notice
that statistics like the collection frequency cft or the average term space size
V avg have to be collected over all peers. That means they are collection-wide
information that cannot be determined locally but has to be disseminated
globally or estimated. Also the CORI estimators are widely used in Peer-
to-Peer information retrieval, because they allow choosing collections with
a sufficient quality, while having to exchange only a very limited amount of
statistical data.
Metacrawlers. Closely related to the field of collection selection are
so-called metacrawlers like e.g. GlOSS [259] (shorthand for Glossary of
Server Servers). Metacrawlers have been designed in connection with the
text database discovery problem, i.e. the problem of selecting most promis-
ing document collections from the WWW with respect to a query. The basic
idea is that a metacrawler does not crawl the actual document collection and
build a complete index over the documents, but rather collects only meta-
data about the individual collections like the number of documents in each
20.2 Indexstructures for Query Routing 343
collection and how many documents for each keyword (above a certain criti-
cal threshold number) in a collection are present. Abstracting from the actual
information which document contains the keyword, the indexes build by the
metacrawler are much smaller than inverted keyword indexes, however, of
course due to the aggregation of information also less reliable. For instance
the information whether keywords appear conjunctively in any document of
the collection is lost. But the resulting index can be handled centrally and
the meta-data used for giving probabilities of finding suitable documents in
each collection.
In GlOSS the usefulness of a collection for single keyword queries can be
characterized by the number of documents that contain the keyword normal-
ized by the total number of documents the collection offers. Building on the
assumption that keywords appear independently in documents, the usefulness
for multi-keyword queries is given as the product of the normalized numbers
for each individual keyword [259]. This basic text database discovery using
a central glossary of servers supports boolean retrieval and retrieval in the
vector space model (vGlOSS). Experiments on the GlOSS system show that
average index sizes can be reduced by about two orders of magnitude and
produced a correct estimation (compared to a complete inverted document
index) of the most useful collections in over 80% of cases. But still, since
the glossary index is a central index, it needs to be updated every time a
collection changes and thus does not lend itself easily to information retrieval
in Peer-to-Peer infrastructures.
Although the work on distributed information retrieval and metasearch is
definitely relevant related research, it addresses only the problem of integrat-
ing a small and typically rather static set of underlying retrieval engines and
information sources. Such a small federation of systems is of course less chal-
lenging than a collaborative search process in highly dynamical Peer-to-Peer
systems. We will take a closer look at specific techniques used in Peer-to-Peer
infrastructures in the following sections.
As has been stated, providing collection wide is essential for the retrieval ef-
fectiveness. There is a challenging trade-off between reduced network traffic
by lazy dissemination however leading to less effective retrieval, and a large
network traffic overhead by eager dissemination facilitating very effective re-
trieval. What is needed is ’just the right’ level of dissemination to maintain
a ’suitable’ retrieval effectiveness. Thus previous approaches to disseminate
collection-wide information rely on different techniques.
The PlanetP system [140] does not use collection-wide information like
e.g. the inverted document frequency of query terms directly, but circum-
navigates the problem by using a so-called inverted peer frequency (IP F ).
The inverted peer frequency estimates for all query terms, which peers are
interesting contributors to a certain query. For each query term t the inverted
N
peer frequency is given by IP Ft := log(1 + N t
) where N is the number of
peers in the community and Nt is the number of peers that offer documents
containing term t. In PlanetP summarizations of the content in the form of
Bloom filters are used to decide what content a peer can offer. Since these
are eagerly disseminated throughout the network by gossiping algorithms,
each peer can locally decide values for N and Nt . The relevance of a peer for
answering multi-keyword queries is then simply the sum of the inverted peer
frequencies for all query terms. Peers are then queried in the sequence of their
IPFs and the best documents are collected until queried peers do no longer
improve the quality of the result set. In terms of retrieval effectiveness [140]
show that the approach is quite comparable to the use of inverted document
frequencies in precision and recall and also the documents retrieved using
IP F show an average overlap of about 70% to result sets retrieved using
IDF . However, by using gossiping to disseminate Bloom filters the system’s
scalability is severely limited.
Structured Peer-to-Peer infrastructures allow for a more scalable way of
providing collection-wide information than simple gossiping. Based on the
notion that in answering a query current collection-wide information is only
needed for the query terms, each superpeer can disseminate such informa-
tion together with a query. [53] shows for a setting of distributed servers
hosting collections of newspaper articles that employing an index collecting
information like IDFs for certain query terms in parallel to the query routing
index can provide sufficiently up-to-date collection-wide information. The ba-
sic idea of both indexes is the same: the routing index of a super-peer states
what peers are interesting to address for a given query and the CWI index
provides collection-wide data for each keyword. The data in the CWI index
can change in two ways: like in routing indexes existing entries have only a
certain time to live, such that stale entries are periodically removed. On the
20.3 Supporting Effective Information Retrieval 349
other hand it can be updated evaluating the answers of the peers that the
query was forwarded to. These peers can easily provide the result documents
together with local statistics about their individual collections. This statis-
tical information can then be aggregated along the super-peer backbone to
give an adequate snapshot of the currently most important document col-
lections for a keyword (e.g. document frequencies and collection sizes can be
added up). As stated in [607] the collection-wide informations does usually
only change significantly, if new peers join the network with corpora of docu-
ments on completely new topics. Since index entries only have a certain time
to live, occasionally flooding queries about query terms not in the index (and
disseminating only an estimation of the statistics needed), usually refreshes
the CWI index sufficiently, while not producing too many incorrect results.
Experiments in [53] show that by using an CWI index and disseminating the
collection-wide information together with the query, even in the presence of
massive popularity shifts the CWI index recovers quickly.
in a peer’s collection with respect to multi keyword queries the index lists
or summaries of the peer have to be investigated. Using Bloom filters as
summaries [66] proposes to build a peer p’s combined Bloom filter bp with
respect to the query as the bitwise logical AND of its filters for the individual
keywords and then estimate the novelty by comparing it to bprev := i∈S bi
as the union of those Bloom filters bi of the set of collections S that have
already been investigated previously during the retrieval process. The degree
of novelty can then be approximated by counting the locations where peer
p’s Bloom filter gives set bits that are not already set in the combined filter
of previous collections:
|{k|bp [k] = 1 ∧ bprev [k] = 0}|
Analogously, the overlap between the collections can be estimated by
counting the number of bits that are set in both filters. Of course this is
only a heuristic measure as the actual documents have been abstracted into
summaries. Having the same summary, however, does not imply being the
same document, but only being characterized by the same keywords. That
means those documents are probably not adding new aspects for the user’s
information need as expressed in the query. Generally speaking estimating the
overlap and preferably querying peers that add new aspects to an answer set
is a promising technique for supporting information retrieval in Peer-to-Peer
environments and will need further attention.
where l is the shortest path between the topics in the taxonomy tree, h
is the depth level of the direct common subsumer, and α ≥ 0 and β > 0
are parameters scaling the contribution of shortest path length and depth,
respectively. Using optimal parameter (α = 0.2 and β = 0.6) this measure
shows a correlation coefficient with human similarity judgements performing
nearly at the level of human replication. Experiments in a scenario of feder-
ated news collections in [53] show that the retrieval process can be effectively
supported, if documents can be classified sufficiently well by a taxonomy of
common categories.
This chapter has given a brief survey of techniques for information retrieval
in Peer-to-Peer infrastructures. In most of today’s applications in Peer-to-
Peer scenarios simple retrieval models based on exact matching of meta-data
are prevalent. Whereas meta-data annotation has to anticipate the use of
descriptors in later applications, information retrieval capabilities work on
more complex and unbiased information about the documents in each collec-
tion offered by a peer. Thus, such capabilities offer much more flexibility in
querying and open up a large number of semantically advanced applications.
Generally speaking, information retrieval differs from simple meta-data-
based retrieval in that a ranked retrieval model is employed where not only
some suitable peer for download needs to be found, but the ’best’ docu-
ments within the entire network must be located. Moreover and in contrast
to Gnutella-style infrastructures, querying has to be performed in a more ef-
ficient manner than simple flooding. Generally, only a small number of peers
should be selected for querying. In addition, the querying method has to be
relatively stable in the face of network churn and since rankings usually rely
on collection-wide information, it has to estimated or efficiently disseminated
throughout the network.
The basic retrieval problem is heavily related to previous research in dis-
tributed information retrieval as is used for querying document collections in
the WWW. But the Peer-to-Peer environment still poses different challenges,
especially because network churn causes a much more dynamic retrieval en-
vironment and centralized index structures cannot be efficiently used. Also,
related work in Peer-to-Peer systems, e.g., distributed hash tables can not be
352 20. Supporting Information Retrieval in Peer-to-Peer Systems
21.1 Introduction
Peer-to-Peer systems have been receiving considerable attention from the
networking research community recently. Several approaches have been pro-
posed as communication schemes in order to supply efficient and scalable
inter-peer communication. These schemes are designed on top of the physical
networking infrastructure as overlay networks taking advantage of the rich
flexibility, which is accomplished at low cost. A number of important design
approaches has been already presented in previous chapters. Their topologies
and operation mechanisms influence greatly the performance of routing and
topology maintenance algorithms and hence, the efficiency of the correspond-
ing Peer-to-Peer system.
However, Peer-to-Peer systems are distributed systems with a large num-
ber of non-functional requirements such as scalability, dependability (includ-
ing fault-tolerance, security, integrity, consistency), fairness, etc. These re-
quirements should be met in order to design systems, which are easily de-
ployed on top of the Internet while making use of available resources in an
optimal way. Most approaches have been designed to deal with a subset of
these requirements. Nevertheless, they have intrinsic limitations fulfilling the
complete set of the aforementioned requirements. In most cases, trade-offs in
meeting these requirements exist, thus, raising severe constraints.
To elaborate further the aforementioned trade-off issue, we consider the
design of a Peer-to-Peer system where fault-tolerance should be supported in
the presence of heterogeneous environments (peers may have different physi-
cal capabilities and behavioral patterns). In the context of Peer-to-Peer sys-
tems where peers represent unreliable components, fault-tolerance is achieved
mostly by the employment of redundancy and replication mechanisms. Pure
DHT-based approaches such as Chord [576] or Pastry [527] suggest a large
number of neighbors that usually increases logarithmically with respect to
the size of the system. While it has been shown that such approaches pro-
vide high fault-tolerance [385]1 , they ignore practical limitations raised by
peers of low physical capabilities that may not fulfill the continuously in-
creasing requirements as system’s size expands. In addition, by ignoring het-
1
That study assumes a Peer-to-Peer system where both peers’ inter-arrival and
service (lifespan) time distributions follow the Poisson model. However, as it has
been empirically observed in many studies (e.g. cf. [98]) that peer lifespan follows
a different distribution.
R. Steinmetz and K. Wehrle (Eds.): P2P Systems and Applications, LNCS 3485, pp. 353-366, 2005.
Springer-Verlag Berlin Heidelberg 2005
354 21. Hybrid Peer-to-Peer Systems
erogeneity and dealing equally with each peer, the system maintenance cost
increases significantly, while the least reliable peers contribute minimally in
the fault-tolerance of the system. Further, similar requirement trade-offs (i.e.
anonymity versus efficiency, heterogeneity versus load-balance, etc.) appear
when pure design approaches are selected.
In this chapter we investigate Peer-to-Peer systems, which follow a hy-
brid design approach. The word hybrid is used in many disciplines such as
in biology, in sociology or in linguistics. In general, it is used to characterize
“something derived from heterogeneous sources or composed of incongruous
elements” (Oxford Dictionary). Though initially the term “hybrid Peer-to-
Peer system” was used in the context of Peer-to-Peer systems to describe
approaches that combined both Peer-to-Peer and Client/Server aspects, its
usage was broadened to cover further combinations of heterogeneous ap-
proaches.
In general, hybrid systems are claimed to be intrinsically better than pure
approaches, mostly because of the great heterogeneity observed in deployed
systems. They allow for the synergistic combination of two techniques with
more strengths and less weaknesses than either technique alone.
In the remaining of this chapter we investigate and define a coarse-grained
classification scheme for the observed topologies of the most important, state-
of-the-art, hybrid overlay networks, their underlying mechanisms and the
algorithms employed to operate on them. Then, we discuss their benefits
and drawbacks in a general system-unaware way that does not consider spe-
cific Peer-to-Peer systems, where hybrid approaches are compared with non-
hybrid approaches.
In order to meet the critical set of the aforementioned (and possibly ad-
ditional) non-functional requirements for the operation of the Peer-to-Peer
overlay networks, a great variety of approaches have been proposed. An-
alyzing the design mechanisms that characterize the Peer-to-Peer overlay
networks, three major design dimensions can be identified to classify the pro-
posed systems (cf. Figure 21.1). An alternative three dimensional approach
is presented in [154].
Overlay networks vary in their structural design from tightly structured
networks such as Chord [576] or Pastry [527] to loosely structured ones such
as Freenet [124] or Gnutella [251]. This design dimension is graphically de-
picted in the projected axis of the design space in Figure 21.1. Tightly struc-
tured (or simply structured) overlays continuously maintain their topology,
targeting to a “perfect” structure (e.g., a hypercube or a butterfly topol-
ogy). Structured topologies may require high maintenance cost especially in
the presence of high churn rate. Also, they deal uniformly with the shared
21.2 Overlay Network Design Dimensions 355
objects and services provided by the system and they are unaware of their
query distribution, a fact that might cause a significant mismatch. Moreover,
Distributed Hash Table (DHT) based approaches (which is the most com-
mon mechanism to build structured overlay networks) cannot support easily
range queries2 . Alternative investigations include several mappings of local
data structures to distributed network topologies, such as tries [230] or modi-
fications of traditionally used topologies such as hypercubes [541], butterflies
[399] and multi-butterflies [155].
On the other hand, loosely structured (or simply unstructured ) overlays
do not aim to reach a predefined targeted topology, but rather they have a
more “random” structure. However, it has been observed that certain con-
nectivity policies (i.e., preferential attachment) may emerge their topology
to power-law networks or networks with small-world characteristics. Unstruc-
tured topologies are typically inefficient in finding published, rare items and
the embedded searching operations are in general considerably costly in terms
of network overhead (most approaches use flooding or at best, selective dis-
semination mechanisms[398]). The observed power-law topology, though it
provides a graph with a small diameter3 , it distributes unevenly the commu-
nication effort and introduces potential hot spots at these peers with high
degree playing the role of a “hub”. However, in scenarios where the query dis-
tribution is non-uniform (i.e., lognormal, Zipf) unstructured networks may
be designed to operate efficiently.
Further, overlay networks may vary on the dependency of the peers on
each other, as it is shown in the vertical axis of Figure 21.1. Approaches
such as Chord or Freenet treat all of the participants equally and they are
referred as pure or flat Peer-to-Peer networks. On the other hand, hierarchi-
cal approaches such as Napster [436] or eDonkey [185] separate the common
overlay related responsibilities and assign the majority (or all) of the tasks
to a small subset of (usually) more powerful nodes only (e.g. for resource
indexing). These subset of peers is usually named as “servers”, “super-peers”
or “ultra-peers”. The fault-tolerance of flat approaches is considerably higher
than the hierarchical ones since failures or attacks to any single peer do not
have significant consequences. However, such approaches do not deal well
with the heterogeneity of the participating peers both in terms of physical
capabilities and user behavior. The complexity of flat approaches is usually
higher compared to the hierarchical counterparts. On the other hand, hi-
erarchical solutions require a certain infrastructure to operate and may be
controlled by third parties easier than the non-hierarchical alternatives. The
operational load is unequally balanced among the networked entities and high
dependency exists among them.
2
Range queries are queries searching not for a single item that matches a specific
key but rather for a set of items, which are “close” to a description based on e.g.
metadata.
3
Small diameter is a desirable feature for a network topology in order to reduce
the maximum number of hops required to reach any destination in the overlay.
356 21. Hybrid Peer-to-Peer Systems
Dependency
Pure (flat)
Hierarchical
Determinism
Loosely structured Probabilistic Deterministic
Tightly structured
Structure
In this chapter we focus on systems that lie in the middle of at least one of
the axes shown in Figure 21.1, though many of the proposed systems follow
hybrid mechanisms in more than one dimensions. By doing so, hybrid designs
aim to deal with the limitations of the pure approaches.
21.3.1 JXTA
RDV2
RDV1 RDV3
Edge Peer PX
P2
RDV6 RDV4
P1 Advertisement
RDV5
21.3.2 Brocade
The majority of DHTs assume that most nodes in the system are uniform in
resources such as network bandwidth and storage. As a result, messages are
often routed across multiple autonomous systems (AS) and administrative
domains before reaching their destinations.
Brocade is a hybrid overlay network proposal, where a secondary overlay
is layered on top of a primary DHT. The secondary overlay exploits knowl-
edge of underlying network characteristics and builds a location-aware layer
between “supernodes”, which are placed in critical locations in each AS of the
Internet. Supernodes are expected to be endpoints with high bandwidth and
fast access to the wide-area network and act as landmarks for each network
domain. Sent messages across different ASs can be delivered much faster if
normal peers are associated with their nearby supernodes that can operate
as “shortcuts” to tunnel the messages towards their final destination, thus,
21.3 Hybrid Architectures 359
AS
P
P
SP
P
Superpeer SP
P SP
SP P Peer P
P
P
P
P
AS
AS
21.3.3 SHARK
Pure DHT-based solutions rely on hash functions, which may map adver-
tised items to certain locations in the overlay structure (by assigning hash-
generated identifiers both to each item and overlay location). Such mecha-
nisms (while they are very efficient) are limited to single lookup queries of
these identifiers. Range (or rich) search queries based on keywords remain
challenging features for such systems. However, usually users prefer to specify
what they are looking for in terms of keywords. For instance, a user of a file
sharing application could look for a certain genre of music and not for a par-
ticular song. Additionally, multiple dimensions of meta-data are also highly
desirable, for instance, looking for a document released at a certain period
and related with a specific topic.
SHARK (Symmetric Hierarchy Adaption for Routing of Keywords) [421]
employs a hybrid DHT solution for rich keyword searching. Its hybrid over-
lay structure is composed of two parts: a structured one that considers the
Group of Interest (GoI) concept of AGILE [420] and several unstructured
subnetworks grouping peers with similar interests. Queries are initially being
forwarded in the structured part of the network to reach the targeted un-
structured subnetwork. Then, they are broadcasted to the set of interesting
peers that provide the matched items. SHARK is described in deeper detail
in Section 17.4.
21.3.4 Omicron
described. The upper part of Figure 21.4 shows a (2, 3) directed de Bruijn
graph denoting a graph with a maximum out-degree of 2, a diameter of 3 and
order 8. Each node is represented by k-length (three in this example) strings.
Every character of the string can take d different values (two in this example).
In the general case each node is represented by a string such as u1 u2 ...uk .
The connections between the nodes follow a simple left shift operation from
node u1 (u2 ...uk ) to node (u2 ...uk )ux , where ux can take one of the possible
values of the characters (0, d − 1). For example, we can move from node (010)
to either node (100) or (101).
0(01)1
001 011
0(
)1
)1
0(
01
00
1(00)1
11
01
)0
0(
)1
1(
1(01)0
0(11)0
0(10)1
)0
1(
0
1(
10 )
10
11
00
0( 1(
)1
)0
100 110
1(10)0
R
I I I I
M M C M M
C C C
R R R R
21.4.1 OceanStore
Further Reading
For the interested reader, a number of further citations are provided here
to trigger further research areas. A hybrid topology inspired by Peer-to-Peer
overlays and applied in mobile ad hoc networks can be found in [364]. A
two-tier hierarchical Chord is explored in [238]. Measurement efforts on the
KaZaA system can be found in [384]. A hybrid topology that extends Chord
to increase the degree of user anonymity can be found in [562]. An early
comparison of some pioneering hybrid approaches such as Napster and Point-
era is provided in [634]. A hybrid protocol named Borg [641] aims in scal-
able Application-level Multicast. A generic mechanism for the construction
and maintenance of superpeer-based overlay networks is proposed in [428]. A
Peer-to-Peer overlay network simulator has been implemented especially to
augment the evaluation of a wide range of hybrid designing methods [152].
A hybrid approach on deploying hybrid Content Delivery Networks (CDNs)
based on an ad hoc Peer-to-Peer overlay and a centralized infrastructure is
described in [633].
22. ISP Platforms Under a Heavy Peer-to-Peer
Workload
Gerhard Haßlinger (T-Systems, Technologiezentrum Darmstadt)
22.1 Introduction
R. Steinmetz and K. Wehrle (Eds.): P2P Systems and Applications, LNCS 3485, pp. 369-381, 2005.
Springer-Verlag Berlin Heidelberg 2005
370 22. ISP Platforms Under a Heavy Peer-to-Peer Workload
From 1999 Napster has offered a platform for file sharing, which generated a
considerable portion of traffic (> 20%) on IP networks in the USA within a
few months. Despite the shut-down of Napster due to copyright infringements
and persisting problems of illegal content distribution, file sharing traffic has
continuously increased until it became the dominant source of traffic [555].
Table 22.1 shows some representative measurement results for the com-
ponents of the Internet traffic mix in Europe in 2003-2004. Based on the
evaluation of TCP ports, more than half of the traffic is attributed to Peer-
to-Peer applications. In addition, the Peer-to-Peer traffic portion becomes
even larger when observed at application layer [475] e.g. from 50% to al-
most 80% as reported in [40]. In recent time, most of the FastTrack protocol
has been replaced by BitTorrent activity in Deutsche Telekom’s and France
Telecom’s traffic statistics, while eDonkey is still dominant [492].
The application mix on IP networks also varies according to the time of day:
web browsing (HTTP) oscillates between a peak in heavy traffic hours in the
evening and almost no activity for some hours after midnight. Figure 22.1 il-
lustrates the main traffic portions of Peer-to-Peer, web browsing (HTTP) and
other applications, which again have been distinguished via TCP standard
ports.
Peer-to-peer applications again dominate the traffic volume and at the
same time show an overall smoothing effect on traffic profiles as compared
with client-server architectures for several reasons:
– Traffic variability over time:
The daily traffic profiles in broadband access platforms typically show high
activity during the day time or in the evening. For Peer-to-Peer traffic, the
22.2 Peer-to-Peer Traffic Characteristics 371
eDonkey 60 % 38 % ∼ 54.5 % ∼ 20 %
FastTrack 6% 8% ∼1% ∼ 10 %
BitTorrent - - ∼ 3.5 % ∼ 16 %
Other Peer-to- 4% 4% ∼1% ∼ 10 %
Peer
All Peer-to-Peer 70 % 50 % ∼ 60 % ∼ 56 %
HTTP 10 % 15 % - ∼ 12 %
Other (non- 20 % 35 % - ∼ 32 %
Peer-to-Peer /
unknown)
ratio of the peak to the mean rate is usually smaller than 1.5 due to back-
ground transfers which often last throughout the night. Web browsing and
many other applications have a ratio of 2 or higher. The ongoing Peer-
to-Peer data transfers through the night time are initiated by long-lasting
background downloads of large video files with sizes often in the Gigabyte
range. When peers are connected via ADSL access lines, the throughput
of Peer-to-Peer transmission is limited by the upstream speed of the peers.
Thus it requires hours or days for a peer to download a file of Gigabyte size
at a rate of about 100 kbit/s when peers are staying continuously online.
350 ↑ Mbit / s
300
250
200
150
100
Other
50 HTTP/HTTPS
Peer-to-peer
0
Zeit →
Time
Fig. 22.1: Usual profile of Peer-to-Peer, HTTP and other traffic on Deutsche
Telekom’s IP platform measured over three days in the 2. half of 2003
In recent years, the deployment of broadband access lines for the mass mar-
ket together with extensive use of file sharing applications pushed the traffic
volume [459]. From 2000 to 2003 Deutsche Telekom supplied several million
homes with ADSL access and the traffic on the Internet platform increased by
a factor of more than 100 over this 3 year period, coming close to the scalabil-
ity limits of mature carrier-grade routing equipment available on the market.
Meanwhile the broadband access penetration is continuously expanding with
higher access speeds being offered, while the traffic growth rate is flattening.
In general, there are still undoubted requirements for larger bandwidths in
telecommunication networks, whereas the future development of Peer-to-Peer
traffic is difficult to predict.
Most video files are currently using MPEG compression in order to adapt
to limited transmission capacities on account of reduced quality. Television
studios presently demand high resolution for their video transmissions in high
definition television (HDTV) quality together with coding schemes without
loss of information. The corresponding transmission rate amounts to several
Gbit/s for a single video stream. Nowadays IP backbone and access speeds
would have to be increased about 1000-fold for a widespread transport of
video in HDTV quality. Although many people may nowadays be satisfied
with low video quality, improving quality for video and further emerging
broadband applications will continue to increase demand for even larger band-
widths for at least the next decade.
With regard to Peer-to-Peer applications, the illegitimate use of copyright
protected content and the future effect of countermeasures are unknown fac-
tors of influence. In addition, legal downloads and video streaming are cur-
rently being offered on client-server architectures via Internet under accept-
able conditions for the mass market, which may then partially satisfy the
requirements of present Peer-to-Peer users. Moreover, scalability and many
security aspects including resistance against denial-of-service attacks seem to
be handled without much care by current Peer-to-Peer protocols.
On the other hand, the superior efficiency of the Peer-to-Peer principle for
fast widespread distribution of large amounts of data is attractive for software
distribution, especially when updates are frequently required, e.g., for virus
scanning programs. There is a high potential for supporting various upcoming
applications by Peer-to-Peer networks including online gaming, radio and TV
over IP etc., with some candidates for drivers of future traffic growth among
them.
through peering points and the backbone of the IP platform to which the
destination is attached.
Fig. 22.2: Globally Distributed Sources for Downloading a File with eDonkey
On the other hand, Figure 22.3 illustrates the usual access and back-
bone structure of broadband access providers. Tree-shaped access areas are
attached to the backbone at points of presence (PoPs), where remote ac-
cess control routers handle the registration and sessions being set up for the
users. For large provider networks serving millions of subscribers, it can be
expected that a majority of the data of global file sharing systems can already
be found to be replicated at some sources on the same ISP platform and of-
ten even in the same access area. This especially holds for the most popular
and referenced data, since it is observed that the major portion of downloads
comes from a small set of very popular files. Thus the source distribution
of figure 22.2 indicates unnecessarily long transmission paths increasing the
traffic load between autonomous systems and in backbone areas. This leaves
potential for more efficient data exchange when a better match of network
structures on the application and IP layer could be achieved.
In the considered example, a Linux software file was downloaded. The
situation may be different when most audio and video data is transmitted by
people in some country in their own language. France Telecom observed that
a major part of the file sharing traffic in their Internet platform is locally to
France, as can be expected by a differentiation of communities by languages
[196].
378 22. ISP Platforms Under a Heavy Peer-to-Peer Workload
MPLS
Customer
Premise Ethernet, SDH, ATM etc. PoP Global
Equipment
Internet
Label Edge/Switch
Router
Control Systems
of Internet Access Local Peering Partners
Web caches provide an opportunity to optimize traffic flows. Usual web caches
do not apply to Peer-to-Peer traffic and have become inefficient. On the other
hand, caches can be set up specially for Peer-to-Peer traffic. Therefore, they
act as a proxy node in a Peer-to-Peer network, which stores a large amount
of data. A major problem of usual caching is that data in the cache is often
expired, while it has already been updated on the corresponding web site.
Peer-to-peer file sharing systems are not subject to expired data since data
is referenced via unique hash identifiers.
Web caches are not intended to play an active role in Peer-to-Peer net-
works. They should be transparent and used only to shorten transmission
paths to distant nodes, but should not be directly addressed as a source in
the Peer-to-Peer network. When a download is requested for some data chunk
available in the cache, then the cache can respond instead of a source, which
has been selected in the search phase. For transparency reasons, the data
should be transferred as if it originated from the selected source with regard
to
– the IP addresses,
– the upstream access bandwidth of the source and
– possible changes in the status of the source, e.g. accounting for balanced
up- and download volume of Peer-to-Peer network nodes.
However, caches cannot be made completely transparent. A data transfer
from the cache will usually have a shorter transmission delay and a cache will
22.4 Implications for QoS in Multi-service IP Networks 379
not be able to match the available upstream rate of the original source, in-
cluding time-varying bottlenecks on the transmission path beyond the cache.
But at least the online availability and the access speed of the original source
should be taken into account. In fact, the upload capacity of the caches
will substitute a part of the upload capacity of nodes in the Peer-to-Peer
network with consequences for the total data throughput. The efficiency of
caches depends on the source selection by the Peer-to-Peer protocol. In prin-
ciple, unnecessary load on backbone and expensive international links can be
avoided.
By this method, caches for Peer-to-Peer traffic have to be adapted to
discover data for the most popular protocols in use. They do not reduce the
messaging overhead in the search phase. An alternative approach has been
taken by eDonkey, where caches of service providers can be directly included
as a configuration option of the Peer-to-Peer protocol. An open issue for
caching again lies in its partial use for illegal content, which was already a
problem before Peer-to-Peer became popular, but is becoming more serious
with file sharing.
The Internet has developed from data transfer applications to a service in-
tegrating platform with steadily increasing variety of service types including
file transfer, email, web browsing, voice over IP, Peer-to-Peer data exchange
etc. Each service type has its specific quality of service (QoS) demands re-
garding bandwidth, transmission time, e.g., real time constraints, as well as
tolerance for transmission errors and failure situations. Peer-to-peer data ex-
change is usually of the best effort service type without strict QoS demands.
Downloads often run for several hours or days in the background.
Although shorter transfers or even real time transmissions would be desir-
able for some Peer-to-Peer applications, users are aware that economic tariffs
in a mass market impose access bandwidth limitations such that broadband
transfers require considerable time even with increasing access speeds.
On the other hand, the impact of Peer-to-Peer traffic on other services
has to be taken into account. The present traffic profile in IP networks with
a dominant Peer-to-Peer traffic portion of the best effort type suggests that
the differentiated services architecture [102, 618, 76] is sufficient as a simple
scheme to support QoS by introducing traffic classes to be handled at different
priorities.
Since presently less than 20% of the traffic in ISP networks seems to
have strict QoS requirements including voice over IP and virtual private net-
works (VPN), sufficient QoS could be guaranteed for those traffic types by a
strict forwarding priority. Even applications like web browsing and email are
380 22. ISP Platforms Under a Heavy Peer-to-Peer Workload
included in a 20% portion of traffic with the most demanding QoS require-
ments.
When the network is dimensioned for an essentially larger traffic volume
as is generated only by the preferred traffic classes, then they will not suffer
from bottlenecks and queueing delay in normal operation. Vice versa, the
impact of preferring a small premium traffic class on a much larger portion
of Peer-to-Peer traffic is moderate.
The delivery for premium traffic classes can even be assured in some
failure situations, e.g. for single link breakdowns, provided that restoration
links are available in an appropriate network design. Since overload may occur
on those links, the best effort traffic will then often be affected.
Nowadays Peer-to-Peer protocols can cope with temporary disconnections
on the application layer and recover transmission from the last current state
afterwards. When file transfers via FTP or HTTP protocols are interrupted,
an essential part of the transmission is often lost and a complete restart of
the transfer may be required. Segmentation and reassembly of large data files
into small chunks improves reliability and efficiency of Peer-to-Peer transfers,
which is essential for non-assured QoS of best effort transmission.
An obstacle for the application of differentiated services is the difficulty to
classify Peer-to-Peer traffic. A treatment with lower priority based on TCP
port numbers will increase the tendency to disguise Peer-to-Peer applications
by using randomly chosen ports for unknown protocols or by transporting
Peer-to-Peer data exchange e.g. over the HTTP port for web browsing. There-
fore, the only efficient way to classify traffic seems to be through a declaration
and marking of the complete premium type traffic by the users themselves
or by the originating servers, combined with a corresponding differentiated
tariff scheme. But even then unresolved problems remain for supporting QoS
for inter-domain traffic and for QoS-sensitive traffic which is transferred over
peering points into a service provider platform.
22.5 Conclusion
The present traffic mix with a dominant portion of best effort type data
exchanges in the background has implications for the quality of service con-
cept, suggesting that differentiated services [1] are sufficient to support QoS-
sensitive traffic types. Presently such traffic types generate less traffic volume,
even including applications like web browsing. A comprehensive and appro-
priate classification of service types is still subject to many unresolved issues.
23. Traffic Characteristics and Performance
Evaluation of Peer-to-Peer Systems
Kurt Tutschku, Phuoc Tran-Gia (University of Würzburg)
23.1 Introduction
Peer-to-Peer services have become the main source of traffic in the Inter-
net and are even challenging the World Wide Web (WWW) in popularity.
Backbone operators and Internet Service Providers (ISP) consistently report
Peer-to-Peer-type traffic volumes exceeding 50 % of the total traffic in their
networks [42, 337, 372, 556], sometimes even reaching 80 % at nonpeak times
[39, 236], see also chapter 22.
Peer-to-Peer services are highly lucrative due to their simple administra-
tion, their high scalability, their apparent robustness, and easy deployment.
The use of a distributed, self-organizing Peer-to-Peer software might reduce
capital and operational expenditures (CAPEX and OPEX) of service opera-
tors since fewer entities have to be installed and operated. In a commercial
context, high performance Peer-to-Peer means that these services meet tight
statistic performance bounds; for carrier gradeness this bound is typically
99.999 %, the so called “five nines” concept. Before Peer-to-Peer services or
Peer-to-Peer-based algorithms might be released in a production environ-
ment, it has to be evaluated whether these Peer-to-Peer-based solutions meet
these requirements or not.
This aim of this chapter is to present selected characteristics of Peer-to-
Peer traffic and discuss their impact on networks. In addition, the chapter
will outline what performance can be expected from Peer-to-Peer-based al-
gorithms and which factors influence Peer-to-Peer performance. First, this
chapter discusses in Section 23.2 the relationship of basic Peer-to-Peer func-
tions with performance. Section 23.3 is dedicated to the traffic patterns of
popular Peer-to-Peer services. In particular, the characteristics of Gnutella
overlays (Section 23.3.1) and of the eDonkey file sharing application in wire-
line and wireless networks (Section 23.3.2) are investigated. The efficiency of
a Chord-like resource mediation algorithm is discussed in Section 23.4. Sec-
tion 23.5 is devoted to the performance of exchanging resources in a mobile
Peer-to-Peer architecture.
R. Steinmetz and K. Wehrle (Eds.): P2P Systems and Applications, LNCS 3485, pp. 383-397, 2005.
Springer-Verlag Berlin Heidelberg 2005
384 23. Traffic Characteristics and Performance Evaluation of P2P Systems
user-oriented
domain
resource mediation
eDonkey
operator- P2P
centric
domain
centralized
Cache
Proxy
client/
server Multicast
centralized decentralized
resource access
1
A detailed discussion of Figure 23.1 is provided in [27]
386 23. Traffic Characteristics and Performance Evaluation of P2P Systems
Gnutella was one of the first successful Peer-to-Peer file sharing applications
[335] and sparked largely the wide spread interest in Peer-to-Peer due to its
pure Peer-to-Peer architecture. The Gnutella service forms an application-
specific overlay of Internet accessible hosts running Gnutella-speaking appli-
cations like LimeWire [388] or Bearshare [65]. In Gnutella, the overlay is used
for locating files and for finding other peers; the later in order to maintain
the integrity of the overlay. The initial version of Gnutella [126] uses a simple
flooding protocol combined with a back-tracking mechanisms for locating the
resources (files or hosts) in the overlay. While the qualitative evaluation has
revealed that Gnutella suffers from scalability problems [518], little is known
of quantitative results on the traffic and the dynamics in Gnutella overlays.
In particular, time scale and variability of the number of virtual connections
have to be characterized [601].
Measurements at an unrestricted Gnutella client have been carried out
in March 2002 at the University of Würzburg. The observations (cf. Fig-
ure 23.2) reveal that even without sharing files, a Gnutella client is consum-
ing tremendous high amounts of bandwidth for locating resources (files or
hosts), reaching up in the order of tens of Mbps.
10
Load [in Mbps]
9 Mean 95%-Percentile
8
7
6
5
4
3
2
1
0 6 12 18 24 30 36 42 48 54 60
Time [h]
In addition, Figure 23.2 shows that the traffic in Gnutella overlays varies
strongly over short timescales. This is mainly due to the use of flooding
protocols in Gnutella.
23.3 Traffic Characteristics of Peer-to-Peer-Systems 387
23.3.2 eDonkey
sion of file chunks to a downloading peer. The traffic profile [600] shows that
resource mediation traffic (also denoted as “signaling” traffic) and download
traffic have significantly different characteristics. Figure 23.5 depicts a scatter
plot describing graphically the correlation of the TCP holding time and the
size of eDonkey flows.
Each dot in the scatter plot represents an observed eDonkey flow. The
brighter dots are identified download flows, the dark dots represent non-
download connections. The scatter plot shows that almost all identified down-
load flows are within the same region. In turn, the non-download flows are
in an disjunct region of the plot. This graph reveals that download and non-
download flows have significantly different characteristics. A Peer-to-Peer
traffic model has to distinguish between both types of traffic.
The differences between the types of traffic are underlined in Fig-
ure 23.6 and Figure 23.7. The complementary cumulative distribution func-
tion (CCDF) of the flow size is depicted in Figure 23.6. Part (a) of Fig-
ure 23.6 shows that the download flow size decreases stronger than linear in
the log/log plot. That means that the flow sizes don’t show a strong “heavy
tailed” feature.
An approximation of the observed data with a lognormal distribution
achieves a good estimate. The reduced strength of the heavy tail feature is
not expected, but can be explained: the download flows are limited due to the
segmentation of files into chunks and due to the application of the multiple
source download principle. This observation gives evidence that the expected
“mice and elephant” phenomenon [479, 73] in eDonkey traffic is not as severe
as expected.
mean
Probability
“stable” state:
search information
0.02
0
10^-3 10^-2 10^-1 10^0 10^1 10^2 10^3 10^4 10^5 10^6
application
level timeout
Fig. 23.5: Correlation of eDonkey TCP holding time and flow size
Part (b) of Figure 23.6 depicts the size of non-download flows. The prob-
ability that a flow is larger than a given value decreases almost exponen-
tially until a limit (approx. 14 Kbytes). Beyond this limit, the decrease is not
regular. This is an expected behavior since non-download flows are typical
signaling flows to renew requests.
Figure 23.7 depicts the CCDF of the eDonkey flow holding times on TCP
level. The download connection holding time CCDF decrease moderately,
cf. Figure 23.7(a), and reminds more of a linear decay in a log/log plot.
The CCDF of the holding time of non-download streams, cf. Figure 23.7(b),
decreases rapidly as well as un-regularly. This is an expected behavior since
non-download connections are short and limited in their sensitivity on TCP
flow control.
0 0
10 10
-1 -1
10 10
-2 -2
strong
10 10
decay
CCDF
CCDF
-3 -3
10 10
10
-4 Lognormal 10
-4
-5
approx. -5
10 10
0 2 4 6 8 10 0 2 4 6 8 10
10 10 10 10 10 10 10 10 10 10 10 10
Flow Size [byte] Flow Size [byte]
0 0
10 10
-2 -2
10 10
CCDF
CCDF
-4 -4
10 10
-6 -6
10 10
0 1 2 3 4 5 0 1 2 3 4 5
10 10 10 10 10 10 10 10 10 10 10 10
Holding Time [sec] Holding Time [sec]
25 5
Idle time 48.6 s
4.5
20 Fixed-to-Mobile
4
Downloaded Data [MB]
Idle time
Download time 207 s
Bandwidth [KB/s]
3.5 48.3 s
15 3
Set up time Download time
Downloading peer (mobile) 2.5 711 s
10 4.73 s
Sharing peer # 1 (fixed) 2
Sharing peer # 2 (mobile) Mobile-to-Mobile
1.5
5
1
Set up time
0.5
0 11.03 s
0 1 2 3 4 5 0
Time [min] 0 2 4 6 8 10 12
Time [min]
Throughput for MSD
(a) (b) Conn. holding times
(fix/mob→mob)
25
0.9999−quantile
15 0.99−quantile
0.95−quantile
10
5
mean
0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Chord size n
25
CT = 2
search delay bound / E[T ]
N
N
20
15
CT = 1
N
10
C = 0.5
T
N
0
0 100 200 300 400 500 600 700 800 900 1000
Chord size n
Crawler
Index Server
2.5/3G
mobile network Cache Peer
Internet Peers
0
Mobile Peers
Data
Mobile Operator Domain Signalilng
Enhanced Signalling
1 1
churn 30min
0.8 0.8
churn 2h
0.6 churn 30min 0.6
CCDF
CCDF
churn 12h
0.4 0.4
churn 2h
0.2 churn 12h 0.2
0 0
0 200 400 600 800 1000 0 200 400 600 800 1000
download time [min] download time [min]
The results shows that churn behavior of the peers has significant impact
on the download time of files, however the additional infrastructure entity,
the cache peer, can reduce this effect.
Figure 23.14 compares the CCDF of the download time for popular and
unpopular mp3-files of 8 MBytes. The UMTS subscribers get quite reason-
able performance values since the download time exceeds 1 hour only with
a small probability. On the other hand, the GPRS subscribers have much
higher download times and the shape of the curve is completely different
to the CCDF for UMTS. It seems that there exists a minimal required up-
load/download bandwidth of the peers for a given file size in order to retrieve
396 23. Traffic Characteristics and Performance Evaluation of P2P Systems
1 1
GPRS
0.8 GPRS 0.8
0.6 0.6
CCDF
CCDF
0.4 0.4 UMTS
UMTS
0.2 0.2
0 0
0 50 100 150 200 250 0 100 200 300
download time [min] download time [min]
a file efficiently. The shape of the blue curve in Figure 23.14 is characteris-
tic for the CCDF of the download time in an efficient system, while the red
one illustrates the behavior for unefficient systems. This effect becomes even
more obvious for unpopular files which are not cached by the cache peer.
The results of Figure 23.14 show that mobile Peer-to-Peer file sharing is
almost impossible with GPRS whereas UMTS is a good candidate for efficient
Peer-to-Peer file swapping.
23.6 Conclusion
tion 23.3.1). Moreover the measurements have revealed that multiple source
download mechanisms do not increase the “mice and elephant” phenomenon
(cf. Section 23.3.2) and that these mechanisms can perform efficiently even
in mobile environments.
In the case of a mobile environment, measurements and the performance
evaluation indicate that an optimal transfer segment size exists, which is
dependent on the type of access network and the churn behavior of the peers.
The determination of this size is for further research.
24. Peer-to-Peer in Mobile Environments
Wolfgang Kellerer (DoCoMo Euro-Labs)
Rüdiger Schollmeier (Munich University of Technology)
Klaus Wehrle (University of Tübingen)
R. Steinmetz and K. Wehrle (Eds.): P2P Systems and Applications, LNCS 3485, pp. 401-417, 2005.
Springer-Verlag Berlin Heidelberg 2005
402 24. Peer-to-Peer in Mobile Environments
Fig. 24.1: Possible application scenario for location-based services: Locating a taxi.
node’s availability caused, e.g., due to failures or joins and leaves of users.
Too frequent changes as typical in wireless environments appear as a threat
to conventional Peer-to-Peer-systems.
In general, the application of Peer-to-Peer to mobile environments pro-
vides a number of opportunities and challenges since, originally, Peer-to-Peer
was not designed for mobile environments, which we will outline in what
follows. As we will see in the next sections, two scenarios are providing the
impetus for the application of Peer-to-Peer concepts in mobile environments.
Besides the well-known file sharing applications based on Peer-to-Peer
networks, new wireless applications are also feasible in mobile networks, es-
pecially when we consider multi-hop links such as in mobile ad-hoc networks
(MANETs). Therefore, we motivate two basic examples in the next two sec-
tions, that may be realized with Peer-to-Peer technology on top of an ad-hoc
wireless network.
Imagine, for example, a user standing at the side of a road, requires a taxi,
but can not see any nearby. The user could now call the central taxi agency
and order a taxi, having to state his current position. The agency, which has
to track the current location of its taxis, could then direct the nearest one to
the user.
If context-based routing was supported by the available MANET, the user
could simply send out a request which would be broadcasted in a multihop
manner, via a pre-configured number of hops, in its proximity. All participat-
24.2 Introduction to Mobile Communication Systems 403
ing nodes would forward the request until a taxi receives it, as illustrated in
Figure 24.1. The taxi could then reply with an appropriate response message
to the requesting node, and finally, pick up the user.
Thus our context-based routing scheme allows the utilization of Location-
based Services (LBS) without the need for centralized elements. The underly-
ing MANET limits flooding of the search request to the geographical proxim-
ity. Additionally, the creation of all kinds of search requests can be imagined.
Possible request categories could thus also include bars, restaurants or closest
bus stops.
The second scenario is not that highly dynamic as the first one, but also bases
on ad-hoc network technology. On a university campus, today, students and
teaching staff are often equipped with wireless technology, like laptops, PDAs,
and Smartphones. During courses, seminars, reading groups, or in spare time,
they may form groups of collaborating ’peers’. But often, collaboration with
networked systems can not be deployed because of missing infrastructure
support (e.g., network plugs) or due to restrictive network policies.
A solution to such problems is the formation of spontaneous wireless ad-
hoc networks, e.g., based on bluetooth, IEEE 802.11x or both. Students are
then able to collaborate, share teaching materials, and many more. After
the courses are finished, students then move to the next activity. Thus, the
spontaneous groups separate and – after a short and highly dynamic period
– form new collaborative groups. These new groups may have a very different
purpose.
The formation of spontaneous ad-hoc groups can easily be supported
by existing MANET technology. They may also be interconnected by some
Internet-connected peers, if two MANETs are within wireless reachability.
But the sharing of information, the support for collaboration, and the real-
ization of other services is not directly supported by MANET technology.
(see Chapter 2). Therefore, we will discuss the use of both paradigms sepa-
rately in Sections 24.4.1 and 24.4.2 though there may be some overlapping
similarities between both.
Application
Presentation
MPCP
Session (HTTP)
Transport (TCP)
Network (IP) EDSR
Link
Physical
data routing
realized without the necessity for further location-sensitive sensors and cen-
tral instances. Further applications of such a combined approach have been
described previously.
To minimize the effort to create a new protocol and to benefit from former
developments, the MPP protocol stack reuses existing network protocols as
much as possible. For node-to-node communication, the protocol utilizes an
enhanced version of the Dynamic Source Routing (DSR) protocol [326] . For
the transportation of user data it uses HTTP over TCP, as illustrated by
Fig. 24.2. Thus the Enhanced Dynamic Source Routing (EDSR) requires
only a new application layer protocol and minor changes within the DSR
protocol. To connect the application layer protocol (MPP) with the physical
network layer protocol (EDSR), the Mobile Peer Control protocol (MPCP)
is used.
Since MANETs already provide routing algorithms which enable the lo-
calization of network participants by their IP addresses, an additional Peer-
to-Peer implementation of this functionality is unnecessary and even degrades
the performance. Consequently, EDSR is designed to perform the necessary
routing tasks on the network layer and supplements the application layer
protocol (MPP). This approach provides valuable advantages compared with
a separate treatment of both networks:
– The MANET controls the organization of the network. Thus changes in
the topology of the mobile network are taken into account automatically
by the Peer-to-Peer network.
– The network layer is responsible for routing and the application controls
the data exchange.
– The integration of both networks avoids redundant information requests.
– The inter-layer communication of the protocol optimizes performance, since
the overlay network can be optimally adjusted to the physical network.
– The application layer protocol MPP simplifies the implementation of new
services.
410 24. Peer-to-Peer in Mobile Environments
The separation of data exchange and routing tasks allows the reuse of
existing protocols like TCP and HTTP. Only for routing tasks must MPP
directly interact with EDSR residing in the network layer (cf. Fig. 24.2).
MPP allows distant peers to transparently exchange data. Therefore MPP
is responsible for file transfers within the Peer-to-Peer network and resides
in the Peer-to-Peer client application. MPP utilizes HTTP for data exchange
since it is simple to implement and well tested. The HTTP content range
header is able to resume file transfers in case of network errors due to link
breaks. EDSR is mostly based on the DSR protocol, but additionally specifies
new request and reply types to provide the means for finding peers by criteria
other than the IP address. EDSR thus extends DSR and therefore EDSR
nodes can be an integral part of DSR networks.
MPCP is the inter-layer communication channel between the application
and the network layer. Thus MPCP links the EDSR Protocol in the net-
work layer with the Peer-to-Peer application in the application layer. Using
MPCP, the application can register itself in the EDSR layer to initialize
search requests and to process incoming search requests from other nodes.
It communicates to the corresponding protocol all incoming and outgoing
requests and responses, exept the file exchange itself.
On startup, the Peer-to-Peer application on the mobile device announces
itself to the EDSR layer via MPCP. If a user initializes a data search,
MPCP forwards the request to EDSR which transforms it into a search re-
quest (SREQ). Similar to DSR route requests (RREQ), EDSR floods SREQs
through the MANET. EDSR nodes receiving the request, forward it to the
registered Peer-to-Peer application via MPCP. Thus the Peer-to-Peer applica-
tion can determine whether locally shared data satisfies the request’s criteria.
If the request matches the description of a file shared by the node, the ap-
plication initializes an EDSR file reply. This reply is sent back to the source
node and contains all necessary information for the file transfer. Similar to
DSR route replies (RREP), a file reply (FREP) includes the complete path
between source and destination.
To compare the performance of a protocol adapted to the underlying
physical network, like MPP, with a protocol which establishes its overlay
still completely independently of the physical layer, we use an analytical
approach. First, we have to evaluate the number of reachable nodes in a
MANET environment.
If we assume x0 neighbors on a per node average and a radio reach of R0 ,
we can compute the node density as:
x0
nodedensity = (24.1)
R02 Π
If we further assume a uniform distribution of the nodes in the plane, the
cumulative distribution of the distance between two nodes is given by:
24.4 Solutions for Peer-to-Peer in Mobile and Wireless Networks 411
0
R
3·
2/
0
R
0
R
2
r2 Π r
F (r) = = (24.2)
R02 Π R0
which results in an increasing probability of occurrence for an increasing
distance between two ad hoc nodes. The pdf of this function can now be
computed by taking the derivate of F (r):
2r
f (r) = (24.3)
R02
This simply reflects the fact that the differential surface increases linearly
with an increasing radius r. Accordingly, this also means that the probability
of occurrence of nodes within the distance r also increases linearly. Thus the
average distance between two nodes can be computed by:
R0
2r 2
d¯ = · rdr = R0 (24.4)
R02 3
0
This means, as illustrated by Figure 24.3, that the multihop reach of
an average node only increases by 23 R0 instead of R0 . Thus the number of
reachable nodes via h physical links can be computed by:
R02 Π · x0
R20 Π
= x0 , h = 1
2 2 2 2
∆Nphys = 1 + (h − 1) · − 1 + (h − 2) · R02 Π · x0
3 3 R20 Π
= 89 hx0 , h > 1
(24.5)
If we now assume that a node in a Peer-to-Peer network does not adapt its
overlay network to the underlying physical network, then it must establish its
connections randomly. From equation 24.5 we can already observe that the
further away a node is, in terms of physical hops, the higher the probability
412 24. Peer-to-Peer in Mobile Environments
1
∆Nphys 2 , h=1
(1+ 23 (hmax −1))
pcon (h) = 2 = 8h
R0 + (hmax − 1) · 23 R0 Π· x 2, h>1
R0 Π 9(1+ 23 (hmax −1))
(24.6)
where hmax defines the maximum possible number of physical hops, which
is commonly limited to six [263]. The average path length of a not-adapted
overlay network in a physical network can thus be computed by:
−2 hmax −2
l = 1 · 1 + 23 (hmax − 1) + 8 2
9h 1 + 23 (hmax − 1)
h=2 (24.7)
−2
= 89 hmax (hmax +1)(2h
6
max +1)
− 1 + 1 1 + 23 (hmax − 1)
Challenges
As shown in Part III, numerous DHT approaches with versatile character-
istics determined by the subjacent topological structure (routing geometry)
exist. This structure induces different characteristics regarding flexibility and
resilience. A high flexibility in the choice of neighboring nodes enables opti-
mizations with respect to the underlaying network topology. A resilient DHT
structure can still operate without the need for evoking expensive recovery-
algorithms – even if many nodes fail at the same time. These issues are es-
sential for the DHT to react flexibly on topological changes of the underlying
network.
the other hand, tree and butterfly based DHT protocols are not capable of
using PRS because there is only one path existing in the DHT structure that
allows a decrease in the distance between the two nodes. For this reason the
routing algorithm does not allow any variations. Ring-based DHT-protocols
have to make a tradeoff between an increased hop-count in the overlay and
an eventually shorter or more stable path in the underlay.
In wireless networks the inherent broadcasting of packets to neighbors
can be used to improve the peers overlay routing, e.g., like ”shortcuts” in the
overlay network. Routing to these surrounding peers causes just one hop in
the underlying network which makes it very efficient. Keeping the connections
to these nodes causes only small amounts of locally restricted network traffic.
If the ad-hoc network protocol is able to analyze packets, a message can
be intercepted by a node which takes part at the routing process in the
underlay network. This node may redirect the request to a node which is
closer to the destination based on stored information on surrounding peers.
The decision whether intercepting a message or not must be taken in respect
to the progress that would be achieved in the overlay structure compared
to the distance on the alternative route. If connection speed is not an issue,
even nodes not directly involved in the routing may intercept a message.
The route can be changed, and a route change message has to be sent to
the node responsible for processing the routing request. If no route change
message is received within a specified period of time, the routing progress
will be continued. However, this procedure will decrease network traffic at
the cost of increased latency. Route interception and active routing of non-
involved peers can be used to achieve more redundancy, leading to a more
stable network in case of node failures.
Resilient Networks
DHTs are considered to be very resistant against node failures. Backup and
recovery mechanisms, that use distributed redundant information, ensure
that no information is lost if a node suddenly fails. Depending on the sub-
jacent DHT topology, the DHT experiences a reduced routing performance
until the recovery has finished. [266] shows that especially tree and butterfly
based topologies are vulnerable to node failures. Due to the high flexibility of
ring based topologies, these DHTs are still operable in case of massive node
failures.
When DHT protocols are used in an ad-hoc environment, resilience has
to be considered as a very important issue. The resilience of a DHT deter-
mines how much time may pass before expensive recovery mechanisms have
to be evoked. As the quality of connections in ad-hoc networks is highly de-
pendent of the environment of the nodes, some nodes may be temporarily
inaccessible or poorly accessible because of node movement. If the recovery
process is started too early, an avoidable overhead is caused if the node be-
24.4 Solutions for Peer-to-Peer in Mobile and Wireless Networks 415
comes accessible again. If the topological structure allows the DHT protocol
to delay recovery mechanisms without losing routing capability these costly
recovery measures can be avoided. This approach has a positive effect on
the maintenance costs of a DHT. In a worst case scenario, a node which is
partly available and unavailable over a longer period of time can stress the
whole network because of numerous join and leave procedures. This scenario
can easily be provoked by node movement along the network perimeter. Re-
silience is therefore an important factor when DHTs are used in combination
with ad-hoc networks. Resilient DHT structures are capable of compensating
node failures and are able to use recovery mechanisms more accurately.
Merging two DHTs imposes a vast amount of network traffic. Many key-
value pairs have to be redistributed and new neighborhood connections must
be established. This stress may often be unacceptable, especially if the con-
nection between the DHTs is weak or short-lived. A method to merge large
DHT structures has to be designed in respect to the limitations (low band-
width, high latency, etc.) of an ad-hoc network, to avoid network overload.
Criteria like the stability of inter-DHT connections must be judged to avoid a
merge of two DHTs which is likely to split up again. However, if the merging of
DHTs is omitted – and structured communication between DHTs is chosen as
an alternative – continuously separating a DHT will create many small DHTs
and causes an enormous communication overhead. In both cases the simul-
taneous coexistence of DHTs requires an unambiguous DHT-identification.
24.5 Summary
The field of mobile Peer-to-Peer networks (MP2P) has various forms and
currently there exists no coherent view on what is understood by it. The
term mobile emphasizes that nodes/peers in the network are mobile, and
therefore need to be equipped with some kind of wireless communication
technology. Examples of nodes include pedestrians with mobile devices [284]
or vehicles with wireless communication capabilities [632]. Since all mobile
Peer-to-Peer networks construct an overlay over an existing wireless network,
implementations range from MP2P over mobile ad hoc networks (MANETs)
[156] to MP2P over cellular based networks [299, 298].
This chapter looks into a specific class of application for mobile Peer-to-
Peer networks. Here the Peer-to-Peer network is formed by humans carrying
mobile devices, like PDAs or mobile phones, with ad hoc communication
capabilities. All presented applications exploit the physical presence of a user
to support digital or real-life collaboration among them. The integration of
wireless communication technologies like Bluetooth or IEEE 802.11b WiFi
into mobile devices makes this kind of mobile Peer-to-Peer networks feasible.
As stated by Dave Winer, “The P in P2P is People” [624]; most of the
Peer-to-Peer systems rely on the users’ will to contribute. This could be ob-
served in the success of first generation file-sharing applications like Napster
[438] or Gnutella [252].
The key issue of user contribution prevails even more in mobile Peer-
to-Peer networks, where in general anonymous users form the network with
their personal devices. Resources on the device are typically limited. Espe-
cially battery power can be a problem. A user risks draining his battery
by contributing its resources to other users. The device may also become
unavailable for personal tasks, like accessing the calender or making phone
calls. However user contribution may be stimulated by the usefulness of an
application.
Currently, we see the emergence of several mobile Peer-to-Peer applica-
tions, both as commercial products and in research, as described in Section
25.2. As stated above, all these applications make use of the physical presence
R. Steinmetz and K. Wehrle (Eds.): P2P Systems and Applications, LNCS 3485, pp. 419-433, 2005.
Springer-Verlag Berlin Heidelberg 2005
420 25. Spontaneous Collaboration in Mobile P2P Networks
?
A C
The second important aspect in the design space is the relation among
the users participating in the system. Questions to be answered are:
– Is the number of users known/fixed or unknown/open?
– Are the users identifiable or do they work with pseudonyms or even act
totally anonymous?
Answers to these questions have impact on the usage and suitability of the
mobile Peer-to-Peer building blocks we present in Section 25.3.
The rest of this chapter is organized as follows. Section 25.2 presents emerg-
ing applications for mobile Peer-to-Peer networks, namely in the following
domains: enterprise knowledge management, spontaneous recommendation
passing, conference collaboration, spontaneous encounter with friends and
foafs1 , and spontaneous advertisement passing. Common building blocks for
mobile Peer-to-Peer networks and applications are derived from these ex-
amples in Section 25.3. Following this analysis, the iClouds project is pre-
sented in Section 25.4. iClouds provides an architecture that includes building
blocks for easier mobile Peer-to-Peer application development. This chapter
concludes with Section 25.5.
25.2.1 Shark
1
foaf = friend of a friend
424 25. Spontaneous Collaboration in Mobile P2P Networks
Local stations store and manage only location relevant knowledge. This al-
lows for simple location based services. Each local station synchronizes its
knowledge with the central station.
One possible application for Shark is enterprise knowledge management.
Here, each mobile staff member is equipped with a mobile station. Local sta-
tions are installed at subsidiaries. A customer (with his own mobile station)
is able to learn about all kinds of information about the enterprise, e.g., prod-
uct descriptions, price lists or white papers, while talking to a mobile staff
member or visiting a subsidiary or shop.
25.2.2 MobiTip
The MobiTip [529] system allows its users to express their opinions on any-
thing of interest in the environment. Opinions are aggregated and presented
to the users as tips or recommendations. Opinions are entered in free text
form on the user’s device (a mobile phone) and shared in a Peer-to-Peer
manner on-the-fly with users nearby using Bluetooth.
A typical example is a shopping mall, where MobiTip users share their
personal views on certain shops or product offers.
The core MobiTip system can be extended by so-called connection hotspots.
A connection hotspot is placed at a selected location, e.g., the entrance of a
shopping mall, to collect tips and pass them to future visitors. This idea is
similar to the time-shifted communication in the Socialight system based on
Sticky Shadows (see Section 25.2.4).
25.2.3 SpotMe
Several other functions require the communication with the local server,
e.g., last minute agenda update, news dissemination and questionnaires.
SpotMe includes some basic post event services. The collected data and
contact information is made available for every participant online on the web
and sent as e-mail.
25.2.4 Socialight
Socialight [332], a mobile social networking platform that uses mobile phones,
supports spontaneous encounter and interaction with friends and friends of
friends. Using the current or past location of friends, Socialight enables real-
time and time-shifted communication.
Location of users is determined by infrastructure based technology (GPS
and Cell-ID), or ad hoc by signal recognition of Bluetooth devices nearby.
Users have to register on a central platform before using Socialight. The
platform also stores information about the social network of users.
Peer-to-peer communication among users may happen via Tap & Tickle
or Sticky Shadows. Tap & Tickle are two digital gestures that allow users to
exchange information by vibration of their devices. Pressing a button on a
user’s phone will make his friend’s phone vibrate once (Tap) or rhythmically
(Tickle). This is meant as a non-intrusive way to communicate with nearby
friends.
With Sticky Shadows, users can attach digital information to a certain
location. This digital information is recognized by friends when they pass
the same location at a different time. Examples include restaurant reviews
for friends, sales or shopping recommendations, and educational purposes,
where teachers set Sticky Shadows for students.
25.2.5 AdPASS
Project
Service Shark MobiTip SpotMe Socialight AdPASS
Presence Awareness
Message Exchange
Info. Filtering
Info. Distribution
Security
Identity Management
Incentive Schemes
Reputation
User Notification
Table 25.1 summarizes the common services and their conceptual usage
in the presented sample applications.
The next section presents the iClouds project. The project goal is to design
a sound and coherent architecture for mobile Peer-to-Peer applications. This
25.4 The iClouds Project 429
A B C
The two most important data objects found on the iClouds device are two
information lists (iLists for short):
– iHave-list (information have list or information goods):
The iHave-list holds all the information the user wants to contribute to
other users.
– iWish-list (information wish list or information needs):
In the iWish-list, the user specifies what kind of information he is interested
in.
25.4 The iClouds Project 431
A B C
B) moves from A) to C)
A B C
Each iClouds device periodically scans its vicinity to see if known nodes
are still active and in communication range and also to see if any new nodes
have appeared. Information about active nodes is stored in a neighbourhood
data structure.
By exchanging iLists, the iClouds devices align their information goods
and needs. Items on the iWish-lists are matched against items on the iHave-
lists. On a match, information items move from one iHave-list to the other.
For example, consider two iClouds users, Alice and Bob, who meet on the
street. When their iClouds devices discover each other, they will exchange
their iHave-lists and match them locally against their iWish-lists. If an item
on Bob’s iHave-list matches an item on Alice’s iWish-list, her iClouds device
will transfer that item onto her iHave-list.
There are two main communication methods for transferring the iLists.
Peers can either pull the iLists from other peers or they can push their own
iLists to peers they meet. Either of these two operations is applicable to
both lists, which gives four distinct possibilities of communication. Table 25.2
summarizes these possibilities, along with their real-world equivalents.
In each of the four cases shown in Table 25.2, the matching operation
is always performed on the peer who receives the list (Alice’s peer in pull
432 25. Spontaneous Collaboration in Mobile P2P Networks
and Bob’s peer in push). A key strength of iClouds is that each of the four
possible combinations corresponds to an interaction in the real world:
– Standard search.
This is the most natural communication pattern. Alice asks for the informa-
tion stored on Bob’s device and performs a match against her information
needs (specified in her iWish-list) on her device.
– Advertise.
This is a more direct approach. Alice gives her information goods straight
to Bob and it’s up to Bob to match this against the things he is inter-
ested in. As an example, consider an iClouds information sprinkler mounted
on shopping mall doorways pushing advertisements onto customer devices
when they enter the building. This is implemented in the AdPASS system
(cf. Section 25.2.5).
– Active service inquiry.
This is best suited for shopping clerks. They learn at a very early stage
what their customers are interested in. An example of this query could be:
“Can I help you, please tell me what are you looking for?”.
In general, especially for privacy reasons and user acceptance, we believe
it is a good design choice to leave the iWish-list on the iClouds device.
Hence, this model of communication would likely be extremely rare in the
real world.
– Active search.
With active search, we model the natural “I’m looking for X. Can you help
me?”. This is similar to the standard search mechanism, except that the
user is actively searching for a particular item, whereas in the standard
search, the user is more passive.
25.4.3 Architecture
Figure 25.7 shows the architecture that is proposed and used in iClouds.
There is a general distinction between a communication layer and a ser-
vice layer. The communication layer provides simple one-hop message ex-
change between peers in communication range. A neighbourhood data struc-
ture keeps track of active peers in the vicinity.
The common services are located on the next layer. Each service can use
functionality provided by other services or by the communication layer below.
Note that the service layer is extensible for new services that might be needed
by future applications.
The applications reside on the topmost. To fulfil its purpose, an applica-
tion has access to both the service and the communication layer.
25.5 Conclusion 433
Applications
A A A
A A A
Services
25.5 Conclusion
This chapter points out that there are several similarities in mobile Peer-to-
Peer applications. The analysis of emerging applications in this area identifies
a set of common services that serve as basic building blocks.
The iClouds architecture aims to provide a framework for mobile Peer-
to-Peer application developers who do not want to re-invent common func-
tionality over and over again. The architecture is implemented in Java as a
lightweight set of classes and runs on Java2 Micro Edition compliant mobile
devices with 802.11b WiFi communication support.
26. Epidemic Data Dissemination for Mobile
Peer-to-Peer Lookup Services
Christoph Lindemann, Oliver P. Waldhorst (University of Dortmund)
Building efficient lookup services for the Internet constitutes an active area
of research. Recent issues concentrate on building Internet-scale distributed
hash tables as building block of Peer-to-Peer systems, see e.g., [505], [575].
Castro et al. proposed the VIA protocol, which enables location of applica-
tion data across multiple service discovery domains, using a self-organizing
hierarchy [111]. Recently, Sun and Garcia-Molina introduced a partial lookup
service, exploiting the fact that for many applications it is sufficient to resolve
a key to a subset of all matching values [581]. The paper discusses various de-
sign alternatives for a partial lookup service in the Internet. However, none of
these papers consider distributed lookup services for mobile ad-hoc networks.
In MANET, lookup services can be implemented using either unstruc-
tured or structured Peer-to-Peer networks as described in Chapters 24.4.1
and 24.4.2, respectively. However, such approaches put some requirements
on the MANET environment: (1) The MANET must provide a high degree
of connectivity, such that a given node can contact each other node at any
time with high probability. (2) The nodes in the MANET must exhibit low
mobility in order to minimize the required number of updates of routing
tables and other structures. Typically, both structured and unstructured ap-
proaches will perform poorly in scenarios with low connectivity and high mo-
bility. This chapter descibes an approach for building a Peer-to-Peer lookup
service that can cope with intermittent connectivity and high mobility. The
approach builds upon the observation by Grossglauser and Tse, that mobil-
ity does not necessarily hinder communication in MANET, but may support
cost-effective information exchange by epidemic dissemination [262].
As a first approach to epidemic information dissemination in mobile en-
vironments, Papadopouli and Schulzrinne introduced Seven Degrees of Sep-
aration (7DS), a system for mobile Internet access based on Web document
dissemination between mobile users [470]. To locate a Web document, a 7DS
node broadcasts a query message to all mobile nodes currently located inside
its radio coverage. Recipients of the query send response messages containing
file descriptors of matching Web documents stored in their local file caches.
Subsequently, such documents can be downloaded with HTTP by the inquir-
ing mobile node. Downloaded Web documents may be distributed to other
R. Steinmetz and K. Wehrle (Eds.): P2P Systems and Applications, LNCS 3485, pp. 435-455, 2005.
Springer-Verlag Berlin Heidelberg 2005
436 26. Epidemic Data Dissemination for Mobile P2P Lookup Services
(a) The mobile phone broadcasts a query for key k, which matches (b) The notebook broadcasts a response for (k,v). All devices
value v based in the local index of the notebook. inside the radio coverage receive it and store (k,v) in their
index caches.
Query(k) Response(k,v)
... (k(k,v)
1 ,v1 ) ... ... ... (k,v) ... ...
... ... ... ... (k,v) ... (k,v) ...
(c) After changing its position, the second mobile phone receives a (d) The second mobile phone generates a response for (k,v)
query for key k broadcasted by the PDA. from the index cache on behalf of the notebook.
Query(k) Response(k,v)
A node n may contribute index entries of the form (k, v) to the system
by inserting them in a local index. In Figure 26.1, the local index is drawn
as the first box below each mobile device. We refer to such an index entry as
supplied. The node n is called the origin node of an index entry. For example,
the notebook shown in Figure 26.1 is the origin node of the index entry (k, v).
A key k matches a value v, if (k, v) is currently supplied to the PDI system.
Each node in the system may issue queries in order to resolve a key k to all
matching values vi (see Figure 26.1a). A node issuing a query is denoted as
inquiring node.
Query messages are sent to the IP limited broadcast address
255.255.255.255 and a well-defined port, using the User Datagram Proto-
col UDP. Using the IEEE 801.11 ad hoc mode, all nodes located inside the
radio coverage of the inquiring node receive a query message. Each of these
nodes may generate a response message. A response message contains the
key from the query and all matching values from either the local index or a
second data structure called index cache. To enable epidemic data dissemi-
nation, PDI response messages are sent to the IP limited broadcast address
255.255.255.255 and a well-defined port, too. Thus, all mobile nodes within
the radio coverage of the responding node will overhear the message (Fig-
ure 26.1b). Not only the inquiring node but also all other mobile nodes that
receive a response message extract all index entries and store them in the
index cache (see Figure 26.1b). In Figure 26.1, index caches are drawn as
the second box below mobile devices. Index entries from the index cache
are used to resolve queries locally, if the origin nodes of matching values re-
side outside the radio coverage of the inquiring node (see Figures 26.1c and
26.1d). Obviously, the index cache size is limited to a maximum number of
entries adjusted to the capabilities of the mobile device. The replacement
policy least-recently-used (LRU) is employed if a mobile device runs out of
index cache space. By generating responses from index caches, information
is disseminated to all other nodes that are in direct contact, similar to the
26.2 Passive Distributed Indexing 439
Recall that all PDI messages are send to the limited broadcast address and
received by all nodes located inside the radio coverage of the sender. De-
pending on the transmission range of the wireless network interfaces, this
may considerably limit the number of nodes that receive a message. PDI in-
cludes a flooding mechanism that controls forwarding based on the content
of a message. The mechanism is illustrated in Figure 26.2. Query messages
are flooded with a TTL with value T T Lquery , which is specified by the in-
quiring node. For detecting duplicate messages, each massage is tagged with
a unique source ID and a sequence number as described above. We will show
in Section 26.4 that T T Lquery ≤ 2 yields a sufficient performance in most
scenarios. Thus, PDI communication remains localized despite of the flooding
mechanism.
Similarly to query messages, response messages are forwarded with time-
to-live T T Lquery . Recall that the payload of query messages consists of a
few keys. Thus, query messages are small and may be flooded without sig-
nificantly increasing network load (see Figure 26.2a and 26.2b). In contrast,
response messages can contain numerous values that may each have a con-
siderable size, depending on the application using PDI. Therefore, flooding
of complete response messages will significantly increase network load, even
if the scope of flooding is limited to two hops. For the cost-efficient flooding
of response messages, PDI incorporates a concept called selective forward-
ing. That is each node that receives a response message will search the index
cache for each index entry contained in the message (see Figure 26.2d). If an
entry is found, the node itself has already sent a response for this query with
high probability (e.g., as shown in Figure 26.2c). Therefore, forwarding this
index entry constitutes redundant information. Using selective forwarding,
each relay node removes all index entries found in its local index cache from
the response message, before the message is forwarded (see Figure 26.2e).
440 26. Epidemic Data Dissemination for Mobile P2P Lookup Services
(a) The mobile phone issues a query for key k (b) The notebook relays the query. (c) The notebook generates a
and TTLquery > 0. response message for (k,v1) from
the index cache.
Query(k) Query(k) Response(k,v1)
... (k,v1) (k,v1), (k,v2) ... (k,v1) (k,v1), (k,v2) (k,v1) (k,v1) (k,v1), (k,v2)
(d) The second mobile phone generates a response message for (e) The notebook selectively forwards the (unknown)
(k,v1,v2) from the index cache. information (k,v2).
Response(k,v1,v2) Response(k,v2)
(k,v1) (k,v1), (k,v2) (k,v1), (k,v2) (k,v1), (k,v2) (k,v1), (k,v2) (k,v1), (k,v2)
The basic concept of PDI as described in Sections 26.2.2 and 26.2.3 does
not take into account intermittent connectivity and spontaneous departures
of nodes; circumstances under which all information previously supplied by
a node expire. Examples of these cases include node failure or nodes leav-
ing the area covered by the system. In such cases, an implicit invalidation
mechanism can achieve cache coherency. Timeouts constitute a common con-
cept to implement implicit invalidation in several distributed applications,
as they can assure cache consistency without the need to contact the source
of the cached information. PDI defines the concept of value timeouts to ap-
proximate the most recent information about the state of an index entry at
the origin node. Value timeouts limit the time any index entry (k, v) with
a given value v will be stored in an index cache. By receiving a response
from the origin node of (k, v), the corresponding value timeout will be reset.
Let age ((k, v)) be the time elapsed since (k, v) has been extracted from a
response message generated by its origin node. We define the age av of value
v as av = mink (age ((k, v))), i.e., the time elapsed since the most recent
response message of this kind was received. If at a node holds av > T for
the given timeout value T , all pairs (k, v) are removed from its index cache.
PDI implements only one timeout per value v rather than an individual time-
out for each index entry (k, v). This is motivated by the observation that in
most applications modification of an index entry (k, v) for a given v indi-
cates a substantial change of the value. Subsequently, all other index entries
(k , v) are likely to be influenced. For example, in a file sharing system a pair
(keywordi , U RI) is removed when the file specified by URI is withdrawn
26.3 Consistency Issues 441
from the system. Thus, all other pairs (keywordj , U RI) also become stale.
Note that depending on the application the concept of value timeouts can
be easily extended to incorporate individual timeout durations Tv for each
value v. Such duration may be included in a response message generated by
the origin node. For ease of exposition, we assume in the remainder of this
paper a global timeout value T for all values in the system.
To determine the current age of a value, an age field is included in the
response message for each value. This age field is set to zero in each response
from the origin node. When receiving a response message, a node n extracts
the age of each value and calculates the supply time sv . That is the time
of generating a response for this value by the origin node. Assume that the
response message contains age av , then sv is determined by sv = cn − av ,
where cn denotes the local time of node n. sv is stored in the index cache
together with v. Note that v might already be present in the index cache with
supply time sv . The copy the index cache might result from a more recent
response by the origin node, i.e., sv < sv . Thus, in order to relate the age of
a value to the most current response from the origin node, the supply time
is updated only if sv > sv . When a node generates a response for a cached
index entry (k, v), it sets the age field for each value v to av = cn − sv . Note
that only time differences are transmitted in PDI messages, eliminating the
need for synchronizing clocks of all participating devices.
(a) The notebook withdraws (k,v) from the local index and (b) The mobile phone relays the invalidation message and
broadcasts an invalidation message for value v. stores value v in the lazy invalidation cache.
Invalidate(v) Invalidate(v)
(c) After changing its position, the mobile phone receives a (d) The mobile phone sends an invalidation message from the
response for the stale value v broadcasted by the PDA. cache on behalf of the notebook. The PDA invalidates (k,v).
Response(k,v) Invalidate(v)
(k1...
,v1) ... ... (k1...
,v1) ... ...
... ... (k,v) ... ... (k,v)
... v ... ... v v
of nodes cannot be guaranteed nor directories for all cached copies of a shared
item can be maintained. To address these constraints in mobile systems, PDI
defines the concept of lazy invalidation caches implementing explicit invali-
dation of values by epidemic dissemination of invalidation messages. As basic
idea of PDIÂs explicit invalidation mechanism, a node removes all index en-
tries (k, v) from the index cache when it receives an invalidation message for
value v. Flooding with a TTL with value T T Linv is a straightforward way
to propagate invalidation messages. Unfortunately, in mobile systems even
a multi-hop connection between two nodes frequently does not exist. Subse-
quently, stale index entries remain in the index caches of nodes that are not
reached by the invalidation message. Note that these index entries will be re-
distributed in the system due to the epidemic dissemination. We have shown
that even repeated flooding of invalidation messages does not significantly
reduce the number of hits for stale index entries [389].
This observation is consistent with [162], which reports that deleted
database items Âresurrect in a replicated database environment due to epi-
demic data dissemination. In [162], a solution is proposed that uses a special
message to testify the deletion of an item, denoted as death certificate. Death
certificates are actively disseminated along with ordinary data and deleted af-
ter a certain time. In contrast, we propose a more or less passive (or ÂlazyÂ)
approach for the epidemic dissemination of invalidation messages, which is
illustrated in Figure 26.3. For the initial propagation of an invalidation mes-
sage by the origin node, we rely on flooding as described above (Figure 26.3a).
Each node maintains a data structure called lazy invalidation cache, which
is drawn as a third box below the mobile devices in Figure 26.3. When a
node receives an invalidation message for a value v it does not only relay it,
but stores v in the invalidation cache (Figure 26.3b). Note that an entry for
v is stored in the invalidation cache, regardless if the node stores any index
entry (k, v) for v in the index cache. Thus, every node will contribute to the
26.4 Performance Studies 443
To evaluate the performance of the PDI and the proposed consistency mech-
anisms, we conduct simulation experiments using the network simulator ns-2
[198]. We developed an ns-2 application implementing the basic concepts of
PDI, selective forwarding, value timeouts, and lazy invalidation caches as
described in Sections 26.2 and 26.3. An instance of the PDI application is
attached to each simulated mobile node, using the UDP/IP protocol stack
and a MAC layer according to the IEEE 802.11 standard for wireless com-
munication. Recall that PDI can be configured by the four parameters shown
in Table 26.2. As goal of our simulation studies, we will show that PDI can
be configured to the demands of different applications by adjusting these pa-
rameters. Therefore, we have to define detailed models of the system in which
PDI is deployed.
444 26. Epidemic Data Dissemination for Mobile P2P Lookup Services
that a query is for a given key k. As shown in [570], [355], query popularity
in Peer-to-Peer file sharing systems follows a Zipf-like distribution, i.e., the
query function for k = 1, . . . , K is given by:
wquery (k) ∼ k −β
Parameter Value
the total number of all up-to-date matching values currently in the system.
Coherence is measured by the stale hit rate SHR, i.e., SHR = HS /(HS +
HF ), where HS denotes the number of stale hits returned on a query. Note
that stale hit rate is related to the information retrieval measure precision
by precision = 1 − SHR.
In all experiments, we conduct transient simulations starting with ini-
tially empty caches. For each run, the total simulation time is set to 2 hours.
To avoid inaccuracy due to initial warm-ups, we reset all statistic counters
after a warm-up period of 10 min. simulation time. Furthermore, we initialize
positions and speed of all nodes according to the steady state distribution
determined by the random waypoint mobility model [85] to avoid initial tran-
(a) 1 (b) 1
0.8 0.8
0.6 0.6
Hit Rate
Hit Rate
0.4 0.4
(a) (b)
Fig. 26.4: Recall vs. system size for (a) different index cache sizes and (b) different
numbers of forwarding hops
sients. For each point in all performance curves, we performed 100 indepen-
dent simulation runs and calculated corresponding performance measures at
the end of the simulation. In all curves 99% confidence intervals determined
by independent replicates are included.
(a) 1 (b) 1
0.8 0.8
0.6 0.6
Hit Rate
Hit Rate
0.4 0.4
(a) (b)
Fig. 26.5: Recall vs. radio coverage for (a) different index cache sizes and (b)
different numbers of forwarding hops
when the node density passes a certain number of nodes. To understand this
effect, recall that the overall number of values in the system increases with
node density because each node contributes additional values to the lookup
service. Furthermore, the keys in queries are selected according to a Zipf-like
selection function. Due to the heavy tailed nature of this function, responses
to a large number of queries must be cached in order to achieve high hit rates.
Thus, the hit rate decreases even for large caches when the number of index
entries for popular queries exceeds cache capacity and the epidemic dissem-
ination of data decreases. We conclude from Figure 26.4 (a) that epidemic
data dissemination requires a sufficient node density. To gain most benefit of
the variety of values contributed to the lookup service by a large number of
nodes, sufficient index cache size should be provided. In properly configured
systems with a reasonable node density, PDI achieves hit rates up to 0.9.
Similar to the impact of index cache size, the impact of message forward-
ing is limited in systems with a low node density, as shown in Figure 26.4 (b).
Forwarding messages for more than four hops yields only marginal improve-
ments in the hit rate due to limited connectivity. However, for an increasing
node density, the hit rate grows faster in systems with message forwarding
enabled than in non-forwarding systems, as increasing connectivity favors
selective forwarding. In environments with about 64 nodes, configuring the
system for packet forwarding can improve hit rate by almost 30%. Neverthe-
less, non-forwarding systems benefit from growing node density as it fosters
epidemic information dissemination. Thus, the benefit of selective forward-
ing with T T Lquery ≥ 4 hops becomes negligible, if the number of mobile
nodes becomes larger than 64. In these scenarios, forwarding messages over
multiple hops will decrease the variety of information stored in the index
caches, because forwarded results replace other results in a high number of
caches. Thus, fewer different results are returned from the caches for succes-
26.4 Performance Studies 449
(a) 1 (b) 35
30
0.8
0.6
Hit Rate
20
15
0.4
10
0.2 32 Index Entries 32 Index Entries
128 Index Entries 5 128 Index Entries
512 Index Entries 512 Index Entries
2048 Index Entries 2048 Index Entries
0 0
0 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 30 35 40 45
Maximum Node Speed (m/s) Maximum Node Speed (m/s)
(a) (b)
Fig. 26.6: (a) Recall vs. mobility for different index cache sizes and (b) message
volume vs. mobility for different numbers of forwarding hops
sive queries. We conclude from Figure 26.4 (b) that message forwarding is
beneficial in system environments showing a medium node density, while in
systems with high node density message forwarding should be disabled.
In a second experiment, we investigate the sensitivity of PDI to the trans-
mission range of the wireless communication interfaces used by the mobile
nodes. The results of this study are shown in Figure 26.5. Figure 26.5 (a)
shows hit rate as a function of transmission range for different cache sizes.
For a transmission range below 100 meters, i.e., a radio coverage of about 1%
of the simulation area, PDI does not gain sufficient hit rates despite index
cache size. Here, in most cases broadcasted query messages are received only
by a small number of nodes. Consistent with the results shown Figure 26.4
(b), given a reasonable size of the index cache and a transmission range of
115 meters (i.e., a radio coverage of about 4% of the considered area) PDI
achieves sufficiently high hit rate for Peer-to-Peer search queries. For larger
transmission ranges, most queries will reach the origin nodes of index entries
(k, v) for a query v. Thus, PDI does not benefit from caching index entries.
We conclude from Figure 26.5 (a) that for short-range communication de-
vices the number of participating devices must be high to enable the effective
employment of PDI, whereas for long-range communication, the system does
not significantly benefit from PDI.
Figure 26.5 (b) shows hit rate as a function of transmission range for dif-
ferent values of T T Lquery . We find that message forwarding has no impact
in systems with small transmission ranges, while system with medium trans-
mission ranges heavily benefit from forwarding. As another interesting result,
we find that for high transmission ranges PDI with message forwarding dis-
abled gains best performance. Here, unnecessary forwarding of PDI messages
will result in a substantial number of collisions of wireless transmissions, as
confirmed by the examination of ns-2 trace files. These collisions reduce the
450 26. Epidemic Data Dissemination for Mobile P2P Lookup Services
(a) 1 (b) 1
0.8 0.8
0.6 0.6
Hit Rate
Hit Rate
0.4 0.4
(a) (b)
Fig. 26.7: Recall vs. shared data for (a) different index cache sizes and (b) different
numbers of forwarding hops
(a) 1 (b) 1
0.8 0.8
0.6 0.6
Hit Rate
Hit Rate
0.4 0.4
(a) (b)
Fig. 26.8: Recall vs. locality for (a) different forwarding options and (b) different
numbers of forwarding hops
(a) 1 (b) 1
32 Index Entries 32 Index Entries
128 Index Entries 128 Index Entries
512 Index Entries 512 Index Entries
0.8 2048 Index Entries 0.8 2048 Index Entries
0.4 0.4
0.2 0.2
0 0
0 20 40 60 80 100 120 0 20 40 60 80 100 120
Node Density (km-2) Node Density (km-2)
(a) (b)
Fig. 26.9: (a) Recall and (b) coherency vs. system size without invalidation
in Figure 26.9 (b). We find that without invalidation the stale hit rate may
reach 0.4. For smaller index cache sizes, the stale hit rate decreases with
node density. Jointly considering Figures 26.9 (a) and (b) reveals that for an
increasing node density the stale hit rate drops rapidly at the point when
the growth of the hit rate slows down. Looking closer at the index caches
in these scenarios, we find that the content of the caches is highly variable.
Thus, stale index entries are removed early from the caches. We conclude
from Figure 26.9 (b) that large caches yield a high amount of stale hits if
no invalidation mechanism is used. In contrary, small index caches naturally
reduce stale hits, while they fail to provide high hit rates as shown in Figure
26.9 (a). This evidently illustrates the need for invalidation mechanisms in
order to achieve both high hit rates and low stale hit rates.
In a last experiment, we investigate the performance of an integrated
approach combining both value timeouts and lazy invalidation caches to take
into account both weak connectivity and information modification. In further
experiments presented in [389], we found that suitable configuration of value-
timeouts reduces the stale hit rate due to intermittent connectivity and node
failure by 75%. Furthermore, lazy invalidation caches of moderate size reduce
stale results due to data modification by more than 50%. Thus, we fix the
duration of the value timeout to 1000s and the invalidation cache size to 128
entries, since these parameters achieved best performance for the considered
scenario [389]. Figure 26.10 (a) plots the hit rate versus node density. We
find that hit rate is reduced mostly for small systems due to invalidations
of up-to-date index entries by value timeouts. This leads to a decrease of
at most 20%. The performance of index cache sizes of both 512 and 2048
is equal because a large cache cannot benefit from long-term correlations
between requests due to the short timeout. For growing number of nodes,
the hit rate converges towards results without an invalidation mechanism as
shown in Figure 26.9 (a).
454 26. Epidemic Data Dissemination for Mobile P2P Lookup Services
(a) 1 (b) 1
32 Index Entries 32 Index Entries
128 Index Entries 128 Index Entries
512 Index Entries 512 Index Entries
0.8 2048 Index Entries 0.8 2048 Index Entries
0.4 0.4
0.2 0.2
0 0
0 20 40 60 80 100 120 0 20 40 60 80 100 120
Node Density (km-2) Node Density (km-2)
(a) (b)
Fig. 26.10: (a) Recall and (b) coherency vs. system size for hybrid invalidation
As settlement for the reduction of hit rate, the stale hit rate is significantly
reduced compared to a system without invalidation. As shown in Figure 26.10
(b), the stale hit rate is below 5% for all considered index cache sizes. We
conclude from Figure 26.10 that the integrated approach comprising of the
introduced implicit and explicit invalidation mechanisms can effectively han-
dle both spontaneous node departures and modification of information. In
fact, for large index caches, the stale hit rate can be reduced by more than
85%. That is, more than 95% of the results delivered by PDI are up-to-date.
26.5 Summary
shown that PDI can cope with different application characteristics using ap-
propriate configurations. Thus, PDI can be employed for a large set of mobile
applications that possess a sufficiently high degree of temporal locality in the
request stream, including Web-portal and Web search without connection to
the search server, instant messaging applications, and mobile city guides.
In recent work, we have developed a general-purpose analytical perfor-
mance model for epidemic information dissemination in mobile and hoc net-
works [390]. Currently, we are employing this modeling framework to opti-
mize PDI protocol parameters for selected mobile applications. Based on the
results, we are adopting PDI for developing software prototypes of a mobile
file sharing system, a mobile instant messaging application, and disconnected
Web search.
27. Peer-to-Peer and Ubiquitous Computing
Jussi Kangasharju (Darmstadt University of Technology)
R. Steinmetz and K. Wehrle (Eds.): P2P Systems and Applications, LNCS 3485, pp. 457-469, 2005.
Springer-Verlag Berlin Heidelberg 2005
458 27. Peer-to-Peer and Ubiquitous Computing
27.2.1 Information
27.2.2 Network
27.2.3 Collaboration
Because ubiquitous devices are typically small and have only very limited
capabilities, they must collaborate in order to deliver useful services. This
collaboration not only includes simple communications between devices, but
extends to actual cooperation and active sharing of resources and information.
To achieve this collaboration, we need to resort to techniques such as
grouping or communities which can be formed in an ad hoc manner. Such
groups can be formed by, for example, the devices carried by a person, or all
the devices in a given room. The devices within a group share a context and
normally need to be aware of each other and each other’s capabilities and
needs.
Since ubiquitous applications and devices can be used in many different cir-
cumstances, they need to be aware of the context in which they are being
used. One good example of context information is the current location of a
device, which may determine its behavior (e.g., when outside, a mobile phone
might ring loud, whereas in a meeting it would set itself to vibration mode).
Other context information are current environment conditions (light, tem-
perature, etc.) or higher level information, such as the stored preferences of
the current user [166].
27.3 Communications in Ubiquitous Computing Architectures 461
Ubiquitous applications and devices are highly mobile, hence the middleware
must be able to handle this. We can distinguish two kinds of mobility which
require connections to be handed off during the communication.
On the one hand, we have horizontal handoffs (or handovers), which are
currently commonly used in mobile phone networks. In a mobile phone net-
work, a horizontal handoff occurs when the phone moves from the coverage
of one base station to another. During this move, the base stations must
transfer the communication resources to the new base station and the mobile
phone must then switch to this base station. In the ubiquitous computing
world, we do not necessarily have base stations, but can define a horizontal
handoff in an analogous manner. A horizontal handoff occurs when the device
must change its communication partners, but it continues to use the same
technology (e.g., BlueTooth or WLAN) for the communication.
27.4 Ubiquitous Computing Middleware 463
On the other hand, we have vertical handoffs which occur when the device
must change communication technology in order to maintain the connection.
For example, a device in a WLAN hotspot which moves out of the reach of
the access point must switch to, e.g., UMTS, to remain connected.
Although horizontal handoffs are currently much more common, vertical
handoffs are likely to become more common with the proliferation of WLAN
networks. Some horizontal handoffs (especially in mobile phone networks)
can already be handled with current technology, but other types of handoffs,
especially vertical handoffs, usually result in broken connections.
Ubiquitous middleware must therefore have efficient support for both
kinds of handoffs, across a wide range of different networking technologies.
In this section, we will outline the major research challenges in the area of
ubiquitous Peer-to-Peer infrastructures. Each of the topics mentioned below
27.6 Research Challenges in Ubiquitous Peer-to-Peer Computing 467
In a world where a multitude of devices are scattered around and are observ-
ing their environment, security and privacy issues are of paramount impor-
tance. Security allows us to authenticate the devices with which we commu-
nicate (and vice versa!) and is an important building block for establishing
trust. The world of ubiquitous computing is moving in a direction where more
and more of our everyday activities are taking place in the digital world and
468 27. Peer-to-Peer and Ubiquitous Computing
therefore it is vital that we (as human users) are able to trust the devices
and architectures handling our affairs.
Likewise, in a world with many devices observing us, privacy issues have
an important role to play. These issues include the technical problems of
handling and preserving user privacy, through the use of technologies such as
anonymous (or pseudonymous) communications and transactions. Another
aspect of privacy concerns the non-technical issues, such as user acceptance,
and more generally, the expectations of the society as a whole with regard to
what level of privacy can be expected in widespread adoption of ubiquitous
computing architectures.
27.7 Summary
28.1 Introduction
R. Steinmetz and K. Wehrle (Eds.): P2P Systems and Applications, LNCS 3485, pp. 473-489, 2005.
Springer-Verlag Berlin Heidelberg 2005
474 28. Business Applications and Revenue Models
3. How can these parties potentially recover their cost and earn a margin
of profit?
The notion of a service and application style will be introduced, different
revenue models will be revealed and their relevant properties will be iden-
tified. The abstract view of Peer-to-Peer interaction will be the reference in
the analysis of the application/service styles. The discussion at the end of the
chapter suggests ways of dealing with the shortcomings from which current
revenue models suffer.
28.2 Definitions
Receiver can reverse their roles throughout an interaction sequence, i.e., the
Receiver can become the Provider in a following interaction step and vice
versa.
The terms “business model” and “revenue model” are often used interchange-
ably. However, for clarification distinctions should be made. A business model
is the totality of processes and arrangements that define a company’s ap-
proach to commercial markets in order to sell services and/or goods and
generate profits. In contrast, a revenue model is part of a business model.
It includes all arrangements that permit the participants in business interac-
tions to charge fees that are covered by one or several other participants in
order to cover costs and add a margin to create profits.[626]
Revenues can be generated in a direct or indirect way. In the direct way,
the receipts come directly from the customer. When an indirect revenue model
is adopted, the actual product will be provided for free. The gain is received
from a third party who is in one way or another interested in the diffusion
of the product. These two categories of possible revenue models can be reas-
signed (adapted from Laudon and Traver [378]):
– Indirect revenue models can be advertisements, affiliate models, or a
bundling. When the advertising model is adopted, the third party wants to
communicate a sales message that is visible during the usage. It should be
considered that the message is influenced by the reputation of the advertis-
ing media. In the case of the affiliate model, the vendor receives revenue for
passing customers to an affiliate partner, who sells the products or services.
For every purchase, the vendor receives a commission. The bundling model
is similar onto the affiliate model. The revenue is generated by products
or services that are connected to the free offering. Even though the basic
service is free, the customer has to pay for the additional offering.
– Direct revenue models are sales, transaction fees, and subscriptions. In
the sales model the customer pays directly for the object. Two forms have
to be distinguished. In the first one, licenses for the Peer-to-Peer business
application itself are sold. This model corresponds with the application
style. The second form is to use the application as an infrastructure to sell
goods or services. If the transaction fee model is used, the company will be
assigned the role of the mediating service. It provides a service that facili-
tates a transaction between the Provider and the Receiver. The Mediator
earns a fee for every transaction between them. Unlike the transaction fee
model, in the subscription model the fee is paid independently of the actual
usage. It is rather paid periodically, e. g. every month or year.
28.3 Revenue Models for P2P Business Application/Service Styles 477
In the following, the interaction style will be mapped to the reference view.
Thus, potential revenue models and issues in the application style setting as
well as in the service style setting can be examined.
Judging from the criteria that have been defined for viable business models
above, the application style allows for a differentiated charging through vari-
ous licensing models and fulfils the criteria of allocation effectiveness (as long
as the software is not illegally copied). However, when it comes to a service
style, revenue models for Instant Messaging face quite significant challenges
with respect to a differentiated charging structure. Allocation effectiveness
is not a problem for centralized and hybrid Peer-to-Peer architectures since
IM clients need to log on to the service provider’s network. However, this be-
comes a problem in pure Peer-to-Peer architectures. In this case it is rather
difficult to trace the communication between users in order to bring it to
account.
File sharing has become infamous through the quick rise and fall of Napster.
Although Napster collapsed with its first model, its place has been taken
over by others such as KaZaa, Morpheus, Grokster, and eDonkey. The music
industry is amidst a ferocious fight against the free exchange of music by
submitting empty music files or files that contain only part of a song plus
some affixed commercial content. Beside these technical weapons, the music
industry threatens users and service providers with lawsuits. By means of
rigorous penalties the users should be discouraged from sharing illegal media
files. For example, the RIAA made a settlement with four students in the
USA who ran a service for searching MP3 files in their college network. The
students pay penalties of between $ 12,000 and $ 17,000 each [83]. In Germany,
a trainee settled to the amount of EUR 8,000 [478].
Whatever the fate of the various file sharing applications and the commu-
nities behind them will be in the end, they have shaken up the value chain of
the music industry and they might well lead to changes in business models
[310]. Although such digital media exchanges have become the best-known
examples, they should not be mistaken for the only possible instantiation of
this application/service style. Two points are important here:
1. The exchange of entertainment media files is only one specific type of
content exchange. Any other digital content can also be exchanged, e.g.,
design documents, training documents, reports etc. Thus, Digital Content
Sharing can be used as a decentralized form of knowledge management. If
an adequate index service exists, the documents containing the knowledge
of the network participants can be accessed without forcing them to save
these documents on a central server.
2. The definition needs to be broader than just file sharing. It should include
streaming content as well since this type of content can also be recorded
and exchanged in a Peer-to-Peer manner (such an extension to include
streaming content clearly leads to specific challenges in the technological
28.3 Revenue Models for P2P Business Application/Service Styles 481
the original Napster transactions, and the BMG Napster did not do more
than stop the illegal activity. For digital content exchanges that embrace a
Mediator approach, one could think of implementing a billing step into the
content exchange, where any or both of the participants are required to pay
fees for exchanging content and those could then be paid to the Owner. If,
for the moment, it is assumed that this would be technically feasible, then
the Mediator adopts the role of an aggregating middleman for the content
which can be conveniently searched and compiled.
But it is questionable whether anybody other than the Owner should
be involved as a Provider. Why should the content first be bought from an
Owner and then be sold to another Receiver when at the same time the
Owner needs to be part of that transaction again and he needs to be prop-
erly reimbursed? Even though this distribution model could lead to technical
advantages, its economical benefits are not clear. It would be easier for the
Receiver to buy the content directly from the Provider. One might argue,
however, that the intermediary function can add additional value: Today’s
digital content exchanges integrate usually the recordings from different mu-
sic groups which would not be the case with separate download sites. But
that is hardly convincing – the music industry could run a joint catalogue
service without major problems since the artists are regularly bound through
exclusive contracts. Finally, there is the question of control. Digital content
exchanges perform an unbundling of content and provide a possibility for
free reassembly through the user: Rather than buying a complete album,
consumers can buy selected titles only and create their own specific albums
to their tastes. The music industry and artists alike have good reasons to
be reluctant to agree on the unbundling and recompilation of the content. It
is rather difficult to determine an appropriate price for popular as opposed
to less popular titles. Apart from that, a full CD can be sold for a higher
price than, e.g., the three popular titles only [125, 604]. Anyhow, it seems
that the music industry cannot surrender the market demand for digital con-
tent. Several distribution services have started up recently, e.g., the Apple
iTunes store. But nearly all of the serious upcoming Providers are based on a
client-server architecture because more control over the distribution process
is guaranteed. In short: taking an additional party into the transaction simply
because he or she happens to have the digital content at hand does not add
any clear economic value. It is more reasonable if the Provider once again
becomes identical with the Owner, in other words, the record companies sell
the content themselves or with the help of a few centralized licensed sellers.
Then it is likely to look more like iTunes, which follows a client-server-based
approach rather than a Peer-to-Peer exchange.
Finally all these considerations will only hold for Peer-to-Peer exchanges
if a billing scheme can be built into the digital content exchange and if the
fees can be allocated accordingly. If no Mediator is involved, e.g., if a digital
content exchange is built on the decentralized Gnutella protocol, the enforce-
28.3 Revenue Models for P2P Business Application/Service Styles 483
of fans over the years. When Marillion’s contract with their record label ex-
pired, they decided not to renew it. Instead they promoted the new record
using the Internet. The band members wrote to their fan base and asked
if they would be willing to pay for the CD in advance in order to finance
the making of the record. The response was overwhelming. Some fans even
offered to pay for two CDs “if that would help them”. When the production
was finished, the CD was offered through the band’s web site – not for down-
load, but for ordinary purchase through a secure web link. It remains to be
said, however, that there are other examples of such approaches that failed,
e.g., Stephen King’s experiment with an online book. A recent remarkable
approach without protection is addressed by the Potato System [493]. This
system tries to induce the users to license the media files by offering them the
possibility of earning a commission. When a user registers a media file and
hands this file to a friend he gets a commission if his friend buys a license as
well. So the users are motivated to license and to recommend the file. But
whether this model will work is questionable. Due to the absence of copyright
protection, the Owner has no chance to enforce his rights. It is likely that
this will be the main obstacle for music labels participating in this model.
ing technologies was ranked as one of the five top patents to watch by the
MIT technology review in 2001 [583].
If Grid Computing is mapped to the reference view, the Provider(s) can
be interpreted as the one(s) providing available computing power and the
Receiver as the one using this computing power to solve complex problems.
The providing interaction partner can as well be seen as the one who provides
the task to be processed. In this chapter, the focus lies on the computing
power because of the interest in its payment. The Mediator is the central
server application, which manages the distribution, analysis, integrity checks,
and security of the data sets.
Dougherty et al. have distinguished four different revenue models for Grid
Computing: the enterprise software sale, the public Internet exchange, an ap-
plication service provider (ASP) and a B2B exchange model, though the ASP
and B2B models “have not yet developed and may never develop” [178, p.
114]. The enterprise software sale model is identical with the view of an ap-
plication style, i.e., the revenue model is about selling distributed computing
software for installation behind the firewall or “enterprise grid”. The ratio-
nale is to provide more control over the contributing resources which will lead
to higher availability and better security. Apart from that, the LAN/WAN
capacities typically allow the transport of much larger data sets. A revenue
model is typically straightforward and consists of license fees and professional
services for implementation.
The public Internet exchange or “mixed grid” approach is of the service
style. The idea is to provide access to vast computing power on a worldwide
scale. An example is Moneybee where Grid Computing is used to predict
stock prices and exchange rates. Participants download and install a free
screen saver that uses the idle PC resources when the screen saver is on in
order to perform complex operations that are downloaded from a Moneybee
server. Thereafter results are uploaded to the Moneybee server [426].
With respect to revenue models, Grid Computing is a quite different sit-
uation than the other Peer-to-Peer interaction styles. As far as the Mediator
is concerned, there is a need to distinguish whether the service manages the
Grid Computing tasks on behalf of a third party or whether the mediating
service is identical with the Receiver. If the work is done on behalf of a third
party (which corresponds to the ASP model in [178]), the cost for the medi-
ating service plus a margin will need to be charged. If the mediating service
is provided by the Receiver, then the business utilization of the grid com-
putation results will have to cover the cost. In both cases, the issue is how
to determine the computing cost per Provider and how to compensate the
Providers. Mediators currently ask users to donate their excess resources. In
exchange, they offer a portion of these resources to non-profit organizations,
or else they provide “sweepstakes entries” for prizes. The Providers at Mon-
eybee contribute their resources free of charge. Their incentive is to get part
of the results (forecasts for stock prices and exchange rates) that the system
486 28. Business Applications and Revenue Models
generates. It may certainly be that many other Grid Computing tasks are
taken on by Providers in a similar way for free (e.g., when they are bun-
dled with attractive add-ons like the lively graphics of a screen saver hosting
the client application as SETI@home does). What, however, if the Providers
want to be reimbursed monetarily for offering their excess resources? From
the technical point of view, it is not very difficult to employ a pay-per-use
model where the client application records resource usage and provides the
data to the mediating service which then reimburses based on usage. The real
problem is that the price paid for the resource supply is likely to be rather low
and (micro-) payments will need to be organized in a very efficient way, if the
transaction is not to use up all the benefit. It is doubtful whether financial
incentives will be capable of attracting a sufficient number of Providers – at
the end of the day a non monetary incentive seems to be the better idea.
In summary, even though Grid Computing probably has the most straight-
forward revenue model of the core Peer-to-Peer applications, it still faces the
challenge of creating enough business to generate sufficient micro payments
to attract a sufficient subscriber base. Judging from the criteria for revenue
models, there are no problems regarding the allocation effectiveness or effi-
ciency. The questions is whether Grid Computing will create sufficient busi-
ness value to earn its own living, once Providers want to charge for the use
of their resources.
28.3.4 Collaboration
diator can be the server which facilitates the communication or which offers
additional services, depending on the topology type. The Provider and Re-
ceiver are the communication partners of the workgroup, whereas the Object
is the message or document that is exchanged between them. As described,
the Provider generally is the legal owner of the Object or is at least authorised
to hand it on to the Receiver.
The method of selling groupware as an application style is unlikely to
change with new architectures underlying the software. Various licensing
models as introduced in Section 28.3.1 can be used. A special opportunity
arises from the complexity of groupware applications via the aggregation of
different functions and their high integration in daily work. It can be assumed
that a comparatively high demand for professional services exists.
If groupware applications are hosted and brought to the users in the form
of a service style, the above considerations for the core applications of Instant
Messaging and Digital Content Sharing can be carried forward. A transaction-
based billing can only be arranged if a central instance that can observe the
usage of the service is used. In the case where the communication is held
by a server, a transaction-based fee can be accounted for by the amount of
transferred data or the usage time. Otherwise, only the usage of the services
provided by the server can be brought into account, e.g., the catalogue service
for members and files, memory to store files of temporary offline peers, or
security services such as logging. Whether it is possible to adopt the service
style in the case of a completely decentralized architecture is questionable. In
this case, the considerations of Instant Messaging apply. It should be added
that the Provider of a Peer-to-Peer collaboration service style should consider
whether he wants to bill for every user or for a complete group. It is inherent
in collaboration that the work between two members can also benefit the
other group members. So it seems adequate to leave the choice of accounting
method to the customer.
With respect to the criteria for revenue models the danger is that allo-
cation effectiveness cannot be ensured. If the infrastructure employs a real
Peer-to-Peer model, the revenue model will face efficiency challenges. It is the
bundling of various services (such as IM and File Sharing and other, poten-
tially non-Peer-to-Peer services) that makes groupware interesting. Revenue
models for groupware service styles can then be built around various other
criteria. The Peer-to-Peer functionality is only one of them.
28.4 Discussion
of Skype was made public at the end of August 2003. Due to the easy in-
stallation process, the easy-to-use user interface and the good sound quality,
Skype could – as it stated – adopt 10 million users in the first year, with more
than 600,000 users having logged on on average [199]. So, it was successful in
adopting a huge user community. This community gives Skype a user base
to establish a revenue model based on a costly service called SkypeOut. The
latter started at the end of July 2004 allowing users to make prepaid calls in
conventional telecommunication networks. It is conceivable that Skype could
use its VoIP infrastructure for further services, e.g., for a radio program or
for music distribution. For both IM bundling revenue models, the very basic
and rudimentary IM services could still be free (not the least since even a
small fee can put communities that have grown accustomed to free-of-charge
using at the risk of breaking apart). Additional services could be charged on
a pay-per-use basis. Premium services for secure access (e.g., for connections
to bank agents) could have a base subscription fee.
When it comes to Digital Content Sharing, it is currently unclear how the
battle between file exchanging and the music (or other, e.g., film) industry
will finally turn out. But as described above, it might be a better strategy
to try owning the communities. Clearly, owning a community would hardly
be possible if the participants of that community were required to pay for
something that they could get for free somewhere else. Once again, the way
to make such communities work would be to bundle the digital content with
other information goods that are not easily available through illegal content
exchanges. Examples would be reductions for concert tickets, fan articles that
could be ordered exclusively through the community, chat services with the
artists, competitions where authentic belongings from the artists can be won.
Bundling is not really a remedy for Grid Computing revenue models. If
the transfer of the (micro-) payments generates too much overhead, then
revenue in the sense of a pecuniary compensation might not be the way
to go. However, providing information goods as a reimbursement could be
feasible. For example, the screen saver running the distributed computing task
might not just be made of lively graphics but it might provide an information
channel, such as Moneybee does (even though Moneybee is not Peer-to-Peer)
[426]. These information channels can report parts of the grid-wide computed
results news independent of the computing task. At the end of the day, that
approach brings the revenue model back to barter-like structures.
Services for supporting Collaboration have possibilities for bundling sim-
ilar to IM. Basic services like document handover and communication can
be provided for free. Additional services, which fall back on central compo-
nents, can be charged per use. These central components can be especially
catalogue, buffering and security services. These services are not required for
smaller workgroups and Collaboration tasks, but they might become essential
with rising requirements.
29. Peer-to-Peer Market Management
Jan Gerke, David Hausheer (ETH Zurich)
29.1 Requirements
R. Steinmetz and K. Wehrle (Eds.): P2P Systems and Applications, LNCS 3485, pp. 491-507, 2005.
Springer-Verlag Berlin Heidelberg 2005
492 29. Peer-to-Peer Market Management
Peer-to-Peer systems are based on the idea that peers offer services to other
peers. Ideally and in the absence of a monetary payment system, each peer
should contribute as much as it uses from other peers. However, as peers
are autonomous entities acting in a rational way, it is unlikely that such
cooperation is going to happen without appropriate incentives for peers to
share their resources. In fact, it was shown in [11] that 70% of Gnutella users
share no files at all. Thus, many users in Gnutella compete for the resources
offered by only a few peers, which leads to a major degradation of the overall
system performance.
Some Peer-to-Peer systems use specific accounting or reputation mecha-
nisms to deal with this problem, such as BitTorrent’s tit-for-tat mechanism
[128], eMule’s credit system [589], or KaZaA’s peer points [558]. However,
most of these mechanisms are purely file sharing-oriented and can thus hardly
be used for other types of services. Moreover, due to weak security measures
these mechanisms can usually not be applied to commercial purposes.
Another major problem faced in Peer-to-Peer networks, is the fact that
individual peers are usually unreliable, i.e. they may be faulty or even act ma-
liciously. Peers may often join and leave the system, lose messages or stored
data, or deliberately misuse or harm the system. Replication can help to
increase data availability and reliability, however, without appropriate syn-
chronisation techniques replicated data may quickly become inconsistent. In
a commercial environment this problem becomes even more essential. A peer
may increase its own benefit by acting maliciously against potential competi-
tors, e.g., by not forwarding other peers’ service offers. It is hardly feasible
to use accounting mechanisms or payments as an incentive to fulfill such ba-
sic tasks, as it may be difficult to check whether a particular task has been
performed correctly or not. Also, the necessary accounting effort may quickly
exceed the effort for the actual task.
As a consequence of decentralisation and the potentially large size of a
Peer-to-Peer network, the efficient and scalable design of appropriate search
mechanisms and other distributed tasks is another difficult problem. Existing
Peer-to-Peer overlay infrastructures such as Pastry or Chord (cf. Chapter 8)
allow for efficient request routing, but have limited support against malicious
peers or insecure networks. In fact, many Peer-to-Peer mechanisms currently
do not consider malicious behavior at all.
29.1 Requirements 493
The following three main functional goals form the basis for a completely
decentralized and generic Peer-to-Peer marketplace:
Service Support
The targeted Peer-to-Peer system needs to support completely different ser-
vices, including purely resource-based services such as processing power, stor-
age, or transportation, as well as higher level services such as content or soft-
ware applications going beyond pure file sharing. Also, combinations of ex-
isting services offered by different peers need to be supported, to create new,
value-added services. The service usage model described in Section 29.2.2
illustrates how such combinations of distributed services may look like.
Market-Based Management
The most important goal is the creation of a marketplace for trading different
services, managed by true market mechanisms which provide appropriate
incentives. On the one hand, the traditional way for this is by introducing
a currency that can be used in exchange for the services being provided.
On the other hand, barter trade is a suitable alternative which has to be
supported, too. Barter is a simple form of trade where services are directly
exchanged against other services, e.g., a peer may only download a file if
it also provides one. The market model described in Section 29.2.1 further
details market-related aspects of the Peer-to-Peer architecture.
Decentralization
Today, many Internet-based marketplaces such as eBay [183] are based on
centralized infrastructures. However, a true Peer-to-Peer-based system must
use only Peer-to-Peer mechanisms which must be able to function without
any central components. Only this type of approach offers the full advantage
of the Peer-to-Peer concept and ensures that no central point of failure exists.
Efficiency
The adopted core mechanisms should lead to an economically efficient allo-
cation and use of the services being traded among the market participants.
Economic efficiency is reached when the services are allocated in a way which
maximizes the overall social benefit of all participants. Additionally, as far as
the technical design of the mechanisms is concerned, an efficient use of tech-
nical resources like network capacity, memory space, and processing power
has to be achieved. In a distributed system, network resources (i.e. commu-
nication bandwidth) determine clearly the main bottleneck. Thus, size and
number of exchanged messages have to be minimized.
Scalability
With respect to the technical performance, a solution should be capable to
operate under any load, i.e. any number of market participants or services
being offered. A system is scalable if the performance does not decrease as
the load increases. A centralized system does not scale well under these cir-
cumstances, because the load on it increases as more participants make use
of it. Therefore, a central system can quickly become overloaded, especially if
no centralized load-balancing concepts are applied. In contrast, Peer-to-Peer
systems benefit from the characteristic that the load caused by a participat-
ing peer can be compensated by those additional resources provided by that
peer. Emerging Peer-to-Peer overlay infrastructures (cf. Chapter 21) benefit
from this advantage and provide, in addition, scalable and efficient routing
mechanisms which can be used for object replication and load-balancing pur-
poses.
Reliability
It is important that a system designed for real-world applications is available
continuously and performs correctly and securely even in the case of individ-
ual failures. Centralized systems are highly vulnerable against total failures
or Denial-of-Service attacks which can basically make a system unusable.
Peer-to-Peer systems are by design more robust against such failures or at-
tacks. But at the same time they can suffer from the fact that those peers are
autonomous entities, which may not behave as intended by the designer of
the mechanism as mentioned earlier. A solution has to minimize the impact
and prevent or discourage such behavior.
Accountability
Making the services being traded among the peers accountable, is another
inevitable requirement for a market-managed Peer-to-Peer system. An ac-
counting or payment mechanism is required which provides the notion of a
29.2 Architecture 495
common currency that can represent the value of the individual services. This
may be a scalar value, which can be aggregated over time and thus represents
the current credit of a peer. Peer-to-Peer accounting systems are discussed
in detail in Chapter 32. One of the main challenges of an accounting system
is clearly to bind the accounting information to a real identity, thus making
re-entries of peers under a new identity costly and therefore unattractive.
Karma [608], PPay [636] or PeerMint [283] are potential systems that may
be used for this purpose. A similar mechanism is needed to keep track of
a trader’s reputation, considering its behavior in the past, such as cheat-
ing, freeriding, or running malicious attacks. There are trust mechanisms like
EigenTrust [334] which are able to aggregate such information in an efficient
way. The trust metric is needed to be able to exclude misbehaving peers from
the system.
Further desirable properties, such as privacy or anonymity, exist, which
may contradict accountability, as it is difficult to guarantee accountability and
anonymity at the same time. It depends on the dedicated target applications,
if a system has to comply with them.
29.2 Architecture
The classical market is a place where sellers and buyers meet to exchange
goods against payment, e.g., money. While in old times this market corre-
sponded to a closed physical location, nowadays the term is used in a much
broader sense, e.g., to describe a national or even the global market.
The goods traded in the market described here are services (cf. Chap-
ter 14). Thus, the sellers and buyers are service providers and service con-
sumers. However, participants in the market are not restricted to either pro-
vide or consume a service. Rather, they can take on any of these roles at any
point of time. This means, that they can provide a service to a second par-
496 29. Peer-to-Peer Market Management
ticipant and later use a service from a third participant or vice versa. They
can even do both at the same time, as shown in Figure 29.1.
Offer Service
Peer
Peer
Peer
Peer Peer
Peer
Peer
Peer
Offer Payment
Having laid the foundation for a Peer-to-Peer system through its market and
service usage models, it is now essential to derive the internal structure of a
peer. To ensure a sufficient degree of modularity of the architecture, a layered
structure is used for the peer model. The lowest layer are the resources that
are locally available at a certain peer node. On top of this layer services are
executed which can draw on local resources, such as storage space, computing
power or content. They can also access remote resources through other ser-
498 29. Peer-to-Peer Market Management
Peer 1
Peer 2
Service
Application
Service Instance
Service Provision
Peer 3 Peer 4
The core functionality layer on each peer node provides the functional-
ity needed to uphold the peer network services. It is in charge of some basic
local functionality like local resource management as well as distributed func-
tionality and protocols like service discovery or reputation. In particular, it
includes all functionality needed to enable pricing, metering, and charging
and, hence, to support market management of the system. The core func-
29.2 Architecture 499
tionality layer on each peer accesses and cooperates with the corresponding
layers on remote peers but does not access remote or local services.
Strategy development
Product management Offline task of the service provider
Human resource management
Budgeting and controlling Resource management and QoS control
Marketing and selling Service description, discovery,
pricing and negotiation
Contracting Service level agreements
Order fulfillment Service execution, accounting, charging
Business development Service composition uses existing
core functionality
External security mechanisms Security mechanisms included
The last two sections introduced the concept of a Peer-to-Peer based service
market and a system architecture to enable such a market. Still, this has
merely given an overview over the whole topic. Therefore, the purpose of this
section is to give a more detailed view on two sub topics to serve as examples
of the involved complexity and problems. First, the design of a Peer-to-Peer
middleware is introduced. Its purpose is to implement the key mechanisms
described in the previous sections, thus enabling the architecture, which in
turn enables the service market. This middleware has been developed and
implemented within the EU-funded MMAPPS project [591] where it has been
successfully used as a basis for various Peer-to-Peer-applications. Second,
one key mechanism, namely pricing, is presented in even more detail. Its
29.3 Case Studies 501
Service
Service
Service
J
J I I I A A A
J
Accounting K Service
4 B Search 1
Middleware
Service C
Pricing D Security 3
Negotiation 2
After informing the A&C module about the forthcoming service delivery
via interface I, the Service Management module instantiates, configures and
starts a new service instance through interface I. During the service delivery,
the service instance reports its status by sending events to the A&C module
via interface J. The A&C module compares these events against the SLA and
informs the Service Management module via interface K when necessary, e.g.,
in the case of an SLA breach or when the service instance has finished. The
Service Management module controls the service delivery via interface I, e.g.,
stops it in the case of an SLA breach. For special purposes the A&C module
can contact remote A&C modules, e.g., to receive an immediate payment
through tokens (cf. Chapter 32).
The middleware design has been described in more detail in [239] and
[241]. A prototype has been implemented within the MMAPPS project [591]
based on the JXTA framework [255]. The prototype serves as a proof of
concept, showing the middleware enables the architecture presented in Sec-
tion 29.2. It specifically fulfills the functional requirements of service support
and market-based management, as well as the non-functional requirements
of efficiency and accounting (cf. Section 29.1). The other requirements of
decentralisation, scalability and reliability depend on the underlying Peer-to-
Peer framework, JXTA. The middleware does not impede these requirements,
since it does not introduce centralized entities nor unscalable protocols into
the Peer-to-Peer environment.
Online auctions like eBay [183] are becoming increasingly popular market-
places for trading any kind of services over the Internet. Auction-based mar-
kets benefit from the flexibility to adjust prices dynamically and enable to
achieve efficient supply allocations (for an overview on auctions cf. [234]).
However, those markets usually rely on a central component, i.e. the auc-
tioneer which collects price offers of all participants and performs matches.
PeerMart, which is the second case study presented in this chapter, com-
bines the advantages of an economically efficient auction mechanism with
the scalability and robustness of Peer-to-Peer networks. It is shown, how
PeerMart implements a variant of the Double Auction (DA) on top of a
Peer-to-Peer overlay network, as an efficient pricing mechanism for Peer-to-
Peer services. Other than in a single-sided auction, like the English Auction
or the Dutch Auction, in the Double Auction both providers and consumers
can offer prices. The basic idea of PeerMart is to distribute the broker load
of an otherwise centralized auctioneer onto clusters of peers, each being re-
sponsible for brokering a certain number of services. PeerMart differs from
existing work, such as [165] and [460], since it applies a structured rather
than a random Peer-to-Peer overlay network, which enables deterministic lo-
504 29. Peer-to-Peer Market Management
cation of brokers and is more efficient and scalable. It resolves the chicken
and egg problem between providing incentives for services and being itself
dependent on peers’ functionality by introducing redundancy. Under the as-
sumption, that a certain amount of peers behave correctly, PeerMart is thus
able to provide a high reliability even in the presence of malicious or un-
reliable peers. Key design aspects of PeerMart are presented briefly in the
following.
Basic Design
The basic pricing mechanism in PeerMart works as follows: Providers and
consumers which are interested in trading a particular service, have to contact
a responsible broker from which they can request the current price. Brokers
(realized by clusters of peers) answer such requests with two prices:
– the current bid price, i.e. the current highest buy price offered by a con-
sumer
– the current ask price, i.e. the current lowest sell price offered by a provider
Based on this information, consumers and providers can then send their
own price offers (bids or asks) to the brokers. Continuously, brokers run the
following matching strategy:
– Upon every price offer received from a peer, there is no match if the offer
is lower (higher) than the current ask price (bid price). However, the offer
may be stored in a table for later use.
– Otherwise, if there is a match, the offer will be forwarded to the peer that
made the highest bid (lowest ask). The resulting price for the service is set
to the mean price between the two matching price offers.
To implement this mechanism in a decentralized manner, PeerMart uses
Pastry [527], a structured Peer-to-Peer overlay infrastructure. The overlay
is applied for peers joining and leaving the system, and to find other peers
(brokers) in the network. Every peer is given a unique 128-bit node identifier
(nodeId), which can be calculated from a peer’s IP address or public key
using a secure hash function. In PeerMart it is assumed that every peer has
a public/private key pair, which is also used to sign and verify messages.
Furthermore, it is assumed that each service has a unique service identifier
(serviceId). For content services this can be achieved, e.g., by calculating the
hash value of the content data. The serviceId needs to have at least the same
length as the nodeId, to be able to map the services onto the address space
of the underlying network. The only varying service parameter considered at
this stage is the price.
A set of n peers (called broker set) which are numerically closest to the
serviceId are responsible to act as brokers for that service. Each peer in a
broker set keeps an auction table for the service to store m/2 highest bids and
29.3 Case Studies 505
serviceId x
C1 broker set
bid $3 for x
x
1 for
ask $
P1
for x
serviceId y
ask $2 C3
C2
P2
matching offers to all other peers in the broker set. Based on the signature
of an offer, brokers can verify its validity. In addition to the offers matching
locally, a broker also forwards the current highest bid and lowest ask, if it
has not already been sent earlier. Thus, only potential candidates for a match
are synchronized among peers in a broker set. Based on the offers received
from other brokers the current bid price (ask price) can be determined and
a globally valid matching can be performed by every broker. Asks and bids
matching globally are finally forwarded to the corresponding peers by those
broker peers which initially received them.
In this redundant approach message loss is implicitly considered. When a
message is lost accidentally between two brokers, it appears as if one of the
brokers would act maliciously. However, so far timing issues have not been
dealt with, and it was assumed that all messages are sent without any delay.
In PeerMart a slotted time is used for every individual auction to tackle the
problem of message delays. Time slots have a fixed duration which has to be
longer than the maximum expected round trip time between any two peers.
Every time slot has a sequence number starting at zero when a service is
traded for the first time. Price offers from providers (consumers) are collected
continuously. At the end of every even time slot, the potential candidates
for a match are forwarded to the other brokers and arrive there during an
odd time slot. Candidates arriving during even time slots are either delayed
or dropped, depending on the sequence number. At the end of every odd
time slot, the final matches are performed and notified to the corresponding
29.4 Conclusion and Outlook 507
peers. Since after this synchronization process all broker peers have the same
information needed to match offers, no matching conflict can occur. In the
rare case that more than one peers quoted exactly the same price within the
same time slot, a broker peer gives priority to the one that came in first.
After synchronization, the price offer which was prioritized by most brokers
is selected.
A prototype of PeerMart has been implemented and is available as open
source software for testing purposes [481]. More details about the imple-
mentation and results obtained from various experiments can be found in
[282]. These results show that PeerMart provides a reliable, attack-resistant
Peer-to-Peer pricing mechanism at a low overhead of messages and necessary
storage space and scales well for any number of peers trading services. The
mechanism is completely decentralized and suitable for trading any types of
services. Hence, it fulfills all the requirements stated in Section 29.1.
R. Steinmetz and K. Wehrle (Eds.): P2P Systems and Applications, LNCS 3485, pp. 509-525, 2005.
Springer-Verlag Berlin Heidelberg 2005
510 30. A Peer-to-Peer Framework for Electronic Markets
Fig. 30.1: The Peer-to-Peer paradigm at various levels of a system: the traditional
architectures as well as our view for electronic markets.
The proposed architecture has been developed within the framework of the
SESAM project of the priority research program ‘Internet Economy’ funded
by the German Ministry of Education and Research (BMBF). The complete
project covers three scenarios: multi-utility markets, virtual power plants, and
wearable services. The latter two scenarios are of central interest regarding
the Peer-to-Peer paradigm. In the virtual power plant scenario, we assume
that many small devices producing electricity are deployed at locations such
as houses, small companies, and public buildings. The owners of these power
plants want to maximize their profit by selling the energy to the bidder with
the highest offer. The purchaser on the other hand wants to buy energy as
cheap as possible. From this starting point many interesting questions arise:
How does a purchaser find the cheapest offer? Can the purchaser and seller
enter the contract without personal adhesion? How are mini power plants
controlled to maximize profit? What to do if a mini power plant breaks
down? How is accounting accomplished?
These questions lead to several subprojects:
Electronic Contracting – Business processes are subject to legal rules. In or-
der to achieve transparency and seamlessness with high spontaneity in
the markets, the harmonization of the law must be promoted and signing
of contracts must be automated in the network.
Spontaneity, Transparency and Incentives. – The considered scenarios re-
quire smooth and comprehensible interaction of various connected com-
ponents and services. Transparency and incentives for the actors involved
are preconditions for the functioning of such self-organizing markets.
Optimization, Control and Business Models – Due to their inherent decen-
tralized nature and dynamics, self-organizing and spontaneous markets
require specially adapted business models as well as completely new de-
centralized optimization and control mechanisms.
Robustness and Security – One important requirement for the commercial
success of the applications is the security and robustness of all partici-
pating components and processes against active and passive breakdowns
and attacks of ‘normal’ activity.
The results of the virtual power plant scenario get carried onto the field
of so-called wearable services. New markets could emerge through communi-
cation among small devices like PDAs, mobile phones, and sensors in clothes
or worn directly on the body.
To build an integrated Peer-to-Peer system for electronic markets as out-
lined above, an architecture is needed that brings together service orientation
and overlay networks. In particular, the provision of standardized interfaces
to a pool of overlay networking techniques is needed. In this chapter we will
describe our service-oriented Peer-to-Peer architecture (Section 30.2) and dis-
30.2 A Service-Oriented Peer-to-Peer Architecture 513
are not trivially carried out by a Distributed Hash Table (DHT) based Peer-
to-Peer network. Therefore, we need an architecture where service instances
on each peer easily can be attached to a specific Peer-to-Peer networking
approach, i.e., we need an ‘overlay API’ such as the one for DHT-based
methods [148].
Following the above reasoning, we will now first make a quick digression
to traditional service-oriented architecture. Then, we will discuss the concept
of a ServiceNet in more detail and outline the peer node architecture that
brings together service orientation and overlay networks.
30.2.2 ServiceNets
SC-ID = 3
SI-ID = 44
SC-ID = 5
SI-ID = 43
Peer A
SC-ID = 5
SI-ID = 43
Overlay-Id = Pastry://0x3e4
SC-ID = 3 COM-Adr = 1.2.3.10:687
SI-ID = 44
Overlay-Id = Chord://3456 SC-ID = ServiceNet Class
COM-Adr = 1.2.3.10:456 SI-ID = ServiceNet Instance
Overlay-Id = Overlay Information
COM-Adr = Comm. Address
an identity management service, which offers functions that are all based on
a unique key, could use a DHT-based Peer-to-Peer network. On the other
hand, the document service implements a keyword search for which a DHT-
based Peer-to-Peer network may not be suitable. For a document service, an
unstructured Peer-to-Peer network, like GIA [113], might be more suitable.
Using one Peer-to-Peer network for one ServiceNet has the advantage
that the characteristics of the Peer-to-Peer network can be better considered.
Clearly, to assist the process of service creation, a ‘catalogue’ is needed so
that one can find the suitable Peer-to-Peer network for a specific service and
ServiceNet, respectively, based on the required functionality or constraints.
For example, service developers and providers must characterize their service
by answering questions such as ‘Is there a unique key for data elements?’ etc.
Afterwards they should get a recommendation for a Peer-to-Peer network
their service should use. Papers like [346] provide an aid in building such
a catalogue. Currently we are developing a model that gives us a basis for
describing the characteristics and behavior of Peer-to-Peer networks. This
model will then be used for simulation and evaluation.
A peer in a ServiceNet (Figure 30.2) is characterized by:
– ServiceNet Class. A ServiceNet Class (SC) is a unique identifier for the type
of service which the ServiceNet offers. Thereby all peers of a ServiceNet
offer the same kind of interface and functionality. A document service is
one kind of service class.
30.2 A Service-Oriented Peer-to-Peer Architecture 517
In this section we will outline our peer architecture which integrates the
principles mentioned above. This platform offers a basis for current and fu-
ture services for electronic markets. The architecture of a peer is depicted in
Figure 30.3. This architecture can be divided into the following parts:
– Communication Layer. This layer provides an abstraction from a concrete
communication network like networks based on IPv6. This layer gives the
layer above the possibility to send and receive messages.
– Overlay Layer. The overlay layer makes various overlay techniques avail-
able. Chord and Pastry would be examples for DHT-based Peer-to-Peer
networks and GIA could be an example for an unstructured network. This
layer knows the appropriate algorithms and is responsible for initializing
procedures. For example, when a peer wants to join a ServiceNet, this layer
handles the initializing procedure like building a ‘finger table’ [575].
– SOAP-Processor. The SOAP-Processor translates programming language
objects into corresponding SOAP messages. These messages will be ex-
518 30. A Peer-to-Peer Framework for Electronic Markets
Service Layer
Supporting Services
Service Consumer/Provider
Doc Auth …
Discovery, Registry
Stub Stub Stub
Management
System Access Layer
Service -
SOAP-Processor
Overlay Layer
Chord GIA …
Communication Layer
IPv6 …
changed between peers. These messages are used because of their indepen-
dence from specific transport protocols that are used in the communication
layer, and programming languages that are used for implementing the ser-
vices.
– Service Consumer/Provider. This part summarizes all service consumers
and providers. The connection between the underlying SOAP-Processor is
handled through stubs that can be generated out of the WSDL description
of a service. In this part all existing and future services will appear.
– Service Management. The Service Management includes all functions that
are necessary to find, publish and bind services. Therefore, the Service
Management makes use of the ServiceNet called service discovery, which
was mentioned in the previous section. If a service is bound, the manage-
ment has to inform the overlay layer so that a proper initialization can
take place.
– Supporting Services. These are additional services that support the service
developer by simplifying recurring tasks. For example, supporting services
can support the handling of transactions. This would involve session han-
dling and possibly calling roll-back functions, and compensations handlers,
etc. In such cases, existing specifications like WS-AtomicTransaction [377]
can be used.
30.3 Security, Robustness, and Privacy Challenges 519
Besides traditional issues like end-to-end security and service robustness, new
challenges arise from using a Peer-to-Peer network:
Partial encryption. In contrast to traditional networks where routing and
data is always clearly separated in header and payload, Peer-to-Peer systems
sometimes mix those up. For example, performing a search in GIA [113]
means that every involved node has to search in its local store of objects. If
we could assume that all nodes store only their own objects, then there would
not be a problem. But if we think about replication, then object encryption
and integrity become problematic. If we encrypt the whole object like the
payload in typical network protocols, nodes will not be able to search for these
objects. Hence, we have to encrypt the object in such a way that essential
information is accessible for routing and searching.
30.3 Security, Robustness, and Privacy Challenges 521
SESAMContainer SESAMObject
payload
signatures
*
models multiple issuers may sign the trust data as is done in PGP. Besides
that, alternative trust models, like reputation or recommendation systems,
require multiple signatures attached to a single item of trust information.
Having analyzed the robustness and security requirements for our framework
we now focus on persistent signatures, privacy-aware data handling and trust
models.
Trust Models
Using the data structure introduced in the last section (Figure 30.4) we
can ensure the integrity of data objects. To enable applications for a dis-
tributed marketplace where electronic contracting is supported, authenticity
of a peer’s identity is a major issue.
While the integrity of data objects is provided by signatures, peer identity
will be provided by digital certificates. Each signature contains an attribute
which links a certificate to the public key used. This attribute itself must
included in the generation of the signature, otherwise an attacker is able to
change the used certificate. A certificate itself contains one or more signa-
tures. This mechanism is suitable for building any trust model from simple
certificate lists up to more complex certificate trees such as in reputation
systems.
The diversity of applications in a Peer-to-Peer marketplace requires for
different trust models to verify identities. Therefore, we develop trust model
plugins which implement various trust models. First, we implement common
trust models such as Certificate Authority based (X.509 [629]) or distributed
models (PGP [645]). Later, plugins to support reputation and recommenda-
tion systems are added. All plugins are used by the trust component which
offers a common interface for checking identities. The certificate and the cho-
sen trust model are the input parameters of the verification method of the
trust component.
524 30. A Peer-to-Peer Framework for Electronic Markets
30.4 Summary
31.1 Introduction
In the following we specify the breaches of security that may take place on
the application layer of a Peer-to-Peer network. This presupposes a direct act
by a (malicious) user upon a Peer-to-Peer network through the application
interface which enables direct user-to-network interaction. This can happen
without a substantial effort by the user. Furthermore, we introduce the notion
of a malicious node”, meaning a node in the Peer-to-Peer network (and a user
behind it) that uses the network improperly, whether deliberately or not.
Malicious nodes on the application layer may give incorrect responses to
requests. They might report falsely about their bandwidth capacity in order,
e.g., to not have much traffic routed over their own node or they might (in
the case of file sharing) freeride in the network [11].
R. Steinmetz and K. Wehrle (Eds.): P2P Systems and Applications, LNCS 3485, pp. 529-545, 2005.
Springer-Verlag Berlin Heidelberg 2005
530 31. Security-Related Issues in Peer-to-Peer Networks
These problems are usually met with incentives, where peers are rewarded
for good behavior and penalized for bad. Virtual money for file exchanges was
introduced by the now defunct Mojo Nation project [425]. Credits are given
to peers in eMule [589] as a reward for files being uploaded and used as an
entitlement for raising one’s priority in the upload queue of another peer-
client.
Following are security issues that arise from different applications of Peer-
to-Peer networks.
The file storage service provides the storage of files on more than one loca-
tion in a Peer-to-Peer network by using a uniquely defined e.g. alphanumeric
character sequence for each file, called “handle”. Contrary to file-sharing
mechanisms where users are looking for specific content, the searches in the
file storage system can be accomplished only with help of these specific han-
dles which are mapped to a unique peer or a file in the network. Any peer in
the network can locate a file by using it, and the peer with writing rights can
update these files as well. In order to get a handle to find a file, a peer has
532 31. Security-Related Issues in Peer-to-Peer Networks
Nodes in a distributed hash table (DHT) lookup system build their routing
tables by consulting other nodes. A malicious node here may corrupt the
routing tables of others by sending invalid updates. In this case not only could
malicious nodes direct queries to invalid nodes, but well-behaving nodes might
give wrong routing information as well, due to invalid routing updates of their
own. A more sophisticated attack would be to provide nodes that actually
contradict general criteria of routing (high latency, low bandwidth). CAN
[504] uses measurements of RTT (round-trip-time) to favor lower latency
paths in routing updates. This method may be used for choosing nodes with
high latency path as well.
Pastry [527] makes these invalid routing tables easier to detect by forcing
entries to have a correct prefix. This way malicious nodes can not produce
totally random routing tables without being easily detected. Another way of
ensuring that a given node in an update table is authentic, is for every node
to ascertain that this node is actually reachable [565]. This method would
however cause enormous costs in bandwidth and delay.
31.3.3 Partition
The Sybil Attack is a name for a node or any other single entity on a network
presenting multiple identities to other nodes. Thus other nodes might believe
in having an interaction with some distinct nodes whereas in fact there is only
one entity they are addressing. This would mean that given particularly huge
resources (bandwidth, disk space, computing power) some “deceiving” peer
could gain control of a large part of a Peer-to-Peer network, thus undermining
its redundancy [177], which is one of its basic properties. However, considering
a Peer-to-Peer network to be made up of nodes at the “edges of a network”
[28], one could assume rather moderate resources at the disposition of each
peer. In a network of several hundred thousand nodes, such a peer could
presumably cause rather little damage.
In order to prevent the generation of multiple identities in a Peer-to-Peer
network, computational puzzles, like in HashCash [41], might be used. This
is an old solution for defense from DoS (denial-of-service) attacks. Before
joining the network a peer has to solve some computational problem, thus
is forced to use his CPU-cycles needing more time to join than usual. But
in the case of an attacker with huge resources, it would at best slow down
the process of generation of false identities and the process of these “virtual”
nodes joining the network. In structured Peer-to-Peer networks like Chord,
CFS, and Pastry where the network determines different tasks that certain
peers have to do, hashing of IP addresses is performed partly to establish
some kind of identity of individual peers. This would surely complicate and
prolong the process of a Sybil Attack.
The most “intrusive” method of binding an identity directly to a node
would be, of course, to provide a distinctive identification for each computing
unit that is taking part in a network. The commercial platform EMBASSY
[292] provides cryptographic keys embedded inside of every hardware device.
Of course a user has to implicitly trust that these devices have an embedded
key (and not an arbitrary one) and the users actually use these as they are
supposed to. The concept of Pretty Good Privacy (PGP) [644] web-of-trust
may vouch for established identities for other “newcomers” in the network,
but may also be misused by the malicious node to subvert the chain of trust.
As a last resort for keeping all the possible weaknesses of the aforementioned
solutions somewhat under control, one can rely on certification authorities
and similar trusted agencies.
As stated in [171] “one can have, some claim, as many electronic personas
as one has time and energy to create”. Primarily this energy can be a decisive
constraining factor for success in such an attempt.
31.4 Security Concepts for Selected Systems 535
In following sections we will introduce some solutions which deal with security
in the Peer-to-Peer area in quite different ways. The first two address security
536 31. Security-Related Issues in Peer-to-Peer Networks
31.4.1 Groove
Authentication
Workspace Security
With “relay services” users may exchange changes in workspace with a time
gap between the two actions. When a user goes offline the relay service will
“notice” it, save the changes, and pass them to a partner peer of the same
workspace that may be offline when this peer gets online again.
Groove software randomly chooses a relay server from a list it contains and
registers a new account that a user previously set up. From now on, the soft-
ware uses only this relay server and sends, along with its contact information,
also the URL of its relay server in order for other Groove peers to acquire
the full communication path. More precisely, an actual relay-to-workspace
communication is taking place with a secret key established between them,
whereas due to end-to-end encryption the relay service can not access data
inside delta or instant messages but only the header information needed to
locate another peer.
Furthermore, relay servers can enclose the messages in an HTTP - com-
patible format and use ports 80 or 443 (mostly configured open) to circumvent
the firewalls that may separate two peers collaborating over Groove. This ca-
pability may well arouse suspicion with managers of the company or network
administrators of a company’s LAN who may not approve of arbitrary data
being exchanged across the company’s boundaries.
Conclusion
The SixFour Peer-to-Peer system [566] obtained its name from the date of the
Tiananmen Square protests in Beijing in 1989. The developers intended this
system for potential users who live in oppressive countries with limited access
to free press. The basics of the system enable a peer to establish an encrypted
connection to a trusted peer outside of his country and then have this peer
further forward any other requests into the Internet. The aim is to enable
peers to get data confidentially from the Internet thus evading censorship.
A peer wanting to connect to the SixFour network would already have
to know the address of the trusted peer and his RSA public key. The au-
thenticity of the key would have to be checked at the website of the SixFour
developers, hacktivismo.com. For this purpose the peers would have to know
the appropriate signature of Hacktivismo [268].
The trusted peer simply has to forward the requests of peers that “hang”
on them so they can access any TCP- or UDP-based service on the Internet,
provided the trusted peer itself has access to these services, too. One of the
important criteria for becoming a trusted peer is to provide a permanent
IP-connection (permanent IPv4 address). The most reasonable entities to
become trusted peers (according to the idea of developers) seem to be, e.g.,
human rights organizations or NGOs promoting democratic values.
The routing inside of the SixFour network is anonymous if it occurs over
more than 3 nodes. That is, every peer in the routing protocol knows only the
RoutingID, the source, and the target of the packet. The routing topology
of the SixFour network is like Gnutella’s, in that every peer is connected
with several others and floods his own requests or routes other requests at
random to his neighbors. Duplicate requests are rejected according to their
RoutingID. The node-to-node encryption is a classical SSL over port 443 and
the end-to-end encryption is RSA-based.
The shortcomings of this system are primarily the question of the estab-
lishment of the relationship between a peer inside a “censored” part of the
Internet and a trusted peer “outside”. How does a potential peer inside a
“censored” part of the Internet come to know which IP-address is an address
of a trusted peer, through which he could make requests and not be afraid
of possible reprimand from the “censor”? It seems that only some kind of
out-of-band information gathering would help.
Furthermore, it is not clear how these trusted peers could be protected
from possible attacks from entities that would have an interest in disrupt-
ing the SixFour network (like those very countries whose censorship SixFour
tries to break). The possible attackers on trusted peers presumably have the
same or even better know-how than the developers of any free software on
the market and possibly strong financial backing as well. An additional cri-
terion before accepting entities acting as trusted peers could be a display of
security measures that these peers employ locally and set requirements for.
31.4 Security Concepts for Selected Systems 541
Otherwise one would never know which trusted peer could be compromised,
thus compromising the whole network.
Before becoming a serious application with wide acceptance, SixFour
would have to establish a wider community of users and proof of a longer
stable functioning (without security flaws and compromise of peers). It of-
fers the basic functionality of confidential data exchange and anonymity in
the network, but it does not tackle at all the question of key distribution.
Although at the beginning of a possible future enhancement, SixFour is a
fresh example of what Peer-to-Peer-systems are good for beyond simple file-
sharing.
31.4.3 Freenet
Security Architecture
Every Freenet peer runs a node that provides some storage space for the net-
work. When adding a file, a user assigns a globally unique identifier (GUID)
to that file. This GUID will be sent when retrieving a file as well. The GUID
is calculated using the SHA-1 hash and made up of 2 keys: the content-hash
key (CHK) and the signed-subspace key (SSK). CHKs are hashed contents of
the files using the SHA-1. SSK is used to make up a specific namespace that
anyone (who has the keys) can read, but only its owner can write to. A peer
first generates an arbitrary public-private key and chooses a text description
542 31. Security-Related Issues in Peer-to-Peer Networks
of the subspace. One then calculates the SSK by hashing both the public part
of the key and the descriptive string, concatenating them, and hashing them
again.
In order to retrieve the file, one needs the public key and the descriptive
string (to recreate the SSK). To add to, or update the file, one would need
the private part of the key to be able to compute the valid signature, which
the nodes storing the file would check and accept, or reject if false. The SSK
key can be used to store an indirect file containing a pointer to a CHK: a file
is stored under its CHK, which is in turn stored as an indirect file under the
SSK. Given the SSK the original file is retrieved in 2 steps, and the origin of
the file is obscured a step further.
One of the most distinctive characteristics of Freenet is that management
of the node and the management of the storage of the same node is somewhat
disjointed. If a node gets a query, it first checks it own store. The peculiarity
here is that the semantics of the content of the store itself is not comprehen-
sible to humans. The comparison between the request and the possible file in
the storage has to be computed.
If the request is not satisfied, the node forwards the request to the node
with the closest key to the one requested. This information will be gathered
at one specific node through time by having many requests running through
it. When the request is successful, each node which passed the request now
passes the file back and creates an entry in its routing table binding the data
holder with the requested key. On the way back the nodes might cache the
file at their stores. This way subsequent searches find the requested file faster.
For requesting and inserting a file, Freenet offers the possibility of adding
a mix-style “pre-routing” of messages [124]. This way the messages would
be encrypted by a succession of public keys which establish a route that
this message will follow. When the message reaches the end-point of this
pre-routing path it is injected into the Freenet network and thus the true
originator of the message is “obliterated”.
Inserting a file works similar to requests. A user assigns a GUID and
sends an insert message first to his own node with a new key and a TTL
value. The insert might fail because the same file is already in the network
(CHK collision) or there is a different file with the same description (SSK
collision). The checking for a same possible file refers to a lookup whether
the key already exists; if not, the node searches for the numerically closest
key and forwards the insert message to the corresponding node. If the TTL
expires without collision the final node sends the message that the insert
may be performed. On the way to the final node, the file is cached by every
intermediary node, the data is verified against the GUID and a routing entry
is made pointing to the final node as the data holder.
The makers of Freenet recommend encrypting all data before inserting
them into the network. Since the network does not perceive this encryption
since it only forwards already encrypted bits, the inserters have to distribute
31.4 Security Concepts for Selected Systems 543
the secret keys as well as the corresponding GUIDs directly to the end users.
This is performed through some out-of-band means such as personal com-
munication. The same method is used when adding a new node to Freenet:
a user wishing to join sends his new public key and a physical address to a
node whose physical address he already knows.
In order not to let only one node decide what key to assign to a joining
node (and allow a sort of unilateral access to certain data) a chain of hashes
of the seeds of each node and XOR’ed results with other seeds down the path
and hashes thereof (called “commitments”) is produced to provide the means
for every node to check if other nodes revealed their seeds truthfully. The key
for a joining node is assigned as the XOR of all the seeds.
A still open issue is search adequacy of the Freenet network for relevant
keys. There is still no effective way to route searches, leaving the dissemina-
tion of keys solely to out-of-band means. One possible solution is to build
public subspaces for indirect keyword files. When inserting files, one could
insert several indirect files corresponding to search keywords for the original
file, so an indirect file would have pointers to more than one file (more than
one hashed key as content).
The management of storage determines how long a file will be kept by
the popularity of the file, measured by the frequency of the requests per file.
Files that are seldom requested are deleted when a new file has to be inserted.
Even when a file is deleted at one node, another one may still have a copy
of it. The node that already deleted it, will still have an entry in its routing
table pointing to the original data holder, as the routing tables entries are
much smaller than data and will be kept longer.
Following are several more solutions that have as a priority the anonymization
of a connection, transaction, data stream, or communication between two
peers in a network. These systems do not protect specific data from access
by an unauthorized peer or intrusions into peer-machines. They try to make
facets of communication between two peers invisible or not back-traceable
by a supposed eavesdropper. Usually the high connectivity of nodes in a
Peer-to-Peer network is used to obscure the path an arbitrary message has
taken.
Tarzan
network are a node running an application and another node running a NAT
that forwards the traffic to an end destination. Tarzan performs a nested
encryption per hop and encapsulates it in a UDP packet. These encryptions
are performed in the sequence of the nodes that will be passed through, so
the biggest share of encryptions is situated at the node seeking anonymity.
The method is similar to Onion routing [582] with the difference that the
nodes are chosen randomly and dynamically instead of using a fixed set. In
order to choose the relays, Tarzan uses the Chord lookup algorithm (relays
are in a Chord ring) with a random lookup key. When that key’s successor
is found, it responds with its IP address and public key. Tarzan claims thus
to provide anonymity in cases of malicious Tarzan participants, inquisitive
servers on the Internet, and observers with some limited capability to see
traffic on some links.
Publius
Onion Routing
Crowds
Users in this system [508] submit their requests through a “crowd” - a group
of Web surfers running the Crowds software. Crowds users forward HTTP
requests to another randomly selected member of their crowd. A path is con-
figured as the request crosses the network and each Crowds member encrypts
the request for the next member on the path, so the path is not predetermined
before it is submitted to the network. Thus an end server or the forwarding
Crowds member cannot know where a request really originated. By using
the symmetric ciphers Crowds claims to perform better than mix-based so-
lutions. A drawback of Crowds is that each peer on the path of the data to
the intended destination can “see” the plain text.
31.5 Conclusion
R. Steinmetz and K. Wehrle (Eds.): P2P Systems and Applications, LNCS 3485, pp. 547-566, 2005.
Springer-Verlag Berlin Heidelberg 2005
548 32. Accounting in Peer-to-Peer-Systems
This section outlines the different design parameters for building an account-
ing system that meets the described requirements. The problem can be struc-
tured into two main parts: Information collection covers the options how and
in which form information is collected. Information storage concerns the op-
tions of where the collected information can be stored.
In Peer-to-Peer systems there are only a few options where accounting infor-
mation could be collected. Assuming real Peer-to-Peer connections (without
a third peer in the middle as a “trusted party”), the information can only
be collected at the service provider peer, at the service receiver peer, or at
both peers. This collection process usually consists of a metering process and
an evaluation process. The metering process measures the used bandwidth
(the amount of data received/sent), time (the duration of the service), or
collects some service specific signals like “file complete”. The evaluation pro-
cess interprets this information to generate accounting events. These events
determine when an accounting record is created.
Accounting records contain the accounting information and can take sev-
eral forms.
Plain Numbers
The simplest form is plain numbers. For every peer there exists an account
balance stored somewhere in the Peer-to-Peer systems. For example, for each
MByte uploaded the peer’s account balance is increased by one. Plain num-
bers can be changed very easily, therefore, this kind of accounting record is
especially vulnerable to fraud. Nevertheless, this form of information con-
sumes very little space and bandwidth when it is transferred between peers.
Receipts
Receipts are documents of fixed form that can contain all kinds of accounting
information, e.g. transaction partners, transaction object, metering informa-
tion, and an evaluation of it. This evaluation could be the information how
this receipt modifies the peer’s account balance. This part of information
would be similar to plain numbers. In comparison with plain numbers, the
additional information of a receipt adds some trustworthiness to the account-
550 32. Accounting in Peer-to-Peer-Systems
Signed Receipts
Tokens
In the context of this chapter the term token is used for any kind of issued
document. An issuer issues a specific number of these documents (tokens).
Thus, the number of tokens available in a Peer-to-Peer system can be limited.
This results in a specific characteristic that is otherwise hard to achieve -
through issuing a limited number of tokens they become a scarce resource.
Accordingly, tokens can represent other scarce resources. However, they are
not that easy to handle like normal receipts, because two major problems
must be addressed: Forgery and Double Spending, both of which have to be
avoided. With respect to double spending a token must be clearly identifiable.
Thus, a they must contain a unique identification. Also, there must be a
32.3 A Classification of P2P Accounting Schemes 551
Proof of Work
how easy it is for a peer to manipulate accounting records and defraud the
system. There are three basic alternatives for the account location: accounts
can be located locally at the peer that collects the data, at a central server,
or remotely at a peer other than the collecting peer. Centrally held accounts
obviously are contrary to the Peer-to-Peer paradigm. For every transaction
communication with the central entity would be necessary to transfer the
record. Therefore, this solution is out of the question. The advantages and
disadvantages of the two other alternatives shall now be elaborated.
Local Accounts
Storing accounting records at the place where they accrue has the obvious
advantage of reduced traffic, because the records do not need to be sent to
the account holder. However, using local accounts poses an important trust
problem. Using plain numbers or self-signed receipts to store information en-
ables users to easily change the contained information in order to defraud the
accounting system. The appearance of KaZaA Lite as competitor to KaZaA
is an example to this behavior [344]. Therefore, receipts should be signed ei-
ther by the transaction partner or a third party. Alternatively, tokens should
be used. Accordingly, accounting records need to be transferred between the
transaction partners. Therefore, in terms of bandwidth usage, storing ac-
counting records locally is only advantageous in comparison to other storage
location, if the transaction partner needs to get the records for some reason.
For example, both transaction partners have to agree about the contained
accounting information. Further advantages are that users have immediate
control over the collected accounting records. No redundancy need be built
into the system, because a peer’s accounting records are not needed when it
is offline. Also, users are themselves responsible for doing a data backup.
Remote Accounts
The alternative location for storing accounting records is third peers. Using
third peers, hence separating account holders from account owners, clearly
makes it more difficult for any one peer to fraudulently manipulate accounting
records. Accordingly, depending on the Peer-to-Peer application’s require-
ments, special mechanism to ensure information integrity such as signing
might not be needed.
However, using remote accounts requires the exchange of more adminis-
trative messages between peers. This stems from the need for redundancy.
Because the account holder is not always available, several account holders
per peer, each holding a replication of the account, are required. All repli-
cations of an account need to be kept consistent. Therefore, mechanisms to
detect potential inconsistencies as well as mechanisms for determining the ac-
32.4 Proposed Accounting Schemes 553
Location
(eMule)
Information Local
KaZaA
Type Accounts
SeAl, Swift
Numbers/
Central GridBank,
Receipt
Accounts RADIUS, CDNs
based
Remote KARMA,
Accounts Bucking FRs
Issuer
Central Micropay-
(Bank) ments, PPay
Accounting Decentra- Token-
Options lized based Acc.
Local
Accounts Each Grid Acc.
Token Participant Stamps
based
Central Central Micro-
Accounts (Bank) Payments
Remote
Accounts
Currently the two most widely used Peer-to-Peer accounting mechanisms are
KaZaA’s participation level [342] and eMule’s credit system [192]. Both sys-
554 32. Accounting in Peer-to-Peer-Systems
tems account for the amount of data uploaded and downloaded and store
the collected information locally. KaZaA’s system uses the ratio of uploads
to downloads (measured in amount of data transferred) to calculate a peer’s
actual maximal allowed download speed. The higher this ratio is the higher is
the maximal allowed download speed. This is a typical incentive mechanism
for file sharing systems. However, KaZaA’s system is easy to cheat because the
accounting information is stored locally. In fact in the KaZaA clone KaZaA
Lite [344] the participation level is removed. In contrast, eMule’s credit sys-
tem is used to determine a requestor’s position in the provider’s download
queue. The position is determined by the amount of data the requestor up-
loaded to the provider. This system has the obvious advantage, that it cannot
be cheated. The provider keeps his accounting records that only influence his
own behavior. The disadvantage is that this system only accounts for local
observations. A peer could have upload to the system much more than it
downloaded. However, if it downloads from a peer to which it did not upload
before, it will get a bad position in the provider’s download queue.
Another system that uses local accounts to store plain numbers as ac-
counting information is Swift [584]. In contrast to the both systems mentioned
before, it is not used in practice yet. Swift basically is a behavior model for
Peer-to-Peer file sharing to support fair large scale distribution of files in
which downloads are fast. Each peer maintains a credit for every other peer
it is connected to. A peer will only upload to a peer with a positive credit
balance. Because the accounting data only affects the local peer behavior,
peers have no incentive to falsify the collected information.
A system taking into account a peer’s actions in the overall system is
Karma [603]. Karma stores for every peer in the system a value that rep-
resents Its balance of uploads against downloads. This balance is stored at
remote peers. These remote peers are called a peer’s bank set. The bank
set consists of multiple peers, for redundancy reasons. The balance of a peer
must not be lost. Accordingly, a bank set is rather large - a suggested size
is 64 peers. For every transaction the bank sets of the provider and receiver
peer communicate to adjust the peers’ balance according to the transaction
value. Further, Karma includes the concept of an epoch. At the beginning
of each epoch every peer’s balance is adjusted accordingly in order to avoid
inflation.
In [14] another system using remote accounts is presented. So called ac-
countants store a peer’s balance. Like in Karma the accountants are third
peers. To ensure that the balance is not lost a set of accountants is required
for each peer. With every transaction the balance of the two transaction part-
ners is updated to the new value. A non-mediated and a mediated settlement
protocol are presented.
There are other systems known to do accounting using numbers. However
these systems are not compliant with the Peer-to-Peer paradigm. These sys-
tems use a central server to do the accounting. Examples are the accounting
32.4 Proposed Accounting Schemes 555
SeAl [453] is a Peer-to-Peer accounting system that uses locally stored re-
ceipts. SeAl is working based on favors. For every transaction a receipt is
created and stored locally at the receiver and provider. As a result of the
transaction, the receiver owes the provider a favor. A favor can be paid back
by the receiver by providing a service to the provider. Also the provider can
use a favor by redirecting service requests of other peers to the receiver. Fur-
thermore, peers can publish Blacklist Reports about peers behaving against
the system’s rules. For each service request a score is calculated (using paid
back favors and Blacklist Reports) that influences the request’s position in
the provider’s request queue. Accordingly, not all accounting data is stored
locally. However, for Blacklist Reports there exist no accounts for storage.
One class of token based accounting systems are micropayment systems that
use tokens as a micro currency. For payment these tokens are transferred be-
tween users. (Micropayment systems just modify centrally hold bank accounts
on request belong to the plain numbers-based systems.) All micropayment
systems use a central broker or bank. Thus, they are not appropriate for
Peer-to-Peer systems. A micropayments system tailored to Peer-to-Peer sys-
tems is presented in [636]. It relieves the broker of some task and these tasks
are facilitated by the peers of the system. As a result the broker can even go
off-line for short time periods and the system can still continue to operate.
Mojo Nation [441] was one of the earliest Peer-to-Peer systems to use a
payment protocol. Users had to use a virtual currency called Mojos to obtain
a service from another peer. Mojo Nation still required a centralized trusted
third party to issue the Mojos and to resolve double-spending issues.
A system using stamps for peers’ “evidence of participation” is presented
in [431]. Every peer issues personalized stamps and trades these with other
peers. If peer A requests a service from peer B, peer A has to pay a specific
amount of peer B’s stamps back to B. There is no limit how many stamps a
peer issues. However, if a peer issues too many stamps in comparison to its
offered services the stamps will devalue. Thus, the peer will have difficulty
obtaining other stamps, as rational nodes will not wish to purchase its stamps.
This way the stamps protocol combines a virtual currency and reputation.
556 32. Accounting in Peer-to-Peer-Systems
32.5.1 Prerequisites
The token-based accounting system assumes that users can clearly be identi-
fied through a permanent id, (e.g. through a private/public key pair proven
through a certificate issued from a certification authority). Depending on
the application scenario, alternative approaches like [139] are also applica-
ble. Apart from the certification authority it is intended to avoid any central
element.
Further, we assume the use of a reputation mechanism in the Peer-to-Peer
system. This system is used to publish fraudulent behavior that technical
mechanisms cannot detect. The reputation mechanism assigns a reputation
value to each peer that represents the trustworthiness of the peer. A possible
solution is presented e.g. in [333].
32.5.2 Overview
The primary goal of the proposed system is to collect accounting data and
to enable system-wide coordination of resource service usage based on the
collected information. To enable the usage of receipts for coordination in
a distributed system, the receipts must have the basic characteristic of the
resources and services they represent, i.e. they must be scarce. Therefore,
the receipts must be issued. Accordingly, every user has a limited amount
32.5 Token-Based Accounting Scheme 557
Figure 32.2 shows the information contained in a token. A new unused token
contains the first five information fields starting from the right hand of the
figure. The issuing date and time in milliseconds together with the serial
number and the owner id serve as unique identification of a token. This
is required to enable the detection of double spending. Further, this way
double spending can be traced to the owner. During the creation of a batch
of new tokens the serial number is randomly selected for every token. Thereby,
guessing which tokens exist in the system becomes hard. The account id is
used to allocate a token clearly to a specific application. Cross application
usage and trade of tokens are possible. The account id field is optional. The
fifth field contains the signature of the information contained in the first four
fields, signed with the system’s private key. This prevents forgery.
Since a token is basically a receipt, it contains further information about
the transaction for which a token is used. The service consumer is the token
owner.
Before the owner sends the token to the service provider, it also adds the
service provider’s id to the token as well as information about the transaction
(such as transaction object, date and information about the quality of the
service provisioning). The owner finally signs the complete token using its
private key. Subsequently, the contained information cannot be changed by
the service provider. The required information in a token is the information
needed for unique identification, i.e. the system signature, the service provider
558 32. Accounting in Peer-to-Peer-Systems
Issuing Date
Serial Number
Owner ID
Required Information
Account ID
Signature SK
Transaction Data
Transaction Date
Transaction Object
QoS Information
Service Provider ID
Owner Signature
as well as the service provider’s signature. This prevents tokens from being
stolen. Because unused tokens contain the owner, only the owner can spend
them. Used tokens are signed and contain the receiver of the token. Only
the receiver is allowed to exchange tokens against new, own tokens. A token
has no intrinsic value; it rather presents an accounting event. The value of a
token is determined in the token aggregation process.
The Token Aggregation process is used to exchange foreign tokens a peer col-
lected for new tokens issued to that peer. The eight-step Token Aggregation
procedure is shown in Figure 32.3 (a).
First the exchanging peer EP locates a trusted peer TP (1). Trusted peers
are eligible to exchange tokens and possess one part of the system’s private
key [164]. EP sends its N collected foreign tokens (F n1 , ..., F nN ) to TP (2).
TP checks the foreign tokens for their validity. Only tokens signed by the
owner and spent only once are valid for exchange.
Using the aggregation function M = A(F n1 , ..., F nN ) TP calculates the
amount M of new tokens EP must receive in return for the foreign tokens.
The aggregation function is public and can take any form. TP now creates
M new, unsigned tokens (U n1 , ..., U nM ) (3).
To sign the new tokens with the system’s private key using threshold cryp-
tography [164] TP now locates further trusted peers (4). EP is not allowed
to choose the quorum of trusted peers itself. This alleviates the problem of
potential collaboration and fraud. The number of required trusted peers to
sign a token is determined by the used secret sharing scheme. The system’s
trustworthiness increases proportional with the size of the quorum of trusted
peers.
32.5 Token-Based Accounting Scheme 559
TP sends the new tokens to this quorum of trusted peers (5). Each peer
of the quorum signs now the tokens with its part of the system’s private
key (6). The resulting partial tokens (P n1 , ..., P nM ) are transmitted back
to EP (7). Finally, EP combines the partial tokens to new complete tokens
(T n1 , ..., T nM ) (8).
It is important to mention that the aggregation function adds an ad-
ditional degree of freedom to the system. With an appropriate aggregation
function specific economic systems can be implemented.
Account
Account Set of account
Account
Holder Set of account
Holder
Holder peers for FP
1
peers for EP
4
Trusted 5
Peer Trusted
P))
Peer 5
Trusted ID
n(F
Peer (T
Collected 3 n(E
New P ))
ID(F
(foreign) 5
(unsigned) 6 Trusted
Tokens Peer
2 Tokens 2
Fn1 - FnN
Un1 - UnN Trusted
5
Peer Foreign 3
Peer FP
New
Tokens Peer
Fn(FP)
7
8 Partially Trusted
Tn1 - TnN signed
Exchanging
1 Peer TP
9
Tokens
Pn1 - PnN Peer EP
32.5.6 Transactions
Standard Transaction
The standard transaction process is shown in Figure 32.4 (a). After a ser-
vice has been requested by the service consumer C, the service provider P
informs C about the terms and conditions of the service, including the num-
ber of tokens it expects in return for the service. If C accepts the terms and
conditions, the service provisioning phase begins.
During this phase tokens can be transmitted before, after, or during the
service provisioning. For example a token can be transmitted after 1 MB
transferred or after 1 minute service received. Before a token is transmitted,
C fills in the required accounting information. C has no incentive to falsify
this information, because it influences only the token exchange of P. Then
C signs the token with its own private key and sends it to P. P checks the
signature of the received token using C ’s public key, which can be contained
in the token as owner id or transmitted with the service request. Thus, it can
be verified, that the token sender is also the token owner.
P can choose not to continue to provide the service, if the contained
accounting data was incorrect. As a result of each transaction C ’s own token
balance decreases and P ’s foreign token balance increases.
Transaction partners could try to gain tokens by not paying tokens after
receiving a service or by not delivering the service after receiving tokens. In
32.5 Token-Based Accounting Scheme 561
order to avoid that transactions can be split into several parts. Then C sends
a signed token to P after P delivered a part of the service; e.g. C sends a
tokens after each MByte received data of a 5 MByte file transfer. A further
approach that eliminates the incentive for transaction partners to cheat on
the other partner is now presented.
Trustable Transaction
Service Request
# Tokens ( )
Service Consumer C
Accept
Service Provider P
Service Request Account
Account
Account
Service Consumer C
Ready Holder
Service Provider P
Holder
Holder
# Tokens ( ) for C
Accept unsigned
Service
Service
Signing signed
Robbery
Tokens were designed to eliminate robbery. Tokens contain the owner id that
cannot be changed without detection through the system signature. Spent
tokens contain the token receiver secured through the owner’s signature.
Forgery
The system signature on each token ensures that the basic token data cannot
be changed and that no peer can create tokens itself. Thus, the system sig-
nature prevents forgery and is crucial for the trustworthiness of the system.
Accordingly, fraudulent collaboration of trusted peers must be avoided.
This can be achieved if in a quorum of trusted peers there is at least one
trustworthy peer. The probability of a quorum consisting of at least one good
peer can be determined using the hypergeometric distribution. The resulting
probability p defines the trust level of the system according to:
T · (1 − pg )
T number of trusted peers
t
p(T, t, pg ) = , where t quorum size
T
pg percentage of good peers
t
Figure 32.5 shows the required quorum size for specific trust levels. For
example to achieve a trust level of 99.9% with 50% bad trusted peers in
the system a quorum size of ten is required. However, because the trusted
peers are selected using the aforementioned reputation system the percentage
of bad trusted peers can be assumed to be much lower than 50%. Moreover,
because the trusted peers are not aware which other peers belong to a quorum,
having only bad peers in a quorum does not mean that this results in fraud.
The chosen (bad) trusted peers must also collaborate. Thus, the quorum
32.5 Token-Based Accounting Scheme 563
peers must know which other peers have been chosen for the quorum. Thus,
the archived trust level is higher.
25 35
20%
20% 30
20
25
15 20
t t 33%
33%
15
10
50% 10 50%
5 67% 67%
5 80%
80%
0 0
10 100 1000 10000 10 100 1000 10000
T T
Fig. 32.5: Required quorum size for trust levels by percentage of good peers
Furthermore, peers can only become trusted and receive a part of the
shared system private key, if their reputation is above a specific threshold
value. Accordingly, the proportion of bad peers among the trusted peers can
be assumed less than the proportion of bad peers in the whole system. The
actual trust threshold value depends on the used reputation system.
Additionally, threshold cryptography provides different proactive mech-
anisms to secure the key from being compromised. The key parts will be
updated periodically using proactive secret sharing [498]. This makes the old
key parts obsolete without changing the actual key. The system’s public key
remains the same. Further, a new system key will be created periodically
using the decentralized method presented in [81]. This is enforced by tokens
being valid only for a specific period of time. Therefore, the unique token id
contains the creation date and time. Outdated tokens can be exchanged for
new tokens using the Token Aggregation process. If the system’s private key
is kept secret the system can be considered secure.
Double Spending
The verification for double spending relies on the data hold at the account
holders. Thus, users might try to corrupt their token list at the account
holders. This is avoided by not allowing peers to send any queries or enquiries
to the account list. Rule breaches are reported to the reputation system.
Further, the token list at the account holders is a positive list. If a peer plans
to double spend a token, it has to avoid that the token is marked in the
564 32. Accounting in Peer-to-Peer-Systems
Maintenance
Maintenance costs arise from keeping the remote accounts consistent and
from the requirement to keep the systems private key secret. This involves
calculating key updates at one quorum of trusted peers and distributing new
key parts afterwards to the rest of the trusted peers. Table 32.1 summarizes
the complexity of the maintenance actions, where k denotes the size of the
bank-sets and a (t, T) secret sharing scheme is used, where T denotes the
number of trusted peers in the system.
Table 32.1: Account Holder Set & System Key Maintenance Complexity
32.5 Token-Based Accounting Scheme 565
Transactions
For the analysis we assume a conservative ratio of 67% good peers in the
system. Further, we set a trust level of 0,1% which results in a quorum size t
of 6 trusted peers. Furthermore, we set the account holder set size k to 4. We
model a file sharing scenario, where for 1 MB download 1 token is required
and the average file size s is 5 MB. Users exchange tokens in different batch
sizes b. The trustable transaction procedure is used. If n transactions are
carried out the average number of accounting messages M sent in such a
scenario results in:
ns
M (n, k, t, b) = n(2s + 2k) + b (1 + 2k sb + 2k + 2t)
For 100 transactions exchanging 500 tokens with a batch size of 20 results
in 3125 messages. Simulating this scenario the token-based accounting system
creates an additional overhead of less than 1% (for the mentioned example
it is less than 3,5 MB overhead for file transfers of 500 MB). Figure 32.6
(a) shows the generated traffic for different batch sizes and up to one million
transactions. As it can be expected, the overall traffic generated by the token-
based accounting system is reduced as the batch size increases. However, the
effect levels off after a batch size of 20. Figure 32.6 (b) shows the influence of
increased quorum size. The effect is not strong. Even with a very high trust
level (t =18) the system still generates not more than 1% of overhead. The
effect of size of account holder set for the generated traffic is very small and
therefore the graph is omitted here.
60 45 t=10
b=50 t=8
40 t=6
50 t=4
b=20 35
b=10
40 30
Traffic in MByte
Traffic in MByte
b=5
25
30
20
20 15
10
10
5
0 0
0 200 400 600 800 1000 0 200 400 600 800 1000
(1GB) (2GB) (3GB) (4GB) (5GB) (1GB) (2GB) (3GB) (4GB) (5GB)
No. of Transactions No. Of Transactions
(transferred amount of data) (transferred amount of data)
R. Steinmetz and K. Wehrle (Eds.): P2P Systems and Applications, LNCS 3485, pp. 567-581, 2005.
Springer-Verlag Berlin Heidelberg 2005
568 33. The PlanetLab Platform
From the beginning, the target application area for PlanetLab has been
planetary-scale systems, Peer-to-Peer applications prime among these. Plan-
etary-scale applications are characterized as involving computation spread
geographically across a wide area, for some subset of the following reasons:
– Removing latency: to serve a large, dispersed user population and still
provide fast end-to-end response time, computation must be moved to-
wards users to reduce the round-trip time for messages between users and
the service. One obvious example of such services are content distribution
networks (CDNs) like Akamai, who are in the business of moving content
closer to a worldwide user population.
– Spanning domain boundaries: the service executes in many geographi-
cal locations so as to have a presence in many physical areas, legal jurisdic-
tions, financial domains, etc. Examples of this kind of requirement include
censorship-resistant systems like FreeNet [124], and federated archival stor-
age systems like that envisioned by Oceanstore [368], which is intended to
survive the physical destruction or financial dissolution of any participating
service provider.
– Multiple vantage points: the application needs to process and correlate
data in real time from many physical locations. For example, network map-
ping and measurement applications were among the first services deployed
on PlanetLab and continue to be major users of the platform. PlanetLab
has also been used to deploy distributed crawlers of Peer-to-Peer networks
like Gnutella.
PlanetLab was enthusiastically taken up as a research vehicle, and has
rapidly become a canonical part of how large-scale networked systems re-
search is performed. At time of writing, PlanetLab consists of over 500 ma-
chines worldwide at more than 250 sites, with a significant presence in North
America, Europe, China, Russia, Brazil, India, and elsewhere.
This is complicated by the tension between two usage models for Planet-
Lab. The first is as a testbed for experiments: a system is implemented and
then deployed on PlanetLab for the purpose of obtaining measurement results
for a paper. The second is as a platform to support long-running services in
the wide area: several applications such as CoDeeN [468] and OpenDHT [340]
have been running continously on PlanetLab and attracting real users for over
a year, at time of writing.
To cope with these tensions, the PlanetLab team adopted a number of
design principles to follow as the platform evolved [486]. Some of these only
became clear during the course of the initial PlanetLab deployment, but
others were identified at the original March 2002 workshop, including the
three main principles of application-centric interfaces, distributed virtualiza-
tion, and unbundled management.
We start by briefly describing what is, at time of writing, the way most re-
searchers and students use PlanetLab: creating and controlling slices through
the PlanetLab Central (PLC) web interface, or a command line tool which
talks to PLC. Other interfaces exist, for example slices can be created via the
Emulab portal [617].
Users log into PLC and create a slice, giving it a name, for example
ucb p2ps. The first half of the name identifies the creating institution, while
the second is an arbitrary name for the slice. Having done this, users then
add nodes from anywhere on PlanetLab to the slice. Having added a node,
say planet1.berkeley.intel-research.net, to the slice, the user can log
into the machine via secure shell with a command like:
$ ssh ucb [email protected]
33.3 PlanetLab Methodology 575
What the user sees when logging in looks like a networked Linux machine.
She can install programs, run code, su to root, create new user accounts, run
programs like tcpdump, etc.
This lack of restriction on what users can do in slices leads to great flexi-
bility in the code that can run, as well as providing a familiar programming
environment (PlanetLab applications are usually compiled on users’ desktop
machines and then deployed on PlanetLab).
That said, there are significant differences between the runtime environ-
ment of a program running on a PlanetLab node and one running on a work-
station or server in a lab: network conditions are very different, and the
machine is always being shared with other projects. This has led to some
debate about the kinds of experimental results that PlanetLab can provide,
and the kinds of claims about system designs that can legitimately be made
based on such results.
33.3.2 Reproducibility
Lab and the Internet in general can be measured continuously (for example,
by some kind of “weather service”), it may be possible for the state of the
environment at the time of a particular experiment to be sufficiently char-
acterized that the results obtained can be rigorously compared with those of
other experiments which are found, after the fact, to have been conducted
under the same conditions.
There is, however, a different sense in which PlanetLab provides for re-
producibility: the functionality of systems can be verified, and indeed used,
by peer research groups. This has, of course, always been the case for small-
scale (in terms of deployment) systems like compilers and operating systems,
but the availability of PlanetLab now means that large-scale distributed sys-
tems built by research groups can also be taken up and used by other teams.
Of course, this tends to impose a different standard to that by which such
systems have traditionally been evaluated, a topic we return to below.
33.3.3 Representivity
Alongside the issue of how, and in what sense, experimental results from
PlanetLab are reproducible is the question of the extent to which they are
representative of reality. Like the issue of reproducibility, this has a number
of aspects.
PlanetLab nodes are situated in a wide variety of places, including com-
mercial colocation centers, industrial labs with commercial ISP connections,
universities with dual-homed connections to academic and commercial net-
works, and DSL lines and cable modems. Consequently, PlanetLab machines
provide an excellent set of vantage points from which to observe the Internet
as a whole. PlanetSeer [640] is an example of a service leveraging this: obser-
vation of user traffic to the CoDeeN proxy network from all over the Internet
is used to detect, triangulate, and diagnose Internet routing anomalies.
At the same time, however, the actual locations of PlanetLab nodes them-
selves is heavily skewed. Almost all nodes have much more bandwidth than a
typical domestic U.S. or E.U. broadband connections, and the overwhelming
majority are connected to lightly loaded academic networks with very differ-
ent traffic characteristics to the commercial Internet. Banerjee, Griffin, and
Pias [55] were to the first to point this out in print, and analyze the situation
in some detail, but the immediate practical implication is that measurements
of Internet paths between PlanetLab nodes are not likely to be representative
of the Internet as a whole.
It is suggested in [55] that PlanetLab node locations might be chosen
in the future so as to converge PlanetLab’s network coverage to be repre-
sentative of the Internet. However, resource constraints make this unlikely:
PlanetLab is primarily maintained by sites, which are usually universities,
hosting machines locally in exchange for access to the global platform.
33.3 PlanetLab Methodology 577
observed on the Internet at large. However, it has been hard to capture the
implications of such systems for the future design of the Internet itself, since
by their very nature these applications, and the effects they have on the
network, are not tracked in any detail.
The experience of running applications on PlanetLab – even though such
deployments are at a much smaller scale than a successful Peer-to-Peer file-
sharing application, for example – has led to a number of insights about
how Peer-to-Peer applications interact differently with the Internet. Three
features of Peer-to-Peer applications lead to this difference in behaviour.
plementations by design keep multiple node addresses for each entry in their
routing table). In contrast, a traditional web client (for instance) typically
has few or no alternative addresses to contact in the event of a failure to
contact the server.
This opens up a new design space for node-to-node communication pro-
tocols. For example, if one is interested in minimizing message latency in the
presence of failures, as in DHTs like Bamboo [510], it pays to have very ag-
gressive timeouts on the hop-by-hop exchanges by which messages are routed
through the DHT. Even if a node is simply being a little slow, it’s probably
worthwhile to reroute the message around the node and along an alternate
path, as long as this does not unduly increase network congestion.
Such fine-grained control over, and rapid reaction to, message timeouts is
not possible with TCP as it is implemented in a mainstream operating sys-
tem kernel. Furthermore, TCP’s policy of reliable, in-order delivery of mes-
sages is not appropriate for many, if not most, high-performance Peer-to-Peer
systems. Consequently, almost all the Peer-to-Peer applications deployed on
PlanetLab today use UDP-based, TCP-friendly custom transport protocols
rather than vanilla, kernel-based TCP. This is in stark contrast to traditional
Internet applications, which are mostly TCP-based.
Central (PLC) is expected to decrease, though PLC itself will most likely
remain as one resource broker among many.
Beyond merely supporting distributed and Peer-to-Peer applications,
however, recall that an explicit motivation for PlanetLab was to break the
impasse facing Internet researchers, in not being to able introduce architec-
tural changes in the Internet. Overlay networks using the underlying Internet
were suggested as a way out of the problem.
Recently, the term network virtualization [559] has been coined to describe
the use of overlays above a network like the Internet to provide Internet-like
functionality themselves. By providing the ability to run multiple virtual
networks with real users, the argument goes, alternatives to the Internet can
be explored at scale without replacing the current infrastructure.
There are two broad schools of thought as to where this might lead. One
says that by experimenting with alternative network architectures, the net-
working community in the broadest sense (researchers, carriers, governments,
etc.) can select a new network architecture with properties preferable to the
Internet, and then continue to use network virtualization as a way to incre-
mentally deploy it.
The other, slightly more radical, school of thought is that network virtual-
ization is the next architecture, in other words, future networked applications
will operate in the main by setting up per-application virtual Peer-to-Peer
networks, which then connect with other applications at many points.
In any case, investigating issues such as these requires the ability firstly
to place computation at many points in the world (for routing and forward-
ing calculations), and secondly to acquire network paths between such points
whose resources are guaranteed in some way, possibly probabilistically. Plan-
etLab provides the former, existing commercial ISPs’ virtual private network
(VPN) services or optical switched wavelength paths could provide the lat-
ter. A combination of the two holds real possibilities for implementing the
successor to the Internet.
Bibliography
[48] M. Baker, R. Buyya, and D. Laforenza, “Grids and Grid Technologies for
Wide-Area Distributed Computing”, International Journal on Software:
Practice & Experience (SPE), 32(15):1437–1466, 2002.
[49] Y. Bakos and E. Brynjolfsson, “Bundling Information Goods: Pricing,
Profits and Efficiency”, Management Science, 45(12):1613–1630, 1999.
[50] Y. Bakos and E. Brynjolfsson, “Bundling and Competition on the Inter-
net: Aggregation Strategies for Information Goods”, Marketing Science,
19(1):63–82, 2000.
[51] H. E. Bal, K. P. Löhr, and A. Reinefeld, editors, Proceedings of the Second
IEEE/ACM International Symposium on Cluster Computing and the Grid,
Washington, DC, 2002.
[52] H. Balakrishnan, M. F. Kaashoek, D. Karger, R. Morris, and I. Stoica,
“Looking up Data in P2P Systems”, Communications of the ACM, 46(2),
2003.
[53] W.-T. Balke, W. Nejdl, W. Siberski, and U. Thaden, “DL meets P2P -
Distributed Document Retrieval based on Classification and Content”, In
European Conference on Digital Libraries (ECDL), Vienna, Austria, 2005.
[54] W.-T. Balke, W. Nejdl, W. Siberski, and U. Thaden, “Progressive Dis-
tributed Top-k Retrieval in Peer-to-Peer Networks”, In IEEE International
Conference on Data Engineering (ICDE), pp. 174–185, Tokyo, Japan, 2005,
IEEE Computer Society.
[55] S. Banerjee, T. G. Griffin, and M. Pias, “The Interdomain Connectivity of
PlanetLab Nodes”, In Proceedings of the Passive and Active Measurement
Workshop (PAM), 2004.
[56] G. Banga, P. Druschel, and J. C. Mogul, “Resource Containers: A New
Facility for Resource Management in Server Systems”, In Operating Systems
Design and Implementation, pp. 45–58, 1999.
[57] Y. Bar-Yam, “Complexity rising: From human beings to human civilization,
a complexity profile”, , NECSI, 1997.
[58] Y. Bar-Yam, Dynamics of Complex Systems, Westview Press, 1997.
[59] A. Barabási, Z. Dezsö, E. Ravasz, S.-H. Yook, and Z. Oltvai, “Scale-free and
hierarchical structures in complex networks”, Sitges Proceedings on Complex
Networks 2004, 2002.
[60] A.-L. Barabási and R. Albert, “Emergence of Scaling in Random Networks”,
Science, 286:509–512, October 1999.
[61] A. Barmouta and R. Buyya, “GridBank: A Grid Accounting Services
Architecture (GASA) for Distributed Systems Sharing and Integration”,
In 17th Annual International Parallel & Distributed Processing Symposium
(IPDPS 2003) Workshop on Internet Computing and E-Commerce, Nice,
France, April 22-26 2003.
[62] T. Barth and M. Grauer, GRID Computing – Ansätze für verteiltes virtuelles
Prototyping, Springer- Verlag Berlin, Heidelberg, 2002.
[63] C. Batten, K. Barr, A. Saraf, and S. Trepetin, “pStore: A Secure Peer-to-
Peer Backup System”, Technical Memo MIT-LCS-TM-632, Massachusetts
Institute of Technology Laboratory for Computer Science, 2002.
Bibliography 587
[98] F. E. Bustamante and Y. Qiao, “Friendships that last: Peer lifespan and
its role in P2P protocols”, In Proceedings of the International Workshop on
Web Content Caching and Distribution, 2003.
[99] J. Byers, J. Considine, and M. Mitzenmacher, “Simple Load Balancing for
DHTs”, In Proceedings of 2nd International Workshop on Peer-to-Peer
Systems (IPTPS ’03), Berkeley, USA, 2003, IEEE.
[100] J. Callan, “Distributed Information retrieval”, In Advances in information
Retrieval, Kluwer Academic Publishers, 2000.
[101] J. Callan, Z. Lu, and W. B. Croft, “Searching Distributed Collections
with Inference Networks”, In International ACM Conference on Research
and Development in Information Retrieval (SIGIR), Seattle, WA, USA, 1995.
[102] B. Carpenter and K. Nichols, “Different services in the Internet”, In
Proceedings of the IEEE (2002), pp. 1479–1494, 2002.
[103] A. Carzaniga, Architectures for an Event Notification Service Scalable to
Wide-area Networks, Ph.D. Thesis, Politecnico di Milano, Milano, Italy, 1998.
[104] J. L. Casti, “Complexity”, Encyclopaedia Britannica, 2005.
[105] M. Castro, “Practical Byzantine Fault Tolerance”, http://www.lcs.mit.edu/
publications/pubs/pdf/MIT-LCS-TR-817.pdf, 2001.
[106] M. Castro, M. Costa, and A. Rowstron, “Debunking some myths about
structured and unstructured overlays”, In Proceedings of the 2nd Symposium
on Networked Systems Design and Implementation, Boston, MA, USA, 2005.
[107] M. Castro, M. Jones, A. Kermarrec, A. Rowstron, M. Theimer, H. Wang,
and A. Wolman, “An Evaluation of Scalable Application-level Multicast
Built Using Peer-to-peer Overlays”, In INFOCOM 2003, San Francisco, CA,
U.S.A., April 2003.
[108] M. Castro, P. Druschel, A. Ganesh, A. Rowstron, and D. S. Wallach, “Se-
curity for structured peer-to-peer overlay networks”, In Proceedings of the
5th USENIX Symposium on Operating Systems Design and Implementation
(OSDI ’02), Boston, Massachusetts, 2002.
[109] M. Castro, P. Druschel, A.-M. Kermarrec, and A. Rowstron, “SCRIBE:
A large-scale and decentralised application-level multicast infrastructure”,
IEEE Journal on Selected Areas in Communications, 20(8), 2002.
[110] M. Castro, M. B. Jones, A.-M. Kermarrec, A. Rowstron, M. Theimer,
H. Wang, and A. Wolman, “An evaluation of scalable application-level
multicast built using peer-to-peer overlays”, In IEEE Twenty-Second Annual
Joint Conference of the IEEE Computer and Communications Societies
(INFOCOM 2003), 2, pp. 1510–1520, IEEE, 2003.
[111] P. Castro, B. Greenstein, R. Muntz, C. Biskidiuan, R. Kermani, and M. Pa-
padopouli, “Locating Application Data across Service Discovery Domains”,
In Proc. 7th ACM Mobicom, pp. 28–42, Rome, Italy, 2001.
[112] D. Chaum, A. Fiat, and M. Naor, “Untraceable electronic cash”, In CRYPTO
’88, volume 403 of LNCS, pp. 319–327, Springer Verlag, 1990.
590 Bibliography
[363] D. Kossmann, “The state of the art in distributed query processing”, ACM
Comput. Surv., 32(4):422–469, 2000.
[364] H. Koubaa and Z. Wang, “A Hybrid Content Location Approach between
Structured and Unstructured Topology”, In Proceedings of the Third Annual
Mediterranean Ad Hoc Networking Workshop, 2004.
[365] B. Krishnamurthy and J. Rexford, Web Protocols and Practice: HTTP/1.1,
Networking Protocols, Caching, and Traffic Measurement, Addison-Wesley
Professional, 2001.
[366] R. Krishnan, M. D. Smith, Z. Tang, and R. Telang, “The impact of
free-riding on peer-to-peer networks”, In System Sciences, 2004. Proceedings
of the 37th Annual Hawaii International Conference on, pp. 199–208, IEEE,
2004.
[367] J. Kubiatowicz, D. Bindel, Y. Chen, S. Czerwinski, P. Eaton, D. Geels,
R. Gummadi, S. Rhea, H. Weatherspoon, W. Weimer, C. Welles, and
B. Zhao, “Oceanstore: An Architecture for Global-Scale Persistent Storage”,
In 9th International Conference on Architectural Support for Programming
Languages and Operating Systems, 2000.
[368] J. Kubiatowicz, D. Bindel, Y. Chen, S. Czerwinski, P. Eaton, D. Geels,
R. Gummadi, S. Rhea, H. Weatherspoon, C. Wells, and B. Zhao,
“OceanStore: an Architecture for Global-scale Persistent Storage”, In
Proceedings of the 9th International Conference on Architectural Support for
Programming Languages and Operating Systems, pp. 190–201, ACM Press,
2000.
[369] R. Kumar, P. R. S. Rajagopalan, D. Sivakumar, A. Tomkins, and E. Upfal,
“Stochastic Models for the Web Graph”, In Proceedings of the 41st IEEE
Symposium on Foundations of Computer Science, November 2000.
[370] M. Kwon and S. Fahmy, “Topology-aware overlay networks for group
communication”, In Proceedings of the 12th international workshop on
Network and operating systems support for digital audio and video, pp.
127–136, ACM Press, 2002.
[371] La red de Gnutella, “afectada por el virus ’Mandragora’, que
consume ancho de banda”, 2001, http://www.ciberpais.elpais.es/
d/20010308/cibersoc/soc2.htm.
[372] S. A. Lab, “IP Monitoring Project (IPMON) Home Page”,
http://ipmon.sprintlabs.com/ipmon.php/.
[373] K. Lakshminarayanan, I. Stoica, and K. Wehrle, “Support for service
composition in i3”, In MULTIMEDIA ’04: Proceedings of the 12th annual
ACM international conference on Multimedia, pp. 108–111, New York, NY,
USA, 2004, ACM Press.
[374] O. Landsiedel, K. Lehmann, and K. Wehrle, “T-DHT: Topology-Based Dis-
tributed Hash Tables”, In Proceedings of Fifth International IEEE Conference
on Peer-to-Peer-Computing, IEEE, September 2005.
606 Bibliography
[392] C. Liu, L. Yang, I. Foster, and D. Angulo, “Design and Evalutation of a Re-
source Selection Framework for GRID Applications”, GLOBUS2002project
Technical Report, Globus Aliance, 2002.
[393] B. Loo, J. Hellerstein, R. Huebsch, S. Shenker, and I. Stoica, “Enhancing
P2P File-Sharing with an Internet-Scale Query Processor”, In International
Conference on Very Large Databases (VLDB), Toronto, Canada, 2004.
[394] B. T. Loo, R. Huebsch, I. Stoica, and J. M. Hellerstein, “The Case for a
Hybrid P2P Search Infrastructure”, In Proceedings of the 4th International
Workshop on Peer-to-Peer Systems (IPTPS04), 2004.
[395] A. Löser, W. Nejdl, M. Wolpers, and W. Siberski, “Information Integration
in Schema-Based Peer-To-Peer Networks”, In Proceedings of the 15th
Conference On Advanced Information Systems Engineering (CAISE 03),
Klagenfurt/Velden, Austria, 2003, Springer.
[396] S. M. Lui and S. H. Kwok, “Interoperability of Peer-To-Peer File Sharing
Protocols”, ACM SIGecom Exchanges, 3(3):25ff, 2002.
[397] Q. Lv, P. Cao, E. Cohen, K. Li, and S. Shenker, “Search and Replication
in Unstructured Peer-to-Peer Networks”, In Proceedings of the 16th ACM
International Conference on Supercomputing, pp. 84–95, ACM, 2002.
[398] Q. Lv, S. Ratnasamy, and S. Shenker, “Can Heterogeneity Make Gnutella
Scalable?”, In Proceedings of the 1st International Workshop on Peer-to-Peer
Systems (IPTPS02), 2002.
[399] D. Malkhi, M. Naor, and D. Ratajczak, “Viceroy: A Scalable and Dynamic
Emulation of the Butterfly”, In PODC ’02: Proceedings of the twenty-first
annual symposium on Principles of distributed computing, pp. 183–192, ACM
Press, 2002.
[400] G. Manku, M. Bawa, and P. Raghavan, “Symphony: Distributed Hashing in
a Small World”, In Proceedings of the 4th USENIX Symposium on Internet
Technologies and Systems (USITS 2003), 2003.
[401] L. Mathy, N. Blundell, V. Roca, and A. El-Sayed, “Impact of Simple
Cheating in Application-Level Multicast”, In Proceedings of IEEE Infocom,
IEEE, 2004.
[402] H. R. Maturana and F. J. Varela, Autopoiesis and Cognition: The Realization
of the Living, D. Reidel, Dordrecht, Holland, 1980.
[403] H. R. Maturana and F. J. Varela, Der Baum der Erkenntnis, Scherz,
München, 1987.
[404] A. Mauthe and D. Hutchison, “Peer-to-Peer Computing: Systems, Concepts
and Characteristics”, In Praxis in der Informationsverarbeitung & Kommu-
nikation (PIK), K. G. Sauer Verlag, 2003.
[405] P. Maymounkov and D. Mazieres, “Kademlia: A Peer-to-Peer Information
System Based on the XOR Metric”, In International Workshop on Peer-to-
Peer Systems (IPTPS’02), 2002.
[406] McAfee Rumor, http://www.mcafeeasap.com/intl/en/content/ viruss-
can asap/rumor.asp, 2004.
[407] D. L. McGuiness and F. van Harmelen, “OWL Web Ontology Language
Overview”, http://www.w3.org/TR/owl-features/, 2004.
608 Bibliography
[562] A. Singh and L. Liu, “A Hybrid Topology Architecture for P2P Systems”,
In Proceedings of the 13th International Conference on Computer Commu-
nications and Networks, 2004.
[563] A. Singla and C. Rohrs, “Ultrapeers; another step towards Gnutella
scalability”, , Gnutella developer forum, 2002.
[564] M. Sintek and S. Decker, “TRIPLE — A Query, Inference, and Transforma-
tion Language for the Semantic Web”, In Proceedings of the 1st International
Semantic Web Conference, Springer, 2002.
[565] E. Sit and R. Morris, “Security Considerations for Peer-to-Peer Distributed
Hash Tables”, In IPTPS 2002, 2002.
[566] SixFour Manual, 2003,
http://www.brain-pro.de/Seiten/six/readmeintro.html.
[567] Skype, “Skype Homepage”, http://www.skype.com/, 2004.
[568] T. Small and Z. Haas, “The Shared Wireless Infostation Model - A
New Ad Hoc Networking Paradigm (or Where there is a Whale, there is a
Way)”, In Proc. 4th ACM MobiHoc 2003, pp. 233–244, Annapolis, MD, 2003.
[569] M. Solarski, L. Strick, K. Motonaga, C. Noda, and W. Kellerer, “Flexible
Middleware Support for Future Mobile Services and Their Context-Aware
Adaptation”, In V. W. Finn Arve Aagesen, Chutiporn Anutariya, editor,
IFIP International Conference, INTELLCOMM 2004, Bangkok, Thailand,
November 23-26, 2004, Springer LNCS 3283, pp. 281–292, Springer-Verlag
GmbH, 2004.
[570] K. Sripanidkulchai, “The Popularity of Gnutella Queries and its Implications
on Scalability”, In Proc. O’Reilly Peer-to-Peer and Web Services Conf, 2001.
[571] K. Sripanidkulchai, B. Maggs, and H. Zhang, “Efficient Content Location
Using Interest-Based Locality in Peer-to-Peer Systems”, In Annual Joint
Conference of the IEEE Computer and Communications Societies (INFO-
COM), San Francisco, CA, USA, 2003.
[572] S. Staniford, V. Paxson, and N. Weaver, “How to Own the Internet in Your
Spare Time”, In Proceedings of the 11th USENIX Security Symposium, San
Francisco, CA, 2002.
[573] R. Steinmetz and K. Wehrle, “Peer-to-Peer-Networking & -Computing”,
Informatik-Spektrum, 27(1):51–54, 2004, Springer, Heidelberg (in german).
[574] I. Stoica, D. Adkins, S. Zhuang, S. Shenker, and S. Surana, “Internet
Indirection Infrastructure”, In Proceedings of ACM SIGCOMM, August
2002.
[575] I. Stoica, R. Morris, D. Karger, F. Kaashoek, and H. Balakrishnan, “Chord:
A Scalable Peer-To-Peer Lookup Service for Internet Applications”, In
Proceedings of the 2001 ACM Sigcomm Conference, pp. 149–160, ACM
Press, 2001.
[576] I. Stoica, R. Morris, D. Liben-Nowell, D. Karger, M. F. Kaashoek, F. Dabek,
and H. Balakrishnan, “Chord: A scalable Peer-to-Peer Lookup Service for
Internet Applications”, IEEE Transactions on Networking, 11(1):17–32, 2003.
618 Bibliography
[577] T. Straub and A. Heinemann, “An Anonymous Bonus Point System For
Mobile Commerce Based On Word-Of-Mouth Recommendation”, In L. M.
Liebrock, editor, Applied Computing 2004. Proceedings of the 2004 ACM
Symposium on Applied Computing, pp. 766–773, New York, NY, USA, 2004,
ACM Press.
[578] B. Strulo, “Middleware to Motivate Co-operation in Peer-to-Peer Systems
(A Project Discussion)”, P2P Journal, 2004.
[579] M. Stump, “Peer-to-Peer Tracking Can Save Cash: Ellacoya”,
http://www.ellacoya.com/news/pdf/10 07 02 mcn.pdf, 2002.
[580] L. Subramanian, I. Stoica, H. Balakrishnan, and R. Katz, “OverQoS: Of-
fering Internet QoS Using Overlays”, In Proc. of 1st HotNets Workshop, 2002.
[581] Q. Sun and H. Garcia-Molina, “Partial Lookup Services”, In Proc. 23rd
Int. Conf. On Distributed Computing Systems (ICDCS 2003), pp. 58–67,
Providence, Rhode Island, 2003.
[582] P. F. Syverson, D. M. Goldschlag, and M. G. Reed, “Anonymous Connections
and Onion Routing”, In IEEE Symposium on Security and Privacy, pp.
44–54, Oakland, California, 1997.
[583] D. Talbot, “Distributed Computing, subsection in 5 Patents to Watch”,
MIT Technology Review, 104(4):42, 2001.
[584] K. Tamilman, V. Pai, and A. Mohr, “SWIFT: A System With Incentives For
Trading”, In Proceedings of Second Workshop of Economics in Peer-to-Peer
Systems, 2004.
[585] A. Tarlano, W. Kellerer, R. Schollmeier, and J. Eberspächer, “Compression
Scheme Negotiation”, 2004.
[586] C. Tempich, S. Staab, and A. Wranik, “REMINDIN’: Semantic Query Rout-
ing in Peer-to-Peer Networks Based on Social Metaphors”, In Proceedings of
the Thirteenth International conference on the World Wide Web, New York,
NY, USA, 2004, ACM.
[587] D. L. Tennenhouse, J. M. Smith, W. D. Sincoskie, D. J. Wetherall, and G. J.
Minden, “A Survey of Active Network Research”, IEEE Communications
Magazine, 35(1):80–86, January 1997.
[588] S. Thatte, “Business Process Execution Language for Web Services Version
1.1”, 2003, ftp://www6.software.ibm.com/software/developer/library/ws-
bpel.pdf.
[589] The eMule Project, “The eMule Homepage”, http://www.emule-
project.net/, 2004.
[590] The Globus Alliance, http://www.globus.org/, 2004.
[591] The MMAPPS Consortium, “Market Management of Peer to Peer Services”,
http://www.mmapps.org/, 2004.
[592] The Network Simulator – ns-2, http://www.isi.edu/nsnam/ns/.
[593] H. S. Thompson, D. Beech, M. Maloney, and N. Mendelsohn,
“XML Schema Part 1: Structures Second Edition”, W3C, 2004,
http://www.w3c.org/TR/xmlschema-1.
Bibliography 619
– overlay, 80 queries
– paradigm, 9, 12, 79 – key-based, 323
– pure, 42 – keyword-based, 323
– research challenges, 12 – range, 143
– revenue model, 487 – recursive, 147
– second generation, 20 – schema-based, 323
– service, 497 query routing, 343
– structured, 79
– structured systems, 15 random
– system, 80 – graph by Gilbert, 61
– traffic, 17 – graphs, 57, 61
– unstructured systems, 37 – walks, 139
PeerMart, 503 randomized construction, 142
PeerMint, 495 range queries, see queries
percolation search model, 139 ranked retrieval model, 340
performance, 383, 492, 494 rational, 492
– concept, 384 RDF, see Resource Description
perturbation, 235, 239 Framework
Piazza, 333 RDP, see relative delay penalty
PIER, 327, 364 redundancy, 132, 504
planetary-scale systems, 568 referential integrity, 145
PlanetLab, 567 relative delay penalty, 159
– central, 573 reliability, 9, 57, 131, 494, 504
PlanetP, 348 rendezvous peer view, see RPV
PlanetSeer, 576 replication, 132, 492
PLC, 573 – structural, 141
positioning systems, 457 reputation, 492, 495, 497
POST, 176, 189 requirements, 491
– design, 176 research challenges, see Peer-to-Peer
– erasure code, 189 resilience, 414
– fragments, 189 resource
– glacier, 189 – access control, 384, 385
– manifest, 189 – mediation, 384
– security, 182 – mediation functions, 385
power-law Resource Description Framework, 271
– distribution of node degrees, 138 resources, 10, 497
– graphs, 139 revenue model, 476
– network, 68 – direct, 476
PPay, 495 – indirect, 476
PPP, 19 RIAA, 20
preferential attachment, 68, 138 robustness, 503
prefix routing, 141 root node, 505
preservation of key ordering, 143 routing, 290
price, 496, 503 routing indexes, 345
– offer, 503 routing table maintenance, 146
pricing, 501, 504 RPV, 358
privacy, 495
proportional replication, 144 SAN, see Storage Area Networks
provider, 504 scalability, 9, 79, 276, 280, 284, 286,
public key, 504 494, 503
Publius, 544 – of networks, 57
scale-free network, 57, 67, 75, 240
QoS, 9, 23, 379, 496 Scoped Overlays, 172
Quality-of-Service, see QoS
628 Index
ubiquitous
– computing, 457
– devices, 457
– infrastructures, 458
UDDI, 217
Unicorn, 194
unreliable, 492, 504
unstructured Peer-to-Peer
– in mobile environments, 408
– systems, see Peer-to-Peer
UseNet, 18
UUHASH, 21
Watts-Strogatz model, 64
Web Services, 198, 207
– addressing, 218
– description language, 198
– federation, 218
– policy, 218
– reliable messaging, 218
– resource framework, 201, 218
– security, 219
– transaction, 219
weighted collect
– -rec, 289
– on trees, 289
Wireless LAN, 405