Distributed and Cloud Computing
Distributed and Cloud Computing
Computing
This page intentionally left blank
Distributed and Cloud
Computing
From Parallel Processing to the
Internet of Things
Kai Hwang
Geoffrey C. Fox
Jack J. Dongarra
vii
viii Contents
CHAPTER 3 Virtual Machines and Virtualization of Clusters and Data Centers. . . . . . . . . . . . 129
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
3.1 Implementation Levels of Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
3.1.1 Levels of Virtualization Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
3.1.2 VMM Design Requirements and Providers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
3.1.3 Virtualization Support at the OS Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
3.1.4 Middleware Support for Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
3.2 Virtualization Structures/Tools and Mechanisms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
3.2.1 Hypervisor and Xen Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
3.2.2 Binary Translation with Full Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
3.2.3 Para-Virtualization with Compiler Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Contents ix
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623
Preface
xv
xvi Preface
KEY FEATURES
• Coverage of modern distributed computing technology including computer clusters,
virtualization, service-oriented architecture, massively parallel processors, peer-to-peer systems,
cloud computing, social networks, and the Internet of Things.
• Major emphases of the book lie in exploiting the ubiquity, agility, efficiency, scalability,
availability, and programmability of parallel, distributed, and cloud computing systems.
• Latest developments in Hardware, Networks, and System Architecture:
• Multi-core CPUs and Many-Core GPUs (Intel, Nvidia, AMD)
• Virtual Machines and Virtual Clusters (CoD, Violin, Amazon VPC)
• Top-500 Architectures (Tianbhe-1A, Jaguar, Roadrunner, etc.)
• Google AppEngine, Amazon AWS, Microsoft Azure, IBM BlueCloud
• TeraGrid, DataGrid, ChinaGrid, BOINIC, Grid5000 and FutureGrid
• Chord, Napster, BiTorrent, KaZaA, PPlive, JXTA, and .NET
• RFID, Sensor Networks, GPS, CPS, and the Internet of Things
• Facebook, Force.Com, Twitter, SGI Cylone, Nebula, and GoGrid
• Recent advances in paradigms, programming, software and ecosystems:
• MapReduce, Dryad, Hadoop, MPI, Twister, BigTable, DISC, etc
• Cloud Service and Trust Models (SaaS, IaaS, PaaS, and PowerTrust)
• Programming Languages and Protocols (Python, SOAP, UDDI, Pig Latin)
• Virtualization Software (XEN, KVM, VMWare ESX, etc.)
• Cloud OS and Meshups (Eucalyptus, Nimbus, OpenNebula, vShere/4, etc.)
• Service-Oriented Architecture (REST, WS, Web 2.0, OGSA, etc.)
• Distributed Operating Systems (DCE, Amoeba, and MOSIX)
• Middleware and Software Libraries (LSF, Globus, Hadoop, Aneka)
• Over 100 examples are illustrated with 300 figures, designed to meet the need of students taking
a distributed system course. Each chapter includes exercises and further reading.
• Included are case studies from the leading distributed computing vendors: Amazon, Google,
Microsoft, IBM, HP, Sun, Silicon Graphics, Rackspace, SalesForce.com, netSuite, Enomaly, and
many more.
INVITED CONTRIBUTIONS
The book was jointly planned, authored, edited, and proofread by all three lead authors in four
years (2007–2011). Over this period, we have invited and received partial contributions and techni-
cal assistance from the following scientists, researchers, instructors, and doctorial students from
10 top Universities in the U.S., China, and Australia.
Listed below are the invited contributors to this book. The authorship, contributed sections, and
editorship of individual chapters are explicitly identified at the end of each chapter, separately. We
want to thank their dedicated work and valuable contributions throughout the courses of repeated
writing and revision process. The comments by the anonymous reviewers are also useful to improve
the final contents.
Albert Zomaya, Nikzad Babaii Rizvandi, Young-Choon Lee, Ali Boloori, Reza Moraveji,
Javid Taheri, and Chen Wang, Sydney University, Australia
Rajkumar Buyya, University of Melbourne, Australia
Yongwei Wu, Weimin Zheng, and Kang Chen, Tsinghua University, China
Zhenyu Li, Ninghui Sun, Zhiwei Xu, and Gaogang Xie, Chinese Academy of Sciences
Zhibin Yu, Xiaofei Liao and Hai Jin, Huazhong University of Science and Technology
Judy Qiu, Shrideep Pallickara, Marlon Pierce, Suresh Marru, Gregor Laszewski,
Javier Diaz, Archit Kulshrestha, and Andrew J. Younge, Indiana University
Michael McLennan, George Adams III, and Gerhard Klimeck, Purdue University
Zhongyuan Qin, Kaikun Dong, Vikram Dixit, Xiaosong Lou, Sameer Kulkarni,
Ken Wu, Zhou Zhao, and Lizhong Chen, University of Southern California
Renato Figueiredo, University of Florida
Michael Wilde, University of Chicago.
Geoffrey C. Fox is a Distinguished Professor of Informatics, Computing and Physics and Associate
Dean of Graduate Studies and Research in the School of Informatics and Computing, Indiana Uni-
versity. He has taught and led many research groups at Caltech and Syracuse University, previously.
He received his Ph.D. from Cambridge University, U.K. Fox is well known for his comprehensive
work and extensive publications in parallel architecture, distributed programming, grid computing,
web services, and Internet applications. His book on Grid Computing (coauthored with F. Berman
and Tony Hey) is widely used by the research community. He has produced over 60 Ph.D. students
in physics, computer science, and engineering over the years. Contact him via: [email protected].
xix
This page intentionally left blank
Foreword
Richard Feynman, in his wonderful autobiography Surely You’re Joking, Mr. Feynman, recounts how
at Los Alamos in 1944 he was responsible for supervising the human computers who performed the
long and tedious calculations required by the Manhattan Project. Using the mechanical calculators that
were then the state of the art, the best human computer could achieve only one addition or multiplica-
tion every few seconds. Feynman and his team thus developed methods for decomposing problems
into smaller tasks that could be performed simultaneously by different people (they passed cards with
intermediate results between people operating adders, multipliers, collators, and sorters); for running
multiple computations at once in the same computing complex (they used different color cards); for
prioritizing a more important computation (they eliminated cards of other colors); and for detecting
and recovering efficiently from errors (relevant cards, and their descendants, were removed, and com-
putations restarted).
Seventy years later, computer architects face similar challenges—and have adopted similar solu-
tions. Individual computing devices are far faster, but physical constraints still limit their speed.
Thus, today’s computing landscape is characterized by pervasive parallelism. Individual processors
incorporate pipelining, parallel instructions, speculative execution, and multithreading. Essentially
every computer system, from the humblest desktop to the most powerful supercomputer, incorpo-
rates multiple processors. Designers of future exascale supercomputers, to be capable of 1018 opera-
tions per second, tell us that these computers will need to support 107 concurrent operations.
Parallelism is fundamentally about communication and coordination, and those two activities have
also been transformed over the past seventy years by dramatic technological change. Light is no fas-
ter, at 8 inches or 20 centimeters per nanosecond in fiber, than in Feynman’s time; one can never
expect to send a message in less than 50 milliseconds from Los Angeles to Auckland. But the rate at
which data can be transmitted has changed dramatically, from a few characters per second in 1910
(early telegraphs) to thousands of characters per second in 1960 (ARPANET) to more than 10 billion
characters per second over optical fibers in 2010.
Quasi-ubiquitous high-speed communications not only allows call centers to be relocated to India,
it also allows computation to be moved to centralized facilities that achieve massive economies of
scale, and permits enormous quantities of data to be collected and organized to support decision mak-
ing by people worldwide. Thus, government agencies, research laboratories, and companies who need
to simulate complex phenomena create and operate enormous supercomputers with hundreds of
thousands of processors. Similarly, companies such as Google, Facebook, and Microsoft who need
to process large quantities of data operate numerous massive “cloud” data centers that may each
occupy tens of thousands of square feet and contain tens or hundreds of thousands of computers. Like
Feynman’s Los Alamos team, these computing complexes provide computing as a service for many
people, and must juggle many computations performed for different purposes.
Massive parallelism, ultra-fast communication, and massive centralization are all fundamental to
human decision making today. The computations that are used to forecast tomorrow’s weather,
index the web, recommend movies, suggest social connections, predict the future state of the stock
market, or provide any one of a multitude of other desirable information products are typically dis-
tributed over thousands of processors and depend on data collected from sometimes millions of
xxi
xxii Foreword
sources worldwide. Indeed, little of the modern world could function as it does without parallel and
distributed computing.
In this pervasively parallel and distributed world, an understanding of distributed computing is
surely an essential part of any undergraduate education in computer science. (Indeed, I would
argue, an understanding of these topics should be an essential part of any undergraduate education.
But I leave that argument for another time.) The most complex computer systems today are no
longer individual microprocessors, but entire data centers. The most complex computer programs
written today are those that manage or run on data-center-scale systems. A student who graduates
with a degree in computer science and does not understand how these systems and programs are
constructed is profoundly unprepared to engage productively in the modern workforce.
Hwang, Fox, and Dongarra’s text is thus especially timely. In its three sections, it covers pro-
gressively the hardware and software architectures that underpin modern massively parallel compu-
ter systems; the concepts and technologies that enable cloud and distributed computing; and
advanced topics in distributed computing, including grid, peer-to-peer, and the Internet of Things.
In each area, the text takes a systems approach, describing not only concepts but also representative
technologies and realistic large-scale distributed computing deployments. Computing is as much an
engineering discipline as a science, and these descriptions of real systems will both prepare students
to use those systems and help them understand how other architects have maneuvered the
constraints associated with large-scale distributed system design.
The text also addresses some of the more challenging issues facing computer science researchers
today. To name just two, computers have emerged as a major consumer of electricity, accounting for
several percent of all electricity used in the US. (In Japan, it is ironic that following the 2011 tsunami,
the large supercomputers that may help prepare for future natural disasters must often be turned off
to conserve power.) And, the fact that 98% of the roughly 10 billion processors sold each year are
for embedded devices, and that these embedded devices are increasingly communication-enabled,
introduces the opportunity and challenge of an “Internet of Things” that will be vastly larger, more
complex, and more capable than today’s Internet of People.
I hope that the appearance of this book will stimulate more teaching of distributed computing in
universities and colleges—and not just as an optional topic, as is too often the case, but as a core
element of the undergraduate curriculum. I hope also that others outside universities will take this
opportunity to learn about distributed computing, and more broadly about what computing looks
like on the cutting edge: sometimes messy; often complex; but above all tremendously exciting.
Ian Foster
Jackson Hole, Wyoming
August, 2011
PART
1
2 PART 1 Systems Modeling, Clustering, and Virtualization
This chapter is coauthored by Kai Hwang and Jack Dongarra with partial contributions by
Rajkumar Buyya and Ninghui Sun. Special technical assistances are from Zhiwei Xu, Zhou Zhao,
Xiaosong Lou, and Lizhong Chen.
SUMMARY
This chapter presents the evolutionary changes that have occurred in parallel, distributed, and cloud
computing over the past 30 years, driven by applications with variable workloads and large data
sets. We study both high-performance and high-throughput computing systems in parallel computers
appearing as computer clusters, service-oriented architecture, computational grids, peer-to-peer net-
works, Internet clouds, and the Internet of Things. These systems are distinguished by their hard-
ware architectures, OS platforms, processing algorithms, communication protocols, and service
models applied. We also introduce essential issues on the scalability, performance, availability,
security, and energy efficiency in distributed systems.
Disparate Homogeneous
HTC systems HPC systems
nodes nodes
Distributed Centralized
P2P network Clusters or MPPs
control control
Computational
and data grids
Service-oriented RFID and
architecture (SOA) sensors
Virtualization
Internet clouds
FIGURE 1.1
Evolutionary trend toward parallel, distributed, and cloud computing with clusters, MPPs, P2P networks, grids,
clouds, web services, and the Internet of Things.
clusters, grids, or Internet clouds has proliferated. These systems are employed by both consumers
and high-end web-scale computing and information services.
The general computing trend is to leverage shared web resources and massive amounts of data
over the Internet. Figure 1.1 illustrates the evolution of HPC and HTC systems. On the HPC side,
supercomputers (massively parallel processors or MPPs) are gradually replaced by clusters of
cooperative computers out of a desire to share computing resources. The cluster is often a collection
of homogeneous compute nodes that are physically connected in close range to one another. We will
discuss clusters, MPPs, and grid systems in more detail in Chapters 2 and 7.
On the HTC side, peer-to-peer (P2P) networks are formed for distributed file sharing and
content delivery applications. A P2P system is built over many client machines (a concept we
will discuss further in Chapter 5). Peer machines are globally distributed in nature. P2P, cloud
computing, and web service platforms are more focused on HTC applications than on HPC appli-
cations. Clustering and P2P technologies lead to the development of computational grids or data
grids.
the Top 500 most powerful computer systems in the world are measured by floating-point speed in
Linpack benchmark results. However, the number of supercomputer users is limited to less than
10% of all computer users. Today, the majority of computer users are using desktop computers or
large servers when they conduct Internet searches and market-driven computing tasks.
overlaps with distributed computing to a great extent, and cloud computing overlaps with distributed,
centralized, and parallel computing. The following list defines these terms more clearly; their architec-
tural and operational differences are discussed further in subsequent chapters.
• Centralized computing This is a computing paradigm by which all computer resources are
centralized in one physical system. All resources (processors, memory, and storage) are fully
shared and tightly coupled within one integrated OS. Many data centers and supercomputers are
centralized systems, but they are used in parallel, distributed, and cloud computing applications
[18,26].
• Parallel computing In parallel computing, all processors are either tightly coupled with
centralized shared memory or loosely coupled with distributed memory. Some authors refer to
this discipline as parallel processing [15,27]. Interprocessor communication is accomplished
through shared memory or via message passing. A computer system capable of parallel
computing is commonly known as a parallel computer [28]. Programs running in a parallel
computer are called parallel programs. The process of writing parallel programs is often
referred to as parallel programming [32].
• Distributed computing This is a field of computer science/engineering that studies distributed
systems. A distributed system [8,13,37,46] consists of multiple autonomous computers, each
having its own private memory, communicating through a computer network. Information
exchange in a distributed system is accomplished through message passing. A computer
program that runs in a distributed system is known as a distributed program. The process of
writing distributed programs is referred to as distributed programming.
• Cloud computing An Internet cloud of resources can be either a centralized or a distributed
computing system. The cloud applies parallel or distributed computing, or both. Clouds can be
built with physical or virtualized resources over large data centers that are centralized or
distributed. Some authors consider cloud computing to be a form of utility computing or service
computing [11,19].
As an alternative to the preceding terms, some in the high-tech community prefer the term con-
current computing or concurrent programming. These terms typically refer to the union of parallel
computing and distributing computing, although biased practitioners may interpret them differently.
Ubiquitous computing refers to computing with pervasive devices at any place and time using wired
or wireless communication. The Internet of Things (IoT) is a networked connection of everyday
objects including computers, sensors, humans, etc. The IoT is supported by Internet clouds to
achieve ubiquitous computing with any object at any place and time. Finally, the term Internet
computing is even broader and covers all computing paradigms over the Internet. This book covers
all the aforementioned computing paradigms, placing more emphasis on distributed and cloud com-
puting and their working systems, including the clusters, grids, P2P, and cloud systems.
at data centers. This chapter introduces the basics of various parallel and distributed families. Grids
and clouds are disparity systems that place great emphasis on resource sharing in hardware,
software, and data sets.
Design theory, enabling technologies, and case studies of these massively distributed systems
are also covered in this book. Massively distributed systems are intended to exploit a high degree
of parallelism or concurrency among many machines. In October 2010, the highest performing
cluster machine was built in China with 86016 CPU processor cores and 3,211,264 GPU cores
in a Tianhe-1A system. The largest computational grid connects up to hundreds of server clus-
ters. A typical P2P network may involve millions of client machines working simultaneously.
Experimental cloud computing clusters have been built with thousands of processing nodes. We
devote the material in Chapters 4 through 6 to cloud computing. Case studies of HTC systems
will be examined in Chapters 4 and 9, including data centers, social networks, and virtualized
cloud platforms
In the future, both HPC and HTC systems will demand multicore or many-core processors that
can handle large numbers of computing threads per core. Both HPC and HTC systems emphasize
parallelism and distributed computing. Future HPC and HTC systems must be able to satisfy this
huge demand in computing power in terms of throughput, efficiency, scalability, and reliability. The
system efficiency is decided by speed, programming, and energy factors (i.e., throughput per watt
of energy consumed). Meeting these goals requires to yield the following design objectives:
• Efficiency measures the utilization rate of resources in an execution model by exploiting
massive parallelism in HPC. For HTC, efficiency is more closely related to job throughput, data
access, storage, and power efficiency.
• Dependability measures the reliability and self-management from the chip to the system and
application levels. The purpose is to provide high-throughput service with Quality of Service
(QoS) assurance, even under failure conditions.
• Adaptation in the programming model measures the ability to support billions of job requests
over massive data sets and virtualized cloud resources under various workload and service
models.
• Flexibility in application deployment measures the ability of distributed systems to run well in
both HPC (science and engineering) and HTC (business) applications.
systems emphasize both resource distribution and concurrency or high degree of parallelism (DoP).
Let’s review the degrees of parallelism before we discuss the special requirements for distributed
computing.
years. These applications spread across many important domains in science, engineering, business,
education, health care, traffic control, Internet and web services, military, and government
applications.
Almost all applications demand computing economics, web-scale data collection, system
reliability, and scalable performance. For example, distributed transaction processing is often prac-
ticed in the banking and finance industry. Transactions represent 90 percent of the existing market for
reliable banking systems. Users must deal with multiple database servers in distributed transactions.
Maintaining the consistency of replicated transaction records is crucial in real-time banking services.
Other complications include lack of software support, network saturation, and security threats in these
applications. We will study applications and software support in more detail in subsequent chapters.
Web services
Data centers
Utility computing HTC in
business
Service computing Technology
and HPC in
Grid computing convergence
scientific
P2P computing applications
Cloud computing
Computing paradigms
Ubiquitous: Reliable and scalable
Attributes/
Autonomic: Dynamic and discovery
capabilities
Composable: QoS, SLA, etc.
FIGURE 1.2
The vision of computer utilities in modern distributed computing systems.
(Modified from presentation slide by Raj Buyya, 2010)
1.1 Scalable Computing over the Internet 11
program development. These hardware and software supports are necessary to build distributed
systems that explore massive parallelism at all processing levels.
4G standard
Wireless power Activity streams
Media tablet
Augmented reality Cloud computing
Cloud/web platforms
Private cloud computing
Internet TV
Speech-to-speech translation
3D printing Gesture recognition
Social analytics Mesh networks: sensor
Mobile robots
Pen-centric tablet PCs
Video search Microblogging
Autonomous vehicles Electronic paper
Speech recognition
Extreme transaction processing E-Book readers Location-aware applications
Tangible user interfaces Video telepresence Predictive analytics
Interactive TV
Terahertz waves Internet micropayment systems
Computer-brain interface Biometric authentication methods
Broadband over power lines
Context delivery architecture Virtual assistants Mobile application stores
Human augmentation Idea management
Consumer-generated media
Public virtual worlds
As of august 2010
Peak of
Technology Trough of Plateau of
inflated Slope of enlightenment
trigger disillusionment productivity
expectations
Time
Years to mainstream adoption:
Less than 2 years 2 to 5 years 5 to 10 years More than 10 years Obsolete before
plateau
FIGURE 1.3
Hype cycle for Emerging Technologies, 2010.
Hype Cycle Disclaimer
The Hype Cycle is copyrighted 2010 by Gartner, Inc. and its affiliates and is reused with permission. Hype Cycles are graphical representations of the relative maturity of
technologies, IT methodologies and management disciplines. They are intended solely as a research tool, and not as a specific guide to action. Gartner disclaims all warranties,
express or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.
This Hype Cycle graphic was published by Gartner, Inc. as part of a larger research note and should be evaluated in the context of the entire report. The Gartner report is
available at http://www.gartner.com/it/page.jsp?id=1447613.
(Source: Gartner Press Release “Gartner’s 2010 Hype Cycle Special Report Evaluates Maturity of 1,800 Technologies” 7 October 2010.)
1.2 Technologies for Network-based Systems 13
with respect to time and place. The idea is to tag every object using RFID or a related sensor or
electronic technology such as GPS.
With the introduction of the IPv6 protocol, 2128 IP addresses are available to distinguish all the
objects on Earth, including all computers and pervasive devices. The IoT researchers have estimated
that every human being will be surrounded by 1,000 to 5,000 objects. The IoT needs to be designed
to track 100 trillion static or moving objects simultaneously. The IoT demands universal addressa-
bility of all of the objects or things. To reduce the complexity of identification, search, and storage,
one can set the threshold to filter out fine-grain objects. The IoT obviously extends the Internet and
is more heavily developed in Asia and European countries.
In the IoT era, all objects and devices are instrumented, interconnected, and interacted with each
other intelligently. This communication can be made between people and things or among the things
themselves. Three communication patterns co-exist: namely H2H (human-to-human), H2T (human-to-
thing), and T2T (thing-to-thing). Here things include machines such as PCs and mobile phones. The idea
here is to connect things (including human and machine objects) at any time and any place intelligently
with low cost. Any place connections include at the PC, indoor (away from PC), outdoors, and on the
move. Any time connections include daytime, night, outdoors and indoors, and on the move as well.
The dynamic connections will grow exponentially into a new dynamic network of networks,
called the Internet of Things (IoT). The IoT is still in its infancy stage of development. Many proto-
type IoTs with restricted areas of coverage are under experimentation at the time of this writing.
Cloud computing researchers expect to use the cloud and future Internet technologies to support
fast, efficient, and intelligent interactions among humans, machines, and any objects on Earth.
A smart Earth should have intelligent cities, clean water, efficient power, convenient transportation,
good food supplies, responsible banks, fast telecommunications, green IT, better schools, good
health care, abundant resources, and so on. This dream living environment may take some time to
reach fruition at different parts of the world.
1000000 1000000
Processor speed
10 Motorola 68030
FIGURE 1.4
Improvement in processor and network technologies over 33 years.
(Courtesy of Xiaosong Lou and Lizhong Chen of University of Southern California, 2011)
1.2 Technologies for Network-based Systems 15
Multicore processor
L2 cache
L3 cache/DRAM
FIGURE 1.5
Schematic of a modern multicore CPU chip using a hierarchy of caches, where L1 cache is private to each
core, on-chip L2 cache is shared and L3 cache or DRAM Is off the chip.
Both multi-core CPU and many-core GPU processors can handle multiple instruction threads at
different magnitudes today. Figure 1.5 shows the architecture of a typical multicore processor.
Each core is essentially a processor with its own private cache (L1 cache). Multiple cores are
housed in the same chip with an L2 cache that is shared by all cores. In the future, multiple CMPs
could be built on the same CPU chip with even the L3 cache on the chip. Multicore and multi-
threaded CPUs are equipped with many high-end processors, including the Intel i7, Xeon, AMD
Opteron, Sun Niagara, IBM Power 6, and X cell processors. Each core could be also multithreaded.
For example, the Niagara II is built with eight cores with eight threads handled by each core. This
implies that the maximum ILP and TLP that can be exploited in Niagara is 64 (8 × 8 = 64). In
2011, the Intel Core i7 990x has reported 159,000 MIPS execution rate as shown in the upper-
most square in Figure 1.4.
Time
(cycle)
FIGURE 1.6
Five micro-architectures in modern CPU processors, that exploit ILP and TLP supported by multicore and
multithreading technologies.
GPU
Multiprocessor 1 Multiprocessor N
...
P1 P2 ... Pn P1 P2 ... Pn CPU
Main
Device memory memory
FIGURE 1.7
The use of a GPU along with a CPU for massively parallel execution in hundreds or thousands of processing
cores.
(Courtesy of B. He, et al., PACT’08 [23] )
Example 1.1 The NVIDIA Fermi GPU Chip with 512 CUDA Cores
In November 2010, three of the five fastest supercomputers in the world (the Tianhe-1a, Nebulae, and
Tsubame) used large numbers of GPU chips to accelerate floating-point computations. Figure 1.8 shows
the architecture of the Fermi GPU, a next-generation GPU from NVIDIA. This is a streaming multiprocessor
(SM) module. Multiple SMs can be built on a single GPU chip. The Fermi chip has 16 SMs implemented
with 3 billion transistors. Each SM comprises up to 512 streaming processors (SPs), known as CUDA
cores. The Tesla GPUs used in the Tianhe-1a have a similar architecture, with 448 CUDA cores.
The Fermi GPU is a newer generation of GPU, first appearing in 2011. The Tesla or Fermi GPU can be
used in desktop workstations to accelerate floating-point calculations or for building large-scale data cen-
ters. The architecture shown is based on a 2009 white paper by NVIDIA [36]. There are 32 CUDA cores
per SM. Only one SM is shown in Figure 1.8. Each CUDA core has a simple pipelined integer ALU and an
FPU that can be used in parallel. Each SM has 16 load/store units allowing source and destination
addresses to be calculated for 16 threads per clock. There are four special function units (SFUs) for
executing transcendental instructions.
All functional units and CUDA cores are interconnected by an NoC (network on chip) to a large
number of SRAM banks (L2 caches). Each SM has a 64 KB L1 cache. The 768 KB unified L2 cache is
shared by all SMs and serves all load, store, and texture operations. Memory controllers are used to con-
nect to 6 GB of off-chip DRAMs. The SM schedules threads in groups of 32 parallel threads called warps.
In total, 256/512 FMA (fused multiply and add) operations can be done in parallel to produce 32/64-bit
floating-point results. The 512 CUDA cores in an SM can work in parallel to deliver up to 515 Gflops of
double-precision results, if fully utilized. With 16 SMs, a single GPU has a peak speed of 82.4 Tflops. Only
12 Fermi GPUs have the potential to reach the Pflops performance.
In the future, thousand-core GPUs may appear in Exascale (Eflops or 1018 flops) systems. This
reflects a trend toward building future MPPs with hybrid architectures of both types of processing
chips. In a DARPA report published in September 2008, four challenges are identified for exascale
computing: (1) energy and power, (2) memory and storage, (3) concurrency and locality, and
(4) system resiliency. Here, we see the progress of GPUs along with CPU advances in power
1.2 Technologies for Network-based Systems 19
Instruction cache
Warp scheduler Warp scheduler
Dispatch unit Dispatch unit
LD/ST
Core Core Core Core
LD/ST
SFU
LD/ST
Core Core Core Core
LD/ST
LD/ST
Core Core Core Core
LD/ST
SFU
LD/ST
Core Core Core Core
LD/ST
LD/ST
Core Core Core Core
LD/ST
SFU
LD/ST
Core Core Core Core
LD/ST
LD/ST
CUDA core Core Core Core Core
LD/ST
Dispatch port SFU
Operand collector LD/ST
Core Core Core Core
LD/ST
FIGURE 1.8
NVIDIA Fermi GPU built with 16 streaming multiprocessors (SMs) of 32 CUDA cores each; only one SM Is
shown. More details can be found also in [49].
(Courtesy of NVIDIA, 2009 [36] 2011)
efficiency, performance, and programmability [16]. In Chapter 2, we will discuss the use of GPUs
to build large clusters.
100
Gflops/W (core)
10
GPU – 5 Gflops/W
1
GPU
CPU
EF
0.1
2010 2013 2016
FIGURE 1.9
The GPU performance (middle line, measured 5 Gflops/W/core in 2011), compared with the lower CPU
performance (lower line measured 0.8 Gflops/W/core in 2011) and the estimated 60 Gflops/W/core
performance in 2011 for the Exascale (EF in upper curve) in the future.
(Courtesy of Bill Dally [15] )
than 1 Gflop/watt per CPU core. This may limit the scaling of future supercomputers. However, the
GPUs may close the gap with the CPUs. Data movement dominates power consumption. One needs
to optimize the storage hierarchy and tailor the memory to the applications. We need to promote
self-aware OS and runtime support and build locality-aware compilers and auto-tuners for GPU-
based MPPs. This implies that both power and software are the real challenges in future parallel
and distributed computing systems.
100000 1000000
64GB
Memory chip
Disk capacity 16GB 100000
10000
Seagate 10000
1GB
Barracuda XT
1000
256MB 1000
Memory chip (Mbit)
10 16MB Maxtor 10
DiamondMax 2160
1MB 1
Seagate ST43400N
1
256KB
0.1
64KB Lomega
0.1
0.01
Seagate ST-506
0.01 0.001
1978 1981 1984 1987 1990 1993 1996 1999 2002 2005 2008 2011
FIGURE 1.10
Improvement in memory and disk technologies over 33 years. The Seagate Barracuda XT disk has a capacity
of 3 TB in 2011.
(Courtesy of Xiaosong Lou and Lizhong Chen of University of Southern California, 2011)
block. So the SSD can last for several years, even under conditions of heavy write usage. Flash and
SSD will demonstrate impressive speedups in many applications.
Eventually, power consumption, cooling, and packaging will limit large system development.
Power increases linearly with respect to clock frequency and quadratic ally with respect to voltage
applied on chips. Clock rate cannot be increased indefinitely. Lowered voltage supplies are very
much in demand. Jim Gray once said in an invited talk at the University of Southern California,
“Tape units are dead, disks are tape units, flashes are disks, and memory are caches now.” This
clearly paints the future for disk and storage technology. In 2011, the SSDs are still too expensive
to replace stable disk arrays in the storage market.
Client hosts
(PC/WS)
NAS LAN
FIGURE 1.11
Three interconnection networks for connecting servers, client hosts, and storage devices; the LAN connects
client hosts and servers, the SAN connects servers with disk arrays, and the NAS connects clients with large
storage systems in the network environment.
Guest apps
FIGURE 1.12
Three VM architectures in (b), (c), and (d), compared with the traditional physical machine shown in (a).
(Courtesy of M. Abde-Majeed and S. Kulkarni, 2009 USC)
App
Storage Storage
OS
(a) Multiplexing (b) Suspension (storage)
App App
OS OS App App App
VMM VMM OS OS OS
Hardware Hardware VMM VMM
Hardware Hardware
Storage Storage
FIGURE 1.13
VM multiplexing, suspension, provision, and migration in a distributed computing environment.
(Courtesy of M. Rosenblum, Keynote address, ACM ASPLOS 2006 [41] )
• First, the VMs can be multiplexed between hardware machines, as shown in Figure 1.13(a).
• Second, a VM can be suspended and stored in stable storage, as shown in Figure 1.13(b).
• Third, a suspended VM can be resumed or provisioned to a new hardware platform, as shown in
Figure 1.13(c).
• Finally, a VM can be migrated from one hardware platform to another, as shown in Figure 1.13(d).
These VM operations enable a VM to be provisioned to any available hardware platform. They
also enable flexibility in porting distributed application executions. Furthermore, the VM approach
will significantly enhance the utilization of server resources. Multiple server functions can be
consolidated on the same hardware platform to achieve higher system efficiency. This will eliminate
server sprawl via deployment of systems as VMs, which move transparency to the shared hardware.
With this approach, VMware claimed that server utilization could be increased from its current
5–15 percent to 60–80 percent.
$100
20
$50
$0 0
’96 ’97 ’98 ’99 ’00 ’01 ’02 ’03 ’04 ’05 ’06 ’07 ’08 ’09 ’10 ’11 ’12 ’13
FIGURE 1.14
Growth and cost breakdown of data centers over the years.
(Source: IDC Report, 2009)
commodity switches and networks are more desirable in data centers. Similarly, using commodity
x86 servers is more desired over expensive mainframes. The software layer handles network traffic
balancing, fault tolerance, and expandability. Currently, nearly all cloud computing data centers use
Ethernet as their fundamental network technology.
From the application perspective, clusters are most popular in supercomputing applications. In
2009, 417 of the Top 500 supercomputers were built with cluster architecture. It is fair to say that
clusters have laid the necessary foundation for building large-scale grids and clouds. P2P networks
appeal most to business applications. However, the content industry was reluctant to accept P2P
technology for lack of copyright protection in ad hoc networks. Many national grids built in the
past decade were underutilized for lack of reliable middleware or well-coded applications. Potential
advantages of cloud computing include its low cost and simplicity for both providers and users.
A Cluster Sn–1
Sn
S2
Gateway
S0
Disk arrays
I/O devices
FIGURE 1.15
A cluster of servers interconnected by a high-bandwidth SAN or LAN with shared I/O devices and disk arrays;
the cluster acts as a single computer attached to the Internet.
1.3 System Models for Distributed and Cloud Computing 29
Availability and Support Hardware and software support for Failover, failback, check pointing,
sustained HA in cluster rollback recovery, nonstop OS, etc.
Hardware Fault Tolerance Automated failure management to Component redundancy, hot
eliminate all single points of failure swapping, RAID, multiple power
supplies, etc.
Single System Image (SSI) Achieving SSI at functional level with Hardware mechanisms or
hardware and software support, middleware support to achieve DSM
middleware, or OS extensions at coherent cache level
Efficient Communications To reduce message-passing system Fast message passing, active
overhead and hide latencies messages, enhanced MPI library, etc.
Cluster-wide Job Using a global job management Application of single-job
Management system with better scheduling and management systems such as LSF,
monitoring Codine, etc.
Dynamic Load Balancing Balancing the workload of all Workload monitoring, process
processing nodes along with failure migration, job replication and gang
recovery scheduling, etc.
Scalability and Adding more servers to a cluster or Use of scalable interconnect,
Programmability adding more clusters to a grid as performance monitoring, distributed
the workload or data set increases execution environment, and better
software tools
TeraGrid in US, EGEE in Europe, and ChinaGrid in China for various distributed scientific grid
applications.
Grid
Expensive Databases
equipment
IP Broadband network
The Internet
FIGURE 1.16
Computational grid or data grid providing computing utility, data, and information services through resource
sharing and cooperation among participating organizations.
(Courtesy of Z. Xu, Chinese Academy of Science, 2004)
Grid Applications Reported Distributed supercomputing, Open grid with P2P flexibility, all
National Grid initiatives, etc. resources from client machines
Representative Systems TeraGrid built in US, ChinaGrid in JXTA, FightAid@home,
China, and the e-Science grid SETI@home
built in UK
Development Lessons Learned Restricted user groups, Unreliable user-contributed
middleware bugs, protocols to resources, limited to a few apps
acquire resources
32 CHAPTER 1 Distributed System Models and Enabling Technologies
Overlay network
Virtual
link
Physical
link
AP
IP network
FIGURE 1.17
The structure of a P2P system by mapping a physical IP network to an overlay network built with virtual
links.
(Courtesy of Zhenyu Li, Institute of Computing Technology, Chinese Academy of Sciences, 2010)
1.3 System Models for Distributed and Cloud Computing 33
formed by mapping each physical machine with its ID, logically, through a virtual mapping as
shown in Figure 1.17. When a new peer joins the system, its peer ID is added as a node in the
overlay network. When an existing peer leaves the system, its peer ID is removed from the overlay
network automatically. Therefore, it is the P2P overlay network that characterizes the logical con-
nectivity among the peers.
There are two types of overlay networks: unstructured and structured. An unstructured overlay
network is characterized by a random graph. There is no fixed route to send messages or files
among the nodes. Often, flooding is applied to send a query to all nodes in an unstructured overlay,
thus resulting in heavy network traffic and nondeterministic search results. Structured overlay net-
works follow certain connectivity topology and rules for inserting and removing nodes (peer IDs)
from the overlay graph. Routing mechanisms are developed to take advantage of the structured
overlays.
make it too complex to apply in real applications. We need system scalability as the workload
increases. System scaling is directly related to performance and bandwidth. P2P networks do
have these properties. Data location is also important to affect collective performance. Data
locality, network proximity, and interoperability are three design objectives in distributed P2P
applications.
P2P performance is affected by routing efficiency and self-organization by participating peers.
Fault tolerance, failure management, and load balancing are other important issues in using overlay
networks. Lack of trust among peers poses another problem. Peers are strangers to one another.
Security, privacy, and copyright violations are major worries by those in the industry in terms of
applying P2P technology in business applications [35]. In a P2P network, all clients provide resources
including computing power, storage space, and I/O bandwidth. The distributed nature of P2P net-
works also increases robustness, because limited peer failures do not form a single point of failure.
By replicating data in multiple peers, one can easily lose data in failed nodes. On the other hand,
disadvantages of P2P networks do exist. Because the system is not centralized, managing it is difficult.
In addition, the system lacks security. Anyone can log on to the system and cause damage or abuse.
Further, all client computers connected to a P2P network cannot be considered reliable or virus-free. In
summary, P2P networks are reliable for a small number of peer nodes. They are only useful for applica-
tions that require a low level of security and have no concern for data sensitivity. We will discuss P2P
networks in Chapter 8, and extending P2P technology to social networking in Chapter 9.
Hardware Software
Paid services
Internet
cloud Network
User Storage
FIGURE 1.18
Virtualized resources from data centers to form an Internet cloud, provisioned with hardware, software,
storage, network, and services for paid users to run their applications.
applications simultaneously. The cloud ecosystem must be designed to be secure, trustworthy, and
dependable. Some computer users think of the cloud as a centralized resource pool. Others consider
the cloud to be a server cluster which practices distributed computing over all the servers used.
FIGURE 1.19
Three cloud service models in a cloud landscape of major providers.
(Courtesy of Dennis Gannon, keynote address at Cloudcom2010 [19] )
Internet clouds offer four deployment modes: private, public, managed, and hybrid [11]. These
modes demand different levels of security implications. The different SLAs imply that the security
responsibility is shared among all the cloud providers, the cloud resource consumers, and the third-
party cloud-enabled software providers. Advantages of cloud computing have been advocated by
many IT experts, industry leaders, and computer science researchers.
In Chapter 4, we will describe major cloud platforms that have been built and various cloud
services offerings. The following list highlights eight reasons to adapt the cloud for upgraded
Internet applications and web services:
1. Desired location in areas with protected space and higher energy efficiency
2. Sharing of peak-load capacity among a large pool of users, improving overall utilization
3. Separation of infrastructure maintenance duties from domain-specific application development
4. Significant reduction in cloud computing cost, compared with traditional computing paradigms
5. Cloud computing programming and application development
6. Service and data discovery and content/service distribution
7. Privacy, security, copyright, and reliability issues
8. Service agreements, business models, and pricing policies
FIGURE 1.20
Layered achitecture for web services and the grids.
38 CHAPTER 1 Distributed System Models and Enabling Technologies
Here, one might get several models with, for example, JNDI (Jini and Java Naming and Directory
Interface) illustrating different approaches within the Java distributed object model. The CORBA Trad-
ing Service, UDDI (Universal Description, Discovery, and Integration), LDAP (Lightweight Directory
Access Protocol), and ebXML (Electronic Business using eXtensible Markup Language) are other
examples of discovery and information services described in Section 5.4. Management services include
service state and lifetime support; examples include the CORBA Life Cycle and Persistent states, the
different Enterprise JavaBeans models, Jini’s lifetime model, and a suite of web services specifications
in Chapter 5. The above language or interface terms form a collection of entity-level capabilities.
The latter can have performance advantages and offers a “shared memory” model allowing more
convenient exchange of information. However, the distributed model has two critical advantages:
namely, higher performance (from multiple CPUs when communication is unimportant) and a
cleaner separation of software functions with clear software reuse and maintenance advantages. The
distributed model is expected to gain popularity as the default approach to software systems. In the
earlier years, CORBA and Java approaches were used in distributed systems rather than today’s
SOAP, XML, or REST (Representational State Transfer).
fs fs Discovery
In
te
SS
Po
cloud
r-s
Filter
rta
fs fs
er
Filter service
vi
l
ce
cloud fs fs Filter
m
SS
es
cloud
sa
Another
ge
fs fs
s
service SS Filter
Filter cloud
fs fs fs fs
service
fs Filter Discovery
SS fs fs fs
service cloud
fs fs
fs fs
SS
Filter Filter Traditional grid
fs fs
cloud service Filter Filter with exposed
SS fs fs cloud cloud services
Another
grid SS S S S S
S S S Sensor or data
S S S S S S S
S S S S S S interchange
SS
service
Compute
Storage
cloud
cloud
Database
FIGURE 1.21
The evolution of SOA: grids of clouds and grids, where “SS” refers to a sensor service and “fs” to a filter or
transforming service.
40 CHAPTER 1 Distributed System Models and Enabling Technologies
amounts of raw data items. Processing this data will generate useful information, and subsequently,
the knowledge for our daily use. In fact, wisdom or intelligence is sorted out of large knowledge
bases. Finally, we make intelligent decisions based on both biological and machine wisdom. Read-
ers will see these structures more clearly in subsequent chapters.
Most distributed systems require a web interface or portal. For raw data collected by a large
number of sensors to be transformed into useful information or knowledge, the data stream may go
through a sequence of compute, storage, filter, and discovery clouds. Finally, the inter-service mes-
sages converge at the portal, which is accessed by all users. Two example portals, OGFCE and
HUBzero, are described in Section 5.3 using both web service (portlet) and Web 2.0 (gadget) tech-
nologies. Many distributed programming models are also built on top of these basic constructs.
History and Current Written in C and tested Built as a user Developed since 1977,
System Status in the European extension on top of now called MOSIX2
community; version 5.2 UNIX, VMS, Windows, used in HPC Linux and
released in 1995 OS/2, etc. GPU clusters
Distributed OS Microkernel-based and Middleware OS A distributed OS with
Architecture location-transparent, providing a platform for resource discovery,
uses many servers to running distributed process migration,
handle files, directory, applications; The runtime support, load
replication, run, boot, system supports RPC, balancing, flood control,
and TCP/IP services security, and threads configuration, etc.
OS Kernel, Middleware, A special microkernel DCE packages handle MOSIX2 runs with
and Virtualization that handles low-level file,time, directory, Linux 2.6; extensions
Support process, memory, I/O, security services, RPC, for use in multiple
and communication and authentication at clusters and clouds
functions middleware or user with provisioned VMs
space
Communication Uses a network-layer RPC supports Using PVM, MPI in
Mechanisms FLIP protocol and RPC authenticated collective
to implement point-to- communication and communications,
point and group other security services priority process control,
communication in user programs and queuing services
of a grid allows owners of clusters to share their computational resources among multiple cluster
owners. A MOSIX-enabled grid can extend indefinitely as long as trust exists among the cluster owners.
The MOSIX2 is being explored for managing resources in all sorts of clusters, including Linux clusters,
GPU clusters, grids, and even clouds if VMs are used. We will study MOSIX and its applications in
Section 2.4.4.
Application
Operating
systems
Hardware
FIGURE 1.22
A transparent computing environment that separates the user data, application, OS, and hardware in time
and space – an ideal model for cloud computing.
1.4 Software Environments for Distributed Systems and Clouds 43
Table 1.7 Parallel and Distributed Programming Models and Tool Sets
Model Description Features
effective use of resources from Internet clouds and data centers. Service clouds demand extending
Hadoop, EC2, and S3 to facilitate distributed computing over distributed storage systems. Many
other models have also been proposed or developed in the past. In Chapters 5 and 6, we will dis-
cuss parallel and distributed programming in more details.
1.4.3.2 MapReduce
This is a web programming model for scalable data processing on large clusters over large data sets
[16]. The model is applied mainly in web-scale search and cloud computing applications. The user
specifies a Map function to generate a set of intermediate key/value pairs. Then the user applies a
Reduce function to merge all intermediate values with the same intermediate key. MapReduce is
highly scalable to explore high degrees of parallelism at different job levels. A typical MapReduce
computation process can handle terabytes of data on tens of thousands or more client machines.
Hundreds of MapReduce programs can be executed simultaneously; in fact, thousands of MapRe-
duce jobs are executed on Google’s clusters every day.
Table 1.8 Grid Standards and Toolkits for Scientific and Engineering Applications [6]
Key Features and Security
Standards Service Functionalities Infrastructure
OGSA Standard Open Grid Services Architecture; offers Supports a heterogeneous distributed
common grid service standards for environment, bridging CAs, multiple
general public use trusted intermediaries, dynamic policies,
multiple security mechanisms, etc.
Globus Toolkits Resource allocation, Globus security Sign-in multisite authentication with
infrastructure (GSI), and generic PKI, Kerberos, SSL, Proxy, delegation,
security service API and GSS API for message integrity and
confidentiality
IBM Grid Toolbox AIX and Linux grids built on top of Uses simple CA, grants access, grid
Globus Toolkit, autonomic computing, service (ReGS), supports grid
replica services application for Java (GAF4J), GridMap
in IntraGrid for security update
in task spawning and massive data communication. It is efficient, as it processes data with a high
degree of parallelism across a large number of commodity nodes, and it is reliable in that it auto-
matically keeps multiple data copies to facilitate redeployment of computing tasks upon unex-
pected system failures.
107 Scalability
(No. of
106 processors
or cores in a P2P
105 system) Cluster
104
Cloud
103
NUMA
102 Grid
10 SMP
1
1 10 102 103 104 105 106 107
Multiplicity of OS images in a system
FIGURE 1.23
System scalability versus multiplicity of OS images based on 2010 technology.
1.5 Performance, Security, and Energy Efficiency 47
depends on the QoS in a public network. Low-speed P2P networks, Internet clouds, and computer
clusters should be evaluated at the same networking level.
The maximum speedup of n is achieved only if the sequential bottleneck α is reduced to zero or
the code is fully parallelizable with α = 0. As the cluster becomes sufficiently large, that is, n →
∞, S approaches 1/α, an upper bound on the speedup S. Surprisingly, this upper bound is indepen-
dent of the cluster size n. The sequential bottleneck is the portion of the code that cannot be paral-
lelized. For example, the maximum speedup achieved is 4, if α = 0.25 or 1 − α = 0.75, even if one
uses hundreds of processors. Amdahl’s law teaches us that we should make the sequential bottle-
neck as small as possible. Increasing the cluster size alone may not result in a good speedup in
this case.
Very often the system efficiency is rather low, especially when the cluster size is very large.
To execute the aforementioned program on a cluster with n = 256 nodes, extremely low efficiency
E = 1/[0.25 × 256 + 0.75] = 1.5% is observed. This is because only a few processors (say, 4) are
kept busy, while the majority of the nodes are left idling.
workload W′ is essentially the sequential execution time on a single processor. The parallel
execution time of a scaled workload W′ on n processors is defined by a scaled-workload speedup
as follows:
S′ = W′/W = ½αW + ð1 − αÞnW/W = α + ð1 − αÞn (1.3)
This speedup is known as Gustafson’s law. By fixing the parallel execution time at level W, the
following efficiency expression is obtained:
E′ = S′/n = α/n + ð1 − αÞ (1.4)
For the preceding program with a scaled workload, we can improve the efficiency of using a
256-node cluster to E′ = 0.25/256 + 0.75 = 0.751. One should apply Amdahl’s law and Gustafson’s
law under different workload conditions. For a fixed workload, users should apply Amdahl’s law.
To solve scaled problems, users should apply Gustafson’s law.
System availability is attributed to many factors. All hardware, software, and network compo-
nents may fail. Any failure that will pull down the operation of the entire system is called a single
point of failure. The rule of thumb is to design a dependable computing system with no single
point of failure. Adding hardware redundancy, increasing component reliability, and designing for
testability will help to enhance system availability and dependability. In Figure 1.24, the effects on
system availability are estimated by scaling the system size in terms of the number of processor
cores in the system.
In general, as a distributed system increases in size, availability decreases due to a higher chance
of failure and a difficulty in isolating the failures. Both SMP and MPP are very vulnerable with
centralized resources under one OS. NUMA machines have improved in availability due to the use
of multiple OSes. Most clusters are designed to have HA with failover capability. Meanwhile, pri-
vate clouds are created out of virtualized data centers; hence, a cloud has an estimated availability
similar to that of the hosting cluster. A grid is visualized as a hierarchical cluster of clusters. Grids
have higher availability due to the isolation of faults. Therefore, clusters, clouds, and grids have
decreasing availability as the system increases in size. A P2P file-sharing network has the highest
aggregation of client machines. However, it operates independently with low availability, and even
many peer nodes depart or fail simultaneously.
1.5 Performance, Security, and Energy Efficiency 49
High
(100%)
Grid
Cluster
NUMA
P2P
SMP network
Low
(0)
Small System size (# processor cores) Large (106)
FIGURE 1.24
Estimated system availability by system size of common configurations in 2010.
● Eavesdropping ● Penetration
● Traffic analysis ● Masquerade ● DoS
● Trojan Horse Theft
● EM/RF Interception ● Bypassing controls
● Indiscretions ● No authorization ● Trapdoor
of personnel ● Physical intrusion ● Service spoofing
● Media scavenging
● Resource exhaustion ● Resource
● Intercept/alter ● Integrity violation exhaustion
● Repudiation ● Theft ● Integrity
● Replay violation
FIGURE 1.25
Various system attacks and network threats to the cyberspace, resulting 4 types of losses.
against virus, worm, and distributed DoS (DDoS) attacks. We also need to deploy mechanisms to
prevent online piracy and copyright violations of digital content. In Chapter 4, we will study
reputation systems for protecting cloud systems and data centers. Security responsibilities are
divided between cloud providers and users differently for the three cloud service models. The
providers are totally responsible for platform availability. The IaaS users are more responsible for
the confidentiality issue. The IaaS providers are more responsible for data integrity. In PaaS and
SaaS services, providers and users are equally responsible for preserving data integrity and
confidentiality.
Application layer
DNA sequence Event simulation and analysis High energy physics Weather
alignment forecasting
Middleware layer
Resource layer
Network layer
FIGURE 1.26
Four operational layers of distributed computing systems.
(Courtesy of Zomaya, Rivandi and Lee of the University of Sydney [33] )
performance. The first step toward this end is to explore a relationship between performance and
energy consumption. Indeed, an application’s energy consumption depends strongly on the number
of instructions needed to execute the application and the number of transactions with the storage
unit (or memory). These two factors (compute and storage) are correlated and they affect comple-
tion time.
transportation systems. Traditional data centers suffer from high construction and operational costs,
complex resource management, poor usability, low security and reliability, and huge energy con-
sumption. It is necessary to adopt new technologies in next-generation data-center designs, a topic
we will discuss in more detail in Chapter 4.
where v, Ceff, K, and vt are the voltage, circuit switching capacity, a technology dependent factor,
and threshold voltage, respectively, and the parameter t is the execution time of the task under
clock frequency f. By reducing voltage and frequency, the device’s energy consumption can also be
reduced.
200 200
Freq (MHz)
Freq (MHz)
100 100
1 2 3 4 5 6 7 1 2 3 4 5 6 7
Time (msec) Time (msec)
FIGURE 1.27
The DVFS technique (right) saves energy, compared to traditional practices (left) by reducing the
frequency or voltage during slack time.
1.6 Bibliographic Notes and Homework Problems 55
Acknowledgments
This chapter was authored by Kai Hwang primarily. Geoffrey Fox and Albert Zomaya have contrib-
uted to Sections 1.4.1 and 1.5.4, respectively. Xiaosong Lou and Lizhong Chen at the University of
Southern California have assisted in plotting Figures 1.4 and 1.10. Nikzad Babaii Rizvandi and Young-
Choon Lee of the University of Sydney have contributed partially to Section 1.5.4. Jack Dongarra
has edited the entire Chapter 1.
References
[1] Amazon EC2 and S3, Elastic Compute Cloud (EC2) and Simple Scalable Storage (S3). http://en
.wikipedia.org/wiki/Amazon_Elastic_Compute_Cloud and http://spatten_presentations.s3.amazonaws
.com/s3-on-rails.pdf.
[2] M. Baker, R. Buyya, Cluster computing at a glance, in: R. Buyya (Ed.), High-Performance Cluster Com-
puting, Architecture and Systems, vol. 1, Prentice-Hall, Upper Saddle River, NJ, 1999, pp. 3–47, Chapter 1.
[3] A. Barak, A. Shiloh, The MOSIX Management System for Linux Clusters, Multi-Clusters, CPU Clusters,
and Clouds, White paper. www.MOSIX.org//txt_pub.html, 2010.
[4] L. Barroso, U. Holzle, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale
Machines, Morgan & Claypool Publishers, 2009.
[5] G. Bell, J. Gray, A. Szalay, Petascale computational systems: balanced cyberstructure in a data-centric
World, IEEE Comput. Mag. (2006).
[6] F. Berman, G. Fox, T. Hey (Eds.), Grid Computing, Wiley, 2003.
[7] M. Bever, et al., Distributed systems, OSF DCE, and beyond, in: A. Schill (Ed.), DCE-The OSF Distributed
Computing Environment, Springer-Verlag, 1993, pp. 1–20.
[8] K. Birman, Reliable Distributed Systems: Technologies, Web Services, and Applications, Springer-Verlag,
2005.
[9] G. Boss, et al., Cloud Computing–The BlueCloud Project. www.ibm.com/developerworks/websphere/
zones/hipods/, October 2007.
[10] R. Buyya (Ed.), High-Performance Cluster Computing, Vol. 1 and 2, Prentice-Hall, Englewood Cliffs, NJ, 1999.
[11] R. Buyya, J. Broberg, A. Goscinski (Eds.), Cloud Computing: Principles and Paradigms, Wiley, 2011.
[12] T. Chou, Introduction to Cloud Computing: Business and Technology. Lecture Notes at Stanford Univer-
sity and Tsinghua University, Active Book Press, 2010.
[13] G. Coulouris, J. Dollimore, T. Kindberg, Distributed Systems: Concepts and Design, Wesley, 2005.
[14] D. Culler, J. Singh, A. Gupta, Parallel Computer Architecture, Kaufmann Publishers, 1999.
[15] B. Dally, GPU Computing to Exascale and Beyond, Keynote Address at ACM Supercomputing
Conference, November 2010.
[16] J. Dean, S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, in: Proceedings of
OSDI 2004. Also, Communication of ACM, Vol. 51, 2008, pp. 107–113.
[17] J. Dongarra, et al. (Eds.), Source Book of Parallel Computing, Morgan Kaufmann, San Francisco, 2003.
[18] I. Foster, Y. Zhao, J. Raicu, S. Lu, Cloud Computing and Grid Computing 360-Degree Compared, Grid
Computing Environments Workshop, 12–16 November 2008.
[19] D. Gannon, The Client+Cloud: Changing the Paradigm for Scientific Research, Keynote Address, IEEE
CloudCom2010, Indianapolis, 2 November 2010.
[20] V.K. Garg, Elements of Distributed Computing. Wiley-IEEE Press, 2002.
[21] R. Ge, X. Feng, K. Cameron, Performance Constrained Distributed DVS Scheduling for Scientific
Applications on Power-aware Clusters, in: Proceedings Supercomputing Conf., Washington, DC, 2005.
References 57
[22] S. Ghosh, Distributed Systems–An Algorithmic Approach, Chapman & Hall/CRC, 2007.
[23] B. He, W. Fang, Q. Luo, N. Govindaraju, T. Wang, Mars: A MapReduce Framework on Graphics
Processor, ACM PACT’08, Toronto, Canada, 25–29 October 2008.
[24] J. Hennessy, D. Patterson, Computer Architecture: A Quantitative Approach, Morgan Kaufmann, 2007.
[25] T. Hey, et al., The Fourth Paradigm: Data-Intensive Scientific Discovery, Microsoft Research, 2009.
[26] M.D. Hill, et al., The Data Center as a Computer, Morgan & Claypool Publishers, 2009.
[27] K. Hwang, Advanced Computer Architecture: Parallelism, Scalability, Programming, McGraw-Hill, 1993.
[28] K. Hwang, Z. Xu, Scalable Parallel Computing, McGraw-Hill, 1998.
[29] K. Hwang, S. Kulkarni, Y. Hu, Cloud Security with Virtualized Defense and Reputation-based Trust
Management, in: IEEE Conference on Dependable, Autonomous, and Secure Computing (DAC-2009),
Chengdu, China, 14 December 2009.
[30] K. Hwang, D. Li, Trusted Cloud Computing with Secure Resources and Data Coloring, in: IEEE Internet
Computing, Special Issue on Trust and Reputation Management, September 2010, pp. 14–22.
[31] Kelton Research, Server Energy & Efficiency Report. www.1e.com/EnergyCampaign/downloads/Server_
Energy_and_Efficiency_Report_2009.pdf, September 2009.
[32] D. Kirk, W. Hwu, Programming Massively Processors: A Hands-on Approach, Morgan Kaufmann, 2010.
[33] Y.C. Lee, A.Y. Zomaya, A Novel State Transition Method for Metaheuristic-Based Scheduling in
Heterogeneous Computing Systems, in: IEEE Transactions on Parallel and Distributed Systems, September
2008.
[34] Z.Y. Li, G. Xie, K. Hwang, Z.C. Li, Churn-Resilient Protocol for Massive Data Dissemination in P2P
Networks, in: IEEE Trans. Parallel and Distributed Systems, May 2011.
[35] X. Lou, K. Hwang, Collusive Piracy Prevention in P2P Content Delivery Networks, in: IEEE Trans. on
Computers, July, 2009, pp. 970–983.
[36] NVIDIA Corp. Fermi: NVIDIA’s Next-Generation CUDA Compute Architecture, White paper, 2009.
[37] D. Peleg, Distributed Computing: A Locality-Sensitive Approach, SIAM, 2000.
[38] G.F. Pfister, In Search of Clusters, Second ed., Prentice-Hall, 2001.
[39] J. Qiu, T. Gunarathne, J. Ekanayake, J. Choi, S. Bae, H. Li, et al., Hybrid Cloud and Cluster Computing
Paradigms for Life Science Applications, in: 11th Annual Bioinformatics Open Source Conference BOSC
2010, 9–10 July 2010.
[40] M. Rosenblum, T. Garfinkel, Virtual machine monitors: current technology and future trends, IEEE
Computer (May) (2005) 39–47.
[41] M. Rosenblum, Recent Advances in Virtual Machines and Operating Systems, Keynote Address, ACM
ASPLOS 2006.
[42] J. Smith, R. Nair, Virtual Machines, Morgan Kaufmann, 2005.
[43] B. Sotomayor, R. Montero, I. Foster, Virtual Infrastructure Management in Private and Hybrid Clouds,
IEEE Internet Computing, September 2009.
[44] SRI. The Internet of Things, in: Disruptive Technologies: Global Trends 2025, www.dni.gov/nic/
PDF_GIF_Confreports/disruptivetech/appendix_F.pdf, 2010.
[45] A. Tanenbaum, Distributed Operating Systems, Prentice-Hall, 1995.
[46] I. Taylor, From P2P to Web Services and Grids, Springer-Verlag, London, 2005.
[47] Twister, Open Source Software for Iterative MapReduce, http://www.iterativemapreduce.org/.
[48] Wikipedia. Internet of Things, http://en.wikipedia.org/wiki/Internet_of_Things, June 2010.
[49] Wikipedia. CUDA, http://en.wikipedia.org/wiki/CUDA, March 2011.
[50] Wikipedia. TOP500, http://en.wikipedia.org/wiki/TOP500, February 2011.
[51] Y. Wu, K. Hwang, Y. Yuan, W. Zheng, Adaptive Workload Prediction of Grid Performance in
Confidence Windows, in: IEEE Trans. on Parallel and Distributed Systems, July 2010.
[52] Z. Zong, Energy-Efficient Resource Management for High-Performance Computing Platforms, Ph.D.
dissertation, Auburn University, 9 August 2008.
58 CHAPTER 1 Distributed System Models and Enabling Technologies
HOMEWORK PROBLEMS
Problem 1.1
Briefly define the following basic techniques and technologies that represent recent related advances
in computer architecture, parallel processing, distributed computing, Internet technology, and
information services:
a. High-performance computing (HPC) system
b. High-throughput computing (HTC) system
c. Peer-to-peer (P2P) network
d. Computer cluster versus computational grid
e. Service-oriented architecture (SOA)
f. Pervasive computing versus Internet computing
g. Virtual machine versus virtual infrastructure
h. Public cloud versus private cloud
i. Radio-frequency identifier (RFID)
j. Global positioning system (GPS)
k. Sensor network
l. Internet of Things (IoT)
m. Cyber-physical system (CPS)
Problem 1.2
Circle only one correct answer in each of the following two questions:
1. In the 2009 Top 500 list of the fastest computer systems, which architecture dominates?
a. Symmetric shared-memory multiprocessor systems.
b. Centralized massively parallel processor (MPP) systems.
c. Clusters of cooperative computers.
2. In a cloud formed by a cluster of servers, all servers must be selected as follows:
a. All cloud machines must be built on physical servers.
b. All cloud machines must be built with virtual servers.
c. The cloud machines can be either physical or virtual servers.
Problem 1.3
An increasing number of organizations in industry and business sectors adopt cloud systems.
Answer the following questions regarding cloud computing:
a. List and describe the main characteristics of cloud computing systems.
b. Discuss key enabling technologies in cloud computing systems.
c. Discuss different ways for cloud service providers to maximize their revenues.
Problem 1.4
Match 10 abbreviated terms and system models on the left with their descriptions on the right. Enter
the description label (a, b, c, …, j) in the underlined blanks in front of the terms.
Homework Problems 59
______Globus (a) A scalable software platform promoted by Apache for web users to write and
run applications over vast amounts of distributed data
______BitTorrent (b) A P2P network for MP3 music delivery with a centralized directory server
______Gnutella (c) The programming model and associated implementation by Google for
distributed mapping and reduction of very large data sets
______EC2 (d) A middleware library jointly developed by USC/ISI and Argonne National Lab for
grid resource management and job scheduling
______TeraGrid (e) A distributed storage program by Google for managing structured data that can
scale to very large sizes
______EGEE (f) A P2P file-sharing network using multiple file index trackers
______Hadoop (g) A critical design goal of clusters of computers to tolerate nodal faults or recovery
from host failures
______SETI@home (h) The service architecture specification as an open grid standard
______Napster (i) An elastic and flexible computing environment that allows web application
developers to acquire cloud resources effectively
______BigTable (j) A P2P grid over 3 million desktops for distributed signal processing in search of
extraterrestrial intelligence
Problem 1.5
Consider a multicore processor with four heterogeneous cores labeled A, B, C, and D. Assume cores A
and D have the same speed. Core B runs twice as fast as core A, and core C runs three times faster than
core A. Assume that all four cores start executing the following application at the same time and no
cache misses are encountered in all core operations. Suppose an application needs to compute the square
of each element of an array of 256 elements. Assume 1 unit time for core A or D to compute the square
of an element. Thus, core B takes 12 unit time and core C takes 13 unit time to compute the square of an
element. Given the following division of labor in four cores:
Core A 32 elements
Core B 128 elements
Core C 64 elements
Core D 32 elements
a. Compute the total execution time (in time units) for using the four-core processor to compute the
squares of 256 elements in parallel. The four cores have different speeds. Some faster cores finish
the job and may become idle, while others are still busy computing until all squares are computed.
b. Calculate the processor utilization rate, which is the total amount of time the cores are busy (not
idle) divided by the total execution time they are using all cores in the processor to execute the
above application.
Problem 1.6
Consider parallel execution of an MPI-coded C program in SPMD (single program and multiple
data streams) mode on a server cluster consisting of n identical Linux servers. SPMD mode means
60 CHAPTER 1 Distributed System Models and Enabling Technologies
the same MPI program is running simultaneously on all servers but over different data sets of
identical workloads. Assume that 25 percent of the program execution is attributed to the execution
of MPI commands. For simplicity, assume that all MPI commands take the same amount of execu-
tion time. Answer the following questions using Amdahl’s law:
a. Given that the total execution time of the MPI program on a four-server cluster is T minutes,
what is the speedup factor of executing the same MPI program on a 256-server cluster,
compared with using the four-server cluster? Assume that the program execution is deadlock-
free and ignore all other runtime execution overheads in the calculation.
b. Suppose that all MPI commands are now enhanced by a factor of 2 by using active messages
executed by message handlers at the user space. The enhancement can reduce the execution
time of all MPI commands by half. What is the speedup of the 256-server cluster installed
with this MPI enhancement, computed with the old 256-server cluster without MPI
enhancement?
Problem 1.7
Consider a program for multiplying two large-scale N × N matrices, where N is the matrix size. The
sequential multiply time on a single server is T1 = cN3 minutes, where c is a constant determined by
the server used. An MPI-code parallel program requires Tn = cN3/n + dN2/n0.5 minutes to complete
execution on an n-server cluster system, where d is a constant determined by the MPI version used.
Assume the program has a zero sequential bottleneck (α = 0). The second term in Tn accounts for
the total message-passing overhead experienced by n servers.
Answer the following questions for a given cluster configuration with n = 64 servers, c = 0.8,
and d = 0.1. Parts (a, b) have a fixed workload corresponding to the matrix size N = 15,000. Parts
(c, d) have a scaled workload associated with an enlarged matrix size N′ = n1/3 N = 641/3 × 15,000 =
4 × 15,000 = 60,000. Assume the same cluster configuration to process both workloads. Thus, the
system parameters n, c, and d stay unchanged. Running the scaled workload, the overhead also
increases with the enlarged matrix size N′.
a. Using Amdahl’s law, calculate the speedup of the n-server cluster over a single server.
b. What is the efficiency of the cluster system used in Part (a)?
c. Calculate the speedup in executing the scaled workload for an enlarged N′ × N′ matrix on the
same cluster configuration using Gustafson’s law.
d. Calculate the efficiency of running the scaled workload in Part (c) on the 64-processor cluster.
e. Compare the above speedup and efficiency results and comment on their implications.
Problem 1.8
Compare the similarities and differences between traditional computing clusters/grids and the
computing clouds launched in recent years. Consider all technical and economic aspects as listed
below. Answer the following questions against real example systems or platforms built in recent
years. Also discuss the possible convergence of the two computing paradigms in the future.
a. Hardware, software, and networking support
b. Resource allocation and provisioning methods
Homework Problems 61
Problem 1.9
Answer the following questions regarding PC and HPC systems:
a. Explain why PCs and HPCs were evolutionary rather than revolutionary in the past 30 years.
b. Discuss the drawbacks in disruptive changes in processor architecture. Why is the memory wall
a major problem in achieving scalable changes in performance?
c. Explain why x-86 processors are still dominating the PC and HPC markets.
Problem 1.10
Multicore and many-core processors have appeared in widespread use in both desktop computers
and HPC systems. Answer the following questions regarding advanced processors, memory devices,
and system interconnects:
a. What are the differences between multicore CPUs and GPUs in terms of architecture and usage?
b. Explain why parallel programming cannot match the progress of processor technology.
c. Suggest ideas and defend your argument with some plausible solutions to this mismatch problem
between core scaling and effective programming and use of multicores.
d. Explain why flash memory SSD can deliver better speedups in some HPC or HTC applications.
e. Justify the prediction that InfiniBand and Ethernet will continue to dominate the HPC market.
Problem 1.11
In Figure 1.7, you studied five categories of modern processors. Characterize in Table 1.9 five
micro-architectures for designing these processors. Comment on their advantages/shortcomings
and identify the names of two example commercial processors that are built in each processor
category. Assume a single core in the superscalar processor and the three multithreaded proces-
sors. The last processor category is a multicore CMP and each core is assumed to handle one
thread at a time.
Single-threaded Superscalar
Fine-grain Multithreading
Coarse-grain Multithreading
Simultaneous Multithreading (SMT)
Multicore Chip Multiprocessor (CMP)
62 CHAPTER 1 Distributed System Models and Enabling Technologies
Problem 1.12
Discuss the major advantages and disadvantages in the following areas:
a. Why are virtual machines and virtual clusters suggested in cloud computing systems?
b. What breakthroughs are required to build virtualized cloud systems cost-effectively?
c. What are the impacts of cloud platforms on the future of the HPC and HTC industry?
Problem 1.13
Characterize the following three cloud computing models:
a. What is an IaaS (Infrastructure-as-a-Service) cloud? Give one example system.
b. What is a PaaS (Platform-as-a-Service) cloud? Give one example system.
c. What is a SaaS (Software-as-a-Service) cloud? Give one example system.
Problem 1.14
Briefly explain each of the following cloud computing services. Identify two cloud providers by
company name in each service category.
a. Application cloud services
b. Platform cloud services
c. Compute and storage services
d. Collocation cloud services
e. Network cloud services
Problem 1.15
Briefly explain the following terms associated with network threats or security defense in a distribu-
ted computing system:
a. Denial of service (DoS)
b. Trojan horse
c. Network worm
d. Service spoofing
e. Authorization
f. Authentication
g. Data integrity
h. Confidentiality
Problem 1.16
Briefly answer the following questions regarding green information technology and energy effi-
ciency in distributed systems:
a. Why is power consumption critical to data-center operations?
b. What constitutes the dynamic voltage frequency scaling (DVFS) technique?
c. Conduct in-depth research on recent progress in green IT research, and write a report on its
applications to data-center design and cloud service applications.
Homework Problems 63
Problem 1.17
Compare GPU and CPU chips in terms of their strengths and weaknesses. In particular, discuss the
trade-offs between power efficiency, programmability, and performance. Also compare various
MPP architectures in processor selection, performance target, efficiency, and packaging constraints.
Problem 1.18
Compare three distributed operating systems: Amoeba, DCE, and MOSIX. Research their recent
developments and their impact on applications in clusters, grids, and clouds. Discuss the suitability
of each system in its commercial or experimental distributed applications. Also discuss each
system’s limitations and explain why they were not successful as commercial systems.
This page intentionally left blank
CHAPTER
SUMMARY
Clustering of computers enables scalable parallel and distributed computing in both science and busi-
ness applications. This chapter is devoted to building cluster-structured massively parallel processors.
We focus on the design principles and assessment of the hardware, software, middleware, and operating
system support to achieve scalability, availability, programmability, single-system images, and fault tol-
erance in clusters. We will examine the cluster architectures of Tianhe-1A, Cray XT5 Jaguar, and IBM
Roadrunner. The study also covers the LSF middleware and MOSIX/OS for job and resource manage-
ment in Linux clusters, GPU clusters and cluster extensions to building grids and clouds. Only physical
clusters are studied in this chapter. Virtual clusters will be studied in Chapters 3 and 4.
2000, we see the trend moving to the clustering of RISC or x86 PC engines. Clustered products
now appear as integrated systems, software tools, availability infrastructure, and operating system
extensions. This clustering trend matches the downsizing trend in the computer industry. Supporting
clusters of smaller nodes will increase sales by allowing modular incremental growth in cluster con-
figurations. From IBM, DEC, Sun, and SGI to Compaq and Dell, the computer industry has lever-
aged clustering of low-cost servers or x86 desktops for their cost-effectiveness, scalability, and HA
features.
DEC VAXcluster (1991) A UNIX cluster of symmetric multiprocessing (SMP) servers running
the VMS OS with extensions, mainly used in HA applications
U.C. Berkeley NOW Project (1995) A serverless network of workstations featuring active messaging,
cooperative filing, and GLUnix development
Rice University TreadMarks (1996) Software-implemented distributed shared memory for use in
clusters of UNIX workstations based on page migration
Sun Solaris MC Cluster (1995) A research cluster built over Sun Solaris workstations; some cluster
OS functions were developed but were never marketed
successfully
Tandem Himalaya Cluster (1994) A scalable and fault-tolerant cluster for OLTP and database
processing, built with nonstop operating system support
IBM SP2 Server Cluster (1996) An AIX server cluster built with Power2 nodes and the Omega
network, and supported by IBM LoadLeveler and MPI extensions
Google Search Engine Cluster (2003) A 4,000-node server cluster built for Internet search and web
service applications, supported by a distributed file system and
fault tolerance
MOSIX (2010) www.mosix.org A distributed operating system for use in Linux clusters,
multiclusters, grids, and clouds; used by the research community
68 CHAPTER 2 Computer Clusters for Scalable Parallel Computing
A Unix cluster of SMP servers running VMS/OS with extensions, mainly used in high-
availability applications. An AIX server cluster built with Power2 nodes and Omega network and
supported by IBM Loadleveler and MPI extensions. A scalable and fault-tolerant cluster for OLTP
and database processing built with non-stop operating system support. The Google search engine
was built at Google using commodity components. MOSIX is a distributed operating systems for
use in Linux clusters, multi-clusters, grids, and the clouds, originally developed by Hebrew Univer-
sity in 1999.
2.1.2.1 Scalability
Clustering of computers is based on the concept of modular growth. To scale a cluster from hundreds
of uniprocessor nodes to a supercluster with 10,000 multicore nodes is a nontrivial task. The scalabil-
ity could be limited by a number of factors, such as the multicore chip technology, cluster topology,
packaging method, power consumption, and cooling scheme applied. The purpose is to achieve scal-
able performance constrained by the aforementioned factors. We have to also consider other limiting
factors such as the memory wall, disk I/O bottlenecks, and latency tolerance, among others.
2.1.2.2 Packaging
Cluster nodes can be packaged in a compact or a slack fashion. In a compact cluster, the nodes are
closely packaged in one or more racks sitting in a room, and the nodes are not attached to periph-
erals (monitors, keyboards, mice, etc.). In a slack cluster, the nodes are attached to their usual peri-
pherals (i.e., they are complete SMPs, workstations, and PCs), and they may be located in different
rooms, different buildings, or even remote regions. Packaging directly affects communication wire
length, and thus the selection of interconnection technology used. While a compact cluster can uti-
lize a high-bandwidth, low-latency communication network that is often proprietary, nodes of a
slack cluster are normally connected through standard LANs or WANs.
2.1.2.3 Control
A cluster can be either controlled or managed in a centralized or decentralized fashion. A compact
cluster normally has centralized control, while a slack cluster can be controlled either way. In a cen-
tralized cluster, all the nodes are owned, controlled, managed, and administered by a central opera-
tor. In a decentralized cluster, the nodes have individual owners. For instance, consider a cluster
comprising an interconnected set of desktop workstations in a department, where each workstation
is individually owned by an employee. The owner can reconfigure, upgrade, or even shut down the
workstation at any time. This lack of a single point of control makes system administration of such
a cluster very difficult. It also calls for special techniques for process scheduling, workload migra-
tion, checkpointing, accounting, and other similar tasks.
2.1.2.4 Homogeneity
A homogeneous cluster uses nodes from the same platform, that is, the same processor architecture
and the same operating system; often, the nodes are from the same vendors. A heterogeneous
2.1 Clustering for Massive Parallelism 69
2.1.2.5 Security
Intracluster communication can be either exposed or enclosed. In an exposed cluster, the communi-
cation paths among the nodes are exposed to the outside world. An outside machine can access the
communication paths, and thus individual nodes, using standard protocols (e.g., TCP/IP). Such
exposed clusters are easy to implement, but have several disadvantages:
• Being exposed, intracluster communication is not secure, unless the communication subsystem
performs additional work to ensure privacy and security.
• Outside communications may disrupt intracluster communications in an unpredictable fashion.
For instance, heavy BBS traffic may disrupt production jobs.
• Standard communication protocols tend to have high overhead.
In an enclosed cluster, intracluster communication is shielded from the outside world, which
alleviates the aforementioned problems. A disadvantage is that there is currently no standard for
efficient, enclosed intracluster communication. Consequently, most commercial or academic clusters
realize fast communications through one-of-a-kind protocols.
supercomputer sites, and virtualized cloud platforms. A good understanding of how clusters and
MPPs work collectively will pave the way toward understanding the ins and outs of large-scale
grids and Internet clouds in subsequent chapters. Several issues must be considered in developing
and using a cluster. Although much work has been done in this regard, this is still an active
research and development area.
to monitor the running condition of all nodes. In case of a node failure, critical jobs running on the
failing nodes can be saved by failing over to the surviving node machines. Rollback recovery
schemes restore the computing results through periodic checkpointing.
500
400
MPP
Systems
300 Cluster
SMP
Constellations
200 Single processor
Others
100
0
06/1993
06/1994
06/1995
06/1996
06/1997
06/1998
06/1999
06/2000
06/2001
06/2002
06/2003
06/2004
06/2005
06/2006
06/2007
06/2008
06/2009
06/2010
Top 500 releases
FIGURE 2.1
Architectural share of the Top 500 systems.
(Courtesy of www.top500.org [25] )
In 2010, the Top 500 architecture is dominated by clusters (420 systems) and MPPs (80 systems).
The basic distinction between these two classes lies in the components they use to build the systems.
Clusters are often built with commodity hardware, software, and network components that are com-
mercially available. MPPs are built with custom-designed compute nodes, boards, modules, and cabi-
nets that are interconnected by special packaging. MPPs demand high bandwidth, low latency, better
power efficiency, and high reliability. Cost-wise, clusters are affordable by allowing modular growth
with scaling capability. The fact that MPPs appear in a much smaller quantity is due to their high
cost. Typically, only a few MPP-based supercomputers are installed in each country.
100 Pflops
43673.10 TF
Sum
10 Pflops #1
2566.00 TF
#500
1 Pflops
100 Tflops
31112 GF
Performance
10 Tflops
1.12 TF
1 Tflops
58.7 GF
100 Gflops
10 Gflops
0.42 GF
1 Gflops
100 Mflops
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
FIGURE 2.2
Performance plot of the Top 500 supercomputers from 1993 to 2010.
(Courtesy of www.top500.org [25] )
Third place is represented by the combined use of the SLEs10 with the SGI ProPack5, with 15 sys-
tems (3 percent) over 135,200 processors. Fourth place goes to the CNK/SLES9 used by 14 systems
(2.8 percent) over 1.13 million processors. Finally, the CNL/OS was used in 10 systems (2 percent)
over 178,577 processors. The remaining 34 systems applied 13 other operating systems with a total
share of only 6.8 percent. In conclusion, the Linux OS dominates the systems in the Top 500 list.
1. Tianhe-1A, National NUDT TH1A with 14,336 Xeon X5670 2.57 Pflops, 54.6% (over
Supercomputing CPUs (six cores each) plus 7168 NVIDIA 4.02 MW a peak of
Center, Tianjin, China, Tesla M2050 GPUs (448 CUDA cores 4.7 Pflops)
http://www.nscc-tj. each), running Linux, built by National
gov.cn/en/ Univ. of Defense Technology, China
2. Jaguar, DOE/SC/Oak Cray XT5-HE: MPP with 224,162 x 6 1.76 Pflops, 75.6% (over
Ridge National Lab., AMD Opteron, 3D torus network, Linux 6.95 MW a peak of
United States, http:// (CLE), manufactured by Cray, Inc. 4.7 Pflops)
computing.ornl.gov
3. Nebulae at China’s TC3600 Blade, 120,640 cores in 55,680 1.27 Pflops, 42.6% (over
National Xeon X5650 plus 64,960 NVIDIA Tesla 2.55 MW a peak of
Supercomputer C2050 GPUs, Linux, InfiniBand, built by 2.98 Pflops)
Center, ShenZhen, Dawning, Inc.
China http://www.ict
.cas.cas.cn
4. TSUBAME 2.0, GSIC HP cluster, 3000SL, 73,278 x 6 Xeon 1.19 Pflops, 52.19% (over
Center, Tokyo Institute X5670 processors, NVIDIA GPU, Linux/ 1.8 MW a peak of
of Technology, Tokyo, SLES 11, built by NEC/HP 2.3 Pflops)
Japan, http://www.
gsic.titech.ac.jp/
5. Hopper, DOE/SC/ Cray XE6 150,408 x 12 AMD Opteron, 1.05 Pflops, 78.47% (over
LBNL/ NERSC, Linux (CLE), built by Cray, Inc. 2.8 MW a peak of
Berkeley, CA. USA, 1.35 Pflops)
http//www/nersc.gov/
Table 2.3 Sample Compute Node Architectures for Large Cluster Construction
Node Architecture Major Characteristics Representative Systems
Homogeneous node Multicore processors mounted on the The Cray XT5 uses two six-core
using the same same node with a crossbar connected to AMD Opteron processors in
multicore processors shared memory or local disks each compute node
Hybrid nodes using CPU General-purpose CPU for integer China’s Tianhe-1A uses two Intel
plus GPU or FLP operations, with GPUs acting as Xeon processors plus one
accelerators coprocessors to speed up FLP operations NVIDIA GPU per compute node
2500 8
7
2000
6
Megawatts
1500
Gigaflops
1000 3
2
500
1
0 0
Tianhe-1A Jaguar Nebulae Tsubame Hopper II
FIGURE 2.3
Power and performance of the top 5 supercomputers in November 2010.
(Courtesy of www.top500.org [25] and B. Dally [10] )
Parallel applications
Parallel applications
Sequential applications
Sequential applications
Sequential applications Parallel programming environment
Cluster middleware
(Single system image and availability infrastructure)
FIGURE 2.4
The architecture of a computer cluster built with commodity hardware, software, middleware, and network
components supporting HA and SSI.
(Courtesy of M. Baker and R. Buyya, reprinted with Permission [3] )
2.2 Computer Clusters and MPP Architectures 77
single job management system. Single memory may be realized with the help of the compiler or a
runtime library. A single process space is not necessarily supported.
In general, an idealized cluster is supported by three subsystems. First, conventional databases
and OLTP monitors offer users a desktop environment in which to use the cluster. In addition to
running sequential user programs, the cluster supports parallel programming based on standard
languages and communication libraries using PVM, MPI, or OpenMP. The programming environ-
ment also includes tools for debugging, profiling, monitoring, and so forth. A user interface sub-
system is needed to combine the advantages of the web interface and the Windows GUI. It
should also provide user-friendly links to various programming environments, job management
tools, hypertext, and search support so that users can easily get help in programming the compu-
ter cluster.
P/C P/C
M M
MIO MIO
D LAN D
NIC NIC
P/C P/C
M M
MIO MIO
D D
NIC Shared disk NIC
P/C P/C
M M
MIO D MIO D
SCI
NIC NIC
FIGURE 2.5
Three ways to connect cluster nodes (P/C: Processor and Cache; M: Memory; D: Disk; NIC: Network
Interface Circuitry; MIO: Memory-I/O Bridge.)
(Courtesy of Hwang and Xu [14] )
compute nodes dominate in system cost, because we may have 1,000 times more compute nodes
than service nodes in a single large clustered system. Table 2.3 introduces two example compute
node architectures: homogeneous design and hybrid node design.
In the past, most MPPs are built with a homogeneous architecture by interconnecting a large
number of the same compute nodes. In 2010, the Cray XT5 Jaguar system was built with 224,162
AMD Opteron processors with six cores each. The Tiahe-1A adopted a hybrid node design using two
Xeon CPUs plus two AMD GPUs per each compute node. The GPU could be replaced by special
floating-point accelerators. A homogeneous node design makes it easier to program and maintain
the system.
System
(64 cabinets, 64 × 32 × 32)
Cabinet
(32 node boards,
8 × 8 × 16)
Node board
(32 chips, 4 × 4 × 2)
16 compute cards
Compute card
(2 chips,
1 × 2 × 1)
Chip
(2 processors) 180/360 TF/s
16 TB DDR
2.9/5.7 TF/s
256 GB DDR
90/180 GF/s
8 GB DDR
5.6/11.2 GF/s
2.8/5.6 GF/s 0.5 GB DDR
4 MB
FIGURE 2.6
The IBM Blue Gene/L architecture built with modular components packaged hierarchically in five levels.
(Courtesy of N. Adiga, et al., IBM Corp., 2005 [1] )
Top-500 list—toped the Japanese Earth Simulator. The system was upgraded to score a 478 Tflops speed
in 2007. By examining the architecture of the Blue Gene series, we reveal the modular construction of a
scalable MPP system as shown in Figure 2.6. With modular packaging, the Blue Gene/L system is con-
structed hierarchically from processor chips to 64 physical racks. This system was built with a total of
65,536 nodes with two PowerPC 449 FP2 processors per node. The 64 racks are interconnected by a
huge 3D 64 x 32 x 32 torus network.
In the lower-left corner, we see a dual-processor chip. Two chips are mounted on a computer card.
Sixteen computer cards (32 chips or 64 processors) are mounted on a node board. A cabinet houses
32 node boards with an 8 x 8 x 16 torus interconnect. Finally, 64 cabinets (racks) form the total
system at the upper-right corner. This packaging diagram corresponds to the 2005 configuration. Cus-
tomers can order any size to meet their computational needs. The Blue Gene cluster was designed to
achieve scalable performance, reliability through built-in testability, resilience by preserving locality of
failures and checking mechanisms, and serviceability through partitioning and isolation of fault
locations.
80 CHAPTER 2 Computer Clusters for Scalable Parallel Computing
Available link 1.28 Gbps (M-XP) 2.8 Gbps (QsNet) 2.5 Gbps (1X ) 1 Gbps
speeds 10 Gbps (M-10G) 7.2 Gbps (QsNetII ) 10 Gbps (4X )
30 Gbps (12X )
MPI latency ~3 us ~3 us ~4.5 us ~40 us
Network Yes Yes Yes No
processor
Topologies Any Any Any Any
Network topology Clos Fat tree Fat tree Any
Routing Source-based, Source-based, Destination-based Destination-based
cut-through cut-through
Flow control Stop and go Worm-hole Credit-based 802.3x
FIGURE 2.7
Google search engine cluster architecture.
(Courtesy of Google, Inc. [6] )
2.2 Computer Clusters and MPP Architectures 81
300
250
200
150
100
50
0
May-04
May-05
May-06
May-07
May-08
Feb-04
Feb-05
Feb-06
Feb-07
Feb-08
Nov-03
Aug-04
Nov-04
Aug-05
Nov-05
Aug-06
Nov-06
Aug-07
Nov-07
Aug-08
Nov-08
FIGURE 2.8
Distribution of high-bandwidth interconnects in the Top 500 systems from 2003 to 2008.
(Courtesy of www.top500.org [25] )
82 CHAPTER 2 Computer Clusters for Scalable Parallel Computing
Other InfiniBand
Switch Switch Switch Router subnets
WANs
LANs
Consoles
Switch Switch
InfiniBand
fabric Storage
TCA
subsystem
Controller
I/O chassis
Switch Switch I/O chassis
FIGURE 2.9
The InfiniBand system fabric built in a typical high-performance computer cluster.
(Source: O. Celebioglu, et al, “Exploring InfiniBand as an HPC Cluster Interconnect”, Dell Power Solutions,
Oct.2004 © 2011 Dell Inc. All Rights Reserved )
2.2 Computer Clusters and MPP Architectures 83
IP packet applications
Linux OS extensions:
Linux Linux Linux
High-availability interface, hardware drivers
I/O drivers, platform management drivers
Hardware support:
CPU/GPU CPU/GPU Hot swap devices, router interfaces CPU/GPU
FIGURE 2.10
Middleware, Linux extensions, and hardware support for achieving massive parallelism and HA in a Linux
cluster system built with CPUs and GPUs.
84 CHAPTER 2 Computer Clusters for Scalable Parallel Computing
chip is capable of achieving up to 1 Tflops for single-precision (SP) arithmetic, and more than
80 Gflops for double-precision (DP) calculations. Recent HPC-optimized GPUs contain up to 4 GB
of on-board memory, and are capable of sustaining memory bandwidths exceeding 100 GB/second.
GPU clusters are built with a large number of GPU chips. GPU clusters have already demonstrated
their capability to achieve Pflops performance in some of the Top 500 systems. Most GPU clusters
are structured with homogeneous GPUs of the same hardware class, make, and model. The software
used in a GPU cluster includes the OS, GPU drivers, and clustering API such as an MPI.
The high performance of a GPU cluster is attributed mainly to its massively parallel multicore
architecture, high throughput in multithreaded floating-point arithmetic, and significantly reduced
time in massive data movement using large on-chip cache memory. In other words, GPU clusters
already are more cost-effective than traditional CPU clusters. GPU clusters result in not only a
quantum jump in speed performance, but also significantly reduced space, power, and cooling
demands. A GPU cluster can operate with a reduced number of operating system images, compared
with CPU-based clusters. These reductions in power, environment, and management complexity
make GPU clusters very attractive for use in future HPC applications.
To off-chip To network
DRAMs routers
L20 L21 ... L21023 MC ... MC NI
FIGURE 2.11
The proposed GPU chip design for 20 Tflops performance and 1.6 TB/s memory bandwidth in the Echelon system.
(Courtesy of Bill Dally, Reprinted with Permission [10] )
2.2 Computer Clusters and MPP Architectures 85
academic interest to illustrate how one can explore the many-core GPU technology to achieve
Exascale computing in the future PU technology to achieve Exascale computing in the future.
DRAM DRAM NV
••• Self-aware
cube cube RAM
OS
NoC Self-aware
runtime
L0 L0
LCO
SM0 SM127
Processor chip (PC) Locality-aware
N7
Node 0 (N0) 20 TF, 1.6 TB/s, 256 GB compiler &
Module 0 (M)160 TF,12.8 TB/s, 2 TB M15
autotuner
Cabinet 0 (C0) 26 PF, 205 TB/s, 32 TB CN
Echelon system
FIGURE 2.12
The architecture of NVIDIA Echelon system built with a hierarchical network of GPUs that can deliver
2.6 Pflops per cabinet, and takes at least N = 400 cabinets to achieve the desired Eflops performance.
(Courtesy of Bill Dally, Reprinted with Permission [10] )
86 CHAPTER 2 Computer Clusters for Scalable Parallel Computing
Each compute module features a performance of 160 Tflops and 12.8 TB/s over 2 TB of memory.
Thus, a single cabinet can house 128 GPU nodes or 16,000 processor cores. Each cabinet has the
potential to deliver 2.6 Pflops over 32 TB memory and 205 TB/s bandwidth. The N cabinets are
interconnected by a Dragonfly network with optical fiber.
To achieve Eflops performance, we need to use at least N = 400 cabinets. In total, an Exascale
system needs 327,680 processor cores in 400 cabinets. The Echelon system is supported by a self-
aware OS and runtime system. The Echelon system is also designed to preserve locality with the
support of compiler and autotuner. At present, NVIDIA Fermi (GF110) chip has 512 stream proces-
sors. Thus the Echelon design is about 25 times faster. It is highly likely that the Echelon will
employ post-Maxwell NVIDIA GPU planned to appear in 2013 ~ 2014 time frame.
The following saxpy_parallel routine is written in CUDA C code for parallel execution by 256
threads/block on many processing cores on the GPU chip. Note that n blocks are handled by n pro-
cessing cores, where n could be on the order of hundreds of blocks.
_global__void saxpy_parallel (int n, float a, float*x, float *y)
{ Int i = blockIndex.x*blockDim.x + threadIndex.x; if (i < n) y [i] = a*x[i] + y[i] }
// Invoke the parallel SAXPY kernel with 256 threads/block int nblocks = (n + 255)/256;
saxpy_parallel <<< nblocks, 256 >>> (n, 2.0, x, y);
This is a good example of using CUDA C to exploit massive parallelism on a cluster of multi-
core and multithreaded processors using the CODA GPGPUs as building blocks.
available for using Python, Perl, FORTRAN, Java, Ruby, Lua, MATLAB, and IDL. CUDA has
been used to accelerate nongraphical applications in computational biology, cryptography, and other
fields by an order of magnitude or more. A good example is the BOINC distributed computing cli-
ent. CUDA provides both a low-level API and a higher-level API. CUDA works with all NVIDIA
GPUs from the G8X series onward, including the GeForce, Quadro, and Tesla lines. NVIDIA states
that programs developed for the GeForce 8 series will also work without modification on all future
NVIDIA video cards due to binary compatibility.
• Symmetry A user can use a cluster service from any node. In other words, all cluster services
and functionalities are symmetric to all nodes and all users, except those protected by access
rights.
• Location-transparent The user is not aware of the where abouts of the physical device that
eventually provides a service. For instance, the user can use a tape drive attached to any cluster
node as though it were physically attached to the local node.
The main motivation to have SSI is that it allows a cluster to be used, controlled, and main-
tained as a familiar workstation is. The word “single” in “single-system image” is sometimes synon-
ymous with “global” or “central.” For instance, a global file system means a single file hierarchy,
which a user can access from any node. A single point of control allows an operator to monitor and
configure the cluster system. Although there is an illusion of a single system, a cluster service or
functionality is often realized in a distributed manner through the cooperation of multiple compo-
nents. A main requirement (and advantage) of SSI techniques is that they provide both the perfor-
mance benefits of distributed implementation and the usability benefits of a single image.
From the viewpoint of a process P, cluster nodes can be classified into three types. The home
node of a process P is the node where P resided when it was created. The local node of a process
P is the node where P currently resides. All other nodes are remote nodes to P. Cluster nodes can
be configured to suit different needs. A host node serves user logins through Telnet, rlogin, or even
FTP and HTTP. A compute node is one that performs computational jobs. An I/O node is one that
serves file I/O requests. If a cluster has large shared disks and tape units, they are normally physi-
cally attached to I/O nodes.
There is one home node for each process, which is fixed throughout the life of the process. At
any time, there is only one local node, which may or may not be the host node. The local node and
remote nodes of a process may change when the process migrates. A node can be configured to
provide multiple functionalities. For instance, a node can be designated as a host, an I/O node, and
a compute node at the same time. The illusion of an SSI can be obtained at several layers, three of
which are discussed in the following list. Note that these layers may overlap with one another.
• Application software layer Two examples are parallel web servers and various parallel
databases. The user sees an SSI through the application and is not even aware that he is using a
cluster. This approach demands the modification of workstation or SMP applications for clusters.
• Hardware or kernel layer Ideally, SSI should be provided by the operating system or by the
hardware. Unfortunately, this is not a reality yet. Furthermore, it is extremely difficult to provide
an SSI over heterogeneous clusters. With most hardware architectures and operating systems
being proprietary, only the manufacturer can use this approach.
• Middleware layer The most viable approach is to construct an SSI layer just above the OS kernel.
This approach is promising because it is platform-independent and does not require application
modification. Many cluster job management systems have already adopted this approach.
Each computer in a cluster has its own operating system image. Thus, a cluster may display
multiple system images due to the stand-alone operations of all participating node computers. Deter-
mining how to merge the multiple system images in a cluster is as difficult as regulating many indi-
vidual personalities in a community to a single personality. With different degrees of resource
sharing, multiple systems could be integrated to achieve SSI at various operational levels.
2.3 Design Principles of Computer Clusters 89
The DNS translates the symbolic name and returns the IP address 159.226.41.150 of the least-
loaded node, which happens to be node Host1. The user then logs in using this IP address. The
DNS periodically receives load information from the host nodes to make load-balancing translation
decisions. In the ideal case, if 200 users simultaneously log in, the login sessions are evenly distrib-
uted among our hosts with 50 users each. This allows a single host to be four times more
powerful.
FIGURE 2.13
Realizing a single entry point using a load-balancing domain name system (DNS).
(Courtesy of Hwang and Xu [14] )
90 CHAPTER 2 Computer Clusters for Scalable Parallel Computing
Node 1
Node N
Process P
FIGURE 2.14
Three types of storage in a single file hierarchy. Solid lines show what process P can access and the
dashed line shows what P may be able to access.
(Courtesy of Hwang and Xu [14] )
2.3 Design Principles of Computer Clusters 91
attached. Any process on any node can use any network and I/O device as though it were attached
to the local node. Single networking means any node can access any network connection.
Single Point of Control: The system administrator should be able to configure, monitor, test,
and control the entire cluster and each individual node from a single point. Many clusters help
with this through a system console that is connected to all nodes of the cluster. The system
console is normally connected to an external LAN (not shown in Figure 2.15) so that the
administrator can log in remotely to the system console from anywhere in the LAN to perform
administration work.
Note that single point of control does not mean all system administration work should be
carried out solely by the system console. In reality, many administrative functions are
distributed across the cluster. It means that controlling a cluster should be no more difficult than
administering an SMP or a mainframe. It implies that administration-related system information
(such as various configuration files) should be kept in one logical place. The administrator
monitors the cluster with one graphics tool, which shows the entire picture of the cluster, and
the administrator can zoom in and out at will.
Single point of control (or single point of management) is one of the most challenging issues
in constructing a cluster system. Techniques from distributed and networked system
management can be transferred to clusters. Several de facto standards have already been
developed for network management. An example is Simple Network Management Protocol
(SNMP). It demands an efficient cluster management package that integrates with the availability
support system, the file system, and the job management system.
Single Memory Space: Single memory space gives users the illusion of a big, centralized main
memory, which in reality may be a set of distributed local memory spaces. PVPs, SMPs, and DSMs
have an edge over MPPs and clusters in this respect, because they allow a program to utilize all
global or local memory space. A good way to test if a cluster has a single memory space is to run a
sequential program that needs a memory space larger than any single node can provide.
Suppose each node in Figure 2.15 has 2 GB of memory available to users. An ideal single
memory image would allow the cluster to execute a sequential program that needs 8 GB of
memory. This would enable a cluster to operate like an SMP system. Several approaches have
FIGURE 2.15
A cluster with single networking, single I/O space, single memory, and single point of control.
(Courtesy of Hwang and Xu [14] )
2.3 Design Principles of Computer Clusters 93
been attempted to achieve a single memory space on clusters. Another approach is to let the
compiler distribute the data structures of an application across multiple nodes. It is still a
challenging task to develop a single memory scheme that is efficient, platform-independent, and
able to support sequential binary codes.
Single I/O Address Space: Assume the cluster is used as a web server. The web information
database is distributed between the two RAIDs. An HTTP daemon is started on each node to
handle web requests, which come from all four network connections. A single I/O space implies
that any node can access the two RAIDs. Suppose most requests come from the ATM network.
It would be beneficial if the functions of the HTTP on node 3 could be distributed to all four
nodes. The following example shows a distributed RAID-x architecture for I/O-centric cluster
computing [9].
Example 2.6 Single I/O Space over Distributed RAID for I/O-Centric Clusters
A distributed disk array architecture was proposed by Hwang, et al. [9] for establishing a single I/O space
in I/O-centric cluster applications. Figure 2.16 shows the architecture for a four-node Linux PC cluster, in
which three disks are attached to the SCSI bus of each host node. All 12 disks form an integrated RAID-x
with a single address space. In other words, all PCs can access both local and remote disks. The addres-
sing scheme for all disk blocks is interleaved horizontally. Orthogonal stripping and mirroring make it possi-
ble to have a RAID-1 equivalent capability in the system.
The shaded blocks are images of the blank blocks. A disk block and its image will be mapped on dif-
ferent physical disks in an orthogonal manner. For example, the block B0 is located on disk D0. The
image block Mo of block B0 is located on disk D3. The four disks D0, D1, D2, and D3 are attached to four
servers, and thus can be accessed in parallel. Any single disk failure will not lose the data block, because
its image is available in recovery. All disk blocks are labeled to show image mapping. Benchmark experi-
ments show that this RAID-x is scalable and can restore data after any single disk failure. The distributed
RAID-x has improved aggregate I/O bandwidth in both parallel read and write operations over all physical
disks in the cluster.
Cluster network
D0 D1 D2 D3
B0 B1 B2 B3
B12 B13 B14 B15
B24 B25 B26 B27
M25 M14 M3 M0
M26 M15 M12 M1
M27 M24 M13 M2
D4 D5 D6 D7
B4 B5 B6 B7
B16 B17 B18 B19
B28 B29 B30 B31
M29 M18 M7 M4
M30 M19 M16 M5
M31 M28 M17 M6
D8 D9 D10 D11
B8 B9 B10 B11
B20 B21 B22 B23
B32 B33 B34 B35
M33 M22 M11 M8
M34 M23 M20 M9
M35 M32 M21 M10
FIGURE 2.16
Distributed RAID architecture with a single I/O space over 12 distributed disks attached to 4 host computers
in the cluster (Di stands for Disk i, Bj for disk block j, Mj an image for blocks Bj, P/M for processor/memory
node, and CDD for cooperative disk driver.)
(Courtesy of Hwang, Jin, and Ho [13] )
• Middleware support for SSI clustering As shown in Figure 2.17, various SSI features are
supported by middleware developed at three cluster application levels:
• Management level This level handles user applications and provides a job management system
such as GLUnix, MOSIX, Load Sharing Facility (LSF), or Codine.
• Programming level This level provides single file hierarchy (NFS, xFS, AFS, Proxy) and
distributed shared memory (TreadMark, Wind Tunnel).
2.3 Design Principles of Computer Clusters 95
User applications
Programming level
Implementation level
FIGURE 2.17
Relationship among clustering middleware at the job management, programming, and implementation levels.
(Courtesy of K. Hwang, H. Jin, C.L. Wang and Z. Xu [16] )
• Implementation level This level supports a single process space, checkpointing, process
migration, and a single I/O space. These features must interface with the cluster hardware and
OS platform. The distributed disk array, RAID-x, in Example 2.6 implements a single I/O space.
then repeats. A system’s reliability is measured by the mean time to failure (MTTF), which is the
average time of normal operation before the system (or a component of the system) fails. The metric
for serviceability is the mean time to repair (MTTR), which is the average time it takes to repair the
system and restore it to working condition after it fails. The availability of a system is defined by:
Availability = MTTF/ðMTTF + MTTRÞ (2.1)
FIGURE 2.18
The operate-repair cycle of a computer system.
state and start over. For instance, we all have rebooted our PC to take care of transient failures such
as a frozen keyboard or window. Permanent failures cannot be corrected by rebooting. Some hard-
ware or software component must be repaired or replaced. For instance, rebooting will not work if
the system hard disk is broken.
When a node fails in the clusters in Figure 2.19(b) and Figure 2.19(c), not only will the node
applications all fail, but also all node data cannot be used until the node is repaired. The shared
disk cluster in Figure 2.19(d) provides a remedy. The system stores persistent data on the shared
disk, and periodically checkpoints to save intermediate results. When one WS node fails, the data
will not be lost in this shared-disk cluster.
P P P P WS WS WS
Bus (OS) (OS) (OS)
Shared memory
where OS resides Ethernet
High-speed network
Ethernet
WS WS WS
WS Shared WS
(OS) (OS) (OS)
(OS) RAID (OS)
Ethernet
(c) A dual-networked cluster (d) A shared-disk cluster
FIGURE 2.19
Single points of failure (SPF) in an SMP and in three clusters, where greater redundancy eliminates more
SPFs in systems from (a) to (d).
(Courtesy of Hwang and Xu [14] )
98 CHAPTER 2 Computer Clusters for Scalable Parallel Computing
N versions all run simultaneously and their results are constantly compared. If the results differ, the
system is notified that a fault has occurred. But because of isolated redundancy, it is extremely unli-
kely that the fault will cause a majority of the N versions to fail at the same time. So the system
continues working, with the correct result generated by majority voting. In a highly available but
less mission-critical system, only one version needs to run at a time. Each version has a built-in
self-test capability. When one version fails, another version can take over.
node through both networks. The master node will detect a failure if it does not receive messages for a
beat (10 seconds) from a node and will make the following diagnoses:
• A node’s connection to one of the two networks failed if the master receives a heartbeat from the node
through one network but not the other.
• The node failed if the master does not receive a heartbeat through either network. It is assumed that
the chance of both networks failing at the same time is negligible.
The failure diagnosis in this example is simple, but it has several pitfalls. What if the master node fails? Is
the 10-second heartbeat period too long or too short? What if the heartbeat messages are dropped by the net-
work (e.g., due to network congestion)? Can this scheme accommodate hundreds of nodes? Practical HA sys-
tems must address these issues. A popular trick is to use the heartbeat messages to carry load information so
that when the master receives the heartbeat from a node, it knows not only that the node is alive, but also the
resource utilization status of the node. Such load information is useful for load balancing and job management.
Once a failure is diagnosed, the system notifies the components that need to know the failure event.
Failure notification is needed because the master node is not the only one that needs to have this informa-
tion. For instance, in case of the failure of a node, the DNS needs to be told so that it will not connect
more users to that node. The resource manager needs to reassign the workload and to take over the
remaining workload on that node. The system administrator needs to be alerted so that she can initiate
proper actions to repair the node.
Assume one of the cluster nodes fails every 100 hours. Other parts of the cluster never fail. Steps 1
through 3 take two hours. On average, the mean time for step 4 is two hours. What is the availability of
the cluster? What is the yearly failure cost if each one-hour downtime costs $82,500?
Solution: The cluster’s MTTF is 100 hours; the MTTR is 2 + 2 = 4 hours. According to Table 2.5, the
availability is 100/104 = 96.15 percent. This corresponds to 337 hours of downtime in a year, and the
failure cost is $82500 × 337, that is, more than $27 million.
handled by this runtime support. This approach is used widely because it has the advantage that
user applications do not have to be modified.
A main problem is that most current checkpointing libraries are static, meaning the application
source code (or at least the object code) must be available. It does not work if the application is in
the form of executable code. A third approach requires the user (or the compiler) to insert check-
pointing functions in the application; thus, the application has to be modified, and the transparency
is lost. However, it has the advantage that the user can specify where to checkpoint. This is helpful
to reduce checkpointing overhead. Checkpointing incurs both time and storage overheads.
Here, MTTF is the system’s mean time to failure. This MTTF accounts the time consumed to
save one checkpoint, and h is the average percentage of normal computation performed in a check-
point interval before the system fails. The parameter h is always in the range. After a system is
restored, it needs to spend h × (checkpoint interval) time to recompute.
Ckpt i Ckpt i + 1
Fail
tc h tc
Checkpoint intervel
FIGURE 2.20
Time parameters between two checkpoints.
(Courtesy of Hwang and Xu [14] )
2.3 Design Principles of Computer Clusters 103
a b d Time
c
P
s
x
y
Q
t v
u z
R
FIGURE 2.21
Consistent and inconsistent checkpoints in a parallel program.
(Courtesy of Hwang and Xu [14] )
104 CHAPTER 2 Computer Clusters for Scalable Parallel Computing
(represented by dots), one from every process. In addition, some communication states may need to be
saved. The intersection of a snapshot line with a process’s time line indicates where the process should
take a (local) checkpoint. Thus, the program’s snapshot c consists of three local checkpoints: s, t, u for
processes P, Q, and R, respectively, plus saving the communication y.
Table 2.6 Job Scheduling Issues and Schemes for Cluster Nodes
Issue Scheme Key Problems
requirement can be static or dynamic. Static scheme fixes the number of nodes for a single job for its
entire period. Static scheme may underutilize the cluster resource. It cannot handle the situation when
the needed nodes become unavailable, such as when the workstation owner shuts down the machine.
Dynamic resource allows a job to acquire or release nodes during execution. However, it is
much more difficult to implement, requiring cooperation between a running job and the Java Mes-
sage Service (JMS). The jobs make asynchronous requests to the JMS to add/delete resources. The
JMS needs to notify the job when resources become available. The synchrony means that a job
should not be delayed (blocked) by the request/notification. Cooperation between jobs and the JMS
requires modification of the programming languages/libraries. A primitive mechanism for such
cooperation exists in PVM and MPI.
4 2 4
FIGURE 2.22
The tiling technique for scheduling more jobs to cluster nodes to shorten the total makespan and thus
increase the job throughput.
(Courtesy of Hwang and Xu [14] )
fed to a simulator to generate various statistical and performance results, based on different sequen-
tial and parallel workload combinations, resource allocations, and scheduling policies. The following
workload characteristics are based on a NAS benchmark experiment. Of course, different workloads
may have variable statistics.
• Roughly half of parallel jobs are submitted during regular working hours. Almost 80 percent of
parallel jobs run for three minutes or less. Parallel jobs running longer than 90 minutes account
for 50 percent of the total time.
• The sequential workload shows that 60 percent to 70 percent of workstations are available to
execute parallel jobs at any time, even during peak daytime hours.
• On a workstation, 53 percent of all idle periods are three minutes or less, but 95 percent of idle
time is spent in periods of time that are 10 minutes or longer.
• A 2:1 rule applies, which says that a network of 64 workstations, with proper JMS software, can
sustain a 32-node parallel workload in addition to the original sequential workload. In other
words, clustering gives a supercomputer half of the cluster size for free!
However, once it starts execution, it stays in that node. A package that does support dynamic
process migration is Condor.
• All packages allow dynamic suspension and resumption of a user job by the user or by the
administrator. All packages allow resources (e.g., nodes) to be dynamically added to or deleted.
• Most packages provide both a command-line interface and a graphical user interface. Besides
UNIX security mechanisms, most packages use the Kerberos authentication system.
• The lsmake utility is a parallel version of the UNIX make utility, allowing a makefile to be
processed in multiple nodes simultaneously.
is most suitable for running compute-intensive applications with low to moderate amounts of I/O.
Tests of MOSIX2 show that the performance of several such applications over a 1 GB/second
campus grid is nearly identical to that of a single cluster. Here are some interesting features of
MOSIX2:
• Users can log in on any node and do not need to know where their programs run.
• There is no need to modify or link applications with special libraries.
• There is no need to copy files to remote nodes, thanks to automatic resource discovery and
workload distribution by process migration.
• Users can load-balance and migrate processes from slower to faster nodes and from nodes that
run out of free memory.
• Sockets are migratable for direct communication among migrated processes.
• The system features a secure runtime environment (sandbox) for guest processes.
• The system can run batch jobs with checkpoint recovery along with tools for automatic
installation and configuration scripts.
1800
MOSIX
PVM
1500
Execution time (sec.)
1200
900
600
300
0
8 nodes 16 nodes 24 nodes 32 nodes
FIGURE 2.23
Performance of the memory-ushering algorithm using MOSIX versus PVM.
(Courtesy of A. Barak and O. La’adan [5] )
FIGURE 2.24
The Tianhe-1A system built by the National University of Defense Technology and installed at the National
Supercomputer Center, Tianjin, China, in 2010 [11].
Service
Compute subsystem subsystem
CPU CPU CPU CPU CPU
Operation Operation
+ + + + +
diagnoses subsystem
node node
GPU GPU GPU GPU GPU
Monitoring and
Communication subsystem
Storage subsystem
MDS
OSS OSS OSS OSS
FIGURE 2.25
Abstract architecture of the Tianhe-1A system.
connected via PCI-E. A blade has two nodes and is 2U in height (Figure 2.25). The complete sys-
tem has 14,336 Intel sockets (Westmere) plus 7,168 NVIDIA Fermi boards plus 2,048 Galaxy sock-
ets (the Galaxy processor-based nodes are used as frontend processing for the system). A compute
node has two Intel sockets plus a Fermi board plus 32 GB of memory.
114 CHAPTER 2 Computer Clusters for Scalable Parallel Computing
FIGURE 2.26
Calculation of the theoretical peak speed of Tianhe-1A system.
The total system has a theoretical peak of 4.7 Pflops/second as calculated in Figure 2.26. Note
that there are 448 CUDA cores in each GPU node. The peak speed is achieved through 14,236
Xeon CPUs (with 380,064 cores) and 7,168 Tesla GPUs (with 448 CUDA cores per node and
3,496,884 CUDA cores in total). There are 3,876,948 processing cores in both the CPU and GPU
chips. An operational node has two eight-core Galaxy chips (1 GHz, SPARC architecture) plus
32 GB of memory. The Tianhe-1A system is packaged in 112 compute cabinets, 12 storage cabinets,
six communications cabinets, and eight I/O cabinets.
The operation nodes are composed of two eight-core Galaxy FT-1000 chips. These processors
were designed by NUDT and run at 1 GHz. The theoretical peak for the eight-core chip is
8 Gflops/second. The complete system has 1,024 of these operational nodes with each having
32 GB of memory. These operational nodes are intended to function as service nodes for job crea-
tion and submission. They are not intended as general-purpose computational nodes. Their speed is
excluded from the calculation of the peak or sustained speed. The peak speed of the Tianhe-1A is
calculated as 3.692 Pflops [11]. It uses 7,168 compute nodes (with 448 CUDA cores/GPU/compute
node) in parallel with 14,236 CPUs with six cores in four subsystems.
The system has total disk storage of 2 petabytes implemented with a Lustre clustered file sys-
tem. There are 262 terabytes of main memory distributed in the cluster system. The Tianhe-1A epi-
tomizes modern heterogeneous CPU/GPU computing, enabling significant achievements in
performance, size, and power. The system would require more than 50,000 CPUs and twice as
much floor space to deliver the same performance using CPUs alone. A 2.507-petaflop system built
entirely with CPUs would consume at least 12 megawatts, which is three times more power than
what the Tianhe-1A consumes.
Switches
FIGURE 2.27
The ARCH fat-tree interconnect in two stages of high-bandwidth switches [11].
Programming Visualization
Compiler system
environment system
Heterogenous
Parallel debugger
OpenMP
MPI
optimization
Compile
Fortran77
C/C++
90/95
Operating system
Energy Security Parallel
management isolation file system
Kernel
FIGURE 2.28
Software architecture of the Tianhe-1A supercomputer [11].
The system has an efficiency of 54.58 percent, which is much lower than the 75 percent efficiency
achieved by Jaguar and Roadrunner. Listed below are some applications of Tianhe-1A. Most of
them are specially tailored to satisfy China’s national needs.
• Parallel AMR (Adaptive Mesh Refinement) method
• Parallel eigenvalue problems
• Parallel fast multipole methods
• Parallel computing models
• Gridmol computational chemistry
• ScGrid middleware, grid portal
• PSEPS parallel symmetric eigenvalue package solvers
• FMM-radar fast multipole methods on radar cross sections
• Transplant many open source software programs
• Sandstorm prediction, climate modeling, EM scattering, or cosmology
• CAD/CAE for automotive industry
Hyper
DMA
transport
6-Port engine
interface
router
Memory
Bridge
control PowerPC
processor 440
interface processor
Cray SeaStar2+
architecture
FIGURE 2.29
The interconnect SeaStar router chip design in the Cray XT5 Jaguar supercomputer.
(Courtesy of Cray, Inc. [9] and Oak Ridge National Laboratory, United States, 2009 )
Department of Energy, in the United States. The entire Jaguar system is built with 86 cabinets. The
following are some interesting architectural and operational features of the Jaguar system:
• Built with AMD six-core Opteron processors running Linux at a 2.6 GHz clock rate
• Has a total of 224,162 cores on more than 37,360 processors in 88 cabinets in four rows (there
are 1,536 or 2,304 processor cores per cabinet)
• Features 8,256 compute nodes and 96 service nodes interconnected by a 3D torus network, built
with Cray SeaStar2+ chips
• Attained a sustained speed, Rmax, from the Linpack Benchmark test of 1.759 Pflops
• Largest Linpack matrix size tested recorded as Nmax = 5,474,272 unknowns
The basic building blocks are the compute blades. The interconnect router in the SeaStar+ chip
(Figure 2.29) provides six high-speed links to six neighbors in the 3D torus, as seen in Figure 2.30.
The system is scalable by design from small to large configurations. The entire system has 129 TB
of compute memory. In theory, the system was designed with a peak speed of Rpeak = 2.331 Pflops.
In other words, only 75 percent (=1.759/2.331) efficiency was achieved in Linpack experiments.
The external I/O interface uses 10 Gbps Ethernet and InfiniBand links. MPI 2.1 was applied in mes-
sage-passing programming. The system consumes 32–43 KW per cabinet. With 160 cabinets, the
entire system consumes up to 6.950 MW. The system is cooled with forced cool air, which con-
sumes a lot of electricity.
Service node
ress links
Dual PCI-Exp
Compute node
ory
4-32 GB mem
2+
Cray SeaStar
ct
9.6 GB/s
3-D interconne
9.6
GB
ct
6.4 GB/s dire
/s
9.6 GB/s
ory
9.6 GB/s 9.6
GB connect mem
/s
9.6 GB/s
ct
25.6 GB/s dire
ec t m em or y
conn
FIGURE 2.30
The 3D torus interconnect in the Cray XT5 Jaguar supercomputer.
(Courtesy of Cray, Inc. [9] and Oak Ridge National Laboratory, United States, 2009 )
links which connect to six neighbors in the 3D torus. The peak bidirectional bandwidth of each link
is 9.6 GB/second with sustained bandwidth in excess of 6 GB/second. Each port is configured with
an independent router table, ensuring contention-free access for packets.
The router is designed with a reliable link-level protocol with error correction and retransmis-
sion, ensuring that message-passing traffic reliably reaches its destination without the costly timeout
and retry mechanism used in typical clusters. The torus interconnect directly connects all the nodes
in the Cray XT5 system, eliminating the cost and complexity of external switches and allowing for
easy expandability. This allows systems to economically scale to tens of thousands of nodes—
well beyond the capacity of fat-tree switches. The interconnect carries all message-passing and I/O
traffic to the global file system.
The Cray XT5 3D torus architecture is designed for superior MPI performance in HPC applica-
tions. This is accomplished by incorporating dedicated compute nodes and service nodes. Compute
nodes are designed to run MPI tasks efficiently and reliably to completion. Each compute node is
composed of one or two AMD Opteron microprocessors (dual or quad core) and direct attached
memory, coupled with a dedicated communications resource. Service nodes are designed to provide
system and I/O connectivity and also serve as login nodes from which jobs are compiled and
launched. The I/O bandwidth of each compute node is designed for 25.6 GB/second performance.
7 8 9 10
PPE 6 11 IOIF_1 FlexIO
BEI
EIB
5 0
XIO MIC IOIF_0 FlexIO
XIO
4 3 2 1
RAM RAM
SPE 0 SPE 2 SPE 4 SPE 6
FIGURE 2.31
Schematic of the IBMCell processor architecture.
(Courtesy of IBM, http://www.redbooks.ibm.com/redpapers/pdfs/redp4477.pdf [28] )
120 CHAPTER 2 Computer Clusters for Scalable Parallel Computing
operating on a shared memory model. The rack is built with TriBlade servers, which are connected by
an InfiniBand network. In order to sustain this compute power, the connectivity within each node con-
sists of four PCI Express x8 links, each capable of 2 GB/s transfer rates, with a 2 μs latency. The
expansion slot also contains the InfiniBand interconnect, which allows communications to the rest of
the cluster. The capability of the InfiniBand interconnect is rated at 2 GB/s with a 2 μs latency.
computing in two edited volumes [7]. Two books on Linux clusters are [20,23]. HA clusters are
treated in [24]. Recent assessment of HPC interconnects can be found in [6,8,12,22]. The Google
cluster interconnect was reported by Barroso, et al. [6]. GPUs for supercomputing was discussed in
[10]. GPU clusters were studied in [19]. CUDA parallel programming for GPUs is treated in [31].
MOSIX/OS for cluster or grid computing is treated in [4,5,30].
Hwang, Jin, and Ho developed a distributed RAID system for achieving a single I/O space in a
cluster of PCs or workstations [13–17]. More details of LSF can be found in Zhou [35]. The Top 500
list was cited from the release in June and November 2010 [25]. The material on the Tianhe-1A can
be found in Dongarra [11] and on Wikipedia [29]. The IBM Blue Gene/L architecture was reported
by Adiga, et al. [1] and subsequently upgraded to a newer model called the Blue Gene/P solution.
The IBM Roadrunner was reported by Kevin, et al. [18] and also in Wikipedia [28]. The Cray XT5
and Jaguar systems are described in [9]. China’s Nebulae supercomputer was reported in [27]. Speci-
fic cluster applications and checkpointing techniques can be found in [12,16,17,24,32,34]. Cluster
applications can be found in [7,15,18,21,26,27,33,34].
Acknowledgments
This chapter is authored by Kai Hwang of USC and by Jack Dongarra of UTK jointly. Some cluster
material are borrowed from the earlier book [14] by Kai Hwang of USC and Zhiwei Xu of the
Chinese Academy of Sciences. Valuable suggestions to update the material were made by Rajkumar
Buyya of the University of Melbourne.
References
[1] N. Adiga, et al., An overview of the blue gene/L supercomputer, in: ACM Supercomputing Conference
2002, November 2002, http://SC-2002.org/paperpdfs/pap.pap207.pdf.
[2] D. Bader, R. Pennington, Cluster computing applications, Int. J. High Perform. Comput. (May) (2001).
[3] M. Baker, et al., Cluster computing white paper. http://arxiv.org/abs/cs/0004014, January 2001.
[4] A. Barak, A. Shiloh, The MOSIX Management Systems for Linux Clusters, Multi-Clusters and Clouds.
White paper, www.MOSIX.org//txt_pub.html, 2010.
[5] A. Barak, R. La’adan, The MOSIX multicomputer operating systems for high-performance cluster comput-
ing, Future Gener. Comput. Syst. 13 (1998) 361–372.
[6] L. Barroso, J. Dean, U. Holzle, Web search for a planet: The Google cluster architecture, IEEE Micro.
23 (2) (2003) 22–28.
[7] R. Buyya (Ed.), High-Performance Cluster Computing. Vols. 1 and 2, Prentice Hall, New Jersey, 1999.
[8] O. Celebioglu, R. Rajagopalan, R. Ali, Exploring InfiniBand as an HPC cluster interconnect, (October)
(2004).
[9] Cray, Inc, CrayXT System Specifications. www.cray.com/Products/XT/Specifications.aspx, January 2010.
[10] B. Dally, GPU Computing to Exascale and Beyond, Keynote Address, ACM Supercomputing Conference,
November 2010.
[11] J. Dongarra, Visit to the National Supercomputer Center in Tianjin, China, Technical Report, University of
Tennessee and Oak Ridge National Laboratory, 20 February 2011.
[12] J. Dongarra, Survey of present and future supercomputer architectures and their interconnects, in: Interna-
tional Supercomputer Conference, Heidelberg, Germany, 2004.
122 CHAPTER 2 Computer Clusters for Scalable Parallel Computing
[13] K. Hwang, H. Jin, R.S. Ho, Orthogonal striping and mirroring in distributed RAID for I/O-Centric cluster
computing, IEEE Trans. Parallel Distrib. Syst. 13 (2) (2002) 26–44.
[14] K. Hwang, Z. Xu, Support of clustering and availability, in: Scalable Parallel Computing, McGraw-Hill,
1998, Chapter 9.
[15] K. Hwang, C.M. Wang, C.L. Wang, Z. Xu, Resource scaling effects on MPP performance: STAP bench-
mark implications, IEEE Trans. Parallel Distrib. Syst. (May) (1999) 509–527.
[16] K. Hwang, H. Jin, E. Chow, C.L. Wang, Z. Xu, Designing SSI clusters with hierarchical checkpointing
and single-I/O space, IEEE Concurrency (January) (1999) 60–69.
[17] H. Jin, K. Hwang, Adaptive sector grouping to reduce false sharing of distributed RAID clusters, J. Clust.
Comput. 4 (2) (2001) 133–143.
[18] J. Kevin, et al., Entering the petaflop era: the architecture of performance of Roadrunner, www.c3.lanl
.gov/~kei/mypubbib/papers/SC08:Roadrunner.pdf, November 2008.
[19] V. Kindratenko, et al., GPU Clusters for High-Performance Computing, National Center for Supercomput-
ing Applications, University of Illinois at Urban-Champaign, Urbana, IL, 2009.
[20] K. Kopper, The Linux Enterprise Cluster: Building a Highly Available Cluster with Commodity Hardware
and Free Software, No Starch Press, San Francisco, CA, 2005.
[21] S.W. Lin, R.W. Lau, K. Hwang, X. Lin, P.Y. Cheung, Adaptive parallel Image rendering on multiproces-
sors and workstation clusters. IEEE Trans. Parallel Distrib. Syst. 12 (3) (2001) 241–258.
[22] J. Liu, D.K. Panda, et al., Performance comparison of MPI implementations over InfiniBand, Myrinet and
Quadrics, (2003).
[23] R. Lucke, Building Clustered Linux Systems, Prentice Hall, New Jersey, 2005.
[24] E. Marcus, H. Stern, Blueprints for High Availability: Designing Resilient Distributed Systems, Wiley.
[25] TOP500.org. Top-500 World’s fastest supercomputers, www.top500.org, November 2010.
[26] G.F. Pfister, In Search of Clusters, second ed., Prentice-Hall, 2001.
[27] N.H. Sun, China’s Nebulae Supercomputer, Institute of Computing Technology, Chinese Academy of
Sciences, July 2010.
[28] Wikipedia, IBM Roadrunner. http://en.wikipedia.org/wiki/IBM_Roadrunner, 2010, (accessed 10.01.10).
[29] Wikipedia, Tianhe-1. http://en.wikipedia.org/wiki/Tianhe-1, 2011, (accessed 5.02.11).
[30] Wikipedia, MOSIX. http://en.wikipedia.org/wiki/MOSIX, 2011, (accessed 10.02.11).
[31] Wikipedia, CUDA. http://en.wikipedia.org/wiki/CUDA, 2011, (accessed 19.02.11).
[32] K. Wong, M. Franklin, Checkpointing in distributed computing systems, J. Parallel Distrib. Comput.
(1996) 67–75.
[33] Z. Xu, K. Hwang, Designing superservers with clusters and commodity components. Annual Advances in
Scalable Computing, World Scientific, Singapore, 1999.
[34] Z. Xu, K. Hwang, MPP versus clusters for scalable computing, in: Proceedings of the 2nd IEEE Interna-
tional Symposium on Parallel Architectures, Algorithms, and Networks, June 1996, pp. 117–123.
[35] S. Zhou, LSF: Load Sharing and Batch Queuing Software, Platform Computing Corp., Canada, 1996.
HOMEWORK PROBLEMS
Problem 2.1
Differentiate and exemplify the following terms related to clusters:
a. Compact versus slack clusters
b. Centralized versus decentralized clusters
Homework Problems 123
Problem 2.2
This problem refers to the redundancy technique. Assume that when a node fails, it takes 10 seconds
to diagnose the fault and another 30 seconds for the workload to be switched over.
a. What is the availability of the cluster if planned downtime is ignored?
b. What is the availability of the cluster if the cluster is taken down one hour per week for
maintenance, but one node at a time?
Problem 2.3
This is a research project to evaluate the cluster architectures of four supercomputers built in recent
years. Study the details of the No. 1 supercomputer, the Tianhe-1A, which was announced in the
Top 500 list released in November 2010. Your study should include the following:
a. Conduct an in-depth evaluation of the Tianhe-1A architecture, hardware components, operating
system, software support, parallelizing compilers, packaging, cooling, and new applications.
b. Compare the relative strengths and limitations of the Tianhe-1A with respect to the three case-
study systems: the Jaguar, Nebulae, and Roadrunner, studied in Section 2.5. Use tabulations or
plot curves, if you find enough benchmark data to conduction the comparison study.
Problem 2.4
This problem consists of two parts related to cluster computing:
1. Define and distinguish among the following terms on scalability:
a. Scalability over machine size
b. Scalability over problem size
c. Resource scalability
d. Generation scalability
2. Explain the architectural and functional differences among three availability cluster
configurations: hot standby, active takeover, and fault-tolerant clusters. Give two example
commercial cluster systems in each availability cluster configuration. Comment on their relative
strengths and weaknesses in commercial applications.
Problem 2.5
Distinguish between multiprocessors and multicomputers based on their structures, resource sharing,
and interprocessor communications.
a. Explain the differences among UMA, NUMA, COMA, DSM, and NORMA memory models.
b. What are the additional functional features of a cluster that are not found in a conventional
network of autonomous computers?
c. What are the advantages of a clustered system over a traditional SMP server?
124 CHAPTER 2 Computer Clusters for Scalable Parallel Computing
Problem 2.6
Study the five research virtual cluster projects listed in Table 2.6 and answer the following ques-
tions regarding the coverage on COD and Violin experience given in Sections 2.5.3 and 2.5.4:
a. From the viewpoints of dynamic resource provisioning, evaluate the five virtual clusters and
discuss their relative strengths and weaknesses based on the open literature.
b. Report on the unique contribution from each of the five virtual cluster projects in terms of the
hardware setting, software tools, and experimental environments developed and performance
results reported.
Problem 2.7
This problem is related to the use of high-end x86 processors in HPC system construction. Answer
the following questions:
a. Referring to the latest Top 500 list of supercomputing systems, list all systems that have used
x86 processors. Identify the processor models and key processor characteristics such as number
of cores, clock frequency, and projected performance.
b. Some have used GPUs to complement the x86 CPUs. Identify those systems that have
procured substantial GPUs. Discuss the roles of GPUs to provide peak or sustained flops per
dollar.
Problem 2.8
Assume a sequential computer has 512 MB of main memory and enough disk space. The disk read/
write bandwidth for a large data block is 1 MB/second. The following code needs to apply
checkpointing:
do 1000 iterations
A = foo (C from last iteration) /* this statement takes 10 minutes */
B = goo (A) /* this statement takes 10 minutes */
C = hoo (B) /* this statement takes 10 minutes */
end do
A, B, and C are arrays of 120 MB each. All other parts of the code, operating system, libraries
take, at most, 16 MB of memory. Assume the computer fails exactly once, and the time to restore
the computer is ignored.
a. What is the worst-case execution time for the successful completion of the code if checkpointing
is performed?
b. What is the worst-case execution time for the successful completion of the code if plain
transparent checkpointing is performed?
c. Is it beneficial to use forked checkpointing with (b)?
d. What is the worst-case execution time for the code if user-directed checkpointing is performed?
Show the code where user directives are added.
e. What is the worst-case execution time of the code if forked checkpointing is used with (d)?
Homework Problems 125
Problem 2.9
Compare the latest Top 500 list with the Top 500 Green List of HPC systems. Discuss a few top
winners and losers in terms of energy efficiency in power and cooling costs. Reveal the green-
energy winners’ stories and report their special design features, packaging, cooling, and manage-
ment policies that make them the winners. How different are the ranking orders in the two lists?
Discuss their causes and implications based on publicly reported data.
Problem 2.10
This problem is related to processor selection and system interconnects used in building the top
three clustered systems with commercial interconnects in the latest Top 500 list.
a. Compare the processors used in these clusters and identify their strengths and weaknesses in
terms of potential peak floating-point performance.
b. Compare the commercial interconnects of these three clusters. Discuss their potential performance
in terms of their topological properties, network latency, bisection bandwidth, and hardware used.
Problem 2.11
Study Example 2.6 and the original paper [14] reporting the distributed RAID-x architecture and
performance results. Answer the following questions with technical justifications or evidence:
a. Explain how the RAID-x system achieved a single I/O address space across distributed disks
attached to cluster nodes.
b. Explain the functionality of the cooperative disk drivers (CCDs) implemented in the RAID-x
system. Comment on its application requirements and scalability based on current PC
architecture, SCSI bus, and SCSI disk technology.
c. Explain why RAID-x has a fault-tolerance capability equal to that of the RAID-5 architecture.
d. Explain the strengths and limitations of RAID-x, compared with other RAID architectures.
Problem 2.12
Study the relevant material in Sections 2.2 and 2.5 and compare the system interconnects of the IBM
Blue Gene/L, IBM Roadrunner, and Cray XT5 supercomputers released in the November 2009 Top
500 evaluation. Dig deeper to reveal the details of these systems. These systems may use custom-
designed routers in interconnects. Some also use some commercial interconnects and components.
a. Compare the basic routers or switches used in the three system interconnects in terms of
technology, chip design, routing scheme, and claimed message-passing performance.
b. Compare the topological properties, network latency, bisection bandwidth, and hardware
packaging of the three system interconnects.
Problem 2.13
Study the latest and largest commercial HPC clustered system built by SGI, and report on the clus-
ter architecture in the following technical and benchmark aspects:
a. What is the SGI system model and its specification? Illustrate the cluster architecture with a
block diagram and describe the functionality of each building block.
126 CHAPTER 2 Computer Clusters for Scalable Parallel Computing
b. Discuss the claimed peak performance and reported sustained performance from SGI.
c. What are the unique hardware, software, networking, or design features that contribute to the
claimed performance in Part (b)? Describe or illustrate those system features.
Problem 2.14
Consider in Figure 2.32 a server-client cluster with an active-takeover configuration between two
identical servers. The servers share a disk via a SCSI bus. The clients (PCs or workstations) and the
Ethernet are fail-free. When a server fails, its workload is switched to the surviving server.
a. Assume that each server has an MTTF of 200 days and an MTTR of five days. The disk has an
MTTF of 800 days and an MTTR of 20 days. In addition, each server is shut down for
maintenance for one day every week, during which time that server is considered unavailable.
Only one server is shut down for maintenance at a time. The failure rates cover both natural
failures and scheduled maintenance. The SCSI bus has a failure rate of 2 percent. The servers
and the disk fail independently. The disk and SCSI bus have no scheduled shutdown. The client
machine will never fail.
1. The servers are considered available if at least one server is available. What is the combined
availability of the two servers?
2. In normal operation, the cluster must have the SCSI bus, the disk, and at least one server
available simultaneously. What are the possible single points of failure in this cluster?
b. The cluster is considered unacceptable if both servers fail at the same time. Furthermore, the
cluster is declared unavailable when either the SCSI bus or the disk is down. Based on the
aforementioned conditions, what is the system availability of the entire cluster?
c. Under the aforementioned failure and maintenance conditions, propose an improved architecture
to eliminate all single points of failure identified in Part (a).
Ethernet
Server 1 Server 2
SCSI bus
Disk array
FIGURE 2.32
An HA cluster with redundant hardware components.
Homework Problems 127
Problem 2.15
Study various cluster job scheduling policies in Table 2.6 and answer the following questions. You
may need to gather more information from Wikipedia, Google, or other sources if any of the sche-
duling policies are new to you.
a. Explain the advantages and disadvantages of nonpreemptive and preemptive scheduling policies
and suggest methods to amend the problems.
b. Repeat Part (a) for static and dynamic scheduling policies.
c. Repeat Part (a) for dedicated and space-sharing scheduling policies.
d. Compare the relative performance of time-sharing, independent, and gang scheduling policies.
e. Compare the relative performance in stay and migrating policies on local jobs against remote jobs.
Problem 2.16
Study various SSI features and HA support for clusters in Section 2.3 and answer the following
questions, providing reasons for your answers. Identify some example cluster systems that are
equipped with these features. Comment on their implementation requirements and discuss the opera-
tional obstacles to establish each SSI feature in a cluster system.
a. Single entry point in a cluster environment
b. Single memory space in a cluster system
c. Single file hierarchy in a cluster system
d. Single I/O space in a cluster system
e. Single network space in a cluster system
f. Single networking in a cluster system
g. Single point of control in a cluster system
h. Single job management in a cluster system
i. Single user interface in a cluster system
j. Single process space in a cluster system
Problem 2.17
Use examples to explain the following terms on cluster job management systems.
a. Serial jobs versus parallel jobs
b. Batch jobs versus interactive jobs
c. Cluster jobs versus foreign (local) jobs
d. Cluster processes, local processes, and kernel processes
e. Dedicated mode, space-sharing mode, and timesharing mode
f. Independent scheduling versus gang scheduling
Problem 2.18
This problem focuses on the concept of LSF:
a. Give an example of each of the four types of LSF jobs.
b. For a 1,000-server cluster, give two reasons why the LSF load-sharing policy is better if (1) the
entire cluster has one master LIM or (2) all LIMs are masters.
128 CHAPTER 2 Computer Clusters for Scalable Parallel Computing
c. In the LSF master-election scheme, a node in the “no master” state waits for a time period
proportional to the node number before becoming a new master. Why is the wait time
proportional to the node number?
Problem 2.19
This problem is related to the use of MOSIX for cluster computing. Check with the open literature
on current features that have been claimed by designers and developers in supporting Linux clus-
ters, GPU clusters, multiclusters, and even virtualized clouds. Discuss the advantages and shortcom-
ings from the user’s perspective.
Problem 2.20
Compare China’s Tianhe-1A with the Cray Jaguar in terms of their relative strengths and weak-
nesses in architecture design, resource management, software environment, and reported applica-
tions. You may need to conduct some research to find the latest developments regarding these
systems. Justify your assessment with reasoning and evidential information.
CHAPTER
SUMMARY
The reincarnation of virtual machines (VMs) presents a great opportunity for parallel, cluster, grid,
cloud, and distributed computing. Virtualization technology benefits the computer and IT industries
by enabling users to share expensive hardware resources by multiplexing VMs on the same set of
hardware hosts. This chapter covers virtualization levels, VM architectures, virtual networking,
virtual cluster construction, and virtualized data-center design and automation in cloud computing.
In particular, the designs of dynamically structured clusters, grids, and clouds are presented with
VMs and virtual clusters.
Machines
Application Application
Virtual
Application
Guest OS Guest OS
FIGURE 3.1
The architecture of a computer system before and after virtualization, where VMM stands for virtual machine
monitor.
Application level
FIGURE 3.2
Virtualization ranging from hardware to applications in five abstraction levels.
132 CHAPTER 3 Virtual Machines and Virtualization of Clusters and Data Centers
process-level virtualization. The most popular approach is to deploy high level language (HLL)
VMs. In this scenario, the virtualization layer sits as an application program on top of the operating
system, and the layer exports an abstraction of a VM that can run programs written and compiled
to a particular abstract machine definition. Any program written in the HLL and compiled for this
VM will be able to run on it. The Microsoft .NET CLR and Java Virtual Machine (JVM) are two
good examples of this class of VM.
Other forms of application-level virtualization are known as application isolation, application
sandboxing, or application streaming. The process involves wrapping the application in a layer that
is isolated from the host OS and other applications. The result is an application that is much easier
to distribute and remove from user workstations. An example is the LANDesk application virtuali-
zation platform which deploys software applications as self-contained, executable files in an isolated
environment without requiring installation, system modifications, or elevated security privileges.
3.1.1.6 Relative Merits of Different Approaches
Table 3.1 compares the relative merits of implementing virtualization at various levels. The column
headings correspond to four technical merits. “Higher Performance” and “Application Flexibility”
are self-explanatory. “Implementation Complexity” implies the cost to implement that particular vir-
tualization level. “Application Isolation” refers to the effort required to isolate resources committed
to different VMs. Each row corresponds to a particular level of virtualization.
The number of X’s in the table cells reflects the advantage points of each implementation level.
Five X’s implies the best case and one X implies the worst case. Overall, hardware and OS support
will yield the highest performance. However, the hardware and application levels are also the most
expensive to implement. User isolation is the most difficult to achieve. ISA implementation offers
the best application flexibility.
Table 3.1 Relative Merits of Virtualization at Various Levels (More “X”’s Means Higher Merit, with a
Maximum of 5 X’s)
Higher Application Implementation Application
Level of Implementation Performance Flexibility Complexity Isolation
There are three requirements for a VMM. First, a VMM should provide an environment for pro-
grams which is essentially identical to the original machine. Second, programs run in this environ-
ment should show, at worst, only minor decreases in speed. Third, a VMM should be in complete
control of the system resources. Any program run under a VMM should exhibit a function identical
to that which it runs on the original machine directly. Two possible exceptions in terms of differ-
ences are permitted with this requirement: differences caused by the availability of system resources
and differences caused by timing dependencies. The former arises when more than one VM is run-
ning on the same machine.
The hardware resource requirements, such as memory, of each VM are reduced, but the sum of
them is greater than that of the real machine installed. The latter qualification is required because of
the intervening level of software and the effect of any other VMs concurrently existing on the same
hardware. Obviously, these two differences pertain to performance, while the function a VMM pro-
vides stays the same as that of a real machine. However, the identical environment requirement
excludes the behavior of the usual time-sharing operating system from being classed as a VMM.
A VMM should demonstrate efficiency in using the VMs. Compared with a physical machine,
no one prefers a VMM if its efficiency is too low. Traditional emulators and complete software
interpreters (simulators) emulate each instruction by means of functions or macros. Such a method
provides the most flexible solutions for VMMs. However, emulators or simulators are too slow to
be used as real machines. To guarantee the efficiency of a VMM, a statistically dominant subset
of the virtual processor’s instructions needs to be executed directly by the real processor, with no
software intervention by the VMM. Table 3.2 compares four hypervisors and VMMs that are in
use today.
Complete control of these resources by a VMM includes the following aspects: (1) The VMM is
responsible for allocating hardware resources for programs; (2) it is not possible for a program to
access any resource not explicitly allocated to it; and (3) it is possible under certain circumstances
for a VMM to regain control of resources already allocated. Not all processors satisfy these require-
ments for a VMM. A VMM is tightly related to the architectures of processors. It is difficult to
implement a VMM for some types of processors, such as the x86. Specific limitations include the
inability to trap on some privileged instructions. If a processor is not designed to support virtualiza-
tion primarily, it is necessary to modify the hardware to satisfy the three requirements for a VMM.
This is known as hardware-assisted virtualization.
root root
user user
OpenVZ templates
OpenVZ layer
Hardware
Network
FIGURE 3.3
The OpenVZ virtualization layer inside the host OS, which provides some OS images to create VMs quickly.
(Courtesy of OpenVZ User’s Guide [65] )
computing, the first and second benefits can be used to overcome the defects of slow initialization of
VMs at the hardware level, and being unaware of the current application state, respectively.
example, the chroot command in a UNIX system can create several virtual root directories within a
host OS. These virtual root directories are the root directories of all VMs created.
There are two ways to implement virtual root directories: duplicating common resources to each
VM partition; or sharing most resources with the host environment and only creating private
resource copies on the VM on demand. The first way incurs significant resource costs and overhead
on a physical machine. This issue neutralizes the benefits of OS-level virtualization, compared with
hardware-assisted virtualization. Therefore, OS-level virtualization is often a second choice.
Linux vServer for Linux platforms (http://linux- Extends Linux kernels to implement a security
vserver.org/) mechanism to help build VMs by setting resource
limits and file attributes and changing the root
environment for VM isolation
OpenVZ for Linux platforms [65]; http://ftp.openvz Supports virtualization by creating virtual private
.org/doc/OpenVZ-Users-Guide.pdf) servers (VPSes); the VPS has its own files, users,
process tree, and virtual devices, which can be
isolated from other VPSes, and checkpointing and
live migration are supported
FVM (Feather-Weight Virtual Machines) for Uses system call interfaces to create VMs at the NY
virtualizing the Windows NT platforms [78]) kernel space; multiple VMs are supported by
virtualized namespace and copy-on-write
138 CHAPTER 3 Virtual Machines and Virtualization of Clusters and Data Centers
Linux servers. Each VPS has its own files, users and groups, process tree, virtual network, virtual devices,
and IPC through semaphores and messages.
The resource management subsystem of OpenVZ consists of three components: two-level disk alloca-
tion, a two-level CPU scheduler, and a resource controller. The amount of disk space a VM can use is set
by the OpenVZ server administrator. This is the first level of disk allocation. Each VM acts as a standard
Linux system. Hence, the VM administrator is responsible for allocating disk space for each user and
group. This is the second-level disk quota. The first-level CPU scheduler of OpenVZ decides which VM to
give the time slice to, taking into account the virtual CPU priority and limit settings.
The second-level CPU scheduler is the same as that of Linux. OpenVZ has a set of about 20 parameters
which are carefully chosen to cover all aspects of VM operation. Therefore, the resources that a VM can use
are well controlled. OpenVZ also supports checkpointing and live migration. The complete state of a VM can
quickly be saved to a disk file. This file can then be transferred to another physical machine and the VM can
be restored there. It only takes a few seconds to complete the whole process. However, there is still a delay
in processing because the established network connections are also migrated.
The WABI offers middleware to convert Windows system calls to Solaris system calls. Lxrun is
really a system call emulator that enables Linux applications written for x86 hosts to run on UNIX
systems. Similarly, Wine offers library support for virtualizing x86 processors to run Windows appli-
cations on UNIX hosts. Visual MainWin offers a compiler support system to develop Windows appli-
cations using Visual Studio to run on some UNIX hosts. The vCUDA is explained in Example 3.2
with a graphical illustration in Figure 3.4.
Host OS Guest OS
.........
CUDA library vCUDA library vGPU
Device driver
VMM
FIGURE 3.4
Basic concept of the vCUDA architecture.
(Courtesy of Lin Shi, et al. © IEEE [57] )
140 CHAPTER 3 Virtual Machines and Virtualization of Clusters and Data Centers
and interprets remote requests and creates a corresponding execution context for the API calls from the
guest OS, then returns the results to the guest OS. The vCUDA stub also manages actual physical resource
allocation.
Application
Application
Application
Application
Application
Application
Application
Application
Application
Application
Domain0 XenoLinux XenoWindows
XEN (Hypervisor)
Hardware devices
FIGURE 3.5
The Xen architecture’s special domain 0 for control and I/O, and several guest domains for user applications.
(Courtesy of P. Barham, et al. [7] )
particular controls the others. The guest OS, which has control ability, is called Domain 0, and the
others are called Domain U. Domain 0 is a privileged guest OS of Xen. It is first loaded when Xen
boots without any file system drivers being available. Domain 0 is designed to access hardware
directly and manage devices. Therefore, one of the responsibilities of Domain 0 is to allocate and
map hardware resources for the guest domains (the Domain U domains).
For example, Xen is based on Linux and its security level is C2. Its management VM is named
Domain 0, which has the privilege to manage other VMs implemented on the same host. If Domain
0 is compromised, the hacker can control the entire system. So, in the VM system, security policies
are needed to improve the security of Domain 0. Domain 0, behaving as a VMM, allows users to
create, copy, save, read, modify, share, migrate, and roll back VMs as easily as manipulating a file,
which flexibly provides tremendous benefits for users. Unfortunately, it also brings a series of
security problems during the software life cycle and data lifetime.
Traditionally, a machine’s lifetime can be envisioned as a straight line where the current state of
the machine is a point that progresses monotonically as the software executes. During this time, con-
figuration changes are made, software is installed, and patches are applied. In such an environment,
the VM state is akin to a tree: At any point, execution can go into N different branches where multiple
instances of a VM can exist at any point in this tree at any given time. VMs are allowed to roll back
to previous states in their execution (e.g., to fix configuration errors) or rerun from the same point
many times (e.g., as a means of distributing dynamic content or circulating a “live” system image).
the underlying hardware, binary translation must be adopted. Although the host-based architecture has
flexibility, the performance is too low to be useful in practice.
Although para-virtualization reduces the overhead, it has incurred other problems. First, its
compatibility and portability may be in doubt, because it must support the unmodified OS as
well. Second, the cost of maintaining para-virtualized OSes is high, because they may require
deep OS kernel modifications. Finally, the performance advantage of para-virtualization varies
greatly due to workload variations. Compared with full virtualization, para-virtualization is
relatively easy and more practical. The main problem in full virtualization is its low performance
in binary translation. To speed up binary translation is difficult. Therefore, many virtualization
products employ the para-virtualization architecture. The popular Xen, KVM, and VMware ESX
are good examples.
Console
OS
VMM VMM VMM VMM
x86 SMP
hardware
CPU Memory disk NIC
FIGURE 3.9
The VMware ESX server architecture using para-virtualization.
(Courtesy of VMware [71] )
VMware ESX Server File System. The service console is responsible for booting the system, initiating the
execution of the VMM and resource manager, and relinquishing control to those layers. It also facilitates
the process for system administrators.
At the time of this writing, many hardware virtualization products were available. The VMware
Workstation is a VM software suite for x86 and x86-64 computers. This software suite allows users
to set up multiple x86 and x86-64 virtual computers and to use one or more of these VMs simulta-
neously with the host operating system. The VMware Workstation assumes the host-based virtuali-
zation. Xen is a hypervisor for use in IA-32, x86-64, Itanium, and PowerPC 970 hosts. Actually,
Xen modifies Linux as the lowest and most privileged layer, or a hypervisor.
One or more guest OS can run on top of the hypervisor. KVM (Kernel-based Virtual Machine)
is a Linux kernel virtualization infrastructure. KVM can support hardware-assisted virtualization and
paravirtualization by using the Intel VT-x or AMD-v and VirtIO framework, respectively. The
VirtIO framework includes a paravirtual Ethernet card, a disk I/O controller, a balloon device for
adjusting guest memory usage, and a VGA graphics interface using VMware drivers.
Example 3.4 Hardware Support for Virtualization in the Intel x86 Processor
Since software-based virtualization techniques are complicated and incur performance overhead,
Intel provides a hardware-assist technique to make virtualization easy and improve performance.
Figure 3.10 provides an overview of Intel’s full virtualization techniques. For processor virtualization,
Intel offers the VT-x or VT-i technique. VT-x adds a privileged mode (VMX Root Mode) and some
instructions to processors. This enhancement traps all sensitive instructions in the VMM automatically.
For memory virtualization, Intel offers the EPT, which translates the virtual address to the machine’s
physical addresses to improve performance. For I/O virtualization, Intel implements VT-d and VT-c to
support this.
Network
VT-x
EPT VT-d VT-c
VT-i Storage
FIGURE 3.10
Intel hardware support for virtualization of processor, memory, and I/O devices.
(Modified from [68], Courtesy of Lizhong Chen, USC)
3.3 Virtualization of CPU, Memory, and I/O Devices 147
VM0 VM0
Ring 3 Apps Apps
VM VM VMCS
VMM
VMX entry exit configuration
root Memory and I/O
mode virtualization
VM control structure
VT-x
Processors with
CPU0
VT-x VT-x (or VT-i)
CPU0
FIGURE 3.11
Intel hardware-assisted CPU virtualization.
(Modified from [68], Courtesy of Lizhong Chen, USC)
CPU state for VMs, a set of additional instructions is added. At the time of this writing, Xen, VMware, and the
Microsoft Virtual PC all implement their hypervisors by using the VT-x technology.
Generally, hardware-assisted virtualization should have high efficiency. However, since the transition from
the hypervisor to the guest OS incurs high overhead switches between processor modes, it sometimes
cannot outperform binary translation. Hence, virtualization systems such as VMware now use a hybrid
approach, in which a few tasks are offloaded to the hardware but the rest is still done in software. In addition,
para-virtualization and hardware-assisted virtualization can be combined to improve the performance further.
VM1 VM2
Virtual VA
memory
Physical PA
memory
Machine MA
memory
FIGURE 3.12
Two-level memory mapping procedure.
(Courtesy of R. Rblig, et al. [68] )
Since each page table of the guest OSes has a separate page table in the VMM corresponding to
it, the VMM page table is called the shadow page table. Nested page tables add another layer of
indirection to virtual memory. The MMU already handles virtual-to-physical translations as defined
by the OS. Then the physical memory addresses are translated to machine addresses using another
set of page tables defined by the hypervisor. Since modern operating systems maintain a set of
page tables for every process, the shadow page tables will get flooded. Consequently, the perfor-
mance overhead and cost of memory will be very high.
VMware uses shadow page tables to perform virtual-memory-to-machine-memory address trans-
lation. Processors use TLB hardware to map the virtual memory directly to the machine memory to
avoid the two levels of translation on every access. When the guest OS changes the virtual memory
to a physical memory mapping, the VMM updates the shadow page tables to enable a direct
lookup. The AMD Barcelona processor has featured hardware-assisted memory virtualization since
2007. It provides hardware assistance to the two-stage address translation in a virtual execution
environment by using a technology called nested paging.
Process in guest OS
L1
L2
GVA L3
L4
GPA
Guest OS
kernel
CR3
Virtual
machine
VMM
EPT MMU
Hardware HPA
EPT TLB
FIGURE 3.13
Memory virtualization using EPT by Intel (the EPT is also known as the shadow page table [68]).
page table is a page fault, the CPU will generate a page fault interrupt and will let the guest OS kernel
handle the interrupt. When the PGA of the L3 page table is obtained, the CPU will look for the EPT to get
the HPA of the L3 page table, as described earlier. To get the HPA corresponding to a GVA, the CPU
needs to look for the EPT five times, and each time, the memory needs to be accessed four times. There-
fore, there are 20 memory accesses in the worst case, which is still very slow. To overcome this short-
coming, Intel increased the size of the EPT TLB to decrease the number of memory accesses.
Guest OS
• Guest device driver
• Virtual device
Device driver
Virtualization layer
• Virtualization layer
– emulates the virtual device
Device
emulation – remaps guest and real I/O addresses
I/O Stack – multiplexes and drives the physical device
– I/O features. e.g., COW disks
Device driver
• Real device
– may be different from virtual device
FIGURE 3.14
Device emulation for I/O virtualization implemented inside the middle layer that maps real I/O devices into the
virtual devices for the guest device driver to use.
(Courtesy of V. Chadha, et al. [10] and Y. Dong, et al. [15] )
All the functions of a device or bus infrastructure, such as device enumeration, identification,
interrupts, and DMA, are replicated in software. This software is located in the VMM and acts as a
virtual device. The I/O access requests of the guest OS are trapped in the VMM which interacts with
the I/O devices. The full device emulation approach is shown in Figure 3.14.
A single hardware device can be shared by multiple VMs that run concurrently. However, software
emulation runs much slower than the hardware it emulates [10,15]. The para-virtualization method of
I/O virtualization is typically used in Xen. It is also known as the split driver model consisting of a
frontend driver and a backend driver. The frontend driver is running in Domain U and the backend dri-
ver is running in Domain 0. They interact with each other via a block of shared memory. The frontend
driver manages the I/O requests of the guest OSes and the backend driver is responsible for managing
the real I/O devices and multiplexing the I/O data of different VMs. Although para-I/O-virtualization
achieves better device performance than full device emulation, it comes with a higher CPU overhead.
Direct I/O virtualization lets the VM access devices directly. It can achieve close-to-native per-
formance without high CPU costs. However, current direct I/O virtualization implementations focus
on networking for mainframes. There are a lot of challenges for commodity hardware devices. For
example, when a physical device is reclaimed (required by workload migration) for later reassign-
ment, it may have been set to an arbitrary state (e.g., DMA to some arbitrary memory locations)
that can function incorrectly or even crash the whole system. Since software-based I/O virtualization
requires a very high overhead of device emulation, hardware-assisted I/O virtualization is critical.
Intel VT-d supports the remapping of I/O DMA transfers and device-generated interrupts. The archi-
tecture of VT-d provides the flexibility to support multiple usage models that may run unmodified,
special-purpose, or “virtualization-aware” guest OSes.
Another way to help I/O virtualization is via self-virtualized I/O (SV-IO) [47]. The key idea of
SV-IO is to harness the rich resources of a multicore processor. All tasks associated with virtualizing an
I/O device are encapsulated in SV-IO. It provides virtual devices and an associated access API to VMs
and a management API to the VMM. SV-IO defines one virtual interface (VIF) for every kind of virtua-
lized I/O device, such as virtual network interfaces, virtual block devices (disk), virtual camera devices,
152 CHAPTER 3 Virtual Machines and Virtualization of Clusters and Data Centers
and others. The guest OS interacts with the VIFs via VIF device drivers. Each VIF consists of two mes-
sage queues. One is for outgoing messages to the devices and the other is for incoming messages from
the devices. In addition, each VIF has a unique ID for identifying it in SV-IO.
VMApp VMApp
memcpy to VM memory
Syscall
ask VMM to rasie IRQ
VMNet driver
VMDriver
Return from IOCTL
VMApp
Packet receive completion
FIGURE 3.15
Functional blocks involved in sending and receiving network packets.
(Courtesy of VMWare [71] )
3.3 Virtualization of CPU, Memory, and I/O Devices 153
The virtual NIC models an AMD Lance Am79C970A controller. The device driver for a Lance controller
in the guest OS initiates packet transmissions by reading and writing a sequence of virtual I/O ports; each
read or write switches back to the VMApp to emulate the Lance port accesses. When the last OUT instruc-
tion of the sequence is encountered, the Lance emulator calls a normal write() to the VMNet driver. The
VMNet driver then passes the packet onto the network via a host NIC and then the VMApp switches back
to the VMM. The switch raises a virtual interrupt to notify the guest device driver that the packet was sent.
Packet receives occur in reverse.
SW Threads/Guest VMs
System
software Virtual
V0 V1 V2 V3
procs.
ISA
Paused
Chip Physical
C0 C1 C2
cores
FIGURE 3.16
Multicore virtualization method that exposes four VCPUs to the software, when only three cores are actually
present.
(Courtesy of Wells, et al. [74] )
Today’s many-core CMPs use a physical hierarchy of two or more cache levels that stati-
cally determine the cache allocation and mapping. A virtual hierarchy is a cache hierarchy
that can adapt to fit the workload or mix of workloads [39]. The hierarchy’s first level locates
data blocks close to the cores needing them for faster access, establishes a shared-cache domain,
and establishes a point of coherence for faster communication. When a miss leaves a tile, it
first attempts to locate the block (or sharers) within the first level. The first level can also pro-
vide isolation between independent workloads. A miss at the L1 cache can invoke the L2
access.
The idea is illustrated in Figure 3.17(a). Space sharing is applied to assign three workloads to
three clusters of virtual cores: namely VM0 and VM3 for database workload, VM1 and VM2 for
web server workload, and VM4–VM7 for middleware workload. The basic assumption is that each
workload runs in its own VM. However, space sharing applies equally within a single operating
system. Statically distributing the directory among tiles can do much better, provided operating sys-
tems or hypervisors carefully map virtual pages to physical frames. Marty and Hill suggested a
two-level virtual coherence and caching hierarchy that harmonizes with the assignment of tiles to
the virtual clusters of VMs.
Figure 3.17(b) illustrates a logical view of such a virtual cluster hierarchy in two levels. Each
VM operates in a isolated fashion at the first level. This will minimize both miss access time and
performance interference with other workloads or VMs. Moreover, the shared resources of cache
capacity, inter-connect links, and miss handling are mostly isolated between VMs. The second level
maintains a globally shared memory. This facilitates dynamically repartitioning resources without
costly cache flushes. Furthermore, maintaining globally shared memory minimizes changes to
existing system software and allows virtualization features such as content-based page sharing.
A virtual hierarchy adapts to space-shared workloads like multiprogramming and server consolida-
tion. Figure 3.17 shows a case study focused on consolidated server workloads in a tiled architecture.
This many-core mapping scheme can also optimize for space-shared multiprogrammed workloads in
a single-OS environment.
3.4 Virtual Clusters and Resource Management 155
L2 cache
L1
Processor
1&D
FIGURE 3.17
CMP server consolidation by space-sharing of VMs into many cores forming multiple virtual clusters to
execute various workloads.
(Courtesy of Marty and Hill [39] )
Virtual
Virtual
cluster 1
Cluster 1
Virtual
cluster 2
Virtual Virtual
cluster 3 cluster 4
FIGURE 3.18
A cloud platform with four virtual clusters over three physical clusters shaded differently.
(Courtesy of Fan Zhang, Tsinghua University )
3.4 Virtual Clusters and Resource Management 157
• VMs can be colonized (replicated) in multiple servers for the purpose of promoting distributed
parallelism, fault tolerance, and disaster recovery.
• The size (number of nodes) of a virtual cluster can grow or shrink dynamically, similar to the
way an overlay network varies in size in a peer-to-peer (P2P) network.
• The failure of any physical nodes may disable some VMs installed on the failing nodes. But the
failure of VMs will not pull down the host system.
Since system virtualization has been widely used, it is necessary to effectively manage VMs
running on a mass of physical computing nodes (also called virtual clusters) and consequently build
a high-performance virtualized computing environment. This involves virtual cluster deployment,
monitoring and management over large-scale clusters, as well as resource scheduling, load
balancing, server consolidation, fault tolerance, and other techniques. The different node colors in
Figure 3.18 refer to different virtual clusters. In a virtual cluster system, it is quite important to
store the large number of VM images efficiently.
Figure 3.19 shows the concept of a virtual cluster based on application partitioning or customi-
zation. The different colors in the figure represent the nodes in different virtual clusters. As a large
number of VM images might be present, the most important thing is to determine how to store
those images in the system efficiently. There are common installations for most users or applica-
tions, such as operating systems or user-level programming libraries. These software packages can
be preinstalled as templates (called template VMs). With these templates, users can build their own
software stacks. New OS instances can be copied from the template VM. User-specific components
such as programming libraries and applications can be installed to those instances.
Three physical clusters are shown on the left side of Figure 3.18. Four virtual clusters are created
on the right, over the physical clusters. The physical machines are also called host systems. In
contrast, the VMs are guest systems. The host and guest systems may run with different operating
VM1
VM2
VM3
VM4
VM1
VM2
VM3
VM1
VM2
VM3
VM4
VM1
VM2
VM3
VM4
FIGURE 3.19
The concept of a virtual cluster based on application partitioning.
(Courtesy of Kang Chen, Tsinghua University 2008)
158 CHAPTER 3 Virtual Machines and Virtualization of Clusters and Data Centers
systems. Each VM can be installed on a remote server or replicated on multiple servers belonging to
the same or different physical clusters. The boundary of a virtual cluster can change as VM nodes are
added, removed, or migrated dynamically over time.
Every VM is configured with a name, disk image, network setting, and allocated CPU and
memory. One needs to record each VM configuration into a file. However, this method is inefficient
when managing a large group of VMs. VMs with the same configurations could use preedited profiles
to simplify the process. In this scenario, the system configures the VMs according to the chosen pro-
file. Most configuration items use the same settings, while some of them, such as UUID, VM name,
and IP address, are assigned with automatically calculated values. Normally, users do not care which
host is running their VM. A strategy to choose the proper destination host for any VM is needed. The
deployment principle is to fulfill the VM requirement and to balance workloads among the whole
host network.
Stage 1: Reservation
Initialize a container on the target host
Stage 4: Commitment
VM state on Host A is released
FIGURE 3.20
Live migration process of a VM from one host to another.
(Courtesy of C. Clark, et al. [14] )
the VM is not enabled. An active state refers to a VM that has been instantiated at the virtualization
platform to perform a real task. A paused state corresponds to a VM that has been instantiated but
disabled to process a task or paused in a waiting state. A VM enters the suspended state if its
machine file and virtual resources are stored back to the disk. As shown in Figure 3.20, live migra-
tion of a VM consists of the following six steps:
Steps 0 and 1: Start migration. This step makes preparations for the migration, including
determining the migrating VM and the destination host. Although users could manually make a
VM migrate to an appointed host, in most circumstances, the migration is automatically started
by strategies such as load balancing and server consolidation.
Steps 2: Transfer memory. Since the whole execution state of the VM is stored in memory,
sending the VM’s memory to the destination node ensures continuity of the service provided by
the VM. All of the memory data is transferred in the first round, and then the migration controller
recopies the memory data which is changed in the last round. These steps keep iterating until the
dirty portion of the memory is small enough to handle the final copy. Although precopying
memory is performed iteratively, the execution of programs is not obviously interrupted.
3.4 Virtual Clusters and Resource Management 161
Step 3: Suspend the VM and copy the last portion of the data. The migrating VM’s
execution is suspended when the last round’s memory data is transferred. Other nonmemory data
such as CPU and network states should be sent as well. During this step, the VM is stopped and
its applications will no longer run. This “service unavailable” time is called the “downtime” of
migration, which should be as short as possible so that it can be negligible to users.
Steps 4 and 5: Commit and activate the new host. After all the needed data is copied, on the
destination host, the VM reloads the states and recovers the execution of programs in it, and the
service provided by this VM continues. Then the network connection is redirected to the new
VM and the dependency to the source host is cleared. The whole migration process finishes by
removing the original VM from the source host.
Figure 3.21 shows the effect on the data transmission rate (Mbit/second) of live migration of a
VM from one host to another. Before copying the VM with 512 KB files for 100 clients, the data
throughput was 870 MB/second. The first precopy takes 63 seconds, during which the rate is
reduced to 765 MB/second. Then the data rate reduces to 694 MB/second in 9.8 seconds for more
iterations of the copying process. The system experiences only 165 ms of downtime, before the
VM is restored at the destination host. This experimental result shows a very small migration
overhead in live transfer of a VM between host nodes. This is critical to achieve dynamic cluster
reconfiguration and disaster recovery as needed in cloud computing. We will study these techniques
in more detail in Chapter 4.
With the emergence of widespread cluster computing more than a decade ago, many cluster con-
figuration and management systems have been developed to achieve a range of goals. These goals
naturally influence individual approaches to cluster management. VM technology has become a
popular method for simplifying management and sharing of physical computing resources. Platforms
600
694 Mbit/sec
400
165 ms total down time
200
512 Kb files Sample over 100 ms
100 concurrent clients Sample over 500 ms
0
0 10 20 30 40 50 60 70 80 90 100 110 120 130
Elapsed time (secs)
FIGURE 3.21
Effect on data transmission rate of a VM migrated from one failing web server to another.
(Courtesy of C. Clark, et al. [14] )
162 CHAPTER 3 Virtual Machines and Virtualization of Clusters and Data Centers
such as VMware and Xen allow multiple VMs with different operating systems and configurations
to coexist on the same physical host in mutual isolation. Clustering inexpensive computers is an
effective way to obtain reliable, scalable computing power for network services and compute-
intensive applications
A distributed file system is used in ISR serving as a transport mechanism for propagating a
suspended VM state. The actual file systems themselves are not mapped onto the distributed file
system. Instead, the VMM only accesses its local file system. The relevant VM files are explicitly
copied into the local file system for a resume operation and taken out of the local file system for a
suspend operation. This approach relieves developers from the complexities of implementing several
different file system calls for different distributed file systems. It also essentially disassociates the
VMM from any particular distributed file system semantics. However, this decoupling means that
the VMM has to store the contents of each VM’s virtual disks in its local files, which have to be
moved around with the other state information of that VM.
In smart copying, the VMM exploits spatial locality. Typically, people often move between the
same small number of locations, such as their home and office. In these conditions, it is possible to
transmit only the difference between the two file systems at suspending and resuming locations. This
technique significantly reduces the amount of actual physical data that has to be moved. In situations
where there is no locality to exploit, a different approach is to synthesize much of the state at the resum-
ing site. On many systems, user files only form a small fraction of the actual data on disk. Operating
system and application software account for the majority of storage space. The proactive state transfer
solution works in those cases where the resuming site can be predicted with reasonable confidence.
area network (SAN) or network attached storage (NAS), is employed. Only memory and CPU status
needs to be transferred from the source node to the target node. Live migration techniques mainly
use the precopy approach, which first transfers all memory pages, and then only copies modified pages
during the last round iteratively. The VM service downtime is expected to be minimal by using iterative
copy operations. When applications’ writable working set becomes small, the VM is suspended and
only the CPU state and dirty pages in the last round are sent out to the destination.
In the precopy phase, although a VM service is still available, much performance degradation will
occur because the migration daemon continually consumes network bandwidth to transfer dirty pages in
each round. An adaptive rate limiting approach is employed to mitigate this issue, but total migration time
is prolonged by nearly 10 times. Moreover, the maximum number of iterations must be set because not all
applications’ dirty pages are ensured to converge to a small writable working set over multiple rounds.
In fact, these issues with the precopy approach are caused by the large amount of transferred
data during the whole migration process. A checkpointing/recovery and trace/replay approach (CR/
TR-Motion) is proposed to provide fast VM migration. This approach transfers the execution trace
file in iterations rather than dirty pages, which is logged by a trace daemon. Apparently, the total
size of all log files is much less than that of dirty pages. So, total migration time and downtime of
migration are drastically reduced. However, CR/TR-Motion is valid only when the log replay rate is
larger than the log growth rate. The inequality between source and target nodes limits the applica-
tion scope of live migration in clusters.
Another strategy of postcopy is introduced for live migration of VMs. Here, all memory pages are
transferred only once during the whole migration process and the baseline total migration time is
reduced. But the downtime is much higher than that of precopy due to the latency of fetching pages
from the source node before the VM can be resumed on the target. With the advent of multicore or
many-core machines, abundant CPU resources are available. Even if several VMs reside on a same mul-
ticore machine, CPU resources are still rich because physical CPUs are frequently amenable to multi-
plexing. We can exploit these copious CPU resources to compress page frames and the amount of
transferred data can be significantly reduced. Memory compression algorithms typically have little
memory overhead. Decompression is simple and very fast and requires no memory for decompression.
Dom0
Dirty
Shadow page table
bitmap
VMM VMM
Hardware Hardware
FIGURE 3.22
Live migration of VM from the Dom0 domain to a Xen-enabled target host.
operations occur, reducing it to a “one-sided” interface. Data communication over RDMA does not need to
involve the CPU, caches, or context switches. This allows migration to be carried out with minimal impact
on guest operating systems and hosted applications. Figure 3.22 shows the a compression scheme for VM
migration.
This design requires that we make trade-offs between two factors. If an algorithm embodies expecta-
tions about the kinds of regularities in the memory footprint, it must be very fast and effective. A single
compression algorithm for all memory data is difficult to achieve the win-win status that we expect.
Therefore, it is necessary to provide compression algorithms to pages with different kinds of regularities.
The structure of this live migration system is presented in Dom0.
Migration daemons running in the management VMs are responsible for performing migration. Shadow
page tables in the VMM layer trace modifications to the memory page in migrated VMs during the precopy
phase. Corresponding flags are set in a dirty bitmap. At the start of each precopy round, the bitmap is
sent to the migration daemon. Then, the bitmap is cleared and the shadow page tables are destroyed and
re-created in the next round. The system resides in Xen’s management VM. Memory pages denoted by
bitmap are extracted and compressed before they are sent to the destination. The compressed data is
then decompressed on the target.
Cluster-on-Demand at Duke Dynamic resource allocation with a Sharing of VMs by multiple virtual
Univ. virtual cluster management system clusters using Sun GridEngine [12]
Cellular Disco at Stanford Univ. To deploy a virtual cluster on a VMs deployed on multiple
shared-memory multiprocessor processors under a VMM called
Cellular Disco [8]
VIOLIN at Purdue Univ. Multiple VM clustering to prove the Reduce execution time of
advantage of dynamic adaptation applications running VIOLIN with
adaptation [25,55]
GRAAL Project at INRIA in Performance of parallel algorithms 75% of max. performance
France in Xen-enabled virtual clusters achieved with 30% resource
slacks over VM clusters
Network boot
Dynamic automatic configuration ACPI
virtual clusters resource negotiation Wake-on-LAN
VCM VCM
FIGURE 3.23
COD partitioning a physical cluster into multiple virtual clusters.
(Courtesy of Jeff Chase, et al., HPDC-2003 © IEEE [12] )
80
70 Systems
Architecture
Biogeometry
60
Number of nodes
50
40
30
20
10
0
Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8
Time
FIGURE 3.24
Cluster size variations in COD over eight days at Duke University.
(Courtesy of Jeff Chase, et al., HPDC-2003 © IEEE [12] )
partitions a physical cluster into multiple virtual clusters (vClusters). vCluster owners specify the operating
systems and software for their clusters through an XML-RPC interface. The vClusters run a batch schedule
from Sun’s GridEngine on a web server cluster. The COD system can respond to load changes in restruc-
turing the virtual clusters dynamically.
The Duke researchers used the Sun GridEngine scheduler to demonstrate that dynamic virtual clusters
are an enabling abstraction for advanced resource management in computing utilities such as grids. The
system supports dynamic, policy-based cluster sharing between local users and hosted grid services.
Attractive features include resource reservation, adaptive provisioning, scavenging of idle resources, and
dynamic instantiation of grid services. The COD servers are backed by a configuration database. This
system provides resource policies and template definition in response to user requests.
Figure 3.24 shows the variation in the number of nodes in each of three virtual clusters during
eight days of a live deployment. Three application workloads requested by three user groups are labeled
“Systems,” “Architecture,” and “BioGeometry” in the trace plot. The experiments were performed with
multiple SGE batch pools on a test bed of 80 rack-mounted IBM xSeries-335 servers within the Duke clus-
ter. This trace plot clearly shows the sharp variation in cluster size (number of nodes) over the eight days.
Dynamic provisioning and deprovisioning of virtual clusters are needed in real-life cluster applications.
domains. The project leverages the maturity of VM migration and environment adaptation technology.
The approach is to enable mutually isolated virtual environments for executing parallel applications on
top of a shared physical infrastructure consisting of multiple domains. Figure 3.25 illustrates the idea
with five concurrent virtual environments, labeled as VIOLIN 1–5, sharing two physical clusters.
The squares of various shadings represent the VMs deployed in the physical server nodes. The
major contribution by the Purdue group is to achieve autonomic adaptation of the virtual computation
environments as active, integrated entities. A virtual execution environment is able to relocate itself
across the infrastructure, and can scale its share of infrastructural resources. The adaptation is
transparent to both users of virtual environments and administrations of infrastructures. The adaptation
overhead is maintained at 20 sec out of 1,200 sec in solving a large NEMO3D problem of 1 million
particles.
The message being conveyed here is that the virtual environment adaptation can enhance resource
utilization significantly at the expense of less than 1 percent of an increase in total execution time. The
2. After VIOLIN 2
is finished,
before adaptation
Time
3. After adaptation
4. After VIOLIN
4, 5 are created
5. After VIOLIN
1, 3 are finished
FIGURE 3.25
VIOLIN adaptation scenario of five virtual environments sharing two hosted clusters. Note that there are
more idle squares (blank nodes) before and after the adaptation.
(Courtesy of P. Ruth, et al. [55] )
3.5 Virtualization for Data-Center Automation 169
migration of VIOLIN environments does pay off. Of course, the gain in shared resource utilization will bene-
fit many users, and the performance gain varies with different adaptation scenarios. We leave readers to
trace the execution of another scenario in Problem 3.17 at the end of this chapter to tell the differences.
Virtual networking is a fundamental component of the VIOLIN system.
Therefore, it is common that most servers in data centers are underutilized. A large amount of
hardware, space, power, and management cost of these servers is wasted. Server consolidation is an
approach to improve the low utility ratio of hardware resources by reducing the number of physical
servers. Among several server consolidation techniques such as centralized and physical consolida-
tion, virtualization-based server consolidation is the most powerful. Data centers need to optimize
their resource management. Yet these techniques are performed with the granularity of a full server
machine, which makes resource management far from well optimized. Server virtualization enables
smaller resource allocation than a physical machine.
In general, the use of VMs increases resource management complexity. This causes a challenge
in terms of how to improve resource utilization as well as guarantee QoS in data centers. In detail,
server virtualization has the following side effects:
• Consolidation enhances hardware utilization. Many underutilized servers are consolidated into
fewer servers to enhance resource utilization. Consolidation also facilitates backup services and
disaster recovery.
• This approach enables more agile provisioning and deployment of resources. In a virtual
environment, the images of the guest OSes and their applications are readily cloned and reused.
• The total cost of ownership is reduced. In this sense, server virtualization causes deferred
purchases of new servers, a smaller data-center footprint, lower maintenance costs, and lower
power, cooling, and cabling requirements.
• This approach improves availability and business continuity. The crash of a guest OS has no
effect on the host OS or any other guest OS. It becomes easier to transfer a VM from one
server to another, because virtual servers are unaware of the underlying hardware.
To automate data-center operations, one must consider resource scheduling, architectural support,
power management, automatic or autonomic resource management, performance of analytical mod-
els, and so on. In virtualized data centers, an efficient, on-demand, fine-grained scheduler is one of
the key factors to improve resource utilization. Scheduling and reallocations can be done in a wide
range of levels in a set of data centers. The levels match at least at the VM level, server level,
and data-center level. Ideally, scheduling and resource reallocations should be done at all levels.
However, due to the complexity of this, current techniques only focus on a single level or, at most,
two levels.
Dynamic CPU allocation is based on VM utilization and application-level QoS metrics. One
method considers both CPU and memory flowing as well as automatically adjusting resource over-
head based on varying workloads in hosted services. Another scheme uses a two-level resource
management system to handle the complexity involved. A local controller at the VM level and a
global controller at the server level are designed. They implement autonomic resource allocation via
the interaction of the local and global controllers. Multicore and virtualization are two cutting tech-
niques that can enhance each other.
However, the use of CMP is far from well optimized. The memory system of CMP is a typical
example. One can design a virtual hierarchy on a CMP in data centers. One can consider protocols
that minimize the memory access time, inter-VM interferences, facilitating VM reassignment, and
supporting inter-VM sharing. One can also consider a VM-aware power budgeting scheme using
multiple managers integrated to achieve better power management. The power budgeting policies
cannot ignore the heterogeneity problems. Consequently, one must address the trade-off of power
saving and data-center performance.
3.5 Virtualization for Data-Center Automation 171
Example 3.11 Parallax Providing Virtual Disks to Client VMs from a Large Common
Shared Physical Disk
The architecture of Parallax is scalable and especially suitable for use in cluster-based environments.
Figure 3.26 shows a high-level view of the structure of a Parallax-based cluster. A cluster-wide administrative
domain manages all storage appliance VMs, which makes storage management easy. The storage appliance
172 CHAPTER 3 Virtual Machines and Virtualization of Clusters and Data Centers
Physical hosts
VMM (Xen)
VMM (Xen)
FIGURE 3.26
Parallax is a set of per-host storage appliances that share access to a common block device and presents
virtual disks to client VMs.
(Courtesy of D. Meyer, et al. [43] )
VM also allows functionality that is currently implemented within data-center hardware to be pushed out and
implemented on individual hosts. This mechanism enables advanced storage features such as snapshot
facilities to be implemented in software and delivered above commodity network storage targets.
Parallax itself runs as a user-level application in the storage appliance VM. It provides virtual disk images
(VDIs) to VMs. A VDI is a single-writer virtual disk which may be accessed in a location-transparent manner
from any of the physical hosts in the Parallax cluster. The VDIs are the core abstraction provided by Parallax.
Parallax uses Xen’s block tap driver to handle block requests and it is implemented as a tapdisk library. This
library acts as a single block virtualization service for all client VMs on the same physical host. In the Parallax
system, it is the storage appliance VM that connects the physical hardware device for block and network
access. As shown in Figure 3.30, physical device drivers are included in the storage appliance VM. This imple-
mentation enables a storage administrator to live-upgrade the block device drivers in an active cluster.
Table 3.6 VI Managers and Operating Systems for Virtualizing Data Centers [9]
Manager/
OS, Resources Being Client Public
Platforms, Virtualized, Web API, Hypervisors Cloud Special
License Link Language Used Interface Features
and OpenNebula are all open source software available to the general public. Only vSphere 4 is a
proprietary OS for cloud resource virtualization and management over data centers.
These VI managers are used to create VMs and aggregate them into virtual clusters as elastic
resources. Nimbus and Eucalyptus support essentially virtual networks. OpenNebula has additional
features to provision dynamic resources and make advance reservations. All three public VI
managers apply Xen and KVM for virtualization. vSphere 4 uses the hypervisors ESX and ESXi
from VMware. Only vSphere 4 supports virtual storage in addition to virtual networking and data
protection. We will study Eucalyptus and vSphere 4 in the next two examples.
CM
Public
network
GM GM
Private Private
network network
IM
IM
IM
IM
IM
IM
IM
IM
Cluster A Cluster B
FIGURE 3.27
Eucalyptus for building private clouds by establishing virtual networks over the VMs linking through
Ethernet and the Internet.
(Courtesy of D. Nurmi, et al. [45] )
Furthermore, the designers leverage existing web-service features such as WS-Security policies for
secure communication between components. The three resource managers in Figure 3.27 are specified
below:
• Instance Manager controls the execution, inspection, and terminating of VM instances on the host
where it runs.
• Group Manager gathers information about and schedules VM execution on specific instance managers,
as well as manages virtual instance network.
• Cloud Manager is the entry-point into the cloud for users and administrators. It queries node managers
for information about resources, makes scheduling decisions, and implements them by making
requests to group managers.
In terms of functionality, Eucalyptus works like AWS APIs. Therefore, it can interact with EC2. It does
provide a storage API to emulate the Amazon S3 API for storing user data and VM images. It is installed
on Linux-based platforms, is compatible with EC2 with SOAP and Query, and is S3-compatible with SOAP
and REST. CLI and web portal services can be applied with Eucalyptus.
VMware vSphere 4
FIGURE 3.28
vSphere/4, a cloud operating system that manages compute, storage, and network resources over
virtualized data centers.
(Courtesy of VMware, April 2010 [72] )
overall architecture. The system interacts with user applications via an interface layer, called vCenter.
vSphere is primarily intended to offer virtualization support and resource management of data-center
resources in building private clouds. VMware claims the system is the first cloud OS that supports
availability, security, and scalability in providing cloud computing services.
The vSphere 4 is built with two functional software suites: infrastructure services and application
services. It also has three component packages intended mainly for virtualization purposes: vCompute is
supported by ESX, ESXi, and DRS virtualization libraries from VMware; vStorage is supported by VMS and
176 CHAPTER 3 Virtual Machines and Virtualization of Clusters and Data Centers
thin provisioning libraries; and vNetwork offers distributed switching and networking functions. These
packages interact with the hardware servers, disks, and networks in the data center. These infrastructure
functions also communicate with other external clouds.
The application services are also divided into three groups: availability, security, and scalability.
Availability support includes VMotion, Storage VMotion, HA, Fault Tolerance, and Data Recovery from
VMware. The security package supports vShield Zones and VMsafe. The scalability package was built with
DRS and Hot Add. Interested readers should refer to the vSphere 4 web site for more details regarding
these component software functions. To fully understand the use of vSphere 4, users must also learn how
to use the vCenter interfaces in order to link with existing applications or to develop new applications.
Policy engine
APP APP
Policy module
Guest OS
Policy
framework
Virtual machine
OS interface library
PTrace
FIGURE 3.29
The architecture of livewire for intrusion detection using a dedicated VM.
(Courtesy of Garfinkel and Rosenblum, 2002 [17] )
and has the same privilege to access the hardware as well as the VMM. Garfinkel and Rosenblum
[17] have proposed an IDS to run on a VMM as a high-privileged VM. Figure 3.29 illustrates
the concept.
The VM-based IDS contains a policy engine and a policy module. The policy framework can
monitor events in different guest VMs by operating system interface library and PTrace indicates
trace to secure policy of monitored host. It’s difficult to predict and prevent all intrusions without
delay. Therefore, an analysis of the intrusion action is extremely important after an intrusion occurs.
At the time of this writing, most computer systems use logs to analyze attack actions, but it is hard to
ensure the credibility and integrity of a log. The IDS log service is based on the operating system ker-
nel. Thus, when an operating system is invaded by attackers, the log service should be unaffected.
Besides IDS, honeypots and honeynets are also prevalent in intrusion detection. They attract and
provide a fake system view to attackers in order to protect the real system. In addition, the attack
action can be analyzed, and a secure IDS can be built. A honeypot is a purposely defective system
that simulates an operating system to cheat and monitor the actions of an attacker. A honeypot can
be divided into physical and virtual forms. A guest operating system and the applications running
on it constitute a VM. The host operating system and VMM must be guaranteed to prevent attacks
from the VM in a virtual honeypot.
Example 3.14 EMC Establishment of Trusted Zones for Protection of Virtual Clusters
Provided to Multiple Tenants
EMC and VMware have joined forces in building security middleware for trust management in distribu-
ted systems and private clouds. The concept of trusted zones was established as part of the virtual
infrastructure. Figure 3.30 illustrates the concept of creating trusted zones for virtual clusters (multiple
178 CHAPTER 3 Virtual Machines and Virtualization of Clusters and Data Centers
Anti-malware
Federate
Identity
identities with
federation Insulate
public clouds
infrastructure Cybercrime
from malware, intelligence
Trojans and
cybercriminals
Strong
Control and
Virtual APP APP Tenant authentication
isolate VM in the
network #2
virtual OS OS
security
infrastructure
Virtual infrastructure
Insulate
information
from other Data loss
tenants prevention
Segregate and
Access
control user APP APP
Tenant
Mgmt
access #1
OS OS
Encryption &
Virtual infrastructure Insulate
key mgmt
information
from cloud
providers’
Enable end to employees Tokenization
Security end view of
info. & security events
event and compliance Cloud provider
mgmt across
infrastructures Physical infrastructure
FIGURE 3.30
Techniques for establishing trusted zones for virtual cluster insulation and VM isolation.
(Courtesy of L. Nick, EMC [40] )
applications and OSes for each tenant) provisioned in separate virtual environments. The physical
infrastructure is shown at the bottom, and marked as a cloud provider. The virtual clusters or infrastruc-
tures are shown in the upper boxes for two tenants. The public cloud is associated with the global user
communities at the top.
The arrowed boxes on the left and the brief description between the arrows and the zoning boxes are
security functions and actions taken at the four levels from the users to the providers. The small circles
between the four boxes refer to interactions between users and providers and among the users
themselves. The arrowed boxes on the right are those functions and actions applied between the tenant
environments, the provider, and the global communities.
Almost all available countermeasures, such as anti-virus, worm containment, intrusion detection,
encryption and decryption mechanisms, are applied here to insulate the trusted zones and isolate the
VMs for private tenants. The main innovation here is to establish the trust zones among the virtual clusters.
Acknowledgments 179
The end result is to enable an end-to-end view of security events and compliance across the virtual
clusters dedicated to different tenants. We will discuss security and trust issues in Chapter 7 when we
study clouds in more detail.
Acknowledgments
This chapter is coauthored by Zhibin Yu of Huazhong University of Science and Technology
(HUST), China and by Kai Hwang of USC. Dr. Hai Jin and Dr. Xiaofei Liao of HUST have extended
significant technical support to this work. The authors would like to thank Dr. Zhong-Yuan Qin
180 CHAPTER 3 Virtual Machines and Virtualization of Clusters and Data Centers
of Southeast University, China, Fan Zhang of Tsinghua University, and Lizhong Chen and Zhou Zhao
of USC for their assistance in drawing several sketched figures and updating the references in
this chapter.
References
[1] Advanced Micro Devices. AMD Secure Virtual Machine Architecture Reference Manual, 2008.
[2] K. Adams, O. Agesen, A comparison of software and hardware techniques for x86 virtualization, in:
Proceedings of the 12th International Conference on Architectural Support for Programming Languages
and Operating Systems, San Jose, CA, October 2006, pp. 21–25.
[3] V. Adve, C. Lattner, et al., LLVA: A low-level virtual instruction set architecture, in: Proceedings of the
36th International Symposium on Micro-architecture (MICRO-36 ’03), 2003.
[4] J. Alonso, L. Silva, A. Andrzejak, P. Silva, J. Torres, High-available grid services through the use of
virtualized clustering, in: Proceedings of the 8th Grid Computing Conference, 2007.
[5] P. Anedda, M. Gaggero, et al., A general service-oriented approach for managing virtual machine alloca-
tion, in: Proceedings of the 24th Annual ACM Symposium on Applied Computing (SAC 2009), ACM
Press, March 2009, pp. 9–12.
[6] H. Andre Lagar-Cavilla, J.A. Whitney, A. Scannell, et al., SnowFlock: rapid virtual machine cloning for
cloud computing, in: Proceedings of EuroSystems, 2009.
[7] P. Barham, B. Dragovic, K. Fraser, et al., Xen and the art of virtualization, in: Proceedings of the 19th
ACM Symposium on Operating System Principles (SOSP19), ACM Press, 2003, pp. 164–177.
[8] E. Bugnion, S. Devine, M. Rosenblum, Disco: running commodity OS on scalable multiprocessors, in:
Proceedings of SOSP, 1997.
[9] R. Buyya, J. Broberg, A. Goscinski (Eds.), Cloud Computing: Principles and Paradigms, Wiley Press,
New York, 2011.
[10] V. Chadha, R. Illikkal, R. Iyer, I/O Processing in a virtualized platform: a simulation-driven approach, in:
Proceedings of the 3rd International Conference on Virtual Execution Environments (VEE), 2007.
[11] R. Chandra, N. Zeldovich, C. Sapuntzakis, M.S. Lam, The collective: a cache-based system management
architecture, in: Proceedings of the Second Symposium on Networked Systems Design and Implementa-
tion (NSDI ’05), USENIX, Boston, May 2005, pp. 259–272.
[12] J. Chase, L. Grit, D. Irwin, J. Moore, S. Sprenkle, Dynamic virtual cluster in a grid site manager, in: IEEE
Int’l Symp. on High Performance Distributed Computing, (HPDC-12), 2003.
[13] D. Chisnall, The Definitive Guide to the Xen Hypervisor, Prentice Hall, International, 2007.
[14] C. Clark, K. Fraser, S. Hand, et al., Live migration of virtual machines, in: Proceedings of the Second
Symposium on Networked Systems Design and Implementation (NSDI ’05), 2005, pp. 273–286.
[15] Y. Dong, J. Dai, et al., Towards high-quality I/O virtualization, in: Proceedings of SYSTOR 2009, The
Israeli Experimental Systems Conference, 2009.
[16] E. Elnozahy, M. Kistler, R. Rajamony, Energy-efficient server clusters, in: Proceedings of the 2nd Work-
shop on Power-Aware Computing Systems, February 2002.
[17] J. Frich, et al., On the potential of NoC virtualization for multicore chips, in: IEEE Int’l Conf. on
Complex, Intelligent and Software-Intensive Systems, 2008, pp. 801–807.
[18] T. Garfinkel, M. Rosenblum, A virtual machine introspection-based architecture for intrusion detection,
2002.
[19] L. Grit, D. Irwin, A. Yumerefendi, J. Chase, Virtual machine hosting for networked clusters: building the
foundations for autonomic orchestration, in: First International Workshop on Virtualization Technology in
Distributed Computing (VTDC), November 2006.
References 181
[20] D. Gupta, S. Lee, M. Vrable, et al., Difference engine: Harnessing memory redundancy in virtual
machines, in: Proceedings of the USENIX Symposium on Operating Systems Design and Implementation
(OSDI ’08), 2008, pp. 309–322.
[21] M. Hines, K. Gopalan, Post-copy based live virtual machine migration using adaptive pre-paging and
dynamic self-ballooning, in: Proceedings of the ACM/USENIX International Conference on Virtual
Execution Environments (VEE ’09), 2009, pp. 51–60.
[22] T. Hirofuchi, H. Nakada, et al., A live storage migration mechanism over WAN and its performance eva-
luation, in: Proceedings of the 4th International Workshop on Virtualization Technologies in Distributed
Computing, 15 June, ACM Press, Barcelona, Spain, 2009.
[23] K. Hwang, D. Li, Trusted cloud computing with secure resources and data coloring, IEEE Internet
Comput., (September/October) (2010) 30–39.
[24] Intel Open Source Technology Center, System Virtualization—Principles and Implementation, Tsinghua
University Press, Beijing, China, 2009.
[25] X. Jiang, D. Xu, VIOLIN: Virtual internetworking on overlay infrastructure, in: Proceedings of the
International Symposium on Parallel and Distributed Processing and Applications, 2004, pp. 937–946.
[26] H. Jin, L. Deng, S. Wu, X. Shi, X. Pan, Live virtual machine migration with adaptive memory compres-
sion, in: Proceedings of the IEEE International Conference on Cluster Computing, 2009.
[27] K. Jin, E. Miller, The effectiveness of deduplication on virtual machine disk images, in: Proceedings of
SYSTOR, 2009, The Israeli Experimental Systems Conference, 2009.
[28] S. Jones, A. Arpaci-Disseau, R. Arpaci-Disseau, Geiger: Monitoring the buffer cache in a virtual machine
environment, in: ACM ASPLOS, San Jose, CA, October 2006, pp. 13–14.
[29] F. Kamoun, Virtualizing the datacenter without compromising server performance, ACM Ubiquity 2009,
(9) (2009).
[30] D. Kim, H. Kim, J. Huh, Virtual snooping: Filtering snoops in virtualized multi-coures, in: 43rd Annual
IEEE/ACM Int’l Symposium on Mcrosrchitecture (MICRO-43).
[31] A. Kivity, et al., KVM: The linux virtual machine monitor, in: Proceedings of the Linux Symposium,
Ottawa, Canada, 2007, p. 225.
[32] A. Kochut, On impact of dynamic virtual machine reallocation on data center efficiency, in: Proceedings
of the IEEE International Symposium on Modeling, Analysis and Simulation of Computers and Telecom-
munication Systems (MASCOTS), 2008.
[33] R. Kumar, et al., Heterogeneous chip multiptiprocessors, IEEE Comput. Mag. 38 (November) (2005)
32–38.
[34] B. Kyrre, Managing large networks of virtual machines, in: Proceedings of the 20th Large Installation
System Administration Conference, 2006, pp. 205–214.
[35] J. Lange, P. Dinda, Transparent network services via a virtual traffic layer for virtual machines, in:
Proceedings of High Performance Distributed Computing, ACM Press, Monterey, CA, pp. 25–29,
June 2007.
[36] A. Liguori, E. Hensbergen, Experiences with content addressable storage and virtual disks, in: Proceedings
of the Workshop on I/O Virtualization (WIOV ’08), 2008.
[37] H. Liu, H. Jin, X. Liao, L. Hu, C. Yu, Live migration of virtual machine based on full system trace and
replay, in: Proceedings of the 18th International Symposium on High Performance Distributed Computing
(HPDC ’09), 2009, pp. 101–110.
[38] A. Mainwaring, D. Culler, Design challenges of virtual networks: Fast, general-purpose communication,
in: Proceedings of the Seventh ACM SIGPLAN Symposium on Principles and Practices of Parallel
Programming, 1999.
[39] M. Marty, M. Hill, Virtual hierarchies to support server consolidation, in: Proceedings of the 34th Annual
International Symposium on Computer Architecture (ISCA), 2007.
182 CHAPTER 3 Virtual Machines and Virtualization of Clusters and Data Centers
[40] M. McNett, D. Gupta, A. Vahdat, G.M. Voelker, Usher: An extensible framework for managing clusters
of virtual machines, in: 21st Large Installation System Administration Conference (LISA) 2007.
[41] D. Menasce, Virtualization: Concepts, applications., performance modeling, in: Proceedings of the 31st
International Computer Measurement Group Conference, 2005, pp. 407–414.
[42] A. Menon, J. Renato, Y. Turner, Diagnosing performance overheads in the Xen virtual machine environ-
ment, in: Proceedings of the 1st ACM/USENIX International Conference on Virtual Execution Environ-
ments, 2005.
[43] D. Meyer, et al., Parallax: Virtual disks for virtual machines, in: Proceedings of EuroSys, 2008.
[44] J. Nick, Journey to the private cloud: Security and compliance, in: Technical presentation by EMC Visit-
ing Team, May 25, Tsinghua University, Beijing, 2010.
[45] D. Nurmi, et al., The eucalyptus open-source cloud computing system, in: Proceedings of the 9th
IEEE ACM International Symposium on Cluster Computing and The Grid (CCGrid), Shanghai, China,
September 2009, pp. 124–131.
[46] P. Padala, et al., Adaptive control of virtualized resources in utility computing environments, in: Proceed-
ings of EuroSys 2007.
[47] L. Peterson, A. Bavier, M.E. Fiuczynski, S. Muir, Experiences Building PlanetLab, in: Proceedings of the
7th USENIX Symposium on Operating Systems Design and Implementation (OSDI2006), 6–8 November
2006.
[48] B. Pfaff, T. Garfinkel, M. Rosenblum, Virtualization aware file systems: Getting beyond the limitations of
virtual disks, in: Proceedings of USENIX Networked Systems Design and Implementation (NSDI 2006),
May 2006, pp. 353–366.
[49] E. Pinheiro, R. Bianchini, E. Carrera, T. Heath, Dynamic cluster reconfiguration for power and
performance, in: L. Benini (Ed.), Compilers and Operating Systems for Low Power, Kluwer Academic
Publishers, 2003.
[50] H. Qian, E. Miller, et al., Agility in virtualized utility computing, in: Proceedings of the Third International
Workshop on Virtualization Technology in Distributed Computing (VTDC 2007), 12 November 2007.
[51] H. Raj, I. Ganev, K. Schwan, Self-Virtualized I/O: High Performance, Scalable I/O Virtualization in
Multi-core Systems, Technical Report GIT-CERCS-06-02, CERCS, Georgia Tech, 2006, www.cercs.
gatech.edu/tech-reports/tr2006/git-cercs-06-02.pdf.
[52] J. Robin, C. Irvine, Analysis of the Intel pentium’s ability to support a secure virtual machine monitor, in:
Proceedings of the 9th USENIX Security Symposium Vol. 9, 2000.
[53] M. Rosenblum, The reincarnation of virtual machines, ACM QUEUE, (July/August) (2004).
[54] M. Rosenblum, T. Garfinkel, Virtual machine monitors: current technology and future trends, IEEE
Comput 38 (5) (2005) 39–47.
[55] P. Ruth, et al., Automatic Live Migration of Virtual Computational Environments in a Multi-domain
Infrastructure, Purdue University, 2006.
[56] C. Sapuntzakis, R. Chandra, B. Pfaff, et al., Optimizing the migration of virtual computers, in:
Proceedings of the 5th Symposium on Operating Systems Design and Implementation, Boston, 9–11
December 2002.
[57] L. Shi, H. Chen, J. Sun, vCUDA: GPU accelerated high performance computing in virtual machines, in:
Proceedings of the IEEE International Symposium on Parallel and Distributed Processing, 2009.
[58] J. Smith, R. Nair, Virtual Machines: Versatile Platforms for Systems and Processes, Morgan Kaufmann, 2005.
[59] J. Smith, R. Nair, The architecture of virtual machines, IEEE Comput., (May) (2005).
[60] Y. Song, H. Wang, et al., Multi-tiered on-demand resource scheduling for VM-based data center, in:
Proceedings of the 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, 2009.
[61] B. Sotomayor, K. Keahey, I. Foster, Combining batch execution and leasing using virtual machines, in:
Proceedings of the 17th International Symposium on High-Performance Distributed Computing, 2008.
Homework Problems 183
HOMEWORK PROBLEMS
Problem 3.1
Briefly answer the following questions on virtualization levels. Highlight the key points and identify
the distinctions in different approaches. Discuss their relative advantages, shortcomings and limita-
tions. Also identify example systems implemented at each level.
184 CHAPTER 3 Virtual Machines and Virtualization of Clusters and Data Centers
Problem 3.2
Explain the differences between hypervisor and para-virtualization and give one example VMM
(virtual machine monitor), that was built in each of the two categories.
Problem 3.3
Install the VMware Workstation on a Windows XP or Vista personal computer or laptop, and then
install Red Hat Linux and Windows XP in the VMware Workstation. Configure the network settings
of Red Hat Linux and Windows XP to get on the Internet. Write an installation and configuration
guide for the VMware Workstation, Red Hat Linux, and Windows XP systems. Include any trouble-
shooting tips in the guide.
Problem 3.4
Download a new kernel package from www.kernel.org/. Compile it in Red Hat Linux in the VMware
Workstation installed in Problem 3.3 with Red Hat Linux on a real computer. Compare the time required
for the two compilations. Which one takes longer to compile? What are their major differences?
Problem 3.5
Install Xen on a Red Hat Linux machine in two methods from the binary code or from the source
code. Compile installation guides for the two methods used. Describe the dependencies of utilities
and packages along with troubleshooting tips.
Problem 3.6
Install Red Hat Linux on the Xen you installed in Problem 3.5. Download nbench from www.tux
.org/~mayer/linux/bmark.html. Run the nbench on the VM using Xen and on a real machine. Com-
pare the performance of the programs on the two platforms.
Problem 3.7
Use the utilities for easing deployment of Google enterprise applications in VMs. The Google-vm-
deployment tool can be downloaded from http://code.google.com/p/google-vm-deployment/.
Problem 3.8
Describe the approaches used to exchange data among the domains of Xen and design experiments
to compare the performance of data communication between the domains. This is designed to famil-
iarize you with the Xen programming environment. It may require a longer period of time to port
the Xen code, implement the application code, perform the experiments, collect the performance
data, and interpret the results.
Problem 3.9
Build your own LAN by using the VMware Workstation. The topological structure of the LAN is
specified in Figure 3.31. Machine A is required to install Red Hat Linux while machine B is
required to install Windows XP.
Homework Problems 185
192.168.203.2 192.168.204.2
LAN 1 The router LAN 2
Machine A Machine B
External gate way: 192.168.204.1
Internal gate way: 192.168.203.1
FIGURE 3.31
The topological structure of the virtual LAN.
Problem 3.10
Study the relevant papers [33,63,74] on asymmetric or heterogeneous chip multitprocessors (CMP).
Write a study report to survey the area, identify the key research issues, review the current develop-
ment and open research challenges lying ahead.
Problem 3.11
Study the relevant papers [17,28,30,66] on network on chip (NoC) and virtualization of NoC
resources for multi-core CMP design and applications. Repeat Problem 3.10 with a survey report
after the research study.
Problem 3.12
Hardware and software resource deployment are 4 often complicated and time-consuming.
Automatic VM deployment can significantly reduce the time to instantiate new services or
reallocate resources depending on user needs. Visit the following web site for more informa-
tion. http://wiki.systemimager.org/index.php/Automating_Xen_VM_deployment_with_SystemI-
mager. Report your experience with automatic deployment using the SystemImager and
Xen-tools.
Problem 3.13
Design an experiment to analyze the performance of Xen live migration for I/O read-intensive appli-
cations. The performance merits include the time consumed by the precopy phase, the downtime,
the time used by the pull phase, and the total migration time.
Problem 3.14
Design an experiment to test the performance of Xen live migration for I/O write-intensive applica-
tions. The performance metrics include the time consumed by the precopy phase, the downtime,
the time used by the pull phase, and the total migration time. Compare the results with those from
Problem 3.13.
186 CHAPTER 3 Virtual Machines and Virtualization of Clusters and Data Centers
Problem 3.15
Design and implement a VM execution environment for grid computing based on VMware Server.
The environment should enable grid users and resource providers to use services that are unique to
a VM-based approach to distributed computing. Users can define customized execution environ-
ments which can then be archived, copied, shared, and instantiated as multiple runtime clones.
Problem 3.16
Design a large-scale virtual cluster system. This problem may require three students to work together for
a semester. Assume that users can create multiple VMs at one time. Users can also manipulate and
configure multiple VMs at the same time. Common software such as OS or libraries are preinstalled as
templates. These templates enable users to create a new execution environment rapidly. Finally, you
can assume that users have their own profiles which store the identification of data blocks.
Problem 3.17
Figure 3.32 shows another VIOLIN adaptation scenario for changes in virtual environments. There
are four VIOLIN applications running in two cluster domains. Trace the three steps of VIOLIN job
execution and discuss the gains in resource utilization after live migration of the virtual execution
1. Initially VIOLIN
1, 2, 3, 4 are computing.
Time
2. VIOLIN 3 is finished.
VIOLIN 1 runs less CPU
demanding application.
3. After adaptation
FIGURE 3.32
An adaptation scenario with four VIOLINs running in two cluster domains in the VIOLIN Virtual clustering
experiments.
(Courtesy of P. Ruth, et al. [55] )
Homework Problems 187
environment in the two cluster domains. You can check your results against the cited paper to
compare your observations.
Problem 3.18
After studying the material presented in Section 3.3.5, plus reading the papers by Wells, et al. [74]
and by Marty and Hill in [39] answer the following two questions:
a. Distinguish virtual cores from physical cores and discuss the mapping technique in Wells’s
paper on upgrading resource utilization and fault tolerance in using virtualized multicore
processors.
b. Study the cache coherence protocol presented in the Marty and Hill paper and discuss its
feasibility and advantages to implement on many-core CMP in the future.
This page intentionally left blank
PART
Computing Clouds,
Service-Oriented Architecture,
and Programming
2
Three chapters in Part 2 are devoted to cloud computing, including various cloud platforms for IaaS
(infrastructure as a service), PaaS (platform as a service), and SaaS (software as a service) applica-
tions. We present service-oriented architectures developed in recent years. Parallel and distributed
computing paradigms and their software support, language tools, and programming environments
are covered for various cloud computing services.
189
190 PART 2 Computing Clouds, Service-Oriented Architecture, and Programming
Pallickara, Marlon Pierce, Suresh Marru, Michael McLennan, George Adams, III, Gerhard Klimeck,
and Michael Wilde. The final manuscript was edited by Kai Hwang.
SUMMARY
This chapter covers design principles, architectures, and enabling technologies of cloud platforms.
We begin with a discussion of data-center design and management. Next, we present design choices
for building compute and storage cloud platforms. We cover layered platform structure, virtualiza-
tion support, resource provisioning, and infrastructure management. Several public cloud platforms
are also studied, including Amazon Web Services, the Google App Engine, and Microsoft Azure.
Subsequent chapters are devoted to service-oriented architectures, cloud computing paradigms, pro-
gramming environments, and future cloud extensions.
Recall the introduction in Chapter 1 in which we said that the cloud will free users to focus on
user application development and create business value by outsourcing job execution to cloud provi-
ders. In this scenario, the computations (programs) are sent to where the data is located, rather than
copying the data to millions of desktops as in the traditional approach. Cloud computing avoids
large data movement, resulting in much better network bandwidth utilization. Furthermore, machine
virtualization has enhanced resource utilization, increased application flexibility, and reduced the
total cost of using virtualized data-center resources.
The cloud offers significant benefit to IT companies by freeing them from the low-level task of
setting up the hardware (servers) and managing the system software. Cloud computing applies a vir-
tual platform with elastic resources put together by on-demand provisioning of hardware, software,
and data sets, dynamically. The main idea is to move desktop computing to a service-oriented plat-
form using server clusters and huge databases at data centers. Cloud computing leverages its low
cost and simplicity to both providers and users. According to Ian Foster [25], cloud computing
intends to leverage multitasking to achieve higher throughput by serving many heterogeneous appli-
cations, large or small, simultaneously.
4.1.1.1 Centralized versus Distributed Computing
Some people argue that cloud computing is centralized computing at data centers. Others claim that
cloud computing is the practice of distributed parallel computing over data-center resources. These
represent two opposite views of cloud computing. All computations in cloud applications are distributed
to servers in a data center. These are mainly virtual machines (VMs) in virtual clusters created out of
data-center resources. In this sense, cloud platforms are systems distributed through virtualization.
As Figure 4.1 shows, both public clouds and private clouds are developed in the Internet. As many
clouds are generated by commmercial providers or by enterprises in a distributed manner, they will be
interconnected over the Internet to achieve scalable and efficient computing services. Commercial cloud
providers such as Amazon, Google, and Microsoft created their platforms to be distributed geographi-
cally. This distribution is partially attributed to fault tolerance, response latency reduction, and even
legal reasons. Intranet-based private clouds are linked to public clouds to get additional resources.
Nevertheless, users in Europe may not feel comfortable using clouds in the United States, and vice
versa, until extensive service-level agreements (SLAs) are developed between the two user communities.
4.1.1.2 Public Clouds
A public cloud is built over the Internet and can be accessed by any user who has paid for the service.
Public clouds are owned by service providers and are accessible through a subscription. The callout
box in top of Figure 4.1 shows the architecture of a typical public cloud. Many public clouds are
available, including Google App Engine (GAE), Amazon Web Services (AWS), Microsoft Azure,
IBM Blue Cloud, and Salesforce.com’s Force.com. The providers of the aforementioned clouds are
commercial providers that offer a publicly accessible remote interface for creating and managing VM
instances within their proprietary infrastructure. A public cloud delivers a selected set of business pro-
cesses. The application and infrastructure services are offered on a flexible price-per-use basis.
4.1.1.3 Private Clouds
A private cloud is built within the domain of an intranet owned by a single organization. Therefore,
it is client owned and managed, and its access is limited to the owning clients and their partners. Its
deployment was not meant to sell capacity over the Internet through publicly accessible interfaces.
194 CHAPTER 4 Cloud Platform Architecture over Virtualized Data Centers
Microsoft Amazon
Azure AWS
IBM Blue Cloud
The Internet
A hybrid Public Google
cloud Salesforce App Engine
Force.com
Private cloud
(IBM RC2) An
Intranet Cloud users
FIGURE 4.1
Public, private, and hybrid clouds illustrated by functional architecture and connectivity of representative
clouds available by 2011.
Private clouds give local users a flexible and agile private infrastructure to run service workloads
within their administrative domains. A private cloud is supposed to deliver more efficient and con-
venient cloud services. It may impact the cloud standardization, while retaining greater customiza-
tion and organizational control.
requires that you assign work to virtual clusters created for users. The gateway nodes provide the
access points of the service from the outside world. These gateway nodes can be also used for secur-
ity control of the entire cloud platform. In physical clusters and traditional grids, users expect static
demand of resources. Clouds are designed to handle fluctuating workloads, and thus demand variable
resources dynamically. Private clouds will satisfy this demand if properly designed and managed.
Data centers and supercomputers have some similarities as well as fundamental differences.
We discussed supercomputers in Chapter 2. In the case of data centers, scaling is a fundamental
requirement. Data-center server clusters are typically built with large number of servers, ranging
from thousands to millions of servers (nodes). For example, Microsoft has a data center in the
Chicago area that has 100,000 eight-core servers, housed in 50 containers. In supercomputers, a
separate data farm is used, while a data center uses disks on server nodes plus memory cache and
databases.
Data centers and supercomputers also differ in networking requirements, as illustrated in
Figure 4.2. Supercomputers use custom-designed high-bandwidth networks such as fat trees or 3D
torus networks (which we discussed in Chapter 2). Data-center networks are mostly IP-based
commodity networks, such as the 10 Gbps Ethernet network, which is optimized for Internet
access. Figure 4.2 shows a multilayer structure for accessing the Internet. The server racks are at
the bottom Layer 2, and they are connected through fast switches (S) as the hardware core. The
data center is connected to the Internet at Layer 3 with many access routers (ARs) and border
routers (BRs).
Internet Internet
BR BR
Data center
Layer 3
AR AR ... AR AR
Layer 2
LB S S LB
S S S S ...
Key:
• BR = L3 border router
A A ... A A A ... A • AR = L3 access router
• S = L2 switch
• LB = Load balancer
• A = Rack of servers
A single layer 2 domain
FIGURE 4.2
Standard data-center networking for the cloud to access the Internet.
(Courtesy of Dennis Gannon, 2010 [26] )
196 CHAPTER 4 Cloud Platform Architecture over Virtualized Data Centers
An example of a private cloud is the one the U.S. National Aeronautics and Space Administra-
tion (NASA) is building to enable researchers to run climate models on remote systems it provides.
This can save users the capital expense of HPC machines at local sites. Furthermore, NASA can
build the complex weather models around its data centers, which is more cost-effective. Another
good example is the cloud built by the European Council for Nuclear Research (CERN). This is a
very big private cloud designed to distribute data, applications, and computing resources to thou-
sands of scientists around the world.
These cloud models demand different levels of performance, data protection, and security enfor-
cement. In this case, different SLAs may be applied to satisfy both providers and paid users. Cloud
computing exploits many existing technologies. For example, grid computing is the backbone of
cloud computing in that the grid has the same goals of resource sharing with better utilization of
research facilities. Grids are more focused on delivering storage and computing resources while
cloud computing aims to achieve economies of scale with abstracted services and resources.
and executing the application code and management of resources. What is even worse is that this
cycle repeats itself in about every 18 months, meaning the machine we bought becomes obsolete
every 18 months.
The cloud computing paradigm is shown on the right. This computing model follows a pay-
as-you-go model. Therefore the cost is significantly reduced, because we simply rent computer
resources without buying the computer in advance. All hardware and software resources are leased
from the cloud provider without capital investment on the part of the users. Only the execution
phase costs some money. The experts at IBM have estimated that an 80 percent to 95 percent
saving results from cloud computing, compared with the conventional computing paradigm. This is
very much desired, especially for small businesses, which requires limited computing power and
thus avoid the purchase of expensive computers or servers repeatedly every few years.
For example, IBM has estimated that the worldwide cloud service market may reach $126 billion
by 2012, including components, infrastructure services, and business services. Internet clouds work
as service factories built around multiple data centers. To formalize the above cloud computing
model, we characterize the cloud cost model, the cloud ecosystems, and enabling technologies. These
topics help our readers understand the motivations behind cloud computing. The intention is to
remove the barriers of cloud computing
• Data privacy protection Can you trust data centers to handle your private data and records?
This concern must be addressed to make clouds successful as trusted services.
• High quality of cloud services The QoS of cloud computing must be standardized to make
clouds interoperable among multiple providers.
• New standards and interfaces This refers to solving the data lock-in problem associated with
data centers or cloud providers. Universally accepted APIs and access protocols are needed to
provide high portability and flexibility of virtualized applications.
Variable costs in
Costs
Costs
operational expenses
FIGURE 4.3
Computing economics between traditional IT users and cloud users.
4.1 Cloud Computing and Service Models 199
Platform-as-a-
Individual users Other clouds Cloud consumers
service
Need raw Need to outsource Need resources on which to instantiate
infrastructure excess workloads services (web, databases, and so on) for their users
(a)
(b)
VMware vSphere
OpenNebula VI management
and others
(c)
FIGURE 4.4
Cloud ecosystem for building private clouds: (a) Consumers demand a flexible platform; (b) Cloud manager
provides virtualized resources over an IaaS platform; (c) VI manager allocates VMs; (d) VM managers handle
VMs installed on servers.
(Courtesy of Sotomayor, et al. © IEEE [68] )
200 CHAPTER 4 Cloud Platform Architecture over Virtualized Data Centers
Users
SaaS
Launch Monitor
Launch Q Monitor Q
ctrlr ctrlr
laaS
Master Distributed
Status DB
Worker file system
laaS
DaaS
FIGURE 4.5
The IaaS, PaaS, and SaaS cloud service models at different service levels..
(Courtesy of J. Suh and S. Kang, USC )
applications over his chosen OS environment. The user does not manage or control the underlying
cloud infrastructure, but has control over the OS, storage, deployed applications, and possibly select
networking components. This IaaS model encompasses storage as a service, compute instances as
a service, and communication as a service. The Virtual Private Cloud (VPC) in Example 4.1 shows
how to provide Amazon EC2 clusters and S3 storage to multiple users. Many startup cloud provi-
ders have appeared in recent years. GoGrid, FlexiScale, and Aneka are good examples. Table 4.1
summarizes the IaaS offerings by five public cloud providers. Interested readers can visit the
companies’ web sites for updated information. More examples can be also found in two recent
cloud books [10,18].
Amazon EC2 Each instance has 1–20 EC2 processors, CLI or web Service Xen, Linux,
1.7–15 GB of memory, and 160–1.69 TB of (WS) portal Windows
storage.
GoGrid Each instance has 1–6 CPUs, 0.5–8 GB of REST, Java, PHP, Xen, Linux,
memory, and 30–480 GB of storage. Python, Ruby Windows
Rackspace Each instance has a four-core CPU, REST, Python, PHP, Xen, Linux
Cloud 0.25–16 GB of memory, and 10–620 GB of Java, C#, .NET
storage.
FlexiScale in Each instance has 1–4 CPUs, 0.5–16 GB of web console Xen, Linux,
the UK memory, and 20–270 GB of storage. Windows
Joyent Cloud Each instance has up to eight CPUs, No specific API, OS-level
0.25–32 GB of memory, and 30–480 GB SSH, Virtual/Min virtualization,
of storage. OpenSolaris
Customer’s isolated
AWS resources
EC2 EC2 EC2
Subnets
S3
VPN Router S3
S3
gateway
EC2
EC2 Amazon
EC2
web services
cloud
Secure VPN
connection over
the Internet
Customer’s
network
FIGURE 4.6
Amazon VPC (virtual private cloud).
(Courtesy of VMWare, http://aws.amazon.com/vpc/ )
4.1 Cloud Computing and Service Models 203
SSH and Windows, 99.5 percent available agreements, per-hour pricing, Linux and Windows OSes, and
automatic scaling and load balancing. We will illustrate the use of EC2 in more detail in Chapter 6. VPC
allows the user to isolate provisioned AWS processors, memory, and storage from interference by other
users. Both auto-scaling and elastic load balancing services can support related demands. Auto-scaling
enables users to automatically scale their VM instance capacity up or down. With auto-scaling, one
can ensure that a sufficient number of Amazon EC2 instances are provisioned to meet desired
performance. Or one can scale down the VM instance capacity to reduce costs, when the workload
is reduced.
Google App Python, Java, and MapReduce, web Web applications and
Engine Eclipse-based IDE programming on demand BigTable storage
Salesforce.com’s Apex, Eclipse-based Workflow, Excel-like formula, Business applications
Force.com IDE, web-based Wizard Web programming on demand such as CRM
Microsoft Azure .NET, Azure tools for Unrestricted model Enterprise and web
MS Visual Studio applications
Amazon Elastic Hive, Pig, Cascading, MapReduce Data processing and
MapReduce Java, Ruby, Perl, e-commerce
Python, PHP, R, C++
Aneka .NET, stand-alone SDK Threads, task, MapReduce .NET enterprise
applications, HPC
204 CHAPTER 4 Cloud Platform Architecture over Virtualized Data Centers
software development platform for users from different parts of the world. This model also encourages
third parties to provide software management, integration, and service monitoring solutions.
Web application
provider
Users
HTTP HTTP
App Engine request response
admin console
Manage User
traffic monitor interface
Local version control
development
FIGURE 4.7
Google App Engine platform for PaaS operations.
(Courtesy of Yangting Wu, USC )
4.1 Cloud Computing and Service Models 205
well. After these steps are finished, the SDK provided provides a tool for uploading the user’s applica-
tion to Google’s infrastructure where the applications are actually deployed. Many additional third-party
capabilities, including software management, integration, and service monitoring solutions, are also
provided.
Here are some useful links when logging on to the GAE system:
3. Pitney Bowes, an e-commerce company, offers clients the opportunity to perform B2B transactions
using the Microsoft Azure platform, along with .NET and SQL services. These offerings have signifi-
cantly increased the company’s client base.
FIGURE 4.8
A huge data center that is 11 times the size of a football field, housing 400,000 to 1 million servers.
(Courtesy of Dennis Gannon [26] )
racks are accessible via a cluster-level switch. Consider a data center built with 2,000 servers, each
with 8 GB of DRAM and four 1 TB disk drives. Each group of 40 servers is connected through a
1 Gbps link to a rack-level switch that has an additional eight 1 Gbps ports used for connecting the
rack to the cluster-level switch.
It was estimated [9] that the bandwidth available from local disks is 200 MB/s, whereas the band-
width from off-rack disks is 25 MB/s via shared rack uplinks. The total disk storage in the cluster is
almost 10 million times larger than local DRAM. A large application must deal with large discrepancies
in latency, bandwidth, and capacity. In a very large-scale data center, components are relatively cheaper.
The components used in data centers are very different from those in building supercomputer systems.
With a scale of thousands of servers, concurrent failure, either hardware failure or software
failure, of 1 percent of nodes is common. Many failures can happen in hardware; for example, CPU
failure, disk I/O failure, and network failure. It is even quite possible that the whole data center
does not work in the case of a power crash. Also, some failures are brought on by software. The
service and data should not be lost in a failure situation. Reliability can be achieved by redundant
hardware. The software must keep multiple copies of data in different locations and keep the data
accessible while facing hardware or software errors.
Ceiling Ceiling
FIGURE 4.9
The cooling system in a raised-floor data center with hot-cold air circulation supporting water heat exchange
facilities.
(Courtesy of DLB Associates, D. Dyer [22] )
about 2–4 ft above the concrete floor. The under-floor area is often used to route power cables to
racks, but its primary use is to distribute cool air to the server rack. The CRAC (computer room air
conditioning) unit pressurizes the raised floor plenum by blowing cold air into the plenum.
The cold air escapes from the plenum through perforated tiles that are placed in front of server
racks. Racks are arranged in long aisles that alternate between cold aisles and hot aisles to avoid
mixing hot and cold air. The hot air produced by the servers circulates back to the intakes of the
CRAC units that cool it and then exhaust the cool air into the raised floor plenum again. Typically,
the incoming coolant is at 12–14°C and the warm coolant returns to a chiller. Newer data centers
often insert a cooling tower to pre-cool the condenser water loop fluid. Water-based free cooling
uses cooling towers to dissipate heat. The cooling towers use a separate cooling loop in which
water absorbs the coolant’s heat in a heat exchanger.
to meet this requirement. For example, one-to-many communications are used for supporting distrib-
uted file access. One can use one or a few servers as metadata master servers which need to com-
municate with slave server nodes in the cluster. To support the MapReduce programming paradigm,
the network must be designed to perform the map and reduce functions (to be treated in Chapter 7)
at a high speed. In other words, the underlying network structure should support various network
traffic patterns demanded by user applications.
Edge
switches
10.0.2.1 10.2.2.1
Aggregation
10.2.0.1
10.0.1.1
Edge
FIGURE 4.10
A fat-tree interconnection topology for scalable data-center construction.
(Courtesy of M. Al-Fares, et al. [2] )
4.2 Data-Center Design and Interconnection Networks 211
FIGURE 4.11
A modular data center built in a truck-towed ICE Cube container, that can be cooled by chilled air circulation
with cold-water heat exchanges.
(Courtesy of SGI, Inc., http://www.sgi.com/icecube )
212 CHAPTER 4 Cloud Platform Architecture over Virtualized Data Centers
take half a day. Extending this to a whole container system with multiple racks for 1,000 servers
requires the layout of the floor space with power, networking, cooling, and complete testing.
The container must be designed to be weatherproof and easy to transport. Modular data-center
construction and testing may take a few days to complete if all components are available and power
and water supplies are handy. The modular data-center approach supports many cloud service appli-
cations. For example, the health care industry will benefit by installing a data center at all clinic
sites. However, how to exchange information with the central database and maintain periodic con-
sistency becomes a rather challenging design issue in a hierarchically structured data center. The
security of collocation cloud services may involve multiple data centers.
Level 1
Level 0
00 01 02 03 10 11 12 13 20 21 22 23 30 31 32 33
FIGURE 4.12
BCube, a high-performance, server-centric network for building modular data centers.
(Courtesy of C. Guo, et al. [30] )
4.2 Data-Center Design and Interconnection Networks 213
The BCube provides multiple paths between any two nodes. Multiple paths provide extra bandwidth to
support communication patterns in different cloud applications. The BCube provides a kernel module in
the server OS to perform routing operations. The kernel module supports packet forwarding while the
incoming packets are not destined to the current node. Such modification of the kernel will not influence
the upper layer applications. Thus, the cloud application can still run on top of the BCube network
structure without any modification.
FIGURE 4.13
A 2D MDCube constructed from nine BCube containers.
(Courtesy of Wu, et al. [82] )
• Low cost to both users and providers The cost to users and providers of the cloud system
built over the data centers should be reduced, including all operational costs.
• Security enforcement and data protection Data privacy and security defense mechanisms must
be deployed to protect the data center against network attacks and system interrupts and to
maintain data integrity from user abuses or network attacks.
• Green information technology Saving power consumption and upgrading energy efficiency are
in high demand when designing and operating current and future data centers.
4.3 Architectural Design of Compute and Storage Clouds 215
Fast platform deployment Fast, efficient, and flexible deployment of cloud resources to provide
dynamic computing environment to users
Virtual clusters on demand Virtualized cluster of VMs provisioned to satisfy user demand and virtual
cluster reconfigured as workload changes
Multitenant techniques SaaS for distributing software to a large number of users for their
simultaneous use and resource sharing if so desired
Massive data processing Internet search and web services which often require massive data
processing, especially to support personalized services
Web-scale communication Support for e-commerce, distance education, telemedicine, social
networking, digital government, and digital entertainment applications
Distributed storage Large-scale storage of personal records and public archive information
which demands distributed storage over the clouds
Licensing and billing services License management and billing services which greatly benefit all types
of cloud services in utility computing
These technologies play instrumental roles in making cloud computing a reality. Most of these
technologies are mature today to meet increasing demand. In the hardware area, the rapid progress
in multicore CPUs, memory chips, and disk arrays has made it possible to build faster data centers
with huge amounts of storage space. Resource virtualization enables rapid cloud deployment and
disaster recovery. Service-oriented architecture (SOA) also plays a vital role.
Progress in providing SaaS, Web 2.0 standards, and Internet performance have all contributed to
the emergence of cloud services. Today’s clouds are designed to serve a large number of tenants
over massive volumes of data. The availability of large-scale, distributed storage systems is the
foundation of today’s data centers. Of course, cloud computing is greatly benefitted by the progress
made in license management and automatic billing techniques in recent years.
A public
Data centers
cloud
FIGURE 4.14
A security-aware cloud platform built with a virtual cluster of VMs, storage, and networking resources
over the data-center servers operated by providers.
(Courtesy of K. Hwang and D. Li, 2010 [36] )
special APIs that enable developers to exploit Internet clouds. Monitoring and metering units are
used to track the usage and performance of provisioned resources.
The software infrastructure of a cloud platform must handle all resource management and do most of
the maintenance automatically. Software must detect the status of each node server joining and leaving,
and perform relevant tasks accordingly. Cloud computing providers, such as Google and Microsoft,
have built a large number of data centers all over the world. Each data center may have thousands of
servers. The location of the data center is chosen to reduce power and cooling costs. Thus, the data cen-
ters are often built around hydroelectric power. The cloud physical platform builder is more concerned
about the performance/price ratio and reliability issues than shear speed performance.
In general, private clouds are easier to manage, and public clouds are easier to access. The
trends in cloud development are that more and more clouds will be hybrid. This is because many
cloud applications must go beyond the boundary of an intranet. One must learn how to create a pri-
vate cloud and how to interact with public clouds in the open Internet. Security becomes a critical
issue in safeguarding the operation of all cloud types. We will study cloud security and privacy
issues at the end of this chapter.
218 CHAPTER 4 Cloud Platform Architecture over Virtualized Data Centers
The Internet
FIGURE 4.15
Layered architectural development of the cloud platform for IaaS, PaaS, and SaaS applications over the
Internet.
4.3 Architectural Design of Compute and Storage Clouds 219
The application layer is formed with a collection of all needed software modules for SaaS
applications. Service applications in this layer include daily office management work, such as
information retrieval, document processing, and calendar and authentication services. The applica-
tion layer is also heavily used by enterprises in business marketing and sales, consumer relation-
ship management (CRM), financial transactions, and supply chain management. It should be
noted that not all cloud services are restricted to a single layer. Many applications may apply
resources at mixed layers. After all, the three layers are built from the bottom up with a depen-
dence relationship.
From the provider’s perspective, the services at various layers demand different amounts of
functionality support and resource management by providers. In general, SaaS demands the most
work from the provider, PaaS is in the middle, and IaaS demands the least. For example, Amazon
EC2 provides not only virtualized CPU resources to users, but also management of these provi-
sioned resources. Services at the application layer demand more work from providers. The best
example of this is the Salesforce.com CRM service, in which the provider supplies not only the
hardware at the bottom layer and the software at the top layer, but also the platform and software
tools for user application development and monitoring.
Users/
brokers
VM Service request
monitor Dispatcher monitor
Virtual
machines
(VMs)
Physical
machines
FIGURE 4.16
Market-oriented cloud architecture to expand/shrink leasing of resources with variation in QoS/demand from users.
(Courtesy of Raj Buyya, et al. [11] )
resources (supply/demand). Pricing serves as a basis for managing the supply and demand of computing
resources within the data center and facilitates in prioritizing resource allocations effectively. The Account-
ing mechanism maintains the actual usage of resources by requests so that the final cost can be computed
and charged to users. In addition, the maintained historical usage information can be utilized by the Ser-
vice Request Examiner and Admission Control mechanism to improve resource allocation decisions.
The VM Monitor mechanism keeps track of the availability of VMs and their resource entitle-
ments. The Dispatcher mechanism starts the execution of accepted service requests on allocated
VMs. The Service Request Monitor mechanism keeps track of the execution progress of service
requests. Multiple VMs can be started and stopped on demand on a single physical machine to
meet accepted service requests, hence providing maximum flexibility to configure various partitions
of resources on the same physical machine to different specific requirements of service requests. In
addition, multiple VMs can concurrently run applications based on different operating system envir-
onments on a single physical machine since the VMs are isolated from one another on the same
physical machine.
be greater importance on customers since they pay to access services in clouds. In addition, the state
of the art in cloud computing has no or limited support for dynamic negotiation of SLAs between par-
ticipants and mechanisms for automatic allocation of resources to multiple competing requests. Nego-
tiation mechanisms are needed to respond to alternate offers protocol for establishing SLAs [72].
Commercial cloud offerings must be able to support customer-driven service management based
on customer profiles and requested service requirements. Commercial clouds define computational
risk management tactics to identify, assess, and manage risks involved in the execution of applica-
tions with regard to service requirements and customer needs. The cloud also derives appropriate
market-based resource management strategies that encompass both customer-driven service manage-
ment and computational risk management to sustain SLA-oriented resource allocation. The system
incorporates autonomic resource management models that effectively self-manage changes in service
requirements to satisfy both new service demands and existing service obligations, and leverage VM
technology to dynamically assign resource shares according to service requirements.
Infrastructure service
Virtualized infrastructure
Whitebox management
FIGURE 4.17
Virtualized servers, storage, and network for cloud platform construction.
(Courtesy of Zhong-Yuan Qin, SouthEast University, China )
Using VMs in a cloud computing platform ensures extreme flexibility for users. As the comput-
ing resources are shared by many users, a method is required to maximize the users’ privileges and
still keep them separated safely. Traditional sharing of cluster resources depends on the user and
group mechanism on a system. Such sharing is not flexible. Users cannot customize the system
for their special purposes. Operating systems cannot be changed. The separation is not complete.
4.3 Architectural Design of Compute and Storage Clouds 223
Table 4.4 Virtualized Resources in Compute, Storage, and Network Clouds [4]
Provider AWS Microsoft Azure GAE
Compute x86 instruction set, Xen VMs, Common language Predefined application
cloud with resource elasticity allows runtime VMs framework handlers written
virtual scalability through virtual provisioned by in Python, automatic
cluster of cluster, or a third party such declarative descriptions scaling up and down,
servers as RightScale must provide server failover inconsistent
the cluster with the web applications
Storage Models for block store (EBS) SQL Data Services MegaStore/BigTable
cloud with and augmented key/blob (restricted view of SQL
virtual store (SimpleDB), automatic Server), Azure storage
storage scaling varies from EBS to service
fully automatic (SimpleDB, S3)
Network Declarative IP-level topology; Automatic with user’s Fixed topology to
cloud placement details hidden, declarative descriptions accommodate three-tier
services security groups restricting or roles of app. web app. structure,
communication, availability components scaling up and down is
zones isolate network failure, automatic and
elastic IP applied programmer-invisible
FIGURE 4.18
Recovery overhead of a conventional disaster recovery scheme, compared with that required to recover from
live migration of VMs.
An environment that meets one user’s requirements often cannot satisfy another user. Virtualization
allows users to have full privileges while keeping them separate.
Users have full access to their own VMs, which are completely separate from other users’ VMs.
Multiple VMs can be mounted on the same physical server. Different VMs may run with different
OSes. We also need to establish the virtual disk storage and virtual networks needed by the VMs.
The virtualized resources form a resource pool. The virtualization is carried out by special servers
dedicated to generating the virtualized resource pool. The virtualized infrastructure (black box in
the middle) is built with many virtualizing integration managers. These managers handle loads,
resources, security, data, and provisioning functions. Figure 4.18 shows two VM platforms. Each
platform carries out a virtual solution to a user job. All cloud services are managed in the boxes at
the top.
224 CHAPTER 4 Cloud Platform Architecture over Virtualized Data Centers
We discussed disaster recovery in Chapters 2 and 3. The cloning of VMs offers an effective
solution. The idea is to make a clone VM on a remote server for every running VM on a local server.
Among all the clone VMs, only one needs to be active. The remote VM should be in a suspended
mode. A cloud control center should be able to activate this clone VM in case of failure of the origi-
nal VM, taking a snapshot of the VM to enable live migration in a minimal amount of time. The
migrated VM can run on a shared Internet connection. Only updated data and modified states are sent
to the suspended VM to update its state. The Recovery Property Objective (RPO) and Recovery Time
Objective (RTO) are affected by the number of snapshots taken. Security of the VMs should be
enforced during live migration of VMs.
attack for VM migrations. In general, passive attacks steal sensitive data or passwords. Active
attacks may manipulate kernel data structures which will cause major damage to cloud servers. We
will study all of these security and privacy problems on clouds in Section 4.5.
OVF also defines a transport mechanism for VM templates, and can apply to different virtualiza-
tion platforms with different levels of virtualization. In terms of cloud standardization, we suggest
the ability for virtual appliances to run on any virtual platform. We also need to enable VMs to run
on heterogeneous hardware platform hypervisors. This requires hypervisor-agnostic VMs. We also
need to realize cross-platform live migration between x86 Intel and AMD technologies and support
legacy hardware for load balancing. All these issue are wide open for further research.
FIGURE 4.19
Roles of individual and organizational users and their interaction with cloud providers under various cloud
service models.
Table 4.5 Five Major Cloud Platforms and Their Service Offerings [36]
Model IBM Amazon Google Microsoft Salesforce
Note: WCA: WebSphere CloudBurst Appliance; RC2: Research Compute Cloud; RAD: Rational Application Developer;
SOA: Service-Oriented Architecture; TSAM: Tivoli Service Automation Manager; EC2: Elastic Compute Cloud; S3: Simple
Storage Service; SQS: Simple Queue Service; GAE: Google App Engine; AWS: Amazon Web Services; SQL: Structured
Query Language; EBS: Elastic Block Store; CRM: Consumer Relationship Management.
4.4 Public Cloud Platforms: GAE, AWS, and Azure 229
Amazon pioneered the IaaS business in supporting e-commerce and cloud applications by
millions of customers simultaneously. The elasticity in the Amazon cloud comes from the flexibi-
lity provided by the hardware and software services. EC2 provides an environment for running
virtual servers on demand. S3 provides unlimited online storage space. Both EC2 and S3 are sup-
ported in the AWS platform. Microsoft offers the Azure platform for cloud applications. It has also
supported the .NET service, dynamic CRM, Hotmail, and SQL applications. Salsforce.com offers
extensive SaaS applications for online CRM applications using its Force.com platforms.
As Table 4.5 shows, all IaaS, PaaS, and SaaS models allow users to access services over the
Internet, relying entirely on the infrastructures of the cloud service providers. These models are
offered based on various SLAs between the providers and the users. SLAs are more common in net-
work services as they account for the QoS characteristics of network services. For cloud computing
services, it is difficult to find a reasonable precedent for negotiating an SLA. In a broader sense, the
SLAs for cloud computing address service availability, data integrity, privacy, and security protection.
Blank spaces in the table refer to unknown or underdeveloped features.
Node
Application
Google cloud BigTable
infrastructure server
User Node
MapReduce
Node job
Node
Scheduler GFS
slave chunkserver
Linux
FIGURE 4.20
Google cloud platform and major building blocks, the blocks shown are large clusters of low-cost servers.
(Courtesy of Kang Chen, Tsinghua University, China )
web interface provided by each application. Third-party application providers can use GAE to
build cloud applications for providing services. The applications all run in data centers under tight
management by Google engineers. Inside each data center, there are thousands of servers forming
different clusters.
Google is one of the larger cloud application providers, although its fundamental service pro-
gram is private and outside people cannot use the Google infrastructure to build their own service.
The building blocks of Google’s cloud computing application include the Google File System for
storing large amounts of data, the MapReduce programming framework for application developers,
Chubby for distributed application lock services, and BigTable as a storage service for accessing
structural or semistructural data. With these building blocks, Google has built many cloud applica-
tions. Figure 4.20 shows the overall architecture of the Google cloud infrastructure. A typical cluster
configuration can run the Google File System, MapReduce jobs, and BigTable servers for structure
data. Extra services such as Chubby for distributed locks can also run in the clusters.
GAE runs the user program on Google’s infrastructure. As it is a platform running third-party
programs, application developers now do not need to worry about the maintenance of servers. GAE
can be thought of as the combination of several software components. The frontend is an application
framework which is similar to other web application frameworks such as ASP, J2EE, and JSP. At the
time of this writing, GAE supports Python and Java programming environments. The applications can
run similar to web application containers. The frontend can be used as the dynamic web serving
infrastructure which can provide the full support of common technologies.
a. The datastore offers object-oriented, distributed, structured data storage services based on
BigTable techniques. The datastore secures data management operations.
b. The application runtime environment offers a platform for scalable web programming and
execution. It supports two development languages: Python and Java.
c. The software development kit (SDK) is used for local application development. The SDK allows
users to execute test runs of local applications and upload application code.
d. The administration console is used for easy management of user application development cycles,
instead of for physical resource management.
e. The GAE web service infrastructure provides special interfaces to guarantee flexible use and
management of storage and network resources by GAE.
Google offers essentially free GAE services to all Gmail account owners. You can register for a
GAE account or use your Gmail account name to sign up for the service. The service is free within
a quota. If you exceed the quota, the page instructs you on how to pay for the service. Then you
download the SDK and read the Python or Java guide to get started. Note that GAE only accepts
Python, Ruby, and Java programming languages. The platform does not provide any IaaS services,
unlike Amazon, which offers Iaas and PaaS. This model allows the user to deploy user-built appli-
cations on top of the cloud infrastructure that are built using the programming languages and soft-
ware tools supported by the provider (e.g., Java, Python). Azure does this similarly for .NET. The
user does not manage the underlying cloud infrastructure. The cloud provider facilitates support of
application development, testing, and operation support on a well-defined service platform.
SQS EC2
EC2
EC2
User EC2
EBS
EBS
EBS
EBS Developer
SimpleDB
Amazon S3
S3
FIGURE 4.21
Amazon cloud computing infrastructure (Key services are identified here; many more are listed in Table 4.6).
(Courtesy of Kang Chen, Tsinghua University, China )
interface which can be used to support traditional applications. SQS stands for Simple Queue
Service, and its job is to ensure a reliable message service between two processes. The message can
be kept reliably even when the receiver processes are not running. Users can access their objects
through SOAP with either browsers or other client programs which support the SOAP standard.
Table 4.6 summarizes the service offerings by AWS in 12 application tracks. Details of EC2,
S3, and EBS are available in Chapter 6 where we discuss programming examples. Amazon offers
queuing and notification services (SQS and SNS), which are implemented in the AWS cloud. Note
brokering systems run very efficiently in clouds and offer a striking model for controlling sensors
and providing office support of smartphones and tablets. Different from Google, Amazon provides
a more flexible cloud computing platform for developers to build cloud applications. Small and
medium-size companies can put their business on the Amazon cloud platform. Using the AWS plat-
form, they can service large numbers of Internet users and make profits through those paid services.
ELB automatically distributes incoming application traffic across multiple Amazon EC2 instances
and allows user to avoid nonoperating nodes and to equalize load on functioning images. Both auto-
scaling and ELB are enabled by CloudWatch which monitors running instances. CloudWatch is a
web service that provides monitoring for AWS cloud resources, starting with Amazon EC2. It pro-
vides customers with visibility into resource utilization, operational performance, and overall demand
patterns, including metrics such as CPU utilization, disk reads and writes, and network traffic.
Amazon (like Azure) offers a Relational Database Service (RDS) with a messaging interface to
be covered in Section 4.1. The Elastic MapReduce capability is equivalent to Hadoop running on
the basic EC2 offering. AWS Import/Export allows one to ship large volumes of data to and from
EC2 by shipping physical disks; it is well known that this is often the highest bandwidth connection
between geographically distant systems. Amazon CloudFront implements a content distribution
4.4 Public Cloud Platforms: GAE, AWS, and Azure 233
network. Amazon DevPay is a simple-to-use online billing and account management service that
makes it easy for businesses to sell applications that are built into or run on top of AWS.
FPS provides developers of commercial systems on AWS with a convenient way to charge Amazon’s
customers that use such services built on AWS. Customers can pay using the same login credentials,
shipping address, and payment information they already have on file with Amazon. The FWS allows
merchants to access Amazon’s fulfillment capabilities through a simple web service interface.
Merchants can send order information to Amazon to fulfill customer orders on their behalf. In July 2010,
Amazon offered MPI clusters and cluster compute instances. The AWS cluster compute instances use
hardware-assisted virtualization instead of the para-virtualization used by other instance types and
requires booting from the EBS. Users are freed to create a new AMI as needed.
User applications
Windows Office Exchange SharePoint Dynamic
live live online online CRM online
Windows Azure
FIGURE 4.22
Microsoft Windows Azure platform for cloud computing.
(Courtesy of Microsoft, 2010, http://www.microsoft.com/windowsazure )
• SQL Azure This function makes it easier for users to visit and use the relational database
associated with the SQL server in the cloud.
• SharePoint service This provides a scalable and manageable platform for users to develop their
special business applications in upgraded web services.
• Dynamic CRM service This provides software developers a business platform in managing
CRM applications in financing, marketing, and sales and promotions.
All these cloud services in Azure can interact with traditional Microsoft software applications,
such as Windows Live, Office Live, Exchange online, SharePoint online, and dynamic CRM
online. The Azure platform applies the standard web communication protocols SOAP and REST.
The Azure service applications allow users to integrate the cloud application with other platforms
or third-party clouds. You can download the Azure development kit to run a local version
of Azure. The powerful SDK allows Azure applications to be developed and debugged on the
Windows hosts.
FIGURE 4.23
A stack of six layers of cloud services and their providers.
(Courtesy of T. Chou, Active Book Express, 2010 [16] )
236 CHAPTER 4 Cloud Platform Architecture over Virtualized Data Centers
IT administrators/cloud Monitor SLAs Monitor SLAs and enable Monitor SLAs and
providers service platforms deploy software
Software developers To deploy and store Enabling platforms via Develop and deploy
(vendors) data configurators and APIs software
End users or business To deploy and store To develop and test web Use business software
users data software
performance is the primary concern. From the end users’ perspective, the quality of services, including
security, is the most important.
the file system in a traditional single machine. Other layers running on top of the file system are the
layers for executing cloud computing applications. They include the database storage system, program-
ming for large-scale clusters, and data query language support. The next layers are the components in
the software stack.
In the EC2 platform, some predefined VM templates are also provided. Users can choose different
kinds of VMs from the templates. IBM’s Blue Cloud does not provide any VM templates. In general,
any type of VM can run on top of Xen. Microsoft also applies virtualization in its Azure cloud platform.
The provider should offer resource-economic services. Power-efficient schemes for caching, query pro-
cessing, and thermal management are mandatory due to increasing energy waste by heat dissipation
from data centers. Public or private clouds promise to streamline the on-demand provisioning of soft-
ware, hardware, and data as a service, achieving economies of scale in IT deployment and operation.
Capacity
Resources
Resources
Capacity
Demand Demand
1 2 3 1 2 3
Time (days) Time (days)
(a) Provisioning for peak load (b) Underprovisioning 1
Resources
Capacity
Demand
1 2 3
Time (days)
(c) Underprovisioning 2
FIGURE 4.24
Three cases of cloud resource provisioning without elasticity: (a) heavy waste due to overprovisioning,
(b) underprovisioning and (c) under- and then overprovisioning.
(Courtesy of Armbrust, et al., UC Berkeley, 2009 [4] )
4.5 Inter-cloud Resource Management 239
100 20
90 CPU_load 18
80 Num_nodes 16
CPU utilization (%)
Number of nodes
70 14
60 12
50 10
40 8
30 6
20 4
10 2
0 0
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134
(a) Demand-driven
100 20
90 18
80 16
CPU utilization (%)
Number of nodes
70 14
60 12
50 10
40 8
30 6
20 4
10 2
0 0
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134
(b) Event-driven
100 20
90 18
80 16
CPU utilization (%)
Number of nodes
70 14
60 12
50 10
40 8
30 6
20 4
10 2
0 0
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65
(c) Popularity-driven
FIGURE 4.25
EC2 performance results on the AWS EC2 platform, collected from experiments at the University of Southern
California using three resource provisioning methods.
(Courtesy of Ken Wu, USC )
4.5 Inter-cloud Resource Management 241
Cloud provider
VIE
(cloud APIs)
User
IGG 2 Enactment of leases
application
1 Physical resources
Request
for VMs
3 Application deployment
FIGURE 4.26
Cloud resource deployment using an IGG (intergrid gateway) to allocate the VMs from a Local cluster to
interact with the IGG of a public cloud provider.
(Courtesy of Constanzo, et al. © IEEE [21] )
deploy applications in three steps: (1) requesting the VMs, (2) enacting the leases, and (3) deploying
the VMs as requested. Under peak demand, this IGG interacts with another IGG that can allocate
resources from a cloud computing provider.
A grid has predefined peering arrangements with other grids, which the IGG manages.
Through multiple IGGs, the system coordinates the use of InterGrid resources. An IGG is aware
of the peering terms with other grids, selects suitable grids that can provide the required
resources, and replies to requests from other IGGs. Request redirection policies determine which
peering grid InterGrid selects to process a request and a price for which that grid will perform
the task. An IGG can also allocate resources from a cloud provider. The cloud system creates a
virtual environment to help users deploy their applications. These applications use the distributed
grid resources.
The InterGrid allocates and provides a distributed virtual environment (DVE). This is a virtual
cluster of VMs that runs isolated from other virtual clusters. A component called the DVE manager
performs resource allocation and management on behalf of specific user applications. The core com-
ponent of the IGG is a scheduler for implementing provisioning policies and peering with other
gateways. The communication component provides an asynchronous message-passing mechanism.
Received messages are handled in parallel by a thread pool.
242 CHAPTER 4 Cloud Platform Architecture over Virtualized Data Centers
GFS: Google File System Very large sustainable reading and writing bandwidth, mostly
continuous accessing instead of random accessing. The programming
interface is similar to that of the POSIX file system accessing interface.
HDFS: Hadoop Distributed File The open source clone of GFS. Written in Java. The programming
System interfaces are similar to POSIX but not identical.
Amazon S3 and EBS S3 is used for retrieving and storing data from/to remote servers. EBS
is built on top of S3 for using virtual disks in running EC2 instances.
4.5 Inter-cloud Resource Management 243
storage or distributed file systems. The scale of such a database might be quite large for processing
huge amounts of data. The main purpose is to store the data in structural or semi-structural ways
so that application developers can use it easily and build their applications rapidly. Traditional data-
bases will meet the performance bottleneck while the system is expanded to a larger scale. How-
ever, some real applications do not need such strong consistency. The scale of such databases can
be quite large. Typical cloud databases include BigTable from Google, SimpleDB from Amazon,
and the SQL service from Microsoft Azure.
VM VM
Local physical
infrastructure Grid’5000
FIGURE 4.27
Interactions among VM managers for cloud creation and management; the manager provides a public
API for users to submit and control the VMs.
(Courtesy of Constanzo, et al. © IEEE [21] )
244 CHAPTER 4 Cloud Platform Architecture over Virtualized Data Centers
Sophia Orsay
Lille
# of cores: 618 IGG IGG 2 gateways
3 gateways
Rennes
Bordeaux Sophia Orsay
# of cores: 650 Grenoble
# of cores: 272
IGG IGG
Toulouse Sophia
# of cores: 568 4 gateways
# of cores: 436
IGG IGG
Grid’5000 sites
10 Gnps link Rennes Nancy
Peering
(a) (b)
FIGURE 4.28
The InterGrid test bed over the French Grid’5000 located in nine cities across France.
(Courtesy of Constanzo, et al. © IEEE [21] )
To prevent the gateways from interfering with real Grid’5000 users, emulated VM managers were
implemented to instantiate fictitious VMs. The number of emulated hosts is limited by the core number at
each site. A balanced workload was configured among the sites. The maximum number of VMs requested
does not exceed the number of cores in any site. The load characteristics are shown in Figure 4.29 under
a four-gateway scenario. The teal bars indicate each grid site’s load. The magenta bars show the load
when gateways redirect requests to one another. The green bars correspond to the amount of load each
gateway accepts from other gateways. The brown bars represent the amount of load that is redirected.
The results show that the loading policy can balance the load across the nine sites. Rennes, a site with a
heavy load, benefits from peering with other gateways as the gateway redirects a great share of its load to
other sites.
2.0
1.8 Load without peering
Load when peering
1.6 Load accepted
1.4 Load redirected
1.2
Load
1.0
0.8
0.6
0.4
0.2
0
Nancy Rennes Orsay Sophia
FIGURE 4.29
Cloud loading results at four gateways at resource sites in the Grid’5000 system.
(Courtesy of Constanzo, et al. © IEEE [21] )
has data centers in the United States (e.g., one on the East Coast and another on the West Coast)
and Europe. However, currently Amazon expects its cloud customers (i.e., SaaS providers) to
express a preference regarding where they want their application services to be hosted. Amazon
does not provide seamless/automatic mechanisms for scaling its hosted services across multiple
geographically distributed data centers.
This approach has many shortcomings. First, it is difficult for cloud customers to determine in
advance the best location for hosting their services as they may not know the origin of consumers
of their services. Second, SaaS providers may not be able to meet the QoS expectations of their ser-
vice consumers originating from multiple geographical locations. This necessitates building mechan-
isms for seamless federation of data centers of a cloud provider or providers supporting dynamic
scaling of applications across multiple domains in order to meet QoS targets of cloud customers.
Figure 4.30 shows the high-level components of the Melbourne group’s proposed InterCloud
architecture.
In addition, no single cloud infrastructure provider will be able to establish its data centers
at all possible locations throughout the world. As a result, cloud application service (SaaS) pro-
viders will have difficulty in meeting QoS expectations for all their consumers. Hence, they
would like to make use of services of multiple cloud infrastructure service providers who can
provide better support for their specific consumer needs. This kind of requirement often arises
in enterprises with global operations and applications such as Internet services, media hosting,
and Web 2.0 applications. This necessitates federation of cloud infrastructure service providers
for seamless provisioning of services across different cloud providers. To realize this, the
Cloudbus Project at the University of Melbourne has proposed InterCloud architecture [12] sup-
porting brokering and exchange of cloud resources for scaling applications across multiple
clouds.
By realizing InterCloud architectural principles in mechanisms in their offering, cloud providers
will be able to dynamically expand or resize their provisioning capability based on sudden spikes
in workload demands by leasing available computational and storage capabilities from other cloud
248 CHAPTER 4 Cloud Platform Architecture over Virtualized Data Centers
Compute cloud
Storage cloud
Broker 1
. Directory
.
Bank
.
. Auctioneer
Enterprise
resource
manager Compute
Broker N
(Proxy) cloud
Global cloud
exchange
Enterprise IT consumer
Storage cloud
FIGURE 4.30
Inter-cloud exchange of cloud resources through brokering.
(Courtesy of R. Buyya, et al., University of Melbourne [12] )
service providers; operate as part of a market-driven resource leasing federation, where application
service providers such as Salesforce.com host their services based on negotiated SLA contracts dri-
ven by competitive market prices; and deliver on-demand, reliable, cost-effective, and QoS-aware
services based on virtualization technologies while ensuring high QoS standards and minimizing
service costs. They need to be able to utilize market-based utility models as the basis for provi-
sioning of virtualized software services and federated hardware infrastructure among users with
heterogeneous applications.
They consist of client brokering and coordinator services that support utility-driven federation of
clouds: application scheduling, resource allocation, and migration of workloads. The architecture
cohesively couples the administratively and topologically distributed storage and compute capabil-
ities of clouds as part of a single resource leasing abstraction. The system will ease the cross-
domain capability integration for on-demand, flexible, energy-efficient, and reliable access to the
infrastructure based on virtualization technology [6,75].
The Cloud Exchange (CEx) acts as a market maker for bringing together service producers and
consumers. It aggregates the infrastructure demands from application brokers and evaluates them
against the available supply currently published by the cloud coordinators. It supports trading of
cloud services based on competitive economic models such as commodity markets and auctions.
CEx allows participants to locate providers and consumers with fitting offers. Such markets enable
4.6 Cloud Security and Trust Management 249
services to be commoditized, and thus will pave the way for creation of dynamic market infrastruc-
ture for trading based on SLAs. An SLA specifies the details of the service to be provided in terms
of metrics agreed upon by all parties, and incentives and penalties for meeting and violating the
expectations, respectively. The availability of a banking system within the market ensures that
financial transactions pertaining to SLAs between participants are carried out in a secure and
dependable environment.
Integration OS and
Government risk management and compliance,
middleware Management identity and access management, virtual
machines (VMs), patch management
APIs
Network IDS/IPS, firewalls, data processing
Connectivity and Networking
information, Anti-DDoS, QoS, DNSSEC
delivery
Trusted Hardware and software RoT and APIs,
Virtualization computing trust-overlay and reputation systems
IaaS
Acronyms:
PaaS
IPS: Intrusion-prevention system
RoT: Root of trust
SaaS DDoS: Distributed denial of service
DNSSEC: Domain Name System Security Extensions
QoS: Quality of service
(a) Cloud service models (b) Security, privacy, and copyright protection measures
FIGURE 4.31
Cloud service models on the left (a) and corresponding security measures on the right (b); the IaaS is at the
innermost level, PaaS is at the middle level, and SaaS is at the outermost level, including all hardware,
software, datasets, and networking resources.
(Courtesy of Hwang and Li [36] )
Thus, security defenses are needed to protect all cluster servers and data centers. Here are some
cloud components that demand special security protection:
• Protection of servers from malicious software attacks such as worms, viruses, and malware
• Protection of hypervisors or VM monitors from software-based attacks and vulnerabilities
• Protection of VMs and monitors from service disruption and DoS attacks
4.6 Cloud Security and Trust Management 251
• Protection of data and information from theft, corruption, and natural disasters
• Providing authenticated and authorized access to critical data and services
Secure data centers Choose hazard-free location, enforce building safety. Avoid windows, keep buffer
and computer zone around the site, bomb detection, camera surveillance, earthquake-proof, etc.
buildings
Use redundant utilities Multiple power and supplies, alternate network connections, multiple databases at
at multiple sites separate sites, data consistency, data watermarking, user authentication, etc.
Trust delegation and Cross certificates to delegate trust across PKI domains for various data centers,
negotiation trust negotiation among certificate authorities (CAs) to resolve policy conflicts
Worm containment Internet worm containment and distributed defense against DDoS attacks to
and DDoS defense secure all data centers and cloud platforms
Reputation system for Reputation system could be built with P2P technology; one can build a hierarchy
data centers of reputation systems from data centers to distributed file systems
Fine-grained file Fine-grained access control at the file or object level; this adds to security
access control protection beyond firewalls and IDSes
Copyright protection Piracy prevention achieved with peer collusion prevention, filtering of poisoned
and piracy prevention content, nondestructive read, alteration detection, etc.
Privacy protection Uses double authentication, biometric identification, intrusion detection and disaster
recovery, privacy enforcement by data watermarking, data classification, etc.
252 CHAPTER 4 Cloud Platform Architecture over Virtualized Data Centers
hypervisor provides visibility of the guest OS, with complete guest isolation. Fault containment and
failure isolation of VMs provide a more secure and robust environment. Malicious intrusions may
destroy valuable hosts, networks, and storage resources. Internet anomalies found in routers, gate-
ways, and distributed hosts may stop cloud services. Trust negotiation is often done at the SLA
level. Public Key Infrastructure (PKI) services could be augmented with data-center reputation sys-
tems. Worm and DDoS attacks must be contained. It is harder to establish security in the cloud
because all data and software are shared by default.
Cloud computing
HTTP
Application server
JMS XML
SQL
Browser
Vordel
Database reporter
FIGURE 4.32
The typical security structure coordinated by a secured gateway plus external firewalls to safeguard the
access of public or private clouds.
(Courtesy of Vordel Company )
technology is more mature, but require frequent updates of the signature databases. Network anom-
aly detection reveals abnormal traffic patterns, such as unauthorized episodes of TCP connection
sequences, against normal traffic patterns. Distributed IDSes are needed to combat both types of
intrusions.
R5
R5
R6
R4 R3 R6
R3 R4
R7
R8
R9 R2 R8 R7
R2
R1
R1 R9
R0
Victim
R0
DDoS attacking zombies
Malicious attacking flows
(a) Traffic flow pattern of a DDoS attack (b) The attack traffic flow tree over 10 routers
FIGURE 4.33
DDoS attacks and defense by change-point detection at all routers on the flooding tree.
(Courtesy of Chen, Hwang, and Ku [15] )
4.6 Cloud Security and Trust Management 255
detection scheme is suitable for protecting cloud core networks. The provider-level cooperation
eliminates the need for intervention by edge networks.
Network
VM instance VM instance
FIGURE 4.34
A VM migrating from host A to host B through a vulnerable network threatened by a man-in-the-middle attack
to modify the VM template and OS state.
256 CHAPTER 4 Cloud Platform Architecture over Virtualized Data Centers
• Ability to secure the ISP or cloud service provider from invading users’ privacy
• Personal firewalls at user ends to keep shared data sets from Java, JavaScript, and ActiveX
applets
• A privacy policy consistent with the cloud service provider’s policy, to protect against identity
theft, spyware, and web bugs
• VPN channels between resource sites to secure transmission of critical data objects
Data
User data Colored
coloring
He
En
Ex
Ex’
Backward Cloud
Color En’
color drops Colored
matching He’ generator extracting
Confirmation
Negotiation
Data Service
owner provider
FIGURE 4.35
Data coloring with cloud watermarking for trust management at various security clearance levels in data
centers.
(Courtesy of Hwang and Li [36] )
4.6 Cloud Security and Trust Management 257
computationally expensive. The data coloring takes a minimal number of calculations to color or
decolor the data objects. Cryptography and watermarking or coloring can be used jointly in a cloud
environment.
Reputation systems
Centralized Distributed
EigenTrust [33]
FIGURE 4.36
Design options of reputation systems for social networks and cloud platforms.
A centralized reputation system is easier to implement, but demands more powerful and reliable
server resources; a distributed reputation system is much more complex to build. Distributed systems
are more scalable and reliable in terms of handling failures. At the second tier, reputation systems are
further classified by the scope of reputation evaluation. User-oriented reputation systems focus on
individual users or agents. Most P2P reputation systems belong to this category. In data centers, repu-
tation is modeled for the resource site as a whole. This reputation applies to products or services
offered by the cloud. Commercial reputation systems have been built by eBay, Google, and Amazon
in connection with the services they provide. These are centralized reputation systems.
Distributed reputation systems are mostly developed by academic research communities. Aberer and
Despotovic have proposed a model to manage trust in P2P systems. The Eigentrust reputation system
was developed at Stanford University using a trust matrix approach. The PeerTrust system was devel-
oped at Georgia Institute of Technology for supporting e-commerce applications. The PowerTrust sys-
tem was developed at the University of Southern California based on Power law characteristics of
Internet traffic for P2P applications. Vu, et al. proposed a QoS-based ranking system for P2P
transactions.
can be implemented with a trust overlay network. A hierarchy of P2P reputation systems is suggested
to protect cloud resources at the site level and data objects at the file level. This demands both
coarse-grained and fine-grained access control of shared resources. These reputation systems keep
track of security breaches at all levels.
The reputation system must be designed to benefit both cloud users and data centers. Data objects
used in cloud computing reside in multiple data centers over a SAN. In the past, most reputation
systems were designed for P2P social networking or for online shopping services. These reputation
systems can be converted to protect cloud platform resources or user applications in the cloud. A cen-
tralized reputation system is easier to implement, but demands more powerful and reliable server
resources. Distributed reputation systems are more scalable and reliable in terms of handling failures.
The five security mechanisms presented earlier can be greatly assisted by using a reputation system
specifically designed for data centers.
However, it is possible to add social tools such as reputation systems to support safe cloning of
VMs. Snapshot control is based on the defined RPO. Users demand new security mechanisms to
protect the cloud. For example, one can apply secured information logging, migrate over secured
virtual LANs, and apply ECC-based encryption for secure migration. Sandboxes provide a safe
execution platform for running programs. Further, sandboxes can provide a tightly controlled set of
resources for guest operating systems, which allows a security test bed to test the application code
from third-party vendors.
Worm containment DDoS defense and piracy prevention Hybrid intrusion detection
Defense against
piracy or network
attacks
Trust overlay networks over
cloud resources sites
and data centers
Reputation aggregation
and integration
FIGURE 4.37
DHT-based trust overlay networks built over cloud resources provisioned from multiple data centers for trust
management and distributed security enforcement.
(Courtesy of Hwang and Li [36] )
conditions. The design is aimed at a trusted cloud environment to ensure high-quality services,
including security.
The cloud security trend is to apply virtualization support for security enforcement in data centers.
Both reputation systems and data watermarking mechanisms can protect data-center access at the
coarse-grained level and to limit data access at the fine-grained file level. In the long run, a new Security
as a Service is desired. This “SaaS” is crucial to the universal acceptance of web-scale cloud computing
in personal, business, community, and government applications. Internet clouds are certainly in line
with IT globalization and efficient computer outsourcing. However, interoperability among different
clouds relies on a common operational standard by building a healthy cloud ecosystem.
References 261
Acknowledgements
This chapter was primarily written by Kai Hwang of University of Southern California. Raj Buyya of
Merlbourne University and Kang Chen of Tsinghua University partially contributed to this chapter.
Valuable suggestions were made by Geoffery Fox of Indiana University in revising this chapter.
References
[1] K. Aberer, Z. Despotovic, Managing trust in a peer-to-peer information system, in: ACM CIKM
International Conference on Information and Knowledge Management, 2001.
[2] M. Al-Fares, A. Loukissas, A. Vahdat, A scalable, commodity datacenter network architecture, in: Proceedings
of the ACM SIGCOMM 2008 Conference on Data Communication, Seattle, WA, 17–22 August 2008.
262 CHAPTER 4 Cloud Platform Architecture over Virtualized Data Centers
[3] Amazon EC2 and S3. Elastic Compute Cloud (EC2) and Simple Scalable Storage (S3). http://spatten_
presentations.s3.amazonaws.com/s3-on-rails.pdf.
[4] M. Armbrust, A. Fox, R. Griffith, et al., Above the Clouds: A Berkeley View of Cloud Computing, Technical
Report No. UCB/EECS-2009-28, University of California at Berkley, 10 February 2009.
[5] I. Arutyun, et al., Open circus: a global cloud computing testbed, IEEE Comput. Mag. (2010) 35–43.
[6] P. Barham, et al., Xen and the art of virtualization, in: Proceedings of the 19th ACM Symposium on
Operating System Principles, ACM Press, New York, 2003.
[7] L. Barroso, J. Dean, U. Holzle, Web search for a planet: the architecture of the Google cluster, IEEE
Micro (2003), doi: 10.1109/MM.2003.1196112.
[8] L. Barroso, U. Holzle, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale
Machines, Morgan Claypool Publishers, 2009.
[9] G. Boss, P. Mllladi, et al., Cloud computing: the bluecloud project. www.ibm.com/developerworks/
websphere/zones/hipods/, 2007.
[10] R. Buyya, J. Broberg, A. Goscinski (Eds.), Cloud Computing: Principles and Paradigms, Wiley Press,
2011.
[11] R. Buyya, C.S. Yeo, S. Venugopal, Market-oriented cloud computing: vision, hype, and reality for deliver-
ing IT services as computing utilities, in: Proceedings of the 10th IEEE International Conference on High
Performance Computing and Communications (HPCC), Dalian, China, 25–27 September 2008.
[12] R. Buyya, R. Ranjan, R.N. Calheiros, InterCloud: utility-oriented federation of cloud computing environ-
ments for scaling of application services, in: Proceedings of the 10th International Conference on
Algorithms and Architectures for Parallel Processing (ICA3PP 2010, LNCS 608), Busan, South Korea,
21–23 May 2010.
[13] M. Cai, K. Hwang, J. Pan, C. Papadopoulos, WormShield: fast worm signature generation with distributed
fingerprint aggregation, in: IEEE Transactions of Dependable and Secure Computing (TDSC), Vol. 4,
No. 2, April/June 2007, pp. 88–104.
[14] Chase, et al., Dynamic virtual clusters in a grid site manager, in: IEEE 12th Symposium on High-Performance
Distributed Computing (HPDC), 2003.
[15] Y. Chen, K. Hwang, W.S. Ku, Collaborative detection of DDoS attacks over multiple network domains, in:
IEEE Transaction on Parallel and Distributed Systems, Vol. 18, No. 12, December 2007, pp. 1649–1662.
[16] T. Chou, Introduction to Cloud Computing: Business and Technology, Active Book Press, 2010.
[17] C. Clark, K. Fraser, J. Hansen, et al., Live migration of virtual machines, in: Proceedings of the Second
Symposium on Networked Systems Design and Implementation, Boston, MA, 2 May 2005, pp. 273–286.
[18] Cloud Computing Tutorial. www.thecloudtutorial.com, January 2010.
[19] Cloud Security Alliance, Security guidance for critical areas of focus in cloud computing, April 2009.
[20] C. Collberg, C. Thomborson, Watermarking, temper-proofing, and obfuscation tools for software protection,
IEEE Trans. Software Eng. 28 (2002) 735–746.
[21] A. Costanzo, M. Assuncao, R. Buyya, Harnessing cloud technologies for a virtualized distributed computing
infrastructure, IEEE Internet Comput. (2009).
[22] D. Dyer, Current trends/Challenges in datacenter thermal management—A facilities perspective, in:
Presentation at ITHERM, San Diego, 1 June 2006.
[23] Q. Feng, K. Hwang, Y. Dai, Rainbow product ranking for upgrading e-commerce, IEEE Internet
Comput. (2009).
[24] I. Foster, The grid: computing without bounds, Sci. Am. 288 (4) (2003) 78–85.
[25] I. Foster, Y. Zhao, J. Raicu, S. Lu, Cloud computing and grid computing 360-degree compared, in: Grid
Computing Environments Workshop, 12–16 November 2008.
[26] D. Gannon, The client+cloud: changing the paradigm for scientific research, Keynote address, in:
CloudCom 2010, Indianapolis, 2 November 2010.
[27] Google Inc, Efficient data center summit. www.google.com/corporate/green/datacenters/summit.html, 2009.
References 263
[28] Green Grid, Quantitative analysis of power distribution configurations for datacenters. www.thegreengrid
.org/gg_content/.
[29] A. Greenberg, J. Hamilton, D. Maltz, P. Patel, The cost of a cloud: research problems in datacenter
networks, in: ACM SIGCOMM Computer Communication Review, Vol. 39, No. 1, January 2009.
[30] C. Guo, G. Lu, et al., BCube: a high-performance server-centric network architecture for modular datacen-
ters, in: ACM SIGCOMM Computer Communication Review, Vol. 39, No. 44, October 2009.
[31] E. Hakan, Cloud computing: does Nirvana hide behind the Nebula? IEEE Softw. (2009).
[32] B. Hayes, Cloud computing, Commun. ACM 51 (2008) 9–11.
[33] C. Hoffa, et al., On the use of cloud computing for scientific workflows, in: IEEE Fourth International
Conference on eScience, December 2008.
[34] R. He, J. Hu, J. Niu, M. Yuan, A novel cloud-based trust model for pervasive computing, in: Fourth Inter-
national Conference on Computer and Information Technology, 14–16 September 2004, pp. 693–700.
[35] K. Hwang, S. Kulkarni, Y. Hu, Cloud security with virtualized defense and reputation-based trust manage-
ment, in: IEEE International Conference on Dependable, Autonomic, and Secure Computing (DASC 09),
Chengdu, China, 12–14 December 2009.
[36] K. Hwang, D. Li, Trusted cloud computing with secure resources and data coloring, IEEE Internet
Comput. (2010).
[37] K. Hwang, M. Cai, Y. Chen, M. Qin, Hybrid intrusion detection with weighted signature generation over
anomalous internet episodes, in: IEEE Transactions on Dependable and Secure Computing, Vol.4, No.1,
January–March 2007, pp. 41–55.
[38] V. Jinesh, Cloud Architectures, White paper, Amazon. http://aws.amazon.com/about-aws/whats-new/2008/
07/16/cloud-architectures-white-paper/.
[39] D. Kamvar, T. Schlosser, H. Garcia-Molina, The EigenTrust algorithm for reputation management in P2P
networks, in: Proceedings of the 12th International Conference on the World Wide Web, 2003.
[40] K. Keahey, M. Tsugawa, A. Matsunaga, J. Fortes, Sky computing, IEEE Internet Comput., (2009).
[41] Kivit, et al., KVM: the linux virtual machine monitor, in: Proceedings of the Linux Symposium, Ottawa,
Canada, 2007, p. 225.
[42] KVM Project, Kernel-based virtual machines. www.linux-kvm.org, 2011 (accessed 02.11).
[43] G. Lakshmanan, Cloud Computing: Relevance to Enterprise, Infosys Technologies, Inc., 2009.
[44] N. Leavitt, et al., Is cloud computing really ready for prime time? IEEE Comput. 42 (1) (2009) 15–20.
[45] D. Li, H. Meng, X. Shi, Membership clouds and membership cloud generator, J. Comput. Res. Dev.
32 (6) (1995) 15–20.
[46] D. Li, C. Liu, W. Gan, A new cognitive model: cloud model, Int. J. Intell. Syst. (2009).
[47] D. Linthicum, Cloud Computing and SOA Convergence in Your Enterprise: A Step-by-Step Guide,
Addison Wesley Professional, 2009.
[48] X. Lou, K. Hwang, Collusive piracy prevention in P2P content delivery networks, IEEE Trans. Comput.,
(2009).
[49] M. Luis, Vaquero, L. Rodero-Merino, et al., A break in the clouds: towards a cloud definition, in: ACM
SIGCOMM Computer Communication Review Archive, January 2009.
[50] D. Manchala, E-Commerce trust metrics and models, IEEE Internet Comput. (2000).
[51] T. Mather, et al., Cloud Security and Privacy: An Enterprise Perspective on Risks and Compliance,
O’Reilly Media, Inc., 2009.
[52] L. Mei, W. Chan, T. Tse, A tale of clouds: paradigm comparisons and some thoughts on research issues,
in: IEEE Asia-Pacific Services Computing Conference, December 2008.
[53] D. Nelson, M. Ryan, S. DeVito, et al., The role of modularity in datacenter design, Sun BluePrints. www.sun
.com/storagetek/docs/EED.pdf.
[54] M. Nelson, B.H. Lim, G. Hutchins, Fast transparent migration for virtual machines, in: Proceedings of the
USENIX 2005 Annual Technical Conference, Anaheim, CA, 10–15 April 2005, pp. 391–394.
264 CHAPTER 4 Cloud Platform Architecture over Virtualized Data Centers
[55] D. Nurmi, R. Wolski, et al., Eucalyptus: an Elastic utility computing architecture linking your programs to
useful systems, in: UCSB Computer Science Technical Report No. 2008–10, August 2008.
[56] W. Norman, M. Paton, T. de Aragao, et al., Optimizing utility in cloud computing through autonomic
workload execution, in: Bulletin of the IEEE Computer Society Technical Committee on Data
Engineering, 2009.
[57] D.A. Patterson, et al., Recovery-Oriented Computing (ROC): Motivation, Definition, Techniques, and
Case Studies, UC Berkeley CS Technical Report UCB//CSD-02-1175, 15 March 2002.
[58] M. Pujol, et al., Extracting reputation in multi-agent systems by means of social network topology, in:
Proceedings of the International Conference on Autonomous Agents and Multi-Agent Systems, 2002.
[59] S. Roschke, F. Cheng, C. Meinel, Intrusion detection in the cloud, in: IEEE International Conference on
Dependable, Autonomic, and Secure Computing (DASC 09), 13 December 2009.
[60] R. Raghavendra, P. Ranganathan, V. Talwar, Z. Wang, X. Zhu, No ‘power’ struggles: coordinated multi-level
power management for the datacenter, in: Proceedings of the ACM International Conference on Architectural
Support for Programming Languages and Operating Systems, Seattle, WA, March 2008.
[61] D. Reed, Clouds, clusters and many core: the revolution ahead, in: Proceedings of the 2008 IEEE Interna-
tional Conference on Cluster Computing, 29 September – 1 October 2008.
[62] J. Rittinghouse, J. Ransome, Cloud Computing: Implementation, Management, and Security, CRC
Publishers, 2010.
[63] B. Rochwerger, D. Breitgand, E. Levy, et al., The RESERVOIR Model and Architecture for Open Federated
Cloud Computing, IBM Syst. J. (2008).
[64] R.N. Calheiros, R. Ranjan, C.A.F. De Rose, R. Buyya, CloudSim: a novel framework for modeling and
simulation of cloud computing infrastructures and services, Technical Report, GRIDS-TR-2009-1, University
of Melbourne, Australia, 13 March 2009.
[65] Salesforce.com, http://en.wikipedia.org/wiki/Salesforce.com/, 2010.
[66] H. Shen, K. Hwang, Locality-preserving clustering and discovery of resources in wide-area computational
grids, IEEE Trans. Comput. (2011) Accepted To Appear.
[67] S. Song, K. Hwang, R. Zhou, Y. Kwok, Trusted P2P transactions with fuzzy reputation aggregation, in:
IEEE Internet Computing, Special Issue on Security for P2P and Ad Hoc Networks, November 2005,
pp. 24–34.
[68] B. Sotomayor, R. Montero, I. Foster, Virtual infrastructure management in private and hybrid clouds,
IEEE Internet Comput. (2009).
[69] G. Stuer, K. Vanmechelena, J. Broeckhovea, A commodity market algorithm for pricing substitutable grid
resources, Future Gener. Comput. Syst. 23 (5) (2007) 688–701.
[70] C. Vecchiola, X. Chu, R. Buyya, Aneka: a software platform for .NET-based cloud computing, in:
W. Gentzsch, et al. (Eds.), High Speed and Large Scale Scientific Computing, IOS Press, Amsterdam,
Netherlands, 2009, pp. 267–295.
[71] T. Velte, A. Velite, R. Elsenpeter, Cloud Computing, A Practical Approach, McGraw-Hill Osborne Media,
2010.
[72] S. Venugopal, X. Chu, R. Buyya, A negotiation mechanism for advance resource reservation using the
alternate offers protocol, in: Proceedings of the 16th International Workshop on Quality of Service
(IWQoS 2008), Twente, The Netherlands, June 2008.
[73] K. Vlitalo, Y. Kortesniemi, Privacy in distributed reputation management, in: Workshop of the 1st Int’l
Conference on Security and Privacy for Emerging Areas in Communication Networks, September 2005.
[74] VMware, Inc., Disaster Recovery Solutions from VMware, White paper, www.vmware.com/, 2007
(accessed 07).
[75] VMware, Inc., Migrating Virtual Machines with Zero Downtime, www.vmware.com/, 2010 (accessed 07).
[76] VMware, Inc., vSphere, www.vmware.com/products/vsphere/, 2010 (accessed 02.10).
Homework Problems 265
[77] L. Vu, M. Hauswirth, K. Aberer, QoS-based service selection and ranking with trust and reputation
management, in: Proceedings of the On The Move Conference (OTM ’05), LNCS 3760, 2005.
[78] W. Voosluys, et al., Cost of VM live migration in clouds: a performance evaluation, in: Proceedings of
the First International Conference on Cloud Computing, IOS Press, Netherlands, 2009, pp. 267–295.
[79] Y. Wang, J. Vassileva, Toward trust and reputation based web service selection: a survey, J. Multi-agent
Grid Syst. (MAGS) (2007).
[80] Wikipedia, Cloud computing, http://en.wikipedia.org/wiki/Cloud_computing, 2010 (accessed 26.01.10).
[81] Wikipedia, Data center, http://en.wikipedia.org/wiki/Data_center, 2010 (accessed 26.01.10).
[82] H. Wu, G. Lu, D. Li, C. Guo, Y. Zhang, MDCube: a high performance network structure for modular data
center interconnection, ACM CoNEXT ’09, Rome, 1–4 December 2009.
[83] Y. Wu, K. Hwang, Y. Yuan, W. Zheng, Adaptive workload prediction of grid performance in confidence
windows, IEEE Trans. Parallel Distrib. Syst. (2010).
[84] XEN Organization, www.sen.org, 2011 (accessed 20.02.11).
[85] L. Xiong, L. Liu, PeerTrust: supporting reputation-based trust for peer-to-peer electronic communities,
IEEE Trans. Knowl. Data Eng. (2004) 843–857.
[86] J. Yang, J. Wang, C. Wang, D. Li., A novel scheme for watermarking natural language text, in: Proceed-
ings of the Third International Conference on Intelligent Information Hiding and Multimedia Signal
Processing, 2007, pp. 481–484.
[87] L. Youseff, M. Butrico, D. Maria, D. Silva, Toward a unified ontology of cloud computing, in: Grid
Computing Environments Workshop (GCE ’08), November 2008, pp. 1–10.
[88] F. Zhang, J. Cao, K. Hwang, C. Wu, Ordinal optimized scheduling of scientific workflows in elastic compute
Clouds, IEEE Trans. Comput. (2011) submitted (under review).
[89] R. Zhou, K. Hwang, PowerTrust: a robust and scalable reputation system for trusted peer-to-peer computing,
IEEE Trans. Parallel Distrib. Syst. (2007).
HOMEWORK PROBLEMS
Problem 4.1
Compile a table to compare public clouds and private clouds in each of the following four aspects.
Also identify their differences, advantages, and shortcomings in terms of design technologies and
application flexibility. Give several example platforms that you know of under each cloud class.
a. Technology leveraging and IT resource ownership
b. Provisioning methods of resources including data and VMs, and their management
c. Workload distribution methods and loading policies
d. Security precautions and data privacy enforcement
Problem 4.2
Describe the following techniques or terminologies used in cloud computing and cloud services.
Use a concrete example cloud or case study to explain the addressed technology.
a. Virtualized data center
b. Green information technology
c. Multitenant technique
266 CHAPTER 4 Cloud Platform Architecture over Virtualized Data Centers
Problem 4.3
You have read a number of success or failure stories on cloud computing applications by startup
entrepreneurs or established IT companies. In Chou’s introductory book on business computing
[18], six categories of cloud services are identified, as listed here and shown below and also in
Figure 4.23.
Software Applications (SaaS): Concur, RightNOW, Teleo, Kenexa, Webex, Blackbaud,
Salesforce.com, Netsuite, Omniture, Kenexa, Vocus
Platform Services (PaaS): Force.com, App Engine, Postini, Facebook, MS Azure, NetSuite, IBM
RC2, IBM BlueCloud, SGI Cyclone, eBay, Pitney Bowes
Infrastructure Services (SaaS): Amazon AWS, OpSource Cloud, IBM Ensembles, Eli Lily,
Rackspace cloud, Windows Azure, HP, Bank North, New York Times
Collocation Services(LaaS): Savvis, Internap, NTTCommunications, Digital Realty Trust, 365
Main
Network Cloud Services (NaaS): Owest, AT&T, AboveNet
Hardware/Virtualization Services (HaaS): VMware, Intel, IBM, XenEnterprise
Pick one company from each of the six cloud service categories. Dig into the company services in
detail, contact the company directly, or visit their web sites. The purpose is to report on their inno-
vations in cloud technology, good business ideas implemented, software application development,
business models they have developed, and success/failure lessons learned.
Problem 4.4
Check the AWS cloud web site. Plan a real computing application using EC2, or S3, or SQS, separately.
You must specify the resources requested and figure out the costs charged by Amazon. Carry out the
EC2, S3, or SQS experiments on the AWS platform and report and analyze the performance results
measured.
Problem 4.5
Consider two cloud service systems: Google File System and Amazon S3. Explain how they
achieve their design goals to secure data integrity and to maintain data consistency while facing the
problems of hardware failure, especially concurrent hardware failures.
Problem 4.6
Read the articles by Buyya and associates [10–14] on market-oriented cloud architecture and
intercloud resource exchanges:
a. Discuss the reasons for scaling applications across multiple clouds. Suggest ways to leverage the
ideas in cloud mashup applications.
b. Identify key architectural elements that need to be added to realize market-oriented clouds within
commercial cloud systems supported by companies such as IBM and Microsoft.
Homework Problems 267
Problem 4.7
Suggest two hardware mechanisms and software schemes to secure the application cloud (SaaS), the
infrastructure cloud (IaaS), and the platform cloud (PaaS). Discuss their specific requirements and
difficulties and limitations that may be encountered.
Problem 4.8
Identify infrastructure components of IaaS services offered by Amazon EC2 and Google, and use
CloudSim from the University of Melbourne to model and simulate these infrastructures. Develop
an algorithm for scheduling applications across an InterCloud or federation of these simulated
infrastructures. In this evaluation, carry out experiments by modeling applications that are short-
lived such as web applications and long-lived such as high-performance computing.
Problem 4.9
Explain the following terminologies associated with data-center design and managements:
a. Differences in warehouse-scale data centers and modular data centers
b. Study three data center architecture papers in [2,24,25,45,68] and report on their contributions to
advance data-center performance and dependability
c. Discuss the scalability of those data-center architectures you studied in Part (b)
Problem 4.10
Explain the differences in the following two machine recovery schemes. Comment on their imple-
mentation requirements, advantages and shortcomings, and application potential.
a. Recovery of a physical machine failure by another physical machine
b. Recovery of a VM failure by another VM
Problem 4.11
Elaborate on four major advantages of using virtualized resources in cloud computing applications.
Your discussion should address resource management issues from the provider’s perspective and
the application flexibility, cost-effectiveness, and dependability concerns by cloud users.
Problem 4.12
Answer the following questions on the InterGrid Gateway (IGG) experiments. Read the original
paper [5] for details beyond Sections 6.3.3 and 6.3.4 on cloud creation and load peering experi-
ments in a distributed virtualized cloud infrastructure built on top of the French Grid’5000 system.
a. Study the details of the IGG software components and their usage to link resource sites
b. Repeat the IGG experiments in a local small-scale network or grid environment by getting the
IGG software from the University of Melbourne
c. Use the CloudSim simulator from the University of Melbourne or write your own simulator to
repeat the experiments in Part (b).
268 CHAPTER 4 Cloud Platform Architecture over Virtualized Data Centers
Problem 4.13
Map the following 14 names or abbreviated terms on the left column with the best-match definitions
or descriptions on the right column. Just enter the description label (a, b, c, ….) in the blank spaces in
front of the terms being matched to. This is a 1-to-1 correspondence.
_____ GAE a) The agreement signed between users and providers in cloud computing
_____ CRM b) A public cloud that must run from Windows 7 based host
_____ AWS c) A public cloud used mainly for PaaS applications
_____ SLA d) A public compute cloud used in scalable business computing application
_____ Azure e) A cloud platform built by SalesForce.com
_____ EC2 f) A commercial cloud OS for transforming a data center to a cloud platform
_____ S3 g) One of the most frequently used SaaS applications used in the business world
_____ Force.com h) A cloud platform built mainly for IaaS applications
_____ vSphere/4 i) A storage cloud service for distributed storage applications
_____ EBS j) An open-source hypervisor developed at Cambridge University
_____ SQL k) The distributed file system used in Google search engines and in App Engine
_____ Chubby l) An Amazon block lock inteface for saving and restoring the VM instances
_____ XEN m) An Azure service module for users to visit and use of relational database
_____ GFS n) A distributed block lock service module in Google App Engine
Problem 4.14
This is an extended research problem on an efficient data-center network by Al-Fare, et al. [18] that
you studied in Section 4.2.2. The scalable data-center architecture was proposed to extend the fat-
tree concept. You are asked to perform the following tasks:
a. Study these two papers and justify the claimed network features on scalability, efficient routing,
and compatibility with the Ethernet, IP, and TCP protocols
b. Suggest a means to improve the network in fault tolerance, cross-section bandwidth, and
implementation efficiency based on today’s technology.
Problem 4.15
You have studied the basic concepts of live migration of VMs and disaster recovery in Section
3.5.2 and Section 4.3.3. Read the related articles in [16,19,52,55,72] and answer the following
questions based on your research findings:
a. What virtualization support is needed to achieve fast cloning of VMs? Explain how VM cloning
can enable fast recovery.
b. What are the RPO and RTO in the design of a disaster recovery scheme? Explain the role of
snapshots in disaster recovery schemes.
Homework Problems 269
Problem 4.16
Investigate the differences among encryption, watermarking, and coloring for protecting data sets
and software in cloud environments. Discuss their relative strengths and limitations.
Problem 4.17
This is an extended research problem on efficient data-center networks. In the papers by Guo, et al.
[25] and Wu, et al. [68], the MDCube networks module was proposed for building mega data cen-
ters as shown in Figure 4.13 in Section 4.2.4.
a. Discuss the advantages of using this network to improve the intercontainer bandwidth, reduce
the cost of the interconnection structure, and reduce the cabling complexity.
b. Justify the claimed low diameter, high capacity, and fault tolerance of the MDCube network
design for interconnecting data-center containers.
Problem 4.18
Read the articles [15,17,21,35,37,46,49,57] on cloud security and data protection. Can you suggest
any way to upgrade the cloud infrastructure for protection of data, privacy, and security in a cloud
platform or data-center environment? Evaluate the trust models reported in [1,25,34,36,48,64] for
securing cloud service applications.
Problem 4.19
Draw a layered diagram to relate the construction of IaaS, PaaS, and SaaS clouds from bare
machine hardware to the users applications. Briefly list the representative cloud service offerings at
each cloud layer from the major cloud providers that you know of.
Problem 4.20
Discuss the enabling technologies for building the cloud platforms from virtulized and automated
data centers to provide IaaS, PaaS, or SaaS services. Identify hardware, software, and networking
mechanisms or business models that enable multitenant services.
This page intentionally left blank
CHAPTER
SUMMARY
In this chapter, we cover two major service-oriented architecture styles, namely REST (REpresenta-
tional State Transfer) and WS (Web Services) and their extensions. We discuss message-oriented
middleware and enterprise bus infrastructure with a publish-subscribe infrastructure. Two application
interfaces (OGCE and HUBzero) are described using web service (portlet) and Web 2.0 (gadget)
technologies. We handle data and metadata in distributed systems using service registries and
semantic web/grid. Finally, we describe a general workflow approach illustrating it with the BPEL
web service standard, Pegasus, Taverna, Kepler, Trident, and Swift.
• Network orientation Services tend to be oriented toward use over a network, though this is
not an absolute requirement.
• Platform-neutral Messages are sent in a platform-neutral, standardized format delivered
through the interfaces. XML is the most obvious format that meets this constraint.
Unlike the component-based model, which is based on design and development of tightly
coupled components for processes within an enterprise, using different protocols and technologies
such as CORBA and DCOM, SOA focuses on loosely coupled software applications running across
different administrative domains, based on common protocols and technologies, such as HTTP and
XML. SOA is related to early efforts on the architecture style of large-scale distributed systems,
particularly Representational State Transfer (REST). Nowadays, REST still provides an alternative
to the complex standard-driven web services technology and is used in many Web 2.0 services. In
the following subsections, we introduce REST and standard-based SOA in distributed systems.
HTTP packet
HTTP verb
POST | GET | PUT | DELETE
REST aware SSL
HTTP-REST HTTP/S REST payload HTTP/S application authentication
client HTML, XML, JPEG, HTTP server
JSON, etc
FIGURE 5.1
A simple REST interaction between user and server in HTTP specification.
(Courtesy of Thomas Fielding [2])
274 CHAPTER 5 Service-Oriented Architectures for Distributed Computing
is a resource. Any information that can be named can be a resource, such as a document or
image or a temporal service. A resource is a conceptual mapping to a set of entities. Each
particular resource is identified by a unique name, or more precisely, a Uniform Resource
Identifier (URI) which is of type URL, providing a global addressing space for resources
involved in an interaction between components as well as facilitating service discovery. The
URIs can be bookmarked or exchanged via hyperlinks, providing more readability and the
potential for advertisement.
Uniform, Constrained Interface: Interaction with RESTful web services is done via the HTTP
standard, client/server cacheable protocol. Resources are manipulated using a fixed set of four
CRUD (create, read, update, delete) verbs or operations: PUT, GET, POST, and DELETE. PUT
creates a new resource, which can then be destroyed by using DELETE. GET retrieves the
current state of a resource. POST transfers a new state onto a resource.
Self-Descriptive Message: A REST message includes enough information to describe how to
process the message. This enables intermediaries to do more with the message without parsing
the message contents. In REST, resources are decoupled from their representation so that their
content can be accessed in a variety of standard formats (e.g., HTML, XML, MIME, plain text,
PDF, JPEG, JSON, etc.). REST provides multiple/alternate representations of each resource.
Metadata about the resource is available and can be used for various purposes, such as cache
control, transmission error detection, authentication or authorization, and access control.
Stateless Interactions: The REST interactions are “stateless” in the sense that the meaning of a
message does not depend on the state of the conversation. Stateless communications improve
visibility, since a monitoring system does not have to look beyond a single request data field in
order to determine the full nature of the request reliability as it facilitates the task of recovering
from partial failures, and increases scalability as discarding state between requests allows the
server component to quickly free resources. However, stateless interactions may decrease network
performance by increasing the repetitive data (per-interaction overhead). Stateful interactions are
based on the concept of explicit state transfer. Several techniques exist to exchange state, such as
URI rewriting, cookies, and hidden form fields. State can be embedded in response messages to
point to valid future states of the interaction.
Such lightweight infrastructure, where services can be built with minimal development tools, is
inexpensive and easy to adopt. The effort required to build a client to interact with a RESTful ser-
vice is rather small as developers can begin testing such services from an ordinary web browser,
without having to develop custom client-side software. From an operational point of view, a state-
less RESTful web service is scalable to serve a very large number of clients, as a result of REST
support for caching, clustering, and load balancing.
RESTful web services can be considered an alternative to SOAP stack or “big web services,”
described in the next section, because of their simplicity, lightweight nature, and integration with
HTTP. With the help of URIs and hyperlinks, REST has shown that it is possible to discover web
resources without an approach based on registration to a centralized repository. Recently, the web
Application Description Language (WADL) [3] has been proposed as an XML vocabulary to
describe RESTful web services, enabling them to be discovered and accessed immediately by poten-
tial clients. However, there are not a variety of toolkits for developing RESTful applications. Also,
restrictions on GET length, which does not allow encoding of more than 4 KB of data in the
5.1 Services and Service-Oriented Architecture 275
resource URI, can create problems because the server would reject such malformed URIs, or may
even be subject to crashes. REST is not a standard. It is a design and architectural style for large-
scale distributed systems.
Table 5.1 lists the REST architectural elements. Several Java frameworks have emerged to help
with building RESTful web services. Restlet [4], a lightweight framework, implements REST archi-
tectural elements such as resources, representation, connector, and media type for any kind of
RESTful system, including web services. In the Restlet framework, both the client and the server
are components. Components communicate with each other via connectors.
JSR-311 (JAX-RS) [5], a specification provided by Sun Microsystems, defines a set of Java
APIs for the development of RESTful web services. The specification provides a set of annotations
with associated classes and interfaces that can be used to expose Java objects as web resources.
JSR-311 provides clear mappings between the URI and corresponding resources, and mappings
between HTTP methods with the methods in Java objects, by using annotations. The API supports
a wide range of HTTP entity content types including HTML, XML, JSON, GIF, JPG, and so on.
Jersey [6] is a reference implementation of the JSR-311 specification for building RESTful web ser-
vices. It also provides an API for developers to extend Jersey based on their needs.
stored in containers called “buckets,” each identified by a unique key. Buckets serve several purposes:
They organize the Amazon S3 namespace at the highest level, identify the account responsible for storage
and data transfer charges, play a role in access control, and serve as the unit of aggregation for usage
reporting. Amazon S3 provides three types of resources: a list of user buckets, a particular bucket, and a
particular S3 object, accessible through https://s3.amazonaws.com/{name-of-bucket}/{name-of-object}.
These resources are retrieved, created, or manipulated by basic HTTP standard operations: GET,
HEAD, PUT, and DELETE. GET can be used to list buckets created by the user, objects kept inside a
bucket, or an object’s value and its related metadata. PUT can be used for creating a bucket or setting an
object’s value or metadata, DELETE for removing a particular bucket or object, and HEAD for getting a
specific object’s metadata. The Amazon S3 API supports the ability to find buckets, objects, and their
related metadata; create new buckets; upload objects; and delete existing buckets and objects for the
aforementioned operations. Table 5.2 shows some sample REST request-response message syntax for
creating an S3 bucket.
Amazon S3 REST operations are HTTP requests to create, fetch, and delete buckets and objects.
A typical REST operation consists of sending a single HTTP request to Amazon S3, followed by waiting for
an HTTP response. Like any HTTP request, a request to Amazon S3 contains a request method, a URI,
request headers which contain basic information about the request, and sometimes a query string and
request body. The response contains a status code, response headers, and sometimes a response body.
The request consists of a PUT command followed by the bucket name created on S3. The Amazon S3
REST API uses the standard HTTP header to pass authentication information. The authorization header
consists of an AWS Access Key ID and AWS SecretAccess Key, issued by the developers when they regis-
ter to S3 Web Services, followed by a signature. To authenticate, the AWSAccessKeyId element identifies
the secret key to compute the signature upon request from the developer. If the request signature matches
the signature included, the requester is authorized and subsequently, the request is processed.
The composition of RESTful web services has been the main focus of composite Web 2.0 appli-
cations, such as mashups, to be discussed in Sections 5.4 and 9.1.3. A mashup application combines
capabilities from existing web-based applications. A good example of a mashup is taking images
from an online repository such as Flickr and overlaying them on Google Maps. Mashups differ
from all-in-one software products in that instead of developing a new feature into an existing tool,
5.1 Services and Service-Oriented Architecture 277
they combine the existing tool with another tool that already has the desired feature. All tools work
independently, but create a uniquely customized experience when used together in harmony.
<WSDL>
Refers to ...
</WSDL>
UDDI registry
ce
vi
n r
tio se
Qu
rip te
er
sc da
y
de /up
fin
is
bl
Bi
d
Pu
nd
a
se
to
rv
W
ice
SD
L
do
cu
m
en
t
SOAP
request
FIGURE 5.2
A simple web service interaction among provider, user, and the UDDI registry.
278 CHAPTER 5 Service-Oriented Architectures for Distributed Computing
conjunction with other web-related standards. The technologies that make up the core of today’s
web services are as follows:
Simple Object Access Protocol (SOAP) SOAP provides a standard packaging structure for
transmission of XML documents over various Internet protocols, such as SMTP, HTTP, and
FTP. By having such a standard message format, heterogeneous middleware systems can
achieve interoperability. A SOAP message consists of a root element called envelope, which
contains a header: a container that can be extended by intermediaries with additional
application-level elements such as routing information, authentication, transaction management,
message parsing instructions, and Quality of Service (QoS) configurations, as well as a body
element that carries the payload of the message. The content of the payload will be marshaled
by the sender’s SOAP engine and unmarshaled at the receiver side, based on the XML schema
that describes the structure of the SOAP message.
Web Services Description Language (WSDL) WSDL describes the interface, a set of
operations supported by a web service in a standard format. It standardizes the representation of
input and output parameters of its operations as well as the service’s protocol binding, the way
in which the messages will be transferred on the wire. Using WSDL enables disparate clients to
automatically understand how to interact with a web service.
Universal Description, Discovery, and Integration (UDDI) UDDI provides a global registry
for advertising and discovery of web services, by searching for names, identifiers, categories, or
the specification implemented by the web service. UDDI is explained in detail in Section 5.4.
SOAP is an extension, and an evolved version of XML-RPC, a simple and effective remote pro-
cedure call protocol which uses XML for encoding its calls and HTTP as a transport mechanism,
introduced in 1999 [7]. According to its conventions, a procedure executed on the server and the
value it returns was a formatted in XML. However, XML-RPC was not fully aligned with the latest
XML standardization. Moreover, it did not allow developers to extend the request or response for-
mat of an XML-RPC call. As the XML schema became a W3C recommendation in 2001, SOAP
mainly describes the protocols between interacting parties and leaves the data format of exchanging
messages to XML schema to handle.
The major difference between web service technology and other technologies such as J2EE,
CORBA, and CGI scripting is its standardization, since it is based on standardized XML, providing a
language-neutral representation of data. Most web services transmit messages over HTTP, making
them available as Internet-scale applications. In addition, unlike CORBA and J2EE, using HTTP as
the tunneling protocol by web services enables remote communication through firewalls and proxies.
SOAP-based web services are also referred to as “big web services” [7]. As we saw earlier in
this chapter, RESTful [8] services can also be considered a web service, in an HTTP context.
SOAP-based web services interaction can be either synchronous or asynchronous, making them sui-
table for both request-response and one-way exchange patterns, thus increasing web service avail-
ability in case of failure.
FIGURE 5.3
WS-I protocol stack and its related specifications.
“The automation of the processes, which involves the orchestration of a set of Grid services, agents
and actors that must be combined together to solve a problem or to define a new service.”
The JBPM [16] Project, built for the JBoss [17] open source middleware platform, is an example
of a workflow management and business process execution system. Another workflow system,
Taverna [18], has been extensively used in life science applications. There are a variety of tools for
developing and deploying web services in different languages, among them SOAP engines such as
Apache Axis for Java, gSOAP [19] for C++, the Zolera Soap Infrastructure (ZSI) [20] for Python,
and Axis2/Java and Axis2/C. These toolkits, consisting of a SOAP engine and WSDL tools for gen-
erating client stubs, considerably hide the complexity of web service application development and
integration. As there is no standard SOAP mapping for any of the aforementioned languages, two
different implementations of SOAP may produce different encodings for the same objects.
Since SOAP can combine the strengths of XML and HTTP, as a standard transmission protocol
for data, it is an attractive technology for heterogeneous distributed computing environments, such
as grids and clouds, to ensure interoperability. As we discussed in Section 7.4, Open Grid Services
Architecture (OGSA) grid services are extensions of web services and in new grid middleware,
such as Globus Toolkit 4 and its latest released version GT5, pure standard web services. Amazon
S3 as a cloud-based persistent storage service is accessible through both a SOAP and a REST inter-
face. However, REST is the preferred mechanism for communicating with S3 due to the difficulties
of processing large binary objects in the SOAP API, and in particular, the limitation that SOAP
puts on the object size to be managed and processed. Table 5.3 depicts a sample SOAP request-
response to get a user object from S3.
A SOAP message consists of an envelope used by the applications to enclose information that
need to be sent. An envelope contains a header and a body block. The EncodingStyle element refers
<soap:Envelope <soap:Envelope
xmlns:soap=“http://www.w3.org/2003/05/soap- xmlns:soap=“http://www.w3.org/2003/05/soap-
envelope” envelope”
soap:encodingStyle= soap:encodingStyle=
“http://www.w3.org/2001/12/soap-encoding”> “http://www.w3.org/2001/12/soap-encoding”>
<soap:Body> <soap:Body>
<CreateBucket xmlns=“http://doc.s3.amazonaws <CreateBucket xmlns=“http://doc.s3.amazonaws
.com/2010-03-15”> .com/2010-03-15”>
<Bucket>SampleBucket</Bucket> <Bucket>SampleBucket</Bucket>
<AWSAccessKeyId> <AWSAccessKeyId>1B9FVRAYCP1VJEXAMPLE=
1B9FVRAYCP1VJEXAMPLE= </AWSAccessKeyId>
</AWSAccessKeyId> <Timestamp>2010-03-15T14:40:00.165Z
<Timestamp>2010-03-15T14:40:00.165Z </Timestamp>
</Timestamp> <Signature>Iuyz3d3P0aTou39dzbqaEXAMPLE
<Signature>Iuyz3d3P0aTou39dzbqaEXAMPLE =</Signature>
=</Signature> </CreateBucket>
</CreateBucket> </soap:Body>
</soap:Body> </soap:Envelope>
</soap:Envelope>
5.1 Services and Service-Oriented Architecture 281
to the URI address of an XML schema for encoding elements of the message. Each element of a
SOAP message may have a different encoding, but unless specified, the encoding of the whole mes-
sage is as defined in the XML schema of the root element. The header is an optional part of a
SOAP message that may contain auxiliary information as mentioned earlier, which does not exist in
this example.
The body of a SOAP request-response message contains the main information of the conversa-
tion, formatted in one or more XML blocks. In this example, the client is calling CreateBucket of
the Amazon S3 web service interface. In case of an error in service invocation, a SOAP message
including a Fault element in the body will be forwarded to the service client as a response, as an
indicator of a protocol-level error.
Web browser
Client
Presentation layer
Application server
Data storage
Information system
FIGURE 5.4
Three-tier system architecture.
(Gustavo Alonso, et al, Web Services: Concepts, Architectures and Applications (Data-Centric Systems and Applications).
Springer Verlag, 2010. With kind permission from Springer Science+Business Media B.V.)
5.1 Services and Service-Oriented Architecture 283
• Business/application logic layer or middleware Programs that implement the actual operations
requested by the client through the presentation layer. The middle tier can also control user
authentication and access to resources, as well as performing some of the query processing for
the client, thus removing some of the load from the database servers.
• Resource management layer Also known as the data layer, deals with and implements the
different data sources of an information system.
In fact, a three-tier system is an extension of two-tier architecture where the application logic is
separated from the resource management layer [21]. By the late 1990s, as the Internet became an
important part of many applications, the industry extended the three-tier model to an N-tier
approach. SOAP-based and RESTful web services have become more integrated into applications.
As a consequence, the data tier split into a data storage tier and a data access tier. In very sophisti-
cated systems, an additional wrapper tier can be added to unify data access to both databases and
web services. Web services can benefit from the separation of concerns inherent in multitier archi-
tecture in almost the same way as most dynamic web applications [22].
The business logic and data can be shared by both automated and GUI clients. The only differ-
ences are the nature of the client and the presentation layer of the middle tier. Moreover, separating
business logic from data access enables database independence. N-tier architecture is characterized
by the functional decomposition of applications, service components, and their distributed deployment.
Such an architecture for both web services and dynamic web applications leads to reusability,
simplicity, extensibility, and clear separation of component functionalities.
Web services can be seen as another tier on top of the middleware and application integration
infrastructure [23], allowing systems to interact with a standard protocol across the Internet. Because
each tier can be managed or scaled independently, flexibility is increased in the IT infrastructure that
employs N-tier architecture. In the next section, we will describe OGSA, as multitier, service-oriented
architecture for middleware which describes the capabilities of a grid computing environment and
embodies web services to make computing resources accessible in large-scale heterogeneous
environments.
Application
Grid applications
OGSA
Data services
Resource management
Infrastructure services
Middleware
Information services
Execution management
Self-management Security
services
services
WS- WS-
WSRF
Management Addressing
Fabric
Web services OGSI
FIGURE 5.5
The OGSA architecture.
(Courtesy of Foster, et al. [24], http://www.ogf.org/documents/GFD.80.pdf.)
status monitoring; relatively static data used for discovery; and any data that is logged.
Troubleshooting is just one of the possible uses for information provided by these services.
• Self-Management Services Support service-level attainment for a set of services (or resources),
with as much automation as possible, to reduce the costs and complexity of managing the system.
These services are essential in addressing the increasing complexity of owning and operating an
IT infrastructure.
OGSA has been adopted as reference grid architecture by a number of grid projects. The first
prototype grid service implementation was demonstrated January 29, 2002, at a Globus Toolkit
tutorial held at Argonne National Laboratory. Since then, the Globus Toolkit 3.0 and 3.2 have
offered an OGSA implementation based on OGSI. Two key properties of a grid service are transi-
ence and statefulness. Creation and destruction of a transient grid service can be done dynamically.
The creation and lifetime of OGSA grid services are handled following the “factory pattern,” to be
explained in Section 7.3.1. Web service technologies are designed to support loosely coupled,
coarse-grained dynamic systems, and hence do not meet all grid requirements, such as keeping state
information, and thus they are unable to fully address the wide range of distributed systems OGSA
is designed to support.
286 CHAPTER 5 Service-Oriented Architectures for Distributed Computing
OGSA applies a set of WSDL extensions to represent the identifiers necessary to implement a
grid service instance across any system. These extensions were defined by OGSI. A key extension
is the grid service reference: a network-wide pointer to a specific grid service instance, which
makes that instance accessible to remote client applications. These extensions, including the Grid
Service Handle (GSH) and Grid Service Reference (GSR), will be described in Chapter 7. These
extensions include stateful grid services and the shortcomings of OGSI with its dense and long spe-
cifications. Further problems concern incompatibility with some current web service tools and the
fact that it takes a lot of concepts from object orientation.
Unlike the nature of web services, this has led to close cooperation between the grid and web service
communities. As a result of these joint efforts, the Web Services Resource Framework (WSRF) [27],
WS-Addressing [28], and WS-Notification (WSN) specifications have been proposed to OASIS. Conse-
quently, OGSI extensions to web services have been deprecated in favor of new web service standards,
and in particular, WSRF. WSRF is a collection of five different specifications. Of course, they all relate
to the management of WS-Resources. Table 5.5 depicts WSRF-related interface operations.
Plain web services are usually stateless. This means the web service can’t “remember” informa-
tion, or keep state, from one invocation to another. However, since a web service is stateless, the
following invocations have no idea of what was done in the previous invocations. Grid applications
generally require web services to keep state information as they interact with the clients or other
web services. The purpose of WSRF is to define a generic framework for modeling and accessing
persistent resources using web services in order to facilitate the definition and implementation of a
service and the integration and management of multiple services. Note that “stateless” services can,
in fact, remember state if that is carried in messages they receive. These could contain a token
remembered in a cookie on the client side and a database or cache accessed by the service. Again,
the user accessing a stateless service can establish state for the session through the user login that
references permanent information stored in a database.
The state information of a web service is kept in a separate entity called a resource. A service
may have more than one (singleton) resource, distinguished by assigning a unique key to each
resource. Resources can be either in memory or persistent, stored in secondary storage such as a
file or database. The pairing of a web service with a resource is called a WS-Resource. The
preferred way of addressing a specific WS-Resource is to use the qualified endpoint reference
(EPR) construct, proposed by the WS-Addressing specification. Resources store actual data items,
referred to as resource properties. Resource properties are usually used to keep service data
values, providing information on the current state of the service, or metadata about such values,
or they may contain information required to manage the state, such as the time when the resource
must be destroyed. Currently, the Globus Toolkit 4.0 provides a set of OGSA capabilities based
on WSRF.
Broker 1
Service 1 Service 1’
Message bus
Service 2 Service 2’
Clients Broker 2 Clients
Communication
......
......
and routing
abstraction
Service N Service N’
Broker M
FIGURE 5.6
Two message bus implementations between services or using a broker network.
messages in different styles (say, SOAP, REST, or Java RMI) can communicate with each other. The
term “enterprise service bus” or ESB [30,31] refers to the case where the bus supports the convenient
integration of many components, often in different styles. These remarks motivate the messaging black
box abstraction shown in Figure 5.6.
One does not open a channel between source and destination, but rather injects a message into the
bus with enough information to allow it to be delivered correctly. This injection is performed by code
loaded into each service and represented by the filled ovals as client interfaces in Figure 5.6(a). The
message bus is shown linking services in this figure, but it can work with any software or hardware
entity sending and receiving messages. A simple example could be desktops or smart phones as the
clients. Further, such buses can be implemented internally to an application, or in a distributed fashion.
In the latter case, the message bus is typically implemented as a set of “brokers” shown in Figure 5.6(b).
The use of multiple brokers allows the bus to scale to many clients (services) and large mes-
sage traffic. Note that the brokers of Figure 5.6(b) are “just” special servers/services that receive
messages and perform needed transformations and routing on them and send out new messages.
There is a special (simple) case of message buses where the brokers shown in Figure 5.6(b) are
not separate servers but are included in the client software. Note that such buses support not
just point-to-point messaging but broadcast or selective multicast to many recipient clients
(services).
Often, one implements brokers as managers of queues, and software in this area often has MQ
or “Message Queue” in its description. An early important example is MQSeries [32] from IBM
which is now marketed as the more recent WebSphereMQ [32,33]. Later, when we study cloud
platforms in Chapter 8, we will find that both Azure and Amazon offer basic queuing software.
A typical use of a message queue is to relate the master and workers in the “farm” model of parallel
computing where the “master” defines separate work items that are placed in a queue which is
accessed by multiple workers that select the next available item. This provides a simple dynamically
load-balanced parallel execution model. If necessary, the multiple brokers of Figure 5.6(b) can be
used to achieve scalability.
5.2 Message-Oriented Middleware 291
0.08
Coordinate(M)
0.06
0.04
0.02
–0.02
0 100 200 300 400 500 600
Time(s [11:28:58PM-EST 11:39:01PM-EST])
FIGURE 5.7
Display of GPS sensors managed by NaradaBrokering in Southern California; the map displays the
time series produced by one of the GPS stations. (http://quakesim.jpl.nasa.gov/).
294 CHAPTER 5 Service-Oriented Architectures for Distributed Computing
application similar to Example 5.4 where publish-subscribe systems manage sensor networks. In fact, one
can consider webcams as sensors (as they produce real-time streams) and so Example 5.5 is also of this
type. Clouds are an important implementation for this type of application as brokers can be added on
demand to support myriad dynamic sensors from cell phones to military or civil sensors responding to an
extreme event. The display of GPS sensors is managed by NaradaBrokering. The map displays the time
series produced by GPS stations.
FIGURE 5.8
A gateway component software stack for scientific applications.
5.3 Portals and Science Gateways 295
storage systems, national-scale grid resources such as the TeraGrid [46] and the Open Science Grid
[47], and computational cloud resources. The second tier consists of middleware for interacting with
these resources. Common middleware examples include Globus [48] (providing job execution and
remote file management), Condor [49] (providing job scheduling), and iRods [50] (providing data
and metadata movement and management). Middleware systems typically expose secure, network-
accessible APIs and developer libraries as well as providing numerous command-line tools.
Tier 2 software and services run on the resources of Tier 1. The next two tiers are not tied to these
resources and can run on separate host machines. Tier 3 is the gateway software layer and consists of
service components that manage the execution of scientific applications and data on the Tier 1
resources through the Tier 2 interfaces. Execution management for both data and science applications
can be decoupled into several components: the initial invocation, monitoring, fault tolerance, and task
coupling as workflows. Security considerations [51,52] (such as authentication and authorization)
permeate all tiers, but we locate the component in Tier 3 since the user’s identity is established here.
We also place user and group management (including social networking) in this tier, along with
third-party information services [53]. Finally, the top tier (Tier 4) is the user presentation layer. Pre-
sentation layers may be built with a number of different tools, including content management sys-
tems, Open Social gadgets, and desktop applications, as described in more detail shortly. Figure 5.8
is flexible enough to describe most gateways. We next make the description more concrete by
reviewing two example science gateways: GridChem and UltraScan. Both gateways have been used
for significant scientific research.
and material scientists for fitting experimental data to finite element solutions of flow equations. More than
700 biochemists, biophysicists, biologists, and material scientists worldwide rely on UltraScan software for
analysis of their experimental data. The software has assisted in understanding an array of disease pro-
cesses, including cancer, neurodegenerative diseases, HIV/AIDS, diabetes, Huntington’s disease, and
aging studies. By conservative estimates, UltraScan has contributed to more than 250 peer-reviewed publi-
cations including 23 known publications from 2009 [59].
UltraScan makes its core experimental analysis software available as an online service to scientific
users through its Science Gateway. This analysis software is computationally demanding and must be run
on clusters and supercomputers. UltraScan job management services (Tier 3) hide complexities and pro-
vide fault tolerance for the experimental scientists using the portal. Although it uses TeraGrid for some of
its computing power, UltraScan needs to span multiple resource providers: It also uses university clusters
and would like to extend its resources to include international grids in Germany and Australia. The key to
UltraScan’s success and growth over the next three years is to provide the ability to manage jobs across
multiple cyberinfrastructure resources that are not collectively part of a single, managed grid.
Gateways such as UltraScan provide an example where code optimization and efficient use of compute
resources are done by experts in the field and shared with hundreds of end users. The gateways lower the
entry barrier for analyzing the data on high-end resources. As an example, the UltraScan gateway provides
an optimal solution for solving large non-negative least squares problems that arise in data analysis
[60,61]. Solving these problems requires significant compute resources. The procedure improves compute
resource utilization targeting the inverse problem involved in modeling analytical ultracentrifugation experi-
mental data. Solving large non-negatively constrained least squares systems is frequently used in the phy-
sical sciences to estimate model parameters which best fit experimental data.
AUC is an important hydrodynamic experimental technique used in biophysics to characterize macro-
molecules and to determine parameters such as molecular weight and shape. Newer AUC instruments
equipped with multi-wavelength (MWL) detectors have recently increased the data sizes by three orders of
magnitude. Analyzing the MWL data requires significant compute resources. UltraScan gateway bridges
these requirements for end users by providing both the procedures and the capabilities to execute them
on supercomputing resources.
courses at 131 institutions of higher education have used nanoHUB.org resources. The journal citations
and the documented use in classrooms is evidence that nanoHUB.org resources aid both research and
education.
1
HUBzero is a trademark of Purdue University.
298 CHAPTER 5 Service-Oriented Architectures for Distributed Computing
• NEES.org 15,000 users, online since August 2010 Home of the NSF Network for Earthquake
Engineering Simulation, which catalogs experimental data from 14 institutions that are
simulating earthquake conditions in the lab. The site also hosts modeling tools used to visualize
and analyze the data [64a].
HUBzero middleware
Grid computing
FIGURE 5.9
The HUBzero architecture and its major functional components.
5.3 Portals and Science Gateways 299
may vary from seconds on a single core to hours and days on multiple, possibly large numbers of
cores [68,69].
Each session runs in a restricted virtual environment implemented using OpenVZ [70], which
controls access to file systems, networking, and other system resources. Users can see their files
and processes, but cannot see others on the system, cannot launch attacks on other hosts, and can-
not exceed their file quota. If a tool already has a GUI that runs under Linux/X11, it can be
deployed as is in a matter of hours. If not, tool developers can use HUBzero’s Rappture toolkit
(http://rappture.org) to create a GUI with very little effort. Developers start by defining the inputs
and outputs for their tool as a hierarchy of data objects written in an XML format. The data objects
include simple elements, such as integers, Boolean values, numbers with units, and sets of mutually
exclusive choices, as well as more complex objects, such as physical structures, finite element
meshes, and molecules.
Rappture reads the XML description of a tool and generates a standard GUI for the tool automa-
tically. The GUI prompts for values, launches simulation jobs, and loads results for visualization.
The underlying simulation code uses an API to retrieve input values and store output results. Rappture
includes libraries for C/C++, FORTRAN, MATLAB, Java, Python, Perl, Ruby, and Tcl/Tk, so the
underlying simulator is not constrained to a particular language, but can be written in the developer’s
language of choice. The tool shown in Figure 5.9, for example, is a Rappture GUI for a MATLAB
program that simulates carbon nanotubes.
can click on the Help or Support link near the top of any page and fill out a form to file a support
ticket. Tickets can be handled by the staff managing the hub, or forwarded to the researchers sup-
porting various simulation tools. Some questions are beyond the understanding of the hub staff, or
even beyond the knowledge of a single researcher.
HUBzero includes a question-and-answer forum patterned after Askville on Amazon.com that
can engage the whole community. Any registered user can post a question, and other users can pro-
vide answers. At some point, the best answer is chosen as the “final” answer by the person who
asked the question, and participants earn points as a reward for their effort. Points can be used as
bragging rights, or as currency to purchase t-shirts and other merchandise in a hub store. The list of
past questions/answers forms a knowledge base upon which a community can draw for immediate
help with a similar problem.
Other issues are beyond software problems and physics questions, but really entail requests for
tool improvements and new features. Such requests are tracked on a “wish list” for each tool and also
on a wish list for the entire hub. All tickets, questions, and wishes are managed by the HUBzero soft-
ware. Various administrators, software developers, and community members are given access to these
facilities to help address the community support issues.
may need to be run hundreds of times to perform an overall sensitivity analysis, to optimize a
design goal, or to quantify uncertainty in the output. A hub should not only catalog tools, but
connect them one to another, to grid computing resources, and ultimately, to the researcher with a
question to answer.
5.3.3.1 Workflows
The OGCE scientific workflow system [74,76,77] provides a programming model that allows the
scientist to program experiments using application web services developed with the GFAC service
that abstract the complexities of the underlying middleware. The workflow system enables scientists
to author experiments that can be stored, replayed, and shared with others. The workflow suite
is bundled with interfaces for composition, execution, and monitoring of the workflows. Salient
features include support for long-running applications and steering/dynamic user interactions.
OGCE’s software stack is designed to be flexible in its coupling of various components to har-
ness end-to-end, multiscale gateway infrastructures. Individual OGCE tools can be integrated into
gateway deployments; likewise, other standard specification-based tools can be swapped within
OGCE’s software stack. As a specific example, the OGCE workflow system provides (with its
XBaya frontend) a graphical interface to browse various application service registries (such as the
OGCE’s XRegistry).
From these registries, users can construct task graphs as workflows. The representation is captured
as an abstract, high-level, workflow-neutral format, which can be translated into workflow execution
specific syntax. Currently, integrations of BPEL [78], Jython, Taverna SCUFL [79], and Pegasus
DAX [80] have all been demonstrated and exist at various levels of support. By default, the workflow
enactment is facilitated by an open source BPEL implementation, the Apache Orchestration and
Director Engine (ODE) [81], that OGCE developers have enhanced to support long-running scientific
workflows on computational grids.
strategies must be specialized to the application. Providing extensibility for gateways to address their
application-specific error conditions is an upcoming effort for GFAC development.
5.3.3.4 Packaging
OGCE software is open source and available for download through SourceForge, and plans are
underway to start an Apache Incubator project for GFAC, XBaya, and supporting components. The
preferred download mechanism is by SVN client checkouts and updates of tagged releases. The cur-
rent OGCE release bridges several component projects. Each subproject can be built separately
using Apache Maven; a master Maven POM is used to build all subprojects. This approach simpli-
fies both development and deployment. Subprojects can be added when they mature, replaced with
major upgrades, and discarded. Updates can be applied to specific components without rebuilding
the entire software stack or developing a specific patching system. The OGCE software stack is
designed to be portable and to compile on many platforms.
Building a science gateway requires to match end-user requirements, that is, the scientific use cases
to be supported, to the capabilities of the gateway stack. Furthermore, developing the underlying work-
flows of the gateway is itself a lengthy scientific task, apart from the implementation issues. There are
always components that can be used as is (security credential management and file browsers, for exam-
ple), but OGCE software is intended to be extended and modified by gateway developers.
This requires a close collaboration between science domain experts and cyberinfrastructure specia-
lists. Long-term sustainability is an important challenge facing all gateways; particularly those that
depend on external resource providers such as the TeraGrid and Open Science Grid (see Figure 5.8
Tiers 1 and 2). These resources and their middleware evolve; gateways with Tier 3 and 4 components
that are not actively maintained will decay. The challenge for many gateways is to maintain their
middleware with reduced funding as the gateway matures from active development to stable usage.
304 CHAPTER 5 Service-Oriented Architectures for Distributed Computing
publishing service descriptions and querying the registry. Data in a UDDI registry is organized as
instance types:
• businessEntity Describes an organization or a business that provides the web services, including
the company name, contact information, industry/product/geographic classification, and so on
• businessService Describes a collection of related instances of web services offered by an
organization, such as the name of the service, a description, and so forth
• bindingTemplate Describes the technical information necessary to use a particular web service,
such as the URL address to access the web service instance and references to its description
• tModel A generic container for specification of WSDL documents in general web services
• publisherAssertion Defines a relationship between two or more businessEntity elements
• subscription A standing request to keep track of changes to the entities in the subscription
The entities businessEntity, businessService, bindingTemplate, and tModel form the core data
structures of UDDI, each of which can be uniquely identified and accessed by a URI, called the
“UDDI key.” These entities and their relationships are depicted in Figure 5.10. A UDDI registry
can be used by service providers, service requestors, or other registries. For such interactions with
businessService
URL
bindingTemplate tModel
businessService
Human readable URL
information about the bindingTemplate tModel
service publisher URL
bindingTemplate tModel
businessEntity
businessService
URL
bindingTemplate tModel
Descriptive information
about a particular service
Service descriptions
FIGURE 5.10
UDDI entities and their relationship.
306 CHAPTER 5 Service-Oriented Architectures for Distributed Computing
the registry, UDDI provides a set of APIs. Two basic types of API operations are applied to UDDI
components:
• UDDI Inquiry API In order to find the set of registry entries such as business, service, binding,
or tMode, details matching a particular search criterion (find_) or details of an entry
corresponding to a given UDDI key (get_) operation can be used.
• UDDI Publishers API This enables add, modify, and delete entries by providing save_ and
delete_ operations. In addition to the aforementioned look-up APIs, UDDI also defines general-
purpose operation types, such as the next 4 specialized APIs.
• UDDI Security API Allows users to get and discard authentication tokens (get_autToken,
discard_autToken)
• UDDI Custody and Ownership Transfer API Enables registries to transfer the custody of
information among themselves and to transfer ownership of these structures one another
(transfer_entities, transfer_custody)
• UDDI Subscription API Enables monitoring of changes in a registry by subscribing to
track new, modified, and deleted entries (delete_subscription, get_subscriptionResults,
get_subscriptions, save_subscriptions)
• UDDI Replication API Supports replication of information between registries so that different
registries can be kept synchronized
Although UDDI is an open standard, it has never gained much popularity among various enter-
prise and scientific communities, as there has been no global registry for registering enterprise,
e-science, or grid services following the UDDI specification after the closure of the public nodes of
the Universal Business Registry operated by IBM, Microsoft, and SAP in January 2006. However,
several public registries have been launched for public use by various communities providing a
categorized list of a variety of services and related APIs. One of them is ProgrammableWeb.com [84].
ProgrammableWeb.com is a registry of a variety of Web 2.0 applications, such as mashups
and APIs organized by category, date, or popularity. It has similar goals to UDDI, but does not
use the detailed UDDI specifications. Mashups are composite Web 2.0 applications which com-
bine capabilities from existing web-based applications, typically RESTful web services. Mashups
can be compared to workflows, as they both implement distributed programming at the service
level. Content used in mashups is typically sourced from a third party via a public interface or
API. According to data released by the ProgrammableWeb.com register [84], most mashup and
APIs are applied in mapping, search, travel, social, instant messaging, shopping, video streaming
areas.
Other methods of sourcing content for mashups include web feeds (such as RSS) and JavaScript.
web developers are able to programmatically search and retrieve APIs, mashups, member pro-
files, and other data from the ProgrammableWeb.com catalog, integrate on-demand registry and
repository functionality into any service, and dynamically add new content as well as comment
on existing entries, using the provided API, based on open standards including XML, RSS,
OpenSearch, and Atom Publishing Protocol (APP) [85]. Among the most popular mashups
frequently used on the web site are those provided for Google Maps, Flickr, Facebook, Twitter, and
YouTube.
5.4 Discovery, Registries, Metadata, and Databases 307
Advanced Queue
FIGURE 5.11
Oracle publish-subscribe model.
Message queuing and publish-subscribe are tightly intertwined: Oracle introduced a publish-
subscribe solution for enterprise information delivery and messaging, shown in Figure 5.11, and
based on Advanced Queuing—an implementation based on JMS and fully integrated in the Oracle
database—to publish data changes and to automate business process workflows for distributed
applications. Ontologies and other semantic mechanisms can be used to make events in the system
more “context-aware.” Advanced Queuing was introduced in Oracle 8 and extended in Oracle 9i to
support publish-subscription. In Oracle release 10.1, Advanced Queuing was integrated into Oracle
Streams, called Oracle Streams Advanced Queuing.
A combination of features are introduced to allow a publish-subscribe style of messaging
between applications. These features include rule-based subscribers, message propagation, the listen
feature, and notification capabilities. Oracle Streams Advanced Queuing is built on top of Oracle
Streams and leverages the functionality of Oracle Database so that messages can be stored persis-
tently, propagated between queues on different computers and databases, and transmitted using Ora-
cle Net Services and HTTP(S). As Oracle Streams Advanced Queuing is implemented in database
tables, all operational benefits of high availability (HA), scalability, and reliability are also applic-
able to queue data. Standard database features such as recovery, restart, and security are supported
by Oracle Streams Advanced Queuing. Database development and management tools such as Oracle
Enterprise Manager can be applied to monitor queues. Like other database tables, queue tables can
be imported and exported.
function and location without significant rich context. Metadata catalogs are used by various
groups and communities, ranging from high-energy physics to biomedical, earth observation, and
geological science.
Because of the importance of metadata services for the use of large-scale local or wide area sto-
rage resources, many groups have made efforts to investigate and implement such services. Among
the earliest metadata catalogs is the Metadata Catalog Service (MCAT) [90], which is a part of the
Storage Resource Broker (SRB) [91], which has evolved to the iRODS system [50]. MCAT, devel-
oped by the San Diego Supercomputing Center, aims to provide an abstraction layer over heteroge-
neous storage devices and file systems either inside or across computing centers. MCAT stores the
data hierarchically using a tree of collections and is both a file and metadata catalog. Later versions
of MCAT support replication and federation of data resources.
The MCS developed by the Globus Alliance [92] provides hierarchical organization of metadata
and flexible schemas, and hides the storage backend from the user. The Globus project also contains
the Replica Location Service (RLS) [93], which uses index servers to provide a global list of files
available on different replica catalogs. Several LHC experiments have implemented their own speci-
fic metadata catalogs using a standard relational database backend and providing an intermediate
layer to access the catalog on a distributed environment.
AMGA (the ARDA Metadata for Grid Applications) [94] is the official metadata catalog of the
gLite software stack of the EGEE project. It began as an exploratory project to study the metadata
requirements of the LHC experiments, and has since been deployed by several groups from differ-
ent user communities, including high-energy physics (for LHCb bookkeeping), biomedicine, and
earth observation. AMGA uses a hierarchical file system-like model to structure metadata, stored in
a relational database. It stores entries representing the entities that are being described, such as files.
The entries are grouped into collections, which can have a variant number of user-defined attributes,
called the schema of the collection.
Attributes are represented as key-value pairs with type information, and each entry assigns an
individual value to the attributes of its collection. A schema can be a representation of a directory,
which can contain either entries or other schemas. As an advantage of this tree-like structure, users
can define a hierarchical structure which can help to better organize metadata in subtrees that can
be queried independently. The server supports several storage systems by using modules. AMGA
can manage groups of users with different permissions on directories. In grid environment, file
and metadata catalogs are used by users for discovering and locating data among the hundreds of
grid sites.
To achieve this, work has been undertaken to assert the meaning of web resources in a common
data model, the Resource Description Framework (RDF), using agreed ontologies expressed in a
common language, such as the OWL web ontology language, so we can share the metadata and
add in background knowledge. From this basis we should be able to query, filter, integrate, and
aggregate the metadata, and reason over it to infer more metadata by applying rules and policies.
RDF is the first language developed for the Semantic web, using XML to represent information
(including metadata) about resources on the web. RDF uses web identifiers (URIs), and describes
resources in terms of simple properties and property values. OWL is an expressive ontology lan-
guage that extends RDF schema. OWL adds more vocabulary for describing properties and classes:
among others, relations between classes, cardinality, equality, richer typing of properties, character-
istics of properties, and enumerated classes.
Semantic web services describe and annotate various aspects of a web service using explicit,
machine-understandable semantics, facilitating the discovery, execution, monitoring, and aggregation
of resources and web services, which solves interoperability issues and helps bring resources
together to create virtual organizations. The OWL-S ontology enables web services to be described
semantically and their descriptions to be processed and understood by software agents. It provides a
standard vocabulary that can be used together with the other aspects of the OWL description lan-
guage to create service descriptions. The OWL-S ontology defines the top-level concept of a
“Service” and three OWL-S subontologies:
• Service profile Expresses what a service does in order to enable service advertisement and
discovery.
• Service model Describes how the service works in order to enable service invocation,
composition, monitoring, and recovery.
• Service grounding Specifies the details of how to access the service. A grounding can be
thought of as a mapping from an abstract to a concrete specification, based on WSDL as a
particular specification language.
The “semantic grid” or “grid with semantics” aims to incorporate the advantages of the grid,
Semantic web, and web services. Built on the W3C Semantic Web Initiative, it is an extension of
the current grid in which information and services are given well-defined meanings (ontologies,
annotations, and negotiation processes as studied in the Semantic web and Software Agent para-
digms), better enabling computers and people to work in cooperation. The semantic grid provides a
general semantic-based, computational, and knowledge-based service for enabling the management
and sharing of complex resources and reasoning mechanisms, thus systematically exposing semanti-
cally rich information associated with resources to build more intelligent grid services.
The notion of the semantic grid shown in Figures 5.12 and 5.13 was first articulated in the
context of e-science, which is all about scientific investigation among scientists of various com-
munities such as physicists, biologist, and chemists and their resources performed through distrib-
uted global collaborations such as the grid to solve scientific problems by generating, analyzing,
sharing, and discussing their insights, experiments, and results in an effective manner, and the
computing infrastructure that enables this joint effort. Higher-level services use the information
relating to the resources’ capability and the mechanisms for service fulfillment to automatically
discover interoperable services and select the most appropriate service for the user with minimal
human intervention.
5.4 Discovery, Registries, Metadata, and Databases 311
Web services
s
ce
r vi
Gr
se
id
eb
s
cw
er
vic
nti
es
ma
Se
Semantic grid
FIGURE 5.12
Semantic grid-related concepts and technologies.
(Courtesy of Goble and Roure, (ECAI-2004), Valencia, Spain, [95] )
Data Text
Portal Collaboratory Knowledge
mining mining
Grid
Knowledge
services
OGSA
Semantic
Knowledge-based Knowledge-based
data/computation information Grid
services services services
OGSA
Computation services Data Information services Base
Grid
services
Web services,
Grid middleware fabric WSRF
FIGURE 5.13
Semantic grid architecture.
(Courtesy of Goble and Roure, (ECAI-2004), Valencia, Spain, [95] )
312 CHAPTER 5 Service-Oriented Architectures for Distributed Computing
Semantic OGSA (S-OGSA) [96] has been proposed as reference architecture for semantic-based
grids. S-OGSA extends OGSA to support the explicit handling of semantics, and defines the asso-
ciated knowledge services to support a set of service capabilities. This is done by introducing
semantic provisioning services that give support to the provision of semantics, by allowing the crea-
tion, storage, update, removal, and access of different forms of knowledge and metadata. This set of
services includes ontology management and reasoning services, metadata services, and annotation
services
S-OGSA has three main aspects: the model (the elements which it is composed of and their
interrelationships), the capabilities (the services needed to deal with such components), and the
mechanisms (the elements that will enable delivery when deploying the architecture in an applica-
tion, grounded to a grid platform). The myGrid [97] Project community as a pioneer of semantic
grid efforts focusing on the challenges of e-science has developed a suite of tools and services to
enable workflow-based composition of diverse biological data and computational resources. Within
the project, Semantic web technologies have been applied to the problems of resource discovery
and workflow results management in a grid environment.
resources to publish the job requirements and resource description in a form of semistructured data.
The agents and resources subscribe to a matchmaker, which scans the published data and pairs jobs
with resources according to the preferences and constraints specified in the data. Once a pair is
formed, the matchmaker notifies the matching agent and resource. The agent and resource then
establish contact and execute a job.
Dryad has a similar scheduling mechanism. Each execution of a vertex in a DAG has an execu-
tion record that contains the state of the execution and the versions of the predecessor vertices pro-
viding its input data. A vertex is placed into a scheduling queue when its input data is ready. The
constraints and preferences of running the vertex are attached to its execution record (e.g., the ver-
tex may have a list of computers on which it would like to run, or prefers to be located in a compu-
ter where a data set is stored). The scheduler then does the matchmaking by allocating the vertex to
the resource. The approach has a long history and can be traced back to the Linda programming
model [102] and middleware inspired by Linda, such as JavaSpaces [103] where jobs are published
in a shared space generalizing the messaging queue and consumed by resources subscribed to the
space.
As jobs can be dispatched to different nodes, a job execution environment typically needs the
support of a distributed data management system for a job to access remote data sets and, some-
times, exchange data with other jobs. As mentioned earlier, MapReduce is supported by GFS.
Dryad also has a GFS-like distributed storage system, which can break large data files into
small chunks. These chunks are then distributed and replicated across the disks of nodes in the
system. The communication channels between jobs are often file-based. Through a distributed
storage system, a file can be passed from one job to another transparently. Accessing and
exchanging data with certain structures can be supported through abstractions built on top of the
distributed storage system. Google BigTable [104] and Amazon Dynamo [105] are two
examples.
The data abstraction in BigTable described in Section 6.3 is a multidimensional sorted map and
in Dynamo is key-value pair. With these abstractions, a job can access multidimensional data or
key-value pairs from any node in the system. As a consequence, a job execution environment is
very powerful nowadays. It is capable of running not only computational jobs, but also a variety of
data-intensive jobs. However, as maintaining data consistency comes with a high cost in a distribu-
ted environment, such a job execution environment has limitations for running applications with
strong consistency requirements. At the time of this writing, BigTable and Dynamo only support
relaxed consistency.
In a large-scale distributed system where applications are dynamically placed onto any compu-
ters to run, the capability of collecting information regarding the application and resource status
becomes crucial for the system to achieve efficiency, detect failures or risks, and track system
states. A contemporary distributed system often has a complex monitoring subsystem. Astrolabe
[106], used in Amazon, is such an example. Astrolabe monitors the state of a collection of distribu-
ted resources and is capable of continuously computing summaries of the data in the system
through aggregation. Its aggregation mechanism is driven by SQL queries (e.g., the query SELECT
MIN(load) AS load returns the minimal load in the system).
Aggregating such information dynamically from a large number of nodes is challenging. Tradi-
tional monitoring systems in clusters do not scale beyond a few dozen nodes [106]. Astrolabe
adopts a decentralized approach to achieve scalability. It runs an agent on each node. The agents
314 CHAPTER 5 Service-Oriented Architectures for Distributed Computing
talk with one another using a gossip protocol. Information managed by Astrolabe is organized
hierarchically. Each agent maintains a data structure that contains a subset of the hierarchy. The
local information is updated directly in an agent. An agent that needs the information of an internal
node of the hierarchy can aggregate the information of the children in the hierarchy from agents
that are responsible for them. Information of sibling nodes in the hierarchy can be obtained from
relevant agents through the gossip protocol. The gossip protocol is simple: Periodically, each agent
randomly selects some other agent and exchanges information with it. The information exchanged
is the least common ancestor of the two agents.
Through this mechanism, Astrolabe is capable of taking snapshots of the distributed state. The
state is aggregated dynamically without a central agent storing all the information. However, the
aggregated information is replicated among all the agents involved in the process and it raises con-
sistency issues. Two users retrieving the same data attribute may get different results. Astrolabe
only supports a relaxed consistency called eventual consistency, that is, given an aggregate attribute
X that depends on some other attribute Y, Astrolabe guarantees that an update made to Y is even-
tually reflected in X. Such a monitoring system is important to a job execution environment, parti-
cularly when the underlying system scales.
Data access/storage
Registry Metadata
Core grid capabilities
Security Notification Workflow Messaging
Physical network
FIGURE 5.14
The concept of the grid of grids of services.
5.5 Workflow in Service-Oriented Architectures 315
In Figure 5.14, we show multiple application grids in different areas of “critical infrastructure.”
Component grids (subgrids) are invoked for collaboration, visualization, sensor fusion, computing,
and GIS applications. This builds on core grid capabilities in areas covered in Sections 5.2, 5.3, and
5.4. By adding application-specific services, one builds a distributed system to support the study of
flood, gas, or electrical infrastructure. Workflow is used to integrate component grids and services.
Note that this discussion is phrased in terms of grids, but is equally applicable to clouds. We will
discuss grid computing in more detail in Chapter 7.
Traditional
programming Workflow
FIGURE 5.15
Hierarchical computing, data, and programming abstraction.
316 CHAPTER 5 Service-Oriented Architectures for Distributed Computing
UNIX primitives. For example, TCP channels or publish-subscribe messaging replaces pipes.
In Figure 5.15, we show that multilevel architectures are very common—not just in programming
as in workflow, but also in computing, databases, and sensors.
The concept of workflow was first developed by the Workflow Management Coalition [111]
which has existed for almost 20 years and generated standard reference models, documents, and a
substantial industry of tools and workflow management support products. However, this coalition is
largely concerned with business process management, with steps in a workflow often involving
human and not computer steps. For example, Allen defined business workflow as the automation of
a business process, in whole or in part, during which documents, information, or tasks are passed
from one participant to another for action, according to a set of procedural rules [112]. Thus, work-
flow in this chapter addresses very different issues from those described by the Workflow Manage-
ment Coalition.
We can identify the development of workflow concepts and systems as a major achievement
of the grid and distributed system community. Note that services are, of course, equally critical,
but the essence of that idea came from business systems. Historically, workflow concepts evolved
from distributed programming using systems such as Linda [113], HeNCE [114], AVS [115], and
Khoros [116] as well as complex Shell (Perl) scripts. Petri Nets also can be considered one of the
early ideas important for workflow. Although several good workflow systems have been built on
Petri Nets [117–119], the most popular systems today are not based on Petri Nets. There are
important commercial areas using environments very similar to scientific workflow. These
include systems to analyze experimental (laboratory) data, termed Laboratory Information Sys-
tems or LIMS as exemplified by [120,121] or the general resource [122]. The chemical infor-
matics and business intelligence areas have several workflow-like systems including InforSense
[123] and Pipeline Pilot [124].
Sanity check of
network files, Create, register Validate CSV Perform CSV Perform
Start For each CSV Bulk load CSV End
manifest, empty loadDB file & table file/table loadDB/batch
file in batch file into table
checksum from template scheme validation validation
Determine ‘merge
worthy’ load DBs &
slice cold DBs
FIGURE 5.16
Two typical (Load and Merge) workflows from the Pan-STARRS astronomy data processing area.
(Courtesy of Barga, et al. [129] )
hsapiens_gene_ensembl
GetUniqueHomolog
CreateFasta
seqret
emma
plot
FlattenImageList
Workflow outputs
FIGURE 5.17
The workflow in the Taverna system.
(Courtesy of C. Goble, 2008 [132] )
nontrivial control constructs such as looping over regions of workflow and conditional decisions. Note that
each box can be a sequential or parallel component.
The need for the above operations is seen in the tornado tracking workflow [131] of LEAD
II [133] which involves data processing services linked to traditional (MPI) parallel weather
simulations. The concurrency expressed in workflow is often termed “functional” parallelism
distinguishing it from “data” parallelism. Typically functional parallelism (corresponding to dif-
ferent services in application) is absolute and is determined in size and nature by the problem.
On the other hand, data parallelism is achieved by dividing a large data set into parts. The
degree of data parallelism is determined by the number of available cores. Different parts corre-
spond to using a parallel computing technology (MPI, threading, MapReduce) to implement the
services.
made overhead less important than, say, for MPI. This same feature typically allows workflow to be
executed in a distributed fashion—the network latency of long communication hops is often not
important. The classification in [139] is of workflows used in domains from meteorology and ocean
modeling, bioinformatics and biomedical workflows, astronomy and neutron science. These are
examined according to their size, resource use, graph pattern, data pattern, and usage scenario.
BPEL specifies the control and not the data flow of a workflow. Of course, the control structure
implies the data flow structure for a given set of nodes.
Figure 5.18 shows a pipeline as seen for the astronomy workflows of Figure 5.16. A more gen-
eral workflow structure is that of the DAG, which is a collection of vertices and directed edges,
each edge connecting one vertex to another such that there are no cycles. That is, there is no way
to start at some vertex V and follow a sequence of edges that eventually loops back to that vertex V
again. In spite of sophisticated specialized workflow systems, scripting using traditional languages
and toolkits is perhaps the dominant technique used to build workflows. Often this is done in an
informal fashion using any environment with distributed computing (Internet) support; PHP may be
the most popular environment for building mashups, but Python and JavaScript are also well used.
Figure 5.18 shows a simple cyclic graph. Dagman [159] used in Condor [160] a sophisticated DAG
processing engine. This leads to a class of workflow systems such as Pegasus [146] aimed at sche-
duling the nodes of DAG-based workflows. Karajan [161] and Ant [162] can also easily represent
DAGs. Note that most sophisticated workflow systems support hierarchical specifications—namely
the nodes of a workflow can be either services or collections of services (subworkflows). This is
consistent with the Grid of Grids concept.
The issues that need to be addressed by a workflow engine are well understood in the active
research community but are not well described in an integrated fashion [15,107–110]. Clearly,
many and perhaps all of the execution issues discussed in this book for distributed systems, clouds,
and grids are implicitly or explicitly relevant for workflow execution engines. Figure 5.19 speci-
fies the interaction between the constituent services or activities. One important technology
choice is the mechanism for transferring information between the nodes of the graph. The sim-
plest choice is that each node reads from and writes to disk and this allows one to treat the
execution of each node as an independent job invoked when all its needed input data is available
on disk.
Pipeline
Loop
FIGURE 5.18
A workflow graph that includes subgraphs illustrating pipelines and loops.
5.5 Workflow in Service-Oriented Architectures 321
This could seem inefficient, but we are not operating in a region for which MPI and parallel
computing technologies are designed. There, low latency (microseconds) is often essential, but in a
workflow, we have a different communication pattern: long-running jobs inputting and outputting
substantial data sets. The cost of reading and writing is often quite acceptable and allows simpler
fault tolerant implementations. Of course, one can use the messaging systems described in Section
5.2 to manage data transfer in a workflow, and in extreme cases, simple models where all communi-
cation is handled by a single central “control node.”
Obviously, this latter solution could lead to poor performance that does not properly scale as
workflow size increases. Distributed object technology such as CORBA can also be used for com-
munication, as it is in the Discover Middleware [136]. In fact, there are often two communication
systems in workflow environments corresponding to “control” and “data,” respectively. Obviously,
the control communication would usually have small messages and very different requirements
from the data network. In this regard, one should mention the “proxy model” which is often used
in grid architectures and workflow.
Suppose one of the nodes in Figure 5.18 corresponds to the execution of a large simulation
job—say, the chemistry code Amber or Gaussian. Then one could consider the node as the Amber
code with links corresponding directly to data used by this code. However, this is not normal.
Rather, the node is a service that contains the metadata and mechanisms to invoke Amber on a
(remote) machine and also to determine when it is complete. Then the information flowing between
proxy nodes is all essentially control information. This proxy model can also be considered an
agent framework [137].
TeraGrid’
Submit host
(Laptop, login host,...)
Open science grid
Clouds
Script
APP APP
01 02 Compute
nodes
Site App
list list f1
Workflow
status Provenance a1
and logs log
f2
FIGURE 5.19
Swift workflow system architecture.
(Courtesy of www.ci.uchicago.edu/swift/ [138] )
100 unknown
protein sequences
...
>seq_42
MKKLATLISAALLSST Annotation of
VSVA... organism
BLAST matches and
program putative
function
‘NR’ protein
database
10 parallel BLAST
runs to test small
Input
sample datasets
datasets
on workstation
FIGURE 5.20
Application of the Swift to bioinformatics.
(Courtesy of Wikipedia, http://en.wikipedia.org/wiki/Many-task_computing, [154] )
implementing the scheduling style of workflow [146]. Taverna [18,147] from the myGrid project
[148] is very popular in the bioinformatics community and substantial effort has been made by the
UK OMII effort [149] into making the system robust. An innovative extension of this project is the
myExperiment scientific social networking site [150], which enables sharing of workflows. Other
key aspects of workflow are security [163] and fault tolerance [110].
The HPSearch project [50,151,152] at Indiana University supported workflow efficiently from
JavaScript focusing on control of the messaging system, NaradaBrokering [39], used to control
streaming workflows (i.e., those where data is streamed between nodes of workflow and not written
to disk). There are a few other workflow systems aimed at streaming data [153]. The new scripting
environment, Swift [138] supports the Many Task programming model where innumerable jobs
need to be executed [154].
There are very many approaches to workflow which are largely successful in prototype one-
off situations. However, experience has found that most are not really robust enough for produc-
tion use outside the development team. This observation motivated Microsoft to put its Trident
workflow environment [129,130] into open source for science. Trident is built on the commercial
quality Windows Workflow Foundation. Note that in most cases one would use Trident in a
324 CHAPTER 5 Service-Oriented Architectures for Distributed Computing
“proxy” fashion with workflow agents running on Windows controlling remote components that
largely run on Linux machines [131,133]. Some workflow systems are built around the data flow
concept with this being the original model [115,116,155] and the direct opposite of the proxy
concept.
Today’s mashups are typically data flow systems with the interaction scripted in languages such as
JavaScript or PHP. ProgrammableWeb.com has, for example, more than 2,000 APIs and 5,000 mash-
ups recorded as of September 2010 [84]. Commercial mashup approaches include Yahoo! Pipes
(http://pipes.yahoo.com/pipes/), Microsoft Popfly (http://en.wikipedia.org/wiki/Microsoft_Popfly),
and IBM’s sMash enhanced BPEL Web 2.0 [156] workflow environment. There is some disagree-
ment as to whether MapReduce is a workflow environment (http://yahoo.github.com/oozie/) as it and
workflow can execute similar problems. Other workflow approaches extend the “remote method invo-
cation” model. This model underlies the Common Component Architecture (CCA) [157,158].
Acknowledgements
This chapter was coauthored by Geoffrey Fox of Indiana University and Albert Zomaya of
Sydney University, jointly. Their work was assisted by Ali Javadzadeh Boloori and Chen Wang
of the Univ. of Sydney; by Shrideep Pallickara, Marlon Pierce, and Suresh Marru of Indiana
Univ.; by Michael McLennan, George B. Adams III, and Gerhard Klimeck of Purdue Univ. and
by Michael Wilde of the Univ. of Chicago. Kai Hwang of USC has edited the final manuscript
of this chapter.
References
[1] D. Booth, H. Haas, F. McCabe, et al., Working Group Note 11: Web Services Architecture, www.w3
.org/TR/2004/NOTE-ws-arch-20040211/ (accessed 18.10.10).
[2] R. Fielding, Architectural Styles and the Design of Network-Based Software Architectures, University of
California at Irvine, 2000, p. 162, http://portal.acm.org/citation.cfm?id=932295.
[3] M. Hadley, Web Application Description Language (WADL), W3C Member Submission, www.w3.org/
Submission/wadl/, 2009 (accessed 18.10.10).
[4] Restlet, the leading RESTful web framework for Java, www.restlet.org/ (accessed 18.10.10).
[5] JSR 311 – JAX-RS: Java API for RESTful Web Services, https://jsr311.dev.java.net/ (accessed 18.10.10).
[6] Jersey open source, production quality, JAX-RS (JSR 311) reference implementation for building REST-
ful web services, https://jersey.dev.java.net/ (accessed 18.10.10).
[7] D. Winer, The XML-RPC specification, www.xmlrpc.com/, 1999 (accessed 18.10.10).
[8] L. Richardson, S. Ruby, RESTful Web Services, O’Reilly, 2007.
[9] A. Nadalin, C. Kaler, R. Monzillo, P. Hallam-Baker, Web services security: SOAP message security 1.1
(WS-Security 2004), OASIS Standard Specification, http://docs.oasis-open.org/wss/v1.1/wss-v1.1-spec-os-
SOAPMessageSecurity.pdf, 2006 (accessed 18.10.10).
[10] A. Andrieux, K. Czajkowski, A. Dan, et al., Web services agreement specification (WS-Agreement),
OGF Documents, GFD.107, www.ogf.org/documents/GFD.107.pdf, 2007 (accessed 18.10.2010).
[11] D. Davis, A. Karmarkar, G. Pilz, S. Winkler, Ü. Yalçinalp, Web services reliable messaging (WS-Reliable-
Messaging), OASIS Standard, http://docs.oasis-open.org/ws-rx/wsrm/200702, 2009 (accessed 18.10.2010).
References 325
[12] M. Little, A. Wilkinson, Web services atomic transaction (WS-AtomicTransaction) Version 1.2, OASIS
Standard, http://docs.oasis-open.org/ws-tx/wstx-wsat-1.2-spec-os.pdf, 2009.
[13] M. Feingold, R. Jeyaraman, Web services coordination (WS-Coordination) Version 1.2, OASIS Standard,
http://docs.oasis-open.org/ws-tx/wstx-wscoor-1.2-spec-os.pdf, 2009 (accessed 18.10.2010).
[14] K. Chiu, M. Govindaraju, R. Bramley, Investigating the limits of SOAP performance for scientific
computing, in: 11th IEEE International Symposium on High Performance Distributed Computing, 2002,
pp. 246–254.
[15] D. Gannon, G. Fox, Workflow in grid systems, Editorial of special issue of Concurrency & Computation:
Practice & Experience, based on GGF10 Berlin meeting, Vol. 18, No. 10, 2006, pp. 1009–1019, doi:
http://dx.doi.org/10.1002/cpe.v18:10 and http://grids.ucs.indiana.edu/ptliupages/publications/Workflow-over-
view.pdf.
[16] JBPM Flexible business process management (BPM) suite, www.jboss.org/jbpm (accessed 18.10.10).
[17] JBoss enterprise middleware, www.jboss.org/ (accessed 18.10.10).
[18] Taverna workflow management system, www.taverna.org.uk/ (accessed 18.10.10).
[19] R.V. Englen, K. Gallivan, The gSOAP toolkit for web services and peer-to-peer computing networks,
in: 2nd IEEE/ACM International Symposium on Cluster Computing and The Grid (CCGRID ’02),
2002.
[20] R. Salz, ZSI: The zolera soap infrastructure, http://pywebsvcs.sourceforge.net/zsi.html, 2005 (accessed
18.10.10).
[21] J. Edwards, 3-Tier server/client at work, first ed., John Wiley & Sons, 1999.
[22] B. Sun, A multi-tier architecture for building RESTful web services, IBM Developer Works 2009, www.-
ibm.com/developerworks/web/library/wa-aj-multitier/index.html, 2009 (accessed 18.10.2010).
[23] G. Alonso, F. Casati, H. Kuno, V. Machiraju, Web Services: Concepts, Architectures and Applications
(Data-Centric Systems and Applications), Springer Verlag, 2010.
[24] I. Foster, H. Kishimoto, A. Savva, et al., The open grid services architecture version 1.5, Open Grid
Forum, GFD.80, www.ogf.org/documents/GFD.80.pdf, 2006.
[25] I. Foster, S. Tuecke, C. Kesselman, The philosophy of the grid, in: 1st International Symposium on Clus-
ter Computing and the Grid (CCGRID0), IEEE Computer Society, 2001.
[26] S. Tuecke, K. Czajkowski, I. Foster, et al., Grid Services Infrastructure (OGSI) Version 1.0. Global Grid
Forum Proposed Recommendation, GFD15, www.ggf.org/documents/GFD.15.pdf, 2003 (accessed
18.10.10).
[27] S. Graham, A. Karmarkar, J. Mischkinsky, I. Robinson, I. Sedukhin, Web Services Resource 1.2 (WS-
Resource) WSRF, OASIS Standard, http://docs.oasis-open.org/wsrf/wsrf-ws_resource-1.2-spec-os.pdf,
2006 (accessed 18.10.10).
[28] M. Gudgin, M. Hadley, T. Rogers, Web Services Addressing 1.0 – Core, W3C Recommendation, 9 May
2006.
[29] G. Fox, D. Gannon, A Survey of the Role and Use of Web Services and Service Oriented Architectures
in Scientific/Technical Grids, http://grids.ucs.indiana.edu/ptliupages/publications/ReviewofServices
andWorkflow-IU-Aug2006B.pdf, 2006 (accessed 16.10.10).
[30] Supported version of Mule ESB, www.mulesoft.com/.
[31] Open source version of Mule ESB, www.mulesoft.org/.
[32] IBM’s original network software MQSeries rebranded WebSphereMQ in 2002, http://en.wikipedia.org/
wiki/IBM_WebSphere_MQ.
[33] WebSphereMQ IBM network software, http://en.wikipedia.org/wiki/IBM_WebSphere_MQ.
[34] H. Shen, Content-based publish/subscribe systems, in: X. Shen, et al., (Eds.), Handbook of Peer-to-Peer
Networking, Springer Science+Business Media, LLC, 2010, pp. 1333–1366.
[35] Java Message Service JMS API, www.oracle.com/technetwork/java/index-jsp-142945.html and http://en
.wikipedia.org/wiki/Java_Message_Service.
326 CHAPTER 5 Service-Oriented Architectures for Distributed Computing
[63] A. Strachan, G. Klimeck, M.S. Lundstrom, Cyber-enabled simulations in nanoscale science and engineer-
ing, Comput. Sci. Eng. 12 (2010) 12–17.
[64] H. Abelson, The creation of OpenCourseWare at MIT, J. Sci. Educ. Technol. 17 (2) (2007) 164–174.
[64a] T. J. Hacker, R. Eigenmann, S. Bagchi, A. Irfanoglu, S. Pujol, A. Catlin, Ellen Rathje, The NEEShub
cyberinfrastructure for earthquake engineering, computing, in Science and Engineering, 13 (4) (2011)
67–78, doi:10.1109/MCSE.2011.70.
[65] T. Richardson, Q. Stafford-Fraser, K.R. Wood, A. Hopper, Virtual network computing, IEEE Internet
Comput. 2 (1) (1998) 33–38.
[66] P.M. Smith, T.J. Hacker, C.X. Song, Implementing an industrial-strength academic cyberinfrastructure at
Purdue University, in: IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2008.
[67] W. Qiao, M. McLennan, R. Kennell, D. Ebert, G. Klimeck, Hub-based simulation and graphics hardware acce-
lerated visualization for nanotechnology applications, IEEE Trans. Vis. Comput. Graph. 12 (2006) 1061–1068.
[68] G. Klimeck, M. Luisier, Atomistic modeling of realistically extended semiconductor devices with
NEMO/OMEN, IEEE Comput. Sci. Eng. 12 (2010) 28–35.
[69] B.P. Haley, G. Klimeck, M. Luisier, et al., Computational nanoelectronics research and education at
nanoHUB.org, J. Comput. Electron. 8 (2009) 124–131.
[70] OpenVZ web site, http://openvz.org (accessed 17.08.10).
[71] J. Alameda, M. Christie, G. Fox, et al., The open grid computing environments collaboration: portlets
and services for science gateways, Concurr. Comput. Pract. Exper. 19 (6) (2007) 921–942.
[72] Open Grid Computing Environments web site, www.collab-ogce.org (accessed 18.10.10).
[73] Z. Guo, R. Singh, M.E. Pierce, Building the PolarGrid portal using Web 2.0 and OpenSocial, in:
SC-GCE, 2009.
[74] T. Gunarathne, C. Herath, E. Chinthaka, S. Marru, Experience with adapting a WS-BPEL runtime for
eScience workflows, in: 5th Grid Computing Environments Workshop, ACM, 2009.
[75] S. Marru, S. Perera, M. Feller, S. Martin, Reliable and Scalable Job Submission: LEAD Science Gate-
ways Testing and Experiences with WS GRAM on TeraGrid Resources, in: TeraGrid Conference, 2008.
[76] S. Perera, S. Marru, T. Gunarathne, D. Gannon, B. Plale, Application of management frameworks to
manage workflow-based systems: A case study on a large scale e-science project, in: IEEE International
Conference on Web Services, 2009.
[77] S. Perera, S. Marru, C. Herath, Workflow Infrastructure for Multi-scale Science Gateways, in: TeraGird
Conference, 2008.
[78] T. Andrews, F. Curbera, H. Dholakia, et al., Business process execution language for web services,
version 1.1, 2003.
[79] T. Oinn, M. Addis, J. Ferris, et al., Taverna: a tool for the composition and enactment of bioinformatics
workflows, Bioinformatics, (2004).
[80] E. Deelman, J. Blythe, Y. Gil, et al., Pegasus: Mapping scientific workflows onto the grid, in: Grid Com-
puting, Springer, 2004.
[81] Apache ODE (Orchestration Director Engine) open source BPEL execution engine, http://ode.apache.org/.
[82] M. Christie, S. Marru, The LEAD Portal: a TeraGrid gateway and application service architecture,
Concurr. Comput. Pract. Exper. 19 (6) (2007) 767–781.
[83] UDDI Version 3 Specification, OASIS Standard, OASIS UDDI Specifications TC – Committee Specifi-
cations, www.oasis-open.org/committees/uddi-spec/doc/tcspecs.htm#uddiv3, 2005 (accessed 18.10.10).
[84] Programmable Web site for contributed service APIs and mashups, www.programmableweb.com/
(accessed 18.10.10).
[85] J. Gregorio, B. de hOra, RFC 5023 – The Atom Publishing Protocol, IETF Request for Comments,
http://tools.ietf.org/html/rfc5023, 2007 (accessed 18.10.10).
[86] L. Vargas, J. Bacon, Integrating Databases with Publish/Subscribe, in: 25th IEEE International Confer-
ence on Distributed Computing Systems Workshops (ICDCSW ’05), 2005.
328 CHAPTER 5 Service-Oriented Architectures for Distributed Computing
[87] M. Aktas, Thesis. Information Federation in Grid Information Services. Indiana University, http://grids
.ucs.indiana.edu/ptliupages/publications/MehmetAktasThesis.pdf, 2007.
[88] M.S. Aktas, G.C. Fox, M. Pierce, A federated approach to information management in grids, J. Web Serv. Res.
7 (1) (2010) 65–98, http://grids.ucs.indiana.edu/ptliupages/publications/JWSR-PaperRevisedSubmission529-
Proofread.pdf.
[89] M.S. Aktas, M. Pierce, High-performance hybrid information service architecture, Concurr. Comput.
Pract. Exper. 22 (15) (2010) 2095–2123.
[90] E. Deelman, G. Singh, M.P. Atkinson, et al., Grid based metadata services, in: 16th International Confer-
ence on Scientific and Statistical DatabaseManagement (SSDBM ’04), Santorini, Greece, 2004.
[91] C.K. Baru, The SDSC Storage Resource Broker, in: I. Press, (Ed.), CASCON ’98 Conference, Toronto, 1998.
[92] G. Singh, S. Bharathi, A. Chervenak, et al., A Metadata Catalog Service for Data Intensive Applications,
in: 2003 ACM/IEEE Conference on Supercomputing, Conference on High Performance Networking and
Computing, ACM Press, 15–21 November 2003, p. 17.
[93] L. Chervenak, et al., Performance and Scalability of a Replica Location Service, in: 13th IEEE Interna-
tional Symposium on High Performance Distributed Computing (HPDC 13), IEEE Computer Society,
Washington, DC, 2004, pp. 182–191.
[94] B. Koblitz, N. Santos, V. Pose, The AMGA metadata service, J. Grid Comput. 6 (1) (2007) 61–76.
[95] C.A. Goble, D. De Roure, The Semantic Grid: Myth busting and bridge building, in: 16th European
Conference on Artificial Intelligence (ECAI-2004), Valencia, Spain, 2004.
[96] O. Corcho, et al., An Overview of S-OGSA: A Reference Semantic Grid Architecture, in: Journal of
Web Semantics: Science, Services and Agents on the World Wide Web, 2006, pp. 102–115.
[97] MyGrid, www.mygrid.org.uk/.
[98] J. Dean, S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, in: Sixth Symposium
on Operating Systems Design and Implementation, 2004, pp. 137–150.
[99] M. Isard, M. Budiu, Y. Yu, A. Birrell, D. Fetterly, Dryad: Distributed Data-Parallel Programs from Sequential
Building Blocks, in: ACM SIGOPS Operating Systems Review, ACM Press, Lisbon, Portugal, 2007.
[100] S. Ghemawat, The Google File System, in: 19th ACM Symposium on Operating System Principles, 2003,
pp. 20–43.
[101] D. Thain, T. Tannenbaum, M. Livny, Distributed computing in practice: the Condor experience, Concurr.
Comput. Pract. Exper. 17 (2–4) (2005) 323–356.
[102] D. Gelernter, Generative communication in Linda, in: ACM Transactions on Programming Languages
and Systems, 1985, pp. 80–112.
[103] E. Freeman, S. Hupfer, K. Arnold, JavaSpaces: Principles, Patterns, and Practice, Addison-Wesley, 1999.
[104] F. Chang, et al., BigTable: A Distributed Storage System for Structured Data, in: OSDI 2006, Seattle,
pp. 205–218.
[105] G. De Candia, et al., Dynamo: Amazon’s highly available key-value store, in: SOSP, Stevenson, WA,
pp. 205–219.
[106] R. Van Renesse, K. Birman, W. Vogels, Astrolabe: A robust and scalable technology for distributed system
monitoring, management, and data mining, in: ACM Transactions on Computer Systems, 2003, pp. 164–206.
[107] J. Yu, R. Buyya, A taxonomy of workflow management systems for grid computing, in: Technical
Report, GRIDS-TR-2005-1, Grid Computing and Distributed Systems Laboratory, University of
Melbourne, Australia, 2005.
[108] I.J. Taylor, E. Deelman, D.B. Gannon, M. Shields, Workflows for e-Science: Scientific Workflows for
Grids, Springer, 2006.
[109] Z. Zhao, A. Belloum, M. Bubak, Editorial: Special section on workflow systems and applications in
e-Science, Future Generation Comp. Syst. 25 (5) (2009) 525–527, http://dx.doi.org/10.1016/j.future
.2008.10.011.
References 329
[110] E. Deelman, D. Gannon, M. Shields, I. Taylor, Workflows and e-Science: an overview of workflow sys-
tem features and capabilities, Future Generation Comp. Syst. 25 (5) (2009) 528–540, doi: http://dx.doi
.org/10.1016/j.future.2008.06.012.
[111] Workflow Management Consortium, www.wfmc.org/.
[112] R. Allen, Workflow: An Introduction, Workflow Handbook. Workflow Management Coalition, 2001.
[113] N. Carriero, D. Gelernter, Linda in context, Commun. ACM 32 (4) (1989) 444–458.
[114] A. Beguelin, J. Dongarra, G.A. Geist, HeNCE: A User’s Guide, Version 2.0, www.netlib.org/hence/
hence-2.0-doc-html/hence-2.0-doc.html.
[115] C. Upson, T. Faulhaber Jr., D.H. Laidlaw, et al., The application visualization system: a computational
environment for scientific visualization, IEEE Comput. Graph. Appl., (1989) 30–42.
[116] J. Rasure, S. Kubica, The Khoros application development environment, in: Khoral Research Inc., Albu-
querque, New Mexico, 1992.
[117] A. Hoheisel, User tools and languages for graph-based Grid workflows, Concurr. Comput. Pract. Exper.
18 (10) (2006) 1101–1113, http://dx.doi.org/10.1002/cpe.v18:10.
[118] Z. Guan, F. Hernandez, P. Bangalore, et al., Grid-Flow: A Grid-enabled scientific workflow system with a
Petri-net-based interface, Concurr. Comput. Pract. Exper. 18 (10) (2006) 1115–1140, http://dx.doi.org/
10.1002/cpe.v18:10.
[119] M. Kosiedowski, K. Kurowski, C. Mazurek, J. Nabrzyski, J. Pukacki, Workflow applications in GridLab and
PROGRESS projects, Concurr. Comput. Pract. Exper. 18 (10) (2006) 1141–1154, http://dx.doi.org/10.1002/
cpe.v18:10.
[120] LabVIEW Laboratory Virtual Instrumentation Engineering Workbench, http://en.wikipedia.org/wiki/
LabVIEW.
[121] LabSoft LIMS laboratory information management system, www.labsoftlims.com/.
[122] LIMSource Internet LIMS resource, http://limsource.com/home.html.
[123] InforSense Business Intelligence platform, hwww.inforsense.com/products/core_technology/inforsense_
platform/index.html.
[124] Pipeline Pilot scientific informatics platform from Accelrys, http://accelrys.com/products/pipeline-pilot/.
[125] OASIS Web Services Business Process Execution Language Version 2.0 BPEL, http://docs.oasis-open
.org/wsbpel/2.0/OS/wsbpel-v2.0-OS.html.
[126] F. Curbera, R. Khalaf, W.A. Nagy, S. Weerawarana, Implementing BPEL4WS: The architecture of a
BPEL4WS implementation, Concurr. Comput. Pract. Exper. 18 (10) (2006) 1219–1228, http://dx.doi.org/
10.1002/cpe.v18:10.
[127] ActiveBPEL Open Source workflow engine, www.activebpel.org/.
[128] F. Leyman, Choreography for the Grid: Towards fitting BPEL to the resource framework, Concurr.
Comput. Pract. Exper. 18 (10) (2006) 1201–1217, http://dx.doi.org/10.1002/cpe.v18:10.
[129] R. Barga, D. Guo, J. Jackson, N. Araujo, Trident: a scientific workflow workbench, in: Tutorial eScience
Conference, Indianapolis, 2008.
[130] Microsoft, Project Trident: A Scientific Workflow Workbench, http://tridentworkflow.codeplex.com/ and
http://research.microsoft.com/en-us/collaboration/tools/trident.aspx.
[131] The forecast before the storm. [iSGTW International Science Grid This Week], www.isgtw.org/?
pid=1002719, 2010.
[132] C. Goble, Curating services and workflows: the good, the bad and the ugly, a personal story in the small,
in: European Conference on Research and Advanced Technology for Digital Libraries, 2008.
[133] Linked Environments for Atmospheric Discovery II (LEAD II), http://pti.iu.edu/d2i/leadII-home.
[134] XBaya workflow composition tool, www.collab-ogce.org/ogce/index.php/XBaya.
[135] XBaya integration with OGCE Open Grid Computing Environments Portal, www.collab-ogce.org/ogce/
index.php/XBaya.
330 CHAPTER 5 Service-Oriented Architectures for Distributed Computing
[136] V. Bhat, M. Parashar, Discover middleware substrate for integrating services on the grid, in: Proceedings
of the 10th International Conference on High Performance Computing (HiPC 2003), Lecture Notes in
Computer Science. Springer-Verlag, Hyderabad, India, 2003.
[137] Z. Zhao, A. Belloum, C.D. Laat, P. Adriaans, B. Hertzberger, Distributed execution of aggregated multi-
domain workflows using an agent framework, in: IEEE Congress on Services (Services 2007), 2007.
[138] Open source scripting workflow supporting the many task execution paradigm, www.ci.uchicago.edu/swift/.
[139] L. Ramakrishnan, B. Plale, A multi-dimensional classification model for scientific workflow characteristics,
in: 1st International Workshop on Workflow Approaches to New Data-Centric Science, Indianapolis, 2010.
[140] Kepler Open Source Scientific Workflow System, http://kepler-project.org.
[141] B. Ludäscher, I. Altintas, C. Berkley, et al., Scientific workflow management and the Kepler system,
Concurr. Comput. Pract. Exper. 18 (10) (2006) 1039–1065, http://dx.doi.org/10.1002/cpe.v18:10.
[142] T. McPhillips, S. Bowers, D. Zinn, B. Ludäscher, Scientific workflow design for mere mortals, Future
Generation Comp. Syst. 25 (5) (2009) 541–551, http://dx.doi.org/10.1016/j.future.2008.06.013.
[143] Triana, Triana Open Source Problem Solving Environment, www.trianacode.org/index.html (accessed
18.10.10).
[144] I. Taylor, M. Shields, I. Wang, A. Harrison, in: I. Taylor, et al., (Eds.), Workflows for e-Science, Springer,
2007, pp. 320–339.
[145] D. Churches, G. Gombas, A. Harrison, et al., Programming scientific and distributed workflow with Triana
services, Concurr. Comput. Pract. Exper. 18 (10) (2006) 1021–1037, http://dx.doi.org/10.1002/cpe.v18:10.
[146] Pegasus Workflow Management System, http://pegasus.isi.edu/.
[147] T. Oinn, M. Greenwood, M. Addis, et al., Taverna: Lessons in creating a workflow environment for the
life sciences, Concurr. Comput. Pract. Exper. 18 (10) (2006) 1067–1100, http://dx.doi.org/10.1002/cpe
.v18:10.
[148] myGrid multi-institutional, multi-disciplinary research group focusing on the challenges of eScience,
www.mygrid.org.uk/.
[149] OMII UK Software Solutions for e-Research, www.omii.ac.uk/index.jhtml.
[150] Collaborative workflow social networking site, www.myexperiment.org/.
[151] H. Gadgil, G. Fox, S. Pallickara, M. Pierce, Managing grid messaging middleware, in: Challenges of
Large Applications in Distributed Environments (CLADE), 2006, pp. 83–91.
[152] HPSearch, Scripting environment for managing streaming workflow and their messaging based communi-
cation, www.hpsearch.org/, 2005 (accessed 18.10.10).
[153] C. Herath, B. Plale, Streamflow, in: 10th IEEE/ACM International Conference on Cluster, Cloud and
Grid Computing, 2010.
[154] Many Task Computing Paradigm, http://en.wikipedia.org/wiki/Many-task_computing.
[155] D. Bhatia, V. Burzevski, M. Camuseva, et al., WebFlow: A visual programming paradigm for web/Java
based coarse grain distributed computing, Concurr. Comput. Pract. Exper. 9 (6) (1997) 555–577.
[156] IBM WebSphere sMash Web 2.0 Workflow system, IBM DeveloperWorks, www.ibm.com/developer-
works/websphere/zones/smash/ (accessed 19.10.10).
[157] Common Component Architecture CCA Forum, www.cca-forum.org/.
[158] D. Gannon, S. Krishnan, L. Fang, et al., On building parallel & grid applications: component technology and
distributed services, Cluster Comput. 8 (4) (2005) 271–277, http://dx.doi.org/10.1007/s10586-005-4094-2.
[159] E. Deelman, T. Kosar, C. Kesselman, M. Livny, What makes workflows work in an opportunistic envir-
onment? Concurr. Comput. Pract. Exper. 18 (10) (2006), http://dx.doi.org/10.1002/cpe.v18:10.
[160] Condor home page, www.cs.wisc.edu/condor/.
[161] Karajan parallel scripting language, http://wiki.cogkit.org/index.php/Karajan.
[162] GridAnt extension of the Apache Ant build tool residing in the Globus COG kit, www.gridworkflow.org/
snips/gridworkflow/space/GridAnt.
Homework Problems 331
[163] H. Chivers, J. McDermid, Refactoring service-based systems: How to avoid trusting a workflow ser-
vice, Concurr. Comput. Pract. Exper. 18 (10) (2006) 1255–1275, http://dx.doi.org/10.1002/cpe.
v18:10.
[164] S. Weerawarana, F. Curbera, F. Leymann, T. Storey, D.F. Ferguson. Web Services Platform Architecture:
SOAP, WSDL, WS-Policy, WS-Addressing, WS-BPEL, WS-Reliable Messaging, and More, Prentice
Hall, 2005.
[165] M. Atkinson, D. DeRoure, A. Dunlop, et al., Web Service Grids: An evolutionary approach, Concurr.
Comput. Pract. Exper. 17 (2005) 377–389, http://dx.doi.org/10.1002/cpe.936.
[166] T. Segaran, C. Evans, J. Taylor, Programming the Semantic Web, O’Reilly, 2009.
[167] G. Fox, Data and metadata on the semantic grid, Comput. Sci. Eng. 5 (5) (2003) 76–78.
[168] H. Gonzalez, A. Halevy, C.S. Jensen, et al., Google fusion tables: Data management, integration and col-
laboration in the cloud. International Conference on Management of Data, in: Proceedings of the 1st
ACM Symposium on Cloud Computing, ACM, Indianapolis, 2010, pp. 175–180.
HOMEWORK PROBLEMS
Problem 5.1
Discuss the strengths and drawbacks of WS-* and RESTful web services. Compare their architec-
tural principles. Which one is the preferred mechanism for communicating with Amazon S3? Why?
Problem 5.2
Discuss advantages and shortcomings of stateless web services. How can we keep the state in REST
and standard web services?
Problem 5.3
This assignment requires you to combine the queuing and publish-subscribe paradigms within a
single application. Organizations A and B are two businesses that use queuing for B2B transac-
tions. Every transaction is stored (prior to forwarding) and has a 128-bit UUID identifier asso-
ciated with it. Within the organization, messages are delivered using publish-subscribe. Create
five subscribers (sales, marketing, audit, packaging, and finance) within each organization; these
subscribers do not have to log the messages again since there is a copy of that message already
available. If the assignment is done in Java it is prescribed that the Java Message Service
be used.
Problem 5.4
Develop an application that uses publish-subscribe to communicate between entities developed in
different languages. Producers should be written in C++, and consumers should be in Java. Distrib-
uted components communicate with one another using the AMQP wire format.
Problem 5.5
Map the terms and abbreviations on the left to the best suited descriptions on the right.
332 CHAPTER 5 Service-Oriented Architectures for Distributed Computing
1. _____ S-OGSA (a) A set of specifications defining a way to describe, publish, and discover
information about web services
2. _____ REST (b) The first language developed for the Semantic web, using XML to represent
information about resources on the web
3. _____ UDDI (c) An XML vocabulary to describe RESTful web services
4. _____ MCS (d) A family of OASIS-published specifications which provide a set of operations that
web services may implement to become stateful
5. _____ WSDL (e) A catalog service developed by the Globus Alliance providing hierarchical
organization of the metadata
6. _____ RDF (f) Composite Web 2.0 applications which combine capabilities from existing
web-based applications
7. _____ WADL (g) A software architecture for distributed hypermedia systems based on HTTP
8. _____ WSRF (h) An expressive ontology language that extends RDF schema
9. _____ Mashups (i) A reference architecture for semantic-based grids by extending Open Grid
Services Architecture
10. _____ OWL-S (j) An XML-based language to describe the set of operations and messages
supported by a web service
Problem 5.6
Combine the two assignments 5.3 and 5.4 above, where organization A is developed in Java, while
organization B is developed in C++. Use AMQP format in these communications.
Problem 5.7
Describe possible uses of enterprise buses in grid architectures.
Problem 5.8
Run NaradaBrokering or an equivalent system selected from Table 5.8 in an available cloud. Use it
to connect two Android smartphones exchanging messages and pictures snapped by the phone.
Problem 5.9
Install the OGCE toolkit using instructions at the web site: www.collabogce.org/ogce/index.php/
Portal_download. Then select one or more from the range of projects given at another web site:
www.collab-ogce.org/ogce/index.php/Tutorials#Screen_Capture_Demos.
Problem 5.10
Compare Google App Engine, Drupal, HUBzero, and OGCE approaches to science gateways.
Discuss in the context of portals you are familiar with.
Problem 5.11
What are the three main categories of information that a service registry contains? Which entities
map these categories of information in UDDI?
Homework Problems 333
Problem 5.12
Outline the main components of a real-life parallel or distributed job execution environment. You
can choose any example platforms such as clusters in Chapter 2, clouds in Chapter 4, grids in
Chapter 7, and P2P networks in Chapter 8.
Problem 5.13
Describe the role of a distributed file system in a job execution environment such as MapReduce in
a large-scale cloud system.
Problem 5.14
Resource Description Framework (RDF) and the OWL web ontology language are two technologies
for the Semantic web. Describe the relationship between them.
Problem 5.15
Why are grids popular in academic applications while clouds dominate commercial use? Use case
studies to compare their strengths and weaknesses.
Problem 5.16
Study more details from [138,154]. Implement the Swift application shown in Figure 5.20 on an
available cloud platform.
Problem 5.17
Use Swift to implement the word count problem described in MapReduce (Details of MapReduce
are given in Section 6.2.2).
Problem 5.18
Use Taverna to build a workflow linking a module to extract comments on grids and clouds (or
your favorite topic) from Twitter, Flickr, or Facebook. The social networking APIs can be found at
www.programmableweb.com.
This page intentionally left blank
CHAPTER
SUMMARY
This chapter is devoted to programming real cloud platforms. MapReduce, BigTable, Twister,
Dryad, DryadLINQ, Hadoop, Sawzall, and Pig Latin are introduced and assessed. We use concrete
examples to explain the implementation and application requirements in the cloud. We review core
service models and access technologies. Cloud services provided by Google App Engine (GAE),
Amazon Web Service (AWS), and Microsoft Windows Azure are illustrated by example applications.
In particular, we illustrate how to programming the GAE, AWS EC2, S3, and EBS. We review the
open-source Eucalyptus, Nimbus, and OpenNebula and the startup Manjrasoft Aneka system for
cloud computing.
Physical or virtual The cloud environment consists of some physical or virtual platforms. Virtual
computing platform platforms have unique capabilities to provide isolated environments for different
applications and users.
Massive data storage With large data sets, cloud data storage services provide large disk capacity and
service, distributed the service interfaces that allow users to put and get data. The distributed file
file system system offers massive data storage service. It can provide similar interfaces as
local file systems.
Massive database Some distributed file systems are sufficient to provide the underlying storage
storage service service application developers need to save data in a more semantic way. Just like
DBMS in the traditional software stack, massive database storage services are
needed in the cloud.
Massive data Cloud infrastructure provides thousands of computing nodes for even a very simple
processing method application. Programmers need to be able to harness the power of these machines
and programming without considering tedious infrastructure management issues such as handling
model network failure or scaling the running code to use all the computing facilities
provided by the platforms.
Workflow and data The programming model offers abstraction of the cloud infrastructure. Similar to
query language the SQL language used for database systems, in cloud computing, providers have
support built some workflow language as well as data query language to support better
application logic.
Programming Web interfaces or special APIs are required for cloud applications: J2EE, PHP,
interface and service ASP, or Rails. Cloud applications can use Ajax technologies to improve the user
deployment experience while using web browsers to access the functions provided. Each cloud
provider opens its programming interface for accessing the data stored in massive
storage.
Runtime support Runtime support is transparent to users and their applications. Support includes
distributed monitoring services, a distributed task scheduler, as well as distributed
locking and other services. They are critical in running cloud applications.
Support services Important support services include data and computing services. For example, clouds
offer rich data services and interesting data parallel execution models like MapReduce.
6.1.2.1 Workflow
As introduced in Section 5.5, workflow has spawned many projects in the United States and Eur-
ope. Pegasus, Taverna, and Kepler are popular, but no choice has gained wide acceptance. There
are commercial systems such as Pipeline Pilot, AVS (dated), and the LIMS environments. A recent
entry is Trident [2] from Microsoft Research which is built on top of Windows Workflow Founda-
tion. If Trident runs on Azure or just any old Windows machine, it will run workflow proxy ser-
vices on external (Linux) environments. Workflow links multiple cloud and noncloud services in
real applications on demand.
Table 6.3 Traditional Features in Cluster, Grid, and Parallel Computing Environments
Cluster management: ROCKS and packages offering a range of tools to make it easy to bring up
clusters
Data management: Included metadata support such as RDF triple stores (Semantic web success and can
be built on MapReduce as in SHARD); SQL and NOSQL included in
Grid programming environment: Varies from link-together services as in Open Grid Services Architecture
(OGSA) to GridRPC (Ninf, GridSolve) and SAGA
OpenMP/threading: Can include parallel compilers such as Cilk; roughly shared memory technologies. Even
transactional memory and fine-grained data flow come here
Portals: Can be called (science) gateways and see an interesting change in technology from portlets to
HUBzero and now in the cloud: Azure Web Roles and GAE
Scalable parallel computing environments: MPI and associated higher level concepts including
ill-fated HP FORTRAN, PGAS (not successful but not disgraced), HPCS languages (X-10, Fortress,
Chapel), patterns (including Berkeley dwarves), and functional languages such as F# for distributed
memory
Virtual organizations: From specialized grid solutions to popular Web 2.0 capabilities such as Facebook
Workflow: Supports workflows that link job components either within or between grids and clouds; relate to
LIMS Laboratory Information Management Systems.
6.1 Features of Cloud and Grid Platforms 339
the national cyberinfrastructure we can expect that high-bandwidth links will be made available between
clouds and TeraGrid. The special structure of cloud data with blocks (in Azure blobs) and tables could
allow high-performance parallel algorithms, but initially, simple HTTP mechanisms are used to
transport data [3–5] on academic systems/TeraGrid and commercial clouds.
6.1.3.3 DPFS
This covers the support of file systems such as Google File System (MapReduce), HDFS (Hadoop), and
Cosmos (Dryad) with compute-data affinity optimized for data processing. It could be possible to link
DPFS to basic blob and drive-based architecture, but it’s simpler to use DPFS as an application-centric
storage model with compute-data affinity and blobs and drives as the repository-centric view.
In general, data transport will be needed to link these two data views. It seems important to consider
this carefully, as DPFS file systems are precisely designed for efficient execution of data-intensive
applications. However, the importance of DPFS for linkage with Amazon and Azure is not clear, as
these clouds do not currently offer fine-grained support for compute-data affinity. We note here that
Azure Affinity Groups are one interesting capability [7]. We expect that initially blobs, drives, tables,
and queues will be the areas where academic systems will most usefully provide a platform similar to
Azure (and Amazon). Note the HDFS (Apache) and Sector (UIC) projects in this area.
deployment clearly simplifies one’s system. For N platform features, one only needs N services,
whereas number of possible images with alternative approaches is a prohibitive 2N.
fault–tolerant, distributed fashion. Web roles provide an interesting approach to portals. GAE is
largely aimed at web applications, whereas science gateways are successful in TeraGrid.
6.1.4.2 MapReduce
There has been substantial interest in “data parallel” languages largely aimed at loosely coupled
computations which execute over different data samples. The language and runtime generate
and provide efficient execution of “many task” problems that are well known as successful grid
applications. However, MapReduce, summarized in Table 6.5, has several advantages over
traditional implementations for many task problems, as it supports dynamic execution, strong
fault tolerance, and an easy-to-use high-level interface. The major open source/commercial
MapReduce implementations are Hadoop [23] and Dryad [24–27] with execution possible with
or without VMs.
Hadoop is currently offered by Amazon, and we expect Dryad to be available on Azure.
A prototype Azure MapReduce was built at Indiana University, which we will discuss shortly. On
FutureGrid, we already intend to support Hadoop, Dryad, and other MapReduce approaches, includ-
ing Twister [29] support for iterative computations seen in many data-mining and linear algebra
applications. Note that this approach has some similarities with Cloudera [35] which offers a variety
of Hadoop distributions including Amazon and Linux. MapReduce is closer to broad deployment
than other cloud platform features, as there is quite a bit of experience with Hadoop and Dryad out-
side clouds.
6.1.4.4 SaaS
Services are used in a similar fashion in commercial clouds and most modern distributed systems.
We expect users to package their programs wherever possible, so no special support is needed to
enable SaaS. We already discussed in Section 6.1.3 why “Systems software as a service” was an
interesting idea in the context of a database service. We desire a SaaS environment that provides
many useful tools to develop cloud applications over large data sets. In addition to the technical
features, such as MapReduce, BigTable, EC2, S3, Hadoop, AWS, GAE, and WebSphere2, we need
protection features that may help us to achieve scalability, security, privacy, and availability.
standpoint, it increases throughput and resource utilization. Running a parallel program on a distributed
computing system, however, could be a very complicated process. Therefore, to place the complexity in
perspective, data flow of running a typical parallel program on a distributed system is further explained
in this chapter.
Therefore, parallel and distributed programming paradigms or models are offered to abstract many
parts of the data flow from users.
In other words, these models aim to provide users with an abstraction layer to hide implementa-
tion details of the data flow which users formerly ought to write codes for. Therefore, simplicity of
writing parallel programs is an important metric for parallel and distributed programming para-
digms. Other motivations behind parallel and distributed programming models are (1) to improve
productivity of programmers, (2) to decrease programs’ time to market, (3) to leverage underlying
resources more efficiently, (4) to increase system throughput, and (5) to support higher levels of
abstraction [40].
MapReduce, Hadoop, and Dryad are three of the most recently proposed parallel and distributed
programming models. They were developed for information retrieval applications but have been
shown to be applicable for a variety of important applications [41]. Further, the loose coupling of
components in these paradigms makes them suitable for VM implementation and leads to much bet-
ter fault tolerance and scalability for some applications than traditional parallel computing models
such as MPI [42–44].
User interfaces
Abstraction
layer
Input Map Reduce Output
files function function files
Flow of data
Control flow
MapReduce
software MapReduce Library
framework
Controller
FIGURE 6.1
MapReduce framework: Input data flows through the Map and Reduce functions to generate the output result
under the control flow using MapReduce software library. Special user interfaces are used to access the Map
and Reduce resources.
346 CHAPTER 6 Cloud Programming and Software Environments
the “value” part of the data, (key, value), is the actual data, and the “key” part is only used by the
MapReduce controller to control the data flow [37].
˙ ˙
˙ val) ˙ ˙
Group
˙ (key2,
Sort
Map ˙ ˙ Reduce ˙
˙ ˙
˙ ˙ ˙ ˙
˙ ˙
(key8, val) (key8, [val, val...])
(key8, val)
(key8, val)
Key Value ˙ (key5, val)
˙ Output
Input
˙ (key1, val)
MapReduce
library
FIGURE 6.2
MapReduce logical data flow in 5 processing stages over successive (key, value) pairs.
MapReduce
Most poetry library
ignores most people
FIGURE 6.3
The data flow of a word-count problem using the MapReduce functions (Map, Sort, Group and Reduce) in a
cascade operations.
MapReduce framework forms these groups by first sorting the intermediate (key, value) pairs and
then grouping values with the same key. It should be noted that the data is sorted to simplify the
grouping process. The Reduce function processes each (key, [set of values]) group and produces a set
of (key, value) pairs as output.
To clarify the data flow in a sample MapReduce application, one of the well-known MapReduce
problems, namely word count, to count the number of occurrences of each word in a collection of
documents is presented here. Figure 6.3 demonstrates the data flow of the word-count problem for
a simple input file containing only two lines as follows: (1) “most people ignore most poetry” and
(2) “most poetry ignores most people.” In this case, the Map function simultaneously produces a
number of intermediate (key, value) pairs for each line of content so that each word is the inter-
mediate key with 1 as its intermediate value; for example, (ignore, 1). Then the MapReduce library
348 CHAPTER 6 Cloud Programming and Software Environments
collects all the generated intermediate (key, value) pairs and sorts them to group the 1’s for identical
words; for example, (people, [1,1]). Groups are then sent to the Reduce function in parallel so that
it can sum up the 1 values for each word and generate the actual number of occurrence for each
word in the file; for example, (people, 2).
Then the MapReduce library collects all the produced intermediate (key, value) pairs from all input
(key, value) pairs, and sorts them based on the “key” part. It then groups the values of all occur-
rences of the same key. Finally, the Reduce function is applied in parallel to each group producing
the collection of values as output as illustrated here:
Reduce Function
ðkey2 , List ðval2 ÞÞ → List ðval2 Þ (6.2)
Map worker
1 2 3 1 2 3 1 2 3 1 2 3
Regions
1 2 3
Reduce worker
FIGURE 6.4
Use of MapReduce partitioning function to link the Map and Reduce workers.
M M M M M M M Map function
k1:v k2:v k1:v k3:v k1:v k2:v k3:v k2:v k3:v k3:v k5:v k1:v k4:v k4:v k3:v k4:v
C C C Combiner
Synchronization
communication
R1 R2
R R Reduce
Output
Output file Output file
Reduce worker Reduce worker
FIGURE 6.5
Data flow implementation of many functions in the Map workers and in the Reduce workers through multiple
sequences of partitioning, combining, synchronization and communication, sorting and grouping, and reduce
operations.
6.2 Parallel and Distributed Programming Paradigms 351
9. Communication Reduce worker i, already notified of the location of region i of all map workers,
uses a remote procedure call to read the data from the respective region of all map workers. Since
all reduce workers read the data from all map workers, all-to-all communication among all map
and reduce workers, which incurs network congestion, occurs in the network. This issue is one of
the major bottlenecks in increasing the performance of such systems [50–52]. A data transfer
module was proposed to schedule data transfers independently [55].
Steps 10 and 11 correspond to the reduce worker domain:
10. Sorting and Grouping When the process of reading the input data is finalized by a reduce
worker, the data is initially buffered in the local disk of the reduce worker. Then the reduce
worker groups intermediate (key, value) pairs by sorting the data based on their keys,
followed by grouping all occurrences of identical keys. Note that the buffered data is sorted
and grouped because the number of unique keys produced by a map worker may be more
than R regions in which more than one key exists in each region of a map worker (see
Figure 6.4).
11. Reduce function The reduce worker iterates over the grouped (key, value) pairs, and for each
unique key, it sends the key and corresponding values to the Reduce function. Then this function
processes its input data and stores the output results in predetermined files in the user’s program.
To better clarify the interrelated data control and control flow in the MapReduce framework,
Figure 6.6 shows the exact order of processing control in such a system contrasting with dataflow
in Figure 6.5.
User program
(2) Fork
User program
Master
files
Worker
File 1
(5) Map User program
(4) Read Worker (12) Write
Split 2
File 2
User program
Split 3
Worker
FIGURE 6.6
Control flow implementation of the MapReduce functionalities in Map workers and Reduce workers (running
user programs) from input files to the output files under the control of the master user program.
(Courtesy of Yahoo! Pig Tutorial [54] )
• MPI does not transfer all data from node to node, but just the amount needed to update
information. We can call the MPI flow δ flow and the MapReduce flow full data flow.
The same phenomenon is seen in all “classic parallel” loosely synchronous applications which
typically exhibit an iteration structure over compute phases followed by communication phases. We
can address the performance issues with two important changes:
1. Stream information between steps without writing intermediate steps to disk.
2. Use long-running threads or processors to communicate the δ (between iterations) flow.
These changes will lead to major performance increases at the cost of poorer fault tolerance and
ease to support dynamic changes such as the number of available nodes. This concept [42] has
been investigated in several projects [34,57–59] while the direct idea of using MPI for MapReduce
applications is investigated in [44]. The Twister programming paradigm and its implementation
architecture at run time are illustrated in Figure 6.7(a, b). In Example 6.1, we summarize Twister [60]
whose performance results for K means are shown in Figure 6.8 [55,56], where Twister is much faster
than traditional MapReduce. Twister distinguishes the static data which is never reloaded from the
dynamic δ flow that is communicated.
6.2 Parallel and Distributed Programming Paradigms 353
Iterate
Static Configure ( )
data
User
program
Map (key, value)
d flow
Combine (key, list <value>)
Close ( )
M Map worker
Worker nodes
D D R Reduce worker
MR User
M M M M driver program
D MRDeamon
R R R R
Data read/write
FIGURE 6.7
Twister: An iterative MapReduce programming paradigm for repeated MapReduce executions.
Example 6.1 Performance of K Means Clustering in MPI, Twister, Hadoop, and DryadLINQ
The MapReduce approach leads to fault tolerance and flexible scheduling, but for some applications the per-
formance degradation compared to MPI is serious, as illustrated in Figure 6.8 for a simple parallel K means
clustering algorithm. Hadoop and DryadLINQ are more than a factor of 10 slower than MPI for the largest data
set, and perform even more poorly for smaller data sets. One could use many communication mechanisms in
Iterative MapReduce, but Twister chose a publish-subscribe network using a distributed set of brokers, as
described in Section 5.2 with similar performance achieved with ActiveMQ and NaradaBrokering.
The Map-Reduce pair is iteratively executed in long-running threads. We compare in Figure 6.9.
the different thread and process structures of 4 parallel programming paradigms: namely Hadoop,
Dryad, Twister (also called MapReduce++), and MPI. Note that Dryad can use pipes and avoids
costly disk writing according to the original papers [26,27].
354 CHAPTER 6 Cloud Programming and Software Environments
1000
log scale 10
Hadoop
0.1 DryadLINQ
Twister
Run using 256 CPU cores MPI
0.01
0.512 5.12 10.24 20.48
Number of 2D data points (millions)
FIGURE 6.8
Performance of K means clustering for MPI, Twister, Hadoop, and DryadLINQ.
FIGURE 6.9
Thread and process structure of four parallel programming paradigms at runtimes.
6.2 Parallel and Distributed Programming Paradigms 355
(log scale)
100
10
Hadoop
Twister
1
0.27 0.55 0.83 1.11 1.4
Number of URLs (billions)
FIGURE 6.10
Performance of Hadoop and Twister on ClueWeb data set using 256 processing cores.
Example 6.2 Performance of Hadoop and Twister on ClueWeb Data Set over 256 Processor
Cores
Important research areas for Iterative MapReduce include fault tolerance and scalable approaches to com-
munication. Figure 6.10 shows [55] that iterative algorithms are found in information retrieval. This figure
shows the famous Page Rank algorithm (with a kernel of iterative matrix vector multiplication) run on the
public ClueWeb data sets, and independent of size, Twister is about 20 times faster than Hadoop as
revealed by the gap between the top and lower curves in Figure 6.10.
all DataNodes. Each DataNode, usually one per node in a cluster, manages the storage attached
to the node. Each DataNode is responsible for storing and retrieving its file blocks [61].
HDFS Features: Distributed file systems have special requirements, such as performance,
scalability, concurrency control, fault tolerance, and security requirements [62], to operate
efficiently. However, because HDFS is not a general-purpose file system, as it only executes
specific types of applications, it does not need all the requirements of a general distributed file
system. For example, security has never been supported for HDFS systems. The following
discussion highlights two important characteristics of HDFS to distinguish it from other generic
distributed file systems [63].
HDFS Fault Tolerance: One of the main aspects of HDFS is its fault tolerance characteristic.
Since Hadoop is designed to be deployed on low-cost hardware by default, a hardware failure in
this system is considered to be common rather than an exception. Therefore, Hadoop considers
the following issues to fulfill reliability requirements of the file system [64]:
• Block replication To reliably store data in HDFS, file blocks are replicated in this system.
In other words, HDFS stores a file as a set of blocks and each block is replicated and
distributed across the whole cluster. The replication factor is set by the user and is three by
default.
• Replica placement The placement of replicas is another factor to fulfill the desired fault
tolerance in HDFS. Although storing replicas on different nodes (DataNodes) located in
different racks across the whole cluster provides more reliability, it is sometimes ignored as
the cost of communication between two nodes in different racks is relatively high in
comparison with that of different nodes located in the same rack. Therefore, sometimes
HDFS compromises its reliability to achieve lower communication costs. For example, for
the default replication factor of three, HDFS stores one replica in the same node the original
data is stored, one replica on a different node but in the same rack, and one replica on a
different node in a different rack to provide three copies of the data [65].
• Heartbeat and Blockreport messages Heartbeats and Blockreports are periodic messages
sent to the NameNode by each DataNode in a cluster. Receipt of a Heartbeat implies that the
DataNode is functioning properly, while each Blockreport contains a list of all blocks on a
DataNode [65]. The NameNode receives such messages because it is the sole decision maker
of all replicas in the system.
HDFS High-Throughput Access to Large Data Sets (Files): Because HDFS is primarily
designed for batch processing rather than interactive processing, data access throughput in
HDFS is more important than latency. Also, because applications run on HDFS typically have
large data sets, individual files are broken into large blocks (e.g., 64 MB) to allow HDFS to
decrease the amount of metadata storage required per file. This provides two advantages: The
list of blocks per file will shrink as the size of individual blocks increases, and by keeping
large amounts of data sequentially within a block, HDFS provides fast streaming reads
of data.
HDFS Operation: The control flow of HDFS operations such as write and read can properly
highlight roles of the NameNode and DataNodes in the managing operations. In this section, the
control flow of the main operations of HDFS on files is further described to manifest the
interaction between the user, the NameNode, and the DataNodes in such systems [63].
6.2 Parallel and Distributed Programming Paradigms 357
• Reading a file To read a file in HDFS, a user sends an “open” request to the NameNode to
get the location of file blocks. For each file block, the NameNode returns the address of a
set of DataNodes containing replica information for the requested file. The number of
addresses depends on the number of block replicas. Upon receiving such information, the
user calls the read function to connect to the closest DataNode containing the first block of
the file. After the first block is streamed from the respective DataNode to the user, the
established connection is terminated and the same process is repeated for all blocks of the
requested file until the whole file is streamed to the user.
• Writing to a file To write a file in HDFS, a user sends a “create” request to the NameNode
to create a new file in the file system namespace. If the file does not exist, the NameNode
notifies the user and allows him to start writing data to the file by calling the write function.
The first block of the file is written to an internal queue termed the data queue while a data
streamer monitors its writing into a DataNode. Since each file block needs to be replicated
by a predefined factor, the data streamer first sends a request to the NameNode to get a list
of suitable DataNodes to store replicas of the first block.
The steamer then stores the block in the first allocated DataNode. Afterward, the block is for-
warded to the second DataNode by the first DataNode. The process continues until all allocated
DataNodes receive a replica of the first block from the previous DataNode. Once this replication
process is finalized, the same process starts for the second block and continues until all blocks of
the file are stored and replicated on the file system.
Cluster
Rack 1 Rack 2
map
map
map
map
map
map
map
MapReduce map
engnie
Blocks
FIGURE 6.11
HDFS and MapReduce architecture in Hadoop where boxes with different shadings refer to different
functional nodes applied to different blocks of data.
358 CHAPTER 6 Cloud Programming and Software Environments
has a master/slave architecture consisting of a single JobTracker as the master and a number of
TaskTrackers as the slaves (workers). The JobTracker manages the MapReduce job over a cluster
and is responsible for monitoring jobs and assigning tasks to TaskTrackers. The TaskTracker man-
ages the execution of the map and/or reduce tasks on a single computation node in the cluster.
Each TaskTracker node has a number of simultaneous execution slots, each executing either a
map or a reduce task. Slots are defined as the number of simultaneous threads supported by CPUs
of the TaskTracker node. For example, a TaskTracker node with N CPUs, each supporting M
threads, has M * N simultaneous execution slots [66]. It is worth noting that each data block is pro-
cessed by one map task running on a single slot. Therefore, there is a one-to-one correspondence
between map tasks in a TaskTracker and data blocks in the respective DataNode.
Submit
job
User JobTracker NameNode
Task Heartbeat
assignment
TaskTracker TaskTracker TaskTracker
JVM JVM JVM JVM JVM
Reduce
Map
Map
Map
Map
Slots
HMM
Blocks
DataNode DataNode DataNode
FIGURE 6.12
Data flow in running a MapReduce job at various task trackers using the Hadoop library.
6.2 Parallel and Distributed Programming Paradigms 359
also creates reduce tasks and assigns them to the TaskTrackers. The number of reduce tasks is
predetermined by the user, and there is no locality consideration in assigning them.
• Task execution The control flow to execute a task (either map or reduce) starts inside the
TaskTracker by copying the job JAR file to its file system. Instructions inside the job JAR file
are executed after launching a Java Virtual Machine (JVM) to run its map or reduce task.
• Task running check A task running check is performed by receiving periodic heartbeat messages
to the JobTracker from the TaskTrackers. Each heartbeat notifies the JobTracker that the sending
TaskTracker is alive, and whether the sending TaskTracker is ready to run a new task.
6.2.4.1 Dryad
Dryad is more flexible than MapReduce as the data flow of its applications is not dictated/predeter-
mined and can be easily defined by users. To achieve such flexibility, a Dryad program or job is defined
by a directed acyclic graph (DAG) where vertices are computation engines and edges are communica-
tion channels between vertices. Therefore, users or application developers can easily specify arbitrary
DAGs to specify data flows in jobs.
Given a DAG, Dryad assigns the computational vertices to the underlying computation engines
(cluster nodes) and controls the data flow through edges (communication between cluster nodes).
Data partitioning, scheduling, mapping, synchronization, communication, and fault tolerance are
major implementation details hidden by Dryad to facilitate its programming environment. Because
the data flow of a job is arbitrary for this system, only the control flow of the runtime environment
is further explained here. As shown in Figure 6.13(a), the two main components handling the con-
trol flow of Dryad are the job manager and the name server.
In Dryad, the distributed job is represented as a DAG where each vertex is a program and edges
represent data channels. Thus, the whole job will be constructed by the application programmer
who defines the processing procedures as well as the flow of data. This logical computation graph
will be automatically mapped onto the physical nodes by the Dryad runtime. A Dryad job is con-
trolled by the job manager, which is responsible for deploying the program to the multiple nodes in
the cluster. It runs either within the computing cluster or as a process in the user’s workstation
which can access the cluster. The job manager has the code to construct the DAG as well as the
library to schedule the work running on top of the available resources. Data transfer is done via
channels without involving the job manager. Thus, the job manager should not be the performance
bottleneck. In summary, the job manager
1. Constructs a job’s communication graph (data flow graph) using the application-specific
program provided by the user.
2. Collects the information required to map the data flow graph to the underlying resources
(computation engine) from the name server.
The cluster has a name server which is used to enumerate all the available computing resources
in the cluster. Thus, the job manager can contact the name server to get the topology of the whole
360 CHAPTER 6 Cloud Programming and Software Environments
Data plane
Job schedule Daemon
Vertex File, FIFO, Network
Job Name
manager server
Control plane
(a) Dryad control and data flow
Channels
Input Stage
files
Output
Sort files
Grep
Awk
Sed Perl
Grep Sort
Sed Awk
Grep Sort
Vertices
(processes)
(b) Dryad job structure
FIGURE 6.13
Dryad framework and its job structure, control and data flow.
(Courtesy of Isard, et al., ACM SIGOPS Operating Systems Review, 2007 [26] )
cluster and make scheduling decisions. A processing daemon runs in each computing node in the
cluster. The binary of the program will be sent to the corresponding processing node directly from
the job manager. The daemon can be viewed as a proxy so that the job manager can communicate
with the remote vertices and monitor the state of the computation. By gathering this information,
the name server provides the job manager with a perfect view of the underlying resources and net-
work topology. Therefore, the job manager is able to:
1. Map the data flow graph to the underlying resources.
2. Schedule all necessary communications and synchronization across the respective resources.
6.2 Parallel and Distributed Programming Paradigms 361
It also considers data and computation locality when mapping the data flow graph to the underlying
resources [26]. When the data flow graph is mapped on a set of computation engines, a light dae-
mon runs on each cluster node to run the assigned tasks. Each task is defined by the user using an
application-specific program. During runtime, the job manager communicates with each daemon
to monitor the state of the computation of the node and its communication with its preceding and
succeeding nodes. At runtime, the channels are used to transport the structured items between vertices
which represent the processing programs. There are several types of communication mechanisms for
implementing channels such as shared memory, TCP sockets, or even distributed file systems.
The execution of a Dryad job can be considered a 2D distributed set of pipes. Traditional UNIX
pipes are 1D pipes, with each node in the pipe as a single program. Dryad’s 2D distributed pipe sys-
tem has multiple processing programs in each vertex node. In this way, large-scale data can be pro-
cessed simultaneously. Figure 6.13(b) shows the Dryad 2D pipe job structure. During 2D pipe
execution, Dryad defines many operations to construct and change the DAG dynamically. The opera-
tions include creating new vertices, adding graph edges, merging two graphs, as well as handling job
input and output. Dryad also has a fault-tolerant mechanism built in. As it is built on a DAG, there
are typically two types of failures: vertex failures and channel failures, which are handled differently.
As there are many nodes in the cluster, the job manager can choose another node to re-execute the
corresponding job assigned to the failed node. In case of an edge failure, the vertex that created the
channel will be re-executed and a new channel will be created and touch the corresponding nodes
again. Dryad provides other mechanisms in addition to the runtime graph refinements which are used
for improving execution performance. As a general framework, Dryad can be used in many situations,
including scripting language support, map-reduce programming, and SQL service integration.
Client machine
Invoke Results
(4) (5)
Vertex Exec
JM (8)
code plan
FIGURE 6.14
LINQ-expression execution in DryadLINQ.
(Courtesy of Yu, et al. [27] )
[Serializable]
public struct Pair {
string word;
int count;
6.2 Parallel and Distributed Programming Paradigms 363
inside the DryadLINQ framework, and the programs are more concerned with the application program
logic. This can greatly reduce the level of programming skill required for parallel data process
programming.
Map Partitioning
Line offset
M1 M1 M1 Sort & group
(0, 77.241.136.xx -- [12/Jan/2010...] ... Macintosh; U) (Macintosh, 1) (Macintosh, 1)
(207, 58.87.64.xx -- ([12/Jan/2010...] ... WinXP; U) (WinXP, 1) (WinXP, 1) R1
(290, 78.41.226.xx -- [10/Jan/2010...] ... Win98; U) (Win98, 1) (Win98, 1)
(310, 213.163.90.xx -- [29/Dec/2009...] ... WinXP; U) (WinXP, 1) (WinXP, 1) (Macintosh, 1) Reduce Output
(Macintosh, 1)
M2 M2 M2 (Macintosh, 1) (Macintosh, 1,1,1,1) (Macintosh, 4)
(Macintosh, 1)
(320, 110.5.116.xx -- [20/Nov/2009...] ... WinXP; U) (WinXP, 1) (Macintosh, 1)
(380, 62.106.0.xx -- [20/Nov/2009...] ... Macintosh; U) (Macintosh, 1) (Macintosh, 1)
(420, 116.212.64.xx -- [20/Nov/2009...] ... Win98; U) (Win98, 1) (WinXP, 1) (WinXP, 1)
(540, 1212.80.0.xx -- [19/Nov/2009...] ... Macintosh; U) (Macintosh, 1) (Win98, 1) (Win98, 1)
(459, 80.82.96.xx -- [19/Nov/2009...] ... WinXP; U) (WinXP, 1) (WinXP, 1) (WinXP, 1) (Win98, 1,1,1) (Win98, 3)
(WinXP, 1) (WinXP, 1,1,1,1,1) (WinXP, 5)
M3 M3 M3 (Win98, 1)
(WinXP, 1)
(612, 77.236.192.xx -- [18/Nov/2009...] ... WinXP; U) (WinXP, 1) (Macintosh, 1) (WinXP, 1)
(682, 77.230.60.xx -- [17/Nov/2009...] ... Macintosh; U) (Macintosh, 1) (WinXP, 1) (Win98, 1)
(710, 216.55.163.xx -- [17/Nov/2009...] ... Win98; U) (Win98, 1) (Win98, 1)
R2
FIGURE 6.15
Data flow in WebVisCounter program execution.
6.2 Parallel and Distributed Programming Paradigms 365
Because the output of interest is the number of times users connect to a given web site using a particular
operating system, the Map function parses each line to extract the type of the operating system used
(e.g.,WinXp) as a key and assigns a value (1 in this case) to it. The Reduce function in turn sums up the
number of 1s for each unique key (operating system type in this case). Figure 6.15 shows the associated
data flow for the WebVisCounter program.
Filters Aggregator
FIGURE 6.16
The overall flow of filtering, aggregating, and collating in Sawzall.
366 CHAPTER 6 Cloud Programming and Software Environments
The first three lines declare the aggregators as count, total, and sum_of_squares. table is a keyword
which defines an aggregator type. These particular tables are sum tables and they will automatically add
up the values emitted to them. For each input record, Sawzall initializes the predefined variable input to
the uninterpreted byte string of the input. The line x: float = input; converts the input into a floating-point
number and stores it in local variable x. The three emit statements send the intermediate values to the
aggregators. One can translate the Sawzall scripts into MapReduce programs and run them with multiple
servers.
supported collective operations. This is an important general approach to parallelism and was seen,
for example, a long time ago in High Performance Fortran [68]. There are several discussions of
Pig and Pig Latin in the literature [69,70], and here we summarize the language features. Table 6.8
lists the four data types in Pig Latin and Table 6.9 the 14 operators.
The LOAD command returns a handle to the bag Queries. myUDF() is an optional custom reader
which is an example of a user-defined function. The AS syntax defines the schema of the tuples that
make up the bag Queries. The data can now be processed by commands such as:
The example maps each tuple in queries as determined by the user-defined function expandQueryUDF.
FOREACH runs over all tuples in the bag. Alternatively, one could use FILTER as in the following example to
remove all tuples with a userID equal to Alice as a string:
Real_queries = FILTER queries BY userID neq 'Alice';
Pig Latin offers the equivalent of a SQL JOIN capability using COGROUP as in
Grouped_data = COGROUP results BY queryString, revenue BY queryString;
Where results and revenue are bags of tuples (either from LOAD or processing of LOADed data)
Results: queryString, url, position)
Revenue: (queryString, adSlot, amount)
COGROUP is more general than JOIN in the sense that COGROUP does not produce a set of tuples
(queryString, url, position, adSlot, amount), but rather a tuple consisting of three fields. The first field is
querystring, the second field is all tuples from Results with the value of querystring, and the third field is
all tuples from Revenue with the value of queryString. FLATTEN can map the result of COGROUP to SQL
Join (queryString, url, position, adSlot, amount) syntax.
Pig Latin operations are performed in the order listed as a data flow pipeline. This is in contrast
to declarative SQL where one just specifies “what” has to be done, not how it is to be done. Pig
Latin supports user-defined functions, as illustrated in the preceding code, as first-class operations
in the language which could be an advantage over SQL. User-defined functions can be placed in
Load, Store, Group, Filter, and Foreach operators, depending on user preference. Note that the rich
set of data flow operations allowed in Pig Latin makes it similar to a scripting approach to work-
flow, as we discussed in Section 5.5.5. The Pig! Apache project [69] maps Pig Latin into a
sequence of MapReduce operations implemented in Hadoop.
unit executes the same program, but at any given time, there is no requirement that the same instruction
be executed. Category 1 corresponds to regular problems, whereas category 2 includes dynamic irregu-
lar cases with complex geometries for solving partial differential equations or particle dynamics. Note
that synchronous problems are still around, but they are run on MIMD machines with the SPMD
model. Also note that category 2 consists of compute–communicate phases and the computations are
synchronized by communication. No additional synchronization is needed.
Category 3 consists of asynchronously interacting objects and is often considered the people’s
view of a typical parallel problem. It probably does describe the concurrent threads in a modern oper-
ating system, as well as some important applications, such as event-driven simulations and areas such
370 CHAPTER 6 Cloud Programming and Software Environments
as search in computer games and graph algorithms. Shared memory is natural due to the low latency
often needed to perform dynamic synchronization. It wasn’t clear in the past, but now it appears that
this category is not very common in large-scale parallel problems of importance.
Category 4 is the simplest algorithmically, with disconnected parallel components. However, the
importance of this category has probably grown since the original 1988 analysis when it was esti-
mated to account for 20 percent of all parallel computing. Both grids and clouds are very natural for
this class, which does not need high-performance communication between different nodes.
Category 5 refers to the coarse-grained linkage of different “atomic” problems, and this was
fully covered in Section 5.5. This area is clearly common and is expected to grow in importance.
Remember the critical observation in Section 5.5 that we use a two-level programming model with
the metaproblem (workflow) linkage specified in one fashion and the component problems with
approaches such as those in this chapter. Grids or clouds are suitable for metaproblems as coarse-
grained decomposition does not usually require stringent performance.
As noted earlier, we added a sixth category to cover data-intensive applications motivated by the
clear importance of MapReduce as a new programming model. We call this category MapReduce++,
and it has three subcategories: “map only” applications similar to pleasingly parallel category 4
[41,43,74,75]; the classic MapReduce with file-to-file operations consisting of parallel maps followed
by parallel reduce operations; and a subcategory that captures the extended MapReduce version cov-
ered in Section 6.2.2. Note that category 6 is a subset of categories 2 and 4 with additional reading
and writing of data and a specialized, loosely synchronous structure compared to that used in data
analysis. This comparison is made clearer in Table 6.11.
Google corporate
apps (docs, sites
maps...)
Datastore
Secure intranet
FIGURE 6.17
Programming environment for Google AppEngine.
An update of an entity occurs in a transaction that is retried a fixed number of times if other
processes are trying to update the same entity simultaneously. Your application can execute multiple
data store operations in a single transaction which either all succeed or all fail together. The data
store implements transactions across its distributed network using “entity groups.” A transaction
manipulates entities within a single group. Entities of the same group are stored together for effi-
cient execution of transactions. Your GAE application can assign entities to groups when the enti-
ties are created. The performance of the data store can be enhanced by in-memory caching using
the memcache, which can also be used independently of the data store.
Recently, Google added the blobstore which is suitable for large files as its size limit is 2 GB.
There are several mechanisms for incorporating external resources. The Google SDC Secure Data
Connection can tunnel through the Internet and link your intranet to an external GAE application.
The URL Fetch operation provides the ability for applications to fetch resources and communicate
with other hosts over the Internet using HTTP and HTTPS requests. There is a specialized mail
mechanism to send e-mail from your GAE application.
Applications can access resources on the Internet, such as web services or other data, using
GAE’s URL fetch service. The URL fetch service retrieves web resources using the same high-
speed Google infrastructure that retrieves web pages for many other Google products. There are
dozens of Google “corporate” facilities including maps, sites, groups, calendar, docs, and YouTube,
among others. These support the Google Data API which can be used inside GAE.
6.3 Programming Support of Google App Engine 373
An application can use Google Accounts for user authentication. Google Accounts handles user
account creation and sign-in, and a user that already has a Google account (such as a Gmail
account) can use that account with your app. GAE provides the ability to manipulate image data
using a dedicated Images service which can resize, rotate, flip, crop, and enhance images. An appli-
cation can perform tasks outside of responding to web requests. Your application can perform these
tasks on a schedule that you configure, such as on a daily or hourly basis using “cron jobs,”
handled by the Cron service.
Alternatively, the application can perform tasks added to a queue by the application itself, such
as a background task created while handling a request. A GAE application is configured to consume
resources up to certain limits or quotas. With quotas, GAE ensures that your application won’t
exceed your budget, and that other applications running on GAE won’t impact the performance of
your app. In particular, GAE use is free up to certain quotas.
FIGURE 6.18
Architecture of Google File System (GFS).
(Courtesy of S. Ghemawat, et al. [53] )
The customized API adds snapshot and record append operations to facilitate the building of
Google applications.
Figure 6.18 shows the GFS architecture. It is quite obvious that there is a single master in the
whole cluster. Other nodes act as the chunk servers for storing data, while the single master
stores the metadata. The file system namespace and locking facilities are managed by the master.
The master periodically communicates with the chunk servers to collect management information
as well as give instructions to the chunk servers to do work such as load balancing or fail
recovery.
The master has enough information to keep the Step 1
4
whole cluster in a healthy state. With a single master, Client Master
many complicated distributed algorithms can be 2
3
avoided and the design of the system can be simplified.
However, this design does have a potential weakness, Secondary
as the single GFS master could be the performance bot- replica A
6
tleneck and the single point of failure. To mitigate this,
Google uses a shadow master to replicate all the data on
7
the master, and the design guarantees that all the data Primary
5
operations are performed directly between the client replica
and the chunk server. The control messages are trans- Legend:
ferred between the master and the clients and they can Control
be cached for future use. With the current quality of Secondary 6 Data
commodity servers, the single master can handle a clus- replica B
ter of more than 1,000 nodes.
Figure 6.19 shows the data mutation (write, append FIGURE 6.19
operations) in GFS. Data blocks must be created for Data mutation sequence in GFS.
6.3 Programming Support of Google App Engine 375
all replicas. The goal is to minimize involvement of the master. The mutation takes the following
steps:
1. The client asks the master which chunk server holds the current lease for the chunk and the
locations of the other replicas. If no one has a lease, the master grants one to a replica it chooses
(not shown).
2. The master replies with the identity of the primary and the locations of the other (secondary)
replicas. The client caches this data for future mutations. It needs to contact the master again
only when the primary becomes unreachable or replies that it no longer holds a lease.
3. The client pushes the data to all the replicas. A client can do so in any order. Each chunk server will
store the data in an internal LRU buffer cache until the data is used or aged out. By decoupling the
data flow from the control flow, we can improve performance by scheduling the expensive data
flow based on the network topology regardless of which chunk server is the primary.
4. Once all the replicas have acknowledged receiving the data, the client sends a write request to the
primary. The request identifies the data pushed earlier to all the replicas. The primary assigns
consecutive serial numbers to all the mutations it receives, possibly from multiple clients, which
provides the necessary serialization. It applies the mutation to its own local state in serial order.
5. The primary forwards the write request to all secondary replicas. Each secondary replica applies
mutations in the same serial number order assigned by the primary.
6. The secondaries all reply to the primary indicating that they have completed the operation.
7. The primary replies to the client. Any errors encountered at any replicas are reported to the client.
In case of errors, the write corrects at the primary and an arbitrary subset of the secondary
replicas. The client request is considered to have failed, and the modified region is left in an
inconsistent state. Our client code handles such errors by retrying the failed mutation. It will make
a few attempts at steps 3 through 7 before falling back to a retry from the beginning of the write.
Thus, besides the writing operation provided by GFS, special appending operations can be
used to append the data blocks to the end of the files. The reason for providing such operations
is that some of the Google applications need a lot of append operations. For example, while
crawlers are gathering data from the web, the contents of web pages will be appended to page
files. Thus, the appending operation is provided and optimized. The client specifies data to be
appended and GFS appends it to the file atomically at least once. GFS picks the offset and the
clients cannot decide the offset of the data position. The appending operation works for concur-
rent writers.
GFS was designed for high fault tolerance and adopted some methods to achieve this goal.
Master and chunk servers can be restarted in a few seconds, and with such a fast recovery capabil-
ity, the window of time in which the data is unavailable can be greatly reduced. As we mentioned
before, each chunk is replicated in at least three places and can tolerate at least two data crashes for
a single chunk of data. The shadow master handles the failure of the GFS master. For data integrity,
GFS makes checksums on every 64 KB block in each chunk. With the previously discussed design
and implementation, GFS can achieve the goals of high availability (HA), high performance, and
large scale. GFS demonstrates how to support large-scale processing workloads on commodity hard-
ware designed to tolerate frequent component failures optimized for huge files that are mostly
appended and read.
376 CHAPTER 6 Cloud Programming and Software Environments
<html> 15
“com.cnn.www” <html> 15 “CNN” 19
<html> 15 “CNN.com” 18
BigTable client
BigTable cell BigTable master Metadata
BigTable client
Performs metadata cost library
load balancing
Read/write
FIGURE 6.20
BigTable data model and system structure.
(Courtesy of Chang, et al. [11] )
378 CHAPTER 6 Cloud Programming and Software Environments
relational database which provides abundant atomic operations (transactions). Row creation is implicit
upon storing data. Rows are ordered lexicographically, that is, close together lexicographically, usually on
one or a small number of machines.
Large tables are broken into tablets at row boundaries. A tablet holds a contiguous range of rows. Clients
can often choose row keys to achieve locality. The system aims for about 100 MB to 200 MB of data per tablet.
Each serving machine is responsible for about 100 tablets. This can achieve faster recovery times as
100 machines each pick up one tablet from the failed machine. This also results in fine-grained load balan-
cing, that is, migrating tablets away from the overloaded machine. Similar to the design in GFS, a master
machine in BigTable makes load-balancing decisions.
Figure 6.20(b) shows the BigTable system structure. A BigTable master manages and stores the meta-
data of the BigTable system. BigTable clients use the BigTable client programming library to communicate
with the BigTable master and tablet servers. BigTable relies on a highly available and persistent distributed
lock service called Chubby [76] discussed in Section 6.3.4.
UserTable 1
Other ...
METADATA
tablets ...
...
...
Root tablet
(1st METADATA tablet) ...
Chubby file
... UserTable N
...
...
...
...
...
...
FIGURE 6.21
Tablet location hierarchy in using the BigTable.
6.4 Programming on Amazon AWS and Microsoft Azure 379
Client processes
FIGURE 6.22
Structure of Google Chubby for distributed lock service.
compaction to store data efficiently. Shared logs are used for logging the operations of multiple
tablets so as to reduce the log space as well as keep the system consistent.
for a growing number of smartphones and tablets [77]. We further note the auto-scaling and elastic
load balancing services which support related capabilities. Auto-scaling enables you to automatically
scale your Amazon EC2 capacity up or down according to conditions that you define. With auto-
scaling, you can ensure that the number of Amazon EC2 instances you’re using scales up seam-
lessly during demand spikes to maintain performance, and scales down automatically during
demand lulls to minimize cost.
Elastic load balancing automatically distributes incoming application traffic across multiple
Amazon EC2 instances and allows you to avoid nonoperating nodes and to equalize load on func-
tioning images. Both auto-scaling and elastic load balancing are enabled by CloudWatch which
monitors running instances. CloudWatch is a web service that provides monitoring for AWS cloud
resources, starting with Amazon EC2. It provides customers with visibility into resource utilization,
operational performance, and overall demand patterns—including metrics such as CPU utilization,
disk reads and writes, and network traffic.
This sequence is supported by public, private, and paid AMIs shown in Figure 6.24. The
AMIs are formed from the virtualized compute, storage, and server resources shown at the bottom
of Figure 6.23.
Private AMI Images created by you, which are private by default. You can grant access to other
users to launch your private images.
Public AMI Images created by users and released to the AWS community, so anyone can launch
instances based on them and use them any way they like. AWS lists all public images at
http://developer.amazonwebservices.com/connect/kbcategory.jspa?categoryID=171.
Paid QAMI You can create images providing specific functions that can be launched by anyone
willing to pay you per each hour of usage on top of Amazon’s charges.
6.4 Programming on Amazon AWS and Microsoft Azure 381
Configure
Launch
firewall
Elastic IP address
Elastic block store
Virtualization layer
FIGURE 6.23
Amazon EC2 execution environment.
4. High-CPU instances have proportionally more CPU resources than memory (RAM) and are well suited
for compute-intensive applications.
5. Cluster compute instances provide proportionally high CPU resources with increased network perfor-
mance and are well suited for high-performance computing (HPC) applications and other demanding
network-bound applications. They use 10 Gigabit Ethernet interconnections.
The cost in the third column is expressed in terms of EC2 Compute Units (ECUs) where one ECU provides
the CPU capacity of a 1.0–1.2 GHz 2007 Opteron or 2007 Xeon processor. This leads to the cost per hour for
CPUs shown in Table 6.14. Note that a real-world use of EC2 must pay for use of many different resources;
the CPU charge in Table 6.14 is just one component and all charges (which naturally change often and so
the reader should get the latest values online) are available on the AWS web site.
User
Virtualization layer
FIGURE 6.24
Amazon S3 execution environment.
• Authentication mechanisms to ensure that data is kept secure from unauthorized access. Objects
can be made private or public, and rights can be granted to specific users.
• Per-object URLs and ACLs (access control lists).
• Default download protocol of HTTP. A BitTorrent protocol interface is provided to lower costs
for high-scale distribution.
• $0.055 (more than 5,000 TB) to 0.15 per GB per month storage (depending on total amount).
• First 1 GB per month input or output free and then $.08 to $0.15 per GB for transfers outside an
S3 region.
• There is no data transfer charge for data transferred between Amazon EC2 and Amazon S3
within the same region or for data transferred between the Amazon EC2 Northern Virginia
region and the Amazon S3 U.S. Standard region (as of October 6, 2010).
block device (like a hard drive). Snapshots are provided so that the data can be saved incrementally.
This can improve performance when saving and restoring data. In terms of pricing, Amazon provides
a similar pay-per-use schema as EC2 and S3. Volume storage charges are based on the amount of
storage users allocate until it is released, and is priced at $0.10 per GB/month. EBS also charges
$0.10 per 1 million I/O requests made to the storage (as of October 6, 2010). The equivalent of EBS
has been offered in open source clouds such Nimbus.
Live Other
services cloud
Users
services
Messaging interface
Compute
Load balancer Load balancer
Storage
File system
Messaging interface
interface
AzureSQL Blobs Queues Tables
Drives
Fabric
Fabric
controller
FIGURE 6.25
Features of the Azure cloud platform.
As discussed in Chapter 4, the Azure concept of roles is an interesting idea that we can expect
to be expanded in terms of role types and use in other cloud environments. Figure 6.25 shows
that compute roles can be load-balanced, similar to what GAE and AWS clouds have done (see
Section 4.1).
6.4.4.1 SQLAzure
Azure offers a very rich set of storage capabilities, as shown in Figure 6.25. The SQLAzure ser-
vice offers SQL Server as a service and is described in detail in Example 6.10. All the storage
modalities are accessed with REST interfaces except for the recently introduced Drives that are
analogous to Amazon EBS discussed in Section 6.4.3, and offer a file system interface as a dur-
able NTFS volume backed by blob storage. The REST interfaces are automatically associated
with URLs and all storage is replicated three times for fault tolerance and is guaranteed to be
consistent in access.
The basic storage system is built from blobs which are analogous to S3 for Amazon. Blobs are
arranged as a three-level hierarchy: Account → Containers → Page or Block Blobs. Containers are
analogous to directories in traditional file systems with the account acting as the root. The block
blob is used for streaming data and each such blob is made up as a sequence of blocks of up
to 4 MB each, while each block has a 64 byte ID. Block blobs can be up to 200 GB in size.
Page blobs are for random read/write access and consist of an array of pages with a maximum blob
size of 1 TB. One can associate metadata with blobs as <name, value> pairs with up to 8 KB
per blob.
386 CHAPTER 6 Cloud Programming and Software Environments
development. SDS (in fact, the total Azure Platform SDK) can be integrated with Microsoft’s powerful Visual
Studio development environment. This can greatly improve the effectiveness and efficiency with which
developers can make cloud applications.
6.5.1.3 Nimbus
Nimbus [81,82] is a set of open source tools that together provide an IaaS cloud computing solu-
tion. Figure 6.27 shows the architecture of Nimbus, which allows a client to lease remote resources
by deploying VMs on those resources and configuring them to represent the environment desired
388 CHAPTER 6 Cloud Programming and Software Environments
External
clouds
Management platform
CLC
Virtualized network
SC Walrus
CC
NC
FIGURE 6.26
The Eucalyptus architecture for VM image management.
(Courtesy of Eucalyptus LLC [81] )
VMM node
Service node VMM node
Nimbus Control
laaS
Libvirt
HTTP SSH
Xen/KVM
Cumulus
Cloud client storage
VM VM
DHCP server
FIGURE 6.27
Nimbus cloud infrastructure.
(Courtesy of Nimbus Project [82] )
6.5 Emerging Cloud Software Environments 389
by the user. To this end, Nimbus provides a special web interface known as Nimbus Web [83]. Its
aim is to provide administrative and user functions in a friendly interface. Nimbus Web is centered
around a Python Django [84] web application that is intended to be deployable completely separate
from the Nimbus service.
As shown in Figure 6.27, a storage cloud implementation called Cumulus [83] has been tightly
integrated with the other central services, although it can also be used stand-alone. Cumulus is com-
patible with the Amazon S3 REST API [85], but extends its capabilities by including features such
as quota management. Therefore, clients such as boto [86] and s2cmd [87], that work against the
S3 REST API, work with Cumulus. On the other hand, the Nimbus cloud client uses the Java
Jets3t library [88] to interact with Cumulus.
Nimbus supports two resource management strategies. The first is the default “resource pool”
mode. In this mode, the service has direct control of a pool of VM manager nodes and it assumes it
can start VMs. The other supported mode is called “pilot.” Here, the service makes requests to a
cluster’s Local Resource Management System (LRMS) to get a VM manager available to deploy
VMs. Nimbus also provides an implementation of Amazon’s EC2 interface [89] that allows users to
use clients developed for the real EC2 system against Nimbus-based clouds.
Drivers
Virtualization Network Storage External cloud
Xen Fixed IPs NFS EC2 OpenNebula
KVM Range of IPs SCP Elastic hosts
VMWare ...
External
cloud
Local infrastructure
FIGURE 6.28
OpenNebula architecture and its main components.
(Courtesy of Sotomayor, et al. [94] )
their local infrastructure with computing capacity from a public cloud to meet peak demands, or
implement HA strategies. OpenNebula currently includes an EC2 driver, which can submit requests
to Amazon EC2 [89] and Eucalyptus [80], as well as an ElasticHosts driver [96]. Regarding storage,
an Image Repository allows users to easily specify disk images from a catalog without worrying
about low-level disk configuration attributes or block device mapping. Also, image access control is
applied to the images registered in the repository, hence simplifying multiuser environments and
image sharing. Nevertheless, users can also set up their own images.
6.5.2.1 Sector/Sphere
Sector/Sphere is a software platform that supports very large distributed data storage and simplified dis-
tributed data processing over large clusters of commodity computers, either within a data center or
across multiple data centers. The system consists of the Sector distributed file system and the Sphere
parallel data processing framework [97,98]. Sector is a distributed file system (DFS) that can be
deployed over a wide area and allows users to manage large data sets from any location with a high-
speed network connection [99]. The fault tolerance is implemented by replicating data in the file system
and managing the replicas.
Since Sector is aware of the network topology when it places replicas, it also provides better
reliability, availability, and access throughout. The communication is performed using User Data-
gram Protocol (UDP) for message passing and user-defined type (UDT) [100] for data transfer.
6.5 Emerging Cloud Software Environments 391
Security server Master Client Obviously, UDP is faster than TCP for message
passing because it does not require connection
SSL SSL
setup, but it could become a problem if Sector
is used over the Internet. Meanwhile, UDT is
Data a reliable UDP-based application-level data
transport protocol which has been specifically
designed to enable high-speed data transfer over
wide area high-speed networks [100]. Finally,
the Sector client provides a programming API,
Slaves
tools, and a FUSE [101] user space file system
module.
FIGURE 6.29 On the other hand, Sphere is a parallel data
The Sector/Sphere system architecture. processing engine designed to work with data
(Courtesy of Gu and Grossman [102] ) managed by Sector. This coupling allows the
system to make accurate decisions about job
scheduling and data location. Sphere provides a programming framework that developers can use to
process data stored in Sector. Thus, it allows UDFs to run on all input data segments in parallel.
Such data segments are processed at their storage locations whenever possible (data locality). Failed
data segments may be restarted on other nodes to achieve fault tolerance. In a Sphere application,
both inputs and outputs are Sector files. Multiple Sphere processing segments can be combined
to support more complicated applications, with inputs/outputs exchanged/shared via the Sector file
system [102].
The Sector/Sphere platform [102] is supported by the architecture shown in Figure 6.29, which
is composed of four components. The first component is the security server, which is responsible
for authenticating master servers, slave nodes, and users. We also have the master servers that can
be considered the infrastructure core. The master server maintains file system metadata, schedules
jobs, and responds to users’ requests. Sector supports multiple active masters that can join and
leave at runtime and can manage the requests. Another component is the slave nodes, where data is
stored and processed. The slave nodes can be located within a single data center or across multiple
data centers with high-speed network connections. The last component is the client component.
This provides tools and programming APIs for accessing and processing Sector data.
Finally, it is worthy to mention that as part of this platform, a new component has been devel-
oped. It is called Space [97] and it consists of a framework to support column-based distributed
data tables. Therefore, tables are stored by columns and are segmented onto multiple slave nodes.
Tables are independent and no relationships between them are supported. A reduced set of SQL
operations is supported, including, but not limited to, table creation and modification, key-value
updates and lookups, and select UDF operations.
6.5.2.2 OpenStack
OpenStack [103] was been introduced by Rackspace and NASA in July 2010. The project is building
an open source community spanning technologists, developers, researchers, and industry to share
resources and technologies with the goal of creating a massively scalable and secure cloud infrastruc-
ture. In the tradition of other open source projects, the software is open source and limited to just
open source APIs such as Amazon.
392 CHAPTER 6 Cloud Programming and Software Environments
Currently, OpenStack focuses on the development of two aspects of cloud computing to address
compute and storage aspects with the OpenStack Compute and OpenStack Storage solutions.
“OpenStack Compute is the internal fabric of the cloud creating and managing large groups of
virtual private servers” and “OpenStack Object Storage is software for creating redundant, scalable
object storage using clusters of commodity servers to store terabytes or even petabytes of data.”
Recently, an image repository was prototyped. The image repository contains an image registration
and discovery service and an image delivery service. Together they deliver images to the compute
service while obtaining them from the storage service. This development gives an indication that
the project is striving to integrate more services into its portfolio.
S3
(Tornado)
FIGURE 6.30
OpenStack Nova system architecture. The AMQP (Advanced Messaging Queuing Protocol) was described in
Section 5.2.
6.5 Emerging Cloud Software Environments 393
the S3 service, and manages nodes, as well as storage workers through a queue. Additionally, Nova
integrates networking components to manage private networks, public IP addressing, virtual private
network (VPN) connectivity, and firewall rules. It includes the following types:
• NetworkController manages address and virtual LAN (VLAN) allocations
• RoutingNode governs the NAT (network address translation) conversion of public IPs to private
IPs, and enforces firewall rules
• AddressingNode runs Dynamic Host Configuration Protocol (DHCP) services for private
networks
• TunnelingNode provides VPN connectivity
The network state (managed in the distributed object store) consists of the following:
• VLAN assignment to a project
• Private subnet assignment to a security group in a VLAN
• Private IP assignments to running instances
• Public IP allocations to a project
• Public IP associations to a private IP/running instance
accessible through the Internet to its subscribers, or a private cloud constituted by a set of nodes
with restricted access as shown in Figure 6.31.
Aneka acts as a workload distribution and management platform for accelerating applications in
both Linux and Microsoft .NET framework environments. Some of the key advantages of Aneka
over other workload distribution solutions include:
• Support of multiple programming and application environments
• Simultaneous support of multiple runtime environments
• Rapid deployment tools and framework
• Ability to harness multiple virtual and/or physical machines for accelerating application
provisioning based on users’ Quality of Service/service-level agreement (QoS/SLA) requirements
• Built on top of the Microsoft .NET framework, with support for Linux environments through Mono
Management kit
Application
Management Administration
Software development kit studio portal
Design SLA-negotiation Management
APIs
explorer web services web services
Container
Programming models
Task Thread Map Reduce Other
model model model models
Resistance
Foundation services
Security
Membership Reservation Storage License Accounting
services services services services services
Fabric services
Dynamic resource provisioning services
Hardware profile services
Infrastructure
.NET@Windows Mono@Linux
Physical machines/virtual machines
Amazon
Private cloud Microsoft Google
IBM
LAN network
Data center
FIGURE 6.31
Architecture and components of Aneka.
(Courtesy of Raj Buyya, Manjrasoft )
6.5 Emerging Cloud Software Environments 395
Aneka offers three types of capabilities which are essential for building, accelerating, and managing
clouds and their applications:
1. Build Aneka includes a new SDK which combines APIs and tools to enable users to rapidly
develop applications. Aneka also allows users to build different runtime environments such as
enterprise/private cloud by harnessing compute resources in network or enterprise data centers,
Amazon EC2, and hybrid clouds by combining enterprise private clouds managed by Aneka
with resources from Amazon EC2 or other enterprise clouds built and managed using
XenServer.
2. Accelerate Aneka supports rapid development and deployment of applications in multiple
runtime environments running different operating systems such as Windows or Linux/UNIX.
Aneka uses physical machines as much as possible to achieve maximum utilization in local
environments. Whenever users set QoS parameters such as deadlines, and if the enterprise
resources are insufficient to meet the deadline, Aneka supports dynamic leasing of extra
capabilities from public clouds such as EC2 to complete the task within the deadline (see
Figure 6.32).
Executors
Executor Executor
Scheduler
Private enterprise
network
Internet
Private Publicly available resources
cloud (physical and virtual)
Scheduler
vmware
amazon.com
VPN Executor/schedulers
Client libraries (virtual resources)
Public cloud
FIGURE 6.32
Aneka using private cloud resources along with dynamically leased public cloud resources.
(Courtesy of Raj Buyya, Manjrasoft, www.manjrasoft.com/ )
396 CHAPTER 6 Cloud Programming and Software Environments
3. Manage Management tools and capabilities supported by Aneka include a GUI and APIs to set
up, monitor, manage, and maintain remote and global Aneka compute clouds. Aneka also has
an accounting mechanism and manages priorities and scalability based on SLA/QoS which
enables dynamic provisioning.
Here are three important programming models supported by Aneka for both cloud and tradi-
tional parallel applications:
1. Thread programming model, best solution to adopt for leveraging the computing capabilities of
multicore nodes in a cloud of computers
2. Task programming model, which allows for quickly prototyping and implementing an
independent bag of task applications
3. MapReduce programming model,as discussed in Section 6.2.2
execution. In addition, specific models can require additional services or a different type of support.
Aneka provides support for the most well-known application programming patterns, such as distrib-
uted threads, bags of tasks, and MapReduce.
Additional services can be designed and deployed in the system. This is how the infrastructure is
enriched with additional features and capabilities. The SDK provides straightforward interfaces and
ready-to-use components for rapid service prototyping. Deployment and integration of new services
is painless and immediate: The container leverages the Spring framework and allows for dynamic
integration of new components such as services.
Use private
Aneka cloud
LAN network
(Running Maya batch mode on demand)
Setup 2: Aneka
80 75
enterpraise cloud 6000 Frames
4000 Frames
60 2000 Frames
50
(in hrs)
Time
40
25
Setup 1: Single server 20
Raw locomotive design files 10
8
3.3
(using/autodesk Maya) Using Maya 0
Single Aneka
graphical mode
server cloud
directly
Aneka utilizes idle desktops
(3D) to decrease tasktime
four-core from days to hours
server
FIGURE 6.33
Rendering images of locomotive design on GoFront’s private cloud using Aneka.
(Courtesy of Raj Buyya, Manjrasoft )
398 CHAPTER 6 Cloud Programming and Software Environments
The GoFront Group is responsible for designing the high-speed electric locomotive, metro car, urban
transportation vehicle, and motor train. The raw design of the prototypes requires high-quality 3D images
using Autodesk’s Maya rendering software. By examining the 3D images, engineers identify problems in
the original design and make the appropriate design improvements. However, such designs on a single
four-core server took three days to render scenes with 2,000 frames.
To reduce this time, GoFront used Aneka and created an enterprise cloud by utilizing networked PCs.
It used Aneka Design Explorer, a tool for rapid creation of parameter sweep applications in which the
same program is executed many times on different data items (in this case, executing Maya for rendering
different images). A customized Design Explorer (called Maya GUI) has been implemented for Maya
rendering. Maya GUI managed parameters, generated Aneka tasks, monitored submitted Aneka tasks, and
collected final rendered images. The design image used to take three days to render (2,000+ frames, each
frame with more than five different camera angles). Using only a 20-node Aneka cloud, GoFront was able
to reduce the rendering time from three days to three hours.
Acknowledgement
This chapter was coauthored by Geoffrey Fox of Indiana University and Albert Zomaya of Sydney
University, Australia. Their work was assisted by Kang Chen (Tsinghua University, China), Judy
Qiu, Gregor von Laszewski, Javier Diaz, Archit Kulshrestha, and Andrew Younge (Indiana Univer-
sity), Reza Moravaeji and Javid Teheri (University of Sydney), and Renato Figueiredo (University
of Florida). Rajkumar Buyya (University of Melbourne) has contributed Section 6.5.3. Kai Hwang
(USC) edited the final manuscript.
References
[1] C. Dabrowski, Reliability in grid computing systems, Concurr. Comput. Pract. Exper. 21 (8) (2009)
927–959.
[2] Microsoft, Project Trident: A Scientific Workflow Workbench, http://research.microsoft.com/en-us/
collaboration/tools/trident.aspx, 2010.
[3] W. Lu, J. Jackson, R. Barga, AzureBlast: A case study of developing science applications on the cloud,
in: ScienceCloud: 1st Workshop on Scientific Cloud Computing co-located with HPDC (High Perfor-
mance Distributed Computing), ACM, Chicago, IL, 21 June 2010.
[4] Distributed Systems Laboratory (DSL) at University of Chicago Wiki. Performance Comparison: Remote
Usage, NFS, S3-fuse, EBS. 2010.
[5] D. Jensen, Blog entry on Compare Amazon S3 to EBS data read performance, http://jensendarren
.wordpress.com/2009/12/30/compare-amazon-s3-to-ebs-data-read-performance/, 2009.
[6] Zend PHP Company, The Simple Cloud API for Storage, Queues and Table, http://www.simplecloud.org/
home, 2010.
[7] Microsoft, Windows Azure Geo-location Live, http://blogs.msdn.com/b/windowsazure/archive/2009/04/
30/windows-azure-geo-location-live.aspx, 2009.
[8] Raytheon BBN, SHARD (Scalable, High-Performance, Robust and Distributed) Triple Store based on
Hadoop. http://www.cloudera.com/blog/2010/03/how-raytheon-researchers-are-using-hadoop-to-build-a-
scalable-distributed-triple-store/, 2010.
[9] NOSQL Movement, Wikipedia list of resources, http://en.wikipedia.org/wiki/NoSQL, 2010.
[10] NOSQL Link Archive, LIST OF NOSQL DATABASES, http://nosql-database.org/, 2010.
400 CHAPTER 6 Cloud Programming and Software Environments
[11] F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D. Wallach, M. Burrows, et al., BigTable: A distributed
storage system for structured data, in: OSDI’06: Seventh Symposium on Operating System Design and
Implementation, USENIX, Seattle, WA, 2006.
[12] Amazon, Welcome to Amazon SimpleDB, http://docs.amazonwebservices.com/AmazonSimpleDB/latest/
DeveloperGuide/index.html, 2010.
[13] J. Haridas, N. Nilakantan, B. Calder, Windows Azure Table, http://go.microsoft.com/fwlink/?LinkId=153401,
2009.
[14] International Virtual Observatory Alliance, VOTable Format Definition Version 1.1, http://www.ivoa.net/
Documents/VOTable/20040811/, 2004.
[15] Apache Incubator, Heart (Highly Extensible & Accumulative RDF Table) planet-scale RDF data
store and a distributed processing engine based on Hadoop & Hbase, http://wiki.apache.org/incubator/
HeartProposal, 2010.
[16] M. King, Amazon SimpleDB and CouchDB Compared, http://www.automatthew.com/2007/12/amazon-
simpledb-and-couchdb-compared.html, 2007.
[17] Apache, Hbase implementation of BigTable on Hadoop File System, http://hbase.apache.org/, 2010.
[18] Apache, The CouchDB document-oriented database project, http://couchdb.apache.org/index.html, 2010.
[19] M/Gateway Developments Ltd, M/DB Open Source “plug-compatible” alternative to Amazon’s
SimpleDB database, http://gradvs1.mgateway.com/main/index.html?path=mdb, 2009.
[20] ActiveMQ, http://activemq.apache.org/, 2009.
[21] S. Pallickara, G. Fox, NaradaBrokering: a distributed middleware framework and architecture for
enabling durable peer-to-peer grids, in: ACM/IFIP/USENIX 2003 International Conference on Middle-
ware, Rio de Janeiro, Brazil, Springer-Verlag, New York, Inc., 2003.
[22] NaradaBrokering, Scalable Publish Subscribe System, http://www.naradabrokering.org/, 2010.
[23] Apache Hadoop, http://hadoop.apache.org/, 2009.
[24] J. Ekanayake, A.S. Balkir, T. Gunarathne, G. Fox, C. Poulain, N. Araujo, et al., DryadLINQ for scienti-
fic analyses, in: Fifth IEEE International Conference on eScience, Oxford, 2009.
[25] J. Ekanayake, T. Gunarathne, J. Qiu, G. Fox, S. Beason, J.Y. Choi, et al., Applicability of DryadLINQ to
Scientific Applications, Community Grids Laboratory, Indiana University, 2009.
[26] M. Isard, M. Budiu, Y. Yu, A. Birrell, D. Fetterly, Dryad: Distributed data-parallel programs from
sequential building blocks, in: ACM SIGOPS Operating Systems Review, ACM Press, 2007.
[27] Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P.K. Gunda, et al., DryadLINQ: A System for
General-Purpose Distributed Data-Parallel Computing Using a High-Level Language, in: Symposium on
Operating System Design and Implementation (OSDI), 2008.
[28] J. Dean, S. Ghemawat, MapReduce: simplified data processing on large clusters, Commun. ACM 51 (1)
(2008) 107–113.
[29] J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S. Bae, J. Qiu, et al., Twister: a runtime for iterative
MapReduce, in: Proceedings of the First International Workshop on MapReduce and Its Applications of
ACM HPDC 2010 conference, ACM, Chicago, IL, 20–25 June 2010.
[30] T. Gunarathne, T. Wu, J. Qiu, G. Fox, MapReduce in the Clouds for Science, in: CloudCom, IUPUI
Conference Center, Indianapolis, 30 November–3 December 2010.
[31] R.S. Dorward, R. Griesemer, S. Quinlan, Interpreting the data: parallel analysis with Sawzall, Scientific
Prog. J. 13 (4) (2005) 227–298 (Special Issue on Grids and Worldwide Computing Programming Models
and Infrastructure).
[32] Pig! Platform for analyzing large data sets, http://hadoop.apache.org/pig/, 2010.
[33] C. Olston, B. Reed, U. Srivastava, R. Kumar, A. Tomkins, Pig Latin: a not-so-foreign language for data
processing, in: Proceedings of the 2008 ACM SIGMOD International Conference on Management of
Data, ACM, Vancouver, Canada, 2008, pp. 1099–1110.
References 401
[34] G. Malewicz, M.H. Austern, A. Bik, J. Dehnert, I. Horn, N. Leiser, et al., Pregel: a system for large-scale
graph processing, in: Proceedings of the twenty-first annual symposium on parallelism in algorithms and
architectures, ACM, Calgary, Canada, 2009, p. 48.
[35] Cloudera, CDH: A free, stable Hadoop distribution offering RPM, Debian, AWS and automatic config-
uration options. http://www.cloudera.com/hadoop/, 2010.
[36] A. Grama, G. Karypis, V. Kumar, A. Gupta, Introduction to Parallel Computing, second ed., Addison
Wesley, 2003.
[37] J. Dean, S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, in: Sixth Symposium
on Operating Systems Design and Implementation, 2004, pp. 137–150.
[38] H. Kasim, V. March, R. Zhang, S. See, Survey on Parallel Programming Model, in: IFIP International
Conference on Network and Parallel Computing, Lecture Notes in Computer Science, Vol. 5245,
Springer-Verlag, Shanghai, China, 2008, pp. 266–275.
[39] S. Hariri, M. Parashar, Tools and Environments for Parallel and Distributed Computing, Series on
Parallel and Distributed Computing, Wiley, 2004, ISBN:978-0471332886.
[40] L. Silva, R. Buyya, Parallel Programming Models and Paradigms, (2007).
[41] T. Gunarathne, T. Wu, J. Qiu, G. Fox, Cloud Computing Paradigms for Pleasingly Parallel Biomedical
Applications, in: Proceedings of the Emerging Computational Methods for the Life Sciences Workshop
of ACM HPDC 2010 conference, Chicago, IL, 20–25 June 2010.
[42] G. Fox, MPI and MapReduce, in: Clusters, Clouds, and Grids for Scientific Computing CCGSC, Flat
Rock, NC, http://grids.ucs.indiana.edu/ptliupages/presentations/CCGSC-Sept8-2010.pptx, 8 September
2010.
[43] J. Ekanayake, X. Qiu, T. Gunarathne, S. Beason, G. Fox, High Performance Parallel Computing with
Clouds and Cloud Technologies, Cloud Computing and Software Services: Theory and Techniques, CRC
Press (Taylor and Francis), 2010.
[44] T. Hoefler, A. Lumsdaine, J. Dongarra, Towards Efficient MapReduce Using MPI, in: Recent Advances
in Parallel Virtual Machine and Message Passing Interface, Lecture Notes in Computer Science: vol.
5759, Springer Verlag, Espoo Finland, 2009, pp. 240–249.
[45] S. Ibrahim, H. Jin, B. Cheng, H. Cao, S. Wu, L. Qi, CLOUDLET: towards mapreduce implementation
on virtual machines, in: Proceedings of the 18th ACM International Symposium on High Performance
Distributed Computing, ACM, Garching, Germany, 2009, pp. 65–66.
[46] T. Sandholm, K. Lai, MapReduce optimization using regulated dynamic prioritization, in: Proceedings of
the eleventh international joint conference on measurement and modeling of computer systems, ACM,
Seattle, WA, 2009, pp. 299–310.
[47] Wikipedia, MapReduce, http://en.wikipedia.org/wiki/MapReduce, 2010 (accessed 06.11.10).
[48] J. Dean, S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, in: Presentation at
OSDI-2004 Conference. http://labs.google.com/papers/mapreduce-osdi04-slides/index.html, 2004
(accessed 6.11.10).
[49] R. Lammel, Google’s MapReduce programming model – Revisited, Sci. Comput. Prog. 68 (3) (2007)
208–237.
[50] M. Zaharia, A. Konwinski, A.D. Joseph, R. Katz, I. Stoica, Improving MapReduce performance in het-
erogeneous environments, in: Proceedings of the 8th USENIX conference on operating systems design
and implementation, USENIX Association, San Diego, California, 2008, pp. 29–42.
[51] N. Vasic, M. Barisits, V. Salzgeber, D. Kostic, Making cluster applications energy-aware, in: Proceed-
ings of the 1st workshop on automated control for datacenters and clouds, ACM, Barcelona, Spain,
2009, pp. 37–42.
[52] D.J. DeWitt, E. Paulson, E. Robinson, J. Naughton, J. Royalty, S. Shankar, et al., Clustera: an integrated
computation and data management system, in: Proc. VLDB Endow, 2008, 1(1), pp. 28–41.
402 CHAPTER 6 Cloud Programming and Software Environments
[53] S. Ghemawat, H. Gobioff, S. Leung, The Google File System, in: 19th ACM Symposium on Operating
Systems Principles, 2003, pp. 20–43.
[54] Google, Introduction to Parallel Programming and MapReduce, http://code.google.com/edu/parallel/
mapreduce-tutorial.html, 2010.
[55] J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S. Bae, J. Qiu, et al., Twister: a runtime for iterative
MapReduce, in: Proceedings of the First International Workshop on MapReduce and Its Applications of
ACM HPDC 2010 Conference, ACM, Chicago, IL, 20–25 June 2010.
[56] B. Zhang, Y. Ruan, T. Wu, J. Qiu, A. Hughes, G. Fox, Applying Twister to Scientific Applications, in:
CloudCom 2010, IUPUI Conference Center, Indianapolis, 30 November–3 December 2010.
[57] G. Malewicz, M.H. Austern, A. Bik, J.C. Dehnert, I. Horn, N. Leiser, et al., Pregel: A System for Large-
Scale Graph Processing, in: International conference on management of data, Indianapolis, Indiana, 2010,
pp. 135–146.
[58] Y. Bu, B. Howe, M. Balazinska, M.D. Ernst, HaLoop: Efficient Iterative Data Processing on Large
Clusters, in: The 36th International Conference on Very Large Data Bases, VLDB Endowment, Vol. 3,
Singapore, 13–17 September 2010.
[59] M. Zaharia, M. Chowdhury, M.J. Franklin, S. Shenker, I. Stoica, Spark: Cluster Computing with Working
Sets, in: 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud ’10), Boston, 22 June
2010.
[60] SALSA Group, Iterative MapReduce, http://www.iterativemapreduce.org/, 2010.
[61] Yahoo, Yahoo! Hadoop Tutorial, http://developer.yahoo.com/hadoop/tutorial/index.html, 2010.
[62] G. Coulouris, J. Dollimore, T. Kindberg, Distributed Systems: Concepts and Design. International Com-
puter Science Series, 4th ed., Addison-Wesley, 2004.
[63] T. White, Hadoop: The Definitive Guide, Second ed., Yahoo Press, 2010.
[64] Apache, HDFS Overview, http://hadoop.apache.org/hdfs/, 2010.
[65] Apache, Hadoop MapReduce, http://hadoop.apache.org/mapreduce/docs/current/index.html, 2010.
[66] J. Venner, Pro Hadoop, first ed., Apress, 2009, ISBN:978-1430219422.
[67] Apache! Pig! (part of Hadoop), http://pig.apache.org/, 2010.
[68] A. Choudhary, G. Fox, S. Hiranandani, K. Kennedy, C. Koelbel, S. Ranka, et al., Unified compilation of
Fortran 77D and 90D, ACM Lett. Program. Lang. Syst. 2 (1–4) (1993) 95–114.
[69] Yahoo, Pig! Tutorial. http://developer.yahoo.com/hadoop/tutorial/module6.html#pig.
[70] Systems@ETH Zurich, Massively Parallel Data Analysis with MapReduce. Lectures on MapReduce,
Hadoop and Pig Latin. http://www.systems.ethz.ch/education/past-courses/hs08/map-reduce/map-reduce/
lecture-slides, 2008 (accessed 07.11.10).
[71] G. Fox, R.D. Williams, P.C. Messina, Parallel computing works! Morgan Kaufmann Publishers, 1994.
[72] J. Ekanayake, T. Gunarathne, J. Qiu, G. Fox, S. Beason, J. Choi, et al., Applicability of DryadLINQ to
Scientific Applications, Community Grids Laboratory, Indiana University, http://grids.ucs.indiana.edu/
ptliupages/publications/DryadReport.pdf, 2010.
[73] J. Qiu, J. Ekanayake, T. Gunarathne, J. Choi, S. Bae, Y. Ruan, et al., Data Intensive Computing
for Bioinformatics, http://grids.ucs.indiana.edu/ptliupages/publications/DataIntensiveComputing_
BookChapter.pdf, 2009.
[74] J. Ekanayake, T. Gunarathne, J. Qiu, Cloud Technologies for Bioinformatics Applications, IEEE Trans.
Parallel Distrib. Syst., (2010).
[75] J. Qiu, T. Gunarathne, J. Ekanayake, J. Choi, S. Bae, H. Li, et al., Hybrid Cloud and Cluster Computing
Paradigms for Life Science Applications, in: 11th Annual Bioinformatics Open Source Conference
BOSC, Boston, 9–10 July 2010.
[76] M. Burrows, The Chubby Lock Service for Loosely-Coupled Distributed Systems, in: OSDI’06:
Seventh Symposium on Operating System Design and Implementation, USENIX, Seattle, WA, 2006,
pp. 335–350.
References 403
[77] G.C. Fox, A. Ho, E. Chan, W. Wang, Measured characteristics of distributed cloud computing infrastruc-
ture for message-based collaboration applications, in: Proceedings of the 2009 International Symposium
on Collaborative Technologies and Systems, IEEE Computer Society, 2009, pp. 465–467.
[78] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, et al., Dynamo: Amazon’s
highly available key-value store, SIGOPS Oper. Syst. Rev. 41 (6) (2007) 205–220.
[79] Eucalyptus LLC, White Papers. http://www.eucalyptus.com/whitepapers.
[80] D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. Soman, L. Youseff, et al., The Eucalyptus Open-
Source Cloud-Computing System, in: 9th IEEE/ACM International Symposium on Cluster Computing
and the Grid, CCGRID ’09, Shanghai, 18–21 May 2009, pp. 124–131.
[81] K. Keahey, I. Foster, T. Freeman, X. Zhang, Virtual workspaces: achieving quality of service and quality
of life in the Grid, Scientific Prog. J. 13 (4) (2005) 265–275.
[82] Nimbus, Cloud computing for science, http://www.nimbusproject.org, 2010.
[83] Nimbus, Frequently Asked Questions, http://www.nimbusproject.org/docs/current/faq.html, 2010.
[84] Django, High-Level Python Web Framework, http://www.djangoproject.com/, 2010.
[85] Amazon, Simple Storage Service API Reference: API Version 2006-03-01, http://awsdocs.s3.amazonaws
.com/S3/latest/s3-api.pdf, 2006.
[86] boto, Python interface to Amazon Web Services, http://code.google.com/p/boto/, 2010.
[87] S3tools project, Open source tools for accessing Amazon S3 – Simple Storage Service, http://s3tools.org/
s3tools, 2010.
[88] bitbucket, JetS3t: open-source Java toolkit and application suite for Amazon Simple Storage Service
(Amazon S3), Amazon CloudFront content delivery network, and Google Storage. http://bitbucket.org/
jmurty/jets3t/wiki/Home.
[89] Amazon, Amazon Elastic Compute Cloud (Amazon EC2). http://aws.amazon.com/ec2.
[90] OpenNebula, industry standard open source cloud computing tool. http://opennebula.org/.
[91] I.M. Llorente, R. Moreno-Vozmediano, R.S. Montero, Cloud Computing for On-Demand Grid Resource
Provisioning, in: Advances in Parallel Computing: High Speed and Large Scale Scientific Computing
vol. 18, IOS Press, 2009, pp. 177–191.
[92] R. Monteroa, R. Moreno-Vozmediano, I. Llorente, An elasticity model for high throughput computing
clusters, J. Parallel Distrib. Comput. (May) (2010).
[93] B. Sotomayor, R.S. Montero, I.M. Llorente, I. Foster, Capacity Leasing in Cloud Systems Using the
OpenNebula Engine, in: 2008 Workshop on Cloud Computing and Its Applications (CCA08), Chicago,
IL, 2008, http://www.cca08.org/papers/Paper20-Sotomayor.pdf.
[94] libvirt, virtualization API. http://libvirt.org.
[95] B. Sotomayor, R.S. Montero, I.M. Llorente, I. Foster, Virtual infrastructure management in private and
hybrid clouds, IEEE Internet Comp. 13 (5) (2009) 14–22.
[96] ElasticHosts, Flexible servers in the cloud. http://www.elastichosts.com/.
[97] Sector/Sphere, High Performance Distributed File System and Parallel Data Processing Engine. http://
sector.sourceforge.net.
[98] Y. Gu, R. Grossman, Sector/Sphere: A Distributed Storage and Computing Platform. SC08 Poster, http://
sector.sourceforge.net/pub/sector-sc08-poster.pdf, 2008.
[99] Y. Gu, R. Grossman, Lessons learned from a year’s worth of benchmarks of large data clouds, in: Pro-
ceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers, ACM, Portland,
Oregon, 2009, pp. 1–6.
[100] Y. Gu, R. Grossman, UDT: UDP-based data transfer for high-speed wide area networks, Comput. Netw.
51 (7) (2007) 1777–1799.
[101] FUSE, Filesystem in Userspace. http://fuse.sourceforge.net.
[102] Y. Gu, R.L. Grossman, Sector and sphere: the design and implementation of a high-performance data
cloud, Phil. Trans. R. Soc. A 367 (2009) 2429–2445.
404 CHAPTER 6 Cloud Programming and Software Environments
[103] Open Stack, Open Source, Open Standards Cloud, http://openstack.org/index.php, 2010.
[104] VSCSE, Big Data for Science. Virtual Summer School hosted by SALSA group at Indiana University,
26 July–30 July 2010, http://salsahpc.indiana.edu/tutorial/.
[105] T. Hey, The Fourth Paradigm: Data-Intensive Scientific Discovery, http://research.microsoft.com/en-us/
um/redmond/events/TonyHey/21216/player.htm, 2010.
[106] T. Hey, S. Tansley, K. Tolle, The Fourth Paradigm: Data-Intensive Scientific Discovery, http://research
.microsoft.com/en-us/collaboration/fourthparadigm/, 2009 (accessed 07.11.10).
[107] SALSA Group, Catalog of Cloud Material, http://salsahpc.indiana.edu/content/cloud-materials,
2010.
[108] Microsoft Research, Cloud Futures Workshop, http://research.microsoft.com/en-us/events/cloudfu-
tures2010/default.aspx, 2010.
[109] T. Chou, Introduction to Cloud Computing: Business and Technology, Active Book Press, LLC, 2010,
p. 252, http://www.lulu.com/items/volume_67/8215000/8215197/1/print/8215197.pdf.
[110] R. Buyya, C. Yeo, S. Venugopal, J. Broberg, I. Brandic, Cloud computing and emerging IT platforms:
vision, hype, and reality for delivering computing as the 5th utility, Future Gener. Comput. Syst. 25 (6)
(2009) 599–616.
[111] P. Chaganti, Cloud computing with Amazon Web Services, Part 1: Introduction — When it’s
smarter to rent than to buy, http://www.ibm.com/developerworks/architecture/library/ar-cloudaws1/,
2008.
[112] Cloud computing with Amazon Web Services, Part 2: Storage in the cloud with Amazon Simple Storage
Service (S3) – Reliable, flexible, and inexpensive storage and retrieval of your data, http://www.ibm
.com/developerworks/architecture/library/ar-cloudaws2/, 2008.
[113] P. Chaganti, Cloud computing with Amazon Web Services, Part 3: Servers on demand with EC2, http://
www.ibm.com/developerworks/architecture/library/ar-cloudaws3/, 2008.
[114] M.R. Palankar, A. Iamnitchi, M. Ripeanu, S. Garfinkel, Amazon S3 for science grids: a viable solution? in:
Proceedings of the 2008 International Workshop on Data-aware Distributed Computing, ACM, Boston, MA,
2008, pp. 55–64.
[115] W. Sun, K. Zhang, S. Chen, X. Zhang, H. Liang, Software as a Service: An Integration Perspective, in:
Fifth International Conference Service-Oriented Computing – ICSOC, Lecture Notes in Computer
Science, Vol. 4749, Springer Verlag, Vienna Austria, 2007, pp. 558–569.
[116] G. Lakshmanan, Cloud Computing. Relevance to Enterprise, http://www.infosys.com/cloud-computing/
white-papers/Documents/relevance-enterprise.pdf, 2009.
[117] N. Leavitt, Is cloud computing really ready for prime time? Computer 42 (1) (2009) 15–20.
[118] G. Lin, G. Dasmalchi, J. Zhu, Cloud Computing and IT as a Service: Opportunities and Challenges, in:
Web Services, ICWS ’08, IEEE, Beijing, 23–26 September 2008.
[119] D.S. Linthicum, Cloud Computing and SOA Convergence in Your Enterprise: A Step-by-Step Guide,
Addison-Wesley Professional, 2009.
[120] L. Mei, W.K. Chan, T.H. Tse, A Tale of Clouds: Paradigm Comparisons and Some Thoughts on
Research Issues, in: Asia-Pacific Services Computing Conference. APSCC ’08 IEEE, Yilan, Taiwan,
9–12 December 2008, pp. 464–469.
[121] G. Fox, S. Bae, J. Ekanayake, X. Qiu, H. Yuan, Parallel Data Mining from Multicore to Cloudy Grids,
book chapter of High Speed and Large Scale Scientific Computing, IOS Press, Amsterdam, 2009, http://
grids.ucs.indiana.edu/ptliupages/publications/CetraroWriteupJune11-09.pdf.
[122] J. Ekanayake, G. Fox, High Performance Parallel Computing with Clouds and Cloud Technologies, in:
First International Conference CloudComp on Cloud Computing, Munich, Germany, 2009.
[123] F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach, M. Burrows, et al., BigTable: a distributed
storage system for structured data, ACM Trans. Comput. Syst. 26 (2) (2008) 1–26.
Homework Problems 405
[124] E. Deelman, G. Singh, M. Livny, B. Berriman, J. Good, The cost of doing science on the cloud: the
Montage example, in: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, IEEE Press,
Austin, Texas, 2008, pp. 1–12, http://www.csd.uwo.ca/faculty/hanan/cs843/papers/ewa-ec2.pdf.
[125] C. Hoffa, G. Mehta, T. Freeman, E. Deelman, K. Keahey, B. Berriman, et al., On the Use of Cloud
Computing for Scientific Workflows, in: Proceedings of the 2008 Fourth IEEE International Conference
on eScience, IEEE Computer Society, 2008, pp. 640–645.
[126] N. Paton, A. Marcelo, T. De Aragão, K. Lee, A. Alvaro, R. Sakellariou, Optimizing utility in cloud
computing through autonomic workload execution, Bulletin of the IEEE Computer Society Technical
Committee on Data Engineering, http://www.cs.man.ac.uk/~alvaro/publications/TCDEBull09.pdf,
2009.
[127] B. Rochwerger, et al., The Reservoir model and architecture for open federated cloud computing, IBM
J. Res. Dev. 53 (4) (2009) 4:1–4:11.
HOMEWORK PROBLEMS
Problem 6.1
Visit the Google App Engine (GAE) web site, download the SDK, and read the Python Guide or
the Java Guide to get started. Note that GAE only accepts Python, Ruby, and Java programming
languages. The platform does not provide any IaaS services.
a. Develop a specific cloud application using available software services such as Gmail, Docs, or
CRM on the GAE platform. Test your applications on the GAE platform.
b. Report your application development experience and experimental results in terms of some
selected performance measures such as job queuing time, execution time, resource utilization
rate, or some QoS attributes such as goal achievement, success rate, fault tolerance, and cost-
effectiveness.
c. Change the problem size or data set size and platform configuration to study the scalability and
efficiency issues in your GAE experiments.
Problem 6.2
Write an application code to run on the GAE platform for backup storage of a large amount of
your personal, family, or company data and records such as photos, video, music, sales receipts,
documents, news media, inventory, market records, financial matters, supply chain information,
human resources, public data sets, and so forth. Note that strict privacy protection is required
here. Minimizing the storage cost is another objective function to achieve. The access path, soft-
ware development tools, and platform information in using GAE can be found in Chapter 4. You
should explain your code development experience and report the results in using the GAE
platform.
Problem 6.3
Get set up to use Amazon Web Services; perhaps use the capability Amazon provides at http://aws.
amazon.com/education/ for instructors to obtain details to arrange free education accounts for their
406 CHAPTER 6 Cloud Programming and Software Environments
students. Test the AWS platform for a SimpleDB application using the relational database service.
This application could be similar to that chosen for Problems 6.1 and 6.2. You should study the
system’s performance, scalability, throughput, efficiency, resource utilization, fault tolerance, and
cost-effectiveness.
Problem 6.4
Design and request an EC2 configuration on the AWS platform for parallel multiplication of two
very large matrices with an order exceeding 50,000.
a. Report your experimental results including execution time, speed performance, VM instances
initiated, compute unit and storage utilization, and service charge experienced.
b. You can also study related issues such as scalability, throughput, efficiency, resource utilization,
fault tolerance, and cost-effectiveness in this scientific cloud experiment.
Problem 6.5
Repeat Problem 6.4 on the AWS platform for an Amazon S3 application. This application could be
similar to that chosen for Problems 6.1–6.2. You can also study the performance, scalability,
throughput, efficiency, resource utilization, fault tolerance, cost-effectiveness, and related issues.
Problem 6.6
Study some reported EC2 or S3 applications on the AWS platform by business or service indus-
try or by large enterprises. Many examples or success stories are told in Chou’s book [109]. For
example, Vertica Systems applies EC2 in DBMS (database management system) applications. Eli
Lilly uses EC2 in drug development. Animoto offers an online service to facilitate personal video
production. Contact a few of those service companies for technical implementation and service
details. Submit a study report with suggestions on improving the selected compute and storage
service applications.
Problem 6.7
Visit the Microsoft Windows Azure Developer Center. You can download the Azure development
kit to run a local version of Azure. Design an application experiment and test it on a local compu-
ter, such as your desktop, or a notebook computer, or a university workstation or server. Report
your experiment experiences in using the Azure platform.
Problem 6.8
In the MapReduce programming model, there is a special case concerning implementing only the
map phase, which is also known as a “map-only” problem. This achievement can enhance existing
applications/binaries to have high throughput by running them in a parallel fashion; in other words,
it helps stand-alone programs to utilize large-scale computing capability. The goal of this exercise is
to write a Hadoop “map-only” program with the bioinformatics application BLAST (NCBI BLAST+;
Homework Problems 407
With the given pseudocode, you need to implement the map() interface within RunnerMap.java
(http://salsahpc.indiana.edu/tutorial/source_code/Hadoop-Blast-sketch.zip), which executes an exter-
nal Java process to run the stand-alone BLAST for each map task with the assigned input query.
The “input.fa” needs to be split into chunks (chunk[1-n].fa) as a set of input queries. This set of
queries is stored on the local disk; you may need to upload them to HDFS before the Hadoop
BLAST program executes. The HDFS directory name is one of the program parameters. Then, the
Hadoop framework assigns an input query, formed as a <key,value> pair such as <filename,file
path on HDFS>, to each worker.
The code of <key,value> generation is done if you use the provided DataFileInputFormat.java
and FileRecordReader.java files as the InputFormatClass. The BLAST binary requires a database
to find matching hits regarding the input queries. This database can be either stored in a shared
file system or located on the local disk using distributed cache; it depends on how you write the
program. Either one of the solutions is acceptable. For the final step, collect the result from each
mapper and merge them into a single output result.fa file. In addition, select a proper version of
the BLAST+ binary (32-bit or 64-bit) regarding the Linux OS version (i686/i386 or x86_64).
For further suggestions, see the SalsaHPC Hadoop BLAST Tutorial with the pseudocode (http://
salsahpc.indiana.edu/tutorials/hadoopblast.html).
Problem 6.9
Visit the Manjrasoft Aneka Software Center at www.manjrasoft.com/. Download the Aneka soft-
ware and create an enterprise cloud using LAN-connected PCs in your university/student laboratory,
and run various example applications that come with Aneka. Write simple parallel programs using
Aneka’s Task, Thread, and MapReduce programming models and carry out their execution on
workers ranging from 2 to 20 in steps of 2.
Problem 6.10
Repeat applications in Problems 6.1–6.5 using the academic/open source packages described in
Section 6.5—namely Eucalyptus, Nimbus, OpenStack, OpenNebula, and Sector/Sphere. This soft-
ware is available on the FutureGrid web site (http://www.futuregrid.org) and includes a growing
number of tutorials.
408 CHAPTER 6 Cloud Programming and Software Environments
Problem 6.11
Test the large-scale matrix multiplication program of Problem 6.4 on two or three cloud performs
(GAE, AWS, and/or Azure). You can also choose a data-intensive application such as a large-scale
search or business processing application. Implement the application on at least two or all three
cloud platforms, separately. The major objective is to minimize the execution time of the applica-
tion. The minor objective is to minimize the user service costs.
a. Run the service on the GAE platform.
b. Run the service on the AWS platform.
c. Run the service on the Windows Azure platform.
d. Compare your compute and storage costs, design experiences, and experimental results on all
three cloud platforms. Report the relative performance and QoS results measured.
Problem 6.12
There are three implementations of the MapReduce engine and its extensions: Google MapReduce,
Apache Hadoop, and Microsoft Dryad. Complete the missing 14 entries to show their similarities
and differences in 6 technical dimensions. 4 entries are already given below as example answers.
You do not have explain the table entries in detail, just enter the names of languages, models, meth-
ods, mechanisms, and platforms applied.
Problem 6.13
Using a MapReduce programming model supported in Aneka, develop a program for image filtering
of hundreds of photos you have taken using a digital camera. Conduct and report the results of your
scalability experiments by varying the number of computing nodes/workers and images of different
resolutions or file size on an Aneka-based enterprise cloud.
Homework Problems 409
Problem 6.14
Examine the tutorials at http://www.futuregrid.org/tutorials and compare the word count application
on Hadoop using Eucalyptus, Nimbus, and at least one other of OpenStack, OpenNebula, or Sector/
Sphere.
Problem 6.15
Examine the tutorials at http://salsahpc.indiana.edu and http://www.iterativemapreduce.org/samples.
html. Compare Hadoop and Twister on cases specified by the instructor from examples given there.
Discuss their relative strengths and weaknesses.
Problem 6.16
The following program is written in Hadoop, namely WebVisCounter. Trace through the program
or run it on a cloud platform to which you have access. Analyze the programming tasks performed
by this Hadoop program and learn from using Hadoop library.
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
static class OSCountMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
Text UserInfo = new Text();
Text OSversion = new Text();
int StartIndex, EndIndex, i;
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens() && i != 8) {
i++; UserInfo.set(tokenizer.nextToken())}
i = 0;
while (UserInfo.charAt(i) != ';'){
if (UserInfo.charAt(i) != '('){StartIndex = i}
i++ };
EndIndex = i;
OSversion = UserInfo.subtring(StartIndex, EndIndex);
output.collect(OSversion, one) }; };
410 CHAPTER 6 Cloud Programming and Software Environments
Problem 6.17
Twister K-means extends the MapReduce programming model iteratively. Many data analysis
techniques require iterative computations. For example, K-means clustering is the application where
multiple iterations of MapReduce computations are necessary for overall computation. Twister is an
enhanced MapReduce runtime that supports iterative MapReduce computations efficiently. For this
problem, study the iterative MapReduce programming model and learn how to implement the
K-means algorithm with Twister.
Problem 6.18
DryadLINQ PageRank is a well-known link analysis algorithm. It calculates numerical values to
each element of a hyperlinked set of web pages, which reflects the probability that the random sur-
fer will access that page. Implementing PageRank with MapReduce is of some difficulty in both
efficiency and programmability due to both the iterative structure and the random access model in
large-scale web graphs. DryadLINQ provide the SQL-like queries API that helps programmers
implement PageRank without much effort. In addition, the Dryad infrastructure helps to scale the
application out in an easy way. Learn how to implement a simple PageRank application with
DryadLINQ.
Homework Problems 411
Problem 6.19
Using a thread programming model supported in Aneka, develop a program for parallel multiplication
of two very large square matrices on an order greater than 500. Conduct and report the results of your
scalability experiments by varying the order of the matrix from 500 to 1,000 in steps of 100 on com-
puting nodes/workers varying from 10 to 50 in steps of 10 in an Aneka-based enterprise cloud.
Problem 6.20
Compare Pig Latin and “bare” Hadoop for implementing data intensive application specified by
your instructor. Discuss their relative advantages and shortcomings.
This page intentionally left blank
PART
413
414 PART 3 Grids, P2P, and the Future Internet
SUMMARY
This chapter is devoted to building grid systems for distributed computing over an integrated
network of supercomputers, server clusters, and data centers. First we examine the open grid service
architecture. We study a dozen grid architectures and their software environments including the
TeraGrid, EGEE, DataGrid, ChinaGrid, BOINC, and Grid’5000, etc. These case studies illustrate
grid design principles, service models, and implementation technologies. Several grid middleware
packages: Condor-G, SGE, Globus Toolkit and CGSP are detailed. Finally, we cover grid trends
and security infrastructure. Past and current grid projects are reviewed against emerging
technologies.
Table 7.1 Four Grid Families Identified in the Great Global Grid (GGG) (Forbes, 2001)
Grid Family Representative Grid Systems and References
Computational Grids TeraGrid (US), EGEE (EU), DataGrid (EU), Grid’5000 (france), ChinaGrid (China),
or Data Grids NAS (NASA), LCG (Cern), e-Science (UK), D-Grid (Nordic), FutureGrid (US), etc.
Information Grids or Semantic Grid, Ontology Platform, BOINC (Berkeley), D4Science, Einsten@Home,
Knowledge Grids Information Power Grid (NASA)
Business Grids BEinGrid (EU), HP eSpeak, IBM WebSphere, Sun Grid Engine, Microsoft .NET, etc.
P2P/Volunteer Grids SETI@Home, Parasic Grid, FightAIDS@Home, Foldong@Home, GIMPS, etc.
Project supported by DARPA has promoted the maturity of grid technology with a rich collection
of software and middleware tools for grid computing. In 2007, the concept of cloud computing was
thrown out, which in many ways was extending grid computing through virtualized data centers. In
this beginning section, we introduce major grid families and review the grid service evolution over
the past 15 years.
Grids differ from conventional HPC clusters. Cluster nodes are more homogeneous machines
that are better coordinated to work collectively and cooperatively. The grid nodes are heterogeneous
computers that are more loosely coupled together over geographically dispersed sites. In 2001,
Forbes Magazine advocated the emergence of the great global grid (GGG) as a new global infrastruc-
ture. This GGG evolved from the World Wide Web (WWW) technology we have enjoyed for many
years. Four major families of grid computing systems were suggested by the Forbes GGG categori-
zation as summarized in Table 7.1. We will study some of the listed grids in subsequent sections.
Application
Application
Collective services
Resource services
Transport (TCP)
FIGURE 7.1
The layered grid service protocols and their relationship with the Internet service protocols.
(Courtesy of Foster, Kesselman, and Tuecke [11] )
Table 7.2 Control Operations and Enquiries for Aggregating Grid Resources
Resources Control Operations Enquiries
Compute Starting, monitoring, and controlling the Hardware and software characteristics;
resources execution of resultant processes; control relevant load information: current load and
over resources: advance reservation queue state
Storage Putting and getting files; control over Hardware and software characteristics;
resources resources allocated to data transfers: relevant load information: available space
advance reservation and bandwidth utilization
Network Control over resources allocated Network characteristics and load
resources
Code Managing versioned source and object Software files and compile support
repositories code
Service Implementing catalog query and update Service order information and agreements
catalogs operations: a relational database
resources. This connectivity could be established directly on physical networks or it could be built
with virtual networking technology.
The connectivity must support the grid fabric, including the network links and virtual private
channels. The fabric layer includes all computational resources, storage systems, catalogs, network
resources, sensors, and their network connections. The connectivity layer enables the exchange of
data between fabric layer resources. The five-layer grid architecture is closely related to the layered
Internet protocol stack shown on the right-hand side of Figure 7.1. The fabric layer corresponds to
the link layer in the Internet stack. The connectivity layer is supported by the network and transport
layers of the Internet stack. The Internet application layer supports the top three layers.
route and transfer data. The resource layer is responsible for sharing single resources. An interface
is needed to claim the static structure and dynamic status of local resources. The grid should be
able to accept resource requests, negotiate the Quality of Service (QoS), and perform the operations
specified in user applications.
The collective layer handles the interactions among a collection of resources. This layer
implements functions such as resource discovery, co-allocation, scheduling, brokering, monitoring,
and diagnostics. Other desired features include replication, grid-enabled programming, workload
management, collaboration, software discovery, access authorization, and community accounting
and payment. The application layer comprises mainly user applications. The applications interact
with components in other layers by using well-defined APIs (application programming interfaces)
and SDKs (software development kits).
recent examples include the BOINC [7] and Folding@Home [14], etc. In practice, these virtual
grids can be viewed as virtual supercomputers.
resource demand. A flexible grid system should be designed to adapt to varying workloads.
In reality, physical organizations include a real company, a university, or a branch of government.
These real organizations often share some common objectives.
For example, several research institutes and hospitals may undertake some joint research
challenges together to explore a new cancer drug. Another concrete example is the joint venture
among IBM, Apple, and Motorola to develop PowerPC processors and their supporting software
in the past. The joint venture was based on the VO model. Grids definitely can promote the
concept of VOs. Still, joint ventures demand resources and labor from all participants. The
following example shows how two VOs or grid configurations can be formed out of three physi-
cal organizations.
Actual Actual
organization organization
B C
Actual Virtual
organization organization
FIGURE 7.2
Two VOs (labeled by X and Y) formed from getting specific resources from three physical organizations (A, B, and C).
422 CHAPTER 7 Grid Computing Systems and Resource Management
also join an existing VO. A participant may leave a VO, once its job is done. The dynamic nature of the
resources in a VO posts a great challenge for grid computing. Resources have to cooperate closely to pro-
duce a rewarding result. Without an effective resource management system, the grid or VO may be ineffi-
cient and waste resources, if poorly managed.
Table 7.3 OGSA Grid Service Interfaces Developed by the OGSA Working Group
Port Type Operation Brief Description
Grid service Find service data Query a grid service instance, including the handle, reference,
primary key, home handle map, interface information, and
service-specific information. Extensible support for various
query languages.
Termination time Set (and get) termination time for grid service instance.
Destroy Terminate grid service instance.
Notification source Subscribe to Subscribe to notifications of service events. Allow delivery via
notification topic third-party messaging services.
Notification sink Deliver notification Carry out asynchronous delivery of notification messages.
Registry Register service Conduct soft-state registration of Grid Service Handles (GSHs).
Unregister service Unregister a GSH.
Factory Create service Create a new grid service instance.
Handle map Find by handle Return the Grid Service Reference (GSR) associated with the
GSH.
the service can be created and destroyed dynamically; statefulness refers to the fact that one can
distinguish one service instance from another.
Resolve Resolve
(GSH) (GSH)
Client Client
Time = T
FIGURE 7.3
A GSH resolving to a different GSR for a migrated service instance before (shown on the left) and after
(on the right) the migration at time T.
7.1 Grid Architecture and Service Modeling 425
Anti-virus
management
Secure logging
federation, etc.)
Trust model
Policy expression and exchange
User
management
Bindings security
Key
(transport, protocol, message security)
management
FIGURE 7.4
The OGSA security model implemented at various protection levels.
(Courtesy of I. Foster, et al., http://www.ogf.org/documents/GFD.80.pdf )
operations require little overhead, static strategies cannot adapt to changes in demand, bandwidth,
and storage vailability. Dynamic strategies can adjust locations and number of data replicas accord-
ing to changes in conditions (e.g., user behavior).
However, frequent data-moving operations can result in much more overhead than in static stra-
tegies. The replication strategy must be optimized with respect to the status of data replicas. For sta-
tic replication, optimization is required to determine the location and number of data replicas. For
dynamic replication, optimization may be determined based on whether the data replica is being cre-
ated, deleted, or moved. The most common replication strategies include preserving locality, mini-
mizing update costs, and maximizing profits.
7.1.4.2 Grid Data Access Models
Multiple participants may want to share the same data collection. To retrieve any piece of data, we
need a grid with a unique global namespace. Similarly, we desire to have unique file names. To
achieve these, we have to resolve inconsistencies among multiple data objects bearing the same
name. Access restrictions may be imposed to avoid confusion. Also, data needs to be protected to
avoid leakage and damage. Users who want to access data have to be authenticated first and then
authorized for access. In general, there are four access models for organizing a data grid, as listed
here and shown in Figure 7.5.
Monadic model: This is a centralized data repository model, shown in Figure 7.5(a). All the
data is saved in a central data repository. When users want to access some data they have to
submit requests directly to the central repository. No data is replicated for preserving data
locality. This model is the simplest to implement for a small grid. For a large grid, this model is
not efficient in terms of performance and reliability. Data replication is permitted in this model
only when fault tolerance is demanded.
Hierarchical model: The hierarchical model, shown in Figure 7.5(b), is suitable for building a
large data grid which has only one large data access directory. The data may be transferred from
the source to a second-level center. Then some data in the regional center is transferred to the
third-level center. After being forwarded several times, specific data objects are accessed directly
by users. Generally speaking, a higher-level data center has a wider coverage area. It provides
higher bandwidth for access than a lower-level data center. PKI security services are easier to
implement in this hierarchical data access model. The European Data Grid (EDG) to be studied
in Section 7.2.3 adopts this data access model.
Federation model: This data access model shown in Figure 7.5(c) is better suited for designing a
data grid with multiple sources of data supplies. Sometimes this model is also known as a mesh
model. The data sources are distributed to many different locations. Although the data is shared,
the data items are still owned and controlled by their original owners. According to predefined
access policies, only authenticated users are authorized to request data from any data source. This
mesh model may cost the most when the number of grid institutions becomes very large.
Hybrid model: This data access model is shown in Figure 7.5(d). The model combines the best
features of the hierarchical and mesh models. Traditional data transfer technology, such as FTP,
applies for networks with lower bandwidth. Network links in a data grid often have fairly high
bandwidth, and other data transfer models are exploited by high-speed data transfer tools such
as GridFTP developed with the Globus library. The cost of the hybrid model can be traded off
between the two extreme models for hierarchical and mesh-connected grids.
7.2 Grid Projects and Grid Systems Built 427
CERN
Institution
Root
Institution
Institution Region Region
FIGURE 7.5
Four architectural models for building a data grid.
Example 7.4 The EGEE Grid for Enabling Grids for E-sciencE
The EGEE grid infrastructure and projects were funded by the Europen Union in 3 stages from 2004 to
2006. EGEE stands for Enabling Grids for E-sciencE. The project is no longer active. However, the
TeraGrid (USA) NSF, 2002 (Sec. 7.2.2) An open grid infrastructure in 11 resource sites, with
40 Gbps Internet2 backbone links with 2 Pflops computing
power, 50 PB storage, and 100 specific databases.
DataGrid (EU) EU and CERN, 2001 Claimed the largest grid on earth for data analysis in high-
(7.2.3) energy physics, environment, and bioinformatics, etc.
Grid’5000 French Gov., 2006, An experimental grid comprising 5,000 processor cores in
(France) (Sec. 7.2.1) 9 French cities for HPC and research use (Section 7.2.2).
ChinaGrid (China) Education Ministry An educational computing grid for research use, linking
2005, (Sec. 7.2.4) HPC systems in 100 Chinese universities.
NAS Grid (USA) NASA Ames Lab, 2008 Running genetic algorithms with the Condor cycle
scavenger on large number of Sun and SGI workstations.
7.2 Grid Projects and Grid Systems Built 429
distributed computing infrastructure built and nurtured by the projects DataGrid (2002-2004), EGEE-I, -II
and -III (2004-2010) is now supported by the European Grid Infrastructure. This long-term organization
coordinates National Grid Initiatives. Transitioning the care of the EGEE infrastructure to EGI is part of the
process for ensuring a vibrant and sustained European research community.
The EGEE grid infrastructure consists of a set of middleware services deployed on a worldwide
collection of computational and storage resources plus the services provided. For production services, a
large multi-science grid infrastructure involves 250 resource centers world-wide, providing some 40,000
CPUs and several petabytes of storage. This infrastructure is used on a daily basis by several thousands of
scientists over 200 virtual organizations on a daily basis. This is a stable, well-supported infrastructure, run-
ning the latest released versions of the gLite middleware. The EDEE also has a pre-production service,
which provides access to grid services to test and evaluate features of the middleware.
In addition, the pre-production extends the middleware certification activity, helping to evaluate
deployment procedures, interoperability, and basic software functionality. Furthermore there is an EGEE
Network Operations Center (ENOC), which caters for the network operational coordination between EGEE
and other network providers such as GEANT2 and NRENs. In the long run, the primary goals of EGEE and
EGI projects are to promote global collaboration for scientific research between academic and industrial
applications. Reported EGEE application challenges include health and medicine, genomics and
bioscience, particle physics and astronomy, environmental science, engineering design, chemistry and
material science, and social sciences. Interested readers are referred to the web site: http://www.eu-egee
.org for details.
Lille
# of cores: 618
Orsay
# of cores: 684
Rennes Nancy
# of cores: 722 # of cores: 574
Lyon
# of cores: 268
Bordeaux Grenoble
# of cores: 650 # of cores: 272
Toulouse Sophia
# of cores: 436 # of cores: 568
Grid’5000 sites
10 gnps link
FIGURE 7.6
Overview of Grid’5000 located at nine resource sites in France.
(Courtesy of https://www.grid5000.fr/mediawiki/index.php/Grid5000:Home )
X.509 certificates are used to enforce security in Grid’5000. The system uses dedicated private networks to
construct a trustable backbone network.
The Grid’5000 sites are connected via high-bandwidth links. Each user has a universal account. Each
site manages its local user accounts, and each site runs a Lightweight Directory Access Protocol (LDAP)
server, which maintains the same directory tree for all users. The user accounts at the local site are treated
with higher priority to read and write shared files. The user accounts at remote sites are only allowed to
read files locally. The account information of remote sites in the tree is synchronized periodically. From the
user’s point of view, a globally enabled account is installed in the entire grid system.
National Institute for Computational Sciences, Oak Ridge National Laboratory (ORNL), the Pittsburgh
Supercomputing Center (PSC), Purdue University, the San Diego Supercomputer Center (SDSC),
the Texas Advanced Computing Center (TACC), the University of Chicago/ Argonne National
Laboratory (ANL), and the National Center for Atmospheric Research (NCAR).
When the project started in 2001, TeraGrid had only four resource sites at ANL, SDSC, UCSA,
and Caltech. In 2003, NSF added four sites to TeraGrid as well as established a third network hub,
in Atlanta. These new sites were at ORNL, Purdue University, Indiana University, and TACC.
TeraGrid construction was also made possible through key corporate partnerships with Sun Micro-
systems, IBM, Intel Corporation, Qwest Communications, Juniper, Myricom, Hewlett-Packard Com-
pany, and Oracle Corporation. At the time of this writing, the TeraGrid facility is entering full
production use. The next-generation petascale resources, called XD, may be added into one or two
national grid facilities in the United States from 2011 to 2016.
TeraGrid is a high-performance computing and communication (HPCC) system using the
Globus software/middleware tools to develop grids linking multiteraflop computers together. At the
time of this writing, the Internet backbone networks used to build TeraGrid include Gbps links (four 10
Gigabit Ethernet bundled together). Figure 7.7 shows five TeraGrid resource sites and their backbone
networks. Major resource types at each grid site are identified. A 40 GB/second extensible backplane
network connects two regional gateway hubs in Chicago and Los Angeles.
TeraGrid resources are integrated through SOA. Each resource provides a service in terms of
specific interfaces and operational support. Computational resources run a set of software packages
called Coordinated TeraGrid Software and Services (CTSS). CTSS provides a familiar user environ-
ment on all TeraGrid systems, allowing scientists to port code from one system to another. CTSS
also provides integrative functions such as single sign-on, remote job submission, workflow support,
and data movement tools. CTSS includes the Globus Toolkit, Condor, distributed accounting man-
agement software, verification and validation software, and a set of compilers, programming tools,
and environment variables.
TeraGrid resources are interconnected by a dedicated optical network, with each resource provi-
der site connecting at either 10 Gigabits per second or 30 Gigabits per second. TeraGrid users
access the facility through national research networks such as the Internet2 Abilene backbone and
National LambdaRail. TeraGrid users primarily come from U.S. universities. Academic researchers
in the United States can obtain exploratory, or CPU hours. More extensive allocations involve pro-
posal submissions and awards after a peer-review process. All allocation proposals are handled
through the TeraGrid web site, https://www.teragrid.org/web.
Switch Switch
Switch Switch
SDSC: Data-intensive
FIGURE 7.7
Five major TeraGrid resource sites, connected by a 40 GB/s backplane network between the LA and
Chicago hubs.
(Courtesy of https://www.teragrid.org/web )
100 Mb/s
Workstations
FIGURE 7.8
Hierarchical data distribution of the European Data Grid (EDG).
(http://eu-datagrid.web.cern.ch/eu-datagrid © CERN )
network directory. Resource dissemination is batched and periodically pushed to other parts of the
grid. Resource discovery in the EDG is decentralized and query-based. The scheduler uses a hier-
archical organization with an extensible scheduling policy. A special workload distribution facility
balances the analysis of jobs submitted from hundreds of physicists to different entry points in the
grid. Application monitoring and collection of user access patterns can be used to optimize global
access and data distribution.
The data accessor and data locator are used to map location-independent identifiers to location-
dependent identifiers. The data accessor is an interface encapsulating the details of the local file system
and mass storage systems such as Castor and HPSS. Several implementations of this generic interface
exist. The data locator makes use of the generic metadata manager, which is responsible for efficient
publishing and management of a distributed and hierarchical set of objects. Query optimization ensures
that for a given query an optimal migration and replication plan is produced. Such plans provide appro-
priate security mechanisms to cover connected resources, independent of organizational boundaries.
QoS broker
Resource
discoverer Naming proxy
Resource Resource
manager Resource Scheduler & state manager
resolver Security proxy
Peer estimator support
interface Resource co- interface
allocator Other proxy
Job manager
FIGURE 7.9
Major components of a grid resource management system (RMS) supported by Globus.
(Courtesy of I. Foster [17] )
436 CHAPTER 7 Grid Computing Systems and Resource Management
Queuing theory
Optimal Enumerative
Math pgmg.
Approximate
Sub-optimal Graph theory
Local Heuristic
Static
Optimal
Global Cooperative Approximate
Physically Sub-optimal
distributed Heuristic
Dynamic Non-cooperative
Physically non-distributed
FIGURE 7.10
Hierarchical classification of grid job scheduling methods for grids and computer clusters.
7.3 Grid Resource Management and Brokering 437
and computation time. Some sample data sets and benchmark programs are needed to collect the
information for training the performance model.
The resources in a grid are dynamically allocated for different applications. For traditional dis-
tributed systems, the system administrator has full control of the resources. (P2P is not considered
here, because job scheduling is not a key issue of it.) There are only two players (the system and
the user). The scheduling technologies can completely focus on performance. The optimization
objective of the scheduling can be system-centric or application-centric. However, the owner of the
resources has the right to determine the number and quality of resources to be shared in the grid.
The grid has three players: the grid system, resource provider, and resource customer.
Without any incentives, the resource providers will be reluctant to share more and better
resources. So the optimization objectives of scheduling in the grid have to take the benefit of
resource providers into account. The most common incentive is to allocate resources for jobs based
on economics principles. In an economic-based grid computing environment, users express their
requirements for QoS, and their preferred cost for different levels QoS. The grid service providers
need to work out a cost model for various services.
Directory Consumer
service
Event data
Search/ Producer interface
register
Consumer interface
Event data
Sensor Application Monitor Database
Producer
FIGURE 7.11
Producer/consumer grid monitoring model implemented in the grid monitoring architecture (GMA).
438 CHAPTER 7 Grid Computing Systems and Resource Management
security, and scalability. In particular, the discovery and transmission of grid monitoring data should be
handled separately. Related metadata describing the profiles of monitoring data needs to be stored in a
public location.
Virtualization/management
Sensor-II
Local storage
(republisher)
Adaptor Sensor-I
Other data source producer
FIGURE 7.12
Architecture of CGSV module deployment.
(Courtesy of Y. Wu, et al. [56] )
7.3 Grid Resource Management and Brokering 439
Accounting
service
Metering service
Monitoring Monitoring
FIGURE 7.13
A generic framework of a grid accounting system for billing services.
440 CHAPTER 7 Grid Computing Systems and Resource Management
[8], the accounting system of Gridbus, uses SOAP over Globus Toolkit sockets. Access from the
remote service is authenticated and authorized with this grid security infrastructure. Grids normally
consist of multiple real organizations. The accounting has to be performed across multiple domains,
so the accounting system must provide the access interface of standard grid services. Globus Toolkit
is the most popular and de facto standard for grid services.
services
Schedule
advisor Secure
Grid node
Pricing
Trade Trade server
Trading algorithms
manager
Accounting
Resource
Deployment agent Job
reservation
execution
Resource
Storage allocation
Grid consumer Grid resource broker Grid middleware Grid service provider
services
FIGURE 7.14
An economy model for assessing grid computing services, developed at the University of Melbourne.
(Courtesy of Buyya, et al. [8] )
Application code
Explore
1 data
Data source Visual application composer
Data + 10
2 ults
Res info
t
(Instruments/dis- cas
Grid
tributed sources) Data catalogue 5 3 Grid info service
resource broker
12
Data replicator 6 4
(GDMP)
ASP catalogue 9 7 Grid market directory
Job
ult
Res
8
$
Grid service (GS)
Alchami Bill
(Globus)
GS
11
CPU GTS Gridbus
or PE GridBank
Cluster scheduler PE Cluster scheduler
PE PE
Grid service GSP GSP GSP GSP
provider (GSP) (e.g., IBM) (e.g., UofM) (e.g., VPAC) (Accounting service)
(e.g., CERN)
FIGURE 7.15
A grid resource broker managing job scheduling and execution.
(Courtesy of Buyya and Bubendorfer (editors) [9] )
442 CHAPTER 7 Grid Computing Systems and Resource Management
core grid middleware such as Globus in UNIX/Linux environments. Multinode clusters need to be
presented as a single resource to the grid, and this can be achieved by deploying job management
systems such as the Sun Grid Engine on them.
In a grid environment where data needs to be federated for sharing among various interested
parties, data grid technologies such as SRB, Globus RLS, and EU DataGrid need to be deployed.
The user-level middleware needs to be deployed on resources responsible for providing resource
brokering and application execution management services. Users may even access these services via
web portals. Several grid resource brokers have been developed. Some of the more prominent
include Nimrod-G, Condor-G, GridWay, and Gridbus Resource Broker. The following 11 steps are
followed to aggregate the Grid resources:
1. The user composes his application as a distributed application (e.g., parameter sweep) using
visual application development tools.
2. The user specifies his analysis and QoS requirements and submits them to the grid resource
broker.
3. The grid resource broker performs resource discovery using the grid information service.
4. The broker identifies resource service prices by querying the grid market directory.
5. The broker identifies the list of data sources or replicas and selects the optimal ones.
6. The broker identifies computational resources that provide the required services.
7. The broker ensures that the user has necessary credit or an authorized share to utilize the
resources.
8. The broker scheduler analyzes the resources to meet the user’s QoS requirements.
9. The broker agent on a resource executes the job and returns the results.
10. The broker collates the results and passes them to the user.
11. The meter charges the user by passing the resource usage information to the accountant.
Scheduling
interfaces
Gridbus batch() - PBS
- Condor
broker - SGE
AlgorithmN fork() - Aneka
- XGrid
Aneka Globus
Data store Amazon EC2 SSH
Job manager
FIGURE 7.16
Architecture of the Gridbus resource broker and its interaction with many public or private grids/clouds.
(Courtesy of Buyya and Bubendorfer (editors) [9] )
adapters are implemented as plug-in actuators that support deployment of jobs on resources accessible via
grid middleware such as Globus and enterprise cloud application platforms such as Aneka.
The broker also works with public clouds such as Amazon EC2 and supports remote job execution via
SSH on clusters managed by queuing systems such as PBS, Condor, and SGE. The broker enables
different types of scheduling depending on the objectives of the user and type of resources. At present, the
broker can accommodate compute, storage, network, and information resources with prices based on time
(e.g., 1 Grid Dollar/s) or capacity (e.g., 1 Grid Dollar for 1 MB). It can also accommodate user objectives
such as the fastest computation within the budget (time optimization), or the cheapest computation within
the deadline (cost optimization) for both compute and data-intensive applications. The compute-intensive
algorithms are based on those developed previously in Nimrod/G [10].
manager are studied in more detail. Finally, we study the CGSP library specifically developed out
of the ChinaGrid Project.
Popular grid middleware tools include the Globus Toolkits (USA), gLight, UNICORE (German),
BOINC (Berkeley), CGSP (China), Condor-G, and Sun Grid Engine, etc. [52]. Table 7.6 sum-
marizes the grid software support and middleware packages developed for grid systems since 1995.
In subsequent sections, we will describe the features in Condor-G, SGE, GT4, and CGSP.
Example 7.9 introduces the matchmaking capability of Condor and Condor-G developed at
University of Wisconsin. Example 7.10 discusses the features of Sun’s Grid Engine developed at
Sun Microsystems for business grid computing.
Example 7.9 Features in Condor Kernel and Condor-G for Grid Computing
Condor is a software tool for high-throughput distributed batch computing. It was designed to explore the idle
cycles of a network of distributed computers. The major components of Condor are the user agent,
resources, and matchmaker, as shown in Figure 7.17. The ClassAds (classified advertisements) language
was exploited in Condor to express user requests against available resources in a distributed system. Agents
and resources advertise their status and requirements in ClassAds to a central matchmaker. The match-
maker scans the ClassAds and creates pairs of (resources, agents) that match each other’s requirements.
Subsequently, the matched agent negotiates with the available resource to execute the job. Two problem
solvers are provided in Condor: the master-worker and the DAG manager. For specific jobs, Condor records
checkpoints and subsequently resumes program execution from the checkpoints. When running a job
on remote machines without a shared file system, Condor accesses the local execution environment via
remote system calls. Condor-G was designed to match the GRAM protocol (to be covered in Section 7.4.4)
for coupling jobs with resources in the Globus Project. Condor-G adds durability and two-phase commitment
to prevent the loss and repetition of jobs in GRAM. However, Condor-G does not support all the features
in GRAM.
Matchmaker
(central manager)
Problem solver
Agent Resource
User (DAGMan)
(schedd) (startd)
(Master-worker)
Shadow Sandbox
(shadow) (starter)
Job
FIGURE 7.17
Major functional components in a Condor system (Condor-specific names are in parentheses).
446 CHAPTER 7 Grid Computing Systems and Resource Management
The SGE system uses reserved ports, Kerberos, DCE, SSL, and authentication in classified hosts to
enforce security at different trust levels and resource access restrictions.
Applications
Core services
GASS Metacomputing GRAM
directory Globus security Replica catalog
GridFTP service (MDS) interface (GSI) I/O
FIGURE 7.18
Globus Tookit GT4 supports distributed and cluster computing services.
(Courtesy of I. Foster [17] )
Global Resource Allocation GRAM Grid Resource Access and Management (HTTP-based)
Manager
Communication Nexus Unicast and multicast communication
Grid Security Infrastructure GSI Authentication and related security services
Monitory and Discovery MDS Distributed access to structure and state information
Service
Health and Status HBM Heartbeat monitoring of system components
Global Access of GASS Grid access of data in remote secondary storage
Secondary Storage
Grid File Transfer GridFTP Inter-node fast file transfer
available resources, move data between sites, manage user credentials, and so on. As a de facto
standard in grid middleware, GT4 is based on industry-standard web service technologies.
In Section 7.4.3, we will examine the HTTP-based GRAM and MDS modules. Nexus is used
for collective communications and HBM for heartbeat monitoring of resource nodes. GridFTP is for
448 CHAPTER 7 Grid Computing Systems and Resource Management
speeding up internode file transfers. The module GASS is used for global access of secondary storage.
GSI (grid security infrastructure) will be covered in Section 7.5. More details of the functional modules
of Globus GT4 and their applications are available at www.globus.org/toolkit/.
SEG Sch
edu
te le
sta
b Job control
Jo Local
it GRAM
S ubm scheduler
Del
Job ega
te
De GRAM
leg
ate Delegation adaptor
User job
FTP control
RTF FT
Pc
ont
rol data GridFTP
FTP
GridFTP
FIGURE 7.19
Globus job workflow among interactive functional modules.
(Courtesy of Foster and Kesselman [15] )
7.4 Software and Middleware for Grid Computing 449
Python service
Comm. auth.
Java service
Data access
File transfer
Telecontrol
Delgation
C service
GridFTP
Myproxy
Myproxy
Replica
Trigger
GRAM
Index
Server
FIGURE 7.20
Client and GT4 server interactions; vertical boxes correspond to service programs and horizontal boxes
represent the user codes.
(Courtesy of Foster and Kesselman [15] )
GT4 components do not, in general, address end-user needs directly. Instead, GT4 provides a set
of infrastructure services for accessing, monitoring, managing, and controlling access to infrastruc-
ture elements. The server code in the vertical boxes in Figure 7.22 corresponds to 15 grid services
that are in heavy use in the GT4 library. These demand computational, communication, data, and
storage resources. We must enable a range of end-user tools that provide the higher-level capabilities
needed in specific user applications. Wherever possible, GT4 implements standards to facilitate con-
struction of operable and reusable user code. Developers can use these services and libraries to build
simple and complex systems quickly.
A high-security subsystem addresses message protection, authentication, delegation, and
authorization. Comprising both a set of service implementations (server programs at the bottom
of Figure 7.21) and associated client libraries at the top, GT4 provides both web services and
non-WS applications. The horizontal boxes in the client domain denote custom applications and/
or third-party tools that access GT4 services. The toolkit programs provide a set of useful infrastructure
services.
Three containers are used to host user-developed services written in Java, Python, and C, respec-
tively. These containers provide implementations of security, management, discovery, state manage-
ment, and other mechanisms frequently required when building services. They extend open source
service hosting environments with support for a range of useful web service specifications, includ-
ing WSRF, WS-Notification, and WS-Security.
A set of client libraries allow client programs in Java, C, and Python to invoke operations on both
GT4 and user-developed services. In many cases, multiple interfaces provide different levels of
control: For example, in the case of GridFTP, there is not only a simple command-line client (globus-
url-copy) but also control and data channel libraries for use in programs—and the XIO library allow-
ing for the integration of alternative transports. The use of uniform abstractions and mechanisms
means clients can interact with different services in similar ways, which facilitates construction of
complex, interoperable systems and encourages code reuse.
450 CHAPTER 7 Grid Computing Systems and Resource Management
Service
dispatcher
Request
V
SOAP Web
Transport service
processing
V
Response
User applications
administration
Custom
GT4 WSRF
Registry
WSRF web
Custom web web services
services
services
WS-addressing, WSRF,
WS-notification
WSDL, SOAP, WS-security
GT4 container
FIGURE 7.21
Globus container serving as a runtime environment for implementing web services in a grid platform.
(Courtesy of Foster and Kesselman [16] )
security credentials to be used, and job persistence requirements. Other operations enable clients to
monitor the status of both allocated resources and running tasks and notify users about their status,
as well as guide job execution on the grid.
The heart of GRAM contains a set of web services designed to run the Globus Web Services
Resource Framework (WSRF) core hosting environment. Each submitted job is exposed with a dis-
tinct resource qualifying the requested service. The service provides an interface to monitor the sta-
tus of the job or to terminate the job. Each compute element is accessed through a local scheduler.
The service provides an interface to create the managed job resources to perform a job in that local
scheduler, as illustrated in the job workflow diagram.
3. RLS (Replica Location Service) is a scalable system for maintaining and providing access to
information about the location of replicated files and data sets.
4. OGSA-DAI (Globus Data Access and Integration) tools were developed by the UK e-Science
program and provide access to relational and XML databases.
GT4 container
MDS-index
Adapter
WebDMDS
MDS-index MDS-index
GridFTP
GRAM RFT
FIGURE 7.22
GT4 system monitoring and resource discovery infrastructure.
(Courtesy of Ian Foster [17] )
7.4 Software and Middleware for Grid Computing 453
Service container Produces a basic environment for the installation, deployment, running, and
monitoring of services, especially CGSP kernel services
Security manager Focuses on user identity authentication, identity mapping, service and resource
authorization, and secure message passing between CGSP nodes
Information center Provides service registration, publishing, metadata management, service querying,
service matching, and resource status collection for the CGSP services in a uniform way
Data manager Shields users from the heterogeneousness of underlying storage resources to
provide a uniform storage resource access mode
Execution manager Accepts users’ job execution requests and invokes relevant services according to the
job description, and manages the job during its lifetime
Domain manager In charge of user management, logging and accounting, and user identity mapping
between different CGSP domains
Grid monitor Named ChinaGrid Super Vision (CGSV), mainly focuses on the monitoring of CGSP
resource load, QoS, user actions, job status, and network, to ensure that the system
is running normally and to enhance grid performance
Portal A web page for learning about CGSP services and providing support for new grid
applications
resources for a given application. It brings the user closer to both grid developers and field experts.
CGSP has been tested in image processing, bioinformatics, distance education, computational fluid
dynamics, and large-scale information processing on ChinaGrid.
Table 7.8 shows the functional blocks built in CGSP. ChinaGrid aims at constructing a public
grid service system for research and higher education in China. CGSP integrates all sorts of hetero-
geneous resources, especially educational and research resources distributed over CERNET in
China. CGSP provides a grid platform for grid construction from the top portal level to the integra-
tion of underlying resources. This system supports uniform management of heterogeneous resources,
portal building, job definitions, service packaging, and grid monitoring. It provides an extensible
grid framework to serve primarily the top 100 universities in China.
The CGSP software platform meets the OGSA standard and the WSRF criteria. All software,
hardware, storage, and network technology becomes virtual resources to the user. Like Globus,
CGSP emphasizes a high degree of resource sharing and cooperation among resources. CGSP com-
prises nine major components, as briefly introduced in Table 7.8. In addition, CGSP provides a set
of APIs and toolkits that enable end users and developers to write grid-enabled programs, and
query and deploy their applications.
7.4.4.1 CGSP Job Execution Flow
Figure 7.24 shows the flow of jobs in CGSP on a single domain. First, the grid service containers
should include the allocated resources. The resources are extracted through a service packaging tool
and are included in the service containers, The three symbols, S1, S2, and S3, represent the com-
mand-line software, legacy program, and service requested, respectively. We need to use a job defi-
nition tool to model the workflow of a job. Because job requests are submitted via SOAP
messages, the job manager is deployed in the service container.
454 CHAPTER 7 Grid Computing Systems and Resource Management
User Developer
Domain manager Grid monitor
Grid
Grid Security manager development
portal tool kit
Service container
FIGURE 7.23
Functional building blocks in the CGSP library for ChinaGrid, developed by Tsinghua University, China [53].
The job manager is responsible for job request analysis, service selection, and job scheduling
and execution management. Each domain needs to deploy a service manager which collects service
information from containers to support other software modules. In addition, each domain has to
deploy a domain manager to implement the functions of user management and other issues. Lastly,
one or more grid portals can be deployed to satisfy specific needs of different specialties. A portal
development tool can help users to develop web applications that satisfy interaction requirements
and can be deployed in the portal.
After an application is deployed, the following steps are taking place, as illustrated in Figure 7.24 in
8 steps. More details can be found in [53].
1. The user should register and acquire the certificate signed by ChinaGrid. Then the user logs in
to the grid portal through a certain security protocol (such as HTTPS) and interacts with the
grid system through web applications deployed in the portal.
2. If necessary, the user uploads the input data required by the computing job/request to the
personal data space in the data manager. Then the user can define a job (such as J1) by using
the job defining tool. The job definition is then sent to the job manager.
3. According to the job process rule in J1, which is stored locally, the job manager queries
available services from the information center and helps to complete job J1.
4. The information center collects the service information using a “push/pull” model. It maintains
the resource view so that it can return an available service list sorted by a certain rule. In
principle, the information center will return local available services. If necessary, it will query
other information centers for more available services if it cannot find a service satisfying the
requirement in its own domain.
5. The job manager selects service S2 and sends it to the allocated computing node.
6. After service containers receive the requests, they need to invoke the access interfaces provided
by the data manager in order to get the data by a logical file path in the unified data space.
Then the service containers execute the services.
7.5 Grid Application Trends and Security Measures 455
CA
Global
Local
J1 J2
Web app Container S1 CMD
2 Job manager 5 Request for S2
Request for J1
1 Portal
Container 7 S2 Legacy
program
3 Container
4
Available Remote
S3
Job information services S2 resources
Domain Resource 6
manager information Information Retrieve Container S4
center input data
Retrieve data
FIGURE 7.24
The job execution sequence in using the CGSP on ChinaGrid [53].
7. The job manager continuously collects the services’ status in the execution processes to report
the information to the user through the portal.
8. When the job is completed, the service container moves the computing results to the personal
data space in the data manager.
Cloud Computing Cloud and grid technologies share some common resources and
infrastructures. They are gradually merging in high-tech and service industries.
(Chapters 4–6)
P2P Computing Volunteer grid computing can benefit from P2P computing technology
(Chapter 8)
HP and HA Cluster Supercomputers appear in various network platforms, not restricted to
Systems centralized MPP or high-performance or high-availability clusters (Chapter 2)
Data Center Automation Through virtualization, the boundaries among data centers, clouds and
supercomputers are blurred. (Chapters 3, 4, 9)
Fuzzy Theory and Fuzzy theory and reputation systems can be applied for trust management
Reputation Systems in grids, clouds, and any distributed systems (Sec.7.5.2, Sec.7.5.3 and
Sec. 8.4 [42])
Game-Theoretic Game theory can be applied to achieve win-win optimization and avoid selfish
Optimization behaviors in degrading grid and cloud applications (Example 7.11 and [29])
7.5 Grid Application Trends and Security Measures 457
Do not outsource
1st, friend → Accept
delay is okay...
2nd, foe → Reject
Take some risk to run
... ...
Site 1 Site 2
Site 3 Site 4
I will outsource to
a friendly site
FIGURE 7.25
A conceptual example to illustrate the selfish grid problem and a game-theoretical solution.
(Courtesy of Kwok, Hwang, and Song [30], 2007 )
458 CHAPTER 7 Grid Computing Systems and Resource Management
Traditional point-value prediction methods apply a very short prediction window. Although they
may work well to predict CPU load in centralized computer systems, they do not work well on
large-scale production grids due to the long execution time expected. In fact, point-value prediction
can hardly cover workload fluctuation in a long time frame. In general, the load index is used to
estimate the percentage of peak performance achievable on a given computational grid. The man-
agement console of such a grid can monitor CPU utilization.
Workload managers in large grid infrastructures are notoriously weak in determining correct
scheduling scenarios, which affects the application execution time on the grid. This section dis-
cusses predicting future workload within a reasonable confidence range. The narrower the prediction
range, the higher the accuracy of prediction. Long-range workload prediction is bound to have some
errors. But we try to minimize these prediction errors by using look-ahead filtering techniques
within a trainable confidence window.
100
FIGURE 7.26
Adaptive prediction of grid workload using confidence windows derived from historical load information and
projected load in lookahead window.
(Courtesy of Wu, Hwang, Yuan, and Zheng [55], 2010 )
7.5 Grid Application Trends and Security Measures 459
5,000 processors in nine sites throughout France. The Grid Workloads Archive (GWA) collected the
workload traces.
35
30
Load (%)
20
10
0
01/06 04/06 07/06 10/06
FIGURE 7.27
Workload variation in the Grid’5000 from Jan. to Oct. 2006.
460 CHAPTER 7 Grid Computing Systems and Resource Management
1. Produce the refined workload index X(t ); // using Kalman filter on workload index X (t) //
2. Generate smoothed workload index X (t);
// using Savitzky-Golay filter on refined workload index X (t)//
3. Predict workload index series [X(t + 1),.., X(t + n)]
//using Algorithm 9.1 on smoothed loads [X(t − p),.., X (t)] //
4. forall i = 1 to n
5. Form A(i ) = smoothed load index [X(t − w + 1),.., X (t )]
followed by predicted load indices [X(t + 1),.., X(t + i )]
6. Compute confidence window {Xl(t + i ), Xu(t + i )} of A(i ) to yield
Al(i ) = Xl (t + i ) and Au(i ) = Xu(t + i )
7. endforall
8. Generate smoothed Als, Aus //using Savitzky-Golay filter//
9. Generate Xl (t + n) = Als (n), the lower end of confidence window
Generate Xu(t + n) = Aus (n), the upper end of confidence window
7.5 Grid Application Trends and Security Measures 461
100 100
Mean square error (%)
Mean square error (%)
10 10
1 1
0.1 0.1
0 0
10 20 30 40 50 10 20 30 40 50
Lookahead span n Lookahead span n
(a) AuverGrid (b) Grid’5000
“Best” AR model prediction
Window based prediction using AH-Model
Median window based prediction using AH-Model
FIGURE 7.28
Mean square errors of two implementations of AH-Model, compared with the best MSE by using the AR model.
(Courtesy of Wu, Hwang, Yuan, and Zheng [55], 2010 )
462 CHAPTER 7 Grid Computing Systems and Resource Management
On the one hand, a user job demands the resource site to provide security assurance by issuing a
security demand (SD). On the other hand, the site needs to reveal its trustworthiness, called its trust
index (TI). These two parameters must satisfy a security-assurance condition: TI ≥ SD during the
job mapping process. When determining its security demand, users usually care about some typical
attributes. These attributes and their values are dynamically changing and depend heavily on the
trust model, security policy, accumulated reputation, self-defense capability, attack history, and site
vulnerability. Three challenges are outlined below to establish the trust among grid sites [39].
The first challenge is integration with existing systems and technologies. The resources sites in a
grid are usually heterogeneous and autonomous. It is unrealistic to expect that a single type of
security can be compatible with and adopted by every hosting environment. At the same time, exist-
ing security infrastructure on the sites cannot be replaced overnight. Thus, to be successful, grid
security architecture needs to step up to the challenge of integrating with existing security architec-
ture and models across platforms and hosting environments.
The second challenge is interoperability with different “hosting environments.” Services are
often invoked across multiple domains, and need to be able to interact with one another. The intero-
peration is demanded at the protocol, policy, and identity levels. For all these levels, interoperation
must be protected securely. The third challenge is to construct trust relationships among interacting
hosting environments. Grid service requests can be handled by combining resources on multiple
security domains. Trust relationships are required by these domains during the end-to-end traversals.
A service needs to be open to friendly and interested entities so that they can submit requests and
access securely.
Resource sharing among entities is one of the major goals of grid computing. A trust relation-
ship must be established before the entities in the grid interoperate with one another. The entities
have to choose other entities that can meet the requirements of trust to coordinate with. The entities
that submit requests should believe the resource providers will try to process their requests and
return the results with a specified QoS. To create the proper trust relationship between grid entities,
two kinds of trust models are often used. One is the PKI-based model, which mainly exploits the
PKI to authenticate and authorize entities; we will discuss this in the next section. The other is the
reputation-based model.
The grid aims to construct a large-scale network computing system by integrating distributed,
heterogeneous, and autonomous resources. The security challenges faced by the grid are much
greater than other computing systems. Before any effective sharing and cooperation occurs, a trust
relationship has to be established among participants. Otherwise, not only will participants be
reluctant to share their resources and services, but also the grid may cause a lot of damage [63].
Security demand
Trustworthiness
Response capabilities Anti-virus capacity Job turnaround time Job slowdown ratio
FIGURE 7.29
A general trust model for grid computing.
(Courtesy of Song, Hwang, and Kwok [43], 2005 )
Private key Certificate Private Certificate Private key Certificate Private key Certificate
of Alice of Alice key of Bob of Charlie of Charlie of CA of CA
FIGURE 7.30
Interactions among multiple parties in a sequence of trust delegation operations using the PKI services in a
GT4-enabled grid environment.
7.5 Grid Application Trends and Security Measures 465
1 4
Authorization
Subject Resource
authority
2 3
2 1
Authorization
Resource Subject
authority
3 4
4 3
Authorization
Subject Resource
authority
1 2
FIGURE 7.31
Three authorization models: the subject-push model, resource-pulling model, and the authorization agent
model.
First, the generation of the proxy credential is similar to the procedure of generating a user certificate
in the traditional PKI. Second, when Bob acts on behalf of Alice, he sends the request together with
Alice’s proxy credential and the Alice certificate to Charlie. Third, after obtaining the proxy credential,
Charlie finds out that the proxy credential is signed by Alice. So he tries to verify the identity of Alice and
finds Alice trustable. Finally, Charlie accepts Bob’s requests on behalf of Alice. This is called a trust
delegation chain.
Message WS-Security
WS-Security TLS
protection WS-SecureConversation
FIGURE 7.32
GSI functional layers at the message and transport levels.
(See Berman, Fox and Hey [5], Foster and Kesselman [15], and Foster [17] for more details )
7.5 Grid Application Trends and Security Measures 467
WS-Trust are used for delegation. An Authorization Framework allows for a variety of authorization
schemes, including a “grid-mapfile” ACL, an ACL defined by a service, a custom authorization
handler, and access to an authorization service via the SAML protocol. In addition, associated
security tools provide for the storage of X.509 credentials (MyProxy and Delegation services), the
mapping between GSI and other authentication mechanisms (e.g., KX509 and PKINIT for Kerberos,
MyProxy for one-time passwords), and maintenance of information used for authorization (VOMS,
GUMS, PERMIS).
The remainder of this section reviews both the GT implementations of each of these functions
and the standards that are used in these implementations. The web services portions of GT4 use
SOAP as their message protocol for communication. Message protection can be provided either by
transport-level security, which transports SOAP messages over TLS, or by message-level security,
which is signing and/or encrypting portions of the SOAP message using the WS-Security standard.
Here we describe these two methods.
Alice Bob
Authentication request: EEC(Alice)-ID,
pub-key, CA
Checking valid
Challenge message to encrypt
Encrypt with private key
Encrypted challenge message
Decrypt with pub-key
EEC(Bob)-ID, pub-key, CA
Checking valid
Challenge message to encrypt
Encrypt with private key
FIGURE 7.33
Multiple handshaking in a mutual authentication scheme.
(Courtesy of Foster and Kesselman [15], 2002 )
FIGURE 7.34
A sequence of trust delegations in which new certificates are signed by the owners rather by the CA.
signature on the owner’s certificate. This establishes a chain of trust from the CA to the last proxy
through the successive owners of resources. The GSI uses WS-Security with textual usernames and
passwords. This mechanism supports more rudimentary web service applications. When using user-
names and passwords as opposed to X.509 credentials, the GSI provides authentication, but no
advanced security features such as delegation, confidentiality, integrity, and replay prevention.
However, one can use usernames and passwords with anonymous transport-level security such as
unauthenticated TLS to ensure privacy.
ACKNOWLEDGMENTS
This chapter was authored by Yongwei Wu of Tsinghua University (THU) China and Kai Hwang of
University of Southern California, jointly. Rajkumar Buyya of the University of Melbourne has contrib-
uted Section 7.3.4. The authors appreciate the technical support from Professor Weimin Zheng of THU
and Dr. Zhiwei Xu of Chinese Academy of Sciences during the course of developing this chapter.
References
[1] W. Allcock, J. Bresnahan, R. Kettimuthu, et al., The globus striped gridFTP framework and server, in:
Proceedings of the ACM/IEEE Conference on Supercomputing, 2005.
[2] R. Aydt, D. Gunter, W. Smith, et al., A Grid Monitoring Architecture, Global Grid Forum Performance
Working Group, 2002.
[3] F. Azzedin, M. Maheswaran, A trust brokering system and its application to resource management in
public-resource grids, in: Proceedings of the 18th International Parallel and Distributed Processing Sympo-
sium (IPDPS ’04), Santa Fe, NM, April 26, 2004, p. 22a.
[4] F. Berman, R. Wolski, S. Figueira, J. Schopf, G. Shao, Application-level scheduling on distributed
heterogeneous networks, in: Proceedings of the ACM/IEEE Conference on Supercomputing,
Pittsburgh, 1996.
[5] F. Berman, G. Fox, T. Hey (Eds.), Grid Computing: Making the Global Infrastructure a Reality, Wiley
Series in Communications Networking and Distributed Systems, Wiley, 2003.
[6] V. Berstis, Fundamentals of grid computing. IBM Publication, http://www.redbooks.ibm.com/abstracts/
redp3613.html, 2011 (accessed 26.04.11).
[7] BOINCstats - Boinc combined credit overview, http://www.boincstats.com/stats/project_graph.php?pr=bo,
2011 (accessed 26.04.11).
[8] R. Buyya, D. Abramson, S. Venugopal, The Grid Economy, in: Proceedings of the IEEE, 2005, pp. 698–714.
[9] R. Buyya, K. Bubendorfer (Eds.), Market Oriented Grid and Utility Computing, John Wiley & Sons,
2009.
[10] A. Chervenak, et al., Giggle: a framework for constructing scalable replica location services, in: Proceed-
ings of the 2002 ACM/IEEE Conference on Supercomputing, Baltimore, 16–22 November 2002.
[11] Dongarra, I. Fister, G. Fox, et al., Sourcebook of Parallel Computing, Morgan Kaufman Publishers, 2002.
[12] L. Ferreira, et al., Introduction to Grid Computing with Globus, (http://www.redbooks.ibm.com/abstracts/
sg246895.html?OPen)
[13] L. Ferreira, et al., Grid Computing in Research and Education, (http://www.redbooks.ibm.com/abstracts/
sg246649.html?OPen)
[14] Folding@Home. (http://fah-web.stanford.edu/cgi-bin/main.py?qtype=ossrats), 2011, (accessed 7.03.2011).
[15] I. Foster, C. Kasselman, Grid2: Blueprint for a New Computing Infrastructure, Morgan Kaufman Publishers,
2002.
[16] I. Foster, C. Kesselman, S. Tuecke, The anatomy of the grid: enabling scalable virtual organizations, Int.
J High. Perform. Comput. Appl. 15 (3) (2001) 200.
[17] I. Foster, Globus toolkit version 4: software for service-oriented systems, J. Comput. Sci. Technol. 21 (4)
(2006) 513–520.
[18] L. Francesco, et al., The many faces of the integration of instruments and the grid, Int. J. Web. Grid.
Services 3 (3) (2007) 239–266.
[19] L. Gong, S.H. Sun, E.F. Watson, Performance modeling and prediction of non-dedicated network computing,
IEEE. Trans. Comput., (2002).
472 CHAPTER 7 Grid Computing Systems and Resource Management
[20] G. Yuan, H. Jin, M. Li, N. Xiao, W. Li, Z. Wu, Y. Wu, Grid computing in China, J. Grid. Comput. 2 (2) (2004).
[21] W. Hoschek, J. Jaen-Martinez, A. Samar, H. Stockinger, K. Stockinger, Data management in an interna-
tional data grid project, in: Proceedings of the 1st IEEE/ACM International Workshop on Grid Computing
(Grid 2000), Bangalore, India, 17 December 2000, pp. 77–90.
[22] K. Hwang, Z. Xu, Scalable Parallel Computing, McGraw-Hill, 1998.
[23] K. Hwang, Y. Kwok, S. Song, M. Cai, Yu Chen, Y. Chen, DHT-based security infrastructure for trusted
internet and grid computing, Int. J. Crit. Infrastructures 2 (4) (2006) 412–433.
[24] M.A. Iverson, F. Ozguner, L. Potter, Statistical prediction of task execution time through analytical bench-
marking for scheduling in a heterogeneous environment, IEEE. Trans. Comput., (1999) 1374–1379.
[25] L. Jiadao, R. Yahyapour, Negotiation model supporting co-allocation for grid scheduling, in: Proceedings
of the 7th IEEE/ACM International Conference on Grid Computing, 2006, p. 8.
[26] H. Jin, Challenges of grid computing, Advances in Web-Age Information Management. Lecture Notes in
Computer Science, 3739 (2005) 25–31.
[27] M. Kalantari, M. Akbari, Fault-aware grid scheduling using performance prediction by workload model-
ling, J. Supercomput. 46 (1) (2008).
[28] C. Karnow, The grid: blueprint for a new computing infrastructure, Leonardo 32 (4) (1999) 331–332.
[29] K. Krauter, R. Buyya, M. Maheswaran, A taxonomy and survey of grid resource management systems for
distributed computing, Softw.Pract. Exper. 32 (2) (2002) 135–164.
[30] R. Kwok, K. Hwang, S. Song, Selfish grids: game-theoretic modeling and NAS/PSA benchmark evaluation,
IEEE Trans. Parallel. Distrib. Syst., (May) (2007).
[31] M. Li, M. Baker, The Grid: Core Technologies, Wiley, 2005, (http://coregridtechnologies.org/).
[32] H. Li, Performance evaluation in grid computing: a modeling and prediction perspective, in: Seventh IEEE
International Symposium on Cluster Computing and The Grid, (CCGrid 2007), May 2007, pp. 869–874.
[33] X. Lijuan, Z. Yanmin, L.M. Ni, Z. Xu, GridIS: an incentive-based grid scheduling, in: Proceedings of the
19th IEEE Int’l Parallel and Distributed Processing Symposium, 4–8 April 2005.
[34] C. Lin, V. Varadharajan, Y. Wang, V. Pruthi. Enhancing grid security with trust management, in: Proc. of
the 2004 IEEE International Conference on Services Computing, Washington, DC, 14 September 2004,
pp. 303–310.
[35] J. Nash, Non-Cooperative Games, Ann. Math. Second Series 54 (2) (1951) 286–295.
[36] P. Plaszczak, R. Wellner, Grid Computing: The Savvy Manager’s Guide, Kaufmann, 2006.
[37] M. Poess, N. Raghunath, Larege-scale data warehouses on grid, http://www.vldb2005.org/program/papertue/
p1055-poess.pdf, 2005.
[38] J. Schopf, F. Berman, Performance prediction in production environments, in: 12th International Parallel
Processing Symposium, Orlando, FL, April 1998, pp. 647–653.
[39] SETI@Home credir overview, (http://www.boincstats.com/stats/project_graph.php?pr=sah), (accessed
21.04.11).
[40] H. Shen, K. Hwang, Locality-preserving clustering and discovery of resources in wide-area computational
grids, IEEE. Trans. Comput. (accepted to appear 2011).
[41] R. Smith, Grid computing: a brief technology analysis, CTO Network Library. (http://www.ctonet.org/
documents/GridComputing_analysis.pdf
[42] S. Song, K. Hwang, R. Zhou, Y.K. Kwok, Trusted P2P transactions with fuzzy reputation aggregation,
IEEE. Internet. Comput. (November–December) (2005) 18–28.
[43] S. Song, K. Hwang, Y.K. Kwok, Trusted grid computing with security binding and trust integration,
J. Grid. Comput. 3 (1–2) (2005).
[44] S. Song, K. Hwang, Y. Kwok, Risk-tolerant heuristics and genetic algorithms for security-assured grid job
scheduling, IEEE. Trans. Comput., (2006) 703–719.
[45] Sun Microsystems, How Sun Grid Engine, enterprise edition works. White paper, www.sun.com/sofware/
gridware/sgeee53/wp-sgeee.pdf, 2001.
Homework Problems 473
[46] I. Taylor, From P2P to Web Services and Grids, Springer-Verlag, London, 2005.
[47] D. Thain, T. Tannenbaum, M. Livny, Distributed computing in practice: the condor experience, Concur-
rency. Comput. Pract. Exp., (2005) 323–356.
[48] S. Venugopal, R. Buyya, K. Ramamohanarao, A taxonomy of data grids for distributed data sharing, man-
agement, and processing. ACM. Comput. Surv. 38 (1) (2006) 1–53.
[49] S. Venugopal, R. Buyya, L. Winton, A grid service broker for scheduling e-science applications on global
data grids, Concurrency. Comput. Pract. Exp. 18 (6) (2006) 685–699.
[50] H. Wang, Z. Xu, Y. Gong, W. Li, Agora: grid community in Vega grid, in: Proceedings of the 2nd International
Workshop on Grid and Cooperative Computing, Shanghai, China, 7 December, 2003, pp. 685–691.
[51] V. Welch, et al., Security for grid services, in: Proceedings of the 12th IEEE International Symposium on
High Performance Distributed Computing, June 22–24, 2003, pp. 48–57.
[52] Wikipedia, Grid Computing. http://een.wikipedia.org/wiki/Grid_computing, (accessed 26.04.11).
[53] Y. Wu, S. Wu, H. Yu, C. Hu, CGSP: an extensible and reconfigurable grid framework, Lect. Notes.
Comput. Sci. 3756 (2005) 292–300.
[54] Y. Wu, C. Hu, L. Zha, S. Wu, Grid middleware in China, Int. J. Web. Grid. Serv. 3 (4) (2007) 371–402.
[55] Y. Wu, K. Hwang, Y. Yuan, W. Zheng, Adaptive workload prediction of grid performance in confidence
windows, IEEE Trans. Parallel. Distrib. Syst., (July) (2010).
[56] Y.W. Wu, S. Wu, H.S. Yu, C.M. Hu, Introduction to ChinaGrid support platform, in: Proceedings of
Parallel and Distributed Processing and Applications, 2005, pp. 232–240.
[57] Z. Xu, W. Li, L. Zha, H. Yu, D. Liu, Vega: a computer systems approach to grid computing, J. Grid.
Comput. 2 (2) (2004) 109–120.
[58] L. Yang, I. Foster, J.M. Schopf, Homeostatic and tendency-based CPU load predictions, in: International
Parallel and Distributed Processing Symposium, 2003, pp. 42–50.
[59] X. Zhang, J. Schopf, Performance analysis of the globus toolkit monitoring and discovery service, MDS2,
in: Proceedings of the Int’l Workshop on Middleware Performance, 2004, pp. 843–849.
[60] W. Zheng, L. Liu, M. Hu, Y. Wu, L. Li, F. He, J. Tie, CGSV: an adaptable stream-integrated grid monitor-
ing system, Network. Parallel. Comput. 3779 (2005) 22–31. Lecture Notes in Computer Science, Springer.
HOMEWORK PROBLEMS
Problem 7.1
Consider the following list of resources allocation, code files, data transfers, and computation tasks
performed in typical grid experiments. You are asked to execute a typical job on a computational
grid that is available to you. Discuss the resource items that are controlled by the grid resource
management system on your platform. Either the Globus GT4 or the CGSP are candidate systems
to use.
a. Starting, monitoring, and controlling the execution of the resultant processes
b. Putting and getting files
c. Relevant load information: current load and queue state
d. Resources allocated to network transfers
e. Resources with advance reservation
f. Network characteristics and load
g. Relevant load information: available space and bandwidth utilization
h. Versioned source and object codes
474 CHAPTER 7 Grid Computing Systems and Resource Management
Problem 7.2
Briefly specify the functionality of the software modules GRAM, Nexus, GSI, MDS, and GASS
implemented in the Globus GT-4 middleware library. Try to illustrate an example application in
using each of the five functional modules.
Problem 7.3
Circle only one correct answer in the following question related to the use of GT4 or CGSP middle-
ware tools for grid computing. Justify why you chose your answer. Repeat the question for those
features that are not supported by CGSP.
a. Grid workflow with WSRF support
b. JSDL job submission and legacy job execution using General Running Service
c. Heterogeneous integration using virtual database and virtual table.
d. Multi-Local Scheduler support including PBS, LSF, and Condor
Problem 7.4
Four steps are required to verify proxy credentials of two parties (Bob and Alice) in Figure 7.30.
Extend the trust deletion scheme to three parties (say adding John). Draw a diagram similar to
Figure 7.30 to show all the credentail checking operations in a e GT4-enable grid.
Problem 7.5
In Example 7.11, you have learned the basic concept of a game-theoretic approach to solving the
grid selfishness problem. Now, you are asked to the original paper by Kwok, Hwang, and Song
[30] in details on the game theory and NAS verification experiments. Discuss the advantage of the
game-theoretical solution and its implementation requirements in a real-life grid environment.
Problem 7.6
Compare three perfoirmance prediction methods: AR(p), H-Model, and AH-Model presented in
Section 7.5.2. Discuss their relative merits and limitations in grid workload characterization and per-
formance prediction. Comment on the advantage use of Savitzky-Golay filter and Kalman filter in
the prediction nprocess.
Problem 7.7
Study the Gridbus broker system in Section 7.3.4. Try to use the system to schedule an application
job with many tasks on an available grid or cloud platform. Report on the experimental settings and
performance results obtained. Discuss the strength and weakness of the Gridbus system.
Problem 7.8
Visit the BOINC web site: http://[email protected] and register your desktop or personal compu-
ter as a contributor machine. Join one of the listed grid application projects to gain some concrete
experience in volunteer computing. Report on your research findings and discuss the experimental
results obtained. Comment on the advantages of using a virtual grid platform against traditional
HPC systems and identify the shortcomings or limitations encountered.
Homework Problems 475
Problem 7.9
Match the seven abbreviated terms and system models on the left with the best-match descriptions
on the right. Just enter the description label (a, b, c, …, g) in the underlined blanks in front of the
specific terms. This is a one-to-one correspondence mapping.
Problem 7.10
Study the grid systems: EGEE, Grid’500, DataGrid, and ChinaGrid in the examples in Section 7.2.
Choose one example system to dig deeper by finding more details from its Web site and published
papers. Conduct an original research on its updated system development. Identify some open pro-
blems discovered and report on your research findings.
Problem 7.11
This problem requires you to run a distributed program (MPI code) on any GT$-enable grid
platform that you have access to or you can set up with a research team. Write a report to explain
the grid experimental setting using the Globus tools. Report the lessons learned and performance
results obtained.
Problem 7.12
Describe how mutual authentication is implemented between two user parties A and B in Figure 7.35
across a grid using the Globus Grid Security Infrastructure and PKI. You need to describe five
476 CHAPTER 7 Grid Computing Systems and Resource Management
Step 1 Step 2
Step 4 Step 5 CA
A B
X.509
certificate
Step 3
FIGURE 7.35
Mutual authentication between two parties using the GSI and X.509 certificate issued from the certificate
authority (CA).
Table 7.10 Comparison among P2P Networks, Grids, and P2P Grids
Features P2P Networks Grid Systems P2P Grids
handshaking steps for how B verifies the identity of A. Assume that a trusted certificate authority
(CA) is available to issue X.509 certificates containing the needed public keys to any grid user. Each
user can apply his private key to encrypt a message. The receiving end can apply the sender’s public
key to decrypt the message. User A can verify the identity of B in the exact same operation, except
reversing the roles. You do not need to repeat that reversed process.
Problem 7.13
Distinguish P2P networks from grids and P2P grids by filling in the missing entries in Table 7.10.
Some entries are already given. You may want to review material in Table 1.3, Table 1.5, and
Table 1.9. More material on P2P and P2P grid systems can be found in Chapter 8.
Homework Problems 477
Problem 7.14
This is a survey report about all the P2P grids or volunteer computing grids that can manage to join
as contributing hosts. Compare the relative performance of all the volunteer grids you have used.
Discuss their strengths and weakness in hardware capabilities, software support, and middleware
services provided.
Problem 7.15
Try to virtualize some of the resources in Problems 7.8, 7.13, or 7.14. Repeat one the the three pro-
blems in using virtual machines. Discuss the gains and losses in using a virtualized production grid.
You can also use the EC 2 or other virtualized data centers to conduct the needed experiments.
This page intentionally left blank
CHAPTER
SUMMARY
In this chapter, we study peer-to-peer (P2P) networks and their applications. P2P overlay networks
are built over a large number of client computers through virtual networking over the Internet. We
present unstructured, structured, and hybrid overlay networks. Distributed hash tables (DHTs) are
studied for fast routing and locality preservation operations. Fault tolerance and churn resilience are
studied to improve dependability. On the application side, P2P networks are used for distributed file
sharing and content distribution applications. Trust management and copyright protection are studied
with reputation systems and content poisoning techniques. Examples are given to illustrate the
design principles of P2P networks and the protocols applied in real-life applications. We will cover
P2P social networks in Chapter 9.
Email
FTP
70 P2P
Web
60
Percentage of Internet traffic
50
40
30
20
10
0
1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006
Year
FIGURE 8.1
The distribution of Internet traffic based on application trends.
(Courtesy of CacheLogic Research, 2007 )
• Decentralization In a pure P2P computing system, peers are equal and there is no central server
to coordinate the system. Each peer has a partial view of the whole system to construct the
overlay network, and controls its data and resources.
• Self-organization Self-organization means no central management point is required to organize
the peer computing resources and the data resources spreading over the peers. The resources in
P2P computing systems are dynamic or fluctuate in the sense that they come and go at any
time, at will. It would be very costly to dedicate a server to manage these systems.
• Ad hoc connectivity and dynamics The availability of peers is unpredictable. A peer may join
or leave the system at any time. The overlay topologies and the system scales will vary
enormously and rapidly. P2P computing systems should provide stable services even in extra
dynamic environments.
• Anonymity In a decentralized P2P network, peers resort to detour paths to send/receive requests
(i.e., two nodes communicate through some middle nodes). This ensures senders’ anonymity.
Anonymity can be also achieved through hashing.
• Scalability The P2P model eliminates the single point of failure problem related to the traditional
centralized client/server model. Each peer only maintains a limited number of system states and shares
resources directly with others. These features enable the high scalability of P2P computing systems.
• Fault tolerance In a P2P network, all peers are created equal and no peer can dominate the
system or become a single point of failure to pull down the entire system. Resources could be
stored in multiple peers. These features facilitate fault tolerance.
482 CHAPTER 8 Peer-to-Peer Computing and Overlay Networks
Server
Clients Peers
FIGURE 8.2
P2P network model versus client/server architecture.
Data
Control Control Data
Super peer
Data
Index Control
server
(a) Centralized model (Napster) (b) Pure P2P model (Gnutella 0.4) (c) Hierarchical model (KaZaA)
FIGURE 8.3
Evolution of P2P network systems from centralized Napster to flat Gnutella and hierarchically structured
KaZaA using super nodes.
8.1 Peer-to-Peer Computing Systems 483
Since there is no central index server, file request messages are flooded to their neighbor peers.
Gnutella 0.6 systems, KaZaA, and Skype leverage the third model, the hierarchical model; see
Figure 8.3(c). A portion of powerful peers are selected as super nodes in the system. The requests
are only flooded among powerful peers, hence increasing scalability. P2P distributed computing sys-
tems always maintain central servers for task management and communication to client peers. How-
ever, client peers do not need to communicate. Therefore, peers in this context refer to computing
systems that make their resources available. P2P platforms act as a middleware foundation to enable
easy development of P2P systems. The platforms provide security services, communication services,
and standard services (indexing, searching, and file sharing).
File sharing Queries are forwarded by peers collaboratively. File Napster, Gnutella,
objects are transferred directly between peers. KaZaA, eDonkey
Content distribution Nodes forward the content they have received to BitTorrent, PPlive
other peers until all nodes receive the content.
P2P instant messaging Text messages or video and audio data are relayed by Skype, MSN
peers in the systems, instead of central servers.
Distributed computing Computationally intensive tasks are split into subtasks. SETI@Home,
Nodes process subtasks independently. Folding@Home
Collaboration support Events and messages exchanged are relayed instantly Groove, Magi, AIM
to all other peer members in the group.
Platforms Platforms are general frameworks that provide support JXTA, Microsoft .NET
for P2P computing or use P2P mechanisms.
484 CHAPTER 8 Peer-to-Peer Computing and Overlay Networks
Msg exchange
during login
Super node
Ordinary host
Login server
FIGURE 8.4
The Skype architecture and its major components.
A super node or ordinary host is also called a Skype client. Each client maintains a list of reachable
nodes. This list is called the host cache, and it contains the IP addresses and ports of a number of super
nodes (typically 200). A number of super nodes, which are run by the service provider, are hardcoded
into the client host cache. After the first login, a client learns other super nodes’ information to fill its
cache. The login server is the only central component in Skype. It stores usernames and passwords for
user authentication. During its login process, a client first chooses a valid entry from its cache to build a
TCP connection. If there is no valid entry in the cache, the client will report a login failure. The client then
contacts the login server for authentication.
A client detects whether it is behind the network address translator (NAT)/firewall or not during its login
process. Online and offline user information is stored and propagated in a decentralized fashion, and so
are user search queries. Users’ buddy lists are stored locally on the Skype clients. Skype claims it uses an
advanced Globe Index technique for quick user search. However, it is unclear how it works since all the
communications, including both control messages and media data, are encrypted. Skype clients prefer to
use TCP for call signaling and UDP for media transfers. However, if it is impossible to communicate
through UDP due to a NAT/firewall, TCP connections are used. In the case that the caller or callee is
behind a NAT/firewall, a valid node which has a publicly accessible IP address is used for signaling and
media data relay. The relay nodes may change during conservations.
8.1 Peer-to-Peer Computing Systems 485
its
un User’s
rk s
lit
et s
G r t re
po
Re
Get work units
FIGURE 8.5
Workload distribution process in SETI@Home for volunteer computing.
486 CHAPTER 8 Peer-to-Peer Computing and Overlay Networks
C1 C2 C1 C2
C3 C4 C3 C4
(a) Inefficient overlay (b) Proximity-aware overlay
C1 C3 C2 C4
AS1 AS2
(c) Underlying IP topology
FIGURE 8.6
Building a proximity-aware P2P overlay network.
488 CHAPTER 8 Peer-to-Peer Computing and Overlay Networks
C to D
D to A D to A
B msg B msg B msg
A C D B
FIGURE 8.7
An example of onion routing for anonymous communication.
8.1 Peer-to-Peer Computing Systems 489
500
500 500
1,000
FIGURE 8.8
An example of a DDoS attack in a small P2P network through message flooding.
490 CHAPTER 8 Peer-to-Peer Computing and Overlay Networks
have to find other super nodes to connect with. The index information kept in the faulty super node
would be lost. Node failure imposes a great challenge to achieve fault tolerance in P2P networks.
Since there is no central coordinator, peers use periodic heartbeat messages to check the status and
availability of other peers.
structure, but later adopt a more efficient overlay structure to speed up searches and downloads with
enhanced security and efficiency. Here are some useful features of structured overlays:
• Structured routing mechanisms in the overlay
• Add application-level overlays over the peers
• Reduce routing hops, compared with using random graphs
• Avoid flooding and eliminate hotspots
• Guarantee the search results
• Provide load balancing among the peer nodes
• Provide enhanced scalability and fault tolerance
• Preserve data locality if desired
• Promote self-organization under constraints
• Provide enhanced security protection
• Support node heterogeneity
Design style
Functionality
File sharing Distributed Collaboration Platforms
computing
FIGURE 8.9
Classification of various P2P systems by functionality and control style.
(Courtesy of Min Cai, EE 657 Report, University of Southern California, 2007 )
492 CHAPTER 8 Peer-to-Peer Computing and Overlay Networks
CSCW refers to computer-supported cooperative work. CAN, Chord, Pastry, and Tapestry are
now all using structured overlays with decentralized control. Functionally speaking, the P2P grids
are mainly for distributed computing and collaboration applications that have no fixed structures.
SETI@Home has centralized control for distributed supercomputing applications. The rest of the
P2P networks fall in various squares in the 2D design space as shown. Note that some lattice
squares are empty or are labeled with a question mark, meaning no systems are being built yet
under that category.
The expected degree of a vertices is Efdg = ðn − 1Þp: Apparently, if p = c/n, then for a large n,
Pfd = kg = ck! e−c : That is how the node degree follows a Poisson distribution. For unstructured
k
P2P systems, the Erdos-Renyi random graph is too random to design a distributed algorithm for
determination of the forwarding path from one peer to another peer.
Architecture Centralized model Pure P2P model Hierarchical with super peers
File indices Centralized index Each node keeps file Super node keeps the indices
servers keep all links indices for itself only for the nodes that connect to
them
Search algorithm Queries are sent to Queries are flooded Queries to super nodes with
servers, which return on entire overlay flooding among themselves and
the requested files Matching peers reply reply to the originators
to originators
Search time (hops) O(1) O(log n) O(log m), m counts supernodes
Search cost (#msgs) O(1) O(d* n), d is node O(c* m), c is supernode degree
degree
File fetch method Originator gets the Query originator gets Originator gets the file from
file from a single the file directly from multiple peers concurrently
peer a peer
Scalability Limited by the Limited by the high Relatively good scalability due to
central servers search cost super peers
Fault tolerance Single point of failure Good fault tolerance Super peers are vulnerable to
due to equality attacks
among peers
file-sharing systems, data is distributed randomly over the peers. Flooding on the overlay is used to
find files in response to requests. The TTL is imposed to limit the flooding, in order to save heavy
traffic generated by flooding. Moreover, there is no guarantee on the search results.
The major difference among these systems lies in the fact that Napster applies a centralized con-
trol, Gnutella uses pure P2P networking, and KaZaA applies a hierarchical structure. Napster has a
constant search time. Flooding in Gnutella takes the longest search time across the entire network.
KaZaA performs in between, because binary search is conducted only across the super nodes. Both
Napter and Gnutella have limited scalability. KaZaA has better scalability. However, these three
P2P file-sharing services have all suffered greatly from illegal downloads by copyright violators in
recent years.
FIGURE 8.10
Flooding scheme used in Gnutella to search for peers that can provide files of digital content.
Header Payload
Payload Payload
Descriptor ID TTL Hops Payload
descriptor length
Byte
offset
0 15 16 17 18 19 22 Variable
FIGURE 8.11
Gnutella packet descriptor format.
only uses 1 byte. Gnutella defines five main types of descriptor, including Ping (0X00), Pong (0X01), Query
(0X40), QueryHit (0X80), and push (0X80). The TTL and Hops fields use one byte each. TTL is initialized
by the descriptor originator and is decreased by one once the descriptor traverses one servant. When TTL
reaches 0, the descriptor is discarded.
The TTL is used to limit the horizon of the descriptor in order to save the communication cost of unlim-
ited flooding. The Hops field counts the number of hops the descriptor has traveled. The last four bytes in
the header is for payload length, which is followed by variable payload. For example, The Ping has no pay-
load, while other four descriptors, payload is filled with bytes of variable size.
The Ping descriptor is used by servants to discover others. A servant that receives a Ping sends back
one or more Pong descriptors along the path of Ping with a reverse direction. A Pong consists of the
address of the responsible servant and the amount of data it shares. Query descriptors are for search,
while a QueryHit descriptor is the response to a Query. The Push descriptor enables firewalled servants to
share files with others. A new servant contacts bootstrap nodes to join. After joining the system, a servant
communicates with others using descriptors. It sends a Ping to discover more servants to connect with.
In general, the Ping and Query descriptors are sent in a flooding fashion. A servant that receives such
a descriptor will forward the descriptor to its neighboring servants. Clearly, if the horizon of the flooding is
not limited, such descriptors would, on average, be forwarded to d neighbors by each peer, where d is the
average number of neighboring peers of each node. Thus, an n-node Gnutella system must use d*n
496 CHAPTER 8 Peer-to-Peer Computing and Overlay Networks
messages, within which (d-1)*n messages are duplicated. Such a large number of messages limits system
scalability.
Gnutella limits the horizon of flooding with a TTL in each descriptor. The TTL is decreased by one each
time the descriptor is forwarded. If the TTL expires, that is, it reaches 0, the descriptor is discarded without
further forwarding. By default, the TTL is set at 7 since within this horizon, 95 percent of servants can be
reached. However, the requested files may be owed only by the unreached 5 percent of servants. In such
cases, the requests cannot be satisfied, even though files that match the requests do exist.
The flooding search in Gnutella is suitable for highly replicated files (i.e., hay), not for rarely replicated
ones (i.e., needle). Since users often request hay, Gnutella works well in practice. The request originator
downloads the file from one peer directly using HTTP. From version 0.6, Gnutella leverages powerful peers
as ultrapeers to construct hierarchical architecture similar to KaZaA’s. The descriptors are only flooded
among ultrapeers, greatly reducing search cost and improving scalability.
Peers
Video Hash function2 CcDd7890
P2P network
Audio Hash function3 EeFf5678
FIGURE 8.12
Key mapping in distributed hash tables (DHTs).
8.2 P2P Overlay Networks and Properties 497
The DHT applies consistent hashing to map keys to nodes. This consistent hashing should be able
to define an abstract notion of proximity or distance between two nodes. Keys are mapped to the
nearest nodes with the minimum distance or maximum proximity. When a new node joins or an exist-
ing node leaves, consistent hashing causes a small impact on existing nodes by reassigning their keys.
For n nodes with k keys, each node is responsible for, at most, ð1 + εÞ/n of the whole ID space with a
high probability. The addition or deletion of a node requires Oð1/nÞ keys to reassign, where
ε = Oðlog nÞ: Thus, the DHT is suitable for P2P systems where nodes leave and join frequently.
The DHT provides two primitive operations: put (key, data) and get (key). The put primitive
stores the date or data index with the key on the node which has the closest identifier to the key,
while the get primitive retrieves the node that stores the data or data index with the key. Peer nodes
form an overlay network. Every node keeps several links to a subset of other nodes. The get request
for a key with identifier k is forwarded over the overlay to the nearest node, until the request arrives
at the node with an identifier closest to k.
Popular P2P networks that use DHTs include BitTorrent’s distributed tracker, the Bitcoin mone-
tary network, the Kad network, the Storm botnet, YaCy, and the Coral Content Distribution Net-
work. Some prominent research projects including the Chord project; the PAST storage utility;
P-Grid, a self-organized and emerging overlay network; and the CoopNet content distribution sys-
tem also apply the DHT in their overlay constructions. Others apply the DHT in resource discovery
in grid computing systems. Resource discovery searches for suitable resource types to match with
user application demands.
P2P substrate
DHT (self-organizing
overlay network)
TCP/IP Internet
FIGURE 8.13
DHT for fast and secure search and other Internet applications.
498 CHAPTER 8 Peer-to-Peer Computing and Overlay Networks
different structured P2P systems, such as Chord, Pastry, CAN, and Kademlia. This section focuses
on the structures of these structured P2P systems. In Table 8.3, n is the number of nodes in the P2P
network. The quantity 2b for specifying Pastry is the base of the identifier space. The “log” implies
the binary base 2.
A good example of DHT systems, Chord, is introduced in Example 8.6. The identifier space of
m-bit strings is mapped to a unidirectional logical ring modulo 2m. The distance between the identi-
fier id1 and the identifier id 2 is measured by the interval traveled from id1 to id2 in a clockwise
direction. Nodes are placed at the position where the node’s identifier can be found. A node with
identifier idx stores the data or data index with key identifiers falling into the interval from the
immediate predecessor of idx to idx itself. Each node keeps m chordal links to the nodes whose
identifiers immediately succeed ðx + 2i Þmod2m for 0 ≤ i ≤ m − 1:
14
10
0
Gap Span Scsr
1
8 1 (1,2) 1
2 (2,4) 3
3
6 3 (4,8) 6
8 (8,1) 10
FIGURE 8.14
An example of a Chord network with a 16-key search space. The finger table establishes the direct links
between the populated nodes located in various span intervals.
500 CHAPTER 8 Peer-to-Peer Computing and Overlay Networks
22 away from N1. Although there are m entries in a peer’s finger table, on the average, a peer’s finger
table contains O(log n) nodal information, where n is the total number of nodes in the search space.
In Figure 8.14, N1’s finger table has three rows. However, the first two rows contain the same node’s
information, that is, N3’s information. This is stemmed from the fact that the identifier range which starts
from a node to its bm − log nc-th entry is within distance 2m =n: Since nodes are randomly distributed on
the identifier ring, an average of one node is within the range 2m =n: Thus, on average, the number of
nodes in a node’s finger table is Oðm − bm − log nc + 1Þ = Oðlog nÞ: DHT-based search is often used to
locate rarely replicated items.
To improve robustness, each node maintains a successor list. The list starts from its successor down
the ring. If the current successor fails, a node replaces the fault successor with the first live entry in the
list. A newly coming node of ID X first contacts a bootstrap node, which forwards the join message to X ’s
predecessor Y. The new node joins as Y ’s successor and Y finds entries to fulfill the finger table for the
new node. The nodes whose finger tables should contain X update their tables. The join/leave operation of
a node requires using log2 n messages. Chord uses a stabilization procedure running periodically on each
node to update the successor and finger table entries.
(0.5~0.75, 0.5~1)
1 F C
B D
(0~0.5, 0.5~1) E
B A D
(0.5~0.75, 0.5~1)
E
A C
(0~0.5, 0~0.5) (0.5~1, 0~0.5)
0
G H
0 1
FIGURE 8.15
A CAN network constructed by the repeated partitioning of the 2D coordinated space and its routing process.
(Courtesy of S. Ratnasamy, P. Francis, and R. Karp [40] )
8.2 P2P Overlay Networks and Properties 501
In Figure 8.15(a), node B and node C are neighbors of node A, but node D is not A’s neighbor. Keys
are mapped to coordinate space using uniform hash functions. Each key is associated with a d-dimension
vector, which represents its position in the coordinate space. If a key’s position in the coordinate space
falls in the zone owned by a node i, node i stores the pointer to the object related to the key or directly
stores the object. In a d-dimension coordinate space, each node maintains 2d neighbors on average and
the route complexity is of the order of Oððd=4Þðn1=d ÞÞ:
A new coming node X contacts a bootstrap node, which is located using the DNS service. The boot-
strap node supplies the address information of some randomly chosen peers populated in the overlay. The
new node randomly chooses a point P in the coordinate space as its position and asks a populated node
to route the join message to the node Y whose zone contains P. The node Y splits its zone into two parts
with equal quota and hands over one subzone to the incoming node.
The CAN overlay uses a greedy routing algorithm. A node greedily forwards the message to its neigh-
bor with coordinates closest to the destination. Figure 8.15(b) shows a 2D CAN overlay and the routing
path from node C to node H. In a CAN overlay with d-dimension coordinates, on the average, each
dimension has n1=d nodes. On a torus, there are 1/4n1=d nodes between any two nodes on the average.
Since a message to its destination traverses all d dimensions, this yields the routing complexity, which is
Oððd=4Þðn 1=d ÞÞ:
Peer Peer
Pastry Interest-based Pastry Interest-based
Leaf set Cluster view Leaf set Cluster view
FIGURE 8.16
The intuition of the hybrid P2P overlay construction by keeping the primary components and getting the
secondary components for free.
502 CHAPTER 8 Peer-to-Peer Computing and Overlay Networks
Example 8.8 KaZaA Network Using Super Nodes to Form a Backbone Overlay
KaZaA connects peers in a two-layer hierarchical structure (see Figure 8.17). The upper layer consists of
super nodes that are more powerful in terms of CPU, bandwidth, and stability. Lightweight user nodes are
assumed at the second layer, and are connected to a parent super node. A super node acts as a hub for
user nodes. A user node uploads to its parent super node the filename, content hash, and file description of
each requested file. Thus, each super node maintains the file indices for all of its children nodes. The super
node provides each child with a super-node list, which contains up to 200 super nodes in the system.
In 2006, more than 3 million peers of KaZaA shared more than 3,000 terabytes of digital content,
including MP3s, videos, and games. At one time, more than 50 percent of Internet traffic was generated
by KaZaA users. On average, it takes 56 minutes to connect to the end super node. Each super node has
about 100 to 150 children. In total, the system has roughly 30,000 super nodes. Each super node has
TCP connections to 30–50 other super nodes. The system is not fully connected. Among the super nodes
there is 0.1 percent connectivity with an average of 23 minutes to connect two super nodes.
Each super node acts like a mini-Napster hub by tracking the content and IP addresses of its descen-
dants. The super node is represented with its IP address, port number, workload, timestamp, and similar
data. This list is cached at the user node for use in bootstrapping and failure recovery. KaZaA encrypts all
its signaling traffic. A query from a user node is first submitted to its parent super node. If any files match
the query, the super node directly answers the query. Otherwise, the super node forwards the query to
other super nodes using TTL-limited flooding. KaZaA uses the UUHash to identify files uniquely. Nodes
request files from multiple peers using this unique identifier in order to enable parallel downloading.
The UUHash hashes the first 300 KB of a file using MD5, yielding a 128-bit hash value. It applies a
smallhash function to 300 KB blocks at file offsets 2m MB, with m being an integer from 0 until the offset
reaches the end. The smallhash generates a 32-bit hash value. The MD5 hash and this smallhash are
concatenated, yielding a 160-bit hash as the file identifier. The UUHash is simple since it only hashes a
fraction of the file. This weakness allows users to easily modify large parts of the file without changing the
UUHash identifier. However, copyright infringement is still a serious problem. Both DDoS attacks on super
nodes and content pollution have been encountered in the past.
Super nodes
User nodes
FIGURE 8.17
The KaZaA architecture built with super nodes forming a backbone overlay.
8.2 P2P Overlay Networks and Properties 503
The system automatically switches to a new download server when the current server fails to deliver.
The system also gives the estimated download time to users. Super nodes are relatively stable; thus node
churns have a limited effect. The bandwidth on a super node is low, even in relaying VoIP traffic. The two-
layer structure takes heterogeneity in capacity into account and enables network proximity awareness by
assigning user nodes to nearby super nodes. Flooding is performed on top of the upper layer, thus improv-
ing scalability. However, it still does not provide real guarantees on search scope and search time.
L
L
L U1
L
H2
H1
H3
H4
L
L
L
L
FIGURE 8.18
A hybrid P2P architecture of Gnutella and Chord.
504
Overlay structure Random Circular identifier space Multi-dim. Tree data hierarchy Distributed skip list
Cartesian space
Design parameters N/A r: Immediate successors d: dimensionality N/A N/A
Routing hops O(N ) O(log N ) O(d · N1/d) O(h), tree height O(log N )
Node state O(1) O(log N ) O(2 d ) O(1) O(2log N )
Node join O(1) O(log N ) O(2 d ) O(1) O(log N )
Load balancing and N/A Via consistent hashing Partitioned space, Via caching and Consistent hashing
reducing query and virtual nodes zone reassignment, replication the suffix of object
hotspots caching and name
replication
Fault handling Flooding to Successors list, Multiple realities, Node replication, Redundant
other valid application layer overloading zones routing to next best neighbors at level 0,
neighbors replication prefix node survive failures at
boundary
Network proximity No Proximity routing to Topology-aware No Proximity neighbor
finger’s successor with overlay and selection via
lowest latency proximity routing P-Table and
C-Table
Support node No No, but might through No, but might split No No
heterogeneity virtual nodes zones proportional
to node capabilities
Self-organization Yes Yes Yes Yes Yes
Guarantee No, only Yes Yes Yes Yes
determinate within
searching results broadcast
scope
Data locality No No, destroyed by No, destroyed by Yes Yes
uniform hashing mapping to hyper
space via uniform
hashing
Applications N/A CFS [27] Data-centric sensor Resource discovery Global event
networks notification service
all five overlay structures. With uniform routing, both Chord and CAN destroy data locality easily.
Network proximity is supported only in Chord, CAN, and SkipNet. They all (except Gnutella) guar-
antee determinate search results. Gnutella applies a random graph. Chord applies a chordal identifier
space using fingers to identify immediate successors.
CAN applies a multidimensional Cartesian space. TerraDir has a tree data hierarchy and Skip-
Net uses a distributed skip list. The major attraction of Gnutella is its constant node state and
node joining time, but its routing time is the longest, O(n). Flooding in Gnutella is an expensive
way to search, but it supports fault tolerance. The rest all support fault tolerance using replication
techniques. O(log n) routing speed is achieved in Chord and SkipNet. The routing time in Terra-
Dir is proportional to the tree height (h) applied. The dimension d of CAN affects its performance
greatly.
DHT-based structured P2P overlay networks have strict and predefined structures, which
facilitate message routing. Each node is associated with a globally unique identifier. The routing
process to a destination node gradually reduces the identifier’s distance from the message pro-
cessing node to the destination node. Although different structured overlays have different rout-
ing algorithms, the routing complexities are Oðlog nÞ, where n is the number of nodes. We
describe the routing algorithms for three representative structured P2P overlays in the following
examples.
identifier is smaller than y is chosen as the next step node. For example, in Figure 8.20, seven nodes are
populated in a 16-node chord structure. The minimum routing path from node 0 to node 8 is via node 6.
The routing complexity for the chord is Oðlog nÞ for an n-node system. Suppose that node x selects the
i-th entry in its finger table as the next step node z. Then node z is at least 2i away from x in the identifier
space. Since the identifier of the (i + 1)-th entry of x’s finger table is not smaller than y, node y is at most 2i+1
away from x in the identifier space. Therefore, the distance between node x and z is at least equal to the
distance between nodes z and y, which means node z halves the distance between nodes x and y.
After log n steps, the distance between the source node to the destination is equal to 2m=n, where m is the
identifier length. Since the populated nodes are randomly distributed, an average of one node exists in each
of the range 2m=n:
14
10
0
8 Gap Span Scsr
1 1 (1,2) 1
2 (2,4) 3
6
3 4 (4,8) 6
8 (8,1) 10
Routing path
Node 0 finger table
FIGURE 8.20
An example of the table lookup search in a Chord overlay network.
508 CHAPTER 8 Peer-to-Peer Computing and Overlay Networks
exploit network proximity in structured P2P overlays: geographic layout, proximity routing, and
proximity neighbor selection.
For n = 216 nodes or 64 K nodes, it takes at most 4 routing steps (log16 32K) to route the messages
among 16 groups, compared with log264K = 16 steps needed in using Chord.
With node failures, the Pastry is definitely faster than Chord and CAN. During failure recovery, the
Pastry may take O(n) routing hops at the worst case. Each routing table entry points to a node close to the
local node in the proximity space, among all node identifiers with the appropriate prefix. In a way, Pastry
has the best locality properties. The expected distance traveled by a message is within a small constant in
the proximity space. Pastry is suitable to use in building very large P2P networks. The fast lookup search
within proximity space makes Pastry very suitable to use in very large-scale P2P networks. Details of Pastry
can be found in [21].
A B B
1 1
1 1 1 1
A 0.7 0.6
C
A 0.7 C
1 0.8
1
C 1 1
D E
1 1
D
(a) Initial network (b) After D joining (c) After E joining
B B B
A 1 1 C A 1 1 C A 1 1 C
0.3 0.7 0.3 0.6 0.3
0.8 1 1
1 1 0.3 1 0.3 1
0.5 0.5
F 1 E F 1 E F D
1 1 1
D D
(d) After F joining (e) After link adjustment (f) After E departure
FIGURE 8.21
An example CRP-enabled overlay network design in six steps.
(Courtesy of Li, Xie, Hwang, and Li [24] )
nodes down the base ring. If a node detects a faulty successor, it replaces that successor with the
first live node in the list. In Figure 8.21(e), for example, node C keeps two successor nodes E and D.
If node E fails or leaves, C takes D as its new successor.
CAM-Chord
700 ACOM
CRP: 0
CRP: 0.01
400
300
200
FIGURE 8.22
Average delivery time of five data dissemination schemes over a P2P network.
(Courtesy of Z. Li, et al. [24] )
A fair reputation system is thus needed to establish the trust or distrust among peers, based on
recorded historical behaviors of the peers. The purpose is to distinguish good peers from bad ones
through a scientific screening process. The quality of the reputation system is primarily indicated by
its accuracy and effectiveness in updating periodically.
The edge labels are the local scores issued between all (source, destination) pairs. The frac-
tion value inside each node in Figure 8.23 is the global reputation score of that peer at time t.
This global score is resulted from aggregating all local scores issued by all peers toward the peer
being evaluated. However, all local scores must be weighted by their own global reputations.
In another words, the global reputation is the weighted sum of all local scores. For clarity, all
the global reputation scores are normalized so that their sum is always a “1”. For example, the
Global N2 0.001
0.7 N3
0.6 0.009
0.4
0.32 0.2 0.3
N1
0.8
0.9 0.04
N4
0.1
0.63
N5
FIGURE 8.23
A directed graph showing the trust relationship among five peers in a P2P network.
516 CHAPTER 8 Peer-to-Peer Computing and Overlay Networks
global scores of the five peers are represented by a reputation vector of five components which
add up to 1.
VðtÞ = fv1 ðtÞ, v2 ðtÞ, v3 ðtÞ, v4 ðtÞ, v5 ðtÞg = f0:32, 0:001, 0:009, 0:04, 0:63g (8.3)
Similarly, the remaining global scores for the other four peers are computed to yield the following
updated global reputation vector:
Vðt + 1Þ = fv1 ðt + 1Þ, v2 ðt + 1Þ, v3 ðt + 1Þ, v4 ðt + 1Þ, v5 ðt + 1Þg
(8.5)
= f0:5673, 0:0063, 0, 0:1370, 0:2573g
This vector is unnormalized. Dividing each peer score by the total sum of all component scores,
one can obtain the following normalized global reputation vector. Note that the sum of the five
peer reputation scores should be 1 in the normalized vector.
8.4 Trust, Reputation, and Security Management 517
FIGURE 8.24
A trust overlay network for a P2P trust managment, where the edge is weighted by the peer feedback score
for the service provided. The global reputation of a peer (node) is calculated by the weighted sum of all local
trust scores received on the incoming edges.
(Courtesy of R. Zhou and K. Hwang [57] )
Our system can incorporate different methods to generate feedback scores, such as Bayesian
learning. In a TON, every node keeps feedback scores for its neighbors. Because every peer has its
own criteria to generate feedback scores, the feedback will be normalized to local trust scores.
Each node Ni is rated with a global reputation score vi. This global reputation is aggregated from
8.4 Trust, Reputation, and Security Management 519
local trust scores weighted by the global reputations of all in-degree neighbors. For example, the
global reputation score of N2 could be calculated by weighting three incoming local scores (0.8,
0.7, 0.6) from N1, N5, and N10,000, respectively: v2 = 0.8 v1 + 0.7 v5 + 0.6 v10000. Given v1 = 0.04,
v5 = 0.0007, and v10000 = 0.000005, then v2 = 0.8 × 0.04 + 0.7 × 0.0007 + 0.6 × 0.0000005 = 0.032 +
0.00049 + 0.00003 = 0.032493. It is interesting to note that node N1 has a much higher reputation
score v1, compared with v5 and v10000 in this example. Thus, node N1 carries more weight in the
global reputation aggregation process.
>)
8
>)
, N8
<0.40, N8> , N 12
.02
, <0
0 .64 <0.10, N6, N2>
Node D (N12) ey=
t (k
es
qu <0.28, N15>
rt _re
So Node B (N6)
<0.02, N12>
<0.20, N2, N15>
Node C (N8)
FIGURE 8.25
Distributed reputation ranking using the locality-preserving hash function over a DHT-based P2P system.
(Courtesy of R. Zhou and K. Hwang [57] )
520 CHAPTER 8 Peer-to-Peer Computing and Overlay Networks
Node N2 is the successor node of the maximum hash value 15, so node N2 initiates the process
to find m power nodes. Node N2 is responsible for the hash values in the union of two ranges [15,
16] ∪ [0, 2]. Since it has no corresponding triplets within the range [15, 16], it stores zero triples
with the highest reputation values, that is, k = 0. Therefore, it sends a Top_m_Request(m = 1, k = 0)
message to its predecessor node N15, which finds its stored triplet with value 0.4 being the highest
one. So, node N8 is the most reputable node in this example system. Multiple LPH functions could
be used to prevent cheating by the malicious peers.
Power
Initial reputation nodes
Reputation updating
aggregation
Regular random walk Look-ahead random walk Distributed ranking module
FIGURE 8.26
Functional modules in the PowerTrust system for trust score collection and global reputation aggregation.
(Courtesy of Zhou and Hwang [57] )
8.4 Trust, Reputation, and Security Management 521
500
EigenTrust
PowerTrust
400
Convergence overhead
300
200
100
0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Greedy factor (α)
FIGURE 8.27
Convergence overhead of two reputation systems in a P2P network.
(Courtesy of Zhou and Hwang [57] )
than the threshold. The EigenTrust approach relies on a few pretrusted nodes to compute the global
reputations. They assumed that some peers are known to be trustworthy, essentially among the very
first few peers joining the system.
For fairness, the same number of power nodes equal to that of the pretrusted nodes used in
EigenTrust is chosen. Figure 8.27 shows the convergence overheads for these two reputation sys-
tems. Here, the power nodes in PowerTrust and the pretrusted node in EigenTrust are allowed to
leave freely. A sharp drop in iteration count in using PowerTrust to a flat small number less than
50 is observed, when α increases from 0.15 to 1, while EigenTrust still requires more than
100 iterations to converge. The EigenTrust system overhead can reach as high as 400 iterations as
the system increases to 4,000 nodes.
The EigenTrust system converges very slowly. The system cannot guarantee its convergence,
when the pretrusted nodes are allowed to leave the system freely. In the PowerTrust system, the
power nodes are reelected after each aggregation round. Based on the distributed ranking mechan-
ism, the score managers of the departing power nodes notify the system to replace them in a timely
manner with other, more qualified power nodes. The decrease in computation overhead means a sig-
nificant traffic reduction on the network, and less work for all peers involved. The low overhead in
using the PowerTrust system makes it attractive for performing highly scalable P2P applications.
8.4.3.2 Query Success Rate
The PowerTrust system was applied in simulated P2P file-sharing experiments. There are more than
100,000 files in the simulated P2P systems. The number of copies of each file in the system is
determined by a content Power-law distribution with β = 1.2. Each peer is assigned with a number
522 CHAPTER 8 Peer-to-Peer Computing and Overlay Networks
80
60
40
20
0
1 5 9 13 17
Round number (t)
FIGURE 8.28
Query success rate in distributed file sharing using two reputation systems to compensate for the free
departure of power nodes or pretrusted peers.
(Courtesy of Zhou and Hwang [57] )
of files based on Sarioiu distribution. At each time step, a query is randomly generated at a peer
and completely executed before the next query/time step. The query distribution determines which
file each query searches for. The queries are ranked according to their popularity. A Power-law dis-
tribution with β = 0.63 for queries ranked 1 to 250 and β = 1.24 for the remaining lower ranking
queries is used. This distribution models the query popularity in existing P2P systems. When a
query for a file is issued, the file owner with the highest global reputation is selected to download
the desired file.
The query success rate is measured by the percentage of successful queries over the total number
of queries issued. Every node may respond to a query with unauthentic files. For simplicity, this beha-
vior is modeled as inversely proportional to the node’s global reputation. Both cases of allowing or
disallowing power nodes or pretrusted nodes to leave the system are considered. The case of a
no-trust system, meaning no trust management in the P2P system, is also evaluated. The no-trust sys-
tem randomly selects a node to download the file without considering reputation. Figure 8.28 shows
the results without the departure of power nodes or pretrusted nodes.
There are 1,000 queries issued after each round of global reputation aggregation. The query suc-
cess rate of PowerTrust is maintained at 90 percent after just one round of reputation aggregation.
The query success rate of EigenTrust drops from 85 percent to 50 percent after several aggregation
rounds. This is because the pretrusted nodes may leave the system randomly. The EigenTrust query
success rate drops to a 50 percent low, equal to that of a no-trust system, after 17 rounds of reputa-
tion aggregation.
be done to hinder the attacks is crucial to achieve harmonious operation in a P2P network. Four
classes of network attacks often occur in a P2P network:
1. A P2P network could enter churn condition, if large number of peers join or leave the network
rapidly or randomly. Churn could create inconsistent behavior or resources deadlocks.
2. Distributed denial of service (DDoS) attacks are caused by flooding attacks on a target node.
3. Routing attackers attempt to re-route messages in order to steal contents or to stage DDoS attacks.
4. Storage/retrieval attacks are caused by attackers to prevent the delivery of the requested data.
To become resilient to network churns, one can force nodes to sign all messages so that inconsis-
tency can easily detected. To cope with DDoS attacks, one can replicate the content and spread over
the network. The SOS example below presents a more sophisticated solution. For identity spoofing
attacks caused by a peer node to use other node’s identity or to deliver bogus contents. One solution
is to insist that all peers must have certificates signed by a trusted authority. Another solution to pre-
vent proofing is tying the identity with protected IP address or send query to verify the address.
To prevent routing attacks, one solution is to use a converging policy iteratively so that each
hop moves closer to the destination in the ID space. Another solution is to provide multiple paths
to re-route messages around the attackers. Onion routing is also used to hide the identify of the ori-
ginator of the final destination. This done by encryption of the incoming message and the next hop
information recursively using public keys. This way, any node only knows who was the immediate
sender and who is the next receiver. To combat storage/retrieval attacks, one can replicate the data
object to enhance availability.
as fast as possible. Caching of P2P content is helpful to improve content downloading speed and
traffic locality. This section reviews fast file search methods. Replicas and consistency techniques
to enhance delivery speed and success rate are discussed as well.
Random walk algorithms reduce overhead from another aspect. A query node deploys k
independently random walkers. Each walker carries a query message and walks on the overlay graph,
until the desired object is located. Random walk algorithms suffer from the TTL selection problem
just like flooding algorithms do. However, in random walk algorithms, each round only increases the
number of visited nodes by k. Therefore, it is reasonable for walkers to periodically check with the
query node to determine whether to continue or stop. Random walk algorithms generally outperform
expanding ring algorithms and save on message overhead by two orders of magnitude at a cost of a
slightly increase in query path length compared to flooding algorithms.
The principles of scalable blind search algorithms for unstructured P2P overlays are summarized
as follows. First, the algorithms should adopt adaptive termination. Simply limiting the TTL values
does not work well as that would adversely affect search scope. Second, the number of visited
nodes should grow slowly with the increase in search round. This is because only a small portion
of the nodes are required to be visited. If the number of visited nodes grows quickly (e.g., exponen-
tially in flooding algorithms), most of the messages generated at the last several hops are duplicated.
The message overhead incurred by blind search algorithms can be greatly reduced if nodes keep
some information regarding the files shared by others. Informed search algorithms are designed for
this purpose. Each node keeps some information regarding the files shared by others and the infor-
mation is used to guide the forwarding of query messages. The idle case is that each node keeps all
the files shared by others, and thus the query path is only one hop long. However, this is not feasi-
ble because it requires not only unaffordable storage, but also lots of synchronize overhead among
peers.
Search performance can also be enhanced through overlay topology adaptation. In Gnutella, one
can take heterogeneity of node capacity into account by assigning more loads on high-capacity
nodes. High-capacity nodes adaptively build more overlay links and keep the file indices of the
nodes one hop away. Since random walk is biased toward the high-degree node, most queries are
forwarded to the high-capacity nodes and satisfied there. Another topology adaptation is to consider
semantic interests of nodes. Semantically relevant nodes are organized into the same semantic clus-
ters on overlays. Queries are forwarded to semantically relevant clusters for answers. The idea
behind this is that if nodes are semantically relevant, it is likely that they are relevant to the same
queries. BitTorrent is the most popular P2P content distribution application. It uses swarming con-
tent distribution.
Web
ile
f
nt
.
nfo oc.
rre
rs i r Tracker
Pee load p
To
o w n
D
3
Piece exchange 1 3
Downloader
1 4
1 4 2
1 2 3 4 File
Seed
FIGURE 8.29
The BitTorrent system architecture.
(Courtesy of “BitTorrent Protocol Specification”; www.bittorrent.org/protocol.html, 2006 )
The file is divided into pieces of equal size, ranging from 32 KB to 4 MB, with a typical size of 256 KB.
The publisher of the file creates a hash for each piece using the SHA-1 hash function. The hash for each
piece is used to verify the integrity of the piece during downloading and is stored in the torrent file. A file
piece may be further divided into blocks. In this case, each piece is a trading unit, while a block is a
request unit. Only after all the blocks of a piece have been downloaded can this piece be publicly shared
with others.
A new node first gets the torrent file from a web site. By parsing the file, it gets the tracker’s address
and asks the tracker for peers’ information. Usually, the tracker responds with a peer list containing 50
randomly selected peers. The number of TCP connections initialized by a peer is limited to 35 and the
total number of connections of a peer is limited to 55. If ever a client fails to maintain at least 20 connec-
tions initialized by it, it recontacts the tracker to obtain additional peers. To update the tracker’s global
view of the system, active nodes periodically (every 30 minutes) report their states to the tracker or at the
time of joining or departure.
Each peer has a bit-vector indicating the pieces’ availabilities on it, one bit for each piece. The bit-
vectors are exchanged between neighbors. If peer y ’s neighbor x finds that y has a piece that it does not
have, x sends an “interest” message to y to request that piece. The decision by y of whether and when to
send the requested piece to x is based on an incentive strategy called Tit-for-Tat (TFT). All connections
are choked by default. Each peer periodically (every 10 seconds) updates its download rates from all of its
neighbors, and chooses four neighbors.
The TFT is an incentive strategy to combat against free riders, the peers that contribute nothing or very
limited upload bandwidth. A peer also uses an optimistic unchoking strategy to unchoke a uniformly
8.5 P2P File Sharing and Copyright Protection 527
chosen peer at random every 30 seconds, which enables the nodes that have no or very limited file pieces
to start up. When unchoked by a neighboring peer, a node selects the file piece that is the least replicated
among its neighbors. This selection is called rarest first (RF) (locally) piece selection. However, for a new
peer that has no file piece, it randomly chooses a piece to download. The RF piece selection scheme is
used to enable random distribution of file pieces on overlay networks, which facilitates the diversities of
pieces on neighboring peers.
Figure 8.30 shows the hierarchical structure of multiple swarms built up in a BitTorrent network.
A swarm is a collection of peer machines that form a virtual cluster under the coordination of special track-
ers. Interswarm traffic is possible if the local swarm cannot satisfy a user request. Multiple trackers
exchange updated directory information to the global view consistently. The idea is quite similar to that of
the KaZaA architecture shown in Figure 8.11. The purpose is to upgrade management efficiency and
shorten search time for users.
By convention, the name of a torrent file has the suffix (.torrent). Torrent files have an “announce” sec-
tion, which specifies the URL of the tracker, and an “info” section, containing the suggested names for the
files, their lengths, the piece length used, and an SHA-1 hash code for each piece, all of which are used
Swarm
Tracker
Swarm
Swarm
Tracker
Swarm
FIGURE 8.30
An operational view of a BitTorrent system built with multiple swarms of resource nodes under the
coordination of special trackers.
528 CHAPTER 8 Peer-to-Peer Computing and Overlay Networks
by clients to verify the integrity of the data they receive. The tracker maintains lists of the clients currently
participating in the torrent. Alternatively, in a decentralized tracking which works without a tracker, every
peer acts as a tracker. Azureus was the first BitTorrent client to implement such a system through the
DHT. An alternative and incompatible DHT system, known as Mainline DHT, was later developed and
adopted by BitTorrent, µTorrent, and others.
Physically close
Chord ring
FIGURE 8.31
Auxiliary structure for a replica group.
First generation: Clients access Limited by Small files with Apache (www.apache
client/server server using single-server limited .com/) GetRight (www
system FTP and HTTP bandwidth downloads .getright.com/) CuteFTP
(www.cuteftp.com/)
Second Clients connect Load balancing Video Akamai (www.akamai
generation: to multiple among servers streaming, virus .com/) SyncCast (www
CDNs servers signatures .synccast.com/)
Third generation: P2P overlay Low; peers are Large files, BitTorrent (www
P2P file-sharing network not secured speed is not .bittorrent.org/) eDonkey
system and are critical (www.edonkey.com/)
unreliable eMule (www.emule.org/)
Origin server
Clients in in U.S.
Japan CDN
Surrogate CDN content
server in distributor Surrogate
Japan server in
China
Surrogate Surrogate
server in server in
Australia U.K.
Clients in
Australia Clients in
China
Content delivery network
Clients in overview
U.K.
FIGURE 8.32
The concept of a global CDN using surrogate servers in major user regions or countries.
(Courtesy of Pallis and Vakali [37] )
conceptual architecture of a global-scale CDN, modeled after the Akamai Technologies system. By 2006,
the Akamai CDN provided 80 percent of overall content delivery services. At that time, more than 12,000
surrogate servers were used in various parts of the world in more than 1,000 networks in 62 countries.
This number is now much greater. Surrogate servers are used as cache centers for distributing digital
content more cost-effectively in terms of transmission, storage, access speed, and fault tolerance require-
ments. Each surrogate center is responsible for distributing the requested content to all end users in each
country or region. The CDN content distribution center coordinates the traffic among the servers. For copy-
right protection reasons, open P2P technology is not used here for commercial content distribution.
8.5 P2P File Sharing and Copyright Protection 531
A P2P network does not require many expensive servers to deliver content. Instead, content is
distributed and shared among the peers. P2P networks feature improvements over conventional
CDNs in terms of content availability and system scalability. Electronic publishing was hindered by
the rapid growth of copyright violations. The major source of illegal P2P content distribution lies in
peer collusion to share copyrighted content with other peers or pirates.
Digital watermarking is often considered for digital copyright protection. Digital watermarking is
injected into a content file so that when a pirated copy is discovered, authorities can find the origin
of the piracy via a unique watermark in each copy. In a P2P network, all peers are sharing exactly
the same file (if it is not poisoned), which effectively defeats the purpose of watermarking. Thus,
watermarking is not a suitable technology for P2P file sharing. By subdividing a large file into
smaller chunks, P2P content delivery allows a peer to download multiple chunks from different
sources. File chunking increases availability and decreases download time. Table 8.7 summarizes
three major P2P CDN families.
P2P networks in the same family have some common features. They are variants or descendants
of their predecessors: BitTorrent, Gnutella, and eMule, respectively. These families are distinguished
primarily by the file-chunking or hashing protocols applied. At the time of this writing, none of
these P2P networks has been built with satisfactory support for copyright protection. The BitTorrent
family applies the strongest hashing at the individual piece or chunklevel, which is most resistant to
poisoning. The Gnutella family applies file-level hashing, which is easily poisoned. The eMule
family applies part-level hashing with fixed chunking.
Our analytical and experimental results show that the eMule family demonstrates a moderate level of
resistance to content poisoning. The ability to detect and identify poisoned chunks is different in the
three P2P network families. The BitTorrent family keeps clean chunks and discards poisoned chunks.
Table 8.7 File Chunking and Hashing Schemes in P2P Content Networks
P2P Networks BitTorrent Family Gnutella Family eMule Family
Chunking Scheme Divides files into fix- Peers negotiate the chunk Divides into 9,500 KB
sized chunks (256 KB), size at runtime, 64 KB/chunk parts, each has 53
called pieces (180 KB) chunks
Hash Distribution SHA hashing at piece SHA hashing applied to MD-4 hashing at part
level, embedded in entire file to generate a level, peers exchange
index file unique file ID, no chunk-level hash set to detect
hashing corrupted contents
Poison Resistance Poison detected at Poison detected only after Poison detected at part
piece level, each is downloading the entire file, level, works if part hash
handled independently heavy overhead if poisoned set is not poisoned
Download Policy Keeps clean and Repeats downloading entire Keeps clean and
discards poisoned file until all chunks are clean, discards poisoned
pieces, repeats a time-consuming policy parts, downloads until
download until all all parts are clean
chunks are clean
Example Networks BitTorrent, Snark, Gnutella, KaZaA, LimeWire, eMule, aMule,
BitComet, BNBT, Phex, Freenet, BareShare, Shareaza, iMule,
BitTyrant, Azureus, Swapper, Ares, etc. Morpheus, eDonkey,
JTorrent, etc. FastTrack, etc.
532 CHAPTER 8 Peer-to-Peer Computing and Overlay Networks
The Gnutella family must download the entire file before any poisoned chunks can be detected. Because
of part-level hashing, the eMule family will either keep or discard the entire part of 53 chunks.
s
as
K ey
e
Colluders
Clients
Distribution
P2P network: Pirates
agents
Pirate identification
colluder detection
content delivery
proactive poisoning
FIGURE 8.33
A secure P2P platform for copyright-protected content delivery.
(Courtesy of Lou and Hwang [31] )
Secure file indexing File index format is modified to include token and IBS signature.
PAP Peer sends digital receipt to bootstrap agent and obtains an
authorization token. The token must be refreshed periodically.
Proactive content poisoning The token and IBS signature check all download requests and
responses, sending clean or poisoned contents, accordingly.
Random collusion prevention Distribution agents randomly recruit decoys to probe for colluders.
Collusion reports are weighted against client trust rates.
10 PC-based distribution agents to handle a swarm size of 2,000 peers. These agents authorize peers
to download and prevent piracy by unpaid peers.
Paid clients, colluders, and pirates are all mixed up without visible labels. The copyright-
protection network is designed to distinguish them automatically. Each client is assigned a bootstrap
agent, selected from one of the distribution agents, as its entry point. In current P2P networks, a
peer can self-assert its username without verification. Therefore, the peer endpoint address (IP
address + port number) instead of the username is used to identify a peer. A peer is considered
fully connected if it is reachable via a listening port on its host.
Table 8.8 summarizes the key protocols and mechanisms used to construct a trusted P2P system.
In this approach, the modified file index format enables pirate detection. Peer Authorization Proto-
col (PAP) authorizes legitimate download privileges to clients. The content distributor applies con-
tent poisoning to disrupt illegal file distribution to unpaid clients. The system is enhanced by
randomized collusion detection. In the system, a content file must be downloaded fully to be useful.
Such a restraint is easily achievable by compressing and encrypting the file with a trivial password
that is known to every peer. This encryption does not offer any protection of the content, except to
package the entire file for distribution.
8.5 P2P File Sharing and Copyright Protection 535
Colluder Colluder
Client
Client Pirate
P2P network
Agent
Agent
Pirate
FIGURE 8.34
Proactive poisoning in a trusted P2P network, where clean chunks (white) are received by legitimate clients
and poisoned chunks (shaded) by detected pirates.
(Courtesy of Lou and Hwang [31] )
536 CHAPTER 8 Peer-to-Peer Computing and Overlay Networks
Client
2: Illegal requests d
isone
3: Po nks
chu
an
Client 3: Cle s
n k
(decoy) chu
P2P network
4: Report 1: Recruit
colluder decoys
Distribution
agent
FIGURE 8.35
Distribution agent randomly recruiting some clients to probe suspected peers; peer collusion is reported
when a peer supplies a copyrighted file to an illegal requestor.
(Courtesy of Lou and Hwang [31] )
suspected peers. If an illegal request is returned with a clean file chunk, the decoy reports the collu-
sion event. Since the decoy is randomly chosen, there exists a risk that the report is not trustworthy
either by error or by cheating. Thus, a reputation system is needed to screen the peers.
A new PAP was developed to distinguish pirates from legitimate clients. Detected pirates will
receive poisoned chunks in their repeated attempts. Pirates are thus severely penalized with no chance to
download successfully in a tolerable amount of time. Based on simulation results, a 99.9 percent preven-
tion rate was found in Gnutella, KaZaA, and Freenet, while for eMule, eDonkey, Morpheus, and others
the rate is between 85 percent and 98 percent. This prevention system is shown to be less effective in
protecting poison-resilient networks such as BitTorrent and Azureus. However, the poisoning approach
opens up low-cost P2P technology for copyrighted content delivery. The advantage lies mainly in mini-
mum delivery speed, higher content availability, and copyright compliance (see Figure 8.36).
Msg3: Authentication
request
Msg4: Private key
request
Msg5: Private key
reply
Msg6: Authentication
token
FIGURE 8.36
The protected peer joining process for copyrighted P2P content delivery using seven messages to secure
communications among four parties involved.
(Courtesy of Lou and Hwang [31] )
100%
Protection success rate β
80%
60%
40%
20%
eMule
BitTorrent
0% Gnutella
FIGURE 8.37
Protection success rate of a simulated eMule network by prolonging the delivery delay to an intolerable
magnitude.
(Courtesy of Lou and Hwang [31] )
Without poisoning, the average download time for a client is 1.5 hours. The Gnutella family, includ-
ing KaZaA and LimeWire, has a near-perfect success rate (higher than 99.9 percent). The eMule net-
work has an average of 85 percent after tolerating up to 20 days. These success rates are satisfactory,
because most pirates will give up trying after a few days without success. The Gnutella family including
the KaZaA and LimeWire network has the highest success rate, close to 99.9 percent, for both file sizes.
538 CHAPTER 8 Peer-to-Peer Computing and Overlay Networks
With a tolerance threshold of 20 days in Figure 8.37, the eMule network has an average
98 percent success rate for the 4.5 GB file and 85 percent success rate for the 700 MB file. In
comparison, an average pirate in a BitTorrent network takes about 100 minutes to download the
700 MB file, only marginally longer than a paid client. Hence the success rate drops rapidly
before two hours. This implies that our system does not protect BitTorrent well due to its strong
resistance to content poisoning.
Acknowledgements
This chapter was jointly written by Zhenyu Li of the Chinese Academy of Sciences (CAS) and by
Kai Hwang of the University of Southern California. The authors would like to acknowledge the
technical support from Gaogang Xie of CAS Institute of Computing Technology.
References 539
References
[1] S. Androutsellis, D. Spinellis, A survey of P2P content distribution technologies. ACM. Comput. Surv.
(December) (2004).
[2] L. Amaral, A. Scala, M. Barthelemy, M. Stanley, Classes of small-world networks. Natl. Acad. Sci.
97 (21) (2000).
[3] S.A. Baset, H. Schulzrinne, An analysis of the Skype peer-to-peer Internet telephony protocol, in: Pro-
ceedings of IEEE INFOCOM, April 2006.
[4] R. Bindal, P. Cao, W. Chan, et al., Improving traffic locality in BitTorrent via biased neighbor selection,
in: Proceedings of IEEE ICDCS, 2006.
[5] B. Bloom, Space/time tradeoffs in hash coding with allowable errors. Commun. ACM. 13 (7) (1970)
422–426.
[6] J. Buford, H. Yu, E. Lua, P2P Networking and Applications, Morgan Kaufmann, December 2008. Also
www.p2pna.com.
[7] M. Castro, P. Druschel, Y.C. Hu, A. Rowstron, Exploiting network proximity in distributed hash tables,
in: Proceedings of the International Workshop on Future Directions in Distributed Computing, June 2002.
[8] X. Cheng, J. Liu, NetTube: exploring social networks for peer-to-peer short video sharing, in: Proceedings
of IEEE Infocom, March 2009.
[9] S. Chen, B. Shi, S. Chen, ACOM: any-source capacity-constrained overlay multicast in non-DHT P2P net-
works, in: IEEE Transactions on Parallel and Distributed Systems, September 2007, pp. 1188–1201.
[10] I. Clarke, O. Sandberg, B. Wiley, T.W. Hong, Freenet: a distributed anonymous information storage and
retrieval system, in: ICSI Workshop on Design Issues in Anonymity and Unobservability, June 2000.
[11] E. Cohen, S. Shenker, Replication strategies in unstructured peer-to-peer networks, in: ACM SIGCOMM,
2002.
[12] A. Fiat, J. Saia, M. Young, Making chord robust to Byzantine attacks, in: Proceedings of the European
Symposium on Algorithms (ESA), 2005.
[13] A.J. Ganesh, A.M. Kermarrec, L. Massoulié, Peer-to-peer membership management for gossip-based pro-
tocols. IEEE. Trans. Computers. 52 (2) (2003) 139–149.
[14] P.B. Godfrey, S. Shenker, I. Stoica, Minimizing churn in distributed systems. in: Proceedings of ACM
SIGCOMM, 2006.
[15] S. Guha, N. Daswani, R. Jain, An experimental study of the Skype peer-to-peer VoIP system, in: Proceed-
ings of the International Workshop on Peer-to-Peer Systems (IPTPS), February 2006.
[16] K.P. Gummadi, R. Gummadi, S.D. Gribble, et al., The impact of DHT routing geometry on resilience and
proximity, in: Proceedings of ACM SIGCOMM, 2003.
[17] M. Hefeeda, O. Saleh, Traffic modeling and proportional partial caching for peer-to-peer systems, in:
IEEE/ACM Transactions on Networking, December 2008, pp. 1447–1460.
[18] Y. Huang, et al., Challenges, design and analysis of a large-scale P2P VOD system, in: Proceedings of
ACM SIGCOMM 2008, Seattle, August 2008.
[19] K. Hwang, D. Li, Trusted cloud computing with secure resources and data coloring, IEEE. Inte. Comput.
(September) (2010) 14–22.
[20] S. Kamvar, M. Schlosser, H. Garcia-Molina, The Eigentrust algorithm for reputation management in P2P
networks, ACM WWW ’03, Budapest, Hungary, 2003.
[21] A. Keromytis, V. Misra, D. Rubenstein, SOS: secure overlay services, in: Proceedings of ANM SIG-
COMM’02, Pittsburg, PA. August 2002.
[22] J. Kleinberg, The small-world phenomenon: an algorithmic perspective, in: Proceedings 32nd ACM Sym-
posium on Theory of Computing, Portland, OR. May 2000.
[23] J. Leitao, J. Pereira, L. Rodrigues, Epidemic broadcast trees, in: Proceedings of the 26th IEEE Interna-
tional Symposium on Reliable Distributed Systems, October 2007.
540 CHAPTER 8 Peer-to-Peer Computing and Overlay Networks
[24] Z. Li, G. Xie, K. Hwang, Z. Li, Churn-resilient protocol for massive data dissemination in P2P networks,
in: IEEE Transactions on Parallel and Distributed Systems, accepted to appear 2011.
[25] Z. Li, G. Xie, Z. Li, Efficient and scalable consistency maintenance for heterogeneous peer-to-peer sys-
tems, in: IEEE Transactions on Parallel and Distributed Systems, December 2008, pp. 1695–1708.
[26] Z. Li, G. Xie, Enhancing content distribution performance of locality-aware BitTorrent systems, in: Pro-
ceedings of IEEE Globecom, December 2010.
[27] L. Liu, W. Shi, Trust and reputation management, in: IEEE Internet Computing, September 2010,
pp. 10–13. (special issue).
[28] Y. Liu, L. Xiao, X. Liu, L. M Ni, X. Zhang, Location awareness in unstructured peer-to-peer systems,
in: IEEE Transactions on Parallel and Distributed Systems, February 2005, pp. 163–174.
[29] J. Liu, S.G. Rao, B. Li, H. Zhang, Opportunities and challenges of peer-to-peer internet video broadcast,
in: Proceedings of the IEEE Special Issue on Recent Advances in Distributed Multimedia Communica-
tions, Vol. 96 (1), 2008, pp. 11–24.
[30] T. Locher, P. Moor, S. Schmid, R. Watenhofer, Free riding in BitTorrent is cheap, in: Proceedings of
ACM HotNets, November 2006.
[31] X. Lou, K. Hwang, Collusive piracy prevention in P2P content delivery networks, in: IEEE Transactions
on Computers, Vol. 58, (7) 2009, pp. 970–983.
[32] B.T. Loo, R. Huebsch, I. Stoica, J. M. Hellerstein, The Case for a Hybrid P2P Search Infrastructure, 3rd
International Workshop on Peer-to-Peer Systems, February 2004.
[33] Q. Lv, P. Cao, E. Cohen, K. Li, S. Shenker, Search and replication in unstructured peer-to-peer networks,
in: Proceedings of the ACM International Conference on Supercomputing, June 2002.
[34] B. Maniymaran, M. Bertier, A.-M. Kermarrec, Build one, get one free: leveraging the coexistence of mul-
tiple P2P overlay networks, in: Proceedings of the International Conference on Distributed Computing
Systems, Toronto, June 2007.
[35] P. Maymounkov, D. Mazières, Kademlia: a peer-to-peer information system based on the XOR metric, in:
Proceedings of the International Workshop on Peer-to-Peer Systems (IPTPS), March 2002.
[36] N. Magharei, R. Rejaie, Y. Guo, Mesh or multiple-tree: a comparative study of live P2P streaming
approaches, in: Proceedings of IEEE INFOCOM, Alaska, May 2007.
[37] G. Pallis, A. Vakali, Insight and perspectives for content delivery networks, Commun. ACM 49 (1) (2006)
101–106.
[38] K. Ross, D. Rubenstein, Peer-to-peer systems, in: IEEE Infocom, Hong Kong, 2004, (Tutorial slides).
[39] A. Rowstron, P. Druschel, Pastry: scalable, decentralized object location and routing for large-scale peer-
to-peer systems, in: Proceedings of the IFIP/ACM International Conference on Distributed Systems Plat-
forms (Middleware), Heidelberg, Germany, November 2001, pp. 329–350.
[40] S. Ratnasamy, P. Francis, M. Handley, R. Karp, A scalable content-addressable network, in: Proceedings
of ACM SIGCOMM, August 2001.
[41] S. Saroiu, P. Gummadi, S. Gribble, A measurement study of peer-to-peer file sharing systems, in: Multi-
media Computing and Networking (MMCN ’02), January 2002.
[42] H. Shen, C. Xu, Locality-aware and churn-resilient load balancing algorithms in structured peer-to-peer net-
works, in: IEEE Transactions on Parallel and Distributed Systems, Vol. 18 (6), June 2007, pp. 849–862.
[43] S. Song, K. Hwang, R Zhou, Y.K. Kwok, Trusted P2P transactions with fuzzy reputation aggregation,
in: IEEE Internet Computing, November–December 2005, pp. 18–28.
[44] M. Steiner, T. En-Najjary, E. Biersack, Long term study of peer behavior in the KAD DHT, in: IEEE/
ACM Transactions on Networking, Vol. 17 (6) 2009.
[45] I. Stoica, R. Morris, D. Liben-Nowell, et al., Chord: a scalable peer-to-peer lookup service for Internet
applications, in: IEEE/ACM Transactions on Networking, Vol. 11 (1) February 2003, pp. 17–32.
[46] K. Spripanidkulchai, B. Maggs, H. Zhang, Efficient content location using interest-based locality in peer-
to-peer systems, in: Proceedings of IEEE INFOCOM, 2003.
Homework Problems 541
[47] C. Tang, R.N. Chang, C. Ward, GoCast: gossip-enhanced overlay multicast for fast and dependable group
communication, in: Proceedings of the International Conference on Dependable Systems and Networks,
Yokohama, Japan, June 2005, pp. 140–149.
[48] V. Venkataraman, K. Yoshida, P. Francis, Chunkyspread: heterogeneous unstructured tree-based peer to
peer multicast, in: 14th IEEE International Conference on Network Protocols, November 2006, pp. 2–11.
[49] Y. Wang, J. Vassileva, Trust and reputation model in peer-to-peer networks, in: Third International Con-
ference on Peer-to-Peer Computing, August 2003.
[50] Wikipedia, Peer to Peer. http://en.wikipedia.org/wiki/Peer-to-peer, 2010 (accessed 14.08.2010).
[51] R.H. Wouhaybi, A.T. Campbell, Phoenix: supporting resilient low-diameter peer-to-peer topologies,
in: IEEE INFOCOM, 2004.
[52] L. Xiao, Y. Liu, L.M. Ni, Improving unstructured peer-to-peer systems by adaptive connection establish-
ment, in: IEEE Transactions on Computers, Vol. 54 (9), September 2005, pp. 1091–1103.
[53] L. Xiong, L. Liu, Peertrust: supporting reputation-based trust for peer-to-peer electronic communities,
in: IEEE Transactions on Knowledge and Data Engineering, Vol. 16 (7), 2004, pp. 843–857.
[54] Z. Xu, C. Tang, Z. Zhang, Building topology-aware overlays using global soft-state, in: Proceedings on
the International Conference on Distributed Computing Systems, 2003.
[55] M. Yang, Y. Dai, X. Li, Bring reputation system to social network in the maze P2P file-sharing system,
in: IEEE 2006 International Symposium on Collaborative Technologies and Systems (CTS 2006),
Las Vegas, 14–17 May 2006.
[56] Z. Zhang, S. Chen, Y. Ling, R. Chow, Capacity-aware multicast algorithms in heterogeneous overlay net-
works, in: IEEE Transactions on Parallel and Distributed Systems, February 2006, pp. 135–147.
[57] R. Zhou, K. Hwang, PowerTrust for fast reputation aggregation in peer-to-peer networks, in: IEEE Trans-
actions on Parallel and Distributed Systems, April 2007, pp. 460–473.
[58] R. Zhou, K. Hwang, M. Cai, GossipTrust for fast reputation aggregation in peer-to-peer networks, in:
IEEE Transactions on Knowledge and Data Engineering, September 2008, pp. 1282–1295.
[59] Z. Zhou, Z. Li, G. Xie, ACNS: Adaptive complementary neighbor selection in BitTorrent-like applica-
tions, in: Proceedings of IEEE ICC 2009, Germany, 2009.
[60] Y. Zhu, Y. Hu, Enhancing search performance on Gnutella-Like P2P systems, in: IEEE Transactions on
Parallel and Distributed Systems, Vol. 17 (12), December 2006, pp. 1482–1495.
HOMEWORK PROBLEMS
Problem 8.1
You studied the basic idea of DHT in Section 8.3.1. One important DHT-based protocol is
Kademlia [35]. The unique feature of Kademlia is that it has been successfully used in eDonkey/
eMule file-sharing systems. Study the Kademlia paper and compare it with Chord and CAN in
terms of routing complexity, maintenance cost, and scalability. Finally, download the source code
of eMule at www.emule-project.net and study how it works. If possible, modify the client
protocol to measure the eMule system. For measurement, please refer to the paper authored by
M. Steiner [44].
Problem 8.2
P2P VoD (video on demand) systems are different from P2P live streaming systems since P2P-VoD
has less synchrony in the users sharing video content, and thus it is very difficult to leverage end
542 CHAPTER 8 Peer-to-Peer Computing and Overlay Networks
users’ resources. Please read the paper [18] authored by Huang, et al. Then list the challenges of
designing P2P VoD systems and explain how PPlive solves these challenges.
Problem 8.3
List the three types of approaches to exploit network proximity in structured P2P overlays and their
advantages and disadvantages. Section 8.3.3 points out that, in Chord overlay networks, a node x
can take the physically closest node whose identifier falls in the range ½x + 2i , x + 2i+1 as its i-th
figure entry, which does not affect routing complexity. Why does this hold?
Problem 8.4
Building network proximity-aware overlays would degrade the performance of search algorithms
such as flood-based algorithms and random walk-based algorithms since nodes are clustered into
groups and fewer links are built among the groups. For example, a random walk query message
may be forwarded within a cluster for many hops before reaching the cluster where the resource is
located. You are asked to design a search algorithm for clustered overlay networks.
Problem 8.5
Data delivery can also been implemented on top of a DHT-based structure. The representative
approach is CAM-Chord [56], which builds a capacity-aware Chord overlay and delivers data items
on top of it. CAM-Chord brings no redundant message, that is, to deliver an item to n nodes, only
n message is required. However, maintaining such a structure is nontrivial. Read the paper and
compare CAM-Chord with the ACOM [9] and CRP approaches [24]. Please focus on following
performance metrics: hop complexity, proximity awareness, overlay maintenance overhead, and data
replication ratio.
Problem 8.6
In Table 8.9, compare the routing algorithms of Chord, Pastry, and CAN by briefly filling in the
missing table entries. Let n be the network size (number of peer nodes) and d be the dimension
size. A few of the entries are already given as examples.
Problem 8.7
P2P traffic locality can also been improved by deploying cache at the gateway of the network pro-
vider. However, different from the HTTP access pattern, the popularity distribution of objects (files)
does not follow Zipf law. Read the paper [17] and explain how the popularity distribution affects
the cache algorithm. Furthermore, discuss the advantages and shortcomings of different methods to
preserve locality.
Problem 8.8
In Section 8.5.1, you studied the basic search algorithms that do not consider semantic information
of shared content. It has been identified in [60] that building semantic overlays can significantly
improve search performance. Read this paper and their related works. You are asked to describe the
reasons why semantic overlays can enhance search performance.
Problem 8.9
YouTube is one of the largest web sites for user-generated videos. Tens of thousands of videos are
viewed each day by a large number of users. YouTube naturally leverages P2P techniques to allevi-
ate server loading. However, unlike traditional videos, user-generated videos are small. Study the
NetTube paper [8] and discuss the solution for using P2P techniques in YouTube content
distribution.
Problem 8.10
Analyze the fault tolerance of a Chord overlay with respect to the size of the successor list. Is it
sufficient to guarantee a stable Chord ring with high probability in the case that the successor list
size is O(log n) (n is the number of nodes), when every peer fails with probability ½?
Problem 8.11
BitTorrent is wildly used for file sharing. Files are divided into pieces. Nodes exchange the piece
availability information with their neighbors. When a node finds there is a piece that its neighbor
has but it has not downloaded yet, it sends a request for the piece to that neighbor. By default, the
piece size is 256 KB. Perform simulations using the code in http://theory.stanford.edu/~cao/biased-
bt.html to evaluate how piece size affects content distribution performance for files of various sizes.
Problem 8.12
Skype is a P2P-based VoIP system. It has been widely used and studied. Read the papers on Skype
[3]. Then describe the overlay structure of Skype, the bandwidth consumption characteristic, and
the impact of access capacity constraint on VoIP quality.
Problem 8.13
In mesh-based pull methods, each node advertises to its neighbors which data block it has received.
The neighbors explicitly request messages if needed. Tree-based push methods, on the other hand,
deliver data blocks along one or multiple tree structures. It is considered that mesh-based methods
suffer from a trade-off between control overhead and delay, while tree-based methods have a trade-
off between continuity and delay [48]. Read a survey paper on content delivery algorithms [36] and
a comparison study on mesh- and tree-based methods [29].You may consider combining these two
methods by leveraging network coding in tree-based systems.
544 CHAPTER 8 Peer-to-Peer Computing and Overlay Networks
Problem 8.14
BitTorrent systems use the so-called “Tit-for-Tat” incentive scheme to enable contribution aware-
ness in content delivery. Some researchers point out that this mechanism is sufficient to ensure fair-
ness, but others are doubtful of this and free-riding is still cheap. Study these three papers in
[20,29,37]. Then answer the following two questions: (1) Which factor makes free riding so easy?
(2) Do you think “Tit-for-Tat” is sufficient to solve the problem? Justify your answers with
reasoning.
Problem 8.15
Match the overlay networks on the left with their best-match descriptions on the right:
Problem 8.16
Consider a Chord overlay network with 64 keys identified by 6-bit key identifiers. Assume six
nodes (0, 4, 7, 12, 32, 50) are populated at present. The rest are not assigned yet. Answer the fol-
lowing questions using this Chord graph:
a. What are the data keys and node keys stored in node 0?
b. Work out the finger table for node 7. Describe how node 7 uses the finger table to find the
shortest routing path to reach the home node of a document file where data key 3 is located.
c. Describe how to insert a new node 45 into this Chord structure. You need to show the shortest
path on the Chord graph leading to the node holding key 45.
d. Use the finger table at the predecessor node to perform five lookup operations to establish the
finger table for this new node 45.
CHAPTER
SUMMARY
Data-intensive applications and future trends of Internet clouds are discussed along the lines of sup-
porting mobile computing, ubiquitous computing, and social networking. First, public, private, and
science clouds and grid projects are reviewed, including those built by IBM, SGI, Salesforce.com,
NASA, and CERN. Requirements for data-intensive scalable computing are reviewed. Then perfor-
mance metrics are presented for high-performance computing grids and for high-throughput comput-
ing clouds. Cloud mashups, mobility support, and benchmark results on existing grids and clouds
are reported. The emergence of the Internet of Things and cyber-physical systems is assessed.
Finally, we analyze large-scale social networks. In particular, Facebook and Twitter are reviewed
with innovative applications.
connection. Science in the cloud has been impacted with a new research environment that leverages
software and services being provided on user demand. New advances in research discovery are
expected in high-energy physics, observatory astronomy, atmospheric modeling, and biomedicine.
Dan Reed [69] of Microsoft Research has assessed the revolution of today’s big-data, many-cores
pervasive clouds and their impact on scientific research in the information age. Some observations
by Reed are discussed below.
Amazon Amazon Web Services (AWS); six 10,000 businesses and individual
Seattle, 1994 Infrastructure-as-a-Service (IaaS) systems users including the New York Times,
including EC2 for computing capacity and The Washington Post, and Eli Lilly
S3 for on-demand storage capacity
Enomaly Elastic Computing Platform integrating Business Objects, France Telecom,
Toronto, 2004 enterprise data centers with commercial NBC, Deutsche Bank, Best Buy
cloud offerings, managing internal and
external resources with virtual machine
(VM) migration
Google GAE offering Platform-as-a-Service (PaaS) Small businesses, enterprises, and
Mountain View, capability plus office productivity tools colleges including Arizona State
CA, 1998 including Gmail, calendar tools, the Postini University and Northwestern University
web site creation tool, and security
protection services
GoGrid Offers web-based storage and deploys Mostly startups, Web 2.0, and
San Francisco, Windows- and Linux-based virtual servers Software-as-a-Service (SaaS)
2008 in the cloud with preinstalled software from companies, plus a few big names like
Apache, PHP, and Microsoft SQL SAP and Novell
Microsoft Azure offering a Windows-as-a-service Epicor, S3Edge, and Micro Focus
Seattle, 1975 platform consisting of the OS and using Azure to develop cloud
developer services that can be used to applications
build and enhance web-hosted applications
NetSuite A business software suite including Business customers such as Puck
San Mateo, CA, e-commerce, consumer relationship Coffee and Wrigleyville Sports
1998 management (CRM), accounting, and
enterprise resources planning (ERP) tools
Rackspace Mosso cloud offering a platform for building Web developers and SaaS providers
San Antonio, TX, web sites; cloud files for a storage service; such as Zapproved, which uses
1998 and Cloud Servers, an EC2-like service that Mosso to deliver an online productivity
provides access to virtualized server tool
instances
Saleforce.com Offers CRM tools for sales force 500,000 customers in financial
San Francisco, automation, analytics, marketing, and social services, communications, and media,
1999 networking; Force.com offers a PaaS for energy, and health care
building web applications on the
Salesforce.com infrastructure
private clouds. IBM LotusLive offers an SaaS cloud for application service development. The services
include an online conference service, coordinated office management, and e-mail services.
computing environment on user demand. At the time of this writing, the RC2 cloud serves more
than 3,000 IBM researchers and developers worldwide as a large testbed for private cloud services.
The system supports machine virtualization, service life-cycle management, and performance
monitoring. The idea is to reduce computing costs by outsourcing jobs to best-fit sites for rapid
resource deployment, and thus optimal execution. The project also evaluates green energy savings
and dynamic resource optimization. In a typical RC2 service setting, the user uses the web service
to submit a new order of RC2 resources such as VMs and storage. Then the user specifies the order
with special service-level agreements (SLAs) regarding resource quantity and capacity classes. RC2
management approves the order with provisioned resources and services, and RC2 monitors the pat-
terns of requested resources and handles billing at the completion of the service.
Developer
DB2.
User
WebSphere.
Application
server
Tivoli. Tivoli.
Monitoring Provisioning
manager
FIGURE 9.1
Architecture of the BlueCloud system built with IBM hardware and software along with some open source
Linux and Xen.
(Courtesy of IBM BlueCloud Project )
550 CHAPTER 9 Ubiquitous Clouds and the Internet of Things
Special software includes Tivoli for monitoring of virtual cluster operations (see www-01.ibm.com/software/
tivoli/products/monitor for details). DB2 manages structured data, and the Tivoli Provisioning Manager provi-
sions virtual servers to physical nodes (see www-03.ibm.com/systems/power/software/virtualization/index.html
for details). The WebSphere Application Server is used for building application portals. Open source software
is used in the background for building the cloud application to process large-scale data. Xen is used as the
container of running instances of cloud applications. Hadoop is used to process massive amounts of data
based on the MapReduce paradigm.
Cyclone customer
secure login
Results on demandTM
SGI firewall VPN
VPN
Data/application license loaded
SGI servers
FIGURE 9.2
SGI Cyclone HPC cloud for enabling SaaS and IaaS applications.
(Courtesy of M. Barrenechea, SGI CEO, www.sgi.com/cyclone, Feb. 2010 )
SGI Altix provides scale-up, Altix ICE provides scale-out, and Altix XE hybrid clusters are used for
hybrid management of cloud servers, all based on Intel Xeon or Itanium processors. Altix is a line of ser-
vers and supercomputers produced by Silicon Graphics, based on Intel processors. Altix servers have
gone through many revisions, the latest being the SGI Altix 4700. This server offers blade configurability in
an uncompromising data-centric architecture capable of sustaining terabytes of data throughput. It fea-
tures “plug and solve” blade configurability in data-centric architecture capable of sustaining terabytes of
data throughput. Like its predecessors, the SGI Altix 4700 platform has been designed for technical users
based on industry-standard CPUs, memory, and I/O.
552 CHAPTER 9 Ubiquitous Clouds and the Internet of Things
NEBULA - Cloud web interface APPS - Directory CODE - Cloud IDE (Staging) JAVA
Virtualized servers
MySQL
Eucalyptus
cluster
VM controller
(DRBD)
Users
LUSTER file system LUSTER file system
(SAML &
LDAP)
PHY PHY PHY PHY PHY
NEBULA1 NEBULA1 NEBULA1 NEBULA1 NEBULA1 PHY PHY PHY
FIGURE 9.3
The Nebula cloud under development at NASA (http://nebula.nasa.gov).
(Courtesy NASA/Ames Research Center )
554 CHAPTER 9 Ubiquitous Clouds and the Internet of Things
The CERN VM technology was originally started with the intent to supply portable development
environments that scientists could run on their laptops and desktops. A variety of virtual image for-
mats are now supported, including the Xen images used by Amazon EC2 as well as science clouds.
The challenge was to find a way to deploy these images so that they would dynamically and
securely register with the AliEn scheduler, and thus join the ALICE resource pool.
Example 9.4 A Mashup Cloud Designed on the AWS and GAE Platforms
Figure 9.4 shows the concept of mashing up the GAE and AWS platforms. Essentially, one gets user input
files from GAE and uses the AWS infrastructure for MapReduce. To prove the effectiveness of mashups,
the performance of MapReduce on the mashup platform is experimented with scalable EC2 resources.
We demonstrate the results of AWS-GAE mashup experiments conducted at the University of Southern
California [18]. The mashup provider is crucial to link the two cloud platforms together.
The mashup design leverages Google’s web agility with the scaling power of AWS EC2. This
mashup helps a user to write agile software on GAE and use the user inputs to perform parallel com-
puting operations on AWS. Also, the creator can apply parallel computing operations to a user-owned
User
User
Mash-up provider
FIGURE 9.4
The concept of a mashup between the GAE and AWS cloud platforms.
(Courtesy of V. Dixit of University of Southern California [18] )
9.1 Cloud Trends in Supporting Ubiquitous Computing 557
cluster. AWS is not very handy as a scalable web interface because the size of EC2 has to be
increased by subscriber request. Also, the name server has to be built by the user to isolate the end
server from becoming directly visible on the network. On the other hand, Google is more cost-effective
in this aspect.
A typical online business such as eBay requires computing power that generates relevant results
for its consumers to increase sales. This is beyond what a web-based architecture such as GAE can
deliver. So, such a business would need a larger computing “farm,” available from Amazon through
its AWS offerings. This model allows users to perform a number of computations on a VM. Con-
sider a business running information located on an Excel spreadsheet in a Windows host and com-
putation code written on a UNIX-based system. In this case, GAE, which has no capability to
“execute” programs, becomes useless. In fact, any executable code needed by a business cannot use
the Google model of service. AWS then becomes the preferred way to accomplish this.
1200
Large data
(2GB)
time taken in
1000
seconds
Power
regression for
800 large data
Time in seconds
(2GB)
time taken in
600 seconds
400
200
0
1 2 4 10 20 40 60
Num of nodes
FIGURE 9.5
Scalable performance on the mashup cloud with decreasing execution time of a 2 GB dataset on EC2 up
to 60 VM instances.
(Courtesy of V. Dixit of University of Southern California [18] )
558 CHAPTER 9 Ubiquitous Clouds and the Internet of Things
1 to 60 nodes. It takes 1,000 seconds to process 2 GB of data on one node and the time is reduced
to 300 seconds over 60 nodes.
is called Cloudlet, and it offers a resource-rich portal for upgrading mobile devices with cognitive
abilities to access distant clouds. This portal should be designed to be trustworthy and use VMs to
explore location-aware cloud applications. The idea can be applied to opportunity discovery, fast
information processing, and intelligent decision making on the road. Cloudlet makes it possible for
mobile devices to access the Internet cloud easily in cost-effective mobile computing services.
Olympus
mobile eye trek
wearable
computer
FIGURE 9.6
VM-based cloudlets for mobile cloud computing applications.
(Courtesy of Satyanarayanan, et al. © IEEE [74] )
560 CHAPTER 9 Ubiquitous Clouds and the Internet of Things
Execute launch VM
Use User-driven
cloudlet device-VM
interactions
Finish use
Done
Create VM residue
ue
Depart VM resid Discard VM
FIGURE 9.7
Fast VM synthesis in Kimberley prototype cloudlet built at CMU.
(Courtesy of Satyanarayanan, et al. © IEEE [74] )
To GEANT
Dresden, Grid5000,
other European grids
Internet 2l NID
TeraGrid
Core
router
1×10
UC
G
PU
UC: 7TF IBM 672 cores
0G
1×1 G
10 IU PU: 4TF Dell 384 cores
1×
IU: 11TF IBM 1024 cores
7TF Cray 684 cores
G
1×
10
1G
˜5TF Data/Memory
1×
LA
1×10G Intensive TTBD
UCSD
˜512 cores
UCSD: 7TF IBM 672 cores
TACC
TACC: 12TF Dell 1152 cores
UF UF: 3TF IBM 256 cores
FIGURE 9.8
FutureGrid distributed system testbed across the U.S. continent [30].
(Courtesy of Judy Qiu [30] )
9.2 Performance of Distributed Systems and the Cloud 563
Portal
Commandline interfaces
Commandline interfaces
API
Security
Pegasus Nimbus
SWE tools
Moab Eucalyptus
FIGURE 9.9
The FutureGrid software stack.
(Courtesy of Judy Qiu [30] )
It has been used in a variety of computer science research projects including cloud and grid computing
and green IT with both software systems and performance goals. Grid’5000 aims at providing a highly
reconfigurable, controllable, and monitored experimental platform to its users. Work explores all levels,
including application, programming environment, application runtime, grid, cloud, P2P middleware,
operating system, and network.
9.2 Performance of Distributed Systems and the Cloud 565
The OCC also manages a testbed for cloud computing, the Open Cloud Testbed, and operates cloud
computing infrastructure to support scientific research, called Open Science Data Cloud, with an
emphasis on data-intensive computing. As of November 2010, the Open Cloud Testbed includes
sites at Chicago (two), California (La Jolla, California), and John Hopkins University, with a total
of 250 nodes and 1,000 cores.
Example 9.5 Wal-Mart Data Warehouse and Sloan Digital Sky Survey [5]
The first example of big data sources comes from the global operations of Wal-Mart. Wal-Mart has 6,000
stores with supply chains involving more than 10,000 manufacturers. Each day, Wal-Mart sells 267 million
items. HP helped Wal-Mart to build data centers to handle 4 petabytes of data daily. Wal-Mart needs to
perform DISC for data mining and management of all its supply chains, understanding market trends, and
formulating pricing strategies.
The second example comes from the Sloan Digital Sky Survey. Its telescope, in the New Mexico Obser-
vatory, captures 200 GB of image data per day. Its latest data set consists of 10 TB of data on 287 million
objects in outer space. Sloan uses SkyServer to provide SQL access in DISC-like data processing
continuously.
To build DISC systems to handle the oceans of data in both examples, one must solve the skinny
processing pipe problem. Use of many-core GPUs has created a possible solution to exploit massive
data-level parallelism in these DISC applications. Cloud computing may be slow compared with super-
computers, but massive parallelism is achieved through the use of large numbers of data centers
collectively.
Application
programs Application
programs
Machine-independent
Software programming model
packages
Runtime
Machine-dependent system
programming model
Hardware Hardware
FIGURE 9.10
Programmer’s perspectives of DISC systems against conventional supercomputers.
(Courtesy of Randal Bryant [5], Reprinted with Permission)
systems is much more complex a problem to evaluate because the systems are used by many clients
simultaneously.
Thus, HTC performance is attributed to many orthogonal factors. Some of them are measurable
and some are not. The basic assumption is that many (millions or more) independent users may use
the shared resources in the cloud or data centers simultaneously. The goal is to satisfy as many con-
current users as possible, even if each user program is a simple web service, social contact, or sim-
ple cloud storage. In this section, we attempt to present some performance evaluations and QoS
assessment models for both HPC and HTC systems.
configuration. These machine instances could be cluster nodes (or processor cores) if physical ser-
vers are used, or VM counts if a virtualized cloud is used.
We compute the total job execution time (or makespan) with Ttotal = kTo + Te(n, m). Then the
system throughput (π) is defined by:
π = n/Ttotal = n/½kTo + Te ðn, mÞ (9.1)
In the ideal case, the overhead kTo is assumed to be very small or simply zero, compared with the
magnitude of the execution time Te. Hence, the ideal throughput is simply measured by n/Te(n, m).
The system efficiency (α) is defined by normalized throughput as follows:
System efficiency represents the effective utilization of all provisioned resources in a system.
Note that the parameters n and m in the expression Te(n, m) vary with the workload (user number),
system size (m), the parallel and distributed computing model applied.
(mean time to repair). That definition applies to all systems (clusters, MPPs, grids, and clouds) stu-
died in this book.
High availability (HA) specifies the performance level of a system with controlled downtime
through quick recovery from failure. The term “continuous availability” (CA) refers an operational
level with almost no failure. Generally speaking, system availability should be the responsibility of
the providers. Usually, availability is maintained sufficiently high with at least five 9s, meaning that
with 99.999 percent of uptime (at most, five minutes of downtime per year) the system can provide
services. Of course, an HA cluster could demand 99.99999 percent availability, meaning only three
seconds of downtime per year. In general, that would be very expensive to achieve in a real-life
machine environment.
margin by the provider. If the provider charges the user a reduced hourly rate of $2/hour, the profit
will be reduced to only 20 percent of the cost. In general, c is set to be greater than d by a large
margin.
The lower value of the θ measure means poor performance. The higher value of θ toward 1
implies higher performance. Figures 9.11(b–d) show the corresponding Kiviat graphs for three cloud
services hypothetically running on three cloud platforms. This is based on the aforementioned ana-
lytical modeling of cloud performance. It would be interesting to validate the analytical results with
real-life benchmark experiments on the cloud platforms from Amazon, Google, and Salesforce.com.
Efficiency (α) α
1
0.5 0.333
Cost (μ) Scalability (β) μ β
0 0.8 0.167
0.35
(a) Kiviat graph (template) with five attributes (b) SaaS with low efficiency, scalability, and security
α α
0.667
0.667 0.545
μ β μ
β
0.85 0.4 0.273
0.25
0.55 0.9 γ
δ γ δ
.99999
(c) PaaS under very good running conditions (d) IaaS under poor running conditions but with HA
FIGURE 9.11
Kiviat graphs showing the QoCS of three cloud service models under assumed running conditions on
five attribute dimensions.
(Courtesy of K. Hwang and S. Kulkarni, USC 2009 )
Table 9.5 Relative Performance of Three Computing Cloud Service Models Applied to an EC2
Application under Two Extreme Security and Availability Conditions
QoCS of Cloud QoCS of Cloud
Services θ, Ranking Order Services, θ, Ranking Order
under Worst- under Worst- under Best- under Best-
Service Model Case Scenario Case Scenario Case Scenario Case Scenario
The results shown in Table 9.5 are estimated values of QoCS and the relative rankings of the three
cloud service models under optimistic versus pessimistic assumptions. The pessimistic QoCS results indi-
cate that all three service models have demonstrate about 30 percent of their maximum performance. The
QoCS of the IaaS is very low compared with those for the PaaS and SaaS models shown in Figure 9.11(b)
and 9.11 (c), respectively. This happens because of the scalability constraints and lower cost-effectiveness
values assumed. The QoCS of the PaaS and SaaS models are fairly close to each other, with the PaaS
model being secure and efficient and SaaS being cost-effective.
574 CHAPTER 9 Ubiquitous Clouds and the Internet of Things
In an optimistic analysis, the QoCS of the three cloud service models are in the range of 50 percent to
60 percent. The IaaS model performs better than the PaaS and SaaS models. This is attributed primarily
to improved efficiency, scalability, and security measures. The PaaS and IaaS models have similar levels of
performance. The PaaS model is slightly higher in performance than the IaaS model, thanks to significant
improvements by providers of PaaS in the areas of efficiency and scalability. However, the ranking of IaaS,
PaaS, and SaaS models varies with specific applications. Benchmark experiments are needed to rank
cloud performance.
1.0
0.8
Parallel efficiency
0.6
0.4
FIGURE 9.12
Cloud performance variation with respect to the number of virtual cores used [19].
Linux and Windows [22,39]; further details may be found in papers from this group. Consider the
performance of running the generative topographic mapping (GTM) interpolation algorithm on various
EC2 sizes. The purpose is for dimension reduction to analyze the NIH PubChem data set of 26.4 million
data points in 166 dimensions. Azure tests on variable instances are shown in Figure 9.12.
The EC2 tests were performed on EC2 Large, High-CPU-Extra-Large (HCXL), and High-Memory-
Quadruple-Extra-Large (HM4XL) instances separately. HM4XL and HCXL instances use eight cores
per instance while “large” instances use two cores per instance. The MapReduce implementations lie in
the middle. The low efficiency is due to memory bandwidth effects on individual nodes, which is most
pronounced for DryadLINQ runs on 16 core systems, whereas Hadoop systems had eight cores per
node. This example illustrates that the cloud instance choice does affect the scalable performance.
4500
4000
3500
3000
Time (s)
2500
2000
Azure MR
1500
Amazon EMR
1000
Hadoop on EC2
500 Hadoop on bare metal
0
64 × 1024 96 × 1536 128 × 2048 160 × 2560 192 × 3072
Num. of cores × Num. of blocks
FIGURE 9.13
Relative SWG performance of MapReduce on four cloud platforms: Azure MR, Amazon EMR, Hadoop on
EC2, and Hadoop on bare metal [19–22,67,68].
programming models suited for clouds. Most of the current studies have looked at test kernels of a
problem, but we can highlight two papers that examine complete applications. Montage delivers
science-grade mosaics of the sky to a community composed of both professional and amateur
astronomers and can naturally be implemented as a service in the cloud.
Everyday objects include not only electronic devices, but also objects such as humans, animals,
food, clothing, homes, vehicles, commodities, trees, hills, and landmarks. The IoT system could
greatly reduce the chances of a company running out of stock or wasting products, as all concerned
parties would know exactly what products are needed. Mislaid items and physical theft would be
affected by the fact that the location of an item would be known to all parties. This demands that all
objects on Earth be radio-tagged, ranging from a milk carton to a truck container to a jumbo jet. Com-
puters identify and manage all tagged objects in the same way humans do. This IoT system will
improve our quality of life and make society cleaner and more secure, convenient, and enjoyable.
FIGURE 9.14
Technology road map of the Internet of things.
(Source: SRI Consulting Business Intelligence )
could be widely applied to personalize the interactions among humans, machines, and objects.
Artificial intelligence, computer vision, robotics, and telepresence can make our lives more auto-
mated in the future.
The IoT development could become more mature and more sophisticated in the next 15 years.
Figure 9.14 shows the major technology advances and key applications that may benefit from the
9.3 Enabling Technologies for the Internet of Things 579
IoT. For example, supply chains are now better supported than before. Vertical market applications
may represent the next wave of advances. Ubiquitous positioning is expected to become a reality as
we move toward 2020. Beyond that, a physical IoT may be in place in a global scale. These
advances will significantly upgrade human abilities, societal outcomes, national productivity, and
quality of life.
Cloud computing
platform
Network
layer Mobile The Information
telecom Internet network
network
Sensor GPS
RFID network
Sensing
layer Road navigators
RFID label Sensor nodes
FIGURE 9.15
The architecture of an IoT consisting of sensing devices that are connected to various applications via mobile
networks, the Internet, and processing clouds.
580 CHAPTER 9 Ubiquitous Clouds and the Internet of Things
applications. The middle layer is also considered as a Semantic web or grid. Some actors (services,
components, avatars) are self-referenced.
e-labels
Computer
EM
wave
Antenna
RS232 USB
WiFi or sensor
network Read/write
Base station devices
FIGURE 9.16
RFID tagging and signals read/write through sensor or WiFi networks.
devices, and backend computers. RFID tags are still not in widespread use by ordinary people,
because they are expensive and bulky. They are more often used by large businesses, moving or
shipping companies, and service companies. However, RFID tags can make our daily life or work
much easier, more convenient, and accessible to visible or hidden objects surrounding us. These
tags are not that expensive to produce in large quantities, and they can be made small enough to fit
on almost any product or object.
External RF antenna devices can illuminate the tag devices similar to radar illuminating a target.
However, RFID is more effective at operating at shorter ranges. A variety of agencies are defining
the standards and regulating the use of RFID. They include the International Organization for Stan-
dardization (ISO), International Electrotechnical Commission (IEC), ASTM International, DASH7
Alliance, and EPCglobal. RFID has many industrial applications including merchandise tracking
and supply chain management. Its use can improve the efficiency of inventory control and accuracy
of assembly line management. Active and semi-passive RFID tags use internal batteries to power
their circuits. An active tag can also use its own battery to broadcast radio waves to a reader,
whereas a semi-passive tag relies on the reader to supply its power for broadcasting.
Active and semi-passive tags are reserved for reading over 30 to 100 meters, if repeater batteries
are used to boost a tag’s range. For example, a customer finds a pair of jeans in her size on a store
shelf with a radio wand provided by the store. She can order the clothing and pay via her cell phone
with the help of an RFID setting. Passive RFID tags rely entirely on the reader as their power source.
These tags are read only up to 20 feet away. They have lower production costs, meaning they can be
applied to less expensive merchandise. These tags are manufactured to be disposable. A railway car
may apply active RFID tags. A bottle of shampoo may use only a passive tag.
Another factor that influences the cost of RFID tags is data storage. There are three storage
types: read-write, read-only, and write-once-read-many (WORM). A read-write tag’s data can be
added to or overwritten. Read-only tags cannot be added to or overwritten. WORM tags can have
additional data added once, but they cannot be overwritten. Most passive RFID tags cost 7–20
cents each. Active and semi-passive tags are more expensive, and RFID manufacturers typically do
not quote prices for these tags without first determining their range, storage type, and quantity.
582 CHAPTER 9 Ubiquitous Clouds and the Internet of Things
The RFID industry’s goal is to get the cost of passive RFID tags down to 5 cents each once more
merchandisers adopt them.
Application focus Monitoring and Wide area voice Web, e-mail, Cable
control and data video replacement
System Resources 4 KB–32 KB 18 MB+ 1 MB+ 250 KB+
Battery Life (days) 100–1,000+ 1–7 0.5–5 1–7
Network Size Unlimited (264) 1 32 7
Bandwidth (Kbps) 20–250 64–128+ 54,000+ 720
Range (meters) 1–100+ 1,000+ 1–100 1–10+
Success Metrics Reliability, power, Reach, quality Speed, flexibility Cost,
cost convenience
energy or from physical destruction), and new nodes may join the network. Therefore, the network
must be able to periodically reconfigure itself so that it can continue to function. Individual nodes
may become disconnected from the rest of the network, but a high degree of connectivity must be
maintained.
Collaborative signal processing is desired in a sensor network. To improve detection/estimation
performance, it is often quite useful to fuse data from multiple sensors. This data fusion requires
the transmission of data and control messages. A user may want to query an individual node or a
group of nodes for information collected in the region. Depending on the amount of data fusion per-
formed, it may not be feasible to transmit a large amount of the data across the network. Instead,
various local sink nodes will collect the data from a given area and create summary messages.
A query may be directed to the sink node nearest to the desired location.
ZigBee coordinator
ZigBee router
ZigBee end device
FIGURE 9.17
Structure of a typical ZigBee network.
(Source: http://mesh-matrix.com/en/technology/tech_zigbee.aspx )
Wakeup
pattern results in a long lifetime for the sensors (months or years) without changing batteries. Good
power management is the key to WSN success.
Hospital
or
medical center
server PC
IEEE802.15.4/
ZigBee compliant
communications
Doctors
mobile PDA
Body
sensor Wearable devices
node (chest belt, wrist band, etc...) Cellular
networks
Patient’s
mobile phone Doctor’s
mobile phone
FIGURE 9.19
An example of how measured data can be transferred to doctors or medical professionals using a wireless
sensor network.
(Source: www.infotech.oulu.fi/Annual/2007/pics/opme_19_sm.jpg )
9.3 Enabling Technologies for the Internet of Things 587
Correction
signal
Master station
Reference station
FIGURE 9.20
The ground GPS receiver, which calculates its 3D location from four or more satellites with help from a few
ground reference stations and a master station.
Passive GPS tracking devices store GPS location data in their internal memory which can then
be downloaded to a computer for viewing at a later time, while active GPS tracking systems send
the data at regular intervals to be viewed in real time. When real-time data is not required, passive
GPS tracking devices tend to be favored more by individual consumers for their compact conveni-
ence and affordability. Concerned parents can install a GPS tracking unit just about anywhere in
their teens’ vehicles to monitor their driving habits and know where they’ve been going, and even
law enforcement officials now rely on passive GPS tracking to trail criminal suspects and enhance
civilian safety via electronic surveillance of parolees. Passive GPS tracking units also serve as a
theft prevention and retrieval aid in consumer as well as commercial vehicles.
final two intersecting points of three satellite spheres. The final receiver location is thus decided by
noting the closest point calculated from the two final points to the sphere surface of the fourth satel-
lite. In the case of no error, the precise position is located. Otherwise, some offset—say, 10 meters
from the exact location—could be resulted from the error introduced. To further reduce the errors,
more satellites could be involved, but this could be rather costly.
Each satellite continually transmits messages that include the time the message was transmitted,
precise orbital information (the ephemeris), and the general system health and rough orbits of all GPS
satellites (the almanac). A GPS receiver calculates its position by precisely timing the signals sent by
GPS satellites high above Earth. A GPS receiver must be locked onto the signal of at least three satel-
lites to calculate a 2D position (latitude and longitude) and track movement. With four or more satel-
lites in view, the receiver can determine the user’s 3D position (latitude, longitude, and altitude).
Once the user’s position has been determined, the GPS unit can calculate other information,
such as speed, bearing, track, trip distance, distance to destination, sunrise and sunset times, and
more. The receiver utilizes the messages it receives to determine the transit time of each message
and computes the distances to each satellite. These distances along with the satellites’ locations are
used to compute the position of the receiver. This position is displayed perhaps with a moving map
display or latitude, longitude, and elevation information. Many GPS units show derived information
such as direction and speed calculated from positional changes.
hidden by most GPS applications, which use only the location information. A few specialized GPS
applications do use the time for time transfer, traffic signal timing, and synchronization of cell
phone base stations.
Although four satellites are required for normal operation, if one dimensional variable is already
known, a receiver can determine its position using only three satellites. For example, a ship or plane
may have known elevation. Some GPS receivers may use additional clues or assumptions (such as reus-
ing the last known altitude, dead reckoning, inertial navigation, or including information from the vehi-
cle computer) to give a less accurate (degraded) position when fewer than four satellites are available.
9.3.4.5 Worldwide Deployment Status
In addition to the GPS deployed by the United States which is now open for global civilian applica-
tions by many countries, Russia has deployed GLONASS (for Global Navigation Satellite System)
for Russian military use exclusively. In the European Union, there is the Gallio positioning system
(EGNSS). By 2011, China had launched 8 satellites for the 31-satellite Beidou navigation system.
The full Beidou system will become operational by 2020.
Civilian GPS applications include navigation, surveying, mapmaking, cellular phones, tectonics,
disaster relief and emergency services, GPS tours, geofencing, recreation, aircraft tracking, and geo-
tagging, among other. Military applications of GPS include navigation, target tracking, missile and
projectile guidance, search and rescue, reconnaissance, and nuclear detonation detection, among
others. The following example shows a civilian application.
• Stage 1: Planning and Coordination A plan or strategy must be developed to address how a good or
service can satisfy the needs of customers.
• Stage 2: Material and Equipment Supplies This phase involves building a strong relationship with the
raw material suppliers and planning methods for shipping, delivery, and payment.
Sale scanned at
checkout
Store
replenishment Store forecast
updated
Delivery
to store Store demand
generated
FIGURE 9.21
Supply chain management in the multipartner business arena.
(Courtesy IGD Academy )
594 CHAPTER 9 Ubiquitous Clouds and the Internet of Things
• Stage 3: Manufacturing and Testing The product is tested, manufactured, and scheduled for delivery.
• Stage 4: Delivery of Products Customer orders are taken and delivery of goods is planned.
• Stage 5: After-Sale Service and Returns Customers may return defective products and the company
addresses customers’ demands. Supply chain software is used by many companies for efficient supply
chain management.
FIGURE 9.22
Smart power grid.
(Source: http://impactnews.com/images/stories/CTA/2009/04/15-grid.jpg. Courtesy of: Community Impact Newspaper )
Sensor network
Cyberspace of Physical world under
embedded systems sensing and tracking
Man-machine
interface
Human interaction in
real time
FIGURE 9.23
Three major components working interactively and intelligently in a CPS.
should be designed as a network of interacting elements with physical input and output instead of
as a stand-alone device. The notion is extended from the concepts of robotics and sensor networks.
The expectation from the CPS community is to improve the link between computational and physi-
cal elements. The design objectives include the adaptability, autonomy, efficiency, functionality,
reliability, safety, and usability of future CPSes.
These three components interact with one another heavily. Sensor networks and man-machine
interfaces need to be established to glue the three sectors together. In a CPS, the cyberspace and
physical worlds are intertwined. The embedded computing, sensed real-world data, and real-time
responses are all equally important. New understandings and techniques are needed to deal with
this new fusion. New advances are desired to broaden the potential of CPSes in several dimensions.
One is human intervention at the right moment, such as collision avoidance in driving a vehicle.
Doctors need precision in robotic surgery and workers need nano-level manufacturing.
It is desired to design CPSes to replace humans in dangerous or inaccessible environments such
as battlefields, rescue from earthquake rubble, and deep-sea exploration. Users demand close coordi-
nation in air traffic control and war fighting and enhanced efficiency and augmentation of human
capabilities in health care monitoring and life-saving operations. Since 2006, the U.S. National
Science Foundation (NSF) has identified CPS as a key area of research. The NSF and other federal
agencies have sponsored several workshops on CPS in recent years.
The applications of CPS typically fall under sensor-based systems and autonomous systems. For
example, many wireless sensor networks monitor some aspect of the environment and relay the pro-
cessed information to a central node. Other types of CPSes include autonomous automotive sys-
tems, medical monitoring, process control systems, distributed robotics, and automatic pilot
avionics. The following example gives new directions of CPS applications.
and sensors. Emphasis of the project is on scalability and robustness. The solution was fully distributed
with simple individual components.
The MIT research group constructed a mobile end-effector to manipulate the robotic garden. They
installed intelligent pots to provide distributed sensing and data storage in the system. They consider
plants, pots, and robots to be systems with different levels of mobility, sensing, actuation, and autonomy.
The system transforms energy, water, and nutrients into production. The system studies important pro-
blems in multirobot coordination, force-feedback grasping, and spatial computing. The MIT research group
envisions the following long-term applications of their CPS work:
• Sustainable, organic agriculture Using a large number of robots and embedding intelligence enables
large-scale heterogeneous plant populations that are less prone to deplete the soil of specific nutrients
and might allow for reduction of pesticides by intelligent local combinations of plants. Also, on-demand
watering can reduce water consumption.
• Autonomous farms Providing a variety of products on a small scale in a fully autonomous fashion
enables long-term deployments of human populations on remote areas and outer space missions.
• Green architecture Fruits and vegetables can be cultivated on the inside and outside of buildings where
they positively contribute to room climate and energy efficiency, and support the inhabitants’ diet.
Many scientific areas could be blended together to realize truly intelligent CPSes in addition to
embedded and real-time systems. These areas include avionics, automobiles, power systems, control
theory, and electro-mechanical engineering, among others. It has been suggested that CPS be a
priority area for R&D in federal networking and information technology in the United States.
studied. Users need to understand why they are so successful. To discover the problems they have
caused in cyberspace, users need to generate effective solutions without degrading network
performance.
Table 9.10 Assessment of Top Five Social Networking Web Sites in April 2011
Brief
Description
Name and Registered Personal Privacy Networking Search Overall
(Year) Web Site Users Profiles Security Features Support Rating
are facing a so-called “small-world” society, where all people are related in a short chain of social
acquaintances, one way or another. All social networks are not so chaotic or random as once
assumed, but rather have underlying structures. Some of the social networking rules or behaviors
are analyzed in this section.
Social relationships are often mapped into directed or undirected graphs, sometimes called
acquaintance graphs or simply social connection graphs. The nodes in a social graph correspond to
the users or actors and the graph edges or links refer to the ties or relationships among the nodes.
The graphs can be complex and hierarchically structured to reflect relationships at all levels. There
can be many kinds of ties between the nodes. Social networks operate from the family level up to
national and global levels. There are pros and cons of social networks. Most free societies welcome
social networks. For political or religious reasons, some countries block the use of social networks
to prevent possible abuses.
B A
C
D
FIGURE 9.24
A graph representation of an example social network.
602 CHAPTER 9 Ubiquitous Clouds and the Internet of Things
FIGURE 9.25
Graph representations of two social networks.
has identified many epidemic graphs on news, opinions, political mobilization, etc. Models for cascading
behavior is a popular topic in social networks.
FIGURE 9.26
Partitioning in paper citation graph to identify research communities.
(Courtesy of Yookyung Jo, Cornell University )
9.5 Online Social and Professional Networking 605
Figure 9.26(a) shows a citation graph of referenced papers by authors working in related areas. The
dots represent individual papers. The edges link the papers if they reference each either one way or
mutually. Through some cleaning process, certain papers are merged due to self-citation or other con-
cerns. The reduced citation graph in Figure 9.26(b) is obtained with referenced papers cleaned up. One
can partition the cleaned graph into three categories of papers that are related to a similar approach or
based on the same solution model.
In tracking communities in social networks, many authors have revealed interesting statistical
properties of community structure in large social and information networks. Jure Leskovec, et al.
[49] have indicated that the best communities are size 100−150. This data is drawn from the
study of 70 large, sparse, real-world social networks. Such small communities are formed with
special relationships such as family and friends, classmates, and work colleagues. These commu-
nities may overlap in many ways. The peers could be connected inside or outside the community
boundary. He has revealed the fact that there are more connections outside than inside for most
people, which is intuitively valid based on the common experiences of many of us. Other inter-
esting observations include the existence of a core in social networks. Often, the core can be
either unique or shared.
These networks often act as a customer relationship management tool for companies selling pro-
ducts and services. Companies can also use social networks to advertize. With global operations,
social networks can make it easier to promote business around the world.
One example of social networking being used for business purposes is LinkedIn.com, which
aims to interconnect professionals. Another is the use of physical spaces available to members of a
social network. Some hub culture extends invitations only to entrepreneurs and other business suc-
cessors to join a social network. Having a physical presence allows members to network in the real
as well as virtual worlds, adding extra business value. Applications for social networking sites have
extended to businesses, and brands are creating their own sector known as brand networking, which
is based on the idea that a brand can build its consumer relationship by connecting consumers to
the brand image on a platform.
institutions. The advantage of using a dedicated medical social networking site is that all the
members are screened against the state licensing board list of practitioners. The role of social
networks is especially of interest to pharmaceutical companies who spend approximately 32 percent
of their marketing dollars to influence the opinion leaders of social networks.
A new trend is emerging with social networks created to help members with various physical and
mental ailments. For people suffering from life-altering diseases, PatientsLikeMe offers its members
the chance to connect with others dealing with similar issues and research patient data related to their
condition. For alcoholics and addicts, SoberCircle gives people in recovery the ability to communi-
cate with one another and strengthen their recovery through the encouragement of others who can
relate to their situation. DailyStrength is also a web site that offers support groups for a wide array of
topics and conditions, including the support topics offered by PatientsLikeMe and SoberCircle. Spark-
People offers community and social networking tools for peer support during weight loss.
(a) A huge network (b) Three users targeted (c) Attack accounts created
100 M + 12 nodes
100 M + 12 nodes 100 M + 12 nodes
(d) Attack accounts embedded (e) Hidden subgraph H found (f) Finding the target nodes
FIGURE 9.27
Attack tracking in an anonymized social network.
(Courtesy of Jon Kleinberg [44] )
attacker may create a small set of new accounts with links among them to form a subgraph H attached to
the targeted accounts, as shown in Figure 9.27(c). When the data set is released, the organization wants
to find H. This is a computationally hard problem, as shown in Figure 9.27(d).
In fact, Kleinberg indicated that the small random graph H will likely be unique and efficiently findable
as shown in Figure 9.27(e), because each edge has a probability of 0.5 percent to be present on a
random graph. Once H is found, one can find the targeted nodes by following the edges from H in
Figure 9.27(f). One can breach privacy with O ((log n)1/2) new nodes. The company should avoid releasing
an anonymized social network. Privacy-preserving mechanisms are still very much in demand to protect
users from such attacks.
Community
Social good
Misc Fortunes
Events
Naming
Dating
Play with digital pet
Enhanced
communication Play game
Media sharing
Tell me about me
Send gift
Social comparison
Profile
enhancement
Play social game
Social selection
FIGURE 9.28
Distribution of 28.4 million Facebook applications, averaged for the week ending November 8, 2007.
(Source: The Facebook Application Ecosystem: Why Some Thrive- and Most Don’t An O’Reilly Radar Report by
Shelly D. Farnham, Ph.D Copyright © 2008 Waggle Labs All rights reserved. Used with Permission )
This distribution varies with time. Once the privacy and security issues are resolved for social net-
works, the number of registered users will grow even more in the future.
Profile page Combined profile: profile picture, bio information, friends list, user’s activity log,
public message board, other components selectively displayed
Social graph Access through user’s friends list on profile pages, with access control
traversal
Communication Internal e-mail-like message: send and receive private messages among friends;
tools instant messaging: accessed on the web page, or through third-party client; public
message board: “Wall,” with access control; update status: a short message, like
micro-blogging, with access control
Shared information Photo album: built-in, with access control; links: post links to outside URL, will
appear on the activity log; videos: embedded outside videos on profile page
Access control Every item on the profile page can be set to one of four access control levels: only
me, only friends, friends of friends, or everyone
Special APIs Games, calendars, mobile clients
Your
Pages Sites Networks
network
Facebook
Ringside social application server
integration
Community
Social engine
Facebook API engine
Open social ID, security,
integration Ringside API rendering, Control panel
FB integration, etc.
Extensible API
Social graph Open DSL (fbml,*) System apps
- Profiles
- Tags
- Relationships
FIGURE 9.29
The Facebook architecture built with social engine servers and Application Programming Interfaces (APIs).
(Courtesy of Facebook, Inc. 2010 )
essentially a huge data center with a very large storage capacity, intelligent file systems, and search-
ing capabilities. Figure 9.29 illustrates the structure of the Facebook platform.
The platform is formed with a huge cluster of servers. These machines are called ringside social
application servers. The social engine is the core of the server, which performs the functions of
identification, security, rendering, and Facebook integration. Three specific APIs are installed to
9.5 Online Social and Professional Networking 611
facilitate user access. A community engine provides networking services to users. Requests are
shown as pages, sites, and networks entering the Facebook server from the top. The social engine
executes all sorts of user applications. Open DSL is used to support application execution.
behind Twitter’s popularity. While blogging requires good writing skills and lots of content to fill
pages, Twitter restricts posts to 140 characters, which encourages many people to post.
Simplicity is provided with an HTTP-based open source API and sharing posts with third-party
applications. Twitter’s API consists of two different parts: a Search API and a REST API. The
REST API enables Twitter developers to access core Twitter data. This data includes tweets, time-
lines, and user data. The Search API enables developers to query the tweets. It also provides infor-
mation about trending topics. The usage of both APIs is subject to rate limiting. Twitter may add
these applications to its white list and remove request limitations. The ease and flexible usage of
Twitter APIs have encouraged many developers to write Twitter applications.
FIGURE 9.30
Twitter architecture and access protocol sequence.
(Courtesy of Hao Song, USC 2009 )
Profile page Simple profile: profile picture, bio information, followers list, following list, user’s message
timeline
Social graph Access through the following and follower’s lists on users’ profile pages, without access
traversal control
Communication Internal e-mail: send private direct messages to followers; instant messaging: not really
tools instant, can be implemented by direct messages; public message board: message
timeline, with access control
Share Photo album: third party, like Twitpic, but no combined access control; links: post links
information to outside URL, will appear as a message on timeline
Access control Message timeline on profile page can be set access control in two levels: private or
public
API applications Mobile clients, PC clients, photo sharing, automatic news feed
9.5 Online Social and Professional Networking 613
simple: just sending messages to a group of people. All other uses are contributed by third-party
developers. Twitter provides a very open API, so third-party developers can create powerful clients or
special applications for Twitter. The connections on Twitter are not offline social relationships. Users
can follow someone they do not know in real life. Not everyone on Twitter is using his real name,
and even when real names are used, there is no effective way to identify users. Ever since launched in
2009, Twitter has provided a verified-account service for celebrities and official agencies. However,
the revealing of personal information on Twitter still poses a privacy problem.
Twitter cannot be regarded as a fully trusted network or a reliable acquaintance basis. Consider
two types of accounts: public and protected accounts. Anyone can see updates at public accounts.
For protected accounts, only the users that have been approved can see their updates. In order to
prevent identity confusion, Twitter selects accounts from celebrities to provide verification. The ver-
ified notice is posted on verified accounts. Twitter provides two new features, Twitter Lists and
Location, to support this verification process. Twitter Lists allow users to create a list of Twitter
accounts. The list can help users organize the people they follow. The lists can also support loca-
tion-based services. Location-sensitive applications like those aided by GPS devices can be greatly
simplified with Twitter services.
Acknowledgements
Kai Hwang of USC is the primary author of this chapter. He was assisted by Zhongyuan Qin, Kaikun
Dong, Vikram Dixit, and Hao Song of the University of Southern California. Judy Qiu of Indiana
University has contributed Sections 9.2.1 and 9.2.5.
References
[1] M. Armbrust, A. Fox, R. Griffith, et al., Above the Clouds: A Berkeley View of Cloud Computing, Tech-
nical Report No. UCB/EECS-2009-28, University of California at Berkley, 10 February 2009.
[2] C. Bardaki, P. Kourouthanassis, RFID-integrated retail supply chain services: Lessons learnt from the
SMART project, in: Proceedings of the Mediterranean Conference on Information Systems (MCIS 2009),
Athens, Greece.
[3] M. Barrenechea, SGI CEO. HPC Cloud-Cyclone. www.sgi.com/cyclone, 2010.
[4] J. Bishop, Understanding and Facilitating the Development of Social Networks in Online Dating Commu-
nities: A Case Study and Model. www.jonathanbishop.com/Library/Documents/EN/docSNCEDS_Ch15
.pdf, 2008.
References 615
[5] R. Bryant, Data Intensive Supercomputing: The Case for DISC, Technical Report, CMU CS-07-128,
http://www.cs.cmu.edu/~bryant, 2007.
[6] J. Brodkin, Ten Cloud Computing Companies to Watch. Network World. www.cio.com/article/print/
492885, 2010.
[7] S. Buchegger, A. Datta, A case for P2P infrastructure for social networks: Opportunities, challenges,
in: Sixth International Conference on Wireless on-Demand Network Systems and Services (WONS 2009).
[8] R. Buyya, J. Broberg, A. Goscinski (Eds.), Cloud Computing: Principles and Paradigms, Wiley Press,
New York, 2011.
[9] R. Buyya, C. Yeo, S. Venugopal, Market-oriented cloud computing: Vision, hype, and reality for delivering
IT services as computing utilities, in: 10th IEEE International Conference on High Performance Computing
and Communications, September 2008.
[10] P. Carrington, et al., (Eds.), Modelsand Methods in Social Network Analysis, Cambridge University Press,
2005.
[11] CERN VM download. http://rbuilder.cern.ch/project/cernvm/build?id=81, 2010.
[12] K. Chen, K. Hwang, G. Chen, Heuristic discovery of role-based trust chains in P2P networks, IEEE Trans.
Parallel Distrib. Syst. (2009) 83–96.
[13] Condor Cloud Computing. www.cs.wisc.edu/condor/description.html, 2010.
[14] J. Dean, Handling Large Datasets at Google: Current Systems and Future Directions, Invited Talk at HSF
Panel. http://labs.google.com/people/jeff, 2008.
[15] E. Deelman, G. Singh, M. Livny, B. Berriman, J. Good, The cost of doing science on the cloud: The
Montage example, in: Proc. of ACM/IEEE Conf. on Supercomputing, IEEE Press, Austin, TX, 2008,
pp. 1–12.
[16] M. Demirbas, M.A. Bayir, C.G. Akcora, Y.S. Yilmaz, H. Ferhatosmanoglu, Crowd-sourced sensing and
collaboration using Twitter, in: IEEE International Symposium on World of Wireless Mobile and Multi-
media Networks (WoWMoM), Montreal, 14–17 June 2010, pp. 1–9.
[17] Distributed Robotics Garden, MIT. http://people.csail.mit.edu/nikolaus/drg/, 2010.
[18] V. Dixit, Cloud Mashup: Agility and Scalability, EE 657 Final Project Report, Univ. of S. Calif., May 2010.
[19] J. Ekanayake, X. Qiu, T. Gunarathne, S. Beason, G. Fox, High performance parallel computing with
clouds and cloud technologies, in: Cloud Computing and Software Services: Theory and Techniques,
CRC Press (Taylor and Francis), 2010, p. 30.
[20] J. Ekanayake, A.S. Balkir, T. Gunarathne, et al., DryadLINQ for scientific analyses, in: Fifth IEEE Inter-
national Conference on eScience, Oxford, England, 2009.
[21] J. Ekanayake, H. Li, B. Zhang, et al., Twister: A runtime for iterative MapReduce, in: Proceedings of the
First Int’l Workshop on MapReduce and Its Applications, ACM HPDC, 20–25 June 2010, Chicago.
[22] J. Ekanayake, T. Gunarathne, J. Qiu, Cloud technologies for bioinformatics applications, in: IEEE Trans-
actions on Parallel and Distributed Systems, accepted to appear, http://grids.ucs.indiana.edu/ptliupages/
publications/BioCloud_TPDS_Journal_Jan4_2010.pdf, 2011.
[23] S. Farnham, The Facebook Application Ecosystem, An O’Reilly Radar Report, 2008.
[24] Q. Feng, K. Hwang, Y. Dai, Rainbow Product Ranking for Upgrading e-Commerce, IEEE Internet Com-
put. 13 (5) (2009) 72–80.
[25] D.G. Feitelson, Workload Modeling for Computer Systems Performance Evaluation, Draft Version 0.7,
Hebrew University of Jerusalem, 2006.
[26] P. Fong, M. Anwar, Z. Zhao, A privacy preservation model for Facebook-style social network systems,
in: European Symposium on Research in Computer Security (ESORICS 2009), 21–23 September 2009.
[27] F. Fovet, Impact of the use of Facebook amongst students of high school age with social, emotional and
behavioural difficulties (SEBD), in: IEEE 39th Frontiers in Education Conference, 2009.
[28] G. Fox, S. Bae, J. Ekanayake, X. Qiu, H. Yuan, Parallel data mining from multicore to cloudy grids,
in: High Speed and Large Scale Scientific Computing, IOS Press, Amsterdam, 2009.
616 CHAPTER 9 Ubiquitous Clouds and the Internet of Things
[29] L. Freeman, The Development of Social Network Analysis, Empirical Press, Vancouver, 2006.
[30] FutureGrid Cyberinfrastructure to allow testing of innovative systems and applications, Home page. www
.futuregrid.org, (accessed 13.11.10).
[31] S.L. Garfinkel, An evaluation of Amazon’s grid computing services: EC2, S3 and SQS, in: Center for
Research on Computation and Society, Harvard University, Technical Report, 2007.
[32] S. Garfinkel, Commodity grid computing with Amazon’s S3 and EC2, Login 32 (1) (2007) 7–13.
[33] L. Gong, S.H. Sun, E.F. Watson, Performance modeling and prediction of non-dedicated network comput-
ing, IEEE Trans. Computers 51 (9) (2002) 1041–1055.
[34] A. Greenberg, J. Hamilton, D.A. Maltz, P. Patel, The cost of a cloud: Research problems in data center
networks, in: SIGCOMM Computer Communication Review, Vol. 39, No. 1, pp. 68–73, 2008, http://doi
.acm.org/10.1145/1496091.1496103.
[35] Grid’5000 and ALADDIN-G5K: An infrastructure distributed in 9 sites around France, for research in
large-scale parallel and distributed systems. https://www.grid5000.fr/mediawiki/index.php/Grid5000:Home,
(accessed 20.11.10).
[36] R. Grossman, Y. Gu, M. Sabala, et al., The open cloud testbed: Supporting open source cloud computing
systems based on large scale high performance, in: A. Doulamis, et al., (Eds.), DynamicNetwork Services,
Springer, Berlin Heidelberg, 2010, pp. 89–97.
[37] R. Grossman, Y. Gu, J. Mambretti, et al., An overview of the open science data cloud, in: Proc. of the
19th ACM Int’l Symp. on High Performance Distributed Computing, Chicago, 2010, pp. 377–384.
[38] T. Gunarathne, T.L. Wu, J. Qiu, G. Fox, Cloud Computing Paradigms for Pleasingly Parallel Biomedical
Applications, in: Proceedings of the Emerging Computational Methods for the Life Sciences Workshop of
ACM HPDC 2010 Conference, Chicago, 20–25 June 2010.
[39] C. Hoffa, et al., On the use of cloud computing for scientific workflows, in: IEEE Fourth International
Conference on eScience, December 2008.
[40] J. Hopcroft, Computer science theory to support research in the information age, in: Distinguished Lecture,
University of Southern California, 6 April 2010.
[41] K. Hwang, Z. Xu, Scalable Parallel Computing: Technology, Architecture and Programmability, McGraw-
Hill Book Co., New York, 1998.
[42] K. Hwang, D. Li, Trusted cloud computing with secure resources and data coloring, IEEE Internet Com-
put. (September) (2010) 14–22.
[43] K. Keahey, M. Tsugawa, A. Matsunaga, J. Fortes, Sky computing, IEEE Internet Comput. 13 (2009)
43–51, doi:http://doi.ieeecomputersociety.org/10.1109/MIC.2009.94; www.nimbusproject.org/files/Sky_
Computing.pdf.
[44] J. Kleinberg, Algorithmic Perspectives on Large-Scale Social Network Data, Cornell University, 2008.
[45] G. Kortuem, F. Kawsar, D. Fitton, V. Sundramoorthy, Smart objects as building blocks for the Internet of
things, IEEE Internet Comput. 14 (1) (2010) 44–51.
[46] R. Kumar, J. Novak, A. Tomkins, Structure and evolution of online social networks, in: The 12th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2006.
[47] E.A. Lee, Cyber physical systems: Design challenges, in: 11th IEEE International Symposium on Object
Oriented Real-Time Distributed Computing, 5–7 May 2008, pp. 363–369.
[48] A. Langville, C. Meyer, Google’s PageRank and Beyond: The Science of Search Engine Rankings,
Princeton University Press, Princeton, NJ, 2006.
[49] J. Leskovec, K. Langt, A. Dagupta, M. Mahoney, Statistical properties of community structure in large
social and information networks, in: International World Wide Web Conference, (WWW), 2008.
[50] H. Li, Performance evaluation in grid computing: A modeling and prediction perspective, in: Seventh IEEE
International Symposium on Cluster Computing and the Grid (CCGrid 2007), May 2007, pp. 869–874.
[51] Z.Y. Li, G. Xie, K. Hwang, Z.C. Li, Proximity-Aware overlay network for fast and churn resilient data
dissemination, in: IEEE Transactions on Parallel and Distributed Systems, Accepted to appear 2011.
References 617
[52] D. Linthicum, Cloud Computing and SOA Convergence in Your Enterprise: A Step-by-Step Guide,
Addison Wesley Professional, 2009.
[53] B.A. Lloyd, Professional networking on the Internet, in: Pulp and Paper Industry Technical Conference,
Birmingham, AL, 2009, pp. 62–66.
[54] X. Lou, K. Hwang, Collusive Piracy Prevention in P2P Content Delivery Networks, IEEE Trans. Compu-
ters 58 (July) (2009) 970–983.
[55] X. Lou, K. Hwang, Y. Hu, Accountable file indexing against poisoning DDoS attacks in P2P networks,
in: IEEE Globecom, Honolulu, 3 November 2009.
[56] Magellan: A cloud for science at Argonne. http://magellan.alcf.anl.gov/, (accessed 15.11.10).
[57] L. Mei, W. Chan, T. Tse, A tale of clouds: Paradigm comparisons and some thoughts on research issues,
in: IEEE Asia-Pacific Services Computing Conference, December 2008.
[58] A. Mislove, M. Marcon, K.P. Gummadi, P. Druschel, B. Bhattacharjee, Measurement and analysis of
online social networks, in: The 7th ACM SIGCOMM Conference on Internet Measurement, October 2007.
[59] J. Napper, P. Bientinesi, Can cloud computing reach the top500? in: Proceedings of the Combined Work-
shops on Unconventional High Performance Computing Workshop Plus Memory Access Workshop,
ACM, Ischia, Italy, 2009, pp. 17–20.
[60] Nimbus Cloud Computing. http://workspace.globus.org/, 2010.
[61] W. Norman, M. Paton, T. de Aragao, et al., Optimizing utility in cloud computing through autonomic
workload execution, in: IEEE Computer Society Technical Committee on Data Engineering, 2009.
[62] D. Nurmi, R. Wolski, C. Grzegorczyk, et al., Eucalyptus: A technical report on an elastic utility comput-
ing architecture linking your programs to useful systems, UCSB, Santa Barbara, Technical Report, 2008.
[63] C. Olston, B. Reed, B.U. Srivastava, et al., Pig Latin: A not-so-foreign language for data processing,
in: Proceedings of the 2008 ACM SIGMOD Int’l Conf. on Management of Data, Vancouver, 9–12
June 2008.
[64] Open Cirrus, Welcome to Open Cirrus, the HP/Intel/Yahoo! Open Cloud Computing Research Testbed.
https://opencirrus.org/, (accessed 20.11.10).
[65] M. Palankar, A. Onibokun, A. Iamnitchi, M. Ripeanu, Amazon S3 for science grids: A viable solution?
Computer Science and Engineering, University of South Florida, Technical Report, 2007.
[66] A. Passant, T. Hastrup, U. Bojars, J. Breslin, Microblogging: A semantic web and distributed approach,
in: 4th Workshop on Scripting for the Semantic Web in conjunction with ESWC 2008.
[67] J. Qiu, T. Gunarathne, J. Ekanayake, et al., Hybrid cloud and cluster computing paradigms for life science
applications, in: 11th Annual Bioinformatics Open Source Conference (BOSC 2010), Boston, 9–10 July 2010.
[68] J. Qiu, T. Ekanayake, T. Gunarathne, et al., Data Intensive Computing for Bioinformatics. http://grids.ucs
.indiana.edu/ptliupages/publications/DataIntensiveComputing_BookChapter.pdf, 29 December 2009.
[69] D. Reed, Clouds, clusters and ManyCore: The revolution ahead, in: IEEE International Conference on
Cluster Computing, 29 September–1 October 2008.
[70] J. Rittinghouse, J. Ransome, Cloud Computing: Implementation, Management and Security, CRC Pub-
lisher, 2010.
[71] B. Rochwerger, D. Breitgand, E. Levy, et al., The Reservoir Model and Architecture for Open Federated
Cloud Computing, IBM Syst. J. (2008).
[72] V. Sanhu, The CERN Virtual Machine and Cloud Computing, B.S. Thesis at the Dept. of Physics,
University of Victoria, Canada, 29 January 2010.
[73] G. Santucci, The Internet of Things: Between the Revolution of the Internet and the Metamorphosis of
Objects. http://ec.europa.eu/information_society/policy/rfid/documents/iotrevolution.pdf, 2010.
[74] M. Satyanarayanan, V. Bahl, R. Caceres, N. Davies, The case for VM-based cloudlets in mobile comput-
ing, IEEE Pervasive Comput. 8 (4) (2009) 14–23.
[75] J. Schopf, F. Berman, Performance prediction in production environments, in: 12th International Parallel
Processing Symposium, Orlando, FL, April 1998, pp. 647–653.
618 CHAPTER 9 Ubiquitous Clouds and the Internet of Things
[76] Science Clouds: Informal group of small clouds made available by various institutions on a voluntary
basis. http://scienceclouds.org/, (accessed November 2010).
[77] H. Song, Exploring Facebook and Twitter Technologies for P2P Social Networking, in: EE 657 Final
Project Report, University of Southern California, May 2010.
[78] H. Sundmaeker, P. Guillemin, P. Friess, S. Woelfflé, Vision and Challenges for Realising the Internet of
Things, European Union, March 2010.
[79] Venus-C, Virtual Multidisciplinary Environmemnts Using Cloud Infrastructure. www.venus-c.eu/Pages/
Home.aspx, (accessed November 2010).
[80] E. Walker, Benchmarking Amazon EC2 for high-performance scientific computing, Login 33 (5) (2008)
18–23.
[81] E. Welbourne, L. Battle, G. Cole, et al., Building the Internet of things using RFID: the RFID ecosystem
experience, IEEE Internet Comput. 13 (3) (2009) 48–55.
[82] S. Wasserman, K. Faust, Social Networks Analysis: Methods and Applications, Cambridge University
Press, Cambridge, 1994.
[83] D. Watts, Small Worlds: The Dynamics of Networks between Order and Randomness, Princeton University
Press, Princeton, 2003.
[84] M. Weng, A Multimedia Social Networking Community for Mobile Devices, Tisch School of The Arts,
New York University, 2007.
[85] Wikipedia, Cyber-physical systems. http://en.wikipedia.org/wiki/Cyber-physical_system, 2010.
[86] Wikipedia, Social Network. http://en.wikipedia.org/wiki/Social_network, 2010.
[87] Wikipedia, Facebook. http://en.wikipedia.org/wiki/Facebook, March 3, 2011.
[88] Wikipedia, Mashup (web app hybrid). http://en.wikipedia.org/wiki/Mashup_%28web_application_hybrid%,
29 November 2010.
[89] Y. Wu, K. Hwang, Y. Yuan, C. Wu, Adaptive workload prediction of grid performance in confidence
windows, IEEE Trans. Parallel Distrib. Syst. 21 (July) (2010) 925–938.
[90] Z. Xu, K. Hwang, Early prediction of MPP performance: SP2, T3D, and paragon experiences, J. Parallel
Comput. 22 (7) (1996) 917–942.
[91] L. Yan, Y. Zhang, L.T. Yang, H. Ning, The Internet of Things: From RFID to the Next-Generation Pervasive
Networked Systems, Auerbach Publications, 2008.
[92] B. Yan, G. Huang, Supply chain information transmission based on RFID and Internet of things, in: Interna-
tional Colloquium on Computing, Communication, Control, and Management (CCCM 2009) Vol. 4, 2009,
pp. 166–169.
[93] M. Yigitbasi, A. Iosup, D. Epema, C-Meter: a framework for performance analysis of computing clouds,
in: International Workshop on Cloud Computing, May 2009.
[94] B.J. Zhang, Y. Ruan, T.L. Wu, J. Qiu, A. Hughes, G. Fox, Applying twister to scientific applications, in:
International Conference on Cloud Computing (CloudCom 2010), http://grids.ucs.indiana.edu/ptliupages/
publications/PID1510523.pdf, 2010.
HOMEWORK PROBLEMS
Problem 9.1
Describe the shortcomings of mobile devices and distance clouds for mobile cloud computing.
Explain how the cloudlets developed at CMU can alleviate these shortcomings to enable mobile
cloud computing requests submitted from handheld pervasive devices.
Homework Problems 619
Problem 9.2
Answer the following two questions on GPSes:
a. Compare the four satellite positioning systems developed in the United States, Russia, the
European Union, and China, in terms of their satellite technology, deployment architecture,
coverage areas, precision, capability, and application areas.
b. Explain why a GPS receiver needs to use the signals from four satellites to locate a single point
(a 3D location) on the Earth’s surface.
Problem 9.3
What are the differences between active RFID and passive RFID? Comment on their advantages
and limitations. Do the same for active versus passive GPS receivers.
Problem 9.4
Compare Facebook and Twitter in four reported applications and discuss their relative strengths and
weaknesses in these applications.
Problem 9.5
Workload traces on AuverGrid can be described with three workload patterns including sudden
level change, sudden fluctuation, and gradual level change. What are the three load patterns and
how would you characterize their potential impact on the performance prediction results?
Problem 9.6
The IoT differs from the traditional Internet in many ways. Identify their differences and describe
their distictions in connecting entities, infrastructure and networking, and application domains.
Problem 9.7
Distinguish the following pairs of terminologies:
a. Open source versus proprietary operating systems
b. Internet of Things versus cyber-physical systems
c. Social networks versus professional networks
Problem 9.8
Answer the following two questions on wireless sensor networks:
a. Characterize three generations of wireless sensor networks in terms of their size, weight, power
supply, deployment, and so on.
b. Low power operation is desired in using a wireless sensor network (WSN). Plot the desired
power distribution in using a WSN. You should explain how to minimize power consumption
by operating the WSN in three modes periodically.
620 CHAPTER 9 Ubiquitous Clouds and the Internet of Things
Problem 9.9
Explain how to apply the RFID technology in merchandise positioning during transportation and
counting of the remaining goods in a supermarket.
Problem 9.10
Describe the following technology terms, functional modules, and cloud services associated the
IBM Blue Cloud and RC2 cloud. You may need to get more information from IBM’s web sites or
the IBM Journal of Research and Development for reported achievements.
a. IBM WebSphere
b. Tivoli Service Automation Manager (TSAM)
c. IBM Research Compute Cloud (RC2)
d. Tivoli provisioning manager
e. Linux with Xen
Problem 9.11
Conduct research on the possible use of cloud resources for processing data collected in selected IoT
applications (see Figure 9.19). Propose a solution scheme and the possible cloud resources needed.
You can narrow down the scope of sensor data collected to control the complexity involved.
Problem 9.12
Study the performance metrics for HPC applications (Section 9.2.2) and HTC applications (Section
9.2.3). Research how to improve their accuracy and effectiveness through analysis of real-life HPC
and HTC benchmark experiments on data-center clusters or on cloud platforms that you have
accessed. Suggest new or improved performance metrics that can cover more machine characteris-
tics and application parameters identified in the benchmark experiments.
Problem 9.13
Briefly explain how Facebook and Twitter cope with the security, privacy and copyright violation
issues in online social networking applications. Common protection techniques and unique solutions
to each system should be treated, separtely.
Problem 9.14
Match the following 20 names or abbreviated terms in the left column with the best-match defini-
tions or descriptions in the right column.
_____MySpace a) The world’s largest social network in terms of registered users today
_____Cloudlet b) A social network mainly used for video strip forwarding applications
_____Bluetooth c) A natural extension of the Internet involving humans in the loop of control
_____IoT d) A social network for microblogging and news alert applications
_____WiMAX e) A radio tagging technology used in Internet of Things construction
Homework Problems 621
—cont’d
Term Description to Be Mapped with the Terms on the Left Column
_____RFID f) An infrastructure that supports remote cloud access by mobile devices
_____ZigBee g) A private cloud built by IBM for internal use in R&D applications
_____GPS h) An Internet extension for interconnecting all objects, people, animals, and products
_____Facebook i) An Internet access technology used mainly for its speed and flexibility
_____YouTube j) The most scalable wireless sensor network
_____Twitter k) A fixed wireless technology for broadband Internet access
_____Nebula l) A satellite technology for location-sensitive applications
_____CPS m) A wireless technology used to replace cables in short-range applications
_____RC2 n) One of the most frequently used SaaS applications in the business world
_____LinkedIn o) A social network mainly used in professional networking over the Internet
_____WiFi p) A social network web site specialized in customized social services
_____CRM q) A large private cloud used initially for aerospace and weather forecasting applications
_____Cloudlet r) A public cloud used in IaaS applications under a pay-as-you-go billing system
Problem 9.15
Circle only one answer in each of the following multiple-choice questions:
1. Which one of the following social network properties is correctly stated?
a. The structure holes can be eliminated by filling one or more links to other nodes.
b. The local bridge consists of only one link, like a bridge.
c. The centrality has nothing to do with the node degree.
2. Which design choice was NOT adopted in the Kimberley cloudlet prototype at CMU?
a. The use of an imported virtual machine (VM) from a mobile device.
b. A base VM is deployed in the cloudlet host machine.
c. An imported VM, once deployed in the cloudlet, cannot be removed.
3. Which of the following descriptions of the Internet of Things is correctly stated?
a. The IoT does not apply GPS technology in supply chain management.
b. A smart grid does not explore solar energy or wind power.
c. ZigBee networks are more scalable than WiFi and Bluetooth networks.
4. Which of the following assessments on today’s Facebook applications is correctly stated?
a. The most popular use of Facebook today is to enhance communication.
b. Gift sending and dating services are not supported by Facebook.
c. Personal profiles of Twitter users are better supported than those for Facebook users.
Problem 9.16
Conduct a Google search for technical information on the following two large-scale private clouds.
Describe what you have found out from reliable sources of information. Try to reveal the cloud
platform architecture and its infrastructure bases and scope of major applications.
a. The private cloud under development by NASA in the United States
b. The private cloud under development by CERN in Europe
622 CHAPTER 9 Ubiquitous Clouds and the Internet of Things
Problem 9.17
Briefly describe the following technology terms, functional modules, and cloud services associated
with the SGI Cyclone cloud engine:
a. Scale-up, scale-out, and hybrid management of cloud servers
b. SGI Altix servers and ICE clusters
c. Reported Cyclone HPC applications
Problem 9.18
This problem is related to Example 9.13. Prove that the identification of the attack accounts (sub-
graph H ) and the tracing of the targeted users are solvable mathematically.
Index
Page numbers in italics indicate figures and tables
623
624 Index
enabling technologies for, 576–590, 578 Job scheduling methods, 104–106, 105
GPS and, 587–590 resource management and, 435–437
mobility and security support for, 561, 561 selfish grids and, 457
RFID and, 579, 580–582 space sharing, 105–106
sensor networks and, 582–587, 586 tiling, 105, 106
technology roadmap for, 578 time sharing, 106
ubiquitous computing and, 576–580 Job-level parallelism (JLP), 9
Internet Protocol Security (IPsec), 37 Joint ventures, 421
Internet Suspend-Resume (ISR), 162 Joyent Cloud, 202
Internet traffic, 480, 481 JPA. See Java Persistence API (JPA)
Internode communication, 70 JSR-311 (JAX0RS), 275
Interoperability, 69, 226–227, 462 JVM. See Java Virtual Machine (JVM)
Intrusion detection systems (IDSes), 50, 462
distributed, 252–253 K
host-based, 176 K means clustering, 353, 354
network-based, 176 Kad network, 497
VM-based, 176–179 Kademlia, 499
I/O address space, single, 92, 93 KaZaA, 482, 483, 493–494, 494, 502–503, 502, 537
I/O node, 88 Kernel layer, 88
I/O virtualization, 150–153, 152 Kernel level, checkpointing at, 101–102
IP addresses, 2128, 13 Kesselman, Carl, 416
IPv6 protocol, 13 Kimberley, 560
IRods, 295, 309 Kleinberg, Jon, 607–608
Isolated redundancy, 98 Klienrock, Leonard, 6
IT as a service, 200 Knowledge grids, 417
IT growth, 29 Khronos Group, 86
Iterative MapReduce, 351–355, 353, 371 KVM (kernel-based VM), 144, 146
J L
Jaguar. See Cray XT5 Jaguar Laboratory Information Systems (LIMS), 316
Java, 37–38 LAMP architecture, 298
Java Data Object (JDO), 370 Language Integrated Query (LINQ), 361–365
Java Message Service (JMS), 37, 105, 291 Latency processor, 84
Java Persistence API (JPA), 370 LDAP (Lightweight Directory Access Protocol), 38
Java Virtual Machine (JVM), 133 Libraries, checkpointing at, 101–102
JavaSpace, 313 Library support level virtualization, 132, 138–140
JBoss, 280 Lightweight Directory Access Protocol (LDAP), 392, 430
JBPM project, 280 LimeWire, 537
JDO. See Java Data Object (JDO) Linda programming model, 313
JLP. See Job-level parallelism (JLP) Link layer, 418
JMS. See Java Message Service (JMS) Linked Environments for Atmospheric Discovery (LEAD),
JNDI (Jini and Java Naming Directory Interface), 38 302
Job execution environments, 312–314 LinkedIn, 606
Job management, 70 Linux clusters, MOSIX2 for, 41–42, 110
Job management systems (JMS), 93, 95, 107–109 Linux OS, 72, 137
administration, 107 Linux vServer, 137
cluster job types, 107 Live migration, 159–162, 160, 165
cluster workload characteristics, 107–108 Load balancing, 158, 202, 380, 385
features of, 108–109 Load index, 457
migration schemes, 108 Load information managers (LIMs), 109
Job response time, 45 Load management, 107
Job scheduler, 107 Load sharing, 107
636 Index
Load sharing facility (LSF), 109–110, 554 control flow implementation, 352
application of, on cluster, 110 data and control flow, 348–351
architecture, 109 data flow implementation, 350
utility commands, 109–110 definition of, 346
Load-balancing clusters, 71 description and features, 42, 43
Local area networks (LANs), 21–22, 27 formal notation of data flow, 348
virtual, 235 framework, 345
Local bridge, 602 Hadoop implementation, 355–359, 364–365
Local files, 90 iterative, 351–355, 353, 371
Local jobs, 107 logical data flow, 346–348, 347
competition between cluster jobs and, 106 performance, 575, 576
Local node, 88 problem-solving strategy, 348
Local Resource Management System (LRMS), 389 scalability, 558
Local scheduler, 448 MapReduce++, 353, 369, 370, 371
Local scheduling, 106 Mashups. See Cloud mashups
Local storage, 90 Massive parallelism, 17, 18, 19
Local trust scores, 518, 520 clustering for, 66–75
Location as a Service (LaaS), 235 GPU clusters for, 83–87
Location transparency, 88, 91 Massive systems, 27, 27
Location-aware request distribution algorithms (LARD), Massively distributed systems, 8
312 Massively parallel processors (MPPs), 5, 69, 369, 416, 546
Location-based service (LBS), 587 architecture, 75–87
Logistics services, 592 design issues, 69–71
Look-ahead random walk (LRW), 520 packaging, 77–79
Loosely synchronous applications, 369, 371 Master node, 99–100
Low-cost design philosophy, 25–26 Master-worker model, 349
Lsbatch, 109 Matchmaking, 312
Lshosts, 109 Matrix multiplication, 86
LSLIB (load sharing library), 109 Maya GUI, 398
Lsmake, 110 MC Cluster, 67
Lsrun, 109 MCAT, 309
Lstcsh, 109 M/DB, 341
Lstools, 109 MDCube (Modularized Datacenter Cube), 213, 214
Lxrun, 138, 139 MDS module, 447, 450, 452
Mean square error (MSE), 460
M Mean time to failure (MTTF), 48, 98, 100–101, 102
Machine memory, 148 Mean time to repair (MTTR), 48, 98, 100–101
Machine size, 45 Medical applications, of social networks, 606
Magellan, 565 Megabits per second (Mbps), 14
Mainframe computers, 4 Megastations, 70
Malware-based attacks, 249, 251 Memcache, 372
Man-in-the-middle attacks, 225, 251, 255, 255 Memex, 547
Manjrasoft, 205, 393–398 Memory
Many-core chip multiprocessors (CMPs), 153, 155 machine, 148
Many-core GPU processors, 15 physical, 148
Map function, 43, 345–346, 348 shared, 77, 369, 370
Mapping, 344, 368–370 virtual, 130, 148
MapReduce, 26, 203, 312–313, 336 Memory access time, 20
architecture, 357, 357–358 Memory bus, 77
benchmarking, 574–576 Memory controllers (MCs), 18, 84
comparisons with, 342 Memory management unit (MMU), 148
compute-data affinity, 351 Memory migration, 162
Index 637
Replica Location Service (RLS), 309 Ringside social application servers, 610
Replication, 528 RISC processors, 15
Representational State Transfer (REST), 273, 273–277 RLS (Replica Location Service), 452
architecture, 275 Roadrunner. See IBM Roadrunner
principles of, 273 Robotic garden project, 596–597
queuing services and, 341 Rollback recovery, 70, 100
Reputation management, in P2P networks, 489 Root directories, 137
Reputation sharing, 227 Rootkits, 251
Reputation systems, 257–260, 456, 456 Routing, in P2P networks, 505–507
for clouds, 258–259 Routing efficiency, 487
design options, 257–258, 258 RowKey property, 386
in P2P networks, 514, 518, 520 Run() method, 384
trust overlay networks, 259–260, 260 RunJob(conf) function, 358
Reputation vector, 520 Runtime environment, 451
Reputation-based trust model, 257–260, 462–463 Runtime support services, 237
Research clouds, 562–566
Research Compute Cloud (RC2), 194–196 S
Research grids, 444 SaaS. See Software as a Service (SaaS)
Resource Description Framework (RDF), 310 SAGA (Simple API for Grid Applications), 444
Resource layer, 52, 53, 419 Salesforce.com, 205, 228, 229, 548, 552
Resource management SAML protocol, 466
See also Grid resource management SAN. See Storage area networks (SANs)
inter-cloud, 234–249 Savitsky-Golay filter, 460
optimization of, 170 Sawzall, 365–366, 365–368
services, 284 SAXPY, 86
virtual clusters and, 155–169 Saxpy_serial routine, 86
Resource management layer, 282, 283 Scalability, 8, 10, 68, 176, 558
Resource management system (RMS), 388, 435–437, 435 application, 45
Resource manager, 107 cloud, 226–227
Resource monitoring, 437, 437 dimensions of, 45
Resource provisioning, 237–243, 238 horizontal, 570
of compute resources, 237–238 multitasking, 569–570
demand-driven, 239, 240 vs. OS image count, 46, 46–47
dynamic, 239–241 of P2P networks, 481, 486
event-driven, 238–239, 240 size, 45
methods, 238–239 software, 45
popularity-driven, 238–239, 240 vertical, 570
of storage resources, 242–243 Scalable coherence interface (SCI), 77
underprovisioning, 238 Scalable computing, 4–13
Resource sharing, 7, 77, 78 Scalable parallel computing environments, 338
Resource-pulling model, 465, 465 Scalable performance, 46, 70
REST. See Representational State Transfer (REST) Scaled-workload speedup, 47
REST systems, 38 Scale-free graphs, 493
RESTful Web services Scheduling, 344
in Amazon S3, 275–276 Science and research clouds, 562–566
interaction with, 274 Science Clouds, 566
Web 2.0 applications and, 276 Science gateways, 294, 294–303, 338
Restlet, 275 exemplars, 295–297
Retailing, IoT and, 591–594 HUBzero, 297–301
RFID. See Radio-frequency identification (RFID) Open Gateway Computing Environments (OGCE) project,
RFID tags, 580–582 301–303
RFT (Reliable File Transfer), 448, 451 Scratch space, 91
Index 643
Simple Object Access Protocol (SOAP), 38, 278–281, 382, 466 Software stacks, for cloud computing, 236
Simple Queue Service (SQS), 379 Solid-state drives (SSDs), 20
Simple Storage Service. See Amazon S3 (Simple Storage Service) SourceForge.net, 297, 303
SimpleDB, 336, 379, 383–384 SP2 Server Cluster, 67, 98
SimpleGrid project, 294 Space, 45
Simultaneous multithreaded (SMT) processor, 16 Space sharing, 105–106
Single Cloud File Storage API, 340 Special function units (SFUs), 18
Single entry point, 89 Speed performance, 5, 72
Single file hierarchy, 90 Speedup factor, 47
Single instruction, multiple data (SIMD), 71 Sphere, 336, 390–391
Single memory space, 92, 92 SPMD (single-program and multiple data), 368
Single networking, 91–93, 92 SPRESI services, 288–289
Single point of control, 88, 92, 92 Spyware, 251
Single point of failure, 48, 97, 97 SQL, 340–341
Single point of management, 92 SQL Data Services (SDS), 386
Single-chip cloud computer (SCC), 87 SQL queries, 313
Single-precision (SP) arithmetic, 83 SQLAzure, 385
Single-system image (SSI), 29, 41, 70 SSDs. See Solid-state drives (SSDs)
features, 87–95 SSI. See Single-system image (SSI)
in MOSIX2, 110–111 Stable storage, 90
Size scalability, 45 Standard instances, 381, 382
Skip-list-based systems, 498 Standardization, 226–227, 257, 278
SkipNet, 504, 505 Stateless interactions, 274
Sky Computing, 566 Static mode, 104
Skype, 483–484 Static priority, 104
Slack clusters, 68 Static replication, 425
SLEs10, 72 Stay scheme, 106
Sloan Digital Sky Survey, 567 Storage
Small-world graphs, 493 Azure, 384–387
Smart buildings, 594 cloud, 340
Smart power grids, 594, 595 distributed, 226
SMP. See Symmetric multiprocessor (SMP) local, 90
Snapshots, 103–104 provisioning of, 242–243, 242
SOA. See Service-oriented architecture (SOA) remote, 90
SOAP. See Simple Object Access Protocol (SOAP) stable, 90
SOAP messages, 280, 280 virtual, 171–172
SOAP-based Web services, 277–282 Storage area networks (SANs), 21, 22, 163
Social circles/clusters, 602 Storage Resource Broker (SRB), 309
Social engine, 610 Storage technology, 20
Social network analysis, 600–603 Storage virtualization, 171–172, 224
Social networks. See Online social networks (OSNs) Storm botnet, 497
Software Streaming multiprocessors (SMs), 18, 84
clusters and, 83 Streaming processors (SPs), 18, 19
for grid computing, 443–455 Striped data transfer, 427
licensing, 227 Structural cohesion, 602–603
support, 444–446 Structural equivalence, 602
virtualization, 221 Structural holes, 602
Software as a Service (SaaS), 35, 36, 49, 192, 200, 205–206, Structured overlay networks, 33, 490–491, 498, 498–501, 502–503
227, 229, 235, 343, 572, 574 hierarchical, 501–505
Software bugs, 226 network proximity in, 507–509
Software layer, for virtualization, 130 search algorithms, 524–528
Software scalability, 45 Subgrids, 314
Index 645
Subject-push model, 465, 465 Tera-floating-point operations per second (Tflops/second), 485
SubmitJob() function, 358 TeraGrid, 30, 301, 303, 428, 430–431, 432
Subscription, 305 TeraGrid Science Gateway, 295
Sun Grid Engine (SGE), 440, 444, 446 TerraDir, 504, 505
Sun Niagara 2, 14 Tesla, 17
Sun One, 417 Tesla GPU, 18, 87
Super nodes, 502–503, 510 Tflops (tera floating-point operations per second), 45
Supercomputers, 5 Third-party applications, 244
See also Top 500 computer systems Throughput per watt, 8
case studies, 112–120 Tianhe-1A, 18, 73, 74–75, 78, 112–116
data centers and, 195 ARCH fat-tree interconnect, 113
vs. DISC systems, 567, 568 architecture, 112, 113
top five in 2010, 73, 74 hardware implementation, 112–114
virtual, 416, 419–422 Linpack Benchmark results, 115–116
Superscalar processor, 15 planned applications, 115–116
Supply chain management, 592–594, 593 power consumption, space, and cost, 115
Suspended state, 159 software stack, 115, 115
Sustained speed (Rmax), 73 Tiling, 105–106, 105
Swarm-based delivery methods, 532 Time, 45
Swift workflow system, 321–322, 322 Time sharing, 106
Switch-centric design, 210–211 Tit-for-tat (TFT), 526
Symmetric multiprocessor (SMP), 46, 144 Tivoli Provisioning Manager, 550
Symmetry, 88, 91 Tivoli Service Automation Manager (TSAM), 547
Synchronization, 344, 349 TLP. See Task-level parallelism (TLP)
Synchronous applications, 369 TModel, 305
Synergistic technologies, 577–579, 578 Top 500 computer systems, 5, 66
System availability, 45, 48, 49, 570–571 analysis of, 71–75
System defense technologies, 50 application share, 74
System directories, 90 ARCH fat-tree interconnect, 114
System efficiency, 47, 73, 570 architectural evolution, 71–72, 72
System of systems, 40 case studies, 112–120
System throughput, 45, 569–570 cluster architecture, 28
System-area interconnects, 21, 22 country share, 74
Systems of systems, 273–277 Cray XT5 Jaguar, 116–119
IBM Roadrunner, 119–120
T operating systems, 72
Table-lookup search, 506, 506–507 performance, 72–73
Tables, 341 power versus performance, 75, 75
Azure, 386–387 share of system interconnects in, 81–83, 81
finger, 499, 506 Tianhe-1A, 112
Tablet location hierarchy, 378–379, 378 top five in 2010, 73, 74–75
Target channel adapter (TCA), 82 Topic-based systems, 307
Task-level parallelism (TLP), 9, 14 Total execution time, 569
Taverna, 318, 319 Total failures, 97
Tcsh, 109 ToxTree service, 288–289
Technologies TPS (transactions per second), 45
convergence of, 26 Transaction servers, 533
fusion, 456–457 Transient failures, 96
hype cycle of new, 11, 12 Translation lookaside buffer (TLB), 148
scalability, 45–48 Transparency
Telnet command, 29 location, 88, 91
Template VMs, 157–158, 244–245 in programming environments, 42, 42
646 Index
Transport-level security (TLS), 466–467, 469 Unstructured overlay networks, 33, 73, 509
TreadMarks, 67 URL Fetch operation, 372
Tree-based delivery methods, 532 User authentication, 373
Tree-structure systems, 498 User Datagram Protocol (UDP), 390
Trident, 337 User interface, 77, 93
Trojan horses, 251 User server, 107
Trust delegation, 469–470, 470 User-application level virtualization, 132–133
Trust delegation chain, 465 User-defined type (UDT), 390
Trust index (TI), 462–463 User-directed checkpointing, 103
Trust management Usernames, 468–469
clouds, 249–260 User-oriented reputation systems, 258
in P2P networks, 489, 514–523 User-space (US) protocol, 98
in virtualized data centers, 176–179 Utility computing, 7, 10–11, 26
Trust matrix, 515–516 UUHash, 502
Trust models, 461–463
fuzzy, 463 V
generalized model, 462, 463 Varuna database service, 288–289
PKI-based, 464–465 VAX Series, 4
reputation-based, 257–260, 463 VAXcluster, 67
Trust overlay networks (TONs), 259–260, 260, 517–519, 518 VCenter, 174
Trusted zones, 177–179, 178 VCompute, 175
Tsubame, 18, 75 VCUDA, 138–139
TSUBAME 2.0, 73, 74 VCuda, 139–140
Tuecke, Steve, 416 Vector machines, 9
Twister, 342, 343, 351–355, 353–355 Venus-C, 566
Twitter, 480, 597, 599, 611–613, 612 Vertical scalability, 570
Two-core CMP, 15 VIOLIN cluster, 165, 166, 167–169
Virtual appliances, 23, 398
U Virtual clusters
Ubiquitous computing, 546–561 application partitioning and, 157
clouds for, 546–561 cloud computing and, 194
definition of, 546 dynamic deployment of, 165–169
Internet of Things and, 576–590 fast deployment and effective scheduling, 158
wireless sensor networks and, 582 management of, 159
UDDI (Universal Description, Discovery, and Integration), 38, migration of memory, files, and network resources,
304–306, 305 162–165
UDDI Custody and Ownership Transfer API, 306 vs. physical clusters, 156, 156–159
UDDI Inquiry API, 306 resource management and, 155–169
UDDI Publisher API, 306 trusted zones for, 177–179
UDDI Replication API, 306 Virtual disk images (VDIs), 172
UDDI Security API, 306 Virtual execution environment (VE), 135
UDDI Subscription API, 306 Virtual grids, 419
UltraScan Biophysics Gateway, 295–296 Virtual hierarchies, 153–154
UNICORE, 444 Virtual infrastructures, 24
Unified namespace, 425–426 Virtual infrastructures (VI) managers, 172, 173, 199
Uniform Resource Identifiers (URIs), 273, 275 Virtual instruction set architecture (V-ISA), 132
Uninterruptible power supply (UPS), 25 Virtual LANs, 235
Universal Business Registry, 306 Virtual machine monitor (VMM), 23, 130, 131, 237, 244
Universal Description, Discovery, and Integration (UDDI), binary translation using, 142
278 comparison of, 134
UNIX semantics, 91 design requirements and providers, 133–135
Unplanned failures, 96 trust management and, 176–179
Index 647
Virtual machine (VM) managers, 243, 244 structures/tools and mechanisms, 140–145
Virtual machines (VMs), 22–24, 130 VMM design requirements and providers, 133–135
architectures, 23 Virtualization layer, 130
bare-metal, 23 Virtualization middleware, 22–24
cloning, 224–225 Virtualization software, 221
cloud computing and, 193, 222 Virtualized data centers
deployment, 158 automation, 169–179
initialization of, 156 cloud OSes for, 172–176
management of, 243–246 server consolidation, 169–170
migration, 159–162, 160–161 storage management, 171–172
Parallel Virtual Machine (PVM), 43 trust management in, 176–179
performance effects, 159–162 Virtualizing integration managers, 223
primitive operations, 23–24, 24 Viruses, 49
provisioning of, 237–243 Visual MainWin, 138, 139
provisioning of, to virtual clusters, 156 Visual Studio, 386
states, 159 VLIW (very long instruction word) architectures, 9
storage, 171 VLSI microprocessors, 4
synthesis, 560 VM image management, 387, 388
template, 157–158, 244–245 VM Orchestrator, 199
Virtual memory, 130, 148 VMApp, 153
Virtual Network Computing (VNC), 298 VMDriver, 152
Virtual organizations (VOs), 30, 338, 416, 420–422, 421 VMNet, 153
Virtual platforms, 30 VMware ESX Server, 144–145, 145, 156
Virtual Private Cloud (VPC), 201–203, 202 VMware vSphere 4, 173, 174–176
Virtual private networks (VPNs), 28, 433 VMware Workstation, 146, 152–153, 174
Virtual Processor ID (VPID), 149 VNetwork, 175
Virtual reality (VR) applications, 13 Voice over IP (VoIP), 483
Virtual root directories, 137 VOTables, 288–289
Virtual storage, 158–159 VSphere 4, 173, 174–176, 199
management of, 171–172 VStorage, 175
Virtual supercomputers, 416, 419–420 VT-i technique, 146
Virtualization, 6 VT-x technique, 146
binary translation with full, 141–143
cloud architecture and, 221–225 W
cloud security and, 251 WABI. See Windows Application Binary Interface (WABI)
CPU, 147–148 Wal-Mart data warehouse, 567
data center, 25–26 Warps, 18
for data-center automation, 169–179 Watermarking, 256, 256–257, 531
hardware-assisted, 145, 148 Web 2.0 services, 5, 6, 26
hardware-level, 132–133 Web Application Description Language (WADL), 274
host-based, 141–143 Web pages, 242
for IaaS, 224 Web roles, 341–342, 384
implementation levels of, 130–140, 131, 133 Web Service Choreography Description Language (WS-CDL),
I/O, 150–153, 152 317
on Linux or Windows platforms, 137–138 Web Service Choreography Interface (WSCI), 317
memory, 148–150, 149–150 Web service core, 451
middleware support for, 138, 138–140 Web services, 277, 422
in multicore processors, 153–154, 154 application-specific, 301
at OS level, 135–138 definition of, 277
para-virtualization, 143–145 layered architecture for, 37–38, 37
in public clouds, 224 multitier architecture, 282–283
storage, 224 RESTful, 273–277
648 Index