0% found this document useful (0 votes)
14 views61 pages

Fundamentals of Cloud Application Archit

Uploaded by

turingjr5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views61 pages

Fundamentals of Cloud Application Archit

Uploaded by

turingjr5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Justin Y.

Shi

The Cloud Book


2
Foreward
ii
Preface

iii
iv
Contributors

Temple University
Justin Y. Shi Philadelphia, Pennsylvania

v
vi
List of Figures

1.1 Packet Switching Network . . . . . . . . . . . . . . . . . . . . 7


1.2 Enterprise Service Bus . . . . . . . . . . . . . . . . . . . . . . 8
1.3 ESB with Spatial Redundancy . . . . . . . . . . . . . . . . . 11
1.4 ESB with Re-Transmission API and Passive Redundancy . . 12
1.5 Conceptual Diagram of DB x . . . . . . . . . . . . . . . . . . 16
1.6 Automatic Re-transmission . . . . . . . . . . . . . . . . . . . 22
1.7 Replicated Partitioned Database (P =3, R=2) . . . . . . . . . 24
1.8 K-Order Shift Mirroring, K=P =4 . . . . . . . . . . . . . . . 25
1.9 Message-based “Bag of Tasks” Parallel Processing . . . . . . 27
1.10 Parallel Processing Using Tuple Space . . . . . . . . . . . . . 31
1.11 Stateless Parallel Processor . . . . . . . . . . . . . . . . . . . 32
1.12 Application Dependent CMSD Envelope . . . . . . . . . . . . 36
1.13 Parallel Performance Map of Matrix Multiplication . . . . . . 37
1.14 PML Tag Structure . . . . . . . . . . . . . . . . . . . . . . . 39
1.15 PML Marked Matrix Program . . . . . . . . . . . . . . . . . 41
1.16 PML Performance . . . . . . . . . . . . . . . . . . . . . . . . 43

vii
viii
Contents

I Fundamentals of Cloud Computing 1


1 Fundamentals of Cloud Application Architecture 3
Justin Y. Shi
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Necessary and Sufficient Conditions . . . . . . . . . . . . . . 5
1.2.1 Necessary Conditions . . . . . . . . . . . . . . . . . . 6
1.2.2 NCA and Sufficient Conditions . . . . . . . . . . . . . 7
1.3 Unit of Transmission (UT) . . . . . . . . . . . . . . . . . . . 9
1.4 Mission Critical Application Architecture: A First Example . 9
1.5 Maximally Survivable Transaction Processing . . . . . . . . . 13
1.5.1 Maximal Survivability and Performance Scalability . . 13
1.5.2 Transaction Processing Failure Model . . . . . . . . . 14
1.5.3 Parallel Synchronous Transaction Replication . . . . . 15
1.5.4 Transaction Trust Model . . . . . . . . . . . . . . . . 17
1.5.5 Non-Stop Resynchronization - 2PCr Protocol . . . . . 17
1.5.6 ACID Properties and RML . . . . . . . . . . . . . . . 18
1.5.7 Cluster Failure Model . . . . . . . . . . . . . . . . . . 20
1.5.8 Lossless Transaction Processing . . . . . . . . . . . . . 21
1.5.9 Cloud-Ready VLDB Application Development . . . . 23
1.5.10 Unlimited Performance Scalability . . . . . . . . . . . 23
1.6 Maximally Survivable High Performance Computing . . . . . 26
1.6.1 Protection for the “Bag of Tasks” . . . . . . . . . . . 26
1.6.2 Data Parallel Programming Using Tuple Space . . . . 30
1.6.3 Stateless Parallel Processing Machine . . . . . . . . . 32
1.6.4 Extreme Parallel Processing Efficiency . . . . . . . . . 34
1.6.5 Performance Scalability . . . . . . . . . . . . . . . . . 37
1.6.6 Automatic Parallel Program Generation – The Higher
Dimension . . . . . . . . . . . . . . . . . . . . . . . . . 38
1.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
1.8 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . 44

Bibliography 45

ix
x
Symbol Description
annealing and genetic algo-
α To solve the generator main- rithms have also been tested.
tenance scheduling, in the √
θ abc This paper presents a survey
past, several mathematical of the literature
techniques have been ap-
ζ over the past fifteen years in
plied.
the generator
σ2 These include integer pro-
gramming, integer linear ∂ maintenance scheduling.
The objective is to
programming, dynamic pro-
gramming, branch and sdf present a clear picture of the
bound etc. available recent literature
ewq of the problem, the con-
P
Several heuristic search al-
gorithms have also been de- straints and the other as-
veloped. In recent years ex- pects of
pert systems, bvcn the generator maintenance
abc fuzzy approaches, simulated schedule.
Part I

Fundamentals of Cloud
Computing

1
1
Fundamentals of Cloud Application
Architecture

Justin Y. Shi
Temple University

CONTENTS
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Necessary and Sufficient Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Necessary Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.2 NCA and Sufficient Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Unit of Transmission (UT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Mission Critical Application Architecture: A First Example . . . . . . . . . . . . . 9
1.5 Maximally Survivable Transaction Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5.1 Maximal Survivability and Performance Scalability . . . . . . . . . . . . . . 13
1.5.2 Transaction Processing Failure Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5.3 Parallel Synchronous Transaction Replication . . . . . . . . . . . . . . . . . . . . 15
1.5.4 Transaction Trust Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.5.5 Non-Stop Resynchronization - 2PCr Protocol . . . . . . . . . . . . . . . . . . . . 17
1.5.6 ACID Properties and RML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.5.7 Cluster Failure Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.5.8 Lossless Transaction Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.5.9 Cloud-Ready VLDB Application Development . . . . . . . . . . . . . . . . . . . 21
1.5.10 Unlimited Performance Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.6 Maximally Survivable High Performance Computing . . . . . . . . . . . . . . . . . . . . 26
1.6.1 Protection for the “Bag of Tasks” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.6.2 Data Parallel Programming Using Tuple Space . . . . . . . . . . . . . . . . . . 30
1.6.3 Stateless Parallel Processing Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.6.4 Extreme Parallel Processing Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.6.5 Performance Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.6.6 Automatic Parallel Program Generation – The Higher Dimension 38
1.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
1.8 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

1.1 Introduction
The scale of economy of cloud computing has ignited wide-spread imagina-
tions. With the promises of mighty computing power at unbelievable prices,
all applications are poised to gain mission critical status. The fundamentals

3
4 The Cloud Book

on cloud application architecture, however, are surprisingly short. For exam-


ple, how can mission critical applications leverage the cloud resources? Would
non-mission critical applications gain mission critical status by moving to the
cloud? This chapter attempts to fill in the fundamentals for mission critical
networked computing applications.
The definition of “mission criticalness” has also evolved. Decades ago, few
applications were considered “mission critical”. They were mostly related to
military, nuclear power plants, financial institutions, air craft controls, etc.
Today, it is harder to find applications are not mission critical. Evidently
the application requirements grew faster than technologies could deliver. The
availability of cloud computing simply adds fuel on the fire. In other words,
cloud computing has not only offered a cost-effective way of computing, it
also has forced the performance, information security and assurance issues to
surface.
This chapter discusses the fundamentals of mission critical cloud applica-
tions. In particular, we focus on three objectives: maximal application sur-
vivability, unlimited application performance scalability and zero data losses.
These properties are desirable for all mission critical applications. The avail-
ability of cloud computing resources has inspired all applications to attain the
mission critical status.
Application survivability is a long standing practical (and research) is-
sue. Although processor virtualization has shortened service interruptions by
quicker restart of check-pointed virtual machines, the fundamental challenges
of non-stop service and maximal survivability remain the same.
Application performance scalability is another nagging issue. Each appli-
cation needs to serve increasingly larger number of clients without degrading
performance. The low costs of cloud services have reduced the budget barriers.
However, except for a few “stateless” applications, such as web crawling, file
sharing and mass mailing, most other applications face scalability challenges,
regardless the computing platform.
The most puzzling is perhaps the data losses in networked computing ap-
plications. The traditional synchronous and asynchronous replication methods
have been proven incapable of delivering performance and survivability at the
same time. The puzzling fact is that we are losing data while using the lossless
TCP protocol. Why?
This chapter discusses the generic “DNA” sequence of the maximally sur-
vivable systems. For cloud ready mission critical applications, we advocate a
focus on the networked computing application architecture - a semantic net-
work architecture formed via application programming interface (API) and
the logical communication channels within. Unlike traditional computer ar-
chitecture and data communication network studies, our focus is a holistic,
application-centric networked computing application development methodol-
ogy. In particular, we focus on embedding the maximally survivable “DNA”
sequence in the networked application architectures.
This chapter is organized as follows: Section 1.2 covers the historical back-
Fundamentals of Cloud Application Architecture 5

ground of the “DNA” sequence of maximally survivable systems, the neces-


sary and sufficient conditions for the maximally survivable systems, and the
introduction of Networked Computing Applications Architectures (N CA2 ).
Section 1.3 defines the Unit of Transmission (UT) for four basic N CAs: mes-
saging, storage, computing and transaction processing. As the first example,
Section 1.4 describes the detailed steps for embedding the maximally surviv-
able sequence for the messaging system: Enterprize Service Bus (ESB). Section
1.5 describes the steps required for embedding the maximally survivable se-
quence for transaction processing system. Section 1.6 describes the design of
a maximally scalable high performance computing system. Section 1.7 is the
summary.

1.2 Necessary and Sufficient Conditions


Approximately fifty years ago, Mr. Paul Baran was given the task of designing
a survivable network for the United States Air Force. It was in the middle of
the Cold War. The objective was simple: build a dependable network that can
survive a nuclear attack.
The packet switching network concept was born. However, deeply invested
in analog technologies (circuit-switching networks), major research institu-
tions refused to embrace the concept. Telecom carriers refused to adopt the
technology.
The key arguments were “poor performance” and “unknown stability”.
Comparing packet switching to circuit switching networks, on surface, the
“store-and-forward” nature of packet switching protocol did seem counter-
productive for transmitting voice signals in real time.
Fast forward to today, a twenty-year old would have a tough time to iden-
tify the existence of circuit switching networks. The Internet is a seamless
“wrap” of packet switching circuits on top of circuit switching networks. Al-
though there are still much to improve on the IP addressing scheme and service
qualities, the two-tier Internet architecture seems destined to stay. And the
TCP protocol remains a lossless protocol. In other words, the packet switching
concept has stood the test of time.
The performance scalability benefit can only be realized after the packet
switching networks reach enough scale. Indeed, it took the world quite a few
years to change the “world-wide wait” to the world-wide web we enjoy today.
Although circuit switching networks still play a critical role in today’s
Internet, the lesson was that performance centric architectures may not sus-
tain in larger scales. The packet switching network was one of those counter
intuitive ideas.
Much research has been conducted on the packet switching protocols.
The results are captured in continually improving Internet services for higher
6 The Cloud Book

speed, better reliability and lower costs. These improvements have brought
the power of networked computing to the masses.
Now, we are confronted with the nagging survivability issues for networked
computing applications (NCAs). Server virtualization, the key technology that
has made cloud computing possible, although cost-effective, merely shifts the
service locations. The technological challenges remain.
It seems that the history is repeating itself. The same problems we solved
four decades ago for the telecommunication industry have come back to chal-
lenge us at a higher level for NCAs.
We ask: what is the generic “DNA” sequence in the packet switching net-
work that has made it possible to create the maximally survivable systems?
How can this sequence be “inherited” to the networked computer applications
to achieve the similar results? What are the necessary and sufficient conditions
for the maximally survivable systems?

1.2.1 Necessary Conditions


Any survivable system must rely on redundancy [6]. Theoretically there are
two types of redundancies: temporal (or re-transmission), and spatial (or repli-
cation). There has been no clear direction as when to apply which type of
redundancy and for what purposes.
“Store-and-forward”, or, “statistical multiplexing”, characterizes the
essence of packet switching architecture. Unlike circuit switching networks,
voice signals are transmitted via discrete packets by interconnected routers
and switches (Figure 1.1).
The “DNA” sequence of this “store-and-forward” architecture has four
components:

1. A well-defined unit of transmission (UT),


2. Transient UT storage,
3. Forward with once-only UT re-transmission logic, and
4. Passive (stateless) spatial UT redundancy.

To date, the “store-and-forward” network has been proven the most surviv-
able architecture. The Internet has inherited this “DNA” sequence by wrap-
ping packet switching circuits around circuit switching networks.
Looking deeper, the “store-and-forward” packet switching protocol is a
clever utilization of multiple redundancy types for the communication sys-
tem: with a given unit of transmission (a data packet), a packet switching
network supports transient spatial redundancy (store), temporal redundancy
(forward with re-transmission) and passive spatial redundancy (routers and
switches). These essential statistical multiplexing methods afford: a) the abil-
ity to mitigate massive component failures, b) the potential to leverage par-
Fundamentals of Cloud Application Architecture 7

FIGURE 1.1
Packet Switching Network

allel processing for performance scalability; and c) provably least cost fault
tolerance.
The primary winning argument for packet switching network was the prov-
ably least cost and maximal survivability design. The unlimited performance
scalability came as a welcome after-effect - a benefit that can only be delivered
after the system has reached enough scale.
The maximal survivability feature allowed decades of continued innova-
tions. After quite a few years, the “world-wide-wait” network has eventually
become the world-wide web that has poised to change the landscape of our
lives.

1.2.2 NCA and Sufficient Conditions


Networked Computing Application (NCA) represents the vast majority com-
puter applications today. Invariably, all NCAs face performance scalability,
service availability and data loss challenges. For example, up scaling a database
server is a non-trivial challenge. Providing non-stop service is another nagging
issue. The most puzzling is that none of NCAs would promise lossless service,
even though almost all are riding atop the lossless TCP (Transport Control
Protocol).
All cloud-computing applications are NCAs.
The common feature of all NCAs is the need for networking. Indeed, each
basic NCA has its own higher level communication protocol on top of TCP/IP,
such as messaging, transaction processing, online storage and parallel comput-
ing. It should not have been a surprise that the maximally survivable “DNA”
is probably needed for all NCAs.
A key observation is the maximal entropy between communication protocol
8 The Cloud Book

FIGURE 1.2
Enterpise Service Bus

layers. This means that the benefits of packet level network are not automatic
for higher layers. Therefore, the necessary conditions for gaining the same
benefits at a higher level should be somewhat similar.
The sufficient conditions include all those required for the correct respec-
tive NCA processing. These are application dependant conditions.
Unlike the low-level data communications, the NCA semantic network typ-
ically carries complex application dependencies and must satisfy non-trivial
semantics. Low-level data packets can be transmitted and re-transmitted with-
out violating the semantics of data communication; it is not clear how to de-
fine the application dependent UTs that can be supported at architecture level
without breaking the application processing correctness. Poor “packetization”
could violate the crucial application dependencies (too little state information)
or incur excessive communication overheads (too much state information). It
becomes a non-trivial exercise to embed the generic “DNA” sequence to any
NCA.
For example, Enterprize Service Bus (ESB) is the core of Service Oriented
Architectures (SOA). It consists of a set of message servers dispatching re-
quests and replies between clients and assortment of heterogeneous servers
(Figure 1.2).
A sustainable cloud-ready ESB should meet the following (sufficient) re-
quirements:

1. Adding processing and communication components should increase


overall performance.
2. Adding processing and communication components should increase
both data and service availabilities.
3. Adding processing and communication components should decrease
the probability of permanent failures.
Fundamentals of Cloud Application Architecture 9

4. There should be zero message loss at all times.


Simply moving an existing ESB to the cloud can not meet all requirements.
Since each message is typically partitioned into multiple packets, and each
packet is routed individually, on surface, it seems that meeting all requirements
can be cost prohibitive.
Ironically, the packet switching network had met all above requirements
for data transmission for many years. What did we miss?

1.3 Unit of Transmission (UT)


The first step in embedding the “store-and-forward” sequence in a target NCA
is defining the unit of transmission (UT).
For example, we are interested in the following basic NCAs:

1. The UT of an Enterprize Service Bus is a message.


2. The UT of a transaction processing system is a transaction. For
simplicity, we consider a simple read-only query also a transaction.
3. The UT of a mission critical high performance computing system
is a computation task.
4. The UT of a mission critical storage system is the disk transaction.
Similarly, we consider a storage retrieval query a disk transaction.
These UT definitions represent the four basic service layers of modern
NCAs: messaging, transaction, storage and computing. Once we understand
how to build these basic NCA architectures, we should be able to construct
robust higher level applications with full mission critical support down to the
physical medium.

1.4 Mission Critical Application Architecture: A First


Example
As mentioned earlier, the basic NCA architectures include messaging, trans-
action, storage and computing. As the first example, this section describes the
construction steps for a mission critical ESB.
We have three operational objectives:

1. Maximal survivability.
2. Unlimited performance scalability.
10 The Cloud Book

3. Zero data losses.

Traditional replication technologies can only address one objective. It


seems cost prohibitive, if not impossible, to meet all objectives using tra-
ditional methods.
For a mission critical messaging architecture, UT is a message. The UT
of TCP is a packet. A message is typically decomposed into multiple packets.
Although the lossless feature of TCP guarantees the delivery of every packet,
the actual delivery of each message is only known to the application. This
is because each packet is routed individually. Since packets can be delivered
at different times, only the application could know if a message is delivered
consistently on the semantic network.
It is interesting to observe the absence of re-transmission API in existing
messaging systems, as if every message transmission must either succeed or
fail.
There is a third state: timeout. It happens more often than we would
desire. The problem is the lack of direction as what to do with timeouts.
Printing error message is the most common practice. Textbooks simply avoid
this very subject. This is, however, the cause of information losses.
A timeout message is in “unknown” state since we are not certain if the
message has ever been delivered. Dealing with timeout is actually a non-
trivial challenge since not all messages can be safely re-transmitted: consider
a timeout message that carried a missile firing command.
The irony is that because of the maximal entropy between protocol abstrac-
tion layers, only the application knows the exact semantics of each message.
Therefore, unlike the popular designs, the only possible layer for implementing
this non-trivial re-transmission logic is the application layer. The lower layers
simply do not have enough information. Optimization is possible leveraging
multiple lower level protocol layers. But methods like “group communication”
cannot address the fundamental survivability and information loss issues.
Having the correct re-transmission logic at the application layer is not
enough. Without transient storage and passive spatial redundancy supported
in the architecture, repeated transmissions will not deliver appreciable sur-
vivability improvements. For the messaging systems, it is necessary to install
passive redundant ESB hardware and allow it be discovered at runtime (Figure
1.4). This is because each application (a messaging client) holds the transient
storage of its own message. The transient storage is automatically cleared
after the confirmation of the message delivery. Therefore, with the help of
any passive messaging hardware, this new infrastructure (Figure 1.4) ensures
maximal survivability (defined by the degree of hardware redundancy) and
there can be no message losses.
For comparison, without showing the service servers, Figure 1.3 shows a
conceptual diagram of a mission critical ESB system with synchronous and
asynchronous storage replication via fiber optic channels. Figure 1.4 shows a
Fundamentals of Cloud Application Architecture 11

FIGURE 1.3
ESB with Spatial Redundancy

conceptual lossless ESB system with re-transmission API at all clients without
the disk replication harness.
Although the differences are subtle, the benefits are non-trivial.
In Figure 1.3, the disk replication harness adds significant overheads to
each message transmission, either synchronously or asynchronously. In asyn-
chronous mode, message losses cannot be avoided and the ESB overall speed
must be throttled to avoid the replication queue overflow. In synchronous
mode, the two-phase-commit replication protocol imposes severe performance
penalty to each message transfer. Adding message queue managers increases
the replication overheads linearly. Amongst the three operating objectives,
only one can be attempted by any production system. This problem persists
even if Figure 1.3 is moved to a computing cloud.
In Figure 1.4, the disk replication harness is removed. There will be zero
replication overhead. This enables unlimited performance scalability: one can
add as many messaging servers as the application requires. There will be zero
message losses since all clients hold the transient storage of their own messages
until confirmed delivery. Further, the re-transmission logic will automatically
discover alternative message routes and recover lost messages. All we need to
supply is the passive ESB redundancy in different location(s). Passive ESB
system needs much less maintenance (similar to a network router) than active
systems. Properly configured, this system can automatically survive multiple
simultaneous component failures.
The critical observation in all these is that messaging storages are all tran-
sient. Replication of transient messages on permanent storage is absolutely
unnecessary.
This scheme can be further improved if we enhance the re-transmission
12 The Cloud Book

FIGURE 1.4
ESB with Re-Transmission API and Passive Redundancy

protocol to leverage the passive ESB hardware. For example, we can build an
interconnecting network of ESB servers where each server runs an enhanced
message store-and-forward protocol at each stage. The result is an extreme
scale ESB infrastructure that can actually meet all three operating objectives
simultaneously. Implementing the improved ESB in a computing cloud can
result in a provably optimal messaging architecture.

It is also worth mentioning that not all messages need confirmed delivery.
Non-mission critical messages can be sent without delivery confirmation. In
other words, there should be two kinds of messaging services in its API: lossless
with automatic re-transmission (like TCP), and others (like UDP).

Since none of the existing messaging systems was built as shown in Figure
1.4, these applications are really using UDP-like messaging without delivery
guarantees.

This first example illustrates the non-obviousness for introducing statistic


multiplexing into the common messaging task. In this case, both the client and
the messaging servers provide the transient UT storages. Therefore having
the correct embedding of “DNA” sequence satisfies both the necessary and
sufficient conditions for the maximally survivable messaging system.

This example also eludes to the potential difficulties and significant ben-
efits of introducing statistical multiplexing to other basic NCA architectures.
The next sections describe the challenges and solutions for embedding the
“store-and-forward” sequence for transaction processing and high performance
computing systems.
Fundamentals of Cloud Application Architecture 13

1.5 Maximally Survivable Transaction Processing


If a message contains a transaction, would the above lossless messaging archi-
tecture deliver lossless transaction processing?
The answer is NO. This is because the maximal entropy between the trans-
action processing and messaging protocol layers. The units of transmission are
different.
A transaction processing system must meet the ACID (Atomic, Consis-
tency, Isolation and Durability) requirements. The messaging service layer
simply does not have enough information regarding transactions. As will
be shown in the following section, embedding transient storage and re-
transmission in a transaction processing system is much harder. It must en-
sure ACID properties, zero single-point of failures, transaction fault tolerance,
transaction integrity, non-stop service and unlimited performance scalability.
The results are equally non-trivial.
Another interesting case is the high performance computing architecture
where performance scalability and fault tolerance issues have long troubled
researchers and practitioners alike. Next section 1.6 describes a possible DNA
sequence embedding for multiprocessor architectures that can deliver maxi-
mally survivable applications with unlimited performance scalability.
High performance cloud-ready lossless storage network is a simplified trans-
action processing system. The details are omitted for brevity.

1.5.1 Maximal Survivability and Performance Scalability


Transaction processing systems (databases) provide the fundamental layer for
all electronic information processing applications. Databases are well known
for their scalability limitations in performance and service/dats availability.
Embedding the maximally survivable DNA sequence could potentially elimi-
nate all these scalability limitations.
This section describes the DB x (database extension gateway) project. It is
the only commercial transaction processing architecture with statistical mul-
tiplexing known to the author at the time of this writing.
A DB x cluster uses statistical multiplexing to link multiple redundant
database servers. Like the Internet, it does not have performance scalability
limitation. Therefore it is a candidate architecture for VLDB (Very Largescale
DataBase) systems.
DB x is a joint work of the author and his former students at Temple
University (www.temple.edu) and later the Parallel Computers Technology
Inc. (www.pcticorp.com). Section 1.5.2 describes the theoretical foundations
of clustering database servers for unlimited scalability in performance and
availability.
14 The Cloud Book

1.5.2 Transaction Processing Failure Model


We define a transaction is:

Definition 1 A collection of conditional data change requests that uniquely


defines resulting dataset(s) when the conditions are satisfied at runtime.

A committed transaction is a collection of data change requests with con-


firmed positive result (transaction state). A committed transaction should
result in a unique persistent dataset until modified by other transactions. All
data changes in a transaction are assumed be either all committed or all can-
celed. There should be no partial updates. All modern database engines today
meet these requirements.
Theoretically, each transaction can be in any one of the following four
states:

1. A commit success is a transaction that is committed to the process-


ing harness and the user has received successful acknowledgement.
2. A commit failure is a transaction that has failed to commit in the
processing harness and the user has received error notification.
3. An unknown transaction is a timeout transaction whose status is
not yet verified.
4. A lost transaction is a committed successful transaction whose pres-
ence cannot be confirmed afterwards.
A correct transaction processing system must meet the following:

• Maximally possible committed transactions,

• Zero unknown transactions, and

• Zero lost transactions.

The number of commit failures is a function of the quality of database/ap-


plication design and implementation. Commit failures do not contribute to
unknown and lost transactions.
Like the ESB project, using a single updatable database in the process-
ing harness, it makes little sense to re-transmit the timeout transaction. For
statistical multiplexing, spatial data redundancy is necessary. Without sta-
tistical multiplexing, database applications can only generate error messages
for unknown transactions. For high value transactions, such as for banks and
stock exchanges, the unknown transactions are manually repaired, if found.
For others, error messages are the only artifacts left behind.
The lack of statistical multiplexing also inhibits performance scalability
since the processing harness simply does not have the option of exploiting
alternative resources during peak moments (especially for read-only queries).
Fundamentals of Cloud Application Architecture 15

Existing transaction replication methods are designed only for protecting


service uptimes [22][24]. Like in ESB, the synchronous mode uses the 2-Phase-
Commit (2PC) protocol that serializes all transactions for the purpose of
replication. Failures in any phase of the protocol require rolling back the entire
transaction. It has serious performance and downtime consequences. But it
ensures the replication of all transactions in the exact same order on multiple
targets (therefore there is no transaction loss but plenty of service time losses).
Asynchronous transaction replication uses either an explicit or implicit (log
scan) replication queue to ease the immediate performance degradation. The
replication queue affords a time buffer for the queue to be emptied. However,
replication is strictly serial since it must ensure the same commit order on the
replication target(s). According to the Little’s formula [21] for queued services,
the primary database server must be throttled below the serial transaction
replication speed to avoid overwhelming the replication queue and potential
service shutdown. In practice, a multicore processor with multi-spindle storage
can easily outperform any high speed serial replication. The system must
shutdown to enter a recovery mode if the replication queue is persistently
overwhelmed.
Transfer service from the primary to the secondary risks transaction losses.
Failures in the replication target can also force extended service shutdown if
the replication queue (and its manager) is not manually reconfigured quickly.
These deficiencies exist in all current transaction processing systems [4][3].
In this sense, databases are the weakest links in all existing IT infrastructures.
For banks, lost transactions can potentially make money when small ac-
count owners do not balance their accounts. For traders, a lost trade may never
be recovered. For service providers, millions of revenue can be lost annually
due to un-recorded usages. For mission critical applications, the consequences
can be serious if a critical transaction is forever lost. Identifying what was
actually lost is theoretically impossible. Many losses can cause irreparable
damages.
Due to the lack of transaction failure model support, large scale transaction
clouds can only exacerbate this problem.
As the importance and scale of transaction processing applications grow,
these limitations impose serious threats to the higher level social, economic
and political infrastructures. Large scale computing clouds only accelerate the
exposure of scalability and reliability issues.
It is worth mentioning that the same problems exist in storage systems
[17],[18], and [39]. Storage data updates are also transactions since the changes
must be persistent. They also require a single updatable persistent image.

1.5.3 Parallel Synchronous Transaction Replication


Currently, most people believe transaction replication is dangerous [19]. In
theory, synchronous transaction replication can eliminate transaction losses
at a cost. The strictly sequential, “all-or-nothing” 2PC protocol causes severe
16 The Cloud Book

FIGURE 1.5
Conceptual Diagram of DB x

performance degradation at runtime. Since very few applications today can


afford synchronous transaction replication, vast majority transaction process-
ing systems today use “leaky solutions” powered by asynchronous replication
and human intervention.
Careful examination of practical applications reveals that most concur-
rent transactions do not have update conflicts; serializing only the concurrent
update conflicts can alleviate much of the performance losses.
Further, it is also highly desirable in practice to continue a transaction
when only a subset of replication targets fail. In other words, dynamic serial-
ization and best effort fault tolerance can drastically enhance the quality of
transaction processing. Further, for read-only queries, there is no need to repli-
cate. Dynamically load balancing read-only queries can deliver more scalable
performance.
Synchronous replication with dynamic transaction serialization and dy-
namic load balancing requires a database communication protocol gateway
to capture all concurrent transactions in transit. We call this the database
extension (DB x ) gateway [35][27]. The new statistically multiplexed database
cluster is capable of parallel synchronous transaction replication where serial-
ization is optimized only for concurrent update conflicts. Figure 1.5 shows a
conceptual diagram of a DB x cluster.
In Figure 1.5, the UT is a transaction. DB x is the transient storage and
the database clients are assumed to contain automatic re-transmission logic.
Unlike ESB where all messages are transient in the messaging servers, pas-
sive ESB redundancy enables unlimited survivability; database transactions
leave persistent data changes that can only be protected by active spatial
Fundamentals of Cloud Application Architecture 17

redundancy. Combined with the automatic re-transmission logic, the statisti-


cally multiplexed spatial redundancy not only provides unlimited survivability
but also the potentials for performance scalability.
Similar to ESB, there are two kinds of transaction services: lossless with
confirmed delivery (TCP-like) and others (UDP-like).
However, to make a practically useful cloud capable VLDB system, we
must also address the following fundamental questions (sufficient conditions):

1. What is the transaction trust model?


2. How to deliver non-stop service when servers with very large
datasets become out-of-sync?
3. How to maintain database internals without shutting down service?
4. What is the cluster failure model? How to eliminate single-point-
failures?
5. How to ensure transaction ACID properties?
6. How to eliminate transaction losses?
7. How to deliver unlimited performance scalability?

1.5.4 Transaction Trust Model


From a database application’s perspective, each transaction must result in a
database state that is consistent with all data changes issued from the appli-
cation. If multiple databases reply differently for the same transaction, each
reply is valid since each individual transaction processing context is semanti-
cally identical to a single database environment. However, the data inconsis-
tency cannot be tolerated (caused by different processing orders on different
servers).
This means that in practice, we can simply designate any server as the
“primary”. Servers reply with different results than the primary can be re-
moved from service and resynchronized at a later time.
This trust model ensures a single consistent data view with the designated
“primary” server for all clients. Like a re-transmitted packet, Byzantine failure
makes little sense in this context. There is no need for a quorum.

1.5.5 Non-Stop Resynchronization - 2PCr Protocol


When one or more databases become out of sync for any reason with respect to
the primary server, deactivation is trivial. Bringing them back in service with
reconciled contents and without service downtime is a non-trivial challenge.
Instead of trying to reconcile the datasets by finding their differences,
the critical network position of DB x affords the execution of a “non-stop”
resynchronization algorithm [27]. The resynchronization algorithm is capable
18 The Cloud Book

of reconciling multiple out-of-sync datasets (of arbitrary sizes) for at most 60


second service downtime:

1. Assume that multiple database servers are synchronously repli-


cated. In the same cluster, one or more database servers are deac-
tivated for hardware/software errors or for scheduled maintenance.
They become instantly out-of-sync.
2. Start a full backup of any synchronously replicated server, say S. S
continues to be updated while backing up.
3. Restore the backup set to the group of out-of-sync servers G in
parallel.
4. Obtain transaction differentials from the transaction log on S.
5. If the transaction differential is empty, enable all members of G.
They are all in sync.
6. Apply the transaction differentials. Wait T . Repeat (4).
7. If not terminate, pause DB x gateway and disconnect all clients.
Repeat (4).
The worst-case downtime comes from step (7) where 60 seconds can cover
multiple shorter scan intervals (typically 10 seconds each). The better cases
can have zero downtime if the DB x gateway could find a quiet network time.
Step (7) is necessary to guard against the inherent speed differences between
parallel transaction processing and the sequential nature of the backup and
restore processes.
The resynchronization and dynamic serialization algorithms can be viewed
as an optimistic two-phase-commit protocol (2PCr). 2PCr allows any server
to be taken out of the cluster and put back at a later time while providing
virtually non-stop service.

1.5.6 ACID Properties and RML


Ensuring ACID properties for all transactions is probably the most difficult
task with multiple replicated databases.
For a single database, the Atomic property is ensured by the database en-
gine using the transactional two-phase-commit (2PC) protocol - if any updates
in a transaction is not complete, the entire transaction rolls back.
For multiple databases in the DB x cluster, the Atomic property is also
ensured by the same protocol since the gateway simply replicates the seman-
tically identical sequence of commands transparently. The transaction trust
model guards against the potential inconsistencies. Dynamic serialization does
not affect the execution of the transactional 2PC protocol.
Transaction Durability is much enhanced in a DB x cluster since there are
multiple synchronously replicated datasets.
Fundamentals of Cloud Application Architecture 19

The Isolation and Consistency properties are inter-twined. In practice, ap-


plication programmers often relax the isolation level in exchange for higher
throughput (less locking and higher concurrency). ANSI/ISO defines the fol-
lowing isolation levels:

1. Serializable
2. Repeatable Read
3. Read Committed
4. Read Uncommitted (dirty read)

In [8], a snapshot isolation level is also defined. Serializable is the highest


isolation level that ensures that all concurrent applications will see a consistent
database at runtime. Others are progressively more relaxed to increase degree
of concurrency.
In a DB x cluster, dynamic load balancing will alter the isolation behav-
iors subtly. For example, unless explicitly directed using RML (Replication
Markup Language), dynamic load balancing will break the assumed isolation
behaviors, since the reading target may not be the same server with the pro-
ceeding updates. For many applications, this can be tolerated; since the delay
to value stabilization is small and the isolation levels are often relaxed any-
ways. Dynamic load balancing can be disabled if the application requires strict
adherence to isolation levels.
With a single database target, concurrency issues cause deadlocks. With
transaction replication onto multiple servers, the chance of deadlocks is in-
creased, multiple servers can lock different transactions thus cause deadlocks
and data inconsistencies.
These race conditions can be eliminated if conflicting concurrent updates
are globally synchronized. The discipline is identical to composing “thread
safe” programs. For “cloud safe” transaction applications, the programmer
should have the knowledge of concurrent update zones and should be able to
markup the concurrent updates with correct “cluster locks”.
The DB x gateway will use the named “cluster locks” for dynamic serial-
ization of subsequent operations.
Dynamic serialization with cluster locks ensures the delivery of the Con-
sistency property to all “cloud safe” transaction applications.
Like database locks, the “cluster locks” can have shared and exclusive lev-
els; and with different granularity - database, table and row. Unlike database
locks where the locks are typically acquired one at a time, a single cluster
lock can synchronize multiple objects. For example, for a transaction updat-
ing three rows, say r1, r2 and r3, it can create a single cluster lock named
r123. All other concurrent updates on r1, r2 or r3 can synchronize via the
same lock. In this way we eliminate all potential deadlocks.
A “cluster lock” is really a named “mutex” that can be associated with
a table, a row, a field value or an arbitrary symbol. Smaller lock sets can
20 The Cloud Book

give better performance but increases the risk of deadlocks. The discipline of
cluster lock optimization is exactly the same as database locks.
For practical purposes, we will also need two more tags to complete RML.
Here is the complete list:

1. Lock (name, level)/unlock (name). The Lock tag sets up a named


mutex with lock level designation. The Unlock tag frees the named
mutex. Proper use of this pair can prevent all potential deadlocks
and race conditions.
2. LoadBalance on/off. There are three purposes. It can instruct DB x
gateway to load balance read-only stored procedures. It can also
force the gateway to load balance a complex query that the parser
fails to analyze. This tag can also be used to enforce transaction
isolation levels by turning off dynamic load balancing.
3. NoRep on/off. This tag suppresses the replication functionality.
This is necessary for cluster administration.
These tags are inserted in application programs as SQL comments. Correct
use of RML ensures Consistency property for all “cloud-safe” transaction pro-
cessing applications, including applications using stored procedures, triggers
and runtime generated SQL dialects. Transaction isolation levels can be pre-
served at the expense of processing concurrency. RML also allows performance
optimization and effective cloud administration, since not all administrative
tasks need cloud-wide replication. The formal ACID property proof is induc-
tive that is omitted here for brevity.
In summary, this section has described the treatments for meeting ACID
properties using statistical database multiplexing. The basic challenge is that
the DB x gateway is not a true transient storage for complete transactions.
It is only capable of “seeing” one query at a time. Correct use of the lock-
ing tags (Section 1.5.9) and automatic transaction re-transmission (Section
1.5.8) can satisfy the ACID requirements and transient storage requirement
simultaneously. Like the ESB project, the database clients are responsible for
the transient transaction storages until their statuses are confirmed.

1.5.7 Cluster Failure Model


Since the DB x gateway does not maintain transaction states, like the ESB
messaging servers, DB x failures can be masked by simple physical replacement
of passive DB x gateways (IP-takeover or DNS takeover). For the same reason,
transaction re-transmits can automatically discover alternative resources and
can ensure unique execution of mission critical transactions. The very feature
also allows DB x gateways to upscale processing performance by horizontally
distributing transaction loads (active redundancy).
Failback from a repaired gateway is trickier. Non-stop service can be ac-
complished using a virtual IP and the following algorithm.
Fundamentals of Cloud Application Architecture 21

Let X = virtual IP address that is known to all database clients, G =


Group of 2 DB x gateways supporting the services provided via X. Each DB x
gateway has a real IP address Gi .
The zero downtime failover and failback sequence is as follows:

1. Select a primary gateway by binding Gi with X Set Gj to monitor


and failover to X.
2. When Gj detects Gi ’s failure, Gj takes over by binding Gj with X.
3. When failback, remove X from Gi binding, set Gi to monitor and
failover to X.
This simple algorithm ensures the elimination of all single-point-failures
in a DB x cluster and absolute zero downtime for multiple consecutive DB x
gateway failures.

1.5.8 Lossless Transaction Processing


Like in ESB, we use the term “lossless transaction processing” analogous to
lossless packet switching protocol. Like TCP/IP, applications with automatic
re-transmits represent the “best effort” lossless transaction processing proto-
col that exploits infrastructural redundancy for overall transaction processing
reliability and performance gains. Permanent error occurs only after all com-
ponent redundancies are exhausted.
The key to every transaction is the database updates. There are two types
of database updates:

1. Updates that cannot tolerate redundant commits. Additive updates,


such as X = X +100, are examples. These updates (or transactions)
must have unique IDs for tracking to maintain the “commit once”
property.
2. Updates that can tolerate redundant commits or are idempotent.
For idempotent updates and simple inserts, tracking IDs may not
be necessary.
For the lossless transaction processing protocol, the only difference between
(1) and (2) is the status verification of the last timed out transaction. Figure
1.6 shows the pseudo code for automatic transaction re-transmission.
It is worth mentioning that the probability of transaction permanent er-
ror decreases exponentially as the number of redundant database servers in-
creases. For example, if the probability of permanent failure is 10−6 for a single
database server, then the probability of permanent failure using P servers is
10−6p .
If all transactions are protected by this re-transmission logic, like in ESB
(Figure 1.4), we can claim lossless transaction processing allowing arbitrary
failures of hardware, software and network components.
22 The Cloud Book

Count=0;
Tid=GetTransID();

Repeat:
Begin Tran:
Update1; // Insert Tid to sys$trans$table
Update2;
--- // Other operations
End Tran;

catch (timeout | TransactionException):


{
if (!Exist(Tid))
{
// Only necessary for once-only re-transmit.
if ((Count++) > Max)
FatalError("Fatal Error:", Tid, QueryBuffer);
}
goto Repeat;
}

int function Exist(Tid int)


{
sprintf(QueryBuffer,
‘‘select * from sys$trans$table where Tid=\%d", Tid);
SQL_exec(DBH, QueryBuffer);
bind(ReturnBuffer, DBH);
while (!Empty(ReturnBuffer))
{
if (Tid == ReturnBuffer) return (1);
next;
}
return (0);
}

FIGURE 1.6
Automatic Re-transmission
Fundamentals of Cloud Application Architecture 23

1.5.9 Cloud-Ready VLDB Application Development


One of the last sufficient condition for the maximally survivable transaction
processing system is that the VLDB applications must satisfy the ACID prop-
erties. RML and automatic re-transmission are two important additions for the
making of cloud-ready lossless transaction processing applications (OLTP).
There are four steps:

1. Move all server-specific backend data inserts (such as GU ID and


T imestamp) to client-end. This eliminates data inconsistencies by
the known sources.
2. Markup all concurrent update conflicts with cluster locks.
3. Markup all concurrent reads for acceptable isolation behavior.
4. Add automatic re-transmit logic: There should be a new transac-
tion processing API: RCommit, where RCommit(0) bypasses the
last transaction status verification otherwise it automatically adds
a unique transaction ID, transmits the transaction until all alterna-
tives are exhausted. In other words:

(a) RCommit(1) should be used for all updates requires tracking


IDs (TCP-like).
(b) RCommit(0) should be used for all other updates without track-
ing IDs (UDP-like).

Note that transactions with idempotent updates can gain performance


advantages by not using tracking IDs.
Like in ESB, the cloud-ready VLDB applications will generate a permanent
error for each re-transmission that exceeds a pre-defined threshold (Max).
The permanent error log will contain the failed transaction details so it can
be manually executed at a later time. The probability of permanent error
decreases exponentially as the degree of spatial redundancy increases.
The cloud-ready VLDB applications can also enjoy accelerated asyn-
chronous replication on remote targets by employing dynamic serialization
on the far-flung replication targets. Although the queuing effects remain the
same (throttle is still needed) but the same buffer size will tolerate much
heavier update loads compared to existing asynchronous replication schemes.

1.5.10 Unlimited Performance Scalability


There are three ways we can leverage multiple real time synchronized
databases for performance scalability:

1. Dynamic load balancing by distributing read-only queries (and


stored procedures) amongst all database servers in the cluster. This
benefit is delivered by compromising the default isolation behavior.
24 The Cloud Book

FIGURE 1.7
Replicated Partitioned Database (P =3, R=2)

2. Session-based load balancing by distributing “sticky” connections


to the database servers in the cluster. Without any code changes,
all OLAP applications are “cloud safe” applications in this mode.
3. Data partitioning with DB x replication.
Item (1) delivers localized performance benefits for OLTP applications
with relaxed isolation requirements. Item (2) enables plug-and-play perfor-
mance boost for all OLAP applications. Item (3) enables unlimited perfor-
mance scalability for OLTP applications.
These benefits were predicted for share-nothing database clusters many
years ago [37].
To see item (3), consider database partition without using DB x . To date,
database partition has been used to meet performance demands at the ex-
pense of reduced overall system availability. The overall system reliability is
adversely affected as the number of partitions increases; higher number of
partitions represents more potential failure points. With DB x , we can meet
arbitrary high performance and reliability demands by creating large num-
ber of partitions (P ) while maintaining a smaller degree of redundancy (R)
(Figure 1.7).
Any heavily accessed data portion can be partitioned and distributed to
multiple servers. The growth of data tables can be regulated automatically
via DB x gateway [42]. The small degree of redundancy eliminates availabil-
ity concerns with added dynamic load balancing benefits. The entire sys-
tem can deliver virtually non-stop service allowing any number of subsets
Fundamentals of Cloud Application Architecture 25

FIGURE 1.8
K-Order Shift Mirroring, K=P =4

to be repaired/patched/resynchronized. DB x gateway also conveniently af-


fords database structural changes without shutting down service for extended
periods.
For further acceleration of heavy SELECT queries, a k-order shift mirror-
ing (KSM) scheme can be deployed, where 1 < k ≤ P . The idea is to create a
RAID-like database cluster as shown in Figure 1.8.
KSM is suitable for extremely demanding OLAP applications with light
updates. Higher degree of redundancy can unleash the full parallel processing
potentials of all processors. KSM uses the same ideas as in the RAID storage
system to implement a parallel transaction processing architecture with full
parallel I/O support [27].
In Figure 1.8, “join” two partitioned tables across four servers can be done
in two steps:

1. Parallel partial joins. For example, for tables A and B, there are 16
partial join pairs [A0 ∗ B0 ], [A0 ∗ B1 ], [A0 ∗ B2 ], [A0 ∗ B3 ], [A1 ∗ B0 ],
[A1 ∗ B1 ],..,[A15 ∗ B15 ]. All 16 pairs can be executed in parallel on
four servers.

2. Merge and post processing.

Linear speedups can be expected.


26 The Cloud Book

For extreme scale OLTP applications, since R-degree of synchronous trans-


action replication can be as small as 2 and P is unrestricted, therefore DB X
clusters can deliver unlimited performance scalability.

1.6 Maximally Survivable High Performance Computing


Scientific computing is another basic NCA that can benefit from this statisti-
cal multiplexing magic. Unlike the transaction processing system where a lost
transaction may not be recovered, given identical inputs, every lost computa-
tion task can always be re-computed to deliver semantically identical results,
even for non-deterministic applications. Therefore, if every HPC (High Per-
formance Computing) application can be decomposed into stages of ”bag of
tasks” (Figure 1.9), cheap fault tolerance and non-stop HPC is potentially
possible. In theory, unlimited performance scalability should also be within
reach.
This section describes the high level concepts of the Synergy Project [28]
at Temple University. The Synergy Project was designed to address:
1. Maximally survivable HPC applications (Sections 1.6.1-1.6.3).
2. Unlimited performance scalability with heterogeneous processor
types including GPGPU, DSP, single and multicore processors (Sec-
tions 1.6.5-1.6.6).
Section 1.6.4 introduces a practical parallel performance modeling method.
A maximally survivable HPC application does not have the information loss
issue. It has, however, a higher dimension scalability issue that cannot be ad-
dressed by performance scalability and application survivability alone: parallel
programmability (Section 1.6.6).
Each HPC program requires combined skilled in processing and commu-
nication hardware, domain knowledge and computer programming. While job
security is granted to the few talented individuals, the productivity is no-
toriously low. Section 1.6.6 reports an attempt to address this high order
scalability issue using a Parallel Markup Language (PML).

1.6.1 Protection for the “Bag of Tasks”


Each HPC application contains two types of programs: a) sequential (not par-
allelizable); and b) parallelizable. There are three types of parallelism [13]:
SIMD (Single Instruction Multiple Data), MIMD (Multiple Instruction Mul-
tiple Data) and pipeline.
In literature, MIMD is further divided into SPMD (Single Program Mul-
tiple Data) and MPMD (Multiple Program Multiple Data). Due to the topo-
logical similarity, we consider SIMD and SPMD synonymous.
Fundamentals of Cloud Application Architecture 27

FIGURE 1.9
Message-based “Bag of Tasks” Parallel Processing

By far, SPMD/SIMD is the simplest and the most dominant parallelism


type that delivers high performance using massively many identical process-
ing elements. In legacy parallel programming environments, such as message
passing and shared memory, exploiting MPMD and pipeline parallelisms re-
quire inventive programming. For data parallel processing, these parallelism
types are automatically exploited [12] [23].
Each SPMD component can be implemented as a “bag of tasks”. Each
“bag of tasks” group consists of two kinds of programs: a master that dis-
tributes working assignments and collects the results; and replicated workers
that continuously work for available assignments. The natural dataflow be-
tween the master and workers automatically exploits MPMD and pipeline
parallelisms (Figure 1.9).
For fault tolerance, it is apparent that master failure can only be protected
by check-pointing. The failure of workers, however, may be recovered by re-
issuing the lost working assignments.
Specifically, master failures may only be recovered via either global or
local check-pointing. Global check-pointing requires freezing and saving the
state of the entire computation (often with tens of thousands of processing
elements), thus it can be very costly. Local check-point has much smaller ”foot
print” but suffers the difficulty of consistent global state re-construction [11].
It is possible, however, to execute synchronized local check-points and recover
consistent global state with the help of a distributed synchronized termination
algorithm [40].
In order to understand the potential performance benefits of protecting the
SPMD/SIMD components (workers), for simplicity, the following discussion
assumes the same overhead for global and local master check-pointing.
We define the expected computing time with failure using the following
model [15]:
28 The Cloud Book

t0 : Interval of application-wide check-point

α : Density function of any processing element failure

K0 : Time needed to create a check-point

K1 : Time needed to read and recover a check-point

T : Time needed to run the application without check-points.

Further, we define

α1 : Density function of critical (non-worker) element failure

α2 : Density function of non-critical (worker) element failure

Thus,

α = α1 + α2 .

Assuming failure occurs only once per check-point interval and all failures are
independent, then the expected running time E per check-point interval with
any processing element failure is

E = (1 − αt0 )(K0 + t0 ) + αt0 (K0 + t0 + K1 + t0 /2) (1.1)


The expected running time per check-point interval with worker failure toler-
ance will be:

E ′ = (1 − αt0 )(K0 + t0 )+ α1 t0 (K0 + t0 + K1 + t0 /2)+ α2 t0 (K0 + t0 + X), (1.2)

where X = recovery time for worker time losses. We can then compute the
differences E ′ − E, as follows:
Since:

α2 = α − α1

E = (1 − αt0 )(K0 + t0 ) + αt0 (K0 + t0 + K1 + t0 /2)

E ′ = (1 − αt0 )(K0 + t0 ) + α1 t0 (K0 + t0 + K1 + t0 /2) + α2 t0 (K0 + t0 + X)

E − E′

= (α − α1 )t0 (K0 + t0 + K1 + t0 /2) − α2 t0 (K0 + t0 + X)

= α2 t0 (K0 + t0 + K1 + t0 /2 − K0 − t0 − X)

= α2 t0 (K1 + t0 /2 − X)
Fundamentals of Cloud Application Architecture 29

In other words, the savings equal to the product of the probability of partial
(worker) failure and the sum of check-point reading time and master lost time
with an offset of lost worker time. Since the number of workers is typically
very large, the savings are substantial.
The total expected application running time ET without worker fault tol-
erance is:
T t2
ET = (K0 + t0 + α(t0 K1 + 0 )) (1.3)
t0 2
We can compute the optimal check-point interval as:

dET K0 α
= T (− 2 + )
t0 t0 2
K0 α
0=− +
t0 2
K0 α
=
t20 2
r
2K0
t0 = (1.4)
α
The total application running time ET with worker fault tolerance is:

αt20 α2 t20
 
T
ET′ = K0 + t0 + αt0 K1 + − α2 t0 K1 − + α2 t0 X
t0 2 2
 
K0 αt0 α2 t0
= T 1+ + αK1 + − α2 K1 − + α2 X (1.5)
t0 2 2
The optimal check-point interval with worker fault tolerance is:

dET K0 α − α2
= T (− + )
t0 t0 2
K0 α − α2
0=− +
t0 2
K0 α − α2
=
t0 2
r
2K0
t0 = (1.6)
α − α2
For example, if we set the check-point interval t0 = 60 minutes, the check-
point reading and writing time K0 = K1 = 10 minutes, and the average worker
failure delay X = 30 sec = 0.5 minute, the expected savings per check-point
under any single worker failure is about 39.5 minutes.

E − E′
30 The Cloud Book

= (α − α1 )t0 (K0 + t0 + K1 + t0 /2) − α2 t0 (K0 + t0 + X)

= α2 t0 (K0 + t0 + K1 + t0 /2 − K0 − t0 − X)

= α2 t0 (K1 + t0 /2 − X) — Since α2 T0 = 1 (any worker failure)

= (10 + 30 − 0.5)

= 39.5 minutes.

Furthermore, if the mean time between failure (MTBF) is 3 hours in a


system of 1024 processors, this gives αt0 = α180 1 or α = 1/180. Thus,
α1 = 1/(180P) = 1/184, 320. The optimal check-point interval (in minutes)
is:

r
2K0 p
t0 = = 2 · 10 · 184, 320 = 1, 920 (1.7)
α − α2

This means that it is not necessary to check-point the masters unless the
application running time T is greater than 30 hours.
Implementing the “worker fault tolerance” scheme requires:

1. All workers must be free from global side-effects. Otherwise the


correctness of the parallel programs may not be maintained.
2. Dynamic sender-receiver binding at runtime. Otherwise worker fail-
ure recovery cannot be done.
Running side-effect free parallel programs also has the benefits of reduced
check-point sizes.

1.6.2 Data Parallel Programming Using Tuple Space


Amongst all existing HPC programming paradigms, such as message passing
(MPI [14]), share memory (OpenMP [10]), Tuple Space (Linda [9] and Syn-
ergy [28]), only the Tuple Space paradigm seems a natural fit (Figure 1.10).
A Tuple Space supports three operators:

1. P ut(TupleName, value);
2. Read(NamePattern, &buffer);
3. Get(NamePattern, &buffer);
The Read operator fetches the value from a matching tuple. The Get oper-
ator extracts the value from a matching tuple and destroys the tuple. Both
operators are “blocking” that they suspend the calling program indefinitely
until a matching tuple is found. Tuple name matching allows the same tuple
Fundamentals of Cloud Application Architecture 31

FIGURE 1.10
Parallel Processing Using Tuple Space

be fetched by different programs. This allows “shadow tuple” implementation


(transient tuple storage).
Specifically, under the Tuple Space parallel processing paradigm, the mas-
ter simply puts working assignments into a tuple space. The massively many
workers simply repeat the same loop: retrieve the working assignments, com-
pute and return the results. The Tuple Space runtime system can easily
support automatic worker failure detection and recovery by implementing
“shadow tuples” - tuples masked invisible after being retrieved via the “get”
operator, but recovered if the corresponding worker crashes, and destroyed
only after the corresponding result is delivered [28].
The problem was that the tuple Space parallel programming paradigm
requires indirect communication and implicit parallel programming. These
ideas seemed counter-productive against the performance centric designs of
HPC systems where programmers are used to manipulating program struc-
tures to match hardware features. To pursue higher performance, some earlier
Tuple Space efforts even tried to manipulate compiler’s code generator [9] [2].
For mitigating massive component failures and for attaining unlimited
scalability, the transient tuple storage is not only necessary but also desirable
due to statistic multiplexing needs. This argument can only be persuasive if
a physical Tuple Space architecture could demonstrate the feasibility.
As discussed in Section 1.3, for this type of NCA service, the UT (Unit
of Transmission) is a working assignment. The worker’s work-seeking code is
naturally amenable to the re-try logic with ”shadow tuple” support from the
potential Tuple Space architecture. The theoretical framework is identical to
data parallel computing ([12][5] and [2]). The Tuple Space runtime (centralized
or distributed) implementations could support the tuple transient storage with
spatial hardware redundancy.
Theoretically, these treatments transform the seemingly insurmountable
HPC performance, scalability and reliability issues into solvable communica-
tion problems.
32 The Cloud Book

FIGURE 1.11
Stateless Parallel Processor

1.6.3 Stateless Parallel Processing Machine


The failure of any path in the interconnection network of a multiprocessor will
cause the HPC application to halt. Therefore the interconnection network is
the single point-of-failure in all HPC systems. The problem was caused by the
lack of statistical multiplexing on the semantic network. Without a properly
defined UT (Unit of Transmission), interconnection network virtualization
would only enhance data transmission with little help to the HPC applications
scalability.
Further, explicit parallel programming paradigms, such as message passing
(MPI) and share memory (OpenMP), rely on direct low-level resource bind-
ing to deliver performance. Semantic network statistical multiplexing is not
possible.
These observations inspired the design for a Stateless Parallel Processor
(SPP) [29].

1. Architecture: The maximally survivable HPC architecture will fea-


ture a UT defined statistic multiplexed data parallel interconnec-
tion network. It is matched with a UT defined high level data par-
allel API and a distributed transient UT storage implementation
with passive spatial redundancy. The conceptual diagram SPP is
depicted in Figure 1.11 [29].
In Figure 1.11, SW represents the passive spatial redundancy using
multiple redundant interconnection networks. The number of inter-
connection networks must match the number of interfaces on each
compute node.
UVR stands for Unidirectional Virtual Ring – a virtualized inter-
Fundamentals of Cloud Application Architecture 33

connection network responsible for transporting application data


and system administration requests. The dotted ring represents the
virtual paths that data requests travel. The solid lines represent
the physical data travel paths. Further, not shown in Figure 11, the
HPC application programming interface (API) supports the three
Tuple Space operators. Together, these are a higher level intercon-
nection network responsible for the robust UT operations.
In this design, since application API and the supporting architec-
ture are parts of statistical multiplexing for the defined UT, there is
no single point-of-failure for any application’s workers. The masters
are only application single point-of-failures. Master can be protected
by check-pointing and recovery (CPR).
In other words, failure of computing or communication components
can be masked by transient storage (for workers) and check-points
(for masters). The earlier discussion had make it clear that the
master check-point overhead can be minimized drastically if worker
fault tolerance is supported.
The only scalability concern is the round-trip overhead of UVR. As-
suming multiple redundant physical switched networks, the worst-
case UVR round trip latency will be on the order of lgk (P )[4], where
P is the maximal number of computing nodes per application and
k is the fan-out factor (degree of parallel communication). For ex-
ample, for an application running on a million nodes with k=2, at
most 20 hops are required for all participating nodes to communi-
cate their data needs. Actual data transfers will be in parallel using
the multiple redundant interconnection networks [30][29].
UVR is only required for each HPC application in order to maintain
a single consistent system image. Therefore, there can be multiple
UVRs for multiple applications. There is no limit on the number of
computing nodes. Like the Internet, theoretically there is also no
limit on the number of networks to be included, even though each
node does have a physical limit on the number of installed network
interfaces.
2. Implementation: A Data Routing Agent (DRA) implements the
UVR using the passive spatially redundant interconnection net-
works. DRA will also support the UT-based communications and
synchronization (Tuple Space semantics) [30] [29].
Unlike other tuple space implementations [9] [7], DRA is a dis-
tributed tuple space mechanism with automatic failure detection
and recovery capabilities. The DRA has a T -shaped communica-
tion pattern (see “Compute Node” in Figure 1.11). It is responsible
for three tasks:
(a) Local Tuple Space operations: tuple storage, matching and
34 The Cloud Book

program activation, tuple recovery upon notification of re-


triever’s failure, program activation, control and monitor, pro-
gram check-point and re-start.
(b) Global Tuple Space services: data request service, provision-
ing and fast parallel forwarding. It must account for potential
downstream node and link failures.
(c) Administrative services: User account management, application
activation, control and monitor.
3. Automated Master Protection: Even though the masters are not na-
tively protected by the SPP architecture, the worker fault tolerance
feature has reduced master check-point sizes. Unlike application
level check-pointing where the programmer is responsible to find
the most economic location of the global state saving points, it is
possible to implement non-blocking system level check-point-restart
protocol automatically with little overhead. The technical challenge
of migrating communication stacks between heterogeneous compute
nodes can be handled by a “precision surgery” of the TCP/IP ker-
nel stack [43]. The theoretical difficulty [34] [20] for reconstructing
a consistent global state from multiple independently check-pointed
masters can be resolved using a synchronized distributed termina-
tion algorithm [40].
Under the SPP paradigm, the “put”, “read” and “get” operations will block-
/unblock requesting programs to automatically form application dependent
SPMD, MPMD and pipeline clusters at runtime.
The three Tuple Space operations allows application programs to ex-
press arbitrary communication patterns including broadcasts, neighbor-to-
neighbors and one-to-one conversations.
Each HPC application builds its own semantic network at runtime. Run-
time statistical multiplexing allows automatic protection of massive number
of workers. SPP applications can deliver virtually non-stop service regardless
multiple computing and communication component failures.

1.6.4 Extreme Parallel Processing Efficiency


Unlike explicit parallel programming paradigms where overlapping computa-
tion and communication requires inventive programming, finding the optimal
processing grain size in a data parallel application can automatically maximize
concurrency. Processors form dynamic SPMD, MPMD and pipeline clusters
automatically for a data parallel application. Optimal granularity ensures zero
synchronization overhead, thus maximal overlapping of computing and com-
munication.
Optimal granularity can be found using a coarse-to-fine linear search
method. There are two steps: a) find the most promising loop depth; and
b) find the optimal grouping factor. The first step is labor intensive since each
Fundamentals of Cloud Application Architecture 35

change in partition loop depth requires re-programming. This is the very rea-
son that has prevented practical HPC applications from being optimized. Once
the loop depth is found, optimal grouping factor can be calculated statically
or heuristically at runtime [25][16].
Since deeper loop decomposition (finer processing grain) risks higher com-
munication overheads, maximizing the degree of parallelism can adversely im-
pact performance. The optimal loop depth gives the best possibility to deliver
the optimal parallel performance.
Finding the optimal loop depth can be done experimentally via some back-
of-envelop calculation or aided by an analytical tool we call timing models [31]
[33].
The timing model for a compute intense loop is an equation containing
estimation models for major time consuming elements, such as computing,
communication, synchronization and disk I/O. Setting the synchronization
time to zero makes it possible to find the performance upper bound that all
parallel tasks complete in time. In reality, the synchronization time may be
negative when computing overlaps with communications.
For example, let

cf (n)
Tseq (n) = (1.8)
ω ′ (n)

be the running time of a program of input size n, where f (n) is the time
complexity, c > 0 captures the efficiency losses (instruction to algorithmic
step ratio) of the programmer, compiler and operating system scheduler, and
ω ′ (n) is the processor speed measured in instructions processed per second.
Since both c and ω ′ (n) are hard to obtain in practice, we introduce

ω ′ (n)
ω(n) = (1.9)
c
measured in algorithmic steps per second. Thus ω(n) can be obtained from
program instrumentation:

f (n)
ω(n) = (1.10)
Tseq (n)

ω(n) is typically smaller from the manufacturer’s peak performance claim.


ω(n) will also vary according to problem size (n) due to memory handing over-
heads. A typical memory intense application will exhibit ω(n) characteristics
(in MOPS = Millions Operations Per Second) as shown in Figure 1.12.
Depending upon the partitioned problem sizes (n), Figure 1.12 gives the
entire spectrum of ω(n). This is useful in speedup calculations.
36 The Cloud Book

FIGURE 1.12
Application Dependent CMSD Envelope

Ignoring the disk IO and synchronization times, a parallel application’s


timing model contains the following:

Tpar (n, p) = Tcomp (n, p) + Tcomm (n, p). (1.11)

Let µ be the network capability in bytes per second, a parallel matrix


multiplication timing model is:

n3 δn2 (p + 1)
Tpar (n, p) = + ; (1.12)
pω(n) µ

where n = problem size, p = number of processors, µ = the application


dependent network capability in bytes per second, and δ = matrix cell size in
bytes. This model represents a row or column partitioning strategy but not
both (tiling).
The sequential timing model is:

n3
Tseq (n) = . (1.13)
ω(n)

The speed up is:

Tseq (n) p
Sp = = (1.14)
Tpar (n, p) δω(n)(p2 + p)
1+

Given δ = 8 bytes (double precision), ω(n) = 300 MOPS and µ = 120


Fundamentals of Cloud Application Architecture 37

FIGURE 1.13
Parallel Performance Map of Matrix Multiplication

MBPS, Figure 1.13 shows the performance map of parallel matrix multiplica-
tion for a small number of processors and problem sizes.
Timing model helps to determine the optimal partition depth. For exam-
ple, if n = 10, 000, for the matrix multiplication program with the processing
environment characterized by ω(n) = 300 MOPS and µ = 120 MBPS, it is not
a good idea to pursue anything deeper than the top-level parallelization since
the speed up will not be greater than 12. As shown in Figure 1.13, spreading
calculations onto too many processors can have severe adverse performance
effects. However, deeper loop partitioning should be re-evaluated if the pro-
cessing environment changes, such as slower processors, larger problem sizes,
faster network or a combination of these. For simulation applications, the op-
timal processing grain size should be identified using the same process within
each simulated time period. This is because the time marching loop does not
impact computing versus communication ratio.
Theoretically, the Timing Model method overcomes the impossibility of
program performance prediction (due to Turing’s Halting Theorem [41]) by
introducing application dependent ω(n). ω(n) is obtained via program instru-
mentation. The rest is straightforward.

1.6.5 Performance Scalability


The statistically multiplexed interconnection network not only eliminates the
single point-of-failure but also allows the applications to leverage multiple
redundant networks simultaneously. Higher network speed allows for better
performance scalability. This is the enabling factor.
The ultimate performance scalability can be achieved using a hybrid ap-
proach analogous to the Internet. The idea is to wrap the data parallel pro-
cessing layer on top of multiple “coarse grain workers”. Each coarse grain
worker is a piece of HPC code running in the conventional message passing or
38 The Cloud Book

share memory environment (analogous to the circuit-switching networks). The


overall application is a multi-scaling algorithm [1] employing multiple coarse
grain workers integrated via the Stateless Parallel Processing virtual machine.
The SPP properties will deliver the extreme dependability and efficiency for
these very large scale parallel applications.
Multi-scale modeling and simulation [1] is an emerging field involving
mathematics, chemistry, physics, engineering, environmental science and com-
puter science. Traditional mono-scale simulation has been proven inadequate
due to prohibitively high costs of interconnection communication. The focus
is on the fundamental modeling and computational principles of underlying
multi-scale methods. These studies have been supported by observations of
multi-scale phenomena.
Since the maximal survivability framework is based on the Unit of Trans-
mission (tuples), as long as the supporting architecture can find a way to
transmit tuples, it can mitigate multiple simultaneous runtime (computing
and communication) component failures for all applications. Non-stop HPC
service has thus become possible by allowing online component crashes, repair
and reboots (since the application architecture supports UT fault tolerance).
Like the Internet, a HPC application’s performance and reliability is only
bound by the maximal degree of passive spatial redundancy.

1.6.6 Automatic Parallel Program Generation – The Higher


Dimension
Traditional parallel programming requires a combination of three hard to ac-
quire skills: domain knowledge, programming and processing/communication
hardware architecture. Although very high performances have been delivered
through carefully optimized codes, the productivity of these systems is noto-
riously low. A typical HPC production system takes many years to mature.
Once it has reached a usable stage, the underlying sciences could have evolved
to a point that the value of the entire system may be in question.
Recently, using markup language to aid automated parallel program gen-
eration has achieved varying degrees of successes ([36], [44] and [26]). This
section reports a similar attempt but focused in Tuple Space parallel program-
ming.
Parallel programming using Tuple Space relies on tuples to communicate
and synchronize parallel programs. This is a five step process:

1. Identify the computing intensive parts and separate them into in-
dependent sub-programs.
2. Identify data dependencies amongst all parts. This defines the tuples
to connect the sub-programs.
3. For each computing intensive sub-program, decide a partition loop
depth.
Fundamentals of Cloud Application Architecture 39

<reference></reference>
<parallel>
<reference></reference>
<master>
<send> or <read>
<worker>
<send> or <read>
<target>
the loop to be parallelized
</target>
<send> or <read>
</worker>
<send> or <read>
</master>
</parallel>

FIGURE 1.14
PML Tag Structure

4. Design a ”bag of tasks” scheme for each loop partition.


5. Develop a complete sequential program for each sub-program. This
will produce the necessary masters and workers.

The resulting parallel programs will run in a Tuple Space supported en-
vironment, such as Synergy [28] [32] where the worker will be automatically
replicated to run on multiple compute nodes.
For performance optimization, each partition depth change requires re-
coding steps 3-5. This is labor intensive. PML (Parallel Markup Language)
[38] was developed to ease the finding of optimal processing granularity and
for the ease of parallel programming in general.
Like all other parallel program markup efforts ([36], [26] and [44]), the core
concept of PML is to eliminate automated dependency analysis - a common
component in traditional parallel compilers. Instead, the user is required to
perform dependency analysis and marking the results in the sequential source
code using PML tags.
PML is a XML-like language designed for generating data parallel pro-
grams from sequential programs. It contains seven tags (Figure 1.14).
The “reference” tag marks program segments for direct source-to-source
40 The Cloud Book

copy in their relative positions. The “master” tag marks the range of the
parallel master. The “send” or “read” tags define the master-worker interface
based on their data exchange formats. The “worker” tag marks the compute
intense segment of the program that is to be parallelized. The “target” tag
defines the actual partitioning strategy based on loop subscripts, such as tiling
(2D), striping (1D) or wave-front (2D). The coarse-to-fine grain size search is
to place “target” tags in an outer loop first and then gradually drive into
deeper loop(s) if the timing model indicates there are speedup advantages in
the deeper loops.
Figure 1.15 shows the PML marked sequential matrix multiplication pro-
gram.
In this example, the variable declarations of i,j and k will be copied exactly.

/* <reference id="123"> */
int i, j, k;
/* </reference> */

The first three lines of “master” tag define the output data interface to
workers. The two “put” tags insert two tuples named “B” and “A” with double
precision [N, N ] cells to the space.

/* <master id="123"> */
/* <put var="B" type="double[N][N]" opt="ONCE" /> */
/* <put var="A" type="double[N][N]"/> */

The “worker” tags define how the tuples are to be accessed. Tuple “B”
will be read “ON CE”. Tuple “A” must be retrieved along the i subscript for
N times.

/* <worker> */
/* <read var="B" type="double[N][N]" opt="ONCE"/> */
/* <get var="A" type="double[N(i)][N]"/> */

The “target” tag defines the partition depth and grouping factor. In this
case, the partition will happen at subscript i, within the range of [0, N ] step
1 and group G. This is the 1st order partition.

/* <target index="i" limits="(0,N,1)" chunk="G" order="1"> */


for (i = 0; i < N; i++)
/* </target> */

The “worker” body concludes with a single output tag describing the over-
all dimensions and the partitioned dimension [N (i)].

/* <put var="C" type="double[N(i)][N]"/> */


/* </worker> */
Fundamentals of Cloud Application Architecture 41

/* <parallel appname="matrix"> */
main(int argc, char **argv[]) {
/* <reference id="123"> */
int i, j, k;
/* </reference> */

/* <master id="123"> */
/* <put var="B" type="double[N][N]" opt="ONCE" /> */
/* <put var="A" type="double[N][N]" /> */

/* <worker> */
/* <read var="B" type="double[N][N]" opt="ONCE"/> */
/* <get var="A" type="double[N(i)][N]"/> */
/* <target index="i" limits="(0,N,1)" chunk="G" order="1"> */
for (i = 0; i < N; i++)
/* </target> */
{
for (k = 0; k < N; k++)
for (j = 0; j < N; j++)
C[i][j] += A[i][k]*B[k][j];
}
/* <put var="C" type="double[N(i)][N]"/> */
/* </worker> */
/* <put var="C" type="double[N][N]"/> */

/* </master> */
exit(0);
}
/* </parallel> */

FIGURE 1.15
PML Marked Matrix Program
42 The Cloud Book

Finally, the “master” is responsible for collecting the results with a single
read tag.

/* <put var="C" type="double[N][N]"/> */


/* </master> */

In the “target” tag, the “type” attribute design allows complex loop (sub-
script) manipulations. In this example, the expression [N (i), N ] indicates an
one dimension (leftmost) parallelization. The grouping factor is described
by the ”chunk” attribute. Expression [N (i), N (j)] indicates a two-dimension
parallelization. As mentioned earlier, deeper parallelization typically require
higher communication volume. It should be used with care. Similarly, sliding
windows, equal partitions, and wave-forms can all be implemented via similar
mechanisms.
In the “master” tag, it is also possible to insert check-point instructions.
To show the feasibility of the PML approach, a PML compiler was con-
structed [38]. The PML compiler generates two parallel programs: a master
and a worker. We tested the manually crafted and the generated programs in
the Synergy parallel processing environment [28][31].
We also installed MPICH2-0.971 [14]. It was compiled with -enable-fast
switch. A MPI parallel matrix application program was also obtained.
All test programs were compiled with gcc (version 2.95.3) -O3 switch.
All tests were conducted using a Solaris cluster consisted of 25 Sun Blade500
processors connected via 100 Mbps switches. All nodes are of exactly identical
configuration.
The Synergy and PML experiments were timed with worker fault tolerance
turned on. The Synergy master and MPICH2 programs have no fault tolerance
protection.
Figure 1.16 shows the recorded performance results comparing Synergy
hand crafted programs, PML generated programs and MPI parallel programs.
This study [38] also included other common computational algorithms,
such as Laplacian solver using Gauss-Seidel iteration, block LU factorization,
convolution and others. This study revealed that a) the tuple space can be
augmented to reduce code generation complexities; and b) the PML tags can
be very flexible to accommodate arbitrarily complex partition patterns. Like
other markup languages, [36],[44] and [26], practice and memorization help
with coding efficiency. In comparison, the scale of efforts is still much less
than coding data parallel applications directly.
For extreme scale applications, we could develop new PML tags to in-
clude the support for multi-scale algorithms [1] with automatic check-point
generation. The new features would allow programmers to compose sequential
multi-scale algorithms and use existing MPI or OpenMP codes as coarse-grain
workers directly. PML tags can parallelize the multi-scale program with direct
references (via the ”reference” tag) to the legacy MPI or OpenMP codes.
Fundamentals of Cloud Application Architecture 43

Nodes(P ) Size(n) PML Manual MPICH2 Sequential


2 600 6.7 5(G=25) 5.12 8.9
2 800 15.3 12.2(G=200) 11.87 21.6
2 1000 28.3 23.4(G=63) 22.86 42.4
2 1600 118.3 95(G=100) 95 181.7
2 2000 231.3 187.6(G=75) 186 358
4 600 4.5 3.6(G=16) 3.3 8.9
4 800 10 7.8(G=12) 7.2 21.6
4 1000 17.3 14.1(G=13) 13.4 42.4
4 1600 66.7 53(G=23) 53 181.7
4 2000 128.3 101.2(G=21) 100 358.7

FIGURE 1.16
PML Performance Comparisons

1.7 Summary
Cloud computing has brought the “mission criticalness” within the reach of
all applications. Reaping the full cloud benefits, however, requires non-trivial
efforts. However, none of the challenges is new. This chapter has described
the fundamentals of mission critical applications for scalability, availability
and information assurance goals.
Starting from the packet switching networks, we identified the generic
“DNA” sequence that is essential to the robustness of extreme data com-
munication systems. Using information theory, we argued that extreme scale
networked computing applications (NCA) are also possible if statistic multi-
plexing is introduced to the application semantic networks.
The necessary condition of the maximally survivable NCA contains four
elements: Unit of Transmission (UT), transient spatial redundancy, temporal
redundancy and passive spatial redundancy. The sufficient conditions include
maximal NCA service/data availability, unlimited NCA scalability and loss
free NCA processing. NCA correctness and integrity are also assumed.
This chapter also describes specific steps toward meeting the necessary
and sufficient conditions of mission critical applications. we have included
detailed “DNA” embedding sequence for three basic NCAs: mission critical
Enterprise Service Bus (ESB), lossless mission critical transaction processing
and non-stop high performance computing.
Although the results are non-trivial, somewhat experimental still; it can
be hard to argue with the technological direction.
We have only scratched the surface. The basic NCA services: messaging,
storage, transaction processing and computing, suggest the direction for the
development of next generation networking equipment and distributed appli-
cation APIs. Differentiating from the traditional silo development processes,
44 The Cloud Book

these are integrating frame works based on the NCA semantic networks. The
next generation systems are likely to include more advanced support for the
basic NCA services.
The holistic NCA development methodology has brought the essential but
non-functional factors to surface. We hope this chapter could help the appli-
cation API designers, system developers, application integrators and network
developers to better understand the critical factors for solving some of the
most difficult non-functional networked computing problems.

1.8 Acknowledgements
The author wishes to thank many students and colleagues who contributed
in different ways to the body of work described here. In particular, contri-
butions of Kostas Blathras, John Dougherty, David Muchler, Suntian Song,
Feijian Sun, Yijian Yang and Fanfan Xiong are essential to the philosophical
developments in the “decoupling” school of thoughts.
The projects described here are partially supported by the National Sci-
ence Foundation, the Office of Naval Research, Temple University Vice Provost
Office for Research and Parallel Computers Technology Inc. Special thanks to
Professors David Clark (MIT) and Chip Elliott (GENI) for the informal dis-
cussions after their Distinguished Speaker talks on TCP/IP protocols, trans-
action losses and information entropies.
This manuscript was particularly inspired by recent invigorating discus-
sions with author’s new colleagues Drs. Abdallah Khreishah, Shan Ken Lin
and Ph.D. student Moussa Taifi, on fault tolerance modeling and information
theory.
Bibliography

[1] Multi-scale modeling and simulation, 2010. [Online],


http://www.math.princeton.edu/multiscale/.

[2] S. Ajuja, N. Carriero, and D. Gelernter. Linda and friends. Computer,


19(8):26–34, 1986.

[3] Novell BEA Systems. Complete Transition of TUXEDO to


BEA. http://www.bea.com/framework.jsp?CNT=pr00002.htm&FP=
/content/news_events/press_releases/1996, 1996.

[4] Juan M. Andrade. The TUXEDO System: Software for Constructing and
Managing Distributed Business Applications, 1996.

[5] Arvind. Decomposing a program for multiple processor system. In Pro-


ceedings of the 1980 International Conference on Parallel Processing,
pages 7–14, 1980.

[6] Paul Baran. On distributed communications, rm-3420. Technical report,


http://www.rand.org/about/history/baran.list.html, 1964.

[7] J. Basney. A Distributed Implementation of the C-Linda Programming


Language. PhD thesis, Oberlin College, June 1995.

[8] Hal Berenson, Phil Bernstein, Gray Jim, Jim Melton, Elizabeth O’Neil,
and Patrick O’Neil. A critique of ansi sql isolation levels. In Proceedings
of the 1995 ACM SIGMOD international Conference on Management of
Data, pages 1–10. ACM, 1995.

[9] N. Carriero and D. Gelernter. How to Write Parallel Programs - A First


Course. The MIT Press, Cambridge, MA, 1990.

[10] Rohit Chandra. Parallel Programming in OpenMP. Morgan Kauffman.

[11] M. Chandy and L. Lamport. Distributed snapshots: Determining global


states of distributed systems. ACM Transactions on Computing Systems,
3(1):63 –75, 1985.

[12] J. B. Dennis. Data flow supercomputers. Computer, pages 48–56, 1980.

[13] Michael Flynn. Some computer organizations and their effectiveness.


IEEE Trans. Comput., c(21):948, 1972.

45
46 The Cloud Book

[14] MESSAGE-PASSING INTERFACE FORUM. MPI: A message-passing


interface standard. http://www.mpi.org, 1994.

[15] William Gropp and Ewing Lusk. Fault tolerance in mpi programs. Special
Issue, Journal of High Performance Computing Applications, 1(18):363–
372, 2002.

[16] S. F. Hummel, E. Schonberg, and L. E. Flynn. Factoring: A method for


scheduling parallel loops. CACM, 35(8):90–101, August 1992.

[17] IBM. WebSphere: MQ V6 Fundamentals, 2005.

[18] IBM. MC91: High Availability for WebSphere MQ on Unix Platforms,


Vol.,7, 2008.

[19] Gray Jim. The dangers of replication and a solution. In ACM SIGMOD
International Conference on Management of Data Archive, pages 173 –
182, Montreal, Quebec, Canada, 1996.

[20] Leslie Lamport. Time, clocks, and the ordering of events in a distributed
system. CACM, 21(7):558–565, 1978.

[21] John Little. A proof for the queueing formula: l = λw. Operations
Research: A Journal of the Institute for Operations Research and the
Management Sciences, pages 383–387, 1961.

[22] Microsoft. SQL Server Online Book. http://msdn.microsoft.com/


en-us/library/ms187956.aspx, 2010.

[23] University of Manchester. Dataflow research project, 1997.

[24] Oracle. Oracle 11g Online Library. http://download.oracle.com/


docs/cd/B28359_01/server.111/b28313/usingpe.htm#i1007101,
2010.

[25] C. Polychronopoulos and D. Kuck. Guided self-scheduling: A practical


scheduling scheme for parallel computers. IEEE Transactions on Com-
puters, C-36(12):1425–1439, December 1987.

[26] W3C School. MathML3 Manual. 2011.

[27] Justin Shi and Suntian Song. Apparatus and Method of Optimizing
Database Clustering with Zero Transaction Loss. (Pending), 2007.

[28] Justin Y. Shi. Synergy v3.0 Manual. http://spartan.cis.temple.edu/


synergy/, 1995.

[29] Justin Y. Shi. Fault tolerant self-optimizing multiprocessor system and


method thereof. 2007.
Fundamentals of Cloud Application Architecture 47

[30] Justin Y. Shi. Decoupling as a foundation for large scale parallel process-
ing. In Proceedings of 2009 High Performance Computing and Commu-
nications, Seoul, Korea, 2009.

[31] Y. Shi. Heterogeneous computing for graphics applications. In National


Conference on Graphics Applications, April 1991.

[32] Y. Shi. A distributed programming model and its applications to compu-


tation intensive applications for heterogeneous environments”. In Inter-
national Space Year Conference on Earth and Space Information Systems,
pages 10–13, Pasadena, CA., February 1992.

[33] Y. Shi. Program scalability analysis. In International Conference on


Distributed and Parallel Processing, Geogetown University, Washington
D.C., October 1997.

[34] M. Shultz, G. Bronevetsky. R. Fenandes, D. M. K. Pingali, and


P. Stodghill. Implementation and evaluation of a scalable application-
level checkpoint-recovery scheme for mpi programs. In Proceedings of
Supercomputing 2004 Conference, Pittsburgh, PA., November 2004.

[35] Suntian Song. Method and apparatus for database fault tolerance with
instant transaction replication using off-the-shelf database servers and
low bandwidth networks. (#6,421,688), 2002.

[36] Scott Spetka, Haris Hadzimujic, Stephen Peek, and Christopher Flynn.
High Productivity Languages for Parallel Programming Compared to
MPI. HPCMP Users Group Conference, 0:413–417, 2008.

[37] Michael Stonebraker. The case for shared nothing architecture. Database
Engineering, 9(1), 1986.

[38] Feijian Sun. Automatic Program Parallelization Using Stateless Parallel


Processing Architecture. PhD thesis, Temple University, 2004.

[39] Symantec. Veritas Cluster Server (VCS). http://eval.symantec.


com/mktginfo/products/Datasheets/High_Availability/cluster_
server_datasheet.pdf, 2006.

[40] B. Szymanski, Y. Shi, and N. Prywes. Synchronized distributed termina-


tion. IEEE Transactions on Software Engineering, SE11(10):1136–1140,
1985.

[41] Alan Turing. On computable numbers, with an application to the


entscheidungsproblem. Proceedings of the London Mathematical Society,
Series 2(42):230–265, 1936.

[42] Fanfan Xiong. Resource Efficient Parallel VLDB with Customizable De-
gree of Redundancy. PhD thesis, Temple University, 2009.
48 The Cloud Book

[43] Yijian Yang. Fault Tolerance Protocol for Multiple Dependent Master
Protection in a Stateless Parallel Processing Framework. PhD thesis,
Temple University, August 2007.
[44] Yingqian Zhang, Bin Sun, and Jia Liu. A markup language for
parallel programming model on multi-core system. SCALCOM-
EMBEDDEDCOM 2009 International Conference, pages 640–643, Sept.
2009.

You might also like