Cloud Computing Overview
Cloud Computing Overview
Cloud Computing Overview
E
A
T
U
R
E
An Overview
of Cloud
Computing
F
E
A
T
U
R
E
D
uring the last several decades, dramatic advances in computing power, storage, and networking
technology have allowed the human race to generate, process, and share increasing amounts
of information in dramatically new ways. As new applications of computing technology are
developed and introduced, these applications are often used in ways that their designers never envisioned.
New applications, in turn, lead to new demands for even more powerful computing infrastructure.
To meet these computing-infrastructure demands, system designers are constantly looking for new
system architectures and algorithms to process larger collections of data more quickly than is feasible
with todays systems. It is now possible to assemble very large, powerful systems consisting of many
small, inexpensive commodity components because computers have become smaller and less expensive,
disk drive capacity continues to increase, and networks have gotten faster. Such systems tend to be
much less costly than a single, faster machine with comparable capabilities.
Building systems from large numbers of commodity components leads to some signicant challenges,
however. Because many more computers can be put into a computer room today than was possible
even a few years ago, electrical-power consumption, air-conditioning capacity, and equipment weight
have all become important considerations for system designs. Software challenges also arise in this
environment because writing software that can take full advantage of the aggregate computing power of
many machines is far more difcult than writing software for a single, faster machine.
Recently, a number of commercial and academic organizations have built large systems from
commodity computers, disks, and networks, and have created software to make this hardware easier
to program and manage. These organizations have taken a variety of novel approaches to address the
challenges outlined above. In some cases, these organizations have used their hardware and software to
provide storage, computational, and data management services to their own internal users, or to provide
these services to external customers for a fee. We refer to the hardware and software environment that
implements this service-based environment as a cloud-computing environment. Because the term cloud
computing is relatively new, there is not universal agreement on this denition. Some people use the
terms grid computing, utility computing, or application service providers to describe the same storage,
computation, and data-management ideas that constitute cloud computing.
Regardless of the exact denition used, numerous companies and research organizations are
applying cloud-computing concepts to their business or research problems including Google, Amazon,
Yahoo, and numerous universities. This article provides an overview of some of the most popular cloud-
computing services and architectures in use today. We also describe potential applications for cloud
computing and conclude by discussing areas for further research.
F
E
A
T
U
R
E
Image credit: NOAA Photo Library, NOAA Central Library; OAR/ERL/National Severe Storms Laboratory (NSSL)
Nomenclature
Before describing examples of cloud computing
technology, we must frst defne a few related terms
more precisely. A computing cluster consists of
a collection of similar or identical machines that
physically sit in the same computer room or building.
Each machine in the cluster is a complete computer
consisting of one or more CPUs, memory, disk
drives, and network interfaces. The machines are
networked together via one or more high-speed local-
area networks. Another important characteristic of
a cluster is that its owned and operated by a single
administrative entity such as a research center or
a company. Finally, the software used to program
and manage clusters should give users the illusion
that theyre interacting with a single large computer
when in reality the cluster may consist of hundreds
or thousands of individual machines. Clusters
are typically used for scientifc or commercial
applications that can be parallelized. Since clusters
can be built out of commodity components, they are
often less expensive to construct and operate than
supercomputers.
Although the term grid is sometimes used
interchangeably with cluster, a computational
grid takes a somewhat different approach to high-
performance computing. A grid typically consists
of a collection of heterogeneous machines that are
geographically distributed. As with a cluster, each
machine is a complete computer, and the machines
are connected via high-speed networks. Because
a grid is geographically distributed, some of the
machines are connected via wide-area networks
that may have less bandwidth and/or higher latency
than machines sitting in the same computer room.
Another important distinction between a grid and
a cluster is that the machines that constitute a grid
may not all be owned by the same administrative
entity. Consequently, grids typically provide
services to authenticate and authorize users to access
resources on a remote set of machines on the same
grid. Because researchers in the physical sciences
often use grids to collect, process, and disseminate
data, grid software provides services to perform
bulk transfers of large fles between sites. Since a
computation may involve moving data between sites
and performing different computations on the data,
grids usually provide mechanisms for managing
long-running jobs across all of the machines in
the grid.
Grid computing and cluster computing are not
mutually exclusive. Some high-performance
computing systems combine some of the attributes
of both. For example, the Globus Toolkit [1], a set of
software tools that is currently the de facto standard
for building grid-computing systems, provides
mechanisms to manage clusters at different sites
that are part of the same grid. As youll see later
in this article, many cloud-computing systems also
share many of the same attributes as clusters and
grids.
The Google Approach to Cloud
Computing
Google is well known for its expanding list of
services including their very popular search engine,
email service, mapping services, and productivity
applications. Underlying these applications
is Googles internally developed cloud-based
computing infrastructure. Google has published a
series of papers in the computer-science research
literature that demonstrate how they put together
a small collection of good ideas to build a wide
variety of high performance, scalable applications.
In this section we describe what Google has built
and how they use it.
Google Design Philosophy
Although Googles clouds are very high-
performance computer systems, the company took
a dramatically different approach to building them
than what is commonly done today in the high-
performance and enterprise-computing communities
[2]. Rather than building a system from a moderate
number of very high-performance computers,
Google builds their cloud hardware as a cluster
containing a much larger number of commodity
computers. They assume that hardware will fail
regularly and design their software to deal with
that fact. Since Google is not using state-of-the-art
hardware, theyre also not using the most expensive
hardware. Consequently, they can optimize their
costs, power consumption, and space needs by
making appropriate tradeoffs.
Another key aspect of Googles design philosophy
is to optimize their system software for the specifc
applications they plan to build on it. In contrast, the
designers of most system software (e.g. operating
systems, compilers, and database management
systems) try to provide a combination of good
performance and scalability to a wide user base.
Since it is not known how different applications
will use system resources, the designer uses his or
her best judgment and experience to build systems
that provide good overall performance for the
types of applications they expect to be run most
often. Because Google is building both the system
and application software, they know what their
applications require and can focus their system-
software design efforts on meeting exactly those
requirements.
Google File System
Googles design philosophy is readily evident in
the architecture of the Google File System (GFS)
[3]. GFS serves as the foundation of Googles
cloud software stack and is intended to resemble
a Unix-like fle system to its users. Unlike Unix
or Windows fle systems, GFS is optimized for
storing very large fles (> 1 GB) because Google`s
applications typically manipulate fles of this size.
One way that Google implements this optimization
is by changing the smallest unit of allocated storage
in the fle system from the 8 KB block size typical
of most Unix fle systems to 64 MB. Using a 64
MB block size (Google calls these blocks chunks)
results in much higher performance for large fles
at the expense of very ineffcient storage utilization
for fles that are substantially smaller than 64 MB.
Another important design choice Google makes
in GFS is to optimize the performance of the
types of I/O access patterns they expect to see
most frequently. A typical fle system is designed
to support both sequential and random reads and
writes reasonably effciently. Because Google`s
applications typically write a fle sequentially once
and then only read it, GFS is optimized to support
append operations. While any portion of a fle may
be written in any order, this type of random write
operation will be much slower than operations to
append new data to the end of a fle.
Since GFS is optimized for storing very large
fles, Google designed it so that the chunks that
constitute a single GFS fle do not need to reside on
one disk as they would in a conventional Unix fle
system. Instead, GFS allocates chunks across all of
the machines in the cloud. Doing chunk allocation
in this manner also provides the architectural
underpinnings for GFS fault tolerance. Each chunk
can be replicated onto one or more machines in the
cloud so that no fles are lost if a single host or disk
drive fails. Google states that they normally use
a replication factor of three (each chunk stored on
three different machines), and that they do not use
the fault-tolerance techniques used in enterprise-
class servers such as redundant arrays of inexpensive
disks (RAID).
MapReduce
Built on top of GFS, Googles MapReduce
framework is the heart of the computational model
for their approach to cloud computing [4, 5]. The
basic idea behind Googles computational model is
that a software developer writes a program containing
two simple functionsmap and reduceto process
a collection of data. Googles underlying runtime
system then divides the program into many small
tasks, which are then run on hundreds or thousands
of hosts in the cloud. The runtime system also
ensures that the correct subset of data is delivered
to each task.
The developer-written map function takes as its
input a sequence of <key
in
, value
in
> tuples, performs
some computation on these tuples, and produces as
its output a sequence of <key
out
, value
out
> tuples.
There does not necessarily need to be a one-to-one
correspondence between input and output key/value
tuples. Also, key
in
does not necessarily equal key
out
for a given key/value tuple.
The developer-written reduce function takes as its
input a key and a set of values corresponding to that
key. Thus, for all <key', value
i
> tuples produced by
the map function that have the same key key', the
reduce function will be invoked once with key'
and
the set of all values value
i
. Its important to note
that if the tuple <key', value
i
> is generated multiple
times by the map function, value
i
will appear the
same number of times in the set of values provided
to the reduce function, i.e., duplicate values are not
removed. Once invoked, the reduce function will
perform some computation on the set of values
and produce some output value that the runtime
infrastructure will associate with the key that was
supplied as input to reduce.
Example
To illustrate how MapReduce might be used
to solve a real problem, consider the following
hypothetical application. Suppose a software
developer is asked to build a tool to search for words
or phrases in a collection of thousands or millions
of text documents. One common data structure that
is useful for building this application is an inverted
index. For every word w that appears in any of the
documents, there will be a record in the inverted
index that lists every document where w appears at
least once. Once the inverted index is constructed,
the search tool can rapidly identify where words
appear in the document collection by searching the
index rather than the entire collection.
Constructing the inverted index is straightforward
using MapReduce. To do so, the developer would
construct a map function that takes as its input a
single <document name, document contents> tuple,
parses the document, and outputs a list of <word,
document name> tuples. For one invocation of
map, key
in
might be speech.txt, and value
in
might be We choose to go to the moon in
this decade and do the other thingsnot
because they are easy, but because they
are hard! If the map function were invoked with
the key/value tuple shown above, map would parse
the document by locating each word in the document
using whitespace and punctuation, removing the
punctuation, and normalizing the capitalization. For
each word found, map would output a tuple <key
out
,
value
out
> where key
out
is a word and value
out
is the
name of the document key
in
. In this example, 25
<key
out
, value
out
> tuples would be output as follows
<we, speech.txt>
<choose, speech.txt>
<to, speech.txt>
<go, speech.txt>
<to, speech.txt>
<hard, speech.txt>
The map function would be invoked for each
document in the collection. If the cloud has
thousands of nodes, map could be processing
thousands of documents in parallel. Accessing
any document from any machine in the cloud via
GFS makes map task scheduling easier, since a fle
doesnt need to be pre-positioned at the machine
processing it.
The reduce function in this example is very easy
to implement. The MapReduce infrastructure will
aggregate all <key,value> tuples with the same key
that were generated by all map functions, and send
the aggregate to a single reduce function as <key,
list of values>. This reduce invocation will iterate
over all values and output the key and all values
associated with it. In our example, suppose we
have a second document whose name is song.
txt and its contents are I'll see you on the
dark side of the moon. The map functions
processing speech.txt and song.txt will output
the tuples <moon,speech.txt> and <moon,
song.txt> respectively. Reduce will be invoked
with the key/value tuple <moon, list{speech.
txt, song.txt)> and will output something that
looks like
moon: speech.txt song.txt
If the word moon appeared in other documents,
those document names would also appear on this
line.
Hundreds or thousands of reduce functions will
process the output of different map functions the
same way, once again taking advantage of the
parallelism in the cloud. Because of the way the
MapReduce infrastructure allocates work, the
processing of a single word w will be done by only
one reduce task. This design decision dramatically
simplifes scheduling reduce tasks and improves
fault tolerance since a failed reduce task can be
restarted without affecting other reduce tasks and
without doing unnecessary work.
After all reduce functions have fnished processing
tuples, there will be a collection of fles containing
one or more lines of word: document list mappings
as shown above. By concatenating these fles, we
have constructed the inverted index.
Discussion
The MapReduce model is interesting from a number
of perspectives. Decades of high-performance-
computing experience has demonstrated the
diffculty of writing software that takes full
advantage of the processing power provided by a
parallel or distributed system. Nevertheless, the
MapReduce runtime system is able to automatically
partition a computation and run the individual parts
on many different machines, achieving substantial
performance improvements. Part of the reason
that this is possible goes back to Googles design
philosophy. MapReduce is not a general-purpose
parallel programming paradigm, but rather a special-
purpose paradigm that is designed for problems that
can be partitioned in such a way that there are very
few dependencies between the output of one part
of the computation and the input of another. Not
all problems can be partitioned in this manner, but
since many of Googles problems can, MapReduce
is a very effective approach for them. By optimizing
MapReduce performance on their cloud hardware,
Google can amortize the costs and reap the benefts
across many applications.
Another key beneft of the special-purpose nature
of MapReduce is that it can be used to enhance the
fault-tolerance of applications. Recall that Google
builds its clouds as clusters of commodity hardware
and designs its software to cope with hardware
failures. Because a MapReduce computation can
be partitioned into many independent parts, the
MapReduce runtime system can restart one part of
the computation if its underlying hardware fails.
This restarting operation can be accomplished
without affecting the rest of the computation,
and without requiring additional expertise or
programming effort on the part of the application
developer. Googles approach to fault tolerance is in
stark contrast to most approaches today that require
substantial programmer effort and/or expensive
hardware redundancy.
Performance
How well does MapReduce work in practice?
Googles published results show impressive results
for processing large data. In a paper published in
2004 [4], Google describes two experiments they
ran on an 1800 machine cloud. Each machine had
two 2 GHz Intel Xeon processors, 4 GB of memory,
two 160 GB disks connected via an IDE interface,
and gigabit Ethernet. Although not explicitly stated
in the paper, we assume the processors were single-
core processors.
Google`s frst experiment was to create and run
a program they called grep, which was designed
to search for a three-character pattern in 10