0% found this document useful (0 votes)
31 views41 pages

CC Unit 1

Uploaded by

dineshdhanavath0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views41 pages

CC Unit 1

Uploaded by

dineshdhanavath0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

|1

UNIT – 1
Systems Modeling, Clustering and Virtualization: Distributed System Models and Enabling Technologies,
Computer Clusters for Scalable Parallel Computing, Virtual Machines and Virtualization of Clusters and
Data centers.
UNIT 1

INTRODUCTION

• Centralized comput ing: This is a computing paradigm by which all


compute r resource s are centralized in one physical system. All resource s
( processors, memory, and storage) are fully shared and tightly coupled
within one int egrat e d OS. Many data centers and supercom p u t e r s are
cent ralized systems, but they are used in parallel, distributed, and cloud
comput ing applicat ions

• Parallel comput ing: In parallel comput ing, all processors are either tightly
coupled with cent ralized shared memory or loosely coupled with distributed
memory. Some authors refer to this discipline as parallel processing. Inter
processor communicat ion is accomplished through shared memory or via
message passing. A computer system capable of parallel comput ing is
commonly known as a parallel compute r . Progra m s running in a parallel
compute r are called parallel progra m s . The process of writing paralle l
progra ms is often referred to as parallel progra mming.

• Distrib u t e d comp u t i n g : This is a field of compute r


science/ engine e r i ng that studies distributed systems. A distribut ed system
consists of multiple autonomous comput ers, each having its own private
memory, communicat ing through a compute r network. Informat ion
exchange in a distribute d system is accomplished through message passing.
A computer progra m that runs in a distribute d system is known as a
distributed progra m. The process of writing distributed progra ms is
referred to as distributed progra mmin g.

• Cloud comp u t i n g : An Internet cloud of resource s can be either a


cent ralized or a distributed comput ing system. The cloud applies parallel or
distributed comput ing, or both. Clouds can be built with physical or
virtualized resource s over large data centers that are centralized or
dist ributed. Some aut hors consider cloud comput ing to be a form of utilit y
comput ing or service comput ing.

• Ubiquito u s comp u t i n g refers to computing with pervasive devices at


any place and time
using wired or wireless communicat ion.

• The Intern e t of Thing s (IoT) is a networked connect ion of everyday


objects including computers, sensors, humans, etc. The IoT is supported by
Internet clouds to achieve ubiquitous computing with any object at any
place and time.

• High- perfor m a n c e comp u t i n g ( H P C ) emphasize the raw speed


perform anc e . The speed of HPC systems has increas e d from Gflops in the
early 1990 s to now Pflops in 2010 . This improvem e n t was driven mainly by
the demands from scientific, engineer ing, and manufact uring communit ies.
|2

• High- throu g h p u t comp u t i n g (HTC) systems pay more attent ion to


high- flux comput ing. The main applicat ion for high- flux comput ing is in
Internet searches and web services by millions or more users
simult aneously. The performa nc e goal thus shifts to measur e high
t hroughpu t or the number of tasks completed per unit of time. HTC
technology needs to not only improve in terms of batch processing speed,
but also address the acute problems of cost, energy savings, security, and
reliabilit y at many data and ent erprise comput ing centers.

Degr e e s of Paralleli s m:

Bit- level paralleli s m (BLP) converts bit- serial processing to word- level
processing gradually. Over the years, users graduat e d from 4- bit
microprocessors to 8-,16- , 32- , and 64- bit CPUs.

Instru c t i o n - level paralleli s m (ILP ), in which the processor executes


multiple inst ruct ions simult aneously rather than only one instruct ion at a
time.

Data- level paralleli s m (DLP) was made popular through SIMD (single
instruct ion, multiple data) and vector machines using vector or array types
of instruct ions. DLP requires even more hardwa r e support and compiler
assist ance to work properly.

Ever since the introduct ion of mult icore processors and chip
mult iproces sors (CMPs), we have been exploring Task- level paralleli s m
(TLP ).

TECHNOLOGIES FOR NETWORK- BASED SYSTEMS

Multicor e CPUs and Multithr e a d i n g Techn o l o g i e s : Today, advanced


CPUs or microprocessor chips assum e a mult icore archit ectur e with dual,
quad, six, or more processing cores. These processors exploit parallelism at
ILP and TLP levels. Both multi- core CPU and many- core GPU processors can
handle multiple instruct ion t hreads at different magnit udes today. Multiple
cores are housed in the same chip with an L2 cache that is shared by all
cores. In the future, multiple CMPs could be built on the same CPU chip
with even the L3 cache on the chip. Multicore and mult it hrea d e d CPUs are
equipped with many high- end processors, including the Intel i7, Xeon, AMD
Opteron, Sun Niagara, IBM Power 6, and X cell processors. Each core could
be also mult it hre ad e d.

Mem ory, Storag e , and Wide- Area Netw orkin g: Memory chips have
experience d a 4x increase in capacit y every three years. For hard drives,
capacit y increased from 260 MB in 1981 to 250 GB in 2004 . Disks or disk
|3

arrays have exceeded 3 TB in capacit y. The rapid growth of flash memor y


and solid- state drives (SSDs) also impacts the future of HPC and HTC
systems.

Syste m - Area Interc o n n e c t s : The nodes in small clusters are mostly


int erconne c t e d by an Ethernet switch or a local area network( LAN).

As Figure shows, a LAN typically is used to connect client hosts to big


servers. A storage area network (SAN) connects servers to network storage
such as disk arrays. Network attache d storage (NAS) connects client hosts
directly to the disk arrays. All three types of networks often appear in a
large cluster built with comme rcial network component s .

Wide- Area Netw orki n g: High- bandwidt h networking increas es the


capabilit y of building massively distributed systems. The rapid growth of
Etherne t bandwidt h from 10 Mbps in1979 to 1 Gbps in 1999 , and 40 ~ 100
GE in 2011 . It has been speculat e d that 1 Tbps network links will become
available by 2013 .

Virtual Machin e s and Virtualizati o n Middl ew ar e

Virtual machines (VMs) offer novel solutions to underut ilized resourc es,
applicat ion inflexibility, software manage a bilit y, and securit y concerns in
existing physical machines. Today, to build large clusters, grids, and clouds,
we need to access large amount s of comput ing, storage, and networking
resources in a virtualized manner. We need to aggreg a t e those resources,
and hopefully, offer a single system image. In part icular, a cloud of
provisioned resources must rely on virtualizat ion of processors, memory,
and I/O facilities dynamically.
|4

The host machine is equipped with the physical hardw a r e . The VM is built
with virtual resources manage d by a guest OS to run a specific applicat ion.
Between the VMs and the host platform, one needs to deploy a middleware
layer called a virtual machine monitor (VMM).

Figure shows a native VM installed with the use of a VMM called a


hypervisor in privileged
Mode. The guest OS could be a Linux system and the hypervisor is the XEN
system developed at
Cambridge Universit y. This hypervisor approach is also called bare- metal
VM, because the hypervisor handles the bare hardwar e (CPU, memory, and
I/O) directly. Archit ectur e is the host VM shown in Figure( c). Here the VMM
runs in non- privileged mode. The host OS need not be modified. The VM can
also be impleme n t e d with a dual mode, as shown in Figure 1.12( d). Part of
the VMM runs at the user level and another part runs at the supervisor
level. In this case, the host OS may have to be modified to some extent.
Mult iple VMs can be ported to a given hardw a r e system to support the
virtualizat ion process. The VM approach offers hardwa r e independ e nc e of
the OS and applicat ions.

VM Primitive Operatio n s : The VMM provides the VM abst ract ion to the
guest OS. With full virtualizat ion, the VMM exports a VM abstract io n
identical to the physical machine so that a standar d OS such as Windows
2000 or Linux can run just as it would on the physical hardwa r e

Low- level VMM operat ions are


• the VMs can be mult iplexed betwee n hardwa r e machines,
• a VM can be suspend e d and stored in stable storage
• a suspend e d VM can be resume d or provisioned to a new hardwa r e
platform
• a VM can be migrate d from one hardwa r e platform to another
|5

These VM operat ions enable a VM to be provisioned to any available


hardwa r e platform. They also enable flexibility in porting distributed
applicat ion execut ions. Further m o r e , the VM approach will significant ly
enhance the utilization of server resource s.

SYSTEM MODELS FOR DISTRIBUTED AND CLOUD COMPUTING

Distributed and cloud comput ing systems are built over a large number of
autonomous computer nodes. These node machines are interconne c t e d by
SANs, LANs, or WANs in a hierarchical manner. Massive systems are
considered highly scalable, and can reach web- scale connect ivit y, either
physically or logically. Massive systems are classified into four groups:
clusters, P 2Pnet works, comput ing grids, and Internet clouds over huge data
centers. In terms of node number, these four system classes may involve
hundre ds, t housands, or even millions of compute r s as part icipat ing nodes.
These machines work collectively, cooperat ively, or collaborat ively at
various levels.

1. Clust er s of Cooperativ e Compu t er s A comput ing cluster consists of


int erconne c t e d stand- alone comput ers which work cooperat ively as a
single integrat ed comput ing resource.

Clust er Archite c t u r e

A clust e r of servers interc o n n e c t e d by a high- bandwid t h


Figure shows the archit ectur e of a typical server cluster built around a low-
latency, high bandwidt h interconn ect ion network. Through hierarchica l
const ruct ion using a SAN, LAN, or WAN, one can build scalable clusters
with an increasing number of nodes. The cluster is connected to the Internet
via a virtual private network (VPN) gateway. The gateway IP address locates
the cluster. The system image of a compute r is decided by the way the OS
manages the shared cluster resource s

Singl e - Syste m Imag e : An ideal cluster should merge mult iple system
images into a single- system image (SSI). Cluster designers desire a cluster
operat ing system or some middlewa r e to support SSI at various levels,
including the sharing of CPUs, memory, and I/O across all cluster nodes. An
|6

SSI is an illusion created by software or hardwar e that present s a collection


of resourc es as one int egrate d, powerful resource. SSI makes the cluster
appear like a single machine to the user.

Hardwar e , Softwar e , and Middlew ar e Supp ort : Clusters exploring


massive parallelism are commonly known as MPPs. Special cluster
middleware supports are needed to create SSI or high availabilit y (HA).
Both sequent ial and parallel applicat ions can run on the cluster, and special
parallel environm e nt s are needed to facilitate use of the cluster resources

2. Grid Comp uti n g Infrastru c t u r e s : A comput ing grid offers an


infrastruct u r e that couples computers, software/ middlewa r e , special
instrum e n t s , and people and sensors togethe r . The grid is often
constructe d across LAN, WAN, or Internet backbone networks at a
regional, national, or global scale. Ent erprises or organizat ions
present grids as integrat ed comput ing resource s . They can also be
viewed as virtual platforms to support virtual organizat ions. The
compute r s used in a grid are primarily workstat ions, servers, clusters,
and supercom p u t e r s . Personal compute r s , laptops, and PDAs can be
used as access devices to a grid system.

3. Peer- to- Peer Network Famili e s

The struct ur e of a P2 P syste m by mappin g a physical IP network to


an overlay network

In a P2P system, every node acts as both a client and a server, providing
part of the system resources. Peer machines are simply client comput ers
connect ed to the Internet. All client machines act autonomously to join or
leave the system freely. This implies that no master- slave relat ionship exists
among the peers. No central coordinat ion or central databas e is needed. In
other words, no peer machine has a global view of the entire P2P system.
The system is self- organizing with distribute d control. Unlike the cluster or
grid, a P2P network does not use a dedicate d interconne c t ion network. The
physical network is simply an ad hoc network formed at various Internet
domains randomly using the TCP/IP and NAI protocols
|7

Data items or files are distributed in the part icipat ing peers. Based on
communicat ion or file- sharing needs, the peer IDs form an overlay network
at the logical level. This overlay is a virtual network formed by mapping
each physical machine with its ID, logically, through a virtual mapping

P2P performa nc e is affected by routing efficiency and self- organizat ion by


part icipat ing peers. Fault tolerance, failure manage m e n t , and load
balancing are other important issues in using overlay net works. Lack of
trust among peers poses another problem. Peers are strange r s to one
another. Securit y, privacy, and copyright violations are major worries

4. Cloud Comp uti n g over the Intern e t : A cloud is a pool of virtualized


compute r resource s . A cloud can host a variety of different workloads,
including batch- style backend jobs and interact ive and user- facing
applicat ions.

Virtualiz e d reso ur c e s from data cent e r s to form an


Intern e t cloud

Cloud comput ing applies a virtualized platform with elastic resources on


demand by provisioning hardwa r e , software, and data sets dynamically (see
Figure 1.18). The idea is to move desktop comput ing to a service- oriented
platform using server clusters and huge databas es at data centers

THREE CLOUD SERVICE MODELS

• Infrastru c t u r e as a Servic e (Iaa S ) This model puts together


infrastruct u r e s demand e d by users —namely servers, storage,
networks, and the data center fabric. The user can deploy and run on
multiple VMs running guest OSes on specific applicat ions. The user
does not manage or control the underlying cloud infrastructu r e , but
can specify when to reques t and release the needed resources.
• Platfor m as a Service ( Paa S ) This model enables the user to deploy
user- built applicat ions onto a virtualized cloud platform. Paa S includes
middleware, databas e s , developme n t tools, and some runtime support
such as Web 2.0 and Java. The plat form includes both hardw a r e and
software integrat e d with specific progra mming interfaces. The
|8

provider supplies the API and software tools (e.g., Java, Python, Web
2.0, .NET). The user is freed from managing the cloud infrastruct u r e .
• Softwar e as a Servic e (Saa S ) This refers to browser- init iated
applicat ion software over t housands of paid cloud customers. The
Saa S model applies to business processes, industry applicat ions,
consume r relat ionship manage m e n t (CRM), enterpr ise resources
planning (ERP),huma n resource s (HR), and collaborat ive applicat ions.
On the custome r side, there is no upfront investme n t in servers or
software licensing. On the provider side, costs are rather low,
compare d with convent ional hosting of user applicat ions.

SOFTWARE ENVIRONME NT S FOR DISTRIBUTED SYSTEMS AND


CLOUDS

Service Orient e d Archite ct ur e (SOA)


• A paradigm for organizing and utilizing distribute d capabilit ies that
may be under the control of different ownership domains and
impleme n t e d using various t echnology stacks
• A set of component s which can be invoked and whose interface
descript ions can be published and discovere d (W3C)
• SOA is an archit ectu r al style whose goal is to achieve loose coupling
among interact ing software agents
• A service is a unit of work done by a service provider to achieve
desired end results for a service consume r
• Both provider and consume r are roles played by software agents on
behalf of their owners

Distrib ut e d operatin g syste m s

The comput ers in most distribute d systems are loosely coupled Thus, a
distributed system inherent ly has multiple system images This is mainly due
to the fact that all node machines run with an independ e n t operating system
To promote resource sharing and fast communicat ion among Node
machines, it is best to have a distributed OS that manages all resources
coherent ly and efficiently. Such a system is most likely to be a closed
system, and it will likely rely on message passing and RPCs for inter node
communicat ions

MOSIX 2 for Linux Cluster s

Runs with a virtualizat ion layer in the Linux environm e n t This layer provides
a partial single- system image to user applicat ions This is mainly due to the
fact that all node machines run with an independ e n t operat ing system
Supports both sequent ial and parallel applicat ions, and discovers resource s
and migrates software process es among Linux nodes Can manage a Linux
cluster or a grid of multiple clusters

PARALLEL AND DISTRIBUTED PROGRAMMI NG MODELS


|9

Mes s a g e - Passi n g Interfac e (MPI ) This is the primary progra m ming


standa rd used to develop parallel and concurre n t program s to run on a
distributed system MPI is essent ially a library of subprogr a m s that can be
called from C or FORTRAN to write parallel program s running on a
distributed system The idea is to embody clusters, grid systems, and P2P
systems with upgrad ed web services and utility comput ing applicat ions

Map Red u c e This is a web program min g model developed by Google for
scalable data processing on large clusters over large data sets The model is
applied mainly in web- scale search and cloud comput ing Applications. The
master node specifies a Map function to divide the input into Sub problems
Applies a Reduce function to merge all interm ediat e values with the same
interm e diat e key Highly scalable to explore high degrees of parallelism at
different job levels A typical Map Reduce comput at ion process can handle
t erabyt es of data on tens of t housands or more client machines Thousands
of Map Reduc e jobs are executed on Google' s clusters every day

Hadoo p Library offers a software platform that was originally developed


by a Yahoo! Group. The package enables users to write and run applicat ions
over vast amounts of distr ibuted data Users can easily scale Hadoop to store
and process petabyt es of data in the web space.

Econ o m i c a l: Comes with an open source version of MapReduce that


minimizes overhe ad in task spawning and massive data communicat ion
Efficie n t : Processes data with a high degree of parallelism across a large
numbe r of commodit y nodes
Reliabl e: Automat ically keeps multiple data copies to facilitate
redeployment of comput ing tasks upon unexpecte d system failures

Amdahl' s Law

Let us suppose a uniprocessor workstat ion executes a given progra m in time


T minutes

Now the same progra m is part it ioned for parallel execut ion on a cluster of
many nodes

We assume that a fraction α of the code must be executed sequent ially.

Therefore, (1 - α) of the code can be compiled for parallel execut ion by n


processors

The total execut ion time of the progra m is calculated by:

α T + (1 - α )T/n

where, the first term is the sequent ial execut ion time on a single processor
and the second term is the parallel execut ion time on n processing nodes
| 10

Amdahl' s Law: Spee du p factor

The speedup factor of using the n- processor system over the use of asingle
processor is expresse d by:

Speedup = S = T/[ α T + (1 - α)T/n]

= 1/[ α + (1 - α)/n]

The maximum speedup of n is achieved only if the code is fully parallelizable


with α = 0

As the cluster becomes sufficiently large, that is, n -> ∞, S approach e s 1/ α,


an upper bound on the speedup S

The sequent ial bottleneck is the portion of the code that cannot be
parallelized
ifα = 0:25 -> 1 - α = 0:75, Max. speedup = 4

Gustafs o n' s Law

Scaling the problem size to match the cluster capabilit y (scaled- workload
speedup)

Let W be the workload in a given program. When using an n- processor


system, the user scales the workload to

W’ = α W + (1 - α) n W

The parallel execut ion time of a scaled workload W' on n processors is


defined by scaled- workload speedup

S’ = W’/W = [α W + (1 - α) n W]/W

= α + (1 - α)n

Thus efficiency is

E’ = S’ / n = α /n + (1 - α)
For α = 0:25 and n = 256, E = 75%
Availability

A system is highly available if it has a long mean time to failure (MTTF) and
a short mean time to repair (MTTR)

System Availability = MTTF = (MTTF + MTTR)


| 11

Failure may occur in hardwar e , software or network component. Any failure


that will pull down the operation of the entire system is called a single point
of failure. A reliable comput ing system must be designed with no single
point of failure.

In general, as a distribute d system increase s in size, availabilit y decreas es


due to a higher chance of failure and a difficulty in isolating the failures

ENERGY EFFICIENCY IN DISTRIBUTED COMPUTING

Primary performa nc e goals in convent ional parallel and distribute d


comput ing systems are high perform a nc e and high t hroughp u t , considering
some form of performa nc e reliabilit y (e.g., fault tolerance and securit y).
However, these systems recent ly encounte r e d new challenging issues
including energy efficiency, and workload and resource outsourcing

Energy Consu m p t i o n of Unus e d Servers: To run a server farm (data


center) a company has to spend a huge amount of money for hardw a r e ,
software, operat ional support, and energy every year. Therefore, companies
should t horoughly identify whethe r their installed server farm (more
specifically, the volume of provisioned resources) is at an appropriat e level,
part icularly in terms of utilization.

Redu c i n g Energy in Active Servers: In addition to identifying


unused/ und e r u t i lized servers for energy savings, it is also necessar y to
apply appropriat e t echniques to decreas e energy consumpt ion in active
distributed systems with negligible influence on their perform a nce.

Applicati o n Layer: Until now, most user applicat ions in science, business,
engineer ing, and financial areas tend to increas e a system’ s speed or
quality. By introducing energy- aware applicat ions, the challenge is to design
sophist icate d mult ilevel and multi- domain energy manage m e n t applicat ions
without hurting performa nc e .

Middle w ar e Layer: The middleware layer acts as a bridge between the


applicat ion layer and the resource layer. This layer provides resource
broker, communicat ion service, task analyzer, task scheduler, securit y
access, reliabilit y control, and informat ion service capabilit ies. It is also
responsible for applying energy- efficient t echniques, part icularly in task
scheduling.

Resour c e Layer: The resource layer consists of a wide range of resourc es


including comput ing nodes and storage units. This layer generally int eracts
with hardwa r e devices and the operating system; t herefore, it is responsible
for controlling all distribut ed resources in distributed comput ing systems.
Dynamic power manage m e n t (DPM) and dynamic voltage- frequency scaling
(DVFS) are two popular methods incorporat e d into recent computer
hardwa r e systems. In DPM, hardwar e devices, such as the CPU, have the
capabilit y to switch from idle mode to one or more lower power modes. In
| 12

DVFS, energy savings are achieved based on the fact that the power
consumpt ion in CMOS circuits has a direct relat ionship with frequency and
the square of the voltage supply.

Netw ork Layer: Routing and t ransferr ing packets and enabling network
services to the resource layer are the main responsibilit y of the network
layer in distributed computing systems. The major challenge to build
energy- efficient networks is, again, determining how to measur e , predict,
and create a balance between energy consum p t ion and perform anc e .

A comp u t e r clust er is a collection of interconn ect e d stand- alone


compute r s which can work togethe r collectively and cooperat ively as a
single integrat e d comput ing resource pool. Clustering explores massive
parallelism at the job level and achieves high availabilit y (HA) through
stand- alone operat ions.
The benefits of computer clusters and massively parallel processors (MPPs)
include scalable performa nc e , High Availability, fault toleranc e, modular
growth, and use of commodit y component s
Desig n Objective s of Compu t er Clust er s

Clusters are classified using six ort hogonal attribut es: scalabilit y,
packaging, control, homogeneit y, progra m ma b ility, and securit y

1. Scalability: Clustering of compute r s is based on the concept of


modular growth The scalabilit y could be limited by a number of
factors, such as the mult icore chip technology, cluster topology,
packaging method, power consumpt ion, and cooling schem e applied.
The purpose is to achieve scalable performa nc e constrained by the
aforeme nt ioned factors

2. Packa g i n g: Cluster nodes can be packaged in a comp a c t or a slack


fashion.
In a compact cluster, the nodes are closely packaged in one or more
racks sitting in a
room, and the nodes are not attached to periphe r als ( monitors,
keyboards, mice, etc.)
In a slack cluster, the nodes are attached to their usual peripher als
and they may be
located in different rooms, different buildings, or even remote
regions. Packaging directly
affects communicat ion wire length, and thus the select ion of
int erconne c t ion t echnology
used. While a compact cluster can utilize a high- bandwidt h, low-
latency communicat ion
network that is often proprietar y, nodes of a slack cluster are
normally connect ed through
standar d LANs or WANs
| 13

3. Control: A cluster can be either controlled or managed in a


cent ralized or decentr alized fashion. A compact cluster normally has
cent ralized control, while a slack cluster can be controlled either way.
In a centralized cluster, all the nodes are owned, controlled, manage d,
and administ ere d by a central operator. In a decent ralized cluster, the
nodes have individual owners This lack of a single point of control
makes system administ rat ion of such a cluster very difficult. It also
calls for special techniques for process scheduling, workload
migrat ion, checkpoint ing, account ing, and other similar tasks.

4. Hom o g e n e i t y: A homogeneo us cluster uses nodes from the same


platform, that is, the same processor archit ect ur e and the same
operat ing system; often, the nodes are from the same vendors. A
heteroge neo us cluster uses nodes of different platforms.
Interoper a bilit y is an important issue in heteroge neo us clusters. For
instance, process migrat ion is often needed for load balancing or
availabilit y. In a homogeneous cluster, a binary process image can
migrate to another node and continue execut ion. This is not feasible in
a het erogen e ou s cluster, as the binary code will not be executable
when the process migrates to a node of a different platform.

5. Security: Intra cluster communicat ion can be either exposed or


enclosed. In an exposed cluster, the communicat ion paths among the
nodes are exposed to the outside world. An outside machine can
access the communicat ion paths, and thus individual nodes, using
standa rd protocols (e.g., TCP/IP). Such exposed clusters are easy to
impleme n t , but have several disadvant ag e s :

• Being exposed, intra cluster communicat ion is not secure, unless the
communicat ion
subsyste m performs addit ional work to ensure privacy and securit y.
• Outside communicat ions may disrupt intra cluster communicat ions
in an unpredict a ble
fashion.
• Standar d communicat ion protocols tend to have high overhea d. A
disadvanta g e is that
there is current ly no standar d for efficient, enclosed intra cluster
communicat ion.

6. Dedicat e d versu s Enterpris e Clusters: A dedicated cluster is


typically installed in a desk side rack in a central computer room. It is
homogen eo usly configured with the same type of compute r nodes and
manage d by a single administ rat or group like a frontend host.
Dedicated clusters are used as subst it utes for tradit ional mainfra m es
or supercom p u t e r s. A dedicated cluster is installed, used, and
administer e d as a single machine An ent erprise cluster is mainly used
to utilize idle resources in the nodes. Each node is usually a full-
fledged SMP, workstat ion, or PC, with all the necess a ry periphe r als
| 14

attache d. The nodes are typically geographically dist ributed, and are
not necess a r i ly in the same room or even in the same building. The
nodes are individually owned by multiple owners.

Funda m e n t a l Clust er Desi gn Issu e s

1. Scalabl e Perfor m a n c e : S caling of resource s (cluster nodes, memor y


capacit y, I/O bandwidt h, etc.) leads to a proport ional increase in
perform anc e . Both scale- up and scale- down capabilit ies are needed,
depending on applicat ion demand or cost- effect iveness considera t ions.
Clustering is driven by scalabilit y

2. Singl e - Syste m Imag e ( SSI): A set of workstat ions connected by an


Etherne t network is not necessar ily a cluster. A cluster is a single
system.

3. Availability Supp ort: Clusters can provide cost- effective HA


capabilit y with lots of redunda n cy in processors, memory, disks, I/O
devices, networks, and operat ing system images

4. Clust er Job Mana g e m e n t : Clusters try to achieve high system


utilization from t radit ional workstat ions or PC nodes that are normally
not highly utilized. Job manage m e n t software is required to provide
batching, load balancing, parallel processing, and other funct ionalit y

5. Inter node Com m u n i c a t i o n: The inter node physical wire lengths


are longer in a cluster than in an MPP. A long wire implies greater
int erconne c t network latency. But, longer wires have more problems
in terms of reliabilit y, clock skew, and cross talking. These problems
call for reliable and secure communication protocols, which increase
overhe ad. Clusters often use commodit y networks (e.g., Et hernet)
with standar d protocols such as TCP/IP.

6. Fault Toleran c e and Recovery: Clusters of machines can be


designed to eliminate all single points of failure. Through redund a ncy,
a cluster can tolerate faulty conditions up to a certain extent.
Heart be a t mechanism s can be installed to monitor the running
condition of all nodes. In case of a node failure, critical jobs running
on the failing nodes can be saved by failing over to the surviving node
machines. Rollback recovery schemes restore the comput ing results
through periodic check pointing.

7. Clust er Family Classific ati o n: compute r clusters are divided into


three classes

• Comp u t e clust er s These are clusters designed mainly for


collective comput at ion
| 15

over a single large job. The compute clusters do not handle


many I/O operat ions,
such as databas e services. When a single compute job
requires frequent
communicat ion among the cluster nodes, the cluster must
share a dedicated
network, and thus the nodes are mostly ho mogeneous and
tightly coupled.
This type of clusters is also known as a Beow ulf clust er

• High- Availability clust er s HA (high- availabilit y) clusters are


designed to be
fault- tolerant and achieve HA of services. HA clusters operate
with many
redunda nt nodes to sustain faults or failures.
• Load- balan c i n g clust er s These clusters shoot for higher
resource utilization
through load balancing among all part icipat ing nodes in the
cluster. All nodes
share the workload or function as a single virtual machine
(VM). Requests
initiated from the user are distribute d to all node compute r s to
form a cluster. This
results in a balanced workload among different machines, and
thus higher
resource utilization or higher perform a nc e . Middlewa r e is
needed to achieve
dynamic load balancing by job or process migrat ion among all
the cluster nodes .

A Basic Clust er Archite c t ur e


| 16

Figure shows simple cluster of compute r s built with commodit y component s


and fully supporte d with desired SSI features and HA capabilit y. The
processing nodes are commodit y workst at ions, PCs, or servers. The node
operat ing systems should be designed for mult iuser, mult itasking, and
mult it hre a d e d applicat ions. The nodes are interconne c t e d by one or more
fast commodit y networks. These networks use standar d communicat ion
protocols and operate at a speed that should be two orders of magnit ude
faster than that of the current TCP/IP speed over Et hernet.
The network interface card is connecte d to the node’s standar d I/O bus
(e.g., PCI). When the
processor or the operat ing system is changed, only the driver software
needs to change

cluster middlewar e combines togethe r all node platforms at the user space.
An availabilit y middleware offers HA services. An SSI layer provides a single
entry point, a single file hierarchy, a single point of control, and a single job
manage me nt system. In addition to
running sequent ial user progra m s , the cluster supports parallel
progra mmi ng based on standa rd
languag e s and communicat ion libraries using PVM, MPI, or Open MP. The
progra m mi ng environme n t also includes tools for debugging, profiling,
monitoring, and so forth. A user interface subsyste m is needed to combine
the advantag e s of the web interface and the Windows GUI. It should also
provide user- friendly links to various progra m min g environm e n t s , job
manage m e n t tools, hypertext, and search support so that users can easily
get help in program min g the computer cluster .

Resour c e Sharin g in Clusters


| 17

Clustering improves both availabilit y and perform a nc e


The nodes of a cluster can be connected in one of three ways, as shown in
Figure.

The shared- nothing archit ect ur e in Part( a) is used in most clusters, where
the nodes are connecte d through the I/O bus. This archit ectu r e simply
connects two or more autonomous computers via a LAN such as Ethernet

A shared- disk cluster is shown in Part (b) is in favor of small- scale


availabilit y clusters in business applicat ions. When one node fails, the other
node takes over.. This is what most business clusters desire so that they can
enable recovery support in case of node failure. The shared disk can hold
checkpoint files or critical system images to enhance cluster availabilit y.
Without shared disks, check point ing, rollback recovery, failover, and
failback are not possible in a cluster.

The shared- memory cluster in Part (c) is much more difficult to realize. The
nodes could be connected by a scalable coherenc e interface (SCI) ring,
which is connect ed to the memory bus of each node through an NIC module.
In the other two archit ect ur e s , the int erconnec t is attached to the I/O bus.
The memory bus operates at a higher frequency than the I/O bus.

DESIGN PRINCIPLES OF COMPUTER CLUSTERS


| 18

General- purpose comput ers and clusters of cooperat ive comput ers should
be designed for scalabilit y, availabilit y, Single System Image, High
Availability, Fault toleranc e, and Rollback recovery

1. Singl e Syste m Imag e : A single system image is the illusion, create d by


software or hardwa r e,
that present s a collection of resource s as an integrat e d powerful
resource. SSI makes the
cluster appear like a single machine to the user, applicat ions, and
network. A cluster with
multiple system images is nothing but a collection of independ e nt
compute r s ( Distributed
systems in general)

Singl e- Syste m - Imag e Feature s

● Singl e Syste m : The entire cluster is viewed by the users as one


system, which has multiple processors.
● Singl e Control : Logically, an end user or system user utilizes
services from one place with a single interface.
● Sym m e t ry : A user can use a cluster service from any node. All cluster
services and funct ionalit ies are symmetric to all nodes and all users,
except those protected by access rights.
● Locatio n Transp ar e n t : The user is not aware of the wherea bouts of
the physical device that eventually provides a service.

Basic SSI Servic e s

A. Singl e Entry Point


telnet cluster. usc. ed u
telnet node 1. clust er. usc. edu

1. Four nodes of a cluster are used as host nodes to receive users’ login
requests.
2. To log into the cluster a standar d Unix command such as “telnet
cluster. cs. hku. hk”, using the symbolic name of the cluster system is
issued.
3. The symbolic name is t ranslate d by the DNS, which returns with the
IP address 159. 226 . 41 . 150 of the least- loaded node, which happens to
be node Host 1.
4. The user then logs in using this IP address.
| 19

5. The DNS periodically receives load information from the host nodes to
make load- balancing t ranslat ion decisions.
B. Singl e File Hierarc hy : xFS, AFS, Solaris MC Proxy
The illusion of a single, huge file system image that
t ranspa r e nt ly integrat es local
and global disks and other file devices (e.g., tapes). Files can
reside on 3 types of
locations in a cluster:
Local storag e - disk on the local node.
Remot e storag e - disks on remote nodes.
Stable stora g e -
Persistent - data, once written to the stable storage, will
stay there at least for
a period of time (e.g., a week), even after the
cluster shuts down.
Fault tolerant - to some degree, by using redunda ncy and
periodical backup to
tapes.

Three types of storage in a single file hierarchy. Solid lines show what
process P can access
and the dashed line shows what P may be able to access

C. Singl e I/O, Networki n g , and Mem ory Spac e: To achieve SSI,


we need a:

• single control point


• single address space
• single job manage me n t system
• single user interface
• single process control

Singl e Networki n g : A properly designed cluster should behave as one


system. Any process on any node can use any network and I/O device as
though it were attache d to the local node. Single networking means any
node can access any network connect ion .

Singl e Point of Control : The system administ rator should be able to


configure, monitor, test,
| 20

and control the entire cluster and each individual node from a single point.
Many clusters help
with this through a system console that is connect ed to all nodes of the
cluster

Singl e Mem ory Spac e : Single memory space gives users the illusion of a
big, centralized main
memory, which in reality may be a set of distributed local memory spaces.

Singl e I/O Addres s Spac e : A single I/O space implies that any node can
access the RAIDs

A cluster with single networking, single I/O space, single memory, and
single point of control

Other Servic e s

Singl e Job Mana g e m e n t : All cluster jobs can be submitted from any node
to a single
job manage me nt system. GlUnix, Codine, LSF, etc.

Singl e User Interfac e : The users use the cluster through a single
graphical interface. Such an
interface is available for workstat ions and PCs like CDE in Solaris/ NT

Singl e proce s s spac e All user process es created on various nodes form a
single process space
and share a uniform process ident ificat ion scheme. A process on any node
can create (e.g., through a UNIX fork) or communicat e with (e.g., through
signals, pipes, etc.) processes
on remote nodes.

Middle w ar e support for SSI clust eri n g SSI features are supporte d by
middleware developed at three cluster applicat ion levels:
| 21

• Mana g e m e n t level This level handles user applicat ions and provides a
job manage me nt system
such as GLUnix, MOSIX, Load Sharing Facility (LSF), or Codine.

• Progra m m i n g level This level provides single file hierarchy (NFS, xFS,
AFS, Proxy) and
distributed shared memory (Tread M ark, Wind Tunnel ).

• Imple m e n t a t i o nlevel This level supports a single process space,


checkpoint ing, process
migrat ion, and a single I/O space. These features must interface with the
cluster hardwa r e and
OS platform .

Relatio n s h i p amo n g clust e ri n g middle w ar e at the job mana g e m e n t , progra m m i n g ,


and imple m e n t a t i o n levels .

2. High Availability throu g h Redun d a n cy:

• Reliability measur es how long a system can operate without a


breakdow n.
• Availability indicates the percent a g e of time that a system is
available to the user, that is,
the percent a g e of system uptime.
• Servic e ab i l ity refers to how easy it is to service the system,
including hardwa r e and
software maintena nc e , repair, upgrade s , and so on .

A system’ s reliabilit y is measur e d by the mean time to failure (MTTF), which


is the
average time of normal operat ion before the system (or a component of the
system) fails. The metric for serviceabilit y is the mean time to repair
| 22

(MTTR), which is the average time it takes to repair the system and restore
it to working condition after it fails.

The availabilit y of a system is defined by:

Availability = MTTF / (MTTF +MTTR)

Failure is any event that prevents the system from normal operation
• Unplan n e d failure s The system breaks, due to an operat ing system
crash, a hardwa r e
failure, a network disconnect ion, human operat ion errors, a power
outage, and so on. All
these are simply called failures. The system must be repaired to
correct the failure.
• Plann e d shutd o w n s The system is not broken, but is periodically
taken off normal
operat ion for upgrade s , reconfigurat ion, and maintena nc e .

Transi e n t versu s Perma n e n t Failure s


A lot of failures are transi e n t in that they occur tempora r i ly and then
disappe ar. They can be dealt with without replacing any component s . A
standa rd approach is to roll back the system to a known state and start
over.
Perma n e n t failure s cannot be correcte d by rebooting. Some hardwa r e or
software component must be repaired or replaced. For instance, rebooting
will not work if the system hard disk is broken.

Partial versu s Total Failure s


A failure that renders the entire system unusable is called a total failure. A
failure that only affects part of the system is called a partial failure if the
system is still usable, even at a reduced capacit y

Redu n d a n c y Tech niq u e s

Isolat e d Redu n d a n c y: A key technique to improve availabilit y in any


system is to use redunda n t compone n t s . When a component (the primar y
component) fails, the service it provided is taken over by another
component (the backup component). Furthe r m o r e , the primary and the
backup component s should be isolated from each other, meaning they
should not be subject to the same cause of failure. Clusters provide HA with
redunda ncy in power supplies, fans, processors, memories, disks, I/O
devices, networks, operat ing system images, and so on. In a carefully
designed cluster, redund ancy is also isolated.

N- Version Progra m m i n g to Enhan c e Softwar e Reliability


| 23

A common isolated- redunda ncy approach to construct ing a mission- critical


software system is called N- version progra m ming. The software is
impleme n t e d by N isolated teams who may not even know the others exist.
Different teams are asked to implement the software using different
algorithms, progra m ming languag es, environme n t tools, and even platforms
In a fault- tolerant system, the N versions all run simult aneously and their
results are constant ly compare d. If the results differ, the system is notified
that a fault has occurred.

3. Fault- Toleran t Cluster Config ur ati o n s: The cluster solution was


targete d to provide
availabilit y support for two server nodes with three ascending levels of
availabilit y: hot
standby, active takeover, and fault- tolerant. The level of availabilit y
increase s from standby to
active and fault- tolerant cluster configurat ions. The shorter is the
recovery time, the higher is
the cluster availabilit y. Failback refers to the ability of a recovered node
returning to normal
operat ion after repair or mainten a nc e . Activeness refers to whether the
node is used in active
work during normal operat ion.

• Hot stan dby server clust er s : In a hot standby cluster, only the
primary node is actively doing all the useful work normally. The standby
node is powered on (hot) and running some monitoring program s to
communicat e heart be a t signals to check the status of the primary node,
but is not actively running other useful workloads. The primary node must
mirror any data to shared disk storage, which is accessible by the standby
node. The standby node requires a second copy of data.
• Active- takeov er clust er s : In this case, the archit ect ur e is symmet ric
among multiple server
nodes. Both servers are primary, doing useful work normally. Both
failover and failback are
often supported on both server nodes. When a node fails, the user
applicat ions fail over to
the available node in the cluster. Depending on the time required to
impleme nt the failover,
users may exper ienc e some delays or may lose some data that was not
saved in the last
checkpoint.

• Failover clust er: When a component fails, this technique allows the
remaining system to
take over the services originally provided by the failed component . A
failover mechanism
must provide several funct ions, such as failure diagnosis, failure
notificat ion, and failure
| 24

recovery. Failure diagnosis refers to the detection of a failure and the


location of the failed
component that caused the failure. A commonly used technique is
heart be a t , whereby the
cluster nodes send out a stream of heartbe a t message s to one another. If
the system does not
receive the stream of heartbe a t messages from a node, it can conclude
that either the node or
the network connect ion has failed.

Recov ery Sche m e s

Failure recovery refers to the actions needed to take over the workload of
a failed component . There are two types of recovery t echniques. In
backward recovery , the processes running on a cluster periodically save a
consist ent state (called a checkpoint) to a stable storage. After a failure, the
system is reconfigured to isolate the failed component, restores the previous
checkpoint, and resum es normal operat ion. This is called rollback.
Backward recovery is relatively easy to implement in an applicat ion-
independ e nt , portable fashion

If execut ion time is crucial, such as in real- time systems where the rollback
time cannot be tolerated, a forward recovery scheme should be used. With
such a schem e, the system is not rolled back to the previous checkpoint
upon a failure. Instead, the system utilizes the failure diagnosis informat ion
to reconst ruct a valid system state and continues execut ion. Forward
recovery is applicat ion- depende nt and may need extra hardwar e

Check poi nt i n g and Recovery Tech ni qu e s

Checkpoint ing is the process of periodically saving the state of an execut ing
progra m to stable storage, from which the system can recover after a
failure. Each progra m state saved is called a checkpoint . The disk file that
contains the saved state is called the checkpoint file.
Checkpoint ing t echniques are useful not only for availabilit y, but also for
progra m debugging,
process migrat ion, and load balancing

Checkpoint ing can be realized by the operat ing system at the kern e l level ,
where the OS t ranspa r e nt ly checkpoint s and restarts processes
A less t ranspa r e n t approach links the user code with a checkpoint ing
library in the user spac e . Check pointing and restart ing are handled by
this runt ime support. This approach is used widely because it has the
advantag e that user applicat ions do not have to be modified.

A third approach requires the user (or the comp i l er) to insert
checkpoint ing functions in the applicat ion; thus, the applicat ion has to be
modified, and the transpa r e n cy is lost. However, it has the advanta ge that
the user can specify where to checkpoint. This is helpful to reduce
| 25

checkpoint ing overhead. Checkpoint ing incurs both time and storage
overhe ads.

Ch e c k p o i n t Overh e a d s
During a progra m’ s execut ion, its states may be saved many t imes. This is
denoted by the time consum e d to save one checkpoint. The storage
overhe ad is the extra memory and disk space required for checkpoint ing.
Both time and storage overheads depend on the size of the checkpoint file.

Choosin g an Optimal Check poi n t Interval

The time period between two checkpoint s is called the checkpoint interval.
Making the interval
larger can reduce checkpoint time overhea d.
Wong and Franklin derived an expression for optimal checkpoint interval

Optimal checkpoint interval = Square root (MTTF x t c)/h

MTTF is the system’ s mean time to failure. This MTTF accounts the time
consume d to
save one checkpoint, and h is the average percent a g e of normal
comput at ion perform ed in a checkpoint interval before the system fails. The
param e t e r h is always in the range. After a system is restored, it needs to
spend h × ( checkpoint interval) time to recomput e .

Incre m e n t al Check poi n t


Instead of saving the full state at each checkpoint, an increme n t a l
checkpoint scheme saves only the portion of the state that is changed from
the previous checkpoint In full- state checkpoint ing, only one checkpoint file
needs to be kept on disk. Subsequ e n t checkpoint s simply overwrite this file.
With increm e n t a l checkpoint ing, old files needed to be kept, because a state
may span many files. Thus, the total storage require me nt is larger

Forked Check p oi nt i n g
Most checkpoint scheme s are blocking in that the normal computa t ion is
stopped while checkpoint ing is in progres s . With enough memory,
checkpoint overhead can be reduced by making a copy of the program state
in memory and invoking another asynchronous thread to perform the
checkpoint ing concurre n t ly. A simple way to overlap checkpoint ing with
comput at ion is to use the UNIX fork( ) system call. The forked child process
duplicates the parent process’ s address space and checkpoint s it.
Meanwhile, the parent process continues execut ion. Overlapping is achieved
since checkpoint ing is disk- I/O intensive

User- Direct ed Check p oi n t in g


The checkpoint overhea ds can somet imes be subst ant ially reduced if the
user inserts code (e.g.,
library or system calls) to tell the system when to save, what to save, and
what not to save. What
| 26

should be the exact contents of a checkpoint ? It should contain just enough


informat ion to allow a system to recover. The state of a process includes its
data state and control state

Check p o i n t i n g Parallel Progra m s The state of a parallel progra m is


usually much larger
than that of a sequent ial progra m, as it consists of the set of the states of
individual processes, plus the state of the communicat ion network.
Parallelism also introduce s various timing and consistency problems
Consis t e n t Snap s h ot
A global snapshot is called consistent if there is no message that is received
by the checkpoint of
one process, but not yet sent by another process. Graphically, this
corresponds to the case that no
arrow crosses a snapshot line from right to left

Coordin at ed versu s Indep e n d e n t Checkp o i nt in g


Checkpoint ing schem es for parallel progra m s can be classified into two
types. In coordinate d checkpoint ing (also called consistent checkpoint ing),
the parallel progra m is frozen, and all processes are checkpoint e d at the
same time. In independ e n t checkpoint ing, the processes are checkpoint ed
independ e nt of one another.

Clust er Job Sche d uli n g and Mana g e m e n t

A Job Mana g e m e n t Syste m ( JMS ) should have three parts:


● A user server lets the user submit jobs to one or more queues, specify
resource require m e n t s for each job, delete a job from a queue, inquire
about the status of a job or a queue.
● A job sch e d u l e r that performs job scheduling and queuing according
to job types, resourc e requirem e n t s , resource availabilit y, and
scheduling policies.
● A reso ur c e mana g e r that allocates and monitors resources, enforces
scheduling policies, and collects account ing informat ion .

JMS Adminis tr ati o n

● JMS should be able to dynamically reconfigure the cluster with


minimal impact on the running jobs.
● The administ rator’ s prologue and epilogue scripts should be able to
run before and after each job for securit y checking, account ing, and
cleanup.
● Users should be able to cleanly kill their own jobs.
● The administ rat or or the JMS should be able to cleanly suspend or kill
any job.
➢ Clean means that when a job is suspende d or killed, all its
processes must be included.
| 27

➢ Otherwise some “orphan” processes are left in the system,


wasting cluster resources and may eventually render the system
unusable .

Several types of jobs execute on a cluster.

● Serial jobs run on a single node.


● Parallel jobs use multiple nodes.
● Int eract ive jobs are those that require fast turnarou nd time, and their
input/ output is directed to a terminal.
➢ These jobs do not need large resources, and the users expect
them to execute immediately, not made to wait in a queue.
● Batch jobs normally need more resourc es, such as large memory
space and long CPU time.
➢ But they do not need immediate response.
➢ They are submitted to a job queue to be scheduled to run when
the resource becomes available (e.g., during off hours).
Multi- Job Sch e d uli n g Sch e m e s

● Cluster jobs may be scheduled to run at a specific time ( calen d ar


sch ed uli n g ) or when a part icular event happens ( event sche d u l in g ).
● Jobs are scheduled according to priorit ies based on submission time,
resource nodes, execut ion time, memory, disk, job type, and user
identity.
● With static priority , jobs are assigned priorities according to a
predete r mine d, fixed scheme.
➢ A simple schem e is to schedule jobs in a first- come, first- serve
fashion.
➢ Another schem e is to assign different priorit ies to users.
With dyna mi c priority , the priority of a job may change over time.
| 28

Job Scheduling Issues and Scheme s for Cluster Nodes

Sche d uli n g Mode s

Dedicat e d Mode :
● Only one job runs in the cluster at a time, and at most one process of
the job is assigned to a node at a time.
● The single job runs until complet ion before it releases the cluster to
run other jobs.

Spac e Sharin g :
Mult iple jobs can run on disjoint partit ions ( groups) of nodes
simult aneously.
● At most one process is assigned to a node at a time.
● Although a partit ion of nodes is dedicat ed to a job, the interconnec t
and the I/O subsyste m may be shared by all jobs.

Time sharin g :
● Mult iple user processes are assigned to the same node.
Time- sharing introduces the following parallel scheduling policies:

● Indep e n d e n t Sch e d u l i n g (Inde p e n d e n t ) : Uses the operating


system of each cluster node to schedule different processes as in
a tradit ional workstat ion.
● Gang Sche d u l i n g : Schedules all processes of a parallel job
toget her. When one process is active, all processes are active.
| 29

● Comp e t i t i o n with Foreig n ( Local) Jobs : Scheduling becomes


more complicated when both cluster jobs and local jobs are
running. The local jobs should have priority over cluster jobs.

1. Migratio n Sch e m e Issu e s Nod e Availability : Can the job find


another available node to migrate to?

➢ Berkeley study : Even during peak hours, 60% of workstat ions in


a cluster are available.

2. Migratio n Overh e a d : What is the effect of the migration overhea d?


The migrat ion time can significant ly slow down a parallel job.
➢ Berkeley study : a slowdown as great as 2.4 times.
➢ Slowdown is less if a parallel job is run on a cluster of twice the
size.
➢ e.g. a 32- node job on a 60- node cluster – migrat ion slowdown no
more than 20%, even when migrat ion time of 3 minutes.

3. Recruit m e n t Thres h o l d : the amount of time a workstat ion stays


unused before the cluster considers it an idle node. What should be
the recruit m e nt t hreshold?

Virtualizati o n is a compute r archit ectur e t echnology by which multiple


virtual machines (VMs) are mult iplexed in the same hardwa r e machine. The
purpose of a VM is to enhance resourc e sharing by many users and improve
compute r performa nc e in terms of resource utilization and applicat ion
flexibility. Hardwa r e resources (CPU, memory, I/O devices, etc.) or software
resources ( operat ing system and software libraries) can be virtualized in
various funct ional layers
| 30

A t radit ional computer runs with a host operat ing system specially tailored
for its hardwar e archit ect ur e , as shown in Figure( a). After virtualizat ion,
different user applicat ions manage d by their own operat ing systems (guest
OS) can run on the same hardwa r e , independ e n t of the host OS. This is
often done by adding addit ional software, called a virtualizat ion layer as
shown in Figure( b). This virtualizat ion layer is known as hypervisor or
virtual machine monitor (VMM). The VMs are in the upper boxes, where
applicat ions run with their own guest OS over the virtualized CPU, memory,
and I/O resources. The main function of the software layer for virtualization
is to virtualize the physical hardwa r e of a host machine into virtual
resources to be used by the VMs

Levels of Virtualizatio n Imple m e n t a t i o n

The virtualizat ion software creates the abst ract ion of VMs by int erposing a
virtualizat ion layer at various levels of a compute r system.

Common virtualizat ion layers include

• Instruct ion set archit ectur e (ISA) level


• Hardwa r e level
• Operat ing system level
• Library support level
• Application level
| 31

Instru c t i o n Set Archite c t u r e Level: Virtualization is perform ed by


emulat ing a given ISA by the ISA of the host machine.

e.g, MIPS binary code can run on an x-86- based host machine with
the help of ISA emulat ion. Typical systems: Bochs, Crusoe, Quemu, BIRD,
Dynamo

Advanta g e:
• It can run a large amount of legacy binary codes written for
various processors on any given new hardw ar e host machines
• best applicat ion flexibility
Limitatio n:
• One source instruct ion may require tens or hundreds of native
target instructions to perform its function, which is relat ively
slow.

• V-ISA requires adding a processor- specific software t ranslat io n


layer in the complier.
Virtualizati o n at Hardwar e Abstractio n level: Virtualization is
perform ed right on top of the hardwar e .

• It generat e s virtual hardwa r e environm e n t s for VMs, and manages the


underlying hardwar e through virtualizat ion.
| 32

• Typical systems: VMware, Virtual PC, Denali, Xen


Advanta g e:
• Has higher performa nc e and good application isolation

Limitatio n:
• Very expensive to implement (complexit y)

Virtualizati o n at Operatin g Syste m (OS) level: It is an abst ract ion layer


between t radit ional
OS and user applicat ions.
• This virtualizat ion creates isolated cont ainers on a single physical
server and the OS- instance to utilize the hardwa r e and software in
datacent e r s.
• Typical systems: Jail / Virtual Environm en t / Ensim' s VPS / FVM
Advanta g e:
• Has minimal starup/ shut dow n cost, low resource require m e n t , and
high scalabilit y; synchronize VM and host state changes.
Limitatio n:
• All VMs at the operating system level must have the same kind of
guest OS
• Poor applicat ion flexibility and isolation.

Library Supp ort level: It creates execution environme n t s for running alien
progra m s on a platform rather than creating VM to run the entire operat ing
system.
• It is done by API call int ercept ion and remapping.
• Typical systems: Wine, WAB, LxRun , Visual Main Win
Advanta g e:
• It has very low implementa t ion effort

Limitatio n:
• Poor applicat ion flexibility and isolation

User- Applicati o n level: It virtualizes an applicat ion as a virtual machine.


• This layer sits as an applicat ion progra m on top of an operat ing
system and exports an abstract ion of a VM that can run progra ms
written and compiled to a particular abstract machine definition.
• Typical systems: JVM , NET CLI , Panot
Advanta g e:
• has the best applicat ion isolation

Limitatio n:
| 33

• Low performa nc e , low applicat ion flexibility and high implement a t ion
complexit y .

OS- Level Virtualizati o n

Operat ing system virtualizat ion inserts a virtualization layer inside an


operat ing system to
partit ion a machine’ s physical resources. It enables multiple isolated VMs
within a single operating system kernel. This kind of VM is often called a
virtual execut ion environm e n t (VE), Virtual Private System (VPS), or simply
container. From the user’s point of view, VEs look like real servers. This
means a VE has its own set of process es, file system, user accounts, network
interfaces with IP address es, routing tables, firewall rules, and other
personal sett ings. Although VEs can be customized for different people, they
share the same operat ing system kernel. Therefore, OS- level virtualizat ion is
also called single- OS image virtualizat ion
| 34

Operatin g syste m virtualizati o n from the point of view of a mach i n e


stack

Advanta g e s of OS Exten s i o n for Virtualizati o n

1. VMs at OS level has minimum startup/ shut dow n costs


2. OS- level VM can easily synchronize with its
environme nt

Disadvan t a g e of OS Exten s i o n for Virtualizati o n


All VMs in the same OS container must have the same or similar guest
OS, which restrict
applicat ion flexibility of different VMs on the same physical machine.

Hypervis or and Xen Archite ct ure

The hypervisor supports hardwar e- level virtualizat ion (see Figure 3.1( b)) on
bare metal devices like CPU, memory, disk and network interfaces. The
hypervisor software sits directly between the physical hardwa r e and its OS.
This virtualizat ion layer is referred to as either the VMM or the hypervisor.
The hypervisor provides hypercalls for the guest OSes and applicat ions.
| 35

Depending on the funct ionalit y, a hypervisor can assume a micro- kernel


archit ect ur e like the Microsoft Hyper- V. Or it can assume a monolithic
hypervisor architectu r e like the VMware ESX for server virtualizat ion

The Xen Archite ct ur e

Xen is an open source hypervisor progra m developed by Cambridge


Universit y. Xen is a microkernel hypervisor, which separat e s the policy from
the mechanis m. The Xen hypervisor implement s all the mechanism s , leaving
the policy to be handled by Domain 0. Xen does not include any device
drivers natively. It just provides a mechanism by which a guest OS can have
direct access to the physical devices. Xen provides a virtual environme n t
located between the hardwar e and the OS .

The Xen archit e c t u r e’ s spe cial domai n 0 for control and I/O, and several
gue s t domai ns for user applic ati o ns

The core compone n t s of a Xen system are the hypervisor, kernel, and
applicat ions The guest OS, which has control ability, is called Domain 0, and
the others are called Domain U. Domain 0 is a privileged guest OS of Xen. It
is first loaded when Xen boots without any file system drivers being
available. Domain 0 is designed to access hardwa r e directly and manage
devices. Therefore, one of the responsibilit ies of Domain 0 is to allocate and
map hardwa r e resources for the guest domains (the Domain U domains).

Binary Translati o n with Full Virtualizati o n


Depending on implement a t ion t echnologies, hardwa r e virtualizat ion can be
classified into two categories: full virtualizat ion and host- based
virtualizat ion.

Full virtualizat ion does not need to modify the host OS. It relies on binar y
t ranslat ion to trap and to virtualize the execut ion of certain sensit ive, non
virtualizable instruct ions. The guest OS es and their applicat ions consist of
noncrit ical and critical inst ruct ions. In a host- based system, both a host OS
and a guest OS are used. A virtualizat ion software layer is built between the
host OS and guest OS
| 36

Full Virtualizati o n

With full virtualizat ion, noncrit ical instructions run on the hardwa r e directly
while critical
inst ruct ions are discovered and replaced with traps into the VMM to be
emulat ed by software. Both the hypervisor and VMM approache s are
considered full virtualizat ion. Why are only critical instruct ions t rapped into
the VMM? This is because binary t ranslat ion can incur a large perform anc e
overhe ad. Noncrit ical inst ruct ions do not control hardwa r e or threate n the
securit y of the system, but critical instruct ions do. Therefore, running
noncrit ical instruct ions on hardwar e not only can promote efficiency, but
also can ensure system securit y.

Binary Translati o n of Guest OS Requ e s t s Usin g a VMM

This approach was impleme nt e d by VMware and many other software


companies. VMware puts the VMM at Ring 0 and the guest OS at Ring 1.
The VMM scans the instruct ion stream and identifies the privileged, control-
and behavior- sensit ive inst ruct ions. When these instruct ions are identified,
they are t rapped into the VMM, which emulat es the behavior of these
inst ruct ions. The method used in this emulation is called binary t ranslat ion.
Therefore, full virtualizat ion
combines binary t ranslat ion and direct execut ion. The guest OS is
completely decoupled from the underlying hardwa r e . Conseque n t ly, the
guest OS is unawar e that it is being virtualized

Host- Bas ed Virtualizati o n

An alterna t ive VM archit ectu r e is to install a virtualizat ion layer on top of


the host OS. This host OS is still responsible for managing the hardwa r e .
The guest OSes are installed and run on top of the virtualizat ion layer.
Dedicated applicat ions may run on the VMs. Certainly, some other
applicat ions can also run with the host OS directly.

This host based archit ectur e has some distinct advantage s . First, the user
can install this VM archit ect ur e without modifying the host OS. The
| 37

virtualizing software can rely on the host OS to provide device drivers and
other low- level services. This will simplify the VM design and ease its
deployment. Second, the host- based approach appeals to many host
machine configurat ions.

Compared to the hypervisor/VMM archit ectur e , the performa nc e of the host-


based archit ectur e may also be low. When an applicat ion requests hardwa r e
access, it involves four layers of mapping which downgra d e s performa nc e
significant ly. When the ISA of a guest OS is different from the ISA of the
underlying hardwar e , binary t ranslat ion must be adopted. Although the
host- based archit ectur e has flexibility, the perform anc e is too low to be
useful in pract ice.
Para- Virtualizati o n with Compil er Support

Para- virtualizat ion needs to modify the guest operat ing systems. A para-
virtualized VM provides
special APIs requiring subst ant ial OS modificat ions in user applicat ions.
Para- virtualizat ion attempts to reduce the virtualizat ion overhea d, and thus
improve performa nc e by modifying only the guest OS kernel.
The guest operat ing systems are para- virtualized. They are assisted by an
intelligent compiler to replace the non virtualizable OS instructions by
hypercalls The lower the ring number, the higher the privilege of instruct io n
being executed. The OS is responsible for managing the hardwa r e and the
privileged instruct ions to execute at Ring 0, while user- level applicat ions
run at Ring 3.
The best example of para- virtualizat ion is the KVM

Para- virtualiz e d VM archit e c t ur e

VIRTUALIZATION OF CPU, MEMORY, AND I/O DEVICES

To support virtualizat ion, processors such as the x86 employ a special


running mode and instruct ions, known as hardwar e- assisted virtualizat ion.
In this way, the VMM and guest OS run in different modes and all sensit ive
inst ruct ions of the guest OS and its applicat ions are trapped in the VMM. To
save processor states, mode switching is completed by hardwa r e .
| 38

Modern operating systems and processors permit multiple processes to run


simult aneously. If there is no protection mechanis m in a processor, all
inst ruct ions from different processes will access the hardw a r e directly and
cause a system crash. Therefore, all processors have at least two modes,
user mode and supervisor mode, to ensure controlled access of critical
hardwa r e . Instruct ions running in supervisor mode are called privileged
inst ruct ions. Other instruct ions are unprivileged instruct ions. In a
virtualized environm en t , it is more difficult to make OSes and applicat ions
run correctly because there are more layers in the machine stack. The
VMware
Workstat ion is a VM software suite for x86 and x86- 64 compute r s . This
software suite allows users to set up multiple x86 and x86- 64 virtual
compute r s and to use one or more of these VMs simult aneously with the
host operat ing system. The VMware Workstat ion assum es the host- based
virtualizat ion. Xen is a hypervisor for use in IA-32, x86- 64, Itanium, and
Power PC 970 hosts.

CPU Virtualizatio n
A VM is a duplicate of an existing computer system in which a majority of
the VM instructions are executed on the host processor in native mode.
Thus, unprivileged instruct ions of VMs run directly on the host machine for
higher efficiency. Other critical inst ructions should be handled carefully for
correct nes s and stabilit y. The critical inst ruct ions are divided into three
categories: privileged inst ruct ions, control sensit ive instruct ions, and
behavior- sensit ive instruct ions. Privileged instruct ions execute in a
privileged mode and will be t rapped if executed outside this mode. Control-
sensit ive instruct ions attempt to change the configurat ion of resources
used. Behavior- sensit ive instruct ions have different behaviors depending on
the configurat ion of resource s , including the load and store operat ions over
the virtual memory. A CPU archit ectur e is virtualizable if it supports the
ability to run the VM’s privileged and unprivileged instruct ions in the CPU’s
user mode while the VMM runs in supervisor mode.

Mem ory Virtualizati o n


Virtual memory virtualizat ion is similar to the virtual memory support
provided by modern operat ing systems. In a tradit ional execut io n
environme n t , the operating system maintains mappings of virtual memory to
machine memory using page tables, which is a one- stage mapping from
virtual memory to machine memory. All modern x86 CPUs include a memor y
manage m e n t unit (MMU) and a t ranslat ion lookaside buffer (TLB) to
optimize virtual memory perfor ma nc e . However, in a virtual execut ion
environme n t , virtual memory virtualizat ion involves sharing the physical
system memory in RAM and dynamically allocating it to the physical
memory of the VMs. a two- stage mapping process should be maintained by
the guest OS and the VMM, respect ively: virtual memory to physical
memory and physical memory to machine memory. The guest OS continues
to control the mapping of virtual address es to the physical memory
| 39

address es of VMs. But the guest OS cannot directly access the actual
machine memory.
The VMM is responsible for mapping the guest physical memory to the
actual machine memory

I/O Virtualizati o n

I/O virtualizat ion involves managing the routing of I/O requests between
virtual devices and
the shared physical hardwa r e All the functions of a device or bus
infrastruct u r e , such as device enumer a t ion, ident ificat ion, interrupt s , and
DMA, are replicated in software. This software is located in the VMM and
acts as a virtual device. The I/O access reques t s of the guest OS are trapped
in the VMM which interacts with the I/O devicesA single hardwa r e device
can be shared by multiple VMs that run concurren t ly

VIRTUAL CLUSTERS AND RESOURCE MANAGEMENT

A physical cluster is a collection of servers (physical machines)


int erconne c t ed by a physical network such as a LAN.
Virtual clusters are built with VMs installed at distributed servers from one
or more physical clusters.
The VMs in a virtual cluster are interconn ect e d logically by a virtual
network across several
physical networks

A cloud platfor m with four virtual clust er s over three physical


clust er s

Physical versu s Virtual Proce s s o r cores


| 40

Physical cores Virtual core s


The actual physical cores present in There can be more virtual cores
the processor visible to a single OS than there are
physical cores.

More burden on the software to Design of software becomes easier


write applicat ions which can run as the hardwa r e assists the software
directly on the cores. in dynamic resource utilization

Hardwa r e provides no assistance to Hardwa r e provides assist ance to the


the software and is hence simpler software and is hence more complex.

Poor resource manage me nt Better resource manage me nt


The lowest level of system software The lowest level of system software
has to be modified. need not be modified

VIRTUALIZATION FOR DATA- CENTER AUTOMATION


Data- center automat ion means that huge volumes of hardw a r e , software,
and databas e resources in these data centers can be allocated dynamically
to millions of Internet users simult aneously, with guarant e e d QoS and cost-
effect iveness. This automat ion process is triggere d by the growth of
virtualizat ion products and cloud comput ing services. Virtualization is
moving towards enhancing mobility, reducing planned downtime (for
maint en a nc e), and increasing the numbe r of virtual clients.
The latest virtualizat ion developme n t highlights high availabilit y (HA),
backup services, workload balancing, and further increas es in client bases

Server Consolid a t i o n in Data Center s

In data centers, a large numbe r of heteroge n eo u s workloads can run on


servers at various t imes. These heteroge n eo us workloads can be roughly
divided into two categories: chatty workloads and non int eract ive
workloads. Chatty workloads may burst at some point and return to a silent
state at some other point. A web video service is an example of this,
whereby a lot of people use it at night and few people use it during the day.
Non interact ive workloads do not require people’s efforts to make progress
after they are submitted. High- performa nc e comput ing is a typical example
of this. At various stages, the require m e n t s for resourc es of these workloads
are dramat ically different. However, to guarant e e that a workload will
always be able to cope with all demand levels, the workload is stat ically
allocated enough resources so that peak demand is satisfied

Therefore, it is common that most servers in data centers are underut ilized.
A large amount of
hardwa r e , space, power, and manage m e n t cost of these servers is wasted.
Server consolidat ion is an approach to improve the low utility ratio of
hardwa r e resources by reducing the number of physical servers. Among
several server consolidat ion technique s such as centralized and physical
| 41

consolidat ion, virtualizat ion- based server consolidat ion is the most powerful.
Data centers need to optimize their resource manage m e n t the use of VMs
increase s resource manage m e n t complexit y. This causes a challenge in
terms of how to improve resource utilizat ion as well as guarant e e QoS in
data cent ers.

Ad van t a g e s

• Consolidat ion enhance s hardwar e utilization. Many underut ilized


servers are consolidated into fewer servers to enhance resource
utilization. Consolidat ion also facilitates backup services and disaster
recovery.

• This approach enables more agile provisioning and deployment of


resources. In a virtual
environme nt , the images of the guest OSes and their applicat ions
are readily cloned and
reused.

• The total cost of ownership is reduced. In this sense, server


virtualizat ion causes deferred
purchas es of new servers, a smaller data- center footprint, lower
maint en a nc e costs, and
lower power, cooling, and cabling require me nt s.

• This approach improves availabilit y and business continuit y. The


crash of a guest OS has no effect on the host OS or any other guest
OS. It becomes easier to transfer a VM from one server to another,
because virtual servers are unaware of the underlying hardwa r e .

To automat e data- center operat ions, one must consider resource


scheduling, archit ect ur a l support, power manage m e n t , automat ic or
autonomic resource manage me n t , perform a nc e of analyt ical models

You might also like