CC Unit 1
CC Unit 1
UNIT – 1
Systems Modeling, Clustering and Virtualization: Distributed System Models and Enabling Technologies,
Computer Clusters for Scalable Parallel Computing, Virtual Machines and Virtualization of Clusters and
Data centers.
UNIT 1
INTRODUCTION
• Parallel comput ing: In parallel comput ing, all processors are either tightly
coupled with cent ralized shared memory or loosely coupled with distributed
memory. Some authors refer to this discipline as parallel processing. Inter
processor communicat ion is accomplished through shared memory or via
message passing. A computer system capable of parallel comput ing is
commonly known as a parallel compute r . Progra m s running in a parallel
compute r are called parallel progra m s . The process of writing paralle l
progra ms is often referred to as parallel progra mming.
Degr e e s of Paralleli s m:
Bit- level paralleli s m (BLP) converts bit- serial processing to word- level
processing gradually. Over the years, users graduat e d from 4- bit
microprocessors to 8-,16- , 32- , and 64- bit CPUs.
Data- level paralleli s m (DLP) was made popular through SIMD (single
instruct ion, multiple data) and vector machines using vector or array types
of instruct ions. DLP requires even more hardwa r e support and compiler
assist ance to work properly.
Ever since the introduct ion of mult icore processors and chip
mult iproces sors (CMPs), we have been exploring Task- level paralleli s m
(TLP ).
Mem ory, Storag e , and Wide- Area Netw orkin g: Memory chips have
experience d a 4x increase in capacit y every three years. For hard drives,
capacit y increased from 260 MB in 1981 to 250 GB in 2004 . Disks or disk
|3
Virtual machines (VMs) offer novel solutions to underut ilized resourc es,
applicat ion inflexibility, software manage a bilit y, and securit y concerns in
existing physical machines. Today, to build large clusters, grids, and clouds,
we need to access large amount s of comput ing, storage, and networking
resources in a virtualized manner. We need to aggreg a t e those resources,
and hopefully, offer a single system image. In part icular, a cloud of
provisioned resources must rely on virtualizat ion of processors, memory,
and I/O facilities dynamically.
|4
The host machine is equipped with the physical hardw a r e . The VM is built
with virtual resources manage d by a guest OS to run a specific applicat ion.
Between the VMs and the host platform, one needs to deploy a middleware
layer called a virtual machine monitor (VMM).
VM Primitive Operatio n s : The VMM provides the VM abst ract ion to the
guest OS. With full virtualizat ion, the VMM exports a VM abstract io n
identical to the physical machine so that a standar d OS such as Windows
2000 or Linux can run just as it would on the physical hardwa r e
Distributed and cloud comput ing systems are built over a large number of
autonomous computer nodes. These node machines are interconne c t e d by
SANs, LANs, or WANs in a hierarchical manner. Massive systems are
considered highly scalable, and can reach web- scale connect ivit y, either
physically or logically. Massive systems are classified into four groups:
clusters, P 2Pnet works, comput ing grids, and Internet clouds over huge data
centers. In terms of node number, these four system classes may involve
hundre ds, t housands, or even millions of compute r s as part icipat ing nodes.
These machines work collectively, cooperat ively, or collaborat ively at
various levels.
Clust er Archite c t u r e
Singl e - Syste m Imag e : An ideal cluster should merge mult iple system
images into a single- system image (SSI). Cluster designers desire a cluster
operat ing system or some middlewa r e to support SSI at various levels,
including the sharing of CPUs, memory, and I/O across all cluster nodes. An
|6
In a P2P system, every node acts as both a client and a server, providing
part of the system resources. Peer machines are simply client comput ers
connect ed to the Internet. All client machines act autonomously to join or
leave the system freely. This implies that no master- slave relat ionship exists
among the peers. No central coordinat ion or central databas e is needed. In
other words, no peer machine has a global view of the entire P2P system.
The system is self- organizing with distribute d control. Unlike the cluster or
grid, a P2P network does not use a dedicate d interconne c t ion network. The
physical network is simply an ad hoc network formed at various Internet
domains randomly using the TCP/IP and NAI protocols
|7
Data items or files are distributed in the part icipat ing peers. Based on
communicat ion or file- sharing needs, the peer IDs form an overlay network
at the logical level. This overlay is a virtual network formed by mapping
each physical machine with its ID, logically, through a virtual mapping
provider supplies the API and software tools (e.g., Java, Python, Web
2.0, .NET). The user is freed from managing the cloud infrastruct u r e .
• Softwar e as a Servic e (Saa S ) This refers to browser- init iated
applicat ion software over t housands of paid cloud customers. The
Saa S model applies to business processes, industry applicat ions,
consume r relat ionship manage m e n t (CRM), enterpr ise resources
planning (ERP),huma n resource s (HR), and collaborat ive applicat ions.
On the custome r side, there is no upfront investme n t in servers or
software licensing. On the provider side, costs are rather low,
compare d with convent ional hosting of user applicat ions.
The comput ers in most distribute d systems are loosely coupled Thus, a
distributed system inherent ly has multiple system images This is mainly due
to the fact that all node machines run with an independ e n t operating system
To promote resource sharing and fast communicat ion among Node
machines, it is best to have a distributed OS that manages all resources
coherent ly and efficiently. Such a system is most likely to be a closed
system, and it will likely rely on message passing and RPCs for inter node
communicat ions
Runs with a virtualizat ion layer in the Linux environm e n t This layer provides
a partial single- system image to user applicat ions This is mainly due to the
fact that all node machines run with an independ e n t operat ing system
Supports both sequent ial and parallel applicat ions, and discovers resource s
and migrates software process es among Linux nodes Can manage a Linux
cluster or a grid of multiple clusters
Map Red u c e This is a web program min g model developed by Google for
scalable data processing on large clusters over large data sets The model is
applied mainly in web- scale search and cloud comput ing Applications. The
master node specifies a Map function to divide the input into Sub problems
Applies a Reduce function to merge all interm ediat e values with the same
interm e diat e key Highly scalable to explore high degrees of parallelism at
different job levels A typical Map Reduce comput at ion process can handle
t erabyt es of data on tens of t housands or more client machines Thousands
of Map Reduc e jobs are executed on Google' s clusters every day
Amdahl' s Law
Now the same progra m is part it ioned for parallel execut ion on a cluster of
many nodes
α T + (1 - α )T/n
where, the first term is the sequent ial execut ion time on a single processor
and the second term is the parallel execut ion time on n processing nodes
| 10
The speedup factor of using the n- processor system over the use of asingle
processor is expresse d by:
= 1/[ α + (1 - α)/n]
The sequent ial bottleneck is the portion of the code that cannot be
parallelized
ifα = 0:25 -> 1 - α = 0:75, Max. speedup = 4
Scaling the problem size to match the cluster capabilit y (scaled- workload
speedup)
W’ = α W + (1 - α) n W
S’ = W’/W = [α W + (1 - α) n W]/W
= α + (1 - α)n
Thus efficiency is
E’ = S’ / n = α /n + (1 - α)
For α = 0:25 and n = 256, E = 75%
Availability
A system is highly available if it has a long mean time to failure (MTTF) and
a short mean time to repair (MTTR)
Applicati o n Layer: Until now, most user applicat ions in science, business,
engineer ing, and financial areas tend to increas e a system’ s speed or
quality. By introducing energy- aware applicat ions, the challenge is to design
sophist icate d mult ilevel and multi- domain energy manage m e n t applicat ions
without hurting performa nc e .
DVFS, energy savings are achieved based on the fact that the power
consumpt ion in CMOS circuits has a direct relat ionship with frequency and
the square of the voltage supply.
Netw ork Layer: Routing and t ransferr ing packets and enabling network
services to the resource layer are the main responsibilit y of the network
layer in distributed computing systems. The major challenge to build
energy- efficient networks is, again, determining how to measur e , predict,
and create a balance between energy consum p t ion and perform anc e .
Clusters are classified using six ort hogonal attribut es: scalabilit y,
packaging, control, homogeneit y, progra m ma b ility, and securit y
• Being exposed, intra cluster communicat ion is not secure, unless the
communicat ion
subsyste m performs addit ional work to ensure privacy and securit y.
• Outside communicat ions may disrupt intra cluster communicat ions
in an unpredict a ble
fashion.
• Standar d communicat ion protocols tend to have high overhea d. A
disadvanta g e is that
there is current ly no standar d for efficient, enclosed intra cluster
communicat ion.
attache d. The nodes are typically geographically dist ributed, and are
not necess a r i ly in the same room or even in the same building. The
nodes are individually owned by multiple owners.
cluster middlewar e combines togethe r all node platforms at the user space.
An availabilit y middleware offers HA services. An SSI layer provides a single
entry point, a single file hierarchy, a single point of control, and a single job
manage me nt system. In addition to
running sequent ial user progra m s , the cluster supports parallel
progra mmi ng based on standa rd
languag e s and communicat ion libraries using PVM, MPI, or Open MP. The
progra m mi ng environme n t also includes tools for debugging, profiling,
monitoring, and so forth. A user interface subsyste m is needed to combine
the advantag e s of the web interface and the Windows GUI. It should also
provide user- friendly links to various progra m min g environm e n t s , job
manage m e n t tools, hypertext, and search support so that users can easily
get help in program min g the computer cluster .
The shared- nothing archit ect ur e in Part( a) is used in most clusters, where
the nodes are connecte d through the I/O bus. This archit ectu r e simply
connects two or more autonomous computers via a LAN such as Ethernet
The shared- memory cluster in Part (c) is much more difficult to realize. The
nodes could be connected by a scalable coherenc e interface (SCI) ring,
which is connect ed to the memory bus of each node through an NIC module.
In the other two archit ect ur e s , the int erconnec t is attached to the I/O bus.
The memory bus operates at a higher frequency than the I/O bus.
General- purpose comput ers and clusters of cooperat ive comput ers should
be designed for scalabilit y, availabilit y, Single System Image, High
Availability, Fault toleranc e, and Rollback recovery
1. Four nodes of a cluster are used as host nodes to receive users’ login
requests.
2. To log into the cluster a standar d Unix command such as “telnet
cluster. cs. hku. hk”, using the symbolic name of the cluster system is
issued.
3. The symbolic name is t ranslate d by the DNS, which returns with the
IP address 159. 226 . 41 . 150 of the least- loaded node, which happens to
be node Host 1.
4. The user then logs in using this IP address.
| 19
5. The DNS periodically receives load information from the host nodes to
make load- balancing t ranslat ion decisions.
B. Singl e File Hierarc hy : xFS, AFS, Solaris MC Proxy
The illusion of a single, huge file system image that
t ranspa r e nt ly integrat es local
and global disks and other file devices (e.g., tapes). Files can
reside on 3 types of
locations in a cluster:
Local storag e - disk on the local node.
Remot e storag e - disks on remote nodes.
Stable stora g e -
Persistent - data, once written to the stable storage, will
stay there at least for
a period of time (e.g., a week), even after the
cluster shuts down.
Fault tolerant - to some degree, by using redunda ncy and
periodical backup to
tapes.
Three types of storage in a single file hierarchy. Solid lines show what
process P can access
and the dashed line shows what P may be able to access
and control the entire cluster and each individual node from a single point.
Many clusters help
with this through a system console that is connect ed to all nodes of the
cluster
Singl e Mem ory Spac e : Single memory space gives users the illusion of a
big, centralized main
memory, which in reality may be a set of distributed local memory spaces.
Singl e I/O Addres s Spac e : A single I/O space implies that any node can
access the RAIDs
A cluster with single networking, single I/O space, single memory, and
single point of control
Other Servic e s
Singl e Job Mana g e m e n t : All cluster jobs can be submitted from any node
to a single
job manage me nt system. GlUnix, Codine, LSF, etc.
Singl e User Interfac e : The users use the cluster through a single
graphical interface. Such an
interface is available for workstat ions and PCs like CDE in Solaris/ NT
Singl e proce s s spac e All user process es created on various nodes form a
single process space
and share a uniform process ident ificat ion scheme. A process on any node
can create (e.g., through a UNIX fork) or communicat e with (e.g., through
signals, pipes, etc.) processes
on remote nodes.
Middle w ar e support for SSI clust eri n g SSI features are supporte d by
middleware developed at three cluster applicat ion levels:
| 21
• Mana g e m e n t level This level handles user applicat ions and provides a
job manage me nt system
such as GLUnix, MOSIX, Load Sharing Facility (LSF), or Codine.
• Progra m m i n g level This level provides single file hierarchy (NFS, xFS,
AFS, Proxy) and
distributed shared memory (Tread M ark, Wind Tunnel ).
(MTTR), which is the average time it takes to repair the system and restore
it to working condition after it fails.
Failure is any event that prevents the system from normal operation
• Unplan n e d failure s The system breaks, due to an operat ing system
crash, a hardwa r e
failure, a network disconnect ion, human operat ion errors, a power
outage, and so on. All
these are simply called failures. The system must be repaired to
correct the failure.
• Plann e d shutd o w n s The system is not broken, but is periodically
taken off normal
operat ion for upgrade s , reconfigurat ion, and maintena nc e .
• Hot stan dby server clust er s : In a hot standby cluster, only the
primary node is actively doing all the useful work normally. The standby
node is powered on (hot) and running some monitoring program s to
communicat e heart be a t signals to check the status of the primary node,
but is not actively running other useful workloads. The primary node must
mirror any data to shared disk storage, which is accessible by the standby
node. The standby node requires a second copy of data.
• Active- takeov er clust er s : In this case, the archit ect ur e is symmet ric
among multiple server
nodes. Both servers are primary, doing useful work normally. Both
failover and failback are
often supported on both server nodes. When a node fails, the user
applicat ions fail over to
the available node in the cluster. Depending on the time required to
impleme nt the failover,
users may exper ienc e some delays or may lose some data that was not
saved in the last
checkpoint.
• Failover clust er: When a component fails, this technique allows the
remaining system to
take over the services originally provided by the failed component . A
failover mechanism
must provide several funct ions, such as failure diagnosis, failure
notificat ion, and failure
| 24
Failure recovery refers to the actions needed to take over the workload of
a failed component . There are two types of recovery t echniques. In
backward recovery , the processes running on a cluster periodically save a
consist ent state (called a checkpoint) to a stable storage. After a failure, the
system is reconfigured to isolate the failed component, restores the previous
checkpoint, and resum es normal operat ion. This is called rollback.
Backward recovery is relatively easy to implement in an applicat ion-
independ e nt , portable fashion
If execut ion time is crucial, such as in real- time systems where the rollback
time cannot be tolerated, a forward recovery scheme should be used. With
such a schem e, the system is not rolled back to the previous checkpoint
upon a failure. Instead, the system utilizes the failure diagnosis informat ion
to reconst ruct a valid system state and continues execut ion. Forward
recovery is applicat ion- depende nt and may need extra hardwar e
Checkpoint ing is the process of periodically saving the state of an execut ing
progra m to stable storage, from which the system can recover after a
failure. Each progra m state saved is called a checkpoint . The disk file that
contains the saved state is called the checkpoint file.
Checkpoint ing t echniques are useful not only for availabilit y, but also for
progra m debugging,
process migrat ion, and load balancing
Checkpoint ing can be realized by the operat ing system at the kern e l level ,
where the OS t ranspa r e nt ly checkpoint s and restarts processes
A less t ranspa r e n t approach links the user code with a checkpoint ing
library in the user spac e . Check pointing and restart ing are handled by
this runt ime support. This approach is used widely because it has the
advantag e that user applicat ions do not have to be modified.
A third approach requires the user (or the comp i l er) to insert
checkpoint ing functions in the applicat ion; thus, the applicat ion has to be
modified, and the transpa r e n cy is lost. However, it has the advanta ge that
the user can specify where to checkpoint. This is helpful to reduce
| 25
checkpoint ing overhead. Checkpoint ing incurs both time and storage
overhe ads.
Ch e c k p o i n t Overh e a d s
During a progra m’ s execut ion, its states may be saved many t imes. This is
denoted by the time consum e d to save one checkpoint. The storage
overhe ad is the extra memory and disk space required for checkpoint ing.
Both time and storage overheads depend on the size of the checkpoint file.
The time period between two checkpoint s is called the checkpoint interval.
Making the interval
larger can reduce checkpoint time overhea d.
Wong and Franklin derived an expression for optimal checkpoint interval
MTTF is the system’ s mean time to failure. This MTTF accounts the time
consume d to
save one checkpoint, and h is the average percent a g e of normal
comput at ion perform ed in a checkpoint interval before the system fails. The
param e t e r h is always in the range. After a system is restored, it needs to
spend h × ( checkpoint interval) time to recomput e .
Forked Check p oi nt i n g
Most checkpoint scheme s are blocking in that the normal computa t ion is
stopped while checkpoint ing is in progres s . With enough memory,
checkpoint overhead can be reduced by making a copy of the program state
in memory and invoking another asynchronous thread to perform the
checkpoint ing concurre n t ly. A simple way to overlap checkpoint ing with
comput at ion is to use the UNIX fork( ) system call. The forked child process
duplicates the parent process’ s address space and checkpoint s it.
Meanwhile, the parent process continues execut ion. Overlapping is achieved
since checkpoint ing is disk- I/O intensive
Dedicat e d Mode :
● Only one job runs in the cluster at a time, and at most one process of
the job is assigned to a node at a time.
● The single job runs until complet ion before it releases the cluster to
run other jobs.
Spac e Sharin g :
Mult iple jobs can run on disjoint partit ions ( groups) of nodes
simult aneously.
● At most one process is assigned to a node at a time.
● Although a partit ion of nodes is dedicat ed to a job, the interconnec t
and the I/O subsyste m may be shared by all jobs.
Time sharin g :
● Mult iple user processes are assigned to the same node.
Time- sharing introduces the following parallel scheduling policies:
A t radit ional computer runs with a host operat ing system specially tailored
for its hardwar e archit ect ur e , as shown in Figure( a). After virtualizat ion,
different user applicat ions manage d by their own operat ing systems (guest
OS) can run on the same hardwa r e , independ e n t of the host OS. This is
often done by adding addit ional software, called a virtualizat ion layer as
shown in Figure( b). This virtualizat ion layer is known as hypervisor or
virtual machine monitor (VMM). The VMs are in the upper boxes, where
applicat ions run with their own guest OS over the virtualized CPU, memory,
and I/O resources. The main function of the software layer for virtualization
is to virtualize the physical hardwa r e of a host machine into virtual
resources to be used by the VMs
The virtualizat ion software creates the abst ract ion of VMs by int erposing a
virtualizat ion layer at various levels of a compute r system.
e.g, MIPS binary code can run on an x-86- based host machine with
the help of ISA emulat ion. Typical systems: Bochs, Crusoe, Quemu, BIRD,
Dynamo
Advanta g e:
• It can run a large amount of legacy binary codes written for
various processors on any given new hardw ar e host machines
• best applicat ion flexibility
Limitatio n:
• One source instruct ion may require tens or hundreds of native
target instructions to perform its function, which is relat ively
slow.
Limitatio n:
• Very expensive to implement (complexit y)
Library Supp ort level: It creates execution environme n t s for running alien
progra m s on a platform rather than creating VM to run the entire operat ing
system.
• It is done by API call int ercept ion and remapping.
• Typical systems: Wine, WAB, LxRun , Visual Main Win
Advanta g e:
• It has very low implementa t ion effort
Limitatio n:
• Poor applicat ion flexibility and isolation
Limitatio n:
| 33
• Low performa nc e , low applicat ion flexibility and high implement a t ion
complexit y .
The hypervisor supports hardwar e- level virtualizat ion (see Figure 3.1( b)) on
bare metal devices like CPU, memory, disk and network interfaces. The
hypervisor software sits directly between the physical hardwa r e and its OS.
This virtualizat ion layer is referred to as either the VMM or the hypervisor.
The hypervisor provides hypercalls for the guest OSes and applicat ions.
| 35
The Xen archit e c t u r e’ s spe cial domai n 0 for control and I/O, and several
gue s t domai ns for user applic ati o ns
The core compone n t s of a Xen system are the hypervisor, kernel, and
applicat ions The guest OS, which has control ability, is called Domain 0, and
the others are called Domain U. Domain 0 is a privileged guest OS of Xen. It
is first loaded when Xen boots without any file system drivers being
available. Domain 0 is designed to access hardwa r e directly and manage
devices. Therefore, one of the responsibilit ies of Domain 0 is to allocate and
map hardwa r e resources for the guest domains (the Domain U domains).
Full virtualizat ion does not need to modify the host OS. It relies on binar y
t ranslat ion to trap and to virtualize the execut ion of certain sensit ive, non
virtualizable instruct ions. The guest OS es and their applicat ions consist of
noncrit ical and critical inst ruct ions. In a host- based system, both a host OS
and a guest OS are used. A virtualizat ion software layer is built between the
host OS and guest OS
| 36
Full Virtualizati o n
With full virtualizat ion, noncrit ical instructions run on the hardwa r e directly
while critical
inst ruct ions are discovered and replaced with traps into the VMM to be
emulat ed by software. Both the hypervisor and VMM approache s are
considered full virtualizat ion. Why are only critical instruct ions t rapped into
the VMM? This is because binary t ranslat ion can incur a large perform anc e
overhe ad. Noncrit ical inst ruct ions do not control hardwa r e or threate n the
securit y of the system, but critical instruct ions do. Therefore, running
noncrit ical instruct ions on hardwar e not only can promote efficiency, but
also can ensure system securit y.
This host based archit ectur e has some distinct advantage s . First, the user
can install this VM archit ect ur e without modifying the host OS. The
| 37
virtualizing software can rely on the host OS to provide device drivers and
other low- level services. This will simplify the VM design and ease its
deployment. Second, the host- based approach appeals to many host
machine configurat ions.
Para- virtualizat ion needs to modify the guest operat ing systems. A para-
virtualized VM provides
special APIs requiring subst ant ial OS modificat ions in user applicat ions.
Para- virtualizat ion attempts to reduce the virtualizat ion overhea d, and thus
improve performa nc e by modifying only the guest OS kernel.
The guest operat ing systems are para- virtualized. They are assisted by an
intelligent compiler to replace the non virtualizable OS instructions by
hypercalls The lower the ring number, the higher the privilege of instruct io n
being executed. The OS is responsible for managing the hardwa r e and the
privileged instruct ions to execute at Ring 0, while user- level applicat ions
run at Ring 3.
The best example of para- virtualizat ion is the KVM
CPU Virtualizatio n
A VM is a duplicate of an existing computer system in which a majority of
the VM instructions are executed on the host processor in native mode.
Thus, unprivileged instruct ions of VMs run directly on the host machine for
higher efficiency. Other critical inst ructions should be handled carefully for
correct nes s and stabilit y. The critical inst ruct ions are divided into three
categories: privileged inst ruct ions, control sensit ive instruct ions, and
behavior- sensit ive instruct ions. Privileged instruct ions execute in a
privileged mode and will be t rapped if executed outside this mode. Control-
sensit ive instruct ions attempt to change the configurat ion of resources
used. Behavior- sensit ive instruct ions have different behaviors depending on
the configurat ion of resource s , including the load and store operat ions over
the virtual memory. A CPU archit ectur e is virtualizable if it supports the
ability to run the VM’s privileged and unprivileged instruct ions in the CPU’s
user mode while the VMM runs in supervisor mode.
address es of VMs. But the guest OS cannot directly access the actual
machine memory.
The VMM is responsible for mapping the guest physical memory to the
actual machine memory
I/O Virtualizati o n
I/O virtualizat ion involves managing the routing of I/O requests between
virtual devices and
the shared physical hardwa r e All the functions of a device or bus
infrastruct u r e , such as device enumer a t ion, ident ificat ion, interrupt s , and
DMA, are replicated in software. This software is located in the VMM and
acts as a virtual device. The I/O access reques t s of the guest OS are trapped
in the VMM which interacts with the I/O devicesA single hardwa r e device
can be shared by multiple VMs that run concurren t ly
Therefore, it is common that most servers in data centers are underut ilized.
A large amount of
hardwa r e , space, power, and manage m e n t cost of these servers is wasted.
Server consolidat ion is an approach to improve the low utility ratio of
hardwa r e resources by reducing the number of physical servers. Among
several server consolidat ion technique s such as centralized and physical
| 41
consolidat ion, virtualizat ion- based server consolidat ion is the most powerful.
Data centers need to optimize their resource manage m e n t the use of VMs
increase s resource manage m e n t complexit y. This causes a challenge in
terms of how to improve resource utilizat ion as well as guarant e e QoS in
data cent ers.
Ad van t a g e s