Cluster Computing
Cluster Computing
Cluster Computing
A SEMINAR REPORT
Submitted by
KUMAR KAUSHIK
SCHOOL OF ENGINEERING
AUGUST 2008
DIVISION OF COMPUTER SCIENCE AND ENGINEERING
SCHOOL OF ENGINEERING
COCHIN UNIVERSITY OF SCIENCE AND TECHNOLOGY
KOCHI-682022
Certificate
of the VII semester, Computer Science and Engineering in the year 2008 in partial
fulfillment of the requirements in the award of Degree of Bachelor of Technology in
Computer Science and Engineering of Cochin University of Science and Technology.
Date :
Acknowledgement
Many people have contributed to the success of this. Although a single sentence hardly
suffices, I would like to thank Almighty God for blessing us with His grace. I extend my
sincere and heart felt thanks to Dr. David Peter, Head of Division, Computer Science
and Engineering, for providing us the right ambience for carrying out this work. I am
profoundly indebted to my seminar guide, Ms. Ancy Zachariah for innumerable acts of
timely advice, encouragement and I sincerely express my gratitude to her.
I express my immense pleasure and thankfulness to all the teachers and staff of the
Department of Computer Science and Engineering, CUSAT for their cooperation and
support.
Last but not the least, I thank all others, and especially my classmates who in one way or
another helped me in the successful completion of this work.
KUMAR KAUSHIK
ABSTRACT
other through fast local area networks. Clusters are usually deployed to
The seminar will contain the concepts of cluster computing and the
ABSTRACT
LIST OF TABLES
LIST OF FIGURES
1. INTRODUCTION …………………………………… 1
1.1 GENERAL ……………………………………... 1
1.1.1 Cluster Computing………………………… 1
1.1.2. Cluster Benefits……………………...…… 3
2. Types of Clusters……………………………………... 4
3. Cluster Components…………………………………. 8
4. Cluster operation…………………………………….. 11
5.1 Throughput…………………………………………20
5.1.1 Slow Start……………………………….21
5.1.2 Congestion Avoidance………………….22
6. Cluster Applications……………………………………23
6.1 Google Search Engine…………….………….…..23
6.2 Petroleum Reservoir Simulation……….….……..25
6.3 Protein Explorer………………………………….26
6.4 Earthquake Simulation…………………………...28
6.5 Image Rendering……………………………………….30
References…………………………………………36
LIST OF TABLES
LIST OF FIGURES
1.INTRODUCTION
Parallel computing has seen many changes since the days of the highly expensive and
proprietary super computers. Changes and improvements in performance have also
been seen in the area of mainframe computing for many environments. But these
compute environments may not be the most cost effectiveand flexible solution for a
problem. Over the past decade, cluster technologies have been developed that allow
multiple low cost computers to work in a coordinated fashion to process applications.
The economics, performance and flexibility of compute clusters makes cluster
computing an attractive alternative to centralized computing models and the attendant
to cost, inflexibility, and scalability issues inherent to these models.
Many enterprises are now looking at clusters of high-performance, low cost
computers to provide increased application performance, high availability, and ease of
scaling within the data center. Interest in and deployment of computer clusters has
largely been driven by the increase in the performance of off-the-shelf commodity
computers, high-speed, low-latency network switches and the maturity of the software
components. Application performance continues to be of significant concern for
various entities including governments, military, education, scientific and now
enterprise organizations. This document provides a review of cluster computing, the
various types of clusters and their associated applications. This document is a high-
level informational document; it does not provide details aboutvarious cluster
implementations and applications.
clusters) and high availability clusters. Clusters may also be deployed to address load
balancing, parallel processing, systems management, and scalability. Today, clusters
are made up of commodity computers usually restricted to a single switch or group of
interconnected switches operating at Layer 2 and within a single virtual local-area
network (VLAN). Each compute node (computer) may have different characteristics
such as single processor or symmetric multiprocessor design, and access to various
types of storage devices. The underlying network is a dedicated network made up of
high-speed, low-latency switches that may be of a single switch or a hierarchy of
multiple switches.
A growing range of possibilities exists for a cluster interconnection technology.
Different variables will determine the network hardware for the cluster. Price-per-
port, bandwidth, latency, and throughput are key variables. The choice of network
technology depends on a number of factors, including price, performance, and
compatibility with other cluster hardware and system software as well as
communication characteristics of the applications that will use the cluster. Clusters are
not commodities in themselves, although they may be based on commodity hardware.
A number of decisions need to be made (for example, what type of hardware the
nodes run on, which interconnect to use, and which type of switching architecture to
build on) before assembling a cluster range. Each decision will affect the others, and
some will probably be dictated by the intended use of the cluster. Selecting the right
cluster elements involves an understanding of the application and the necessary
resources that include, but are not limited to, storage, throughput, latency, and number
of nodes.
When considering a cluster implementation, there are some basic questions that can
help determine the cluster attributes such that technology options can be evaluated:
1. Will the application be primarily processing a single dataset?
2. Will the application be passing data around or will it generate real-time
information?
3. Is the application 32- or 64-bit?
The answers to these questions will influence the type of CPU, memory architecture,
storage, cluster interconnect, and cluster network design. Cluster applications are
often CPU-bound so that interconnect and storage bandwidth are not limiting factors,
although this is not always the case.
The main benefits of clusters are scalability, availability, and performance. For
scalability, a cluster uses the combined processing power of compute nodes to run
cluster-enabled applications such as a parallel database server at a higher performance
than a single machine can provide. Scaling the cluster's processing power is achieved
by simply adding additional nodes to the cluster. Availability within the cluster is
assured as nodes within the cluster provide backup to each other in the event of a
failure. In high-availability clusters, if a node is taken out of service or fails, the load
is transferred to another node (or nodes) within the cluster. To the user, this operation
is transparent as the applications and data running are also available on the failover
nodes. An additional benefit comes with the existence of a single system image and
the ease of manageability of the cluster. From the users perspective the users sees an
application resource as the provider of services and applications. The user does not
know or care if this resource is a single server, a cluster, or even which node within
the cluster is providing services. These benefits map to needs of today's enterprise
business, education, military and scientific community infrastructures. In summary,
clusters provide:
• Scalable capacity for compute, data, and transaction intensive applications,
including support of mixed workloads
• Horizontal and vertical scalability without downtime
• Ability to handle unexpected peaks in workload
• Central system management of a single systems image
• 24 x 7 availability.
2. TYPES OF CLUSTER
There are several types of clusters, each with specific design goals and functionality.
These clusters range from distributed or parallel clusters for computation intensive or
data intensive applications that are used for protein, seismic, or nuclear modeling to
simple load-balanced clusters.
(Failover Clusters)
(Figure- 2.1)
the same content, and the client does not care from which server the server provides
the requested content.
This type of cluster distributes incoming requests for resources or content among
multiple nodes running the same programs or having the same content (see Figure
2.2). Every node in the cluster is able to handle requests for the same content or
application. If a node fails, requests are redistributed between the remaining available
nodes. This type of distribution is typically seen in a web-hosting environment.
Both the high availability and load-balancing cluster technologies can be combined to
increase the reliability, availability, and scalability of application and data resources
that are widely deployed for web, mail, news, or FTP services.
3. CLUSTER COMPONENTS
The basic building blocks of clusters are broken down into multiple categories: the
cluster nodes, cluster operating system, network switching hardware and the
node/switch interconnect (see Figure 3). Significant advances have been
accomplished over the past five years to improve the performance of both
the compute nodes as well as the underlying switching infrastructure.
(Cluster Components)
(Figure- 3)
Application : It includes all the various applications that are going on for a particular
group. These applications run in parallel. These includes various query running on
different nodes of the cluster. This can be said as the input part of the cluster
component.
Middleware: These are software packages which interacts the user with the operating
system for the cluster computing. In other words we can say that these are the layers
of software between applications and operating system. Middleware provides various
services required by an application to function correctly. The software that are used as
middleware are:
OSCAR
Features:
Image based Installation.
Supported by Red Hat 9.0 and Mandrake 9.0.
Processors supported: x86, Itanium (in beta).
Interconnects: Ethernet, Myrinet.
Diskless support in development.
Opteron support in development.
High-availability support in alpha testing.
SCYLD
Features:
Commercial distribution.
Single system image design.
Processors: x86 and Opteron.
Interconnects: Ethernet and Infiniband.
MPI and PVM.
Diskful and diskless support.
Rocks
Features:
Interconnect: Interconnection between the various nodes of the cluster system can be
done using 10GbE, Myrinet etc. In case of small cluster system these and be
connected with the help of simple switches.
Nodes: Nodes of the cluster system implies about the different computers that are
connected. All of these processors can be of intels or AMD 64 bit.
4. CLUSTER OPERATION
4.1 Cluster Nodes
Node technology has migrated from the conventional tower cases to single rack-unit
multiprocessor systems and blade servers that provide a much higher processor
density within a decreased area. Processor speeds and server architectures have
increased in performance, as well as solutions that provide options for either 32-bit or
64-bit processors systems. Additionally, memory performance as well as hard-disk
access speeds and storage capacities have also increased. It is interesting to note that
even though performance is growing exponentially in some cases, the cost of these
technologies has dropped considerably. As shown in Figure 4.1 below, node
participation in the cluster falls into one of two responsibilities: master (or head) node
and compute (or slave) nodes. The master node is the unique server in cluster systems.
It is responsible for running the file system and also serves as the key system for
clustering middleware to route processes, duties, and monitor the health and status of
each slave node. A compute (or slave) node within a cluster provides the cluster a
computing and data storage capability. These nodes are derived from fully
operational, standalone computers that are typically marketed as desktop or server
systems that, as such, are off-the-shelf commodity systems.
(Cluster Nodes)
(Figure- 4.1)
Commodity cluster solutions are viable today due to a number of factors such as the
high performance commodity servers and the availability of high speed, low-latency
network switch technologies that provide the inter-nodal communications.
Commodity clusters typically incorporate one or more dedicated switches to support
communication between the cluster nodes. The speed and type of node interconnects
vary based on the requirements of the application and organization. With today's low
costs per-port for Gigabit Ethernet switches, adoption of 10-Gigabit Ethernet and the
standardization of 10/100/1000 network interfaces on the node hardware, Ethernet
continues to be a leading interconnect technology for many clusters. In addition to
Ethernet, alternative network or interconnect technologies include Myrinet, Quadrics,
and Infiniband that support bandwidths above 1Gbps and end-to-end message
latencies below 10 microseconds (uSec).
Ethernet is the most widely used interconnect technology for local area networking
(LAN). Ethernet as a technology supports speeds varying from 10Mbps to 10 Gbps
and it is successfully deployed and operational within many high-performance cluster
computing environments.
Compute intensive is a term that applies to any computer application that demands a
lot of computation cycles (for example, scientific applications such as meteorological
prediction). These types of applications are very sensitive to end-to-end message
latency. This latency sensitivity is caused by either the processors having to wait for
instruction messages, or if transmitting results data between nodes takes longer. In
general, the more time spent idle waiting for an instruction or for results data, the
longer it takes to complete the application.
Data intensive is a term that applies to any application that has high demands of
attached storage facilities. Performance of many of these applications is impacted by
the quality of the I/O mechanisms supported by current cluster architectures, the
bandwidth available for network attached storage, and, in some cases, the
performance of the underlying network components at both Layer 2 and 3.
Data-intensive applications can be found in the area of data mining, image processing,
and genome and protein science applications. The movement to parallel I/O systems
continues to occur to improve the I/O performance for many of these applications.
Transaction intensive is a term that applies to any application that has a high-level of
interactive transactions between an application resource and the cluster resources.
Many financial, banking, human resource, and web-based applications fall into this
category.
There are three main care abouts for cluster applications: message latency, CPU
utilization, and throughput. Each of these plays an important part in improving or
impeding application performance. This section describes each of these issues and
their associated impact on application performance.
(Message Latency)
(Figure- 4.4.1)
To address host stack latency, reductions in protocol processing have been addressed
somewhat through the implementation of TOE and further developments of combined
TOE and Remote Direct Memory Access (RDMA) technologies are occurring that
will significantly reduce the protocol processing in the host. See Figure 4.4.2 through
Figure 4.4.4 below for examples.
(Progression)
(Figure- 4.4.2)
(CPU Utilization)
(Figure- 4.5.1)
The example on the left side of Figure 4.5.1 shows that when there is virtually no
network or protocol processing going on, CPU 0 and 1 of each node are 100%
devoted to application processing. The right side of Figure 4.5.1 shows that the
network traffic levels have significantly increased. As this happens, the CPU spends
cycles processing the MPI and TCP protocol stacks, including moving data to and
from the wire. This results in a reduced or suspended application processing. With the
increase in protocol processing, note that the utilization percentages of CPU 0 and 1
are dramatically reduced, in some cases to 0.
5.1 Throughput
(Throughput)
(Figure- 5.1)
As shown in the examples in Figure 5.1, there are both simple and complex collective
routines. As more scatter-gather, all gather, and all-to-all routines are used, multiple
head-of-line blocking instances may occur within the switch, even within non-
blocking switch architectures. Additionally, the buffer architectures of the underlying
network, specifically the depth and allocation of ingress and egress
buffers, become key to throughput levels. If buffers fill, congestion management
routines may be invoked. In the switch, this means that pause frames will be sent
resulting in the sending node discontinuing sending traffic until the congestion
subsides. In the case of TCP, the congestion avoidance algorithms comes into effect.
To alleviate this problem, modern TCP devices are restrained in the rate at which they
initially send segments. Each sender is at first restricted to sending only an amount of
data equal to one “full-sized”segment that is equal to the MSS value for the
connection. Each time an acknowledgment is received, the amount of data the device
can send is increased by the size of another full-sized segment. Thus, the device
“starts slow” in terms of how much data it can send, with the amount it sends
increasing until either the full window size is reached or congestion is detected on the
link. In the latter case, the congestion avoidance feature, described below, is used.
6. CLUSTER APPLICATIONS
Internet search engines enable Internet users to search for information on the Internet
by entering specific keywords. A widely used search engine, Google uses cluster
computing to meet the huge quantity of worldwide search requests that comprise
of a peak of thousands of queries per second. A single Google query needs to use at
least tens of billions of processing cycles and access a few hundred megabytes of data
in order to return satisfactory search results.
Google uses cluster computing as its solution to the high demand of system
resources since clusters have better price-performance ratios than alternative high-
performance computing platforms, and also use less electrical power. Google
focuses on 2 important design factors: reliability and request throughput.
Google is able to achieve reliability at the software level so that a reliable computing
infrastructure can be constructed on clusters of 15,000 commodity PCs distributed
worldwide. The services for Google are also replicated across multiple machines in
the clusters to provide the necessary availability. Google maximizes overall request
throughput by performing parallel execution of individual search requests. This means
that more search requests can be completed within a specific time interval.
Figure shows how a GWS operates within a local cluster. The first phase of query
execution involves index servers consulting an inverted index that match each query
keyword to a matching list of documents. Relevance scores are also computed for
matching documents so that the search result returned to the user is ordered by score.
In the second phase, document servers fetch each document from disk to extract the
title and the keyword-in-context portion of the document. In addition to the 2
phases, the GWS also activates the spell checker and the ad server. The spell
checker verifies that the spelling of the query keywords is correct, while the ad
server generate advertisements that relate to the query and may therefore interest the
user.
requirements. With the advent of cluster computing, more researchers are using
compositional simulators that use more data to characterize reservoirs.
The GPAS is a compositional petroleum reservoir simulator that can perform more
accurate, efficient and high-resolution simulation of fluid flow in permeable media.
It uses a finite-difference method which divides a continuous domain into smaller
cells to solve the governing partial differential equations. The higher number of
cells produces more accurate results, but requires more computation time. A fully
implicit solution results in a structure of non-linear equations that are then
resolved using Newton’s method. However, large sparse linear systems of
equations are needed to obtain numerical solution of these non-linear equations.
Therefore, the Portable Extensible Toolkit for Scientific Computation (PETSc), a set
of tools for solving partial differential equations, is used to solve these linear systems.
To handle the parallel processing requirements, an Integrated Parallel Accurate
Reservoir Simulator (IPARS) framework has been developed to separate the
physical model development from parallel processing. IPARS provides input and
output, memory management, domain decomposition, and message passing among
processors to update overlapping regions. Communications between the
simulator framework and a physical model are carried out through FORTRAN
subroutine calls provided within the IPARS, thus hiding the complexities from
the physical model developers who only need to call the FORTRAN subroutines
to perform corresponding tasks.
by the national ‘Protein 3000’ project in Japan that was initiated in 2002 with
the goal of solving the structures of 3,000 proteins by the year 2007.
Figure shows the components of the PE system. It will be a cluster of 256 dual-
processor nodes giving a total of 512 processors, connected via Gigabit Ethernet.
Each cluster node has 2 special-purpose engine boards (with 12 MDGRAPE-3
chips on each board) connected to it, giving it a total of 6,144 chips.
The cluster nodes will transmit the coordinates and the other data of particles for
the molecular dynamics simulation to the special-purpose engines, which then
calculate the non-bonded forces, such as Coulomb force and van der Walls force
between particles before returning the results to the hosts. In other words, the special-
purpose engines only focus on computing the most complex portion of the
simulation that is calculating the non-bonded forces. All the coordination and
other calculations are handled by the cluster nodes themselves.
The simulations are based on multi-resolution mesh algorithms that can model the
wide range of length and time scales depicting the earthquake response. Figure 6.4
shows the process of generating a mesh using the etree method. That method is used
for earthquake simulations in heterogeneous basins where the shear wave velocity
and maximum resolved frequency determine the local element size. At the initial
“construct” step, an octree is constructed and stored on disk. The decompositions of
the octants are dependent on the geometry or physics being modeled, thus
resulting in an unbalanced octree. Then, the balance step recursively decomposes
all the large octants that violate the 2-to-1 constraint until there are no more illegal
conditions, thus creating a balanced octree. Finally, in the transform step, mesh-
specific information such as the element-node relationship and the node
coordinates are derived from the balanced octree and separately stored in two
databases: one for the mesh elements, another for the mesh nodes.
For the balancing step, the whole domain is first partitioned into equal-size blocks.
Then, internal balancing enforces the 2-to-1 constraint within each block. Finally,
boundary balancing is used to resolve interactions between adjacent blocks. This local
balancing step is very effective since it can achieve a speed-up ranging from 8 to 28,
depending on the size of the meshes being balanced.
The Scientific Computing and Imaging (SCI) Institute at University of Utah has
explored cluster-based scientific visualization using a 32-node visualization
cluster composed of commodity hardware components connected with a high-
speed network. The OpenGL scientific visualization tool Simian has been modified
to create a cluster-aware version of Simian that supports parallelization by
making explicit use of remote cluster nodes through a message-passing interface
(MPI). Simian is able to generate 3D images for fire-spread simulations that model
scenarios such as when a missile located within a pool of jet fuel catches fire and
explodes. Using image rendering for fire-spread simulations enables researchers to
have a better visualization of the destructive effects.
Normally, Simian uses a swapping mechanism to manage datasets that are too
large to load into the available texture memory, resulting in low performance
and interactivity. For the cluster-aware Simian, large datasets are divided into
sub-volumes that can be distributed across multiple cluster nodes, thus achieving
the interactive performance. This “divide-and-conquer” technique first decomposes
the dataset into sub-volumes before distributing the sub-volumes to multiple remote
cluster nodes. Each node is then responsible for rendering its sub-volume using
the locally available graphics hardware. The individual results are finally
combined using a binary-swap compositing algorithm to generate the final image.
This enables the cluster-aware Simian to visualize large-scale datasets to maintain
interactive rates without the need of texture swapping.Figure shows the
visualization of two fire-spread datasets simulating a heptane pool fire, generated by
the cluster-aware version of Simian using 8 cluster nodes. The top row of Figure
shows two views (side and top views) of the h300_0075 dataset, while the bottom row
shows the h300_0130 dataset.
supported by the Digipede Network™. The chart below gives an overview of the two
solutions.
Both systems use similar terminology to define submitted requests: A job defines the
work submitted to the system which includes the required resources and the tasks to
execute. A task is an individual unit of work that can be executed concurrently with
other tasks.
8. CONCLUSION
REFERENCES
http://icl.cs.utk.edu/iter-ref
M. Baker, A. Apon, R. Buyya, H. Jin, “Cluster Computing
and Applications”, Encyclopedia of Computer Science and
Technology, Vol.45, Marcel Dekker, Aug. 2006.
D. Butenhof, Programming with POSIX Threads,
Addison-Wesley, 2000.
R. Buyya (ed.), High Performance Cluster Computing:
Systems and Architectures, Prentice Hall, 2007.