Lecture3 Cluster Computing Baker 2
Lecture3 Cluster Computing Baker 2
1.1 Introduction
Very often applications need more computing power than a sequential computer
can provide. One way of overcoming this limitation is to improve the operating
speed of processors and other components so that they can oer the power required
by computationally intensive applications. Even though this is currently possible
to certain extent, future improvements are constrained by the speed of light, ther-
modynamic laws, and the high nancial costs for processor fabrication. A viable
and cost-eective alternative solution is to connect multiple processors together and
coordinate their computational eorts. The resulting systems are popularly known
as parallel computers, and they allow the sharing of a computational task among
multiple processors.
As Pster 1] points out, there are three ways to improve performance:
Work harder,
Work smarter, and
Get help.
In terms of computing technologies, the analogy to this mantra is that working
harder is like using faster hardware (high performance processors or peripheral
devices). Working smarter concerns doing things more e ciently and this revolves
around the algorithms and techniques used to solve computational tasks. Finally,
getting help refers to using multiple computers to solve a particular task.
3
4 Cluster Computing at a Glance Chapter 1
A review of the changes in computing eras is shown in Figure 1.1. Each comput-
ing era started with a development in hardware architectures, followed by system
software (particularly in the area of compilers and operating systems), applications,
and reaching its zenith with its growth in PSEs (Problem Solving Environments).
Each component of a computing system undergoes three phases: R&D (Research
and Development), commercialization, and commodity. The technology behind the
development of computing system components in the sequential era has matured,
and similar developments are yet to happen in the parallel era. That is, parallel
computing technology needs to advance, as it is not mature enough to be exploited
as commodity technology.
The main reason for creating and using parallel computers is that parallelism
is one of the best ways to overcome the speed bottleneck of a single processor. In
addition, the price performance ratio of a small cluster-based parallel computer as
opposed to a minicomputer is much smaller and consequently a better value. In
short, developing and producing systems of moderate speed using parallel architec-
tures is much cheaper than the equivalent performance of a sequential system.
The remaining parts of this chapter focus on architecture alternatives for con-
structing parallel computers, motivations for transition to low cost parallel comput-
ing, a generic model of a cluster computer, commodity components used in building
clusters, cluster middleware, resource management and scheduling, programming
environments and tools, and representative cluster systems. The chapter ends with
a summary of hardware and software trends, and concludes with future cluster
technologies.
Section 1.2 Scalable Parallel Computer Architectures 5
Architecture
Sequential
Era
System Software
Applications
Architecture
Parallel
Era System Software
Applications
Commercialization
SMP systems today have from 2 to 64 processors and can be considered to have
shared-everything architecture. In these systems, all processors share all the global
resources available (bus, memory, I/O system) a single copy of the operating system
runs on these systems.
CC-NUMA is a scalable multiprocessor system having a cache-coherent nonuni-
form memory access architecture. Like an SMP, every processor in a CC-NUMA
system has a global view of all of the memory. This type of system gets its name
(NUMA) from the nonuniform times to access the nearest and most remote parts
Section 1.3 Towards Low Cost Parallel Computing and Motivations 7
of memory.
Distributed systems can be considered conventional networks of independent
computers. They have multiple system images, as each node runs its own operating
system, and the individual machines in a distributed system could be, for example,
combinations of MPPs, SMPs, clusters, and individual computers.
At a basic level a cluster 1 is a collection of workstations or PCs that are inter-
connected via some network technology. For parallel computing purposes, a cluster
will generally consist of high performance workstations or PCs interconnected by
a high-speed network. A cluster works as an integrated collection of resources and
can have a single system image spanning all its nodes. Refer to 1] and 2] for a
detailed discussion on architectural and functional characteristics of the competing
computer architectures.
applications to be developed, tested, and even run on NOW, and then at a later
stage to be ported, with little modication, onto dedicated parallel platforms where
CPU-time is accounted and charged.
The following list highlights some of the reasons NOW is preferred over special-
ized parallel computers 5], 4]:
Individual workstations are becoming increasingly powerful. That is, work-
station performance has increased dramatically in the last few years and is
doubling every 18 to 24 months. This is likely to continue for several years,
with faster processors and more e cient multiprocessor machines coming into
the market.
The communications bandwidth between workstations is increasing and la-
tency is decreasing as new networking technologies and protocols are imple-
mented in a LAN.
Workstation clusters are easier to integrate into existing networks than special
parallel computers.
Typical low user utilization of personal workstations.
The development tools for workstations are more mature compared to the
contrasting proprietary solutions for parallel computers, mainly due to the
nonstandard nature of many parallel systems.
Workstation clusters are a cheap and readily available alternative to special-
ized high performance computing platforms.
Clusters can be easily grown node's capability can be easily increased by
adding memory or additional processors.
Clearly, the workstation environment is better suited to applications that are not
communication-intensive since a LAN typically has high message start-up latencies
and low bandwidths. If an application requires higher communication performance,
the existing commonly deployed LAN architectures, such as Ethernet, are not ca-
pable of providing it.
Traditionally, in science and industry, a workstation referred to a UNIX plat-
form and the dominant function of PC-based machines was for administrative work
and word processing. There has been, however, a rapid convergence in proces-
sor performance and kernel-level functionality of UNIX workstations and PC-based
machines in the last three years (this can be attributed to the introduction of high
performance Pentium-based machines and the Linux and Windows NT operating
systems). This convergence has led to an increased level of interest in utilizing PC-
based systems as a cost-eective computational resource for parallel computing.
This factor coupled with the comparatively low cost of PCs and their widespread
availability in both academia and industry has helped initiate a number of software
projects whose primary aim is to harness these resources in some collaborative way.
Section 1.4 Windows of Opportunity 9
1.4 Windows of Opportunity
The resources available in the average NOW, such as processors, network interfaces,
memory and hard disk, oer a number of research opportunities, such as:
Parallel Processing - Use the multiple processors to build MPP/DSM-like sys-
tems for parallel computing.
Network RAM - Use the memory associated with each workstation as aggregate
DRAM cache this can dramatically improve virtual memory and le system
performance.
Software RAID (Redundant Array of Inexpensive Disks) - Use the arrays
of workstation disks to provide cheap, highly available, and scalable le storage
by using redundant arrays of workstation disks with LAN as I/O backplane. In
addition, it is possible to provide parallel I/O support to applications through
middleware such as MPI-IO.
Multipath Communication - Use the multiple networks for parallel data trans-
fer between nodes.
Scalable parallel applications require good oating-point performance, low la-
tency and high bandwidth communications, scalable network bandwidth, and fast
access to les. Cluster software can meet these requirements by using resources
associated with clusters. A le system supporting parallel I/O can be built using
disks associated with each workstation instead of using expensive hardware-RAID.
Virtual memory performance can be drastically improved by using Network RAM
as a backing store instead of hard disk. In a way, parallel le systems and Network
RAM reduces the widening performance gap between processors and disks.
It is very common to connect cluster nodes using the standard Ethernet and spe-
cialized high performance networks such as Myrinet. These multiple networks can
be utilized for transferring data simultaneously across cluster nodes. The multipath
communication software performs demultiplexing of data at the transmitting end
across multiple networks and multiplexing of data at the receiving end. Thus, all
available networks can be utilized for faster communication of data between cluster
nodes.
Parallel Applications
Cluster Middleware
(Single System Image and Availability Infrastructure)
Comm. S/W Comm. S/W Comm. S/W Comm. S/W Comm. S/W
Net. Interface HW Net. Interface HW Net. Interface HW Net. Interface HW Net. Interface HW
The performance of applications also improves with the support of scalable soft-
ware environment. Another benet of clustering is a failover capability that allows
a backup computer to take over the tasks of a failed computer located in its cluster.
Clusters are classied into many categories based on various factors as indicated
below.
1. Application Target - Computational science or mission-critical applications.
High Performance (HP) Clusters
High Availability (HA) Clusters
The main concentration of this book is on HP clusters and the technologies
and environments required for using them in parallel computing. However, we
also discuss issues involved in building HA clusters with an aim for integrating
performance and availability into a single system (see Chapter 4).
2. Node Ownership - Owned by an individual or dedicated as a cluster node.
Dedicated Clusters
Nondedicated Clusters
The distinction between these two cases is based on the ownership of the nodes
in a cluster. In the case of dedicated clusters, a particular individual does not
own a workstation the resources are shared so that parallel computing can be
performed across the entire cluster 6]. The alternative nondedicated case is
where individuals own workstations and applications are executed by stealing
idle CPU cycles 7]. The motivation for this scenario is based on the fact that
most workstation CPU cycles are unused, even during peak hours. Parallel
computing on a dynamically changing set of nondedicated workstations is
called adaptive parallel computing.
In nondedicated clusters, a tension exists between the workstation owners and
remote users who need the workstations to run their application. The former
expects fast interactive response from their workstation, while the latter is
only concerned with fast application turnaround by utilizing any spare CPU
cycles. This emphasis on sharing the processing resources erodes the concept
of node ownership and introduces the need for complexities such as process
migration and load balancing strategies. Such strategies allow clusters to
deliver adequate interactive performance as well as to provide shared resources
to demanding sequential and parallel applications.
3. Node Hardware - PC, Workstation, or SMP.
Clusters of PCs (CoPs) or Piles of PCs (PoPs)
Clusters of Workstations (COWs)
Section 1.7 Commodity Components for Clusters 13
Clusters of SMPs (CLUMPs)
4. Node Operating System - Linux, NT, Solaris, AIX, etc.
Linux Clusters (e.g., Beowulf)
Solaris Clusters (e.g., Berkeley NOW)
NT Clusters (e.g., HPVM)
AIX Clusters (e.g., IBM SP2)
Digital VMS Clusters
HP-UX clusters.
Microsoft Wolfpack clusters.
5. Node Conguration - Node architecture and type of OS it is loaded with.
Homogeneous Clusters: All nodes will have similar architectures and run
the same OSs.
Heterogeneous Clusters: All nodes will have dierent architectures and
run dierent OSs.
6. Levels of Clustering - Based on location of nodes and their count.
Group Clusters (#nodes: 2-99): Nodes are connected by SANs (System
Area Networks) like Myrinet and they are either stacked into a frame or
exist within a center.
Departmental Clusters (#nodes: 10s to 100s)
Organizational Clusters (#nodes: many 100s)
National Metacomputers (WAN/Internet-based): (#nodes: many de-
partmental/organizational systems or clusters)
International Metacomputers (Internet-based): (#nodes: 1000s to many
millions)
Individual clusters may be interconnected to form a larger system (clusters of
clusters) and, in fact, the Internet itself can be used as a computing cluster. The
use of wide-area networks of computer resources for high performance computing
has led to the emergence of a new eld called Metacomputing. (Refer to Chapter 7
for further details on Metacomputing.)
1.7.1 Processors
Over the past two decades, phenomenal progress has taken place in microprocessor
architecture (for example RISC, CISC, VLIW, and Vector) and this is making the
single-chip CPUs almost as powerful as processors used in supercomputers. Most
recently researchers have been trying to integrate processor and memory or network
interface into a single chip. The Berkeley Intelligent RAM (IRAM) project 9]
is exploring the entire spectrum of issues involved in designing general purpose
computer systems that integrate a processor and DRAM onto a single chip { from
circuits, VLSI design, and architectures to compilers and operating systems. Digital,
with its Alpha 21364 processor, is trying to integrate processing, memory controller,
and network interface into a single chip.
Intel processors are most commonly used in PC-based computers. The cur-
rent generation Intel x86 processor family includes the Pentium Pro and II. These
processors, while not in the high range of performance, match the performance of
medium level workstation processors 10]. In the high performance range, the Pen-
tium Pro shows a very strong integer performance, beating Sun's UltraSPARC at
the same clock speed however, the oating-point performance is much lower. The
Pentium II Xeon, like the newer Pentium IIs, uses a 100 MHz memory bus. It is
available with a choice of 512KB to 2MB of L2 cache, and the cache is clocked at
the same speed as the CPU, overcoming the L2 cache size and performance issues
of the plain Pentium II. The accompanying 450NX chipset for the Xeon supports
64-bit PCI busses that can support Gigabit interconnects.
Other popular processors include x86 variants (AMD x86, Cyrix x86), Digital
Alpha, IBM PowerPC, Sun SPARC, SGI MIPS, and HP PA. Computer systems
based on these processors have also been used as clusters for example, Berkeley
NOW uses Sun's SPARC family of processors in their cluster nodes. (For further
information on industrial high performance microprocessors refer to web-based VLSI
Microprocessors Guide 11].)
1.7.2 Memory and Cache
Originally, the memory present within a PC was 640 KBytes, usually `hardwired'
onto the motherboard. Typically, a PC today is delivered with between 32 and
64 MBytes installed in slots with each slot holding a Standard Industry Memory
Module (SIMM) the potential capacity of a PC is now many hundreds of MBytes.
Computer systems can use various types of memory and they include Extended
Data Out (EDO) and fast page. EDO allows the next access to begin while the
previous data is still being read, and fast page allows multiple adjacent accesses to
be made more e ciently.
The amount of memory needed for the cluster is likely to be determined by the
cluster target applications. Programs that are parallelized should be distributed
such that the memory, as well as the processing, is distributed between processors
for scalability. Thus, it is not necessary to have a RAM that can hold the entire
problem in memory on each system, but it should be enough to avoid the occurrence
Section 1.7 Commodity Components for Clusters 15
of too much swapping of memory blocks (page-misses) to disk, since disk access has
a large impact on performance.
Access to DRAM is extremely slow compared to the speed of the processor,
taking up to orders of magnitude more time than a CPU clock cycle. Caches
are used to keep recently used blocks of memory for very fast access if the CPU
references a word from that block again. However, the very fast memory used for
cache is expensive and cache control circuitry becomes more complex as the size of
the cache grows. Because of these limitations, the total size of a cache is usually in
the range of 8KB to 2MB.
Within Pentium-based machines it is not uncommon to have a 64-bit wide mem-
ory bus as well as a chip set that supports 2 MBytes of external cache. These
improvements were necessary to exploit the full power of the Pentium and to make
the memory architecture very similar to that of UNIX workstations.
1.7.3 Disk and I/O
Improvements in disk access time have not kept pace with microprocessor per-
formance, which has been improving by 50 percent or more per year. Although
magnetic media densities have increased, reducing disk transfer times by approxi-
mately 60 to 80 percent per year, overall improvement in disk access times, which
rely upon advances in mechanical systems, has been less than 10 percent per year.
Grand challenge applications often need to process large amounts of data and
data sets. Amdahl's law implies that the speed-up obtained from faster processors is
limited by the slowest system component therefore, it is necessary to improve I/O
performance such that it balances with CPU performance. One way of improving
I/O performance is to carry out I/O operations in parallel, which is supported by
parallel le systems based on hardware or software RAID. Since hardware RAIDs
can be expensive, software RAIDs can be constructed by using disks associated with
each workstation in the cluster.
1.7.4 System Bus
The initial PC bus (AT, or now known as ISA bus) used was clocked at 5 MHz
and was 8 bits wide. When rst introduced, its abilities were well matched to the
rest of the system. PCs are modular systems and until fairly recently only the
processor and memory were located on the motherboard, other components were
typically found on daughter cards connected via a system bus. The performance of
PCs has increased by orders of magnitude since the ISA bus was rst used, and it
has consequently become a bottleneck, which has limited the machine throughput.
The ISA bus was extended to be 16 bits wide and was clocked in excess of 13 MHz.
This, however, is still not su cient to meet the demands of the latest CPUs, disk
interfaces, and other peripherals.
A group of PC manufacturers introduced the VESA local bus, a 32-bit bus that
matched the system's clock speed. The VESA bus has largely been superseded by
the Intel-created PCI bus, which allows 133 Mbytes/s transfers and is used inside
16 Cluster Computing at a Glance Chapter 1
Pentium-based PCs. PCI has also been adopted for use in non-Intel based platforms
such as the Digital AlphaServer range. This has further blurred the distinction
between PCs and workstations, as the I/O subsystem of a workstation may be built
from commodity interface and interconnect cards.
1.7.5 Cluster Interconnects
The nodes in a cluster communicate over high-speed networks using a standard net-
working protocol such as TCP/IP or a low-level protocol such as Active Messages.
In most facilities it is likely that the interconnection will be via standard Ether-
net. In terms of performance (latency and bandwidth), this technology is showing
its age. However, Ethernet is a cheap and easy way to provide le and printer
sharing. A single Ethernet connection cannot be used seriously as the basis for
cluster-based computing its bandwidth and latency are not balanced compared to
the computational power of the workstations now available. Typically, one would
expect the cluster interconnect bandwidth to exceed 10 MBytes/s and have message
latencies of less than 100 s. A number of high performance network technologies
are available in the marketplace in this section we discuss a few of them.
Ethernet, Fast Ethernet, and Gigabit Ethernet
Standard Ethernet has become almost synonymous with workstation networking.
This technology is in widespread usage, both in the academic and commercial sec-
tors. However, its 10 Mbps bandwidth is no longer su cient for use in environments
where users are transferring large data quantities or there are high tra c densities.
An improved version, commonly known as Fast Ethernet, provides 100 Mbps band-
width and has been designed to provide an upgrade path for existing Ethernet
installations. Standard and Fast Ethernet cannot coexist on a particular cable,
but each uses the same cable type. When an installation is hub-based and uses
twisted-pair it is possible to upgrade the hub to one, which supports both stan-
dards, and replace the Ethernet cards in only those machines where it is believed
to be necessary.
Now, the state-of-the-art Ethernet is the Gigabit Ethernet2 and its attraction
is largely due to two key characteristics. First, it preserves Ethernet's simplicity
while enabling a smooth migration to Gigabit-per-second (Gbps) speeds. Second,
it delivers a very high bandwidth to aggregate multiple Fast Ethernet segments
and to support high-speed server connections, switched intrabuilding backbones,
interswitch links, and high-speed workgroup networks.
Asynchronous Transfer Mode (ATM)
ATM is a switched virtual-circuit technology and was originally developed for the
telecommunications industry 12]. It is embodied within a set of protocols and stan-
dards dened by the International Telecommunications Union. The international
2 Gigabit Ethernet is Ethernet, only faster!
Section 1.7 Commodity Components for Clusters 17
ATM Forum, a non-prot organization, continues this work. Unlike some other
networking technologies, ATM is intended to be used for both LAN and WAN,
presenting a unied approach to both. ATM is based around small xed-size data
packets termed cells. It is designed to allow cells to be transferred using a number
of dierent media such as both copper wire and ber optic cables. This hardware
variety also results in a number of dierent interconnect performance levels.
When rst introduced, ATM used optical ber as the link technology. However,
this is undesirable in desktop environments for example, twisted pair cables may
have been used to interconnect a networked environment and moving to ber-based
ATM would mean an expensive upgrade. The two most common cabling technolo-
gies found in a desktop environment are telephone style cables (CAT-3) and a better
quality cable (CAT-5). CAT-5 can be used with ATM allowing upgrades of existing
networks without replacing cabling.
Scalable Coherent Interface (SCI)
SCI is an IEEE 1596-1992 standard aimed at providing a low-latency distributed
shared memory across a cluster 13]. SCI is the modern equivalent of a Processor-
Memory-I/O bus and LAN combined. It is designed to support distributed multi-
processing with high bandwidth and low latency. It provides a scalable architecture
that allows large systems to be built out of many inexpensive mass-produced com-
ponents.
SCI is a point-to-point architecture with directory-based cache coherence. It
can reduce the delay of interprocessor communications even when compared to the
newest and best technologies currently available, such as Fiber Channel and ATM.
SCI achieves this by eliminating the need for runtime layers of software protocol-
paradigm translation. A remote communication in SCI takes place as just part of
a simple load or store process in a processor. Typically, a remote address results in
a cache miss. This in turn causes the cache controller to address remote memory
via SCI to get the data. The data is fetched to the cache with a delay in the order
of a few ss and then the processor continues execution.
Dolphin currently produces SCI cards for SPARC's SBus however, they have
also announced availability of PCI-based SCI cards. They have produced an SCI
MPI which oers less than 12 s zero message-length latency on the Sun SPARC
platform and they intend to provide MPI for Windows NT. A SCI version of High
Performance Fortran (HPF) is available from Portland Group Inc.
Although SCI is favored in terms of fast distributed shared memory support, it
has not been taken up widely because its scalability is constrained by the current
generation of switches and its components are relatively expensive.
Myrinet
Myrinet is a 1.28 Gbps full duplex interconnection network supplied by Myri-
com 15]. It is a proprietary, high performance interconnect. Myrinet uses low
latency cut-through routing switches, which is able to oer fault tolerance by au-
18 Cluster Computing at a Glance Chapter 1
tomatic mapping of the network conguration. This also simplies setting up the
network. Myrinet supports both Linux and NT. In addition to TCP/IP support, the
MPICH implementation of MPI is also available on a number of custom-developed
packages such as Berkeley active messages, which provide sub-10 s latencies.
Myrinet is relatively expensive when compared to Fast Ethernet, but has real ad-
vantages over it: very low-latency (5 s, one-way point-to-point), very high through-
put, and a programmable on-board processor allowing for greater exibility. It can
saturate the eective bandwidth of a PCI bus at almost 120 Mbytes/s with 4Kbytes
packets.
One of the main disadvantages of Myrinet is, as mentioned, its price compared
to Fast Ethernet. The cost of Myrinet-LAN components, including the cables and
switches, is in the range of $1,500 per host. Also, switches with more than 16 ports
are unavailable, so scaling can be complicated, although switch chaining is used to
construct larger Myrinet clusters.
LINUX
Linux 16] is a UNIX-like OS which was initially developed by Linus Torvalds, a
Finnish undergraduate student in 1991-92. The original releases of Linux relied
heavily on the Minix OS however, the eorts of a number of collaborating pro-
grammers have resulted in the development and implementation of a robust and
reliable, POSIX compliant, OS.
Although Linux was developed by a single author initially, a large number of au-
thors are now involved in its development. One major advantage of this distributed
development has been that there is a wide range of software tools, libraries, and
utilities available. This is due to the fact that any capable programmer has access
to the OS source and can implement the feature that they wish. Linux quality
control is maintained by only allowing kernel releases from a single point, and its
availability via the Internet helps in getting fast feedback about bugs and other
problems. The following are some advantages of using Linux:
Linux runs on cheap x86 platforms, yet oers the power and exibility of
UNIX.
Linux is readily available on the Internet and can be downloaded without cost.
It is easy to x bugs and improve system performance.
20 Cluster Computing at a Glance Chapter 1
Users can develop or ne-tune hardware drivers which can easily be made
available to other users.
Linux provides the features typically found in UNIX implementations such as:
preemptive multitasking, demand-paged virtual memory, multiuser, and multipro-
cessor support 17]. Most applications written for UNIX will require little more
than a recompilation. In addition to the Linux kernel, a large amount of applica-
tion/systems software is also freely available, including GNU software and XFree86,
a public domain X-server.
Solaris
The Solaris operating system from SunSoft is a UNIX-based multithreaded and
multiuser operating system. It supports Intel x86 and SPARC-based platforms. Its
networking support includes a TCP/IP protocol stack and layered features such as
Remote Procedure Calls (RPC), and the Network File System (NFS). The Solaris
programming environment includes ANSI-compliant C and C++ compilers, as well
as tools to prole and debug multithreaded programs.
The Solaris kernel supports multithreading, multiprocessing, and has real-time
scheduling features that are critical for multimedia applications. Solaris supports
two kinds of threads: Light Weight Processes (LWPs) and user level threads. The
threads are intended to be su ciently lightweight so that there can be thousands
present and that synchronization and context switching can be accomplished rapidly
without entering the kernel.
Solaris, in addition to the BSD le system, also supports several types of non-
BSD le systems to increase performance and ease of use. For performance there
are three new le system types: CacheFS, AutoClient, and TmpFS. The CacheFS
caching le system allows a local disk to be used as an operating system managed
cache of either remote NFS disk or CD-ROM le systems. With AutoClient and
CacheFS, an entire local disk can be used as cache. The TmpFS temporary le
system uses main memory to contain a le system. In addition, there are other
le systems like the Proc le system and Volume le system to improve system
usability.
Solaris supports distributed computing and is able to store and retrieve dis-
tributed information to describe the system and users through the Network Infor-
mation Service (NIS) and database. The Solaris GUI, OpenWindows, is a combi-
nation of X11R5 and the Adobe Postscript system, which allows applications to be
run on remote systems with the display shown along with local applications.
Microsoft Windows NT
Microsoft Windows NT (New Technology) is a dominant operating system in the
personal computing marketplace 18]. It is a preemptive, multitasking, multiuser,
32-bit operating system. NT supports multiple CPUs and provides multi-tasking,
using symmetrical multiprocessing. Each 32-bit NT-application operates in its own
Section 1.8 Network Services/Communication SW 21
virtual memory address space. Unlike earlier versions (such as Windows for Work-
groups and Windows 95/98), NT is a complete operating system, and not an ad-
dition to DOS. NT supports dierent CPUs and multiprocessor machines with
threads. NT has an object-based security model and its own special le system
(NTFS) that allows permissions to be set on a le and directory basis.
A schematic diagram of the NT architecture is shown in Figure 1.3. NT has the
network protocols and services integrated with the base operating system.
Applications
Protected Subsystems
(e.g. POSIX, OS/2)
Hardware
I/O Graphics
Abstraction Layer
Hardware
provide the cluster with important quality of service parameters, such as latency,
bandwidth, reliability, fault-tolerance, and jitter control. Typically, the network
services are designed as a hierarchical stack of protocols. In such a layered system
each protocol layer in the stack exploits the services provided by the protocols below
it in the stack. The classic example of such a network architecture is the ISO OSI
7-layer system.
Traditionally, the operating system services (pipes/sockets) have been used for
communication between processes in message passing systems. As a result, commu-
nication between source and destination involves expensive operations, such as the
passing of messages between many layers, data copying, protection checking, and
reliable communication measures. Often, clusters with a special network/switch
like Myrinet use lightweight communication protocols such as active messages for
fast communication among its nodes. They potentially bypass the operating system
and thus remove the critical communication overheads and provide direct, user-level
access to the network interface.
Often in clusters, the network services will be built from a relatively low-level
communication API (Application Programming Interface) that can be used to sup-
port a wide range of high-level communication libraries and protocols. These mech-
anisms provide the means to implement a wide range of communications methodolo-
gies, including RPC, DSM, and stream-based and message passing interfaces such
as MPI and PVM. (A further discussion of communications and network protocols
can be found in Chapter 10.)
It should also be noted that programming and runtime systems like PVM can also
serve as cluster middleware.
The SSI layers support both cluster-aware (such as parallel applications de-
veloped using MPI) and non-aware applications (typically sequential programs).
These applications (cluster-aware, in particular) demand operational transparency
and scalable performance (i.e., when cluster capability is enhanced, they need to
run faster). Clusters, at one operational extreme, act like an SMP or MPP system
with a high degree of SSI, and at another they can function as a distributed system
with multiple system images.
The SSI and system availability services play a major role in the success of
clusters. In the following section, we briey discuss the layers supporting this
infrastructure. A detailed discussion on cluster infrastructure can be found in the
rest of the chapter with suitable pointers for further information.
Hardware Layer
Systems such as Digital (DEC's) Memory Channel and hardware DSM oer SSI
at hardware level and allow the user to view cluster as a shared memory system.
Digital's memory channel, a dedicated cluster interconnect, provides virtual shared
memory among nodes by means of internodal address space mapping. (Refer to
Chapter 9 for further discussion on DEC memory channel.)
Operating System Kernel (Underware) or Gluing Layer
Cluster operating systems support an e cient execution of parallel applications in
an environment shared with sequential applications. A goal is to pool resources in a
cluster to provide better performance for both sequential and parallel applications.
To realize this goal, the operating system must support gang-scheduling of parallel
programs, identify idle resources in the system (such as processors, memory, and
networks), and oer globalized access to them. It has to support process migration
for dynamic load balancing and fast interprocess communication for both the system
and user-level applications. The OS must make sure these features are available to
the user without the need of new system calls or commands and having the same
syntax. OS kernels supporting SSI include SCO UnixWare and Sun Solaris-MC.
A full cluster-wide SSI allows all physical resources and kernel resources to be
visible and accessible from all nodes within the system. Full SSI can be achieved as
underware (SSI at OS level). In other words, each node's OS kernel cooperating to
present the same view from all kernel interfaces on all nodes.
The full SSI at kernel level, can save time and money because existing programs
and applications do not have to be rewritten to work in this new environment. In
addition, these applications will run on any node without administrative setup, and
processes can be migrated to load balance between the nodes and also to support
fault-tolerance if necessary.
Most of the operating systems that support a SSI are built as a layer on top
of the existing operating systems and perform global resource allocation. This
Section 1.9 Cluster Middleware and Single System Image 25
strategy makes the system easily portable, tracks vendor software upgrades, and
reduces development time. Berkeley GLUnix follows this philosophy and proves that
new systems can be built quickly by mapping new services onto the functionality
provided by the layer underneath.
Applications and Subsystems Layer (Middleware)
SSI can also be supported by applications and subsystems, which presents multiple,
cooperating components of an application to the user/administrator as a single
application. The application level SSI is the highest and in a sense most important,
because this is what the end user sees. For instance, a cluster administration tool
oers a single point of management and control SSI services. These can be built as
GUI-based tools oering a single window for the monitoring and control of cluster
as a whole, individual nodes, or specic system components.
The subsystems oer a software means for creating an easy-to-use and e cient
cluster system. Run time systems, such as cluster le systems, make disks at-
tached to cluster nodes appear as a single large storage system. SSI oered by
le systems ensures that every node in the cluster has the same view of the data.
Global job scheduling systems manage resources, and enables the scheduling of sys-
tem activities and execution of applications while oering high availability services
transparently.
1.9.2 SSI Boundaries
A key that provides structure to the SSI lies in noting the following points 1]:
Every single system image has a boundary and
Single system image support can exist at dierent levels within a system{one
able to be built on another.
For instance, a subsystem (resource management systems like LSF and CO-
DINE) can make a collection of interconnected machines appear as one big machine.
When any operation is performed within the SSI boundary of the subsystem, it pro-
vides an illusion of a classical supercomputer. But if anything is performed outside
its SSI boundary, the cluster appears to be just a bunch of connected comput-
ers. Another subsystem/application can make the same set of machines appear
as a large database/storage system. For instance, a cluster le system built using
local disks associated with nodes can appear as a large storage system (software
RAID)/parallel le system and oer faster access to the data.
1.9.3 Middleware Design Goals
The design goals of cluster-based systems are mainly focused on complete trans-
parency in resource management, scalable performance, and system availability in
supporting user applications.
26 Cluster Computing at a Glance Chapter 1
Complete Transparency
The SSI layer must allow the user to use a cluster easily and eectively without
the knowledge of the underlying system architecture. The operating environment
appears familiar (by providing the same look and feel of the existing system) and
is convenient to use. The user is provided with the view of a globalized le system,
processes, and network. For example, in a cluster with a single entry point, the
user can login at any node and the system administrator can install/load software at
anyone's node and have be visible across the entire cluster. Note that on distributed
systems, one needs to install the same software for each node. The details of resource
management and control activities such as resource allocation, de-allocation, and
replication are invisible to user processes. This allows the user to access system
resources such as memory, processors, and the network transparently, irrespective
of whether they are available locally or remotely.
Scalable Performance
As clusters can easily be expanded, their performance should scale as well. This
scalability should happen without the need for new protocols and APIs. To extract
the maximum performance, the SSI service must support load balancing and par-
allelism by distributing workload evenly among nodes. For instance, single point
entry should distribute ftp/remote exec/login requests to lightly loaded nodes. The
cluster must oer these services with small overhead and also ensure that the time
required to execute the same operation on a cluster should not be larger than on a
single workstation (assuming cluster nodes and workstations have similar congu-
ration).
Enhanced Availability
The middleware services must be highly available at all times. At any time, a point
of failure should be recoverable without aecting a user's application. This can be
achieved by employing checkpointing and fault tolerant technologies (hot standby,
mirroring, failover, and failback services) to enable rollback recovery.
When SSI services are oered using the resources available on multiple nodes,
failure of any node should not aect the system's operation and a particular service
should support one or more of the design goals. For instance, when a le system is
distributed among many nodes with a certain degree of redundancy, when a node
fails, that portion of le system could be migrated to another node transparently.
node without the loss of computation. Process migration allows for dynamic
load balancing among the cluster nodes.
1.11.1 Threads
Threads are a popular paradigm for concurrent programming on uniprocessor as
well as multiprocessors machines. On multiprocessor systems, threads are primarily
used to simultaneously utilize all the available processors. In uniprocessor systems,
threads are used to utilize the system resources eectively. This is achieved by
exploiting the asynchronous behavior of an application for overlapping computation
and communication. Multithreaded applications oer quicker response to user input
and run faster. Unlike forked process, thread creation is cheaper and easier to
manage. Threads communicate using shared variables as they are created within
their parent process address space.
Threads are potentially portable, as there exists an IEEE standard for POSIX
threads interface, popularly called pthreads. The POSIX standard multithreading
interface is available on PCs, workstations, SMPs, and clusters 21]. A program-
ming language such as Java has built-in multithreading support enabling easy de-
velopment of multithreaded applications. Threads have been extensively used in
developing both application and system software (including an environment used
to create this chapter and the book as a whole!).
the cache coherence schemes, and increased hardware complexity. Two examples of
hardware DSM systems are DASH 26] and Merlin 27].
1.11.4 Parallel Debuggers and Prolers
To develop correct and e cient high performance applications it is highly desirable
to have some form of easy-to-use parallel debugger and performance proling tools.
Most vendors of HPC systems provide some form of debugger and performance
analyzer for their platforms. Ideally, these tools should be able to work in a hetero-
geneous environment, thus making it possible to develop and implement a parallel
application on, say a NOW, and then actually do production runs on a dedicated
HPC platform, such as the Cray T3E.
Debuggers
The number of parallel debuggers that are capable of being used in a cross-platform,
heterogeneous, development environment is very limited. Therefore, in 1996 an
eort was begun to dene a cross-platform parallel debugging standard that dened
the features and interface users wanted. The High Performance Debugging Forum
(HPDF) was formed as a Parallel Tools Consortium project 28]. The forum has
developed a HPD Version specication which denes the functionality, semantics,
and syntax for a command-line parallel debugger. Ideally, a parallel debugger should
be capable of:
Managing multiple processes and multiple threads within a process.
Displaying each process in its own window.
Displaying source code, stack trace, and stack frame for one or more processes.
Diving into objects, subroutines, and functions.
Setting both source-level and machine-level breakpoints.
Sharing breakpoints between groups of processes.
Dening watch and evaluation points.
Displaying arrays and its slices.
Manipulating code variables and constants.
TotalView
TotalView is a commercial product from Dolphin Interconnect Solutions 29]. It
is currently the only widely available GUI-based parallel debugger that supports
multiple HPC platforms. TotalView supports most commonly used scientic lan-
guages (C, C++, F77/F90 and HPF), message passing libraries (MPI and PVM)
and operating systems (SunOS/Solaris, IBM AIX, Digital UNIX and SGI IRIX).
Even though TotalView can run on multiple platforms, it can only be used in homo-
geneous environments, namely, where each process of the parallel application being
debugged must be running under the same version of the OS.
Section 1.11 Programming Environments and Tools 33
1.11.5 Performance Analysis Tools
The basic purpose of performance analysis tools is to help a programmer to un-
derstand the performance characteristics of an application. In particular, it should
analyze and locate parts of an application that exhibit poor performance and create
program bottlenecks. Such tools are useful for understanding the behavior of nor-
mal sequential applications and can be enormously helpful when trying to analyze
the performance characteristics of parallel applications.
Most performance monitoring tools consist of some or all of the following com-
ponents:
A means of inserting instrumentation calls to the performance monitoring
routines into the user's application.
A run-time performance library that consists of a set of monitoring routines
that measure and record various aspects of a program performance.
A set of tools for processing and displaying the performance data.
A particular issue with performance monitoring tools is the intrusiveness of the
tracing calls and their impact on the applications performance. It is very important
to note that instrumentation aects the performance characteristics of the parallel
application and thus provides a false view of its performance behavior. Table 1.3
shows the most commonly used tools for performance analysis of message passing
programs.
1.11.6 Cluster Administration Tools
Monitoring clusters is a challenging task that can be eased by tools that allow entire
clusters to be observed at dierent levels using a GUI. Good management software
is crucial for exploiting a cluster as a high performance computing platform.
There are many projects investigating system administration of clusters that
support parallel computing, including Berkeley NOW 4], SMILE 30] (Scalable
Multicomputer Implementation using Low-cost Equipment), and PARMON 31].
The Berkeley NOW system administration tool gathers and stores data in a rela-
tional database. It uses a Java applet to allow users to monitor a system from their
browser. The SMILE administration tool is called K-CAP. Its environment consists
of compute nodes (these execute the compute-intensive tasks), a management node
(a le server and cluster manager as well as a management console), and a client
that can control and monitor the cluster. K-CAP uses a Java applet to connect
to the management node through a predened URL address in the cluster. The
Node Status Reporter (NSR) provides a standard mechanism for measurement and
access to status information of clusters 32]. Parallel applications/tools can access
NSR through the NSR Interface. PARMON is a comprehensive environment for
monitoring large clusters. It uses client-server techniques to provide transparent ac-
cess to all nodes to be monitored. The two major components of PARMON are the
34 Cluster Computing at a Glance Chapter 1
Interprocess Communication
Active Messages (AM) is the basic communications primitives in Berkeley NOW.
It generalizes previous AM interfaces to support a broader spectrum of applica-
tions such as client/server programs, le systems, operating systems, and provide
continuous support for parallel programs. The AM communication is essentially
a simplied remote procedure call that can be implemented e ciently on a wide
range of hardware. NOW includes a collection of low-latency, parallel communi-
cation primitives: Berkeley Sockets, Fast Sockets, shared address space parallel C
(Split-C), and MPI.
36 Cluster Computing at a Glance Chapter 1
Parallel Applications
AM AM AM AM AM
Net. Interface HW Net. Interface HW Net. Interface HW Net. Interface HW Net. Interface HW
Applications
Fast Messages
Sockets
FM contains functions for sending long and short messages and for extracting mes-
sages from the network. The services provided by FM guarantees and controls the
memory hierarchy that FM provides to software built with FM. FM also guaran-
tees reliable and ordered packet delivery as well as control over the scheduling of
communication work.
The FM interface was originally developed on a Cray T3D and a cluster of
SPARCstations connected by Myrinet hardware. Myricom's Myrinet hardware is a
programmable network interface card capable of providing 160 MBytes/s links with
switch latencies of under a s. FM has a low-level software interface that delivers
hardware communication performance however, higher-level layers interface oer
greater functionality, application portability, and ease of use.
1.13.3 The Beowulf Project
The Beowulf project's 6] aim was to investigate the potential of PC clusters for
performing computational tasks. Beowulf refers to a Pile-of-PCs (PoPC) to de-
scribe a loose ensemble or cluster of PCs, which is similar to COW/NOW. PoPC
emphasizes the use of mass-market commodity components, dedicated processors
(rather than stealing cycles from idle workstations), and the use of a private com-
munications network. An overall goal of Beowulf is to achieve the `best' overall
system cost/performance ratio for the cluster.
System Software
The collection of software tools being developed and evolving within the Beowulf
project is known as Grendel. These tools are for resource management and to sup-
port distributed applications. The Beowulf distribution includes several program-
ming environments and development libraries as separate packages. These include
PVM, MPI, and BSP, as well as, SYS V-style IPC, and pthreads.
The communication between processors in Beowulf is through TCP/IP over the
Ethernet internal to cluster. The performance of interprocessor communications
is, therefore, limited by the performance characteristics of the Ethernet and the
system software managing message passing. Beowulf has been used to explore the
feasibility of employing multiple Ethernet networks in parallel to satisfy the internal
data transfer bandwidths required. Each Beowulf workstation has user-transparent
access to multiple parallel Ethernet networks. This architecture was achieved by
`channel bonding' techniques implemented as a number of enhancements to the
Linux kernel. The Beowulf project has shown that up to three networks can be
ganged together to obtain signicant throughput, thus validating their use of the
channel bonding technique. New network technologies, such as Fast Ethernet, will
ensure even better interprocessor communications performance.
In the interests of presenting a uniform system image to both users and appli-
cations, Beowulf has extended the Linux kernel to allow a loose ensemble of nodes
to participate in a number of global namespaces. In a distributed scheme it is often
convenient for processes to have a PID that is unique across an entire cluster, span-
Section 1.13 Representative Cluster Systems 39
ning several kernels. Beowulf implements two Global Process ID (GPID) schemes.
The rst is independent of external libraries. The second, GPID-PVM, is designed
to be compatible with PVM Task ID format and uses PVM as its signal trans-
port. While the GPID extension is su cient for cluster-wide control and signaling
of processes, it is of little use without a global view of the processes. To this end,
the Beowulf project is developing a mechanism that allows unmodied versions of
standard UNIX utilities (e.g., ps) to work across a cluster.
Applications
Network
Solaris MC
Other Nodes
Filesystem Processes
for commands such as 'ps' and for the debuggers. It supports remote execution,
which allows to start up new processes on any node in the system.
Solaris MC also globalizes its support for networking and I/O. It allows more
than one network connection and provides support to multiplex between arbitary
the network links.
systems provide a wide variety of tools and utilities commonly used to develop, test,
and run parallel applications. These include various high-level APIs for message
passing and shared-memory programming.
and well known, rather than an exhaustive list of all those available.
All the projects discussed claim to consist of commodity components. Although
this is true one could argue, however, that true commodity technologies would
be those that are pervasive at most academic or industrial sites. If this were the
case, then true commodity would mean PCs running Windows 95 with standard 10
Mbps Ethernet. However, when considering parallel applications with demanding
computational and network needs, this type of low-end cluster would be incapable
of providing the resources needed.
Each of the projects discussed tries to overcome the bottlenecks that arise while
using cluster-based systems for running demanding parallel applications in a slightly
dierent way. Without fail, however, the main bottleneck is not the computational
resource (be it a PC or UNIX workstation), rather it is the provision of a low-latency,
high-bandwidth interconnect and an e cient low-level communications protocol to
provide high-level APIs.
The Beowulf project explores the use of multiple standard Ethernet cards to
overcome the communications bottleneck, whereas Berkeley NOW and HPVM use
programmable Myrinet cards and AM/FM communications protocols. Solaris MC
uses Myrinet NICs and TCP/IP. The choice of what is the best solution cannot
just be based on performance the cost per node to provide the NIC should also be
considered. For example, a standard Ethernet card costs less than $100, whereas
Myrinet cards cost in excess of $1000 each. Another factor that must also be consid-
ered in this equation is the availability of Fast Ethernet and the advent of GigaBit
Ethernet. It seems that Ethernet technologies are likely to be more mainstream,
mass produced, and consequently cheaper than specialized network interfaces. As
an aside, all the projects that have been discussed are in the vanguard of the clus-
ter computing revolution and their research is helping the following army determine
which are the best techniques and technologies to adopt.
Acknowledgments
We thank Dan Hyde, Toni Cortes, Lars Rzymianowicz, Marian Bubak, Krzysztof
Sowa, Lori Pollock, Jay Fenwick, Eduardo Pinheiro, and Miguel Barreiro Paz for
their comments and suggestions on this chapter.
1.16 Bibliography
1] G. Pster. In Search of Clusters. Prentice Hall PTR, NJ, 2nd Edition, NJ,
1998.
2] K. Hwang and Z. Xu. Scalable Parallel Computing: Technology, Architecture,
Programming. WCB/McGraw-Hill, NY, 1998.
3] C. Koelbel et al. The High Performance Fortran Handbook. The MIT Press,
Massachusetts, 1994.
4] T. Anderson, D. Culler, and D. Patterson. A Case for Networks of Worksta-
tions. IEEE Micro, Feb. 95. http://now.cs.berkeley.edu/
46 Cluster Computing at a Glance Chapter 1
5] M.A. Baker, G.C. Fox, and H.W. Yau. Review of Cluster Management Soft-
ware. NHSE Review, May 1996. http://www.nhse.org/NHSEreview/CMS/
6] The Beowulf Project. http://www.beowulf.org
7] QUT Gardens Project. http://www.t.qut.edu.au/CompSci/PLAS/
8] MPI Forum. http://www.mpi-forum.org/docs/docs.html
9] The Berkeley Intelligent RAM Project. http://iram.cs.berkeley.edu/
10] The Standard Performance Evaluation Corporation (SPEC).
http://open.specbench.org
11] Russian Academy of Sciences. VLSI Microprocessors: A Guide to High Per-
formance Microprocessors. http://www.microprocessor.sscc.ru/
12] ATM Forum. ATM User Level Network Interface Specication. Prentice Hall,
NJ, June 1995.
13] SCI Association. http://www.SCIzzL.com/
14] MPI-FM: MPI for Fast Messages.
http://www-csag.cs.uiuc.edu/projects/comm/mpi-fm.html
15] N. Boden et. al. Myrinet - A Gigabit-per-Second Local-Area Network. IEEE
Micro, February 1995. http://www.myri.com/
16] The Linux Documentation Project. http://sunsite.unc.edu/mdw/linux.html
17] Parallel Processing using Linux. http://yara.ecn.purdue.edu/pplinux/
18] H. Custer. Inside Windows NT. Microsoft Press, NY, 1993.
19] Kai Hwang et. al. Designing SSI Clusters with Hierarchical Checkpointing and
Single I/O Space. IEEE Concurrency, vol.7(1), Jan.- March, 1999.
20] J. Jones and C. Bricknell. Second Evaluation of Job Scheduling Software.
http://science.nas.nasa.gov/Pubs/TechReports/NASreports/NAS-97-013/
21] F. Mueller. On the Design and Implementation of DSM-Threads. In Proceed-
ings of the PDPTA'97 Conference, Las Vegas, USA, 1997.
22] The PVM project. http://www.epm.ornl.gov/pvm/
23] mpiJava Wrapper. http://www.npac.syr.edu/projects/prpc/mpiJava/, Aug.
1998.
24] TreadMarks. http://www.cs.rice.edu/willy/TreadMarks/overview.html
Section 1.16 Bibliography 47
25] N. Carriero and D. Gelernter. Linda in Context. Communications of the ACM,
April 1989.
26] D. Lenoski et al. The Stanford DASH Multiprocessor. IEEE Computer, March
1992.
27] C. Mapples and Li Wittie. Merlin: A Superglue for Multiprocessor Systems.
In Proceedings of CAMPCON'90, March 1990.
28] Parallel Tools Consortium project. http://www.ptools.org/
29] Dolphin Interconnect Solutions. http://www.dolphinics.no/
30] P. Uthayopas et. al. Building a Resources Monitoring System for SMILE Be-
owulf Cluster. In Proceedings of HPC Asia98 Conference, Singapore, 1998.
31] R. Buyya et. al. PARMON: A Comprehensive Cluster Monitoring System. In
Proceedings of the AUUG'98 Conference, Sydney, Australia, 1998.
32] C. Roder et. al. Flexible Status Measurement in Heterogeneous Environment.
In Proceedings of the PDPTA'98 Conference, Las Vegas, 1998.
33] Grand Challenging Applications.
http://www.mcs.anl.gov/Projects/grand-challenges/
34] R. Buyya. High Performance Cluster Computing: Programming and Applica-
tions. vol. 2, Prentice Hall PTR, NJ, 1999.
35] Computer Architecture Links. http://www.cs.wisc.edu/arch/www/
36] HPVM. http://www-csag.cs.uiuc.edu/projects/clusters.html
37] Solaris MC. http://www.sunlabs.com/research/solaris-mc/
38] D. E. Culler, J. P. Singh, and A. Gupta. Parallel Computer Architecture: A
Hardware/Software Approach. M. K. Publishers, San Francisco, CA, 1998.