0% found this document useful (0 votes)
20 views55 pages

Topic 6A 2024

Uploaded by

abbastayyaba417
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views55 pages

Topic 6A 2024

Uploaded by

abbastayyaba417
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 55

Information Technology

FIT3143 Parallel Computing


Semester 2 2024

Topic 6A:
Clusters and Cluster Performance
Dr Carlo Kopp, MACM, SMIEEE, AFAIAA
Faculty of Information Technology
© 2011 - 2024 Monash University
Why Study Clusters and Their Performance?
 Clusters are the most common “building block” used to form parallel (MIMD)
systems, therefore they must be understood;
 Most Grids are formed by aggregating clusters using middleware, Clouds are
usually clusters running Cloud hypervisors
 Foundation knowledge: Because the performance of clusters determines many
critical aspects of parallel system performance, understanding clusters is
essential;
 Foundation knowledge: The limitations of a cluster can limit what a parallel
application can or cannot do;
 Practical skills: When coding parallel applications you will have to run the code
on a cluster, or a cluster in a grid/cloud;
 Practical skills: You may have to benchmark an application on a cluster, so
understanding cluster performance matters

2
Clusters vs. Grids vs. Clouds
Clusters
 Clusters emerged during the 1990s as an alternative to traditional
“supercomputers”, which were usually optimised for vector processing, and
architected to exploit Instruction Level Parallelism;
 A “cluster” is a term applied to a group of general purpose processors,
connected by a high speed “fabric” of links, that are running software to execute
usually large parallel processing jobs;
 Parallelism in clusters occurs at the level of the process, unlike traditional
supercomputers;
 In principle, the limits to the number of cores in a cluster are determined by the
performance limits of the “fabric” interconnecting the machines forming the
cluster;
 Most contemporary supercomputers are built as clusters.

4
Clusters vs. Grids vs. Clouds (I)
 The term “Cloud” is now very widely employed to describe cloud computing
environments, but also any large distributed computing system, even if it is not
running genuine “Cloud” middleware and a runtime environment!
 Clusters are typically confined to one computer room, and run middleware and
programming environments optimised for parallel jobs, especially
“supercomputing” tasks;
 A “Grid” is usually formed by aggregating multiple clusters over a Wide Area
Network (WAN), to increase aggregate performance, using “Grid Middleware”;
 A “Cloud” like a “Grid” may aggregate vast numbers of cores over multiple sites,
but “Cloud middleware” is usually built to support a disparate mix of different
users, and provide “elastic” allocation of computing resources.

5
Clusters vs. Grids vs. Clouds (II)
 A cluster or multiple clusters may be running cluster middleware, or Grid
middleware, or Cloud middleware;
 In some instances, such middleware may be run concurrently, such as
cluster that is used for local jobs but also participates in a Grid;
 An ongoing problem with distributed computing is the imprecision of
language and labels used to describe systems, especially in industry;
 The best indication of what category a distributed system falls under is the
type of middleware and programming environment being run;
 In current usage, the term “cluster” is often only used to describe the
hardware, reflecting the fact that middleware may support various models;

6
Clusters - Integration
 The simplest way to form a cluster is to interconnect a large number of racked
general purpose processors, using a high speed network;
 The principal challenge is in providing a way of managing jobs, distributing the
computing load across the cores, and providing seamless IPC between
processes in jobs;
 These tasks are typically performed by “middleware”, and in many instances,
“runtime environment” software for managing jobs, down to individual
processes;
 In constructing clusters, which are the basic building block in Grids and clouds,
the performance of the interconnecting fabric is critically important;
 Two parameters are of interest, these are latency and bandwidth in
interconnecting processes running on cores.

7
Hardware and Fabrics
TIK Experimental Cluster “Scylla” – ETH Zurich, Switzerland
Ethernet Switches
22 x Athlon
Commodity PCs
Debian Linux

http://www.tik.ee.ethz.ch/~ddosvax/cluster/

9
Juno Linux Cluster - Lawrence Livermore National Laboratory, USA
https://computing.llnl.gov/tutorials/linux_clusters/
Infiniband Fabric
1,152 x Quad Core
Opteron Nodes

10
Clusters - Fabrics
 Clusters need interconnects to carry traffic between the machines or “nodes” forming the
cluster
 The general term for such an interconnect is “fabric”
 Fabrics are typically defined by the type of interconnect used and the type of topology used
 While variants of Ethernet remain the most common interconnect used in clusters,
especially small/cheap clusters, higher performing clusters use faster interconnects (below)
 Topologies vary widely and reflect operator priorities – tradeoffs between performance and
reliability (redundancy)
 Considerations are path length, latency, fault tolerance, scalability and cost (Chkirbene et
al, 2018)
 Pathlength and latency impact performance, fault tolerance impacts reliability, and cost
reflects complexity

11
Comparing Topologies (Chkirbene et al, 2018)
 Degree of nodes: the number of ports per server in the data centre. “Flat”
topologies have degree = 2, while “Recursive” topologies have degree > 2
 Scalability: number of nodes in the topology, good scalability allows incremental
increases to large numbers
 Diameter: defined as the maximum of the shortest distances between all pairs
of nodes
 Fault tolerance: continues to operate even in the presence of component
failures
 Average Path Length: the APL shows the efficiency of the routing algorithms
employed in packet transmission
 Bandwidth: data transfer rate or throughput from one node to another. One-to-
One, One-to-All, One-to-Several and All-to-All bandwidth often employed

12
Comparing Topologies (Chkirbene et al, 2018)

13
Cluster Fabric Topology Example – Fat Tree Model

Note core layer routers will be higher performing compared to aggregation and edge layer switches

14
Cluster Fabric Topology Example – HyperFlatNet

15
Cluster Fabric Topology Example – VacoNet

16
Cluster Fabric Topology Performance – Average Path Length

Lower APL is better due to lower cumulative queuing delay across routers/switches - Chkirbene et al, 2018

17
Cluster Design and Implementation
Cluster Design
 Earliest clusters formed by stacking PCs or Unix workstations on
benches, and later 19 inch racks, and interconnecting them with 10
Mbits/s 802.3 Ethernet (e.g. Monash PPME);
 Interconnect performance was soon found to be important for computing
tasks performing a lot of IPC, or demanding a lot of bandwidth;
 The general trend since then has been to use commodity Ethernet and
Ethernet switches for small / cheap / low performance clusters, and much
more expensive and elaborate interconnects for large / expensive / high
performance clusters;
 Clusters are usually used as building blocks in larger data centres –
hardware configuration reflecting performance demands of the
application;

19
Cluster Building Blocks
 Computational Nodes:
– Individual computers, using multiple core CPUs;
– Multiple CPU, multiple core systems;
– Server “blades” sharing backplanes in racks;
 Fabrics:
– Low performance commodity Ethernet switches and routers;
– High performance 2.5/5/10G Ethernet switches and routers;
– High performance interconnect Infiniband, ePCIe switches and routers;
 Middleware:
– Load balancing/sharing middleware for clusters;
– Parametric computing or other cluster / grid middlware;
– Cloud middleware to run virtual machines

20
“Economy” Cluster Interconnects
 Initially 10 and later 100 Mbit/s “Fast Ethernet”;
 Currently “Gigabit Ethernet” (GbE) mostly used, with commodity “Gigabit
Ethernet” switches;
 Gigabit Ethernet over TCP/IP doesn’t substantially reduce latency in comparison
with “Fast Ethernet”; this is due to protocol processing delays in hosts which
remain CPU bound;
 Network interconnects are queuing systems and behave accordingly;
 Emerging “economy” interconnects are based on:
A. 10GBASE-T, 5GBASE-T and 2.5GBASE-T Ethernet adaptors and switches
intended for data centres and end user premises
B. External PCIe (Express) switches, similar in performance to Thunderbolt, also
based on PCIe – BSD Socket API via PCIe device driver or TCP/IP over PCIe;

21
10GBASE-T, 5GBASE-T and 2.5GBASE-T Ethernet
 10 Gigabit Ethernet first defined in 2002 for optical fibre links
 Later revisions (IEEE 802.3-2018) support optical fibre, copper
twisted pair and twin-axial cable, backplanes and printed circuit
boards
 Used primarily as a fabric in data centres to interconnect servers /
hosts in clusters and clouds
 More expensive than 1GBASE-T but cheaper than Infiniband
 5GBASE-T and 2.5GBASE-T (802.3bz) used reduced data rate
10GBASE-T waveform with cheap Cat-5/6 twisted pair Ethernet
cables

22
10GBASE-T Ethernet Hardware
Intel X540-T2, dual port 10 Gigabit Ethernet NIC, PCIe 2.1 x8 card. Two 8P8C (RJ45) connectors, 10GBASE-T standard
(Nosachev – Wikimedia)

Cat 6 Ethernet 8P8C (RJ45) Twisted Pair Cable


(Raysonho - Wikimedia)

23
External PCIe (PCI Express) Hardware
Cronologic Ndigo External PCIe Expander

External PCIe 24 x Switch


[https://www.serialcables.com/product-category/pcie-gen3-switch/]

External PCIe Cables [http://www.onestopsystems.com]

24
External PCIe (PCI Express) Software

Dolphin Software eXpressWare suite – TCP/IP over ePCIe / Proprietary SuperSockets


[https://www.dolphinics.com/products/dolphin_pci_express_software.html]

25
“Performance” Cluster Interconnects
 cLAN VIA implementation. Very low latencies of 0.5µsec and high
bandwidth – now obsoleted;
 QsNet from Quadrics, similar to Myrinet, latency around 5µsec – now
obsoleted;
 ANSI/VITA 26-1998 Myrinet - simple low-latency switches, 640 Mbps
to 10 Gbps – now obsoleted;
 Infiniband – high performance mesh providing typically 56 – 100 Gbps;
 Infiniband and Gigabit Ethernet variants are now dominant in large
commercial and scientific supercomputing clusters;

26
InfiniBand Fabric

Mellanox - http://www.mellanox.com/pdf/whitepapers/IB_Intro_WP_190.pdf

27
InfiniBand Hardware

Mellanox InfiniBand 1U Switch, 12 QSFP+ ports

Mellanox ConnectX®-4 Single/Dual-Port Adapter -100Gb/s 16 x PCI3

28
Cluster Performance
Benchmarking the Cluster
 How do we know how fast the cluster runs?
 Measurement of performance is an important factor while selecting a
system design and architecture
 Numerous methods exist for benchmarking a cluster – none are ideal
 Important that the technique used provides a representative compute
load compared to the intended application;
 Workloads can be homogenous – e.g. parallel identical jobs – clusters,
grids;
 Workloads can be inhomogeneous/hetergenous – parallel but very
different jobs – clouds;

30
Network Latency and Bandwidth
 Latency is easy to determine by measurement of elapsed time using common
clocks;
 Bandwidth (throughput, capacity) is easy to determine by benchmarking;
 Results must be used with caution, as time variant network conditions dictate
latency and usable bandwidth at any time;
 Latency and bandwidth only characterise the network and not the
computational aspects of the system;
 Measurements usually done on lightly loaded systems to determine best
achievable performance;
 Many tools available for latency and bandwidth measurements
 Some lab tasks will provide opportunity to do measurements

31
Computational Performance
 How much computational effort a cluster can do determines its
economic viability and usefulness
 Performance is a measure of computational effort done per time
under some set of conditions
 Computational effort is the quantity or number of computational
operations e.g. machine instructions performed
 Performance metrics are usually highly sensitive to the
conditions of measurement – e.g. type of workload, application
program, operating system, and configuration of the hardware
platform

32
Computational Benchmarking
 The performance of a computer is usually measured by benchmarking
 Benchmarking can be done by running a specific application and measuring
how long it takes to run and what compute resources are consumed
 Benchmarking can also be done by running a dedicated benchmark
applications or benchmark suites
 In all benchmarking we seek “like comparisons”, as in “comparing apples to
apples, rather than apples to oranges”
 The intent is always to perform measurements of performance that permit
honest side by side comparisons between systems
 Vendors have a long running history of manipulating benchmarks to favour
their products – caveat emptor applies!

33
Performance Metrics – FLOPS/MOPS/MIPS
 Maximum aggregate performance of the system can be measured in terms of
Maximum aggregate floating-point operations:
P = N*C*F*R
 Where:
 P performance in FLOPS, MFLOPS, GFLOPS,
 N number of nodes,
 C number of CPUs (CPU cores),
 F floating point operations per clock period,
 R clock rate.
 The other measure is for integer operations – using MOPS/MIPS where P is
expressed in integer Mega- “Operations” or “Instructions”

34
Application Performance
 Number of operations performed while executing the application,
divided by the total runtime, M/GFLOPS or M/GOPS.
 Computed using a program similar to the actual program the user
intends to run on the production system.
 More meaningful than theoretical peak performance.
 Need to correctly estimate the number of floating point (or integer)
operations in the code.
 The algorithm must be optimised.
 Problems may arise if the code was tuned for a particular platform;
specifically where code includes features written around machine
specific performance accelerators.

35
Application Runtime Comparison
 The total “wall-clock” run time for an existing application and
dataset.
 It frees us from counting the operations in the code – simply
compare time to compute
 Also it removes the need to develop benchmarking code which
may differ from the intended application.
 Performance tuning could distort results
 The same application must be run on all systems being
benchmarked against one another under identical conditions

36
Scalability How Well Does It Scale?
 One measure of Scalability is computed thus:
S = T(1) / T(N)
 Where T(1) is the wall clock time for a program to run on a single
processor
 T(N) is the runtime over N processors
 A scalability figure close to N ie S≅ N means the program scales
well
 Scalability metric helps estimate the optimal number of processors
for an application
 Amdahl’s Law and other models can be used to estimate scalability

37
Efficiency
 It is calculated thus:

E = P(N)/N where P is performance

 Values close to unity or 100% are ideally sought

 This metric suffers from the same problems as the Scalability measure

38
Percentage of Peak
 Application performance statistics are gathered in terms of
the percentage of the theoretical peak performance
 A real application is run and results compared to a theoretical
estimate
 Such statistics highlight the extent an application is making
use of the computational power of the system
 Depends on the type of the application and statistical mix of
executed instructions

39
System Utilisation
 System level effects include:
A. Competition between tasks executing on the system
B. I/O contention
C. Memory swapping
D. Job Scheduler inefficiencies
E. Job start-up delays
 A system can be assessed on its long-term throughput through these
statistics
 Statistics easily collected using sar, vmstat, netstat, iostat
or other tools.

40
MPI Ping-Pong Test
 A widely used measure in clusters
 Tests the aggregated bandwidth and latency of the interprocessor
communication network
 The API will be written in C and assumes the MPI libraries have
been installed
 Reading: H. Kamal, B. Penoff and A. Wagner, “SCTP versus TCP
for MPI,” SC '05: Proceedings of the 2005 ACM/IEEE Conference
on Supercomputing, Seattle, WA, USA, 2005, pp. 30-30. doi:
10.1109/SC.2005.63

41
MPI Ping-Pong Test (Kamal et al 2005)

Abstract: SCTP (Stream Control Transmission


Protocol) is a recently standardized transport
level protocol with several features that better
support the communication requirements of
parallel applications; these features are not
present in traditional TCP (Transmission
Control Protocol). These features make SCTP
a good candidate as a transport level protocol
for MPI (Message Passing Interface). MPI is a
message passing middleware that is widely
used to parallelize scientific and compute
intensive applications. TCP is often used as
the transport protocol for MPI in both local
area and wide area networks. Prior to this
work, SCTP has not been used for MPI. We
compared and evaluated the benefits of using
SCTP instead of TCP as the underlying
transport protocol for MPI. …..

42
LINPACK Benchmark
 High Performance LINPACK (HPL) Benchmark is widely used
within clusters.
 These benchmarks execute the LINPACK codes available on
Netlib.
 These benchmarks can overestimate performance.
 LINPACK is a library of functions written to solve linear
equations and linear least-squares problems;
 LINPACK was written in FORTRAN; widely used in many
applications.

43
Measuring and Monitoring Performance
Measuring and Monitoring Performance
 There are many available tools for real time monitoring and / or
logging of machine performance;
 Unix and unix-like systems have provided the text based vmstat
(BSD) and sar (SVR4) utilities since the 1980s;
 The user must analyse and interpret text based logs of data that can
be challenging;
 More recent ncurses based top and htop provide realtime text based
monitoring;
 Very recent netdata tool provides browser based realtime graphical
representation;

45
Output from vmstat tool

46
Output from top tool

47
Output from htop tool

48
Output from netdata tool

49
Output from netdata tool

50
Observations
 Different applications require different runtime environments, operating
systems, libraries, parallel or distributed computing middleware, and hardware
interconnection topologies;
 There are no panacea solutions – Grids, Clouds, conventional Clusters, HPC
“supercomputers” all perform best for applications which fit their unique
characteristics;
 Coarse assessments of performance may be unrealistically optimistic – or
pessimistic - for many applications;
 Benchmarks can be highly accurate, but only for applications which are very
close in behaviour to the benchmark;
 The best performance benchmark is the intended application itself,
using a representative dataset or parameters.

51
Summary
Summary
 Clusters vs. Grids vs. Clouds
 Hardware and Fabrics
 Cluster Design and Implementation
 Cluster Performance
 Measuring and Monitoring Performance

53
Reading Materials
References / Reading
 http://users.monash.edu/~ckopp/SYSTEMS/Vector-CPU-0600.htm
 http://users.monash.edu/~ckopp/SYSTEMS/Cluster-Practical-1299.htm
 http://www.drdobbs.com/parallel/managing-cluster-computers/184404165
 http://users.monash.edu/~ckopp/SYSTEMS/SCSI-SAN-0799.htm
 http://gridbus.org/papers/encyclopedia.pdf
 http://users.monash.edu/~ckopp/SYSTEMS/Gigabit-IP-LAN-1097.htm
 http://www.linuxvirtualserver.org/
 http://users.monash.edu/~ckopp/SYSTEMS/Infiniband-Intro-0901.htm
 Blaise Barney, Linux Clusters Overview, Lawrence Livermore National Laboratory:
https://computing.llnl.gov/tutorials/linux_clusters/
 Eric Hazen, Linux Cluster for Computational Physics Applications, Boston Uni:
http://joule.bu.edu/~hazen/LinuxCluster/
 Chkirbene, Z., Hamila, R., Foufou, S. (2018). A Survey on Data Center Network Topologies. In:
Boudriga, N., et al (eds) Ubiquitous Networking. UNet 2018. Lecture Notes in Computer Science, vol
11277. Springer, Cham. https://doi.org/10.1007/978-3-030-02849-7_13

55

You might also like