0% found this document useful (0 votes)
13 views105 pages

Pda 2

The document outlines a course on Parallel and Distributed Algorithms for computer science students, covering various parallel computing solutions and architectures. It discusses the evolution of parallel computing, key definitions, performance-oriented distinctions between parallel and distributed systems, and various taxonomies like Flynn's and Bell's. Additionally, it highlights the significance of parallel hardware and software, as well as the state of high-performance computing (HPC) by 2025.

Uploaded by

Darius Ö/
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views105 pages

Pda 2

The document outlines a course on Parallel and Distributed Algorithms for computer science students, covering various parallel computing solutions and architectures. It discusses the evolution of parallel computing, key definitions, performance-oriented distinctions between parallel and distributed systems, and various taxonomies like Flynn's and Bell's. Additionally, it highlights the significance of parallel hardware and software, as well as the state of high-performance computing (HPC) by 2025.

Uploaded by

Darius Ö/
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 105

Parallel and Distributed

Algorithms

A course for the 3rd year students


(Major in Computer Science)

Parallel Platforms for High


Performance Computing (HPC)
Contents
 A taxonomy of parallel solutions:
 pipelining
 functional
 vectorial, etc.
 Other HPC solutions:
 SMPs vs. MPPs
 Clusters, Grids and Clouds
 NOWs
 State of the art in HPC in 2025: Top500 Supercomputers
 Focus on years 2016 – 2024
The Birth of Parallel Computing
• It was Gill who first formalized the concept on 16 December 1957, in a lecture
at the British Computer Society, published together with comments and
questions from the audience and his responses
• Subsequent papers on the subject did not appear for 7 years, but a decade later,
interest in parallel programming had increased greatly
• This article correctly predicted the future importance of parallel computing:
"The use of multiple control units within a single machine will enable still
higher overall computing speeds to be achieved when the ultimate speed of a
single arithmetical unit has been reached“
• In the 1960s and 1970s parallel computing was heavily utilized in industries
that relied on large investments for R&D such as aircraft design and defence,
as well as modelling scientific problems such as meteorology
• Parallelism became of central importance in HPC, especially with the advent
of supercomputers in the late 1960s that had multiple physical CPUs on nodes
with their respective memory, networked together in a hybrid-memory model
• Today, "parallel computing has become the dominant paradigm in computer
architecture, mainly in the form of multi-core processors"
PARALLEL vs. Distributed Computing

A working definition may differentiate them after:


• Focus (relative): coarse, medium or fine grain
• Main goal: shorter running time!
• The processors are intended to contribute to the implementation of
a more efficient execution of a solution to the same problem
• In parallel computing a problem involves lots of computations
and data (e.g. matrix multiplication, sorting); to attain efficiency,
communication has to be kept to a minimum (optimal)
• In distributed systems problems are different: often coordinating
resources is the most important (e.g. termination detection, leader
election, commit) and communication may not be minimal
Performance-oriented definition
Parallel System: An optimized collection of (tightly coupled)
processors, that form a powerful (multi) computer, dedicated to
execution of complex tasks; each processor executes in a semi-
independent manner a subtask, co-ordinates with others from time
to time. The primary goal of parallel processing is a significant
increase in performance.
Distributed System: A collection of multiple autonomous (loosely
coupled) computers, communicating through a computer network,
that interact with each other in order to achieve a common goal.
Leslie Lamport: “A distributed system is one in which the failure
of a computer you didn't even know existed can render your own
computer unusable “
Remark. Parallel processing in distributed environments is not
only possible but also a cost-effective attractive alternative
Parallel Hardware and Software
Have grown out of conventional serial hardware and software.
To better understand how the parallel systems evolved to their
current state, one must look to the important aspects of serial
computations, such as:
The separation of memory and CPU, called the von
Neumann bottleneck
 The potentially vast quantity of data and instructions (DI)
needed to run a program is effectively isolated from the CPU
 In 2020s, CPUs are able to execute instructions > 100x faster
than they can fetch/ store items from/ to the main memory

Improving Computer Performance
To address the von Neumann bottleneck, many modifications
to the basic architecture have been experimented
 Caching: use a wider interconnection to transport more DI in a
single memory access/ store blocks of DI closer to the CPU
 Virtual memory: for large programs/ data sets, functions as a
cache to secondary data; main memory only holds active parts
of the many running programs, idle parts are kept in a block of
secondary storage, called swap space
 Instruction-level parallelism: there are two main approaches,
pipelining and data parallelism
 Hardware multithreading: allows for systems to continue to
work when the task currently executed has stalled
1st : Taxonomy of Parallel Solutions

Pipelining
- instructions are decomposed into elementary operations; many
different operations may be at a given moment in execution
Functional parallelism
- there are independent units to execute specialized functions
Vector parallelism
- identical units are provided to execute under unique control
the same operation on different data items
Multi-processing
- several “tightly coupled” processors execute independent
instructions, communicate through a common shared memory
Multi-computing
- several “tightly coupled” processors execute independent
instructions, communicate with each other using messages
Pipelining (often completed by functional/vector parallelism)

Ex. IBM 360/195, CDC 6600/7600, Cray 1


Vector Processors
• Early parallel computers used vector processors; their design was
MISD, their programming was SIMD (see Flynn’s taxonomy next)
• Most significant representatives of this class:
• CDC Cyber 205, CDC 6600
• Cray-1, Cray-2, Cray XMP, Cray YMP etc.
• IBM 3090 Vector
• Inovative aspects:
• Superior organization
• Use of performant technologies (not CMOS), i.e. cooling
• Use of “peripheral processors” (minicomputers)
• Generally, do not rely on usual techniques for paging/
segmentation, that slow down computations
Vector Parallelism
• Is based on “primary” high-level, efficient operations, able to process
in one step whole linear arrays (vectors)
• It may be extended to matrix processing etc.
Cray 2 Cray XMP/4
2nd : Flynn’s Taxonomy

Data
Instr. Flow Simple Multiple
Flow

Simple SISD SIMD

Multiple MISD MIMD


SIMD (Single Instruction stream, Multiple Data stream)

Global control unit

PE PE PE … PE PE

Interconnection network

Ex: early parallel machines


• Illiac IV, MPP, CM-2, MasPar MP-1
Modern settings
• multimedia extensions - MMX, SSE
• DSP chips
SIMD (cont.)
Positives:
• less hardware needed (compared to MIMD computers, they
have only one global control unit)
• less memory needed (must store only a copy of the program)
• less startup time to communicate with neighboring
processors
• easy to understand and reason about
Negatives:
• proprietary hardware needed – fast obsolescence, high
development costs/time
• rigid structure suitable only for highly structured problems
• inherent inefficiency due to selective turn-off
SIMD and Data-Parallelism
SIMD computers are naturally suited for data-parallel programs
• that is, programs in which the same set of instructions are
executed on a large data set
Example:
for (i=0; i<1000; i++) pardo
c[i] = a[i]+b[i];
Processor k executes c[k] = a[k]+b[k]
SIMD – inefficiency example (1)

Different processors cannot execute distinct instructions in the same clock cycle
Example:
for (i=0; i<10; i++) ~
if (a[i]<b[i]) ~
c[i] = a[i]+b[i]; ~
else ~
c[i] = 0;

a[] 4 1 7 2 9 3 3 0 6 7
b[] 5 3 4 1 4 5 3 1 4 8
c[]

p 0 p1 p 2 p 3 p 4 p 5 p 6 p 7 p 8 p 9
SIMD – inefficiency example (2)

Example:
for (i=0; i<10; i++) pardo
if (a[i]<b[i]) ~
c[i] = a[i]+b[i]; ~
else ~
c[i] = 0;

a[] 4 1 7 2 9 3 3 0 6 7
b[] 5 3 4 1 4 5 3 1 4 8
c[]

p 0 p1 p 2 p 3 p 4 p 5 p 6 p 7 p 8 p 9
SIMD – inefficiency example (3)

Example:
for (i=0; i<10; i++) pardo
~ if (a[i]<b[i])
~ c[i] = a[i]+b[i];
~ else
~ c[i] = 0;

a[] 4 1 7 2 9 3 3 0 6 7
b[] 5 3 4 1 4 5 3 1 4 8
c[] 9 4 8 1 15

p 0 p1 p 2 p 3 p 4 p 5 p 6 p 7 p 8 p 9
p 0 p1 p 2 p 3 p 4 p 5 p 6 p 7 p 8 p 9
SIMD – inefficiency example (4)

Example:
for (i=0; i<10; i++) pardo
if (a[i]<b[i]) ~
c[i] = a[i]+b[i]; ~
else ~
c[i] = 0;

a[] 4 1 7 2 9 3 3 0 6 7
b[] 5 3 4 1 4 5 3 1 4 8
c[] 9 4 0 0 0 8 0 1 0 15

p 0 p1 p 2 p 3 p 4 p 5 p 6 p 7 p 8 p 9
p 0 p1 p 2 p 3 p 4 p 5 p 6 p 7 p 8 p 9
p 0 p1 p 2 p 3 p 4 p 5 p 6 p 7 p 8 p 9
MIMD (Multiple Instruction stream, Multiple Data stream)

PE + PE + … PE +
control unit control unit control unit

Interconnection network

Single Program, Multiple Data


• a popular way to program MIMD computers
• simplifies code maintenance and program distribution
• equivalent to MIMD (big switch at the beginning)
MIMD (cont)
Positives:
• can be easily/fast/cheaply built from existing microprocessors
• very flexible (suitable for irregular problems)
• can have extra hardware to provide fast synchronization, which
enables them to operate in SIMD mode (ex. CM5)
Negatives:
• more complex (each processor has its own control unit)
• requires more resources (duplicated program, OS, …)
• more difficult to reason about/design correct programs
3rd : Bell’s Taxonomy
Aka Address-Space Organization (only for MIMD computers)
•Multiprocessors
(single address space, communication uses common memory)
• Distributed memory (Scalable)
• Centralized memory (Not scalable )
•Multicomputers
(multiple address space, communication uses transfer of messages)
• Distributed
• Centralized
Multiprocessors (shared memory)

Ex. Compaq SystemPro, Sequent Symmetry 2000


Multicomputers (distributed memory)

Ex. nCube, Intel iPSC/860


Multiprocessor Architectures
• Typical examples (case study) are the Connection Machine-s

CM2 CM5
CM Organization

CM Processors
Host Computer Microcontroller And
Memories

• Host sends commands/ data to a microcontroller


• The microcontroller broadcasts control signals and
data back to the processor network
• It also collects data from the network
CM* Processors and Memory
• Bit dimension (this means the memory is
addressable at bit level)
• Operations are bit serialized
• Data organization in fields is arbitrary (may
include any number of bits, starts anywhere)
• A set of contextual bits (flags) in all processors
determines their activation
CM Programming

• PARIS - PArallel Instruction Set, similar to an


assembly language
• *LISP –Common Lisp extension that includes
explicit parallel operations
• C* - C extension with explicit parallel data and
implicit parallel operations
• CM-Fortran – the implemented dialect of
Fortran 90
CM2 Architecture

Nexus Front
(switch) End

Connection Machine Connection Machine


Processors Processors

Sequencer Sequencer
0 3

Sequencer Sequencer
1 2

Connection Machine Connection Machine


Processors Processors
Interconnection Network
of CM2 Processors
• Any node in the network is a cluster (“chip”), with:
– 16 data processors on a chip
– Memory
– Routing node
• Nodes are connected in a 12D hypercube
– There are 4096 nodes, each has direct links to other 11 nodes
– Maximal dimension of a CM is thus 12 x 4096, or 64K
processors
CM5

• Starting with CM-5, the Thinking Machines Co. went


(in 1991) from a hypercube architecture of simple
processors to a complete new one, MIMD
– This was based on a “fat tree” of RISC processors
(SPARC)
• A few years later CM-5E replaced SPARC processors
with more fast SuperSPARCs
Other HPC Taxonomies/ Solutions
There are several very large spread systems for parallel
processing, but fundamentally different from the
programming point of view:
SMP (Symmetric Multi-Processing )
MPP (Massively Parallel Processing)
(Computer) Clusters
Grids and Clouds
NOW (Network of Workstations)
Symmetric Multi-Processing
• An architecture in which multiple CPUs,
residing in one cabinet, are driven from a
single O/S image
A Pool of Resources
• Each processor is a peer (one is not favored more
than another)
– Shared bus
– Shared memory address space
– Common I/O channels and disks
– Separate caches per processor, synchronized via various
techniques

• But if one CPU fails, the entire SMP system is down


– Clusters of two or more SMP systems can be used to
provide high availability (fault resilience)
Scalability of SMPs
• Is limited (2-32), reduced by several factors, such as:
– Inter-processor communication
– Bus contention with CPUs and serializable points
– Kernel serialization
– The bottleneck problems - inherent in SMP systems when
all CPUs attempt to access the same memory
• Most vendors have SMP models on the market:
– Sequent, Pyramid, Encore: SMP on Unix platforms
– IBM, HP, NCR, Unisys also provide SMP servers
– Many versions of Unix, Windows NT, NetWare and OS/2
have been designed or adapted for SMP
Speedup and Efficiency of SMPs
• SMPs help with overall throughput, not a single job, speeding
up whatever processes can be overlapped
– In a desktop computer, it would speed up the running of
multiple applications simultaneously
– If an application is multithreaded, it will improve the
performance of that single application
• The OS controls all CPUs, executing simultaneously, either
processing data or in an idle loop waiting to do something
– CPUs are assigned to the next available task or thread that
can run concurrently
Massively Parallel Processing
• A class of architectures in which each available
processing node runs a separate copy of the O/S
Distributed Resources
• Each CPU is a subsystem with its own memory and
copy of the OS and application
• Each subsystem communicates with the others via a
high-speed interconnect
• There are independent cache/memory/and I/O
subsystems per node
• Data is shared via function, from node-to-node
generally
• Sometimes this is referred as “shared-nothing”
architecture (example: IBM SP2)
Speedup and Efficiency of
MPPs
• Nearly all supercomputers as of 2005 are massively
parallel, and may have x100,000 CPUs
• The cumulative output of the many constituent CPUs
can result in large total peak FLOPS
– The true amount of computation accomplished depends on
the nature of the computational task and its implementation
– Some problems are more intrinsically able to be separated
into parallel computational tasks than others
• Single chip implementations of massively parallel
architectures are becoming cost effective
Integrated MPP and SMP
The Reliant RM1000 computer from Pyramid Technology
combined both MPP and SMP processing.
Speedup and Efficiency
• Nodes communicate by passing messages, using
standards such as MPI
• Nearly all supercomputers as of 2005 are massively
parallel, and may have x100,000 CPUs
• The cumulative output of the many constituent CPUs
can result in large total peak FLOPS
– The true amount of computation accomplished depends on
the nature of the computational task and its implementation
– Some problems are more intrinsically able to be separated
into parallel computational tasks than others
• Single chip implementations of massively parallel
architectures are becoming cost effective
Programming SMPs and MPPs
• To use MPP effectively, a problem must be breakable
into pieces that can all be solved simultaneously
• It is the case of scientific environments: simulations or
mathematical problems can be split apart and each part
processed at the same time
• In the business world: a parallel data query (PDQ) can
divide a large database into pieces (parallel groups)
• In contrast: applications that support parallel operations
(multithreading) may immediately take advantage of
SMPs - and performance gains are available to all
applications simply because there are more processors
Levels of parallelism
Implicit Parallelism in Modern Microprocessors
• pipelining, superscalar execution, VLIW
Hardware parallelism
- as given by machine architecture and hardware multiplicity
(Hwang)
- reflects a model of resource utilization by operations with a
potential of simultaneous execution, or refers the resources’
peak performance
Software parallelism
- acts at job, program, instruction or even bit (arithmetic)
level
Limitations of Memory System Performance

• Problem: high latency of memory vs. speed of computing


• Solutions: caches, latency hiding using multithreading
and prefetching
Granularity
- A measure for the amount of computations within
a process
- Usually described as coarse, medium and fine
Latency
- Opposed to granularity, measures the overhead due
to communication between fragments of code
Computer Clusters
• Composed of multiple computing nodes working
together closely so that in many respects they form
a single computer to process computational jobs
• Clusters may be built usually by assembling similar
type of commodity machines that have one or more
CPUs and CPU cores
• Clusters typical use is when the tasks of a job are
relatively independent of each other so that they can
be farmed out to different nodes of the cluster
• In some cases, tasks of a job may still need to be
processed in a parallel manner, i.e. tasks may be
required to interact with each other during execution
Computer vs. Data Clusters
• Computer clusters should not be confused with data
clusters, that refer to allocation for files/ directories
• They are loosely coupled sets of independent
processors functioning as a single system to provide:
– Higher Availability (remember: clusters of 2+ SMP
systems are used to provide fault resilience)
– Performance and Load Balancing
– Maintainability
• Examples: RS/6000 (up to 8 nodes), DEC Open
VMS Cluster (up to 16 nodes), IBM Sysplex (up to
32 nodes), Sun SparcCluster
Clustering Issues
(valid for both computing and data)

• A cluster of servers may


provide fault tolerance
and/or load balancing
– If one server fails, one or
more additional servers
are still available
– Load balancing is used to
distribute the workload
over multiple systems
How It Works
• The allocation of jobs to individual nodes of a cluster
is handled by a Distributed Resource Manager (DRM)
• The DRM allocates a task to a node using the resource
allocation policies that may consider node availability,
user priority, job waiting time, etc.
• Typically, DRMs also provide submission and monitor
interface, enabling users to specify jobs to be executed
and keep track of the progress of execution
• Examples of popular resource managers are: Condor,
the Sun Grid Engine (SGE) and the Portable Batch
Queuing System (PBS)
Types of Clusters
The primary distinction within computer clusters is how
tightly-coupled the individual nodes are:
• The Beowulf Cluster Design: densely located, sharing
a dedicated network, probably has homogenous nodes
• "Grid" Computing: when a compute task uses one or
few nodes, and needs little inter-node communication
Middleware such as MPI (Message Passing Interface)
or PVM (Parallel Virtual Machine) allows well designed
programs to be portable to a wide variety of clusters
Speedup and Efficiency
• The TOP500 list includes many clusters
• Tightly-coupled computer clusters are often designed
for "supercomputing“
– The central concept of a Beowulf cluster is the use of
commercial off-the-shelf (COTS) computers to produce a
cost-effective alternative to a traditional supercomputer
• But clusters, that can have very high Flops, may be
poorer in accessing all data in the cluster
– They are excellent for parallel computation, but inferior to
traditional supercomputers at non-parallel computation
Grid Computing
• A term that refers to a group of distributed computing
resources offering two types of services, either of
online computation or online storage
• A subcategory of distributed computing, or a special
case in parallel computing, which relies on complete
computers connected by a conventional network
• The ancestor of the Grid is Metacomputing, which tried
to interconnect supercomputer centers with the purpose
to obtain superior processing resources
Grids
• “A computational grid is a hardware and software
infrastructure providing dependable, consistent,
pervasive and cheap access to high-end computational
capabilities.” (Foster, Kesselman, “The Grid: Blueprint
for a New Computing Infrastructure”, 1998)
• The key concept in a grid is the ability to negotiate
resource-sharing arrangements among a set of
resources from participating parties and then to use the
resulting resource pool for some purpose
• In contrast, traditional supercomputer has the resource
pool apriori formed by many processors connected by
local high speed computer bus
How Grids Work
(A Grid Checklist)
1) Coordination of resources that are not
subject to centralized control …
2) Use of standard, open, general-purpose
protocols and interfaces
3) Delivery of nontrivial qualities of service
(response time, throughput, availability, security)
Grid computing is concerned with “coordinated resource sharing
and problem solving in dynamic, multi-institutional virtual
organizations” (Foster, Tuecke, “The Anatomy of the Grid,”
2000), and/or co-allocation of multiple resource types to meet
complex user demands, so that the utility of the combined system
is significantly greater than that of the sum of its parts
(Meta-)Scheduling
• A typical grid computing architecture includes a meta-
scheduler connecting a number of geographically
distributed clusters that are managed by local DRMs
• The meta-scheduler (e.g. GridWay, GridSAM) aims
to optimize computational workloads by combining an
organization ’s multiple DRMs into an aggregated
single view, allowing jobs to be directed to the best
location (cluster) for execution
• It integrates computational heterogeneous resources
into a global infrastructure, so users no longer need to
be aware of which resources are used for their jobs
Grid Middleware
• Grid computing tried to introduce common interfaces
and standards that eliminate the heterogeneity from
the resource access in different domains
• Therefore, several grid middleware systems have been
developed to resolve the differences that exist between
submission, monitoring and query interfaces of DRMs
• In addition, a set of new open standards and protocols
like OGSA (Open Grid Services Architecture), WSRF
(Web Services Resource Framework) are introduced
to facilitate mapping between independent systems
Example: The Globus Toolkit
• Includes: a platform-indep. job submission interface,
GRAM, which cooperates with the underlying DRMs to
integrate job submission methods; a security framework
GSI; and the resource information mechanism MDS
• Interactions with its components are mapped to local
management system specific calls; support is provided
for many DRMs, including Condor, SGE and PBS
– GridSAM provides a common job submission/ monitoring
interface to multiple underlying DRMs
– As a Web Service based submission service, it implements
the Job Submission Description Language (JSDL) and a
collection of DRM plug-ins that map JSDL requests and
monitoring calls to system-specific calls
Grid Challenges
Computing in grid environments may be difficult due to:
• Resource heterogeneity: results in differing capability
of processing jobs, making the execution performance
difficult to assess
• Resource dynamic behavior: of both network and
computational resources
• Resource co-allocation: the required resources must be
offered at the same time, or the computation cannot go
• Resource access security: important things need to be
managed, i.e. access policy (what is shared? to whom?
when?), authentication (how users/resources identify?),
authorization (are operations consistent with the rules?)
Grid Platform Design
• A standard programming interface is very important to
the parallel computing developers
• Security is a big concern in an unreliable environment
– incomplete or corrupt data will fail the grid computation
• Resource management: allocation and scheduling
• Data locality
• System management
• Network management
Case study: SAGA (Simple API
for Grid Applications, 2009)
• Targets application developers with no background in
distributed computing
• Does not replace Globus or similar grid computing
middleware systems
• Designed as an object oriented API
– Encapsulates related functionality in a set of objects, grouped
in functional namespaces, called packages
– SAGA Core specification defines a set of general principles
(the 'SAGA Look and Feel‘), plus a set of API packages
– Based on that, a number of API extensions have been defined
(they are in train of a standardization process)
The SAGA 'Look and Feel‘
• Covers the following areas:
– security and session management
– permission management
– asynchronous operations and monitoring
– asynchronous notifications
– attribute management
– I/O buffer management
• All SAGA specifications are defined in IDL*, they
are thus object oriented, but language neutral
• Many language bindings exist (Java, Python, C++),
but they are not standardized
The SAGA Architecture
• Follows the adaptor pattern, a software design pattern
which is used for translating one interface into another
• The SAGA core implementation defines the packages:
– saga::advert - interface for AS* access
– saga::filesystem - interface for file and directory access
– saga::replica - interface for replica management
– saga::namespace - abstract interface (used by first 3 above)
– saga::job - interface for job definition, manag. and control
– saga::rpc - interface for RPC** client and servers
– saga::sd- interface for service discovery in distributed envir.
– saga::stream - interface for data stream client and servers
Traditional Grids
• Run on a closed environment, so they will not have a
security problem
• There are 3 types of nodes in this architecture
– A controller handles the resource allocation,
scheduling and result collection
– A submission node submits a job to the controller
node and gets the result from the controller node
– An execution node executes the set of tasks that
come from the controller node, then returns the
partial result to the controller node
Problems with Traditional Grids
• The resource is fixed and cannot be increased or
decreased while performing a task computation
• When a fault occurs in any execution nodes, the
computation operation will fail
• If one of the partial results is not correct, the whole
result will not be correct, etc.
So, the traditional grid cannot provide a good solution
towards “dependable and consistent computing”.
We need a better solution able to provide more flexibility
than the traditional one.
Clouds
• Cloud computing uses the Web server facilities of a 3rd
party provider on the Internet to store, deploy and run
applications
• It takes two main forms:
1. “Infrastructure as a Service” (IaaS): only hardware/
software infrastructure (OS, databases) are offered
– Includes “Utility Computing”, “DeskTop Virtualization”
2. “Software as a Service" (SaaS), which includes the
business applications as well
• Regardless whether the cloud is infrastructure only or
includes applications, major features are self service,
scalability and speed
Speedup and Performance
• Customers log into the cloud and run their applications
as desired; although a representative of the provider
may be involved in setting up the service, customers
make all configuration changes from their browsers
• In most cases, everything is handled online from start
to finish by the customer
• The cloud provides virtually unlimited computing
capacity and supports extra workloads on demand
• Cloud providers may be connected to multiple Tier 1
Internet backbones for fast response times/ availability
Infrastructure Only (IaaS/PaaS)
• Using the cloud for computing power only can be
cheap to support new projects or seasonal increases
• When constructing a new datacenter, there are very big
security, environmental and management issues, not to
mention hardware/software maintenance forever after
• In addition, commercial cloud facilities may be able to
withstand natural disasters
• Infrastructure-only cloud computing is also named
infrastructure as a service (IaaS), platform as a service
(PaaS), cloud hosting, utility computing, grid hosting
Infrastructure & Applications (SaaS)
• More often, cloud computing refers to application
service providers (ASPs) that offer everything: the
infrastructure as outlined below and the applications,
relieving the organization of virtually all maintenance
• MS Office 365 or Google Apps and Salesforce.com's
CRM products are examples of this SaaS model
• This is a paradigm shift because company data are
stored externally; even if data are duplicated in-house,
copies "in the cloud" create security and privacy issues
• Companies may create private clouds within their own
datacenters, or use hybrid clouds (both private/public)
Networks of Workstations (NOW)
• Uses network based architecture (even the Internet),
even when working in Massively Parallel model
• More appropriate to distributed computing, therefore
they are seen sometimes as distributed computers
– NOWs formed the hardware/ software foundation
used by the Inktomi search engine (Inktomi was
acquired by Yahoo! in 2002)
– This led to a multi-tier architecture for Internet
services based on distributed systems, in use today
• We are only interested here in a NOW as a solution
for HPC, in which they form a VCE
NOWs Working in Parallel
• Application partitions task into manageable subtasks
• Application asks participating nodes to post available
resources and computational burdens
– Network bandwidth
– Available RAM
– Processing power available
• Nodes respond and application parses out subtasks to
nodes with less computational burden and most
available resources
• Application must parse out subtasks and synchronize
answer
Top500 Supercomputers
• All data available from www.top500.org, since June 1993

• Listing of the 500 most powerful computers in the World

• Benchmark used is Yardstick: Rmax from LINPACK MPP


– Rmax stands for “Maximal LINPACK performance achieved”
– Very simply: Ax=b, dense problem, is solved using Gaussian
elimination algorithm
• Updated twice a year:
– SC‘xy in the States in November
– Meeting in Germany in June
Significant Performance Development
of HPC: milestones over 3 decades

fastest

slowest
Performance Development of HPC:
the Run towards Exaflops
The Top 10 Systems in Top500 (Nov 2016)
The Top 10 Systems in Top500 (Nov 2017)
China’s First Homegrown
Many-core Processor
ShenWei SW26010 Processor
• Vendor : Shanghai High Performance IC Design Center
• Supported by National Science and Technology Major
Project (NMP): Core Electronic Devices, High-end Generic
Chips, and Basic Software
• 28 nm technology
• 260 Cores
• 3 Tflop/s peak
Sunway TaihuLight (http://bit.ly/sunway-2016)
State of HPC in 2017 (J.Dongarra)
• Pflops computing (>1015 Flop/s) fully established: 117 systems
(adds or multiplies on 64-bit machines)
• Three technology architecture (“swim lanes”) are thriving:
– Commodity (e.g. Intel)
– Commodity + accelerator (e.g. GPUs): 88 systems
– Lightweight cores (e.g. IBM BG, ARM, Intel’s Knights Landing)
• Interest in supercomputing is now worldwide, and growing in
many new markets (~50% of Top500 computers are in industry)
• Exascale (1018 Flop/s) projects exist in many countries and
regions
• Largest share: Intel processors (x86 instruction set), 92%;
followed by AMD, 1%
Towards Exascale Computing
The Top 10 Systems in Top500 (Nov 2018)
The Top 10 Systems in Top500 (Nov 2018, cont.)
The Top 10 Systems in Top500 (Nov 2018)
The Top 10 Systems in Top500 (Nov 2019)
The Top 10 Systems in Top500 (Nov 2019, cont.)
The Top 10 Systems in Top500 (June 2020)
- Research Market -
The Top 10 Systems in Top500 (June 2020)
- Commercial Market -
Performance Development (June 2020)
Performance Fraction of the Top 5 Systems
(June 2020)
VENDORS / SYSTEM SHARE (2017)
VENDORS / SYSTEM SHARE (2020)
COUNTRIES SHARE of Top500 (tree map)
COUNTRIES SHARE Variation
COUNTRIES / PERFORMANCE SHARE
State of HPC in 2020 (Erich Strohmaier)
• A renewed TOP10: but Fugaku (Mount Fuji) solidify its #1 status
in a list reflecting a flattening performance growth curve
• The full list recorded the smallest number of new entries since the
project began in 1993
• The entry level to the list moved up to 1.32 petaflops on the HPL
benchmark, increasing from 1.23 Pflops recorded in June 2020
• The aggregate performance of all 500 systems grew from 2.22
Eflops in June to just 2.43 exaflops on the November list
• Thanks to additional hardware, Fugaku grew its performance:
HPL to 442 petaflops and mixed precision HPC-AI benchmark to
2.0 exaflops, besting its 1.4 exaflops mark recorded in June 2020
These represent the first benchmark measurements above one
exaflop for any precision on any type of hardware!
• TOP100 Research System and Commercial Systems show very
different markets
COUNTRIES / SYSTEM Share (June 2020)
The Top 10 Systems in Top500 (June 2021)

Erich Strohmaier
Still waiting for Exascale: Japan's Fugaku outperforms all
competition once again
Nov. 15, 2021

FRANKFURT, Germany; BERKELEY, Calif.; and KNOXVILLE, Tenn.— The 58th annual edition of
the TOP500 saw little change in the Top10. Here’s a summary of the systems in the Top10:

 Fugaku remains the No. 1 system. It has 7,630,848 cores which allowed it to achieve an
HPL benchmark score of 442 Pflop/s. This puts it 3x ahead of the No. 2 system in the list.
 Summit, an IBM-built system at the Oak Ridge National Laboratory (ORNL) in Tennessee,
USA, remains the fastest system in the U.S. and at the No. 2 spot worldwide. It has a
performance of 148.8 Pflop/s on the HPL benchmark, which is used to rank the TOP500
list. Summit has 4,356 nodes, each housing two Power9 CPUs with 22 cores each and six
NVIDIA Tesla V100 GPUs, each with 80 streaming multiprocessors (S.M.). The nodes are
linked together with a Mellanox dual-rail EDR InfiniBand network.
 Sierra, a system at the Lawrence Livermore National Laboratory, CA, USA, is at No. 3. Its
architecture is very similar to the #2 systems Summit. It is built with 4,320 nodes with
two Power9 CPUs and four NVIDIA Tesla V100 GPUs. Sierra achieved 94.6 Pflop/s.
 Sunway TaihuLight is a system developed by China’s National Research Center of Parallel
Computer Engineering & Technology (NRCPC) and installed at the National
Supercomputing Center in Wuxi, China's Jiangsu province is listed at the No. 4 position
with 93 Pflop/s.

read more »
The Top 10 Systems in Top500 (Nov.2022)
The Top 10 Systems in Top500 (Nov.2023)
Supercomputer Country OS Rmax Rpeak
#1 Frontier United States HPE Cray OS 1,194 PFlops 1,679.82 PFlops
SUSE Linux
#2 Aurora United States Enterprise Server 585.34 PFlops 1,059.33 PFlops
15 SP4
#3 Eagle United States Ubuntu 22.04 561.20 PFlops 846.84 PFlops
#4
Red Hat Enterprise
Supercomputer Japan 442.01 PFlops 537.21 PFlops
Linux (RHEL)
Fugaku
#5 LUMI Finland HPE Cray OS 379.70 PFlops 531.51 PFlops
#6 Leonardo Italy Linux 238.70 PFlops 304.47 PFlops
#7 Summit United States RHEL 7.4 148.60 PFlops 200.79 PFlops
#8 MareNostrum
5 ACC
Spain RedHat 9.1 138.20 PFlops 265.57 PFlops
(Accelerated
Partition)

#9 Eos NVIDIA Ubuntu 22.04.3


United States 121.40 PFlops 188.65 PFlops
DGX SuperPOD LTS

#10 Sierra United States RHEL 94.64 PFlops 125.71 PFlops


Top 5 Systems in Top500 (Nov.2024)
Top 6 – 10 Systems in Top500 (Nov.2024)
Performance Development
Dominant OS: Linux
• In the ‘90s, commercial versions of Unix such as IRIX, UNICOS,
AIX or Solaris dominated the TOP500 list.
• Windows and MacOS were already focused on end user devices
• Linux appeared for the first time in this list of the world’s most
powerful supercomputers in June 1998
• The first supercomputer using Linux to be included in the list was
the Avalon Cluster from the United States, in the 314th position
• From 1998, Linux’s growth has been outstanding, especially
between Nov.2002 and Nov. 2009: Linux went from being used
in 71 supercomputers to being used in 448 supercomputers.
• Currently, Linux is the clear leader in supercomputing; to lose the
leadership, a big hardware revolution should take place.
Why is Linux perfect for supercomputers?
• Open source system, 100% customizable: this feature enables
free modification of any part of the code. It is useful both for
improving performance and solving any security problem
• Lower consumption of resources: as it is customizable,
performance can be boosted by specifying that only vital
applications are executed. This cannot be done with other OSs,
since they do not usually allow specific configurations
• Modular structure: new modules can be added without affecting
other parts within the OS. Resource optimization is much easier
• Adapted to all kinds of workloads: this makes measuring the
supercomputer efficiency, consumption and performance easier
• Costs: as it is completely free, there is no need to pay for a
license to use it. So, the necessary investment is only the time
required to modify the OS for each project.
Benchmarks
• The “best” performance is used, as measured by a parallel
implementation of Linpack benchmark, working to solve a
dense system of linear equations
• More specifically, that version of the benchmark is used that
allows the user to scale problem size and to optimize the
software to achieve the best performance for a given system
• This performance does not reflect theoverall performance of
a given system, but the performance of a dedicated system
for solving a dense system of linear equations
• Since the problem is very regular, the performance achieved
is quite high, and the performance numbers give a good
correction of peak performance Rmax
• By measuring the actual performance for different problem
sizes n, a user can get not only Rmax for the problem size
Nmax but also the problem size N1/2 where half of the
performance Rmax is achieved

You might also like