Pda 2
Pda 2
Algorithms
Pipelining
- instructions are decomposed into elementary operations; many
different operations may be at a given moment in execution
Functional parallelism
- there are independent units to execute specialized functions
Vector parallelism
- identical units are provided to execute under unique control
the same operation on different data items
Multi-processing
- several “tightly coupled” processors execute independent
instructions, communicate through a common shared memory
Multi-computing
- several “tightly coupled” processors execute independent
instructions, communicate with each other using messages
Pipelining (often completed by functional/vector parallelism)
Data
Instr. Flow Simple Multiple
Flow
PE PE PE … PE PE
Interconnection network
Different processors cannot execute distinct instructions in the same clock cycle
Example:
for (i=0; i<10; i++) ~
if (a[i]<b[i]) ~
c[i] = a[i]+b[i]; ~
else ~
c[i] = 0;
a[] 4 1 7 2 9 3 3 0 6 7
b[] 5 3 4 1 4 5 3 1 4 8
c[]
p 0 p1 p 2 p 3 p 4 p 5 p 6 p 7 p 8 p 9
SIMD – inefficiency example (2)
Example:
for (i=0; i<10; i++) pardo
if (a[i]<b[i]) ~
c[i] = a[i]+b[i]; ~
else ~
c[i] = 0;
a[] 4 1 7 2 9 3 3 0 6 7
b[] 5 3 4 1 4 5 3 1 4 8
c[]
p 0 p1 p 2 p 3 p 4 p 5 p 6 p 7 p 8 p 9
SIMD – inefficiency example (3)
Example:
for (i=0; i<10; i++) pardo
~ if (a[i]<b[i])
~ c[i] = a[i]+b[i];
~ else
~ c[i] = 0;
a[] 4 1 7 2 9 3 3 0 6 7
b[] 5 3 4 1 4 5 3 1 4 8
c[] 9 4 8 1 15
p 0 p1 p 2 p 3 p 4 p 5 p 6 p 7 p 8 p 9
p 0 p1 p 2 p 3 p 4 p 5 p 6 p 7 p 8 p 9
SIMD – inefficiency example (4)
Example:
for (i=0; i<10; i++) pardo
if (a[i]<b[i]) ~
c[i] = a[i]+b[i]; ~
else ~
c[i] = 0;
a[] 4 1 7 2 9 3 3 0 6 7
b[] 5 3 4 1 4 5 3 1 4 8
c[] 9 4 0 0 0 8 0 1 0 15
p 0 p1 p 2 p 3 p 4 p 5 p 6 p 7 p 8 p 9
p 0 p1 p 2 p 3 p 4 p 5 p 6 p 7 p 8 p 9
p 0 p1 p 2 p 3 p 4 p 5 p 6 p 7 p 8 p 9
MIMD (Multiple Instruction stream, Multiple Data stream)
PE + PE + … PE +
control unit control unit control unit
Interconnection network
CM2 CM5
CM Organization
CM Processors
Host Computer Microcontroller And
Memories
Nexus Front
(switch) End
Sequencer Sequencer
0 3
Sequencer Sequencer
1 2
fastest
slowest
Performance Development of HPC:
the Run towards Exaflops
The Top 10 Systems in Top500 (Nov 2016)
The Top 10 Systems in Top500 (Nov 2017)
China’s First Homegrown
Many-core Processor
ShenWei SW26010 Processor
• Vendor : Shanghai High Performance IC Design Center
• Supported by National Science and Technology Major
Project (NMP): Core Electronic Devices, High-end Generic
Chips, and Basic Software
• 28 nm technology
• 260 Cores
• 3 Tflop/s peak
Sunway TaihuLight (http://bit.ly/sunway-2016)
State of HPC in 2017 (J.Dongarra)
• Pflops computing (>1015 Flop/s) fully established: 117 systems
(adds or multiplies on 64-bit machines)
• Three technology architecture (“swim lanes”) are thriving:
– Commodity (e.g. Intel)
– Commodity + accelerator (e.g. GPUs): 88 systems
– Lightweight cores (e.g. IBM BG, ARM, Intel’s Knights Landing)
• Interest in supercomputing is now worldwide, and growing in
many new markets (~50% of Top500 computers are in industry)
• Exascale (1018 Flop/s) projects exist in many countries and
regions
• Largest share: Intel processors (x86 instruction set), 92%;
followed by AMD, 1%
Towards Exascale Computing
The Top 10 Systems in Top500 (Nov 2018)
The Top 10 Systems in Top500 (Nov 2018, cont.)
The Top 10 Systems in Top500 (Nov 2018)
The Top 10 Systems in Top500 (Nov 2019)
The Top 10 Systems in Top500 (Nov 2019, cont.)
The Top 10 Systems in Top500 (June 2020)
- Research Market -
The Top 10 Systems in Top500 (June 2020)
- Commercial Market -
Performance Development (June 2020)
Performance Fraction of the Top 5 Systems
(June 2020)
VENDORS / SYSTEM SHARE (2017)
VENDORS / SYSTEM SHARE (2020)
COUNTRIES SHARE of Top500 (tree map)
COUNTRIES SHARE Variation
COUNTRIES / PERFORMANCE SHARE
State of HPC in 2020 (Erich Strohmaier)
• A renewed TOP10: but Fugaku (Mount Fuji) solidify its #1 status
in a list reflecting a flattening performance growth curve
• The full list recorded the smallest number of new entries since the
project began in 1993
• The entry level to the list moved up to 1.32 petaflops on the HPL
benchmark, increasing from 1.23 Pflops recorded in June 2020
• The aggregate performance of all 500 systems grew from 2.22
Eflops in June to just 2.43 exaflops on the November list
• Thanks to additional hardware, Fugaku grew its performance:
HPL to 442 petaflops and mixed precision HPC-AI benchmark to
2.0 exaflops, besting its 1.4 exaflops mark recorded in June 2020
These represent the first benchmark measurements above one
exaflop for any precision on any type of hardware!
• TOP100 Research System and Commercial Systems show very
different markets
COUNTRIES / SYSTEM Share (June 2020)
The Top 10 Systems in Top500 (June 2021)
Erich Strohmaier
Still waiting for Exascale: Japan's Fugaku outperforms all
competition once again
Nov. 15, 2021
FRANKFURT, Germany; BERKELEY, Calif.; and KNOXVILLE, Tenn.— The 58th annual edition of
the TOP500 saw little change in the Top10. Here’s a summary of the systems in the Top10:
Fugaku remains the No. 1 system. It has 7,630,848 cores which allowed it to achieve an
HPL benchmark score of 442 Pflop/s. This puts it 3x ahead of the No. 2 system in the list.
Summit, an IBM-built system at the Oak Ridge National Laboratory (ORNL) in Tennessee,
USA, remains the fastest system in the U.S. and at the No. 2 spot worldwide. It has a
performance of 148.8 Pflop/s on the HPL benchmark, which is used to rank the TOP500
list. Summit has 4,356 nodes, each housing two Power9 CPUs with 22 cores each and six
NVIDIA Tesla V100 GPUs, each with 80 streaming multiprocessors (S.M.). The nodes are
linked together with a Mellanox dual-rail EDR InfiniBand network.
Sierra, a system at the Lawrence Livermore National Laboratory, CA, USA, is at No. 3. Its
architecture is very similar to the #2 systems Summit. It is built with 4,320 nodes with
two Power9 CPUs and four NVIDIA Tesla V100 GPUs. Sierra achieved 94.6 Pflop/s.
Sunway TaihuLight is a system developed by China’s National Research Center of Parallel
Computer Engineering & Technology (NRCPC) and installed at the National
Supercomputing Center in Wuxi, China's Jiangsu province is listed at the No. 4 position
with 93 Pflop/s.
read more »
The Top 10 Systems in Top500 (Nov.2022)
The Top 10 Systems in Top500 (Nov.2023)
Supercomputer Country OS Rmax Rpeak
#1 Frontier United States HPE Cray OS 1,194 PFlops 1,679.82 PFlops
SUSE Linux
#2 Aurora United States Enterprise Server 585.34 PFlops 1,059.33 PFlops
15 SP4
#3 Eagle United States Ubuntu 22.04 561.20 PFlops 846.84 PFlops
#4
Red Hat Enterprise
Supercomputer Japan 442.01 PFlops 537.21 PFlops
Linux (RHEL)
Fugaku
#5 LUMI Finland HPE Cray OS 379.70 PFlops 531.51 PFlops
#6 Leonardo Italy Linux 238.70 PFlops 304.47 PFlops
#7 Summit United States RHEL 7.4 148.60 PFlops 200.79 PFlops
#8 MareNostrum
5 ACC
Spain RedHat 9.1 138.20 PFlops 265.57 PFlops
(Accelerated
Partition)