0% found this document useful (0 votes)
3 views

Parallel Processing Lecture1

The document discusses the evolution and significance of parallel processing in computing, highlighting Moore's Law and the exponential growth of microprocessor performance. It outlines the challenges faced in high-performance computing, including the memory-wall, power-wall, and reliability-wall, as well as the limits imposed by physical laws such as the speed of signal propagation. Additionally, it emphasizes the importance of parallel processing in achieving higher speed, throughput, and computational power for complex tasks.

Uploaded by

ayman mossad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Parallel Processing Lecture1

The document discusses the evolution and significance of parallel processing in computing, highlighting Moore's Law and the exponential growth of microprocessor performance. It outlines the challenges faced in high-performance computing, including the memory-wall, power-wall, and reliability-wall, as well as the limits imposed by physical laws such as the speed of signal propagation. Additionally, it emphasizes the importance of parallel processing in achieving higher speed, throughput, and computational power for complex tasks.

Uploaded by

ayman mossad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

CS 408 - Parallel Processing

Professor BenBella S. Tawfik

Lecture 1
WHY PARALLEL PROCESSING?
The quest for higher-performance digital
computers seems unending. In the past two
decades, the performance of microprocessors
has enjoyed an exponential growth. The
growth of microprocessor speed/performance
by a factor of 2 every 18 months (or about
60% per year) is known as Moore’s law.
TIPS
Projection,
circa 1998
Projection,
Processor performance

circa 2012
1.6 / yr
GIPS The number of cores
Pentium II R10000 has been increasing
from a few in 2005
Pentium to the current 10s, and
68040 is projected to reach
80486 100s by 2020
80386 Cores
MIPS
68000 1000
80286
100

10

KIPS 1
1980 1990 2000 2010 2020

Calendar year

Figure 1.1 The exponential growth of microprocessor performance,


known as Moore’s Law, shown over the past two decades
(extrapolated)
This growth is the result of a combination of two
factors:
 Increase in complexity (related both to higher device density and to larger
size) of VLSI chips, projected to rise to around 10 M transistors per chip for
microprocessors, and 1B for dynamic random-access memories (DRAMs), by
the year 2000 [SIA94]
 Introduction of, and improvements in, architectural features such as on-chip
cache memories, large instruction buffers, multiple instruction issue per cycle,
multithreading, deep pipelines, out-of-order instruction execution, and branch
prediction Moore’s law was originally formulated in 1965 in terms of the
doubling of chip complexity every year (later revised to every 18 months)
based only on a small number of data points [Scha97]. Moore’s revised
prediction matches almost perfectly the actual increases in the number of
transistors in DRAM and microprocessor chips.
 Moore’s law seems to hold regardless of
how one measures processor performance:
 counting the number of executed instructions
per second (IPS), counting the number of
floating-point operations per second
(FLOPS), or using sophisticated benchmark
suites that attempt to measure the processor's
performance on real applications. This is
because all of these measures, though
numerically different, tend to rise at roughly
the same rate.
Figure 1.1 shows that the performance of
actual processors has in fact followed Moore’s
law quite closely since 1980 and is on the
verge of reaching the GIPS (giga IPS = 10^9)
IPS) milestone.
 Even though it is expected that Moore's law
will continue to hold for the near future,
there is a limit that will eventually be
reached.
That some previous predictions about when the limit will be reached have
proven wrong does not alter the fact that a limit, dictated by physical laws, does
exist. The most easily understood physical limit is that imposed by the finite
speed of signal propagation along a wire. This is sometimes referred to as the
speed-of-light argument (or limit), explained as follows.
 The Speed-of-Light Argument. The speed of light is about 30 cm/ns. Signals
travel on a wire at a fraction of the speed of light. If the chip diameter is 3
cm, say, any computation that involves signal transmission from one end of
the chip to another cannot be executed faster than 10^10 times per second.
Reducing distances by a factor of 10 or even 100 will only increase the limit
by these factors; we still cannot go beyond 10^12 computations per second.
 To relate the above limit to the instruction execution rate (MIPS or FLOPS),
we need to estimate the distance that signals must travel within an instruction
cycle. This is not easy to do, given the extensive use of pipelining and
memory-latency-hiding techniques in modern high-performance processors.
Despite this difficulty, it should be clear that we are in fact not very far from
limits imposed by the speed of signal propagation and several other physical
laws.
Evolution of Computer Performance/Cost

From:
“Robots After All,”
by H. Moravec,
CACM, pp. 90-97,
October 2003.

Mental power in four scales


The Semiconductor Technology Roadmap
Calendar year  2001 2004 2007 2010 2013 2016 2015 2020 2025
Halfpitch (nm) 140 90 65 45 32 22 19 12 8
Clock freq. (GHz) 2 4 7 3.6 12 4.1 20 4.6 30 4.4 5.3 6.5
Wiring levels 7 8 9 10 10 10
Power supply (V) 1.1 1.0 0.8 0.7 0.6 0.5 0.6
Max. power (W) 130 160 190 220 250 290
From the 2001 edition of the roadmap [Alla02] From the 2011 edition
Actual halfpitch (Wikipedia, 2019): 2001, 130; 2010, 32; 2014, 14; 2018, 7 (Last updated in 2013)
TIPS

Factors contributing to the validity of Moore’s law

Processor performance
1.6 / yr
Denser circuits; Architectural improvements GIPS
Pentium II R10000

Measures of processor performance 68040


Pentium
80486

Instructions/second (MIPS, GIPS, TIPS, PIPS) MIPS


80386

68000
Floating-point operations per second 80286

(MFLOPS, GFLOPS, TFLOPS, PFLOPS)


KIPS
Running time on benchmark suites 1980 1990
Calendar year
2000 2010
Trends in Processor Chip Density, Performance,
Clock Speed, Power, and Number of Cores
Transistors per chip (1000s)
Relative performance Density
Clock speed (MHz)
Power dissipation (W)
Number of cores per chip Perfor-
mance

Clock

Power

Cores

Year of Introduction
NRC Report (2011): The Future of Computing Performance: Game Over or Next Level?
Trends in Processor Chip Density, Performance,
Clock Speed, Power, and Number of Cores

Year of Introduction
Original data up to 2010 collected/plotted by M. Horowitz et al.; Data for 2010-2017 extension collected by K. Rupp
Shares of Technology and Architecture in Processor
Performance Improvement

Overall Performance Improvement


(SPECINT, relative to 386)

Gate Speed Improvement


(FO4, relative to 386)

Feature Size (mm)

~1985 --------- 1995-2000 --------- ~2005 ~2010


Much of arch. improvements already achieved
Source: [DANO12] “CPU DB: Recording Microprocessor History,” CACM, April 2012.
Why High-Performance Computing?
Higher speed (solve problems faster)
1 Important when there are “hard” or “soft” deadlines;
e.g., 24-hour weather forecast
Higher throughput (solve more problems)
2 Important when we have many similar tasks to perform;
e.g., transaction processing
Higher computational power (solve larger problems)
3 e.g., weather forecast for a week rather than 24 hours,
or with a finer mesh for greater accuracy
Categories of supercomputers
Uniprocessor; aka vector machine
Multiprocessor; centralized or distributed shared memory
Multicomputer; communicating via message passing
Massively parallel processor (MPP; 1K or more processors)
The Speed-of-Light Argument

The speed of light is about 30 cm/ns.

Signals travel at 40-70% speed of light (say, 15 cm/ns).

If signals must travel 1.5 cm during the execution of an


instruction, that instruction will take at least 0.1 ns;
thus, performance will be limited to 10 GIPS.
This limitation is eased by continued miniaturization,
architectural methods such as cache memory, etc.;
however, a fundamental limit does exist.

How does parallel processing help? Wouldn’t multiple


processors need to communicate via signals as well?
Interesting Quotes about Parallel Programming

“There are 3 rules to follow when parallelizing large codes.


1 Unfortunately, no one knows what these rules are.”
~ W. Somerset Maugham, Gary Montry
“The wall is there. We probably won’t have any more
2 products without multicore processors [but] we see a lot of
problems in parallel programming.” ~ Alex Bachmutsky
“We can solve [the software crisis in parallel computing],
3 but only if we work from the algorithm down to the
hardware — not the traditional hardware-first mentality.”
~ Tim Mattson
“[The processor industry is adding] more and more cores,
4
but nobody knows how to program those things. I mean,
two, yeah; four, not really; eight, forget it.” ~ Steve Jobs
The Three Walls of High-Performance Computing
1 Memory-wall challenge:
Memory already limits single-processor performance.
How can we design a memory system that provides
a bandwidth of several terabytes/s for data-intensive
high-performance applications?
2 Power-wall challenge:
When there are millions of processing nodes, each
drawing a few watts of power, we are faced with the
energy bill and cooling challenges of MWs of power
dissipation, even ignoring the power needs of the
interconnection network and peripheral devices
3 Reliability-wall challenge:
Ensuring continuous and correct functioning of a
system with many thousands or even millions of
processing nodes is non-trivial, given that a few of
the nodes are bound to malfunction at an given time
Power-Dissipation
Challenge
A challenge at both ends:
- Supercomputers
- Personal electronics

Koomey’s Law:
Exponential improvement in
energy-efficient computing,
with computations
performed per KWh
doubling every 1.57 years

How long will Koomey’s law


be in effect? It will come to
an end, like Moore’s Law
https://cacm.acm.org/magazines/2017/1/211094-
exponential-laws-of-computing-growth/fulltext
Why Do We Need TIPS or TFLOPS Performance?
Reasonable running time = Fraction of hour to several hours (103-104 s)
In this time, a TIPS/TFLOPS machine can perform 1015-1016 operations

Example 1: Southern oceans heat Modeling


(10-minute iterations)
300 GFLOP per iteration 
300 000 iterations per 6 yrs =
1016 FLOP

Example 2: Fluid dynamics calculations (1000  1000  1000 lattice)


109 lattice points  1000 FLOP/point  10 000 time steps = 1016 FLOP

Example 3: Monte Carlo simulation of nuclear reactor


1011 particles to track (for 1000 escapes)  104 FLOP/particle = 1015 FLOP

Decentralized supercomputing: A grid of tens of thousands networked


computers discovered the Mersenne prime 282 589 933 – 1 as the largest
known prime number as of Jan. 2021 (it has 24 862 048 digits in decimal)
Supercomputer Performance Growth
PFLOPS
ASCI goals
Supercomputer performance
$240M MPPs

$30M MPPs
CM-5 Vector supers
TFLOPS
CM-5

CM-2
Micros
Y-MP

GFLOPS
Alpha
Cray
X-MP 80860

80386
MFLOPS
1980 1990 2000 2010
Calendar year
Fig. 1.2 The exponential growth in supercomputer performance over
the past two decades (from [Bell92], with ASCI performance goals and
microprocessor peak FLOPS superimposed as dotted lines).
The ASCI Program
1000 Plan Develop Use
Performance (TFLOPS)

100+ TFLOPS, 20 TB
100 ASCI Purple

30+ TFLOPS, 10 TB
ASCI Q

10+ TFLOPS, 5 TB
10 ASCI White ASCI
3+ TFLOPS, 1.5 TB
ASCI Blue
1+ TFLOPS, 0.5 TB
1 ASCI Red
1995 2000 2005 2010
Calendar year
Fig. 24.1 Milestones in the Accelerated Strategic (Advanced Simulation &)
Computing Initiative (ASCI) program, sponsored by the US Department of
Energy, with extrapolation up to the PFLOPS level.
The Quest for Higher Performance
Top Three Supercomputers in 2005 (IEEE Spectrum, Feb. 2005, pp. 15-16)

1. IBM Blue Gene/L 2. SGI Columbia 3. NEC Earth Sim


LLNL, California NASA Ames, California Earth Sim Ctr, Yokohama

Material science, Aerospace/space sim, Atmospheric, oceanic,


nuclear stockpile sim climate research and earth sciences
32,768 proc’s, 8 TB, 10,240 proc’s, 20 TB, 5,120 proc’s, 10 TB,
28 TB disk storage 440 TB disk storage 700 TB disk storage
Linux + custom OS Linux Unix
71 TFLOPS, $100 M 52 TFLOPS, $50 M 36 TFLOPS*, $400 M?
Dual-proc Power-PC 20x Altix (512 Itanium2) Built of custom vector
chips (10-15 W power) linked by Infiniband microprocessors
Full system: 130k-proc, Volume = 50x IBM,
360 TFLOPS (est) Power = 14x IBM
* Led the top500 list for 2.5 yrs
The Quest for Higher Performance: 2008 Update
Top Three Supercomputers in June 2008 (http://www.top500.org)

1. IBM Roadrunner 2. IBM Blue Gene/L 3. Sun Blade X6420


LANL, New Mexico LLNL, California U Texas Austin
Nuclear stockpile Advanced scientific Open science research
calculations, and more simulations
122,400 proc’s, 98 TB, 212,992 proc’s, 74 TB, 62,976 proc’s, 126 TB
0.4 TB/s file system I/O 2 PB disk storage
Red Hat Linux CNK/SLES 9 Linux
1.38 PFLOPS, $130M 0.596 PFLOPS, $100M 0.504 PFLOPS*
PowerXCell 8i 3.2 GHz, PowerPC 440 700 MHz AMD X86-64 Opteron
AMD Opteron (hybrid) quad core 2 GHz
2.35 MW power, 1.60 MW power, 2.00 MW power,
expands to 1M proc’s expands to 0.5M proc’s Expands to 0.3M proc’s
* Actually 4th on top-500 list, with the 3rd being another IBM Blue Gene system at 0.557 PFLOPS
The Quest for Higher Performance: 2012 Update
Top Three Supercomputers in November 2012 (http://www.top500.org)

1. Cray Titan 2. IBM Sequoia 3. Fujitsu K Computer


ORNL, Tennessee LLNL, California RIKEN AICS, Japan
XK7 architecture Blue Gene/Q arch RIKEN architecture
560,640 cores, 1,572,864 cores, 705,024 cores,
710 TB, Cray Linux 1573 TB, Linux 1410 TB, Linux
Cray Gemini interconn’t Custom interconnect Tofu interconnect
17.6/27.1 PFLOPS* 16.3/20.1 PFLOPS* 10.5/11.3 PFLOPS*
AMD Opteron, 16-core, Power BQC, 16-core, SPARC64 VIIIfx,
2.2 GHz, NVIDIA K20x 1.6 GHz 2.0 GHz
8.2 MW power 7.9 MW power 12.7 MW power
* max/peak performance
In the top 10, IBM also occupies ranks 4-7 and 9-10. Dell and NUDT (China) hold ranks 7-8.
The Quest for Higher Performance: 2018 Update
Top Three Supercomputers in November 2018 (http://www.top500.org)
The Quest for Higher Performance: 2020 Update
Top Five Supercomputers in November 2020 (http://www.top500.org)
Top 500 Supercomputers in the World

2014 2016 2018 2020


What Exactly is Parallel Processing?

Parallelism = Concurrency
Doing more than one thing at a time
Has been around for decades, since early computers
I/O channels, DMA, device controllers, multiple ALUs

The sense in which we use it in this course


Multiple agents (hardware units, software processes)
collaborate to perform our main computational task

- Multiplying two matrices


- Breaking a secret code
- Deciding on the next chess move
1.2 A Motivating Init. Pass 1 Pass 2 Pass 3
2m 2 2 2
Example 3 3m
4
3 3
5 5 5m 5
6
7 7 7 7 m
Fig. 1.3 The sieve of 8
9 9
Eratosthenes yielding a 10
11 11 11 11
list of 10 primes for n = 30. 12
Marked elements have 13 13 13 13
14
been distinguished by 15 15
erasure from the list. 16
17 17 17 17
18
19 19 19 19
20
Any composite number 21 21
22
has a prime factor 23 23 23 23
that is no greater than 24
25 25 25
its square root. 26
27 27
28
29 29 29 29
30
Single-Processor Implementation of the Sieve

Current Prime Index


P

Bit-vector n
1 2

Fig. 1.4 Schematic representation of single-processor


solution for the sieve of Eratosthenes.
Control-Parallel Implementation of the Sieve

Index Index Index


P1 P2 ... Pp

Shared Current Prime


Memory I/O Device

1 2 n
(b)

Fig. 1.5 Schematic representation of a control-parallel


solution for the sieve of Eratosthenes.
Running Time of the Sequential/Parallel Sieve
Time
0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
2 | 3 | 5 | 7 | 11 |13|17
19 29
p = 1, t = 1411 23 31
23 29 31
2 | 7 |17
3 5 | 11 |13|
19
p = 2, t = 706

2 |
| 3 11 | 19 29 31
5 | 7 13|17 23
p = 3, t = 499

Fig. 1.6 Control-parallel realization of the sieve of


Eratosthenes with n = 1000 and 1  p  3.
Data-Parallel Implementation of the Sieve

P1 Current Prime Index Assume at most n processors,


so that all prime factors dealt with
are in P1 (which broadcasts them)
1 2 n/p

P2 Current Prime Index

Communi-
c ation n/p+1 2n/p

Pp Current Prime Index

n <n/p
n–n/p+1 n

Fig. 1.7 Data-parallel realization of the sieve of Eratosthenes.


One Reason for Sublinear Speedup:
Communication Overhead

Ideal speedup

Solution time Actual speedup

Computation

Communication

Number of processors Number of processors

Fig. 1.8 Trade-off between communication time and computation


time in the data-parallel realization of the sieve of Eratosthenes.
Another Reason for Sublinear Speedup:
Input/Output Overhead

Ideal speedup

Solution time
Actual speedup
Computation

I/O time

Number of processors Number of processors

Fig. 1.9 Effect of a constant I/O time on the data-parallel


realization of the sieve of Eratosthenes.

You might also like