0% found this document useful (0 votes)

13 views34 pages

Parallel Processing Lecture1

The document discusses the evolution and significance of parallel processing in computing, highlighting Moore's Law and the exponential growth of microprocessor performance. It outlines the challenges faced in high-performance computing, including the memory-wall, power-wall, and reliability-wall, as well as the limits imposed by physical laws such as the speed of signal propagation. Additionally, it emphasizes the importance of parallel processing in achieving higher speed, throughput, and computational power for complex tasks.

Uploaded by

ayman mossad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views34 pages

Parallel Processing Lecture1

Uploaded by

ayman mossad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

CS 408 - Parallel Processing

Professor BenBella S. Tawfik

Lecture 1
WHY PARALLEL PROCESSING?
The quest for higher-performance digital
computers seems unending. In the past two
decades, the performance of microprocessors
has enjoyed an exponential growth. The
growth of microprocessor speed/performance
by a factor of 2 every 18 months (or about
60% per year) is known as Moore’s law.
TIPS
Projection,
circa 1998
Projection,
Processor performance

circa 2012
1.6 / yr
GIPS The number of cores
Pentium II R10000 has been increasing
from a few in 2005
Pentium to the current 10s, and
68040 is projected to reach
80486 100s by 2020
80386 Cores
MIPS
68000 1000
80286
100

KIPS 1
1980 1990 2000 2010 2020

Calendar year

Figure 1.1 The exponential growth of microprocessor performance,

known as Moore’s Law, shown over the past two decades
(extrapolated)
This growth is the result of a combination of two
factors:
 Increase in complexity (related both to higher device density and to larger
size) of VLSI chips, projected to rise to around 10 M transistors per chip for
microprocessors, and 1B for dynamic random-access memories (DRAMs), by
the year 2000 [SIA94]
 Introduction of, and improvements in, architectural features such as on-chip
cache memories, large instruction buffers, multiple instruction issue per cycle,
multithreading, deep pipelines, out-of-order instruction execution, and branch
prediction Moore’s law was originally formulated in 1965 in terms of the
doubling of chip complexity every year (later revised to every 18 months)
based only on a small number of data points [Scha97]. Moore’s revised
prediction matches almost perfectly the actual increases in the number of
transistors in DRAM and microprocessor chips.
 Moore’s law seems to hold regardless of
how one measures processor performance:
 counting the number of executed instructions
per second (IPS), counting the number of
floating-point operations per second
(FLOPS), or using sophisticated benchmark
suites that attempt to measure the processor's
performance on real applications. This is
because all of these measures, though
numerically different, tend to rise at roughly
the same rate.
Figure 1.1 shows that the performance of
actual processors has in fact followed Moore’s
law quite closely since 1980 and is on the
verge of reaching the GIPS (giga IPS = 10^9)
IPS) milestone.
 Even though it is expected that Moore's law
will continue to hold for the near future,
there is a limit that will eventually be
reached.
That some previous predictions about when the limit will be reached have
proven wrong does not alter the fact that a limit, dictated by physical laws, does
exist. The most easily understood physical limit is that imposed by the finite
speed of signal propagation along a wire. This is sometimes referred to as the
speed-of-light argument (or limit), explained as follows.
 The Speed-of-Light Argument. The speed of light is about 30 cm/ns. Signals
travel on a wire at a fraction of the speed of light. If the chip diameter is 3
cm, say, any computation that involves signal transmission from one end of
the chip to another cannot be executed faster than 10^10 times per second.
Reducing distances by a factor of 10 or even 100 will only increase the limit
by these factors; we still cannot go beyond 10^12 computations per second.
 To relate the above limit to the instruction execution rate (MIPS or FLOPS),
we need to estimate the distance that signals must travel within an instruction
cycle. This is not easy to do, given the extensive use of pipelining and
memory-latency-hiding techniques in modern high-performance processors.
Despite this difficulty, it should be clear that we are in fact not very far from
limits imposed by the speed of signal propagation and several other physical
laws.
Evolution of Computer Performance/Cost

From:
“Robots After All,”
by H. Moravec,
CACM, pp. 90-97,
October 2003.

Mental power in four scales

The Semiconductor Technology Roadmap
Calendar year  2001 2004 2007 2010 2013 2016 2015 2020 2025
Halfpitch (nm) 140 90 65 45 32 22 19 12 8
Clock freq. (GHz) 2 4 7 3.6 12 4.1 20 4.6 30 4.4 5.3 6.5
Wiring levels 7 8 9 10 10 10
Power supply (V) 1.1 1.0 0.8 0.7 0.6 0.5 0.6
Max. power (W) 130 160 190 220 250 290
From the 2001 edition of the roadmap [Alla02] From the 2011 edition
Actual halfpitch (Wikipedia, 2019): 2001, 130; 2010, 32; 2014, 14; 2018, 7 (Last updated in 2013)
TIPS

Factors contributing to the validity of Moore’s law

Processor performance
1.6 / yr
Denser circuits; Architectural improvements GIPS
Pentium II R10000

Measures of processor performance 68040

Pentium
80486

Instructions/second (MIPS, GIPS, TIPS, PIPS) MIPS

80386

68000
Floating-point operations per second 80286

(MFLOPS, GFLOPS, TFLOPS, PFLOPS)

KIPS
Running time on benchmark suites 1980 1990
Calendar year
2000 2010
Trends in Processor Chip Density, Performance,
Clock Speed, Power, and Number of Cores
Transistors per chip (1000s)
Relative performance Density
Clock speed (MHz)
Power dissipation (W)
Number of cores per chip Perfor-
mance

Clock

Power

Cores

Year of Introduction
NRC Report (2011): The Future of Computing Performance: Game Over or Next Level?
Trends in Processor Chip Density, Performance,
Clock Speed, Power, and Number of Cores

Year of Introduction
Original data up to 2010 collected/plotted by M. Horowitz et al.; Data for 2010-2017 extension collected by K. Rupp
Shares of Technology and Architecture in Processor
Performance Improvement

Overall Performance Improvement

(SPECINT, relative to 386)

Gate Speed Improvement

(FO4, relative to 386)

Feature Size (mm)

~1985 --------- 1995-2000 --------- ~2005 ~2010

Much of arch. improvements already achieved
Source: [DANO12] “CPU DB: Recording Microprocessor History,” CACM, April 2012.
Why High-Performance Computing?
Higher speed (solve problems faster)
1 Important when there are “hard” or “soft” deadlines;
e.g., 24-hour weather forecast
Higher throughput (solve more problems)
2 Important when we have many similar tasks to perform;
e.g., transaction processing
Higher computational power (solve larger problems)
3 e.g., weather forecast for a week rather than 24 hours,
or with a finer mesh for greater accuracy
Categories of supercomputers
Uniprocessor; aka vector machine
Multiprocessor; centralized or distributed shared memory
Multicomputer; communicating via message passing
Massively parallel processor (MPP; 1K or more processors)
The Speed-of-Light Argument

The speed of light is about 30 cm/ns.

Signals travel at 40-70% speed of light (say, 15 cm/ns).

If signals must travel 1.5 cm during the execution of an

instruction, that instruction will take at least 0.1 ns;
thus, performance will be limited to 10 GIPS.
This limitation is eased by continued miniaturization,
architectural methods such as cache memory, etc.;
however, a fundamental limit does exist.

How does parallel processing help? Wouldn’t multiple

processors need to communicate via signals as well?
Interesting Quotes about Parallel Programming

“There are 3 rules to follow when parallelizing large codes.

1 Unfortunately, no one knows what these rules are.”
~ W. Somerset Maugham, Gary Montry
“The wall is there. We probably won’t have any more
2 products without multicore processors [but] we see a lot of
problems in parallel programming.” ~ Alex Bachmutsky
“We can solve [the software crisis in parallel computing],
3 but only if we work from the algorithm down to the
hardware — not the traditional hardware-first mentality.”
~ Tim Mattson
“[The processor industry is adding] more and more cores,
4
but nobody knows how to program those things. I mean,
two, yeah; four, not really; eight, forget it.” ~ Steve Jobs
The Three Walls of High-Performance Computing
1 Memory-wall challenge:
Memory already limits single-processor performance.
How can we design a memory system that provides
a bandwidth of several terabytes/s for data-intensive
high-performance applications?
2 Power-wall challenge:
When there are millions of processing nodes, each
drawing a few watts of power, we are faced with the
energy bill and cooling challenges of MWs of power
dissipation, even ignoring the power needs of the
interconnection network and peripheral devices
3 Reliability-wall challenge:
Ensuring continuous and correct functioning of a
system with many thousands or even millions of
processing nodes is non-trivial, given that a few of
the nodes are bound to malfunction at an given time
Power-Dissipation
Challenge
A challenge at both ends:
- Supercomputers
- Personal electronics

Koomey’s Law:
Exponential improvement in
energy-efficient computing,
with computations
performed per KWh
doubling every 1.57 years

How long will Koomey’s law

be in effect? It will come to
an end, like Moore’s Law
https://cacm.acm.org/magazines/2017/1/211094-
exponential-laws-of-computing-growth/fulltext
Why Do We Need TIPS or TFLOPS Performance?
Reasonable running time = Fraction of hour to several hours (103-104 s)
In this time, a TIPS/TFLOPS machine can perform 1015-1016 operations

Example 1: Southern oceans heat Modeling

(10-minute iterations)
300 GFLOP per iteration 
300 000 iterations per 6 yrs =
1016 FLOP

Example 2: Fluid dynamics calculations (1000  1000  1000 lattice)

109 lattice points  1000 FLOP/point  10 000 time steps = 1016 FLOP

Example 3: Monte Carlo simulation of nuclear reactor

1011 particles to track (for 1000 escapes)  104 FLOP/particle = 1015 FLOP

Decentralized supercomputing: A grid of tens of thousands networked

computers discovered the Mersenne prime 282 589 933 – 1 as the largest
known prime number as of Jan. 2021 (it has 24 862 048 digits in decimal)
Supercomputer Performance Growth
PFLOPS
ASCI goals
Supercomputer performance
$240M MPPs

$30M MPPs
CM-5 Vector supers
TFLOPS
CM-5

CM-2
Micros
Y-MP

GFLOPS
Alpha
Cray
X-MP 80860

80386
MFLOPS
1980 1990 2000 2010
Calendar year
Fig. 1.2 The exponential growth in supercomputer performance over
the past two decades (from [Bell92], with ASCI performance goals and
microprocessor peak FLOPS superimposed as dotted lines).
The ASCI Program
1000 Plan Develop Use
Performance (TFLOPS)

100+ TFLOPS, 20 TB
100 ASCI Purple

30+ TFLOPS, 10 TB
ASCI Q

10+ TFLOPS, 5 TB
10 ASCI White ASCI
3+ TFLOPS, 1.5 TB
ASCI Blue
1+ TFLOPS, 0.5 TB
1 ASCI Red
1995 2000 2005 2010
Calendar year
Fig. 24.1 Milestones in the Accelerated Strategic (Advanced Simulation &)
Computing Initiative (ASCI) program, sponsored by the US Department of
Energy, with extrapolation up to the PFLOPS level.
The Quest for Higher Performance
Top Three Supercomputers in 2005 (IEEE Spectrum, Feb. 2005, pp. 15-16)

1. IBM Blue Gene/L 2. SGI Columbia 3. NEC Earth Sim

LLNL, California NASA Ames, California Earth Sim Ctr, Yokohama

Material science, Aerospace/space sim, Atmospheric, oceanic,

nuclear stockpile sim climate research and earth sciences
32,768 proc’s, 8 TB, 10,240 proc’s, 20 TB, 5,120 proc’s, 10 TB,
28 TB disk storage 440 TB disk storage 700 TB disk storage
Linux + custom OS Linux Unix
71 TFLOPS, $100 M 52 TFLOPS, $50 M 36 TFLOPS*, $400 M?
Dual-proc Power-PC 20x Altix (512 Itanium2) Built of custom vector
chips (10-15 W power) linked by Infiniband microprocessors
Full system: 130k-proc, Volume = 50x IBM,
360 TFLOPS (est) Power = 14x IBM
* Led the top500 list for 2.5 yrs
The Quest for Higher Performance: 2008 Update
Top Three Supercomputers in June 2008 (http://www.top500.org)

1. IBM Roadrunner 2. IBM Blue Gene/L 3. Sun Blade X6420

LANL, New Mexico LLNL, California U Texas Austin
Nuclear stockpile Advanced scientific Open science research
calculations, and more simulations
122,400 proc’s, 98 TB, 212,992 proc’s, 74 TB, 62,976 proc’s, 126 TB
0.4 TB/s file system I/O 2 PB disk storage
Red Hat Linux CNK/SLES 9 Linux
1.38 PFLOPS, $130M 0.596 PFLOPS, $100M 0.504 PFLOPS*
PowerXCell 8i 3.2 GHz, PowerPC 440 700 MHz AMD X86-64 Opteron
AMD Opteron (hybrid) quad core 2 GHz
2.35 MW power, 1.60 MW power, 2.00 MW power,
expands to 1M proc’s expands to 0.5M proc’s Expands to 0.3M proc’s
* Actually 4th on top-500 list, with the 3rd being another IBM Blue Gene system at 0.557 PFLOPS
The Quest for Higher Performance: 2012 Update
Top Three Supercomputers in November 2012 (http://www.top500.org)

1. Cray Titan 2. IBM Sequoia 3. Fujitsu K Computer

ORNL, Tennessee LLNL, California RIKEN AICS, Japan
XK7 architecture Blue Gene/Q arch RIKEN architecture
560,640 cores, 1,572,864 cores, 705,024 cores,
710 TB, Cray Linux 1573 TB, Linux 1410 TB, Linux
Cray Gemini interconn’t Custom interconnect Tofu interconnect
17.6/27.1 PFLOPS* 16.3/20.1 PFLOPS* 10.5/11.3 PFLOPS*
AMD Opteron, 16-core, Power BQC, 16-core, SPARC64 VIIIfx,
2.2 GHz, NVIDIA K20x 1.6 GHz 2.0 GHz
8.2 MW power 7.9 MW power 12.7 MW power
* max/peak performance
In the top 10, IBM also occupies ranks 4-7 and 9-10. Dell and NUDT (China) hold ranks 7-8.
The Quest for Higher Performance: 2018 Update
Top Three Supercomputers in November 2018 (http://www.top500.org)
The Quest for Higher Performance: 2020 Update
Top Five Supercomputers in November 2020 (http://www.top500.org)
Top 500 Supercomputers in the World

2014 2016 2018 2020

What Exactly is Parallel Processing?

Parallelism = Concurrency
Doing more than one thing at a time
Has been around for decades, since early computers
I/O channels, DMA, device controllers, multiple ALUs

The sense in which we use it in this course

Multiple agents (hardware units, software processes)
collaborate to perform our main computational task

- Multiplying two matrices

- Breaking a secret code
- Deciding on the next chess move
1.2 A Motivating Init. Pass 1 Pass 2 Pass 3
2m 2 2 2
Example 3 3m
4
3 3
5 5 5m 5
6
7 7 7 7 m
Fig. 1.3 The sieve of 8
9 9
Eratosthenes yielding a 10
11 11 11 11
list of 10 primes for n = 30. 12
Marked elements have 13 13 13 13
14
been distinguished by 15 15
erasure from the list. 16
17 17 17 17
18
19 19 19 19
20
Any composite number 21 21
22
has a prime factor 23 23 23 23
that is no greater than 24
25 25 25
its square root. 26
27 27
28
29 29 29 29
30
Single-Processor Implementation of the Sieve

Current Prime Index

Bit-vector n
1 2

Fig. 1.4 Schematic representation of single-processor

solution for the sieve of Eratosthenes.
Control-Parallel Implementation of the Sieve

Index Index Index

P1 P2 ... Pp

Shared Current Prime

Memory I/O Device

1 2 n
(b)

Fig. 1.5 Schematic representation of a control-parallel

solution for the sieve of Eratosthenes.
Running Time of the Sequential/Parallel Sieve
Time
0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
2 | 3 | 5 | 7 | 11 |13|17
19 29
p = 1, t = 1411 23 31
23 29 31
2 | 7 |17
3 5 | 11 |13|
19
p = 2, t = 706

2 |
| 3 11 | 19 29 31
5 | 7 13|17 23
p = 3, t = 499

Fig. 1.6 Control-parallel realization of the sieve of

Eratosthenes with n = 1000 and 1  p  3.
Data-Parallel Implementation of the Sieve

P1 Current Prime Index Assume at most n processors,

so that all prime factors dealt with
are in P1 (which broadcasts them)
1 2 n/p

P2 Current Prime Index

Communi-
c ation n/p+1 2n/p

Pp Current Prime Index

n <n/p
n–n/p+1 n

Fig. 1.7 Data-parallel realization of the sieve of Eratosthenes.

One Reason for Sublinear Speedup:
Communication Overhead

Ideal speedup

Solution time Actual speedup

Computation

Communication

Number of processors Number of processors

Fig. 1.8 Trade-off between communication time and computation

time in the data-parallel realization of the sieve of Eratosthenes.
Another Reason for Sublinear Speedup:
Input/Output Overhead

Ideal speedup

Solution time
Actual speedup
Computation

I/O time

Number of processors Number of processors

Fig. 1.9 Effect of a constant I/O time on the data-parallel

realization of the sieve of Eratosthenes.

Introduction To High Performance Computing: Unit-I
No ratings yet
Introduction To High Performance Computing: Unit-I
70 pages
High Performance Computing: Course Introduction
No ratings yet
High Performance Computing: Course Introduction
32 pages
CS-3006 2 PDC Overview Compressed
No ratings yet
CS-3006 2 PDC Overview Compressed
107 pages
Trends in Computer Architecture
No ratings yet
Trends in Computer Architecture
30 pages
Hpca Notes
No ratings yet
Hpca Notes
216 pages
Moore's Law - SoC
No ratings yet
Moore's Law - SoC
35 pages
01 - Introduction: 1 Why Parallel Programming Is Important in Research
No ratings yet
01 - Introduction: 1 Why Parallel Programming Is Important in Research
50 pages
CH02 COA10e
No ratings yet
CH02 COA10e
67 pages
Lecture 4 Parallel and Scalable Machine Learning With HPC Part 1
No ratings yet
Lecture 4 Parallel and Scalable Machine Learning With HPC Part 1
47 pages
Comp422 534 2020 Lecture1 Introduction
No ratings yet
Comp422 534 2020 Lecture1 Introduction
49 pages
Comp422 2011 Lecture1 Introduction
No ratings yet
Comp422 2011 Lecture1 Introduction
50 pages
Lecture1 Notes
No ratings yet
Lecture1 Notes
19 pages
14013204-3 - Parallel Computing - Lecture1
No ratings yet
14013204-3 - Parallel Computing - Lecture1
52 pages
CI-0120 Arquitectura de Computadoras Ejemplos FundamentosDiseño
No ratings yet
CI-0120 Arquitectura de Computadoras Ejemplos FundamentosDiseño
52 pages
CS-3006 4 PerformanceAnalysis
No ratings yet
CS-3006 4 PerformanceAnalysis
62 pages
التحليل
No ratings yet
التحليل
32 pages
Comprehensive Survey SmartNIC
No ratings yet
Comprehensive Survey SmartNIC
37 pages
Lecture Slides-Week1
No ratings yet
Lecture Slides-Week1
59 pages
CS0051 - Module 01
No ratings yet
CS0051 - Module 01
52 pages
L1.0 HPC Overview
No ratings yet
L1.0 HPC Overview
58 pages
Lect 1
No ratings yet
Lect 1
16 pages
Decisive Aspects in The Evolution of Microprocessors: Dezsö Sima
No ratings yet
Decisive Aspects in The Evolution of Microprocessors: Dezsö Sima
31 pages
Exploiting Intra Warp Parallelism For GPGPU
No ratings yet
Exploiting Intra Warp Parallelism For GPGPU
37 pages
CAQA5e ch1
No ratings yet
CAQA5e ch1
42 pages
Chapter 2
No ratings yet
Chapter 2
14 pages
Multicore Processor
No ratings yet
Multicore Processor
18 pages
The Future Evolution of High-Performance Microprocessors: Norm Jouppi HP Labs
No ratings yet
The Future Evolution of High-Performance Microprocessors: Norm Jouppi HP Labs
57 pages
PDF
100% (2)
PDF
39 pages
Ca02 2014 PDF
No ratings yet
Ca02 2014 PDF
79 pages
1 Introduction
No ratings yet
1 Introduction
48 pages
Chapter 2
No ratings yet
Chapter 2
34 pages
Chapter 2
No ratings yet
Chapter 2
28 pages
Imicro96 Yu Proc - Future
No ratings yet
Imicro96 Yu Proc - Future
8 pages
PC 1
No ratings yet
PC 1
53 pages
SMM Cap1
No ratings yet
SMM Cap1
101 pages
Unit 1
No ratings yet
Unit 1
54 pages
Computer Architecture: Vnu - University Engineering Technology
No ratings yet
Computer Architecture: Vnu - University Engineering Technology
30 pages
2 Week
No ratings yet
2 Week
35 pages
Polarity and Intermolecular Forces Lab Sheet
100% (1)
Polarity and Intermolecular Forces Lab Sheet
9 pages
Parallel Computing: "Parallelization" Redirects Here. For Parallelization of Manifolds, See
No ratings yet
Parallel Computing: "Parallelization" Redirects Here. For Parallelization of Manifolds, See
20 pages
Lecture1 Introduction To Parallel Computing - 2025
No ratings yet
Lecture1 Introduction To Parallel Computing - 2025
38 pages
ch1 PC
No ratings yet
ch1 PC
84 pages
Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational
No ratings yet
Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational
78 pages
February 22, 2010
No ratings yet
February 22, 2010
53 pages
W3C1 Principles of Parallel Computing
No ratings yet
W3C1 Principles of Parallel Computing
28 pages
CH02-COA10e Spring 2025
No ratings yet
CH02-COA10e Spring 2025
24 pages
Lec1 Introduction
No ratings yet
Lec1 Introduction
23 pages
Trade Mogul Trading Guide
No ratings yet
Trade Mogul Trading Guide
17 pages
Pmi RMP Handbook
No ratings yet
Pmi RMP Handbook
39 pages
Intro
No ratings yet
Intro
14 pages
Chapter 1 Solution
No ratings yet
Chapter 1 Solution
35 pages
L5-L6-Performance Issues
No ratings yet
L5-L6-Performance Issues
47 pages
Apsc 160 (Ubc)
No ratings yet
Apsc 160 (Ubc)
10 pages
Computer Performance
No ratings yet
Computer Performance
27 pages
Lecture 1
No ratings yet
Lecture 1
23 pages
Module 1: Parallelism Fundamentals Week 1 Learning Outcomes
No ratings yet
Module 1: Parallelism Fundamentals Week 1 Learning Outcomes
8 pages
CMP2008 L1
No ratings yet
CMP2008 L1
47 pages
Parallel N Distributed Systems
No ratings yet
Parallel N Distributed Systems
44 pages
03jul201502074415 PDF
No ratings yet
03jul201502074415 PDF
6 pages
Lecture 1
No ratings yet
Lecture 1
18 pages
TGREDCO - Telangana Renewable Energy Development Corporation LTD.
No ratings yet
TGREDCO - Telangana Renewable Energy Development Corporation LTD.
5 pages
Introduction To Parallel Computing
No ratings yet
Introduction To Parallel Computing
14 pages
Benguet EP 2017
No ratings yet
Benguet EP 2017
158 pages
Electric Project Documentation: . +A PROJECT 2507 1. Information PLUG-SPRAY2507
No ratings yet
Electric Project Documentation: . +A PROJECT 2507 1. Information PLUG-SPRAY2507
5 pages
Contact Us Track Your Order: - 44-330-175-5511 - Europe - Language
No ratings yet
Contact Us Track Your Order: - 44-330-175-5511 - Europe - Language
18 pages
Note 2
No ratings yet
Note 2
3 pages
Characteristics of Letters
No ratings yet
Characteristics of Letters
22 pages
RDZ Search Options
No ratings yet
RDZ Search Options
74 pages
Y.garud Multiaxial Fatigue
No ratings yet
Y.garud Multiaxial Fatigue
27 pages
Life Fitness Cardiovascular - NEW ELEVATION SERIES Discover Se & Si Consoles
No ratings yet
Life Fitness Cardiovascular - NEW ELEVATION SERIES Discover Se & Si Consoles
35 pages
PHD Thesis On Physics Education
100% (3)
PHD Thesis On Physics Education
5 pages
CH 3
No ratings yet
CH 3
70 pages
09 Basic Compression
No ratings yet
09 Basic Compression
81 pages
Owner's Manual Brillance T50-T52 - 21x14cms
No ratings yet
Owner's Manual Brillance T50-T52 - 21x14cms
160 pages
03 - нематоды птиц
No ratings yet
03 - нематоды птиц
10 pages
Saudi Aramco Pre-Commissioning Form: High Voltage Motor Controlgear
100% (1)
Saudi Aramco Pre-Commissioning Form: High Voltage Motor Controlgear
4 pages
Parallel Processing Lecture3
No ratings yet
Parallel Processing Lecture3
54 pages
Ysj Dissertation
100% (2)
Ysj Dissertation
4 pages
Chapter 10 - ToMS - Individual Assignment - Faris Prasetyo Makarim
No ratings yet
Chapter 10 - ToMS - Individual Assignment - Faris Prasetyo Makarim
4 pages
B Tech District-Wise
No ratings yet
B Tech District-Wise
10 pages
Grade 10 Agriculture
No ratings yet
Grade 10 Agriculture
4 pages
Lec2.1 DFA
No ratings yet
Lec2.1 DFA
17 pages
Itinerary of Travel
No ratings yet
Itinerary of Travel
4 pages
WiNG 5.0 Cheat Sheet - RF Domains
No ratings yet
WiNG 5.0 Cheat Sheet - RF Domains
6 pages
The Straits Times Life! - Interviews With Mediacorp Artistes, Pierre PNG, Cynthia Koh and Rebecca Lim
No ratings yet
The Straits Times Life! - Interviews With Mediacorp Artistes, Pierre PNG, Cynthia Koh and Rebecca Lim
1 page
Activity 2 - Crossword Puzzle
No ratings yet
Activity 2 - Crossword Puzzle
2 pages
Boolean Xor Based (K, N) Threshold Visual Cryptography For Grayscale Images
No ratings yet
Boolean Xor Based (K, N) Threshold Visual Cryptography For Grayscale Images
4 pages
HW - 7 1
No ratings yet
HW - 7 1
4 pages
Prof Ed 7 Rating Scale
No ratings yet
Prof Ed 7 Rating Scale
3 pages
Log
No ratings yet
Log
2 pages
Hard Work, Determination, and Persistence: 3 Keywords in Life
No ratings yet
Hard Work, Determination, and Persistence: 3 Keywords in Life
2 pages
Technology in Telecommunications Networks
From Everand
Technology in Telecommunications Networks
Tanushri Kaniyar
No ratings yet
Concise Guide to OTN optical transport networks
From Everand
Concise Guide to OTN optical transport networks
alasdair gilchrist
4/5 (2)
Fundamentals of Network Planning and Optimisation 2G/3G/4G: Evolution to 5G
From Everand
Fundamentals of Network Planning and Optimisation 2G/3G/4G: Evolution to 5G
Ajay R. Mishra
No ratings yet

Parallel Processing Lecture1

Uploaded by

Parallel Processing Lecture1

Uploaded by

CS 408 - Parallel Processing

Professor BenBella S. Tawfik

Figure 1.1 The exponential growth of microprocessor performance,

Mental power in four scales

Factors contributing to the validity of Moore’s law

Measures of processor performance 68040

Instructions/second (MIPS, GIPS, TIPS, PIPS) MIPS

(MFLOPS, GFLOPS, TFLOPS, PFLOPS)

Overall Performance Improvement

Gate Speed Improvement

Feature Size (mm)

~1985 --------- 1995-2000 --------- ~2005 ~2010

The speed of light is about 30 cm/ns.

Signals travel at 40-70% speed of light (say, 15 cm/ns).

If signals must travel 1.5 cm during the execution of an

How does parallel processing help? Wouldn’t multiple

“There are 3 rules to follow when parallelizing large codes.

How long will Koomey’s law

Example 1: Southern oceans heat Modeling

Example 2: Fluid dynamics calculations (1000  1000  1000 lattice)

Example 3: Monte Carlo simulation of nuclear reactor

Decentralized supercomputing: A grid of tens of thousands networked

1. IBM Blue Gene/L 2. SGI Columbia 3. NEC Earth Sim

Material science, Aerospace/space sim, Atmospheric, oceanic,

1. IBM Roadrunner 2. IBM Blue Gene/L 3. Sun Blade X6420

1. Cray Titan 2. IBM Sequoia 3. Fujitsu K Computer

2014 2016 2018 2020

The sense in which we use it in this course

- Multiplying two matrices

Current Prime Index

Fig. 1.4 Schematic representation of single-processor

Index Index Index

Shared Current Prime

Fig. 1.5 Schematic representation of a control-parallel

Fig. 1.6 Control-parallel realization of the sieve of

P1 Current Prime Index Assume at most n processors,

P2 Current Prime Index

Pp Current Prime Index

Fig. 1.7 Data-parallel realization of the sieve of Eratosthenes.

Solution time Actual speedup

Number of processors Number of processors

Fig. 1.8 Trade-off between communication time and computation

Number of processors Number of processors

Fig. 1.9 Effect of a constant I/O time on the data-parallel

You might also like