0% found this document useful (0 votes)

9 views62 pages

Parallel Processing Lecture2

The document discusses parallel processing concepts in CS 408, focusing on Amdahl's Law, which highlights the limitations of parallelization in programs with non-parallelizable components. It also covers Flynn's Taxonomy, which categorizes computer architectures based on instruction and data streams, and outlines trends in high-technology development related to parallel computing. Additionally, it mentions the evolution of parallel processing technologies and their implications for performance and algorithm design.

Uploaded by

ayman mossad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views62 pages

Parallel Processing Lecture2

Uploaded by

ayman mossad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 62

CS 408 - Parallel Processing

Professor BenBella S. Tawfik

Lecture 2
Analysis


Useful Diagram

One node per

O(1) operation

…
…
Question: why
…
are there no
cycles in this
graph?
Analysis


Definitions


More Definitions



Amdahl’s Law

 Now it’s time for some bad news.

 In practice, your program won’t just sum all the
elements in an array.
 You will have a program with
 Some parts that parallelize well
 Can turn them into a map or a reduce.
 Some parts that won’t parallelize at all
 Operations on a linked list.
 Reading a text file.
 A computation where each step needs the result of
the previous steps.
Amdahl’s Law


Amdahl’s Law

 This is BAD NEWS

 If 1/3 of our program can’t be parallelized, we
can’t get a speedup better than 3.
 No matter how many processors we throw at
our problem.
 And while the first few processors make a huge
difference, the benefit diminishes quickly.
Amdahl’s Law and Moore’s
Law

Amdahl’s Law: Moving
Forward
 Unparallelized code becomes a bottleneck
quickly.
 What do we do? Design smarter algorithms!

 Consider the following problem:

 Given 3
an array of7numbers,6return an2array with4
the “running sum”

3 10 16 18 22
1.3 Parallel Processing Ups and Downs
Fig. 1.10 Richardson’s circular Using thousands of “computers”
theater for weather forecasting (humans + calculators) for 24-hr
calculations. weather prediction in a few hours

1960s: ILLIAC IV (U Illinois) –

 four 8  8 mesh quadrants, SIMD

 
  1980s: Commercial interest –

  technology was driven by


  
 government grants & contracts.
Once funding dried up,

  
 

Conductor
 many companies went bankrupt
 2000s: Internet revolution –

   info providers, multimedia, data
 
 mining, etc. need lots of power

2020s: Cloud, big-data, AI/ML
Trends in High-Technology Development
GovResGovResGovResGovResGovResGovResGovResGovResGovResGovRes
Graphics
IndResIndResIndResIndResIndResIndResIndResIndResIndResIndRes
IndDevIndDev $1B$1B$1B$1B$1B$1B$1B$1B$1B$1B$1B$1B

GovResGovResGovResGovResGovResGovResGovResGovResGovResGovRe
Networking
IndResIndResIndResIndResIndResIndResIndResIndResIndR
Transfer of
ideas/people IndDevIndDev $1B$1B$1B$1B$1B$1B$1B$1B$1B$1B$1

GovRes
RISC
IndResIndR
IndDev $1B$1B$1B$1B$1B$1B$1B$1B

GovResGovResGovResG GovResGovResGovResGo
Parallelism
IndResIndResIndResIndResIndResInd
Evolution of parallel processing has been
quite different from other high tech fields IndDevIndDev $1B$1B$1B$1B$1B$1B$1
1960 1970 1980 1990 2000
Development of some technical fields into $1B businesses and the roles played by
government research and industrial R&D over time (IEEE Computer, early 90s?).
Trends in Hi-Tech Development (2003)
2010 2020
Status of Computing Power (circa 2000)
TFLOPS PFLOPS (Peta = 1015)
GFLOPS on desktop: Apple Macintosh, with G4 processor
PFLOPS EFLOPS (Exa = 1018)
TFLOPS in supercomputer center:
1152-processor IBM RS/6000 SP (switch-based network)
Cray T3E, torus-connected
EFLOPS ZFLOPS (Zeta = 1021)
PFLOPS on the drawing board:
1M-processor IBM Blue Gene (2005?)
32 proc’s/chip, 64 chips/board, 8 boards/tower, 64 towers
Processor: 8 threads, on-chip memory, no data cache
Chip: defect-tolerant, row/column rings in a 6  6 array
Board: 8  8 chip grid organized as 4  4  4 cube
Tower: Boards linked to 4 neighbors in adjacent towers
System: 323232 cube of chips, 1.5 MW (water-cooled)
Flynn Taxonomy

Flynn (1966) Single Data Multiple Data

Single Instruction SISD SIMD
Multiple Instruction MISD MIMD

• MISD
• Fault tolerance
• Pipeline processing/streaming or systolic
array
• Now extended to SPMD
1
• single program multiple data
Mikko Lipasti-University of Wisconsin

9
Memory Taxonomy
For Shared Memory Uniform Nonuniform
Memory Memory
Cache Coherence CC-UMA CC-NUMA
No Cache Coherence NCC-UMA NCC-NUMA

• NUMA wins out for practical implementation

• Cache coherence favors programmer
• Common in general-purpose systems
• NCC widespread in scalable systems
• CC overhead is too high, not always2
necessary
Mikko Lipasti-University of Wisconsin

0
Flynn Taxonomy (1966) Recap

Data Stream
Single Multi
Instruction Single SISD SIMD
Stream (Single-Core Processors) (GPUs, Intel SSE/AVX
extensions, …)
Multi MISD MIMD
(Systolic Arrays, …) (VLIW, Parallel Computers)
Flynn Taxonomy Recap

Single-Instruction Single-Instruction
Single-Data Multi-Data
(Single-Core Processors) (GPUs, Intel SIMD Exte

Multi-Instruction Multi-Instruction
Single-Data Multi-Data
(Systolic Arrays,…) (Parallel Computers)
Intel SIMD Extensions

 New instructions, new registers

 Introduced in phases/groups of functionality
 SSE – SSE4 (1999 –2006)
 128 bit width operations

 AVX, FMA, AVX2, AVX-512 (2008 – 2015)

 256 – 512 bit width operations
 AVX-512 chips not available yet (as of Spring, 2019), but soon!

 F16C, and more to come?

Intel SIMD Registers (AVX-512)

 XMM0 – XMM15
XMM0  128-bit registers
YMM0
ZMM0
 SSE
XMM1  YMM0 – YMM15
YMM1
ZMM1
 256-bit registers
 AVX, AVX2
…

 ZMM0 – ZMM31
XMM31  512-bit registers
YMM31
ZMM31  AVX-512
SSE/AVX Data Types

255 0
YMM0

float float float float float float float float

double double double double
int32 int32 int32 int32 int32 int32 int32 int32
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
8888 8888 8888 8888 8888 8888 8888 8 8 8 8 Operation on
32 8-bit values
in one instruction!
Sandy Bridge
Microarchitecture

e.g., “Port 5 pressure” when code uses too much shuffle operations
Skylake Die Layout
Aside: Do I Have SIMD
Capabilities?
 less /proc/cpuinfo
The Global View of Computer
Architecture

Applications Parallelism
History

Technology
Computer Architecture
Programming -instruction set design
Languages -Organization OS
-Hardware/Software boundary

Measurement
and Compilers
Evaluation Interface Design (ISA)

Eurípides Montagne University of Central CDA

31
Florida 4150
The Task of A Computer
Designer

 Determine what attributes are important for a new

machine.

 Design a machine to maximize performance while staying

within cost constrains.

Eurípides Montagne University of Central CDA

32
Florida 4150
Flynn’s Taxonomy

 Michael Flynn (from Stanford)

 Made a characterization of computer systems which
became known as Flynn’s Taxonomy

Computer

Instructions Data

Eurípides Montagne University of Central CDA

33
Florida 4150
Flynn’s Taxonomy
 SISD – Single Instruction Single Data Systems

SI SISD SD

Eurípides Montagne University of Central CDA

34
Florida 4150
Flynn’s Taxonomy

 SIMD – Single Instruction Multiple Data Systems “Array

Processors”

SISD SD

SI SISD SD Multiple Data

SISD SD

Eurípides Montagne University of Central CDA

35
Florida 4150
Flynn’s Taxonomy

 MIMD Multiple Instructions Multiple Data System:

“Multiprocessors”

Multiple Instructions Multiple Data

SI SISD SD

SI SISD SD
Eurípides Montagne University of Central CDA
36
Florida 4150
Flynn’s Taxonomy

 MISD- Multiple Instructions / Single Data System

 Some people say “pipelining” lies here, but this is
debatable.
Multiple Instructions Single Data

SI SISD

SI SISD SD

SI SISD
Eurípides Montagne University of Central CDA
37
Florida 4150
Abbreviations:
SISD one address Machine.
IP

 IP: Instruction pointer

MAR
 MAR: Memory Address Register
 MDR: Memory Data Register MEMORY

 A: Accumulator
OP ADDR MDR A
 ALU: Arithmetic Logic Unit
 IR: Instruction Register DECODER ALU

 OP: Opcode
Eurípides Montagne University of Central CDA
 ADDR: Address
Florida 4150
38
One address format
LOAD X
OP ADDRESS
 MAR  IP
 MDR  M[MAR] || IP  IP + 1
IP
 IR  MDR
 DECODER IR.OP MAR

 MAR  IR.ADDR
MEMORY
 MDR M[MAR]
A  MDR
OP ADDR MDR A

DECODER ALU

Eurípides Montagne University of Central CDA

39
Florida 4150
One address format

OP ADDRESS
ADD X
- MAR  IP
IP
– MDR  M[MAR] || IP  IP + 1
– IR  MDR MAR
– DECODER IR.OP
– MAR  IR.ADDR MEMORY
– MDR M[MAR]
– A  A + MDR MDR A
OP ADDR

DECODER ALU

Eurípides Montagne University of Central CDA

40
Florida 4150
One address format

OP ADDRESS
STORE X

IP
- MAR  IP
- MDR  M[MAR] || IP  IP + 1 MAR
- IR  MDR
- DECODER IR.OP MEMORY
- MAR  IR.ADDR
- MDR  A MDR
OP ADDR A
- M[MAR]  MDR
DECODER ALU

Eurípides Montagne University of Central CDA

41
Florida 4150
SISD Stack Machine

IP
 First Stack Machine
• B5000 MAR

MEMORY

MDR Stack
OP ADDR
1
2
ALU 3
DECODER 4

Eurípides Montagne University of Central CDA

42
Florida 4150
Multiprocessor Machines (MIMD)

MEMORY

CPU CPU CPU

Eurípides Montagne University of Central CDA

43
Florida 4150
Flynn’s Taxonomy
(figure 2.20 from Quinn)
Data stream
Single Multiple
Instruction stream

SIMD
Single

SISD
Processor arrays
Uniprocessor
Pipelined vector processors

MIMD
MISD
Multiple

Multiprocessors
Systolic array
Multicomputers
2-D Mesh Network

 Direct topology
 Switches arranged into a 2-D lattice
 Communication allowed only between neighboring
switches
 Variants allow wraparound connections between
switches on edge of mesh
2-D Meshes
Evaluating 2-D Meshes

 Diameter: (n1/2)

 Bisection width: (n1/2)

 Number of edges per switch: 4

 Constant edge length? Yes

Binary Tree Network

 Indirect topology
 n = 2d processor nodes, n-1 switches
Evaluating Binary Tree
Network
 Diameter: 2 log n

 Bisection width: 1

 Edges / node: 3

 Constant edge length? Yes/No?

Hypertree Network

 Indirect topology
 Shares low diameter of binary tree
 Greatly improves bisection width
 From “front” looks like k-ary tree of height d
 From “side” looks like upside down binary tree of height
d
Hypertree Network
Evaluating 4-ary Hypertree

 Diameter: log n

 Bisection width: n / 2

 Edges / node: 6

 Constant edge length? No

Butterfly Network

 Indirect topology
 n = 2d processor 0 1 2 3 4 5 6 7
nodes connected
by n(log n + 1) Rank 0 0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7
switching nodes

Rank 1 1,0 1,1 1,2 1,3 1,4 1,5 1,6 1,7

Rank 2 2,0 2,1 2,2 2,3 2,4 2,5 2,6 2,7

Rank 3 3,0 3,1 3,2 3,3 3,4 3,5 3,6 3,7

Butterfly Network Routing
Evaluating Butterfly Network

 Diameter: log n

 Bisection width: n / 2

 Edges per node: 4

 Constant edge length? No

Hypercube

 Directory topology
 2 x 2 x … x 2 mesh
 Number of nodes a power of 2
 Node addresses 0, 1, …, 2k-1
 Node i connected to k nodes whose addresses differ
from i in exactly one bit position
Hypercube Addressing

1110 1111

0110 0111

1010 1011

0010 0011

1100 1101

0100 0101

1000 1001

0000 0001
Evaluating Hypercube
Network
 Diameter: log n

 Bisection width: n / 2

 Edges per node: log n

 Constant edge length? No

1.4 Types of Parallelism: A Taxonomy

Single data Multiple data Shared Message

stream streams variables passing
Single instr

memory
stream

Global
Johnson’ s expansion
SISD SIMD GMSV GMMP
Uniprocessors Array or vector Shared-memory Rarely used
processors multiprocessors
Multiple instr

Distributed
streams

memory
MISD MIMD DMSV DMMP
Rarely used Multiproc’s or Distributed Distrib-memory
multicomputers shared memory multicomputers

Flynn’s categories

Fig. 1.11 The Flynn-Johnson classification of computer systems.

1.5 Roadblocks to Parallel Processing
 Grosch’s law: Economy of scale applies, or power = cost2
No longer valid; in fact we can get more bang per buck in micros
 Minsky’s conjecture: Speedup tends to be proportional to log p
Has roots in analysis of memory bank conflicts; can be overcome
 Tyranny of IC technology: Uniprocessors suffice (x10 faster/5 yrs)
Faster ICs make parallel machines faster too; what about x1000?

 Tyranny of vector supercomputers: Familiar programming model

Not all computations involve vectors; parallel vector machines

 Software inertia: Billions of dollars investment in software

New programs; even uniprocessors benefit from parallelism spec

 Amdahl’s law: Unparallelizable code severely limits the speedup

Amdahl’s Law
50

f = 0
f = fraction
40
unaffected
Spee du p ( s )

f = 0 .01
30 p = speedup
f = 0 .02 of the rest
20
f = 0 .05
1
s=
10 f + (1 – f)/p
f = 0 .1

0
 min(p, 1/f)
0 10 20 30 40 50
E nha nc em en t f ac tor ( p )

Fig. 1.12 Limit on speed-up according to Amdahl’s law.

1.6 Effectiveness of Parallel Processing
1 Fig. 1.13 p Number of processors
Task graph
exhibiting W(p) Work performed by p processors
2
limited
inherent T(p) Execution time with p processors
3
parallelism. T(1) = W(1); T(p)  W(p)
4
S(p) Speedup = T(1) / T(p)
5
8

7 6 E(p) Efficiency = T(1) / [p T(p)]

10
9 R(p) Redundancy = W(p) / W(1)
11
12 W(1) = 13 U(p) Utilization = W(p) / [p T(p)]
T(1) = 13
13 Q(p) Quality = T3(1) / [p T2(p) W(p)]
T() = 8
Reduction or Fan-in Computation
Example: Adding 16 numbers, 8 processors, unit-time additions
----------- 16 numbers to be added -----------
Zero-time communication
+ + + + + +
+ + E(8) = 15 / (8  4) = 47%
S(8) = 15 / 4 = 3.75
R(8) = 15 / 15 = 1
+ + + + Q(8) = 1.76

Unit-time communication
+ +
E(8) = 15 / (8  7) = 27%
S(8) = 15 / 7 = 2.14
+ R(8) = 22 / 15 = 1.47
Q(8) = 0.39
Sum

Fig. 1.14 Computation graph for finding the sum of 16 numbers .

ABCs of Parallel Processing in One Slide
A Amdahl’s Law (Speedup Formula)
Bad news – Sequential overhead will kill you, because:
Speedup = T1/Tp  1/[f + (1 – f)/p]  min(1/f, p)
Morale: For f = 0.1, speedup is at best 10, regardless of peak OPS.

B Brent’s Scheduling Theorem

Good news – Optimal scheduling is very difficult, but even a naive
scheduling algorithm can ensure:
T1/p  Tp  T1/p + T = (T1/p)[1 + p/(T1/T)]
Result: For a reasonably parallel task (large T1/T), or for a suitably
small p (say, p  T1/T), good speedup and efficiency are possible.

C Cost-Effectiveness Adage
Real news – The most cost-effective parallel solution may not be
the one with highest peak OPS (communication?), greatest speed-up
(at what cost?), or best utilization (hardware busy doing what?).
Analogy: Mass transit might be more cost-effective than private cars
even if it is slower and leads to many empty seats.

Computer Architecture Ebook
No ratings yet
Computer Architecture Ebook
443 pages
Advanced Computer Architecture
No ratings yet
Advanced Computer Architecture
34 pages
CS-3006!3!1 SIMD Intrinsic Programming Reduced
No ratings yet
CS-3006!3!1 SIMD Intrinsic Programming Reduced
55 pages
Lecture 3.1.1 (Parallelism in Uniprocessor System, Flynn - S Classification)
No ratings yet
Lecture 3.1.1 (Parallelism in Uniprocessor System, Flynn - S Classification)
8 pages
1/1 Multiprocessors (Or) Shared Memory Multi-Processor Model
No ratings yet
1/1 Multiprocessors (Or) Shared Memory Multi-Processor Model
17 pages
PAG Unit1
No ratings yet
PAG Unit1
64 pages
Unit IV CA
No ratings yet
Unit IV CA
73 pages
02 Lecture Flynn IN
No ratings yet
02 Lecture Flynn IN
78 pages
IBM AS/400: Terminal Velocity: Racing Through AS/400 Emulation Like It’s 1988!
From Everand
IBM AS/400: Terminal Velocity: Racing Through AS/400 Emulation Like It’s 1988!
Scott Markham
No ratings yet
Ceg4131 Models
No ratings yet
Ceg4131 Models
27 pages
Ca Part 4
No ratings yet
Ca Part 4
25 pages
Computer Organization: - by Rama Krishna Thelagathoti (M.Tech CSE From IIT Madras)
No ratings yet
Computer Organization: - by Rama Krishna Thelagathoti (M.Tech CSE From IIT Madras)
118 pages
ACA UNIT-5 Notes
No ratings yet
ACA UNIT-5 Notes
15 pages
Flynn's Classification
No ratings yet
Flynn's Classification
4 pages
Chapter 1 (Parallel Computer Models)
No ratings yet
Chapter 1 (Parallel Computer Models)
20 pages
Module-1 Theory of Parallelism: The State of Computing Computer Development Milestones
No ratings yet
Module-1 Theory of Parallelism: The State of Computing Computer Development Milestones
48 pages
CI-0120 Arquitectura de Computadoras Ejemplos FundamentosDiseño
No ratings yet
CI-0120 Arquitectura de Computadoras Ejemplos FundamentosDiseño
52 pages
Week 4a - Computer Architecture Fundamentals - Part 1
No ratings yet
Week 4a - Computer Architecture Fundamentals - Part 1
45 pages
Introduction Mod1
No ratings yet
Introduction Mod1
120 pages
StudM1p1Parallel Computer Modelsppt1shared
No ratings yet
StudM1p1Parallel Computer Modelsppt1shared
107 pages
The Pragmatic Programmer For Machine Learning Engineering Analytics and Data Science Solutions Marco Scutari Download
No ratings yet
The Pragmatic Programmer For Machine Learning Engineering Analytics and Data Science Solutions Marco Scutari Download
89 pages
Parallel Scalable Models
No ratings yet
Parallel Scalable Models
61 pages
CS0051 - Module 01 - Subtopic 1
No ratings yet
CS0051 - Module 01 - Subtopic 1
27 pages
Chapter 6 Parallel and Concurrent Computing
No ratings yet
Chapter 6 Parallel and Concurrent Computing
27 pages
Organization of Multiprocessor Systems
No ratings yet
Organization of Multiprocessor Systems
87 pages
Parallel Processing Parallel Processing
No ratings yet
Parallel Processing Parallel Processing
64 pages
CH 3
No ratings yet
CH 3
70 pages
09 Basic Compression
No ratings yet
09 Basic Compression
81 pages
Histroy of Computer Generation
No ratings yet
Histroy of Computer Generation
28 pages
ACA1
No ratings yet
ACA1
26 pages
Cs8083 MCP Unit I Notes
No ratings yet
Cs8083 MCP Unit I Notes
31 pages
Parallel Processing Lecture3
No ratings yet
Parallel Processing Lecture3
54 pages
Report Calculator Arin
No ratings yet
Report Calculator Arin
136 pages
Architecture Instruction Set Extensions Programming Reference
No ratings yet
Architecture Instruction Set Extensions Programming Reference
180 pages
GTX 970
No ratings yet
GTX 970
172 pages
Parallel Computing
No ratings yet
Parallel Computing
34 pages
Architecture Instruction Set Extensions Programming Reference PDF
No ratings yet
Architecture Instruction Set Extensions Programming Reference PDF
211 pages
Lecture 3 Flynn's Classical Taxonomy
No ratings yet
Lecture 3 Flynn's Classical Taxonomy
29 pages
General System Architecture
No ratings yet
General System Architecture
28 pages
x86 64 psABI 1.0 PDF
No ratings yet
x86 64 psABI 1.0 PDF
157 pages
Lect6-SPC - Flynns
No ratings yet
Lect6-SPC - Flynns
16 pages
COE4590 10 Flyns
No ratings yet
COE4590 10 Flyns
15 pages
Baker CHPT 5 SIMD Good
No ratings yet
Baker CHPT 5 SIMD Good
94 pages
Intel® Architecture Instruction Set Extensions and Future Features Programming Reference
No ratings yet
Intel® Architecture Instruction Set Extensions and Future Features Programming Reference
145 pages
CSE 260 - Introduction To Parallel Computation: Larry Carter Carter@cs - Ucsd.edu
No ratings yet
CSE 260 - Introduction To Parallel Computation: Larry Carter Carter@cs - Ucsd.edu
22 pages
Unit 4 COA
No ratings yet
Unit 4 COA
8 pages
Aca Unit 1.1
No ratings yet
Aca Unit 1.1
20 pages
Parallel Computer Models: PCA Chapter 1
No ratings yet
Parallel Computer Models: PCA Chapter 1
61 pages
ch.9 Pipeline MoDIFIED
No ratings yet
ch.9 Pipeline MoDIFIED
76 pages
Advanced Computer Architecture: The Architecture of Parallel Computers
No ratings yet
Advanced Computer Architecture: The Architecture of Parallel Computers
44 pages
Real Time System
No ratings yet
Real Time System
10 pages
Parallel Processing
No ratings yet
Parallel Processing
22 pages
HC2022 XeonDx700 PraveenMosur FINAL
No ratings yet
HC2022 XeonDx700 PraveenMosur FINAL
22 pages
Chapter - 5 Parallel Processing
No ratings yet
Chapter - 5 Parallel Processing
117 pages
Introduction To Parallel Processing
No ratings yet
Introduction To Parallel Processing
49 pages
Parallel Computer Models: CEG 4131 Computer Architecture III Miodrag Bolic
No ratings yet
Parallel Computer Models: CEG 4131 Computer Architecture III Miodrag Bolic
27 pages
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
No ratings yet
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
43 pages
MPX Linux64 Abi PDF
No ratings yet
MPX Linux64 Abi PDF
141 pages
Lecture 3 19th Jan 2024
No ratings yet
Lecture 3 19th Jan 2024
8 pages
14 HC2024.Intel - Xeon 6 SoC - Praveen.Mosur
No ratings yet
14 HC2024.Intel - Xeon 6 SoC - Praveen.Mosur
28 pages
HWINFO
No ratings yet
HWINFO
65 pages
HWiNFO64 Report
No ratings yet
HWiNFO64 Report
49 pages
07 Simd Avx
No ratings yet
07 Simd Avx
41 pages
Advanced Computer Architecture: The Architecture of Parallel Computers
No ratings yet
Advanced Computer Architecture: The Architecture of Parallel Computers
44 pages
Computer Architecture Flynn's Taxonomy
No ratings yet
Computer Architecture Flynn's Taxonomy
4 pages
Lec2.1 DFA
No ratings yet
Lec2.1 DFA
17 pages
Ch12 Parallel Proc3-Aula
No ratings yet
Ch12 Parallel Proc3-Aula
35 pages
Model
No ratings yet
Model
14 pages
HPA - Notes
No ratings yet
HPA - Notes
5 pages
Parallel Computing Platforms and Memory System Performance: John Mellor-Crummey
No ratings yet
Parallel Computing Platforms and Memory System Performance: John Mellor-Crummey
43 pages
3rd Generation Intel Xeon Scalable Processors
No ratings yet
3rd Generation Intel Xeon Scalable Processors
20 pages
Data Asus X452EA
No ratings yet
Data Asus X452EA
40 pages
CPSC 321 Computer Architecture: Fall 2006
No ratings yet
CPSC 321 Computer Architecture: Fall 2006
36 pages
Build A High-Performance Object Storage-as-a-Service Platform With Minio
No ratings yet
Build A High-Performance Object Storage-as-a-Service Platform With Minio
13 pages
Modern X86 Assembly Language Programming: Covers x86 64-Bit, AVX, AVX2, and AVX-512 2nd Edition Daniel Kusswurm 2024 Scribd Download
100% (1)
Modern X86 Assembly Language Programming: Covers x86 64-Bit, AVX, AVX2, and AVX-512 2nd Edition Daniel Kusswurm 2024 Scribd Download
29 pages
A Study of Performance Programming of CPU, GPU Accelerated Computers and SIMD Architecture
No ratings yet
A Study of Performance Programming of CPU, GPU Accelerated Computers and SIMD Architecture
19 pages
RNN-Based Radio Resource Management On
No ratings yet
RNN-Based Radio Resource Management On
14 pages
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
No ratings yet
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
26 pages
Deep Learning With Intel AVX512 and Intel Deep Learning Boost Tuning Guide On 3rd Generation Intel Xeon Scalable Processors 1
No ratings yet
Deep Learning With Intel AVX512 and Intel Deep Learning Boost Tuning Guide On 3rd Generation Intel Xeon Scalable Processors 1
24 pages
PlayStation Architecture: Architecture of Consoles: A Practical Analysis, #6
From Everand
PlayStation Architecture: Architecture of Consoles: A Practical Analysis, #6
Rodrigo Copetti
No ratings yet
Iot Intel Xeon d2700 Processors Kontron Solution Brief
No ratings yet
Iot Intel Xeon d2700 Processors Kontron Solution Brief
4 pages
Computer Architecture and Organization
No ratings yet
Computer Architecture and Organization
15 pages
OracleDBonICL8358vsIBMPower TCOstudy
No ratings yet
OracleDBonICL8358vsIBMPower TCOstudy
9 pages
HPC Cluster Tuning Guide On 3rd Generation Intel Xeon Scalable Processors 1
No ratings yet
HPC Cluster Tuning Guide On 3rd Generation Intel Xeon Scalable Processors 1
10 pages
White Paper Micron Intel HPC AI Workloads
No ratings yet
White Paper Micron Intel HPC AI Workloads
7 pages
SISd
No ratings yet
SISd
17 pages
Flynns Classification
No ratings yet
Flynns Classification
27 pages
Intel Avx10 Technical Paper
No ratings yet
Intel Avx10 Technical Paper
8 pages
Intel Xeon 6 Product Brief
No ratings yet
Intel Xeon 6 Product Brief
6 pages
All-Products Esuprt Software Esuprt It Ops Datcentr MGMT High-Computing-Solution-Resources White-Papers84 En-Us
No ratings yet
All-Products Esuprt Software Esuprt It Ops Datcentr MGMT High-Computing-Solution-Resources White-Papers84 En-Us
8 pages
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
No ratings yet
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
26 pages
Efficient Parallel Sort On AVX-512-based Multi-Core and Many-Core Architectures
No ratings yet
Efficient Parallel Sort On AVX-512-based Multi-Core and Many-Core Architectures
9 pages
Intel's New Chimera - Alder Lake - Agner's CPU Blog
No ratings yet
Intel's New Chimera - Alder Lake - Agner's CPU Blog
4 pages
HelpDefault Settings - RPCS3 Wiki
No ratings yet
HelpDefault Settings - RPCS3 Wiki
1 page

Parallel Processing Lecture2

Uploaded by

Parallel Processing Lecture2

Uploaded by

CS 408 - Parallel Processing

Professor BenBella S. Tawfik

One node per

 Now it’s time for some bad news.

 This is BAD NEWS

 Consider the following problem:

1960s: ILLIAC IV (U Illinois) –

Flynn (1966) Single Data Multiple Data

• NUMA wins out for practical implementation

 New instructions, new registers

 AVX, FMA, AVX2, AVX-512 (2008 – 2015)

 F16C, and more to come?

float float float float float float float float

Eurípides Montagne University of Central CDA

 Determine what attributes are important for a new

 Design a machine to maximize performance while staying

Eurípides Montagne University of Central CDA

 Michael Flynn (from Stanford)

Eurípides Montagne University of Central CDA

Eurípides Montagne University of Central CDA

 SIMD – Single Instruction Multiple Data Systems “Array

SI SISD SD Multiple Data

Eurípides Montagne University of Central CDA

 MIMD Multiple Instructions Multiple Data System:

Multiple Instructions Multiple Data

 MISD- Multiple Instructions / Single Data System

 IP: Instruction pointer

Eurípides Montagne University of Central CDA

Eurípides Montagne University of Central CDA

Eurípides Montagne University of Central CDA

Eurípides Montagne University of Central CDA

CPU CPU CPU

Eurípides Montagne University of Central CDA

 Bisection width: (n1/2)

 Number of edges per switch: 4

 Constant edge length? Yes

 Constant edge length? Yes/No?

 Constant edge length? No

Rank 1 1,0 1,1 1,2 1,3 1,4 1,5 1,6 1,7

Rank 2 2,0 2,1 2,2 2,3 2,4 2,5 2,6 2,7

Rank 3 3,0 3,1 3,2 3,3 3,4 3,5 3,6 3,7

 Edges per node: 4

 Constant edge length? No

 Edges per node: log n

 Constant edge length? No

Single data Multiple data Shared Message

Fig. 1.11 The Flynn-Johnson classification of computer systems.

 Tyranny of vector supercomputers: Familiar programming model

 Software inertia: Billions of dollars investment in software

 Amdahl’s law: Unparallelizable code severely limits the speedup

Fig. 1.12 Limit on speed-up according to Amdahl’s law.

7 6 E(p) Efficiency = T(1) / [p T(p)]

Fig. 1.14 Computation graph for finding the sum of 16 numbers .

B Brent’s Scheduling Theorem

You might also like