0% found this document useful (0 votes)

58 views

Lecture02 Types

The document discusses different types of parallelism including: - Parallelism in hardware through pipelining, multicore processors, and distributed memory systems. - Parallelism in software through task parallelism by distributing work across processors, and data parallelism by performing the same operation on multiple data items simultaneously. - Flynn's taxonomy which categorizes parallel computers based on their instruction and data streams as SISD, SIMD, MISD, and MIMD.

Uploaded by

supermanijak

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views

Lecture02 Types

Uploaded by

supermanijak

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Lect.

2: Types of Parallelism
▪ Parallelism in Hardware (Uniprocessor)
– Pipelining
– Superscalar, VLIW etc.
▪ Parallelism in Hardware (SIMD, Vector processors, GPUs)
▪ Parallelism in Hardware (Multiprocessor)
– Shared-memory multiprocessors
– Distributed-memory multiprocessors
– Chip-multiprocessors a.k.a. Multi-cores
▪ Parallelism in Hardware (Multicomputers a.k.a. clusters)
▪ Parallelism in Software
– Task parallelism
– Data parallelism

CS4/MSc Parallel Architectures - 2014-2015

1
Taxonomy of Parallel Computers
▪ According to instruction and data streams (Flynn):
– Single instruction single data (SISD): this is the standard uniprocessor
– Single instruction, multiple data streams (SIMD):
▪ Same instruction is executed in all processors with different data
▪ E.g., graphics processing
– Multiple instruction, single data streams (MISD):
▪ Different instructions on the same data
▪ Rarely used in practice
– Multiple instruction, multiple data streams (MIMD): the “common”
multiprocessor
▪ Each processor uses it own data and executes its own program (or part of the
program)
▪ Most flexible approach
▪ Easier/cheaper to build by putting together “off-the-shelf ” processors

CS4/MSc Parallel Architectures - 2014-2015

2
Taxonomy of Parallel Computers
▪ According to physical organization of processors and memory:
– Physically centralized memory, uniform memory access (UMA)
▪ All memory is allocated at same distance from all processors
▪ Also called symmetric multiprocessors (SMP)
▪ Memory bandwidth is fixed and must accommodate all processors → does not
scale to large number of processors
▪ Used in most CMPs today (e.g., IBM Power, Intel Core)

CPU CPU CPU CPU

Cache Cache Cache Cache

Interconnection

Main memory

CS4/MSc Parallel Architectures - 2014-2015

3
Taxonomy of Parallel Computers
▪ According to physical organization of processors and memory:
– Physically distributed memory, non-uniform memory access (NUMA)
▪ A portion of memory is allocated with each processor (node)
▪ Accessing local memory is much faster than remote memory
▪ If most accesses are to local memory than overall memory bandwidth increases
linearly with the number of processors

CPU CPU CPU CPU

Node

Cache Cache Cache Cache

Mem. Mem. Mem. Mem.

Interconnection

CS4/MSc Parallel Architectures - 2014-2015

4
Taxonomy of Parallel Computers
▪ According to memory communication model
– Shared address or shared memory
▪ Processes in different processors can use the same virtual address space
▪ Any processor can directly access memory in another processor node
▪ Communication is done through shared memory variables
▪ Explicit synchronization with locks and critical sections
▪ Arguably easier to program??
– Distributed address or message passing
▪ Processes in different processors use different virtual address spaces
▪ Each processor can only directly access memory in its own node
▪ Communication is done through explicit messages
▪ Synchronization is implicit in the messages
▪ Arguably harder to program??
▪ Some standard message passing libraries (e.g., MPI)

CS4/MSc Parallel Architectures - 2014-2015

5
Shared Memory vs. Message Passing
▪ Shared memory
Producer (p1) Consumer (p2)

flag = 0; flag = 0;
… …
a = 10; while (!flag) {}
flag = 1; x = a * y;

▪ Message passing
Producer (p1) Consumer (p2)

… …
a = 10; receive(p1, b, label);
send(p2, a, label); x = b * y;

CS4/MSc Parallel Architectures - 2014-2015

6
Types of Parallelism in Applications
▪ Instruction-level parallelism (ILP)
– Multiple instructions from the same instruction stream can be executed
concurrently
– Generated and managed by hardware (superscalar) or by compiler (VLIW)
– Limited in practice by data and control dependences

▪ Thread-level or task-level parallelism (TLP)

– Multiple threads or instruction sequences from the same application can be
executed concurrently
– Generated by compiler/user and managed by compiler and hardware
– Limited in practice by communication/synchronization overheads and by
algorithm characteristics

CS4/MSc Parallel Architectures - 2014-2015

7
Types of Parallelism in Applications

▪ Data-level parallelism (DLP)

– Instructions from a single stream operate concurrently on several data
– Limited by non-regular data manipulation patterns and by memory
bandwidth

▪ Transaction-level parallelism
– Multiple threads/processes from different transactions can be executed
concurrently
– Limited by access to metadata and by interconnection bandwidth

CS4/MSc Parallel Architectures - 2014-2015

8
Example: Equation Solver Kernel
▪ The problem:
– Operate on a (n+2)x(n+2) matrix
– Points on the rim have fixed value
– Inner points are updated as:

A[i,j] = 0.2 x (A[i,j] + A[i,j-1] + A[i-1,j] +

A[i,j+1] + A[i+1,j])
– Updates are in-place, so top and left are new
values and bottom and right are old ones
– Updates occur at multiple sweeps
– Keep difference between old and new values
and stop when difference for all points is small
enough

CS4/MSc Parallel Architectures - 2014-2015

9
Example: Equation Solver Kernel
▪ Dependences:
– Computing the new value of a given point requires the new value of the
point directly above and to the left
– By transitivity, it requires all points in the sub-matrix in the upper-left corner
– Points along the top-right to bottom-left diagonals can be computed
independently

CS4/MSc Parallel Architectures - 2014-2015

10
Example: Equation Solver Kernel
▪ ILP version (from sequential code):
– Machine instructions from each j iteration can occur in parallel
– Branch prediction allows overlap of multiple iterations of j loop
– Some of the instructions from multiple j iterations can occur in parallel

while (!done) {
diff = 0;
for (i=1; i<=n; i++) {
for (j=1; j<=n; j++) {
temp = A[i,j];
A[i,j] = 0.2*(A[i,j]+A[i,j-1]+A[i-1,j] +
A[i,j+1]+A[i+1,j]);
diff += abs(A[i,j] – temp);
}
}
if (diff/(n*n) < TOL) done=1;
}
CS4/MSc Parallel Architectures - 2014-2015
11
Example: Equation Solver Kernel
▪ TLP version (shared-memory):
int mymin = 1+(pid * n/P);
int mymax = mymin + n/P – 1;

while (!done) {
diff = 0; mydiff = 0;
for (i=mymin; i<=mymax; i++) {
for (j=1; j<=n; j++) {
temp = A[i,j];
A[i,j] = 0.2*(A[i,j]+A[i,j-1]+A[i-1,j] +
A[i,j+1]+A[i+1,j]);
mydiff += abs(A[i,j] – temp);
}
}
lock(diff_lock); diff += mydiff; unlock(diff_lock);
barrier(bar, P);
if (diff/(n*n) < TOL) done=1;
barrier(bar, P);
} CS4/MSc Parallel Architectures - 2014-2015
12
Example: Equation Solver Kernel
▪ TLP version (shared-memory) (for 2 processors):
– Each processor gets a chunk of rows
▪ E.g., processor 0 gets: mymin=1 and mymax=2
and processor 1 gets: mymin=3 and mymax=4

int mymin = 1+(pid * n/P);

int mymax = mymin + n/P – 1;

int mymin = 1+(pid * n/P);

int mymax = mymin + n/P – 1;

while (!done) {
diff = 0; mydiff = 0;
for (i=mymin; i<=mymax; i++) {
for (j=1; j<=n; j++) {
temp = A[i,j];
A[i,j] = 0.2*(A[i,j]+A[i,j-1]+A[i-1,j] +
A[i,j+1]+A[i+1,j]);
mydiff += abs(A[i,j] – temp);
}
...
CS4/MSc Parallel Architectures - 2014-2015
13
Example: Equation Solver Kernel
▪ TLP version (shared-memory):
– All processors can access freely the same data structure A
– Access to diff, however, must be in turns
– All processors update together their own done variable
...
for (i=mymin; i<=mymax; i++) {
for (j=1; j<=n; j++) {
temp = A[i,j];
A[i,j] = 0.2*(A[i,j]+A[i,j-1]+A[i-1,j] +
A[i,j+1]+A[i+1,j]);
mydiff += abs(A[i,j] – temp);
}
}
lock(diff_lock); diff += mydiff; unlock(diff_lock);
barrier(bar, P);
if (diff/(n*n) < TOL) done=1;
barrier(bar, P);
} CS4/MSc Parallel Architectures - 2014-2015
14
Example: Equation Solver Kernel
▪ TLP version (shared-memory):
– All processors can access freely the same data structure A
– Access to diff, however, must be in turns
– All processors update together their own done variable
...
for (i=mymin; i<=mymax; i++) {
for (j=1; j<=n; j++) {
temp = A[i,j];
A[i,j] = 0.2*(A[i,j]+A[i,j-1]+A[i-1,j] +
A[i,j+1]+A[i+1,j]);
mydiff += abs(A[i,j] – temp);
}
}
lock(diff_lock); diff += mydiff; unlock(diff_lock);
barrier(bar, P);
if (diff/(n*n) < TOL) done=1;
barrier(bar, P);
} CS4/MSc Parallel Architectures - 2014-2015
14
Example: Equation Solver Kernel
▪ TLP version (shared-memory):
– All processors can access freely the same data structure A
– Access to diff, however, must be in turns
– All processors update together their own done variable
...
for (i=mymin; i<=mymax; i++) {
for (j=1; j<=n; j++) {
temp = A[i,j];
A[i,j] = 0.2*(A[i,j]+A[i,j-1]+A[i-1,j] +
A[i,j+1]+A[i+1,j]);
mydiff += abs(A[i,j] – temp);
}
}
lock(diff_lock); diff += mydiff; unlock(diff_lock);
barrier(bar, P);
if (diff/(n*n) < TOL) done=1;
barrier(bar, P);
} CS4/MSc Parallel Architectures - 2014-2015
14
Example: Equation Solver Kernel
▪ TLP version (shared-memory):
– All processors can access freely the same data structure A
– Access to diff, however, must be in turns
– All processors update together their own done variable
...
for (i=mymin; i<=mymax; i++) {
for (j=1; j<=n; j++) {
temp = A[i,j];
A[i,j] = 0.2*(A[i,j]+A[i,j-1]+A[i-1,j] +
A[i,j+1]+A[i+1,j]);
mydiff += abs(A[i,j] – temp);
}
}
lock(diff_lock); diff += mydiff; unlock(diff_lock);
barrier(bar, P);
if (diff/(n*n) < TOL) done=1;
barrier(bar, P);
} CS4/MSc Parallel Architectures - 2014-2015
14
Types of Speedups and Scaling
▪ Scalability: adding x times more resources to the machine yields
close to x times better “performance”
– Usually resources are processors (but can also be memory size or
interconnect bandwidth)
– Usually means that with x times more processors we can get ~x times
speedup for the same problem
– In other words: How does efficiency (see Lecture 1) hold as the number of
processors increases?

▪ In reality we have different scalability models:

– Problem constrained
– Time constrained

▪ Most appropriate scalability model depends on the user interests

CS4/MSc Parallel Architectures - 2014-2015

15
Types of Speedups and Scaling
▪ Problem constrained (PC) scaling:
– Problem size is kept fixed
– Wall-clock execution time reduction is the goal
– Number of processors and memory size are increased
– “Speedup” is then defined as:

Time(1 processor)
SPC =
Time(p processors)

– Example: Weather simulation that does not complete in reasonable time

CS4/MSc Parallel Architectures - 2014-2015

16
Types of Speedups and Scaling
▪ Time constrained (TC) scaling:
– Maximum allowable execution time is kept fixed
– Problem size increase is the goal
– Number of processors and memory size are increased
– “Speedup” is then defined as:

Work(p processors)
STC =
Work(1 processor)

– Example: weather simulation with refined grid

CS4/MSc Parallel Architectures - 2014-2015

Malan
No ratings yet
Malan
1 page
MCQ of Evs
No ratings yet
MCQ of Evs
14 pages
Cloud Computing CS 15-319: Programming Models-Part I Lecture 4, Jan 25, 2012
No ratings yet
Cloud Computing CS 15-319: Programming Models-Part I Lecture 4, Jan 25, 2012
40 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
04_progbasics
No ratings yet
04_progbasics
51 pages
06_progperf2
No ratings yet
06_progperf2
69 pages
217 Lec3
No ratings yet
217 Lec3
46 pages
Parallel_computing
No ratings yet
Parallel_computing
32 pages
Lect11 12 Parallel
No ratings yet
Lect11 12 Parallel
57 pages
Parallel Architectures Parallel Architectures: Ever Faster
No ratings yet
Parallel Architectures Parallel Architectures: Ever Faster
11 pages
02 Basicarch
No ratings yet
02 Basicarch
83 pages
02 Multicore
No ratings yet
02 Multicore
66 pages
Arch13 Multiprocessors Afterlecture
No ratings yet
Arch13 Multiprocessors Afterlecture
70 pages
Parallel Computer Architecture A Hardware-Software
No ratings yet
Parallel Computer Architecture A Hardware-Software
18 pages
Assignment
No ratings yet
Assignment
6 pages
HPC Module 4
No ratings yet
HPC Module 4
18 pages
Introduction To Parallel Processing: Shantanu Dutt University of Illinois at Chicago
No ratings yet
Introduction To Parallel Processing: Shantanu Dutt University of Illinois at Chicago
51 pages
Unit 4
No ratings yet
Unit 4
42 pages
001__DDS-IIIT-Jan-10th
No ratings yet
001__DDS-IIIT-Jan-10th
34 pages
Parallel Computing Lab Manual PDF
No ratings yet
Parallel Computing Lab Manual PDF
51 pages
QUIZ PREP
No ratings yet
QUIZ PREP
21 pages
Multicore Code Entwicklung
No ratings yet
Multicore Code Entwicklung
33 pages
Seminar
No ratings yet
Seminar
85 pages
Systolic Arrays & Their Applications
No ratings yet
Systolic Arrays & Their Applications
35 pages
Module 2 - New 1
No ratings yet
Module 2 - New 1
72 pages
Chapter 02 - Asynchronous and Parallel Programming in .NET
No ratings yet
Chapter 02 - Asynchronous and Parallel Programming in .NET
55 pages
04 Progbasics
No ratings yet
04 Progbasics
62 pages
Parallel Computing: Er. Anupama Singh Department of Computer Science & Engg
No ratings yet
Parallel Computing: Er. Anupama Singh Department of Computer Science & Engg
22 pages
2-INTRODUCTION TO PDC - MOTIVATION - KEY CONCEPTS-03-Dec-2019Material - I - 03-Dec-2019 - Module - 1 PDF
No ratings yet
2-INTRODUCTION TO PDC - MOTIVATION - KEY CONCEPTS-03-Dec-2019Material - I - 03-Dec-2019 - Module - 1 PDF
63 pages
Mca 4
No ratings yet
Mca 4
61 pages
Module 1 - New
No ratings yet
Module 1 - New
59 pages
Parallel Processing
No ratings yet
Parallel Processing
31 pages
04 Progbasics
No ratings yet
04 Progbasics
43 pages
Data-Level Parallelism: Nima Honarmand
No ratings yet
Data-Level Parallelism: Nima Honarmand
59 pages
Chapter 9 Multicore Systems
No ratings yet
Chapter 9 Multicore Systems
203 pages
Parallel and Distributed Computing
No ratings yet
Parallel and Distributed Computing
7 pages
Chapter 7
No ratings yet
Chapter 7
25 pages
L2 Parallel Computing Models
No ratings yet
L2 Parallel Computing Models
31 pages
Paralle Processing in Brief
No ratings yet
Paralle Processing in Brief
31 pages
Parallel Computation Models: Slide 1
No ratings yet
Parallel Computation Models: Slide 1
28 pages
Cray-1 (1976) : The World's Most Expensive Love Seat
No ratings yet
Cray-1 (1976) : The World's Most Expensive Love Seat
18 pages
Parallel Computing
No ratings yet
Parallel Computing
28 pages
IMA PPtTutorial
No ratings yet
IMA PPtTutorial
104 pages
QUIZ PREP
No ratings yet
QUIZ PREP
21 pages
Week_5
No ratings yet
Week_5
35 pages
Untitled document
No ratings yet
Untitled document
23 pages
#Include #Include #Define
No ratings yet
#Include #Include #Define
8 pages
Parallel Programming: Sathish S. Vadhiyar Course Web Page
No ratings yet
Parallel Programming: Sathish S. Vadhiyar Course Web Page
36 pages
Chapter 04
No ratings yet
Chapter 04
12 pages
Chapter 4 Solutions: Case Study: Implementing A Vector Kernel On A Vector Processor and GPU
No ratings yet
Chapter 4 Solutions: Case Study: Implementing A Vector Kernel On A Vector Processor and GPU
12 pages
Chapter 4
No ratings yet
Chapter 4
46 pages
3677035
No ratings yet
3677035
30 pages
Unit1 RMD PDF
No ratings yet
Unit1 RMD PDF
27 pages
Architectures For Parrallel Computation
No ratings yet
Architectures For Parrallel Computation
40 pages
Unit VI Parallel Programming Concepts
No ratings yet
Unit VI Parallel Programming Concepts
90 pages
Parallel Programming
No ratings yet
Parallel Programming
42 pages
Chapter 1 (Parallel Computer Models)
No ratings yet
Chapter 1 (Parallel Computer Models)
20 pages
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
Franco Mario
No ratings yet
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
From Everand
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
Jonathan Rigdon
No ratings yet
All My IT Tech Posts
From Everand
All My IT Tech Posts
Stephen Edwards
No ratings yet
Foundation Course for Advanced Computer Studies
From Everand
Foundation Course for Advanced Computer Studies
Franck Ismael Djédjé
No ratings yet
Application Form
No ratings yet
Application Form
1 page
TG Monomer - San Esters Brochure
No ratings yet
TG Monomer - San Esters Brochure
4 pages
ACCA Paper F2: Introduction To Paper F2 Management Accounting (MA) KPMG Connect
No ratings yet
ACCA Paper F2: Introduction To Paper F2 Management Accounting (MA) KPMG Connect
6 pages
Management of Instruction: Objective-Related Principles of Teaching
No ratings yet
Management of Instruction: Objective-Related Principles of Teaching
10 pages
Module Welding Technology
No ratings yet
Module Welding Technology
5 pages
Optimum Location of Outrigger in Tall Buildings Using Finite Element Analysis
No ratings yet
Optimum Location of Outrigger in Tall Buildings Using Finite Element Analysis
11 pages
Altoire Hibiscus-Grafting Trandafir Japonez
No ratings yet
Altoire Hibiscus-Grafting Trandafir Japonez
10 pages
Myanmar 3rd Edition Saw Myat Yin download pdf
100% (2)
Myanmar 3rd Edition Saw Myat Yin download pdf
51 pages
BIEAP Hall Ticket 2024
No ratings yet
BIEAP Hall Ticket 2024
5 pages
Homework #3 For Everyone (5pts Each) :: Universidad Popular Del Cesar
No ratings yet
Homework #3 For Everyone (5pts Each) :: Universidad Popular Del Cesar
2 pages
PS02.HD - PB (R4)
No ratings yet
PS02.HD - PB (R4)
2 pages
03a Controlling Resources Part 1
No ratings yet
03a Controlling Resources Part 1
74 pages
Implementasi Pemberlakuan Hukum Sasi Di Desa Negeri Lima: JST: Jurnal Studi Islam
No ratings yet
Implementasi Pemberlakuan Hukum Sasi Di Desa Negeri Lima: JST: Jurnal Studi Islam
27 pages
ECS ProcessExpert v7.0.2
No ratings yet
ECS ProcessExpert v7.0.2
180 pages
FB Reviewer
No ratings yet
FB Reviewer
4 pages
Project From Hell Case Study - Group 6 Presenatation
No ratings yet
Project From Hell Case Study - Group 6 Presenatation
5 pages
L6 Deflection and Stiffness
No ratings yet
L6 Deflection and Stiffness
21 pages
Inside GEs Transformation
No ratings yet
Inside GEs Transformation
22 pages
Gabriel José de La Concordia García Márquez
No ratings yet
Gabriel José de La Concordia García Márquez
2 pages
CS8691-Artificial Intelligence NOTES 1
No ratings yet
CS8691-Artificial Intelligence NOTES 1
220 pages
Important Information: Section 1D - Troubleshooting
No ratings yet
Important Information: Section 1D - Troubleshooting
54 pages
Riber 7-s3 k18-108 116-128
No ratings yet
Riber 7-s3 k18-108 116-128
13 pages
Business Valuation PDF Version
No ratings yet
Business Valuation PDF Version
13 pages
Pakistan's Relations With Central Asian Countries
No ratings yet
Pakistan's Relations With Central Asian Countries
3 pages
Assignment # 3 (Advanced Wireless Communication) Submitted By: Muhammad Tayyeb (18F-0863)
No ratings yet
Assignment # 3 (Advanced Wireless Communication) Submitted By: Muhammad Tayyeb (18F-0863)
1 page
Crim Cases Set 2
No ratings yet
Crim Cases Set 2
99 pages
SMCP: Message Markers: (1) Instruction (2) Advice (3) Warning (4) Information (5) (6) Answer (7) Request (8) Intention
100% (1)
SMCP: Message Markers: (1) Instruction (2) Advice (3) Warning (4) Information (5) (6) Answer (7) Request (8) Intention
5 pages

Lecture02 Types

Uploaded by

Lecture02 Types

Uploaded by

Lect.

CS4/MSc Parallel Architectures - 2014-2015

CS4/MSc Parallel Architectures - 2014-2015

CPU CPU CPU CPU

Cache Cache Cache Cache

CS4/MSc Parallel Architectures - 2014-2015

CPU CPU CPU CPU

Cache Cache Cache Cache

Mem. Mem. Mem. Mem.

CS4/MSc Parallel Architectures - 2014-2015

CS4/MSc Parallel Architectures - 2014-2015

CS4/MSc Parallel Architectures - 2014-2015

▪ Thread-level or task-level parallelism (TLP)

CS4/MSc Parallel Architectures - 2014-2015

▪ Data-level parallelism (DLP)

CS4/MSc Parallel Architectures - 2014-2015

A[i,j] = 0.2 x (A[i,j] + A[i,j-1] + A[i-1,j] +

CS4/MSc Parallel Architectures - 2014-2015

CS4/MSc Parallel Architectures - 2014-2015

int mymin = 1+(pid * n/P);

int mymin = 1+(pid * n/P);

▪ In reality we have different scalability models:

▪ Most appropriate scalability model depends on the user interests

CS4/MSc Parallel Architectures - 2014-2015

– Example: Weather simulation that does not complete in reasonable time

CS4/MSc Parallel Architectures - 2014-2015

– Example: weather simulation with refined grid

CS4/MSc Parallel Architectures - 2014-2015

You might also like