0% found this document useful (0 votes)

29 views30 pages

Parallel Computing

Parallel computers use multiple processors to work on different parts of a problem simultaneously. This can reduce the execution time compared to using a single processor. The speedup from using p processors is generally between 1 and p, with best case being linear speedup of p. Shared memory parallel computers give all processors equal access to memory, while distributed memory machines have faster access to local versus remote memory. Benchmark tests like Linpack are used to measure real-world performance of parallel systems. A variety of interconnection networks are used between processors, balancing factors like communication latency, bandwidth and scalability.

Uploaded by

Martina Janeva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views30 pages

Parallel Computing

Uploaded by

Martina Janeva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Parallel Computing

Parallel Computers
Parallel processing/computing :

– at least two processors have to cooperate

– by means of exchanging data
– while working on different parts of one and the same
problem

Parallel computers : make use of multiple processors for parallel

processing
Speed-up
Basic idea of parallel processing: execution time can be reduced by
employing more than one processor, the larger the number of
processors the smaller the execution time.

Speed-up
s(p) = T1/Tp

T1 -- execution time on one processor

Tp -- execution time on p processors

• Best case: s(p) = p (linear speed-up)

• Worst case: s(p) = 1
• Generally: 1 ≤ s(p) ≤ p
Execution on one processor

T1 = Tseq + Tpar
Tseq -- execution time of sequential part
Tpar -- execution time of parallelisable part

Execution on p processors

Tp = Tseq + Tpar /p
Shared memory parallel computers

processors have equally fast access to any location in memory

Distributed memory parallel
computers

access to own memory is faster than access to other memory

Non-Uniform Memory Access
(NUMA) Architectures
Performance

• Theoretical peak performance

• Linpack benchmark performance

Theoretical peak performance
• Theoretical peak performance Rpeak – the maximal number of
arithmetical operations (additions and/or multiplications) a processor
can carry out per second
Rpeak 1 = f µpr

where:
– f – clock frequency
– µpr – maximum number of operations per clock cycle

• The theoretical peak performance of a parallel computer is equal to

the product of the number of processors and the theoretical peak
performance of one processor.

Rpeak p = pRpeak 1
Examples:

• Cray J32
– f = 100 MHz, µpr = 2, Rpeak 1 = 200 Mflops
– p = 32, Rpeak 32 = 6.4 Gflops

• NEC SX-5
– f = 250 MHz, µpr = 32, Rpeak 1 = 8 Gflops
– p = 16, Rpeak 16 = 128 Gflops
– n = 32 (NUMA), Rpeak 32*16 = 4 Tflops
Benchmark performance

• Benchmark

– a program for a specific problem

– the number of operations which are executed is known

– used to measure the run time in a single-user mode

– to determine the benchmark performance (operations per

second).
Linpack benchmark
• Linpack – a popular library of Fortran subroutines for the
numerical solution of linear algebra problems

• Linpack benchmark – based on one particular subroutine which

is used for the solution of a dense system of linear equations

– algorithm: LU factorization by Gaussian elimination with partial

pivoting

– number of operations: 2n3/3 + O(n2) (n – number of unknowns)

• Top-500 list of most powerful computer installations

http://www.top500.org/
Interconnection structures for
parallel computers
Bisection or cross-section
bandwidth

• Definition: the effective rate at which one half of the processing

nodes can send data to the other half (for worst case division of
the processors).

• It does not scale linearly with the number of processing nodes in

most interconnection schemes.
Complete communication graph

• The bisection bandwidth grows in proportion to the number of

nodes.

• The number of edges: n(n-1)/2

Bus

• The bisection bandwidth of the system is constant and equal to the

bandwidth of the bus.

• Simple software and hardware.

Crossbar switch
• Bisection bandwidth scales with the number of processing
nodes.

• Total number of communication network ports -- Θ(n)

• Number of links -- Θ(n2).

• In practice the crossbar switch is used only to interconnect a

relatively small number of processors.
Multistage switching networks

• A series of switches which are grouped in stages realizes the

connection between pairs of inputs and outputs.

• Can be organized in many different topologies fitted to particular

applications.

• Number of links -- Θ(n log(n))

• Bisection bandwidth -- Θ(n)

Example - Beneš network
Regular grids: 1-D arrays
• Linear processor array and ring.

• Bisection bandwidth -- Ω(1)

• Remote communication -- Ο(n)

Regular grids: 2-D arrays
• 2-D mesh

• Torus.

• Twisted torus.

• Remote communication needs time O(n1/2).

• Bisection bandwidth -- Ω(n1/2).

A two-dimensional mesh
Regular grids: 3-D arrays

• Remote communication -- O(n1/3)

• Bisection bandwidth -- Ω (n2/3)

Example: Cray T3E -- 10 x 10 x 10 grid

Trees

Binary tree
Trees
• Remote communication -- O(log(n)).

• Fit well the communicational requirements of reduction

operations and a number of optimal algorithms based on divide
and conquer techniques.

• Less suited for regular data array redistribution operations.

• The decreasing aggregate bandwidth of a tree network in its

upper levels and in particular around the root presents a severe
bottleneck for massive communication.
Fat tree

The aggregate bandwidth of a fat tree network is kept

constant at all levels of the tree
Binary Hypercubes
• A binary hypercube of degree d (d > 3) consists of n = 2d
nodes labeled by distinct d-bit binary numbers.

• Two nodes are connected by an edge, iff their respective labels

differ in exactly one bit position.

• O(n log(n)) links

• Bisection bandwidth scales in proportion with the number of nodes.

• Remote communication -- O(log(n))

Examples of binary hypercubes

Implementing Real-Time Partitioned Convolution Algorithms On Conventional Operating Systems
No ratings yet
Implementing Real-Time Partitioned Convolution Algorithms On Conventional Operating Systems
8 pages
Unit 1
No ratings yet
Unit 1
25 pages
1 Module 1 Parallelism Fundamentals Motivation Key Concepts and Challenges Parallel Computing
No ratings yet
1 Module 1 Parallelism Fundamentals Motivation Key Concepts and Challenges Parallel Computing
81 pages
Chap2 Slides Week3
No ratings yet
Chap2 Slides Week3
28 pages
Introduction
No ratings yet
Introduction
46 pages
DSECL ZG 522: Big Data Systems: Session 2: Parallel and Distributed Systems
No ratings yet
DSECL ZG 522: Big Data Systems: Session 2: Parallel and Distributed Systems
58 pages
HPC Unit 3
No ratings yet
HPC Unit 3
31 pages
Introduction
No ratings yet
Introduction
34 pages
BDS-Session-2
No ratings yet
BDS-Session-2
58 pages
2 New Module 2 Performance Analysis of Multiprocessor Architectures Students Version
No ratings yet
2 New Module 2 Performance Analysis of Multiprocessor Architectures Students Version
13 pages
Unit4 Session3 Parallel Computing Concepts Terminology Design Issues
No ratings yet
Unit4 Session3 Parallel Computing Concepts Terminology Design Issues
30 pages
CSCI 8150 Advanced Computer Architecture
100% (2)
CSCI 8150 Advanced Computer Architecture
18 pages
Distributed Systems R19 - Unit-1
No ratings yet
Distributed Systems R19 - Unit-1
35 pages
Aca Notes
No ratings yet
Aca Notes
63 pages
Introduction To Parallel Computing-Dr Nousheen
No ratings yet
Introduction To Parallel Computing-Dr Nousheen
43 pages
Parallel Computing MCSE011
No ratings yet
Parallel Computing MCSE011
189 pages
Introduction to Computers
No ratings yet
Introduction to Computers
36 pages
RS_PDS-OE 3010
No ratings yet
RS_PDS-OE 3010
8 pages
PDS Merged
No ratings yet
PDS Merged
182 pages
2022 Mid 1
No ratings yet
2022 Mid 1
4 pages
Hpclab
No ratings yet
Hpclab
58 pages
BDS Session 2
No ratings yet
BDS Session 2
58 pages
Aca Notes: Scalability
No ratings yet
Aca Notes: Scalability
13 pages
Week_7 (1)
No ratings yet
Week_7 (1)
27 pages
Course Outcome 1:: 15Cs4180 - Parallel Computing
No ratings yet
Course Outcome 1:: 15Cs4180 - Parallel Computing
23 pages
Module 5
No ratings yet
Module 5
45 pages
Unit 1
No ratings yet
Unit 1
21 pages
Lecture 5 Network Topologies for Parallel Architectures - Updated
No ratings yet
Lecture 5 Network Topologies for Parallel Architectures - Updated
46 pages
p1
No ratings yet
p1
30 pages
P 1
No ratings yet
P 1
44 pages
Parallel Computation Lecture Notes
No ratings yet
Parallel Computation Lecture Notes
44 pages
Pertemuan 11
No ratings yet
Pertemuan 11
18 pages
Lecture 4 Network Topologies For Parallel Architecture
No ratings yet
Lecture 4 Network Topologies For Parallel Architecture
34 pages
Parallel Programming Platforms (Part 1) : CSE3057Y Parallel and Distributed Systems
No ratings yet
Parallel Programming Platforms (Part 1) : CSE3057Y Parallel and Distributed Systems
38 pages
Chapter 3
No ratings yet
Chapter 3
21 pages
Slides Taken From: Parallel Computing Platforms
No ratings yet
Slides Taken From: Parallel Computing Platforms
11 pages
Unit I Introduction
No ratings yet
Unit I Introduction
54 pages
High Performance Computing
100% (2)
High Performance Computing
164 pages
Questions Pool For Distributed Systems
No ratings yet
Questions Pool For Distributed Systems
15 pages
Lecture 3 Amdahl's Law and Karp Flatt Metric
No ratings yet
Lecture 3 Amdahl's Law and Karp Flatt Metric
42 pages
Unit I 2 Marks With Answer
No ratings yet
Unit I 2 Marks With Answer
6 pages
24-25 - Parallel Processing PDF
No ratings yet
24-25 - Parallel Processing PDF
36 pages
Mpi Course
No ratings yet
Mpi Course
202 pages
Additional Topics of Unit-I and Unit-II: Syed Rameem Zahra
No ratings yet
Additional Topics of Unit-I and Unit-II: Syed Rameem Zahra
21 pages
PDC Notes by Zatch-1
No ratings yet
PDC Notes by Zatch-1
42 pages
09 Communication models of Parallel platforms
No ratings yet
09 Communication models of Parallel platforms
25 pages
Chapter 2 - Parallel Programming Platforms
No ratings yet
Chapter 2 - Parallel Programming Platforms
33 pages
Unit 1 (2)
No ratings yet
Unit 1 (2)
54 pages
Lec1 Introduction to Parallel Computing (2)
No ratings yet
Lec1 Introduction to Parallel Computing (2)
40 pages
HPC Lectures 1 5
No ratings yet
HPC Lectures 1 5
18 pages
Lec Notes
No ratings yet
Lec Notes
50 pages
1.1 Parallelism and Computing: 1.1.1 Trends in Applications
No ratings yet
1.1 Parallelism and Computing: 1.1.1 Trends in Applications
25 pages
Parallel Processors: Session 5 Interconnection Networks
No ratings yet
Parallel Processors: Session 5 Interconnection Networks
48 pages
Parallel and Distributed Computing Research Paper
No ratings yet
Parallel and Distributed Computing Research Paper
8 pages
Relation To Computer System Components: M.D.Boomija, Ap/Cse
100% (1)
Relation To Computer System Components: M.D.Boomija, Ap/Cse
39 pages
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
Franco Mario
No ratings yet
An Introduction To Data Acquisition
From Everand
An Introduction To Data Acquisition
Jason King
No ratings yet
Analog Dialogue, Volume 45, Number 4: Analog Dialogue, #4
From Everand
Analog Dialogue, Volume 45, Number 4: Analog Dialogue, #4
Analog Dialogue
No ratings yet
Preliminary Specifications: Programmed Data Processor Model Three (PDP-3) October, 1960
From Everand
Preliminary Specifications: Programmed Data Processor Model Three (PDP-3) October, 1960
Digital Equipment Corporation
No ratings yet
Digital Spectral Analysis MATLAB® Software User Guide
From Everand
Digital Spectral Analysis MATLAB® Software User Guide
S. Lawrence Marple, Jr.
No ratings yet
Routing in Wireless Mesh Networks
From Everand
Routing in Wireless Mesh Networks
Raghav Kumar
No ratings yet
Lecture 1423813120
No ratings yet
Lecture 1423813120
4 pages
AI晶片市場與技術趨勢
No ratings yet
AI晶片市場與技術趨勢
11 pages
Interfacing Devices Memory Devices and Interfacing Interfacing Devices
No ratings yet
Interfacing Devices Memory Devices and Interfacing Interfacing Devices
14 pages
How To Rack
100% (3)
How To Rack
58 pages
Answer All Questions, Each Carries 3 Marks: Reg No.: - Name
No ratings yet
Answer All Questions, Each Carries 3 Marks: Reg No.: - Name
2 pages
What Is The Difference Between ARM and FPGA Processors
No ratings yet
What Is The Difference Between ARM and FPGA Processors
9 pages
Hardware: Architectures
No ratings yet
Hardware: Architectures
33 pages
ATtiny 85
100% (2)
ATtiny 85
201 pages
CSC112 Study Questions by Premier
No ratings yet
CSC112 Study Questions by Premier
12 pages
Chapter 4
No ratings yet
Chapter 4
73 pages
Meep Openmp
No ratings yet
Meep Openmp
13 pages
Sistem Cerdas Pemberi Pakan Ikan Otomatis Berbasis Arduino Uno
No ratings yet
Sistem Cerdas Pemberi Pakan Ikan Otomatis Berbasis Arduino Uno
12 pages
Sequoia
No ratings yet
Sequoia
2 pages
Microcontroller Notes
100% (1)
Microcontroller Notes
85 pages
Logix5000 Controllers - CPU Clock Speeds
No ratings yet
Logix5000 Controllers - CPU Clock Speeds
3 pages
Wiring TeleFlow
No ratings yet
Wiring TeleFlow
13 pages
Generations of Computer: First Generation Second Generation Third Generation Fourth Generation
No ratings yet
Generations of Computer: First Generation Second Generation Third Generation Fourth Generation
11 pages
Unit 5 - Issues in Ad Hoc Wireless Network
No ratings yet
Unit 5 - Issues in Ad Hoc Wireless Network
11 pages
Cse502 L11 Bpred
No ratings yet
Cse502 L11 Bpred
58 pages
8051 Instruction Set
No ratings yet
8051 Instruction Set
3 pages
Tages, and Applications.: Ieee Transactions On Signal Processing, Vol. 47, No. 3, March 1999 907
No ratings yet
Tages, and Applications.: Ieee Transactions On Signal Processing, Vol. 47, No. 3, March 1999 907
5 pages
ICT-Full Version-PDF-1
No ratings yet
ICT-Full Version-PDF-1
162 pages
Lec 1 (Operating System) 1
No ratings yet
Lec 1 (Operating System) 1
96 pages
Modicon Premium - TSXH5724M
No ratings yet
Modicon Premium - TSXH5724M
3 pages
Shashemene Poly Technique Collage Information Technology Support Service Implement Maintenance Procedures
No ratings yet
Shashemene Poly Technique Collage Information Technology Support Service Implement Maintenance Procedures
44 pages
Unit-2_84beee2a43c1be02b5032b4a14f7ee28
No ratings yet
Unit-2_84beee2a43c1be02b5032b4a14f7ee28
76 pages
Counters and Time Delays: by Dr. Syed Sahal Nazli Alhady
No ratings yet
Counters and Time Delays: by Dr. Syed Sahal Nazli Alhady
21 pages
Saint Louis University School of Engineering and Architecture Department of Electrical Engineering
No ratings yet
Saint Louis University School of Engineering and Architecture Department of Electrical Engineering
8 pages
Ispc - A SPMD Compiler for High-Performance CPU Programming (Ispc_inpar_2012)
No ratings yet
Ispc - A SPMD Compiler for High-Performance CPU Programming (Ispc_inpar_2012)
13 pages

Parallel Computing

Uploaded by

Parallel Computing

Uploaded by

Parallel Computing

– at least two processors have to cooperate

Parallel computers : make use of multiple processors for parallel

T1 -- execution time on one processor

• Best case: s(p) = p (linear speed-up)

processors have equally fast access to any location in memory

access to own memory is faster than access to other memory

• Theoretical peak performance

• Linpack benchmark performance

• The theoretical peak performance of a parallel computer is equal to

– a program for a specific problem

– the number of operations which are executed is known

– used to measure the run time in a single-user mode

– to determine the benchmark performance (operations per

• Linpack benchmark – based on one particular subroutine which

– algorithm: LU factorization by Gaussian elimination with partial

– number of operations: 2n3/3 + O(n2) (n – number of unknowns)

• Top-500 list of most powerful computer installations

• Definition: the effective rate at which one half of the processing

• It does not scale linearly with the number of processing nodes in

• The bisection bandwidth grows in proportion to the number of

• The number of edges: n(n-1)/2

• The bisection bandwidth of the system is constant and equal to the

• Simple software and hardware.

• Total number of communication network ports -- Θ(n)

• Number of links -- Θ(n2).

• In practice the crossbar switch is used only to interconnect a

• A series of switches which are grouped in stages realizes the

• Can be organized in many different topologies fitted to particular

• Number of links -- Θ(n log(n))

• Bisection bandwidth -- Θ(n)

• Bisection bandwidth -- Ω(1)

• Remote communication -- Ο(n)

• Remote communication needs time O(n1/2).

• Bisection bandwidth -- Ω(n1/2).

• Remote communication -- O(n1/3)

• Bisection bandwidth -- Ω (n2/3)

Example: Cray T3E -- 10 x 10 x 10 grid

• Fit well the communicational requirements of reduction

• Less suited for regular data array redistribution operations.

• The decreasing aggregate bandwidth of a tree network in its

The aggregate bandwidth of a fat tree network is kept

• Two nodes are connected by an edge, iff their respective labels

• O(n log(n)) links

• Bisection bandwidth scales in proportion with the number of nodes.

• Remote communication -- O(log(n))

You might also like