0% found this document useful (0 votes)

106 views21 pages

Chapter 6 Parallel Processor

The document discusses parallel processors from client to cloud systems. It covers topics like multiprocessors, multicore processors, parallelism, Amdahl's law, scaling, vector architectures, hardware multithreading, and shared memory multiprocessors. An example of summing numbers on a shared memory multiprocessor using reduction is also presented.

Uploaded by

q qq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

106 views21 pages

Chapter 6 Parallel Processor

Uploaded by

q qq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

COMPUTER ORGANIZATION AND DESIGN

5th
Edition
The Hardware/Software Interface

Chapter 6
Parallel Processors from
Client to Cloud
§6.1 Introduction
Introduction
 Multiprocessor
 Goal: connecting multiple computers
to get higher performance
 Scalability, availability, power efficiency
 Multicore processors
 Chips with multiple processors (cores)
 Task-level (process-level) parallelism
 High throughput for independent jobs
 Parallel program (parallel software)
 A single program run on multiple processors
 Challenges: partitioning, coordination,
communications overhead
Chapter 6 — Parallel Processors from Client to Cloud — 2
Amdahl’s Law
 Sequential part can limit speedup
 Example: 100 processors, 90× speedup?
 Tnew = Tparallelizable/100 + Tsequential

 Solving: Fparallelizable = 0.999

 Need sequential part to be 0.1% of original
time

Chapter 6 — Parallel Processors from Client to Cloud — 3

Scaling Example
 Workload
 sum of 10 scalars (sequential)
 sum of a pair of 10×10 matrix (parallel)
 What’s the speed up from single to 10 and 100 processors?
 Single processor:
 Time = (10 + 100) × tadd

Assumes load balanced

 10 processors

across processors
 Time = 10 × tadd + 100/10 × tadd = 20 × tadd
 Speedup = 110/20 = 5.5 (55% of potential)
 100 processors
 Time = 10 × tadd + 100/100 × tadd = 11 × tadd
 Speedup = 110/11 = 10 (10% of potential)

Chapter 6 — Parallel Processors from Client to Cloud — 4

Scaling Example (cont)
 What if matrix size is 100 × 100?
 Single processor:
 Time = (10 + 10000) × tadd
 10 processors

Assumes load balanced

Time = 10 × tadd + 10000/10 × tadd = 1010 × tadd

across processors


 Speedup = 10010/1010 = 9.9 (99% of potential)

 100 processors
 Time = 10 × tadd + 10000/100 × tadd = 110 × tadd
 Speedup = 10010/110 = 91 (91% of potential)

Chapter 6 — Parallel Processors from Client to Cloud — 5

Strong vs Weak Scaling
 Strong scaling: problem size fixed
 Measure speed-up achieved on a multiprocessor while
keeping problem size fixed
 Weak scaling: problem size proportional to number of CPU
 Measure speed-up achieved on a multiprocessor while
increase the size of the problem proportionally to the
increase in the number of processors.

 10 processors, 10 × 10 matrix

performance
Time = 10 × tadd + 100/10 × tadd = 20 × tadd

Constant


 100 processors, 32 × 32 matrix

 Time = 10 × tadd + 1024/100 × tadd ~= 20 × tadd

Chapter 6 — Parallel Pro

cessors from Client to Cloud — 6
§6.3 SISD, MIMD, SIMD, SPMD, and Vector
Instruction and Data Streams
 An alternate classification on parallel
hardware
Data Streams
Single Multiple
Instruction Single SISD: SIMD (vector processor):
Streams Intel Pentium 4 SSE instructions of x86
Multiple MISD: MIMD :
No example yet Intel Xeon e5345

Chapter 6 — Parallel Processors from Client to Cloud — 7

Vector architecture
 Highly pipelined function units
 Stream data from/to vector registers to units
 Data collected from memory into vector registers
 Results stored from vector registers to memory
 Example: Vector extension to MIPS
 32 vector registers, each has 64 64-bit elements
 Vector instructions
 lv, sv: load/store vector
 addv.d: add vectors of double
 addvs.d: add scalar to each element of vector of double

Chapter 6 — Parallel Processors from Client to Cloud — 8

Vector architecture
 Single add pipeline
 Complete one addition per cycle

An array of parallel functional units

 Four add pipeline

 Complete four additions per cycle

Chapter 6 — Parallel Processors from Client to Cloud — 9

Example: DAXPY (Y = a × X + Y)
 Conventional MIPS code
l.d $f0,a($sp) ;load scalar a
addiu r4,$s0,#512 ;bound of what to load
loop:l.d $f2,0($s0) ;load x(i)
mul.d $f2,$f2,$f0 ;a × x(i)
l.d $f4,0($s1) ;load y(i)
add.d $f4,$f4,$f2 ;a × x(i) + y(i)
s.d $f4,0($s1) ;store into y(i)
addiu $s0,$s0,#8 ;increment index to x
addiu $s1,$s1,#8 ;increment index to y
subu $t0,r4,$s0 ;compute bound
bne $t0,$zero,loop ;check if done


 Vector MIPS code

l.d $f0,a($sp) ;load scalar a
lv $v1,0($s0) ;load vector x
mulvs.d $v2,$v1,$f0 ;vector-scalar multiply
lv $v3,0($s1) ;load vector y
addv.d $v4,$v2,$v3 ;add y to product
sv $v4,0($s1) ;store the result
Vector vs. Scalar architecture
 Reduce Instruction fetch and decode
 A single vector instruction is equivalent to executing an
entire loop.
 Avoid data hazard checking
 Only check data hazard between vectors, not for every
element within the vectors (computation of every element
within the same vector is independent).
 Avoid control hazard
 An entire loop is replaced by a vector instruction  loop
branch (leading to control hazard) is non-existent.
 Efficient memory access
 If the vector’s elements are all adjacent in memory,
fetching them is efficient using interleaved memory banks.

Chapter 6 — Parallel Processors from Client to Cloud — 11

§6.4 Hardware Multithreading
Hardware Multithreading
 Hardware Multithreading
 Multiple hardware threads (replicate registers, PC, etc.)
 fast context switching between threads
 1. Fine-grain multithreading
 Switch threads after execution of each instruction
 If one thread stalls (long and short), others are executed
 Slow down the execution of individual threads, especially
those without stalls
 2. Coarse-grain multithreading
 Only switch on long stall (e.g., L2-cache miss)
 Simplifies hardware, but doesn’t hide short stalls (eg,
data hazards)

Chapter 6 — Parallel Processors from Client to Cloud — 12

Simultaneous Multithreading (SMT)

 3. Simultaneous multithreading (SMT)

 in multiple-issue dynamically scheduled pipelined CPU
 Motivation
 Multiple-issue processors often have more functional

units available than single thread needs to use.

 Schedule instructions from multiple threads
 Instructions from independent threads execute when
function units are available
 hide the throughput loss from both short and long stall

Chapter 6 — Parallel Processors from Client to Cloud — 13

Multithreading Example

Coarse MT: Coarse-grained

multithreading
Fine MT: Fine-grained
multithreading
SMT: Simultaneous
multithreading
§6.5 Multicore and Other Shared Memory Multiprocessors
Shared Memory Multiprocessor
 SMP: Symmetric Multi-Processing
 Hardware provides single physical
address space for all processors
 Synchronize shared variables using locks
 Memory access time
 UMA (uniform) vs. NUMA (nonuniform)

Processor Processor Processor Processor

cache cache cache cache

Interconnection network
Memory Memory

Memory Memory Interconnection network

UMA NUMA
Chapter 6 — Parallel Processors from Client to Cloud — 15
Example: Sum Reduction
 Sum 100,000 numbers on 100 processor UMA
 Each processor has ID: 0 ≤ Pn ≤ 99
 Partition 1000 numbers per processor
 Initial summation on each processor
sum[Pn] = 0;
for ( i=1000*Pn; i<1000*(Pn+1); i=i+1)
sum[Pn] = sum[Pn] + A[i];
 Now need to add these partial sums
 Reduction: divide and conquer
 Half the processors add pairs, then quarter, …
 Need to synchronize between reduction steps

Chapter 6 — Parallel Processors from Client to Cloud — 16

Example: Sum Reduction

half = 100;
repeat
synch();
// The condition
if (half%2 != 0 && Pn == 0)
// when half is odd
sum[0] = sum[0] + sum[half-1];

half = half/2;
if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half];
until (half == 1);

Chapter 6 — Parallel Processors from Client to Cloud — 17

§6.7 Clusters, WSC, and Other Message-Passing MPs
Loosely Coupled Clusters
 Network of independent computers
 Each has private memory and OS
 Connected using I/O system
 E.g., Ethernet/switch, Internet
 Suitable for applications with independent tasks
 Web servers, databases, simulations, …
 High availability, scalable, affordable
 Problems
 Administration cost
 Low interconnect bandwidth
 c.f. processor/memory bandwidth on an SMP

Chapter 6 — Parallel Processors from Client to Cloud — 18

Message Passing
 Each processor has private physical
address space
 Hardware sends/receives messages
between processors

Chapter 6 — Parallel Processors from Client to Cloud — 19

Interconnection Networks
 Network topologies
 Arrangements of processors, switches, and links

Bus Ring

N-cube (N = 3)
2D Mesh
Fully connected

Chapter 6 — Parallel Processors from Client to Cloud — 20

§6.14 Concluding Remarks
Concluding Remarks
 Higher performance by using multiple processors
 Difficulties
 Developing parallel software
 Devising appropriate architectures
 SIMD and vector operations match multimedia
applications and are easy to program
 Higher disk performance by using RAID

Chapter 6 — Parallel Processors from Client to Cloud — 21

Patterson6e MIPS Ch04 PPT
No ratings yet
Patterson6e MIPS Ch04 PPT
137 pages
Application of VLSI in Artificial Intelligence PDF
No ratings yet
Application of VLSI in Artificial Intelligence PDF
4 pages
Chapter 2 Instructions Language of The Computer
No ratings yet
Chapter 2 Instructions Language of The Computer
95 pages
Arm Cortex m85 Processor Dgug 101928 0002 05 en
No ratings yet
Arm Cortex m85 Processor Dgug 101928 0002 05 en
722 pages
Cortex A9 Processor
No ratings yet
Cortex A9 Processor
20 pages
2-3 - Common - Storage - Protocols - Copie
No ratings yet
2-3 - Common - Storage - Protocols - Copie
58 pages
Axi Prot
No ratings yet
Axi Prot
273 pages
0014 SharedMemoryArchitecture
No ratings yet
0014 SharedMemoryArchitecture
31 pages
Ax 99100
No ratings yet
Ax 99100
74 pages
Large and Fast: Exploiting Memory Hierarchy: Omputer Rganization and Esign
No ratings yet
Large and Fast: Exploiting Memory Hierarchy: Omputer Rganization and Esign
87 pages
L 1 ParallelProcess Challenges
No ratings yet
L 1 ParallelProcess Challenges
82 pages
Design and Verification of SDRAM Controller Based
No ratings yet
Design and Verification of SDRAM Controller Based
9 pages
SoC or System On Chip Seminar Report
No ratings yet
SoC or System On Chip Seminar Report
25 pages
Data Link Layer
No ratings yet
Data Link Layer
18 pages
William Stallings Computer Organization and Architecture 9 Edition
No ratings yet
William Stallings Computer Organization and Architecture 9 Edition
60 pages
Arm Neoverse N2:: Arm'S 2 Generation High Performance Infrastructure Cpus and System Ips
100% (1)
Arm Neoverse N2:: Arm'S 2 Generation High Performance Infrastructure Cpus and System Ips
27 pages
Tensor Processing Unit
100% (1)
Tensor Processing Unit
15 pages
Ec20 R2.1 Mini Pcie: Hardware Design
No ratings yet
Ec20 R2.1 Mini Pcie: Hardware Design
52 pages
Time Sequence of Multiple Interrupts
No ratings yet
Time Sequence of Multiple Interrupts
49 pages
A Unified UVM Architecture For Flash-Based Memory
No ratings yet
A Unified UVM Architecture For Flash-Based Memory
4 pages
PF Smarttime Sta Ug PDF
No ratings yet
PF Smarttime Sta Ug PDF
90 pages
DDI0475C Corelink Nic400 Network Interconnect r0p2 TRM
No ratings yet
DDI0475C Corelink Nic400 Network Interconnect r0p2 TRM
74 pages
Data Link Layer Protocols
No ratings yet
Data Link Layer Protocols
148 pages
Comparison of Microprocessor, Microcontroller, Pic and Arm Processors
No ratings yet
Comparison of Microprocessor, Microcontroller, Pic and Arm Processors
1 page
Cortex-M For Beginners - 2016 (Final v3)
No ratings yet
Cortex-M For Beginners - 2016 (Final v3)
25 pages
CH03 COA9e
No ratings yet
CH03 COA9e
52 pages
Mesh Architecture and Routing Algorithms
No ratings yet
Mesh Architecture and Routing Algorithms
13 pages
Pulpissimo: Datasheet: The Pulp Team
No ratings yet
Pulpissimo: Datasheet: The Pulp Team
101 pages
CHAPTER 2 Memory Hierarchy Design & APPENDIX B. Review of Memory Heriarchy
No ratings yet
CHAPTER 2 Memory Hierarchy Design & APPENDIX B. Review of Memory Heriarchy
73 pages
Key Difference Between DDR4 and DDR3 RAM
No ratings yet
Key Difference Between DDR4 and DDR3 RAM
4 pages
AMD64 Architecture Programmers Manual
No ratings yet
AMD64 Architecture Programmers Manual
386 pages
Life Cycle of A Thread
No ratings yet
Life Cycle of A Thread
4 pages
System Bus Noc
No ratings yet
System Bus Noc
102 pages
The RISC-V Instruction Set Manual: UCB/EECS-2014-54
No ratings yet
The RISC-V Instruction Set Manual: UCB/EECS-2014-54
100 pages
Snooping Cache and Directory Based Multiprocessors
No ratings yet
Snooping Cache and Directory Based Multiprocessors
59 pages
HC21 23 131 Ajanovic-Intel-PCIeGen3 PDF
No ratings yet
HC21 23 131 Ajanovic-Intel-PCIeGen3 PDF
61 pages
Riscv Zscale Workshop June2015 PDF
No ratings yet
Riscv Zscale Workshop June2015 PDF
19 pages
ARM7 Processor Family
No ratings yet
ARM7 Processor Family
8 pages
System On Chip Design and Modelling: University of Cambridge Computer Laboratory Lecture Notes
No ratings yet
System On Chip Design and Modelling: University of Cambridge Computer Laboratory Lecture Notes
144 pages
ARM Cortex-A9 MPCore
No ratings yet
ARM Cortex-A9 MPCore
34 pages
1 Buspec Specification of Amba-Ahb Protocol
No ratings yet
1 Buspec Specification of Amba-Ahb Protocol
22 pages
GLS
No ratings yet
GLS
2 pages
The Memory System: Fundamental Concepts
No ratings yet
The Memory System: Fundamental Concepts
115 pages
Architecture
No ratings yet
Architecture
21 pages
Chapter 05
No ratings yet
Chapter 05
25 pages
Beyond UVM: Creating Truly Reusable Protocol Layering: Abstract-Protocols That Are Transported by Other Lower-Level
No ratings yet
Beyond UVM: Creating Truly Reusable Protocol Layering: Abstract-Protocols That Are Transported by Other Lower-Level
6 pages
Embedded Systems Design - 2: Dr. N. Mathivanan
No ratings yet
Embedded Systems Design - 2: Dr. N. Mathivanan
10 pages
8.2.0 ARM Architecture
No ratings yet
8.2.0 ARM Architecture
117 pages
ARM7 9 Family
No ratings yet
ARM7 9 Family
46 pages
Amba 4axi Stream
No ratings yet
Amba 4axi Stream
2 pages
SR MR Iov
No ratings yet
SR MR Iov
63 pages
ARM-A Mandatory Primer
No ratings yet
ARM-A Mandatory Primer
4 pages
Operating System Assignment (.1)
No ratings yet
Operating System Assignment (.1)
12 pages
Block Interconnection: Today's Topics Divide Into Two
No ratings yet
Block Interconnection: Today's Topics Divide Into Two
9 pages
CC Unit 1
No ratings yet
CC Unit 1
81 pages
Embedded System: 1 History
No ratings yet
Embedded System: 1 History
11 pages
Distributed System PDF
No ratings yet
Distributed System PDF
148 pages
History of Operating Systems
No ratings yet
History of Operating Systems
6 pages
07 Multiprocessors MF PDF
No ratings yet
07 Multiprocessors MF PDF
99 pages
Seminar Report On Threads in Ds
No ratings yet
Seminar Report On Threads in Ds
23 pages
AOS 1-6 Compilation
No ratings yet
AOS 1-6 Compilation
86 pages
UNIT-1 (OS) : Introduction To Operating System
No ratings yet
UNIT-1 (OS) : Introduction To Operating System
19 pages
Design and ASIC Implementation of Ethernet Switch For Network Application
No ratings yet
Design and ASIC Implementation of Ethernet Switch For Network Application
5 pages
Question Bank: Ii Year A & B
No ratings yet
Question Bank: Ii Year A & B
70 pages
L2 - Multiprocessor System
No ratings yet
L2 - Multiprocessor System
24 pages
Unit-1 OS KCS-401
No ratings yet
Unit-1 OS KCS-401
62 pages
1.overview of Operating System
No ratings yet
1.overview of Operating System
18 pages
Multiprocessor Architectures and Programming
No ratings yet
Multiprocessor Architectures and Programming
89 pages
A Microprocessor
No ratings yet
A Microprocessor
13 pages
Computer Hardware and Peripherals
No ratings yet
Computer Hardware and Peripherals
38 pages
Distributed Os
No ratings yet
Distributed Os
30 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
68 pages
ServerPlus Ver 2
No ratings yet
ServerPlus Ver 2
74 pages
Labview Multicore Systems
No ratings yet
Labview Multicore Systems
86 pages
Software Architecture: of The INTEL 8086
No ratings yet
Software Architecture: of The INTEL 8086
70 pages
Ch12 Parallel Proc3-Aula
No ratings yet
Ch12 Parallel Proc3-Aula
35 pages
MCS 22 em 2023 24 MP
No ratings yet
MCS 22 em 2023 24 MP
15 pages
IT233 Final Revision
No ratings yet
IT233 Final Revision
22 pages
Distributed Systems: University of Pennsylvania
No ratings yet
Distributed Systems: University of Pennsylvania
26 pages
Opearting System
No ratings yet
Opearting System
25 pages
Pci e
No ratings yet
Pci e
19 pages
Cs837: Adv. Operating Systems: Dr. Mian M.Hamayun
No ratings yet
Cs837: Adv. Operating Systems: Dr. Mian M.Hamayun
19 pages
Chapter 1
No ratings yet
Chapter 1
73 pages
Distributed Shared Memory
No ratings yet
Distributed Shared Memory
9 pages
S.No Topics Lec: Advanced Computer Network ETCS-401
No ratings yet
S.No Topics Lec: Advanced Computer Network ETCS-401
4 pages
Lesson 6 - Central Processing Unit
No ratings yet
Lesson 6 - Central Processing Unit
5 pages
Fbda Unit-1
No ratings yet
Fbda Unit-1
17 pages
Module 2
No ratings yet
Module 2
2 pages
Connectivity Prediction in Mobile Ad Hoc Networks for Real-Time Control
From Everand
Connectivity Prediction in Mobile Ad Hoc Networks for Real-Time Control
Sebastian Thelen
5/5 (1)
Application-Specific Integrated Circuit ASIC A Complete Guide
From Everand
Application-Specific Integrated Circuit ASIC A Complete Guide
Gerardus Blokdyk
No ratings yet

Chapter 6 Parallel Processor

Uploaded by

Chapter 6 Parallel Processor

Uploaded by

COMPUTER ORGANIZATION AND DESIGN

 Solving: Fparallelizable = 0.999

Chapter 6 — Parallel Processors from Client to Cloud — 3

Assumes load balanced

Chapter 6 — Parallel Processors from Client to Cloud — 4

Assumes load balanced

 Speedup = 10010/1010 = 9.9 (99% of potential)

Chapter 6 — Parallel Processors from Client to Cloud — 5

 100 processors, 32 × 32 matrix

Chapter 6 — Parallel Pro

Chapter 6 — Parallel Processors from Client to Cloud — 7

Chapter 6 — Parallel Processors from Client to Cloud — 8

An array of parallel functional units

 Four add pipeline

Chapter 6 — Parallel Processors from Client to Cloud — 9

 Vector MIPS code

Chapter 6 — Parallel Processors from Client to Cloud — 11

Chapter 6 — Parallel Processors from Client to Cloud — 12

 3. Simultaneous multithreading (SMT)

units available than single thread needs to use.

Chapter 6 — Parallel Processors from Client to Cloud — 13

Coarse MT: Coarse-grained

Processor Processor Processor Processor

Memory Memory Interconnection network

Chapter 6 — Parallel Processors from Client to Cloud — 16

Chapter 6 — Parallel Processors from Client to Cloud — 17

Chapter 6 — Parallel Processors from Client to Cloud — 18

Chapter 6 — Parallel Processors from Client to Cloud — 19

Chapter 6 — Parallel Processors from Client to Cloud — 20

Chapter 6 — Parallel Processors from Client to Cloud — 21

You might also like