0% found this document useful (0 votes)
75 views

Chapter 6 Parallel Processor

The document discusses parallel processors from client to cloud systems. It covers topics like multiprocessors, multicore processors, parallelism, Amdahl's law, scaling, vector architectures, hardware multithreading, and shared memory multiprocessors. An example of summing numbers on a shared memory multiprocessor using reduction is also presented.

Uploaded by

q qq
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views

Chapter 6 Parallel Processor

The document discusses parallel processors from client to cloud systems. It covers topics like multiprocessors, multicore processors, parallelism, Amdahl's law, scaling, vector architectures, hardware multithreading, and shared memory multiprocessors. An example of summing numbers on a shared memory multiprocessor using reduction is also presented.

Uploaded by

q qq
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

COMPUTER ORGANIZATION AND DESIGN

5th
Edition
The Hardware/Software Interface

Chapter 6
Parallel Processors from
Client to Cloud
§6.1 Introduction
Introduction
 Multiprocessor
 Goal: connecting multiple computers
to get higher performance
 Scalability, availability, power efficiency
 Multicore processors
 Chips with multiple processors (cores)
 Task-level (process-level) parallelism
 High throughput for independent jobs
 Parallel program (parallel software)
 A single program run on multiple processors
 Challenges: partitioning, coordination,
communications overhead
Chapter 6 — Parallel Processors from Client to Cloud — 2
Amdahl’s Law
 Sequential part can limit speedup
 Example: 100 processors, 90× speedup?
 Tnew = Tparallelizable/100 + Tsequential

 Solving: Fparallelizable = 0.999


 Need sequential part to be 0.1% of original
time

Chapter 6 — Parallel Processors from Client to Cloud — 3


Scaling Example
 Workload
 sum of 10 scalars (sequential)
 sum of a pair of 10×10 matrix (parallel)
 What’s the speed up from single to 10 and 100 processors?
 Single processor:
 Time = (10 + 100) × tadd

Assumes load balanced


 10 processors

across processors
 Time = 10 × tadd + 100/10 × tadd = 20 × tadd
 Speedup = 110/20 = 5.5 (55% of potential)
 100 processors
 Time = 10 × tadd + 100/100 × tadd = 11 × tadd
 Speedup = 110/11 = 10 (10% of potential)

Chapter 6 — Parallel Processors from Client to Cloud — 4


Scaling Example (cont)
 What if matrix size is 100 × 100?
 Single processor:
 Time = (10 + 10000) × tadd
 10 processors

Assumes load balanced


Time = 10 × tadd + 10000/10 × tadd = 1010 × tadd

across processors

 Speedup = 10010/1010 = 9.9 (99% of potential)


 100 processors
 Time = 10 × tadd + 10000/100 × tadd = 110 × tadd
 Speedup = 10010/110 = 91 (91% of potential)

Chapter 6 — Parallel Processors from Client to Cloud — 5


Strong vs Weak Scaling
 Strong scaling: problem size fixed
 Measure speed-up achieved on a multiprocessor while
keeping problem size fixed
 Weak scaling: problem size proportional to number of CPU
 Measure speed-up achieved on a multiprocessor while
increase the size of the problem proportionally to the
increase in the number of processors.

 10 processors, 10 × 10 matrix

performance
Time = 10 × tadd + 100/10 × tadd = 20 × tadd

Constant

 100 processors, 32 × 32 matrix


 Time = 10 × tadd + 1024/100 × tadd ~= 20 × tadd

Chapter 6 — Parallel Pro


cessors from Client to Cloud — 6
§6.3 SISD, MIMD, SIMD, SPMD, and Vector
Instruction and Data Streams
 An alternate classification on parallel
hardware
Data Streams
Single Multiple
Instruction Single SISD: SIMD (vector processor):
Streams Intel Pentium 4 SSE instructions of x86
Multiple MISD: MIMD :
No example yet Intel Xeon e5345

Chapter 6 — Parallel Processors from Client to Cloud — 7


Vector architecture
 Highly pipelined function units
 Stream data from/to vector registers to units
 Data collected from memory into vector registers
 Results stored from vector registers to memory
 Example: Vector extension to MIPS
 32 vector registers, each has 64 64-bit elements
 Vector instructions
 lv, sv: load/store vector
 addv.d: add vectors of double
 addvs.d: add scalar to each element of vector of double

Chapter 6 — Parallel Processors from Client to Cloud — 8


Vector architecture
 Single add pipeline
 Complete one addition per cycle

An array of parallel functional units

 Four add pipeline


 Complete four additions per cycle

Chapter 6 — Parallel Processors from Client to Cloud — 9


Example: DAXPY (Y = a × X + Y)
 Conventional MIPS code
l.d $f0,a($sp) ;load scalar a
addiu r4,$s0,#512 ;bound of what to load
loop:l.d $f2,0($s0) ;load x(i)
mul.d $f2,$f2,$f0 ;a × x(i)
l.d $f4,0($s1) ;load y(i)
add.d $f4,$f4,$f2 ;a × x(i) + y(i)
s.d $f4,0($s1) ;store into y(i)
addiu $s0,$s0,#8 ;increment index to x
addiu $s1,$s1,#8 ;increment index to y
subu $t0,r4,$s0 ;compute bound
bne $t0,$zero,loop ;check if done

 Vector MIPS code


l.d $f0,a($sp) ;load scalar a
lv $v1,0($s0) ;load vector x
mulvs.d $v2,$v1,$f0 ;vector-scalar multiply
lv $v3,0($s1) ;load vector y
addv.d $v4,$v2,$v3 ;add y to product
sv $v4,0($s1) ;store the result
Vector vs. Scalar architecture
 Reduce Instruction fetch and decode
 A single vector instruction is equivalent to executing an
entire loop.
 Avoid data hazard checking
 Only check data hazard between vectors, not for every
element within the vectors (computation of every element
within the same vector is independent).
 Avoid control hazard
 An entire loop is replaced by a vector instruction  loop
branch (leading to control hazard) is non-existent.
 Efficient memory access
 If the vector’s elements are all adjacent in memory,
fetching them is efficient using interleaved memory banks.

Chapter 6 — Parallel Processors from Client to Cloud — 11


§6.4 Hardware Multithreading
Hardware Multithreading
 Hardware Multithreading
 Multiple hardware threads (replicate registers, PC, etc.)
 fast context switching between threads
 1. Fine-grain multithreading
 Switch threads after execution of each instruction
 If one thread stalls (long and short), others are executed
 Slow down the execution of individual threads, especially
those without stalls
 2. Coarse-grain multithreading
 Only switch on long stall (e.g., L2-cache miss)
 Simplifies hardware, but doesn’t hide short stalls (eg,
data hazards)

Chapter 6 — Parallel Processors from Client to Cloud — 12


Simultaneous Multithreading (SMT)

 3. Simultaneous multithreading (SMT)


 in multiple-issue dynamically scheduled pipelined CPU
 Motivation
 Multiple-issue processors often have more functional

units available than single thread needs to use.


 Schedule instructions from multiple threads
 Instructions from independent threads execute when
function units are available
 hide the throughput loss from both short and long stall

Chapter 6 — Parallel Processors from Client to Cloud — 13


Multithreading Example

Coarse MT: Coarse-grained


multithreading
Fine MT: Fine-grained
multithreading
SMT: Simultaneous
multithreading
§6.5 Multicore and Other Shared Memory Multiprocessors
Shared Memory Multiprocessor
 SMP: Symmetric Multi-Processing
 Hardware provides single physical
address space for all processors
 Synchronize shared variables using locks
 Memory access time
 UMA (uniform) vs. NUMA (nonuniform)

Processor Processor Processor Processor


cache cache cache cache

Interconnection network
Memory Memory

Memory Memory Interconnection network

UMA NUMA
Chapter 6 — Parallel Processors from Client to Cloud — 15
Example: Sum Reduction
 Sum 100,000 numbers on 100 processor UMA
 Each processor has ID: 0 ≤ Pn ≤ 99
 Partition 1000 numbers per processor
 Initial summation on each processor
sum[Pn] = 0;
for ( i=1000*Pn; i<1000*(Pn+1); i=i+1)
sum[Pn] = sum[Pn] + A[i];
 Now need to add these partial sums
 Reduction: divide and conquer
 Half the processors add pairs, then quarter, …
 Need to synchronize between reduction steps

Chapter 6 — Parallel Processors from Client to Cloud — 16


Example: Sum Reduction

half = 100;
repeat
synch();
// The condition
if (half%2 != 0 && Pn == 0)
// when half is odd
sum[0] = sum[0] + sum[half-1];

half = half/2;
if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half];
until (half == 1);

Chapter 6 — Parallel Processors from Client to Cloud — 17


§6.7 Clusters, WSC, and Other Message-Passing MPs
Loosely Coupled Clusters
 Network of independent computers
 Each has private memory and OS
 Connected using I/O system
 E.g., Ethernet/switch, Internet
 Suitable for applications with independent tasks
 Web servers, databases, simulations, …
 High availability, scalable, affordable
 Problems
 Administration cost
 Low interconnect bandwidth
 c.f. processor/memory bandwidth on an SMP

Chapter 6 — Parallel Processors from Client to Cloud — 18


Message Passing
 Each processor has private physical
address space
 Hardware sends/receives messages
between processors

Chapter 6 — Parallel Processors from Client to Cloud — 19


Interconnection Networks
 Network topologies
 Arrangements of processors, switches, and links

Bus Ring

N-cube (N = 3)
2D Mesh
Fully connected

Chapter 6 — Parallel Processors from Client to Cloud — 20


§6.14 Concluding Remarks
Concluding Remarks
 Higher performance by using multiple processors
 Difficulties
 Developing parallel software
 Devising appropriate architectures
 SIMD and vector operations match multimedia
applications and are easy to program
 Higher disk performance by using RAID

Chapter 6 — Parallel Processors from Client to Cloud — 21

You might also like