Parallel Computation Models: Vivek Sarkar
Parallel Computation Models: Vivek Sarkar
Vivek Sarkar
M B
P P P P P P P P
cachelines
Stride-N Sequential
access to access through
one row* entire matrix
When cache (or TLB or memory) can’t hold entire B matrix, there will be
a miss on every line.
When cache (or TLB or memory) can’t hold a row of A, there will be a
miss on each access
A uniprocessor is DISK
Sequence of memory modules
Highest level is large memory, low speed
Processor (level 0) is tiny memory, high speed
Connected by channels
DRAM
All channels can be active simultaneously
Data are moved in fixed-sized blocks
A block is a chunk of contiguous data
Block size depends on level
cache
regs
B
“Cubelet” of computation
is product of a submatrix
of A with submatrix of B
A - Data involved is proportional
C to surface area.
- Computation is proportional
to volume.
DISK
Height of module = lg(blocksize)
Width = lg(number of blocks)
Length of channel = lg(transfer time)
Alpern & Carter: “Since MH model is so great, let’s generalize it for parallel
computers!”
A computer is a tree of memory modules
Largest memory is at root.
Children have less memory, more compute power.
Four parameters per module
Block size, number of blocks, transfer time from parent, and number of
children.
Homogeneous all modules at a level have same parameters
(PMH ignores difference between shared and distributed address space
computation.)
network Extended
Storage
Main
memories Main
memories Main
memory
Caches
Disks Caches
Disks
Disks Scalar vector
cache regs
registers
Vector
The Grid NOW
supercomputer
Internodal network
DRAM
Node Node Node
L2 L2 L2 L2 L2 L2
SRAM
L1 L1 L1 L1 L1 L1
P1 P2 P3 P4 ... Pn registers
functional units
synch
• Horizontal Structure
– Concurrency among a fixed
number of virtual processors.
– Processes do not have a Global
particular order. Communication
– Locality plays no role in the
placement of processes on
processors.
– p = number of processors. Barrier
Synchronization
Cray
T3E 47 506 1.2 40
IBM
SP2 26 5400 9 6
Pentium NOW
serial Ethernet 1 61 540,000 2800 61
• Halt Functions
– bsp_abort()
• one process halts all
g g g
P0 L
o o o o
P1 L o
P2 L o
P3 L L
o o
P4 o
g
P5 o o o L
P6 L o
P7 o
4 8 12 16 20 24
time