0% found this document useful (0 votes)
192 views59 pages

Chapter 8

Vector processors apply operations simultaneously to vectors of data rather than scalars. They use vector instructions like vector-vector, vector-scalar, and vector-memory. Vectorization improves performance by reducing software overhead. Memory is organized for concurrent access to maximize throughput. Vector supercomputers balance vector and scalar performance through architectural design goals like scalability and high I/O performance. SIMD computers apply the same instruction to multiple data elements using processing elements with local interconnects.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
192 views59 pages

Chapter 8

Vector processors apply operations simultaneously to vectors of data rather than scalars. They use vector instructions like vector-vector, vector-scalar, and vector-memory. Vectorization improves performance by reducing software overhead. Memory is organized for concurrent access to maximize throughput. Vector supercomputers balance vector and scalar performance through architectural design goals like scalability and high I/O performance. SIMD computers apply the same instruction to multiple data elements using processing elements with local interconnects.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Chapter 8

Multivector and SIMD computers


-by
Prajwala T R
Dept of CSE
PESIT
Vector processing principles
Vector instruction types
• Vector-ordered collection of scalar items of
same type.
• Uses fixed addressing increment-stride
• Vector processor is ensemble of vector
registers,functional pipelines,regs,vectorizer.
• Vector processing –arithmetic and logical
operators are applied to vectors
• Vectorization
• Vector processors are faster ,efficient
• Reduces software overhead.
Vector instruction types
• Vector vector instruction
• Vector scalar instructions
• Vector memory instructions
• Gather and scatter instructions
– M->v1 X v0
– V1 X v0->M
• Masking instructions-compress or expand the
vector
Vector instructions in cray like
computers
Vector access memory schemes
• Vector operand specification
– Base address
– Stride
– Length
– Access rate should match pipeline rate
C(concurrent)-Access memory
organization
• M-way lower order interleaved memory
structure
• If stride is one successive address are accessed
with one cycle delay
• If stride 2 then access are separated by 2
minor cycle.
• Maximum throughput of m words per cycle.
Low order interleaving
S(simultaneous)-Access memory organization
C/S access memory organization
• N buses and m memory modules
• N buses operate in parallel(c-access)
• M modules are interleaved to allow c access.
• Most popular memory access module in
vector computers
NEC SX vector super computer
Relative vector/scalar performance
• Amdhals law redefined
• P=1/(1-f)+f/r
• Indicates speedup of vector to scalar
processing.
• The hardware speed ratio r is designer’s
choice.
Performance directed design goals
• Architectural design goals
– Maintaining good vector to scalar performance
balance
– Supporting scalability
– Increasing memory system capacity and
performance
– Providing high performance i/o and easy access to
network
Balances vector scalar ratio
• Scalar processing is indispensible part of
general purpose architecture
• Vector balance point
• Vector performance
– 9 MFLOPS-vector
– 1 MFLOPS -scalar
• I/O and networking performance
– With speed of supercomputers increasing
problem size increases and I/O bandwidth
requirement as well
– I/O rate
– Cray systems
– 100GBPS transfer rate
• Memory demand
– Latency and bandwidth
– Effective memory hierarchy
– Memory sizes available on chip is rapidly
increasing.
– Relative speed mismatch
• Scalability
– Support of shared memory with increasing
number of processors and memory port.
– Constraints
• Latency
• Communication overhead
Table of comparison
Cray Y MP 816 system organization
C-90 and clusters
Cray MPP systems
• Off the shell components are not suitable.
• Balance of speed between processor memory
and I/O required.
• Lack of efficient memory operation like
synchronization and communication in RISC
• All the lead to introduction of MPP
• T3D
– 150 MHz clock, partition to emulate as SIMD or
MIMD dynamically
– Distributed memory.
– Mach based microkernel operating systems
– Program debugging and performance tools
Development phases
Fujitsu VP2000
Fujitsu 5000
Mainframe computers
LINPACK results
Compound vector processing
• CVF-compound vector function- composite function of
vector operations are converted from looping structure
of linked scalar operation
• Ex:
Do I=1,N
Load r1,x(I)
Load r2,y(I)
Mul r1,s
Add r2,r1
Store y(I),r2
continue
After vectorization
M(x:x+N-1)->v1
N(y:y+N-1)->v2
S X v1->v1
V2+v1->v2
V2->M(y:y+N-1)
• CVF
Y(I)=S X X(I)+Y(I)
Compound vector functions
• Vector loops and chains
– The loop count is determined at compile time or
run time
• Strip mining-when vector has length greater
than vector register
– Vector registers are not allocated to any other
operation until all segments of current vector are
handled.
• Functional unit independence
– Vector registers act as interface between pipeline
stages
– Vector registers and functional units must be
reserved before a vector chain is established
example
Timing diagram
Chaining limitations
• Number of vector operations
• Number of functional pipeline units
• Number of interfaces for adjacent pipelining
stages
• Degree of chaining depends on how many
unary and binary operators.
• How many scalar operations and vector
operations
• Vector recurrence-
– Functional pipeline feed back input to its own
source registers
– Ex-component counter
What is Systolic Computing?
A set of simple processing elements with local connections
which takes external inputs and processes them in a
predetermined manner in a pipelined fashion
Host Station in Systolic Architecture

• As a result of the local-communication scheme, a systolic network is easily


extended without adding any burden to the I/O.
• Systolic Array.
Control Control Control
Unit Unit Unit
……..
Processing Processing Processing
Units Units Units

Interconnection Network(Local)

• Systolic arrays usually pipe data from an outside host and also pipe the
results back to the host.
Multipipeline networking
• Pipeline net-constructed by interconnecting
multiple functional pipelines using BCN
• 2 level architecture of pipelining
Program graph transformation

• rule 1:Adding k delays to any node in systolic


graph and then subtracting k delay from all
incoming edges
• Rule 2:multiply all edges with scaling constant
• 0-graph is called systolic program graph
SIMD computer organization
• Distributed memory model
– Local memory
– scalar and vector control unit
– All processing elements are interconnected by
routing network
– Masking logic
Shared memory model
• Alignment network
• Alignment network
• must be properly set to avoid conflicts
• SIMD instructions
– All instructions must use vector operands of equal
length n
– Data routing functions
• Host and I/O
– Control memory
– Mass storage and graphics display results
CM-2 architecture
• Front end
• Sequencer
• Modes of communication
– Broadcasting
– Global combining
– Scalar memory bus
• Processing nodes
– 32 bit slice processor
– Floating point accelerator
– Bit slice ALU
• Hypercube routers
• applications
MasPar MP architecture
MasPar MP architecture
• Array control unit
• Scalar RISC processor
• Uses demand paging
• Fetches and decodes the instructions
• PE array
• 1024 PE
• 64 PE clusters-16 clusters per PE
• multistage cross bar interconnection netwrok
• Parallel disk arrays

You might also like