Ca 3
Ca 3
processors
• SIMD array processors was developed to
perform parallel computations on vector or
matrix types of data.
• Parallel processing algorithms have been
developed by many computer scientists for
SIMD computers.
•Important SIMD algorithms can be used to
perform
Matrix multiplication,
Fast Fourier transform (FFT),
Matrix transposition,
Summation of vector elements,
Matrix inversion,
Parallel sorting,
Linear recurrence,
Boolean matrix operations, and
To solve partial differential equations.
• The implementation of parallel algorithms on
SIMD machines is described by concurrent
ALGOL.
• The physical memory allocations and program
implementation depend on the specific
architecture of a given SIMD machine.
SIMD Matrix Multiplication
• Cumulative multiplication refers to the linked
multiply-add operation c ← c + a x b.
• The addition is merged into the multiplication
because the multiply is equivalent to multi-
operand addition.
• Therefore, unit time is the time required to
perform one cumulative multiplication.
Example : O(n2 ) algorithm for
SIMD matrix multiplication
•It should be noted that the vector load
operation is performed to initialize the row
vectors of matrix C one row at a time.
•In the vector multiply operation, the same
multiplier aij is broadcast from the CU to all PEs
to multiply all n elements {bik for k = 1, 2, ... , n}
of the ith row vector of B.
•In total, n2 vector multiply operations are
needed in the double loops.
•Each vector multiply instruction implies n
parallel scalar multiplications in each of the n2
iterations.
•This algorithm is implementable on an array of
n PEs.
Memory allocation
•Implementation of matrix multiplication on a
SIMD computer with n PEs.
•The algorithm construct depends heavily on the
memory allocation of the A, B and C matrices in
the PEMs.
•Each row vector of the matrix are stored across
the PEMs.
•Column vectors are then stored within the
same PEM.
•This memory allocation scheme allows parallel
access of all the elements in each row vector of
the matrices.
•Based on this data distribution, we obtain the
O(n2) matrix multiplication parallel algorithm.
• The two parallel do operations correspond to
vector load for initialization and vector
multiply for the inner loop of additive
multiplications.
• The time complexity has been reduced to
O(n2).
• Therefore, the SIMD algorithm is n times
faster than the SISD algorithm for matrix
multiplication
•The successive memory contents in the
execution of the SIMD matrix multiplication
program are:
Parallel sorting on array processors
• A SIMD algorithm is to be presented for sorting
n2 connected processor array in O(n) routing
and comparison steps.
• This shows a speedup of O(log2n) over the best
sorting algorithm which takes O(nlog2n) steps
on a uniprocessor system
• Assume an array processor with N=n2 identical
PEs interconnected by a mesh network similar
to the Illiac-IV except that the PEs at the
perimeter have 2 or 3 rather than 4
neighbours.
• That is, there are no wrap around connections
in this simplified mesh network
• Eliminating wrap around condition simplifies
the array sorting algorithm
• The time complexity of the array sorting
algorithm would be affected by at most a
factor of two if the wraparound connections
were included
Two time measures are needed to estimate
the time complexity of the parallel sorting
algorithm.
1.Routing Time, tR
2.Comparison Time, tC
• Let tR be the routing time required to move
one item from a PE to one of its neighbours,
and tc be the comparison time required for
one comparison step.
• Concurrent data routing is allowed upto N
comparisons may be performed
simultaneously
•This means that a comparison-interchange step
between two items in adjacent PEs can be
done in 2tR+tc time units (route left, compare
and route right).
• A mixture of horizontal and vertical
comparison interchanges requires at least
4tR+tc time units.
•The sorting problem depends on the indexing
schemes on the PEs.
•The PEs may be indexed by a bijection from
{1,2,..,n} x {1,2,..,n} to {0,1,…N-1}, where N=n2.
• The choice of a particular indexing scheme
depends upon how the sorted elements will
be used.
• The longest routing path on the mesh in a
sorting process is the transposition of two
elements initially loaded at opposite corner
PEs.
• This transposition needs atleast 4(n-1) routing
steps.
Batcher's
odd-even
merge sort
of two
sorted
sequences
on a set of
linearly
connected
PEs
• The shuffle and unshuffle operations can each
be implemented with a sequence of
interchange operations (marked by the
double-arrows).
• Both the perfect shuffle and its inverse
(unshuffle) can be done in k-1 interchanges
or 2(k— 1) routing steps on a linear array of 2k
PEs.
Example : M(j,2) sorting algorithm
• Given two sorted columns of length j ≥ 2, the
M(j,2) algorithm consists of the following
steps:
The M(j,2) algorithm is illustrated for an
M(4,2) sorting
Connection Issues for SIMD
Processing
•SIMD array processors allow explicit expression
of parallelism in user programs.
•The compiler detects the parallelism and
generates object code suitable for execution in
the multiple processing elements and the
control unit.
• Program segments that cannot be converted
into parallel executable forms are executed in
the control unit;
• Program segments that can be converted into
parallel executable forms are sent to the PEs
and executed synchronously on data fetched
from parallel memory modules under the
control of the control unit.
• To enable synchronous manipulation in the
PEs, the data is permuted and arranged in
vector form.
• Thus, to run a program more efficiently on an
array processor, one must develop a
technique for vectorizing the program codes.
• The interconnection network plays a major
role in vectorization.
• Several connection issues in using SIMD inter-
connection networks are :
1. Permutation and connectivity
2. Partitioning and reconfigurability
3. Reliability and bandwidth