0% found this document useful (0 votes)
29 views30 pages

Parallel Computing

Parallel computers use multiple processors to work on different parts of a problem simultaneously. This can reduce the execution time compared to using a single processor. The speedup from using p processors is generally between 1 and p, with best case being linear speedup of p. Shared memory parallel computers give all processors equal access to memory, while distributed memory machines have faster access to local versus remote memory. Benchmark tests like Linpack are used to measure real-world performance of parallel systems. A variety of interconnection networks are used between processors, balancing factors like communication latency, bandwidth and scalability.

Uploaded by

Martina Janeva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views30 pages

Parallel Computing

Parallel computers use multiple processors to work on different parts of a problem simultaneously. This can reduce the execution time compared to using a single processor. The speedup from using p processors is generally between 1 and p, with best case being linear speedup of p. Shared memory parallel computers give all processors equal access to memory, while distributed memory machines have faster access to local versus remote memory. Benchmark tests like Linpack are used to measure real-world performance of parallel systems. A variety of interconnection networks are used between processors, balancing factors like communication latency, bandwidth and scalability.

Uploaded by

Martina Janeva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Parallel Computing

Parallel Computers
Parallel processing/computing :

– at least two processors have to cooperate


– by means of exchanging data
– while working on different parts of one and the same
problem

Parallel computers : make use of multiple processors for parallel


processing
Speed-up
Basic idea of parallel processing: execution time can be reduced by
employing more than one processor, the larger the number of
processors the smaller the execution time.

Speed-up
s(p) = T1/Tp

T1 -- execution time on one processor


Tp -- execution time on p processors

• Best case: s(p) = p (linear speed-up)


• Worst case: s(p) = 1
• Generally: 1 ≤ s(p) ≤ p
Execution on one processor

T1 = Tseq + Tpar
Tseq -- execution time of sequential part
Tpar -- execution time of parallelisable part

Execution on p processors

Tp = Tseq + Tpar /p
Shared memory parallel computers

processors have equally fast access to any location in memory


Distributed memory parallel
computers

access to own memory is faster than access to other memory


Non-Uniform Memory Access
(NUMA) Architectures
Performance

• Theoretical peak performance

• Linpack benchmark performance


Theoretical peak performance
• Theoretical peak performance Rpeak – the maximal number of
arithmetical operations (additions and/or multiplications) a processor
can carry out per second
Rpeak 1 = f µpr

where:
– f – clock frequency
– µpr – maximum number of operations per clock cycle

• The theoretical peak performance of a parallel computer is equal to


the product of the number of processors and the theoretical peak
performance of one processor.

Rpeak p = pRpeak 1
Examples:

• Cray J32
– f = 100 MHz, µpr = 2, Rpeak 1 = 200 Mflops
– p = 32, Rpeak 32 = 6.4 Gflops

• NEC SX-5
– f = 250 MHz, µpr = 32, Rpeak 1 = 8 Gflops
– p = 16, Rpeak 16 = 128 Gflops
– n = 32 (NUMA), Rpeak 32*16 = 4 Tflops
Benchmark performance

• Benchmark

– a program for a specific problem

– the number of operations which are executed is known

– used to measure the run time in a single-user mode

– to determine the benchmark performance (operations per


second).
Linpack benchmark
• Linpack – a popular library of Fortran subroutines for the
numerical solution of linear algebra problems

• Linpack benchmark – based on one particular subroutine which


is used for the solution of a dense system of linear equations

– algorithm: LU factorization by Gaussian elimination with partial


pivoting

– number of operations: 2n3/3 + O(n2) (n – number of unknowns)

• Top-500 list of most powerful computer installations

http://www.top500.org/
Interconnection structures for
parallel computers
Bisection or cross-section
bandwidth

• Definition: the effective rate at which one half of the processing


nodes can send data to the other half (for worst case division of
the processors).

• It does not scale linearly with the number of processing nodes in


most interconnection schemes.
Complete communication graph

• The bisection bandwidth grows in proportion to the number of


nodes.

• The number of edges: n(n-1)/2


Bus

• The bisection bandwidth of the system is constant and equal to the


bandwidth of the bus.

• Simple software and hardware.


Crossbar switch
• Bisection bandwidth scales with the number of processing
nodes.

• Total number of communication network ports -- Θ(n)

• Number of links -- Θ(n2).

• In practice the crossbar switch is used only to interconnect a


relatively small number of processors.
Multistage switching networks

• A series of switches which are grouped in stages realizes the


connection between pairs of inputs and outputs.

• Can be organized in many different topologies fitted to particular


applications.

• Number of links -- Θ(n log(n))

• Bisection bandwidth -- Θ(n)


Example - Beneš network
Regular grids: 1-D arrays
• Linear processor array and ring.

• Bisection bandwidth -- Ω(1)

• Remote communication -- Ο(n)


Regular grids: 2-D arrays
• 2-D mesh

• Torus.

• Twisted torus.

• Remote communication needs time O(n1/2).

• Bisection bandwidth -- Ω(n1/2).


A two-dimensional mesh
Regular grids: 3-D arrays

• Remote communication -- O(n1/3)

• Bisection bandwidth -- Ω (n2/3)

Example: Cray T3E -- 10 x 10 x 10 grid


Trees

Binary tree
Trees
• Remote communication -- O(log(n)).

• Fit well the communicational requirements of reduction


operations and a number of optimal algorithms based on divide
and conquer techniques.

• Less suited for regular data array redistribution operations.

• The decreasing aggregate bandwidth of a tree network in its


upper levels and in particular around the root presents a severe
bottleneck for massive communication.
Fat tree

The aggregate bandwidth of a fat tree network is kept


constant at all levels of the tree
Binary Hypercubes
• A binary hypercube of degree d (d > 3) consists of n = 2d
nodes labeled by distinct d-bit binary numbers.

• Two nodes are connected by an edge, iff their respective labels


differ in exactly one bit position.

• O(n log(n)) links

• Bisection bandwidth scales in proportion with the number of nodes.

• Remote communication -- O(log(n))


Examples of binary hypercubes

You might also like