Aca Unit 1
Aca Unit 1
Module I
Parallel Processing
Jyoti Kumari
1
Parallel Processing
Definition:
• Parallel processing can be described as a class of techniques which
enables the system to achieve simultaneous data-processing tasks
to increase the computational speed of a computer system.
• A parallel processing system can carry out simultaneous
data-processing to achieve faster execution time.
• For instance, while an instruction is being processed in the ALU
component of the CPU, the next instruction can be read from
memory.
Why?
• It enhances the computer’s processing capability
• increases its throughput, i.e. the amount of processing that can be
accomplished during a given interval of time.
2
Parallel Computer Models
3
Parallel/Vector Computers
• Parallel computers are those that execute programs in MIMD mode.
• There ate two major classes of parallel computers:
Shared-memory multiprocessors
Message passing multicomputers
• The major distinction between multiprocessors and multicomputers lies in memory sharing
and the mechanisms used for interprocessor communication.
• A vector processor is equipped with multiple vector pipelines that can be concurrently used
under hardware or firmware control.
There are two families of pipelined vector processors:
Memory-to-memory:
This architecture supports the pipelined flow of vector operands directly from the memory
to pipelines and then back to the memory.
Register-to-register:
This architecture uses vector registers to interface between the memory and functional
pipelines.
4
System Attributes to Performance
• The ideal performance of a computer system demands a perfect match between machine
capability and program behaviour.
• Machine capability can be enhanced with better hardware technology, innovative architectural
features, and efficient resources management.
• Program behaviour is difficult to predict due to its heavy dependence on application and
run-time conditions.
• Other factors affecting program behaviour
algorithm design, data structures,
language efficiency, programmer skill,
and compiler technology.
• These attributes/ performance indicators guide system architects in designing better machines
or to educate programmers or compiler writers in optimizing the codes for more efficient
execution by the hardware.
5
• Computer architects have come up with a variety of metrics to describe the computer
performance:
Clock rate and CPI :
• Since I/O and system overhead frequently overlaps processing by other programs, it is fair to
consider only the CPU time used by a program, and the user CPU time is the most important
factor.
• CPU is driven by a clock with a constant cycle time (usually measured in nanoseconds, which
controls the rate of internal operations in the CPU.
• The clock mostly has the constant cycle time (τ in nanoseconds).
• The inverse of the cycle time is the clock rate (f = 1/τ, measured in megahertz).
• A shorter clock cycle time, or equivalently a larger number of cycles per second, implies more
operations can be performed per unit time.
• The size of the program is determined by the instruction count (Ic), the number of machine
instructions to be executed by the program.
• Different machine instructions require different numbers of clock cycles to execute. CPI
(cycles per instruction) is thus an important parameter.
6
• Performance Factors:
Let Ic, be the number of instructions in a given program, or the
instruction count.
The CPU time ( T in seconds/program) needed to execute the program is
estimated by finding the product of three contributing factors:
T = Ic * CPI * τ (1)
7
• The time required to access memory is called the memory cycle, which is
usually k times the processor cycle time τ.
• The value of k depends on the memory technology and the processor-memory
interconnection scheme.
• The processor cycles required for each instruction (CPI) can be attributed to
– cycles needed for instruction decode and execution (p),
– and cycles needed for memory references (m* k).
8
System Attributes:
• The five performance factors (Ic, p, m, k, τ) are influenced by four system
attributes:
Instruction-set architecture: It affects Ic and p
Compiler technology: It affects Ic , p and m
CPU implementation and control: Determine total processor time
needed (p*τ )
and cache and memory hierarchy: Affect memory latency (k* τ)
The attributes are shown in table 1.2.
9
10
MIPS Rate:
• The millions of instructions per second, this is calculated by dividing the number of instructions
executed in a running program by time required to run the program.
• Let C be the total number of clock cycles needed to execute a given program.
• Then the CPU time in Eq. 2 can be estimated as T= C * τ = C/f .
• CPI = C/Ic and T = Ic * CPI * τ = Ic * CPI/f.
• The processor speed is often measured in terms of million instructions per second (MIPS) or we
can say MIPS rate of a given processor.
• MIPS rate = Ic / (T * 106) = f / (CPI * 106 ) = (fc * I ) / (C * 106)
(3)
• Based on eq 3, the CPU time in eq 2 can also be written as
T = I * 10-6 /MIPS
c
• MIPS rate is directly proportional to the clock rate (f) and inversely proportion to the CPI.
• All four systems attributes (instruction set, compiler, processor, and memory technologies) affect
the MIPS rate.
11
Floating Point Operations per Second
• Most compute-intensive applications in science and engineering make heavy use of floating
point operations.
• Compared to instructions per second, for such applications a more relevant measure of
performance is floating point operations per second, which is abbreviated as flops.
• With prefix mega. (106), giga (109) tera (1012) and peta (1015) this is written as megaflops
(mf1ops), gigaflops (gflops), teraflops or petaflops.
Throughput Rate
How many programs a system can execute per unit time, called the system throughput Ws (in
programs/second).
In a multiprogrammed system, the system throughput is often lower than the CPU throughput
Wp, defined by:
Wp = f / (Ic* CPI)
(4)
• Note that Wp, = (MIPS) * 106/I c, from Eq. 3.
• The unit for Wp is programs/second.
• The CPU throughput is a measure of how many programs can be executed per second, based
only on the MIPS rate and average program length (Ic).
12
• An implicit approach uses a
conventional language, such as C, C++,
Fortran, or Pascal, to write the source
program.
• The sequentially coded source program
is translated into parallel object code by
a parallelizing compiler. As illustrated in
Fig. 1.a, this compiler must be able to
detect parallelism and assign target
machine resources.
• This compiler approach has been
applied in programming shared-memory
multiprocessors. With parallelism being
implicit, success relies heavily on the
“intelligence” of a parallelizing compiler.
• This approach requires less effort on
the part of the programmer. Fig. 1.a.
13
• Explicit Parallelism This
approach (Fig. 1.b.) requires more
effort by the programmer to
develop a source program using
parallel dialects of C, C++,
Fortran, or Pascal.
• Parallelism is explicitly specified
in the user programs. This reduces
the burden on the compiler to
detect parallelism.
• Instead, the compiler needs to
preserve parallelism and, where
possible, assigns target machine
resources.
Fig. 1.b.
14
Multiprocessors and Multicomputers
• Shared-Memory Multiprocessors
– UMA Model
– NUMA Model
– COMA Model
– Discussed in Flyn’s Classification PPT
• Distributed-Memory Multicomputers
15
Multivector and SIMD computers
Vector Supercomputers
• A vector computer is often built on
top of a scalar processor.
• As shown in Fig. 2., the vector
processor is attached to the scalar
processor as an optional feature.
• Program and data are first loaded
into the main memory through a host
computer.
• All instructions are first decoded by
the scalar control unit.
• If the decoded instruction is a scalar
operation or a program control
operation, it will be directly executed
by the scalar processor using the Fig. 2. Architecture of vector processor
scalar functional pipelines.
16
• If the instruction is decoded as a vector operation, it will be sent to the
vector control unit.
• This control unit will supervise the flow of vector data between the main memory
and vector functional pipelines.
• The vector data flow is coordinated by the control unit. A number of
vector functional pipelines may be built into a vector processor.
• Two pipeline vector supercomputer models are described below.
• vector processor models
– Register-to-register architecture
– Memory-to-memory architecture
Register-to-register
• Vector registers are used to hold the vector operands, intermediate and final vector
results.
• The vector functional pipelines retrieve operands from and put results into
the vector registers. All vector registers are programmable in user
instructions.
• Each vector register is equipped with a component counter which keeps track of the
component registers used in successive pipeline cycles.
17
• The length of each vector register is usually fixed, say, sixty-four 64-bit
component registers in a vector register in a Cray Series supercomputer.
• Other machines, like the Fujitsu VPZUDG Series, rise reconfigurable vector
registers to dynamically match the register length with that of the vector
operands.
• In general, there are fixed numbers of vector registers and functional pipelines
in a vector processor. Therefore. both resources must he reserved in advance to
avoid resource conflicts between different vector operations.
18
SIMD Supercomputers
where –
1. N is the number of processing elements (PEs) in
the machine.
For example, the llliac iV had 64 PEs and the
Connection Machine CM-2 had 65,536 PEs.
2. C is the set of instructions directly executed
by the control unit (CU), including scalar and
program flow control instructions.
3. I is the set of instructions broadcast by the CU to
all PEs for parallel execution. These include
arithmetic, logic, data routing, masking, and
other local operations executed by each active PE
over data within that PE.
4. M is the set of masking schemes, where each
mask partitions the set of PEs into enabled and
disabled subsets.
5. R is the set of data-routing functions, specifying
various patterns to be set up in the
interconnection network for inter-PE
communications. Fig. 3. Operational model of SIMD
19
PRAM and VLSI Models
Parallel Random Access Machines (PRAM) Models
• Theoretical model
• These models are often used by algorithm designers and VLSI device/chip developers.
• The ideal models provide a convenient framework for developing parallel algorithms
without worrying about the implementation details or physical constraints.
20
• The space complexity can be similarly defined as a function of the problem size s.
• The asymptotic space complexity refers to the data storage of large problems.
• Note that the program (code) storage requirement and the storage for input data
are not considered in this.
• The time complexity of a serial algorithm is simply called serial complexity.
• The time complexity of a parallel algorithm is called parallel complexity.
• Intuitively, the parallel complexity should be lower than the serial
complexity, at least asymptotically.
• We consider only deterministic algorithms, in which every operational step is
uniquely defined in agreement with the way programs are executed on real
computers.
• A non deterministic algorithm contains operations resulting in one outcome from a
set of possible outcomes. There exist no real computers that can execute non
deterministic algorithms.
21
NP-Completeness
• An algorithm has a polynomial complexity if there exists a polynomial
p(s)
such that the time complexity is O(p(s)) for problem size s.
• The set of problems having polynomial-complexity algorithms is called
P-class (for polynomial class).
• The set of problems solvable by nondeterministic algorithms in
polynomial time is called NP-class (for non deterministic polynomial
class).
• Since deterministic algorithms are special cases of the nondeterministic
ones, we know that P is a subset of NP.
• The P-class problems are computationally tractable, while the NP - P-class
problems are intractable. But we do not know whether P = NP or P != NP.
This is still an open problem in computer science.
• To simulate a nondeterministic algorithm with a deterministic algorithm
may require exponential time. Therefore, intractable NP-class problems are
also said to have exponential-time complexity.
22
PRAM Models
• A parallel random access machine (PRAM)
model for modelling idealized parallel
computers with zero synchronization or
memory access overhead.
• This PRAM model will be used for parallel
algorithm development and for scalability and
complexity analysis.
• An n-processor PRAM (Fig. 4) has a globally
addressable memory.
• The shared memory can be distributed among
the processors or centralized in one place.
• The n processors—also called processing
elements (PEs)—operate on a synchronized
read-memory, compute, and write-memory
cycle.
• With shared memory, the model must specify
how concurrent read and concurrent write of Fig. 4. PRAM
memory are handled.
23
Four memory-update options are possible:
• Exclusive read (ER): This allows at most one processor to read from any memory
location in each cycle, a rather restrictive policy.
• Exclusive write (EW): This allows at most one processor to write into a
memory location at a time.
• Concurrent read (CR): This allows multiple processors to read the same information
from the same memory cell in the same cycle.
• Concurrent write (CW): This allows simultaneous writes to the same memory
location. In order to avoid confusion, some policy must be set up to resolve the
write conflicts.
24
• Since CR does not create a conflict problem, various PRAM variants differ mainly in how
they handle the CW conflicts.
PRAM variant:
• Described below are four variants of the PRAM model, depending on how the memory reads
and writes are handled.
1. EREW-PRAM model—This model forbids more than one processor from reading or writing
the same memory cell simultaneously. This is the most restrictive PRAM model proposed.
2. CREW-PRAM Model—The write conflicts are avoided by mutual exclusion. Concurrent
reads to the same memory location are allowed.
3. ERCW-PRAM model—This allows exclusive read or concurrent writes to the same memory
location.
4. CRCW-PRAM model—This model allows either concurrent reads or concurrent writes to
the same memory location.
25
1. CPI= (Ic * Clock cycle count) / Total
number of instructions
= ((450000 * 1) + (320000 * 2) +
(150000 * 2) + (80000 * 2)) /
(1000000)
2. MIPS = I c / (T * 106=
= 1550000/1000000 ) 1.55
= Ic / (Ic * CPI * τ * 106)
= 1/(CPI * (1/f) * 106)
= f/ (CPI * 106 )
f= 400 MHz , CPI =
1.55
MIPS = (400 *
10 ) / (1.55 * 106)
6
= 258.06
3. Execution time
(T) = Ic * CPI * (1/f) 26
• a. The MIPs rate could be
computed as the following:
• [ (MIPS rate) /106 ] = Ic / T
• I = T × [ (MIPS rate) /106]
c
• Now by computing the ratio of the
instruction count of S2 to S1:
• [ x × 1800] / [12x × 100] = 18x / 12x
= 1.5
a.What is the relative size of the
• b. CPI = f/ (MIPS rate * 106)
instruction count of the machine code
for this benchmark program running • For S1, the CPI = (500 MHz) / (100
on the two machines? MIPS) = 5
b.What are the CPI values for the two • For S2, the CPI = (2.5 GHz) / (1800
machines. MIPS) = 1.4
27
Level of Parallelism
Hardware and Software Parallelism
For implementation of parallelism, we need special hardware and software support. Besides
theoretical conditioning, joint efforts between hardware designers and software programmers
are needed to exploit parallelism in upgrading computer performance.
Hardware Parallelism
• This refers to the type of parallelism defined by the machine architecture and hardware
multiplicity.
• It is a function of cost and performance tradeoffs.
• It displays the resource utilization patterns of simultaneously executable operations.
• It also indicates the peak performance of the processor resources.
One way to characterize the parallelism in a processor is by the number of instruction issues per
machine cycle.
• If a processor issues k instructions per machine cycle, then it is called a k-issue processor.
• A conventional pipelined processor takes one machine cycle to issue a single instruction. These
types of processors are called one-issue machines, with a single instruction pipeline in the processor.
28
• In a modern processor, two or more instructions can be issued per machine cycle.
• For example, the Intel i960CA was a three-issue processor with one arithmetic, one memory
access, and one branch instruction issued per cycle.
• The IBM RISC/System 6000 is a four-issue processor capable of issuing one arithmetic, one
memory access, one floating-point. and one branch operation per cycle.
Software parallelism
• This type of parallelism is revealed in the program profile or in the program flow graph.
• It is a function of algorithm, programming style, and program design.
• The program flow graph displays the patterns of simultaneously executable operations.
29
Mismatch between software parallelism
and hardware parallelism
Fig.
• There are eight instructions (four
loads and four arithmetic operations)
to be executed in three consecutive
machine cycles as shown in Fig. 5.a.
• Four load operations are performed
in the first cycle, followed by two
multiply operations in the second
cycle and two add/subtract
operations in the third cycle.
• Therefore, the parallelism varies
from 4 to 2 in three cycles.
• The average software parallelism is
equal to 8/3 = 2.6 instructions per
cycle in this example.
Fig. 5.a.
30
Consider execution of the same
program by a two-issue processor
which can execute one memory
access (load or write) and one
arithmetic (add, subtract, multiply
etc.) operation simultaneously.
31
• Let us try to match the software
parallelism shown in Fig. 5.a in a
hardware platform of a dual-processor
system, where single-issue processors
are used.
• The achievable hardware parallelism is
shown in Fig. 6, where L/S stands for
load/store operations.
• Note that six processor cycles are needed
to execute the 12 instructions by two
processors.
• S1 and S2 are two inserted store
operations, and l5 and l6, are two
inserted load operations. These added
instructions are needed for
interprocessor communication through
the shared memory.
Fig. 6. Dual Processor Execution of Program in
Fig. 5.a.
32
Types of Software Parallelism
33
a. Avg. CPI = total C / Ic
• A & L = 60% * 2 x 106 =
Load = 18 % * 2 X 106 =
Branch = 12 % * 2 X
106 =
Mem = 10% * 2 X 106 =
CPI = ( x 1) + ( X 2) +
( X 4) + (X 8)
Ic
= 2.24
b. MIPS rate = 178. 57
34