0% found this document useful (0 votes)
126 views

Parallel Processing

This document discusses parallel processing and pipelining. It begins with an introduction to parallel processing and Flynn's classification of computer architectures. It then discusses different types of pipelining including general pipelining, arithmetic pipelining, and instruction pipelining. For each type of pipelining, it describes the pipeline stages and how instructions or operations move through the pipeline to achieve parallel execution. It also discusses concepts like pipeline speedup and factors that can limit achieving maximum theoretical speedup.

Uploaded by

Mannan Bansal
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
126 views

Parallel Processing

This document discusses parallel processing and pipelining. It begins with an introduction to parallel processing and Flynn's classification of computer architectures. It then discusses different types of pipelining including general pipelining, arithmetic pipelining, and instruction pipelining. For each type of pipelining, it describes the pipeline stages and how instructions or operations move through the pipeline to achieve parallel execution. It also discusses concepts like pipeline speedup and factors that can limit achieving maximum theoretical speedup.

Uploaded by

Mannan Bansal
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 32

Parallel processing

Agenda
• Introduction to parallel processing
• Flynn’s classification
• Pipelining
– General Pipeline
– Arithmetic pipeline
– Instruction pipeline
• Instruction level parallelism
Parallel Processing
• Execution of Concurrent Events in the
computing process to achieve faster
computational speed
• Levels of Parallel Processing
– Job or Program level
– Task or Procedure level
– Inter-Instruction level
– Intra-Instruction level
Parallel Computers
• Flynn's classification: Based on the multiplicity
of Instruction Streams and Data Streams
– Instruction Stream: Sequence of Instructions read
from memory
– Data Stream: Operations performed on the data in
the processor Number of Data Streams
Single Multiple

Number of Single SISD SIMD


Instruction
Streams Multiple MISD MIMD
SISD COMPUTER SYSTEMS
Control Processor Data stream
Memory
Unit Unit

Instruction stream

• Characteristics
– Standard von Neumann machine
– Instructions and data are stored in memory
– One operation at a time

• Limitations
– Limitation on Memory Bandwidth
– Memory is shared by CPU and I/O
MISD COMPUTER SYSTEMS

M CU P

M CU P Memory
• •
• •
• •

M CU P Data stream

Instruction stream

• There is no computer at present that can be


classified as MISD
SIMD COMPUTER SYSTEMS
Data bus Memory

Control Unit
Instruction stream

P P • • • P Processor units
Data stream
Alignment network

M M ••• M Memory modules

• Characteristics
– Only one copy of the program exists
– A single controller executes one instruction at a time
MIMD COMPUTER SYSTEMS
P M P M ••• P M

Interconnection Network

Shared Memory
• Characteristics
– Multiple processing units
– Execution of multiple instructions on multiple data
• Types of MIMD computer systems
– Shared memory multiprocessors
– Message-passing multicomputers
Pipelining
• A technique of decomposing a sequential
process into sub-operations, with each sub
process being executed in a partial dedicated
segment that operates concurrently with all
other segments.
Ai * Bi + Ci for i = 1, 2, 3, ... , 7
Ai Bi Memory Ci
Segment 1
R1 R2

Multiplier

Segment 2

R3 R4

Adder
Segment 3

R5

R1  Ai, R2  Bi Load Ai and Bi


R3  R1 * R2, R4  Ci Multiply and load Ci
R5  R3 + R4 Add
OPERATIONS IN EACH PIPELINE
STAGE

Clock Segment 1 Segment 2 Segment 3


Pulse
Number R1 R2 R3 R4 R5
1 A1 B1
2 A2 B2 A1 * B1 C1
3 A3 B3 A2 * B2 C2 A1 * B1 + C1
4 A4 B4 A3 * B3 C3 A2 * B2 + C2
5 A5 B5 A4 * B4 C4 A3 * B3 + C3
6 A6 B6 A5 * B5 C5 A4 * B4 + C4
7 A7 B7 A6 * B6 C6 A5 * B5 + C5
8 A7 * B7 C7 A6 * B6 + C6
9 A7 * B7 + C7
GENERAL PIPELINE
• General Structure of a 4-Segment Pipeline
– Any operation that can be decomposed into a
sequence of sub operations of about same
complexity can be implemented by a pipeline
processor.
– A task is defined as the total operation performed
going through all the segments in pipeline (Ti)
General Pipeline
Clock

Input S 1 R1 S2 R2 S 3 R3 S 4 R4

• State Space Diagram


1 2 3 4 5 6 7 8 9 Clock cycles
Segment 1 T1 T2 T3 T4 T5 T6

2 T1 T2 T3 T4 T5 T6

3 T1 T2 T3 T4 T5 T6

4 T1 T2 T3 T4 T5 T6
PIPELINE SPEEDUP
n: Number of tasks to be performed

Conventional Machine (Non-Pipelined)


– t: time to complete one task
– t1: Time required to complete the n tasks
– t1 = n * t

Pipelined Machine (k stages)


– tp: Clock cycle (time to complete each suboperation)
– tk: Time required to complete the n tasks
– tk = (k + n - 1) * tp
PIPELINE SPEEDUP
• Speedup : The speedup of a pipeline processing over an
equivalent non-pipelining processing is defined by the ratio
– Sk: Speedup
Sk = n*t / (k + n - 1)*tp
• As the number of tasks increases, n becomes much larger
than k - 1, and k + n - 1 approaches the value of n.
t
lim Sk = t ( = k, if t = k * tp )
n p

• Thus, theoretical maximum speedup of the pipeline can be k,


where k is number of segments in the pipeline.
Hurdle to maximum speedup
• There are various reasons why the pipeline
cannot operate at its maximum theoretical rate.
• Different segments may take different times to
complete their sub-operation.
• The clock cycle must be chosen
– equal the time delay of the segment with the
maximum propagation time.
– This causes all other segments to waste time while
waiting for the next clock.
PIPELINE SPEEDUP: Example
• Example
– 4-stage pipeline
– suboperation in each stage; tp = 20nS
– 100 tasks to be executed
– 1 task in non-pipelined system; 20*4 = 80nS
• Pipelined System
(k + n - 1)*tp = (4 + 99) * 20 = 2060nS
• Non-Pipelined System
n*k*tp = 100 * 80 = 8000nS
• Speedup
Sk = 8000 / 2060 = 3.88
Types of Pipeline
• Two areas of computer design where the
pipeline organization is applicable.
– An arithmetic pipeline divides an arithmetic
operation into sub-operations for execution in the
pipeline segments.
– An instruction pipeline operates on a stream of
instructions by overlapping the fetch, decode, and
execute phases of the instruction cycle.
ARITHMETIC PIPELINE
• used to implement floating-point operations,
multiplication of fixed-point numbers, and similar
computations encountered in scientific problems.
• Eg. Floating-point adder-subtractor:
– Let X and Y be two floating point numbers
• X = A x 2a, Y = B x 2b
– Let 4-stage pipeline with below segments is used
1. Compare the exponents
2. Align the mantissa
3. Add/sub the mantissa
4. Normalize the result
Exponents Mantissas
• Ex: a b A B
– X = 0.9504 x 103
– Y = 0.8200 x 102 R R
– Z = X+Y i.e.
Z = 0.10324 x 104 Compare Difference
• Suppose that the time delays of Segment 1: exponents
by subtraction
the four segments are t1 = 60
ns, t2 = 70 ns, t3 = 100 ns, t4 =
R
80 ns, and the interface
registers have a delay of tr = 10
ns. Segment 2: Choose exponent Align mantissa
• The clock cycle is chosen to be
R
t p= t3 + tr = 110 ns.
• An equivalent non-pipeline Add or subtract
floating point adder-subtractor Segment 3: mantissas
will have a delay time tn = t1 +
t2 + t3+ t4 + 4*tr = 350 ns. R R
• In this case the pipelined adder
has a speedup of 350/110 = 3.1 Adjust Normalize
over the non-pipelined adder. Segment 4: exponent result

R R
INSTRUCTION CYCLE
• Six Phases* in an Instruction Cycle
1. Fetch an instruction from memory
2. Decode the instruction
3. Calculate the effective address of the operand
4. Fetch the operands from memory
5. Execute the operation
6. Store the result in the proper place

* Some instructions skip some phases


* Effective address calculation can be done in the part of the decoding
phase
* Storage of the operation result into a register is done automatically
in the execution phase
INSTRUCTION CYCLE
• 4-Stage Pipeline
1. FI: Fetch an instruction from memory
2. DA: Decode the instruction and calculate the
effective address of the operand
3. FO: Fetch the operand
4. EX: Execute the operation
It is assumed that the processor has separate
instruction and data memories so that the
operation in Fl and FO can proceed at the same
time.
Instruction Execution In a 4-stage Pipeline
Segment1: Fetch instruction
from memory

Decode instruction
Segment2: and calculate
effective address

yes Branch?
no
Segment3: Fetch operand
from memory

Segment4: Execute instruction

Interrupt yes
Interrupt?
handling
no
Update PC

Empty pipe
Instruction Execution In a 4-stage Pipeline

Step: 1 2 3 4 5 6 7 8 9 10 11 12 13
Instruction 1 FI DA FO EX
2 FI DA FO EX
(Branch) 3 FI DA FO EX
4 FI FI DA FO EX
5 FI DA FO EX

6 FI DA FO EX
7 FI DA FO EX
Major Hazards In Pipelined Execution
• 3 categories
1. Resource Hazard
2. Data Dependency Hazard
3. Branching Hazard
Resource Hazard
• Hardware Resources required by the
instructions in simultaneous overlapped
execution cannot be met
Eg. Fetch instruction and fetch operands for
two different instructions at same time.
• Solution is to use two different memory buses
to fetch instruction and data respectively
Data Dependency Hazard
• Occurs when the execution of an instruction
depends on the result of previous instruction
• Eg. ADD R1, R2, R3
SUB R4, R1, R5
• Data hazard can be dealt with either hardware
techniques or software techniques.
Data Dependency Solution
Hardware Technique:
• Interlock
– hardware detects the data dependencies and delays
the scheduling of the dependent instruction by
stalling enough clock cycles
• Forwarding (bypassing, short-circuiting)
– Accomplished by a data path that routes a value from
a source (usually an ALU) to a user, bypassing a
designated register. This allows the value to be
produced to be used at an earlier stage in the pipeline
than would otherwise be possible
Data Dependency Solution
Software Technique:
• Delayed Load:
– give the responsibility for solving data conflicts
problems to the compiler that translates the high-
level programming language into a machine language
program.
– The compiler for such computers is designed to detect
a data conflict and reorder the instructions as
necessary to delay the loading of the conflicting data
by inserting delayed load no-operation instructions.
– This method is referred to as delayed load.
Branching Hazards
• Branch Instructions
– Branch target address is not known until the branch
instruction is completed
Branch FI DA FO EX
Instruction
Next FI DA FO EX
Instruction

Target address available


• Dealing with Branching hazards
– Pre-fetch target instruction
– Branch Target Buffer / Loop Buffer
– Branch Prediction
– Delayed Branch
Control Hazards
• Pre-fetch target instruction :
– prefetch the target instruction in addition to the instruction following the
branch. Both are saved until the branch is executed.
– If the branch condition is successful, the pipeline continues from the branch
target instruction.
– An extension of this procedure is to continue fetching instructions from both
places until the branch decision is made. At that time control chooses the
instruction stream of the correct program flow.
• Branch Target Buffer(BTB; Associative Memory)
– Entry: Address of previously executed branches; (Address of) Target
instruction and the next few instructions
– When fetching an instruction, search BTB.
– If found, fetch the instruction stream in BTB;
– If not, new stream is fetched and update BTB
Control Hazards
• Loop Buffer (small very high speed register file maintained by IF segment)
– When a program loop is detected in the program, it is stored in the loop buffer
in its entirety, including all branches.
– The program loop can be executed directly without having to access memory
until the loop mode is removed by the final branching out
• Branch Prediction
– Guessing the branch condition, and fetch an instruction stream based on the
guess. Correct guess eliminates the branch penalty.
• Delayed Branch
– Compiler detects the branch and rearranges the instruction sequence by
inserting no-operation instructions that keep the pipeline busy in the presence
of a branch instruction.
– This causes the computer to fetch the target instruction during the execution
of the no-operation instruction, allowing a continuous flow of the pipeline.

You might also like