Module 5
Module 5
Parallel Processing
• The system may have two or more ALUs and be able to execute two or
more instructions at the same time
• Also, the system may have two or more processors operating concurrently
• Example: the ALU can be separated into three units and the operands diverted
to each unit under the supervision of a control unit
110
111
• Parallel processing can be classified from:
o The internal organization of the processors
o The interconnection structure between processors
o The flow of information through the system
o The number of instructions and data items that are manipulated simultaneously
• The operations performed on the data in the processor is the data stream
• Parallel processing may occur in the instruction stream, the data stream, or both
Computer classification:
o Single instruction stream, single data stream – SISD
o Single instruction stream, multiple data stream – SIMD o
Multiple instruction stream, single data stream – MISD
o Multiple instruction stream, multiple data stream – MIMD
PIPELINING
• Each segment performs partial processing dictated by the way the task is
partitioned
• The result obtained from the computation in each segment is transferred to the
next segment in the pipeline
• The final result is obtained after the data have passed through all segments
112
• The suboperations performed in each segment are:
R1 ← Ai , R2 ← Bi
R3 ← R1 * R2, R4 ← Ci
R5 ← R3 + R4
113
• Any operation that can be decomposed into a sequence of suboperations of
about the same complexity can be implemented by a pipeline processor
• Once the pipeline is full, it takes only one clock period to obtain an output
114
115
• Any operation that can be decomposed into a sequence of suboperations of
about the same complexity can be implemented by a pipeline processor
• Once the pipeline is full, it takes only one clock period to obtain an output
• Consider a nonpipeline unit that performs the same operation and takes tn time
to complete each task
116
• The speedup of a pipeline processing over an equivalent nonpipeline processing is
defined by the ratio
S= ntn .
(k + n – 1)tp
• If we assume that the time to process a task is the same in both circuits, tn =k tp S = ktn
=k
tp
• Example:
o Cycle time = tp = 20 ns o # of
segments = k = 4
o # of tasks = n = 100
• One reason is that the clock cycle must be chosen to equal the time delay of
the segment with the maximum propagation time
• Pipeline organization is applicable for arithmetic operations and fetching
instructions
117
• As the number of tasks increase, the speedup becomes
S = tn
t
p
• Therefore, the theoretical maximum speed up that a pipeline can provide is k
• Example:
p Cycle time = tp = 20 ns o # of
segments = k = 4
p # of tasks = n = 100
• One reason is that the clock cycle must be chosen to equal the time delay
of the segment with the maximum propagation time
• Pipeline organization is applicable for arithmetic operations and fetching
instructions
Arithmetic Pipeline
• Pipeline arithmetic units are usually found in very high speed computers
118
• Four segments are used to perform the following:
119
3 2
• X = 0.9504 x 10 and Y = 0.8200 x 10
• The two exponents are subtracted in the first segment to obtain 3-2=1
• The larger exponent 3 is chosen as the exponent of the result
3
• Segment 2 shifts the mantissa of Y to the right to obtain Y = 0.0820 x 10
• The mantissas are now aligned
3
• Segment 3 produces the sum Z = 1.0324 x 10
• Segment 4 normalizes the result by shifting the mantissa once to the right and
4
incrementing the exponent by one to obtain Z = 0.10324 x 10
Instruction Pipeline
• If a branch out of sequence occurs, the pipeline must be emptied and all
the instructions that have been read from memory after the branch
instruction must be discarded
• Thus, an instruction stream can be placed in a queue, waiting for decoding and
processing by the execution segment
• This reduces the average access time to memory for reading instructions
• Whenever there is space in the buffer, the control unit initiates the next
instruction fetch phase
• The following steps are needed to process each instruction:
o Fetch the instruction from memory
o Decode the instruction
o Calculate the effective address o
Fetch the operands from memory o
Execute the instruction
o Store the result in the proper place
• The pipeline may not perform at its maximum rate due to: o
Different segments taking different times to operate
o Some segment being skipped for certain operations
o Memory access conflicts
120
• Example: Four-segment instruction pipeline
• Assume that the decoding can be combined with calculating the EA in one
segment
• Assume that most of the instructions store the result in a register so that the execution
and storing of the result can be combined in one segment
121
• Up to four suboperations in the instruction cycle can overlap and up to four different
instructions can be in progress of being processed at the same time
• It is assumed that the processor has separate instruction and data memories
• Reasons for the pipeline to deviate from its normal operation are:
o Resource conflicts caused by access to memory by two segments at the
same time.
o Data dependency conflicts arise when an instruction depends on the result of
a previous instruction, but his result is not yet available
122
• Assume that most of the instructions store the result in a register so that the execution
and storing of the result can be combined in one segment
123
• Up to four suboperations in the instruction cycle can overlap and up to four different
instructions can be in progress of being processed at the same time
• It is assumed that the processor has separate instruction and data memories
• Reasons for the pipeline to deviate from its normal operation are:
Branch difficulties arise from program control instructions that may change the value of
PC
o Hardware interlocks are circuits that detect instructions whose source operands
are destinations of prior instructions. Detection causes the hardware to insert
the required delays without altering the program sequence.
o Operand forwarding uses special hardware to detect a conflict and then avoid
it by routing the data through special paths between pipeline segments. This
requires additional hardware paths through multiplexers as well as the circuit
to detect the conflict.
o Delayed load is a procedure that gives the responsibility for solving data
conflicts to the compiler. The compiler is designed to detect a data conflict and
reorder the instructions as necessary to delay the loading of the conflicting data
by inserting no-operation instructions.
124
o Branch prediction uses some additional logic to guess the outcome of
a conditional branch instruction before it is executed. The pipeline
then begins prefetching instructions from the predicted path.
.
125